How to Scrape Walmart Product Data Using Python and Handle Anti-Scraping Mechanisms
When scraping data from large e-commerce platforms like Walmart, we often encounter various anti-scraping mechanisms that make direct data extraction challenging. In this article, we will walk through how to use Python to bypass anti-scraping measures and retrieve stable product data using strategies such as request header spoofing, introducing request delays, using proxy servers, and utilizing dynamic page scraping tools.
1. Install Required Python Libraries
To scrape static pages, you need the following libraries:
pip install requests beautifulsoup4
If the page data is dynamically loaded via JavaScript, you will need to install Selenium:
pip install selenium webdriver-manager
2. Scraping Walmart Using Requests + BeautifulSoup
Here is an example that scrapes product information for the search term "laptop" on Walmart:
import requestsfrom bs4 import BeautifulSoup
# Walmart search URL
search_query = "laptop"
base_url = f"https://www.walmart.com/search?q={search_query}"
# Spoofing the user agent to avoid detection
headers = {
"User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/91.0.4472.124 Safari/537.36"
}
response = requests.get(base_url, headers=headers)
if response.status_code == 200:
soup = BeautifulSoup(response.text, "html.parser")
products = soup.find_all("div", class_="search-result-gridview-item")
for product in products:
title = product.find("a", class_="product-title-link")
price = product.find("span", class_="price-characteristic")
if title and price:
product_name = title.text.strip()
product_price = price.text.strip()
product_url = "https://www.walmart.com" + title["href"]
print(f"Product Name: {product_name}")
print(f"Price: ${product_price}")
print(f"Link: {product_url}")
print("-" * 50)
else:
print("Request failed, status code:", response.status_code)
3. Handling Anti-Scraping Mechanisms
(1) Adding Request Headers
To make the crawler look more like a real user, you can use a more complete header:
headers = {"User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/110.0.4472.124 Safari/537.36",
"Accept-Language": "en-US,en;q=0.9",
"Referer": "https://www.google.com/",
}
(2) Adding Request Delays
By adding delays between consecutive requests, you can reduce the risk of triggering anti-scraping mechanisms:
import timetime.sleep(2) # Add a delay of 2 seconds
(3) Using Proxies
When direct requests are blocked, you can bypass IP blocking by using proxy servers. Here’s how you can integrate Luckdata’s proxy services:
import requestsproxies = {
"http": "http://Account:Password@ahk.luckdata.io:Port",
"https": "http://Account:Password@ahk.luckdata.io:Port",
}
response = requests.get(base_url, headers=headers, proxies=proxies)
if response.status_code == 200:
print("Successfully retrieved data via proxy")
else:
print("Proxy request failed, status code:", response.status_code)
Luckdata’s proxy services support both dynamic residential and data center proxies, ensuring high anonymity and global coverage to effectively reduce IP blocking risks and improve scraping stability for large-scale data collection.
4. Using Selenium to Scrape Dynamic Pages
When Walmart pages load content using JavaScript, you can use Selenium to simulate browser actions and retrieve the full data:
from selenium import webdriverfrom selenium.webdriver.chrome.service import Service
from webdriver_manager.chrome import ChromeDriverManager
from selenium.webdriver.common.by import By
import time
options = webdriver.ChromeOptions()
options.add_argument("--headless")
options.add_argument("--disable-blink-features=AutomationControlled")
driver = webdriver.Chrome(service=Service(ChromeDriverManager().install()), options=options)
search_query = "laptop"
base_url = f"https://www.walmart.com/search?q={search_query}"
driver.get(base_url)
time.sleep(5) # Wait for the page to load
products = driver.find_elements(By.CSS_SELECTOR, "div.search-result-gridview-item")
for product in products:
try:
title_element = product.find_element(By.CSS_SELECTOR, "a.product-title-link")
price_element = product.find_element(By.CSS_SELECTOR, "span.price-characteristic")
product_name = title_element.text
product_price = price_element.text
product_url = title_element.get_attribute("href")
print(f"Product Name: {product_name}")
print(f"Price: ${product_price}")
print(f"Link: {product_url}")
print("-" * 50)
except Exception:
print("Skipping a product, data may be incomplete")
driver.quit()
5. Using API to Retrieve Walmart Data
Using an API allows you to directly access structured data, avoiding the complexities of parsing HTML and handling anti-scraping measures. Here's how you can call Luckdata’s Walmart API:
import requestsheaders = {
'X-Luckdata-Api-Key': 'your luckdata key'
}
api_url = 'https://luckdata.io/api/walmart-API/get_vwzq'
params = {
'url': 'https://www.walmart.com/ip/NELEUS-Mens-Dry-Fit-Mesh-Athletic-Shirts-3-Pack-Black-Gray-Olive-Green-US-Size-M/439625664?classType=VARIANT'
}
response = requests.get(api_url, headers=headers, params=params)
print(response.json())
By using Luckdata's API, you can quickly retrieve detailed Walmart product data, avoiding anti-scraping limitations and obtaining structured data directly, which simplifies the scraping process significantly.
6. Saving Data to a CSV File
You can save the scraped data to a CSV file for easy analysis and processing:
import csvdata = [
("Product Name", "Price", "Link"),
("Laptop 1", "$499.99", "https://www.walmart.com/laptop1"),
("Laptop 2", "$799.99", "https://www.walmart.com/laptop2"),
]
with open("walmart_data.csv", "w", newline="", encoding="utf-8") as file:
writer = csv.writer(file)
writer.writerows(data)
print("Data has been saved to walmart_data.csv")
Conclusion
Static Page Scraping: Use
requests
andBeautifulSoup
to scrape static web pages.Dynamic Page Scraping: Use Selenium to simulate browser actions and extract data from pages that use JavaScript.
Handling Anti-Scraping Mechanisms: Bypass detection by spoofing headers, adding request delays, using proxies (such as Luckdata's proxy services), and leveraging headless Selenium.
API Usage: Use Luckdata’s API to quickly obtain structured Walmart product data, bypassing the complexities of web scraping.
Data Storage: Save the scraped data in CSV format for easy analysis and use in future projects.