如何使用 Python 爬取 Walmart 商品資料並應對反爬機制

2025-03-19

在網路爬蟲應用中，爬取大型電商平台如 Walmart 的商品資料時，往往會遇到各式各樣的反爬機制，導致直接抓取資料變得困難。本文將介紹如何利用 Python 技術，透過請求頭偽裝、延遲請求、代理伺服器以及動態頁面爬取工具，突破反爬機制並取得穩定的商品資料。

1. 安裝所需的 Python 庫

安裝抓取靜態頁面所需的庫：

pip install requests beautifulsoup4

若頁面資料採用 JavaScript 動態加載，則需要安裝 Selenium：

pip install selenium webdriver-manager

2. 使用 Requests+BeautifulSoup 爬取 Walmart

以下示例抓取 Walmart 搜尋「laptop」的商品資料：

import requests
from bs4 import BeautifulSoup
# Walmart 搜尋 URL
search_query = "laptop"
base_url = f"https://www.walmart.com/search?q={search_query}"
# 偽裝請求頭
headers = {
"User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/91.0.4472.124 Safari/537.36"
}
response = requests.get(base_url, headers=headers)
if response.status_code == 200:
soup = BeautifulSoup(response.text, "html.parser")
products = soup.find_all("div", class_="search-result-gridview-item")
for product in products:
title = product.find("a", class_="product-title-link")
price = product.find("span", class_="price-characteristic")
if title and price:
product_name = title.text.strip()
product_price = price.text.strip()
product_url = "https://www.walmart.com" + title["href"]
print(f"商品名稱: {product_name}")
print(f"價格: ${product_price}")
print(f"連結: {product_url}")
print("-" * 50)
else:
print("請求失敗，狀態碼:", response.status_code)

3. 處理反爬策略

(1) 增加請求頭

使用較完整的請求頭，使爬蟲看起來更像真實使用者：

headers = {
"User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/110.0.4472.124 Safari/537.36",
"Accept-Language": "en-US,en;q=0.9",
"Referer": "https://www.google.com/",
}

(2) 增加請求延遲

在連續請求之間增加延遲，降低觸發反爬機制的風險：

import time
time.sleep(2)  # 延遲2秒

(3) 使用代理

當直接請求被限制時，可利用代理伺服器來突破 IP 封鎖。以下示例使用 Luckdata 的代理服務：

import requests
proxies = {
"http": "http://Account:Password@ahk.luckdata.io:Port",
"https": "http://Account:Password@ahk.luckdata.io:Port",
}
response = requests.get(base_url, headers=headers, proxies=proxies)
if response.status_code == 200:
print("透過代理成功取得資料")
else:
print("代理請求失敗，狀態碼:", response.status_code)

Luckdata 的代理服務支援動態住宅代理與資料中心代理，具備高匿名性與全球覆蓋，能有效降低 IP 被封風險，並提高大規模資料抓取的穩定性。

4. 使用 Selenium 爬取動態頁面

當 Walmart 頁面內容由 JavaScript 動態加載時，可利用 Selenium 模擬瀏覽器操作，取得完整資料：

from selenium import webdriver
from selenium.webdriver.chrome.service import Service
from webdriver_manager.chrome import ChromeDriverManager
from selenium.webdriver.common.by import By
import time
options = webdriver.ChromeOptions()
options.add_argument("--headless")
options.add_argument("--disable-blink-features=AutomationControlled")
driver = webdriver.Chrome(service=Service(ChromeDriverManager().install()), options=options)
search_query = "laptop"
base_url = f"https://www.walmart.com/search?q={search_query}"
driver.get(base_url)
time.sleep(5)  # 等待頁面加載
products = driver.find_elements(By.CSS_SELECTOR, "div.search-result-gridview-item")
for product in products:
try:
title_element = product.find_element(By.CSS_SELECTOR, "a.product-title-link")
price_element = product.find_element(By.CSS_SELECTOR, "span.price-characteristic")
product_name = title_element.text
product_price = price_element.text
product_url = title_element.get_attribute("href")
print(f"商品名稱: {product_name}")
print(f"價格: ${product_price}")
print(f"連結: {product_url}")
print("-" * 50)
except Exception:
print("跳過一個商品，資料可能不完整")
driver.quit()

5. 使用 API 取得 Walmart 資料

利用 API 可直接獲取結構化資料，省去解析頁面及反爬處理的繁雜工作。以下示例展示如何調用 Luckdata 提供的 Walmart API：

import requests
headers = {
'X-Luckdata-Api-Key': 'your luckdata key'
}
api_url = 'https://luckdata.io/api/walmart-API/get_vwzq'
params = {
'url': 'https://www.walmart.com/ip/NELEUS-Mens-Dry-Fit-Mesh-Athletic-Shirts-3-Pack-Black-Gray-Olive-Green-US-Size-M/439625664?classType=VARIANT'
}
response = requests.get(api_url, headers=headers, params=params)
print(response.json())

透過 Luckdata 的 API，不僅能快速獲取 Walmart 商品的詳細資料，還能避免反爬限制，直接取得結構化數據，極大簡化資料抓取流程。

6. 儲存資料至 CSV 檔案

將取得的資料儲存為 CSV 檔案，便於後續資料分析：

import csv
data = [
("商品名稱", "價格", "連結"),
("Laptop 1", "$499.99", "https://www.walmart.com/laptop1"),
("Laptop 2", "$799.99", "https://www.walmart.com/laptop2"),
]
with open("walmart_data.csv", "w", newline="", encoding="utf-8") as file:
writer = csv.writer(file)
writer.writerows(data)
print("資料已儲存到 walmart_data.csv")

總結

靜態頁面抓取：使用 requests 與 BeautifulSoup 取得網頁資料。
動態頁面抓取：利用 Selenium 模擬瀏覽器操作，獲取完整動態內容。
反爬策略：透過偽裝請求頭、延遲請求以及代理服務（如 Luckdata 的代理服務）突破限制。
API 調用：使用 Luckdata API 直接取得 Walmart 結構化資料，簡化爬蟲流程。
資料儲存：將抓取到的資料儲存為 CSV 檔案，方便後續分析與處理。