使用 Python 進行網頁抓取：從爬蟲到 API 獲取數據

2025-03-19

網頁抓取（Web Scraping）是自動從網頁提取數據的技術，廣泛應用於數據分析、市場研究和自動化任務。本文介紹如何使用 Python 進行網頁抓取，包括傳統爬蟲方法和 API 數據獲取方式，以高效、合規地採集網頁數據。

1. 安裝必要的庫

Python 提供多個網頁抓取工具，常用的庫包括：

requests：發送 HTTP 請求，獲取網頁內容
BeautifulSoup：解析 HTML 結構，提取數據
lxml：提高 HTML 解析效率
selenium：處理動態網頁

安裝方法：

pip install requests beautifulsoup4 lxml selenium

2. 發送 HTTP 請求並獲取網頁內容

首先使用 requests 發送 GET 請求獲取網頁 HTML 原始碼：

import requests
url = "https://example.com"
headers = {"User-Agent": "Mozilla/5.0"}
response = requests.get(url, headers=headers)
if response.status_code == 200:
print(response.text[:500])  # 只打印前500個字元
else:
print("請求失敗，狀態碼:", response.status_code)

關鍵點：

設置 User-Agent 頭部，模擬瀏覽器訪問，避免被封禁
檢查 HTTP 響應狀態碼 (200 表示請求成功)

3. 解析 HTML 提取數據

使用 BeautifulSoup 解析 HTML 頁面並提取數據：

from bs4 import BeautifulSoup
soup = BeautifulSoup(response.text, "html.parser")
# 獲取網頁標題
title = soup.title.text
print("網頁標題:", title)
# 查找所有連結
for link in soup.find_all("a"):
print(link.get("href"))

常見解析方法：

soup.find(tag, attrs={})：查找單個元素
soup.find_all(tag, attrs={})：查找所有符合條件的元素
element.text：獲取標籤內的文本
element.get("attribute")：獲取標籤屬性值

4. 處理動態網頁

如果網頁內容依賴 JavaScript 生成，requests 無法直接獲取數據。這時可以使用 selenium：

from selenium import webdriver
from selenium.webdriver.common.by import By
driver = webdriver.Chrome()
driver.get("https://example.com")
driver.implicitly_wait(5)  # 等待頁面加載
# 獲取頁面標題
element = driver.find_element(By.TAG_NAME, "h1")
print("頁面標題:", element.text)
driver.quit()

注意事項：

需要安裝 WebDriver（如 chromedriver）
implicitly_wait() 讓 Selenium 等待頁面加載完成
find_element() 用於查找 DOM 元素

5. 處理反爬策略

(1) 設置隨機 User-Agent

使用 fake_useragent 生成隨機 User-Agent：

pip install fake-useragent

from fake_useragent import UserAgent
headers = {"User-Agent": UserAgent().random}
response = requests.get("https://example.com", headers=headers)

(2) 增加請求間隔

避免短時間內大量請求，減少被封禁的風險：

import time
import random
time.sleep(random.uniform(2, 5))  # 隨機等待 2~5 秒

(3) 使用代理 IP（LuckData 代理）

LuckData 提供 數據中心代理、動態住宅代理、無限動態住宅代理，擁有超過 1.2 億 動態住宅代理 IP，支持 HTTP/HTTPS，適用於品牌保護、SEO 監控、市場研究、電子商務等多種場景。

LuckData 代理使用示例（Python）

import requests
proxyip = "http://Account:Password@ahk.luckdata.io:Port"
url = "https://api.ip.cc"
proxies = {
'http': proxyip,
'https': proxyip,
}
data = requests.get(url=url, proxies=proxies)
print(data.text)

LuckData 代理優勢：

全球定位：來自 200+ 國家地區的真實 IP，可精確到 國家、州、市級別
快速響應：自動化配置，0.6 毫秒級別響應，99.99% 正常運行時間
無限並發：高性能服務器，支持 無限並發請求
安全合規：最高級別的隱私保護與安全性

6. 通過 API 獲取數據

相比傳統爬蟲，API 提供更加 穩定、合規 的方式獲取數據。例如，LuckData 提供 Walmart、Amazon、Google、TikTok 等平台的 API，支持 Python 調用，返回結構化 JSON 數據。

6.1 API 請求示例（Python）

以下為 LuckData Walmart API 獲取商品詳情數據的示例：

import requests
headers = {
'X-Luckdata-Api-Key': 'your luckdata key'
}
response = requests.get(
'https://luckdata.io/api/walmart-API/get_vwzq?url=https://www.walmart.com/ip/NELEUS-Mens-Dry-Fit-Mesh-Athletic-Shirts-3-Pack-Black-Gray-Olive-Green-US-Size-M/439625664?classType=VARIANT',
headers=headers
)
print(response.json())  # 解析返回的 JSON 數據

API 優勢：

避免封禁（IP 限制、驗證碼）
數據格式化（直接返回 JSON 數據）
適用企業級應用（大規模數據獲取）

7. 數據存儲

爬取的數據可以保存為 CSV、JSON 或存入資料庫：

(1) 保存為 CSV

import csv
data = [("標題1", "https://example.com/1"), ("標題2", "https://example.com/2")]
with open("data.csv", "w", newline="", encoding="utf-8") as file:
writer = csv.writer(file)
writer.writerow(["標題", "鏈接"])
writer.writerows(data)

(2) 保存為 JSON

import json
data = [{"title": "標題1", "url": "https://example.com/1"}]
with open("data.json", "w", encoding="utf-8") as file:
json.dump(data, file, ensure_ascii=False, indent=4)

總結

本文介紹了 Python 進行網頁抓取的完整流程，包括： ✅ 傳統爬蟲方法（requests、BeautifulSoup）
✅ 處理動態網頁（selenium）
✅ LuckData 代理繞過反爬限制
✅ LuckData API 高效獲取數據
✅ 數據存儲與優化抓取效率

使用這些方法，可高效獲取網頁數據，適用於 數據分析、商業情報等多種場景 ：https://luckdata.io/marketplace