Web Scraping with Python: From Crawling to API Data Extraction
Web scraping is the automated process of extracting data from websites, widely used for data analysis, market research, and automation tasks. This article explores how to perform web scraping with Python, covering traditional crawling techniques and API-based data retrieval for efficient and compliant data collection.
1. Install Necessary Libraries
Python offers various tools for web scraping, including:
requests
: Sends HTTP requests to fetch webpage contentBeautifulSoup
: Parses HTML structure and extracts datalxml
: Improves HTML parsing efficiencyselenium
: Handles dynamic web pages
Install them using:
pip install requests beautifulsoup4 lxml selenium
2. Sending HTTP Requests to Retrieve Webpage Content
Use requests
to send a GET request and retrieve raw HTML content:
import requestsurl = "https://example.com"
headers = {"User-Agent": "Mozilla/5.0"}
response = requests.get(url, headers=headers)
if response.status_code == 200:
print(response.text[:500]) # Print the first 500 characters
else:
print("Request failed, status code:", response.status_code)
Key points:
Set
User-Agent
to mimic a browser and avoid detectionCheck the HTTP response status (
200
indicates success)
3. Parsing HTML to Extract Data
Use BeautifulSoup
to parse HTML and extract relevant data:
from bs4 import BeautifulSoupsoup = BeautifulSoup(response.text, "html.parser")
# Get webpage title
title = soup.title.text
print("Page Title:", title)
# Find all links
for link in soup.find_all("a"):
print(link.get("href"))
Common parsing methods:
soup.find(tag, attrs={})
: Finds a single elementsoup.find_all(tag, attrs={})
: Finds all matching elementselement.text
: Extracts text content from a tagelement.get("attribute")
: Retrieves attribute values
4. Handling Dynamic Web Pages
If a webpage's content is generated dynamically via JavaScript, requests
alone won’t work. Use selenium
instead:
from selenium import webdriverfrom selenium.webdriver.common.by import By
driver = webdriver.Chrome()
driver.get("https://example.com")
driver.implicitly_wait(5) # Wait for the page to load
# Get page title
element = driver.find_element(By.TAG_NAME, "h1")
print("Page Title:", element.text)
driver.quit()
Key considerations:
Requires installing WebDriver (e.g.,
chromedriver
)implicitly_wait()
allows Selenium to wait for page elementsfind_element()
helps locate DOM elements
5. Handling Anti-Scraping Mechanisms
(1) Use Random User-Agent
Generate a random User-Agent
using fake_useragent
:
pip install fake-useragent
from fake_useragent import UserAgentheaders = {"User-Agent": UserAgent().random}
response = requests.get("https://example.com", headers=headers)
(2) Add Delays Between Requests
To avoid being blocked due to high-frequency requests:
import timeimport random
time.sleep(random.uniform(2, 5)) # Random wait between 2 to 5 seconds
(3) Use Proxy IP (LuckData Proxy Services)
LuckData provides datacenter proxies, dynamic residential proxies, and unlimited dynamic residential proxies with over 120 million residential IPs, supporting HTTP/HTTPS, ideal for brand protection, SEO monitoring, market research, and e-commerce applications.
LuckData Proxy Example (Python)
import requestsproxyip = "http://Account:Password@ahk.luckdata.io:Port"
url = "https://api.ip.cc"
proxies = {
'http': proxyip,
'https': proxyip,
}
data = requests.get(url=url, proxies=proxies)
print(data.text)
LuckData Proxy Advantages:
Global IP Coverage: Over 200 countries with precise location targeting (country, state, city)
Fast Response Time: Automated proxy setup with 0.6ms latency and 99.99% uptime
Unlimited Concurrent Sessions: High-performance servers with unlimited concurrent requests
Security & Compliance: Ensures privacy protection and legal compliance
6. Retrieving Data via APIs
Compared to traditional scraping, APIs offer a more stable and compliant way to obtain data. LuckData provides APIs for Walmart, Amazon, Google, TikTok, and other platforms, supporting Python requests with structured JSON responses.
6.1 API Request Example (Python)
Below is an example using LuckData's Walmart API to fetch product details:
import requestsheaders = {
'X-Luckdata-Api-Key': 'your luckdata key'
}
response = requests.get(
'https://luckdata.io/api/walmart-API/get_vwzq?url=https://www.walmart.com/ip/NELEUS-Mens-Dry-Fit-Mesh-Athletic-Shirts-3-Pack-Black-Gray-Olive-Green-US-Size-M/439625664?classType=VARIANT',
headers=headers
)
print(response.json()) # Parse the returned JSON data
API Benefits:
Avoid bans (IP restrictions, captchas)
Structured data output (directly returns JSON data)
Scalability for enterprise applications (large-scale data retrieval)
7. Storing Data
Scraped data can be stored in CSV, JSON, or databases:
(1) Save as CSV
import csvdata = [("Title 1", "https://example.com/1"), ("Title 2", "https://example.com/2")]
with open("data.csv", "w", newline="", encoding="utf-8") as file:
writer = csv.writer(file)
writer.writerow(["Title", "Link"])
writer.writerows(data)
(2) Save as JSON
import jsondata = [{"title": "Title 1", "url": "https://example.com/1"}]
with open("data.json", "w", encoding="utf-8") as file:
json.dump(data, file, ensure_ascii=False, indent=4)
Conclusion
This article provided a comprehensive guide to web scraping with Python, covering:
✅ Traditional scraping methods (requests, BeautifulSoup)
✅ Handling dynamic pages (selenium)
✅ Using LuckData proxies to bypass restrictions
✅ Retrieving data efficiently via LuckData APIs
✅ Data storage and optimization techniques
By leveraging these techniques, you can efficiently gather web data for data analysis, business intelligence, and more : https://luckdata.io/marketplace