Web Scraping with Python: From Crawling to API Data Extraction

2025-03-19

Web scraping is the automated process of extracting data from websites, widely used for data analysis, market research, and automation tasks. This article explores how to perform web scraping with Python, covering traditional crawling techniques and API-based data retrieval for efficient and compliant data collection.

1. Install Necessary Libraries

Python offers various tools for web scraping, including:

requests: Sends HTTP requests to fetch webpage content
BeautifulSoup: Parses HTML structure and extracts data
lxml: Improves HTML parsing efficiency
selenium: Handles dynamic web pages

Install them using:

pip install requests beautifulsoup4 lxml selenium

2. Sending HTTP Requests to Retrieve Webpage Content

Use requests to send a GET request and retrieve raw HTML content:

import requests
url = "https://example.com"
headers = {"User-Agent": "Mozilla/5.0"}
response = requests.get(url, headers=headers)
if response.status_code == 200:
print(response.text[:500])  # Print the first 500 characters
else:
print("Request failed, status code:", response.status_code)

Key points:

Set User-Agent to mimic a browser and avoid detection
Check the HTTP response status (200 indicates success)

3. Parsing HTML to Extract Data

Use BeautifulSoup to parse HTML and extract relevant data:

from bs4 import BeautifulSoup
soup = BeautifulSoup(response.text, "html.parser")
# Get webpage title
title = soup.title.text
print("Page Title:", title)
# Find all links
for link in soup.find_all("a"):
print(link.get("href"))

Common parsing methods:

soup.find(tag, attrs={}): Finds a single element
soup.find_all(tag, attrs={}): Finds all matching elements
element.text: Extracts text content from a tag
element.get("attribute"): Retrieves attribute values

4. Handling Dynamic Web Pages

If a webpage's content is generated dynamically via JavaScript, requests alone won’t work. Use selenium instead:

from selenium import webdriver
from selenium.webdriver.common.by import By
driver = webdriver.Chrome()
driver.get("https://example.com")
driver.implicitly_wait(5)  # Wait for the page to load
# Get page title
element = driver.find_element(By.TAG_NAME, "h1")
print("Page Title:", element.text)
driver.quit()

Key considerations:

Requires installing WebDriver (e.g., chromedriver)
implicitly_wait() allows Selenium to wait for page elements
find_element() helps locate DOM elements

5. Handling Anti-Scraping Mechanisms

(1) Use Random User-Agent

Generate a random User-Agent using fake_useragent:

pip install fake-useragent

from fake_useragent import UserAgent
headers = {"User-Agent": UserAgent().random}
response = requests.get("https://example.com", headers=headers)

(2) Add Delays Between Requests

To avoid being blocked due to high-frequency requests:

import time
import random
time.sleep(random.uniform(2, 5))  # Random wait between 2 to 5 seconds

(3) Use Proxy IP (LuckData Proxy Services)

LuckData provides datacenter proxies, dynamic residential proxies, and unlimited dynamic residential proxies with over 120 million residential IPs, supporting HTTP/HTTPS, ideal for brand protection, SEO monitoring, market research, and e-commerce applications.

LuckData Proxy Example (Python)

import requests
proxyip = "http://Account:Password@ahk.luckdata.io:Port"
url = "https://api.ip.cc"
proxies = {
'http': proxyip,
'https': proxyip,
}
data = requests.get(url=url, proxies=proxies)
print(data.text)

LuckData Proxy Advantages:

Global IP Coverage: Over 200 countries with precise location targeting (country, state, city)
Fast Response Time: Automated proxy setup with 0.6ms latency and 99.99% uptime
Unlimited Concurrent Sessions: High-performance servers with unlimited concurrent requests
Security & Compliance: Ensures privacy protection and legal compliance

6. Retrieving Data via APIs

Compared to traditional scraping, APIs offer a more stable and compliant way to obtain data. LuckData provides APIs for Walmart, Amazon, Google, TikTok, and other platforms, supporting Python requests with structured JSON responses.

6.1 API Request Example (Python)

Below is an example using LuckData's Walmart API to fetch product details:

import requests
headers = {
'X-Luckdata-Api-Key': 'your luckdata key'
}
response = requests.get(
'https://luckdata.io/api/walmart-API/get_vwzq?url=https://www.walmart.com/ip/NELEUS-Mens-Dry-Fit-Mesh-Athletic-Shirts-3-Pack-Black-Gray-Olive-Green-US-Size-M/439625664?classType=VARIANT',
headers=headers
)
print(response.json())  # Parse the returned JSON data

API Benefits:

Avoid bans (IP restrictions, captchas)
Structured data output (directly returns JSON data)
Scalability for enterprise applications (large-scale data retrieval)

7. Storing Data

Scraped data can be stored in CSV, JSON, or databases:

(1) Save as CSV

import csv
data = [("Title 1", "https://example.com/1"), ("Title 2", "https://example.com/2")]
with open("data.csv", "w", newline="", encoding="utf-8") as file:
writer = csv.writer(file)
writer.writerow(["Title", "Link"])
writer.writerows(data)

(2) Save as JSON

import json
data = [{"title": "Title 1", "url": "https://example.com/1"}]
with open("data.json", "w", encoding="utf-8") as file:
json.dump(data, file, ensure_ascii=False, indent=4)

Conclusion

This article provided a comprehensive guide to web scraping with Python, covering:
✅ Traditional scraping methods (requests, BeautifulSoup)
✅ Handling dynamic pages (selenium)
✅ Using LuckData proxies to bypass restrictions
✅ Retrieving data efficiently via LuckData APIs
✅ Data storage and optimization techniques

By leveraging these techniques, you can efficiently gather web data for data analysis, business intelligence, and more : https://luckdata.io/marketplace