Web Scraping with Python: From Crawling to API Data Extraction

Web scraping is the automated process of extracting data from websites, widely used for data analysis, market research, and automation tasks. This article explores how to perform web scraping with Python, covering traditional crawling techniques and API-based data retrieval for efficient and compliant data collection.

1. Install Necessary Libraries

Python offers various tools for web scraping, including:

  • requests: Sends HTTP requests to fetch webpage content

  • BeautifulSoup: Parses HTML structure and extracts data

  • lxml: Improves HTML parsing efficiency

  • selenium: Handles dynamic web pages

Install them using:

pip install requests beautifulsoup4 lxml selenium


2. Sending HTTP Requests to Retrieve Webpage Content

Use requests to send a GET request and retrieve raw HTML content:

import requests

url = "https://example.com"

headers = {"User-Agent": "Mozilla/5.0"}

response = requests.get(url, headers=headers)

if response.status_code == 200:

print(response.text[:500]) # Print the first 500 characters

else:

print("Request failed, status code:", response.status_code)

Key points:

  • Set User-Agent to mimic a browser and avoid detection

  • Check the HTTP response status (200 indicates success)


3. Parsing HTML to Extract Data

Use BeautifulSoup to parse HTML and extract relevant data:

from bs4 import BeautifulSoup

soup = BeautifulSoup(response.text, "html.parser")

# Get webpage title

title = soup.title.text

print("Page Title:", title)

# Find all links

for link in soup.find_all("a"):

print(link.get("href"))

Common parsing methods:

  • soup.find(tag, attrs={}): Finds a single element

  • soup.find_all(tag, attrs={}): Finds all matching elements

  • element.text: Extracts text content from a tag

  • element.get("attribute"): Retrieves attribute values


4. Handling Dynamic Web Pages

If a webpage's content is generated dynamically via JavaScript, requests alone won’t work. Use selenium instead:

from selenium import webdriver

from selenium.webdriver.common.by import By

driver = webdriver.Chrome()

driver.get("https://example.com")

driver.implicitly_wait(5) # Wait for the page to load

# Get page title

element = driver.find_element(By.TAG_NAME, "h1")

print("Page Title:", element.text)

driver.quit()

Key considerations:

  • Requires installing WebDriver (e.g., chromedriver)

  • implicitly_wait() allows Selenium to wait for page elements

  • find_element() helps locate DOM elements


5. Handling Anti-Scraping Mechanisms

(1) Use Random User-Agent

Generate a random User-Agent using fake_useragent:

pip install fake-useragent

from fake_useragent import UserAgent

headers = {"User-Agent": UserAgent().random}

response = requests.get("https://example.com", headers=headers)

(2) Add Delays Between Requests

To avoid being blocked due to high-frequency requests:

import time

import random

time.sleep(random.uniform(2, 5)) # Random wait between 2 to 5 seconds

(3) Use Proxy IP (LuckData Proxy Services)

LuckData provides datacenter proxies, dynamic residential proxies, and unlimited dynamic residential proxies with over 120 million residential IPs, supporting HTTP/HTTPS, ideal for brand protection, SEO monitoring, market research, and e-commerce applications.

LuckData Proxy Example (Python)

import requests

proxyip = "http://Account:Password@ahk.luckdata.io:Port"

url = "https://api.ip.cc"

proxies = {

'http': proxyip,

'https': proxyip,

}

data = requests.get(url=url, proxies=proxies)

print(data.text)

LuckData Proxy Advantages:

  • Global IP Coverage: Over 200 countries with precise location targeting (country, state, city)

  • Fast Response Time: Automated proxy setup with 0.6ms latency and 99.99% uptime

  • Unlimited Concurrent Sessions: High-performance servers with unlimited concurrent requests

  • Security & Compliance: Ensures privacy protection and legal compliance


6. Retrieving Data via APIs

Compared to traditional scraping, APIs offer a more stable and compliant way to obtain data. LuckData provides APIs for Walmart, Amazon, Google, TikTok, and other platforms, supporting Python requests with structured JSON responses.

6.1 API Request Example (Python)

Below is an example using LuckData's Walmart API to fetch product details:

import requests

headers = {

'X-Luckdata-Api-Key': 'your luckdata key'

}

response = requests.get(

'https://luckdata.io/api/walmart-API/get_vwzq?url=https://www.walmart.com/ip/NELEUS-Mens-Dry-Fit-Mesh-Athletic-Shirts-3-Pack-Black-Gray-Olive-Green-US-Size-M/439625664?classType=VARIANT',

headers=headers

)

print(response.json()) # Parse the returned JSON data

API Benefits:

  • Avoid bans (IP restrictions, captchas)

  • Structured data output (directly returns JSON data)

  • Scalability for enterprise applications (large-scale data retrieval)


7. Storing Data

Scraped data can be stored in CSV, JSON, or databases:

(1) Save as CSV

import csv

data = [("Title 1", "https://example.com/1"), ("Title 2", "https://example.com/2")]

with open("data.csv", "w", newline="", encoding="utf-8") as file:

writer = csv.writer(file)

writer.writerow(["Title", "Link"])

writer.writerows(data)

(2) Save as JSON

import json

data = [{"title": "Title 1", "url": "https://example.com/1"}]

with open("data.json", "w", encoding="utf-8") as file:

json.dump(data, file, ensure_ascii=False, indent=4)


Conclusion

This article provided a comprehensive guide to web scraping with Python, covering:
Traditional scraping methods (requests, BeautifulSoup)
Handling dynamic pages (selenium)
Using LuckData proxies to bypass restrictions
Retrieving data efficiently via LuckData APIs
Data storage and optimization techniques

By leveraging these techniques, you can efficiently gather web data for data analysis, business intelligence, and more : https://luckdata.io/marketplace