Scrapy in Action: Efficiently Crawling Taobao Product Listings and Details

2025-04-18

In the field of e-commerce data extraction and analysis, how to efficiently and reliably scrape product data is always a key challenge. This article uses the Python-based Scrapy framework to build a scalable, maintainable crawler system for Taobao, covering both product listing pages and detail pages. We'll also introduce optimization techniques and demonstrate how to store scraped data in databases like MongoDB or MySQL.

This article is ideal for developers with basic Python knowledge who are looking to build a production-ready data scraping system and master the practical use of Scrapy.

Scrapy in Action: Efficiently Crawling Taobao Product Listings and Details

1. Initializing a Scrapy Project and Understanding Its Structure

Scrapy is a high-performance, modular, asynchronous Python web scraping framework. It's perfect for building large-scale, structured scraping projects. Start by creating a Scrapy project:

scrapy startproject taobao_spider

Project structure overview:

taobao_spider/ ├── items.py # Data models ├── pipelines.py # Data processing and storage ├── settings.py # Global configurations ├── middlewares.py # Middleware setup (User-Agent / Proxy) └── spiders/ └── search_spider.py # Main crawler script

Each module is highly extensible and supports modular reuse.

2. Parsing Product Listings and Pagination Logic

Taobao's search page (e.g., https://s.taobao.com/search?q=headphones) uses AJAX to dynamically load content. Product data is embedded in the HTML as JSON. We can extract product IDs and then request the corresponding detail pages.

Pagination logic and product ID extraction:

# spiders/search_spider.py
import scrapy
import re
class TaobaoSearchSpider(scrapy.Spider):
name = 'taobao_search'
allowed_domains = ['s.taobao.com', 'item.taobao.com']
start_urls = []
def start_requests(self):
keyword = 'headphones'
for page in range(1, 6):  # First 5 pages
offset = (page - 1) * 44
url = f'https://s.taobao.com/search?q={keyword}&s={offset}'
yield scrapy.Request(url=url, callback=self.parse)
def parse(self, response):
item_ids = re.findall(r'"nid":"(\d+)"', response.text)
for nid in item_ids:
detail_url = f'https://item.taobao.com/item.htm?id={nid}'
yield scrapy.Request(url=detail_url, callback=self.parse_detail)

We use a regex to extract the nid of each product from the HTML and generate detail page URLs.

3. Scraping Product Detail Pages

The detail page contains core product information like title, price, store name, sales, reviews, and more. Since layouts can vary slightly between products, our parser should be fault-tolerant and robust.

Parsing logic example:

def parse_detail(self, response):
def safe_get(selector):
return selector.get().strip() if selector else ''
title = safe_get(response.css('#J_Title .tb-main-title::attr(data-title)'))
price = safe_get(response.css('.tb-rmb-num::text'))
shop  = safe_get(response.css('.tb-shop-name::text'))
sales = safe_get(response.css('#J_SellCounter::text'))
location = safe_get(response.css('.tb-deliveryAdd span::text'))
yield {
'title': title,
'price': price,
'shop': shop,
'sales': sales,
'location': location,
'url': response.url
}

The helper function safe_get() is used to handle missing data and prevent crashes.

4. Storing Data in MongoDB Using Pipelines

Scrapy sends all parsed items to pipelines.py for processing and persistence. Below is an example pipeline for MongoDB:

pipelines.py:

import pymongo
class MongoPipeline:
def open_spider(self, spider):
self.client = pymongo.MongoClient('localhost', 27017)
self.db = self.client['taobao_data']
self.col = self.db['products']
def close_spider(self, spider):
self.client.close()
def process_item(self, item, spider):
self.col.update_one(
{'title': item['title']},
{'$set': dict(item)},
upsert=True
)
return item

Enable the pipeline in settings.py:

ITEM_PIPELINES = {
'taobao_spider.pipelines.MongoPipeline': 300,
}

The use of upsert=True ensures that data is updated if already present, preventing duplicates.

5. Performance and Anti-Blocking Optimizations

When scraping Taobao, rate limits and bot detection are common issues. Here are several key strategies:

1. Rate Limiting and Concurrency Control

# settings.py
DOWNLOAD_DELAY = 1.5
CONCURRENT_REQUESTS = 4

Adding delays helps reduce detection and banning risk.

2. Random User-Agent and Proxy Rotation

Define middleware to randomly rotate user agents (or proxies) in middlewares.py.

import random
class RandomUserAgentMiddleware:
USER_AGENTS = [
'Mozilla/5.0 (Windows NT 10.0; Win64; x64)...',
'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7)...'
]
def process_request(self, request, spider):
request.headers['User-Agent'] = random.choice(self.USER_AGENTS)

Enable it in settings.py:

DOWNLOADER_MIDDLEWARES = {
'taobao_spider.middlewares.RandomUserAgentMiddleware': 400,
}

3. Incremental Crawling Strategy

To avoid reprocessing existing records, check if a URL already exists in the database:

def process_item(self, item, spider):
exists = self.col.find_one({'url': item['url']})
if not exists:
self.col.insert_one(dict(item))
return item

4. Retry Mechanisms and Logging

Use Scrapy's built-in retry and logging settings:

RETRY_ENABLED = True
RETRY_TIMES = 3
LOG_LEVEL = 'INFO'

This ensures better fault tolerance in production.

6. Summary and Future Extensions

Using Scrapy, we’ve built a modular and production-ready Taobao crawler that covers:

Product listing extraction
Product detail scraping
Data storage with MongoDB
Performance and anti-blocking optimizations

This solution provides:

Clean architecture, easy to extend or maintain
MongoDB integration for further analysis
Support for user-agent rotation, throttling, and incremental updates

In the next steps, this crawler can be extended to:

Extract product reviews and analyze sentiment
Track historical prices and trends
Integrate Redis or Kafka for distributed crawling

Scrapy's rich ecosystem and mature architecture make it a great fit for building commercial-grade e-commerce data pipelines.

Articles related to APIs :

If you need the Taobao API, feel free to contact us : support@luckdata.com