Scrapy in Action: Efficiently Crawling Taobao Product Listings and Details
In the field of e-commerce data extraction and analysis, how to efficiently and reliably scrape product data is always a key challenge. This article uses the Python-based Scrapy framework to build a scalable, maintainable crawler system for Taobao, covering both product listing pages and detail pages. We'll also introduce optimization techniques and demonstrate how to store scraped data in databases like MongoDB or MySQL.
This article is ideal for developers with basic Python knowledge who are looking to build a production-ready data scraping system and master the practical use of Scrapy.

1. Initializing a Scrapy Project and Understanding Its Structure
Scrapy is a high-performance, modular, asynchronous Python web scraping framework. It's perfect for building large-scale, structured scraping projects. Start by creating a Scrapy project:
scrapy startproject taobao_spider
Project structure overview:
taobao_spider/├── items.py # Data models
├── pipelines.py # Data processing and storage
├── settings.py # Global configurations
├── middlewares.py # Middleware setup (User-Agent / Proxy)
└── spiders/
└── search_spider.py # Main crawler script
Each module is highly extensible and supports modular reuse.
2. Parsing Product Listings and Pagination Logic
Taobao's search page (e.g., https://s.taobao.com/search?q=headphones
) uses AJAX to dynamically load content. Product data is embedded in the HTML as JSON. We can extract product IDs and then request the corresponding detail pages.
Pagination logic and product ID extraction:
# spiders/search_spider.pyimport scrapy
import re
class TaobaoSearchSpider(scrapy.Spider):
name = 'taobao_search'
allowed_domains = ['s.taobao.com', 'item.taobao.com']
start_urls = []
def start_requests(self):
keyword = 'headphones'
for page in range(1, 6): # First 5 pages
offset = (page - 1) * 44
url = f'https://s.taobao.com/search?q={keyword}&s={offset}'
yield scrapy.Request(url=url, callback=self.parse)
def parse(self, response):
item_ids = re.findall(r'"nid":"(\d+)"', response.text)
for nid in item_ids:
detail_url = f'https://item.taobao.com/item.htm?id={nid}'
yield scrapy.Request(url=detail_url, callback=self.parse_detail)
We use a regex to extract the nid
of each product from the HTML and generate detail page URLs.
3. Scraping Product Detail Pages
The detail page contains core product information like title, price, store name, sales, reviews, and more. Since layouts can vary slightly between products, our parser should be fault-tolerant and robust.
Parsing logic example:
def parse_detail(self, response):def safe_get(selector):
return selector.get().strip() if selector else ''
title = safe_get(response.css('#J_Title .tb-main-title::attr(data-title)'))
price = safe_get(response.css('.tb-rmb-num::text'))
shop = safe_get(response.css('.tb-shop-name::text'))
sales = safe_get(response.css('#J_SellCounter::text'))
location = safe_get(response.css('.tb-deliveryAdd span::text'))
yield {
'title': title,
'price': price,
'shop': shop,
'sales': sales,
'location': location,
'url': response.url
}
The helper function safe_get()
is used to handle missing data and prevent crashes.
4. Storing Data in MongoDB Using Pipelines
Scrapy sends all parsed items to pipelines.py
for processing and persistence. Below is an example pipeline for MongoDB:
pipelines.py:
import pymongoclass MongoPipeline:
def open_spider(self, spider):
self.client = pymongo.MongoClient('localhost', 27017)
self.db = self.client['taobao_data']
self.col = self.db['products']
def close_spider(self, spider):
self.client.close()
def process_item(self, item, spider):
self.col.update_one(
{'title': item['title']},
{'$set': dict(item)},
upsert=True
)
return item
Enable the pipeline in settings.py:
ITEM_PIPELINES = {'taobao_spider.pipelines.MongoPipeline': 300,
}
The use of upsert=True
ensures that data is updated if already present, preventing duplicates.
5. Performance and Anti-Blocking Optimizations
When scraping Taobao, rate limits and bot detection are common issues. Here are several key strategies:
1. Rate Limiting and Concurrency Control
# settings.pyDOWNLOAD_DELAY = 1.5
CONCURRENT_REQUESTS = 4
Adding delays helps reduce detection and banning risk.
2. Random User-Agent and Proxy Rotation
Define middleware to randomly rotate user agents (or proxies) in middlewares.py
.
import randomclass RandomUserAgentMiddleware:
USER_AGENTS = [
'Mozilla/5.0 (Windows NT 10.0; Win64; x64)...',
'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7)...'
]
def process_request(self, request, spider):
request.headers['User-Agent'] = random.choice(self.USER_AGENTS)
Enable it in settings.py
:
DOWNLOADER_MIDDLEWARES = {'taobao_spider.middlewares.RandomUserAgentMiddleware': 400,
}
3. Incremental Crawling Strategy
To avoid reprocessing existing records, check if a URL already exists in the database:
def process_item(self, item, spider):exists = self.col.find_one({'url': item['url']})
if not exists:
self.col.insert_one(dict(item))
return item
4. Retry Mechanisms and Logging
Use Scrapy's built-in retry and logging settings:
RETRY_ENABLED = TrueRETRY_TIMES = 3
LOG_LEVEL = 'INFO'
This ensures better fault tolerance in production.
6. Summary and Future Extensions
Using Scrapy, we’ve built a modular and production-ready Taobao crawler that covers:
Product listing extraction
Product detail scraping
Data storage with MongoDB
Performance and anti-blocking optimizations
This solution provides:
Clean architecture, easy to extend or maintain
MongoDB integration for further analysis
Support for user-agent rotation, throttling, and incremental updates
In the next steps, this crawler can be extended to:
Extract product reviews and analyze sentiment
Track historical prices and trends
Integrate Redis or Kafka for distributed crawling
Scrapy's rich ecosystem and mature architecture make it a great fit for building commercial-grade e-commerce data pipelines.
Articles related to APIs :
Introduction to Taobao API: Basic Concepts and Application Scenarios
Taobao API: Authentication & Request Flow Explained with Code Examples
Using the Taobao API to Retrieve Product Information and Implement Keyword Search
How to Use the Taobao API to Build a Product Price Tracker and Alert System
Using the Taobao API to Build a Category-Based Product Recommendation System
Taobao Data Source Analysis and Technical Selection: API vs Web Scraping vs Hybrid Crawling
If you need the Taobao API, feel free to contact us : support@luckdata.com