In this blog, I detail the process of scraping product data from Daraz, a leading e-commerce platform in Pakistan. Utilizing Scrapy, a robust Python framework for web scraping, I aimed to extract essential product details such as names, prices, stock status, and product URLs. This guide walks you through each step of the process.
Step 1: Setting Up the Scrapy Project
Before diving into data extraction, it's essential to set up the Scrapy environment.
Installation
Begin by installing Scrapy using pip:
pip install scrapy
Creating the Scrapy Project
Initialize a new Scrapy project with the following commands:
scrapy startproject daraz
cd daraz
This command generates the necessary directory structure, including folders for spiders and configuration files.
Step 2: Understanding the Structure of the Website
Daraz dynamically loads its product listings using AJAX requests. By inspecting the network tab in Developer Tools, I observed that product data is returned in JSON format rather than standard HTML. This insight is crucial, as it allows for direct data retrieval from the AJAX endpoints.
The product URLs contain parameters like categories, filters, and pagination. Crafting requests with these parameters ensures the retrieval of complete and accurate data.
Step 3: Writing the Scrapy Spider
Within the daraz/spiders/
directory, I created a new spider to handle the data extraction.
import scrapy
import json
class DarazProductSpider(scrapy.Spider):
name = "daraz_product"
start_urls = ["https://www.daraz.pk/sport-action-camera-accessory-kits/?ajax=true"]
def parse(self, response):
try:
data = json.loads(response.text)
except json.JSONDecodeError:
self.logger.error("JSON parse failed")
return
list_items = data.get('mods', {}).get('listItems', [])
for item in list_items:
yield {
'name': item.get('name'),
'price': item.get('priceShow'),
'stock': item.get('inStock'),
'product_url': item.get('itemUrl')
}
next_href = data.get('seoInfo', {}).get('nextHref')
if next_href:
if "ajax=true" not in next_href:
if "?" in next_href:
next_href += "&ajax=true"
else:
next_href += "?ajax=true"
next_page = response.urljoin(next_href)
yield scrapy.Request(url=next_page, callback=self.parse)
Explanation:
- name: Defines the spider's name, used to run it via
scrapy crawl daraz_product
. - start_urls: Contains the initial URL with necessary query parameters, ensuring the site returns product data in JSON format.
- parse method: Parses the JSON response, extracts product details, and handles pagination by recursively calling itself for subsequent pages.
Step 4: Running the Spider
With the spider configured, execute it using:
scrapy crawl daraz_product -o products.json
This command initiates the spider and saves the scraped product data into a products.json
file. Scrapy efficiently manages requests, JSON parsing, pagination, and data storage.
Step 5: The Scraped Data
Upon completion, the spider yields a JSON file containing structured data, including:
- Name: Product title
- Price: Product cost
- Stock: Availability status
- Product URL: Direct link to the product page
Sample Output:
{
"name": "Wireless Earbuds",
"price": "Rs. 2,999",
"stock": "In Stock",
"product_url": "https://www.daraz.pk/products/..."
}
Step 6: Conclusion and Next Steps
Scrapy simplifies the process of scraping dynamic content from Daraz by effectively handling AJAX responses and pagination. The spider navigates through multiple pages, extracting all required product data.
Next Steps:
- Optimize the Spider: Implement error handling for missing data.
- Extract Additional Data: Include fields like reviews, ratings, and seller information.
- Export Formats: Enable CSV or Excel outputs for easier data analysis.
Final Thoughts:
Web scraping is a powerful tool for obtaining structured data from dynamic websites. With Scrapy, automating the process of fetching, parsing, and storing product information from Daraz becomes straightforward.
Frequently Asked Questions (FAQs)
1. Why include ajax=true
in the URL?
Adding ajax=true
ensures that Daraz returns product data in JSON format, simplifying the parsing and extraction process.
2. How is pagination handled?
The spider checks for the nextHref
field in the JSON response. If present, it appends ajax=true
and recursively sends a request to the next page.
3. Can all categories be scraped with this spider?
Yes, but you'll need to update the start_urls
for each category. Alternatively, pass category URLs dynamically using command-line arguments or custom settings.
4. Is scraping Daraz legal?
Web scraping exists in a legal gray area. While technically feasible, it's crucial to respect a site's robots.txt
rules and terms of service. Always scrape responsibly to avoid overloading the server.
5. What if the spider gets blocked?
If too many requests are made rapidly, the site may block the spider. To prevent this:
- Add delays using
DOWNLOAD_DELAY
in settings. - Utilize proxies or rotate user agents.
- Implement error handling and retries gracefully.
Note: Ensure compliance with Daraz's terms of service and robots.txt when scraping data.