How I Extracted Product Data from Daraz Using Scrapy

Learn how I built a Scrapy spider to extract product data from Daraz, including names, prices, stock status, and URLs. This step-by-step guide covers handling AJAX and JSON, Perfect for beginners.

How I Extracted Product Data from Daraz Using Scrapy

In this blog, I detail the process of scraping product data from Daraz, a leading e-commerce platform in Pakistan. Utilizing Scrapy, a robust Python framework for web scraping, I aimed to extract essential product details such as names, prices, stock status, and product URLs. This guide walks you through each step of the process.

 

Step 1: Setting Up the Scrapy Project

Before diving into data extraction, it's essential to set up the Scrapy environment.

Installation

Begin by installing Scrapy using pip:

pip install scrapy

Creating the Scrapy Project

Initialize a new Scrapy project with the following commands:

scrapy startproject daraz
cd daraz

This command generates the necessary directory structure, including folders for spiders and configuration files.

 

Step 2: Understanding the Structure of the Website

Daraz dynamically loads its product listings using AJAX requests. By inspecting the network tab in Developer Tools, I observed that product data is returned in JSON format rather than standard HTML. This insight is crucial, as it allows for direct data retrieval from the AJAX endpoints.

The product URLs contain parameters like categories, filters, and pagination. Crafting requests with these parameters ensures the retrieval of complete and accurate data.

 

Step 3: Writing the Scrapy Spider

Within the daraz/spiders/ directory, I created a new spider to handle the data extraction.

import scrapy
import json

class DarazProductSpider(scrapy.Spider):
    name = "daraz_product"
    start_urls = ["https://www.daraz.pk/sport-action-camera-accessory-kits/?ajax=true"]

    def parse(self, response):
        try:
            data = json.loads(response.text)
        except json.JSONDecodeError:
            self.logger.error("JSON parse failed")
            return

        list_items = data.get('mods', {}).get('listItems', [])
        for item in list_items:
            yield {
                'name': item.get('name'),
                'price': item.get('priceShow'),
                'stock': item.get('inStock'),
                'product_url': item.get('itemUrl')
            }

        next_href = data.get('seoInfo', {}).get('nextHref')
        if next_href:
            if "ajax=true" not in next_href:
                if "?" in next_href:
                    next_href += "&ajax=true"
                else:
                    next_href += "?ajax=true"
            next_page = response.urljoin(next_href)
            yield scrapy.Request(url=next_page, callback=self.parse)

Explanation:

  • name: Defines the spider's name, used to run it via scrapy crawl daraz_product.
  • start_urls: Contains the initial URL with necessary query parameters, ensuring the site returns product data in JSON format.
  • parse method: Parses the JSON response, extracts product details, and handles pagination by recursively calling itself for subsequent pages.

 

Step 4: Running the Spider

With the spider configured, execute it using:

scrapy crawl daraz_product -o products.json

This command initiates the spider and saves the scraped product data into a products.json file. Scrapy efficiently manages requests, JSON parsing, pagination, and data storage.

 

Step 5: The Scraped Data

Upon completion, the spider yields a JSON file containing structured data, including:

  • Name: Product title
  • Price: Product cost
  • Stock: Availability status
  • Product URL: Direct link to the product page

Sample Output:

{
  "name": "Wireless Earbuds",
  "price": "Rs. 2,999",
  "stock": "In Stock",
  "product_url": "https://www.daraz.pk/products/..."
}

 

Step 6: Conclusion and Next Steps

Scrapy simplifies the process of scraping dynamic content from Daraz by effectively handling AJAX responses and pagination. The spider navigates through multiple pages, extracting all required product data.

Next Steps:

  • Optimize the Spider: Implement error handling for missing data.
  • Extract Additional Data: Include fields like reviews, ratings, and seller information.
  • Export Formats: Enable CSV or Excel outputs for easier data analysis.

Final Thoughts:

Web scraping is a powerful tool for obtaining structured data from dynamic websites. With Scrapy, automating the process of fetching, parsing, and storing product information from Daraz becomes straightforward.

 

Frequently Asked Questions (FAQs)

1. Why include ajax=true in the URL?

Adding ajax=true ensures that Daraz returns product data in JSON format, simplifying the parsing and extraction process.

2. How is pagination handled?

The spider checks for the nextHref field in the JSON response. If present, it appends ajax=true and recursively sends a request to the next page.

3. Can all categories be scraped with this spider?

Yes, but you'll need to update the start_urls for each category. Alternatively, pass category URLs dynamically using command-line arguments or custom settings.

4. Is scraping Daraz legal?

Web scraping exists in a legal gray area. While technically feasible, it's crucial to respect a site's robots.txt rules and terms of service. Always scrape responsibly to avoid overloading the server.

5. What if the spider gets blocked?

If too many requests are made rapidly, the site may block the spider. To prevent this:

  • Add delays using DOWNLOAD_DELAY in settings.
  • Utilize proxies or rotate user agents.
  • Implement error handling and retries gracefully.

Note: Ensure compliance with Daraz's terms of service and robots.txt when scraping data.