Web Scraping Ebay Products

we will walk you through the step-by-step process of using Scrapy to extract data from eBay.

Web Scraping Ebay Products

Web scraping can be an incredibly powerful tool, especially for gathering data from e-commerce sites like eBay. In this blog, we’ll walk through how to scrape product data from eBay using Python’s Scrapy framework. We’ll cover the basics, step by step, to help you get started.

Why Scrape E-Commerce Data From the Web?

Scraping e-commerce data can provide valuable insights into market trends, product pricing, and customer preferences. Whether you’re conducting market research, building a price comparison tool, or just curious about product information, web scraping can automate the data collection process and help you gather large amounts of data efficiently.

eBay Scraping Libraries and Tools

For scraping eBay, several libraries and tools can be used, but we’ll focus on Scrapy, a powerful and flexible web scraping framework for Python. Scrapy makes it easy to define how to extract data from web pages and handle the common challenges of web scraping, such as handling requests, parsing HTML, and storing the extracted data.

Scraping eBay Product Data With Scrapy Framework

Let’s dive into scraping eBay using Scrapy. Follow these steps to build a simple eBay scraper.

Initial Setup

First, make sure you have Python 3 installed on your machine. You can check this by running:

$python --version

Next, install Scrapy if you haven’t already:

$pip install scrapy

Installing Required Libraries

Ensure you have all necessary libraries installed. Besides Scrapy, you might need additional libraries such as `lxml` and `json` for parsing and handling data.

pip install lxml

Creating a Scrapy Project

Create a new Scrapy project:

$scrapy startproject ebay_scraper

Navigate into your project directory:

$cd ebay_scraper

Create a new spider:

$scrapy genspider ebay_spider ebay.com

Setting Up the Spider

In your `ebay_spider.py` file, update your spider to define how it should parse the HTML document. Here’s a basic structure:


import scrapy
class EbaySpiderSpider(scrapy.Spider):
    name = 'ebay_spider'
    start_urls = ['https://www.ebay.com/sch/i.html?_nkw=laptop']
    def parse(self, response):
        # Parsing logic goes here

Inspecting the Product Page

Inspect the product pages on eBay to identify the HTML elements that contain the data you want to extract. Use your browser’s developer tools (right-click on the page and select “Inspect”) to find the XPath of these elements.

Extracting the Price Data

Update your `parse` method to extract the price data from the page. Here’s how you can extract product links and follow them:


def parse(self, response):
    product_links = response.xpath('//a[@class="s-item__link"]/@href').getall()
    yield from response.follow_all(product_links, self.parse_product)

Extracting Item Details

Define a `parse_product` method to extract details from each product page:


def parse_product(self, response):
    item = {}
    item['title'] = response.xpath('//h1[@class="it-ttl"]/text()').get()
    item['price'] = response.xpath('//span[@id="prcIsum"]/text()').get()
    yield item

Extracting Dynamic Item specifics

This part of the code is responsible for extracting detailed product information, specifically the item specifics, from an eBay product page. eBay product listings often include various specifics such as brand, model, size, color, and other attributes that describe the product in detail. The following code snippet captures these details dynamically:

val = {}
for label in response.xpath('//div[@class="vim x-about-this-item"]//dt'):
    spec_key = ''.join(label.xpath('.//text()').getall())
    spc_value = ''.join(label.xpath('./following-sibling::dd[1]//text()').getall())
    val[spec_key] = spc_value
        item['Item specifics'] = val

Saving Data to JSON

To store the scraped data, you can use Scrapy’s built-in JSON exporter. Run your spider with the following command:

$scrapy crawl ebay_spider -o products.json -a start_url="link_to_listing_page"

This will save the scraped data from listing page to a file named `products.json`.

Understanding HTTP Request Headers

Before we conclude this lets try to understand headers, header defines the header dictionary, which includes HTTP headers that the Scrapy spider will use when making requests to eBay. Let's break down each key-value pair to understand their purpose:

'Accept': 'application/json'

This header indicates that the client (our spider) expects the server to return JSON data.

'Accept-Encoding': 'gzip, deflate, br'

This specifies the content encoding that the client can handle. gzip, deflate, and br (Brotli) are all compression algorithms that help reduce the size of the response data.

'Accept-Language': 'en-GB,en-US;q=0.9,en;q=0.8,es-MX;q=0.7,es;q=0.6'

This indicates the preferred languages for the response content. The q parameter indicates the relative quality factor for each language, with higher values representing higher preference.

'Connection': 'keep-alive'

This header keeps the connection open for multiple requests/responses, improving efficiency by reusing the same connection.

'Content-Type': 'application/json'

This specifies the media type of the resource being sent to the server. Here, it indicates that the content is in JSON format.

'Origin': 'https://www.ebay.com'

This header indicates the origin of the request, used in CORS (Cross-Origin Resource Sharing) to determine whether the request should be allowed.

'Sec-Fetch-Dest': 'empty'

This is a security feature used to indicate the destination of the request, helping the server understand the context of the request.

'Sec-Fetch-Mode': 'cors'

This indicates that the request is a CORS request, which allows the server to know that the request is coming from a different origin.

'Sec-Fetch-Site': 'cross-site'

This indicates that the request is coming from a different site, part of the security context of the request.

'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/124.0.0.0 Safari/537.36'

The User-Agent string identifies the client software making the request. This helps the server understand the type of client and sometimes adjust the response accordingly. Here, it mimics a request coming from a Chrome browser on a Windows 10 machine.

'X-EBAY-C-MARKETPLACE-ID': 'EBAY-GB'

This header specifies the eBay marketplace identifier, indicating that the requests are intended for the eBay UK site.

'sec-ch-ua': '"Google Chrome";v="107", "Chromium";v="107", "Not=A?Brand";v="24"'

This is part of the User-Agent Client Hints, providing information about the browser and its version in a structured format.

'sec-ch-ua-mobile': '?0'

This indicates whether the client is a mobile device (?0 means no).

Complete Spider Code

Here’s the complete spider code:


import json
import scrapy
import re
from urllib.parse import unquote
class EbaySpiderSpider(scrapy.Spider):
    name = 'ebay_spider'
    header = {
        'Accept': 'application/json',
        'Accept-Encoding': 'gzip, deflate, br',
        'Accept-Language': 'en-GB,en-US;q=0.9,en;q=0.8,es-MX;q=0.7,es;q=0.6',
        'Connection': 'keep-alive',
        'Content-Type': 'application/json',
        'Origin': 'https://www.ebay.com',
        'Sec-Fetch-Dest': 'empty',
        'Sec-Fetch-Mode': 'cors',
        'Sec-Fetch-Site': 'cross-site',
        'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/124.0.0.0 Safari/537.36',
        'X-EBAY-C-MARKETPLACE-ID': 'EBAY-GB',
        'sec-ch-ua': '"Google Chrome";v="107", "Chromium";v="107", "Not=A?Brand";v="24"',
        'sec-ch-ua-mobile': '?0'
    }
    def __init__(self, **kwargs):
        start_url = kwargs['start_url']
        self.start_urls = [start_url + '&_fcid=3']
        super(EbaySpiderSpider, self).__init__()
    def parse(self, response, **kwargs):

        self.page_count += 1
        product_links = response.xpath('//a[@class="s-item__link"]/@href').getall()
        if product_links:
            yield from response.follow_all(product_links, callback=self.parse_product, headers=self.header, meta={'page_link': response.request.url})
        page_link = response.xpath('//a[@aria-label="Go to next search page"]/@href').get()
        if page_link:
            yield response.follow(page_link, callback=self.parse)
    def parse_product(self, response):
        item = {}
        item['Pagination_Link'] = response.meta['page_link']
        breadcrumbs = response.xpath('//nav[contains(@class, "breadcrumbs")]//a[@class="seo-breadcrumb-text"]/span/text()').getall()
        for index, element in enumerate(breadcrumbs):
            item[f'Category_{index + 1}'] = element
        item['Product_link'] = response.request.url
        img_urls = re.findall('"ZOOM_GUID","URL":"(.*?)"', response.text)
        if img_urls:
            for index, img_url in enumerate(img_urls):
                item['Product_image_' + str(index + 1)] = img_url
        product_title_full = response.xpath('//h1[@class="x-item-title__mainTitle"]/span/text()').get()
        item['Product_title'] = product_title_full
        item['Product_Price'] = response.xpath('//div[@class="x-price-primary"]/span/text()').get()
        val = {}
        for label in response.xpath('//div[@class="vim x-about-this-item"]//dt'):
            spec_key = ''.join(label.xpath('.//text()').getall())
            spc_value = ''.join(label.xpath('./following-sibling::dd[1]//text()').getall())
            val[spec_key] = spc_value
        item['Item specifics'] = val
        yield item

Conclusion

Scraping eBay with Python and Scrapy is a practical way to gather detailed product information. By following the steps outlined in this blog, you can set up a scraper that fetches and stores eBay data efficiently. Remember to always respect the website's `robots.txt` and terms of service to avoid any legal issues. Happy scraping!

Frequently Asked Questions
Scraping e-commerce data can provide valuable insights into market trends, product pricing, and customer preferences. It can be useful for market research, building price comparison tools, or gathering data for analysis.
For scraping eBay, you can use the Scrapy framework in Python. Additionally, you may need libraries like `lxml` for parsing HTML and `json` for handling data.
First, install Scrapy using pip: pip install scrapy. Then, create a new Scrapy project using the command: scrapy startproject ebay_scraper. Navigate into the project directory and generate a new spider using: scrapy genspider ebay_spider ebay.com.
Inspect the product pages on eBay to identify the HTML elements that contain the data you want to extract. Use XPath to extract these elements in your Scrapy spider. For example, you can extract product links and follow them to extract detailed product information like title and price.
To handle pagination, extract the link to the next page from the response and yield a request to follow it. Use a counter to limit the number of pages you scrape, as shown in the blog’s spider code.
Scrapy provides a built-in JSON exporter. You can run your spider with the following command to export the scraped data to a JSON file: scrapy crawl ebay_spider -o products.json.
The HTTP headers in the Scrapy spider mimic a legitimate browser request, ensuring that the server returns the correct response. Headers such as 'User-Agent', 'Accept', and 'Accept-Language' help identify the client and specify the types of responses it can handle.