Web scraping can be an incredibly powerful tool, especially for gathering data from e-commerce sites like eBay. In this blog, we’ll walk through how to scrape product data from eBay using Python’s Scrapy framework. We’ll cover the basics, step by step, to help you get started.
Why Scrape E-Commerce Data From the Web?
Scraping e-commerce data can provide valuable insights into market trends, product pricing, and customer preferences. Whether you’re conducting market research, building a price comparison tool, or just curious about product information, web scraping can automate the data collection process and help you gather large amounts of data efficiently.
eBay Scraping Libraries and Tools
For scraping eBay, several libraries and tools can be used, but we’ll focus on Scrapy, a powerful and flexible web scraping framework for Python. Scrapy makes it easy to define how to extract data from web pages and handle the common challenges of web scraping, such as handling requests, parsing HTML, and storing the extracted data.
Scraping eBay Product Data With Scrapy Framework
Let’s dive into scraping eBay using Scrapy. Follow these steps to build a simple eBay scraper.
Initial Setup
First, make sure you have Python 3 installed on your machine. You can check this by running:
$python --version
Next, install Scrapy if you haven’t already:
$pip install scrapy
Installing Required Libraries
Ensure you have all necessary libraries installed. Besides Scrapy, you might need additional libraries such as `lxml` and `json` for parsing and handling data.
pip install lxml
Creating a Scrapy Project
Create a new Scrapy project:
$scrapy startproject ebay_scraper
Navigate into your project directory:
$cd ebay_scraper
Create a new spider:
$scrapy genspider ebay_spider ebay.com
Setting Up the Spider
In your `ebay_spider.py` file, update your spider to define how it should parse the HTML document. Here’s a basic structure:
import scrapy
class EbaySpiderSpider(scrapy.Spider):
name = 'ebay_spider'
start_urls = ['https://www.ebay.com/sch/i.html?_nkw=laptop']
def parse(self, response):
# Parsing logic goes here
Inspecting the Product Page
Inspect the product pages on eBay to identify the HTML elements that contain the data you want to extract. Use your browser’s developer tools (right-click on the page and select “Inspect”) to find the XPath of these elements.
Extracting the Price Data
Update your `parse` method to extract the price data from the page. Here’s how you can extract product links and follow them:
def parse(self, response):
product_links = response.xpath('//a[@class="s-item__link"]/@href').getall()
yield from response.follow_all(product_links, self.parse_product)
Extracting Item Details
Define a `parse_product` method to extract details from each product page:
def parse_product(self, response):
item = {}
item['title'] = response.xpath('//h1[@class="it-ttl"]/text()').get()
item['price'] = response.xpath('//span[@id="prcIsum"]/text()').get()
yield item
Extracting Dynamic Item specifics
This part of the code is responsible for extracting detailed product information, specifically the item specifics, from an eBay product page. eBay product listings often include various specifics such as brand, model, size, color, and other attributes that describe the product in detail. The following code snippet captures these details dynamically:
val = {}
for label in response.xpath('//div[@class="vim x-about-this-item"]//dt'):
spec_key = ''.join(label.xpath('.//text()').getall())
spc_value = ''.join(label.xpath('./following-sibling::dd[1]//text()').getall())
val[spec_key] = spc_value
item['Item specifics'] = val
Saving Data to JSON
To store the scraped data, you can use Scrapy’s built-in JSON exporter. Run your spider with the following command:
$scrapy crawl ebay_spider -o products.json -a start_url="link_to_listing_page"
This will save the scraped data from listing page to a file named `products.json`.
Understanding HTTP Request Headers
Before we conclude this lets try to understand headers, header defines the header dictionary, which includes HTTP headers that the Scrapy spider will use when making requests to eBay. Let's break down each key-value pair to understand their purpose:
'Accept': 'application/json'
This header indicates that the client (our spider) expects the server to return JSON data.
'Accept-Encoding': 'gzip, deflate, br'
This specifies the content encoding that the client can handle. gzip, deflate, and br (Brotli) are all compression algorithms that help reduce the size of the response data.
'Accept-Language': 'en-GB,en-US;q=0.9,en;q=0.8,es-MX;q=0.7,es;q=0.6'
This indicates the preferred languages for the response content. The q parameter indicates the relative quality factor for each language, with higher values representing higher preference.
'Connection': 'keep-alive'
This header keeps the connection open for multiple requests/responses, improving efficiency by reusing the same connection.
'Content-Type': 'application/json'
This specifies the media type of the resource being sent to the server. Here, it indicates that the content is in JSON format.
'Origin': 'https://www.ebay.com'
This header indicates the origin of the request, used in CORS (Cross-Origin Resource Sharing) to determine whether the request should be allowed.
'Sec-Fetch-Dest': 'empty'
This is a security feature used to indicate the destination of the request, helping the server understand the context of the request.
'Sec-Fetch-Mode': 'cors'
This indicates that the request is a CORS request, which allows the server to know that the request is coming from a different origin.
'Sec-Fetch-Site': 'cross-site'
This indicates that the request is coming from a different site, part of the security context of the request.
'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/124.0.0.0 Safari/537.36'
The User-Agent string identifies the client software making the request. This helps the server understand the type of client and sometimes adjust the response accordingly. Here, it mimics a request coming from a Chrome browser on a Windows 10 machine.
'X-EBAY-C-MARKETPLACE-ID': 'EBAY-GB'
This header specifies the eBay marketplace identifier, indicating that the requests are intended for the eBay UK site.
'sec-ch-ua': '"Google Chrome";v="107", "Chromium";v="107", "Not=A?Brand";v="24"'
This is part of the User-Agent Client Hints, providing information about the browser and its version in a structured format.
'sec-ch-ua-mobile': '?0'
This indicates whether the client is a mobile device (?0 means no).
Complete Spider Code
Here’s the complete spider code:
import json
import scrapy
import re
from urllib.parse import unquote
class EbaySpiderSpider(scrapy.Spider):
name = 'ebay_spider'
header = {
'Accept': 'application/json',
'Accept-Encoding': 'gzip, deflate, br',
'Accept-Language': 'en-GB,en-US;q=0.9,en;q=0.8,es-MX;q=0.7,es;q=0.6',
'Connection': 'keep-alive',
'Content-Type': 'application/json',
'Origin': 'https://www.ebay.com',
'Sec-Fetch-Dest': 'empty',
'Sec-Fetch-Mode': 'cors',
'Sec-Fetch-Site': 'cross-site',
'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/124.0.0.0 Safari/537.36',
'X-EBAY-C-MARKETPLACE-ID': 'EBAY-GB',
'sec-ch-ua': '"Google Chrome";v="107", "Chromium";v="107", "Not=A?Brand";v="24"',
'sec-ch-ua-mobile': '?0'
}
def __init__(self, **kwargs):
start_url = kwargs['start_url']
self.start_urls = [start_url + '&_fcid=3']
super(EbaySpiderSpider, self).__init__()
def parse(self, response, **kwargs):
self.page_count += 1
product_links = response.xpath('//a[@class="s-item__link"]/@href').getall()
if product_links:
yield from response.follow_all(product_links, callback=self.parse_product, headers=self.header, meta={'page_link': response.request.url})
page_link = response.xpath('//a[@aria-label="Go to next search page"]/@href').get()
if page_link:
yield response.follow(page_link, callback=self.parse)
def parse_product(self, response):
item = {}
item['Pagination_Link'] = response.meta['page_link']
breadcrumbs = response.xpath('//nav[contains(@class, "breadcrumbs")]//a[@class="seo-breadcrumb-text"]/span/text()').getall()
for index, element in enumerate(breadcrumbs):
item[f'Category_{index + 1}'] = element
item['Product_link'] = response.request.url
img_urls = re.findall('"ZOOM_GUID","URL":"(.*?)"', response.text)
if img_urls:
for index, img_url in enumerate(img_urls):
item['Product_image_' + str(index + 1)] = img_url
product_title_full = response.xpath('//h1[@class="x-item-title__mainTitle"]/span/text()').get()
item['Product_title'] = product_title_full
item['Product_Price'] = response.xpath('//div[@class="x-price-primary"]/span/text()').get()
val = {}
for label in response.xpath('//div[@class="vim x-about-this-item"]//dt'):
spec_key = ''.join(label.xpath('.//text()').getall())
spc_value = ''.join(label.xpath('./following-sibling::dd[1]//text()').getall())
val[spec_key] = spc_value
item['Item specifics'] = val
yield item
Conclusion
Scraping eBay with Python and Scrapy is a practical way to gather detailed product information. By following the steps outlined in this blog, you can set up a scraper that fetches and stores eBay data efficiently. Remember to always respect the website's `robots.txt` and terms of service to avoid any legal issues. Happy scraping!
pip install scrapy
. Then, create a new Scrapy project using the command: scrapy startproject ebay_scraper
. Navigate into the project directory and generate a new spider using: scrapy genspider ebay_spider ebay.com
.
scrapy crawl ebay_spider -o products.json
.