Scrapy is a powerful Python framework for web scraping, capable of handling various tasks, including downloading PDF files from websites. In this tutorial, I'll walk you through setting up a Scrapy project to extract PDF files efficiently, with added functionality to track and resume scraping based on dates.
🛠️ Prerequisites
Ensure you have the following installed:
- Python 3.6 or higher
- Scrapy
Install Scrapy using pip:
pip install scrapy
📁 Step 1: Create a Scrapy Project
Initialize a new Scrapy project:
scrapy startproject pdf_scraper
cd pdf_scraper
🕷️ Step 2: Define the Spider with Date Tracking
Create a new spider in the spiders
directory. For example, pdf_spider.py
:
import scrapy
import os
from datetime import datetime, timedelta
class PDFSpider(scrapy.Spider):
name = 'pdf_spider'
start_urls = ['http://example.com'] # Replace with your target URL
date_file = 'last_scraped_date.txt'
default_start_date = datetime(2020, 1, 1) # Set your default start date
def __init__(self, *args, **kwargs):
super().__init__(*args, **kwargs)
self.base_url = 'http://example.com/pdfs' # Replace with your base URL
self.end_date = datetime.today()
self.start_date = self.load_start_date()
def load_start_date(self):
if os.path.exists(self.date_file):
with open(self.date_file, 'r') as f:
date_str = f.read().strip()
try:
return datetime.strptime(date_str, '%Y-%m-%d')
except ValueError:
self.logger.warning(f"Invalid date format in {self.date_file}. Using default start date.")
return self.default_start_date
def save_current_date(self, current_date):
with open(self.date_file, 'w') as f:
f.write(current_date.strftime('%Y-%m-%d'))
def parse(self, response):
current_date = self.start_date
while current_date <= self.end_date:
self.save_current_date(current_date)
date_str = current_date.strftime('%Y-%m-%d')
# Construct your PDF URL based on the date
pdf_url = f"{self.base_url}/{date_str}.pdf" # Adjust as needed
yield {'file_urls': [pdf_url], 'date': date_str}
current_date += timedelta(days=1)
def close(self, reason):
self.logger.info(f"Spider closed: {reason}")
if os.path.exists(self.date_file):
with open(self.date_file, 'r') as f:
last_date = f.read().strip()
self.logger.info(f"Last scraped date saved as: {last_date}")
Explanation:
- Initialization: I set up the spider with a base URL and determined the date range for scraping.
- Date Loading: The
load_start_date
method checks if a file namedlast_scraped_date.txt
exists. If it does, it reads the last scraped date; otherwise, it uses a default start date. - Date Saving: The
save_current_date
method writes the current date tolast_scraped_date.txt
, ensure that the scraper can resume from this date in future runs. - Parsing: In the
parse
method, I loop through each date from the start date to the end date, construct the corresponding PDF URL, and yield it for downloading. - Closure: Upon closing the spider, I log the last scraped date for reference.
⚙️ Step 3: Configure the Files Pipeline
In settings.py
, enable the Files Pipeline and specify the storage directory:
ITEM_PIPELINES = {
'scrapy.pipelines.files.FilesPipeline': 1,
}
FILES_STORE = 'downloads' # Directory to store downloaded PDFs
Explanation:
- Files Pipeline: This built-in Scrapy pipeline handles downloading and storing files.
- FILES_STORE: Specifies the directory where the downloaded PDFs will be saved.
🚀 Step 4: Run the Spider
Execute the spider to start downloading PDFs:
scrapy crawl pdf_spider
The PDFs will be saved in the downloads
directory, and the last scraped date will be recorded in last_scraped_date.txt
.
📝 Additional Tips
- Dynamic Date Extraction: If the website displays available dates dynamically (e.g., through a calendar widget), consider parsing those dates directly from the page to determine which PDFs to download.
- Error Handling: Implement error handling to manage scenarios where a PDF for a specific date does not exist or cannot be downloaded.
- Resuming Scraping: The
last_scraped_date.txt
file ensures that if the scraper is interrupted, it can resume from the last saved date, preventing redundant downloads.