Extracting PDFs with Scrapy and Implementing Date Tracking

Discover how to build a Scrapy spider to download PDFs from a website, implement date tracking to resume interrupted scraping sessions, and configure the Files Pipeline for efficient file storage. This tutorial provides step-by-step instructions, code explanations, and tips for handling dynamic date extraction and error management.

Extracting PDFs with Scrapy and Implementing Date Tracking

Scrapy is a powerful Python framework for web scraping, capable of handling various tasks, including downloading PDF files from websites. In this tutorial, I'll walk you through setting up a Scrapy project to extract PDF files efficiently, with added functionality to track and resume scraping based on dates.

🛠️ Prerequisites

Ensure you have the following installed:

  • Python 3.6 or higher
  • Scrapy

Install Scrapy using pip:

pip install scrapy

📁 Step 1: Create a Scrapy Project

Initialize a new Scrapy project:

scrapy startproject pdf_scraper
cd pdf_scraper

🕷️ Step 2: Define the Spider with Date Tracking

Create a new spider in the spiders directory. For example, pdf_spider.py:

import scrapy
import os
from datetime import datetime, timedelta

class PDFSpider(scrapy.Spider):
    name = 'pdf_spider'
    start_urls = ['http://example.com']  # Replace with your target URL
    date_file = 'last_scraped_date.txt'
    default_start_date = datetime(2020, 1, 1)  # Set your default start date

    def __init__(self, *args, **kwargs):
        super().__init__(*args, **kwargs)
        self.base_url = 'http://example.com/pdfs'  # Replace with your base URL
        self.end_date = datetime.today()
        self.start_date = self.load_start_date()

    def load_start_date(self):
        if os.path.exists(self.date_file):
            with open(self.date_file, 'r') as f:
                date_str = f.read().strip()
                try:
                    return datetime.strptime(date_str, '%Y-%m-%d')
                except ValueError:
                    self.logger.warning(f"Invalid date format in {self.date_file}. Using default start date.")
        return self.default_start_date

    def save_current_date(self, current_date):
        with open(self.date_file, 'w') as f:
            f.write(current_date.strftime('%Y-%m-%d'))

    def parse(self, response):
        current_date = self.start_date
        while current_date <= self.end_date:
            self.save_current_date(current_date)
            date_str = current_date.strftime('%Y-%m-%d')
            # Construct your PDF URL based on the date
            pdf_url = f"{self.base_url}/{date_str}.pdf"  # Adjust as needed
            yield {'file_urls': [pdf_url], 'date': date_str}
            current_date += timedelta(days=1)

    def close(self, reason):
        self.logger.info(f"Spider closed: {reason}")
        if os.path.exists(self.date_file):
            with open(self.date_file, 'r') as f:
                last_date = f.read().strip()
                self.logger.info(f"Last scraped date saved as: {last_date}")

Explanation:

  • Initialization: I set up the spider with a base URL and determined the date range for scraping.
  • Date Loading: The load_start_date method checks if a file named last_scraped_date.txt exists. If it does, it reads the last scraped date; otherwise, it uses a default start date.
  • Date Saving: The save_current_date method writes the current date to last_scraped_date.txt, ensure that the scraper can resume from this date in future runs.
  • Parsing: In the parse method, I loop through each date from the start date to the end date, construct the corresponding PDF URL, and yield it for downloading.
  • Closure: Upon closing the spider, I log the last scraped date for reference.

⚙️ Step 3: Configure the Files Pipeline

In settings.py, enable the Files Pipeline and specify the storage directory:

ITEM_PIPELINES = {
    'scrapy.pipelines.files.FilesPipeline': 1,
}
FILES_STORE = 'downloads'  # Directory to store downloaded PDFs

Explanation:

  • Files Pipeline: This built-in Scrapy pipeline handles downloading and storing files.
  • FILES_STORE: Specifies the directory where the downloaded PDFs will be saved.

🚀 Step 4: Run the Spider

Execute the spider to start downloading PDFs:

scrapy crawl pdf_spider

The PDFs will be saved in the downloads directory, and the last scraped date will be recorded in last_scraped_date.txt.

📝 Additional Tips

  • Dynamic Date Extraction: If the website displays available dates dynamically (e.g., through a calendar widget), consider parsing those dates directly from the page to determine which PDFs to download.
  • Error Handling: Implement error handling to manage scenarios where a PDF for a specific date does not exist or cannot be downloaded.
  • Resuming Scraping: The last_scraped_date.txt file ensures that if the scraper is interrupted, it can resume from the last saved date, preventing redundant downloads.

📚 Resources