How we manage 100s of scrapy spiders

How we manage 100s of scrapy spiders

Stackadoc is an AI studio that specializes in helping customers implement machine learning solutions across a wide range of fields. In order to ensure the accuracy of our predictions, we heavily rely on data collected from the internet. To gather this data efficiently, we utilize a powerful python library called scrapy (https://scrapy.org/), which allows us to extract information from various websites.

Currently, our data-gathering process involves running hundreds of spiders every daily to constantly update our databases and enhance the accuracy of our models. These spider scripts are hosted on a single machine, and we utilize scrapyd's API (https://github.com/scrapy/scrapyd) to manage and execute them. However, as the number of spiders keeps growing, the time needed to execute them has become a bottleneck for us. In our case, vertical scaling is no longer an option, we already use one of the largest machines available and we have already reached the limit of parallel executions on a single machine.

To reduce the overall time required, we had to start thinking about adopting a horizontal scaling approach. This involves distributing the scraping load across multiple machines, allowing us to handle a higher number of spiders simultaneously.

Over the next three articles, we will unfold the journey of how our Scrapy application was transformed into a robust, distributed system. To ensure a clear and focused discussion, we will break down the content as follows:

  1. Project Structure & Dockerizing the Scrapy Code - In article 1, we'll cover the foundational elements of our project's architecture and explain the step-by-step process of containerizing the Scrapy code within Docker.
  2. Distributing Load Across Scrapyd Instances - Article 2 will focus how we were able to distribute the scraping workload among multiple Scrapyd service instances.
  3. Leveraging ECS for Scrapyd Instance Management & Orchestration - In article 3, we'll examine how we utilized Amazon ECS (Elastic Container Service) to manage and orchestrate the Scrapyd instances.

Project structure

Since, we will be handling and modifying code let’s discuss how our scraping project is structured.

scraper/
├── scraper/
│   ├── __init__.py
│   ├── items.py
│   ├── middlewares.py
│   ├── pipelines.py
│   ├── settings.py
│   ├── scrapy.cfg
│   └── spiders/
│       ├── __init__.py
│       └── spider.py
├── scrapyd.conf
├── scrapyd.sh
├── Dockerfile
├── poetry.lock
└── pyproject.toml

The layout is consistent with a conventional Scrapy project architecture:

  • The scraper subdirectory is the heart of our project. It hosts our spiders in a dedicated spiders folder, as well as several critical files:
    • items.py defines the data structure for scraped items.
    • middlewares.py contains middleware components for request and response processing.
    • pipelines.py is where we process and store the items that the spiders have scraped.
    • settings.py specifies project settings including configurations for the item pipelines, robot.txt rules, and more.
  • scrapy.cfg is the configuration file for the scrapy command-line tool, and it designates project settings and the locations of various components.
  • scrapyd.conf is the configuration file for the scrapyd service, where parameters for deploying and running your spiders are set. We will explore the specifics of this file in an upcoming section.
  • scrapyd.sh is a shell script that kick-starts the scrapyd service and facilitates the deployment of spiders. It executes the following commands:

sh -c 'sleep 15s; cd scraper; scrapyd-deploy'&
scrapyd -d /code;

The sleep 15s command creates a short delay to ensure that all necessary services are up and running before deployment begins.

  • In addition to these components, we are using poetry, a modern tool for Python package management. The poetry.lock file makes sure that the exact versions of the dependencies are recorded, ensuring reliable builds, while the pyproject.toml file holds the project metadata and the list of dependencies.

Dockerizing scrapy code

When deploying multiple instances of our web scraping application, we encountered challenges in setting up each instance with the required libraries, tools, and code - and keeping everything synchronized and up to date.

To address these concerns, we have adopted Docker. Docker provides means to containerize our application along with its dependencies into a standalone image, streamlining the deployment process across different environments. Docker images guarantee consistency by encapsulating the same runtime environment on any machine.

Updating our application becomes simpler too by utilizing a docker registry. We update the image in the registry. Subsequent deployments automatically pull the updated image, ensuring that our spiders are always running the latest version of our application.

Below are the steps we took to dockerize our application.

Step 1: Create the Dockerfile

First, we create a file named Dockerfile in the root directory of our project. Our file looks like this:

FROM python:3.10.12-slim

ENV YOUR_ENV=PYTHONFAULTHANDLER=1 \\
 PYTHONUNBUFFERED=1 \\
 PYTHONHASHSEED=random \\
 PYTHONDONTWRITEBYTECODE=1 \\
 PIP_DISABLE_PIP_VERSION_CHECK=on \\
 PIP_NO_CACHE_DIR=off \\
 PIP_DEFAULT_TIMEOUT=100 \\
 POETRY_VERSION=1.3.1RUN pip install "poetry==$POETRY_VERSION"

WORKDIR /code

COPY poetry.lock pyproject.toml /code/
RUN poetry install --no-interaction --no-ansi

COPY scraper/ /code/scraper/
COPY scrapyd.conf /code/scrapyd.conf
COPY scrapyd.sh /code/scrapyd.sh

ENTRYPOINT ["/code/scrapyd.sh"]

Here's a breakdown of what we're defining in the Dockerfile:

  1. FROM python:3.10.12-slim - This specifies the base image to build upon, using a slim version of the official Python 3.10.12 image from Docker Hub.
  2. ENV - We're setting environment variables related toPython runtime and pip behavior, as well as defining the Poetry versionwe'll be using.
  3. The RUN pip install command installs Poetry inside our Docker image.
  4. WORKDIR /code - This sets the working directory for subsequent instructions.
  5. COPY poetry.lock pyproject.toml - These files arerequired by Poetry to identify and install the correct dependencies.
  6. The RUN poetry install command installs the project dependencies defined in pyproject.toml.
  7. The COPY commands transfer our application code into the image.
  8. ENTRYPOINT ["/code/scrapyd.sh"] - Sets the entry point forthe Docker image. When a container is run from this image, it willexecute the "scrapyd.sh" script located at the "/code/scrapyd.sh" pathinside the container.

Step 2: Building the Docker Image

With the Dockerfile in place, the next step is to build a Docker image from it. This is done by running the following command:

docker build -t my_scrapy_app:latest .

Step 3: Pushing the Image to a Docker Repository

For hosting our Docker image, we chose Amazon Elastic Container Registry (ECR). Nevertheless, these instructions can be applied to any docker registry. First step is to authenticate Docker with your chosen registry provider. After a successful authentication, we can run the following commands to tag and push your Docker image:

docker tag my_scrapy_app:latest your-repository-uri:your-tag
docker push your-repository-uri:your-tag

You should replace your-repository-uri with the URI provided by your Docker repository provider and your-tag with your chosen tag for the image. The tagging command assigns the chosen tag to your local image, allowing you to differentiate between different image versions or configurations.

By completing these steps, our application has been successfully dockerized, allowing for seamless distribution and quick, consistent deployment on any machine, enhancing our operational capabilities.

In the next article, we will discuss the approach we took to distribute the scraping load between the different instances.

Amin SAFFAR

Backend Engineer @Stackadoc