How we manage 100s of scrapy spiders
Discover how Stackadoc's AI studio transformed their Scrapy data-gathering process from a single-machine bottleneck to a robust, distributed system, capable of handling hundreds of spiders simultaneously, by leveraging Docker, Scrapyd, and Amazon ECS.
Stackadoc is an AI studio that specializes in helping customers implement machine learning solutions across a wide range of fields. In order to ensure the accuracy of our predictions, we heavily rely on data collected from the internet. To gather this data efficiently, we utilize a powerful python library called scrapy (https://scrapy.org/), which allows us to extract information from various websites.
Currently, our data-gathering process involves running hundreds of spiders every daily to constantly update our databases and enhance the accuracy of our models. These spider scripts are hosted on a single machine, and we utilize scrapyd's API (https://github.com/scrapy/scrapyd) to manage and execute them. However, as the number of spiders keeps growing, the time needed to execute them has become a bottleneck for us. In our case, vertical scaling is no longer an option, we already use one of the largest machines available and we have already reached the limit of parallel executions on a single machine.
To reduce the overall time required, we had to start thinking about adopting a horizontal scaling approach. This involves distributing the scraping load across multiple machines, allowing us to handle a higher number of spiders simultaneously.
Over the next three articles, we will unfold the journey of how our Scrapy application was transformed into a robust, distributed system. To ensure a clear and focused discussion, we will break down the content as follows:
- Project Structure & Dockerizing the Scrapy Code - In article 1, we'll cover the foundational elements of our project's architecture and explain the step-by-step process of containerizing the Scrapy code within Docker.
- Distributing Load Across Scrapyd Instances - Article 2 will focus how we were able to distribute the scraping workload among multiple Scrapyd service instances.
- Leveraging ECS for Scrapyd Instance Management & Orchestration - In article 3, we'll examine how we utilized Amazon ECS (Elastic Container Service) to manage and orchestrate the Scrapyd instances.
Project structure
Since, we will be handling and modifying code let’s discuss how our scraping project is structured.
scraper/
├── scraper/
│ ├── __init__.py
│ ├── items.py
│ ├── middlewares.py
│ ├── pipelines.py
│ ├── settings.py
│ ├── scrapy.cfg
│ └── spiders/
│ ├── __init__.py
│ └── spider.py
├── scrapyd.conf
├── scrapyd.sh
├── Dockerfile
├── poetry.lock
└── pyproject.toml
The layout is consistent with a conventional Scrapy project architecture:
- The
scraper
subdirectory is the heart of our project. It hosts our spiders in a dedicatedspiders
folder, as well as several critical files:items.py
defines the data structure for scraped items.middlewares.py
contains middleware components for request and response processing.pipelines.py
is where we process and store the items that the spiders have scraped.settings.py
specifies project settings including configurations for the item pipelines, robot.txt rules, and more.
scrapy.cfg
is the configuration file for the scrapy command-line tool, and it designates project settings and the locations of various components.scrapyd.conf
is the configuration file for thescrapyd
service, where parameters for deploying and running your spiders are set. We will explore the specifics of this file in an upcoming section.scrapyd.sh
is a shell script that kick-starts thescrapyd
service and facilitates the deployment of spiders. It executes the following commands:
sh -c 'sleep 15s; cd scraper; scrapyd-deploy'&
scrapyd -d /code;
The sleep 15s
command creates a short delay to ensure that all necessary services are up and running before deployment begins.
- In addition to these components, we are using
poetry
, a modern tool for Python package management. Thepoetry.lock
file makes sure that the exact versions of the dependencies are recorded, ensuring reliable builds, while thepyproject.toml
file holds the project metadata and the list of dependencies.
Dockerizing scrapy code
When deploying multiple instances of our web scraping application, we encountered challenges in setting up each instance with the required libraries, tools, and code - and keeping everything synchronized and up to date.
To address these concerns, we have adopted Docker. Docker provides means to containerize our application along with its dependencies into a standalone image, streamlining the deployment process across different environments. Docker images guarantee consistency by encapsulating the same runtime environment on any machine.
Updating our application becomes simpler too by utilizing a docker registry. We update the image in the registry. Subsequent deployments automatically pull the updated image, ensuring that our spiders are always running the latest version of our application.
Below are the steps we took to dockerize our application.
Step 1: Create the Dockerfile
First, we create a file named Dockerfile in the root directory of our project. Our file looks like this:
FROM python:3.10.12-slim
ENV YOUR_ENV=PYTHONFAULTHANDLER=1 \\
PYTHONUNBUFFERED=1 \\
PYTHONHASHSEED=random \\
PYTHONDONTWRITEBYTECODE=1 \\
PIP_DISABLE_PIP_VERSION_CHECK=on \\
PIP_NO_CACHE_DIR=off \\
PIP_DEFAULT_TIMEOUT=100 \\
POETRY_VERSION=1.3.1RUN pip install "poetry==$POETRY_VERSION"
WORKDIR /code
COPY poetry.lock pyproject.toml /code/
RUN poetry install --no-interaction --no-ansi
COPY scraper/ /code/scraper/
COPY scrapyd.conf /code/scrapyd.conf
COPY scrapyd.sh /code/scrapyd.sh
ENTRYPOINT ["/code/scrapyd.sh"]
Here's a breakdown of what we're defining in the Dockerfile:
FROM python:3.10.12-slim
- This specifies the base image to build upon, using a slim version of the official Python 3.10.12 image from Docker Hub.ENV
- We're setting environment variables related toPython runtime and pip behavior, as well as defining the Poetry versionwe'll be using.- The
RUN pip install
command installs Poetry inside our Docker image. WORKDIR /code
- This sets the working directory for subsequent instructions.COPY poetry.lock pyproject.toml
- These files arerequired by Poetry to identify and install the correct dependencies.- The
RUN poetry install
command installs the project dependencies defined inpyproject.toml
. - The
COPY
commands transfer our application code into the image. ENTRYPOINT ["/code/scrapyd.sh"]
- Sets the entry point forthe Docker image. When a container is run from this image, it willexecute the "scrapyd.sh" script located at the "/code/scrapyd.sh" pathinside the container.
Step 2: Building the Docker Image
With the Dockerfile in place, the next step is to build a Docker image from it. This is done by running the following command:
docker build -t my_scrapy_app:latest .
Step 3: Pushing the Image to a Docker Repository
For hosting our Docker image, we chose Amazon Elastic Container Registry (ECR). Nevertheless, these instructions can be applied to any docker registry. First step is to authenticate Docker with your chosen registry provider. After a successful authentication, we can run the following commands to tag and push your Docker image:
docker tag my_scrapy_app:latest your-repository-uri:your-tag
docker push your-repository-uri:your-tag
You should replace your-repository-uri
with the URI provided by your Docker repository provider and your-tag
with your chosen tag for the image. The tagging command assigns the chosen tag to your local image, allowing you to differentiate between different image versions or configurations.
By completing these steps, our application has been successfully dockerized, allowing for seamless distribution and quick, consistent deployment on any machine, enhancing our operational capabilities.
In the next article, we will discuss the approach we took to distribute the scraping load between the different instances.