Stackadoc is an AI studio that specializes in helping customers implement machine learning solutions across a wide range of fields. In order to ensure the accuracy of our predictions, we heavily rely on data collected from the internet. To gather this data efficiently, we utilize a powerful python library called scrapy (https://scrapy.org/), which allows us to extract information from various websites.
Currently, our data-gathering process involves running hundreds of spiders every daily to constantly update our databases and enhance the accuracy of our models. These spider scripts are hosted on a single machine, and we utilize scrapyd's API (https://github.com/scrapy/scrapyd) to manage and execute them. However, as the number of spiders keeps growing, the time needed to execute them has become a bottleneck for us. In our case, vertical scaling is no longer an option, we already use one of the largest machines available and we have already reached the limit of parallel executions on a single machine.
To reduce the overall time required, we had to start thinking about adopting a horizontal scaling approach. This involves distributing the scraping load across multiple machines, allowing us to handle a higher number of spiders simultaneously.
Over the next three articles, we will unfold the journey of how our Scrapy application was transformed into a robust, distributed system. To ensure a clear and focused discussion, we will break down the content as follows:
Since, we will be handling and modifying code let’s discuss how our scraping project is structured.
scraper/
├── scraper/
│ ├── __init__.py
│ ├── items.py
│ ├── middlewares.py
│ ├── pipelines.py
│ ├── settings.py
│ ├── scrapy.cfg
│ └── spiders/
│ ├── __init__.py
│ └── spider.py
├── scrapyd.conf
├── scrapyd.sh
├── Dockerfile
├── poetry.lock
└── pyproject.toml
The layout is consistent with a conventional Scrapy project architecture:
scraper
subdirectory is the heart of our project. It hosts our spiders in a dedicated spiders
folder, as well as several critical files:items.py
defines the data structure for scraped items.middlewares.py
contains middleware components for request and response processing.pipelines.py
is where we process and store the items that the spiders have scraped.settings.py
specifies project settings including configurations for the item pipelines, robot.txt rules, and more.scrapy.cfg
is the configuration file for the scrapy command-line tool, and it designates project settings and the locations of various components.scrapyd.conf
is the configuration file for the scrapyd
service, where parameters for deploying and running your spiders are set. We will explore the specifics of this file in an upcoming section.scrapyd.sh
is a shell script that kick-starts the scrapyd
service and facilitates the deployment of spiders. It executes the following commands:sh -c 'sleep 15s; cd scraper; scrapyd-deploy'&
scrapyd -d /code;
The sleep 15s
command creates a short delay to ensure that all necessary services are up and running before deployment begins.
poetry
, a modern tool for Python package management. The poetry.lock
file makes sure that the exact versions of the dependencies are recorded, ensuring reliable builds, while the pyproject.toml
file holds the project metadata and the list of dependencies.When deploying multiple instances of our web scraping application, we encountered challenges in setting up each instance with the required libraries, tools, and code - and keeping everything synchronized and up to date.
To address these concerns, we have adopted Docker. Docker provides means to containerize our application along with its dependencies into a standalone image, streamlining the deployment process across different environments. Docker images guarantee consistency by encapsulating the same runtime environment on any machine.
Updating our application becomes simpler too by utilizing a docker registry. We update the image in the registry. Subsequent deployments automatically pull the updated image, ensuring that our spiders are always running the latest version of our application.
Below are the steps we took to dockerize our application.
First, we create a file named Dockerfile in the root directory of our project. Our file looks like this:
FROM python:3.10.12-slim
ENV YOUR_ENV=PYTHONFAULTHANDLER=1 \\
PYTHONUNBUFFERED=1 \\
PYTHONHASHSEED=random \\
PYTHONDONTWRITEBYTECODE=1 \\
PIP_DISABLE_PIP_VERSION_CHECK=on \\
PIP_NO_CACHE_DIR=off \\
PIP_DEFAULT_TIMEOUT=100 \\
POETRY_VERSION=1.3.1RUN pip install "poetry==$POETRY_VERSION"
WORKDIR /code
COPY poetry.lock pyproject.toml /code/
RUN poetry install --no-interaction --no-ansi
COPY scraper/ /code/scraper/
COPY scrapyd.conf /code/scrapyd.conf
COPY scrapyd.sh /code/scrapyd.sh
ENTRYPOINT ["/code/scrapyd.sh"]
Here's a breakdown of what we're defining in the Dockerfile:
FROM python:3.10.12-slim
- This specifies the base image to build upon, using a slim version of the official Python 3.10.12 image from Docker Hub.ENV
- We're setting environment variables related toPython runtime and pip behavior, as well as defining the Poetry versionwe'll be using.RUN pip install
command installs Poetry inside our Docker image.WORKDIR /code
- This sets the working directory for subsequent instructions.COPY poetry.lock pyproject.toml
- These files arerequired by Poetry to identify and install the correct dependencies.RUN poetry install
command installs the project dependencies defined in pyproject.toml
.COPY
commands transfer our application code into the image.ENTRYPOINT ["/code/scrapyd.sh"]
- Sets the entry point forthe Docker image. When a container is run from this image, it willexecute the "scrapyd.sh" script located at the "/code/scrapyd.sh" pathinside the container.With the Dockerfile in place, the next step is to build a Docker image from it. This is done by running the following command:
docker build -t my_scrapy_app:latest .
For hosting our Docker image, we chose Amazon Elastic Container Registry (ECR). Nevertheless, these instructions can be applied to any docker registry. First step is to authenticate Docker with your chosen registry provider. After a successful authentication, we can run the following commands to tag and push your Docker image:
docker tag my_scrapy_app:latest your-repository-uri:your-tag
docker push your-repository-uri:your-tag
You should replace your-repository-uri
with the URI provided by your Docker repository provider and your-tag
with your chosen tag for the image. The tagging command assigns the chosen tag to your local image, allowing you to differentiate between different image versions or configurations.
By completing these steps, our application has been successfully dockerized, allowing for seamless distribution and quick, consistent deployment on any machine, enhancing our operational capabilities.
In the next article, we will discuss the approach we took to distribute the scraping load between the different instances.