Mastering Container Management: Why We Chose AWS ECS Over Kubernetes for Scaling and Efficiency

Mastering Container Management: Why We Chose AWS ECS Over Kubernetes for Scaling and Efficiency

Using ECS to manage and orchestrate scrapyd instances.

💡 Don't miss the continuation of our journey into distributed scraping! If you haven't checked it out yet, catch up by reading the first part here: How we manage 100s of scrapy spiders and the second part here: Streamlining Workloads with Scrapyd and Postgres

In the previous article we were able to dockerize our application so it can be run on isolated containers. We modified scrapyd behavior to make it rely on one shared queue between the different containers.

It’s time to figure out how the container infrastructure will be managed. How to handle scaling up/down depending on the demand ? How to handle provisioning new containers and updating them with the latest code ?

In order to do this it was clear that we needed an orchestration tool. When we say container orchestration people usually think of kubernetes but for our use, resources and budget we went a different direction. We rather opted for aws ECS. This is why:

  1. Less Overhead: AWS takes on the responsibility for managing the ECS infrastructure, reducing the workload on our small team and allowing us to focus more on application development. It's a simpler platform to navigate and manage, requiring less upkeep and configuration than Kubernetes.
  2. Seamless integration with AWS services: ECS integrates tightly with AWS services such as Elastic Load Balancer, Identity and Access Management (IAM), and Auto Scaling, among others.
  3. Cost Efficiency: ECS comes at no additional cost, we only pay for the resources (like EC2 instances or EBS volumes) we used.
  4. Built in monitoring: ECS comes with built-in monitoring capabilities through integration with AWS CloudWatch, providing real-time insights and comprehensive visibility on resources utilization directly from your the aws AWS console.

The following image illustrates the different aws ressource we need to put in place:

At Stackadoc we use terraform to provision and mange our aws resssources. In this section we will go through the different steps to provision the infrascture, illustrated above and needed to run our daily scraping operations.

Step 1: Building the docker image and uploading the docker image to ECR

resource "aws_ecr_repository" "distributed_scraping_repository" {
 name = "distributed-craping-repository"
}

resource "docker_image" "distributed_scraping_docker_image" {
 name = "${aws_ecr_repository.distributed_scraping_repository.repository_url}:latest"
 build {
   context = "${path.root}/../docker/*"
 }
 triggers = {
   dir_sha1 = sha1(join("", [for f in fileset("${path.root}/..", "docker/*") : filesha1(f)]))
 }
}

resource "docker_registry_image" "image_handler" {
 name = docker_image.distributed_scraping_docker_image.name
}

The following Terraform script creates an AWS ECR (Elastic Container Registry) repository, a Docker image, and pushes that image to the ECR repository.

The image build will be triggered each time any file under the docker/ directory changes. This directory contains the Dockerfile and the entry script that were mentioned in the first section.

The image that we are hosting on ECR will be used by ECS to launch new containers.

Step 2: networking

Next we will create a VPC (Virtual Private Cloud), and configures sub-nets, Internet Gateway and associated route tables. Here’s a walk-through:

data "aws_availability_zones" "available" { state = "available" }

locals {
 azs_count = 2
 azs_names = data.aws_availability_zones.available.names
}

Here, we fetch the names of two availability zones.

resource "aws_vpc" "main" {
 cidr_block           = "10.10.0.0/16"
 tags                 = { Name = "distributed-scraping-vpc" }
}

We create a new Virtual Private Cloud (VPC) with CIDR block 10.10.0.0/16.


resource "aws_subnet" "public" {
 count                   = local.azs_count
 vpc_id                  = aws_vpc.main.id
 availability_zone       = local.azs_names[count.index]
 cidr_block              = cidrsubnet(aws_vpc.main.cidr_block, 8, 10 + count.index)
 map_public_ip_on_launch = true
 tags                    = { Name = "distributed-scraping-public-${local.azs_names[count.index]}" }
}

Next, we set up public subnets in each of the previously fetched availability zones.

resource "aws_internet_gateway" "internet_gateway" {
 vpc_id = aws_vpc.main.id
 tags = {
   Name = "internet_gateway"
 }
}

We create an internet gateway and attach it to our VPC.


resource "aws_route_table" "public" {
 vpc_id = aws_vpc.main.id
 route {
   cidr_block = "0.0.0.0/0"
   gateway_id = aws_internet_gateway.internet_gateway.id
 }
}

resource "aws_route_table_association" "public" {
 count          = local.azs_count
 subnet_id      = element(aws_subnet.public.*.id, count.index)
 route_table_id = element(aws_route_table.public.*.id, count.index)
}

We created two route tables and we associate the gateways to the public subnets to be able to connect to the internet. Notice that we created only two subnets this is only for demonstration purposes. Amazon recommends creating at least 3 subnets for production environments to avoid any availability issues.

Step 3: Provisioning ec2 instances

data "aws_ssm_parameter" "ecs_node_ami" {
 name = "/aws/service/ecs/optimized-ami/amazon-linux-2/recommended/image_id"
}

Here, we fetch the latest ECS optimized AMI (Amazon Machine Image) ID for Amazon Linux 2 from AWS Parameter Store. This ensures that we always use the most current and optimized Amazon Linux 2 image for ECS when creating the underlying EC2 instances that will be running our containers.

resource "aws_launch_template" "distributed_scraping_launch_template" {
 name_prefix   = "stackabot-ecs-template"
 image_id      = data.aws_ssm_parameter.ecs_node_ami.value
 instance_type = "t3.xlarge"

 key_name               = "ecs-key-pair"
 vpc_security_group_ids = [aws_security_group.ec2_security_group.id]
 iam_instance_profile {
   arn = aws_iam_instance_profile.ecs_node.arn
 }

 user_data = base64encode(<<-EOF
     #!/bin/bash
     echo ECS_CLUSTER=${aws_ecs_cluster.distributed_scraping_cluster.name} >> /etc/ecs/ecs.config;
   EOF
 )
}

The aws_launch_template block creates a launch template that will launch a number of identical EC2 machines. On this template, we specify the instance type, image ID, IAM role, and security group. The user_data field, which specifies a script to be run at instance launch, sets the name of the ECS cluster that the EC2 instances will join.

resource "aws_autoscaling_group" "distributed_scraping_autoscaling_group" {
 name_prefix           = "distributed-scraping-auto-sclaing-group"
 vpc_zone_identifier   = aws_subnet.public[*].id
 desired_capacity      = 0
 max_size              = 5
 min_size              = 0
 health_check_type     = "EC2"
 protect_from_scale_in = false

 launch_template {
   id      = aws_launch_template.distributed_scraping_launch_template.id
   version = "$Latest"
 }

 lifecycle {
   ignore_changes = [desired_capacity]
 }

 tag {
   key                 = "Name"
   value               = "distributed-scraping-ecs-cluster"
   propagate_at_launch = true
 }

 tag {
   key                 = "AmazonECSManaged"
   value               = ""
   propagate_at_launch = true
 }
}

Finally, we create the aws_autoscaling_group resource. This block sets up an auto-scaling group that uses the previously defined launch template to bring up new instances. The desired_capacity, max_size, and min_size parameters are used to control the number of instances in the group.

Step 4: creating the ECS cluster

In this section we will create an ECS cluster, an ECS capacity provider, a CloudWatch log group, an ECS task definition, and an ECS service. Let's get to it:

resource "aws_ecs_cluster" "distributed_scraping_cluster" {
 name = "distributed-scraping-cluster"
}

Our ECS cluster named "stackabot-cluster" is created here.

resource "aws_ecs_capacity_provider" "ecs_capacity_provider" {
 name = "distributed-scraping-capacity-provider"

 auto_scaling_group_provider {
   auto_scaling_group_arn         = aws_autoscaling_group.stackabot_autoscaling_group.arn
   managed_termination_protection = "DISABLED"

   managed_scaling {
     maximum_scaling_step_size = 2
     minimum_scaling_step_size = 1
     status                    = "ENABLED"
     target_capacity           = 100
   }
 }
}

We create a capacity provider. It uses our autoscaling group to provide the necessary ec2 infrastructure to run the tasks containers.

resource "aws_ecs_cluster_capacity_providers" "ecs_cluster_capacity_providers" {
 cluster_name = aws_ecs_cluster.distributed_scraping_cluster.name

 capacity_providers = [aws_ecs_capacity_provider.ecs_capacity_provider.name]

 default_capacity_provider_strategy {
   base              = 1
   weight            = 100
   capacity_provider = aws_ecs_capacity_provider.ecs_capacity_provider.name
 }
}

The capacity provider is then associated with our newly created ECS cluster here.

resource "aws_cloudwatch_log_group" "scrapy_log_group" {
 name = "/ecs/scrapy-logs"
 skip_destroy = true
 retention_in_days = 7
}

We also need to create a new log group in CloudWatch Logs. This group will be used to store the logs from our different scrapyd containers.

resource "aws_ecs_task_definition" "scrapy_task_definition" {
 family             = "scrapy-task"
 network_mode       = "bridge"
 task_role_arn      = aws_iam_role.ecs_task_role.arn
 execution_role_arn = aws_iam_role.ecs_exec_role.arn
 cpu                = 4096
 memory             = 15722
 runtime_platform {
   operating_system_family = "LINUX"
   cpu_architecture        = "X86_64"
 }

 container_definitions = jsonencode([
   {
     name      = "scrapy-client"
     image     = "${aws_ecr_repository.stackabot_repo.repository_url}:latest"
     cpu       = 4096
     memory    = 15722
     essential = true

     linuxParameters = {
       initProcessEnabled = true
     }

     logConfiguration = {
       logDriver = "awslogs"
       options = {
         awslogs-create-group  = "true"
         awslogs-group         = "${aws_cloudwatch_log_group.scrapy_log_group.id}"
         awslogs-region        = "eu-west-3"
         awslogs-stream-prefix = "ecs"
       }
     }

   }
 ])

}

The ressource above creates a task definition to describe how our scrapyd containers will be launched. As docker image we used the image hosted in our ecr repository. The container logs will be saved in the cloud watch group that we defined above. We also precised the cpu and memory size necessary to run this task.

resource "aws_ecs_service" "scrapy_service" {
 name                   = "sc-service"
 cluster                = aws_ecs_cluster.distributed_scraping_cluster.id
 task_definition        = aws_ecs_task_definition.scrapy_task_definition.arn
 desired_count          = 0
 enable_execute_command = true

 network_configuration {
   subnets         = aws_subnet.public[*].id
 }

 force_new_deployment = true

 triggers = {
   redeployment = true
 }

 capacity_provider_strategy {
   capacity_provider = aws_ecs_capacity_provider.ecs_capacity_provider.name
   weight            = 100
 }

 lifecycle {
   ignore_changes = [desired_count]
 }
}

Finally, we create our ECS service. For information the ECS service is the ressource responsible for running and maintaining the desired count of scrapyd containers. The service will be attached to our cluster. When creating the service we would like it to spin up to 3 scrapyd containers maximum. We enable execute command to be able to exec inside the containers for debugging purposes.  The containers spin up by this service will be attached to the subnets we created in the networking section. The capacity provider will provide the necessary ec2 instances to run the containers.

In the culmination of this series, we successfully established a robust ECS cluster designed to efficiently manage and orchestrate our Scrapyd containers. This scalable infrastructure is pivotal in maintaining a streamlined web scraping operation, showcasing a significant leap forward in distributed data collection.

For those interested in delving deeper or building upon our work, the entire codebase discussed throughout these articles is accessible at: GitHub - Stackadoc's AWS Distributed Scrapy.

Future Directions

There's always room for enhancement and optimization. Two potential areas for further development include:

  1. Resilience in Failure Handling: Implementing amechanism to gracefully handle container failures by rerouting ongoingspider tasks back to the shared queue. This would significantly improvethe system's resilience, ensuring continuous data collection withoutloss of progress.
  2. Streamlined Deployment: Integrating AWS CodeDeployfor a seamless update process of the Docker images within the ECScluster. This approach would minimize downtime and facilitate a smoother workflow for deploying updates and improvements.

Acknowledgements

Thank you to all our readers for joining us on this journey. We hope that our exploration into building a distributed scraping system has been both informative and inspiring. As the landscape of web data extraction evolves, so too will the strategies and technologies we employ. We look forward to continuing this journey together.

Amin SAFFAR

Backend Engineer @Stackadoc