Why we Chose Scaleway Over AWS for AI Music Generation
Learn why Scaleway was the perfect choice to balance cost and performance in GPU instance management.
Our tool, synthia.co, leverages the Facebook model MusicGen to create the incredible samples you request. Initially, we used Replicate to host the model in a serverless environment. This approach was initially a relief, as Replicate eliminated the complexities of deploying and managing multiple MusicGen models. However, this convenience came at a cost—both financially and in terms of performance. As our user base grew, so did our expenses. Moreover, we faced significant issues with cold start times, with some generation requests taking up to five minutes to produce an audio sample. Given the importance of rapid response times for maintaining user engagement, we recognized the need to deploy and manage our models in-house to improve efficiency.
In this article, we will discuss our choice of cloud provider and detail the architecture we implemented to seamlessly integrate the MusicGen model with our synthia.co platform.
Cloud provider choice
To meet our goal of responding to generation requests within 20 seconds, it was clear that running predictions on GPU instances was essential. With many cloud providers offering access to GPU instances, we evaluated several options, including AWS, Google Cloud Platform (GCP), and Scaleway.
AWS and GCP are well-known for their robust infrastructure and extensive services, but they come with a higher price tag and complexity. AWS offers a wide range of GPU instances, such as the P3 and G4 series, which are powerful but significantly more expensive than other options. GCP provides similar high-performance GPUs, like the Tesla T4 and V100, but also at a premium cost. Both AWS and GCP have a steep learning curve, with a multitude of services and configurations that can be overwhelming.
In contrast, Scaleway offers a more cost-effective solution for GPU instances. Scaleway’s pricing is generally lower than that of AWS and GCP, making it an attractive option for startups and smaller projects. Additionally, Scaleway provides a simpler and more straightforward user experience, with an easy-to-navigate interface and fewer service configurations to manage. This simplicity reduces the overhead of managing infrastructure, allowing us to focus more on our application development.
Given our requirements for affordability and ease of use, we chose Scaleway GPU instances. Scaleway not only met our performance needs but also offered a balance of cost efficiency and simplicity, making it the ideal choice for our project.
Architecture
To integrate our model with the synthia.co platform, we created a REST endpoint that accepts generation parameters and returns the generated audio to the user.
We built our API using the Python framework FastAPI, which allowed us to quickly and robustly set up our endpoint. When a user calls the endpoint, the predict
method is invoked to generate and return an audio sample. While this setup works well for a few requests per day, it is insufficient for our needs due to the limitations on the number of simultaneous predictions we can run. MusicGen models are large and memory-intensive; for instance, with an L40s GPU, we can only handle up to three predictions in parallel. This means that if more than three users request audio generation simultaneously, the extra requests would fail because there is no memory left on our instance.
To address this, we implemented a queuing system to store generation requests and process them asynchronously based on available resources. We introduced a queue that supports asynchronous communication, acting as a buffer to distribute requests. Decoupling workloads has helped in building scalable and reliable applications. With a message queue, the producer can post messages even when no worker is available, and workers can process messages from the queue as soon as they are available without exceeding the resource limitations.
For our queue, we used Celery, a simple, flexible, and reliable distributed system to process vast amounts of messages, providing the necessary tools for maintaining such a system. We chose RabbitMQ as our broker, a reliable and mature messaging and streaming broker that is easy to deploy, interoperable, and flexible. To store execution results and other necessary information, we used Redis.
Here’s how the system works:
- FastAPI receives a request: Celery accepts tasks from FastAPI requests and stores them in the RabbitMQ broker.
- Task execution: When a new task is added, Celery checks for available workers. If a worker is available, it pulls the task and starts execution.
- Status monitoring: The progress and status of task execution are saved in Redis. We also exposed a second endpoint to return the status of each task.
Celery offers a monitoring tool called Flower, which we found very useful for tracking the state of workers and viewing the tasks being executed.
Below is an illustration of this architecture:
Each component in this architecture runs in a separate Docker container. We use Docker Swarm to orchestrate these containers, allowing us to scale the number of workers based on demand, perform blue/green deployments with minimal downtime, and automatically recover containers in case of failure. Docker Swarm enabled us to run Celery workers (responsible for generating music samples using the MusicGen models) on multiple GPU instances.
CI/CD
This project is hosted on GitLab, where we leverage GitLab Pipelines to ensure compliance with CI/CD standards. Our pipeline consists of five stages:
- Build Stage: Docker images are built.
- Lint Stage: Python code is verified for PEP8 compliance.
- Test Stage: Unit tests are executed on various components.
- Delivery Stage: The latest stable version is pushed to the GitLab container registry.
- Deployment Stage: New images are deployed on the Docker Swarm.
We utilize a GitLab Runner installed on the Swarm manager instance to initiate the deployment with the latest stable Docker images. This setup ensures a streamlined and automated workflow, from code validation to deployment.
In conclusion
Using the Facebook model MusicGen, synthia.co generates audio samples on demand. Initially, we used Replicate's serverless hosting, which simplified deployment but led to higher costs and performance issues as our user base grew. To improve efficiency, we decided to deploy and manage our models in-house.
We compared AWS, Google Cloud Platform (GCP), and Scaleway. While AWS and GCP are powerful, they are expensive and complex. Scaleway provided a more cost-effective and user-friendly solution, making it our preferred choice.
Our architecture includes a FastAPI endpoint that handles requests to generate audio. We implemented a queuing system with Celery and RabbitMQ to manage requests asynchronously and use Redis to store execution results. These results can be collected via a second FastAPI endpoint.
The system is containerized with Docker and orchestrated using Docker Swarm, allowing us to scale workers, perform seamless deployments, and ensure automatic recovery. Our CI/CD pipeline on GitLab includes stages for building, linting, testing, delivering, and deploying, ensuring an efficient and automated workflow.
This setup allows us to manage resources effectively, maintain performance, and deliver high-quality audio samples while keeping costs under control.