Make it CLAP : Building a text-to-audio AI-search engine in under 17 minutes !

Make it CLAP : Building a text-to-audio AI-search engine in under 17 minutes !

Image: Courtesy of Busta Rhymes, for the nostalgics

I know… It’s cool, right? It’s like you build Google for Sound !

In this article, we are going to build your very own personalized music search engine using an open-source sound-embedding model and Qdrant vector database, which has been gaining serious attention in the AI community lately.

Let’s make it a symphony of sounds and be the conductor ! An introduction

Musicians, sound designers and sound professionnals alike all struggle with a special reality : finding sounds in the vast ocean of samples, stems, tracks, sounds is especially complicated. First and foremost because the sound is linked to time and that it is hard to listen to 2 tracks at once, and virtually impossible to listen to more at the same time.

On the contrary, scrolling to dozens of images at once is so much easier ! Think of you scrolling, you can absorb tons of images. Sound is a much more difficult media, setting the stage for needing a much more detailed approach.

As our world becomes increasingly auditory, the demand for efficient ways to search and access information through sound is rapidly growing. The traditional text-based search methods face challenges in accurately capturing the essence of complex audio or musical pieces (imagine trying to describe a symphony or the ambient sounds of a rainforest accurately in words — it's quite challenging).

Enter the revolutionary concept of Reverse Audio Search, akin to Google Images for sound. This technology enables users to upload or input a sound clip into a search engine to locate similar or identical audio tracks, marking a significant advancement in the realm of auditory search.

Leveraging the power of AI takes this efficiency and convenience to unprecedented levels. By incorporating AI capabilities with a Reverse Audio Search Engine, the search experience is greatly enhanced. Imagine the possibility of creating your own personalized audio search engine right on your device, akin to having your own bespoke version of Google for sound.

Silence ! Let’s the (binary) music begin

Technical music sheet

This process is divided into the following sections:

  • Environment setup.
  • Audio Data-Preprocessing & Populating the Qdrant Vector Database.
  • Gradio Interface setup.
  • Testing Text to audio Search

Install Docker (if you haven’t already).

Please use the great documentation of our friends at digital Ocean !

Here’s the Ubuntu 22.04 (but it surely exists for your own flavour)

https://www.digitalocean.com/community/tutorials/how-to-install-and-use-docker-on-ubuntu-22-04

Orchestrating Qdrant

Kicking off the setup for this musical project starts with retrieving the Docker container image imbued with melodious elements, followed by deploying it on your local Docker daemon. (Make sure to orchestrate the Docker application before proceeding.)

Harmonize your system by pulling the Qdrant client container, a symphony of data, from the Docker Hub repository. Then, conduct the container with the following command, setting the stage for the application to perform at localhost:6333, a digital concert hall for your project's operatic debut.

docker pull qdrant/qdrant
docker run -p 6333:6333 -p 6334:6334 \
   -v $(pwd)/qdrant_storage:/qdrant/storage:z \
   qdrant/qdrant

NOTE: If you are running on Windows, kindly replace $(pwd) with your local path.

The Python code

Setup your environment, install the requirements and

git clone git@github.com:stackadoc/audiosearch.git
cd audiosearch
virtualenv --python 3.10 venv
source venv/bin/activate
pip install -r requirements.txt

Data Pre-Processing and Populating the Vector Database

Download the demo samples

For this project I’ve used the https://www.kaggle.com/competitions/park-spring-2023-music-genre-recognition/data dataset, which is a collection of 4500 files split into 10 music genres categories

  1. First let’s install the Kaggle package; for that open Jupyter Notebook in vscode and install the package using pip install kaggle
  2. Obtain your Kaggle API key: You can generate it on Kaggle by going to your account settings and under the ‘API’ section, click on ‘Create New API Token’. This will download a file named ‘kaggle.json’ which holds the credentials required.
  3. Move the downloaded ‘kaggle.json’ file to your project directory.
  4. Open the terminal and run the following command to download the dataset above mentioned: kaggle competitions download -c park-spring-2023-music-genre-recognition
  5. After downloading, you may need to unzip or extract the contents of the downloaded file for further processing
  6. Copy and store the folder of training, where every genre subfolders are displayed :

# For me, it gives the following :
KAGGLE_DB_PATH = '/home/arthur/data/kaggle/park-spring-2023-music-genre-recognition/train/train'

Populate the demo database, using database.py

Simply modify the parameters for your CACHE_FOLDER and KAGGLE_DB_PATH, in the database.py folder

Simply launch the Gradio app and enjoy !

Here’s the simple code :

python app.py

Go on your favorite web browser and open the local URL displayed in your execution stack (for me, Gradio opens at : http://127.0.0.1:7861/ )

Now, go and find sounds !

For the impatients and those who want to see results, please take a look at our video !

Arthur Renaud

CEO @Stackadoc