Building a semantic search engine to discover new songs
I started to hear the term “semantic search” a lot lately but I wasn’t sure about what it meant until recently. After conducting some research and delving into numerous articles, I realized that semantic search is just one of the many innovative applications enabled by Large Language Models (LLMs).
However, despite my research the concept remained somewhat fuzzy 😉.
This is because most of the articles discussing it were mainly theoretical, and made it sound just like another buzzword. So the goal of this article is to fix this by to sheding some light on the topic and by providing a real world example.
Semantic search is a set of techniques that allow you to make context-dependent searches based on meaning and intent. It is often opposed to what makes today the vast majority of web, keyword search.
Keyword search looks for exact matches between words or sentences. Even though keyword search can be given some flexibility by using techniques such as fuzzy matching and synonym mapping, in the end, it doesn’t capture the intent of the request and might miss some relevant results only because the vocabulary used was different from the one that was stored and indexed in the database.
Semantic search does not look for exact matches. Instead, it provides results that match the meaning of the words or sentences requested.
To better grasp the difference, you can follow this Tricky Query from Cohere and this example showing about how the two search paradigms handle the distinction between Chocolate Milk vs Milk Chocolate from Elastic Search
The previous examples were nice to understand the concept. Now let’s see how we can use those new tools and techniques in real life.
In the tutorial-like example below, we will search for songs based on the messagesconveyed by their lyrics.
You can find the code for this tutorial here.
We will proceed with the following steps:
Firstly, we will load a dataset containing the lyrics of popular songs from streaming platforms.
Next, we will preprocess the dataset and generate embeddings for the song lyrics using OpenAI.
Then, we will input the data into a MongoDB vector database and establish the necessary indexes.