Make similarity search using vector database

In this article I will walk you through how you can make simple similarity search with vector database and S-BERT model

Prerequisites

There are not so much prerequisites because it's a beginner level tutorial of how to work with vector databases so all you need is some basic knowledge of vector database and how to use pre-trained machine learning models. Other than that you need python 🐍 installed on your computer. We are using Milvus as our database for this tutorial but you can choose any database from below (just method will change)

Steps

Setup ML model
Setup DB and insert some data
Search in your DB

Yup it's that easy

Setup ML model

pip install -U sentence-transformers
pip install -U pymilvus

We will be using S-BERT model to covert our sentences to vector embeddings to store it easily in database. We are using pre-trained model here to generate embeddings.

# model.py
from sentence_transformers import SentenceTransformer

def get_embedding(sentence):
    model = SentenceTransformer("all-MiniLM-L6-v2")
    embedding = model.encode(sentence) # emedding.shape = (1, 384)
    return embedding

Setup DB and insert some data

We are using pymilvus to connect to our local database (not recommended for production). Then I wrote some random sentences and generated vector embeddings for all them. If you have mongodb or any other document based database then it will be familiar to work with milvus because we also create collection to store data.

make sure you make collection with proper embeddings dimension according your model

Then we can insert our embeddings with their sentences in database.

# upload.py
from pymilvus import MilvusClient
from model import get_embedding

sentences = [
    "I am a very smart person, but I have a lot of stupid questions",
    "You have same humor level as me if you laugh at dad jokes",
    "If you are stupid and funny then poeple will laugh at you"
]

vectors = [get_embedding(sentence) for sentence in sentences]

data = [
    {"id": i, "vector": vectors[i], "text": sentences[i]}
    for i in range(len(vectors))
]

print("Data has", len(data), "entities, each with fields: ", data[0].keys())
print("Vector dim:", len(data[0]["vector"]))

client = MilvusClient("demo.db")

if not client.has_collection(collection_name="demo"):
    client.create_collection(
        collection_name="demo",
        dimension=384,
    )

client.insert(collection_name="demo", data=data)

Search in your DB

Now it's last part where we will write code for our search. It too easy in milvus. You just need to create embeddings for your query and then you can search for that embedding. you can also setup indexing and all but that will be too much for this tutorial.

# search.py
from pymilvus import MilvusClient
from model import get_embedding

query = "What do you think about laughing at dad jokes?"
embeddings = get_embedding(query)

client = MilvusClient("demo.db")

query_vectors = [[float(e) for e in embeddings]]

res = client.search(
    collection_name="demo",
    data=query_vectors,
    limit=1,
    output_fields=["text"],
)

print(res)

Now it's time to test it

It's simple. Run your upload script and then search script.

python upload.py

python search.py

Congratulations!! You created your first similarity search using python.