In this article I will walk you through how you can make simple similarity search with vector database and S-BERT model
Prerequisites
There are not so much prerequisites because it's a beginner level tutorial of how to work with vector databases so all you need is some basic knowledge of vector database and how to use pre-trained machine learning models. Other than that you need python ๐ installed on your computer. We are using Milvus as our database for this tutorial but you can choose any database from below (just method will change)
Steps
Setup ML model
Setup DB and insert some data
Search in your DB
Yup it's that easy
Setup ML model
pip install -U sentence-transformers
pip install -U pymilvus
We will be using S-BERT model to covert our sentences to vector embeddings to store it easily in database. We are using pre-trained model here to generate embeddings.
# model.py
from sentence_transformers import SentenceTransformer
def get_embedding(sentence):
model = SentenceTransformer("all-MiniLM-L6-v2")
embedding = model.encode(sentence) # emedding.shape = (1, 384)
return embedding
Setup DB and insert some data
We are using pymilvus to connect to our local database (not recommended for production). Then I wrote some random sentences and generated vector embeddings for all them. If you have mongodb or any other document based database then it will be familiar to work with milvus because we also create collection to store data.
make sure you make collection with proper embeddings dimension according your model
Then we can insert our embeddings with their sentences in database.
# upload.py
from pymilvus import MilvusClient
from model import get_embedding
sentences = [
"I am a very smart person, but I have a lot of stupid questions",
"You have same humor level as me if you laugh at dad jokes",
"If you are stupid and funny then poeple will laugh at you"
]
vectors = [get_embedding(sentence) for sentence in sentences]
data = [
{"id": i, "vector": vectors[i], "text": sentences[i]}
for i in range(len(vectors))
]
print("Data has", len(data), "entities, each with fields: ", data[0].keys())
print("Vector dim:", len(data[0]["vector"]))
client = MilvusClient("demo.db")
if not client.has_collection(collection_name="demo"):
client.create_collection(
collection_name="demo",
dimension=384,
)
client.insert(collection_name="demo", data=data)
Search in your DB
Now it's last part where we will write code for our search. It too easy in milvus. You just need to create embeddings for your query and then you can search for that embedding. you can also setup indexing and all but that will be too much for this tutorial.
# search.py
from pymilvus import MilvusClient
from model import get_embedding
query = "What do you think about laughing at dad jokes?"
embeddings = get_embedding(query)
client = MilvusClient("demo.db")
query_vectors = [[float(e) for e in embeddings]]
res = client.search(
collection_name="demo",
data=query_vectors,
limit=1,
output_fields=["text"],
)
print(res)
Now it's time to test it
It's simple. Run your upload script and then search script.
python upload.py
python search.py
Congratulations!! You created your first similarity search using python.