After following this guide, you will be able to use a
pre-trained model to populate a Qdrant
database with text and to search it for relevant results.
Creating a searchable database is straightforward. You can achieve basic functionality with Postgres'
full text search.
The novel advantage of using a pre-trained model with Qdrant is the ability to ingest and search data in any
language that the model supports. Over 50 languages are supported by the model that we will be using.
Let's consider a scenario where you have a collection of English and French song lyrics. The model can convert
each song's lyrics into a vector, which can then be stored in Qdrant. Once this is done, you can run a search query
in English and find semantically similar lyrics in both English and French.
This guide assumes that you are using Linux, specifically Ubuntu in my case, and that you have enough
knowledge to follow along with any linked pages and examples.
Requirements
-
Python 3.10.x
-
You may need to run sudo apt install libffi-dev libsqlite3-dev before installing Python, so that the
_ctypes and sqlite3 modules are available for use.
-
You must install all of the following dependencies VIA pip:
-
Qdrant
Code
from qdrant_client import QdrantClient
from qdrant_client.http import models
from sentence_transformers import SentenceTransformer
from torch.cuda import is_available
from uuid import uuid4
def text_to_vector(text):
"""
Encodes the provided text into a vector, using a pre-trained model.
:param text: Text to convert.
:return: Vector representation of the text.
"""
model = SentenceTransformer(
device="cuda" if is_available() else "cpu",
model_name_or_path="sentence-transformers/paraphrase-multilingual-mpnet-base-v2",
trust_remote_code=True
)
return model.encode(text).tolist()
def ingest(qdrant_client, text):
"""
Ingests the provided text into Qdrant.
:param text: Text to ingest.
"""
qdrant_client.upsert(
collection_name="example",
points=[
models.PointStruct(
id=str(uuid4()),
payload={"text": text},
vector=text_to_vector(text)
)
]
)
def search(qdrant_client, query):
"""
Searches Qdrant for text similar to the provided query, and prints the results.
:param query: Query to search with.
:return: None
"""
results = qdrant_client.search(
collection_name="example",
limit=10,
query_vector=text_to_vector(query),
with_payload=True
)
print(f"Search Results for '{query}':")
for result in results:
score = round(result.score * 100, 2)
print(f"\tScore: {score}\tText: '{result.payload['text']}'")
# Delete and recreate the Qdrant collection.
qdrant_client = QdrantClient(host="host.docker.internal", port=6333)
qdrant_client.recreate_collection(
collection_name="example",
vectors_config=models.VectorParams(size=768, distance=models.Distance.COSINE)
)
# Ingest some example text.
ingest(qdrant_client, "Hello, World!")
ingest(qdrant_client, "Olá Mundo!")
ingest(qdrant_client, "こんにちは世界")
ingest(qdrant_client, "हैलो वर्ल्ड!")
# Search for similar text.
search(qdrant_client, "Hello, World!")
search(qdrant_client, "Olá Mundo!")
search(qdrant_client, "こんにちは世界")
search(qdrant_client, "हैलो वर्ल्ड!")
text_to_vector downloads and loads
this model,
and uses it to convert any input text into a 768 element vector.
See here for a list of
available models and their tradeoffs.
ingest creates a
point and inserts it
into the example
collection.
search runs a Qdrant
search
to find and display results that are semantically similar to the input query.
Code Output
Search Results for 'Hello, World!':
Score: 100.0 Text: 'Hello, World!'
Score: 99.09 Text: 'हैलो वर्ल्ड!'
Score: 96.71 Text: 'こんにちは世界'
Score: 88.44 Text: 'Olá Mundo!'
Search Results for 'Olá Mundo!':
Score: 100.0 Text: 'Olá Mundo!'
Score: 89.48 Text: 'हैलो वर्ल्ड!'
Score: 88.58 Text: 'こんにちは世界'
Score: 88.44 Text: 'Hello, World!'
Search Results for 'こんにちは世界':
Score: 100.0 Text: 'こんにちは世界'
Score: 96.71 Text: 'Hello, World!'
Score: 95.36 Text: 'हैलो वर्ल्ड!'
Score: 88.58 Text: 'Olá Mundo!'
Search Results for 'हैलो वर्ल्ड!':
Score: 100.0 Text: 'हैलो वर्ल्ड!'
Score: 99.09 Text: 'Hello, World!'
Score: 95.36 Text: 'こんにちは世界'
Score: 89.48 Text: 'Olá Mundo!'
Notes
-
You can use OpenAI's Whisper to
generate transcriptions/lyrics of an audio file. See
this post for more information on how
to do so.