How to Implement Multilingual Search with Sentence Transformers

Published Feb. 1, 2024 — Edited Feb. 13, 2024

After following this guide, you will be able to use a pre-trained model to populate a Qdrant database with text and to search it for relevant results.

Creating a searchable database is straightforward. You can achieve basic functionality with Postgres' full text search. The novel advantage of using a pre-trained model with Qdrant is the ability to ingest and search data in any language that the model supports. Over 50 languages are supported by the model that we will be using.

Let's consider a scenario where you have a collection of English and French song lyrics. The model can convert each song's lyrics into a vector, which can then be stored in Qdrant. Once this is done, you can run a search query in English and find semantically similar lyrics in both English and French.

This guide assumes that you are using Linux, specifically Ubuntu in my case, and that you have enough knowledge to follow along with any linked pages and examples.

Requirements

Python 3.10.x
- You may need to run sudo apt install libffi-dev libsqlite3-dev before installing Python, so that the _ctypes and sqlite3 modules are available for use.
- You must install all of the following dependencies VIA pip:
Qdrant

Code

from qdrant_client import QdrantClient from qdrant_client.http import models from sentence_transformers import SentenceTransformer from torch.cuda import is_available from uuid import uuid4 def text_to_vector(text): """ Encodes the provided text into a vector, using a pre-trained model. :param text: Text to convert. :return: Vector representation of the text. """ model = SentenceTransformer( device="cuda" if is_available() else "cpu", model_name_or_path="sentence-transformers/paraphrase-multilingual-mpnet-base-v2", trust_remote_code=True ) return model.encode(text).tolist() def ingest(qdrant_client, text): """ Ingests the provided text into Qdrant. :param text: Text to ingest. """ qdrant_client.upsert( collection_name="example", points=[ models.PointStruct( id=str(uuid4()), payload={"text": text}, vector=text_to_vector(text) ) ] ) def search(qdrant_client, query): """ Searches Qdrant for text similar to the provided query, and prints the results. :param query: Query to search with. :return: None """ results = qdrant_client.search( collection_name="example", limit=10, query_vector=text_to_vector(query), with_payload=True ) print(f"Search Results for '{query}':") for result in results: score = round(result.score * 100, 2) print(f"\tScore: {score}\tText: '{result.payload['text']}'") # Delete and recreate the Qdrant collection. qdrant_client = QdrantClient(host="host.docker.internal", port=6333) qdrant_client.recreate_collection( collection_name="example", vectors_config=models.VectorParams(size=768, distance=models.Distance.COSINE) ) # Ingest some example text. ingest(qdrant_client, "Hello, World!") ingest(qdrant_client, "Olá Mundo!") ingest(qdrant_client, "こんにちは世界") ingest(qdrant_client, "हैलो वर्ल्ड!") # Search for similar text. search(qdrant_client, "Hello, World!") search(qdrant_client, "Olá Mundo!") search(qdrant_client, "こんにちは世界") search(qdrant_client, "हैलो वर्ल्ड!")

text_to_vector downloads and loads this model, and uses it to convert any input text into a 768 element vector. See here for a list of available models and their tradeoffs.

ingest creates a point and inserts it into the example collection.

search runs a Qdrant search to find and display results that are semantically similar to the input query.

Code Output

Search Results for 'Hello, World!': Score: 100.0 Text: 'Hello, World!' Score: 99.09 Text: 'हैलो वर्ल्ड!' Score: 96.71 Text: 'こんにちは世界' Score: 88.44 Text: 'Olá Mundo!' Search Results for 'Olá Mundo!': Score: 100.0 Text: 'Olá Mundo!' Score: 89.48 Text: 'हैलो वर्ल्ड!' Score: 88.58 Text: 'こんにちは世界' Score: 88.44 Text: 'Hello, World!' Search Results for 'こんにちは世界': Score: 100.0 Text: 'こんにちは世界' Score: 96.71 Text: 'Hello, World!' Score: 95.36 Text: 'हैलो वर्ल्ड!' Score: 88.58 Text: 'Olá Mundo!' Search Results for 'हैलो वर्ल्ड!': Score: 100.0 Text: 'हैलो वर्ल्ड!' Score: 99.09 Text: 'Hello, World!' Score: 95.36 Text: 'こんにちは世界' Score: 89.48 Text: 'Olá Mundo!'

Notes

You can use OpenAI's Whisper to generate transcriptions/lyrics of an audio file. See this post for more information on how to do so.

Tutorial Automation Machine Learning Multilingual Search Sentence Software Transformer Translation