Retrieval augmented generation with OpenAI and OpenSearch®
Use OpenSearch® as a vector database to generate responses to user queries using AI
Retrieval augmented generation (RAG) is an AI technique for retrieving facts from an external knowledge base to provide large language models (LLMs) on the most accurate, up-to-date context to be able to craft better replies and reduce the risk of hallucination.
This tutorial guides you through an example of using Aiven for OpenSearch® as a backend vector database for OpenAI embeddings and how to perform text, semantic, or mixed search which can serve as basis for a RAG system. We'll use the set of Wikipedia articles as base knowledge to influence the replies of a chatbot.
Why use OpenSearch as a backend vector database?
OpenSearch is a widely adopted open source search/analytics engine. It allows to storing, querying and transforming of documents in a variety of shapes. It also provides fast and scalable functionality to perform both accurate and fuzzy text search. Using OpenSearch as vector database enables you to mix and match semantic and text search queries on top of a performant and scalable engine.
Prerequisites
Before you begin, have the following:
- An Aiven Account. You can create an account and start a free trial with Aiven by navigating to the signup page and creating a user.
- An Aiven for OpenSearch service. You can spin up an Aiven for OpenSearch service in minutes in the Aiven Console with the following steps
- Click on Create service
- Select OpenSearch
- Choose the Cloud Provider and Region
- Select the Service plan (the
hobbyist
plan is enough for the notebook) - Provide the Service name
- Click on Create service
- The OpenSearch Connection String. The connection string is visible as Service URI in the Aiven for OpenSearch service overview page.
- Your OpenAI API key
- Python and
pip
installed locally.
Installing dependencies
The tutorial requires the following Python packages:
openai
pandas
wget
python-dotenv
opensearch-py
You can install the above packages with:
pip install openai pandas wget python-dotenv opensearch-py
OpenAI key settings
We'll use OpenAI to create embeddings starting from a set of documents, so we need an API key. You can get one from the OpenAI API Key page after logging in.
To avoid leaking the OpenAI key, you can store it as an environment variable named OPENAI_API_KEY
.
Info
For more information on how to perform the same task across other operative systems, refer to Best Practices for API Key Safety.To store safely the information, create a .env
file in the same folder where the notebook is located and add the following line, replacing the <INSERT_YOUR_API_KEY_HERE>
with your OpenAI API Key.
OPENAI_API_KEY=<INSERT_YOUR_API_KEY_HERE>
Connect to Aiven for OpenSearch
Once the Aiven for OpenSearch service is in the RUNNING
state, we can retrieve the connection string from the Aiven for OpenSearch service page.
Copy the Service URI paramete and store it in the same .env
file created above, after replacing the https://USER:PASSWORD@HOST:PORT
string with the Service URI.
OPENSEARCH_URI=https://USER:PASSWORD@HOST:PORT
We can now connect to Aiven for OpenSearch by adding the following code in Python:
import os from opensearchpy import OpenSearch from dotenv import load_dotenv # Load environment variables from .env file load_dotenv() connection_string = os.getenv("OPENSEARCH_URI") # Create the client with SSL/TLS enabled, but hostname verification disabled. client = OpenSearch(connection_string, use_ssl=True, timeout=100)
The code above reads the OpenSearch connection string from the .env
file (os.getenv("OPENSEARCH_URI")
) and creates a client connection using SSL with a timeout of 100
seconds.
Download the dataset
In theory we could use any dataset for this purpose, and you are more than welcome to bring your own. However, for simplicity's sake and to avoid the need to calculate embeddings on a huge dataset of documents, we'll use a set with pre-calculated OpenAI embeddings which score Wikipedia articles. We can get the file and unzip it with:
import wget import zipfile embeddings_url = 'https://cdn.openai.com/API/examples/data/vector_database_wikipedia_articles_embedded.zip' wget.download(embeddings_url) with zipfile.ZipFile("vector_database_wikipedia_articles_embedded.zip", "r") as zip_ref: zip_ref.extractall("data")
Let's load the file in a pandas dataframe and check its content with:
import pandas as pd wikipedia_dataframe = pd.read_csv("data/vector_database_wikipedia_articles_embedded.csv") wikipedia_dataframe.head()
The file contains:
id
a unique Wikipedia article identifierurl
the Wikipedia article URLtitle
the title of the Wikipedia pagetext
the text of the articletitle_vector
andcontent_vector
the embedding calculated on the title and content of the wikipedia article respectivelyvector_id
the id of the vector
Define the OpenSearch mapping to store the OpenAI embeddings
To properly store and query all the fields included in the dataset, we need to define an OpenSearch index settings and mapping optimized for the storage of the information, including the embeddings. For this purpose we can define the settings and the mappings via:
index_settings ={ "index": { "knn": True, "knn.algo_param.ef_search": 100 } } index_mapping= { "properties": { "title_vector": { "type": "knn_vector", "dimension": 1536, "method": { "name": "hnsw", "space_type": "l2", "engine": "faiss" } }, "content_vector": { "type": "knn_vector", "dimension": 1536, "method": { "name": "hnsw", "space_type": "l2", "engine": "faiss" }, }, "text": {"type": "text"}, "title": {"type": "text"}, "url": { "type": "keyword"}, "vector_id": {"type": "long"} } }
The code above:
- Defines an index with
knn
search enabled. The k-nearest neighbors (k-NN) search searches a vector space (generated by embeddings) in order to retrieve the k closest vectors. You can read more in the OpenSearch k-NN documentation. - Defines a mapping with:
title_vector
andcontent_vector
of typeknn_vector
and1536
dimension (vector with 1536 entries)text
, containing the article text as atext
fieldtitle
, containing the article title as atext
fieldurl
, containing the article text as akeyword
fieldvector_id
, containing the id of the vector aslong
field
With the settings and mappings defined, we can now create the openai_wikipedia_index
index in Aiven for OpenSearch with:
index_name = "openai_wikipedia_index" client.indices.create( index=index_name, body={"settings": index_settings, "mappings":index_mapping} )
Load data into OpenSearch
With the index created, the next step is to parse the the pandas dataframe and load the data into OpenSearch using the Bulk APIs. The following function loads a set of rows in the dataframe:
def dataframe_to_bulk_actions(df): for index, row in df.iterrows(): yield { "_index": index_name, "_id": row['id'], "_source": { 'url' : row["url"], 'title' : row["title"], 'text' : row["text"], 'title_vector' : json.loads(row["title_vector"]), 'content_vector' : json.loads(row["content_vector"]), 'vector_id' : row["vector_id"] } }
To speed up ingestion we can load the data in batches of 200
rows.
from opensearchpy import helpers import json start = 0 end = len(wikipedia_dataframe) batch_size = 200 for batch_start in range(start, end, batch_size): batch_end = min(batch_start + batch_size, end) batch_dataframe = wikipedia_dataframe.iloc[batch_start:batch_end] actions = dataframe_to_bulk_actions(batch_dataframe) helpers.bulk(client, actions)
Once all the documents are loaded, we can try a query to retrieve the documents containing Pizza
:
res = client.search(index=index_name, body={ "_source": { "excludes": ["title_vector", "content_vector"] }, "query": { "match": { "text": { "query": "Pizza" } } } }) print(res["hits"]["hits"][0]["_source"]["text"])
The result is the Wikipedia article talking about Pizza
:
Pizza is an Italian food that was created in Italy (The Naples area). It is made with different toppings. Some of the most common toppings are cheese, sausages, pepperoni, vegetables, tomatoes, spices and herbs and basil. These toppings are added over a piece of bread covered with sauce. The sauce is most often tomato-based, but butter-based sauces are used, too. The piece of bread is usually called a "pizza crust". Almost any kind of topping can be put over a pizza. The toppings used are different in different parts of the world. Pizza comes from Italy from Neapolitan cuisine. However, it has become popular in many parts of the world. History The origin of the word Pizza is uncertain. The food was invented in Naples about 200 years ago. It is the name for a special type of flatbread, made with special dough. The pizza enjoyed a second birth as it was taken to the United States in the late 19th century. ...
Encode chatbot questions with OpenAI text-embedding-ada-002 model
To perform a semantic search, we need to calculate question encodings with the same embedding model used to encode the documents at index time. In this example, we need to use the text-embedding-ada-002
model.
from openai import OpenAI # Define model EMBEDDING_MODEL = "text-embedding-ada-002" # Define the Client openaiclient = OpenAI( # This is the default and can be omitted api_key=os.getenv("OPENAI_API_KEY"), ) # Define question question = 'is Pineapple a good ingredient for Pizza?' # Create embedding question_embedding = openaiclient.embeddings.create(input=question, model=EMBEDDING_MODEL)
Run semantic search queries with OpenSearch
With the above embedding calculated, we can now run semantic searches against the OpenSearch index to retrieve the necessary context for the retrieval-augmented generation. We're using knn
as query type and scan the content of the content_vector
field.
response = client.search( index = index_name, body = { "size": 15, "query" : { "knn" : { "content_vector":{ "vector": question_embedding.data[0].embedding, "k": 3 } } } } ) for result in response["hits"]["hits"]: print("Id:" + str(result['_id'])) print("Score: " + str(result["_score"])) print("Title: " + str(result["_source"]["title"])) print("Text: " + result["_source"]["text"][0:100])
The result is the list of articles ranked by score:
Id:13967 Score: 13.94602 Title: Pizza Text: Pizza is an Italian food that was created in Italy (The Naples area). It is made with different topp Id:90918 Score: 13.754393 Title: Pizza Hut Text: Pizza Hut is an American pizza restaurant, or pizza parlor. Pizza Hut also serves salads, pastas and Id:66079 Score: 13.634726 Title: Pizza Pizza Text: Pizza Pizza Limited (PPL), doing business as Pizza Pizza (), is a franchised Canadian pizza fast foo Id:85932 Score: 11.388243 Title: Margarita Text: Margarita may mean: The margarita, a cocktail made with tequila and triple sec Margarita Island, a Id:13968 Score: 10.576359 Title: Pepperoni Text: Pepperoni is a meat food that is sometimes sliced thin and put on pizza. It is a kind of salami, whi Id:87088 Score: 9.424156 Title: Margherita of Savoy ...
Use OpenAI Chat Completions API to generate a RAG reply
The step above retrieves the content semantically similar to the question. Now let's use OpenAI chat completions
function to return a retrieval-augmented generated reply based on the information retrieved.
# Retrieve the text of the first result in the above dataset top_hit_summary = response['hits']['hits'][0]['_source']['text'] # Craft a reply response = openaiclient.chat.completions.create( model="gpt-3.5-turbo", messages=[ {"role": "system", "content": "You are a helpful assistant."}, {"role": "user", "content": "Answer the following question:" + question + "by using the following text:" + top_hit_summary } ] ) choices = response.choices for choice in choices: print(choice.message.content)
The result is going to be similar to the below:
Pineapple is a contentious ingredient for pizza, and it is commonly used at Pizza Pizza Limited (PPL), a Canadian fast-food restaurant with locations throughout Ontario. The menu includes a variety of toppings such as pepperoni, pineapples, mushrooms, and other non-exotic produce. Pizza Pizza has been a staple in the area for over 30 years, with over 500 locations in Ontario and expanding across the nation, including recent openings in Montreal and British Columbia.
Conclusion
OpenSearch is a powerful tool providing both text and vector search capabilities. Used alongside OpenAI APIs allows you to craft personalized AI applications able to augment the context based on semantic search, and return responses to queries that are augmented by AI. A logical next step would be to pair OpenSearch with another storage system to, for example, store the responses that your customers find useful and train the model further. Building an end to end system including a databases like PostgreSQL® and a streaming integration with Apache Kafka® could provide the resiliency of a relational database and the hybrid search capability of OpenSearch with data feeding in near real time.
You can try Aiven for OpenSearch, or any of the other Open Source tools, in the Aiven platform free trial by signing up.