Retrieval evaluation with datasets

David de Matheu

December 6, 2023

•

min read

Configuring RAG pipelines requires iteration across different parameters ranging from pre-processing loaders and chunkers, to the actual embedding model being used. To assist in testing different configurations, Neum AI provides several tools to test, evaluate and compare pipelines.

At the core of the Neum AI framework are pipelines which are used as the core abstraction that brings together data sources, pre-processing steps, embedding transformations and data ingestion. Pipelines can be treated as a unit that enables actions of top of it including search / data retrieval. Therefore, when talking about evaluation, we talk about it at the pipeline level.

We have designed tools that allow you to configure pipelines and allows you to evaluate a single pipeline against multiple sample datasets as well as to compare performance of different pipeline configurations against a single dataset.

Prerequisites

We will start by installing some dependencies:

Neum AI framework


pip install neumai

Neum AI tools


pip install neumai-tools

To follow along this blog, you will need credentials to connectors used. For this example, we will be using Open AI to run embeddings and Weaviate as our vector database.

Open AI embeddings model for which you will need an Open AI API Key. To get an API Key visit OpenAI. Make sure you have configured billing for the account.
Weaviate vector database for which you will need a Weaviate Cloud Service URL and API Key. To get a URL and API Key visit Weaviate Cloud Service.

Configure a simple pipeline

We will start with a pipeline configured using the Neum AI framework. The pipeline will extract data from a website, process it and drop it into a vector database. We will configure it using the OpenAI and Weaviate credentials. You can further customize and configure your desired pipeline using our components.


from neumai.DataConnectors import WebsiteConnector
from neumai.SinkConnectors import WeaviateSink
from neumai.EmbedConnectors import OpenAIEmbed
from neumai.Loaders.HTMLLoader import HTMLLoader
from neumai.Chunkers.RecursiveChunker import RecursiveChunker
from neumai.Sources import SourceConnector
from neumai.Shared import Selector
from neumai.Pipelines import Pipeline

website_connector =  WebsiteConnector(
    url = "https://www.neum.ai/post/retrieval-augmented-generation-at-scale",
    selector = Selector(
        to_metadata=['url']
    )
)

source = SourceConnector(
  data_connector = website_connector, 
  loader = HTMLLoader(), 
  chunker = RecursiveChunker()
)

openai_embed = OpenAIEmbed(
    api_key = "OPEN AI KEY",
)

weaviate_sink = WeaviateSink(
  url = "your-weaviate-url",
  api_key = "your-api-key",
  class_name = "your-class-name",
)

pipeline = Pipeline(
  sources=[source], 
  embed=openai_embed, 
  sink=weaviate_sink
)

Once we have the pipeline, we will run it for the first time to populate our vector database. The pipeline can be triggered again if the data needs to be updated. (ex. website is updated or if using document stores, new documents are added.)


print(f"Vectors stored: {pipeline.run()}")

Now that the pipeline ran and the vectors have been stored in our vector database, we can test it with a sample query:


result = pipeline.search(query="What is Celery used for?", number_of_results=1)[0]
print(f"Search Result: {result.metadata['text']}")

Evaluate the pipeline

We now have a populated vector database and a pipeline configuration. We will now run a dataset against it, to see how it performs. To create a Dataset we will use the built-in class and add some DatasetEntry objects. Each DatasetEntry is a test we will run against the pipeline. It contains the query and the expected output so we can compare the output retrieved.

Example DatasetEntry


DatasetEntry(
	id='1', 
	query="What is Retrieval Augmented Generation (RAG)?", 
	expected_output="The blog explains RAG as a method that helps in finding data quickly by performing searches in a 'natural way' and using that information to power more accurate AI applications"
)

As part of the Dataset we can establish the type of evaluation we want to use. We support two evaluation types, but you can add your own.

Cosine Evaluation: Compares the vector embeddings between the retrieved chunk and the expected output.
LLM Evaluation: Uses an LLM to check the quality and correctness of the retrieved information in answering the query at hand. (Requires you to set an OpenAI key as an enviornment variable: OPENAI_API_KEY)

For this demo we will use the CosineEvaluation which will yield a cosine similarity score for the evlauation:


from neumai_tools.DatasetEvaluation.Dataset import Dataset
from neumai_tools.DatasetEvaluation.DatasetUtils import DatasetEntry
from neumai_tools.DatasetEvaluation.Evaluation import CosineEvaluation
dataset = Dataset(name="Test 1", dataset_entries=[
    DatasetEntry(id='1', query="What is Retrieval Augmented Generation (RAG)?", expected_output="The blog explains RAG as a method that helps in finding data quickly by performing searches in a 'natural way' and using that information to power more accurate AI applications"),
    DatasetEntry(id='2', query="How does the RAG system function?", expected_output="It describes the process where data is extracted, processed, embedded, and stored in a vector database for fast semantic search lookup. This data is then used by AI applications for providing accurate responses based on user inputs"),
    DatasetEntry(id='3', query="What are the challenges in scaling RAG?", expected_output=" The blog discusses the challenges in ingesting and synchronizing large-scale text embeddings for RAG, including understanding the volume of data, ingestion time, search latency, cost, and the complexities of data embedding"),
    DatasetEntry(id='4', query="What specific technologies or programming languages are used in the development of Neum AI's RAG system?", expected_output="Neum AI is written in Python.")
], evaluation_type=CosineEvaluation)

Once the dataset is created, we can run it against a Pipeline or a PipelineCollection. This provides flexibility so that if you are testing multiple different pipeline configuration you can see how each configuration performs against your dataset.


results = dataset.run_with_pipeline(pipeline=pipeline)

print(f'Dataset Result ID: {results.dataset_results_id}')
for result in results.dataset_results:
    print(f"For query: {result.dataset_entry.query} \n Expected Outcome: {result.dataset_entry.expected_output} \n Actual Result: {result.raw_result.metadata['text']} \n Score: {result.score}")

The result is a small report showcasing how your pipeline performed against a dataset:


Dataset Result ID: 1907ecf0-2ef2-47c6-a287-b08dd743c36c

For query: What is Retrieval Augmented Generation (RAG)?
Expected Outcome: The blog explains RAG as a method that helps in finding data quickly by performing searches in a 'natural way' and using that information to power more accurate AI applications
Actual Result: As we’ve shared in other blogs in the past, getting a Retrieval Augmented Generation (RAG) application started is pretty straightforward. The problem comes when trying to scale it and making it production-ready. In this blog we will go into some technical and architectural details of how we do this at Neum AI, specifically on how we did this for a pipeline syncing 1 billion vectors.First off, can you explain what RAG is to a 5 year old? - Thanks ChatGPT
Score: 0.8666629542920514

For query: How does the RAG system function?
Expected Outcome: It describes the process where data is extracted, processed, embedded, and stored in a vector database for fast semantic search lookup. This data is then used by AI applications for providing accurate responses based on user inputs
Actual Result: RAG helps finding data quickly by performing search in a “natural way” and use that information/knowledge to power a more accurate AI application that needs such information!This is what a typical RAG system looks likeData is extracted, processed, embedded and stored in a vector database for fast semantic search lookup
Score: 0.8942534447233399

For query: What are the challenges in scaling RAG?
Expected Outcome:  The blog discusses the challenges in ingesting and synchronizing large-scale text embeddings for RAG, including understanding the volume of data, ingestion time, search latency, cost, and the complexities of data embedding
Actual Result: As we’ve shared in other blogs in the past, getting a Retrieval Augmented Generation (RAG) application started is pretty straightforward. The problem comes when trying to scale it and making it production-ready. In this blog we will go into some technical and architectural details of how we do this at Neum AI, specifically on how we did this for a pipeline syncing 1 billion vectors.First off, can you explain what RAG is to a 5 year old? - Thanks ChatGPT
Score: 0.8438287364009266

Based on these results, we can modify parameters like the chunk size or overlap to try to improve our results. In addition to the CosineEvaluation we can also use the LLMEvaluation to see results in terms of relevancy and accuracy of context that might provide further context.

Conclusion

This is our first iteration in adding evaluation to the framework. This is an area where we see a ton of potential in helping developers build robust and scalable pipelines that provide the right results. When it comes to evaluation itself, we see frameworks like RAGAS which could be integrated in to provide even more granular results. No need to re-invent the wheel and something that we will look into given the easy extensibility provided by the Neum framework.

The work doesn’t stop at just evaluation but also how to translate the evaluation results to actions. This is currently a gap that we see and which we want to help address. Some ideas we are incubating include:

Semantic chunking which could be further improved by evaluation results.
Augmenting search results with missing information.

What other ideas should be part of the conversation?

Check out our latest post

Follows us on social for additional content

This is some text inside of a div block.