Configuring RAG pipelines requires iteration across different parameters ranging from pre-processing loaders and chunkers, to the actual embedding model being used. To assist in testing different configurations, Neum AI provides several tools to test, evaluate and compare pipelines.
At the core of the Neum AI framework are pipelines which are used as the core abstraction that brings together data sources, pre-processing steps, embedding transformations and data ingestion. Pipelines can be treated as a unit that enables actions of top of it including search / data retrieval. Therefore, when talking about evaluation, we talk about it at the pipeline level.
We have designed tools that allow you to configure pipelines and allows you to evaluate a single pipeline against multiple sample datasets as well as to compare performance of different pipeline configurations against a single dataset.
We will start by installing some dependencies:
- Neum AI framework
- Neum AI tools
To follow along this blog, you will need credentials to connectors used. For this example, we will be using Open AI to run embeddings and Weaviate as our vector database.
- Open AI embeddings model for which you will need an Open AI API Key. To get an API Key visit OpenAI. Make sure you have configured billing for the account.
- Weaviate vector database for which you will need a Weaviate Cloud Service URL and API Key. To get a URL and API Key visit Weaviate Cloud Service.
Configure a simple pipeline
We will start with a pipeline configured using the Neum AI framework. The pipeline will extract data from a website, process it and drop it into a vector database. We will configure it using the OpenAI and Weaviate credentials. You can further customize and configure your desired pipeline using our components.
Once we have the pipeline, we will run it for the first time to populate our vector database. The pipeline can be triggered again if the data needs to be updated. (ex. website is updated or if using document stores, new documents are added.)
Now that the pipeline ran and the vectors have been stored in our vector database, we can test it with a sample query:
Evaluate the pipeline
We now have a populated vector database and a pipeline configuration. We will now run a dataset against it, to see how it performs. To create a Dataset we will use the built-in class and add some DatasetEntry objects. Each DatasetEntry is a test we will run against the pipeline. It contains the query and the expected output so we can compare the output retrieved.
As part of the Dataset we can establish the type of evaluation we want to use. We support two evaluation types, but you can add your own.
- Cosine Evaluation: Compares the vector embeddings between the retrieved chunk and the expected output.
- LLM Evaluation: Uses an LLM to check the quality and correctness of the retrieved information in answering the query at hand. (Requires you to set an OpenAI key as an enviornment variable: OPENAI_API_KEY)
For this demo we will use the CosineEvaluation which will yield a cosine similarity score for the evlauation:
Once the dataset is created, we can run it against a Pipeline or a PipelineCollection. This provides flexibility so that if you are testing multiple different pipeline configuration you can see how each configuration performs against your dataset.
The result is a small report showcasing how your pipeline performed against a dataset:
Based on these results, we can modify parameters like the chunk size or overlap to try to improve our results. In addition to the CosineEvaluation we can also use the LLMEvaluation to see results in terms of relevancy and accuracy of context that might provide further context.
This is our first iteration in adding evaluation to the framework. This is an area where we see a ton of potential in helping developers build robust and scalable pipelines that provide the right results. When it comes to evaluation itself, we see frameworks like RAGAS which could be integrated in to provide even more granular results. No need to re-invent the wheel and something that we will look into given the easy extensibility provided by the Neum framework.
The work doesn’t stop at just evaluation but also how to translate the evaluation results to actions. This is currently a gap that we see and which we want to help address. Some ideas we are incubating include:
- Semantic chunking which could be further improved by evaluation results.
- Augmenting search results with missing information.
What other ideas should be part of the conversation?