As a follow up to our blog on Spreadsheets + LLMs, today we are releasing tools help streamline the process of processing structured data into embeddings. These tools include data loaders for CSV and JSON data types and semantic selectors to analyze the data and provide guidance on what fields should be embedded vs which fields should be treated as metadata.
In this blog, we will showcase how semantic selectors works and share a tutorial to start using it. We will introduce neumai-tools, an open-source, python module that contains tools to pre-process documents as well as the inclusion of semantic selectors inside the Neum AI pre-processing playground.
How Does It Work
At a high level, it starts with the data in question. It might be a collection of JSON objects for a product listing or a CSV with customer reviews. Once we have the data we then:
- Extract the full set of columns from it and example values for each
- Classify the columns into embeddable and metadata
- Output an array of columns for each purpose
Once we have the list of embeddable and metadata columns, we provide those lists to the loader that we will use to correct map them out and pass the right context to the embeddings engine. At the end of the process, we get vector embeddings for the columns that needed to be embedded and the rest of the columns are attached to the vector as metadata.
Classifying columns for embeddings
The most crucial step in the process is the classification itself.
As part of it, each property and example value is passed through an LLM powered engine to analyze its contents and get an idea of whether it makes sense to embed or not. The engine is configured to look for values that carry a high semantic value and that would be used in abstract queries.
To give you a sense of what you can expect, let's consider a couple of examples:
For a given set of product listing properties, we want to extract the ones that have the most semantic relevance.
In this case, we correctly are able to identify that product name, description and category have the most semantic relevancy. Other properties like price or rating are not really useful from an embedding perspective, so instead we will just keep them as metadata.
For a person’s profile, we want to extract the information that can help us best search for them.
In this case, we correctly identify Name, Work Experience, Educational Background and Skills as key areas that we would want to do semantic search across. Values like Email or LinkedIn URL make more sense as metadata. Fields like Hobbies fall in a potential gray area that could be relevant depending on the type of search we want to do.
Try it out in the playground
You can try out the semantic selectors yourself directly on the Neum AI pre-processing playground by choosing it in the selectors section:
Integrating Semantic Selectors into your app
To get started with using semantic selectors for embeddings and metadata fields we will need the neumai-tools module. We will also leverage openai for our underlying LLM.
Next, we will import the fields_to_embed, fields_for_medatadata , JSONLoader and CSVLoader. These utilities will help us get the right fields to embed and for metadata. Depending on the type of data you are using, you can choose between JSON and CSV loaders. Then, choose a file that you want to analyze and pass the file_path into the methods for analysis.
Once we have the fields to_embed and the fields to_metadata, then we can no pass those arrays of values into our loaders. Within the loader, the arrays will be used to select the correct values. Once you have the information loaded, you can then pass that into text splitters or directly to embeddings.
Pre-processing continues to be a key step in the process of creating great generative AI applications that are grounded on your own data. We believe that by leveraging intelligence we can simplify pre-processing while increasing the quality of the results at scale. Simple tools like the ones above can help stir you in that direction. Please share any feedback you might have you try out these methods.
Outside of pre-processing, scaling data pipelines for vector embeddings continues to be a challenge. As you move past initial experimentation, check out Neum AI as a platform to help you scale your applications while keeping the quality up and the latency & cost down. Neum AI provides access through the platform to capabilities like context-aware text splitting and more. Stay tuned to our social media (Twitter and LinkedIn) and discord for more updates.