Blog
This is some text inside of a div block.

Semantic selectors for structured data

David de Matheu
November 14, 2023
10
min read

As a follow up to our blog on Spreadsheets + LLMs, today we are releasing tools help streamline the process of processing structured data into embeddings. These tools include data loaders for CSV and JSON data types and semantic selectors to analyze the data and provide guidance on what fields should be embedded vs which fields should be treated as metadata.

In this blog, we will showcase how semantic selectors works and share a tutorial to start using it. We will introduce neumai-tools, an open-source, python module that contains tools to pre-process documents as well as the inclusion of semantic selectors inside the Neum AI pre-processing playground.

How Does It Work

At a high level, it starts with the data in question. It might be a collection of JSON objects for a product listing or a CSV with customer reviews.  Once we have the data we then:

  1. Extract the full set of columns from it and example values for each
  2. Classify the columns into embeddable and metadata
  3. Output an array of columns for each purpose

Once we have the list of embeddable and metadata columns, we provide those lists to the loader that we will use to correct map them out and pass the right context to the embeddings engine. At the end of the process, we get vector embeddings for the columns that needed to be embedded and the rest of the columns are attached to the vector as metadata.

Classifying columns for embeddings

The most crucial step in the process is the classification itself.

As part of it, each property and example value is passed through an LLM powered engine to analyze its contents and get an idea of whether it makes sense to embed or not. The engine is configured to look for values that carry a high semantic value and that would be used in abstract queries.

Example Outputs

To give you a sense of what you can expect, let's consider a couple of examples:

Product Listing

For a given set of product listing properties, we want to extract the ones that have the most semantic relevance.

In this case, we correctly are able to identify that product name, description and category have the most semantic relevancy. Other properties like price or rating are not really useful from an embedding perspective, so instead we will just keep them as metadata.

People Profile

For a person’s profile, we want to extract the information that can help us best search for them.

In this case, we correctly identify Name, Work Experience, Educational Background and Skills as key areas that we would want to do semantic search across. Values like Email or LinkedIn URL make more sense as metadata. Fields like Hobbies fall in a potential gray area that could be relevant depending on the type of search we want to do.

Try it out in the playground

You can try out the semantic selectors yourself directly on the Neum AI pre-processing playground by choosing it in the selectors section:

Integrating Semantic Selectors into your app

To get started with using semantic selectors for embeddings and metadata fields we will need the neumai-tools module. We will also leverage openai for our underlying LLM.

```bash pip install neumai-tools openai ```

Next, we will import the fields_to_embed, fields_for_medatadata , JSONLoader and CSVLoader. These utilities will help us get the right fields to embed and for metadata. Depending on the type of data you are using, you can choose between JSON and CSV loaders. Then, choose a file that you want to analyze and pass the file_path into the methods for analysis.

```python from neumai_tools import fields_to_embed, fields_for_metadata, JSONLoader,CSVLoader import openai openai.api_key="API_KEY" # openai.organization="OPTIONAL_ORGANIZATION" file_path = "INSERT PATH TO FILE" loader_choice = "CSVLoader or JSONLoader" to_embed = fields_to_embed(file_path=file_path, loader_choice=loader_choice) to_metadata = fields_for_metadata(file_path=file_path, loader_choice=loader_choice) ```

Once we have the fields to_embed and the fields to_metadata, then we can no pass those arrays of values into our loaders. Within the loader, the arrays will be used to select the correct values. Once you have the information loaded, you can then pass that into text splitters or directly to embeddings.

```python loader = JSONLoader(file_path=file_path, embed_keys=to_embed, metadata_keys=to_metadata) # or # loader = CSVLoader(file_path=temp_file, embed_keys=embed_keys, metadata_keys=metadata_keys) documents = loader.load() print(documents) # You can now take the documents and pass them through your own text splitter # or directly push them to be embedded. ```

Conclusion

Pre-processing continues to be a key step in the process of creating great generative AI applications that are grounded on your own data. We believe that by leveraging intelligence we can simplify pre-processing while increasing the quality of the results at scale. Simple tools like the ones above can help stir you in that direction.  Please share any feedback you might have you try out these methods.

Outside of pre-processing, scaling data pipelines for vector embeddings continues to be a challenge. As you move past initial experimentation, check out Neum AI as a platform to help you scale your applications while keeping the quality up and the latency & cost down. Neum AI provides access through the platform to capabilities like context-aware text splitting and more. Stay tuned to our social media (Twitter and LinkedIn) and discord for more updates.

Check out our latest post

Follows us on social for additional content

Configuring RAG pipelines requires iteration across different parameters ranging from pre-processing loaders and chunkers, to the actual embedding model being used. To assist in testing different configurations, Neum AI provides several tools to test, evaluate and compare pipelines.
David de Matheu
December 6, 2023
10
min read
Real-time synchronization of embeddings into vector databases is now trivial! Learn how to create a real-time Retrieval Augmented Generation pipeline with Neum and Supabase.
Kevin Cohen
November 25, 2023
8
min read
Following the release of Neum AI framework, an open-source project to build large scale RAG pipelines, we explore how to get started building with the framework in a multi-part series.
David de Matheu
November 22, 2023
15
min read

Ready to start scaling your RAG solution?

We are here to help. Get started today with our SDK and Cloud offerings.