Blog
This is some text inside of a div block.

LLM + Spreadsheets

David de Matheu
November 14, 2023
10
min read

Spreadsheets are a very popular way to organize and store information from financial reports to customer reviews. Every day we go on Reddit or Twitter, we see several people asking about this topic so we thought we would add our 2 cents. For most of this blog, we will talk about CSVs as a common type of spreadsheet, but the concepts within the blog apply to different types of structured data including JSON objects. Many of the capabilities discussed below are supported by Neum AI where you can build robust and scalable data pipelines to process data into vector embeddings. Feel free you reach out to with any questions to founders@tryneum.com or schedule time.

Connecting my spreadsheets to an LLM

Let's imagine you have a bunch of spreadsheets that contain key information about your business. On a consistent basis, you are building reports that help you visualize the data as well as feeding those spreadsheets into existing ML models to crunch through the number. LLMs have the potential to help automate several of those actions to get more insights faster and with less hassle. But how LLMs can help varies on the type of data we have at hand and the desired output. On one hand we might have spreadsheets full of numerical data like revenue numbers, growth percentages, or dates and times. On the other we might have spreadsheets that contain content like user reviews, product listings, etc. LLMs can serve different roles across these different types of data that is driven by the types of questions that we want to extract from them.

Spreadsheets with numerical data

For these types of spreadsheets, examples of questions might be something like:

  • Who are the top five customers by revenue?
  • What was the growth of the company in the last month?
  • What customers decreased their spend in the last week in Spain?

At the same time the data to answers these questions requires a number of transformations like sums, counts, filters that are very aligned to traditional Datalytics.  For this use case, LLMs provide value in:

  1. Helping query the data. LLMs can help translate a question into a query that can run on top of the data. For example, who are the top five customers by revenue might translate to order customers by revenue and limit to 5.
  2. The queried data (up to the token limit) can be fed into an LLM prompt as a table that it can use as context to respond questions.

Spreadsheets with content

In the case of content, for example product listings or customer reviews, querying data in a traditional manner is more difficult as the value in the content lies on its semantic nature. You can’t use traditional queries alone to answer questions like:

  • Show me products that are similar to X?
  • Does product Y work well?
  • What are positive features that users find with our product?

These questions require us to match the question to the data based on their semantic representation. For example, saying does product Y work well, is not a Boolean that we likely have in the data, instead we need to search through reviews from users to get a general sense of what they are trying to tell us. LLMs really shine for this use case as they enable capabilities that were otherwise hard to achieve with more traditional ML models.

To enable this use case, we can use embeddings and vector databases to index our content within the spreadsheet so we can search on top of it. For the rest of this blog, we will focus on the this use case 🙂.

What do I do with all these columns of data?

Let's imagine for a second that we have a spreadsheet that contains product listings:

For each product listing we have information, including their names, prices, ratings, descriptions, and features. From this listings, we will focus our attention of the description and features. These columns contain the most semantically relevant information that we want to match user queries to. Columns like price and ratings, provide almost no semantic meaning and therefore we will simply use as metadata for our vectors.

The end result is that we will organize our data into a vector database with:

  1. A vector of the content found within the semantically relevant columns (i.e. description and features in vector form)
  2. Some metadata associated to the vector that we can retrieve with the vector to pass as context to the application. (i.e. name, price, rating, description text and features text)

Example vector:

```json { id: “listing1_description”, vector: […], metadata: { “name” : “…", “price” : ”…”, “rating” : “…”, “description” : “…”, “features” : “…” } } ```

How would it work?

Let's look at code to put this theory into action. We will use Langchain to help us with text splitting and loading into vector databases. We will use Chroma as our vector database running locally, but this can swapped to any other hosted vector database you might like. To download the sample listings CSV that we will use, download here.

Ingesting the data into vector databases

We will start by importing the tools we will use including the CharacterTextSplitter , OpenAIEmbeddings and Chroma .

```python import csv from langchain.docstore.document import Document from langchain.text_splitter import CharacterTextSplitter from langchain.embeddings import OpenAIEmbeddings from langchain.vectorstores import Chroma # Define the columns we want to embed vs which ones we want in metadata columns_to_embed = ["Description","Features"] columns_to_metadata = ["Product Name","Price", "Rating","Description", "Features"] ```

Next, we will open the CSV file we have with our test listings. For each row in it we will separate the columns we will use for embeddings and those that we will use as metadata. We will use the Langchain Document construct to organize our content and metadata.

```python # Process the CSV into the embedable content vs the metadata and put it into Document format so that we can chunk it into pieces. docs = [] with open('TestListings.csv', newline="", encoding='utf-8-sig') as csvfile: csv_reader = csv.DictReader(csvfile) for i, row in enumerate(csv_reader): to_metadata = {col: row[col] for col in columns_to_metadata if col in row} values_to_embed = {k: row[k] for k in columns_to_embed if k in row} to_embed = "\n".join(f"{k.strip()}: {v.strip()}" for k, v in values_to_embed.items()) newDoc = Document(page_content=to_embed, metadata=to_metadata) docs.append(newDoc) ```

Once we have the document created, we will use the Langchain CharacterTextSplitter to break them further down into chunks.

```python # Lets split the document using Chracter splitting. splitter = CharacterTextSplitter(separator = "\n", chunk_size=500, chunk_overlap=0, length_function=len) documents = splitter.split_documents(docs) ```

Now that we have the chunks, we will generate the embeddings and insert the values into Chroma. Each vector inserted will have both the vector representation that will be used for similarity search as well as the metadata values we added.

```python # Generate embeddings from documents and store in a vector database embeddings_model = OpenAIEmbeddings() db = Chroma.from_documents(documents, OpenAIEmbeddings()) ```

Finally, we can query the data and even pass in filters to leverage the metadata we have associated.

```python # Query the vector database for information. query = "Heart rate monitor" docs = db.similarity_search(query) print(docs[0].page_content) print(docs[0].metadata) ```

Advanced Querying

To really take advantage of that metadata we generated, we can go beyond and leverage the Langchain SelfQueryRetriever. We can define a schema for the metadata easily and have it been used to generated filters using LLMs.

```python from langchain.llms import OpenAI from langchain.retrievers.self_query.base import SelfQueryRetriever from langchain.chains.query_constructor.base import AttributeInfo # Metadata schema based on the values on the CSV metadata_field_info = [ AttributeInfo( name="Product Name", description="Name of the product", type="string", ), AttributeInfo( name="Price", description="The price of the product as a number. Ex. 149.99", type="string", ), AttributeInfo( name="Rating", description="The rating of the product as a number from 0 to 5. Ex. 4.5", type="string", ), AttributeInfo( name="Description", description="Description of the product", type="string" ), AttributeInfo( name="Features", description="Features of the product", type="string" ), ] document_content_description = "Product listing" # Configure retriver llm = OpenAI(temperature=0) retriever = SelfQueryRetriever.from_llm( llm, db, document_content_description, metadata_field_info, verbose=True # Retrieve values retriever.get_relevant_documents("good heart monitor") # The result is: # Description: Achieve your fitness goals with our Fitness Tracker Smartwatch. Monitor your activity, heart rate, and receive notifications on your wrist.\n # Features: - Heart rate monitor and activity tracking.
- Built-in GPS for accurate workout tracking.
- Sleep analysis and guided breathing exercises.
- Receive notifications from your smartphone.
- Water-resistant for workouts and everyday use.
- Long battery life.', # metadata={ # 'Product Name': 'Fitness Tracker Smartwatch', # 'Price': '149.99', # 'Rating': '4.5', # 'Description': 'Achieve your fitness goals with our Fitness Tracker Smartwatch. Monitor your activity, heart rate, and receive notifications on your wrist.', # 'Features': '- Heart rate monitor and activity tracking.
- Built-in GPS for accurate workout tracking.
- Sleep analysis and guided breathing exercises.
- Receive notifications from your smartphone.
- Water-resistant for workouts and everyday use.
- Long battery life.' # } ```

Scaling up the process

At this point, we have a basic version of the workflow working, but you might be asking yourself how you would productize it further. Maybe you have hundreds or thousands of CSVs that you want to process, or you have CSVs that are continuously being updated and want to ensure that your vectors are kept up to date. Also at retrieval, you might be looking at ways of automating the metadata schema creation and maintenance process. These same questions might apply to other data types like JSONs.

If these are questions you are trying to figure out, Neum AI can help. We have built a scalable and robust platform that helps process your data into vector embeddings inside of a vector database and keep them up-to-date even as your data source is changing. When it is time to retrieve, we provide a schema automatically that you can use yourself or plug into Langchain SelfQueryRetriever. See it all in action in this video.

If you would like to learn more head over to Neum AI or schedule time with me to show you a demo.

Check out our latest post

Follows us on social for additional content

Configuring RAG pipelines requires iteration across different parameters ranging from pre-processing loaders and chunkers, to the actual embedding model being used. To assist in testing different configurations, Neum AI provides several tools to test, evaluate and compare pipelines.
David de Matheu
December 6, 2023
10
min read
Real-time synchronization of embeddings into vector databases is now trivial! Learn how to create a real-time Retrieval Augmented Generation pipeline with Neum and Supabase.
Kevin Cohen
November 25, 2023
8
min read
Following the release of Neum AI framework, an open-source project to build large scale RAG pipelines, we explore how to get started building with the framework in a multi-part series.
David de Matheu
November 22, 2023
15
min read

Ready to start scaling your RAG solution?

We are here to help. Get started today with our SDK and Cloud offerings.