This is some text inside of a div block.

Q&A with thousands of documents

David de Matheu
November 14, 2023
min read

Probably the most common use case today with LLMs is enabling a chat bot that allows you to do Q&A with a document. Using tools like Langchain and Pinecone, it is relatively straightforward to support it.  This blog is not meant as a full step by step guide as there are a ton of great tutorials around like: this, this, this, and this. Instead, it is meant to be a thought piece outlining what it takes to support such a scenario at different complexity levels. From getting started with building a prototype app, to Q&A with a single document, to building scalable and robust apps that allow you Q&A with thousands of documents.

Quick Intro

At a thousand feet, there are five main components to architecting such a solution:

  1. A data store like Azure Blob or a local folder in your computer where the documents are stored. We will be connecting to this location to extract the documents.
  2. A chunking transformation that will help us break apart the documents into smaller pieces of context that we can reference. Chunks might be as small as a word and as large as the whole document. Langchain has a great collection of text splitters we can start with.
  3. An embedding model to vectorize our information. The process of vectorizing translates the text into a semantic representation that we can compare against queries to find context that is semantically similar. Open AI has the most common model to get us started: text-ada-002.
  4. A vector store like Pinecone, Weaviate or Qdrant where we will store those generated vectors. The vector store will keep the vectors organized and allows us to quickly query and retrieve vectors.
  5. A prompt that will leverage the context extracted in conjunction to prompt instructions to answer the questions from the user.

Starting with the basics

At its most fundamental level, supporting such a scenario can be done in 10 lines of code with 30 minutes of preparation setting up an Open AI account and a Pinecone index. (All for free with credits) Leveraging Langchain, we can access loaders for our document stores and text splitters. Langchain also will help us craft our prompt either by leveraging retrievers, chains and agents.

The code above is not meant to be the best practice to accomplish our goal, but provides a simple “getting started” experience. As an outcome you are able to push a document of your choice, have that embedded and synced into a vector store so that it is ready to query. Cool! Many of you might look at this and say “yeah, but that is just a basic app, how do I support my use case”. So lets go deeper.

How do we scale this to thousands of documents

As we think about evolving from a simple prototype to a scalable and robust application, there are 7 main problems come top of mind:

  1. My data lives in a variety of sources, each with a different structure, cadence, API, etc.
  2. My data is constantly changing, how do I maintain my vectors up to date?
  3. Scaling my architecture to handle thousands of documents in parallel to make sure new data is available in a timely fashion.
  4. Not all the data I want to process is text, I want to process tabular data, images, video, audio and more.
  5. When I chunk my data, the context get split in different chunks.
  6. The embeddings miss some of the context and meaning in my text. (Especially if I input domain specific text)
  7. Ensuring that the right context is queried from the vector database using semantic search

These problems have been validated over and over by every customer that we talk to. Many started like we did above, with a 10 line application that felt magical, but that will not cut it as they look to scale and extend their solution. These problems are not small by any means and will require collaboration across a large number of stakeholders to ensure a magical end to end experience.

At Neum AI, we have taken a focus on a couple of the problems above, specifically problems 1, 2 and 3. In the next section, we will explore how we think we can solve problem # 1.

Building data pipelines for LLM developers

By no means did Neum AI invent data pipelines or is the best solution out there to build them. (hello Fivetran, Azure Data Factory, Amazon Glue, Airbyte) We have taken a special focus at helping build data pipelines that are specific to LLM developers. Starting with problem number one: bringing data from across a number of data sources into vector stores.

Neum AI sits beneath a large part of the steps required to build LLM apps with an initial focus in connecting data from document stores to vector stores. In the process of connecting that data, Neum chunks and embeds it. When it comes to chunking and embedding today, Neum AI is pretty generic, leveraging out of the box solutions like Langchain, NLTK and Open AI. The initial value proposition is really about moving data automatically and at scale. Connecting data sources holding thousands of documents and having those documents land safely in a vector store. That is how we think we can help developers get to the goal of doing Q&A with a 1000 documents.

So, how does Neum work?

To access a step by step tutorial showcasing how to build a Q&A with doc pipeline using Neum AI see our documentation.

To get started, Neum AI allows you to pick from a variety of document stores, embedding providers and vector stores to configure your pipeline. We are adding more consistently to ensure you access your data wherever it is. This enables you to easily bring your documents where ever they are and easily connect them to be synced into your vector store of choice.

Crafting data pipelines can be an anxiety inducing endeavor. But it is a key requirement to enable the scale required to process thousands of documents without running the risk of your documents falling out of date or losing data. With Neum, you can use our out-of-the-box UI or APIs to configure a robust and scalable pipeline. It supports scheduling so that Neum pulls any new data available in your data source at a given cadence to ensure your vector stores always have the latest content.

With this first step, Neum AI helps standardize the core pipeline to move data from data sources into vector stores with the ability to chunk and embed it in the process. In this process, Neum AI can handle large number of documents as well as different types of documents that can be moved through the pipeline without any issue.

In the next iteration of this blog we will deep dive into the next problem we are solving: syncing data changes. This is a key problem for use cases where the data sources are dynamic like a Notion notebook that is being updated or user generated information like listings.

Lets talk 🙂

If you have made it this far, you probably agree with one or all of the problems we discussed and might be looking for solution. If you would be interested in chatting with me, my calendar is open Even if you are not interested in Neum as a solution, would still be open to sharing ideas.

Check out our latest post

Follows us on social for additional content

Configuring RAG pipelines requires iteration across different parameters ranging from pre-processing loaders and chunkers, to the actual embedding model being used. To assist in testing different configurations, Neum AI provides several tools to test, evaluate and compare pipelines.
David de Matheu
December 6, 2023
min read
Real-time synchronization of embeddings into vector databases is now trivial! Learn how to create a real-time Retrieval Augmented Generation pipeline with Neum and Supabase.
Kevin Cohen
November 25, 2023
min read
Following the release of Neum AI framework, an open-source project to build large scale RAG pipelines, we explore how to get started building with the framework in a multi-part series.
David de Matheu
November 22, 2023
min read

Ready to start scaling your RAG solution?

We are here to help. Get started today with our SDK and Cloud offerings.