Contextually splitting documents

David de Matheu

November 14, 2023

•

min read

Context matters when using Large Language Models like GPT-4 and Claude, especially when discussing specialized topics. The key to effective model prompting often lies in Retrieval Augmented Generation (RAG), where content—such as an SEC filing—is broken down into manageable text chunks. These chunks are then converted into vector embeddings for easier retrieval. Traditional splitting methods, which segment text by token count or sentence, often fall short, leaving developers to craft their own, often labor-intensive, solutions.

Enter Neum AI's new feature: context-aware text splitting. This feature allows for custom strategies that better suit specific documents. It's a game-changer for consistent datasets like templated contracts or user-uploaded files, enhancing both retrieval quality and overall application performance.

In this blog, we will showcase how the text splitter works and share a tutorial to start using it. We will introduce neumai-tools, an open-source, python module that contains tools to pre-process documents as well as the inclusion of context aware text splitting inside the Neum AI pre-processing playground.

How does it work?

We start with a collection of documents that generally follow a given template like contracts, FAQs, etc. We will try to generate a strategy for splitting those documents that we can apply across all the documents. The goal is that the strategy we generate provides a better result that blindly splitting it by sentence or number of tokens.

We will take a couple of the documents to use as a sample. Given that the documents are similar, we can pick any two that are a good approximation or even use a template if there is one. (Ex. master contract or spec template) Once we have those, we can use LLMs to analyze the documents and help generate a strategy.

We will use a multi-shot prompt system to ensure that we apply our thinking across different steps and yield the best result possible. As a pre-processing step, removing any covers, table of contents or abstracts can help ensure that we do our analysis of the meatiest parts of the document.

Chunking strategy

For the first prompt we will be generating a strategy to split the documents. This will be the most expensive / time consuming step, but we want to ensure to use a high quality model that can analyze the documents and provide a good approximation. The output of this step is a high quality outline of the steps to take, any obvious markings / format that we can parse across, etc.

Chunking code

Once we have the chunking strategy established, we then use a second prompt to help generate the code to be applied to the text. With this code, we can easily run subsequent documents through the same set of transformations.

Chunking runtime

After the code is generated, we check the code to make sure it is correct and runnable. If it has any issues, we can re-generate it / fix it.

Example outputs

End result of the process is a piece of code that we can use to split up text documents that follow a similar structure. For example, this is what the process yields for a couple sample documents:

Q&A Documents

For this case, we have a document organized in questions and answers. The smart splitter identified the format and divides it to keep questions and answers together in the same chunk.

Contracts (ex. SAFE)

For this case, we have a standard SAFE contract. The smart splitter identified the format and generated several regex to identify sections, paragraphs and sentences within the text to then generate chunks out of them.

You can try out the smart chunker yourself directly on the Neum AI pre-processing playground by choosing it in the text splitting section:

Integrating smart splitting into your flow

To get started, we will need to install the pip package for neumai-tools. This package includes several utilities for pre-processing documents as part of a RAG data pipeline. We will also install langchain and unstructured[all-docs] to use them in our examples.

```bash pip install neumai-tools langchain unstructured[all-docs] openai ```

Once installed, we can then implement code that leverages the semantic_chunking_code and semantic_chunking utilities.

semantic_chunking_code: Outputs the code generated by the system based on a sample piece of document. (Up to 2000 tokens). The code comes out ready to be executed inside of a function called split_text_into_chunks.
semanting_chunking: Takes as an input the generated code and the full set of documents to be split by it. It outputs a list of Document objects that contain the text chunks and can be used to generate embeddings.

```python # Configure OpenAI import openai openai.api_key="API_KEY" # openai.organization="OPTIONAL_ORGANIZATION" text = "String of text to analyze" # Alternatively can use Langchain or Unstructured loaders to load a file # For example: # from langchain.document_loaders import UnstructuredFileLoader # loader = UnstructuredFileLoader(file_path="File Path") # documents = loader.load() # text = documents[0].page_content splitter_code = semantic_chunking_code(text) ```

We now have the splitter_code generated based on the sample text we provided. We can now take that splitter code and apply it across other pieces of text / documents. In this case we will use LangChain loaders to get the text off a document and pass it on to the semantic_chunking code.

```python from langchain.document_loaders import UnstructuredFileLoader loader = UnstructuredFileLoader(file_path="File Path") documents = loader.load() semantic_chunking(documents=documents, chunking_code_exec=splitter_code) ```

This code will returns a list of Document objects with the chunks generated.

Conclusion

Pre-processing continues to be a key step in the process of creating great generative AI applications that are grounded on your own data. We believe that by leveraging intelligence we can simplify pre-processing while increasing the quality of the results at scale. Simple tools like the ones above can help stir you in that direction. Please share any feedback you might have you try out these methods.

Outside of pre-processing, scaling data pipelines for vector embeddings continues to be a challenge. As you move past initial experimentation, check out Neum AI as a platform to help you scale your applications while keeping the quality up and the latency & cost down. Neum AI provides access through the platform to capabilities like context-aware text splitting and more. Stay tuned to our social media (Twitter and LinkedIn) and discord for more updates.

Check out our latest post

Follows us on social for additional content

This is some text inside of a div block.

Retrieval evaluation with datasets

Configuring RAG pipelines requires iteration across different parameters ranging from pre-processing loaders and chunkers, to the actual embedding model being used. To assist in testing different configurations, Neum AI provides several tools to test, evaluate and compare pipelines.

David de Matheu

December 6, 2023

•

min read

This is some text inside of a div block.

Real-time data embedding and indexing for RAG with Neum and Supabase

Real-time synchronization of embeddings into vector databases is now trivial! Learn how to create a real-time Retrieval Augmented Generation pipeline with Neum and Supabase.

Kevin Cohen

November 25, 2023

•

min read

This is some text inside of a div block.

Building scalable RAG pipelines with Neum AI framework - Part 1

Following the release of Neum AI framework, an open-source project to build large scale RAG pipelines, we explore how to get started building with the framework in a multi-part series.

David de Matheu

November 22, 2023

•

min read

View all

Ready to start scaling your RAG solution?

We are here to help. Get started today with our SDK and Cloud offerings.

Star Us

Get started