Blog
This is some text inside of a div block.

Pre-processing playground

David de Matheu
November 14, 2023
10
min read

Pre-processing playground

To use the hosted app, head to https://neumai-playground.streamlit.app/ Project is a fork of the Langchain Text Splitter Explorer.

Check out the open-source repo: NeumTry/pre-processing-playground (github.com)

At Neum AI, we are focused on building the next generation of data pipelines built specifically for embeddings and RAG. Preparing data to be converted into vector embeddings and ingested in vector databases is challening. Different data types have different requirements and best practices to best convert them and optimize them for retrieval.

Starting with choosing the right loader that will correctly extract the text and format from the original file. For structured data types like JSON and CSVs, separating the content that is worth embeddings and the content that should just serve as metadata is necessary. Once we have the text that contains our context, it must be split into smaller chunks while mantaining a cohesive information structure. - e.g. you don't just want to split in the middle of sentence. Chunking can take different shapes and forms depending on the type of document it is. For example for a Q&A document you want to keep Q&As together. If the document is report with sections, you want to keep the sections together. If it is code, you want to keep classes and methods together.

What can the app do?

Using this repo and the associated app, you can test pre-processing flows for different documents. It is likely that you might be processing documents that generally follow a similar structure, so optimizing your process can help you apply it across your document set. The app allows you to upload a file, choose the loader you want to use and the splitter to chunk it. In addition, you can leverage metadata selectors to attach metadata to the resulting chunks. (only available for JSONs and CSVs using the provided loaders). The app does not store any data, simply uploads to temp storage to use at runtime and then cleans up.

What is coming?

We will be adding more capabilities to the app to further match the feature set that we offer through Neum AI. This includes intelligence layers to pick the correct loaders, splitters, etc. As well as more nuanced loaders and splitters that are specific to a given data type as well as document context. (ex. reports, Q&A, contracts, etc.) To learn more or collaborate email: founders@tryneum.com

Check out our latest post

Follows us on social for additional content

Configuring RAG pipelines requires iteration across different parameters ranging from pre-processing loaders and chunkers, to the actual embedding model being used. To assist in testing different configurations, Neum AI provides several tools to test, evaluate and compare pipelines.
David de Matheu
December 6, 2023
10
min read
Real-time synchronization of embeddings into vector databases is now trivial! Learn how to create a real-time Retrieval Augmented Generation pipeline with Neum and Supabase.
Kevin Cohen
November 25, 2023
8
min read
Following the release of Neum AI framework, an open-source project to build large scale RAG pipelines, we explore how to get started building with the framework in a multi-part series.
David de Matheu
November 22, 2023
15
min read

Ready to start scaling your RAG solution?

We are here to help. Get started today with our SDK and Cloud offerings.