Blog
This is some text inside of a div block.

Indexing from Tweets to Product Listings with SingleStore and Neum AI

David de Matheu
November 14, 2023
10
min read

Neum AI enables AI engineers to connect their data sources to their LLMs through Retrieval Augmented Generation (RAG). Neum AI supports a variety of data sources that you can pull from as well as vector databases where you can have vectors stores to then do retrieval. Today, we are announcing support for SingleStore as both a data source and vector database. SingleStore allows you to keep all your data in a single place while leveraging the power of vector embeddings and RAG. Neum AI makes it easy to generate vector embeddings for the data and connect everything together.

Figure showing the RAG process including process, vector generation, storage and retrieval.

As with other integrations, SingleStore is supported through Neum AI’s large scale, synchronized and extensible data architecture. This means supporting millions of data points being extracted from SingleStore or other data sources and converted into vector embeddings to power your search, classification and recommendation use cases. To learn more, read our last write up: RAG at scale.

SingleStore as a vector database

Lets start with a simple example using SingleStore our vector database. In this case, we will be extracting tweet data from Twitter / X (sorry still weird 😝) and generating vector embeddings that we will store into SingleStore. This use case is great for developers who already use SingleStore for their data, but might have other sources of unstructured data that they want to leverage for Retrieval Augmented Generation (RAG) like Sharepoint, Google Drive, Notion and others.

Figure showing flow for example 1: ingesting tweets into SingleStore using NeumAI

With Neum AI, we can simply connect our existing data sources, pick the embedding model we want to use and the sink where we want to store our vector embeddings.

First, we will configure a table within our SingleStore database to receive our vector embedding data we will generate off the tweets. The table will be simple just having a field for an id, the text of the tweet and the vector embedding. We will create it using the SQL command below:

```sql CREATE TABLE tweets ( id TEXT NOT NULL PRIMARY KEY, text TEXT CHARACTER SET utf8mb4 COLLATE utf8mb4_unicode_ci, source TEXT, vector BLOB timestamp TEXT ); ```

Next, we will configure the Neum AI pipeline we will run to achieve this as a JSON object. For Apify, we will use the tweet-flash actor to get Elon Musk’s tweets. We will choose OpenAI as our embedding model and SingleStore as sink. For SingleStore, we have previously created a table within our database called tweets which we will configure. Across the different connectors we are using, we need to configure API keys and connection strings.

POST: https://api.neum.ai/v1/pipelines

```json { "source":{ "source_name":"twitter", "metadata": { "twitter_username":"elonmusk", "number_of_tweets":50 } }, "embed":{ "embed_name":"openai", "metadata":{ "api_key":"OPENAI-API-KEY" } }, "sink":{ "sink_name":"singlestore", "metadata":{ "url":"user:password@host:port/database", "table":"tweets" } } } ```

Once we have this configured, we can now run our pipeline using Neum AI create pipeline REST API. You will need a Neum AI API key which you can get by creating an account at dashboard.neum.ai.

The pipeline will now run for a couple minutes, gathering the tweet data, processing them and generating vector embeddings that are stored into SingleStore. Once there, you can query the results semantically by using Neum AI search REST API or SingleStore APIs directly. To try out a chat-based interface use our Open-Source sample.

For example, let's ask a question about X based on Elon Musk’s tweets:

Chatbot exchange showcasing answer and context for question about X

Try out other sources like S3 or Azure Blob to let Neum AI process thousands or millions of documents into vector embeddings that can be stored in SingleStore and queried quickly.

Indexing product listings in SingleStore

Now that we saw how SingleStore can be used as a vector database, you might be asking yourself, what if I already have data in SingleStore, can I use vector e

Figure showcasing flow for example 2: ingesting data from a SingleStore table as vectors back into SingleStore.

For this example, lets pretend we have an existing table in SingleStore that contains all of product listings. The table might look something like this:

Table showing schema of products in SingleStore

For each row in the table, we will create a couple of vector embeddings to index the information under Name, Description, and Category. To store all of these embeddings, we will create a separate table similar to the one we had before, only that this time, we will add additional columns to store some of the metadata associated to vector embedding including the Product ID , Name, Price, Quantity. The metadata will allow us to filter product based on availability or price for example. Using filtering can help improve the quality of the retrieved data so that it is more relevant to the query that the user is making.

```sql CREATE TABLE index_product_listings( ID VARCHAR(255) not null PRIMARY KEY, text TEXT NOT NULL, vector blob NOT NULL, ProductID VARCHAR(255), Name VARCHAR(255), Price DECIMAL(10, 2), StockQuantity INT); ```
Note: In cases where you might only be generating a single vector embedding from a row of data, then it is possible to simply add a column to an existing table to store the vector embedding. In our experience, storing the entire context of a row in a single vector embedding leads to reduced quality in results and is why we generally generate multiple vector embeddings per row to capture different parts of the context.

Now that we have our table, let's set up the Neum AI pipeline to help us extract, process, embed and store the data back into SingleStore ready to be semantically searched. Similar to the previous example, we will configure a source, an embed and a sink connector. For source, we will use the SingleStore query connector, for embed we will use OpenAI and for sink we will use the SingleStore store connector.

For the SingleStore query connector, we will configure a query to extract data from our product listings table like so:

```sql SELECT * FROM Products; ```

We will also configure what fields we want to turn into vector embeddings (Name, Description, and Category) and which fields we will have as metadata (Product ID , Name, Price, Quantity) For the OpenAI and SingleStore store connectors, we will use a similar configuration as before:

```json { "source":{ "source_name":"singlestore", "metadata": { "url":"user:password@host:port/database", "query":"SELECT * FROM Products;", "id_key":"ProductID", "embed_keys":["Name","Description"], "metadata_keys":["Product ID" , "Name", "Price", "StockQuantity"] } }, "embed":{ "embed_name":"openai", "metadata":{ "api_key":"OPENAI-API-KEY", } }, "sink":{ "sink_name":"singlestore", "metadata":{ "url":"user:password@host:port/database", "table":"index_product_listings" } } } ```

Once we have everything configured, we create the pipeline on Neum AI using the create pipeline REST API. You will need a Neum AI API key which you can get by creating an account at dashboard.neum.ai. The pipeline will automatically pull from SingleStore using the query, process and embed the data and push it back into SingleStore as vector embeddings. Neum AI supports the ability to read and process millions of rows of data. It will efficiently parallelize the workloads and process the data quickly and reliably.

Once it has landed back into SingleStore as vector embeddings, we can use natural language to query information that we can use directly or pass as context through RAG to a model. To query results, use the Neum AI search REST API or SingleStore APIs directly. To try out a chat-based interface use our Open-Source sample.

Image showing conversation validating product related questions

What’s next?

Using Neum AI, you can centralize your data processing across different data sources ranging from scraping from websites or Twitter, down to proprietary data sources from your company like S3, Azure Blob or SingleStore. This allows you to craft AI powered experiences that are grounded on the specific data sources that you need and make sure the quality of the responses is up to par. By using Neum AI with SingleStore you can further centralize your data needs as both your sources and vector embeddings can live under the same roof.

Coming up on this integration will be deeper control for you to decide the best pre-processing steps like loading and chunking of data. We will also be adding more controls for synchronizing your data so that Neum AI can listen to changes in your underlying SingleStore tables and automatically sync those changes to the vector embeddings generated.

Get started today with Neum AI at dashboard.neum.ai

Check out our latest post

Follows us on social for additional content

Configuring RAG pipelines requires iteration across different parameters ranging from pre-processing loaders and chunkers, to the actual embedding model being used. To assist in testing different configurations, Neum AI provides several tools to test, evaluate and compare pipelines.
David de Matheu
December 6, 2023
10
min read
Real-time synchronization of embeddings into vector databases is now trivial! Learn how to create a real-time Retrieval Augmented Generation pipeline with Neum and Supabase.
Kevin Cohen
November 25, 2023
8
min read
Following the release of Neum AI framework, an open-source project to build large scale RAG pipelines, we explore how to get started building with the framework in a multi-part series.
David de Matheu
November 22, 2023
15
min read

Ready to start scaling your RAG solution?

We are here to help. Get started today with our SDK and Cloud offerings.