Blog
This is some text inside of a div block.

Building ElectionGPT: Using Neum AI to build an authentic candidate chatbot ahead of the 2024 US Presidential Election.

David de Matheu
November 14, 2023
10
min read

Disclaimer: We are the founders of Neum AI — A data platform for embeddings management, optimization, and synchronization at large scale, essentially helping with large-scale RAG.

A couple days ago, we released ElectionGPT.ai with the goal of showcasing experiences that use LLMs with Retrieval Augmented Generation (RAG) and are grounded on factual data from a variety of sources. In this blog, we would like to explore how we built it using Neum AI and the learnings we had in the process.

Why we built ElectionGPT?

We built ElectionGPT to serve as an easy-to-consume source of news about candidates for the upcoming elections. Our goal was to ground each candidate experience on the core unbiased propositions that each candidate is putting forth. Today, there is a ton of information out there coming from a variety of media outlets. It’s difficult to follow everything and have it distilled in a way where follow up questions can be asked. Ultimately, we wanted to create an experience that felt as close as possible to talking 1:1 with the candidate.

We also thought that this type of experience extends past just an election scenario as individuals and organizations want to create experiences to reach their “customers” and where they core message is delivered. They want to guardrail those experiences to attach to their desired storyline. (This could be their specific brand, ideology or message) They also want to present information quickly, as its happening which requires a high level of orchestration and synchronization.

Another driver for this experience was the idea of multi-source context. More than ever, information is siloed across many repositories. In this case, candidates present themselves through their websites, Twitter — or X? 🙂 as well as through interviews and podcasts. Some of these sources is also were we find the most up to date and relevant information. For example on Twitter / X where candidates are sharing minute by minute updates across topics.

So how did you build it?

We will walk through a couple components of the process, starting with data sources and moving into more technical topics. At a high level, we are contextualizing OpenAI’s GPT-4 using a variety of data sources which are updated daily to produce authentic outputs based on the candidates opinion’s driven from the ingested data sources.

High-level architecture

Architecture overview showcasing Candidate KB sources and RAG process.

We are using a RAG pipeline to generate vector embeddings from the different data sources. We are then ingesting the vectors into Weaviate where we are generating a search index. Every day, the data sources get re-ingested to get any changes we might detect. (Ex. new Tweets or updates on Wikipedia). The re-ingestion makes sure to not duplicate any existing vectors in Weaviate through the use of well crafter vector IDs. At runtime, we are retrieving similar vectors to the user query using cosine similarity and adding the context into the prompt we are sending OpenAI for chat completion. We are leveraging metadata to help track what sources data is coming from. This entire process is being handled through the Neum AI platform.

You can see the full repository for the project in GitHub. Lets go into the details!

Finding the right data

ElectionGPT started with research into each candidate for the upcoming 2024 US presidential election. We wanted to identify the correct data sources that would provide the most authentic perspective of each candidate. It started with their websites and government plans and snowballed into other sources like Twitter / X where they were showing active engagement.

At a high level we are pulling data from these sources to have even more authentic responses, as well as up-to-date context in addition to OpenAI’s regular training data:

  1. Wikipedia articles
  2. Ballotpedia
  3. Transcripts of interviews with candidates found here()
  4. Real-time Tweets (Xweets? ok I’ll stop)
  5. Government plans / Websites

Our goal was to keep enriching this with more data but we figured those were great starting points so long as we get the latest information.

Ingesting the data

Once we had identified the right data sources, we leveraged the Neum AI platform to extract and ingest the data into a vector database, Weaviate. By leveraging the Neum AI platform, we generated a pipeline for each candidate and their source. This allowed us to both trigger updates for specific candidates as well as silo the data so that there is no mixing. (i.e. we wouldn’t want Trump and Biden ideologies mixing and leading to the bot serving an in-authentic answer form either candidate).

Below is a sample configuration of one of the pipelines. You can see the full configuration here.

{
 "source": [
   {
     "source_name": "apify",
     "metadata": {
       "api_key": "APIFY KEY",
       "actor_name": "shanes~tweet-flash",
       "apify_metadata": {
         "from_user": [
           "RonDeSantis"
         ],
         "only_tweets": true,
         "max_tweets": 50
       }
     }
   },
   {
     "source_name": "apify",
     "metadata": {
       "api_key": "APIFY KEY",
       "actor_name": "apify~website-content-crawler",
       "apify_metadata": {
         "startUrls": [
           {
             "url": "<https://www.rev.com/blog/transcripts/florida-governor-ron-desantis-announces-2024-presidential-run-on-twitter-spaces-with-elon-musk-transcript>"
           }
         ]
       }
     }
   },
//Additional sources added, but simplified for write up.
 ],
 "sink": {
   "sink_name": "weaviate",
   "metadata": {
     "url": "WCS URL",
     "api_key": "WCS API KEY",
     "class_name": "RonDeSantis_pipeline"
   }
 },
 "embed": {
   "embed_name": "openai",
   "metadata": {
     "api_key": "OPEN AI API KEY",
   }
 },
}

To help scrape tweets and websites, we leveraged Apify APIs. As part of the pipeline, we also declared an embedding model, in this case using OpenAI text-ada-002 and the vector database, Weaviate. (Note: we used classes to separate the candidate within the DB) Once configured, we used the create pipeline endpoint for Neum AI to start running it.

As the data gets loaded, we make sure to extract key metadata. For example, the website its coming from, the tweet ID, etc. This metadata together with the content itself is associated to each vector to ensure that later on, we can provide proper sources as we generate answers to user queries.

This is a sample of the payload for each vector:

{
 "id":"Unique ID for each chunk, source and content",
 "vector":[.......],
 "data_object":{
   "text":"Content that was embedded",
   "source":"Source of the content"
 }
}

Maintaining the data up to date

Once the pipeline is created, we can configure a schedule for it run on. In this case, we selected daily, but can be done hourly or every couple minutes. In the background, Neum AI is constantly syncing the data from the above sources (though some don’t change as often but some do by the minute, like Twitter) and storing them in the vector database so that the application can do semantic search with the latest information.

As mentioned briefly above, within the platform we are using well-crafted vector IDs to help us not duplicate data and update vectors if underlying data changed. At a high level our IDs are crafted from:

  1. The source
  2. The chunk within the source
  3. The information contained within the chunk

Even though, the large scale infrastructure potential of Neum AI is not being put to the test by the data being synced here, it provides a ton of ease of use where we can have data being consistently updated with only a couple minutes of setup.

Providing a front-end

The ElectionGPT frontend is built on top of NextJS and Vercel. You can get access to the full repo here. More importantly that the UX itself, is how we are generating the prompt and displaying the results.

The OpenAI prompt gets constructed and instructs the application (powered by GPT-4) to respond to the user in an unbiased way. We then inject context into the prompt at runtime based on the user’s query, effectively performing a semantic search over our data in Weaviate where all of the sources mentioned above have already been vectorized and ready to be consumed.

This is what the prompt looks like:

You are part of the campaign team for the presidential candidate ${candidateName}.
Remove all the biases that you have, you are here to serve information based on the provide context about a candidate. Please respect the candidate's views on anything, even if they are controversial.
Use the history of the messages for follow up questions.
If the user asks for latest information, respond with what you have in the context.  
To better help you with more recent information, you have access to the candidate's tweets, news, interviews and more below.
Here is the context: ${get_formatted_results(responseData.results)}

In it, you can see where we are injecting the retrieved context: ${get_formatted_results(responseData.results)}.

The context is being retrieved through Neum AI’s search endpoint. This endpoint, automatically connects to the vector database for a given pipelines and has context of the sources and metadata that were pulled into the vector database.

In this case, as mentioned before, we are associating the source information into the metadata for the vector.

"data_object":{
 "text":"Content that was embedded",
 "source":"Source of the content"
}

When we retrieve a vector, we use the source to contextualize where the data is coming from that we are using in the response. This informs the user on how and where did the AI application came up with the answer. For news and media use cases, it’s imperative that the users are can fact-check or read more from the source directly if interested.

What’s next?

We have seen an increase in traffic of 50+ daily active users with 500+ questions posted but we hope more people leverage this tool for the upcoming primaries early next year and the presidential election later in the year!

Having said that, we believe this can be a new form of consuming mass media in a fun and interactive way with sources always shown and up-to-date. Allowing users to have 1:1 interactions with personalities or sources of information, provides a much more intimate experience. We are exploring based on feedback, building similar experiences at more local levels (we have all seen those pamphlets that show up to our doors ahead of local elections that almost no one reads).

From a medium perspective, we are still on the fence on whether a chat interface is the best way to consume information. On one hand it helps the user direct the conversation, but on the other it feels like more work. We want to explore experiences that feel more natural and where the initial interactions are more fluid, but that still provide the ability for users to interject and help guide the conversation.

As for Election GPT:

We recently launched a new feature where we display the “hottest questions” in the day which are aggregated semantically by AI and displaying a subset of them. We hope people can learn from what others are asking and form their own opinions around a topic.

Lastly, we have been experimenting with a Debate feature where you could be simulating a Debate between all these candidates where the AI takes turn responding with information from each of them.. More on this soon!

Note: If you are interested in implementing this at a local level or if you want to partner with us for exclusive media for other applications please reach out! Email or calendar works.

Check out our latest post

Follows us on social for additional content

Configuring RAG pipelines requires iteration across different parameters ranging from pre-processing loaders and chunkers, to the actual embedding model being used. To assist in testing different configurations, Neum AI provides several tools to test, evaluate and compare pipelines.
David de Matheu
December 6, 2023
10
min read
Real-time synchronization of embeddings into vector databases is now trivial! Learn how to create a real-time Retrieval Augmented Generation pipeline with Neum and Supabase.
Kevin Cohen
November 25, 2023
8
min read
Following the release of Neum AI framework, an open-source project to build large scale RAG pipelines, we explore how to get started building with the framework in a multi-part series.
David de Matheu
November 22, 2023
15
min read

Ready to start scaling your RAG solution?

We are here to help. Get started today with our SDK and Cloud offerings.