Skip to main content

High-Level Design

The full high-level overview of RAG Me Up can be found in the drawing below. This image shows all components that can be used in the framework but all of them are configurable or can be turned on/off depending on your needs. Deciding which components you want to use and how to configure them, is crucial in setting up production-grade RAG pipelines and will be explained for each component separately.

The entire RAG pipeline that will be executed by RAG Me Up can be configured through the .env file entirely. An example is given in .env.template which you can rename to .env. The template aims to provide a sane starting point for generic RAG but should always be subject to tweaking when building your RAG setup.

RAG pipeline drawing

High Level Pipeline Explanation

Any RAG framework or pipeline roughly consists of two distinct sections:

  • Indexing - a one-time process where a large corpus of documents is processed and indexed into a (vector) database. The whole indexing phase is executed before the query phase takes place.
  • Querying - a runtime process where a user initiaties a retrieval and answer process through an interaction. In this process, the indexes created during the indexing phase are used to retrieve documents and use those to answer the user's query.

While all of the components shown in the diagram will be discussed in the remainder in detail, we briefly address each below with links to subsections you can explore if you don't want to read the documentation in full. Mind you that when using RAG Me Up, you will first and foremost be configuring the .env file which is a single place to set up all the components shown here.

Your Documents

Of course any RAG pipeline starts with your documents. While there are different flavors of RAG (GraphRAG, Text2SQL, etc.), RAG Me Up focuses on semantic RAG. This means that your documents and the queries you want to run on them should be semantic ones. Hence, your documents should contain some form of text. Whether that is inside a PDF, DOCX, JSON, CSV, etc. doesn't matter but RAG Me Up does focus on text.

Loaders

For each document type (DOCX vs XLSX vs JSON, etc.), the way to retrieve the text from the source document differs. Hence, we have loaders for each type of document separately.

Chunking

A strength of RAG (when compared to "just" using an LLM and uploading a document) is that it is capable to search through large amounts of documents and also search within those documents. To achieve the latter, documents are chopped up into chunks. There are different ways of doing this chopping up and RAG Me Up supports a few.

Document Embeddings

Once we have our data chunked into parts of text, arguably the most crucial step is to convert them to vectors that we can use for comparison during query-time. RAG Me Up by uses hybrid search which combines dense and sparse vectors. The document embeddings are used to create the dense vectors using an LM or LLM1.

Sparse BM25 Vector store

One of the two ways chunks are indexed in RAG Me Up is using BM25 as sparse vectors which typically excell at keyword-type search. This is useful for example when users query our RAG system with very brief, perhaps even single-word, queries that do not necessarily lead to meaningful embeddings.

RAG Me Up uses Postgres for storing sparse vectors with a BM25 index using pg_search by ParadeDB.

Dense Vector DB

The other way chunks are indexed in RAG Me Up is by writing the document embeddings created in the step before into an indexed database. For this, RAG Me Up also uses Postgres, indexed with the pgvector extension.

Query (User Interface)

Not explicitly written out in the diagram is the user interface. RAG Me Up comes with a custom user interface written in Scala to allow users to talk to the RAG system. RAG Me Up's server (Python) is built to be stateless which means that the chat history, memory, previously retrieved documents, etc. are all supposed to be handled by the user interface.

An important part of the query handling step is to convert the user's question into vectors that are similar to what is stored in the (vector) database. While not explicitly drawn in the architecture, we use the same embedding model to create the dense vector for the query.

History Summarization

While context windows of LLMs are ever increasing, they are still limited and even if they are really large, it becomes increasingly harder for an LLM to focus on the essence of a message as the size increases. Mind you also that in case of RAG Me Up, the chat history is part of every message sent to the LLM.

To remedy this, you can optionally summarize lengthy messages (with history) once they exceed a specific threshold.

Document Fetch Check

When dealing with history in a RAG system, it is important to determine whether or not a question from a user is actually a follow-up on documents that were already retrieved or not. New documents should only be fetched if there is either no history present yet or when the user's question calls for a new retrieval. We make this decision by asking an LLM which way we should go in case there is already history present.

If documents should be fetched, we continue to retrieval. If not, we go to answering the question directly with documents that were previously retrieved and are present in the chat history.

HyDE

When using any RAG system, there is an inherent mismatch between the indexing phase and the query phase. When indexing our document chunks, we are essentially working with (potential) answers. Whenever a user poses a query, this is a question. When we just naively embed the chunks and the question in exactly the same way, we are comparing apples to oranges, though we expect there to be some coherence between the two.

Nonetheless, HyDE (Hypothetical Document Embeddings) tries to remedy this by generating a couple of documents with a given query. We do this by asking the LLM to generate those documents. This way we hope to compare apples with apples by using the answers generated with the query instead for retrieving relevant document chunks. HyDE is an optional step in RAG Me Up.

Document Retrieval

For query-time this is obviously the most crucial step to perform. Here we query the (hybrid) database to fetch documents (chunks) that are relevant to the user's question. This is done by firing a SQL query to the Postgres system holding the dense and sparse vectors. The SQL query itself already scores the retrieved document chunks on similarity (using cosime similarity and BM25 score in a 50/50 weighing).

Reranking

One of the issues with a normal RAG pipeline's scoring is that the document chunks are embedded in isolation and so is the user query. While this is relatively fast and can be done asynchronously, an alternative would be to use cross-encoders or other models that embed the document chunk together with the query to capture attention across both. This is what a reranker aims to do. The problem however is that this is too time-consuming to do for the entire document set. As a solution however, we can apply a reranker only on those documents that were retrieved by the regular retrieval process and rerank them. A good practice then is to retrieve a relatively large set directly from the database and then return a smaller subset of documents after reranking. RAG Me Up uses flashrank for reranking as an optional step.

Answer Check - LLM-as-a-judge

Once documents are retrieved and ranked, we can use a form of self-inflection or LLM-as-a-judge to determine whether or not the currently retrieved set of chunks can accurately answer the user's question. If this is not the case, we can rewrite the original user's question in an attempt to obtain better document chunks in a new retrieval round. RAG Me Up allows this rewriting loop to be turned on as an optional step and will only perform it once to prevent too lengthy or even infinite rewrites.

Re2

It has been shown that instructing an LLM to re-read a question - called Re2 - benefits the quality of the answer given by the LLM. Re2 is an optional step in RAG Me Up.

Prompt Creation

When the question is potentially remodeled and the documents are finalized to be inject, RAG Me Up will set up a prompt with documents inserted into it and feed the full prompt to the LLM and get the reply back.

Provenance Attribution

Once the LLM has given an answer to the user's question, a RAG pipeline usually finalizes its query-phase and returns it to the client. In some applications of RAG however, it can be crucial to understand what documents were actually used by the LLM to generate the answer. While, even after reranking, we feed a set of "as-relevant-as-possible" documents to the LLM it may still choose to use some more than others or even ignore some altogether.

Provenance attribtuion tries to assign a score to each document chunk to indicate how relevant it was in generating the answer in hindsight. There are different ways of doing this and RAG Me Up provides provenance attribution as an optional step.

Footnotes

  1. While overly simplified; LMs are Language Models like the BERT-family of models and are good are converting text into semantic-preserving vectors. They are not to be confused with LLMs (Large Language Models) which generally generate text. This goed without saying that LLMs also create vectors for any text and hence can often be used as embedding models too.