Chatbots: 10 Levers for Optimal Retrieval Augmented Generation

Answering a factual question requires knowledge of the subject and staying up-to-date with its developments. This limitation, often encountered by LLMs (including ChatGPT), can be avoided by using a new approach called Retrieval Augmented Generation. In this article, we explain the core principles of this approach and how to configure it to create a chatbot that provides quality responses.

Date: 24 April 2024

Expertises:

Data Science ⊕

Domaine: Software industry ⊕

About project: ARIAC ⊕

Authors

Introduction

In a previous blog post, we introduced the phenomenon of chatbots following the grand entrance of ChatGPT. One point that caught our attention then was that ChatGPT is likely to give plausible but false responses to factual questions. This phenomenon is known as "hallucination." Thus, pranksters have successfully obtained a convincing speech (at least for Martians) about the difference between chicken eggs and cow eggs.

In the meantime, significant progress on this issue has been made following intense research in this field, both by academic and private players. This research field studies to what extent chatbots could provide correct and up-to-date responses in specialized sectors, depending on the evolution of knowledge in these domains. Two main approaches emerge from this research:

Specializing the chatbot (or more precisely the underlying large language model, called LLM) in a domain by feeding it with a set of documents or question-and-answer sets related to that domain,
Developing a system that provides the chatbot with the subset of documents useful for formulating its response as questions are asked (see image below).

These two approaches can be used in combination or individually. This article focuses on the second approach, known as Retrieval Augmented Generation (RAG). It is renowned for providing qualitative responses while requiring fewer computing resources for implementation and maintenance. We start by describing this approach and then examine 10 levers to enhance its effectiveness.

What is RAG?

The Retrieval Augmented Generation approach involves facilitating the chatbot’s work by providing it with documents related to the question asked. Instead of relying on its general knowledge, it uses a set of reference documents on which it bases its response. Thanks to this architecture, and as long as the knowledge base (i.e., the set of documents) is up-to-date and complete, the risk of the chatbot’s hallucination is greatly reduced. Additionally, one can also retrieve the link(s) to the documentary source and reduce the risk of variation in the response provided.


General Approach to Retrieval Augmented Generation

Practically, RAG is a process comprising three major steps:

Retrieve: This first step involves selecting a set of documents relevant to the question asked. It can materialize in different ways. The contemporary approach typically involves associating each document (or document fragment depending on the size of the document) from the reference domain with a digital signature (i.e., a numerical vector). These signatures are designed so that documents addressing the same subject receive similar signatures. These signatures are stored in a database. When a question is asked by a user, it is also translated in terms of signature. The ’information retrieval’ stage then consists of identifying the signatures of the documents closest to the question’s signature and returning the documents to the chatbot (or more precisely to the LLM in charge of the second stage).
Augment: During this second stage, the chatbot receives the user’s question as usual. However, in the context of RAG, it receives the additional instruction to answer the question based on the documents obtained in the first stage.
Generate: In the last stage, the chatbot generates the response to the question. These steps are illustrated in the diagram below with an example from the pharmaceutical domain.


Architecture du RAG avec exemple (source : Galileo) et identification des 10 leviers

The 10 Levers for Effective RAG

We have identified 10 levers that we believe are important points of attention in this process to refine the RAG to your use case. The setting of these levers is linked to three factors:

the desired quality of the response,
response time,
and the budget.
These 10 levers are identified in the diagram above. We explain them below.

1. Choice of documents and their format

The issue of document selection may seem trivial, but from experience, it is often the most effective lever for improving RAG. It is essential to ensure that the set of selected documents:

covers the reference domain as exhaustively as possible,
is regularly updated in the knowledge base accessible by the RAG,
does not contain obsolete information,
provides quality information: avoid the "garbage in-garbage-out" phenomenon, meaning one cannot expect a quality response if the source documents are not of quality.

Besides the choice of documents, there is also the issue of the format of the documents: if the documents are available in a tagged format (such as HTML or markdown), this allows easier execution of the second lever described below.

2. Segmenting Documents into Fragments: Method, Size, and Overlapping

Depending on the size of the documents selected in lever 1, it is usually necessary to segment these documents into fragments called "chunks." This need arises because the LLM cannot process a number of information units (called "tokens," usually words or parts of words) beyond a certain threshold (known as "context window"). This limit is particularly decisive in the context of RAG, as the document fragments selected during the "Retrieve" step are added to the question posed by the user as additional information. At a minimum, the size of the fragments must be defined so as not to exceed this permissible information threshold. However, a more sophisticated approach also considers its possible impact on the quality of the response: depending on the use case, choosing a fragment too small may prevent the LLM from accessing all the useful information to answer the question, while choosing a fragment too large may dilute the importance of the useful information in a fragment and thus prevent the "Retrieve" step from recovering this fragment.

Fragment cutting techniques are discussed in detail in "The 5 Levels Of Text Splitting For Retrieval". Here are the 5 types of techniques considered by the author according to their level of sophistication:

Cutting according to one or more delimiter characters: for example, using the period as a delimiter to segment the text.
recursive cutting according to delimiter(s): similar to the previous method, but the algorithm continues to divide the document based on the chosen delimiter characters until the size of the obtained fragments does not exceed a defined threshold.
Cutting based on the type of document: for certain types of documents, such as markdown or HTML, segmentation can be performed based on the tags used in the document.
Cutting according to a statistical model: this is an experimental approach that uses statistical methods to divide the document into fragments with strong semantic cohesion.
Cutting according to an LLM (Large Language Model): this is also an experimental method that involves asking a large language model to partition the document into meaningful units.

3. Choice of Fragment Representation

As previously discussed, fragments are translated into digital signatures, that is, numerical vectors. However, the question remains: what are we going to digitize? The usual choice is to digitize the raw text. More advanced approaches propose to digitize, in addition or alternatively, a summary and/or keywords related to the fragment. To have this additional information, an LLM can also be used to generate it.

4. Choice of the User’s Question Representation

The same questioning applies to the representation of the user’s question: one can digitize the user’s raw question or rework it. This reworking can correct potential language errors, include the addition of new representations such as synonyms of the question, or further divide the initial question into a series of more general questions.

5. Choice of Model for Digital Signature

Transforming the two representations described above into a digital signature requires choosing a model. This model can be open access (example: all-MiniLM-L6-v2 from Hugging Face) or paid (example: text-embedding-3-small from OpenAI). The choice of a model will depend on the access price, storage cost (see lever 6), and the quality of the signature. Regarding the quality of the signature, one can expect it to be better for a given use case if the corpus that trained this model substantially covers the target interest domain. The access price in the case of a paid model is generally linked to the number of words (tokens) transmitted to the model. The storage price, in the case of a paid database, will be influenced by the size of the signature (vector dimension) and its numerical precision.

6. Choice of Database for Signature Storage and Fragment Retrieval Algorithm

The signatures associated with the fragments are stored in a database that can be in open access with local storage (example: Chroma) or paid with storage usually on a cloud (example: Pinecone). To this database choice, we also need to consider the algorithms that are available to identify the fragments semantically closest to the question posed by the user. In this research field, we particularly note a distinction between algorithms that browse all the fragment signatures to conduct this search (optimal result obtained but time-consuming) and approaches seeking good semantic matches (without guarantee of optimality but less time-consuming).

Finally, it is also worth mentioning research on algorithms designed to retrieve a set of fragments with varied contents, offering semantic diversity among the retrieved fragments. The choice of the DB and the retrieval algorithm depends on the three factors discussed previously (the desired quality of the response, the response time, and the budget. For this part, as well as for the other levers, we suggest first testing an open access approach to verify if it provides satisfactory results in terms of quality and speed of response. If not, then consider a paid access solution. It is important to note, specifically for this lever, that choosing a paid database involves considering the various costs associated with its use. In particular, any data storage fees will be paid throughout the life of the RAG regardless of its level of use.

7. Choice of the Number of Fragments Extracted from the Database

The last lever related to the "Retrieve" step involves defining the number of fragments to be recovered. This choice is made in interaction with all the elements of the RAG architecture linked to the permissible information threshold by the LLM (discussed during lever 2): the permissible length of a user’s questions, the size of the fragments (see lever 2), the instructions given to the LLM (see lever 8, below) and the choice of the LLM (see lever 9). The more fragments given to the LLM, the more likely it is that the useful parts for answering the question will be found in the recovered fragments. However, there is also a greater chance in this case that unnecessary or redundant information will be found in these fragments. It should be noted that if the "Retrieve" step is effective (and thus systematically recovers useful fragments), we tend to decrease this number, while if this first step is less efficient we tend to give the maximum number of fragments and thus leave more freedom for the LLM to sort them out.

8. Choice of Instructions Given to the LLM

Once the useful fragments have been recovered, the second step, "Augment," consists of giving instructions to the LLM. The content of these instructions may vary depending on the level of precision desired by the designer. Based on experience feedback, it is commonly accepted that the more precise the instructions given to the large language model are, the higher the quality of its responses. The article by Santu et al. [1] discusses 5 levels of instruction. In the context of RAG, the following information can be noted in these instructions to make them as complete as possible:

General expected role of the LLM
Usage context: application domain, type of user
Referring to the given fragments to formulate its response
Information about the generation of its response both in form and content
How its response will be evaluated.

9. Choice of LLM
The choice of LLM also depends on the three factors previously discussed (quality, response time, and budget). However, this choice is not easy to implement objectively given that the LLM market is a very dynamic market, with strong competition between various actors, whether open access or paid. Regarding quality, the trend seems to be towards LLMs with comparable levels (until recently dominated by ChatGPT4).

However, the response time is not solely determined by the model itself, but also depends on the infrastructure on which this LLM is hosted, which also raises budgetary considerations. For example, running an LLM on a standard laptop or desktop is complicated in the context of a RAG because the response time also depends on the length of the text input to the LLM, this length being significantly affected by the transmitted fragments. For a test or the use of a RAG addressing a moderate number of users, the choice of a paid LLM may be useful to obtain a response within a satisfactory timeframe without having to equip oneself with potentially costly infrastructure (potentially including a GPU). Another option is to use cloud hosting with an open access LLM, but one must ensure the response time in this case.

10. Choice of Evaluation Method and Measures

To improve the RAG during its implementation and operation, it is necessary to determine how the quality of the responses will be judged. The most common but also most time-consuming approach involves having the results of the chatbot’s use analyzed by one or more experts in the relevant field.
This step may be preceded by an initial evaluation assisted by AI through the following three steps:

Generation of a set of synthetic question-answers by an LLM
Submission of the questions to the RAG
Submission of the questions, synthetic responses, and the responses from the RAG to an LLM acting as evaluator.

Besides this global evaluation of the quality of the response provided by the chatbot, the "Retrieve" part can also be evaluated separately from the "Generate" part. In the first case, we will particularly focus on the proportion of cases where the useful fragments to provide a response are recovered at the end of this "Retrieve" step. In the second case, we will focus on the proportion of cases where a satisfactory response is provided when the correct fragments are transmitted to the LLM.

Conclusion

In this article, we first introduced a method called "Retrieval Augmented Generation" that makes a chatbot more effective in answering factual questions from an interest domain. The essence of this method involves transmitting to the chatbot the useful references related to the question, these references being stored in an external knowledge base linked to the interest domain. Then, we discussed 10 adjustable levers for your use case so that the solution obtained meets your expectations in terms of the quality of the response, the response time, and the allocated budget. At CETIC, in this research area, we are currently investigating the possibility of automatically deploying a RAG architecture including optimal choices for these levers according to the given use case.

Christian Colot, Senior R&D Engineer CETIC
Laurie Ceccotti, Researcher CETIC

[1] SANTU, Shubhra Kanti Karmaker et FENG, Dongji. Teler: A general taxonomy of llm prompts for benchmarking complex tasks. arXiv preprint arXiv:2305.11430, 2023.