In this article at OpenGenus, we have explored a new finetuning technique for Large Language Models (LLMs) developed by Meta (formerly Facebook). This technique is known as Retrieval Augmented Generation (RAG).
Table of contents:
- Drawback of LLM
- RAG: the solution by Meta
- Add RAG in LLM pipeline
This RAG technique is introduced in the paper titled "Retrieval-Augmented Generation for Knowledge-Intensive NLP Tasks" which was published on 12 April 2021. Researchers from Facebook AI Research, University College London and New Your University. 12 researchers were involved.
As of September 2023, the paper has been cited over 850 times making it one of the influential papers in the field of Large Language Model (LLM).
Drawback of LLM
Large Language Models are massive and often require finetuning on application specific data for internal use cases unless we are using them for generic purposes like text summarization. Finetuning is a memory and compute heavy task and often requires us to update all the weights of the LLM.
Even parameter efficient techniques like LoRA (Low Rank Adaptation of LLMs) may not always result in satisfactory performance due to quantization. One of the biggest limitations of finetuning techniques is also the need to continuously finetune LLMs with changing knowledge base leading to huge compute costs.
RAG: the solution by Meta
RAG introduced by Meta AI is a method that combines information retrieval and text generation capabilities to supply additional context to the LLM without having to train the LLM on the external knowledge base.
For complex and knowledge intensive tasks, providing the LLM with relevant contextual information can boost output consistency, reliability of the generated responses, and mitigate the problem of hallucination. In simple words, RAG is like an external expert that directs the LLM to use as relevant information as possible with the help of supportive additional context so the output generated by the LLM is correct and consistent.
Add RAG in LLM pipeline
To introduce RAG into the LLM pipeline, one simply needs to chunk the external domain specific knowledge base into small documents having around 150 words, create embeddings using a pretrained model and store the document vectors in a vector database. When an input query is passed to the LLM, most relevant information is fetched from this external database using metrics like cosine similarity, and is concatenated as additional context to the LLM. The combination of this external context and the input prompt is then passed as input to the text generator to generate output responses.
RAG generates responses that are more factual, specific, and diverse. Parametric knowledge provided by LLMs (even with traditional finetuning techniques) is static. RAG enables us to bypass retraining and get access to the latest information for generating reliable outputs via retrieval-based generation.
Retrieval based techniques like RAG have become quite popular of late, and are being combined with state-of-the-art LLMs like ChatGPT to enhance factual consistency and reduce hallucination.