Finetuning a model means updating the model itself over a set of data to improve the model in a variety of ways. This can include improving the quality of outputs, reducing hallucinations, memorizing more data holistically, and reducing latency/cost.

The core of our toolkit revolves around in-context learning / retrieval augmentation, which involves using the models in inference mode and not training the models themselves.

While finetuning can be also used to “augment” a model with external data, finetuning can complement retrieval augmentation in a variety of ways:

Embedding Finetuning Benefits

  • Finetuning the embedding model can allow for more meaningful embedding representations over a training distribution of data –> leads to better retrieval performance.

LLM Finetuning Benefits

  • Allow it to learn a style over a given dataset

  • Allow it to learn a DSL that might be less represented in the training data (e.g. SQL)

  • Allow it to correct hallucinations/errors that might be hard to fix through prompt engineering

  • Allow it to distill a better model (e.g. GPT-4) into a simpler/cheaper model (e.g. gpt-3.5, Llama 2)

Integrations with LlamaIndex

This is an evolving guide, and there are currently three key integrations with LlamaIndex. Please check out the sections below for more details!

  • Finetuning embeddings for better retrieval performance

  • Finetuning Llama 2 for better text-to-SQL

  • Finetuning gpt-3.5-turbo to distill gpt-4

Finetuning Embeddings for Better Retrieval Performance

We created a comprehensive repo/guide showing you how to finetune an open-source embedding model (in this case, bge) over an unstructured text corpus. It consists of the following steps:

  1. Generating a synthetic question/answer dataset using LlamaIndex over any unstructed context.

  2. Finetuning the model

  3. Evaluating the model.

Finetuning gives you a 5-10% increase in retrieval evaluation metrics. You can then plug this fine-tuned model into your RAG application with LlamaIndex.


Finetuning GPT-3.5 to distill GPT-4

We have multiple guides showing how to use OpenAI’s finetuning endpoints to fine-tune gpt-3.5-turbo to output GPT-4 responses for RAG/agents.

We use GPT-4 to automatically generate questions from any unstructured context, and use a GPT-4 query engine pipeline to generate “ground-truth” answers. Our OpenAIFineTuningHandler callback automatically logs questions/answers to a dataset.

We then launch a finetuning job, and get back a distilled model. We can evaluate this model with Ragas to benchmark against a naive GPT-3.5 pipeline.


[WIP] Finetuning GPT-3.5 to Memorize Knowledge

We have a guide experimenting with showing how to use OpenAI fine-tuning to memorize a body of text. Still WIP! Not quite as good as RAG yet.

Finetuning Llama 2 for Better Text-to-SQL

In this tutorial, we show you how you can finetune Llama 2 on a text-to-SQL dataset, and then use it for structured analytics against any SQL database using LlamaIndex abstractions.

The stack includes sql-create-context as the training dataset, OpenLLaMa as the base model, PEFT for finetuning, Modal for cloud compute, LlamaIndex for inference abstractions.