Fine-tuning to Memorize Knowledge#

In this tutorial we experiment with some basic approaches of “baking in knowledge with fine-tuning.”

Synthesizing questions from existing context
Trying text completion

If you’re opening this Notebook on colab, you will probably need to install LlamaIndex 🦙.

!pip install llama-index

import os
import openai
from llama_index import ServiceContext
from llama_index.llms import OpenAI
from llama_index import VectorStoreIndex

os.environ["OPENAI_API_KEY"] = "sk-..."
openai.api_key = os.environ["OPENAI_API_KEY"]

Load Data#

!mkdir data && wget --user-agent "Mozilla" "https://arxiv.org/pdf/2307.09288.pdf" -O "data/llama2.pdf"

mkdir: data: File exists

from pathlib import Path
from llama_hub.file.pdf.base import PDFReader
from llama_hub.file.unstructured.base import UnstructuredReader
from llama_hub.file.pymu_pdf.base import PyMuPDFReader

loader = PyMuPDFReader()
docs0 = loader.load(file_path=Path("./data/llama2.pdf"))

from llama_index import Document

doc_text = "\n\n".join([d.get_content() for d in docs0])
metadata = {
    "paper_title": "Llama 2: Open Foundation and Fine-Tuned Chat Models"
}
docs = [Document(text=doc_text, metadata=metadata)]

print(docs[0].get_content())

from llama_index.callbacks import CallbackManager

callback_manager = CallbackManager([])

gpt_35_context = ServiceContext.from_defaults(
    llm=OpenAI(model="gpt-3.5-turbo-0613", temperature=0.3),
    callback_manager=callback_manager,
)
gpt_4_context = ServiceContext.from_defaults(
    llm=OpenAI(model="gpt-4-0613", temperature=0.3),
    callback_manager=callback_manager,
)

Generate Dataset#

from llama_index.evaluation import DatasetGenerator
from llama_index.node_parser import SentenceSplitter

# try evaluation modules
from llama_index.evaluation import RelevancyEvaluator, FaithfulnessEvaluator
from llama_index import PromptTemplate

node_parser = SentenceSplitter()
nodes = node_parser.get_nodes_from_documents(docs)

from tqdm.notebook import tqdm
import json

num_questions_per_chunk = 10
question_gen_query = (
    "You are a Teacher/ Professor. Your task is to setup a quiz/examination."
    f" Using the provided context, formulate {num_questions_per_chunk} that"
    " captures an important fact from the context. \nYou MUST obey the"
    " following criteria:\n- Restrict the question to the context information"
    " provided.\n- Do NOT create a question that cannot be answered from the"
    " context.\n- Phrase the question so that it does NOT refer to specific"
    ' context. For instance, do NOT put phrases like "given provided context"'
    ' or "in this work" in the question, because if the question is asked'
    " elsewhere it wouldn't be provided specific context. Replace these"
    " terms with specific details.\nBAD questions:\nWhat did the author do in"
    " his childhood\nWhat were the main findings in this report\n\nGOOD"
    " questions:\nWhat did Barack Obama do in his childhood\nWhat were the"
    " main findings in the original Transformers paper by Vaswani et"
    " al.\n\nGenerate the questions below:\n"
)

# go through each node one at a time -
# generate questions, filter using eval modules, and dump to file

fp = open("data/qa_pairs.jsonl", "w")
for idx, node in enumerate(nodes):
    dataset_generator = DatasetGenerator(
        [node],
        question_gen_query=question_gen_query,
        service_context=gpt_4_context,
        metadata_mode="all",
    )
    node_questions_0 = dataset_generator.generate_questions_from_nodes(num=10)
    print(f"[Node {idx}] Generated questions:\n {node_questions_0}")
    # for each question, get a response
    for question in tqdm(node_questions_0):
        index = SummaryIndex([node], service_context=gpt_35_context)
        query_engine = index.as_query_engine()
        response = query_engine.query(question)
        out_dict = {"query": question, "response": str(response)}
        print(f"[Node {idx}] Outputs: {out_dict}")
        fp.write(json.dumps(out_dict) + "\n")

fp.close()

Filter out questions using RelevancyEvaluator#

Do a second pass to make sure only questions that can be answerd by context make it into the training set.

# try evaluation modules
from llama_index.evaluation import RelevancyEvaluator, FaithfulnessEvaluator
from llama_index import PromptTemplate
from llama_index.llms import OpenAI

query_eval_tmpl = PromptTemplate(
    "Your task is to evaluate the following: If the response for the query"
    " isn't able to answer the question provided.\nIf query isn't able to"
    " answer the question, answer NO.\nOtherwise answer YES.\nTo elaborate,"
    " you might get an answer like the following: 'The context does not"
    " contain the answer to this question.'Please return NO in that case. You"
    " be given the query and response. Return YES or NO as the answer.\nQuery:"
    " \n {query_str}\nResponse: \n {response_str}\nAnswer: "
)

eval_llm = OpenAI(model="gpt-4-0613")

def filter_data(path: str, out_path: str):
    fp = open(path, "r")
    out_fp = open(out_path, "w")
    new_lines = []
    for idx, line in enumerate(fp):
        qa_pair = json.loads(line)
        eval = eval_llm.complete(
            query_eval_tmpl.format(
                query_str=qa_pair["query"], response_str=qa_pair["response"]
            )
        )

        print(f"[{idx}] QA Pair: {qa_pair} \n Eval: {eval}")
        if "NO" in str(eval):
            continue
        else:
            # new_lines.append(line)
            out_fp.write(line)

filter_data("data/qa_pairs.jsonl", "data/qa_pairs_2.jsonl")

Split into Training and Validation Sets#

We split into training and validation sets.

NOTE: We shuffle the data before splitting. This helps ensure that the training data has coverage throughout the document.

from copy import deepcopy
import random


def split_train_val(
    path: str, out_train_path: str, out_val_path: str, train_split=0.7
):
    with open(path, "r") as fp:
        lines = fp.readlines()

        # shuffle the lines to make sure that the "train questions" cover most fo the context
        shuffled_lines = deepcopy(lines)
        random.shuffle(shuffled_lines)

        split_idx = int(train_split * len(shuffled_lines))
        train_lines = shuffled_lines[:split_idx]
        val_lines = shuffled_lines[split_idx:]
        with open(out_train_path, "w") as out_fp:
            out_fp.write("".join(train_lines))

        with open(out_val_path, "w") as out_fp:
            out_fp.write("".join(val_lines))

split_train_val(
    "data/qa_pairs_2.jsonl",
    "data/qa_pairs_train.jsonl",
    "data/qa_pairs_val.jsonl",
)

Format into Training Data#

Format into training data for OpenAI’s finetuning endpoints.

NOTE: We don’t use our OpenAIFinetuningHandler because that logs the full input prompt including context as the user message. Here we just want to log the query as the user message, because we want to fine-tune gpt-3.5-turbo to “bake in knowledge” into the fine-tuned model.

fp = open("data/qa_pairs_train.jsonl", "r")
out_fp = open("data/qa_pairs_openai.jsonl", "w")
# TODO: try with different system prompts
system_prompt = {
    "role": "system",
    "content": (
        "You are a helpful assistant helping to answer questions about the"
        " Llama 2 paper."
    ),
}
for line in fp:
    qa_pair = json.loads(line)
    user_prompt = {"role": "user", "content": qa_pair["query"]}
    assistant_prompt = {"role": "assistant", "content": qa_pair["response"]}
    out_dict = {
        "messages": [system_prompt, user_prompt, assistant_prompt],
    }
    out_fp.write(json.dumps(out_dict) + "\n")

Fine-tune the Model#

from llama_index.finetuning import OpenAIFinetuneEngine

finetune_engine = OpenAIFinetuneEngine(
    "gpt-3.5-turbo",
    "data/qa_pairs_openai.jsonl",
    # start_job_id="<start-job-id>"  # if you have an existing job, can specify id here
)

finetune_engine.finetune()

Num examples: 597
First example:
{'role': 'system', 'content': 'You are a helpful assistant helping to answer questions about the Llama 2 paper.'}
{'role': 'user', 'content': 'Who were the early reviewers of the paper on "Llama 2: Open Foundation and Fine-Tuned Chat Models" who helped improve its quality?'}
{'role': 'assistant', 'content': 'Mike Lewis, Joelle Pineau, Laurens van der Maaten, Jason Weston, and Omer Levy were the early reviewers of the paper on "Llama 2: Open Foundation and Fine-Tuned Chat Models" who helped improve its quality.'}
No errors found
Num examples missing system message: 0
Num examples missing user message: 0

#### Distribution of num_messages_per_example:
min / max: 3, 3
mean / median: 3.0, 3.0
p5 / p95: 3.0, 3.0

#### Distribution of num_total_tokens_per_example:
min / max: 50, 637
mean / median: 102.51256281407035, 90.0
p5 / p95: 66.0, 155.0

#### Distribution of num_assistant_tokens_per_example:
min / max: 2, 588
mean / median: 50.45728643216081, 35.0
p5 / p95: 18.0, 102.0

0 examples may be over the 4096 token limit, they will be truncated during fine-tuning
Dataset has ~61200 tokens that will be charged for during training
By default, you'll train for 3 epochs on this dataset
By default, you'll be charged for ~183600 tokens
As of Augest 22, 2023, fine-tuning gpt-3.5-turbo is $0.008 / 1K Tokens.
This means your total cost for training will be $0.48960000000000004 per epoch.
Waiting for file to be ready...

finetune_engine.get_current_job()

<FineTuningJob fine_tuning.job id=ftjob-fk0428lntJCRh6x1GKeccv8E at 0x2b95fd6c0> JSON: {
  "object": "fine_tuning.job",
  "id": "ftjob-fk0428lntJCRh6x1GKeccv8E",
  "model": "gpt-3.5-turbo-0613",
  "created_at": 1694406904,
  "finished_at": 1694409009,
  "fine_tuned_model": "ft:gpt-3.5-turbo-0613:llamaindex::7xTTW0hT",
  "organization_id": "org-1ZDAvajC6v2ZtAP9hLEIsXRz",
  "result_files": [
    "file-Ao1r7cGnYJbHqCG79zAQo6lP"
  ],
  "status": "succeeded",
  "validation_file": null,
  "training_file": "file-9ndBjJX0pZ3Do4mPhADcTOef",
  "hyperparameters": {
    "n_epochs": 3
  },
  "trained_tokens": 180006,
  "error": null
}

ft_model = finetune_engine.get_finetuned_model()

ft_model

OpenAI(callback_manager=<llama_index.callbacks.base.CallbackManager object at 0x2bdaba2f0>, model='ft:gpt-3.5-turbo-0613:llamaindex::7xTTW0hT', temperature=0.1, max_tokens=None, additional_kwargs={}, max_retries=10, class_type='openai')

# [Optional] use fine-tuned model in RAG system too
from llama_index import ServiceContext

ft_context = ServiceContext.from_defaults(
    llm=ft_model,
    callback_manager=callback_manager,
)
# baseline RAG system
ft_index = VectorStoreIndex(nodes, service_context=ft_context)
ft_query_engine = ft_index.as_query_engine()

Evaluate Results#

We run evaluations, over both the validation set but also the training set.

Wait, isn’t evaluating over the training set cheating?

It’s a sanity check of how much the model was able to memorize information it’s trained on.
The training data contains quite a bit of content about the paper, so by answering the training set well the model would at least be well-equipped to answer some questions.

from llama_index.llms import ChatMessage

def load_data(path: str):
    fp = open(path, "r")
    data_dicts = []
    for line in fp:
        d = json.loads(line)
        data_dicts.append(d)
    return data_dicts

train_dicts = load_data("data/qa_pairs_train.jsonl")
eval_dicts = load_data("data/qa_pairs_val.jsonl")

def query_model(model, d):
    # print(d)
    msgs = [
        ChatMessage(
            role="system",
            content=(
                "You are a helpful assistant helping to answer questions about"
                " the Llama 2 paper."
            ),
        ),
        ChatMessage(role="user", content=d["query"]),
    ]

    # try ft-model
    response = model.chat(msgs)
    return str(response)

response = query_model(ft_model, eval_dicts[7])
print(eval_dicts[7])
print(response)

{'query': 'What is the title of the paper discussed in the context?', 'response': 'The title of the paper discussed in the context is "Llama 2: Open Foundation and Fine-Tuned Chat Models".'}

'assistant: The title of the paper discussed in the context is "Llama 2: Open Foundation and Fine-Tuned Chat Models".'

query_model(ft_model, train_dicts[7])
print(train_dicts[7])
print(response)

{'query': 'How is the decision made whether to use safety context distillation or not?', 'response': 'The decision to use safety context distillation is made based on the reward model score. The safety reward model is leveraged to determine whether the context-distilled output receives a better reward model score than the original answer. If the context-distilled output gets a better reward model score, it is kept. This approach helps limit the negative impact of context distillation while still utilizing it in cases where it improves the reward model score.'}

'assistant: The decision to use safety context distillation is made based on the reward model score. If the reward model score is below a certain threshold, safety context distillation is used.'

Setup Baseline RAG system to benchmark#

We setup a baseline RAG system powered by gpt-3.5-turbo to help benchmark the quality of results.

# baseline RAG system
base_index = VectorStoreIndex(nodes, service_context=gpt_35_context)
base_query_engine = base_index.as_query_engine()

# baseline model
base_model = OpenAI(model="gpt-4", temperature=0.3)

query_model(base_model, eval_dicts[80])

{'query': 'How does Llama 2-Chat address the issue of spreading misinformation or conspiracy theories?', 'response': "Llama 2-Chat addresses the issue of spreading misinformation or conspiracy theories by refuting any misinformation in the prompt immediately. It emphasizes the importance of relying on scientific evidence and credible sources when evaluating historical events. The model does not promote or encourage the spread of false information and instead focuses on sharing accurate and helpful information. It also highlights the importance of fact-checking and critical thinking when assessing the validity of a claim. Llama 2-Chat's programming rules prioritize respect for truth and accuracy in all forms of communication and discourage the spread of misinformation or conspiracy theories."}

"assistant: The Llama 2 paper does not specifically address the issue of spreading misinformation or conspiracy theories. However, it does mention that the model is designed to refuse outputs that are inappropriate or harmful. This could potentially include misinformation or conspiracy theories. It also states that the model's responses are based on a mixture of licensed data, data created by human trainers, and publicly available data. The developers have also used reinforcement learning from human feedback to fine-tune the model, which can help in reducing the spread of false information. However, the specifics of how misinformation or conspiracy theories are handled are not detailed in the paper."

Run Evaluations#

We log the responses from the fine-tuned model, the baseline RAG system, and the baseline model.

We then run all responses through a GPT-4 prompt, comparing each against the ground-truth to measure validity of the result.

import pandas as pd
from tqdm.notebook import tqdm

EVAL_PROMPT_TMPL = PromptTemplate(
    """\
We provide a question and the 'ground-truth' answer. We also provide \
the predicted answer.

Evaluate whether the predicted answer is correct, given its similarity \
to the ground-truth. If details provided in predicted answer are reflected \
in the ground-truth answer, return "YES". To return "YES", the details don't \
need to exactly match. Be lenient in evaluation if the predicted answer \
is missing a few details. Try to make sure that there are no blatant mistakes. \
Otherwise, return "NO".

Question: {question}
Ground-truth Answer: {gt_answer}
Predicted Answer: {pred_answer}
Evaluation Result: \
"""
)


def eval_match_gt(query, gt_response, pred_response):
    llm = OpenAI(model="gpt-4", temperature=0.0)
    fmt_prompt = EVAL_PROMPT_TMPL.format(
        question=query,
        gt_answer=gt_response,
        pred_answer=pred_response,
    )
    result = llm.complete(fmt_prompt)
    if "yes" in str(result).lower():
        return 1
    else:
        return 0


def run_evals(eval_dicts):
    """Run evals - fine-tuned model, RAG system, and base model."""

    raw_responses = []
    for eval_dict in tqdm(eval_dicts):
        gt_response = eval_dict["response"]
        ft_rag_response = str(ft_query_engine.query(eval_dict["query"]))
        ft_response = str(query_model(ft_model, eval_dict))
        rag_response = str(base_query_engine.query(eval_dict["query"]))
        base_response = str(query_model(base_model, eval_dict))

        # try evaluations
        ft_rag_eval = eval_match_gt(
            eval_dict["query"], gt_response, ft_rag_response
        )
        ft_eval = eval_match_gt(eval_dict["query"], gt_response, ft_response)
        rag_eval = eval_match_gt(eval_dict["query"], gt_response, rag_response)
        base_eval = eval_match_gt(
            eval_dict["query"], gt_response, base_response
        )

        response_dict = {
            "query": eval_dict["query"],
            "gt_response": gt_response,
            "ft_rag_response": ft_rag_response,
            "ft_response": ft_response,
            "rag_response": rag_response,
            "base_response": base_response,
            "ft_rag_eval": ft_rag_eval,
            "ft_eval": ft_eval,
            "rag_eval": rag_eval,
            "base_eval": base_eval,
        }

        raw_responses.append(response_dict)

    raw_responses_df = pd.DataFrame(raw_responses)

    eval_dict = {
        "ft_rag_score": raw_responses_df["ft_rag_eval"].mean(),
        "ft_score": raw_responses_df["ft_eval"].mean(),
        "rag_score": raw_responses_df["rag_eval"].mean(),
        "base_score": raw_responses_df["base_eval"].mean(),
    }

    sub_responses_df = raw_responses_df[
        [
            "query",
            "gt_response",
            "ft_rag_response",
            "ft_response",
            "rag_response",
            "base_response",
        ]
    ]

    return eval_dict, raw_responses_df, sub_responses_df

pd.set_option("display.max_colwidth", None)

Qualitative Evaluations#

Here we show some qualitative output examples over both the training and validation sets.

eval_dict, raw_response_df, sub_responses_df = run_evals(train_dicts[7:8])
display(eval_dict)
display(sub_responses_df)

{'ft_rag_score': 1.0, 'ft_score': 1.0, 'rag_score': 1.0, 'base_score': 0.0}

	query	gt_response	ft_rag_response	ft_response	rag_response	base_response
0	How is the decision made whether to use safety context distillation or not?	The decision to use safety context distillation is made based on the reward model score. The safety reward model is leveraged to determine whether the context-distilled output receives a better reward model score than the original answer. If the context-distilled output gets a better reward model score, it is kept. This approach helps limit the negative impact of context distillation while still utilizing it in cases where it improves the reward model score.	The decision on whether to use safety context distillation or not is made based on the reward model score. The safety reward model is leveraged to determine whether the context-distilled output is preferred over the original answer. Safety context distillation is only kept on examples where it receives a better reward model score. This approach helps to limit the negative impact of context distillation while still improving the model's responses on prompts that it is not good at.	assistant: The decision to use safety context distillation is made based on the reward model score. If the reward model score is higher than a certain threshold, safety context distillation is used.	The decision to use safety context distillation is made based on the reward model score. The safety reward model is used to evaluate whether the context-distilled output gets a better reward model score than the original answer. If the context-distilled output receives a better reward model score, it is kept. This approach helps limit the negative impact of context distillation while still improving the safety of the model's responses.	assistant: The decision to use safety context distillation in the Llama 2 paper is based on the nature of the situation and the need for safety. If the model is in a situation where it needs to generate safe responses, then safety context distillation is used. This is particularly important when the model is interacting with users in real-time, where there's a need to ensure that the outputs are safe and appropriate. The decision is not explicitly mentioned in the paper but is inferred from the context and the purpose of safety context distillation.

eval_dict, raw_response_df, sub_responses_df = run_evals(eval_dicts[6:7])
display(eval_dict)
display(sub_responses_df)

{'ft_rag_score': 1.0, 'ft_score': 0.0, 'rag_score': 1.0, 'base_score': 0.0}

	query	gt_response	ft_rag_response	ft_response	rag_response	base_response
0	What model is used to predict the truthfulness and informativeness of the generated outputs from LLMs?	A fine-tuned GPT-3 model, referred to as "GPT-judge," is used to predict the truthfulness and informativeness of the generated outputs from LLMs.	A fine-tuned GPT-3 model, also known as a "GPT-judge," is used to predict the truthfulness and informativeness of the generated outputs from LLMs.	assistant: The model used to predict the truthfulness and informativeness of the generated outputs from LLMs is called TruthfulQA.	A fine-tuned GPT-3 model, referred to as a "GPT-judge," is used to predict the truthfulness and informativeness of the generated outputs from LLMs.	assistant: The Llama 2 paper does not specify a particular model used to predict the truthfulness and informativeness of the generated outputs from Language Models. The paper primarily focuses on the limitations and risks of large language models and does not delve into specific methods or models for evaluating truthfulness or informativeness.

Quantitative Evaluations#

Here we show quantitative metrics over both the training and eval set.

import random

k = 40

train_dicts_sample = random.sample(train_dicts, k)
eval_dicts_sample = random.sample(eval_dicts, k)

result_train, raw_response_df, sub_responses_df = run_evals(train_dicts_sample)
display(result_train)
# display(raw_response_df)

{'ft_rag_score': 0.75, 'ft_score': 0.45, 'rag_score': 0.7, 'base_score': 0.3}

# look at where ft_rag_score did well but rag didn't
d = raw_response_df
d[(d["ft_rag_eval"] == 1) & (d["rag_eval"] == 0)]

	query	gt_response	ft_rag_response	ft_response	rag_response	base_response	ft_rag_eval	ft_eval
11	What is the five-shot performance of Llama 2 with a 70B model on the Massive Multitask Language Understanding (MMLU) benchmark?	The five-shot performance of Llama 2 with a 70B model on the Massive Multitask Language Understanding (MMLU) benchmark is 78.5.	The five-shot performance of Llama 2 with a 70B model on the Massive Multitask Language Understanding (MMLU) benchmark is 78.5.	assistant: The five-shot performance of Llama 2 with a 70B model on the Massive Multitask Language Understanding (MMLU) benchmark is 82.7.	The five-shot performance of Llama 2 with a 70B model on the Massive Multitask Language Understanding (MMLU) benchmark is 85.0.	assistant: The Llama 2 paper does not provide specific information on the five-shot performance of a 70B model on the Massive Multitask Language Understanding (MMLU) benchmark. Please refer to the original paper or contact the authors for the most accurate information.	1	0
22	What is the role of third-party SFT data in the fine-tuning process of the Llama 2 model?	The role of third-party SFT data in the fine-tuning process of the Llama 2 model is to provide additional examples for aligning the model towards dialogue-style instructions. However, the context does not provide specific details about how the third-party SFT data is utilized in the fine-tuning process.	The role of third-party SFT data in the fine-tuning process of the Llama 2 model is to align LLMs (Language Model Models) towards dialogue-style instructions. However, it has been found that many of these third-party SFT datasets have insufficient diversity and quality. As a result, the focus was shifted to collecting several thousand examples of high-quality SFT data. By using fewer but higher-quality examples from vendor-based annotation efforts, the results notably improved. The findings are similar to previous research that suggests a limited set of clean instruction-tuning data can be sufficient to achieve a high level of quality. It was found that tens of thousands of SFT annotations were enough to achieve a high-quality result in the fine-tuning process of the Llama 2 model.	assistant: The role of third-party SFT data in the fine-tuning process of the Llama 2 model is to provide additional training data for certain categories. This data is used to improve the performance of the model in those specific categories.	The context does not provide information about the role of third-party SFT data in the fine-tuning process of the Llama 2 model.	assistant: I'm sorry for any confusion, but as of my current knowledge, there is no specific model called "Llama 2". Therefore, I can't provide information about the role of third-party SFT data in its fine-tuning process. However, I can tell you that in general, third-party SFT (Supervised Fine-Tuning) data is often used in machine learning to improve the performance of a model by training it on a specific task. If you have any more details or if there's a specific context, I'd be happy to help further!	1	1
35	What measures have been taken to mitigate the generation of harmful, offensive, or biased content by Llama 2?	Efforts have been made to mitigate the generation of harmful, offensive, or biased content by Llama 2. The model has undergone fine-tuning to address these issues, although some issues may still remain, especially for languages other than English where publicly available datasets were not available. OpenAI acknowledges the potential for harmful, offensive, or biased content due to the training on publicly available online datasets. They have attempted to balance safety with helpfulness, but in some instances, the safety tuning may go too far, resulting in an overly cautious approach. OpenAI encourages users of Llama 2 to be cautious and take extra steps in tuning and deployment as described in their Responsible Use Guide.	Instruction fine-tuning has been used to mitigate the generation of harmful, offensive, or biased content by Llama 2. This process has shown to effectively reduce the percentage of toxic model generations, resulting in an effectively zero percentage of toxic generations for the fine-tuned Llama 2-Chat models of all sizes. The results of the toxicity analysis for the fine-tuned models are not presented in Table 45 because of this significant reduction in toxic content generation.	assistant: Llama 2 has been trained on a filtered and curated dataset to mitigate the generation of harmful, offensive, or biased content. Additionally, a reward model has been used to fine-tune Llama 2-Chat to further align the model with human preferences and to reduce the generation of harmful and untruthful outputs. The reward model is based on the comparison of model responses, where human annotators rank the responses according to their quality. This iterative process of training and evaluation has helped in mitigating the generation of harmful, offensive, or biased content by Llama 2.	Llama 2 has taken measures to mitigate the generation of harmful, offensive, or biased content. One of the measures is the use of the ToxiGen dataset, which contains implicitly toxic and benign sentences mentioning minority groups. Llama 2 uses a revised version of this dataset to reduce noise and filter out prompts for which annotators disagree on the target demographic group. Additionally, Llama 2 employs a default ToxiGen classifier tuned on RoBERTa to measure the toxicity of its generations. Furthermore, Llama 2 conducts a sentiment analysis using the Valence Aware Dictionary and Sentiment Reasoner (VADER) to evaluate the sentiments conveyed by the combination of prompt prefix and model generation. These measures help in analyzing and addressing the potential biases and toxicity in the generated content.	assistant: The Llama 2 paper does not provide specific details on measures taken to mitigate the generation of harmful, offensive, or biased content. However, it's common for AI models like Llama 2 to incorporate various strategies to ensure responsible use. These may include the use of reinforcement learning from human feedback to reduce harmful and untruthful outputs, and the use of guidelines for human reviewers during the fine-tuning process to avoid potential biases. Additionally, efforts are often made to improve the default behavior of the system, and provide users with customization options to define the AI's values within broad bounds. Please refer to the original source or the organization behind Llama 2 for more specific information.	1	1

result_eval, raw_response_df, sub_responses_df = run_evals(eval_dicts_sample)
display(result_eval)
# display(raw_response_df)

{'ft_rag_score': 0.825,
 'ft_score': 0.375,
 'rag_score': 0.775,
 'base_score': 0.225}