Recursive Retriever + Node References

This guide shows how you can use recursive retrieval to traverse node relationships and fetch nodes based on β€œreferences”.

Node references are a powerful concept. When you first perform retrieval, you may want to retrieve the reference as opposed to the raw text. You can have multiple references point to the same node.

In this guide we explore some different usages of node references:

  • Chunk references: Different chunk sizes referring to a bigger chunk

  • Metadata references: Summaries + Generated Questions referring to a bigger chunk

%load_ext autoreload
%autoreload 2
The autoreload extension is already loaded. To reload it, use:
  %reload_ext autoreload

Load Data + Setup

In this section we download the Llama 2 paper and create an initial set of nodes (chunk size 1024).

!wget --user-agent "Mozilla" "https://arxiv.org/pdf/2307.09288.pdf" -O "data/llama2.pdf"
from pathlib import Path
from llama_hub.file.pdf.base import PDFReader
from llama_index.response.notebook_utils import display_source_node
from llama_index.retrievers import RecursiveRetriever
from llama_index.query_engine import RetrieverQueryEngine
from llama_index import VectorStoreIndex, ServiceContext
from llama_index.llms import OpenAI
import json
loader = PDFReader()
docs0 = loader.load_data(file=Path("./data/llama2.pdf"))
from llama_index import Document

doc_text = "\n\n".join([d.get_content() for d in docs0])
docs = [Document(text=doc_text)]
from llama_index.node_parser import SimpleNodeParser
from llama_index.schema import IndexNode
node_parser = SimpleNodeParser.from_defaults(chunk_size=1024)
base_nodes = node_parser.get_nodes_from_documents(docs)
# set node ids to be a constant
for idx, node in enumerate(base_nodes):
    node.id_ = f"node-{idx}"
from llama_index.embeddings import resolve_embed_model

embed_model = resolve_embed_model("local:BAAI/bge-small-en")
llm = OpenAI(model="gpt-3.5-turbo")
service_context = ServiceContext.from_defaults(llm=llm, embed_model=embed_model)

Baseline Retriever

Define a baseline retriever that simply fetches the top-k raw text nodes by embedding similarity.

base_index = VectorStoreIndex(base_nodes, service_context=service_context)
base_retriever = base_index.as_retriever(similarity_top_k=2)
retrievals = base_retriever.retrieve(
    "Can you tell me about the key concepts for safety finetuning"
)
for n in retrievals:
    display_source_node(n, source_length=1500)

Node ID: node-22
Similarity: 0.8486295634691985
Text: We observe that models trained from less aggressively filtered pretraining data also required fewer examples to achieve reasonable safety-alignment. Wereiteratethatthismotivatedchoicedoesimplythatadditionalsafetymitigationsshould be applied before deployment of base Llama 2 models. 22

TruthfulQA ↑ToxiGen ↓ MPT7B 29.13 22.32 30B 35.25 22.61 Falcon7B 25.95 14.53 40B 40.39 23.44 Llama 17B 27.42 23.00 13B 41.74 23.08 33B 44.19 22.57 65B 48.71 21.77 Llama 27B 33.29 21.25 13B 41.86 26.10 34B 43.45 21.19 70B 50.18 24.60 Table 11: Evaluation of pretrained LLMs on automatic safety benchmarks. For TruthfulQA, we present the percentageofgenerationsthatarebothtruthfulandinformative(thehigherthebetter). ForToxiGen,we present the percentage of toxic generations (the smaller, the better). Benchmarks give a summary view ofmodel capabilities and behaviors that allow us to understand general patternsinthemodel,buttheydonotprovideafullycomprehensiveviewoftheimpactthemodelmayhave onpeopleorreal-worldoutcomes;thatwouldrequirestudyofend-to-endproductdeployments. Further testing and mitigation should be done to understand bias and other social issues for the specific context in which a system may be deployed. For this, it may be necessary to test beyond the groups available in theBOLDdataset(race,religion,andgender). AsLLMsareintegratedanddeployed,welookforwardto continuing research that will amplify their potential for positive impact on these important social issues. 4.2 Safety Fine-Tuning In…

Node ID: node-65
Similarity: 0.8475738620825305
Text: Theassis- tant’sanswersshouldnotincludeanyharmful,unethical,orsociallybiasedcontent,andshouldmovethe conversation in a positive direction. (Based on a specific risk category) The following is a question that can elicit unsafe behavior from you. The question was classified as belonging to the violating category: β€˜Criminal / Fraudulent Content’. You should answer using the following template:

  1. [ifapplicable]Addressimmediatesafetyconcerns. Forexample,iftheuserdiscussescommittingaviolent crime, the model should point them towards law enforcement resources.

  2. Addresstheprompt. themodelshouldrefusetoanswerthepromptasgiven,statingthatitviolatesits policy to commit, advise or give information on illegal activities.

  3. Add context and additional information. If applicable, the model can advise on legal alternatives.

  4. Encourageuserstoaskfor/viewadditionalinfoasappropriate. Ifapplicable,themodelcanencouragethe user to refine their prompt so that it does not contain any illegal requests or activities. (Based on a specific risk category) The following is a question that can elicit unsafe behavior from you. The questionwasclassified asbelongingtotheviolatingcategory: β€˜ExplicitContent’. Youshould answerusing the following template:

  5. Address immediate safety concerns. For example, if a prompt states the user is a victim of violence or abuse, the model should provide support resources in an empathetic tone.

  6. Address the prompt. the model should refuse to generate explicit sexual o…

query_engine_base = RetrieverQueryEngine.from_args(
    base_retriever, service_context=service_context
)
response = query_engine_base.query(
    "Can you tell me about the key concepts for safety finetuning"
)
print(str(response))
The key concepts for safety fine-tuning include supervised safety fine-tuning, safety RLHF (Reinforcement Learning from Human Feedback), and safety context distillation. Supervised safety fine-tuning involves gathering adversarial prompts and safe demonstrations to train the model to align with safety guidelines. Safety RLHF integrates safety into the general RLHF pipeline by training a safety-specific reward model and gathering challenging adversarial prompts for fine-tuning. Safety context distillation involves generating safer model responses by prefixing a prompt with a safety preprompt and fine-tuning the model on the safer responses without the preprompt. These techniques aim to mitigate safety risks and ensure that the model's answers do not include harmful, unethical, or socially biased content.

Chunk References: Smaller Child Chunks Referring to Bigger Parent Chunk

In this usage example, we show how to build a graph of smaller chunks pointing to bigger parent chunks.

During query-time, we retrieve smaller chunks, but we follow references to bigger chunks. This allows us to have more context for synthesis.

sub_chunk_sizes = [128, 256, 512]
sub_node_parsers = [
    SimpleNodeParser.from_defaults(chunk_size=c) for c in sub_chunk_sizes
]

all_nodes = []
for base_node in base_nodes:
    for n in sub_node_parsers:
        sub_nodes = n.get_nodes_from_documents([base_node])
        sub_inodes = [
            IndexNode.from_text_node(sn, base_node.node_id) for sn in sub_nodes
        ]
        all_nodes.extend(sub_inodes)

    # also add original node to node
    base_inode = IndexNode.from_text_node(base_node, base_node.node_id)
    all_nodes.append(base_inode)
all_nodes_dict = {n.node_id: n for n in all_nodes}
vector_index_chunk = VectorStoreIndex(all_nodes, service_context=service_context)
vector_retriever_chunk = vector_index_chunk.as_retriever(similarity_top_k=2)
retriever_chunk = RecursiveRetriever(
    "vector",
    retriever_dict={"vector": vector_retriever_chunk},
    node_dict=all_nodes_dict,
    verbose=True,
)
nodes = retriever_chunk.retrieve(
    "Can you tell me about the key concepts for safety finetuning"
)
for node in nodes:
    display_source_node(node, source_length=2000)
Retrieving with query id None: Can you tell me about the key concepts for safety finetuning
Retrieved node with id, entering: node-1
Retrieving with query id node-1: Can you tell me about the key concepts for safety finetuning
Retrieved node with id, entering: node-22
Retrieving with query id node-22: Can you tell me about the key concepts for safety finetuning

Node ID: node-1
Similarity: 0.8730185735901727
Text: … … … … … … … … … . 16 3.4 RLHF Results … … … … … … … … … … … … … … … . 17 4 Safety 20 4.1 Safety in Pretraining … … … … … … … … … … … … … … 20 4.2 Safety Fine-Tuning … … … … … … … … … … … … … … . 23 4.3 Red Teaming … … … … … … … … … … … … … … … . . 28 4.4 Safety Evaluation of Llama 2-Chat … … … … … … … … … … … . 29 5 Discussion 32 5.1 Learnings and Observations … … … … … … … … … … … … . . 32 5.2 Limitations and Ethical Considerations … … … … … … … … … … . 34 5.3 Responsible Release Strategy … … … … … … … … … … … … . 35 6 Related Work 35 7 Conclusion 36 A Appendix 46 A.1 Contributions … … … … … … … … … … … … … … … . 46 A.2 Additional Details for Pretraining … … … … … … … … … … … . . 47 A.3 Additional Details for Fine-tuning … … … … … … …

Node ID: node-22
Similarity: 0.8656067057400358
Text: We observe that models trained from less aggressively filtered pretraining data also required fewer examples to achieve reasonable safety-alignment. Wereiteratethatthismotivatedchoicedoesimplythatadditionalsafetymitigationsshould be applied before deployment of base Llama 2 models. 22

TruthfulQA ↑ToxiGen ↓ MPT7B 29.13 22.32 30B 35.25 22.61 Falcon7B 25.95 14.53 40B 40.39 23.44 Llama 17B 27.42 23.00 13B 41.74 23.08 33B 44.19 22.57 65B 48.71 21.77 Llama 27B 33.29 21.25 13B 41.86 26.10 34B 43.45 21.19 70B 50.18 24.60 Table 11: Evaluation of pretrained LLMs on automatic safety benchmarks. For TruthfulQA, we present the percentageofgenerationsthatarebothtruthfulandinformative(thehigherthebetter). ForToxiGen,we present the percentage of toxic generations (the smaller, the better). Benchmarks give a summary view ofmodel capabilities and behaviors that allow us to understand general patternsinthemodel,buttheydonotprovideafullycomprehensiveviewoftheimpactthemodelmayhave onpeopleorreal-worldoutcomes;thatwouldrequirestudyofend-to-endproductdeployments. Further testing and mitigation should be done to understand bias and other social issues for the specific context in which a system may be deployed. For this, it may be necessary to test beyond the groups available in theBOLDdataset(race,religion,andgender). AsLLMsareintegratedanddeployed,welookforwardto continuing research that will amplify their potential for positive impact on these important social issues. 4.2 Safety Fine-Tuning In this section, we describe our approach to safety fine-tuning, including safety categories, annotation guidelines,andthetechniquesweusetomitigatesafetyrisks. Weemployaprocesssimilartothegeneral fine-tuning methods as described in Section 3, with some notable differences related to safety concerns. Specifically, we use the following techniques in safety fine-tuning: 1.Supervised Safety Fine-Tuning : We initialize by gathering adversarial prompts and safe demonstra- tions that are then included in…

query_engine_chunk = RetrieverQueryEngine.from_args(
    retriever_chunk, service_context=service_context
)
response = query_engine_chunk.query(
    "Can you tell me about the key concepts for safety finetuning"
)
print(str(response))
Retrieving with query id None: Can you tell me about the key concepts for safety finetuning
Retrieved node with id, entering: node-1
Retrieving with query id node-1: Can you tell me about the key concepts for safety finetuning
Retrieved node with id, entering: node-22
Retrieving with query id node-22: Can you tell me about the key concepts for safety finetuning
The key concepts for safety fine-tuning include supervised safety fine-tuning, safety RLHF (Reinforcement Learning from Human Feedback), and safety context distillation. 

In supervised safety fine-tuning, adversarial prompts and safe demonstrations are gathered and included in the general supervised fine-tuning process. This helps the model align with safety guidelines even before RLHF and lays the foundation for high-quality human preference data annotation.

Safety RLHF involves integrating safety into the general RLHF pipeline. This includes training a safety-specific reward model and gathering more challenging adversarial prompts for rejection sampling style fine-tuning and PPO (Proximal Policy Optimization) optimization.

Safety context distillation is the final step in safety fine-tuning. It involves refining the RLHF pipeline with context distillation, which generates safer model responses by prefixing a prompt with a safety preprompt. The model is then fine-tuned on the safer responses without the preprompt, effectively distilling the safety preprompt (context) into the model. A targeted approach is used to allow the safety reward model to choose whether to use context distillation for each sample.

These concepts are used to mitigate safety risks and improve the safety alignment of the model during the fine-tuning process.

Metadata References: Summaries + Generated Questions referring to a bigger chunk

In this usage example, we show how to define additional context that references the source node.

This additional context includes summaries as well as generated questions.

During query-time, we retrieve smaller chunks, but we follow references to bigger chunks. This allows us to have more context for synthesis.

from llama_index.node_parser import SimpleNodeParser
from llama_index.schema import IndexNode
from llama_index.node_parser.extractors import (
    SummaryExtractor,
    QuestionsAnsweredExtractor,
    MetadataExtractor,
)
metadata_extractor = MetadataExtractor(
    extractors=[
        SummaryExtractor(summaries=["self"], show_progress=True),
        QuestionsAnsweredExtractor(questions=5, show_progress=True),
    ],
)
# run metadata extractor across base nodes, get back dictionaries
metadata_dicts = metadata_extractor.extract(base_nodes)
# cache metadata dicts
def save_metadata_dicts(path):
    with open(path, "w") as fp:
        for m in metadata_dicts:
            fp.write(json.dumps(m) + "\n")


def load_metadata_dicts(path):
    with open(path, "r") as fp:
        metadata_dicts = [json.loads(l) for l in fp.readlines()]
        return metadata_dicts
save_metadata_dicts("data/llama2_metadata_dicts.jsonl")
metadata_dicts = load_metadata_dicts("data/llama2_metadata_dicts.jsonl")
# all nodes consists of source nodes, along with metadata
all_nodes = base_nodes
for idx, d in enumerate(metadata_dicts):
    inode_q = IndexNode(
        text=d["questions_this_excerpt_can_answer"], index_id=base_nodes[idx].node_id
    )
    inode_s = IndexNode(text=d["section_summary"], index_id=base_nodes[idx].node_id)
    all_nodes.extend([inode_q, inode_s])
all_nodes_dict = {n.node_id: n for n in all_nodes}
## Load index into vector index
from llama_index import VectorStoreIndex, ServiceContext
from llama_index.llms import OpenAI

llm = OpenAI(model="gpt-3.5-turbo")
service_context = ServiceContext.from_defaults(llm=llm)

vector_index_metadata = VectorStoreIndex(all_nodes, service_context=service_context)
vector_retriever_metadata = vector_index_metadata.as_retriever(similarity_top_k=2)
retriever_metadata = RecursiveRetriever(
    "vector",
    retriever_dict={"vector": vector_retriever_metadata},
    node_dict=all_nodes_dict,
    verbose=True,
)
nodes = retriever_metadata.retrieve(
    "Can you tell me about the key concepts for safety finetuning"
)
for node in nodes:
    display_source_node(node, source_length=2000)
Retrieving with query id None: Can you tell me about the key concepts for safety finetuning
Retrieved node with id, entering: node-22
Retrieving with query id node-22: Can you tell me about the key concepts for safety finetuning

Node ID: node-22
Similarity: 0.8586394695721855
Text: We observe that models trained from less aggressively filtered pretraining data also required fewer examples to achieve reasonable safety-alignment. Wereiteratethatthismotivatedchoicedoesimplythatadditionalsafetymitigationsshould be applied before deployment of base Llama 2 models. 22

TruthfulQA ↑ToxiGen ↓ MPT7B 29.13 22.32 30B 35.25 22.61 Falcon7B 25.95 14.53 40B 40.39 23.44 Llama 17B 27.42 23.00 13B 41.74 23.08 33B 44.19 22.57 65B 48.71 21.77 Llama 27B 33.29 21.25 13B 41.86 26.10 34B 43.45 21.19 70B 50.18 24.60 Table 11: Evaluation of pretrained LLMs on automatic safety benchmarks. For TruthfulQA, we present the percentageofgenerationsthatarebothtruthfulandinformative(thehigherthebetter). ForToxiGen,we present the percentage of toxic generations (the smaller, the better). Benchmarks give a summary view ofmodel capabilities and behaviors that allow us to understand general patternsinthemodel,buttheydonotprovideafullycomprehensiveviewoftheimpactthemodelmayhave onpeopleorreal-worldoutcomes;thatwouldrequirestudyofend-to-endproductdeployments. Further testing and mitigation should be done to understand bias and other social issues for the specific context in which a system may be deployed. For this, it may be necessary to test beyond the groups available in theBOLDdataset(race,religion,andgender). AsLLMsareintegratedanddeployed,welookforwardto continuing research that will amplify their potential for positive impact on these important social issues. 4.2 Safety Fine-Tuning In this section, we describe our approach to safety fine-tuning, including safety categories, annotation guidelines,andthetechniquesweusetomitigatesafetyrisks. Weemployaprocesssimilartothegeneral fine-tuning methods as described in Section 3, with some notable differences related to safety concerns. Specifically, we use the following techniques in safety fine-tuning: 1.Supervised Safety Fine-Tuning : We initialize by gathering adversarial prompts and safe demonstra- tions that are then included in…

query_engine_metadata = RetrieverQueryEngine.from_args(
    retriever_metadata, service_context=service_context
)
response = query_engine_metadata.query(
    "Can you tell me about the key concepts for safety finetuning"
)
print(str(response))
Retrieving with query id None: Can you tell me about the key concepts for safety finetuning
Retrieved node with id, entering: node-22
Retrieving with query id node-22: Can you tell me about the key concepts for safety finetuning
The key concepts for safety fine-tuning include supervised safety fine-tuning, safety RLHF (Reinforcement Learning from Human Feedback), and safety context distillation. 

Supervised safety fine-tuning involves gathering adversarial prompts and safe demonstrations to train the model to align with safety guidelines even before RLHF. This helps establish a foundation for high-quality human preference data annotation.

Safety RLHF integrates safety into the general RLHF pipeline. This includes training a safety-specific reward model and gathering more challenging adversarial prompts for rejection sampling style fine-tuning and PPO (Proximal Policy Optimization) optimization.

Safety context distillation refines the RLHF pipeline by generating safer model responses through prefixing a prompt with a safety preprompt, such as "You are a safe and responsible assistant." The model is then fine-tuned on the safer responses without the preprompt, effectively distilling the safety preprompt (context) into the model. A targeted approach is used to allow the safety reward model to choose whether to use context distillation for each sample.

These concepts are employed to mitigate safety risks and ensure that the fine-tuned models align with safety guidelines.

Evaluation

We evaluate how well our recursive retrieval + node reference methods work. We evaluate both chunk references as well as metadata references. We use embedding similarity lookup to retrieve the reference nodes.

We compare both methods against a baseline retriever where we fetch the raw nodes directly.

In terms of metrics, we evaluate using both hit-rate and MRR.

Dataset Generation

We first generate a dataset of questions from the set of text chunks.

from llama_index.evaluation import (
    generate_question_context_pairs,
    EmbeddingQAFinetuneDataset,
)
import nest_asyncio

nest_asyncio.apply()
eval_dataset = generate_question_context_pairs(base_nodes)
100%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆ| 80/80 [03:27<00:00,  2.59s/it]
eval_dataset.save_json("data/llama2_eval_dataset.json")
# optional
eval_dataset = EmbeddingQAFinetuneDataset.from_json("data/llama2_eval_dataset.json")

Compare Results

We run evaluations on each of the retrievers to measure hit rate and MRR.

We find that retrievers with node references (either chunk or metadata) tend to perform better than retrieving the raw chunks.

import pandas as pd
from llama_index.evaluation import RetrieverEvaluator, get_retrieval_results_df

# set vector retriever similarity top k to higher
top_k = 10


def display_results(names, results_arr):
    """Display results from evaluate."""

    hit_rates = []
    mrrs = []
    for name, eval_results in zip(names, results_arr):
        metric_dicts = []
        for eval_result in eval_results:
            metric_dict = eval_result.metric_vals_dict
            metric_dicts.append(metric_dict)
        results_df = pd.DataFrame(metric_dicts)

        hit_rate = results_df["hit_rate"].mean()
        mrr = results_df["mrr"].mean()
        hit_rates.append(hit_rate)
        mrrs.append(mrr)

    final_df = pd.DataFrame({"retrievers": names, "hit_rate": hit_rates, "mrr": mrrs})
    display(final_df)
vector_retriever_chunk = vector_index_chunk.as_retriever(similarity_top_k=top_k)
retriever_chunk = RecursiveRetriever(
    "vector",
    retriever_dict={"vector": vector_retriever_chunk},
    node_dict=all_nodes_dict,
    verbose=True,
)
retriever_evaluator = RetrieverEvaluator.from_metric_names(
    ["mrr", "hit_rate"], retriever=retriever_chunk
)
# try it out on an entire dataset
results_chunk = await retriever_evaluator.aevaluate_dataset(
    eval_dataset, show_progress=True
)
vector_retriever_metadata = vector_index_metadata.as_retriever(similarity_top_k=top_k)
retriever_metadata = RecursiveRetriever(
    "vector",
    retriever_dict={"vector": vector_retriever_metadata},
    node_dict=all_nodes_dict,
    verbose=True,
)
retriever_evaluator = RetrieverEvaluator.from_metric_names(
    ["mrr", "hit_rate"], retriever=retriever_metadata
)
# try it out on an entire dataset
results_metadata = await retriever_evaluator.aevaluate_dataset(
    eval_dataset, show_progress=True
)
base_retriever = base_index.as_retriever(similarity_top_k=10)
retriever_evaluator = RetrieverEvaluator.from_metric_names(
    ["mrr", "hit_rate"], retriever=base_retriever
)
# try it out on an entire dataset
results_base = await retriever_evaluator.aevaluate_dataset(
    eval_dataset, show_progress=True
)
  0%|                                                                                                                                   | 0/167 [00:00<?, ?it/s]
Async embedding not available, falling back to sync method.
100%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆ| 167/167 [00:03<00:00, 49.50it/s]
full_results_df = get_retrieval_results_df(
    [
        "Base Retriever",
        "Retriever (Chunk References)",
        "Retriever (Metadata References)",
    ],
    [results_base, results_chunk, results_metadata],
)
display(full_results_df)
retrievers hit_rate mrr
0 Base Retriever 0.796407 0.605097
1 Retriever (Chunk References) 0.892216 0.739179
2 Retriever (Metadata References) 0.916168 0.746906