Auto Merging Retrieverο
In this notebook, we showcase our AutoMergingRetriever
, which looks at a set of leaf nodes and recursively βmergesβ subsets of leaf nodes that reference a parent node beyond a given threshold. This allows us to consolidate potentially disparate, smaller contexts into a larger context that might help synthesis.
You can define this hierarchy yourself over a set of documents, or you can make use of our brand-new text parser: a HierarchicalNodeParser that takes in a candidate set of documents and outputs an entire hierarchy of nodes, from βcoarse-to-fineβ.
%load_ext autoreload
%autoreload 2
Load Dataο
Letβs first load the Llama 2 paper: https://arxiv.org/pdf/2307.09288.pdf. This will be our test data.
!wget --user-agent "Mozilla" "https://arxiv.org/pdf/2307.09288.pdf" -O "data/llama2.pdf"
from pathlib import Path
# from llama_hub.file.pdf.base import PDFReader
from llama_hub.file.pymu_pdf.base import PyMuPDFReader
loader = PyMuPDFReader()
# docs0 = loader.load_data(file=Path("./data/llama2.pdf"))
docs0 = loader.load(file_path=Path("./data/llama2.pdf"))
By default, the PDF reader creates a separate doc for each page. For the sake of this notebook, we stitch docs together into one doc. This will help us better highlight auto-merging capabilities that βstitchβ chunks together later on.
from llama_index import Document
doc_text = "\n\n".join([d.get_content() for d in docs0])
docs = [Document(text=doc_text)]
Parse Chunk Hierarchy from Text, Load into Storageο
In this section we make use of the HierarchicalNodeParser
. This will output a hierarchy of nodes, from top-level nodes with bigger chunk sizes to child nodes with smaller chunk sizes, where each child node has a parent node with a bigger chunk size.
By default, the hierarchy is:
1st level: chunk size 2048
2nd level: chunk size 512
3rd level: chunk size 128
We then load these nodes into storage. The leaf nodes are indexed and retrieved via a vector store - these are the nodes that will first be directly retrieved via similarity search. The other nodes will be retrieved from a docstore.
from llama_index.node_parser import HierarchicalNodeParser, SimpleNodeParser
node_parser = HierarchicalNodeParser.from_defaults()
nodes = node_parser.get_nodes_from_documents(docs)
len(nodes)
1029
Here we import a simple helper function for fetching βleafβ nodes within a node list. These are nodes that donβt have children of their own.
from llama_index.node_parser import get_leaf_nodes, get_root_nodes
leaf_nodes = get_leaf_nodes(nodes)
len(leaf_nodes)
795
root_nodes = get_root_nodes(nodes)
Load into Storageο
We define a docstore, which we load all nodes into.
We then define a VectorStoreIndex
containing just the leaf-level nodes.
# define storage context
from llama_index.storage.docstore import SimpleDocumentStore
from llama_index.storage import StorageContext
from llama_index import ServiceContext
from llama_index.llms import OpenAI
docstore = SimpleDocumentStore()
# insert nodes into docstore
docstore.add_documents(nodes)
# define storage context (will include vector store by default too)
storage_context = StorageContext.from_defaults(docstore=docstore)
service_context = ServiceContext.from_defaults(llm=OpenAI(model="gpt-3.5-turbo"))
## Load index into vector index
from llama_index import VectorStoreIndex
base_index = VectorStoreIndex(
leaf_nodes, storage_context=storage_context, service_context=service_context
)
Define Retrieverο
from llama_index.retrievers.auto_merging_retriever import AutoMergingRetriever
base_retriever = base_index.as_retriever(similarity_top_k=6)
retriever = AutoMergingRetriever(base_retriever, storage_context, verbose=True)
# query_str = "What were some lessons learned from red-teaming?"
# query_str = "Can you tell me about the key concepts for safety finetuning"
query_str = "What could be the potential outcomes of adjusting the amount of safety data used in the RLHF stage?"
nodes = retriever.retrieve(query_str)
base_nodes = base_retriever.retrieve(query_str)
> Merging 4 nodes into parent node.
> Parent node id: d9b0684f-c36d-4315-a78c-ef902b15cf8e.
> Parent node text: We conduct RLHF by first collecting human preference data for safety similar to Section 3.2.2: an...
len(nodes)
3
len(base_nodes)
6
from llama_index.response.notebook_utils import display_source_node
for node in nodes:
display_source_node(node, source_length=10000)
Node ID: 1c347302-91a0-43c6-b476-02969129671f
Similarity: 0.8694979150607424
Text: We also list two
qualitative examples where safety and helpfulness reward models donβt agree with each other in Table 35.
A.4.2
Qualitative Results on Safety Data Scaling
In Section 4.2.3, we study the impact of adding more safety data into model RLHF in a quantitative manner.
Here we showcase a few samples to qualitatively examine the evolution of model behavior when we scale
safety data in Tables 36, 37, and 38. In general, we are observing that Llama 2-Chat becomes safer responding
to unsafe prompts with more safety data used.
Node ID: 3671b20d-ea5e-4afc-983e-02be6ee8302d
Similarity: 0.8616645024812453
Text: We conduct RLHF by first collecting human preference data for safety similar to Section 3.2.2: annotators
write a prompt that they believe can elicit unsafe behavior, and then compare multiple model responses to
the prompts, selecting the response that is safest according to a set of guidelines. We then use the human
preference data to train a safety reward model (see Section 3.2.2), and also reuse the adversarial prompts to
sample from the model during the RLHF stage.
Better Long-Tail Safety Robustness without Hurting Helpfulness
Safety is inherently a long-tail problem,
where the challenge comes from a small number of very specific cases. We investigate the impact of Safety
RLHF by taking two intermediate Llama 2-Chat checkpointsβone without adversarial prompts in the RLHF
stage and one with themβand score their responses on our test sets using our safety and helpfulness reward
models. In Figure 14, we plot the score distribution shift of the safety RM on the safety test set (left) and that
of the helpfulness RM on the helpfulness test set (right). In the left hand side of the figure, we observe that
the distribution of safety RM scores on the safety set shifts to higher reward scores after safety tuning with
RLHF, and that the long tail of the distribution near zero thins out. A clear cluster appears on the top-left
corner suggesting the improvements of model safety. On the right side, we do not observe any gathering
pattern below the y = x line on the right hand side of Figure 14, which indicates that the helpfulness score
distribution is preserved after safety tuning with RLHF. Put another way, given sufficient helpfulness training
data, the addition of an additional stage of safety mitigation does not negatively impact model performance
on helpfulness to any notable degradation. A qualitative example is shown in Table 12.
Impact of Safety Data Scaling.
A tension between helpfulness and safety of LLMs has been observed in
previous studies (Bai et al., 2022a). To better understand how the addition of safety training data affects
general model performance, especially helpfulness, we investigate the trends in safety data scaling by
adjusting the amount of safety data used in the RLHF stage.
Node ID: 9004b1e1-67b6-427d-b6f8-e17cb62445a9
Similarity: 0.8546977459150967
Text: 0
0.2
0.4
0.6
0.8
1.0
Helpfulness RM Score before Safety RLHF
0.0
0.2
0.4
0.6
0.8
1.0
Helpfulness RM Score after Safety RLHF
0
1000
0
1000
Figure 14: Impact of safety RLHF measured by reward model score distributions. Left: safety reward
model scores of generations on the Meta Safety test set. The clustering of samples in the top left corner
suggests the improvements of model safety.
for node in base_nodes:
display_source_node(node, source_length=10000)
Node ID: 4805858d-2cfc-4817-b39b-9aa80796ba10
Similarity: 0.8767292201499971
Text: A qualitative example is shown in Table 12.
Impact of Safety Data Scaling.
A tension between helpfulness and safety of LLMs has been observed in
previous studies (Bai et al., 2022a). To better understand how the addition of safety training data affects
general model performance, especially helpfulness, we investigate the trends in safety data scaling by
adjusting the amount of safety data used in the RLHF stage.
Node ID: 8bbe6232-36fc-42f1-b580-eb91676cdbde
Similarity: 0.8724663025381515
Text: A clear cluster appears on the top-left
corner suggesting the improvements of model safety. On the right side, we do not observe any gathering
pattern below the y = x line on the right hand side of Figure 14, which indicates that the helpfulness score
distribution is preserved after safety tuning with RLHF. Put another way, given sufficient helpfulness training
data, the addition of an additional stage of safety mitigation does not negatively impact model performance
on helpfulness to any notable degradation. A qualitative example is shown in Table 12.
Impact of Safety Data Scaling.
Node ID: 1c347302-91a0-43c6-b476-02969129671f
Similarity: 0.8694979150607424
Text: We also list two
qualitative examples where safety and helpfulness reward models donβt agree with each other in Table 35.
A.4.2
Qualitative Results on Safety Data Scaling
In Section 4.2.3, we study the impact of adding more safety data into model RLHF in a quantitative manner.
Here we showcase a few samples to qualitatively examine the evolution of model behavior when we scale
safety data in Tables 36, 37, and 38. In general, we are observing that Llama 2-Chat becomes safer responding
to unsafe prompts with more safety data used.
Node ID: 9004b1e1-67b6-427d-b6f8-e17cb62445a9
Similarity: 0.8546977459150967
Text: 0
0.2
0.4
0.6
0.8
1.0
Helpfulness RM Score before Safety RLHF
0.0
0.2
0.4
0.6
0.8
1.0
Helpfulness RM Score after Safety RLHF
0
1000
0
1000
Figure 14: Impact of safety RLHF measured by reward model score distributions. Left: safety reward
model scores of generations on the Meta Safety test set. The clustering of samples in the top left corner
suggests the improvements of model safety.
Node ID: cfdb4b18-028e-439b-963c-ede2e1e4eea3
Similarity: 0.8487467427260533
Text: Better Long-Tail Safety Robustness without Hurting Helpfulness
Safety is inherently a long-tail problem,
where the challenge comes from a small number of very specific cases. We investigate the impact of Safety
RLHF by taking two intermediate Llama 2-Chat checkpointsβone without adversarial prompts in the RLHF
stage and one with themβand score their responses on our test sets using our safety and helpfulness reward
models.
Node ID: f4793124-1595-4b27-86e0-5b3801969ca8
Similarity: 0.8487157445107789
Text: We conduct RLHF by first collecting human preference data for safety similar to Section 3.2.2: annotators
write a prompt that they believe can elicit unsafe behavior, and then compare multiple model responses to
the prompts, selecting the response that is safest according to a set of guidelines. We then use the human
preference data to train a safety reward model (see Section 3.2.2), and also reuse the adversarial prompts to
sample from the model during the RLHF stage.
Plug it into Query Engineο
from llama_index.query_engine import RetrieverQueryEngine
query_engine = RetrieverQueryEngine.from_args(retriever)
base_query_engine = RetrieverQueryEngine.from_args(base_retriever)
response = query_engine.query(query_str)
> Merging 4 nodes into parent node.
> Parent node id: 3671b20d-ea5e-4afc-983e-02be6ee8302d.
> Parent node text: We conduct RLHF by first collecting human preference data for safety similar to Section 3.2.2: an...
print(str(response))
Adjusting the amount of safety data used in the RLHF stage could potentially have the following outcomes:
1. Improved model safety: Increasing the amount of safety data used in RLHF may lead to improvements in model safety. This means that the model becomes better at responding to unsafe prompts and avoids generating unsafe or harmful outputs.
2. Thinning out of the long tail of safety RM scores: Increasing the amount of safety data may result in a shift in the distribution of safety reward model (RM) scores towards higher reward scores. This means that the model becomes more consistent in generating safe responses and reduces the occurrence of low safety scores.
3. Preservation of helpfulness performance: Adjusting the amount of safety data used in RLHF is not expected to negatively impact model performance on helpfulness. This means that the model's ability to generate helpful responses is maintained even after incorporating additional safety training.
4. Gathering pattern in helpfulness RM scores: There is no observed gathering pattern below the y = x line in the distribution of helpfulness RM scores after safety tuning with RLHF. This suggests that the helpfulness score distribution is preserved, indicating that the model's helpfulness performance is not significantly degraded by the addition of safety mitigation measures.
Overall, adjusting the amount of safety data used in the RLHF stage aims to strike a balance between improving model safety without compromising its helpfulness performance.
base_response = base_query_engine.query(query_str)
print(str(base_response))
Adjusting the amount of safety data used in the RLHF stage could potentially lead to improvements in model safety. This can be observed by a clear cluster appearing on the top-left corner, suggesting enhanced model safety. Additionally, it is indicated that the helpfulness score distribution is preserved after safety tuning with RLHF, indicating that the addition of safety data does not negatively impact model performance on helpfulness.
Evaluationο
We evaluate how well the hierarchical retriever works compared to the baseline retriever in a more quantitative manner.
In terms of metrics, we evaluate using both hit-rate and MRR.
from llama_index.evaluation import (
DatasetGenerator,
QueryResponseDataset,
)
from llama_index import ServiceContext
from llama_index.llms import OpenAI
import nest_asyncio
nest_asyncio.apply()
# NOTE: run this if the dataset isn't already saved
eval_service_context = ServiceContext.from_defaults(llm=OpenAI(model="gpt-4"))
dataset_generator = DatasetGenerator(
root_nodes,
service_context=eval_service_context,
show_progress=True,
num_questions_per_chunk=2,
)
eval_dataset = await dataset_generator.agenerate_dataset_from_nodes(num=60)
eval_dataset.save_json("data/llama2_eval_qr_dataset.json")
# optional
eval_dataset = QueryResponseDataset.from_json("data/llama2_eval_qr_dataset.json")
Compare Resultsο
We run evaluations on each of the retrievers to measure hit rate and MRR.
import asyncio
import nest_asyncio
nest_asyncio.apply()
from llama_index.evaluation import (
CorrectnessEvaluator,
SemanticSimilarityEvaluator,
RelevancyEvaluator,
PairwiseComparisonEvaluator,
)
# NOTE: can uncomment other evaluators
# evaluator = CorrectnessEvaluator(service_context=eval_service_context)
evaluator = SemanticSimilarityEvaluator(service_context=eval_service_context)
# evaluator = RelevancyEvaluator(service_context=eval_service_context)
pairwise_evaluator = PairwiseComparisonEvaluator(service_context=eval_service_context)
from tqdm import tqdm
from tqdm.asyncio import tqdm_asyncio
async def get_predicted_answers(questions, query_engine):
tasks = []
for question in questions:
tasks.append(query_engine.aquery(question))
responses = await tqdm_asyncio.gather(*tasks)
return responses
eval_qs = eval_dataset.questions
qr_pairs = eval_dataset.qr_pairs
pred_responses = await get_predicted_answers(eval_qs, query_engine)
base_pred_responses = await get_predicted_answers(eval_qs, base_query_engine)
from tqdm.asyncio import tqdm_asyncio
import numpy as np
async def avg_eval_score(qr_pairs, pred_responses, evaluator):
tasks = []
for idx, (question, reference) in tqdm(enumerate(qr_pairs)):
pred_response = pred_responses[idx]
task = evaluator.aevaluate_response(
query=question,
response=pred_response,
reference=reference,
)
tasks.append(task)
results = await tqdm_asyncio.gather(*tasks)
scores = np.array([r.score for r in results])
return scores.mean()
avg_score = await avg_eval_score(qr_pairs, pred_responses, evaluator)
60it [00:00, 822412.55it/s]
100%|ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ| 60/60 [00:01<00:00, 55.66it/s]
avg_score
0.9193331063079818
base_avg_score = await avg_eval_score(qr_pairs, base_pred_responses, evaluator)
60it [00:00, 461758.24it/s]
100%|ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ| 60/60 [00:00<00:00, 68.06it/s]
base_avg_score
0.9220138432235365
Analysis: Semantic similarity doesnβt reveal a noticeable difference between the base retriever and the new retriever.
This might indicate that the auto-merging retreiver didnβt demonstrate a significant improvement.
However, what if we asked GPT-4 to see which answer it prefers? Maybe the auto-merging retriever gives more details in its answers.
from tqdm.asyncio import tqdm_asyncio
import numpy as np
async def avg_eval_score_pairwise(
queries, ref_responses, candidate_responses, evaluator
):
tasks = []
for idx, question in enumerate(queries):
ref_response = ref_responses[idx]
candidate_response = candidate_responses[idx]
task = evaluator.aevaluate_response(
query=question,
response=candidate_response,
reference=ref_response,
)
tasks.append(task)
results = await tqdm_asyncio.gather(*tasks)
scores = np.array([r.score for r in results])
return scores.mean()
base_avg_score = await avg_eval_score_pairwise(
eval_qs, base_pred_responses, pred_responses, pairwise_evaluator
)
100%|ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ| 60/60 [00:10<00:00, 5.61it/s]
base_avg_score
0.65
Analysis: The pairwise comparison score is a measure of the percentage of time the candidate answer (using auto-merging retriever) is preferred vs. the base answer (using the base retriever). Here we see that the candidate answer is preferred 65% of the time, which is above average!