BM25 Retriever
In this guide, we define a bm25 retriever that search documents using bm25 method.
This notebook is very similar to the RouterQueryEngine notebook.
Setup
# NOTE: This is ONLY necessary in jupyter notebook.
# Details: Jupyter runs an event-loop behind the scenes.
# This results in nested event-loops when we start an event-loop to make async queries.
# This is normally not allowed, we use nest_asyncio to allow it for convenience.
import nest_asyncio
nest_asyncio.apply()
import os
import openai
os.environ["OPENAI_API_KEY"] = "sk-..."
openai.api_key = os.environ["OPENAI_API_KEY"]
import logging
import sys
logging.basicConfig(stream=sys.stdout, level=logging.INFO)
logging.getLogger().handlers = []
logging.getLogger().addHandler(logging.StreamHandler(stream=sys.stdout))
from llama_index import (
SimpleDirectoryReader,
ServiceContext,
StorageContext,
VectorStoreIndex,
)
from llama_index.retrievers import BM25Retriever
from llama_index.indices.vector_store.retrievers.retriever import VectorIndexRetriever
from llama_index.llms import OpenAI
Note: NumExpr detected 16 cores but "NUMEXPR_MAX_THREADS" not set, so enforcing safe limit of 8.
NumExpr defaulting to 8 threads.
Load Data
We first show how to convert a Document into a set of Nodes, and insert into a DocumentStore.
# load documents
documents = SimpleDirectoryReader("../data/paul_graham").load_data()
# initialize service context (set chunk size)
llm = OpenAI(model="gpt-4")
service_context = ServiceContext.from_defaults(chunk_size=1024, llm=llm)
nodes = service_context.node_parser.get_nodes_from_documents(documents)
# initialize storage context (by default it's in-memory)
storage_context = StorageContext.from_defaults()
storage_context.docstore.add_documents(nodes)
index = VectorStoreIndex(
nodes=nodes,
storage_context=storage_context,
service_context=service_context,
)
BM25 Retriever
We will search document with bm25 retriever.
# !pip install rank_bm25
retriever = BM25Retriever.from_defaults(index, similarity_top_k=2)
from llama_index.response.notebook_utils import display_source_node
# will retrieve all context from the author's life
nodes = retriever.retrieve(
"Can you give me all the context regarding the author's life?"
)
for node in nodes:
display_source_node(node)
Node ID: 880f345a-439c-49a0-ac49-55f2dbb61ee1
Similarity: 7.322087057741413
Text: This name didn’t last long before it was replaced by “software as a service,” but it was current …
Node ID: dfb3f58d-da85-4c90-b070-f817db51023c
Similarity: 7.121406119821665
Text: There, right on the wall, was something you could make that would last.Paintings didn’t become ob…
nodes = retriever.retrieve("What did Paul Graham do after RISD?")
for node in nodes:
display_source_node(node)
Node ID: 450904cc-d483-4b66-90a2-c36608de5c4b
Similarity: 6.84581665569314
Text: That seemed unnatural to me, and on this point the rest of the world is coming around to my way o…
Node ID: 56ef1b5a-5519-4e31-bedb-9d81ce80f0d8
Similarity: 6.259657209129812
Text: So I decided to take a shot at it.It took 4 years, from March 26, 2015 to October 12, 2019.It was…
Router Retriever with bm25 method
Now we will combine bm25 retriever with vector index retriever.
from llama_index.tools import RetrieverTool
vector_retriever = VectorIndexRetriever(index)
bm25_retriever = BM25Retriever.from_defaults(index, similarity_top_k=2)
retriever_tools = [
RetrieverTool.from_defaults(
retriever=vector_retriever,
description="Useful in most cases",
),
RetrieverTool.from_defaults(
retriever=bm25_retriever,
description="Useful if searching about specific information",
),
]
from llama_index.retrievers import RouterRetriever
retriever = RouterRetriever.from_defaults(
retriever_tools=retriever_tools,
service_context=service_context,
select_multi=True,
)
# will retrieve all context from the author's life
nodes = retriever.retrieve(
"Can you give me all the context regarding the author's life?"
)
for node in nodes:
display_source_node(node)
Selecting retriever 0: The author's life context is a broad topic and would be useful in most cases..
Node ID: dfb3f58d-da85-4c90-b070-f817db51023c
Similarity: 0.7839511925738453
Text: There, right on the wall, was something you could make that would last.Paintings didn’t become ob…
Node ID: 47b8287e-23e4-4c1a-b7f9-6db6d4c5786a
Similarity: 0.7816309032859696
Text: The students and faculty in the painting department at the Accademia were the nicest people you c…
Advanced - Hybrid Retriever + Re-Ranking
Here we extend the base retriever class and create a custom retriever that always uses the vector retriever and BM25 retreiver.
Then, nodes can be re-ranked and filtered. This lets us keep intermediate top-k values large and letting the re-ranking filter out un-needed nodes.
To best demonstrate this, we will use a larger set of source documents – Chapter 3 from the 2022 IPCC Climate Report.
Setup data
!curl https://www.ipcc.ch/report/ar6/wg2/downloads/report/IPCC_AR6_WGII_Chapter03.pdf --output IPCC_AR6_WGII_Chapter03.pdf
% Total % Received % Xferd Average Speed Time Time Time Current
Dload Upload Total Spent Left Speed
100 20.7M 100 20.7M 0 0 15.4M 0 0:00:01 0:00:01 --:--:-- 15.5M
# !pip install pypdf
from llama_index import (
VectorStoreIndex,
ServiceContext,
StorageContext,
SimpleDirectoryReader,
)
from llama_index.llms import OpenAI
# load documents
documents = SimpleDirectoryReader(
input_files=["IPCC_AR6_WGII_Chapter03.pdf"]
).load_data()
# initialize service context (set chunk size)
# -- here, we set a smaller chunk size, to allow for more effective re-ranking
llm = OpenAI(model="gpt-3.5-turbo")
service_context = ServiceContext.from_defaults(chunk_size=256, llm=llm)
nodes = service_context.node_parser.get_nodes_from_documents(documents)
# initialize storage context (by default it's in-memory)
storage_context = StorageContext.from_defaults()
storage_context.docstore.add_documents(nodes)
index = VectorStoreIndex(
nodes, storage_context=storage_context, service_context=service_context
)
from llama_index.retrievers import BM25Retriever
# retireve the top 10 most similar nodes using embeddings
vector_retriever = index.as_retriever(similarity_top_k=10)
# retireve the top 10 most similar nodes using bm25
bm25_retriever = BM25Retriever.from_defaults(index, similarity_top_k=10)
Custom Retriever Implementation
from llama_index.retrievers import BaseRetriever
class HybridRetriever(BaseRetriever):
def __init__(self, vector_retriever, bm25_retriever):
self.vector_retriever = vector_retriever
self.bm25_retriever = bm25_retriever
def _retrieve(self, query, **kwargs):
bm25_nodes = self.bm25_retriever.retrieve(query, **kwargs)
vector_nodes = self.vector_retriever.retrieve(query, **kwargs)
# combine the two lists of nodes
all_nodes = []
node_ids = set()
for n in bm25_nodes + vector_nodes:
if n.node.node_id not in node_ids:
all_nodes.append(n)
node_ids.add(n.node.node_id)
return all_nodes
index.as_retriever(similarity_top_k=5)
hybrid_retriever = HybridRetriever(vector_retriever, bm25_retriever)
Re-Ranker Setup
from llama_index.indices.postprocessor import SentenceTransformerRerank
reranker = SentenceTransformerRerank(
top_n=4, model="cross-encoder/ms-marco-MiniLM-L-6-v2"
)
/home/loganm/miniconda3/envs/llama-index/lib/python3.11/site-packages/tqdm/auto.py:21: TqdmWarning: IProgress not found. Please update jupyter and ipywidgets. See https://ipywidgets.readthedocs.io/en/stable/user_install.html
from .autonotebook import tqdm as notebook_tqdm
Use pytorch device: cpu
Retrieve
from llama_index import QueryBundle
nodes = hybrid_retriever.retrieve("What is the impact of climate change on the ocean?")
reranked_nodes = reranker.postprocess_nodes(
nodes,
query_bundle=QueryBundle("What is the impact of climate change on the ocean?"),
)
print("Initial retrieval: ", len(nodes), " nodes")
print("Re-ranked retrieval: ", len(reranked_nodes), " nodes")
Batches: 100%|██████████| 1/1 [00:00<00:00, 1.16it/s]
Initial retrieval: 19 nodes
Re-ranked retrieval: 4 nodes
from llama_index.response.notebook_utils import display_source_node
for node in reranked_nodes:
display_source_node(node)
Node ID: f297871f-603e-4192-b289-f724e96ccff8
Similarity: 6.131191253662109
Text: Observations: vulnerabilities and impacts
Anthropogenic climate change has exposed ocean and coas…
Node ID: 61eb5022-6988-4e2a-98ab-3dc58bddde6a
Similarity: 6.01539945602417
Text: 3
469Oceans and Coastal Ecosystems and Their Services Chapter 3
Frequently Asked Questions
FAQ 3…
Node ID: 7ddaf395-f5f2-4593-b4fa-d8e82b57696c
Similarity: 4.70263671875
Text: {Box 3.2, 3.2.2.1, 3.4.2.5, 3.4.2.10, 3.4.3.3, Cross-Chapter
Box PALEO in Chapter 1}
Climate imp…
Node ID: 574fd2e2-dd52-45a7-bb40-ca25c9b2cc00
Similarity: 3.768509864807129
Text: In both polar regions, climate-induced changes in ocean and sea ice conditions have
expanded the…
Full Query Engine
from llama_index.query_engine import RetrieverQueryEngine
query_engine = RetrieverQueryEngine.from_args(
retriever=hybrid_retriever,
node_postprocessors=[reranker],
service_context=service_context,
)
response = query_engine.query("What is the impact of climate change on the ocean?")
Batches: 100%|██████████| 1/1 [00:00<00:00, 1.25it/s]
from llama_index.response.notebook_utils import display_response
display_response(response)
Final Response:
Climate change has greatly impacted life in the ocean and along its coasts. It has exposed ocean and coastal ecosystems to conditions that are unprecedented over millennia. Fundamental changes in the physical and chemical characteristics of the ocean are changing the timing, distribution, and abundance of oceanic and coastal organisms. These changes have been observed through multi-decadal observations, laboratory studies, and meta-analyses of published data. Additionally, climate change is degrading ocean health, altering stocks of marine resources, and threatening the sustenance provided to Indigenous Peoples, livelihoods of artisanal fisheries, and marine-based industries such as tourism, shipping, and transportation. The impacts of climate change on the ocean can influence human activities and employment by altering resource availability, spreading pathogens, flooding shorelines, and degrading ocean ecosystems.