BM25 Retriever#

In this guide, we define a bm25 retriever that search documents using bm25 method.

This notebook is very similar to the RouterQueryEngine notebook.

Setup#

If you’re opening this Notebook on colab, you will probably need to install LlamaIndex 🦙.

!pip install llama-index

# NOTE: This is ONLY necessary in jupyter notebook.
# Details: Jupyter runs an event-loop behind the scenes.
#          This results in nested event-loops when we start an event-loop to make async queries.
#          This is normally not allowed, we use nest_asyncio to allow it for convenience.
import nest_asyncio

nest_asyncio.apply()

import os
import openai

os.environ["OPENAI_API_KEY"] = "sk-..."
openai.api_key = os.environ["OPENAI_API_KEY"]

import logging
import sys

logging.basicConfig(stream=sys.stdout, level=logging.INFO)
logging.getLogger().handlers = []
logging.getLogger().addHandler(logging.StreamHandler(stream=sys.stdout))

from llama_index import (
    SimpleDirectoryReader,
    ServiceContext,
    StorageContext,
    VectorStoreIndex,
)
from llama_index.retrievers import BM25Retriever
from llama_index.indices.vector_store.retrievers.retriever import (
    VectorIndexRetriever,
)
from llama_index.llms import OpenAI

Download Data#

!mkdir -p 'data/paul_graham/'
!wget 'https://raw.githubusercontent.com/run-llama/llama_index/main/docs/examples/data/paul_graham/paul_graham_essay.txt' -O 'data/paul_graham/paul_graham_essay.txt'

Load Data#

We first show how to convert a Document into a set of Nodes, and insert into a DocumentStore.

# load documents
documents = SimpleDirectoryReader("./data/paul_graham").load_data()

# initialize service context (set chunk size)
llm = OpenAI(model="gpt-4")
service_context = ServiceContext.from_defaults(chunk_size=1024, llm=llm)
nodes = service_context.node_parser.get_nodes_from_documents(documents)

# initialize storage context (by default it's in-memory)
storage_context = StorageContext.from_defaults()
storage_context.docstore.add_documents(nodes)

index = VectorStoreIndex(
    nodes=nodes,
    storage_context=storage_context,
    service_context=service_context,
)

BM25 Retriever#

We will search document with bm25 retriever.

# !pip install rank_bm25

# We can pass in the index, doctore, or list of nodes to create the retriever
retriever = BM25Retriever.from_defaults(nodes=nodes, similarity_top_k=2)

from llama_index.response.notebook_utils import display_source_node

# will retrieve context from specific companies
nodes = retriever.retrieve("What happened at Viaweb and Interleaf?")
for node in nodes:
    display_source_node(node)

Node ID: d95537b4-b398-4b47-94ff-da86f05a27f7
Similarity: 5.171801938898801
Text: I wanted to go back to RISD, but I was now broke and RISD was very expensive, so I decided to get…

Node ID: 6f84e2a5-1ab1-4389-8799-b7713e085931
Similarity: 4.838241203957084
Text: All you had to do was teach SHRDLU more words.

There weren’t any classes in AI at Cornell then, …

nodes = retriever.retrieve("What did Paul Graham do after RISD?")
for node in nodes:
    display_source_node(node)

Node ID: a4fd0b29-4138-4741-9e27-9f65d6968eb4
Similarity: 8.090884087344435
Text: Not so much because it was badly written as because the problem is so convoluted. When you’re wor…

Node ID: d95537b4-b398-4b47-94ff-da86f05a27f7
Similarity: 5.830874349482576
Text: I wanted to go back to RISD, but I was now broke and RISD was very expensive, so I decided to get…

Router Retriever with bm25 method#

Now we will combine bm25 retriever with vector index retriever.

from llama_index.tools import RetrieverTool

vector_retriever = VectorIndexRetriever(index)
bm25_retriever = BM25Retriever.from_defaults(nodes=nodes, similarity_top_k=2)

retriever_tools = [
    RetrieverTool.from_defaults(
        retriever=vector_retriever,
        description="Useful in most cases",
    ),
    RetrieverTool.from_defaults(
        retriever=bm25_retriever,
        description="Useful if searching about specific information",
    ),
]

from llama_index.retrievers import RouterRetriever

retriever = RouterRetriever.from_defaults(
    retriever_tools=retriever_tools,
    service_context=service_context,
    select_multi=True,
)

# will retrieve all context from the author's life
nodes = retriever.retrieve(
    "Can you give me all the context regarding the author's life?"
)
for node in nodes:
    display_source_node(node)

Selecting retriever 0: The author's life context is a broad topic, which may require a comprehensive approach that is useful in most cases..

Node ID: fcd399c1-3544-4df3-80a9-0a7d3fd41f1f
Similarity: 0.7942753162501964
Text: [10]

Wow, I thought, there’s an audience. If I write something and put it on the web, anyone can…

Node ID: b203e140-d549-4284-99f4-b1b5bcd996ea
Similarity: 0.7788031317604815
Text: Now all I had to do was learn Italian.

Only stranieri (foreigners) had to take this entrance exa…

Advanced - Hybrid Retriever + Re-Ranking#

Here we extend the base retriever class and create a custom retriever that always uses the vector retriever and BM25 retreiver.

Then, nodes can be re-ranked and filtered. This lets us keep intermediate top-k values large and letting the re-ranking filter out un-needed nodes.

To best demonstrate this, we will use a larger set of source documents – Chapter 3 from the 2022 IPCC Climate Report.

Setup data#

!curl https://www.ipcc.ch/report/ar6/wg2/downloads/report/IPCC_AR6_WGII_Chapter03.pdf --output IPCC_AR6_WGII_Chapter03.pdf

  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100 20.7M  100 20.7M    0     0   361k      0  0:00:58  0:00:58 --:--:--  422k

# !pip install pypdf

from llama_index import (
    VectorStoreIndex,
    ServiceContext,
    StorageContext,
    SimpleDirectoryReader,
)
from llama_index.llms import OpenAI

# load documents
documents = SimpleDirectoryReader(
    input_files=["IPCC_AR6_WGII_Chapter03.pdf"]
).load_data()

# initialize service context (set chunk size)
# -- here, we set a smaller chunk size, to allow for more effective re-ranking
llm = OpenAI(model="gpt-3.5-turbo")
service_context = ServiceContext.from_defaults(chunk_size=256, llm=llm)
nodes = service_context.node_parser.get_nodes_from_documents(documents)

# initialize storage context (by default it's in-memory)
storage_context = StorageContext.from_defaults()
storage_context.docstore.add_documents(nodes)

index = VectorStoreIndex(
    nodes, storage_context=storage_context, service_context=service_context
)

from llama_index.retrievers import BM25Retriever

# retireve the top 10 most similar nodes using embeddings
vector_retriever = index.as_retriever(similarity_top_k=10)

# retireve the top 10 most similar nodes using bm25
bm25_retriever = BM25Retriever.from_defaults(nodes=nodes, similarity_top_k=10)

Custom Retriever Implementation#

from llama_index.retrievers import BaseRetriever


class HybridRetriever(BaseRetriever):
    def __init__(self, vector_retriever, bm25_retriever):
        self.vector_retriever = vector_retriever
        self.bm25_retriever = bm25_retriever
        super().__init__()

    def _retrieve(self, query, **kwargs):
        bm25_nodes = self.bm25_retriever.retrieve(query, **kwargs)
        vector_nodes = self.vector_retriever.retrieve(query, **kwargs)

        # combine the two lists of nodes
        all_nodes = []
        node_ids = set()
        for n in bm25_nodes + vector_nodes:
            if n.node.node_id not in node_ids:
                all_nodes.append(n)
                node_ids.add(n.node.node_id)
        return all_nodes

index.as_retriever(similarity_top_k=5)

hybrid_retriever = HybridRetriever(vector_retriever, bm25_retriever)

Re-Ranker Setup#

# !pip install sentence_transformers

from llama_index.postprocessor import SentenceTransformerRerank

reranker = SentenceTransformerRerank(top_n=4, model="BAAI/bge-reranker-base")

Downloading (…)lve/main/config.json: 100%|██████████| 799/799 [00:00<00:00, 3.86MB/s]
Downloading pytorch_model.bin: 100%|██████████| 1.11G/1.11G [00:32<00:00, 34.4MB/s]
Downloading (…)okenizer_config.json: 100%|██████████| 443/443 [00:00<00:00, 2.19MB/s]
Downloading (…)tencepiece.bpe.model: 100%|██████████| 5.07M/5.07M [00:00<00:00, 14.1MB/s]
Downloading (…)cial_tokens_map.json: 100%|██████████| 279/279 [00:00<00:00, 1.48MB/s]

Use pytorch device: cpu

Retrieve#

from llama_index import QueryBundle

nodes = hybrid_retriever.retrieve(
    "What is the impact of climate change on the ocean?"
)
reranked_nodes = reranker.postprocess_nodes(
    nodes,
    query_bundle=QueryBundle(
        "What is the impact of climate change on the ocean?"
    ),
)

print("Initial retrieval: ", len(nodes), " nodes")
print("Re-ranked retrieval: ", len(reranked_nodes), " nodes")

Batches: 100%|██████████| 1/1 [00:05<00:00,  5.61s/it]

Initial retrieval:  19  nodes
Re-ranked retrieval:  4  nodes

from llama_index.response.notebook_utils import display_source_node

for node in reranked_nodes:
    display_source_node(node)

Node ID: 74b12b7b-f4b9-490a-9342-b640211468dd
Similarity: 0.998129665851593
Text: 3 469Oceans and Coastal Ecosystems and Their Services Chapter 3 Frequently Asked Questions FAQ 3…

Node ID: 2b35824c-2e96-47b7-8dfb-da25c4eefb7d
Similarity: 0.996731162071228
Text: {Box 3.2, 3.2.2.1, 3.4.2.5, 3.4.2.10, 3.4.3.3, Cross-Chapter Box PALEO in Chapter 1} Climate imp…

Node ID: 01ef2a9e-0dd0-4bce-ab60-e6a3f6456f7b
Similarity: 0.9954373240470886
Text: These ecosystems are also influenced by non-climate drivers, especially fisheries, oil and gas ex…

Node ID: 8a23b728-0352-4b01-a5c0-42765669855d
Similarity: 0.9872682690620422
Text: Additionally, climate-change-driven oxygen loss (Section 3.2.3.2; Luna et al., 2012; Belley et…

Full Query Engine#

from llama_index.query_engine import RetrieverQueryEngine

query_engine = RetrieverQueryEngine.from_args(
    retriever=hybrid_retriever,
    node_postprocessors=[reranker],
    service_context=service_context,
)

response = query_engine.query(
    "What is the impact of climate change on the ocean?"
)

Batches: 100%|██████████| 1/1 [00:05<00:00,  5.74s/it]

from llama_index.response.notebook_utils import display_response

display_response(response)

Final Response: Climate change has significant impacts on the ocean. It is degrading ocean health and altering stocks of marine resources. This, combined with over-harvesting, is threatening the sustenance provided to Indigenous Peoples, the livelihoods of artisanal fisheries, and marine-based industries such as tourism, shipping, and transportation. Climate change can also influence human activities and employment by altering resource availability, spreading pathogens, flooding shorelines, and degrading ocean ecosystems. Additionally, increases in intensity, reoccurrence, and duration of marine heatwaves due to climate change can lead to species extirpation, habitat collapse, and surpassing ecological tipping points. Some habitat-forming coastal ecosystems, including coral reefs, kelp forests, and seagrass meadows, are at high risk of irreversible phase shifts due to marine heatwaves. Non-climate drivers such as fisheries, oil and gas extraction, cable laying, and mineral resource exploration also influence ocean ecosystems.