Retriever Query Engine with Custom Retrievers - Simple Hybrid Searchο
In this tutorial, we show you how to define a very simple version of hybrid search!
Combine keyword lookup retrieval with vector retrieval using βANDβ and βORβ conditions.
Setupο
import logging
import sys
logging.basicConfig(stream=sys.stdout, level=logging.INFO)
logging.getLogger().addHandler(logging.StreamHandler(stream=sys.stdout))
from llama_index import (
GPTVectorStoreIndex,
GPTSimpleKeywordTableIndex,
SimpleDirectoryReader,
ServiceContext,
StorageContext
)
from IPython.display import Markdown, display
Load Dataο
We first show how to convert a Document into a set of Nodes, and insert into a DocumentStore.
# load documents
documents = SimpleDirectoryReader('../paul_graham_essay/data').load_data()
# initialize service context (set chunk size)
service_context = ServiceContext.from_defaults(chunk_size_limit=1024)
node_parser = service_context.node_parser
nodes = node_parser.get_nodes_from_documents(documents)
# initialize storage context (by default it's in-memory)
storage_context = StorageContext.from_defaults()
storage_context.docstore.add_documents(nodes)
Define Vector Index and Keyword Table Index over Same Dataο
We build a vector index and keyword index over the same DocumentStore
vector_index = GPTVectorStoreIndex(nodes, storage_context=storage_context)
keyword_index = GPTSimpleKeywordTableIndex(nodes, storage_context=storage_context)
Define Custom Retrieverο
We now define a custom retriever class that can implement basic hybrid search with both keyword lookup and semantic search.
setting βANDβ means we take the intersection of the two retrieved sets
setting βORβ means we take the union
# import QueryBundle
from llama_index import QueryBundle
# import NodeWithScore
from llama_index.data_structs import NodeWithScore
# Retrievers
from llama_index.retrievers import BaseRetriever, VectorIndexRetriever, KeywordTableSimpleRetriever
from typing import List
class CustomRetriever(BaseRetriever):
"""Custom retriever that performs both semantic search and hybrid search."""
def __init__(
self,
vector_retriever: VectorIndexRetriever,
keyword_retriever: KeywordTableSimpleRetriever,
mode: str = "AND"
) -> None:
"""Init params."""
self._vector_retriever = vector_retriever
self._keyword_retriever = keyword_retriever
if mode not in ("AND", "OR"):
raise ValueError("Invalid mode.")
self._mode = mode
def _retrieve(self, query_bundle: QueryBundle) -> List[NodeWithScore]:
"""Retrieve nodes given query."""
vector_nodes = self._vector_retriever.retrieve(query_bundle)
keyword_nodes = self._keyword_retriever.retrieve(query_bundle)
vector_ids = {n.node.get_doc_id() for n in vector_nodes}
keyword_ids = {n.node.get_doc_id() for n in keyword_nodes}
combined_dict = {n.node.get_doc_id(): n for n in vector_nodes}
combined_dict.update({n.node.get_doc_id(): n for n in keyword_nodes})
if self._mode == "AND":
retrieve_ids = vector_ids.intersection(keyword_ids)
else:
retrieve_ids = vector_ids.union(keyword_ids)
retrieve_nodes = [combined_dict[rid] for rid in retrieve_ids]
return retrieve_nodes
Plugin Retriever into Query Engineο
Plugin retriever into a query engine, and run some queries
from llama_index import ResponseSynthesizer
from llama_index.query_engine import RetrieverQueryEngine
# define custom retriever
vector_retriever = VectorIndexRetriever(index=vector_index, similarity_top_k=2)
keyword_retriever = KeywordTableSimpleRetriever(index=keyword_index)
custom_retriever = CustomRetriever(vector_retriever, keyword_retriever)
# define response synthesizer
response_synthesizer = ResponseSynthesizer.from_args()
# assemble query engine
custom_query_engine = RetrieverQueryEngine(
retriever=custom_retriever,
response_synthesizer=response_synthesizer,
)
# vector query engine
vector_query_engine = RetrieverQueryEngine(
retriever=vector_retriever,
response_synthesizer=response_synthesizer,
)
# keyword query engine
keyword_query_engine = RetrieverQueryEngine(
retriever=keyword_retriever,
response_synthesizer=response_synthesizer,
)
response = custom_query_engine.query("What did the author do during his time at YC?")
INFO:llama_index.token_counter.token_counter:> [retrieve] Total LLM token usage: 0 tokens
> [retrieve] Total LLM token usage: 0 tokens
> [retrieve] Total LLM token usage: 0 tokens
INFO:llama_index.token_counter.token_counter:> [retrieve] Total embedding token usage: 12 tokens
> [retrieve] Total embedding token usage: 12 tokens
> [retrieve] Total embedding token usage: 12 tokens
INFO:llama_index.indices.keyword_table.retrievers:> Starting query: What did the author do during his time at YC?
> Starting query: What did the author do during his time at YC?
> Starting query: What did the author do during his time at YC?
INFO:llama_index.indices.keyword_table.retrievers:query keywords: ['time', 'yc', 'author']
query keywords: ['time', 'yc', 'author']
query keywords: ['time', 'yc', 'author']
INFO:llama_index.indices.keyword_table.retrievers:> Extracted keywords: ['time', 'yc']
> Extracted keywords: ['time', 'yc']
> Extracted keywords: ['time', 'yc']
INFO:llama_index.token_counter.token_counter:> [get_response] Total LLM token usage: 2250 tokens
> [get_response] Total LLM token usage: 2250 tokens
> [get_response] Total LLM token usage: 2250 tokens
INFO:llama_index.token_counter.token_counter:> [get_response] Total embedding token usage: 0 tokens
> [get_response] Total embedding token usage: 0 tokens
> [get_response] Total embedding token usage: 0 tokens
print(response)
The author worked on YC, wrote essays, hacked, and worked on a new version of Arc with Robert. He also organized a summer program for undergrads to start startups, funded a batch of 8 startups, and provided free air conditioners to the founders. He also noticed the advantages of funding startups in batches, such as the tight alumni community and the startups becoming each other's customers.
# hybrid search can allow us to not retrieve nodes that are irrelevant
# Yale is never mentioned in the essay
response = custom_query_engine.query("What did the author do during his time at Yale?")
print(str(response))
len(response.source_nodes)
None
0
# in contrast, vector search will return an answer
response = vector_query_engine.query("What did the author do during his time at Yale?")
print(str(response))
len(response.source_nodes)
The author attended Harvard for his PhD program in computer science and took art classes there. He then applied to the Rhode Island School of Design (RISD) for a Bachelor of Fine Arts (BFA) program and the Accademia di Belli Arti in Florence for an entrance exam. He eventually went to RISD and passed the entrance exam in Florence. During his time at Harvard, he also co-founded the Summer Founders Program, which invited undergrads to apply to start their own startups. He also worked on a new version of Arc with Robert Morris and wrote essays to promote the program.
2