Faithfulness Evaluator#

This notebook uses the FaithfulnessEvaluator module to measure if the response from a query engine matches any source nodes.
This is useful for measuring if the response was hallucinated.
The data is extracted from the New York City wikipedia page.

# attach to the same event-loop
import nest_asyncio

nest_asyncio.apply()
# configuring logger to INFO level
import logging
import sys

logging.basicConfig(stream=sys.stdout, level=logging.INFO)
logging.getLogger().addHandler(logging.StreamHandler(stream=sys.stdout))
from llama_index import (
    TreeIndex,
    VectorStoreIndex,
    SimpleDirectoryReader,
    ServiceContext,
    Response,
)
from llama_index.llms import OpenAI
from llama_index.evaluation import FaithfulnessEvaluator
import pandas as pd

pd.set_option("display.max_colwidth", 0)

Using GPT-4 here for evaluation

# gpt-4
gpt4 = OpenAI(temperature=0, model="gpt-4")
service_context_gpt4 = ServiceContext.from_defaults(llm=gpt4)

evaluator_gpt4 = FaithfulnessEvaluator(service_context=service_context_gpt4)
documents = SimpleDirectoryReader("./test_wiki_data/").load_data()
# create vector index
service_context = ServiceContext.from_defaults(chunk_size=512)
vector_index = VectorStoreIndex.from_documents(
    documents, service_context=service_context
)
# define jupyter display function
def display_eval_df(response: Response, eval_result: str) -> None:
    if response.source_nodes == []:
        print("no response!")
        return
    eval_df = pd.DataFrame(
        {
            "Response": str(response),
            "Source": response.source_nodes[0].node.text[:1000] + "...",
            "Evaluation Result": "Pass" if eval_result.passing else "Fail",
        },
        index=[0],
    )
    eval_df = eval_df.style.set_properties(
        **{
            "inline-size": "600px",
            "overflow-wrap": "break-word",
        },
        subset=["Response", "Source"]
    )
    display(eval_df)

To run evaluations you can call the .evaluate_response() function on the Response object return from the query to run the evaluations. Lets evaluate the outputs of the vector_index.

query_engine = vector_index.as_query_engine()
response_vector = query_engine.query("How did New York City get its name?")
eval_result = evaluator_gpt4.evaluate_response(response=response_vector)
display_eval_df(response_vector, eval_result)
  Response Source Evaluation Result
0 New York City got its name from the English explorer Henry Hudson, who rediscovered New York Harbor in 1609 while searching for the Northwest Passage. He named the area New York after the Duke of York, who later became King James II of England. He claimed the area for France and named it Nouvelle Angoulême (New Angoulême).A Spanish expedition, led by the Portuguese captain Estêvão Gomes sailing for Emperor Charles V, arrived in New York Harbor in January 1525 and charted the mouth of the Hudson River, which he named Río de San Antonio ('Saint Anthony's River').The Padrón Real of 1527, the first scientific map to show the East Coast of North America continuously, was informed by Gomes' expedition and labeled the northeastern United States as Tierra de Esteban Gómez in his honor.In 1609, the English explorer Henry Hudson rediscovered New York Harbor while searching for the Northwest Passage to the Orient for the Dutch East India Company.He proceeded to sail up what the Dutch would name the North River (now the Hudson River), named first by Hudson as the Mauritius after Maurice, Prince of Orange.Hudson's first mate described the harbor as "a very good Harbour for all windes" and the river as "a mile broad" and "full of fish".Hud... Fail

Benchmark on Generated Question#

Now lets generate a few more questions so that we have more to evaluate with and run a small benchmark.

from llama_index.evaluation import DatasetGenerator

question_generator = DatasetGenerator.from_documents(documents)
eval_questions = question_generator.generate_questions_from_nodes(5)

eval_questions
WARNING:llama_index.indices.service_context:chunk_size_limit is deprecated, please specify chunk_size instead
chunk_size_limit is deprecated, please specify chunk_size instead
chunk_size_limit is deprecated, please specify chunk_size instead
['What is the population of New York City as of 2020?',
 'Which borough of New York City is home to the headquarters of the United Nations?',
 'How many languages are spoken in New York City, making it the most linguistically diverse city in the world?',
 'Who founded the trading post on Manhattan Island that would later become New York City?',
 'What was New York City named after in 1664?']
import asyncio


def evaluate_query_engine(query_engine, questions):
    c = [query_engine.aquery(q) for q in questions]
    results = asyncio.run(asyncio.gather(*c))
    print("finished query")

    total_correct = 0
    for r in results:
        # evaluate with gpt 4
        eval_result = (
            1 if evaluator_gpt4.evaluate_response(response=r).passing else 0
        )
        total_correct += eval_result

    return total_correct, len(results)
vector_query_engine = vector_index.as_query_engine()
correct, total = evaluate_query_engine(vector_query_engine, eval_questions[:5])

print(f"score: {correct}/{total}")
INFO:openai:message='OpenAI API response' path=https://api.openai.com/v1/embeddings processing_ms=35 request_id=b36e17a843c31e827f0b7034e603cf28 response_code=200
message='OpenAI API response' path=https://api.openai.com/v1/embeddings processing_ms=35 request_id=b36e17a843c31e827f0b7034e603cf28 response_code=200
message='OpenAI API response' path=https://api.openai.com/v1/embeddings processing_ms=35 request_id=b36e17a843c31e827f0b7034e603cf28 response_code=200
INFO:openai:message='OpenAI API response' path=https://api.openai.com/v1/embeddings processing_ms=35 request_id=5acb726518065db9312da9f23beef411 response_code=200
message='OpenAI API response' path=https://api.openai.com/v1/embeddings processing_ms=35 request_id=5acb726518065db9312da9f23beef411 response_code=200
message='OpenAI API response' path=https://api.openai.com/v1/embeddings processing_ms=35 request_id=5acb726518065db9312da9f23beef411 response_code=200
INFO:openai:message='OpenAI API response' path=https://api.openai.com/v1/embeddings processing_ms=46 request_id=4af43bfbe4e24fdae0ec33312ee7491e response_code=200
message='OpenAI API response' path=https://api.openai.com/v1/embeddings processing_ms=46 request_id=4af43bfbe4e24fdae0ec33312ee7491e response_code=200
message='OpenAI API response' path=https://api.openai.com/v1/embeddings processing_ms=46 request_id=4af43bfbe4e24fdae0ec33312ee7491e response_code=200
INFO:openai:message='OpenAI API response' path=https://api.openai.com/v1/embeddings processing_ms=37 request_id=e30413546fe5f96d3890606767f2ec53 response_code=200
message='OpenAI API response' path=https://api.openai.com/v1/embeddings processing_ms=37 request_id=e30413546fe5f96d3890606767f2ec53 response_code=200
message='OpenAI API response' path=https://api.openai.com/v1/embeddings processing_ms=37 request_id=e30413546fe5f96d3890606767f2ec53 response_code=200
INFO:openai:message='OpenAI API response' path=https://api.openai.com/v1/embeddings processing_ms=33 request_id=01f0a8dada4dae80c97a9a412f03b84f response_code=200
message='OpenAI API response' path=https://api.openai.com/v1/embeddings processing_ms=33 request_id=01f0a8dada4dae80c97a9a412f03b84f response_code=200
message='OpenAI API response' path=https://api.openai.com/v1/embeddings processing_ms=33 request_id=01f0a8dada4dae80c97a9a412f03b84f response_code=200
INFO:openai:message='OpenAI API response' path=https://api.openai.com/v1/chat/completions processing_ms=282 request_id=ed7b1f8ba68ae32b1d8e24e0d0764e86 response_code=200
message='OpenAI API response' path=https://api.openai.com/v1/chat/completions processing_ms=282 request_id=ed7b1f8ba68ae32b1d8e24e0d0764e86 response_code=200
message='OpenAI API response' path=https://api.openai.com/v1/chat/completions processing_ms=282 request_id=ed7b1f8ba68ae32b1d8e24e0d0764e86 response_code=200
INFO:openai:message='OpenAI API response' path=https://api.openai.com/v1/chat/completions processing_ms=820 request_id=b4532c6d665b6cfd644861ed69819cb9 response_code=200
message='OpenAI API response' path=https://api.openai.com/v1/chat/completions processing_ms=820 request_id=b4532c6d665b6cfd644861ed69819cb9 response_code=200
message='OpenAI API response' path=https://api.openai.com/v1/chat/completions processing_ms=820 request_id=b4532c6d665b6cfd644861ed69819cb9 response_code=200
INFO:openai:message='OpenAI API response' path=https://api.openai.com/v1/chat/completions processing_ms=847 request_id=4d9bbc71a95b7e0bb69a048e251772c8 response_code=200
message='OpenAI API response' path=https://api.openai.com/v1/chat/completions processing_ms=847 request_id=4d9bbc71a95b7e0bb69a048e251772c8 response_code=200
message='OpenAI API response' path=https://api.openai.com/v1/chat/completions processing_ms=847 request_id=4d9bbc71a95b7e0bb69a048e251772c8 response_code=200
INFO:openai:message='OpenAI API response' path=https://api.openai.com/v1/chat/completions processing_ms=952 request_id=d1657940d881929d500b1fddc46b5866 response_code=200
message='OpenAI API response' path=https://api.openai.com/v1/chat/completions processing_ms=952 request_id=d1657940d881929d500b1fddc46b5866 response_code=200
message='OpenAI API response' path=https://api.openai.com/v1/chat/completions processing_ms=952 request_id=d1657940d881929d500b1fddc46b5866 response_code=200
INFO:openai:message='OpenAI API response' path=https://api.openai.com/v1/chat/completions processing_ms=1482 request_id=c4456f75580d227f846d3a044e5eef1b response_code=200
message='OpenAI API response' path=https://api.openai.com/v1/chat/completions processing_ms=1482 request_id=c4456f75580d227f846d3a044e5eef1b response_code=200
message='OpenAI API response' path=https://api.openai.com/v1/chat/completions processing_ms=1482 request_id=c4456f75580d227f846d3a044e5eef1b response_code=200
finished query
score: 5/5