Pandas Query Engine

import logging
import sys
from IPython.display import Markdown, display

import pandas as pd
from llama_index.query_engine import PandasQueryEngine


logging.basicConfig(stream=sys.stdout, level=logging.INFO)
logging.getLogger().addHandler(logging.StreamHandler(stream=sys.stdout))

Let’s start on a Toy DataFrame

Very simple dataframe containing city and population pairs.

# Test on some sample data
df = pd.DataFrame(
    {"city": ["Toronto", "Tokyo", "Berlin"], "population": [2930000, 13960000, 3645000]}
)
query_engine = PandasQueryEngine(df=df, verbose=True)
response = query_engine.query(
    "What is the city with the highest population?",
)
> Pandas Instructions:
```

df['city'][df['population'].idxmax()]
```
> Pandas Output: Tokyo
display(Markdown(f"<b>{response}</b>"))

Tokyo

# get pandas python instructions
print(response.metadata["pandas_instruction_str"])
df['city'][df['population'].idxmax()]

Analyzing the Titanic Dataset

The Titanic dataset is one of the most popular tabular datasets in introductory machine learning Source: https://www.kaggle.com/c/titanic

df = pd.read_csv("../data/csv/titanic_train.csv")
query_engine = PandasQueryEngine(df=df, verbose=True)
response = query_engine.query(
    "What is the correlation between survival and age?",
)
> Pandas Instructions:
```
df['survived'].corr(df['age'])
```
> Pandas Output: -0.07722109457217768
display(Markdown(f"<b>{response}</b>"))

-0.07722109457217768

# get pandas python instructions
print(response.metadata["pandas_instruction_str"])
df['survived'].corr(df['age'])