Pandas Query Engineο
import logging
import sys
from IPython.display import Markdown, display
import pandas as pd
from llama_index.query_engine import PandasQueryEngine
logging.basicConfig(stream=sys.stdout, level=logging.INFO)
logging.getLogger().addHandler(logging.StreamHandler(stream=sys.stdout))
Letβs start on a Toy DataFrameο
Very simple dataframe containing city and population pairs.
# Test on some sample data
df = pd.DataFrame(
{"city": ["Toronto", "Tokyo", "Berlin"], "population": [2930000, 13960000, 3645000]}
)
query_engine = PandasQueryEngine(df=df, verbose=True)
response = query_engine.query(
"What is the city with the highest population?",
)
> Pandas Instructions:
```
df['city'][df['population'].idxmax()]
```
> Pandas Output: Tokyo
display(Markdown(f"<b>{response}</b>"))
Tokyo
# get pandas python instructions
print(response.metadata["pandas_instruction_str"])
df['city'][df['population'].idxmax()]
Analyzing the Titanic Datasetο
The Titanic dataset is one of the most popular tabular datasets in introductory machine learning Source: https://www.kaggle.com/c/titanic
df = pd.read_csv("../data/csv/titanic_train.csv")
query_engine = PandasQueryEngine(df=df, verbose=True)
response = query_engine.query(
"What is the correlation between survival and age?",
)
> Pandas Instructions:
```
df['survived'].corr(df['age'])
```
> Pandas Output: -0.07722109457217768
display(Markdown(f"<b>{response}</b>"))
-0.07722109457217768
# get pandas python instructions
print(response.metadata["pandas_instruction_str"])
df['survived'].corr(df['age'])