Postgres Vector Store#

In this notebook we are going to show how to use Postgresql and pgvector to perform vector searches in LlamaIndex

If you’re opening this Notebook on colab, you will probably need to install LlamaIndex 🦙.

%pip install llama-index-vector-stores-postgres

!pip install llama-index

# import logging
# import sys

# Uncomment to see debug logs
# logging.basicConfig(stream=sys.stdout, level=logging.DEBUG)
# logging.getLogger().addHandler(logging.StreamHandler(stream=sys.stdout))

from llama_index.core import SimpleDirectoryReader, StorageContext
from llama_index.core import VectorStoreIndex
from llama_index.vector_stores.postgres import PGVectorStore
import textwrap
import openai

Setup OpenAI#

The first step is to configure the openai key. It will be used to created embeddings for the documents loaded into the index

import os

os.environ["OPENAI_API_KEY"] = "<your key>"
openai.api_key = os.environ["OPENAI_API_KEY"]

Download Data

!mkdir -p 'data/paul_graham/'
!wget 'https://raw.githubusercontent.com/run-llama/llama_index/main/docs/examples/data/paul_graham/paul_graham_essay.txt' -O 'data/paul_graham/paul_graham_essay.txt'

Loading documents#

Load the documents stored in the data/paul_graham/ using the SimpleDirectoryReader

documents = SimpleDirectoryReader("./data/paul_graham").load_data()
print("Document ID:", documents[0].doc_id)

Document ID: 88efac05-2277-4eda-a94c-c9247c9aca1c

Create the Database#

Using an existing postgres running at localhost, create the database we’ll be using.

import psycopg2

connection_string = "postgresql://postgres:password@localhost:5432"
db_name = "vector_db"
conn = psycopg2.connect(connection_string)
conn.autocommit = True

with conn.cursor() as c:
    c.execute(f"DROP DATABASE IF EXISTS {db_name}")
    c.execute(f"CREATE DATABASE {db_name}")

Create the index#

Here we create an index backed by Postgres using the documents loaded previously. PGVectorStore takes a few arguments.

from sqlalchemy import make_url

url = make_url(connection_string)
vector_store = PGVectorStore.from_params(
    database=db_name,
    host=url.host,
    password=url.password,
    port=url.port,
    user=url.username,
    table_name="paul_graham_essay",
    embed_dim=1536,  # openai embedding dimension
)

storage_context = StorageContext.from_defaults(vector_store=vector_store)
index = VectorStoreIndex.from_documents(
    documents, storage_context=storage_context, show_progress=True
)
query_engine = index.as_query_engine()

Query the index#

We can now ask questions using our index.

response = query_engine.query("What did the author do?")

print(textwrap.fill(str(response), 100))

The author worked on writing and programming before college. Initially, the author wrote short
stories and later started programming on an IBM 1401 using an early version of Fortran. The author
then transitioned to working with microcomputers, building a computer kit and eventually getting a
TRS-80 to further explore programming. In college, the author initially planned to study philosophy
but switched to studying AI due to a lack of interest in philosophy courses. The author was inspired
to work on AI after encountering works like Heinlein's novel "The Moon is a Harsh Mistress" and
seeing Terry Winograd using SHRDLU in a PBS documentary.

response = query_engine.query("What happened in the mid 1980s?")

print(textwrap.fill(str(response), 100))

In the mid-1980s, the author spent a significant amount of time working on a book called "On Lisp"
and had obtained a contract to publish it. They were paid large amounts of money for their work,
which allowed them to save enough to go back to RISD (Rhode Island School of Design) and pay off
their college loans. They also learned valuable lessons during this time, such as the importance of
having technology companies run by product people rather than sales people, the drawbacks of editing
code by too many people, and the significance of being the "entry level" option in a competitive
market.

Querying existing index#

vector_store = PGVectorStore.from_params(
    database="vector_db",
    host="localhost",
    password="password",
    port=5432,
    user="postgres",
    table_name="paul_graham_essay",
    embed_dim=1536,  # openai embedding dimension
)

index = VectorStoreIndex.from_vector_store(vector_store=vector_store)
query_engine = index.as_query_engine()

response = query_engine.query("What did the author do?")

print(textwrap.fill(str(response), 100))

The author worked on writing and programming before college. They wrote short stories and tried
writing programs on an IBM 1401 computer. They also built a microcomputer and started programming on
it, writing simple games and a word processor. In college, the author initially planned to study
philosophy but switched to AI due to their interest in intelligent computers. They taught themselves
AI by learning Lisp.

Hybrid Search#

To enable hybrid search, you need to:

pass in hybrid_search=True when constructing the PGVectorStore (and optionally configure text_search_config with the desired language)
pass in vector_store_query_mode="hybrid" when constructing the query engine (this config is passed to the retriever under the hood). You can also optionally set the sparse_top_k to configure how many results we should obtain from sparse text search (default is using the same value as similarity_top_k).

from sqlalchemy import make_url

url = make_url(connection_string)
hybrid_vector_store = PGVectorStore.from_params(
    database=db_name,
    host=url.host,
    password=url.password,
    port=url.port,
    user=url.username,
    table_name="paul_graham_essay_hybrid_search",
    embed_dim=1536,  # openai embedding dimension
    hybrid_search=True,
    text_search_config="english",
)

storage_context = StorageContext.from_defaults(
    vector_store=hybrid_vector_store
)
hybrid_index = VectorStoreIndex.from_documents(
    documents, storage_context=storage_context
)

hybrid_query_engine = hybrid_index.as_query_engine(
    vector_store_query_mode="hybrid", sparse_top_k=2
)
hybrid_response = hybrid_query_engine.query(
    "Who does Paul Graham think of with the word schtick"
)

print(hybrid_response)

Roy Lichtenstein

Metadata filters#

PGVectorStore supports storing metadata in nodes, and filtering based on that metadata during the retrieval step.

Download git commits dataset#

!mkdir -p 'data/git_commits/'
!wget 'https://raw.githubusercontent.com/run-llama/llama_index/main/docs/examples/data/csv/commit_history.csv' -O 'data/git_commits/commit_history.csv'

import csv

with open("data/git_commits/commit_history.csv", "r") as f:
    commits = list(csv.DictReader(f))

print(commits[0])
print(len(commits))

{'commit': '44e41c12ab25e36c202f58e068ced262eadc8d16', 'author': 'Lakshmi Narayanan Sreethar<[email protected]>', 'date': 'Tue Sep 5 21:03:21 2023 +0530', 'change summary': 'Fix segfault in set_integer_now_func', 'change details': 'When an invalid function oid is passed to set_integer_now_func, it finds out that the function oid is invalid but before throwing the error, it calls ReleaseSysCache on an invalid tuple causing a segfault. Fixed that by removing the invalid call to ReleaseSysCache.  Fixes #6037 '}
4167

Add nodes with custom metadata#

# Create TextNode for each of the first 100 commits
from llama_index.core.schema import TextNode
from datetime import datetime

nodes = []
dates = set()
authors = set()
for commit in commits[:100]:
    author_email = commit["author"].split("<")[1][:-1]
    commit_date = datetime.strptime(
        commit["date"], "%a %b %d %H:%M:%S %Y %z"
    ).strftime("%Y-%m-%d")
    commit_text = commit["change summary"]
    if commit["change details"]:
        commit_text += "\n\n" + commit["change details"]
    nodes.append(
        TextNode(
            text=commit_text,
            metadata={
                "commit_date": commit_date,
                "author": author_email,
            },
        )
    )
    dates.add(commit_date)
    authors.add(author_email)

print(nodes[0])
print(min(dates), "to", max(dates))
print(authors)

Node ID: e084ffbd-24e0-4bd9-b7c8-287fe1abd85d
Text: Fix segfault in set_integer_now_func  When an invalid function
oid is passed to set_integer_now_func, it finds out that the function
oid is invalid but before throwing the error, it calls ReleaseSysCache
on an invalid tuple causing a segfault. Fixed that by removing the
invalid call to ReleaseSysCache.  Fixes #6037
2023-03-22 to 2023-09-05
{'[email protected]', '[email protected]', '[email protected]', '[email protected]', '[email protected]', '[email protected]', '[email protected]', '[email protected]', '[email protected]', '[email protected]', '[email protected]', '[email protected]', '[email protected]', '[email protected]', '[email protected]'}

vector_store = PGVectorStore.from_params(
    database=db_name,
    host=url.host,
    password=url.password,
    port=url.port,
    user=url.username,
    table_name="metadata_filter_demo3",
    embed_dim=1536,  # openai embedding dimension
)

index = VectorStoreIndex.from_vector_store(vector_store=vector_store)
index.insert_nodes(nodes)

print(index.as_query_engine().query("How did Lakshmi fix the segfault?"))

Lakshmi fixed the segfault by removing the invalid call to ReleaseSysCache that was causing the issue.

Apply metadata filters#

Now we can filter by commit author or by date when retrieving nodes.

from llama_index.core.vector_stores.types import (
    MetadataFilter,
    MetadataFilters,
)

filters = MetadataFilters(
    filters=[
        MetadataFilter(key="author", value="[email protected]"),
        MetadataFilter(key="author", value="[email protected]"),
    ],
    condition="or",
)

retriever = index.as_retriever(
    similarity_top_k=10,
    filters=filters,
)

retrieved_nodes = retriever.retrieve("What is this software project about?")

for node in retrieved_nodes:
    print(node.node.metadata)

{'commit_date': '2023-08-07', 'author': '[email protected]'}
{'commit_date': '2023-08-07', 'author': '[email protected]'}
{'commit_date': '2023-08-15', 'author': '[email protected]'}
{'commit_date': '2023-08-23', 'author': '[email protected]'}
{'commit_date': '2023-07-13', 'author': '[email protected]'}
{'commit_date': '2023-08-27', 'author': '[email protected]'}
{'commit_date': '2023-08-21', 'author': '[email protected]'}
{'commit_date': '2023-08-30', 'author': '[email protected]'}
{'commit_date': '2023-08-10', 'author': '[email protected]'}
{'commit_date': '2023-08-20', 'author': '[email protected]'}

filters = MetadataFilters(
    filters=[
        MetadataFilter(key="commit_date", value="2023-08-15", operator=">="),
        MetadataFilter(key="commit_date", value="2023-08-25", operator="<="),
    ],
    condition="and",
)

retriever = index.as_retriever(
    similarity_top_k=10,
    filters=filters,
)

retrieved_nodes = retriever.retrieve("What is this software project about?")

for node in retrieved_nodes:
    print(node.node.metadata)

{'commit_date': '2023-08-23', 'author': '[email protected]'}
{'commit_date': '2023-08-15', 'author': '[email protected]'}
{'commit_date': '2023-08-17', 'author': '[email protected]'}
{'commit_date': '2023-08-15', 'author': '[email protected]'}
{'commit_date': '2023-08-23', 'author': '[email protected]'}
{'commit_date': '2023-08-15', 'author': '[email protected]'}
{'commit_date': '2023-08-21', 'author': '[email protected]'}
{'commit_date': '2023-08-24', 'author': '[email protected]'}
{'commit_date': '2023-08-16', 'author': '[email protected]'}
{'commit_date': '2023-08-20', 'author': '[email protected]'}

Apply nested filters#

In the above examples, we combined multiple filters using AND or OR. We can also combine multiple sets of filters.

e.g. in SQL:

WHERE (commit_date >= '2023-08-01' AND commit_date <= '2023-08-15') AND (author = '[email protected]' OR author = '[email protected]')

filters = MetadataFilters(
    filters=[
        MetadataFilters(
            filters=[
                MetadataFilter(
                    key="commit_date", value="2023-08-01", operator=">="
                ),
                MetadataFilter(
                    key="commit_date", value="2023-08-15", operator="<="
                ),
            ],
            condition="and",
        ),
        MetadataFilters(
            filters=[
                MetadataFilter(key="author", value="[email protected]"),
                MetadataFilter(key="author", value="[email protected]"),
            ],
            condition="or",
        ),
    ],
    condition="and",
)

retriever = index.as_retriever(
    similarity_top_k=10,
    filters=filters,
)

retrieved_nodes = retriever.retrieve("What is this software project about?")

for node in retrieved_nodes:
    print(node.node.metadata)

{'commit_date': '2023-08-07', 'author': '[email protected]'}
{'commit_date': '2023-08-07', 'author': '[email protected]'}
{'commit_date': '2023-08-15', 'author': '[email protected]'}
{'commit_date': '2023-08-10', 'author': '[email protected]'}

PgVector Query Options#

IVFFlat Probes#

Specify the number of IVFFlat probes (1 by default)

When retrieving from the index, you can specify an appropriate number of IVFFlat probes (higher is better for recall, lower is better for speed)

retriever = index.as_retriever(
    vector_store_query_mode=query_mode,
    similarity_top_k=top_k,
    vector_store_kwargs={"ivfflat_probes": 10},
)

HNSW EF Search#

Specify the size of the dynamic candidate list for search (40 by default)

retriever = index.as_retriever(
    vector_store_query_mode=query_mode,
    similarity_top_k=top_k,
    vector_store_kwargs={"hnsw_ef_search": 300},
)