Building Data Ingestion from Scratch
In this tutorial, we show you how to build a data ingestion pipeline into a vector database.
We use Pinecone as the vector database.
We will show how to do the following:
How to load in documents.
How to use a text splitter to split documents.
How to manually construct nodes from each text chunk.
[Optional] Add metadata to each Node.
How to generate embeddings for each text chunk.
How to insert into a vector database.
Setup
We build an empty Pinecone Index, and define the necessary LlamaIndex wrappers/abstractions so that we can start loading data into Pinecone.
Build Pinecone Index
import pinecone
import os
api_key = os.environ["PINECONE_API_KEY"]
pinecone.init(api_key=api_key, environment="us-west1-gcp")
/Users/jerryliu/Programming/gpt_index/.venv/lib/python3.10/site-packages/pinecone/index.py:4: TqdmExperimentalWarning: Using `tqdm.autonotebook.tqdm` in notebook mode. Use `tqdm.tqdm` instead to force console mode (e.g. in jupyter console)
from tqdm.autonotebook import tqdm
# dimensions are for text-embedding-ada-002
pinecone.create_index("quickstart", dimension=1536, metric="euclidean", pod_type="p1")
pinecone_index = pinecone.Index("quickstart")
# [Optional] drop contents in index
pinecone_index.delete(deleteAll=True)
{}
Create PineconeVectorStore
Simple wrapper abstraction to use in LlamaIndex. Wrap in StorageContext so we can easily load in Nodes.
from llama_index.vector_stores import PineconeVectorStore
vector_store = PineconeVectorStore(pinecone_index=pinecone_index)
Build an Ingestion Pipeline from Scratch
We show how to build an ingestion pipeline as mentioned in the introduction.
Note that steps (2) and (3) can be handled via our NodeParser
abstractions, which handle splitting and node creation.
For the purposes of this tutorial, we show you how to create these objects manually.
1. Load Data
!mkdir data
!wget --user-agent "Mozilla" "https://arxiv.org/pdf/2307.09288.pdf" -O "data/llama2.pdf"
from pathlib import Path
from llama_hub.file.pymu_pdf.base import PyMuPDFReader
loader = PyMuPDFReader()
documents = loader.load(file_path="./data/llama2.pdf")
2. Use a Text Splitter to Split Documents
Here we import our SentenceSplitter
to split document texts into smaller chunks, while preserving paragraphs/sentences as much as possible.
from llama_index.text_splitter import SentenceSplitter
text_splitter = SentenceSplitter(
chunk_size=1024,
# separator=" ",
)
text_chunks = []
# maintain relationship with source doc index, to help inject doc metadata in (3)
doc_idxs = []
for doc_idx, doc in enumerate(documents):
cur_text_chunks = text_splitter.split_text(doc.text)
text_chunks.extend(cur_text_chunks)
doc_idxs.extend([doc_idx] * len(cur_text_chunks))
3. Manually Construct Nodes from Text Chunks
We convert each chunk into a TextNode
object, a low-level data abstraction in LlamaIndex that stores content but also allows defining metadata + relationships with other Nodes.
We inject metadata from the document into each node.
This essentially replicates logic in our SimpleNodeParser
.
from llama_index.schema import TextNode
nodes = []
for idx, text_chunk in enumerate(text_chunks):
node = TextNode(
text=text_chunk,
)
src_doc = documents[doc_idxs[idx]]
node.metadata = src_doc.metadata
nodes.append(node)
# print a sample node
print(nodes[0].get_content(metadata_mode="all"))
total_pages: 77
file_path: ./data/llama2.pdf
source: 1
Llama 2: Open Foundation and Fine-Tuned Chat Models
Hugo Touvron∗
Louis Martin†
Kevin Stone†
Peter Albert Amjad Almahairi Yasmine Babaei Nikolay Bashlykov Soumya Batra
Prajjwal Bhargava Shruti Bhosale Dan Bikel Lukas Blecher Cristian Canton Ferrer Moya Chen
Guillem Cucurull David Esiobu Jude Fernandes Jeremy Fu Wenyin Fu Brian Fuller
Cynthia Gao Vedanuj Goswami Naman Goyal Anthony Hartshorn Saghar Hosseini Rui Hou
Hakan Inan Marcin Kardas Viktor Kerkez Madian Khabsa Isabel Kloumann Artem Korenev
Punit Singh Koura Marie-Anne Lachaux Thibaut Lavril Jenya Lee Diana Liskovich
Yinghai Lu Yuning Mao Xavier Martinet Todor Mihaylov Pushkar Mishra
Igor Molybog Yixin Nie Andrew Poulton Jeremy Reizenstein Rashi Rungta Kalyan Saladi
Alan Schelten Ruan Silva Eric Michael Smith Ranjan Subramanian Xiaoqing Ellen Tan Binh Tang
Ross Taylor Adina Williams Jian Xiang Kuan Puxin Xu Zheng Yan Iliyan Zarov Yuchen Zhang
Angela Fan Melanie Kambadur Sharan Narang Aurelien Rodriguez Robert Stojnic
Sergey Edunov
Thomas Scialom∗
GenAI, Meta
Abstract
In this work, we develop and release Llama 2, a collection of pretrained and fine-tuned
large language models (LLMs) ranging in scale from 7 billion to 70 billion parameters.
Our fine-tuned LLMs, called Llama 2-Chat, are optimized for dialogue use cases. Our
models outperform open-source chat models on most benchmarks we tested, and based on
our human evaluations for helpfulness and safety, may be a suitable substitute for closed-
source models. We provide a detailed description of our approach to fine-tuning and safety
improvements of Llama 2-Chat in order to enable the community to build on our work and
contribute to the responsible development of LLMs.
∗Equal contribution, corresponding authors: {tscialom, htouvron}@meta.com
†Second author
Contributions for all the authors can be found in Section A.1.
arXiv:2307.09288v2 [cs.CL] 19 Jul 2023
[Optional] 4. Extract Metadata from each Node
We extract metadata from each Node using our Metadata extractors.
This will add more metadata to each Node.
from llama_index.node_parser.extractors import (
MetadataExtractor,
QuestionsAnsweredExtractor,
TitleExtractor,
)
from llama_index.llms import OpenAI
llm = OpenAI(model="gpt-3.5-turbo")
metadata_extractor = MetadataExtractor(
extractors=[
TitleExtractor(nodes=5, llm=llm),
QuestionsAnsweredExtractor(questions=3, llm=llm),
],
in_place=False,
)
nodes = metadata_extractor.process_nodes(nodes)
5. Generate Embeddings for each Node
Generate document embeddings for each Node using our OpenAI embedding model (text-embedding-ada-002
).
Store these on the embedding
property on each Node.
from llama_index.embeddings import OpenAIEmbedding
embed_model = OpenAIEmbedding()
for node in nodes:
node_embedding = embed_model.get_text_embedding(
node.get_content(metadata_mode="all")
)
node.embedding = node_embedding
6. Load Nodes into a Vector Store
We now insert these nodes into our PineconeVectorStore
.
NOTE: We skip the VectorStoreIndex abstraction, which is a higher-level abstraction that handles ingestion as well. We use VectorStoreIndex
in the next section to fast-trak retrieval/querying.
vector_store.add(nodes)
Retrieve and Query from the Vector Store
Now that our ingestion is complete, we can retrieve/query this vector store.
NOTE: We can use our high-level VectorStoreIndex
abstraction here. See the next section to see how to define retrieval at a lower-level!
from llama_index import VectorStoreIndex
from llama_index.storage import StorageContext
index = VectorStoreIndex.from_vector_store(vector_store)
query_engine = index.as_query_engine()
query_str = "Can you tell me about the key concepts for safety finetuning"
response = query_engine.query(query_str)
print(str(response))
The key concepts for safety fine-tuning include supervised safety fine-tuning, safety RLHF (Reinforcement Learning from Human Feedback), and safety context distillation. Supervised safety fine-tuning involves gathering adversarial prompts and safe demonstrations to align the model with safety guidelines. Safety RLHF integrates safety into the RLHF pipeline by training a safety-specific reward model and gathering challenging adversarial prompts for fine-tuning. Safety context distillation refines the RLHF pipeline by generating safer model responses with a safety preprompt and fine-tuning the model on these responses without the preprompt. These concepts aim to mitigate safety risks and improve the model's alignment with safety guidelines.