Multi-modal - LlamaIndex 🦙 0.9.48

Types of Multi-modal Use Cases#

This space is actively being explored right now, but there are some fascinating use cases popping up.

RAG (Retrieval Augmented Generation)#

All the core RAG concepts: indexing, retrieval, and synthesis, can be extended into the image setting.

The input could be text or image.
The stored knowledge base can consist of text or images.
The inputs to response generation can be text or image.
The final response can be text or image.

Check out our guides below:

Structured Outputs#

You can generate a structured output with the new OpenAI GPT4V via LlamaIndex. The user just needs to specify a Pydantic object to define the structure of the output.

Check out the guide below:

Multi-Modal GPT4V Pydantic Program

Retrieval-Augmented Image Captioning#

Oftentimes understanding an image requires looking up information from a knowledge base. A flow here is retrieval-augmented image captioning - first caption the image with a multi-modal model, then refine the caption by retrieving it from a text corpus.

Check out our guides below:

Retrieval-Augmented Image Captioning

Agents#

Here are some initial works demonstrating agentic capabilities with GPT-4V.

Evaluations and Comparisons#

These sections show comparisons between different multi-modal models for different use cases.

Model Guides#

Here are notebook guides showing you how to interact with different multimodal model providers.

Multi-modal#