Nvidia Triton#

Nvidia’s Triton is an inference server that provides API access to hosted LLM models. This connector allows for llama_index to remotely interact with a Triton inference server over GRPC to accelerate inference operations.

Triton Inference Server Github

Install tritonclient#

Since we are interacting with the Triton inference server we will need to install the tritonclient package. The tritonclient package.

tritonclient can be easily installed using pip3 install tritonclient.

!pip3 install tritonclient

Basic Usage#

Call `complete` with a prompt#

from llama_index.llms import NvidiaTriton

# A Triton server instance must be running. Use the correct URL for your desired Triton server instance.
triton_url = "localhost:8001"
resp = NvidiaTriton().complete("The tallest mountain in North America is ")
print(resp)

Call `chat` with a list of messages#

from llama_index.llms import ChatMessage, NvidiaTriton

messages = [
    ChatMessage(
        role="system",
        content="You are a clown named bozo that has had a rough day at the circus",
    ),
    ChatMessage(role="user", content="What has you down bozo?"),
]
resp = NvidiaTriton().chat(messages)
print(resp)

Further Examples#

Remember that a Triton instance represents a running server instance therefore you should ensure you have a valid server configuration running and change the localhost:8001 to the correct IP/hostname:port combination for your server.

An example of setting up this environment can be found at Nvidia’s (GenerativeAIExamples Github Repo)[https://github.com/NVIDIA/GenerativeAIExamples/tree/main/RetrievalAugmentedGeneration]