Open In Colab

Nvidia TensorRT-LLM#

TensorRT-LLM provides users with an easy-to-use Python API to define Large Language Models (LLMs) and build TensorRT engines that contain state-of-the-art optimizations to perform inference efficiently on NVIDIA GPUs.

TensorRT-LLM Github

TensorRT-LLM Environment Setup#

Since TensorRT-LLM is a SDK for interacting with local models in process there are a few environment steps that must be followed to ensure that the TensorRT-LLM setup can be used.

  1. Nvidia Cuda 12.2 or higher is currently required to run TensorRT-LLM

  2. Install tensorrt_llm via pip with pip3 install tensorrt_llm -U --extra-index-url https://pypi.nvidia.com

  3. For this example we will use Llama2. The Llama2 model files need to be created via scripts following the instructions here

    • The following files will be created from following the stop above

    • Llama_float16_tp1_rank0.engine: The main output of the build script, containing the executable graph of operations with the model weights embedded.

    • config.json: Includes detailed information about the model, like its general structure and precision, as well as information about which plug-ins were incorporated into the engine.

    • model.cache: Caches some of the timing and optimization information from model compilation, making successive builds quicker.

  4. mkdir model

  5. Move all of the files mentioned above to the model directory.

!pip3 install tensorrt_llm -U --extra-index-url https://pypi.nvidia.com

Basic Usage#

Call complete with a prompt#

from llama_index.llms import LocalTensorRTLLM


def completion_to_prompt(completion: str) -> str:
    """
    Given a completion, return the prompt using llama2 format.
    """
    return f"<s> [INST] {completion} [/INST] "


llm = LocalTensorRTLLM(
    model_path="./model",
    engine_name="llama_float16_tp1_rank0.engine",
    tokenizer_dir="meta-llama/Llama-2-13b-chat",
    completion_to_prompt=completion_to_prompt,
)
resp = llm.complete("Who is Paul Graham?")
print(str(resp))