View Source Llama
Mix.install([
{:bumblebee, "~> 0.5.0"},
{:nx, "~> 0.7.0"},
{:exla, "~> 0.7.0"},
{:kino, "~> 0.12.0"}
])
Nx.global_default_backend({EXLA.Backend, client: :host})
Introduction
In this notebook we look at running Meta's Llama model, specifically Llama 2, one of the most powerful open source Large Language Models (LLMs).
Note: this is a very involved model, so the generation can take a long time if you run it on a CPU. Also, running on the GPU currently requires at least 16GB of VRAM, though at least 30GB is recommended for optimal runtime.
Text generation
In order to load Llama 2, you need to ask for access on meta-llama/Llama-2-7b-chat-hf. Once you are granted access, generate a HuggingFace auth token and put it in a HF_TOKEN
Livebook secret.
Let's load the model and create a serving for text generation:
hf_token = System.fetch_env!("LB_HF_TOKEN")
repo = {:hf, "meta-llama/Llama-2-7b-chat-hf", auth_token: hf_token}
# Option 1
# {:ok, model_info} = Bumblebee.load_model(repo, type: :bf16, backend: EXLA.Backend)
# Option 2 and 3
{:ok, model_info} = Bumblebee.load_model(repo, type: :bf16)
{:ok, tokenizer} = Bumblebee.load_tokenizer(repo)
{:ok, generation_config} = Bumblebee.load_generation_config(repo)
:ok
generation_config =
Bumblebee.configure(generation_config,
max_new_tokens: 256,
strategy: %{type: :multinomial_sampling, top_p: 0.6}
)
serving =
Bumblebee.Text.generation(model_info, tokenizer, generation_config,
compile: [batch_size: 1, sequence_length: 1028],
stream: true,
# Option 1 and 2
# defn_options: [compiler: EXLA]
# Option 3
defn_options: [compiler: EXLA, lazy_transfers: :always]
)
# Should be supervised
Kino.start_child({Nx.Serving, name: Llama, serving: serving})
We adjust the generation config to use a non-deterministic generation strategy. The most interesting part, though, is the combination of serving options.
First, note that in the Setup cell we set the default backend to {EXLA.Backend, client: :host}
, which means that by default we load the parameters onto CPU. There are a couple combinations of options related to parameters, trading off memory usage for speed:
Bumblebee.load_model(..., backend: EXLA.Backend)
,defn_options: [compiler: EXLA]
- load all parameters directly onto the GPU. This requires the most memory, but it should provide the fastest inference time. In case you are using multiple GPUs (and a partitioned serving), you still want to load the parameters onto the CPU first and instead usepreallocate_params: true
, so that the parameters are copied onto each of them.defn_options: [compiler: EXLA]
- copy all parameters to the GPU before each computation and discard afterwards (or more specifically, when no longer needed in the computation). This requires less memory, but the copying increases the inference time.defn_options: [compiler: EXLA, lazy_transfers: :always]
- lazily copy parameters to the GPU during the computation as needed. This requires the least memory, at the cost of inference time.
As for the other options, we specify :compile
with fixed shapes, so that the model is compiled only once and inputs are always padded to match these shapes. We also enable :stream
to receive text chunks as the generation is progressing.
user_input = Kino.Input.textarea("User prompt", default: "What is love?")
user = Kino.Input.read(user_input)
prompt = """
[INST] <<SYS>>
You are a helpful, respectful and honest assistant. Always answer as helpfully as possible, while being safe. Your answers should not include any harmful, unethical, racist, sexist, toxic, dangerous, or illegal content. Please ensure that your responses are socially unbiased and positive in nature.
If a question does not make any sense, or is not factually coherent, explain why instead of answering something not correct. If you don't know the answer to a question, please don't share false information.
<</SYS>>
#{user} [/INST] \
"""
Nx.Serving.batched_run(Llama, prompt) |> Enum.each(&IO.write/1)