HuggingFace Library Integration Helpers.
Configuration builders and integration utilities for the major HuggingFace ecosystem libraries, mirroring the "4. Libraries" section of the docs:
- 4.1 Core Libraries: Transformers, Datasets, Tokenizers, Accelerate, Evaluate
- 4.2 Generative AI: Diffusers
- 4.3 Optimization: Optimum, PEFT
- 4.4 Other Tools: Safetensors, TRL, Bitsandbytes
These helpers generate configuration maps for use with:
HuggingfaceClient.run_job/1— to run library-specific training on HF infraHuggingfaceClient.autotrain_create/1— for AutoTrain fine-tuning- Local training scripts
See: https://huggingface.co/docs
Example
# Bitsandbytes 4-bit quantization config
bnb_config = HuggingfaceClient.Libraries.bnb_config(
load_in_4bit: true,
bnb_4bit_quant_type: "nf4",
bnb_4bit_compute_dtype: "bfloat16"
)
# Merge into LoRA config
training_config = Map.merge(
HuggingfaceClient.lora_config(base_model: "meta-llama/Llama-3.1-8B"),
%{"quantization_config" => bnb_config}
)
Summary
Functions
Builds a bitsandbytes quantization config.
Returns configuration for a Diffusers pipeline.
Returns the API schema for a Gradio Space.
Returns the URL for a Gradio Space's API endpoint.
Returns an Optimum export/optimization configuration.
Returns a QLoRA configuration combining 4-bit bitsandbytes + LoRA.
Returns configuration for a reranking model.
Returns metadata about a Safetensors file from a Hub repository.
Returns configuration for a Sentence Transformers embedding model.
Returns configuration for tokenizer settings.
Returns configuration for loading a Transformers model.
Returns a TRL (Transformer Reinforcement Learning) configuration.
Functions
Builds a bitsandbytes quantization config.
Enables 4-bit or 8-bit quantization for memory-efficient inference and training.
Options (4-bit)
:load_in_4bit— enable 4-bit loading (default:true):bnb_4bit_quant_type—"nf4"(NormalFloat4, better quality) or"fp4"(default:"nf4"):bnb_4bit_compute_dtype— compute dtype:"bfloat16","float16","float32"(default:"bfloat16"):bnb_4bit_use_double_quant— double quantization for extra savings (default:true)
Options (8-bit)
:load_in_8bit— enable 8-bit loading:llm_int8_threshold— threshold for mixed-precision (default: 6.0):llm_int8_skip_modules— list of module names to skip quantization
Example
# QLoRA-style 4-bit config
config = HuggingfaceClient.Libraries.bnb_config(
load_in_4bit: true,
bnb_4bit_quant_type: "nf4",
bnb_4bit_compute_dtype: "bfloat16",
bnb_4bit_use_double_quant: true
)
# Use in LoRA training:
training_config = Map.merge(
HuggingfaceClient.lora_config(base_model: "meta-llama/Llama-3.1-8B"),
%{"bnb_config" => config}
)
# 8-bit for inference only
config_8bit = HuggingfaceClient.Libraries.bnb_config(
load_in_8bit: true,
llm_int8_threshold: 6.0
)
Returns configuration for a Diffusers pipeline.
Used to generate images, videos, or audio with diffusion models.
Options
:model_id— HF model ID (required):task—"text-to-image","image-to-image","inpainting","text-to-video"(default:"text-to-image"):scheduler— diffusion scheduler:"DDPM","DDIM","DPM++","Euler","EulerA"(default:"EulerA"):dtype—"float16","bfloat16","float32"(default:"float16"):device—"cuda","mps","cpu"(default:"cuda"):enable_xformers— memory-efficient attention (default:true):safety_checker— enable safety checker (default:false)
Example
config = HuggingfaceClient.Libraries.diffusers_config(
model_id: "black-forest-labs/FLUX.1-dev",
task: "text-to-image",
dtype: "bfloat16",
device: "cuda"
)
@spec gradio_api_schema( String.t(), keyword() ) :: {:ok, map()} | {:error, Exception.t()}
Returns the API schema for a Gradio Space.
Example
{:ok, schema} = HuggingfaceClient.Libraries.gradio_api_schema("gradio/hello_world")
IO.inspect(schema["endpoints"])
Returns the URL for a Gradio Space's API endpoint.
Example
api_url = HuggingfaceClient.Libraries.gradio_api_url("stabilityai/stable-diffusion")
# "https://stabilityai-stable-diffusion.hf.space/run/predict"
Returns an Optimum export/optimization configuration.
Optimum is HuggingFace's toolkit for optimizing models for specific hardware.
Options
:model_id— HF model ID (required):backend—"onnx","openvino","tflite","coreml","neuronx"(default:"onnx"):task— task type for export (e.g."text-classification"):fp16— export in FP16 (default:false):optimize_for—"performance","size","latency"(default:"performance")
Example
config = HuggingfaceClient.Libraries.optimum_config(
model_id: "bert-base-uncased",
backend: "onnx",
task: "text-classification",
fp16: true,
optimize_for: "latency"
)
Returns a QLoRA configuration combining 4-bit bitsandbytes + LoRA.
QLoRA is the most memory-efficient fine-tuning approach for large models. Reduces GPU memory by ~75% compared to full fine-tuning.
Options
:base_model— model to fine-tune (required):rank— LoRA rank (default: 16):alpha— LoRA alpha (default: 32):quant_type—"nf4"or"fp4"(default:"nf4"):compute_dtype—"bfloat16"or"float16"(default:"bfloat16")
Example
# Fine-tune a 70B model on a single A100
config = HuggingfaceClient.Libraries.qlora_config(
base_model: "meta-llama/Llama-3.1-70B-Instruct",
rank: 64, alpha: 128
)
{:ok, job} = HuggingfaceClient.run_job(
image: "huggingface/trl-latest-gpu:latest",
command: ["python", "sft.py"] ++ HuggingfaceClient.training_to_args(config),
flavor: "a100-large",
access_token: token
)
Returns configuration for a reranking model.
Reranking improves RAG pipelines by scoring document relevance.
Popular models: "cross-encoder/ms-marco-MiniLM-L-6-v2", "BAAI/bge-reranker-large".
Options
:model_id— cross-encoder model ID (required):max_length— max input length (default: 512):batch_size— scoring batch size (default: 32)
Example
config = HuggingfaceClient.Libraries.reranker_config(
model_id: "BAAI/bge-reranker-large",
max_length: 1024
)
# Use with TEI rerank endpoint
{:ok, results} = HuggingfaceClient.tei_rerank(tei,
query: "What is deep learning?",
texts: ["Deep learning is...", "Python is..."]
)
@spec safetensors_metadata(String.t(), String.t(), keyword()) :: {:ok, map()} | {:error, Exception.t()}
Returns metadata about a Safetensors file from a Hub repository.
Safetensors is a safe, fast format for storing model weights.
Example
{:ok, meta} = HuggingfaceClient.Libraries.safetensors_metadata(
"gpt2", "model.safetensors"
)
IO.puts("Tensors: #{map_size(meta["tensors"])}")
Returns configuration for a Sentence Transformers embedding model.
Sentence Transformers provides state-of-the-art embeddings for semantic search, RAG, and document similarity.
Options
:model_id— embedding model ID (required) Examples:"sentence-transformers/all-MiniLM-L6-v2",`"BAAI/bge-large-en-v1.5"`, `"intfloat/e5-large-v2"`:normalize— normalize embeddings (default:true):prompt— prompt prefix for asymmetric models:batch_size— encoding batch size (default: 32):max_seq_length— max token length (default: 512)
Example
config = HuggingfaceClient.Libraries.sentence_transformers_config(
model_id: "BAAI/bge-large-en-v1.5",
normalize: true,
prompt: "Represent this sentence for searching relevant passages: "
)
# Use with TEI client for production serving
tei = HuggingfaceClient.tei("http://localhost:8080")
{:ok, embedding} = HuggingfaceClient.tei_embed(tei,
"What is machine learning?",
prompt_name: "query"
)
Returns configuration for tokenizer settings.
Options
:model_id— HF model ID (required):max_length— max token length (default: 512):padding—"max_length","longest","do_not_pad"(default:"longest"):truncation— truncation strategy (default:true):add_special_tokens— add BOS/EOS tokens (default:true):return_tensors—"pt","tf","np"(default:"pt")
Example
tokenizer_cfg = HuggingfaceClient.Libraries.tokenizer_config(
model_id: "bert-base-uncased",
max_length: 512,
padding: "max_length",
truncation: true
)
Returns configuration for loading a Transformers model.
Generates a from_pretrained-compatible config dict.
Options
:model_id— HF model ID (required):revision— branch/commit (default:"main"):dtype—"float16","bfloat16","float32","auto"(default:"auto"):device_map—"auto","cuda","cpu", or explicit device map:trust_remote_code— allow custom model code (default:false):load_in_4bit/:load_in_8bit— quantize with bitsandbytes:attn_implementation—"flash_attention_2","sdpa","eager":use_cache— enable KV cache (default:true)
Example
config = HuggingfaceClient.Libraries.transformers_config(
model_id: "meta-llama/Llama-3.1-8B-Instruct",
dtype: "bfloat16",
device_map: "auto",
attn_implementation: "flash_attention_2"
)
Returns a TRL (Transformer Reinforcement Learning) configuration.
TRL supports SFT, DPO, ORPO, GRPO, PPO, and reward modeling.
Options
:trainer—"sft","dpo","orpo","grpo","ppo","reward"(required):base_model— model ID to train (required):dataset— dataset ID:max_seq_length— max sequence length (default: 2048):packing— pack sequences for efficiency (default:truefor SFT):use_peft— use PEFT/LoRA (default:false):lora_r— LoRA rank (if use_peft is true):learning_rate— learning rate
Example
# SFT training config
config = HuggingfaceClient.Libraries.trl_config(
trainer: "sft",
base_model: "Qwen/Qwen2.5-7B",
dataset: "my-org/chat-dataset",
max_seq_length: 2048,
use_peft: true,
lora_r: 16
)
# DPO alignment
config = HuggingfaceClient.Libraries.trl_config(
trainer: "dpo",
base_model: "my-org/sft-model",
dataset: "my-org/preference-data",
beta: 0.1
)