ImageTextToText (huggingface_client v0.1.0)

Copy Markdown View Source

Multimodal vision-language models (VLMs).

Combines an image and text prompt to generate a text response. Used for GPT-4V-style tasks: image captioning with context, visual reasoning, chart/doc understanding, multi-turn vision conversations.

Different from image_to_text (which only generates captions without a text prompt).

Summary

Functions

Runs image+text to text generation.

Functions

run(client, args)

Runs image+text to text generation.

Options

  • :image — image URL, binary, or base64 (required)
  • :prompt — text prompt to condition the generation (required)
  • :model — override model (e.g. "llava-hf/llava-1.5-7b-hf")
  • :max_new_tokens — max tokens to generate
  • :temperature — sampling temperature