Multimodal vision-language models (VLMs).
Combines an image and text prompt to generate a text response. Used for GPT-4V-style tasks: image captioning with context, visual reasoning, chart/doc understanding, multi-turn vision conversations.
Different from image_to_text (which only generates captions without a text prompt).
Summary
Functions
Runs image+text to text generation.