Vision-Language Models
Vision-Language Models
Vision-Language Models (VLMs) feature a multimodal architecture that processes image and text data simultaneously. They can perform Visual Question Answering (VQA), image captioning and Text-To-Image search kind of tasks. VLMs utilize techniques like multimodal fusing with cross-attention, masked-language modeling, and image-text matching to relate visual semantics to textual representations. The data from both modalities, including detected objects, the spatial layout of the image, and text embeddings, are mapped to each other.
VLM Training
Contrastive Learning(CLIP and BLIP): Often used for alignment, this involves training the model to bring closer the representations of text and images that are semantically similar and push apart those that are not.
Multi-Task Learning: Training the model on various tasks (e.g., image captioning, visual question answering) to improve its ability to understand and integrate both modalities.
Applications
Image Captioning: Generating descriptive text for images.
Visual Question Answering: Answering questions based on visual content.
Cross-modal Retrieval: Finding images based on text queries and vice versa.
Visual search: In e-commerce, where users upload images to find similar products.
Evaluating Vision Language Models
VLM validation involves assessing the quality of the relationships between the image and text data.
BLEU: The Bilingual Evaluation Understudy metric was originally proposed to evaluate machine translation tasks.
ROUGE: Recall-Oriented Understudy for Gisting Evaluation computes recall by considering how many words in the reference sentence appear in the candidate.
METEOR: Metric for Evaluation of Translation with Explicit Ordering computes the harmonic mean of precision and recall, giving more weight to recall and multiplying it with a penalty term
CIDEr: Consensus-based Image Description Evaluation compares a target sentence to a set of human sentences by computing the average similarity between reference and target sentences using TF-IDF scores.
Limitations of Vision Language Models
Model Complexity: Language and vision models are quite complex on their own, and combining the two only worsens the problem.
Dataset Bias: Dataset biases occur when VLMs memorize deep patterns within training and test sets without solving anything.
Evaluation difficulties: The evaluation strategies discussed above only compare a candidate sentence with reference sentences.
Deploying VLMs requirements
VLMs often require significant computational resources due to their large size. Top-performing open-source models like the above-mentioned ones can reach over 70 billion parameters. This means you need high-performance GPUs to run them, especially for real-time applications
Examples
LLaVA: Large Language and Vision Assistant - Visual Instruction Tuning
LLaVA seamlessly integrates a pre-trained language model (Vicuna) with a visual encoder (CLIP) using a simple linear layer, creating a robust architecture capable of effectively processing and understanding language-image instructions.
PaliGemma:Gemma is a decoder-only model for text generation. Combining the image encoder of SigLIP with Gemma using a linear adapter makes PaliGemma a powerful vision language model.
NVLM is a family of multimodal LLMs developed by NVIDIA, representing a frontier-class approach to VLMs. It achieves state-of-the-art results in tasks that require a deep understanding of both text and images.
Molmo
Molmo, a sophisticated vision-language model, seeks to bridge this gap by creating high-quality multimodal capabilities built from open datasets and independent training methods.
Qwen2-VL
Qwen2-VL is the latest iteration of the VLMs in the Qwen series. It now goes beyond basic recognition of objects like plants and landmarks to understand complex relationships among multiple objects in a scene. In addition, it is capable of identifying handwritten text and multiple languages within images.
GPT-4 with vision (GPT-4V) enables users to instruct GPT-4 to analyze image inputs provided by the user.