A Comprehensive Review of CLIP and Its Variants: Architectures, Training, Performance, and Future Directions

Contrastive Language-Image Pre-training (CLIP) marked a significant milestone in vision-language modeling, demonstrating the efficacy of learning transferable visual concepts directly from natural language supervision at an unprecedented scale. This review provides a comprehensive analysis of the original CLIP model and its diverse array of variants. It delves into architectural innovations, ranging from minor modifications to entirely new paradigms like sequence-to-sequence modeling and the integration of generative components. Training methodologies are examined, including the evolution of datasets from noisy web scrapes to curated collections and synthetic data, variations in loss functions beyond the initial contrastive objective, and the strategic use of frozen model components for efficiency. The review systematically compares the performance trade-offs, capabilities, and limitations across different variants, highlighting improvements in zero-shot performance, efficiency, robustness, and task versatility. Widely adopted models such as OpenCLIP, ALIGN, Florence, CoCa, LiT, BLIP, and SigLIP are discussed alongside experimental and specialized variants including CLIPA, SiCLIP, CyCLIP, SCLIP, TiC-CLIP, Duoduo CLIP, LRSCLIP, TULIP, and LongCLIP, which address specific challenges like computational cost, geometric consistency, dense prediction, continual learning, 3D understanding, remote sensing, feature granularity, and long-text processing. Key evolutionary themes emerge, including the relentless pursuit of training efficiency, the synergistic combination of contrastive and generative objectives, the modular composition of models using frozen components, adaptation to specialized domains, and concerted efforts to overcome CLIP's inherent limitations, particularly in compositional reasoning and spatial understanding. Based on this analysis, persistent research gaps are identified, pointing towards future directions focused on achieving deeper multimodal reasoning, enhancing model robustness and fairness, developing more nuanced evaluation protocols, and expanding integration across a wider spectrum of modalities and tasks.

1. Introduction

The fields of computer vision and natural language processing (NLP) have undergone a significant transformation, driven by the success of large-scale pre-trained models. Initially demonstrated in NLP with models like GPT ¹, this paradigm involves training high-capacity models on vast amounts of data, often in a self-supervised or weakly supervised manner, to learn general-purpose representations that can be effectively transferred to various downstream tasks.² This shift has increasingly influenced computer vision, moving away from models trained solely on datasets with fixed, predetermined object categories.⁴ Such traditional supervision limits model generality and usability, requiring additional labeled data for any new visual concept.⁴ Concurrently, the importance of multimodal understanding – the ability of AI systems to jointly process and relate information from different modalities like vision and language – has grown substantially.³

Against this backdrop, Contrastive Language-Image Pre-training (CLIP), introduced by Radford et al. (2021), emerged as a landmark achievement.⁴ CLIP demonstrated that powerful and transferable visual models could be learned directly from natural language supervision at an immense scale.⁴ Its core innovation was the simplicity and scalability of its pre-training task: predicting the correct pairings between images and their associated text captions within large batches.⁵ By training on a massive, noisy dataset of 400 million image-text pairs scraped from the internet (WIT dataset), CLIP acquired remarkable zero-shot transfer capabilities.⁴ It could perform diverse visual tasks, including object classification, action recognition, and optical character recognition (OCR), often competitively with supervised baselines, without requiring any task-specific fine-tuning.⁴ This ability to generalize using natural language prompts represented a significant departure from previous methods relying on fixed classification heads.⁵ CLIP's success spurred a wave of research in multimodal AI, influencing the development of numerous subsequent models and establishing contrastive learning on web-scale data as a dominant paradigm.³ The original CLIP model showed impressive efficiency gains compared to alternative approaches like image-to-caption generation, attributed partly to the contrastive objective and the use of Vision Transformer (ViT) architectures.⁵

The profound impact of CLIP has naturally led to extensive research aimed at understanding its properties, addressing its inherent limitations, improving its efficiency, and extending its capabilities. This has resulted in a rapid proliferation of CLIP variants.³ Some variants focus on replicating and democratizing CLIP using publicly available datasets (e.g., OpenCLIP), while others push the boundaries of scale (e.g., ALIGN). Many aim to enhance efficiency by modifying the training process or architecture (e.g., LiT, SigLIP, CLIPA, SiCLIP). Others seek to augment CLIP's capabilities by integrating generative objectives (e.g., CoCa, BLIP) or adapting it for multi-task learning (e.g., Florence). Furthermore, specialized variants have emerged to tackle specific limitations like geometric inconsistency (CyCLIP), poor dense prediction performance (SCLIP), or the inability to handle temporal data drift (TiC-CLIP), long text inputs (LongCLIP), or specific domains like 3D vision (Duoduo CLIP) and remote sensing (LRSCLIP, RemoteCLIP). Given this diverse and rapidly evolving landscape, a structured overview and comparative analysis is crucial. Such a review can help researchers and practitioners navigate the plethora of approaches, understand the state-of-the-art, identify successful modification strategies, recognize persistent challenges, and chart pathways for future research.

This paper provides a comprehensive review of CLIP and its prominent variants. The scope encompasses the original CLIP model, widely adopted public implementations and successors (OpenCLIP, ALIGN, Florence/Florence-2, CoCa, LiT, BLIP/BLIP-2, SigLIP), and a curated selection of experimental or domain-specific variants chosen to illustrate key research trends. The main contributions of this review are: (1) Detailed technical summaries of the architecture, training methodology, performance characteristics, and pros/cons of key CLIP variants. (2) A comparative analysis highlighting the evolution of architectures, training paradigms (datasets, losses, parameter freezing), and performance-efficiency trade-offs across the variants. (3) Identification of recurring themes, successful modification patterns (e.g., leveraging frozen components, integrating generative losses), and persistent challenges (e.g., compositional reasoning, fine-grained understanding). (4) A discussion of identified research gaps and promising future directions for vision-language pre-training.

The remainder of this paper is organized as follows: Section 2 provides background on the original CLIP model. Section 3 details widely adopted variants. Section 4 discusses selected experimental and domain-specific variants. Section 5 offers a comparative analysis and identifies key themes. Section 6 outlines research gaps and future directions. Section 7 concludes the review.

2. Background: The Original CLIP Model

CLIP (Contrastive Language-Image Pre-training) ⁴ established a new direction for learning visual representations by leveraging weak supervision from vast amounts of image-text data readily available on the internet. Its design choices, while simple, proved highly effective when scaled.

2.1 Architecture

CLIP employs a dual-encoder architecture, a common structure in multimodal learning where separate networks process each modality before their representations are brought into alignment.⁶

Image Encoder: CLIP explored two main architectures for encoding images ⁶:
Modified ResNets: Five different ResNet models were evaluated (ResNet-50, ResNet-101, and three scaled versions RN50x4, RN50x16, RN50x64 following EfficientNet-style scaling). These used ResNet-D improvements and antialiased blur pooling. A key modification was replacing the final global average pooling layer with an attention pooling mechanism. This mechanism consists of a single layer of multi-head query-key-value (QKV) attention where the query is conditioned on the global average-pooled representation of the image.⁶
Vision Transformer (ViT): Several ViT models (ViT-B/32, ViT-B/16, ViT-L/14, and a higher-resolution ViT-L/14@336px) were also implemented, closely following the original ViT design but with a minor modification of adding layer normalization to the patch and position embeddings before the transformer layers.⁶ ViT models generally offered better computational efficiency for equivalent performance compared to the ResNets.¹¹
Text Encoder: The text encoder is a Transformer model.⁶ The base version has 63 million parameters, structured with 12 layers, a model width of 512, and 8 attention heads. It operates on text that has been tokenized using a lower-cased byte pair encoding (BPE) scheme with a vocabulary size of 49,152 tokens. The input text sequence length is capped at 76 tokens.⁶ Input sequences are bracketed with special start-of-sequence () and end-of-sequence () tokens. The activations from the transformer's highest layer at the \ token position are treated as the feature representation for the entire text sequence. Masked self-attention is used within the text encoder to ensure that the prediction for each token only depends on preceding tokens, although for the final \ token representation used, this effectively allows attending to the whole sequence.⁶
Projection Heads: Both the image and text encoders output feature vectors. These vectors are then projected into a shared multi-modal embedding space using simple linear projection layers.⁶ It is within this shared space that the similarity between image and text representations is calculated.

2.2 Training Methodology

CLIP's training is centered around a contrastive objective performed at a massive scale.

Contrastive Objective: The core idea is to learn a mapping from images and text to the shared embedding space such that the embeddings of corresponding (image, text) pairs are close, while embeddings of non-corresponding pairs are far apart. Given a batch of N (image, text) pairs, the model computes the cosine similarity between all possible N×N pairings (N image embeddings vs. N text embeddings). The training objective aims to maximize the similarity scores for the N correct ("positive") pairs along the diagonal of the similarity matrix and minimize the scores for the N2−N incorrect ("negative") pairs off the diagonal.⁵ This is implemented using a symmetric cross-entropy loss computed over the similarity scores after scaling by a learnable temperature parameter (τ). This objective is a variant of the InfoNCE loss commonly used in contrastive learning.⁶
Dataset (WIT): A key component of CLIP's success was the creation of a new, large-scale dataset called WebImageText (WIT). This dataset comprises 400 million (image, text) pairs gathered from publicly available internet sources.⁴ To ensure broad concept coverage, the collection process involved searching for pairs where the text included one of 500,000 diverse queries (e.g., object names, concepts). The dataset was intentionally kept noisy, contrasting with smaller, more heavily curated datasets like Conceptual Captions or MSCOCO.¹¹ The scale of WIT (comparable in word count to the dataset used for GPT-2) was deemed crucial for learning generalizable representations.⁶
Scale and Efficiency: Training was performed at a significant scale. The largest models were trained for billions of image-text pair presentations. However, the authors reported significant efficiency gains compared to alternative approaches considered, such as predicting text captions from images (similar to VirTex). The contrastive objective was found to be 4x to 10x more efficient for zero-shot ImageNet classification.⁵ The use of ViT image encoders provided further efficiency improvements (around 3x) over ResNet counterparts.¹¹ The best performing model reported was trained on 256 GPUs for 2 weeks.¹¹

2.3 Core Capabilities and Limitations

CLIP's training methodology resulted in a model with unique strengths but also notable weaknesses.

Strengths:
Zero-Shot Transfer: CLIP's defining capability is its strong zero-shot performance across a wide range of vision tasks.⁴ By creating classifiers "on the fly" using text prompts (e.g., "a photo of a {class name}"), CLIP can classify images into categories it was never explicitly trained on.⁵ It demonstrated non-trivial transfer to over 30 datasets, including tasks like OCR, action recognition, geo-localization, and fine-grained classification.⁴ Its zero-shot accuracy on ImageNet could match a fully supervised ResNet-50 without using any ImageNet training data.⁴
Robustness: Zero-shot CLIP models were found to be significantly more robust to natural distribution shifts (e.g., ImageNet-R, ImageNet-Sketch) compared to standard supervised models of equivalent accuracy on the original benchmark.⁵ This suggests that learning from diverse, noisy web data leads to more generalizable representations.
Limitations: Despite its successes, CLIP exhibits several limitations:
Abstract/Systematic Tasks: It struggles with tasks requiring more than object recognition, such as counting the number of objects in an image ⁹ or fine-grained spatial reasoning (e.g., estimating distances).¹¹
Fine-Grained Classification: While capable of some fine-grained tasks zero-shot, its performance can be limited compared to specialized models, especially for subtle distinctions.⁹
Compositionality: CLIP often fails to correctly bind attributes to objects or understand complex relationships between multiple entities in a scene.²³ It might recognize individual concepts but fail to grasp how they are combined (e.g., confusing "a red cube on a blue sphere" with "a blue cube on a red sphere"). This points to a 'bag-of-concepts' representation rather than a structured understanding.²³
Negation: The model has difficulty understanding negation in text prompts.²⁴
Text Length: The fixed text encoder input length (77 tokens, with an effective length often much shorter) restricts its ability to process detailed or complex textual descriptions.²¹
Data Bias: Training on unfiltered web data means the model can inherit societal biases present in the data.¹
Embedding Space Issues: Studies have noted a "modality gap" where image and text embeddings occupy distinct regions within the shared space, and the space itself can be highly anisotropic (directions are not uniform).²⁴

The design choices that enabled CLIP's groundbreaking zero-shot transfer—namely, the simple contrastive objective focused on global image-text similarity and the reliance on massive but noisy web data—also laid the groundwork for its limitations. The focus on aligning entire images with entire captions, while effective for recognizing general concepts, inherently deemphasizes the precise localization, counting, or compositional understanding needed for other tasks. Similarly, the fixed, relatively short text input length, sufficient for typical web captions, becomes a bottleneck when more nuanced textual guidance is required. These trade-offs, inherent in the original CLIP formulation, became primary motivators for the development of the numerous variants explored in subsequent sections.

3. Widely Adopted CLIP Variants

Following the release and success of CLIP, numerous research groups developed variants aiming to replicate, improve upon, or extend its capabilities. Several of these have gained widespread adoption due to their open-source nature, performance improvements, or novel functionalities.

3.1 OpenCLIP

OpenCLIP emerged as a community effort to provide an open-source implementation of CLIP, facilitating reproducibility and further research.²⁵ The primary goal was to replicate the performance of OpenAI's original models using publicly available datasets like LAION.¹³

Architecture: OpenCLIP offers significant flexibility, providing implementations for various visual backbones beyond those used in the original paper, including different sizes of ViT (B/32, B/16, L/14, H/14, G/14) and ConvNeXt architectures.²⁵ It also incorporates architectures like CoCa (Contrastive Captioners).²⁵
Training: Models have been trained extensively on large public datasets, primarily LAION-400M and the larger LAION-2B.¹³ The training process utilizes the standard contrastive loss. The OpenCLIP repository documents large-scale training runs, detailing the significant computational resources required (e.g., ViT-L/14 on LAION-400M used 400 A100 GPUs for ~127 hours) and the challenges encountered, such as training instability at scale, which necessitated techniques like using bfloat16 precision instead of float16.²⁸ Very large batch sizes (up to 160k) were explored, often enabled by gradient checkpointing.³⁰ Unless specifically fine-tuning a model like CoCa, the components are typically trained end-to-end without freezing.³¹
Pros: Being open-source is its primary advantage, enabling broader research access and development.²⁵ It has successfully replicated and even surpassed the performance of original OpenAI CLIP models on certain benchmarks when trained on LAION datasets (e.g., OpenCLIP ViT-H/14 achieved 78.0% zero-shot ImageNet accuracy on LAION-2B, compared to OpenAI's ViT-L/14 at 75.5%).²⁷ The largest OpenCLIP model, ViT-G/14, reached 80.1% zero-shot accuracy on ImageNet.³² It supports a wider range of architectures and datasets (including DataComp-1B) than the original release.²⁵
Cons: Training these large models remains computationally intensive and expensive, requiring substantial GPU resources and time.²⁶ Large-scale training runs encountered stability issues requiring specific solutions.³⁰ Performance can be sensitive to the specific training data and hyperparameters; for example, the OpenCLIP ViT-L/14 trained on LAION-400M initially underperformed OpenAI's equivalent ²⁷, although later runs on LAION-2B closed this gap.²⁷

3.2 ALIGN (Google)

ALIGN (A Large-scale ImaGe and Noisy-text embedding) explored the limits of scaling vision-language pre-training using extremely large, noisy datasets, prioritizing scale over meticulous data curation.¹⁴

Architecture: ALIGN employs a straightforward dual-encoder architecture. It utilizes EfficientNet models (specifically EfficientNet-L2 in key experiments) as the image encoder and BERT models (BERT-Large) as the text encoder.¹⁴ A simple fully connected layer is used to align the output dimensions of the two encoders.³⁷
Training: The defining characteristic of ALIGN is its training dataset: over 1.8 billion image alt-text pairs collected from the web with only minimal frequency-based filtering applied.¹⁴ This contrasts sharply with datasets like Conceptual Captions or CLIP's WIT, which involved more significant cleaning or curation. The model is trained using a standard contrastive loss, formulated with normalized softmax, similar to CLIP.¹⁴ No components were reported as frozen during pre-training.
Pros: ALIGN demonstrated that massive data scale can effectively compensate for significant noise in the training data, achieving state-of-the-art results at the time.¹⁴ It yielded strong visual representations, achieving 88.64% top-1 accuracy on ImageNet after fine-tuning and 85.5% with frozen features (linear probe), slightly outperforming CLIP's ViT-L/14 on the latter.³⁸ Its zero-shot ImageNet classification accuracy was 76.4%, comparable to CLIP's ViT-L/14.³⁸ It also set new records on image-text retrieval benchmarks like Flickr30K and MSCOCO, even compared to more complex cross-attention models.¹⁴ The architectural simplicity is also an advantage.
Cons: The primary limitation is its reliance on a massive, proprietary dataset (1.8 billion pairs), making it difficult for the broader community to replicate or build upon directly.¹³ While scale compensated for noise overall, the noisy nature of the data might impose limitations on tasks requiring very fine-grained understanding or high precision, although this is not explicitly quantified in the provided sources. Like other models trained on unfiltered web data, it risks inheriting societal biases.⁴¹

3.3 Florence / Florence-2 (Microsoft)

The Florence project represents Microsoft's effort to build foundational models for computer vision, aiming for broader applicability than CLIP's initial focus.² Florence-1 aimed for universal representations across space, time, and modality ⁴³, while Florence-2 adopted a unified sequence-to-sequence approach driven by text prompts.⁴⁴

Architecture:
Florence-1: Leveraged adapters on a pre-trained backbone to handle diverse tasks. Snippets suggest a CoSwin Transformer image encoder ⁴⁷ and likely a Transformer-based text encoder. Specific adapters included the Dynamic Head for spatial tasks (like detection) and the METER adapter for vision-language tasks (like VQA).⁴³
Florence-2: Employs a sequence-to-sequence architecture.⁴⁴ The vision encoder is DaViT.⁴⁴ A multi-modal encoder-decoder Transformer processes concatenated visual token embeddings (from DaViT) and text prompt embeddings.⁴⁴ It uniquely handles region-based outputs (for detection, segmentation, grounding) by incorporating special location tokens into its vocabulary and generating coordinate sequences.⁴⁴ Available in base (230M parameters) and large (770M parameters) sizes.⁴⁹
Training:
Florence-1: Pre-trained on large web-scale image-text data (FLD-900M mentioned, derived from datasets like COCO, LVIS, OpenImages).⁴⁸ Training likely involved contrastive and potentially other objectives suitable for its multi-task adapters.⁴³
Florence-2: Trained on the purpose-built FLD-5B dataset, containing 5.4 billion visual annotations on 126 million images.⁴⁴ This dataset was created through an iterative process involving automated annotation by specialist models followed by model refinement.⁴⁴ The model is trained end-to-end using a standard cross-entropy language modeling objective across all tasks, treating each task as a translation problem from image + prompt to text/location tokens.⁴⁴ No frozen components are mentioned during its primary training phase.
Pros: Florence-1 reported exceptional zero-shot ImageNet accuracy (83.74% top-1).⁴³ Florence-2's main strength is its versatility; it can perform a wide range of tasks (captioning, object detection, segmentation, grounding, OCR) using natural language prompts within a single model.⁴⁴ It claims strong zero-shot and fine-tuning performance, competing effectively with larger, specialized models.⁴⁵ The availability of smaller versions (e.g., Florence-2 Base) makes it more accessible.⁴⁹ Florence-2 is open-source.⁴⁹ It demonstrates better detection accuracy, especially for distant objects, compared to models like YOLOv8l-world in some examples.⁵⁴
Cons: The creation of the massive, heavily annotated FLD-5B dataset required for Florence-2 is a significant undertaking, potentially limiting reproducibility.⁴⁴ Direct comparisons of Florence-2's zero-shot ImageNet accuracy against CLIP are lacking in the provided materials; its evaluation focuses more on its multi-task capabilities on benchmarks like COCO captioning and RefCOCO.⁴⁶ Like other foundation models, it may require fine-tuning for specialized domains like medical imaging.⁴⁹ Potential issues in the training data, such as multiple images sharing one description, could affect learning.⁴⁸

3.4 CoCa (Salesforce / OpenCLIP)

CoCa (Contrastive Captioners) enhances the standard contrastive learning framework by integrating a generative captioning objective, aiming to benefit both representation learning and text generation within a single model.³¹

Architecture: CoCa builds upon the CLIP architecture (e.g., ViT image encoder, Transformer text encoder) by adding a Transformer-based text decoder on top of the text encoder.³¹ This decoder attends to the image encoder's output features via cross-attention layers.³¹ A key architectural detail is the use of attentional pooling for the image representations, which are then fed to both the contrastive loss calculation and the decoder's cross-attention mechanism.³¹ To separate unimodal text encoding from multimodal generation, cross-attention is omitted in the initial layers of the decoder, allowing these layers to function similarly to the text encoder for the contrastive task.⁵⁵
Training: CoCa is trained using a combination of two loss functions: the standard contrastive loss (like CLIP) applied to the outputs of the image encoder and the unimodal layers of the text decoder, and a captioning loss (standard autoregressive cross-entropy) applied to the output of the multimodal text decoder.³¹ These objectives are trained jointly, sharing most of the network parameters, resulting in minimal computational overhead compared to training only the contrastive loss (estimated ~20% increase).³¹ The OpenCLIP implementation was trained on the LAION-2B dataset (seeing 13 billion samples).³¹ No components are frozen during this pre-training. For fine-tuning specifically on captioning tasks (like MSCOCO), the contrastive loss weight is set to zero, focusing training solely on the generative objective.³¹
Pros: CoCa demonstrates improved zero-shot classification performance compared to contrastive-only models of similar size trained on the same data (e.g., CoCa ViT-L/14 achieved 75.5% on ImageNet-1k vs. 73.1% for CLIP ViT-L/14).³¹ It also shows better performance on image-text retrieval tasks.³¹ The unified architecture efficiently supports both contrastive tasks (retrieval, zero-shot classification) and generative tasks (captioning, VQA after fine-tuning).³¹ It achieves very strong performance after fine-tuning on specific tasks (e.g., 91.0% top-1 ImageNet accuracy).⁵⁵
Cons: Fine-tuning CoCa exclusively for generative tasks (like captioning) was observed to significantly degrade or entirely eliminate its contrastive capabilities (zero-shot classification, retrieval).³¹ Its zero-shot captioning performance can be limited because the noisy web text used in pre-training lacks the richness of dedicated text corpora, potentially hindering the decoder's generative quality without fine-tuning.³¹ The architecture is slightly more complex than a standard dual-encoder CLIP due to the addition of the decoder.

3.5 LiT (Google)

LiT (Locked-image Tuning) presents a highly efficient method for adapting powerful, pre-trained image encoders for zero-shot vision-language tasks by focusing the training effort solely on the text encoder.⁵⁶

Architecture: The core principle of LiT is the use of a pre-trained image encoder whose weights are kept frozen (locked) during the vision-language alignment phase.⁵⁶ This frozen image tower can be based on various architectures, including ResNets, ViTs (up to ViT-g/14 tested), and even MLP-Mixers, pre-trained using either supervised (e.g., on ImageNet-21k, JFT) or self-supervised methods (e.g., DINO, MoCo-v3).⁵⁶ The text encoder, typically a Transformer (BERT-base, T5-base, or the standard CLIP transformer were explored), is unlocked and trained from scratch or fine-tuned to align with the fixed image representations.⁵⁶
Training: LiT employs a contrastive loss, similar to CLIP, to align the outputs of the frozen image encoder and the trainable text encoder.⁵⁷ Training occurs on standard image-text datasets (e.g., CC12M, YFCC100M, potentially subsets of ALIGN's data).⁵⁸ Crucially, gradients are only computed and applied to the text tower and the projection head, not the image tower.⁵⁶
Pros: LiT offers significant data and computational efficiency compared to training large vision-language models end-to-end.⁵⁶ By reusing powerful, pre-existing image models, it amortizes the cost of visual representation learning. It achieves state-of-the-art zero-shot classification performance, with a ViT-g/14 based LiT model reaching 85.2% top-1 accuracy on ImageNet.⁵⁶ This performance demonstrates that aligning a text model to a fixed, high-quality visual representation can be highly effective. The approach decouples visual feature learning (which can leverage large labeled image datasets) from language alignment (which can use potentially noisier image-text web data).⁵⁶ Locking the image tower simplifies training, reduces memory usage (no gradients for the image model), and potentially allows for pre-computing image embeddings if no image augmentations are used, further saving compute.⁵⁶ Its flexibility allows leveraging the best available image models, regardless of their original training objective.⁵⁶
Cons: The performance of a LiT model is fundamentally capped by the quality and scope of the frozen image encoder. It may struggle with visual concepts or nuances not captured by the pre-trained image model.⁶⁰ While efficient, it might be less adaptable than end-to-end trained models where both modalities can adjust. Some studies suggest that fine-tuning both towers (like in CLIP or the 3T variant) can sometimes yield better results, particularly for retrieval tasks, though LiT often excels in classification.⁶¹ Using a locked pre-trained language model as the text tower performs poorly, indicating the text model needs adaptability.⁶¹

3.6 BLIP / BLIP-2 (Salesforce)

The BLIP (Bootstrapping Language-Image Pre-training) family focuses on creating unified models for both vision-language understanding and generation tasks, with BLIP addressing noisy web data via a unique filtering technique and BLIP-2 pioneering the efficient use of frozen unimodal foundation models.⁶²

Architecture:
BLIP: Introduces the Multimodal Mixture of Encoder-Decoder (MED) architecture.⁶⁴ This flexible architecture, based on a ViT image encoder and a BERT-style text encoder/decoder, can operate in three modes by sharing/modifying components: (1) Unimodal Encoder (separate image/text encoding for contrastive loss), (2) Image-grounded Text Encoder (adds cross-attention for image-text matching loss), and (3) Image-grounded Text Decoder (uses causal self-attention for language modeling loss).⁶⁴
BLIP-2: Leverages large, pre-trained frozen image encoders (e.g., ViT-G) and frozen large language models (LLMs) like OPT or FlanT5.⁶³ The modality gap is bridged by a lightweight, trainable Querying Transformer (Q-Former). The Q-Former, initialized from BERT-base (~188M parameters), uses a fixed set of learnable queries (e.g., 32) to interact with the frozen image encoder's features via cross-attention, extracting a compact visual representation. This representation is then fed as a soft visual prompt to the frozen LLM.⁶³
Training:
BLIP: Pre-trained on a base dataset of 14M image-text pairs, then employs CapFilt (Captioning and Filtering).⁶⁴ CapFilt uses a fine-tuned captioner (MED decoder) to generate synthetic captions for noisy web images (e.g., 129M Conceptual Captions dataset) and a fine-tuned filter (MED encoder) to remove noisy original/synthetic captions. The resulting bootstrapped dataset is used to train the final BLIP model from scratch using a combination of Image-Text Contrastive (ITC), Image-Text Matching (ITM), and Language Modeling (LM) losses.⁶⁴ No components are frozen during the main pre-training on the bootstrapped data.
BLIP-2: Features a two-stage pre-training strategy where only the Q-Former is trained.⁶³ Stage 1: Vision-Language Representation Learning. The Q-Former is trained connected to the frozen image encoder using image-text data. It optimizes three objectives simultaneously: ITC (aligning global image/text features), ITM (predicting image-text match), and Image-grounded Text Generation (ITG, forcing queries to extract visual info needed for text generation). Stage 2: Vision-to-Language Generative Learning. The Q-Former (with its learned visual extraction capability) is connected to the frozen LLM. The Q-Former's output query embeddings are projected and prepended to the text input as soft visual prompts. The Q-Former is further trained using a standard generative language modeling loss (or prefix LM loss for encoder-decoder LLMs), teaching it to effectively communicate the visual information to the LLM.⁶³
Pros: BLIP's CapFilt provides an effective method for cleaning and leveraging noisy web data.⁶² BLIP achieved state-of-the-art results on various understanding and generation tasks, including retrieval, captioning, and VQA.⁶² BLIP-2 is exceptionally parameter-efficient during pre-training, as only the relatively small Q-Former is updated.⁶³ This allows it to leverage extremely large frozen vision models and LLMs without prohibitive training costs. BLIP-2 excels at zero-shot instructed image-to-text generation tasks (like complex VQA and visual dialogue) thanks to the capabilities inherited from the frozen LLM.⁶³ It significantly outperformed much larger models like Flamingo80B on benchmarks like zero-shot VQAv2 (65.2% accuracy for BLIP-2 ViT-g/FlanT5-XXL vs 56.3% for Flamingo80B) with 54x fewer trainable parameters.⁶³ BLIP models also show strong retrieval performance (e.g., BLIP-2 ViT-L zero-shot COCO Text-to-Image Recall@1 of 66.3%).⁶⁹
Cons: The CapFilt process in BLIP adds complexity to the data preparation pipeline.⁶⁴ BLIP-2's overall performance is highly dependent on the quality of the chosen frozen image encoder and LLM.⁶³ While training is efficient, inference with BLIP-2 can still be computationally demanding due to the large size of the frozen LLM component.⁷⁵ BLIP-2 may inherit limitations, biases, or safety concerns from the underlying frozen LLM, and it has not been extensively tested in real-world safety-critical applications.⁶⁶

3.7 SigLIP / SigLIP 2 (Google)

SigLIP (Sigmoid Loss for Language Image Pre-training) proposes a modification to the contrastive loss function used in CLIP, replacing the standard batch-wise softmax normalization with a simpler pairwise sigmoid function.⁷⁶ SigLIP 2 builds upon this by incorporating additional training objectives.⁷⁹

Architecture: SigLIP uses the standard dual-encoder architecture (e.g., ViT image encoder, Transformer text encoder) similar to CLIP.⁷⁷ SigLIP 2, during its training phase, adds a standard transformer decoder connected to the vision encoder's patch features. This decoder is used for auxiliary objectives like captioning and referring expression prediction (inspired by LocCa) but is discarded for inference, leaving the standard dual-encoder structure.⁷⁹
Training: The key innovation is the Sigmoid loss. Instead of normalizing similarity scores across the entire batch, SigLIP treats each possible image-text pair (Ii,Tj) independently as a binary classification problem.⁷⁶ Matching pairs (i=j) are positive examples (target label 1), and mismatching pairs (i=j) are negative examples (target label -1, or sometimes 0). The loss is computed as the sum of binary logistic losses (negative log-likelihood of sigmoid outputs) over all pairs in the batch, using learnable temperature (t) and bias (b) parameters: LSigLIP=−∑i,jlog(σ(zij(t⋅sim(Ii,Tj)+b))), where zij is the target label and σ is the sigmoid function.⁷⁶ This loss was applied during training on the large-scale WebLI dataset.⁸⁰ SigLIP 2 enhances this by adding: (1) Decoder-based losses (LocCa): Image captioning, referring expression prediction (predicting bounding boxes for text), and grounded captioning (predicting captions for boxes), using the auxiliary decoder.⁷⁹ (2) Self-supervised losses (applied in the last 20% of training): Self-distillation and masked prediction on image features (inspired by SILC, TIPS).⁷⁹ No components are typically frozen during SigLIP pre-training, although a SigLiT variant combines the Sigmoid loss with the LiT approach (frozen image encoder).⁷⁶
Pros: The Sigmoid loss offers several advantages over the standard softmax contrastive loss. It is simpler to implement, especially in distributed settings, as it avoids the need for global normalization across the batch.⁷⁶ It is more memory-efficient, allowing for training with significantly larger batch sizes on the same hardware (demonstrated up to 1 million batch size).⁷⁶ Sigmoid loss performs particularly well at smaller batch sizes (<16k) where softmax can struggle.⁷⁶ This efficiency enables effective training with fewer computational resources, as demonstrated by the SigLiT variant achieving high accuracy (84.5% ImageNet ZS with ViT-g/14) using only four TPUv4 chips.⁷⁶ SigLIP models achieve strong zero-shot performance (e.g., SigLIP Base 384px: 76.2% ImageNet ZS ³⁹; SigLIP 2 So/14 384px: 84.1% ¹⁹). SigLIP 2 further improves performance, especially on dense prediction tasks and multilingual benchmarks, due to the auxiliary objectives.⁷⁹
Cons: The performance advantage of Sigmoid loss over softmax diminishes as batch sizes become very large (e.g., >32k), although it remains competitive.⁷⁷ SigLIP 2 significantly increases the complexity of the training pipeline by incorporating a decoder and multiple additional loss objectives (captioning, grounding, self-supervised) compared to the original SigLIP or CLIP.⁷⁹

These widely adopted variants illustrate two primary trajectories in the evolution of CLIP. One path focuses on refining the original contrastive learning paradigm – scaling it up with more open data (OpenCLIP, ALIGN), making it more efficient through architectural or loss function changes (LiT, SigLIP), or improving the underlying data itself. The second path seeks to expand CLIP's capabilities beyond simple contrastive alignment, integrating generative modeling (CoCa, BLIP/BLIP-2) or adopting entirely different architectures suited for multi-task learning (Florence-2).

A recurring theme across these variants is the tension between training efficiency and model capability. Approaches like LiT and SigLIP achieve remarkable efficiency gains by simplifying the training process (freezing components, simplifying the loss) while largely staying within the contrastive framework. Conversely, models like CoCa, BLIP-2, and Florence-2 introduce greater architectural or training complexity (decoders, bridging modules, sequence-to-sequence frameworks, multi-stage training, complex datasets) to unlock broader functionalities like text generation or unified task handling. This highlights a fundamental trade-off for researchers and practitioners.

Furthermore, the success of LiT and particularly BLIP-2 underscores the power of leveraging existing, powerful unimodal models. Freezing high-quality pre-trained image encoders (in LiT) or both image encoders and LLMs (in BLIP-2) and training only lightweight adapters or bridging modules has proven to be a highly parameter-efficient strategy for achieving state-of-the-art results. This suggests a move towards modularity and composition, where powerful multimodal systems are constructed by effectively connecting specialized unimodal foundation models, rather than solely relying on monolithic end-to-end training.

Table 1: Summary of Major CLIP Variants

Model	Year/Venue	Key Innovation	Image Encoder(s)	Text Encoder(s)	Training Data (Type, Scale)	Frozen Components?	ZS ImageNet Acc (%) (Top-1, Example)	Key Pro/Con
CLIP (OpenAI)	2021/ICML	Scalable contrastive learning from web data for ZS transfer	ResNet (Mod.), ViT	Transformer (BPE, 76 tok)	WIT (Web Image-Text, 400M)	No (during pretrain)	76.2 (ViT-L/14@224px) 13	Pro: Breakthrough ZS transfer, robust. Con: Weak compositionality/counting, short text limit.
OpenCLIP	2022+ (GitHub)	Open-source replication & extension	ViT, ConvNeXt, CoCa	Transformer, CoCa	LAION (400M, 2B), DataComp (1B)	No (typically)	78.0 (ViT-H/14, LAION-2B) 30	Pro: Open source, high performance, flexible. Con: High training cost, stability issues at scale.
ALIGN	2021/ICML	Scaling contrastive learning with massive noisy data	EfficientNet (L2)	BERT (Large)	Noisy Web Alt-Text (1.8B)	No	76.4 38	Pro: Scale compensates for noise, simple arch. Con: Proprietary data, potential noise issues.
Florence-1	2021 (arXiv)	Foundation model concept, adapters for space/time/modality	CoSwin?	Transformer?	FLD-900M (Web+Curated)	Base model (pretrain)	83.74 43	Pro: Very high ZS accuracy, broad vision scope. Con: Less documented, superseded by Flo-2.
Florence-2 (Large)	2023 (arXiv)	Unified seq-to-seq prompt-based vision model	DaViT	Transformer Enc-Dec	FLD-5B (Annotated Images, 5.4B ann.)	No	N/A (Focus on multi-task) 46	Pro: Task versatility via prompts, strong ZS/FT. Con: Complex curated dataset needed.
CoCa (ViT-L/14)	2022/ECCV	Combines contrastive loss with captioning loss (decoder)	ViT-L/14	Transformer Enc + Dec	LAION-2B (13B seen)	No (pretrain)	75.5 31	Pro: Improved ZS/retrieval over CLIP, unified model. Con: FT for captioning hurts contrastive.
LiT (ViT-g/14)	2022/CVPR	Locked-image Tuning: Train text encoder against frozen image encoder	ViT-g/14 (Frozen)	Transformer (Trainable)	Image-Text (CC12M, YFCC)	Image Encoder	85.2 56	Pro: Highly compute/data efficient, SOTA ZS accuracy. Con: Relies on frozen encoder quality.
BLIP (ViT-L)	2022/ICML	Unified Enc-Dec (MED), data bootstrapping (CapFilt)	ViT-L	MED (Transformer Enc/Dec)	Web Data + COCO (Bootstrapped, ~143M)	No (pretrain)	N/A (Focus on VQA/Caption/Retrieval)	Pro: Effective use of noisy data, unified model. Con: CapFilt adds complexity.
BLIP-2 (ViT-g/FlanT5-XXL)	2023/ICML	Lightweight Q-Former bridges frozen image encoder & frozen LLM	ViT-g (Frozen)	FlanT5-XXL (Frozen)	Image-Text (Web, COCO, etc.)	Image Enc & LLM	65.2 (VQAv2 ZS) 63	Pro: Highly parameter-efficient training, SOTA ZS VQA/Gen. Con: High inference cost (LLM).
SigLIP (ViT-L/16)	2023/ICCV	Pairwise Sigmoid loss instead of softmax contrastive loss	ViT-L/16	Transformer	WebLI (English subset)	No (SigLIP) / Img (SigLiT)	76.2 (Base@384px) 80	Pro: Efficient loss, good at small batch sizes. Con: Benefit diminishes at very large batches.
CLIPA-v2 (H/14)	2023 (arXiv)	Inverse scaling law: Train large models with fewer tokens	ViT-H/14	Transformer	DataComp-1B / LAION-2B (~13B seen, stages)	No	81.1 ($10k budget) 81	Pro: Drastically reduced training cost for SOTA ZS. Con: Multi-stage training required.
TULIP (So/14)	2025 (arXiv)	Enhanced contrastive (I-I, T-T), generative augmentation, reconstruction reg.	ViT-So/14	Transformer	Image-Text + Augmentations	No	85.0 (@384px) 19	Pro: Improves fine-grained vision & reasoning. Con: Complex training objectives.
LongCLIP (ViT-L/14)	2024/ECCV	Fine-tuning for long text input (>77 tokens)	ViT-L/14	Transformer (Fine-tuned)	Image-Text + 1M Long Pairs	No (fine-tuning stage)	~75 (Preserves CLIP ZS) 23	Pro: Handles long text, plug-and-play. Con: Requires extra fine-tuning & long-text data.

(Note: ZS ImageNet Acc refers to Top-1 accuracy on ImageNet-1k zero-shot benchmark. N/A indicates data not readily available in snippets or not the primary focus. Training data scale often refers to unique images or pairs, 'seen' indicates total samples processed. Costs are estimates.)

4. Experimental and Domain-Specific Variants

Beyond the widely adopted variants, a significant body of research has focused on developing more specialized versions of CLIP, targeting specific limitations, efficiency constraints, or application domains. These experimental variants often introduce novel techniques or architectural modifications.

4.1 Efficiency Variants

Given the substantial computational cost of training large CLIP models ²⁶, several approaches specifically target efficiency, aiming to achieve competitive performance with reduced resources.

CLIPA / CLIPA-v2: This work identified a surprising "inverse scaling law": larger CLIP models can be effectively trained using shorter input sequences (fewer image patches/tokens).²⁶ CLIPA leverages this by training large models like ViT-H/14 with significantly downsampled images (e.g., 70x70 or 84x84 resolution) during the main training phase, followed by shorter fine-tuning stages at higher resolutions.⁸¹ This drastically reduces the number of tokens processed per sample. CLIPA-v2 achieved an impressive 81.1% zero-shot ImageNet accuracy with an H/14 model within an estimated $10,000 budget, outperforming a comparable OpenCLIP model trained conventionally at ~39x the cost.⁸¹ While highly effective, this approach requires careful management of the multi-stage training process and the token reduction strategy.
SiCLIP: Designed explicitly for training on consumer-grade hardware (e.g., a single Nvidia RTX3090 GPU with 1TB storage).⁸⁴ It achieves this through several synergistic techniques: (1) Architectural simplification using SAS-P blocks (Simplified Attention Sub-block Parallel) with weight sharing, reducing parameters and removing skip connections.⁸⁴ (2) Weight Inheritance with Multi-stage Knowledge Distillation (WIKD), where weights from a pre-trained teacher model (MobileCLIP-S0) are inherited and frozen for some layers, while others are trained using multi-stage distillation (feature, relation, interactive contrastive spaces).⁸⁴ (3) A novel Pair Matching (PM) loss that adds an auxiliary binary matching task to better distinguish pairs, especially on small datasets.⁸⁴ (4) An augmented dataset CC12M-SYN, created by adding synthetic captions to the CC12M dataset using a captioning model.⁸⁴ SiCLIP demonstrates competitive performance on standard benchmarks despite its resource constraints and small training dataset (12M images), offering faster inference than its teacher model.⁸⁴ Its main drawback is the reliance on a specific teacher model and the complexity of integrating multiple specialized training techniques.
LightCLIP: Focuses on improving performance when using lightweight image encoder backbones, which are often less effective in standard CLIP training.⁸⁵ It introduces a multi-level interaction paradigm during training, including softening negative labels in the contrastive loss, adding a token-level alignment objective using relaxed bipartite matching between image patches and words, and incorporating a Masked Language Modeling (MLM) objective for the text encoder, enhanced by injecting image features.⁸⁶ These additions aim to extract more information for alignment without increasing inference cost.
CLIP-PING: Another approach to boost lightweight models, CLIP-PING uses guidance from the intrinsic neighborhood structure of data.⁸⁷ It leverages features extracted from arbitrary pre-trained unimodal encoders (for both image and text) to identify nearest neighbors (NN) within the same modality and cross-nearest neighbors (XNN) between modalities. Additional contrastive loss terms based on these "proximus neighbors" are added to the standard CLIP objective, encouraging the lightweight model to align not just positive pairs but also samples that are semantically close in the guidance feature spaces. This simple addition significantly boosts performance (e.g., +5.5% ImageNet ZS for ViT-XS on 3M pairs) with minimal overhead.⁸⁷ Its effectiveness depends on the availability and quality of the pre-trained encoders used for guidance.

4.2 Consistency Variants

Standard CLIP training does not explicitly enforce geometric consistency between the learned image and text embedding spaces, which can lead to inconsistencies in downstream tasks.¹⁵

CyCLIP: Addresses this by adding two explicit regularization terms to the CLIP loss function.¹⁵ Cross-modal consistency penalizes discrepancies between the similarity of (Image A, Text B) and (Image B, Text A) for mismatched pairs A and B. In-modal consistency encourages the similarity between two images (Image A, Image B) to be close to the similarity between their corresponding texts (Text A, Text B). By minimizing the squared differences of these similarity scores, CyCLIP enforces a more symmetrical and geometrically consistent structure in the joint embedding space. This results in significant gains in zero-shot classification accuracy (10-24%) and robustness to distribution shifts (10-27%) compared to standard CLIP, albeit at the cost of a more complex loss function with additional hyperparameters.¹⁵

4.3 Dense Prediction Variants

While CLIP excels at image-level tasks, its standard architecture struggles with dense prediction tasks like semantic segmentation, often failing to accurately localize features.¹⁸

SCLIP: Investigates CLIP's potential for dense prediction and identifies the issue as spatially invariant features learned by standard self-attention, where local tokens attend broadly rather than locally.¹⁸ SCLIP replaces the standard self-attention module in the final transformer layer of the CLIP vision encoder with a novel Correlative Self-Attention (CSA) mechanism.¹⁸ CSA calculates attention scores based on pairwise correlations between projected local visual tokens, encouraging tokens to attend strongly to themselves and semantically similar regions. Remarkably, SCLIP can reuse the pre-trained query and key projection weights from the original CLIP self-attention block, allowing for a training-free adaptation. This minimal change dramatically improves zero-shot semantic segmentation performance (average mIoU across 8 benchmarks boosted from CLIP's 14.1% to 38.2%), significantly outperforming prior state-of-the-art by enhancing feature localization while retaining semantic understanding.¹⁸ The impact on global tasks like classification was not the focus of the SCLIP paper.

4.4 Continual Learning Variants

Foundation models like CLIP are typically trained on static datasets, but real-world data evolves over time. Continual learning aims to update models efficiently as new data arrives without forgetting past knowledge or requiring complete retraining.¹⁷

TiC-CLIP: This work introduces benchmarks and a framework for Time-Continual training of CLIP models.¹⁷ They created time-stamped web-scale datasets (TiC-DataComp with 12.7B pairs over 9 years, TiC-YFCC, TiC-RedCaps) and dynamic evaluation tasks to measure temporal robustness.¹⁷ They evaluated several continual learning strategies: Sequential (fine-tuning only on the newest data batch), Cumulative-All (retraining from scratch on all data up to the current time step - the costly oracle), and Replay (fine-tuning on the new data plus a buffer of data sampled from previous time steps).¹⁷ Replay variants included replaying all past data (Replay-All) or using fixed-size buffers with different sampling strategies (Replay-Exp: exponential decay for older data amounts; Replay-Equal: equal sampling from all past steps).¹⁷ Their key finding was that simple rehearsal-based approaches (like Replay-All or Replay-Equal) significantly outperform sequential fine-tuning and can match the performance of retraining from scratch (Cumulative-All) while being substantially more compute-efficient (e.g., 2.5x reduction reported).⁸⁸ Interestingly, they observed less catastrophic forgetting than typically seen in smaller-scale continual learning benchmarks.⁹³

4.5 3D Variants

Adapting CLIP's principles to understand 3D shapes is an active area, moving beyond 2D images and text.¹⁶

Duoduo CLIP: Learns 3D shape representations directly from multi-view 2D images rather than relying on 3D point cloud inputs, which are often used by other methods like OpenShape or Uni3D.¹⁶ It fine-tunes an off-the-shelf 2D CLIP model (ViT-B/16) using rendered multi-view images of 3D objects (e.g., from Objaverse) paired with text descriptions.¹⁶ A Multi-View Attention (MVA) mechanism, essentially cross-view attention, is incorporated to allow information sharing across the different views of the same object, boosting performance.⁹⁵ The approach is designed to be permutation invariant to the order of input views and does not require explicit pose information.⁹⁷ Duoduo CLIP demonstrates superior zero-shot generalization on 3D classification and retrieval tasks compared to point-cloud-based methods while being significantly more computationally efficient (requiring only 87M parameters and 57 A5000 GPU hours vs. 1B parameters and 480 A100 hours for Uni3D).⁹⁵ Its use of multi-view images allows leveraging strong 2D priors from CLIP and offers flexibility, as it can handle a variable number of input views and potentially work directly with real-world photos where point clouds might be unavailable or noisy.¹⁶ Potential concerns include whether fine-tuning on rendered views might degrade performance on single natural images and the novelty relative to other multi-view attention techniques.⁹⁸ Training requires access to multi-view rendering capabilities.⁹⁹

4.6 Remote Sensing Variants

Applying vision-language models to the domain of remote sensing (RS) presents unique challenges, including specialized visual concepts and data scarcity.¹⁰⁰

LRSCLIP: Specifically designed to align RS images with both long and short text descriptions, addressing limitations of models trained only on short captions which can lead to "hallucinations" or overlooking context.¹⁰¹ It introduces the LRS2M dataset, containing 2 million RS image-text pairs derived from multiple sources (e.g., RS5M, MillionAID). Crucially, LRS2M includes both short captions and corresponding long, detailed captions generated using LLMs or Vision-and-Language Models (VLMs) like VHM.²⁰ The training strategy likely involves modifications to handle the dual text lengths, possibly using separate text encoders or an adapted single encoder, though specific architectural details are sparse in the snippets.¹⁰¹ LRSCLIP shows significant improvements (10-20%) in zero-shot long-text retrieval on RS benchmarks compared to a Long-CLIP baseline, while also achieving state-of-the-art results in short-text retrieval (outperforming GeoRSCLIP) and zero-shot RS image classification.²⁰
RemoteCLIP: Aims to be a foundational VLM for the RS domain.¹⁰⁰ It tackles the issue of limited labeled RS data by employing data scaling techniques. It converts heterogeneous annotations (like bounding boxes and segmentation masks) from various RS datasets into a unified image-caption format using Box-to-Caption (B2C) and Mask-to-Box (M2B) conversion methods. It further incorporates UAV imagery to create a large pre-training dataset (claimed 12x larger than combining existing public RS datasets).¹⁰³ RemoteCLIP uses a standard CLIP architecture and contrastive (InfoNCE) loss for training.¹⁰⁰ It demonstrates strong performance across 16 different RS datasets and tasks (including classification, retrieval, and counting), consistently outperforming baseline foundation models and setting new state-of-the-art results on RS image-text retrieval benchmarks like RSITMD and RSICD.¹⁰⁰ Its success relies heavily on the sophisticated data engineering process used to generate its training corpus.

4.7 Feature Enhancement Variants

Some variants aim to improve the quality or granularity of the learned visual representations, addressing CLIP's tendency to focus on global semantics over fine details.¹⁰⁴

TULIP: Designed as an open-source, drop-in replacement for CLIP-like models, TULIP focuses on learning more fine-grained visual features while preserving strong language grounding.¹⁹ It modifies the standard contrastive pre-training framework by incorporating three key elements: (1) Generative data augmentation (GeCo), which likely involves augmenting both images and text captions synthetically. (2) Enhanced contrastive learning, which includes not only the standard image-text contrastive loss but also image-image (potentially patch-wise) and text-text contrastive objectives to enforce better unimodal structure. (3) Reconstruction regularization, adding objectives to reconstruct masked parts of the image or text, encouraging the encoders to capture more complete information.¹⁹ TULIP achieves state-of-the-art zero-shot ImageNet accuracy and shows significant improvements over models like SigLIP on vision-centric downstream tasks (e.g., 2x gain on RxRx1 linear probe) and compositional reasoning benchmarks (e.g., 3x gain on MMVP).¹⁹ The trade-off is a more complex training procedure involving multiple loss terms and data augmentation strategies that need careful balancing.

4.8 Long Text Variants

The fixed, short text input length of the original CLIP text encoder (77 tokens) is a significant limitation for applications requiring detailed descriptions.²¹

LongCLIP: Directly addresses this limitation by enabling CLIP models to process longer text sequences.²¹ It achieves this through an efficient fine-tuning process applied to a pre-trained CLIP model (e.g., ViT-L/14). The fine-tuning uses a relatively small dataset (1 million pairs) of images paired with long text captions. Two novel techniques are employed to prevent catastrophic forgetting of the original CLIP capabilities and to maintain alignment with the original latent space: (1) Knowledge-preserved stretching of positional embeddings, which adapts the positional encoding to handle longer sequences without drastically altering the representations of shorter, well-trained initial positions. (2) Primary component matching, which likely involves aligning the principal components of the fine-tuned feature space with those of the original CLIP space.²² LongCLIP demonstrates substantial improvements (~20%) in retrieving images from long text captions and also boosts performance (~6%) on traditional short-caption retrieval tasks (COCO, Flickr30k).²² Importantly, it is designed as a plug-and-play replacement for the standard CLIP text encoder in downstream applications like text-to-image generation, allowing them to leverage more detailed prompts.²² The fine-tuning process is reported to be very efficient (e.g., 0.25 hours on 8 GPUs).²³

The diverse array of experimental and domain-specific variants showcases the adaptability of the core CLIP framework. These variants often employ targeted modifications—whether to the architecture, loss function, training data, or training strategy—to address specific shortcomings or tailor the model to new requirements. This focused approach highlights both the modularity inherent in the dual-encoder design and the realization that achieving specialized capabilities often necessitates moving beyond the original, relatively simple CLIP formulation.

A recurring pattern in many successful variants is the strategic use of data or existing models. Sophisticated data engineering, including generating synthetic captions (SiCLIP, TULIP), converting annotations (RemoteCLIP), curating domain-specific datasets with varied text lengths (LRSCLIP), or constructing temporally ordered benchmarks (TiC-CLIP), plays a crucial role. Similarly, leveraging powerful pre-trained models, either through fine-tuning (Duoduo CLIP, LongCLIP), knowledge distillation (SiCLIP), using frozen components (LiT, BLIP-2), or extracting guidance signals (CLIP-PING), has proven highly effective. This suggests that future advancements in vision-language modeling may rely as heavily on innovative data strategies and intelligent model composition as on purely novel architectural designs.

5. Comparative Analysis and Key Themes

The proliferation of CLIP variants invites a comparative analysis to understand the evolutionary trends, recurring strategies, performance trade-offs, and the extent to which CLIP's original limitations have been addressed.

5.1 Architectural Evolution

While the dual-encoder structure remains prevalent, significant architectural diversification has occurred.

Encoder Choices: The Vision Transformer (ViT) has become the dominant image encoder backbone, largely replacing the ResNet options explored in the original CLIP, primarily due to its scalability and efficiency.⁶ Variants have explored different ViT sizes (B, L, H, G, g) ²⁸ and alternatives like ConvNeXt ²⁹, EfficientNet ¹⁴, and DaViT.⁴⁴ Text encoders remain predominantly Transformer-based, though some variants experiment with different sizes or pre-trained language models like BERT ¹⁴ or T5.⁶³ Lightweight architectures are explored in variants like SiCLIP.⁸⁴
Beyond Dual Encoders: A major trend is the move beyond simple dual encoders to support generative tasks or more complex interactions. This includes adding text decoders (CoCa ³¹, BLIP ⁶⁴, SigLIP 2 ⁷⁹), employing full sequence-to-sequence encoder-decoder frameworks (Florence-2 ⁴⁴), or introducing specialized bridging modules like the Q-Former in BLIP-2 to connect frozen unimodal models.⁶³ Task-specific modifications, such as attention mechanism changes for dense prediction (SCLIP ¹⁸) or 3D understanding (Duoduo CLIP ⁹⁵), are also common.
Attention Mechanisms: While standard self-attention and cross-attention are foundational, variants have introduced modifications. CLIP's replacement of global average pooling with attention pooling ⁶ was an early example, also adopted by CoCa.³¹ SCLIP's Correlative Self-Attention targets spatial reasoning for dense tasks ¹⁸, and Duoduo CLIP uses Multi-View Attention for 3D.⁹⁵ The use of Rotary Position Embeddings (RoPE) has also been explored to improve spatial understanding in some contexts.¹⁰⁹

5.2 Training Paradigms

Training methodologies have evolved significantly in terms of loss functions, data strategies, and efficiency techniques.

Loss Functions: The original contrastive loss (InfoNCE style) remains central to many variants.⁶ However, alternatives like the pairwise Sigmoid loss (SigLIP ⁷⁶) have emerged, offering efficiency benefits, especially at smaller batch sizes. A major trend is the augmentation of the contrastive objective with other losses: generative/language modeling losses (CoCa ³¹, BLIP ⁶⁴, BLIP-2 ⁶³, SigLIP 2 ⁷⁹) enable text generation; matching losses (ITM in BLIP ⁶⁴, PM in SiCLIP ⁸⁴) provide finer-grained alignment signals; consistency losses (CyCLIP ¹⁵) improve geometric structure; and reconstruction losses (TULIP ¹⁰⁵) enhance feature granularity. Knowledge distillation losses are used for training lightweight models (SiCLIP ⁸⁴). Balancing these multiple objectives is a key challenge in complex variants.
Data Scaling and Curation: The initial success of CLIP ⁶ and ALIGN ¹⁴ demonstrated the power of scaling training data, even if noisy. This led to the creation and use of massive public datasets like LAION (up to 5B pairs) ¹³ and proprietary ones like WebLI (10B pairs).¹¹⁰ However, recognizing the limitations of purely noisy data, subsequent strategies focused on improving data quality or diversity. BLIP's CapFilt bootstraps captions using model-based generation and filtering.⁶⁴ Florence-2 relies on the heavily annotated FLD-5B dataset.⁴⁴ Synthetic data generation is employed by SiCLIP, TULIP, and RemoteCLIP to augment or create training data.⁸⁴ Domain-specific datasets have been curated for 3D (Objaverse renderings for Duoduo CLIP ⁹⁹) and remote sensing (LRS2M for LRSCLIP ¹⁰¹). TiC-CLIP introduced temporally structured datasets for continual learning.⁸⁹ This evolution reflects a shift from simply scaling data quantity to optimizing data quality, relevance, and structure.
Parameter Freezing Strategies: Freezing parts of the model during training has become a cornerstone for efficiency. LiT pioneered freezing the image encoder ⁵⁶, a strategy also used in SigLiT.⁷⁶ BLIP-2 took this further by freezing both the image encoder and a large language model, training only the lightweight Q-Former.⁶³ SiCLIP uses weight inheritance, freezing parts of a teacher model.⁸⁴ This modular approach drastically reduces the number of trainable parameters and associated computational costs, enabling the leveraging of powerful pre-trained unimodal models. The trade-off lies in potentially reduced adaptability compared to end-to-end training.
Training Efficiency Techniques: Besides parameter freezing, various techniques are employed to manage the high cost of training. Extremely large batch sizes (e.g., 160k in OpenCLIP ³², 1M in SigLIP experiments ⁷⁷) are used, often enabled by gradient accumulation ²⁵ and gradient checkpointing.³⁰ Mixed-precision training (e.g., bfloat16) is crucial for stability and speed, especially at scale.³⁰ CLIPA's strategy of reducing input token length based on model size offers significant speedups.⁸¹ Architectural simplifications like SiCLIP's SAS-P blocks also contribute.⁸⁴

5.3 Performance vs. Efficiency Trade-offs

A central theme is the trade-off between model performance, capability, and the resources required for training and inference.

Training Cost vs. Performance: Models trained end-to-end on massive datasets (CLIP, ALIGN, large OpenCLIP models) achieve high performance but incur enormous training costs (e.g., hundreds of thousands of GPU hours or dollars).³⁰ Variants employing parameter freezing (LiT, BLIP-2) or efficiency techniques (SigLIP, CLIPA, SiCLIP) demonstrate that comparable or even superior performance on certain benchmarks (especially zero-shot classification) can be achieved with orders-of-magnitude less training compute.⁵⁶ CLIPA's "inverse scaling law" provides a theoretical basis for some of these gains.²⁶
Inference Cost: While training efficiency is improved by freezing large components (like LLMs in BLIP-2), the inference cost can remain high due to the size of these frozen modules.⁷⁵ Lightweight variants like SiCLIP explicitly target lower inference cost as well.⁸⁴
Capability vs. Complexity: Models aiming for broader capabilities beyond contrastive alignment (CoCa, BLIP, Florence-2) often introduce architectural or training complexity (decoders, multiple losses, multi-stage training).³¹ This contrasts with simpler, efficiency-focused variants like LiT or SigLIP. The choice depends on the desired balance between performance on core CLIP tasks, extended functionality, and implementation/training complexity.

5.4 Addressing CLIP's Limitations: Progress and Persistent Issues

Variants have made progress on some of CLIP's weaknesses, but others remain challenging.

Fine-grained Recognition: Models like TULIP claim improvements by incorporating objectives designed to enhance visual feature granularity.¹⁰⁵ However, CLIP-style models generally still lag behind vision-specific models on highly fine-grained tasks.⁹
Counting & Numerical Reasoning: This remains a significant weakness for CLIP and most variants discussed.¹¹ Targeted solutions are sparse in the reviewed literature.¹¹¹
Compositionality, Attribute Binding, Spatial Reasoning: These represent major, persistent limitations of the standard CLIP framework.²³ While some variants offer minor improvements (e.g., SCLIP for segmentation-related spatial tasks ¹⁸, TULIP's potential reasoning boost ¹⁰⁵, CyCLIP's consistency ¹⁵), they often fall short. The superior performance of generative MLLMs (like LLaVA) on these tasks, even when using the same CLIP vision encoder, strongly suggests that the limitation may be inherent to the dual-encoder contrastive architecture's inability to model complex relationships, rather than just the quality of the visual features themselves.¹⁰⁹
Negation: Understanding negation remains a challenge.²⁴
Text Length: LongCLIP ¹⁰⁷ and LRSCLIP ¹⁰¹ demonstrate that fine-tuning can successfully extend CLIP's text processing capabilities. Using powerful LLMs as text encoders (as explored in LLM2CLIP ¹¹²) offers an alternative route.
Robustness: CLIP's inherent robustness ¹¹ can be degraded by standard fine-tuning.¹¹⁴ CyCLIP enhances robustness through consistency constraints.¹⁵ WiSE-FT provides a method to preserve robustness during fine-tuning.¹¹⁴ TiC-CLIP specifically addresses temporal robustness.⁸⁹

The evolution from CLIP reveals diverse strategies. Some researchers refined the core contrastive learning engine by improving data (ALIGN, LAION, CapFilt), efficiency (LiT, CLIPA, SigLIP), or internal consistency (CyCLIP). Others bolted on new capabilities, primarily generation, by adding decoders or integrating LLMs (CoCa, BLIP, BLIP-2). Still others adapted the model's inputs or outputs for specific domains or tasks (SCLIP, Duoduo CLIP, LRSCLIP, LongCLIP). This branching evolution signifies that while the core CLIP concept is powerful and adaptable, achieving SOTA performance, efficiency, or specialized functionality often requires moving beyond the original formulation, leading to a complex ecosystem with no single "best" variant for all purposes.

A critical observation is the apparent ceiling for certain deep reasoning tasks within the standard contrastive dual-encoder framework. While efficiency, robustness, and domain adaptation have seen significant progress, fundamental challenges in compositionality, counting, and negation persist across most variants that retain the core CLIP architecture and objective. The fact that alternative architectures like generative MLLMs show superior performance on these specific reasoning benchmarks ¹⁰⁹, even when leveraging the same CLIP visual encoder, points towards limitations in how CLIP's contrastive objective aligns features or in the architectural capacity of the dual-encoder setup itself for these types of reasoning. This suggests that future breakthroughs in these areas might require more radical departures from the original CLIP paradigm.

Furthermore, the journey highlights the evolving and critical role of data. The initial leap by CLIP and ALIGN was enabled by embracing massive, noisy web data.⁶ However, subsequent progress often involved more sophisticated data strategies: open-source large datasets enabling replication (LAION for OpenCLIP ¹³), methods to clean or enhance noisy data (BLIP's CapFilt ⁶⁴), extremely large-scale curated annotations for multi-task learning (Florence's FLD-5B ⁴⁴), the use of AI to generate synthetic training data (SiCLIP, TULIP, RemoteCLIP ⁸⁴), and temporally structured data for continual learning (TiC-CLIP ⁸⁹). This indicates that data strategy—encompassing scale, quality, diversity, structure, and generation—is now as vital a component of advancing vision-language models as architectural design or loss function innovation.

Table 2: Performance Comparison on Key Benchmarks

Model	ZS ImageNet Acc (%) (Top-1)	ZS IN-Robustness Avg (%)	ZS COCO R@1 I2T / T2I (%)	ZS Flickr30k R@1 I2T / T2I (%)	ZS VQAv2 Acc (%)	Params (M) (Trainable / Total)	Training Compute (Est. GPUh / Cost)
OpenAI CLIP ViT-L/14@224px	76.2 13	High (Baseline)	58.4 / 37.8 72	88.0 / 68.7 38	N/A	~427 / ~427	High (Proprietary)
OpenCLIP ViT-L/14 (LAION-2B)	75.3 27	Similar to OpenAI	~49.5 / ~66.0 81	~77.8 / ~90.8 81	N/A	~427 / ~427	41,472 A100h / ~$47k 81
OpenCLIP ViT-H/14 (LAION-2B)	78.0 30	Improved	73.4 / N/A (R@5) 30	94.0 / N/A (R@5) 32	N/A	~986 / ~986	216,712 A100h / ~$248k 81
OpenCLIP ViT-G/14 (LAION-2B)	80.1 32	Further Improved	74.9 / N/A (R@5) 32	94.9 / N/A (R@5) 32	N/A	~2540 / ~2540	232,448 A100h / ~$366k 81
ALIGN (EffNet-L2/BERT-L)	76.4 38	High	58.6 / 45.6 38	88.6 / 75.7 38	N/A	~600+ / ~600+	Very High (Proprietary 1.8B data)
Florence-1 (CoSwin-H?)	83.74 43	N/A	64.7 / 47.2 52	N/A	80.36 (FT) 43	~893 / ~893	High (Proprietary)
CoCa ViT-L/14 (LAION-2B)	75.5 31	N/A	61.7 / 43.9 (F30k) 31	87.7 / 70.8 (F30k) 31	N/A	~638 / ~638	~1.2x OpenCLIP-L/14 Cost 31
LiT ViT-g/14 (JFT/WebLI)	85.2 56	Very High	N/A	N/A	N/A	~695 (Text) / ~2540 (Total)	Efficient (Text only training)
BLIP ViT-L (CapFilt)	N/A	N/A	65.6 / 47.5 (FT) 115	94.6 / 78.0 (FT) 115	77.7 (FT) 115	~446 / ~446	Moderate (Bootstrapping)
BLIP-2 ViT-g/FlanT5-XXL	N/A	N/A	81.0 / 68.3 (ZS) 69	97.6 / 89.7 (ZS) 69	65.2 (ZS) 63	~108 / ~12208	Low (Q-Former only)
SigLIP ViT-L/16@384px (WebLI)	76.2 (Base) 80	N/A	64.5 / 47.2 19	89.6 / 77.9 19	N/A	~427 / ~427	Efficient (Sigmoid Loss)
CLIPA-v2 H/14 (DataComp-1B)	81.1 81	N/A	67.1 / N/A (R@5) 81	76.1 / N/A (R@5) 81	N/A	~986 / ~986	5,920 A100h / ~$9k 81
TULIP So/14@384px	85.0 19	N/A	70.1 / 54.2 19	93.9 / 81.8 19	N/A	>1B / >1B	N/A

(Notes: ZS = Zero-Shot, FT = Fine-Tuned. IN-Robustness Avg is qualitative based on descriptions. Retrieval scores are Recall@1 unless noted. Params are approximate total parameters; trainable parameters differ significantly for LiT, BLIP-2, SiCLIP. Compute costs are rough estimates from papers where available, highly dependent on hardware/settings. N/A = Not Available/Applicable from snippets.)

6. Research Gaps and Future Directions

Despite the remarkable progress driven by CLIP and its variants, significant research gaps remain, opening avenues for future investigation.

6.1 Overcoming Foundational Limitations

The core reasoning capabilities of CLIP-like models remain a primary area for improvement.

Compositionality and Attribute Binding: As highlighted repeatedly, standard contrastive models struggle to understand how objects and attributes combine or relate within a scene.²³ This "binding problem" limits their use in tasks requiring precise understanding. Future work needs to explore architectures or objectives that explicitly model these relationships, perhaps incorporating structured representations, graph neural networks, or insights from neurosymbolic AI.²⁴ Developing and utilizing robust benchmarks specifically targeting compositionality (like SugarCREPE, ARO, MMVP) is crucial for measuring progress.²⁴
Spatial and Geometric Reasoning: While variants like SCLIP improve dense prediction ¹⁸, deeper spatial understanding (e.g., relative positioning like "left of", "behind", topology, containment) remains weak.²⁴ Integrating geometric priors more fundamentally or exploring alternative spatial representations could be beneficial.
Counting and Numerical Reasoning: CLIP's inability to reliably count objects points to a fundamental gap in its capabilities.¹¹ This likely requires incorporating mechanisms beyond visual feature similarity, potentially involving object detection modules or dedicated counting heads trained with appropriate objectives.
Negation and Logical Reasoning: Understanding negation ("not a yellow coat") is poorly handled.²⁴ This requires moving beyond simple correlation learning towards models that can handle logical operators and exclusion, a significant challenge for current architectures.

6.2 Advancing Training Efficiency and Scalability

While significant strides have been made, further improvements in efficiency are needed, especially for democratization and continual learning.

Optimizing Token Usage: CLIPA's inverse scaling law provides an empirical guideline ⁸¹, but more principled methods for minimizing token processing without performance loss are needed. This could involve adaptive token sampling, learnable token pruning, or more efficient attention mechanisms.
Effective Low-Resource Training: Variants like SiCLIP, LightCLIP, and CLIP-PING show promise for training on limited hardware or data.⁸⁴ Further research into highly effective knowledge distillation techniques, extreme parameter sharing, quantization-aware training ³, or novel lightweight architectures is needed to push the boundaries of low-resource VLM training.
Reducing Large-Batch Dependency: While SigLIP alleviates the issue compared to softmax ⁷⁶, contrastive learning generally benefits from large negative pools. Developing methods that learn effectively with smaller batches or even in online settings would broaden applicability.
Principled Multi-Objective Optimization: Models like CoCa, BLIP, TULIP, and CyCLIP combine multiple loss terms.¹⁵ Finding optimal, potentially dynamic, ways to weight and schedule these diverse objectives (contrastive, generative, matching, consistency, reconstruction) remains an open research question.

6.3 Enhancing Robustness, Fairness, and Interpretability

Deploying VLMs reliably and responsibly requires addressing trustworthiness aspects.

Distribution Shift Robustness: While CLIP exhibits baseline robustness ¹¹, performance can degrade under shifts or during fine-tuning.¹¹⁴ Methods like WiSE-FT ¹¹⁴ and CyCLIP ¹⁵ offer improvements, but robustness against a wider range of shifts (temporal, domain, style, adversarial) needs further investigation. TiC-CLIP provides a framework for temporal shifts.⁸⁹
Adversarial Robustness: Like many deep learning models, VLMs are vulnerable to adversarial attacks. Developing effective defense mechanisms and robust training strategies specifically for contrastively trained models is crucial.¹¹¹
Bias Mitigation: Models trained on web data inevitably learn societal biases.¹ Research is needed on methods to audit, quantify, and mitigate these biases in VLMs, potentially involving data filtering, algorithmic debiasing, or controlled generation techniques.
Interpretability and Explainability: Understanding why a VLM makes a certain prediction or fails on specific tasks (like compositionality) is vital for debugging and building trust.²⁴ Developing explainability methods tailored to dual-encoder contrastive models, perhaps through attention visualization ¹⁸, gradient-based methods, or concept analysis, is an important direction.⁴¹

6.4 Beyond Image-Text: Expanding Modalities and Tasks

The core principles of contrastive alignment can be extended beyond static images and text.

Video-Language Models: Extending CLIP to video understanding requires handling temporal dynamics, action recognition, and the increased computational complexity.⁴³
3D-Language Models: Integrating 3D data (point clouds, meshes, multi-view images, implicit representations like NeRFs or 3DGS) with language is a rapidly growing area.¹⁶ Challenges include robust alignment, handling sparse/noisy real-world 3D data, and scaling to complex scenes.⁹⁵
Broader Multimodal Integration: Moving towards foundation models that seamlessly integrate other modalities like audio (e.g., CLAP ¹¹¹), sensor data, tabular data, or graphs is a key goal for more holistic AI systems.³
Embodied AI and Action: Leveraging VLMs for robotics requires grounding language and vision in action generation (Vision-Language-Action models, VLAs).¹¹⁷ This involves challenges in policy learning, dynamics modeling, and real-world interaction.
Dense, Interactive, and Generative Tasks: Pushing beyond classification and retrieval towards high-fidelity dense prediction (segmentation ¹⁸, detection ⁴⁴), interactive tasks like visual question answering ⁶³ and dialogue, and high-quality conditional generation requires models with finer-grained understanding and often different architectural paradigms (e.g., generative MLLMs).

6.5 Domain Adaptation and Generalization Challenges

Ensuring models generalize well and adapt effectively to specific application domains remains crucial.

Domain Specialization vs. Generalization: While foundation models aim for generality, optimal performance in specialized domains like remote sensing ¹⁰¹ or medical imaging ³ often requires domain-specific data and fine-tuning. Balancing general capabilities with specialized expertise is an ongoing challenge.
Efficient Adaptation: Methods for efficiently adapting large VLMs to new domains or tasks with minimal data or computation are highly desirable. This includes parameter-efficient fine-tuning (PEFT) techniques and test-time adaptation strategies like CLIPArTT.¹¹⁹
Out-of-Distribution (OOD) Generalization: Standard benchmarks may not fully capture real-world performance. Evaluating models on challenging OOD datasets (like WikiDO ⁷³) reveals limitations in generalization and motivates research into methods that improve robustness to unforeseen data variations.

6.6 Evaluation Methodologies

Robust evaluation is critical for driving meaningful progress.

Beyond Standard Benchmarks: There is a pressing need for more comprehensive benchmarks that specifically target known weaknesses like compositionality, spatial reasoning, counting, and negation.²⁴ Temporal benchmarks (TiC-eval ⁸⁹) and OOD benchmarks (WikiDO ⁷³) are also vital.
Evaluating Generative Capabilities: For variants with generative components (CoCa, BLIPs, Florence-2), metrics beyond standard ones like BLEU or CIDEr are needed to assess aspects like factual correctness, avoidance of hallucination, and alignment with fine-grained instructions.⁸
Human Evaluation: For complex reasoning, nuanced understanding, and assessing potential biases, human evaluation remains indispensable.⁴¹ Developing scalable and reliable human evaluation protocols is important.

The research trajectory suggests that while CLIP provided a powerful foundation, achieving human-like multimodal understanding requires addressing its inherent reasoning limitations. This may necessitate moving beyond incremental improvements to the dual-encoder contrastive framework and potentially incorporating elements from symbolic reasoning, graph representations, or more structured world models.

Furthermore, the increasing scale and capability of these models underscore the critical need for advancements in trustworthiness. Ensuring robustness against various distribution shifts and adversarial attacks, mitigating inherent biases learned from web-scale data, and developing effective interpretability tools must become central research priorities alongside performance optimization.¹ Current evaluation practices often lag behind model capabilities, necessitating the development of more nuanced benchmarks that probe deeper reasoning and safety aspects.¹⁷

Finally, the future likely involves integrating CLIP-like vision-language alignment capabilities within larger, more complex multimodal ecosystems. Building on the modularity demonstrated by variants like LiT and BLIP-2 ⁵⁶, future systems may compose specialized foundation models across vision, language, audio, 3D, and action modalities to tackle increasingly sophisticated tasks in areas like robotics, embodied AI, and human-computer interaction.³

7. Conclusion

Contrastive Language-Image Pre-training (CLIP) fundamentally altered the landscape of computer vision and multimodal AI by demonstrating that powerful, generalizable visual representations could be learned from weak natural language supervision at scale. Its simple yet effective contrastive learning approach on massive web datasets enabled unprecedented zero-shot transfer capabilities, shifting the paradigm away from reliance on curated, category-specific labeled data.

This review has charted the journey from the original CLIP model to the diverse and rapidly expanding ecosystem of its variants. We have analyzed key developments across architecture, training methodology, performance, and capabilities. Major themes have emerged from this evolution. One significant thrust has been the drive for efficiency and accessibility, leading to open-source implementations like OpenCLIP, techniques leveraging frozen components (LiT, BLIP-2), simplified loss functions (SigLIP), and strategies exploiting scaling laws or architectural modifications for low-resource training (CLIPA, SiCLIP). Another major direction has been capability expansion, moving beyond contrastive alignment to integrate generative modeling (CoCa, BLIP), adopt multi-task sequence-to-sequence frameworks (Florence-2), or enhance specific functionalities like dense prediction (SCLIP) or long-text understanding (LongCLIP). The critical role of data strategy has also become evident, evolving from simply scaling noisy web data (CLIP, ALIGN) to sophisticated curation, bootstrapping, synthetic generation, and temporal structuring (BLIP, Florence-2, TULIP, RemoteCLIP, TiC-CLIP). Finally, the trend towards modularity, composing systems from powerful pre-trained unimodal components (as exemplified by LiT and BLIP-2), represents a promising path forward.

These advancements have yielded substantial successes. Open-source efforts have democratized research. Scalability has been proven even with noisy data. Training efficiency has improved dramatically, making powerful models more attainable. Generative capabilities have been successfully integrated, and models have been adapted effectively to specialized domains like 3D vision and remote sensing.

However, persistent challenges remain. The most significant are the fundamental limitations in deep reasoning – particularly compositionality, attribute binding, counting, and negation – which seem resistant to incremental improvements within the standard CLIP framework. Ensuring model robustness, mitigating biases inherited from web data, and developing adequate interpretability methods are critical for trustworthy deployment. Furthermore, evaluation methodologies need to evolve beyond standard benchmarks to better assess these complex capabilities and failure modes.

The future outlook for vision-language models is dynamic. We anticipate continued efforts to bridge the reasoning gap, potentially through hybrid architectures or novel learning paradigms. The integration of more modalities (video, audio, 3D, action) will likely accelerate, leading towards more comprehensive multimodal foundation models. The interplay between architectural innovation, sophisticated data strategies (including synthetic data and continual learning), and objective function design will continue to shape the field. Ultimately, the goal extends beyond simple image-text alignment towards building AI systems with a deeper, more robust, and more actionable understanding of the multimodal world.

References

Large language model - Wikipedia, accessed April 29, 2025, https://en.wikipedia.org/wiki/Large_language_model
Foundation Models Defining a New Era in Vision: A Survey and Outlook, accessed April 29, 2025, https://www.computer.org/csdl/journal/tp/5555/01/10834497/23mYUeDuDja
A Survey on Efficient Vision-Language Models - arXiv, accessed April 29, 2025, https://arxiv.org/html/2504.09724v1
(PDF) Learning Transferable Visual Models From Natural Language Supervision (2021) | Alec Radford | 3672 Citations - SciSpace, accessed April 29, 2025, https://scispace.com/papers/learning-transferable-visual-models-from-natural-language-1msnnp1spo
Learning Transferable Visual Models From Natural Language Supervision, accessed April 29, 2025, https://proceedings.mlr.press/v139/radford21a/radford21a.pdf
arxiv.org, accessed April 29, 2025, https://arxiv.org/abs/2103.00020
awaisrauf/Awesome-CV-Foundational-Models - GitHub, accessed April 29, 2025, https://github.com/awaisrauf/Awesome-CV-Foundational-Models
A Survey of State of the Art Large Vision Language Models: Alignment, Benchmark, Evaluations and Challenges - arXiv, accessed April 29, 2025, https://arxiv.org/html/2501.02189v5
Zero-shot Image Classification with OpenAI's CLIP VIT-L14 - Analytics Vidhya, accessed April 29, 2025, https://www.analyticsvidhya.com/blog/2024/09/clip-vit-l14/
CLIP Explained - Papers With Code, accessed April 29, 2025, https://paperswithcode.com/method/clip
CLIP: Connecting text and images - OpenAI, accessed April 29, 2025, https://openai.com/index/clip/
Learning Transferable Visual Models From Natural Language Supervision - arXiv, accessed April 29, 2025, https://arxiv.org/pdf/2103.00020
LAION-5B: An open large-scale dataset for training next generation image-text models - OpenReview, accessed April 29, 2025, https://openreview.net/pdf?id=M3Y74vmsMcY
Scaling Up Visual and Vision-Language Representation Learning ..., accessed April 29, 2025, https://www.researchgate.net/publication/349236966_Scaling_Up_Visual_and_Vision-Language_Representation_Learning_With_Noisy_Text_Supervision
proceedings.neurips.cc, accessed April 29, 2025, https://proceedings.neurips.cc/paper_files/paper/2022/file/2cd36d327f33d47b372d4711edd08de0-Paper-Conference.pdf
Duoduo CLIP: Efficient 3D Understanding with Multi-View Images - arXiv, accessed April 29, 2025, https://arxiv.org/html/2406.11579
TiC-CLIP: Continual Training of CLIP Models - arXiv, accessed April 29, 2025, https://arxiv.org/html/2310.16226v3
www.ecva.net, accessed April 29, 2025, https://www.ecva.net/papers/eccv_2024/papers_ECCV/papers/03178.pdf
TULIP: Towards Unified Language-Image Pretraining - arXiv, accessed April 29, 2025, https://arxiv.org/html/2503.15485
LRSCLIP: A Vision-Language Foundation Model for Aligning Remote Sensing Image with Longer Text - arXiv, accessed April 29, 2025, https://arxiv.org/html/2503.19311v1
Decoding Long-CLIP: Understand the Power of Zero-Shot Classification | DigitalOcean, accessed April 29, 2025, https://www.digitalocean.com/community/tutorials/long-clip-zero-shot-classification-text-analysis
[2403.15378] Long-CLIP: Unlocking the Long-Text Capability of CLIP - arXiv, accessed April 29, 2025, https://arxiv.org/abs/2403.15378
Unlocking the Long-Text Capability of CLIP - arXiv, accessed April 29, 2025, https://arxiv.org/html/2403.15378v3
Is CLIP ideal? No. Can we fix it? Yes! - arXiv, accessed April 29, 2025, https://arxiv.org/html/2503.08723v1
mlfoundations/open_clip: An open source implementation ... - GitHub, accessed April 29, 2025, https://github.com/mlfoundations/open_clip
An Inverse Scaling Law for CLIP Training, accessed April 29, 2025, https://proceedings.neurips.cc/paper_files/paper/2023/file/996e2b446391fcb8bf32a3d1645cc799-Paper-Conference.pdf
OpenCLIP - open-clip-torch · PyPI, accessed April 29, 2025, https://pypi.org/project/open-clip-torch/2.0.2/
open_clip/docs/PRETRAINED.md at main - GitHub, accessed April 29, 2025, https://github.com/mlfoundations/open_clip/blob/main/docs/PRETRAINED.md
laion/CLIP-convnext_base_w-laion2B-s13B-b82K - Hugging Face, accessed April 29, 2025, https://huggingface.co/laion/CLIP-convnext_base_w-laion2B-s13B-b82K
Large scale openCLIP: L/14, H/14 and g/14 trained on LAION-2B, accessed April 29, 2025, https://laion.ai/blog/large-openclip/
Training Contrastive Captioners - LAION, accessed April 29, 2025, https://laion.ai/blog/coca/
Reaching 80% zero-shot accuracy with OpenCLIP: ViT-G/14 trained on LAION-2B, accessed April 29, 2025, https://laion.ai/blog/giant-openclip/
CLIP ViT H 14 Laion2B S32B B79K · Models - Dataloop AI, accessed April 29, 2025, https://dataloop.ai/library/model/laion_clip-vit-h-14-laion2b-s32b-b79k/
CLIP-ViT-H-14-laion2B-s32B-b79K - PromptLayer, accessed April 29, 2025, https://www.promptlayer.com/models/clip-vit-h-14-laion2b-s32b-b79k
CLIP-ViT-bigG-14-laion2B-39B-b160k - PromptLayer, accessed April 29, 2025, https://www.promptlayer.com/models/clip-vit-bigg-14-laion2b-39b-b160k-40ad
Scaling Up Visual and Vision-Language Representation Learning with Align - Toolify.ai, accessed April 29, 2025, https://www.toolify.ai/ai-news/scaling-up-visual-and-visionlanguage-representation-learning-with-align-1054218
ALIGN: A Large-scale ImaGe and Noisy-text Model - GeeksforGeeks, accessed April 29, 2025, https://www.geeksforgeeks.org/align-a-large-scale-image-and-noisy-text-model/
ALIGN: Scaling Up Visual and Vision-Language Representation Learning With Noisy, accessed April 29, 2025, https://research.google/blog/align-scaling-up-visual-and-vision-language-representation-learning-with-noisy-text-supervision/
Scaling Up Visual and Vision-Language Representation Learning With Noisy Text Supervision | Papers With Code, accessed April 29, 2025, https://paperswithcode.com/paper/scaling-up-visual-and-vision-language
Scaling Up Visual and Vision-Language Representation Learning With Noisy Text Supervision, accessed April 29, 2025, http://proceedings.mlr.press/v139/jia21b/jia21b.pdf
Explanation Alignment: Quantifying the Correctness of Model Reasoning At Scale - MIT Visualization Group, accessed April 29, 2025, https://vis.csail.mit.edu/pubs/explanation-alignment.pdf
Impact of Preference Noise on the Alignment Performance of Generative Language Models, accessed April 29, 2025, https://openreview.net/forum?id=nMAaCsCTCI
Florence: A New Foundation Model for Computer Vision - arXiv, accessed April 29, 2025, http://arxiv.org/pdf/2111.11432
CVPR Poster Florence-2: Advancing a Unified Representation for a ..., accessed April 29, 2025, https://cvpr2023.thecvf.com/virtual/2024/poster/30529
Florence-2: Microsoft's New Foundation Model Explained - Encord, accessed April 29, 2025, https://encord.com/blog/florence-2-explained/
microsoft/Florence-2-large - Hugging Face, accessed April 29, 2025, https://huggingface.co/microsoft/Florence-2-large
Florence: Novel Vision Foundation Model by Microsoft - Zilliz Learn, accessed April 29, 2025, https://zilliz.com/learn/florence-novel-vision-foundation-model-by-microsoft
Microsoft's Florence-2: The Ultimate Unified Model - viso.ai, accessed April 29, 2025, https://viso.ai/computer-vision/florence-2/
How to Fine-tune Florence-2 for Object Detection Tasks - Roboflow Blog, accessed April 29, 2025, https://blog.roboflow.com/fine-tune-florence-2-object-detection/
Microsoft's Florence-2: The Future of Unified Vision AI Models - Ikomia, accessed April 29, 2025, https://www.ikomia.ai/blog/microsoft-florence-2-unified-vision-ai
Florence-2: Zero-Shot Vision AI by Microsoft | Ultralytics, accessed April 29, 2025, https://www.ultralytics.com/blog/florence-2-microsofts-latest-vision-language-model
Florence: A New Foundation Model for Computer Vision | Papers With Code, accessed April 29, 2025, https://paperswithcode.com/paper/florence-a-new-foundation-model-for-computer
Florence-VL: Enhancing Vision-Language Models with Generative Vision Encoder and Depth-Breadth Fusion - arXiv, accessed April 29, 2025, https://arxiv.org/html/2412.04424v1
Comparing Zero-Shot Object Detection Models: YOLO vs. Florence 2 - Sieve, accessed April 29, 2025, https://www.sievedata.com/resources/comparing-zero-shot-object-detection-yolo-florence
CoCa: Contrastive Captioners are Image-Text Foundation Models, accessed April 29, 2025, https://r.jordan.im/download/language-models/yu2022.pdf
arxiv.org, accessed April 29, 2025, https://arxiv.org/abs/2111.07991
[2111.07991] LiT: Zero-Shot Transfer with Locked-image text Tuning - ar5iv - arXiv, accessed April 29, 2025, https://ar5iv.labs.arxiv.org/html/2111.07991
arXiv:2111.07991v3 [cs.CV] 22 Jun 2022, accessed April 29, 2025, https://arxiv.org/pdf/2111.07991
(PDF) LiT Tuned Models for Efficient Species Detection - ResearchGate, accessed April 29, 2025, https://www.researchgate.net/publication/368687900_LiT_Tuned_Models_for_Efficient_Species_Detection
FLAME: Frozen Large Language Models Enable Data-Efficient Language-Image Pre-training - arXiv, accessed April 29, 2025, https://arxiv.org/html/2411.11927v1
Three Towers: Flexible Contrastive Learning with Pretrained Image Models - OpenReview, accessed April 29, 2025, https://openreview.net/forum?id=LSYQB4CwD3
Salesforce/blip-image-captioning-large - Hugging Face, accessed April 29, 2025, https://huggingface.co/Salesforce/blip-image-captioning-large
BLIP-2: Bootstrapping Language-Image Pre-training with Frozen Image Encoders and Large Language Models - arXiv, accessed April 29, 2025, https://arxiv.org/html/2301.12597
arxiv.org, accessed April 29, 2025, https://arxiv.org/abs/2201.12086
(PDF) BLIP: Bootstrapping Language-Image Pre-training for Unified Vision-Language Understanding and Generation - ResearchGate, accessed April 29, 2025, https://www.researchgate.net/publication/358232774_BLIP_Bootstrapping_Language-Image_Pre-training_for_Unified_Vision-Language_Understanding_and_Generation
Blip2 Image To Text · Models - Dataloop AI, accessed April 29, 2025, https://dataloop.ai/library/model/paragon-ai_blip2-image-to-text/
[23.01] BLIP-2 - DOCSAID, accessed April 29, 2025, https://docsaid.org/en/papers/model-tuning/blip2/
VQA v2 val Benchmark (Visual Question Answering (VQA)) - Papers With Code, accessed April 29, 2025, https://paperswithcode.com/sota/visual-question-answering-on-vqa-v2-val
BLIP-2: Bootstrapping Language-Image Pre-training with Frozen Image Encoders and Large Language Models | Papers With Code, accessed April 29, 2025, https://paperswithcode.com/paper/blip-2-bootstrapping-language-image-pre
BLIP-2: Bootstrapping Language-Image Pre-training with Frozen Image Encoders and Large Language Models - The Nemati Lab, accessed April 29, 2025, https://www.nematilab.info/bmijc/assets/081823_paper.pdf
BLIP-2 - Hugging Face, accessed April 29, 2025, https://huggingface.co/docs/transformers/v4.34.1/model_doc/blip-2
MS COCO Benchmark (Zero-shot Text-to-Image Retrieval) - Papers With Code, accessed April 29, 2025, https://paperswithcode.com/sota/zero-shot-text-to-image-retrieval-on-ms-coco
WikiDO: A New Benchmark Evaluating Cross-Modal Retrieval for Vision-Language Models, accessed April 29, 2025, https://nips.cc/virtual/2024/poster/97785
MS COCO Benchmark (Image-to-Text Retrieval) - Papers With Code, accessed April 29, 2025, https://paperswithcode.com/sota/image-to-text-retrieval-on-coco
nielsr/comparing-captioning-models · BLIP 2 comparison? - Hugging Face, accessed April 29, 2025, https://huggingface.co/spaces/nielsr/comparing-captioning-models/discussions/42
arxiv.org, accessed April 29, 2025, https://arxiv.org/abs/2303.15343
Sigmoid Loss for Language Image Pre-Training - arXiv, accessed April 29, 2025, https://arxiv.org/pdf/2303.15343
Sigmoid Loss for Language Image Pre-Training - CVF Open Access, accessed April 29, 2025, https://openaccess.thecvf.com/content/ICCV2023/papers/Zhai_Sigmoid_Loss_for_Language_Image_Pre-Training_ICCV_2023_paper.pdf
SigLIP 2: Multilingual Vision-Language Encoders with Improved Semantic Understanding, Localization, and Dense Features - arXiv, accessed April 29, 2025, https://arxiv.org/html/2502.14786v1
google/siglip-base-patch16-384 - Hugging Face, accessed April 29, 2025, https://huggingface.co/google/siglip-base-patch16-384
arxiv.org, accessed April 29, 2025, https://arxiv.org/pdf/2306.15658
Simplifying CLIP: Unleashing the Power of Large-Scale Models on Consumer-level Computers - arXiv, accessed April 29, 2025, https://arxiv.org/html/2411.14789v1
CLIPA-v2: Scaling CLIP Training with 81.1% Zero-shot ImageNet... - OpenReview, accessed April 29, 2025, https://openreview.net/forum?id=0hTtit3AAm
simplifying clip: unleashing the power of large-scale models on consumer-level computers - arXiv, accessed April 29, 2025, https://arxiv.org/pdf/2411.14789
LightCLIP: Learning Multi-Level Interaction for Lightweight Vision-Language Models - arXiv, accessed April 29, 2025, https://arxiv.org/html/2312.00674v1
LightCLIP: Learning Multi-Level Interaction for Lightweight Vision-Language Models - arXiv, accessed April 29, 2025, https://arxiv.org/abs/2312.00674
[2412.03871] CLIP-PING: Boosting Lightweight Vision-Language Models with Proximus Intrinsic Neighbors Guidance - arXiv, accessed April 29, 2025, https://arxiv.org/abs/2412.03871
TiC-CLIP: Continual Training of CLIP Models - Apple Machine Learning Research, accessed April 29, 2025, https://machinelearning.apple.com/research/tic-clip-v2
accessed January 1, 1970, https://arxiv.org/pdf/2310.16226
[2310.16226] TiC-CLIP: Continual Training of CLIP Models - arXiv, accessed April 29, 2025, https://arxiv.org/abs/2310.16226
Paper page - TiC-CLIP: Continual Training of CLIP Models - Hugging Face, accessed April 29, 2025, https://huggingface.co/papers/2310.16226
[PDF] TiC-CLIP: Continual Training of CLIP Models - Semantic Scholar, accessed April 29, 2025, https://www.semanticscholar.org/paper/21f005c5b15f5c41a45b76b733cb928dfc8e9b05
TiC-CLIP: Continual Training of CLIP Models - OpenReview, accessed April 29, 2025, https://openreview.net/forum?id=TLADT8Wrhn
[2502.17860] UniGS: Unified Language-Image-3D Pretraining with Gaussian Splatting, accessed April 29, 2025, https://arxiv.org/abs/2502.17860
accessed January 1, 1970, https://arxiv.org/pdf/2406.11579
(PDF) Duoduo CLIP: Efficient 3D Understanding with Multi-View Images - ResearchGate, accessed April 29, 2025, https://www.researchgate.net/publication/381485790_Duoduo_CLIP_Efficient_3D_Understanding_with_Multi-View_Images
[2406.11579] Duoduo CLIP: Efficient 3D Understanding with Multi-View Images - arXiv, accessed April 29, 2025, https://arxiv.org/abs/2406.11579
Duoduo CLIP: Efficient 3D Understanding with Multi-View Images | OpenReview, accessed April 29, 2025, https://openreview.net/forum?id=iGbuc9ekKK
3dlg-hcvc/DuoduoCLIP: [ICLR 2025] Duoduo CLIP: Efficient 3D Understanding with Multi-View Images - GitHub, accessed April 29, 2025, https://github.com/3dlg-hcvc/DuoduoCLIP
(PDF) RemoteCLIP: A Vision Language Foundation Model for Remote Sensing, accessed April 29, 2025, https://www.researchgate.net/publication/371729225_RemoteCLIP_A_Vision_Language_Foundation_Model_for_Remote_Sensing
accessed January 1, 1970, https://arxiv.org/pdf/2503.19311
LRSCLIP: A Vision-Language Foundation Model for Aligning Remote Sensing Image with Longer Text | Papers With Code, accessed April 29, 2025, https://paperswithcode.com/paper/lrsclip-a-vision-language-foundation-model
[2306.11029] RemoteCLIP: A Vision Language Foundation Model for Remote Sensing - arXiv, accessed April 29, 2025, https://arxiv.org/abs/2306.11029
TULIP: Towards Unified Language-Image Pretraining - ResearchGate, accessed April 29, 2025, https://www.researchgate.net/publication/390020614_TULIP_Towards_Unified_Language-Image_Pretraining
accessed January 1, 1970, https://arxiv.org/pdf/2503.15485
TULIP: Towards Unified Language-Image Pretraining, accessed April 29, 2025, https://tulip-berkeley.github.io/
arxiv.org, accessed April 29, 2025, https://arxiv.org/abs/2311.16777
Unlocking the Long-Text Capability of CLIP - European Computer Vision Association, accessed April 29, 2025, https://www.ecva.net/papers/eccv_2024/papers_ECCV/papers/06793.pdf
Exploring How Generative MLLMs Perceive More Than CLIP with the Same Vision Encoder, accessed April 29, 2025, https://arxiv.org/html/2411.05195v2
Scaling (Down) CLIP: A Comprehensive Analysis of Data, Architecture, and Training Strategies - arXiv, accessed April 29, 2025, https://arxiv.org/html/2404.08197v1
On the Limitations of Vision-Language Models in Understanding Image Transforms - arXiv, accessed April 29, 2025, https://arxiv.org/html/2503.09837v2
LLM2CLIP: Powerful Language Model Unlocks Richer Visual Representation - arXiv, accessed April 29, 2025, https://arxiv.org/html/2411.04997v3
LLM2CLIP: Powerful Language Model Unlock Richer Visual Representation - arXiv, accessed April 29, 2025, https://arxiv.org/html/2411.04997v1
[2109.01903] Robust fine-tuning of zero-shot models - ar5iv, accessed April 29, 2025, https://ar5iv.labs.arxiv.org/html/2109.01903
salesforce/BLIP: PyTorch code for BLIP: Bootstrapping Language-Image Pre-training for Unified Vision-Language Understanding and Generation - GitHub, accessed April 29, 2025, https://github.com/salesforce/BLIP
arXiv:2212.05171v4 [cs.CV] 12 Jun 2023, accessed April 29, 2025, https://r.jordan.im/download/language-models/xue2022.pdf
Survey on Vision-Language-Action Models - arXiv, accessed April 29, 2025, https://arxiv.org/html/2502.06851v2
A Survey on Vision-Language-Action Models for Embodied AI - arXiv, accessed April 29, 2025, https://arxiv.org/html/2405.14093v4
[2405.00754] CLIPArTT: Adaptation of CLIP to New Domains at Test Time - arXiv, accessed April 29, 2025, https://arxiv.org/abs/2405.00754
Improving Image Captioning Descriptiveness by Ranking and LLM-based Fusion - arXiv, accessed April 29, 2025, https://arxiv.org/html/2306.11593
clip-ViT-L-14 | AI Model Details - AIModels.fyi, accessed April 29, 2025, https://www.aimodels.fyi/models/huggingFace/clip-vit-l-14-sentence-transformers
Scaling Up Visual and Vision-Language Representation Learning With Noisy Text Supervision - Microsoft, accessed April 29, 2025, https://www.microsoft.com/en-us/research/wp-content/uploads/2021/10/080421_Yinfei_Yang_ALIGN_MSR.pdf
Training-free Zero-shot Composed Image Retrieval via Weighted Modality Fusion and Similarity - arXiv, accessed April 29, 2025, https://arxiv.org/html/2409.04918v2
[2310.07394] CLIP for Lightweight Semantic Segmentation - arXiv, accessed April 29, 2025, https://arxiv.org/abs/2310.07394
Duoduo CLIP: Efficient 3D Understanding with Multi-View Images - AIModels.fyi, accessed April 29, 2025, https://www.aimodels.fyi/papers/arxiv/duoduo-clip-efficient-3d-understanding-multi-view