CLIP Is Shortsighted: Paying Attention Beyond the First Sentence [CVPR 2026]

Marc-Antoine Lavoie1, Anas Mahmoud2, Aldo Zaimi2, Arsene Fansi Tchango2, Steven L. Waslander1,
1University of Toronto Robotics Institute, 2Mila - Quebec AI Institute
DeBias-CLIP schematic

DeBias-CLIP improves long-context understanding in CLIP-style models by training with long captions with simple text augmentations. Our model has a significantly flatter text token attention distribution across the context window compared to the baseline.

Summary

  • CLIP-style Models Are Biased: We empirically demonstrate that CLIP and its variants (e.g., SigLIP) inherently favor early text tokens, a bias that persists even when training on longer captions (e.g., Long-CLIP).
  • Long-Caption Datasets Are Biased: Training and evaluation datasets used in long-context retrieval (e.g., ShareGPT4V) share a common structure: paragraph-length texts almost always start with a summary sentence, creating a shortcut that is sufficient for image-text matching.
  • DeBias-CLIP - Sentence-Level Augmentations: To resolve this issue, we introduce simple sentence-level augmentations: (1) dropping the initial summary sentence, (2) randomly permuting and subsampling the rest, and (3) padding the token sequence.
  • A simple Drop-In Replacement: Our DeBias-CLIP method introduces zero new trainable parameters beyond extended positional embeddings, and our text augmentations add negligible computational overhead.

Uncovering the Biases of CLIP Text Encoders

Effect of the number of padding sentences on retrieval performance

Highlighting the Bias via Padding: First, we demonstrate that CLIP text encoders are heavily biased toward early tokens by evaluating retrieval on the first 2 sentences of the DOCCI dataset. When we prepend uninformative filler sentences ("This is a photo.") to the beginning of the caption, performance drops significantly. Simply pushing the informative text further back in the token sequence is enough to affect the model's retrieval accuracy.

Effect of sentence removal and permutation on CLIP retrieval performance

The Impact of Sentence Order: Next, we show that even among early tokens, sentence order strictly dictates performance. Evaluating retrieval on the first 2 sentences of DOCCI, we compare the original text (First 2) against a permuted version (Swap 2) and a baseline using only the first sentence (First only). For all CLIP models, pushing the highly informative first sentence to the second position causes a measurable drop in accuracy. Notably, this structural reliance is so pronounced in SigLIP models that the performance drop from displacing the summary sentence entirely negates any informational gain provided by the second sentence.

Effect of sentence removal and permutation for retrieval with long-context CLIP models

Long-Context Models Fail Similarly: Finally, we extend this evaluation to models explicitly trained on long captions, using Long-CLIP as our baseline. Testing on full-length captions, we moved the crucial first sentence to the fourth position. Just like standard CLIP, these long-context baselines suffer a sharp drop in performance when the text is reordered. In contrast, our proposed DeBias-CLIP not only achieves consistently higher overall performance but also proves significantly more robust to this sentence permutation.

DeBias-CLIP: 3 Simple Text Augmentations

Track_Sampling_Augmentation

DeBias-CLIP builds upon the Long-CLIP baseline by training on long-caption datasets using two different text inputs: the full original caption and an augmented caption. Instead of reinforcing the early-token bias by training on the first sentence alone, our method proposes three simple sentence-level augmentations.

First, we remove the opening summary sentence. Because this initial sentence is typically highly information-dense, it provides a training shortcut that easily satisfies the text-image contrastive loss. By removing it, we force the model to stop relying on summaries and actually learn from the fine-grained details in the rest of the caption.

While this simple trick greatly improves long-text retrieval, it can also negatively impact short-text performance. To counteract this and maintain versatility, we randomly subsample the remaining sentences. This dynamically generates shorter training captions, increasing the difficulty of the task by deliberately reducing the available information.

However, subsampling alone can cause the model to drift away from the original pretrained CLIP weights, as it forces less informative tokens into the early positions. To prevent this drift, and to properly train the later positional embeddings, we add padding tokens to the beginning of the text. This pushes the meaningful information deeper into the context window. Because we simultaneously train on the original full caption, the early positions still learn from the dense summary, while the later positions are finally forced to align with deep caption details.

Improved Long-Context Retrieval - Visualizations

Long-Context Retrieval Visualization

Improved Image Generation with Stable Diffusion XL

SDXL Generation Comparison

BibTeX

@article{lavoie2026clip,
  author    = {Lavoie, Marc-Antoine and Mahmoud, Anas and Zaimi, Aldo and Tchango, Arsene Fansi and Waslander, Steven L},
  title     = {CLIP Is Shortsighted: Paying Attention Beyond the First Sentence},
  journal   = {arXiv preprint arXiv:2602.22419},
  year      = {2026},  
}