Highlighting the Bias via Padding: First, we demonstrate that CLIP text encoders are heavily biased toward early tokens by evaluating retrieval on the first 2 sentences of the DOCCI dataset. When we prepend uninformative filler sentences ("This is a photo.") to the beginning of the caption, performance drops significantly. Simply pushing the informative text further back in the token sequence is enough to affect the model's retrieval accuracy.
The Impact of Sentence Order: Next, we show that even among early tokens, sentence order strictly dictates performance. Evaluating retrieval on the first 2 sentences of DOCCI, we compare the original text (First 2) against a permuted version (Swap 2) and a baseline using only the first sentence (First only). For all CLIP models, pushing the highly informative first sentence to the second position causes a measurable drop in accuracy. Notably, this structural reliance is so pronounced in SigLIP models that the performance drop from displacing the summary sentence entirely negates any informational gain provided by the second sentence.
Long-Context Models Fail Similarly: Finally, we extend this evaluation to models explicitly trained on long captions, using Long-CLIP as our baseline. Testing on full-length captions, we moved the crucial first sentence to the fourth position. Just like standard CLIP, these long-context baselines suffer a sharp drop in performance when the text is reordered. In contrast, our proposed DeBias-CLIP not only achieves consistently higher overall performance but also proves significantly more robust to this sentence permutation.
DeBias-CLIP builds upon the Long-CLIP baseline by training on long-caption datasets using two different text inputs: the full original caption and an augmented caption. Instead of reinforcing the early-token bias by training on the first sentence alone, our method proposes three simple sentence-level augmentations.
First, we remove the opening summary sentence. Because this initial sentence is typically highly information-dense, it provides a training shortcut that easily satisfies the text-image contrastive loss. By removing it, we force the model to stop relying on summaries and actually learn from the fine-grained details in the rest of the caption.
While this simple trick greatly improves long-text retrieval, it can also negatively impact short-text performance. To counteract this and maintain versatility, we randomly subsample the remaining sentences. This dynamically generates shorter training captions, increasing the difficulty of the task by deliberately reducing the available information.
However, subsampling alone can cause the model to drift away from the original pretrained CLIP weights, as it forces less informative tokens into the early positions. To prevent this drift, and to properly train the later positional embeddings, we add padding tokens to the beginning of the text. This pushes the meaningful information deeper into the context window. Because we simultaneously train on the original full caption, the early positions still learn from the dense summary, while the later positions are finally forced to align with deep caption details.
@article{lavoie2026clip,
author = {Lavoie, Marc-Antoine and Mahmoud, Anas and Zaimi, Aldo and Tchango, Arsene Fansi and Waslander, Steven L},
title = {CLIP Is Shortsighted: Paying Attention Beyond the First Sentence},
journal = {arXiv preprint arXiv:2602.22419},
year = {2026},
}