CoMP: Continual Multimodal Pre-training
for Vision Foundation Models


*Equal Contributions        Corresponding Author

1 Shanghai Key Lab of Intell. Info. Processing, School of CS, Fudan University
2 Shanghai Innovation Institute

Pre-trained Vision Foundation Models (VFMs) provide strong representations. In this paper, we continually pre-train prevailing VFMs in a multimodal manner such that they can effortlessly process visual inputs of varying sizes and produce representations that are more aligned with language representations, regardless of their original pre-training process. To this end, we introduce CoMP, a carefully designed Continual Multimodal Pre-training pipeline. Specifically, CoMP uses a Continual Rotary Position Embedding to accommodate visual inputs with different resolutions, and an Alignment Loss between visual and textual features for better cross-modal alignment. After continual pre-training, leading VFMs like DINOv2 and SigLIP achieve remarkable improvements not only in multimodal understanding tasks but also in generic classification and segmentation tasks.



CoMP-MM

Under the slimilar pre-training data size, CoMP-MM significantly outperforms all other methods and achieves state-of-the-art performance among open-source models across multiple benchmarks and on both 1B and 7B models.

CoMP-SigLIP & CoMP-DINOv2

Our models outperform CLIP, SigLIP and DINOv2 by a significant margin. Notably, our CoMP-SigLIP-So400M outperforms AIMv2-H (600M) on most tasks, and our CoMP-DINOv2-L also surpasses DINOv2-G, demonstrating the effectiveness of our method.

Methods

CoMP builds upon (1) C-RoPE, a Continual Rotary Position Embedding for vision models, which is operated by adding the standard RoPE-2D with the learned 1D position embedding, to support native resolution continual pre-training; (2) Alignment Loss, a cross-entropy loss between visual and textual features through language prototypes, to align multimodal representations between VFMs and LMMs.

(1) C-RoPE

(2) Alignment Loss

BibTeX


          @article{comp2025,
                title={CoMP: Continual Multimodal Pre-training for Vision Foundation Models}, 
                author={Chen, Yitong and Meng, Lingchen and Peng, Wujian and Wu, Zuxuan and Jiang, Yu-Gang},
                year={2025},
                journal={arXiv preprint arXiv:2503.18931}, 
          }
        

This website is adapted from Nerfies and MathVista, licensed under a Creative Commons Attribution-ShareAlike 4.0 International License