Pre-trained Vision Foundation Models (VFMs) provide strong representations. In this paper, we continually pre-train prevailing VFMs in a multimodal manner such that they can effortlessly process visual inputs of varying sizes and produce representations that are more aligned with language representations, regardless of their original pre-training process. To this end, we introduce CoMP, a carefully designed Continual Multimodal Pre-training pipeline. Specifically, CoMP uses a Continual Rotary Position Embedding to accommodate visual inputs with different resolutions, and an Alignment Loss between visual and textual features for better cross-modal alignment. After continual pre-training, leading VFMs like DINOv2 and SigLIP achieve remarkable improvements not only in multimodal understanding tasks but also in generic classification and segmentation tasks.
(1) C-RoPE
(2) Alignment Loss
@article{comp2025,
title={CoMP: Continual Multimodal Pre-training for Vision Foundation Models},
author={Chen, Yitong and Meng, Lingchen and Peng, Wujian and Wu, Zuxuan and Jiang, Yu-Gang},
year={2025},
journal={arXiv preprint arXiv:2503.18931},
}
This website is adapted from Nerfies and MathVista, licensed under a Creative Commons Attribution-ShareAlike 4.0 International License