Abstract: Vision-language pretrained models, such as CLIP, have established new benchmarks in multimodal data mining. In such models, few-shot fine-tuning is a major challenge to achieve optimal ...
Some results have been hidden because they may be inaccessible to you
Show inaccessible results