Vlca vision-language aligning model with cross-modal attention for bilingual remote sensing image captioning
- 论文名称: Vlca: vision-language aligning model with cross-modal attention for bilingual remote sensing image captioning
- 模型架构: VLM
- Visual Encoder: CNN, Transformer
- Text Encoder: Transformer
- Model Details: Vision Encoder:CLIP(ResNet50/ViT)Text Encoder:GPT-2
- Task: Image Caption
- Link: https://ieeexplore.ieee.org/abstract/document/10066217
- Code/Project: -
- Published in: J SYST ENG ELECTRON 2023
This post is licensed under CC BY 4.0 by the author.