Post

Vlca vision-language aligning model with cross-modal attention for bilingual remote sensing image captioning

  • 论文名称: Vlca: vision-language aligning model with cross-modal attention for bilingual remote sensing image captioning
  • 模型架构: VLM
  • Visual Encoder: CNN, Transformer
  • Text Encoder: Transformer
  • Model Details: Vision Encoder:CLIP(ResNet50/ViT)Text Encoder:GPT-2
  • Task: Image Caption
  • Link: https://ieeexplore.ieee.org/abstract/document/10066217
  • Code/Project: -
  • Published in: J SYST ENG ELECTRON 2023
This post is licensed under CC BY 4.0 by the author.