RS-LLaVA Large Vision Language Model for Joint Captioning and Question Answering in Remote Sensing Imagery
- 论文名称: RS-LLaVA: Large Vision Language Model for Joint Captioning and Question Answering in Remote Sensing Imagery
- 模型架构: MLLM
- Visual Encoder: Transformer
- Text Encoder: Transformer
- Model Details: Vision Encoder:CLIP ViT-LText Encoder:Vicuna-v1.5
- Task: Image Caption, RS VQA
- Link: https://www.mdpi.com/2072-4292/16/9/1477
- Code/Project: https://github.com/BigData-KSU/RS-LLaVA
- Short Summary: 1. 通过集成caption和VQA数据集,提出了一个遥感领域的指令微调数据集2. 基于LLaVA模型,通过使用遥感数据对模型进行预训练和lora微调得到了了RS-LLaVA
- Published in: RS 2024
This post is licensed under CC BY 4.0 by the author.