Post

EarthGPT A Universal Multi-modal Large Language Model for Multi-sensor Image Comprehension in Remote Sensing Domain

  • 论文名称: EarthGPT: A Universal Multi-modal Large Language Model for Multi-sensor Image Comprehension in Remote Sensing Domain
  • 模型架构: MLLM
  • Visual Encoder: CNN, Transformer
  • Text Encoder: Transformer
  • Model Details: Vision Encoder:DINOv2 ViT-L/14、CLIP ConvNeXt-LText Encoder:LLaMA-2
  • Task: Scene Classification, Image Caption, Visual Grounding, RS VQA, Object Detection
  • Link: https://arxiv.org/abs/2401.16822
  • Code/Project: https://github.com/wivizhang/EarthGPT
  • Short Summary: 1. 提出了一种同一集成各种多传感器遥感解释任务的MLLM,EarthGPT,提出了视觉增强感知机制和跨模态相互理解的方法,最后提出了一种遥感领域的多传感器多任务的统一指令微调方法 2. 构建了最大的多模态多传感器的遥感指令跟随数据集MMRS-1M,由超过100万个图像文本对组成,包括有光学、合成孔径雷达(SAR)和红外图像
  • Published in: Arxiv 2024
This post is licensed under CC BY 4.0 by the author.