Post

SkyEyeGPT Unifying Remote Sensing Vision-Language Tasks via Instruction Tuning with Large Language Model

  • 论文名称: SkyEyeGPT: Unifying Remote Sensing Vision-Language Tasks via Instruction Tuning with Large Language Model
  • 模型架构: MLLM
  • Visual Encoder: Transformer
  • Text Encoder: Transformer
  • Model Details: Vision Encoder:EVA-CLIPText Encoder:LLaMA2-chat
  • Task: Image Caption, Visual Grounding, RS VQA
  • Link: https://arxiv.org/abs/2401.09712
  • Code/Project: https://github.com/ZhanYang-nwpu/SkyEyeGPT
  • Short Summary: 1.遥感领域的视觉语言指令数据集(SkyEye968k),包括单任务和多任务对话指令,包括968k条样本的指令跟随数据集2. 提出SkyEyeGPT模型,通过一个对齐层将RS视觉特征投影到语言领域后,它们与任务特定的指令一起被馈送到基于LLM的RS解码器中;设计了一个两阶段微调方法,第一阶段是遥感图文对齐,第二阶段是多任务对话微调
  • Published in: Arxiv 2024
This post is licensed under CC BY 4.0 by the author.