Arxiv 36
- SkyEyeGPT Unifying Remote Sensing Vision-Language Tasks via Instruction Tuning with Large Language Model
- On the Foundations of Earth and Climate Foundation Models
- MMEarth Exploring Multi-Modal Pretext Tasks For Geospatial Representation Learning
- Change-Agent Towards Interactive Comprehensive Remote Sensing Change Interpretation and Analysis
- LHRS-Bot Empowering Remote Sensing with VGI-Enhanced Large Multimodal Language Model
- Charting New Territories Exploring the Geographic and Geospatial Capabilities of Multimodal LLMs
- One for All Toward Unified Foundation Models for Earth Vision
- EarthGPT A Universal Multi-modal Large Language Model for Multi-sensor Image Comprehension in Remote Sensing Domain
- Large Language Models for Captioning and Retrieving Remote Sensing Images
- SARATR-X A Foundation Model for Synthetic Aperture Radar Images Target Recognition
- SwiMDiff Scene-wide Matching Contrastive Learning with Diffusion Constraint for Remote Sensing Image
- H2RSVLM Towards Helpful and Honest Remote Sensing Large Vision Language Model
- Popeye A Unified Visual-Language Model for Multi-Source Ship Detection from Remote Sensing Imagery
- MTP Advancing Remote Sensing FoundationModel via Multi-Task Pretraining
- Neural Plasticity-Inspired Foundation Model for Observing the Earth Crossing Modalities
- DINO-MC Self-supervised Contrastive Learning for Remote Sensing Imagery with Multi-sized Local Crops
- FoMo-Bench a multi-modal, multi-scale and multi-task Forest Monitoring Benchmark for remote sensing foundation models
- On the Promises and Challenges of Multimodal Foundation Models for Geographical, Environmental, Agricultural, and Urban Planning Applications
- CtxMIM Context-Enhanced Masked Image Modeling for Remote Sensing Image Understanding
- USat A Unified Self-Supervised Encoder for Multi-Sensor Satellite Imagery
- Foundation Models for Generalist Geospatial Artificial Intelligence
- A billion-scale foundation model for remote sensing images
- RSGPT A Remote Sensing Vision Language Model and Benchmark
- Predicting Gradient is Better Exploring Self-Supervised Learning for SAR ATR with a Joint-Embedding Predictive Architecture
- SatCLIP Global, General-Purpose Location Embeddings with Satellite Imagery
- Changes to Captions An Attentive Network for Remote Sensing Change Captioning
- Bootstrapping Interactive Image-Text Alignment for Remote Sensing Image Captioning
- RingMo-lite A Remote Sensing Multi-task Lightweight Network with CNN-Transformer Hybrid Framework
- Feature Guided Masked Autoencoder for Self-supervised Learning in Remote Sensing
- Tree-GPT Modular Large Language Model Expert System for Forest Remote Sensing Image Understanding and Interactive Analysis
- DeCUR decoupling common & unique representations for multimodal self-supervision
- Rsprompter Learning to prompt for remote sensing instance segmentation based on visual foundation model
- Good at captioning, bad at counting Benchmarking GPT-4V on Earth observation data
- Lightweight, Pre-trained Transformers for Remote Sensing Timeseries
- RS5M and GeoRSCLIP A Large Scale Vision-Language Dataset and A Large Vision-Language Model for Remote Sensing
- Self-supervised vision transformers for joint sar-optical representation learning