GGT-100K: Generative Ground Truth for Generalizable Real-World Image Restoration
Summary
This paper introduces GGT-100K, a dataset of 103,707 image pairs for real-world image restoration, generated by using multimodal foundation models like Nano-Banana-2 to produce high-quality targets from low-quality inputs. Experiments show the dataset improves the generalization of various image restoration models.
View Cached Full Text
Cached at: 06/01/26, 03:17 AM
Paper page - GGT-100K: Generative Ground Truth for Generalizable Real-World Image Restoration
Source: https://huggingface.co/papers/2605.31039
Abstract
Generative multimodal foundation models are used to create high-quality training data for image restoration, improving model generalization across diverse real-world scenarios.
Real-worldimage restoration(IR) is bottlenecked by the scarcity of high-quality paired training data.Synthetic datasetsare abundant but often fail to modelreal-world degradations, while real-world paired datasets are expensive and difficult to capture. As a result, IR models trained on these datasets show limited generalization in real-world scenarios. In this work, we proposeGenerative Ground Truth(GGT) by usinggenerative multimodal foundation models(MFMs) to produce high-quality (HQ) targets from real-world low-quality (LQ) images. We first conduct a systematic evaluation of nine state-of-the-art MFMs, includingNano-Banana-2and GPT-Image-2, on images of various scenes and degradation types. The results demonstrate thatNano-Banana-2withVLM-based adaptive promptingshows the highest capability to synthesize perceptually realistic and content-faithful HQ targets, which can serve as the GGT for the LQ input. We then employNano-Banana-2to build a GGT synthesis pipeline, which involvesmulti-stage quality controlto ensure data reliability, and construct GGT-100K, anLQ-HQ paired datasetcomprising 103,707 training pairs and covering diverse scenes and complexreal-world degradations. A test set of 500 image pairs is also established. Extensive experiments show that GGT-100K consistently improves the real-world generalization of a wide range of IR models, with particularly strong benefits for finetuning generative models for IR tasks. Our results suggest that MFMs can serve as practical tools for restoration-oriented data generation, and GGT-100K is a useful resource to expand the generalization boundaries of real-world IR models.
View arXiv pageView PDFProject pageGitHub9Add to collection
Get this paper in your agent:
hf papers read 2605\.31039
Don’t have the latest CLI?curl \-LsSf https://hf\.co/cli/install\.sh \| bash
Models citing this paper0
No model linking this paper
Cite arxiv.org/abs/2605.31039 in a model README.md to link it from this page.
Datasets citing this paper1
#### VCLab-PolyU/GGT-100K Updatedabout 2 hours ago • 98
Spaces citing this paper0
No Space linking this paper
Cite arxiv.org/abs/2605.31039 in a Space README.md to link it from this page.
Collections including this paper0
No Collection including this paper
Add this paper to acollectionto link it from this page.
Similar Articles
@adithya_s_k: You can now finetune models on agent traces directly with TRL Claude Code traces Codex traces OpenClaw traces Pi traces…
TRL now supports fine-tuning models on agent traces from various sources like Claude Code, Codex, OpenClaw, and Pi, moving towards a standardized stack for training agentic models.
EVA-Bench Data 2.0: 3 Domains, 121 Tools, 213 Scenarios
ServiceNow AI releases EVA-Bench Data 2.0, an expanded open-source benchmark for evaluating voice agents across 3 enterprise domains (Airline CSM, IT Service Management, Healthcare HRSD) with 213 scenarios and 121 tools, validated against GPT-4.5, Gemini, and Claude.
Fine-grained Fragment Retrieval in Multi-modal Long-form Dialogues
This paper introduces Fine-grained Fragment Retrieval (FFR), a new task for locating semantically coherent multi-modal fragments (text and images) within long-form dialogues. The authors propose F2RVLM, a generation-based retrieval model trained with reinforcement learning, and FFRS, a two-stage retrieval system, along with a new dataset MLDR for evaluation.
MemoryDocDataSet: A Benchmark for Joint Conversational Memory and Long Document Reasoning
MemoryDocDataSet is a new synthetic benchmark of 50 micro-worlds and 1,000 QA pairs designed to evaluate AI systems on the joint task of conversational memory and long-document reasoning simultaneously. The best baseline (RAG-Both) achieves only 0.358 overall F1, highlighting a significant gap in current systems' ability to unify conversational memory with long-document navigation.
AICompanionBench: Benchmarking LLMs-as-Judges for AI Companion Safety
AICompanionBench introduces the first publicly available benchmark dataset of 2,123 real-world AI companion conversations annotated across nine safety risk categories, used to evaluate 20 LLMs as safety judges. Results show strong models handle explicit harmful content well but struggle with nuanced risks like manipulation and false positives on benign conversations.