GGT-100K: Generative Ground Truth for Generalizable Real-World Image Restoration

Hugging Face Daily Papers Papers

Summary

This paper introduces GGT-100K, a dataset of 103,707 image pairs for real-world image restoration, generated by using multimodal foundation models like Nano-Banana-2 to produce high-quality targets from low-quality inputs. Experiments show the dataset improves the generalization of various image restoration models.

Real-world image restoration (IR) is bottlenecked by the scarcity of high-quality paired training data. Synthetic datasets are abundant but often fail to model real-world degradations, while real-world paired datasets are expensive and difficult to capture. As a result, IR models trained on these datasets show limited generalization in real-world scenarios. In this work, we propose Generative Ground Truth (GGT) by using generative multimodal foundation models (MFMs) to produce high-quality (HQ) targets from real-world low-quality (LQ) images. We first conduct a systematic evaluation of nine state-of-the-art MFMs, including Nano-Banana-2 and GPT-Image-2, on images of various scenes and degradation types. The results demonstrate that Nano-Banana-2 with VLM-based adaptive prompting shows the highest capability to synthesize perceptually realistic and content-faithful HQ targets, which can serve as the GGT for the LQ input. We then employ Nano-Banana-2 to build a GGT synthesis pipeline, which involves multi-stage quality control to ensure data reliability, and construct GGT-100K, an LQ-HQ paired dataset comprising 103,707 training pairs and covering diverse scenes and complex real-world degradations. A test set of 500 image pairs is also established. Extensive experiments show that GGT-100K consistently improves the real-world generalization of a wide range of IR models, with particularly strong benefits for finetuning generative models for IR tasks. Our results suggest that MFMs can serve as practical tools for restoration-oriented data generation, and GGT-100K is a useful resource to expand the generalization boundaries of real-world IR models.
Original Article
View Cached Full Text

Cached at: 06/01/26, 03:17 AM

Paper page - GGT-100K: Generative Ground Truth for Generalizable Real-World Image Restoration

Source: https://huggingface.co/papers/2605.31039

Abstract

Generative multimodal foundation models are used to create high-quality training data for image restoration, improving model generalization across diverse real-world scenarios.

Real-worldimage restoration(IR) is bottlenecked by the scarcity of high-quality paired training data.Synthetic datasetsare abundant but often fail to modelreal-world degradations, while real-world paired datasets are expensive and difficult to capture. As a result, IR models trained on these datasets show limited generalization in real-world scenarios. In this work, we proposeGenerative Ground Truth(GGT) by usinggenerative multimodal foundation models(MFMs) to produce high-quality (HQ) targets from real-world low-quality (LQ) images. We first conduct a systematic evaluation of nine state-of-the-art MFMs, includingNano-Banana-2and GPT-Image-2, on images of various scenes and degradation types. The results demonstrate thatNano-Banana-2withVLM-based adaptive promptingshows the highest capability to synthesize perceptually realistic and content-faithful HQ targets, which can serve as the GGT for the LQ input. We then employNano-Banana-2to build a GGT synthesis pipeline, which involvesmulti-stage quality controlto ensure data reliability, and construct GGT-100K, anLQ-HQ paired datasetcomprising 103,707 training pairs and covering diverse scenes and complexreal-world degradations. A test set of 500 image pairs is also established. Extensive experiments show that GGT-100K consistently improves the real-world generalization of a wide range of IR models, with particularly strong benefits for finetuning generative models for IR tasks. Our results suggest that MFMs can serve as practical tools for restoration-oriented data generation, and GGT-100K is a useful resource to expand the generalization boundaries of real-world IR models.

View arXiv pageView PDFProject pageGitHub9Add to collection

Get this paper in your agent:

hf papers read 2605\.31039

Don’t have the latest CLI?curl \-LsSf https://hf\.co/cli/install\.sh \| bash

Models citing this paper0

No model linking this paper

Cite arxiv.org/abs/2605.31039 in a model README.md to link it from this page.

Datasets citing this paper1

#### VCLab-PolyU/GGT-100K Updatedabout 2 hours ago • 98

Spaces citing this paper0

No Space linking this paper

Cite arxiv.org/abs/2605.31039 in a Space README.md to link it from this page.

Collections including this paper0

No Collection including this paper

Add this paper to acollectionto link it from this page.

Similar Articles

EVA-Bench Data 2.0: 3 Domains, 121 Tools, 213 Scenarios

Hugging Face Blog

ServiceNow AI releases EVA-Bench Data 2.0, an expanded open-source benchmark for evaluating voice agents across 3 enterprise domains (Airline CSM, IT Service Management, Healthcare HRSD) with 213 scenarios and 121 tools, validated against GPT-4.5, Gemini, and Claude.

Fine-grained Fragment Retrieval in Multi-modal Long-form Dialogues

arXiv cs.CL

This paper introduces Fine-grained Fragment Retrieval (FFR), a new task for locating semantically coherent multi-modal fragments (text and images) within long-form dialogues. The authors propose F2RVLM, a generation-based retrieval model trained with reinforcement learning, and FFRS, a two-stage retrieval system, along with a new dataset MLDR for evaluation.

MemoryDocDataSet: A Benchmark for Joint Conversational Memory and Long Document Reasoning

arXiv cs.CL

MemoryDocDataSet is a new synthetic benchmark of 50 micro-worlds and 1,000 QA pairs designed to evaluate AI systems on the joint task of conversational memory and long-document reasoning simultaneously. The best baseline (RAG-Both) achieves only 0.358 overall F1, highlighting a significant gap in current systems' ability to unify conversational memory with long-document navigation.

AICompanionBench: Benchmarking LLMs-as-Judges for AI Companion Safety

arXiv cs.AI

AICompanionBench introduces the first publicly available benchmark dataset of 2,123 real-world AI companion conversations annotated across nine safety risk categories, used to evaluate 20 LLMs as safety judges. Results show strong models handle explicit harmful content well but struggle with nuanced risks like manipulation and false positives on benign conversations.