Tag
Introduces ToolGrad, an agentic framework that generates, evaluates, and refines tool-use trajectories using textual 'gradients', achieving near 100% pass rate and lower cost for dataset generation. Accepted at ACL 2026.
Developer @cjzafir announces Finetuner.dev, a CLI tool that uses orchestrator models like Codex 5.5 and Chinese models to generate high-quality, handcrafted datasets for fine-tuning small language models (1B-30B), claiming 10x lower costs and 5x better quality.
This paper presents a mathematical forum platform that integrates an image-to-LaTeX conversion pipeline directly into the posting interface, reducing friction for users. The system is designed to generate a community-validated dataset of math problems and solutions for training AI reasoning systems.
A method using Bayesian Probabilistic Context-Free Grammar to generate synthetic regression datasets that structurally resemble physics equations, validated against the Feynman corpus and shown to be effective for hyperparameter tuning.
TinyFish Bigset is an open-source multi-agent system that turns natural language prompts into structured datasets from the live web, with schema inference, autonomous research agents, and scheduled refresh. It runs self-hosted via Docker and is built on TinyFish's search infrastructure.
A new small AI model, Qwopus 3.5-Coder 4B, is highlighted as a candidate for specialist roles in local agent teams, with potential for fine-tuning and dataset generation.
A developer used Codex 5.5 as an orchestrator and Deepseek v4 pro as an executor to generate a 240M token fine-tuning dataset, burning 359M tokens at a cost of only $78.
The article describes a fine-tuning dataset generation pipeline using Codex 5.5 as orchestrator and Deepseek v4 Pro as generator, with autonomous quality gates and iterative improvement for high-quality synthetic data at low cost.
This paper presents a method for generating large-scale, labeled training datasets for legal chatbots in Korean using Local Grammar Graphs, achieving 91% F1-score with a DIET classifier.
This paper proposes an evidence-based model to automatically generate query keywords from query-free summarization datasets, enabling the creation of query-focused summarization datasets. Experimental results show that summaries generated using evidence-based queries achieve competitive ROUGE scores compared to original queries.