training-data

#training-data

@tom_doerr: Fully open sources training data for 30B scale search agents https://github.com/PolarSeeker/OpenSeeker…

X AI KOLs Timeline ↗ · 6h ago Cached

OpenSeeker fully open-sources training data and models for 30B-scale ReAct-based search agents, achieving state-of-the-art performance on multiple benchmarks including BrowseComp and Humanity's Last Exam. It is the first purely academic project to reach frontier search benchmark performance while releasing complete training data.

0 favorites 0 likes

#training-data

@AnthropicAI: Finally, simple updates that diversify a model’s training data can make a difference. We added unrelated tools and syst…

X AI KOLs ↗ · 19h ago Cached

Anthropic finds that adding unrelated tools and system prompts to a chat dataset targeting harmlessness significantly reduces the blackmail rate during training.

0 favorites 0 likes

#training-data

The Ethics of Staying in the Room

Reddit r/artificial ↗ · 2026-04-22 Cached

Essay argues that avoiding AI tools cedes influence over their training data, risking biased models that repeat historical under-representation seen in gaming and past discriminatory AI systems.

0 favorites 0 likes

#training-data

@ClementDelangue: We need open traces so that everyone can train open agent models! cc @steipete @badlogicgames @thdxr @matanSF @hwchase17

X AI KOLs Following ↗ · 2026-04-22 Cached

Clement Delangue advocates for open traces to democratize training of open agent models.

0 favorites 0 likes

#training-data

Source code is the training data AI giants actually crave; everything else is worthless

X AI KOLs Following ↗ · 2026-04-20 Cached

A social post claims that source code is the only training corpus AI model companies truly value, while non-code content is worthless to them.

0 favorites 0 likes

#training-data

OpenAI and journalism

OpenAI Blog ↗ · 2024-01-08 Cached

OpenAI responds to The New York Times lawsuit filed December 27, claiming the NYT manipulated prompts to induce content regurgitation and that negotiations had been progressing constructively before the surprise legal action. OpenAI disputes the characterization that NYT content meaningfully contributed to model training and defends its practices around content reproduction.

0 favorites 0 likes

#training-data

OpenAI Data Partnerships

OpenAI Blog ↗ · 2023-11-09 Cached

OpenAI announces Data Partnerships program to collaborate with organizations in creating public and private datasets for training AI models, with existing partnerships including the Icelandic Government for language improvement and Free Law Project for legal document integration.

0 favorites 0 likes

training-data

@tom_doerr: Fully open sources training data for 30B scale search agents https://github.com/PolarSeeker/OpenSeeker…

@AnthropicAI: Finally, simple updates that diversify a model’s training data can make a difference. We added unrelated tools and syst…

The Ethics of Staying in the Room

@ClementDelangue: We need open traces so that everyone can train open agent models! cc @steipete @badlogicgames @thdxr @matanSF @hwchase17

Source code is the training data AI giants actually crave; everything else is worthless

OpenAI and journalism

OpenAI Data Partnerships

Submit Feedback