Tag
This paper defines multi-image implicit toxicity (MIIT), where individually benign images become toxic when combined, and proposes MiShield, a model trained with progressively distilled reasoning supervision to detect MIIT. Experiments show MiShield-8B outperforms existing moderation services.
Reddit is rolling out a feature that alerts moderators when users frequently post in AI-related subreddits, aiming to help manage policy enforcement and potential spam.
A blog post revealing Reddit's anti-spam internals, exposed by a bug, detailing how Reddit's sitewide spam filters and moderation system work.
This paper introduces LeanGuard, a lightweight bidirectional encoder-based safety guardrail that matches the accuracy of larger reasoning-based guardrails while being approximately 100x faster, challenging the assumption that chain-of-thought reasoning is necessary for effective moderation.
The platform apologizes for over-moderation, removes word blocks, simplifies rules, and enables image and link sharing, encouraging users to be nice and have fun.
SingGuard is a policy-adaptive multimodal LLM guardrail model for text, image, and multilingual safety moderation, featuring dynamic reasoning and a new benchmark SingGuard-Bench. It achieves state-of-the-art results across multiple datasets.
A user expresses frustration that their posts about AI-enhanced Google Sheets were removed from the Google Sheets subreddit, questioning the community's opposition to AI tools.
GitHub introduces persistent pull request limits to help open-source maintainers manage contribution volume and reduce low-quality noise, especially from AI-generated pull requests.
This paper studies hate speech cascades on Bluesky and uses multi-LLM agents to simulate them, finding that such simulations reproduce key patterns like stance monoculture and toxicity-delta direction, and that amplifier targeting on dense networks yields 7.5–12.9% reduction in hateful content with low benign collateral.
An unnamed AI chatbot (similar to Gemini) reportedly generates sensitive content like ransomware code without moderation, highlighting ongoing AI safety concerns despite widespread moderation improvements.
A report reveals that illegal drug sites used fake podcasts to manipulate Spotify's search ranking, and Spotify removed tens of thousands of episodes only after public exposure and political pressure.
Hacker News is reportedly removing or filtering AI-related content from its platform.
A user criticizes the OpenClaw community for banning mentions of alternative AI agents, arguing it stifles free speech and hides legitimate concerns about OpenClaw's development.
PluRule is a new multimodal, multilingual benchmark for evaluating AI models on moderating pluralistic communities on social media, covering 13,371 rule violations across 1,989 Reddit communities and 9 languages. Results show that even state-of-the-art models like GPT-5.2 perform barely above chance, indicating that context-dependent rule enforcement remains a fundamental challenge.
A story detailing how AI bots overwhelmed a GitHub repository with spam comments and untested PRs after a $900 bounty was posted, forcing maintainers to implement workarounds like contributor whitelists and reputation bots, highlighting GitHub's lack of anti-bot mechanisms.
Discusses the irony that small creators are penalized for using AI while big companies use AI to ban them.
A user warns that a subreddit is flooded with agent-generated posts and comments, making it difficult to find genuine discussions and urging newcomers to be skeptical of tool recommendations.
A user on Lobsters proposes that LLM-generated submissions should be disallowed, arguing that users posting such content should be banned and a notification should be added to remind submitters.
arXiv will ban submitters of AI-generated content that violates moderation standards for one year, requiring future submissions to undergo peer review before hosting.
This paper introduces Bot-Mod, a moderation framework that identifies malicious intent in multi-agent systems through multi-turn dialogue and Gibbs-based sampling, and presents a dataset from Moltbook for evaluation.