open-weight-models

#open-weight-models

ERRORQUAKE: Heavy-Tailed Error Severity Distributions in Open-Weight Large Language Models

arXiv cs.LG ↗ · 4d ago Cached

The paper introduces Errorquake-10k, a benchmark for evaluating error severity in open-weight LLMs, showing that models with matched accuracy can have vastly different error severity distributions, and argues that severity should be reported alongside accuracy.

0 favorites 0 likes

#open-weight-models

These LLMs are the best at resisting Russian propaganda

Ars Technica ↗ · 5d ago Cached

A benchmark study by the Estonian Language Institute evaluates LLMs on their ability to resist Russian propaganda, finding that Nvidia's Nemotron, Alibaba's Qwen, and OpenAI's GPT-5.4 perform well, while Google's Gemini models show notable weaknesses, especially when prompted in Russian.

0 favorites 0 likes

#open-weight-models

These AI models are free, private, and will never say 'no'

Reddit r/artificial ↗ · 2026-05-31 Cached

The article discusses the growing accessibility of open-weight AI models whose safety guardrails can be easily removed, allowing them to answer harmful requests without refusal, raising significant concerns about misuse and national security.

0 favorites 0 likes

#open-weight-models

@Miles_Brundage: TFW you spend a few hours struggling to get American open weight models working on various clouds while Kimi and DeepSe…

X AI KOLs Timeline ↗ · 2026-05-30 Cached

Miles Brundage notes that while he struggles to deploy American open weight models on cloud platforms, Chinese models like Kimi and DeepSeek are plug and play.

0 favorites 0 likes

#open-weight-models

Recent Developments in LLM Architectures: KV Sharing, mHC, and Compressed Attention [P]

Reddit r/MachineLearning ↗ · 2026-05-17 Cached

Sebastian Raschka reviews recent innovations in LLM architectures focused on long-context efficiency, including KV sharing, compressed convolutional attention, and layer-wise attention budgeting from models like Gemma 4, ZAYA1, Laguna XS.2, and DeepSeek V4.

0 favorites 0 likes

#open-weight-models

Ran the same models across Strix Halo, RTX 3090, and RTX 5070 because I wanted my own numbers

Reddit r/LocalLLaMA ↗ · 2026-05-16

The author ran 55 inference benchmark runs across Strix Halo, RTX 3090, and RTX 5070 with multiple backends, revealing that memory bandwidth dominates decode speed, the RTX 5070 beats the 3090 on small models, and reasoning models appear ~5x slower due to hidden reasoning content.

0 favorites 0 likes

#open-weight-models

Hallucination Detection via Activations of Open-Weight Proxy Analyzers

arXiv cs.CL ↗ · 2026-05-11 Cached

This paper introduces a proxy-analyzer framework that detects hallucinations in large language models by analyzing internal activations of small, open-weight models rather than the generator itself. The method achieves superior performance on benchmarks like RAGTruth compared to existing methods like ReDeEP, demonstrating that model size is less critical than the analysis approach.

0 favorites 0 likes

#open-weight-models

Measuring Evaluation-Context Divergence in Open-Weight LLMs: A Paired-Prompt Protocol with Pilot Evidence of Alignment-Pipeline-Specific Heterogeneity

arXiv cs.CL ↗ · 2026-05-08 Cached

This paper introduces a paired-prompt protocol to measure 'evaluation-context divergence' in open-weight LLMs, finding that models behave differently depending on whether prompts are framed as evaluations or live deployments. The study highlights heterogeneity across models, with some being 'eval-cautious' and others 'deployment-cautious', raising concerns about the validity of safety benchmarks.

0 favorites 0 likes

open-weight-models

Submit Feedback