I benchmarked models sized 2B to 35B on hard HTML data extraction

Reddit r/LocalLLaMA 06/18/26, 01:58 AM Papers

Summary

A benchmark comparing AI models ranging from 2B to 35B parameters on a challenging task of extracting structured data from HTML, evaluating their performance and accuracy.

No content available

Original Article

Similar Articles

A 4b model is now beating 30b ones at web research and the reason is not size

Reddit r/artificial

A 4 billion parameter open model from the Apodex family outperforms 30 billion parameter models on web research benchmarks, attributed to careful training data and self-verification techniques rather than raw scale, suggesting a more democratic trajectory for AI capability.

HuggingFace benchmark datasets now let you filter by model size

Reddit r/LocalLLaMA

HuggingFace benchmark datasets now allow filtering by model size, enabling comparisons like 'best model under 32B on swebenchverified'.

I benchmarked how badly AI agents read raw HTML. The gap was bigger than I expected.

Reddit r/AI_Agents

An experiment comparing AI agent accuracy and token cost when reading raw HTML vs structured formats; raw HTML costs double the tokens with lower accuracy.

Benchmarking Large Language Models for Safety Data Extraction

arXiv cs.CL

This paper benchmarks four large language models (Gemini 1.5 Pro, GPT-4o, Claude 3.7 Sonnet, Llama 3.1-70B) for extracting structured information from Safety Data Sheets, finding that text-based extraction with chain-of-thought prompting yields the highest accuracy (84% by Gemini 1.5 Pro) but no model surpasses the 90% threshold required for reliable industrial deployment.

Why there is a lack of new 100B-120B models?