I benchmarked models sized 2B to 35B on hard HTML data extraction
Summary
A benchmark comparing AI models ranging from 2B to 35B parameters on a challenging task of extracting structured data from HTML, evaluating their performance and accuracy.
Similar Articles
A 4b model is now beating 30b ones at web research and the reason is not size
A 4 billion parameter open model from the Apodex family outperforms 30 billion parameter models on web research benchmarks, attributed to careful training data and self-verification techniques rather than raw scale, suggesting a more democratic trajectory for AI capability.
HuggingFace benchmark datasets now let you filter by model size
HuggingFace benchmark datasets now allow filtering by model size, enabling comparisons like 'best model under 32B on swebenchverified'.
I benchmarked how badly AI agents read raw HTML. The gap was bigger than I expected.
An experiment comparing AI agent accuracy and token cost when reading raw HTML vs structured formats; raw HTML costs double the tokens with lower accuracy.
Benchmarking Large Language Models for Safety Data Extraction
This paper benchmarks four large language models (Gemini 1.5 Pro, GPT-4o, Claude 3.7 Sonnet, Llama 3.1-70B) for extracting structured information from Safety Data Sheets, finding that text-based extraction with chain-of-thought prompting yields the highest accuracy (84% by Gemini 1.5 Pro) but no model surpasses the 90% threshold required for reliable industrial deployment.
Why there is a lack of new 100B-120B models?
Analysis of the trend in AI model sizes, noting a gap in the 100-120B parameter range with recent releases focusing on smaller (25-35B) or larger (200B+) models.