hubert.cpp, a C++ implementation of distilHuBERT [P]
Summary
A C++ implementation of distilHuBERT with no runtime dependencies, compiled-in weights, dynamic sizing, and on-par performance with ONNX Runtime, designed for easy integration into CMake projects.
Similar Articles
Show HN: Tiny-vLLM – high performance LLM inference engine in C++ and CUDA
Tiny-vLLM is a high-performance LLM inference engine implemented in C++ and CUDA, offering features like continuous batching and PagedAttention, and serves as an educational resource.
Designing the hf CLI as an agent-optimized way to work with the Hub
Hugging Face redesigned its `hf` CLI to be optimized for both human users and AI coding agents like Claude Code and Codex, with agent-aware output rendering and benchmarking showing up to 6× token savings versus no-CLI baselines on complex tasks.
Benchmark: ONNX Runtime vs HF Transformers vs GGUF for Parakeet TDT 0.6B on CPU-only hardware [D]
A benchmark comparing ONNX Runtime, HF Transformers, and GGUF for the Parakeet TDT 0.6B ASR model on CPU-only hardware shows ONNX Runtime achieves 37% faster inference than HF Transformers bfloat16, while GGUF prioritizes memory efficiency.
@no_stp_on_snek: got it here if ya want to try it out:
A fork of llama.cpp integrating TurboQuant+ for advanced KV-cache and weight quantization, with cross-backend kernel support (Apple Silicon, NVIDIA CUDA, AMD ROCm, Vulkan) and used in production by LocalAI, Chronara, and AtomicChat.
huihui-ai/Huihui-GLM-5.2-abliterated-GGUF
A quantized GGUF version of the abliterated GLM-5.2 model is released on Hugging Face, enabling local inference with various tools like Transformers, llama.cpp, and vLLM.