AI directly in DRAM: The Float Detox – How Pure Logic Unleashes the Future of Learning
Summary
BIN16 replaces all floating-point operations with boolean operations (XNOR+popcount) for neural network training and inference, enabling direct computation in off-the-shelf DRAM with zero floats, gradients, or hyperparameter tuning. It achieves 82% accuracy on MNIST in a single epoch, using only 220 lines of C.
Similar Articles
Signs Beat Floats: Low-Rank Double-Binary Adaptation for On-Device Fine-Tuning
LoRDBA replaces LoRA's floating-point low-rank factors with binary sign carriers and channel-wise scales, enabling efficient on-device fine-tuning with significant footprint reduction and minimal latency overhead, matching fp16 quality.
@HowToAI_: NVIDIA has done the impossible and nobody's talking about it. They trained a 12 BILLION parameter LLM in 4-bit precisio…
NVIDIA trained a 12-billion parameter LLM in 4-bit precision using the new NVFP4 format with micro-scaling, achieving near-zero intelligence loss while halving memory usage and tripling arithmetic speed, marking a major breakthrough in efficient AI training.
intel optane for AI workloads
Intel's discontinued Optane persistent memory technology is finding a second life in AI workloads, enabling a user to run a 1 trillion parameter model locally at ~4 tokens/second using cheap second-hand Optane modules. The article highlights Optane's lower latency compared to SSDs, making it suitable for large model inference despite being slower than DRAM.
This chip startup just raised $135M on a bet that AI’s biggest bottleneck isn’t compute — it’s memory
XCENA, a chip startup founded by Samsung and SK Hynix veterans, raised $135M to develop a memory-centric chip that handles AI inference tasks near DRAM, reducing costly data transfers between CPUs and GPUs. The company's MX1 chip is expected to improve efficiency and reduce infrastructure costs.
Memory-Bound but Not Bandwidth-Limited: The Physical AI Inference Gap in Batch-1 LLM Decode
This paper investigates the performance gap in batch-1 LLM decode for physical AI systems, finding that faster memory bandwidth does not proportionally reduce latency due to launch overheads, and that quantization efficiency varies significantly across hardware.