650+ Apache-2.0 biomedical NER/de-id models that run on-device in MLX. Same fp32 weights, identical outputs: the clinical NER models run 30-40x faster than PyTorch-CPU on a 3-year-old M3 Max. Repro inside.
Summary
A collection of 650+ Apache-2.0 licensed biomedical NER and de-identification models that run on-device via MLX, achieving 30-40x faster inference than PyTorch-CPU on an M3 Max with identical outputs.
Similar Articles
New local model reaching near frontier on PII removal at 9 ms CPU inference
Introduces ScreenLeak, a benchmark for measuring PII redaction in computer-use AI data, and presents two local models (v45_phase3 for text and rfdetr_v8 for images) achieving near-frontier performance at low latency.
I fine-tuned Parakeet 0.6B for medical ASR — open weights, local Mac/CUDA/CPU
Omi Health founder fine-tuned NVIDIA's Parakeet TDT 0.6B for medical ASR, releasing open-weights model Omi Med STT v1 that achieves competitive medical-WER while running locally on Mac, CUDA, or CPU.
@Modular: .@hippocraticai runs 400B+ parameter models for real-time patient conversations, tens of thousands per day. When they b…
Hippocratic AI partners with Modular to use MAX framework for inference on large language models, achieving sub-500ms TTFT, ~30% faster P99 latency and ~22% faster mean latency at scale on NVIDIA B300 GPUs, with portability to AMD.
@AlexJonesax: Two open-source MLX inference servers worth knowing about if you run LLMs on Mac: MTPLX (@youssofal) Uses a model's own…
This article highlights two open-source MLX inference servers for Mac: MTPLX, which optimizes token speed using speculative decoding without a draft model, and oMLX, which improves workflow efficiency with persistent KV caches for coding agents.
@neural_avb: I am working on porting SAM models and harness into Apple silicon. Already seeing 1.25x inference speed increase on mlx…
Porting SAM 2.1 models to Apple silicon with MLX, achieving 1.25x inference speed increase on the small model, with quantized versions planned.