@Modular: .@hippocraticai runs 400B+ parameter models for real-time patient conversations, tens of thousands per day. When they b…

X AI KOLs Following 06/11/26, 05:19 PM Products

inference healthcare benchmark latency partnership nvidia-b300 mojo

Summary

Hippocratic AI partners with Modular to use MAX framework for inference on large language models, achieving sub-500ms TTFT, ~30% faster P99 latency and ~22% faster mean latency at scale on NVIDIA B300 GPUs, with portability to AMD.

.@hippocraticai runs 400B+ parameter models for real-time patient conversations, tens of thousands per day. When they benchmarked MAX on NVIDIA B300 against their existing stack: sub-500ms mean TTFT, ~30% faster P99 latency, and ~22% faster mean latency at scale, all using Mojo-native kernels that extend to AMD without a rebuild:

Original Article

View Cached Full Text

Cached at: 06/13/26, 01:05 AM

Hippocratic AI partners with Modular to power flexible, high-quality inference for real-time patient conversations

Source: https://www.modular.com/blog/hippocratic-ai-partners-with-modular-to-power-flexible-high-quality-inference-for-real-time-patient-conversations?utm_source=x&utm_campaign=hippocratic May 18, 2026

Modular Team

Problem

Hippocratic AIbuilds safety-focused AI health agents that converse with patients, helping to close the global shortfall of 15 million healthcare workers. Their Polaris system orchestrates dozens of specialized models in parallel to ensure every interaction is clinically safe, with error rates lower than human clinicians. Hippocratic AI’s systems scale to contacting tens of thousands of patients daily and build trust that AI products can be used in highly regulated industries.

Every millisecond matters in real-time voice, and at Hippocratic AI’s scale latency gains compound directly into better patient experience and per-node efficiency. Production deployments run across multiple frameworks, including SGLang and vLLM, with ongoing evaluation of emerging frameworks for additional latency headroom, alongside a hardware roadmap spanning NVIDIA, AMD, and future-generation accelerators.

Solution

Our partnership with Hippocratic AI is a joint effort where both teams worked together to integrate Modular’sMAX frameworkinto Hippocratic AI’s inference pipelines with NVIDIA B300 GPUs. The evaluation benchmarked MAX against an existing SGLang deployment on 400B+ parameter models, with particular focus on tail latency and on the future portability of the underlying architecture to the heterogeneous hardware.

Modular has rebuilt the AI infrastructure stack from the ground up. From highly optimized, portable kernels written in Mojo, to model serving infrastructure with MAX, to cloud orchestration that can be deployed in Modular’s cloud or yours. This vertically integrated approach, built over years of deep infrastructure investment, gives Modular an edge to extract performance against existing frameworks.

MAX delivered across every dimension that matters:

**Keep every conversation instant.**MAX delivers sub-500ms mean time to first token (TTFT) and holds total generation time tight even at high concurrency, supporting responsive, natural interactions.
**Eliminate latency spikes that break trust.**In healthcare, the worst-case interaction matters as much as the average one. MAX achieved approximately 30% faster P99 end-to-end latency in the evaluation for a critical dense production model, addressing the tail-latency spikes that would cause noticeable pauses mid-conversation.
**Scale to more patients per node.**MAX delivered approximately 22% faster mean end-to-end latency at scale for a specific workload, contributing to the per-node efficiency gains of Hippocratic AI targets across its production stack.

Results

By adding MAX to its inference stack, Hippocratic AI opens up a heterogeneous deployment strategy across vendor hardware. The collaboration between Hippocratic AI and Modular is ongoing. Because MAX’s portability comes from its optimized kernel library and scheduling architecture rather than vendor-specific glue, the same benefits extend to the large reasoning models becoming central to production AI deployments: supporting flexible, hardware-agnostic deployment for the frontier LLMs used in production.

MetricResultTime to first token (TTFT)sub-500ms meanEnd-to-end latency - P9930% fasterEnd-to-end latency - Mean~22% faster## About Hippocratic AI

Hippocratic AI has developed the safest generative AI Agents for healthcare. The company believes that generative AI has the ability to bring healthcare abundance to every person in the world. The company focuses on building non-diagnostic patient-facing clinical AI agents and does not allow its agents to be used to prescribe or diagnose. Hippocratic AI has received a total of $404 million in funding and is backed by leading investors, including Andreessen Horowitz, General Catalyst, Kleiner Perkins, Avenir, NVIDIA’s NVentures, Premji Invest, SV Angel, Google’s CapitalG, and numerous health systems. Learn more athttps://hippocraticai.com/.

Request a demo of this use case

If you’re deploying large language models for inference,request a demotoday. Excited to chat!

Discover what Modular can do for you

Request a demo

Sign up today Signup to our Cloud Platform today to get started easily. Sign Up https://docs.modular.com/max/get-started
Browse open models Browse our model catalog, or deploy your own custom model Browse models https://www.modular.com/models

Sign up for our newsletter

Get all our latest news, announcements and updates delivered directly to your inbox. Unsubscribe at anytime.

Thanks for signing up to our newsletter! 🚀

Thank you,

Modular Sales Team

Oops! Something went wrong while submitting the form.

@Modular: .@hippocraticai runs 400B+ parameter models for real-time patient conversations, tens of thousands per day. When they b…

Hippocratic AI partners with Modular to power flexible, high-quality inference for real-time patient conversations

Problem

Solution

Results

Request a demo of this use case

Sign up for our newsletter

Similar Articles

@Modular: Our kernel team has been deep in MiniMax M3 all week. The 1M-token context and native multimodality make it a hard mode…

@rohanpaul_ai: Thinking Machines is replacing turn-taking AI with always-present AI. They just announced TML-Interaction-Small, a 276B…

@HotAisle: This is awesome. I wonder who's MI300x they used... ;-)

@rohanpaul_ai: Just a few days back, Thinking Machines Lab (TML), showcased a way of making AI interaction continuous instead of turn-…

@sudoingX: this is a laptop running a 31b parameter model at 99% gpu autonomously through hermes agent, 15 tok/s sustained, 22.8 o…

Submit Feedback

Similar Articles

@Modular: Our kernel team has been deep in MiniMax M3 all week. The 1M-token context and native multimodality make it a hard mode…

@rohanpaul_ai: Thinking Machines is replacing turn-taking AI with always-present AI. They just announced TML-Interaction-Small, a 276B…

@HotAisle: This is awesome. I wonder who's MI300x they used... ;-)

@rohanpaul_ai: Just a few days back, Thinking Machines Lab (TML), showcased a way of making AI interaction continuous instead of turn-…

@sudoingX: this is a laptop running a 31b parameter model at 99% gpu autonomously through hermes agent, 15 tok/s sustained, 22.8 o…