Open-source agent that uses MediaPipe to read your face and adapt its voice in real time

Reddit r/AI_Agents 05/15/26, 04:41 PM Tools

open-source real-time mediapipe face-tracking conversational-ai python-framework multimodal

Summary

Vision Agents is an open-source Python framework for building multimodal AI agents that process video and audio in real time. It enables conversational agents to adapt their voice based on facial expressions and gaze using MediaPipe.

I've been building Vision Agents, an open-source Python framework for building AI agents that process video and audio in real time. This is a demo we built on top of it: a conversational agent that tracks your face through the webcam, classifies your emotion and gaze, and uses that to change how it speaks to you. The agent runs MediaPipe's FaceLandmarker at 8fps on the webcam feed. It pulls 52 blendshape coefficients per frame and classifies them into coarse labels. Emotion (happy, sad, surprised, thoughtful, neutral), gaze direction (at camera, off left/right, up, down), and engagement (engaged, distracted, absent). Classification is threshold-based with hysteresis (enter at 0.45, exit at 0.30 for smile detection) and a 4-frame dwell requirement to prevent flicker. That facial state gets prepended to the user's transcript before it hits the LLM: [user state: sad, looking down] my day was rough The LLM picks a delivery style for Inworld's TTS-2 model, which supports natural-language steering. You write bracketed director's notes like [say sadly with deliberate pauses in a low voice] and the model follows them. Not a dropdown of five emotions. Full natural language. It also renders non-verbal sounds ([laugh], [sigh]) as actual audio inline. If you look away or leave the frame for 5+ seconds, the agent nudges you back contextually instead of sitting in silence. It never narrates what it sees ("I notice you looking away"). The camera signal is guidance for the model, not something it repeats. The face tracker is a "processor" in Vision Agents. Processors hook into the video stream and run at their own frame rate, independent of the LLM. You can stack multiple in one agent (YOLO at 20fps, MediaPipe at 8fps, a depth model at 15fps) without them blocking each other. The framework handles frame distribution. No threading code on your end. The full agent setup is about 15 lines of Python. Each piece (TTS, STT, LLM, processors) is a swappable plugin. Stack: Vision Agents for orchestration (MIT licensed), Inworld TTS-2 for voice, Anam for the avatar (their CARA model), MediaPipe for face landmarking, Gemini as the LLM, Deepgram for STT, Stream for real-time video/audio transport. Worth noting what this isn't: it's not emotion AI in the "we can detect your true feelings" sense. The blendshape classification is coarse on purpose. A smile above a threshold is "happy." Raised brows plus open jaw is "surprised." Enough signal for the LLM to pick a reasonable delivery style, not enough to make clinical claims. Happy to answer questions.

Original Article

Open-source agent that uses MediaPipe to read your face and adapt its voice in real time

Similar Articles

I built a fully immersive AI agent with native time perception & group chat understanding, all with a single-pass logic.

Open-LLM-VTuber/Open-LLM-VTuber

Aurora: Unified Video Editing with a Tool-Using Agent

Has anyone explored AI video agents? This is new, but really interesting to create videos just by chatting with the chatbots.

WebWatcher: Breaking New Frontier of Vision-Language Deep Research Agent

Submit Feedback

Similar Articles

I built a fully immersive AI agent with native time perception & group chat understanding, all with a single-pass logic.

Open-LLM-VTuber/Open-LLM-VTuber

Aurora: Unified Video Editing with a Tool-Using Agent

Has anyone explored AI video agents? This is new, but really interesting to create videos just by chatting with the chatbots.

WebWatcher: Breaking New Frontier of Vision-Language Deep Research Agent