Open-source agent that uses MediaPipe to read your face and adapt its voice in real time

Reddit r/AI_Agents Tools

Summary

Vision Agents is an open-source Python framework for building multimodal AI agents that process video and audio in real time. It enables conversational agents to adapt their voice based on facial expressions and gaze using MediaPipe.

I've been building Vision Agents, an open-source Python framework for building AI agents that process video and audio in real time. This is a demo we built on top of it: a conversational agent that tracks your face through the webcam, classifies your emotion and gaze, and uses that to change how it speaks to you. The agent runs MediaPipe's FaceLandmarker at 8fps on the webcam feed. It pulls 52 blendshape coefficients per frame and classifies them into coarse labels. Emotion (happy, sad, surprised, thoughtful, neutral), gaze direction (at camera, off left/right, up, down), and engagement (engaged, distracted, absent). Classification is threshold-based with hysteresis (enter at 0.45, exit at 0.30 for smile detection) and a 4-frame dwell requirement to prevent flicker. That facial state gets prepended to the user's transcript before it hits the LLM: [user state: sad, looking down] my day was rough The LLM picks a delivery style for Inworld's TTS-2 model, which supports natural-language steering. You write bracketed director's notes like [say sadly with deliberate pauses in a low voice] and the model follows them. Not a dropdown of five emotions. Full natural language. It also renders non-verbal sounds ([laugh], [sigh]) as actual audio inline. If you look away or leave the frame for 5+ seconds, the agent nudges you back contextually instead of sitting in silence. It never narrates what it sees ("I notice you looking away"). The camera signal is guidance for the model, not something it repeats. The face tracker is a "processor" in Vision Agents. Processors hook into the video stream and run at their own frame rate, independent of the LLM. You can stack multiple in one agent (YOLO at 20fps, MediaPipe at 8fps, a depth model at 15fps) without them blocking each other. The framework handles frame distribution. No threading code on your end. The full agent setup is about 15 lines of Python. Each piece (TTS, STT, LLM, processors) is a swappable plugin. Stack: Vision Agents for orchestration (MIT licensed), Inworld TTS-2 for voice, Anam for the avatar (their CARA model), MediaPipe for face landmarking, Gemini as the LLM, Deepgram for STT, Stream for real-time video/audio transport. Worth noting what this isn't: it's not emotion AI in the "we can detect your true feelings" sense. The blendshape classification is coarse on purpose. A smile above a threshold is "happy." Raised brows plus open jaw is "surprised." Enough signal for the LLM to pick a reasonable delivery style, not enough to make clinical claims. Happy to answer questions.
Original Article

Similar Articles

Open-LLM-VTuber/Open-LLM-VTuber

GitHub Trending (daily)

Open-LLM-VTuber is an open-source voice-interactive AI companion with a Live2D avatar, supporting real-time conversations and visual perception, fully operable offline.

Aurora: Unified Video Editing with a Tool-Using Agent

Hugging Face Daily Papers

Aurora is an agentic video editing framework that pairs a tool-augmented vision-language model agent with a diffusion transformer to automatically resolve textual and visual underspecification in user requests, enabling unified video editing tasks like replacement, removal, style transfer, and reference-driven insertion.

WebWatcher: Breaking New Frontier of Vision-Language Deep Research Agent

Papers with Code Trending

WebWatcher is a multimodal agent for deep research that uses synthetic trajectories and reinforcement learning to achieve superior performance in complex visual and textual information retrieval tasks. The paper also introduces BrowseComp-VL, a new benchmark for evaluating multimodal agents.