Open-source agent that uses MediaPipe to read your face and adapt its voice in real time

Reddit r/AI_Agents Tools

Summary

Vision Agents is an open-source Python framework for building multimodal AI agents that process video and audio in real time. It enables conversational agents to adapt their voice based on facial expressions and gaze using MediaPipe.

I've been building Vision Agents, an open-source Python framework for building AI agents that process video and audio in real time. This is a demo we built on top of it: a conversational agent that tracks your face through the webcam, classifies your emotion and gaze, and uses that to change how it speaks to you. The agent runs MediaPipe's FaceLandmarker at 8fps on the webcam feed. It pulls 52 blendshape coefficients per frame and classifies them into coarse labels. Emotion (happy, sad, surprised, thoughtful, neutral), gaze direction (at camera, off left/right, up, down), and engagement (engaged, distracted, absent). Classification is threshold-based with hysteresis (enter at 0.45, exit at 0.30 for smile detection) and a 4-frame dwell requirement to prevent flicker. That facial state gets prepended to the user's transcript before it hits the LLM: [user state: sad, looking down] my day was rough The LLM picks a delivery style for Inworld's TTS-2 model, which supports natural-language steering. You write bracketed director's notes like [say sadly with deliberate pauses in a low voice] and the model follows them. Not a dropdown of five emotions. Full natural language. It also renders non-verbal sounds ([laugh], [sigh]) as actual audio inline. If you look away or leave the frame for 5+ seconds, the agent nudges you back contextually instead of sitting in silence. It never narrates what it sees ("I notice you looking away"). The camera signal is guidance for the model, not something it repeats. The face tracker is a "processor" in Vision Agents. Processors hook into the video stream and run at their own frame rate, independent of the LLM. You can stack multiple in one agent (YOLO at 20fps, MediaPipe at 8fps, a depth model at 15fps) without them blocking each other. The framework handles frame distribution. No threading code on your end. The full agent setup is about 15 lines of Python. Each piece (TTS, STT, LLM, processors) is a swappable plugin. Stack: Vision Agents for orchestration (MIT licensed), Inworld TTS-2 for voice, Anam for the avatar (their CARA model), MediaPipe for face landmarking, Gemini as the LLM, Deepgram for STT, Stream for real-time video/audio transport. Worth noting what this isn't: it's not emotion AI in the "we can detect your true feelings" sense. The blendshape classification is coarse on purpose. A smile above a threshold is "happy." Raised brows plus open jaw is "surprised." Enough signal for the LLM to pick a reasonable delivery style, not enough to make clinical claims. Happy to answer questions.
Original Article

Similar Articles

Invideo AI uses OpenAI models to create videos 10x faster

OpenAI Blog

Invideo AI, an India-based startup, launches a multi-agent video creation platform built on OpenAI models (GPT-4.1, o3, gpt-image-1, text-to-speech) that enables users to generate professional-quality videos 10x faster from natural language prompts. The system uses specialized AI agents for planning, scripting, research, content moderation, visual generation, and narration, now serving over 50 million users creating 7 million videos monthly.

I gave AI agents eyes on my PC

Reddit r/AI_Agents

The author introduces Pupil, an open-source tool that enables AI agents to visually inspect PC UIs and identify click targets without relying on screenshots.

openai/openai-agents-python

GitHub Trending (daily)

OpenAI releases openai-agents-python, a lightweight framework for building multi-agent workflows that supports OpenAI APIs and 100+ other LLMs. The SDK includes features like sandbox agents, tools, guardrails, human-in-the-loop, tracing, and realtime voice agent capabilities.