Vision Agents is an open-source Python framework for building multimodal AI agents that process video and audio in real time. It enables conversational agents to adapt their voice based on facial expressions and gaze using MediaPipe.
I've been building Vision Agents, an open-source Python framework for building AI agents that process video and audio in real time. This is a demo we built on top of it: a conversational agent that tracks your face through the webcam, classifies your emotion and gaze, and uses that to change how it speaks to you. The agent runs MediaPipe's FaceLandmarker at 8fps on the webcam feed. It pulls 52 blendshape coefficients per frame and classifies them into coarse labels. Emotion (happy, sad, surprised, thoughtful, neutral), gaze direction (at camera, off left/right, up, down), and engagement (engaged, distracted, absent). Classification is threshold-based with hysteresis (enter at 0.45, exit at 0.30 for smile detection) and a 4-frame dwell requirement to prevent flicker. That facial state gets prepended to the user's transcript before it hits the LLM: [user state: sad, looking down] my day was rough The LLM picks a delivery style for Inworld's TTS-2 model, which supports natural-language steering. You write bracketed director's notes like [say sadly with deliberate pauses in a low voice] and the model follows them. Not a dropdown of five emotions. Full natural language. It also renders non-verbal sounds ([laugh], [sigh]) as actual audio inline. If you look away or leave the frame for 5+ seconds, the agent nudges you back contextually instead of sitting in silence. It never narrates what it sees ("I notice you looking away"). The camera signal is guidance for the model, not something it repeats. The face tracker is a "processor" in Vision Agents. Processors hook into the video stream and run at their own frame rate, independent of the LLM. You can stack multiple in one agent (YOLO at 20fps, MediaPipe at 8fps, a depth model at 15fps) without them blocking each other. The framework handles frame distribution. No threading code on your end. The full agent setup is about 15 lines of Python. Each piece (TTS, STT, LLM, processors) is a swappable plugin. Stack: Vision Agents for orchestration (MIT licensed), Inworld TTS-2 for voice, Anam for the avatar (their CARA model), MediaPipe for face landmarking, Gemini as the LLM, Deepgram for STT, Stream for real-time video/audio transport. Worth noting what this isn't: it's not emotion AI in the "we can detect your true feelings" sense. The blendshape classification is coarse on purpose. A smile above a threshold is "happy." Raised brows plus open jaw is "surprised." Enough signal for the LLM to pick a reasonable delivery style, not enough to make clinical claims. Happy to answer questions.
The article discusses the emerging concept of AI video agents that allow users to create complete videos simply by chatting with a chatbot, potentially simplifying and replacing traditional multi-tool video production workflows.
OpenAI has launched three new real-time audio models to enable continuous, multitasking voice interactions that prioritize long-context reasoning, live translation, and seamless tool use.
Invideo AI, an India-based startup, launches a multi-agent video creation platform built on OpenAI models (GPT-4.1, o3, gpt-image-1, text-to-speech) that enables users to generate professional-quality videos 10x faster from natural language prompts. The system uses specialized AI agents for planning, scripting, research, content moderation, visual generation, and narration, now serving over 50 million users creating 7 million videos monthly.
The author introduces Pupil, an open-source tool that enables AI agents to visually inspect PC UIs and identify click targets without relying on screenshots.
OpenAI releases openai-agents-python, a lightweight framework for building multi-agent workflows that supports OpenAI APIs and 100+ other LLMs. The SDK includes features like sandbox agents, tools, guardrails, human-in-the-loop, tracing, and realtime voice agent capabilities.