Are AI social apps moving from text chat to real-time video interfaces?

Reddit r/ArtificialInteligence News

Summary

A discussion about the evolution of AI social apps from text chat to real-time video interfaces, highlighting Mel's multimodal interaction stack and the technical challenges of latency, lip sync, and orchestration.

I knew text-based character chat was already working as a category — especially after seeing [Character.AI](http://Character.AI) take off, with founders who came from Google/LaMDA-type work. But it feels like the next step might be moving from text chat into real-time video interaction. I tried Mel recently, and the interesting part to me wasn’t just that it lets you talk to characters. It was the whole interaction stack: voice input, lip sync, camera-aware responses, facial reactions, and a video character that felt much less static than the usual avatar/chatbot setup. For example, if the user is visibly on a plane, the character can ask if they’re on a plane. If the user is in a bathroom, it can notice that context too. I’m not sure how much of the video is truly changing in real time vs. using some clever prebuilt animation/rendering system, but the lip sync was surprisingly good and the interaction felt more dynamic than most AI social apps I’ve seen so far. For people working on multimodal or agentic interfaces, what do you think is technically hardest here? * low-latency vision understanding * speech timing * lip sync * real-time avatar rendering * memory/context * making it feel unscripted instead of like a scripted NPC My guess is that the challenge is less about any single model and more about orchestration: keeping voice, vision, language, animation, and memory synced without making the whole thing feel delayed or fake. Do you think real-time video becomes a serious AI interface, or is it mostly a novelty until latency/animation quality improves?
Original Article

Similar Articles

Interaction Models

Hacker News Top

Thinking Machines AI announces a research preview of interaction models, a new architecture designed for native, real-time human-AI collaboration across audio, video, and text. By replacing turn-based interfaces with a multi-stream, micro-turn design, the model aims to keep humans actively in the loop while delivering state-of-the-art intelligence and responsiveness.