[Vex] - I built an open-source terminal AI video editor that edits real footage with FFmpeg, Whisper, and agent tool calls

Reddit r/AI_Agents Tools

Summary

Vex is an open-source terminal-based AI video editing agent that uses LLMs as planners and deterministic tools (FFmpeg, Whisper) for actual edits, enabling natural-language editing commands with local-first, scriptable workflows.

Most AI video tools feel backwards. They start with the model. I wanted the opposite. I wanted the model to be the planner, not the editor. The actual edits should come from boring, deterministic tools: FFmpeg, MoviePy, Whisper, project state, timelines, undo/redo, export validation. So I built **Vex**. Vex is an open-source AI video editing agent for the terminal. You launch vex, point it at a video, and talk to it like this: trim the first 30 seconds of D:\videos\clip.mp4 remove awkward pauses burn subtitles add auto visuals export it for instagram The important part is not “AI edits video.” That is the hype version. The real idea is an **agentic harness for video editing**. The LLM does not own the truth. It chooses tools. The project state owns the truth. Vex keeps a working copy of the footage, stores timeline operations, records artifacts, and can rebuild edits through undo/redo instead of just hoping the model remembers what happened. The current stack includes: * natural-language editing in a terminal REPL * safe working-copy edits so original footage stays untouched * trims, merges, speed changes, fades, overlays, audio edits, subtitles * local Whisper transcription * transcript-aware highlight cuts and vertical shorts * auto color grading through sampled-frame analysis and reusable FFmpeg filters * transcript-aware custom visuals through Hyperframes first, Manim when needed * export presets for YouTube, Instagram, TikTok, X, and podcast audio * Gemini, Claude, and OpenAI-compatible local providers like Ollama / LM Studio / llama.cpp The auto visuals part is the most interesting piece right now. Instead of blindly throwing stock footage over a talking-head video, Vex reads the transcript, scores which spoken beats are actually visualizable, decides whether full-screen replacement or picture-in-picture is safer, generates the visual, checks frames for contrast/dead space/text overflow/edge safety, then composites the best version back into the cut. Basically: AI chooses the move. Deterministic tools execute the move. Project state remembers the move. That is the whole mental model. The honest scorecard: Can it replace a professional editor? No. Can it automate a lot of boring creator editing work? Yes. Can it help with shorts, captions, subtitles, b-roll, color, and exports? Yes. Is it perfect on messy creative judgment? No. Where it wins: repeatable editing workflows with clear instructions. Where it still needs work: long-form taste, complex narrative edits, and making setup smoother. I built this because I think “AI video editing” should not mean uploading everything into a black-box web app. It should also be possible to have a local-first, scriptable, inspectable editing harness where the model is just one part of the system. Repo link in the comments below. I’d love brutal feedback from people who edit videos, build agent tools, or have tried to automate FFmpeg workflows before. What would make this actually useful in your workflow?
Original Article

Similar Articles

built a CLI based agent harness for video editing

Reddit r/AI_Agents

Vex is an open-source CLI agent harness that lets users edit videos via natural language commands, automating tasks like silence removal, b-roll addition, and visual generation.

Vexilo

Product Hunt

Vexilo is a planner for Claude Code featuring 31 agents, 92 commands, and over 121 skills to enhance AI-assisted development.

Aurora: Unified Video Editing with a Tool-Using Agent

Hugging Face Daily Papers

Aurora is an agentic video editing framework that pairs a tool-augmented vision-language model agent with a diffusion transformer to automatically resolve textual and visual underspecification in user requests, enabling unified video editing tasks like replacement, removal, style transfer, and reference-driven insertion.