Tag
Bernini proposes a unified video generation and editing framework that combines multimodal large language models for semantic planning with diffusion models for pixel rendering, achieving state-of-the-art performance through semantic interface separation and enhanced positional embeddings.