Tag
This paper introduces AnyMo, a unified multimodal framework for human motion generation that combines a Residual FSQ-based motion tokenizer with a scalable masked modeling transformer, along with the OmniHuMo dataset of over 5,000 hours of motion data to enable high-quality synthesis under arbitrary modality combinations.
UniSteer introduces a text-guided activation flow matching method to learn a universal conditional velocity field in activation space, enabling versatile LLM behavior control and classification tasks without task-specific intervention modules.
MERIT is a framework that learns disentangled music representations for melody, rhythm, and timbre using conditional audio generation and source-separated stems, enabling nuanced and factor-specific audio similarity queries.