@MaxForAI: Yesterday, ByteDance Seed open-sourced a very interesting checkpoint, TaskMem. It is trained on Qwen3-VL-30B-A3B, with the goal not being to directly answer questions, but to enable multimodal Agents to learn to generate more useful long-term memory from video/environment streams. The key is to let the Agent learn in continuous video…

X AI KOLs Timeline 06/03/26, 05:19 AM Models

multimodal-agent long-term-memory video-understanding open-source bytedance rl-training streaming-task

Summary

ByteDance Seed has open-sourced the TaskMem checkpoint, trained on Qwen3-VL-30B-A3B. It uses two-stage reinforcement learning to enable multimodal Agents to learn to generate long-term memory from video streams, achieving significant improvements on benchmarks such as VideoMME and EgoLife.

Yesterday, ByteDance Seed open-sourced a very interesting checkpoint TaskMem It is trained on Qwen3-VL-30B-A3B, with the goal not being to directly answer questions, but to enable multimodal Agents to learn to generate more useful long-term memory from video/environment streams. The key is to let the Agent judge in continuous video/environment streams "what is worth remembering", rather than treating memory as a simple summary, RAG database, or clipboard. The corresponding paper is called "Task-Focused Memorization for Multimodal Agents", authored by Tao Zou, Yichen He, Tian Qiu, Yuan Lin, Hang Li, from ByteDance Seed and Fudan University. The core method in the paper is two-stage training. Stage 1: Learning "how to remember". Use RL to train a memory generation policy, so that it generates accurate, non-repetitive, format-consistent, and sufficiently informative episodic memory. The paper uses GSPO for training, with rewards including format, thinking length, quality, and richness. Here's an interesting detail: they specifically added a richness reward, because optimizing only for quality would allow the model to find loopholes and generate very short but seemingly correct memories. Well, once models find a loophole in the test, they'll cheat faster than a college student. Stage 2: Learning "what to remember". After deployment, based on the tasks/questions that appear in the recent environment, a very lightweight adapter is trained to steer the model's memory focus toward information that is more likely to be useful in the future. The paper says this adapter has only 2048 trainable parameters, the main model is frozen, and it is optimized with DPO; it's more like a "task-direction memory bias vector". The experimental design is very interesting: they transformed VideoMME, EgoLife, and EgoTempo into streaming tasks. The Agent first watches the video stream and generates memory, then questions appear later, and when answering, it cannot review the original video—it can only use the generated memory. This setup is closer to real-world Agent scenarios than typical video QA, because in a real environment you can't always rewind the recording to rewatch, even though I'd like to. Results: TaskMem's accuracy on three benchmarks is VideoMME 67.9, EgoLife 45.4, EgoTempo 27.6. Compared to the baseline Qwen3-VL-30B-A3B's 61.6, 38.4, 22.3, the improvements are 6.3, 7.0, and 5.3 percentage points respectively. It surpasses GPT-5.2 (shown in the table) on VideoMME and EgoLife; on EgoTempo, its accuracy is lower than GPT-5.2 but precision is higher. This direction is very inspiring for personal AI, embodied agents, screenshot memory, and video understanding. For example, users take many screenshots—the difficulty is not just retrieval, but whether the system can predict in advance which screenshots, which details, and which contexts will be useful later. Link: https://huggingface.co/ByteDance-Seed/TaskMem/tree/main…

Original Article

View Cached Full Text

Cached at: 06/03/26, 09:47 AM

Yesterday, ByteDance Seed open-sourced a very interesting checkpoint:

TaskMem

It is trained based on Qwen3-VL-30B-A3B. The goal is not to directly answer questions, but to enable multimodal agents to learn to generate more useful long-term memory from video/environment streams.

The key point is to let the agent learn to judge “what is worth remembering” in a continuous video/environment stream, rather than treating memory as a simple summary, RAG library, or clipboard.

The corresponding paper is titled Task-Focused Memorization for Multimodal Agents, authored by Tao Zou, Yichen He, Tian Qiu, Yuan Lin, and Hang Li, from ByteDance Seed and Fudan University.

The core approach in the paper is a two-stage training process.

Stage 1: Learning “how to remember”
Reinforcement learning is used to train the memory generation strategy, producing episodic memory that is accurate, non-redundant, stable in format, and sufficiently informative.
The paper uses GSPO for training, with rewards including format, thinking length, quality, and richness.
An interesting detail: they specifically added a richness reward because optimizing only for quality would let the model game the system by generating very short but seemingly correct memories.
Models, once they find a loophole in the exam, cheat faster than college students.

Stage 2: Learning “what to remember”
After deployment, a very lightweight adapter is trained based on tasks/questions that appear in the recent environment to shift the model’s memory focus toward information that is more likely to be useful in the future.
The paper states that this adapter has only 2048 trainable parameters, the main model is frozen, and DPO is used for optimization; it functions more like a “task-oriented memory bias vector.”

The experimental design is quite clever: they adapted VideoMME, EgoLife, and EgoTempo into streaming tasks.
The agent first watches the video stream and generates memory; questions appear later, and when answering, the agent cannot review the original video—only the generated memory.
This setup is closer to real-world agent scenarios than typical video QA, because in real environments you can’t just rewind the recording every time (even though I wish I could).

Results: TaskMem achieved accuracy scores of 67.9 on VideoMME, 45.4 on EgoLife, and 27.6 on EgoTempo.
Compared to the baseline Qwen3-VL-30B-A3B (61.6, 38.4, 22.3), improvements are 6.3, 7.0, and 5.3 percentage points respectively.
It outperforms GPT-5.2 reported in the table on VideoMME and EgoLife; on EgoTempo, its accuracy is lower than GPT-5.2, but its precision is higher.

This direction is highly inspiring for personal AI, embodied agents, screenshot memory, and video understanding.
For example, when users take many screenshots, the difficulty is not just retrieval—it’s whether the system can anticipate which screenshots, which details, and which context will be useful in the future.

Link: https://huggingface.co/ByteDance-Seed/TaskMem/tree/main…

ByteDance-Seed/TaskMem at main

Source: https://huggingface.co/ByteDance-Seed/TaskMem/tree/main hyc2026’s picture

hyc2026 (https://huggingface.co/hyc2026)

Upload folder using huggingface_hub

b2b4dff (https://huggingface.co/ByteDance-Seed/TaskMem/commit/b2b4dff145646e82646e4ee9d657442b75cd0ba3)

verified

1 day ago

ByteDance-Seed/TaskMem at main

Similar Articles

@servasyy_ai: https://x.com/servasyy_ai/status/2057463627255570937

Submit Feedback

Similar Articles

@berryxia: Guys, the MemOS 2.0 open-source project has been updated again! It has gained 9.3K Stars on GitHub ~ This time, 'AI memory' has been upgraded from an advanced clipboard to true 'execute and learn'. Previously, many memory solutions simply stored chat logs and added semantic search, making it look like memory, but it was actually just RAG...

@WY_mask: Build persistent memory engine for all kinds of AI coding assistants http://github.com/rohitg00/agentmemory… Silently records code changes and context in the background, automatically extracts and compresses into structured memory, saves Token consumption from long context, associates past information, as…

@berryxia: Agent memory is incredibly competitive! I have to say, the more people join this track, the better it gets! The Tencent AI team spent a full 6 months tackling just one problem: AI agents frequently dropping context in long conversations. They ended up building a complete memory system and open-sourced it directly. After reading their sharing, my biggest takeaway is...

@servasyy_ai: https://x.com/servasyy_ai/status/2057463627255570937

@wsl8297: When running complex tasks with AI agents, the most painful thing is often not that the model isn't strong enough, but that as the conversation gets longer, the context starts to overflow. You have to keep filling in background details, re-explaining the process, plus the redundant logs from tool calls — tokens just gush out like a broken pipe. Recently, I saw TencentDB Agent Memory open-sourced by Tencent...