@MaxForAI: Yesterday, ByteDance Seed open-sourced a very interesting checkpoint, TaskMem. It is trained on Qwen3-VL-30B-A3B, with the goal not being to directly answer questions, but to enable multimodal Agents to learn to generate more useful long-term memory from video/environment streams. The key is to let the Agent learn in continuous video…

X AI KOLs Timeline Models

Summary

ByteDance Seed has open-sourced the TaskMem checkpoint, trained on Qwen3-VL-30B-A3B. It uses two-stage reinforcement learning to enable multimodal Agents to learn to generate long-term memory from video streams, achieving significant improvements on benchmarks such as VideoMME and EgoLife.

Yesterday, ByteDance Seed open-sourced a very interesting checkpoint TaskMem It is trained on Qwen3-VL-30B-A3B, with the goal not being to directly answer questions, but to enable multimodal Agents to learn to generate more useful long-term memory from video/environment streams. The key is to let the Agent judge in continuous video/environment streams "what is worth remembering", rather than treating memory as a simple summary, RAG database, or clipboard. The corresponding paper is called "Task-Focused Memorization for Multimodal Agents", authored by Tao Zou, Yichen He, Tian Qiu, Yuan Lin, Hang Li, from ByteDance Seed and Fudan University. The core method in the paper is two-stage training. Stage 1: Learning "how to remember". Use RL to train a memory generation policy, so that it generates accurate, non-repetitive, format-consistent, and sufficiently informative episodic memory. The paper uses GSPO for training, with rewards including format, thinking length, quality, and richness. Here's an interesting detail: they specifically added a richness reward, because optimizing only for quality would allow the model to find loopholes and generate very short but seemingly correct memories. Well, once models find a loophole in the test, they'll cheat faster than a college student. Stage 2: Learning "what to remember". After deployment, based on the tasks/questions that appear in the recent environment, a very lightweight adapter is trained to steer the model's memory focus toward information that is more likely to be useful in the future. The paper says this adapter has only 2048 trainable parameters, the main model is frozen, and it is optimized with DPO; it's more like a "task-direction memory bias vector". The experimental design is very interesting: they transformed VideoMME, EgoLife, and EgoTempo into streaming tasks. The Agent first watches the video stream and generates memory, then questions appear later, and when answering, it cannot review the original video—it can only use the generated memory. This setup is closer to real-world Agent scenarios than typical video QA, because in a real environment you can't always rewind the recording to rewatch, even though I'd like to. Results: TaskMem's accuracy on three benchmarks is VideoMME 67.9, EgoLife 45.4, EgoTempo 27.6. Compared to the baseline Qwen3-VL-30B-A3B's 61.6, 38.4, 22.3, the improvements are 6.3, 7.0, and 5.3 percentage points respectively. It surpasses GPT-5.2 (shown in the table) on VideoMME and EgoLife; on EgoTempo, its accuracy is lower than GPT-5.2 but precision is higher. This direction is very inspiring for personal AI, embodied agents, screenshot memory, and video understanding. For example, users take many screenshots—the difficulty is not just retrieval, but whether the system can predict in advance which screenshots, which details, and which contexts will be useful later. Link: https://huggingface.co/ByteDance-Seed/TaskMem/tree/main…
Original Article
View Cached Full Text

Cached at: 06/03/26, 09:47 AM

Yesterday, ByteDance Seed open-sourced a very interesting checkpoint:

TaskMem

It is trained based on Qwen3-VL-30B-A3B. The goal is not to directly answer questions, but to enable multimodal agents to learn to generate more useful long-term memory from video/environment streams.

The key point is to let the agent learn to judge “what is worth remembering” in a continuous video/environment stream, rather than treating memory as a simple summary, RAG library, or clipboard.

The corresponding paper is titled Task-Focused Memorization for Multimodal Agents, authored by Tao Zou, Yichen He, Tian Qiu, Yuan Lin, and Hang Li, from ByteDance Seed and Fudan University.

The core approach in the paper is a two-stage training process.

Stage 1: Learning “how to remember”
Reinforcement learning is used to train the memory generation strategy, producing episodic memory that is accurate, non-redundant, stable in format, and sufficiently informative.
The paper uses GSPO for training, with rewards including format, thinking length, quality, and richness.
An interesting detail: they specifically added a richness reward because optimizing only for quality would let the model game the system by generating very short but seemingly correct memories.
Models, once they find a loophole in the exam, cheat faster than college students.

Stage 2: Learning “what to remember”
After deployment, a very lightweight adapter is trained based on tasks/questions that appear in the recent environment to shift the model’s memory focus toward information that is more likely to be useful in the future.
The paper states that this adapter has only 2048 trainable parameters, the main model is frozen, and DPO is used for optimization; it functions more like a “task-oriented memory bias vector.”

The experimental design is quite clever: they adapted VideoMME, EgoLife, and EgoTempo into streaming tasks.
The agent first watches the video stream and generates memory; questions appear later, and when answering, the agent cannot review the original video—only the generated memory.
This setup is closer to real-world agent scenarios than typical video QA, because in real environments you can’t just rewind the recording every time (even though I wish I could).

Results: TaskMem achieved accuracy scores of 67.9 on VideoMME, 45.4 on EgoLife, and 27.6 on EgoTempo.
Compared to the baseline Qwen3-VL-30B-A3B (61.6, 38.4, 22.3), improvements are 6.3, 7.0, and 5.3 percentage points respectively.
It outperforms GPT-5.2 reported in the table on VideoMME and EgoLife; on EgoTempo, its accuracy is lower than GPT-5.2, but its precision is higher.

This direction is highly inspiring for personal AI, embodied agents, screenshot memory, and video understanding.
For example, when users take many screenshots, the difficulty is not just retrieval—it’s whether the system can anticipate which screenshots, which details, and which context will be useful in the future.

Link: https://huggingface.co/ByteDance-Seed/TaskMem/tree/main…


ByteDance-Seed/TaskMem at main

Source: https://huggingface.co/ByteDance-Seed/TaskMem/tree/main hyc2026’s picture

hyc2026 (https://huggingface.co/hyc2026)

Upload folder using huggingface_hub

b2b4dff (https://huggingface.co/ByteDance-Seed/TaskMem/commit/b2b4dff145646e82646e4ee9d657442b75cd0ba3)

verified

1 day ago

Similar Articles

@berryxia: Guys, the MemOS 2.0 open-source project has been updated again! It has gained 9.3K Stars on GitHub ~ This time, 'AI memory' has been upgraded from an advanced clipboard to true 'execute and learn'. Previously, many memory solutions simply stored chat logs and added semantic search, making it look like memory, but it was actually just RAG...

X AI KOLs Timeline

MemOS 2.0 open-source project update introduces the 'execute and learn' mechanism, enabling the AI Agent to automatically deconstruct and distill experience when completing tasks, evolving hierarchically from raw trajectories to muscle memory, resulting in a dedicated assistant that understands you better as you use it.

@WY_mask: Build persistent memory engine for all kinds of AI coding assistants http://github.com/rohitg00/agentmemory… Silently records code changes and context in the background, automatically extracts and compresses into structured memory, saves Token consumption from long context, associates past information, as…

X AI KOLs Timeline

agentmemory is an open-source tool that provides persistent memory for AI coding assistants. It silently records code changes and context, automatically extracts and compresses them into structured memory, reduces Token consumption, and supports multiple mainstream platforms such as Claude Code and Codex.

@berryxia: Agent memory is incredibly competitive! I have to say, the more people join this track, the better it gets! The Tencent AI team spent a full 6 months tackling just one problem: AI agents frequently dropping context in long conversations. They ended up building a complete memory system and open-sourced it directly. After reading their sharing, my biggest takeaway is...

X AI KOLs Timeline

Tencent AI has open-sourced an Agent memory system that significantly improves token efficiency and agent consistency in long dialogues through three methods: real-time context compression, Mermaid task maps, and Persona memory. Token consumption is reduced by 61%, and persona consistency jumps from 48% to 76%.

@servasyy_ai: https://x.com/servasyy_ai/status/2057463627255570937

X AI KOLs Timeline

Tencent Cloud database team open-sourced TencentDB Agent Memory, a runtime system that solves the context degradation problem in long tasks for AI agents, compressing short-term context into the memory system through three-layer backtracking and dynamic compression, and integrating a long-term memory pipeline. This is a landmark attempt for AI agent memory systems moving from 'database' to 'runtime'.

@wsl8297: When running complex tasks with AI agents, the most painful thing is often not that the model isn't strong enough, but that as the conversation gets longer, the context starts to overflow. You have to keep filling in background details, re-explaining the process, plus the redundant logs from tool calls — tokens just gush out like a broken pipe. Recently, I saw TencentDB Agent Memory open-sourced by Tencent...

X AI KOLs Timeline

Tencent has open-sourced TencentDB Agent Memory, which solves the AI agent long-context overflow problem through hierarchical memory management (symbolic short-term memory + hierarchical long-term memory). Benchmarks show token consumption reduced by up to 61% and task success rate improved by over 50%.