@jakevin7: Maka's Harness project brings DeepSeek Flash's test set results close to GLM-5.2 level ----------------------------------- maka + DeepSeek Flash V4, te…

X AI KOLs Timeline 07/03/26, 05:59 AM News

deepseek glm harness terminal-bench cache agent maka

Summary

Maka's Harness project improved the self-check mechanism, enabling DeepSeek Flash V4 to achieve evaluation results close to GLM-5.2 on the terminal-bench sample set, completing 10 programming agent tasks with only 4 RMB and a 97.5% cache hit rate.

Maka's Harness project brings DeepSeek Flash's test set results close to GLM-5.2 level ----------------------------------- maka + DeepSeek Flash V4, terminal-bench sample scored 0.8. Actually close to 0.9—one problem was solved correctly, but its "product contamination" wasn't counted by the scoring system. It's almost catching up to GLM 5.2's evaluation results. ----------------------------------- The terminal-bench sample is a subset of the full 84-question terminal-bench set, consisting of 10 programming agent tasks. This run: Total token consumption: 60 million, of which cache hits: 58.5 million (97.5% hit rate). Total cost: about 4 RMB. 10 problems, 4 yuan, near perfect score. ----------------------------------- Does this mean DeepSeek Flash got better? No. It's the same DeepSeek Flash V4 model. What pushed the score from baseline to 0.8 was two rounds of iteration on the self-check mechanism in the agent loop—inspired by Claude Code's self-check implementation, reinforcing the model's self-verification logic after completing a task. After two iterations, Flash suddenly "had a breakthrough" on the terminal benchmark. What this shows: the model's capability has an upper limit, but the quality of the harness engineering can greatly affect the model's upper bound on specific tasks. Self-check is not magic; it just makes the model "verify itself before submitting the answer." This single step lets DeepSeek Flash achieve a level close to GLM-5.2. ----------------------------------- Why the 97.5% cache hit rate is key Terminal-bench consists of long-horizon tasks. Each problem requires dozens of tool calls, accumulating context up to millions of tokens. Maka's context management played a role here: Prefix remains stable, DeepSeek cache hits heavily Out of 60 million tokens, 58.5 million were cache tokens, only 1.5 million needed full computation Thus 10 difficult problems cost only 4 RMB, not tens of yuan This is also why running DeepSeek Flash with maka is more economical than using reasonix directly—maka fully exploits DeepSeek's cache mechanism. ----------------------------------- The next goal is to run the full 84-question terminal-bench. Making Flash this strong is not due to the model, but to the harness engineering.

Original Article

View Cached Full Text

Cached at: 07/03/26, 06:30 AM

Maka’s Harness Engineering Brings DeepSeek Flash’s Test Set Performance Close to GLM-5.2 Level ———————————– maka + DeepSeek Flash V4, terminal-bench sample scores 0.8. Practically close to 0.9 — one question was actually answered correctly but the “output contamination” wasn’t counted by the scoring system. It’s nearly caught up to GLM 5.2’s evaluation performance. ———————————– terminal-bench sample is a subset of the full 84-question terminal-bench set, consisting of 10 programming Agent tasks. Results from this run: total token consumption: 60 million, of which cache hits: 58.5 million (97.5% hit rate). Total cost: ~4 RMB. 10 questions, 4 yuan, near perfect score. ———————————– Did DeepSeek Flash get stronger? No. It uses the same DeepSeek Flash V4 model. What pushed the score from baseline to 0.8 was two iterations of the self-check mechanism in the agent loop — leveraging Claude Code’s self-check implementation, strengthening the model’s self-validation logic after task completion. After two iterations, Flash suddenly “grasped the insight” on terminal benchmark. This shows: model capabilities have an upper limit, but the quality of the harness engineering can significantly affect how well the model performs on specific tasks. Self-check isn’t magic—it just lets the model “double-check itself before submitting the answer.” This step allowed DeepSeek Flash to achieve performance close to GLM-5.2. ———————————– Why 97.5% cache hit rate is key terminal-bench is a long-horizon task. Each question requires dozens of tool calls, with context accumulating to millions of tokens. maka’s context management plays a crucial role here: the prefix remains stable, DeepSeek’s cache hits heavily. Out of 60 million tokens, 58.5 million are cache tokens, only 1.5 million require full computation. So 10 challenging questions cost only 4 yuan, not dozens. This is why running DeepSeek Flash with maka is more economical than using reasonix directly — maka maximizes DeepSeek’s cache mechanism. ———————————– Next goal: run the full 84-question terminal-bench. Making Flash this powerful isn’t the model’s credit — it’s the harness engineering’s credit. — # maka-agent/maka-agent Source: https://github.com/maka-agent/maka-agent ENGLISH # Maka Maka is a local-first desktop AI workbench. It integrates model connections, conversations, tool permissions, file reading/writing, terminal execution, search, bot entry points, and run recovery into an Electron application. The goal is to let users run an observable, controllable, and continuously recoverable agent on their own computer. This repository is under active development. This README serves two audiences: - Users opening Maka for the first time: understand why you need to configure AI first, where data is stored, and which capabilities are already available. - Engineers continuing to develop Maka: can quickly start, verify, locate key packages and design documents. ## What you’ll see When entering Maka for the first time without an available model connection, the first screen guides you to complete AI configuration rather than showing an empty chat box that can’t send messages. The recommended path is: 1. Open Settings -> Models. 2. Select a real model provider, fill in the API key, or log in to an already integrated account. 3. Test the connection and select the default model. 4. Return to the main screen and start your first conversation using the quick input. Currently integrated model types include: - Overseas APIs: Anthropic, OpenAI, Google Gemini. - Domestic APIs: DeepSeek, Moonshot, Z.AI Coding Plan, Kimi Coding Plan. - Local models: Ollama. - Custom gateway: OpenAI Compatible endpoint. - Account subscription entries: Claude Subscription, Codex Subscription, Gemini CLI, etc. are displayed separately based on experimental/available status. Entries not yet connected to the send pipeline won’t pretend to be available. ## Current capabilities Maka is not a simple chat demo. It already has these core aspects: - Desktop conversations: create, switch, archive, search, rename, stop, retry, regenerate, branch from a turn. - Model runtime: provider runtime based on Vercel AI SDK, supports model streaming, tool calls, usage recording, error classification, and resume on startup. - Local tools: Read, Write, Edit, Bash, Glob, Grep. File writing and command execution follow permission policies. - First-run guidance: based on actual connection status, shows different states: “configure / select default connection / select default model / start conversation”. - Settings center: models, accounts, usage statistics, daily review, local memory, voice models, open gateway, bot conversations, web search, network proxy, permissions & capabilities, health status, data & about. - Local memory: MEMORY.md management, manual addition, archive/restore, agent read toggle. - Web search: Tavily credential configuration, testing, and agent tool boundaries. - Bot entry points: framework for configuring/testing/running Telegram, Feishu, WeCom, WeChat iLink, Discord, DingTalk, QQ. - Open gateway: local HTTP/SSE API, uses tokens to protect external reading of session state, events, capabilities, and health summaries. - Office document workflow: enabled after local officecli detection for reading, validation, and per-use authorization for editing. - Runtime kernel: AgentRun ledger, RuntimeEvent read model, ToolRuntime, ModelAdapter, RunTrace, and recovery logic. ## Local & Privacy Boundaries Maka stores working data by default in a workspace directory under Electron’s userData: text /workspaces/default/ llm-connections.json credentials.json settings.json sessions/ Important boundaries: - Provider connection metadata and session JSONL are on the local filesystem. - Runtime credentials such as Provider/API keys, bot tokens, proxy passwords, gateway tokens, and Tavily keys are written to local credentials.json; current format is file-first plaintext JSON, relying on OS account boundaries, with directory 0700 and file 0600 enforced on POSIX. - Subscription OAuth tokens for Claude, Codex, Cursor, Antigravity, etc. use separate Electron safeStorage paths; these token stores fail closed when safeStorage is unavailable. - The renderer doesn’t directly access plaintext keys; Settings only shows masked status and test results. - File read/write, shell, and dangerous operations require permission engine approval. - Capabilities like Incognito / Privacy context, Memory, Voice, Workspace instructions have separate contract documents governing them. ## Quick Start The repository uses npm workspaces. Although pnpm-workspace.yaml exists, the current scripts and lockfile are npm-based. sh npm install npm run dev npm run dev will first build all workspaces, then start the Electron desktop app. If ELECTRON_SKIP_BINARY_DOWNLOAD=1 was set during dependency installation, you need to install the Electron platform binary before starting: sh node node_modules/electron/install.js Common dev commands: sh npm run build npm run typecheck npm --workspace @maka/desktop run test npm --workspace @maka/runtime run test npm --workspace @maka/core run test For desktop visuals and real window verification: sh npm --workspace @maka/desktop run screenshots npm --workspace @maka/desktop run screenshots:diff:stable npm --workspace @maka/desktop run smoke:real-window Basic checks before release: sh npm run check:release ## Optional Environment Variables These variables only affect local development or specific capabilities: | Variable | Purpose | | — | — | | ANTHROPIC_API_KEY | Bootstrap Anthropic connection on first launch. | | OPENAI_API_KEY | Bootstrap OpenAI connection on first launch. | | TAVILY_API_KEY / MAKA_TAVILY_API_KEY | Tavily credential source for web search. | | MAKA_RIVE_BIN / RIVE_BIN | Specify the rive CLI used by the Rive workflow. | | MAKA_VISUAL_SMOKE_FIXTURE | Enable deterministic visual fixture, dev/test build only. | ## Project Structure text apps/desktop/ src/main/ Electron main process, IPC, settings, OAuth, bot, gateway src/preload/ window.maka preload bridge src/renderer/ React desktop UI and Settings surfaces packages/core/ Pure contracts: sessions, events, settings, permissions, model connections packages/storage/ File-backed session, settings, connection, run-ledger stores packages/runtime/ SessionManager, AgentRun, AI SDK runtime, tools, bots, telemetry packages/ui/ Shared rendering components, markdown, artifacts, redaction helpers docs/ Product, runtime, design-system, privacy and test-plan contracts scripts/ Build hygiene, screenshot, smoke and release helpers ## Runtime Architecture The current runtime has been refactored from a single large flow into clearer kernel boundaries: text SessionManager -> AgentRun -> AiSdkBackend -> ModelAdapter -> ToolRuntime -> RunTrace -> AgentRunStore Key principles: - SessionManager remains the public runtime API exposed to desktop, bots, and gateway. - AgentRun is responsible for the durable run facts and resume-on-startup of a single turn. - ToolRuntime handles tool input validation, permissions, watchdog, abort, telemetry, artifact candidates, and error classification. - ModelAdapter isolates provider stream/error/usage normalization. - RunTrace is best-effort; trace write failures must not affect user conversation. More details: - docs/runtime-kernel.md - docs/runtime-v2-architecture-evolution.md - docs/runtime-v2-implementation-notes.md ## UI & Product Quality Contracts Maka’s UI is not hastily built — it already has a dedicated design system and test plan: - docs/design-system.md: color, density, states, animation, Settings IA, copy, and a11y contracts. - docs/ui-quality-plan.md: real window, visual screenshots, interaction states, regression verification strategy. - docs/full-product-test-plan.md: complete QA route from first-run, settings, sessions, tools, search, bot, gateway to failure paths. When changing UI, don’t just run TypeScript. At minimum, include: 1. Corresponding node:test contracts for the surface. 2. Pass check-console / check-a11y. 3. Add visual fixtures or real-window smoke tests when necessary. ## Pre-contribution Checks For typical code changes, at least run: sh npm run typecheck --workspaces --if-present npm run build git diff --check For changes involving desktop renderer / Settings / IPC, also run the corresponding focused suites, e.g.: sh npm --workspace @maka/desktop run test -- settings-form-a11y-contract visible-copy-hygiene-contract For runtime / storage changes, run the corresponding workspace tests: sh npm --workspace @maka/runtime run test npm --workspace @maka/storage run test ## Related Documentation - CHANGELOG.md: summary of unreleased changes. - SECURITY.md: security boundaries and reporting methods. - docs/workspace-privacy-context.md: workspace privacy context. - docs/search-service-threat-model.md: search service threat model. - docs/memory-threat-model.md: local memory threat model. - docs/voice-threat-model.md: voice capability boundaries. - docs/maka-capability-audit-v1.md: capability maturity audit and future roadmap.

@jakevin7: Maka's Harness project brings DeepSeek Flash's test set results close to GLM-5.2 level ----------------------------------- maka + DeepSeek Flash V4, te…

Similar Articles

@jakevin7: DeepSeek cache hit rate 95%, feels great. Maka's performance under the latest round of long-context tasks with the Deepseek model is outstanding. Total runtime close to 18 hours, nearly 400 million tokens, cost 33 bucks. The Make builders are amazing…

@jakevin7: I increasingly feel that Maka is very suitable for learning Agent. For example, recently a Maka core dev raised an issue discussing DeepSeek's cache optimization. The whole process is transparent: 1 issue + 8 PRs pushed through, from usage normalization → …

@jakevin7: Maka has been sprinting hard in the past two days, and the most noteworthy thing is out. Autonomous Task Loop v1 is live. Previously, Maka would run an agent and be done. Now it's a persistent loop: preflight → runtime → SelfCheck…

Submit Feedback

Similar Articles

@jakevin7: DeepSeek cache hit rate 95%, feels great. Maka's performance under the latest round of long-context tasks with the Deepseek model is outstanding. Total runtime close to 18 hours, nearly 400 million tokens, cost 33 bucks. The Make builders are amazing…

@jakevin7: I increasingly feel that Maka is very suitable for learning Agent. For example, recently a Maka core dev raised an issue discussing DeepSeek's cache optimization. The whole process is transparent: 1 issue + 8 PRs pushed through, from usage normalization → …

@jakevin7: Maka has been sprinting hard in the past two days, and the most noteworthy thing is out. Autonomous Task Loop v1 is live. Previously, Maka would run an agent and be done. Now it's a persistent loop: preflight → runtime → SelfCheck…

@geekbb: MCP tool that offloads low-risk tasks from Codex to DeepSeek, letting expensive models only make judgments. Average 48% cost savings over five test tasks with about 6 seconds latency. CodexSaver is an MCP tool that delegates low-risk tasks (writing tests, documentation, code explanations...) in Codex coding sessions...

@jakevin7: Sharing something interesting Maka is currently working on: letting agents automatically optimize their own system prompt, fully closed-loop, without any human intervention. Karpathy's autoresearch, AEGIS, etc. have explored similar directions—a goal-driven self-reinforcement learning system.