阅读视图

AI Hasn’t Rebuilt the Organization Yet. Anxiety Rebuilt Management First

2026年4月6日 19:30

Tunnel vision is what happens when attention, judgment, even perception itself gets compressed into too narrow a field. What stands right in front of us grows brighter and brighter, while the surrounding signals — sometimes the ones that matter more — begin to fade.

Influenced by OpenClaw, my team went through an unusually intense stretch last March. We were shipping an Agentic OS — an agent-native Linux OS — every two weeks, and the pace tightened almost at once. A release every two weeks. Daily syncs. Weekly milestones.

The message from management was simple enough: move fast. Let AI do the quick part. If you wait until everything is fully thought through, the moment will be gone. Someone else will already have launched. This wave has to be caught, and so on.

What felt strange was that the part most worth pausing over became the part few people wanted to ask about. Were we actually speeding up the resolution of problems, or merely speeding up the production of proof that problems were being resolved?

Narrower vision does not make judgment sharper

The most dangerous thing about tunnel vision is not that it leaves you blind. It is that what remains in view can look uncannily like the right answer.

Under speed, tension, fear, exhaustion, and pressure, a person’s attention narrows into a point. Organizations do something similar. As outside examples grow more dazzling, it becomes easier for teams inside to compress their attention around the most visible signals: speed, launch, PR, visibility, the sense of staking out ground.

Then complex problems get translated into actions that are easier to manage: shorten the cycle, increase the syncs, move reporting forward, underline the milestones.

None of these moves is irrational on its own. Layer them together, though, and they begin to change in character. They start carrying a function that was never really theirs: using higher-frequency visibility to fill the gap where judgment should have offered steadiness.

A narrowed field of vision is not the same thing as clearer judgment. More often, it simply makes it easier to miss the things that decide the outcome: whether the problem has been defined clearly, whether the use case has truly converged, whether the boundaries have been drawn, whether the risks have been made explicit, and who will finally carry the cost.

AI lowers the threshold for prototyping, not the difficulty of landing a complex system

AI coding makes it faster to write something. Agent-based products make it easier to get something running. What once took weeks to assemble can now be placed on the table in a matter of hours.

The stimulus this creates for management is immediate: if others can get something running, why are we still discussing boundaries?

And this is where organizations begin to misread an external signal as an internal conclusion. If it runs, then surely all that remains is execution.

But the part that has not become any simpler — goal definition, scenario convergence, permission boundaries, quality standards, coordination mechanisms, risk attribution, rollback when things fail — gets pushed quietly to the edge, largely because it does not shine.

The cheaper the prototype becomes, the easier it is to create the feeling that success is just around the corner. And once that feeling takes hold, the path of least resistance is often not to think the problem through, but to turn the pace all the way up instead: compress the cycle, increase reporting, raise visibility, use control to patch the place where judgment is still incomplete. The first path asks people to carry uncertainty. The second can be driven by anxiety alone.

The mark of anxiety-driven management: it organizes pressure around time, rather than resources around the problem

Healthy management should organize resources around the problem itself:

Is the problem actually clear?
Can half the scenarios be cut away?
Which risks need to be surfaced early?
How should the stages be set, and when is something ready for a sandbox, a pilot, or limited rollout?

Anxiety-driven management tends to organize pressure around time:

Did we sync today?
Was the daily report sent?
Why is there still no more definite result?
Can the cycle be compressed a little further?

Both sets of questions can look like progress. But they are not moving the same thing forward. One is working on the problem. The other is trying to manage unease.

Over time, I came to understand why this kind of behavior in an organization left me not only dissatisfied, but quietly angry. It creates a subtle reward structure: high-frequency syncs naturally reward what can be shown, and punish the judgment that cannot. After a while, teams become better at translating work into visible progress than at translating uncertainty into boundaries.

And what we get in return is something that looks remarkably like efficiency: busier, denser, faster, more visible — and not necessarily any closer to being right.

It is not surprising that Linux OS security teams feel the fracture earlier

Not every team hits the wall at the same moment. A team like mine, working on Linux OS security, will run into it sooner, because what we face is not only whether something can be built, but what kind of capability it acquires once built, and what follows from that.

At the application layer, getting a prototype running early may genuinely help with trial and error. A rough result does not always turn into a systemic incident. But in OS security, the logic of a demo does not carry cleanly into a real system. What is at stake here is system boundaries, execution privileges, the radius of failure, the cost of rollback, and the burden of auditability. You may be able to accept that something is not smart enough yet. It is much harder to accept that it has already been given capabilities it should never have had.

What is truly expensive in security is often not getting something to move, but answering the questions that attract little glamour: should it have this kind of capability at all? Within what scope does it run? What can it access? Who absorbs the consequence when it gets something wrong? Where is the rollback path?

These questions do not make for bright lines on a launch deck. Yet in complex systems, they are the real cost structure. What is unfortunate is that, before a management team already deep in FOMO, this entire layer of judgment and tradeoff can disappear beneath tunnel vision.

What exhausts people is not only the workload. It is the need to keep proving that the work is real.

After speaking with the team, I realized that what many people resented was not necessarily the daily report itself. It was closer to a condition of work: being interrupted again and again, questioned again and again, asked again and again to prove one’s value. You are not only doing the work. You are also being asked to keep producing evidence that the work is happening.

The worst part is not simply the time it takes. It is what it does to the structure of attention. Deep judgment depends on continuity. Many important decisions do not appear inside a single sync. They grow slowly, inside a stretch of thought that remains relatively whole.

Once thinking is cut into fragments, people begin to slide from solving problems into performing coherence for the organization. You start choosing what can be shown right away, rather than what actually matters but remains invisible for now.

For me, there is another cost as well. It enters the body. This is not the fatigue of a single sprint. It is the slower depletion that comes from living too long in a state of alertness.

I am not against speed. I am against using high-frequency execution to pretend that uncertainty has already become low.

To be fair, management’s anxiety does not come from nowhere. External change is accelerating. Expectations are being rewritten. Daily reporting and high-frequency syncs are not useless by nature. In incident response, or in work that is already clearly defined, or when multiple teams truly do need to coordinate closely, they can even be necessary.

What I object to is something else: the problem has not yet converged, the boundaries have not yet been drawn, and yet the work is already being managed at the tempo of a high-certainty execution phase. That is not execution. It is density being used in place of judgment.

Technical teams cannot hide behind complexity either. Naming boundaries is not the same as refusing action. Clarifying risk is not the same as being conservative. Maturity is not about moving slower. It is about distinguishing more quickly what can be done first, what must be thought through first, and what must never cross the line.

Focus is not the same as narrowing. Real focus gathers resources while preserving a sense of the whole. Narrowing, under pressure, drops the surrounding information until all that remains is a local objective growing brighter — and more dangerous.

What truly deserves acceleration is the ability to make the problem converge

If there is something in the age of AI that deserves to be accelerated, I would place my hope in a few forms of reasoning that are less dazzling, but cheaper in the long run:

Problem definition: do not wrap direction in big words too early. Ask first: what concrete problem are we actually solving? Why does it deserve to be solved in a new way?
Scenario convergence: anxiety makes organizations want everything at once — narrative, product, platform, cloud migration, strategic positioning. Mature acceleration often begins with a willingness to cut.
Making boundaries explicit: permissions, data, auditability, rollback, responsibility — these should not be left for later, once something is already running. The closer you get to high-privilege systems, the earlier these questions need to be laid out in the open.
Stage-gate judgment: what is only a prototype, what belongs in a sandbox, what is ready for a pilot, and what must never touch a real environment. This matters more than shaving a few more days off the cycle.

Speed, by itself, is not a capability. The real capability lies in knowing where speed belongs.

Whether AI rebuilds organizations may finally depend on what it amplifies first

AI will almost certainly keep reshaping products, workflows, and roles. That much is hard to stop. But before any real reconstruction begins, many organizations may go through something else first: AI acts like an amplifier, enlarging the habits and instincts that were already there.

When an organization meets AI, what gets amplified first?

Do you clarify the problem first, or heat up the story first?
(the ability to define the problem vs. the impulse to build a narrative)
Do you make the boundaries explicit first, or tighten the tempo first?
(the ability to judge boundaries vs. the instinct to manage anxiety through control)
Do you establish stage gates first, or simply make reports and syncs more frequent?
(mature gating vs. denser reporting)
Do you widen the field and see the risks, costs, and stopping conditions — or, the busier things become, does attention narrow until all anyone can see are demos, PR, and launch speed?
(a wider field of judgment vs. a narrower tunnel)

If the first thing a change produces is not better judgment, but faster self-proof, then what is it really rebuilding? Is it rebuilding organizational capability at all?

What makes it sadder is that quite a few managers seem genuinely excited by precisely this.

Maybe that’s enough for now.

AI 还没重构组织，焦虑先重构了管理

Shadow Walker 松烟阁

edony

2026年4月6日 13:58

隧道视觉效应（tunnel vision）是指人的注意力、视野或判断，被强烈压缩到一个很窄的范围里，只盯着眼前最突出的目标，忽略了周边更重要的信息。

受到OpenClaw的影响，过去的3月份我的团队经历了非常高强度工作节奏 —— 2周发布一个Agentic OS（一种Agent-Native Linux OS），于是节奏被立刻拉爆：两周要产品发布，每天要同步，周周要里程碑。

管理层给出的要求是：一定要快，让AI搞一下很快的，等你想清楚了黄花菜都凉了，xxx已经发布了，这一波热点一定要抓住，blah blah～

奇怪的是，真正该被问清楚的那一部分，反而更少人愿意停下来问：我们到底是在加速解决问题，还是在加速生产“问题正在被解决”的证据？

视野收窄，并不会让判断更准

“隧道视觉”最危险的地方，不是看不见，而是看得见得太“像正确”。人在高速、紧张、恐惧、疲劳、强压下，注意力会缩成一个点。组织也一样。外部案例越来越亮眼，内部就很容易把注意力压缩到几个最显眼的信号上：速度、上线、PR、可见性、占位感。

然后把复杂问题翻译成一组更容易管理的动作：周期压缩、同步加密、汇报前置、节点强调。

这些动作单独看都合理。叠在一起就变味了——它们开始承担一个本不属于自己的功能：用更高频的可见性，去填补判断不足带来的控制感缺失。

视野收窄，不等于判断更清晰。很多时候，它只是让人更快忽略那些决定结果的东西：问题有没有被定义清楚，场景有没有收敛，边界有没有划清，风险有没有显性化，代价最后由谁承担。

AI 降低的是原型门槛，不是复杂系统的落地难度

AI coding 让“写出来”更快，Agent 类产品又让“跑起来”更容易。过去要几周拼出来的雏形，现在几个小时就能摆上台。

管理层受到的刺激是直给的：别人都能跑，我们为什么还在讨论边界？

这时组织很容易把外部信号误读成内部结论：既然能跑起来，剩下的就只是执行。

而真正没变简单的那一部分——目标定义、场景收敛、权限边界、质量标准、协作机制、风险归因、失败回滚——因为不够耀眼，就被系统性挤到边缘。

原型越便宜，越容易制造“我们已经快要成了”的氛围。氛围一起来，组织最省事的选择往往不是把问题想清楚，而是把节奏拉满：压周期、加汇报、提可见性，用控制感去补判断的缺口。前者要扛不确定，后者只要焦虑就能驱动。

焦虑型管理的特征：围绕时间组织压力，而不是围绕问题组织资源

正常的管理动作，理应围绕问题来组织资源：

这件事的定义清楚了吗？
能不能砍掉一半场景？
哪些风险必须提前考虑？
阶段怎么设，什么时候能进灰度、什么时候能试点？

焦虑型管理更容易围绕时间来组织压力：

今天有没有同步？
日报有没有发？
为什么还没有更明确的结果？
能不能再压缩一点周期？

两套问题看起来都在推进，推进的却不是同一件事：一种在处理问题。另一种在管理不安。

思考到这里，我逐渐弄明白自己为什么对于组织上这种行为感到不满甚至愤怒，因为这种行为会把组织带进一个微妙的奖励机制：高频同步天然奖励“可展示”的动作，惩罚“不可见”的判断。久了以后，团队会越来越擅长把工作“翻译”成进度，而不是把不确定“翻译”成边界。于是我们会得到一种很像效率的东西：更忙、更密、更快、更可见，但不一定更接近正确。

Linux OS 安全团队更早感到撕裂，并不意外

不是所有团队都会一样早地撞墙，我在的 Linux OS 安全这类团队会更早，因为它面对的不只是“能不能做出来”，而是“做出来以后它拿到什么能力，会造成什么后果”。

在应用层，原型先跑起来，可能确实能帮助试错，粗糙一点也未必马上变成系统性事故。但在 OS 安全领域，demo 的逻辑不能直接平移到真实系统：这里牵扯的是系统边界、执行权限、错误半径、回滚成本、审计责任。你可以接受它先不够聪明，很难接受它先拿到了不该拿的能力。

安全领域真正昂贵的部分往往不是“让它动起来”，而是回答这些不 glamorous 的问题：它该不该有这类能力？它在哪个范围里运行？它能访问什么？它犯错之后谁来兜底？回滚路径在哪里？

这些很难写成高光句子，但在复杂系统里，它们才是成本结构。比较遗憾的是，面对深陷 FOMO 状态的管理层，这个层面的考虑和取舍已经被“隧道视觉效应”彻底掩盖了。

让人耗竭的，不只是忙，而是持续自证

在跟团队的交流后，我意识到很多人反感的也未必是日报本身。更像是一种工作状态：持续被打断、持续被追问、持续被要求证明价值。你不只要做事，还要不断产出“我正在做事”的证据。

这种状态最糟的不是占用了多少时间，而是注意力结构 —— 深度判断依赖连续性，很多关键决定不是在一次同步里蹦出来的，而是在一段相对完整的思考时间里慢慢长出来的。

当思考被切碎，人会从“解决问题”滑向“组织表达”。你开始优先做能立刻展示的事，而不是那些真正重要但暂时不可见的事。

于我个人而言，这里还多了一些代价，它会“进入”我的身体，这不是简单一次猛冲的疲惫，而是长期处在警觉里的透支。

我不反对快，我反对用高频执行假装进入了高确定

客观的看，管理层的焦虑并非无缘无故，外部变化确实在加速，预期在被改写。日报、高频同步也不是天然无效，在事故处理中、任务定义足够明确、多个团队必须紧密协作的时候，它们甚至是必要的。

我反对的是另一种场景：问题还没收敛，边界还没划清，却已经按“高确定执行阶段”的节奏去管理它。那不是执行力，是用密度替代判断。

技术团队也不能躲在复杂性后面，指出边界不等于拒绝行动，说清风险也不等于保守。成熟的做法不是慢，而是更快地区分：什么可以先做，什么必须先想清楚，什么绝不能越线。

聚焦不是窄化，而是在保留全局感知的前提下收束资源；窄化是在压力驱动下丢掉周边信息，只剩一个越来越亮、也越来越危险的局部目标。

真正该被加速的，是问题收敛能力

如果一定要说 AI 时代什么值得被加速，我更倾向于几件不那么耀眼、但更便宜的理性：

问题定义：别先用大词把方向包起来。先回答：到底在解决哪个具体问题？为什么值得用新的方式来解？
场景收敛：焦虑会让组织同时想要叙事、产品、平台、上云、占位。成熟的提速往往从“愿意砍掉”开始。
边界显性化：权限、数据、审计、回滚、责任，不该留到“先做起来以后再补”。越靠近高权限系统，越要提前摊开讲清楚。
阶段门判断：什么只是原型，什么能进沙箱，什么能试点，什么绝不能碰真实环境。这比“周期再压缩几天”更重要。

速度本身不是能力，正确地分配速度，才是。

AI 会不会重构组织，最后取决于它先放大的是什么

AI 当然会继续改写产品、流程、岗位，这事大概率挡不住。但在真正的“重构、变革”发生之前，很多组织会先经历另一件事：AI 像一个放大器，把组织原本的习惯和本能先放大出来。

一个组织面对 AI 时，最先被放大的到底是什么：

你们会先把问题说清楚，还是先把故事讲热闹？（问题定义能力 vs 叙事冲动）
你们会先把边界讲明白，还是先把节奏拉更紧？（边界判断能力 vs 控制焦虑的本能）
你们会先把阶段门立起来，还是先把日报和同步加密？（成熟的阶段门 vs 更密集的汇报）
你们会把视野拉开，看到风险、代价和停止条件；还是越忙越窄，只盯着 demo、PR、上线速度？（更广的视野 vs 更窄的隧道）

如果一种变化最先带来的，不是更好的判断，而是更快的自证——那它重构、变革的，真的是组织能力吗？更加可悲的是，很多管理者反而沾沾自喜地感到“兴奋”。

为什么偏偏是“左耳进，右耳出”

Shadow Walker 松烟阁

edony

2026年3月8日 22:10

这期文章我打算写一个前两天跟自己做的智力小游戏 —— 顺口溜里“左耳进右耳出”，为什么偏偏是“左进右出”，而不是“右进左出”？

以往我不大会做这种看似无聊的事情，不过有了ChatGPT以后时不时的会变成好奇宝宝～

以下是ChatGPT给的答案：

我们形容一个人听不进劝，总会顺口说一句：“左耳进，右耳出。”这句话说起来无比顺滑，就像风穿过空荡的堂屋。

但你有没有想过，为什么偏偏是“左进右出”，而不是“右进左出”？

如果你认为这只是一句毫无逻辑的俗语，或者反过来，你想从中寻找某种高深莫测的玄学机理，我都得先挑战一下你的假设。不要带着猎奇或过度神圣化的滤镜去看待传统文化，我们把这层外衣剥开，看看这五个字背后，中国人是如何构建底层认知逻辑的。

从中国传统文化的坐标系来看，“左到右”绝非随机，它由三个维度的逻辑严密支撑：

- 语言潜意识的“锚定效应”：汉语里有一个铁律——左永远在右之前。你想想我们常用的词：左右、左顾右盼、左思右想。古人在创造对立词汇时，习惯性地将“左”作为起始动作的锚点。“左耳进”是顺应了千百年来形成的肌肉记忆和语言本能，如果硬说“右耳进左耳出”，在发音和语感上就会产生极大的滞涩感。

- 中医与气机的“生杀循环”：古人对宇宙运行的理解基于阴阳五行。在中医和道家内丹学说中，气机运行的核心法则是“左升右降”。左为阳、为东方、为木，代表事物的起始、输入和生机；右为阴、为西方、为金，代表事物的衰退、排泄和肃杀。外界的声音（信息）从左边（生门）进来，符合“输入”设定；从右边（死门）出去，符合“排出”设定。从左到右，刚好完成了一个没有经过“中宫（心）”截留的无效闭环。

- 尊卑秩序的隐喻：虽然历史上“尚左”还是“尚右”有过反复，但在大部分深受礼教影响的时期，“左”代表着尊位。当别人对你说话时，信息从“尊位”进入，表示表面上给了对方面子，做出了“我在听”的姿态；但最终从“卑位”排出去，意味着这些话被贬值、被抛弃。这极其精准地刻画了那种“态度诚恳、坚决不改”的敷衍心态。

- 空间方位学：中国古代的权力中心和建筑风水，其核心视角永远是“坐北朝南”。当你面朝南方站立时，你的左边是东，右边是西。左（东）日出之所，代表着万物肇始、阳气初生、信息的“进入”与发端。右（西）日落之处，代表着一天终结、阴气沉降、事物的“消亡”与排出。“左耳进右耳出”，在古人的潜意识里，就像太阳东升西落一样，是一个完整、迅速且不可逆的滑落过程。

与AI的问答和文化考据到此为止。

这次的思维小游戏让我对AI的态度出现了跌宕起伏：

能力膨胀感

以前，面对一个跨越人文学科、历史学和心理学的混沌问题，我会自我怀疑：这并非我的专业领域，缺乏严密的逻辑链条和执行路径，想了也是白想。于是我会做一个判断：这个问题太远，太偏，太像无聊的抬杠，就算想了，大概率也找不到像样的回应。

Gemini/ChatGPT等的出现，把跨界探索的阻力降到了零，它为我提供了一个万能的“知识助手”扩展了我的能力和思维边界，有那么一瞬间我觉得自己强的可怕，以前很多问题不会被问也不敢问，不是因为它们不重要而是因为：

搜索引擎不适合接这种问题；
书里不会专门讲这种角落问题；
问别人会显得琐碎、奇怪、甚至有点傻；
自己想，成本太高，反馈太弱。

现在这件事改了，很多原本会被压下去的问题，变成了可以随手试探、马上展开、低成本来回追问的对象。这个变化让我觉得自己变厉害了，不得不承认：

现在的我问题意识会变强。大脑会更愿意保留那些“咦，为什么会这样”的瞬间。
思维会更敢于跨层。以前我看到“左耳进右耳出”，大概只会把它当俗语。现在不再被“这个问题配不配想这么多”卡住了。
思考从“独白”变成了“对话式探索”。以前一个念头出来后卡住了没有结果，我很快就放弃了。现在有一个不嫌我问题小、没有不耐烦、也不会把我问题打回去的“助手”，它让我那些半成品问题有机会长成完整思路。

So What?

周末我跟家人聊我对“左耳进右耳出”问题的探索的时候，她点了点了淡淡说了句“原来是这样啊～”，也就没有下文了

这让我突然意识到：AI确实像一个极其强悍的外脑，让我敢想敢问，但我也只是从一个“不知道答案的人”，变成了一个“高级的API调用者”，AI能瞬间给我提供上帝视角，让我在几秒钟内看透一个复杂现象的本质，让我产生“我懂了”、“我顿悟了”的极度快感。

过去，拥有答案的人掌握话语权，但在AI时代，答案成了最廉价的工业制成品，AI放大的是我“提出问题”的能力，这种多巴胺的分泌让我忽视了一个关键点：提问能力被放大了，但辨别“发现”与“脑补”的能力如果没同步升级，认知会变得更活跃，也更容易失真，它更加会极大削弱在现实中去死磕、去落地的动力，有了AI你压根不会像学生发paper那样从综述做起，旁征博引论证“左耳进右耳出”的文化渊源，而我如今的思维方式、技术能力都源自学生时期那些折磨人的专业训练。

既然已经深刻意识到了这种“认知杠杆”的威力，我不妨把视线从纯粹的思维探讨拉回现实 —— 以后在AI回答满意之后追问一句“So What?”

拆解 kimi-cli：Coding Agent 的能力上限，为什么在“模型之外”？

Shadow Walker 松烟阁

edony

2026年3月1日 18:11

很多 coding agent 的讨论都是关于“它会不会按要求生成代码”，当模型写代码的能力趋于同质化时，coding agent 的上限往往不由大模型的生成能力单方面决定，而是由系统的过程控制、边界划分与协作协议决定的。模型会写代码当然重要，但当任务变成多步骤、要调用工具、可能碰到风险操作、还得把过程讲清楚给人看时，差距就不再只是模型能力的差距，而是系统设计的差距。

过程重于生成

最近我在拆 kimi-cli 的过程中找到了一个具象的切入点：表面上它是一个命令行工具，里面其实更像一个小型协作系统。你在命令行里输入一句任务，表面看只是“回车，然后开始跑”。但从系统内部看，事情没那么简单：界面要持续更新，Agent 要一步一步推进，工具调用要在合适的时机触发，有些操作还会停下来等你确认。用户看到的是一条输出流，系统处理的却是另一回事——它在同时处理节奏、状态、边界和人机协作。也正因为这样，很多 coding agent 的差距，看起来像“好不好用”的差距，本质上其实是“过程设计得好不好”的差距。

我想把 kimi-cli 当成一个样本，借它回答一个更有长期价值的问题：拆一个 coding agent到底该先看什么，才能看出亮点、边界和可借鉴的地方？我的第一版观察地图会先放在四个层面：它怎么“跑起来”、怎么“控风险”、怎么“跟人协作”、以及怎么“长期演进”。后面的连载也会沿着这四条线往下拆——不追求把源码讲完，只追求把设计取舍讲清楚。

四层透视：拆解一个 coding agent

第一层：它怎么“跑起来”

很多人会把 coding agent 想成“模型回复一次 -> 调工具一次 -> 输出结果”，像一个稍微聪明一点的脚本。但真实一点的场景很快会把这种想象打碎：一个任务可能要先读文件、再搜目录、再改代码、再跑命令、再看报错、再回退、再重试。也就是说，它不是在回答一个问题，而是在推进一个过程。kimi-cli 给我的第一个信号就是，它内部把这件事当成了“循环推进”的系统问题来处理，而不是“单轮对话”的包装问题。你会看到类似 run_soul()、_agent_loop()、_step() 这样的结构线索——这不只是代码组织方式，它反映的是kimi cli设计者对任务本质的判断：coding agent 的核心不是“会说”而是“能推进”。与其投入巨大精力让模型在一步内做对所有事，不如设计一个鲁棒的循环机制，让模型有空间“犯错并自我修正”。

第二层是它怎么“控风险”

Agent风险控制往往被很粗糙地对待：要么“全部自动执行”，追求爽感；要么“每一步都确认”，追求安全感。真正难的恰恰是中间地带：哪些操作应该无感通过，哪些操作必须提醒，哪些确认可以在一个会话里批量放行，哪些必须一条一条过。这里的设计不是“弹窗细节”，而是产品立场。因为你一旦允许模型触达工具、命令和文件系统，风险就不再抽象，它会直接变成用户信任问题。kimi-cli 里与审批相关的处理（包括确认、拒绝、会话级放行这样的中间态）让我很有兴趣，原因不在于“它做了确认”，而在于它承认了一件事：效率和边界不是二选一，它们需要被设计成一个连续的、可调的机制（这一点我后面会另起文章单独拆）。

第三层是它怎么“跟人协作”。

跟人协作我觉得是被很多 agent 产品低估的地方。我们常常把 UI/TUI 当成“皮肤”，但对 coding agent 来说，UI 其实是系统能力的一部分。因为用户并不只需要结果，还需要对过程有最低限度的理解：它现在在干嘛、为什么停下来了、下一步要我做什么、刚才那一步到底改了什么。kimi-cli 的一些设计让我意识到，它不是简单把内部日志打印出来，而是在认真处理“系统内部发生了什么”与“人应该看到什么”之间的翻译问题。像 Wire 这样的消息通道设计（包括更原始的流与更适合展示的流）背后反映的是一种很务实的判断：如果系统不能把过程表达清楚，人就很难持续信任它。很多我们以为的“UI 不聪明”，最后追到根上，其实是协作协议没设计好，这不是美观问题，而是预期管理。

第四层是它怎么“长期演进”。

agent 一旦开始长功能，复杂度会涨得很快：模型提供商会变，工具会增，配置会分层，交互入口会从 CLI 扩到 Web，用户对行为边界的预期也会分化。这个时候，一个项目值不值得学习，不只看它当前能不能用，还要看它有没有给未来留空间。kimi-cli 在抽象层、配置与扩展路径上的一些安排（比如模型抽象、Agent Spec、工具接入路径，以及面向 Web UI 的会话进程处理思路）给我的感觉是：它不只是想做一个“能跑的命令行助手”，而是在尝试把“agent 作为系统”这件事认真落地。未必每个选择都适合所有团队，但这种“为演进留接口”的意识，本身就值得看。

kimi-cli 拆解的思考

我想写kimi cli拆解的系列文章，不是因为我想证明某个项目“领先”，也不是因为我想做一套面向源码的讲解课，而是因为我越来越觉得：对于程序员而言，讨论 coding agent如果只停在“效果展示”和“功能截图”，很容易错过真正有复用价值的东西。我们真正能学走的，往往不是某个 prompt，也不是某个 skill 技巧，而是它如何处理过程、如何划边界、如何组织协作、如何为未来留余地。这些东西不够“爽”，但它们才决定一个 agent 能不能从 demo 长成工具，从工具长成系统。

为了达成我的目标，我给自己提了如下要求：

不评价模型能力高低，重点放在过程与系统设计；
不把看到的设计都当成最佳实践，更关心它在什么场景里成立、在什么场景里可能代价过高；
提炼可迁移原则：如果你也在做 coding agent、AI 工具、或者任何“模型 + 工具 + 人协作”的系统，哪些取舍值得借鉴，哪些要谨慎照搬（如果你只是想做一个查天气的单步助手，强行引入复杂的 Agent Loop 和审批流，只会徒增维护成本）。

接下来的kimi-cli系列文章我会沿着四条线往下拆：

Agent 循环机制：任务推进、步骤拆分与错误回退（How it runs）
工具系统与审批流：风险分级与会话授权（How it controls boundaries）
消息通道设计：内部事件流与外部展示流的分离翻译（How it talks to humans）
抽象层与扩展性：面向未来的插件化思路（How it evolves）

我拆 kimi-cli 不是为了梳理清楚源码，而是想借它这面镜子，把 coding agent 的过程设计看清楚一点。模型的的光芒常常盖住了Agent在背后的支撑，对于 Builder 而言真正值得学的往往是后者。

如果反馈不错，我再往后和其他项目做一些轻量的“设计立场”对比——不是比谁更强，而是比谁在为哪类问题买单。

坑挖好～～敬请期待后续系列文章

最后，附上kimi-cli的项目整体架构示意图：

graph TD subgraph L1 [第一层：CLI 入口] CLI["Typer CLI
cli/__init__.py"] end subgraph L2 [第二层：应用工厂] App["KimiCLI
app.py"] end subgraph L3 [第三层：Soul 核心] Soul["KimiSoul
soul/kimisoul.py"] end subgraph L4 [第四层：Wire 通信] WireLayer["Wire SPMC
wire/__init__.py"] end subgraph L5 [第五层：UI 前端] Shell["Shell TUI"] Print["Print"] ACP["ACP"] WireJSON["Wire Server"] end subgraph L6 [第六层：基础设施] KosongPkg["Kosong LLM"] KaosPkg["PyKAOS OS"] end L1 --> L2 L2 --> L3 L3 --> L4 L4 --> L5 L3 --> L6

我为 Memos 做了一个图片渲染服务

Shadow Walker 松烟阁

edony

2026年2月23日 17:00

趁着春节休假我给自己的 memos 系统补了一块一直想做的能力：把一条 memo 分享成一张可传播的卡片图。这件事看起来像是“截图”，但真正做起来我才发现，它本质上不是一个图片处理问题，而是一个系统设计问题。

我做下来的体会可以总结成一句话：不是在后端“画”一张图，而是在后端养了一台可控的浏览器，让它在正确的时机按下快门。

为什么要单独做一个渲染服务

从实现需求的角度，最直接的方案当然是把图片生成逻辑塞进主后端里，但我最后没有这么做。原因不是“不能做”，而是职责会变得很混乱。memos 后端擅长的是memo数据、权限、API等管理，而“把一个复杂页面稳定渲染成图”这件事，天然更像浏览器运行时的问题：

字体是否加载完成
图片是否加载完成
主题和语言是否正确
阴影、圆角、背景渐变是否和前端一致
在高分屏下导出后是否仍然清晰

这些问题，如果在后端重新实现一套模板引擎和排版逻辑，等于把前端再写一遍。维护成本会很快失控。所以我最后做了一个内部服务 memos-images-rendering，专门负责一件事：

接受渲染请求
打开 Memos Web 的分享页
等页面准备完成
截图并返回 PNG

主后端负责鉴权和发令，渲染服务负责执行。边界一下就清楚了。

重要的前提：需要特别说明的一点是我有一台闲置的阿里云99一年的ECS，可以作为memos image rendering服务的环境，如果单为这个服务特意准备VPS/ECS就有点“为了瓶醋包了顿饺子”的味道了。

我为什么选择“浏览器渲染”，而不是“模板渲染”

如果只是输出一张图，很多人第一反应会是 Canvas、SVG、甚至服务端模板拼图。前一版本的share memo as image就是html2canvas的实现，但是在不同的环境（如iOS、macOS、Linux等）图片显示就会不一致。方案都能做但我最终还是选了 Playwright，原因很现实：

前端已经有现成的分享页面样式
主题（light/dark）和 locale 参数已经在 Web 层存在
富文本、图片、emoji、CJK 字体渲染都交给浏览器
样式改动后，渲染结果天然跟着前端走

也就是说，我复用的是“最终呈现结果”，而不是去复用一堆中间数据结构。这在工程上很重要。因为用户看到的从来不是 DTO，也不是 Markdown AST，用户看到的是最后那张图。

真正的难点不是截图，而是“什么时候截图”

做这种服务最容易低估的一点是：截图动作很简单，截图时机很难。如果你太早截图，常见问题会立刻出现：

字体还没加载完成，文本换行错位
图片还在加载，卡片出现空白块
阴影和渐变还没稳定，导出的图看起来像半成品

很多人会用 networkidle，但线上页面经常有长连接、埋点、异步请求，这个信号并不可靠。所以我在这个服务里用的是一个更“业务化”的约定：

分享页加载完成后，由前端显式设置 window.__MEMO_SHARE_READY__ = true
渲染服务等待这个全局标记，再截图

这个设计有点像前后端之间的握手协议。浏览器知道“DOM 出来了”，但只有业务页面自己知道“这张卡现在可以拍了”。这也是我这次实现里最关键的一条经验：渲染服务不应该猜页面状态，而应该和页面约定状态。

0:00

/0:29

两种模式：在“稳定输出”和“视觉效果”之间做了分层

为了兼顾不同分享场景，我做了两种渲染模式：

1. `fixed` 模式：要尺寸确定，就给你尺寸确定

这个模式会截图整个分享画布，然后把结果缩放到目标宽高（例如 2400x1350）。适合场景：

社交平台封面图
需要固定比例（例如 16:9）
需要稳定落地到某个模板位

优点是稳，结果尺寸完全可控。

2. `auto` 模式：围绕卡片智能裁切，保留一点“呼吸感”

这个模式会围绕 .share-memo-card 自动裁切，而不是死板地截满画布。我做了几个细节处理：

给卡片周围保留背景 bleed
给阴影额外留安全边距
最终裁切区域会和 .share-memo-canvas 相交，防止越界

这样导出的图不会显得“贴边”，卡片视觉上更像一张真正的分享卡。当然，自动裁切一定会有失败边界（例如选择器缺失、布局异常）。所以我给它做了兜底：一旦 auto crop 失败，自动回退到画布截图。

这背后的思路很简单：好看是加分项，稳定返回结果是底线。

性能优化不是先上分布式，而是先把单机跑顺

这种服务如果按“每次请求都启动浏览器”的方式写，基本很快就会卡住。所以我做了几层非常务实的优化：

复用单个浏览器实例
复用共享 browser context（缓存可复用）
每次请求只新建 page，结束后关闭 page
用队列限制并发（MAX_CONCURRENCY）
支持按空闲时间、渲染次数、存活时长回收浏览器

这几条加起来的效果，是把“浏览器启动成本”从每次请求里挪走，只在必要时付一次。另外还有一个经常被忽略的细节：清晰度。我这边默认用较高 DPR 渲染，再用 sharp 做高质量缩放。这会让最终 PNG 在文字和细线条上更稳，不容易出现“能看，但发出去有点糊”的情况。换句话说，我不是只追求“生成成功”，而是追求“发出去像成品”。

可观测性：我希望看到它慢在哪里，而不是只知道它失败了

图片渲染服务很容易变成一个黑盒：请求进来，等几秒，成了或炸了。为了避免这个问题，我给它加了几类观测信息（可按环境开关）：

分阶段 timing 日志（goto、wait ready、截图、裁切、resize）
请求级 request id
可选响应头返回 timing 信息
页面 console error / request failed 日志（用于排查前端资源问题）

这类服务一旦线上出问题，最怕的不是错误本身，而是“没有上下文”。能看到每一阶段耗时，你才能判断问题在网络、页面、字体、图片，还是浏览器进程本身。

安全边界：这个服务只负责渲染，不负责做判断

这个服务是 internal-only 的，我没有把它设计成公开接口。完整链路里，鉴权责任在 memos 侧：

后端生成短时效 share JWT
渲染服务带着 token 打开分享页
分享页/相关 API 完成验证
渲染服务只拿最终页面做截图

也就是说，渲染服务不做业务权限判断，它只执行“拍照”动作。这是我这次实现里另一个很明确的选择：让权限留在权限系统里，让渲染留在渲染系统里。

这次实现让我重新确认的一件事

我以前会把这类需求归类为“媒体能力”或者“图片处理”，但这次做完后我更愿意把它叫作：“前端呈现的后端化执行”。它不是在服务端重新发明 UI，而是把浏览器变成一个受控运行时，把页面变成一个可验证的渲染契约。当你接受这个视角之后，很多设计决策都会变得自然：

为什么要等 __MEMO_SHARE_READY__
为什么要复用 browser/context
为什么要做 auto 与 fixed 双模式
为什么 auto crop 失败要回退
为什么要把它做成内部服务，而不是暴露公网

这些不是“优化点”，而是这个系统能长期稳定运行的前提。

后续我还想做的事

目前版本已经能稳定支撑分享图生成，但还有一些值得继续打磨的方向：

渲染结果缓存（相同 memo + theme + size 命中缓存）
更细粒度的失败分类与重试策略
可视化调试模式（导出裁切框信息）
批量渲染/异步任务模式（适合预生成封面）
更多模板（不是只有“截图卡片”，而是“设计化卡片”）

如果你也在做类似的“网页转图片”服务，我的建议是先别急着上复杂架构，先把这三件事做对：

渲染时机要有业务握手
失败路径要可回退
浏览器生命周期要可控

剩下的扩展，都会容易很多。

总结这次实践，让我对Vibe Coding有了新的认识，对于一个Linux OS系统安全背景的人来讲，前端、浏览器这些永远在我的技能点之外的，但是Vibe Coding增强了我的技能树，所以不要把核心价值押注在补足模型能力缺口上，而是增强复杂系统的编排能力。

复杂系统的编排能力，包括数据孤岛、组织阻力、权限、习惯成本，这些是模型能力再强也吃不掉的，因为它们不是智能问题，是人和组织的问题。

但“编排能力”离不开专业领域知识，模型降低的是通用知识的门槛，但在医疗、法律、金融等领域，真正的专业判断力并不会被抹平。

终端属于 Agent，但人类需要“仪表盘”

Shadow Walker 松烟阁

edony

2026年2月14日 18:50

事情的起因是 Dotey 在 X 上分享的那个 Obsidian CLI 项目。

看着那个在黑色背景中跳动的纯文本光标，一种久违的极客审美油然而生。这不仅是一个工具的发布，更像是一个信号：“Headless（无头化）”正在成为 Agent 时代的默认配置。

这种趋势让一直困扰我的那个选择题变得更加尖锐：在 AI 辅助编程的未来，我们到底该走向 Claude Code 这种纯命令行的极致效率，还是坚守 Cursor 这种重 UI 的集成环境？

按理说，作为一名 Linux 安全工程师，我的肌肉记忆属于终端，属于那些单行命令的组合。根据“奥卡姆剃刀”原则，我也应该拥抱 CLI——它更轻、更快、对 Agent 更友好。

但当我的手指真正悬在键盘上准备修一个复杂的线上 Bug 时，我必须承认一个反直觉的事实：我身体诚实地留在了 UI 里。

这不仅仅是习惯的惯性，而是因为我们在交付的东西，本质上不同。

效率的诱惑与隐形代价

Teri Radichel 曾详细论证过为何在安全视角下应选择 CLI。她的理由非常硬核：

隔离与受控：不希望 Agent 接触本地凭证，最好把它关在云端沙箱里，只通过文本交互。
资源与速度：远程传输文本远比 GUI 渲染快，且纯文本是 LLM 的原生语言。
风险规避：她观察到 Agent 可能会尝试越权（如使用 sudo），在 CLI 脚本中更容易做硬限制。

这套逻辑是工程师的典型思维：追求极致的执行效率与资源隔离。如果你的目标是让 Agent 像流水线工人一样批量处理任务，CLI 确实是最高效的通道。

然而，当我们面对线上事故或复杂重构时，交付的不仅仅是“代码执行”，而是“变更控制”。

为什么我选择留在 UI

如果你把 AI 仅仅当作“更快的键盘”，那么 CLI 胜出。但如果你把 AI 当作一个“极其勤奋但偶尔会产生幻觉的实习生”，UI 就变成了必须的审计台。

我看重 UI，并非因为我不懂命令行，而是为了以下三个控制权：

1. 把“改动”可视化为地形

CLI 的 diff 是流式的，你需要在大脑里重建上下文。而 UI 的文件树与双栏对比，本质上是一个“范围雷达”。
在涉及安全修补时，我最恐惧的不是 AI 写不出代码，而是它“顺手”改了不该改的配置。UI 让这种越界行为在视觉上无处遁形——这是对“非预期改动”的物理防御。

2. 上下文是环境，不是文本

在终端里喂给 AI 上下文，往往需要你把图片转成链接、把日志复制粘贴。这消耗的是工作记忆。
在集成良好的 UI 中，截图、Issue 链接、报错片段是环境的一部分。你不需要整理它们，只需要指向它们。当你的脑力不需要处理“数据搬运”时，才能腾出带宽去处理“逻辑判断”。

3. 可逆性带来的安全感

你敢让 AI 大胆尝试，是因为你手里攥着“撤销键”。
UI 将 Revert、Discard Changes 做成了极低成本的按钮。而在终端里，回滚往往意味着另一串指令的输入。这种操作成本的差异，决定了你在面对不确定性时的决策心理——是如履薄冰，还是大胆假设。

引擎与仪表盘

攻击者从未停止寻找开发环境的漏洞。无论是 LastPass 的泄露事件，还是 DEV#POPPER 针对开发者的社工攻击，都提醒我们：开发终端本身就是一个高价值的攻击面。

正因如此，安全工程师更应该清楚什么时候该钻进“引擎盖”，什么时候该坐在“驾驶位”。

CLI 是引擎：这里充满噪声、热量与复杂的管线。它是动力的来源，适合 Agent 在这里榨取算力，快速运转。
UI 是仪表盘：这里冷静、抽象。你看不到活塞的运动，但你能看到时速、油量预警、导航地图（架构）以及最重要的——刹车踏板。

当 AI 逐渐接管了引擎盖下的繁重劳动，人类工程师的价值，也许不再是比 AI 懂更多的指令细节，而是作为驾驶员，手握方向盘，盯着仪表盘上的每一个异常跳动。

问题不在于工具的优劣，而在于你此刻的角色：你是负责燃烧的燃料，还是负责方向的驾驶员？

A Deep Dive of LangGraph Mechanisms & Agent Design Patterns

Shadow Walker 松烟阁

edony

2026年1月5日 21:38

Introduction

While building a CVE assessment agent, I ran into an orchestration issue that looked trivial at first—but turned out to be instructive.

The agent was implemented with LangGraph and (conceptually) structured like this:

--- config: theme: 'base' themeVariables: primaryColor: '#BB2528' primaryTextColor: '#fff' primaryBorderColor: '#7C0000' lineColor: '#F8B229' secondaryColor: '#006100' tertiaryColor: '#fff' --- graph TD Start((__start__)) --> GetCVE[get_cve_data] Start --> GetCVSS[get_cvss_data] GetCVE --> GenASD[generate_asd_data] GetCVSS --> GetStmt[get_cvss_statement_data] GenASD --> Normalize[normalize_cvss_data] GetStmt --> Normalize Normalize --> GenVector[generate_cvss_vector] GenVector --> End((__end__))

Then the logs started to feel… off:

2025-12-24 12:07:16|x-sec|INFO|./core/agents/mimora/cvss_vector_agent.py:504|_generate_cvss_vector|开始完成CVE风险评估并生成CVSS向量
2025-12-24 12:07:16|x-sec|INFO|./core/agents/mimora/cvss_vector_agent.py:401|_normalize_cvss_data|开始归一化CVSS向量数据
2025-12-24 12:07:26|x-sec|INFO|./core/agents/mimora/cvss_vector_agent.py:527|_generate_cvss_vector|Generated CVSS Vector Response
2025-12-24 12:07:26|x-sec|DEBUG|./core/agents/mimora/cvss_vector_agent.py:528|_generate_cvss_vector|Response content: {
    "cvss_vector": "CVSS:3.1/AV:L/AC:L
2025-12-24 12:07:26|x-sec|INFO|./core/agents/mimora/tools/tool_utils.py:19|wrapper|开始执行工具: _calculate_cvss_score, 参数: ('CVSS:3.1/AV:L/AC:L/PR:N/UI:R/S:U/C:H/I:H/A:H',), {}
2025-12-24 12:07:26|x-sec|INFO|./core/agents/mimora/tools/cvss_tool.py:107|_calculate_cvss_score|开始计算CVSS评分: version=3.0, vector=CVSS:3.1/AV:L/AC:L/PR:N/UI:R/S:U/C:H/I:H/A:H...
2025-12-24 12:07:26|x-sec|INFO|./core/agents/mimora/tools/cvss_tool.py:170|_calculate_cvss_score|成功计算CVSS评分: version=3.1, base_score=7.8, base_severity=High
2025-12-24 12:07:26|x-sec|INFO|./core/agents/mimora/tools/tool_utils.py:22|wrapper|工具 _calculate_cvss_score 执行成功，耗时: 0.00s
2025-12-24 12:07:26|x-sec|INFO|./core/agents/mimora/cvss_vector_agent.py:504|_generate_cvss_vector|开始完成CVE风险评估并生成CVSS向量
2025-12-24 12:07:26|x-sec|INFO|./core/agents/mimora/cvss_vector_agent.py:568|_generate_cvss_severity|开始生成CVSS严重性等级
2025-12-24 12:07:35|x-sec|INFO|./core/agents/mimora/cvss_vector_agent.py:527|_generate_cvss_vector|Generated CVSS Vector Response
2025-12-24 12:07:35|x-sec|DEBUG|./core/agents/mimora/cvss_vector_agent.py:528|_generate_cvss_vector|Response content: {
    "cvss_vector": "CVSS:3.1/AV:L/AC:L
2025-12-24 12:07:35|x-sec|INFO|./core/agents/mimora/tools/tool_utils.py:19|wrapper|开始执行工具: _calculate_cvss_score, 参数: ('CVSS:3.1/AV:L/AC:L/PR:N/UI:R/S:U/C:H/I:H/A:H',), {}
2025-12-24 12:07:35|x-sec|INFO|./core/agents/mimora/tools/cvss_tool.py:107|_calculate_cvss_score|开始计算CVSS评分: version=3.0, vector=CVSS:3.1/AV:L/AC:L/PR:N/UI:R/S:U/C:H/I:H/A:H...
2025-12-24 12:07:35|x-sec|INFO|./core/agents/mimora/tools/cvss_tool.py:170|_calculate_cvss_score|成功计算CVSS评分: version=3.1, base_score=7.8, base_severity=High
2025-12-24 12:07:35|x-sec|INFO|./core/agents/mimora/tools/tool_utils.py:22|wrapper|工具 _calculate_cvss_score 执行成功，耗时: 0.00s
2025-12-24 12:07:42|x-sec|INFO|./core/agents/mimora/cvss_vector_agent.py:596|_generate_cvss_severity|Generated CVSS Severity Response

The output was unstable. My first instinct was the usual scapegoat—LLM hallucination.
But a graph runtime is supposed to reduce this kind of unpredictability, not amplify it.

After tracing the execution path, I found the culprit: _generate_cvss_vector was being scheduled twice. That directly contradicted my intended topology.

I’ll skip the play-by-play debugging here. What matters is what the anomaly triggered: a deeper look into agent orchestration—and the design patterns that fall out of it.

Rethinking Agent Orchestration

0:00

/0:44

Where today’s orchestration starts to crack

As systems evolve from “generative AI” (single-shot text) into autonomous agents, architecture becomes the real stability lever—more than prompts, more than model choice.

Early paradigms favored chains: linear prompt sequences that work well for small, bounded tasks.
But once an agent needs to plan, call tools, reflect, and iterate, a linear DAG (Directed Acyclic Graph) becomes a poor fit.

An agent is not a clean input–output pipeline. It is a loop of Perception → Reasoning → Action → Observation, repeated until termination—if termination exists at all. That cyclic nature violates the “acyclic” assumption. Meanwhile, many systems are drifting toward multi-agent setups: planners, executors, critics, and retrievers collaborating in parallel, all sharing and mutating context.

At that point, you inherit the problems of distributed systems: race conditions, state consistency, cyclic dependencies, and fault tolerance.

So the question becomes: what orchestration model can represent cycles and parallel collaboration without turning the runtime into a guessing game?

LangGraph’s bet is to bring the BSP (Bulk Synchronous Parallel) model—battle-tested in HPC and big-data graph computing—into agent orchestration.

Why graph computing models?

Traditional software models systems as services or objects. An agent system behaves closer to a state machine traversing a graph, where state is the asset and transitions are the work.

Cycles are the default, not the exception
ReAct is basically Think → Act → Observe → Think. DAGs can express this only indirectly (recursion, outer loops, manual re-entry), which tends to complicate call stacks and context handling. BSP treats cycles naturally: a loop is simply an ongoing sequence of supersteps.
State is the center of gravity
In agent systems, context is not “data passing through”—it is the system. Decisions are functions of the current state. BSP forces explicit state management and versioning, which aligns unusually well with LLM-based workflows.
Parallelism needs a first-class synchronization primitive
Patterns like Map-Reduce fan-out or supervisor/worker collaboration require parallel work that later converges. BSP’s barrier gives you that synchronization point natively—without ad-hoc asyncio.gather, locks, or fragile ordering assumptions.

Google Pregel & the BSP model

The Pregel framework

Pregel can be summarized in three ideas:

How it computes: a vertex state machine — decide whether to work or to sleep
How it runs: the BSP execution model — decide how the system synchronizes
How it propagates: message passing — move values across edges

This is the core intuition behind “think like a vertex.” Each vertex has two key states:

Active: the vertex runs compute(), processes incoming messages, updates its value, and sends messages to neighbors.
Inactive (halted): the vertex “sleeps” after it votes to halt.
Wake-up: receiving a message brings a halted vertex back to Active.

On a cluster, computation is sliced into supersteps:

Compute: all active vertices run in parallel (read messages from step S-1 → compute → send messages for step S+1)
Messages: values are in flight
Barrier: everyone must finish step S—and messages must be delivered—before anyone enters step S+1

No one runs ahead; no one is left behind. That rhythm eliminates a large class of race conditions.

Example: spreading the maximum value (6) across a graph.

Superstep 0: Node 1 holds the value 6.
Message: Node 1 tells Node 2: “I have a 6.”
Superstep 1: Node 2 receives 6, compares it with its own value (3), updates to 6, and propagates further.
Result: the maximum spreads through the graph like a contagion.

The BSP model

Proposed by Leslie Valiant, Bulk Synchronous Parallel (BSP) divides execution into sequential supersteps. In each superstep, three things happen:

Local computation: each processor computes independently on local data.
Communication: processors send messages, but those messages are not visible until the next step.
Barrier synchronization: everyone waits until computation and communication complete.

This tames the chaos: because messages are only visible after the barrier, every unit observes a globally consistent state from the previous step. For the programmer, the mental model is simpler: write logic that alternates between compute and communicate, bounded by a barrier.

0:00

/0:51

Decoding the LangGraph runtime

So how does LangGraph implement BSP? The core engine is the PregelLoop.

StateGraph & message passing

Everything begins with state. You define a schema (often a TypedDict or a Pydantic model) representing the data that flows through the graph.

from typing import TypedDict, Annotated
import operator

class AgentState(TypedDict):
    messages: Annotated[list[str], operator.add]
    summary: str

The key detail is Annotated[list[str], operator.add]: it defines a channel and its reducer.

Channels: decoupling reads from writes

In BSP, nodes don’t mutate shared memory directly. They publish updates to channels.

LastValue (default): keep the latest value (good for overwrites).
BinaryOperatorAggregate: the backbone of safe parallel updates. A binary operator (e.g. operator.add) merges updates at the barrier. If multiple nodes emit updates in the same superstep, the runtime aggregates them deterministically—no lost updates, no races.
Topic: a pub/sub-like channel for transient events.

`PregelLoop`: the lifecycle of a superstep

The heartbeat is PregelLoop.tick.

Phase 1: Plan

At the start of a superstep, the runtime checks channel versions.

It’s data-driven: if a node subscribes to a channel updated in the previous step, that node becomes active.
If the previous step ended on a conditional edge, the routing function decides which nodes are activated next.

Phase 2: Execute (local computation)

Active nodes run in parallel.

Read isolation: each node reads a snapshot of state captured at the start of the step. Even if Node A emits updates, Node B (running concurrently) still sees the old snapshot.
Write buffering: node outputs are buffered; they are not applied immediately.

Phase 3: Update & barrier

Once all active nodes finish:

collect buffered writes
apply reducers (e.g. old_messages + new_A + new_B)
increment channel versions
checkpoint: serialize the full state into storage

Only after this does the barrier lift and the next superstep begin.

LangGraph source code (conceptual)

State and channels

State behavior is defined by the underlying channel type.

Channel class	Update logic	Typical use
LastValue	`value = new_value` (overwrite)	flags, latest query
BinaryOperatorAggregate	`value = reducer(value, new_value)`	chat history (`add_messages`), parallel results
Topic	append to a queue	pub/sub, event streams

# BinaryOperatorAggregate (reducer channel type)
class BinaryOperatorAggregate(BaseChannel):
    def __init__(self, operator, initial_value):
        self.operator = operator  # e.g., operator.add
        self.value = initial_value

    def update(self, values):
        if not values:
            return False

        for new_val in values:
            if isinstance(new_val, Overwrite):
                self.value = new_val.value
            else:
                # Apply reducer: old + new -> updated
                self.value = self.operator(self.value, new_val)
        return True

Pregel loop and supersteps (simplified)

class PregelLoop:
    def execute(self, initial_state):
        # 1. Initialize channels
        self.channels = self.initialize_channels(initial_state)

        # 2. Superstep loop
        while not self.is_terminated():

            # --- Phase A: Plan ---
            tasks = []
            for node in self.nodes:
                # Trigger: input channel updated in the previous step
                if self.check_trigger(node, self.channels):
                    # Read snapshot (immutable)
                    input_snapshot = self.read_channels(node.inputs)
                    tasks.append((node, input_snapshot))

            if not tasks:
                break

            # --- Phase B: Execute (parallel) ---
            # Nodes cannot observe each other's writes within the same step
            results = await parallel_execute(tasks)

            # --- Phase C: Update (barrier) ---
            for node, result in results:
                writes = self.parse_writes(node, result)
                for channel, values in writes:
                    self.channels[channel].update(values)

            # --- Phase D: Checkpoint ---
            self.checkpointer.put(self.channels.snapshot())

            self.step += 1

Checkpointer and “time travel”

A checkpoint is not just a save file; it’s a logical clock.

It stores both channel_values (user data) and channel_versions (synchronization metadata). That enables “time travel”: load any previous checkpoint, replay execution, or fork a new branch from a past state. For debugging multi-step agent behavior, this is not a nice-to-have—it changes what is possible.

Interrupts

In standard Python, pausing mid-await and serializing the suspended execution context is painful.

In BSP, the barrier between supersteps is a natural pause point. When an interrupt is configured (e.g. interrupt_before=["node_A"]), the runtime simply stops scheduling at the barrier, persists state, and exits. Resuming is just: reload checkpoint → continue with the next superstep.

Framework comparison

Feature	LangGraph (BSP)	Native asyncio	Notes
Control flow	step-wise: read → run → write → sync	continuous callbacks / awaits	BSP is structured and easier to reason about; asyncio can be faster but harder to audit
Consistency	strong: reducers resolve conflicts at the barrier	fragile: easy to introduce races	BSP reduces the need for locks
Debugging	time travel: replay from any step	logs only	snapshots make global reconstruction feasible
Runaway loops	explicit guardrails (e.g., recursion / step limits)	implicit (hangs / starvation)	BSP makes “termination policy” a first-class concern

Versus other agent frameworks:

CrewAI: great for high-level “role-playing teams,” but harder to control granular state or implement rigorous rollback.
AutoGen: conversation-centric; state is often scattered across agent histories rather than centralized, which makes global undo and replay harder.

Advanced patterns

BSP unlocks patterns that are awkward in other architectures.

1) Map-Reduce (dynamic fan-out)

When batch size is unknown until runtime:

Map (step 1): a dispatcher emits Send objects
Process (step 2): the runtime spawns $N$ parallel workers dynamically
Reduce (step 3): a reducer triggers only when all parallel outputs have arrived and been aggregated at the barrier

import operator
from typing import Annotated, TypedDict
from langgraph.graph import StateGraph, START, END
from langgraph.types import Send

# 1. 定义状态
class OverallState(TypedDict):
    topic: str
    sub_results: Annotated[list[str], operator.add] # 聚合所有 Worker 的结果

class WorkerState(TypedDict):
    section: str

# 2. 定义节点
def planner(state: OverallState):
    # 动态生成 3 个子任务
    sections =
    # 返回 Send 对象列表。这不会立即运行，而是安排在下一超步并行运行。
    return

def worker(state: WorkerState):
    # 并行执行的逻辑
    return {"sub_results": [f"Finished section: {state['section']}"]}

def reducer(state: OverallState):
    # 当所有 Worker 完成后，本节点被触发
    # 由于 sub_results 是 operator.add，这里能看到完整的列表
    return {"final_summary": "\n".join(state["sub_results"])}

# 3. 构建图
graph = StateGraph(OverallState)
graph.add_node("planner", planner)
graph.add_node("worker_node", worker)
graph.add_node("reducer", reducer)

# 动态扇出：使用 add_conditional_edges
graph.add_conditional_edges("planner", lambda x: x) # 直接返回 Send 列表
graph.add_edge("worker_node", "reducer") # Fan-in: 所有 worker 写完后触发 reducer
graph.add_edge("planner", END) # 只是为了图完整性，实际流向由 Send 控制
graph.set_entry_point("planner")

app = graph.compile()

2) Subgraphs (fractal composition)

A graph can be wrapped as a node inside another graph. The parent graph pauses while the child graph advances through its own supersteps. This supports modularity and isolation—useful when you want complex agents without a monolith.

--- config: theme: 'base' themeVariables: primaryColor: '#BB2528' primaryTextColor: '#fff' primaryBorderColor: '#7C0000' lineColor: '#F8B229' secondaryColor: '#006100' tertiaryColor: '#fff' background: '#f4f4f4' --- graph LR subgraph Parent Graph Start --> Router Router -->|Complexity High| SubGraphNode Router -->|Complexity Low| SimpleNode SubGraphNode --> End SimpleNode --> End end subgraph SubGraphNode [Child Graph execution] direction LR S_Start((Start)) --> Agent1 Agent1 --> Critiques Critiques -->|Reject| Agent1 Critiques -->|Approve| S_End((End)) end

# 定义子图 (Child Graph)
child_builder = StateGraph(MessagesState)
child_builder.add_node("child_agent", call_model)
child_builder.add_edge(START, "child_agent")
child_builder.add_edge("child_agent", END)
child_graph = child_builder.compile()

# 定义父图 (Parent Graph)
parent_builder = StateGraph(ParentState)
parent_builder.add_node("router", router_node)

#!!! 关键点：将编译后的子图作为节点加入父图!!!
# 在 BSP 运行时看来，这只是一个耗时较长的普通节点
parent_builder.add_node("nested_workflow", child_graph) 

parent_builder.add_edge(START, "router")
parent_builder.add_conditional_edges(
    "router", 
    route_logic, 
    {"complex": "nested_workflow", "simple": "simple_node"}
)

3) Human-in-the-loop (HITL)

Because state is decoupled from execution, you can “freeze” the world, let a human edit state (e.g., correct a bank transfer amount), and then resume as if the world had always been consistent.

# Demo 说明：
# Agent负责处理敏感的转账请求：
# - 输入分析： 提取金额和收款人。
# - 风险评估： 如果金额 > 1000，需要人工审批。
# - 执行转账： 调用银行 API。

# Demo 代码示意实现：
## 定义状态
class State(TypedDict):
    amount: int
    recipient: str
    status: str

## 节点 1: 风险检查
def risk_check(state: State):
    if state["amount"] > 1000:
        # 触发中断
        decision = interrupt(f"Approve transfer of {state['amount']}?")
        if decision!= "approve":
            return {"status": "rejected"}
    return {"status": "approved"}

## 节点 2: 执行
def execute_transfer(state: State):
    if state["status"] == "approved":
        print(f"Transferring to {state['recipient']}")
    return {}

## 构建图
workflow = StateGraph(State)
workflow.add_node("risk_check", risk_check)
workflow.add_node("execute_transfer", execute_transfer)
workflow.add_edge(START, "risk_check")
workflow.add_edge("risk_check", "execute_transfer")
workflow.add_edge("execute_transfer", END)

app = workflow.compile(checkpointer=MemorySaver())

Back to reality: the Agent implementation

Returning to the original bug, I applied these ideas in the agent development.

Speed and isolation

I used parallel execution for data fetching (get_cve_data, get_cvss_data) to reduce latency.
To avoid context pollution—where a large context from one branch (e.g., ASD generation) bleeds into another—I used subgraphs to isolate execution contexts.

class CVSSVectorAgent:
    """CVSS Vector Agent"""

    def __init__(self):
        self.data_agent = CVEDataAgent()
        self.asd_agent = MitreASDAgent()
        self.llm = ChatTongyi(name="cvss-vector-agent-llm", model="qwen3-max")
        self.prompt_manager = CVSSVectorPrompts()
        self.memory = MemorySaver()
        self.agent = self._build_graph()
        self.logger = get_logger()

    def _build_graph(self):
	    ...
	    # 添加边
        # get_cve_data, get_cvss_data, generate_asd_data 是并行节点用于加速agent执行
        builder.add_edge(START, "get_cve_data")
        builder.add_edge(START, "get_cvss_data")

An explicit synchronization barrier

To resolve the scheduling/synchronization issue, I added a no-op barrier node.

# No-op node to synchronize paths
def sync_barrier(state: CVSSVectorState):
    return {}

builder.add_node("sync_barrier", sync_barrier)

# ... route conditional edges to sync_barrier ...

# Only proceed after the barrier
builder.add_edge("sync_barrier", "normalize_cvss_data")

By making the topology explicitly respect the BSP rhythm, the “double execution” vanished. The runtime returned to a predictable cadence: compute, wait, advance.

Closing thoughts

“Knowing the tool” is the first step. “Knowing the model behind the tool” is where leverage comes from.

Moving from chains to graphs is not just a syntax upgrade—it changes how we think about time, state, and consistency in agent systems. Once you see the barrier as a clock, many problems stop being mysterious.

References

LangGraph 机制深度解析与Agent模式设计

Shadow Walker 松烟阁

edony

2026年1月4日 23:13

引子

我在开发一个CVE相关的Agent的时候，碰到一个很有意思的Agent编排问题，Agent采用LangGraph框架开发的，具体Agent结构如下所示：

Agent运行的结果不稳定，一开始我以为是Agent常见的幻觉问题，但是基于Graph编排就是为了避免幻觉问题，这很奇怪。在排查tracing和调用日志之后，我发现了一个很奇怪的现象：_generate_cvss_vector 执行了两次，具体日志如下所示：

2025-12-24 12:07:16|x-sec|INFO|./core/agents/mimora/cvss_vector_agent.py:504|_generate_cvss_vector|开始完成CVE风险评估并生成CVSS向量
2025-12-24 12:07:16|x-sec|INFO|./core/agents/mimora/cvss_vector_agent.py:401|_normalize_cvss_data|开始归一化CVSS向量数据
2025-12-24 12:07:26|x-sec|INFO|./core/agents/mimora/cvss_vector_agent.py:527|_generate_cvss_vector|Generated CVSS Vector Response
2025-12-24 12:07:26|x-sec|DEBUG|./core/agents/mimora/cvss_vector_agent.py:528|_generate_cvss_vector|Response content: {
    "cvss_vector": "CVSS:3.1/AV:L/AC:L
2025-12-24 12:07:26|x-sec|INFO|./core/agents/mimora/tools/tool_utils.py:19|wrapper|开始执行工具: _calculate_cvss_score, 参数: ('CVSS:3.1/AV:L/AC:L/PR:N/UI:R/S:U/C:H/I:H/A:H',), {}
2025-12-24 12:07:26|x-sec|INFO|./core/agents/mimora/tools/cvss_tool.py:107|_calculate_cvss_score|开始计算CVSS评分: version=3.0, vector=CVSS:3.1/AV:L/AC:L/PR:N/UI:R/S:U/C:H/I:H/A:H...
2025-12-24 12:07:26|x-sec|INFO|./core/agents/mimora/tools/cvss_tool.py:170|_calculate_cvss_score|成功计算CVSS评分: version=3.1, base_score=7.8, base_severity=High
2025-12-24 12:07:26|x-sec|INFO|./core/agents/mimora/tools/tool_utils.py:22|wrapper|工具 _calculate_cvss_score 执行成功，耗时: 0.00s
2025-12-24 12:07:26|x-sec|INFO|./core/agents/mimora/cvss_vector_agent.py:504|_generate_cvss_vector|开始完成CVE风险评估并生成CVSS向量
2025-12-24 12:07:26|x-sec|INFO|./core/agents/mimora/cvss_vector_agent.py:568|_generate_cvss_severity|开始生成CVSS严重性等级
2025-12-24 12:07:35|x-sec|INFO|./core/agents/mimora/cvss_vector_agent.py:527|_generate_cvss_vector|Generated CVSS Vector Response
2025-12-24 12:07:35|x-sec|DEBUG|./core/agents/mimora/cvss_vector_agent.py:528|_generate_cvss_vector|Response content: {
    "cvss_vector": "CVSS:3.1/AV:L/AC:L
2025-12-24 12:07:35|x-sec|INFO|./core/agents/mimora/tools/tool_utils.py:19|wrapper|开始执行工具: _calculate_cvss_score, 参数: ('CVSS:3.1/AV:L/AC:L/PR:N/UI:R/S:U/C:H/I:H/A:H',), {}
2025-12-24 12:07:35|x-sec|INFO|./core/agents/mimora/tools/cvss_tool.py:107|_calculate_cvss_score|开始计算CVSS评分: version=3.0, vector=CVSS:3.1/AV:L/AC:L/PR:N/UI:R/S:U/C:H/I:H/A:H...
2025-12-24 12:07:35|x-sec|INFO|./core/agents/mimora/tools/cvss_tool.py:170|_calculate_cvss_score|成功计算CVSS评分: version=3.1, base_score=7.8, base_severity=High
2025-12-24 12:07:35|x-sec|INFO|./core/agents/mimora/tools/tool_utils.py:22|wrapper|工具 _calculate_cvss_score 执行成功，耗时: 0.00s
2025-12-24 12:07:42|x-sec|INFO|./core/agents/mimora/cvss_vector_agent.py:596|_generate_cvss_severity|Generated CVSS Severity Response

这与Agent的Graph设计编排动作并不符合，定位过程这里不细说，但是这引起了我对Agent设计模式的好奇，这篇文章我就来探索一下这个问题。

Agent编排问题思考

0:00

/0:44

Agent编排有什么问题

在生成式人工智能（Generative AI）从单纯的文本生成向自主智能体（Autonomous Agents）演进的历史进程中，编排框架的架构设计成为了决定系统稳定性、可扩展性和复杂度的核心变量。早期的开发范式主要围绕“链式”（Chain）结构展开，这种线性的 Prompt 序列在处理单一、短程任务时表现出色。然而，随着需求转向能够自主决策、使用工具、自我反思并进行长期规划的“智能体”，线性有向无环图（DAG）的局限性暴露无遗。

Agent系统的本质不再是简单的输入-输出管道，而是一个包含了感知（Perception）、决策（Reasoning）、行动（Action）和观察（Observation）的无限循环。这种循环性（Cycles）打破了传统的 DAG 假设。更为关键的是，Agent系统往往不再是单打独斗，而是演变为多智体系统（Multi-Agent Systems），其中多个专注于不同领域的智能体（如规划者、执行者、审查者）需要并行工作并共享上下文。

随着Agent系统的演进，在编排Agent系统的时候会碰到很多问题，例如需要解决多智体协作中的竞态条件（Race Conditions）、状态一致性（State Consistency）、循环推理（Cyclic Reasoning）以及容错恢复（Fault Tolerance）等，此时开发者不禁会思考能否有一个图灵完备的编排模型来解决这个问题呢？

面对这样的Agent编排的问题，LangGraph将已经在高性能计算（HPC）和大数据处理领域验证过的 BSP 模型引入到了 AI Agent 的编排中。

为什么是图计算模型

传统的软件工程倾向于将系统建模为服务（Services）或对象（Objects），而 AI Agent 的行为模式更接近于状态机（State Machine） 在图上的随机游走。

非线性与循环： Agent 的核心特征是循环（Looping）。例如，ReAct 模式（Reasoning + Acting）本质上是一个 Think -> Act -> Observe -> Think 的闭环。DAG（有向无环图）无法原生表达这种循环，通常需要通过递归调用或外部循环来实现，这会导致调用栈溢出或上下文管理混乱。BSP 模型天生支持循环——循环仅仅是无限的超步序列而已。
状态的中心地位：在 Agent 系统中，Context（上下文/状态）是核心资产。所有的决策都基于当前的 State。BSP 模型强制要求显式的状态管理和版本控制，这与 Agent 对上下文依赖的需求不谋而合。
并发与协作：现代 Agent 系统往往是多角色的（Map-Reduce pattern, Supervisor pattern）。多个 Agent 需要并行工作并汇聚结果。BSP 的栅栏机制天然解决了并行任务的同步与汇聚问题，无需开发者手动编写复杂的 asyncio.gather 或锁机制。

Google Pregel&BSP模型

Google Pregel框架

Pregel 的核心可以用三个图来概括：

怎么算： “顶点状态机” —— 决定节点是工作还是休息。
怎么跑： “BSP模型” —— 决定整个集群如何同步。
怎么传： “最大值传播示例” —— 演示一个具体算法在图上的流动。

如上图所示，这是 Pregel "Think Like a Vertex"（像顶点一样思考）的核心。每个顶点只有两种状态：活跃 (Active) 和不活跃 (Inactive/Halted)。

Active (活跃): 顶点正在计算。它可以处理收到的消息，更新自己的值，并向邻居发送新消息。
Inactive (不活跃): 顶点“睡着了”。如果它觉得自己没活干了（比如计算结果收敛了），就投票休眠（Vote to Halt）。
被唤醒: 哪怕顶点睡着了，只要它收到了新消息，系统会立刻把它强制唤醒（切换回 Active），让它处理消息。

如上图所示，这是 Pregel 在分布式集群上的宏观运行方式。计算被切分成一个个 Superstep（超步），所有机器必须“齐步走”。

计算 (Compute): 所有顶点并行处理自己的逻辑（读上一轮消息 -> 算 -> 发下一轮消息）。
通信 (Messages): 这一轮发出的消息，会在网络中飞一会儿。
路障 (Barrier): 这是关键。所有顶点必须都跑完 Superstep S，且消息都传到了，才能一起进入 Superstep S+1。不允许有的跑得快，有的跑得慢。

这种“走一步、停一步、等一等”的模式，解决了分布式系统中极其复杂的死锁和竞态条件问题。

如上图所示，假设我们要在一个图里找到最大的数字（在这个例子中是 6）并传给所有人。

Superstep 0: 大家都有初始值。节点 1 拿着最大值 6。
消息传递: 节点 1 发现自己值是 6，告诉邻居节点 2：“嘿，我有 6”。
Superstep 1: 节点 2 收到了“6”，对比自己原来的“3”，发现 6 更大，于是更新自己为 6，并在下一轮继续传播。
结果: 就像病毒扩散一样，最大值会在几次 Superstep 后覆盖全图。

BSP模型

Bulk Synchronous Parallel (BSP) 模型是一种整体同步并行计算模型，由计算机科学家 Leslie Valiant 提出。它将并行计算划分为一系列 超级步（Superstep） 顺序执行。在每个超级步内，所有处理单元都执行以下三个阶段：

本地计算阶段：每个处理单元（例如处理器或节点）使用当前可用的数据执行计算。各处理单元彼此独立、并行地进行局部运算。
消息传递阶段：处理单元将本超级步产生的输出发送为消息给其他处理单元。这些消息在本超级步内不会被目标立即处理，而是累积起来供下一个超级步使用。
全局同步屏障阶段：所有处理单元在此同步点等待，直到每个单元都完成了本超级步的前两阶段。同步屏障确保没有单元抢先进入下一超级步。

以上三个阶段严格串行发生：只有当所有处理单元完成本地计算后，才进行统一的通信，然后才能执行同步。同步屏障标志着一个超步的结束和下一个超步的开始。整个 BSP 程序由若干连续的超步构成，重复“计算->通信->同步”的流程，直到满足终止条件。由于通信中的消息仅在同步后才可见，这保证了每个超步各处理单元看到的是上一超步结束时的全局一致状态。BSP 模型具有易于编程、性能可预测且不易出现死锁等特点。从程序员视角来看，BSP 提供了一种简洁的并行语义：把并发逻辑写成在同步栅栏之间交替进行的计算和通信步骤，从而降低了思维复杂度。

0:00

/0:51

LangGraph运行时框架解析

这一节主要是研究清楚LangGraph是如何实现BSP模型的（langgraph == 0.3.27），LangGraph运行时框架的核心引擎是 PregelLoop 类。

状态图（StateGraph）与消息传递

在 LangGraph 中，一切皆始于状态（State）。开发者首先定义一个 StateSchema（通常是 TypedDict 或 Pydantic Model），它规定了图中流动的数据结构。

from typing import TypedDict, Annotated
import operator

class AgentState(TypedDict):
    messages: Annotated[list[str], operator.add]
    summary: str

这里的 Annotated[list[str], operator.add] 是理解 LangGraph BSP 实现的关键。它定义了一个通道（Channel）及其归约器（Reducer）。

通道（Channels）：解耦读与写

在 BSP 模型中，节点不直接写入内存位置。相反，它们向“通道”发送更新。通道是管理状态变更的中间层。LangGraph 提供了多种类型的通道：

LastValue Channel（默认）：存储最后一次接收到的值。如果在一个超步中有多个节点向此通道发送数据，通常只有最后一个（或随机一个，取决于具体实现细节）会被保留。这适用于那些全量替换的状态字段。
BinaryOperatorAggregate Channel：这是 BSP 并行能力的基石。它允许定义一个二元操作符（如 operator.add）。在栅栏同步阶段，系统会将该超步内所有发往此通道的更新值，以及当前通道的旧值，通过这个操作符进行聚合。多个 Agent 可以并行生成消息，系统会自动将它们追加（Append）到历史记录中，而不会发生覆盖。
Topic Channel：类似于消息队列的主题，用于传递瞬时的事件流，不保留历史状态。

PregelLoop：Superstep的生命周期

PregelLoop.tick 方法是 LangGraph 运行时的心跳，每一次 tick 代表一个超步的执行。我们可以将其逻辑分解为以下几个微观阶段：

阶段一：计划（Plan）

当一个新的超步开始时，运行时首先检查当前的通道版本（Channel Versions）。

LangGraph 维护着每个通道的数据版本号（通常是递增的整数）。
每个节点都订阅了一个或多个输入通道。
触发逻辑（Triggering）：如果一个节点订阅的通道在上一轮超步中发生了版本更新（即接收到了新数据），该节点就会被标记为“待执行”。
条件边解析：如果上一轮结束于一个条件边，系统会执行路由函数（Routing Function），确定下一轮应该激活哪些节点。

这一阶段对应于 BSP 模型的“调度”逻辑。重要的是，这种触发是数据驱动的（Data-driven），而非传统的控制流驱动。

阶段二：执行（Execute）—— 局部计算

一旦确定了本轮需要运行的节点集合（例如 Node A 和 Node B），LangGraph 会并行启动它们的执行。

读隔离：每个节点在执行时，读取的是本超步开始时的状态快照（Snapshot）。即使 Node A 在执行过程中“修改”了状态，并行运行的 Node B 看到的仍然是旧状态。这保证了并行任务之间的隔离性。
写缓冲：节点执行完毕后，返回一个字典（如 {"messages": ["Hello"]}）。这个返回值不会立即应用到全局 State 中。它被放入一个临时的“写入缓冲区”（Write Buffer）。

这对应于 BSP 模型的“并发计算”阶段。在这一阶段，系统中不存在共享内存的竞争，因为所有的读操作都是基于快照的，所有的写操作都是缓冲的。

阶段三：更新与栅栏（Update & Barrier）

当本超步内的所有节点都完成执行后，系统进入栅栏同步阶段。这是 LangGraph 发挥魔法的地方：

收集写入：运行时从缓冲区中取出所有节点产生的更新。
执行归约（Reduce）：对于每一个通道，运行时应用预定义的 Reducer 函数。例如，如果 Node A 返回 msg1，Node B 返回 msg2，且通道配置为 add，则新状态计算为 State + msg1 + msg2。
版本递增：更新后的通道版本号加 1。
持久化：如果配置了 Checkpointer，此时系统会将更新后的完整 State 序列化并存储到数据库中。

只有在这一系列原子操作完成后，系统才会解除栅栏，准备进入下一个超步。

LangGraph 源码解析

State && Channel

LangGraph 的 State 在底层被编译为一组 Channel 对象，所以 State 的运行逻辑可以通过 channel 的源码来理解。

通道类型 (Class)	对应 State 标注	更新逻辑 (update method)	获取逻辑 (get method)	典型应用场景
LastValue	int, str (无注解)	value = new_value (覆盖)	返回当前存储的值	状态机流转标志、最新查询词、单一结果
BinaryOperatorAggregate	Annotated	value = reducer(value, new_value) (归约)	返回归约后的累积值	聊天历史 (add_messages)、并行分析结果汇总
EphemeralValue	(内部使用或特殊配置)	接受更新	读取一次后即清空 (Reset after read)	信号传递、触发器、无需持久化的中间数据
Topic	Topic (显式配置)	追加到队列	返回本轮新增的消息列表	Pub/Sub 模式、日志流、事件广播

# channel 基类抽象
class BaseChannel(ABC):
    @abstractmethod
    def update(self, values: Sequence[Any]) -> bool:
        """接收更新值，修改内部状态。返回 True 表示状态已变更。"""
        pass

    @abstractmethod
    def get(self) -> Any:
        """获取当前值 (供 Node 读取)。"""
        pass
        
    @abstractmethod
    def checkpoint(self) -> Any:
        """序列化当前状态。"""
        pass
        
# channel 实现逻辑
# 1. LastValue(默认channel类型)
class LastValue(BaseChannel):
    def update(self, values):
        if not values:
            return False
        # 直接覆盖：如果有多个 Node 同时写入，保留列表中的最后一个
        # (通常由并行执行完成的顺序决定，或由图拓扑顺序决定)
        self.value = values[-1] 
        return True

# 2. BinaryOperatorAggregate(Reducer channel类型)
class BinaryOperatorAggregate(BaseChannel):
    def __init__(self, operator, initial_value):
        self.operator = operator # 例如: operator.add 或 add_messages
        self.value = initial_value

    def update(self, values):
        if not values:
            return False
        
        # 遍历所有待应用的更新
        for new_val in values:
            # 支持特殊指令：Overwrite
            if isinstance(new_val, Overwrite):
                self.value = new_val.value
            else:
                # 应用 Reducer: old + new -> new_old
                # 例如: list + list -> extended list
                self.value = self.operator(self.value, new_val)
        return True
        
# 3. Topic(PubSub channel类型)
class Topic(BaseChannel, Value | list[Value], list[Value]]):
    """
    一个可配置的 PubSub Topic 通道。
    """
    def __init__(self, typ: type[Value], accumulate: bool = False):
        self.typ = typ
        self.accumulate = accumulate
        self.unique = False # 注意：标准实现中通常没有显式的 unique 参数，需通过逻辑推导
        #... 初始化内部存储结构
    def update(self, writes: Sequence[Value]) -> bool:
	    # 如果不累积，直接用新值覆盖旧值
	    if not self.accumulate:
	        self.values = list(writes) # 替换旧状态

	    # 如果累积，将新值追加到旧状态
	    if self.accumulate:
	        self.values.extend(writes)
	    return True

Pregel Loop && Superstep

我们可以通过一个简化的伪代码模型来理解 LangGraph 的 _step（单步执行）逻辑。这部分逻辑主要位于 langgraph/pregel/__init__.py 和 langgraph/pregel/loop.py 中（GitHub源码地址）。

# 简化的 LangGraph 运行时逻辑模型 (伪代码)
# 具体实现在 class Pregel(PregelProtocol) --> stream()/astream()

class PregelLoop:
    def execute(self, initial_state):
        # 1. 初始化通道 (Channels)
        # 将输入状态写入对应的通道 (如 'messages', 'count' 等)
        self.channels = self.initialize_channels(initial_state)
        
        # 2. 超步循环 (Super-step Loop)
        # 只要有待处理的任务，就继续循环
        while not self.is_terminated():
            
            # --- 阶段 A: 计划 (Plan) / 触发器逻辑 ---
            # 检查哪些节点订阅的通道在上一轮发生了更新
            tasks =
            for node_name, node in self.nodes.items():
                # Trigger: 如果节点的输入通道有新数据，则激活该节点
                if self.check_trigger(node, self.channels):
                    # 准备任务：读取当前状态快照 (不可变)
                    input_snapshot = self.read_channels(node.inputs)
                    tasks.append((node_name, node.func, input_snapshot))
            
            if not tasks:
                break # 没有任务，图执行结束
            
            # --- 阶段 B: 执行 (Execute) / 并行计算 ---
            # 并行运行所有激活节点的函数
            # 注意：节点内部无法看到其他节点本轮产生的更新
            results = await parallel_execute(tasks)
            
            # --- 阶段 C: 更新 (Update) / 栅栏同步 ---
            # 这是 BSP 的核心：统一应用所有更新
            checkpoint_writes =
            for node_name, result in results:
                # 解析节点返回值，确定要更新哪些通道
                writes = self.parse_writes(node_name, result)
                
                # 将更新应用到通道 (应用 Reducer)
                for channel_name, value in writes:
                    # 例如: channels['messages'].update(new_msg)
                    # 如果是 add_messages reducer，这里会执行 list append
                    self.channels[channel_name].update(value)
                    
            # --- 阶段 D: 持久化 (Checkpoint) ---
            # 保存当前所有通道的状态到数据库 (支持时间旅行)
            self.checkpointer.put(self.channels.snapshot())
            
            # 增加步数，准备下一轮
            self.step += 1

在每一轮 Superstep 开始时，运行时需要决定哪些 Node 应该被激活。

源码逻辑：系统检查所有 Channel 的 version（版本号）。
每个 Node 都有一个订阅列表（Input Channels）。
逻辑判断：if max(channel.version for channel in node.inputs) > node.last_seen_version:
如果条件满足，说明该 Node 的输入数据发生了变化，Node 被加入 tasks 队列。
特殊处理：对于 START 节点或被 Send API 动态调用的节点，它们会被无条件或基于特定规则加入队列。

Checkpointer

中断机制的基石是状态的持久化。没有检查点，图就是无状态的，中断后无法恢复。

检查点的数据结构

一个检查点不仅仅是用户定义的 State 字典。它包含：

Config (Thread ID): 类似于会话 ID。
Channel Values: 所有通道的当前值。
Version Information: 逻辑时钟，用于冲突检测。
Pending Sends: 尚未处理的消息。
Next Tasks: 下一步计划执行的任务列表（如果是中断状态）。

# checkpoint 数据结构
# channel_versions 和 versions_seen 是增量计算的核心，
# LangGraph 依靠比对这两个字典来决定下一轮激活哪些节点，而不是全量扫描。
checkpoint = {
    "v": 1, # 协议版本
    "id": "uuid-...", # Checkpoint ID
    "ts": "2023-10-27...", # 时间戳
    "channel_values": { # 用户状态
        "messages": [...], 
        "count": 5
    },
    "channel_versions": { # 逻辑时钟 (关键!)
        "messages": 12,
        "count": 5
    },
    "versions_seen": { # 每个节点上次看到的版本
        "agent_node": {"messages": 11, "count": 5}
    }
}

存储后端与序列化

LangGraph 支持多种 Checkpointer 实现：

InMemorySaver: 仅用于测试，进程重启即丢失。
PostgresSaver / AsyncSqliteSaver: 生产环境标准。

默认情况下，检查点使用 pickle 进行序列化。这支持了复杂的 Python 对象（如自定义类），但也带来了安全性和兼容性问题。如果代码更新导致类定义改变，旧的检查点可能无法加载。生产环境建议尽量使用 JSON 兼容的基础数据类型，或实现自定义的序列化协议。

class BaseCheckpointSaver:
    def put(self, config, checkpoint, metadata, new_versions):
        # 1. 序列化
        serialized_checkpoint = self.serde.dumps(checkpoint)
        
        # 2. 写入数据库 (伪 SQL)
        # INSERT INTO checkpoints (thread_id, checkpoint_id, data) 
        # VALUES (?,?,?)
        # ON CONFLICT DO NOTHING;
        
        # 3. 更新最新指针
        # UPDATE threads SET latest_checkpoint_id =? WHERE thread_id =?

线程级隔离

Thread ID 是实现多用户并发的关键。每个 Thread ID 对应一条独立的状态演进链。中断和恢复必须严格匹配同一个 Thread ID。源代码中，checkpointer.get(config) 方法利用这个 ID 来检索最新的状态快照。

故障恢复与重放 (Replay)

当用户调用 graph.invoke(..., config={"thread_id": "1"}) 时：

checkpointer.get(config) 从数据库查出最新的 Checkpoint。
PregelLoop 将 checkpoint["channel_values"] 恢复到内存中的 self.channels。
PregelLoop 将 checkpoint["versions_seen"] 恢复到 Node 状态。
无缝继续: 循环继续运行，仿佛从未停止过。因为所有上下文（包括逻辑时钟）都已完美复原。

Interrupt

# langgraph.types.interrupt 伪代码
def interrupt(value):
    # 1. 检查当前是否在 Pregel 循环中
    if not _is_in_pregel_loop():
        raise RuntimeError("interrupt can only be called inside a node")
    
    # 2. 抛出特殊异常，携带 Payload
    # 这会立即中止当前 Node 函数的执行
    raise GraphInterrupt(value)
    
    
# interrupt 运行时捕获
# PregelLoop.run_task 内部逻辑
try:
    result = node.func(input)
except GraphInterrupt as e:
    # 捕获中断请求
    # 1. 将任务标记为 "interrupted"
    # 2. 保存中断产生的值 (e.value) 到 Checkpoint
    self.save_checkpoint(...)
    # 3. 停止整个 Superstep，不进行 Update
    return CreateInterrupt(e.value)
    
    
# 恢复 (Resume) 与值注入
# interrupt 恢复有两种方式：
# 1. Command(resume="approved")
# 2. graph.update_state(thread_config, {"input": "hi"})
def interrupt(value):
    # 检查是否有 resume 值注入
    if _has_resume_value():
        # 直接返回注入的值，不再抛出异常！
        return _get_resume_value()

    #... (抛出异常逻辑)

LangGraph vs. 其他框架

智能体编排范式对比

维度	LangGraph (BSP)	原生 Asyncio (事件驱动)	核心差异分析
执行流	分步式 (Step-wise): 严格的 Read -> Process -> Write -> Sync 循环。	连续流 (Continuous): 回调链、Promise 链，任务一旦完成立即触发下一个。	BSP 提供了更清晰的逻辑结构，易于理解和预测；Asyncio 理论延迟更低，但逻辑难以追踪。
状态一致性	强一致性: 归约器在栅栏处解决冲突，所有节点在下一轮看到的都是一致的合并状态。	最终一致性: 容易出现竞态条件，需要复杂的锁机制。	BSP 避免了复杂的并发锁，降低了开发风险。
调试体验	时光倒流: 支持从任意历史超步恢复及重放。	日志追踪: 依赖散落在各处的日志，难以还原全局状态。	BSP 的“快照”特性是调试神技。
死锁处理	显式检测: 框架可以检测到循环超步限制（Recursion Limit）。	隐式死锁: await 可能无限挂起，难以检测。	BSP 强制设置最大超步数，防止无限循环。

框架对比

我对几个使用过的Agent框架进行对比：

CrewAI 采用了更高级的抽象，通常基于角色的顺序执行或简单的并行。它更像是一个基于“团队”隐喻的封装层。相比之下，LangGraph 更底层。CrewAI 往往难以处理精细的状态回滚和复杂的条件跳转，而 LangGraph 的 BSP 模型允许开发者控制每一个超步的细节。
AutoGen 采用了基于“对话”的多 Agent 模式。Agent 之间的交互通过消息流驱动。虽然也具备并发能力，但其状态管理通常分散在各个 Agent 的内部历史中，缺乏一个全局的、版本控制的 State 对象。这使得在 AutoGen 中实现全局一致的“撤销”或“状态修改”比 LangGraph 困难。

高阶Agent设计模式

BSP 架构不仅仅是为了LangGraph解决基础问题，它还解锁了一系列高级设计模式，使得构建能够处理现实世界复杂度的 Agent 成为可能。

Map-Reduce 与动态任务分发（Send API）

在处理文档批量分析等任务时，我们通常不知道文档的确切数量。传统的静态图结构难以应对这种动态性。利用 BSP 的批处理特性，结合 Send API，优雅地实现了 Map-Reduce 模式。

场景：用户上传了一个包含未知数量文件的文件夹，要求“总结每个文件，然后生成总报告”。

Map 阶段（超步 1）： Dispatcher 节点运行。它读取输入列表，针对列表中的每一项，生成一个 Send("process_file", {"file": item}) 对象。在 BSP 视角下，这相当于在当前超步结束时，动态向图中注入了 $N$ 个并发任务。
Process 阶段（超步 2）：系统检测到 process_file 节点收到了 $N$ 个独立的消息包。于是，系统并行启动 $N$ 个 process_file 节点实例。由于 BSP 的隔离性，这 $N$ 个实例互不干扰。每个实例处理完后，返回 {"summaries": [summary]}。
Reduce 阶段（超步 3）： Summarizer 节点订阅了 summaries 通道（配置为 append 归约器）。在超步 2 结束的栅栏处，所有 $N$ 个摘要被自动聚合成一个大列表。Summarizer 在超步 3 被触发一次，接收到完整的列表，生成总报告。

代码示例：

import operator
from typing import Annotated, TypedDict
from langgraph.graph import StateGraph, START, END
from langgraph.types import Send

# 1. 定义状态
class OverallState(TypedDict):
    topic: str
    sub_results: Annotated[list[str], operator.add] # 聚合所有 Worker 的结果

class WorkerState(TypedDict):
    section: str

# 2. 定义节点
def planner(state: OverallState):
    # 动态生成 3 个子任务
    sections =
    # 返回 Send 对象列表。这不会立即运行，而是安排在下一超步并行运行。
    return

def worker(state: WorkerState):
    # 并行执行的逻辑
    return {"sub_results": [f"Finished section: {state['section']}"]}

def reducer(state: OverallState):
    # 当所有 Worker 完成后，本节点被触发
    # 由于 sub_results 是 operator.add，这里能看到完整的列表
    return {"final_summary": "\n".join(state["sub_results"])}

# 3. 构建图
graph = StateGraph(OverallState)
graph.add_node("planner", planner)
graph.add_node("worker_node", worker)
graph.add_node("reducer", reducer)

# 动态扇出：使用 add_conditional_edges
graph.add_conditional_edges("planner", lambda x: x) # 直接返回 Send 列表
graph.add_edge("worker_node", "reducer") # Fan-in: 所有 worker 写完后触发 reducer
graph.add_edge("planner", END) # 只是为了图完整性，实际流向由 Send 控制
graph.set_entry_point("planner")

app = graph.compile()

这种模式的精妙之处在于隐式同步。开发者不需要编写任何“等待所有 Worker 完成”的代码（如 Promise.all），BSP 的栅栏机制保证了 Summarizer 只有在所有并行的 Map 任务都完成后（即所有消息都已处理并归约）才会被调度。

子图（Subgraphs）

分形的 BSP随着系统复杂度增加，单层图变得难以管理。LangGraph 支持将一个图封装为另一个图的节点，称为子图。在 BSP 模型下，子图的执行表现为嵌套的超步循环。

当父图执行到子图节点时，父图的超步时钟“挂起”。
子图启动自己的 PregelLoop，拥有独立的 State、独立的超步计数器。
子图在内部运行多个超步，直到完成。
子图的最终结果作为父图当前超步的输出，父图恢复，进入下一个超步。

--- config: theme: 'base' themeVariables: primaryColor: '#BB2528' primaryTextColor: '#fff' primaryBorderColor: '#7C0000' lineColor: '#F8B229' secondaryColor: '#006100' tertiaryColor: '#fff' background: '#f4f4f4' --- graph LR subgraph Parent Graph Start --> Router Router -->|Complexity High| SubGraphNode Router -->|Complexity Low| SimpleNode SubGraphNode --> End SimpleNode --> End end subgraph SubGraphNode [Child Graph execution] direction LR S_Start((Start)) --> Agent1 Agent1 --> Critiques Critiques -->|Reject| Agent1 Critiques -->|Approve| S_End((End)) end

代码示例：

# 定义子图 (Child Graph)
child_builder = StateGraph(MessagesState)
child_builder.add_node("child_agent", call_model)
child_builder.add_edge(START, "child_agent")
child_builder.add_edge("child_agent", END)
child_graph = child_builder.compile()

# 定义父图 (Parent Graph)
parent_builder = StateGraph(ParentState)
parent_builder.add_node("router", router_node)

#!!! 关键点：将编译后的子图作为节点加入父图!!!
# 在 BSP 运行时看来，这只是一个耗时较长的普通节点
parent_builder.add_node("nested_workflow", child_graph) 

parent_builder.add_edge(START, "router")
parent_builder.add_conditional_edges(
    "router", 
    route_logic, 
    {"complex": "nested_workflow", "simple": "simple_node"}
)

这种设计保证了模块化隔离。子图内部的中间状态（Intermediate State）不会污染父图的全局状态，除非显式地作为结果返回。这使得团队可以并行开发不同的 Agent 模块，最后像搭积木一样组装起来，而不用担心状态命名冲突或版本混乱。

中断与人工介入（Human-in-the-Loop）

这是 BSP 模型相对于连续流模型最“杀手级”的应用场景。在涉及敏感操作（如转账、发送合同）时，Agent 必须暂停并寻求人类确认。在异步函数执行过程中（例如 await llm.invoke(...) 正在等待网络响应时），要“暂停”程序并把状态序列化到磁盘是非常困难的。程序的状态分散在堆栈帧、闭包变量和 Event Loop 的句柄中。

而在 BSP 模型中，每个超步之间的栅栏是天然的、完美的暂停点。

LangGraph 允许在节点定义中指定 interrupt_before=["approve_node"]。
当图执行即将进入 approve_node 超步时，运行时检测到中断信号。
系统在栅栏处“冻结”：保存当前所有通道的状态，停止调度，释放内存和计算资源。
人类介入：管理员通过 API 查询当前状态，发现 Agent 拟定的合同金额有误。管理员发送一个 update_state 请求，修改了内存中的金额字段。
恢复（Resume）：管理员发送“继续”指令。系统加载被修改后的状态，就像什么都没发生一样，进入 approve_node 超步。这种 “冻结-修改-继续” 的能力，完全依赖于 BSP 模型将状态（Data）与执行（Control）解耦的特性。

HITL Agent举例：

# Demo 说明：
# Agent负责处理敏感的转账请求：
# - 输入分析： 提取金额和收款人。
# - 风险评估： 如果金额 > 1000，需要人工审批。
# - 执行转账： 调用银行 API。

# Demo 代码示意实现：
## 定义状态
class State(TypedDict):
    amount: int
    recipient: str
    status: str

## 节点 1: 风险检查
def risk_check(state: State):
    if state["amount"] > 1000:
        # 触发中断
        decision = interrupt(f"Approve transfer of {state['amount']}?")
        if decision!= "approve":
            return {"status": "rejected"}
    return {"status": "approved"}

## 节点 2: 执行
def execute_transfer(state: State):
    if state["status"] == "approved":
        print(f"Transferring to {state['recipient']}")
    return {}

## 构建图
workflow = StateGraph(State)
workflow.add_node("risk_check", risk_check)
workflow.add_node("execute_transfer", execute_transfer)
workflow.add_edge(START, "risk_check")
workflow.add_edge("risk_check", "execute_transfer")
workflow.add_edge("execute_transfer", END)

app = workflow.compile(checkpointer=MemorySaver())

一些Agent设计模式

Agent执行加速

class CVSSVectorAgent:
    """CVSS Vector Agent"""

    def __init__(self):
        self.data_agent = CVEDataAgent()
        self.asd_agent = MitreASDAgent()
        self.llm = ChatTongyi(name="cvss-vector-agent-llm", model="qwen3-max")
        self.prompt_manager = CVSSVectorPrompts()
        self.memory = MemorySaver()
        self.agent = self._build_graph()
        self.logger = get_logger()

    def _build_graph(self):
	    ...
	    # 添加边
        # get_cve_data, get_cvss_data, generate_asd_data 是并行节点用于加速agent执行
        builder.add_edge(START, "get_cve_data")
        builder.add_edge(START, "get_cvss_data")

Agent Context隔离

# 添加节点
# 通过子图的方式添加数据sub-agent：CVEDataAgent
# CVSSVectorAgent与CVEDataAgent执行的context相互隔离，通过input/output数据耦合
builder.add_node("get_cve_data", self._get_cve_data)
builder.add_node("get_cvss_data", self._get_cvss_data)
# 风险建模Sub-Agent：MitreASDAgent，采用同样的思路进行context隔离
# ASD Agent大约占用8000token的context，非常容易触发token limit
builder.add_node("generate_asd_data", self._generate_asd_data)
builder.add_node("generate_cvss_vector", self._generate_cvss_vector)
builder.add_node("generate_cvss_severity", self._generate_cvss_severity)

Agent并发管理

	# 空操作节点，只是用来同步所有路径
    def sync_barrier(state: CVSSVectorState):
        """同步屏障节点 - 等待所有前驱完成后再继续"""
        return {}  # 不做任何操作，只是等待
    
    builder.add_node("sync_barrier", sync_barrier)
    
    # 并行启动
    builder.add_edge(START, "get_cve_data")
    builder.add_edge(START, "get_cvss_data")
    
    # get_cve_data 的条件边 - 始终经过中间节点
    builder.add_conditional_edges(
        "get_cve_data",
        self._is_cve_data_empty,
        {
            "generate_asd_data": "generate_asd_data",
            "skip_asd": "sync_barrier",  # 条件不满足时也走 barrier
        }
    )
    
    # get_cvss_data 的条件边 - 始终经过中间节点
    builder.add_conditional_edges(
        "get_cvss_data",
        self._cvss_data_check,
        {
            "get_cvss_statement_data": "get_cvss_statement_data",
            "skip_statement": "sync_barrier",  # 条件不满足时也走 barrier
        }
    )
    
    # 中间节点都指向 sync_barrier
    builder.add_edge("generate_asd_data", "sync_barrier")
    builder.add_edge("get_cvss_statement_data", "sync_barrier")
    
    # sync_barrier 之后才是 normalize_cvss_data
    builder.add_edge("sync_barrier", "normalize_cvss_data")

引子问题定位分析（vibe coding版）

引子中提到的这个问题其实是一个Agent编排设计的时候并发问题，大致的问题产生过程如下所示：

0:00

/0:34

有了这个问题之后就很好解决了，我直接指导Cursor来完成问题分析与修复的，与Cursor的交互记录参考下面的附件文件：

cursor_cvss

Cursor vibe coding记录：修复基于LangGraph编排sub-agent引起的并发问题

cursor_cvss.md

359 KB

download-circle

References

The Physics of Inference – A Deep Dive into KV and Prompt Caching

Shadow Walker 松烟阁

edony

2025年12月14日 21:46

Introduction

Viewed through an engineering lens, as the "Scaling Laws" face increasing scrutiny, I find myself agreeing with the growing consensus: Large Language Models (LLMs) are entering a "middle age" of calculated efficiency—a time for harvesting fruits rather than just planting forests.

In his Thanksgiving letter, Andrew Ng noted that while there may be bubbles in AI, they are certainly not in the application layer:

AI Application Layer: Underinvested. The potential here far exceeds common perception.
AI Inference Infrastructure: Still requires significant investment.
AI Training Infrastructure: I remain cautiously optimistic, though this is where a bubble might exist.

Context

As Generative AI transitions from experimental labs to large-scale commercial deployment, inference efficiency has become the critical variable determining economic viability. In the current landscape dominated by the Transformer architecture, the marginal cost of inference is constrained not by pure compute (FLOPs), but by the "Memory Wall."

As context windows expand from the early 4k tokens to 128k, 1M, and even 10M, managing the Key-Value (KV) Cache has emerged as the primary bottleneck for system throughput and latency.

This analysis spans from underlying physical principles to high-level application strategies. We begin by dissecting the mathematics of the KV Cache during decoding and its consumption of memory bandwidth. We then trace the architectural evolution from Multi-Head Attention (MHA) to Grouped-Query Attention (GQA), and finally to the Multi-Head Latent Attention (MLA) pioneered by DeepSeek. MLA, in particular, achieves extreme compression through the decoupling of low-rank matrix decomposition and Rotary Positional Embeddings (RoPE), laying the physical foundation for "disk-level caching."

On the system software front, we examine how vLLM’s PagedAttention borrows paging concepts from operating systems to solve fragmentation, and how SGLang’s RadixAttention utilizes Radix Trees for dynamic KV reuse. We also touch upon StreamingLLM, which exploits the "Attention Sink" phenomenon to bypass window limits for infinite streaming.

Finally, we survey the market implementation of Prompt Caching (Google, Anthropic, OpenAI, DeepSeek, Alibaba), contrasting the "High-Performance Memory" route against the "Architecture-Driven Low-Cost" route.

1. The Physical Bottleneck: Seeing Through the KV Cache

Before discussing optimization, we must understand—from first principles—why the KV Cache is the Achilles' heel of large model inference. It is not merely a question of capacity, but a conflict between Memory Bandwidth and Arithmetic Intensity.

1.1 The Autoregressive Nature of Transformer Decoding

Inference in Transformers occurs in two distinct phases:

Prefill Phase: The model processes all input tokens in parallel. Because this is highly parallelizable, it is usually Compute-bound. GPU utilization is high.
Decoding Phase: The model generates subsequent tokens one by one. This is an Autoregressive process; generating the $t$-th token depends on the internal state of the previous $t-1$ tokens.

In standard Self-Attention, the calculation is:

$$\text{Attention}(Q, K, V) = \text{softmax}\left(\frac{QK^T}{\sqrt{d_k}}\right)V$$

Here, $Q$ (Query) is the vector for the current step, while $K$ (Key) and $V$ (Value) hold information from all history tokens. To avoid recalculating the $K$ and $V$ projections for the entire history at every new step, the system stores these vectors in VRAM. This is the KV Cache.

1.2 The Math of VRAM Consumption

KV Cache size is a linear function of sequence length, multiplying with layers, heads, and dimensions. For a standard Transformer, it can be calculated as:

$$\text{Size}{KV} = 2 \times L{seq} \times B_{batch} \times N_{layers} \times H_{heads} \times D_{head} \times P_{prec}$$

Where:

$2$: Represents the two matrices, Key and Value.
$L_{seq}$: Current sequence length (context window).
$B_{batch}$: Batch size of concurrent requests.
$N_{layers}$: Number of layers.
$H_{heads}$: Number of attention heads.
$D_{head}$: Dimension per head.
$P_{prec}$: Precision (2 bytes for FP16).

Case Study: Llama-2 70B
Assuming FP16 precision, a sequence length of 4096, and a Batch Size of 1:

$N_{layers} = 80$
$H_{heads} = 64$
$D_{head} = 128$

The KV Cache for a single request is:
$$2 \times 4096 \times 1 \times 80 \times 64 \times 128 \times 2 \approx 10.7 \text{ GB}$$

If we extend the context to 100k tokens, this swells to 260 GB. This far exceeds the capacity of a single NVIDIA A100 (80GB) or H100. Consequently, memory capacity limits Batch Size, preventing the GPU cores from being fully utilized, driving up unit costs.

1.3 The Memory Wall

Beyond capacity, bandwidth is the silent killer. During decoding, for every token generated, the GPU must move the entire KV Cache from High Bandwidth Memory (HBM) to the on-chip SRAM for calculation.

Compute (FLOPs): Grows linearly.
Data Transfer (Bytes): Also grows linearly.

However, because the matrix multiplication degenerates into a vector operation (Query vector), the Arithmetic Intensity (FLOPs/Bytes ratio) is extremely low. Even with an H100's massive bandwidth (~3.35 TB/s), the GPU spends most of its time waiting for data. This is the definition of a Memory-bound scenario.

2. Architectural Evolution: From MHA to MLA

To shrink the KV Cache, architects have performed surgery on the heart of the Transformer.

2.1 Multi-Head Attention (MHA): The Expensive Baseline

In the original Attention Is All You Need, the model has $H$ Query Heads and $H$ Key/Value Heads.

Mechanism: Each Query Head has a unique KV pair. Maximum expressiveness.
Cost: Size is proportional to $H$. In the long-context era, this became unsustainable.

2.2 Multi-Query Attention (MQA): Radical Compression

Proposed by Noam Shazeer (2019).

Mechanism: All Query Heads share one Key Head and one Value Head.
Compression: $H : 1$. (e.g., 64x reduction).
Trade-off: Radical memory savings, but the model loses the ability to "attend" to different nuances simultaneously, often degrading perplexity. Used in PaLM and Falcon.

2.3 Grouped-Query Attention (GQA): The Golden Mean

Introduced with Llama-2, GQA became the standard for open-source models (Llama-3, Mistral, Qwen).

Mechanism: Query Heads are divided into $G$ groups. Each group shares a KV Head.
Example: Llama-2 70B uses 8 KV Heads for 64 Query Heads (8:1 compression).
Result: It sits on the Pareto Frontier—delivering performance near MHA with efficiency near MQA.

2.4 Multi-Head Latent Attention (MLA): DeepSeek's Revolution

DeepSeek-V2 (and V3) introduced MLA, which is not just a grouping strategy, but a fundamental reconstruction of storage.

2.4.1 Low-Rank Compression

Instead of storing the full $d_{model} \times L$ matrices, MLA assumes redundancy. It projects the input into a low-dimensional "Latent Vector" ($c_{KV}$) and stores only this compressed version. During computation, it projects this vector back up to the full dimension.
This reduces memory footprint from $O(H \times d_{head})$ to $O(d_{latent})$.

2.4.2 Decoupled RoPE

The challenge with compression is Rotary Positional Embeddings (RoPE). RoPE is geometrically sensitive; applying it to a compressed vector destroys position information.

DeepSeek's solution: Decoupling.

Content Head: Captures semantics, uses low-rank compression (No RoPE).
Position Head: A separate, tiny vector specifically carrying RoPE info.
Concatenation: They are joined only during the attention score calculation.

This allows the KV Cache to be 1/5th the size of GQA models. Crucially, it makes moving the cache to SSD/RAM feasible because the bandwidth requirement drops drastically.

Feature	MHA (Llama-1)	MQA (Falcon)	GQA (Llama-3)	MLA (DeepSeek-V3)
KV Heads	= Query Heads ($H$)	1	Groups ($G$)	Virtual/Dynamic
VRAM Usage	High (100%)	Very Low (~1-2%)	Medium (~12-25%)	Extreme (~5-10%)
Performance	Baseline	Lossy	Near Lossless	Lossless/Better
RoPE	Native	Native	Native	Decoupled

3. System-Level Management: OS Concepts Reborn

If architecture defines the "theoretical minimum," system software determines how we place that data on hardware.

3.1 PagedAttention (vLLM)

Before vLLM, memory was allocated statically based on "Max Sequence Length," leading to fragmentation and 60-80% waste.

3.1.1 The Principle

Inspired by Virtual Memory paging:

KV Block: Data is sliced into fixed blocks (e.g., 16 tokens).
Non-contiguous: Blocks can live anywhere in physical memory.
Block Table: Maps logical flow to physical blocks.

Impact:

Zero Waste: Internal fragmentation is limited to the last block.
Memory Sharing: Multiple requests sharing a System Prompt ("You are a helpful assistant...") point to the same physical blocks. Copy-on-Write is triggered only when they diverge. This is the foundation of Prompt Caching.

3.2 RadixAttention (SGLang)

vLLM handled allocation; SGLang handles discovery.

3.2.1 Radix Tree Structure

SGLang views the KV Cache not as a linear array, but as a Radix Tree.

Nodes: KV Cache states.
Edges: Token sequences.

3.2.2 Automatic Reuse

Scenario: User asks A, gets B. User asks C. The system sees the path A->B and resumes calculation from there.

LRU Eviction: When memory fills, leaves are pruned first.

3.3 StreamingLLM

For infinite streams (e.g., digital humans), simple sliding windows break the model.
MIT researchers discovered Attention Sinks: The first few tokens (usually 4) anchor the entire attention mechanism. StreamingLLM keeps these "sink tokens" permanently and slides the rest, allowing infinite length with stable perplexity.

4. Extreme Compression: Quantization

FP8: Supported by H100, halves memory usage with negligible loss.
INT4: Difficult due to "Outliers" in the Key/Value matrices. Techniques like SmoothQuant and KIVI migrate outliers to weights or keep them in high precision to make INT4 viable.

5. Market Landscape: The Battle of Caching

2025 marks the era of "Context Caching" as a standard product.

5.1 DeepSeek: The Price Butcher

Leveraging MLA, DeepSeek moves cache to Disk (SSD).

Price: $0.014 / million tokens (Hit). This is ~0.5% of OpenAI's price.
Storage: Free.
TTL: Hours to days. Ideal for long-tail knowledge retrieval.

5.2 Google Gemini: TPU Scale

Implicit: Automatic for Flash models.
Explicit: For Pro models. A "Lease" model—you pay a storage fee per hour. Only economical for high-frequency queries.

5.3 Anthropic Claude: High-Speed RAM Lease

Targeted at coding and high-interaction tasks.

TTL: 5 minutes.
Mechanism: Explicit breakpoints.
Economics: You pay a premium (1.25x) to write to cache. You must reuse it within 5 minutes to break even.

5.4 OpenAI & Alibaba

OpenAI: Conservative. 50% discount on hits. No write premium.
Alibaba (Qwen): Mixed mode. Strong support for long contexts (10M tokens).

Vendor	Mechanism	Storage Medium	TTL	Write Cost	Read Cost	Storage Fee
DeepSeek	Implicit	SSD/Disk	Long	1.0x	~0.05x	Free
Anthropic	Explicit	HBM	5 min	1.25x	0.10x	Included
Google	Hybrid	TPU HBM	1 hour+	1.0x	~0.25x	Hourly
OpenAI	Implicit	HBM	Dynamic	1.0x	0.50x	Free

6. Semantic Caching

Complementary to Prompt Caching (Server-side), Semantic Caching (Client-side) uses Embeddings (Vector DBs like Milvus) to match intent.

If a user asks "Price of apple?" and later "How much is an apple?", Semantic Cache returns the saved answer without hitting the LLM.
Tools: GPTCache.

7. Case Study: Prompt Cache in Agent Dev

Code Example

# Enabling Prompt Caching in Qwen
from langchain_core.prompts import ChatPromptTemplate

prompt = ChatPromptTemplate.from_messages([
  SystemMessage(
    content=[
      {
        "type": "text",
        "text": app_prompt_template.format(vars),
        "cache_control": {"type": "ephemera"}, # The explicit flag
      }
    ]
  ),
  HumanMessage(content=app_user_prompt_template.format(input_data))
])

# Enabling Prompt Caching in OpenAI
from openai import OpenAI

client = OpenAI()

completion = client.chat.completions.create(
    model="gpt-4o",
    # openai 支持 24h 保存 cache
    prompt_cache_retention: "24h"
    messages=[
      {"role": "system", "content": "You are a helpful assistant."},
      {"role": "user", "content": "Tell me a joke."},
    ]
)

Summary

The history of Large Model inference is, essentially, a history of struggling against memory bandwidth.

Architecture & Hardware: DeepSeek's MLA proves that algorithmic innovation (low-rank compression) can unlock hardware potential (SSD storage), completely upending the business model.
Stateful APIs: The HTTP stateless protocol is no longer sufficient. LLMs are becoming "Stateful Operating Systems," and developers must manage "Context Lifecycle" just as they manage database connections.
The Cost Cliff: With prices hitting $0.014/M tokens, the bottleneck for RAG shifts from "how to retrieve less to save money" to "how much context can the model handle without hallucinating." Full Context is replacing sliced retrieval.
For developers, the strategy is clear: Use Anthropic/vLLM for high-frequency, low-latency tasks (coding assistants), and leverage DeepSeek's disk caching for massive knowledge analysis where cost is the primary constraint.

References

Optimizing Transformer Inference with Grouped Query Attention | Towards AI, accessed November 27, 2025, https://towardsai.net/p/machine-learning/optimizing-transformer-inference-with-grouped-query-attention
arXiv:2305.13245v3 [cs.CL] 23 Dec 2023, accessed November 27, 2025, https://arxiv.org/pdf/2305.13245
Attention Mechanisms in Transformers: Comparing MHA, MQA, and GQA | Yue Shui Blog, accessed November 27, 2025, https://syhya.github.io/posts/2025-01-16-group-query-attention/
Understanding Multi-Head Latent Attention, accessed November 27, 2025, https://planetbanatt.net/articles/mla.html
DeepSeek-V2: A Strong, Economical, and Efficient Mixture-of-Experts Language Model, accessed November 27, 2025, https://arxiv.org/html/2405.04434v2
DeepSeek API introduces Context Caching on Disk, cutting prices by an order of magnitude, accessed November 27, 2025, https://api-docs.deepseek.com/news/news0802
MLA: Redefining KV-Cache Through Low-Rank Projections and On-Demand Decompression - Hugging Face, accessed November 27, 2025, https://huggingface.co/blog/NormalUhr/mla-explanation
How PagedAttention resolves memory waste of LLM systems - Red Hat Developer, accessed November 27, 2025, https://developers.redhat.com/articles/2025/07/24/how-pagedattention-resolves-memory-waste-llm-systems
Introduction to vLLM and PagedAttention | Runpod Blog, accessed November 27, 2025, https://www.runpod.io/blog/introduction-to-vllm-and-pagedattention
vLLM and PagedAttention: A Comprehensive Overview | by Abonia Sojasingarayar | Medium, accessed November 27, 2025, https://medium.com/@abonia/vllm-and-pagedattention-a-comprehensive-overview-20046d8d0c61
vAttention: Dynamic Memory Management for Serving LLMs without PagedAttention - arXiv, accessed November 27, 2025, https://arxiv.org/html/2405.04437v1
When to Choose SGLang Over vLLM: Multi-Turn Conversations and KV Cache Reuse, accessed November 27, 2025, https://www.runpod.io/blog/sglang-vs-vllm-kv-cache
SGLang: Efficient Execution of Structured Language Model Programs - arXiv, accessed November 27, 2025, https://arxiv.org/pdf/2312.07104
Fast and Expressive LLM Inference with RadixAttention and SGLang | LMSYS Org, accessed November 27, 2025, https://lmsys.org/blog/2024-01-17-sglang/
Arxiv Dives - Efficient Streaming Language Models with Attention Sinks - Oxen.ai, accessed November 27, 2025, https://ghost.oxen.ai/arxiv-dives-efficient-streaming-language-models-with-attention-sinks/
Efficient Streaming Language Models with Attention Sinks - arXiv, accessed November 27, 2025, https://arxiv.org/html/2309.17453v4
[2309.17453] Efficient Streaming Language Models with Attention Sinks - arXiv, accessed November 27, 2025, https://arxiv.org/abs/2309.17453
Attention Sinks for LLM - Endless Generation - Analytics Vidhya, accessed November 27, 2025, https://www.analyticsvidhya.com/blog/2023/12/attention-sinks-for-llm/
FP8 E5M2 KV Cache - vLLM, accessed November 27, 2025, https://docs.vllm.ai/en/v0.6.3.post1/quantization/fp8_e5m2_kvcache.html
FP8 quantization with AMD Quark for vLLM — Tutorials for AI developers 8.0, accessed November 27, 2025, https://rocm.docs.amd.com/projects/ai-developer-hub/en/latest/notebooks/gpu_dev_optimize/fp8_quantization_quark_vllm.html
KVQuant: Towards 10 Million Context Length LLM Inference with KV Cache Quantization, accessed November 27, 2025, https://www.stat.berkeley.edu/~mmahoney/pubs/neurips-2024-kvquant.pdf
AsymKV: Enabling 1-Bit Quantization of KV Cache with Layer-Wise Asymmetric Quantization Configurations - ACL Anthology, accessed November 27, 2025, https://aclanthology.org/2025.coling-main.158.pdf
FireQ: Fast INT4-FP8 Kernel and RoPE-aware Quantization for LLM Inference Acceleration, accessed November 27, 2025, https://arxiv.org/html/2505.20839v1
NQKV: A KV Cache Quantization Scheme Based on Normal Distribution Characteristics, accessed November 27, 2025, https://arxiv.org/html/2505.16210v1
Gemini Developer API pricing, accessed November 27, 2025, https://ai.google.dev/gemini-api/docs/pricing
Context caching overview | Generative AI on Vertex AI - Google Cloud Documentation, accessed November 27, 2025, https://docs.cloud.google.com/vertex-ai/generative-ai/docs/context-cache/context-cache-overview
Context Caching In Google Gemini: Better Than RAG For Memory - Empathy First Media, accessed November 27, 2025, https://empathyfirstmedia.com/context-caching-google-gemini/
Prompt Caching Support in Spring AI with Anthropic Claude, accessed November 27, 2025, https://spring.io/blog/2025/10/27/spring-ai-anthropic-prompt-caching-blog
Prompt caching - Claude Docs, accessed November 27, 2025, https://platform.claude.com/docs/en/build-with-claude/prompt-caching
Prompt Caching is a Must! How I Went From Spending $720 to $72 Monthly on API Costs | by Du'An Lightfoot | Medium, accessed November 27, 2025, https://medium.com/@labeveryday/prompt-caching-is-a-must-how-i-went-from-spending-720-to-72-monthly-on-api-costs-3086f3635d63
Prompt caching - OpenAI API, accessed November 27, 2025, https://platform.openai.com/docs/guides/prompt-caching
Prompt Caching in the API - OpenAI, accessed November 27, 2025, https://openai.com/index/api-prompt-caching/
How does Prompt Caching work? - OpenAI Developer Community, accessed November 27, 2025, https://community.openai.com/t/how-does-prompt-caching-work/992307
Context Cache feature of Qwen models - Alibaba Cloud Model Studio, accessed November 27, 2025, https://www.alibabacloud.com/help/en/model-studio/context-cache
Qwen context window: token limits, memory policy, and 2025 rules - Data Studios, accessed November 27, 2025, https://www.datastudios.org/post/qwen-context-window-token-limits-memory-policy-and-2025-rules
Semantic Cache: How to Speed Up LLM and RAG Applications - Medium, accessed November 27, 2025, https://medium.com/@svosh2/semantic-cache-how-to-speed-up-llm-and-rag-applications-79e74ce34d1d
Semantic Cache: Accelerating AI with Lightning-Fast Data Retrieval - Qdrant, accessed November 27, 2025, https://qdrant.tech/articles/semantic-cache-ai-data-retrieval/
Semantic caching for faster, smarter LLM apps - Redis, accessed November 27, 2025, https://redis.io/blog/what-is-semantic-caching/
zilliztech/GPTCache: Semantic cache for LLMs. Fully integrated with LangChain and llama_index. - GitHub, accessed November 27, 2025, https://github.com/zilliztech/GPTCache
Reducing LLM Costs and Latency via Semantic Embedding Caching - arXiv, accessed November 27, 2025, https://arxiv.org/html/2411.05276v2
If your app process many similar queries, use Semantic Caching to reduce your cost and latency : r/LangChain - Reddit, accessed November 27, 2025, https://www.reddit.com/r/LangChain/comments/1f4rlx0/if_your_app_process_many_similar_queries_use/
图解vLLM Automatic Prefix Cache(RadixAttention), https://zhuanlan.zhihu.com/p/693556044
Gemini 3技术是跳蛙式超越 https://www.youtube.com/watch?v=EMQxQwoFSb4
Andre Ng issue 329 | deeplearning.ai batch https://www.deeplearning.ai/the-batch/issue-329/
The Architecture Behind vLLM: How PagedAttention Improves Memory Utilization https://medium.com/@mandeep0405/the-architecture-behind-vllm-how-pagedattention-improves-memory-utilization-2f9b25272110
Low Rank Decompositions of Matrices - YouTube https://www.youtube.com/watch?v=_FmolBCUo9M
The Inner Workings of DeepSeek-V3 · Chris McCormick https://mccormickml.com/2025/02/12/the-inner-workings-of-deep-seek-v3/
Decoupled RoPE with MHLA: Blending Rotary Positional Encoding and Latent Attention Like a Pro https://medium.com/@drpester001/decoupled-rope-with-mhla-blending-rotary-positional-encoding-and-latent-attention-like-a-pro-d13842134f4d
Efficient Streaming Language Models with Attention Sinks https://github.com/mit-han-lab/streaming-llm

从KV Cache到Prompt Cache的应用

Shadow Walker 松烟阁

edony

2025年11月30日 20:40

引子

从工程师的视角来观察，随着Scaling Law失效问题被更多的人提起，我越来越认同LLM正在逐渐进入「精打细算，收割果实的平庸时代」。Andrew Ng在他的感恩节给读者的来信中提到，AI可能存在泡沫但是一定不是在AI应用开发：

AI 应用层：投资不足。其潜力远超大多数人的认知。
AI 推理基础设施：仍需大量投资。
AI 模型训练基础设施：我对这一领域仍持谨慎乐观态度，但可能存在泡沫。

1. 大模型推理的物理瓶颈：透视KV Cache

在探讨具体的优化技术之前，必须从第一性原理出发，理解为何KV Cache会成为大模型推理的阿喀琉斯之踵。这不仅是显存容量的问题，更是显存带宽（Memory Bandwidth）与计算强度（Arithmetic Intensity）之间矛盾的体现。

1.1 Transformer解码的自回归特性

Transformer模型的推理过程分为两个阶段：预填充（Prefill）和解码（Decoding）。

预填充阶段（Prefill）：模型并行处理输入的所有token。由于可以并行计算，这一阶段主要受限于GPU的计算能力（Compute-bound）。此时，GPU的利用率通常很高。
解码阶段（Decoding）：模型逐个生成后续token。这是一个自回归（Autoregressive）过程，即生成第 $t$ 个token需要依赖于前 $t-1$ 个token的内部状态。

在标准的注意力机制（Self-Attention）中，计算公式为：

$$\text{Attention}(Q, K, V) = \text{softmax}\left(\frac{QK^T}{\sqrt{d_k}}\right)V$$
其中，$Q$（Query）是当前时间步的查询向量，而 $K$（Key）和 $V$（Value）则包含了所有历史token的信息。为了避免在生成每一个新token时都重新计算前面所有token的 $K$ 和 $V$ 投影，系统会将这些计算好的向量存储在显存中，这就是 KV Cache。

1.2 显存占用的数学推导

KV Cache的显存占用量是序列长度的线性函数，且随层数、头数和隐藏层维度倍增。对于一个标准的Transformer模型，其KV Cache的大小可以通过以下公式精确计算：

$$ \text{Size}{KV} = 2 \times L{seq} \times B_{batch} \times N_{layers} \times H_{heads} \times D_{head} \times P_{prec} $$

其中：

$2$：代表Key和Value两个矩阵。
$L_{seq}$：当前的序列长度（上下文窗口大小）。
$B_{batch}$：并发请求的批处理大小（Batch Size）。
$N_{layers}$：Transformer的层数。
$H_{heads}$：注意力头的数量。
$D_{head}$：每个注意力头的维度（通常为 $D_{model} / H_{heads}$）。
$P_{prec}$：数据精度（FP16为2字节，FP32为4字节）。

案例分析：Llama-2 70B模型
假设我们使用FP16精度（2字节），序列长度为4096，Batch Size为1。

$N_{layers} = 80$
$H_{heads} = 64$ (GQA之前)
$D_{head} = 128$

单次请求的KV Cache大小为：

$$2 \times 4096 \times 1 \times 80 \times 64 \times 128 \times 2 \approx 10.7 \text{ GB}$$
如果我们将上下文扩展到100k tokens（如Claude或GPT-4 Turbo常见场景），单次请求的KV Cache将膨胀至 260 GB。这已经远远超过了单张NVIDIA A100 (80GB) 甚至 H100 (80GB) 的显存容量。这意味着，在长文本场景下，显存容量直接限制了Batch Size，导致GPU的算力无法被填满，推理成本急剧上升。

1.3 内存墙与带宽瓶颈

除了容量限制，更致命的是带宽限制。在解码阶段，每生成一个token，GPU都需要从高带宽内存（HBM）中读取整个KV Cache到片上SRAM进行计算。

计算量（FLOPs）：随着KV Cache的增长，计算量仅线性增长。
数据传输量（Bytes）：数据传输量也线性增长，但由于矩阵乘法中的一维退化为向量（Query vector），算术强度（Arithmetic Intensity，即FLOPs/Bytes比率）极低。

现代GPU（如H100）具有极高的算力（~1000 TFLOPS FP16）和极高的带宽（~3.35 TB/s）。然而，在解码阶段，由于每次矩阵向量乘法（GEMV）都需要搬运庞大的KV Cache，GPU大部分时间都在等待数据从HBM传输，导致计算单元闲置。这就是所谓的“内存受限”（Memory-bound）场景。

为了缓解这一问题，业界在过去两年中经历了从算法架构改革到系统工程优化的剧烈演变。

2. 注意力机制的架构演进：从MHA到MLA

为了从根本上减小KV Cache的体积，模型架构师们对Transformer的核心——注意力机制进行了多次手术。这一演进路径清晰地展示了从追求极致性能到追求效率与性能平衡的过程。

2.1 多头注意力（MHA）：昂贵的基准

在《Attention Is All You Need》原论文中提出的多头注意力（Multi-Head Attention, MHA）机制中，模型拥有 $H$ 个查询头（Query Heads），同时也对应拥有 $H$ 个键值头（Key/Value Heads）。

机制：每个Query Head都有自己独立的Key和Value投影矩阵。这意味着模型可以从 $H$ 个不同的子空间（Subspaces）捕捉信息，理论上表达能力最强。
代价：正如第1节中的计算所示，KV Cache的大小与头数 $H$ 成正比。对于大模型，这意味着巨大的显存开销。
现状：早期的BERT、GPT-2、GPT-3以及Llama-1均采用MHA。但在长上下文时代，MHA已成为不可承受之重。

2.2 多查询注意力（MQA）：激进的压缩

为了解决推理性价比问题，Noam Shazeer在2019年提出了多查询注意力（Multi-Query Attention, MQA）。

机制：无论有多少个Query Heads，整个层只保留1个 Key Head 和 1个 Value Head。所有的Query Heads共享这唯一的KV对。
压缩比：$H : 1$。如果模型有64个头，KV Cache的大小直接缩小64倍。
优势：极大地减少了显存占用和数据搬运量，使得推理速度显著提升，且能支持更大的Batch Size。
劣势：这种压缩过于激进，导致模型在处理复杂任务时，无法同时关注输入序列的不同方面，造成性能（Perplexity）下降和训练不稳定性。
应用：Google的PaLM模型和Falcon系列采用了MQA，但在开源社区并未立刻成为主流。

2.3 分组查询注意力（GQA）：中庸之道的胜利

在Llama-2发布时，Meta引入了分组查询注意力（Grouped-Query Attention, GQA），这一机制迅速成为当今开源大模型（如Llama-3, Mistral, Qwen）的实际标准。

机制：GQA是MHA和MQA的折中方案。它将Query Heads分为 $G$ 个组，每个组内的Query Heads共享一个KV Head。
配置：例如，Llama-2 70B使用了8个KV Heads（$G=8$），而Query Heads为64个。这意味着每8个Query Heads共享1个KV Head。
压缩比：$H : G$。在上述例子中，压缩比为8:1。
效果：研究表明，GQA在显存占用和推理速度上接近MQA，而在模型效果（Accuracy/Perplexity）上几乎等同于MHA。它成功地在帕累托前沿（Pareto Frontier）上找到了最佳平衡点。

2.4 多头潜在注意力（MLA）：DeepSeek的架构革命

DeepSeek-V2（及其后的V3）引入的MLA（Multi-Head Latent Attention）不仅仅是分组策略的调整，而是对KV Cache存储方式的根本性重构。这是DeepSeek能够提供极低API价格的核心技术支撑。

2.4.1 低秩矩阵压缩（Low-Rank Compression）原理

传统的注意力机制直接存储投影后的 $K$ 和 $V$ 矩阵，维度为 $d_{model} \times L$。MLA认为这些高维矩阵中存在大量的冗余信息，可以通过低秩分解来压缩。MLA不直接存储 $K$ 和 $V$，而是将输入的隐藏状态投影到一个低维的“潜在向量”（Latent Vector, $c_{KV}$）。

压缩：输入向量首先经过一个下投影矩阵，变为极低维度的潜在向量（例如，压缩比可达数十倍）。
存储：在KV Cache中，只存储这个压缩后的潜在向量。
还原：在计算注意力分数时，通过一个上投影矩阵，将潜在向量实时还原为用于计算的Key和Value。

这种方法将KV Cache的显存占用从 $O(H \times d_{head})$ 降低到了 $O(d_{latent})$，其中 $d_{latent}$ 远小于前者。

2.4.2 解耦旋转位置编码（Decoupled RoPE）

低秩压缩的一个巨大挑战是如何兼容旋转位置编码（RoPE）。RoPE对向量的旋转操作具有几何敏感性，直接在压缩向量上应用RoPE会破坏其位置信息，或者在还原时引入巨大误差。

DeepSeek创造性地提出了“解耦RoPE”策略：

内容头（Content Head）：负责捕捉语义信息，采用上述的低秩压缩（不带RoPE）。
位置头（Position Head）：一个单独的、维度很小的向量，专门用于携带RoPE位置信息。
拼接（Concatenation）：在计算Attention Score时，将还原后的内容头与位置头拼接，共同参与计算。

通过这种方式，MLA既实现了极致的KV Cache压缩（仅存储压缩内容+少量位置信息），又完美保留了长上下文所需的位置感知能力。根据DeepSeek的报告，MLA使得KV Cache的大小在同等参数规模下只有GQA模型的1/5甚至更低，这使得将KV Cache放入显存之外的介质（如内存或SSD）成为可能，因为数据传输的带宽压力被大幅减轻了。

特性	MHA (Llama-1)	MQA (Falcon)	GQA (Llama-3)	MLA (DeepSeek-V3)
KV头数量	等于Query头数 ($H$)	1	分组数 ($G$, 如8)	虚拟/动态生成
显存占用	极高 (100%)	极低 (~1-2%)	中等 (~12-25%)	极致压缩 (5-10%)
模型性能	基准 (高)	有损	接近无损	无损甚至更优
推理速度	慢 (受限于带宽)	极快	快	极快
RoPE兼容性	原生支持	原生支持	原生支持	需解耦设计

3. 系统级显存管理与优化：从分页到流式

如果说Transformer架构决定了KV Cache的“理论最小体积”，那么系统软件则决定了如何在物理硬件上高效地“摆放”这些数据。2023年以来，以vLLM为代表的推理框架通过引入操作系统领域的经典思想，彻底改变了显存管理的范式。

3.1 显存碎片化与PagedAttention (vLLM)

在vLLM出现之前，主流推理框架（如FasterTransformer）采用的是静态显存分配。对于一个请求，系统必须按照其“最大可能长度”（Max Sequence Length）预先分配一块连续的显存空间。

内部碎片（Internal Fragmentation）：如果预分配了2048长度，但用户只生成了50个token，剩余的空间全部被浪费。
外部碎片（External Fragmentation）：不同请求的显存块大小不一，导致显存中出现许多无法被利用的空隙。

据统计，这种方式导致的显存浪费率高达60%-80%。

3.1.1 PagedAttention的原理

vLLM团队受操作系统虚拟内存（Virtual Memory）分页机制的启发，提出了PagedAttention。

KV Block：将KV Cache切分为固定大小的块（Block），例如每块存储16个token的KV数据。
非连续存储：这些Block在物理显存（HBM）中不需要连续存放，可以分散在任意位置。
页表（Block Table）：系统维护一张映射表，记录每个请求的逻辑token顺序对应哪些物理Block。
按需分配：只有当新的token生成填满当前Block时，系统才申请下一个Block。

优势：

零浪费：内部碎片仅存在于最后一个未填满的Block中，浪费率降至4%以下。
显存共享（Memory Sharing）：这是PagedAttention最强大的特性。对于这就如Python中的引用计数，如果多个请求共享相同的System Prompt（例如“你是一个有用的助手...”），vLLM只需在物理显存中存储一份该Prompt的KV Block，所有请求的页表都指向这一份数据。只有当各自生成不同的后续内容时，才触发“写时复制”（Copy-on-Write）。这为由Prompt Cache奠定了系统基础。

并行采样（Parallel Sampling）：在 Parallel Sampling 中，同一个 prompt 会生成多个候选输出，便于用户从多个备选中选择最佳响应，常用于内容生成或模型对比测试。在 vLLM 中，这些采样序列共享相同的 prompt，其对应的 KV Cache 也可以共用同一组物理块。PagedAttention 通过引用计数和 block-level 的 copy-on-write 机制实现共享与隔离的平衡：只有当序列出现不同分支时，才会触发复制操作。

3.2 动态前缀复用与RadixAttention (SGLang)

虽然vLLM解决了分配问题，但在处理复杂的、非线性的对话历史时，如何自动发现可复用的KV Cache仍是难题。SGLang框架提出了RadixAttention，将缓存管理提升到了一个新的维度。

3.2.1 Radix Tree（基数树）结构

SGLang不再将KV Cache视为线性的数组，而是将其维护为一个基数树（Radix Tree）。

节点：树的边代表token序列，节点代表KV Cache的状态。
路径：从根节点到叶子节点的路径代表一个完整的对话历史。

Hash RadixAttention 代码走读：

3.2.2 自动复用机制

当一个新的请求到达时，系统将Prompt作为搜索键在Radix Tree中进行最长前缀匹配（Longest Prefix Match）。

场景A（多轮对话）：用户问“A”，模型答“B”。用户接着问“C”。RadixAttention自动匹配到“A->B”的路径，直接复用其KV Cache，只需计算“C”。
场景B（Few-Shot Learning）：多个请求使用相同的Few-Shot示例，但在最后的问题上不同。RadixAttention自动锁定公共前缀节点，无需人工干预。
LRU淘汰：当显存不足时，系统根据最近最少使用（LRU）原则，从叶子节点开始剪枝，释放显存。

prefix hash 代码走读：

与vLLM早期的前缀缓存相比，RadixAttention无需用户手动配置，且能处理更复杂的分支结构（如Tree-of-Thoughts推理），显著提高了复杂Agent任务的吞吐量。

3.3 无限流式生成与StreamingLLM

对于需要7x24小时运行的数字人或长期助理，KV Cache理论上会无限增长直至显存溢出。简单的“滑动窗口”（Sliding Window，只保留最近N个token）会导致模型崩溃，因为Transformer在训练时并未适应这种信息的突然截断。

3.3.1 注意力汇聚（Attention Sink）现象

MIT的研究人员发现，Transformer模型在推理时，会倾向于将大量的注意力分数分配给序列开头的几个token（通常是前4个）。这些token充当了“锚点”（Anchor），稳定了后续层的注意力计算，即使它们本身可能没有太多语义信息。

3.3.2 StreamingLLM机制

基于上面的发现，StreamingLLM提出了一种特殊的缓存策略：

保留汇聚点：永久保留序列开头的几个token（Attention Sinks）的KV Cache。
滑动窗口：对后续的token使用滑动窗口，只保留最近的N个。
位置编码调整：对RoPE进行相对位置的平移，使模型感知到正确的距离。

这种方法使得模型可以在有限的显存（例如只缓存2048个token）下，处理长度达到400万甚至无限的输入流，且困惑度（Perplexity）不发生爆炸。

4. 极致压缩：KV Cache量化技术

除了架构和系统优化，降低数据本身的精度是另一个维度的压缩手段。KV Cache量化（Quantization）正从研究走向生产环境。

4.1 精度格式的演变

FP16/BF16：传统的基准，每个参数2字节。
FP8 (E4M3/E5M2)：NVIDIA H100原生支持。将KV Cache压缩到1字节/参数。vLLM和TensorRT-LLM已经支持FP8 KV Cache，通常能带来2倍的吞吐量提升，且精度损失微乎其微。

4.2 激进量化：INT4与非均匀分布挑战

将KV Cache压缩到4-bit（INT4）可以将显存占用减少4倍，但这面临巨大挑战。

4.2.1 异常值（Outliers）问题

研究发现，Key和Value矩阵中的数值分布并非均匀的高斯分布。特定的通道（Channels）或Token会出现数值极大的异常值。如果使用标准的均匀量化（Uniform Quantization），这些异常值会拉大量化范围（Range），导致大部分小数值的精度被严重吞噬，模型彻底失效。

解决方案：

SmoothQuant / Atom：引入平滑因子，将激活值（Activation）中的异常值迁移到权重（Weight）中，或者在量化前对通道进行缩放（Per-channel scaling），使得分布更平滑，适合INT8/INT4量化。
KIVI / KVQuant：采用非均匀量化策略，或者将少量的异常值单独以高精度（FP16）存储，而对绝大部分数据进行INT4甚至2-bit量化。
分组量化：类似于GQA，对每128个或64个元素进行独立的量化统计，减小异常值的影响范围。

目前，INT4 KV Cache在部分长文本场景下已可投入使用，但在高精度要求的逻辑推理任务中仍需谨慎评估。

5. 各大厂商Prompt Cache支持情况深度评测

2025年以来，随着上述技术的成熟，各大模型厂商纷纷推出了面向开发者的“Context Caching”或“Prompt Caching”服务。这一功能被视为RAG（检索增强生成）和Agent（智能体）应用的经济基石。

5.1 DeepSeek：磁盘缓存与价格屠夫

DeepSeek（深度求索）是目前市场上最具颠覆性的玩家。依托于其独特的MLA架构，DeepSeek实现了真正的磁盘级上下文缓存。

技术原理：由于MLA将KV Cache压缩到了极小（约为MHA的1/10甚至更低），数据量小到足以从SSD硬盘阵列中实时读取，而不会造成严重的延迟瓶颈。这打破了“KV Cache必须在显存”的铁律 6。
缓存策略（Implicit）：自动触发。无需用户显式创建缓存ID，系统自动识别重复的前缀。
价格体系：
- 缓存命中（Cache Hit）：$0.014 / 百万tokens。这是一个惊人的数字。相比之下，GPT-4o的输入价格约为$2.50，DeepSeek的价格仅为OpenAI的 0.5%。
- 缓存未命中（Cache Miss）：约 $0.14 - $0.27 / 百万tokens（取决于具体版本），依然远低于竞品。
- 存储费：免费。得益于磁盘存储的低成本，DeepSeek不收取额外的时间存储费。
持久性：由于存储在磁盘，其TTL（生存时间）远长于基于显存的竞品，理论上可达数小时甚至数天（取决于系统调度），非常适合低频但长尾的知识库查询。

5.2 Google Gemini：TPU加持下的灵活双模

Google Vertex AI利用其TPU Pod的庞大显存池，提供了最为灵活的“隐式+显式”双轨制。

隐式缓存（Implicit）：针对Gemini 2.5 Flash等模型。自动检测，无需代码更改。
- 门槛：1024或2048 tokens以上。
- 价格：命中时约为标准输入价格的25%（即75%折扣）。无存储费。
显式缓存（Explicit）：针对企业级长文档。
- 机制：用户调用API创建CachedContent对象，获得一个ID。
- 价格结构：包含两部分。
  - 计算费：命中缓存的Token价格极低（甚至在某些层级接近免费）。
  - 存储费：按小时计费。例如，Gemini 2.5 Pro约为$4.50 / 100万tokens / 小时。
- 适用场景：** 这是一种“租赁显存”的模式。只有当你的查询频率非常高（例如每小时数百次查询同一份文档），节省的计算费才能覆盖存储费。如果只是偶尔查询，显式缓存反而更贵。

5.3 Anthropic Claude：极速流转的显存租赁

Anthropic的策略非常激进，专注于“高频会话”场景，尤其是配合Claude 4 Sonnet强大的编码能力。

显式断点：用户需在API参数中设置 cache_control: {"type": "ephemeral"} 断点。
TTL（5分钟）：这是最大的争议点与特点。缓存仅保存5分钟。每次命中（Read），TTL重置为5分钟。这意味着它不适合长期存储，只适合连续不断的对话。
价格：
- 写入（Write）：1.25倍标准输入价格。你需要支付溢价来创建缓存。
- 读取（Read）：0.1倍（10%）标准输入价格。90%的折扣。
盈亏平衡点：由于有25%的写入溢价，用户必须在5分钟内至少复用一次缓存（即第二次提问），才能开始省钱。如果写入后5分钟内没有再次提问，缓存失效，用户的写入溢价就“白花了”。

5.4 OpenAI：保守的黑盒策略

OpenAI在Prompt Cache上显得相对保守和不透明。

机制：纯隐式。自动匹配1024 tokens以上的块。
支持模型：GPT-4、5等系列。
价格：
- 写入：原价（无溢价）。
- 读取：50%折扣。
分析：50%的折扣力度远小于DeepSeek (95%+) 或 Anthropic (90%)。但由于没有写入溢价，这是一种“无风险”的优惠——即使没命中也不会亏。
TTL：不透明，通常为5分钟-24小时，受系统负载影响极大。这使得开发者很难依赖它来做严格的成本控制。

5.5 阿里云 Qwen (通义千问)：混合模式

阿里云Model Studio紧跟国际步伐，提供了类似的机制。

隐式：命中时按标准价的20%计费（80%折扣）。
显式：
- 写入：1.25倍价格。
- 读取：20%价格。
Qwen-Long：支持文件ID引用的长上下文，本质上是一种持久化的上下文缓存，支持高达10M tokens，适合超长文档分析。

5.6 厂商对比汇总表

厂商	机制类型	最小Token限制	存储介质推测	TTL (生存时间)	写入成本 (Write)	命中成本 (Read)	存储费用
DeepSeek	隐式 (自动)	无/低	SSD/磁盘	长 (小时/天)	1.0x (原价)	~0.05x ($0.014)	免费
Anthropic	显式 (断点)	1024	显存 (HBM)	5分钟 (刷新)	1.25x (溢价)	0.10x (一折)	包含在写入溢价中
Google	显式 + 隐式	1024/2048	TPU HBM	1小时 (显式, 可续)	1.0x	~0.25x	按小时收费 (显式)
OpenAI	隐式 (自动)	1024	显存 (HBM)	动态 (短)	1.0x	0.50x (五折)	免费
Alibaba	显式 + 隐式	256/1024	显存	5分钟/动态	1.25x (显式)	0.10x - 0.20x	免费

5.7 成本情景模拟：法律文档分析

场景：上传一本50,000 tokens的法律法典，并在接下来的2小时内进行100次问答查询。

Anthropic (Claude 3.5 Sonnet):
- 首单（写入）：$50k \times $3.75/M = $0.1875$
- 后续99次（读取）：$99 \times 50k \times $0.30/M = $1.485$
- 总计：~$1.67
- 风险：如果中间停顿超过5分钟，需重新支付写入费。
Google (Gemini 2.5 Pro - 显式缓存):
- 写入：$50k \times $1.25/M = $0.0625$ (假设基础价)
- 存储：$2 \text{小时} \times 0.05M \text{ tokens} \times $4.50/M/hr = $0.45$
- 读取：$100 \times 50k \times $0.30/M = $1.50$
- 总计：~$2.01
- 优势：即使2小时内没人提问，缓存也在。
OpenAI (GPT-4o):
- 首单：$50k \times $2.50/M = $0.125$
- 后续99次：$99 \times 50k \times $1.25/M = $6.18$
- 总计：~$6.30
- 劣势：读取折扣力度不够。
DeepSeek (V3):
- 首单：$50k \times $0.14/M = $0.007$
- 后续99次：$99 \times 50k \times $0.014/M = $0.069$
- 总计：~$0.076

6. 语义缓存与应用层优化

除了依赖模型厂商的Prompt Cache，开发者在应用层（Client-side）还可以通过语义缓存（Semantic Caching）进一步降低成本。这与Prompt Cache是互补关系。

6.1 语义缓存的原理

传统的缓存（如Redis）基于Key-Value精确匹配。如果用户问“苹果的价格？”和“苹果多少钱？”，传统缓存会视为两个请求。
语义缓存引入了向量嵌入（Embedding）技术：

Embedding：将用户Query转化为向量。
向量搜索：在向量数据库（如Milvus, Qdrant, Redis Vector）中搜索历史Query。
相似度阈值：如果发现一个历史Query的余弦相似度大于阈值（如0.95），则直接返回之前缓存的LLM回答 36。

6.2 开源工具：GPTCache

GPTCache是目前最流行的开源语义缓存库，支持LangChain集成。

模块化设计：支持多种Embedding模型（OpenAI, HuggingFace）和多种向量存储（Redis, FAISS）。
后处理：甚至支持对缓存的回答进行评估，确保时效性。
对比：
- Prompt Cache (服务端)：解决的是“长Context复用”的问题。输入是旧的（书），问题是新的。
- Semantic Cache (客户端)：解决的是“高频重复问题”的问题。输入是新的（问题），但意义是旧的。
最佳实践：只有当用户的提问具有高重复性（如客服系统常见问题）时，语义缓存才有意义。对于开放式分析任务，Prompt Cache更为关键。

7. Prompt Cache在X-Sec中的应用

7.1 Prompt Cache

# Qwen3 等支持 Cache 的 LLM 使能 Prompt Caching
from langchain_core.prompts import ChatPromptTemplate

prompt = ChatPromptTemplate.from_messages([
  SystemMessage(
    # Qwen启动显示prompt cache能力
    content=[
      {
        "type": "text",
        "text": app_prompt_template.format(vars),
        "cache_control": {"type": "ephemera"},
      }
    ]
  ),
  HumanMessage(content=app_user_prompt_template.format(input_data),
])

# OpenAI 使能 Prompt Caching
from openai import OpenAI

client = OpenAI()

completion = client.chat.completions.create(
    model="gpt-4o",
    # openai 支持 24h 保存 cache
    prompt_cache_retention: "24h"
    messages=[
      {"role": "system", "content": "You are a helpful assistant."},
      {"role": "user", "content": "Tell me a joke."},
    ]
)

7.2 Semantic Cache

语义cahce需要Memory系统的支持（langchain checkpoint支持vector store），所以当前还没有进行深入探索，后续Agent应用开发进入到这个阶段再行总结和探索。Stay Tuned！

8. 总结

大模型推理的演进史，本质上是一部与显存带宽和容量搏斗的抗争史。

架构与硬件的协同：DeepSeek MLA的成功证明了算法创新（低秩压缩）可以解锁硬件潜力（SSD存储），从而颠覆商业模式。这预示着未来模型设计将更多地考虑底层硬件特性，而非单纯追求参数量。
有状态API（Stateful API）的兴起：HTTP无状态协议已不再适应LLM应用。Prompt Cache的普及标志着LLM服务正演变为“有状态”的操作系统。开发者必须学会管理“Context 生命周期”，像管理数据库连接池一样管理Prompt Cache。
成本的断崖式下降：随着$0.014/M tokens这种价格的出现，RAG应用的瓶颈将从“检索多少文档省钱”转变为“模型能处理多少文档不幻觉”。全量知识库注入（Full Context）正在取代部分切片检索，成为新的技术趋势。
对于Agent应用开发者而言，当下的最优策略是：对于高频低延时任务（如代码助手）选择Anthropic或vLLM自建服务；对于海量数据分析和知识库问答，DeepSeek的磁盘缓存方案提供了目前无法匹敌的性价比优势。

References

Optimizing Transformer Inference with Grouped Query Attention | Towards AI, accessed November 27, 2025, https://towardsai.net/p/machine-learning/optimizing-transformer-inference-with-grouped-query-attention
arXiv:2305.13245v3 [cs.CL] 23 Dec 2023, accessed November 27, 2025, https://arxiv.org/pdf/2305.13245
Attention Mechanisms in Transformers: Comparing MHA, MQA, and GQA | Yue Shui Blog, accessed November 27, 2025, https://syhya.github.io/posts/2025-01-16-group-query-attention/
Understanding Multi-Head Latent Attention, accessed November 27, 2025, https://planetbanatt.net/articles/mla.html
DeepSeek-V2: A Strong, Economical, and Efficient Mixture-of-Experts Language Model, accessed November 27, 2025, https://arxiv.org/html/2405.04434v2
DeepSeek API introduces Context Caching on Disk, cutting prices by an order of magnitude, accessed November 27, 2025, https://api-docs.deepseek.com/news/news0802
MLA: Redefining KV-Cache Through Low-Rank Projections and On-Demand Decompression - Hugging Face, accessed November 27, 2025, https://huggingface.co/blog/NormalUhr/mla-explanation
How PagedAttention resolves memory waste of LLM systems - Red Hat Developer, accessed November 27, 2025, https://developers.redhat.com/articles/2025/07/24/how-pagedattention-resolves-memory-waste-llm-systems
Introduction to vLLM and PagedAttention | Runpod Blog, accessed November 27, 2025, https://www.runpod.io/blog/introduction-to-vllm-and-pagedattention
vLLM and PagedAttention: A Comprehensive Overview | by Abonia Sojasingarayar | Medium, accessed November 27, 2025, https://medium.com/@abonia/vllm-and-pagedattention-a-comprehensive-overview-20046d8d0c61
vAttention: Dynamic Memory Management for Serving LLMs without PagedAttention - arXiv, accessed November 27, 2025, https://arxiv.org/html/2405.04437v1
When to Choose SGLang Over vLLM: Multi-Turn Conversations and KV Cache Reuse, accessed November 27, 2025, https://www.runpod.io/blog/sglang-vs-vllm-kv-cache
SGLang: Efficient Execution of Structured Language Model Programs - arXiv, accessed November 27, 2025, https://arxiv.org/pdf/2312.07104
Fast and Expressive LLM Inference with RadixAttention and SGLang | LMSYS Org, accessed November 27, 2025, https://lmsys.org/blog/2024-01-17-sglang/
Arxiv Dives - Efficient Streaming Language Models with Attention Sinks - Oxen.ai, accessed November 27, 2025, https://ghost.oxen.ai/arxiv-dives-efficient-streaming-language-models-with-attention-sinks/
Efficient Streaming Language Models with Attention Sinks - arXiv, accessed November 27, 2025, https://arxiv.org/html/2309.17453v4
[2309.17453] Efficient Streaming Language Models with Attention Sinks - arXiv, accessed November 27, 2025, https://arxiv.org/abs/2309.17453
Attention Sinks for LLM - Endless Generation - Analytics Vidhya, accessed November 27, 2025, https://www.analyticsvidhya.com/blog/2023/12/attention-sinks-for-llm/
FP8 E5M2 KV Cache - vLLM, accessed November 27, 2025, https://docs.vllm.ai/en/v0.6.3.post1/quantization/fp8_e5m2_kvcache.html
FP8 quantization with AMD Quark for vLLM — Tutorials for AI developers 8.0, accessed November 27, 2025, https://rocm.docs.amd.com/projects/ai-developer-hub/en/latest/notebooks/gpu_dev_optimize/fp8_quantization_quark_vllm.html
KVQuant: Towards 10 Million Context Length LLM Inference with KV Cache Quantization, accessed November 27, 2025, https://www.stat.berkeley.edu/~mmahoney/pubs/neurips-2024-kvquant.pdf
AsymKV: Enabling 1-Bit Quantization of KV Cache with Layer-Wise Asymmetric Quantization Configurations - ACL Anthology, accessed November 27, 2025, https://aclanthology.org/2025.coling-main.158.pdf
FireQ: Fast INT4-FP8 Kernel and RoPE-aware Quantization for LLM Inference Acceleration, accessed November 27, 2025, https://arxiv.org/html/2505.20839v1
NQKV: A KV Cache Quantization Scheme Based on Normal Distribution Characteristics, accessed November 27, 2025, https://arxiv.org/html/2505.16210v1
Gemini Developer API pricing, accessed November 27, 2025, https://ai.google.dev/gemini-api/docs/pricing
Context caching overview | Generative AI on Vertex AI - Google Cloud Documentation, accessed November 27, 2025, https://docs.cloud.google.com/vertex-ai/generative-ai/docs/context-cache/context-cache-overview
Context Caching In Google Gemini: Better Than RAG For Memory - Empathy First Media, accessed November 27, 2025, https://empathyfirstmedia.com/context-caching-google-gemini/
Prompt Caching Support in Spring AI with Anthropic Claude, accessed November 27, 2025, https://spring.io/blog/2025/10/27/spring-ai-anthropic-prompt-caching-blog
Prompt caching - Claude Docs, accessed November 27, 2025, https://platform.claude.com/docs/en/build-with-claude/prompt-caching
Prompt Caching is a Must! How I Went From Spending $720 to $72 Monthly on API Costs | by Du'An Lightfoot | Medium, accessed November 27, 2025, https://medium.com/@labeveryday/prompt-caching-is-a-must-how-i-went-from-spending-720-to-72-monthly-on-api-costs-3086f3635d63
Prompt caching - OpenAI API, accessed November 27, 2025, https://platform.openai.com/docs/guides/prompt-caching
Prompt Caching in the API - OpenAI, accessed November 27, 2025, https://openai.com/index/api-prompt-caching/
How does Prompt Caching work? - OpenAI Developer Community, accessed November 27, 2025, https://community.openai.com/t/how-does-prompt-caching-work/992307
Context Cache feature of Qwen models - Alibaba Cloud Model Studio, accessed November 27, 2025, https://www.alibabacloud.com/help/en/model-studio/context-cache
Qwen context window: token limits, memory policy, and 2025 rules - Data Studios, accessed November 27, 2025, https://www.datastudios.org/post/qwen-context-window-token-limits-memory-policy-and-2025-rules
Semantic Cache: How to Speed Up LLM and RAG Applications - Medium, accessed November 27, 2025, https://medium.com/@svosh2/semantic-cache-how-to-speed-up-llm-and-rag-applications-79e74ce34d1d
Semantic Cache: Accelerating AI with Lightning-Fast Data Retrieval - Qdrant, accessed November 27, 2025, https://qdrant.tech/articles/semantic-cache-ai-data-retrieval/
Semantic caching for faster, smarter LLM apps - Redis, accessed November 27, 2025, https://redis.io/blog/what-is-semantic-caching/
zilliztech/GPTCache: Semantic cache for LLMs. Fully integrated with LangChain and llama_index. - GitHub, accessed November 27, 2025, https://github.com/zilliztech/GPTCache
Reducing LLM Costs and Latency via Semantic Embedding Caching - arXiv, accessed November 27, 2025, https://arxiv.org/html/2411.05276v2
If your app process many similar queries, use Semantic Caching to reduce your cost and latency : r/LangChain - Reddit, accessed November 27, 2025, https://www.reddit.com/r/LangChain/comments/1f4rlx0/if_your_app_process_many_similar_queries_use/
图解vLLM Automatic Prefix Cache(RadixAttention), https://zhuanlan.zhihu.com/p/693556044
Gemini 3技术是跳蛙式超越 https://www.youtube.com/watch?v=EMQxQwoFSb4
Andre Ng issue 329 | deeplearning.ai batch https://www.deeplearning.ai/the-batch/issue-329/

Before Making AI Agent Systems Smarter, First Make Them Trustworthy

Shadow Walker 松烟阁

edony

2025年10月26日 15:58

Introduction: An Illusion of "Simplicity"

A narrative has recently become common within the team: "Building an Agent is simple now. You can just piece it together with LangChain, BaiLian, or Flowise, and it runs."

At first glance, this statement is hard to refute—frameworks have indeed lowered the barrier to entry. But that "simplicity" is more of an illusion, a facade created after the complexity has been temporarily absorbed by the platform. From a technical standpoint, Agent development involves:

Orchestration and task planning;
Context and Memory management;
Domain knowledge fusion (RAG);
And the "agentification" of business logic.

These steps are not accomplished just by writing a few prompts. When developers feel it's "simple," it's because the complexity has been absorbed by the platform. The difficulty of Agents lies not in getting a demo to run, but in making it operate reliably, controllably, and sustainably over the long term.

Why Is Agent Development Mistakenly Seen as "Simple"?

On the surface, we are in an era of explosive AI growth, with platforms and tools emerging endlessly. It's true that by writing a few prompts and connecting a few chains, a "functional" Agent can be born. But this doesn't mean the complexity has vanished. Instead, the complexity has been relocated.

I break this "simplicity" down into three illusions:

1. Encapsulated Complexity

Frameworks help you string prompts and trim context, shielding developers from the details. But the underlying mechanics—debugging, tracing, and state recovery—are still burdens you must bear alone.

Take LangChain as an example. A "question-answering" Agent can be created with just a few lines of code:

from langchain.agents import initialize_agent, load_tools
from langchain.llms import OpenAI

llm = OpenAI(temperature=0)
tools = load_tools(["serpapi", "llm-math"], llm=llm)
agent = initialize_agent(tools, llm, agent_type="zero-shot-react-description")

agent.run("What is the current weather in Singapore, and convert it to Celsius?")

This code hides almost all complexity:

Prompt assembly, call chains, and context management are encapsulated internally.
But if the task fails (e.g., API rate limiting, tool failure), the Agent defaults to neither retrying nor logging a trace.

What looks like a "simple run" actually means sacrificing the interfaces for observability and debugging.

2. Outsourced Complexity

Memory, RAG, and Embeddings are all handed over to the platform for custody. The price is the loss of the ability to intervene and explain.

In LangChain, you can quickly add "memory":

from langchain.memory import ConversationBufferMemory
memory = ConversationBufferMemory(memory_key="chat_history")

But this is just a short-term memory buffer. It doesn't handle:

Conflicts with old information;
State drift over multiple turns;
Or context truncation issues due to excessive length.

As the Agent scales, memory consistency and state cleanup become new sources of system complexity.

3. Postponed Complexity

It doesn't disappear; it just reappears during the execution phase:

Output drift
Inability to reproduce results
Collapse of correctness and stability

Being able to run does not equal being able to run correctly over the long term. What we call simplicity is often just us temporarily avoiding the confrontation with complexity.

The Three Layers of Agent System Complexity

1. Agent Complexity

The complexity of an Agent system manifests in its ability to be run, reproduced, and evolved. Most current Agent frameworks have solved "runnability," but "reproducibility" and "evolvability" remain significant system engineering challenges.

Level	Core Objective	Engineering Keywords	LangChain Example Explanation
Runnability (Run)	Enable the Agent to start and execute tasks	prompt, context, tool calls, execution flow	Rapidly assembling an executable chain via `initialize_agent`
Reproducibility (Reproduce)	Make behavior controllable and debuggable	memory, state, logs, versioning	No built-in version tracking; Memory state drift requires manual management
Evolvability (Evolve)	Allow the system to continuously improve	RAG, feedback loops, collaboration, safety boundaries	Supports vector retrieval, but lacks self-assessment and reinforcement learning mechanisms

At the "Runnability" level, the abstractions designed by frameworks like LangChain are indeed efficient. But to make an Agent's behavior stable, explainable, and continuously optimizable, additional infrastructure—such as logging systems, prompt version management, and feedback loops—is still required.

From a system engineering perspective, the difficulty of an Agent lies not in "generation" but in "execution." All platforms will eventually expose their costs along these two lifecycles.

Dimension	Definition	Common Issues	Essence
Correctness	Is each decision correct?	Hallucinations, incorrect tool calls, logical deviations	The output logic is wrong
Stability	Is the system reproducible?	State drift, infinite loops, random fluctuations	The behavior is uncertain

In the implementation phase, stability is often more critical than correctness. Only when stability exists can correctness even be verified and optimized.

Intelligence's uncertainty must be underpinned by engineering's certainty. Stability and observability are the prerequisites for an Agent to be truly evolvable.

2. The Agent Amplification Effect

As shown in the image above, the same model (qwen-max), the same time, and the same prompt produce different results. This is the amplification effect that LLM uncertainty brings to Agents. Compared to the traditional software systems developers are most familiar with, the complexity and difficulty of Agents stem from this uncertainty, amplified at each semantic level by the LLM.

If a single LLM interaction has a 90% correctness rate, an Agent system requiring 10 LLM interactions will have its correctness drop to just 35%. If it requires 20 interactions, the correctness plummets to 12%.

Memory's Uncertainty Amplification

Traditional software state management is deterministic (e.g., what's in the database is what's in the database). An Agent's memory, however, relies on LLM parsing, embedding, and retrieval. The results are highly uncertain. Therefore, memory is not a storage/retrieval problem, but a semantic consistency problem. This is unique to Agents.

Orchestration's Dynamic Amplification

In traditional systems, orchestration (workflow) is a fixed, predefined process. In an Agent, the orchestration—which tool to call next, and how—is often dynamically decided by the LLM. This means the orchestration problem isn't just about "sequence/concurrency"; it's about an explosion of the decision space, making testing, monitoring, and optimization far more complex.

Testability's Unpredictability Amplification

Traditional software is predictable: given input → expected output. An Agent's output is a probability distribution (a stream of tokens from the LLM); there is no strict determinism. Therefore, testing cannot rely solely on unit tests. It must incorporate replay testing, baseline comparison testing, and simulation environment testing, which is far beyond the difficulty of standard application testing.

3. From "Runnable" to "Usable"

The "'It Runs, Doesn't It?' Fallacy"

Some might say, "I can get the Agent to work just by modifying the prompts. Am I amplifying the problem myself, rather than the Agent?"

"Getting it to run by tweaking prompts" essentially means: Short-term goal + High tolerance = Good enough.
The goal of an Agent system, however, is: Long-term goal + Engineering-grade reliability = Drastic increase in difficulty.

Let's first look at why tweaking prompts seems to work. Many Agent Demos or POCs (Proofs of Concept) aim for one-off tasks, like "write a summary for me" or "call this API." In these low-requirement scenarios, the raw power of the LLM masks many underlying issues:

Memory can be passed purely through context (long-term persistence is never really tested).
Orchestration can be hard-coded or hinted at in the prompt.
Testability is irrelevant; if it gets the right answer once, it's a win.

The problem is that when the requirement shifts from a "Demo" to a "Sustainably Usable System," these issues are rapidly amplified:

Prompt Modification ≠ Reliability Guarantee. Changing a prompt might fix the immediate bug, but it doesn't guarantee the same class of problem won't reappear in another case. You haven't established reproducible, maintainable decision logic; you've just engaged in "black-box tweaking."
Prompt Modification ≠ Scalability. Prompt hacking works for a single-task Agent. But in a multi-tool, multi-scenario Agent, the prompt's complexity grows exponentially and eventually becomes uncontrollable.
Prompt Modification ≠ Engineering Controllability. Traditional software can be covered by test cases to ensure logical coverage. Prompts can only partially mitigate the LLM's probabilistic fluctuations; they cannot provide strong guarantees.

This is why, ultimately, we need more structured methods for memory, orchestration, and testing—which is to say, Agent systematization.

Limitations of Agent Frameworks

Let's use the LangChain framework as an example to see if frameworks can solve the three layers of Agent complexity. LangChain provides a basic CallbackManager and LangSmith integration for tracing an Agent's execution. This functionality is often overlooked, but it is key to understanding "reproducibility" and "observability."

from langchain.callbacks import StdOutCallbackHandler, CallbackManager
from langchain.llms import OpenAI
from langchain.agents import initialize_agent, load_tools

# Create a simple callback manager
handler = StdOutCallbackHandler()
manager = CallbackManager([handler])

llm = OpenAI(temperature=0, callback_manager=manager)
tools = load_tools(["llm-math"], llm=llm)

agent = initialize_agent(tools, llm, agent_type="zero-shot-react-description")

agent.run("Calculate (15 + 9) * 2")

When executed, LangChain will output every Thought and Action to the console:

Thought: I need to use the calculator tool.
Action: Calculator
Action Input: (15 + 9) * 2
Observation: 48
Thought: I now know the final answer.
Final Answer: 48

This seemingly simple output reveals three important facts:

The Agent's internal decision process is traceable (this is the prerequisite for reproducibility).
The CallbackManager must be actively enabled by the engineer (it doesn't log by default).
The granularity of observation is limited (it cannot directly trace context trimming, memory overwrites, etc.).

LangSmith provides a more complete visual trace, but it remains an external observation tool. The Agent framework itself still lacks built-in verification mechanisms. In other words, the framework gives you the ability to "see," but it doesn't solve the problem of "control" for you.

Although frameworks like LangChain are making interesting attempts to solve the complex problems in Agent systems, we must admit that most engineering dimensions remain unresolved. (In short, frameworks solve the problem of "using an LLM to do things," but not the problem of "making the LLM do things in a way that is controllable, sustainable, and scalable like a system"):

Module	What Frameworks Provide	What Remains Uncovered / Needs Engineering
Inference Layer (LLM Layer)	Model calls, Prompt templates	Output stability, task context constraints, hallucination detection
Tools Layer	API calls, Function routing	Secure tool sandbox, permission control, error recovery
Memory Layer	Basic vector retrieval, session context	Long-term memory compression, conflict detection, memory decay strategy
Orchestration Layer	Simple loops or chained calls	Multi-task scheduling, plan optimization, inter-agent dependency graphs
Evaluation Layer	Some tracing, benchmarks	Automated metrics (success rate, controllability, cost monitoring)
Safety & Compliance	Almost non-existent	Execution boundaries, permission models, audit logs, sandboxed execution
Deployment & Ops	Some SDKs, CLI tools	Persistence, elastic scaling, version management, A/B testing mechanisms

Framework	Runnability	Reproducibility	Evolvability	Notes
LangChain	✅ Mature chain calls	⚙️ Partially observable	⚙️ Manual tuning	Many tools, but state is unstable
AutoGen	✅ Multi-Agent collaboration	⚙️ Rudimentary memory	❌ Lacks learning mechanism	Flexible but hard to reproduce
CrewAI	✅ Easy task orchestration	⚙️ State instability	❌ No feedback optimization	Strong interaction, weak control
AliCloud BaiLian	✅ Drag-and-drop building	⚙️ Platform logs	⚙️ Built-in knowledge center	Platform absorbs complexity, but is a major black box with limited control

✅ Runnability: Generally well-supported (low barrier to entry)
⚙️ Reproducibility: Only partially supported (requires self-built state and observation layers)
❌ Evolvability: Still relies on manual effort and system design

LangChain makes Agents "buildable," but it makes the system lose its "explainability." Complexity didn't disappear; it just migrated from the code layer to the runtime.

Let's delve deeper into runtime complexity. The new problem Agent systems bring is that they don't just "run"; they must "continuously think," and the side effect of thinking is instability. This is not "traditional code complexity" but "system uncertainty introduced by intelligent behavior." It makes Agent engineering feel more like managing a complex adaptive system than a linear, controllable piece of software.

New Dimension of Complexity	Description	Example Scenario
Context Drift	The model misunderstands or forgets key task objectives during multi-turn reasoning	An Agent deviates from the task's semantics during a long conversation, executing irrelevant actions
Semantic Non-determinism	The same input may produce different outputs, making processes non-replayable	Prompt debugging results are unstable; automated testing is hard to cover
Task Decomposition & Planning	The quality of plans generated by the LLM is unstable; task boundaries are vague	In AutoGen's "plan+execute" model, sub-tasks overflow or loop
Memory Pollution	Long-term stored context introduces noise or conflicting information	The Agent "learns" incorrect knowledge, causing future execution deviations
Control Ambiguity	The boundary between the Agent's execution and the human/system control layer is unclear	Manual instructions are overridden, tasks are repeated, resources are abused
Self-Adaptation Drift	The Agent learns incorrect patterns or behaviors based on feedback	Reinforcing a hallucinatory response during an RLHF/reflection loop
Multi-Agent Coordination	Communication, role assignment, and conflict resolution between Agents	Task duplication or conflicts in multi-role systems like CrewAI

The Only Solution for Agents is Systematization

Prompt Hacking fails when the problem scales. For a single, simple scenario, tweaking a prompt works. But as task complexity and the number of scenarios increase, the prompt becomes bloated and uncontrollable (e.g., one prompt stuffed with dozens of rules). It's like concatenating strings to build SQL queries: it runs at first, but inevitably leads to injection vulnerabilities and a maintenance disaster. Systematization helps by providing structured constraints and automated orchestration, rather than manual prompt tuning.
Uncertainty demands controllability. Getting it right once is a win for a demo. But in a production environment, you need 99% correctness (or 100%). Even a 1% hallucination rate will accumulate into a disaster. For example, a log analysis Agent that misses or false-reports an issue just once could lead to an undiscovered online incident. Systematization ensures controllability through testing, monitoring, and replay verification, rather than gambling on luck every time.
Knowledge persistence vs. repeating mistakes. Today, an Agent's bug is fixed by changing a prompt. Tomorrow, a new requirement comes in, and the exploration starts all over again. Knowledge isn't retained. The Agent can't remember or reuse past solutions, leading to constant redundant labor. A colleague complained that in one business system, prompt modification commits made up over a third of all code commits. Yet, when another colleague tried to reuse that prompt for a similar problem, it was completely non-transferable and had to be hacked again from scratch. Systematization, through Memory + Knowledge Bases, ensures an Agent can learn and accumulate knowledge, not reinvent the wheel every time.

Prompt Hacking / Demo Agents solve "small problems." Only Systematized Agents can solve the problems of "scalability, reliability, and persistence." These issues might not be obvious now, but they will inevitably explode as usage time and scope expand.

A Demo Agent can solve today's problem. A Systematized Agent can solve tomorrow's and the day after's.

Dimension	Demo Agent (Can run)	Systematized Agent (Can run sustainably)
Goal	Single task / POC success	Continuous, repeatable, multi-dependent business processes
Memory/Knowledge	Raw chat history; occasional vector retrieval	Layered memory (session/short-term/long-term + RAG strategy); consistency & versioning
Orchestration/State	Sequential calls / simple ReAct; no explicit state	Explicit state machine / graph (concurrency, rollback, retry, timeout, fault tolerance)
Reliability & Testing	"Passes the example" is the standard; non-reproducible	Replay sets / baseline comparison / fuzz testing; SLOs & failure mode design
Observability	A few log lines	End-to-end tracing, call chains, metrics, auditing
Security/Permissions	Constraints written in the prompt	Fine-grained permissions / sandbox / audit trails / anti-privilege-escalation
Scalability	Prompt becomes uncontrollable as scenarios grow	Modular components / model routing / tool governance
Cost Curve	Fast/cheap initially; maintenance costs skyrocket later	Upfront engineering investment; stable and scalable long-term

From "Smart" to "Reliable"

Some Real-World Agent Cases

Looking at history, we can understand rise and fall. Looking at others, we can understand our own successes and failures. The problems I've encountered in Agent system development are surely not mine alone. I asked ChatGPT to search Reddit, GitHub, and blogs for Agent development cases, hoping to use others' experiences to validate my own thinking and reflections:

1. Typical Failures of Toy-Level Agents

Auto-GPT community feedback: looping, getting stuck, unable to complete tasks (the classic early example of "runnable but not reliable"). Auto-GPT seems nearly unusable
Developer questioning if agents can go to production, noting severe step-skipping/hallucinations in multi-step tasks (system prompt + function calling isn't enough). Seriously, can LLM agents REALLY work in production?
OpenAI Realtime Agents official example repo issue: Even the "simple demo" has too many hallucinations to be usable non-demo contexts. Lots of hallucinations?

2. Engineering Problems Exposed After Production (Not solvable by prompt changes)

LangGraph deployment under production concurrency pressure: "can't start a new thread" (a resource/concurrency issue triggered by parallel nodes in Celery). Handling "RuntimeError: can't start a new thread" error at production.
LangChain version upgrade breaks a production multi-agent app (__aenter__): Highlighting the need for explicit dependencies, version locking, and regression testing. AgentExecutor ainvoke stopped working after version upgrade

3. Industry/Big-Tech Postmortems: Why "Systematization" is Needed

Anthropic: Effective agents come from "composable simple patterns + engineering practices," not from piling on frameworks (summarized from many client projects). Building Effective AI Agents
OpenAI: Released Agents SDK + built-in observability, stating clearly "it's hard to turn capabilities into production-grade agents" and requires visualization/tracing/orchestration tools. New tools for building agents
AWS Strands Agents SDK: Official emphasis on production-grade observability as a key design point, with built-in telemetry/logging/metrics hooks. Strands Agents SDK: A technical deep dive into agent architectures and observability
Salesforce (Agentforce) Blog: Summarizes 5 top reasons for production failures (silent retrieval failures, lack of fault tolerance, using ReAct as orchestration, etc.), advocating for engineered RAG/fault-tolerance/evaluation. 5 Reasons Why AI Agents and RAG Pipelines Fail in Production (And How to Fix It)
LangChain Team: Why build LangGraph/Platform—for control, durability, long-running/bursty traffic, checkpoints, retries, and memory. Claims it's used in production by LinkedIn/Uber/Klarna (vendor claims, but highlights the "systematization elements"). Building LangGraph: Designing an Agent Runtime from first principles

4. Positive Case: Treating it with a "Distributed Systems Mindset"

Community experience: Treating LLM orchestration as a distributed system. Using retry/backoff/idempotency/circuit-breakers/persistent queues to get a multi-step workflow completion rate to 99.5% (an engineering post, emphasizing "systematic" methodology). Production LLM reliability: How I achieved 99.5% job completion despite constant 429 errors

5. Community Reality: People are using it in production, but focus on "de-complexing + limited agents"

Developer feedback on LangGraph being production-viable: Migrated from LangChain's Agent Executor; the prototype→streamline→retain-necessities path is more robust (de-hallucinate/de-fancy, retain control). Anyone Using Langchai Agents in production?

The Four Stages of Agent Development

Over more than a year of Agent development, I've gone through a cognitive shift from "Agents are simple" to "Agents are truly complex." At first, I treated frameworks as black boxes, writing prompts and piecing things together to run a demo. As the complexity of the scenarios increased and I needed to go deeper into Agent system R&D, the difficulties gradually revealed themselves. I've tried to break down this "simple → truly hard" process:

Stage 1: The "Hello World" Stage (Looks simple)

Using frameworks like LangChain / AutoGen / CrewAI, you can get something running in a few lines of code. Most people stop at "it can chat" or "it can call tools," so they feel "Agent development is just this."

Stage 2: The Scene Adaptation Stage (Starting to hit pitfalls)

As the complexity of the problems the Agent solves increases, you slowly run into the LLM context window limit, requiring trimming, compression, or selection (i.e., Context Management problems). You find that vector retrieval results are often irrelevant, leading to non-answers, requiring optimization of preprocessing and query rewriting (RAG Knowledge Management). It runs in simple scenes, but falls into traps in slightly more complex ones.

Stage 3: The Systematization Stage (Complexity explodes)

Going further, as tool calls and context management increase, the Agent must ensure consistency across sessions and tasks. You must consider persistence, version control, and conflict resolution. A single Agent can't adapt to complex tasks; you need multi-Agent collaboration. At this point, you must solve deadlock, task conflicts, and state rollbacks. When task complexity rises, debugging the Agent flow can't be solved by tweaking prompts; you must add tracing and observability tools.

Stage 4: The Engineering Landing Stage (The real hard part)

Agentifying Business Logic: How to test it? How to guarantee controllability and stability?
Security & Compliance: Permissions, privilege escalation, data leakage. Strict security boundaries are a must.
Monitoring & SLOs: Like operating microservices, you need monitoring, alerting, and failure recovery.

In summary, frameworks like LangChain lowered the "barrier to entry" for Agents, but they did not lower the "barrier to implementation."

My Cognitive Evolution in Agent Development

I have been developing an Agent system focused on vulnerability and security assessment in my own work. As I experienced the four stages of Agent development mentioned above, my thinking and understanding of Agents also changed:

Level 0: The Framework Illusion Layer

Typical Behavior: Install LangChain / AutoGen / CrewAI, run an official demo, modify a prompt.
Cognitive Trait: Believes "Agent Development = Writing Prompts." The barrier to entry is extremely low, similar to writing a script.
Misconception: Thinks the framework solves all complexity, ignoring memory, orchestration, testing, and security.

Level 1: The Scene Splicing Layer

Typical Behavior: Can stitch together RAG, tool calls, and simple multi-agent orchestration to build a seemingly viable prototype.
Cognitive Trait: Begins to realize the importance of context management and RAG strategies.
Pain Points: Encounters "irrelevant answers," "memory corruption," and "tasks failing to complete reliably."
Misconception: Tries to use prompt hacking to solve all problems, ignoring underlying information management and system design.

Level 2: The System Design Layer

Typical Behavior: Treats the Agent as a microservices system, needing to consider architecture, observability, and state management.
Cognitive Trait: Understands that memory is essentially a database/knowledge-base problem, and orchestration is more like workflow scheduling than a chat.
Pain Points: Debugging costs are extremely high; requires tracing, logging, and metrics monitoring.
Key Challenge: How to ensure the Agent is robust, controllable, and reproducible.

Level 3: The Engineering Landing Layer

Typical Behavior: Deploys the Agent into a production business environment.
Cognitive Trait: Treats Agent development as an engineering discipline, just like SRE / Security / Distributed Systems.
Pain Points:
- Testability: The non-determinism of LLMs makes it impossible to guarantee stability with traditional unit tests.
- Security: Permission management, privilege escalation, prompt injection protection.
- Monitoring & SLOs: The Agent must be observable and recoverable, just like a service.
Key Challenge: How to make the Agent reliable enough to carry critical business functions.

Level 4: The Intelligent Evolution Layer (Frontier Exploration)

Typical Behavior: Attempting to build an Agent system with long-term memory, autonomous learning, and evolvability.
Cognitive Trait: No longer sees the Agent as an LLM wrapper, but as a new type of distributed intelligent system.
Challenges:
- Memory becomes a knowledge graph + adaptive learning problem.
- Orchestration involves game theory, collaboration, and even emergent behavior.
- Security requires "AI sandboxes" to prevent loss of control.
Status: Most are not at this stage; it is primarily research and experimentation.

Based on my current understanding of Agents, I now position them as system components rather than intelligent robots. My goal is not "occasional brilliance" but "sustained reliability."

Basic Principles:

Principles:
- Stable first, smart second.
- Observable first, optimized second.
Functionality:
- Establish a replayable mechanism for state and logs.
- Implement version tracking for Prompts / Memory / RAG.
- Introduce observability metrics (success rate, drift rate, redundant calls).
- Clearly define the boundaries and permission scope for each Agent.
- Designate "error recovery" pathways in the architecture.
Boundaries:
- If the Agent is only for one-off tasks or exploratory experiments, complexity control can be relaxed.
- If used for production tasks (monitoring, automated operations), stability and security boundaries take precedence.
- The deeper the framework's encapsulation, the more an external explainability layer is needed.

The Path to Agent Intelligence

Someone said 2025 might be the "Year of the Agent." After nearly a year of technical iteration, Agents have also seen considerable development from an engineering perspective. LangChain has essentially become the preferred backend option for Agent systems, and Agent R&D has evolved from prompt engineering → context engineering (as shown in the figure below).

1. Agent Development Philosophy

Agents are not a panacea. The key is to choose the appropriate automation stage for tasks of different complexities. I believe we can see from the five evolutionary stages of Agents:

Complex ≠ Better
- Don't blindly chase the "strongest Agent architecture"; suitability is key.
- Using a complex system for a simple task only increases cost and risk.
The Real Challenge is "Human"
- Many failed cases stem from the designer choosing the wrong architecture or lacking phased thinking.
- The model and workflow are not the problem; the human is.
The Importance of Design Thinking
- First, assess the task's complexity and automation potential.
- Then, decide the required level of intelligence (Script → LLM → RPA → Agent → Multi-Agent).
- Finally, match the appropriate tool, don't use a "one-size-fits-all" approach.

2. Agent Design Patterns

1️⃣ ReAct Pattern (Reasoning + Acting)
- Structure: Divided into Reasoning and Acting phases.
- Mechanism:
  - LLM1: Understands context, plans which tool/API to call.
  - LLM2: Executes the action, returns the result.
- Pros: Decouples reasoning and action, clear structure.
- Applications: Q&A, multi-step tasks, tool-driven workflows.
2️⃣ CodeAct Pattern
- Flow:
  - User → Plan: User gives a task, Agent plans the steps.
  - Code → Feedback: Generates and executes code, corrects based on results.
- Feature: Introduces a feedback loop (code execution → result → reflection).
- Applications: Verifiable tasks (data processing, analysis, API calls).
- Represents: AI acting through code.
3️⃣ Tool Use Pattern
- Core Concept: Upgrades from single API calls to a unified protocol (MCP) for managing tools.
- Features:
  - Tool abstraction and standardization.
  - Supports multi-modal, multi-source tool access.
- Significance: Greatly improves the Agent's ecosystem compatibility and extensibility.
4️⃣ Self-Reflection / Reflexion Pattern
- Architecture:
  - Main LLM: Executes the main task.
  - Critique LLM(s): Criticizes/reviews the main model's output.
  - Generator: Combines feedback to produce the final answer.
- Advantages:
  - Introduces a "self-reflection" mechanism.
  - Reduces hallucination rates, improves logic and quality consistency.
- Applications: Scientific research, content generation, high-risk decision scenarios.
5️⃣ Multi-Agent Workflow
- Structure:
  - Core Agent: Coordinates task allocation.
  - Sub-Agents: Each focuses on a specific function/domain.
  - Aggregator: Integrates outputs from sub-agents.
- Features:
  - Simulates real team collaboration.
  - Supports complex, cross-functional tasks.
- Applications: Enterprise-level systems, automated programming, cross-departmental processes.
6️⃣ Agentic RAG Pattern
- Flow:
  - Agent uses tools to perform Web / Vector retrieval.
  - Main Agent fuses retrieval results with its own reasoning.
  - Generator produces the final answer.
- Features:
  - Dynamic retrieval + reasoning.
  - Agent can autonomously decide "if, when, and how" to retrieve.
- Significance: From static RAG → intelligent, decision-making Agentic RAG.

3. Latest Agent Progress

Finally, I want to summarize the latest engineering progress in Agents and the most recent engineering experiences worth learning from:

Agentic Design Pattern (by Google's Antonio Gulli), PDF
Build agentic AI systems (by Andrew Ng), Course

Below are some takeaways from Agent development. Those interested can look up how various Agent players are planning their strategies.

Perhaps future frameworks will absorb even more of this complexity. But the role of the engineer will not disappear. What we must do is to re-establish order in the places where complexity has been hidden—to make intelligence not just callable, but tamable.

让Agent系统更聪明之前，先让它能被信任

Shadow Walker 松烟阁

edony

2025年10月11日 16:04

引子：一种“简单”的错觉

团队内部最近常出现一种论调：“现在做 Agent 很简单，用 LangChain、百炼、Flowise 搭一搭就能跑。”

这句话乍一听确实无法反驳 —— 框架确实降低了门槛。但那种“简单”，更像是复杂性暂时被平台吸收后的假象。从技术层面看，Agent 开发涉及：

编排与任务规划；
Context 与 Memory 管理；
领域知识融合（RAG）；
以及业务逻辑的 agent 化。

这些环节并不是写几个 prompt 就能搞定的。当开发者觉得“简单”其实是因为——复杂性被平台吸收了。 Agent 之难，不在跑通 Demo，而在让它长期、稳定、可控地运行。

Agent 开发为何被误以为“简单”？

从表面看，我们站在了一个 AI 爆炸的年代，各种平台与工具层出不穷。确实写几个 prompt、拼几层链路，一个“能动”的 Agent 就诞生了。但这并不是复杂性消失的标志，而是——复杂性被转移了位置。

我把这层“简单”拆成三种幻觉：

1. 被封装的复杂性

框架帮你拼接 prompt、裁剪 context，让开发者远离细节，但调试、trace、状态恢复这些底层骨架，仍无人替你承担。

以 LangChain 为例，只需几行代码即可创建一个 “能回答问题” 的 Agent：

from langchain.agents import initialize_agent, load_tools
from langchain.llms import OpenAI

llm = OpenAI(temperature=0)
tools = load_tools(["serpapi", "llm-math"], llm=llm)
agent = initialize_agent(tools, llm, agent_type="zero-shot-react-description")

agent.run("给我查一下新加坡现在的天气，并换算成摄氏度")

这段代码几乎隐藏了所有复杂性：

prompt 拼装、调用链、上下文管理都在内部封装；
但若任务出错（如 API 限流、工具失败），Agent 默认并不会重试或记录 trace。

看似“简单运行”，实则丧失了可观测与调试的接口。

2. 被外包的复杂性

Memory、RAG、Embedding 全交由平台托管，代价是失去了干预与解释的能力。在 LangChain 中，你可以快速添加“记忆”：

from langchain.memory import ConversationBufferMemory
memory = ConversationBufferMemory(memory_key="chat_history")

但这只是短期记忆缓冲，它不会处理：

旧信息冲突；
多轮状态漂移；
以及上下文过长导致的剪裁问题。

当 Agent 规模扩大，内存一致性与状态清理反而成了新的系统复杂度。

3. 被推迟的复杂性

它不会消失，只会在运行阶段重新显现：

输出漂移
无法复现
正确性与稳定性塌陷

能跑起来并不等于能长期跑得对。所谓简单，其实是我们暂时不用面对复杂。

Agent 系统的三层复杂度

1. Agent复杂度

Agent 系统的复杂性体现在可运行、可复现、可进化。当下的 Agent 框架大多解决了「可运行性」，但「可复现性」与「可进化性」仍是系统工程难题。

层次	核心目标	工程关键词	LangChain 示例说明
可运行性（Run）	让 Agent 能启动并执行任务	prompt、context、工具调用、执行流	通过 initialize_agent 快速组装可执行链路
可复现性（Reproduce）	让行为可控、可调试	memory、状态、日志、版本化	无内建版本追踪，Memory 状态漂移需人工管理
可进化性（Evolve）	让系统能持续变聪明	RAG、反馈回路、协作、安全边界	支持向量检索，但缺少自我评估与强化学习机制

在“可运行性”层面，以LangChain为代表的框架的抽象设计确实高效。但若要让 Agent 行为稳定、可解释、可持续优化，仍需额外引入日志系统、prompt 版本管理、feedback loop 等基础设施。

从系统工程角度看，Agent 的难点并非在“生成”而在“执行”。所有平台最终都会在这两条生命线上暴露代价。

维度	定义	常见问题	本质
正确性（Correctness）	每次决策是否正确	幻觉、误调用、逻辑偏差	输出逻辑错
稳定性（Stability）	系统是否可复现	状态漂移、死循环、随机波动	行为不确定

在落地阶段，稳定性往往比正确性更关键。只有稳定性存在，正确性才有被验证和优化的可能性。

智能的不确定性必须以工程的确定性为支撑。稳定与可观测，是 Agent 真正可演化的前提。

2. Agent放大效应

如上图所示，同样的模型(qwen-max)，同样的时间、同样的prompt，产生的结果缺不一样，这就是LLM的不确定性带给Agent的放大效应。相对于开发者最熟悉的传统软件系统的开发，Agent带来的复杂和难点就源于它被 LLM 的不确定性和语义层次的逐级放大了。假设一次LLM交互正确率为90%，一个Agent系统需要10次LLM的交互，那么这个Agent系统的正确率就只有35%，一个Agent系统需要20次LLM的交互，那么这个Agent系统的正确率只有12%。

Memory 的不确定性放大

相比传统软件的状态管理来说（是确定性的，例如数据库里有啥就是啥），Agent 的memory依赖 LLM 的解析、embedding、检索，结果高度不确定，所以memory不是存取问题而是语义一致性问题，这是 Agent 特有的。

编排的动态性放大

传统系统里编排（workflow/orchestration）是固定的流程，预定义好。Agent 里编排常常是 LLM 动态决定下一步调用哪个工具、如何调用。这意味着编排问题不仅是“顺序/并发”的问题，而是决策空间爆炸，导致测试、监控、优化都更复杂。

测试性的不可预测性放大

传统软件可预测：给定输入 → 预期输出。Agent 的输出是概率分布（LLM 输出 token 流），没有严格确定性。所以测试不能只用单元测试，而要引入回放测试、对比基线测试、模拟环境测试，这就远超普通应用测试的难度。

3. Agent从“能跑”到“能用”

又不是不能跑，要什么自行车？

有人说，Agent开发的时候我修改修改提示词也能达到目标，是否是我自己放大了问题并不是Agent放大了上面提到的不确定性。

“改改提示词就能跑通”，本质上其实在说：短期目标 + 容忍度高 = 足够好，而Agent系统的目标是：长期目标 + 工程级可靠性 = 难度激增。

先看看为什么改改prompt就能跑通，很多 Agent Demo 或 POC（Proof of Concept）目标是一次性任务，比如“帮我写个总结”“调用下 API”，在这种低要求场景里，LLM 本身的强大能力掩盖了很多问题：

Memory 可以只靠上下文传递（没真正测试过长时效）；
编排可以写死流程或靠提示词 hint；
测试性无所谓，跑一次能对上答案就算赢；

是我放大了问题还是Agent系统放大了问题，其实需要从需求出发，因为当需求从 “Demo” → “持续可用系统” 时，问题会迅速被放大：

Prompt 修改 ≠ 可靠性保证，改提示词可能解决眼前 bug，但没有保证同类问题不会在别的 case 再次出现。你其实没有建立可复现、可维护的决策逻辑，只是调参式“玄学优化”。
Prompt 修改 ≠ 可扩展性，在单任务 Agent 下，prompt hack 有效。但在多工具、多场景 Agent 里，prompt 的复杂度指数级增长，最终失控。
Prompt 修改 ≠ 工程可控性，传统软件能写测试 case 保证逻辑覆盖，但是 prompt 只能部分缓解 LLM 的概率波动，没法做强保证。

所以最终需要更结构化的 memory、编排和测试手段 —— Agent系统化。

Agent框架的局限

以Langchain框架为例，看看框架是否能够解决Agent三层复杂度的问题。LangChain 提供了基础的 CallbackManager 与 LangSmith（或Lanfuse）集成，用于追踪 Agent 的执行过程。这部分功能通常被忽略，却是理解「可复现性」与「可观测性」的关键。

from langchain.callbacks import StdOutCallbackHandler, CallbackManager
from langchain.llms import OpenAI
from langchain.agents import initialize_agent, load_tools

# 创建一个简单的回调管理器
handler = StdOutCallbackHandler()
manager = CallbackManager([handler])

llm = OpenAI(temperature=0, callback_manager=manager)
tools = load_tools(["llm-math"], llm=llm)

agent = initialize_agent(tools, llm, agent_type="zero-shot-react-description")

agent.run("计算一下 (15 + 9) * 2 是多少？")

执行时，LangChain 会在终端输出每一次 思考（Thought）与 动作（Action）：

Thought: 我需要使用计算工具。
Action: Calculator
Action Input: (15 + 9) * 2
Observation: 48
Thought: 我现在知道最终答案了。
Final Answer: 48

看似简单的输出，其实揭示了三个重要事实：

Agent 内部决策过程可追踪（这是复现性的前提）；
CallbackManager 需要工程师主动启用（默认不会记录）；
观测粒度受限（无法直接追踪上下文裁剪、记忆覆盖等细节）。

LangSmith 提供了更完善的可视化 trace，但依然属于外部观测工具，Agent 框架本身仍未内建可验证机制。也就是说，框架给你“看”的能力，却不会替你“控”的问题。

虽然Langchain这样的框架已经有意思的在解决Agent系统中的复杂问题，但是不得不承认当前大部分工程维度仍然是未解决的（简言之，框架解决了“调用 LLM 做事”的问题，但没有解决“让 LLM 做事像系统那样可控、可持续、可扩展”的问题）：

模块	当前框架提供	未覆盖 / 待工程化部分
推理层 (LLM Layer)	模型调用、Prompt 模板	输出稳定性、任务上下文约束、幻觉检测
工具层 (Tools Layer)	API 调用、函数路由	工具调用安全沙箱、权限控制、错误恢复
记忆层 (Memory Layer)	基本向量检索、会话上下文	长期记忆压缩、冲突检测、记忆衰减策略
协调层 (Orchestration)	简单 loop 或链式调用	多任务调度、计划优化、Agent 间依赖图
评估层 (Evaluation)	少量 tracing、benchmark	自动指标体系（成功率、可控性、成本监控）
安全与合规 (Safety & Compliance)	几乎缺失	执行边界、权限模型、审计日志、沙箱执行
部署与运维 (Ops)	少量 SDK、CLI 工具	持久化、弹性伸缩、版本管理、A/B 测试机制

框架	可运行性	可复现性	可进化性	说明
LangChain	✅ 链式调用成熟	⚙️ 部分可观测	⚙️ 手动调优	工具多，但状态不稳
AutoGen	✅ 多 Agent 协作	⚙️ 初级记忆	❌ 缺乏学习机制	灵活但难复现
CrewAI	✅ 任务编排便捷	⚙️ 状态不稳定	❌ 无反馈优化	强交互，弱管控
阿里云百炼	✅ 拖拽式搭建	⚙️ 平台日志	⚙️ 内置知识中心	平台吸收复杂性，但黑箱严重，控制粒度受限

✅ 可运行性：普遍支持良好（开发门槛低）
⚙️ 可复现性：仅局部支持（需自建状态与观测层）
❌ 可进化性：仍靠人工与系统设计

LangChain 让 Agent “能搭”，却让系统失去了“能解释”的能力。复杂性并未消失，只是从代码层迁移到了运行时。

我们再来深入的分析一下运行时的复杂度，即Agent系统带来的新问题——它不仅要运行，还要「持续思考」，而思考的副作用就是不稳定性。这些复杂性不是「传统的代码复杂度」，而是「智能行为带来的系统不确定性」。它们让 Agent 工程更像在管理一个复杂适应系统，而非线性可控的软件。

新增复杂性维度	描述	示例场景
上下文漂移 (Context Drift)	模型在多轮推理中误解或遗忘关键任务目标	多轮对话中 Agent 偏离任务语义，执行无关动作
语义非确定性 (Non-determinism)	同样输入可能产生不同输出，导致流程不可重放	Prompt 调试后结果不稳定、自动测试难以覆盖
任务分解与规划 (Decomposition)	LLM 生成的计划质量不稳定、任务边界模糊	AutoGen 的“plan+execute”模式中，子任务溢出或循环
记忆污染 (Memory Pollution)	长期存储的上下文引入噪声或冲突信息	Agent 学会错误知识，导致后续执行偏差
控制权与边界模糊 (Control Ambiguity)	Agent 的执行与人类/系统控制层之间界限不清	人工指令被覆盖、重复任务、资源滥用
自适应演化风险 (Self-Adaptation Drift)	Agent 基于反馈学习错误模式或行为	RLHF/反思循环中强化了幻觉响应
多Agent 协同复杂性 (Coordination)	Agent 间通信、角色分配、冲突解决	CrewAI 等多角色系统中任务重复或冲突

Agent唯一解是系统化

问题规模放大后 Prompt Hack 失效，单一问题场景，改 prompt 就能跑通，但是当任务复杂度、场景数量增加，prompt 就会变得臃肿不可控（比如一个 prompt 里要塞几十条规则），就像写 SQL 时拼接字符串，开始能跑，最后一定注入 + 维护灾难。系统化帮助Agent结构化约束 + 自动化编排，而不是人肉调 prompt
不确定性需要可控性，一次性跑出来成功就算赢，但是在生产环境必须 99% 正确（甚至100%），哪怕 1% 幻觉就会积累成灾难，例如像日志分析 Agent，错报/漏报一次可能导致线上事故没被发现。系统化通过测试、监控、回放验证，确保可控，而不是每次都赌运气
知识沉淀 vs 重复踩坑，Agent今天改 prompt 能解决 bug，明天来了新需求又重新摸索。知识没有沉淀，Agent 不能记忆/复用，最终不断重复劳动。同事抱怨过一个业务系统的开发中prompt修改的commit占所有代码提交的三分之一以上，但是另一同事遇到同类问题想复用这个prompt发现完全无法迁移还要重新 hack。系统化就是通过Memory + 知识库保证 Agent 能学到、积累下来，不是每次都重造轮子。

Prompt Hack/Demo Agent 能解决的是“小问题”，系统化 Agent 才能解决“扩展性、可靠性、沉淀”的问题。这些问题现在可能不明显，但随着使用时间和范围扩大，必然会爆发。

Demo Agent 确实能解决问题，但只能解决今天的问题，系统化 Agent 才能解决明天和后天的问题。

维度	Demo Agent（能跑起来）	系统化 Agent（可持续运行）
目标	单次任务/POC 成功	连续、可重复、多人依赖的业务
记忆/知识	聊天记录直塞；偶尔向量检索	分层记忆（会话/短期/长期+RAG 策略）；一致性与版本化
编排/状态	顺序调用/简单 ReAct；无显式状态	显式状态机/图（并发、回滚、重试、超时、容错）
可靠性与测试	以“能过样例”为准；不可复现	回放集/基线对比/ fuzz 测；SLO 与失败模式设计
可观测性	打几行日志	端到端 tracing、调用链、指标、审计
安全/权限	Prompt 里写约束	细粒度权限/沙箱/审计轨迹/越权防护
扩展性	场景一多，Prompt 失控	模块化组件/模型路由/工具治理
成本曲线	前期快/便宜；后期维护飙升	前期工程化投入；后期稳定、可扩展

Agent从“聪明”到“可靠”

一些真实Agent案例

以史为镜，可以知兴替；以人为镜，可以明得失，我在Agent系统开发过程中碰到的问题一定不止我一个人，我让ChatGPT帮我搜索了Reddit、GitHub、Blog中关于Agent开发的案例，想借助别人的案例来验证我自己的思考和反思是否一致：

1. 玩具级 Agent 的典型失败

Auto-GPT 社区多次反馈：循环、卡死、无法完结任务（早期最典型的“能跑但不可靠”），Auto-GPT seems nearly unusable
开发者质疑“代理能否上生产”，实际尝试后指出多步任务中跳步/幻觉严重（仅靠系统 prompt+函数调用不行），Seriously, can LLM agents REALLY work in production?
OpenAI Realtime Agents 官方示例库 issue：即便是“简单 demo”，使用者也反馈幻觉过多，不具备非 demo 可用性，Lots of hallucinations?

2. 上生产后暴露的工程问题（不是改 Prompt 能解决）

LangGraph 生产部署并发压力下“can't start a new thread”（Celery 内多节点并行触发的资源/并发问题），Handling "RuntimeError: can't start a new thread" error at production.
LangChain 版本升级导致生产多代理应用直接崩（__aenter__）：显示依赖/版本锁定与回归测试的必要性，AgentExecutor ainvoke stopped working after version upgrade

3. 行业/大厂公开复盘：为什么需要“系统化能力”

Anthropic：有效的代理来自“可组合的简单模式+工程化实践”，而非堆框架（从大量客户项目中总结），Building Effective AI Agents
OpenAI：发布 Agents SDK + 内置可观测性，明确指出“把能力变成生产级代理很难，需要可视化/追踪/编排工具”，New tools for building agents
AWS Strands Agents SDK：官方强调生产级可观测性是关键设计点，内建遥测/日志/指标钩子，Strands Agents SDK: A technical deep dive into agent architectures and observability
Salesforce（Agentforce）博客：总结生产失败 5 大原因（检索静默失败、缺乏容错、把 ReAct 当编排等），主张工程化 RAG/容错/评估，5 Reasons Why AI Agents and RAG Pipelines Fail in Production (And How to Fix It)
LangChain 团队：为什么要做 LangGraph/平台——为控制、耐久性、长运行/突发流量、检查点、重试、记忆而生，并称其已被LinkedIn/Uber/Klarna用于生产代理（厂商口径，但点出“系统化要素”），Building LangGraph: Designing an Agent Runtime from first principles

4. 正向案例：当你用“分布式系统心态”做编排/容错

社区经验：把 LLM 编排当分布式系统来做，通过重试/退避/幂等/断路器/持久化队列等模式把多步工作流完成率拉到 99.5%（工程实战帖，强调“系统化”方法论），Production LLM reliability: How I achieved 99.5% job completion despite constant 429 errors

5. 社区实况：有人在生产用，但都在谈“去复杂化 + 有限代理”

LangGraph 在产线可用的开发者反馈：从 LangChain 的 Agent Executor 迁移；原型→精简→保留必要能力的路线更稳健（去幻觉/去花哨，保留可控），Anyone Using Langchai Agents in production?

Agent开发的四个阶段

一年多的Agent开发，我经历Agent很简单到Agent真复杂的认知变化，最开始把框架当黑盒，写 prompt 拼拼凑凑，就能跑个 demo，随着场景复杂性提升需要往Agent系统研发的深处走时，难点就逐步暴露出来。我尝试把这个“简单 → 真难”的过程拆了一下：

第一阶段：Hello World 阶段（看起来很简单）

用 LangChain / CrewAI 之类的框架，几行代码就能跑起来，大多数停在“能对话”、“能调用工具”这层，所以觉得“AI Agent 开发不过如此”。

第二阶段：场景适配阶段（开始遇到坑）

随着Agent解决问题的复杂度提升，慢慢会碰到LLM context窗口装不下，需要裁剪、压缩、选择（即Context 管理问题），然后又发现向量检索、数据获取的结果经常无关、答非所问，需要优化预处理、query 重写（RAG知识管理），渐渐地我认识到简单场景能跑，但是稍微复杂点就掉坑。

第三阶段：系统化阶段（复杂性爆炸）

再进一步，Agent随着工具调用、上下文管理增加（例如复杂安全威胁建模过程），需要保证跨会话、跨任务一致性，必须考虑持久化、版本控制、冲突解决。单个Agent无法适应复杂任务（例如复杂的漏洞风险评估任务），需要多 Agent 协同，此时就必须解决 deadlock、任务冲突、状态回滚。任务的复杂性上来了Agent 流程调试就不是改改 prompt 能解决的，要加 tracing、可观测性工具。

第四阶段：工程落地阶段（真正的硬骨头）

业务逻辑 Agent 化：业务workflow开发完成了，但是如何测试？如何保证可控性和稳定性？还没有任何最佳实践和答案。
安全与合规：权限、越权调用、数据泄露，必须引入严格的安全边界，当前这方面完全没有涉及，这对一个生产使用的系统会有非常大的隐患。
监控与 SLO：像运维微服务一样，需要监控、报警、故障恢复。

Langchain框架、百炼平台等确实让Agent“起步门槛”变低，但没有降低“落地门槛”。第三阶段、第四阶段很多问题，我也是摸着石头过河，当下还总结不出任何经验和结论。

我对Agent开发认知的演化

我一直围绕自己工作中涉及到的漏洞安全评估开发Agent系统，在经历上面提到的四个Agent开发的时候，我对Agent的思考和理解也在变化：

第一层：框架幻觉层

典型行为 ：装个 LangChain / AutoGen / CrewAI，跑个官方 demo，改一改 prompt。
认知特征 ：觉得“Agent 开发=写 Prompt”，门槛极低，和写脚本差不多。
误区：以为框架解决了一切复杂性，忽略了 memory、编排、测试、安全。

第二层：场景拼接层

典型行为 ：能把 RAG、工具调用、简单多 Agent 编排拼接在一起，做一个看似可用的原型。
认知特征 ：开始意识到 context 管理、RAG 策略的重要性。
痛点：遇到“答非所问”“记忆错乱”“任务无法稳定完成”。
误区：尝试用 prompt hack 解决所有问题，忽略了底层信息管理和系统设计。

第三层：系统设计层

典型行为 ：将 Agent 当成微服务系统，需要考虑架构、可观测性、状态管理。
认知特征 ：理解 memory 本质上是数据库/知识库问题，编排更像工作流调度而非聊天。
痛点：debug 成本极高；需要 tracing、日志、指标监控。
关键挑战 ：如何确保 Agent 鲁棒性、可控性、可复现性 。

第四层：工程落地层

典型行为 ：将 Agent 投入业务生产环境。
认知特征 ：把 Agent 开发当成 SRE/安全/分布式系统 一样的工程学科。
痛点：
- 测试性 ：LLM 的非确定性导致无法用传统单测保证稳定。
- 安全性 ：权限管理、越权调用、prompt 注入防护。
- 监控与SLO ：Agent 必须像服务一样可观测、可恢复。
关键挑战 ： 如何让 Agent 可靠到能承载关键业务。

第五层：智能演化层（前沿探索）

典型行为 ：尝试构建有长期记忆、自主学习、可进化的 Agent 体系。
认知特征 ：不再把 Agent 当 LLM wrapper，而是当 新型分布式智能系统 。
挑战：
- memory 变成知识图谱 + 自适应学习问题
- 编排涉及博弈、协作甚至涌现行为
- 安全需要“AI sandboxes”，避免失控
现状：大多数人还没到这个阶段，研究和实验为主。

结合当下对Agent的理解，当前我对Agent的定位是将其视作一个系统组件而非智能机器人，我的目标不是“偶尔惊艳”而是“持续可靠”。基本原则：

原则：
- 先稳定，后聪明
- 先可观测，后优化
功能：
- 建立状态与日志的可回放机制
- 对 Prompt / Memory / RAG 做版本追踪
- 引入观测指标（成功率、漂移率、冗余调用）
- 明确每个 Agent 的边界与权限范围
- 在设计上预留“错误恢复”通道
边界：
- 若 Agent 仅用于一次性任务或探索性实验，复杂度控制可以放宽。
- 若用于生产任务（监控、自动化操作），稳定性与安全边界优先。
- 框架封装越深，越需要额外的可解释层。

Agent智能化之路（最新进展）

好像有人说过2025是Agent元年，经过将近一年的Agent技术迭代，Agent也从工程角度有了比较长足的发展，Langchain基本上已经成为Agent system后端的优先选项，Agent研发也经历 prompt engineering → context engineering的演变（如下图所示）。

最后，我想分享一下当下Agent工程上最新进展以及Agent system最新的工程经验值得借鉴与学习：

Agentic Design Pattern(by Google Antonio Gulli)，PDF
Build agentic AI systems(by Adrew Ng)，Course

最后，也许未来的框架能进一步吸收这些复杂性。但工程师的角色不会因此消失。我们要做的，是在复杂性被隐藏的地方，重新建立秩序 —— 让智能不只是可调用，更是可驯服。

在乌龟追上来之前

Shadow Walker 松烟阁

edony

2025年10月3日 19:50

E41 孟岩对话阿娇：我的另一面，也想被注视和欣赏

Listen

播客对谈刚开始（22:42）的一段对话，跟去年的付航给我的感受很相似：

孟岩：我还是会好奇那些，为什么一个人可以变成这样的，可以去解决任何问题的人
阿娇：我没有，我没有

阿娇说「我没有」的时候，喉咙是略带干涩的，一段明显的停顿之后再次强调了一下「我没有」。第一次听这一段的时候，我正好在开车通勤，当时鼻子一酸，到现在我还是不知道用什么语言描述我当时内心的感受。

这期播客中有非常多的停顿，孟岩的、阿娇的，有各自一度语塞的、有阿娇身体不适的杂音、有喝水的咕噜声等等，在追求严谨、品质、精致的播客里这无疑是惨不忍睹的。但是我很珍惜这一期播客，小宇宙里唯一收藏的就是它了，相比当下社交媒体为大众营造的美好、精致、暖心、振奋，它足够真实，还原了这个世界本来的样子，可以有口癖、有粗口、有空白。

理解阿娇

这期播客我听了很多遍，因为心境不同，其实前几遍的时候，我并不太理解阿娇和孟岩对谈的一些话。

「这几个月，我反反复复在想，也反反复复有人在聊的东西——为什么死亡已经迫在眉睫，看起来这世间一切都即将与我无关，可欲望没有退潮？为什么我对自己的诸多想象，我的诸多理想主义，我的追求，我的欲念，它未曾消退？我以为会一切返璞归真，去伪存真，最后只剩一两件要紧的事，一两个要紧的人，但是偏偏不，我还是有那么多想做的事情。」

「真正的外部评价无法刺穿的屏障，是只有我自己能刺穿的。」

「正是因为我有很多想说的，那不说也好，不说也好。把那些话留着，让我自己去消化，然后把那些表达欲留着，让我自己去消化，我们之间留一些无人知晓的留白。 可能在你以后几十年的生命里，我不知道你会不会再想起我，（可能）某一次晚霞，或是某一次海口的落日，会让你再想起我这个人，然后你再回味时，你回味起的，也许就是我想跟你说的话。」

阿娇期待别人的注视，特别是与高手的对谈，但是她不喜欢别人用标签的滤镜来关注她，阿娇的心境与我过往写的一篇博文不谋而合（如何减少标签对自己分享带来的压力）。作为一个被医生预测只能活到今年八月份的晚期病人，她自己都没法预测下一周能否「完美交付」，所以她更加希望珍惜自己每次的「在场」，我想当下对阿娇而言Everything matters，她渴望的是被看见和尊重的专业能力、平等的智识交流、完整且立体的女性。

所以孟岩提到的「你要是能够有钱做投资，一定会是一个很好的投资人」，其实说明他是没有真正理解傲娇，还是带着「柱子哥」的标签来看待阿娇。我也能理解孟岩在问阿娇「我们交流了之后还觉得我是高手吗？」，阿娇没有正面的回答，但我觉得已经给出答案了，一个被阿娇认为非常charming、智慧的孟岩也并没有真正的看到懂得如何注视阿娇。阿娇追求的是在命运的巨大不确定性面前，夺回对自己生命叙事的定义权，其实在跟孟岩的交流中就已经给出答案了：「我，阿娇，首先是一个有能力的专业人士，一个热爱生活的独立女性，其次，才是一个病人。」

理解孟岩

这期播客里孟岩像是一个客人，言语之间充满了谨慎，他的语调与提问，随时在自我提醒：这一步会不会冒犯到的。那份小心并不傲慢冷漠，反而带着温度——一种想要守护的敬意。

我能理解，孟岩的谨慎来自悲悯以及对话题重量的清醒认知，更是来自健康者难以跨越的距离感。阿娇的坦然，并不是无能为力的自我安慰，而是带着遗憾地一次次与死亡的直面、与痛苦的缠斗。她已把死亡当作生活里不可回避的「考题」，用一种近乎平常心的姿态谈论它。

我能理解也更感谢孟岩，在这期的对谈没有运用任何技巧和引导（这是孟岩擅长的，却是对阿娇的误解和亵渎），孟岩没有让他们的对话陷入对苦难的消费，也没有流于空洞的正能量，他守住了边界与尊严，坦然赋予了厚度与真实。那些停顿、噪点与呼吸，比流畅的表达更有力量——它们让人听见了生命最真实的质感。

理解自己

不能在场 vs. 不能上场，哪个更让你痛苦？不能上场是命运的安排，不能在场是心的背叛。人在却心不在，比缺席更残酷。
没有过去 vs. 没有未来，哪个更让你不能接受？失去过去是失忆，失去未来是虚无。我更怕门被锁死而不是记忆被抹去。
从未在一起 vs. 最终没在一起，哪个更让你遗憾？最终没在一起至少有过印证，从未在一起只剩无尽的“假如”。空白比断裂更让人困在迷宫。

我重复听了很多遍这期播客，阿娇让我想到了自己的一位远赴异国的老友阿楠，她们两个给我的感受很像，聪明能干、倔强要强、直面挑战和对抗。也不知道经历了生活的磨砺之后，现在的老友有没有改变，不过我知道她现在很开心很幸福，这是让我觉得好的地方。至少，在生命体验的角度阿娇跟阿楠是不一样的，我的朋友是无比幸运且幸福的。

我的笔触在高维空间独一无二

Shadow Walker 松烟阁

edony

2025年9月14日 14:57

最近在复盘松烟阁管理的时候，我用朱雀检测测试了一下最近的两篇博文，结果还挺精确：

朱雀AI生成检测报告单_1757410963161

朱雀AI生成检测报告单_1757410963161.pdf

8 MB

download-circle

朱雀AI生成检测报告单_1757411751223

朱雀AI生成检测报告单_1757411751223.pdf

5 MB

download-circle

两篇文档都是出自我手，我非常清楚其中的细节以及写作的过程，第一篇是我纯手写，第二篇表格部分是我用Gemini生成的，由此可以见检测结果与事实完全正确。

自从AI能够生成高质量内容就是，创作者群里就一种声音表达自己对内容的洁癖，比较激进的甚至会表达类似「只要是AI生成的内容就不愿意付出精力和时间去阅读」。我一直努力在克制自己创作内容中掺入AI生成的部分，不过确实要承认在结构化思考和表达上AI更优秀，起码比我自己的表现是要好的，慢慢地我可以接受自己为文章内容提供灵魂，而AI为我的文章内容提供装饰品。

AI正在深刻的改变我们生活的方方面面，我想写作上就是这样的，而怎么控制好以及怎么区分是否来自AI，这就是我今后要逐步积累和锻炼的能力。

换个视角，其实这两篇检测报告也从另一个方面证实了一个AI和人是有区别的。

AIGC的本质就是将人类语言通过分词进行向量化，通过神经网络将文本、语言问题转换成了矩阵运算的问题，无论是余弦相似度或者是欧几里得距离表示生成内容与输入内容的相关性，这些向量都是人类语言在高维空间的映射表示。

而朱雀检测能够通过大模型区分出人写的和AI生成的，其实意味着人自己写的东西在高维空间是有非常显著特征的，只是生活中三维空间的人普遍没办法感知到而已。朱雀检测其实就量化的证明了，在某个未知的高维空间里面可以非常清晰地区分人写的内容与AI生成的内容的。

从这个角度来看，我一直输出写作是有意义的，因为数学说明了在这个时空的某个维度我的内容具备了AI没法取代我的特征（至少目前为止AI做不到取代我），这样的经历让重新对人的创作有了信心。做一个坚定的人类相信者，拥抱和享受AI给我们带来的变化吧。

最后，致敬那些一直在坚持自己博客的朋友，以此共勉！

大厂生存实录：从老板的“屎”论，谈谈如何去除内容的“AI味”

Shadow Walker 松烟阁

edony

2025年9月7日 17:26

背景

最近职场槽点真的有点多，这周开会的时候大老板在质疑AI投入的必要性问题，其中有一句我个人觉得可以列入年度金句了：“没有用AI之前，xxx问题有什么影响？我看大家都觉得没什么问题都能扛得住，我承认它是「屎」，但是不是大家也能忍着，也没出什么事......”

说这句话是基于大老板对于当前AI的认识：

直接让大模型分析一些业务问题，它给出的回答看上去像模像样，将来80%的Agent会被大模型内化掉；
现在Agent的构建很方便也很廉价，甚至外包都能做；
目前Agent在内部的应用碰到了非常强烈的幻觉问题，没法立竿见影地产生业务价值；
面对技术趋势，一群程序员是为了做AI而做AI；

在大厂环境中，上面提到的大老板的这些潜台词无可厚非，我不打算做更多的吐槽，我打算思考一个问题：怎么判断一段内容是AI生成的，它的特点和根因是什么。

只有了解通用大模型生成内容的原理以及特点，才能继续评价Agent的价值和发展，这个问题留待下一篇职场生存实录文章探讨吧。

内容的「AI味」

读一段文字，你有没有过那种感觉？哪儿都对，就是心里不得劲，通顺有逻辑，但就是没人味儿。当时我还纳闷，后来才反应过来，这就是所谓的“AI味”。

一开始我也觉得，是AI还不够牛，等它再进化几代就好了。利用周末的时间我研究了一下这个问题，我感觉这根本不是技术问题，换言之模型进化未必能解。AI的本质是个超级厉害的统计学模型能算出哪个词后面跟哪个词的概率最高——它知道“心碎”这个词旁边经常出现“眼泪”和“夜晚”——但它跟人不一样没有自己的「心」，所以它根本不知道心碎是种什么感觉，这些具身数据的缺失注定它没办法真正“体验”，就像一个人照着菜谱念菜名，词儿都对但没有锅气。

“AI味”的特点

为了可以识别“AI味”，我觉得可以从结构、语言、语调和内容深度四个维度来分析“AI”内容的特点：

类别	具体标志	示例
结构	公式化的文章结构	严格遵循“引言-论点1-论点2-论点3-结论”的模板。
	冗余的总结	在文章结尾反复用不同措辞总结相同的内容。
词汇与短语	重复的过渡短语	频繁使用“总而言之”、“此外”、“值得注意的是”等。
	通用、空洞的词汇	滥用“变革者”、“前所未有的”、“至关重要”等词语。
语调与声音	过度的中立性	对有争议的话题采取绝对中立，缺乏明确立场。
	缺乏个人声音	语调统一、平淡，听起来不像任何一个特定的人。
内容与实质	缺少个人轶事	仅陈述事实和数据，没有个人故事或案例来支撑。
	表面化的分析	解释“是什么”，但很少深入探讨“为什么”或“如何”。

“AI味”的原理

生成式大模型的核心运作机制是基于统计概率预测下一个最可能出现的词或短语，而非真正的理解。这个过程天然地导向一种“向平均值回归”（regression to the mean）的趋势。因为模型旨在生成最“安全”、最“普遍”的输出，所以它会频繁地使用那些在训练语料库中统计上最常见的通用短语和陈词滥调。它擅长的是复制，而非真正的原创。

同时，因为缺乏人生活中接触的超越文本数据之外的共享语境数据，大模型的具身认知（embodied cognition）是缺失的，AI无法捕捉情感的细腻纹理和道德的复杂性，无法理解讽刺、反语或微妙的文化隐喻的原因就在于此。

大模型的架构和训练机制引发的“浅输入，浅输出”即奖励机制导致大模型追求平均和正确，也就是正确的废话，这也侧面说明了AI输出的质量与输入的质量、上下文深度直接相关。一个笼统的提示词，如“写一篇关于人工智能的博客”，没有提供任何具体背景，迫使AI只能从其训练数据中最通用、最宽泛的部分提取信息。

这样“AI味”的原理就清晰了：浅层输入导致AI调用通用模式，表现为刻板结构和通用短语，最终产出固定模式的文本也就是“AI味”。

如何用prompt去“AI”味

宝玉老师推荐的怎么用大模型去掉“AI”味，本质上就是“让大模型主动露拙，模仿人的随机性”，具体prompt如下：

1. 用词与句式
- 70% 句子长度 < 18 词；偶尔插入 < 6 词的独立短句。
- 使用常见口语连接：and, but, so, anyway, you know.
- 每段至少包含一个具象细节（气味、手感、具体对象）。

2. 叙事与结构
- 采用第一人称或“我/我们”叙述，加入个人小片段（真实或拟真）。
- 不做完整封闭结论，最后一段留一点未决疑问。

3. 语用特征
- 保留一两处自我否定或转折（“其实…但回头想想…”）。
- 允许一两个轻微重复或改写，而不影响理解。

1. 技术采样参数（如需）
- temperature 0.9–1.1；top_p 0.9；top_k 40–60。
- frequency_penalty 0.3；presence_penalty 0.6。
  
以下是一些范文参考：
{text}

最后我把上面的思考内容喂给NotebookLM，让它给我一个关于“AI味”的Overview：

0:00

/7:30

大厂生存实录之为什么是我做

Shadow Walker 松烟阁

edony

2025年8月31日 22:21

🍿

文章想以《浪浪山小妖怪》里的几句台词作为开篇
熊教头：“小的们，你们看好了，这弓，要这么拉。你们还得多练呐，不然，怎么在浪浪山干活啊！”；
狼总管：“大王说，要突出一个‘陷’字。”

工作当中的烦恼真的让人无力和恼怒，在做的一个CVE相关的AI项目，在误报阶段已经有60-70%的CVE可以被AI处理，准确率也达到了90%，接下来阶段就是对非误报的CVE进行安全风险评估，包括安全技术分析、攻击向量模拟、可利用性评估、修复难易评估、Patch修复、发布Errata，这里涉及到的数据、工作量极其繁杂，一个40+人的团队在投入但是人力上还是捉襟见肘。Leader决定拉起一个「数字员工」的项目，将AI引入全部的CVE处理流程，用来提升CVE处理的效率解决人力的问题。

我们作为安全团队，除了负责AI Infra的建设之外，要做的一个事情就是为CVE漏洞给出安全风险评估，包括安全技术性分析、攻击向量模拟甚至CVE利用的PoC等等，这里面涉及到很多安全数据和模型，发挥AI的能力是比较自然的思路，于是大家一致同意把这个纳入到上面提到的「数字员工」项目中。

公司将AI作为公司级别的战略，所以AI应用在公司内部属于百花齐放的阶段，跟创业类似的，周边团队在开发AI应用的时候碰到Agent幻觉等问题，导致在收益上比较微薄也就是常说的ROI不高，但也确实的减少了几乎一半的工作量。面对这种经验，Leader们泛起嘀咕——现阶段大力投入AI是不是会收益微薄甚至项目失败，不妨等技术成熟一点再加大投入。于是Leader在项目讨论会上，一直push我们要求给数据、给指标，一个没有跑起来的项目当然没有符合老板预期的数据和指标，甚至连数据打标点都还在设计中，可想而知项目的推进阻力会如何之大。

这期间还发生一个让我无言以对的事情，我的直系Leader问我「CVE这么多繁琐的流程，如果我跳过所有的前序步骤，来一个CVE我就用AI merge修复patch这样是不是也可以？」。面对一个企业级的操作系统产品，这个问题有很多角度来分析和论述：

修复patch来源考虑，大约只有30%左右的CVE才有上游patch
AI修复CVE代码的能力问题，复杂patch人肉cherry-pick的时候都会失败何况AI
AI合入patch的效率和成本问题，面对每个月1000+的CVE修复（一个CVE修复时间按照8个小时算，那是15人月工作量，换算成AI Token应该有千万量级资金开销了）
直接合入patch代码引入的安全性问题，patch是否修复CVE，是否引入新的风险，这些不分析无法判断
直接合入patch代码引入的稳定性问题，变更意味着稳定性问题，看看Debian操作系统的版本管理就知道了

面对革命性的技术浪潮，到底是想清楚再做还是先做然后调整呢？我的理念是「想都是问题，做才是答案」，同样都是付出成本与代价，与其在想这件事情上浪费人力物力，不如把成本放在做这件事情。想出一个 90 分的方案再去执行，相比一个 60 分的方案加后续再打磨完善，后者更能在过程过获得做事的成就感，形成一个输入和输出的正循环。

抛开理性分析，安全团队不是应该专注于自己领域的问题的研究和解决吗？把自己的Ego和Scope扩展到这么大，是不是太把自己当回事了，大厂这样的巨无霸面前不会真由基层员工来掌控方向的，都是螺丝钉就别给自己加戏了。大厂这样环境给人太多枷锁了，一直在被要求回答：我是谁，我在做什么，我为什么要做，为什么是我做，Blablabla……作为小团队的Leader没有这样清醒的意识，真的会「累死三军」

这篇文章没什么总结更没什么思考，只是感慨跟我类似的这些困在大厂里的「牛马」，慢慢的你会越来越清醒的意识到一个道理：“在大厂永远不会得到精神自由”！

Cursor与Vibe Coding

Shadow Walker 松烟阁

edony

2025年8月9日 21:30

Vibe Coding以及八卦

简单总结Vibe Coding：人利用AI生成代码完成大部分的软件开发任务。

我个人理解Vibe Coding的本质就是开发工作中的一种新的协作关系 —— 人专注于差异性工作，把能交给大模型做的工作尽量交给它去完成，充分发挥AI的可塑性、高效性等。

随着大模型底层基础能力的飞速迭代（例如Qwen3-Coder），AI的能力范畴正在变得越来越大，Vibe Coding的发展趋势就是自底向上蚕食程序员的工作范畴。

Vibe Coding的分层架构（理想）

以Cursor和Claude Code为代表Vibe Coding工具正在快速迭代，Vibe Coding的架构是时刻在演进和变化，我根据自己的理解比较理想化给出Vibe Coding的分层架构，与上面发展趋势图相对应，Vibe Coding发展到最后就是这张图只剩下蓝色框框没有其他颜色🥳。

Vibe Coding的八卦

关于Vibe coding的八卦，一图胜千文，其实目的还是想让好奇AI Assistant Coding的人知道它的优缺点，以及需要注意的事项。

Vibe Coding技术研究

AI Code Agent工作流

我总结了主流的几个AI Code Agent的架构：

1. Cursor Agent工作流

2. RooCode Agent工作流

3. Cline Agent工作流

综合来看，当下AI Code Agent的技术架构基本上都是类似的，主要的差异就是工具、策略等方面选择的差异。

AI Code Agent技术架构

以Cline架构为例对AI Code Agent架构进行管中窥豹

除了基础模型造成的差异之外，Cline建立的AI Coding的差异化竞争力主要在：

Human in the loop: 这是Cline最根本的架构原则。通过要求用户对所有关键操作进行审批，它在强大的自主能力和专业开发所需的安全可控性之间取得了务实的平衡，使其成为可以信赖的工具；
MCP实现的开放扩展性: Cline没有选择构建一个封闭的、专有的工具生态，而是全面拥抱了开放标准MCP。这不仅为其带来了无限的扩展能力，更使其成为一个能够自我完善、与社区共同成长的平台；
混合式上下文管理: 结合自动化分析和用户精确引导的上下文管理策略，是Cline能够有效处理大型复杂代码库的关键。它承认AI的局限性，并巧妙地将人类的领域知识融入到人机交互的循环中；
工具生态：代码理解、codebase index、AST、diff_apply等代码相关的工具生态支持Cline能够高效和准确的完成任务，同时支持可扩展性；
模型无关的灵活性: 通过稳健的 LLM 抽象层，Cline避免了对任何单一模型提供商的深度绑定。这使用户可以根据任务需求、成本预算和性能表现，自由选择最合适的AI“引擎”；

我们更进一步来拆解这些优势：

竞争力项目	Cline	Cursor	RooCode
Human in the loop	✅	✅	✅
MCP生态支持	✅	✅	✅
工具生态	AST	RAG	AST + RAG
混合上下文管理	Context Management	Automated Context Management + Context Priority Management	Intelligent Context Condensing
Agentic AI Model	✅	✅	✅
Model	❌	apply edit 小模型	❌

AI Code在卷哪些技术一目了然。

Vibe Coding应用

Vibe Coding 大致上有两种使用方法：

我需要个xxx（我需要xxx功能），帮我实现一下，然后一直retry；
我按照方案和架构设计原子化Vibe Coding任务，然后提供任务必要的信息，然后交给AI实现；

实践中，我发现方法1可以极大释放人力，有的时候AI给出的结果超出预期，唯一存在的问题就是人对于代码的信任度。方法2效率提升不如方法1，但是基本的代码实现以及项目进展情况人能够掌控，能够有效的避免AI Coding的几个问题：

偷懒（例如，几次测试通不过直接在测试代码pass来绕过失败的case）；
用力过猛（例如，有现成的算法实现还自己重复实现一套）；
幻觉（例如，判断条件弄反了等）；

基于上述经验，我对Vibe coding的采取的是不信任的态度，原子化任务保证我能够review检查代码（神仙也很难在2万行代码的PR里面准确挑出2-3个bug）。同时，如果是比较明确的任务并且已经有参考（例如已经有sqlite实现了，让AI完成mysql的实现）我就会采用方法1来提升效率。

另外，还有一种方式（我使用相对较少）—— 结对编程的方式，即向AI描述问题或者需求，然后AI规划方案，人来修改或者补充形成 todo list（plan），然后由AI实现。

最后，总结一下Vibe Coding的个人思考：想写一个demo直接扔给AI自动完成，当下的AI已经能够保证试错成本和效率了。想写一个生产使用的软件，自己把控方案设计和技术架构让Vibe coding作为帮助生成代码的工具，同时对Vibe coding始终抱着不信任的态度，时时review、处处检查。

几个重要原则

由于每个人使用AI Code工具以及编程习惯都不一样，所以有一些关键原则是需要注意的，其他可以自由发挥。

最重要的原则，Curosr能写出符合要求的好代码的前提是它有足够的上下文，包括：项目、需求、架构、技术栈、编程约定等等；
清晰的文档和注释，让Cursor获取更多的上下文；
AI更擅长修改和优化，新创造有一定的随机性，随着模型增强会改善；
当下的Code Agent已经足够好了，特别是在Cursor，Cline等成熟产品中prompt可以更加自然语言些，类似「你是个xxx专家」这类激活专家系统的方式可以放弃了，省钱还省时间;
prompt engineering技巧的掌握和运用还是要持续升级，例如最近跟着Manus学会的response prefill技巧，使用<|im_start|>assistant<tool_call>{"name": "xxx"来控制工具调用的模式（自动、必需、指定等）；
有规划的拆分原子任务，一次只做一件事，有助于代码的准确性和可靠性；

几个重要概念

Agent/Chat/Tab

能够跨多个文件读取和修改代码的 AI。用自然语言描述更改，Agent 会执行这些更改。

Rules

定义 AI 行为的自定义指令。设置编码标准、框架偏好和项目特定约定。

Memory

持久存储项目上下文和过往对话中的决策。在未来的交互中自动引用。

codebase index/graph

对代码库进行语义分析。支持代码搜索、引用查找和上下文感知建议。

context

在代码生成过程中提供给 AI 模型的信息。包括文件、符号和对话历史。

基于Cursor的Vibe Coding

1. Cursor Rule设置

Rule的设置是为了让Cursor了解项目结构以及开发约定，包括：

编码关于代码库的领域特定知识
自动化项目特定的工作流程或模板
标准化样式或架构决策

Rule编写时应该注意：

保持规则在500行以内
将大型规则拆分为多个可组合的规则
提供具体的示例或引用文件
避免模糊的指导。编写规则时要像清晰的内部文档一样
在聊天中重复提示时重用规则

rules 结构示例如下图所示

另外一种Rule设置方法（来自Cline），Task Manager + Todo List的方式管理active context：

Task Manager的方式使用AI Code工具是比较进阶的使用方式，项目开发周期的缘故我还没生产实际中使用过，后续再研究和分享。目前这块有一些比较不错的实践经验可以参考：

2. 文档更新

我会在项目的docs目录中放置所有的项目相关的问题，包括技术选型、架构设计、模块说明、部署架构、使用手册等等，这样可以让人和AI都能够及时了解项目详细情况，一般修复问题或者新增feature的时候，我会将对应的文档添加到上下文中（@docs/xxx.md），帮着Agent能够更好的完成任务。

随着项目迭代和演进文档需要及时更新，不然会对AI形成误导导致无法正确地完成任务，所以我会形成固定的文档更新工作流。

flowchart TD Start([Start]) End([End]) Agent[Task Auto Trigger]:::redClass Developer[Triggered by Hand]:::blueClass Task[Document Update Process] subgraph AI-Process direction LR P1[Review Files] P2[Current State] P3[Clarify Steps] P4[Documents Change] P1 --> P2 --> P3 --> P4 end subgraph Review-Change Review[Developer Docs Change Review] Review -->| review ok? |Review Review --> Apply[Merge Changes] end Start --> Agent Start --> Developer Agent --> Task Developer --> Task Task --> AI-Process AI-Process --> Review-Change Review-Change --> End classDef redClass fill:#ffcd70,stroke:#333,stroke-width:2px; classDef blueClass fill:#97fcf9,stroke:#333,stroke-width:2px;

3. Cursor四大使用场景

功能实现，先给样例再给出功能实现的要求以及项目上下文
测试代码，给出测试规划并要求测试要求
修复问题，给出问题描述以及关键上下文
比较方案，对于无法很好解决的问题让Cursor给出方案以及比较

References

日本游记 —— 大阪京都

Shadow Walker 松烟阁

edony

2025年8月2日 21:48

趁着暑假，跟家人一起去大阪和京都，有一些见闻和感受想记录一下

物

有这样一句话「你不买东西，去大阪干吗？」，大阪的心斋桥是一个购物的天堂，这里有性价比极高的药妆店，数码店，让我这种物欲很低的人都忍不住买买买。

不得不说心斋桥是一个可以放大人物欲的地方。

还有一个关于物的感受，除了经常听说的精致、匠心之外，我在大阪生活方方面面碰到的物品，无论是菜单、点餐机器、硬币机器、地铁JR闸机等等，都有很重的使用痕迹，大多磨损很厉害，但是无一不是设计精巧、指示明晰、使用顺畅。这些物品极大释放了人力，这一切物品和工具让我看到了一些视人为人的意境，在这里物最重要的工具属性得到了充分发挥，人在物的帮助下能够悠闲且毫无顾忌的席地而坐，饮一杯自动贩卖机啤酒。

留心查找这些物品的信息，其中不乏美的、格力等产自中国的东西。很可惜，在国内难得看到这样视物为物、视人为人的松弛和安逸。

人

在来日本之前，对日本的「跪式服务」早有耳闻，冷漠、社畜甚至变态也国内也多有提及，但是他们又能说出「山川异域，风月同天」这种令人神往之语，所以我对他们的人都很好奇，这次旅游用心观察了一下。
日本人干活总是有种不急不慢的感觉，刚到大阪的第一天我们就奔着一家很有名神户和牛店去了，因为没有预约防止没有位置，我们5点不到就到店里了，整个店很小巧，最多能容纳6个人，我们在翻看菜单的时候一个刚刚来上班的店员，推门而入的时候拖着长长尾音跟店里忙碌的师傅打招呼“はい～～”，左右摇晃的双肩包，踮着脚从我身后欢快走过。在帮我们摆放和牛的时候，专注、细腻，wasabi的形状都特意调整好……

地铁上，偶尔看到男生打扮得很日系动漫，不多但挺精致，我做公共交通期间之间到过一次，打扮很像进击的巨人里面的艾伦，上衣的装饰以及裤子镂空的细节都能看出非常精致用心（P.S. 日本人对在公共空间拍照时很敏感的，所以只能用苍白的语言形容这种精致了）。

大部分日本女生的装束都是可爱风精致感觉，年轻女生一般是前面刘海微微卷一下，打扮很精致，脸部会抹点儿腮红，除了这个印象其他跟国内其实差不多。

还有一件关于人的事情值得分享，在大阪我发现了一家我超级喜欢的咖啡屋，早上10点的时候可以见到各式各样悠闲本地人，有随手放下电脑包工作的中年大叔、有点了杯冰水悠闲看报的老人、有热烈聊天的年轻女性等等。Cash Only的买单要求能够让你深刻的体会到日式服务的精神，买单的是一位头发全白的老奶奶，精神头非常好，每位进门的顾客都会热情的打招呼，熟识的老顾客还会闲聊上几句。你坐下用餐的时候，她会站在门口发发呆，时不时往店里眺望几下，防止有顾客需要帮忙的地方。

店里的美式咖啡非常不错，喝惯了星巴克之类的深烘的焦苦味，突然喝到带有回甘和花草味道的手磨咖啡会觉得非常新奇，思来想去估计是因为国内的咖啡于我而言是牛马续命的药水，这家咖啡屋的美式才是生活的享受。然后再配上一个抹了薄薄一层果酱的吐司，这个吐司被放进烤箱烘烤过，表层微微泛黄香脆，里面这是吐司面包软软的质地，咖啡的回味再加上吐司酥软、果酱的甜美，确实是难得的口腹之欢。店员会热情看着你吃东西，嘴里一直重复着「どぞ，どぞ......」，看到你享受美味的时候会很开心的默默走开。

我低头吃东西的时候，余光无意间偏见站在门口老奶奶，看我吃得很香的样子，她嘴角上扬比我还开心。老奶奶不会英文，最后我们去付钱的时候她看我要掏卡，最近轻轻念叨着「cash only, cash only」，然后在指示牌子上比划着，神情略带歉意，我恍然赶忙拿出现金递给她，她笑逐颜开接过去，嘴里念叨着什么，指了指账单，然后拿走了1000的纸币剩下的返回给了我。我当时没注意，还是家人提醒，200多的零头老奶奶没收。再后来，我们又去了两次，都是这个老奶奶收的钱，每次都只收我1000块，零头都去掉了，或许是表达cash only歉意，或许是看着小伙子顺眼，或许她就是喜欢顾客喜欢吃他们的东西，或许她就是喜欢看到世界各地的人。

思

本来想总结思考一下，日本的精致、安静、悠闲带来的感触和对比的。

后来，有点意兴阑珊，去思考这些的意义是什么呢？对两个完全不同的文化、制度的国家评头论足又有什么意思呢？它们仅剩相同的地方大概就是相似的面孔和历史文化渊源了吧！

于我而言，我唯一记住的就是，它是一个难得让我悠闲和松弛的地方。它的漂亮和友善或许可能是他们天生如此，又或许是因为我是一个匆匆游客，又或许是因为我个人的自我感动。

总之，人总会对喜欢的东西才会格外用心

0:00

/0:46

AI改变我们的编程方式

Shadow Walker 松烟阁

edony

2025年7月6日 16:14

延续上一篇关于AI的思考，作为程序员最关心的编程范式的变化，我打算看一下业界有哪些思考和趋势。

这篇文章主要是围绕Tesla前AI总监Andrej Karpathy关于软件3.0的观点展开的。

什么是软件3.0

Andrej Karpathy提出软件开发的演进过程可以被划分为三个主要阶段，即Software 1.0、Software 2.0和Software 3.0，它们代表了编程思想和实现方式的根本性转变。

特性	Software 1.0 (传统软件)	Software 2.0 (AI 软件)	Software 3.0 (LLM 驱动的软件)
核心思想	明确的指令 (Explicit Instructions)	从数据中学习 (Learning from Data)	与通用模型对话 (Interacting with a General Model)
“代码”是什么?	人类编写的源代码 (C++, Python)	神经网络的权重 (Weights)	自然语言提示 (Prompts)
如何“编程”?	程序员在开发环境中编写代码	工程师用海量数据训练模型	用户通过对话、提问、下指令来引导模型
开发者角色	软件工程师、程序员	机器学习工程师、数据科学家	所有人、提示工程师、AI 应用开发者
典型例子	计算器、操作系统、数据库	图像识别、语音识别、推荐系统	ChatGPT、Copilot (代码助手)、Midjourney (AI绘画)
优点	精确、可靠、行为可预测	擅长处理模糊、复杂的模式识别问题	开发速度极快、门槛极低、知识广博
缺点	僵化、无法处理模糊问题	“黑箱”，需要海量数据，可能产生偏见	可能会“幻觉”(一本正经胡说八道)、依赖大公司、有安全风险

Software 1.0：传统编程

这是我们最熟悉的软件开发模式。程序员使用Python、C++等形式化的编程语言，一行一行地编写明确的指令来告诉计算机如何执行任务。例如，特斯拉自动驾驶系统中就包含大量的C++代码。这种模式的特点是逻辑精确、行为可预测。

Software 2.0：神经网络编程

Software 2.0的“代码”不再是人类编写的指令，而是神经网络的权重。开发者通过收集和标注海量数据，使用梯度下降等优化算法来“训练”一个神经网络模型。模型从数据中自动学习规律和特征，例如用于图像识别的AlexNet。在特斯拉自动驾驶系统中，神经网络就逐渐取代了传统的C++代码。

Software 3.0：自然语言编程

这是最新的范式，其核心是大型语言模型（LLMs）。开发者不再编写传统代码或训练专门的模型，而是通过自然语言（如英语）编写提示词（Prompts）来与一个强大的、预先训练好的通用LLM进行交互，引导它生成所需的功能或代码。例如，要完成一个情感分类任务，开发者可能只需向LLM提供几句话的描述和几个示例即可。

编程范式的演进

Software的三个阶段是演进关系而非完全取代。Software 1.0是基础，提供精确控制；Software 2.0解决了1.0难以处理的模糊和感知问题；而Software 3.0则建立在2.0的成果（LLMs）之上，将编程的接口从形式化语言转变为自然语言，极大地降低了门槛。它们之间的核心区别在于“代码”的形态和“编程”的方式：从人类编写指令，到用数据优化权重，再到用语言引导模型。

这个变化最酷的地方在于，开发工作的核心，从“告诉电脑怎么做”变成了“告诉AI我想要什么”。在AI的加持下，开发者不用再苦哈哈地抠每一行代码逻辑了，只需要写出清晰、全面的目标说明（Specifications）和提示（Prompts），告诉AI我们期望软件能实现什么功能。然后，AI代理就会像个得力助手，把这些想法转化成我们能读懂、能运行的代码。

所以你看，开发者的角色也变了！程序员不再是单纯的“码农”，而更像是“AI协调员”或者“模型训练师”。最重要的技能也变成了如何拆解复杂问题、做好系统设计，以及一门新学科提示工程（Prompt Engineering）。这不只是简单地提问，而是要学会如何像一个精准的指挥家一样，用最恰当的指令、上下文和约束，引导AI产出我们想要的结果。

当下的AI Coding Assistant的身影现在几乎无处不在，贯穿了整个软件开发的生命周期：

规划阶段：脑子里只有一个模糊的想法？没关系，AI能帮你把它变成明确的需求和用户故事。
开发阶段：像GitHub Copilot这样的AI编程神器，可以实时帮你写代码、补全函数，据说能把开发效率拉高整整45%！简直太神了！
测试阶段：AI还能自动写测试用例、跑测试，帮你揪出代码里的bug和安全漏洞。
部署和维护：AI也能优化发布流程，甚至能预测系统可能会在什么时候出问题，让运维工作省心不少。

不过目前AI Coding也不是完美的，存在很多问题：

安全问题：现在最头疼的就是“提示注入”（Prompt Injection）。坏人可能会用一些刁钻的问法来“忽悠”AI，让它绕过安全限制，干点坏事，比如泄露你的数据。
可信度与可靠性问题：LLM有时候像个“黑箱”，我们搞不清它为什么会做出某个决定。而且它还时不时会一本正经地“胡说八道”（也就是“幻觉”），这对于要求绝对精准的软件来说，可是个大麻烦。

所以目前为止Software 3.0目前更适合作为工具箱中的辅助工具，而非完全取代传统的代码编程。

软件范式的更新

Andrej Karpathy关于软件3.0的讨论中提到它不仅仅是一个技术更新，更是一场深刻的范式革命。

编程的民主化与角色的转变

社区普遍认为，Software 3.0最大的意义在于“编程民主化”。由于编程语言变成了自然语言，非程序员也能参与到软件创建中，极大地拓宽了创新的来源。X平台（原Twitter）上的讨论热烈，用户强调“English is coding”（英语即编程）的理念。对于程序员而言，工作重心从手写代码转向了设计高效的提示（Prompt Engineering）、验证和审计AI生成的代码以及管理整个AI系统。

LLM作为新的“操作系统”

Andrej Karpathy将LLMs比作1960年代的早期计算机或是一种新型的操作系统。这一观点在社区中获得了广泛认同。LLMs成为了一个可编程、可组合的基础平台，开发者可以围绕它来构建各种应用，而不是一切从零开始。

人机协作而非完全取代

目前大家普遍的共识是，Software 3.0强调的是人机协作，即“部分自动化”（Partial Autonomy）。Karpathy提出的“Autonomy Slider”（自主性滑块）概念被反复提及，即用户可以根据任务需求调整AI的介入程度。AI负责生成，人类负责验证，形成高效的“生成-验证”循环。

总而言之，软件3.0不仅仅是工具上的更新换代，它更像是一场关于思维、流程和团队协作方式的革命。公司需要赶紧建立起新的管理规则，培养团队和AI打交道的能力，还得花钱投资一些能管好AI风险的技术。说到底，未来的赢家，不是那些拥有最强AI的人，而是那些最懂得如何与AI共舞、能建立起高效、安全、负责任的人机协作体系的团队和个人。这，才是这场变革中最激动人心的部分！

阅读视图

Narrower vision does not make judgment sharper

AI lowers the threshold for prototyping, not the difficulty of landing a complex system

The mark of anxiety-driven management: it organizes pressure around time, rather than resources around the problem

It is not surprising that Linux OS security teams feel the fracture earlier

What exhausts people is not only the workload. It is the need to keep proving that the work is real.

I am not against speed. I am against using high-frequency execution to pretend that uncertainty has already become low.

What truly deserves acceleration is the ability to make the problem converge

Whether AI rebuilds organizations may finally depend on what it amplifies first

视野收窄，并不会让判断更准

AI 降低的是原型门槛，不是复杂系统的落地难度

焦虑型管理的特征：围绕时间组织压力，而不是围绕问题组织资源

Linux OS 安全团队更早感到撕裂，并不意外

让人耗竭的，不只是忙，而是持续自证

我不反对快，我反对用高频执行假装进入了高确定

真正该被加速的，是问题收敛能力

AI 会不会重构组织，最后取决于它先放大的是什么

能力膨胀感

So What?

过程重于生成

四层透视：拆解一个 coding agent

第一层：它怎么“跑起来”

第二层是它怎么“控风险”

第三层是它怎么“跟人协作”。

第四层是它怎么“长期演进”。

kimi-cli 拆解的思考

为什么要单独做一个渲染服务

我为什么选择“浏览器渲染”，而不是“模板渲染”

真正的难点不是截图，而是“什么时候截图”

两种模式：在“稳定输出”和“视觉效果”之间做了分层

1. fixed 模式：要尺寸确定，就给你尺寸确定

2. auto 模式：围绕卡片智能裁切，保留一点“呼吸感”

性能优化不是先上分布式，而是先把单机跑顺

可观测性：我希望看到它慢在哪里，而不是只知道它失败了

安全边界：这个服务只负责渲染，不负责做判断

这次实现让我重新确认的一件事

后续我还想做的事

效率的诱惑与隐形代价

为什么我选择留在 UI

1. 把“改动”可视化为地形

2. 上下文是环境，不是文本

3. 可逆性带来的安全感

引擎与仪表盘

Introduction

Rethinking Agent Orchestration

Where today’s orchestration starts to crack

Why graph computing models?

Google Pregel & the BSP model

The Pregel framework

The BSP model

Decoding the LangGraph runtime

StateGraph & message passing

Channels: decoupling reads from writes

PregelLoop: the lifecycle of a superstep

Phase 1: Plan

Phase 2: Execute (local computation)

Phase 3: Update & barrier

LangGraph source code (conceptual)

State and channels

Pregel loop and supersteps (simplified)

Checkpointer and “time travel”

Interrupts

Framework comparison

Advanced patterns

1) Map-Reduce (dynamic fan-out)

2) Subgraphs (fractal composition)

3) Human-in-the-loop (HITL)

Back to reality: the Agent implementation

Speed and isolation

An explicit synchronization barrier

Closing thoughts

References

引子

Agent编排问题思考

Agent编排有什么问题

为什么是图计算模型

Google Pregel&BSP模型

Google Pregel框架

BSP模型

LangGraph运行时框架解析

1. `fixed` 模式：要尺寸确定，就给你尺寸确定

2. `auto` 模式：围绕卡片智能裁切，保留一点“呼吸感”

`PregelLoop`: the lifecycle of a superstep