Hermes Gateway 飞书连接断开排查

青萍叙事2026-06-10

前言

昨天手痒，给 Hermes Gateway 换了个新模型。
结果飞书那边突然就收不到消息了。

第一反应是：换模型搞坏了？
赶紧查 Gateway 日志，看到了这么一行：

1	RuntimeError: Executor shutdown has been called

飞书的 WebSocket 连接也断了，Gateway 看起来还在跑，但实际上已经是个空壳。

排查：先看 Gateway 日志

2026-06-09 14:23:45 INFO  Gateway started
2026-06-09 14:23:46 INFO  Feishu WebSocket connected
...
2026-06-09 15:34:12 ERROR RuntimeError: Executor shutdown has been called
2026-06-09 15:34:12 ERROR Feishu WebSocket connection lost

Gateway 14:23 启动，15:34 报错，刚好运行了 1 小时 11 分钟。

这个时间点很微妙，不是刚启动就挂，而是运行了一段时间后才出问题。
换模型的操作发生在 15:30 左右，看起来像是换模型触发了问题，但实际上根因不在这里。

根因：Python asyncio executor 的生命周期陷阱

Gateway 是 Python 写的，核心用的是 asyncio event loop。
问题出在 run_in_executor 这个调用上。

Python asyncio 的 loop.run_in_executor() 会把同步代码丢到线程池里执行。
默认情况下，它用的是 event loop 自带的 ThreadPoolExecutor。
这个 executor 的生命周期管理有个坑：

如果 executor 的引用计数归零，Python 的垃圾回收器会把它清理掉，调用 shutdown()。

在长时间运行的服务里，这种情况可能悄无声息地发生：

# 这段代码可能出问题
async def call_llm(prompt):
    loop = asyncio.get_event_loop()
    # 如果 executor 被 GC 回收，这里就会报错
    result = await loop.run_in_executor(None, sync_llm_call, prompt)
    return result

日志里的 Executor shutdown has been called 就是这么来的。
executor 被关了，所有依赖它的异步任务都会失败，飞书的 WebSocket 消息处理自然也跟着挂了。

加上 macOS 的内存压力管理（memory pressure），长时间运行的 Python 进程更容易触发这类内部状态异常。
系统可能在后台杀掉了一些不活跃的线程，导致 executor 的内部状态被破坏。

发现问题：`–replace` 的坑

最直接的办法就是重启：

1	hermes gateway start

重启后飞书消息恢复正常。
但这只是临时方案，后面还会遇到同样的问题。

这里又踩了一个坑，我发现之前的 Gateway 是用 --replace 模式启动的：

1	hermes gateway start --replace

--replace 的作用是如果已有 Gateway 在跑，先杀掉再启动新的。
听起来很方便对吧？
但在 macOS 上，如果你用了 launchd 来管理服务，--replace 会导致冲突。

Gateway 被 launchd 拉起来后，你手动 --replace 启动一个新的，launchd 不知道旧的被杀了，新旧进程就会打架。

最佳实践：用 launchd 托管 Gateway

macOS 上管理后台服务，launchd 是正道。
先看看现有的配置：

1	ls ~/Library/LaunchAgents/ai.hermes.gateway-*.plist

1 2	/Users/lu/Library/LaunchAgents/ai.hermes.gateway-default.plist /Users/lu/Library/LaunchAgents/ai.hermes.gateway-work.plist

每个 profile 有一个对应的 plist 文件。
看看里面写了什么：

1	cat ~/Library/LaunchAgents/ai.hermes.gateway-default.plist

<?xml version="1.0" encoding="UTF-8"?>
<!DOCTYPE plist PUBLIC "-//Apple//DTD PLIST 1.0//EN" 
  "http://www.apple.com/DTDs/PropertyList-1.0.dtd">
<plist version="1.0">
<dict>
    <key>Label</key>
    <string>ai.hermes.gateway-default</string>
    <key>ProgramArguments</key>
    <array>
        <string>/Users/lu/.hermes/bin/hermes</string>
        <string>gateway</string>
        <string>start</string>
        <string>--replace</string>   <!-- 问题在这 -->
    </array>
    <key>RunAtLoad</key>
    <true/>
    <key>KeepAlive</key>
    <dict>
        <key>SuccessfulExit</key>
        <false/>
    </dict>
</dict>
</plist>

看到了吧，ProgramArguments 里赫然写着 --replace。
这就是问题所在。

批量修复 plist 配置

检查了一下，所有 profile 的 plist 都有这个问题。
手动改太慢，写个脚本批量处理：

1 2	# 查看所有有问题的 plist grep -l "\-\-replace" ~/Library/LaunchAgents/ai.hermes.gateway-*.plist

1 2	/Users/lu/Library/LaunchAgents/ai.hermes.gateway-default.plist /Users/lu/Library/LaunchAgents/ai.hermes.gateway-work.plist

批量移除 --replace：

for plist in ~/Library/LaunchAgents/ai.hermes.gateway-*.plist; do
    # 移除 --replace 那一行
    sed -i '' '/<string>--replace<\/string>/d' "$plist"
    echo "Fixed: $plist"
done

重新加载所有 launchd 服务：

1
2
3

# 先卸载再加载
launchctl unload ~/Library/LaunchAgents/ai.hermes.gateway-*.plist
launchctl load ~/Library/LaunchAgents/ai.hermes.gateway-*.plist

新版 macOS 上 unload/load 已标记为 legacy，推荐用 bootout/bootstrap：
1
2
launchctl bootout gui/$(id -u) ~/Library/LaunchAgents/ai.hermes.gateway-*.plist 2>/dev/null
launchctl bootstrap gui/$(id -u) ~/Library/LaunchAgents/ai.hermes.gateway-*.plist

验证一下：

1	launchctl list \| grep hermes

1 2	ai.hermes.gateway-default 0 com.apple.xpc.launchd.oneshot.0x10000001.hermes ai.hermes.gateway-work 0 com.apple.xpc.launchd.oneshot.0x10000002.hermes

状态码是 0，说明正常运行。

踩坑后的教训

asyncio executor 的坑要提前防。

长时间运行的 Python 服务，executor 的生命周期不能完全交给垃圾回收器。
要么显式持有 executor 的引用，要么定期检查 executor 状态。

# 显式持有 executor 引用
class Gateway:
    def __init__(self):
        self._executor = ThreadPoolExecutor(max_workers=4)
        self._loop = asyncio.get_event_loop()
        self._loop.set_default_executor(self._executor)
    
    async def call_in_background(self, func, *args):
        return await self._loop.run_in_executor(self._executor, func, *args)

macOS 服务必须用 launchd 托管。

别用 nohup 后台跑。
别用 --replace。
别用 screen 或 tmux。
launchd 是 macOS 原生的服务管理器，它会帮你处理进程崩溃重启、开机自启、资源限制等问题。

关键配置点：

RunAtLoad=true：登录时自动启动
KeepAlive.SuccessfulExit=false：正常退出不重启，异常退出才重启
不要加 --replace：让 launchd 管理进程生命周期

遇到连接问题先查日志。

飞书消息发不出去，原因可能很多：网络问题、token 过期、服务挂了……
别上来就重启，先看日志。
日志里的错误信息通常能直接指向问题根因。

# 查看 Gateway 最近日志
tail -100 ~/.hermes/logs/gateway-default.log

# 查看 launchd 的日志
log show --predicate 'processImagePath contains "hermes"' --last 1h

这次排查让我对 Python asyncio 的底层机制有了更深的理解。
asyncio 不是银弹。
executor 管理、事件循环生命周期这些细节，在开发阶段很难暴露，往往要跑到生产环境、运行一段时间后才会出问题。

好在这次只是飞书消息发不出去，不是什么致命故障。
但也提醒我，对于长时间运行的服务，监控和容错设计真的不能偷懒。