feat: 语音流式输入管线 + VAD前端集成 + 插件-工具合并清理

- 前端: VAD语音检测(@ricky0123/vad-web) + useVoiceInput双模式(流式WS/REST) - Gateway: VoiceStreamManager代理WS流式STT到voice-service - Voice-service: DashScope REST → Realtime WS → Whisper三级引擎 + ffmpeg转码 - 共享模块: pkg/audio(音频转换) + pkg/dashscope(ASR REST客户端) - 清理: 移除旧plugin-manager和pkg/plugins，完成插件→工具合并 - 文档: 完善gateway-api.md和voice-service.md语音API文档 - 工具: scripts/voice/ 语音转换脚本集 Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
2026-06-06 11:50:40 +08:00
parent 258cf81b25
commit 6ef9e082a6
91 changed files with 4091 additions and 3929 deletions
@@ -167,11 +167,14 @@ ws://<gateway>/ws/chat?token=<jwt>&session_id=<optional>&client_id=<optional>&de

 ```json
 {
-  "type": "message|voice_input|ping|history",
+  "type": "message|voice_input|voice_stream_start|voice_stream_chunk|voice_stream_end|ping|history",
  "session_id": "string (可选)",
  "mode": "text|voice_msg|voice_assistant",
  "content": "string (纯图片消息可留空，文字+图片时填写提问内容)",
-  "audio_data": "string (voice_input 类型必填, base64)",
+  "audio_data": "string (voice_input / voice_stream_chunk 类型必填, base64)",
+  "format": "string (voice_stream_start 可选, 音频格式: webm, wav, pcm, opus; 默认 webm)",
+  "language": "string (voice_stream_start 可选, 识别语言: zh, en, ja, ko, auto; 默认 zh)",
+  "sequence": 0,
  "attachments": [
    {
      "type": "image",
@@ -194,14 +197,18 @@ ws://<gateway>/ws/chat?token=<jwt>&session_id=<optional>&client_id=<optional>&de
  "timestamp": 1717000000000,
  "client_id": "string",
  "device_name": "string",
-  "user_agent": "string"
+  "user_agent": "string",
+  "client_msg_id": "string"
 }
 ```

 | type | 说明 |
 |------|------|
 | `message` | 文字聊天，触发 AI 回复 |
-| `voice_input` | 语音输入，先转录再作为 message 处理 |
+| `voice_input` | 语音输入（完整音频），先转录再作为 message 处理 |
+| `voice_stream_start` | 开启流式语音会话，Gateway 连接 Voice-Service 流式 STT |
+| `voice_stream_chunk` | 流式语音音频分片 (base64)，Gateway 转发至 Voice-Service |
+| `voice_stream_end` | 结束流式语音，等待最终识别结果，自动触发 LLM 回复 |
 | `ping` | 心跳，自动回复 pong |
 | `history` | 请求历史消息 |

@@ -244,7 +251,9 @@ ws://<gateway>/ws/chat?token=<jwt>&session_id=<optional>&client_id=<optional>&de
 | `stream_chunk` | 增量文本块 |
 | `stream_end` | AI 生成结束（含完整 text） |
 | `stream_segments` | 流式断句（语音） |
-| `voice_transcript` | 语音转录结果 |
+| `voice_transcript` | 语音转录结果 (非流式, voice_input) |
+| `voice_interim` | 流式语音中间识别结果 |
+| `voice_final` | 流式语音最终识别文本 |
 | `error` | 错误 |
 | `history_response` | 历史消息返回 |
 | `notification` | 推送通知 |
@@ -325,7 +334,7 @@ Client                              Gateway

 ---

-### 语音输入流程
+### 语音输入流程 (非流式)

 ```
 Client                           Gateway                       Voice-Service
@@ -343,6 +352,128 @@ Client                           Gateway                       Voice-Service
  |<-- ... 正常流式回复 ...         |                               |
 ```

+> **注意：** `voice_input` 为非流式模式，客户端发送完整音频后一次性获取转录结果。适合 MediaRecorder 录音完成后使用。
+> 推荐使用下方的流式语音输入，配合前端 VAD 实现边说边识别。
+
+---
+
+### 流式语音输入流程 (voice_stream_*)
+
+配合前端 VAD (Voice Activity Detection) 实现自动语音检测和边说边识别。前端逐帧发送音频分片，Gateway 通过 WebSocket 代理到 Voice-Service 流式 STT，实时返回中间结果。
+
+```
+Client                                Gateway                          Voice-Service
+  |                                      |                                   |
+  |-- {type:"voice_stream_start",       |                                   |
+  |    format:"webm", language:"zh"} --> |                                   |
+  |                                      |-- WS /api/v1/stt/stream --------> |
+  |                                      |<-- session ready                  |
+  |<-- {type:"voice_interim", text:""}  |                                   |
+  |                                      |                                   |
+  |-- {type:"voice_stream_chunk",       |                                   |
+  |    audio_data:"<base64>",           |                                   |
+  |    sequence:0} ------------------>   |                                   |
+  |                                      |-- binary audio frame ---------->  |
+  |                                      |<-- {type:"result",                |
+  |                                      |     text:"你好", isFinal:false}   |
+  |<-- {type:"voice_interim",           |                                   |
+  |     text:"你好"}                     |                                   |
+  |                                      |                                   |
+  |-- ... more chunks ...               |                                   |
+  |                                      |                                   |
+  |-- {type:"voice_stream_end"} ----->   |                                   |
+  |                                      |-- {action:"stop"} --------------> |
+  |                                      |<-- {type:"result",                |
+  |                                      |     text:"你好世界", isFinal:true}|
+  |<-- {type:"voice_final",             |                                   |
+  |     text:"你好世界"}                 |                                   |
+  |                                      |                                   |
+  |   (Gateway 自动将最终文本           |                                   |
+  |    作为 message 发给 AI-Core)        |                                   |
+  |<-- {type:"stream_start"}            |                                   |
+  |<-- ... 正常流式 LLM 回复 ...         |                                   |
+```
+
+**消息详情：**
+
+#### voice_stream_start — 开启流式语音会话
+
+```json
+{
+  "type": "voice_stream_start",
+  "format": "webm",
+  "language": "zh"
+}
+```
+
+| 字段 | 类型 | 必填 | 说明 |
+|------|------|------|------|
+| `format` | string | 否 | 音频格式，默认 `"webm"`。支持: `webm`, `wav`, `pcm`, `opus` |
+| `language` | string | 否 | 识别语言，默认 `"zh"`。支持: `zh`, `en`, `ja`, `ko`, `auto` |
+
+Gateway 收到后连接 Voice-Service 流式 STT WebSocket。成功时返回空 `voice_interim` 确认会话建立；失败返回 `error`。
+
+#### voice_stream_chunk — 发送音频分片
+
+```json
+{
+  "type": "voice_stream_chunk",
+  "audio_data": "<base64 encoded audio>",
+  "sequence": 0
+}
+```
+
+| 字段 | 类型 | 必填 | 说明 |
+|------|------|------|------|
+| `audio_data` | string | 是 | Base64 编码的音频数据 |
+| `sequence` | int | 否 | 分片序号，从 0 递增，用于排序和去重 |
+
+Gateway 将 audio_data 解码后以 binary 帧转发至 Voice-Service。无直接响应；识别结果通过 `voice_interim` 异步推送。
+
+#### voice_stream_end — 结束流式语音
+
+```json
+{
+  "type": "voice_stream_end"
+}
+```
+
+Gateway 向 Voice-Service 发送 stop 信号，等待最终识别结果。最终文本通过 `voice_final` 返回，并自动触发 LLM 回复流程。
+
+#### voice_interim — 中间识别结果 (Server → Client)
+
+```json
+{
+  "type": "voice_interim",
+  "message_id": "voice_<random>",
+  "text": "中间识别文本",
+  "timestamp": 1717000000000
+}
+```
+
+| 字段 | 说明 |
+|------|------|
+| `text` | 当前累积的识别文本，**非最终结果**，会随更多音频输入而更新 |
+
+> **前端处理：** 收到 `voice_interim` 后应在 UI 中展示实时识别文本（如灰色斜体），收到 `voice_final` 后替换为最终文本。
+
+#### voice_final — 最终识别结果 (Server → Client)
+
+```json
+{
+  "type": "voice_final",
+  "message_id": "voice_<random>",
+  "text": "最终识别文本",
+  "timestamp": 1717000000000
+}
+```
+
+| 字段 | 说明 |
+|------|------|
+| `text` | 最终的完整识别文本。空字符串表示未识别到语音 |
+
+收到 `voice_final` 后，Gateway 自动将 `text` 作为 `message` 类型转发至 AI-Core 触发 LLM 流式回复。随后的 `stream_start` / `review` / `stream_end` 流程与普通文字消息相同。
+
 ---

 ## 3. 会话管理