feat: 语音流式输入管线 + VAD前端集成 + 插件-工具合并清理
- 前端: VAD语音检测(@ricky0123/vad-web) + useVoiceInput双模式(流式WS/REST) - Gateway: VoiceStreamManager代理WS流式STT到voice-service - Voice-service: DashScope REST → Realtime WS → Whisper三级引擎 + ffmpeg转码 - 共享模块: pkg/audio(音频转换) + pkg/dashscope(ASR REST客户端) - 清理: 移除旧plugin-manager和pkg/plugins,完成插件→工具合并 - 文档: 完善gateway-api.md和voice-service.md语音API文档 - 工具: scripts/voice/ 语音转换脚本集 Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
This commit is contained in:
+137
-6
@@ -167,11 +167,14 @@ ws://<gateway>/ws/chat?token=<jwt>&session_id=<optional>&client_id=<optional>&de
|
||||
|
||||
```json
|
||||
{
|
||||
"type": "message|voice_input|ping|history",
|
||||
"type": "message|voice_input|voice_stream_start|voice_stream_chunk|voice_stream_end|ping|history",
|
||||
"session_id": "string (可选)",
|
||||
"mode": "text|voice_msg|voice_assistant",
|
||||
"content": "string (纯图片消息可留空,文字+图片时填写提问内容)",
|
||||
"audio_data": "string (voice_input 类型必填, base64)",
|
||||
"audio_data": "string (voice_input / voice_stream_chunk 类型必填, base64)",
|
||||
"format": "string (voice_stream_start 可选, 音频格式: webm, wav, pcm, opus; 默认 webm)",
|
||||
"language": "string (voice_stream_start 可选, 识别语言: zh, en, ja, ko, auto; 默认 zh)",
|
||||
"sequence": 0,
|
||||
"attachments": [
|
||||
{
|
||||
"type": "image",
|
||||
@@ -194,14 +197,18 @@ ws://<gateway>/ws/chat?token=<jwt>&session_id=<optional>&client_id=<optional>&de
|
||||
"timestamp": 1717000000000,
|
||||
"client_id": "string",
|
||||
"device_name": "string",
|
||||
"user_agent": "string"
|
||||
"user_agent": "string",
|
||||
"client_msg_id": "string"
|
||||
}
|
||||
```
|
||||
|
||||
| type | 说明 |
|
||||
|------|------|
|
||||
| `message` | 文字聊天,触发 AI 回复 |
|
||||
| `voice_input` | 语音输入,先转录再作为 message 处理 |
|
||||
| `voice_input` | 语音输入(完整音频),先转录再作为 message 处理 |
|
||||
| `voice_stream_start` | 开启流式语音会话,Gateway 连接 Voice-Service 流式 STT |
|
||||
| `voice_stream_chunk` | 流式语音音频分片 (base64),Gateway 转发至 Voice-Service |
|
||||
| `voice_stream_end` | 结束流式语音,等待最终识别结果,自动触发 LLM 回复 |
|
||||
| `ping` | 心跳,自动回复 pong |
|
||||
| `history` | 请求历史消息 |
|
||||
|
||||
@@ -244,7 +251,9 @@ ws://<gateway>/ws/chat?token=<jwt>&session_id=<optional>&client_id=<optional>&de
|
||||
| `stream_chunk` | 增量文本块 |
|
||||
| `stream_end` | AI 生成结束(含完整 text) |
|
||||
| `stream_segments` | 流式断句(语音) |
|
||||
| `voice_transcript` | 语音转录结果 |
|
||||
| `voice_transcript` | 语音转录结果 (非流式, voice_input) |
|
||||
| `voice_interim` | 流式语音中间识别结果 |
|
||||
| `voice_final` | 流式语音最终识别文本 |
|
||||
| `error` | 错误 |
|
||||
| `history_response` | 历史消息返回 |
|
||||
| `notification` | 推送通知 |
|
||||
@@ -325,7 +334,7 @@ Client Gateway
|
||||
|
||||
---
|
||||
|
||||
### 语音输入流程
|
||||
### 语音输入流程 (非流式)
|
||||
|
||||
```
|
||||
Client Gateway Voice-Service
|
||||
@@ -343,6 +352,128 @@ Client Gateway Voice-Service
|
||||
|<-- ... 正常流式回复 ... | |
|
||||
```
|
||||
|
||||
> **注意:** `voice_input` 为非流式模式,客户端发送完整音频后一次性获取转录结果。适合 MediaRecorder 录音完成后使用。
|
||||
> 推荐使用下方的流式语音输入,配合前端 VAD 实现边说边识别。
|
||||
|
||||
---
|
||||
|
||||
### 流式语音输入流程 (voice_stream_*)
|
||||
|
||||
配合前端 VAD (Voice Activity Detection) 实现自动语音检测和边说边识别。前端逐帧发送音频分片,Gateway 通过 WebSocket 代理到 Voice-Service 流式 STT,实时返回中间结果。
|
||||
|
||||
```
|
||||
Client Gateway Voice-Service
|
||||
| | |
|
||||
|-- {type:"voice_stream_start", | |
|
||||
| format:"webm", language:"zh"} --> | |
|
||||
| |-- WS /api/v1/stt/stream --------> |
|
||||
| |<-- session ready |
|
||||
|<-- {type:"voice_interim", text:""} | |
|
||||
| | |
|
||||
|-- {type:"voice_stream_chunk", | |
|
||||
| audio_data:"<base64>", | |
|
||||
| sequence:0} ------------------> | |
|
||||
| |-- binary audio frame ----------> |
|
||||
| |<-- {type:"result", |
|
||||
| | text:"你好", isFinal:false} |
|
||||
|<-- {type:"voice_interim", | |
|
||||
| text:"你好"} | |
|
||||
| | |
|
||||
|-- ... more chunks ... | |
|
||||
| | |
|
||||
|-- {type:"voice_stream_end"} -----> | |
|
||||
| |-- {action:"stop"} --------------> |
|
||||
| |<-- {type:"result", |
|
||||
| | text:"你好世界", isFinal:true}|
|
||||
|<-- {type:"voice_final", | |
|
||||
| text:"你好世界"} | |
|
||||
| | |
|
||||
| (Gateway 自动将最终文本 | |
|
||||
| 作为 message 发给 AI-Core) | |
|
||||
|<-- {type:"stream_start"} | |
|
||||
|<-- ... 正常流式 LLM 回复 ... | |
|
||||
```
|
||||
|
||||
**消息详情:**
|
||||
|
||||
#### voice_stream_start — 开启流式语音会话
|
||||
|
||||
```json
|
||||
{
|
||||
"type": "voice_stream_start",
|
||||
"format": "webm",
|
||||
"language": "zh"
|
||||
}
|
||||
```
|
||||
|
||||
| 字段 | 类型 | 必填 | 说明 |
|
||||
|------|------|------|------|
|
||||
| `format` | string | 否 | 音频格式,默认 `"webm"`。支持: `webm`, `wav`, `pcm`, `opus` |
|
||||
| `language` | string | 否 | 识别语言,默认 `"zh"`。支持: `zh`, `en`, `ja`, `ko`, `auto` |
|
||||
|
||||
Gateway 收到后连接 Voice-Service 流式 STT WebSocket。成功时返回空 `voice_interim` 确认会话建立;失败返回 `error`。
|
||||
|
||||
#### voice_stream_chunk — 发送音频分片
|
||||
|
||||
```json
|
||||
{
|
||||
"type": "voice_stream_chunk",
|
||||
"audio_data": "<base64 encoded audio>",
|
||||
"sequence": 0
|
||||
}
|
||||
```
|
||||
|
||||
| 字段 | 类型 | 必填 | 说明 |
|
||||
|------|------|------|------|
|
||||
| `audio_data` | string | 是 | Base64 编码的音频数据 |
|
||||
| `sequence` | int | 否 | 分片序号,从 0 递增,用于排序和去重 |
|
||||
|
||||
Gateway 将 audio_data 解码后以 binary 帧转发至 Voice-Service。无直接响应;识别结果通过 `voice_interim` 异步推送。
|
||||
|
||||
#### voice_stream_end — 结束流式语音
|
||||
|
||||
```json
|
||||
{
|
||||
"type": "voice_stream_end"
|
||||
}
|
||||
```
|
||||
|
||||
Gateway 向 Voice-Service 发送 stop 信号,等待最终识别结果。最终文本通过 `voice_final` 返回,并自动触发 LLM 回复流程。
|
||||
|
||||
#### voice_interim — 中间识别结果 (Server → Client)
|
||||
|
||||
```json
|
||||
{
|
||||
"type": "voice_interim",
|
||||
"message_id": "voice_<random>",
|
||||
"text": "中间识别文本",
|
||||
"timestamp": 1717000000000
|
||||
}
|
||||
```
|
||||
|
||||
| 字段 | 说明 |
|
||||
|------|------|
|
||||
| `text` | 当前累积的识别文本,**非最终结果**,会随更多音频输入而更新 |
|
||||
|
||||
> **前端处理:** 收到 `voice_interim` 后应在 UI 中展示实时识别文本(如灰色斜体),收到 `voice_final` 后替换为最终文本。
|
||||
|
||||
#### voice_final — 最终识别结果 (Server → Client)
|
||||
|
||||
```json
|
||||
{
|
||||
"type": "voice_final",
|
||||
"message_id": "voice_<random>",
|
||||
"text": "最终识别文本",
|
||||
"timestamp": 1717000000000
|
||||
}
|
||||
```
|
||||
|
||||
| 字段 | 说明 |
|
||||
|------|------|
|
||||
| `text` | 最终的完整识别文本。空字符串表示未识别到语音 |
|
||||
|
||||
收到 `voice_final` 后,Gateway 自动将 `text` 作为 `message` 类型转发至 AI-Core 触发 LLM 流式回复。随后的 `stream_start` / `review` / `stream_end` 流程与普通文字消息相同。
|
||||
|
||||
---
|
||||
|
||||
## 3. 会话管理
|
||||
|
||||
Reference in New Issue
Block a user