Files

T

AskaEth 70f8b30d04 docs: 添加完整 API 文档 — Gateway 统一文档 + 后端服务文档

新增 docs/api/gateway-api.md：面向客户端开发的网关 API 统一文档，覆盖全部 16 个模块。
新增 docs/api/backend-services/：后端服务详细文档 (ai-core, memory-service, voice-service, iot-debug, tool-engine)。
更新 .gitignore：docs/api/ 例外允许推送，其他 docs/ 内容仍忽略。

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

2026-05-24 12:39:55 +08:00

5.8 KiB

Raw Blame History

Voice-Service API

Base URL: http://<host>:8093 | Auth: 无

语音服务封装两层引擎：

STT (语音转文字): DashScope qwen3-asr-flash-realtime (主) + 本地 Whisper (备)
TTS (文字转语音): edge-tts (主) + espeak-ng (备)

1. POST /api/v1/transcribe — 语音转文字

Content-Type: multipart/form-data | Max body: 10 MB

表单字段

字段	类型	必填	说明
`audio`	file	是	音频文件。格式从扩展名推断：wav/mp3/ogg/flac/m4a
`language`	string	否	默认 `"zh"`。可选: `zh`, `en`, `ja`, `ko`, `auto`

响应 200

{
  "success": true,
  "text": "转录结果文本",
  "language": "zh",
  "duration_ms": 1234
}

错误

状态码	错误体
400	`{"error":"文件过大或解析失败，最大支持 10MB"}`
400	`{"error":"缺少 audio 文件字段"}`
400	`{"error":"音频文件为空"}`
400	`{"error":"不支持的音频格式: <ext>，支持的格式: WAV, MP3, OGG, FLAC, M4A"}`
405	`{"error":"method not allowed"}`
500	`{"error":"读取音频文件失败"}`
500	`{"success":false,"error":"<engine error>"}`

2. POST /api/v1/tts/synthesize — 文字转语音

Content-Type: application/json

请求

{
  "text": "你好世界 (必填)",
  "voice": "zh-CN-XiaoxiaoNeural (默认)",
  "rate": "+0% (默认，如 +20%/-20%)"
}

响应 200 — 原始音频流

Content-Type: audio/mpeg (edge-tts) 或 audio/wav (espeak-ng/fallback)
Content-Disposition: inline; filename=synthesized.mp3

引擎回退链: edge-tts (mp3) → espeak-ng (wav) → silent WAV

错误

状态码	错误体
400	`{"error":"请求体解析失败: ..."}`
400	`{"error":"text 字段不能为空"}`
405	`{"error":"method not allowed"}`
500	`{"error":"TTS 合成失败: ..."}`

3. GET /api/v1/tts/voices — 发音人列表

// 响应 200
{
  "voices": [
    { "name": "zh-CN-XiaoxiaoNeural", "display_name": "晓晓 (女声)", "gender": "Female", "locale": "zh-CN" },
    { "name": "zh-CN-YunxiNeural",    "display_name": "云希 (男声)", "gender": "Male",   "locale": "zh-CN" },
    { "name": "zh-CN-XiaoyiNeural",   "display_name": "晓伊 (女声)", "gender": "Female", "locale": "zh-CN" }
  ],
  "count": 3
}

4. GET /api/v1/health — 健康检查

{
  "status": "ok",
  "service": "voice-service",
  "stt": {
    "available": true,
    "primary": "dashscope",
    "dashscope": { "available": true, "model": "qwen3-asr-flash-realtime", "provider": "dashscope" },
    "whisper": {
      "available": true,
      "binary_available": true,
      "model_loaded": true,
      "model_name": "ggml-small.bin"
    },
    "default_language": "zh",
    "supported_languages": ["zh","en","ja","ko","auto"]
  },
  "tts": {
    "available": true,
    "edge_tts": true,
    "espeak_ng": false,
    "engine": "edge-tts",
    "default_voice": "zh-CN-XiaoxiaoNeural",
    "builtin_voices": 3
  }
}

状态字段说明

字段	说明
`stt.available`	DashScope 或 Whisper 至少一个可用
`stt.dashscope.available`	DashScope API Key 已配置
`stt.whisper.available`	Whisper 二进制 + 模型文件均存在
`tts.available`	至少一个 TTS 引擎可用
`tts.engine`	当前激活引擎: `edge-tts`, `espeak-ng`, `fallback (silent WAV)`, `none`

5. GET /api/v1/status — 服务状态

同 /health 但无顶层 status 字段:

{
  "service": "voice-service",
  "stt": { ... },  // 同 health.stt
  "tts": { ... }   // 同 health.tts
}

6. GET /api/v1/tts/status — TTS 单独状态

{
  "service": "voice-service",
  "tts": {
    "available": true,
    "edge_tts": true,
    "espeak_ng": false,
    "engine": "edge-tts",
    "default_voice": "zh-CN-XiaoxiaoNeural",
    "builtin_voices": 3
  }
}

7. WebSocket GET /api/v1/stt/stream — 流式 STT

Query 参数: ?language=zh&format=pcm (language 默认 zh, format 默认 pcm) Read deadline: 300s

客户端 → 服务端

Binary 帧: 原始 PCM 音频 (16-bit LE, 16000Hz, mono)。每帧通过 input_audio_buffer.append 转发到 DashScope。

JSON 控制帧:

{ "action": "stop" }
// 请求结束会话。服务端返回 done 后关闭。

{ "language": "en" }
// 动态切换识别语言。

服务端 → 客户端 (JSON 文本帧)

result — 识别结果

{
  "type": "result",
  "text": "识别文本片段",
  "isFinal": true
}

字段	说明
`isFinal: true`	VAD 端点检测到的完整句子
`isFinal: false`	中间增量 (delta)

error

{ "type": "error", "error": "错误描述" }

done — 响应 stop

{ "type": "done", "action": "stop" }

连接生命周期

HTTP 升级请求 → 验证 STT 引擎可用性 (不可用返回 503)
建立 DashScope realtime 会话 (session.created → session.update → session.updated)
客户端发送 binary PCM 帧 → 服务端 base64 编码后 input_audio_buffer.append
DashScope VAD 自动检测 → conversation.item.input_audio_transcription.completed → 转发 result
客户端发送 {"action":"stop"} → 服务端 session.finish → 关闭连接

5.8 KiB Raw Blame History