# Voice-Service API **Base URL:** `http://:8093` | **Auth:** 无 语音服务封装两层引擎: - **STT (语音转文字):** DashScope `qwen3-asr-flash-realtime` (主) + 本地 Whisper (备) - **TTS (文字转语音):** edge-tts (主) + espeak-ng (备) --- ## 目录 1. [POST /api/v1/transcribe — 语音转文字](#1-post-apiv1transcribe) 2. [POST /api/v1/tts/synthesize — 文字转语音](#2-post-apiv1ttssynthesize) 3. [GET /api/v1/tts/voices — 发音人列表](#3-get-apiv1ttsvoices) 4. [GET /api/v1/health — 健康检查](#4-get-apiv1health) 5. [GET /api/v1/status — 服务状态](#5-get-apiv1status) 6. [GET /api/v1/tts/status — TTS 状态](#6-get-apiv1ttsstatus) 7. [WebSocket GET /api/v1/stt/stream — 流式 STT](#7-websocket-get-apiv1sttstream) --- ## 1. POST /api/v1/transcribe — 语音转文字 **Content-Type:** `multipart/form-data` | **Max body:** 10 MB ### 表单字段 | 字段 | 类型 | 必填 | 说明 | |------|------|------|------| | `audio` | file | 是 | 音频文件。格式从扩展名推断:wav/mp3/ogg/flac/m4a | | `language` | string | 否 | 默认 `"zh"`。可选: `zh`, `en`, `ja`, `ko`, `auto` | ### 响应 200 ```json { "success": true, "text": "转录结果文本", "language": "zh", "duration_ms": 1234 } ``` ### 错误 | 状态码 | 错误体 | |--------|--------| | 400 | `{"error":"文件过大或解析失败,最大支持 10MB"}` | | 400 | `{"error":"缺少 audio 文件字段"}` | | 400 | `{"error":"音频文件为空"}` | | 400 | `{"error":"不支持的音频格式: ,支持的格式: WAV, MP3, OGG, FLAC, M4A"}` | | 405 | `{"error":"method not allowed"}` | | 500 | `{"error":"读取音频文件失败"}` | | 500 | `{"success":false,"error":""}` | --- ## 2. POST /api/v1/tts/synthesize — 文字转语音 **Content-Type:** `application/json` ### 请求 ```json { "text": "你好世界 (必填)", "voice": "zh-CN-XiaoxiaoNeural (默认)", "rate": "+0% (默认,如 +20%/-20%)" } ``` ### 响应 200 — 原始音频流 - **Content-Type:** `audio/mpeg` (edge-tts) 或 `audio/wav` (espeak-ng/fallback) - **Content-Disposition:** `inline; filename=synthesized.mp3` 引擎回退链: edge-tts (mp3) → espeak-ng (wav) → silent WAV ### 错误 | 状态码 | 错误体 | |--------|--------| | 400 | `{"error":"请求体解析失败: ..."}` | | 400 | `{"error":"text 字段不能为空"}` | | 405 | `{"error":"method not allowed"}` | | 500 | `{"error":"TTS 合成失败: ..."}` | --- ## 3. GET /api/v1/tts/voices — 发音人列表 ```json // 响应 200 { "voices": [ { "name": "zh-CN-XiaoxiaoNeural", "display_name": "晓晓 (女声)", "gender": "Female", "locale": "zh-CN" }, { "name": "zh-CN-YunxiNeural", "display_name": "云希 (男声)", "gender": "Male", "locale": "zh-CN" }, { "name": "zh-CN-XiaoyiNeural", "display_name": "晓伊 (女声)", "gender": "Female", "locale": "zh-CN" } ], "count": 3 } ``` --- ## 4. GET /api/v1/health — 健康检查 ```json { "status": "ok", "service": "voice-service", "stt": { "available": true, "primary": "dashscope", "dashscope": { "available": true, "model": "qwen3-asr-flash-realtime", "provider": "dashscope" }, "whisper": { "available": true, "binary_available": true, "model_loaded": true, "model_name": "ggml-small.bin" }, "default_language": "zh", "supported_languages": ["zh","en","ja","ko","auto"] }, "tts": { "available": true, "edge_tts": true, "espeak_ng": false, "engine": "edge-tts", "default_voice": "zh-CN-XiaoxiaoNeural", "builtin_voices": 3 } } ``` ### 状态字段说明 | 字段 | 说明 | |------|------| | `stt.available` | DashScope 或 Whisper 至少一个可用 | | `stt.dashscope.available` | DashScope API Key 已配置 | | `stt.whisper.available` | Whisper 二进制 + 模型文件均存在 | | `tts.available` | 至少一个 TTS 引擎可用 | | `tts.engine` | 当前激活引擎: `edge-tts`, `espeak-ng`, `fallback (silent WAV)`, `none` | --- ## 5. GET /api/v1/status — 服务状态 同 `/health` 但无顶层 `status` 字段: ```json { "service": "voice-service", "stt": { ... }, // 同 health.stt "tts": { ... } // 同 health.tts } ``` --- ## 6. GET /api/v1/tts/status — TTS 单独状态 ```json { "service": "voice-service", "tts": { "available": true, "edge_tts": true, "espeak_ng": false, "engine": "edge-tts", "default_voice": "zh-CN-XiaoxiaoNeural", "builtin_voices": 3 } } ``` --- ## 7. WebSocket GET /api/v1/stt/stream — 流式 STT **Query 参数:** `?language=zh&format=pcm` (language 默认 zh, format 默认 pcm) **Read deadline:** 300s ### 客户端 → 服务端 **Binary 帧:** 原始 PCM 音频 (16-bit LE, 16000Hz, mono)。每帧通过 `input_audio_buffer.append` 转发到 DashScope。 **JSON 控制帧:** ```json { "action": "stop" } // 请求结束会话。服务端返回 done 后关闭。 { "language": "en" } // 动态切换识别语言。 ``` ### 服务端 → 客户端 (JSON 文本帧) **result** — 识别结果 ```json { "type": "result", "text": "识别文本片段", "isFinal": true } ``` | 字段 | 说明 | |------|------| | `isFinal: true` | VAD 端点检测到的完整句子 | | `isFinal: false` | 中间增量 (delta) | **error** ```json { "type": "error", "error": "错误描述" } ``` **done** — 响应 stop ```json { "type": "done", "action": "stop" } ``` ### 连接生命周期 1. HTTP 升级请求 → 验证 STT 引擎可用性 (不可用返回 503) 2. 建立 DashScope realtime 会话 (`session.created` → `session.update` → `session.updated`) 3. 客户端发送 binary PCM 帧 → 服务端 base64 编码后 `input_audio_buffer.append` 4. DashScope VAD 自动检测 → `conversation.item.input_audio_transcription.completed` → 转发 result 5. 客户端发送 `{"action":"stop"}` → 服务端 `session.finish` → 关闭连接