Cyrene/docs/api/backend-services/voice-service.md

# Voice-Service API

**Base URL:** `http://<host>:8093` | **Auth:** 无

语音服务封装两层引擎：
- **STT (语音转文字):** DashScope `qwen3-asr-flash-realtime` (主) + 本地 Whisper (备)
- **TTS (文字转语音):** edge-tts (主) + espeak-ng (备)

---

## 目录

1. [POST /api/v1/transcribe — 语音转文字](#1-post-apiv1transcribe)
2. [POST /api/v1/tts/synthesize — 文字转语音](#2-post-apiv1ttssynthesize)
3. [GET /api/v1/tts/voices — 发音人列表](#3-get-apiv1ttsvoices)
4. [GET /api/v1/health — 健康检查](#4-get-apiv1health)
5. [GET /api/v1/status — 服务状态](#5-get-apiv1status)
6. [GET /api/v1/tts/status — TTS 状态](#6-get-apiv1ttsstatus)
7. [WebSocket GET /api/v1/stt/stream — 流式 STT](#7-websocket-get-apiv1sttstream)

---

## 1. POST /api/v1/transcribe — 语音转文字

**Content-Type:** `multipart/form-data` | **Max body:** 10 MB

### 表单字段

| 字段 | 类型 | 必填 | 说明 |
|------|------|------|------|
| `audio` | file | 是 | 音频文件。格式从扩展名推断：wav/mp3/ogg/flac/m4a |
| `language` | string | 否 | 默认 `"zh"`。可选: `zh`, `en`, `ja`, `ko`, `auto` |

### 响应 200

```json
{
  "success": true,
  "text": "转录结果文本",
  "language": "zh",
  "duration_ms": 1234
}
```

### 错误

| 状态码 | 错误体 |
|--------|--------|
| 400 | `{"error":"文件过大或解析失败，最大支持 10MB"}` |
| 400 | `{"error":"缺少 audio 文件字段"}` |
| 400 | `{"error":"音频文件为空"}` |
| 400 | `{"error":"不支持的音频格式: <ext>，支持的格式: WAV, MP3, OGG, FLAC, M4A"}` |
| 405 | `{"error":"method not allowed"}` |
| 500 | `{"error":"读取音频文件失败"}` |
| 500 | `{"success":false,"error":"<engine error>"}` |

---

## 2. POST /api/v1/tts/synthesize — 文字转语音

**Content-Type:** `application/json`

### 请求

```json
{
  "text": "你好世界 (必填)",
  "voice": "zh-CN-XiaoxiaoNeural (默认)",
  "rate": "+0% (默认，如 +20%/-20%)"
}
```

### 响应 200 — 原始音频流

- **Content-Type:** `audio/mpeg` (edge-tts) 或 `audio/wav` (espeak-ng/fallback)
- **Content-Disposition:** `inline; filename=synthesized.mp3`

引擎回退链: edge-tts (mp3) → espeak-ng (wav) → silent WAV

### 错误

| 状态码 | 错误体 |
|--------|--------|
| 400 | `{"error":"请求体解析失败: ..."}` |
| 400 | `{"error":"text 字段不能为空"}` |
| 405 | `{"error":"method not allowed"}` |
| 500 | `{"error":"TTS 合成失败: ..."}` |

---

## 3. GET /api/v1/tts/voices — 发音人列表

```json
// 响应 200
{
  "voices": [
    { "name": "zh-CN-XiaoxiaoNeural", "display_name": "晓晓 (女声)", "gender": "Female", "locale": "zh-CN" },
    { "name": "zh-CN-YunxiNeural",    "display_name": "云希 (男声)", "gender": "Male",   "locale": "zh-CN" },
    { "name": "zh-CN-XiaoyiNeural",   "display_name": "晓伊 (女声)", "gender": "Female", "locale": "zh-CN" }
  ],
  "count": 3
}
```

---

## 4. GET /api/v1/health — 健康检查

```json
{
  "status": "ok",
  "service": "voice-service",
  "stt": {
    "available": true,
    "primary": "dashscope",
    "dashscope": { "available": true, "model": "qwen3-asr-flash-realtime", "provider": "dashscope" },
    "whisper": {
      "available": true,
      "binary_available": true,
      "model_loaded": true,
      "model_name": "ggml-small.bin"
    },
    "default_language": "zh",
    "supported_languages": ["zh","en","ja","ko","auto"]
  },
  "tts": {
    "available": true,
    "edge_tts": true,
    "espeak_ng": false,
    "engine": "edge-tts",
    "default_voice": "zh-CN-XiaoxiaoNeural",
    "builtin_voices": 3
  }
}
```

### 状态字段说明

| 字段 | 说明 |
|------|------|
| `stt.available` | DashScope 或 Whisper 至少一个可用 |
| `stt.dashscope.available` | DashScope API Key 已配置 |
| `stt.whisper.available` | Whisper 二进制 + 模型文件均存在 |
| `tts.available` | 至少一个 TTS 引擎可用 |
| `tts.engine` | 当前激活引擎: `edge-tts`, `espeak-ng`, `fallback (silent WAV)`, `none` |

---

## 5. GET /api/v1/status — 服务状态

同 `/health` 但无顶层 `status` 字段:

```json
{
  "service": "voice-service",
  "stt": { ... },  // 同 health.stt
  "tts": { ... }   // 同 health.tts
}
```

---

## 6. GET /api/v1/tts/status — TTS 单独状态

```json
{
  "service": "voice-service",
  "tts": {
    "available": true,
    "edge_tts": true,
    "espeak_ng": false,
    "engine": "edge-tts",
    "default_voice": "zh-CN-XiaoxiaoNeural",
    "builtin_voices": 3
  }
}
```

---

## 7. WebSocket GET /api/v1/stt/stream — 流式 STT

**Query 参数:** `?language=zh&format=pcm` (language 默认 zh, format 默认 pcm)
**Read deadline:** 300s

### 客户端 → 服务端

**Binary 帧:** 原始 PCM 音频 (16-bit LE, 16000Hz, mono)。每帧通过 `input_audio_buffer.append` 转发到 DashScope。

**JSON 控制帧:**

```json
{ "action": "stop" }
// 请求结束会话。服务端返回 done 后关闭。

{ "language": "en" }
// 动态切换识别语言。
```

### 服务端 → 客户端 (JSON 文本帧)

**result** — 识别结果
```json
{
  "type": "result",
  "text": "识别文本片段",
  "isFinal": true
}
```
| 字段 | 说明 |
|------|------|
| `isFinal: true` | VAD 端点检测到的完整句子 |
| `isFinal: false` | 中间增量 (delta) |

**error**
```json
{ "type": "error", "error": "错误描述" }
```

**done** — 响应 stop
```json
{ "type": "done", "action": "stop" }
```

### 连接生命周期

1. HTTP 升级请求 → 验证 STT 引擎可用性 (不可用返回 503)
2. 建立 DashScope realtime 会话 (`session.created` → `session.update` → `session.updated`)
3. 客户端发送 binary PCM 帧 → 服务端 base64 编码后 `input_audio_buffer.append`
4. DashScope VAD 自动检测 → `conversation.item.input_audio_transcription.completed` → 转发 result
5. 客户端发送 `{"action":"stop"}` → 服务端 `session.finish` → 关闭连接