DroneMind/voicellmcloud/docs/API_SPECIFICATION.md

# 云端无人机语音服务 - 接口规范 (v1.0)

本文档定义了香橙派客户端与云端语音服务之间的完整交互协议。

---

## 1. 连接信息

| 项目 | 值 |
|------|-----|
| **协议** | WebSocket (ws://) |
| **地址** | `ws://<服务器IP>:8765/v1/voice/session` |
| **鉴权** | Bearer Token |

### 1.1 仅 TTS（不走 LLM）

在**已建立**的 WebSocket 会话上发送 **`tts.synthesize`**（见 §3.3），下行仍为 `tts_audio_chunk`（text 元数据 + binary PCM）与 **`turn.complete`**；与 `turn.text` **互斥排队**，共用同一连接与播放器路径，无需额外 HTTP。

### 1.2 机端麦克风 + 云端 Fun-ASR（`pcm_asr_uplink`）

`session.start` 中设置 **`transport_profile`: `pcm_asr_uplink`**，使用 **`turn.audio.start` / `turn.audio.chunk` / `turn.audio.end`** 上行 **16 kHz mono pcm_s16le**（chunk 内 **Base64** 裸 PCM）。服务端用**与 Qwen 相同的百炼 API Key** 调用阿里云 **Fun-ASR 实时识别**；识别完成后下行与 `text_uplink` 相同（并可选 **`asr.partial`**）。**完整字段与时序见** [`CLOUD_VOICE_PROTOCOL_pcm_asr_uplink_v1.md`](./CLOUD_VOICE_PROTOCOL_pcm_asr_uplink_v1.md)。

**机端「未唤醒不上云 / 节省 Fun-ASR 按量」**：见 [`CLOUD_VOICE_CLIENT_WAKE_GATE_v1.md`](./CLOUD_VOICE_CLIENT_WAKE_GATE_v1.md)。

**小爱类多轮会话（问候、滴声、断句提示、5s 超时、飞控/闲聊分支，服务端+机端分工）**：见 [`CLOUD_VOICE_ASSISTANT_SESSION_v1.md`](./CLOUD_VOICE_ASSISTANT_SESSION_v1.md)。

---

## 2. 完整通信时序

```
客户端 (香橙派)                            云端服务端
      |                                              |
      |-------- 1. session.start ------------------->|
      |                                              |
      |<------- 2. session.ready --------------------|
      |                                              |
      |-------- 3. turn.text (STT 结果) ------------>|
      |                                              |
      |<------- 4. dialog_result (闲聊/飞控JSON) ----|
      |<------- 5. tts_audio_chunk (text 帧) --------|
      |<------- 6. tts_audio_chunk (binary 音频帧) --|
      |<------- 7. ... (多个音频块) ... --------------|
      |<------- 8. turn.complete -------------------|
      |                                              |
      |-------- 9. session.end --------------------->|
```

**仅 TTS（无对话）**：在 `session.ready` 之后可发送 `tts.synthesize`（每条自带 `turn_id`）；服务端仅下行 `tts_audio_chunk*` → `turn.complete`（**无** `dialog_result` / `llm.text_delta`）。

---

## 3. 消息格式详情

### 3.1 建立会话 (session.start)

**客户端 → 服务端**

```json
{
  "type": "session.start",
  "proto_version": "1.0",
  "transport_profile": "text_uplink",
  "session_id": "唯一会话ID-UUID",
  "auth_token": "与服务端 .env 中的 BEARER_TOKEN 一致",
  "client": {
    "device_id": "设备唯一标识符",
    "locale": "zh-CN",
    "capabilities": {
      "playback_sample_rate_hz": 24000,
      "prefer_tts_codec": "pcm_s16le"
    }
  }
}
```

---

### 3.2 服务端就绪 (session.ready)

**服务端 → 客户端**

```json
{
  "type": "session.ready",
  "proto_version": "1.0",
  "transport_profile": "text_uplink",
  "session_id": "唯一会话ID-UUID",
  "server_caps": {
    "accepts_audio_uplink": false,
    "llm": true,
    "tts_codecs": ["pcm_s16le"],
    "llm_context_turns": 4
  }
}
```

---

### 3.3 仅 TTS 上行 (tts.synthesize)

在 `session.ready` 之后，不经过 LLM：由服务端直接合成 `text` 并推送 **`tts_audio_chunk*`**，最后 **`turn.complete`**（`metrics.llm_ms` 为 `0`）。**不**下发 `dialog_result`、`llm.text_delta`；**不**写入对话历史。与 **`turn.text` pipeline 互斥**（同会话上一条处理完再发下一条）。

**客户端 → 服务端**

```json
{
  "type": "tts.synthesize",
  "proto_version": "1.0",
  "transport_profile": "text_uplink",
  "turn_id": "唯一轮次ID-UUID（与 tts_audio_chunk.turn_id 一致）",
  "text": "要播报的中文"
}
```

（`text` 长度受服务端 `TTS_MAX_CHARS` 限制，超出截断。）

---

### 3.4 发送文本 (turn.text)

**客户端 → 服务端**

```json
{
  "type": "turn.text",
  "proto_version": "1.0",
  "transport_profile": "text_uplink",
  "turn_id": "唯一轮次ID-UUID",
  "text": "本地 STT 识别的完整中文文本",
  "is_final": true,
  "source": "device_stt"
}
```

---

### 3.5 识别结果 (dialog_result)

**服务端 → 客户端**

客户端必须根据 `routing` 字段判断处理逻辑。

#### 场景 A：闲聊 (chitchat)

```json
{
  "type": "dialog_result",
  "proto_version": "1.0",
  "transport_profile": "text_uplink",
  "turn_id": "唯一轮次ID-UUID",
  "user_input": {
    "text": "用户原始文本",
    "language": "zh",
    "is_final": true,
    "source": "device_stt"
  },
  "routing": "chitchat",
  "flight_intent": null,
  "chat_reply": "你好！我无法获取实时天气信息，建议查看手机天气App。",
  "tts_hint": {
    "speak_summary_or_reply": true,
    "voice_id": "default"
  }
}
```

**客户端动作：**
1. 使用 `chat_reply` 字段的内容进行 TTS 播放。
2. **必须**继续接收后续的 `tts_audio_chunk` 和 `turn.complete`。

---

#### 场景 B：飞控指令 (flight_intent) ✨ 重点

```json
{
  "type": "dialog_result",
  "proto_version": "1.0",
  "transport_profile": "text_uplink",
  "turn_id": "唯一轮次ID-UUID",
  "user_input": {
    "text": "起飞然后在前方十米悬停",
    "language": "zh",
    "is_final": true,
    "source": "device_stt"
  },
  "routing": "flight_intent",
  "flight_intent": {
    "is_flight_intent": true,
    "version": 1,
    "actions": [
      {"type": "takeoff", "args": {}},
      {"type": "goto", "args": {"frame": "local_ned", "x": 10, "y": 0, "z": null}},
      {"type": "hover", "args": {}}
    ],
    "summary": "起飞后前往机头方向约十米并保持高度"
  },
  "chat_reply": null,
  "tts_hint": {
    "speak_summary_or_reply": true,
    "voice_id": "default"
  }
}
```

> 飞控载荷的严格字段表与 `wait` / `takeoff.relative_altitude_m` / 可选 `trace_id` 等，见 `docs/FLIGHT_INTENT_SCHEMA_v1.md`。

**🔊 TTS 播报约定（重要）：**
*   **当 `routing` 为 `flight_intent` 时，云端 TTS 合成并下发的语音固定为：**
    > **"识别到飞控指令，正在下发指令"**
*   客户端收到 `flight_intent` 后，应开始解析 `actions` 数组并准备下发飞控，同时播放该提示音。

**客户端动作：**
1. **解析 `flight_intent.actions`** 数组，按顺序执行飞控逻辑。
2. **必须**继续接收后续的 `tts_audio_chunk`（听到提示音）和 `turn.complete`，**绝对不能提前断开或退出循环**。

---

### 3.6 TTS 音频流 (tts_audio_chunk)

**协议规则：**
音频流由成对的消息组成：
1. **Text 帧 (JSON)**：描述接下来音频块的元数据。
2. **Binary 帧 (Raw Bytes)**：真正的 PCM 音频数据。

**Text 帧 (JSON) 格式：**

```json
{
  "type": "tts_audio_chunk",
  "proto_version": "1.0",
  "transport_profile": "text_uplink",
  "turn_id": "唯一轮次ID-UUID",
  "seq": 0,
  "codec": "pcm_s16le",
  "sample_rate_hz": 24000,
  "is_final": false
}
```

**Binary 帧 (音频数据)：**
*   **格式**：PCM S16LE (16-bit, Little-Endian)
*   **采样率**：24000 Hz
*   **声道**：Mono (单声道)

**⚠️ 客户端接收关键逻辑（必须遵守）：**
```python
msg = await ws.recv()

# ✅ 必须判断类型！
if isinstance(msg, bytes):
    # 这是二进制音频数据，追加到缓冲区
    audio_buffer.append(msg)
    # ❌ 绝对不能对 msg 执行 json.loads(msg)！
else:
    # 这是 JSON 元数据
    data = json.loads(msg)
    ...
```

---

### 3.7 轮次完成 (turn.complete)

**服务端 → 客户端**

收到此消息表示当前轮次（LLM + TTS）**完全结束**。

```json
{
  "type": "turn.complete",
  "proto_version": "1.0",
  "transport_profile": "text_uplink",
  "turn_id": "唯一轮次ID-UUID",
  "metrics": {
    "llm_ms": 1400,
    "tts_first_byte_ms": 3100
  }
}
```

**客户端动作：**
1. 收到此消息后，拼接 `audio_buffer` 中的二进制块。
2. 将完整的 PCM 数据送入声卡播放。
3. 准备下一轮对话。

---

## 4. 客户端检查清单 (Checklist)

客户端团队在联调前，请确保以下逻辑已正确实现：

- [ ] **区分帧类型**：接收循环中是否有 `isinstance(msg, bytes)` 判断？
- [ ] **禁止提前退出**：无论 `routing` 是闲聊还是飞控，**必须**收到 `turn.complete` 才能结束当前 `run_turn`。
- [ ] **JSON 解析安全**：确保**只对 Text 帧**调用 `json.loads()`，Binary 帧直接 `append`。
- [ ] **音频播放**：收到的 PCM 数据是 `int16` 格式，播放前需转为 `float32` (除以 32768.0)。
- [ ] **飞控提示音**：收到 `routing=flight_intent` 时，期望听到 TTS 语音："识别到飞控指令，正在下发指令"。

---

**版本**: v1.0
**更新日期**: 2024-04-07
**协议标识**: `text_uplink`