# 云端无人机语音对话服务 — 完整方案（文本上行 / text_uplink）

## 文档定位

- **适用场景**：无人机端完成 **中文语音→文本（本地 STT）**，经 **WSS 仅上传 UTF-8 文本**；云端完成 **LLM（飞控结构化意图 vs 闲聊）+ TTS 流式下发**。
- **云端不做**：上行音频识别、MAVLink/PX4 直连。
- **协议标识**：`proto_version: "1.0"`，`transport_profile: "text_uplink"`（握手时声明并确认）。

---

## A. 给开发 Agent 的任务摘要

在**无仓库先验**前提下，实现 **TLS WebSocket** 后端：每轮只收 `turn.text`，输出 **`dialog_result`（含 `routing`、`flight_intent` 或 `chat_reply`）+ `tts_audio_chunk*` + `turn.complete`**；飞控 JSON **仅返回、不执行**。必须与下文章节 **F 协议**字段名与时序一致。

---

## B. 业务与领域背景

### B.1 产品

- **中文**口语交互：飞行指令 + 闲聊。口语、省略、领域词（起飞、降落、返航、悬停、高度、速度、往前飞、航点、相对位移等）由 **LLM** 在**文本**上理解。

### B.2 端侧职责（实现不在本文，但云须按此假设）

- 麦克风、唤醒/VAD、**本地 STT** → 得到 **稳定一版用户中文** → 发送 **`turn.text`**。
- 解析 `dialog_result`，**本地执行** `flight_intent`（PX4/MAVROS 等），并 **播放 TTS**。

### B.3 云端职责

- **输入**：每轮一条（或多条仅允许重复同一终稿时的幂等策略）**中文文本**。
- **输出**：结构化结果 + **TTS PCM（或约定 codec）流**。
- **不做**：用户语音识别；机上飞控执行。

### B.4 术语

| 术语 | 含义 |
|------|------|
| `session_id` | 一会话多轮，客户端生成 UUID v4。 |
| `turn_id` | 一轮「用户一句 → 助手答完」UUID v4。 |
| `routing` | `flight_intent` / `chitchat` / `error`。 |
| `flight_intent` | 与机端解析器一致的 JSON 对象。 |

---

## C. 系统架构与数据流

```text
[机载麦] → VAD/唤醒 → [本地 STT] → 中文文本
                               ↓ WSS text
                    [云端] LLM 分流 + 生成播报文案
                               ↓
              dialog_result + tts_audio_chunk* (binary)
                               ↓
[扬声器] ← 播放 TTS          [飞控] ← flight_intent（本地执行）
```

### 单轮时序（强制）

1. 客户端 → `session.start`（含 `transport_profile: "text_uplink"`）
2. 服务端 → `session.ready`（确认 profile，声明能力）
3. 客户端 → `turn.text`
4. 服务端 → `dialog_result` → `tts_audio_chunk*`（text 头 + binary 体）→ `turn.complete`
5. 下一轮换新 `turn_id`，重复 3–4。

---

## D. 大模型与意图逻辑

### D.1 分流（二选一）

1. **`flight_intent`**：话里含对本机无人机的**可执行飞行/任务意图**（含「飞高一点」「往左」等模糊指令 → 结构化尽量表达，`null`/缺字段 + `summary` 说明歧义）。
2. **`chitchat`**：日常聊天、与当次飞行无关且无有效飞控语义。

**返航**：`{"type":"return_home","args":{}}`。

### D.2 `flight_intent` Schema

当 `routing === "flight_intent"`，`flight_intent` 必须为对象，**完整约束以 [`FLIGHT_INTENT_SCHEMA_v1.md`](./FLIGHT_INTENT_SCHEMA_v1.md) 为准**（含桥/ROS 约定）。摘要：

- `is_flight_intent`: `true`
- `version`: `1`
- `actions`: 非空数组，按时间顺序；每项仅 `type` + `args`
  - `takeoff`：`args` 为 `{}` 或含可选 `relative_altitude_m`（米，>0）
  - `land` / `return_home` / `hover` / `hold`：`args` 为 `{}`
  - `goto`：`args` 须含 `frame`（`local_ned` | `body_ned`），可选 `x`/`y`/`z`（米，相对位移，可 `null`）
  - `wait`：**伴飞侧**定时等待；`args` 仅 `{"seconds": 正数}`（上限见 Schema）；「悬停 N 秒」典型为 `hover`/`hold` 后接 `wait`
- `summary`: 非空中文（播报/日志，**不参与机控**）
- `trace_id`（可选）：端到端追踪 ID，string，建议 ≤128 字符

禁止在结构化字段里夹 Markdown/代码块；顶层不得出现 Schema 未列字段。无法理解时可 `chitchat` 让用户说具体一点。

### D.3 `chitchat`

- `chat_reply`：非空自然中文。
- `flight_intent`：必须为 `null`。

### D.4 内部 LLM System 提示（服务端）

实际 System 提示由实现维护（`app/services/llm_service.py` → `build_system_prompt`），须与 [`FLIGHT_INTENT_SCHEMA_v1.md`](./FLIGHT_INTENT_SCHEMA_v1.md) **规则 A** 一致：`actions` 含 `takeoff` / `land` / `return_home` / `hover` / `hold` / `goto` / `wait`；可选顶层 `trace_id`；`takeoff` 可带 `relative_altitude_m`；**模型原文**经解析后写入 `dialog_result`，**原样不得**当最终 WS JSON 给客户端（避免未解析 JSON）。

**后处理**：解析到 `is_flight_intent === true` → `routing=flight_intent`，填 `flight_intent`，`chat_reply=null`；否则 `routing=chitchat`，`flight_intent=null`，`chat_reply=纯文本`。

### D.5 多轮上下文

- 建议每 `session_id` 保留短历史（如 2–4 轮）；**须在 `session.ready`** 写明 `llm_context_turns`。无历史则填 `0`。

---

## E. TTS 策略

| `routing` | 播报文本 |
|-----------|----------|
| `flight_intent` | 优先 **`flight_intent.summary`**（可略扩展但仍简短）。 |
| `chitchat` | **`chat_reply`**（过长时服务端截断策略写 README）。 |

默认下行：**`pcm_s16le`，24000 Hz，mono**（若与 `session.start` 不一致须重采样或协商拒绝）。

**顺序**：同 `turn_id` 下**建议先 `dialog_result` 再首包 TTS**；若交错以换首包延迟，须在 README 声明。

---

## F. 协议 v1.0 — `text_uplink` 配置（规范性）

### F.1 传输与鉴权

- **WSS**：`wss://{host}/v1/voice/session`
- **TLS 1.2+**
- 鉴权：**`Authorization: Bearer <token>`** 或 **`session.start.auth_token`**（产品只选一种）
- 每条 JSON 含：`"proto_version": "1.0"`
- **用户上行**：仅 **text 帧 JSON**；**不向服务端上传用户 PCM/Opus**（调试协议可单独附录，生产禁用）。
- **TTS**：**text 元数据 + binary 音频**（见 `tts_audio_chunk`）。

### F.2 客户端 → 服务端

**`session.start`**

```json
{
  "type": "session.start",
  "proto_version": "1.0",
  "transport_profile": "text_uplink",
  "session_id": "uuid-v4",
  "auth_token": "optional-if-not-using-Authorization-header",
  "client": {
    "device_id": "string",
    "locale": "zh-CN",
    "capabilities": {
      "playback_sample_rate_hz": 24000,
      "prefer_tts_codec": "pcm_s16le"
    }
  }
}
```

**`turn.text`（每轮至少一条；终稿语义一条即可）**

```json
{
  "type": "turn.text",
  "proto_version": "1.0",
  "transport_profile": "text_uplink",
  "turn_id": "uuid-v4",
  "text": "用户要说的整句中文",
  "is_final": true,
  "source": "device_stt"
}
```

`source` 枚举：`device_stt` | `debug_keyboard` | `text_only`  
`is_final`：文本上行下通常为 `true`；若未来接「流式纠错」可 `false` 仅处理最后一条 `is_final:true`（须在 README 中定义）。

**`tts.synthesize`（仅 TTS，无 LLM）**

须在 `session.ready` 之后发送；与 `turn.text` **互斥**（实现上同会话串行 pipeline）。**不**返回 `dialog_result` / `llm.text_delta`，**不**写入多轮历史；下行仅为 `tts_audio_chunk*` → `turn.complete`（`metrics.llm_ms === 0`）。

```json
{
  "type": "tts.synthesize",
  "proto_version": "1.0",
  "transport_profile": "text_uplink",
  "turn_id": "uuid-v4",
  "text": "要直接播报的中文"
}
```

**`session.end`（可选）**

```json
{
  "type": "session.end",
  "proto_version": "1.0",
  "session_id": "uuid-v4"
}
```

**禁止（本 profile）**：`turn.audio_chunk`、`turn.audio_end`。若收到 → **`error`：`INVALID_MESSAGE`，`retryable: false`**。

### F.3 服务端 → 客户端

**`session.ready`**

```json
{
  "type": "session.ready",
  "proto_version": "1.0",
  "transport_profile": "text_uplink",
  "session_id": "uuid-v4",
  "server_caps": {
    "accepts_audio_uplink": false,
    "llm": true,
    "tts_codecs": ["pcm_s16le"],
    "llm_context_turns": 4
  }
}
```

**`dialog_result`**

```json
{
  "type": "dialog_result",
  "proto_version": "1.0",
  "transport_profile": "text_uplink",
  "turn_id": "uuid-v4",
  "user_input": {
    "text": "与 turn.text 一致或规范化后的文本",
    "language": "zh",
    "is_final": true,
    "source": "device_stt"
  },
  "routing": "flight_intent",
  "flight_intent": {
    "is_flight_intent": true,
    "version": 1,
    "actions": [
      {"type": "takeoff", "args": {}},
      {"type": "goto", "args": {"frame": "local_ned", "x": 10, "y": 0, "z": -5}}
    ],
    "summary": "起飞后前往机头方向约十米并保持高度"
  },
  "chat_reply": null,
  "tts_hint": {
    "speak_summary_or_reply": true,
    "voice_id": "default"
  }
}
```

- **`user_input`**：表示**机端提交/云端正则化后的用户话**；**不是**云端语音识别结果。
- 闲聊示例：`routing` = `chitchat`，`flight_intent` = `null`，`chat_reply` = 非空字符串。

**`tts_audio_chunk`**

- 先 **text 帧**（JSON），再 **binary 帧**（payload）。

```json
{
  "type": "tts_audio_chunk",
  "proto_version": "1.0",
  "transport_profile": "text_uplink",
  "turn_id": "uuid-v4",
  "seq": 0,
  "codec": "pcm_s16le",
  "sample_rate_hz": 24000,
  "is_final": false
}
```

**`turn.complete`**

```json
{
  "type": "turn.complete",
  "proto_version": "1.0",
  "transport_profile": "text_uplink",
  "turn_id": "uuid-v4",
  "metrics": {
    "llm_ms": 350,
    "tts_first_byte_ms": 80
  }
}
```

（已去掉 `stt_ms`；无云端听写阶段。）

**`error`**

```json
{
  "type": "error",
  "proto_version": "1.0",
  "transport_profile": "text_uplink",
  "turn_id": "uuid-v4",
  "code": "LLM_TIMEOUT",
  "message": "human readable",
  "retryable": true
}
```

**`code` 枚举（本 profile）**  
`UNAUTHORIZED` | `INVALID_MESSAGE` | `LLM_FAILED` | `LLM_TIMEOUT` | `TTS_FAILED` | `RATE_LIMIT` | `INTERNAL`

（`STT_FAILED`、`BAD_AUDIO` 本 profile **不使用**；非法音频类消息用 `INVALID_MESSAGE`。）

### F.4 幂等

- 同一 `turn_id` 重复提交相同 `turn.text`：返回**等价** `dialog_result` + TTS，或 `error`（`INVALID_MESSAGE`）；**禁止**静默改变 `flight_intent` 语义。

### F.5 超时

- `turn.text`（`is_final:true`）收到后 → `dialog_result` / `error` 总时限；TTS 首字节单独时限；均在 README 可配。

---

## G. 工程与非功能

- 环境变量：`WS_BIND`、`BEARER_TOKEN`、`LLM_*`、`TTS_*`、各阶段 timeout、限流等。
- 日志：含 `session_id`、`turn_id`、阶段耗时；**禁止**记录 token。
- 会话内 turn **建议串行**处理，避免乱序。

### G.1 可选扩展：飞控「播报—确认—执行」门控

为降低 **本地 ASR 错字**导致的误控，推荐统一采用 **`routing: flight_intent`** + **`confirm` 对象** + 会话声明 `client.protocol.dialog_result: "cloud_voice_dialog_v1"`。权威字段、机端状态机与握手见 **[`CLOUD_VOICE_DIALOG_v1.md`](./CLOUD_VOICE_DIALOG_v1.md)**。历史备选（`flight_intent_pending`、`turn.confirmation`）见 [`CLOUD_VOICE_FLIGHT_CONFIRM_v1.md`](./CLOUD_VOICE_FLIGHT_CONFIRM_v1.md)。未声明 v1 的会话，服务端可继续下发**无 `confirm`** 的旧版 `dialog_result`（本仓库实现：兼容路径）。

---

## H. 测试与验收

1. 「今天天气」→ `chitchat`，TTS 为闲聊。
2. 「起飞然后在前方十米悬停」→ `flight_intent`，`actions` 顺序合理。
3. 「返航」→ `return_home`。
4. 纯闲聊不出现 `is_flight_intent`。
5. 发 `turn.audio_chunk` → `INVALID_MESSAGE`。
6. 鉴权失败 → `UNAUTHORIZED`。
7. `dialog_result.user_input.text` 与输入一致或可解释规范化规则。

---

## I. 复制给 Cursor / 云端 Agent 的完整执行提示词

```text
你是资深后端工程师。实现「无人机中文语音助手」云端服务，传输配置为 text_uplink：

【输入】仅通过 WSS 接收 JSON：session.start（含 transport_profile=text_uplink）、每轮 turn.text（UTF-8 中文，来自机端本地 STT）。禁止依赖用户上行音频；若收到 turn.audio_chunk / turn.audio_end 应返回 error code INVALID_MESSAGE。

【处理】LLM 按本文档第二节 D：飞控意图与闲聊二选一；flight_intent 必须严格符合文中 schema；否则 chitchat 与 chat_reply。

【输出】session.ready（accepts_audio_uplink=false）；每轮 dialog_result（必含 user_input，语义为机端文本而非云 STT）；再 tts_audio_chunk（text 头+binary PCM）；最后 turn.complete；metrics 不含 stt_ms。

【错误码】使用文档 F 节枚举。

【交付】可运行服务、README（环境变量、启动命令、llm_context_turns、timeouts、TTS 采样率），并通过文档 H 节测试。

若细节未规定，在 README 记录假设且不更改字段名与 routing 语义。
```

---

## 与旧版机端字段的兼容说明（可选）

若机端暂不改解析逻辑、只认旧字段名：可在 **`dialog_result` 同时带 `stt` 与 `user_input` 相同内容** 作为过渡；稳定后只保留 `user_input`。新云端建议**只实现 `user_input`**，机端一次性对齐。

---

## 与仓库机端代码对齐

飞控 JSON 的语义与 `src/core/qwen_intent_chat.py` 中 `FLIGHT_INTENT_CHAT_SYSTEM` / `parse_flight_intent_reply` 保持一致。