LKTX 1650f046b6 Add voicellmcloud (cloud service) under voicellmcloud/

Made-with: Cursor

2026-04-14 10:08:41 +08:00

13 KiB

Raw Blame History

云端无人机语音对话服务 — 完整方案（文本上行 / text_uplink）

文档定位

适用场景：无人机端完成 中文语音→文本（本地 STT），经 WSS 仅上传 UTF-8 文本；云端完成 LLM（飞控结构化意图 vs 闲聊）+ TTS 流式下发。
云端不做：上行音频识别、MAVLink/PX4 直连。
协议标识：proto_version: "1.0"，transport_profile: "text_uplink"（握手时声明并确认）。

A. 给开发 Agent 的任务摘要

在无仓库先验前提下，实现 TLS WebSocket 后端：每轮只收 turn.text，输出 dialog_result（含 routing、flight_intent 或 chat_reply）+ tts_audio_chunk* + turn.complete；飞控 JSON 仅返回、不执行。必须与下文章节 F 协议字段名与时序一致。

B. 业务与领域背景

B.1 产品

中文口语交互：飞行指令 + 闲聊。口语、省略、领域词（起飞、降落、返航、悬停、高度、速度、往前飞、航点、相对位移等）由 LLM 在文本上理解。

B.2 端侧职责（实现不在本文，但云须按此假设）

麦克风、唤醒/VAD、本地 STT → 得到 稳定一版用户中文 → 发送 turn.text。
解析 dialog_result，本地执行 flight_intent（PX4/MAVROS 等），并 播放 TTS。

B.3 云端职责

输入：每轮一条（或多条仅允许重复同一终稿时的幂等策略）中文文本。
输出：结构化结果 + TTS PCM（或约定 codec）流。
不做：用户语音识别；机上飞控执行。

B.4 术语

术语	含义
`session_id`	一会话多轮，客户端生成 UUID v4。
`turn_id`	一轮「用户一句 → 助手答完」UUID v4。
`routing`	`flight_intent` / `chitchat` / `error`。
`flight_intent`	与机端解析器一致的 JSON 对象。

C. 系统架构与数据流

[机载麦] → VAD/唤醒 → [本地 STT] → 中文文本
                               ↓ WSS text
                    [云端] LLM 分流 + 生成播报文案
                               ↓
              dialog_result + tts_audio_chunk* (binary)
                               ↓
[扬声器] ← 播放 TTS          [飞控] ← flight_intent（本地执行）

单轮时序（强制）

客户端 → session.start（含 transport_profile: "text_uplink"）
服务端 → session.ready（确认 profile，声明能力）
客户端 → turn.text
服务端 → dialog_result → tts_audio_chunk*（text 头 + binary 体）→ turn.complete
下一轮换新 turn_id，重复 3–4。

D. 大模型与意图逻辑

D.1 分流（二选一）

flight_intent：话里含对本机无人机的可执行飞行/任务意图（含「飞高一点」「往左」等模糊指令 → 结构化尽量表达，null/缺字段 + summary 说明歧义）。
chitchat：日常聊天、与当次飞行无关且无有效飞控语义。

返航：{"type":"return_home","args":{}}。

D.2 `flight_intent` Schema

当 routing === "flight_intent"，flight_intent 必须为对象，完整约束以 FLIGHT_INTENT_SCHEMA_v1.md 为准（含桥/ROS 约定）。摘要：

is_flight_intent: true
version: 1
actions: 非空数组，按时间顺序；每项仅 type + args
- takeoff：args 为 {} 或含可选 relative_altitude_m（米，>0）
- land / return_home / hover / hold：args 为 {}
- goto：args 须含 frame（local_ned | body_ned），可选 x/y/z（米，相对位移，可 null）
- wait：伴飞侧定时等待；args 仅 {"seconds": 正数}（上限见 Schema）；「悬停 N 秒」典型为 hover/hold 后接 wait
summary: 非空中文（播报/日志，不参与机控）
trace_id（可选）：端到端追踪 ID，string，建议 ≤128 字符

禁止在结构化字段里夹 Markdown/代码块；顶层不得出现 Schema 未列字段。无法理解时可 chitchat 让用户说具体一点。

D.3 `chitchat`

chat_reply：非空自然中文。
flight_intent：必须为 null。

D.4 内部 LLM System 提示（服务端）

实际 System 提示由实现维护（app/services/llm_service.py → build_system_prompt），须与 FLIGHT_INTENT_SCHEMA_v1.md 规则 A 一致：actions 含 takeoff / land / return_home / hover / hold / goto / wait；可选顶层 trace_id；takeoff 可带 relative_altitude_m；模型原文经解析后写入 dialog_result，原样不得当最终 WS JSON 给客户端（避免未解析 JSON）。

后处理：解析到 is_flight_intent === true → routing=flight_intent，填 flight_intent，chat_reply=null；否则 routing=chitchat，flight_intent=null，chat_reply=纯文本。

D.5 多轮上下文

建议每 session_id 保留短历史（如 2–4 轮）；须在 session.ready 写明 llm_context_turns。无历史则填 0。

E. TTS 策略

`routing`	播报文本
`flight_intent`	优先 `flight_intent.summary`（可略扩展但仍简短）。
`chitchat`	`chat_reply`（过长时服务端截断策略写 README）。

默认下行：pcm_s16le，24000 Hz，mono（若与 session.start 不一致须重采样或协商拒绝）。

顺序：同 turn_id 下建议先 dialog_result 再首包 TTS；若交错以换首包延迟，须在 README 声明。

F. 协议 v1.0 — `text_uplink` 配置（规范性）

F.1 传输与鉴权

WSS：wss://{host}/v1/voice/session
TLS 1.2+
鉴权：Authorization: Bearer <token> 或 session.start.auth_token（产品只选一种）
每条 JSON 含："proto_version": "1.0"
用户上行：仅 text 帧 JSON；不向服务端上传用户 PCM/Opus（调试协议可单独附录，生产禁用）。
TTS：text 元数据 + binary 音频（见 tts_audio_chunk）。

F.2 客户端 → 服务端

session.start

{
  "type": "session.start",
  "proto_version": "1.0",
  "transport_profile": "text_uplink",
  "session_id": "uuid-v4",
  "auth_token": "optional-if-not-using-Authorization-header",
  "client": {
    "device_id": "string",
    "locale": "zh-CN",
    "capabilities": {
      "playback_sample_rate_hz": 24000,
      "prefer_tts_codec": "pcm_s16le"
    }
  }
}

turn.text（每轮至少一条；终稿语义一条即可）

{
  "type": "turn.text",
  "proto_version": "1.0",
  "transport_profile": "text_uplink",
  "turn_id": "uuid-v4",
  "text": "用户要说的整句中文",
  "is_final": true,
  "source": "device_stt"
}

source 枚举：device_stt | debug_keyboard | text_only
is_final：文本上行下通常为 true；若未来接「流式纠错」可 false 仅处理最后一条 is_final:true（须在 README 中定义）。

tts.synthesize（仅 TTS，无 LLM）

须在 session.ready 之后发送；与 turn.text 互斥（实现上同会话串行 pipeline）。不返回 dialog_result / llm.text_delta，不写入多轮历史；下行仅为 tts_audio_chunk* → turn.complete（metrics.llm_ms === 0）。

{
  "type": "tts.synthesize",
  "proto_version": "1.0",
  "transport_profile": "text_uplink",
  "turn_id": "uuid-v4",
  "text": "要直接播报的中文"
}

session.end（可选）

{
  "type": "session.end",
  "proto_version": "1.0",
  "session_id": "uuid-v4"
}

禁止（本 profile）：turn.audio_chunk、turn.audio_end。若收到 → error：INVALID_MESSAGE，retryable: false。

F.3 服务端 → 客户端

session.ready

{
  "type": "session.ready",
  "proto_version": "1.0",
  "transport_profile": "text_uplink",
  "session_id": "uuid-v4",
  "server_caps": {
    "accepts_audio_uplink": false,
    "llm": true,
    "tts_codecs": ["pcm_s16le"],
    "llm_context_turns": 4
  }
}

dialog_result

{
  "type": "dialog_result",
  "proto_version": "1.0",
  "transport_profile": "text_uplink",
  "turn_id": "uuid-v4",
  "user_input": {
    "text": "与 turn.text 一致或规范化后的文本",
    "language": "zh",
    "is_final": true,
    "source": "device_stt"
  },
  "routing": "flight_intent",
  "flight_intent": {
    "is_flight_intent": true,
    "version": 1,
    "actions": [
      {"type": "takeoff", "args": {}},
      {"type": "goto", "args": {"frame": "local_ned", "x": 10, "y": 0, "z": -5}}
    ],
    "summary": "起飞后前往机头方向约十米并保持高度"
  },
  "chat_reply": null,
  "tts_hint": {
    "speak_summary_or_reply": true,
    "voice_id": "default"
  }
}

user_input：表示机端提交/云端正则化后的用户话；不是云端语音识别结果。
闲聊示例：routing = chitchat，flight_intent = null，chat_reply = 非空字符串。

tts_audio_chunk

先 text 帧（JSON），再 binary 帧（payload）。

{
  "type": "tts_audio_chunk",
  "proto_version": "1.0",
  "transport_profile": "text_uplink",
  "turn_id": "uuid-v4",
  "seq": 0,
  "codec": "pcm_s16le",
  "sample_rate_hz": 24000,
  "is_final": false
}

turn.complete

{
  "type": "turn.complete",
  "proto_version": "1.0",
  "transport_profile": "text_uplink",
  "turn_id": "uuid-v4",
  "metrics": {
    "llm_ms": 350,
    "tts_first_byte_ms": 80
  }
}

（已去掉 stt_ms；无云端听写阶段。）

error

{
  "type": "error",
  "proto_version": "1.0",
  "transport_profile": "text_uplink",
  "turn_id": "uuid-v4",
  "code": "LLM_TIMEOUT",
  "message": "human readable",
  "retryable": true
}

（STT_FAILED、BAD_AUDIO 本 profile 不使用；非法音频类消息用 INVALID_MESSAGE。）

F.4 幂等

同一 turn_id 重复提交相同 turn.text：返回等价 dialog_result + TTS，或 error（INVALID_MESSAGE）；禁止静默改变 flight_intent 语义。

F.5 超时

turn.text（is_final:true）收到后 → dialog_result / error 总时限；TTS 首字节单独时限；均在 README 可配。

G. 工程与非功能

环境变量：WS_BIND、BEARER_TOKEN、LLM_*、TTS_*、各阶段 timeout、限流等。
日志：含 session_id、turn_id、阶段耗时；禁止记录 token。
会话内 turn 建议串行处理，避免乱序。

G.1 可选扩展：飞控「播报—确认—执行」门控

为降低 本地 ASR 错字导致的误控，推荐统一采用 routing: flight_intent + confirm 对象 + 会话声明 client.protocol.dialog_result: "cloud_voice_dialog_v1"。权威字段、机端状态机与握手见 CLOUD_VOICE_DIALOG_v1.md。历史备选（flight_intent_pending、turn.confirmation）见 CLOUD_VOICE_FLIGHT_CONFIRM_v1.md。未声明 v1 的会话，服务端可继续下发无 confirm 的旧版 dialog_result（本仓库实现：兼容路径）。

H. 测试与验收

「今天天气」→ chitchat，TTS 为闲聊。
「起飞然后在前方十米悬停」→ flight_intent，actions 顺序合理。
「返航」→ return_home。
纯闲聊不出现 is_flight_intent。
发 turn.audio_chunk → INVALID_MESSAGE。
鉴权失败 → UNAUTHORIZED。
dialog_result.user_input.text 与输入一致或可解释规范化规则。

I. 复制给 Cursor / 云端 Agent 的完整执行提示词

你是资深后端工程师。实现「无人机中文语音助手」云端服务，传输配置为 text_uplink：

【输入】仅通过 WSS 接收 JSON：session.start（含 transport_profile=text_uplink）、每轮 turn.text（UTF-8 中文，来自机端本地 STT）。禁止依赖用户上行音频；若收到 turn.audio_chunk / turn.audio_end 应返回 error code INVALID_MESSAGE。

【处理】LLM 按本文档第二节 D：飞控意图与闲聊二选一；flight_intent 必须严格符合文中 schema；否则 chitchat 与 chat_reply。

【输出】session.ready（accepts_audio_uplink=false）；每轮 dialog_result（必含 user_input，语义为机端文本而非云 STT）；再 tts_audio_chunk（text 头+binary PCM）；最后 turn.complete；metrics 不含 stt_ms。

【错误码】使用文档 F 节枚举。

【交付】可运行服务、README（环境变量、启动命令、llm_context_turns、timeouts、TTS 采样率），并通过文档 H 节测试。

若细节未规定，在 README 记录假设且不更改字段名与 routing 语义。

与旧版机端字段的兼容说明（可选）

若机端暂不改解析逻辑、只认旧字段名：可在 dialog_result 同时带 stt 与 user_input 相同内容 作为过渡；稳定后只保留 user_input。新云端建议只实现 user_input，机端一次性对齐。

与仓库机端代码对齐

飞控 JSON 的语义与 src/core/qwen_intent_chat.py 中 FLIGHT_INTENT_CHAT_SYSTEM / parse_flight_intent_reply 保持一致。

13 KiB Raw Blame History Unescape Escape