TuriX-CUA 深度学习笔记 / TuriX-CUA Deep Learning Notes
目录 / Table of Contents
1. 项目概述 / Project Overview
1.1 什么是 TuriX-CUA / What is TuriX-CUA
TuriX-CUA (Computer Use Agent) 是一个开源的AI计算机使用代理,允许AI模型通过视觉理解直接操控桌面操作系统完成任务。
English: TuriX-CUA (Computer Use Agent) is an open-source AI agent that enables AI models to directly control desktop operating systems through visual understanding to complete tasks.
1.2 核心定位 / Core Positioning
| 特性 / Feature | 描述 / Description |
|---|---|
| 定位 / Positioning | AI驱动的数字助手 / AI-Driven Digital Assistant |
| 使命 / Mission | "描述你的任务给你的电脑,以启动你的数字牛马" |
| 开源 / Open Source | 100% 开源,对个人和科研免费 / 100% Open Source, Free for Personal & Research |
| 跨平台 / Cross-Platform | macOS / Windows / Linux |
1.3 性能表现 / Performance
OSWorld 基准测试成绩: 64.2% (排名第3) 自建 macOS 基准测试: 80%+ 成功率
2. 核心架构 / Core Architecture
2.1 系统架构图 / System Architecture Diagram
┌─────────────────────────────────────────────────────────────────────┐
│ User Task │
│ "帮我预订机票和酒店" │
└─────────────────────────────────┬───────────────────────────────────┘
│
▼
┌─────────────────────────────────────────────────────────────────────┐
│ TuriX Agent Core │
├─────────────────────────────────────────────────────────────────────┤
│ │
│ ┌─────────────┐ ┌─────────────┐ ┌─────────────┐ │
│ │ Planner │───▶│ Brain │───▶│ Actor │ │
│ │ (规划器) │ │ (大脑) │ │ (执行器) │ │
│ └─────────────┘ └─────────────┘ └─────────────┘ │
│ │ │ │ │
│ │ │ │ │
│ ▼ ▼ ▼ │
│ ┌─────────────┐ ┌─────────────┐ ┌─────────────┐ │
│ │ Memory │ │ Skills │ │ Controller │ │
│ │ (记忆) │ │ (技能库) │ │ (控制器) │ │
│ └─────────────┘ └─────────────┘ └─────────────┘ │
│ │ │
│ ▼ │
│ ┌─────────────────┐ │
│ │ macOS UI │ │
│ │ (Quartz/Axis) │ │
│ └─────────────────┘ │
└─────────────────────────────────────────────────────────────────────┘
2.2 多模型分工 / Multi-Model Division of Labor
| 模型 / Model | 职责 / Responsibility | 输入 / Input | 输出 / Output |
|---|---|---|---|
| Planner | 任务分解、Skill选择、网络搜索 | Task + Memory | 分步骤计划 + 选中的Skills |
| Brain | 状态分析、目标设定、步骤评估 | 截图 + Memory + Skills | step_evaluate、next_goal |
| Actor | 精确动作执行 | Brain指令 + 截图 | 具体动作序列 |
| Memory | 上下文记忆压缩 | 历史步骤 | 结构化摘要 |
2.3 核心代码结构 / Core Code Structure
turix-cua/
├── src/
│ ├── agent/ # Agent 核心
│ │ ├── service.py # Agent 主服务
│ │ ├── planner_service.py # 规划器
│ │ ├── prompts.py # 提示词模板
│ │ ├── output_schemas.py # 输出Schema定义
│ │ ├── views.py # 数据模型
│ │ └── message_manager/ # 消息管理
│ │
│ ├── controller/ # 控制器
│ │ ├── service.py # 控制器主服务
│ │ ├── registry/ # 动作注册表
│ │ └── views.py # 动作定义
│ │
│ ├── mac/ # macOS特定实现
│ │ ├── actions.py # 原生动作执行
│ │ ├── tree.py # UI树构建
│ │ └── element.py # 元素定义
│ │
│ └── utils/ # 工具函数
│ ├── skills.py # Skill系统
│ ├── brain_search.py # 搜索增强
│ └── record_store.py # 记忆存储
│
├── OpenCLaw_TuriX_skill/ # OpenClaw Skill包
│ ├── SKILL.md # Skill定义
│ └── scripts/ # 运行脚本
│
├── skills/ # 内置Skills
│ └── github-web-actions.md
│
└── examples/
├── main.py # 主入口
└── config.json # 配置文件
3. 动作系统 / Action System
3.1 动作Schema详解 / Action Schema Details
TuriX 定义了一套结构化的动作Schema,用于让AI模型生成可执行的桌面操作指令。
English: TuriX defines a structured action schema for AI models to generate executable desktop operation instructions.
核心动作类型表 / Core Action Types Table
动作类型 Schema Key 参数 说明
完成任务 done{text: string}标记任务已完成
文本输入 input_text{text: string}向焦点元素输入文本
打开应用 open_app{app_name: string}启动指定应用
运行脚本 run_apple_script{script: string}执行AppleScript
单击 Click{position: [x, y]}左键单击(归一化坐标)
右键单击 RightSingle{position: [x, y]}右键单击
拖拽 Drag{position1: [x1,y1], position2: [x2,y2]}从起点拖到终点
移动鼠标 move_mouse{position: [x, y]}移动鼠标指针
向上滚动 scroll_up{position, dx, dy}在指定位置向上滚动
向下滚动 scroll_down{position, dx, dy}在指定位置向下滚动
快捷键 Hotkey{key: string}单键快捷键
组合键 multi_Hotkey{key1, key2, key3?}多键组合
记录信息 record_info{text, file_name}记录重要信息到文件
3.2 动作Schema完整定义 / Complete Action Schema Definition
python
src/agent/output_schemas.py
ACTION_SCHEMA = { "type": "object", "properties": { "action": { "type": "array", "items": { "type": "object", "properties": { # ----- task finished ----- "done": {"type": "object", "properties": {"text": {"type": "string"}}},
# ----- typing ----- "input_text": { "type": "object", "properties": {"text": {"type": "string"}}, "required": ["text"] },
# ----- open app ----- "open_app": { "type": "object", "properties": {"app_name": {"type": "string"}}, "required": ["app_name"] },
# ----- AppleScript ----- "run_apple_script": { "type": "object", "properties": {"script": {"type": "string"}}, "required": ["script"] },
# ----- hotkeys ----- "Hotkey": { "type": "object", "properties": {"key": {"type": "string"}}, "required": ["key"] }, "multi_Hotkey": { "type": "object", "properties": { "key1": {"type": "string"}, "key2": {"type": "string"}, "key3": {"type": "string"}, }, "required": ["key1", "key2"] },
# ----- clicks ----- "Click": { "type": "object", "properties": { "position": {"type": "array", "items": {"type": "number"}} }, "required": ["position"] },
# ----- drag ----- "Drag": { "type": "object", "properties": { "position1": {"type": "array", "items": {"type": "number"}}, "position2": {"type": "array", "items": {"type": "number"}}, }, "required": ["position1", "position2"] },
# ----- scrolling ----- "scroll_up": { "type": "object", "properties": { "position": {"type": "array", "items": {"type": "number"}}, "dx": {"type": "number"}, "dy": {"type": "number"}, }, "required": ["position"] }, "scroll_down": { "type": "object", "properties": { "position": {"type": "array", "items": {"type": "number"}}, "dx": {"type": "number"}, "dy": {"type": "number"}, }, "required": ["position"] },
# ----- memory ----- "record_info": {"type": "object", "properties": { "text": {"type": "string"}, "file_name": {"type": "string"} }, "required": ["text", "file_name"] }, "wait": {"type": "object", "properties": {"text": {"type": "string"}}}, }, } } } }
python3.3 控制器动作实现 / Controller Action Implementation
src/controller/views.py
class InputTextAction(BaseModel): text: str
class OpenAppAction(BaseModel): app_name: str
class PressAction(BaseModel): key: str
class PressCombinedAction(BaseModel): key1: str key2: str key3: Optional[str] = None
class LeftClickPixel(BaseModel): position: List[float] = Field(..., description="Coordinates (normalised) [x,y]")
class RightClickPixel(BaseModel): position: List[float] = Field(..., description="Coordinates (normalised) [x,y]")
class ScrollUpAction(BaseModel): position: List[float] dx: Optional[int] dy: Optional[int]
class ScrollDownAction(BaseModel): position: List[float] dx: Optional[int] dy: Optional[int]
class DragAction(BaseModel): position1: List[float] position2: List[float]
python3.4 macOS 原生动作执行 / macOS Native Action Execution
src/mac/actions.py
async def _click_invisible(x, y, button='left'): """ Perform a press-and-release click at (x, y) without leaving the cursor there. """ if button == 'left': down_type = Quartz.kCGEventLeftMouseDown up_type = Quartz.kCGEventLeftMouseUp cg_button = Quartz.kCGMouseButtonLeft else: down_type = Quartz.kCGEventRightMouseDown up_type = Quartz.kCGEventRightMouseUp cg_button = Quartz.kCGMouseButtonRight
old_pos = _get_current_mouse_position() # Move to position move = Quartz.CGEventCreateMouseEvent(None, Quartz.kCGEventMouseMoved, (x, y), cg_button) Quartz.CGEventPost(Quartz.kCGSessionEventTap, move) await asyncio.sleep(0.03) # Mouse down event_down = Quartz.CGEventCreateMouseEvent(None, down_type, (x, y), cg_button) Quartz.CGEventPost(Quartz.kCGHIDEventTap, event_down) # Mouse up event_up = Quartz.CGEventCreateMouseEvent(None, up_type, (x, y), cg_button) Quartz.CGEventPost(Quartz.kCGHIDEventTap, event_up)
所有坐标归一化到 0-1000 范围3.5 坐标归一化系统 / Coordinate Normalization System
python
4. 大脑系统 / Brain System
4.1 Brain Schema 详解 / Brain Schema Details
Brain 是Turix的核心决策组件,负责分析当前状态并为Actor设定下一步目标。
English: Brain is Turix's core decision component, responsible for analyzing the current state and setting the next goal for the Actor.
src/agent/output_schemas.py
BRAIN_SCHEMA = { "oneOf": [ { "type": "object", "properties": { "analysis": { "type": "object", "properties": { "analysis": {"type": "string"}, "sop_check": {"type": "string"} }, "required": ["analysis", "sop_check"] }, "current_state": { "type": "object", "properties": { "step_evaluate": {"type": "string"}, # "Success" / "Failed" "ask_human": {"type": "string"}, # 询问用户 / "No" "next_goal": {"type": "string"}, # 下一步目标 }, "required": [ "step_evaluate", "ask_human", "next_goal" ] }, }, "required": ["analysis", "current_state"], }, { "type": "object", "properties": { "read_files": { "type": "object", "properties": { "files": {"type": "array", "items": {"type": "string"}} }, "required": ["files"] } }, "required": ["read_files"] } ] }
4.2 Brain 输出字段说明 / Brain Output Fields Description
| 字段 | 类型 | 说明 |
|---|---|---|
analysis.analysis | string | 详细分析当前状态与预期状态的匹配程度 |
analysis.sop_check | string | 识别当前适用于哪个Skill步骤 |
current_state.step_evaluate | string | "Success" 或 "Failed" |
current_state.ask_human | string | 询问用户确认(如需登录)或 "No" |
current_state.next_goal | string | 为Actor生成可执行的目标 |
read_files | array | 请求读取特定文件内容 |
4.3 Brain 提示词模板 / Brain Prompt Template
pythonsrc/agent/prompts.py
class BrainPrompt_turix: def get_system_message(self) -> SystemMessage: return SystemMessage( content=f""" SYSTEM PROMPT FOR BRAIN MODEL: === GLOBAL INSTRUCTIONS ===
=== ROLE-SPECIFIC DIRECTIVES ===
"Success" if the step is complete.
5. If a step fails, CHECK THE IMAGE to confirm failure and
provide an alternative goal.
6. If something is unclear (e.g., login required, preferences),
ask the user for confirmation in ask_human.
7. YOU MUST WRITE THE DETAIL TEXT YOU WANT THE ACTOR TO INPUT OR
EXECUTE IN THE NEXT GOAL, DO NOT JUST WRITE "INPUT MESSAGE".
"""
)
┌─────────────┐ │ 获取截图 │ └──────┬──────┘ │ ▼ ┌─────────────────────────────┐ │ 分析当前状态 │ │ • 与预期对比 │ │ • 评估上一步是否成功 │ └────────────┬────────────────┘ │ ▼ ┌─────────────────────────────┐ │ 设定下一步目标 │ │ • 生成具体可执行指令 │ │ • 考虑 Skill 指导 │ └────────────┬────────────────┘ │ ▼ ┌─────────────────────────────┐ │ 输出决策结果 │ │ • step_evaluate │ │ • next_goal │ │ • ask_human │ └─────────────────────────────┘4.4 Brain 工作流程 / Brain Workflow
┌─────────────────────────────────────────────────────────────────────┐ │ Claude / MCP Client │ │ │ │ User: "帮我搜索AI新闻并写入Pages文档发给联系人" │ └─────────────────────────────┬───────────────────────────────────────┘ │ MCP Protocol ▼ ┌─────────────────────────────────────────────────────────────────────┐ │ TuriX CUA (MCP Server) │ │ │ │ ┌──────────────┐ │ │ │ MCP Handler │ ◀─── 接收 Claude 的任务指令 │ │ └──────┬───────┘ │ │ │ │ │ ▼ │ │ ┌──────────────┐ │ │ │ Agent │ ◀─── 分解任务并执行 │ │ └──────┬───────┘ │ │ │ │ │ ▼ │ │ ┌──────────────┐ │ │ │ Controller │ ────▶ 操控 macOS UI │ │ └──────────────┘ │ └─────────────────────────────────────────────────────────────────────┘
5. MCP集成 / MCP Integration
5.1 MCP 简介 / MCP Introduction
MCP (Model Context Protocol) 允许 TuriX 与 Claude for Desktop 或其他支持 MCP 的 Agent 集成。
English: MCP (Model Context Protocol) allows TuriX to integrate with Claude for Desktop or other MCP-enabled Agents.5.2 MCP 集成架构 / MCP Integration Architecture
5.3 MCP 演示能力 / MCP Demo Capabilities
| 演示场景 | 说明 |
|---|---|
| Claude 搜索 AI 新闻 | 通过 MCP 调用 TuriX |
| 写入 Pages 文档 | TuriX 控制桌面创建文档 |
| 发送给联系人 | 自动化邮件发送流程 |
6. OpenClaw Skill集成 / OpenClaw Skill Integration
6.1 Skill 系统概述 / Skill System Overview
Skills 是 Markdown 格式的手册,帮助 Agent 在特定领域更可靠地执行任务。
English: Skills are Markdown playbooks that help the Agent perform more reliably in specific domains.
6.2 Skill 文件结构 / Skill File Structure
markdownname: github-web-actions description: When you need to perform GitHub web operations
GitHub Web Actions Skill
搜索仓库 / Search Repositories
Star 操作 / Star Actions
python6.3 Skill 解析实现 / Skill Parsing Implementation
src/utils/skills.py
def _split_frontmatter(text: str) -> tuple[dict, str]: """解析 YAML frontmatter""" if not text.startswith("---"): return {}, text
lines = text.splitlines() # ... 解析 frontmatter 逻辑 frontmatter_lines = lines[1:end_idx] body = "\n".join(lines[end_idx + 1:]).lstrip("\n") metadata = {} for line in frontmatter_lines: if ":" not in line: continue key, value = line.split(":", 1) metadata[key.strip()] = value.strip() return metadata, body
@dataclass(frozen=True) class SkillMetadata: name: str description: str path: Path
@dataclass(frozen=True) class SkillContent: name: str description: str body: str path: Path
┌──────────────────┐ │ 扫描 skills/ │ │ 目录下的 .md 文件│ └────────┬─────────┘ │ ▼ ┌──────────────────┐ │ 解析 frontmatter │ │ 提取 name/desc │ └────────┬─────────┘ │ ▼ ┌──────────────────┐ │ Planner 选择 │ │ 相关 Skills │ └────────┬─────────┘ │ ▼ ┌──────────────────┐ │ Brain 使用 │ │ 完整内容指导步骤 │ └──────────────────┘6.4 Skill 加载流程 / Skill Loading Flow
yaml6.5 OpenClaw Skill 包 / OpenClaw Skill Package
OpenCLaw_TuriX_skill/SKILL.md
name: turix-mac description: Computer Use Agent (CUA) for macOS automation using TuriX. Use when you need to perform visual tasks on the desktop.
TuriX-Mac Skill
When to Use
Architecture
User Request → [Clawdbot] → [TuriX Skill] → [run_turix.sh] → [TuriX Agent] ↓ ┌─────────────────────────┼─────────────────────────┐ ↓ ↓ ↓ [Planner] [Brain] [Memory] ↓ ↓ ↓ [Actor] ───→ [Controller] ───→ [macOS UI]
7. 技术亮点与创新 / Technical Highlights
7.1 多模型协作架构 / Multi-Model Collaboration Architecture
| 创新点 | 说明 |
|---|---|
| 角色分离 | Planner/Brain/Actor/Memory 各司其职 |
| 热插拔模型 | 通过 config.json 切换 VLM,无需改代码 |
| 结构化输出 | 严格的 JSON Schema 确保输出可解析 |
7.2 UI 树构建技术 / UI Tree Building Technology
pythonsrc/mac/tree.py
class MacUITreeBuilder: def __init__(self): self.INTERACTIVE_ACTIONS = { 'AXPress', # Most buttons 'AXShowMenu', # Menu buttons 'AXIncrement', # Spinners 'AXConfirm', # Dialogs 'AXSetValue' # Text fields } def _get_all_attributes(self, element: 'AXUIElement'): """获取元素所有可访问属性""" attributes = {} error, attribute_names = AXUIElementCopyAttributeNames(element, None) if error == kAXErrorSuccess and attribute_names: for attr in list(attribute_names): attributes[attr] = self._get_attribute(element, attr) return attributes
传统自动化: 需为每个应用开发专用API TuriX方法: 只要人能点的,TuriX就能点 - WhatsApp - Excel - Outlook - 内部工具 - 任何 macOS 应用7.3 无应用专用API / No App-Specific APIs
7.4 性能优化 / Performance Optimization
| 技术 | 效果 |
|---|---|
| 异步执行 | 使用 asyncio 实现并发动作 |
| 内存压缩 | 可恢复的内存压缩机制 |
| Token优化 | 智能 Token 计数和截断 |
8. 学习要点总结 / Key Takeaways
8.1 核心设计模式 / Core Design Patterns
8.2 关键文件速查 / Key Files Quick Reference
| 文件 | 功能 |
|---|---|
src/agent/service.py | Agent 主逻辑 |
src/agent/output_schemas.py | 所有 Schema 定义 |
src/agent/prompts.py | 提示词模板 |
src/controller/service.py | 动作控制器 |
src/mac/actions.py | macOS 原生操作 |
src/mac/tree.py | UI 树构建器 |
src/utils/skills.py | Skill 系统 |
8.3 扩展开发指南 / Extension Development Guide
python添加新动作类型
1. 在 src/controller/views.py 添加 Pydantic 模型
class NewAction(BaseModel): param1: str param2: int2. 在 src/controller/service.py 添加执行逻辑
async def execute_new_action(self, params): # 实现动作执行 pass3. 在 output_schemas.py 添加 schema
"new_action": { "type": "object", "properties": { "param1": {"type": "string"}, "param2": {"type": "number"} } }json // examples/config.json { "agent": { "task": "Open Safari and go to github.com" }, "brain_llm": { "provider": "turix", "model_name": "turix-brain", "api_key": "YOUR_API_KEY" }, "actor_llm": { "provider": "turix", "model_name": "turix-actor", "api_key": "YOUR_API_KEY" }, "memory_llm": { "provider": "turix", "model_name": "turix-brain", "api_key": "YOUR_API_KEY" } }8.4 配置示例 / Configuration Example
8.5 学习资源 / Learning Resources
*Created with ❤️ for AI enthusiasts*