TuriX-CUA 深度学习笔记 / TuriX-CUA Deep Learning Notes

目录 / Table of Contents

  • 项目概述 / Project Overview
  • 核心架构 / Core Architecture
  • 动作系统 / Action System
  • 大脑系统 / Brain System
  • MCP集成 / MCP Integration
  • OpenClaw Skill集成 / OpenClaw Skill Integration
  • 技术亮点与创新 / Technical Highlights
  • 学习要点总结 / Key Takeaways

  • 1. 项目概述 / Project Overview

    1.1 什么是 TuriX-CUA / What is TuriX-CUA

    TuriX-CUA (Computer Use Agent) 是一个开源的AI计算机使用代理,允许AI模型通过视觉理解直接操控桌面操作系统完成任务。

    English: TuriX-CUA (Computer Use Agent) is an open-source AI agent that enables AI models to directly control desktop operating systems through visual understanding to complete tasks.

    1.2 核心定位 / Core Positioning

    特性 / Feature描述 / Description
    定位 / PositioningAI驱动的数字助手 / AI-Driven Digital Assistant
    使命 / Mission"描述你的任务给你的电脑,以启动你的数字牛马"
    开源 / Open Source100% 开源,对个人和科研免费 / 100% Open Source, Free for Personal & Research
    跨平台 / Cross-PlatformmacOS / Windows / Linux

    1.3 性能表现 / Performance

    OSWorld 基准测试成绩: 64.2% (排名第3)
    自建 macOS 基准测试: 80%+ 成功率
    

    2. 核心架构 / Core Architecture

    2.1 系统架构图 / System Architecture Diagram

    ┌─────────────────────────────────────────────────────────────────────┐
    │                           User Task                                  │
    │                      "帮我预订机票和酒店"                              │
    └─────────────────────────────────┬───────────────────────────────────┘
                                      │
                                      ▼
    ┌─────────────────────────────────────────────────────────────────────┐
    │                        TuriX Agent Core                              │
    ├─────────────────────────────────────────────────────────────────────┤
    │                                                                      │
    │  ┌─────────────┐    ┌─────────────┐    ┌─────────────┐             │
    │  │   Planner   │───▶│    Brain    │───▶│    Actor    │             │
    │  │   (规划器)   │    │   (大脑)    │    │   (执行器)   │             │
    │  └─────────────┘    └─────────────┘    └─────────────┘             │
    │         │                  │                  │                     │
    │         │                  │                  │                     │
    │         ▼                  ▼                  ▼                     │
    │  ┌─────────────┐    ┌─────────────┐    ┌─────────────┐             │
    │  │   Memory    │    │   Skills    │    │ Controller  │             │
    │  │   (记忆)    │    │  (技能库)   │    │  (控制器)   │             │
    │  └─────────────┘    └─────────────┘    └─────────────┘             │
    │                                              │                       │
    │                                              ▼                       │
    │                                    ┌─────────────────┐              │
    │                                    │    macOS UI     │              │
    │                                    │  (Quartz/Axis)  │              │
    │                                    └─────────────────┘              │
    └─────────────────────────────────────────────────────────────────────┘
    

    2.2 多模型分工 / Multi-Model Division of Labor

    模型 / Model职责 / Responsibility输入 / Input输出 / Output
    Planner任务分解、Skill选择、网络搜索Task + Memory分步骤计划 + 选中的Skills
    Brain状态分析、目标设定、步骤评估截图 + Memory + Skillsstep_evaluatenext_goal
    Actor精确动作执行Brain指令 + 截图具体动作序列
    Memory上下文记忆压缩历史步骤结构化摘要

    2.3 核心代码结构 / Core Code Structure

    turix-cua/
    ├── src/
    │   ├── agent/                    # Agent 核心
    │   │   ├── service.py           # Agent 主服务
    │   │   ├── planner_service.py   # 规划器
    │   │   ├── prompts.py           # 提示词模板
    │   │   ├── output_schemas.py     # 输出Schema定义
    │   │   ├── views.py             # 数据模型
    │   │   └── message_manager/     # 消息管理
    │   │
    │   ├── controller/              # 控制器
    │   │   ├── service.py          # 控制器主服务
    │   │   ├── registry/           # 动作注册表
    │   │   └── views.py           # 动作定义
    │   │
    │   ├── mac/                    # macOS特定实现
    │   │   ├── actions.py         # 原生动作执行
    │   │   ├── tree.py            # UI树构建
    │   │   └── element.py         # 元素定义
    │   │
    │   └── utils/                  # 工具函数
    │       ├── skills.py          # Skill系统
    │       ├── brain_search.py    # 搜索增强
    │       └── record_store.py    # 记忆存储
    │
    ├── OpenCLaw_TuriX_skill/        # OpenClaw Skill包
    │   ├── SKILL.md               # Skill定义
    │   └── scripts/               # 运行脚本
    │
    ├── skills/                     # 内置Skills
    │   └── github-web-actions.md
    │
    └── examples/
        ├── main.py                # 主入口
        └── config.json            # 配置文件
    

    3. 动作系统 / Action System

    3.1 动作Schema详解 / Action Schema Details

    TuriX 定义了一套结构化的动作Schema,用于让AI模型生成可执行的桌面操作指令。

    English: TuriX defines a structured action schema for AI models to generate executable desktop operation instructions.

    核心动作类型表 / Core Action Types Table

    动作类型Schema Key参数说明
    完成任务done{text: string}标记任务已完成
    文本输入input_text{text: string}向焦点元素输入文本
    打开应用open_app{app_name: string}启动指定应用
    运行脚本run_apple_script{script: string}执行AppleScript
    单击Click{position: [x, y]}左键单击(归一化坐标)
    右键单击RightSingle{position: [x, y]}右键单击
    拖拽Drag{position1: [x1,y1], position2: [x2,y2]}从起点拖到终点
    移动鼠标move_mouse{position: [x, y]}移动鼠标指针
    向上滚动scroll_up{position, dx, dy}在指定位置向上滚动
    向下滚动scroll_down{position, dx, dy}在指定位置向下滚动
    快捷键Hotkey{key: string}单键快捷键
    组合键multi_Hotkey{key1, key2, key3?}多键组合
    记录信息record_info{text, file_name}记录重要信息到文件

    3.2 动作Schema完整定义 / Complete Action Schema Definition

    python

    src/agent/output_schemas.py

    ACTION_SCHEMA = { "type": "object", "properties": { "action": { "type": "array", "items": { "type": "object", "properties": { # ----- task finished ----- "done": {"type": "object", "properties": {"text": {"type": "string"}}},

    # ----- typing ----- "input_text": { "type": "object", "properties": {"text": {"type": "string"}}, "required": ["text"] },

    # ----- open app ----- "open_app": { "type": "object", "properties": {"app_name": {"type": "string"}}, "required": ["app_name"] },

    # ----- AppleScript ----- "run_apple_script": { "type": "object", "properties": {"script": {"type": "string"}}, "required": ["script"] },

    # ----- hotkeys ----- "Hotkey": { "type": "object", "properties": {"key": {"type": "string"}}, "required": ["key"] }, "multi_Hotkey": { "type": "object", "properties": { "key1": {"type": "string"}, "key2": {"type": "string"}, "key3": {"type": "string"}, }, "required": ["key1", "key2"] },

    # ----- clicks ----- "Click": { "type": "object", "properties": { "position": {"type": "array", "items": {"type": "number"}} }, "required": ["position"] },

    # ----- drag ----- "Drag": { "type": "object", "properties": { "position1": {"type": "array", "items": {"type": "number"}}, "position2": {"type": "array", "items": {"type": "number"}}, }, "required": ["position1", "position2"] },

    # ----- scrolling ----- "scroll_up": { "type": "object", "properties": { "position": {"type": "array", "items": {"type": "number"}}, "dx": {"type": "number"}, "dy": {"type": "number"}, }, "required": ["position"] }, "scroll_down": { "type": "object", "properties": { "position": {"type": "array", "items": {"type": "number"}}, "dx": {"type": "number"}, "dy": {"type": "number"}, }, "required": ["position"] },

    # ----- memory ----- "record_info": {"type": "object", "properties": { "text": {"type": "string"}, "file_name": {"type": "string"} }, "required": ["text", "file_name"] }, "wait": {"type": "object", "properties": {"text": {"type": "string"}}}, }, } } } }

    3.3 控制器动作实现 / Controller Action Implementation

    python

    src/controller/views.py

    class InputTextAction(BaseModel): text: str

    class OpenAppAction(BaseModel): app_name: str

    class PressAction(BaseModel): key: str

    class PressCombinedAction(BaseModel): key1: str key2: str key3: Optional[str] = None

    class LeftClickPixel(BaseModel): position: List[float] = Field(..., description="Coordinates (normalised) [x,y]")

    class RightClickPixel(BaseModel): position: List[float] = Field(..., description="Coordinates (normalised) [x,y]")

    class ScrollUpAction(BaseModel): position: List[float] dx: Optional[int] dy: Optional[int]

    class ScrollDownAction(BaseModel): position: List[float] dx: Optional[int] dy: Optional[int]

    class DragAction(BaseModel): position1: List[float] position2: List[float]

    3.4 macOS 原生动作执行 / macOS Native Action Execution

    python

    src/mac/actions.py

    async def _click_invisible(x, y, button='left'): """ Perform a press-and-release click at (x, y) without leaving the cursor there. """ if button == 'left': down_type = Quartz.kCGEventLeftMouseDown up_type = Quartz.kCGEventLeftMouseUp cg_button = Quartz.kCGMouseButtonLeft else: down_type = Quartz.kCGEventRightMouseDown up_type = Quartz.kCGEventRightMouseUp cg_button = Quartz.kCGMouseButtonRight

    old_pos = _get_current_mouse_position() # Move to position move = Quartz.CGEventCreateMouseEvent(None, Quartz.kCGEventMouseMoved, (x, y), cg_button) Quartz.CGEventPost(Quartz.kCGSessionEventTap, move) await asyncio.sleep(0.03) # Mouse down event_down = Quartz.CGEventCreateMouseEvent(None, down_type, (x, y), cg_button) Quartz.CGEventPost(Quartz.kCGHIDEventTap, event_down) # Mouse up event_up = Quartz.CGEventCreateMouseEvent(None, up_type, (x, y), cg_button) Quartz.CGEventPost(Quartz.kCGHIDEventTap, event_up)

    3.5 坐标归一化系统 / Coordinate Normalization System

    所有坐标归一化到 0-1000 范围
  • [500, 500] = 屏幕中心
  • [0, 0] = 左上角
  • [1000, 1000] = 右下角

  • 4. 大脑系统 / Brain System

    4.1 Brain Schema 详解 / Brain Schema Details

    Brain 是Turix的核心决策组件,负责分析当前状态并为Actor设定下一步目标。

    English: Brain is Turix's core decision component, responsible for analyzing the current state and setting the next goal for the Actor.
    python

    src/agent/output_schemas.py

    BRAIN_SCHEMA = { "oneOf": [ { "type": "object", "properties": { "analysis": { "type": "object", "properties": { "analysis": {"type": "string"}, "sop_check": {"type": "string"} }, "required": ["analysis", "sop_check"] }, "current_state": { "type": "object", "properties": { "step_evaluate": {"type": "string"}, # "Success" / "Failed" "ask_human": {"type": "string"}, # 询问用户 / "No" "next_goal": {"type": "string"}, # 下一步目标 }, "required": [ "step_evaluate", "ask_human", "next_goal" ] }, }, "required": ["analysis", "current_state"], }, { "type": "object", "properties": { "read_files": { "type": "object", "properties": { "files": {"type": "array", "items": {"type": "string"}} }, "required": ["files"] } }, "required": ["read_files"] } ] }

    4.2 Brain 输出字段说明 / Brain Output Fields Description

    字段类型说明
    analysis.analysisstring详细分析当前状态与预期状态的匹配程度
    analysis.sop_checkstring识别当前适用于哪个Skill步骤
    current_state.step_evaluatestring"Success" 或 "Failed"
    current_state.ask_humanstring询问用户确认(如需登录)或 "No"
    current_state.next_goalstring为Actor生成可执行的目标
    read_filesarray请求读取特定文件内容

    4.3 Brain 提示词模板 / Brain Prompt Template

    python

    src/agent/prompts.py

    class BrainPrompt_turix: def get_system_message(self) -> SystemMessage: return SystemMessage( content=f""" SYSTEM PROMPT FOR BRAIN MODEL: === GLOBAL INSTRUCTIONS ===

  • Environment: macOS. Current time is {self.current_time}.
  • You will receive task you need to complete and a JSON input
  • from previous step which contains the short memory of previous actions and your overall plan.
  • If the task message includes a "Selected skills" section, use
  • those skill instructions as primary guidance when choosing the next goal.
  • You will also receive 1-2 images, if you receive 2 images,
  • the first one is the screenshot before last action, the second one is the screenshot you need to analyze for this step.
  • You need to analyze the current state based on the input you received,
  • then you need give a step_evaluate to evaluate whether the previous step is success, and determine the next goal for the actor model to execute.
  • You can only ask the actor model to use the apps that are already
  • installed in the computer.

    === ROLE-SPECIFIC DIRECTIVES ===

  • Role: Brain Model for MacOS 15+ Agent.
  • For most actions to be evaluated as "Success," the screenshot
  • should show the expected result.
  • Responsibilities
  • 1. Analysis and evaluate the previous goal. 2. Determine the next goal for the actor model to execute. 3. Check the provided image/data carefully to validate step success. 4. Mark step_evaluate as "Success" if the step is complete. 5. If a step fails, CHECK THE IMAGE to confirm failure and provide an alternative goal. 6. If something is unclear (e.g., login required, preferences), ask the user for confirmation in ask_human. 7. YOU MUST WRITE THE DETAIL TEXT YOU WANT THE ACTOR TO INPUT OR EXECUTE IN THE NEXT GOAL, DO NOT JUST WRITE "INPUT MESSAGE". """ )

    4.4 Brain 工作流程 / Brain Workflow

    ┌─────────────┐ │ 获取截图 │ └──────┬──────┘ │ ▼ ┌─────────────────────────────┐ │ 分析当前状态 │ │ • 与预期对比 │ │ • 评估上一步是否成功 │ └────────────┬────────────────┘ │ ▼ ┌─────────────────────────────┐ │ 设定下一步目标 │ │ • 生成具体可执行指令 │ │ • 考虑 Skill 指导 │ └────────────┬────────────────┘ │ ▼ ┌─────────────────────────────┐ │ 输出决策结果 │ │ • step_evaluate │ │ • next_goal │ │ • ask_human │ └─────────────────────────────┘

    5. MCP集成 / MCP Integration

    5.1 MCP 简介 / MCP Introduction

    MCP (Model Context Protocol) 允许 TuriX 与 Claude for Desktop 或其他支持 MCP 的 Agent 集成。

    English: MCP (Model Context Protocol) allows TuriX to integrate with Claude for Desktop or other MCP-enabled Agents.

    5.2 MCP 集成架构 / MCP Integration Architecture

    ┌─────────────────────────────────────────────────────────────────────┐ │ Claude / MCP Client │ │ │ │ User: "帮我搜索AI新闻并写入Pages文档发给联系人" │ └─────────────────────────────┬───────────────────────────────────────┘ │ MCP Protocol ▼ ┌─────────────────────────────────────────────────────────────────────┐ │ TuriX CUA (MCP Server) │ │ │ │ ┌──────────────┐ │ │ │ MCP Handler │ ◀─── 接收 Claude 的任务指令 │ │ └──────┬───────┘ │ │ │ │ │ ▼ │ │ ┌──────────────┐ │ │ │ Agent │ ◀─── 分解任务并执行 │ │ └──────┬───────┘ │ │ │ │ │ ▼ │ │ ┌──────────────┐ │ │ │ Controller │ ────▶ 操控 macOS UI │ │ └──────────────┘ │ └─────────────────────────────────────────────────────────────────────┘

    5.3 MCP 演示能力 / MCP Demo Capabilities

    演示场景说明
    Claude 搜索 AI 新闻通过 MCP 调用 TuriX
    写入 Pages 文档TuriX 控制桌面创建文档
    发送给联系人自动化邮件发送流程

    6. OpenClaw Skill集成 / OpenClaw Skill Integration

    6.1 Skill 系统概述 / Skill System Overview

    Skills 是 Markdown 格式的手册,帮助 Agent 在特定领域更可靠地执行任务。

    English: Skills are Markdown playbooks that help the Agent perform more reliably in specific domains.

    6.2 Skill 文件结构 / Skill File Structure

    markdown
    name: github-web-actions description: When you need to perform GitHub web operations

    GitHub Web Actions Skill

    搜索仓库 / Search Repositories

  • Navigate to github.com
  • Click the search box
  • Type your query
  • Star 操作 / Star Actions

  • Find the repository
  • Click the star button
  • 6.3 Skill 解析实现 / Skill Parsing Implementation

    python

    src/utils/skills.py

    def _split_frontmatter(text: str) -> tuple[dict, str]: """解析 YAML frontmatter""" if not text.startswith("---"): return {}, text

    lines = text.splitlines() # ... 解析 frontmatter 逻辑 frontmatter_lines = lines[1:end_idx] body = "\n".join(lines[end_idx + 1:]).lstrip("\n") metadata = {} for line in frontmatter_lines: if ":" not in line: continue key, value = line.split(":", 1) metadata[key.strip()] = value.strip() return metadata, body

    @dataclass(frozen=True) class SkillMetadata: name: str description: str path: Path

    @dataclass(frozen=True) class SkillContent: name: str description: str body: str path: Path

    6.4 Skill 加载流程 / Skill Loading Flow

    ┌──────────────────┐ │ 扫描 skills/ │ │ 目录下的 .md 文件│ └────────┬─────────┘ │ ▼ ┌──────────────────┐ │ 解析 frontmatter │ │ 提取 name/desc │ └────────┬─────────┘ │ ▼ ┌──────────────────┐ │ Planner 选择 │ │ 相关 Skills │ └────────┬─────────┘ │ ▼ ┌──────────────────┐ │ Brain 使用 │ │ 完整内容指导步骤 │ └──────────────────┘

    6.5 OpenClaw Skill 包 / OpenClaw Skill Package

    yaml

    OpenCLaw_TuriX_skill/SKILL.md


    name: turix-mac description: Computer Use Agent (CUA) for macOS automation using TuriX. Use when you need to perform visual tasks on the desktop.

    TuriX-Mac Skill

    When to Use

  • When asked to perform actions on the Mac desktop
  • When navigating applications that lack command-line interfaces
  • For multi-step visual workflows
  • Architecture

    User Request → [Clawdbot] → [TuriX Skill] → [run_turix.sh] → [TuriX Agent] ↓ ┌─────────────────────────┼─────────────────────────┐ ↓ ↓ ↓ [Planner] [Brain] [Memory] ↓ ↓ ↓ [Actor] ───→ [Controller] ───→ [macOS UI]

    7. 技术亮点与创新 / Technical Highlights

    7.1 多模型协作架构 / Multi-Model Collaboration Architecture

    创新点说明
    角色分离Planner/Brain/Actor/Memory 各司其职
    热插拔模型通过 config.json 切换 VLM,无需改代码
    结构化输出严格的 JSON Schema 确保输出可解析

    7.2 UI 树构建技术 / UI Tree Building Technology

    python

    src/mac/tree.py

    class MacUITreeBuilder: def __init__(self): self.INTERACTIVE_ACTIONS = { 'AXPress', # Most buttons 'AXShowMenu', # Menu buttons 'AXIncrement', # Spinners 'AXConfirm', # Dialogs 'AXSetValue' # Text fields } def _get_all_attributes(self, element: 'AXUIElement'): """获取元素所有可访问属性""" attributes = {} error, attribute_names = AXUIElementCopyAttributeNames(element, None) if error == kAXErrorSuccess and attribute_names: for attr in list(attribute_names): attributes[attr] = self._get_attribute(element, attr) return attributes

    7.3 无应用专用API / No App-Specific APIs

    传统自动化: 需为每个应用开发专用API TuriX方法: 只要人能点的,TuriX就能点 - WhatsApp - Excel - Outlook - 内部工具 - 任何 macOS 应用

    7.4 性能优化 / Performance Optimization

    技术效果
    异步执行使用 asyncio 实现并发动作
    内存压缩可恢复的内存压缩机制
    Token优化智能 Token 计数和截断

    8. 学习要点总结 / Key Takeaways

    8.1 核心设计模式 / Core Design Patterns

  • 多智能体协作: 规划、决策、执行分离
  • 结构化输出: JSON Schema 约束 LLM 输出
  • 视觉驱动: 基于截图的视觉理解
  • 技能扩展: Markdown 格式的可插拔技能
  • 8.2 关键文件速查 / Key Files Quick Reference

    文件功能
    src/agent/service.pyAgent 主逻辑
    src/agent/output_schemas.py所有 Schema 定义
    src/agent/prompts.py提示词模板
    src/controller/service.py动作控制器
    src/mac/actions.pymacOS 原生操作
    src/mac/tree.pyUI 树构建器
    src/utils/skills.pySkill 系统

    8.3 扩展开发指南 / Extension Development Guide

    python

    添加新动作类型

    1. 在 src/controller/views.py 添加 Pydantic 模型

    class NewAction(BaseModel): param1: str param2: int

    2. 在 src/controller/service.py 添加执行逻辑

    async def execute_new_action(self, params): # 实现动作执行 pass

    3. 在 output_schemas.py 添加 schema

    "new_action": { "type": "object", "properties": { "param1": {"type": "string"}, "param2": {"type": "number"} } }

    8.4 配置示例 / Configuration Example

    json // examples/config.json { "agent": { "task": "Open Safari and go to github.com" }, "brain_llm": { "provider": "turix", "model_name": "turix-brain", "api_key": "YOUR_API_KEY" }, "actor_llm": { "provider": "turix", "model_name": "turix-actor", "api_key": "YOUR_API_KEY" }, "memory_llm": { "provider": "turix", "model_name": "turix-brain", "api_key": "YOUR_API_KEY" } }

    8.5 学习资源 / Learning Resources

  • GitHub: https://github.com/TurixAI/TuriX-CUA
  • AtomGit: https://atomgit.com/TurixAI/TuriX-CUA
  • Discord: https://discord.gg/yaYrNAckb5
  • 技术报告: https://turix.ai/technical-report/

  • © 2024-2026 TuriX AI | TuriX-CUA 学习笔记

    *Created with ❤️ for AI enthusiasts*