The Self-Operating Computer Framework is an innovative system that enables multimodal models to autonomously operate a computer by interpreting the screen and executing mouse and keyboard actions to achieve specified objectives. This framework is compatible with various multimodal models and currently integrates with GPT-4o, o1, Gemini Pro Vision, Claude 3, and LLaVa. Notably, it was the first known project to implement a multimodal model capable of viewing and controlling a computer screen. The framework supports features like Optical Character Recognition (OCR) and Set-of-Mark (SoM) prompting to enhance visual grounding capabilities. It is designed to be compatible with macOS, Windows, and Linux (with X server installed), and is released under the MIT license.

Features

  • Autonomous Computer Control: Enables multimodal models to operate a computer by interpreting the screen and executing mouse and keyboard actions to achieve specific tasks.
  • Multimodal Model Compatibility: Supports models such as GPT-4 Vision, Gemini Pro Vision, Claude 3, and LLaVa for diverse applications.
  • Optical Character Recognition (OCR): Integrates OCR capabilities for extracting text from the computer screen for enhanced visual processing.
  • Set-of-Mark (SoM) Prompting: Utilizes SoM prompting to improve visual grounding and contextual understanding during interactions.
  • Cross-Platform Support: Works seamlessly on macOS, Windows, and Linux (with X server installed).
  • Open Source and Flexible Licensing: Released under the MIT license, encouraging community contributions and customizable use cases.

Project Samples

Project Activity

See All Activity >

License

MIT License

Follow Self-Operating Computer

Self-Operating Computer Web Site

Other Useful Business Software
Auth0 for AI Agents now in GA Icon
Auth0 for AI Agents now in GA

Ready to implement AI with confidence (without sacrificing security)?

Connect your AI agents to apps and data more securely, give users control over the actions AI agents can perform and the data they can access, and enable human confirmation for critical agent actions.
Start building today
Rate This Project
Login To Rate This Project

User Ratings

★★★★★
★★★★
★★★
★★
1
0
0
0
0
ease 1 of 5 2 of 5 3 of 5 4 of 5 5 of 5 5 / 5
features 1 of 5 2 of 5 3 of 5 4 of 5 5 of 5 5 / 5
design 1 of 5 2 of 5 3 of 5 4 of 5 5 of 5 5 / 5
support 1 of 5 2 of 5 3 of 5 4 of 5 5 of 5 5 / 5

User Reviews

  • Really awesome to use an AI agent and get it to operate your computer
Read more reviews >

Additional Project Details

Operating Systems

Linux, Mac, Windows

Programming Language

Python

Related Categories

Python Intelligent Agents, Python Agentic AI Tool, Python AI Agent Frameworks, Python AI Agents

Registered

2025-01-27