Midscene.js | AI Native Landscape

Midscene.js is a cross-platform UI automation framework driven by vision-language models that uses screenshots as the primary means of element localization and interaction. It enables developers to describe automation goals and steps in natural language or lightweight scripts, reducing reliance on fragile DOM selectors. The project provides a JavaScript SDK, YAML scripting, integrations with Puppeteer and Playwright, a Bridge Mode for desktop browsers, and zero-code Chrome extension and mobile playgrounds for rapid prototyping.

Key Features

Vision-language model element localization that replaces brittle CSS and XPath selectors with visual understanding, making automation more resilient to UI changes
Unified multi-platform support covering Web, Android, and iOS through a single JavaScript SDK and consistent scripting format
Built-in replay and visual debugging tools for reproducing, inspecting, and troubleshooting automation flows with full transparency
Caching mechanisms and MCP integration enabling efficient replays and higher-level orchestration by AI agents
Zero-code Chrome extension and mobile playgrounds for rapid prototyping without writing scripts

Use Cases

End-to-end UI testing where visual understanding eliminates the maintenance burden of selector-based test suites
Automated operational tasks such as form filling, demo flows, and cross-platform RPA scenarios
Natural language-driven automation where teams express complex interactions through plain text or concise scripts
AI agent orchestration where visual understanding enables agents to interact with any application without API access

Technical Highlights

Prioritizes a pure-vision approach with DOM mode available as an option for data extraction tasks
Supports multiple vision-language models including Qwen-VL and UI-TARS, balancing token costs against cross-platform robustness
Designed for self-hosting with an open SDK ecosystem for local or cloud deployment