Overview
Magika is an AI-powered file content type detection tool developed by Google's security research team that uses a compact deep-learning model to identify over 200 file types with millisecond latency on a single CPU. Trained on approximately 100 million samples, it delivers near-99% accuracy and is already used at scale across Google products like Gmail and Drive for routing files to the correct security scanners and content processors.
Key Features
- Lightweight model requiring only a few megabytes, achieving millisecond inference per file after loading for high-throughput batch processing.
- Multi-language bindings including a Rust CLI, Python API, JavaScript/TypeScript bindings, and an in-progress Go binding for broad integration flexibility.
- Coverage of over 200 content types with per-type confidence thresholds and configurable modes (high-confidence, medium-confidence, best-guess).
- Easy installation via pip, pipx, or NPM, plus a browser-based demo that requires no setup.
Use Cases
- Security and content inspection pipelines that need to route uploaded or transferred files to appropriate scanners and policy engines.
- Large-scale offline processing of logs, mail archives, and storage systems where fast file pre-classification enables efficient downstream distribution.
- CI/CD and forensic automation workflows that require reliable file-type extraction and analytics as part of build or investigation pipelines.
Technical Details
- Custom lightweight deep-learning model with per-type confidence thresholding that achieves approximately 99% accuracy on benchmark test sets while maintaining low latency and minimal resource consumption.
- Optimized batch inference and limited input sampling techniques ensure classification speed is nearly independent of file size.
- Designed for scalable CPU-based deployment without GPU requirements, making it practical for server-side and edge environments.