Magika

Tracked

Magika is an AI-powered, fast file content type detection tool released by Google's security research team.

Author Google Open Sourced 2023-08-22 Last Commit Unknown

Overview

Magika is an AI-powered file content type detection tool developed by Google's security research team that uses a compact deep-learning model to identify over 200 file types with millisecond latency on a single CPU. Trained on approximately 100 million samples, it delivers near-99% accuracy and is already used at scale across Google products like Gmail and Drive for routing files to the correct security scanners and content processors.

Key Features

  • Lightweight model requiring only a few megabytes, achieving millisecond inference per file after loading for high-throughput batch processing.
  • Multi-language bindings including a Rust CLI, Python API, JavaScript/TypeScript bindings, and an in-progress Go binding for broad integration flexibility.
  • Coverage of over 200 content types with per-type confidence thresholds and configurable modes (high-confidence, medium-confidence, best-guess).
  • Easy installation via pip, pipx, or NPM, plus a browser-based demo that requires no setup.

Use Cases

  • Security and content inspection pipelines that need to route uploaded or transferred files to appropriate scanners and policy engines.
  • Large-scale offline processing of logs, mail archives, and storage systems where fast file pre-classification enables efficient downstream distribution.
  • CI/CD and forensic automation workflows that require reliable file-type extraction and analytics as part of build or investigation pipelines.

Technical Details

  • Custom lightweight deep-learning model with per-type confidence thresholding that achieves approximately 99% accuracy on benchmark test sets while maintaining low latency and minimal resource consumption.
  • Optimized batch inference and limited input sampling techniques ensure classification speed is nearly independent of file size.
  • Designed for scalable CPU-based deployment without GPU requirements, making it practical for server-side and edge environments.