pdfly is a lightweight CLI tool for extracting metadata from PDFs and performing common PDF manipulations at scale. It provides scriptable commands that fit naturally into automation pipelines, CI jobs, and batch-processing workflows, making it straightforward to integrate PDF operations without writing custom parsing code.
Extraction Capabilities
- Fast extraction of document metadata, text content, and structured information from single or multiple PDFs
- Batch-oriented CLI designed for scripting and unattended execution in CI/CD or data pipelines
- Configurable output formats and processing options that adapt to archival, indexing, or analysis needs
Integration & Automation
- Fits into CI/CD pipelines for automated document processing and validation
- Serves post-OCR cleanup workflows where PDFs need inspection before downstream analysis
- Automates data extraction across thousands of PDFs in large document archives
Technical Design
- Built in Python on top of proven PDF parsing libraries
- Exposes both a CLI and programmatic APIs for flexibility
- BSD-3-Clause licensed with documentation hosted on Read the Docs