pdfly

Tracked

A command-line tool to extract (meta)data from PDFs and manipulate PDF files at scale.

Author py-pdf Open Sourced 2022-04-09 Last Commit Unknown

pdfly is a lightweight CLI tool for extracting metadata from PDFs and performing common PDF manipulations at scale. It provides scriptable commands that fit naturally into automation pipelines, CI jobs, and batch-processing workflows, making it straightforward to integrate PDF operations without writing custom parsing code.

Extraction Capabilities

  • Fast extraction of document metadata, text content, and structured information from single or multiple PDFs
  • Batch-oriented CLI designed for scripting and unattended execution in CI/CD or data pipelines
  • Configurable output formats and processing options that adapt to archival, indexing, or analysis needs

Integration & Automation

  • Fits into CI/CD pipelines for automated document processing and validation
  • Serves post-OCR cleanup workflows where PDFs need inspection before downstream analysis
  • Automates data extraction across thousands of PDFs in large document archives

Technical Design

  • Built in Python on top of proven PDF parsing libraries
  • Exposes both a CLI and programmatic APIs for flexibility
  • BSD-3-Clause licensed with documentation hosted on Read the Docs