PdfExtractor

View Source

Hex.pm Documentation

A lightweight Elixir library for extracting text from PDF files using Python's pdfplumber. Supports single and multi-page extraction with optional area filtering.

Features

  • 🔍 Extract text from single or multiple PDF pages
  • 📍 Area-based extraction using bounding boxes
  • 🐍 Leverages Python's powerful pdfplumber library
  • 🚀 Simple and intuitive API
  • ✅ Comprehensive test coverage
  • 📚 Full documentation

Installation

Add pdf_extractor to your list of dependencies in mix.exs:

def deps do
  [
    {:pdf_extractor, "~> 0.1.0"}
  ]
end

Usage

Extract text from specific regions using bounding boxes [x0, y0, x1, y1]:

pages = [0, 1] # zero based index
areas = %{
  0 => [0, 0, 300, 200],    # Top-left area of page 0
  1 => [200, 300, 600, 500] # Bottom-right area of page 1
}
PdfExtractor.PdfPlumber.extract_text("path/to/document.pdf", pages, areas)

Return Format

The function returns a map where keys are page numbers and values are the extracted text:

%{
  0 => "Text from page 0...",
  1 => "Text from page 1...",
  2 => "Text from page 2..."
}

Documentation

Full documentation is available at https://hexdocs.pm/pdf_extractor.

License

This project is licensed under the MIT License - see the LICENSE file for details.

Acknowledgments

  • Built on top of the excellent pdfplumber Python library
  • Uses pythonx for seamless Python integration