Documentation Index
Fetch the complete documentation index at: https://langchain-5e9cc07a-preview-usestr-1765228917-dac0d1a.mintlify.app/llms.txt
Use this file to discover all available pages before exploring further.
Safe, Open, High-Performance — PDF for AI
OpenDataLoader PDF converts PDFs into JSON, Markdown or Html — ready to feed into modern AI stacks (LLMs, vector search, and RAG).
It reconstructs document layout (headings, lists, tables, and reading order) so the content is easier to chunk, index, and query.
Powered by fast, heuristic, rule-based inference, it runs entirely on your local machine and delivers high-throughput processing for large document sets.
AI-safety is enabled by default and automatically filters likely prompt-injection content embedded in PDFs to reduce downstream risk.
Overview
Integration details
Loader features
| Source | Document Lazy Loading | Native Async Support |
|---|
| OpenDataLoaderPDFLoader | ✅ | ❌ |
The OpenDataLoaderPDFLoader component enables you to parse PDFs into structured Document objects.
Requirements
- Python >= 3.9
- Java 11 or newer available on the system
PATH
- opendataloader-pdf >= 1.1.1
Installation
pip install -U langchain-opendataloader-pdf
Quick start
from langchain_opendataloader_pdf import OpenDataLoaderPDFLoader
loader = OpenDataLoaderPDFLoader(
file_path=["path/to/document.pdf", "path/to/folder"],
format="text"
)
documents = loader.load()
for doc in documents:
print(doc.metadata, doc.page_content[:80])
Parameters
| Parameter | Type | Required | Default | Description |
|---|
file_path | List[str] | ✅ Yes | — | One or more PDF file paths or directories to process. |
format | str | No | None | Output formats (e.g. "json", "html", "markdown", "text"). |
quiet | bool | No | False | Suppresses CLI logging output when True. |
content_safety_off | Optional[List[str]] | No | None | List of content safety filters to disable (e.g. "all", "hidden-text", "off-page", "tiny", "hidden-ocg"). |
Additional Resources