# paperlib A local-first paper library engine with a CLI for managing academic papers. **paperlib** is designed to import PDF papers into a structured local library, convert PDFs into Markdown using external converters, maintain stable per-paper metadata files, and provide a searchable index database. It offers optional AI-based structured summaries while remaining useful even without AI features. ## Key Features - **Local-first**: All data lives locally in the paper library directory - **CLI-first**: All important workflows accessible from the command line - **JSON source of truth**: Per-paper metadata files with rebuildable SQLite index - **AI-optional**: Core workflows work without LLM configuration - **Machine-readable**: `--json` output for automation and integration - **Stable interfaces**: Designed for scripts and higher-level tools ## Installation ### System Dependencies For PDF conversion functionality, paperlib requires OpenGL support through MinerU. If you are inside a graphical everionment, you are likely fine. On headless systems, install: ```bash # Debian based sudo apt-get install libglvnd0 # Fedora sudo dnf install libglvnd-glx # Arch Linux sudo pacman -S libglvnd # Gentoo sudo emerge -av media-libs/libglvnd # or just add media-libs/libglvnd to your @world or some set ``` ### Python Package ```bash # Install with uv (recommended) uv add paperlib # Or with pip pip install paperlib ``` ## Quick Start ```bash # Initialize a paper library paperlib init # Import a local PDF paperlib import --pdf paper.pdf --title "My Research Paper" # Import from arXiv paperlib import --arxiv 2212.06340 # List all papers paperlib list # Show paper details paperlib show # Convert PDFs to Markdown (requires MinerU) paperlib convert # Search papers paperlib search "machine learning" # Rebuild search index paperlib reindex ``` ## Core Commands ### Library Management - `paperlib init [path]` - Initialize a paper library directory - `paperlib status` - Show library configuration and layout - `paperlib reindex` - Rebuild search index from stored papers ### Paper Import - `paperlib import --pdf ` - Import a local PDF file - `paperlib import --arxiv ` - Import paper from arXiv - Options: `--title`, `--notes`, `--tags`, `--library` ### Paper Management - `paperlib list` - List all imported papers with status - `paperlib show ` - Show detailed paper information - `paperlib convert` - Convert pending papers to Markdown using MinerU ### Search (Future) - `paperlib search ` - Search papers by content and metadata ## Library Structure A paperlib library is organized as follows: ``` library_root/ ├── config/ │ ├── config.toml │ └── prompts/ ├── papers/ │ ├── arxiv/ │ │ └── 2026/ │ │ └── arxiv-2212_06340/ │ │ ├── meta.json # Paper metadata │ │ ├── source.pdf # Original PDF │ │ ├── paper.md # Converted markdown │ │ ├── summary.json # AI summary (optional) │ │ ├── summary.md # Rendered summary │ │ ├── assets/ # Images, figures │ │ └── logs/ # Conversion logs │ └── local/ │ └── / │ └── ... ├── db/ │ └── paperlib.sqlite3 # Search index (rebuildable) ├── inbox/ # Temporary imports └── cache/ # Processing cache ``` ## Data Model ### Paper Metadata (`meta.json`) Each paper has a `meta.json` file containing: - Core identifiers: `paper_id`, `source_type`, `source_id` - Bibliographic info: `title`, `authors`, `published_date`, `categories` - File paths: `pdf_path`, `paper_md_path`, `summary_json_path` - Processing status: `conversion_status`, `summary_status` - User data: `tags`, `notes` ### Summary Data (`summary.json`) Optional AI-generated summaries with: - Structured fields: problem statement, method overview, results - Categorization: problem tags, technique tags - Relevance scoring and recommended sections ## PDF Conversion paperlib integrates with [MinerU](https://github.com/opendatalab/MinerU) for high-quality PDF to Markdown conversion: ```bash # Install MinerU (optional) pip install mineru[core] # Convert all pending papers paperlib convert # Retry failed conversions (useful after fixing system dependencies) paperlib convert --retry-failed # Force reconvert all papers paperlib convert --force # Convert specific paper paperlib convert --paper-id ``` ### Troubleshooting PDF Conversion If conversion fails with OpenGL/display errors on headless systems: ```bash # Check if MinerU is properly installed uv run mineru --version # If you get "libxcb.so.1" or similar errors, install OpenGL support: sudo apt-get install libglvnd0 # Ubuntu/Debian sudo pacman -S libglvnd # Arch Linux sudo dnf install libglvnd-glx # Fedora # Test conversion manually mineru -p example.pdf -o /tmp/test_output -b pipeline # Check paperlib conversion logs cat path/to/library/papers/.../logs/mineru.log ``` ## Machine-Readable Output Most commands support `--json` output for automation: ```bash paperlib list --json paperlib show --json paperlib status --json ``` ## Development paperlib is designed for extensibility and integration with higher-level tools. ### Running Tests ```bash # Run all tests uv run pytest # Run specific test module uv run pytest tests/test_models.py # Run with coverage uv run pytest --cov=paperlib ``` ### Code Quality ```bash # Format code uv run ruff format # Check linting uv run ruff check # Type checking uv run mypy src/ ``` ## Architecture paperlib follows clean architecture principles: - **Models**: Data structures for papers and summaries - **Storage**: File-based metadata and PDF management - **Index**: SQLite search and retrieval layer - **Importers**: PDF and arXiv import workflows - **Converters**: PDF to Markdown transformation - **CLI**: Command-line interface and argument parsing ## Roadmap - [x] Core paper import (local PDF, arXiv) - [x] PDF to Markdown conversion (MinerU integration)* - [x] Metadata management and search indexing - [x] CLI with all basic commands - [x] Comprehensive test suite - [ ] Search command implementation - [ ] AI summarization with provider abstraction - [ ] JSON output for all commands - [ ] Configuration file support - [ ] Advanced arXiv workflows **Note**: PDF conversion requires `libglvnd` system dependency for OpenGL support on headless systems. ## Non-Goals paperlib is intentionally focused and does NOT include: - Web UI or GUI applications - Multi-user or cloud-first features - Mandatory daemon or background services - Vector database requirements - Fully autonomous research assistant behavior ## License MIT License - see LICENSE file for details. ## Contributing Contributions welcome! Please read the development guidelines in AGENTS.md and ensure all tests pass before submitting PRs.