7.9 KiB
paperlib
A local-first paper library engine with a CLI for managing academic papers.
paperlib is designed to import PDF papers into a structured local library, convert PDFs into Markdown using external converters, maintain stable per-paper metadata files, and provide a searchable index database. It offers optional AI-based structured summaries while remaining useful even without AI features.
Key Features
- Local-first: All data lives locally in the paper library directory
- CLI-first: All important workflows accessible from the command line
- JSON source of truth: Per-paper metadata files with rebuildable SQLite index
- AI-optional: Core workflows work without LLM configuration
- Machine-readable:
--jsonoutput for automation and integration - Stable interfaces: Designed for scripts and higher-level tools
Installation
System Dependencies
For PDF conversion functionality, paperlib requires OpenGL support through MinerU. If you are inside a graphical everionment, you are likely fine. On headless systems, install:
# Debian based
sudo apt-get install libglvnd0
# Fedora
sudo dnf install libglvnd-glx
# Arch Linux
sudo pacman -S libglvnd
# Gentoo
sudo emerge -av media-libs/libglvnd
# or just add media-libs/libglvnd to your @world or some set
Python Package
# Install with uv (recommended)
uv add paperlib
# Or with pip
pip install paperlib
Quick Start
# Initialize a paper library
paperlib init
# Import a local PDF
paperlib import --pdf paper.pdf --title "My Research Paper"
# Import from arXiv
paperlib import --arxiv 2212.06340
# List all papers
paperlib list
# Show paper details
paperlib show <paper-id>
# Convert PDFs to Markdown (requires MinerU)
paperlib convert
# Search papers
paperlib search "machine learning"
# Rebuild search index
paperlib reindex
Core Commands
Library Management
paperlib init [path]- Initialize a paper library directorypaperlib status- Show library configuration and layoutpaperlib reindex- Rebuild search index from stored papers
Paper Import
paperlib import --pdf <path>- Import a local PDF filepaperlib import --arxiv <id>- Import paper from arXiv- Options:
--title,--notes,--tags,--library
Paper Management
paperlib list- List all imported papers with statuspaperlib show <paper-id>- Show detailed paper informationpaperlib convert- Convert pending papers to Markdown using MinerU
Search (Future)
paperlib search <query>- Search papers by content and metadata
Library Structure
A paperlib library is organized as follows:
library_root/
├── config/
│ ├── config.toml
│ └── prompts/
├── papers/
│ ├── arxiv/
│ │ └── 2026/
│ │ └── arxiv-2212_06340/
│ │ ├── meta.json # Paper metadata
│ │ ├── source.pdf # Original PDF
│ │ ├── paper.md # Converted markdown
│ │ ├── summary.json # AI summary (optional)
│ │ ├── summary.md # Rendered summary
│ │ ├── assets/ # Images, figures
│ │ └── logs/ # Conversion logs
│ └── local/
│ └── <hash>/
│ └── ...
├── db/
│ └── paperlib.sqlite3 # Search index (rebuildable)
├── inbox/ # Temporary imports
└── cache/ # Processing cache
Data Model
Paper Metadata (meta.json)
Each paper has a meta.json file containing:
- Core identifiers:
paper_id,source_type,source_id - Bibliographic info:
title,authors,published_date,categories - File paths:
pdf_path,paper_md_path,summary_json_path - Processing status:
conversion_status,summary_status - User data:
tags,notes
Summary Data (summary.json)
Optional AI-generated summaries with:
- Structured fields: problem statement, method overview, results
- Categorization: problem tags, technique tags
- Relevance scoring and recommended sections
PDF Conversion
paperlib integrates with MinerU for high-quality PDF to Markdown conversion:
# Install MinerU (optional)
pip install mineru[core]
# Convert all pending papers
paperlib convert
# Retry failed conversions (useful after fixing system dependencies)
paperlib convert --retry-failed
# Force reconvert all papers
paperlib convert --force
# Convert specific paper
paperlib convert --paper-id <paper-id>
Troubleshooting PDF Conversion
If conversion fails with OpenGL/display errors on headless systems:
# Check if MinerU is properly installed
uv run mineru --version
# If you get "libxcb.so.1" or similar errors, install OpenGL support:
sudo apt-get install libglvnd0 # Ubuntu/Debian
sudo pacman -S libglvnd # Arch Linux
sudo dnf install libglvnd-glx # Fedora
# Test conversion manually
mineru -p example.pdf -o /tmp/test_output -b pipeline
# Check paperlib conversion logs
cat path/to/library/papers/.../logs/mineru.log
Machine-Readable Output
Most commands support --json output for automation and integration:
# Get library configuration in JSON
paperlib status --json
# List all papers with metadata
paperlib list --json
# Get detailed paper information
paperlib show <paper-id> --json
# Get import results
paperlib import --arxiv 2212.06340 --json
# Get conversion status and results
paperlib convert --json
paperlib convert --paper-id <paper-id> --json
# Get reindexing statistics
paperlib reindex --json
JSON Output Format
All JSON responses follow a consistent envelope format:
{
"success": true,
"timestamp": "2024-01-15T10:30:00.000Z",
"data": { /* command-specific data */ }
}
For errors:
{
"success": false,
"timestamp": "2024-01-15T10:30:00.000Z",
"error": "Error message here",
"error_code": 1
}
This structured output enables reliable automation, scripting, and integration with other tools. The JSON format is stable across paperlib versions.
Development
paperlib is designed for extensibility and integration with higher-level tools.
Running Tests
# Run all tests
uv run pytest
# Run specific test module
uv run pytest tests/test_models.py
# Run with coverage
uv run pytest --cov=paperlib
Code Quality
# Format code
uv run ruff format
# Check linting
uv run ruff check
# Type checking
uv run mypy src/
Architecture
paperlib follows clean architecture principles:
- Models: Data structures for papers and summaries
- Storage: File-based metadata and PDF management
- Index: SQLite search and retrieval layer
- Importers: PDF and arXiv import workflows
- Converters: PDF to Markdown transformation
- CLI: Command-line interface and argument parsing
Roadmap
- Core paper import (local PDF, arXiv)
- PDF to Markdown conversion (MinerU integration)*
- Metadata management and search indexing
- CLI with all basic commands
- Comprehensive test suite
- Search command implementation
- AI summarization with provider abstraction
- JSON output for core commands
- Configuration file support
- Advanced arXiv workflows
Note: PDF conversion requires libglvnd system dependency for OpenGL support on headless systems.
Non-Goals
paperlib is intentionally focused and does NOT include:
- Web UI or GUI applications
- Multi-user or cloud-first features
- Mandatory daemon or background services
- Vector database requirements
- Fully autonomous research assistant behavior
License
MIT License - see LICENSE file for details.
Contributing
Contributions welcome! Please read the development guidelines in AGENTS.md and ensure all tests pass before submitting PRs.