Files
2026-04-17 20:04:32 -04:00

7.9 KiB

paperlib

A local-first paper library engine with a CLI for managing academic papers.

paperlib is designed to import PDF papers into a structured local library, convert PDFs into Markdown using external converters, maintain stable per-paper metadata files, and provide a searchable index database. It offers optional AI-based structured summaries while remaining useful even without AI features.

Key Features

  • Local-first: All data lives locally in the paper library directory
  • CLI-first: All important workflows accessible from the command line
  • JSON source of truth: Per-paper metadata files with rebuildable SQLite index
  • AI-optional: Core workflows work without LLM configuration
  • Machine-readable: --json output for automation and integration
  • Stable interfaces: Designed for scripts and higher-level tools

Installation

System Dependencies

For PDF conversion functionality, paperlib requires OpenGL support through MinerU. If you are inside a graphical everionment, you are likely fine. On headless systems, install:

# Debian based
sudo apt-get install libglvnd0

# Fedora  
sudo dnf install libglvnd-glx

# Arch Linux
sudo pacman -S libglvnd

# Gentoo
sudo emerge -av media-libs/libglvnd
# or just add media-libs/libglvnd to your @world or some set

Python Package

# Install with uv (recommended)
uv add paperlib

# Or with pip
pip install paperlib

Quick Start

# Initialize a paper library
paperlib init

# Import a local PDF
paperlib import --pdf paper.pdf --title "My Research Paper"

# Import from arXiv
paperlib import --arxiv 2212.06340

# List all papers
paperlib list

# Show paper details
paperlib show <paper-id>

# Convert PDFs to Markdown (requires MinerU)
paperlib convert

# Search papers
paperlib search "machine learning"

# Rebuild search index
paperlib reindex

Core Commands

Library Management

  • paperlib init [path] - Initialize a paper library directory
  • paperlib status - Show library configuration and layout
  • paperlib reindex - Rebuild search index from stored papers

Paper Import

  • paperlib import --pdf <path> - Import a local PDF file
  • paperlib import --arxiv <id> - Import paper from arXiv
  • Options: --title, --notes, --tags, --library

Paper Management

  • paperlib list - List all imported papers with status
  • paperlib show <paper-id> - Show detailed paper information
  • paperlib convert - Convert pending papers to Markdown using MinerU

Search (Future)

  • paperlib search <query> - Search papers by content and metadata

Library Structure

A paperlib library is organized as follows:

library_root/
├── config/
│   ├── config.toml
│   └── prompts/
├── papers/
│   ├── arxiv/
│   │   └── 2026/
│   │       └── arxiv-2212_06340/
│   │           ├── meta.json          # Paper metadata
│   │           ├── source.pdf         # Original PDF
│   │           ├── paper.md           # Converted markdown
│   │           ├── summary.json       # AI summary (optional)
│   │           ├── summary.md         # Rendered summary
│   │           ├── assets/            # Images, figures
│   │           └── logs/              # Conversion logs
│   └── local/
│       └── <hash>/
│           └── ...
├── db/
│   └── paperlib.sqlite3              # Search index (rebuildable)
├── inbox/                             # Temporary imports
└── cache/                            # Processing cache

Data Model

Paper Metadata (meta.json)

Each paper has a meta.json file containing:

  • Core identifiers: paper_id, source_type, source_id
  • Bibliographic info: title, authors, published_date, categories
  • File paths: pdf_path, paper_md_path, summary_json_path
  • Processing status: conversion_status, summary_status
  • User data: tags, notes

Summary Data (summary.json)

Optional AI-generated summaries with:

  • Structured fields: problem statement, method overview, results
  • Categorization: problem tags, technique tags
  • Relevance scoring and recommended sections

PDF Conversion

paperlib integrates with MinerU for high-quality PDF to Markdown conversion:

# Install MinerU (optional)
pip install mineru[core]

# Convert all pending papers
paperlib convert

# Retry failed conversions (useful after fixing system dependencies)
paperlib convert --retry-failed

# Force reconvert all papers
paperlib convert --force

# Convert specific paper
paperlib convert --paper-id <paper-id>

Troubleshooting PDF Conversion

If conversion fails with OpenGL/display errors on headless systems:

# Check if MinerU is properly installed
uv run mineru --version

# If you get "libxcb.so.1" or similar errors, install OpenGL support:
sudo apt-get install libglvnd0  # Ubuntu/Debian
sudo pacman -S libglvnd         # Arch Linux
sudo dnf install libglvnd-glx   # Fedora

# Test conversion manually
mineru -p example.pdf -o /tmp/test_output -b pipeline

# Check paperlib conversion logs
cat path/to/library/papers/.../logs/mineru.log

Machine-Readable Output

Most commands support --json output for automation and integration:

# Get library configuration in JSON
paperlib status --json

# List all papers with metadata
paperlib list --json

# Get detailed paper information  
paperlib show <paper-id> --json

# Get import results
paperlib import --arxiv 2212.06340 --json

# Get conversion status and results
paperlib convert --json
paperlib convert --paper-id <paper-id> --json

# Get reindexing statistics
paperlib reindex --json

JSON Output Format

All JSON responses follow a consistent envelope format:

{
  "success": true,
  "timestamp": "2024-01-15T10:30:00.000Z",
  "data": { /* command-specific data */ }
}

For errors:

{
  "success": false, 
  "timestamp": "2024-01-15T10:30:00.000Z",
  "error": "Error message here",
  "error_code": 1
}

This structured output enables reliable automation, scripting, and integration with other tools. The JSON format is stable across paperlib versions.

Development

paperlib is designed for extensibility and integration with higher-level tools.

Running Tests

# Run all tests
uv run pytest

# Run specific test module
uv run pytest tests/test_models.py

# Run with coverage
uv run pytest --cov=paperlib

Code Quality

# Format code
uv run ruff format

# Check linting
uv run ruff check

# Type checking
uv run mypy src/

Architecture

paperlib follows clean architecture principles:

  • Models: Data structures for papers and summaries
  • Storage: File-based metadata and PDF management
  • Index: SQLite search and retrieval layer
  • Importers: PDF and arXiv import workflows
  • Converters: PDF to Markdown transformation
  • CLI: Command-line interface and argument parsing

Roadmap

  • Core paper import (local PDF, arXiv)
  • PDF to Markdown conversion (MinerU integration)*
  • Metadata management and search indexing
  • CLI with all basic commands
  • Comprehensive test suite
  • Search command implementation
  • AI summarization with provider abstraction
  • JSON output for core commands
  • Configuration file support
  • Advanced arXiv workflows

Note: PDF conversion requires libglvnd system dependency for OpenGL support on headless systems.

Non-Goals

paperlib is intentionally focused and does NOT include:

  • Web UI or GUI applications
  • Multi-user or cloud-first features
  • Mandatory daemon or background services
  • Vector database requirements
  • Fully autonomous research assistant behavior

License

MIT License - see LICENSE file for details.

Contributing

Contributions welcome! Please read the development guidelines in AGENTS.md and ensure all tests pass before submitting PRs.