wyj/paperlib

Fork 0

Files

T

wyj 76580fc4a2 doc: doc the --json option

2026-04-17 20:04:32 -04:00

7.9 KiB

Raw Permalink Blame History

paperlib

A local-first paper library engine with a CLI for managing academic papers.

paperlib is designed to import PDF papers into a structured local library, convert PDFs into Markdown using external converters, maintain stable per-paper metadata files, and provide a searchable index database. It offers optional AI-based structured summaries while remaining useful even without AI features.

Key Features

Local-first: All data lives locally in the paper library directory
CLI-first: All important workflows accessible from the command line
JSON source of truth: Per-paper metadata files with rebuildable SQLite index
AI-optional: Core workflows work without LLM configuration
Machine-readable: --json output for automation and integration
Stable interfaces: Designed for scripts and higher-level tools

Installation

System Dependencies

For PDF conversion functionality, paperlib requires OpenGL support through MinerU. If you are inside a graphical everionment, you are likely fine. On headless systems, install:

# Debian based
sudo apt-get install libglvnd0

# Fedora  
sudo dnf install libglvnd-glx

# Arch Linux
sudo pacman -S libglvnd

# Gentoo
sudo emerge -av media-libs/libglvnd
# or just add media-libs/libglvnd to your @world or some set

Python Package

# Install with uv (recommended)
uv add paperlib

# Or with pip
pip install paperlib

Quick Start

# Initialize a paper library
paperlib init

# Import a local PDF
paperlib import --pdf paper.pdf --title "My Research Paper"

# Import from arXiv
paperlib import --arxiv 2212.06340

# List all papers
paperlib list

# Show paper details
paperlib show <paper-id>

# Convert PDFs to Markdown (requires MinerU)
paperlib convert

# Search papers
paperlib search "machine learning"

# Rebuild search index
paperlib reindex

Core Commands

Library Management

paperlib init [path] - Initialize a paper library directory
paperlib status - Show library configuration and layout
paperlib reindex - Rebuild search index from stored papers

Paper Import

paperlib import --pdf <path> - Import a local PDF file
paperlib import --arxiv <id> - Import paper from arXiv
Options: --title, --notes, --tags, --library

Paper Management

paperlib list - List all imported papers with status
paperlib show <paper-id> - Show detailed paper information
paperlib convert - Convert pending papers to Markdown using MinerU

Search (Future)

paperlib search <query> - Search papers by content and metadata

Library Structure

A paperlib library is organized as follows:

library_root/
├── config/
│   ├── config.toml
│   └── prompts/
├── papers/
│   ├── arxiv/
│   │   └── 2026/
│   │       └── arxiv-2212_06340/
│   │           ├── meta.json          # Paper metadata
│   │           ├── source.pdf         # Original PDF
│   │           ├── paper.md           # Converted markdown
│   │           ├── summary.json       # AI summary (optional)
│   │           ├── summary.md         # Rendered summary
│   │           ├── assets/            # Images, figures
│   │           └── logs/              # Conversion logs
│   └── local/
│       └── <hash>/
│           └── ...
├── db/
│   └── paperlib.sqlite3              # Search index (rebuildable)
├── inbox/                             # Temporary imports
└── cache/                            # Processing cache

Data Model

Paper Metadata (`meta.json`)

Each paper has a meta.json file containing:

Core identifiers: paper_id, source_type, source_id
Bibliographic info: title, authors, published_date, categories
File paths: pdf_path, paper_md_path, summary_json_path
Processing status: conversion_status, summary_status
User data: tags, notes

Summary Data (`summary.json`)

Optional AI-generated summaries with:

Structured fields: problem statement, method overview, results
Categorization: problem tags, technique tags
Relevance scoring and recommended sections

PDF Conversion

paperlib integrates with MinerU for high-quality PDF to Markdown conversion:

# Install MinerU (optional)
pip install mineru[core]

# Convert all pending papers
paperlib convert

# Retry failed conversions (useful after fixing system dependencies)
paperlib convert --retry-failed

# Force reconvert all papers
paperlib convert --force

# Convert specific paper
paperlib convert --paper-id <paper-id>

Troubleshooting PDF Conversion

If conversion fails with OpenGL/display errors on headless systems:

# Check if MinerU is properly installed
uv run mineru --version

# If you get "libxcb.so.1" or similar errors, install OpenGL support:
sudo apt-get install libglvnd0  # Ubuntu/Debian
sudo pacman -S libglvnd         # Arch Linux
sudo dnf install libglvnd-glx   # Fedora

# Test conversion manually
mineru -p example.pdf -o /tmp/test_output -b pipeline

# Check paperlib conversion logs
cat path/to/library/papers/.../logs/mineru.log

Machine-Readable Output

Most commands support --json output for automation and integration:

# Get library configuration in JSON
paperlib status --json

# List all papers with metadata
paperlib list --json

# Get detailed paper information  
paperlib show <paper-id> --json

# Get import results
paperlib import --arxiv 2212.06340 --json

# Get conversion status and results
paperlib convert --json
paperlib convert --paper-id <paper-id> --json

# Get reindexing statistics
paperlib reindex --json

JSON Output Format

All JSON responses follow a consistent envelope format:

{
  "success": true,
  "timestamp": "2024-01-15T10:30:00.000Z",
  "data": { /* command-specific data */ }
}

For errors:

{
  "success": false, 
  "timestamp": "2024-01-15T10:30:00.000Z",
  "error": "Error message here",
  "error_code": 1
}

This structured output enables reliable automation, scripting, and integration with other tools. The JSON format is stable across paperlib versions.

Development

paperlib is designed for extensibility and integration with higher-level tools.

Running Tests

# Run all tests
uv run pytest

# Run specific test module
uv run pytest tests/test_models.py

# Run with coverage
uv run pytest --cov=paperlib

Code Quality

# Format code
uv run ruff format

# Check linting
uv run ruff check

# Type checking
uv run mypy src/

Architecture

paperlib follows clean architecture principles:

Models: Data structures for papers and summaries
Storage: File-based metadata and PDF management
Index: SQLite search and retrieval layer
Importers: PDF and arXiv import workflows
Converters: PDF to Markdown transformation
CLI: Command-line interface and argument parsing

Roadmap

Core paper import (local PDF, arXiv)
PDF to Markdown conversion (MinerU integration)*
Metadata management and search indexing
CLI with all basic commands
Comprehensive test suite
Search command implementation
AI summarization with provider abstraction
JSON output for core commands
Configuration file support
Advanced arXiv workflows

Note: PDF conversion requires libglvnd system dependency for OpenGL support on headless systems.

Non-Goals

paperlib is intentionally focused and does NOT include:

Web UI or GUI applications
Multi-user or cloud-first features
Mandatory daemon or background services
Vector database requirements
Fully autonomous research assistant behavior

License

MIT License - see LICENSE file for details.

Contributing

Contributions welcome! Please read the development guidelines in AGENTS.md and ensure all tests pass before submitting PRs.

7.9 KiB Raw Permalink Blame History