paperlib is designed to import PDF papers into a structured local library, convert PDFs into Markdown using external converters, maintain stable per-paper metadata files, and provide a searchable index database. It offers optional AI-based structured summaries while remaining useful even without AI features.

Key Features

Local-first: All data lives locally in the paper library directory
CLI-first: All important workflows accessible from the command line
JSON source of truth: Per-paper metadata files with rebuildable SQLite index
AI-optional: Core workflows work without LLM configuration
Machine-readable: --json output for automation and integration
Stable interfaces: Designed for scripts and higher-level tools

Installation

# Install with uv (recommended)
uv add paperlib

# Or with pip
pip install paperlib

Quick Start

# Initialize a paper library
paperlib init

# Import a local PDF
paperlib import --pdf paper.pdf --title "My Research Paper"

# Import from arXiv
paperlib import --arxiv 2212.06340

# List all papers
paperlib list

# Show paper details
paperlib show <paper-id>

# Convert PDFs to Markdown (requires MinerU)
paperlib convert

# Search papers
paperlib search "machine learning"

# Rebuild search index
paperlib reindex

Core Commands

Library Management

paperlib init [path] - Initialize a paper library directory
paperlib status - Show library configuration and layout
paperlib reindex - Rebuild search index from stored papers

Paper Import

paperlib import --pdf <path> - Import a local PDF file
paperlib import --arxiv <id> - Import paper from arXiv
Options: --title, --notes, --tags, --library

Paper Management

paperlib list - List all imported papers with status
paperlib show <paper-id> - Show detailed paper information
paperlib convert - Convert pending papers to Markdown using MinerU

Search (Future)

paperlib search <query> - Search papers by content and metadata

Library Structure

A paperlib library is organized as follows:

library_root/
├── config/
│   ├── config.toml
│   └── prompts/
├── papers/
│   ├── arxiv/
│   │   └── 2026/
│   │       └── arxiv-2212_06340/
│   │           ├── meta.json          # Paper metadata
│   │           ├── source.pdf         # Original PDF
│   │           ├── paper.md           # Converted markdown
│   │           ├── summary.json       # AI summary (optional)
│   │           ├── summary.md         # Rendered summary
│   │           ├── assets/            # Images, figures
│   │           └── logs/              # Conversion logs
│   └── local/
│       └── <hash>/
│           └── ...
├── db/
│   └── paperlib.sqlite3              # Search index (rebuildable)
├── inbox/                             # Temporary imports
└── cache/                            # Processing cache

Data Model

Paper Metadata (`meta.json`)

Each paper has a meta.json file containing:

Core identifiers: paper_id, source_type, source_id
Bibliographic info: title, authors, published_date, categories
File paths: pdf_path, paper_md_path, summary_json_path
Processing status: conversion_status, summary_status
User data: tags, notes

Summary Data (`summary.json`)

Optional AI-generated summaries with:

Structured fields: problem statement, method overview, results
Categorization: problem tags, technique tags
Relevance scoring and recommended sections

PDF Conversion

paperlib integrates with MinerU for high-quality PDF to Markdown conversion:

# Install MinerU (optional)
pip install mineru[core]

# Convert all pending papers
paperlib convert

# Convert specific paper
paperlib convert --paper-id <paper-id>

Machine-Readable Output

Most commands support --json output for automation:

paperlib list --json
paperlib show <paper-id> --json
paperlib status --json

Development

paperlib is designed for extensibility and integration with higher-level tools.

Running Tests

# Run all tests
uv run pytest

# Run specific test module
uv run pytest tests/test_models.py

# Run with coverage
uv run pytest --cov=paperlib

Code Quality

# Format code
uv run ruff format

# Check linting
uv run ruff check

# Type checking
uv run mypy src/

Architecture

paperlib follows clean architecture principles:

Models: Data structures for papers and summaries
Storage: File-based metadata and PDF management
Index: SQLite search and retrieval layer
Importers: PDF and arXiv import workflows
Converters: PDF to Markdown transformation
CLI: Command-line interface and argument parsing

Roadmap

Core paper import (local PDF, arXiv)
PDF to Markdown conversion (MinerU integration)
Metadata management and search indexing
CLI with all basic commands
Comprehensive test suite
Search command implementation
AI summarization with provider abstraction
JSON output for all commands
Configuration file support
Advanced arXiv workflows

Non-Goals

paperlib is intentionally focused and does NOT include:

Web UI or GUI applications
Multi-user or cloud-first features
Mandatory daemon or background services
Vector database requirements
Fully autonomous research assistant behavior

License

MIT License - see LICENSE file for details.

Contributing

Contributions welcome! Please read the development guidelines in AGENTS.md and ensure all tests pass before submitting PRs.

README.md

paperlib