paperlib/README.md

# paperlib

A local-first paper library engine with a CLI for managing academic papers.

**paperlib** is designed to import PDF papers into a structured local library, convert PDFs into Markdown using external converters, maintain stable per-paper metadata files, and provide a searchable index database. It offers optional AI-based structured summaries while remaining useful even without AI features.

## Key Features

- **Local-first**: All data lives locally in the paper library directory
- **CLI-first**: All important workflows accessible from the command line
- **JSON source of truth**: Per-paper metadata files with rebuildable SQLite index
- **AI-optional**: Core workflows work without LLM configuration
- **Machine-readable**: `--json` output for automation and integration
- **Stable interfaces**: Designed for scripts and higher-level tools

## Installation

```bash
# Install with uv (recommended)
uv add paperlib

# Or with pip
pip install paperlib
```

## Quick Start

```bash
# Initialize a paper library
paperlib init

# Import a local PDF
paperlib import --pdf paper.pdf --title "My Research Paper"

# Import from arXiv
paperlib import --arxiv 2212.06340

# List all papers
paperlib list

# Show paper details
paperlib show <paper-id>

# Convert PDFs to Markdown (requires MinerU)
paperlib convert

# Search papers
paperlib search "machine learning"

# Rebuild search index
paperlib reindex
```

## Core Commands

### Library Management
- `paperlib init [path]` - Initialize a paper library directory
- `paperlib status` - Show library configuration and layout
- `paperlib reindex` - Rebuild search index from stored papers

### Paper Import
- `paperlib import --pdf <path>` - Import a local PDF file
- `paperlib import --arxiv <id>` - Import paper from arXiv
- Options: `--title`, `--notes`, `--tags`, `--library`

### Paper Management
- `paperlib list` - List all imported papers with status
- `paperlib show <paper-id>` - Show detailed paper information
- `paperlib convert` - Convert pending papers to Markdown using MinerU

### Search (Future)
- `paperlib search <query>` - Search papers by content and metadata

## Library Structure

A paperlib library is organized as follows:

```
library_root/
├── config/
│   ├── config.toml
│   └── prompts/
├── papers/
│   ├── arxiv/
│   │   └── 2026/
│   │       └── arxiv-2212_06340/
│   │           ├── meta.json          # Paper metadata
│   │           ├── source.pdf         # Original PDF
│   │           ├── paper.md           # Converted markdown
│   │           ├── summary.json       # AI summary (optional)
│   │           ├── summary.md         # Rendered summary
│   │           ├── assets/            # Images, figures
│   │           └── logs/              # Conversion logs
│   └── local/
│       └── <hash>/
│           └── ...
├── db/
│   └── paperlib.sqlite3              # Search index (rebuildable)
├── inbox/                             # Temporary imports
└── cache/                            # Processing cache
```

## Data Model

### Paper Metadata (`meta.json`)
Each paper has a `meta.json` file containing:
- Core identifiers: `paper_id`, `source_type`, `source_id`
- Bibliographic info: `title`, `authors`, `published_date`, `categories`
- File paths: `pdf_path`, `paper_md_path`, `summary_json_path`
- Processing status: `conversion_status`, `summary_status`
- User data: `tags`, `notes`

### Summary Data (`summary.json`)
Optional AI-generated summaries with:
- Structured fields: problem statement, method overview, results
- Categorization: problem tags, technique tags
- Relevance scoring and recommended sections

## PDF Conversion

paperlib integrates with [MinerU](https://github.com/opendatalab/MinerU) for high-quality PDF to Markdown conversion:

```bash
# Install MinerU (optional)
pip install mineru[core]

# Convert all pending papers
paperlib convert

# Convert specific paper
paperlib convert --paper-id <paper-id>
```

## Machine-Readable Output

Most commands support `--json` output for automation:

```bash
paperlib list --json
paperlib show <paper-id> --json
paperlib status --json
```

## Development

paperlib is designed for extensibility and integration with higher-level tools.

### Running Tests

```bash
# Run all tests
uv run pytest

# Run specific test module
uv run pytest tests/test_models.py

# Run with coverage
uv run pytest --cov=paperlib
```

### Code Quality

```bash
# Format code
uv run ruff format

# Check linting
uv run ruff check

# Type checking
uv run mypy src/
```

## Architecture

paperlib follows clean architecture principles:

- **Models**: Data structures for papers and summaries
- **Storage**: File-based metadata and PDF management
- **Index**: SQLite search and retrieval layer
- **Importers**: PDF and arXiv import workflows
- **Converters**: PDF to Markdown transformation
- **CLI**: Command-line interface and argument parsing

## Roadmap

- [x] Core paper import (local PDF, arXiv)
- [x] PDF to Markdown conversion (MinerU integration)
- [x] Metadata management and search indexing
- [x] CLI with all basic commands
- [x] Comprehensive test suite
- [ ] Search command implementation
- [ ] AI summarization with provider abstraction
- [ ] JSON output for all commands
- [ ] Configuration file support
- [ ] Advanced arXiv workflows

## Non-Goals

paperlib is intentionally focused and does NOT include:
- Web UI or GUI applications
- Multi-user or cloud-first features
- Mandatory daemon or background services
- Vector database requirements
- Fully autonomous research assistant behavior

## License

MIT License - see LICENSE file for details.

## Contributing

Contributions welcome! Please read the development guidelines in AGENTS.md and ensure all tests pass before submitting PRs.