Files
paperlib/README.md
T
2026-04-17 16:54:30 -04:00

213 lines
5.8 KiB
Markdown

# paperlib
A local-first paper library engine with a CLI for managing academic papers.
**paperlib** is designed to import PDF papers into a structured local library, convert PDFs into Markdown using external converters, maintain stable per-paper metadata files, and provide a searchable index database. It offers optional AI-based structured summaries while remaining useful even without AI features.
## Key Features
- **Local-first**: All data lives locally in the paper library directory
- **CLI-first**: All important workflows accessible from the command line
- **JSON source of truth**: Per-paper metadata files with rebuildable SQLite index
- **AI-optional**: Core workflows work without LLM configuration
- **Machine-readable**: `--json` output for automation and integration
- **Stable interfaces**: Designed for scripts and higher-level tools
## Installation
```bash
# Install with uv (recommended)
uv add paperlib
# Or with pip
pip install paperlib
```
## Quick Start
```bash
# Initialize a paper library
paperlib init
# Import a local PDF
paperlib import --pdf paper.pdf --title "My Research Paper"
# Import from arXiv
paperlib import --arxiv 2212.06340
# List all papers
paperlib list
# Show paper details
paperlib show <paper-id>
# Convert PDFs to Markdown (requires MinerU)
paperlib convert
# Search papers
paperlib search "machine learning"
# Rebuild search index
paperlib reindex
```
## Core Commands
### Library Management
- `paperlib init [path]` - Initialize a paper library directory
- `paperlib status` - Show library configuration and layout
- `paperlib reindex` - Rebuild search index from stored papers
### Paper Import
- `paperlib import --pdf <path>` - Import a local PDF file
- `paperlib import --arxiv <id>` - Import paper from arXiv
- Options: `--title`, `--notes`, `--tags`, `--library`
### Paper Management
- `paperlib list` - List all imported papers with status
- `paperlib show <paper-id>` - Show detailed paper information
- `paperlib convert` - Convert pending papers to Markdown using MinerU
### Search (Future)
- `paperlib search <query>` - Search papers by content and metadata
## Library Structure
A paperlib library is organized as follows:
```
library_root/
├── config/
│ ├── config.toml
│ └── prompts/
├── papers/
│ ├── arxiv/
│ │ └── 2026/
│ │ └── arxiv-2212_06340/
│ │ ├── meta.json # Paper metadata
│ │ ├── source.pdf # Original PDF
│ │ ├── paper.md # Converted markdown
│ │ ├── summary.json # AI summary (optional)
│ │ ├── summary.md # Rendered summary
│ │ ├── assets/ # Images, figures
│ │ └── logs/ # Conversion logs
│ └── local/
│ └── <hash>/
│ └── ...
├── db/
│ └── paperlib.sqlite3 # Search index (rebuildable)
├── inbox/ # Temporary imports
└── cache/ # Processing cache
```
## Data Model
### Paper Metadata (`meta.json`)
Each paper has a `meta.json` file containing:
- Core identifiers: `paper_id`, `source_type`, `source_id`
- Bibliographic info: `title`, `authors`, `published_date`, `categories`
- File paths: `pdf_path`, `paper_md_path`, `summary_json_path`
- Processing status: `conversion_status`, `summary_status`
- User data: `tags`, `notes`
### Summary Data (`summary.json`)
Optional AI-generated summaries with:
- Structured fields: problem statement, method overview, results
- Categorization: problem tags, technique tags
- Relevance scoring and recommended sections
## PDF Conversion
paperlib integrates with [MinerU](https://github.com/opendatalab/MinerU) for high-quality PDF to Markdown conversion:
```bash
# Install MinerU (optional)
pip install mineru[core]
# Convert all pending papers
paperlib convert
# Convert specific paper
paperlib convert --paper-id <paper-id>
```
## Machine-Readable Output
Most commands support `--json` output for automation:
```bash
paperlib list --json
paperlib show <paper-id> --json
paperlib status --json
```
## Development
paperlib is designed for extensibility and integration with higher-level tools.
### Running Tests
```bash
# Run all tests
uv run pytest
# Run specific test module
uv run pytest tests/test_models.py
# Run with coverage
uv run pytest --cov=paperlib
```
### Code Quality
```bash
# Format code
uv run ruff format
# Check linting
uv run ruff check
# Type checking
uv run mypy src/
```
## Architecture
paperlib follows clean architecture principles:
- **Models**: Data structures for papers and summaries
- **Storage**: File-based metadata and PDF management
- **Index**: SQLite search and retrieval layer
- **Importers**: PDF and arXiv import workflows
- **Converters**: PDF to Markdown transformation
- **CLI**: Command-line interface and argument parsing
## Roadmap
- [x] Core paper import (local PDF, arXiv)
- [x] PDF to Markdown conversion (MinerU integration)
- [x] Metadata management and search indexing
- [x] CLI with all basic commands
- [x] Comprehensive test suite
- [ ] Search command implementation
- [ ] AI summarization with provider abstraction
- [ ] JSON output for all commands
- [ ] Configuration file support
- [ ] Advanced arXiv workflows
## Non-Goals
paperlib is intentionally focused and does NOT include:
- Web UI or GUI applications
- Multi-user or cloud-first features
- Mandatory daemon or background services
- Vector database requirements
- Fully autonomous research assistant behavior
## License
MIT License - see LICENSE file for details.
## Contributing
Contributions welcome! Please read the development guidelines in AGENTS.md and ensure all tests pass before submitting PRs.