302 lines
7.9 KiB
Markdown
302 lines
7.9 KiB
Markdown
# paperlib
|
|
|
|
A local-first paper library engine with a CLI for managing academic papers.
|
|
|
|
**paperlib** is designed to import PDF papers into a structured local library, convert PDFs into Markdown using external converters, maintain stable per-paper metadata files, and provide a searchable index database. It offers optional AI-based structured summaries while remaining useful even without AI features.
|
|
|
|
## Key Features
|
|
|
|
- **Local-first**: All data lives locally in the paper library directory
|
|
- **CLI-first**: All important workflows accessible from the command line
|
|
- **JSON source of truth**: Per-paper metadata files with rebuildable SQLite index
|
|
- **AI-optional**: Core workflows work without LLM configuration
|
|
- **Machine-readable**: `--json` output for automation and integration
|
|
- **Stable interfaces**: Designed for scripts and higher-level tools
|
|
|
|
## Installation
|
|
|
|
### System Dependencies
|
|
|
|
For PDF conversion functionality, paperlib requires OpenGL support through MinerU. If you are inside a graphical everionment, you are likely fine. On headless systems, install:
|
|
|
|
```bash
|
|
# Debian based
|
|
sudo apt-get install libglvnd0
|
|
|
|
# Fedora
|
|
sudo dnf install libglvnd-glx
|
|
|
|
# Arch Linux
|
|
sudo pacman -S libglvnd
|
|
|
|
# Gentoo
|
|
sudo emerge -av media-libs/libglvnd
|
|
# or just add media-libs/libglvnd to your @world or some set
|
|
```
|
|
|
|
### Python Package
|
|
|
|
```bash
|
|
# Install with uv (recommended)
|
|
uv add paperlib
|
|
|
|
# Or with pip
|
|
pip install paperlib
|
|
```
|
|
|
|
## Quick Start
|
|
|
|
```bash
|
|
# Initialize a paper library
|
|
paperlib init
|
|
|
|
# Import a local PDF
|
|
paperlib import --pdf paper.pdf --title "My Research Paper"
|
|
|
|
# Import from arXiv
|
|
paperlib import --arxiv 2212.06340
|
|
|
|
# List all papers
|
|
paperlib list
|
|
|
|
# Show paper details
|
|
paperlib show <paper-id>
|
|
|
|
# Convert PDFs to Markdown (requires MinerU)
|
|
paperlib convert
|
|
|
|
# Search papers
|
|
paperlib search "machine learning"
|
|
|
|
# Rebuild search index
|
|
paperlib reindex
|
|
```
|
|
|
|
## Core Commands
|
|
|
|
### Library Management
|
|
- `paperlib init [path]` - Initialize a paper library directory
|
|
- `paperlib status` - Show library configuration and layout
|
|
- `paperlib reindex` - Rebuild search index from stored papers
|
|
|
|
### Paper Import
|
|
- `paperlib import --pdf <path>` - Import a local PDF file
|
|
- `paperlib import --arxiv <id>` - Import paper from arXiv
|
|
- Options: `--title`, `--notes`, `--tags`, `--library`
|
|
|
|
### Paper Management
|
|
- `paperlib list` - List all imported papers with status
|
|
- `paperlib show <paper-id>` - Show detailed paper information
|
|
- `paperlib convert` - Convert pending papers to Markdown using MinerU
|
|
|
|
### Search (Future)
|
|
- `paperlib search <query>` - Search papers by content and metadata
|
|
|
|
## Library Structure
|
|
|
|
A paperlib library is organized as follows:
|
|
|
|
```
|
|
library_root/
|
|
├── config/
|
|
│ ├── config.toml
|
|
│ └── prompts/
|
|
├── papers/
|
|
│ ├── arxiv/
|
|
│ │ └── 2026/
|
|
│ │ └── arxiv-2212_06340/
|
|
│ │ ├── meta.json # Paper metadata
|
|
│ │ ├── source.pdf # Original PDF
|
|
│ │ ├── paper.md # Converted markdown
|
|
│ │ ├── summary.json # AI summary (optional)
|
|
│ │ ├── summary.md # Rendered summary
|
|
│ │ ├── assets/ # Images, figures
|
|
│ │ └── logs/ # Conversion logs
|
|
│ └── local/
|
|
│ └── <hash>/
|
|
│ └── ...
|
|
├── db/
|
|
│ └── paperlib.sqlite3 # Search index (rebuildable)
|
|
├── inbox/ # Temporary imports
|
|
└── cache/ # Processing cache
|
|
```
|
|
|
|
## Data Model
|
|
|
|
### Paper Metadata (`meta.json`)
|
|
Each paper has a `meta.json` file containing:
|
|
- Core identifiers: `paper_id`, `source_type`, `source_id`
|
|
- Bibliographic info: `title`, `authors`, `published_date`, `categories`
|
|
- File paths: `pdf_path`, `paper_md_path`, `summary_json_path`
|
|
- Processing status: `conversion_status`, `summary_status`
|
|
- User data: `tags`, `notes`
|
|
|
|
### Summary Data (`summary.json`)
|
|
Optional AI-generated summaries with:
|
|
- Structured fields: problem statement, method overview, results
|
|
- Categorization: problem tags, technique tags
|
|
- Relevance scoring and recommended sections
|
|
|
|
## PDF Conversion
|
|
|
|
paperlib integrates with [MinerU](https://github.com/opendatalab/MinerU) for high-quality PDF to Markdown conversion:
|
|
|
|
```bash
|
|
# Install MinerU (optional)
|
|
pip install mineru[core]
|
|
|
|
# Convert all pending papers
|
|
paperlib convert
|
|
|
|
# Retry failed conversions (useful after fixing system dependencies)
|
|
paperlib convert --retry-failed
|
|
|
|
# Force reconvert all papers
|
|
paperlib convert --force
|
|
|
|
# Convert specific paper
|
|
paperlib convert --paper-id <paper-id>
|
|
```
|
|
|
|
### Troubleshooting PDF Conversion
|
|
|
|
If conversion fails with OpenGL/display errors on headless systems:
|
|
|
|
```bash
|
|
# Check if MinerU is properly installed
|
|
uv run mineru --version
|
|
|
|
# If you get "libxcb.so.1" or similar errors, install OpenGL support:
|
|
sudo apt-get install libglvnd0 # Ubuntu/Debian
|
|
sudo pacman -S libglvnd # Arch Linux
|
|
sudo dnf install libglvnd-glx # Fedora
|
|
|
|
# Test conversion manually
|
|
mineru -p example.pdf -o /tmp/test_output -b pipeline
|
|
|
|
# Check paperlib conversion logs
|
|
cat path/to/library/papers/.../logs/mineru.log
|
|
```
|
|
|
|
## Machine-Readable Output
|
|
|
|
Most commands support `--json` output for automation and integration:
|
|
|
|
```bash
|
|
# Get library configuration in JSON
|
|
paperlib status --json
|
|
|
|
# List all papers with metadata
|
|
paperlib list --json
|
|
|
|
# Get detailed paper information
|
|
paperlib show <paper-id> --json
|
|
|
|
# Get import results
|
|
paperlib import --arxiv 2212.06340 --json
|
|
|
|
# Get conversion status and results
|
|
paperlib convert --json
|
|
paperlib convert --paper-id <paper-id> --json
|
|
|
|
# Get reindexing statistics
|
|
paperlib reindex --json
|
|
```
|
|
|
|
### JSON Output Format
|
|
|
|
All JSON responses follow a consistent envelope format:
|
|
|
|
```json
|
|
{
|
|
"success": true,
|
|
"timestamp": "2024-01-15T10:30:00.000Z",
|
|
"data": { /* command-specific data */ }
|
|
}
|
|
```
|
|
|
|
For errors:
|
|
```json
|
|
{
|
|
"success": false,
|
|
"timestamp": "2024-01-15T10:30:00.000Z",
|
|
"error": "Error message here",
|
|
"error_code": 1
|
|
}
|
|
```
|
|
|
|
This structured output enables reliable automation, scripting, and integration with other tools. The JSON format is stable across paperlib versions.
|
|
|
|
## Development
|
|
|
|
paperlib is designed for extensibility and integration with higher-level tools.
|
|
|
|
### Running Tests
|
|
|
|
```bash
|
|
# Run all tests
|
|
uv run pytest
|
|
|
|
# Run specific test module
|
|
uv run pytest tests/test_models.py
|
|
|
|
# Run with coverage
|
|
uv run pytest --cov=paperlib
|
|
```
|
|
|
|
### Code Quality
|
|
|
|
```bash
|
|
# Format code
|
|
uv run ruff format
|
|
|
|
# Check linting
|
|
uv run ruff check
|
|
|
|
# Type checking
|
|
uv run mypy src/
|
|
```
|
|
|
|
## Architecture
|
|
|
|
paperlib follows clean architecture principles:
|
|
|
|
- **Models**: Data structures for papers and summaries
|
|
- **Storage**: File-based metadata and PDF management
|
|
- **Index**: SQLite search and retrieval layer
|
|
- **Importers**: PDF and arXiv import workflows
|
|
- **Converters**: PDF to Markdown transformation
|
|
- **CLI**: Command-line interface and argument parsing
|
|
|
|
## Roadmap
|
|
|
|
- [x] Core paper import (local PDF, arXiv)
|
|
- [x] PDF to Markdown conversion (MinerU integration)*
|
|
- [x] Metadata management and search indexing
|
|
- [x] CLI with all basic commands
|
|
- [x] Comprehensive test suite
|
|
- [ ] Search command implementation
|
|
- [ ] AI summarization with provider abstraction
|
|
- [x] JSON output for core commands
|
|
- [ ] Configuration file support
|
|
- [ ] Advanced arXiv workflows
|
|
|
|
**Note**: PDF conversion requires `libglvnd` system dependency for OpenGL support on headless systems.
|
|
|
|
## Non-Goals
|
|
|
|
paperlib is intentionally focused and does NOT include:
|
|
- Web UI or GUI applications
|
|
- Multi-user or cloud-first features
|
|
- Mandatory daemon or background services
|
|
- Vector database requirements
|
|
- Fully autonomous research assistant behavior
|
|
|
|
## License
|
|
|
|
MIT License - see LICENSE file for details.
|
|
|
|
## Contributing
|
|
|
|
Contributions welcome! Please read the development guidelines in AGENTS.md and ensure all tests pass before submitting PRs.
|