docs: add docs
This commit is contained in:
@@ -1,19 +1,213 @@
|
||||
# `paperlib`: a CLI tool to manage paper library
|
||||
# paperlib
|
||||
|
||||
This project use `mineru` to convert PDF to markdown, and establish a markdown paper library.
|
||||
A local-first paper library engine with a CLI for managing academic papers.
|
||||
|
||||
## usage
|
||||
**paperlib** is designed to import PDF papers into a structured local library, convert PDFs into Markdown using external converters, maintain stable per-paper metadata files, and provide a searchable index database. It offers optional AI-based structured summaries while remaining useful even without AI features.
|
||||
|
||||
## Key Features
|
||||
|
||||
- **Local-first**: All data lives locally in the paper library directory
|
||||
- **CLI-first**: All important workflows accessible from the command line
|
||||
- **JSON source of truth**: Per-paper metadata files with rebuildable SQLite index
|
||||
- **AI-optional**: Core workflows work without LLM configuration
|
||||
- **Machine-readable**: `--json` output for automation and integration
|
||||
- **Stable interfaces**: Designed for scripts and higher-level tools
|
||||
|
||||
## Installation
|
||||
|
||||
```bash
|
||||
# init a library in current directory
|
||||
# Install with uv (recommended)
|
||||
uv add paperlib
|
||||
|
||||
# Or with pip
|
||||
pip install paperlib
|
||||
```
|
||||
|
||||
## Quick Start
|
||||
|
||||
```bash
|
||||
# Initialize a paper library
|
||||
paperlib init
|
||||
|
||||
# manually import a PDF
|
||||
paperlib import --pdf <path to pdf> [--arxiv-id xxxx.xxxxx]
|
||||
# Import a local PDF
|
||||
paperlib import --pdf paper.pdf --title "My Research Paper"
|
||||
|
||||
# import an arXiv paper
|
||||
paperlib import --arxiv xxxx.xxxxx
|
||||
# Import from arXiv
|
||||
paperlib import --arxiv 2212.06340
|
||||
|
||||
# place holder
|
||||
...
|
||||
# List all papers
|
||||
paperlib list
|
||||
|
||||
# Show paper details
|
||||
paperlib show <paper-id>
|
||||
|
||||
# Convert PDFs to Markdown (requires MinerU)
|
||||
paperlib convert
|
||||
|
||||
# Search papers
|
||||
paperlib search "machine learning"
|
||||
|
||||
# Rebuild search index
|
||||
paperlib reindex
|
||||
```
|
||||
|
||||
## Core Commands
|
||||
|
||||
### Library Management
|
||||
- `paperlib init [path]` - Initialize a paper library directory
|
||||
- `paperlib status` - Show library configuration and layout
|
||||
- `paperlib reindex` - Rebuild search index from stored papers
|
||||
|
||||
### Paper Import
|
||||
- `paperlib import --pdf <path>` - Import a local PDF file
|
||||
- `paperlib import --arxiv <id>` - Import paper from arXiv
|
||||
- Options: `--title`, `--notes`, `--tags`, `--library`
|
||||
|
||||
### Paper Management
|
||||
- `paperlib list` - List all imported papers with status
|
||||
- `paperlib show <paper-id>` - Show detailed paper information
|
||||
- `paperlib convert` - Convert pending papers to Markdown using MinerU
|
||||
|
||||
### Search (Future)
|
||||
- `paperlib search <query>` - Search papers by content and metadata
|
||||
|
||||
## Library Structure
|
||||
|
||||
A paperlib library is organized as follows:
|
||||
|
||||
```
|
||||
library_root/
|
||||
├── config/
|
||||
│ ├── config.toml
|
||||
│ └── prompts/
|
||||
├── papers/
|
||||
│ ├── arxiv/
|
||||
│ │ └── 2026/
|
||||
│ │ └── arxiv-2212_06340/
|
||||
│ │ ├── meta.json # Paper metadata
|
||||
│ │ ├── source.pdf # Original PDF
|
||||
│ │ ├── paper.md # Converted markdown
|
||||
│ │ ├── summary.json # AI summary (optional)
|
||||
│ │ ├── summary.md # Rendered summary
|
||||
│ │ ├── assets/ # Images, figures
|
||||
│ │ └── logs/ # Conversion logs
|
||||
│ └── local/
|
||||
│ └── <hash>/
|
||||
│ └── ...
|
||||
├── db/
|
||||
│ └── paperlib.sqlite3 # Search index (rebuildable)
|
||||
├── inbox/ # Temporary imports
|
||||
└── cache/ # Processing cache
|
||||
```
|
||||
|
||||
## Data Model
|
||||
|
||||
### Paper Metadata (`meta.json`)
|
||||
Each paper has a `meta.json` file containing:
|
||||
- Core identifiers: `paper_id`, `source_type`, `source_id`
|
||||
- Bibliographic info: `title`, `authors`, `published_date`, `categories`
|
||||
- File paths: `pdf_path`, `paper_md_path`, `summary_json_path`
|
||||
- Processing status: `conversion_status`, `summary_status`
|
||||
- User data: `tags`, `notes`
|
||||
|
||||
### Summary Data (`summary.json`)
|
||||
Optional AI-generated summaries with:
|
||||
- Structured fields: problem statement, method overview, results
|
||||
- Categorization: problem tags, technique tags
|
||||
- Relevance scoring and recommended sections
|
||||
|
||||
## PDF Conversion
|
||||
|
||||
paperlib integrates with [MinerU](https://github.com/opendatalab/MinerU) for high-quality PDF to Markdown conversion:
|
||||
|
||||
```bash
|
||||
# Install MinerU (optional)
|
||||
pip install mineru[core]
|
||||
|
||||
# Convert all pending papers
|
||||
paperlib convert
|
||||
|
||||
# Convert specific paper
|
||||
paperlib convert --paper-id <paper-id>
|
||||
```
|
||||
|
||||
## Machine-Readable Output
|
||||
|
||||
Most commands support `--json` output for automation:
|
||||
|
||||
```bash
|
||||
paperlib list --json
|
||||
paperlib show <paper-id> --json
|
||||
paperlib status --json
|
||||
```
|
||||
|
||||
## Development
|
||||
|
||||
paperlib is designed for extensibility and integration with higher-level tools.
|
||||
|
||||
### Running Tests
|
||||
|
||||
```bash
|
||||
# Run all tests
|
||||
uv run pytest
|
||||
|
||||
# Run specific test module
|
||||
uv run pytest tests/test_models.py
|
||||
|
||||
# Run with coverage
|
||||
uv run pytest --cov=paperlib
|
||||
```
|
||||
|
||||
### Code Quality
|
||||
|
||||
```bash
|
||||
# Format code
|
||||
uv run ruff format
|
||||
|
||||
# Check linting
|
||||
uv run ruff check
|
||||
|
||||
# Type checking
|
||||
uv run mypy src/
|
||||
```
|
||||
|
||||
## Architecture
|
||||
|
||||
paperlib follows clean architecture principles:
|
||||
|
||||
- **Models**: Data structures for papers and summaries
|
||||
- **Storage**: File-based metadata and PDF management
|
||||
- **Index**: SQLite search and retrieval layer
|
||||
- **Importers**: PDF and arXiv import workflows
|
||||
- **Converters**: PDF to Markdown transformation
|
||||
- **CLI**: Command-line interface and argument parsing
|
||||
|
||||
## Roadmap
|
||||
|
||||
- [x] Core paper import (local PDF, arXiv)
|
||||
- [x] PDF to Markdown conversion (MinerU integration)
|
||||
- [x] Metadata management and search indexing
|
||||
- [x] CLI with all basic commands
|
||||
- [x] Comprehensive test suite
|
||||
- [ ] Search command implementation
|
||||
- [ ] AI summarization with provider abstraction
|
||||
- [ ] JSON output for all commands
|
||||
- [ ] Configuration file support
|
||||
- [ ] Advanced arXiv workflows
|
||||
|
||||
## Non-Goals
|
||||
|
||||
paperlib is intentionally focused and does NOT include:
|
||||
- Web UI or GUI applications
|
||||
- Multi-user or cloud-first features
|
||||
- Mandatory daemon or background services
|
||||
- Vector database requirements
|
||||
- Fully autonomous research assistant behavior
|
||||
|
||||
## License
|
||||
|
||||
MIT License - see LICENSE file for details.
|
||||
|
||||
## Contributing
|
||||
|
||||
Contributions welcome! Please read the development guidelines in AGENTS.md and ensure all tests pass before submitting PRs.
|
||||
Reference in New Issue
Block a user