docs: add docs

2026-04-17 16:54:30 -04:00
parent 74d140e5f8
commit 432010f431
10 changed files with 1682 additions and 19 deletions
@@ -1,19 +1,213 @@
-# `paperlib`: a CLI tool to manage paper library
+# paperlib

-This project use `mineru` to convert PDF to markdown, and establish a markdown paper library.
+A local-first paper library engine with a CLI for managing academic papers.

-## usage
+**paperlib** is designed to import PDF papers into a structured local library, convert PDFs into Markdown using external converters, maintain stable per-paper metadata files, and provide a searchable index database. It offers optional AI-based structured summaries while remaining useful even without AI features.
+
+## Key Features
+
+- **Local-first**: All data lives locally in the paper library directory
+- **CLI-first**: All important workflows accessible from the command line
+- **JSON source of truth**: Per-paper metadata files with rebuildable SQLite index
+- **AI-optional**: Core workflows work without LLM configuration
+- **Machine-readable**: `--json` output for automation and integration
+- **Stable interfaces**: Designed for scripts and higher-level tools
+
+## Installation

 ```bash
-# init a library in current directory
+# Install with uv (recommended)
+uv add paperlib
+
+# Or with pip
+pip install paperlib
+```
+
+## Quick Start
+
+```bash
+# Initialize a paper library
 paperlib init

-# manually import a PDF
-paperlib import --pdf <path to pdf> [--arxiv-id xxxx.xxxxx]
+# Import a local PDF
+paperlib import --pdf paper.pdf --title "My Research Paper"

-# import an arXiv paper
-paperlib import --arxiv xxxx.xxxxx
+# Import from arXiv
+paperlib import --arxiv 2212.06340

-# place holder
-...
+# List all papers
+paperlib list
+
+# Show paper details
+paperlib show <paper-id>
+
+# Convert PDFs to Markdown (requires MinerU)
+paperlib convert
+
+# Search papers
+paperlib search "machine learning"
+
+# Rebuild search index
+paperlib reindex
 ```
+
+## Core Commands
+
+### Library Management
+- `paperlib init [path]` - Initialize a paper library directory
+- `paperlib status` - Show library configuration and layout
+- `paperlib reindex` - Rebuild search index from stored papers
+
+### Paper Import
+- `paperlib import --pdf <path>` - Import a local PDF file
+- `paperlib import --arxiv <id>` - Import paper from arXiv
+- Options: `--title`, `--notes`, `--tags`, `--library`
+
+### Paper Management
+- `paperlib list` - List all imported papers with status
+- `paperlib show <paper-id>` - Show detailed paper information
+- `paperlib convert` - Convert pending papers to Markdown using MinerU
+
+### Search (Future)
+- `paperlib search <query>` - Search papers by content and metadata
+
+## Library Structure
+
+A paperlib library is organized as follows:
+
+```
+library_root/
+├── config/
+│   ├── config.toml
+│   └── prompts/
+├── papers/
+│   ├── arxiv/
+│   │   └── 2026/
+│   │       └── arxiv-2212_06340/
+│   │           ├── meta.json          # Paper metadata
+│   │           ├── source.pdf         # Original PDF
+│   │           ├── paper.md           # Converted markdown
+│   │           ├── summary.json       # AI summary (optional)
+│   │           ├── summary.md         # Rendered summary
+│   │           ├── assets/            # Images, figures
+│   │           └── logs/              # Conversion logs
+│   └── local/
+│       └── <hash>/
+│           └── ...
+├── db/
+│   └── paperlib.sqlite3              # Search index (rebuildable)
+├── inbox/                             # Temporary imports
+└── cache/                            # Processing cache
+```
+
+## Data Model
+
+### Paper Metadata (`meta.json`)
+Each paper has a `meta.json` file containing:
+- Core identifiers: `paper_id`, `source_type`, `source_id`
+- Bibliographic info: `title`, `authors`, `published_date`, `categories`
+- File paths: `pdf_path`, `paper_md_path`, `summary_json_path`
+- Processing status: `conversion_status`, `summary_status`
+- User data: `tags`, `notes`
+
+### Summary Data (`summary.json`)
+Optional AI-generated summaries with:
+- Structured fields: problem statement, method overview, results
+- Categorization: problem tags, technique tags
+- Relevance scoring and recommended sections
+
+## PDF Conversion
+
+paperlib integrates with [MinerU](https://github.com/opendatalab/MinerU) for high-quality PDF to Markdown conversion:
+
+```bash
+# Install MinerU (optional)
+pip install mineru[core]
+
+# Convert all pending papers
+paperlib convert
+
+# Convert specific paper
+paperlib convert --paper-id <paper-id>
+```
+
+## Machine-Readable Output
+
+Most commands support `--json` output for automation:
+
+```bash
+paperlib list --json
+paperlib show <paper-id> --json
+paperlib status --json
+```
+
+## Development
+
+paperlib is designed for extensibility and integration with higher-level tools.
+
+### Running Tests
+
+```bash
+# Run all tests
+uv run pytest
+
+# Run specific test module
+uv run pytest tests/test_models.py
+
+# Run with coverage
+uv run pytest --cov=paperlib
+```
+
+### Code Quality
+
+```bash
+# Format code
+uv run ruff format
+
+# Check linting
+uv run ruff check
+
+# Type checking
+uv run mypy src/
+```
+
+## Architecture
+
+paperlib follows clean architecture principles:
+
+- **Models**: Data structures for papers and summaries
+- **Storage**: File-based metadata and PDF management  
+- **Index**: SQLite search and retrieval layer
+- **Importers**: PDF and arXiv import workflows
+- **Converters**: PDF to Markdown transformation
+- **CLI**: Command-line interface and argument parsing
+
+## Roadmap
+
+- [x] Core paper import (local PDF, arXiv)
+- [x] PDF to Markdown conversion (MinerU integration)
+- [x] Metadata management and search indexing
+- [x] CLI with all basic commands
+- [x] Comprehensive test suite
+- [ ] Search command implementation
+- [ ] AI summarization with provider abstraction
+- [ ] JSON output for all commands
+- [ ] Configuration file support
+- [ ] Advanced arXiv workflows
+
+## Non-Goals
+
+paperlib is intentionally focused and does NOT include:
+- Web UI or GUI applications
+- Multi-user or cloud-first features
+- Mandatory daemon or background services
+- Vector database requirements
+- Fully autonomous research assistant behavior
+
+## License
+
+MIT License - see LICENSE file for details.
+
+## Contributing
+
+Contributions welcome! Please read the development guidelines in AGENTS.md and ensure all tests pass before submitting PRs.