docs: add docs
This commit is contained in:
@@ -1,19 +1,213 @@
|
||||
# `paperlib`: a CLI tool to manage paper library
|
||||
# paperlib
|
||||
|
||||
This project use `mineru` to convert PDF to markdown, and establish a markdown paper library.
|
||||
A local-first paper library engine with a CLI for managing academic papers.
|
||||
|
||||
## usage
|
||||
**paperlib** is designed to import PDF papers into a structured local library, convert PDFs into Markdown using external converters, maintain stable per-paper metadata files, and provide a searchable index database. It offers optional AI-based structured summaries while remaining useful even without AI features.
|
||||
|
||||
## Key Features
|
||||
|
||||
- **Local-first**: All data lives locally in the paper library directory
|
||||
- **CLI-first**: All important workflows accessible from the command line
|
||||
- **JSON source of truth**: Per-paper metadata files with rebuildable SQLite index
|
||||
- **AI-optional**: Core workflows work without LLM configuration
|
||||
- **Machine-readable**: `--json` output for automation and integration
|
||||
- **Stable interfaces**: Designed for scripts and higher-level tools
|
||||
|
||||
## Installation
|
||||
|
||||
```bash
|
||||
# init a library in current directory
|
||||
# Install with uv (recommended)
|
||||
uv add paperlib
|
||||
|
||||
# Or with pip
|
||||
pip install paperlib
|
||||
```
|
||||
|
||||
## Quick Start
|
||||
|
||||
```bash
|
||||
# Initialize a paper library
|
||||
paperlib init
|
||||
|
||||
# manually import a PDF
|
||||
paperlib import --pdf <path to pdf> [--arxiv-id xxxx.xxxxx]
|
||||
# Import a local PDF
|
||||
paperlib import --pdf paper.pdf --title "My Research Paper"
|
||||
|
||||
# import an arXiv paper
|
||||
paperlib import --arxiv xxxx.xxxxx
|
||||
# Import from arXiv
|
||||
paperlib import --arxiv 2212.06340
|
||||
|
||||
# place holder
|
||||
...
|
||||
# List all papers
|
||||
paperlib list
|
||||
|
||||
# Show paper details
|
||||
paperlib show <paper-id>
|
||||
|
||||
# Convert PDFs to Markdown (requires MinerU)
|
||||
paperlib convert
|
||||
|
||||
# Search papers
|
||||
paperlib search "machine learning"
|
||||
|
||||
# Rebuild search index
|
||||
paperlib reindex
|
||||
```
|
||||
|
||||
## Core Commands
|
||||
|
||||
### Library Management
|
||||
- `paperlib init [path]` - Initialize a paper library directory
|
||||
- `paperlib status` - Show library configuration and layout
|
||||
- `paperlib reindex` - Rebuild search index from stored papers
|
||||
|
||||
### Paper Import
|
||||
- `paperlib import --pdf <path>` - Import a local PDF file
|
||||
- `paperlib import --arxiv <id>` - Import paper from arXiv
|
||||
- Options: `--title`, `--notes`, `--tags`, `--library`
|
||||
|
||||
### Paper Management
|
||||
- `paperlib list` - List all imported papers with status
|
||||
- `paperlib show <paper-id>` - Show detailed paper information
|
||||
- `paperlib convert` - Convert pending papers to Markdown using MinerU
|
||||
|
||||
### Search (Future)
|
||||
- `paperlib search <query>` - Search papers by content and metadata
|
||||
|
||||
## Library Structure
|
||||
|
||||
A paperlib library is organized as follows:
|
||||
|
||||
```
|
||||
library_root/
|
||||
├── config/
|
||||
│ ├── config.toml
|
||||
│ └── prompts/
|
||||
├── papers/
|
||||
│ ├── arxiv/
|
||||
│ │ └── 2026/
|
||||
│ │ └── arxiv-2212_06340/
|
||||
│ │ ├── meta.json # Paper metadata
|
||||
│ │ ├── source.pdf # Original PDF
|
||||
│ │ ├── paper.md # Converted markdown
|
||||
│ │ ├── summary.json # AI summary (optional)
|
||||
│ │ ├── summary.md # Rendered summary
|
||||
│ │ ├── assets/ # Images, figures
|
||||
│ │ └── logs/ # Conversion logs
|
||||
│ └── local/
|
||||
│ └── <hash>/
|
||||
│ └── ...
|
||||
├── db/
|
||||
│ └── paperlib.sqlite3 # Search index (rebuildable)
|
||||
├── inbox/ # Temporary imports
|
||||
└── cache/ # Processing cache
|
||||
```
|
||||
|
||||
## Data Model
|
||||
|
||||
### Paper Metadata (`meta.json`)
|
||||
Each paper has a `meta.json` file containing:
|
||||
- Core identifiers: `paper_id`, `source_type`, `source_id`
|
||||
- Bibliographic info: `title`, `authors`, `published_date`, `categories`
|
||||
- File paths: `pdf_path`, `paper_md_path`, `summary_json_path`
|
||||
- Processing status: `conversion_status`, `summary_status`
|
||||
- User data: `tags`, `notes`
|
||||
|
||||
### Summary Data (`summary.json`)
|
||||
Optional AI-generated summaries with:
|
||||
- Structured fields: problem statement, method overview, results
|
||||
- Categorization: problem tags, technique tags
|
||||
- Relevance scoring and recommended sections
|
||||
|
||||
## PDF Conversion
|
||||
|
||||
paperlib integrates with [MinerU](https://github.com/opendatalab/MinerU) for high-quality PDF to Markdown conversion:
|
||||
|
||||
```bash
|
||||
# Install MinerU (optional)
|
||||
pip install mineru[core]
|
||||
|
||||
# Convert all pending papers
|
||||
paperlib convert
|
||||
|
||||
# Convert specific paper
|
||||
paperlib convert --paper-id <paper-id>
|
||||
```
|
||||
|
||||
## Machine-Readable Output
|
||||
|
||||
Most commands support `--json` output for automation:
|
||||
|
||||
```bash
|
||||
paperlib list --json
|
||||
paperlib show <paper-id> --json
|
||||
paperlib status --json
|
||||
```
|
||||
|
||||
## Development
|
||||
|
||||
paperlib is designed for extensibility and integration with higher-level tools.
|
||||
|
||||
### Running Tests
|
||||
|
||||
```bash
|
||||
# Run all tests
|
||||
uv run pytest
|
||||
|
||||
# Run specific test module
|
||||
uv run pytest tests/test_models.py
|
||||
|
||||
# Run with coverage
|
||||
uv run pytest --cov=paperlib
|
||||
```
|
||||
|
||||
### Code Quality
|
||||
|
||||
```bash
|
||||
# Format code
|
||||
uv run ruff format
|
||||
|
||||
# Check linting
|
||||
uv run ruff check
|
||||
|
||||
# Type checking
|
||||
uv run mypy src/
|
||||
```
|
||||
|
||||
## Architecture
|
||||
|
||||
paperlib follows clean architecture principles:
|
||||
|
||||
- **Models**: Data structures for papers and summaries
|
||||
- **Storage**: File-based metadata and PDF management
|
||||
- **Index**: SQLite search and retrieval layer
|
||||
- **Importers**: PDF and arXiv import workflows
|
||||
- **Converters**: PDF to Markdown transformation
|
||||
- **CLI**: Command-line interface and argument parsing
|
||||
|
||||
## Roadmap
|
||||
|
||||
- [x] Core paper import (local PDF, arXiv)
|
||||
- [x] PDF to Markdown conversion (MinerU integration)
|
||||
- [x] Metadata management and search indexing
|
||||
- [x] CLI with all basic commands
|
||||
- [x] Comprehensive test suite
|
||||
- [ ] Search command implementation
|
||||
- [ ] AI summarization with provider abstraction
|
||||
- [ ] JSON output for all commands
|
||||
- [ ] Configuration file support
|
||||
- [ ] Advanced arXiv workflows
|
||||
|
||||
## Non-Goals
|
||||
|
||||
paperlib is intentionally focused and does NOT include:
|
||||
- Web UI or GUI applications
|
||||
- Multi-user or cloud-first features
|
||||
- Mandatory daemon or background services
|
||||
- Vector database requirements
|
||||
- Fully autonomous research assistant behavior
|
||||
|
||||
## License
|
||||
|
||||
MIT License - see LICENSE file for details.
|
||||
|
||||
## Contributing
|
||||
|
||||
Contributions welcome! Please read the development guidelines in AGENTS.md and ensure all tests pass before submitting PRs.
|
||||
+288
@@ -0,0 +1,288 @@
|
||||
# CLI Reference
|
||||
|
||||
This document describes all available commands in the paperlib CLI.
|
||||
|
||||
## Global Options
|
||||
|
||||
All commands support these global options:
|
||||
|
||||
- `--help`, `-h`: Show help message
|
||||
- `--version`: Show version information
|
||||
|
||||
Many commands also support:
|
||||
- `--library`, `-L`: Specify library root directory (default: current directory)
|
||||
- `--json`: Output machine-readable JSON instead of human-readable format
|
||||
|
||||
## Commands
|
||||
|
||||
### `paperlib init [PATH]`
|
||||
|
||||
Initialize a paper library directory structure.
|
||||
|
||||
**Arguments:**
|
||||
- `PATH`: Directory to initialize (default: current directory)
|
||||
|
||||
**Examples:**
|
||||
```bash
|
||||
# Initialize library in current directory
|
||||
paperlib init
|
||||
|
||||
# Initialize library in specific directory
|
||||
paperlib init /path/to/my/papers
|
||||
|
||||
# Initialize and create parent directories
|
||||
paperlib init ~/Documents/research/papers
|
||||
```
|
||||
|
||||
**Behavior:**
|
||||
- Creates standard directory structure (config/, papers/, db/, etc.)
|
||||
- Safe to run multiple times (idempotent)
|
||||
- Creates parent directories if they don't exist
|
||||
|
||||
---
|
||||
|
||||
### `paperlib import`
|
||||
|
||||
Import papers into the library from various sources.
|
||||
|
||||
**Required (one of):**
|
||||
- `--pdf PATH`: Import a local PDF file
|
||||
- `--arxiv ID`: Import paper from arXiv by ID or URL
|
||||
|
||||
**Options:**
|
||||
- `--title TEXT`: Override paper title (for local PDFs)
|
||||
- `--notes TEXT`: Add notes about the paper
|
||||
- `--tags TAG1 TAG2`: Add tags to the paper
|
||||
- `--library PATH`: Specify library directory
|
||||
|
||||
**Examples:**
|
||||
```bash
|
||||
# Import local PDF
|
||||
paperlib import --pdf paper.pdf --title "My Research" --tags ml ai
|
||||
|
||||
# Import from arXiv
|
||||
paperlib import --arxiv 2212.06340
|
||||
|
||||
# Import with arXiv URL
|
||||
paperlib import --arxiv https://arxiv.org/abs/2212.06340
|
||||
|
||||
# Import to specific library
|
||||
paperlib import --pdf paper.pdf --library ~/research
|
||||
```
|
||||
|
||||
**Behavior:**
|
||||
- Generates stable paper ID based on content (local) or arXiv ID
|
||||
- Copies PDF to structured storage location
|
||||
- Creates meta.json with paper metadata
|
||||
- Prevents duplicate imports (same content/ID)
|
||||
- Indexes paper in search database
|
||||
|
||||
---
|
||||
|
||||
### `paperlib list`
|
||||
|
||||
List all papers in the library with their current status.
|
||||
|
||||
**Options:**
|
||||
- `--library PATH`: Specify library directory
|
||||
- `--json`: Output in JSON format
|
||||
|
||||
**Examples:**
|
||||
```bash
|
||||
# List all papers
|
||||
paperlib list
|
||||
|
||||
# List papers in specific library
|
||||
paperlib list --library ~/research
|
||||
|
||||
# Get machine-readable output
|
||||
paperlib list --json
|
||||
```
|
||||
|
||||
**Output Format:**
|
||||
```
|
||||
Found 3 papers:
|
||||
|
||||
📄 arxiv-2212_06340
|
||||
The new discontinuous Galerkin methods based numerical relativity program Nmesh
|
||||
By: Wolfgang Tichy, Liwei Ji, Ananya Adhikari (+2 more)
|
||||
Categories: gr-qc
|
||||
|
||||
⏳ local-a1b2c3d4e5f6
|
||||
Machine Learning Applications in Physics
|
||||
Categories: cs.AI, physics.comp-ph
|
||||
```
|
||||
|
||||
**Status Indicators:**
|
||||
- ⏳ Paper imported, conversion pending
|
||||
- 📄 PDF converted to Markdown
|
||||
- 📝 AI summary generated
|
||||
- ❌ Conversion or processing failed
|
||||
|
||||
---
|
||||
|
||||
### `paperlib show PAPER_ID`
|
||||
|
||||
Show detailed information about a specific paper.
|
||||
|
||||
**Arguments:**
|
||||
- `PAPER_ID`: The unique paper identifier
|
||||
|
||||
**Options:**
|
||||
- `--library PATH`: Specify library directory
|
||||
- `--json`: Output in JSON format
|
||||
|
||||
**Examples:**
|
||||
```bash
|
||||
# Show paper details
|
||||
paperlib show arxiv-2212_06340
|
||||
|
||||
# Show with JSON output
|
||||
paperlib show local-a1b2c3d4 --json
|
||||
```
|
||||
|
||||
**Output includes:**
|
||||
- All metadata fields
|
||||
- Processing status
|
||||
- File locations and existence
|
||||
- Import timestamp
|
||||
- Tags and notes
|
||||
|
||||
---
|
||||
|
||||
### `paperlib convert`
|
||||
|
||||
Convert papers from PDF to Markdown using MinerU.
|
||||
|
||||
**Options:**
|
||||
- `--library PATH`: Specify library directory
|
||||
- `--paper-id ID`: Convert specific paper only
|
||||
|
||||
**Examples:**
|
||||
```bash
|
||||
# Convert all pending papers
|
||||
paperlib convert
|
||||
|
||||
# Convert specific paper
|
||||
paperlib convert --paper-id arxiv-2212_06340
|
||||
|
||||
# Convert in specific library
|
||||
paperlib convert --library ~/research
|
||||
```
|
||||
|
||||
**Behavior:**
|
||||
- Processes papers with `conversion_status: pending`
|
||||
- Uses MinerU for PDF to Markdown conversion
|
||||
- Updates metadata with conversion status
|
||||
- Creates conversion logs in `logs/` directory
|
||||
- Handles conversion failures gracefully
|
||||
|
||||
---
|
||||
|
||||
### `paperlib reindex`
|
||||
|
||||
Rebuild the search index from stored paper metadata.
|
||||
|
||||
**Options:**
|
||||
- `--library PATH`: Specify library directory
|
||||
|
||||
**Examples:**
|
||||
```bash
|
||||
# Rebuild index
|
||||
paperlib reindex
|
||||
|
||||
# Rebuild index for specific library
|
||||
paperlib reindex --library ~/research
|
||||
```
|
||||
|
||||
**Behavior:**
|
||||
- Clears existing SQLite database
|
||||
- Scans all meta.json files in papers/ directory
|
||||
- Rebuilds full-text search index
|
||||
- Reports statistics on completion
|
||||
- Safe to run anytime (repairs corrupted index)
|
||||
|
||||
---
|
||||
|
||||
### `paperlib status`
|
||||
|
||||
Show library configuration and layout information.
|
||||
|
||||
**Options:**
|
||||
- `--library PATH`: Specify library directory
|
||||
- `--json`: Output in JSON format
|
||||
|
||||
**Examples:**
|
||||
```bash
|
||||
# Show current library status
|
||||
paperlib status
|
||||
|
||||
# Show specific library status
|
||||
paperlib status --library ~/research
|
||||
```
|
||||
|
||||
**Output:**
|
||||
```
|
||||
root: /home/user/papers
|
||||
config: /home/user/papers/config/config.toml
|
||||
database: /home/user/papers/db/paperlib.sqlite3
|
||||
papers: /home/user/papers/papers
|
||||
inbox: /home/user/papers/inbox
|
||||
cache: /home/user/papers/cache
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## Future Commands
|
||||
|
||||
These commands are planned but not yet implemented:
|
||||
|
||||
### `paperlib search QUERY`
|
||||
Search papers by content and metadata.
|
||||
|
||||
### `paperlib summarize [PAPER_ID]`
|
||||
Generate AI summaries for papers.
|
||||
|
||||
### `paperlib export FORMAT`
|
||||
Export papers in various formats.
|
||||
|
||||
### `paperlib doctor`
|
||||
Diagnose and repair library issues.
|
||||
|
||||
---
|
||||
|
||||
## Exit Codes
|
||||
|
||||
paperlib commands return standard exit codes:
|
||||
|
||||
- `0`: Success
|
||||
- `1`: General error (file not found, invalid arguments, etc.)
|
||||
- `2`: Command line argument error
|
||||
|
||||
## Configuration
|
||||
|
||||
paperlib looks for configuration in these locations (in order):
|
||||
1. `$LIBRARY_ROOT/config/config.toml`
|
||||
2. `~/.config/paperlib/config.toml`
|
||||
3. Built-in defaults
|
||||
|
||||
## JSON Output Format
|
||||
|
||||
When using `--json`, commands output structured data suitable for programmatic consumption:
|
||||
|
||||
```json
|
||||
{
|
||||
"papers": [
|
||||
{
|
||||
"paper_id": "arxiv-2212_06340",
|
||||
"title": "Example Paper",
|
||||
"authors": ["Alice Smith", "Bob Jones"],
|
||||
"conversion_status": "success",
|
||||
"imported_at": "2024-01-15T10:30:00"
|
||||
}
|
||||
],
|
||||
"total": 1
|
||||
}
|
||||
```
|
||||
|
||||
This format is stable across paperlib versions for reliable automation.
|
||||
@@ -0,0 +1,638 @@
|
||||
# Integration Guide
|
||||
|
||||
This document describes how to integrate paperlib with higher-level tools and automation workflows.
|
||||
|
||||
## Overview
|
||||
|
||||
paperlib is designed as a **library engine** that higher-level tools can build upon. It provides:
|
||||
|
||||
- **Stable CLI interface** with machine-readable JSON output
|
||||
- **File-based storage** that external tools can read directly
|
||||
- **Python API** for programmatic access
|
||||
- **Event hooks** for workflow integration (future)
|
||||
|
||||
## CLI Integration
|
||||
|
||||
### Machine-Readable Output
|
||||
|
||||
Most paperlib commands support `--json` output for automation:
|
||||
|
||||
```bash
|
||||
# Get library statistics
|
||||
paperlib status --json
|
||||
{
|
||||
"library_root": "/home/user/papers",
|
||||
"total_papers": 42,
|
||||
"by_status": {"converted": 38, "pending": 4},
|
||||
"last_updated": "2024-01-15T10:30:00Z"
|
||||
}
|
||||
|
||||
# List papers with metadata
|
||||
paperlib list --json
|
||||
{
|
||||
"papers": [
|
||||
{
|
||||
"paper_id": "arxiv-2212_06340",
|
||||
"title": "Example Paper",
|
||||
"authors": ["Alice Smith", "Bob Jones"],
|
||||
"categories": ["cs.AI"],
|
||||
"conversion_status": "success",
|
||||
"summary_status": "pending",
|
||||
"imported_at": "2024-01-15T10:30:00Z"
|
||||
}
|
||||
],
|
||||
"total": 1
|
||||
}
|
||||
|
||||
# Import with JSON response
|
||||
paperlib import --arxiv 2212.06340 --json
|
||||
{
|
||||
"success": true,
|
||||
"paper_id": "arxiv-2212_06340",
|
||||
"title": "Example Paper Title",
|
||||
"message": "Successfully imported arXiv paper"
|
||||
}
|
||||
```
|
||||
|
||||
### Exit Codes
|
||||
|
||||
paperlib commands follow standard Unix exit code conventions:
|
||||
|
||||
```bash
|
||||
paperlib import --arxiv 2212.06340
|
||||
echo $? # 0 for success, 1 for error
|
||||
|
||||
# Check if paper exists before processing
|
||||
if paperlib show "$paper_id" --json >/dev/null 2>&1; then
|
||||
echo "Paper exists"
|
||||
else
|
||||
echo "Paper not found"
|
||||
fi
|
||||
```
|
||||
|
||||
### Scripting Examples
|
||||
|
||||
#### Daily arXiv Import
|
||||
|
||||
```bash
|
||||
#!/bin/bash
|
||||
# daily-arxiv.sh - Import papers from daily arXiv feed
|
||||
|
||||
LIBRARY="$HOME/research"
|
||||
ARXIV_FEED_URL="http://export.arxiv.org/rss/cs.AI"
|
||||
|
||||
# Parse RSS feed and extract arXiv IDs
|
||||
curl -s "$ARXIV_FEED_URL" | \
|
||||
grep -oP 'arxiv\.org/abs/\K[0-9]{4}\.[0-9]{4,5}' | \
|
||||
while read arxiv_id; do
|
||||
echo "Importing $arxiv_id..."
|
||||
paperlib import --arxiv "$arxiv_id" --library "$LIBRARY" --json
|
||||
done
|
||||
|
||||
# Convert newly imported papers
|
||||
paperlib convert --library "$LIBRARY"
|
||||
|
||||
# Generate daily report
|
||||
paperlib list --library "$LIBRARY" --json | \
|
||||
jq '.papers | map(select(.imported_at | startswith(now | strftime("%Y-%m-%d"))))'
|
||||
```
|
||||
|
||||
#### Batch Processing
|
||||
|
||||
```bash
|
||||
#!/bin/bash
|
||||
# batch-process.sh - Process multiple papers from a list
|
||||
|
||||
LIBRARY="$HOME/research"
|
||||
PAPER_LIST="papers.txt"
|
||||
|
||||
while IFS= read -r pdf_path; do
|
||||
if [[ -f "$pdf_path" ]]; then
|
||||
echo "Importing $pdf_path..."
|
||||
result=$(paperlib import --pdf "$pdf_path" --library "$LIBRARY" --json)
|
||||
|
||||
if [[ $? -eq 0 ]]; then
|
||||
paper_id=$(echo "$result" | jq -r '.paper_id')
|
||||
echo "Successfully imported as $paper_id"
|
||||
else
|
||||
echo "Failed to import $pdf_path"
|
||||
fi
|
||||
fi
|
||||
done < "$PAPER_LIST"
|
||||
|
||||
# Convert all pending papers
|
||||
paperlib convert --library "$LIBRARY"
|
||||
```
|
||||
|
||||
## Python API
|
||||
|
||||
### Direct Library Access
|
||||
|
||||
```python
|
||||
from paperlib.config import LibraryPaths
|
||||
from paperlib.storage import PaperStorageManager
|
||||
from paperlib.index import DatabaseManager
|
||||
from paperlib.importer import ArxivImporter, LocalImporter
|
||||
|
||||
# Initialize library components
|
||||
library_paths = LibraryPaths.from_root("/path/to/library")
|
||||
storage = PaperStorageManager(library_paths)
|
||||
database = DatabaseManager(library_paths)
|
||||
database.initialize_database()
|
||||
|
||||
# Import paper programmatically
|
||||
arxiv_importer = ArxivImporter(storage)
|
||||
metadata = arxiv_importer.import_arxiv_paper("2212.06340")
|
||||
database.index_paper(metadata)
|
||||
|
||||
# Search and retrieve
|
||||
results = list(database.search_papers("neural networks"))
|
||||
for result in results:
|
||||
paper = storage.load_paper_metadata(result["paper_id"], result["source_type"])
|
||||
print(f"{paper.title} by {', '.join(paper.authors)}")
|
||||
|
||||
# Get statistics
|
||||
stats = database.get_statistics()
|
||||
print(f"Total papers: {stats['total_papers']}")
|
||||
```
|
||||
|
||||
### Metadata Processing
|
||||
|
||||
```python
|
||||
import json
|
||||
from pathlib import Path
|
||||
from paperlib.models import PaperMetadata, PaperSummary
|
||||
|
||||
# Process all papers in library
|
||||
papers_dir = Path("/home/user/papers/papers")
|
||||
|
||||
for meta_file in papers_dir.rglob("meta.json"):
|
||||
# Load metadata
|
||||
metadata = PaperMetadata.load_from_file(meta_file)
|
||||
|
||||
# Check for summary
|
||||
summary_path = meta_file.parent / "summary.json"
|
||||
if summary_path.exists():
|
||||
summary = PaperSummary.load_from_file(summary_path)
|
||||
|
||||
# Extract key information
|
||||
tags = summary.problem_tags + summary.technique_tags
|
||||
entities = summary.entities
|
||||
|
||||
print(f"Paper: {metadata.title}")
|
||||
print(f"Tags: {', '.join(tags)}")
|
||||
print(f"Entities: {', '.join(entities)}")
|
||||
```
|
||||
|
||||
## File System Integration
|
||||
|
||||
### Direct File Access
|
||||
|
||||
Since paperlib uses a documented file layout, tools can read data directly:
|
||||
|
||||
```python
|
||||
import json
|
||||
from pathlib import Path
|
||||
|
||||
def scan_library(library_root: Path):
|
||||
"""Scan library and extract metadata."""
|
||||
papers = []
|
||||
|
||||
for meta_file in library_root.glob("papers/**/meta.json"):
|
||||
with meta_file.open() as f:
|
||||
metadata = json.load(f)
|
||||
papers.append(metadata)
|
||||
|
||||
return papers
|
||||
|
||||
def find_papers_by_category(library_root: Path, category: str):
|
||||
"""Find papers in a specific category."""
|
||||
matching_papers = []
|
||||
|
||||
for meta_file in library_root.glob("papers/**/meta.json"):
|
||||
with meta_file.open() as f:
|
||||
metadata = json.load(f)
|
||||
|
||||
if category in metadata.get("categories", []):
|
||||
matching_papers.append(metadata)
|
||||
|
||||
return matching_papers
|
||||
```
|
||||
|
||||
### Watch for Changes
|
||||
|
||||
```python
|
||||
import time
|
||||
from pathlib import Path
|
||||
from watchdog.observers import Observer
|
||||
from watchdog.events import FileSystemEventHandler
|
||||
|
||||
class PaperLibraryHandler(FileSystemEventHandler):
|
||||
def __init__(self, library_root):
|
||||
self.library_root = Path(library_root)
|
||||
|
||||
def on_created(self, event):
|
||||
if event.src_path.endswith("meta.json"):
|
||||
print(f"New paper imported: {event.src_path}")
|
||||
# Trigger processing workflow
|
||||
self.process_new_paper(event.src_path)
|
||||
|
||||
def on_modified(self, event):
|
||||
if event.src_path.endswith("summary.json"):
|
||||
print(f"Summary updated: {event.src_path}")
|
||||
# Update downstream systems
|
||||
|
||||
def process_new_paper(self, meta_path):
|
||||
"""Handle newly imported paper."""
|
||||
# Load metadata
|
||||
with open(meta_path) as f:
|
||||
metadata = json.load(f)
|
||||
|
||||
# Trigger downstream processing
|
||||
# - Send to processing queue
|
||||
# - Update knowledge base
|
||||
# - Generate notifications
|
||||
|
||||
# Watch library for changes
|
||||
observer = Observer()
|
||||
handler = PaperLibraryHandler("/home/user/papers")
|
||||
observer.schedule(handler, "/home/user/papers/papers", recursive=True)
|
||||
observer.start()
|
||||
```
|
||||
|
||||
## Higher-Level Tool Examples
|
||||
|
||||
### Research Dashboard
|
||||
|
||||
```python
|
||||
"""research_dashboard.py - Web dashboard for research library"""
|
||||
|
||||
from flask import Flask, jsonify, render_template
|
||||
from paperlib.config import LibraryPaths
|
||||
from paperlib.storage import PaperStorageManager
|
||||
from paperlib.index import DatabaseManager
|
||||
|
||||
app = Flask(__name__)
|
||||
|
||||
# Initialize paperlib components
|
||||
library_paths = LibraryPaths.from_root("/home/user/research")
|
||||
storage = PaperStorageManager(library_paths)
|
||||
database = DatabaseManager(library_paths)
|
||||
|
||||
@app.route('/api/papers')
|
||||
def list_papers():
|
||||
"""List all papers with metadata."""
|
||||
papers = list(database.list_papers(limit=50))
|
||||
return jsonify(papers)
|
||||
|
||||
@app.route('/api/search/<query>')
|
||||
def search_papers(query):
|
||||
"""Search papers by query."""
|
||||
results = list(database.search_papers(query, limit=20))
|
||||
return jsonify(results)
|
||||
|
||||
@app.route('/api/stats')
|
||||
def library_stats():
|
||||
"""Get library statistics."""
|
||||
stats = database.get_statistics()
|
||||
return jsonify(stats)
|
||||
|
||||
@app.route('/')
|
||||
def dashboard():
|
||||
"""Main dashboard page."""
|
||||
return render_template('dashboard.html')
|
||||
|
||||
if __name__ == '__main__':
|
||||
app.run(debug=True)
|
||||
```
|
||||
|
||||
### Daily Digest Generator
|
||||
|
||||
```python
|
||||
"""daily_digest.py - Generate daily research digest"""
|
||||
|
||||
import json
|
||||
from datetime import datetime, timedelta
|
||||
from pathlib import Path
|
||||
from paperlib.config import LibraryPaths
|
||||
from paperlib.index import DatabaseManager
|
||||
|
||||
def generate_daily_digest(library_root: str, output_file: str):
|
||||
"""Generate digest of recently imported papers."""
|
||||
|
||||
# Initialize database
|
||||
library_paths = LibraryPaths.from_root(library_root)
|
||||
database = DatabaseManager(library_paths)
|
||||
|
||||
# Get papers from last 24 hours
|
||||
yesterday = datetime.now() - timedelta(days=1)
|
||||
yesterday_iso = yesterday.isoformat()
|
||||
|
||||
recent_papers = []
|
||||
for paper in database.list_papers():
|
||||
if paper["imported_at"] >= yesterday_iso:
|
||||
recent_papers.append(paper)
|
||||
|
||||
if not recent_papers:
|
||||
print("No new papers imported yesterday.")
|
||||
return
|
||||
|
||||
# Group by category
|
||||
by_category = {}
|
||||
for paper in recent_papers:
|
||||
categories = json.loads(paper["categories_json"])
|
||||
for category in categories:
|
||||
if category not in by_category:
|
||||
by_category[category] = []
|
||||
by_category[category].append(paper)
|
||||
|
||||
# Generate HTML digest
|
||||
html_content = f"""
|
||||
<html>
|
||||
<head><title>Daily Research Digest - {datetime.now().strftime('%Y-%m-%d')}</title></head>
|
||||
<body>
|
||||
<h1>Daily Research Digest</h1>
|
||||
<p>Found {len(recent_papers)} new papers</p>
|
||||
"""
|
||||
|
||||
for category, papers in by_category.items():
|
||||
html_content += f"<h2>{category}</h2><ul>"
|
||||
for paper in papers:
|
||||
title = paper["title"]
|
||||
paper_id = paper["paper_id"]
|
||||
html_content += f'<li><strong>{title}</strong> ({paper_id})</li>'
|
||||
html_content += "</ul>"
|
||||
|
||||
html_content += "</body></html>"
|
||||
|
||||
# Write output
|
||||
Path(output_file).write_text(html_content)
|
||||
print(f"Digest written to {output_file}")
|
||||
|
||||
if __name__ == "__main__":
|
||||
generate_daily_digest("/home/user/research", "digest.html")
|
||||
```
|
||||
|
||||
### Literature Review Assistant
|
||||
|
||||
```python
|
||||
"""review_assistant.py - AI-powered literature review helper"""
|
||||
|
||||
from paperlib.config import LibraryPaths
|
||||
from paperlib.index import DatabaseManager
|
||||
from paperlib.models import PaperSummary
|
||||
|
||||
class ReviewAssistant:
|
||||
def __init__(self, library_root: str):
|
||||
self.library_paths = LibraryPaths.from_root(library_root)
|
||||
self.database = DatabaseManager(self.library_paths)
|
||||
|
||||
def find_related_papers(self, paper_id: str, max_results: int = 10):
|
||||
"""Find papers related to the given paper."""
|
||||
|
||||
# Get source paper metadata
|
||||
source_paper = self.database.get_paper(paper_id)
|
||||
if not source_paper:
|
||||
return []
|
||||
|
||||
# Extract search terms from title and categories
|
||||
title_words = source_paper["title"].lower().split()
|
||||
categories = json.loads(source_paper["categories_json"])
|
||||
|
||||
# Search for papers with similar keywords
|
||||
search_terms = title_words + categories
|
||||
related_papers = []
|
||||
|
||||
for term in search_terms:
|
||||
results = list(self.database.search_papers(term, limit=5))
|
||||
for result in results:
|
||||
if result["paper_id"] != paper_id:
|
||||
related_papers.append(result)
|
||||
|
||||
# Remove duplicates and return top results
|
||||
seen_ids = set()
|
||||
unique_papers = []
|
||||
for paper in related_papers:
|
||||
if paper["paper_id"] not in seen_ids:
|
||||
seen_ids.add(paper["paper_id"])
|
||||
unique_papers.append(paper)
|
||||
if len(unique_papers) >= max_results:
|
||||
break
|
||||
|
||||
return unique_papers
|
||||
|
||||
def generate_topic_overview(self, topic: str):
|
||||
"""Generate overview of papers on a specific topic."""
|
||||
|
||||
# Search for papers on topic
|
||||
papers = list(self.database.search_papers(topic, limit=50))
|
||||
|
||||
if not papers:
|
||||
return f"No papers found for topic: {topic}"
|
||||
|
||||
# Analyze summaries if available
|
||||
key_entities = set()
|
||||
techniques = set()
|
||||
|
||||
for paper in papers:
|
||||
summary_path = Path(paper["summary_json_path"])
|
||||
if summary_path.exists():
|
||||
summary = PaperSummary.load_from_file(summary_path)
|
||||
key_entities.update(summary.entities)
|
||||
techniques.update(summary.technique_tags)
|
||||
|
||||
# Generate overview
|
||||
overview = f"""
|
||||
Topic: {topic}
|
||||
|
||||
Papers found: {len(papers)}
|
||||
|
||||
Key entities mentioned:
|
||||
{', '.join(sorted(key_entities)[:10])}
|
||||
|
||||
Common techniques:
|
||||
{', '.join(sorted(techniques)[:10])}
|
||||
|
||||
Recent papers:
|
||||
"""
|
||||
|
||||
# Add recent papers
|
||||
recent_papers = sorted(papers, key=lambda x: x["imported_at"], reverse=True)[:5]
|
||||
for paper in recent_papers:
|
||||
overview += f"\n- {paper['title']} ({paper['paper_id']})"
|
||||
|
||||
return overview
|
||||
|
||||
# Usage
|
||||
assistant = ReviewAssistant("/home/user/research")
|
||||
overview = assistant.generate_topic_overview("transformer architecture")
|
||||
print(overview)
|
||||
```
|
||||
|
||||
## Integration Patterns
|
||||
|
||||
### Pipeline Processing
|
||||
|
||||
```bash
|
||||
# Multi-stage processing pipeline
|
||||
paperlib import --arxiv 2212.06340 --json > import_result.json
|
||||
paper_id=$(jq -r '.paper_id' import_result.json)
|
||||
|
||||
# Convert to markdown
|
||||
paperlib convert --paper-id "$paper_id"
|
||||
|
||||
# Generate summary (when available)
|
||||
# paperlib summarize --paper-id "$paper_id"
|
||||
|
||||
# Update downstream systems
|
||||
curl -X POST "http://research-db/api/papers" \
|
||||
-H "Content-Type: application/json" \
|
||||
-d @import_result.json
|
||||
```
|
||||
|
||||
### Event-Driven Architecture
|
||||
|
||||
```python
|
||||
"""event_handler.py - Process paperlib events"""
|
||||
|
||||
import json
|
||||
from pathlib import Path
|
||||
import pika # RabbitMQ client
|
||||
|
||||
class PaperLibraryEventHandler:
|
||||
def __init__(self, rabbitmq_url: str):
|
||||
self.connection = pika.BlockingConnection(pika.URLParameters(rabbitmq_url))
|
||||
self.channel = self.connection.channel()
|
||||
|
||||
def on_paper_imported(self, paper_metadata: dict):
|
||||
"""Handle new paper import."""
|
||||
message = {
|
||||
"event": "paper_imported",
|
||||
"paper_id": paper_metadata["paper_id"],
|
||||
"title": paper_metadata["title"],
|
||||
"categories": paper_metadata["categories"],
|
||||
"timestamp": paper_metadata["imported_at"]
|
||||
}
|
||||
|
||||
# Send to processing queue
|
||||
self.channel.basic_publish(
|
||||
exchange='',
|
||||
routing_key='paper_processing',
|
||||
body=json.dumps(message)
|
||||
)
|
||||
|
||||
def on_summary_generated(self, paper_id: str, summary_path: Path):
|
||||
"""Handle summary generation."""
|
||||
with summary_path.open() as f:
|
||||
summary = json.load(f)
|
||||
|
||||
message = {
|
||||
"event": "summary_generated",
|
||||
"paper_id": paper_id,
|
||||
"tags": summary["problem_tags"] + summary["technique_tags"],
|
||||
"entities": summary["entities"]
|
||||
}
|
||||
|
||||
# Send to indexing service
|
||||
self.channel.basic_publish(
|
||||
exchange='',
|
||||
routing_key='summary_indexing',
|
||||
body=json.dumps(message)
|
||||
)
|
||||
```
|
||||
|
||||
## Best Practices
|
||||
|
||||
### Error Handling
|
||||
|
||||
```python
|
||||
import subprocess
|
||||
import json
|
||||
|
||||
def safe_paperlib_command(command: list[str]) -> dict:
|
||||
"""Execute paperlib command with proper error handling."""
|
||||
try:
|
||||
result = subprocess.run(
|
||||
["paperlib"] + command + ["--json"],
|
||||
capture_output=True,
|
||||
text=True,
|
||||
check=True
|
||||
)
|
||||
return json.loads(result.stdout)
|
||||
|
||||
except subprocess.CalledProcessError as e:
|
||||
return {
|
||||
"success": False,
|
||||
"error": e.stderr,
|
||||
"exit_code": e.returncode
|
||||
}
|
||||
|
||||
except json.JSONDecodeError as e:
|
||||
return {
|
||||
"success": False,
|
||||
"error": f"Invalid JSON response: {e}",
|
||||
"raw_output": result.stdout
|
||||
}
|
||||
|
||||
# Usage
|
||||
result = safe_paperlib_command(["import", "--arxiv", "2212.06340"])
|
||||
if result.get("success", True): # Assume success if no "success" field
|
||||
print(f"Imported paper: {result['paper_id']}")
|
||||
else:
|
||||
print(f"Import failed: {result['error']}")
|
||||
```
|
||||
|
||||
### Performance Optimization
|
||||
|
||||
```python
|
||||
# Batch operations for better performance
|
||||
from paperlib.index import DatabaseManager
|
||||
|
||||
def batch_index_papers(library_root: str, paper_ids: list[str]):
|
||||
"""Index multiple papers efficiently."""
|
||||
database = DatabaseManager(LibraryPaths.from_root(library_root))
|
||||
storage = PaperStorageManager(LibraryPaths.from_root(library_root))
|
||||
|
||||
# Begin transaction for batch insert
|
||||
with database._get_connection() as conn:
|
||||
for paper_id in paper_ids:
|
||||
metadata = storage.load_paper_metadata(paper_id, source_type)
|
||||
if metadata:
|
||||
database.index_paper(metadata)
|
||||
# Automatic commit on context exit
|
||||
```
|
||||
|
||||
### Configuration Management
|
||||
|
||||
```python
|
||||
# config_manager.py - Centralized configuration
|
||||
import os
|
||||
from pathlib import Path
|
||||
|
||||
class ConfigManager:
|
||||
def __init__(self):
|
||||
self.library_root = os.getenv("PAPERLIB_ROOT", Path.home() / "research")
|
||||
self.api_keys = {
|
||||
"openai": os.getenv("OPENAI_API_KEY"),
|
||||
"anthropic": os.getenv("ANTHROPIC_API_KEY")
|
||||
}
|
||||
|
||||
def get_library_path(self, name: str = "default") -> str:
|
||||
"""Get library path by name."""
|
||||
if name == "default":
|
||||
return str(self.library_root)
|
||||
return str(Path.home() / f"research-{name}")
|
||||
|
||||
def paperlib_command_base(self, library_name: str = "default") -> list[str]:
|
||||
"""Get base command for paperlib with library."""
|
||||
return ["paperlib", "--library", self.get_library_path(library_name)]
|
||||
|
||||
config = ConfigManager()
|
||||
|
||||
# Usage in scripts
|
||||
import subprocess
|
||||
cmd = config.paperlib_command_base("arxiv") + ["list", "--json"]
|
||||
result = subprocess.run(cmd, capture_output=True, text=True)
|
||||
```
|
||||
|
||||
This integration guide provides the foundation for building sophisticated research workflows on top of paperlib's stable, local-first architecture.
|
||||
@@ -0,0 +1,259 @@
|
||||
# Storage Layout
|
||||
|
||||
This document describes the on-disk structure and organization of a paperlib library.
|
||||
|
||||
## Overview
|
||||
|
||||
A paperlib library is a directory containing all papers, metadata, configuration, and index data. The layout is designed to be:
|
||||
|
||||
- **Human-readable**: Directory structure is intuitive and browsable
|
||||
- **Stable**: File locations don't change unexpectedly
|
||||
- **Rebuildable**: Index can be reconstructed from source files
|
||||
- **Portable**: Entire library can be moved or backed up as a unit
|
||||
|
||||
## Directory Structure
|
||||
|
||||
```
|
||||
library_root/
|
||||
├── config/ # Library configuration
|
||||
│ ├── config.toml # Main configuration file
|
||||
│ ├── vocab.yaml # Controlled vocabulary (future)
|
||||
│ └── prompts/ # AI prompt templates (future)
|
||||
│ └── summarize_paper.md
|
||||
├── papers/ # Paper storage (source of truth)
|
||||
│ ├── arxiv/ # arXiv papers organized by year
|
||||
│ │ └── 2026/
|
||||
│ │ └── arxiv-2212_06340/
|
||||
│ │ ├── meta.json # Paper metadata
|
||||
│ │ ├── source.pdf # Original PDF
|
||||
│ │ ├── paper.md # Converted markdown
|
||||
│ │ ├── summary.json # AI-generated summary
|
||||
│ │ ├── summary.md # Rendered summary
|
||||
│ │ ├── ref.bib # Bibliography (future)
|
||||
│ │ ├── assets/ # Images, figures
|
||||
│ │ └── logs/ # Processing logs
|
||||
│ │ └── mineru.log
|
||||
│ └── local/ # Local PDF imports by hash
|
||||
│ └── a1b2c3d4e5f6/
|
||||
│ └── ... (same structure)
|
||||
├── inbox/ # Temporary import staging (future)
|
||||
├── db/ # Search index (rebuildable)
|
||||
│ └── paperlib.sqlite3
|
||||
└── cache/ # Processing cache (safe to delete)
|
||||
```
|
||||
|
||||
## Paper Directory Organization
|
||||
|
||||
### arXiv Papers
|
||||
|
||||
arXiv papers are organized by year and paper ID:
|
||||
|
||||
```
|
||||
papers/arxiv/YEAR/arxiv-NORMALIZED_ID/
|
||||
```
|
||||
|
||||
Where:
|
||||
- `YEAR` is extracted from the arXiv ID (e.g., `2212.06340` → `2022`)
|
||||
- `NORMALIZED_ID` replaces dots and version numbers with underscores
|
||||
- `2212.06340` → `arxiv-2212_06340`
|
||||
- `2212.06340v2` → `arxiv-2212_06340v2`
|
||||
|
||||
**Examples:**
|
||||
```
|
||||
papers/arxiv/2022/arxiv-2212_06340/
|
||||
papers/arxiv/2023/arxiv-2301_12345v1/
|
||||
papers/arxiv/2024/arxiv-2405_98765/
|
||||
```
|
||||
|
||||
### Local Papers
|
||||
|
||||
Local papers are organized by content hash:
|
||||
|
||||
```
|
||||
papers/local/HASH_PREFIX/
|
||||
```
|
||||
|
||||
Where `HASH_PREFIX` is the first 16 characters of the SHA256 hash of the PDF file.
|
||||
|
||||
**Examples:**
|
||||
```
|
||||
papers/local/a1b2c3d4e5f67890/
|
||||
papers/local/fedcba9876543210/
|
||||
```
|
||||
|
||||
## File Types
|
||||
|
||||
### Required Files
|
||||
|
||||
Every paper directory contains:
|
||||
|
||||
#### `meta.json`
|
||||
The canonical metadata file (JSON format):
|
||||
```json
|
||||
{
|
||||
"paper_id": "arxiv-2212_06340",
|
||||
"source_type": "arxiv",
|
||||
"source_id": "2212.06340",
|
||||
"title": "Example Paper Title",
|
||||
"authors": ["Alice Smith", "Bob Jones"],
|
||||
"published_date": "2022-12-13T02:46:55",
|
||||
"categories": ["cs.AI", "stat.ML"],
|
||||
"pdf_path": "papers/arxiv/2022/arxiv-2212_06340/source.pdf",
|
||||
"paper_md_path": "papers/arxiv/2022/arxiv-2212_06340/paper.md",
|
||||
"imported_at": "2024-01-15T10:30:00",
|
||||
"conversion_status": "success",
|
||||
"summary_status": "not_requested",
|
||||
"tags": ["machine-learning"],
|
||||
"notes": "Important paper on neural networks"
|
||||
}
|
||||
```
|
||||
|
||||
#### `source.pdf`
|
||||
The original PDF file, exactly as imported.
|
||||
|
||||
### Generated Files
|
||||
|
||||
These files are created by paperlib processing:
|
||||
|
||||
#### `paper.md`
|
||||
Markdown conversion of the PDF, generated by MinerU or other converters.
|
||||
|
||||
#### `summary.json` (optional)
|
||||
AI-generated structured summary:
|
||||
```json
|
||||
{
|
||||
"schema_version": "1.0",
|
||||
"one_sentence_summary": "This paper introduces...",
|
||||
"problem_statement": "Current methods have limitations...",
|
||||
"method_overview": "We propose a novel approach...",
|
||||
"main_results": "Experiments show 95% accuracy...",
|
||||
"claimed_contributions": ["Novel architecture", "Improved performance"],
|
||||
"problem_tags": ["classification", "optimization"],
|
||||
"technique_tags": ["neural-networks", "transformers"],
|
||||
"entities": ["BERT", "ImageNet", "ResNet"],
|
||||
"relevance_to_user": 0.85
|
||||
}
|
||||
```
|
||||
|
||||
#### `summary.md` (optional)
|
||||
Human-readable summary rendered from `summary.json`.
|
||||
|
||||
### Supporting Directories
|
||||
|
||||
#### `assets/`
|
||||
Contains extracted images, figures, and other media from the PDF conversion process.
|
||||
|
||||
#### `logs/`
|
||||
Processing logs for debugging and audit trails:
|
||||
- `mineru.log` - PDF conversion logs
|
||||
- `summary.log` - AI summarization logs (future)
|
||||
|
||||
## Index Database
|
||||
|
||||
The SQLite database at `db/paperlib.sqlite3` contains:
|
||||
|
||||
### Tables
|
||||
|
||||
#### `papers`
|
||||
Main paper index with searchable fields:
|
||||
- Metadata from all `meta.json` files
|
||||
- Computed search fields (full-text, author lists, etc.)
|
||||
- Processing status tracking
|
||||
|
||||
#### `papers_fts`
|
||||
Full-text search virtual table (SQLite FTS5) for content search.
|
||||
|
||||
### Rebuilding
|
||||
|
||||
The database is **always rebuildable** from the source files:
|
||||
```bash
|
||||
paperlib reindex
|
||||
```
|
||||
|
||||
This design ensures the JSON files remain the authoritative source of truth.
|
||||
|
||||
## Path Conventions
|
||||
|
||||
### Relative Paths
|
||||
All paths in `meta.json` are relative to the library root:
|
||||
```json
|
||||
{
|
||||
"pdf_path": "papers/local/a1b2c3d4e5f6/source.pdf",
|
||||
"paper_md_path": "papers/local/a1b2c3d4e5f6/paper.md"
|
||||
}
|
||||
```
|
||||
|
||||
### Cross-Platform Compatibility
|
||||
All paths use forward slashes (`/`) regardless of operating system.
|
||||
|
||||
## Backup and Portability
|
||||
|
||||
### What to Backup
|
||||
For complete library backup, include:
|
||||
- `config/` directory (configuration)
|
||||
- `papers/` directory (source of truth)
|
||||
|
||||
### What NOT to Backup
|
||||
These can be regenerated:
|
||||
- `db/` directory (rebuildable index)
|
||||
- `cache/` directory (temporary files)
|
||||
|
||||
### Moving Libraries
|
||||
To move a library:
|
||||
1. Copy the entire directory structure
|
||||
2. Run `paperlib reindex` to rebuild the database
|
||||
3. Update any absolute paths in configuration
|
||||
|
||||
## Storage Efficiency
|
||||
|
||||
### Deduplication
|
||||
Papers are naturally deduplicated:
|
||||
- arXiv papers by normalized arXiv ID
|
||||
- Local papers by SHA256 content hash
|
||||
|
||||
### Large Files
|
||||
For papers with large asset directories:
|
||||
- Assets are stored alongside papers for locality
|
||||
- Consider using file system compression or deduplication if needed
|
||||
|
||||
## File System Requirements
|
||||
|
||||
### Permissions
|
||||
paperlib requires:
|
||||
- Read/write access to library directory
|
||||
- Ability to create subdirectories
|
||||
- Atomic file operations for metadata updates
|
||||
|
||||
### File System Features
|
||||
Recommended:
|
||||
- Case-sensitive file system (avoids conflicts)
|
||||
- Support for Unicode filenames
|
||||
- Journaling (protects against corruption)
|
||||
|
||||
### Disk Space
|
||||
Typical storage requirements:
|
||||
- PDF files: 1-10 MB each
|
||||
- Markdown conversions: 10-100 KB each
|
||||
- Metadata: ~1-5 KB per paper
|
||||
- Database index: ~1-10 KB per paper
|
||||
- Assets: Varies (0-50 MB for image-heavy papers)
|
||||
|
||||
## Migration and Versioning
|
||||
|
||||
### Schema Evolution
|
||||
When paperlib updates its storage format:
|
||||
- Metadata schema versions are tracked in each file
|
||||
- Migration tools handle format upgrades
|
||||
- Backward compatibility is maintained when possible
|
||||
|
||||
### Validation
|
||||
paperlib provides tools to validate library integrity:
|
||||
```bash
|
||||
paperlib doctor # (future command)
|
||||
```
|
||||
|
||||
This will check:
|
||||
- All referenced files exist
|
||||
- Metadata format is valid
|
||||
- Database consistency with files
|
||||
- No orphaned or corrupted data
|
||||
@@ -0,0 +1,289 @@
|
||||
# Summary Schema
|
||||
|
||||
This document defines the structure and semantics of the `summary.json` files that contain AI-generated paper summaries.
|
||||
|
||||
## Overview
|
||||
|
||||
The `summary.json` file contains structured, AI-generated analysis of a paper. It is designed to:
|
||||
|
||||
- Provide consistent, machine-readable summaries
|
||||
- Support research triage and discovery workflows
|
||||
- Enable automated categorization and search
|
||||
- Remain stable across different AI providers
|
||||
- Use controlled vocabulary when available
|
||||
|
||||
## Schema Version 1.0
|
||||
|
||||
### File Structure
|
||||
|
||||
```json
|
||||
{
|
||||
"schema_version": "1.0",
|
||||
"one_sentence_summary": "This paper introduces a novel neural architecture for...",
|
||||
"problem_statement": "Current approaches to X suffer from limitations...",
|
||||
"method_overview": "The authors propose a hybrid approach combining...",
|
||||
"main_results": "Experiments show 15% improvement over baselines...",
|
||||
"claimed_contributions": [
|
||||
"Novel attention mechanism design",
|
||||
"State-of-the-art results on ImageNet",
|
||||
"Theoretical analysis of convergence properties"
|
||||
],
|
||||
"assumptions": [
|
||||
"Data is independently distributed",
|
||||
"Computational budget allows for large models"
|
||||
],
|
||||
"limitations": [
|
||||
"Only evaluated on English text",
|
||||
"Requires significant computational resources",
|
||||
"Limited theoretical justification for design choices"
|
||||
],
|
||||
"problem_tags": ["classification", "computer-vision", "optimization"],
|
||||
"technique_tags": ["neural-networks", "attention", "transformers"],
|
||||
"entities": ["ImageNet", "BERT", "ResNet", "CIFAR-10"],
|
||||
"relevance_to_user": 0.75,
|
||||
"recommended_sections": ["Section 3.2", "Algorithm 1", "Table 2"]
|
||||
}
|
||||
```
|
||||
|
||||
## Field Definitions
|
||||
|
||||
### Required Fields
|
||||
|
||||
#### `schema_version` (string)
|
||||
- **Purpose**: Track format version for migration
|
||||
- **Format**: Semantic version string (e.g., "1.0")
|
||||
- **Required**: Yes
|
||||
|
||||
#### `one_sentence_summary` (string)
|
||||
- **Purpose**: Concise paper overview for quick scanning
|
||||
- **Guidelines**:
|
||||
- One complete sentence, under 200 characters
|
||||
- Focus on the main contribution or finding
|
||||
- Avoid technical jargon when possible
|
||||
- **Example**: "This paper introduces a new attention mechanism that improves transformer efficiency by 40% while maintaining accuracy."
|
||||
|
||||
### Core Content Fields
|
||||
|
||||
#### `problem_statement` (string)
|
||||
- **Purpose**: What problem does this paper address?
|
||||
- **Guidelines**:
|
||||
- 2-3 sentences maximum
|
||||
- Focus on the gap or limitation being addressed
|
||||
- Explain why this problem matters
|
||||
|
||||
#### `method_overview` (string)
|
||||
- **Purpose**: High-level description of the approach
|
||||
- **Guidelines**:
|
||||
- 3-4 sentences maximum
|
||||
- Focus on the key innovation or insight
|
||||
- Avoid detailed algorithmic descriptions
|
||||
|
||||
#### `main_results` (string)
|
||||
- **Purpose**: Key empirical findings or theoretical results
|
||||
- **Guidelines**:
|
||||
- Quantitative results when available
|
||||
- Highlight significance of improvements
|
||||
- Note any surprising or counterintuitive findings
|
||||
|
||||
### Structured Lists
|
||||
|
||||
#### `claimed_contributions` (array of strings)
|
||||
- **Purpose**: Authors' stated contributions
|
||||
- **Guidelines**:
|
||||
- Extract from paper's contribution list
|
||||
- Preserve authors' framing and claims
|
||||
- 3-6 items typically
|
||||
|
||||
#### `assumptions` (array of strings)
|
||||
- **Purpose**: Key assumptions underlying the work
|
||||
- **Guidelines**:
|
||||
- Mathematical, methodological, or data assumptions
|
||||
- Critical for understanding applicability
|
||||
- Often unstated but important
|
||||
|
||||
#### `limitations` (array of strings)
|
||||
- **Purpose**: Acknowledged or apparent limitations
|
||||
- **Guidelines**:
|
||||
- From authors' discussion or limitations section
|
||||
- Obvious limitations not acknowledged by authors
|
||||
- Important for understanding scope
|
||||
|
||||
### Categorization
|
||||
|
||||
#### `problem_tags` (array of strings)
|
||||
- **Purpose**: Categorize the problem domain
|
||||
- **Controlled vocabulary** (preferred values):
|
||||
- `classification`, `regression`, `clustering`
|
||||
- `optimization`, `search`, `planning`
|
||||
- `generation`, `translation`, `summarization`
|
||||
- `detection`, `segmentation`, `tracking`
|
||||
- `compression`, `encoding`, `decoding`
|
||||
- `privacy`, `security`, `robustness`
|
||||
- `interpretability`, `fairness`, `ethics`
|
||||
- `efficiency`, `scalability`, `deployment`
|
||||
|
||||
#### `technique_tags` (array of strings)
|
||||
- **Purpose**: Categorize the technical approaches
|
||||
- **Controlled vocabulary** (preferred values):
|
||||
- `neural-networks`, `deep-learning`, `transformers`
|
||||
- `cnn`, `rnn`, `lstm`, `gru`, `attention`
|
||||
- `reinforcement-learning`, `supervised-learning`, `unsupervised-learning`
|
||||
- `bayesian`, `probabilistic`, `statistical`
|
||||
- `graph-neural-networks`, `graph-algorithms`
|
||||
- `computer-vision`, `natural-language-processing`
|
||||
- `federated-learning`, `transfer-learning`, `meta-learning`
|
||||
- `adversarial`, `generative-models`, `vae`, `gan`
|
||||
|
||||
### Entities and References
|
||||
|
||||
#### `entities` (array of strings)
|
||||
- **Purpose**: Important datasets, models, algorithms, or systems mentioned
|
||||
- **Guidelines**:
|
||||
- Proper names: "ImageNet", "BERT", "ResNet"
|
||||
- Algorithms: "SGD", "Adam", "RANSAC"
|
||||
- Benchmarks: "GLUE", "COCO", "WMT"
|
||||
- Avoid generic terms like "neural network"
|
||||
|
||||
### User Relevance
|
||||
|
||||
#### `relevance_to_user` (number, optional)
|
||||
- **Purpose**: Estimated relevance score for the user
|
||||
- **Format**: Float between 0.0 and 1.0
|
||||
- **Guidelines**:
|
||||
- Based on user's research interests (if known)
|
||||
- `null` if user preferences unavailable
|
||||
- Higher scores = more relevant
|
||||
|
||||
#### `recommended_sections` (array of strings, optional)
|
||||
- **Purpose**: Specific sections worth reading in detail
|
||||
- **Format**: Section references as they appear in paper
|
||||
- **Examples**: ["Section 3.2", "Algorithm 1", "Table 2", "Appendix A"]
|
||||
|
||||
## Generation Guidelines
|
||||
|
||||
### AI Provider Instructions
|
||||
|
||||
When generating summaries, AI models should:
|
||||
|
||||
1. **Read for understanding**: Focus on the paper's core contributions
|
||||
2. **Use structured thinking**: Work through each field systematically
|
||||
3. **Prefer facts over interpretation**: Extract what authors claim, not opinions
|
||||
4. **Use controlled vocabulary**: Select from predefined tag lists when possible
|
||||
5. **Be concise**: Optimize for quick scanning and search
|
||||
6. **Handle uncertainty**: Use `null` or empty arrays for unclear fields
|
||||
|
||||
### Quality Criteria
|
||||
|
||||
Good summaries exhibit:
|
||||
- **Accuracy**: Faithful to the paper's content
|
||||
- **Completeness**: Cover all major aspects
|
||||
- **Consistency**: Similar papers get similar treatment
|
||||
- **Searchability**: Use terms that aid discovery
|
||||
- **Brevity**: Information density over verbosity
|
||||
|
||||
### Common Issues to Avoid
|
||||
|
||||
- **Hallucination**: Never invent facts not in the paper
|
||||
- **Editorializing**: Don't add opinions about paper quality
|
||||
- **Inconsistent terminology**: Use standard field names
|
||||
- **Over-abstraction**: Keep concrete details when useful
|
||||
- **Under-specification**: Provide enough detail for usefulness
|
||||
|
||||
## Schema Evolution
|
||||
|
||||
### Version History
|
||||
|
||||
- **v1.0** (current): Initial schema with core fields
|
||||
|
||||
### Migration Strategy
|
||||
|
||||
When the schema evolves:
|
||||
1. New versions increment the `schema_version` field
|
||||
2. Migration tools handle format upgrades automatically
|
||||
3. Backward compatibility maintained when possible
|
||||
4. Deprecated fields are marked but preserved
|
||||
|
||||
### Extensibility
|
||||
|
||||
Future versions may add:
|
||||
- Additional structured fields
|
||||
- Hierarchical tag taxonomies
|
||||
- Multi-lingual support
|
||||
- Citation relationship mapping
|
||||
- Experimental reproducibility metadata
|
||||
|
||||
## Integration with paperlib
|
||||
|
||||
### File Lifecycle
|
||||
|
||||
1. **Generation**: AI provider creates `summary.json`
|
||||
2. **Validation**: paperlib validates against schema
|
||||
3. **Indexing**: Content indexed for search
|
||||
4. **Rendering**: Human-readable `summary.md` generated
|
||||
5. **Updates**: Summaries can be regenerated with new models
|
||||
|
||||
### Search Integration
|
||||
|
||||
Summary fields are indexed for search:
|
||||
- Full-text search includes all text fields
|
||||
- Tag-based search uses `problem_tags` and `technique_tags`
|
||||
- Entity search uses the `entities` field
|
||||
- Relevance ranking can use `relevance_to_user` scores
|
||||
|
||||
### API Integration
|
||||
|
||||
Higher-level tools can consume summaries programmatically:
|
||||
|
||||
```python
|
||||
import json
|
||||
from pathlib import Path
|
||||
|
||||
# Load summary
|
||||
summary_path = Path("papers/arxiv/2022/arxiv-2212_06340/summary.json")
|
||||
with summary_path.open() as f:
|
||||
summary = json.load(f)
|
||||
|
||||
# Extract key information
|
||||
tags = summary["problem_tags"] + summary["technique_tags"]
|
||||
relevance = summary.get("relevance_to_user", 0.0)
|
||||
entities = summary["entities"]
|
||||
```
|
||||
|
||||
This enables automated workflows like:
|
||||
- Daily digest generation
|
||||
- Research recommendation systems
|
||||
- Literature review automation
|
||||
- Cross-reference discovery
|
||||
|
||||
## Examples
|
||||
|
||||
### Machine Learning Paper
|
||||
```json
|
||||
{
|
||||
"schema_version": "1.0",
|
||||
"one_sentence_summary": "Introduces EfficientNet, a family of convolutional neural networks that achieve better accuracy and efficiency than previous models through compound scaling.",
|
||||
"problem_statement": "Existing ConvNet scaling methods arbitrarily scale network dimensions, leading to suboptimal accuracy and efficiency trade-offs.",
|
||||
"method_overview": "The paper proposes compound scaling that uniformly scales network width, depth, and resolution with a fixed ratio, guided by neural architecture search to find optimal scaling coefficients.",
|
||||
"main_results": "EfficientNet-B7 achieves 84.3% top-1 accuracy on ImageNet while being 8.4x smaller and 6.1x faster than the best existing ConvNet.",
|
||||
"claimed_contributions": [
|
||||
"Novel compound scaling method for ConvNets",
|
||||
"EfficientNet family with state-of-the-art accuracy/efficiency",
|
||||
"Systematic study of scaling dimensions"
|
||||
],
|
||||
"assumptions": [
|
||||
"ImageNet classification transfers to other vision tasks",
|
||||
"Compound scaling works across different architectures"
|
||||
],
|
||||
"limitations": [
|
||||
"Limited evaluation on tasks beyond image classification",
|
||||
"Scaling coefficients may not generalize to all architectures"
|
||||
],
|
||||
"problem_tags": ["classification", "computer-vision", "efficiency"],
|
||||
"technique_tags": ["cnn", "neural-architecture-search", "model-scaling"],
|
||||
"entities": ["ImageNet", "MobileNet", "ResNet", "NASNet"],
|
||||
"relevance_to_user": null,
|
||||
"recommended_sections": ["Section 3.1", "Table 2", "Figure 2"]
|
||||
}
|
||||
```
|
||||
|
||||
This schema provides a foundation for consistent, structured paper analysis while remaining flexible enough to evolve with new research needs and AI capabilities.
|
||||
@@ -3,7 +3,6 @@
|
||||
import shutil
|
||||
import subprocess
|
||||
from pathlib import Path
|
||||
from unittest.mock import patch
|
||||
|
||||
import pytest
|
||||
|
||||
|
||||
@@ -106,7 +106,7 @@ class TestLocalImporter:
|
||||
def test_import_duplicate_pdf(self, local_importer, sample_pdf):
|
||||
"""Test importing the same PDF twice."""
|
||||
# Import once
|
||||
metadata1 = local_importer.import_pdf(pdf_path=sample_pdf)
|
||||
local_importer.import_pdf(pdf_path=sample_pdf)
|
||||
|
||||
# Try to import again
|
||||
with pytest.raises(ValueError, match="Paper already imported"):
|
||||
|
||||
@@ -6,10 +6,9 @@ from pathlib import Path
|
||||
import pytest
|
||||
|
||||
from paperlib.config import LibraryPaths
|
||||
from paperlib.converter import MinerUConverter
|
||||
from paperlib.importer import ArxivImporter, LocalImporter
|
||||
from paperlib.importer import LocalImporter
|
||||
from paperlib.index import DatabaseManager
|
||||
from paperlib.models import ConversionStatus, SourceType
|
||||
from paperlib.models import SourceType
|
||||
from paperlib.storage import PaperStorageManager
|
||||
|
||||
|
||||
|
||||
@@ -5,8 +5,6 @@ import tempfile
|
||||
from datetime import datetime
|
||||
from pathlib import Path
|
||||
|
||||
import pytest
|
||||
|
||||
from paperlib.models import (
|
||||
ConversionStatus,
|
||||
PaperMetadata,
|
||||
|
||||
@@ -1,13 +1,12 @@
|
||||
"""Tests for paperlib storage manager."""
|
||||
|
||||
import shutil
|
||||
import tempfile
|
||||
from pathlib import Path
|
||||
|
||||
import pytest
|
||||
|
||||
from paperlib.config import LibraryPaths
|
||||
from paperlib.models import ConversionStatus, PaperMetadata, SourceType
|
||||
from paperlib.models import ConversionStatus, SourceType
|
||||
from paperlib.storage import PaperStorageManager
|
||||
|
||||
|
||||
|
||||
Reference in New Issue
Block a user