docs: add docs

This commit is contained in:
2026-04-17 16:54:30 -04:00
parent 74d140e5f8
commit 432010f431
10 changed files with 1682 additions and 19 deletions
+288
View File
@@ -0,0 +1,288 @@
# CLI Reference
This document describes all available commands in the paperlib CLI.
## Global Options
All commands support these global options:
- `--help`, `-h`: Show help message
- `--version`: Show version information
Many commands also support:
- `--library`, `-L`: Specify library root directory (default: current directory)
- `--json`: Output machine-readable JSON instead of human-readable format
## Commands
### `paperlib init [PATH]`
Initialize a paper library directory structure.
**Arguments:**
- `PATH`: Directory to initialize (default: current directory)
**Examples:**
```bash
# Initialize library in current directory
paperlib init
# Initialize library in specific directory
paperlib init /path/to/my/papers
# Initialize and create parent directories
paperlib init ~/Documents/research/papers
```
**Behavior:**
- Creates standard directory structure (config/, papers/, db/, etc.)
- Safe to run multiple times (idempotent)
- Creates parent directories if they don't exist
---
### `paperlib import`
Import papers into the library from various sources.
**Required (one of):**
- `--pdf PATH`: Import a local PDF file
- `--arxiv ID`: Import paper from arXiv by ID or URL
**Options:**
- `--title TEXT`: Override paper title (for local PDFs)
- `--notes TEXT`: Add notes about the paper
- `--tags TAG1 TAG2`: Add tags to the paper
- `--library PATH`: Specify library directory
**Examples:**
```bash
# Import local PDF
paperlib import --pdf paper.pdf --title "My Research" --tags ml ai
# Import from arXiv
paperlib import --arxiv 2212.06340
# Import with arXiv URL
paperlib import --arxiv https://arxiv.org/abs/2212.06340
# Import to specific library
paperlib import --pdf paper.pdf --library ~/research
```
**Behavior:**
- Generates stable paper ID based on content (local) or arXiv ID
- Copies PDF to structured storage location
- Creates meta.json with paper metadata
- Prevents duplicate imports (same content/ID)
- Indexes paper in search database
---
### `paperlib list`
List all papers in the library with their current status.
**Options:**
- `--library PATH`: Specify library directory
- `--json`: Output in JSON format
**Examples:**
```bash
# List all papers
paperlib list
# List papers in specific library
paperlib list --library ~/research
# Get machine-readable output
paperlib list --json
```
**Output Format:**
```
Found 3 papers:
📄 arxiv-2212_06340
The new discontinuous Galerkin methods based numerical relativity program Nmesh
By: Wolfgang Tichy, Liwei Ji, Ananya Adhikari (+2 more)
Categories: gr-qc
⏳ local-a1b2c3d4e5f6
Machine Learning Applications in Physics
Categories: cs.AI, physics.comp-ph
```
**Status Indicators:**
- ⏳ Paper imported, conversion pending
- 📄 PDF converted to Markdown
- 📝 AI summary generated
- ❌ Conversion or processing failed
---
### `paperlib show PAPER_ID`
Show detailed information about a specific paper.
**Arguments:**
- `PAPER_ID`: The unique paper identifier
**Options:**
- `--library PATH`: Specify library directory
- `--json`: Output in JSON format
**Examples:**
```bash
# Show paper details
paperlib show arxiv-2212_06340
# Show with JSON output
paperlib show local-a1b2c3d4 --json
```
**Output includes:**
- All metadata fields
- Processing status
- File locations and existence
- Import timestamp
- Tags and notes
---
### `paperlib convert`
Convert papers from PDF to Markdown using MinerU.
**Options:**
- `--library PATH`: Specify library directory
- `--paper-id ID`: Convert specific paper only
**Examples:**
```bash
# Convert all pending papers
paperlib convert
# Convert specific paper
paperlib convert --paper-id arxiv-2212_06340
# Convert in specific library
paperlib convert --library ~/research
```
**Behavior:**
- Processes papers with `conversion_status: pending`
- Uses MinerU for PDF to Markdown conversion
- Updates metadata with conversion status
- Creates conversion logs in `logs/` directory
- Handles conversion failures gracefully
---
### `paperlib reindex`
Rebuild the search index from stored paper metadata.
**Options:**
- `--library PATH`: Specify library directory
**Examples:**
```bash
# Rebuild index
paperlib reindex
# Rebuild index for specific library
paperlib reindex --library ~/research
```
**Behavior:**
- Clears existing SQLite database
- Scans all meta.json files in papers/ directory
- Rebuilds full-text search index
- Reports statistics on completion
- Safe to run anytime (repairs corrupted index)
---
### `paperlib status`
Show library configuration and layout information.
**Options:**
- `--library PATH`: Specify library directory
- `--json`: Output in JSON format
**Examples:**
```bash
# Show current library status
paperlib status
# Show specific library status
paperlib status --library ~/research
```
**Output:**
```
root: /home/user/papers
config: /home/user/papers/config/config.toml
database: /home/user/papers/db/paperlib.sqlite3
papers: /home/user/papers/papers
inbox: /home/user/papers/inbox
cache: /home/user/papers/cache
```
---
## Future Commands
These commands are planned but not yet implemented:
### `paperlib search QUERY`
Search papers by content and metadata.
### `paperlib summarize [PAPER_ID]`
Generate AI summaries for papers.
### `paperlib export FORMAT`
Export papers in various formats.
### `paperlib doctor`
Diagnose and repair library issues.
---
## Exit Codes
paperlib commands return standard exit codes:
- `0`: Success
- `1`: General error (file not found, invalid arguments, etc.)
- `2`: Command line argument error
## Configuration
paperlib looks for configuration in these locations (in order):
1. `$LIBRARY_ROOT/config/config.toml`
2. `~/.config/paperlib/config.toml`
3. Built-in defaults
## JSON Output Format
When using `--json`, commands output structured data suitable for programmatic consumption:
```json
{
"papers": [
{
"paper_id": "arxiv-2212_06340",
"title": "Example Paper",
"authors": ["Alice Smith", "Bob Jones"],
"conversion_status": "success",
"imported_at": "2024-01-15T10:30:00"
}
],
"total": 1
}
```
This format is stable across paperlib versions for reliable automation.
+638
View File
@@ -0,0 +1,638 @@
# Integration Guide
This document describes how to integrate paperlib with higher-level tools and automation workflows.
## Overview
paperlib is designed as a **library engine** that higher-level tools can build upon. It provides:
- **Stable CLI interface** with machine-readable JSON output
- **File-based storage** that external tools can read directly
- **Python API** for programmatic access
- **Event hooks** for workflow integration (future)
## CLI Integration
### Machine-Readable Output
Most paperlib commands support `--json` output for automation:
```bash
# Get library statistics
paperlib status --json
{
"library_root": "/home/user/papers",
"total_papers": 42,
"by_status": {"converted": 38, "pending": 4},
"last_updated": "2024-01-15T10:30:00Z"
}
# List papers with metadata
paperlib list --json
{
"papers": [
{
"paper_id": "arxiv-2212_06340",
"title": "Example Paper",
"authors": ["Alice Smith", "Bob Jones"],
"categories": ["cs.AI"],
"conversion_status": "success",
"summary_status": "pending",
"imported_at": "2024-01-15T10:30:00Z"
}
],
"total": 1
}
# Import with JSON response
paperlib import --arxiv 2212.06340 --json
{
"success": true,
"paper_id": "arxiv-2212_06340",
"title": "Example Paper Title",
"message": "Successfully imported arXiv paper"
}
```
### Exit Codes
paperlib commands follow standard Unix exit code conventions:
```bash
paperlib import --arxiv 2212.06340
echo $? # 0 for success, 1 for error
# Check if paper exists before processing
if paperlib show "$paper_id" --json >/dev/null 2>&1; then
echo "Paper exists"
else
echo "Paper not found"
fi
```
### Scripting Examples
#### Daily arXiv Import
```bash
#!/bin/bash
# daily-arxiv.sh - Import papers from daily arXiv feed
LIBRARY="$HOME/research"
ARXIV_FEED_URL="http://export.arxiv.org/rss/cs.AI"
# Parse RSS feed and extract arXiv IDs
curl -s "$ARXIV_FEED_URL" | \
grep -oP 'arxiv\.org/abs/\K[0-9]{4}\.[0-9]{4,5}' | \
while read arxiv_id; do
echo "Importing $arxiv_id..."
paperlib import --arxiv "$arxiv_id" --library "$LIBRARY" --json
done
# Convert newly imported papers
paperlib convert --library "$LIBRARY"
# Generate daily report
paperlib list --library "$LIBRARY" --json | \
jq '.papers | map(select(.imported_at | startswith(now | strftime("%Y-%m-%d"))))'
```
#### Batch Processing
```bash
#!/bin/bash
# batch-process.sh - Process multiple papers from a list
LIBRARY="$HOME/research"
PAPER_LIST="papers.txt"
while IFS= read -r pdf_path; do
if [[ -f "$pdf_path" ]]; then
echo "Importing $pdf_path..."
result=$(paperlib import --pdf "$pdf_path" --library "$LIBRARY" --json)
if [[ $? -eq 0 ]]; then
paper_id=$(echo "$result" | jq -r '.paper_id')
echo "Successfully imported as $paper_id"
else
echo "Failed to import $pdf_path"
fi
fi
done < "$PAPER_LIST"
# Convert all pending papers
paperlib convert --library "$LIBRARY"
```
## Python API
### Direct Library Access
```python
from paperlib.config import LibraryPaths
from paperlib.storage import PaperStorageManager
from paperlib.index import DatabaseManager
from paperlib.importer import ArxivImporter, LocalImporter
# Initialize library components
library_paths = LibraryPaths.from_root("/path/to/library")
storage = PaperStorageManager(library_paths)
database = DatabaseManager(library_paths)
database.initialize_database()
# Import paper programmatically
arxiv_importer = ArxivImporter(storage)
metadata = arxiv_importer.import_arxiv_paper("2212.06340")
database.index_paper(metadata)
# Search and retrieve
results = list(database.search_papers("neural networks"))
for result in results:
paper = storage.load_paper_metadata(result["paper_id"], result["source_type"])
print(f"{paper.title} by {', '.join(paper.authors)}")
# Get statistics
stats = database.get_statistics()
print(f"Total papers: {stats['total_papers']}")
```
### Metadata Processing
```python
import json
from pathlib import Path
from paperlib.models import PaperMetadata, PaperSummary
# Process all papers in library
papers_dir = Path("/home/user/papers/papers")
for meta_file in papers_dir.rglob("meta.json"):
# Load metadata
metadata = PaperMetadata.load_from_file(meta_file)
# Check for summary
summary_path = meta_file.parent / "summary.json"
if summary_path.exists():
summary = PaperSummary.load_from_file(summary_path)
# Extract key information
tags = summary.problem_tags + summary.technique_tags
entities = summary.entities
print(f"Paper: {metadata.title}")
print(f"Tags: {', '.join(tags)}")
print(f"Entities: {', '.join(entities)}")
```
## File System Integration
### Direct File Access
Since paperlib uses a documented file layout, tools can read data directly:
```python
import json
from pathlib import Path
def scan_library(library_root: Path):
"""Scan library and extract metadata."""
papers = []
for meta_file in library_root.glob("papers/**/meta.json"):
with meta_file.open() as f:
metadata = json.load(f)
papers.append(metadata)
return papers
def find_papers_by_category(library_root: Path, category: str):
"""Find papers in a specific category."""
matching_papers = []
for meta_file in library_root.glob("papers/**/meta.json"):
with meta_file.open() as f:
metadata = json.load(f)
if category in metadata.get("categories", []):
matching_papers.append(metadata)
return matching_papers
```
### Watch for Changes
```python
import time
from pathlib import Path
from watchdog.observers import Observer
from watchdog.events import FileSystemEventHandler
class PaperLibraryHandler(FileSystemEventHandler):
def __init__(self, library_root):
self.library_root = Path(library_root)
def on_created(self, event):
if event.src_path.endswith("meta.json"):
print(f"New paper imported: {event.src_path}")
# Trigger processing workflow
self.process_new_paper(event.src_path)
def on_modified(self, event):
if event.src_path.endswith("summary.json"):
print(f"Summary updated: {event.src_path}")
# Update downstream systems
def process_new_paper(self, meta_path):
"""Handle newly imported paper."""
# Load metadata
with open(meta_path) as f:
metadata = json.load(f)
# Trigger downstream processing
# - Send to processing queue
# - Update knowledge base
# - Generate notifications
# Watch library for changes
observer = Observer()
handler = PaperLibraryHandler("/home/user/papers")
observer.schedule(handler, "/home/user/papers/papers", recursive=True)
observer.start()
```
## Higher-Level Tool Examples
### Research Dashboard
```python
"""research_dashboard.py - Web dashboard for research library"""
from flask import Flask, jsonify, render_template
from paperlib.config import LibraryPaths
from paperlib.storage import PaperStorageManager
from paperlib.index import DatabaseManager
app = Flask(__name__)
# Initialize paperlib components
library_paths = LibraryPaths.from_root("/home/user/research")
storage = PaperStorageManager(library_paths)
database = DatabaseManager(library_paths)
@app.route('/api/papers')
def list_papers():
"""List all papers with metadata."""
papers = list(database.list_papers(limit=50))
return jsonify(papers)
@app.route('/api/search/<query>')
def search_papers(query):
"""Search papers by query."""
results = list(database.search_papers(query, limit=20))
return jsonify(results)
@app.route('/api/stats')
def library_stats():
"""Get library statistics."""
stats = database.get_statistics()
return jsonify(stats)
@app.route('/')
def dashboard():
"""Main dashboard page."""
return render_template('dashboard.html')
if __name__ == '__main__':
app.run(debug=True)
```
### Daily Digest Generator
```python
"""daily_digest.py - Generate daily research digest"""
import json
from datetime import datetime, timedelta
from pathlib import Path
from paperlib.config import LibraryPaths
from paperlib.index import DatabaseManager
def generate_daily_digest(library_root: str, output_file: str):
"""Generate digest of recently imported papers."""
# Initialize database
library_paths = LibraryPaths.from_root(library_root)
database = DatabaseManager(library_paths)
# Get papers from last 24 hours
yesterday = datetime.now() - timedelta(days=1)
yesterday_iso = yesterday.isoformat()
recent_papers = []
for paper in database.list_papers():
if paper["imported_at"] >= yesterday_iso:
recent_papers.append(paper)
if not recent_papers:
print("No new papers imported yesterday.")
return
# Group by category
by_category = {}
for paper in recent_papers:
categories = json.loads(paper["categories_json"])
for category in categories:
if category not in by_category:
by_category[category] = []
by_category[category].append(paper)
# Generate HTML digest
html_content = f"""
<html>
<head><title>Daily Research Digest - {datetime.now().strftime('%Y-%m-%d')}</title></head>
<body>
<h1>Daily Research Digest</h1>
<p>Found {len(recent_papers)} new papers</p>
"""
for category, papers in by_category.items():
html_content += f"<h2>{category}</h2><ul>"
for paper in papers:
title = paper["title"]
paper_id = paper["paper_id"]
html_content += f'<li><strong>{title}</strong> ({paper_id})</li>'
html_content += "</ul>"
html_content += "</body></html>"
# Write output
Path(output_file).write_text(html_content)
print(f"Digest written to {output_file}")
if __name__ == "__main__":
generate_daily_digest("/home/user/research", "digest.html")
```
### Literature Review Assistant
```python
"""review_assistant.py - AI-powered literature review helper"""
from paperlib.config import LibraryPaths
from paperlib.index import DatabaseManager
from paperlib.models import PaperSummary
class ReviewAssistant:
def __init__(self, library_root: str):
self.library_paths = LibraryPaths.from_root(library_root)
self.database = DatabaseManager(self.library_paths)
def find_related_papers(self, paper_id: str, max_results: int = 10):
"""Find papers related to the given paper."""
# Get source paper metadata
source_paper = self.database.get_paper(paper_id)
if not source_paper:
return []
# Extract search terms from title and categories
title_words = source_paper["title"].lower().split()
categories = json.loads(source_paper["categories_json"])
# Search for papers with similar keywords
search_terms = title_words + categories
related_papers = []
for term in search_terms:
results = list(self.database.search_papers(term, limit=5))
for result in results:
if result["paper_id"] != paper_id:
related_papers.append(result)
# Remove duplicates and return top results
seen_ids = set()
unique_papers = []
for paper in related_papers:
if paper["paper_id"] not in seen_ids:
seen_ids.add(paper["paper_id"])
unique_papers.append(paper)
if len(unique_papers) >= max_results:
break
return unique_papers
def generate_topic_overview(self, topic: str):
"""Generate overview of papers on a specific topic."""
# Search for papers on topic
papers = list(self.database.search_papers(topic, limit=50))
if not papers:
return f"No papers found for topic: {topic}"
# Analyze summaries if available
key_entities = set()
techniques = set()
for paper in papers:
summary_path = Path(paper["summary_json_path"])
if summary_path.exists():
summary = PaperSummary.load_from_file(summary_path)
key_entities.update(summary.entities)
techniques.update(summary.technique_tags)
# Generate overview
overview = f"""
Topic: {topic}
Papers found: {len(papers)}
Key entities mentioned:
{', '.join(sorted(key_entities)[:10])}
Common techniques:
{', '.join(sorted(techniques)[:10])}
Recent papers:
"""
# Add recent papers
recent_papers = sorted(papers, key=lambda x: x["imported_at"], reverse=True)[:5]
for paper in recent_papers:
overview += f"\n- {paper['title']} ({paper['paper_id']})"
return overview
# Usage
assistant = ReviewAssistant("/home/user/research")
overview = assistant.generate_topic_overview("transformer architecture")
print(overview)
```
## Integration Patterns
### Pipeline Processing
```bash
# Multi-stage processing pipeline
paperlib import --arxiv 2212.06340 --json > import_result.json
paper_id=$(jq -r '.paper_id' import_result.json)
# Convert to markdown
paperlib convert --paper-id "$paper_id"
# Generate summary (when available)
# paperlib summarize --paper-id "$paper_id"
# Update downstream systems
curl -X POST "http://research-db/api/papers" \
-H "Content-Type: application/json" \
-d @import_result.json
```
### Event-Driven Architecture
```python
"""event_handler.py - Process paperlib events"""
import json
from pathlib import Path
import pika # RabbitMQ client
class PaperLibraryEventHandler:
def __init__(self, rabbitmq_url: str):
self.connection = pika.BlockingConnection(pika.URLParameters(rabbitmq_url))
self.channel = self.connection.channel()
def on_paper_imported(self, paper_metadata: dict):
"""Handle new paper import."""
message = {
"event": "paper_imported",
"paper_id": paper_metadata["paper_id"],
"title": paper_metadata["title"],
"categories": paper_metadata["categories"],
"timestamp": paper_metadata["imported_at"]
}
# Send to processing queue
self.channel.basic_publish(
exchange='',
routing_key='paper_processing',
body=json.dumps(message)
)
def on_summary_generated(self, paper_id: str, summary_path: Path):
"""Handle summary generation."""
with summary_path.open() as f:
summary = json.load(f)
message = {
"event": "summary_generated",
"paper_id": paper_id,
"tags": summary["problem_tags"] + summary["technique_tags"],
"entities": summary["entities"]
}
# Send to indexing service
self.channel.basic_publish(
exchange='',
routing_key='summary_indexing',
body=json.dumps(message)
)
```
## Best Practices
### Error Handling
```python
import subprocess
import json
def safe_paperlib_command(command: list[str]) -> dict:
"""Execute paperlib command with proper error handling."""
try:
result = subprocess.run(
["paperlib"] + command + ["--json"],
capture_output=True,
text=True,
check=True
)
return json.loads(result.stdout)
except subprocess.CalledProcessError as e:
return {
"success": False,
"error": e.stderr,
"exit_code": e.returncode
}
except json.JSONDecodeError as e:
return {
"success": False,
"error": f"Invalid JSON response: {e}",
"raw_output": result.stdout
}
# Usage
result = safe_paperlib_command(["import", "--arxiv", "2212.06340"])
if result.get("success", True): # Assume success if no "success" field
print(f"Imported paper: {result['paper_id']}")
else:
print(f"Import failed: {result['error']}")
```
### Performance Optimization
```python
# Batch operations for better performance
from paperlib.index import DatabaseManager
def batch_index_papers(library_root: str, paper_ids: list[str]):
"""Index multiple papers efficiently."""
database = DatabaseManager(LibraryPaths.from_root(library_root))
storage = PaperStorageManager(LibraryPaths.from_root(library_root))
# Begin transaction for batch insert
with database._get_connection() as conn:
for paper_id in paper_ids:
metadata = storage.load_paper_metadata(paper_id, source_type)
if metadata:
database.index_paper(metadata)
# Automatic commit on context exit
```
### Configuration Management
```python
# config_manager.py - Centralized configuration
import os
from pathlib import Path
class ConfigManager:
def __init__(self):
self.library_root = os.getenv("PAPERLIB_ROOT", Path.home() / "research")
self.api_keys = {
"openai": os.getenv("OPENAI_API_KEY"),
"anthropic": os.getenv("ANTHROPIC_API_KEY")
}
def get_library_path(self, name: str = "default") -> str:
"""Get library path by name."""
if name == "default":
return str(self.library_root)
return str(Path.home() / f"research-{name}")
def paperlib_command_base(self, library_name: str = "default") -> list[str]:
"""Get base command for paperlib with library."""
return ["paperlib", "--library", self.get_library_path(library_name)]
config = ConfigManager()
# Usage in scripts
import subprocess
cmd = config.paperlib_command_base("arxiv") + ["list", "--json"]
result = subprocess.run(cmd, capture_output=True, text=True)
```
This integration guide provides the foundation for building sophisticated research workflows on top of paperlib's stable, local-first architecture.
+259
View File
@@ -0,0 +1,259 @@
# Storage Layout
This document describes the on-disk structure and organization of a paperlib library.
## Overview
A paperlib library is a directory containing all papers, metadata, configuration, and index data. The layout is designed to be:
- **Human-readable**: Directory structure is intuitive and browsable
- **Stable**: File locations don't change unexpectedly
- **Rebuildable**: Index can be reconstructed from source files
- **Portable**: Entire library can be moved or backed up as a unit
## Directory Structure
```
library_root/
├── config/ # Library configuration
│ ├── config.toml # Main configuration file
│ ├── vocab.yaml # Controlled vocabulary (future)
│ └── prompts/ # AI prompt templates (future)
│ └── summarize_paper.md
├── papers/ # Paper storage (source of truth)
│ ├── arxiv/ # arXiv papers organized by year
│ │ └── 2026/
│ │ └── arxiv-2212_06340/
│ │ ├── meta.json # Paper metadata
│ │ ├── source.pdf # Original PDF
│ │ ├── paper.md # Converted markdown
│ │ ├── summary.json # AI-generated summary
│ │ ├── summary.md # Rendered summary
│ │ ├── ref.bib # Bibliography (future)
│ │ ├── assets/ # Images, figures
│ │ └── logs/ # Processing logs
│ │ └── mineru.log
│ └── local/ # Local PDF imports by hash
│ └── a1b2c3d4e5f6/
│ └── ... (same structure)
├── inbox/ # Temporary import staging (future)
├── db/ # Search index (rebuildable)
│ └── paperlib.sqlite3
└── cache/ # Processing cache (safe to delete)
```
## Paper Directory Organization
### arXiv Papers
arXiv papers are organized by year and paper ID:
```
papers/arxiv/YEAR/arxiv-NORMALIZED_ID/
```
Where:
- `YEAR` is extracted from the arXiv ID (e.g., `2212.06340``2022`)
- `NORMALIZED_ID` replaces dots and version numbers with underscores
- `2212.06340``arxiv-2212_06340`
- `2212.06340v2``arxiv-2212_06340v2`
**Examples:**
```
papers/arxiv/2022/arxiv-2212_06340/
papers/arxiv/2023/arxiv-2301_12345v1/
papers/arxiv/2024/arxiv-2405_98765/
```
### Local Papers
Local papers are organized by content hash:
```
papers/local/HASH_PREFIX/
```
Where `HASH_PREFIX` is the first 16 characters of the SHA256 hash of the PDF file.
**Examples:**
```
papers/local/a1b2c3d4e5f67890/
papers/local/fedcba9876543210/
```
## File Types
### Required Files
Every paper directory contains:
#### `meta.json`
The canonical metadata file (JSON format):
```json
{
"paper_id": "arxiv-2212_06340",
"source_type": "arxiv",
"source_id": "2212.06340",
"title": "Example Paper Title",
"authors": ["Alice Smith", "Bob Jones"],
"published_date": "2022-12-13T02:46:55",
"categories": ["cs.AI", "stat.ML"],
"pdf_path": "papers/arxiv/2022/arxiv-2212_06340/source.pdf",
"paper_md_path": "papers/arxiv/2022/arxiv-2212_06340/paper.md",
"imported_at": "2024-01-15T10:30:00",
"conversion_status": "success",
"summary_status": "not_requested",
"tags": ["machine-learning"],
"notes": "Important paper on neural networks"
}
```
#### `source.pdf`
The original PDF file, exactly as imported.
### Generated Files
These files are created by paperlib processing:
#### `paper.md`
Markdown conversion of the PDF, generated by MinerU or other converters.
#### `summary.json` (optional)
AI-generated structured summary:
```json
{
"schema_version": "1.0",
"one_sentence_summary": "This paper introduces...",
"problem_statement": "Current methods have limitations...",
"method_overview": "We propose a novel approach...",
"main_results": "Experiments show 95% accuracy...",
"claimed_contributions": ["Novel architecture", "Improved performance"],
"problem_tags": ["classification", "optimization"],
"technique_tags": ["neural-networks", "transformers"],
"entities": ["BERT", "ImageNet", "ResNet"],
"relevance_to_user": 0.85
}
```
#### `summary.md` (optional)
Human-readable summary rendered from `summary.json`.
### Supporting Directories
#### `assets/`
Contains extracted images, figures, and other media from the PDF conversion process.
#### `logs/`
Processing logs for debugging and audit trails:
- `mineru.log` - PDF conversion logs
- `summary.log` - AI summarization logs (future)
## Index Database
The SQLite database at `db/paperlib.sqlite3` contains:
### Tables
#### `papers`
Main paper index with searchable fields:
- Metadata from all `meta.json` files
- Computed search fields (full-text, author lists, etc.)
- Processing status tracking
#### `papers_fts`
Full-text search virtual table (SQLite FTS5) for content search.
### Rebuilding
The database is **always rebuildable** from the source files:
```bash
paperlib reindex
```
This design ensures the JSON files remain the authoritative source of truth.
## Path Conventions
### Relative Paths
All paths in `meta.json` are relative to the library root:
```json
{
"pdf_path": "papers/local/a1b2c3d4e5f6/source.pdf",
"paper_md_path": "papers/local/a1b2c3d4e5f6/paper.md"
}
```
### Cross-Platform Compatibility
All paths use forward slashes (`/`) regardless of operating system.
## Backup and Portability
### What to Backup
For complete library backup, include:
- `config/` directory (configuration)
- `papers/` directory (source of truth)
### What NOT to Backup
These can be regenerated:
- `db/` directory (rebuildable index)
- `cache/` directory (temporary files)
### Moving Libraries
To move a library:
1. Copy the entire directory structure
2. Run `paperlib reindex` to rebuild the database
3. Update any absolute paths in configuration
## Storage Efficiency
### Deduplication
Papers are naturally deduplicated:
- arXiv papers by normalized arXiv ID
- Local papers by SHA256 content hash
### Large Files
For papers with large asset directories:
- Assets are stored alongside papers for locality
- Consider using file system compression or deduplication if needed
## File System Requirements
### Permissions
paperlib requires:
- Read/write access to library directory
- Ability to create subdirectories
- Atomic file operations for metadata updates
### File System Features
Recommended:
- Case-sensitive file system (avoids conflicts)
- Support for Unicode filenames
- Journaling (protects against corruption)
### Disk Space
Typical storage requirements:
- PDF files: 1-10 MB each
- Markdown conversions: 10-100 KB each
- Metadata: ~1-5 KB per paper
- Database index: ~1-10 KB per paper
- Assets: Varies (0-50 MB for image-heavy papers)
## Migration and Versioning
### Schema Evolution
When paperlib updates its storage format:
- Metadata schema versions are tracked in each file
- Migration tools handle format upgrades
- Backward compatibility is maintained when possible
### Validation
paperlib provides tools to validate library integrity:
```bash
paperlib doctor # (future command)
```
This will check:
- All referenced files exist
- Metadata format is valid
- Database consistency with files
- No orphaned or corrupted data
+289
View File
@@ -0,0 +1,289 @@
# Summary Schema
This document defines the structure and semantics of the `summary.json` files that contain AI-generated paper summaries.
## Overview
The `summary.json` file contains structured, AI-generated analysis of a paper. It is designed to:
- Provide consistent, machine-readable summaries
- Support research triage and discovery workflows
- Enable automated categorization and search
- Remain stable across different AI providers
- Use controlled vocabulary when available
## Schema Version 1.0
### File Structure
```json
{
"schema_version": "1.0",
"one_sentence_summary": "This paper introduces a novel neural architecture for...",
"problem_statement": "Current approaches to X suffer from limitations...",
"method_overview": "The authors propose a hybrid approach combining...",
"main_results": "Experiments show 15% improvement over baselines...",
"claimed_contributions": [
"Novel attention mechanism design",
"State-of-the-art results on ImageNet",
"Theoretical analysis of convergence properties"
],
"assumptions": [
"Data is independently distributed",
"Computational budget allows for large models"
],
"limitations": [
"Only evaluated on English text",
"Requires significant computational resources",
"Limited theoretical justification for design choices"
],
"problem_tags": ["classification", "computer-vision", "optimization"],
"technique_tags": ["neural-networks", "attention", "transformers"],
"entities": ["ImageNet", "BERT", "ResNet", "CIFAR-10"],
"relevance_to_user": 0.75,
"recommended_sections": ["Section 3.2", "Algorithm 1", "Table 2"]
}
```
## Field Definitions
### Required Fields
#### `schema_version` (string)
- **Purpose**: Track format version for migration
- **Format**: Semantic version string (e.g., "1.0")
- **Required**: Yes
#### `one_sentence_summary` (string)
- **Purpose**: Concise paper overview for quick scanning
- **Guidelines**:
- One complete sentence, under 200 characters
- Focus on the main contribution or finding
- Avoid technical jargon when possible
- **Example**: "This paper introduces a new attention mechanism that improves transformer efficiency by 40% while maintaining accuracy."
### Core Content Fields
#### `problem_statement` (string)
- **Purpose**: What problem does this paper address?
- **Guidelines**:
- 2-3 sentences maximum
- Focus on the gap or limitation being addressed
- Explain why this problem matters
#### `method_overview` (string)
- **Purpose**: High-level description of the approach
- **Guidelines**:
- 3-4 sentences maximum
- Focus on the key innovation or insight
- Avoid detailed algorithmic descriptions
#### `main_results` (string)
- **Purpose**: Key empirical findings or theoretical results
- **Guidelines**:
- Quantitative results when available
- Highlight significance of improvements
- Note any surprising or counterintuitive findings
### Structured Lists
#### `claimed_contributions` (array of strings)
- **Purpose**: Authors' stated contributions
- **Guidelines**:
- Extract from paper's contribution list
- Preserve authors' framing and claims
- 3-6 items typically
#### `assumptions` (array of strings)
- **Purpose**: Key assumptions underlying the work
- **Guidelines**:
- Mathematical, methodological, or data assumptions
- Critical for understanding applicability
- Often unstated but important
#### `limitations` (array of strings)
- **Purpose**: Acknowledged or apparent limitations
- **Guidelines**:
- From authors' discussion or limitations section
- Obvious limitations not acknowledged by authors
- Important for understanding scope
### Categorization
#### `problem_tags` (array of strings)
- **Purpose**: Categorize the problem domain
- **Controlled vocabulary** (preferred values):
- `classification`, `regression`, `clustering`
- `optimization`, `search`, `planning`
- `generation`, `translation`, `summarization`
- `detection`, `segmentation`, `tracking`
- `compression`, `encoding`, `decoding`
- `privacy`, `security`, `robustness`
- `interpretability`, `fairness`, `ethics`
- `efficiency`, `scalability`, `deployment`
#### `technique_tags` (array of strings)
- **Purpose**: Categorize the technical approaches
- **Controlled vocabulary** (preferred values):
- `neural-networks`, `deep-learning`, `transformers`
- `cnn`, `rnn`, `lstm`, `gru`, `attention`
- `reinforcement-learning`, `supervised-learning`, `unsupervised-learning`
- `bayesian`, `probabilistic`, `statistical`
- `graph-neural-networks`, `graph-algorithms`
- `computer-vision`, `natural-language-processing`
- `federated-learning`, `transfer-learning`, `meta-learning`
- `adversarial`, `generative-models`, `vae`, `gan`
### Entities and References
#### `entities` (array of strings)
- **Purpose**: Important datasets, models, algorithms, or systems mentioned
- **Guidelines**:
- Proper names: "ImageNet", "BERT", "ResNet"
- Algorithms: "SGD", "Adam", "RANSAC"
- Benchmarks: "GLUE", "COCO", "WMT"
- Avoid generic terms like "neural network"
### User Relevance
#### `relevance_to_user` (number, optional)
- **Purpose**: Estimated relevance score for the user
- **Format**: Float between 0.0 and 1.0
- **Guidelines**:
- Based on user's research interests (if known)
- `null` if user preferences unavailable
- Higher scores = more relevant
#### `recommended_sections` (array of strings, optional)
- **Purpose**: Specific sections worth reading in detail
- **Format**: Section references as they appear in paper
- **Examples**: ["Section 3.2", "Algorithm 1", "Table 2", "Appendix A"]
## Generation Guidelines
### AI Provider Instructions
When generating summaries, AI models should:
1. **Read for understanding**: Focus on the paper's core contributions
2. **Use structured thinking**: Work through each field systematically
3. **Prefer facts over interpretation**: Extract what authors claim, not opinions
4. **Use controlled vocabulary**: Select from predefined tag lists when possible
5. **Be concise**: Optimize for quick scanning and search
6. **Handle uncertainty**: Use `null` or empty arrays for unclear fields
### Quality Criteria
Good summaries exhibit:
- **Accuracy**: Faithful to the paper's content
- **Completeness**: Cover all major aspects
- **Consistency**: Similar papers get similar treatment
- **Searchability**: Use terms that aid discovery
- **Brevity**: Information density over verbosity
### Common Issues to Avoid
- **Hallucination**: Never invent facts not in the paper
- **Editorializing**: Don't add opinions about paper quality
- **Inconsistent terminology**: Use standard field names
- **Over-abstraction**: Keep concrete details when useful
- **Under-specification**: Provide enough detail for usefulness
## Schema Evolution
### Version History
- **v1.0** (current): Initial schema with core fields
### Migration Strategy
When the schema evolves:
1. New versions increment the `schema_version` field
2. Migration tools handle format upgrades automatically
3. Backward compatibility maintained when possible
4. Deprecated fields are marked but preserved
### Extensibility
Future versions may add:
- Additional structured fields
- Hierarchical tag taxonomies
- Multi-lingual support
- Citation relationship mapping
- Experimental reproducibility metadata
## Integration with paperlib
### File Lifecycle
1. **Generation**: AI provider creates `summary.json`
2. **Validation**: paperlib validates against schema
3. **Indexing**: Content indexed for search
4. **Rendering**: Human-readable `summary.md` generated
5. **Updates**: Summaries can be regenerated with new models
### Search Integration
Summary fields are indexed for search:
- Full-text search includes all text fields
- Tag-based search uses `problem_tags` and `technique_tags`
- Entity search uses the `entities` field
- Relevance ranking can use `relevance_to_user` scores
### API Integration
Higher-level tools can consume summaries programmatically:
```python
import json
from pathlib import Path
# Load summary
summary_path = Path("papers/arxiv/2022/arxiv-2212_06340/summary.json")
with summary_path.open() as f:
summary = json.load(f)
# Extract key information
tags = summary["problem_tags"] + summary["technique_tags"]
relevance = summary.get("relevance_to_user", 0.0)
entities = summary["entities"]
```
This enables automated workflows like:
- Daily digest generation
- Research recommendation systems
- Literature review automation
- Cross-reference discovery
## Examples
### Machine Learning Paper
```json
{
"schema_version": "1.0",
"one_sentence_summary": "Introduces EfficientNet, a family of convolutional neural networks that achieve better accuracy and efficiency than previous models through compound scaling.",
"problem_statement": "Existing ConvNet scaling methods arbitrarily scale network dimensions, leading to suboptimal accuracy and efficiency trade-offs.",
"method_overview": "The paper proposes compound scaling that uniformly scales network width, depth, and resolution with a fixed ratio, guided by neural architecture search to find optimal scaling coefficients.",
"main_results": "EfficientNet-B7 achieves 84.3% top-1 accuracy on ImageNet while being 8.4x smaller and 6.1x faster than the best existing ConvNet.",
"claimed_contributions": [
"Novel compound scaling method for ConvNets",
"EfficientNet family with state-of-the-art accuracy/efficiency",
"Systematic study of scaling dimensions"
],
"assumptions": [
"ImageNet classification transfers to other vision tasks",
"Compound scaling works across different architectures"
],
"limitations": [
"Limited evaluation on tasks beyond image classification",
"Scaling coefficients may not generalize to all architectures"
],
"problem_tags": ["classification", "computer-vision", "efficiency"],
"technique_tags": ["cnn", "neural-architecture-search", "model-scaling"],
"entities": ["ImageNet", "MobileNet", "ResNet", "NASNet"],
"relevance_to_user": null,
"recommended_sections": ["Section 3.1", "Table 2", "Figure 2"]
}
```
This schema provides a foundation for consistent, structured paper analysis while remaining flexible enough to evolve with new research needs and AI capabilities.