docs: add docs

2026-04-17 16:54:30 -04:00
parent 74d140e5f8
commit 432010f431
10 changed files with 1682 additions and 19 deletions
@@ -1,19 +1,213 @@
-# `paperlib`: a CLI tool to manage paper library
+# paperlib
-This project use `mineru` to convert PDF to markdown, and establish a markdown paper library.
+A local-first paper library engine with a CLI for managing academic papers.
-## usage
+**paperlib** is designed to import PDF papers into a structured local library, convert PDFs into Markdown using external converters, maintain stable per-paper metadata files, and provide a searchable index database. It offers optional AI-based structured summaries while remaining useful even without AI features.
 ## Key Features
 - **Local-first**: All data lives locally in the paper library directory
 - **CLI-first**: All important workflows accessible from the command line
 - **JSON source of truth**: Per-paper metadata files with rebuildable SQLite index
 - **AI-optional**: Core workflows work without LLM configuration
 - **Machine-readable**: `--json` output for automation and integration
 - **Stable interfaces**: Designed for scripts and higher-level tools
 ## Installation
 ```bash
-# init a library in current directory
+# Install with uv (recommended)
 uv add paperlib
 # Or with pip
 pip install paperlib
 ```
 ## Quick Start
 ```bash
 # Initialize a paper library
 paperlib init
-# manually import a PDF
+# Import a local PDF
-paperlib import --pdf <path to pdf> [--arxiv-id xxxx.xxxxx]
+paperlib import --pdf paper.pdf --title "My Research Paper"
-# import an arXiv paper
+# Import from arXiv
-paperlib import --arxiv xxxx.xxxxx
+paperlib import --arxiv 2212.06340
-# place holder
+# List all papers
-...
+paperlib list
 # Show paper details
 paperlib show <paper-id>
 # Convert PDFs to Markdown (requires MinerU)
 paperlib convert
 # Search papers
 paperlib search "machine learning"
 # Rebuild search index
 paperlib reindex
 ```
 ## Core Commands
 ### Library Management
 - `paperlib init [path]` - Initialize a paper library directory
 - `paperlib status` - Show library configuration and layout
 - `paperlib reindex` - Rebuild search index from stored papers
 ### Paper Import
 - `paperlib import --pdf <path>` - Import a local PDF file
 - `paperlib import --arxiv <id>` - Import paper from arXiv
 - Options: `--title`, `--notes`, `--tags`, `--library`
 ### Paper Management
 - `paperlib list` - List all imported papers with status
 - `paperlib show <paper-id>` - Show detailed paper information
 - `paperlib convert` - Convert pending papers to Markdown using MinerU
 ### Search (Future)
 - `paperlib search <query>` - Search papers by content and metadata
 ## Library Structure
 A paperlib library is organized as follows:
 ```
 library_root/
 ├── config/
 │   ├── config.toml
 │   └── prompts/
 ├── papers/
 │   ├── arxiv/
 │   │   └── 2026/
 │   │       └── arxiv-2212_06340/
 │   │           ├── meta.json          # Paper metadata
 │   │           ├── source.pdf         # Original PDF
 │   │           ├── paper.md           # Converted markdown
 │   │           ├── summary.json       # AI summary (optional)
 │   │           ├── summary.md         # Rendered summary
 │   │           ├── assets/            # Images, figures
 │   │           └── logs/              # Conversion logs
 │   └── local/
 │       └── <hash>/
 │           └── ...
 ├── db/
 │   └── paperlib.sqlite3              # Search index (rebuildable)
 ├── inbox/                             # Temporary imports
 └── cache/                            # Processing cache
 ```
 ## Data Model
 ### Paper Metadata (`meta.json`)
 Each paper has a `meta.json` file containing:
 - Core identifiers: `paper_id`, `source_type`, `source_id`
 - Bibliographic info: `title`, `authors`, `published_date`, `categories`
 - File paths: `pdf_path`, `paper_md_path`, `summary_json_path`
 - Processing status: `conversion_status`, `summary_status`
 - User data: `tags`, `notes`
 ### Summary Data (`summary.json`)
 Optional AI-generated summaries with:
 - Structured fields: problem statement, method overview, results
 - Categorization: problem tags, technique tags
 - Relevance scoring and recommended sections
 ## PDF Conversion
 paperlib integrates with [MinerU](https://github.com/opendatalab/MinerU) for high-quality PDF to Markdown conversion:
 ```bash
 # Install MinerU (optional)
 pip install mineru[core]
 # Convert all pending papers
 paperlib convert
 # Convert specific paper
 paperlib convert --paper-id <paper-id>
 ```
 ## Machine-Readable Output
 Most commands support `--json` output for automation:
 ```bash
 paperlib list --json
 paperlib show <paper-id> --json
 paperlib status --json
 ```
 ## Development
 paperlib is designed for extensibility and integration with higher-level tools.
 ### Running Tests
 ```bash
 # Run all tests
 uv run pytest
 # Run specific test module
 uv run pytest tests/test_models.py
 # Run with coverage
 uv run pytest --cov=paperlib
 ```
 ### Code Quality
 ```bash
 # Format code
 uv run ruff format
 # Check linting
 uv run ruff check
 # Type checking
 uv run mypy src/
 ```
 ## Architecture
 paperlib follows clean architecture principles:
 - **Models**: Data structures for papers and summaries
 - **Storage**: File-based metadata and PDF management  
 - **Index**: SQLite search and retrieval layer
 - **Importers**: PDF and arXiv import workflows
 - **Converters**: PDF to Markdown transformation
 - **CLI**: Command-line interface and argument parsing
 ## Roadmap
 - [x] Core paper import (local PDF, arXiv)
 - [x] PDF to Markdown conversion (MinerU integration)
 - [x] Metadata management and search indexing
 - [x] CLI with all basic commands
 - [x] Comprehensive test suite
 - [ ] Search command implementation
 - [ ] AI summarization with provider abstraction
 - [ ] JSON output for all commands
 - [ ] Configuration file support
 - [ ] Advanced arXiv workflows
 ## Non-Goals
 paperlib is intentionally focused and does NOT include:
 - Web UI or GUI applications
 - Multi-user or cloud-first features
 - Mandatory daemon or background services
 - Vector database requirements
 - Fully autonomous research assistant behavior
 ## License
 MIT License - see LICENSE file for details.
 ## Contributing
 Contributions welcome! Please read the development guidelines in AGENTS.md and ensure all tests pass before submitting PRs.
@@ -0,0 +1,288 @@
 # CLI Reference
 This document describes all available commands in the paperlib CLI.
 ## Global Options
 All commands support these global options:
 - `--help`, `-h`: Show help message
 - `--version`: Show version information
 Many commands also support:
 - `--library`, `-L`: Specify library root directory (default: current directory)
 - `--json`: Output machine-readable JSON instead of human-readable format
 ## Commands
 ### `paperlib init [PATH]`
 Initialize a paper library directory structure.
 **Arguments:**
 - `PATH`: Directory to initialize (default: current directory)
 **Examples:**
 ```bash
 # Initialize library in current directory
 paperlib init
 # Initialize library in specific directory
 paperlib init /path/to/my/papers
 # Initialize and create parent directories
 paperlib init ~/Documents/research/papers
 ```
 **Behavior:**
 - Creates standard directory structure (config/, papers/, db/, etc.)
 - Safe to run multiple times (idempotent)
 - Creates parent directories if they don't exist
 ---
 ### `paperlib import`
 Import papers into the library from various sources.
 **Required (one of):**
 - `--pdf PATH`: Import a local PDF file
 - `--arxiv ID`: Import paper from arXiv by ID or URL
 **Options:**
 - `--title TEXT`: Override paper title (for local PDFs)
 - `--notes TEXT`: Add notes about the paper
 - `--tags TAG1 TAG2`: Add tags to the paper
 - `--library PATH`: Specify library directory
 **Examples:**
 ```bash
 # Import local PDF
 paperlib import --pdf paper.pdf --title "My Research" --tags ml ai
 # Import from arXiv
 paperlib import --arxiv 2212.06340
 # Import with arXiv URL
 paperlib import --arxiv https://arxiv.org/abs/2212.06340
 # Import to specific library
 paperlib import --pdf paper.pdf --library ~/research
 ```
 **Behavior:**
 - Generates stable paper ID based on content (local) or arXiv ID
 - Copies PDF to structured storage location
 - Creates meta.json with paper metadata
 - Prevents duplicate imports (same content/ID)
 - Indexes paper in search database
 ---
 ### `paperlib list`
 List all papers in the library with their current status.
 **Options:**
 - `--library PATH`: Specify library directory
 - `--json`: Output in JSON format
 **Examples:**
 ```bash
 # List all papers
 paperlib list
 # List papers in specific library
 paperlib list --library ~/research
 # Get machine-readable output
 paperlib list --json
 ```
 **Output Format:**
 ```
 Found 3 papers:
 📄 arxiv-2212_06340
   The new discontinuous Galerkin methods based numerical relativity program Nmesh
   By: Wolfgang Tichy, Liwei Ji, Ananya Adhikari (+2 more)
   Categories: gr-qc
 ⏳ local-a1b2c3d4e5f6
   Machine Learning Applications in Physics
   Categories: cs.AI, physics.comp-ph
 ```
 **Status Indicators:**
 - ⏳ Paper imported, conversion pending
 - 📄 PDF converted to Markdown
 - 📝 AI summary generated
 - ❌ Conversion or processing failed
 ---
 ### `paperlib show PAPER_ID`
 Show detailed information about a specific paper.
 **Arguments:**
 - `PAPER_ID`: The unique paper identifier
 **Options:**
 - `--library PATH`: Specify library directory
 - `--json`: Output in JSON format
 **Examples:**
 ```bash
 # Show paper details
 paperlib show arxiv-2212_06340
 # Show with JSON output
 paperlib show local-a1b2c3d4 --json
 ```
 **Output includes:**
 - All metadata fields
 - Processing status
 - File locations and existence
 - Import timestamp
 - Tags and notes
 ---
 ### `paperlib convert`
 Convert papers from PDF to Markdown using MinerU.
 **Options:**
 - `--library PATH`: Specify library directory
 - `--paper-id ID`: Convert specific paper only
 **Examples:**
 ```bash
 # Convert all pending papers
 paperlib convert
 # Convert specific paper
 paperlib convert --paper-id arxiv-2212_06340
 # Convert in specific library
 paperlib convert --library ~/research
 ```
 **Behavior:**
 - Processes papers with `conversion_status: pending`
 - Uses MinerU for PDF to Markdown conversion
 - Updates metadata with conversion status
 - Creates conversion logs in `logs/` directory
 - Handles conversion failures gracefully
 ---
 ### `paperlib reindex`
 Rebuild the search index from stored paper metadata.
 **Options:**
 - `--library PATH`: Specify library directory
 **Examples:**
 ```bash
 # Rebuild index
 paperlib reindex
 # Rebuild index for specific library
 paperlib reindex --library ~/research
 ```
 **Behavior:**
 - Clears existing SQLite database
 - Scans all meta.json files in papers/ directory
 - Rebuilds full-text search index
 - Reports statistics on completion
 - Safe to run anytime (repairs corrupted index)
 ---
 ### `paperlib status`
 Show library configuration and layout information.
 **Options:**
 - `--library PATH`: Specify library directory
 - `--json`: Output in JSON format
 **Examples:**
 ```bash
 # Show current library status
 paperlib status
 # Show specific library status
 paperlib status --library ~/research
 ```
 **Output:**
 ```
 root: /home/user/papers
 config: /home/user/papers/config/config.toml
 database: /home/user/papers/db/paperlib.sqlite3
 papers: /home/user/papers/papers
 inbox: /home/user/papers/inbox
 cache: /home/user/papers/cache
 ```
 ---
 ## Future Commands
 These commands are planned but not yet implemented:
 ### `paperlib search QUERY`
 Search papers by content and metadata.
 ### `paperlib summarize [PAPER_ID]`
 Generate AI summaries for papers.
 ### `paperlib export FORMAT`
 Export papers in various formats.
 ### `paperlib doctor`
 Diagnose and repair library issues.
 ---
 ## Exit Codes
 paperlib commands return standard exit codes:
 - `0`: Success
 - `1`: General error (file not found, invalid arguments, etc.)
 - `2`: Command line argument error
 ## Configuration
 paperlib looks for configuration in these locations (in order):
 1. `$LIBRARY_ROOT/config/config.toml`
 2. `~/.config/paperlib/config.toml`
 3. Built-in defaults
 ## JSON Output Format
 When using `--json`, commands output structured data suitable for programmatic consumption:
 ```json
 {
  "papers": [
    {
      "paper_id": "arxiv-2212_06340",
      "title": "Example Paper",
      "authors": ["Alice Smith", "Bob Jones"],
      "conversion_status": "success",
      "imported_at": "2024-01-15T10:30:00"
    }
  ],
  "total": 1
 }
 ```
 This format is stable across paperlib versions for reliable automation.
@@ -0,0 +1,638 @@
 # Integration Guide
 This document describes how to integrate paperlib with higher-level tools and automation workflows.
 ## Overview
 paperlib is designed as a **library engine** that higher-level tools can build upon. It provides:
 - **Stable CLI interface** with machine-readable JSON output
 - **File-based storage** that external tools can read directly
 - **Python API** for programmatic access
 - **Event hooks** for workflow integration (future)
 ## CLI Integration
 ### Machine-Readable Output
 Most paperlib commands support `--json` output for automation:
 ```bash
 # Get library statistics
 paperlib status --json
 {
  "library_root": "/home/user/papers",
  "total_papers": 42,
  "by_status": {"converted": 38, "pending": 4},
  "last_updated": "2024-01-15T10:30:00Z"
 }
 # List papers with metadata
 paperlib list --json
 {
  "papers": [
    {
      "paper_id": "arxiv-2212_06340",
      "title": "Example Paper",
      "authors": ["Alice Smith", "Bob Jones"],
      "categories": ["cs.AI"],
      "conversion_status": "success",
      "summary_status": "pending",
      "imported_at": "2024-01-15T10:30:00Z"
    }
  ],
  "total": 1
 }
 # Import with JSON response
 paperlib import --arxiv 2212.06340 --json
 {
  "success": true,
  "paper_id": "arxiv-2212_06340",
  "title": "Example Paper Title",
  "message": "Successfully imported arXiv paper"
 }
 ```
 ### Exit Codes
 paperlib commands follow standard Unix exit code conventions:
 ```bash
 paperlib import --arxiv 2212.06340
 echo $?  # 0 for success, 1 for error
 # Check if paper exists before processing
 if paperlib show "$paper_id" --json >/dev/null 2>&1; then
    echo "Paper exists"
 else
    echo "Paper not found"
 fi
 ```
 ### Scripting Examples
 #### Daily arXiv Import
 ```bash
 #!/bin/bash
 # daily-arxiv.sh - Import papers from daily arXiv feed
 LIBRARY="$HOME/research"
 ARXIV_FEED_URL="http://export.arxiv.org/rss/cs.AI"
 # Parse RSS feed and extract arXiv IDs
 curl -s "$ARXIV_FEED_URL" | \
 grep -oP 'arxiv\.org/abs/\K[0-9]{4}\.[0-9]{4,5}' | \
 while read arxiv_id; do
    echo "Importing $arxiv_id..."
    paperlib import --arxiv "$arxiv_id" --library "$LIBRARY" --json
 done
 # Convert newly imported papers
 paperlib convert --library "$LIBRARY"
 # Generate daily report
 paperlib list --library "$LIBRARY" --json | \
 jq '.papers | map(select(.imported_at | startswith(now | strftime("%Y-%m-%d"))))'
 ```
 #### Batch Processing
 ```bash
 #!/bin/bash
 # batch-process.sh - Process multiple papers from a list
 LIBRARY="$HOME/research"
 PAPER_LIST="papers.txt"
 while IFS= read -r pdf_path; do
    if [[ -f "$pdf_path" ]]; then
        echo "Importing $pdf_path..."
        result=$(paperlib import --pdf "$pdf_path" --library "$LIBRARY" --json)
        if [[ $? -eq 0 ]]; then
            paper_id=$(echo "$result" | jq -r '.paper_id')
            echo "Successfully imported as $paper_id"
        else
            echo "Failed to import $pdf_path"
        fi
    fi
 done < "$PAPER_LIST"
 # Convert all pending papers
 paperlib convert --library "$LIBRARY"
 ```
 ## Python API
 ### Direct Library Access
 ```python
 from paperlib.config import LibraryPaths
 from paperlib.storage import PaperStorageManager
 from paperlib.index import DatabaseManager
 from paperlib.importer import ArxivImporter, LocalImporter
 # Initialize library components
 library_paths = LibraryPaths.from_root("/path/to/library")
 storage = PaperStorageManager(library_paths)
 database = DatabaseManager(library_paths)
 database.initialize_database()
 # Import paper programmatically
 arxiv_importer = ArxivImporter(storage)
 metadata = arxiv_importer.import_arxiv_paper("2212.06340")
 database.index_paper(metadata)
 # Search and retrieve
 results = list(database.search_papers("neural networks"))
 for result in results:
    paper = storage.load_paper_metadata(result["paper_id"], result["source_type"])
    print(f"{paper.title} by {', '.join(paper.authors)}")
 # Get statistics
 stats = database.get_statistics()
 print(f"Total papers: {stats['total_papers']}")
 ```
 ### Metadata Processing
 ```python
 import json
 from pathlib import Path
 from paperlib.models import PaperMetadata, PaperSummary
 # Process all papers in library
 papers_dir = Path("/home/user/papers/papers")
 for meta_file in papers_dir.rglob("meta.json"):
    # Load metadata
    metadata = PaperMetadata.load_from_file(meta_file)
    # Check for summary
    summary_path = meta_file.parent / "summary.json"
    if summary_path.exists():
        summary = PaperSummary.load_from_file(summary_path)
        # Extract key information
        tags = summary.problem_tags + summary.technique_tags
        entities = summary.entities
        print(f"Paper: {metadata.title}")
        print(f"Tags: {', '.join(tags)}")
        print(f"Entities: {', '.join(entities)}")
 ```
 ## File System Integration
 ### Direct File Access
 Since paperlib uses a documented file layout, tools can read data directly:
 ```python
 import json
 from pathlib import Path
 def scan_library(library_root: Path):
    """Scan library and extract metadata."""
    papers = []
    for meta_file in library_root.glob("papers/**/meta.json"):
        with meta_file.open() as f:
            metadata = json.load(f)
        papers.append(metadata)
    return papers
 def find_papers_by_category(library_root: Path, category: str):
    """Find papers in a specific category."""
    matching_papers = []
    for meta_file in library_root.glob("papers/**/meta.json"):
        with meta_file.open() as f:
            metadata = json.load(f)
        if category in metadata.get("categories", []):
            matching_papers.append(metadata)
    return matching_papers
 ```
 ### Watch for Changes
 ```python
 import time
 from pathlib import Path
 from watchdog.observers import Observer
 from watchdog.events import FileSystemEventHandler
 class PaperLibraryHandler(FileSystemEventHandler):
    def __init__(self, library_root):
        self.library_root = Path(library_root)
    def on_created(self, event):
        if event.src_path.endswith("meta.json"):
            print(f"New paper imported: {event.src_path}")
            # Trigger processing workflow
            self.process_new_paper(event.src_path)
    def on_modified(self, event):
        if event.src_path.endswith("summary.json"):
            print(f"Summary updated: {event.src_path}")
            # Update downstream systems
    def process_new_paper(self, meta_path):
        """Handle newly imported paper."""
        # Load metadata
        with open(meta_path) as f:
            metadata = json.load(f)
        # Trigger downstream processing
        # - Send to processing queue
        # - Update knowledge base
        # - Generate notifications
 # Watch library for changes
 observer = Observer()
 handler = PaperLibraryHandler("/home/user/papers")
 observer.schedule(handler, "/home/user/papers/papers", recursive=True)
 observer.start()
 ```
 ## Higher-Level Tool Examples
 ### Research Dashboard
 ```python
 """research_dashboard.py - Web dashboard for research library"""
 from flask import Flask, jsonify, render_template
 from paperlib.config import LibraryPaths
 from paperlib.storage import PaperStorageManager
 from paperlib.index import DatabaseManager
 app = Flask(__name__)
 # Initialize paperlib components
 library_paths = LibraryPaths.from_root("/home/user/research")
 storage = PaperStorageManager(library_paths)
 database = DatabaseManager(library_paths)
@app.route('/api/papers')
 def list_papers():
    """List all papers with metadata."""
    papers = list(database.list_papers(limit=50))
    return jsonify(papers)
@app.route('/api/search/<query>')
 def search_papers(query):
    """Search papers by query."""
    results = list(database.search_papers(query, limit=20))
    return jsonify(results)
@app.route('/api/stats')
 def library_stats():
    """Get library statistics."""
    stats = database.get_statistics()
    return jsonify(stats)
@app.route('/')
 def dashboard():
    """Main dashboard page."""
    return render_template('dashboard.html')
 if __name__ == '__main__':
    app.run(debug=True)
 ```
 ### Daily Digest Generator
 ```python
 """daily_digest.py - Generate daily research digest"""
 import json
 from datetime import datetime, timedelta
 from pathlib import Path
 from paperlib.config import LibraryPaths
 from paperlib.index import DatabaseManager
 def generate_daily_digest(library_root: str, output_file: str):
    """Generate digest of recently imported papers."""
    # Initialize database
    library_paths = LibraryPaths.from_root(library_root)
    database = DatabaseManager(library_paths)
    # Get papers from last 24 hours
    yesterday = datetime.now() - timedelta(days=1)
    yesterday_iso = yesterday.isoformat()
    recent_papers = []
    for paper in database.list_papers():
        if paper["imported_at"] >= yesterday_iso:
            recent_papers.append(paper)
    if not recent_papers:
        print("No new papers imported yesterday.")
        return
    # Group by category
    by_category = {}
    for paper in recent_papers:
        categories = json.loads(paper["categories_json"])
        for category in categories:
            if category not in by_category:
                by_category[category] = []
            by_category[category].append(paper)
    # Generate HTML digest
    html_content = f"""
    <html>
    <head><title>Daily Research Digest - {datetime.now().strftime('%Y-%m-%d')}</title></head>
    <body>
    <h1>Daily Research Digest</h1>
    <p>Found {len(recent_papers)} new papers</p>
    """
    for category, papers in by_category.items():
        html_content += f"<h2>{category}</h2><ul>"
        for paper in papers:
            title = paper["title"]
            paper_id = paper["paper_id"]
            html_content += f'<li><strong>{title}</strong> ({paper_id})</li>'
        html_content += "</ul>"
    html_content += "</body></html>"
    # Write output
    Path(output_file).write_text(html_content)
    print(f"Digest written to {output_file}")
 if __name__ == "__main__":
    generate_daily_digest("/home/user/research", "digest.html")
 ```
 ### Literature Review Assistant
 ```python
 """review_assistant.py - AI-powered literature review helper"""
 from paperlib.config import LibraryPaths
 from paperlib.index import DatabaseManager
 from paperlib.models import PaperSummary
 class ReviewAssistant:
    def __init__(self, library_root: str):
        self.library_paths = LibraryPaths.from_root(library_root)
        self.database = DatabaseManager(self.library_paths)
    def find_related_papers(self, paper_id: str, max_results: int = 10):
        """Find papers related to the given paper."""
        # Get source paper metadata
        source_paper = self.database.get_paper(paper_id)
        if not source_paper:
            return []
        # Extract search terms from title and categories
        title_words = source_paper["title"].lower().split()
        categories = json.loads(source_paper["categories_json"])
        # Search for papers with similar keywords
        search_terms = title_words + categories
        related_papers = []
        for term in search_terms:
            results = list(self.database.search_papers(term, limit=5))
            for result in results:
                if result["paper_id"] != paper_id:
                    related_papers.append(result)
        # Remove duplicates and return top results
        seen_ids = set()
        unique_papers = []
        for paper in related_papers:
            if paper["paper_id"] not in seen_ids:
                seen_ids.add(paper["paper_id"])
                unique_papers.append(paper)
                if len(unique_papers) >= max_results:
                    break
        return unique_papers
    def generate_topic_overview(self, topic: str):
        """Generate overview of papers on a specific topic."""
        # Search for papers on topic
        papers = list(self.database.search_papers(topic, limit=50))
        if not papers:
            return f"No papers found for topic: {topic}"
        # Analyze summaries if available
        key_entities = set()
        techniques = set()
        for paper in papers:
            summary_path = Path(paper["summary_json_path"])
            if summary_path.exists():
                summary = PaperSummary.load_from_file(summary_path)
                key_entities.update(summary.entities)
                techniques.update(summary.technique_tags)
        # Generate overview
        overview = f"""
        Topic: {topic}
        Papers found: {len(papers)}
        Key entities mentioned:
        {', '.join(sorted(key_entities)[:10])}
        Common techniques:
        {', '.join(sorted(techniques)[:10])}
        Recent papers:
        """
        # Add recent papers
        recent_papers = sorted(papers, key=lambda x: x["imported_at"], reverse=True)[:5]
        for paper in recent_papers:
            overview += f"\n- {paper['title']} ({paper['paper_id']})"
        return overview
 # Usage
 assistant = ReviewAssistant("/home/user/research")
 overview = assistant.generate_topic_overview("transformer architecture")
 print(overview)
 ```
 ## Integration Patterns
 ### Pipeline Processing
 ```bash
 # Multi-stage processing pipeline
 paperlib import --arxiv 2212.06340 --json > import_result.json
 paper_id=$(jq -r '.paper_id' import_result.json)
 # Convert to markdown
 paperlib convert --paper-id "$paper_id"
 # Generate summary (when available)
 # paperlib summarize --paper-id "$paper_id"
 # Update downstream systems
 curl -X POST "http://research-db/api/papers" \
     -H "Content-Type: application/json" \
     -d @import_result.json
 ```
 ### Event-Driven Architecture
 ```python
 """event_handler.py - Process paperlib events"""
 import json
 from pathlib import Path
 import pika  # RabbitMQ client
 class PaperLibraryEventHandler:
    def __init__(self, rabbitmq_url: str):
        self.connection = pika.BlockingConnection(pika.URLParameters(rabbitmq_url))
        self.channel = self.connection.channel()
    def on_paper_imported(self, paper_metadata: dict):
        """Handle new paper import."""
        message = {
            "event": "paper_imported",
            "paper_id": paper_metadata["paper_id"],
            "title": paper_metadata["title"],
            "categories": paper_metadata["categories"],
            "timestamp": paper_metadata["imported_at"]
        }
        # Send to processing queue
        self.channel.basic_publish(
            exchange='',
            routing_key='paper_processing',
            body=json.dumps(message)
        )
    def on_summary_generated(self, paper_id: str, summary_path: Path):
        """Handle summary generation."""
        with summary_path.open() as f:
            summary = json.load(f)
        message = {
            "event": "summary_generated",
            "paper_id": paper_id,
            "tags": summary["problem_tags"] + summary["technique_tags"],
            "entities": summary["entities"]
        }
        # Send to indexing service
        self.channel.basic_publish(
            exchange='',
            routing_key='summary_indexing',
            body=json.dumps(message)
        )
 ```
 ## Best Practices
 ### Error Handling
 ```python
 import subprocess
 import json
 def safe_paperlib_command(command: list[str]) -> dict:
    """Execute paperlib command with proper error handling."""
    try:
        result = subprocess.run(
            ["paperlib"] + command + ["--json"],
            capture_output=True,
            text=True,
            check=True
        )
        return json.loads(result.stdout)
    except subprocess.CalledProcessError as e:
        return {
            "success": False,
            "error": e.stderr,
            "exit_code": e.returncode
        }
    except json.JSONDecodeError as e:
        return {
            "success": False,
            "error": f"Invalid JSON response: {e}",
            "raw_output": result.stdout
        }
 # Usage
 result = safe_paperlib_command(["import", "--arxiv", "2212.06340"])
 if result.get("success", True):  # Assume success if no "success" field
    print(f"Imported paper: {result['paper_id']}")
 else:
    print(f"Import failed: {result['error']}")
 ```
 ### Performance Optimization
 ```python
 # Batch operations for better performance
 from paperlib.index import DatabaseManager
 def batch_index_papers(library_root: str, paper_ids: list[str]):
    """Index multiple papers efficiently."""
    database = DatabaseManager(LibraryPaths.from_root(library_root))
    storage = PaperStorageManager(LibraryPaths.from_root(library_root))
    # Begin transaction for batch insert
    with database._get_connection() as conn:
        for paper_id in paper_ids:
            metadata = storage.load_paper_metadata(paper_id, source_type)
            if metadata:
                database.index_paper(metadata)
        # Automatic commit on context exit
 ```
 ### Configuration Management
 ```python
 # config_manager.py - Centralized configuration
 import os
 from pathlib import Path
 class ConfigManager:
    def __init__(self):
        self.library_root = os.getenv("PAPERLIB_ROOT", Path.home() / "research")
        self.api_keys = {
            "openai": os.getenv("OPENAI_API_KEY"),
            "anthropic": os.getenv("ANTHROPIC_API_KEY")
        }
    def get_library_path(self, name: str = "default") -> str:
        """Get library path by name."""
        if name == "default":
            return str(self.library_root)
        return str(Path.home() / f"research-{name}")
    def paperlib_command_base(self, library_name: str = "default") -> list[str]:
        """Get base command for paperlib with library."""
        return ["paperlib", "--library", self.get_library_path(library_name)]
 config = ConfigManager()
 # Usage in scripts
 import subprocess
 cmd = config.paperlib_command_base("arxiv") + ["list", "--json"]
 result = subprocess.run(cmd, capture_output=True, text=True)
 ```
 This integration guide provides the foundation for building sophisticated research workflows on top of paperlib's stable, local-first architecture.
@@ -0,0 +1,259 @@
 # Storage Layout
 This document describes the on-disk structure and organization of a paperlib library.
 ## Overview
 A paperlib library is a directory containing all papers, metadata, configuration, and index data. The layout is designed to be:
 - **Human-readable**: Directory structure is intuitive and browsable
 - **Stable**: File locations don't change unexpectedly
 - **Rebuildable**: Index can be reconstructed from source files
 - **Portable**: Entire library can be moved or backed up as a unit
 ## Directory Structure
 ```
 library_root/
 ├── config/                    # Library configuration
 │   ├── config.toml           # Main configuration file
 │   ├── vocab.yaml            # Controlled vocabulary (future)
 │   └── prompts/              # AI prompt templates (future)
 │       └── summarize_paper.md
 ├── papers/                   # Paper storage (source of truth)
 │   ├── arxiv/               # arXiv papers organized by year
 │   │   └── 2026/
 │   │       └── arxiv-2212_06340/
 │   │           ├── meta.json         # Paper metadata
 │   │           ├── source.pdf        # Original PDF
 │   │           ├── paper.md          # Converted markdown
 │   │           ├── summary.json      # AI-generated summary
 │   │           ├── summary.md        # Rendered summary
 │   │           ├── ref.bib          # Bibliography (future)
 │   │           ├── assets/          # Images, figures
 │   │           └── logs/            # Processing logs
 │   │               └── mineru.log
 │   └── local/               # Local PDF imports by hash
 │       └── a1b2c3d4e5f6/
 │           └── ... (same structure)
 ├── inbox/                   # Temporary import staging (future)
 ├── db/                      # Search index (rebuildable)
 │   └── paperlib.sqlite3
 └── cache/                   # Processing cache (safe to delete)
 ```
 ## Paper Directory Organization
 ### arXiv Papers
 arXiv papers are organized by year and paper ID:
 ```
 papers/arxiv/YEAR/arxiv-NORMALIZED_ID/
 ```
 Where:
 - `YEAR` is extracted from the arXiv ID (e.g., `2212.06340` → `2022`)
 - `NORMALIZED_ID` replaces dots and version numbers with underscores
  - `2212.06340` → `arxiv-2212_06340`
  - `2212.06340v2` → `arxiv-2212_06340v2`
 **Examples:**
 ```
 papers/arxiv/2022/arxiv-2212_06340/
 papers/arxiv/2023/arxiv-2301_12345v1/
 papers/arxiv/2024/arxiv-2405_98765/
 ```
 ### Local Papers
 Local papers are organized by content hash:
 ```
 papers/local/HASH_PREFIX/
 ```
 Where `HASH_PREFIX` is the first 16 characters of the SHA256 hash of the PDF file.
 **Examples:**
 ```
 papers/local/a1b2c3d4e5f67890/
 papers/local/fedcba9876543210/
 ```
 ## File Types
 ### Required Files
 Every paper directory contains:
 #### `meta.json`
 The canonical metadata file (JSON format):
 ```json
 {
  "paper_id": "arxiv-2212_06340",
  "source_type": "arxiv",
  "source_id": "2212.06340",
  "title": "Example Paper Title",
  "authors": ["Alice Smith", "Bob Jones"],
  "published_date": "2022-12-13T02:46:55",
  "categories": ["cs.AI", "stat.ML"],
  "pdf_path": "papers/arxiv/2022/arxiv-2212_06340/source.pdf",
  "paper_md_path": "papers/arxiv/2022/arxiv-2212_06340/paper.md",
  "imported_at": "2024-01-15T10:30:00",
  "conversion_status": "success",
  "summary_status": "not_requested",
  "tags": ["machine-learning"],
  "notes": "Important paper on neural networks"
 }
 ```
 #### `source.pdf`
 The original PDF file, exactly as imported.
 ### Generated Files
 These files are created by paperlib processing:
 #### `paper.md`
 Markdown conversion of the PDF, generated by MinerU or other converters.
 #### `summary.json` (optional)
 AI-generated structured summary:
 ```json
 {
  "schema_version": "1.0",
  "one_sentence_summary": "This paper introduces...",
  "problem_statement": "Current methods have limitations...",
  "method_overview": "We propose a novel approach...",
  "main_results": "Experiments show 95% accuracy...",
  "claimed_contributions": ["Novel architecture", "Improved performance"],
  "problem_tags": ["classification", "optimization"],
  "technique_tags": ["neural-networks", "transformers"],
  "entities": ["BERT", "ImageNet", "ResNet"],
  "relevance_to_user": 0.85
 }
 ```
 #### `summary.md` (optional)
 Human-readable summary rendered from `summary.json`.
 ### Supporting Directories
 #### `assets/`
 Contains extracted images, figures, and other media from the PDF conversion process.
 #### `logs/`
 Processing logs for debugging and audit trails:
 - `mineru.log` - PDF conversion logs
 - `summary.log` - AI summarization logs (future)
 ## Index Database
 The SQLite database at `db/paperlib.sqlite3` contains:
 ### Tables
 #### `papers`
 Main paper index with searchable fields:
 - Metadata from all `meta.json` files
 - Computed search fields (full-text, author lists, etc.)
 - Processing status tracking
 #### `papers_fts`
 Full-text search virtual table (SQLite FTS5) for content search.
 ### Rebuilding
 The database is **always rebuildable** from the source files:
 ```bash
 paperlib reindex
 ```
 This design ensures the JSON files remain the authoritative source of truth.
 ## Path Conventions
 ### Relative Paths
 All paths in `meta.json` are relative to the library root:
 ```json
 {
  "pdf_path": "papers/local/a1b2c3d4e5f6/source.pdf",
  "paper_md_path": "papers/local/a1b2c3d4e5f6/paper.md"
 }
 ```
 ### Cross-Platform Compatibility
 All paths use forward slashes (`/`) regardless of operating system.
 ## Backup and Portability
 ### What to Backup
 For complete library backup, include:
 - `config/` directory (configuration)
 - `papers/` directory (source of truth)
 ### What NOT to Backup
 These can be regenerated:
 - `db/` directory (rebuildable index)
 - `cache/` directory (temporary files)
 ### Moving Libraries
 To move a library:
 1. Copy the entire directory structure
 2. Run `paperlib reindex` to rebuild the database
 3. Update any absolute paths in configuration
 ## Storage Efficiency
 ### Deduplication
 Papers are naturally deduplicated:
 - arXiv papers by normalized arXiv ID
 - Local papers by SHA256 content hash
 ### Large Files
 For papers with large asset directories:
 - Assets are stored alongside papers for locality
 - Consider using file system compression or deduplication if needed
 ## File System Requirements
 ### Permissions
 paperlib requires:
 - Read/write access to library directory
 - Ability to create subdirectories
 - Atomic file operations for metadata updates
 ### File System Features
 Recommended:
 - Case-sensitive file system (avoids conflicts)
 - Support for Unicode filenames
 - Journaling (protects against corruption)
 ### Disk Space
 Typical storage requirements:
 - PDF files: 1-10 MB each
 - Markdown conversions: 10-100 KB each
 - Metadata: ~1-5 KB per paper
 - Database index: ~1-10 KB per paper
 - Assets: Varies (0-50 MB for image-heavy papers)
 ## Migration and Versioning
 ### Schema Evolution
 When paperlib updates its storage format:
 - Metadata schema versions are tracked in each file
 - Migration tools handle format upgrades
 - Backward compatibility is maintained when possible
 ### Validation
 paperlib provides tools to validate library integrity:
 ```bash
 paperlib doctor  # (future command)
 ```
 This will check:
 - All referenced files exist
 - Metadata format is valid
 - Database consistency with files
 - No orphaned or corrupted data
@@ -0,0 +1,289 @@
 # Summary Schema
 This document defines the structure and semantics of the `summary.json` files that contain AI-generated paper summaries.
 ## Overview
 The `summary.json` file contains structured, AI-generated analysis of a paper. It is designed to:
 - Provide consistent, machine-readable summaries
 - Support research triage and discovery workflows
 - Enable automated categorization and search
 - Remain stable across different AI providers
 - Use controlled vocabulary when available
 ## Schema Version 1.0
 ### File Structure
 ```json
 {
  "schema_version": "1.0",
  "one_sentence_summary": "This paper introduces a novel neural architecture for...",
  "problem_statement": "Current approaches to X suffer from limitations...",
  "method_overview": "The authors propose a hybrid approach combining...",
  "main_results": "Experiments show 15% improvement over baselines...",
  "claimed_contributions": [
    "Novel attention mechanism design",
    "State-of-the-art results on ImageNet",
    "Theoretical analysis of convergence properties"
  ],
  "assumptions": [
    "Data is independently distributed",
    "Computational budget allows for large models"
  ],
  "limitations": [
    "Only evaluated on English text",
    "Requires significant computational resources",
    "Limited theoretical justification for design choices"
  ],
  "problem_tags": ["classification", "computer-vision", "optimization"],
  "technique_tags": ["neural-networks", "attention", "transformers"],
  "entities": ["ImageNet", "BERT", "ResNet", "CIFAR-10"],
  "relevance_to_user": 0.75,
  "recommended_sections": ["Section 3.2", "Algorithm 1", "Table 2"]
 }
 ```
 ## Field Definitions
 ### Required Fields
 #### `schema_version` (string)
 - **Purpose**: Track format version for migration
 - **Format**: Semantic version string (e.g., "1.0")
 - **Required**: Yes
 #### `one_sentence_summary` (string)
 - **Purpose**: Concise paper overview for quick scanning
 - **Guidelines**: 
  - One complete sentence, under 200 characters
  - Focus on the main contribution or finding
  - Avoid technical jargon when possible
 - **Example**: "This paper introduces a new attention mechanism that improves transformer efficiency by 40% while maintaining accuracy."
 ### Core Content Fields
 #### `problem_statement` (string)
 - **Purpose**: What problem does this paper address?
 - **Guidelines**:
  - 2-3 sentences maximum
  - Focus on the gap or limitation being addressed
  - Explain why this problem matters
 #### `method_overview` (string)
 - **Purpose**: High-level description of the approach
 - **Guidelines**:
  - 3-4 sentences maximum
  - Focus on the key innovation or insight
  - Avoid detailed algorithmic descriptions
 #### `main_results` (string)
 - **Purpose**: Key empirical findings or theoretical results
 - **Guidelines**:
  - Quantitative results when available
  - Highlight significance of improvements
  - Note any surprising or counterintuitive findings
 ### Structured Lists
 #### `claimed_contributions` (array of strings)
 - **Purpose**: Authors' stated contributions
 - **Guidelines**:
  - Extract from paper's contribution list
  - Preserve authors' framing and claims
  - 3-6 items typically
 #### `assumptions` (array of strings)
 - **Purpose**: Key assumptions underlying the work
 - **Guidelines**:
  - Mathematical, methodological, or data assumptions
  - Critical for understanding applicability
  - Often unstated but important
 #### `limitations` (array of strings)
 - **Purpose**: Acknowledged or apparent limitations
 - **Guidelines**:
  - From authors' discussion or limitations section
  - Obvious limitations not acknowledged by authors
  - Important for understanding scope
 ### Categorization
 #### `problem_tags` (array of strings)
 - **Purpose**: Categorize the problem domain
 - **Controlled vocabulary** (preferred values):
  - `classification`, `regression`, `clustering`
  - `optimization`, `search`, `planning`
  - `generation`, `translation`, `summarization`
  - `detection`, `segmentation`, `tracking`
  - `compression`, `encoding`, `decoding`
  - `privacy`, `security`, `robustness`
  - `interpretability`, `fairness`, `ethics`
  - `efficiency`, `scalability`, `deployment`
 #### `technique_tags` (array of strings)  
 - **Purpose**: Categorize the technical approaches
 - **Controlled vocabulary** (preferred values):
  - `neural-networks`, `deep-learning`, `transformers`
  - `cnn`, `rnn`, `lstm`, `gru`, `attention`
  - `reinforcement-learning`, `supervised-learning`, `unsupervised-learning`
  - `bayesian`, `probabilistic`, `statistical`
  - `graph-neural-networks`, `graph-algorithms`
  - `computer-vision`, `natural-language-processing`
  - `federated-learning`, `transfer-learning`, `meta-learning`
  - `adversarial`, `generative-models`, `vae`, `gan`
 ### Entities and References
 #### `entities` (array of strings)
 - **Purpose**: Important datasets, models, algorithms, or systems mentioned
 - **Guidelines**:
  - Proper names: "ImageNet", "BERT", "ResNet"
  - Algorithms: "SGD", "Adam", "RANSAC"  
  - Benchmarks: "GLUE", "COCO", "WMT"
  - Avoid generic terms like "neural network"
 ### User Relevance
 #### `relevance_to_user` (number, optional)
 - **Purpose**: Estimated relevance score for the user
 - **Format**: Float between 0.0 and 1.0
 - **Guidelines**:
  - Based on user's research interests (if known)
  - `null` if user preferences unavailable
  - Higher scores = more relevant
 #### `recommended_sections` (array of strings, optional)
 - **Purpose**: Specific sections worth reading in detail
 - **Format**: Section references as they appear in paper
 - **Examples**: ["Section 3.2", "Algorithm 1", "Table 2", "Appendix A"]
 ## Generation Guidelines
 ### AI Provider Instructions
 When generating summaries, AI models should:
 1. **Read for understanding**: Focus on the paper's core contributions
 2. **Use structured thinking**: Work through each field systematically  
 3. **Prefer facts over interpretation**: Extract what authors claim, not opinions
 4. **Use controlled vocabulary**: Select from predefined tag lists when possible
 5. **Be concise**: Optimize for quick scanning and search
 6. **Handle uncertainty**: Use `null` or empty arrays for unclear fields
 ### Quality Criteria
 Good summaries exhibit:
 - **Accuracy**: Faithful to the paper's content
 - **Completeness**: Cover all major aspects  
 - **Consistency**: Similar papers get similar treatment
 - **Searchability**: Use terms that aid discovery
 - **Brevity**: Information density over verbosity
 ### Common Issues to Avoid
 - **Hallucination**: Never invent facts not in the paper
 - **Editorializing**: Don't add opinions about paper quality
 - **Inconsistent terminology**: Use standard field names
 - **Over-abstraction**: Keep concrete details when useful
 - **Under-specification**: Provide enough detail for usefulness
 ## Schema Evolution
 ### Version History
 - **v1.0** (current): Initial schema with core fields
 ### Migration Strategy
 When the schema evolves:
 1. New versions increment the `schema_version` field
 2. Migration tools handle format upgrades automatically  
 3. Backward compatibility maintained when possible
 4. Deprecated fields are marked but preserved
 ### Extensibility
 Future versions may add:
 - Additional structured fields
 - Hierarchical tag taxonomies
 - Multi-lingual support
 - Citation relationship mapping
 - Experimental reproducibility metadata
 ## Integration with paperlib
 ### File Lifecycle
 1. **Generation**: AI provider creates `summary.json`
 2. **Validation**: paperlib validates against schema
 3. **Indexing**: Content indexed for search
 4. **Rendering**: Human-readable `summary.md` generated
 5. **Updates**: Summaries can be regenerated with new models
 ### Search Integration
 Summary fields are indexed for search:
 - Full-text search includes all text fields
 - Tag-based search uses `problem_tags` and `technique_tags`  
 - Entity search uses the `entities` field
 - Relevance ranking can use `relevance_to_user` scores
 ### API Integration
 Higher-level tools can consume summaries programmatically:
 ```python
 import json
 from pathlib import Path
 # Load summary
 summary_path = Path("papers/arxiv/2022/arxiv-2212_06340/summary.json")
 with summary_path.open() as f:
    summary = json.load(f)
 # Extract key information  
 tags = summary["problem_tags"] + summary["technique_tags"]
 relevance = summary.get("relevance_to_user", 0.0)
 entities = summary["entities"]
 ```
 This enables automated workflows like:
 - Daily digest generation
 - Research recommendation systems  
 - Literature review automation
 - Cross-reference discovery
 ## Examples
 ### Machine Learning Paper
 ```json
 {
  "schema_version": "1.0",
  "one_sentence_summary": "Introduces EfficientNet, a family of convolutional neural networks that achieve better accuracy and efficiency than previous models through compound scaling.",
  "problem_statement": "Existing ConvNet scaling methods arbitrarily scale network dimensions, leading to suboptimal accuracy and efficiency trade-offs.",
  "method_overview": "The paper proposes compound scaling that uniformly scales network width, depth, and resolution with a fixed ratio, guided by neural architecture search to find optimal scaling coefficients.",
  "main_results": "EfficientNet-B7 achieves 84.3% top-1 accuracy on ImageNet while being 8.4x smaller and 6.1x faster than the best existing ConvNet.",
  "claimed_contributions": [
    "Novel compound scaling method for ConvNets",
    "EfficientNet family with state-of-the-art accuracy/efficiency",
    "Systematic study of scaling dimensions"
  ],
  "assumptions": [
    "ImageNet classification transfers to other vision tasks",
    "Compound scaling works across different architectures"
  ],
  "limitations": [
    "Limited evaluation on tasks beyond image classification",
    "Scaling coefficients may not generalize to all architectures"
  ],
  "problem_tags": ["classification", "computer-vision", "efficiency"],
  "technique_tags": ["cnn", "neural-architecture-search", "model-scaling"],
  "entities": ["ImageNet", "MobileNet", "ResNet", "NASNet"],
  "relevance_to_user": null,
  "recommended_sections": ["Section 3.1", "Table 2", "Figure 2"]
 }
 ```
 This schema provides a foundation for consistent, structured paper analysis while remaining flexible enough to evolve with new research needs and AI capabilities.
@@ -3,7 +3,6 @@
 import shutil
 import subprocess
 from pathlib import Path
 from unittest.mock import patch
 import pytest
@@ -106,7 +106,7 @@ class TestLocalImporter:
    def test_import_duplicate_pdf(self, local_importer, sample_pdf):
        """Test importing the same PDF twice."""
        # Import once
-        metadata1 = local_importer.import_pdf(pdf_path=sample_pdf)
+        local_importer.import_pdf(pdf_path=sample_pdf)
        # Try to import again
        with pytest.raises(ValueError, match="Paper already imported"):
@@ -6,10 +6,9 @@ from pathlib import Path
 import pytest
 from paperlib.config import LibraryPaths
-from paperlib.converter import MinerUConverter
+from paperlib.importer import LocalImporter
 from paperlib.importer import ArxivImporter, LocalImporter
 from paperlib.index import DatabaseManager
-from paperlib.models import ConversionStatus, SourceType
+from paperlib.models import SourceType
 from paperlib.storage import PaperStorageManager
@@ -5,8 +5,6 @@ import tempfile
 from datetime import datetime
 from pathlib import Path
 import pytest
 from paperlib.models import (
    ConversionStatus,
    PaperMetadata,
@@ -1,13 +1,12 @@
 """Tests for paperlib storage manager."""
 import shutil
 import tempfile
 from pathlib import Path
 import pytest
 from paperlib.config import LibraryPaths
-from paperlib.models import ConversionStatus, PaperMetadata, SourceType
+from paperlib.models import ConversionStatus, SourceType
 from paperlib.storage import PaperStorageManager