docs: add docs

2026-04-17 16:54:30 -04:00
parent 74d140e5f8
commit 432010f431
10 changed files with 1682 additions and 19 deletions
@@ -0,0 +1,259 @@
+# Storage Layout
+
+This document describes the on-disk structure and organization of a paperlib library.
+
+## Overview
+
+A paperlib library is a directory containing all papers, metadata, configuration, and index data. The layout is designed to be:
+
+- **Human-readable**: Directory structure is intuitive and browsable
+- **Stable**: File locations don't change unexpectedly
+- **Rebuildable**: Index can be reconstructed from source files
+- **Portable**: Entire library can be moved or backed up as a unit
+
+## Directory Structure
+
+```
+library_root/
+├── config/                    # Library configuration
+│   ├── config.toml           # Main configuration file
+│   ├── vocab.yaml            # Controlled vocabulary (future)
+│   └── prompts/              # AI prompt templates (future)
+│       └── summarize_paper.md
+├── papers/                   # Paper storage (source of truth)
+│   ├── arxiv/               # arXiv papers organized by year
+│   │   └── 2026/
+│   │       └── arxiv-2212_06340/
+│   │           ├── meta.json         # Paper metadata
+│   │           ├── source.pdf        # Original PDF
+│   │           ├── paper.md          # Converted markdown
+│   │           ├── summary.json      # AI-generated summary
+│   │           ├── summary.md        # Rendered summary
+│   │           ├── ref.bib          # Bibliography (future)
+│   │           ├── assets/          # Images, figures
+│   │           └── logs/            # Processing logs
+│   │               └── mineru.log
+│   └── local/               # Local PDF imports by hash
+│       └── a1b2c3d4e5f6/
+│           └── ... (same structure)
+├── inbox/                   # Temporary import staging (future)
+├── db/                      # Search index (rebuildable)
+│   └── paperlib.sqlite3
+└── cache/                   # Processing cache (safe to delete)
+```
+
+## Paper Directory Organization
+
+### arXiv Papers
+
+arXiv papers are organized by year and paper ID:
+
+```
+papers/arxiv/YEAR/arxiv-NORMALIZED_ID/
+```
+
+Where:
+- `YEAR` is extracted from the arXiv ID (e.g., `2212.06340` → `2022`)
+- `NORMALIZED_ID` replaces dots and version numbers with underscores
+  - `2212.06340` → `arxiv-2212_06340`
+  - `2212.06340v2` → `arxiv-2212_06340v2`
+
+**Examples:**
+```
+papers/arxiv/2022/arxiv-2212_06340/
+papers/arxiv/2023/arxiv-2301_12345v1/
+papers/arxiv/2024/arxiv-2405_98765/
+```
+
+### Local Papers
+
+Local papers are organized by content hash:
+
+```
+papers/local/HASH_PREFIX/
+```
+
+Where `HASH_PREFIX` is the first 16 characters of the SHA256 hash of the PDF file.
+
+**Examples:**
+```
+papers/local/a1b2c3d4e5f67890/
+papers/local/fedcba9876543210/
+```
+
+## File Types
+
+### Required Files
+
+Every paper directory contains:
+
+#### `meta.json`
+The canonical metadata file (JSON format):
+```json
+{
+  "paper_id": "arxiv-2212_06340",
+  "source_type": "arxiv",
+  "source_id": "2212.06340",
+  "title": "Example Paper Title",
+  "authors": ["Alice Smith", "Bob Jones"],
+  "published_date": "2022-12-13T02:46:55",
+  "categories": ["cs.AI", "stat.ML"],
+  "pdf_path": "papers/arxiv/2022/arxiv-2212_06340/source.pdf",
+  "paper_md_path": "papers/arxiv/2022/arxiv-2212_06340/paper.md",
+  "imported_at": "2024-01-15T10:30:00",
+  "conversion_status": "success",
+  "summary_status": "not_requested",
+  "tags": ["machine-learning"],
+  "notes": "Important paper on neural networks"
+}
+```
+
+#### `source.pdf`
+The original PDF file, exactly as imported.
+
+### Generated Files
+
+These files are created by paperlib processing:
+
+#### `paper.md`
+Markdown conversion of the PDF, generated by MinerU or other converters.
+
+#### `summary.json` (optional)
+AI-generated structured summary:
+```json
+{
+  "schema_version": "1.0",
+  "one_sentence_summary": "This paper introduces...",
+  "problem_statement": "Current methods have limitations...",
+  "method_overview": "We propose a novel approach...",
+  "main_results": "Experiments show 95% accuracy...",
+  "claimed_contributions": ["Novel architecture", "Improved performance"],
+  "problem_tags": ["classification", "optimization"],
+  "technique_tags": ["neural-networks", "transformers"],
+  "entities": ["BERT", "ImageNet", "ResNet"],
+  "relevance_to_user": 0.85
+}
+```
+
+#### `summary.md` (optional)
+Human-readable summary rendered from `summary.json`.
+
+### Supporting Directories
+
+#### `assets/`
+Contains extracted images, figures, and other media from the PDF conversion process.
+
+#### `logs/`
+Processing logs for debugging and audit trails:
+- `mineru.log` - PDF conversion logs
+- `summary.log` - AI summarization logs (future)
+
+## Index Database
+
+The SQLite database at `db/paperlib.sqlite3` contains:
+
+### Tables
+
+#### `papers`
+Main paper index with searchable fields:
+- Metadata from all `meta.json` files
+- Computed search fields (full-text, author lists, etc.)
+- Processing status tracking
+
+#### `papers_fts`
+Full-text search virtual table (SQLite FTS5) for content search.
+
+### Rebuilding
+
+The database is **always rebuildable** from the source files:
+```bash
+paperlib reindex
+```
+
+This design ensures the JSON files remain the authoritative source of truth.
+
+## Path Conventions
+
+### Relative Paths
+All paths in `meta.json` are relative to the library root:
+```json
+{
+  "pdf_path": "papers/local/a1b2c3d4e5f6/source.pdf",
+  "paper_md_path": "papers/local/a1b2c3d4e5f6/paper.md"
+}
+```
+
+### Cross-Platform Compatibility
+All paths use forward slashes (`/`) regardless of operating system.
+
+## Backup and Portability
+
+### What to Backup
+For complete library backup, include:
+- `config/` directory (configuration)
+- `papers/` directory (source of truth)
+
+### What NOT to Backup
+These can be regenerated:
+- `db/` directory (rebuildable index)
+- `cache/` directory (temporary files)
+
+### Moving Libraries
+To move a library:
+1. Copy the entire directory structure
+2. Run `paperlib reindex` to rebuild the database
+3. Update any absolute paths in configuration
+
+## Storage Efficiency
+
+### Deduplication
+Papers are naturally deduplicated:
+- arXiv papers by normalized arXiv ID
+- Local papers by SHA256 content hash
+
+### Large Files
+For papers with large asset directories:
+- Assets are stored alongside papers for locality
+- Consider using file system compression or deduplication if needed
+
+## File System Requirements
+
+### Permissions
+paperlib requires:
+- Read/write access to library directory
+- Ability to create subdirectories
+- Atomic file operations for metadata updates
+
+### File System Features
+Recommended:
+- Case-sensitive file system (avoids conflicts)
+- Support for Unicode filenames
+- Journaling (protects against corruption)
+
+### Disk Space
+Typical storage requirements:
+- PDF files: 1-10 MB each
+- Markdown conversions: 10-100 KB each
+- Metadata: ~1-5 KB per paper
+- Database index: ~1-10 KB per paper
+- Assets: Varies (0-50 MB for image-heavy papers)
+
+## Migration and Versioning
+
+### Schema Evolution
+When paperlib updates its storage format:
+- Metadata schema versions are tracked in each file
+- Migration tools handle format upgrades
+- Backward compatibility is maintained when possible
+
+### Validation
+paperlib provides tools to validate library integrity:
+```bash
+paperlib doctor  # (future command)
+```
+
+This will check:
+- All referenced files exist
+- Metadata format is valid
+- Database consistency with files
+- No orphaned or corrupted data