# Storage Layout This document describes the on-disk structure and organization of a paperlib library. ## Overview A paperlib library is a directory containing all papers, metadata, configuration, and index data. The layout is designed to be: - **Human-readable**: Directory structure is intuitive and browsable - **Stable**: File locations don't change unexpectedly - **Rebuildable**: Index can be reconstructed from source files - **Portable**: Entire library can be moved or backed up as a unit ## Directory Structure ``` library_root/ ├── config/ # Library configuration │ ├── config.toml # Main configuration file │ ├── vocab.yaml # Controlled vocabulary (future) │ └── prompts/ # AI prompt templates (future) │ └── summarize_paper.md ├── papers/ # Paper storage (source of truth) │ ├── arxiv/ # arXiv papers organized by year │ │ └── 2026/ │ │ └── arxiv-2212_06340/ │ │ ├── meta.json # Paper metadata │ │ ├── source.pdf # Original PDF │ │ ├── paper.md # Converted markdown │ │ ├── summary.json # AI-generated summary │ │ ├── summary.md # Rendered summary │ │ ├── ref.bib # Bibliography (future) │ │ ├── assets/ # Images, figures │ │ └── logs/ # Processing logs │ │ └── mineru.log │ └── local/ # Local PDF imports by hash │ └── a1b2c3d4e5f6/ │ └── ... (same structure) ├── inbox/ # Temporary import staging (future) ├── db/ # Search index (rebuildable) │ └── paperlib.sqlite3 └── cache/ # Processing cache (safe to delete) ``` ## Paper Directory Organization ### arXiv Papers arXiv papers are organized by year and paper ID: ``` papers/arxiv/YEAR/arxiv-NORMALIZED_ID/ ``` Where: - `YEAR` is extracted from the arXiv ID (e.g., `2212.06340` → `2022`, `0001.12345` → `2000`) - `NORMALIZED_ID` replaces dots and version numbers with underscores - `2212.06340` → `arxiv-2212_06340` - `2212.06340v2` → `arxiv-2212_06340v2` The year extraction follows arXiv's YYMM.NNNNN format: - Years 00-89 map to 2000-2089 - Years 90-99 map to 1990-1999 **Examples:** ``` papers/arxiv/2022/arxiv-2212_06340/ # 2212.06340 -> year 2022 papers/arxiv/2023/arxiv-2301_12345v1/ # 2301.12345v1 -> year 2023 papers/arxiv/2000/arxiv-0001_98765/ # 0001.98765 -> year 2000 papers/arxiv/1999/arxiv-9912_12345/ # 9912.12345 -> year 1999 ``` ### Local Papers Local papers are organized by content hash: ``` papers/local/HASH_PREFIX/ ``` Where `HASH_PREFIX` is the first 16 characters of the SHA256 hash of the PDF file. **Examples:** ``` papers/local/a1b2c3d4e5f67890/ papers/local/fedcba9876543210/ ``` ## File Types ### Required Files Every paper directory contains: #### `meta.json` The canonical metadata file (JSON format): ```json { "paper_id": "arxiv-2212_06340", "source_type": "arxiv", "source_id": "2212.06340", "title": "Example Paper Title", "authors": ["Alice Smith", "Bob Jones"], "published_date": "2022-12-13T02:46:55", "categories": ["cs.AI", "stat.ML"], "pdf_path": "papers/arxiv/2022/arxiv-2212_06340/source.pdf", "paper_md_path": "papers/arxiv/2022/arxiv-2212_06340/paper.md", "imported_at": "2024-01-15T10:30:00", "conversion_status": "success", "summary_status": "not_requested", "tags": ["machine-learning"], "notes": "Important paper on neural networks" } ``` #### `source.pdf` The original PDF file, exactly as imported. ### Generated Files These files are created by paperlib processing: #### `paper.md` Markdown conversion of the PDF, generated by MinerU or other converters. #### `summary.json` (optional) AI-generated structured summary: ```json { "schema_version": "1.0", "one_sentence_summary": "This paper introduces...", "problem_statement": "Current methods have limitations...", "method_overview": "We propose a novel approach...", "main_results": "Experiments show 95% accuracy...", "claimed_contributions": ["Novel architecture", "Improved performance"], "problem_tags": ["classification", "optimization"], "technique_tags": ["neural-networks", "transformers"], "entities": ["BERT", "ImageNet", "ResNet"], "relevance_to_user": 0.85 } ``` #### `summary.md` (optional) Human-readable summary rendered from `summary.json`. ### Supporting Directories #### `assets/` Contains extracted images, figures, and other media from the PDF conversion process. #### `logs/` Processing logs for debugging and audit trails: - `mineru.log` - PDF conversion logs - `summary.log` - AI summarization logs (future) ## Index Database The SQLite database at `db/paperlib.sqlite3` contains: ### Tables #### `papers` Main paper index with searchable fields: - Metadata from all `meta.json` files - Computed search fields (full-text, author lists, etc.) - Processing status tracking #### `papers_fts` Full-text search virtual table (SQLite FTS5) for content search. ### Rebuilding The database is **always rebuildable** from the source files: ```bash paperlib reindex ``` This design ensures the JSON files remain the authoritative source of truth. ## Path Conventions ### Relative Paths All paths in `meta.json` are relative to the library root: ```json { "pdf_path": "papers/local/a1b2c3d4e5f6/source.pdf", "paper_md_path": "papers/local/a1b2c3d4e5f6/paper.md" } ``` ### Cross-Platform Compatibility All paths use forward slashes (`/`) regardless of operating system. ## Backup and Portability ### What to Backup For complete library backup, include: - `config/` directory (configuration) - `papers/` directory (source of truth) ### What NOT to Backup These can be regenerated: - `db/` directory (rebuildable index) - `cache/` directory (temporary files) ### Moving Libraries To move a library: 1. Copy the entire directory structure 2. Run `paperlib reindex` to rebuild the database 3. Update any absolute paths in configuration ## Storage Efficiency ### Deduplication Papers are naturally deduplicated: - arXiv papers by normalized arXiv ID - Local papers by SHA256 content hash ### Large Files For papers with large asset directories: - Assets are stored alongside papers for locality - Consider using file system compression or deduplication if needed ## File System Requirements ### Permissions paperlib requires: - Read/write access to library directory - Ability to create subdirectories - Atomic file operations for metadata updates ### File System Features Recommended: - Case-sensitive file system (avoids conflicts) - Support for Unicode filenames - Journaling (protects against corruption) ### Disk Space Typical storage requirements: - PDF files: 1-10 MB each - Markdown conversions: 10-100 KB each - Metadata: ~1-5 KB per paper - Database index: ~1-10 KB per paper - Assets: Varies (0-50 MB for image-heavy papers) ## Migration and Versioning ### Schema Evolution When paperlib updates its storage format: - Metadata schema versions are tracked in each file - Migration tools handle format upgrades - Backward compatibility is maintained when possible ### Validation paperlib provides tools to validate library integrity: ```bash paperlib doctor # (future command) ``` This will check: - All referenced files exist - Metadata format is valid - Database consistency with files - No orphaned or corrupted data