7.3 KiB
Storage Layout
This document describes the on-disk structure and organization of a paperlib library.
Overview
A paperlib library is a directory containing all papers, metadata, configuration, and index data. The layout is designed to be:
- Human-readable: Directory structure is intuitive and browsable
- Stable: File locations don't change unexpectedly
- Rebuildable: Index can be reconstructed from source files
- Portable: Entire library can be moved or backed up as a unit
Directory Structure
library_root/
├── config/ # Library configuration
│ ├── config.toml # Main configuration file
│ ├── vocab.yaml # Controlled vocabulary (future)
│ └── prompts/ # AI prompt templates (future)
│ └── summarize_paper.md
├── papers/ # Paper storage (source of truth)
│ ├── arxiv/ # arXiv papers organized by year
│ │ └── 2026/
│ │ └── arxiv-2212_06340/
│ │ ├── meta.json # Paper metadata
│ │ ├── source.pdf # Original PDF
│ │ ├── paper.md # Converted markdown
│ │ ├── summary.json # AI-generated summary
│ │ ├── summary.md # Rendered summary
│ │ ├── ref.bib # Bibliography (future)
│ │ ├── assets/ # Images, figures
│ │ └── logs/ # Processing logs
│ │ └── mineru.log
│ └── local/ # Local PDF imports by hash
│ └── a1b2c3d4e5f6/
│ └── ... (same structure)
├── inbox/ # Temporary import staging (future)
├── db/ # Search index (rebuildable)
│ └── paperlib.sqlite3
└── cache/ # Processing cache (safe to delete)
Paper Directory Organization
arXiv Papers
arXiv papers are organized by year and paper ID:
papers/arxiv/YEAR/arxiv-NORMALIZED_ID/
Where:
YEARis extracted from the arXiv ID (e.g.,2212.06340→2022)NORMALIZED_IDreplaces dots and version numbers with underscores2212.06340→arxiv-2212_063402212.06340v2→arxiv-2212_06340v2
Examples:
papers/arxiv/2022/arxiv-2212_06340/
papers/arxiv/2023/arxiv-2301_12345v1/
papers/arxiv/2024/arxiv-2405_98765/
Local Papers
Local papers are organized by content hash:
papers/local/HASH_PREFIX/
Where HASH_PREFIX is the first 16 characters of the SHA256 hash of the PDF file.
Examples:
papers/local/a1b2c3d4e5f67890/
papers/local/fedcba9876543210/
File Types
Required Files
Every paper directory contains:
meta.json
The canonical metadata file (JSON format):
{
"paper_id": "arxiv-2212_06340",
"source_type": "arxiv",
"source_id": "2212.06340",
"title": "Example Paper Title",
"authors": ["Alice Smith", "Bob Jones"],
"published_date": "2022-12-13T02:46:55",
"categories": ["cs.AI", "stat.ML"],
"pdf_path": "papers/arxiv/2022/arxiv-2212_06340/source.pdf",
"paper_md_path": "papers/arxiv/2022/arxiv-2212_06340/paper.md",
"imported_at": "2024-01-15T10:30:00",
"conversion_status": "success",
"summary_status": "not_requested",
"tags": ["machine-learning"],
"notes": "Important paper on neural networks"
}
source.pdf
The original PDF file, exactly as imported.
Generated Files
These files are created by paperlib processing:
paper.md
Markdown conversion of the PDF, generated by MinerU or other converters.
summary.json (optional)
AI-generated structured summary:
{
"schema_version": "1.0",
"one_sentence_summary": "This paper introduces...",
"problem_statement": "Current methods have limitations...",
"method_overview": "We propose a novel approach...",
"main_results": "Experiments show 95% accuracy...",
"claimed_contributions": ["Novel architecture", "Improved performance"],
"problem_tags": ["classification", "optimization"],
"technique_tags": ["neural-networks", "transformers"],
"entities": ["BERT", "ImageNet", "ResNet"],
"relevance_to_user": 0.85
}
summary.md (optional)
Human-readable summary rendered from summary.json.
Supporting Directories
assets/
Contains extracted images, figures, and other media from the PDF conversion process.
logs/
Processing logs for debugging and audit trails:
mineru.log- PDF conversion logssummary.log- AI summarization logs (future)
Index Database
The SQLite database at db/paperlib.sqlite3 contains:
Tables
papers
Main paper index with searchable fields:
- Metadata from all
meta.jsonfiles - Computed search fields (full-text, author lists, etc.)
- Processing status tracking
papers_fts
Full-text search virtual table (SQLite FTS5) for content search.
Rebuilding
The database is always rebuildable from the source files:
paperlib reindex
This design ensures the JSON files remain the authoritative source of truth.
Path Conventions
Relative Paths
All paths in meta.json are relative to the library root:
{
"pdf_path": "papers/local/a1b2c3d4e5f6/source.pdf",
"paper_md_path": "papers/local/a1b2c3d4e5f6/paper.md"
}
Cross-Platform Compatibility
All paths use forward slashes (/) regardless of operating system.
Backup and Portability
What to Backup
For complete library backup, include:
config/directory (configuration)papers/directory (source of truth)
What NOT to Backup
These can be regenerated:
db/directory (rebuildable index)cache/directory (temporary files)
Moving Libraries
To move a library:
- Copy the entire directory structure
- Run
paperlib reindexto rebuild the database - Update any absolute paths in configuration
Storage Efficiency
Deduplication
Papers are naturally deduplicated:
- arXiv papers by normalized arXiv ID
- Local papers by SHA256 content hash
Large Files
For papers with large asset directories:
- Assets are stored alongside papers for locality
- Consider using file system compression or deduplication if needed
File System Requirements
Permissions
paperlib requires:
- Read/write access to library directory
- Ability to create subdirectories
- Atomic file operations for metadata updates
File System Features
Recommended:
- Case-sensitive file system (avoids conflicts)
- Support for Unicode filenames
- Journaling (protects against corruption)
Disk Space
Typical storage requirements:
- PDF files: 1-10 MB each
- Markdown conversions: 10-100 KB each
- Metadata: ~1-5 KB per paper
- Database index: ~1-10 KB per paper
- Assets: Varies (0-50 MB for image-heavy papers)
Migration and Versioning
Schema Evolution
When paperlib updates its storage format:
- Metadata schema versions are tracked in each file
- Migration tools handle format upgrades
- Backward compatibility is maintained when possible
Validation
paperlib provides tools to validate library integrity:
paperlib doctor # (future command)
This will check:
- All referenced files exist
- Metadata format is valid
- Database consistency with files
- No orphaned or corrupted data