Files
paperlib/docs/storage-layout.md
2026-04-17 17:03:59 -04:00

7.5 KiB

Storage Layout

This document describes the on-disk structure and organization of a paperlib library.

Overview

A paperlib library is a directory containing all papers, metadata, configuration, and index data. The layout is designed to be:

  • Human-readable: Directory structure is intuitive and browsable
  • Stable: File locations don't change unexpectedly
  • Rebuildable: Index can be reconstructed from source files
  • Portable: Entire library can be moved or backed up as a unit

Directory Structure

library_root/
├── config/                    # Library configuration
│   ├── config.toml           # Main configuration file
│   ├── vocab.yaml            # Controlled vocabulary (future)
│   └── prompts/              # AI prompt templates (future)
│       └── summarize_paper.md
├── papers/                   # Paper storage (source of truth)
│   ├── arxiv/               # arXiv papers organized by year
│   │   └── 2026/
│   │       └── arxiv-2212_06340/
│   │           ├── meta.json         # Paper metadata
│   │           ├── source.pdf        # Original PDF
│   │           ├── paper.md          # Converted markdown
│   │           ├── summary.json      # AI-generated summary
│   │           ├── summary.md        # Rendered summary
│   │           ├── ref.bib          # Bibliography (future)
│   │           ├── assets/          # Images, figures
│   │           └── logs/            # Processing logs
│   │               └── mineru.log
│   └── local/               # Local PDF imports by hash
│       └── a1b2c3d4e5f6/
│           └── ... (same structure)
├── inbox/                   # Temporary import staging (future)
├── db/                      # Search index (rebuildable)
│   └── paperlib.sqlite3
└── cache/                   # Processing cache (safe to delete)

Paper Directory Organization

arXiv Papers

arXiv papers are organized by year and paper ID:

papers/arxiv/YEAR/arxiv-NORMALIZED_ID/

Where:

  • YEAR is extracted from the arXiv ID (e.g., 2212.063402022, 0001.123452000)
  • NORMALIZED_ID replaces dots and version numbers with underscores
    • 2212.06340arxiv-2212_06340
    • 2212.06340v2arxiv-2212_06340v2

The year extraction follows arXiv's YYMM.NNNNN format:

  • Years 00-89 map to 2000-2089
  • Years 90-99 map to 1990-1999

Examples:

papers/arxiv/2022/arxiv-2212_06340/    # 2212.06340 -> year 2022
papers/arxiv/2023/arxiv-2301_12345v1/  # 2301.12345v1 -> year 2023  
papers/arxiv/2000/arxiv-0001_98765/    # 0001.98765 -> year 2000
papers/arxiv/1999/arxiv-9912_12345/    # 9912.12345 -> year 1999

Local Papers

Local papers are organized by content hash:

papers/local/HASH_PREFIX/

Where HASH_PREFIX is the first 16 characters of the SHA256 hash of the PDF file.

Examples:

papers/local/a1b2c3d4e5f67890/
papers/local/fedcba9876543210/

File Types

Required Files

Every paper directory contains:

meta.json

The canonical metadata file (JSON format):

{
  "paper_id": "arxiv-2212_06340",
  "source_type": "arxiv",
  "source_id": "2212.06340",
  "title": "Example Paper Title",
  "authors": ["Alice Smith", "Bob Jones"],
  "published_date": "2022-12-13T02:46:55",
  "categories": ["cs.AI", "stat.ML"],
  "pdf_path": "papers/arxiv/2022/arxiv-2212_06340/source.pdf",
  "paper_md_path": "papers/arxiv/2022/arxiv-2212_06340/paper.md",
  "imported_at": "2024-01-15T10:30:00",
  "conversion_status": "success",
  "summary_status": "not_requested",
  "tags": ["machine-learning"],
  "notes": "Important paper on neural networks"
}

source.pdf

The original PDF file, exactly as imported.

Generated Files

These files are created by paperlib processing:

paper.md

Markdown conversion of the PDF, generated by MinerU or other converters.

summary.json (optional)

AI-generated structured summary:

{
  "schema_version": "1.0",
  "one_sentence_summary": "This paper introduces...",
  "problem_statement": "Current methods have limitations...",
  "method_overview": "We propose a novel approach...",
  "main_results": "Experiments show 95% accuracy...",
  "claimed_contributions": ["Novel architecture", "Improved performance"],
  "problem_tags": ["classification", "optimization"],
  "technique_tags": ["neural-networks", "transformers"],
  "entities": ["BERT", "ImageNet", "ResNet"],
  "relevance_to_user": 0.85
}

summary.md (optional)

Human-readable summary rendered from summary.json.

Supporting Directories

assets/

Contains extracted images, figures, and other media from the PDF conversion process.

logs/

Processing logs for debugging and audit trails:

  • mineru.log - PDF conversion logs
  • summary.log - AI summarization logs (future)

Index Database

The SQLite database at db/paperlib.sqlite3 contains:

Tables

papers

Main paper index with searchable fields:

  • Metadata from all meta.json files
  • Computed search fields (full-text, author lists, etc.)
  • Processing status tracking

papers_fts

Full-text search virtual table (SQLite FTS5) for content search.

Rebuilding

The database is always rebuildable from the source files:

paperlib reindex

This design ensures the JSON files remain the authoritative source of truth.

Path Conventions

Relative Paths

All paths in meta.json are relative to the library root:

{
  "pdf_path": "papers/local/a1b2c3d4e5f6/source.pdf",
  "paper_md_path": "papers/local/a1b2c3d4e5f6/paper.md"
}

Cross-Platform Compatibility

All paths use forward slashes (/) regardless of operating system.

Backup and Portability

What to Backup

For complete library backup, include:

  • config/ directory (configuration)
  • papers/ directory (source of truth)

What NOT to Backup

These can be regenerated:

  • db/ directory (rebuildable index)
  • cache/ directory (temporary files)

Moving Libraries

To move a library:

  1. Copy the entire directory structure
  2. Run paperlib reindex to rebuild the database
  3. Update any absolute paths in configuration

Storage Efficiency

Deduplication

Papers are naturally deduplicated:

  • arXiv papers by normalized arXiv ID
  • Local papers by SHA256 content hash

Large Files

For papers with large asset directories:

  • Assets are stored alongside papers for locality
  • Consider using file system compression or deduplication if needed

File System Requirements

Permissions

paperlib requires:

  • Read/write access to library directory
  • Ability to create subdirectories
  • Atomic file operations for metadata updates

File System Features

Recommended:

  • Case-sensitive file system (avoids conflicts)
  • Support for Unicode filenames
  • Journaling (protects against corruption)

Disk Space

Typical storage requirements:

  • PDF files: 1-10 MB each
  • Markdown conversions: 10-100 KB each
  • Metadata: ~1-5 KB per paper
  • Database index: ~1-10 KB per paper
  • Assets: Varies (0-50 MB for image-heavy papers)

Migration and Versioning

Schema Evolution

When paperlib updates its storage format:

  • Metadata schema versions are tracked in each file
  • Migration tools handle format upgrades
  • Backward compatibility is maintained when possible

Validation

paperlib provides tools to validate library integrity:

paperlib doctor  # (future command)

This will check:

  • All referenced files exist
  • Metadata format is valid
  • Database consistency with files
  • No orphaned or corrupted data