paperlib/docs/storage-layout.md

# Storage Layout

This document describes the on-disk structure and organization of a paperlib library.

## Overview

A paperlib library is a directory containing all papers, metadata, configuration, and index data. The layout is designed to be:

- **Human-readable**: Directory structure is intuitive and browsable
- **Stable**: File locations don't change unexpectedly
- **Rebuildable**: Index can be reconstructed from source files
- **Portable**: Entire library can be moved or backed up as a unit

## Directory Structure

```
library_root/
├── config/                    # Library configuration
│   ├── config.toml           # Main configuration file
│   ├── vocab.yaml            # Controlled vocabulary (future)
│   └── prompts/              # AI prompt templates (future)
│       └── summarize_paper.md
├── papers/                   # Paper storage (source of truth)
│   ├── arxiv/               # arXiv papers organized by year
│   │   └── 2026/
│   │       └── arxiv-2212_06340/
│   │           ├── meta.json         # Paper metadata
│   │           ├── source.pdf        # Original PDF
│   │           ├── paper.md          # Converted markdown
│   │           ├── summary.json      # AI-generated summary
│   │           ├── summary.md        # Rendered summary
│   │           ├── ref.bib          # Bibliography (future)
│   │           ├── assets/          # Images, figures
│   │           └── logs/            # Processing logs
│   │               └── mineru.log
│   └── local/               # Local PDF imports by hash
│       └── a1b2c3d4e5f6/
│           └── ... (same structure)
├── inbox/                   # Temporary import staging (future)
├── db/                      # Search index (rebuildable)
│   └── paperlib.sqlite3
└── cache/                   # Processing cache (safe to delete)
```

## Paper Directory Organization

### arXiv Papers

arXiv papers are organized by year and paper ID:

```
papers/arxiv/YEAR/arxiv-NORMALIZED_ID/
```

Where:
- `YEAR` is extracted from the arXiv ID (e.g., `2212.06340` → `2022`, `0001.12345` → `2000`)
- `NORMALIZED_ID` replaces dots and version numbers with underscores
  - `2212.06340` → `arxiv-2212_06340`
  - `2212.06340v2` → `arxiv-2212_06340v2`

The year extraction follows arXiv's YYMM.NNNNN format:
- Years 00-89 map to 2000-2089
- Years 90-99 map to 1990-1999

**Examples:**
```
papers/arxiv/2022/arxiv-2212_06340/    # 2212.06340 -> year 2022
papers/arxiv/2023/arxiv-2301_12345v1/  # 2301.12345v1 -> year 2023
papers/arxiv/2000/arxiv-0001_98765/    # 0001.98765 -> year 2000
papers/arxiv/1999/arxiv-9912_12345/    # 9912.12345 -> year 1999
```

### Local Papers

Local papers are organized by content hash:

```
papers/local/HASH_PREFIX/
```

Where `HASH_PREFIX` is the first 16 characters of the SHA256 hash of the PDF file.

**Examples:**
```
papers/local/a1b2c3d4e5f67890/
papers/local/fedcba9876543210/
```

## File Types

### Required Files

Every paper directory contains:

#### `meta.json`
The canonical metadata file (JSON format):
```json
{
  "paper_id": "arxiv-2212_06340",
  "source_type": "arxiv",
  "source_id": "2212.06340",
  "title": "Example Paper Title",
  "authors": ["Alice Smith", "Bob Jones"],
  "published_date": "2022-12-13T02:46:55",
  "categories": ["cs.AI", "stat.ML"],
  "pdf_path": "papers/arxiv/2022/arxiv-2212_06340/source.pdf",
  "paper_md_path": "papers/arxiv/2022/arxiv-2212_06340/paper.md",
  "imported_at": "2024-01-15T10:30:00",
  "conversion_status": "success",
  "summary_status": "not_requested",
  "tags": ["machine-learning"],
  "notes": "Important paper on neural networks"
}
```

#### `source.pdf`
The original PDF file, exactly as imported.

### Generated Files

These files are created by paperlib processing:

#### `paper.md`
Markdown conversion of the PDF, generated by MinerU or other converters.

#### `summary.json` (optional)
AI-generated structured summary:
```json
{
  "schema_version": "1.0",
  "one_sentence_summary": "This paper introduces...",
  "problem_statement": "Current methods have limitations...",
  "method_overview": "We propose a novel approach...",
  "main_results": "Experiments show 95% accuracy...",
  "claimed_contributions": ["Novel architecture", "Improved performance"],
  "problem_tags": ["classification", "optimization"],
  "technique_tags": ["neural-networks", "transformers"],
  "entities": ["BERT", "ImageNet", "ResNet"],
  "relevance_to_user": 0.85
}
```

#### `summary.md` (optional)
Human-readable summary rendered from `summary.json`.

### Supporting Directories

#### `assets/`
Contains extracted images, figures, and other media from the PDF conversion process.

#### `logs/`
Processing logs for debugging and audit trails:
- `mineru.log` - PDF conversion logs
- `summary.log` - AI summarization logs (future)

## Index Database

The SQLite database at `db/paperlib.sqlite3` contains:

### Tables

#### `papers`
Main paper index with searchable fields:
- Metadata from all `meta.json` files
- Computed search fields (full-text, author lists, etc.)
- Processing status tracking

#### `papers_fts`
Full-text search virtual table (SQLite FTS5) for content search.

### Rebuilding

The database is **always rebuildable** from the source files:
```bash
paperlib reindex
```

This design ensures the JSON files remain the authoritative source of truth.

## Path Conventions

### Relative Paths
All paths in `meta.json` are relative to the library root:
```json
{
  "pdf_path": "papers/local/a1b2c3d4e5f6/source.pdf",
  "paper_md_path": "papers/local/a1b2c3d4e5f6/paper.md"
}
```

### Cross-Platform Compatibility
All paths use forward slashes (`/`) regardless of operating system.

## Backup and Portability

### What to Backup
For complete library backup, include:
- `config/` directory (configuration)
- `papers/` directory (source of truth)

### What NOT to Backup
These can be regenerated:
- `db/` directory (rebuildable index)
- `cache/` directory (temporary files)

### Moving Libraries
To move a library:
1. Copy the entire directory structure
2. Run `paperlib reindex` to rebuild the database
3. Update any absolute paths in configuration

## Storage Efficiency

### Deduplication
Papers are naturally deduplicated:
- arXiv papers by normalized arXiv ID
- Local papers by SHA256 content hash

### Large Files
For papers with large asset directories:
- Assets are stored alongside papers for locality
- Consider using file system compression or deduplication if needed

## File System Requirements

### Permissions
paperlib requires:
- Read/write access to library directory
- Ability to create subdirectories
- Atomic file operations for metadata updates

### File System Features
Recommended:
- Case-sensitive file system (avoids conflicts)
- Support for Unicode filenames
- Journaling (protects against corruption)

### Disk Space
Typical storage requirements:
- PDF files: 1-10 MB each
- Markdown conversions: 10-100 KB each
- Metadata: ~1-5 KB per paper
- Database index: ~1-10 KB per paper
- Assets: Varies (0-50 MB for image-heavy papers)

## Migration and Versioning

### Schema Evolution
When paperlib updates its storage format:
- Metadata schema versions are tracked in each file
- Migration tools handle format upgrades
- Backward compatibility is maintained when possible

### Validation
paperlib provides tools to validate library integrity:
```bash
paperlib doctor  # (future command)
```

This will check:
- All referenced files exist
- Metadata format is valid
- Database consistency with files
- No orphaned or corrupted data