wyj/paperlib

Fork 0

Files

T

wyj 174801242d fix: arxiv year

2026-04-17 17:03:59 -04:00

7.5 KiB

Raw Permalink Blame History

Storage Layout

This document describes the on-disk structure and organization of a paperlib library.

Overview

A paperlib library is a directory containing all papers, metadata, configuration, and index data. The layout is designed to be:

Human-readable: Directory structure is intuitive and browsable
Stable: File locations don't change unexpectedly
Rebuildable: Index can be reconstructed from source files
Portable: Entire library can be moved or backed up as a unit

Directory Structure

library_root/
├── config/                    # Library configuration
│   ├── config.toml           # Main configuration file
│   ├── vocab.yaml            # Controlled vocabulary (future)
│   └── prompts/              # AI prompt templates (future)
│       └── summarize_paper.md
├── papers/                   # Paper storage (source of truth)
│   ├── arxiv/               # arXiv papers organized by year
│   │   └── 2026/
│   │       └── arxiv-2212_06340/
│   │           ├── meta.json         # Paper metadata
│   │           ├── source.pdf        # Original PDF
│   │           ├── paper.md          # Converted markdown
│   │           ├── summary.json      # AI-generated summary
│   │           ├── summary.md        # Rendered summary
│   │           ├── ref.bib          # Bibliography (future)
│   │           ├── assets/          # Images, figures
│   │           └── logs/            # Processing logs
│   │               └── mineru.log
│   └── local/               # Local PDF imports by hash
│       └── a1b2c3d4e5f6/
│           └── ... (same structure)
├── inbox/                   # Temporary import staging (future)
├── db/                      # Search index (rebuildable)
│   └── paperlib.sqlite3
└── cache/                   # Processing cache (safe to delete)

Paper Directory Organization

arXiv Papers

arXiv papers are organized by year and paper ID:

papers/arxiv/YEAR/arxiv-NORMALIZED_ID/

Where:

YEAR is extracted from the arXiv ID (e.g., 2212.06340 → 2022, 0001.12345 → 2000)
NORMALIZED_ID replaces dots and version numbers with underscores
- 2212.06340 → arxiv-2212_06340
- 2212.06340v2 → arxiv-2212_06340v2

The year extraction follows arXiv's YYMM.NNNNN format:

Years 00-89 map to 2000-2089
Years 90-99 map to 1990-1999

Examples:

papers/arxiv/2022/arxiv-2212_06340/    # 2212.06340 -> year 2022
papers/arxiv/2023/arxiv-2301_12345v1/  # 2301.12345v1 -> year 2023  
papers/arxiv/2000/arxiv-0001_98765/    # 0001.98765 -> year 2000
papers/arxiv/1999/arxiv-9912_12345/    # 9912.12345 -> year 1999

Local Papers

Local papers are organized by content hash:

papers/local/HASH_PREFIX/

Where HASH_PREFIX is the first 16 characters of the SHA256 hash of the PDF file.

Examples:

papers/local/a1b2c3d4e5f67890/
papers/local/fedcba9876543210/

File Types

Required Files

Every paper directory contains:

`meta.json`

The canonical metadata file (JSON format):

{
  "paper_id": "arxiv-2212_06340",
  "source_type": "arxiv",
  "source_id": "2212.06340",
  "title": "Example Paper Title",
  "authors": ["Alice Smith", "Bob Jones"],
  "published_date": "2022-12-13T02:46:55",
  "categories": ["cs.AI", "stat.ML"],
  "pdf_path": "papers/arxiv/2022/arxiv-2212_06340/source.pdf",
  "paper_md_path": "papers/arxiv/2022/arxiv-2212_06340/paper.md",
  "imported_at": "2024-01-15T10:30:00",
  "conversion_status": "success",
  "summary_status": "not_requested",
  "tags": ["machine-learning"],
  "notes": "Important paper on neural networks"
}

`source.pdf`

The original PDF file, exactly as imported.

Generated Files

These files are created by paperlib processing:

`paper.md`

Markdown conversion of the PDF, generated by MinerU or other converters.

`summary.json` (optional)

AI-generated structured summary:

{
  "schema_version": "1.0",
  "one_sentence_summary": "This paper introduces...",
  "problem_statement": "Current methods have limitations...",
  "method_overview": "We propose a novel approach...",
  "main_results": "Experiments show 95% accuracy...",
  "claimed_contributions": ["Novel architecture", "Improved performance"],
  "problem_tags": ["classification", "optimization"],
  "technique_tags": ["neural-networks", "transformers"],
  "entities": ["BERT", "ImageNet", "ResNet"],
  "relevance_to_user": 0.85
}

`summary.md` (optional)

Human-readable summary rendered from summary.json.

Supporting Directories

`assets/`

Contains extracted images, figures, and other media from the PDF conversion process.

`logs/`

Processing logs for debugging and audit trails:

mineru.log - PDF conversion logs
summary.log - AI summarization logs (future)

Index Database

The SQLite database at db/paperlib.sqlite3 contains:

Tables

`papers`

Main paper index with searchable fields:

Metadata from all meta.json files
Computed search fields (full-text, author lists, etc.)
Processing status tracking

`papers_fts`

Full-text search virtual table (SQLite FTS5) for content search.

Rebuilding

The database is always rebuildable from the source files:

paperlib reindex

This design ensures the JSON files remain the authoritative source of truth.

Path Conventions

Relative Paths

All paths in meta.json are relative to the library root:

{
  "pdf_path": "papers/local/a1b2c3d4e5f6/source.pdf",
  "paper_md_path": "papers/local/a1b2c3d4e5f6/paper.md"
}

Cross-Platform Compatibility

All paths use forward slashes (/) regardless of operating system.

Backup and Portability

What to Backup

For complete library backup, include:

config/ directory (configuration)
papers/ directory (source of truth)

What NOT to Backup

These can be regenerated:

db/ directory (rebuildable index)
cache/ directory (temporary files)

Moving Libraries

To move a library:

Copy the entire directory structure
Run paperlib reindex to rebuild the database
Update any absolute paths in configuration

Storage Efficiency

Deduplication

Papers are naturally deduplicated:

arXiv papers by normalized arXiv ID
Local papers by SHA256 content hash

Large Files

For papers with large asset directories:

Assets are stored alongside papers for locality
Consider using file system compression or deduplication if needed

File System Requirements

Permissions

paperlib requires:

Read/write access to library directory
Ability to create subdirectories
Atomic file operations for metadata updates

File System Features

Recommended:

Case-sensitive file system (avoids conflicts)
Support for Unicode filenames
Journaling (protects against corruption)

Disk Space

Typical storage requirements:

PDF files: 1-10 MB each
Markdown conversions: 10-100 KB each
Metadata: ~1-5 KB per paper
Database index: ~1-10 KB per paper
Assets: Varies (0-50 MB for image-heavy papers)

Migration and Versioning

Schema Evolution

When paperlib updates its storage format:

Metadata schema versions are tracked in each file
Migration tools handle format upgrades
Backward compatibility is maintained when possible

Validation

paperlib provides tools to validate library integrity:

paperlib doctor  # (future command)

This will check:

All referenced files exist
Metadata format is valid
Database consistency with files
No orphaned or corrupted data

7.5 KiB Raw Permalink Blame History