264 lines
7.5 KiB
Markdown
264 lines
7.5 KiB
Markdown
# Storage Layout
|
|
|
|
This document describes the on-disk structure and organization of a paperlib library.
|
|
|
|
## Overview
|
|
|
|
A paperlib library is a directory containing all papers, metadata, configuration, and index data. The layout is designed to be:
|
|
|
|
- **Human-readable**: Directory structure is intuitive and browsable
|
|
- **Stable**: File locations don't change unexpectedly
|
|
- **Rebuildable**: Index can be reconstructed from source files
|
|
- **Portable**: Entire library can be moved or backed up as a unit
|
|
|
|
## Directory Structure
|
|
|
|
```
|
|
library_root/
|
|
├── config/ # Library configuration
|
|
│ ├── config.toml # Main configuration file
|
|
│ ├── vocab.yaml # Controlled vocabulary (future)
|
|
│ └── prompts/ # AI prompt templates (future)
|
|
│ └── summarize_paper.md
|
|
├── papers/ # Paper storage (source of truth)
|
|
│ ├── arxiv/ # arXiv papers organized by year
|
|
│ │ └── 2026/
|
|
│ │ └── arxiv-2212_06340/
|
|
│ │ ├── meta.json # Paper metadata
|
|
│ │ ├── source.pdf # Original PDF
|
|
│ │ ├── paper.md # Converted markdown
|
|
│ │ ├── summary.json # AI-generated summary
|
|
│ │ ├── summary.md # Rendered summary
|
|
│ │ ├── ref.bib # Bibliography (future)
|
|
│ │ ├── assets/ # Images, figures
|
|
│ │ └── logs/ # Processing logs
|
|
│ │ └── mineru.log
|
|
│ └── local/ # Local PDF imports by hash
|
|
│ └── a1b2c3d4e5f6/
|
|
│ └── ... (same structure)
|
|
├── inbox/ # Temporary import staging (future)
|
|
├── db/ # Search index (rebuildable)
|
|
│ └── paperlib.sqlite3
|
|
└── cache/ # Processing cache (safe to delete)
|
|
```
|
|
|
|
## Paper Directory Organization
|
|
|
|
### arXiv Papers
|
|
|
|
arXiv papers are organized by year and paper ID:
|
|
|
|
```
|
|
papers/arxiv/YEAR/arxiv-NORMALIZED_ID/
|
|
```
|
|
|
|
Where:
|
|
- `YEAR` is extracted from the arXiv ID (e.g., `2212.06340` → `2022`, `0001.12345` → `2000`)
|
|
- `NORMALIZED_ID` replaces dots and version numbers with underscores
|
|
- `2212.06340` → `arxiv-2212_06340`
|
|
- `2212.06340v2` → `arxiv-2212_06340v2`
|
|
|
|
The year extraction follows arXiv's YYMM.NNNNN format:
|
|
- Years 00-89 map to 2000-2089
|
|
- Years 90-99 map to 1990-1999
|
|
|
|
**Examples:**
|
|
```
|
|
papers/arxiv/2022/arxiv-2212_06340/ # 2212.06340 -> year 2022
|
|
papers/arxiv/2023/arxiv-2301_12345v1/ # 2301.12345v1 -> year 2023
|
|
papers/arxiv/2000/arxiv-0001_98765/ # 0001.98765 -> year 2000
|
|
papers/arxiv/1999/arxiv-9912_12345/ # 9912.12345 -> year 1999
|
|
```
|
|
|
|
### Local Papers
|
|
|
|
Local papers are organized by content hash:
|
|
|
|
```
|
|
papers/local/HASH_PREFIX/
|
|
```
|
|
|
|
Where `HASH_PREFIX` is the first 16 characters of the SHA256 hash of the PDF file.
|
|
|
|
**Examples:**
|
|
```
|
|
papers/local/a1b2c3d4e5f67890/
|
|
papers/local/fedcba9876543210/
|
|
```
|
|
|
|
## File Types
|
|
|
|
### Required Files
|
|
|
|
Every paper directory contains:
|
|
|
|
#### `meta.json`
|
|
The canonical metadata file (JSON format):
|
|
```json
|
|
{
|
|
"paper_id": "arxiv-2212_06340",
|
|
"source_type": "arxiv",
|
|
"source_id": "2212.06340",
|
|
"title": "Example Paper Title",
|
|
"authors": ["Alice Smith", "Bob Jones"],
|
|
"published_date": "2022-12-13T02:46:55",
|
|
"categories": ["cs.AI", "stat.ML"],
|
|
"pdf_path": "papers/arxiv/2022/arxiv-2212_06340/source.pdf",
|
|
"paper_md_path": "papers/arxiv/2022/arxiv-2212_06340/paper.md",
|
|
"imported_at": "2024-01-15T10:30:00",
|
|
"conversion_status": "success",
|
|
"summary_status": "not_requested",
|
|
"tags": ["machine-learning"],
|
|
"notes": "Important paper on neural networks"
|
|
}
|
|
```
|
|
|
|
#### `source.pdf`
|
|
The original PDF file, exactly as imported.
|
|
|
|
### Generated Files
|
|
|
|
These files are created by paperlib processing:
|
|
|
|
#### `paper.md`
|
|
Markdown conversion of the PDF, generated by MinerU or other converters.
|
|
|
|
#### `summary.json` (optional)
|
|
AI-generated structured summary:
|
|
```json
|
|
{
|
|
"schema_version": "1.0",
|
|
"one_sentence_summary": "This paper introduces...",
|
|
"problem_statement": "Current methods have limitations...",
|
|
"method_overview": "We propose a novel approach...",
|
|
"main_results": "Experiments show 95% accuracy...",
|
|
"claimed_contributions": ["Novel architecture", "Improved performance"],
|
|
"problem_tags": ["classification", "optimization"],
|
|
"technique_tags": ["neural-networks", "transformers"],
|
|
"entities": ["BERT", "ImageNet", "ResNet"],
|
|
"relevance_to_user": 0.85
|
|
}
|
|
```
|
|
|
|
#### `summary.md` (optional)
|
|
Human-readable summary rendered from `summary.json`.
|
|
|
|
### Supporting Directories
|
|
|
|
#### `assets/`
|
|
Contains extracted images, figures, and other media from the PDF conversion process.
|
|
|
|
#### `logs/`
|
|
Processing logs for debugging and audit trails:
|
|
- `mineru.log` - PDF conversion logs
|
|
- `summary.log` - AI summarization logs (future)
|
|
|
|
## Index Database
|
|
|
|
The SQLite database at `db/paperlib.sqlite3` contains:
|
|
|
|
### Tables
|
|
|
|
#### `papers`
|
|
Main paper index with searchable fields:
|
|
- Metadata from all `meta.json` files
|
|
- Computed search fields (full-text, author lists, etc.)
|
|
- Processing status tracking
|
|
|
|
#### `papers_fts`
|
|
Full-text search virtual table (SQLite FTS5) for content search.
|
|
|
|
### Rebuilding
|
|
|
|
The database is **always rebuildable** from the source files:
|
|
```bash
|
|
paperlib reindex
|
|
```
|
|
|
|
This design ensures the JSON files remain the authoritative source of truth.
|
|
|
|
## Path Conventions
|
|
|
|
### Relative Paths
|
|
All paths in `meta.json` are relative to the library root:
|
|
```json
|
|
{
|
|
"pdf_path": "papers/local/a1b2c3d4e5f6/source.pdf",
|
|
"paper_md_path": "papers/local/a1b2c3d4e5f6/paper.md"
|
|
}
|
|
```
|
|
|
|
### Cross-Platform Compatibility
|
|
All paths use forward slashes (`/`) regardless of operating system.
|
|
|
|
## Backup and Portability
|
|
|
|
### What to Backup
|
|
For complete library backup, include:
|
|
- `config/` directory (configuration)
|
|
- `papers/` directory (source of truth)
|
|
|
|
### What NOT to Backup
|
|
These can be regenerated:
|
|
- `db/` directory (rebuildable index)
|
|
- `cache/` directory (temporary files)
|
|
|
|
### Moving Libraries
|
|
To move a library:
|
|
1. Copy the entire directory structure
|
|
2. Run `paperlib reindex` to rebuild the database
|
|
3. Update any absolute paths in configuration
|
|
|
|
## Storage Efficiency
|
|
|
|
### Deduplication
|
|
Papers are naturally deduplicated:
|
|
- arXiv papers by normalized arXiv ID
|
|
- Local papers by SHA256 content hash
|
|
|
|
### Large Files
|
|
For papers with large asset directories:
|
|
- Assets are stored alongside papers for locality
|
|
- Consider using file system compression or deduplication if needed
|
|
|
|
## File System Requirements
|
|
|
|
### Permissions
|
|
paperlib requires:
|
|
- Read/write access to library directory
|
|
- Ability to create subdirectories
|
|
- Atomic file operations for metadata updates
|
|
|
|
### File System Features
|
|
Recommended:
|
|
- Case-sensitive file system (avoids conflicts)
|
|
- Support for Unicode filenames
|
|
- Journaling (protects against corruption)
|
|
|
|
### Disk Space
|
|
Typical storage requirements:
|
|
- PDF files: 1-10 MB each
|
|
- Markdown conversions: 10-100 KB each
|
|
- Metadata: ~1-5 KB per paper
|
|
- Database index: ~1-10 KB per paper
|
|
- Assets: Varies (0-50 MB for image-heavy papers)
|
|
|
|
## Migration and Versioning
|
|
|
|
### Schema Evolution
|
|
When paperlib updates its storage format:
|
|
- Metadata schema versions are tracked in each file
|
|
- Migration tools handle format upgrades
|
|
- Backward compatibility is maintained when possible
|
|
|
|
### Validation
|
|
paperlib provides tools to validate library integrity:
|
|
```bash
|
|
paperlib doctor # (future command)
|
|
```
|
|
|
|
This will check:
|
|
- All referenced files exist
|
|
- Metadata format is valid
|
|
- Database consistency with files
|
|
- No orphaned or corrupted data |