docs: add docs

This commit is contained in:
2026-04-17 16:54:30 -04:00
parent 74d140e5f8
commit 432010f431
10 changed files with 1682 additions and 19 deletions
+259
View File
@@ -0,0 +1,259 @@
# Storage Layout
This document describes the on-disk structure and organization of a paperlib library.
## Overview
A paperlib library is a directory containing all papers, metadata, configuration, and index data. The layout is designed to be:
- **Human-readable**: Directory structure is intuitive and browsable
- **Stable**: File locations don't change unexpectedly
- **Rebuildable**: Index can be reconstructed from source files
- **Portable**: Entire library can be moved or backed up as a unit
## Directory Structure
```
library_root/
├── config/ # Library configuration
│ ├── config.toml # Main configuration file
│ ├── vocab.yaml # Controlled vocabulary (future)
│ └── prompts/ # AI prompt templates (future)
│ └── summarize_paper.md
├── papers/ # Paper storage (source of truth)
│ ├── arxiv/ # arXiv papers organized by year
│ │ └── 2026/
│ │ └── arxiv-2212_06340/
│ │ ├── meta.json # Paper metadata
│ │ ├── source.pdf # Original PDF
│ │ ├── paper.md # Converted markdown
│ │ ├── summary.json # AI-generated summary
│ │ ├── summary.md # Rendered summary
│ │ ├── ref.bib # Bibliography (future)
│ │ ├── assets/ # Images, figures
│ │ └── logs/ # Processing logs
│ │ └── mineru.log
│ └── local/ # Local PDF imports by hash
│ └── a1b2c3d4e5f6/
│ └── ... (same structure)
├── inbox/ # Temporary import staging (future)
├── db/ # Search index (rebuildable)
│ └── paperlib.sqlite3
└── cache/ # Processing cache (safe to delete)
```
## Paper Directory Organization
### arXiv Papers
arXiv papers are organized by year and paper ID:
```
papers/arxiv/YEAR/arxiv-NORMALIZED_ID/
```
Where:
- `YEAR` is extracted from the arXiv ID (e.g., `2212.06340``2022`)
- `NORMALIZED_ID` replaces dots and version numbers with underscores
- `2212.06340``arxiv-2212_06340`
- `2212.06340v2``arxiv-2212_06340v2`
**Examples:**
```
papers/arxiv/2022/arxiv-2212_06340/
papers/arxiv/2023/arxiv-2301_12345v1/
papers/arxiv/2024/arxiv-2405_98765/
```
### Local Papers
Local papers are organized by content hash:
```
papers/local/HASH_PREFIX/
```
Where `HASH_PREFIX` is the first 16 characters of the SHA256 hash of the PDF file.
**Examples:**
```
papers/local/a1b2c3d4e5f67890/
papers/local/fedcba9876543210/
```
## File Types
### Required Files
Every paper directory contains:
#### `meta.json`
The canonical metadata file (JSON format):
```json
{
"paper_id": "arxiv-2212_06340",
"source_type": "arxiv",
"source_id": "2212.06340",
"title": "Example Paper Title",
"authors": ["Alice Smith", "Bob Jones"],
"published_date": "2022-12-13T02:46:55",
"categories": ["cs.AI", "stat.ML"],
"pdf_path": "papers/arxiv/2022/arxiv-2212_06340/source.pdf",
"paper_md_path": "papers/arxiv/2022/arxiv-2212_06340/paper.md",
"imported_at": "2024-01-15T10:30:00",
"conversion_status": "success",
"summary_status": "not_requested",
"tags": ["machine-learning"],
"notes": "Important paper on neural networks"
}
```
#### `source.pdf`
The original PDF file, exactly as imported.
### Generated Files
These files are created by paperlib processing:
#### `paper.md`
Markdown conversion of the PDF, generated by MinerU or other converters.
#### `summary.json` (optional)
AI-generated structured summary:
```json
{
"schema_version": "1.0",
"one_sentence_summary": "This paper introduces...",
"problem_statement": "Current methods have limitations...",
"method_overview": "We propose a novel approach...",
"main_results": "Experiments show 95% accuracy...",
"claimed_contributions": ["Novel architecture", "Improved performance"],
"problem_tags": ["classification", "optimization"],
"technique_tags": ["neural-networks", "transformers"],
"entities": ["BERT", "ImageNet", "ResNet"],
"relevance_to_user": 0.85
}
```
#### `summary.md` (optional)
Human-readable summary rendered from `summary.json`.
### Supporting Directories
#### `assets/`
Contains extracted images, figures, and other media from the PDF conversion process.
#### `logs/`
Processing logs for debugging and audit trails:
- `mineru.log` - PDF conversion logs
- `summary.log` - AI summarization logs (future)
## Index Database
The SQLite database at `db/paperlib.sqlite3` contains:
### Tables
#### `papers`
Main paper index with searchable fields:
- Metadata from all `meta.json` files
- Computed search fields (full-text, author lists, etc.)
- Processing status tracking
#### `papers_fts`
Full-text search virtual table (SQLite FTS5) for content search.
### Rebuilding
The database is **always rebuildable** from the source files:
```bash
paperlib reindex
```
This design ensures the JSON files remain the authoritative source of truth.
## Path Conventions
### Relative Paths
All paths in `meta.json` are relative to the library root:
```json
{
"pdf_path": "papers/local/a1b2c3d4e5f6/source.pdf",
"paper_md_path": "papers/local/a1b2c3d4e5f6/paper.md"
}
```
### Cross-Platform Compatibility
All paths use forward slashes (`/`) regardless of operating system.
## Backup and Portability
### What to Backup
For complete library backup, include:
- `config/` directory (configuration)
- `papers/` directory (source of truth)
### What NOT to Backup
These can be regenerated:
- `db/` directory (rebuildable index)
- `cache/` directory (temporary files)
### Moving Libraries
To move a library:
1. Copy the entire directory structure
2. Run `paperlib reindex` to rebuild the database
3. Update any absolute paths in configuration
## Storage Efficiency
### Deduplication
Papers are naturally deduplicated:
- arXiv papers by normalized arXiv ID
- Local papers by SHA256 content hash
### Large Files
For papers with large asset directories:
- Assets are stored alongside papers for locality
- Consider using file system compression or deduplication if needed
## File System Requirements
### Permissions
paperlib requires:
- Read/write access to library directory
- Ability to create subdirectories
- Atomic file operations for metadata updates
### File System Features
Recommended:
- Case-sensitive file system (avoids conflicts)
- Support for Unicode filenames
- Journaling (protects against corruption)
### Disk Space
Typical storage requirements:
- PDF files: 1-10 MB each
- Markdown conversions: 10-100 KB each
- Metadata: ~1-5 KB per paper
- Database index: ~1-10 KB per paper
- Assets: Varies (0-50 MB for image-heavy papers)
## Migration and Versioning
### Schema Evolution
When paperlib updates its storage format:
- Metadata schema versions are tracked in each file
- Migration tools handle format upgrades
- Backward compatibility is maintained when possible
### Validation
paperlib provides tools to validate library integrity:
```bash
paperlib doctor # (future command)
```
This will check:
- All referenced files exist
- Metadata format is valid
- Database consistency with files
- No orphaned or corrupted data