docs: add docs
This commit is contained in:
@@ -0,0 +1,259 @@
|
||||
# Storage Layout
|
||||
|
||||
This document describes the on-disk structure and organization of a paperlib library.
|
||||
|
||||
## Overview
|
||||
|
||||
A paperlib library is a directory containing all papers, metadata, configuration, and index data. The layout is designed to be:
|
||||
|
||||
- **Human-readable**: Directory structure is intuitive and browsable
|
||||
- **Stable**: File locations don't change unexpectedly
|
||||
- **Rebuildable**: Index can be reconstructed from source files
|
||||
- **Portable**: Entire library can be moved or backed up as a unit
|
||||
|
||||
## Directory Structure
|
||||
|
||||
```
|
||||
library_root/
|
||||
├── config/ # Library configuration
|
||||
│ ├── config.toml # Main configuration file
|
||||
│ ├── vocab.yaml # Controlled vocabulary (future)
|
||||
│ └── prompts/ # AI prompt templates (future)
|
||||
│ └── summarize_paper.md
|
||||
├── papers/ # Paper storage (source of truth)
|
||||
│ ├── arxiv/ # arXiv papers organized by year
|
||||
│ │ └── 2026/
|
||||
│ │ └── arxiv-2212_06340/
|
||||
│ │ ├── meta.json # Paper metadata
|
||||
│ │ ├── source.pdf # Original PDF
|
||||
│ │ ├── paper.md # Converted markdown
|
||||
│ │ ├── summary.json # AI-generated summary
|
||||
│ │ ├── summary.md # Rendered summary
|
||||
│ │ ├── ref.bib # Bibliography (future)
|
||||
│ │ ├── assets/ # Images, figures
|
||||
│ │ └── logs/ # Processing logs
|
||||
│ │ └── mineru.log
|
||||
│ └── local/ # Local PDF imports by hash
|
||||
│ └── a1b2c3d4e5f6/
|
||||
│ └── ... (same structure)
|
||||
├── inbox/ # Temporary import staging (future)
|
||||
├── db/ # Search index (rebuildable)
|
||||
│ └── paperlib.sqlite3
|
||||
└── cache/ # Processing cache (safe to delete)
|
||||
```
|
||||
|
||||
## Paper Directory Organization
|
||||
|
||||
### arXiv Papers
|
||||
|
||||
arXiv papers are organized by year and paper ID:
|
||||
|
||||
```
|
||||
papers/arxiv/YEAR/arxiv-NORMALIZED_ID/
|
||||
```
|
||||
|
||||
Where:
|
||||
- `YEAR` is extracted from the arXiv ID (e.g., `2212.06340` → `2022`)
|
||||
- `NORMALIZED_ID` replaces dots and version numbers with underscores
|
||||
- `2212.06340` → `arxiv-2212_06340`
|
||||
- `2212.06340v2` → `arxiv-2212_06340v2`
|
||||
|
||||
**Examples:**
|
||||
```
|
||||
papers/arxiv/2022/arxiv-2212_06340/
|
||||
papers/arxiv/2023/arxiv-2301_12345v1/
|
||||
papers/arxiv/2024/arxiv-2405_98765/
|
||||
```
|
||||
|
||||
### Local Papers
|
||||
|
||||
Local papers are organized by content hash:
|
||||
|
||||
```
|
||||
papers/local/HASH_PREFIX/
|
||||
```
|
||||
|
||||
Where `HASH_PREFIX` is the first 16 characters of the SHA256 hash of the PDF file.
|
||||
|
||||
**Examples:**
|
||||
```
|
||||
papers/local/a1b2c3d4e5f67890/
|
||||
papers/local/fedcba9876543210/
|
||||
```
|
||||
|
||||
## File Types
|
||||
|
||||
### Required Files
|
||||
|
||||
Every paper directory contains:
|
||||
|
||||
#### `meta.json`
|
||||
The canonical metadata file (JSON format):
|
||||
```json
|
||||
{
|
||||
"paper_id": "arxiv-2212_06340",
|
||||
"source_type": "arxiv",
|
||||
"source_id": "2212.06340",
|
||||
"title": "Example Paper Title",
|
||||
"authors": ["Alice Smith", "Bob Jones"],
|
||||
"published_date": "2022-12-13T02:46:55",
|
||||
"categories": ["cs.AI", "stat.ML"],
|
||||
"pdf_path": "papers/arxiv/2022/arxiv-2212_06340/source.pdf",
|
||||
"paper_md_path": "papers/arxiv/2022/arxiv-2212_06340/paper.md",
|
||||
"imported_at": "2024-01-15T10:30:00",
|
||||
"conversion_status": "success",
|
||||
"summary_status": "not_requested",
|
||||
"tags": ["machine-learning"],
|
||||
"notes": "Important paper on neural networks"
|
||||
}
|
||||
```
|
||||
|
||||
#### `source.pdf`
|
||||
The original PDF file, exactly as imported.
|
||||
|
||||
### Generated Files
|
||||
|
||||
These files are created by paperlib processing:
|
||||
|
||||
#### `paper.md`
|
||||
Markdown conversion of the PDF, generated by MinerU or other converters.
|
||||
|
||||
#### `summary.json` (optional)
|
||||
AI-generated structured summary:
|
||||
```json
|
||||
{
|
||||
"schema_version": "1.0",
|
||||
"one_sentence_summary": "This paper introduces...",
|
||||
"problem_statement": "Current methods have limitations...",
|
||||
"method_overview": "We propose a novel approach...",
|
||||
"main_results": "Experiments show 95% accuracy...",
|
||||
"claimed_contributions": ["Novel architecture", "Improved performance"],
|
||||
"problem_tags": ["classification", "optimization"],
|
||||
"technique_tags": ["neural-networks", "transformers"],
|
||||
"entities": ["BERT", "ImageNet", "ResNet"],
|
||||
"relevance_to_user": 0.85
|
||||
}
|
||||
```
|
||||
|
||||
#### `summary.md` (optional)
|
||||
Human-readable summary rendered from `summary.json`.
|
||||
|
||||
### Supporting Directories
|
||||
|
||||
#### `assets/`
|
||||
Contains extracted images, figures, and other media from the PDF conversion process.
|
||||
|
||||
#### `logs/`
|
||||
Processing logs for debugging and audit trails:
|
||||
- `mineru.log` - PDF conversion logs
|
||||
- `summary.log` - AI summarization logs (future)
|
||||
|
||||
## Index Database
|
||||
|
||||
The SQLite database at `db/paperlib.sqlite3` contains:
|
||||
|
||||
### Tables
|
||||
|
||||
#### `papers`
|
||||
Main paper index with searchable fields:
|
||||
- Metadata from all `meta.json` files
|
||||
- Computed search fields (full-text, author lists, etc.)
|
||||
- Processing status tracking
|
||||
|
||||
#### `papers_fts`
|
||||
Full-text search virtual table (SQLite FTS5) for content search.
|
||||
|
||||
### Rebuilding
|
||||
|
||||
The database is **always rebuildable** from the source files:
|
||||
```bash
|
||||
paperlib reindex
|
||||
```
|
||||
|
||||
This design ensures the JSON files remain the authoritative source of truth.
|
||||
|
||||
## Path Conventions
|
||||
|
||||
### Relative Paths
|
||||
All paths in `meta.json` are relative to the library root:
|
||||
```json
|
||||
{
|
||||
"pdf_path": "papers/local/a1b2c3d4e5f6/source.pdf",
|
||||
"paper_md_path": "papers/local/a1b2c3d4e5f6/paper.md"
|
||||
}
|
||||
```
|
||||
|
||||
### Cross-Platform Compatibility
|
||||
All paths use forward slashes (`/`) regardless of operating system.
|
||||
|
||||
## Backup and Portability
|
||||
|
||||
### What to Backup
|
||||
For complete library backup, include:
|
||||
- `config/` directory (configuration)
|
||||
- `papers/` directory (source of truth)
|
||||
|
||||
### What NOT to Backup
|
||||
These can be regenerated:
|
||||
- `db/` directory (rebuildable index)
|
||||
- `cache/` directory (temporary files)
|
||||
|
||||
### Moving Libraries
|
||||
To move a library:
|
||||
1. Copy the entire directory structure
|
||||
2. Run `paperlib reindex` to rebuild the database
|
||||
3. Update any absolute paths in configuration
|
||||
|
||||
## Storage Efficiency
|
||||
|
||||
### Deduplication
|
||||
Papers are naturally deduplicated:
|
||||
- arXiv papers by normalized arXiv ID
|
||||
- Local papers by SHA256 content hash
|
||||
|
||||
### Large Files
|
||||
For papers with large asset directories:
|
||||
- Assets are stored alongside papers for locality
|
||||
- Consider using file system compression or deduplication if needed
|
||||
|
||||
## File System Requirements
|
||||
|
||||
### Permissions
|
||||
paperlib requires:
|
||||
- Read/write access to library directory
|
||||
- Ability to create subdirectories
|
||||
- Atomic file operations for metadata updates
|
||||
|
||||
### File System Features
|
||||
Recommended:
|
||||
- Case-sensitive file system (avoids conflicts)
|
||||
- Support for Unicode filenames
|
||||
- Journaling (protects against corruption)
|
||||
|
||||
### Disk Space
|
||||
Typical storage requirements:
|
||||
- PDF files: 1-10 MB each
|
||||
- Markdown conversions: 10-100 KB each
|
||||
- Metadata: ~1-5 KB per paper
|
||||
- Database index: ~1-10 KB per paper
|
||||
- Assets: Varies (0-50 MB for image-heavy papers)
|
||||
|
||||
## Migration and Versioning
|
||||
|
||||
### Schema Evolution
|
||||
When paperlib updates its storage format:
|
||||
- Metadata schema versions are tracked in each file
|
||||
- Migration tools handle format upgrades
|
||||
- Backward compatibility is maintained when possible
|
||||
|
||||
### Validation
|
||||
paperlib provides tools to validate library integrity:
|
||||
```bash
|
||||
paperlib doctor # (future command)
|
||||
```
|
||||
|
||||
This will check:
|
||||
- All referenced files exist
|
||||
- Metadata format is valid
|
||||
- Database consistency with files
|
||||
- No orphaned or corrupted data
|
||||
Reference in New Issue
Block a user