Files

92 lines
2.1 KiB
Markdown

# Data Model
## Library data layout
The paper library on disk should be human-browsable.
A typical layout looks like:
```text
library_root/
config/
config.toml
vocab.yaml
prompts/
summarize_paper.md
inbox/
papers/
arxiv/
2026/
2604.12345/
meta.json
source.pdf
paper.md
summary.json
summary.md
ref.bib
assets/
logs/
mineru.log
local/
sha256-.../
meta.json
source.pdf
paper.md
summary.json
summary.md
db/
paperlib.sqlite3
cache/
```
## Data boundaries
### `meta.json`
`meta.json` should contain deterministic or near-deterministic information, mostly from:
- import process
- file system state
- external paper metadata sources
Typical fields include:
- `paper_id`, `source_type`, `source_id`
- `title`, `authors`, `published_date`, `updated_date`, `categories`
- `pdf_path`, `paper_md_path`, `summary_json_path`, `summary_md_path`
- `imported_at`, `conversion_status`, `summary_status`
Avoid putting speculative AI content into `meta.json`.
### `summary.json`
`summary.json` is optional enrichment and may be regenerated.
It should contain structured fields such as:
- one-sentence summary, problem statement, method overview
- main results, claimed contributions, assumptions, limitations
- problem tags, technique tags, entities
- relevance-to-user fields, recommended sections
`summary.json` must include a schema version.
### SQLite
SQLite stores searchable/indexed state and job-independent status.
It should help with:
- listing papers, filtering and search, path lookup, tag lookup, status overview
But it should never be treated as the only durable source of paper metadata.
## Key conventions
- `meta.json` contains stable metadata and processing status
- `summary.json` contains structured AI-generated enrichment
- `summary.md` is rendered from `summary.json`
- `paper.md` is generated from the PDF by an external converter such as MinerU
- the database is rebuildable from the files above