92 lines
2.1 KiB
Markdown
92 lines
2.1 KiB
Markdown
# Data Model
|
|
|
|
## Library data layout
|
|
|
|
The paper library on disk should be human-browsable.
|
|
|
|
A typical layout looks like:
|
|
|
|
```text
|
|
library_root/
|
|
config/
|
|
config.toml
|
|
vocab.yaml
|
|
prompts/
|
|
summarize_paper.md
|
|
|
|
inbox/
|
|
papers/
|
|
arxiv/
|
|
2026/
|
|
2604.12345/
|
|
meta.json
|
|
source.pdf
|
|
paper.md
|
|
summary.json
|
|
summary.md
|
|
ref.bib
|
|
assets/
|
|
logs/
|
|
mineru.log
|
|
local/
|
|
sha256-.../
|
|
meta.json
|
|
source.pdf
|
|
paper.md
|
|
summary.json
|
|
summary.md
|
|
|
|
db/
|
|
paperlib.sqlite3
|
|
|
|
cache/
|
|
```
|
|
|
|
## Data boundaries
|
|
|
|
### `meta.json`
|
|
|
|
`meta.json` should contain deterministic or near-deterministic information, mostly from:
|
|
|
|
- import process
|
|
- file system state
|
|
- external paper metadata sources
|
|
|
|
Typical fields include:
|
|
|
|
- `paper_id`, `source_type`, `source_id`
|
|
- `title`, `authors`, `published_date`, `updated_date`, `categories`
|
|
- `pdf_path`, `paper_md_path`, `summary_json_path`, `summary_md_path`
|
|
- `imported_at`, `conversion_status`, `summary_status`
|
|
|
|
Avoid putting speculative AI content into `meta.json`.
|
|
|
|
### `summary.json`
|
|
|
|
`summary.json` is optional enrichment and may be regenerated.
|
|
|
|
It should contain structured fields such as:
|
|
|
|
- one-sentence summary, problem statement, method overview
|
|
- main results, claimed contributions, assumptions, limitations
|
|
- problem tags, technique tags, entities
|
|
- relevance-to-user fields, recommended sections
|
|
|
|
`summary.json` must include a schema version.
|
|
|
|
### SQLite
|
|
|
|
SQLite stores searchable/indexed state and job-independent status.
|
|
|
|
It should help with:
|
|
- listing papers, filtering and search, path lookup, tag lookup, status overview
|
|
|
|
But it should never be treated as the only durable source of paper metadata.
|
|
|
|
## Key conventions
|
|
|
|
- `meta.json` contains stable metadata and processing status
|
|
- `summary.json` contains structured AI-generated enrichment
|
|
- `summary.md` is rendered from `summary.json`
|
|
- `paper.md` is generated from the PDF by an external converter such as MinerU
|
|
- the database is rebuildable from the files above |