Data Model

Library data layout

The paper library on disk should be human-browsable.

A typical layout looks like:

library_root/
  config/
    config.toml
    vocab.yaml
    prompts/
      summarize_paper.md

  inbox/
  papers/
    arxiv/
      2026/
        2604.12345/
          meta.json
          source.pdf
          paper.md
          summary.json
          summary.md
          ref.bib
          assets/
          logs/
            mineru.log
    local/
      sha256-.../
        meta.json
        source.pdf
        paper.md
        summary.json
        summary.md

  db/
    paperlib.sqlite3

  cache/

Data boundaries

`meta.json`

meta.json should contain deterministic or near-deterministic information, mostly from:

import process
file system state
external paper metadata sources

Typical fields include:

paper_id, source_type, source_id
title, authors, published_date, updated_date, categories
pdf_path, paper_md_path, summary_json_path, summary_md_path
imported_at, conversion_status, summary_status

Avoid putting speculative AI content into meta.json.

`summary.json`

summary.json is optional enrichment and may be regenerated.

It should contain structured fields such as:

one-sentence summary, problem statement, method overview
main results, claimed contributions, assumptions, limitations
problem tags, technique tags, entities
relevance-to-user fields, recommended sections

summary.json must include a schema version.

SQLite

SQLite stores searchable/indexed state and job-independent status.

It should help with:

listing papers, filtering and search, path lookup, tag lookup, status overview

But it should never be treated as the only durable source of paper metadata.

Key conventions

meta.json contains stable metadata and processing status
summary.json contains structured AI-generated enrichment
summary.md is rendered from summary.json
paper.md is generated from the PDF by an external converter such as MinerU
the database is rebuildable from the files above

2.1 KiB Raw Blame History