Files

2.1 KiB

Data Model

Library data layout

The paper library on disk should be human-browsable.

A typical layout looks like:

library_root/
  config/
    config.toml
    vocab.yaml
    prompts/
      summarize_paper.md

  inbox/
  papers/
    arxiv/
      2026/
        2604.12345/
          meta.json
          source.pdf
          paper.md
          summary.json
          summary.md
          ref.bib
          assets/
          logs/
            mineru.log
    local/
      sha256-.../
        meta.json
        source.pdf
        paper.md
        summary.json
        summary.md

  db/
    paperlib.sqlite3

  cache/

Data boundaries

meta.json

meta.json should contain deterministic or near-deterministic information, mostly from:

  • import process
  • file system state
  • external paper metadata sources

Typical fields include:

  • paper_id, source_type, source_id
  • title, authors, published_date, updated_date, categories
  • pdf_path, paper_md_path, summary_json_path, summary_md_path
  • imported_at, conversion_status, summary_status

Avoid putting speculative AI content into meta.json.

summary.json

summary.json is optional enrichment and may be regenerated.

It should contain structured fields such as:

  • one-sentence summary, problem statement, method overview
  • main results, claimed contributions, assumptions, limitations
  • problem tags, technique tags, entities
  • relevance-to-user fields, recommended sections

summary.json must include a schema version.

SQLite

SQLite stores searchable/indexed state and job-independent status.

It should help with:

  • listing papers, filtering and search, path lookup, tag lookup, status overview

But it should never be treated as the only durable source of paper metadata.

Key conventions

  • meta.json contains stable metadata and processing status
  • summary.json contains structured AI-generated enrichment
  • summary.md is rendered from summary.json
  • paper.md is generated from the PDF by an external converter such as MinerU
  • the database is rebuildable from the files above