init

2026-04-17 02:00:23 -04:00
commit 767326346b
10 changed files with 3467 additions and 0 deletions
@@ -0,0 +1,636 @@
+# AGENTS.md
+
+## Project overview
+
+`paperlib` is a local-first paper library engine with a CLI.
+
+It is designed to:
+
+- import PDF papers into a structured local library
+- convert PDFs into Markdown using external converters such as MinerU
+- maintain stable per-paper metadata files and a searchable index database
+- optionally generate AI-based structured summaries
+- expose a clean CLI that is useful both for humans and for higher-level automation tools such as an arXiv daily digest workflow
+
+`paperlib` is **not*- primarily an AI app. AI summarization is an optional enrichment layer, not the core of the system.
+
+The project should remain useful even when:
+
+- no LLM API key is configured
+- no summarization is enabled
+- only import / convert / index / search features are used
+
+---
+
+## Core design principles
+
+### 1. Local-first
+
+User data lives locally in the paper library directory.
+
+The library must remain usable without a server, web app, or remote database.
+
+Prefer plain files plus SQLite over opaque internal state.
+
+### 2. CLI-first
+
+The CLI is the primary interface.
+
+All important workflows should be accessible from the CLI.
+
+The Python API is useful, but secondary.
+
+### 3. JSON files are the source of truth
+
+Per-paper JSON files in the library are the durable source of truth.
+
+Examples:
+
+- `meta.json`
+- `summary.json`
+
+SQLite is an index/cache layer, not the canonical data store.
+
+This means:
+
+- the index must be rebuildable from files
+- `reindex` should be able to repair the database from on-disk records
+- code must not assume the database alone is authoritative
+
+### 4. AI is optional enrichment
+
+Importing, converting, indexing, listing, showing, and searching papers must work without AI.
+
+AI summarization should be isolated behind a clean interface.
+
+Do not make core workflows depend on an LLM provider.
+
+### 5. Stable machine-readable interfaces
+
+Important commands should support `--json` output so that other tools can consume them.
+
+Examples:
+
+- `paperlib import ... --json`
+- `paperlib summarize ... --json`
+- `paperlib show ... --json`
+- `paperlib export ... --format json`
+
+### 6. Small, explicit, inspectable components
+
+Prefer simple and explicit logic over large hidden frameworks.
+
+Keep components understandable:
+
+- importer
+- converter
+- renderer
+- summarizer
+- search
+- reindex
+- doctor
+
+Avoid unnecessary abstraction until there is a real need.
+
+---
+
+## Non-goals
+
+The following are currently out of scope unless explicitly planned later:
+
+- mandatory daemon architecture
+- web UI
+- multi-user remote service
+- cloud-first design
+- vector database as a required dependency
+- opaque agent framework controlling the core library
+- “fully autonomous research assistant” behavior
+
+---
+
+## Library data layout
+
+The paper library on disk should be human-browsable.
+
+A typical layout looks like:
+
+```text
+library_root/
+  config/
+    config.toml
+    vocab.yaml
+    prompts/
+      summarize_paper.md
+
+  inbox/
+  papers/
+    arxiv/
+      2026/
+        2604.12345/
+          meta.json
+          source.pdf
+          paper.md
+          summary.json
+          summary.md
+          ref.bib
+          assets/
+          logs/
+            mineru.log
+    local/
+      sha256-.../
+        meta.json
+        source.pdf
+        paper.md
+        summary.json
+        summary.md
+
+  db/
+    paperlib.sqlite3
+
+  cache/
+```
+
+Conventions:
+
+- `meta.json` contains stable metadata and processing status
+- `summary.json` contains structured AI-generated enrichment
+- `summary.md` is rendered from `summary.json`
+- `paper.md` is generated from the PDF by an external converter such as MinerU
+- the database is rebuildable from the files above
+
+---
+
+## Data model boundaries
+
+### `meta.json`
+
+`meta.json` should contain deterministic or near-deterministic information, mostly from:
+
+- import process
+- file system state
+- external paper metadata sources
+
+Typical fields include:
+
+- `paper_id`
+- `source_type`
+- `source_id`
+- `title`
+- `authors`
+- `published_date`
+- `updated_date`
+- `categories`
+- `pdf_path`
+- `paper_md_path`
+- `summary_json_path`
+- `summary_md_path`
+- `imported_at`
+- `conversion_status`
+- `summary_status`
+
+Avoid putting speculative AI content into `meta.json`.
+
+### `summary.json`
+
+`summary.json` is optional enrichment and may be regenerated.
+
+It should contain structured fields such as:
+
+- one-sentence summary
+- problem statement
+- method overview
+- main results
+- claimed contributions
+- assumptions
+- limitations
+- problem tags
+- technique tags
+- entities
+- relevance-to-user fields
+- recommended sections
+
+`summary.json` must include a schema version.
+
+### SQLite
+
+SQLite stores searchable/indexed state and job-independent status.
+
+It should help with:
+
+- listing papers
+- filtering and search
+- path lookup
+- tag lookup
+- status overview
+
+But it should never be treated as the only durable source of paper metadata.
+
+---
+
+## CLI philosophy
+
+The CLI should be easy for humans and predictable for scripts.
+
+### Important CLI expectations
+
+- human-readable by default
+- machine-readable with `--json`
+- clear exit codes
+- no hidden background magic
+- no required daemon
+- stable command names
+- idempotent operations when possible
+
+### Expected command families
+
+Core commands include:
+
+- `init`
+- `import`
+- `import-dir`
+- `watch`
+- `convert`
+- `reindex`
+- `doctor`
+- `status`
+- `list`
+- `show`
+- `search`
+- `open`
+- `print-path`
+- `summarize`
+- `render-summary`
+- `export`
+
+When implementing commands, preserve a clear separation between:
+
+- mutation commands
+- read/query commands
+
+---
+
+## Architecture guidelines
+
+The codebase should be organized around a few clear layers.
+
+### 1. Core domain logic
+
+Pure Python logic for:
+
+- identifying papers
+- computing paths
+- importing PDFs
+- updating metadata
+- converting PDFs to Markdown
+- rendering summaries
+- rebuilding the index
+
+This layer should be testable without the CLI.
+
+### 2. CLI layer
+
+Thin wrappers around the core domain logic.
+
+The CLI should:
+
+- parse arguments
+- call core functions
+- format output
+- handle exit codes
+
+The CLI should not contain deep business logic.
+
+### 3. Optional integrations
+
+External systems should live in integration modules, for example:
+
+- MinerU wrapper
+- filesystem watch integration
+- ripgrep integration
+- LLM provider integration
+
+Keep these adapters isolated.
+
+### 4. Optional AI layer
+
+The AI summarization layer should be behind a stable abstraction.
+
+For example:
+
+- load prompt template
+- load paper markdown
+- load optional profile / vocabulary
+- call provider
+- validate structured output
+- write `summary.json`
+- render `summary.md`
+
+Avoid leaking provider-specific behavior into unrelated modules.
+
+---
+
+## AI collaboration guidelines
+
+When using AI to help develop this project, the AI should follow these rules.
+
+### 1. Respect the project boundaries
+
+Do not redesign `paperlib` into:
+
+- a web app
+- a required daemon
+- a monolithic agent system
+- a chat-first interface
+
+Unless explicitly asked, keep the project aligned with:
+
+- local-first
+- CLI-first
+- JSON/SQLite-based architecture
+- AI-optional enrichment
+
+### 2. Prefer incremental changes
+
+Make small, reviewable changes.
+
+When implementing a feature:
+
+- first clarify which module owns it
+- avoid broad refactors unless necessary
+- preserve existing CLI semantics unless intentionally changing them
+
+### 3. Keep file formats stable
+
+Changes to `meta.json` or `summary.json` are important.
+
+If changing schemas:
+
+- update the schema version
+- update documentation
+- consider migration or backward compatibility
+- do not silently break existing libraries
+
+### 4. Avoid hidden coupling
+
+Do not make unrelated modules depend on each other unnecessarily.
+
+For example:
+
+- `search` should not depend on LLM code
+- `import` should not require summarization
+- `reindex` should not assume a specific converter
+- `render-summary` should not require calling AI again
+
+### 5. Prefer explicit data flow
+
+When adding features, keep data flow obvious.
+
+For example:
+
+- `import` creates or updates metadata
+- `convert` creates `paper.md`
+- `summarize` creates `summary.json`
+- `render-summary` creates `summary.md`
+- `reindex` rebuilds SQLite from files
+
+### 6. Do not invent capabilities
+
+If a feature is not implemented yet, do not pretend it exists.
+
+Examples:
+
+- do not write code that assumes a daemon exists
+- do not assume remote sync exists
+- do not assume vector search exists
+- do not assume arXiv-specific logic belongs in the core library
+
+### 7. Prefer durable outputs over polished prose
+
+When designing AI summarization outputs, favor:
+
+- structured JSON
+- stable field names
+- grep-friendly rendered Markdown
+- concise, reusable information
+
+over:
+
+- highly polished review prose
+- flashy but unstable output formats
+
+---
+
+## Coding guidelines
+
+### General style
+
+- Prefer straightforward Python.
+- Use type hints.
+- Keep functions small and focused.
+- Add docstrings to public functions and classes.
+- Avoid overengineering.
+- Prefer composition over deep inheritance.
+
+### Error handling
+
+- Fail clearly.
+- Provide helpful error messages.
+- Distinguish user-facing CLI errors from internal exceptions.
+- Avoid silently swallowing errors.
+
+### Logging
+
+- Use structured and informative logging where useful.
+- Avoid noisy logs in normal CLI output.
+- Keep machine-readable command output clean when `--json` is used.
+
+### File operations
+
+- Be careful with moves, copies, and overwrites.
+- Prefer atomic writes for JSON files when possible.
+- Never corrupt existing metadata due to partial writes.
+
+### Idempotence
+
+Where possible, commands should behave safely when run multiple times.
+
+Examples:
+
+- re-importing the same file should detect duplicates
+- `render-summary` should be repeatable
+- `reindex` should be safe to rerun
+
+### Testing
+
+Add tests for:
+
+- path layout logic
+- metadata read/write behavior
+- duplicate detection
+- reindex behavior
+- summary rendering
+- search behavior
+- CLI output contracts for core commands
+
+Prefer unit tests for core logic and targeted integration tests for CLI behavior.
+
+---
+
+## Search design guidelines
+
+Search should support at least two useful modes:
+
+### 1. Field-aware structured search
+
+Examples:
+
+- tags
+- authors
+- categories
+- titles
+- summary fields
+
+### 2. Full-text-friendly search
+
+Support grep-like workflows and integration with tools such as `ripgrep`.
+
+Do not require semantic/vector search as a baseline feature.
+
+If semantic search is ever added later, it should be optional and must not displace simple grep/database search.
+
+---
+
+## Summarization design guidelines
+
+Summarization should produce reusable structured outputs.
+
+### Summarization goals
+
+A summary should be useful for:
+
+- later human review
+- grep-style reverse lookup
+- building daily/weekly reports
+- indexing by problem/method/result
+- personal research triage
+
+### Summarization output
+
+Prefer generating:
+
+- `summary.json` as the canonical structured output
+- `summary.md` rendered from JSON
+
+Do not make free-form Markdown the only output.
+
+### Prompting guidelines
+
+Prompts should instruct the model to:
+
+- extract factual information
+- avoid unsupported claims
+- use concise and stable language
+- prefer controlled vocabulary when available
+- return structured JSON only
+- use `null` or empty lists for unclear fields rather than hallucinating
+
+### Provider abstraction
+
+The summarizer should not be tightly coupled to a single LLM provider.
+
+Use a provider abstraction so the project can support:
+
+- OpenAI-compatible APIs
+- local models later if desired
+- different prompt templates and vocabularies
+
+---
+
+## What belongs in `paperlib` vs higher-level tools
+
+`paperlib` is the base library engine.
+
+It should own:
+
+- PDF import
+- local storage layout
+- conversion to Markdown
+- metadata files
+- summary files
+- index maintenance
+- CLI access to those capabilities
+
+It should not own high-level discovery workflows such as:
+
+- arXiv daily fetching
+- personalized new-paper ranking
+- daily digest generation
+- automated paper downloading from external feeds
+
+Those belong in higher-level tools that consume `paperlib`.
+
+---
+
+## Expected development workflow
+
+When implementing a new feature, the preferred order is:
+
+1. identify the owning module
+2. define or update the data contract
+3. implement the core logic
+4. add tests
+5. expose it through the CLI if appropriate
+6. update docs and examples
+
+If a change affects on-disk formats or CLI behavior, document it clearly.
+
+---
+
+## Decision heuristics
+
+When uncertain, prefer the option that is:
+
+- more local-first
+- more inspectable
+- easier to test
+- easier to recover from
+- less coupled to AI
+- more stable for scripts
+- less magical
+
+Examples:
+
+- prefer JSON + Markdown over opaque internal blobs
+- prefer explicit CLI commands over hidden automation
+- prefer rebuildable indexes over fragile single-source databases
+- prefer optional AI enrichment over mandatory AI workflows
+
+---
+
+## Documentation expectations
+
+Important features should be documented in:
+
+- `README.md` for user-facing overview
+- `docs/cli.md` for command behavior
+- `docs/storage-layout.md` for on-disk structure
+- `docs/summary-schema.md` for `summary.json`
+- `docs/integration-guide.md` for higher-level tool integration
+
+Keep docs aligned with actual behavior.
+
+---
+
+## If you are an AI agent contributing code
+
+Before making a change, ask:
+
+1. Does this belong in `paperlib`, or in a higher-level workflow project?
+2. Does this preserve local-first and CLI-first design?
+3. Does this make AI optional, not mandatory?
+4. Does this keep JSON files as the durable source of truth?
+5. Does this keep the system understandable to a developer reading the code later?
+
+If the answer to any of these is no, reconsider the approach.
+
+```