update: update the dev-docs for AI agent

2026-04-17 19:54:24 -04:00
parent 227484e975
commit e870fe280a
5 changed files with 325 additions and 624 deletions
@@ -0,0 +1,55 @@
+# AI Integration Guidelines
+
+## Search design
+
+Search should support at least two useful modes:
+
+### 1. Field-aware structured search
+Examples: tags, authors, categories, titles, summary fields
+
+### 2. Full-text-friendly search
+Support grep-like workflows and integration with tools such as `ripgrep`.
+
+Do not require semantic/vector search as a baseline feature.
+
+If semantic search is ever added later, it should be optional and must not displace simple grep/database search.
+
+## Summarization design
+
+Summarization should produce reusable structured outputs.
+
+### Summarization goals
+
+A summary should be useful for:
+- later human review
+- grep-style reverse lookup
+- building daily/weekly reports
+- indexing by problem/method/result
+- personal research triage
+
+### Summarization output
+
+Prefer generating:
+- `summary.json` as the canonical structured output
+- `summary.md` rendered from JSON
+
+Do not make free-form Markdown the only output.
+
+### Prompting guidelines
+
+Prompts should instruct the model to:
+- extract factual information
+- avoid unsupported claims
+- use concise and stable language
+- prefer controlled vocabulary when available
+- return structured JSON only
+- use `null` or empty lists for unclear fields rather than hallucinating
+
+### Provider abstraction
+
+The summarizer should not be tightly coupled to a single LLM provider.
+
+Use a provider abstraction so the project can support:
+- OpenAI-compatible APIs
+- local models later if desired
+- different prompt templates and vocabularies
@@ -0,0 +1,74 @@
+# Architecture Guidelines
+
+The codebase should be organized around a few clear layers.
+
+## 1. Core domain logic
+
+Pure Python logic for:
+
+- identifying papers
+- computing paths
+- importing PDFs
+- updating metadata
+- converting PDFs to Markdown
+- rendering summaries
+- rebuilding the index
+
+This layer should be testable without the CLI.
+
+## 2. CLI layer
+
+Thin wrappers around the core domain logic.
+
+The CLI should:
+
+- parse arguments
+- call core functions
+- format output
+- handle exit codes
+
+The CLI should not contain deep business logic.
+
+## 3. Optional integrations
+
+External systems should live in integration modules, for example:
+
+- MinerU wrapper
+- filesystem watch integration
+- ripgrep integration
+- LLM provider integration
+
+Keep these adapters isolated.
+
+## 4. Optional AI layer
+
+The AI summarization layer should be behind a stable abstraction.
+
+For example:
+
+- load prompt template
+- load paper markdown
+- load optional profile / vocabulary
+- call provider
+- validate structured output
+- write `summary.json`
+- render `summary.md`
+
+Avoid leaking provider-specific behavior into unrelated modules.
+
+## Component boundaries
+
+Avoid hidden coupling:
+
+- `search` should not depend on LLM code
+- `import` should not require summarization
+- `reindex` should not assume a specific converter
+- `render-summary` should not require calling AI again
+
+Prefer explicit data flow:
+
+- `import` creates or updates metadata
+- `convert` creates `paper.md`
+- `summarize` creates `summary.json`
+- `render-summary` creates `summary.md`
+- `reindex` rebuilds SQLite from files
@@ -0,0 +1,51 @@
+# Coding Guidelines
+
+## General style
+
+- Prefer straightforward Python
+- Use type hints
+- Keep functions small and focused
+- Add docstrings to public functions and classes
+- Avoid overengineering
+- Prefer composition over deep inheritance
+
+## Error handling
+
+- Fail clearly
+- Provide helpful error messages
+- Distinguish user-facing CLI errors from internal exceptions
+- Avoid silently swallowing errors
+
+## Logging
+
+- Use structured and informative logging where useful
+- Avoid noisy logs in normal CLI output
+- Keep machine-readable command output clean when `--json` is used
+
+## File operations
+
+- Be careful with moves, copies, and overwrites
+- Prefer atomic writes for JSON files when possible
+- Never corrupt existing metadata due to partial writes
+
+## Idempotence
+
+Where possible, commands should behave safely when run multiple times.
+
+Examples:
+- re-importing the same file should detect duplicates
+- `render-summary` should be repeatable
+- `reindex` should be safe to rerun
+
+## Testing
+
+Add tests for:
+- path layout logic
+- metadata read/write behavior
+- duplicate detection
+- reindex behavior
+- summary rendering
+- search behavior
+- CLI output contracts for core commands
+
+Prefer unit tests for core logic and targeted integration tests for CLI behavior.
@@ -0,0 +1,92 @@
+# Data Model
+
+## Library data layout
+
+The paper library on disk should be human-browsable.
+
+A typical layout looks like:
+
+```text
+library_root/
+  config/
+    config.toml
+    vocab.yaml
+    prompts/
+      summarize_paper.md
+
+  inbox/
+  papers/
+    arxiv/
+      2026/
+        2604.12345/
+          meta.json
+          source.pdf
+          paper.md
+          summary.json
+          summary.md
+          ref.bib
+          assets/
+          logs/
+            mineru.log
+    local/
+      sha256-.../
+        meta.json
+        source.pdf
+        paper.md
+        summary.json
+        summary.md
+
+  db/
+    paperlib.sqlite3
+
+  cache/
+```
+
+## Data boundaries
+
+### `meta.json`
+
+`meta.json` should contain deterministic or near-deterministic information, mostly from:
+
+- import process
+- file system state
+- external paper metadata sources
+
+Typical fields include:
+
+- `paper_id`, `source_type`, `source_id`
+- `title`, `authors`, `published_date`, `updated_date`, `categories`
+- `pdf_path`, `paper_md_path`, `summary_json_path`, `summary_md_path`
+- `imported_at`, `conversion_status`, `summary_status`
+
+Avoid putting speculative AI content into `meta.json`.
+
+### `summary.json`
+
+`summary.json` is optional enrichment and may be regenerated.
+
+It should contain structured fields such as:
+
+- one-sentence summary, problem statement, method overview
+- main results, claimed contributions, assumptions, limitations
+- problem tags, technique tags, entities
+- relevance-to-user fields, recommended sections
+
+`summary.json` must include a schema version.
+
+### SQLite
+
+SQLite stores searchable/indexed state and job-independent status.
+
+It should help with:
+- listing papers, filtering and search, path lookup, tag lookup, status overview
+
+But it should never be treated as the only durable source of paper metadata.
+
+## Key conventions
+
+- `meta.json` contains stable metadata and processing status
+- `summary.json` contains structured AI-generated enrichment
+- `summary.md` is rendered from `summary.json`
+- `paper.md` is generated from the PDF by an external converter such as MinerU
+- the database is rebuildable from the files above