update: update the dev-docs for AI agent

This commit is contained in:
2026-04-17 19:54:24 -04:00
parent 227484e975
commit e870fe280a
5 changed files with 325 additions and 624 deletions
+55
View File
@@ -0,0 +1,55 @@
# AI Integration Guidelines
## Search design
Search should support at least two useful modes:
### 1. Field-aware structured search
Examples: tags, authors, categories, titles, summary fields
### 2. Full-text-friendly search
Support grep-like workflows and integration with tools such as `ripgrep`.
Do not require semantic/vector search as a baseline feature.
If semantic search is ever added later, it should be optional and must not displace simple grep/database search.
## Summarization design
Summarization should produce reusable structured outputs.
### Summarization goals
A summary should be useful for:
- later human review
- grep-style reverse lookup
- building daily/weekly reports
- indexing by problem/method/result
- personal research triage
### Summarization output
Prefer generating:
- `summary.json` as the canonical structured output
- `summary.md` rendered from JSON
Do not make free-form Markdown the only output.
### Prompting guidelines
Prompts should instruct the model to:
- extract factual information
- avoid unsupported claims
- use concise and stable language
- prefer controlled vocabulary when available
- return structured JSON only
- use `null` or empty lists for unclear fields rather than hallucinating
### Provider abstraction
The summarizer should not be tightly coupled to a single LLM provider.
Use a provider abstraction so the project can support:
- OpenAI-compatible APIs
- local models later if desired
- different prompt templates and vocabularies
+74
View File
@@ -0,0 +1,74 @@
# Architecture Guidelines
The codebase should be organized around a few clear layers.
## 1. Core domain logic
Pure Python logic for:
- identifying papers
- computing paths
- importing PDFs
- updating metadata
- converting PDFs to Markdown
- rendering summaries
- rebuilding the index
This layer should be testable without the CLI.
## 2. CLI layer
Thin wrappers around the core domain logic.
The CLI should:
- parse arguments
- call core functions
- format output
- handle exit codes
The CLI should not contain deep business logic.
## 3. Optional integrations
External systems should live in integration modules, for example:
- MinerU wrapper
- filesystem watch integration
- ripgrep integration
- LLM provider integration
Keep these adapters isolated.
## 4. Optional AI layer
The AI summarization layer should be behind a stable abstraction.
For example:
- load prompt template
- load paper markdown
- load optional profile / vocabulary
- call provider
- validate structured output
- write `summary.json`
- render `summary.md`
Avoid leaking provider-specific behavior into unrelated modules.
## Component boundaries
Avoid hidden coupling:
- `search` should not depend on LLM code
- `import` should not require summarization
- `reindex` should not assume a specific converter
- `render-summary` should not require calling AI again
Prefer explicit data flow:
- `import` creates or updates metadata
- `convert` creates `paper.md`
- `summarize` creates `summary.json`
- `render-summary` creates `summary.md`
- `reindex` rebuilds SQLite from files
+51
View File
@@ -0,0 +1,51 @@
# Coding Guidelines
## General style
- Prefer straightforward Python
- Use type hints
- Keep functions small and focused
- Add docstrings to public functions and classes
- Avoid overengineering
- Prefer composition over deep inheritance
## Error handling
- Fail clearly
- Provide helpful error messages
- Distinguish user-facing CLI errors from internal exceptions
- Avoid silently swallowing errors
## Logging
- Use structured and informative logging where useful
- Avoid noisy logs in normal CLI output
- Keep machine-readable command output clean when `--json` is used
## File operations
- Be careful with moves, copies, and overwrites
- Prefer atomic writes for JSON files when possible
- Never corrupt existing metadata due to partial writes
## Idempotence
Where possible, commands should behave safely when run multiple times.
Examples:
- re-importing the same file should detect duplicates
- `render-summary` should be repeatable
- `reindex` should be safe to rerun
## Testing
Add tests for:
- path layout logic
- metadata read/write behavior
- duplicate detection
- reindex behavior
- summary rendering
- search behavior
- CLI output contracts for core commands
Prefer unit tests for core logic and targeted integration tests for CLI behavior.
+92
View File
@@ -0,0 +1,92 @@
# Data Model
## Library data layout
The paper library on disk should be human-browsable.
A typical layout looks like:
```text
library_root/
config/
config.toml
vocab.yaml
prompts/
summarize_paper.md
inbox/
papers/
arxiv/
2026/
2604.12345/
meta.json
source.pdf
paper.md
summary.json
summary.md
ref.bib
assets/
logs/
mineru.log
local/
sha256-.../
meta.json
source.pdf
paper.md
summary.json
summary.md
db/
paperlib.sqlite3
cache/
```
## Data boundaries
### `meta.json`
`meta.json` should contain deterministic or near-deterministic information, mostly from:
- import process
- file system state
- external paper metadata sources
Typical fields include:
- `paper_id`, `source_type`, `source_id`
- `title`, `authors`, `published_date`, `updated_date`, `categories`
- `pdf_path`, `paper_md_path`, `summary_json_path`, `summary_md_path`
- `imported_at`, `conversion_status`, `summary_status`
Avoid putting speculative AI content into `meta.json`.
### `summary.json`
`summary.json` is optional enrichment and may be regenerated.
It should contain structured fields such as:
- one-sentence summary, problem statement, method overview
- main results, claimed contributions, assumptions, limitations
- problem tags, technique tags, entities
- relevance-to-user fields, recommended sections
`summary.json` must include a schema version.
### SQLite
SQLite stores searchable/indexed state and job-independent status.
It should help with:
- listing papers, filtering and search, path lookup, tag lookup, status overview
But it should never be treated as the only durable source of paper metadata.
## Key conventions
- `meta.json` contains stable metadata and processing status
- `summary.json` contains structured AI-generated enrichment
- `summary.md` is rendered from `summary.json`
- `paper.md` is generated from the PDF by an external converter such as MinerU
- the database is rebuildable from the files above