update: update the dev-docs for AI agent
This commit is contained in:
@@ -0,0 +1,55 @@
|
||||
# AI Integration Guidelines
|
||||
|
||||
## Search design
|
||||
|
||||
Search should support at least two useful modes:
|
||||
|
||||
### 1. Field-aware structured search
|
||||
Examples: tags, authors, categories, titles, summary fields
|
||||
|
||||
### 2. Full-text-friendly search
|
||||
Support grep-like workflows and integration with tools such as `ripgrep`.
|
||||
|
||||
Do not require semantic/vector search as a baseline feature.
|
||||
|
||||
If semantic search is ever added later, it should be optional and must not displace simple grep/database search.
|
||||
|
||||
## Summarization design
|
||||
|
||||
Summarization should produce reusable structured outputs.
|
||||
|
||||
### Summarization goals
|
||||
|
||||
A summary should be useful for:
|
||||
- later human review
|
||||
- grep-style reverse lookup
|
||||
- building daily/weekly reports
|
||||
- indexing by problem/method/result
|
||||
- personal research triage
|
||||
|
||||
### Summarization output
|
||||
|
||||
Prefer generating:
|
||||
- `summary.json` as the canonical structured output
|
||||
- `summary.md` rendered from JSON
|
||||
|
||||
Do not make free-form Markdown the only output.
|
||||
|
||||
### Prompting guidelines
|
||||
|
||||
Prompts should instruct the model to:
|
||||
- extract factual information
|
||||
- avoid unsupported claims
|
||||
- use concise and stable language
|
||||
- prefer controlled vocabulary when available
|
||||
- return structured JSON only
|
||||
- use `null` or empty lists for unclear fields rather than hallucinating
|
||||
|
||||
### Provider abstraction
|
||||
|
||||
The summarizer should not be tightly coupled to a single LLM provider.
|
||||
|
||||
Use a provider abstraction so the project can support:
|
||||
- OpenAI-compatible APIs
|
||||
- local models later if desired
|
||||
- different prompt templates and vocabularies
|
||||
@@ -0,0 +1,74 @@
|
||||
# Architecture Guidelines
|
||||
|
||||
The codebase should be organized around a few clear layers.
|
||||
|
||||
## 1. Core domain logic
|
||||
|
||||
Pure Python logic for:
|
||||
|
||||
- identifying papers
|
||||
- computing paths
|
||||
- importing PDFs
|
||||
- updating metadata
|
||||
- converting PDFs to Markdown
|
||||
- rendering summaries
|
||||
- rebuilding the index
|
||||
|
||||
This layer should be testable without the CLI.
|
||||
|
||||
## 2. CLI layer
|
||||
|
||||
Thin wrappers around the core domain logic.
|
||||
|
||||
The CLI should:
|
||||
|
||||
- parse arguments
|
||||
- call core functions
|
||||
- format output
|
||||
- handle exit codes
|
||||
|
||||
The CLI should not contain deep business logic.
|
||||
|
||||
## 3. Optional integrations
|
||||
|
||||
External systems should live in integration modules, for example:
|
||||
|
||||
- MinerU wrapper
|
||||
- filesystem watch integration
|
||||
- ripgrep integration
|
||||
- LLM provider integration
|
||||
|
||||
Keep these adapters isolated.
|
||||
|
||||
## 4. Optional AI layer
|
||||
|
||||
The AI summarization layer should be behind a stable abstraction.
|
||||
|
||||
For example:
|
||||
|
||||
- load prompt template
|
||||
- load paper markdown
|
||||
- load optional profile / vocabulary
|
||||
- call provider
|
||||
- validate structured output
|
||||
- write `summary.json`
|
||||
- render `summary.md`
|
||||
|
||||
Avoid leaking provider-specific behavior into unrelated modules.
|
||||
|
||||
## Component boundaries
|
||||
|
||||
Avoid hidden coupling:
|
||||
|
||||
- `search` should not depend on LLM code
|
||||
- `import` should not require summarization
|
||||
- `reindex` should not assume a specific converter
|
||||
- `render-summary` should not require calling AI again
|
||||
|
||||
Prefer explicit data flow:
|
||||
|
||||
- `import` creates or updates metadata
|
||||
- `convert` creates `paper.md`
|
||||
- `summarize` creates `summary.json`
|
||||
- `render-summary` creates `summary.md`
|
||||
- `reindex` rebuilds SQLite from files
|
||||
@@ -0,0 +1,51 @@
|
||||
# Coding Guidelines
|
||||
|
||||
## General style
|
||||
|
||||
- Prefer straightforward Python
|
||||
- Use type hints
|
||||
- Keep functions small and focused
|
||||
- Add docstrings to public functions and classes
|
||||
- Avoid overengineering
|
||||
- Prefer composition over deep inheritance
|
||||
|
||||
## Error handling
|
||||
|
||||
- Fail clearly
|
||||
- Provide helpful error messages
|
||||
- Distinguish user-facing CLI errors from internal exceptions
|
||||
- Avoid silently swallowing errors
|
||||
|
||||
## Logging
|
||||
|
||||
- Use structured and informative logging where useful
|
||||
- Avoid noisy logs in normal CLI output
|
||||
- Keep machine-readable command output clean when `--json` is used
|
||||
|
||||
## File operations
|
||||
|
||||
- Be careful with moves, copies, and overwrites
|
||||
- Prefer atomic writes for JSON files when possible
|
||||
- Never corrupt existing metadata due to partial writes
|
||||
|
||||
## Idempotence
|
||||
|
||||
Where possible, commands should behave safely when run multiple times.
|
||||
|
||||
Examples:
|
||||
- re-importing the same file should detect duplicates
|
||||
- `render-summary` should be repeatable
|
||||
- `reindex` should be safe to rerun
|
||||
|
||||
## Testing
|
||||
|
||||
Add tests for:
|
||||
- path layout logic
|
||||
- metadata read/write behavior
|
||||
- duplicate detection
|
||||
- reindex behavior
|
||||
- summary rendering
|
||||
- search behavior
|
||||
- CLI output contracts for core commands
|
||||
|
||||
Prefer unit tests for core logic and targeted integration tests for CLI behavior.
|
||||
@@ -0,0 +1,92 @@
|
||||
# Data Model
|
||||
|
||||
## Library data layout
|
||||
|
||||
The paper library on disk should be human-browsable.
|
||||
|
||||
A typical layout looks like:
|
||||
|
||||
```text
|
||||
library_root/
|
||||
config/
|
||||
config.toml
|
||||
vocab.yaml
|
||||
prompts/
|
||||
summarize_paper.md
|
||||
|
||||
inbox/
|
||||
papers/
|
||||
arxiv/
|
||||
2026/
|
||||
2604.12345/
|
||||
meta.json
|
||||
source.pdf
|
||||
paper.md
|
||||
summary.json
|
||||
summary.md
|
||||
ref.bib
|
||||
assets/
|
||||
logs/
|
||||
mineru.log
|
||||
local/
|
||||
sha256-.../
|
||||
meta.json
|
||||
source.pdf
|
||||
paper.md
|
||||
summary.json
|
||||
summary.md
|
||||
|
||||
db/
|
||||
paperlib.sqlite3
|
||||
|
||||
cache/
|
||||
```
|
||||
|
||||
## Data boundaries
|
||||
|
||||
### `meta.json`
|
||||
|
||||
`meta.json` should contain deterministic or near-deterministic information, mostly from:
|
||||
|
||||
- import process
|
||||
- file system state
|
||||
- external paper metadata sources
|
||||
|
||||
Typical fields include:
|
||||
|
||||
- `paper_id`, `source_type`, `source_id`
|
||||
- `title`, `authors`, `published_date`, `updated_date`, `categories`
|
||||
- `pdf_path`, `paper_md_path`, `summary_json_path`, `summary_md_path`
|
||||
- `imported_at`, `conversion_status`, `summary_status`
|
||||
|
||||
Avoid putting speculative AI content into `meta.json`.
|
||||
|
||||
### `summary.json`
|
||||
|
||||
`summary.json` is optional enrichment and may be regenerated.
|
||||
|
||||
It should contain structured fields such as:
|
||||
|
||||
- one-sentence summary, problem statement, method overview
|
||||
- main results, claimed contributions, assumptions, limitations
|
||||
- problem tags, technique tags, entities
|
||||
- relevance-to-user fields, recommended sections
|
||||
|
||||
`summary.json` must include a schema version.
|
||||
|
||||
### SQLite
|
||||
|
||||
SQLite stores searchable/indexed state and job-independent status.
|
||||
|
||||
It should help with:
|
||||
- listing papers, filtering and search, path lookup, tag lookup, status overview
|
||||
|
||||
But it should never be treated as the only durable source of paper metadata.
|
||||
|
||||
## Key conventions
|
||||
|
||||
- `meta.json` contains stable metadata and processing status
|
||||
- `summary.json` contains structured AI-generated enrichment
|
||||
- `summary.md` is rendered from `summary.json`
|
||||
- `paper.md` is generated from the PDF by an external converter such as MinerU
|
||||
- the database is rebuildable from the files above
|
||||
Reference in New Issue
Block a user