wyj/paperlib

Fork 0

Files

T

wyj 767326346b init

2026-04-17 02:00:23 -04:00

13 KiB

Raw Blame History

AGENTS.md

Project overview

paperlib is a local-first paper library engine with a CLI.

It is designed to:

import PDF papers into a structured local library
convert PDFs into Markdown using external converters such as MinerU
maintain stable per-paper metadata files and a searchable index database
optionally generate AI-based structured summaries
expose a clean CLI that is useful both for humans and for higher-level automation tools such as an arXiv daily digest workflow

paperlib is *not- primarily an AI app. AI summarization is an optional enrichment layer, not the core of the system.

The project should remain useful even when:

no LLM API key is configured
no summarization is enabled
only import / convert / index / search features are used

Core design principles

1. Local-first

User data lives locally in the paper library directory.

The library must remain usable without a server, web app, or remote database.

Prefer plain files plus SQLite over opaque internal state.

2. CLI-first

The CLI is the primary interface.

All important workflows should be accessible from the CLI.

The Python API is useful, but secondary.

3. JSON files are the source of truth

Per-paper JSON files in the library are the durable source of truth.

Examples:

meta.json
summary.json

SQLite is an index/cache layer, not the canonical data store.

This means:

the index must be rebuildable from files
reindex should be able to repair the database from on-disk records
code must not assume the database alone is authoritative

4. AI is optional enrichment

Importing, converting, indexing, listing, showing, and searching papers must work without AI.

AI summarization should be isolated behind a clean interface.

Do not make core workflows depend on an LLM provider.

5. Stable machine-readable interfaces

Important commands should support --json output so that other tools can consume them.

Examples:

paperlib import ... --json
paperlib summarize ... --json
paperlib show ... --json
paperlib export ... --format json

6. Small, explicit, inspectable components

Prefer simple and explicit logic over large hidden frameworks.

Keep components understandable:

importer
converter
renderer
summarizer
search
reindex
doctor

Avoid unnecessary abstraction until there is a real need.

Non-goals

The following are currently out of scope unless explicitly planned later:

mandatory daemon architecture
web UI
multi-user remote service
cloud-first design
vector database as a required dependency
opaque agent framework controlling the core library
“fully autonomous research assistant” behavior

Library data layout

The paper library on disk should be human-browsable.

A typical layout looks like:

library_root/
  config/
    config.toml
    vocab.yaml
    prompts/
      summarize_paper.md

  inbox/
  papers/
    arxiv/
      2026/
        2604.12345/
          meta.json
          source.pdf
          paper.md
          summary.json
          summary.md
          ref.bib
          assets/
          logs/
            mineru.log
    local/
      sha256-.../
        meta.json
        source.pdf
        paper.md
        summary.json
        summary.md

  db/
    paperlib.sqlite3

  cache/

Conventions:

meta.json contains stable metadata and processing status
summary.json contains structured AI-generated enrichment
summary.md is rendered from summary.json
paper.md is generated from the PDF by an external converter such as MinerU
the database is rebuildable from the files above

Data model boundaries

`meta.json`

meta.json should contain deterministic or near-deterministic information, mostly from:

import process
file system state
external paper metadata sources

Typical fields include:

paper_id
source_type
source_id
title
authors
published_date
updated_date
categories
pdf_path
paper_md_path
summary_json_path
summary_md_path
imported_at
conversion_status
summary_status

Avoid putting speculative AI content into meta.json.

`summary.json`

summary.json is optional enrichment and may be regenerated.

It should contain structured fields such as:

one-sentence summary
problem statement
method overview
main results
claimed contributions
assumptions
limitations
problem tags
technique tags
entities
relevance-to-user fields
recommended sections

summary.json must include a schema version.

SQLite

SQLite stores searchable/indexed state and job-independent status.

It should help with:

listing papers
filtering and search
path lookup
tag lookup
status overview

But it should never be treated as the only durable source of paper metadata.

CLI philosophy

The CLI should be easy for humans and predictable for scripts.

Important CLI expectations

human-readable by default
machine-readable with --json
clear exit codes
no hidden background magic
no required daemon
stable command names
idempotent operations when possible

Expected command families

Core commands include:

init
import
import-dir
watch
convert
reindex
doctor
status
list
show
search
open
print-path
summarize
render-summary
export

When implementing commands, preserve a clear separation between:

mutation commands
read/query commands

Architecture guidelines

The codebase should be organized around a few clear layers.

1. Core domain logic

Pure Python logic for:

identifying papers
computing paths
importing PDFs
updating metadata
converting PDFs to Markdown
rendering summaries
rebuilding the index

This layer should be testable without the CLI.

2. CLI layer

Thin wrappers around the core domain logic.

The CLI should:

parse arguments
call core functions
format output
handle exit codes

The CLI should not contain deep business logic.

3. Optional integrations

External systems should live in integration modules, for example:

MinerU wrapper
filesystem watch integration
ripgrep integration
LLM provider integration

Keep these adapters isolated.

4. Optional AI layer

The AI summarization layer should be behind a stable abstraction.

For example:

load prompt template
load paper markdown
load optional profile / vocabulary
call provider
validate structured output
write summary.json
render summary.md

Avoid leaking provider-specific behavior into unrelated modules.

AI collaboration guidelines

When using AI to help develop this project, the AI should follow these rules.

1. Respect the project boundaries

Do not redesign paperlib into:

a web app
a required daemon
a monolithic agent system
a chat-first interface

Unless explicitly asked, keep the project aligned with:

local-first
CLI-first
JSON/SQLite-based architecture
AI-optional enrichment

2. Prefer incremental changes

Make small, reviewable changes.

When implementing a feature:

first clarify which module owns it
avoid broad refactors unless necessary
preserve existing CLI semantics unless intentionally changing them

3. Keep file formats stable

Changes to meta.json or summary.json are important.

If changing schemas:

update the schema version
update documentation
consider migration or backward compatibility
do not silently break existing libraries

4. Avoid hidden coupling

Do not make unrelated modules depend on each other unnecessarily.

For example:

search should not depend on LLM code
import should not require summarization
reindex should not assume a specific converter
render-summary should not require calling AI again

5. Prefer explicit data flow

When adding features, keep data flow obvious.

For example:

import creates or updates metadata
convert creates paper.md
summarize creates summary.json
render-summary creates summary.md
reindex rebuilds SQLite from files

6. Do not invent capabilities

If a feature is not implemented yet, do not pretend it exists.

Examples:

do not write code that assumes a daemon exists
do not assume remote sync exists
do not assume vector search exists
do not assume arXiv-specific logic belongs in the core library

7. Prefer durable outputs over polished prose

When designing AI summarization outputs, favor:

structured JSON
stable field names
grep-friendly rendered Markdown
concise, reusable information

over:

highly polished review prose
flashy but unstable output formats

Coding guidelines

General style

Prefer straightforward Python.
Use type hints.
Keep functions small and focused.
Add docstrings to public functions and classes.
Avoid overengineering.
Prefer composition over deep inheritance.

Error handling

Fail clearly.
Provide helpful error messages.
Distinguish user-facing CLI errors from internal exceptions.
Avoid silently swallowing errors.

Logging

Use structured and informative logging where useful.
Avoid noisy logs in normal CLI output.
Keep machine-readable command output clean when --json is used.

File operations

Be careful with moves, copies, and overwrites.
Prefer atomic writes for JSON files when possible.
Never corrupt existing metadata due to partial writes.

Idempotence

Where possible, commands should behave safely when run multiple times.

Examples:

re-importing the same file should detect duplicates
render-summary should be repeatable
reindex should be safe to rerun

Testing

Add tests for:

path layout logic
metadata read/write behavior
duplicate detection
reindex behavior
summary rendering
search behavior
CLI output contracts for core commands

Prefer unit tests for core logic and targeted integration tests for CLI behavior.

Search design guidelines

Search should support at least two useful modes:

1. Field-aware structured search

Examples:

tags
authors
categories
titles
summary fields

2. Full-text-friendly search

Support grep-like workflows and integration with tools such as ripgrep.

Do not require semantic/vector search as a baseline feature.

If semantic search is ever added later, it should be optional and must not displace simple grep/database search.

Summarization design guidelines

Summarization should produce reusable structured outputs.

Summarization goals

A summary should be useful for:

later human review
grep-style reverse lookup
building daily/weekly reports
indexing by problem/method/result
personal research triage

Summarization output

Prefer generating:

summary.json as the canonical structured output
summary.md rendered from JSON

Do not make free-form Markdown the only output.

Prompting guidelines

Prompts should instruct the model to:

extract factual information
avoid unsupported claims
use concise and stable language
prefer controlled vocabulary when available
return structured JSON only
use null or empty lists for unclear fields rather than hallucinating

Provider abstraction

The summarizer should not be tightly coupled to a single LLM provider.

Use a provider abstraction so the project can support:

OpenAI-compatible APIs
local models later if desired
different prompt templates and vocabularies

What belongs in `paperlib` vs higher-level tools

paperlib is the base library engine.

It should own:

PDF import
local storage layout
conversion to Markdown
metadata files
summary files
index maintenance
CLI access to those capabilities

It should not own high-level discovery workflows such as:

arXiv daily fetching
personalized new-paper ranking
daily digest generation
automated paper downloading from external feeds

Those belong in higher-level tools that consume paperlib.

Expected development workflow

When implementing a new feature, the preferred order is:

identify the owning module
define or update the data contract
implement the core logic
add tests
expose it through the CLI if appropriate
update docs and examples

If a change affects on-disk formats or CLI behavior, document it clearly.

Decision heuristics

When uncertain, prefer the option that is:

more local-first
more inspectable
easier to test
easier to recover from
less coupled to AI
more stable for scripts
less magical

Examples:

prefer JSON + Markdown over opaque internal blobs
prefer explicit CLI commands over hidden automation
prefer rebuildable indexes over fragile single-source databases
prefer optional AI enrichment over mandatory AI workflows

Documentation expectations

Important features should be documented in:

README.md for user-facing overview
docs/cli.md for command behavior
docs/storage-layout.md for on-disk structure
docs/summary-schema.md for summary.json
docs/integration-guide.md for higher-level tool integration

Keep docs aligned with actual behavior.

If you are an AI agent contributing code

Before making a change, ask:

Does this belong in paperlib, or in a higher-level workflow project?
Does this preserve local-first and CLI-first design?
Does this make AI optional, not mandatory?
Does this keep JSON files as the durable source of truth?
Does this keep the system understandable to a developer reading the code later?

If the answer to any of these is no, reconsider the approach.

13 KiB Raw Blame History

AGENTS.md

Project overview

Core design principles

1. Local-first

2. CLI-first

3. JSON files are the source of truth

4. AI is optional enrichment

5. Stable machine-readable interfaces

6. Small, explicit, inspectable components

Non-goals

Library data layout

Data model boundaries

meta.json

summary.json

SQLite

CLI philosophy

Important CLI expectations

Expected command families

Architecture guidelines

1. Core domain logic

2. CLI layer

3. Optional integrations

4. Optional AI layer

AI collaboration guidelines

1. Respect the project boundaries

2. Prefer incremental changes

3. Keep file formats stable

4. Avoid hidden coupling

5. Prefer explicit data flow

6. Do not invent capabilities

7. Prefer durable outputs over polished prose

Coding guidelines

General style

Error handling

Logging

File operations

Idempotence

Testing

Search design guidelines

1. Field-aware structured search

2. Full-text-friendly search

Summarization design guidelines

Summarization goals

Summarization output

Prompting guidelines

Provider abstraction

What belongs in paperlib vs higher-level tools

Expected development workflow

Decision heuristics

Documentation expectations

If you are an AI agent contributing code

13 KiB

Raw Blame History

`meta.json`

`summary.json`

What belongs in `paperlib` vs higher-level tools