paperlib/docs/summary-schema.md

# Summary Schema

This document defines the structure and semantics of the `summary.json` files that contain AI-generated paper summaries.

## Overview

The `summary.json` file contains structured, AI-generated analysis of a paper. It is designed to:

- Provide consistent, machine-readable summaries
- Support research triage and discovery workflows
- Enable automated categorization and search
- Remain stable across different AI providers
- Use controlled vocabulary when available

## Schema Version 1.0

### File Structure

```json
{
  "schema_version": "1.0",
  "one_sentence_summary": "This paper introduces a novel neural architecture for...",
  "problem_statement": "Current approaches to X suffer from limitations...",
  "method_overview": "The authors propose a hybrid approach combining...",
  "main_results": "Experiments show 15% improvement over baselines...",
  "claimed_contributions": [
    "Novel attention mechanism design",
    "State-of-the-art results on ImageNet",
    "Theoretical analysis of convergence properties"
  ],
  "assumptions": [
    "Data is independently distributed",
    "Computational budget allows for large models"
  ],
  "limitations": [
    "Only evaluated on English text",
    "Requires significant computational resources",
    "Limited theoretical justification for design choices"
  ],
  "problem_tags": ["classification", "computer-vision", "optimization"],
  "technique_tags": ["neural-networks", "attention", "transformers"],
  "entities": ["ImageNet", "BERT", "ResNet", "CIFAR-10"],
  "relevance_to_user": 0.75,
  "recommended_sections": ["Section 3.2", "Algorithm 1", "Table 2"]
}
```

## Field Definitions

### Required Fields

#### `schema_version` (string)
- **Purpose**: Track format version for migration
- **Format**: Semantic version string (e.g., "1.0")
- **Required**: Yes

#### `one_sentence_summary` (string)
- **Purpose**: Concise paper overview for quick scanning
- **Guidelines**:
  - One complete sentence, under 200 characters
  - Focus on the main contribution or finding
  - Avoid technical jargon when possible
- **Example**: "This paper introduces a new attention mechanism that improves transformer efficiency by 40% while maintaining accuracy."

### Core Content Fields

#### `problem_statement` (string)
- **Purpose**: What problem does this paper address?
- **Guidelines**:
  - 2-3 sentences maximum
  - Focus on the gap or limitation being addressed
  - Explain why this problem matters

#### `method_overview` (string)
- **Purpose**: High-level description of the approach
- **Guidelines**:
  - 3-4 sentences maximum
  - Focus on the key innovation or insight
  - Avoid detailed algorithmic descriptions

#### `main_results` (string)
- **Purpose**: Key empirical findings or theoretical results
- **Guidelines**:
  - Quantitative results when available
  - Highlight significance of improvements
  - Note any surprising or counterintuitive findings

### Structured Lists

#### `claimed_contributions` (array of strings)
- **Purpose**: Authors' stated contributions
- **Guidelines**:
  - Extract from paper's contribution list
  - Preserve authors' framing and claims
  - 3-6 items typically

#### `assumptions` (array of strings)
- **Purpose**: Key assumptions underlying the work
- **Guidelines**:
  - Mathematical, methodological, or data assumptions
  - Critical for understanding applicability
  - Often unstated but important

#### `limitations` (array of strings)
- **Purpose**: Acknowledged or apparent limitations
- **Guidelines**:
  - From authors' discussion or limitations section
  - Obvious limitations not acknowledged by authors
  - Important for understanding scope

### Categorization

#### `problem_tags` (array of strings)
- **Purpose**: Categorize the problem domain
- **Controlled vocabulary** (preferred values):
  - `classification`, `regression`, `clustering`
  - `optimization`, `search`, `planning`
  - `generation`, `translation`, `summarization`
  - `detection`, `segmentation`, `tracking`
  - `compression`, `encoding`, `decoding`
  - `privacy`, `security`, `robustness`
  - `interpretability`, `fairness`, `ethics`
  - `efficiency`, `scalability`, `deployment`

#### `technique_tags` (array of strings)
- **Purpose**: Categorize the technical approaches
- **Controlled vocabulary** (preferred values):
  - `neural-networks`, `deep-learning`, `transformers`
  - `cnn`, `rnn`, `lstm`, `gru`, `attention`
  - `reinforcement-learning`, `supervised-learning`, `unsupervised-learning`
  - `bayesian`, `probabilistic`, `statistical`
  - `graph-neural-networks`, `graph-algorithms`
  - `computer-vision`, `natural-language-processing`
  - `federated-learning`, `transfer-learning`, `meta-learning`
  - `adversarial`, `generative-models`, `vae`, `gan`

### Entities and References

#### `entities` (array of strings)
- **Purpose**: Important datasets, models, algorithms, or systems mentioned
- **Guidelines**:
  - Proper names: "ImageNet", "BERT", "ResNet"
  - Algorithms: "SGD", "Adam", "RANSAC"
  - Benchmarks: "GLUE", "COCO", "WMT"
  - Avoid generic terms like "neural network"

### User Relevance

#### `relevance_to_user` (number, optional)
- **Purpose**: Estimated relevance score for the user
- **Format**: Float between 0.0 and 1.0
- **Guidelines**:
  - Based on user's research interests (if known)
  - `null` if user preferences unavailable
  - Higher scores = more relevant

#### `recommended_sections` (array of strings, optional)
- **Purpose**: Specific sections worth reading in detail
- **Format**: Section references as they appear in paper
- **Examples**: ["Section 3.2", "Algorithm 1", "Table 2", "Appendix A"]

## Generation Guidelines

### AI Provider Instructions

When generating summaries, AI models should:

1. **Read for understanding**: Focus on the paper's core contributions
2. **Use structured thinking**: Work through each field systematically
3. **Prefer facts over interpretation**: Extract what authors claim, not opinions
4. **Use controlled vocabulary**: Select from predefined tag lists when possible
5. **Be concise**: Optimize for quick scanning and search
6. **Handle uncertainty**: Use `null` or empty arrays for unclear fields

### Quality Criteria

Good summaries exhibit:
- **Accuracy**: Faithful to the paper's content
- **Completeness**: Cover all major aspects
- **Consistency**: Similar papers get similar treatment
- **Searchability**: Use terms that aid discovery
- **Brevity**: Information density over verbosity

### Common Issues to Avoid

- **Hallucination**: Never invent facts not in the paper
- **Editorializing**: Don't add opinions about paper quality
- **Inconsistent terminology**: Use standard field names
- **Over-abstraction**: Keep concrete details when useful
- **Under-specification**: Provide enough detail for usefulness

## Schema Evolution

### Version History

- **v1.0** (current): Initial schema with core fields

### Migration Strategy

When the schema evolves:
1. New versions increment the `schema_version` field
2. Migration tools handle format upgrades automatically
3. Backward compatibility maintained when possible
4. Deprecated fields are marked but preserved

### Extensibility

Future versions may add:
- Additional structured fields
- Hierarchical tag taxonomies
- Multi-lingual support
- Citation relationship mapping
- Experimental reproducibility metadata

## Integration with paperlib

### File Lifecycle

1. **Generation**: AI provider creates `summary.json`
2. **Validation**: paperlib validates against schema
3. **Indexing**: Content indexed for search
4. **Rendering**: Human-readable `summary.md` generated
5. **Updates**: Summaries can be regenerated with new models

### Search Integration

Summary fields are indexed for search:
- Full-text search includes all text fields
- Tag-based search uses `problem_tags` and `technique_tags`
- Entity search uses the `entities` field
- Relevance ranking can use `relevance_to_user` scores

### API Integration

Higher-level tools can consume summaries programmatically:

```python
import json
from pathlib import Path

# Load summary
summary_path = Path("papers/arxiv/2022/arxiv-2212_06340/summary.json")
with summary_path.open() as f:
    summary = json.load(f)

# Extract key information
tags = summary["problem_tags"] + summary["technique_tags"]
relevance = summary.get("relevance_to_user", 0.0)
entities = summary["entities"]
```

This enables automated workflows like:
- Daily digest generation
- Research recommendation systems
- Literature review automation
- Cross-reference discovery

## Examples

### Machine Learning Paper
```json
{
  "schema_version": "1.0",
  "one_sentence_summary": "Introduces EfficientNet, a family of convolutional neural networks that achieve better accuracy and efficiency than previous models through compound scaling.",
  "problem_statement": "Existing ConvNet scaling methods arbitrarily scale network dimensions, leading to suboptimal accuracy and efficiency trade-offs.",
  "method_overview": "The paper proposes compound scaling that uniformly scales network width, depth, and resolution with a fixed ratio, guided by neural architecture search to find optimal scaling coefficients.",
  "main_results": "EfficientNet-B7 achieves 84.3% top-1 accuracy on ImageNet while being 8.4x smaller and 6.1x faster than the best existing ConvNet.",
  "claimed_contributions": [
    "Novel compound scaling method for ConvNets",
    "EfficientNet family with state-of-the-art accuracy/efficiency",
    "Systematic study of scaling dimensions"
  ],
  "assumptions": [
    "ImageNet classification transfers to other vision tasks",
    "Compound scaling works across different architectures"
  ],
  "limitations": [
    "Limited evaluation on tasks beyond image classification",
    "Scaling coefficients may not generalize to all architectures"
  ],
  "problem_tags": ["classification", "computer-vision", "efficiency"],
  "technique_tags": ["cnn", "neural-architecture-search", "model-scaling"],
  "entities": ["ImageNet", "MobileNet", "ResNet", "NASNet"],
  "relevance_to_user": null,
  "recommended_sections": ["Section 3.1", "Table 2", "Figure 2"]
}
```

This schema provides a foundation for consistent, structured paper analysis while remaining flexible enough to evolve with new research needs and AI capabilities.