Files
paperlib/docs/summary-schema.md
2026-04-17 16:54:30 -04:00

10 KiB

Summary Schema

This document defines the structure and semantics of the summary.json files that contain AI-generated paper summaries.

Overview

The summary.json file contains structured, AI-generated analysis of a paper. It is designed to:

  • Provide consistent, machine-readable summaries
  • Support research triage and discovery workflows
  • Enable automated categorization and search
  • Remain stable across different AI providers
  • Use controlled vocabulary when available

Schema Version 1.0

File Structure

{
  "schema_version": "1.0",
  "one_sentence_summary": "This paper introduces a novel neural architecture for...",
  "problem_statement": "Current approaches to X suffer from limitations...",
  "method_overview": "The authors propose a hybrid approach combining...",
  "main_results": "Experiments show 15% improvement over baselines...",
  "claimed_contributions": [
    "Novel attention mechanism design",
    "State-of-the-art results on ImageNet",
    "Theoretical analysis of convergence properties"
  ],
  "assumptions": [
    "Data is independently distributed",
    "Computational budget allows for large models"
  ],
  "limitations": [
    "Only evaluated on English text",
    "Requires significant computational resources",
    "Limited theoretical justification for design choices"
  ],
  "problem_tags": ["classification", "computer-vision", "optimization"],
  "technique_tags": ["neural-networks", "attention", "transformers"],
  "entities": ["ImageNet", "BERT", "ResNet", "CIFAR-10"],
  "relevance_to_user": 0.75,
  "recommended_sections": ["Section 3.2", "Algorithm 1", "Table 2"]
}

Field Definitions

Required Fields

schema_version (string)

  • Purpose: Track format version for migration
  • Format: Semantic version string (e.g., "1.0")
  • Required: Yes

one_sentence_summary (string)

  • Purpose: Concise paper overview for quick scanning
  • Guidelines:
    • One complete sentence, under 200 characters
    • Focus on the main contribution or finding
    • Avoid technical jargon when possible
  • Example: "This paper introduces a new attention mechanism that improves transformer efficiency by 40% while maintaining accuracy."

Core Content Fields

problem_statement (string)

  • Purpose: What problem does this paper address?
  • Guidelines:
    • 2-3 sentences maximum
    • Focus on the gap or limitation being addressed
    • Explain why this problem matters

method_overview (string)

  • Purpose: High-level description of the approach
  • Guidelines:
    • 3-4 sentences maximum
    • Focus on the key innovation or insight
    • Avoid detailed algorithmic descriptions

main_results (string)

  • Purpose: Key empirical findings or theoretical results
  • Guidelines:
    • Quantitative results when available
    • Highlight significance of improvements
    • Note any surprising or counterintuitive findings

Structured Lists

claimed_contributions (array of strings)

  • Purpose: Authors' stated contributions
  • Guidelines:
    • Extract from paper's contribution list
    • Preserve authors' framing and claims
    • 3-6 items typically

assumptions (array of strings)

  • Purpose: Key assumptions underlying the work
  • Guidelines:
    • Mathematical, methodological, or data assumptions
    • Critical for understanding applicability
    • Often unstated but important

limitations (array of strings)

  • Purpose: Acknowledged or apparent limitations
  • Guidelines:
    • From authors' discussion or limitations section
    • Obvious limitations not acknowledged by authors
    • Important for understanding scope

Categorization

problem_tags (array of strings)

  • Purpose: Categorize the problem domain
  • Controlled vocabulary (preferred values):
    • classification, regression, clustering
    • optimization, search, planning
    • generation, translation, summarization
    • detection, segmentation, tracking
    • compression, encoding, decoding
    • privacy, security, robustness
    • interpretability, fairness, ethics
    • efficiency, scalability, deployment

technique_tags (array of strings)

  • Purpose: Categorize the technical approaches
  • Controlled vocabulary (preferred values):
    • neural-networks, deep-learning, transformers
    • cnn, rnn, lstm, gru, attention
    • reinforcement-learning, supervised-learning, unsupervised-learning
    • bayesian, probabilistic, statistical
    • graph-neural-networks, graph-algorithms
    • computer-vision, natural-language-processing
    • federated-learning, transfer-learning, meta-learning
    • adversarial, generative-models, vae, gan

Entities and References

entities (array of strings)

  • Purpose: Important datasets, models, algorithms, or systems mentioned
  • Guidelines:
    • Proper names: "ImageNet", "BERT", "ResNet"
    • Algorithms: "SGD", "Adam", "RANSAC"
    • Benchmarks: "GLUE", "COCO", "WMT"
    • Avoid generic terms like "neural network"

User Relevance

relevance_to_user (number, optional)

  • Purpose: Estimated relevance score for the user
  • Format: Float between 0.0 and 1.0
  • Guidelines:
    • Based on user's research interests (if known)
    • null if user preferences unavailable
    • Higher scores = more relevant
  • Purpose: Specific sections worth reading in detail
  • Format: Section references as they appear in paper
  • Examples: ["Section 3.2", "Algorithm 1", "Table 2", "Appendix A"]

Generation Guidelines

AI Provider Instructions

When generating summaries, AI models should:

  1. Read for understanding: Focus on the paper's core contributions
  2. Use structured thinking: Work through each field systematically
  3. Prefer facts over interpretation: Extract what authors claim, not opinions
  4. Use controlled vocabulary: Select from predefined tag lists when possible
  5. Be concise: Optimize for quick scanning and search
  6. Handle uncertainty: Use null or empty arrays for unclear fields

Quality Criteria

Good summaries exhibit:

  • Accuracy: Faithful to the paper's content
  • Completeness: Cover all major aspects
  • Consistency: Similar papers get similar treatment
  • Searchability: Use terms that aid discovery
  • Brevity: Information density over verbosity

Common Issues to Avoid

  • Hallucination: Never invent facts not in the paper
  • Editorializing: Don't add opinions about paper quality
  • Inconsistent terminology: Use standard field names
  • Over-abstraction: Keep concrete details when useful
  • Under-specification: Provide enough detail for usefulness

Schema Evolution

Version History

  • v1.0 (current): Initial schema with core fields

Migration Strategy

When the schema evolves:

  1. New versions increment the schema_version field
  2. Migration tools handle format upgrades automatically
  3. Backward compatibility maintained when possible
  4. Deprecated fields are marked but preserved

Extensibility

Future versions may add:

  • Additional structured fields
  • Hierarchical tag taxonomies
  • Multi-lingual support
  • Citation relationship mapping
  • Experimental reproducibility metadata

Integration with paperlib

File Lifecycle

  1. Generation: AI provider creates summary.json
  2. Validation: paperlib validates against schema
  3. Indexing: Content indexed for search
  4. Rendering: Human-readable summary.md generated
  5. Updates: Summaries can be regenerated with new models

Search Integration

Summary fields are indexed for search:

  • Full-text search includes all text fields
  • Tag-based search uses problem_tags and technique_tags
  • Entity search uses the entities field
  • Relevance ranking can use relevance_to_user scores

API Integration

Higher-level tools can consume summaries programmatically:

import json
from pathlib import Path

# Load summary
summary_path = Path("papers/arxiv/2022/arxiv-2212_06340/summary.json")
with summary_path.open() as f:
    summary = json.load(f)

# Extract key information  
tags = summary["problem_tags"] + summary["technique_tags"]
relevance = summary.get("relevance_to_user", 0.0)
entities = summary["entities"]

This enables automated workflows like:

  • Daily digest generation
  • Research recommendation systems
  • Literature review automation
  • Cross-reference discovery

Examples

Machine Learning Paper

{
  "schema_version": "1.0",
  "one_sentence_summary": "Introduces EfficientNet, a family of convolutional neural networks that achieve better accuracy and efficiency than previous models through compound scaling.",
  "problem_statement": "Existing ConvNet scaling methods arbitrarily scale network dimensions, leading to suboptimal accuracy and efficiency trade-offs.",
  "method_overview": "The paper proposes compound scaling that uniformly scales network width, depth, and resolution with a fixed ratio, guided by neural architecture search to find optimal scaling coefficients.",
  "main_results": "EfficientNet-B7 achieves 84.3% top-1 accuracy on ImageNet while being 8.4x smaller and 6.1x faster than the best existing ConvNet.",
  "claimed_contributions": [
    "Novel compound scaling method for ConvNets",
    "EfficientNet family with state-of-the-art accuracy/efficiency",
    "Systematic study of scaling dimensions"
  ],
  "assumptions": [
    "ImageNet classification transfers to other vision tasks",
    "Compound scaling works across different architectures"
  ],
  "limitations": [
    "Limited evaluation on tasks beyond image classification",
    "Scaling coefficients may not generalize to all architectures"
  ],
  "problem_tags": ["classification", "computer-vision", "efficiency"],
  "technique_tags": ["cnn", "neural-architecture-search", "model-scaling"],
  "entities": ["ImageNet", "MobileNet", "ResNet", "NASNet"],
  "relevance_to_user": null,
  "recommended_sections": ["Section 3.1", "Table 2", "Figure 2"]
}

This schema provides a foundation for consistent, structured paper analysis while remaining flexible enough to evolve with new research needs and AI capabilities.