Files

T

wyj 432010f431 docs: add docs

2026-04-17 16:54:30 -04:00

10 KiB

Raw Blame History

Summary Schema

This document defines the structure and semantics of the summary.json files that contain AI-generated paper summaries.

Overview

The summary.json file contains structured, AI-generated analysis of a paper. It is designed to:

Provide consistent, machine-readable summaries
Support research triage and discovery workflows
Enable automated categorization and search
Remain stable across different AI providers
Use controlled vocabulary when available

Schema Version 1.0

File Structure

{
  "schema_version": "1.0",
  "one_sentence_summary": "This paper introduces a novel neural architecture for...",
  "problem_statement": "Current approaches to X suffer from limitations...",
  "method_overview": "The authors propose a hybrid approach combining...",
  "main_results": "Experiments show 15% improvement over baselines...",
  "claimed_contributions": [
    "Novel attention mechanism design",
    "State-of-the-art results on ImageNet",
    "Theoretical analysis of convergence properties"
  ],
  "assumptions": [
    "Data is independently distributed",
    "Computational budget allows for large models"
  ],
  "limitations": [
    "Only evaluated on English text",
    "Requires significant computational resources",
    "Limited theoretical justification for design choices"
  ],
  "problem_tags": ["classification", "computer-vision", "optimization"],
  "technique_tags": ["neural-networks", "attention", "transformers"],
  "entities": ["ImageNet", "BERT", "ResNet", "CIFAR-10"],
  "relevance_to_user": 0.75,
  "recommended_sections": ["Section 3.2", "Algorithm 1", "Table 2"]
}

Field Definitions

Required Fields

`schema_version` (string)

Purpose: Track format version for migration
Format: Semantic version string (e.g., "1.0")
Required: Yes

`one_sentence_summary` (string)

Purpose: Concise paper overview for quick scanning
Guidelines:
- One complete sentence, under 200 characters
- Focus on the main contribution or finding
- Avoid technical jargon when possible
Example: "This paper introduces a new attention mechanism that improves transformer efficiency by 40% while maintaining accuracy."

Core Content Fields

`problem_statement` (string)

Purpose: What problem does this paper address?
Guidelines:
- 2-3 sentences maximum
- Focus on the gap or limitation being addressed
- Explain why this problem matters

`method_overview` (string)

Purpose: High-level description of the approach
Guidelines:
- 3-4 sentences maximum
- Focus on the key innovation or insight
- Avoid detailed algorithmic descriptions

`main_results` (string)

Purpose: Key empirical findings or theoretical results
Guidelines:
- Quantitative results when available
- Highlight significance of improvements
- Note any surprising or counterintuitive findings

Structured Lists

`claimed_contributions` (array of strings)

Purpose: Authors' stated contributions
Guidelines:
- Extract from paper's contribution list
- Preserve authors' framing and claims
- 3-6 items typically

`assumptions` (array of strings)

Purpose: Key assumptions underlying the work
Guidelines:
- Mathematical, methodological, or data assumptions
- Critical for understanding applicability
- Often unstated but important

`limitations` (array of strings)

Purpose: Acknowledged or apparent limitations
Guidelines:
- From authors' discussion or limitations section
- Obvious limitations not acknowledged by authors
- Important for understanding scope

Categorization

`problem_tags` (array of strings)

Purpose: Categorize the problem domain
Controlled vocabulary (preferred values):
- classification, regression, clustering
- optimization, search, planning
- generation, translation, summarization
- detection, segmentation, tracking
- compression, encoding, decoding
- privacy, security, robustness
- interpretability, fairness, ethics
- efficiency, scalability, deployment

`technique_tags` (array of strings)

Purpose: Categorize the technical approaches
Controlled vocabulary (preferred values):
- neural-networks, deep-learning, transformers
- cnn, rnn, lstm, gru, attention
- reinforcement-learning, supervised-learning, unsupervised-learning
- bayesian, probabilistic, statistical
- graph-neural-networks, graph-algorithms
- computer-vision, natural-language-processing
- federated-learning, transfer-learning, meta-learning
- adversarial, generative-models, vae, gan

Entities and References

`entities` (array of strings)

Purpose: Important datasets, models, algorithms, or systems mentioned
Guidelines:
- Proper names: "ImageNet", "BERT", "ResNet"
- Algorithms: "SGD", "Adam", "RANSAC"
- Benchmarks: "GLUE", "COCO", "WMT"
- Avoid generic terms like "neural network"

User Relevance

`relevance_to_user` (number, optional)

Purpose: Estimated relevance score for the user
Format: Float between 0.0 and 1.0
Guidelines:
- Based on user's research interests (if known)
- null if user preferences unavailable
- Higher scores = more relevant

`recommended_sections` (array of strings, optional)

Purpose: Specific sections worth reading in detail
Format: Section references as they appear in paper
Examples: ["Section 3.2", "Algorithm 1", "Table 2", "Appendix A"]

Generation Guidelines

AI Provider Instructions

When generating summaries, AI models should:

Read for understanding: Focus on the paper's core contributions
Use structured thinking: Work through each field systematically
Prefer facts over interpretation: Extract what authors claim, not opinions
Use controlled vocabulary: Select from predefined tag lists when possible
Be concise: Optimize for quick scanning and search
Handle uncertainty: Use null or empty arrays for unclear fields

Quality Criteria

Good summaries exhibit:

Accuracy: Faithful to the paper's content
Completeness: Cover all major aspects
Consistency: Similar papers get similar treatment
Searchability: Use terms that aid discovery
Brevity: Information density over verbosity

Common Issues to Avoid

Hallucination: Never invent facts not in the paper
Editorializing: Don't add opinions about paper quality
Inconsistent terminology: Use standard field names
Over-abstraction: Keep concrete details when useful
Under-specification: Provide enough detail for usefulness

Schema Evolution

Version History

v1.0 (current): Initial schema with core fields

Migration Strategy

When the schema evolves:

New versions increment the schema_version field
Migration tools handle format upgrades automatically
Backward compatibility maintained when possible
Deprecated fields are marked but preserved

Extensibility

Future versions may add:

Additional structured fields
Hierarchical tag taxonomies
Multi-lingual support
Citation relationship mapping
Experimental reproducibility metadata

Integration with paperlib

File Lifecycle

Generation: AI provider creates summary.json
Validation: paperlib validates against schema
Indexing: Content indexed for search
Rendering: Human-readable summary.md generated
Updates: Summaries can be regenerated with new models

Search Integration

Summary fields are indexed for search:

Full-text search includes all text fields
Tag-based search uses problem_tags and technique_tags
Entity search uses the entities field
Relevance ranking can use relevance_to_user scores

API Integration

Higher-level tools can consume summaries programmatically:

import json
from pathlib import Path

# Load summary
summary_path = Path("papers/arxiv/2022/arxiv-2212_06340/summary.json")
with summary_path.open() as f:
    summary = json.load(f)

# Extract key information  
tags = summary["problem_tags"] + summary["technique_tags"]
relevance = summary.get("relevance_to_user", 0.0)
entities = summary["entities"]

This enables automated workflows like:

Daily digest generation
Research recommendation systems
Literature review automation
Cross-reference discovery

Examples

Machine Learning Paper

{
  "schema_version": "1.0",
  "one_sentence_summary": "Introduces EfficientNet, a family of convolutional neural networks that achieve better accuracy and efficiency than previous models through compound scaling.",
  "problem_statement": "Existing ConvNet scaling methods arbitrarily scale network dimensions, leading to suboptimal accuracy and efficiency trade-offs.",
  "method_overview": "The paper proposes compound scaling that uniformly scales network width, depth, and resolution with a fixed ratio, guided by neural architecture search to find optimal scaling coefficients.",
  "main_results": "EfficientNet-B7 achieves 84.3% top-1 accuracy on ImageNet while being 8.4x smaller and 6.1x faster than the best existing ConvNet.",
  "claimed_contributions": [
    "Novel compound scaling method for ConvNets",
    "EfficientNet family with state-of-the-art accuracy/efficiency",
    "Systematic study of scaling dimensions"
  ],
  "assumptions": [
    "ImageNet classification transfers to other vision tasks",
    "Compound scaling works across different architectures"
  ],
  "limitations": [
    "Limited evaluation on tasks beyond image classification",
    "Scaling coefficients may not generalize to all architectures"
  ],
  "problem_tags": ["classification", "computer-vision", "efficiency"],
  "technique_tags": ["cnn", "neural-architecture-search", "model-scaling"],
  "entities": ["ImageNet", "MobileNet", "ResNet", "NASNet"],
  "relevance_to_user": null,
  "recommended_sections": ["Section 3.1", "Table 2", "Figure 2"]
}

This schema provides a foundation for consistent, structured paper analysis while remaining flexible enough to evolve with new research needs and AI capabilities.

10 KiB Raw Blame History