Context is Key: A Benchmark for Forecasting with Essential Textual Information
Source
- Raw Markdown: paper_context-is-key-2024.md
- PDF: paper_context-is-key-2024.pdf
- Dataset metadata snapshot: context-is-key-2024
- arXiv: https://arxiv.org/abs/2410.18959
- Official code: https://github.com/ServiceNow/context-is-key-forecasting
- Official benchmark viewer: https://servicenow.github.io/context-is-key-forecasting/v0/
- Official Hugging Face dataset: https://huggingface.co/datasets/ServiceNow/context-is-key
Core Claim
Context is Key argues that numerical history alone is often an under-specified forecasting interface. Some future values become predictable only when the model can use relevant context, usually natural language, that names the process, constraints, expected events, hidden historical facts, covariates, or causal relationships.
Alex Context
This is the wiki’s landmark anchor for context-aware time-series work. It is the first source in this corpus that makes the importance of understanding context visually and experimentally obvious: the task is not merely “forecast this numeric sequence”, but “forecast this numeric sequence given the essential text that changes what the future means.”
For time-series world-model research, the paper is a reminder that context is not decoration. Context can be a compact interface for domain knowledge, future events, exogenous variables, constraints, and causal structure. A model that cannot read and operationalize that context will look competent on many passive benchmarks while failing on decision-relevant forecasts.
Key Contributions
- Introduces Context is Key (CiK), a 71-task benchmark for probabilistic forecasting with essential natural-language context.
- Covers seven domains: climatology, economics, energy, mechanics, public safety, transportation, and retail.
- Defines five context sources: intemporal information, future information, historical information, covariate information, and causal information.
- Introduces Region-of-Interest CRPS (RCRPS), which upweights context-sensitive forecast windows and penalizes constraint violations.
- Evaluates statistical forecasters, numerical time-series foundation models, multimodal forecasting methods, and LLM-based forecasters with and without context.
- Shows that Direct Prompt with Llama-3.1-405B-Instruct is the best aggregate method in the paper’s main table, while no method is best across all context types.
Main Findings
- Context materially improves strong context-capable methods. The paper reports a 67.1% improvement for Direct Prompt with Llama-3.1-405B-Instruct when context is provided.
- Purely quantitative forecasters are structurally disadvantaged on CiK because the benchmark intentionally includes information that is not present in the numeric history.
- Prompted LLMs can exploit textual context, but they are brittle: a few significant failures can dominate aggregate RCRPS even when a model has a strong average rank.
- Many LLM forecasters are slow or Pareto-dominated by cheaper quantitative forecasters, so context support alone is not enough for practical deployment.
- Human and LLM validation support the benchmark’s premise: human annotators marked the context useful in 94.7% of annotations, and the LLM critique judged all tasks better with context.
Gotchas
- CiK is an evaluation benchmark with no training split, not a general pretraining corpus.
- The current benchmark is text-only and univariate; it excludes multivariate time-series scenarios and non-text context such as images, databases, or spatiotemporal inputs.
- Some tasks modify the forecast horizon according to the natural-language context. That is appropriate for testing context use, but it means CiK should not be read as a passive forecasting leaderboard.
- Context that describes an event, constraint, or causal relationship is not automatically an action channel. Map it to
exogenous variable,event,intervention, orcontrol inputonly when the benchmark semantics support that interpretation. - RCRPS is necessary for CiK because ordinary aggregate forecast metrics can hide failures exactly where the context matters.
- The authors mitigate memorization through live data, derived series, noise, and date shifts, but the paper still treats public-data contamination as a residual risk.
- Direct Prompt depends on structured output and often constrained decoding. Smaller or base LLMs may fail the output format even before their forecasting quality is evaluated.
Links Into The Wiki
- Context is Key
- Context-Aided Forecasting
- Time-Series Foundation Models
- Time-Series Benchmark Hygiene
- Unified Multimodal Models
Open Questions
- What is the minimal efficient model interface that can combine numeric history with natural-language context without using a very large LLM at inference time?
- How should context-aided forecasting expand from univariate text-conditioned tasks to multivariate time series with heterogeneous exogenous variables?
- Can automatically generated context-aided datasets reach CiK-style relevance without leaking template artifacts or hallucinated domain rules?