otel-control-gym: A Controlled Stand For Time-Series World Models In DevOps

Status: draft system description based on the latest notes.

Collaboration

If this direction resonates with you, I would be happy to talk with like-minded people, collaborate on research, and work on use-cases together.

1. Motivation: Why This Is Needed

Modern DevOps is already good at collecting observability data: metrics, logs, traces, Kubernetes events, service graphs, alerts, and RCA reports. But most of that data answers passive questions:

what is broken right now;
where the root cause is most likely to be;
how metrics will look if nothing changes;
which service looks anomalous relative to past behavior.

In real operations, the most important question is often different: what will happen if we do something.

For example:

what will happen to latency and error rate if checkout is scaled from 2 to 5 replicas;
whether restarting cart will help, or whether it will make the cache problem worse;
whether it is safe to continue rolling out a new version;
when rollback is better, and when increasing CPU limits is enough;
which sequence of actions returns the SLO to green fastest;
how to distinguish real service degradation from an expected effect after changing traffic, limits, or workload mix.

Ordinary observability datasets usually record only past events. They are useful for diagnosis and forecasting, but they do not teach a model well enough to understand the consequences of actions. For that, we need trajectories of the form:

observation -> action -> new observation -> outcome

That is the environment to build: not just a file generator, but a controlled gym for DevOps systems. It should be a stand where a microservice application can be run reproducibly, its state can be changed, faults can be injected, graph telemetry time series can be collected, and we can evaluate how well a model predicts and chooses actions.

2. What We Want To Build

We want to build otel-control-gym: a reproducible Kubernetes environment based on OpenTelemetry Demo, Coroot, Chaos Mesh, and a custom orchestrator.

The environment MUST support two modes:

Offline dataset generation: scripted policies run scenarios, apply actions, inject faults or interventions, and export labeled trajectories.
Online live-stand evaluation: a trained model or controller connects to the live stand through the same gym API, observes the state, chooses an action, and receives a reward.

The key idea is that dataset generation is a special case of running the gym. We are not building a separate “dataset factory” that lives apart from the evaluation environment. We are building one environment in which scripted policies, baseline controllers, and trained models use the same interface.

The minimal interface is:

reset(seed, config) -> initial_observation
observe(window, granularity) -> observation
step(action, dt) -> observation, reward, done, info

Where:

reset deploys the stand or returns it to a known state;
observe returns a telemetry window: node features, edge features, events, logs, traces, Kubernetes events, and Coroot findings;
step(action, dt) applies an action and advances the environment by one control step;
reward measures the usefulness of the result in terms of SLO, cost, recovery speed, and safety;
done marks the end of an episode, goal achievement, an emergency state, or a required reset.

3. Why This Should Be A Gym, Not An Ordinary Benchmark

A benchmark usually defines a fixed set of inputs and expected answers. That is useful for comparing algorithms, but it is not enough for learning behavior.

In DevOps, behavior matters as much as diagnosis. An operator or automatic controller makes decisions over time:

first inspect state -> choose an action -> wait for the effect ->
reassess the situation -> choose the next action

This loop cannot be fully tested on a passive dataset, because a passive dataset contains only one historical branch. It cannot answer: “What would have happened if, instead of restart, we had scaled up?”

A gym gives three things that an ordinary dataset does not:

controlled actions: the model can intervene in the system;
repeatability: the same scenario can be run with different policies;
closed-loop evaluation: we can measure not only prediction accuracy, but also decision quality on a live stand.

The first version does not need to imitate all of production. It is more important to build a clean, controlled environment with clear labels than to maximize realism immediately. Realism can be increased after the data contract and the observe -> action -> reward loop are proven.

4. How This Differs From Existing Datasets And Benchmarks

ChronoGraph

ChronoGraph is the closest existing source to the desired data shape: graph-structured multivariate time series with service nodes, service-to-service edges, node features, edge features, and incident labels.

But ChronoGraph remains a passive dataset. It has graph structure and dynamics, but no first-class channel for operator actions or control inputs. A model can learn to forecast future metrics and find anomalies, but it does not learn to answer: “What changes if we execute this action right now?”

For otel-control-gym, ChronoGraph is a good reference for the shape of graph time-series data, but not the final schema.

RCAEval, GAIA/MicroSS, LEMMA-RCA, OpenRCA, AnoMod, ops-lite

These datasets and benchmarks are useful, but they solve neighboring tasks.

RCAEval is good as a reproducible RCA benchmark: it has failure cases, fault types, telemetry, and expected root causes. But its focus is root-cause localization, not controlling the system after observation.

MicroSS, LEMMA-RCA, OpenRCA, and AnoMod provide useful data for AIOps, diagnosis, multimodal anomaly detection, and LLM-agent investigation. But they mostly answer “what happened and where is the cause”, not “which action should be chosen and what consequences will it have”.

ops-lite is interesting because it provides compact causal graphs for RCA. That is valuable for testing causal hypotheses, but it is not a live environment with a control loop.

AIOpsLab

AIOpsLab is closer in spirit: it separates application, task, fault, workload, and evaluator. That decomposition is worth using as an architectural example.

But in our case, the main object is not an RCA agent or a fault diagnosis task. The main object is an action-conditioned time-series world model for controlling a microservice system over time.

Therefore, AIOpsLab can be treated as a future adapter or a source of useful abstractions, but the first data contract should be world-model-oriented:

state/action/next_state/reward/label

Ordinary TSFMs

Time-series foundation models usually learn to predict future values from past values and context. That is useful for forecasting, anomaly detection, imputation, and sometimes pattern classification.

But an ordinary TSFM is usually passive. It sees the system history, but it does not know which action the operator took, which control inputs changed, or which alternative actions could have led to a different outcome.

otel-control-gym adds the missing part: actions and their consequences.

5. What A Time-Series World Model Is

An ordinary time-series model answers roughly this question:

If the past 30 minutes looked like this, what will the next 5 minutes look like?

A time-series world model answers a stronger question:

If the past 30 minutes looked like this, and we now take action A,
what will the system look like in 5 minutes?

The difference looks small, but in DevOps it is fundamental.

An ordinary TSFM models observed dynamics. It can be a very strong forecaster, but it does not necessarily understand the causal role of actions.

A time-series world model models system dynamics as a controlled process. Its inputs are:

observation: the current and past telemetry state;
graph: services, dependencies, and service-to-service edges;
action: the action of an operator or controller;
control input: a numeric or actuator-like parameter, such as replica count, traffic split, CPU limit, or workload rate;
intervention: an intentional manipulation whose effect matters causally;
exogenous variables: external factors not controlled by the policy;
labels and rewards: ground truth about the scenario and an evaluation signal for the outcome.

The outputs of the world model can include:

forecasts of node features and edge features after the action;
probability of SLO violation;
expected recovery time;
cost estimate for the action;
counterfactual prediction: what would have happened under a different action;
latent state that a controller can use for planning.

In other words, an ordinary TSFM is like a very smart forecast chart. A time-series world model is closer to a consequence simulator: it learns the internal dynamics of the system well enough to support planning.

6. New Applications This Enables

What-if Analysis Before Change

An operator wants to know what will happen after changing limits, scale, or version. A world model can estimate several options before the action is applied:

scale checkout +2 replicas
restart cart
rollback recommendation
increase CPU limit for payment
shift 20% traffic to old version

This does not replace rollout policy or SRE procedure, but it adds another layer for risk forecasting.

Safe Autoscaling And Resource Control

An ordinary autoscaler reacts to metrics. A world-model-based controller can account for delayed effects and graph effects. For example, scaling up one service can move the bottleneck to a dependency, increase queueing, or change the error pattern on another edge.

The gym makes it possible to train and test a controller that optimizes a complex objective, not just one CPU metric:

latency;
error rate;
SLO burn rate;
resource cost;
stability;
recovery time;
penalty for unsafe actions.

Rollout And Rollback Decision Support

During rollout, a new version can have weak, delayed, or workload-dependent effects. Passive anomaly detection can say “the metrics are strange”. A world model should help answer:

continue rollout;
stop rollout;
rollback;
increase resources;
change traffic split;
wait, because the effect is expected and temporary.

Incident Recovery Planning

During an incident, it is important not only to find the root cause, but also to choose the recovery order. A world model can compare possible sequences:

restart -> scale -> rollback
rollback -> scale -> restart
scale -> wait -> rollback

This is a planning problem, not only RCA.

Counterfactual RCA

RCA usually searches for the most likely cause. A world model adds the ability to test a hypothesis with an action or simulated action:

If the cause is cache, then recovering cache should reduce latency on edges
that go through cart.

This does not replace classical RCA, but makes it more operational: a cause should explain not only the past, but also the expected effect of remediation.

Offline Training For Live Control

Without a gym, it is dangerous to train a controller directly in production. A gym provides an intermediate step:

collect controlled episodes;
train a world model offline;
test forecasting and action-effect prediction;
test the policy in the live stand;
only then consider more realistic environments.

7. How The Gym Helps Train A Model

For someone with a DevOps background, training can be described as follows.

We run the stand many times under different conditions. Each run records:

what the service graph was
what the workload was
what metrics, logs, and traces were observed
which action was applied
which fault or intervention was injected
what happened afterward
which outcome is considered good or bad

The model receives not just “metrics for the past hour”, but pairs:

(observation, action) -> next_observation

And additional signals:

label: what exactly we did and where the cause was
reward: how good the outcome was
graph: how services are connected

This supports several model layers:

next-state model: predict the next node and edge features;
action-effect model: predict the difference between “do nothing” and a concrete action;
anomaly/RCA head: find the fault type or affected service;
reward model: estimate the usefulness of an action;
controller: choose the action that maximizes reward.

The important point is that the same gym is needed for dataset generation and for testing the trained model. That protects against the situation where a model passes an offline benchmark but cannot work in a closed loop.

8. System Architecture

The system has four planes.

8.1 Application Plane

The application plane is the microservice application where observations are generated. The base candidate is OpenTelemetry Demo.

It is convenient because it already includes:

several services in a web-store domain;
HTTP/gRPC dependencies;
a load generator;
traces, metrics, and logs;
services with synchronous and asynchronous dependencies;
infrastructure components such as cache, queue, and database.

For the MVP, the application can run in several service configurations:

small-core: frontend, cart, checkout, product catalog, recommendation, payment, shipping, cache;
medium-store: small-core plus ad, currency, email, quote, frontend proxy, load generator;
async-store: medium-store plus queue, accounting, fraud detection;
review-store: medium-store plus product reviews, PostgreSQL, LLM/flagd, if that does not make the stand too expensive.

Different configurations are needed so the model does not overfit to one graph size.

8.2 Control Plane

The control plane is a Python orchestrator plus Kubernetes and Chaos Mesh integrations.

The orchestrator MUST:

manage reset, observe, step, reward, and done;
store the seed, config, chart versions, and Git SHAs;
apply ordinary DevOps actions through the Kubernetes API;
apply fault interventions through Chaos Mesh;
record action logs before applying the action;
check action status after applying it;
record ground-truth labels independently of Coroot diagnosis;
run scripted policies for dataset generation;
run online evaluation for a trained controller.

Ordinary DevOps actions matter as much as failures. The MVP should support:

scale deployment up/down;
add/remove pods through replica count;
restart a deployment or individual workload;
enable/disable an optional service group;
rollout config change;
rollback config change;
change CPU/memory requests/limits;
change autoscaler target/bounds;
shift traffic between versions once canary support exists;
change workload generator rate or request mix as a controlled input.

Fault interventions are needed for learning diagnosis and recovery:

CPU pressure;
memory pressure or leak-like behavior;
network delay on an edge;
packet loss on an edge;
network partition on an edge;
pod kill/restart loop;
cache/database dependency outage;
bad config or wrong target port at a later stage.

8.3 Telemetry Plane

The telemetry plane collects observations.

At minimum, it needs:

Prometheus metrics for node and edge numeric features;
OpenTelemetry Collector for metrics, traces, and logs;
Kubernetes events;
Coroot findings as an additional feature stream;
raw or aggregated traces;
log patterns and error summaries.

Coroot diagnosis MUST NOT replace ground-truth labels. It should be saved as:

weak labels;
baseline observability-tool output;
additional features;
an explanatory layer for model error analysis.

Ground truth should come from the orchestrator: it knows which action was applied, when, to which target, and with which parameters.

8.4 Export And Evaluation Plane

The export plane normalizes telemetry into time buckets and writes the canonical dataset.

The canonical format should be world-model-oriented:

run_<id>/
  metadata.json
  graph.json
  node_features.parquet
  edge_features.parquet
  actions.parquet
  events.parquet
  labels.parquet
  rewards.parquet
  logs.jsonl.zst
  spans.parquet
  k8s_events.jsonl.zst
  coroot_findings.json

Adapters can export additional views:

ChronoGraph-like graph time series for forecasting/anomaly experiments;
RCAEval-style flat DataFrame for RCA baselines;
compact causal graph view for graph/path scoring.

But those adapters should not dictate the main contract. The main contract is:

state/action/next_state/reward/label

9. Data Contract

Every run should be self-contained. If the run directory is opened a year later, it should be clear:

which application was deployed;
which gym version was used;
which Helm charts and configs were applied;
which seed was used;
which workload profile was running;
which actions and interventions were applied;
which telemetry streams are available;
how temporal windows were aggregated;
which labels and rewards are ground truth.

9.1 metadata.json

metadata.json contains:

run_id;
seed;
gym_api_version;
chart_versions;
git_shas;
kubernetes_version;
collector_config_hash;
service_config;
workload_profile;
policy_name;
run_start;
run_end;
action_schedule;
intervention_schedule;
granularity_config.

9.2 graph.json

graph.json describes the service graph:

nodes: services, workloads, infrastructure dependencies;
edges: service-to-service calls, protocols, expected dependencies;
mapping to Kubernetes names;
optional attributes: namespace, deployment, container, endpoint, protocol.

This is static or slowly changing context for the time-series model.

9.3 node_features.parquet

node_features.parquet contains time-bucketed features by service:

time;
run_id;
service;
request rate;
error rate;
latency summaries;
CPU usage;
memory usage;
restarts;
network in/out;
saturation indicators.

9.4 edge_features.parquet

edge_features.parquet contains features by directed edge:

time;
run_id;
source;
target;
protocol;
request rate;
error rate;
latency summaries;
status counters;
retry indicators, if available.

9.5 actions.parquet

actions.parquet is the central table for the world model.

Fields:

time;
run_id;
actor: scripted policy, human, model, baseline;
action_id;
action_type;
target_service or target_edge;
parameters;
duration;
status;
precheck_status;
postcheck_status.

If an action did not apply, that is also data. Such cases must not be silently dropped, otherwise the model will learn from incorrect causal labels.

9.6 events.parquet

events.parquet contains observed events:

Kubernetes events;
rollout events;
pod lifecycle events;
Coroot symptom events;
workload schedule events;
caused_by_action_id, if the event is linked to a known action.

9.7 labels.parquet

labels.parquet contains ground truth:

run_id;
action_id;
intervention_id;
is_fault;
fault_type;
target;
start;
end;
intensity;
root_cause_kind;
expected_symptoms.

Labels should describe not only failures, but also normal operator actions. For a world model, an ordinary scale up is as important as a fault injection.

9.8 rewards.parquet

rewards.parquet contains the signal for control tasks:

time;
run_id;
reward_name;
value;
components;
action_id.

Reward can be computed from:

SLO satisfaction;
latency/error budget;
recovery time;
resource cost;
stability;
penalties for unsafe actions.

9.9 coroot_findings.json

coroot_findings.json contains Coroot diagnosis output:

timestamp;
affected services;
affected edges;
symptom class;
severity or confidence, if available;
explanation text;
suggested root cause, if available.

This is not ground truth. It is observability-tool output that can be compared with labels or used as a weak signal.

10. Temporal Granularity

The gym MUST allow several time scales to be configured:

control_step_seconds: how often the controller can act;
observation_bucket_seconds: how metrics are aggregated;
observation_window_seconds: how much history the model sees;
reward_bucket_seconds: how often reward is computed;
scrape_interval_seconds: metric collection frequency;
trace_aggregation_seconds: trace aggregation window;
log_pattern_window_seconds: log aggregation window;
action_duration_seconds: duration of actions/interventions;
warmup_seconds;
cooldown_seconds.

Control steps and telemetry buckets do not have to match. For example, metrics can be aggregated every 10 seconds while the controller acts every 30 or 60 seconds. This is closer to real DevOps, where observability often has one frequency and decisions are made at another.

11. Workloads And Scenarios

The MVP should include several workload profiles:

steady: stable load;
diurnal: smooth changes similar to a daily profile;
burst: short spikes;
checkout-heavy: more write-like flows;
read-heavy: more catalog/product flows.

Temporal diversity is as important as fault-type variety. For each action or intervention, vary:

start time after warmup;
duration: transient, medium, slow-burn;
intensity: mild, visible, severe;
presence of a recovery phase;
no-fault baseline runs;
normal-control episodes without failures;
multi-action episodes after single-action scenarios are validated.

Without no-fault and normal-control episodes, the model will learn that almost every action is associated with an incident. That is a common dataset mistake in incident analysis.

12. MVP: What Must Be Ready First

The first usable MVP should prove the correctness of the loop, not scale.

Minimum acceptance criteria:

4 service configurations;
at least 5 ordinary DevOps actions;
at least 5 fault/intervention types;
at least 3 workload profiles;
at least 100 labeled runs, including no-fault baselines;
at least one online loop through observe/action/step/reward;
graph per run;
node numeric features;
edge numeric features;
action logs;
labels;
rewards;
baseline scripts for:
- action-conditioned next-state prediction;
- passive forecasting;
- anomaly detection;
- RCA-style root-cause ranking.

The first milestone can be treated as successful if, after 100-500 episodes, we can answer practical questions:

which telemetry streams are actually useful;
which actions have a measurable effect;
where labels diverge from observed symptoms;
how hard it is to predict edge-level effects;
whether a simple model can distinguish action consequences from an exogenous fault.

13. How This Differs From Production Observability

otel-control-gym should not pretend to be full production.

Production has:

unknown configurations;
human changes without perfect labels;
business events;
complex release processes;
incomplete telemetry;
third-party dependencies;
seasonality;
organizational constraints.

The gym should be different at the first stage: small, clean, controlled, and reproducible. Its value is not that it is “like production”, but that it has what production data usually does not have:

precise action labels;
precise intervention labels;
repeatable scenarios;
ability to run an alternative policy;
ability to do online evaluation without risking real users.

After that, it can move toward realism:

more services;
more workload profiles;
noisy telemetry;
partial observability;
irregular human-like actions;
production traces as a background distribution;
domain randomization.

14. Main Risks

Local Kubernetes Limitations

Some eBPF and observability capabilities can work poorly in Docker-in-Docker or local MiniKube-like environments. For a stable MVP, it is better to use k3s/kind on a real VM or a remote Kubernetes cluster.

Label Drift

An action or fault may fail to apply, apply only partially, or create an effect somewhere other than expected. Therefore, the orchestrator should record status, precheck, and postcheck. Failed interventions should either be marked explicitly or excluded through a verifiable rule.

Telemetry Gaps

Traces and logs may be incomplete. The MVP should first make metrics and the service graph reliable, then add traces/logs as additional streams.

Confounding Between Actions And Faults

If faults always appear together with remediation actions, the model can learn false relationships. The dataset needs no-fault runs, normal-control runs, and scenarios where actions are applied without incidents.

Overly Synthetic Faults

Chaos Mesh faults do not cover all of production. That is acceptable for a curriculum. Clean interventions are needed first, then more realistic failure modes.

Data Volume Cost

Traces, logs, and frequent buckets will quickly increase dataset size. Start with short runs, a metrics-first format, and controlled aggregation.

15. Recommended Implementation Path

The first path should be pragmatic:

Use the Coroot chart as a fast way to deploy OpenTelemetry Demo, Coroot, and Chaos Mesh.
Add a Python orchestrator on top of the Kubernetes API and Chaos Mesh.
Implement a real gym API: reset, observe, step, reward, done.
Support metrics-first node and edge features first.
Save action/intervention logs as ground truth.
Save Coroot findings as weak labels and baseline outputs.
Define the canonical world-model dataset contract.
Add ChronoGraph-like and RCAEval-style adapters as secondary views.
Run 100-500 short episodes.
Train simple baselines and check that an action-conditioned model outperforms a passive model where the action actually changes dynamics.

16. Bottom Line

otel-control-gym is needed to move from passive observability to training models that understand the consequences of actions.

Existing datasets cover forecasting, anomaly detection, RCA, and LLM-agent diagnosis well. But they usually lack the controlled loop:

observe -> action -> next observation -> reward

A time-series world model differs from an ordinary TSFM because it models not only the continuation of a time series, but also system dynamics under actions, control inputs, and interventions. For DevOps, this enables a new class of applications: what-if analysis, safe control, rollout/rollback decision support, incident recovery planning, and offline controller training before live-stand evaluation.

The gym is the bridge between dataset and production: controlled enough to provide clean labels and repeatable experiments, and live enough to test the closed-loop behavior of a model on a real Kubernetes application.

Alex Knowledge Base

Explorer

otel-control-gym