Graph Structure As Transformer Context

Summary

This page is the synthesis handle for a narrow design question: how should a mostly standard Transformer receive graph structure as context, especially for Kubernetes OTEL Control Gym?

The local answer is not “use a graph Transformer” as a single category. The useful design space is a set of graph-context interfaces:

  • pairwise structural attention bias;
  • explicit node and edge tokens;
  • reversible graph serialization or graph-to-sequence traversal;
  • learned graph-token vocabularies;
  • visibility masks or sparse relation masks when a sourced baseline is added.

These interfaces can make graph structure visible to a Transformer, but they are not by themselves evidence for an action-conditioned world model. The OTEL target remains:

observation + graph context + action/control input
  -> next observation + reward/label

Mechanism Map

PatternMain sourceHow graph structure enters the TransformerOTEL adaptationMain risk
Pairwise attention biasGraphormerShortest-path distance, degree/centrality, and edge-path features become attention-bias terms.Bias attention by service distance, dependency direction, edge type, and call-path structure.Strong graph prior, but modifies attention and has quadratic scaling pressure.
Node and edge tokensTokenGTNodes and edges are ordinary tokens; endpoint identifiers and type identifiers expose incidence.Services/resources become node tokens; directed dependencies become edge tokens; action targets can point to tokens.Identifier stability under topology drift, service renames, autoscaling, and ephemeral Kubernetes objects.
Reversible graph serializationGraphGPTEulerian or semi-Eulerian paths serialize nodes, edges, and attributes into a reversible sequence.Serialize whole service graphs, ego-subgraphs around affected services, trace-induced subgraphs, or action-targeted subgraphs.Path stochasticity and subgraph sampling can disrupt temporal alignment and incident localization.
Reversible serialization plus BPEGraph TokenizationFrequency-guided graph serialization plus BPE turns recurring substructures into discrete graph tokens.Learn reusable service-topology motifs such as fan-out, queue, cache, database dependency, or rollout target neighborhoods.Continuous telemetry and timestamps still need typed numeric/time-series channels.
Learned graph-token vocabularyGQTA graph-specialized tokenizer learns quantized hierarchical tokens before Transformer processing.Use as an optional graph-front-end for service topology and local neighborhoods.Not the purest no-GNN path: the tokenizer can use graph-specialized machinery and may hide rare operational edges.

OTEL Graph Time-Series Contract

For k8s-otel-control-gym, graph structure SHOULD be treated as context, not as an anonymous channel order. The minimum model input contract is:

graph segment:
  graph.json or topology snapshot
  service/resource metadata
  edge type, direction, protocol, endpoint, and ownership metadata
 
observation segment:
  node_features.parquet time patches
  edge_features.parquet time patches
  selected event/log/trace streams
 
action segment:
  action_type
  target_service or target_edge
  parameters / control input
  status, precheck, postcheck
 
target:
  next node/edge observations
  reward
  labels or diagnosis fields

The model can then be tested on whether graph context improves action-effect prediction, not merely passive forecasting.

The first k8s-otel-control-gym model suite SHOULD compare graph-context interfaces under matched data, context length, and compute:

  1. No-graph / ID-only baseline. Flatten node and edge observations with service IDs only. This proves whether graph structure is actually helping.
  2. TokenGT-style node/edge tokens. Make services/resources node tokens and directed dependencies edge tokens. Add time patches and action/control-input tokens explicitly.
  3. Graphormer-style attention bias. Keep the token layout simple but add directed shortest-path distance, dependency direction, edge type, and action-target relation as pairwise biases.
  4. Graph serialization baseline. Serialize graph snapshots or subgraphs with GraphGPT/Graph Tokenization-style traversal, then interleave telemetry and action segments.
  5. Learned graph-tokenizer front-end. Try a GQT-like vocabulary only after the simpler baselines establish what graph details must be preserved.

The pragmatic first implementation should be a hybrid of explicit node/edge tokens plus relation biases. It keeps the Transformer ordinary enough for standard sequence tooling while making service topology visible before investing in learned graph vocabularies.

Action-Conditioned Boundary

All five source papers are graph-learning or graph-context sources. None of them tests controlled DevOps trajectories, interventions, rewards, or counterfactual rollouts.

That boundary matters. A model that encodes the service graph well is still only a graph-aware passive dynamics model until it receives logged actions and is evaluated on:

  • next node/edge feature prediction after an action;
  • action delta versus NOOP;
  • candidate-action ranking by reward;
  • counterfactual prediction under alternative actions;
  • closed-loop regret or recovery quality on the live stand.

Relation To Foundation TSFM Agenda

Agenda slotVerdictEvidenceMissing pieces
Context interfacepartially closesThe source cluster gives concrete ways to encode graph structure as Transformer context.Needs an OTEL schema that joins graph context with observation windows, events, and action history.
Native multivariate encoding and high-channel scalingadjacentNode/edge tokens, graph serialization, and graph-token vocabularies avoid flattening service telemetry into anonymous channels.Needs graph time-series experiments with topology drift, missing streams, and high-cardinality telemetry.
Control and counterfactualsinsufficient evidenceGraph context is necessary for action-conditioned observability world models.Needs observation + graph + action/control input -> next observation/reward experiments.

Open Questions

  • Which interface wins under matched compute for OTEL episodes: node/edge tokens, pairwise attention bias, graph serialization, learned graph tokens, or a hybrid?
  • How should time be represented: one graph-token set per timestep, temporal patches per node/edge token, or a flattened sequence over (time, graph element) pairs?
  • What identifier scheme survives service renames, topology changes, autoscaling, and ephemeral Kubernetes objects?
  • Should the graph tokenizer learn motifs from one OpenTelemetry Demo graph, from many generated graph variants, or from external production topologies?
  • How do we test that graph compression does not erase rare but intervention-relevant edges?