Graph Tokenization for Bridging Graphs and Transformers

Source

Core Claim

Graph Tokenization proposes a reversible graph-serialization plus BPE tokenizer that turns labeled graph structure into a discrete token sequence, letting standard Transformers process graphs without graph-specific architectural modifications.

Key Contributions

  • Defines graph tokenization as a deterministic discrete-token interface rather than a neural graph encoder, pooling layer, or continuous embedding projector.
  • Combines reversible serialization with BPE so frequent graph substructures become reusable graph tokens.
  • Introduces frequency-guided Eulerian and CPP serializations to make traversal stable under graph permutation while preserving graph structure up to isomorphism.
  • Reports state-of-the-art or competitive results on 14 graph classification and regression benchmarks with standard Transformer backbones such as BERT and GTE.
  • Shows a small autoregressive proof of concept where a decoder-only Transformer generates MNIST grid graphs from token sequences.

Why It Matters For Kubernetes OTEL Control Gym

Kubernetes OTEL Control Gym needs an interface for service graph structure that can be consumed alongside observation windows, event streams, and candidate actions. This paper is a current SOTA candidate for representing graph structure to a standard Transformer through reversible serialization plus BPE/tokenization, without requiring a graph-specific encoder or graph Transformer block.

The useful local pattern is interface-level: graph.json or graph snapshots could be serialized into structural tokens and joined with numeric telemetry observations, action history, and control inputs. The paper does not solve the whole OTEL problem because its experiments are static graph tasks, not graph time series, and it does not model operator actions, intervention effects, or action-conditioned rollouts.

Limitations

  • The method assumes labeled graphs with discrete node and edge symbols. Continuous node or edge features need quantization, a parallel numeric channel, or another typed feature interface.
  • Evidence is mostly graph-level classification and regression. Node-level and edge-level tasks are harder because BPE can merge the target entity into a larger token.
  • The Transformer still faces context-window limits on large graphs; BPE compression helps but does not remove this constraint.
  • CPP-style serialization has cubic graph-size cost, so the practical large-graph path relies on the linear frequency-guided Eulerian variant.
  • The paper is not evidence for action-conditioned control: no action, control input, intervention, reward, or counterfactual prediction channel is evaluated.

Foundation TSFM Relevance

Agenda slotVerdictEvidenceMissing pieces
Context interfacepartially closesProvides a reversible discrete-token interface for graph structure that can condition a standard Transformer without graph-specific architectural changes.Needs integration with telemetry schema, observation windows, event streams, and action history.
Patch size, dynamic tokenization, and typed interfacesadjacentUses BPE to compress recurring graph substructures into structural tokens, reducing sequence length while preserving decodeability.Does not address numeric observations, timestamps, control inputs, or time-series patching.
Native multivariate encoding and high-channel scalingadjacentCould represent service topology or graph snapshots for graph time series.Needs continuous node/edge features, time alignment, topology changes, and high-channel telemetry tests.
Control and counterfactualsinsufficient evidenceThe method can encode graph context but does not condition future observations on actions.Needs observation + graph + action/control input -> future/reward experiments.

Open Questions

  • Can graph tokenization be made stable for changing service graphs where nodes, edges, and telemetry schemas evolve over time?
  • Should continuous node and edge features be quantized into graph tokens, encoded through typed numeric features, or kept as a parallel observation channel?
  • Does a BPE vocabulary learned from one service graph or organization transfer to another, or does it overfit to local topology motifs?
  • How should graph tokens be interleaved with observation windows, action history, control inputs, and event streams in an action-conditioned world model?