Learning Graph Quantized Tokenizers
Source
- Raw Markdown: paper_gqt-2025.md
- PDF: paper_gqt-2025.pdf
- Preprint: arXiv 2410.13798
- Proceedings: ICLR 2025
- Review page: OpenReview ICLR 2025 poster
- Official code: limei0307/GQT
Core Claim
GQT argues that graph structure should be converted into learned discrete graph tokens before a Transformer consumes it. A graph-specialized tokenizer first learns local structural and feature representations, quantizes them into a hierarchical codebook vocabulary, and then feeds compact token sequences to a standard Transformer encoder so the Transformer can focus on longer-range graph interactions.
For this knowledge base, the useful idea is a learned discrete graph vocabulary: graph structure can become a token stream or conditioning object for a Transformer without forcing every downstream model to be a graph-specialized architecture.
Key Contributions
- Trains a graph tokenizer with multi-task graph self-supervised objectives, combining Deep Graph Infomax, GraphMAE2-style masked/distilled learning, and a commitment loss.
- Uses Residual Vector Quantization to map each node representation into hierarchical discrete tokens and compact codebook embeddings.
- Serializes each target node with Personalized PageRank neighbors over original plus semantic edges, giving the Transformer access to long-range graph structure.
- Adds token modulation through aggregated codebook embeddings, positional encodings, hierarchical encodings, and structural gating.
- Reports state-of-the-art performance on 20 of 22 homophilic, heterophilic, large-scale, and long-range graph benchmarks, with large memory reductions after the tokenizer has been trained.
Why It Matters For Kubernetes OTEL Control Gym
k8s-otel-control-gym needs a model interface for service graphs, telemetry schemas, and graph time series: the observation includes node features, edge features, traces, events, and a graph structure that should not be flattened into arbitrary channel order. GQT is relevant because it sketches one way to turn graph structure into discrete tokens that a Transformer can consume alongside telemetry observations.
The strongest transfer is not the exact node-classification pipeline. It is the tokenizer contract: learn a reusable graph vocabulary from service topology, local neighborhoods, semantic edges, and structural scores; then feed those tokens into a Transformer-based passive dynamics model or action-conditioned world model as graph context.
The caveat is important: GQT is not the purest no-GNN end-to-end option. The tokenizer itself uses graph-specialized self-supervised machinery and a GNN encoder, plus preprocessing such as semantic edges and PPR serialization. For k8s-otel-control-gym, that makes GQT a plausible graph-front-end candidate, not evidence that ordinary Transformers alone can learn the whole service graph interface from raw observations and actions.
Limitations
- The method is evaluated on graph learning benchmarks, not graph time series, observability telemetry, or controlled system trajectories.
- It has no action, control input, intervention, reward, or counterfactual rollout channel, so it does not by itself make an action-conditioned world model.
- The memory win appears after tokenizer training; the tokenizer encoder still processes the original graph structure with graph-specialized machinery.
- The learned tokens may hide node/edge details that matter for operations, such as rare dependency paths, transient incidents, delayed action effects, or topology changes.
- The paper focuses on representation and prediction tasks, not generation, online adaptation, or maintaining a dynamic service graph over time.
Foundation TSFM Relevance
| Agenda slot | Verdict | Evidence | Missing pieces |
|---|---|---|---|
| Graph/context interface | adjacent | Converts graph structure and node features into learned discrete tokens plus PPR-neighbor sequences that a Transformer can process. | Needs a service-graph schema, graph time-series observations, and tests on telemetry graphs rather than citation/product/benchmark graphs. |
| Adaptive tokenization | partially closes | Shows a trained quantized tokenizer can compact graph representations and provide a discrete vocabulary before Transformer training. | Needs preservation tests for spikes, rare regimes, topology changes, and intervention-relevant edges. |
| Native multivariate and graph time-series scaling | adjacent | Reports large memory reductions and strong benchmark results for large graphs after tokenization. | Does not model high-channel temporal node/edge metrics or streaming observations. |
| Control and counterfactuals | insufficient evidence | The architecture could provide graph context to an action-conditioned world model. | No action, control input, intervention, reward, policy, or counterfactual prediction experiment. |
Links Into The Wiki
- Graph Structure As Transformer Context
- Kubernetes OTEL Control Gym
- Foundation Time-Series Model Research Agenda
- Latent Tokenization
- Tokenizer Transfer
- Observability Time Series
- World Models
- High-Dimensional Time-Series Forecasting
Open Questions
- Can a GQT-style tokenizer learn a service-graph vocabulary from
graph.json, telemetry schemas, and graph time-series observations without erasing operationally rare but important edges? - Should graph tokens be static context for a passive dynamics model, or should they update with each observation window in an action-conditioned world model?
- How should action and control input tokens interact with graph tokens so intervention effects are preserved rather than averaged into passive graph dynamics?
- Is a graph-specialized tokenizer worth the extra training/preprocessing complexity compared with structural tokenization, attention bias, or visibility masks for ordinary Transformers?