Learning Graph Quantized Tokenizers

Source

Raw Markdown: paper_gqt-2025.md
PDF: paper_gqt-2025.pdf
Preprint: arXiv 2410.13798
Proceedings: ICLR 2025
Review page: OpenReview ICLR 2025 poster
Official code: limei0307/GQT

Core Claim

GQT argues that graph structure should be converted into learned discrete graph tokens before a Transformer consumes it. A graph-specialized tokenizer first learns local structural and feature representations, quantizes them into a hierarchical codebook vocabulary, and then feeds compact token sequences to a standard Transformer encoder so the Transformer can focus on longer-range graph interactions.

For this knowledge base, the useful idea is a learned discrete graph vocabulary: graph structure can become a token stream or conditioning object for a Transformer without forcing every downstream model to be a graph-specialized architecture.

Key Contributions

Trains a graph tokenizer with multi-task graph self-supervised objectives, combining Deep Graph Infomax, GraphMAE2-style masked/distilled learning, and a commitment loss.
Uses Residual Vector Quantization to map each node representation into hierarchical discrete tokens and compact codebook embeddings.
Serializes each target node with Personalized PageRank neighbors over original plus semantic edges, giving the Transformer access to long-range graph structure.
Adds token modulation through aggregated codebook embeddings, positional encodings, hierarchical encodings, and structural gating.
Reports state-of-the-art performance on 20 of 22 homophilic, heterophilic, large-scale, and long-range graph benchmarks, with large memory reductions after the tokenizer has been trained.

Why It Matters For Kubernetes OTEL Control Gym

k8s-otel-control-gym needs a model interface for service graphs, telemetry schemas, and graph time series: the observation includes node features, edge features, traces, events, and a graph structure that should not be flattened into arbitrary channel order. GQT is relevant because it sketches one way to turn graph structure into discrete tokens that a Transformer can consume alongside telemetry observations.

The strongest transfer is not the exact node-classification pipeline. It is the tokenizer contract: learn a reusable graph vocabulary from service topology, local neighborhoods, semantic edges, and structural scores; then feed those tokens into a Transformer-based passive dynamics model or action-conditioned world model as graph context.

The caveat is important: GQT is not the purest no-GNN end-to-end option. The tokenizer itself uses graph-specialized self-supervised machinery and a GNN encoder, plus preprocessing such as semantic edges and PPR serialization. For k8s-otel-control-gym, that makes GQT a plausible graph-front-end candidate, not evidence that ordinary Transformers alone can learn the whole service graph interface from raw observations and actions.

Limitations

The method is evaluated on graph learning benchmarks, not graph time series, observability telemetry, or controlled system trajectories.
It has no action, control input, intervention, reward, or counterfactual rollout channel, so it does not by itself make an action-conditioned world model.
The memory win appears after tokenizer training; the tokenizer encoder still processes the original graph structure with graph-specialized machinery.
The learned tokens may hide node/edge details that matter for operations, such as rare dependency paths, transient incidents, delayed action effects, or topology changes.
The paper focuses on representation and prediction tasks, not generation, online adaptation, or maintaining a dynamic service graph over time.

Foundation TSFM Relevance

Agenda slot	Verdict	Evidence	Missing pieces
Graph/context interface	adjacent	Converts graph structure and node features into learned discrete tokens plus PPR-neighbor sequences that a Transformer can process.	Needs a service-graph schema, graph time-series observations, and tests on telemetry graphs rather than citation/product/benchmark graphs.
Adaptive tokenization	partially closes	Shows a trained quantized tokenizer can compact graph representations and provide a discrete vocabulary before Transformer training.	Needs preservation tests for spikes, rare regimes, topology changes, and intervention-relevant edges.
Native multivariate and graph time-series scaling	adjacent	Reports large memory reductions and strong benchmark results for large graphs after tokenization.	Does not model high-channel temporal node/edge metrics or streaming observations.
Control and counterfactuals	insufficient evidence	The architecture could provide graph context to an action-conditioned world model.	No action, control input, intervention, reward, policy, or counterfactual prediction experiment.

Links Into The Wiki

Open Questions

Can a GQT-style tokenizer learn a service-graph vocabulary from graph.json, telemetry schemas, and graph time-series observations without erasing operationally rare but important edges?
Should graph tokens be static context for a passive dynamics model, or should they update with each observation window in an action-conditioned world model?
How should action and control input tokens interact with graph tokens so intervention effects are preserved rather than averaged into passive graph dynamics?
Is a graph-specialized tokenizer worth the extra training/preprocessing complexity compared with structural tokenization, attention bias, or visibility masks for ordinary Transformers?

Alex Open Research Wiki

Explorer

Learning Graph Quantized Tokenizers

Learning Graph Quantized Tokenizers

Source

Core Claim

Key Contributions

Why It Matters For Kubernetes OTEL Control Gym

Limitations

Foundation TSFM Relevance

Links Into The Wiki

Open Questions

Graph View

Table of Contents

Backlinks