---
abstract: |
  The success of large pretrained Transformers is closely tied to tokenizers, which convert raw input into discrete symbols. Extending these models to graph-structured data remains a significant challenge. In this work, we introduce a graph tokenization framework that generates sequential representations of graphs by combining reversible graph serialization, which preserves graph information, with Byte Pair Encoding (BPE), a widely adopted tokenizer in large language models (LLMs). To better capture structural information, the graph serialization process is guided by global statistics of graph substructures, ensuring that frequently occurring substructures appear more often in the sequence and can be merged by BPE into meaningful tokens. Empirical results demonstrate that the proposed tokenizer enables Transformers such as BERT to be directly applied to graph benchmarks without architectural modifications. The proposed approach achieves state-of-the-art results on 14 benchmark datasets and frequently outperforms both graph neural networks and specialized graph transformers. This work bridges the gap between graph-structured data and the ecosystem of sequence models. Our code is available at [`\color{blue}`{=latex}here](https://github.com/BUPT-GAMMA/Graph-Tokenization-for-Bridging-Graphs-and-Transformers).
author:
- |
  Zeyuan Guo `\footnotemark[1]  `{=latex}\
  Beijing University of Posts and Telecom.,\
  Beijing 100876, China\
  `guozeyuan@bupt.edu.cn` `\And `{=latex}Enmao Diao `\footnotemark[1] `{=latex}\
  DreamSoul\
  `enmao.diao@dreamsoul.com`\
  `\phantom{placeholder placeholder placeholder placehold}`{=latex}\
  `\AND `{=latex} Cheng Yang\
  Beijing University of Posts and Telecom.,\
  Beijing 100876, China\
  `yangcheng@bupt.edu.cn` `\And`{=latex} Chuan Shi `\footnotemark[2]  `{=latex}\
  Beijing University of Posts and Telecom.,\
  Beijing 100876, China\
  `shichuan@bupt.edu.cn`
bibliography:
- iclr2026\_conference.bib
title: Graph Tokenization for Bridging Graphs and Transformers
---

```{=latex}
\let\originalparagraph\paragraph
```
```{=latex}
\renewcommand{\paragraph}[1]{
  
  \originalparagraph{#1}
}
```
```{=latex}
\newcommand{\figleft}{{\em (Left)}}
```
```{=latex}
\newcommand{\figcenter}{{\em (Center)}}
```
```{=latex}
\newcommand{\figright}{{\em (Right)}}
```
```{=latex}
\newcommand{\figtop}{{\em (Top)}}
```
```{=latex}
\newcommand{\figbottom}{{\em (Bottom)}}
```
```{=latex}
\newcommand{\captiona}{{\em (a)}}
```
```{=latex}
\newcommand{\captionb}{{\em (b)}}
```
```{=latex}
\newcommand{\captionc}{{\em (c)}}
```
```{=latex}
\newcommand{\captiond}{{\em (d)}}
```
```{=latex}
\newcommand{\newterm}[1]{{\bf #1}}
```
```{=latex}
\def\figref#1{figure~\ref{#1}}
```
```{=latex}
\def\Figref#1{Figure~\ref{#1}}
```
```{=latex}
\def\twofigref#1#2{figures \ref{#1} and \ref{#2}}
```
```{=latex}
\def\quadfigref#1#2#3#4{figures \ref{#1}, \ref{#2}, \ref{#3} and \ref{#4}}
```
```{=latex}
\def\secref#1{section~\ref{#1}}
```
```{=latex}
\def\Secref#1{Section~\ref{#1}}
```
```{=latex}
\def\twosecrefs#1#2{sections \ref{#1} and \ref{#2}}
```
```{=latex}
\def\secrefs#1#2#3{sections \ref{#1}, \ref{#2} and \ref{#3}}
```
```{=latex}
\def\eqref#1{equation~\ref{#1}}
```
```{=latex}
\def\Eqref#1{Equation~\ref{#1}}
```
```{=latex}
\def\plaineqref#1{\ref{#1}}
```
```{=latex}
\def\chapref#1{chapter~\ref{#1}}
```
```{=latex}
\def\Chapref#1{Chapter~\ref{#1}}
```
```{=latex}
\def\rangechapref#1#2{chapters\ref{#1}--\ref{#2}}
```
```{=latex}
\def\algref#1{algorithm~\ref{#1}}
```
```{=latex}
\def\Algref#1{Algorithm~\ref{#1}}
```
```{=latex}
\def\twoalgref#1#2{algorithms \ref{#1} and \ref{#2}}
```
```{=latex}
\def\Twoalgref#1#2{Algorithms \ref{#1} and \ref{#2}}
```
```{=latex}
\def\partref#1{part~\ref{#1}}
```
```{=latex}
\def\Partref#1{Part~\ref{#1}}
```
```{=latex}
\def\twopartref#1#2{parts \ref{#1} and \ref{#2}}
```
```{=latex}
\def\ceil#1{\lceil #1 \rceil}
```
```{=latex}
\def\floor#1{\lfloor #1 \rfloor}
```
```{=latex}
\def\1{\bm{1}}
```
```{=latex}
\newcommand{\train}{\mathcal{D}}
```
```{=latex}
\newcommand{\valid}{\mathcal{D_{\mathrm{valid}}}}
```
```{=latex}
\newcommand{\test}{\mathcal{D_{\mathrm{test}}}}
```
```{=latex}
\def\eps{{\epsilon}}
```
```{=latex}
\def\reta{{\textnormal{$\eta$}}}
```
```{=latex}
\def\ra{{\textnormal{a}}}
```
```{=latex}
\def\rb{{\textnormal{b}}}
```
```{=latex}
\def\rc{{\textnormal{c}}}
```
```{=latex}
\def\rd{{\textnormal{d}}}
```
```{=latex}
\def\re{{\textnormal{e}}}
```
```{=latex}
\def\rf{{\textnormal{f}}}
```
```{=latex}
\def\rg{{\textnormal{g}}}
```
```{=latex}
\def\rh{{\textnormal{h}}}
```
```{=latex}
\def\ri{{\textnormal{i}}}
```
```{=latex}
\def\rj{{\textnormal{j}}}
```
```{=latex}
\def\rk{{\textnormal{k}}}
```
```{=latex}
\def\rl{{\textnormal{l}}}
```
```{=latex}
\def\rn{{\textnormal{n}}}
```
```{=latex}
\def\ro{{\textnormal{o}}}
```
```{=latex}
\def\rp{{\textnormal{p}}}
```
```{=latex}
\def\rq{{\textnormal{q}}}
```
```{=latex}
\def\rr{{\textnormal{r}}}
```
```{=latex}
\def\rs{{\textnormal{s}}}
```
```{=latex}
\def\rt{{\textnormal{t}}}
```
```{=latex}
\def\ru{{\textnormal{u}}}
```
```{=latex}
\def\rv{{\textnormal{v}}}
```
```{=latex}
\def\rw{{\textnormal{w}}}
```
```{=latex}
\def\rx{{\textnormal{x}}}
```
```{=latex}
\def\ry{{\textnormal{y}}}
```
```{=latex}
\def\rz{{\textnormal{z}}}
```
```{=latex}
\def\rvepsilon{{\mathbf{\epsilon}}}
```
```{=latex}
\def\rvtheta{{\mathbf{\theta}}}
```
```{=latex}
\def\rva{{\mathbf{a}}}
```
```{=latex}
\def\rvb{{\mathbf{b}}}
```
```{=latex}
\def\rvc{{\mathbf{c}}}
```
```{=latex}
\def\rvd{{\mathbf{d}}}
```
```{=latex}
\def\rve{{\mathbf{e}}}
```
```{=latex}
\def\rvf{{\mathbf{f}}}
```
```{=latex}
\def\rvg{{\mathbf{g}}}
```
```{=latex}
\def\rvh{{\mathbf{h}}}
```
```{=latex}
\def\rvu{{\mathbf{i}}}
```
```{=latex}
\def\rvj{{\mathbf{j}}}
```
```{=latex}
\def\rvk{{\mathbf{k}}}
```
```{=latex}
\def\rvl{{\mathbf{l}}}
```
```{=latex}
\def\rvm{{\mathbf{m}}}
```
```{=latex}
\def\rvn{{\mathbf{n}}}
```
```{=latex}
\def\rvo{{\mathbf{o}}}
```
```{=latex}
\def\rvp{{\mathbf{p}}}
```
```{=latex}
\def\rvq{{\mathbf{q}}}
```
```{=latex}
\def\rvr{{\mathbf{r}}}
```
```{=latex}
\def\rvs{{\mathbf{s}}}
```
```{=latex}
\def\rvt{{\mathbf{t}}}
```
```{=latex}
\def\rvu{{\mathbf{u}}}
```
```{=latex}
\def\rvv{{\mathbf{v}}}
```
```{=latex}
\def\rvw{{\mathbf{w}}}
```
```{=latex}
\def\rvx{{\mathbf{x}}}
```
```{=latex}
\def\rvy{{\mathbf{y}}}
```
```{=latex}
\def\rvz{{\mathbf{z}}}
```
```{=latex}
\def\erva{{\textnormal{a}}}
```
```{=latex}
\def\ervb{{\textnormal{b}}}
```
```{=latex}
\def\ervc{{\textnormal{c}}}
```
```{=latex}
\def\ervd{{\textnormal{d}}}
```
```{=latex}
\def\erve{{\textnormal{e}}}
```
```{=latex}
\def\ervf{{\textnormal{f}}}
```
```{=latex}
\def\ervg{{\textnormal{g}}}
```
```{=latex}
\def\ervh{{\textnormal{h}}}
```
```{=latex}
\def\ervi{{\textnormal{i}}}
```
```{=latex}
\def\ervj{{\textnormal{j}}}
```
```{=latex}
\def\ervk{{\textnormal{k}}}
```
```{=latex}
\def\ervl{{\textnormal{l}}}
```
```{=latex}
\def\ervm{{\textnormal{m}}}
```
```{=latex}
\def\ervn{{\textnormal{n}}}
```
```{=latex}
\def\ervo{{\textnormal{o}}}
```
```{=latex}
\def\ervp{{\textnormal{p}}}
```
```{=latex}
\def\ervq{{\textnormal{q}}}
```
```{=latex}
\def\ervr{{\textnormal{r}}}
```
```{=latex}
\def\ervs{{\textnormal{s}}}
```
```{=latex}
\def\ervt{{\textnormal{t}}}
```
```{=latex}
\def\ervu{{\textnormal{u}}}
```
```{=latex}
\def\ervv{{\textnormal{v}}}
```
```{=latex}
\def\ervw{{\textnormal{w}}}
```
```{=latex}
\def\ervx{{\textnormal{x}}}
```
```{=latex}
\def\ervy{{\textnormal{y}}}
```
```{=latex}
\def\ervz{{\textnormal{z}}}
```
```{=latex}
\def\rmA{{\mathbf{A}}}
```
```{=latex}
\def\rmB{{\mathbf{B}}}
```
```{=latex}
\def\rmC{{\mathbf{C}}}
```
```{=latex}
\def\rmD{{\mathbf{D}}}
```
```{=latex}
\def\rmE{{\mathbf{E}}}
```
```{=latex}
\def\rmF{{\mathbf{F}}}
```
```{=latex}
\def\rmG{{\mathbf{G}}}
```
```{=latex}
\def\rmH{{\mathbf{H}}}
```
```{=latex}
\def\rmI{{\mathbf{I}}}
```
```{=latex}
\def\rmJ{{\mathbf{J}}}
```
```{=latex}
\def\rmK{{\mathbf{K}}}
```
```{=latex}
\def\rmL{{\mathbf{L}}}
```
```{=latex}
\def\rmM{{\mathbf{M}}}
```
```{=latex}
\def\rmN{{\mathbf{N}}}
```
```{=latex}
\def\rmO{{\mathbf{O}}}
```
```{=latex}
\def\rmP{{\mathbf{P}}}
```
```{=latex}
\def\rmQ{{\mathbf{Q}}}
```
```{=latex}
\def\rmR{{\mathbf{R}}}
```
```{=latex}
\def\rmS{{\mathbf{S}}}
```
```{=latex}
\def\rmT{{\mathbf{T}}}
```
```{=latex}
\def\rmU{{\mathbf{U}}}
```
```{=latex}
\def\rmV{{\mathbf{V}}}
```
```{=latex}
\def\rmW{{\mathbf{W}}}
```
```{=latex}
\def\rmX{{\mathbf{X}}}
```
```{=latex}
\def\rmY{{\mathbf{Y}}}
```
```{=latex}
\def\rmZ{{\mathbf{Z}}}
```
```{=latex}
\def\ermA{{\textnormal{A}}}
```
```{=latex}
\def\ermB{{\textnormal{B}}}
```
```{=latex}
\def\ermC{{\textnormal{C}}}
```
```{=latex}
\def\ermD{{\textnormal{D}}}
```
```{=latex}
\def\ermE{{\textnormal{E}}}
```
```{=latex}
\def\ermF{{\textnormal{F}}}
```
```{=latex}
\def\ermG{{\textnormal{G}}}
```
```{=latex}
\def\ermH{{\textnormal{H}}}
```
```{=latex}
\def\ermI{{\textnormal{I}}}
```
```{=latex}
\def\ermJ{{\textnormal{J}}}
```
```{=latex}
\def\ermK{{\textnormal{K}}}
```
```{=latex}
\def\ermL{{\textnormal{L}}}
```
```{=latex}
\def\ermM{{\textnormal{M}}}
```
```{=latex}
\def\ermN{{\textnormal{N}}}
```
```{=latex}
\def\ermO{{\textnormal{O}}}
```
```{=latex}
\def\ermP{{\textnormal{P}}}
```
```{=latex}
\def\ermQ{{\textnormal{Q}}}
```
```{=latex}
\def\ermR{{\textnormal{R}}}
```
```{=latex}
\def\ermS{{\textnormal{S}}}
```
```{=latex}
\def\ermT{{\textnormal{T}}}
```
```{=latex}
\def\ermU{{\textnormal{U}}}
```
```{=latex}
\def\ermV{{\textnormal{V}}}
```
```{=latex}
\def\ermW{{\textnormal{W}}}
```
```{=latex}
\def\ermX{{\textnormal{X}}}
```
```{=latex}
\def\ermY{{\textnormal{Y}}}
```
```{=latex}
\def\ermZ{{\textnormal{Z}}}
```
```{=latex}
\def\vzero{{\bm{0}}}
```
```{=latex}
\def\vone{{\bm{1}}}
```
```{=latex}
\def\vmu{{\bm{\mu}}}
```
```{=latex}
\def\vtheta{{\bm{\theta}}}
```
```{=latex}
\def\va{{\bm{a}}}
```
```{=latex}
\def\vb{{\bm{b}}}
```
```{=latex}
\def\vc{{\bm{c}}}
```
```{=latex}
\def\vd{{\bm{d}}}
```
```{=latex}
\def\ve{{\bm{e}}}
```
```{=latex}
\def\vf{{\bm{f}}}
```
```{=latex}
\def\vg{{\bm{g}}}
```
```{=latex}
\def\vh{{\bm{h}}}
```
```{=latex}
\def\vi{{\bm{i}}}
```
```{=latex}
\def\vj{{\bm{j}}}
```
```{=latex}
\def\vk{{\bm{k}}}
```
```{=latex}
\def\vl{{\bm{l}}}
```
```{=latex}
\def\vm{{\bm{m}}}
```
```{=latex}
\def\vn{{\bm{n}}}
```
```{=latex}
\def\vo{{\bm{o}}}
```
```{=latex}
\def\vp{{\bm{p}}}
```
```{=latex}
\def\vq{{\bm{q}}}
```
```{=latex}
\def\vr{{\bm{r}}}
```
```{=latex}
\def\vs{{\bm{s}}}
```
```{=latex}
\def\vt{{\bm{t}}}
```
```{=latex}
\def\vu{{\bm{u}}}
```
```{=latex}
\def\vv{{\bm{v}}}
```
```{=latex}
\def\vw{{\bm{w}}}
```
```{=latex}
\def\vx{{\bm{x}}}
```
```{=latex}
\def\vy{{\bm{y}}}
```
```{=latex}
\def\vz{{\bm{z}}}
```
```{=latex}
\def\evalpha{{\alpha}}
```
```{=latex}
\def\evbeta{{\beta}}
```
```{=latex}
\def\evepsilon{{\epsilon}}
```
```{=latex}
\def\evlambda{{\lambda}}
```
```{=latex}
\def\evomega{{\omega}}
```
```{=latex}
\def\evmu{{\mu}}
```
```{=latex}
\def\evpsi{{\psi}}
```
```{=latex}
\def\evsigma{{\sigma}}
```
```{=latex}
\def\evtheta{{\theta}}
```
```{=latex}
\def\eva{{a}}
```
```{=latex}
\def\evb{{b}}
```
```{=latex}
\def\evc{{c}}
```
```{=latex}
\def\evd{{d}}
```
```{=latex}
\def\eve{{e}}
```
```{=latex}
\def\evf{{f}}
```
```{=latex}
\def\evg{{g}}
```
```{=latex}
\def\evh{{h}}
```
```{=latex}
\def\evi{{i}}
```
```{=latex}
\def\evj{{j}}
```
```{=latex}
\def\evk{{k}}
```
```{=latex}
\def\evl{{l}}
```
```{=latex}
\def\evm{{m}}
```
```{=latex}
\def\evn{{n}}
```
```{=latex}
\def\evo{{o}}
```
```{=latex}
\def\evp{{p}}
```
```{=latex}
\def\evq{{q}}
```
```{=latex}
\def\evr{{r}}
```
```{=latex}
\def\evs{{s}}
```
```{=latex}
\def\evt{{t}}
```
```{=latex}
\def\evu{{u}}
```
```{=latex}
\def\evv{{v}}
```
```{=latex}
\def\evw{{w}}
```
```{=latex}
\def\evx{{x}}
```
```{=latex}
\def\evy{{y}}
```
```{=latex}
\def\evz{{z}}
```
```{=latex}
\def\mA{{\bm{A}}}
```
```{=latex}
\def\mB{{\bm{B}}}
```
```{=latex}
\def\mC{{\bm{C}}}
```
```{=latex}
\def\mD{{\bm{D}}}
```
```{=latex}
\def\mE{{\bm{E}}}
```
```{=latex}
\def\mF{{\bm{F}}}
```
```{=latex}
\def\mG{{\bm{G}}}
```
```{=latex}
\def\mH{{\bm{H}}}
```
```{=latex}
\def\mI{{\bm{I}}}
```
```{=latex}
\def\mJ{{\bm{J}}}
```
```{=latex}
\def\mK{{\bm{K}}}
```
```{=latex}
\def\mL{{\bm{L}}}
```
```{=latex}
\def\mM{{\bm{M}}}
```
```{=latex}
\def\mN{{\bm{N}}}
```
```{=latex}
\def\mO{{\bm{O}}}
```
```{=latex}
\def\mP{{\bm{P}}}
```
```{=latex}
\def\mQ{{\bm{Q}}}
```
```{=latex}
\def\mR{{\bm{R}}}
```
```{=latex}
\def\mS{{\bm{S}}}
```
```{=latex}
\def\mT{{\bm{T}}}
```
```{=latex}
\def\mU{{\bm{U}}}
```
```{=latex}
\def\mV{{\bm{V}}}
```
```{=latex}
\def\mW{{\bm{W}}}
```
```{=latex}
\def\mX{{\bm{X}}}
```
```{=latex}
\def\mY{{\bm{Y}}}
```
```{=latex}
\def\mZ{{\bm{Z}}}
```
```{=latex}
\def\mBeta{{\bm{\beta}}}
```
```{=latex}
\def\mPhi{{\bm{\Phi}}}
```
```{=latex}
\def\mLambda{{\bm{\Lambda}}}
```
```{=latex}
\def\mSigma{{\bm{\Sigma}}}
```
```{=latex}
\newcommand{\tens}[1]{\bm{\mathsfit{#1}}}
```
```{=latex}
\def\tA{{\tens{A}}}
```
```{=latex}
\def\tB{{\tens{B}}}
```
```{=latex}
\def\tC{{\tens{C}}}
```
```{=latex}
\def\tD{{\tens{D}}}
```
```{=latex}
\def\tE{{\tens{E}}}
```
```{=latex}
\def\tF{{\tens{F}}}
```
```{=latex}
\def\tG{{\tens{G}}}
```
```{=latex}
\def\tH{{\tens{H}}}
```
```{=latex}
\def\tI{{\tens{I}}}
```
```{=latex}
\def\tJ{{\tens{J}}}
```
```{=latex}
\def\tK{{\tens{K}}}
```
```{=latex}
\def\tL{{\tens{L}}}
```
```{=latex}
\def\tM{{\tens{M}}}
```
```{=latex}
\def\tN{{\tens{N}}}
```
```{=latex}
\def\tO{{\tens{O}}}
```
```{=latex}
\def\tP{{\tens{P}}}
```
```{=latex}
\def\tQ{{\tens{Q}}}
```
```{=latex}
\def\tR{{\tens{R}}}
```
```{=latex}
\def\tS{{\tens{S}}}
```
```{=latex}
\def\tT{{\tens{T}}}
```
```{=latex}
\def\tU{{\tens{U}}}
```
```{=latex}
\def\tV{{\tens{V}}}
```
```{=latex}
\def\tW{{\tens{W}}}
```
```{=latex}
\def\tX{{\tens{X}}}
```
```{=latex}
\def\tY{{\tens{Y}}}
```
```{=latex}
\def\tZ{{\tens{Z}}}
```
```{=latex}
\def\gA{{\mathcal{A}}}
```
```{=latex}
\def\gB{{\mathcal{B}}}
```
```{=latex}
\def\gC{{\mathcal{C}}}
```
```{=latex}
\def\gD{{\mathcal{D}}}
```
```{=latex}
\def\gE{{\mathcal{E}}}
```
```{=latex}
\def\gF{{\mathcal{F}}}
```
```{=latex}
\def\gG{{\mathcal{G}}}
```
```{=latex}
\def\gH{{\mathcal{H}}}
```
```{=latex}
\def\gI{{\mathcal{I}}}
```
```{=latex}
\def\gJ{{\mathcal{J}}}
```
```{=latex}
\def\gK{{\mathcal{K}}}
```
```{=latex}
\def\gL{{\mathcal{L}}}
```
```{=latex}
\def\gM{{\mathcal{M}}}
```
```{=latex}
\def\gN{{\mathcal{N}}}
```
```{=latex}
\def\gO{{\mathcal{O}}}
```
```{=latex}
\def\gP{{\mathcal{P}}}
```
```{=latex}
\def\gQ{{\mathcal{Q}}}
```
```{=latex}
\def\gR{{\mathcal{R}}}
```
```{=latex}
\def\gS{{\mathcal{S}}}
```
```{=latex}
\def\gT{{\mathcal{T}}}
```
```{=latex}
\def\gU{{\mathcal{U}}}
```
```{=latex}
\def\gV{{\mathcal{V}}}
```
```{=latex}
\def\gW{{\mathcal{W}}}
```
```{=latex}
\def\gX{{\mathcal{X}}}
```
```{=latex}
\def\gY{{\mathcal{Y}}}
```
```{=latex}
\def\gZ{{\mathcal{Z}}}
```
```{=latex}
\def\sA{{\mathbb{A}}}
```
```{=latex}
\def\sB{{\mathbb{B}}}
```
```{=latex}
\def\sC{{\mathbb{C}}}
```
```{=latex}
\def\sD{{\mathbb{D}}}
```
```{=latex}
\def\sF{{\mathbb{F}}}
```
```{=latex}
\def\sG{{\mathbb{G}}}
```
```{=latex}
\def\sH{{\mathbb{H}}}
```
```{=latex}
\def\sI{{\mathbb{I}}}
```
```{=latex}
\def\sJ{{\mathbb{J}}}
```
```{=latex}
\def\sK{{\mathbb{K}}}
```
```{=latex}
\def\sL{{\mathbb{L}}}
```
```{=latex}
\def\sM{{\mathbb{M}}}
```
```{=latex}
\def\sN{{\mathbb{N}}}
```
```{=latex}
\def\sO{{\mathbb{O}}}
```
```{=latex}
\def\sP{{\mathbb{P}}}
```
```{=latex}
\def\sQ{{\mathbb{Q}}}
```
```{=latex}
\def\sR{{\mathbb{R}}}
```
```{=latex}
\def\sS{{\mathbb{S}}}
```
```{=latex}
\def\sT{{\mathbb{T}}}
```
```{=latex}
\def\sU{{\mathbb{U}}}
```
```{=latex}
\def\sV{{\mathbb{V}}}
```
```{=latex}
\def\sW{{\mathbb{W}}}
```
```{=latex}
\def\sX{{\mathbb{X}}}
```
```{=latex}
\def\sY{{\mathbb{Y}}}
```
```{=latex}
\def\sZ{{\mathbb{Z}}}
```
```{=latex}
\def\emLambda{{\Lambda}}
```
```{=latex}
\def\emA{{A}}
```
```{=latex}
\def\emB{{B}}
```
```{=latex}
\def\emC{{C}}
```
```{=latex}
\def\emD{{D}}
```
```{=latex}
\def\emE{{E}}
```
```{=latex}
\def\emF{{F}}
```
```{=latex}
\def\emG{{G}}
```
```{=latex}
\def\emH{{H}}
```
```{=latex}
\def\emI{{I}}
```
```{=latex}
\def\emJ{{J}}
```
```{=latex}
\def\emK{{K}}
```
```{=latex}
\def\emL{{L}}
```
```{=latex}
\def\emM{{M}}
```
```{=latex}
\def\emN{{N}}
```
```{=latex}
\def\emO{{O}}
```
```{=latex}
\def\emP{{P}}
```
```{=latex}
\def\emQ{{Q}}
```
```{=latex}
\def\emR{{R}}
```
```{=latex}
\def\emS{{S}}
```
```{=latex}
\def\emT{{T}}
```
```{=latex}
\def\emU{{U}}
```
```{=latex}
\def\emV{{V}}
```
```{=latex}
\def\emW{{W}}
```
```{=latex}
\def\emX{{X}}
```
```{=latex}
\def\emY{{Y}}
```
```{=latex}
\def\emZ{{Z}}
```
```{=latex}
\def\emSigma{{\Sigma}}
```
```{=latex}
\newcommand{\etens}[1]{\mathsfit{#1}}
```
```{=latex}
\def\etLambda{{\etens{\Lambda}}}
```
```{=latex}
\def\etA{{\etens{A}}}
```
```{=latex}
\def\etB{{\etens{B}}}
```
```{=latex}
\def\etC{{\etens{C}}}
```
```{=latex}
\def\etD{{\etens{D}}}
```
```{=latex}
\def\etE{{\etens{E}}}
```
```{=latex}
\def\etF{{\etens{F}}}
```
```{=latex}
\def\etG{{\etens{G}}}
```
```{=latex}
\def\etH{{\etens{H}}}
```
```{=latex}
\def\etI{{\etens{I}}}
```
```{=latex}
\def\etJ{{\etens{J}}}
```
```{=latex}
\def\etK{{\etens{K}}}
```
```{=latex}
\def\etL{{\etens{L}}}
```
```{=latex}
\def\etM{{\etens{M}}}
```
```{=latex}
\def\etN{{\etens{N}}}
```
```{=latex}
\def\etO{{\etens{O}}}
```
```{=latex}
\def\etP{{\etens{P}}}
```
```{=latex}
\def\etQ{{\etens{Q}}}
```
```{=latex}
\def\etR{{\etens{R}}}
```
```{=latex}
\def\etS{{\etens{S}}}
```
```{=latex}
\def\etT{{\etens{T}}}
```
```{=latex}
\def\etU{{\etens{U}}}
```
```{=latex}
\def\etV{{\etens{V}}}
```
```{=latex}
\def\etW{{\etens{W}}}
```
```{=latex}
\def\etX{{\etens{X}}}
```
```{=latex}
\def\etY{{\etens{Y}}}
```
```{=latex}
\def\etZ{{\etens{Z}}}
```
```{=latex}
\newcommand{\pdata}{p_{\rm{data}}}
```
```{=latex}
\newcommand{\ptrain}{\hat{p}_{\rm{data}}}
```
```{=latex}
\newcommand{\Ptrain}{\hat{P}_{\rm{data}}}
```
```{=latex}
\newcommand{\pmodel}{p_{\rm{model}}}
```
```{=latex}
\newcommand{\Pmodel}{P_{\rm{model}}}
```
```{=latex}
\newcommand{\ptildemodel}{\tilde{p}_{\rm{model}}}
```
```{=latex}
\newcommand{\pencode}{p_{\rm{encoder}}}
```
```{=latex}
\newcommand{\pdecode}{p_{\rm{decoder}}}
```
```{=latex}
\newcommand{\precons}{p_{\rm{reconstruct}}}
```
```{=latex}
\newcommand{\laplace}{\mathrm{Laplace}}
```
```{=latex}
\newcommand{\E}{\mathbb{E}}
```
```{=latex}
\newcommand{\Ls}{\mathcal{L}}
```
```{=latex}
\newcommand{\R}{\mathbb{R}}
```
```{=latex}
\newcommand{\emp}{\tilde{p}}
```
```{=latex}
\newcommand{\lr}{\alpha}
```
```{=latex}
\newcommand{\reg}{\lambda}
```
```{=latex}
\newcommand{\rect}{\mathrm{rectifier}}
```
```{=latex}
\newcommand{\softmax}{\mathrm{softmax}}
```
```{=latex}
\newcommand{\sigmoid}{\sigma}
```
```{=latex}
\newcommand{\softplus}{\zeta}
```
```{=latex}
\newcommand{\KL}{D_{\mathrm{KL}}}
```
```{=latex}
\newcommand{\Var}{\mathrm{Var}}
```
```{=latex}
\newcommand{\standarderror}{\mathrm{SE}}
```
```{=latex}
\newcommand{\Cov}{\mathrm{Cov}}
```
```{=latex}
\newcommand{\normlzero}{L^0}
```
```{=latex}
\newcommand{\normlone}{L^1}
```
```{=latex}
\newcommand{\normltwo}{L^2}
```
```{=latex}
\newcommand{\normlp}{L^p}
```
```{=latex}
\newcommand{\normmax}{L^\infty}
```
```{=latex}
\newcommand{\parents}{Pa}
```
```{=latex}
\DeclareMathOperator*{\argmax}{arg\,max}
```
```{=latex}
\DeclareMathOperator*{\argmin}{arg\,min}
```
```{=latex}
\DeclareMathOperator{\sign}{sign}
```
```{=latex}
\DeclareMathOperator{\Tr}{Tr}
```
```{=latex}
\let\ab\allowbreak
```
```{=latex}
\newcommand{\fix}{\marginpar{FIX}}
```
```{=latex}
\newcommand{\new}{\marginpar{NEW}}
```
```{=latex}
\maketitle
```
```{=latex}
\renewcommand{\thefootnote}{\fnsymbol{footnote}}
```
`\footnotetext[1]{Equal Contribution}`{=latex} `\footnotetext[2]{Corresponding author}`{=latex}

```{=latex}
\renewcommand{\thefootnote}{\arabic{footnote}}
```
Introduction {#sec:introduction}
============

Large pretrained Transformer models [@nlp-attn; @nlp-llm-survey], exemplified by LLMs, have achieved state-of-the-art results across diverse domains [@nlp-vit; @nlp-ast]. A key component of this success is the tokenizer, which converts raw input into sequences of discrete symbols. By structuring information into learnable units, the tokenizer provides the interface between complex data and Transformer architectures, supporting the scalability and performance of these models.

Research on extending Transformers to graph-structured data has explored two main strategies, each with inherent limitations [@g-survey-token/text; @g-datacentric-survey]. One strategy modifies the architecture by incorporating attention mechanisms into Graph Neural Networks (GNNs) to create specialized Graph Transformers [@g-trans]. These approaches require graph-specific designs that diverge from standard sequence models and their ecosystem. The other strategy converts graphs into continuous embeddings for use with Transformers [@g-graphgpt], but this often causes information loss or unstable representations, which can degrade model performance [@g-llaga].

Developing a principled graph tokenizer requires reexamining the notion of tokenization in the context of graph-structured data. Specifically, text can be modeled as a path graph, where the linear sequence of tokens provides both a fixed neighborhood structure and a canonical ordering, making tokenization relatively straightforward. In contrast, general graphs pose additional challenges, as their neighborhoods can branch in multiple directions rather than follow a simple linear sequence. They also lack permutation invariance, where graphs under node permutations are considered equivalent. Furthermore, co-occurrence statistics widely used in text, such as n-gram frequencies based on contiguous tokens, are not directly applicable to graphs.

We propose a framework that addresses these challenges by integrating graph serialization with Byte Pair Encoding (BPE), a data-driven compression algorithm widely applied in text tokenization [@nlp-bpe]. To ensure that graph structure and labels are preserved, we adopt reversible serialization methods such as extended Euler circuits and minimal-weight graph traversals. Ordering ambiguity is resolved by using global statistics to deterministically guide the serialization process, which translates common substructures into frequent and adjacent symbol patterns that BPE is well suited to merge. Specifically, BPE iteratively merges the most frequent pairs of symbols into new tokens, thereby reducing sequence length while preserving common substructures. As a result, applying BPE to serialized graphs enables the construction of a vocabulary of frequent graph neighborhoods, producing discrete tokens that are both informative and well aligned with Transformer architectures.

In this work, our contributions can be summarized as follows:

-   **General Framework for Graph Tokenization.** We introduce a tokenization framework that combines reversible graph serialization with BPE. By decoupling the encoding of graph structure from the model architecture, this framework provides an effective interface that enables standard off-the-shelf Transformer models to be applied directly to graph-structured data without requiring any architectural modifications.

-   **Structure-Guided Serialization for BPE.** We propose a deterministic serialization process guided by global statistics of graph substructures. The process addresses ordering ambiguities in graphs and aligns frequently occurring substructures into adjacent sequence patterns. Structure-Guided Serialization provides an effective basis for BPE to learn a meaningful and interpretable vocabulary of structural graph tokens.

-   **State-of-the-Art Performance on Downstream Tasks.** Our tokenizer enables standard Transformer backbones to achieve state-of-the-art results across a diverse suite of 14 benchmarks for graph classification and regression. The proposed approach frequently outperforms both established Graph Neural Networks and specialized Graph Transformers, demonstrating its effectiveness and generalization.

Related Works {#sec:related}
=============

`\label{sec:related-gnn}`{=latex}

#### Graph Neural Networks.

Graph Neural Networks [@g-gcn; @g-rethink-gnn] are the prevailing framework for learning on graph-structured data. They rely on message passing, where node representations are updated by iteratively aggregating information from local neighbors, enabling effective modeling of local graph structure [@g-gin]. To capture dependencies beyond local neighborhoods, subsequent work introduced self-attention, leading to Graph Transformers [@g-trans; @g-sgformer] and hybrid global-local models [@g-vit-mix; @g-gpatcher]. More recently, graph representation learning has been combined with Graph Foundation Models, often by mapping graph structure and features into the embedding space of pretrained foundation models [@g-graphgpt; @g-llaga; @g-llm4rgnn; @g-datacentric-fgl; @g-tfm4gad]. These approaches depend on cross-modal alignment, with performance influenced by the semantic compatibility between graph features and natural language. Our objective is to design an interface that enables graphs to be processed directly by standard, off-the-shelf Transformers.

`\label{sec:related-serialization}`{=latex}

#### Graph Serialization.

Serialization of a graph into a sequence was one of the earliest strategies for applying sequence-based models. Early methods such as DeepWalk generated node sequences through random walks and processed them with shallow neural networks [@g-deep-walk; @g-bert]. This direction was later surpassed by the message passing paradigm of GNNs [@g-mpnn], which became the dominant approach to learning graph representations. More recently, the success of sequence-native architectures such as the Transformer has renewed interest in serialization-based methods [@g-mamba]. Many existing graph-to-sequence pipelines are not reversible. Specifically, walk-based serializations break the graph into local fragments. Each sequence reflects only part of the graph, and even combining many walks cannot reconstruct the original structure or capture global connectivity [@g-randomwalk-survey]. In another case, traversal-based serializations are sensitive to node ordering and starting choices, so even isomorphic graphs may produce different graph traversal circuits [@g-euler/anal]. In contrast, our method is reversible and almost invariant to graph permutation.

`\label{sec:related-tokenizer}`{=latex}

#### Tokenization

The Transformer architecture has become the standard paradigm for sequence modeling [@nlp-attn]. Its success is closely tied to the use of effective tokenization [@nlp-gpt3; @nlp-dpsk], which is especially critical in LLMs. A tokenizer converts raw input (e.g., text) into a sequence of discrete symbols, with BPE being a widely adopted data-driven approach that builds a vocabulary by iteratively merging frequent symbol pairs [@nlp-bpe]. In prior work on graph data, the term *graph tokenization* has been used with different meanings. It has referred to neural encoders that produce continuous embeddings [@g-graphgpt], pooling or coarsening modules that compress subgraphs into super-nodes [@g-graphbpe], and vector quantization components that discretize node features or latent representations [@g-vqgraph]. In this paper, we adopt the standard definition in natural language processing, where a tokenizer is a deterministic procedure that maps a labeled graph to a sequence of discrete symbols for direct use by sequence models, in contrast to neural approaches (e.g., PS-VAE [@g-psvae]) that project substructures into continuous latent spaces via learnable encoders (see Appendix `\ref{appendix:vae_distinction}`{=latex} for a detailed comparison).

Method {#sec:method}
======

Preliminaries {#sec:preliminaries}
-------------

#### Graph.

A graph is a tuple $G = (\mathcal{V}, \mathcal{E})$, composed of a finite set of nodes $\mathcal{V}$ and a set of edges $\mathcal{E}$. Our work focuses on *labeled graphs*, which we define as a tuple $\mathcal{G} = ({G}, L, \Sigma)$, where $\Sigma$ is a finite alphabet of symbols and $L: \mathcal{V} \cup \mathcal{E} \to \Sigma$ is a labeling function. Two labeled graphs $\mathcal{G}_1 = ({G}_1, L_1, \Sigma)$ and $\mathcal{G}_2 = ({G}_2, L_2, \Sigma)$ are *isomorphic*, denoted $\mathcal{G}_1 \cong \mathcal{G}_2$, if there exists a graph isomorphism $\phi: \mathcal{V}_1 \to \mathcal{V}_2$ between ${G}_1$ and ${G}_2$ that also preserves all labels.

#### Graph Serialization.

In general, a graph serialization function $f$ maps a graph to a finite sequence of symbols. Let $\mathcal{A}$ denote the universe of possible sequence elements. The mapping is defined as $$f: \mathcal{G} \mapsto (s_1, s_2, \dots, s_k) \quad \text{s.t.} \ s_i \in \mathcal{A} \ \text{for} \ 1 \le i \le k.$$ The choice of $\mathcal{A}$ depends on the serialization method. It may consist of node identifiers ($\mathcal{A}=\mathcal{V}$), continuous embeddings ($\mathcal{A}=\mathbb{R}^d$), or discrete labels ($\mathcal{A}=\mathbb{Z}$). For the purpose of building a discrete tokenizer, we focus on serializations where the output sequence is composed of symbols from the graph's alphabet, i.e., $\mathcal{A}=\Sigma$. To serve as a reliable interface, such a serialization should satisfy two key properties:

-   **Reversibility.** A serialization $f$ is reversible if the original labeled graph $\mathcal{G}$ can be recovered from its sequence $S=f(\mathcal{G})$ up to isomorphism. Formally, let $f^{-1}(S)$ denote the set of all graphs that could produce sequence $S$. The serialization is reversible if for any $\mathcal{G}$ in the domain of $f$, there exists a reversed graph $\mathcal{G}' \in f^{-1}(f(\mathcal{G}))$ such that $\mathcal{G}' \cong \mathcal{G}$.

-   **Determinism.** A serialization function $f$ is deterministic if, for any labeled graph $\mathcal{G}$, it consistently produces the same sequence $S$. This property is essential for addressing the permutation-invariance of graphs. A deterministic serialization generates a stable sequence for all graphs within an isomorphism class.

#### Graph Tokenization.

A graph tokenizer $\Phi$ maps a labeled graph $\mathcal{G}$ to a finite sequence of discrete symbols, referred to as tokens. $$\Phi:\ \mathcal{G}\ \mapsto\ S_{\text{T}}=(t_1,\dots,t_m),\quad t_j\in\mathcal{V}_{T}.$$ In this work, we construct the graph tokenizer $\Phi$ by composing a graph serialization function $f$ with a sequence tokenizer $T$ inspired by the text tokenizers used in LLMs. The sequence tokenizer $T$ maps a sequence over the initial alphabet $\Sigma$ to a new sequence over a target vocabulary $\mathcal{V}_{T}$, where the vocabulary is typically learned from data using BPE. The overall mapping is given by $$\Phi = T \circ f .$$ When a decoding procedure is available, the original graph can be reconstructed up to isomorphism (i.e., preserving complete topology and all labels; strict node-index identity requires an additional index mapping) by applying the inverse operations $T^{-1}$ followed by $f^{-1}$. Specifically, the term *graph tokenizer* refers to methods that produce a discrete sequence. Methods that only discretize embeddings [@g-vqgraph] or apply pooling or coarsening [@g-graphbpe] are not considered tokenizers in this sense.

`\noindent`{=latex}

Graph Tokenizer {#sec:framework}
---------------

We construct our graph tokenizer $\Phi$ by composing a reversible and structure-guided serialization function $f$ with a tokenization step $T$ based on BPE. To ensure graph structural information is preserved, $f$ is designed to be *reversible*, and to produce stable sequences, we enforce a *deterministic* guiding policy for $f$. We propose a data-driven graph tokenizer $\Phi$ that is learned from a training corpus of graphs rather than relying on hand-crafted heuristics. Specifically, Algorithm `\ref{alg:tokenizer}`{=latex} details the training, encoding, and decoding procedures of `GraphTokenizer`, and Figure `\ref{fig:framework}`{=latex} illustrates the overall framework.

![Framework of the proposed graph tokenizer. (A) Substructure frequencies are collected from the training graphs. (B) Structure-guided and reversible serialization is performed using a frequency-guided Eulerian circuit, where the next edge is selected according to a priority rule (e.g., red C: 7$\to$13$\to$15$\to$17). (C) A BPE vocabulary is trained on the serialized corpus, and graphs are encoded into discrete tokens for use in downstream sequence models. ](fig/main.png){#fig:framework width="0.95\\columnwidth"}

#### Local Structural Pattern Statistics. {#sec:pattern-collect}

The training procedure begins with the computation of dataset-level statistics of local patterns, which provide a data-driven basis for ensuring determinism in graph serialization. As illustrated in Fig. `\ref{fig:framework}`{=latex}A, we count how often small labeled patterns appear in the training graphs, using molecular graphs and edge patterns as an example. These counts are then normalized into relative frequencies.

For a labeled graph $\mathcal{G}=((\mathcal{V},\mathcal{E}),L,\Sigma)$, we define a basic local pattern as a labeled edge $p=(l_u,l_e,l_v)\in\Sigma^3$, which captures the labels of the source node, the edge, and the target node. Intuitively, this is the smallest substructure that still reflects a typed relation between two labeled entities. Compared with larger subgraphs, it is computationally inexpensive, permutation-invariant over node indices, and stable under isomorphisms, which makes it a practical choice for tie-breaking during graph serialization. The count of $p$ in $\mathcal{G}$ is given by $$\mathrm{Count}(\mathcal{G},p)
= \big|\{\, e=(u,v)\in\mathcal{E} \ \mid\ (L(u),L(e),L(v))=p \,\}\big|.$$

Aggregating over the training set $\mathcal{D}$, we obtain raw counts and their normalized relative frequencies as $$C(p)=\sum_{\mathcal{G}\in\mathcal{D}}\mathrm{Count}(\mathcal{G},p)\qquad
F(p)=\frac{C(p)}{\sum_{p'\in\Sigma^3} C(p')}.
\label{eq:relfreq}$$ $F(p)$ denotes the normalized relative frequency, while raw counts $C(p)$ are introduced here as an intermediate definition.

#### Structure-Guided Reversible Serialization. {#sec:serialize}

We proceed to the next step of our framework, where each graph is converted into a sequence of symbols (Algorithm `\ref{alg:tokenizer}`{=latex}, **line `\ref{alg-serialize}`{=latex}**). In this procedure, line `\ref{alg-count-freq}`{=latex} corresponds to estimating $F$ from data, and the resulting frequency map guides the structure-aware serialization function $f_g(\cdot,F)$. The function $f_g$ addresses traversal ambiguities by prioritizing edges whose incident labeled pattern has higher $F(p)$ from itself and neighbors, with fixed lexical rules applied to break any remaining ties.

::: {#tab:serialization_comparison}
  **Method**                               **Reversibility**   **Determinism**          **Time Complexity**
  --------------------------------------- ------------------- ----------------- ------------------------------------
  Random Walk                                     No                 No                       $O(RL)$
  Node-list BFS/DFS                               No                 No           $O(|\mathcal{V}|+|\mathcal{E}|)$
  Topological Sort                                No                 No           $O(|\mathcal{V}|+|\mathcal{E}|)$
  Eulerian circuit                                Yes                No                  $O(|\mathcal{E}|)$
  SMILES (non-canonical)                          Yes                No           $O(|\mathcal{V}|+|\mathcal{E}|)$
  Canonical SMILES                                Yes                Yes          $O(|\mathcal{V}|+|\mathcal{E}|)$
  Chinese Postman Problem                         Yes                No                 $O(|\mathcal{V}|^3)$
  **Frequency-Guided Eulerian circuit**         **Yes**            **Yes**        $\boldsymbol{O(|\mathcal{E}|)}$
  **Frequency-Guided CPP**                      **Yes**            **Yes**       $\boldsymbol{ O(|\mathcal{V}|^3)}$

  : Properties of graph serialization methods. For Random Walk, $L$ is walk length and $R$ the number of walks. Implementation details are in Appendix `\ref{appendix:method_details}`{=latex}.
:::

Fig. `\ref{fig:framework}`{=latex}B illustrates this serialization process on a molecular graph. Specifically, at the red node C, the next step is chosen by the $F$-guided priority rather than arbitrarily (e.g., choose the green node C because \"C-C\" pair has the highest frequency). To ensure that graph serialization yields faithful graph representations, we require it to satisfy the properties of reversibility and determinism discussed in Section `\ref{sec:preliminaries}`{=latex}. Ensuring that these two properties hold simultaneously for general graphs is a significant challenge. To motivate our approach, we first review classical serialization methods against these criteria. To provide a clear overview, Table `\ref{tab:serialization_comparison}`{=latex} summarizes the properties of existing serialization methods. A key limitation is that no classical method is simultaneously broadly applicable, fully reversible, and inherently deterministic.

Early approaches such as *Random Walks* [@g-deep-walk] are inherently stochastic and typically explore only a local portion of the graph per sample. Even when many walks are aggregated, substructures are split across sequences without markers, so reconstruction is not guaranteed, and the procedure remains non-deterministic. Standard traversal algorithms like *Breadth-First Search (BFS)* [@algo-bfs] and *Depth-First Search (DFS)* [@algo-dfs] also fail to meet these requirements. Their non-determinism arises from arbitrary neighbor selection when multiple choices are available, and their node-list output omits edge connectivity, which prevents reconstruction of the original graph. *Topological Sort* [@algo-topo], which produces linear orderings for Directed Acyclic Graphs (DAGs), is limited to DAGs and admits multiple valid orderings, making it non-deterministic. Moreover, like other node-list traversals, it is not reversible because the precise edge connectivity information necessary for reconstruction is discarded.

```{=latex}
\begin{algorithm}[ht!]\caption{The \texttt{GraphTokenizer} Framework}
\label{alg:tokenizer}
\begin{algorithmic}[1]
\STATE \textbf{Procedure} \textsc{Train}$(\mathcal{D}, K)$
\label{alg-train}
\STATE \textbf{Input:} A training graph dataset $\mathcal{D}$; number of BPE merges $K$.
\STATE \textbf{Output:} frequency map $F$; BPE codebook $\mathcal{C}=(\mathcal{V}_T, \mathcal{R})$.
\STATE $F(p) \gets \sum_{\mathcal{G}\in\mathcal{D}} \text{Count}(\mathcal{G},p), \quad \forall p\in\Sigma^3$
\label{alg-count-freq}
\STATE $\mathcal{D}_S \gets \{\, f_g(\mathcal{G},F) \mid \mathcal{G}\in\mathcal{D} \,\}$
\label{alg-serialize}
\STATE $\mathcal{V}_T \gets \Sigma$;\; $\mathcal{R} \gets \emptyset$
\label{alg-bpe-start}
\FOR{$k=1$ to $K$}
  \STATE $(s_a^*,s_b^*) \gets \arg\max_{(s_a,s_b)} \sum_{S\in\mathcal{D}_S} \text{Count}(S, (s_a,s_b))$
  \label{alg-bpe-count}
  \STATE $s_{\text{new}} \gets s_a^* \cdot s_b^*$
  \label{alg-bpe-new-word}
  \STATE $\mathcal{V}_T \gets \mathcal{V}_T \cup \{s_{\text{new}}\}$
  \STATE $\mathcal{R} \gets \mathcal{R} \cup \{(s_a^*,s_b^*) \to s_{\text{new}}\}$
  \FOR{each $S \in \mathcal{D}_S$}
    \STATE replace all disjoint adjacent pairs $(s_a^*,s_b^*)$ in $S$ with $s_{\text{new}}$
    \label{alg-bpe-replace}
  \ENDFOR
\ENDFOR
\STATE \textbf{return} $(F, \mathcal{C})$
\label{alg-bpe-end}
\STATE

\STATE \textbf{Procedure} \textsc{Encode}$(\mathcal{G}, F, \mathcal{C})$
\label{alg-encode}
\STATE \textbf{Input:} graph $\mathcal{G}$; frequency map $F$; the codebook $\mathcal{C}=(\mathcal{V}_T, \mathcal{R})$.
\STATE \textbf{Output:} A token sequence $S_T$.
\STATE $S \gets f_g(\mathcal{G},F)$
\FOR{each $(s_a,s_b)\to s_{\text{new}}$ in $\mathcal{R}$}
  \STATE replace all disjoint adjacent pairs $(s_a,s_b)$ in $S$ with $s_{\text{new}}$
\ENDFOR
\STATE $S_T \gets S$
\STATE \textbf{return} $S_T$

\STATE

\STATE \textbf{Procedure} \textsc{Decode}$(S_T, \mathcal{C}, f^{-1})$
\label{alg-decode}
\STATE \textbf{Input:} token sequence $S_T$; codebook $\mathcal{C}=(\mathcal{V}_T, \mathcal{R})$; inverse serialization function $f^{-1}$.
\STATE \textbf{Output:} A reconstructed graph $\widehat{\mathcal{G}}$.
\STATE $S \gets S_T$
\FOR{each $(s_a,s_b)\to s_{\text{new}}$ in $\text{reversed}(\mathcal{R})$}
  \STATE replace every $s_{\text{new}}$ in $S$ with the pair $(s_a,s_b)$
\ENDFOR
\STATE $\widehat{\mathcal{G}} \gets f^{-1}(S)$ 
\STATE \textbf{return} $\widehat{\mathcal{G}}$
\end{algorithmic}


\end{algorithm}
```
In contrast to node-based traversals, methods that cover every edge of the graph are naturally reversible. A representative example is the *Eulerian circuit* [@algo-euler-def], which visits each edge exactly once. By treating each undirected edge as two opposing directed edges, the method can extend to any connected graph [@g-euler/anal]. During traversal, emitting an alternating *node--edge--node* sequence ensures that adjacent symbols share an endpoint, which preserves the information needed by $f^{-1}$ to reconstruct the edges. Despite this reversibility, the method remains non-deterministic because the classical Hierholzer's algorithm [@algo-euler] must make arbitrary choices whenever a node has multiple unvisited edges. A related approach is the *Chinese Postman Problem (CPP)* [@algo-cpp-def], which seeks a minimum weight traversal that covers all edges and thereby also preserves the complete graph structure. Non-determinism in CPP, although more constrained, is intrinsic to its solution process, typically solved using Edmonds' blossom algorithm [@algo-cpp]. The standard procedure first identifies all odd-degree nodes and constructs an auxiliary complete graph on them, with edge weights representing shortest path distances in the original graph. A minimum-weight perfect matching is then computed to determine which paths should be duplicated. If multiple minimum-weight matchings exist, the selection between them is arbitrary, which dictates the different possible traversals.

Domain-specific serialization methods for molecular graphs, such as SMILES [@algo-smiles], represent a widely adopted approach in cheminformatics. Non-canonical SMILES is reversible but not deterministic, whereas canonical SMILES achieves determinism by applying an explicit canonicalization procedure under a fixed scheme. These procedures rely on chemistry-specific perception rules (e.g., aromaticity, implicit hydrogens, and structural notations) and therefore do not generalize to arbitrary labeled graphs. Furthermore, the determinism of canonical SMILES is defined relative to the chosen canonicalization algorithm and perception rules, and implementations may differ slightly across toolkits.

Building on the preceding analysis, our strategy is to impose determinism on traversal methods that are inherently reversible. This is accomplished by introducing a guiding mechanism that leverages the global frequency map $F$ to resolve traversal ambiguities. In this way, we obtain a structure-guided graph serialization function $f_g$ that simultaneously satisfies reversibility and determinism for general graphs.

*Frequency-Guided Eulerian circuit* adapts Hierholzer's algorithm by introducing a priority rule. At any node $u$ with unvisited outgoing edges $\mathcal{E}_u$, the next edge $e^*$ is selected deterministically as $$e^* = \argmax_{e_i \in \mathcal{E}_u} \pi(e_i, F), \label{eq:priority}$$ where $\pi(e_i,F)$ assigns a scalar priority, for example $\pi(e_i,F) = F(p_i)$ for the pattern $p_i=(L(u),L(e_i),L(v))$. Although traversal may begin from any node, the resulting circuit differs only by a cyclic shift.

For example, in Fig. `\ref{fig:framework}`{=latex}B, when the traversal reaches the red C, there are four candidate neighbors (including the incoming one). According to the dataset-level statistics $F$, the C--C labeled-edge pattern has the highest $F(p)$, so $f_g$ takes that step (step 3). When it later returns to the same red C, it selects among the remaining three neighbors: the edge to Cl has the next-highest $F(p)$ (step 5), followed by steps 7 and 9.

*Frequency-Guided CPP* incorporates frequency statistics into the edge weights used by the solver. For an edge $e$ with associated pattern $p_e$, the weight is defined as $$w(e) = \alpha \cdot 1 + (1-\alpha) \cdot g(F(p_e)),$$ where $g$ is a decreasing function of frequency (e.g., $1/F(p_e)$) and $\alpha \in [0,1]$ is a tunable hyperparameter. Ties that arise during matching or tour construction are resolved using the priority policy specified in Eq. `\ref{eq:priority}`{=latex}. For disconnected graphs, each component is serialized independently and the results are concatenated in a fixed order.

#### Vocabulary Learning via BPE. {#sec:tokenize}

After converting the graph dataset $\mathcal{D}$ into a corpus of symbol sequences $\mathcal{D}_S$, the final stage of training is to learn a vocabulary from this corpus. We employ Byte Pair Encoding (BPE), inspired by the text tokenizers used in LLMs, corresponding to the main loop in Algorithm `\ref{alg:tokenizer}`{=latex} (**lines `\ref{alg-bpe-start}`{=latex}**--**`\ref{alg-bpe-end}`{=latex}**). BPE iteratively identifies the most frequently occurring adjacent pair of symbols in the corpus and merges it into a new symbol added to the vocabulary. Fig. `\ref{fig:framework}`{=latex}C illustrates the vocabulary learning procedure on a serialized molecular sequence. In this example, a pair denotes an adjacent atom--bond symbol, e.g. $(\mathrm{C},-)$. At each iteration $i$, the most frequent pair is replaced at all disjoint occurrences by a new token $T_i$, and the corresponding merge rule $(s_a,s_b)\!\to T_i$ is added to the codebook $\mathcal{C}$. The updated sequence is then passed back to the counting step, forming an iterative training loop.

The key insight of our framework lies in the interplay between structure-guided serialization and the BPE algorithm. The serialization function $f_g$ is not merely a format conversion tool but leverages the global frequency map $F$ to ensure that statistically common local graph structures are systematically encoded as frequently adjacent symbol pairs in the sequence corpus $\mathcal{D}_S$. This structured corpus forms an ideal input for BPE's greedy merging strategy. When BPE merges the most frequent pair $(s_a^*, s_b^*)$ (**line `\ref{alg-bpe-count}`{=latex}**), the operation is not arbitrary compression but the discovery of statistically salient tokens derived from graph data. Each merged token represents a larger subgraph fragment that can be recovered from the serialization. The resulting vocabulary $\mathcal{V}_T$ provides a data-driven, structurally informed representation of the graph for the downstream Transformer.

#### Encoding and Decoding. {#sec:encode}

After training, the procedure produces two components for inference: the frequency map $F$ and the BPE codebook $\mathcal{C}$. To encode a new graph, the [Encode]{.smallcaps} procedure in Algorithm `\ref{alg:tokenizer}`{=latex} is applied. The graph is first serialized into a symbol sequence by the function $f$, we apply the merge rules $\mathcal{R}$ from $\mathcal{C}$ in the learned order to obtain the final token sequence $S_T$. The [Decode]{.smallcaps} procedure in Algorithm `\ref{alg:tokenizer}`{=latex} reverses this process. The tokens in $S_T$ are first expanded back into the original symbol sequence by applying the inverse of $\mathcal{R}$, and the inverse serialization function $f^{-1}$ then reconstructs the graph. These procedures ensure that the mapping between graphs and sequences is both reversible and deterministic, providing a bidirectional interface between the two domains.

#### Applications. {#sec:application}

The primary output of our framework is a discrete sequence of tokens $S_T$ that faithfully encodes the original graph. This sequential representation provides an interface through which the Transformer ecosystem can be directly applied to graph-structured data [@nlp-attn]. For graph-level prediction tasks such as classification or regression, the token sequence can be processed by an encoder-only model (e.g., BERT). A special `[CLS]` token may be prepended, or the final hidden states pooled, to derive a vector representation for the entire graph [@g-deep-walk]. For generative tasks, a decoder-only model (e.g., GPT) can be trained to generate graphs auto-regressively by predicting the next token in the sequence, supporting applications such as molecular or material discovery [@nlp-gpt2]. Multimodal models can also support tasks such as graph summarization, where we use pretrained graph representations from the proposed tokenizer with a large language model to generate concise descriptions of input graphs [@q-transformer].

In summary, the proposed graph tokenizer reframes graph representation learning as a sequence modeling problem. Our method decouples the structural complexity of graphs from the architectural design of the model and enables direct use of advances in sequence modeling, such as longer context windows [@nlp-longcontext] and more efficient attention mechanisms [@nlp-flashattention] for a wide range of graph learning tasks.

Experiments
===========

In this section, we evaluate our proposed graph tokenizer, `GraphTokenizer` (GT). We aim to answer the following questions: (1) How effectively does BPE compress the serialized graph representations, and what is the efficiency of our approach in terms of sequence length, processing speed, and training throughput? (2) How does our framework, when paired with standard Transformer models, perform against state-of-the-art graph representation learning methods? (3) How do different design choices, such as the serialization method and BPE usage, affect performance? (4) Can the learned vocabulary and model attention provide interpretable insights into graph structures?

Experimental Setup
------------------

#### Datasets.

We evaluate our method on 14 diverse public datasets for graph classification and regression. The benchmarks span multiple domains, including molecular graphs such as Mutagenicity (`Muta`) and Proteins [@data-muta], OGBG-molhiv [@data-ogb], ZINC [@data-zinc], AQSOL [@data-aqsol], and QM9 [@data-qm9]; computer vision graphs like COIL-DEL [@data-coildel]; graph theory like Colors-3 [@data-colors3] and Synthetic [@data-synthetic]; biomedical graphs like DD [@data-dd2] [@data-muta-protein] and Peptides [@data-lrgb]; social networks (Twitter [@data-twitter]); and academic networks (DBLP [@data-dblp]). A summary of dataset statistics is provided in Appendix `\ref{appendix:datasets}`{=latex}.

#### Baselines.

Our approach is benchmarked against a comprehensive set of baselines, ranging from classic GNNs (GCN [@g-gcn], GIN [@g-gin]) to state-of-the-art models, including the powerful GCN+ [@g-rethink-gnn], Graph Transformers like GraphGPS [@g-graphgps], and the serialization-based GraphMamba [@g-mamba]. To ensure a fair comparison, all baseline results are from official implementations run on our unified data splits and preprocessing pipeline [@g-bencnhmark; @g-rethink-gnn; @data-rignn; @data-muta-protein]. Results on key benchmarks are in the main text; the rest are in Appendix `\ref{appendix:more-exp}`{=latex}.

**Implementation Details.** Our proposed method, `GraphTokenizer` (GT), encodes graphs into token sequences that are subsequently processed by a standard Transformer model for downstream tasks. We report results with two Transformer backbones: (1) **GT+BERT**, which adopts the BERT-small architecture [@nlp-bert], and (2) **GT+GTE**, which uses the more recent GTE model with a parameter count comparable to BERT-base [@nlp-gte]. By default, the tokenizer applies the Frequency-Guided Eulerian circuit (`Feuler`) serialization method followed by Byte Pair Encoding (BPE) on the resulting sequences. In the main results table, we report the best-performing serialization for each dataset; a comprehensive comparison across all serialization methods is provided in the ablation study (Table `\ref{tab:ablation}`{=latex}) and Appendix `\ref{appendix:ablation}`{=latex}. Further details on model architectures, dataset splits, and hyperparameter settings are provided in Appendix `\ref{appendix:hyperparameters}`{=latex}.

Performance Results
-------------------

We present the main performance comparison on a representative subset of classification and regression benchmarks in Tables `\ref{tab:main_results_unified}`{=latex}. For each dataset, we report the mean and standard deviation of the primary evaluation metric on five independent runs.

#### Sequence length and efficiency.

Figure `\ref{fig:efficiency}`{=latex} illustrates the impact of our tokenizer on efficiency. As shown in Figure `\ref{fig:efficiency}`{=latex}a, BPE achieves a high compression ratio, reducing sequence lengths from reversible methods to approximately 10% of their original size. Notably, the frequency-guided Eulerian method (`Feuler`) produces more compact sequences post-BPE than its unguided counterpart, confirming that our structure-guided serialization is particularly well-suited for BPE. This compression translates directly to improved training efficiency. Figure `\ref{fig:efficiency}`{=latex}b shows that with BPE, our approach using a standard Transformer backbone becomes significantly more efficient than specialized Graph Transformers like GraphGPS and even surpasses classic GNNs such as GatedGCN. While the speedup (e.g., $\sim$2.5$\times$ on `zinc` for a 10$\times$ compression) is not linear due to model overhead, the gains are substantial. This demonstrates that our graph tokenization framework not only enables standard sequence models to process graphs but also makes them a highly efficient and performant option for graph learning tasks.

![BPE achieves high compression ratios.](exp-raw/exp1_speed/token_length/zinc_token_length_comparison.png){#fig:eff_len width="\\textwidth"}

```{=latex}
\hfill
```
![Training time per epoch.](exp-raw/exp1_speed/train_time/ZINC_training_efficiency.png){#fig:eff_train width="\\textwidth"}

#### Classification and Regression.

Table `\ref{tab:main_results_unified}`{=latex} presents the performance on classification and regression benchmarks, reporting the mean and standard deviation over five independent runs. Our approach, particularly with the GTE backbone (GT+GTE), achieves state-of-the-art results on a majority of the 14 benchmarks. On the `ogbg-molhiv` benchmark, for instance, GT+GTE attains an ROC-AUC of $0.876$ on our test split (val $0.903$), significantly exceeding reported leaderboard results (test $0.8475$, val $0.8275$). This strong performance is achieved using an off-the-shelf sequence model without any graph-specific architectural modifications. Furthermore, the framework's effectiveness is evident even with the compact GT+BERT model, which already outperforms strong baselines on several datasets. Critically, scaling up to the larger GT+GTE backbone yields consistent performance gains across the board, demonstrating a clear advantage over many GNN architectures that can suffer from performance degradation with increased model capacity due to issues like over-smoothing. We further validate these findings through an expanded evaluation in Appendix `\ref{appendix:performance}`{=latex}, which includes comparisons against additional state-of-the-art architectures (e.g., Graphormer, FragNet) and LLM-based Graph Foundation Models (e.g., GraphGPT, LLAGA).

```{=latex}
\begin{table*}[ht]

\caption{Results of classification (left block) and regression (right block). The best scores are shown in bold, the second-best are underlined, and standard deviations are given in parentheses. Results for the remaining datasets are presented in Appendix~\ref{appendix:more-exp}.}
\label{tab:main_results_unified}
\resizebox{\textwidth}{!}{
\begin{tabular}{lccccc|cccc}
\toprule
\textbf{Model} &
\shortstack{\texttt{molhiv}\\ auc$\uparrow$} &
\shortstack{\texttt{p-func}\\ ap$\uparrow$} &
\shortstack{\texttt{mutag}\\ acc$\uparrow$} &
\shortstack{\texttt{coildel}\\ acc$\uparrow$} &
\shortstack{\texttt{dblp}\\ acc$\uparrow$} &
\shortstack{\texttt{qm9}\\ mae$\downarrow$} &
\shortstack{\texttt{zinc}\\ mae$\downarrow$} &
\shortstack{\texttt{aqsol}\\ mae$\downarrow$} &
\shortstack{\texttt{p-struct}\\ avg~mae$\downarrow$} \\
\midrule
GCN        & 74.0 {\scriptsize (0.9)}  & 53.2 {\scriptsize (1.4)}  & 79.7 {\scriptsize (1.7)}  & 74.6 {\scriptsize (0.4)}  & 76.6 {\scriptsize (0.8)}  & 0.134 {\scriptsize (0.004)} & 0.399 {\scriptsize (0.006)} & 1.345 {\scriptsize (0.013)} & 0.342 {\scriptsize (0.003)} \\
GIN        & 76.1 {\scriptsize (1.1)}  & 61.4 {\scriptsize (0.7)}  & 80.4 {\scriptsize (1.2)}  & 72.0 {\scriptsize (0.8)}  & 73.8 {\scriptsize (0.9)}  & 0.176 {\scriptsize (0.006)} & 0.379 {\scriptsize (0.007)} & 2.053 {\scriptsize (0.058)} & 0.338 {\scriptsize (0.002)} \\
GAT        & 72.1 {\scriptsize (0.8)}  & 51.2 {\scriptsize (1.1)}  & 80.1 {\scriptsize (0.9)}  & 74.4 {\scriptsize (1.1)}  & 76.3 {\scriptsize (0.7)}  & 0.114 {\scriptsize (0.015)} & 0.445 {\scriptsize (0.015)} & 1.388 {\scriptsize (0.008)} & 0.316 {\scriptsize (0.003)} \\
GatedGCN   & 80.6 {\scriptsize (0.6)}  & 51.2 {\scriptsize (1.0)}  & 83.6 {\scriptsize (0.8)}  & 83.7 {\scriptsize (0.4)}  & 86.0 {\scriptsize (0.4)}  & 0.096 {\scriptsize (0.007)} & 0.370 {\scriptsize (0.011)} & 0.940 {\scriptsize (0.016)} & 0.312 {\scriptsize (0.004)} \\
\midrule
GraphGPS   & 78.5 {\scriptsize (1.5)}  & 53.5 {\scriptsize (0.7)}  & 84.3 {\scriptsize (0.9)}  & 80.5 {\scriptsize (0.8)}  & 71.6 {\scriptsize (0.8)}  & 0.084 {\scriptsize (0.004)} & 0.310 {\scriptsize (0.005)} & 1.587 {\scriptsize (0.011)} & 0.251 {\scriptsize (0.001)} \\
Exphormer  & 82.3 {\scriptsize (0.7)}  & 64.5 {\scriptsize (0.9)}  & 82.7 {\scriptsize (1.1)}  & \textbf{91.5} {\scriptsize (0.2)}  & 84.9 {\scriptsize (0.8)}  & 0.080 {\scriptsize (0.005)} & 0.281 {\scriptsize (0.006)} & 0.749 {\scriptsize (0.006)} & 0.251 {\scriptsize (0.002)} \\
GraphMamba & 81.2 {\scriptsize (0.5)}  & 67.7 {\scriptsize (0.9)}  & 85.0 {\scriptsize (1.0)}  & 74.5 {\scriptsize (1.1)}  & 87.6 {\scriptsize (0.5)}  & 0.083 {\scriptsize (0.005)} & 0.209 {\scriptsize (0.009)} & 1.133 {\scriptsize (0.014)} & 0.248 {\scriptsize (0.002)} \\
GCN+       & 80.1 {\scriptsize (0.6)}  & \underline{72.6} {\scriptsize (0.6)}  & \underline{88.7} {\scriptsize (0.6)}  & 88.9 {\scriptsize (0.3)}  & 89.6 {\scriptsize (0.4)}  & \underline{0.077} {\scriptsize (0.003)} & \textbf{0.116} {\scriptsize (0.009)} & 0.712 {\scriptsize (0.009)} & \underline{0.244} {\scriptsize (0.001)} \\
\midrule
\textbf{GT+BERT} & \underline{82.6} {\scriptsize (0.4)} & 68.5 {\scriptsize (0.5)} & 87.5 {\scriptsize (0.9)} & 74.1 {\scriptsize (0.4)} & \underline{93.2} {\scriptsize (0.1)} & 0.122 {\scriptsize (0.008)} & 0.241 {\scriptsize (0.011)} & \underline{0.648} {\scriptsize (0.008)} & {0.247} {\scriptsize (0.002)} \\
\textbf{GT+GTE}  & \textbf{87.4} {\scriptsize (0.2)} & \textbf{73.1} {\scriptsize (0.2)} & \textbf{90.1} {\scriptsize (0.7)} & \underline{89.6} {\scriptsize (0.2)} & \textbf{93.6} {\scriptsize (0.1)} & \textbf{0.071} {\scriptsize (0.004)} & \underline{0.131} {\scriptsize (0.007)} & \textbf{0.609} {\scriptsize (0.016)} & \textbf{0.242} {\scriptsize (0.001)} \\
\bottomrule

\end{tabular}
}
\end{table*}
```
Ablation Studies
----------------

We conduct ablation studies to evaluate the impact of different serialization methods and the BPE tokenization step while keeping the GT+GTE backbone fixed. Table `\ref{tab:ablation}`{=latex} shows that the choice of serialization has a significant impact on performance. Reversible methods that traverse every edge (e.g., Eulerian and CPP variants) significantly outperform non-reversible node-list traversals, with only a few exceptions detailed in Appendix `\ref{appendix:more-exp}`{=latex}. Within the reversible category, the frequency-guided Eulerian circuit (`Feuler`) demonstrates a clear advantage over its unguided counterpart, not only in mean performance but also in reduced variance, indicating greater stability. In contrast, the performance gap between CPP and its frequency-guided version (`FCPP`) remains minimal. A plausible explanation is that CPP's objective of finding a minimum-weight traversal already yields a highly structured sequence, leaving limited room for further improvement from frequency-based guidance. Although `FCPP` performs comparably to `Feuler`, `Feuler` provides substantial benefits in algorithmic complexity (Appendix `\ref{appendix:cpp-complexity}`{=latex}) and scalability, making it a more practical choice for larger graphs.

A second key finding is that applying BPE to serialized sequences substantially improves model performance. Across nearly all configurations, BPE yields higher scores with a clear performance margin. This improvement is accompanied by reduced standard deviation, indicating more stable and reliable training. Moreover, these performance gains come in addition to the substantial efficiency improvements discussed previously in Figure `\ref{fig:efficiency}`{=latex}. Therefore, BPE is a critical component of our framework, enhancing both accuracy and computational efficiency.

Additional ablation studies in Appendix `\ref{appendix:ablation}`{=latex} further investigate the sensitivity to the BPE vocabulary size $K$, the choice of statistical guidance unit for frequency collection, and the impact of serialization strategies on graph counting tasks (e.g., the COLORS-3 dataset).

```{=latex}
\resizebox{\textwidth}{!}{
\begin{tabular}{lcccccccccc}
\toprule
\textbf{Method} &
\shortstack{\texttt{molhiv}\\  auc$\uparrow$} &
\shortstack{\texttt{coildel}\\ \ acc$\uparrow$} &
\shortstack{\texttt{p-func}\\  ap$\uparrow$} &
\shortstack{\texttt{zinc}\\  mae$\downarrow$} &
\shortstack{\texttt{qm9}\\  mae$\downarrow$} \\
\cmidrule(lr){2-3}\cmidrule(lr){4-5}\cmidrule(lr){6-7}\cmidrule(lr){8-9}\cmidrule(lr){10-11}
 & \scriptsize w & \scriptsize w/o & \scriptsize w & \scriptsize w/o & \scriptsize w & \scriptsize w/o & \scriptsize w & \scriptsize w/o & \scriptsize w & \scriptsize w/o \\
\midrule
BFS      & 72.3 {\scriptsize (0.6)} & 81.2 {\scriptsize (0.9)} & 81.2 {\scriptsize (0.9)} & 80.1 {\scriptsize (1.3)} & 68.5 {\scriptsize (0.6)} & 67.2 {\scriptsize (0.2)} & 0.453 {\scriptsize (0.011)} & 0.696 {\scriptsize (0.013)} & 0.311 {\scriptsize (0.009)} & 0.292 {\scriptsize (0.011)} \\
DFS      & 76.0 {\scriptsize (0.4)} & 79.1 {\scriptsize (0.5)} & 80.5 {\scriptsize (0.4)} & 79.8 {\scriptsize (0.8)} & \underline{71.0} {\scriptsize (1.1)} & 68.4 {\scriptsize (0.3)} & 0.446 {\scriptsize (0.009)} & 0.705 {\scriptsize (0.008)} & 0.291 {\scriptsize (0.007)} & 0.277 {\scriptsize (0.010)} \\
TOPO     & 73.2 {\scriptsize (0.6)} & 75.6 {\scriptsize (0.8)} & 82.6 {\scriptsize (0.8)} & 81.4 {\scriptsize (1.2)} & 67.9 {\scriptsize (0.3)} & 64.5 {\scriptsize (0.5)} & 0.416 {\scriptsize (0.010)} & 0.634 {\scriptsize (0.011)} & 0.293 {\scriptsize (0.010)} & 0.275 {\scriptsize (0.013)} \\
Eulerian & 84.5 {\scriptsize (0.7)} & 81.0 {\scriptsize (1.0)} & 84.1 {\scriptsize (1.5)} & 84.0 {\scriptsize (1.5)} & 69.1 {\scriptsize (0.6)} & 66.8 {\scriptsize (1.1)} & 0.164 {\scriptsize (0.009)} & 0.160 {\scriptsize (0.016)} & 0.083 {\scriptsize (0.004)} & 0.104 {\scriptsize (0.008)} \\
Feuler   & \textbf{87.4} {\scriptsize (0.4)} & 81.3 {\scriptsize (0.5)} & 88.0 {\scriptsize (0.7)} & 85.6 {\scriptsize (0.6)} & \textbf{73.1} {\scriptsize (0.3)} & 68.1 {\scriptsize (0.9)} & \textbf{0.131} {\scriptsize (0.007)} & 0.171 {\scriptsize (0.013)} & \textbf{0.071} {\scriptsize (0.005)} & 0.088 {\scriptsize (0.007)} \\
CPP      & \underline{86.9} {\scriptsize (0.3)} & 81.2 {\scriptsize (0.5)} & \textbf{89.6} {\scriptsize (0.1)} & 86.7 {\scriptsize (0.3)} & 69.2 {\scriptsize (0.2)} & 67.0 {\scriptsize (0.8)} & 0.141 {\scriptsize (0.006)} & 0.145 {\scriptsize (0.009)} & \underline{0.073} {\scriptsize (0.004)} & 0.093 {\scriptsize (0.006)} \\
FCPP     & 86.4 {\scriptsize (0.3)} & 81.0 {\scriptsize (0.6)} & \underline{89.4} {\scriptsize (0.3)} & 86.8 {\scriptsize (1.0)} & 69.2 {\scriptsize (0.3)} & 66.3 {\scriptsize (0.5)} & \underline{0.140} {\scriptsize (0.005)} & 0.151 {\scriptsize (0.008)} & {0.079} {\scriptsize (0.005)} & 0.095 {\scriptsize (0.007)} \\
SMILES   & — & — & — & — & — & — & 0.201 {\scriptsize (0.012)} & 0.339 {\scriptsize (0.009)} & 0.092 {\scriptsize (0.008)} & 0.081 {\scriptsize (0.014)} \\
\bottomrule

\end{tabular}
}
```
#### Visualizing the Learned Vocabulary.

To examine the structural semantics captured by our tokenizer, we visualize the vocabulary constructed by BPE on the ZINC dataset. As shown in Figure `\ref{fig:bpe_vocab_example}`{=latex}, BPE iteratively merges simple, frequent substructures into progressively more complex and chemically meaningful tokens. For example, a basic sulfonyl group is first established as a token, and then combined with adjacent atoms to form larger functional groups. This confirms that BPE can automatically discover and compose semantically relevant substructures, yielding an interpretable, hierarchical vocabulary.

![Illustration of the BPE merging process on ZINC. Each row shows how simple substructures (left) are iteratively merged to form larger, chemically meaningful tokens (middle and right).](fig/word-process-vis.png){#fig:bpe_vocab_example width="0.85\\linewidth"}

A statistical analysis of the vocabulary ($K{=}2000$) on ZINC further supports these observations: the token size distribution peaks in the 4--6 node range (41.5%), which corresponds to typical functional groups, while atomic tokens (0--1 nodes) comprise only 7.1%. Over 60% of the vocabulary represents multi-node substructures, confirming that BPE discovers an optimal level of abstraction. Further analysis is provided in Appendix `\ref{appendix:qualitative}`{=latex}.

Conclusion
==========

In this paper, we introduce a general framework for graph tokenization that bridges graph-structured data with the Transformer ecosystem. Our approach combines reversible, structure-guided graph serialization with BPE to construct a faithful and efficient interface that encodes graphs into discrete token sequences. Fundamentally, the framework decouples the design of graph representations from the underlying model architecture, thereby reframing graph learning as a sequence modeling problem. This perspective enables graph data to be directly integrated into general-purpose Transformers, allowing the graph learning field to leverage rapid advancements in model architectures, training strategies, and scaling capabilities. Empirically, we demonstrate that this approach enables standard off-the-shelf Transformers to process graph data effectively and achieve state-of-the-art results on a diverse set of benchmarks, outperforming established GNNs and specialized Graph Transformers. Limitations of this work and directions for future research are discussed in Appendix `\ref{appendix:discussion}`{=latex}.

```{=latex}
\newpage
```
Acknowledgments {#acknowledgments .unnumbered}
===============

This work is supported in part by the National Natural Science Foundation of China (No. 62550138, 62192784, 62572064, 62472329), and Young Elite Scientists Sponsorship Program (No.2023QNRC001) by CAST.

Ethics Statement {#ethics-statement .unnumbered}
================

The authors of this work have read and commit to adhering to the Code of Ethics. Our research proposes a foundational framework for graph tokenization and, to the best of our knowledge, does not present any direct ethical concerns. The work does not involve the use of personally identifiable information, sensitive human-subject data, or applications with immediate potential for societal harm.

Reproducibility Statement {#reproducibility-statement .unnumbered}
=========================

To ensure full reproducibility, our complete source code is provided in the supplementary materials. This source code contains training configuration files for all experiments, and the necessary scripts to preprocess datasets from their original sources. For convenience, ready-to-use versions of the datasets are also provided. Comprehensive details on the experimental setup are documented in the Appendix, including the datasets (Appendix `\ref{appendix:datasets}`{=latex}), model architectures (Appendix `\ref{appendix:architectures}`{=latex}), hyperparameter configuration (Appendix `\ref{appendix:hyperparameters}`{=latex}), and runtime environment (Appendix `\ref{appendix:environment}`{=latex}).

```{=latex}
\bibliographystyle{iclr2026_conference}
```
```{=latex}
\newpage
```
`\LARGE `{=latex}Appendix

```{=latex}
\appendix
```
Discussion {#appendix:discussion}
==========

This section provides a candid discussion of our framework's limitations and outlines promising avenues for future research.

Limitations {#appendix:limitations}
-----------

#### Graphs with Continuous Features.

Our framework operates by first converting a labeled graph into a sequence of discrete symbols, which subsequently forms the input for Byte Pair Encoding (BPE). This process implicitly assumes that the features associated with nodes and edges are discrete and can be directly mapped to an initial symbol alphabet. Consequently, graphs where attributes are primarily continuous (e.g., real-valued vectors) cannot be handled natively. The required conversion of continuous attributes into a finite set of discrete symbols always needs a quantization step, which is inherently lossy. This directly conflicts with our framework's core principle of creating a faithful and reversible representation. Integrating continuous features in a principled manner alongside our discrete tokenization process therefore remains a key limitation. However, as elaborated in Appendix `\ref{appendix:future_work}`{=latex}, this is not a fundamental blockage. We propose that continuous features can be effectively integrated via Vector Quantization (VQ) interfaces or by treating continuous attributes as a parallel modality fused with token embeddings.

#### Node and Edge Level Tasks.

The current framework is primarily evaluated on graph-level prediction tasks. Adapting it for fine-grained objectives, such as node classification or link prediction, raises a general consideration for graph tokenization: the Byte Pair Encoding (BPE) process, which is essential for building an efficient vocabulary, may merge the specific target node or edge into a larger, composite token. As a result, the discrete identity of the target entity can be obscured, making it difficult to formulate a direct prediction objective. To mitigate this, future iterations can introduce special \`\`pointer tokens" to distinguish original nodes in a composite token or modify the BPE merge rules to prevent target nodes from being merged into composite tokens, thereby preserving their distinct addressability.

#### Computational Complexity and Scalability.

A key scalability limitation in our framework arises from the trade-off between different serialization methods. The Chinese Postman Problem (CPP) based approach, for instance, introduces a significant bottleneck: its $O(|\mathcal{V}|^3)$ complexity renders it impractical for datasets with large graphs. Consequently, we primarily adopt the highly efficient, linear-time `Feuler` method. With this choice, the main computational workload shifts to the BPE vocabulary training. However, this is a far more manageable constraint, as BPE training is a one-time, offline cost, and its practical runtime impact is a constant-factor consideration rather than a prohibitive asymptotic scaling issue (as detailed in Appendix `\ref{appendix:complexity}`{=latex}). A separate, downstream constraint is the Transformer's fixed context window, which limits the maximum size of a single graph. While this is an inherent limitation of the model architecture, our BPE compression significantly alleviates the issue, and adopting long-context models provides a clear path for future scaling.

Future Works {#appendix:future_work}
------------

Our work opens several promising avenues for future research. Below, we outline these potential directions, ordered based on our perspective of their potential impact and research scope.

-   **Graphs with Continuous Features.** To extend our framework to graphs with continuous features, a straightforward approach would be to first discretize them using methods like vector quantization [@g-vqgraph] and then apply our pipeline. However, we argue this direction is suboptimal, as such quantization is inherently lossy and conflicts with our framework's core principle of creating a faithful, reversible representation. A more promising direction is to treat discrete and continuous information in parallel channels. For instance, features learned via message passing could be incorporated as a continuous bias added to the discrete token embeddings, analogous to how positional encodings supplement token information in standard Transformers. Alternatively, spectral graph theory offers a principled way to derive global information; the eigenvectors of the graph Laplacian could be used to generate a unique continuous encoding for each token's position within the global structure, complementing the discrete sequence.

-   **Node and Edge Level Tasks.** For adapting the framework to node or edge-level tasks, an intuitive strategy might be to predict the entire composite token that contains the target entity. However, this approach is problematic as it makes the learning signal less direct and can amplify prediction errors. A more targeted and principled approach would be to modify the BPE procedure itself. Specifically, we propose fixing the target entities so they are excluded from merging during tokenization. This would preserve the target's granularity for direct prediction, introducing an interesting trade-off between task-specific fidelity and overall vocabulary efficiency that warrants further investigation.

-   **Generative Pre-training and Cross-Domain Generalization.** A key advantage of our sequence-based representation is its potential for large-scale pre-training. A natural next step is to explore pre-training on large corpora of graphs from within the same domain to enhance transferability. More ambitiously, our framework offers a new perspective on the grand challenge of cross-domain generalization. Since our method transforms any graph into a standard sequential format, we hypothesize that graphs from disparate domains can be treated as different \`languages' in NLP. **This analogy extends to generative modeling: just as LLMs generate text, our framework enables autoregressive graph generation. We provide a preliminary verification of this capability in Appendix `\ref{app:autoregressive_generation}`{=latex}, demonstrating that a standard decoder-only Transformer can generate valid graph structures.** Consequently, training a single, large-scale Transformer on a diverse, multi-domain corpus of tokenized graphs could facilitate unprecedented knowledge transfer, potentially giving rise to the scaling laws and unified representations that are foundational to true Graph Foundation Models. Finally, such pre-training may also serve as a novel tool for analyzing datasets. Our preliminary results suggest that a dataset's size does not always correlate with its \`information density,' and monitoring a model's overfitting on a masked prediction task could provide a new way to quantify a graph dataset's quality and diversity.

-   **Further Extensions and Applications.** Our framework also opens several avenues for direct applications and algorithmic refinements. By reframing graph learning as a sequence modeling problem, it immediately enables the use of powerful autoregressive models, such as GPT-style architectures, for graph generation tasks like controllable molecular design. Furthermore, the sequential representation allows our method to seamlessly integrate with emerging long-context architectures to enhance scalability for massive graphs. Extending the serialization to richer graph structures, such as heterogeneous information networks [@g-fedhgnn] and hypergraphs [@g-aehcl], is another promising direction. Finally, a more fundamental extension could involve making the serialization process itself learnable, where traversal decisions are optimized end-to-end for a specific downstream task. While this could create a highly specialized graph-to-sequence interface, it also introduces significant challenges regarding generalization and stability.

Use of Large Language Models {#appendix:llm-usage}
----------------------------

In this work, we used large language models (LLMs) to assist with two non-substantive aspects of the research workflow. No part of the scientific contributions---such as algorithm design, model architecture, theoretical formulation, or experimental evaluation---was generated by or delegated to an LLM.

-   **Manuscript Editing.** LLMs were used to help polish the language of the manuscript. This includes surface-level edits such as improving clarity, grammar, and conciseness of English expressions. All technical content, algorithmic designs, and empirical results were authored and validated by the authors.

-   **Code Documentation and Cleanup.** At the time of open-sourcing our implementation, LLMs were used to assist in non-algorithmic tasks including: adding docstrings and inline comments, generating basic usage documentation, removing deprecated or redundant code, and improving logging output for better reproducibility. All code functionality and correctness were manually verified by the authors.

Further Methodological Details {#appendix:method_details}
==============================

This section provides a formal and detailed supplement to the methodological discussions in Section `\ref{sec:method}`{=latex}. We aim to precisely define how classical graph traversal algorithms are adapted for the task of graph serialization, and to rigorously analyze their resulting properties of **reversibility** and **determinism**, pinpointing the exact sources of their respective strengths and weaknesses for our framework. All notations, unless specified otherwise, follow the definitions established in Section `\ref{sec:preliminaries}`{=latex}.

Distinction from Neural Graph Tokenization Methods {#appendix:vae_distinction}
--------------------------------------------------

It is crucial to clarify that BPE differs fundamentally from **neural-based continuous encoders** such as VAE-based subgraph tokenizers (e.g., PS-VAE [@g-psvae]), linear projectors, or ViT-style patchified embeddings. These methods utilize learnable parameters to project discrete inputs into **continuous embedding spaces**. While such continuous approximation is native for processing continuous signals (e.g., pixel intensities in CV), it introduces inevitable information loss and reconstruction ambiguity when applied to discrete structures like graphs. In contrast, our BPE approach is a deterministic, statistics-based algorithm operating entirely within the discrete symbolic domain. It does not require neural network training and merges symbols based on frequency, avoiding the gradient-based optimization complexity inherent in forcing discrete topology into continuous representations.

This distinction yields decisive practical advantages: (1) **Reversibility:** Our framework is inherently lossless and deterministic. Since graph serialization preserves topology and BPE merging is reversible via the vocabulary, we can reconstruct the exact graph structure up to isomorphism, whereas VAEs rely on probabilistic decoders that only approximate the input. (2) **Efficiency:** BPE achieves significant sequence compression (over $10\times$) without the computational overhead of training separate neural encoders, effectively reducing the quadratic complexity of the Transformer's attention mechanism. (3) **Decoupling:** By converting graphs into standard discrete symbol sequences, we completely decouple graph representation from model architecture. Neural encoding strategies typically require specific graph encoders (e.g., GNNs) or specialized projection layers to bridge the modality gap. Once tokenized by our framework, a graph is formally identical to a natural language sequence and can be processed by any unmodified, off-the-shelf Transformer without specialized graph encoders or projection layers.

Formal Analysis of Node-List Serialization Methods {#appendix:node_list_analysis}
--------------------------------------------------

We first consider the class of methods that generate a sequence composed solely of node identifiers.

#### Formal Definition.

When applied to graph serialization, a node-list traversal method is a function $f_{\text{node}}: \mathcal{G} \mapsto S$ that maps a labeled graph $\mathcal{G} = ((\mathcal{V}, \mathcal{E}), L, \Sigma)$ to a sequence of node labels $S = (s_1, s_2, \dots, s_{|\mathcal{V}|})$. This sequence is generated by a traversal that visits every node in $\mathcal{V}$ exactly once. Let the sequence of visited nodes be $(v_1, v_2, \dots, v_{|\mathcal{V}|})$, where $\{v_1, \dots, v_{|\mathcal{V}|}\} = \mathcal{V}$. The output sequence is then $S = (L(v_1), L(v_2), \dots, L(v_{|\mathcal{V}|}))$.

-   For **Breadth-First Search (BFS)** and **Topological Sort**, the order of nodes in the sequence reflects the layer-by-layer or dependency-based traversal order. It is crucial to note that for $i \in [1, |\mathcal{V}|-1]$, an edge $(v_i, v_{i+1})$ is not guaranteed to exist in $\mathcal{E}$.

-   For **Depth-First Search (DFS)**, the standard algorithm produces a sequence where each node $v_{i+1}$ is an unvisited neighbor of $v_i$ (or a node reached after backtracking). The sequence of \*first discovery\* implies a tree structure (the DFS tree), but the output sequence itself does not encode this structure explicitly.

#### Reversibility.

These methods are fundamentally **irreversible**. A serialization function $f$ is reversible if, for its output $S=f(\mathcal{G})$, the set of pre-images $f^{-1}(S)$ contains a graph $\mathcal{G}'$ such that $\mathcal{G}' \cong \mathcal{G}$. For node-list methods, this condition fails. The output sequence $S$ discards all explicit edge connectivity information. Consequently, a vast number of non-isomorphic graphs can produce the exact same node-label sequence. For example, consider two graphs on three nodes $\{A, B, C\}$ with identical labels: a path graph $\mathcal{G}_1$ with edges $\{(A,B), (B,C)\}$ and a star graph $\mathcal{G}_2$ with edges $\{(A,B), (A,C)\}$. A valid BFS starting from node A in $\mathcal{G}_1$ could produce the sequence $(L(A), L(B), L(C))$, and a BFS starting from A in $\mathcal{G}_2$ could also produce the same sequence. From the sequence alone, it is impossible to distinguish $\mathcal{G}_1$ from $\mathcal{G}_2$, thus violating reversibility.

#### Determinism.

These methods are inherently **non-deterministic**. The source of non-determinism lies in the arbitrary selection of the next node to visit from a set of valid candidates. Formally, at any step of the traversal from a node $u$, let $N_{\text{valid}}(u)$ be the set of unvisited neighbors (for BFS/DFS) or nodes with no remaining incoming edges (for Topological Sort). Since $N_{\text{valid}}(u)$ is an unordered set, any choice of $v \in N_{\text{valid}}(u)$ is permissible by the standard algorithm's definition. The final sequence is thus contingent on implementation-specific details, such as the memory layout of the graph's adjacency list, which are not canonical properties of the graph itself. This leads to different sequences for the same graph, violating determinism.

Formal Analysis of Edge-Covering Serialization Methods {#appendix:edge_cover_analysis}
------------------------------------------------------

This class of methods generates sequences by performing a walk that traverses every edge in the graph at least once.

#### Formal Definition.

A *walk* $W$ in a graph $\mathcal{G}$ is a finite sequence of alternating nodes and edges, $W = (v_0, e_1, v_1, e_2, \dots, e_k, v_k)$, where $v_i \in \mathcal{V}$, $e_i \in \mathcal{E}$, and for all $i \in [1, k]$, edge $e_i=(v_{i-1},v_i) \in \mathcal{E}$. An edge-covering serialization function $f_{\text{edge}}: \mathcal{G} \mapsto S$ maps $\mathcal{G}$ to the label sequence corresponding to such a walk, $S = (L(v_0), L(e_1), L(v_1), \dots, L(v_k))$, under the constraint that the multiset of edges in the walk, $\{e_1, \dots, e_k\}$, covers the entire edge set of the graph, i.e., $\mathcal{E} \subseteq \{e_1, \dots, e_k\}$.

-   For an **Eulerian circuit**, the algorithm is applied to a graph where every undirected edge $\{u,v\} \in \mathcal{E}$ is treated as a pair of directed edges, $(u,v)$ and $(v,u)$, forming an edge set $\mathcal{E}'$  [@g-euler/anal]. The resulting walk traverses every edge in $\mathcal{E}'$ *exactly once*.

-   For the **Chinese Postman Problem (CPP)**, the algorithm finds a walk $W$ that covers $\mathcal{E}$ while minimizing the total weight of the walk, $\sum_{i=1}^k w(e_i)$, where $w(e)$ is the weight of edge $e$ (typically 1 for unweighted graphs). This means some edges in $\mathcal{E}$ may be traversed multiple times.

#### Reversibility.

By definition, these methods are **reversible**. The output sequence $S$ is composed of a series of labeled triplets $(L(v_{i-1}), L(e_i), L(v_i))$. From this sequence, one can reconstruct the complete multiset of labeled edge traversals. Since this multiset is guaranteed to contain every edge from the original graph $\mathcal{E}$ at least once, the full topology of $\mathcal{G}$ can be losslessly recovered up to isomorphism. The mapping from the graph's structure to the information contained in the sequence is injective.

#### Determinism.

In their classical forms, these methods are **non-deterministic**.

-   For an **Eulerian Path**, the non-determinism stems from Hierholzer's algorithm. At any node $u$ during the tour construction, let $E^{\text{unvisited}}_u$ be the set of untraversed edges incident to $u$. The algorithm proceeds by selecting an arbitrary edge from this set. As $E^{\text{unvisited}}_u$ is an unordered set, this choice introduces ambiguity, leading to different valid Eulerian circuits and thus different output sequences.

-   For the **CPP**, the primary source of non-determinism arises during the solution process. The algorithm first identifies the set of odd-degree nodes, $\mathcal{V}_{\text{odd}}$. It then computes a minimum-weight perfect matching on a complete auxiliary graph constructed on $\mathcal{V}_{\text{odd}}$. If multiple distinct perfect matchings exist that share the same minimum total weight, the choice between them is arbitrary. This choice dictates which paths in the original graph are duplicated to form an Eulerian supergraph, ultimately resulting in different minimum-cost tours and thus different sequences.

Our proposed frequency-guided mechanism (Section `\ref{sec:serialize}`{=latex}) addresses these specific sources of non-determinism by providing a canonical, data-driven rule for making these choices.

Disconnected Graphs
-------------------

For completeness, we specify our procedure for handling graphs that are not connected. To ensure the overall serialization remains deterministic, an input graph $\mathcal{G}$ is first decomposed into its set of connected components, $\{\mathcal{G}_1, \mathcal{G}_2, \dots, \mathcal{G}_c\}$. Each component $\mathcal{G}_i$ is independently serialized into a sequence $S_i$ using the chosen method. The resulting set of sequences $\{S_1, \dots, S_c\}$ is then sorted to produce a canonical ordering. The sorting criterion is primarily the length of the sequence in descending order, any ties are resolved using standard lexicographical comparison. The final output is the concatenation of these sorted sequences. This guarantees that any given graph, regardless of its connectivity, maps to a single, unique sequence.

Complexity Analysis {#appendix:complexity}
-------------------

#### CPP Complexity. {#appendix:cpp-complexity}

For CPP--based serialization, the end-to-end cost is $O(|\mathcal{V}|^{3}+|\mathcal{E}|)$: $O(|\mathcal{V}|^{3})$ from the minimum-weight perfect matching on odd-degree vertices, and $O(|\mathcal{E}|)$ from finding the Euler circuit after augmentation. Since $|\mathcal{E}|\le |\mathcal{V}|^{2}$ (even $|\mathcal{E}|=\Theta(|\mathcal{V}|^{2})$ for a complete graph), we have $O(|\mathcal{V}|^{3}+|\mathcal{E}|)=O(|\mathcal{V}|^{3})$, so throughout we denote the CPP family as $O(|\mathcal{V}|^{3})$.

#### Pipeline Complexity.

We analyze the computational complexity of the `GraphTokenizer` training procedure. Let $\mathcal{D} = \{\mathcal{G}_1, \dots, \mathcal{G}_N\}$ be the training dataset, where $\mathcal{G}_i = (\mathcal{V}_i, \mathcal{E}_i)$. We define $V_S = \sum_{i=1}^{N} |\mathcal{V}_i|$ and $E_S = \sum_{i=1}^{N} |\mathcal{E}_i|$ as the total number of nodes and edges in the dataset, respectively. Let $K$ be the number of BPE merge operations and $\bar{L}$ be the average initial sequence length, where $\bar{L} \approx E_S/N$.

*Statistics Collection.* This stage requires a single pass over all edges in the dataset to count local patterns. The complexity is therefore $O(E_S)$.

*Graph Serialization.* As summarized in Table `\ref{tab:serialization_comparison}`{=latex}, the complexity depends on the chosen method. Traversal-based methods such as `Eulerian` (and its guided variant `Feuler`) are linear in graph size, with a total cost for the dataset of $O(V_S + E_S)$. In contrast, `CPP` (and its guided variant `FCPP`) is dominated by solving a minimum-weight perfect matching for each graph, yielding a total complexity of $O(\sum_{i=1}^{N} |\mathcal{V}_i|^3)$. Assuming a relatively homogeneous distribution of graph sizes, this can be expressed as $O(N \cdot (\frac{V_S}{N})^3) = O(\frac{{V_S}^3}{N^2})$.

*BPE Training.* The complexity of the BPE training phase (Lines `\ref{alg-bpe-start}`{=latex}-`\ref{alg-bpe-end}`{=latex}) is dominated by the sequence merge operation. For a standard implementation using dynamic arrays, the cost of a merge is proportional to sequence length. We analyze the total complexity under two bounding scenarios.

First, consider the case where a small number of pairs are merged per iteration. The work is dominated by rewriting all $N$ sequences for each of the $K$ iterations, leading to a total complexity of: $$C_1 = O(K \cdot N \cdot \bar{L})$$

Second, consider the case where a maximal number of pairs are merged, halving the total corpus length $n = \frac{N\bar{L}}{2}$ in each step. The work performed can be described by the recurrence relation: $$T(n) = T(\frac{n}{2}) + f(n)$$ where $f(n)$ is the cost of a single merge pass. For a naive array-based merge, $f(n) = O(N \cdot (\frac{n}{N})^2) = O(\frac{n^2}{N})$. By the Master Theorem, with $a=1, b=2, d=2$, we have $d > \log_b a$, so the total complexity is dominated by the root node's work: $$C_{2, \text{naive}} = O(\frac{n^2}{N}) = O(N\bar{L}^2)$$ If the merge operation were optimized to be $O(1)$ per merged pair (e.g., using a linked list representation or mark new token only at range endpoint to be replaced), then $f(n) = O(n)$. In this case, $d=1 > \log_2 1$, so the complexity would be: $$C_{2, \text{opt}} = O(n) = O(N\bar{L})$$

The final complexity depends on the termination criterion. We employ a fixed number of iterations, $K$, to allow for flexible control over the final vocabulary size and compression ratio across experiments. In typical settings, $K$ (e.g., $10^3-10^4$) is much larger than the average sequence length $\bar{L}$ (e.g., $10^2$), making the $C_1$ term the dominant factor. Alternatively, if one were to terminate based on a minimum frequency threshold, the process would resemble the recursive scenario, making $C_2$ the more relevant complexity model. Given our fixed-iteration approach, the overall BPE training complexity is: $$C_{\text{BPE}} = O(K \cdot N \cdot \bar{L}) = \boldsymbol{O(K \cdot E_S)}$$

**Overall Training Complexity.** For serialization with `Feuler` and other methods with linear time complexity, the total complexity is dominated by the BPE training, resulting in $\boldsymbol{O(K \cdot E_S)}$. If a minimum-frequency stopping criterion were used, the complexity would instead be dominated by the $C_{2, \text{naive}}$ term, becoming $O(E_S + N\bar{L}^2) = O(N\bar{L}^2) = O(\frac{E_S^2}{N})$. For serialization with `FCPP`, the total complexity is dominated by the most expensive component, yielding $\boldsymbol{O(\frac{{V_S}^3}{N^2}+ K \cdot E_S)=O(\frac{{V_S}^3}{N^2})}$, which in practice for large graphs, implies that the serialization becomes the bottleneck.

Experimental Setup {#appendix:experimental_setup}
==================

This section provides the comprehensive configuration details required to fully reproduce all experiments presented in this paper.

Datasets {#appendix:datasets}
--------

Our evaluation is conducted on a diverse suite of benchmark datasets. The `aqsol` and `zinc` datasets are sourced from the benchmark collection proposed by [@g-bencnhmark], while the remaining datasets are standard benchmarks obtained from libraries such as PyTorch Geometric (PyG) and DGL. Table `\ref{tab:appendix_dataset_details_full}`{=latex} below offers a comprehensive summary of their statistical properties and raw feature dimensions.

```{=latex}
\resizebox{\textwidth}{!}{%
\begin{tabular}{l c c c c c c c}
\toprule
\textbf{Dataset} & \textbf{\# Graphs} & \textbf{Task} & \textbf{\# Targets} & \textbf{Nodes (Mean $\pm$ Std)} & \textbf{Edges (Mean $\pm$ Std)} & \textbf{Node Raw Dim} & \textbf{Edge Raw Dim} \\
\midrule
\texttt{aqsol} & 9,823 & Regression & 1 & $33.7 \pm 24.5$ & $67.9 \pm 50.0$ & 65 & 5 \\
\texttt{coildel} & 3,900 & Classification & 100 & $21.5 \pm 13.2$ & $108.5 \pm 77.0$ & 2 & 1 \\
\texttt{colors3} & 10,500 & Classification & 11 & $61.3 \pm 60.5$ & $182.1 \pm 187.3$ & 4 & --- \\
\texttt{dblp} & 19,456 & Classification & 2 & $10.5 \pm 8.5$ & $39.3 \pm 39.3$ & 1 & 1 \\
\texttt{dd} & 1,178 & Classification & 2 & $284.3 \pm 272.1$ & $1431.3 \pm 1388.4$ & 1 & --- \\
\texttt{molhiv} & 41,127 & Classification & 2 & $25.5 \pm 12.1$ & $54.9 \pm 26.4$ & 9 & 3 \\
\texttt{muta} & 4,337 & Classification & 2 & $30.3 \pm 20.1$ & $61.5 \pm 33.6$ & 1 & 1 \\
\texttt{p-func} & 15,535 & MT-Classification & 10 & $150.9 \pm 84.2$ & $307.3 \pm 172.2$ & 9 & 3 \\
\texttt{p-struct} & 15,535 & MT-Regression & 11 & $150.9 \pm 84.2$ & $307.3 \pm 172.2$ & 9 & 3 \\
\texttt{proteins} & 1,113 & Classification & 2 & $39.1 \pm 45.8$ & $145.6 \pm 169.3$ & 2 & --- \\
\texttt{qm9} & 130,831 & Regression & 16 & $18.0 \pm 2.9$ & $37.3 \pm 6.3$ & 16 & 4 \\
\texttt{synthetic} & 300 & Classification & 2 & $100.0 \pm 0.0$ & $392.0 \pm 0.0$ & 2 & --- \\
\texttt{twitter} & 144,033 & Classification & 2 & $4.0 \pm 1.7$ & $10.0 \pm 9.1$ & 1 & 1 \\
\texttt{zinc} & 12,000 & Regression & 1 & $43.8 \pm 8.5$ & $91.1 \pm 18.1$ & 28 & 4 \\
\bottomrule
\end{tabular}%
}
```
Model Architectures {#appendix:architectures}
-------------------

We employed two Transformer backbones in our experiments. The first, which we denote as **GT+BERT**, is based on a BERT-Small architecture. The second, **GT+GTE**, utilizes a more recent and powerful GTE-Base model. The precise architectural parameters for each are detailed side-by-side in Table `\ref{tab:model_architectures}`{=latex}. This format facilitates direct comparison and is designed to accommodate additional model configurations in future scaling law studies. Note that the vocabulary size is determined dynamically based on the dataset and tokenization strategy, causing the total parameter count to vary slightly across experiments, the results reported here include the embedding layer size corresponding to the vocabulary encoded using BPE on the ZINC dataset.

::: {#tab:model_architectures}
  **Parameter**                               **GT+BERT**      **GT+GTE**
  ----------------------------------------- --------------- ----------------
  *Model Configuration*                       BERT-Small        GTE-Base
  Number of Hidden Layers ($N$)                    4               12
  Hidden Size ($d_{\text{model}}$)                512             768
  Number of Attention Heads ($h$)                  4               12
  FFN Intermediate Size ($d_{\text{ff}}$)        2048             3072
  Activation Function                            GELU             GELU
  Dropout Rate (Attention & Hidden)               0.1             0.1
  Position Embedding                          Learned Abs         RoPE
  Max Sequence Length                             768             8192
  Layer Normalization $\epsilon$                 1e-12           1e-12
  Total Parameters (Approx.)                 $\approx$ 15M   $\approx$ 115M

  : Architectural parameters of the Transformer backbones used in our experiments.
:::

Hyperparameters {#appendix:hyperparameters}
---------------

Our hyperparameter tuning strategy was designed for systematic evaluation and reproducibility rather than exhaustive per-dataset optimization. We established a robust base configuration, detailed in our publicly available configuration files, which was applied to all experiments by default. For certain datasets, particularly those with very large or very small graphs, targeted adjustments were made to key parameters like batch size and learning rate to ensure stable training.

Table `\ref{tab:hyperparams_all}`{=latex} provides a comprehensive overview of these settings. The \"Default Configuration\" column represents the base values applied to all datasets unless otherwise specified. The subsequent columns detail the specific overrides for datasets or groups of datasets that required adjustments. This unified view clearly illustrates both our general training strategy and the specific exceptions made.

```{=latex}
\begin{table*}[ht!]


\caption{Unified hyperparameter specification. This table details the default values used for training and the specific conditions under which they were overridden.}
\label{tab:hyperparams_all}
\begin{tabular}{l l p{7cm}}
\toprule
\textbf{Parameter} & \textbf{Default Value} & \textbf{Exceptions \& Conditions} \\
\midrule
\textit{\textbf{General Settings}} \\ \addlinespace
Optimizer & AdamW & --- \\
Epochs & 200 & --- \\
Weight Decay & 0.1 & --- \\
Random Seed & 42 & --- \\
Batch Size & 32 & 16 on (\texttt{DBLP}, \texttt{Peptides\_*}, and \texttt{COIL-DEL}) \\
% \addlinespace
\midrule
\textit{\textbf{Serialization \& Tokenizer Settings}} \\ \addlinespace
Default Method & \texttt{Feuler} & --- \\
BPE Enabled & True & --- \\
BPE Merges ($K$) & 2000 & --- \\
% \addlinespace
\midrule
\textit{\textbf{Finetuning Settings}} \\ \addlinespace
Learning Rate & 1e-5 & 5e-5 on (\texttt{DBLP}, \texttt{molhiv}, \texttt{twitter}) \\
LR Warmup Ratio & 0.025 & --- \\
Max Gradient Norm & 0.5 & --- \\
Early Stopping & 20 epochs & --- \\
% \addlinespace
\midrule
\textit{\textbf{Pre-training (MLM) Settings}} \\ \addlinespace
Learning Rate & 1e-4 & 5e-5 on (\texttt{muta}, \texttt{molhiv}, \texttt{qm9}, \texttt{twitter}, \texttt{dblp}) For GTE, to prevent training instability. \\
LR Warmup Ratio & 0.12 & --- \\
Max Gradient Norm & 2.0 & --- \\
Mask Probability & 0.09 & --- \\
\bottomrule
\end{tabular}
\end{table*}
```
Computational Environment {#appendix:environment}
-------------------------

#### Software.

Our implementation relies on a shared software stack to ensure consistency. The key packages and their versions are listed below. For a complete and exhaustive list of dependencies, please refer to the environment files in our public code repository.

-   **PyTorch**: `2.1.2`

-   **CUDA Toolkit**: `12.1`

-   **DGL**: `2.4.0`

-   **PyTorch Geometric (PyG)**: `2.4.0` (with corresponding libraries `pyg-lib`, `torch-scatter`, etc.)

#### Hardware.

All experiments were conducted on a heterogeneous cluster of NVIDIA GPUs, including consumer grade (GeForce RTX 2080, 3090, 4090) and data center grade (A800, H800) hardware. We verified that our results are stable and consistent across these different architectures.

Hyperparameter Settings for Additional Baselines {#appendix:baseline_hyperparams}
------------------------------------------------

To ensure a fair comparison in the extended benchmarks (Appendix `\ref{appendix:performance}`{=latex}), we adopted the official configurations for all baseline models whenever available:

-   **Graph Transformers (GraphGT, Graphormer):** We utilized the standard hyperparameters provided in their official implementations (e.g., 6--12 layers, hidden dimension of 80--768 depending on dataset size) to match the parameter budget of our models.

-   **Graph Foundation Models (GraphGPT, LLAGA):** For the adaptation experiments, we used the open-sourced checkpoints (e.g., LLaMA-2-7b backbone) and fine-tuned the projection layers or LoRA adapters following the protocols specified in their original papers.

-   **Specialized Architectures (FragNet, Graph-ViT-MLPMixer):** We reproduced these results using the optimal settings reported in their respective publications.

Experimental Results {#appendix:more-exp}
====================

In this section, we present a comprehensive suite of additional experiments to further validate the robustness, interpretability, and comparative performance of our proposed framework. The content is organized as follows:

-   **Qualitative Analysis (Appendix `\ref{appendix:qualitative}`{=latex}):** We offer a detailed interpretation of the learned BPE vocabulary to elucidate the structural semantics captured by our tokenizer, and provide a proof-of-concept for autoregressive graph generation.

-   **Performance Benchmarks (Appendix `\ref{appendix:performance}`{=latex}):** We report complete results for all datasets and expand our comparative evaluation to include recent state-of-the-art Graph Transformers (e.g., GraphGT, Graphormer) and LLM-based Graph Foundation Models (e.g., GraphGPT, LLAGA).

-   **Ablation Studies (Appendix `\ref{appendix:ablation}`{=latex}):** We conduct systematic investigations into core design choices, ranging from architectural components (e.g., Transformer backbones, serialization methods) to specific hyperparameters (e.g., vocabulary size, statistical guidance units), and empirically validate the impact of serialization strategies on counting tasks.

-   **Scalability Analysis (Appendix `\ref{app:scalability_ogb}`{=latex}):** We report runtime measurements on large-scale OGB datasets to empirically validate the scalability of our pipeline.

Qualitative Analysis and Interpretability {#appendix:qualitative}
-----------------------------------------

#### Visualizing the Learned Vocabulary.

To understand the structural patterns captured by our tokenizer, we visualize the vocabulary constructed by BPE's merging process on the ZINC dataset. Figure `\ref{fig:bpe_vocab_example}`{=latex} illustrates how BPE iteratively merges simple, frequent substructures into progressively more complex and chemically meaningful tokens.

Each row in the figure demonstrates such a merging sequence. For instance, the top row shows that a basic structure representing a sulfonyl group (`O=S=O` at the 731st merge iteration) is established as a token. In subsequent merge steps, BPE combines this token with adjacent atoms to form a more complex token (`C[SH](=O)=O`) and then an even larger one (`CC[SH](=O)=O`). This progression directly reflects BPE's mechanism: greedily merging the most frequent adjacent symbol pairs to build a vocabulary. Similarly, the bottom row shows the process starting from a classic benzene ring, which is then merged with neighboring carbon chains to form tokens corresponding to toluene and ethylbenzene.

#### Interpretation of Learned Vocabulary.

To understand the structural semantics captured by our tokenizer, we analyzed the size distribution of the learned vocabulary on the ZINC dataset with a merge count of $K=2000$. As detailed in Table `\ref{tab:app_vocab}`{=latex}, the vocabulary is not dominated by small atomic tokens. Instead, it exhibits a distinct preference for medium-scale substructures. Specifically, atomic tokens (0-1 nodes) comprise only 7.1% of the vocabulary, while the distribution peaks in the **4-6 node range** (41.5%). In the context of molecular graphs, this size range corresponds to typical functional groups and ring structures. Combined with the 7-9 node range, over 60% of the vocabulary represents complex, multi-node substructures. This confirms that the BPE merge process, guided by our frequency statistics, successfully identifies an optimal level of abstraction---producing tokens large enough to capture meaningful local topology (e.g., cycles, branches) yet frequent enough to ensure generalization across the dataset.

```{=latex}
\resizebox{\textwidth}{!}{%
\begin{tabular}{lccccc}
\toprule
\textbf{Token Size} & Atomic ($0 \sim 1$) & Small ($2 \sim 3$) & \textbf{Medium ($4 \sim 6$)} & Large ($7 \sim 9$) & Huge ($10+$) \\
\midrule
\textbf{Proportion} & 7.1\% & 28.5\% & \textbf{41.5\%} & 20.4\% & 2.5\% \\
\bottomrule
\end{tabular}%
}
```
#### Proof-of-Concept: Auto-regressive Graph Generation. {#app:autoregressive_generation}

A core advantage of our framework is that it converts graph data into a format formally identical to natural language, thereby theoretically enabling the use of standard autoregressive (decoder-only) models for graph generation. To empirically validate this capability, we conducted a proof-of-concept experiment using the MNIST dataset, treating images as readily visualizable grid graphs.

**Setup.** We converted each $28 \times 28$ MNIST image into a regular grid graph, where pixels correspond to nodes and are connected to their immediate spatial neighbors (up, down, left, right). These grid graphs were then tokenized using our proposed framework (Frequency-Guided Serialization + BPE) to produce discrete token sequences. We trained a standard, unmodified decoder-only Transformer (GPT-style architecture) on these sequences using the conventional next-token prediction objective ($\mathcal{L}_{\text{CLM}}$).

**Results.** As illustrated in Figure `\ref{fig:mnist_generation}`{=latex}, the model successfully learns to generate coherent graph structures token-by-token. The reconstructed graphs clearly depict recognizable digits, demonstrating that the model effectively captures the global topology and local connectivity patterns solely from the serialized token sequence. This result confirms that our framework effectively bridges the gap between graph generation and standard sequence modeling, opening the door for applying large-scale generative pre-training (e.g., GPT) to graph domains such as molecular design and material discovery.

![Visualization of autoregressive graph generation on MNIST. The model generates the graph structure token-by-token (left to right). The resulting sequences are decoded back into grid graphs, forming coherent digit images. This demonstrates the framework's capability to support generative tasks using standard decoder-only Transformers.](fig/grid_samples.png){#fig:mnist_generation width="90%"}

Performance Results {#appendix:performance}
-------------------

#### Classification and Regression Benchmarks.

We present the complete results for all datasets evaluated in our main experiments. The experimental setup and reporting format are identical to those described in the main text. The results for the remaining datasets (which were not included in the main text due to space constraints) are detailed in Table `\ref{tab:appendix_more_cls}`{=latex}.

```{=latex}
\begin{table*}[t]

\caption{Additional classification results on datasets not shown in the main table. Best scores are in \textbf{bold}, second-best are \underline{underlined}. Std in parentheses.}
\label{tab:appendix_more_cls}
\resizebox{0.6\textwidth}{!}{
\begin{tabular}{lcccc}
\toprule
\textbf{Model} &
\shortstack{\texttt{colors3}\\ acc$\uparrow$} &
\shortstack{\texttt{twitter}\\ acc$\uparrow$} &
\shortstack{\texttt{proteins}\\ acc$\uparrow$} &
\shortstack{\texttt{dd}\\ acc$\uparrow$} \\
\midrule
GCN        & 66.0 {\scriptsize (2.1)}  & 52.7 {\scriptsize (0.3)}  & 73.2 {\scriptsize (1.6)}  & 62.7 {\scriptsize (1.1)} \\
GIN        & 69.3 {\scriptsize (1.6)}  & 55.6 {\scriptsize (0.2)}  & 64.3 {\scriptsize (4.0)}  & 65.2 {\scriptsize (0.9)} \\
GAT        & 76.6 {\scriptsize (1.4)}  & 53.6 {\scriptsize (0.3)}  & 70.5 {\scriptsize (0.9)}  & 58.5 {\scriptsize (1.0)} \\
GatedGCN   & 77.1 {\scriptsize (1.2)}  & 59.8 {\scriptsize (0.4)}  & 71.1 {\scriptsize (1.2)}  & 77.2 {\scriptsize (0.7)} \\
\midrule
GraphGPS   & 77.4 {\scriptsize (1.8)}  & 53.0 {\scriptsize (0.5)}  & 67.9 {\scriptsize (0.6)}  & 76.3 {\scriptsize (0.5)} \\
Exphormer  & 73.9 {\scriptsize (1.9)}  & 55.1 {\scriptsize (0.3)}  & 74.1 {\scriptsize (0.9)}  & 74.6 {\scriptsize (0.6)} \\
GraphMamba & 93.6 {\scriptsize (0.9)}  & 56.4 {\scriptsize (0.7)}  & 70.5 {\scriptsize (1.1)}  & 76.8 {\scriptsize (0.4)} \\
GCN+       & 85.8 {\scriptsize (2.3)}  & 61.3 {\scriptsize (0.2)}  & \underline{77.1} {\scriptsize (0.9)}  & \underline{79.1} {\scriptsize (0.5)} \\
\midrule
\textbf{GT+BERT} & \underline{96.2} {\scriptsize (1.7)} & \underline{62.3} {\scriptsize (0.3)} & 75.0 {\scriptsize (0.7)} & 72.0 {\scriptsize (0.9)} \\
\textbf{GT+GTE}  & \textbf{100.0} {\scriptsize (0.0)} & \textbf{65.7} {\scriptsize (0.2)} & \textbf{79.1} {\scriptsize (0.6)} & \textbf{79.6} {\scriptsize (0.6)} \\
\bottomrule
\end{tabular}
}
\end{table*}
```
#### Comparison with State-of-the-Art Architectures.

To strictly validate the effectiveness of our tokenizer, we significantly expanded our comparative evaluation to include a broader range of state-of-the-art architectures. We conducted a comprehensive evaluation against recent Graph Transformers (**GraphGT**, **Graphormer**), serialization-based or hybrid methods (**FragNet**, **Graph-ViT-MLPMixer**), and classic baselines (**HAN**, **ChebNet**). As detailed in Table `\ref{tab:app_sota}`{=latex}, our method consistently outperforms strong baselines on graph classification (MOLHIV, COIL-DEL) and long-range modeling tasks (Peptide-func). Notably, we surpass specialized architectures like FragNet and Graph-ViT-MLPMixer without requiring their domain-specific inductive biases. On the ZINC regression task, our performance is comparable to standard Graph Transformers (e.g., Graphormer), while classic architectures (ChebNet, HAN) lag significantly behind. This confirms that a general-purpose tokenizer can unlock state-of-the-art performance across diverse graph learning benchmarks using a standard Transformer architecture.

```{=latex}
\resizebox{\textwidth}{!}{
\begin{tabular}{lcccc}
\toprule
\textbf{Model} & \textbf{ZINC} (MAE $\downarrow$) & \textbf{MOLHIV} (AUC $\uparrow$) & \textbf{Peptide-func} (AP $\uparrow$) & \textbf{COIL-DEL} (ACC $\uparrow$)* \\
\midrule
GraphGT & 0.226 {\scriptsize (0.014)} & -- & 63.26 {\scriptsize (1.26)} & 86.1 {\scriptsize (0.8)} \\
Graphormer & 0.132 {\scriptsize (0.006)} & 80.51 {\scriptsize (0.53)} & -- & 88.4 {\scriptsize (0.3)} \\
FragNet & {0.078} {\scriptsize (0.005)} & 81.32 {\scriptsize (0.86)} & 66.8 {\scriptsize (0.5)} & 83.4 {\scriptsize (0.6)} \\
Graph-ViT-MLPMixer & \textbf{0.073} {\scriptsize (0.001)} & 79.97 {\scriptsize (1.02)} & 69.7 {\scriptsize (0.8)} & 89.1 {\scriptsize (0.4)} \\
\midrule
HAN* & 0.348 {\scriptsize (0.042)} & 72.3 {\scriptsize (0.6)} & 52.2 {\scriptsize (1.1)} & 74.5 {\scriptsize (0.6)} \\
ChebNet* & 0.423 {\scriptsize (0.013)} & 70.1 {\scriptsize (0.7)} & 51.3 {\scriptsize (0.8)} & 71.4 {\scriptsize (0.8)} \\
\midrule
\textbf{Our Method} & {0.131} {\scriptsize (0.007)} & \textbf{87.4} {\scriptsize (0.2)} & \textbf{73.1} {\scriptsize (0.2)} & \textbf{89.6} {\scriptsize (0.2)} \\
\bottomrule
\end{tabular}
}
```
#### Comparison with LLM-based Graph Models.

We further benchmarked our method against recent \"Graph Foundation Models\" (GFMs) such as **GraphGPT** and **LLAGA**. Since these models are fundamentally designed for text-attributed graphs (TAGs), we evaluated them via adaptation (textualizing graph structures), treating discrete node labels as textual tokens to satisfy their input constraints. The results in Table `\ref{tab:app_graphgpt}`{=latex} reveal a critical limitation of the textualization paradigm. These models struggle significantly with pure structural tasks, performing near random guessing on datasets like COIL-DEL. This highlights that while LLMs excel when rich semantic text is available, they lack the native capability to reason about complex graph topology solely from linearized text strings. In contrast, our tokenizer specifically captures structural patterns, enabling superior performance on structure-centric tasks.

::: {#tab:app_graphgpt}
  **Model**               **ZINC** (MAE $\downarrow$)            **COIL-DEL** (ACC $\uparrow$)
  ---------------- ----------------------------------------- --------------------------------------
  GraphGPT            0.373 `\scriptsize `{=latex}(0.021)       5.6 `\scriptsize `{=latex}(0.9)
  LLAGA               0.317 `\scriptsize `{=latex}(0.016)       12.5 `\scriptsize `{=latex}(1.1)
  **Our Method**    **0.131** `\scriptsize `{=latex}(0.007)   **89.6** `\scriptsize `{=latex}(0.2)

  : Performance comparison with LLM-based models via textualized graph adaptation. These models exhibit significant performance degradation on pure structural tasks compared to our native tokenization.
:::

#### Sequence length and efficiency.

To demonstrate the generalizability of our model's efficiency, we present visualizations of token efficiency and training throughput for the remaining datasets, analogous to Figure `\ref{fig:efficiency}`{=latex} in the main paper. The results, shown in Figure `\ref{fig:efficiency_qm9}`{=latex}--`\ref{fig:efficiency_aqsol}`{=latex}, confirm that our approach consistently maintains superior efficiency across a diverse range of datasets.

![Average sequence length comparison.](exp-raw/exp1_speed/token_length/qm9_token_length_comparison.png){#fig:eff_len_qm9 width="\\textwidth"}

```{=latex}
\hfill
```
![Average training time per epoch.](exp-raw/exp1_speed/train_time/QM9_training_efficiency.png){#fig:eff_train_qm9 width="\\textwidth"}

![Average sequence length comparison.](exp-raw/exp1_speed/token_length/molhiv_token_length_comparison.png){#fig:eff_len_molhiv width="\\textwidth"}

```{=latex}
\hfill
```
![Average training time per epoch.](exp-raw/exp1_speed/train_time/MOLHIV_training_efficiency.png){#fig:eff_train_molhiv width="\\textwidth"}

![Average sequence length comparison.](exp-raw/exp1_speed/token_length/dd_token_length_comparison.png){#fig:eff_len_dd width="\\textwidth"}

```{=latex}
\hfill
```
![Average training time per epoch.](exp-raw/exp1_speed/train_time/DD_training_efficiency.png){#fig:eff_train_dd width="\\textwidth"}

![Average sequence length comparison.](exp-raw/exp1_speed/token_length/colors3_token_length_comparison.png){#fig:eff_len_colors3 width="\\textwidth"}

```{=latex}
\hfill
```
![Average training time per epoch.](exp-raw/exp1_speed/train_time/COLORS3_training_efficiency.png){#fig:eff_train_colors3 width="\\textwidth"}

![Average sequence length comparison.](exp-raw/exp1_speed/token_length/coildel_token_length_comparison.png){#fig:eff_len_coildel width="\\textwidth"}

```{=latex}
\hfill
```
![Average training time per epoch.](exp-raw/exp1_speed/train_time/COILDEL_training_efficiency.png){#fig:eff_train_coildel width="\\textwidth"}

![Average sequence length comparison.](exp-raw/exp1_speed/token_length/aqsol_token_length_comparison.png){#fig:eff_len_aqsol width="\\textwidth"}

```{=latex}
\hfill
```
![Average training time per epoch.](exp-raw/exp1_speed/train_time/AQSOL_training_efficiency.png){#fig:eff_train_aqsol width="\\textwidth"}

Ablation Studies {#appendix:ablation}
----------------

To further validate our design choices, this section provides detailed ablation studies that complement the analysis in the main paper. We first present the complete ablation results for the GT+GTE model in Table `\ref{tab:appendix-ablation-gte}`{=latex} and the GT+BERT model in Table `\ref{tab:appendix-ablation-bert1}`{=latex} and Table `\ref{tab:appendix-ablation-bert2}`{=latex}. Beyond these general performance trends, we conducted systematic studies to validate our specific design choices regarding vocabulary size, frequency guidance units, and serialization strategies for counting tasks.

#### Impact of Vocabulary Size ($K$).

Our experiments identify $K=2000$ as the optimal saturation point, effectively balancing sequence compression with model learnability. As shown in Table `\ref{tab:app_bpe_merges}`{=latex}, increasing $K$ initially yields significant gains in compression (1.00x to 10.84x). However, this benefit plateaus around $K=2000$. Extending to $K=5000$ offers diminishing returns in compression (11.34x) but fails to improve MAE (0.132). This stagnation may be driven by the \"long-tail\" effect: tokens added beyond $K=2000$ appear too infrequently to receive sufficient gradient updates (often appearing fewer than 10 times across the entire dataset), thereby increasing model complexity without contributing to generalization.

```{=latex}
\resizebox{\textwidth}{!}{
\begin{tabular}{lcccccccccc}
\toprule
\textbf{Method} &
\shortstack{\texttt{mutag}\\ acc$\uparrow$} &
\shortstack{\texttt{colors3}\\ acc$\uparrow$} &
\shortstack{\texttt{dblp}\\ acc$\uparrow$} &
\shortstack{\texttt{aqsol}\\ mae$\downarrow$} &
\shortstack{\texttt{p-struct}\\ avg~mae$\downarrow$} \\
\cmidrule(lr){2-3}\cmidrule(lr){4-5}\cmidrule(lr){6-7}\cmidrule(lr){8-9}\cmidrule(lr){10-11}
 & \scriptsize w & \scriptsize w/o & \scriptsize w & \scriptsize w/o & \scriptsize w & \scriptsize w/o & \scriptsize w & \scriptsize w/o & \scriptsize w & \scriptsize w/o \\
\midrule
BFS      & 79.2 {\scriptsize (0.9)} & 78.8 {\scriptsize (0.6)} & 73.4 {\scriptsize (0.8)} & 74.4 {\scriptsize (0.3)} & 91.8 {\scriptsize (0.19)} & 91.7 {\scriptsize (0.22)} & 0.701 {\scriptsize (0.017)} & 0.702 {\scriptsize (0.008)} & \underline{0.2495} {\scriptsize (0.0018)} & 0.2494 {\scriptsize (0.0005)} \\
DFS      & 81.9 {\scriptsize (0.7)} & 79.4 {\scriptsize (1.5)} & {99.9} {\scriptsize (0.1)} & \textbf{100.0} {\scriptsize (0.0)} & 92.8 {\scriptsize (0.09)} & 92.6 {\scriptsize (0.12)} & 0.684 {\scriptsize (0.013)} & 0.676 {\scriptsize (0.011)} & \textbf{0.2421} {\scriptsize (0.0026)} & 0.2468 {\scriptsize (0.0040)} \\
TOPO     & 80.1 {\scriptsize (0.8)} & 78.4 {\scriptsize (1.4)} & {76.6} {\scriptsize (1.1)} & \underline{98.3}{\scriptsize (1.3)} & {93.0} {\scriptsize (0.09)} & 92.5 {\scriptsize (0.13)} & 0.814 {\scriptsize (0.004)} & 0.737 {\scriptsize (0.003)} & 0.2626 {\scriptsize (0.0022)} & 0.2556 {\scriptsize (0.0002)} \\
Eulerian & 86.1 {\scriptsize (0.9)} & 84.3 {\scriptsize (1.3)} & 38.9 {\scriptsize (1.0)} & 44.7 {\scriptsize (2.2)} & \underline{93.1} {\scriptsize (0.09)} & 91.7 {\scriptsize (0.17)} & \textbf{0.609} {\scriptsize (0.001)} & 0.627 {\scriptsize (0.016)} & 0.2514 {\scriptsize (0.0027)} & 0.2499 {\scriptsize (0.0035)} \\
Feuler   & \textbf{90.1} {\scriptsize (0.6)} & 86.7 {\scriptsize (0.9)} & 41.0 {\scriptsize (1.2)} & 45.3 {\scriptsize (0.8)} & \textbf{93.6} {\scriptsize (0.08)} & 88.9 {\scriptsize (0.14)} & \underline{0.621} {\scriptsize (0.007)} & 0.623 {\scriptsize (0.017)} & 0.2510 {\scriptsize (0.0001)} & 0.2548 {\scriptsize (0.0004)} \\
CPP      & \underline{87.7} {\scriptsize (0.8)} & 85.4 {\scriptsize (0.9)} & 36.9 {\scriptsize (2.4)} & 45.9 {\scriptsize (1.2)} & 90.9 {\scriptsize (0.14)} & 92.1 {\scriptsize (0.21)} & 0.654 {\scriptsize (0.016)} & 0.643 {\scriptsize (0.003)} & 0.2535 {\scriptsize (0.0021)} & 0.2482 {\scriptsize (0.0005)} \\
FCPP     & \underline{87.7} {\scriptsize (0.4)} & 86.0 {\scriptsize (0.7)} & 37.5 {\scriptsize (1.4)} & 45.6 {\scriptsize (0.6)} & 92.4 {\scriptsize (0.08)} & 92.3 {\scriptsize (0.16)} & 0.654 {\scriptsize (0.017)} & 0.633 {\scriptsize (0.010)} & 0.2537 {\scriptsize (0.0039)} & 0.2481 {\scriptsize (0.0006)} \\
SMILES   & — & — & — & — & — & — & 0.741 {\scriptsize (0.010)} & \textbf{0.673} {\scriptsize (0.010)} & — & — \\
\bottomrule
\end{tabular}
}
```
```{=latex}
\resizebox{\textwidth}{!}{
\begin{tabular}{lcccccccccc}
\toprule
\textbf{Method} &
\shortstack{\texttt{molhiv}\\ auc$\uparrow$} &
\shortstack{\texttt{coildel}\\ acc$\uparrow$} &
\shortstack{\texttt{p-func}\\ ap$\uparrow$} &
\shortstack{\texttt{zinc}\\ mae$\downarrow$} &
\shortstack{\texttt{qm9}\\ mae$\downarrow$} \\
\cmidrule(lr){2-3}\cmidrule(lr){4-5}\cmidrule(lr){6-7}\cmidrule(lr){8-9}\cmidrule(lr){10-11}
 & \scriptsize w & \scriptsize w/o & \scriptsize w & \scriptsize w/o & \scriptsize w & \scriptsize w/o & \scriptsize w & \scriptsize w/o & \scriptsize w & \scriptsize w/o \\
\midrule
BFS      & 76.5 {\scriptsize (0.6)} & 76.1 {\scriptsize (0.9)} & \underline{81.2} {\scriptsize (0.9)} & 80.1 {\scriptsize (1.3)} & 66.9 {\scriptsize (0.9)} & 63.2 {\scriptsize (0.9)} & 0.612 {\scriptsize (0.009)} & 0.961 {\scriptsize (0.012)} & 0.227 {\scriptsize (0.011)} & 0.324 {\scriptsize (0.013)} \\
DFS      & 74.6 {\scriptsize (0.4)} & 79.5 {\scriptsize (0.5)} & 80.5 {\scriptsize (0.4)} & 79.8 {\scriptsize (0.8)} & \textbf{71.3} {\scriptsize (0.3)} & 67.5 {\scriptsize (1.0)} & 0.537 {\scriptsize (0.008)} & 0.976 {\scriptsize (0.009)} & 0.275 {\scriptsize (0.009)} & 0.281 {\scriptsize (0.014)} \\
TOPO     & 72.2 {\scriptsize (0.6)} & 77.9 {\scriptsize (0.8)} & \textbf{82.6} {\scriptsize (0.8)} & 81.4 {\scriptsize (1.2)} & 65.3 {\scriptsize (0.5)} & 54.6 {\scriptsize (1.0)} & 0.711 {\scriptsize (0.011)} & 1.034 {\scriptsize (0.012)} & 0.277 {\scriptsize (0.010)} & 0.266 {\scriptsize (0.011)} \\
Eulerian & \underline{83.7} {\scriptsize (0.7)} & 83.9 {\scriptsize (1.0)} & 72.1 {\scriptsize (0.6)} & 69.9 {\scriptsize (2.9)} & 67.3 {\scriptsize (0.9)} & 64.1 {\scriptsize (1.6)} & 0.304 {\scriptsize (0.008)} & 0.396 {\scriptsize (0.012)} & \textbf{0.104} {\scriptsize (0.006)} & 0.127 {\scriptsize (0.007)} \\
Feuler   & 82.6 {\scriptsize (0.4)} & 81.8 {\scriptsize (0.5)} & 74.1 {\scriptsize (0.3)} & 76.1 {\scriptsize (0.9)} & \underline{68.5} {\scriptsize (1.0)} & 65.4 {\scriptsize (1.1)} & \textbf{0.241} {\scriptsize (0.006)} & 0.432 {\scriptsize (0.011)} & 0.122 {\scriptsize (0.004)} & 0.128 {\scriptsize (0.005)} \\
CPP      & \textbf{85.0} {\scriptsize (0.3)} & 82.8 {\scriptsize (0.5)} & 72.5 {\scriptsize (0.5)} & \textbf{83.3} {\scriptsize (0.7)} & 65.6 {\scriptsize (0.6)} & 59.5 {\scriptsize (1.5)} & 0.319 {\scriptsize (0.005)} & 0.464 {\scriptsize (0.008)} & 0.115 {\scriptsize (0.004)} & 0.131 {\scriptsize (0.006)} \\
FCPP     & 83.2 {\scriptsize (0.3)} & 82.2 {\scriptsize (0.6)} & 68.9 {\scriptsize (0.3)} & 78.3 {\scriptsize (1.0)} & 65.5 {\scriptsize (1.8)} & 61.0 {\scriptsize (1.2)} & 0.316 {\scriptsize (0.005)} & 0.467 {\scriptsize (0.007)} & \underline{0.107} {\scriptsize (0.005)} & 0.132 {\scriptsize (0.007)} \\
SMILES   & — & — & — & — & — & — & \underline{0.273} {\scriptsize (0.014)} & 0.320 {\scriptsize (0.007)} & 0.117 {\scriptsize (0.014)} & 0.120 {\scriptsize (0.016)} \\
\bottomrule
\end{tabular}
}
```
```{=latex}
\resizebox{\textwidth}{!}{
\begin{tabular}{lcccccccccc}
\toprule
\textbf{Method} &
\shortstack{\texttt{mutag}\\ acc$\uparrow$} &
\shortstack{\texttt{colors3}\\ acc$\uparrow$} &
\shortstack{\texttt{dblp}\\ acc$\uparrow$} &
\shortstack{\texttt{aqsol}\\ mae$\downarrow$} &
\shortstack{\texttt{p-struct}\\ avg~mae$\downarrow$} \\
\cmidrule(lr){2-3}\cmidrule(lr){4-5}\cmidrule(lr){6-7}\cmidrule(lr){8-9}\cmidrule(lr){10-11}
 & \scriptsize w & \scriptsize w/o & \scriptsize w & \scriptsize w/o & \scriptsize w & \scriptsize w/o & \scriptsize w & \scriptsize w/o & \scriptsize w & \scriptsize w/o \\
\midrule
BFS      & 76.1 {\scriptsize (0.6)} & 76.4 {\scriptsize (0.7)} & \underline{72.7} {\scriptsize (0.9)} & 67.3 {\scriptsize (4.1)} & 91.9 {\scriptsize (0.17)} & 91.3 {\scriptsize (0.26)} & 0.851 {\scriptsize (0.002)} & 0.844 {\scriptsize (0.002)} & \underline{0.2477} {\scriptsize (0.0014)} & 0.2620 {\scriptsize (0.0021)} \\
DFS      & 79.6 {\scriptsize (0.5)} & 77.3 {\scriptsize (0.6)} & \textbf{98.3} {\scriptsize (0.1)} & 87.5 {\scriptsize (4.4)} & 92.6 {\scriptsize (0.08)} & 92.4 {\scriptsize (0.13)} & 0.831 {\scriptsize (0.002)} & 0.852 {\scriptsize (0.002)} & 0.2526 {\scriptsize (0.0036)} & 0.2550 {\scriptsize (0.0012)} \\
TOPO     & 74.9 {\scriptsize (0.5)} & 71.2 {\scriptsize (0.7)} & 60.4 {\scriptsize (2.3)} & 66.6 {\scriptsize (2.2)} & 92.7 {\scriptsize (0.08)} & 92.7 {\scriptsize (0.11)} & 0.810 {\scriptsize (0.002)} & 0.840 {\scriptsize (0.002)} & 0.2578 {\scriptsize (0.0011)} & 0.2648 {\scriptsize (0.0026)} \\
Eulerian & 85.5 {\scriptsize (0.8)} & 82.0 {\scriptsize (1.1)} & 37.8 {\scriptsize (1.4)} & 40.7 {\scriptsize (0.6)} & \underline{93.1} {\scriptsize (0.04)} & 91.8 {\scriptsize (0.14)} & \textbf{0.648} {\scriptsize (0.002)} & 0.677 {\scriptsize (0.003)} & 0.2522 {\scriptsize (0.0012)} & 0.2700 {\scriptsize (0.0049)} \\
Feuler   & \textbf{87.5} {\scriptsize (0.4)} & 84.1 {\scriptsize (0.8)} & 38.5 {\scriptsize (1.1)} & 40.3 {\scriptsize (1.0)} & \textbf{93.2} {\scriptsize (0.06)} & 88.5 {\scriptsize (0.11)} & \textbf{0.648} {\scriptsize (0.002)} & 0.685 {\scriptsize (0.004)} & \textbf{0.2476} {\scriptsize (0.0010)} & 0.2615 {\scriptsize (0.0045)} \\
CPP      & \underline{85.9} {\scriptsize (0.6)} & 83.6 {\scriptsize (0.9)} & 36.5 {\scriptsize (2.0)} & 43.5 {\scriptsize (1.2)} & 90.5 {\scriptsize (0.11)} & 91.7 {\scriptsize (0.22)} & \underline{0.651} {\scriptsize (0.018)} & 0.690 {\scriptsize (0.018)} & 0.2547 {\scriptsize (0.0018)} & 0.2734 {\scriptsize (0.0022)} \\
FCPP     & 85.7 {\scriptsize (0.5)} & 85.0 {\scriptsize (0.9)} & 37.1 {\scriptsize (0.6)} & 44.5 {\scriptsize (1.1)} & 91.8 {\scriptsize (0.13)} & 91.9 {\scriptsize (0.21)} & 0.670 {\scriptsize (0.005)} & 0.695 {\scriptsize (0.003)} & 0.2541 {\scriptsize (0.0029)} & 0.2681 {\scriptsize (0.0051)} \\
SMILES   & — & — & — & — & — & — & 0.746 {\scriptsize (0.003)} & 0.783 {\scriptsize (0.003)} & — & — \\
\bottomrule
\end{tabular}
}
```
::: {#tab:app_bpe_merges}
   Merge Count ($K$)   Compression Ratio       Performance (MAE $\downarrow$)
  ------------------- ------------------- -----------------------------------------
        0 (Raw)              1.00x           0.171 `\scriptsize `{=latex}(0.013)
          100                2.86x           0.144 `\scriptsize `{=latex}(0.012)
          500                5.37x           0.137 `\scriptsize `{=latex}(0.010)
         1000                8.11x           0.131 `\scriptsize `{=latex}(0.011)
       **2000**           **10.84x**       **0.131** `\scriptsize `{=latex}(0.007)
         5000               11.34x           0.132 `\scriptsize `{=latex}(0.009)

  : Impact of BPE merge count ($K$) on compression ratio and model performance (ZINC). $K=2000$ achieves the best balance; larger vocabularies yield diminishing compression returns and do not improve performance due to token rarity.
:::

#### Choice of Frequency Guidance Unit.

The ablation confirms that Node-Edge-Node trigrams offer the superior trade-off, achieving the highest compression rate (10.84x) while maintaining linear complexity. The results in Table `\ref{tab:app_freq_units}`{=latex} show that simpler units like Node-Node bigrams perform suboptimally (10.71x) because they discard critical edge label information. Conversely, more complex units (2-hop/3-hop paths) surprisingly *reduce* compression (10.37x) despite their high computational cost ($O(E^3/N^2)$). This counter-intuitive result may be due to **data sparsity**. Specifically, as patterns grow longer, their exact recurrence becomes exponentially rarer, rendering the frequency map too sparse to provide robust global guidance.

::: {#tab:app_freq_units}
  Unit                           Collect Cost     Compression Ratio
  ---------------------------- ----------------- -------------------
  W/O Guidance                         0               10.46x
  Node-Node Bigram                  $O(E)$             10.71x
  Node-Edge Bigram                  $O(E)$             10.72x
  **Node-Edge-Node Trigram**    $\mathbf{O(E)}$      **10.84x**
  Multi-hop Path (2 hop)          $O(E^2/N)$           10.56x
  Multi-hop Path (3 hop)         $O(E^3/N^2)$          10.37x

  : Impact of different statistical units on BPE compression ratio (ZINC). \`Collect Cost' denotes the complexity of gathering statistics. Trigrams achieve the highest compression with linear overhead.
:::

#### Serialization Strategy and Counting Tasks.

While most results are consistent with the main paper, the `COLORS-3` dataset presents a notable exception due to its unique task of counting nodes with specific colors. On this dataset, node-based serialization methods like DFS perform best as they visit each node exactly once, making the counting task trivial. In contrast, our default edge-traversal method (Eulerian) can visit a single node multiple times, creating ambiguity for direct counting.

To empirically validate this hypothesis, we conducted a controlled experiment on ZINC by simulating a node counting task (counting carbon atoms). As shown in Table `\ref{tab:app_node_counting}`{=latex}, the edge-based serialization struggles with this task (21.6% accuracy), whereas switching to a node-based DFS traversal immediately resolves the failure (83.4% accuracy). This confirms that the limitation stems specifically from the mismatch between the edge-covering serialization strategy and the node-counting objective, rather than an inherent deficiency in the tokenizer or Transformer architecture. Despite this task-specific dynamic, a crucial finding remains: our **guided** serialization significantly outperforms unguided versions across general structural tasks, reinforcing the effectiveness of our guidance mechanism.

::: {#tab:app_node_counting}
  Serialization Method    Accuracy (w/ BPE)   Accuracy (w/o BPE)
  ---------------------- ------------------- --------------------
  Edge-based (Feuler)           21.6%               24.8%
  **Node-based (DFS)**        **83.4%**           **88.4%**

  : Performance on simulated node counting tasks (ZINC). Switching from edge-based to node-based serialization resolves the counting failure, confirming our analysis regarding the COLORS-3 dataset.
:::

```{=latex}
\newpage
```
Scalability Analysis on Large-Scale OGB Datasets {#app:scalability_ogb}
------------------------------------------------

To understand the computational characteristics of our pipeline on large graphs, we profile each stage on OGB datasets containing millions of edges (`ogbn-arxiv`, `ogbg-code2`, and `ogbn-products`). All measurements use the Frequency-Guided Eulerian circuit for serialization and are performed on a single CPU thread, with results normalized to *Time per 1 Million Nodes*.

Table `\ref{tab:app_large_graph_speed}`{=latex} reveals a clear separation between the two stages. BPE encoding introduces negligible overhead, completing in tens of milliseconds per million nodes and remaining stable across datasets regardless of graph density. The computational cost of the pipeline is thus dominated by the serialization stage, whose runtime scales with edge density and serialization method. This decomposition highlights a practical advantage of our modular design: since the tokenization stage (BPE) is lightweight, the overall efficiency can be improved by selecting a serialization method suited to the scale and structure of the target graphs.

::: {#tab:app_large_graph_speed}
  Dataset          Serialization (ms)   BPE Encoding (ms)
  --------------- -------------------- -------------------
  ogbn-arxiv             14,948                57
  ogbg-code2             4,108                 74
  ogbn-products          29,670                56

  : Efficiency of different components of our method on large-scale OGB datasets (Normalized to **Time per 1 Million Nodes**).
:::