---
bibliography:
- neurips\_2023.bib
title: ': Byteifying the Next Generation of Language Models'
---

```{=latex}
\newcommand{\cmark}{\ding{51}}
```
```{=latex}
\newcommand{\xmark}{\ding{55}}
```
```{=latex}
\newcommand{\CellW}{0.9}
```
```{=latex}
\newcommand{\CellH}{0.9}
```
```{=latex}
\newcommand{\cellxy}[5]{%
  \pgfmathsetmacro{\cx}{#1+(#5-1)*\CellW}%
  \pgfmathsetmacro{\cy}{#2+(#3-#4)*\CellH}%
}
```
```{=latex}
\newcommand{\fillcell}[6]{%
  \cellxy{#1}{#2}{#3}{#4}{#5}%
  \fill[#6,draw=black,line width=0.25pt] (\cx,\cy) rectangle ++(\CellW,\CellH);
}
```
```{=latex}
\newcommand{\celltext}[6]{%
  \cellxy{#1}{#2}{#3}{#4}{#5}%
  \node[font=\scriptsize\sffamily] at ({\cx+0.5*\CellW},{\cy+0.5*\CellH}) {#6};
}
```
```{=latex}
\newcommand{\gridlines}[4]{%
  \draw[black,line width=0.25pt] (#1,#2) rectangle ++(#4*\CellW,#3*\CellH);
  \foreach \r in {1,...,#3} {%
    \draw[black,line width=0.25pt] (#1,#2+\r*\CellH) -- ++(#4*\CellW,0);
  }
  \foreach \c in {1,...,#4} {%
    \draw[black,line width=0.25pt] (#1+\c*\CellW,#2) -- ++(0,#3*\CellH);
  }
}
```
```{=latex}
\newcommand{\cblock}[3]{
  \hspace{-1.5mm}
  \begin{tikzpicture}
    [
    node/.style={square, minimum size=10mm, thick, line width=0pt},
    ]
    \node[fill={rgb,255:red,#1;green,#2;blue,#3}] () [] {};
  \end{tikzpicture}%
}
```
```{=latex}
\newcommand{\norm}[1]{\left\lVert#1\right\rVert}
```
```{=latex}
\newcommand{\aitwo}{\raisebox{-1.5pt}{\includegraphics[height=1.05em]{logos/ai2_logo.png}}\xspace}
```
```{=latex}
\newcommand{\aitoo}{\raisebox{-1.5pt}{\includegraphics[height=1.05em]{logos/ai2.pdf}}\xspace}
```
```{=latex}
\newcommand{\allenAiAff}{\raisebox{.28em}{\hspace{.02em}\scalebox{0.7}{\textbf{1}}}}
```
```{=latex}
\newcommand{\cambridgeAff}{\raisebox{.28em}{\hspace{.02em}\scalebox{0.7}{\textbf{2}}}}
```
```{=latex}
\newcommand{\uwAff}{\raisebox{.28em}{\hspace{.02em}\scalebox{0.7}{\textbf{3}}}}
```
```{=latex}
\newcommand{\edinburghAff}{\raisebox{.28em}{\hspace{.02em}\scalebox{0.7}{\textbf{4}}}}
```
```{=latex}
\newcommand{\commaAff}{\raisebox{.28em}{\hspace{.02em}\scalebox{0.7}{\textbf{,}\hspace{0.1em}}}}
```
```{=latex}
\newcommand{\coreContrib}{\raisebox{.28em}{\hspace{.05em}\includegraphics[height=.45em]{logos/core.pdf}}\hspace{0.1em}}
```
```{=latex}
\newcommand{\starOlmo}{\raisebox{.28em}{\hspace{.05em}\includegraphics[height=.5em]{logos/star.pdf}}}
```
```{=latex}
\newcommand{\huggingface}{\raisebox{-1.5pt}{\includegraphics[height=1.05em]{logos/hf.pdf}}\xspace}
```
```{=latex}
\newcommand{\hfdataset}{\raisebox{-1.5pt}{\includegraphics[height=1.05em]{logos/db.pdf}}\xspace}
```
```{=latex}
\newcommand{\emailLogo}{\raisebox{-1.5pt}{\includegraphics[height=1.05em]{logos/email.pdf}}\xspace}
```
```{=latex}
\newcommand{\github}{\raisebox{-1.5pt}{\includegraphics[height=1.05em]{logos/github.pdf}}\xspace}
```
```{=latex}
\newcommand{\wandb}{\raisebox{-1.5pt}{\includegraphics[height=1.05em]{logos/wandb-logo.pdf}}\xspace}
```
```{=latex}
\newcommand{\oldOlmo}{OLMo\xspace}
```
```{=latex}
\newcommand{\newOlmo}{Olmo\xspace}
```
```{=latex}
\newcommand{\tulu}{T\"ulu~3\xspace}
```
```{=latex}
\newcommand{\olmozero}{\textsc{\oldOlmo~1}\xspace}
```
```{=latex}
\newcommand{\olmoapril}{\textsc{\oldOlmo-0424}\xspace}
```
```{=latex}
\newcommand{\olmotoo}{\textsc{\oldOlmo~2}\xspace}
```
```{=latex}
\newcommand{\olmotooinstruct}{\textsc{\oldOlmo~2Instruct}\xspace}
```
```{=latex}
\newcommand{\olmothree}{\textsc{\newOlmo~3}\xspace}
```
```{=latex}
\newcommand{\olmothreebase}{\textsc{\newOlmo~3 Base}\xspace}
```
```{=latex}
\newcommand{\olmothreeinstruct}{\textsc{\newOlmo~3 Instruct}\xspace}
```
```{=latex}
\newcommand{\olmothreeinstructdop}{\textsc{\newOlmo~3 Instruct DPO}\xspace}
```
```{=latex}
\newcommand{\olmothreerl}{\textsc{\newOlmo{}RL}\xspace}
```
```{=latex}
\newcommand{\olmothreerlzero}{\textsc{\newOlmo~3 RL-Zero}\xspace}
```
```{=latex}
\newcommand{\olmothreerlzeromath}{\olmothreerlzero \textsc{Math}\xspace}
```
```{=latex}
\newcommand{\olmothreerlzerocode}{\olmothreerlzero \textsc{Code}\xspace}
```
```{=latex}
\newcommand{\olmothreerlzeroif}{\olmothreerlzero \textsc{IF}\xspace}
```
```{=latex}
\newcommand{\olmothreerlzerogeneral}{\olmothreerlzero \textsc{Mix}\xspace}
```
```{=latex}
\newcommand{\olmocore}{{\oldOlmo-core}\xspace}
```
```{=latex}
\newcommand{\openinstruct}{{Open~Instruct}\xspace}
```
```{=latex}
\newcommand{\roundOne}{{Mix~A}\xspace}
```
```{=latex}
\newcommand{\roundThree}{{Mix~B}\xspace}
```
```{=latex}
\newcommand{\roundFive}{{Mix~C}\xspace}
```
```{=latex}
\newcommand{\metric}[1]{\hspace{0.7em}#1}
```
```{=latex}
\newcommand{\olmothreethinking}{\textsc{\olmothreerlzero~3 Think}\xspace}
```
```{=latex}
\newcommand{\olmothreethinkingsft}{\textsc{\olmothreerlzero~3 Think SFT}\xspace}
```
```{=latex}
\newcommand{\dolma}{\textsc{Dolma}\xspace}
```
```{=latex}
\newcommand{\dolmaoneseven}{\textsc{Dolma~1.7}\xspace}
```
```{=latex}
\newcommand{\dolmatoo}{\textsc{Dolma~3}\xspace}
```
```{=latex}
\newcommand{\dolmathree}{\dolmatoo}
```
```{=latex}
\newcommand{\dolci}{\textsc{Dolci}\xspace}
```
```{=latex}
\newcommand{\dolcithink}{\textsc{Dolci Think}\xspace}
```
```{=latex}
\newcommand{\dolcithinksft}{\textsc{Dolci Think SFT}\xspace}
```
```{=latex}
\newcommand{\dolcithinkdpo}{\textsc{Dolci Think DPO}\xspace}
```
```{=latex}
\newcommand{\dolcithinkrl}{\textsc{Dolci Think RL}\xspace}
```
```{=latex}
\newcommand{\dolciinstruct}{\textsc{Dolci Instruct}\xspace}
```
```{=latex}
\newcommand{\dolciinstructsft}{\textsc{Dolci Instruct SFT}\xspace}
```
```{=latex}
\newcommand{\dolciinstructdpo}{\textsc{Dolci Instruct DPO}\xspace}
```
```{=latex}
\newcommand{\dolciinstructrl}{\textsc{Dolci Instruct RL}\xspace}
```
```{=latex}
\newcommand{\dolcirlzero}{\textsc{Dolci RL-Zero}\xspace}
```
```{=latex}
\newcommand{\molmo}{\textsc{Molmo}\xspace}
```
```{=latex}
\newcommand{\olmomix}{\textsc{\oldOlmo~2}\xspace}
```
```{=latex}
\newcommand{\dolmatoomix}{\textsc{Dolma 3 Mix}\xspace}
```
```{=latex}
\newcommand{\dolminos}{\textsc{\oldOlmo~2 Dolmino Mix}\xspace}
```
```{=latex}
\newcommand{\dolminostoo}{\textsc{Dolma~3 Dolmino Mix}\xspace}
```
```{=latex}
\newcommand{\longminomix}{\textsc{Dolma~3 Longmino Mix}\xspace}
```
```{=latex}
\newcommand{\longminoPool}{\textsc{Dolma~3 Longmino Pool}\xspace}
```
```{=latex}
\newcommand{\missing}{\color{red}{xxx}\xspace}
```
```{=latex}
\newcommand{\olmocr}{\textsc{olmOCR}\xspace}
```
```{=latex}
\newcommand{\olmOCR}{\olmocr}
```
```{=latex}
\newcommand{\olmothreeeval}{\textsc{OlmoBaseEval}\xspace}
```
```{=latex}
\newcommand{\olmix}{\textsc{Olmix}\xspace}
```
```{=latex}
\newcommand{\olmorl}{\olmothreerl}
```
```{=latex}
\newcommand{\olmocrPDF}{{\olmocr science PDFs}\xspace}
```
```{=latex}
\newcommand{\benjamin}[1]{{\color{red} [Benjamin]: #1}}
```
```{=latex}
\newcommand{\nascomment}[1]{{\color{blue} [Noah: #1]}}
```
```{=latex}
\newcommand{\tomasz}[1]{{\color{yellow} [Tomasz]: #1}}
```
```{=latex}
\newcommand{\soldni}[1]{{\color{olmoBlue} [Luca]: #1}}
```
```{=latex}
\newcommand{\valentin}[1]{{\color{periwinkle} [Valentin]: #1}}
```
```{=latex}
\newcommand{\edoardo}[1]{{\color{forestgreen} [Edoardo]: #1}}
```
```{=latex}
\newcommand{\tapprox}{\raisebox{0.1ex}{\ensuremath{\sim}}}
```
```{=latex}
\newcommand{\ltlm}{LTLM}
```
```{=latex}
\newcommand{\bolmolarge}{Bolmo~7B}
```
```{=latex}
\newcommand{\bolmosmall}{Bolmo~1B}
```
```{=latex}
\newcommand{\bolmo}{Bolmo}
```
```{=latex}
\newcommand{\colt}[1]{\tboxmath[colback=tokenization!60]{\mathstrut $\displaystyle #1$}}
```
```{=latex}
\newcommand{\tcolt}[1]{\tbox[colback=tokenization!60]{\strut #1}}
```
```{=latex}
\newcommand{\colb}[1]{\tboxmath[colback=boundary!60]{\mathstrut $\displaystyle #1$}}
```
```{=latex}
\newcommand{\tcolb}[1]{\tbox[colback=boundary!60]{\strut #1}}
```
```{=latex}
\newcommand{\coll}[1]{\tboxmath[colback=local!60]{\mathstrut $\displaystyle #1$}}
```
```{=latex}
\newcommand{\tcoll}[1]{\tbox[colback=local!60]{\strut #1}}
```
```{=latex}
\newcommand{\colp}[1]{\tboxmath[colback=pool!90]{\mathstrut $\displaystyle #1$}}
```
```{=latex}
\newcommand{\tcolp}[1]{\tbox[colback=pool!60]{\strut #1}}
```
```{=latex}
\newcommand{\colg}[1]{\tboxmath[colback=global!60]{\mathstrut $\displaystyle #1$}}
```
```{=latex}
\newcommand{\tcolg}[1]{\tbox[colback=global!60]{\strut #1}}
```
Introduction {#sec:intro}
============

Contemporary language models (LMs) predominantly rely on subword tokenization [@sennrich_neural_2016; @kudo_subword_2018] to segment text into a fixed vocabulary of tokens. This leads to many problems: they suffer from insufficient character understanding [@edman-etal-2024-cute; @cosma2025strawberryproblememergencecharacterlevel; @uzan2025charbenchevaluatingroletokenization], tokenization bias [@phan2024exact; @hayase2025samplinglanguagemodelbyte; @vieira2025from],[^1] inability to incorporate sufficiently many words (e.g., across languages) in their fixed vocabulary [the *vocabulary bottleneck*; @liang-etal-2023-xlm], and potentially suboptimal compute allocation [@hwang2025dynamicchunkingendtoendhierarchical; @pagnoni-etal-2025-byte]. These problems have motivated extensive research into alternatives to subword tokenization, most commonly by switching to UTF-8 bytes.[^2] Many prior byte-level LMs claim to outperform subword-level LMs on some efficiency--performance Pareto frontier [@nawrot-etal-2023-efficient; @slagle2024spacebyte; @wang2024mambabyte; @hwang2025dynamicchunkingendtoendhierarchical; @pagnoni-etal-2025-byte; @evabyte]. However, in practice, byte-level LMs have not seen widespread adoption so far, with all leading LMs still exclusively relying on subword tokenization.

We hypothesize that the key reason for this mismatch between theory and practice is the fact that existing approaches to byte-level language modeling focus predominantly on training a new byte-level model from scratch, and compare against a compute-matched subword-level LM trained from scratch. In contrast, training of state-of-the-art subword-level LMs is rapidly evolving, combining innovations in training data curation, model architecture, and post-training strategies. Keeping this pace is unfeasible for byte-level LM development without extensive investments.

To resolve this mismatch, we introduce , the first family of fully open byte-level LMs achieving performance on the level of state-of-the-art subword-level LMs across various tasks. In contrast to prior byte-level LMs which focus predominantly on training from scratch, is trained by *byteifying* an existing subword-level LM using less than 1% of a typical pretraining budget (39.3B tokens). We train and by byteifying Olmo 3 7B [@olmo3] and OLMo 2 1B [@olmo20242olmo2furious], respectively.

follows the same overall architecture as the recent DTP [@nawrot-etal-2023-efficient], BLT [@pagnoni-etal-2025-byte] and H-Net [@hwang2025dynamicchunkingendtoendhierarchical] models, which we refer to collectively as Latent Tokenizer Language Models (s). However, we specifically adapt the architecture to be well-suited to byteification (see Section [3.1](#sec:arch){reference-type="ref" reference="sec:arch"}). In particular, we resolve a mismatch between the expressivity of subword tokenization and the tokenization in s by allowing 's latent tokenization to use future context (Section [3.1.1](#sec:non_causal_boundaries){reference-type="ref" reference="sec:non_causal_boundaries"}). Alongside an efficient two-stage training procedure starting with exact distillation of the source subword-level LM (Section [3.2](#sec:byteifying){reference-type="ref" reference="sec:byteifying"}), this allows for quickly recovering and in some cases surpassing the performance of the source subword-level LM. We believe that byteifying provides a key missing direction for research on byte-level LMs by enabling the creation of state-of-the-art byte-level LMs without extensive investments. Byteifying is complementary to training from scratch: making it cheap to byteify any subword model can quickly unveil high-performing architectures which are promising candidates for training from scratch as byte-level LMs.

Our models on average outperform all prior public byte-level LMs of comparable size; for example, achieves +16.5% absolute improvement in STEM tasks over BLT 7B, which was trained from scratch. also greatly outperforms the source Olmo 3 on character understanding and, in some cases, on coding. In addition, can be arbitrarily further sped up by training with higher compression ratios of bytes per patch, which is only possible to a limited extent in subword-level LMs (Section [5.1](#sec:increased-compression){reference-type="ref" reference="sec:increased-compression"}). Furthermore, we show that existing post-trained checkpoints can be utilized to post-train a byteified model without any additional training cost (Section [5.2](#sec:zero-cost-post-train){reference-type="ref" reference="sec:zero-cost-post-train"}); this allows for further speeding up research on byte-level LMs by re-using components in the source LM ecosystem. Finally, we provide extensive ablations on our design choices, analyzing the remaining gaps to subword-level LMs, as well as the gaps that closed (Section [6](#sec:ablations){reference-type="ref" reference="sec:ablations"}).

Overall, byteifying lets us finally make byte-level LMs a practical choice competitive with subword-level LMs across a wide set of use cases. We hope the public data, models, and code will help further advance research on byte-level language modeling.

Related Work {#sec:rw}
============

#### Tokenization.

LMs process information represented as a discrete sequence of symbols called *tokens* or *patches*. The process of segmenting the input into this discrete sequence is called *tokenization*, with different ways to tokenize being used across modalities such as text [@kudo_subword_2018], audio [@borsos2023audiolm] and images [@dosovitskiy2020image]. The predominant approach to tokenize text since the inception of LMs has been subword tokenization [@sennrich_neural_2016; @kudo_subword_2018]: tokenizing text into a discrete sequence of units from a finite vocabulary of subword tokens (usually of size 30k-300k), typically represented as integer IDs. Subword tokenization causes a number of problems. **(i)** Information about the characters within each token is lost. While LMs have been shown to implicitly learn their tokens' constituent characters [@kaushal-mahowald-2022-tokens; @edman-etal-2024-cute] and it is possible to explicitly re-introduce character information [@cosma2025strawberryproblememergencecharacterlevel], they still fall short in tasks requiring character knowledge [@edman-etal-2024-cute; @uzan2025charbenchevaluatingroletokenization]. **(ii)** The implicit reliance of subword tokenization on the future contents of the text (called *tokenization bias*) causes unexpected behavior at inference if the prompt ends in the middle of a word or with whitespace [@phan2024exact; @hayase2025samplinglanguagemodelbyte; @vieira2025from]. **(iii)** The need for a fixed, finite subword vocabulary causes restrictive rigidity: for example, while encoding English efficiently is crucial for pretraining since the vast majority of current pretraining documents are in English, various downstream tasks have different efficiency requirements across different languages. **(iv)** Tokenization in contemporary LMs is tied to compute allocation: in a standard LM, the same amount of compute is spent on processing every token in the prefill, every token contributes equally to the KV cache size, and a fixed amount of compute is spent on sequentially generating any new token. Although there are ways to mitigate this problem post-hoc --- such as KV cache sparsification [@ancucki2025inferencetime] and multi-token prediction [@gloeckle2024better] --- directly adapting the tokenization and thus the compute allocation based on the input instead might be more effective [@nawrot-etal-2023-efficient; @pagnoni-etal-2025-byte].

#### Byte-level LMs.

The shortcomings of subword tokenization have motivated extensive work on a wide range of alternatives, which even include tokenizing text by rendering it into pixels and segmenting these into patches [@lotz-etal-2023-text; @rust2023language; @wei2025deepseekocrcontextsopticalcompression]. The most common alternative has been tokenizing into a smaller set of finer-grained atomic units, such as UTF-8 bytes,[^3] instead. One strand of work directly replaces subword tokens with UTF-8 bytes, keeping other aspects of the architecture mostly the same [@xue-etal-2022-byt5; @wang2024mambabyte; @minixhofer2025universal; @evabyte]. This potentially solves problems **(i) - (iii)**[^4] of subword tokenization, but compute allocation remains a problem, exacerbated by having to process on average at least four times longer sequences of bytes. To mitigate this problem, some architectures pool a fixed amount of tokens into a single representation with a lightweight *local encoder* (e.g., another Transformer network), pass the pooled representations through a deep *global model* operating over the shortened sequence, then depool the representations back to the original granularity via a *local decoder*. This approach has been pioneered for autoregressive models by the Hourglass Transformer [@nawrot-etal-2022-hierarchical] and later adopted more broadly [@yu2023megabyte; @ho2024block]. Recent subsequent work has shown that replacing static pooling with dynamic tokenization improves the performance--efficiency Pareto front [@nawrot-etal-2023-efficient; @slagle2024spacebyte]. In this case, the token boundaries may be learned end-to-end, rely on entropy spikes, or be externally supervised [@nawrot-etal-2023-efficient; @hwang2025dynamicchunkingendtoendhierarchical]. We refer to these architectures as Latent Tokenizer Language Models (s) collectively, since --- although operating over bytes --- they perform a tokenization step inside the model which aggregates the byte representations into representations over latent patches. Byte-level s finally have the ability to address issues **(i) - (iv)** of subword tokenization. The most recent s have shown promise by performing on par with subword tokenization when spending the same total amount of FLOPs on training [@hwang2025dynamicchunkingendtoendhierarchical; @pagnoni-etal-2025-byte].

#### Tokenizer Transfer and Retrofitting.

Techniques to alter a model's architecture with extra training are typically referred to as *retrofitting*, which often relies on self-distillation [@mohawk; @ancucki2025inferencetime]. The principal difficulty when this involves a change of tokenizer is finding embeddings for the new tokens; this is usually done using heuristics [@tran2020englishforeignlanguagestransferring; @minixhofer-etal-2022-wechsel; @dobler-de-melo-2023-focus] or training-based methods [@minixhofer2025zeroshottokenizertransfer]. Recently, effective tokenizer transfer methods based on *cross-tokenizer distillation* have been introduced [@dobler2025tokendistillationattentionawareinput; @haltiuk2025modelawaretokenizertransfer; @minixhofer2025universal]. Here, the original model is seen as the teacher, the tokenizer-transferred model is seen as the student, and the objective is to match the behavior of the student to the teacher. Byteification is a special case of tokenizer transfer. Byteification was first done by @pagnoni-etal-2025-byte by initializing the parameters from an existing subword model where possible and training as if from scratch. @hwang2025dynamicchunkingendtoendhierarchical later byteified by supervising the boundary prediction to match the subword boundaries and introducing an auxiliary embedding-matching loss. Our key contribution is creating an which is specifically suited to byteifying. We do so by introducing a novel architecture (Section [3.1](#sec:arch){reference-type="ref" reference="sec:arch"}), as well as a dedicated two-stage procedure to byteify efficiently by first learning to exactly recover the behavior of the source subword model (Section [3.2](#sec:byteifying){reference-type="ref" reference="sec:byteifying"}). Together, these innovations first allow closely matching the performance of state-of-the-art subword-level LMs with a byteified model.

Byteified Olmo {#sec:Bolmo}
==============

![The architecture. transforms the input text into one representation per byte. The representations are contextualized with the consisting of mLSTM blocks. The decides where to place patch boundaries using one byte of future context. The representations are then ed, passed through the consisting of Transformer layers, and ed. Finally, the consisting of another mLSTM stack contextualizes the depooled byte representations and the transforms them into next-byte predictions, alongside deciding where to place the next patch boundary. ](figures/bolmo_architecture-cropped.png){#fig:bolmo_architecture width="\\linewidth"}

Architecture {#sec:arch}
------------

Following the same overall structure as prior s, can be formalized as shown in Figure [1](#fig:bolmo_architecture){reference-type="ref" reference="fig:bolmo_architecture"}.

#### Tokenization & Embedding.

$\mathcal{T}$ assigns every input UTF-8 byte in $x$[^5] a corresponding embedding in $\mathbb{R}^{d}$ from an embedding table containing an entry for every byte. The embedding table over bytes is negligible in size compared to embedding tables over subwords. However, scaling the size and sparsity of the embedding table has been shown to improve performance while having no negative effect on inference speed [@pmlr-v267-huang25bb]. Inspired by BLT's hash embeddings [@hash_embeddings; @pagnoni-etal-2025-byte], we thus increase the size of the embedding table. Specifically, we residually add the longest subword embedding (of the original subword-level LM's embedding table) which ends at the current byte position to every byte embedding:

$$e_i \coloneqq \mathcal{T}_{\text{Byte}}(x_i) + \mathcal{T}_{\text{SubwordSuffix}}(x_{:i})$$

where $\mathcal{T}_{\text{SubwordSuffix}}$ assigns an embedding to every byte based on the index of the subword token in the vocabulary $\mathcal{V}_\text{Subword}$ with the longest common suffix to the byte sequence up to the current position $i$. Retaining the subword embeddings is not strictly necessary, and we can generally achieve the same performance by increasing the size of the local encoder instead. However, subword embedding retention allows us to achieve a better performance--efficiency tradeoff by increasing the amount of cheap sparsely activated parameters.[^6]

#### Local Encoder.

The local encoder $\mathcal{E}$ contextualizes the byte-level embeddings through an mLSTM layer [@beck2025xlstm], resulting in the contextualized representations $\hat{e}$. We find that mLSTM improves inference speed compared to other linear RNN variants (see Section [6.3](#sec:optimizing_inference){reference-type="ref" reference="sec:optimizing_inference"}) while attaining competitive performance. We found a single mLSTM layer to be sufficient since the expressivity of the local encoder is substantially enhanced by the retained subword embeddings.

#### Boundary Predictor.

The boundary predictor $\mathcal{B}$ predicts a score $p \in [0, 1]$ for every byte based on the contextualized representations $\hat{e}$. If $p$ is greater than some threshold, a patch boundary is placed after the current byte. In contrast to prior s, 's boundary predictor is *non-causal*:[^7] it has access to one byte of future context, and it is only employed for the prefill, where future information can be used while retaining the ability to generate text. We describe non-causal boundary prediction in detail in Section [3.1.1](#sec:non_causal_boundaries){reference-type="ref" reference="sec:non_causal_boundaries"}, where we also discuss how boundary prediction is handled during decoding.

#### Pooling.

We pool byte-level representations into patch representations by selecting the representation of the last byte in every patch as the patch-level representation $h$. This is equivalent to the pooling done by @hwang2025dynamicchunkingendtoendhierarchical,[^8] and does not introduce any extra parameters. Contrary to @hwang2025dynamicchunkingendtoendhierarchical, the local models and the global model use the same representation dimensionality, obviating the need for an upprojection.[^9]

#### Global Model.

The majority of compute is spent in the deep global model $\mathcal{M}$ contextualizing the patch representations $h$ into $\hat{h}$. We retain the global model of the original subword-level LM, i.e. the Olmo 3 decoder-only transformer backbone.

#### Depooling.

The global model is invoked at every patch boundary, providing a contextualized representations for every patch. It remains to depool these representations back to representations of bytes. We do so by adding the latest available patch representation in $\hat{h}$ at any byte position to a linear projection of the byte representations $\hat{e}$, resulting in $z$. This is similar to @hwang2025dynamicchunkingendtoendhierarchical's depooling, again forgoing the projection due to equal global and local dimensionality.

#### Local Decoder.

The local decoder $\mathcal{D}$ contextualizes the depooled byte representations $z$ into $\hat{z}$ via another stack of mLSTM layers. Here, we use a larger number of mLSTM layers (in practice, four) to increase capacity since unlike in the encoder, we find it infeasible to meaningfully re-incorporate the output subword embedding matrix, which could have potentially allowed reducing the number of layers in the decoder in a similar way as for the encoder.

#### Language Modeling Head.

The language modeling head $\text{LMHead}$ converts the final byte representations $\hat{z}$ into scores interpretable as next-byte probabilities via a projection to the vocabulary space and softmax.

Overall, our modifications keep the total parameter count similar to the parameter count of the source subword-level LM by removing the output embedding matrix but adding new parameters from the local encoder layers and local decoder layers. In practice, contains 10M fewer parameters than OLMo 2 1B ($-\!0.7\%$), and contains 330M more parameters than Olmo 3 7B ($+\!4.5\%$).

### Non-Causal Patch Boundary Prediction {#sec:non_causal_boundaries}

![Subword-level LMs non-causally set boundaries over the prefill using the external subword tokenizer, then implicitly predict boundaries alongside the text content during decoding *(left)*. Prior byte-level s causally set boundaries with a light-weight boundary predictor during both prefill and decoding *(middle)*. We restore the expressivity of subword-level LM boundaries by non-causally predicting boundaries for the prefill, then predicting whether a boundary occurs alongside the next byte during decoding *(right)*.](figures/bolmo_architecture_comparison-cropped.png){#fig:bolmo_architecture_side_by_side width="\\linewidth"}

Prior s employ a causality constraint on the boundary predictions: the boundary predictor only uses past context to decide on whether to place a boundary.[^10] At a glance, this seems necessary: we are aiming to predict the next byte, so we must not leak any information about it. However, although subword-level LMs employ a causality constraint over the subword tokens, the subword tokens themselves do not depend exclusively on past context: *subword tokenizers use information about future bytes to place token boundaries*. To see this, let us interpret our subword tokenizer as a function which decides whether to place a token boundary after any byte, i.e. $\mathcal{B}(x): \{0,1,..,255\}^n \rightarrow \{0,1\}^n$. Let us assume a vocabulary of English words and subwords, the example text $\texttt{\_Hello\_Wor!}$, which would typically be tokenized as $\{\texttt{\_Hello},\texttt{\_Wor},\texttt{!}\}$, and the position $i = |\texttt{\_Hello\_Wor}|-1 = 9$. $\mathcal{B}(\texttt{\_Hello\_Wor!})_{i} = 1$ since there is a boundary after $\texttt{r}$. However, in the text $\texttt{\_Hello\_World!}$, which would be tokenized as $\{\texttt{\_Hello},\texttt{\_World},\texttt{!}\}$, we have $\mathcal{B}(\texttt{\_Hello\_World})_{i} = 0$, despite $\texttt{\_Hello\_Wor!}[:i] = \texttt{\_Hello\_World!}[:i] = \texttt{\_Hello\_Wor}$. In other words, although the subword-level LM only uses past subword tokens to predict the next subword token, the subword tokens themselves are created by taking future context into account. In this case, this means deciding that `_Wor` should be a token in one case but not in the other, although the text up until that point is equivalent. Current s, in contrast, can not take future context into account. This creates a mismatch between the expressivity of boundary predictors and subword tokenizers. We modify the boundary predictor to resolve this mismatch. In particular, while prior boundary predictors are implemented as

$$\mathcal{B}(\hat{e})_t \coloneqq f(\hat{e}_0,\hat{e}_1,..,\hat{e}_{t})$$

we implement our boundary predictor as

$$\mathcal{B}_{\text{Bolmo}}(\hat{e})_t \coloneqq f(\hat{e}_0,\hat{e}_1,..,\hat{e}_{t},\colb{\hat{e}_{t+1}}).$$

That is, we use up to . Concretely, we parametrize our boundary predictor as

$$\mathcal{B}_{\text{Bolmo}}(\hat{e})_t \coloneqq \cfrac{1}{2} \left( 1 - \cfrac{(W_q \colb{\hat{e}_{t+1}})^T (W_k\hat{e}_{t})}{\|W_q \colb{\hat{e}_{t+1}}\|\|W_k \hat{e}_{t}\|} \right) \in [0,1],$$

i.e., we compute the cosine distance between a projection of the representation of the current byte and the byte one position in the future. An equivalent parametrization, although using the current byte and one byte before, is used by @hwang2025dynamicchunkingendtoendhierarchical. Taking one future byte into account largely resolves the mismatch between subword tokenizers and tokenization.[^11] As shown in Figure [2](#fig:bolmo_architecture_side_by_side){reference-type="ref" reference="fig:bolmo_architecture_side_by_side"}, taking future context into account can also make patches more semantically coherent: for example, in the case of texts containing compounds such as `the flowerbed`, a boundary predictor has three intuitive options: (i) make the entire compound a single patch `flowerbed`, (ii) place a patch boundary after `r` to create the patch `flower`, or (iii) place a patch boundary after `b` (once it is evident this is a compound word) to create `flowerb`. Option (ii) is arguably the semantically most coherent one,[^12] however, for a causal boundary predictor, this would mean having to place a patch boundary after `r` for every text starting with `the flower`, including e.g. `the flowers`, while a non-causal one can adjust based on future context. The byteification strategy of @hwang2025dynamicchunkingendtoendhierarchical supervises based on option (iii), i.e. predicting the start of the next subword token instead of the end of the previous one, which would create a patch `flowerb` as shown in Figure [2](#fig:bolmo_architecture_side_by_side){reference-type="ref" reference="fig:bolmo_architecture_side_by_side"}.

#### Output Boundary Prediction.

While using future context is fine for prefilling, we need to know whether to place a boundary without observing the next byte for decoding. We thus add a special symbol $\texttt{<b>}$ to the vocabulary and let the local decoder learn to emit $\texttt{<b>}$ at the end of every patch (the local encoder, in contrast, never sees $\texttt{<b>}$).[^13] In effect, we end up with two boundary predictors: the boundary predictor $\mathcal{B}$ ingesting the shallowly contextualized representations from the local encoder with future context (used during prefill), and a boundary predictor as part of the language modeling head ingesting deeply contextualized representations from the local decoder without future context (used during decoding). Notably, this is precisely equivalent to what happens in subword-level LMs: the prefill is tokenized using the external subword tokenizer (analogous to the boundary predictor $\mathcal{B}$), and output boundaries $\texttt{<b>}$ are implicitly predicted alongside the text contents of every subword token upon decoding (analogous to our output boundary predictor) as illustrated in Figure [2](#fig:bolmo_architecture_side_by_side){reference-type="ref" reference="fig:bolmo_architecture_side_by_side"}.

#### Boundary Symbol Fusion.

Since we are aiming to predict the symbol `<b>` after every patch, our local decoder is turned from an isotropic model to a transducer from $\mathbb{R}^{n \times d} \rightarrow \mathbb{R}^{(n +k) \times d}$, i.e., the local decoder needs to process $k$ more positions. Although this overhead is not prohibitive in principle, it makes it difficult to compare models which use output boundary prediction and models which do not. We thus make it effectively zero-cost by doubling our byte vocabulary size from 256 to 512, for every byte adding a version of the same byte followed by a boundary. The goal of the local decoder, then, is to predict the current byte *and* whether it is followed by a boundary at every step. Fusing the boundary symbol turns the local decoder back into an isotropic model. The only remaining overhead is that the softmax has to be applied over a set of 512 instead of 256 output tokens, which is negligible.

#### On end-to-end learning of non-causal boundaries.

@hwang2025dynamicchunkingendtoendhierarchical train the boundary predictor end-to-end by incorporating it in the computation graph through (i) smoothing of the contextualized global representations $\hat{h}$ using the boundary scores and (ii) a straight-through estimator of the boundary scores applied to the depooled representations $z$. Training the boundary predictor end-to-end in this style is not immediately possible using our non-causal formulation. This is the case since (i) our output boundary predictor would need to estimate the precise boundary score assigned by the boundary predictor for decoding, instead of only predicting whether a boundary occurs or not and (ii) relatedly, instead of the single bit of information leaked by discrete boundary predictions, the model can learn to leak 16 bits of information (assuming we use bfloat16) about the next byte, which is enough to uniquely identify it. This could cause the model to learn degenerate solutions by exploiting the boundary scores to pass information about the future to the local decoder. In this work, we thus focus exclusively on strategies to train the boundary predictor with external supervision instead, which we believe have been underutilized in prior work.

Byteifying Procedure {#sec:byteifying}
--------------------

We byteify by initializing the parameters of the global model from the subword-level LM checkpoint, while parameters of the local models and the LM head are initialized randomly. Our byteifying procedure consists of two stages. In the *first stage*, we aim to quickly learn weights for the local encoder, local decoder, boundary predictor and LM head which exactly recover the behavior of the subword-level LM. The parameters of the global model stay frozen in this stage. In the *second stage*, we train the entire model to let it learn to utilize byte-level information, while also optionally increasing the target compression ratio of bytes per patch.

### Stage 1: Subword-to-Byte Distillation {#sec:stage1}

The aim of the first stage is quickly learning weights for the local encoder, local decoder, boundary predictor and LM head which recover the behavior of the subword model. Efficiency is crucial; the cost of this stage should be minimal to permit fast experimentation and allow increasing the investment into Stage 2. To achieve these goals, we design a Stage 1 procedure which allows learning the desired weights without fully backpropagating through the global model. This substantially reduces the time per training step (see Appendix [14](#appendix:hyperparameters){reference-type="ref" reference="appendix:hyperparameters"}). The Stage 1 loss is minimal if and only if the byte-level model exactly mimics the source subword-level LM. It is composed of three parts.

#### Quickly Learning a Boundary Predictor $\mathcal{B}_\text{Bolmo}$.

We train the boundary predictor to emulate the boundaries placed by subword tokenization via a binary cross-entropy loss, i.e.,

$$\mathcal{L}_\mathcal{B} \coloneqq - \sum_t \left(\mathcal{B}_{\text{subword}}(x)_t \log \mathcal{B}_\text{Bolmo}(\hat{e})_t + (1 - \mathcal{B}_{\text{subword}}(x)_t) \log (1 - \mathcal{B}_\text{Bolmo}(\hat{e})_t)\right),$$

where $\mathcal{B}_{\text{subword}}(x)$ is $1$ for every byte at the last position of a subword patch, otherwise $0$. The boundary predictor $\mathcal{B}_\text{Bolmo}$ utilizing future context to tokenize the prefill text quickly achieves $>\!99\%$ accuracy.

#### Quickly Learning a Local Encoder $\mathcal{E}$.

Assuming our boundary predictor perfectly emulates subword tokenization, our local encoder and pooling mechanism will be a perfect substitute for the subword embedding matrix if they yield the same input to the global model as the subword embedding matrix for every patch. This is the case if all pooled representations $\text{Pool}(\hat{e}, \mathcal{B}_\text{Bolmo}(\hat{e}))$ are equal to the corresponding subword embeddings $\mathcal{T}_\text{Subword}(x)$. @hwang2025dynamicchunkingendtoendhierarchical optimize toward this goal by directly minimizing the L2 distance of every pooled representation to the corresponding subword embedding. We take an alternative approach inspired by research on model stitching which shows that similar representations do not necessarily propagate through subsequent layers in a similar way [@athanasiadis2025model]. We propagate the pooled representations through $n$ layers of the global model and minimize L2 distance to the subword representations which result from propagating the subword embeddings through the same $n$ layers,

$$\mathcal{L}_\mathcal{E} \coloneqq \| \mathcal{M}_{:n}(\text{Pool}(\mathcal{E}(\hat{e}, \mathcal{B}_\text{subword}( x))) - \mathcal{M}_{:n}(\mathcal{T}_\text{subword}(x)) \|.$$

Notably, we pool the local encoder representations using the true subword boundaries $\mathcal{B}_\text{subword}$ instead of $\mathcal{B}_{\text{Bolmo}}$.[^14] $\mathcal{M}_{:n}$ indicates the global model up to and including the $n$-th layer. The weights of $\mathcal{M}$ are kept frozen. If $n=0$, this reduces to the setting of @hwang2025dynamicchunkingendtoendhierarchical. Although choosing $n>0$ necessitates backpropagating through some parts of the global model, we can minimize the resulting cost by choosing a small $n$. We find $n=4$ to strike a good balance between performance and efficiency, substantially outperforming $n=0$ while remaining cheap to compute.

#### Quickly Learning a Local Decoder $\mathcal{D}$.

Our local decoder and LM head are optimal if our byte-level LM assigns the same likelihood as the subword model to every text $x$. Assuming equal patch boundaries, it is optimal if the likelihood of every *patch* is equal. Since subword-level LMs implicitly predict output patch boundaries, we cannot easily compute comparable patch likelihoods in byte-level models without output boundary prediction. In this case, we would have to resort to approximations as in @minixhofer2025universal. However, since does predict output patch boundaries, simply comparing the likelihoods of every patch results in an exact objective (i.e., a loss which is minimal if and only if both models are the same),

$$\mathcal{L}_{\mathcal{D},\text{Distill}} \coloneqq \sum_i f\left(
\prod_{j \in T(x, i)}\!\!\!\! \text{LMHead}(\hat{z}_{\text{subword}})[j, \text{next\_byte(x, j)}],
\text{LMHead}_\text{subword}(z_\text{subword})[i, \text{next\_tok(x, i)]}
\right),$$

where $j \in T(x, i)$ indicates all byte indices $j$ which are part of the $i$-th subword patch; this includes the indices of the special $\texttt{<b>}$ symbol if treated as separate, or the indices of the 256 special symbols consisting of a byte plus $\texttt{<b>}$ if fused. $\text{next\_tok(..)}$ and $\text{next\_byte(..)}$ map to the index in the vocabulary of the symbol occurring after the current symbol (token or byte), including special symbols.[^15] $z_\text{subword} = \mathcal{M}(\mathcal{T}_\text{subword}(x))$ are the representations of the subword model at the final layer, $\hat{z}_\text{subword} = \mathcal{D}(\text{Depool}(\hat{e},z_\text{subword},p))$ is the result of passing these representations through the depooling layer and the local decoder, and $\text{LMHead}_\text{subword}$ is the LM head of the source subword-level LM. As the comparison function $f$, we choose the temperature-modulated binary cross-entropy,

$$f(\hat{y} \;\|\; y) \coloneqq - \left(
y^{1/\tau} \log \hat{y}^{1/\tau} + 
(1 - y^{1/\tau}) \log (1 - \hat{y}^{1/\tau})
\right),$$

with $\tau=5$ as recommended by @minixhofer2025universal. In practice, we conduct the operations involved in the computation of $\mathcal{L}_\mathcal{D}$ in log-space to ensure stable numerics. We optionally combine the distillation loss $\mathcal{L}_{\mathcal{D},\text{Distill}}$ with a cross-entropy loss to encourage modeling the training data well and to already start exploiting byte-level information,

$$\mathcal{L}_{\mathcal{D},\text{CE}} \coloneqq \sum_j -\!\log \text{LMHead}(\hat{z}_\text{subword})[j, \text{next\_byte(x, j)}].$$

#### Putting It Together.

In principle, the boundary predictor and local encoder on the one hand, and the local decoder and LM head on the other, could be trained separately (assuming we stop the gradient to the encoder through $\hat{z}_\text{subword}$). Although there may be scenarios where this is beneficial, we choose to train them together for simplicity. The complete Stage 1 loss is given by

$$\mathcal{L}_{\text{Stage1}} \coloneq 
\lambda_\mathcal{B} \mathcal{L}_\mathcal{B} + 
\lambda_\mathcal{E} \mathcal{L}_\mathcal{E} + 
\lambda_{\mathcal{D},\text{Distill}} \mathcal{L}_\mathcal{\mathcal{D},\text{Distill}} +
\lambda_{\mathcal{D},\text{CE}} \mathcal{L}_\mathcal{\mathcal{D},\text{CE}},$$

where $\lambda_\mathcal{B},\lambda_\mathcal{E},\lambda_{\mathcal{D},\text{Distill}},\lambda_{\mathcal{D},\text{CE}} \in \mathbb{R}$ are the loss weights which we set $\lambda_\mathcal{B}=4, \lambda_\mathcal{E} = 1, \lambda_{\mathcal{D},\text{Distill}} = 1,\lambda_{\mathcal{D},\text{CE}}=1$. Stage 1 needs in total one forward pass through all layers and one backward pass through the first $n$ layers of the global model, plus forward and backward passes through local encoder, local decoder, boundary predictor and LM head. This makes Stage 1 substantially more efficient than training the entire model. It could also be further optimized by quantizing or applying inference-specific optimizations to the global model layers starting from the $(n\!+\!1)$-th layer (which we do not need to backpropagate through). We analyze the difference between inserting Stage 1 and directly training the entire model end-to-end with randomly initialized parameters (besides the global model) later in Section [6.2](#sec:stage1_ablation){reference-type="ref" reference="sec:stage1_ablation"}. Besides performance improvements, Stage 1 provides a vehicle for rapid experimentation: We can conduct Stage 1 training to rapidly check whether a particular architecture for the local encoder and decoder has sufficient capacity to emulate the input and output embedding matrices, respectively. We use this to guide the architecture search for under the hypothesis that byte-level architectures which can not emulate the subword model after Stage 1 will remain inadequate with further Stage 2 training.

### Stage 2: End-to-End Training {#sec:stage2}

In the second stage, we train the entire model end-to-end, retaining only the boundary loss $\mathcal{L}_\mathcal{B}$ and the cross-entropy loss $\mathcal{L}_{\mathcal{D},\text{CE}}$. For $\mathcal{L}_{\mathcal{D},\text{CE}}$, we substitute the depooled representations $\hat{z}_\text{subword}$ of the subword model representations with the true depooled representations $\hat{z}$, referring to this loss as $\mathcal{L}_{\text{CE}}$,

$$\mathcal{L}_\text{Stage2} \coloneqq \lambda_{\mathcal{B}} \mathcal{L}_\mathcal{B} + \lambda_{\text{CE}} \mathcal{L}_\text{CE}.$$

We now optimize all parameters, including those of the global model $\mathcal{M}$. This stage is intended for the model to adjust to the end-to-end setting, since in Stage 1 we assumed a local encoder and boundary predictor perfectly emulating the subword model, which, although close, is not true in practice. The global model can learn to exploit the new byte-level information in Stage 2, and optionally be trained with higher compression ratios of bytes per patch (see Section [5.1](#sec:increased-compression){reference-type="ref" reference="sec:increased-compression"}).

Experiment Setup {#sec:setup}
================

#### Data

The data mix consists of 172B tokens[^16] from the Dolma 3 pretraining data mix [@olmo3], augmented with 75M tokens of CUTE-style data [@edman-etal-2024-cute], sampled so as not to overlap with the CUTE test set, to encourage character understanding (see Appendix [11](#appendix:cute){reference-type="ref" reference="appendix:cute"} for details). Training runs for less than one epoch on this mix.

#### Model.

We use the pretrained Olmo 3 7B checkpoint after mid-training and long-context extension [@olmo3] as our starting point for byteifying into . For the local models, we use stacks of alternating mLSTM [@beck2025xlstm] and feedforward layers of size 1 and 4 for the encoder and decoder, respectively. See Appendix [14](#appendix:hyperparameters){reference-type="ref" reference="appendix:hyperparameters"} for details on the architecture.

#### Training.

For *Stage 1*, we train on a total of 9.8B tokens ($\approx$ 43B bytes). In this stage, we train the local encoder, decoder, boundary predictor and LM head, keeping the global model frozen. For *Stage 2*, we train the entire model on a total of 39.3B tokens ($\approx$ 173B bytes). See Appendix [14](#appendix:hyperparameters){reference-type="ref" reference="appendix:hyperparameters"} for detailed training hyperparameters.

#### Baseline.

We compare against the Olmo 3 7B checkpoint with continued training on the training data such that the amount of total gradient updates to the global model parameters is the same (i.e., on 39.3B tokens) to disentangle the effects of continued training with the same architecture and byteification.

#### Ablations and Development.

We developed primarily through experiments on OLMo 2 [@olmo20252olmo2furious]. We optimized decisions around the architecture through quick Stage 1 training runs on OLMo 2 1B or 7B. Our byteifying procedure was then applied without adjustments to Olmo 3 7B. Since there is currently no 1B version of Olmo 3, we conduct experiments requiring larger sweeps across training configurations on OLMo 2 1B.

#### Evaluation.

We create the Bolmo 7B evaluation suite based on @olmo3's , skipping GSM Symbolic and BigCodeBench due to their size, and adding CUTE [@edman-etal-2024-cute] and EXECUTE [@edman-etal-2025-execute] to measure character understanding in English and across other languages, respectively. We create the Bolmo 1B evaluation suite based on @olmo3's Base Easy Suite, again adding CUTE [@edman-etal-2024-cute] to measure character understanding. For the Bolmo 1B suite, we define a set of core tasks consisting of ARC [@clark2018think], MMLU [@hendryckstest2021], CSQA [@talmor-etal-2019-commonsenseqa], HellaSwag [@zellers-etal-2019-hellaswag], WinoGrande [@Sakaguchi_Le_Bras_Bhagavatula_Choi_2020], SocialIQA [@sap-etal-2019-social], PIQA [@Bisk_Zellers_Le_bras_Gao_Choi_2020], the Basic Skills benchmark [@olmo3] and CUTE [@edman-etal-2024-cute] for use in ablations and sweeps (see Appendix [10](#appendix:benchmarks){reference-type="ref" reference="appendix:benchmarks"} for details).

Main Results {#sec:results}
============

```{=latex}
\renewcommand{\arraystretch}{1}
```
####  Results.

Table [\[tab:main\_results\]](#tab:main_results){reference-type="ref" reference="tab:main_results"} compares with existing byte-level LMs of comparable size: EvaByte 6.5B [@evabyte], TFree-Hat 7B [@neitemeier2025hierarchical] and BLT 7B [@pagnoni-etal-2025-byte], as well as the source Olmo 3 model [@olmo3]. performs best among all publicly known byte-level models in every category, including code, math, multiple-choice QA, and character understanding. As the only exception, slightly trails TFree-Hat 7B in the GenQA category (70.9 vs. 71.3). also comes close to matching the performance of the source Olmo 3 model [which is itself competitive with other subword-level LMs of comparable size; see @olmo3]. The remaining gap to Olmo 3 is largely not specific to byteifying; it can be attributed to continued training in general, see Appendix [9](#appendix:ablations){reference-type="ref" reference="appendix:ablations"}.

On code, outperforms Olmo 3 due to higher pass\@16 rates at generally slightly lower pass\@1. This indicates that generates more diverse continuations than Olmo 3 under the given sampling settings, which are equivalent for both models ($\text{temperature}=0.6,\text{top\_p=0.6}$, see Appendix [10](#appendix:benchmarks){reference-type="ref" reference="appendix:benchmarks"}). However, although promising, at this point we cannot conclude that byte-level models are fundamentally better suited to generating more diverse continuations, since we have not comprehensively explored the quality--diversity tradeoff at different points defined by different sampling strategies.

The character understanding results are surprising, as Bolmo's accuracy vastly surpasses its subword-level counterpart. In fact, prior byte-level models do not outperform the subword Olmo 3 model. This could be explained by the hypothesis that character understanding is primarily acquired through scale [in terms of parameters and training tokens; @cosma2025strawberryproblememergencecharacterlevel], so although byte-level models should require less scale to acquire character understanding, the increased scale of Olmo 3---likely trained on substantially more tokens than the other models---might compensate for this.[^17] In contrast, is trained with synthetic data encouraging character understanding (Appendix [11](#appendix:cute){reference-type="ref" reference="appendix:cute"}), which speeds up the acquisition of this skill. still outperforms Olmo 3 in a comparison where Olmo 3 had continued training on the Bolmo data mix for the same total amount of tokens (Appendix [9](#appendix:ablations){reference-type="ref" reference="appendix:ablations"}), further suggesting that while character understanding is driven by scale, it emerges sooner in byte-level models.

####  Results.

Table [\[tab:main\_results\_1b\]](#tab:main_results_1b){reference-type="ref" reference="tab:main_results_1b"} compares  [trained off of OLMo2 1B; @olmo20242olmo2furious] with existing byte-level models, including H-Net [for which no 7B checkpoint is available; @hwang2025dynamicchunkingendtoendhierarchical] and BLT 1B. Although trained on the previous Olmo generation, is competitive with prior byte-level models of similar size, outperforming H-Net and slightly trailing behind BLT 1B, although BLT 1B has substantially more than one billion parameters since @pagnoni-etal-2025-byte do not count the hash embedding parameters. Like , exhibits performance degradation compared to the source subword model on some tasks, e.g. $-3.2\%$ on MMLU. However, on other tasks, outperforms OLMo2 1B, e.g. $+5.1\%$ on Lambada, $+3.3\%$ on CoQA and $+32.5\%$ on CUTE.

```{=latex}
\renewcommand{\arraystretch}{1}
```
Training at Higher Compression Factors {#sec:increased-compression}
--------------------------------------

**Takeaway.** Byteified models can be sped up by adapting the external boundary supervision to encourage a higher number of bytes per patch during training. This creates a way to smoothly trade off efficiency and performance which does not exist for subword-level LMs due to the softmax bottleneck.

![The task performance vs. efficiency Pareto frontier of (i) the source subword-level LM with tokenizer transfer to SuperBPE to achieve higher compression in bytes per patch and (ii) models with adapted boundary prediction to achieve higher compression (see Section [5.1](#sec:increased-compression){reference-type="ref" reference="sec:increased-compression"}). The subword-level LM breaks off the frontier as the cost of the softmax starts to dominate for larger vocabulary sizes; byte-level LMs take over the frontier at that point, as seen in the optimal region around the top-left corner.](figures/compression_vs_perf.pdf){#fig:compression_vs_perf width="\\linewidth"}

A substantial advantage of s is --- unlike subword-level LMs --- not to be restricted to a fixed, finite set of patches. So far, we have not exploited this advantage since our primary goal was mimicking the tokenization of the source subword model. We now investigate whether we can leverage the increased freedom in our choice of the patching strategy to train a faster model by encouraging a higher average number of bytes per patch. In particular, we experiment with ways to change the external boundary supervision from the original subword tokenization boundaries $\mathcal{B}_\text{subword}$ to a subset of those boundaries. We fix a compression ratio $t$ of target average bytes-per-patch. We then remove subword boundaries (i.e., merge subword tokens) of $\mathcal{B}_\text{subword}$ until the desired compression ratio is achieved. We experiment with three merging strategies.

-   **BPE**. We iteratively merge the most common pair of tokens as in Byte Pair Encoding [@sennrich_neural_2016]. In contrast to conventional BPE, we apply BPE per-example instead of over the entire corpus.[^18] This is inspired by the work of @feher-etal-2025-retrofitting, which has shown that it is possible to retrofit language models to operate over BPE merges of the tokens in their vocabulary.

-   **Entropy**. We use a small auxiliary 370M parameter subword-level LM[^19] to compute next-token entropies for every token. We then iteratively merge the pair of patches which, when summing their individual entropies, results in the lowest entropy among all entropy sums of pairs of patches in the example.

-   **Cross-Entropy**. We use the same small auxiliary LM as for entropy-based merging, but instead of merging the pair of tokens with the lowest total entropy, we iteratively merge the pair of tokens with the lowest total *cross-entropy* w.r.t. the next token in the data.

In the case of entropy- and cross-entropy-based merging, the auxiliary LM is only required at training time to supervise the boundary predictor [as in DTP; @nawrot-etal-2023-efficient]. Unlike BLT [@pagnoni-etal-2025-byte], we do not need to retain the auxiliary LM for inference.

Even though the loss is discontinuous w.r.t. the parameters of the boundary predictor and we do not employ any technique to backpropagate through the discrete boundary predictions, we observe stable training without loss spikes with all of the above merging methods. An important nuance is that the supervision target compression ratio $t$ is not attained by the model. Despite the boundaries not being learned end-to-end, the model learns to trade off boundary prediction accuracy with the main next-byte prediction loss, like other multitask models which learn to balance performance on the constituent tasks [see e.g. @zhang2021surveymultitasklearning]. An important hyperparameter is thus the factor $\lambda_{\mathcal{B}}$ controlling the importance of the boundary prediction task; we keep $\lambda_{\mathcal{B}} = 4$ from Stage 1 training and report the attained compression ratio $c$ in addition to the target compression ratio $t$.

As the baseline, we increase the bytes per patch of the subword-level LM via tokenizer transfer to SuperBPE tokenizers [@liu2025superbpespacetravellanguage]. Here, we train SuperBPE tokenizers on top of the OLMo 2 tokenizer to reach vocabulary sizes of $\{200k, 400k\}$ using the same 10GB text sample as @liu2025superbpespacetravellanguage for tokenizer training. We use FOCUS [@dobler-de-melo-2023-focus] to initialize the embeddings of the new superword tokens.

Results are shown in Figure [3](#fig:compression_vs_perf){reference-type="ref" reference="fig:compression_vs_perf"}. Through transfer to SuperBPE, we can speed up the subword-level LM while retaining performance to a large extent. However, at some vocabulary size threshold, the subword-level LM breaks off the frontier as the softmax begins to dominate the FLOPs (for OLMo 2 1B, this is somewhere between a vocabulary size of 200k and 400k tokens). Byte-level LMs do not suffer from the softmax bottleneck. This enables unboundedly increasing efficiency at a smooth dropoff in performance. Interestingly, BPE merges outperform entropy and cross-entropy merges, in contrast with prior work using entropy-based patch boundaries [@nawrot-etal-2023-efficient; @pagnoni-etal-2025-byte]. We believe this may be a pattern specific to the byteifying setting, since the BPE merging strategy is the one with the least amount of distinct merges to achieve any target compression (and thus, in this sense, the one closest to the pretrained model). Additional investigation with training from scratch would be necessary to validate this hypothesis.

Post-Training Byteified Models via Task Arithmetic {#sec:zero-cost-post-train}
--------------------------------------------------

**Takeaway.** An existing subword-level post-trained checkpoint can be merged into a byteified model via Task Arithmetic [@ilharco2023editing] to post-train the byteified model with zero extra training cost.

![Byteified models can be post-trained by leveraging an existing (subword-level) post-trained Olmo 3 checkpoint; shown is the performance on IFEval of the base Olmo 3 model ($\theta_{\text{PT}}$), the base ($\theta_{\text{Bolmo}}$), a post-trained Olmo 3 checkpoint ($\theta_{\text{IT}}$), and the result of merging the post-trained checkpoint into .](figures/zero_cost_it.png){#fig:zero_cost_it width="0.8\\linewidth"}

Byteification adds a new component (a byteified model) to the ecosystem around the source LM. A natural question is: How does this new component interact with the other components of the source LM ecosystem? To answer this question, we investigate whether we can merge existing post-trained versions of Olmo 3 to post-train without any extra training cost. We use the Olmo 3 checkpoint directly post-trained on instruction following via RL [RL-Zero; @olmo3] in Deepseek-R1 style [@deepseekai2025deepseekr1incentivizingreasoningcapability] as a case study. We find that we can infuse the instruction following capabilities from this checkpoint into via Task Arithmetic [@ilharco2023editing] by adding the weight difference between the Transformer layers of the post-trained checkpoint and the base Olmo 3 to the corresponding layers (see Figure [4](#fig:zero_cost_it){reference-type="ref" reference="fig:zero_cost_it"}).

While the base model originally performs worse than Olmo 3 on IFEval (31.1% vs. 35.4%), merging via Task Arithmetic lifts performance to on par with the original post-trained checkpoint (67.4% vs. 66.9%). We conclude that it is possible to utilize components of the subword-level LM ecosystem to improve the corresponding byteified model. This removes the prerequisite for byte-level LM support in the infrastructure that subword-level LM post-training has benefitted immensely from [e.g., @lambert2024tulu3; @piche2025pipelinerlfasteronpolicyreinforcement] and substantially speeds up iteration times.

A subtle requirement to post-training byteified models via Task Arithmetic is *embedding resettability*: since we only have a one-to-one correspondence between the parameters of the source subword-level LM and the parameters of the global model $\mathcal{M}$, we can only easily adapt $\mathcal{M}$ via Task Arithmetic. The local encoder $\mathcal{E}$ and decoder $\mathcal{D}$ remain in the base model space. Whether the post-training transfer is successful thus depends on whether the base input embedding space (occupied by the local encoder $\mathcal{E}$ and the input embedding matrix of the base model) and the base output embedding space (occupied by the local decoder $\mathcal{D}$ and the output embedding matrix of the base model) remains compatible with post-trained inner Transformer layers; in other words, whether *resetting the embeddings of the post-trained model to the base model embeddings preserves performance*. We find this to be generally the case --- and more so for larger models --- although not always (see Appendix [12](#appendix:resettability){reference-type="ref" reference="appendix:resettability"}). Designing post-training methods to preserve compatibility among components of the LM ecosystem is a promising area of research, with some encouraging early findings [e.g., @shenfeld2025rlsrazoronlinereinforcement].

Ablations {#sec:ablations}
=========

Impact of Non-Causal Patch Boundaries {#sec:boundary_ablation}
-------------------------------------

**Takeaway.** Causal boundary predictors have to choose between either matching the subword tokenizer boundaries or matching the subword patch content; non-causal boundary predictors can do both, substantially improving downstream performance.

Our largest deviation from prior byte-level architectures is non-causal boundary prediction. As per Section [3.1.1](#sec:non_causal_boundaries){reference-type="ref" reference="sec:non_causal_boundaries"}, causal boundary prediction suffers from a conundrum: we either predict the start of every subword patch, which is easy but creates an offset of one byte w.r.t. the patches passed to the original subword model while also making the patches less semantically coherent, or we predict the end of every subword patch, which is hard, especially since this task has to be performed by the shallow local encoder. In contrast, non-causal boundary prediction allows vastly simplifying the task by using future context (in our case, one future byte). This way, the shallow local encoder has enough capacity to perform well. Figure [5](#fig:boundary_ablation){reference-type="ref" reference="fig:boundary_ablation"} quantifies this phenomenon: by predicting the patch end with future context, the patch end prediction task becomes easy, while retaining patches which are coherent and compatible with the global model. The remaining gap to the source subword-level LM is primarily caused by the non-causal boundary predictor still attaining less than 100% accuracy (see Appendix [9](#appendix:ablations){reference-type="ref" reference="appendix:ablations"} for details); future work on designing the boundary predictor, potentially using more future context than a single byte, could close this gap. It would even be possible to retain the subword tokenizer for boundary prediction of the prefill. However, this would re-introduce reliance on an external tokenizer, add tokenization bias (see Section [2](#sec:rw){reference-type="ref" reference="sec:rw"}), and make training at higher compression factors (see Section [5.1](#sec:increased-compression){reference-type="ref" reference="sec:increased-compression"}) harder.

![Boundary supervision by predicting the subword *patch start* or *patch end* using a *causal* or *non-causal* boundary predictor. Shown are the avg. task performance *(left)*, cos. dist. of the local encoder representations to the target subword representations *(middle)*, and the percentage of bytes where the predicted boundary differs from the true subword boundary *(right)* after Stage 1 training. Causal boundary predictors can achieve either accurate boundaries and accurate representations; non-causal boundaries enable both.](figures/boundary_ablation.png){#fig:boundary_ablation width="0.8\\linewidth"}

Is Stage 1 Training Necessary? {#sec:stage1_ablation}
------------------------------

**Takeaway.** Stage 1 training improves performance, but is not strictly necessary to obtain a good final run. The key benefit of Stage 1 training is speeding up iteration times.

Training in two stages adds implementation complexity. Can we not just train everything end-to-end in a single stage instead and let the optimization process do the work? To address this question, we run experiments where we immediately train all parameters, initializing the local encoder, local decoder, boundary predictor and LM head randomly, and the parameters of the global model from the subword-level LM, i.e., starting directly from Stage 2. A fair comparison of Stage 2 only training with Stage 1 + Stage 2 training is difficult: Stage 1 training requires fewer FLOPs since we only backpropagate through a fraction of the global model (c.f. Section [3.2.1](#sec:stage1){reference-type="ref" reference="sec:stage1"}), and is more memory efficient since we only need to store a small fraction of the optimizer states by omitting training of the global model. We account for this difference by approximately FLOP-matching and disregarding the memory mismatch: Stage 1 needs approximately $2 \times \text{FLOPs}_\mathcal{M}$, whereas Stage 2 needs approximately $3 \times \text{FLOPs}_\mathcal{M}$ (1x for the forward and 2x for the backward pass through the global model). We thus add $9.8B \times 2/3 = 6.5B$ tokens to Stage 2 training when omitting Stage 1 (increasing the length of Stage 2 by $17\%$). In practice, we believe the factor of $2/3$ may slightly favor the Stage 2-only run since the memory requirements for Stage 1 are lower (permitting a larger batch size) and inference-specific optimizations could be used to speed up the forward pass of the subword-level LM used in Stage 1.

Figure [6](#fig:stage1_vs_no_stage1){reference-type="ref" reference="fig:stage1_vs_no_stage1"} compares the training trajectory of runs with vs. without Stage 1 training. There are two main takeaways: (i) the 1B model benefits more from Stage 1 training than 7B, indicating that larger models may be more robust to catastrophic forgetting through large gradients at the start of training when starting directly with Stage 2, and (ii) the bits-per-byte gap narrows throughout the training trajectory but remains in favor of adding Stage 1; it is not clear how this behavior is influenced by the learning rate scheduling so we cannot easily extrapolate to higher token budgets. Since the absence of Stage 1 does not cause catastrophic degradation, we believe it is a reasonable hypothesis that Stage 1 training becomes less important with larger token budgets; however, this might be influenced in nontrivial ways by factors such as the choice of data mix.

Summarily, Stage 1 is beneficial in terms of improving performance compared to matched Stage 2-only training, but not strictly necessary. A key benefit of Stage 1 is streamlining experimentation: quickly obtaining a checkpoint which should come close to the subword-level LMs performance creates a substantially shorter feedback loop than repeatedly running full training experiments.

![Ratio of bits-per-byte throughout training of runs without Stage 1 to runs with Stage 1. For runs with Stage 1, we exclude the Stage 1 loss trajectory. For runs without Stage 1, we exclude the first $9.8B \times 2/3 = 6.5B$ tokens, resulting in a comparable trajectory over the remaining $39.3B$ tokens; a ratio $>\!1$ implies that Stage 1 is beneficial.](figures/stage1_vs_no_stage1.png){#fig:stage1_vs_no_stage1 width="0.8\\linewidth"}

Selecting the Right Local Model Architecture for Fast Inference {#sec:optimizing_inference}
---------------------------------------------------------------

**Takeaway.** FLOP-derivative measurements (total training/inference FLOPs or FLOPs/byte) are a suboptimal proxy for model efficiency. We recommend primarily using wallclock inference time measurements to guide byte-level LM architecture choices.

Previous work on byte-level LMs largely compares against subword-level LMs by matching the total amount of training or inference (i.e., prefill) FLOPs [e.g. @pagnoni-etal-2025-byte] or FLOPs/byte [@hwang2025dynamicchunkingendtoendhierarchical]. This provides an incomplete picture. As observed by prior work [e.g. @ma2018shufflenetv2practicalguidelines], FLOPs do not necessarily correlate with inference speed; some sources of FLOPs are inherently more amenable to being computed efficiently on today's hardware than others, and decoding in Transformers is typically memory bound. We thus largely used inference speed measurements to guide our choice of local model architecture. Figure [7](#fig:wallclock){reference-type="ref" reference="fig:wallclock"} shows prefilling latency (time to first byte) and decoding throughput (bytes/s) measurements of our chosen architecture, as well as various candidate local model architectures we explored.

The chosen architecture using mLSTM [@beck2025xlstm] achieves competitive speeds at decoding 125 bytes/s vs. 150 bytes/s for the subword model at the same compression, and 1s to prefill 72K bytes vs. 0.8s to prefill the tokens corresponding to the same number of bytes for the subword model. In addition, can be made faster by training at arbitrarily higher compression factors (in contrast to subword-level LMs, see Section [5.1](#sec:increased-compression){reference-type="ref" reference="sec:increased-compression"}), and starts surpassing the subword model in inference efficiency at 6.6 bytes per patch. As shown in Figure [7](#fig:wallclock){reference-type="ref" reference="fig:wallclock"} *(right)*, we find mLSTM as implemented in Tiled Flash Linear Attention [TFLA; @beck:25tfla] to achieve substantially higher wallclock decoding throughput than Mamba2 and Gated DeltaNet at the same amount of FLOPs/byte. Relying purely on FLOPs to guide architecture choices would have thus potentially resulted in suboptimal inference speed due to the inconsistent correlation between the two ($R^2\approx 0.63$ to $0.66$ in our experiments).

FLOP-matching is further complicated by having to make decisions as to how to count FLOPs, which is not trivial in practice. For example, the popular FLOP formulas from @hoffmann2022training assume a matrix multiplication of the input embeddings with the one-hot encoded input tokens. This is arguably not in line with hardware realities since the input embeddings can be computed via an extremely fast lookup operation, so counting the associated FLOPs can cause systematic biases.[^20] Additionally, the chunk size used to partially parallelize linear RNN training inherently provides a way to use more FLOPs to achieve faster training [via higher parallelization; as in @dao2024transformers; @yang2025gated], which further muddies the relationship between FLOPs and wallclock times.

![*(left):* decoding throughput (bytes/s) and prefilling latency (time to first byte) for and the source subword model across compression factors and prefill lengths; overtakes Olmo 3 7B at a compression of 6.6 bytes per patch. *(right)*: decoding throughput and prefilling latency for 18.0K prefill bytes across candidate local model architectures and of the final chosen architecture. Recorded with $\text{batchsize}\!=\!1$ on H100 GPUs.](figures/wallclock.png){#fig:wallclock width="\\linewidth"}

Conclusion {#sec:conclusion}
==========

We have introduced byteification as a missing additional direction to training from scratch for developing byte-level LMs. Byteification let us create , the first fully open family of byte-level LMs on par with or surpassing state-of-the-art subword-level LMs at the 1B and 7B parameter scales. benefits from architectural and training decisions specifically designed for byteifying, and comes close to matching subword-level LMs in inference speed. We have further explored byte-level models' increased flexibility, such as arbitrarily decreasing token granularity for faster inference. Byteifying also lets us leverage other components of the ecosystem around the source subword model by byteifying post-trained models in zero-shot once the corresponding base model is byteified. Overall, byteifying finally makes byte-level LMs a practical choice competitive with subword-level LMs, and enables future research directions on byte-level LMs for both the byteification setting and training from scratch.

Future Directions {#sec:future_directions}
=================

We believe enables a number of future research directions, bits of which are sketched below.

#### Bit 0. Investigating how architectures optimized for byteification perform when training from scratch.

We have restricted ourselves purely to the byteification setting. For example, we have not assessed how non-causal patch boundaries perform when training from scratch. We expect that the increased expressivity of the boundary predictor might be generally useful, but we do not yet know.

#### Bit 1. Learning non-causal boundaries end-to-end.

We have purely trained our boundary predictor through direct external supervision --- either to match subword tokens, or to match merges over subword tokens. We believe a highly promising area is learning non-causal boundary predictors end-to-end. For example, boundaries could be learnt end-to-end during post-training of a byteified model via RL, or by adapting methods like @hwang2025dynamicchunkingendtoendhierarchical's method of enabling gradient flow through the boundary predictor to the non-causal setting.

#### Bit 2. Scaling patch size and local model capacity.

We have designed the local models of to minimize inference speed degradation when keeping the same patch size as the original subword model, since we have focused mainly on byteifying while keeping the patching constant. However, jointly using larger local models and a larger patch size might yield a better performance vs. efficiency tradeoff, as suggested by @pagnoni-etal-2025-byte and @pmlr-v267-huang25bb.

#### Bit 3. Multi-byte prediction.

While multi-token/byte prediction has been used to great effect to speed up language models [@gloeckle2024better; @cai2024medusa; @grivas2025fastexpressivemultitokenprediction among others], only predicts the direct next byte. It is not clear how many sequential invocations of the global model multi-byte prediction could save; however, even saving sequential local model computations could lead to substantial speedups and permit larger local models, synergizing with Bit 2.

#### Bit 4. Non-destructive byteification.

As per Appendix [9](#appendix:ablations){reference-type="ref" reference="appendix:ablations"}, the remaining gap between the performance of and the original model can to a large extent be attributed to the continued training setup generally hurting performance. Investigating ways to make continued training less destructive, such as PEFT methods [e.g. @hu2022lora; @pfeiffer2023modular], could be promising.

#### Bit 5. Specialized sampling methods.

Subword-level language models have benefitted from a range of sampling methods which have been to various extents designed for, or at the least empirically validated on, predominantly subword-level LMs [e.g. @Holtzman2020The; @meister-etal-2023-locally; @minh2025turning]. We have not investigated how these methods transfer to s. Developing specialized sampling methods for s, for instance by adjusting the sampling strategy based on the position of the current byte within the patch, is also an intriguing topic.

#### Bit 6. More equitable input units.

operates over UTF-8 bytes, which is a highly Latin-centric atomic unit [@limisiewicz-etal-2024-myte]. We believe that the dynamic latent tokenization can to some extent \`amortize' over the choice of the atomic unit, but it is not clear to what extent this is possible, and in how far s inherit the biases from their underlying encoding. Future work could investigate this, alongside alternative choices for the atomic unit such as MYTE [@limisiewicz-etal-2024-myte] or SCRIPT [@land2025bpestaysscriptstructured].

#### Bit 7. Batched inference optimizations.

We have shown that can achieve throughputs competitive with subword-level LMs in the $\text{batchsize}\!=\!1$ setting, which is sufficient for edge applications. However, achieving fast batched inference of s by applying e.g. PagedAttention [@kwon2023efficientmemorymanagementlarge] and continuous batching [@orca] will be necessary to unlock a wider range of applications. Here, there are some additional challenges for s caused by their dynamicity (a fixed amount of tokens across examples causes a variable number of bytes and vice versa) which require additional work.[^21]

Acknowledgments {#acknowledgments .unnumbered}
===============

We thank the Beaker team at Ai2 for providing and maintaining the training infrastructure. We thank Tyler Romero for helpful discussions on inference efficiency, David Heineman for help with the evaluation infrastructure, Will Merill for useful discussions on linear RNNs, and Alisa Liu for useful discussions on tokenization. This work has been supported by the UK EPSRC grant `EP/T02450X/1`, and resources of the Oak Ridge Leadership Computing Facility, which is a DOE Office of Science User Facility supported under Contract DE-AC05-00OR22725. Edoardo M. Ponti is supported by the ERC Starting Grant AToM-FM (101222956). We acknowledge the National Artificial Intelligence Research Resource (NAIRR) Pilot and Microsoft Azure for contributing to the results in this work.

Additional Ablations {#appendix:ablations}
====================

#### Analyzing the choice of boundary predictor.

Table [1](#tab:boundary_ablation){reference-type="ref" reference="tab:boundary_ablation"} compares various choices for the boundary predictor, confirming that fused non-causal boundary prediction of the patch end is best for bytefying. Additionally, analyzing the performance under the original subword (\`oracle') boundaries shows that the remainder of the gap to the source model after Stage 1 can be mostly explained by the remaining small percentage of errors of the boundary predictor.

#### Comparing byteification to standard continued training.

Table [\[tab:byteify\_vs\_ct\]](#tab:byteify_vs_ct){reference-type="ref" reference="tab:byteify_vs_ct"} compares Bolmo to an Olmo 3 model with continued training on the same data under the same training settings (same batch size, optimizer, etc., see Table [\[tab:training\_hyperparameters\]](#tab:training_hyperparameters){reference-type="ref" reference="tab:training_hyperparameters"}). Continued training without byteification generally degrades performance, potentially forgetting due to a narrower data mix and suboptimal training procedure. A notable exception is character understanding, where the model improves due to the training data targeting this skill (Appendix [11](#appendix:cute){reference-type="ref" reference="appendix:cute"}), but remains worse than Bolmo. While some gap between the byteified model and the model with continued training persists, we believe a promising direction to improve bytefying is thus to apply techniques which generally make training less prone to forgetting, such as applying PEFT methods [e.g. @hu2022lora; @pfeiffer2023modular].

::: {#tab:boundary_ablation}
              $\mathcal{E}$ Sim.   $\mathcal{B}$ Acc.    $L/G$      ARC        MMLU       CSQA       HS         WinoG      SocialIQA   PIQA       B.Skills   CUTE       Avg.
  ---------- -------------------- -------------------- --------- -- ---------- ---------- ---------- ---------- ---------- ----------- ---------- ---------- ---------- ----------
  OLMo2 1B            \-                   \-             \-        61.4       40.4       66.0       68.9       65.2       55.1        76.4       72.9       27.5       59.3
                                                                                                                                                                        
  ,                  97.5                 100             8.8       57.4       35.4       64.4       65.0       65.1       54.0        74.5       66.2       19.6       55.7
  ,                  99.7                 100             8.8       59.0       36.7       64.1       67.4       65.1       53.2        74.6       71.3       30.8       58.0
  ,                  99.8                 100             9.8       58.5       36.8       63.4       68.0       66.1       53.6        74.6       71.6       31.2       58.2
  ,                  99.8                 100             8.8       58.2       36.8       62.6       67.7       66.9       52.4        75.0       70.7       30.7       57.9
                                                                                                                                                                        
  ,                  97.5               **99.2**        **8.8**     54.1       33.4       **63.6**   57.9       64.1       **52.9**    70.4       64.7       19.6       53.4
  ,                  99.7                 96.0          **8.8**     45.9       29.3       43.4       41.8       57.5       44.0        62.4       50.8       7.8        42.5
  ,                **99.8**             **99.2**          9.8       **56.2**   **34.7**   61.3       **61.4**   64.2       52.4        71.6       **69.3**   **29.8**   **55.7**
  ,                **99.8**             **99.2**        **8.8**     **56.2**   34.6       61.9       **61.4**   **65.0**   51.7        **71.9**   69.0       28.8       55.6

  : Comparison of various boundary prediction settings after Stage 1 training across oracle (subword) boundaries and the boundaries as predicted, predicting the patch start causally (, ), patch end causally (, ), patch end non-causally using a separate boundary symbol (, ) and patch end non-causally with fused boundaries (, ), our chosen setting. $\mathcal{E}$ Sim. = the cosine similarity of the pooled local encoder representations to the corresponding subword embeddings, $\mathcal{B}$ Acc.= accuracy of the boundary predictor. $L/G$ = average number of local model invocations per global model invocation. **Boldface** indicates the best result per column.
:::

```{=latex}
\renewcommand{\arraystretch}{1}
```
Benchmark Details {#appendix:benchmarks}
=================

We utilize OLMES [@olmo20242olmo2furious] for all evaluations. See Table [\[tab:task-details\]](#tab:task-details){reference-type="ref" reference="tab:task-details"} for details on our 7B evaluation suite, and Table [\[tab:easy-task-details\]](#tab:easy-task-details){reference-type="ref" reference="tab:easy-task-details"} for details on our 1B evaluation suite.

```{=latex}
\renewcommand{\arraystretch}{1}
```
lllHlHlHHHHHHl & **task & **capability & **\# inst & **ICL & **format & **metric & **temp & **top-p & **extract & **max toks & **p\@k (n) & **n & **\# sub\
**************************

\
\
\
& ARC & Science QA & - & 5 & RC (pmi) & Acc & - & - & - & - & - & - & 2\
& MMLU & General QA & - & 5 & RC (per-char) & Acc & - & - & - & - & - & - & 57\
& CSQA & Commonsense QA & - & 5 & RC (pmi) & Acc & - & - & - & - & - & - & -\
& HellaSwag & Language Modeling & - & 5 & RC (per-char) & Acc & - & - & - & - & - & - & -\
& WinoGrande & Language Modeling & - & 5 & RC (none) & Acc & - & - & - & - & - & - & -\
& SocialIQA & Social QA & - & 5 & RC (per-char) & Acc & - & - & - & - & - & - & -\
& PiQA & Physical QA & - & 5 & RC (per-char) & Acc & - & - & - & - & - & - & -\
& CoQA & Conversation QA & - & 0$^\dagger$ & RC (pmi) & Acc & - & - & - & - & - & - & -\
& DROP & Passage QA & - & 5 & RC (per-char) & Acc & - & - & - & - & - & - & -\
& Jeopardy & Trivia QA & - & 5 & RC (per-char) & Acc & - & - & - & - & - & - & -\
& NaturalQs & General QA & - & 5 & RC (per-char) & Acc & - & - & - & - & - & - & -\
& SQuAD & General QA & - & 5 & RC (per-char) & Acc & - & - & - & - & - & - & -\
& SciQ & Science QA & - & 5 & RC (per-char) & Acc & - & - & - & - & - & - & -\
& QASPER & Science QA & - & 5 & RC (none) & Acc & - & - & - & - & - & - & -\
& Basic Skills & Basic QA & - & 5 & RC (per-token) & Acc & - & - & - & - & - & - & 6\
& DBQA & Science QA & - & 5 & RC (per-char) & Acc & - & - & - & - & - & - & -\
& ProtocolQA & Science QA & - & 5 & RC (per-char) & Acc & - & - & - & - & - & - & -\
& Lambada & Language Modeling & - & 0 & RC (none) & Acc & - & - & - & - & - & - & -\
& MedMCQA & Medical QA & - & 5 & RC (per-char) & Acc & - & - & - & - & - & - & -\
& MedQA & Medical QA & - & 5 & RC (per-char) & Acc & - & - & - & - & - & - & -\
& SciRIFF & Science QA & - & 5 & RC (none) & Acc & - & - & - & - & - & - & -\
& CUTE & Character Understanding & - & 4 & Greedy Cont. & Acc & - & - & - & - & - & - & -\

CUTE-Style Training Data {#appendix:cute}
========================

To encourage models trained on our data mix to learn information about the characters within a word, we generate 75M tokens (0.04% of the training data) of tasks requiring character-level understanding using the [CUTE repository](https://github.com/Leukas/CUTE). Tasks include spelling out words, reversing words as well as swapping, deleting, and substituting characters within a word given words in a source wordlist. We use a list of $n = 150000$ words, ensuring zero overlap with the CUTE test words to avoid contamination. This data is purely in English. We do not use any multilingual character understanding data, but still observe large improvements on the multilingual EXECUTE benchmark (see Section [5](#sec:results){reference-type="ref" reference="sec:results"}), suggesting that some texts requiring character-level understanding can help acquire generalizable knowledge about the characters within words. We observed that byte-level models otherwise do not acquire this knowledge through our short training schedule. However, training for longer, on more diverse data, or with larger local models could act as alternative routes to acquire character-level knowledge.

Does Post-Training Byteified Models via Task Arithmetic Always Work? {#appendix:resettability}
====================================================================

As outlined in Section [5.2](#sec:zero-cost-post-train){reference-type="ref" reference="sec:zero-cost-post-train"}, embedding resettability is a crucial prerequisite of post-training byteified models via Task Arithmetic. This is the case since we can transfer the global model $\mathcal{M}$ to the post-trained space via Task Arithmetic, but we can not transfer the local models since they do not have corresponding parameters in the post-trained checkpoint.

In Figure [8](#fig:emb_reset){reference-type="ref" reference="fig:emb_reset"}, we analyze embedding resettability across a number of models. Resetting the embeddings is possible without substantial performance degradation for a substantial fraction of the analyzed models, with a weak trend toward larger models being more amenable to it. Additionally, in line with the findings of @shenfeld2025rlsrazoronlinereinforcement, we find models post-trained via RL [the Olmo 3 RL-Zero family; @olmo3] to be closer to the original model; here, embedding resetting almost perfectly preserves the original models' performance.

Future work could investigate in more detail when post-training via Task Arithmetic is possible, and whether it is possible to restore the ability to byteify without additional training for post-trained models where this is not the case as-is.

![*(left)*: Cross-Entropy loss for post-trained models, and the same post-trained models with the embeddings reset to the corresponding base model embeddings; loss is computed on examples from the Tulu 3 dataset [@lambert2024tulu3]. *(right)*: Number of model parameters vs. the loss ratio of the model with reset embeddings to the original post-trained model. The number of parameters explains some variance (with larger models being more amenable to reset embeddings), with the remaining variance presumably being due to different post-training choices.](figures/emb_reset.png){#fig:emb_reset width="\\linewidth"}

Embedding Rank Analysis {#appendix:embedding_ranks}
=======================

Figure [9](#fig:emb_rank){reference-type="ref" reference="fig:emb_rank"} shows the explained variance ratio of the singular values of the input and output embedding matrices across a number of models. Besides one exception (Qwen3-4B-Base input embeddings), there is a substantial amount of high-rank structure. This makes the embeddings difficult to approximate using lower-dimensional local models. In particular, in the case of the local encoder, the embedding rank has a hard limit given by the dimensionality of the local model if a linear upprojection or padding is used to upproject [as done, e.g., by @hwang2025dynamicchunkingendtoendhierarchical]. Concatenation of the local representations, as done by @pagnoni-etal-2025-byte, does not lead to a hard limit on the rank but may still limit expressivity.

![The explained variance ratio of the singular values of the input and output embedding matrices (normalized by the number of dimensions). The explained variance ratio smoothly decays along the number of components, up until a steep dropoff toward the highest-rank components. This indicates that it is difficult to approximate the embedding matrices using lower-rank structure. A notable exception is Qwen3-4B-Base, which may be more amenable to a lower-dimensional local encoder; we are not so bold as to dare a guess why.](figures/embedding_rank.png){#fig:emb_rank width="\\linewidth"}

Full Hyperparameters {#appendix:hyperparameters}
====================

Full architecture hyperparameters are shown in Table [\[tab:architecture\_hyperparameters\]](#tab:architecture_hyperparameters){reference-type="ref" reference="tab:architecture_hyperparameters"}, and full training in Table [\[tab:training\_hyperparameters\]](#tab:training_hyperparameters){reference-type="ref" reference="tab:training_hyperparameters"}.

[^1]: Tokenization bias is the phenomenon that sequences of tokens can implicitly leak information about the future content of the text they tokenize. For example, assuming a vocabulary of English words, the token sequence $\{\texttt{\_Hello},\texttt{\_Wor}\}$ leaks that there is no $\texttt{ld}$ after $\texttt{\_Wor}$, otherwise the text would have been tokenized as $\{\texttt{\_Hello},\texttt{\_World}\}$ instead. This can lead to unintuitive behavior in practice [see @minixhofer2025universal; @vieira2025from].

[^2]: @graves2013generating may have been the first to model language over UTF-8 bytes; see @mielke2021between for an overview.

[^3]: Although byte-level LMs are sometimes called \`tokenizer-free', it is more correct to say that UTF-8 is the tokenizer, and the vocabulary is the set of 256 distinct bytes.

[^4]: Since UTF-8 is designed primarily for the Latin script, problem (ii) of inefficiency in languages besides English might persist. However, alternative fine-grained units provide a promising alternative [@limisiewicz-etal-2024-myte; @land2025bpestaysscriptstructured].

[^5]: We treat $x$ as a sequence over bytes, i.e. $x \in \{0,..,255\}^n$.

[^6]: An alternative to increasing the size and sparsity of the local encoder is using a mixture of experts in the feed-forward layer, although we do not investigate this here.

[^7]: For consistency with prior work, we use the term \`non-causal' to contrast with \`causal' as in causal language models, i.e., causal in the sense of using only unidirectional context, although this is arguably a misnomer.

[^8]: @hwang2025dynamicchunkingendtoendhierarchical refer to the process of creating a single representation for every patch as *routing*, whereas we refer to this more generally as *pooling*, which also encompasses the cross-attention pooling done by @pagnoni-etal-2025-byte.

[^9]: We originally experimented with smaller local dimensions but found the upprojection mechanism to bottleneck performance by restricting the rank of the representations (see Appendix [13](#appendix:embedding_ranks){reference-type="ref" reference="appendix:embedding_ranks"}).

[^10]: The causality constraint on boundaries is referred to as *incrementality* by @pagnoni-etal-2025-byte.

[^11]: Subword tokenizers in principle have unrestricted access to the future, while we use a single byte. In practice, we find one byte of lookahead largely sufficient to match the behavior of subword tokenization. However, we believe future work on larger (or unrestricted) lookaheads could be fruitful.

[^12]: It is not clear whether human notions such as semantic coherence or faithfulness to linguistics should play a role in designing language models, see e.g. @beinborn-pinter-2023-analyzing [@minixhofer-etal-2023-compoundpiece].

[^13]: It is worth noting that predicting the boundary symbol `<b>` is analogous to the output boundary prediction in @fleshman2023toucantokenawarecharacterlevel, although the motivation differs.

[^14]: Using the true subword boundaries instead of the boundaries predicted by $\mathcal{B}_{\text{Bolmo}}$ is necessary to preserve the alignment of the pooled representations to the representations in $\mathcal{T}_\text{subword}(x)$ along the sequence dimension.

[^15]: For example, $-\!\log \text{LMHead}_\text{subword}(\text{..}(x))[i, \text{next\_tok(x, i)}]$ is the cross-entropy of the subword model.

[^16]: We count tokens as tokenized by the [Dolma2 Tokenizer](https://huggingface.co/allenai/dolma2-tokenizer).

[^17]: Not all training token/byte counts of prior byte-level models are public.

[^18]: Although applying BPE per minibatch would also be possible we choose to apply it per-example to avoid nontrivial dependencies on the batch size.

[^19]: The auxiliary 370M parameter subword-level LM was trained on 74.3B tokens following a downscaled version of the OLMo 2 training and architecture [@olmo20242olmo2furious].

[^20]: Counting the input embedding FLOPs has limited effect if the models being compared have similar vocabulary sizes. However, for example in the case of @hwang2025dynamicchunkingendtoendhierarchical, it overestimates the FLOPs required by the subword-level baseline LM by up to 25%: The GPT3-Large matched Transformer baseline with $d=1536, |V|=128256$ and an average number of $4.6$ bytes per patch is considered to require 0.42 GFLOPs/byte, of which $2 \times 1536 \times 128256 / 4.6 \approx 0.085$ GFLOPs/byte are due to the input embeddings, while a negligible amount of the GFLOPs/byte of the byte-level models are due to the input embeddings.

[^21]: To our knowledge, the only investigation into efficient batched inference so far is through [Aleph Alpha's vllm fork](https://github.com/Aleph-Alpha/vllm).