---
title: "Toto 2.0 TSALM Workshop Presentation Transcript"
source_video: "/home/ipse/work/iclr2026/downloads/tsalm_workshop_39063681/tsalm_datadog_073212-080537_precise.mp4"
source_slides: "paper_toto-2-tsalm-2026.pdf"
audio: "audio_toto-2-tsalm-2026.mp3"
transcription_model: "gpt-4o-mini-transcribe"
transcribed: "2026-05-16"
---

# Toto 2.0 TSALM Workshop Presentation Transcript

Machine transcript generated with the OpenAI Transcription API from extracted low-bitrate MP3 audio and lightly post-edited for obvious name, benchmark, and model-name errors. Segment headings are approximate offsets from the source recording.

## 00:00:00-00:05:00

Next up, we have the last keynote talk for the workshop, and this was supposed to be in person, but the speaker couldn't make it at the last minute. We have Othmane Abou-Amal, who is an applied AI and AI research director at Datadog, where he drives the company's AI direction across large language models, generative AI, and intelligent systems for observability. He leads the development of AI-powered capabilities that enable engineers to automatically detect, investigate, and resolve complex production issues at scale. His work sits at the intersection of research and production, spanning LLM-based applications, autonomous agents, and time-series modeling. Othmane focuses on building reliable, scalable AI systems that are deeply integrated into real-world engineering workflows, bringing cutting-edge advances in AI to practical use in modern infrastructure and observability platforms. So let's welcome Othmane, and the stage is yours. You can start. Cool. Thanks, Arjun. Thanks, everyone. Let me share my screen. Can you all see my screen and hear me? Yes. Cool. Okay. Well, thanks, everyone, for being here, and thanks for having me. So I'm Othmane. I lead AI research here at Datadog. And yeah, I was really hoping to be in Rio with you all for this, but I had to stay back in New York for some work. So I hope you're all having an amazing time. And for today, I wanted to tell you about the work that we've been doing on our foundation models for observability, how we scale them, and where we're headed next. So this is the plan for today. I'll start by explaining what observability is and why it matters. Then I'll tell you why we think foundation models are the right path for this domain. Then I have an announcement. Toto 2.0 is coming. We've been working on this for a while. And I'm excited to share it with you today for the first time. And then I'll close with where we're going next. This is multimodal world models for observability. So with that said, let's jump in. So first, some context. I know most of you in this room work on models and research. So I wanted to give you a feel for the domain that these models that we're working on need to operate in. So Datadog is an observability company. So the way you can look at this is that we pull data from across our customers' entire technology stack and let them see inside it. This is application logs, infrastructure metrics, traces of requests, security signals, absolutely everything. And think of it as if you run software in production, we collect all the telemetry that tells you whether it's healthy, whether it's performant, or how you can optimize it. Oh, okay. Is it better now? Okay, so this was the agenda, the one that I actually presented. Can you see the slides advancing now? Cool, okay. So there you are one. This is Datadog. So I was talking about how what Datadog does is pretty much that if you're running software in production, we collect the telemetry that tells you whether it's pretty much healthy. So here's why this matters. Like a very classic way by which our customers use Datadog is by defining alerts. So when something goes wrong, like an alert fires, and if it's bad enough, it will wake someone up. Think of this as you run an e-commerce website and your checkout endpoints start returning errors. So even if it's 2 a.m., you're bleeding money, so you're paging one of your engineers. Your phone rings, you're groggy, you're stressed, and now you need to figure out what broke, why, and how to fix it. So our goal with Datadog is to give you the data and the tools that allow you to do all of this. And with AI, our goal is to essentially help that engineer go back to bed faster. And we're working towards a world where we don't need to even wake them up at all. Datadog automates the full cycle of being able to detect that something is broken, being able to investigate and identify a root cause, and then figuring out how to remediate the problem and take an action in order to stabilize the system. And this is not an easy problem. Like the systems that we're talking about have gotten wildly complex. There is more technologies, more microservices, more complex deployment, more people that are involved. And if 20 years ago you had like a few servers and the release was a quarter, today customers have thousands of containers, like on-demand deployment, and a dozen teams touching these systems. And AI...


## 00:05:00-00:10:00

It's effectively making this worse, not better. AI has done an incredible job automating the production of software, but producing software was never the expensive part. Like running it, maintaining it, debugging it, optimizing it, that's where engineers spend 70 to 80% of their time. And more AI-generated code means more software in production, which also means that more things can break. And there is a simple problem in there, right? Like engineers are now running more and more code that they haven't written themselves, but that they might not fully understand. And as a result, they need more help understanding when things break and how to fix that. And our goal is essentially to automate these types of operations and firefighting for engineers and let them focus on actually building. And you might think, like, how hard can it be, right? Like if this is an e-commerce website, like in the example that I was taking, this is the front end and API and the database, and agents can handle that. Well, in practice, this is what a simple e-commerce website actually looks like in production. Hundreds of services, multiple data stores, event buses, ML models, CDNs, authentication, payment, and every single one of these can fail and failures cascade. And each one of these components is submitting a massive amount of time series, logs, traces, etc. So being able to diagnose what went wrong in this system or even decide whether the system is running in a healthy way is extremely hard, even for the best engineers. So coming back to this 3 a.m. page problem, our goal with AI at Datadog is to close this loop from being able to detect that something is wrong, deciding what's causing it and what to do, and taking an action to bring the system back to health. And this is essentially what motivates our work on foundation models. So why foundation models for observability data? So one key idea that I want to cover here is that, again, this is not a solved problem. So nobody has fully automated incident detection and response end-to-end out there. The signals are extremely noisy, the systems themselves are complex, the failure modes are extremely diverse, and every customer's environment is different in its own way. So when we started Datadog AI research, like the question, the core question for us was, what is our unfair advantage? And our answer to that was that we have something that nobody else has, at least not at our scale. And that's data. Datadog sits at the center of our customer's production systems, and we get to actually see everything in terms of data coming out of their systems. So think of this as trillions of machine-generated data points every single hour across tens of thousands of customers. Plus the human data that tells us what's important and what isn't, and how customers do their work, which alerts matters, which one don't. But here is the thing. This data is effectively sequential, but it is not all textual data. A lot of it is metrics, time series, or semi-structured events like logs, or even graph structures like a trace or a service dependency topologies. So this is really kind of like its own modality. So the question became, how can we do for observability what foundation models did for text, for images, for biology? Can we train on our massive corpus of telemetry and build models that understand how production systems behave? And from there, like unlock new capabilities, the way we were able to do that for text, images, biology, etc. So our first answer to that was Toto, the time series foundation model trained on over a trillion data points for observability metrics. So this is work that we announced last year, where we achieved state-of-the-art performance on BOOM, our observability benchmark, and on GIFT-Eval, the standard open benchmark. We open sourced the model last year, and we got like almost 10 million downloads on Hugging Face. And more importantly, like a lot of internal product applications like here at Datadog. And to paraphrase like Ameet Talwalkar, our chief scientist, this was a very exciting moment for us because this was, to an extent, the BERT moment for time series foundation model. We were seeing unsupervised foundation models clearly beating supervised baselines across tasks. So again, this by itself was exciting, but ultimately BERT was important, but it didn't give us as is ChatGPT or Claude Code or all these amazing capabilities that are reshaping how we work. And if you look at Toto, it's very much kind of BERT scale from a model size perspective. But even if you look at its training data, we're talking about roughly 37 billion patches. So we're potentially still extremely early on the scaling curve. So naturally, we wanted to scale up. And that's what we've been doing. Like we asked ourselves, can we make scaling work for time series? And this is not an obvious question, right? Like most time series foundation models out there ship at a single size. And those that will...


## 00:10:00-00:15:00

These multiple sizes usually struggle to show that bigger is actually better. So we set out to effectively change that. And today, we're releasing Toto 2.0, a family of open-weight time series models going from 4 million to 2.5 billion parameters. It's number one on BOOM, GIFT-Eval, and TIME. The weights are actually already on Hugging Face, Apache 2.0 licensed, and the full blog post with all of the technical details is like being finalized and will drop in a couple of weeks. But here is the headline, a blue chart basically shows MWQL rank against parameter count for every model with published sizes. This is on BOOM, our Datadog observability benchmark, but the same holds for other open benchmarks like GIFT-Eval. Now, there is a lot of great work on this chart, like from other models as well. So if you look at like Moirai, Chronos, TimesFM, these are all very strong models that pushed the field forward. But if you look at the scaling behavior, most model families release multiple sizes that perform roughly the same. And the Toto 2.0 line is different here. As you can see, rank improves at every step from 4 million to 2.5 billion. Every size is better than the last. And this is why we think that this is the GPT-2 moment for time series. We're able to observe that scaling works and that it works reliably. Before I go into details, I want to give credit to the folks who have been driving this work. So this is the work of Emaad, Gerald, Chris, David, with contributions from the broader Datadog AI research team. These folks that you see here on the screen have done all the hard work. I'm just the one who gets to talk about it today. So diving a bit deeper, what changed from 1.x? I'll be covering three categories, like one, architecture, two, data, and three, how do we scale these models efficiently. So starting with architecture, Toto 2.0 keeps the 1.x core design. It is still a decoder-only patch transformer that alternates time-axis causal attention and variate-axis full attention. But we replaced the output head, patch masking, patch projections, and normalization. So the output head is now a quantile head, so nine quantile levels trained with pinball loss. More, it's effectively more stable than mixture models that we used for Toto 1.x, especially as we scale. We also adapted contiguous patch masking from TiRex. So instead of generating forecasts like one patch at a time, autoregressively, we mask the future horizon and predict all of the patches in a single forward pass. And this effectively avoids compounding errors. And we also replaced linear patch projections with the residual MLPs at both ends and switched to an arcsinh normalization that handles the huge dynamic range that we typically have in observability metrics. Covering a bit on the optimizer, this one is pretty interesting as well. And it took us a while to get it right. So when you use pinball loss, which is what quantile regression trains on, the gradients are sign-valued. So what this means is that whether the model is off by 0.01 or off by 100, the gradient has the same magnitude. So the loss gives direction, but it doesn't give you scale. So you find yourself in a position where the optimizer has to do most of the interpretation. It needs to figure out how much to update each parameter from its own internal states. And that's why the choice of an optimizer matters so much here. So what we ended up using is NormMuon for matrix-shaped parameters and AdamW for everything else. So NormMuon combines Muon's matrix-level preconditioning, which captures how weights within the same neuron relate to each other, with per-neuron normalization that prevents any single neuron from dominating. And as you can see on these curves, NormMuon gives us a clear win over AdamW and vanilla Muon, both at 1B and the 2.5B scale. And by the way, this is the same split that Karpathy uses in NanoChat to train GPT-2 for under $100. Okay, so let's chat about data. So for Toto 2.0, we trained on 3.4 trillion data points, up from 2.4 trillion. But the bigger change for us was quality more than quantity itself. We ended up putting an order of magnitude more data than what we used. And while in 1.0, our data skewed heavily towards 10-second intervals, like for the time series, for 2.0, we rebalanced to overweight longer intervals, getting a more diverse view of our data. We also retired public datasets entirely. They didn't help in our ablations. I'll come back to that in a second. So the final...


## 00:15:00-00:20:00

The final mix is 42% observability metrics from Datadog's own internal telemetry. Like this is Datadog's data observing our own internal systems. And 58% of synthetic data generated by TempoPFN. So no customer data was used for Toto 2.0. The observability data is exclusively from Datadog observing Datadog. So let's talk about scaling. Like how do we scale these models efficiently? So the key question here is, how do you train four model sizes without tuning hyperparameters separately for each one? And as you all know, hyperparameters like learning rate almost never transfer across model sizes. And that's why most teams ship one size and move on. In our case, we used UMuP, a parameterization that makes optimal hyperparameters independent of model width. So we sweep on a cheap 10 million parameter proxy model and then transfer the configuration directly to larger model sizes with zero per size tuning. UMuP was used by Llama 4, reportedly by GPT-4 as well, and by Stable Diffusion 3. To our knowledge, Toto 2.0 is the first application to time series foundation models. So even at proxy scale, the joint space search is just too big, right? Like the space of possibilities is too large to just like sweep across all of it. So what we end up doing is breaking it into four rounds. One, we start by architecture choices. Two, we look into data frequency weighting, and this is where we discovered that public data within the data mix doesn't help. And round three, where we pick the optimizer. This is where we landed on NormMuon that I was talking about earlier, and we finish by picking a decay schedule. And each round here builds on the previous best. And we ended up building an optional wrapper that we call REX to manage this. And this is actually fully training code agnostic, so it's something that we should be able to reuse for like other models or even other modalities. And it just works. Like I mean in particular from the team that was making this happen. So let's take a moment to appreciate like this beautiful chart. So here, as you can see, you have all the full training curves. And every size trains with the exact same configuration. Validation loss goes down consistently as you go from 4 million to 2.5 billion. And again, there is no per-size tuning. This is all UMuP doing exactly what it promises. So let's look at some results. These are our results on BOOM, which is our internal time series observability benchmark. Not internal, it's internal data, but it's a benchmark that we opened as well. So as you can see, all the four Toto 2.0 sizes rank ahead of every other model. So even the 22 million model outperforms the 1.0 model with seven times fewer parameters. And the 4M model is also a higher caliber model for edge devices. It's something that we're very excited about from the perspective of being able to run it like embedded within customer hosts or systems. And as you can see here, the separation from everything else within our domain is like pretty dramatic. So this is GIFT-Eval. I think the update there is still in progress. But as you probably all know, GIFT-Eval is the standard public benchmark. And here, same story. Toto 2.0, even at 1B, but also 2.5B, achieves the best overall rank. And remember, like, this is a model that has never seen any of these evaluation domains during training. All of the training data that we ended up using is either observability data or synthetic data. Yet it generalizes and we get state-of-the-art performance on these open benchmarks. So one more finding that surprised us and I think is pretty cool and that I wanted to share here, is that long horizon stability essentially scales with model size. So I hope you can see the graphs in the screen, but the idea is that at 4M, forecasts start to break down at horizon 4096. At 1B, the model maintains coherent structure well beyond its training context. So what appears to be the case here is that larger models don't just get more accurate, but they are actually also more stable. And with block decoding, where the model forecasts a block, conditions in it, then forecasts the next, stability improves even further. And this is a property that you want for production use, where you need to reliably forecast over long horizons. These are a couple of more examples that essentially show pretty much...


## 00:20:00-00:25:00

the same behavior. And as you can see, these things are not perfect as you, like, scale up, but you still get, like, dramatically better results. Okay, so let me take a step back and maybe take stock. Last year, we showed and opened Toto 1.0 and called it the BERT moment for time series. The first time we were seeing time series foundation models clearly beat supervised baselines, zero shot. Today, we showed Toto 2.0, the GPT-2 moment. Scaling works, bigger models are better models, reliably from a single configuration, and we think that's a pretty big deal. But time series is only one modality. Like, if you go back to the beginning of my presentation and think about that problem that we're trying to solve, detecting incidents before they happen, understanding root causes, or simulating behavioral changes, those require you to have the full picture of all of this data coming from these systems that customers are running at an insane scale. So where do we go from here? Well, here is the vision. Our data isn't just large scale, it's also comprehensive and diverse. Metrics, logs, traces, topology, events, alerts, source code, etc. Today, we process each signal in its own silo. Like, metrics go to anomaly detectors, logs go to parser, traces go to latency analyzers, etc. What we want to do is essentially to train a single model that learns the joint dynamics of all of these modalities. And when I use the word world model, I'm using it very loosely here. I'm not trying to start a debate about what qualifies exactly as a world model. But if you squint, like large autoregressive language models are already like a form of world models, like the GPTs, Gemini, Claude, etc. They've learned a representation of how the digital world works from training data, and they effectively use it to predict, reason, and plan. So whether you call them like world model or not, that's pretty much what they do. So the question for us is whether we can train a similar kind of model specialized on observability data that learns how distributed systems behave. A model that can predict what happens next and answer what if counterfactual questions. You can also think of this as maybe GAIA-1 or Cosmos are to self-driving cars, but for production software systems. Sort of a learned simulator that also is very Bitter-Lesson-pilled. So one model end-to-end across all signals that beats a pipeline of separately tuned model components. So that is a big vision, but where do we actually start? Well, first evaluation. You can't build what you can't measure. So Stephan Xie spoke about his work a little bit earlier in this workshop. He's an amazing student from CMU who has been working with our team, led the work on ARFBench. So this is a time series question answering benchmark built from real Datadog incidents. So this is 750 QA pairs on 142 time series that we pulled from 63 real incidents like within Datadog with up to 2,000 variables and 40,000 time steps per series. So the questions that you can see essentially test what matters in incident response. When did an anomaly in the time series start? What else is behaving anomalously if you're showing multiple time series at once? Is this metric correlated with the failure that we are investigating? To make this a bit more concrete, this is an example of a task that is included essentially on ARFBench. And let me maybe give you some numbers. The best frontier VLMs, GPT-5 in this case, get around 63% accuracy and 52% F1. That's against a random baseline of 22.5 F1. So these are fairly hard questions. So we also trained a hybrid model that we call Toto-1.0-QA-Experimental, which wires a Toto 1.0 time series representations directly into a visual language model. And that's our output training on top of it. And it scores 64% accuracy, the highest of any models that we tested. And on tasks like anomaly identification specifically, it beats every other model by at least 8.8 percentage points in F1. But here's what's actually interesting, like in the numbers that you see here. The best models still trail human domain experts by about 9 points in accuracy and 13 points in F1. And if you look at it more precisely, the models and humans tend to make different errors. If you combine both of them into an Oracle, you get somewhere around 87.2 accuracy.


## 00:25:00-00:30:00

And 82.8 F1, well beyond what either the models or humans achieve alone. And this gap between today's best models and the oracle is a space that is interesting and that we're trying to close. And by the way, the Toto-1.0-QA-Experimental model that I'm talking about here is open weight, Apache 2.0, already available on GitHub. And again, I believe that Stephan Xie presented some of this work in his talk earlier today. So this is the picture of what we're building towards. A foundational model that takes in all of Datadog's telemetry types, learns how distributed systems behave, and powers everything downstream. This is something that we're hoping to also use as a source of simulations for training our own SRE agents. So this is more kind of like RL plus training on top of language models. But this can also be used for proactive alerting that warns you before an incident happens, counterfactual analysis that can tell you what would happen if you were to scale down the service, if you were to have like 10x more traffic on a specific endpoint. And nobody has really built this, right? Like some of the architectural building blocks here exist in adjacent fields like autonomous driving, robotics, physics simulation, multimodal ML, etc. But they've never really been assembled for observability. Now, this is the large vision, and we're not trying to boil the ocean on day one. We're starting with time series plus logs as our first multimodal combination, evaluating that against Toto as a baseline, and expanding after that to other things like system topologies, traces, where we already have some early promising results. Cool, so let's wrap up. Like here is the arc. A year ago, we showed Toto 1.0. This was the BERT moment for time series, an unsupervised foundation model beating supervised baselines for the first time. Today, Toto 2.0, the GPT-2 moment, scaling works from 4 million to 2.5 billion. Every size better than the last from a single hyperparameter configuration. And ahead of us, multimodal world models for observability. A single model that learns how distributed systems behave across metrics, logs, traces, topology, and code. And the problem space here is real. Billions of dollars lost to system failures every year. And this data is effectively unique. And we believe that no one else out there has it with the scale and diversity of the data that we have here within Datadog. This is very much like wide open research and no one has built these kind of systems before. If that excites you, we are hiring both in New York and Paris, research scientists, research engineers, people who want to work on frontier problems with real data at large scale, and very importantly, with a very real product and business applications. I'm happy to take some questions now, but also come find our team after the talk. I was not able to make it to Rio, but like all of these folks on the slide are at ICLR and probably most of them in the room right now. Feel free to scan the QR code if you want to see our open positions or stop by our booth. Thanks everyone. Thank you very much for the talk, Othmane. Are there any questions from the room for the speaker? We have one question. Thank you very much for the nice talk. I'm a bit curious about Toto 2.0, specifically how it handles multivariate and covariates, which I believe you hinted to. First of all, does it handle those zero-shot in-context learning? And do you think in some sense, if it does, that is somehow, you know, it has this inductive bias towards observability data and maybe doesn't generalize to other domains? Thank you. Cool. Yeah, that's a good question. So this is both the case for Toto 1.0 and also Toto 2.0. Both of them handle like multivariate data. And this is something that we very much care about, like in the sense that for the type of internal products, applications that you have within Datadog, being able to handle multivariate is extremely important. So it does that and it handles it zero-shot. And as you said, like we see that like work particularly well for high-cardinality multivariate, both within observability data. We do see good performance of that like out of distribution as well, like including in some of the open benchmarks that we tested on. One more question. First of all, thank you.


## 00:30:00-00:33:25

Really good presentation and great work. A question. Have you done any comparison to understand how much of performance separately on BOOM and separately on GIFT-Eval could be attributed to your data versus the TempoPFN data? So, because your data is mostly observability, TempoPFN is, well, general data, right? So GIFT-Eval is a general eval and the BOOM is observability eval. So it's, I mean, a combination of different things. So have you done any comparisons to see which data contributes to which performance? Yeah, that's a good question. So if you think about how I was talking about REX, like the way that we did like hyperparameter sweeps, like what you're describing is something that we sweeped for, right? Like if you remember for like round two where we do data selection, we essentially try to optimize towards figuring out like the right data mix, not only between data like observability data and the TempoPFN synthetic data, but also some of the open data that we used like for Toto 1.0. And in practice, what that told us was that open data like hurts performance, so it doesn't help performance. So this was essentially like a step in which we could have learned that, you know, like, oh, maybe a TempoPFN is enough. In practice, it wasn't. So we have that as a proof that like the data like time series metrics data does help with like overall kind of performance. It is probably something that we could look into like more deeply and update a little bit more. But again, in practice, like the way that these models were built, we did have effectively like an opportunity to say that maybe TempoPFN was enough. In practice, it wasn't. And introducing the data like specialized data was still very helpful there. Do we have any more questions in the room? We have one question. Okay, that will be the last question. Thanks for the presentation. I saw OpenTSLM-SP 1B at the end. It's not a forecasting, not a generalizer. It's not a foundation model. Part of that group is just trained on the one you used on four medical data sets. So it's probably not applicable to the evaluation you put out. Just mentioning. Oh, you're talking about the ARFBench dashboard? Is that what you're talking about? It was in the overall accuracy. So it's not comparable. Yeah, that's a good point. All right. Okay, that's fair. I'll share that with the team. Thanks for clearing that out. And you're using the model that we show was the more ineffective one. So either you want to look at the flamingo model. Thank you. Thank you, Othmane. We will move to the last part of the workshop.