TurboQuant

Summary

TurboQuant is a Google Research / NYU / Google DeepMind online vector quantization method for high-dimensional vectors. It combines random rotation, scalar quantization, and a one-bit QJL residual correction to preserve both reconstruction quality and unbiased inner-product estimates. A later vLLM / Red Hat AI implementation study narrows the practical deployment claim: FP8 is usually the stronger default for KV-cache serving, while TurboQuant is mainly a memory-pressure tradeoff.

Interface

  • Input: high-dimensional Euclidean vectors.
  • Main operations: random rotation, coordinate-wise scalar quantization, residual QJL sketching.
  • Main targets: KV-cache compression for LLM inference and vector-search index compression.
  • Key property: data-oblivious online quantization, avoiding a learned data-dependent codebook at indexing time.
  • Current artifact status: arXiv paper, ICLR 2026 OpenReview page, Google Research blog, and vLLM implementation critique; no verified official code repository during ingest.

Role In The Wiki

TurboQuant is the local object card for memory and vector-search compression. It should be used when a page needs a concrete example of geometry-preserving compression for sequence state, retrieval memory, or high-dimensional latent vectors.

For the foundation time-series agenda, it is upstream infrastructure evidence rather than direct time-series evidence. The useful question is whether compressed latent states preserve dense numeric detail, rare regimes, channel interactions, control inputs, and delayed effects while also improving the end-to-end serving contract after dequantization, latency, and throughput are counted.

Evidence