TurboQuant

Summary

TurboQuant is a Google Research / NYU / Google DeepMind online vector quantization method for high-dimensional vectors. It combines random rotation, scalar quantization, and a one-bit QJL residual correction to preserve both reconstruction quality and unbiased inner-product estimates. A later vLLM / Red Hat AI implementation study narrows the practical deployment claim: FP8 is usually the stronger default for KV-cache serving, while TurboQuant is mainly a memory-pressure tradeoff.

Interface

Input: high-dimensional Euclidean vectors.
Main operations: random rotation, coordinate-wise scalar quantization, residual QJL sketching.
Main targets: KV-cache compression for LLM inference and vector-search index compression.
Key property: data-oblivious online quantization, avoiding a learned data-dependent codebook at indexing time.
Current artifact status: arXiv paper, ICLR 2026 OpenReview page, Google Research blog, and vLLM implementation critique; no verified official code repository during ingest.

Role In The Wiki

TurboQuant is the local object card for memory and vector-search compression. It should be used when a page needs a concrete example of geometry-preserving compression for sequence state, retrieval memory, or high-dimensional latent vectors.

For the foundation time-series agenda, it is upstream infrastructure evidence rather than direct time-series evidence. The useful question is whether compressed latent states preserve dense numeric detail, rare regimes, channel interactions, control inputs, and delayed effects while also improving the end-to-end serving contract after dequantization, latency, and throughput are counted.

Evidence

TurboQuant

Alex Open Research Wiki

Explorer

TurboQuant

TurboQuant

Summary

Interface

Role In The Wiki

Evidence

Graph View

Table of Contents

Backlinks

Alex Open Research Wiki

Explorer

TurboQuant

TurboQuant

Summary

Interface

Role In The Wiki

Evidence

Related Pages

Graph View

Table of Contents

Backlinks