WavSpA: Wavelet Space Attention for Boosting Transformers’ Long Sequence Learning Ability
Source
- Raw Markdown: paper_wavspa-2022.md
- PDF: paper_wavspa-2022.pdf
- Preprint: arXiv 2210.01989
- Official code: EvanZhuang/wavspa
Core Claim
Attention can be performed in a learnable wavelet coefficient space, giving Transformers access to both position and frequency information with linear-time sequence transforms.
Key Contributions
- Proposes Wavelet Space Attention.
- Applies a forward wavelet transform, performs attention in coefficient space, then reconstructs the representation with an inverse transform.
- Compares wavelet-space attention with Fourier-space attention on long-sequence benchmarks.
- Tests fixed and adaptive wavelets.
- Reports improved Long Range Arena performance and better reasoning extrapolation on LEGO-style chain-of-reasoning tasks.
Method Notes
WavSpA is not a time-series forecasting paper, but it is highly relevant to long numeric sequences. Wavelets preserve locality and frequency structure, which are natural for nonstationary time-series signals where Fourier-only global bases can be too coarse.
For TSFMs, this source belongs near attention alternatives, adaptive tokenization, and frequency-aware numeric representation.
Evidence And Results
The abstract reports consistent gains over ordinary Transformer attention and Fourier-space attention on Long Range Arena, plus improved extrapolation over distance in a reasoning task.
Alex Notes
- User-provided official code: EvanZhuang/wavspa.
Limitations
- Long Range Arena and LEGO are not forecasting benchmarks.
- Wavelet attention changes the sequence-mixing substrate but does not by itself solve exogenous variables, channel semantics, or action conditioning.
- Need TSFM-specific tests before treating wavelet attention as a better default for forecasting.
Links Into The Wiki
Open Questions
- Can wavelet-space attention improve long-horizon TSFM stability compared with patching and recurrent state?
- Which wavelet bases are appropriate for irregular, missing, or multivariate signals?
- Is wavelet mixing complementary to learned patching, or does it reduce the need for patching?