Self-supervised Audio Teacher-Student Transformer for Both Clip-level and Frame-level Tasks
Source
- Raw Markdown: paper_atst-2023.md
- PDF: paper_atst-2023.pdf
- Preprint: arXiv 2306.04186
- Official code: Audio-WestlakeU/audiossl
Core Claim
Audio self-supervised learning should train both clip-level and frame-level representations, because fine-grained sound event detection needs temporally localized embeddings rather than only global clip features.
Key Contributions
- Introduces ATST-Clip and ATST-Frame.
- Uses Transformer encoders with a teacher-student self-supervised training scheme.
- Designs separate view-creation strategies for clip-level and frame-level representation learning.
- Adds frame-wise data augmentation and masking for ATST-Frame.
- Shows that combining ATST-Clip and ATST-Frame through knowledge distillation can improve downstream performance.
Method Notes
ATST is a temporal representation-learning paper for audio. It is relevant to time-series modeling because it handles high-rate temporal signals and separates global clip semantics from frame-level events.
The teacher is updated by an exponential moving average of the student. The student learns to match teacher representations under different augmented views, with ATST-Frame adding masking to force local temporal reasoning.
Evidence And Results
The paper reports strong clip-level task performance and especially large gains on frame-level sound event detection. The main wiki lesson is that temporal resolution of the representation is a first-class design choice.
Alex Notes
- User note: Teacher-Student Transformer self-supervised learning; frame-level audio tasks, especially sound event detection.
Limitations
- Audio SSL results do not automatically transfer to general numeric time series.
- Frame-level objectives can be expensive for long sequences.
- The method is passive representation learning, not action-conditioned world modeling.
Links Into The Wiki
- Self-Supervised Representation Learning
- Unified Multimodal Models
- Time-Series Foundation Models
- Natural language guidance of high-fidelity TTS
Open Questions
- Which audio SSL tricks transfer to sensor or robotics time-series representation learning?
- Can frame-level teacher-student objectives improve anomaly localization in observability data?
- How should temporal resolution be selected when a model must support both clip-level and event-level tasks?