Self-supervised Audio Teacher-Student Transformer for Both Clip-level and Frame-level Tasks

Source

Core Claim

Audio self-supervised learning should train both clip-level and frame-level representations, because fine-grained sound event detection needs temporally localized embeddings rather than only global clip features.

Key Contributions

Introduces ATST-Clip and ATST-Frame.
Uses Transformer encoders with a teacher-student self-supervised training scheme.
Designs separate view-creation strategies for clip-level and frame-level representation learning.
Adds frame-wise data augmentation and masking for ATST-Frame.
Shows that combining ATST-Clip and ATST-Frame through knowledge distillation can improve downstream performance.

Method Notes

ATST is a temporal representation-learning paper for audio. It is relevant to time-series modeling because it handles high-rate temporal signals and separates global clip semantics from frame-level events.

The teacher is updated by an exponential moving average of the student. The student learns to match teacher representations under different augmented views, with ATST-Frame adding masking to force local temporal reasoning.

Evidence And Results

The paper reports strong clip-level task performance and especially large gains on frame-level sound event detection. The main wiki lesson is that temporal resolution of the representation is a first-class design choice.

Alex Notes

User note: Teacher-Student Transformer self-supervised learning; frame-level audio tasks, especially sound event detection.

Limitations

Audio SSL results do not automatically transfer to general numeric time series.
Frame-level objectives can be expensive for long sequences.
The method is passive representation learning, not action-conditioned world modeling.

Links Into The Wiki

Open Questions

Which audio SSL tricks transfer to sensor or robotics time-series representation learning?
Can frame-level teacher-student objectives improve anomaly localization in observability data?
How should temporal resolution be selected when a model must support both clip-level and event-level tasks?

Alex Knowledge Base

Explorer

Self-supervised Audio Teacher-Student Transformer for Both Clip-level and Frame-level Tasks

Self-supervised Audio Teacher-Student Transformer for Both Clip-level and Frame-level Tasks

Source

Core Claim

Key Contributions

Method Notes

Evidence And Results

Alex Notes

Limitations

Links Into The Wiki

Open Questions

Graph View

Table of Contents

Backlinks