Self-supervised Audio Teacher-Student Transformer for Both Clip-level and Frame-level Tasks

Source

Core Claim

Audio self-supervised learning should train both clip-level and frame-level representations, because fine-grained sound event detection needs temporally localized embeddings rather than only global clip features.

Key Contributions

  • Introduces ATST-Clip and ATST-Frame.
  • Uses Transformer encoders with a teacher-student self-supervised training scheme.
  • Designs separate view-creation strategies for clip-level and frame-level representation learning.
  • Adds frame-wise data augmentation and masking for ATST-Frame.
  • Shows that combining ATST-Clip and ATST-Frame through knowledge distillation can improve downstream performance.

Method Notes

ATST is a temporal representation-learning paper for audio. It is relevant to time-series modeling because it handles high-rate temporal signals and separates global clip semantics from frame-level events.

The teacher is updated by an exponential moving average of the student. The student learns to match teacher representations under different augmented views, with ATST-Frame adding masking to force local temporal reasoning.

Evidence And Results

The paper reports strong clip-level task performance and especially large gains on frame-level sound event detection. The main wiki lesson is that temporal resolution of the representation is a first-class design choice.

Alex Notes

  • User note: Teacher-Student Transformer self-supervised learning; frame-level audio tasks, especially sound event detection.

Limitations

  • Audio SSL results do not automatically transfer to general numeric time series.
  • Frame-level objectives can be expensive for long sequences.
  • The method is passive representation learning, not action-conditioned world modeling.

Open Questions

  • Which audio SSL tricks transfer to sensor or robotics time-series representation learning?
  • Can frame-level teacher-student objectives improve anomaly localization in observability data?
  • How should temporal resolution be selected when a model must support both clip-level and event-level tasks?