# Self-Teaching Autoencoder

## Provenance

- Source type: author blog post, code repository, demo, and X announcement.
- Blog title: Self-Teaching Autoencoder.
- Blog date: 2026-05-19 in the site's `posts.json` index.
- Author: Matteo Peluso / Matteo, `@MozarellaPesto`.
- Official blog URL: <https://the-puzzler.github.io/posts/self-teaching-autoencoder/self-teaching-autoencoder.html>
- Official share URL: <https://the-puzzler.github.io/share/self-teaching-autoencoder.html>
- Official code: <https://github.com/the-puzzler/leautoencoder>
- Official demo: <https://the-puzzler.github.io/posts/self-teaching-autoencoder/latent_brush_demo/>
- X announcement: <https://x.com/MozarellaPesto/status/2058633043364003917>
- Snapshot date: 2026-05-25.
- Local source artifacts: `source_blog.html`, `source_share_redirect.html`, `source_posts.json`, `source_github_readme.md`, `source_x_oembed.json`, `source_x_syndication.json`, and `assets/`.

## Source Status

This is not a paper or peer-reviewed report. It is an author-published experiment with a public blog post, demo, code repository, and X announcement. The claims below should be treated as author-reported project evidence until independently reproduced or turned into a formal paper.

## Announcement Snapshot

The X announcement was created on 2026-05-24 at 19:35:39 UTC. The public X embed and syndication snapshots identify the author as Matteo / `@MozarellaPesto`, preserve the root post text, and record a 13.633 second video attachment. At extraction time the syndication snapshot reported 18 replies and 475 likes; those engagement counts are not stable evidence and should not be used as technical support.

The root announcement frames the result as an autoencoder that reconstructs images with no MSE, no image-space supervision, and only the signal that the output should look like the input through the model's own representation.

## Blog Snapshot

The blog asks whether an autoencoder can learn to reconstruct images without any reconstruction loss. It removes direct pixel loss, feature-space loss, adversarial loss, and hybrid reconstruction losses, then trains an encoder-decoder through latent agreement.

The standard autoencoder setup is:

$$
z = E(x), \qquad \hat{x} = D(z).
$$

The self-teaching objective in the blog is:

$$
\mathcal{L}
=
\mathbb{E}_T \left[\lVert E(T(x)) - E(T(\hat{x})) \rVert\right]
+
\mathcal{L}_{\mathrm{SigReg}}.
$$

The blog's explanation is that comparing only $E(x)$ and $E(\hat{x})$ leaves a "private language" loophole: the encoder and decoder can agree on codes that re-encode consistently without producing faithful images. Transformations tighten the constraint because the decoder output must remain representation-consistent across transformed views:

$$
[x]_{E,T} = \{y : E(T(y)) = E(T(x))\},
\qquad
\hat{x} \in \bigcap_T [x]_{E,T}.
$$

The reported progression is:

- Sparse pixel masking works on a synthetic grayscale-shape toy dataset.
- CIFAR-10 exposes that pixel masking underconstrains color and local texture.
- Crop-resize becomes the first variant that works plausibly on CIFAR-10 because small natural crops constrain color, texture, and coarse shape.
- Step-frozen judging improves the setup by freezing the encoder during the comparison step so the decoder has to absorb more mismatch instead of the encoder becoming invariant to artifacts.
- On CelebA ordinary autoencoding, the step-frozen crop-resize variant remains stable at roughly 6x compression with the same 2M parameter architecture used in the blog figures.
- On CelebA masked autoencoding, the blog reports two harder variants: latent size 128 with about 11M parameters and roughly 400x compression, and latent size 512 with about 36M parameters and roughly 96x compression.

The blog describes `leAutoencoder` as the masked-autoencoding version. If $y = m \odot x$ is a masked observation, the clean and masked reconstruction branches are compared against the same transformed clean target. The source explicitly notes that this is a conditional prediction problem rather than pure compression because $p(x \mid y,m)$ has nonzero entropy.

## Code Snapshot

The repository README describes the project as a small experiment in training an autoencoder to teach itself. The public code repository says it trains autoencoders on CelebA center-cropped to `128x128`, with:

- `main.py`: current self-teaching training loop;
- `main_regular.py`: baseline masked-image reconstruction model;
- `leae/autoencoder.py`: autoencoder architecture;
- `leae/masking.py`: masking and crop helpers;
- `leae/prep_data.py`: dataset loading;
- `leae/sigreg.py`: latent regularizer.

The README's current objective uses a clean branch and a masked branch:

```text
z_clean = E(x)
x_clean_hat = D(z_clean)

z_masked = E(mask(x))
x_masked_hat = D(z_masked)
```

A target encoder `T` scores reconstruction consistency, clean crop consistency, and masked crop consistency. The main latent objective is the average of those losses plus a SIGReg term on clean and masked codes:

```text
loss = mse_loss + sigreg_loss
```

The baseline in the README is direct masked-image reconstruction:

```text
recon = model(masked_image)
loss = MSE(recon, image)
```

The README says the baseline does not use a target encoder, latent consistency between branches, crop consistency losses, or SIGReg.

## Local Interpretation Notes

Self-Teaching Autoencoder is closest to the wiki's JEPA and representation-collapse threads, but it should not be treated as a standard JEPA paper. It is an autoencoder experiment that uses latent consistency as the reconstruction signal while keeping a decoder in the loop.

The interesting mechanism is the combination of three constraints:

- latent agreement instead of direct pixel reconstruction;
- transformations that shrink the encoder's equivalence classes;
- SIGReg or target-encoder structure to reduce collapse and shortcut agreement.

The main wiki-relevant hypothesis is that reconstruction quality and representation quality may not need to be trained as separate stages. A decoder can be constrained as part of the representation-learning loop if the objective makes decoded outputs stay on the input distribution instead of rewarding pixel averages directly.

## Limitations

- No formal paper, venue review, independent reproduction, or standard leaderboard evidence is available in this snapshot.
- The strongest results are author-reported from a blog and code repository.
- The experiments are vision autoencoding and masked autoencoding, not numeric time-series modeling or action-conditioned world modeling.
- The baseline is intentionally simple and the blog notes a caveat: the direct baseline's encoder is trained from masked input to clean output, so it is not trained to handle clean inputs in the same way.
- The masked setting still does not fully resolve mode averaging in ambiguous regions.
- Engagement counts and X reply counts are unstable and should not be used as credibility evidence.
