Molmo and PixMo: Open Weights and Open Data for State-of-the-Art Vision-Language Models

Source

Core Claim

Open VLMs can reach frontier-class performance without distilling from proprietary VLMs if the data pipeline is carefully built. Molmo is the model family; PixMo is the new open data collection.

Key Contributions

  • Releases a Molmo VLM family and PixMo datasets with open weights, open data, code, and evaluations in the paper’s openness framing.
  • Builds dense captions, free-form image Q&A, pointing data, counting data, and document/chart data without external VLM-generated supervision for the core PixMo collection.
  • Uses a standard vision-encoder plus decoder-only LLM architecture with a connector, overlapping multi-crop image processing, and attention pooling over patch windows.
  • Reports that MolmoE-1B nearly matches GPT-4V in the paper’s benchmark and user-preference framing, while larger Molmo models sit between GPT-4V and GPT-4o or second only to GPT-4o depending on the comparison.
  • Includes extensive ablations on data, vision encoders, connector design, crop resolution, and training details.

Method Notes

For the wiki, Molmo/PixMo is mainly a data-engine and openness source. It shows that high-quality human-curated or purpose-built multimodal data can substitute for proprietary-model distillation in a VLM pipeline.

The architecture is still a familiar VLM stack: image encoder, connector, and LLM. The novelty is not that it removes the language model or vision encoder, but that the data and training recipe are unusually open and carefully controlled.

Evidence And Results

The paper reports strong academic benchmark and human-evaluation results. It also provides useful ablations showing that data quality and training details can move VLM performance as much as architectural novelty.

Alex Notes

  • Alex flagged the small open-source VLM angle: a much smaller open model can outperform GPT-4V in the paper’s framing and approach GPT-4o-level behavior on some comparisons.
  • Verify exact “10x less size” wording before reusing it as a factual claim outside Alex notes.

Limitations

  • VLM benchmark results can be highly sensitive to prompts, evaluation harnesses, and data overlap.
  • The strongest model variants are open-weight/open-data in a class-specific sense, but component openness varies by backbone and vision encoder choice.
  • Image-text results should not be directly projected onto numeric time-series modeling without a data-interface argument.

Open Questions

  • Which PixMo data-engine practices transfer to time-series/text datasets?
  • Can open-data VLM training recipes reduce dependence on proprietary teacher models for temporal multimodal models?
  • Which VLM evaluation practices should be copied into time-series reasoning benchmarks?