Florence-2

Summary

Florence-2 is a Microsoft/Azure AI prompt-based vision foundation model trained on the FLD-5B data-engine dataset. It serializes many task outputs as text or location tokens, allowing one sequence-to-sequence model to cover captioning, object detection, grounding, segmentation, OCR-style tasks, and related vision-language tasks.

Role In The Wiki

Florence-2 is the main local example of a practical foundation model whose value comes from an iterative dataset engine. It complements Perception Encoder: Perception Encoder uses a video data engine to create better image-video training pairs, while Florence-2 uses a broader visual annotation engine to create dense, multi-task supervision.

For time-series research, Florence-2 is a cross-domain pattern for label scarcity: create a first labeled corpus, train a seed model, use that model to expand and repair labels, then repeat with audits and filters.

Data Engine

The Florence data engine starts with image collections and existing partial labels, adds synthetic labels from specialist models, filters noisy text and region annotations, and iteratively refines the dataset with a trained multitask model. FLD-5B contains 126M images and 5.4B annotations across text, region-text, and text-phrase-region formats.

Evidence

Official Artifacts