Microsoft Releases Florence-2: A Novel Vision Foundation Model with a Unified, Prompt-based Representation for a Variety of Computer Vision and Vision-Language Tasks

There has been a marked movement in the field of AGI systems towards using pretrained, adaptable representations known for their task-agnostic benefits in various applications. Natural language processing (NLP) is a clear example of this tendency since more sophisticated models demonstrate adaptability by learning new tasks and domains from scratch with only basic instructions. The success of natural language processing inspires a similar strategy in computer vision.Â

One of the main obstacles to universal representation for various vision-related tasks is the requirement for broad perceptual ability. In contrast to natural language processing (NLP), computer vision works with complex visual data such as object location, masked contours, and properties. Mastery of various challenging tasks is required to achieve universal representation in computer vision. Distinctiveness and severe hurdles define this endeavor. The lack of thorough visual annotations is a major obstacle that prevents us from building a basic model that can capture the subtleties of spatial hierarchy and semantic granularity. A further obstacle is that there currently needs to be a unified pretraining framework in computer vision that uses a single network architecture to integrate semantic granularity and spatial hierarchy seamlessly.

A team of Microsoft researchers introduces Florence-2, a novel vision foundation model with a unified, prompt-based representation for a variety of computer vision and vision-language tasks. This solves the problems of needing a consistent architecture and limiting comprehensive data by creating a single, prompt-based representation for all vision activities. Annotated data of high quality and broad scale is required for multitask learning. Using FLD-5B, the data engine generates a complete visual dataset with a total of 5.4B annotations for 126M imagesâ€”a significant improvement over labor-intensive manual annotation. The engineâ€™s two processing modules are highly efficient. Instead of using a single person to annotate each image, as was done in the past, the first module employs specialized models to do it automatically and in collaboration. A more trustworthy and objective picture interpretation is achieved when numerous models collaborate to attain a consensus, reminiscent of the wisdom of crowdsâ€™ ideas.Â

The Florence-2 model stands out for its unique features. It integrates an image encoder and a multi-modality encoder-decoder into a sequence-to-sequence (seq2seq) architecture, following the NLP communityâ€™s goal of developing flexible models with a consistent framework. This architecture can handle a variety of vision tasks without requiring task-specific architectural alterations. The modelâ€™s unified multitask learning technique with consistent optimization, using the same loss function as the aim, is made possible by uniformizing all annotations in the FLD-5B dataset into textual outputs. Florence-2 is a multi-purpose vision foundation model that can ground, caption, and detect objects using just one model and a standard set of parameters, activated by textual cues.

Despite its compact size, Florence-2 stands tall in the field, able to compete with larger specialized models. After fine-tuning using publicly available human-annotated data, Florence-2 achieves new state-of-the-art performances on the benchmarks on RefCOCO/+/g. This pre-trained model outperforms supervised and self-supervised models on downstream tasks, including ADE20K semantic segmentation and COCO object detection and instance segmentation. The results speak for themselves, showing significant improvements of 6.9, 5.5, and 5.9 points on the COCO and ADE20K datasets using Mask-RCNN, DIN, and the training efficiency is 4 times better than pre-trained models on ImageNet. This performance is a testament to the effectiveness and reliability of Florence-2.

Florence-2, with its pre-trained universal representation, has proven to be highly effective. The experimental results demonstrate its prowess in improving a multitude of downstream tasks, instilling confidence in its capabilities.Â

Check out the Paper and Model Card. All credit for this research goes to the researchers of this project. Also,Â donâ€™t forget to follow us onÂ Twitter.Â

Join ourÂ Telegram Channel andÂ LinkedIn Group.

If you like our work, you will love ourÂ newsletter..

Donâ€™t Forget to join ourÂ 45k+ ML SubReddit

The post Microsoft Releases Florence-2: A Novel Vision Foundation Model with a Unified, Prompt-based Representation for a Variety of Computer Vision and Vision-Language Tasks appeared first on MarkTechPost.

Source: Read MoreÂ

Sunshine And March Vibes (2025 Wallpapers Edition)

The Case For Minimal WordPress Setups: A Contrarian View On Theme Frameworks

How To Fix Largest Contentful Paint Issues With Subpart Analysis

How To Prevent WordPress SQL Injection Attacks

Google DeepMind’s CEO says Gemini’s upgrades could lead to AGI — but he still thinks society isn’t “ready for it”

Windows 11 is getting AI Actions in File Explorer — here’s how to try them right now

Is The Alters on Game Pass?

I asked Copilot’s AI to predict the outcome of the Europa League final, and now I’m just sad

Celebrating GAAD by Committing to Universal Design: Equitable Use

Celebrating GAAD by Committing to Universal Design: Equitable Use

GAAD and Universal Design in Healthcare – A Deeper Look

GAAD and Universal Design in Pharmacy – A Deeper Look

Google DeepMind’s CEO says Gemini’s upgrades could lead to AGI — but he still thinks society isn’t “ready for it”

Google DeepMind’s CEO says Gemini’s upgrades could lead to AGI — but he still thinks society isn’t “ready for it”

Windows 11 is getting AI Actions in File Explorer — here’s how to try them right now

Is The Alters on Game Pass?

Microsoft Releases Florence-2: A Novel Vision Foundation Model with a Unified, Prompt-based Representation for a Variety of Computer Vision and Vision-Language Tasks

How JavaScript Lint Rules Work (and Why Abstract Syntax Trees Matter)

Will “Vibe Coders” Take Our Dev Jobs?

Alien Kids Academy

Get a confident smile with a dental implant in Fort Worth at Prestige Dental. Our skilled professionals offer high-quality implants that look and function like natural teeth. Restore your smile and improve your oral health with our advanced dental implant services.

MS Exchange Server Flaws Exploited to Deploy Keylogger in Targeted Attacks

Why SQL is Forever followup

Build or buy? What industry leaders are choosing

Buying a Mac or iPad for school? You can get a $150 Apple gift card. Here’s how

CVE-2025-41399 – F5 BIG-IP SCTP Profile Memory Exhaustion Vulnerability

Laravel Debounce

Microsoft Releases Florence-2: A Novel Vision Foundation Model with a Unified, Prompt-based Representation for a Variety of Computer Vision and Vision-Language Tasks

Related Posts