Close Menu
    DevStackTipsDevStackTips
    • Home
    • News & Updates
      1. Tech & Work
      2. View All

      Sunshine And March Vibes (2025 Wallpapers Edition)

      May 22, 2025

      The Case For Minimal WordPress Setups: A Contrarian View On Theme Frameworks

      May 22, 2025

      How To Fix Largest Contentful Paint Issues With Subpart Analysis

      May 22, 2025

      How To Prevent WordPress SQL Injection Attacks

      May 22, 2025

      Sam Altman says ChatGPT’s viral Ghibli effect “forced OpenAI to do a lot of unnatural things”

      May 22, 2025

      How to get started with Microsoft Copilot on Windows 11

      May 22, 2025

      Microsoft blocks employees from sending emails that mention “Palestine” or “Gaza”

      May 22, 2025

      I missed out on the Clair Obscur: Expedition 33 Collector’s Edition but thankfully, the developers are launching something special

      May 22, 2025
    • Development
      1. Algorithms & Data Structures
      2. Artificial Intelligence
      3. Back-End Development
      4. Databases
      5. Front-End Development
      6. Libraries & Frameworks
      7. Machine Learning
      8. Security
      9. Software Engineering
      10. Tools & IDEs
      11. Web Design
      12. Web Development
      13. Web Security
      14. Programming Languages
        • PHP
        • JavaScript
      Featured

      Perficient is Shaping the Future of Salesforce Innovation

      May 22, 2025
      Recent

      Perficient is Shaping the Future of Salesforce Innovation

      May 22, 2025

      Opal – Optimizely’s AI-Powered Marketing Assistant

      May 22, 2025

      Content Compliance Without the Chaos: How Optimizely CMP Empowers Financial Services Marketers

      May 22, 2025
    • Operating Systems
      1. Windows
      2. Linux
      3. macOS
      Featured

      Sam Altman says ChatGPT’s viral Ghibli effect “forced OpenAI to do a lot of unnatural things”

      May 22, 2025
      Recent

      Sam Altman says ChatGPT’s viral Ghibli effect “forced OpenAI to do a lot of unnatural things”

      May 22, 2025

      How to get started with Microsoft Copilot on Windows 11

      May 22, 2025

      Microsoft blocks employees from sending emails that mention “Palestine” or “Gaza”

      May 22, 2025
    • Learning Resources
      • Books
      • Cheatsheets
      • Tutorials & Guides
    Home»Development»Artificial Intelligence»Florence-2: How it works and how to use it

    Florence-2: How it works and how to use it

    July 27, 2024

    LLMs have dominated the past several years and seen rapid integration across a litany of industries. The primary cause for this sudden appearance of LLMs to the general public and subsequent adoption by wider industry was the discovery that scaling LLMs renders them generally capable agents.

    Additionally, besides being generally performant on many tasks, LLMs are also “foundational models”. This means that they can be used as the bedrock for more complicated processing pipelines that cannot rely on “in-context learning”, whether this be explicit finetuning or architecture adaptation for specialized tasks.

    What happened to language with LLMs is now happening to vision with Large Vision Models (LVMs). The field of Computer Vision has lagged in developing large foundational models akin to LLMs due to difficulties that are intrinsic to developing such models. Microsoft’s new LVM Florence-2 represents a significant step towards this goal of a unified vision model, demonstrating impressive results with a compact, parameter-efficient model.

    Florence-2 is capable of performing a wide variety of image-language tasks, being able to produce image-level, region-level, and pixel-level outputs. Here is a compilation of the tasks that Florence-2 can perform out-of-the-box after pre-training:

    In this article, we’ll look at an overview of what Florence-2 can do, how it works, and how to use it.

    What can Florence-2 do?

    Several years ago in Natural Language Processing, bespoke models were trained and deployed for tasks like summarization, question answering, and more. LLMs constituted a seminal moment via the convergence of specialized architectures into a single, general-purpose model with a simple training paradigm 

    Learn more about LLMs

    Learn more about how the development of LLMs in our Introduction to LLMs for Generative AI

    Learn more

    Florence-2 follows in the footsteps of LLMs, leveraging a unified architecture and simple training paradigm paired with a vast amount of data to become competent at many different tasks. Florence-2 can be considered a sort of GPT-2.5 – being able to perform tasks like:

    CaptioningOptical Character RecognitionObject DetectionRegion DetectionRegion SegmentationVocabulary Segmentation

    and more with one set of weights and no architectural modifications by providing special task tokens to the model at inference. This lies in contrast to older LVMs which are great at transfer learning but are not so good at performing tasks in isolation with simple instructions.

    Building such a model is a complicated task that presents a number of unique challenges – let’s take a look at one of the central challenges now.

    Challenges

    A central challenge of developing an LVM is instilling in it an ability to work at different levels of semantic and spatial resolution – what the authors refer to as “semantic granularity” and “spatial hierarchy”. A general vision model requires an ability to complete tasks at any combination of levels across these two axes:

    A general vision model must be able to operate at various degrees of granularity, both spatial and semantic (adapted from [1])

    Florence-2 addresses this challenge by following in the footsteps of LLMs. That is, Florence-2 follows the “playbook” of LLM research, building on top of other recent vision research, to learn general representations that are useful for many tasks.

    Following this playbook requires three things:

    A singular network architectureA large, sufficiently diverse datasetA unified pre-training framework

    Let’s take a look at each of these components in turn.

    Learn more about the development of LVMs

    We’ll be following this article up with a deep dive into the challenges of building LVMs and review their development – sign up for our newsletter to stay in the loop when we release new content

    Subscribe

    How does Florence-2 work?

    Architecture

    Florence-2 is designed in a simple way – to take in textual prompts (in addition to the image being processed), and generate textual results. Unifying the way in which diverse types of information – masked contours, locations, etc. – are input to the model permits (i) a unified training procedure and (ii) easy extension to other tasks without the need for architectural modifications, which are the two hallmarks of a foundational model.

    In particular, Florence-2 adopts a classic seq2seq transformer architecture into which both visual, textual, and location embeddings are fed. The input image and prompt are mapped into embeddings which are then simply concatenated and passed into the standard Transformer encoder-decoder.  Check out our video on word embeddings to learn more about how embedding models work.

    Florence-2’s architecture [1]

    Additional information

    Language and region token embeddings are generated with an extended language tokenizer/word embedding layer, where the tokenizer’s vocabulary is expanded to include location tokens (in order to accommodate region information). Input images are encoded with a DaViT encoder into a series of visual tokens, and then projected to the dimensionality of the language embeddings. The visual and language embeddings are then concatenated and fed into a standard transformer encoder-decoder.

    Florence-2’s architecture is essentially a standard Transformer encoder-decoder architecture – it is not particularly special and in fact it shouldn’t be. The critical lesson learned from LLMs is that architectural minutiae are not particularly important, and instead simply scaling a model’s size and training data can yield significant performance heights.

    In many ways, it is really the dataset that is most important.

    Dataset – FLD-5B

    To be a foundational model, Florence-2 needs to be trained to have image-level understanding (for e.g. captioning), spatial understanding (for e.g. region detection), and visual-semantic alignment (for e.g. phrase grounding). Imbuing the model with understanding at these various levels demands a large, diverse dataset.

    To this end, the authors curate FLD-5B – an open-source dataset of 5.4 billion annotations on 126 million images. This one-to-many relationship between images and annotations allows them to “get more juice” out of their collected images, and additionally potentially allows for learning more powerful representations by processing the same images in distinct ways.

    FLD-5B contains text annotations, text-region annotations, and text-phrase-region annotations (linking regions to phrases within captions that provide global context about the image) at various degrees of granularity for the images in the dataset:

    FLD-5B contains many annotations for a single image to facilitate learning across several spatial and semantic resolutions [1]

    Compiling such a large, intricate dataset is a complicated task. Unlike LLMs, for which you can scrape basically any available, human-written textual data online, image-text pairs are harder to come by, and “naturally-occurring” annotations for tasks like object detection are virtually nonexistent.

    To circumvent this issue, the authors use specialist AI models to generate labels for training. For example, they use specialized region-detection models and APIs to generate region annotations for images collected in the dataset.

    This process of using AI models to generate labels (called pseudolabels) isn’t uncommon. In fact, we’ve done this at AssemblyAI as first described in our blog on Conformer-1, a previous generation of our Speech-to-Text model. The authors use a variety of specialized models and a multi-step label generation “engine” to generate useful annotations across a wide range of spatial and semantic granularities:

    FLD-5B’s annotation generation procedure [1]

    For any images sourced from a pre-existing dataset (like ImageNet), the human-generated annotations are merged with the synthetic ones, and then this entire dataset is used to train Florence-2. 

    They then use Florence-2 to generate new pseudolabels (measuring an improvement over many of the specialized models), filter these pseudolabels for quality, and then mix them into the set of original annotations and iteratively repeat this process. That is, Florence-2 itself is used as a specialist model to compile FLD-5B.

    A final note of interest: the original pseudolablels are the result of many models working together to reach a consensus. Such techniques are not uncommon, and AssemblyAI also utilized this technique when training Conformer-2, the predecessor to our new Universal-1 Speech-to-Text model.

    Universal-1

    Learn more about Universal-1 in our Research report

    Learn more

    Training

    Florence-2’s training is standard language modeling with cross-entropy loss – here is a table of the annotation types, prompt inputs, and outputs for the various tasks:

    [1]

    Locations are specified as the special tokens <loc_x><loc_y>, where x and y are integers in the range [0, 1000]. These integers indicate coordinates on the image with a 0.1% resolution in either direction. That is, the string <loc_250><loc_500> means the point on the image that is 25% of the way across the horizontal axis, and 50% of the way down the vertical axis (<loc_0><loc_0> is the top left corner of the image).

    By extending the tokenizer’s vocab to include location tokens, Florence-2 can process region-specific info in a unified learning format. This eliminates the need for task-specific heads for different tasks and allows for a more data-centric approach

    Now that we understand how Florence-2 works, let’s take a look at how to use it.

    How to use Florence-2

    The easiest way to get started with Florence-2 is to check out the Colab associated with this article. 



      

    It is an adapted version of the original Florence-2 inference Colab where the code sections have been re-organized with more context and useful information to help you understand how to use this model. These sections use helper functions defined in utils.py, which you can check out on GitHub.

    Let’s look at how to run each task now:

    Captioning

    Florence-2 supports 3 types of captioning tasks at different levels of detail – we run each here and print the results:

    tasks = [utils.TaskType.CAPTION,
    utils.TaskType.DETAILED_CAPTION,
    utils.TaskType.MORE_DETAILED_CAPTION,]

    for task in tasks:
    results = utils.run_example(task, image_rgb)
    print(f'{task.value}{results[task]}’)

    Here is the output

    <CAPTION>
    A green car parked in front of a stop sign.

    <DETAILED_CAPTION>
    The image shows a blue Volkswagen Beetle parked in front of a yellow building with two brown doors and a red stop sign. The sky is a mix of blue and white, and there are a few green trees in the background.

    <MORE_DETAILED_CAPTION>
    The image shows a vintage car parked on the side of a street. The car is a light blue Volkswagen Beetle with a white stripe running along the side. It is parked in front of a yellow building with two wooden doors. Above the car, there is a red stop sign with the word “STOP” written in white letters. The sky is blue and there are trees in the background.

    Optical Character Recognition (OCR)

    Florence-2 can perform whole-image OCR, optionally returning bounding boxes. To perform OCR only, you can use the following code:

    task = utils.TaskType.OCR
    results = utils.run_example(task, image_rgb)
    print(‘Detected Text: ‘, results[task])

    Here is the output:

    Detected Text:  
    STOP

    To run OCR with regions returned, you can use this code:

    task = utils.TaskType.OCR_WITH_REGION
    results = utils.run_example(task, image_rgb)

    # Boxes drawn directly to image, so copy to avoid adulterating image for later tasks
    image_copy = copy.deepcopy(image)
    utils.draw_ocr_bboxes(image_copy, results[task])

    Here is the output:

    Object detection

    Florence-2 can detect objects, optionally returning either categorical or descriptive labels. Here’s how to run object detection with Florence-2:

    tasks = [utils.TaskType.REGION_PROPOSAL,
    utils.TaskType.OBJECT_DETECTION,
    utils.TaskType.DENSE_REGION_CAPTION,]

    for task in tasks:
    results = utils.run_example(task, image_rgb)
    print(task.value)
    utils.plot_bbox(results[task], image)

    And here is a gif of the results (the resizing of the last frame is just due to the GIF’s processing):

    Segmentation

    To segment an object in a particular region, supply Florence-2 with a bounding box in the format “<loc_x1><loc_y1><loc_x2><loc_y2>”, where the first point is the top left corner of the bounding box and the second is the bottom right.

    Here’s how to perform segmentation with Florence-2:

    top_left=[702, 575]
    bottom_right=[866, 772]

    task_prompt = utils.TaskType.REG_TO_SEG
    # converts coordinates to format `”<loc_x1><loc_y1><loc_x2><loc_y2>”`
    text_input = utils.convert_relative_to_loc(top_left + bottom_right)

    results = utils.run_example(task_prompt, image_rgb, text_input=text_input)

    bbox_coords = utils.convert_relative_to_bbox(top_left + bottom_right, image)
    box = {‘bboxes’: [bbox_coords], ‘labels’: [”]}

    # draw input bounding box and output segment
    image_copy = copy.deepcopy(image)
    image_copy = utils.draw_polygons(image_copy, results[task_prompt], fill_mask=True)
    utils.plot_bbox(box, image_copy)

    Here is the result, where both the input bounding box and output segment are drawn:

    Region description

    Florence-2 can also perform region description, which is effectively object detection or captioning for a subset of the image. Region description maps a region to either a category or descriptive annotation – here’s a script to perform it:

    top_left = [52, 332]
    bottom_right = [932, 774]
    text_input = utils.convert_relative_to_loc(top_left + bottom_right)
    bbox = utils.convert_relative_to_bbox(top_left + bottom_right, image)

    for task_prompt in [utils.TaskType.REGION_TO_CATEGORY, utils.TaskType.REGION_TO_DESCRIPTION]:
    results = utils.run_example(task_prompt, image_rgb, text_input=text_input)
    text_result = results[task_prompt].strip().split(‘<‘)[0]

    box = {‘bboxes’: [bbox], ‘labels’: [text_result]}
    utils.plot_bbox(box, image)

    Here’s a GIF of the results:

    Phrase grounding

    Given a textual input, Florence-2 can perform object detection conditioned on a textual input. The model performs object detection for the objects described in the text prompt, linking each identified region to an associated phrase in the textual input. Here we show how Florence-2 detects and labels two regions that correspond to the two salient phrases in the provided input:

    task_prompt = utils.TaskType.PHRASE_GROUNDING
    results = utils.run_example(task_prompt, image_rgb, text_input=”A green car parked in front of a yellow building.”)
    utils.plot_bbox(results[utils.TaskType.PHRASE_GROUNDING], image)

    Here’s the output:

    Vocabulary detection

    Florence-2 can also identify an object given a single phrase. Vocabulary detection is similar to phrase grounding except that it is one-to-one instead of one-to-many, and except that vocabulary detection can also detect text in the image. Here’s how to perform vocabulary detection with Florence-2:

    task_prompt = utils.TaskType.OPEN_VOCAB_DETECTION
    results = utils.run_example(task_prompt, image_rgb, text_input=”a turqoise car”)
    bbox_results  = utils.convert_to_od_format(results[utils.TaskType.OPEN_VOCAB_DETECTION])
    utils.plot_bbox(bbox_results, image)

    Here’s the output:

    Here is an example of how vocabulary detection can be used to find text that is present in the image:

    task_prompt = utils.TaskType.OPEN_VOCAB_DETECTION
    results = utils.run_example(task_prompt, image_rgb, text_input=”stop”)
    bbox_results  = utils.convert_to_od_format(results[utils.TaskType.OPEN_VOCAB_DETECTION])
    utils.plot_bbox(bbox_results, image)

    Here’s the output:

    Vocabulary Segmentation

    Florence-2 can also perform vocabulary segmentation, which is like vocabulary detection except for the fact that it identifies segments rather than regions:

    task_prompt = utils.TaskType.RES
    results = utils.run_example(task_prompt, image_rgb, text_input=”a stop sign”)
    image_copy = copy.deepcopy(image)
    utils.draw_polygons(image_copy, results[utils.TaskType.RES], fill_mask=True)

    Here’s the result:

    Note that vocabulary segmentation does not work with text as vocabulary detection does.

    Cascaded Tasks with Florence-2

    We can chain multiple tasks together to develop more complicated processing pipelines using only Florence-2. For example, here we supply nothing but an image. From this image, we derive a description using captioning, and then identify salient regions with phrase grounding:

    # Get a caption
    task_prompt = utils.TaskType.CAPTION
    results = utils.run_example(task_prompt, image_rgb)

    # Use the output as the input into the next task (phrase grounding)
    text_input = results[task_prompt]
    task_prompt = utils.TaskType.PHRASE_GROUNDING
    results = utils.run_example(task_prompt, image_rgb, text_input)

    results[utils.TaskType.DETAILED_CAPTION] = text_input

    print(text_input)
    utils.plot_bbox(results[utils.TaskType.PHRASE_GROUNDING], image)A green car parked in front of a stop sign.

    From there, we can go a step further and use region segmentation to identify the segments in these regions:

    polygons = []
    task_prompt = utils.TaskType.REG_TO_SEG

    # Run region to segmentation for each region identified by phrase grounding
    for box in results[utils.TaskType.PHRASE_GROUNDING.value][‘bboxes’]:
    box = utils.convert_bbox_to_relative(box, image)
    text_input = utils.convert_relative_to_loc(box)

    run_results = utils.run_example(task_prompt, image_rgb, text_input=text_input)
    polygons += run_results[task_prompt][‘polygons’]

    # Construct labels list required and created the necessary input dict
    labels = []
    for polygon in polygons:
    l = []
    for idx, polygon_ in enumerate(polygon):
    l.append(”)
    labels.append(l)

    seg_results = dict(polygons=polygons, labels=labels)

    # draw the output
    image_copy = copy.deepcopy(image)
    utils.draw_polygons(image_copy, seg_results, fill_mask=True)

    So, we have taken nothing but an image, generated a caption for it, identified the salient objects in the image, and then identified these objects’ outlines.

    What’s next?

    Florence-2 is a big step forward – it can perform a variety of tasks, demonstrates strong zero-shot performance, attains state-of-the-art results on several tasks once finetuned, and is a compact model for its level of performance. Additionally, the contribution of FLD-5B to the open source community is significant and will aid in future research.

    Additional work needs to be done to train an LVM that can perform novel tasks via in-context learning as LLMs can. We’ll be having another article coming out with a deeper dive into the development of LVMs in the coming weeks, so subscribe to our newsletter to stay in the loop when we release new content. 

    Alternatively, check out some of our other resources on AI progress, like:

    How Reinforcement Learning from AI Feedback worksRLHF vs RLAIF for language model alignmentThe Full Story of Large Language Models and RLHF

    Source: Read More 

    Facebook Twitter Reddit Email Copy Link
    Previous ArticleUse Claude 3.5 Sonnet With Audio Data & Latest Speech-to-Text Tutorials
    Next Article Last Week in AI #279 – OpenAI’s leap in AI, hacking concerns, updates to board, and much more!

    Related Posts

    Development

    The Ampere Porting Advisor Tutorial

    May 22, 2025
    Development

    How to Build Scalable Web Apps with React JS

    May 22, 2025
    Leave A Reply Cancel Reply

    Continue Reading

    North Korean Hackers Deploy FERRET Malware via Fake Job Interviews on macOS

    Development

    Predictions for software development in 2025

    Development

    Why is Hyperautomation for the P&C Insurance Industry Important?

    Development

    No MFA, Major Consequences: Simple Security Oversight Led to Change Healthcare Data Breach

    Development
    GetResponse

    Highlights

    Interop 2024: Chrome at 100% for the accessibility focus area

    July 31, 2024

    Chrome now passes 100% of tests for the accessibility focus area. This post explains the…

    How Agile Helps You Improve Your Agility

    May 12, 2025

    The UX of login codes

    January 17, 2025

    NextGen Healthcare Mirth Connect Under Attack – CISA Issues Urgent Warning

    May 21, 2024
    © DevStackTips 2025. All rights reserved.
    • Contact
    • Privacy Policy

    Type above and press Enter to search. Press Esc to cancel.