Florence-2: Advancing a Unified Representation for a Variety of Vision Tasks
Model Summary
This Hub repository contains a HuggingFace's
transformers
implementation of Florence-2 model from Microsoft.
Florence-2 is an advanced vision foundation model that uses a prompt-based approach to handle a wide range of vision and vision-language tasks. Florence-2 can interpret simple text prompts to perform tasks like captioning, object detection, and segmentation. It leverages our FLD-5B dataset, containing 5.4 billion annotations across 126 million images, to master multi-task learning. The model's sequence-to-sequence architecture enables it to excel in both zero-shot and fine-tuned settings, proving to be a competitive vision foundation model.
OCR with region output format:
{'<OCR_WITH_REGION>': {'quad_boxes': [[x1, y1, x2, y2, x3, y3, x4, y4], ...], 'labels': ['text1', ...]}}
prompt = "<OCR_WITH_REGION>"
run_example(prompt)
for More detailed examples, please refer to
notebook
Benchmarks
Florence-2 Zero-shot performance
The following table presents the zero-shot performance of generalist vision foundation models on image captioning and object detection evaluation tasks. These models have not been exposed to the training data of the evaluation tasks during their training phase.
Method
#params
COCO Cap. test CIDEr
NoCaps val CIDEr
TextCaps val CIDEr
COCO Det. val2017 mAP
Flamingo
80B
84.3
-
-
-
Florence-2-base
0.23B
133.0
118.7
70.1
34.7
Florence-2-large
0.77B
135.6
120.8
72.8
37.5
The following table continues the comparison with performance on other vision-language evaluation tasks.
Method
Flickr30k test R@1
Refcoco val Accuracy
Refcoco test-A Accuracy
Refcoco test-B Accuracy
Refcoco+ val Accuracy
Refcoco+ test-A Accuracy
Refcoco+ test-B Accuracy
Refcocog val Accuracy
Refcocog test Accuracy
Refcoco RES val mIoU
Kosmos-2
78.7
52.3
57.4
47.3
45.5
50.7
42.2
60.6
61.7
-
Florence-2-base
83.6
53.9
58.4
49.7
51.5
56.4
47.9
66.3
65.1
34.6
Florence-2-large
84.4
56.3
61.6
51.4
53.6
57.9
49.9
68.0
67.0
35.8
Florence-2 finetuned performance
We finetune Florence-2 models with a collection of downstream tasks, resulting two generalist models
Florence-2-base-ft
and
Florence-2-large-ft
that can conduct a wide range of downstream tasks.
The table below compares the performance of specialist and generalist models on various captioning and Visual Question Answering (VQA) tasks. Specialist models are fine-tuned specifically for each task, whereas generalist models are fine-tuned in a task-agnostic manner across all tasks. The symbol "▲" indicates the usage of external OCR as input.
Method
# Params
COCO Caption Karpathy test CIDEr
NoCaps val CIDEr
TextCaps val CIDEr
VQAv2 test-dev Acc
TextVQA test-dev Acc
VizWiz VQA test-dev Acc
Specialist Models
CoCa
2.1B
143.6
122.4
-
82.3
-
-
BLIP-2
7.8B
144.5
121.6
-
82.2
-
-
GIT2
5.1B
145.0
126.9
148.6
81.7
67.3
71.0
Flamingo
80B
138.1
-
-
82.0
54.1
65.7
PaLI
17B
149.1
127.0
160.0▲
84.3
58.8 / 73.1▲
71.6 / 74.4▲
PaLI-X
55B
149.2
126.3
147.0 / 163.7▲
86.0
71.4 / 80.8▲
70.9 / 74.6▲
Generalist Models
Unified-IO
2.9B
-
100.0
-
77.9
-
57.4
Florence-2-base-ft
0.23B
140.0
116.7
143.9
79.7
63.6
63.6
Florence-2-large-ft
0.77B
143.3
124.9
151.1
81.7
73.5
72.6
Method
# Params
COCO Det. val2017 mAP
Flickr30k test R@1
RefCOCO val Accuracy
RefCOCO test-A Accuracy
RefCOCO test-B Accuracy
RefCOCO+ val Accuracy
RefCOCO+ test-A Accuracy
RefCOCO+ test-B Accuracy
RefCOCOg val Accuracy
RefCOCOg test Accuracy
RefCOCO RES val mIoU
Specialist Models
SeqTR
-
-
-
83.7
86.5
81.2
71.5
76.3
64.9
74.9
74.2
-
PolyFormer
-
-
-
90.4
92.9
87.2
85.0
89.8
78.0
85.8
85.9
76.9
UNINEXT
0.74B
60.6
-
92.6
94.3
91.5
85.2
89.6
79.8
88.7
89.4
-
Ferret
13B
-
-
89.5
92.4
84.4
82.8
88.1
75.2
85.8
86.3
-
Generalist Models
UniTAB
-
-
-
88.6
91.1
83.8
81.0
85.4
71.6
84.6
84.7
-
Florence-2-base-ft
0.23B
41.4
84.0
92.6
94.8
91.5
86.8
91.7
82.2
89.8
82.2
78.0
Florence-2-large-ft
0.77B
43.4
85.2
93.4
95.3
92.0
88.3
92.9
83.6
91.2
91.7
80.5
BibTex and citation info
@article{xiao2023florence,
title={Florence-2: Advancing a unified representation for a variety of vision tasks},
author={Xiao, Bin and Wu, Haiping and Xu, Weijian and Dai, Xiyang and Hu, Houdong and Lu, Yumao and Zeng, Michael and Liu, Ce and Yuan, Lu},
journal={arXiv preprint arXiv:2311.06242},
year={2023}
}
Runs of microsoft Florence-2-base-ft on huggingface.co
49.0K
Total runs
0
24-hour runs
109
3-day runs
-3.6K
7-day runs
11.5K
30-day runs
More Information About Florence-2-base-ft huggingface.co Model
Florence-2-base-ft huggingface.co is an AI model on huggingface.co that provides Florence-2-base-ft's model effect (), which can be used instantly with this microsoft Florence-2-base-ft model. huggingface.co supports a free trial of the Florence-2-base-ft model, and also provides paid use of the Florence-2-base-ft. Support call Florence-2-base-ft model through api, including Node.js, Python, http.
Florence-2-base-ft huggingface.co is an online trial and call api platform, which integrates Florence-2-base-ft's modeling effects, including api services, and provides a free online trial of Florence-2-base-ft, you can try Florence-2-base-ft online for free by clicking the link below.
microsoft Florence-2-base-ft online free url in huggingface.co:
Florence-2-base-ft is an open source model from GitHub that offers a free installation service, and any user can find Florence-2-base-ft on GitHub to install. At the same time, huggingface.co provides the effect of Florence-2-base-ft install, users can directly use Florence-2-base-ft installed effect in huggingface.co for debugging and trial. It also supports api for free installation.