llava-gemma-2b
is a large multimodal model (LMM) trained using the
LLaVA-v1.5 framework
with the 2-billion parameter
google/gemma-2b-it
model as language backbone and the CLIP-based vision encoder.
This model card was created by
Benjamin Consolvo
and the authors listed above.
Intended Use
Intended Use
Description
Primary intended uses
The model has been finetuned for multimodal benchmark evaluations, but can also be used as a multimodal chatbot.
Primary intended users
Anyone using or evaluating multimodal models.
Out-of-scope uses
This model is not intended for uses that require high levels of factuality, high stakes situations, mental health or medical applications, generating misinformation or disinformation, impersonating others, facilitating or inciting harassment or violence, any use that could lead to the violation of a human right under the UN Declaration of Human Rights.
For current usage, see
usage.py
or the following code block:
import requests
from PIL import Image
from transformers import (
LlavaForConditionalGeneration,
AutoTokenizer,
AutoProcessor,
CLIPImageProcessor
)
#In this repo, needed for version < 4.41.1#from processing_llavagemma import LlavaGemmaProcessor#processor = LlavaGemmaProcessor( tokenizer=AutoTokenizer.from_pretrained(checkpoint), image_processor=CLIPImageProcessor.from_pretrained(checkpoint))
checkpoint = "Intel/llava-gemma-2b"# Load model
model = LlavaForConditionalGeneration.from_pretrained(checkpoint)
processor = AutoProcessor.from_pretrained(checkpoint)
# Prepare inputs# Use gemma chat template
prompt = processor.tokenizer.apply_chat_template(
[{'role': 'user', 'content': "<image>\nWhat's the content of the image?"}],
tokenize=False,
add_generation_prompt=True
)
url = "https://www.ilankelman.org/stopsigns/australia.jpg"
image = Image.open(requests.get(url, stream=True).raw)
inputs = processor(text=prompt, images=image, return_tensors="pt")
# Generate
generate_ids = model.generate(**inputs, max_length=30)
output = processor.batch_decode(generate_ids, skip_special_tokens=True, clean_up_tokenization_spaces=False)[0]
print(output)
For straightforward use as a chatbot (without images), you can modify the last portion of code to the following:
# Prepare inputs# Use gemma chat template
prompt = processor.tokenizer.apply_chat_template(
[{'role': 'user', 'content': "Summarize the following paragraph? In this paper, we introduced LLaVA-Gemma, a compact vision-language model leveraging the Gemma Large Language Model in two variants, Gemma-2B and Gemma-7B. Our work provides a unique opportunity for researchers to explore the trade-offs between computational efficiency and multimodal understanding in small-scale models. The availability of both variants allows for a comparative analysis that sheds light on how model size impacts performance in various tasks. Our evaluations demonstrate the versatility and effectiveness of LLaVA-Gemma across a range of datasets, highlighting its potential as a benchmark for future research in small-scale vision-language models. With these models, future practitioners can optimize the performance of small-scale multimodal models more directly."}],
tokenize=False,
add_generation_prompt=True
)
# url = "https://www.ilankelman.org/stopsigns/australia.jpg"# image = Image.open(requests.get(url, stream=True).raw)
inputs = processor(text=prompt, images=None, return_tensors="pt")
# Generate
generate_ids = model.generate(**inputs, max_length=300)
output = processor.batch_decode(generate_ids, skip_special_tokens=True, clean_up_tokenization_spaces=False)[0]
print(output)
Factors
Factors
Description
Groups
-
Instrumentation
-
Environment
Trained for 4 hours on 8 Intel Gaudi 2 AI accelerators.
Card Prompts
Model training and deployment on alternate hardware and software will change model performance
Metrics
Metrics
Description
Model performance measures
We evaluate the LlaVA-Gemma models on a similar collection of benchmarks to other LMM works: GQA; MME; MM-Vet; POPE (accuracy and F1); VQAv2; MMVP; the image subset of ScienceQA. Our experiments provide insights into the efficacy of various design choices within the LLaVA framework.
Decision thresholds
-
Approaches to uncertainty and variability
-
Training Data
The model was trained using the LLaVA-v1.5 data mixture. This is listed as follows:
558K filtered image-text pairs from LAION/CC/SBU, captioned by BLIP.
Performance of LLaVA-Gemma models across seven benchmarks. Highlighted box indicates strongest performance amongst LLaVA-Gemma models. Bottom two rows show self-reported performance of Llava Phi-2 and LLaVA-v1.5 respectively. The bolded
gemma-2b-it
is the current model used here in this model card.
LM Backbone
Vision Model
Pretrained Connector
GQA
MME cognition
MME perception
MM-Vet
POPE accuracy
POPE F1
VQAv2
ScienceQA Image
MMVP
gemma-2b-it
CLIP
Yes
0.531
236
1130
17.7
0.850
0.839
70.65
0.564
0.287
gemma-2b-it
CLIP
No
0.481
248
935
13.1
0.784
0.762
61.74
0.549
0.180
gemma-2b-it
DinoV2
Yes
0.587
307
1133
19.1
0.853
0.838
71.37
0.555
0.227
gemma-2b-it
DinoV2
No
0.501
309
959
14.5
0.793
0.772
61.65
0.568
0.180
gemma-7b-it
CLIP
Yes
0.472
253
895
18.2
0.848
0.829
68.7
0.625
0.327
gemma-7b-it
CLIP
No
0.472
278
857
19.1
0.782
0.734
65.1
0.636
0.240
gemma-7b-it
DinoV2
Yes
0.519
257
1021
14.3
0.794
0.762
65.2
0.628
0.327
gemma-7b-it
DinoV2
No
0.459
226
771
12.2
0.693
0.567
57.4
0.598
0.267
Phi-2b
CLIP
Yes
-
-
1335
28.9
-
0.850
71.4
0.684
-
Llama-2-7b
CLIP
Yes
0.620
348
1511
30.6
0.850
0.859
78.5
0.704
46.1
Ethical Considerations
Intel is committed to respecting human rights and avoiding causing or contributing to adverse impacts on human rights. See
Intel’s Global Human Rights Principles
. Intel’s products and software are intended only to be used in applications that do not cause or contribute to adverse impacts on human rights.
Ethical Considerations
Description
Data
The model was trained using the LLaVA-v1.5 data mixture as described above.
Human life
The model is not intended to inform decisions central to human life or flourishing.
Mitigations
No additional risk mitigation strategies were considered during model development.
Risks and harms
This model has not been assessed for harm or biases, and should not be used for sensitive applications where it may cause harm.
Use cases
-
Caveats and Recommendations
Users (both direct and downstream) should be made aware of the risks, biases and limitations of the model.
Citation details
@misc{hinck2024llavagemma,
title={LLaVA-Gemma: Accelerating Multimodal Foundation Models with a Compact Language Model},
author={Musashi Hinck and Matthew L. Olson and David Cobbley and Shao-Yen Tseng and Vasudev Lal},
year={2024},
eprint={2404.01331},
url={https://arxiv.org/abs/2404.01331},
archivePrefix={arXiv},
primaryClass={cs.CL}
}
Runs of Intel llava-gemma-2b on huggingface.co
4.3K
Total runs
0
24-hour runs
-129
3-day runs
-115
7-day runs
-1.8K
30-day runs
More Information About llava-gemma-2b huggingface.co Model
llava-gemma-2b huggingface.co is an AI model on huggingface.co that provides llava-gemma-2b's model effect (), which can be used instantly with this Intel llava-gemma-2b model. huggingface.co supports a free trial of the llava-gemma-2b model, and also provides paid use of the llava-gemma-2b. Support call llava-gemma-2b model through api, including Node.js, Python, http.
llava-gemma-2b huggingface.co is an online trial and call api platform, which integrates llava-gemma-2b's modeling effects, including api services, and provides a free online trial of llava-gemma-2b, you can try llava-gemma-2b online for free by clicking the link below.
Intel llava-gemma-2b online free url in huggingface.co:
llava-gemma-2b is an open source model from GitHub that offers a free installation service, and any user can find llava-gemma-2b on GitHub to install. At the same time, huggingface.co provides the effect of llava-gemma-2b install, users can directly use llava-gemma-2b installed effect in huggingface.co for debugging and trial. It also supports api for free installation.