Traditional BERT models struggle with VMware-specific words (Tanzu, vSphere, etc.), technical terms, and compound words. (
Weaknesses of WordPiece Tokenization
)
We pretrained thevBERT model to address the aforementioned issues using our We have pretrained our vBERT model to address the aforementioned issues using our
BERT Pretraining Library
.
We have replaced the first 1k unused tokens of BERT's vocabulary with VMware-specific terms to create a modified vocabulary. We then pretrained the 'bert-large-uncased' model for additional 66K steps (60k with MSL_128 and 6k with MSL_512) on VMware domain data.
Intended Use
The model functions as a VMware-specific Language Model.
How to Use
Here is how to use this model to get the features of a given text in PyTorch:
from transformers import BertTokenizer, BertModel
tokenizer = BertTokenizer.from_pretrained('VMware/vbert-2021-large')
model = BertModel.from_pretrained("VMware/vbert-2021-large")
text = "Replace me by any text you'd like."
encoded_input = tokenizer(text, return_tensors='pt')
output = model(**encoded_input)
and in TensorFlow:
from transformers import BertTokenizer, TFBertModel
tokenizer = BertTokenizer.from_pretrained('VMware/vbert-2021-large')
model = TFBertModel.from_pretrained('VMware/vbert-2021-large')
text = "Replace me by any text you'd like."
encoded_input = tokenizer(text, return_tensors='tf')
output = model(encoded_input)
Training
- Datasets
Publically available VMware text data such as VMware Docs, Blogs etc. were used for creating the pretraining corpus. Sourced in May, 2021. (~320,000 Documents)
- Preprocessing
Decoding HTML
Decoding Unicode
Stripping repeated characters
Splitting compound word
Spelling correction
- Model performance measures
We benchmarked vBERT on various VMware-specific NLP downstream tasks (IR, classification, etc).
The model scored higher than the 'bert-base-uncased' model on all benchmarks.
Limitations and bias
Since the model is further pretrained on the BERT model, it may have the same biases embedded within the original BERT model.
The data needs to be preprocessed using our internal vNLP Preprocessor (not available to the public) to maximize its performance.
Runs of VMware vbert-2021-large on huggingface.co
16
Total runs
0
24-hour runs
0
3-day runs
6
7-day runs
10
30-day runs
More Information About vbert-2021-large huggingface.co Model
vbert-2021-large huggingface.co is an AI model on huggingface.co that provides vbert-2021-large's model effect (), which can be used instantly with this VMware vbert-2021-large model. huggingface.co supports a free trial of the vbert-2021-large model, and also provides paid use of the vbert-2021-large. Support call vbert-2021-large model through api, including Node.js, Python, http.
vbert-2021-large huggingface.co is an online trial and call api platform, which integrates vbert-2021-large's modeling effects, including api services, and provides a free online trial of vbert-2021-large, you can try vbert-2021-large online for free by clicking the link below.
VMware vbert-2021-large online free url in huggingface.co:
vbert-2021-large is an open source model from GitHub that offers a free installation service, and any user can find vbert-2021-large on GitHub to install. At the same time, huggingface.co provides the effect of vbert-2021-large install, users can directly use vbert-2021-large installed effect in huggingface.co for debugging and trial. It also supports api for free installation.