There are two versions of the model, AraBERTv0.1 and AraBERTv1, with the difference being that AraBERTv1 uses pre-segmented text where prefixes and suffixes were splitted using the
Farasa Segmenter
.
We evalaute AraBERT models on different downstream tasks and compare them to
mBERT
, and other state of the art models (
To the extent of our knowledge
). The Tasks were Sentiment Analysis on 6 different datasets (
HARD
,
ASTD-Balanced
,
ArsenTD-Lev
,
LABR
), Named Entity Recognition with the
ANERcorp
, and Arabic Question Answering on
Arabic-SQuAD and ARCD
AraBERTv2
What's New!
AraBERT now comes in 4 new variants to replace the old v1 versions:
All models are available in the
HuggingFace
model page under the
aubmindlab
name. Checkpoints are available in PyTorch, TF2 and TF1 formats.
Better Pre-Processing and New Vocab
We identified an issue with AraBERTv1's wordpiece vocabulary. The issue came from punctuations and numbers that were still attached to words when learned the wordpiece vocab. We now insert a space between numbers and characters and around punctuation characters.
The new vocabulary was learnt using the
BertWordpieceTokenizer
from the
tokenizers
library, and should now support the Fast tokenizer implementation from the
transformers
library.
P.S.
: All the old BERT codes should work with the new BERT, just change the model name and check the new preprocessing dunction
Please read the section on how to use the
preprocessing function
Bigger Dataset and More Compute
We used ~3.5 times more data, and trained for longer.
For Dataset Sources see the
Dataset Section
Model
Hardware
num of examples with seq len (128 / 512)
128 (Batch Size/ Num of Steps)
512 (Batch Size/ Num of Steps)
Total Steps
Total Time (in Days)
AraBERTv0.2-base
TPUv3-8
420M / 207M
2560 / 1M
384/ 2M
3M
-
AraBERTv0.2-large
TPUv3-128
420M / 207M
13440 / 250K
2056 / 300K
550K
-
AraBERTv2-base
TPUv3-8
520M / 245M
13440 / 250K
2056 / 300K
550K
-
AraBERTv2-large
TPUv3-128
520M / 245M
13440 / 250K
2056 / 300K
550K
-
AraBERT-base (v1/v0.1)
TPUv2-8
-
512 / 900K
128 / 300K
1.2M
4 days
Dataset
The pretraining data used for the new AraBERT model is also used for Arabic
GPT2 and ELECTRA
.
The dataset consists of 77GB or 200,095,961 lines or 8,655,948,860 words or 82,232,988,358 chars (before applying Farasa Segmentation)
For the new dataset we added the unshuffled OSCAR corpus, after we thoroughly filter it, to the previous dataset used in AraBERTv1 but with out the websites that we previously crawled:
Assafir news articles. Huge thank you for Assafir for giving us the data
Preprocessing
It is recommended to apply our preprocessing function before training/testing on any dataset.
Install farasapy to segment text for AraBERT v1 & v2
pip install farasapy
from arabert.preprocess import ArabertPreprocessor
model_name="bert-base-arabert"
arabert_prep = ArabertPreprocessor(model_name=model_name)
text = "ولن نبالغ إذا قلنا إن هاتف أو كمبيوتر المكتب في زمننا هذا ضروري"
arabert_prep.preprocess(text)
>>>"و+ لن نبالغ إذا قل +نا إن هاتف أو كمبيوتر ال+ مكتب في زمن +نا هذا ضروري"
where
MODEL_NAME
is any model under the
aubmindlab
name
via
wget
:
Go to the tf1_model.tar.gz file on huggingface.co/models/aubmindlab/MODEL_NAME.
copy the
oid sha256
then run
wget https://cdn-lfs.huggingface.co/aubmindlab/aragpt2-base/INSERT_THE_SHA_HERE
(ex: for
aragpt2-base
:
wget https://cdn-lfs.huggingface.co/aubmindlab/aragpt2-base/3766fc03d7c2593ff2fb991d275e96b81b0ecb2098b71ff315611d052ce65248
)
If you used this model please cite us as :
Google Scholar has our Bibtex wrong (missing name), use this instead
@inproceedings{antoun2020arabert,
title={AraBERT: Transformer-based Model for Arabic Language Understanding},
author={Antoun, Wissam and Baly, Fady and Hajj, Hazem},
booktitle={LREC 2020 Workshop Language Resources and Evaluation Conference 11--16 May 2020},
pages={9}
}
Acknowledgments
Thanks to TensorFlow Research Cloud (TFRC) for the free access to Cloud TPUs, couldn't have done it without this program, and to the
AUB MIND Lab
Members for the continous support. Also thanks to
Yakshof
and Assafir for data and storage access. Another thanks for Habib Rahal (
https://www.behance.net/rahalhabib
), for putting a face to AraBERT.
Runs of aubmindlab bert-base-arabert on huggingface.co
886
Total runs
-28
24-hour runs
-66
3-day runs
-14
7-day runs
145
30-day runs
More Information About bert-base-arabert huggingface.co Model
bert-base-arabert huggingface.co
bert-base-arabert huggingface.co is an AI model on huggingface.co that provides bert-base-arabert's model effect (), which can be used instantly with this aubmindlab bert-base-arabert model. huggingface.co supports a free trial of the bert-base-arabert model, and also provides paid use of the bert-base-arabert. Support call bert-base-arabert model through api, including Node.js, Python, http.
bert-base-arabert huggingface.co is an online trial and call api platform, which integrates bert-base-arabert's modeling effects, including api services, and provides a free online trial of bert-base-arabert, you can try bert-base-arabert online for free by clicking the link below.
aubmindlab bert-base-arabert online free url in huggingface.co:
bert-base-arabert is an open source model from GitHub that offers a free installation service, and any user can find bert-base-arabert on GitHub to install. At the same time, huggingface.co provides the effect of bert-base-arabert install, users can directly use bert-base-arabert installed effect in huggingface.co for debugging and trial. It also supports api for free installation.