You can find more information in our paper
AraGPT2
The code in this repository was used to train all GPT2 variants. The code support training and fine-tuning GPT2 on GPUs and TPUs via the TPUEstimator API.
GPT2-base and medium uses the code from the
gpt2
folder and can trains models from the
minimaxir/gpt-2-simple
repository.
These models were trained using the
lamb
optimizer and follow the same architecture as
gpt2
and are fully compatible with the
transformers
library.
GPT2-large and GPT2-mega were trained using the
imcaspar/gpt2-ml
library, and follow the
grover
architecture. You can use the pytorch classes found in
grover/modeling_gpt2.py
as a direct replacement for classes in the
transformers
library (it should support version
v4.x
from
transformers
).
Both models are trained using the
adafactor
optimizer, since the
adam
and
lamb
optimizer use too much memory causing the model to not even fit 1 batch on a TPU core.
AraGPT2 is trained on the same large Arabic Dataset as AraBERTv2.
NOTE: The model expects the input to be preprocessed using the
arabert
library.
if not the model won't be able to generate the correct output.
Testing the model using
transformers
:
The model code is now hosted on HuggingFace so you need to use the
trust_remote_code
flag, and can be used as follows:
from transformers import AutoModelForCausalLM, pipeline
from arabert.preprocess import ArabertPreprocessor
MODEL_NAME='aubmindlab/aragpt2-large'
arabert_prep = ArabertPreprocessor(model_name=MODEL_NAME)
text=""
text_clean = arabert_prep.preprocess(text)
model = AutoModelForCausalLM.from_pretrained(MODEL_NAME, trust_remote_code=True)
tokenizer = GPT2TokenizerFast.from_pretrained(MODEL_NAME)
generation_pipeline = pipeline(
"text-generation", model=MODEL_NAME, trust_remote_code=True
)
#feel free to try different decoding settings
generation_pipeline(text,
pad_token_id=pipeline.tokenizer.eos_token_id,
num_beams=10,
max_length=200,
top_p=0.9,
repetition_penalty = 3.0,
no_repeat_ngram_size = 3)[0]['generated_text']
>>>
python create_pretraining_data.py
--input_file=<RAW TEXT FILE with documents/article separated by an empty line>
--output_file=<OUTPUT TFRecord>
--tokenizer_dir=<Directory with the GPT2 Tokenizer files>
The pretraining data used for the new AraBERT model is also used for
GPT2 and ELECTRA
.
The dataset consists of 77GB or 200,095,961 lines or 8,655,948,860 words or 82,232,988,358 chars (before applying Farasa Segmentation)
For the new dataset we added the unshuffled OSCAR corpus, after we thoroughly filter it, to the previous dataset used in AraBERTv1 but with out the websites that we previously crawled:
Assafir news articles. Huge thank you for Assafir for giving us the data
Disclaimer
The text generated by GPT2 Arabic is automatically generated by a neural network model trained on a large amount of texts, which does not represent the authors' or their institutes' official attitudes and preferences. The text generated by GPT2 Arabic should only be used for research and scientific purposes. If it infringes on your rights and interests or violates social morality, please do not propagate it.
If you used this model please cite us as :
@inproceedings{antoun-etal-2021-aragpt2,
title = "{A}ra{GPT}2: Pre-Trained Transformer for {A}rabic Language Generation",
author = "Antoun, Wissam and
Baly, Fady and
Hajj, Hazem",
booktitle = "Proceedings of the Sixth Arabic Natural Language Processing Workshop",
month = apr,
year = "2021",
address = "Kyiv, Ukraine (Virtual)",
publisher = "Association for Computational Linguistics",
url = "https://www.aclweb.org/anthology/2021.wanlp-1.21",
pages = "196--207",
}
Acknowledgments
Thanks to TensorFlow Research Cloud (TFRC) for the free access to Cloud TPUs, couldn't have done it without this program, and to the
AUB MIND Lab
Members for the continuous support. Also thanks to
Yakshof
and Assafir for data and storage access. Another thanks for Habib Rahal (
https://www.behance.net/rahalhabib
), for putting a face to AraBERT.
aragpt2-large huggingface.co is an AI model on huggingface.co that provides aragpt2-large's model effect (), which can be used instantly with this aubmindlab aragpt2-large model. huggingface.co supports a free trial of the aragpt2-large model, and also provides paid use of the aragpt2-large. Support call aragpt2-large model through api, including Node.js, Python, http.
aragpt2-large huggingface.co is an online trial and call api platform, which integrates aragpt2-large's modeling effects, including api services, and provides a free online trial of aragpt2-large, you can try aragpt2-large online for free by clicking the link below.
aubmindlab aragpt2-large online free url in huggingface.co:
aragpt2-large is an open source model from GitHub that offers a free installation service, and any user can find aragpt2-large on GitHub to install. At the same time, huggingface.co provides the effect of aragpt2-large install, users can directly use aragpt2-large installed effect in huggingface.co for debugging and trial. It also supports api for free installation.