aragpt2-large huggingface.co api & aubmindlab aragpt2-large github AI Model

Introduction of aragpt2-large

Model Details of aragpt2-large

Arabic GPT2

You can find more information in our paper AraGPT2

The code in this repository was used to train all GPT2 variants. The code support training and fine-tuning GPT2 on GPUs and TPUs via the TPUEstimator API.

GPT2-base and medium uses the code from the gpt2 folder and can trains models from the minimaxir/gpt-2-simple repository. These models were trained using the lamb optimizer and follow the same architecture as gpt2 and are fully compatible with the transformers library.

GPT2-large and GPT2-mega were trained using the imcaspar/gpt2-ml library, and follow the grover architecture. You can use the pytorch classes found in grover/modeling_gpt2.py as a direct replacement for classes in the transformers library (it should support version v4.x from transformers ). Both models are trained using the adafactor optimizer, since the adam and lamb optimizer use too much memory causing the model to not even fit 1 batch on a TPU core.

AraGPT2 is trained on the same large Arabic Dataset as AraBERTv2.

NOTE: The model expects the input to be preprocessed using the `arabert` library.

if not the model won't be able to generate the correct output.

Testing the model using `transformers` :

The model code is now hosted on HuggingFace so you need to use the trust_remote_code flag, and can be used as follows:

from transformers import AutoModelForCausalLM, pipeline

from arabert.preprocess import ArabertPreprocessor

MODEL_NAME='aubmindlab/aragpt2-large'
arabert_prep = ArabertPreprocessor(model_name=MODEL_NAME)

text=""
text_clean = arabert_prep.preprocess(text)

model = AutoModelForCausalLM.from_pretrained(MODEL_NAME, trust_remote_code=True)
tokenizer = GPT2TokenizerFast.from_pretrained(MODEL_NAME)
generation_pipeline = pipeline(
    "text-generation", model=MODEL_NAME, trust_remote_code=True
)

#feel free to try different decoding settings
generation_pipeline(text,
    pad_token_id=pipeline.tokenizer.eos_token_id,
    num_beams=10,
    max_length=200,
    top_p=0.9,
    repetition_penalty = 3.0,
    no_repeat_ngram_size = 3)[0]['generated_text']
>>>

Finetunning using `transformers` :

Follow the guide linked here

Finetuning using our code with TF 1.15.4:

Create the Training TFRecords:

python create_pretraining_data.py
 --input_file=<RAW TEXT FILE with documents/article separated by an empty line>
 --output_file=<OUTPUT TFRecord>
 --tokenizer_dir=<Directory with the GPT2 Tokenizer files>

Finetuning:

python3 run_pretraining.py \
 --input_file="gs://<GS_BUCKET>/pretraining_data/*" \
 --output_dir="gs://<GS_BUCKET>/pretraining_model/" \
 --config_file="config/small_hparams.json" \
 --batch_size=128 \
 --eval_batch_size=8 \
 --num_train_steps= \
 --num_warmup_steps= \
 --learning_rate= \
 --save_checkpoints_steps= \
 --max_seq_length=1024 \
 --max_eval_steps= \
 --optimizer="lamb" \
 --iterations_per_loop=5000 \
 --keep_checkpoint_max=10 \
 --use_tpu=True \
 --tpu_name=<TPU NAME> \
 --do_train=True \
 --do_eval=False

Model Sizes

Model	Optimizer	Context size	Embedding Size	Num of heads	Num of layers	Model Size / Num of Params
AraGPT2-base	`lamb`	1024	768	12	12	527MB/135M
AraGPT2-medium	`lamb`	1024	1024	16	24	1.38G/370M
AraGPT2-large	`adafactor`	1024	1280	20	36	2.98GB/792M
AraGPT2-mega	`adafactor`	1024	1536	25	48	5.5GB/1.46B

All models are available in the HuggingFace model page under the aubmindlab name. Checkpoints are available in PyTorch, TF2 and TF1 formats.

Compute

For Dataset Source see the Dataset Section

Model	Hardware	num of examples (seq len = 1024)	Batch Size	Num of Steps	Time (in days)
AraGPT2-base	TPUv3-128	9.7M	1792	125K	1.5
AraGPT2-medium	TPUv3-8	9.7M	1152	85K	1.5
AraGPT2-large	TPUv3-128	9.7M	256	220k	3
AraGPT2-mega	TPUv3-128	9.7M	256	780K	9

Dataset

The pretraining data used for the new AraBERT model is also used for GPT2 and ELECTRA .

The dataset consists of 77GB or 200,095,961 lines or 8,655,948,860 words or 82,232,988,358 chars (before applying Farasa Segmentation)

For the new dataset we added the unshuffled OSCAR corpus, after we thoroughly filter it, to the previous dataset used in AraBERTv1 but with out the websites that we previously crawled:

OSCAR unshuffled and filtered.
Arabic Wikipedia dump from 2020/09/01
The 1.5B words Arabic Corpus
The OSIAN Corpus
Assafir news articles. Huge thank you for Assafir for giving us the data

Disclaimer

The text generated by GPT2 Arabic is automatically generated by a neural network model trained on a large amount of texts, which does not represent the authors' or their institutes' official attitudes and preferences. The text generated by GPT2 Arabic should only be used for research and scientific purposes. If it infringes on your rights and interests or violates social morality, please do not propagate it.

If you used this model please cite us as :

@inproceedings{antoun-etal-2021-aragpt2,
    title = "{A}ra{GPT}2: Pre-Trained Transformer for {A}rabic Language Generation",
    author = "Antoun, Wissam  and
      Baly, Fady  and
      Hajj, Hazem",
    booktitle = "Proceedings of the Sixth Arabic Natural Language Processing Workshop",
    month = apr,
    year = "2021",
    address = "Kyiv, Ukraine (Virtual)",
    publisher = "Association for Computational Linguistics",
    url = "https://www.aclweb.org/anthology/2021.wanlp-1.21",
    pages = "196--207",
}

Acknowledgments

Thanks to TensorFlow Research Cloud (TFRC) for the free access to Cloud TPUs, couldn't have done it without this program, and to the AUB MIND Lab Members for the continuous support. Also thanks to Yakshof and Assafir for data and storage access. Another thanks for Habib Rahal ( https://www.behance.net/rahalhabib ), for putting a face to AraBERT.

Contacts

Wissam Antoun : Linkedin | Twitter | Github | [email protected] | [email protected]

Fady Baly : Linkedin | Twitter | Github | [email protected] | [email protected]

Runs of aubmindlab aragpt2-large on huggingface.co

4.0K

Total runs

24-hour runs

228

3-day runs

495

7-day runs

3.4K

30-day runs

More Information About aragpt2-large huggingface.co Model

More aragpt2-large license Visit here:

https://choosealicense.com/licenses/custom

aragpt2-large huggingface.co

aragpt2-large huggingface.co is an AI model on huggingface.co that provides aragpt2-large's model effect (), which can be used instantly with this aubmindlab aragpt2-large model. huggingface.co supports a free trial of the aragpt2-large model, and also provides paid use of the aragpt2-large. Support call aragpt2-large model through api, including Node.js, Python, http.

aragpt2-large huggingface.co Url

https://huggingface.co/aubmindlab/aragpt2-large

aubmindlab aragpt2-large online free

aragpt2-large huggingface.co is an online trial and call api platform, which integrates aragpt2-large's modeling effects, including api services, and provides a free online trial of aragpt2-large, you can try aragpt2-large online for free by clicking the link below.

aubmindlab aragpt2-large online free url in huggingface.co:

https://huggingface.co/aubmindlab/aragpt2-large

aragpt2-large install

aragpt2-large is an open source model from GitHub that offers a free installation service, and any user can find aragpt2-large on GitHub to install. At the same time, huggingface.co provides the effect of aragpt2-large install, users can directly use aragpt2-large installed effect in huggingface.co for debugging and trial. It also supports api for free installation.

aragpt2-large install url in huggingface.co:

https://huggingface.co/aubmindlab/aragpt2-large

huggingface.co

aubmindlab/bert-base-arabertv02

Total runs: 682.7K

Run Growth: 113.0K

Growth Rate: 16.55%

Updated:March 26 2024

huggingface.co

aubmindlab/bert-base-arabertv2

Total runs: 37.8K

Run Growth: -5.7K

Growth Rate: -15.10%

Updated:August 03 2023

huggingface.co

aubmindlab/aragpt2-base

Total runs: 32.4K

Run Growth: 24.7K

Growth Rate: 76.21%

Updated:October 30 2023

huggingface.co

aubmindlab/bert-base-arabertv02-twitter

Total runs: 7.9K

Run Growth: -2.7K

Growth Rate: -34.61%

Updated:March 24 2023

huggingface.co

aubmindlab/aragpt2-medium

Total runs: 5.1K

Run Growth: 3.3K

Growth Rate: 64.83%

Updated:October 30 2023

huggingface.co

aubmindlab/aragpt2-mega

Total runs: 4.8K

Run Growth: 4.3K

Growth Rate: 89.70%

Updated:October 24 2024

huggingface.co

aubmindlab/bert-large-arabertv02

Total runs: 3.2K

Run Growth: 2.6K

Growth Rate: 82.37%

Updated:August 03 2023

huggingface.co

aubmindlab/bert-large-arabertv2

Total runs: 2.3K

Run Growth: 1.0K

Growth Rate: 44.04%

Updated:March 20 2023

huggingface.co

aubmindlab/araelectra-base-discriminator

Total runs: 2.1K

Run Growth: 1.4K

Growth Rate: 67.38%

Updated:October 29 2024

huggingface.co

aubmindlab/bert-base-arabert

Total runs: 818

Run Growth: -86

Growth Rate: -10.51%

Updated:August 03 2023

huggingface.co

aubmindlab/bert-large-arabertv02-twitter

Total runs: 218

Run Growth: -17

Growth Rate: -7.80%

Updated:April 26 2023

huggingface.co

aubmindlab/bert-base-arabertv01

Total runs: 65

Run Growth: 25

Growth Rate: 38.46%

Updated:June 09 2023

huggingface.co

aubmindlab/araelectra-base-generator

Total runs: 47

Run Growth: -28

Growth Rate: -59.57%

Updated:March 22 2023

huggingface.co

aubmindlab/aragpt2-mega-detector-long

Total runs: 15

Run Growth: 2

Growth Rate: 13.33%

Updated:March 20 2023

aubmindlab / aragpt2-large

Introduction of aragpt2-large

Model Details of aragpt2-large

Arabic GPT2

NOTE: The model expects the input to be preprocessed using the `arabert` library.

Testing the model using `transformers` :

Finetunning using `transformers` :

Finetuning using our code with TF 1.15.4:

Model Sizes

Compute

Dataset

Disclaimer

If you used this model please cite us as :

Acknowledgments

Contacts

Runs of aubmindlab aragpt2-large on huggingface.co

More Information About aragpt2-large huggingface.co Model

More aragpt2-large license Visit here:

aragpt2-large huggingface.co

aragpt2-large huggingface.co Url

aubmindlab aragpt2-large online free

aubmindlab aragpt2-large online free url in huggingface.co:

aragpt2-large install

aragpt2-large install url in huggingface.co:

Url of aragpt2-large

aragpt2-large huggingface.co Url

Provider of aragpt2-large huggingface.co

Other API from aubmindlab

aubmindlab / aragpt2-large

Introduction of aragpt2-large

Model Details of aragpt2-large

Arabic GPT2

NOTE: The model expects the input to be preprocessed using the arabert library.

Testing the model using transformers :

Finetunning using transformers :

Finetuning using our code with TF 1.15.4:

Model Sizes

Compute

Dataset

Disclaimer

If you used this model please cite us as :

Acknowledgments

Contacts

Runs of aubmindlab aragpt2-large on huggingface.co

More Information About aragpt2-large huggingface.co Model

More aragpt2-large license Visit here:

aragpt2-large huggingface.co

aragpt2-large huggingface.co Url

aubmindlab aragpt2-large online free

aubmindlab aragpt2-large online free url in huggingface.co:

aragpt2-large install

aragpt2-large install url in huggingface.co:

Url of aragpt2-large

aragpt2-large huggingface.co Url

Provider of aragpt2-large huggingface.co

Other API from aubmindlab

NOTE: The model expects the input to be preprocessed using the `arabert` library.

Testing the model using `transformers` :

Finetunning using `transformers` :