neuralmagic / starcoder2-15b-FP8

huggingface.co
Total runs: 282
24-hour runs: 0
7-day runs: 0
30-day runs: 0
Model's Last Updated: October 10 2024
text-generation

Introduction of starcoder2-15b-FP8

Model Details of starcoder2-15b-FP8

starcoder2-15b-FP8

Model Overview
  • Model Architecture: starcoder2-15b
    • Input: Text
    • Output: Text
  • Model Optimizations:
    • Weight quantization: FP8
    • Activation quantization: FP8
  • Intended Use Cases: Intended for commercial and research use in English.
  • Out-of-scope: Use in any manner that violates applicable laws or regulations (including trade compliance laws). Use in languages other than English.
  • Release Date: 8/1/2024
  • Version: 1.0
  • License(s): bigcode-openrail-m
  • Model Developers: Neural Magic

Quantized version of starcoder2-15b .

It achieves an average score of 50.70 on the HumanEval+ benchmark, whereas the unquantized model achieves 50.25.

Model Optimizations

This model was obtained by quantizing the weights and activations of starcoder2-15b to FP8 data type, ready for inference with vLLM >= 0.5.2. This optimization reduces the number of bits per parameter from 16 to 8, reducing the disk size and GPU memory requirements by approximately 50%.

Only the weights and activations of the linear operators within transformers blocks are quantized. Symmetric per-tensor quantization is applied, in which a single linear scaling maps the FP8 representations of the quantized weights and activations. AutoFP8 is used for quantization with 512 sequences of UltraChat.

Creation

This model was created by applying LLM Compressor with calibration samples from UltraChat , as presented in the code snipet below. A slight modification to the code was made due to the parameters of the model. Running the below code will throw an index error, and simply replacing the erroneous line with max_quant_shape = param.shape[0] resolves the issue.

import torch
from datasets import load_dataset
from transformers import AutoTokenizer

from llmcompressor.transformers import SparseAutoModelForCausalLM, oneshot
from llmcompressor.transformers.compression.helpers import (
    calculate_offload_device_map,
    custom_offload_device_map,
)

recipe = """
quant_stage:
    quant_modifiers:
        QuantizationModifier:
            ignore: ["lm_head"]
            config_groups:
                group_0:
                    weights:
                        num_bits: 8
                        type: float
                        strategy: tensor
                        dynamic: false
                        symmetric: true
                    input_activations:
                        num_bits: 8
                        type: float
                        strategy: tensor
                        dynamic: false
                        symmetric: true
                    targets: ["Linear"]
"""

model_stub = "bigcode/starcoder2-15b"
model_name = model_stub.split("/")[-1]

device_map = calculate_offload_device_map(
    model_stub, reserve_for_hessians=False, num_gpus=8, torch_dtype=torch.float16
)

model = SparseAutoModelForCausalLM.from_pretrained(
    model_stub, torch_dtype=torch.float16, device_map=device_map
)
tokenizer = AutoTokenizer.from_pretrained(model_stub)

output_dir = f"./{model_name}-FP8"

DATASET_ID = "HuggingFaceH4/ultrachat_200k"
DATASET_SPLIT = "train_sft"
NUM_CALIBRATION_SAMPLES = 512
MAX_SEQUENCE_LENGTH = 4096

ds = load_dataset(DATASET_ID, split=DATASET_SPLIT)
ds = ds.shuffle(seed=42).select(range(NUM_CALIBRATION_SAMPLES))

def preprocess(example):
    return {
        "text": " ".join([msg["content"] for msg in example["messages"]])
    }

ds = ds.map(preprocess)

def tokenize(sample):
    return tokenizer(
        sample["text"],
        padding=False,
        max_length=MAX_SEQUENCE_LENGTH,
        truncation=True,
        add_special_tokens=False,
    )

ds = ds.map(tokenize, remove_columns=ds.column_names)

oneshot(
    model=model,
    output_dir=output_dir,
    dataset=ds,
    recipe=recipe,
    max_seq_length=MAX_SEQUENCE_LENGTH,
    num_calibration_samples=NUM_CALIBRATION_SAMPLES,
    save_compressed=True,
)
Evaluation

The model was evaluated on the HumanEval+ benchmark with the Neural Magic fork of the EvalPlus implementation of HumanEval+ and the vLLM engine, using the following command:

python codegen/generate.py --model neuralmagic/starcoder2-15b-FP8 --temperature 0.2 --n_samples 50 --resume --root ~ --dataset humaneval
python evalplus/sanitize.py ~/humaneval/neuralmagic--starcoder2-15b-FP8_vllm_temp_0.2
evalplus.evaluate --dataset humaneval --samples ~/humaneval/neuralmagic--starcoder2-15b-FP8_vllm_temp_0.2-sanitized
Accuracy
HumanEval+ evaluation scores
Benchmark starcoder2-15b starcoder2-15b-FP8(this model) Recovery
base pass@1 44.8 45.0 100.4%
base pass@10 62.7 64.0 102.0%
base+extra pass@1 38.6 38.4 99.48%
base+extra pass@10 54.9 55.4 100.9%
Average 50.25 50.70 100.7%

Runs of neuralmagic starcoder2-15b-FP8 on huggingface.co

282
Total runs
0
24-hour runs
0
3-day runs
0
7-day runs
0
30-day runs

More Information About starcoder2-15b-FP8 huggingface.co Model

More starcoder2-15b-FP8 license Visit here:

https://choosealicense.com/licenses/bigcode-openrail-m

starcoder2-15b-FP8 huggingface.co

starcoder2-15b-FP8 huggingface.co is an AI model on huggingface.co that provides starcoder2-15b-FP8's model effect (), which can be used instantly with this neuralmagic starcoder2-15b-FP8 model. huggingface.co supports a free trial of the starcoder2-15b-FP8 model, and also provides paid use of the starcoder2-15b-FP8. Support call starcoder2-15b-FP8 model through api, including Node.js, Python, http.

starcoder2-15b-FP8 huggingface.co Url

https://huggingface.co/neuralmagic/starcoder2-15b-FP8

neuralmagic starcoder2-15b-FP8 online free

starcoder2-15b-FP8 huggingface.co is an online trial and call api platform, which integrates starcoder2-15b-FP8's modeling effects, including api services, and provides a free online trial of starcoder2-15b-FP8, you can try starcoder2-15b-FP8 online for free by clicking the link below.

neuralmagic starcoder2-15b-FP8 online free url in huggingface.co:

https://huggingface.co/neuralmagic/starcoder2-15b-FP8

starcoder2-15b-FP8 install

starcoder2-15b-FP8 is an open source model from GitHub that offers a free installation service, and any user can find starcoder2-15b-FP8 on GitHub to install. At the same time, huggingface.co provides the effect of starcoder2-15b-FP8 install, users can directly use starcoder2-15b-FP8 installed effect in huggingface.co for debugging and trial. It also supports api for free installation.

starcoder2-15b-FP8 install url in huggingface.co:

https://huggingface.co/neuralmagic/starcoder2-15b-FP8

Url of starcoder2-15b-FP8

starcoder2-15b-FP8 huggingface.co Url

Provider of starcoder2-15b-FP8 huggingface.co

neuralmagic
ORGANIZATIONS

Other API from neuralmagic