MistralLite is a fine-tuned
Mistral-7B-v0.1
language model, with enhanced capabilities of processing long context (up to 32K tokens). By utilizing an adapted Rotary Embedding and sliding window during fine-tuning, MistralLite is able to
perform significantly better on several long context retrieve and answering tasks
, while keeping the simple model structure of the original model. MistralLite is useful for applications such as long context line and topic retrieval, summarization, question-answering, and etc. MistralLite can be deployed on a single AWS
g5.2x
instance with Sagemaker
Huggingface Text Generation Inference (TGI)
endpoint, making it suitable for applications that require high performance in resource-constrained environments. You can also serve the MistralLite model directly using TGI docker containers. Also, MistralLite supports other ways of serving like
vLLM
, and you can use MistralLite in Python by using the
HuggingFace transformers
and
FlashAttention-2
library.
MistralLite is similar to
Mistral-7B-Instruct-v0.1
, and their similarities and differences are summarized below:
Model
Fine-tuned on long contexts
Max context length
RotaryEmbedding adaptation
Sliding Window Size
Mistral-7B-Instruct-v0.1
up to 8K tokens
32K
rope_theta = 10000
4096
MistralLite
up to 16K tokens
32K
rope_theta = 1000000
16384
Important - Use the prompt template below for MistralLite:
<|prompter|>What are the main challenges to support a long context for LLM?</s><|assistant|>
Motivation of Developing MistralLite
Since the release of
Mistral-7B-Instruct-v0.1
, the model became increasingly popular because its strong performance
on a wide range of benchmarks. But most of the benchmarks are evaluated on
short context
, and not much has been investigated on its performance on long context tasks.
Then We evaluated
Mistral-7B-Instruct-v0.1
against benchmarks that are specifically designed to assess the capabilities of LLMs in handling longer context.
Although the performance of the models on long context was fairly competitive on long context less than 4096 tokens,
there were some limitations on its performance on longer context. Motivated by improving its performance on longer context, we finetuned the Mistral 7B model, and produced
Mistrallite
. The model managed to
significantly boost the performance of long context handling
over Mistral-7B-Instruct-v0.1. The detailed
long context evalutaion results
are as below:
Question:
please tell me how does pgvector help with Generative AI and give me some examples.
Answer from MistralLite:
pgvector is an open-source extension for PostgreSQL supported by Amazon Aurora PostgreSQL-Compatible Edition.
You can use pgvector to store, search, index, and query billions of embeddings that are generated from machine learning (ML) and artificial intelligence (AI) models in your database, such as those from Amazon Bedrock (limited preview) or Amazon SageMaker. A vector embedding is a numerical representation that represents the semantic meaning of content such as text, images, and video.
With pgvector, you can query embeddings in your Aurora PostgreSQL database to perform efficient semantic similarity searches of these data types, represented as vectors, combined with other tabular data in Aurora. This enables the use of generative AI and other AI/ML systems for new types of applications such as personalized recommendations based on similar text descriptions or images, candidate match based on interview notes, customer service next best action recommendations based on successful transcripts or chat session dialogs, and more.
from transformers import AutoModelForCausalLM, AutoTokenizer
import transformers
import torch
model_id = "amazon/MistralLite"
tokenizer = AutoTokenizer.from_pretrained(model_id)
model = AutoModelForCausalLM.from_pretrained(model_id,
torch_dtype=torch.bfloat16,
use_flash_attention_2=True,
device_map="auto",)
pipeline = transformers.pipeline(
"text-generation",
model=model,
tokenizer=tokenizer,
)
prompt = "<|prompter|>What are the main challenges to support a long context for LLM?</s><|assistant|>"
sequences = pipeline(
prompt,
max_new_tokens=400,
do_sample=False,
return_full_text=False,
num_return_sequences=1,
eos_token_id=tokenizer.eos_token_id,
)
for seq in sequences:
print(f"{seq['generated_text']}")
Important
- Use the prompt template below for MistralLite:
<|prompter|>What are the main challenges to support a long context for LLM?</s><|assistant|>
How to Serve MistralLite on TGI
Important:
For an end-to-end example Jupyter notebook using the native TGI container, please refer to
this link
.
If the
input context length is greater than 12K tokens
, it is recommended using a custom TGI container, please refer to
this link
.
Start TGI server
Use TGI version 1.1.0 or later. The official Docker container is:
ghcr.io/huggingface/text-generation-inference:1.1.0
Example Python code for inference with TGI (requires
text_generation
0.6.1 or later):
pip install text_generation==0.6.1
from text_generation import Client
SERVER_PORT = 443
SERVER_HOST = "localhost"
SERVER_URL = f"{SERVER_HOST}:{SERVER_PORT}"
tgi_client = Client(f"http://{SERVER_URL}", timeout=60)
definvoke_tgi(prompt, random_seed=1, max_new_tokens=400, print_stream=True, assist_role=True):
if (assist_role):
prompt = f"<|prompter|>{prompt}</s><|assistant|>"
output = ""for response in tgi_client.generate_stream(
prompt,
do_sample=False,
max_new_tokens=max_new_tokens,
return_full_text=False,
#temperature=None,#truncate=None,#seed=random_seed,#typical_p=0.2,
):
ifhasattr(response, "token"):
ifnot response.token.special:
snippet = response.token.text
output += snippet
if (print_stream):
print(snippet, end='', flush=True)
return output
prompt = "What are the main challenges to support a long context for LLM?"
result = invoke_tgi(prompt)
Important
- When using MistralLite for inference for the first time, it may require a brief 'warm-up' period that can take 10s of seconds. However, subsequent inferences should be faster and return results in a more timely manner. This warm-up period is normal and should not affect the overall performance of the system once the initialisation period has been completed.
How to Deploy MistralLite on Amazon SageMaker
Important:
For an end-to-end example Jupyter notebook using the SageMaker built-in container, please refer to
this link
.
If the
input context length is greater than 12K tokens
, it is recommended using a custom docker container, please refer to
this link
.
To call the endpoint, please follow the example code as below:
input_data = {
"inputs": "<|prompter|>What are the main challenges to support a long context for LLM?</s><|assistant|>",
"parameters": {
"do_sample": False,
"max_new_tokens": 400,
"return_full_text": False,
#"typical_p": 0.2,#"temperature":None,#"truncate":None,#"seed": 1,
}
}
result = predictor.predict(input_data)[0]["generated_text"]
print(result)
or via
boto3
, and the example code is shown as below:
import boto3
import json
defcall_endpoint(client, prompt, endpoint_name, paramters):
client = boto3.client("sagemaker-runtime")
payload = {"inputs": prompt,
"parameters": parameters}
response = client.invoke_endpoint(EndpointName=endpoint_name,
Body=json.dumps(payload),
ContentType="application/json")
output = json.loads(response["Body"].read().decode())
result = output[0]["generated_text"]
return result
client = boto3.client("sagemaker-runtime")
parameters = {
"do_sample": False,
"max_new_tokens": 400,
"return_full_text": False,
#"typical_p": 0.2,#"temperature":None,#"truncate":None,#"seed": 1,
}
endpoint_name = predictor.endpoint_name
prompt = "<|prompter|>What are the main challenges to support a long context for LLM?</s><|assistant|>"
result = call_endpoint(client, prompt, endpoint_name, parameters)
print(result)
When using vLLM from Python code, Please see the example code as below:
from vllm import LLM, SamplingParams
prompts = [
"<|prompter|>What are the main challenges to support a long context for LLM?</s><|assistant|>",
]
sampling_params = SamplingParams(temperature=0, max_tokens=100)
llm = LLM(model="amazon/MistralLite",)
outputs = llm.generate(prompts, sampling_params)
# Print the outputs.for output in outputs:
prompt = output.prompt
generated_text = output.outputs[0].text
print(f"Prompt: {prompt!r}, Generated text: {generated_text!r}")
Limitations
Before using the MistralLite model, it is important to perform your own independent assessment, and take measures to ensure that your use would comply with your own specific quality control practices and standards, and that your use would comply with the local rules, laws, regulations, licenses and terms that apply to you, and your content.
Runs of AWS MistralLite on huggingface.co
61.8K
Total runs
0
24-hour runs
0
3-day runs
0
7-day runs
0
30-day runs
More Information About MistralLite huggingface.co Model
MistralLite huggingface.co is an AI model on huggingface.co that provides MistralLite's model effect (), which can be used instantly with this AWS MistralLite model. huggingface.co supports a free trial of the MistralLite model, and also provides paid use of the MistralLite. Support call MistralLite model through api, including Node.js, Python, http.
MistralLite huggingface.co is an online trial and call api platform, which integrates MistralLite's modeling effects, including api services, and provides a free online trial of MistralLite, you can try MistralLite online for free by clicking the link below.
AWS MistralLite online free url in huggingface.co:
MistralLite is an open source model from GitHub that offers a free installation service, and any user can find MistralLite on GitHub to install. At the same time, huggingface.co provides the effect of MistralLite install, users can directly use MistralLite installed effect in huggingface.co for debugging and trial. It also supports api for free installation.