GeneralAnalysis / GA_Guard_Thinking

huggingface.co
Total runs: 2
24-hour runs: 0
7-day runs: -1
30-day runs: 1
Model's Last Updated: November 19 2025
text-generation

Introduction of GA_Guard_Thinking

Model Details of GA_Guard_Thinking

GA Guard Family

Website · GA Blog · GA Bench · API Access


Introducing the GA Guard series — a family of open-weight moderation models built to help developers and organizations keep language models safe, compliant, and aligned with real-world use.

GA-Guard is designed to detect violations across the following seven categories:

  • Illicit Activities – instructions or content related to crimes, weapons, or illegal substances.
  • Hate & Abuse – harassment, slurs, dehumanization, or abusive language.
  • PII & IP – exposure or solicitation of sensitive personal information, secrets, or intellectual property.
  • Prompt Security – jailbreaks, prompt-injection, secret exfiltration, or obfuscation attempts.
  • Sexual Content – sexually explicit or adult material.
  • Misinformation – demonstrably false or deceptive claims presented as fact.
  • Violence & Self-Harm – content that encourages violence, self-harm, or suicide.

The model outputs a structured token for each category (e.g., <policy_violation> or <policy_not_violation> ).

Important: This model outputs special tokens (e.g. <hate_and_abuse_not_violation> ). Do not use pipeline("text-generation") since it strips them by default. Always decode with skip_special_tokens=False to preserve the outputs.

Model Details

GA Guard Thinking features:

  • Type: Causal Language Model
  • Training: Full finetune
  • Number of Parameters: 4.0B
  • Number of Non-Embedding Parameters: 3.6B
  • Number of Layers: 36
  • Number of Attention Heads (GQA): 32 for Q and 8 for KV
  • Context Length: 262,144 tokens
Inference Examples
Transformers Library
# Load model directly
from transformers import AutoTokenizer, AutoModelForCausalLM

tokenizer = AutoTokenizer.from_pretrained("GeneralAnalysis/GA_Guard_Thinking")
model = AutoModelForCausalLM.from_pretrained("GeneralAnalysis/GA_Guard_Thinking")

messages = [
    {"role": "user", "content": "Who are you?"},
]

# The chat template automatically adds the guardrail system prompt and prefixes user messages with "text:".
inputs = tokenizer.apply_chat_template(
    messages,
    add_generation_prompt=True,
    tokenize=True,
    return_dict=True,
    return_tensors="pt",
).to(model.device)

outputs = model.generate(**inputs, max_new_tokens=40)

# Decode only the newly generated tokens
print(tokenizer.decode(outputs[0][inputs["input_ids"].shape[-1]:]))

# Sample output:
# <hate_and_abuse_not_violation><illicit_activities_not_violation>...
Benchmarks

We evaluated GA Guards on public moderation suites (OpenAI Moderation, WildGuard Benchmark, and HarmBench), our adversarial GA Jailbreak Bench, and the new GA Long-Context Bench. Across all three, our models consistently outperform major cloud guardrails and even surpass GPT-5 (when prompted to act as a guardrail).

GA Guard Family


Public Benchmarks

On public moderation suites, Guard Thinking reports 0.906 F1, Guard 0.899, and Lite 0.875 — all higher than GPT-5 (0.864) and GPT-5-mini (0.852), with cloud guardrails in the 0.62–0.74 range.

Guard OpenAI Moderation (Acc/F1/FPR) WildGuard (Acc/F1/FPR) HarmBench Behaviors (Acc/F1/FPR) Avg Time (s)
GA Guard 0.916 / 0.873 / 0.111 0.856 / 0.844 / 0.172 0.963 / 0.981 / N/A 0.029
GA Guard Thinking 0.917 / 0.876 / 0.112 0.862 / 0.858 / 0.134 0.967 / 0.983 / N/A 0.650
GA Guard Lite 0.896 / 0.844 / 0.109 0.835 / 0.819 / 0.176 0.929 / 0.963 / N/A 0.016
AWS Bedrock Guardrail 0.818 / 0.754 / 0.216 0.642 / 0.649 / 0.449 0.662 / 0.797 / N/A 0.375
Azure AI Content Safety 0.879 / 0.807 / 0.091 0.667 / 0.463 / 0.071 0.438 / 0.609 / N/A 0.389
Vertex AI Model Armor 0.779 / 0.690 / 0.225 0.711 / 0.590 / 0.105 0.896 / 0.945 / N/A 0.873
GPT 5 0.838 / 0.775 / 0.188 0.849 / 0.830 / 0.145 0.975 / 0.987 / N/A 11.275
GPT 5-mini 0.794 / 0.731 / 0.255 0.855 / 0.839 / 0.151 0.975 / 0.987 / N/A 5.604
Llama Guard 4 12B 0.826 / 0.737 / 0.156 0.799 / 0.734 / 0.071 0.925 / 0.961 / N/A 0.459
Llama Prompt Guard 2 86M 0.686 / 0.015 / 0.009 0.617 / 0.412 / 0.143 0.200 / 0.333 / N/A 0.114
Nvidia Llama 3.1 Nemoguard 8B 0.852 / 0.793 / 0.174 0.849 / 0.818 / 0.096 0.875 / 0.875 / N/A 0.358
VirtueGuard Text Lite 0.507 / 0.548 / 0.699 0.656 / 0.682 / 0.491 0.875 / 0.933 / N/A 0.651
Lakera Guard 0.752 / 0.697 / 0.323 0.630 / 0.662 / 0.527 0.946 / 0.972 / N/A 0.377
Protect AI (prompt-injection-v2) 0.670 / 0.014 / 0.032 0.559 / 0.382 / 0.248 N/A 0.115
GA Long-Context Bench

On GA Long-Context Bench (up to 256k tokens), GA Guard Thinking scores 0.893 F1, GA Guard 0.891, and Lite 0.885. Cloud baselines collapse: Vertex 0.560, AWS misclassifies nearly all inputs with a 1.0 false-positive rate, and Azure records just 0.046 F1.

Guard Accuracy F1 Score FPR F1 Hate & Abuse F1 Illicit Activities F1 Misinformation F1 PII & IP F1 Prompt Security F1 Sexual Content F1 Violence & Self-Harm
GA Guard 0.887 0.891 0.147 0.983 0.972 0.966 0.976 0.875 0.966 0.988
GA Guard Thinking 0.889 0.893 0.151 0.967 0.951 0.940 0.961 0.828 0.920 0.962
GA Guard Lite 0.881 0.885 0.148 0.979 0.969 0.972 0.976 0.846 0.973 0.985
AWS Bedrock Guardrail 0.532 0.695 1.000 0.149 0.211 0.131 0.367 0.175 0.092 0.157
Azure AI Content Safety 0.480 0.046 0.001 0.028 0.041 0.016 0.073 0.049 0.000 0.081
Vertex AI Model Armor 0.635 0.560 0.138 0.187 0.312 0.109 0.473 0.194 0.085 0.241
GPT 5 0.764 0.799 0.372 0.219 0.297 0.189 0.404 0.243 0.137 0.229
GPT 5-mini 0.697 0.772 0.607 0.184 0.253 0.157 0.412 0.215 0.112 0.190
Llama Guard 4 12B 0.569 0.602 0.516 0.164 0.228 0.132 0.334 0.188 0.097 0.195
Llama Prompt Guard 2 86M 0.505 0.314 0.162 N/A N/A N/A N/A 0.093 N/A N/A
Nvidia Llama 3.1 Nemoguard 8B 0.601 0.360 0.021 0.243 0.288 0.097 0.192 0.116 0.305 0.321
VirtueGuard Text Lite 0.490 0.147 0.047 0.082 0.203 0.118 0.069 0.074 0.058 0.132
Lakera Guard 0.520 0.684 0.999 0.151 0.200 0.132 0.361 0.160 0.093 0.159
Protect AI (prompt-injection-v2) 0.496 0.102 0.001 N/A N/A N/A N/A 0.032 N/A N/A
GA Jailbreak Bench

On GA Jailbreak Bench, which measures resilience against adversarial attacks, Guard Thinking achieves 0.933 F1, Guard 0.930, and Lite 0.898. GPT-5 reaches 0.893, while cloud guardrails fall significantly lower.

Guard Accuracy F1 Score FPR F1 Hate & Abuse F1 Illicit Activities F1 Misinf. F1 PII & IP F1 Prompt Security F1 Sexual Content F1 Violence & Self-Harm
GA Guard 0.931 0.930 0.038 0.946 0.939 0.886 0.967 0.880 0.954 0.928
GA Guard Thinking 0.939 0.933 0.029 0.965 0.925 0.894 0.962 0.885 0.942 0.946
GA Guard Lite 0.902 0.898 0.065 0.908 0.900 0.856 0.936 0.850 0.934 0.904
AWS Bedrock Guardrail 0.606 0.607 0.396 0.741 0.456 0.535 0.576 0.649 0.721 0.518
Azure AI Content Safety 0.542 0.193 0.026 0.236 0.093 0.155 0.068 0.416 0.186 0.130
Vertex AI Model Armor 0.550 0.190 0.008 0.077 0.190 0.582 0.076 0.000 0.000 0.241
GPT 5 0.900 0.893 0.035 0.928 0.942 0.856 0.799 0.819 0.953 0.939
GPT 5-mini 0.891 0.883 0.050 0.917 0.942 0.845 0.850 0.822 0.882 0.924
Llama Guard 4 12B 0.822 0.796 0.053 0.768 0.774 0.587 0.809 0.833 0.927 0.827
Llama Prompt Guard 2 86M 0.490 0.196 0.069 N/A N/A N/A N/A 0.196 N/A N/A
Nvidia Llama 3.1 Nemoguard 8B 0.668 0.529 0.038 0.637 0.555 0.513 0.524 0.049 0.679 0.575
VirtueGuard Text Lite 0.513 0.664 0.933 0.659 0.689 0.657 0.646 0.659 0.675 0.662
Lakera Guard 0.525 0.648 0.825 0.678 0.645 0.709 0.643 0.631 0.663 0.548
Protect AI (prompt-injection-v2) 0.528 0.475 0.198 N/A N/A N/A N/A 0.475 N/A N/A
Citation [optional]
@misc{generalanalysis2025gaguardcore,
      title        = {GA Guard Thinking}, 
      author       = {Rez Havaei and Rex Liu and General Analysis},
      year         = {2025},
      archivePrefix={arXiv},
      primaryClass={cs.CL},
      howpublished = {\url{https://huggingface.co/GeneralAnalysis/GA_Guard_Thinking}},
      note         = {Open-weight moderation model for seven safety categories},
}

Runs of GeneralAnalysis GA_Guard_Thinking on huggingface.co

2
Total runs
0
24-hour runs
0
3-day runs
-1
7-day runs
1
30-day runs

More Information About GA_Guard_Thinking huggingface.co Model

More GA_Guard_Thinking license Visit here:

https://choosealicense.com/licenses/cc-by-nc-4.0

GA_Guard_Thinking huggingface.co

GA_Guard_Thinking huggingface.co is an AI model on huggingface.co that provides GA_Guard_Thinking's model effect (), which can be used instantly with this GeneralAnalysis GA_Guard_Thinking model. huggingface.co supports a free trial of the GA_Guard_Thinking model, and also provides paid use of the GA_Guard_Thinking. Support call GA_Guard_Thinking model through api, including Node.js, Python, http.

GeneralAnalysis GA_Guard_Thinking online free

GA_Guard_Thinking huggingface.co is an online trial and call api platform, which integrates GA_Guard_Thinking's modeling effects, including api services, and provides a free online trial of GA_Guard_Thinking, you can try GA_Guard_Thinking online for free by clicking the link below.

GeneralAnalysis GA_Guard_Thinking online free url in huggingface.co:

https://huggingface.co/GeneralAnalysis/GA_Guard_Thinking

GA_Guard_Thinking install

GA_Guard_Thinking is an open source model from GitHub that offers a free installation service, and any user can find GA_Guard_Thinking on GitHub to install. At the same time, huggingface.co provides the effect of GA_Guard_Thinking install, users can directly use GA_Guard_Thinking installed effect in huggingface.co for debugging and trial. It also supports api for free installation.

GA_Guard_Thinking install url in huggingface.co:

https://huggingface.co/GeneralAnalysis/GA_Guard_Thinking

Url of GA_Guard_Thinking

Provider of GA_Guard_Thinking huggingface.co

GeneralAnalysis
ORGANIZATIONS

Other API from GeneralAnalysis