NuExtract-tiny-v1.5 is a fine-tuning of
Qwen/Qwen2.5-0.5B
, trained on a private high-quality dataset for structured information extraction. It supports long documents and several languages (English, French, Spanish, German, Portuguese, and Italian).
To use the model, provide an input text and a JSON template describing the information you need to extract.
Note: This model is trained to prioritize pure extraction, so in most cases all text generated by the model is present as is in the original text.
We also provide a 3.8B version which is based on Phi-3.5-mini-instruct:
NuExtract-v1.5
⚠️ We recommend using NuExtract with a temperature at or very close to 0. Some inference frameworks, such as Ollama, use a default of 0.7 which is not well suited to pure extraction tasks.
Benchmark
Zero-shot performance (English):
Few-shot fine-tuning:
Usage
To use the model:
import json
import torch
from transformers import AutoModelForCausalLM, AutoTokenizer
defpredict_NuExtract(model, tokenizer, texts, template, batch_size=1, max_length=10_000, max_new_tokens=4_000):
template = json.dumps(json.loads(template), indent=4)
prompts = [f"""<|input|>\n### Template:\n{template}\n### Text:\n{text}\n\n<|output|>"""for text in texts]
outputs = []
with torch.no_grad():
for i inrange(0, len(prompts), batch_size):
batch_prompts = prompts[i:i+batch_size]
batch_encodings = tokenizer(batch_prompts, return_tensors="pt", truncation=True, padding=True, max_length=max_length).to(model.device)
pred_ids = model.generate(**batch_encodings, max_new_tokens=max_new_tokens)
outputs += tokenizer.batch_decode(pred_ids, skip_special_tokens=True)
return [output.split("<|output|>")[1] for output in outputs]
model_name = "numind/NuExtract-tiny-v1.5"
device = "cuda"
model = AutoModelForCausalLM.from_pretrained(model_name, torch_dtype=torch.bfloat16, trust_remote_code=True).to(device).eval()
tokenizer = AutoTokenizer.from_pretrained(model_name, trust_remote_code=True)
text = """We introduce Mistral 7B, a 7–billion-parameter language model engineered forsuperior performance and efficiency. Mistral 7B outperforms the best open 13Bmodel (Llama 2) across all evaluated benchmarks, and the best released 34Bmodel (Llama 1) in reasoning, mathematics, and code generation. Our modelleverages grouped-query attention (GQA) for faster inference, coupled with slidingwindow attention (SWA) to effectively handle sequences of arbitrary length with areduced inference cost. We also provide a model fine-tuned to follow instructions,Mistral 7B – Instruct, that surpasses Llama 2 13B – chat model both on human andautomated benchmarks. Our models are released under the Apache 2.0 license.Code: <https://github.com/mistralai/mistral-src>Webpage: <https://mistral.ai/news/announcing-mistral-7b/>"""
template = """{ "Model": { "Name": "", "Number of parameters": "", "Number of max token": "", "Architecture": [] }, "Usage": { "Use case": [], "Licence": "" }}"""
prediction = predict_NuExtract(model, tokenizer, [text], template)[0]
print(prediction)
Sliding window prompting:
import json
MAX_INPUT_SIZE = 20_000
MAX_NEW_TOKENS = 6000defclean_json_text(text):
text = text.strip()
text = text.replace("\#", "#").replace("\&", "&")
return text
defpredict_chunk(text, template, current, model, tokenizer):
current = clean_json_text(current)
input_llm = f"<|input|>\n### Template:\n{template}\n### Current:\n{current}\n### Text:\n{text}\n\n<|output|>" + "{"
input_ids = tokenizer(input_llm, return_tensors="pt", truncation=True, max_length=MAX_INPUT_SIZE).to("cuda")
output = tokenizer.decode(model.generate(**input_ids, max_new_tokens=MAX_NEW_TOKENS)[0], skip_special_tokens=True)
return clean_json_text(output.split("<|output|>")[1])
defsplit_document(document, window_size, overlap):
tokens = tokenizer.tokenize(document)
print(f"\tLength of document: {len(tokens)} tokens")
chunks = []
iflen(tokens) > window_size:
for i inrange(0, len(tokens), window_size-overlap):
print(f"\t{i} to {i + len(tokens[i:i + window_size])}")
chunk = tokenizer.convert_tokens_to_string(tokens[i:i + window_size])
chunks.append(chunk)
if i + len(tokens[i:i + window_size]) >= len(tokens):
breakelse:
chunks.append(document)
print(f"\tSplit into {len(chunks)} chunks")
return chunks
defhandle_broken_output(pred, prev):
try:
ifall([(v in ["", []]) for v in json.loads(pred).values()]):
# if empty json, return previous
pred = prev
except:
# if broken json, return previous
pred = prev
return pred
defsliding_window_prediction(text, template, model, tokenizer, window_size=4000, overlap=128):
# split text into chunks of n tokens
tokens = tokenizer.tokenize(text)
chunks = split_document(text, window_size, overlap)
# iterate over text chunks
prev = template
for i, chunk inenumerate(chunks):
print(f"Processing chunk {i}...")
pred = predict_chunk(chunk, template, prev, model, tokenizer)
# handle broken output
pred = handle_broken_output(pred, prev)
# iterate
prev = pred
return pred
Runs of numind NuExtract-1.5-tiny on huggingface.co
2.2K
Total runs
0
24-hour runs
0
3-day runs
132
7-day runs
-938
30-day runs
More Information About NuExtract-1.5-tiny huggingface.co Model
NuExtract-1.5-tiny huggingface.co is an AI model on huggingface.co that provides NuExtract-1.5-tiny's model effect (), which can be used instantly with this numind NuExtract-1.5-tiny model. huggingface.co supports a free trial of the NuExtract-1.5-tiny model, and also provides paid use of the NuExtract-1.5-tiny. Support call NuExtract-1.5-tiny model through api, including Node.js, Python, http.
NuExtract-1.5-tiny huggingface.co is an online trial and call api platform, which integrates NuExtract-1.5-tiny's modeling effects, including api services, and provides a free online trial of NuExtract-1.5-tiny, you can try NuExtract-1.5-tiny online for free by clicking the link below.
numind NuExtract-1.5-tiny online free url in huggingface.co:
NuExtract-1.5-tiny is an open source model from GitHub that offers a free installation service, and any user can find NuExtract-1.5-tiny on GitHub to install. At the same time, huggingface.co provides the effect of NuExtract-1.5-tiny install, users can directly use NuExtract-1.5-tiny installed effect in huggingface.co for debugging and trial. It also supports api for free installation.