Dataset
This is part of my
fasttext classifier collection
for curating pretraining dataset.
This classifier classifies a text into Maths or Others.
The model is trained over 1.6M records, which is a 50:50 mix of maths and non maths in website and achieved a test F1 score of 0.97. It is an intended upsampling of maths data.
The classifier can be used for LLM pretraining data curation, to enhance capability in mathematics.
It is ultra fast ⚡ with a throughtput of ~2000 doc/s with CPU.
Don't underestimate the "old" fasttext classiifer! It is indeed a good and scalable practice.
For example,
QWEN2.5-MATH
leverages fasttext to curate pretraining data, althought its classifier is not open sourced.
🛠️Usage
from typing importListimport re
from huggingface_hub import hf_hub_download
import fasttext
model_hf = fasttext.load_model(hf_hub_download("kenhktsui/maths-fasttext-classifier", "model.bin"))
defreplace_newlines(text: str) -> str:
return re.sub("\n+", " ", text)
defpredict(text_list: List[str]) -> List[dict]:
text_list = [replace_newlines(text) for text in text_list]
pred = model.predict(text_list)
return [{"label": l[0].lstrip("__label__"), "score": s[0]}
for l, s inzip(*pred)]
predict([
"""This is a lightning fast model, which can classify at throughtput of 2000 doc/s with CPU""",
"""Differential geometry is a mathematical discipline that studies the geometry of smooth shapes and smooth spaces, otherwise known as smooth manifolds. It uses the techniques of single variable calculus, vector calculus, linear algebra and multilinear algebra.""",
])
# [{'label': 'Others', 'score': 0.99998367},# {'label': 'Maths', 'score': 0.99995637},
maths-fasttext-classifier huggingface.co is an AI model on huggingface.co that provides maths-fasttext-classifier's model effect (), which can be used instantly with this kenhktsui maths-fasttext-classifier model. huggingface.co supports a free trial of the maths-fasttext-classifier model, and also provides paid use of the maths-fasttext-classifier. Support call maths-fasttext-classifier model through api, including Node.js, Python, http.
maths-fasttext-classifier huggingface.co is an online trial and call api platform, which integrates maths-fasttext-classifier's modeling effects, including api services, and provides a free online trial of maths-fasttext-classifier, you can try maths-fasttext-classifier online for free by clicking the link below.
kenhktsui maths-fasttext-classifier online free url in huggingface.co:
maths-fasttext-classifier is an open source model from GitHub that offers a free installation service, and any user can find maths-fasttext-classifier on GitHub to install. At the same time, huggingface.co provides the effect of maths-fasttext-classifier install, users can directly use maths-fasttext-classifier installed effect in huggingface.co for debugging and trial. It also supports api for free installation.
maths-fasttext-classifier install url in huggingface.co: