potion-code-16M
is a fast static code embedding model optimized for code retrieval tasks. It is distilled from
nomic-ai/CodeRankEmbed
and trained on the
CornStack
code corpus using
Tokenlearn
and contrastive fine-tuning.
It uses static embeddings, allowing text and code embeddings to be computed orders of magnitude faster than transformer-based models on both GPU and CPU.
Installation
pip install model2vec
Usage
from model2vec import StaticModel
model = StaticModel.from_pretrained("minishlab/potion-code-16M")
# Embed natural language queries
query_embeddings = model.encode(["How to read a file in Python?"])
# Embed code documents
code_embeddings = model.encode(["def read_file(path):\n with open(path) as f:\n return f.read()"])
How it works
potion-code-16M is created using the following pipeline:
Vocabulary mining
: code-specific tokens are mined from CornStack and added to the base CodeRankEmbed tokenizer (42k extra tokens → ~62.5k total)
Distillation
: the extended vocabulary is distilled from CodeRankEmbed using Model2Vec (256-dimensional embeddings, PCA whitening)
Tokenlearn
: the distilled model is fine-tuned on 240k (query, document) pairs from CornStack using cosine similarity loss
Contrastive fine-tuning
: the model is further fine-tuned using MultipleNegativesRankingLoss on 120k CornStack query-document pairs
Post-SIF re-regularization
: token weights are re-regularized using SIF weighting after each training stage
CoIR covers a broad range of code retrieval scenarios. For the use case of finding code given a natural language query,
CosQA
and
CodeFeedback (ST/MT)
are the most relevant tasks. Others are less so:
COIRCodeSearchNetRetrieval
retrieves text given a code query (the reverse direction), and the
CodeTransOcean
tasks target cross-language code translation. The hybrid row combines dense retrieval with BM25 using min-max score normalization and equal weighting (alpha=0.5).
The full training pipeline (distill → tokenlearn → contrastive) is in
train.py
. It requires
minishlab/tokenlearn-cornstack-docs-coderankembed
and
minishlab/tokenlearn-cornstack-queries-coderankembed
(20k samples per language used).
potion-code-16M huggingface.co is an AI model on huggingface.co that provides potion-code-16M's model effect (), which can be used instantly with this minishlab potion-code-16M model. huggingface.co supports a free trial of the potion-code-16M model, and also provides paid use of the potion-code-16M. Support call potion-code-16M model through api, including Node.js, Python, http.
potion-code-16M huggingface.co is an online trial and call api platform, which integrates potion-code-16M's modeling effects, including api services, and provides a free online trial of potion-code-16M, you can try potion-code-16M online for free by clicking the link below.
minishlab potion-code-16M online free url in huggingface.co:
potion-code-16M is an open source model from GitHub that offers a free installation service, and any user can find potion-code-16M on GitHub to install. At the same time, huggingface.co provides the effect of potion-code-16M install, users can directly use potion-code-16M installed effect in huggingface.co for debugging and trial. It also supports api for free installation.