minishlab / potion-code-16M

huggingface.co
Total runs: 10
24-hour runs: 0
7-day runs: 0
30-day runs: 0
Model's Last Updated: April 26 2026

Introduction of potion-code-16M

Model Details of potion-code-16M

potion-code-16M Model Card

Overview

potion-code-16M is a fast static code embedding model optimized for code retrieval tasks. It is distilled from nomic-ai/CodeRankEmbed and trained on the CornStack code corpus using Tokenlearn and contrastive fine-tuning.

It uses static embeddings, allowing text and code embeddings to be computed orders of magnitude faster than transformer-based models on both GPU and CPU.

Installation
pip install model2vec
Usage
from model2vec import StaticModel

model = StaticModel.from_pretrained("minishlab/potion-code-16M")

# Embed natural language queries
query_embeddings = model.encode(["How to read a file in Python?"])

# Embed code documents
code_embeddings = model.encode(["def read_file(path):\n    with open(path) as f:\n        return f.read()"])
How it works

potion-code-16M is created using the following pipeline:

  1. Vocabulary mining : code-specific tokens are mined from CornStack and added to the base CodeRankEmbed tokenizer (42k extra tokens → ~62.5k total)
  2. Distillation : the extended vocabulary is distilled from CodeRankEmbed using Model2Vec (256-dimensional embeddings, PCA whitening)
  3. Tokenlearn : the distilled model is fine-tuned on 240k (query, document) pairs from CornStack using cosine similarity loss
  4. Contrastive fine-tuning : the model is further fine-tuned using MultipleNegativesRankingLoss on 120k CornStack query-document pairs
  5. Post-SIF re-regularization : token weights are re-regularized using SIF weighting after each training stage
Results

Results on the CoIR benchmark (NDCG@10, mteb>=2.10 ):

Model Params AVG AppsRetrieval COIRCodeSearchNet CodeFeedbackMT CodeFeedbackST CodeSearchNetCC CodeTransContest CodeTransDL CosQA StackOverflow Text2SQL
CodeRankEmbed 137M 59.14 23.46 94.70 42.61 78.11 76.39 66.43 34.84 35.92 80.53 58.37
potion-code-16M + Hybrid 16M 40.41 5.23 34.03 51.23 64.26 33.22 52.67 31.14 21.63 69.65 41.03
BM25 39.11 4.76 32.45 59.69 67.85 33.00 47.29 32.97 15.53 69.54 28.07
potion-code-16M 16M 37.05 3.97 42.99 36.26 50.27 43.40 39.76 31.72 21.37 57.47 43.34
potion-retrieval-32M 32M 32.10 4.22 31.80 36.71 45.11 38.64 29.97 32.62 8.70 56.26 36.93
potion-base-32M 32M 31.42 3.37 29.58 34.77 42.69 37.88 28.51 30.55 14.61 53.36 38.88

CoIR covers a broad range of code retrieval scenarios. For the use case of finding code given a natural language query, CosQA and CodeFeedback (ST/MT) are the most relevant tasks. Others are less so: COIRCodeSearchNetRetrieval retrieves text given a code query (the reverse direction), and the CodeTransOcean tasks target cross-language code translation. The hybrid row combines dense retrieval with BM25 using min-max score normalization and equal weighting (alpha=0.5).

Model Details
Property Value
Parameters ~16M
Embedding dimensions 256
Vocabulary size ~62,500
Teacher model nomic-ai/CodeRankEmbed
Training corpus CornStack (6 languages: Python, Java, JavaScript, Go, PHP, Ruby)
Max sequence length 1,000,000 tokens (static, no limit in practice)
Additional Resources
Reproducibility

The full training pipeline (distill → tokenlearn → contrastive) is in train.py . It requires minishlab/tokenlearn-cornstack-docs-coderankembed and minishlab/tokenlearn-cornstack-queries-coderankembed (20k samples per language used).

pip install model2vec tokenlearn sentence-transformers datasets skeletoken einops
python train.py
Citation
@software{minishlab2024model2vec,
  author       = {Stephan Tulkens and {van Dongen}, Thomas},
  title        = {Model2Vec: Fast State-of-the-Art Static Embeddings},
  year         = {2024},
  publisher    = {Zenodo},
  doi          = {10.5281/zenodo.17270888},
  url          = {https://github.com/MinishLab/model2vec},
  license      = {MIT}
}

Runs of minishlab potion-code-16M on huggingface.co

10
Total runs
0
24-hour runs
0
3-day runs
0
7-day runs
0
30-day runs

More Information About potion-code-16M huggingface.co Model

More potion-code-16M license Visit here:

https://choosealicense.com/licenses/mit

potion-code-16M huggingface.co

potion-code-16M huggingface.co is an AI model on huggingface.co that provides potion-code-16M's model effect (), which can be used instantly with this minishlab potion-code-16M model. huggingface.co supports a free trial of the potion-code-16M model, and also provides paid use of the potion-code-16M. Support call potion-code-16M model through api, including Node.js, Python, http.

potion-code-16M huggingface.co Url

https://huggingface.co/minishlab/potion-code-16M

minishlab potion-code-16M online free

potion-code-16M huggingface.co is an online trial and call api platform, which integrates potion-code-16M's modeling effects, including api services, and provides a free online trial of potion-code-16M, you can try potion-code-16M online for free by clicking the link below.

minishlab potion-code-16M online free url in huggingface.co:

https://huggingface.co/minishlab/potion-code-16M

potion-code-16M install

potion-code-16M is an open source model from GitHub that offers a free installation service, and any user can find potion-code-16M on GitHub to install. At the same time, huggingface.co provides the effect of potion-code-16M install, users can directly use potion-code-16M installed effect in huggingface.co for debugging and trial. It also supports api for free installation.

potion-code-16M install url in huggingface.co:

https://huggingface.co/minishlab/potion-code-16M

Url of potion-code-16M

potion-code-16M huggingface.co Url

Provider of potion-code-16M huggingface.co

minishlab
ORGANIZATIONS

Other API from minishlab