🔥 [2025/03/12] Our latest Code Embedding Model
OASIS-code-1.5B
is now released.
🔥 [2025/03/12] Our preprint is now available at
OASIS-arxiv
.
Model Details
Model Name
: OASIS (Order-Augmented Strategy for Improved Code Search)
Introduction
OASIS is a state-of-the-art code embedding model developed by Kwaipilot. This model incorporates unique, proprietary methods including
repository-level program analysis
, the
OASIS-instruct data synthesis
algorithm, and a
specialized fusion loss function
, setting new benchmarks in code search efficiency and accuracy.
Intended Use
This model is ideal for developers and researchers engaged in enhancing
code retrieval systems
. OASIS excels in scenarios requiring semantic understanding and retrieval of code snippets within varied programming contexts.
Training and Performance
OASIS was trained on a synthetic dataset created through repository-level analysis, ensuring broad understanding across different coding styles and languages. It has demonstrated state-of-the-art performance on latest code search benchmarks.
Avoid using torch=2.5.0 when loading the model with torch_dtype=torch.bfloat16. For optimal performance and stability, please use PyTorch version 2.4.1 or earlier, or upgrade to 2.5.1 or later.
import torch
import torch.nn.functional as F
from torch import Tensor
from transformers import AutoModel, AutoTokenizer
deflast_token_pool(last_hidden_states: Tensor, attention_mask: Tensor) -> Tensor:
left_padding = (attention_mask[:, -1].sum() == attention_mask.shape[0])
if left_padding:
return last_hidden_states[:, -1]
else:
sequence_lengths = attention_mask.sum(dim=1) - 1
batch_size = last_hidden_states.shape[0]
return last_hidden_states[torch.arange(batch_size, device=last_hidden_states.device), sequence_lengths]
# Add query promptdefget_query_prompt(query: str):
query_description = 'Given a code search query, retrieve relevant code snippet that answer the query'
prompt = f'Instruct: {query_description}\nQuery: {query}'return prompt
query = "How to do quicksort in python?"
code1 = """def bubble_sort(arr): n = len(arr) for i in range(n): swapped = False for j in range(1, n - i): if arr[j - 1] > arr[j]: arr[j - 1], arr[j] = arr[j], arr[j - 1] swapped = True if not swapped: break return arr"""
code2 = """def quick_sort(arr): if len(arr) <= 1: return arr else: pivot = arr[0] less = [x for x in arr[1:] if x <= pivot] greater = [x for x in arr[1:] if x > pivot] return quick_sort(less) + [pivot] + quick_sort(greater)"""
model = AutoModel.from_pretrained("Kwaipilot/OASIS-code-1.5B", output_hidden_states=True)
tokenizer = AutoTokenizer.from_pretrained("Kwaipilot/OASIS-code-1.5B")
# Tokenize and inference
inputs = tokenizer([get_query_prompt(query), code1, code2], max_length=1024, padding=True, truncation=True, return_tensors='pt')
outputs = model(**inputs)
# Last token pooling
embeddings = last_token_pool(outputs.hidden_states[-1], inputs['attention_mask'])
print(embeddings.shape)
# torch.Size([3, 1536])
embeddings = F.normalize(embeddings, dim=1, p=2)
similarity = embeddings @ embeddings.T
print(similarity[0, 1:])
# tensor([0.6895, 0.8240])
Sentence Transformers
First install the Sentence Transformers library:
pip install -U sentence-transformers
Then you can load this model and run inference.
from sentence_transformers import SentenceTransformer
# Download from the 🤗 Hub
model = SentenceTransformer("Kwaipilot/OASIS-code-1.5B")#, model_kwargs={"torch_dtype": torch.bfloat16})
query = "How to do quicksort in python?"
code1 = """def bubble_sort(arr): n = len(arr) for i in range(n): swapped = False for j in range(1, n - i): if arr[j - 1] > arr[j]: arr[j - 1], arr[j] = arr[j], arr[j - 1] swapped = True if not swapped: break return arr"""
code2 = """def quick_sort(arr): if len(arr) <= 1: return arr else: pivot = arr[0] less = [x for x in arr[1:] if x <= pivot] greater = [x for x in arr[1:] if x > pivot] return quick_sort(less) + [pivot] + quick_sort(greater)"""# Run inference
query_embedding = model.encode([query], prompt_name="query")
code_embeddings = model.encode([code1, code2])
print(code_embeddings.shape)
# (2, 1536)# Get the similarity scores for the embeddingsprint(model.similarity(query_embedding[0], code_embeddings[0]))
print(model.similarity(query_embedding[0], code_embeddings[1]))
# tensor([[0.6895]])# tensor([[0.8240]])
BibTeX
@misc{kwaipilotoasis,
title = {Optimized Augmentation Strategy for Improved code Search},
author = {Kwaipilot team},
year = {2024},
}
Runs of Kwaipilot OASIS-code-1.5B on huggingface.co
18
Total runs
0
24-hour runs
0
3-day runs
0
7-day runs
0
30-day runs
More Information About OASIS-code-1.5B huggingface.co Model
OASIS-code-1.5B huggingface.co
OASIS-code-1.5B huggingface.co is an AI model on huggingface.co that provides OASIS-code-1.5B's model effect (), which can be used instantly with this Kwaipilot OASIS-code-1.5B model. huggingface.co supports a free trial of the OASIS-code-1.5B model, and also provides paid use of the OASIS-code-1.5B. Support call OASIS-code-1.5B model through api, including Node.js, Python, http.
OASIS-code-1.5B huggingface.co is an online trial and call api platform, which integrates OASIS-code-1.5B's modeling effects, including api services, and provides a free online trial of OASIS-code-1.5B, you can try OASIS-code-1.5B online for free by clicking the link below.
Kwaipilot OASIS-code-1.5B online free url in huggingface.co:
OASIS-code-1.5B is an open source model from GitHub that offers a free installation service, and any user can find OASIS-code-1.5B on GitHub to install. At the same time, huggingface.co provides the effect of OASIS-code-1.5B install, users can directly use OASIS-code-1.5B installed effect in huggingface.co for debugging and trial. It also supports api for free installation.