CodeModernBERT-Ghost + GPT-2 for Code Documentation Generation
This is an
EncoderDecoderModel
fine-tuned for generating documentation (docstrings) from source code snippets across multiple programming languages (Go, Java, JavaScript, PHP, Python, Ruby).
This model takes a function or method's source code as input and outputs a potential docstring describing its purpose, parameters, and potentially return values. It leverages the code understanding capabilities of CodeModernBERT-Ghost and the text generation abilities of GPT-2, connected via cross-attention mechanisms.
The model was fine-tuned on a large dataset combining code-docstring pairs from six different programming languages, aiming for broad language coverage.
Architecture:
Encoder-Decoder
Encoder:
Shuu12121/CodeModernBERT-Ghost
(A BERT-style model pre-trained on code)
Decoder:
openai-community/gpt2
(Initialized with pre-trained weights, adapted for sequence-to-sequence task with cross-attention)
Intended Uses & Limitations
Intended Use:
Assisting developers in writing documentation for their code by providing automatically generated suggestions.
Summarizing code functionality for better understanding.
Educational purposes for learning about sequence-to-sequence models applied to code.
Limitations:
Accuracy:
The generated documentation may not always be perfectly accurate or complete. It might miss nuances, misunderstand complex logic, hallucinate information (generate incorrect details), or ignore certain parameters/return values.
Always review and edit the generated output.
Code Complexity:
Performance might degrade for very long or complex code snippets, especially those involving deep nesting, complex control flow, or obscure library calls.
Language Nuances:
While trained on multiple languages, the quality might vary slightly between languages depending on the representation in the training data. It may struggle with language-specific idioms or newer language features not present in the training data.
Formatting:
The model might not perfectly replicate specific docstring formats (e.g., Google Style, NumPy Style for Python, specific Javadoc tags) consistently. The examples show basic formatting attempts.
Bias:
The model inherits biases from both the underlying pre-trained models (CodeModernBERT-Ghost, GPT-2) and the training data (CodeXGlue). It might generate biased or non-inclusive language.
Limited Context:
The model currently processes code snippets in isolation. It doesn't consider the broader context of the surrounding file or project, which can be crucial for accurate documentation.
How to Use
You can use this model with the
transformers
library for inference.
import torch
from transformers import AutoTokenizer, EncoderDecoderModel
import os
# Specify the path where your fine-tuned model is saved# Or use the Hugging Face model ID if uploaded: "Shuu12121/CodeEncoderDecodeerModel-Ghost"
MODEL_NAME_OR_PATH = "Shuu12121/CodeEncoderDecodeerModel-Ghost"# Use GPU if available
device = torch.device("cuda"if torch.cuda.is_available() else"cpu")
print(f"Using device: {device}")
# Load tokenizers (assuming they are saved in subdirectories or hosted with the model)try:
# If loading from local path with subdirectories:# encoder_tokenizer_path = os.path.join(MODEL_NAME_OR_PATH, "encoder_tokenizer")# decoder_tokenizer_path = os.path.join(MODEL_NAME_OR_PATH, "decoder_tokenizer")# encoder_tokenizer = AutoTokenizer.from_pretrained(encoder_tokenizer_path)# decoder_tokenizer = AutoTokenizer.from_pretrained(decoder_tokenizer_path)# If loading from Hub (tokenizers should be saved alongside the model):
encoder_tokenizer = AutoTokenizer.from_pretrained(MODEL_NAME_OR_PATH, trust_remote_code=True) # Adjust trust_remote_code if needed
decoder_tokenizer = AutoTokenizer.from_pretrained(MODEL_NAME_OR_PATH, trust_remote_code=True) # Adjust trust_remote_code if needed# Ensure decoder tokenizer has a pad token (GPT-2 might not have one by default)if decoder_tokenizer.pad_token isNone:
decoder_tokenizer.pad_token = decoder_tokenizer.eos_token
print("Decoder pad_token set to eos_token.")
# Load the fine-tuned EncoderDecoderModelprint("Loading model...")
model = EncoderDecoderModel.from_pretrained(MODEL_NAME_OR_PATH).to(device)
model.eval() # Set to evaluation modeprint("Model loaded.")
# Get encoder max length from config (important for truncation)
max_input_length = model.config.encoder.max_position_embeddings # Or specify manually if knownprint(f"Using max_input_length (encoder): {max_input_length}")
except Exception as e:
print(f"Error loading model or tokenizers: {e}")
exit()
defgenerate_docstring(code_snippet: str, max_output_length: int = 256, num_beams: int = 5):
"""Generates a docstring for the given code snippet."""print("\n" + "="*15 + " Input Code " + "="*15)
print(code_snippet.strip())
# Tokenize the input code
inputs = encoder_tokenizer(
code_snippet,
return_tensors="pt",
padding=True, # Padding strategy can be adjusted
truncation=True,
max_length=max_input_length
).to(device)
print("\nGenerating docstring...")
try:
with torch.no_grad():
outputs = model.generate(
inputs.input_ids,
attention_mask=inputs.attention_mask,
max_length=max_output_length,
num_beams=num_beams,
early_stopping=True,
decoder_start_token_id=model.config.decoder_start_token_id,
eos_token_id=model.config.eos_token_id,
pad_token_id=model.config.pad_token_id, # Make sure pad_token_id is set
no_repeat_ngram_size=2# Optional: Helps reduce repetition
)
# Decode the generated token IDs
generated_text = decoder_tokenizer.decode(outputs[0], skip_special_tokens=True)
print("\n" + "="*15 + " Generated Docstring " + "="*15)
print(generated_text.strip())
return generated_text
except Exception as e:
print(f"\nError during generation: {e}")
returnNone# --- Example Usage ---if __name__ == "__main__":
python_code = """def calculate_area(length, width): \"\"\"Placeholder for docstring.\"\"\" if length < 0 or width < 0: raise ValueError("Length and width must be non-negative") return length * width"""
generate_docstring(python_code)
java_code = """ /** Placeholder */ public int findMax(int[] numbers) { if (numbers == null || numbers.length == 0) { throw new IllegalArgumentException("Input array cannot be null or empty."); } int max = numbers[0]; for (int i = 1; i < numbers.length; i++) { if (numbers[i] > max) { max = numbers[i]; } } return max; } """
generate_docstring(java_code)
Training Data
This model was fine-tuned on the google/code_x_glue_ct_code_to_text dataset, specifically using the following language subsets:
Go (go)
Java (java)
JavaScript (javascript)
PHP (php)
Python (python)
Ruby (ruby)
All available training, validation, and test splits for these languages were concatenated and used. (You might want to specify the exact number of samples if you have them recorded).
Total Training Samples: Approx. 908,224
Total Validation Samples: Approx. 44,689
Total Test Samples: Approx. 52,561 (Note: The test split was primarily used for final evaluation/examples, not for model selection during training)
Training Procedure
The model was trained using the transformers Seq2SeqTrainer.
Preprocessing:
Input code (code column) was tokenized using the Shuu12121/CodeModernBERT-Ghost tokenizer, padded/truncated to a maximum length of 2048 tokens.
Target docstrings (docstring column) were tokenized using the openai-community/gpt2 tokenizer, padded/truncated to a maximum length of 256 tokens. The input_ids from the target were used as labels.
CodeEncoderDecodeerModel-Ghost huggingface.co is an AI model on huggingface.co that provides CodeEncoderDecodeerModel-Ghost's model effect (), which can be used instantly with this Shuu12121 CodeEncoderDecodeerModel-Ghost model. huggingface.co supports a free trial of the CodeEncoderDecodeerModel-Ghost model, and also provides paid use of the CodeEncoderDecodeerModel-Ghost. Support call CodeEncoderDecodeerModel-Ghost model through api, including Node.js, Python, http.
CodeEncoderDecodeerModel-Ghost huggingface.co is an online trial and call api platform, which integrates CodeEncoderDecodeerModel-Ghost's modeling effects, including api services, and provides a free online trial of CodeEncoderDecodeerModel-Ghost, you can try CodeEncoderDecodeerModel-Ghost online for free by clicking the link below.
Shuu12121 CodeEncoderDecodeerModel-Ghost online free url in huggingface.co:
CodeEncoderDecodeerModel-Ghost is an open source model from GitHub that offers a free installation service, and any user can find CodeEncoderDecodeerModel-Ghost on GitHub to install. At the same time, huggingface.co provides the effect of CodeEncoderDecodeerModel-Ghost install, users can directly use CodeEncoderDecodeerModel-Ghost installed effect in huggingface.co for debugging and trial. It also supports api for free installation.
CodeEncoderDecodeerModel-Ghost install url in huggingface.co: