Problem: Effective code retrieval is essential for improving code generation, bug fixing, and software maintenance, especially as software complexity grows. Although current code embedding models work well for smaller tasks, they often struggle with real-world challenges like finding bugs in GitHub repositories. This may be due to the noisy, inconsistent data used in their training, which limits their ability to generalize.
Contribution: To tackle this, we introduce CoRNStack, a large-scale, high-quality training dataset designed specifically for code retrieval across multiple programming languages. CoRNStack is curated to remove noisy data and includes challenging examples that improve learning. The dataset, which comprises instances of the form of <query, positive, negatives>, supports training code retrieval and reranking models.
Results: With contrastive training on CoRNStack, our code retriever model (CodeRankEmbed) achieves state-of-the-art results across diverse code retrieval tasks. Our fine-tuned reranking model (CodeRankLLM) further enhances the quality of retrieved results, and when combined with our code retriever, it significantly improves the accuracy of finding relevant functions in GitHub issues—a key need in real-world software development.
The effectiveness of code embedding models depends heavily on the quality of their training data, which comes in the form of triples: a query, a relevant (positive) example, and unrelated (negative) examples. Training with high quality positives with hard negatives, examples that are similar to the positives but don't answer the query correctly, results in high performing code embedding models. Directly using open-source code data like The Stack v2 for this purpose can introduce mismatched or mislabeled pairs, which weakens model performance. To address this, we propose a two-step filtering method that selects the most relevant positives based on similarity scores and adds a diverse range of hard negatives. We call this curated dataset as CoRNStack, short for Consistency filtering and Robust Negatives for enriching The Stackv2.
Data Selection: We built our dataset from the de-duplicated Stackv2, a rich collection of source code in over 600 programming and markup languages. To create text-code pairs, we took function docstrings as text and paired them with their respective functions as code. We applied filters to exclude pairs where the text was non-English, too short, or contained URLs, HTML tags, or invalid characters. Unlike past approaches, we kept pairs with text lengths of 256 tokens or more to help the model handle long query sequences often seen in detailed code retrieval tasks, like those found in GitHub issues.
Dual Consistency Filtering: To create a high-quality dataset of (text, code) pairs, we use an embedding model (Jina-Code-v2) to get text and code embeddings, then calculate similarity scores between all pairs. We keep a pair if it ranks in the top-two most similar matches and surpasses a set similarity threshold. To evaluate this filtered dataset, we ran automated comparisons with other code datasets like CosQA and CodeSearchNet. Using the Qwen2.5-Coder model, we checked if each code snippet fully addresses its corresponding text query across thousands of samples. Our results show CoRNStack has considerably higher <query, positive> correctness than the other datasets.
Curriculum-Based Hard Negative Mining: We improve model training by carefully selecting challenging negatives to learn from. For each (text, code) pair, we start by filtering out false negatives based on a similarity score threshold to ensure only truly "negative" examples remain. From these, we sample a set of negatives using a probability method that emphasizes more challenging cases, with a temperature parameter that adjusts over time to gradually sharpen the selection. This setup, akin to a curriculum, helps the model learn from progressively harder examples, which enhances diversity and prevents overfitting. Importantly, this strategy is efficient, as it relies on a precomputed similarity matrix, making it both scalable and practical.
Model: We use a bi-encoder architecture for our retriever, with weights shared between the text and code encoder. The retriever is trained using a contrastive learning objective based on the InfoNCE loss. The encoder is initialized with Arctic-Embed-M-Long, a text encoder supporting an extended context length of 8,192 tokens and pretrained on large-scale web query-document pairs, along with public text retrieval datasets. We release our trained code retriever as CodeRankEmbed.
Evaluation Datasets: We evaluate CodeRankEmbed on a variety of code retrieval tasks under zero-shot settings. We use CodeSearchNet as the benchmark for function-level text-to-code retrieval, a semantic search task where natural language queries are used to retrieve relevant code snippets. Additionally, to evaluate performance across diverse code retrieval tasks, we use the CoIR benchmark, which includes text-to-code, code-to-text, code-to-code, and hybrid code retrieval tasks (retrieving a hybrid of code and textual documents given a hybrid query).
Baselines: We compare our finetuned code retriever against state-of-the-art code embedding models of various sizes, both open-source and proprietary. The open-source code embedding models include CodeSage, CodeT5+ and Jina-Code-v2, which are currently leading text-to-code retrieval benchmarks. We also compare with the proprietary Voyage-Code-002.
Results: Our code retriever, despite being smaller than the majority of the baselines, significantly outperforms all open-source and proprietary code embedding models, establishing a new state-of-the-art for code embedding tasks. This demonstrates the robustness of our contrastive training data, with the trained model exhibiting superior cross-task generalization on COIR despite being trained exclusively for only text-to-code retrieval.
Model: Our code reranker is based on LLM-based listwise reranking, which has gained prominence for the ability to score multiple passages simultaneously. Training data for listwise reranking was generated by selecting 50,000 <query, positive, negatives> tuples from CoRNStack, filtered to ensure higher similarity scores and better ranks for the positives. Since CoRNStack doesn't contain the ranked ordering data required for training listwise rerankers, we leverage Qwen-2.5-32B-Instruct LLM provided ranked orderings for each example to serve as ranking supervision. We train the reranker using a language modeling objective that minimizes the prediction error of the next token in the sequence. We release our trained code reranker as CodeRankLLM.
Baselines and Evaluation: We compare reranking performance with that of the zero-shot Qwen-2.5-Coder-7B-Instruct model, our base model for our finetuning. Since text-based LLMs are typically trained on both text and code data, we include a listwise text reranker as a baseline. Specifically, we fine-tune the Qwen-2.5-7B-Instruct LLM on 40k GPT-4-labeled listwise reranking instances derived from MS MARCO. We evaluate our models using the CodeSearchNet and AdvTest text-to-code retrieval benchmarks. During inference, we rerank the top 100 results from our code retriever, employing a window size of 10 and a step size of 5 for the listwise LLM rerankers.
Results: The text reranker Qwen-2.5-Text, although finetuned with listwise text data, performs strongly across programming languages, likely due to code examples in its pretraining data enhancing code comprehension. In contrast, the code model Qwen-2.5-Code underperforms in zero-shot listwise reranking but improves markedly after finetuning with code-specific listwise data created using CoRNStack.
Having previously evaluated our CodeRankEmbed and CodeRankLLM models on academic benchmarks, we now demonstrate their utility in assisting software development in real-world settings. Specifically, we focus on the task of function localization, which involves accurately identifying the specific functions that need to be modified in response to a GitHub issue.
Datasets: We evaluate our code retriever+reranker framework based on SWE-Bench, a widely used repository-level benchmark that focuses on resolving real-world GitHub issues with code patches passing associated test cases. Specifically, employ SWE-Bench-Lite, a subset of SWE-Bench, which we reformulated for function localization, where the patched functions are treated as the localization targets. We retained 274 of 300 examples where patches modify existing functions or classes, with the excluded examples introducing code corresponding to new functions or import statements. The GitHub issue serves as the text query, and all functions in the repository are candidates for retrieval.
Baselines and Metrics: Our main baseline, Agentless, is an automated tool for tackling software development issues and is a top open-source performer on SWE-Bench-Lite. It operates in two phases: localization and repair. In localization, Agentless first identifies relevant files, then narrows down to specific classes, functions, and edit locations. Given the size of codebases, it uses file location information and GitHub issues to rank files that may need updates, then pinpoints functions needing changes within these files. Since Agentless selects up to three files for edits and localizes functions within them, we evaluate file localization at top 1–3 and function localization at top 5-10. We also compare against code retrieval baselines, excluding proprietary ones due to API costs.
Results: Our code retriever significantly outperforms Agentless and other retrieval baselines in function localization accuracy. Applying our code reranker over the retriever results yields consistent improvements in both file and function localization. While SWE-Bench-Lite is constructed from popular open-source Python repositories, we hypothesize that our retrieval-based approach could achieve greater improvements on private repositories, which are typically not included in LLM pretraining data, and we leave this investigation for future work.