We introduce SWERank, a retrieve-and-rerank framework for software issue localization, which aims to identify the relevant code that needs to be modified to fix a software issue.

Abstract

Problem: Software issue localization—the task of identifying the exact code locations (files, classes, or functions) that correspond to a natural-language issue description (e.g., bug report or feature request)—is vital for software development. Recent agentic approaches leveraging LLMs have shown promise but incur high costs due to complex multi-step reasoning and reliance on closed-source models. Conversely, traditional code-ranking techniques, which were optimized for small-scale query-to-code or code-to-code retrieval, struggle with the verboseness of issue localization queries.

Contribution: We present SWERank, a retrieve-and-rerank framework for software issue localization combining an embedding-based retriever (SWERankEmbed) with an LLM-based reranker (SWERankLLM). To facilitate training, we introduce SWELoc, a large-scale contrastive dataset mined from public GitHub repositories featuring real-world issue descriptions paired with corresponding code modifications.

Results: On the SWE-Bench-Lite and LocBench benchmarks, SWERank achieves state-of-the-art performance, surpassing both prior ranking models and costly agent-based systems. Additionally, we demonstrate that training existing retrievers and rerankers with SWELoc data yields significant improvements on issue localization.

SWERank Pareto Front — **Figure 1:** Our SWERank models achieve superior localization accuracy at a significantly lower cost compared to contemporary agent-based methods.

LocBench Category-Wise Results — **Figure 2:** Despite being primarily trained with bug reports in SWELoc, the SWERank models demonstrate impressive generalizability across other categories in LocBench.

SWELoc Curation

Existing code retrieval datasets are valuable for NL-to-code search, which focuses on functionality matching. However, they are not ideal for training models for software issue localization, as software issues are often detailed failure descriptions rather than concise specifications. To address this, we created SWELoc, a large-scale dataset specifically for localizing code snippets relevant to software issues, derived from real-world GitHub repositories. Our curation process involves identifying relevant pull requests from Python repositories, extracting issue descriptions with code modifications, and applying filtering and negative mining to improve training data quality.

Creating Contrastive Data From GitHub Pull Requests: Our dataset, SWELoc, was created by selecting GitHub repositories associated with the top 11,000 PyPI packages, filtering them for quality (at least 80% Python, not in SWE-Bench or LocBench, and deduplicated by source code overlap). We then identified pull requests (PRs) within these repositories that were explicitly linked to resolving a GitHub issue and included modifications to test files. For each qualifying PR, we extracted the issue description and the codebase snapshot at the PR's base commit. From these (PR, codebase) pairs, we generated contrastive training data in the form of <query, positive, negatives> tuples. The issue description served as the query, and each modified function in the PR was a positive example, creating multiple training instances per PR. Negative examples consisted of all unmodified functions from the corresponding codebase snapshot, which were further refined using consistency filtering and hard-negative mining to enhance data quality and model training.

Consistency Filtering and Hard Negatives: The quality of training data, specifically the relevance of positive examples and the difficulty of negative examples, significantly impacts model performance. Issue descriptions can be vague sometimes, making direct use of scraped data for training unreliable. To address this, first, we apply top-K consistency filtering to retain only instances where the positive code snippet is semantically close to the query relative to other code snippets in the repository. Beyond filtering for relevance of positive pairs, incorporating challenging negatives is crucial for enabling the model to distinguish between semantically similar instances. To this end, we employ a hard negative mining strategy to select the top M most similar functions to the query.

Query Lengths Distribution — **Figure 4:** The mean query length in SweLoc is 383 tokens underscoring the descriptive nature of issue reports.

Patch Distribution — **Figure 5:** Distribution of code modifications per issue. While many localizations are concentrated, a significant number span multiple units.

SWERank

SWERank adopts a two-stage retrieve-and-rerank approach with two key components: (1) SWERankEmbed, a bi-encoder retrieval model that efficiently narrows down candidate code snippets from large codebases; and (2) SWERankLLM, an instruction-tuned listwise LLM reranker that refines these initial results for improved localization accuracy.

SWERankEmbed

We trained the SWERankEmbed retrievers in two sizes: small and large. Our retrievers utilize a bi-encoder architecture, where weights are shared between the text and code encoders. They are fine-tuned on our SWELoc dataset using a contrastive learning objective based on the InfoNCE loss. SWERankEmbed-small is initialized with CodeRankEmbed, a state-of-the-art 137M parameter code embedding model, while the large variant is initialized with GTE-Qwen2-7B-Instruct, a 7B parameter text embedding model that employs Qwen2-7B-Instruct as its encoder.

SWERankLLM

We also trained our rerankers in small and large sizes, using CodeRankLLM for the small version and Qwen-2.5-32B-Instruct for the large version. Our rerankers are based on LLM-based listwise reranking, a technique that has gained prominence due to its ability to score multiple passages simultaneously. The rerankers are initially pre-trained with text listwise reranking data to learn the listwise output format and are subsequently fine-tuned on SWELoc. Given the absence of ranked ordering among negative samples in SWELoc, we fine-tune with a modified objective that maximizes the likelihood of the first generated (i.e. top-ranked) identifier to be the one corresponding to the positive candidate.

Software Issue Localization Performance

Datasets: We utilize SWE-Bench-Lite and LocBench benchmarks for evaluation. Following prior work, we exclude examples from SWE-Bench-Lite where no existing functions were modified by the patch, resulting in 274 retained examples out of the original 300. While SWE-Bench-Lite primarily consists of bug reports and feature requests, LocBench contains 560 examples overall and additionally includes instances related to security and performance issues.

Baselines and Metrics: We primarily compare SWERank against prior agent-based localization methods, including OpenHands, SWE-Agent, MoatlessTools, and LocAgent, the current state-of-the-art localization approach. These methods mainly use closed-source models like GPT-4o and Claude-3.5, though LocAgent also finetunes open-source models. We also compare SweRank with other performant code retrievers and the rerankers, including CodeRankLLM and GPT-4.1. Following prior work, we use Accuracy@k, which deems localization successful if all relevant code locations are correctly identified within the top-k results. We measure localization accuracy at three granularities: file, module (class) and function

Results: On both SWE-Bench-Lite and LocBench, SWERank outperforms all evaluated agent-based methods and prior retrievers and rerankers, establishing a new state-of-the-art for localization performance. Furthermore, SWERank is Pareto-optimal and offers significant cost-effectiveness. The SWERankLLM reranker only needs to generate candidate identifiers as output to determine the ranking order, and the SWERankEmbed output embeddings can be pre-computed, resulting in negligible inference cost. In contrast, agent-based localization can incur considerable time and expense due to multi-turn interactions, each requiring the generation of lengthy reasoning steps.

SWE-Bench-Lite Results — **Figure 6:** Localization Performance (in %) on SWE-Bench-Lite. The rerankers use SWERankEmbed-large as the retriever. Best retriever numbers are in blue, while best overall numbers (except GPT-4.1) are in bold.

LocBench Results — **Figure 7:** Localization Performance (in %) on LocBench. The rerankers use SWERankEmbed-large as the retriever. Best retriever numbers are in blue, while best overall numbers (except GPT-4.1) are in bold.