Project Proposal

Title

Parallelizing the Gale-Shapley Algorithm
Rhea Kripalani and Lucy Wang

URL

https://lucyw3857116.github.io/gale-shapley.github.io/

Summary

We propose to parallelize the Gale-Shapley stable matching algorithm on a multicore CPU, and optionally GPU, platform. Our implementation aims to reduce runtime when compared to the traditional O(n²) sequential version by leveraging parallelism during the proposal and update phases of the Parallel Iterative Improvement (PII) algorithm. Additional enhancements, such as smart initiation and cycle detection, can be evaluated as well.

Background

The Stable Matching Problem, also known as the Stable Marriage Problem, was first introduced by Gale and Shapley in 1962. It describes a scenario in which two groups of the same size, traditionally men and women, rank each other in order of preference. The goal is to find a one-to-one stable match for all individuals. The matches are considered stable if no two individuals would both rather be matched with each other than with their current partners. Gale Shapley proved that a stable matching always exists, regardless of the input preference lists.

The classic Gale-Shapley algorithm is an iterative, deterministic solution that guarantees a stable matching in O(n²) time for n participants per group. In each round, each unmatched proposer (e.g., a man) submits a proposal to the highest-ranked participant on their preference list who has not yet rejected them. Each acceptor (e.g., a woman) reviews all proposals she receives, and tentatively accepts her most-preferred offer and rejects the rest. This continues until all participants are matched. Although the algorithm has a simple concept, this algorithm is inherently sequential, as each round depends on the updated matchings and responses from the previous round.

Despite its simplicity and theoretical guarantees, the sequential nature of the Gale-Shapley algorithm presents challenges to scaling the algorithm for large input sizes or real-time applications (e.g., live matching platforms or high-speed scheduling). This has led to multiple efforts to parallelize the stable matching problem. One approach is the Parallel Iterative Improvement (PII) algorithm, which begins with a random initial (possibly unstable) matching and refines it by repeatedly identifying and resolving unstable pairs. Although this method can achieve average-case runtimes O(n log(n)) using O(n²) processors, it only converges 90% of the time. To address these limitations, the PII-SC algorithm combines cycle detection and smart initialization, significantly improving the success rate by reducing the failure to converge to less than 10^-7.

We aim to implement and investigate tradeoffs in synchronization, communication patterns, and scalability for the sequential, PII, and PII-SC algorithms.

The Challenge

Parallelizing the Gale-Shapley algorithm presents several non-trivial challenges:

Dependency chains

The standard Gale-Shapley algorithm is conducted in rounds, where each unmatched proposer submits a proposal and decisions are made based on previous matchings. This introduces a strong sequential dependency: the outcome of one round affects the next, so naively parallelizing proposal rounds can lead to race conditions or incorrect behavior. Preserving the invariant that every step produces a stable intermediate state requires careful coordination or rethinking of how to “batch” proposals without violating correctness.

Processor Scalability and Resource Constraints

The Parallel Iterative Improvement (PII) variant of Gale-Shapley achieves good runtime by assuming the availability of O(n²) processors. However, this is impractical for most real-world systems, where the number of available CPU threads or GPU cores is much smaller. Adapting these highly parallel algorithms to real hardware requires either reducing the number of concurrent threads or finding a method to map logical threads to physical cores while maintaining performance.

Synchronization and Communication Overhead

Multiple threads proposing, accepting, and rejecting partners concurrently can lead to contention over shared data structures like preference lists, proposal queues, and matchings. Making sure that these updates occur atomically and in a consistent order adds overhead, especially in shared-memory environments. Synchronization mechanisms (e.g., locks, barriers, or atomic operations) may become bottlenecks, negating the potential speedup from parallelism.

Correctness vs. Performance Tradeoff

Achieving performance through parallelism must not produce unstable or inconsistent matchings (if the algorithm converges). Avoiding processor conflicts, such as two proposers simultaneously targeting the same acceptor, also requires coordination between processors. Designing algorithms that are both fast and provably correct under parallel execution is non-trivial, especially as the system scales.

Memory Access Patterns and Data Locality

Accessing and updating preference lists frequently can cause cache inefficiencies, particularly if the data is scattered across memory. Poor memory locality can potentially negatively impact performance on both CPUs and GPUs. Structuring memory layouts to favor spatial locality and minimizing false sharing are crucial for high-throughput parallel execution.

We will explore ways to mitigate these challenges, including batching proposals, using fine-grained locks or atomic operations, and experimenting with memory layouts that would be good for data locality. We will also assess whether shared-memory multicore systems or massively parallel GPUs (e.g., via CUDA) are better suited for different parts of the workload.

Resources

We will base our parallel implementation on the PII and PII-SC algorithms, as described in White & Lu, An Improved Parallel Iterative Algorithm for Stable Matching, SC13 Poster (link).

Another resource we will reference is the Jones Gale-Shapley repository that generates data for the Gale-Shapley algorithm that we can use to check the correctness of the algorithm (Jones, B. C. (2019). gale-shapley. GitHub. (link).

We will start from scratch in C++ and explore pthreads/OpenMP and OpenMPI for multicore CPU. Optionally, we also want to use CUDA to take advantage of large thread counts. We would like to use the GHC lab machines and PSC machines for scalability testing. Our input data (preference lists) will be randomly generated with controlled seeds for reproducibility.

Goals and Deliverables

Plan to Achieve

Sequential Baseline Implementation

We will implement the classical Gale-Shapley algorithm in C++ as a correctness reference and performance baseline. This implementation will help us verify the correctness of our parallel versions and evaluate the relative speedup gained through parallelism.

Parallel Implementation using PII Algorithm

We will implement a parallel version of the Gale-Shapley algorithm based on the Parallel Iterative Improvement (PII) approach. This version will use thread-level parallelism (e.g., via pthreads or OpenMP) to resolve unstable pairs in parallel. The challenge will be to design a thread-safe system that maintains correctness despite concurrent updates.

Implement Smart Initiation and Cycle Detection (PII-SC)

We will extend our PII implementation with enhancements described in the PII-SC algorithm. Smart initiation will reduce the number of iterations needed to converge, and cycle detection will ensure stability even in difficult matching configurations. We aim to reduce convergence failures and bring success rates closer to 100%.

Correctness and Convergence Validation

We will develop a validation tool that checks for stability in the final matching output (i.e., ensures no blocking pairs remain). We will also benchmark the number of iterations and convergence behavior under varying input sizes and thread counts.

Performance and Scalability Benchmarking

We will measure runtime and speedup across multiple configurations (e.g., varying n, thread count, input distribution). Our goal is to evaluate parallel scalability, pthread efficiency, and overheads using profiling tools (e.g., perf, timers, etc.).

Poster and Visualization

We will create visualizations of the matching evolution and performance metrics (e.g., convergence curves, speedup vs. thread count) to present our findings in the final poster and presentation.

Hope to Achieve

Exploring GPU Implementation Using CUDA

If time permits, we will design a CUDA Implementation that distributes work across GPU threads. This will allow us to investigate fine-grained parallelism and compare it against our multicore CPU implementation, particularly in terms of memory access and synchronization overhead.

Thread Efficiency under Sublinear Thread Counts

We will experiment with configurations where the number of threads is significantly lower than n², using batching and shared work queues to simulate real-world hardware constraints. This will help us study how the algorithm degrades under limited resources.

Data Sensitivity Experiments

We will vary the structure of the preference lists (e.g., full rankings vs. partial rankings, biased preferences vs. random) and analyze how these factors influence performance and convergence. This would deepen our understanding of input-sensitive behavior in parallel systems.

Platform Choice

We will begin with a multicore CPU implementation in C++ using pthreads or OpenMP, since thread-level parallelism and shared memory are easier to control and debug during development. CPU platforms offer strong coordination and memory consistency, which would be beneficial for our algorithm with a lot of correctness-sensitive steps. We might also explore CUDA on GPUs to evaluate performance with a large thread count.

Schedule

Week 1 (March 24 – March 28): Finalize problem formulation, implement sequential Gale-Shapley, and set up benchmarking tools.
Week 2 (March 31 – April 4): Implement the parallel PII algorithm using threads, and validate correctness. Potentially begin implementing the PII-SC algorithm with cycle detection and smart initiation.
Week 3 (April 7 – April 11): Complete the PII-SC algorithm and test large input cases. Potentially begin CUDA implementation of our parallelized algorithm.
Week 4 (April 14 – April 18): Attempt to complete CUDA implementation. Profile the performance of all algorithms, and run experiments to compare all algorithms.
Week 5 (April 21 – April 25): Optimize final implementation, finalize results, and complete the poster and final report.