Project Proposal

Title

Parallelizing the Gale-Shapley Algorithm
Rhea Kripalani and Lucy Wang

URL

https://lucyw3857116.github.io/gale-shapley.github.io/

Summary

We propose to parallelize the Gale-Shapley stable matching algorithm on a multicore CPU, and optionally GPU, platform. Our implementation aims to reduce runtime when compared to the traditional O(n2) sequential version by leveraging parallelism during the proposal and update phases of the Parallel Iterative Improvement (PII) algorithm. Additional enhancements, such as smart initiation and cycle detection, can be evaluated as well.

Background

The Stable Matching Problem, also known as the Stable Marriage Problem, was first introduced by Gale and Shapley in 1962. It describes a scenario in which two groups of the same size, traditionally men and women, rank each other in order of preference. The goal is to find a one-to-one stable match for all individuals. The matches are considered stable if no two individuals would both rather be matched with each other than with their current partners. Gale Shapley proved that a stable matching always exists, regardless of the input preference lists.

The classic Gale-Shapley algorithm is an iterative, deterministic solution that guarantees a stable matching in O(n2) time for n participants per group. In each round, each unmatched proposer (e.g., a man) submits a proposal to the highest-ranked participant on their preference list who has not yet rejected them. Each acceptor (e.g., a woman) reviews all proposals she receives, and tentatively accepts her most-preferred offer and rejects the rest. This continues until all participants are matched. Although the algorithm has a simple concept, this algorithm is inherently sequential, as each round depends on the updated matchings and responses from the previous round.

Despite its simplicity and theoretical guarantees, the sequential nature of the Gale-Shapley algorithm presents challenges to scaling the algorithm for large input sizes or real-time applications (e.g., live matching platforms or high-speed scheduling). This has led to multiple efforts to parallelize the stable matching problem. One approach is the Parallel Iterative Improvement (PII) algorithm, which begins with a random initial (possibly unstable) matching and refines it by repeatedly identifying and resolving unstable pairs. Although this method can achieve average-case runtimes O(n log(n)) using O(n2) processors, it only converges 90% of the time. To address these limitations, the PII-SC algorithm combines cycle detection and smart initialization, significantly improving the success rate by reducing the failure to converge to less than 10-7.

We aim to implement and investigate tradeoffs in synchronization, communication patterns, and scalability for the sequential, PII, and PII-SC algorithms.

The Challenge

Parallelizing the Gale-Shapley algorithm presents several non-trivial challenges:

We will explore ways to mitigate these challenges, including batching proposals, using fine-grained locks or atomic operations, and experimenting with memory layouts that would be good for data locality. We will also assess whether shared-memory multicore systems or massively parallel GPUs (e.g., via CUDA) are better suited for different parts of the workload.

Resources

We will base our parallel implementation on the PII and PII-SC algorithms, as described in White & Lu, An Improved Parallel Iterative Algorithm for Stable Matching, SC13 Poster (link).

Another resource we will reference is the Jones Gale-Shapley repository that generates data for the Gale-Shapley algorithm that we can use to check the correctness of the algorithm (Jones, B. C. (2019). gale-shapley. GitHub. (link).

We will start from scratch in C++ and explore pthreads/OpenMP and OpenMPI for multicore CPU. Optionally, we also want to use CUDA to take advantage of large thread counts. We would like to use the GHC lab machines and PSC machines for scalability testing. Our input data (preference lists) will be randomly generated with controlled seeds for reproducibility.

Goals and Deliverables

Plan to Achieve

Hope to Achieve

Platform Choice

We will begin with a multicore CPU implementation in C++ using pthreads or OpenMP, since thread-level parallelism and shared memory are easier to control and debug during development. CPU platforms offer strong coordination and memory consistency, which would be beneficial for our algorithm with a lot of correctness-sensitive steps. We might also explore CUDA on GPUs to evaluate performance with a large thread count.

Schedule