Project Proposal

Title

Milestone Report
Rhea Kripalani and Lucy Wang

Summary

In the past two and a half weeks, we finalized the problem we are trying to parallelize and implemented two methods of parallelization (shared address space among threads and message passing between processors).

Problem Statement: Given a full preference list for each man and woman (if there are 2*n participants, each participant has n people in their preference list), we want to write a parallelized Gale-Shapley algorithm that will create a stable pairing for each participant.

We iterated over whether or not to require a full preference list after coming across a GitHub project that generates data for the Gale-Shapley algorithm and allows the user to specify how long the preference lists in the dataset should be. When we used the data generated from this with our preference list size being less than the number of men or women, our algorithm was not able to find a stable match for everyone, so we decided to stick with using full preference lists for now, as that is the way it works in most applications anyway.

Our original plan was to implement the PII and PII-SC algorithms as defined in this poster, but we realized that these algorithms are much more complex than we previously thought. We spent a week trying to implement and debug the PII algorithm, but we couldn’t get it to converge on stable pairings for everyone. To ensure that we don’t spend all our time debugging one algorithm and potentially not have anything to write about for our report, we decided to pivot to a different parallelization of the algorithm and implement it in a shared address space and message passing. Our next goal is to implement Gale-Shapley with CUDA.

Updated Schedule

Goals and Deliverables

From the Initial Proposal

In summary, we did have to move some of our goals around. Specifically, we moved the PII and PII-SC Algorithm deliverable to the “nice to haves” section. We replaced this deliverable with the shared address space, message passing, and CUDA implementations. We should be able to complete our new set of goals. Depending on how much time the CUDA implementation takes us, we might be able to achieve one of our “nice to haves” deliverables.

New Goals

Poster Sessions

Our goal would be to show a graph of the speedups of our different implementations and address the differences and benefits of using each type of parallelization regarding the Gale-Shapley Algorithm. We will also describe how we completed our parallel implementation and discuss potential future implementations or improvements we would have pursued if we had more time.

Preliminary Results

Sequential Gale-Shapley Algorithm:

sequential initialization time computation time
master/100_1000.0123411960.000221127
random/100_1000.0017812250.000025335
shuffle/100_1000.0019318840.000029665
master/500_5000.0230344740.000691897
random/500_5000.0231550060.00072703
shuffle/500_5000.0222654120.000586448
master/1000_10000.0851451330.003029481
random/1000_10000.0855942640.002465578
shuffle/1000_10000.0848250350.002196202
2000_20000.3705487740.009627199

We also have the initialization and computation times of the OpenMP shared address space version for all the test cases listed above. For simplicity, we only included data for n = 1000 and n = 2000 test cases.

n=1000 initialization time computation time
1 Thread0.0962073670.006100127
2 Threads0.0851094350.005300165
4 Threads0.0860660420.005146181
8 Threads0.0845035960.006054297
n=2000 initialization time computation time
1 Thread0.3876544790.026912896
2 Threads0.3708794610.021700406
4 Threads0.3695604690.019854909
8 Threads0.3699619810.022176926

Similarly, we have the results for the OpenMPI message passing version with the same test cases.

n=1000 initialization time computation time
1 Thread0.3651460320.004490897
2 Threads0.36047698750.003742539
4 Threads0.31328318530.00417506475
8 Threads0.34111138510.006089136625
n=2000 initialization time computation time
1 Thread0.6467860170.021756559
2 Threads0.644327220.0192913835
4 Threads0.6066601470.022002529
8 Threads0.63384934580.03287053375

There is not much speedup from one thread to eight. Instead, our computation time actually increases a little. Additionally, we noticed that the computation time for our parallel algorithms is larger than our sequential algorithm. We believe this is because there is a lot more overhead with the parallel algorithms, and our datasets aren’t big enough to make the parallelization overhead worth it. The difference in times could also be because we use different approaches to the Gale-Shapley algorithm to fit each method of parallelization. We hope to see clearer trends with larger datasets.

Concerns

Currently, as we try to generate larger datasets to test our implementation, we exceed our disk quota. We have deleted extra files that aren’t useful from our AFS, yet the largest dataset we can test our implementation on is currently n=2000 with a full preference list. This isn’t enough to compare to real applications. For example, the Gale-Shapley algorithm matches medical students to residencies, and there are around 52,000 students who need to be matched. Because we can’t generate such large datasets, we are concerned that we won’t be able to see the benefits of parallelization due to the extra overhead.