Project Proposal

Title

Milestone Report
Rhea Kripalani and Lucy Wang

Summary

In the past two and a half weeks, we finalized the problem we are trying to parallelize and implemented two methods of parallelization (shared address space among threads and message passing between processors).

Problem Statement: Given a full preference list for each man and woman (if there are 2*n participants, each participant has n people in their preference list), we want to write a parallelized Gale-Shapley algorithm that will create a stable pairing for each participant.

We iterated over whether or not to require a full preference list after coming across a GitHub project that generates data for the Gale-Shapley algorithm and allows the user to specify how long the preference lists in the dataset should be. When we used the data generated from this with our preference list size being less than the number of men or women, our algorithm was not able to find a stable match for everyone, so we decided to stick with using full preference lists for now, as that is the way it works in most applications anyway.

Our original plan was to implement the PII and PII-SC algorithms as defined in this poster, but we realized that these algorithms are much more complex than we previously thought. We spent a week trying to implement and debug the PII algorithm, but we couldn’t get it to converge on stable pairings for everyone. To ensure that we don’t spend all our time debugging one algorithm and potentially not have anything to write about for our report, we decided to pivot to a different parallelization of the algorithm and implement it in a shared address space and message passing. Our next goal is to implement Gale-Shapley with CUDA.

Updated Schedule

Week 1 (March 24 – March 28): Finalize problem formulation, implement sequential Gale-Shapley, and set up benchmarking tools.
Week 2 (March 31 – April 4): Tried to implement the PII algorithm, implemented shared address space parallelization, validated correctness.
Week 3 (April 7 – April 11): Generated large data test cases, implemented message passing parallelization, tested on large data test cases.
Week 4.0 (April 14 – April 16): Lucy - Optimize message passing parallelization, Rhea - Write CUDA implementation, look into PII-SC algorithms again.
Week 4.5 (April 16 – April 18): Rhea - Optimize CUDA implementation, Lucy - Profile the performance of all algorithms.
Week 5.0 (April 21 – April 23): Run experiments to compare all algorithms (experiments will be split among us), finalize results.
Week 5.5 (April 23 – April 25): Complete the final report and poster.

Goals and Deliverables

From the Initial Proposal

Plan to achieve:

Sequential Baseline Implementation

Completed

Correctness and Convergence Validation

Completed

Parallel Implementation using PII and PII-SC Algorithm

We moved this to hope to achieve.
Instead of implementing the PII and PII-SC Algorithm, we plan to implement Gale-Shapley using shared address space, message passing, and CUDA

Performance and Scalability Benchmarking

This will be done once all implementations have been completed

Poster and Visualization

This will be done once we have analyzed our data

Nice to have

Explore PII and PII-SC Algorithms

Once we finish the CUDA algorithm, we can look into whether it would be worth our time to continue working on this algorithm or if we should simply optimize our current algorithms.

Thread Efficiency under Sublinear Thread Counts

We are already using thread configurations for significantly lower thread counts. This is more applicable for the PII-SC algorithm.

Data Sensitivity Experiments

We will look into different manipulations of the preference list and their effect on the algorithm. We can vary the length by using partial rankings or by using biased preferences.

In summary, we did have to move some of our goals around. Specifically, we moved the PII and PII-SC Algorithm deliverable to the “nice to haves” section. We replaced this deliverable with the shared address space, message passing, and CUDA implementations. We should be able to complete our new set of goals. Depending on how much time the CUDA implementation takes us, we might be able to achieve one of our “nice to haves” deliverables.

New Goals

Plan to achieve:

CUDA implementation of Gale-Shapley
Optimizations of all algorithms
Performance analysis

Nice to have:

Implement PII or PII-SC Algorithm
Manipulate input data (change the length of preference lists)

Poster Sessions

Our goal would be to show a graph of the speedups of our different implementations and address the differences and benefits of using each type of parallelization regarding the Gale-Shapley Algorithm. We will also describe how we completed our parallel implementation and discuss potential future implementations or improvements we would have pursued if we had more time.

Preliminary Results

Sequential Gale-Shapley Algorithm:

sequential	initialization time	computation time
master/100_100	0.012341196	0.000221127
random/100_100	0.001781225	0.000025335
shuffle/100_100	0.001931884	0.000029665
master/500_500	0.023034474	0.000691897
random/500_500	0.023155006	0.00072703
shuffle/500_500	0.022265412	0.000586448
master/1000_1000	0.085145133	0.003029481
random/1000_1000	0.085594264	0.002465578
shuffle/1000_1000	0.084825035	0.002196202
2000_2000	0.370548774	0.009627199

We also have the initialization and computation times of the OpenMP shared address space version for all the test cases listed above. For simplicity, we only included data for n = 1000 and n = 2000 test cases.

n=1000	initialization time	computation time
1 Thread	0.096207367	0.006100127
2 Threads	0.085109435	0.005300165
4 Threads	0.086066042	0.005146181
8 Threads	0.084503596	0.006054297

n=2000	initialization time	computation time
1 Thread	0.387654479	0.026912896
2 Threads	0.370879461	0.021700406
4 Threads	0.369560469	0.019854909
8 Threads	0.369961981	0.022176926

Similarly, we have the results for the OpenMPI message passing version with the same test cases.

n=1000	initialization time	computation time
1 Thread	0.365146032	0.004490897
2 Threads	0.3604769875	0.003742539
4 Threads	0.3132831853	0.00417506475
8 Threads	0.3411113851	0.006089136625

n=2000	initialization time	computation time
1 Thread	0.646786017	0.021756559
2 Threads	0.64432722	0.0192913835
4 Threads	0.606660147	0.022002529
8 Threads	0.6338493458	0.03287053375

There is not much speedup from one thread to eight. Instead, our computation time actually increases a little. Additionally, we noticed that the computation time for our parallel algorithms is larger than our sequential algorithm. We believe this is because there is a lot more overhead with the parallel algorithms, and our datasets aren’t big enough to make the parallelization overhead worth it. The difference in times could also be because we use different approaches to the Gale-Shapley algorithm to fit each method of parallelization. We hope to see clearer trends with larger datasets.

Concerns

Currently, as we try to generate larger datasets to test our implementation, we exceed our disk quota. We have deleted extra files that aren’t useful from our AFS, yet the largest dataset we can test our implementation on is currently n=2000 with a full preference list. This isn’t enough to compare to real applications. For example, the Gale-Shapley algorithm matches medical students to residencies, and there are around 52,000 students who need to be matched. Because we can’t generate such large datasets, we are concerned that we won’t be able to see the benefits of parallelization due to the extra overhead.