Milestone Report
Rhea Kripalani and Lucy Wang
In the past two and a half weeks, we finalized the problem we are trying to parallelize and implemented two methods of parallelization (shared address space among threads and message passing between processors).
Problem Statement: Given a full preference list for each man and woman (if there are 2*n participants, each participant has n people in their preference list), we want to write a parallelized Gale-Shapley algorithm that will create a stable pairing for each participant.
We iterated over whether or not to require a full preference list after coming across a GitHub project that generates data for the Gale-Shapley algorithm and allows the user to specify how long the preference lists in the dataset should be. When we used the data generated from this with our preference list size being less than the number of men or women, our algorithm was not able to find a stable match for everyone, so we decided to stick with using full preference lists for now, as that is the way it works in most applications anyway.
Our original plan was to implement the PII and PII-SC algorithms as defined in this poster, but we realized that these algorithms are much more complex than we previously thought. We spent a week trying to implement and debug the PII algorithm, but we couldn’t get it to converge on stable pairings for everyone. To ensure that we don’t spend all our time debugging one algorithm and potentially not have anything to write about for our report, we decided to pivot to a different parallelization of the algorithm and implement it in a shared address space and message passing. Our next goal is to implement Gale-Shapley with CUDA.
In summary, we did have to move some of our goals around. Specifically, we moved the PII and PII-SC Algorithm deliverable to the “nice to haves” section. We replaced this deliverable with the shared address space, message passing, and CUDA implementations. We should be able to complete our new set of goals. Depending on how much time the CUDA implementation takes us, we might be able to achieve one of our “nice to haves” deliverables.
Our goal would be to show a graph of the speedups of our different implementations and address the differences and benefits of using each type of parallelization regarding the Gale-Shapley Algorithm. We will also describe how we completed our parallel implementation and discuss potential future implementations or improvements we would have pursued if we had more time.
Sequential Gale-Shapley Algorithm:
sequential | initialization time | computation time |
---|---|---|
master/100_100 | 0.012341196 | 0.000221127 |
random/100_100 | 0.001781225 | 0.000025335 |
shuffle/100_100 | 0.001931884 | 0.000029665 |
master/500_500 | 0.023034474 | 0.000691897 |
random/500_500 | 0.023155006 | 0.00072703 |
shuffle/500_500 | 0.022265412 | 0.000586448 |
master/1000_1000 | 0.085145133 | 0.003029481 |
random/1000_1000 | 0.085594264 | 0.002465578 |
shuffle/1000_1000 | 0.084825035 | 0.002196202 |
2000_2000 | 0.370548774 | 0.009627199 |
We also have the initialization and computation times of the OpenMP shared address space version for all the test cases listed above. For simplicity, we only included data for n = 1000 and n = 2000 test cases.
n=1000 | initialization time | computation time |
---|---|---|
1 Thread | 0.096207367 | 0.006100127 |
2 Threads | 0.085109435 | 0.005300165 |
4 Threads | 0.086066042 | 0.005146181 |
8 Threads | 0.084503596 | 0.006054297 |
n=2000 | initialization time | computation time |
---|---|---|
1 Thread | 0.387654479 | 0.026912896 |
2 Threads | 0.370879461 | 0.021700406 |
4 Threads | 0.369560469 | 0.019854909 |
8 Threads | 0.369961981 | 0.022176926 |
Similarly, we have the results for the OpenMPI message passing version with the same test cases.
n=1000 | initialization time | computation time |
---|---|---|
1 Thread | 0.365146032 | 0.004490897 |
2 Threads | 0.3604769875 | 0.003742539 |
4 Threads | 0.3132831853 | 0.00417506475 |
8 Threads | 0.3411113851 | 0.006089136625 |
n=2000 | initialization time | computation time |
---|---|---|
1 Thread | 0.646786017 | 0.021756559 |
2 Threads | 0.64432722 | 0.0192913835 |
4 Threads | 0.606660147 | 0.022002529 |
8 Threads | 0.6338493458 | 0.03287053375 |
There is not much speedup from one thread to eight. Instead, our computation time actually increases a little. Additionally, we noticed that the computation time for our parallel algorithms is larger than our sequential algorithm. We believe this is because there is a lot more overhead with the parallel algorithms, and our datasets aren’t big enough to make the parallelization overhead worth it. The difference in times could also be because we use different approaches to the Gale-Shapley algorithm to fit each method of parallelization. We hope to see clearer trends with larger datasets.
Currently, as we try to generate larger datasets to test our implementation, we exceed our disk quota. We have deleted extra files that aren’t useful from our AFS, yet the largest dataset we can test our implementation on is currently n=2000 with a full preference list. This isn’t enough to compare to real applications. For example, the Gale-Shapley algorithm matches medical students to residencies, and there are around 52,000 students who need to be matched. Because we can’t generate such large datasets, we are concerned that we won’t be able to see the benefits of parallelization due to the extra overhead.