Date Awarded
2024
Document Type
Dissertation
Degree Name
Doctor of Philosophy (Ph.D.)
Department
Computer Science
Advisor
Bin Ren
Committee Member
Gang Zhou
Committee Member
Jie Ren
Committee Member
Michael Lewis
Committee Member
Jie Chen
Abstract
With the enlarging computation capacity of general Graphics Processing Units (GPUs), leveraging GPUs to accelerate parallel applications has become a critical topic in academia and industry. However, a wide range of irregular applications with a computation-/memory-intensive nature cannot easily achieve high GPU utilization. The challenges mainly involve the following aspects: first, data dependence leads to a coarse-grained kernel; second, heavy GPU memory usage may cause frequent memory evictions and extra overhead of I/O; third, specific computation patterns produce memory redundancies; last, workload balance and data reusability conjunctly benefit the overall performance, but there may exist a dynamic trade-off between them. Targeting these challenges, this dissertation proposes multiple optimizations to accelerate parallel irregular applications on GPU architectures. The dissertation focuses on two real-world applications as case studies: one is calculating many-body correlation functions in a large-scale scientific system; the other one is the eALS-based matrix factorization recommendation system. To accelerate the calculations of many-body correlation functions, this dissertation presents three frameworks in GPU memory management and multi-GPU scheduling. Firstly, an optimized systematic GPU memory management framework, MemHC, utilizing a series of new memory reduction designs in GPU memory allocation, CPU/GPU communications, and GPU memory oversubscription. MemHC employs duplication-aware management and lazy release of GPU memories to corresponding host managing for better data reusability. Moreover, MemHC designs a novel eviction policy called Pre-Protected LRU (Least Recently Used) to reduce evictions and increase memory hits. Secondly, an enhanced multi-GPU scheduling framework, MICCO, particularly by taking both data dimension (e.g., data reuse and data eviction) and computation dimension into account. This work first performs a comprehensive study on the interplay of data reuse and load balance and brings up two new concepts: local reuse pattern and reuse bound for the optimal trade-off between them. Based on this study, MICCO designs a heuristic scheduling algorithm and a machine-learning-based regression model to generate the optimal setting of reuse bounds. Thirdly, a locality-aware multi-GPU scheduling framework. This scheduler leverages pipeline batch generation with a looking ahead strategy. The scheduler builds local dependency graphs based on locality analysis to reorganize input data for memory transfer reduction and better data reuse, achieving up to 79.92% memory cost reduced and 1.67x speedup. To parallelize the eALS-based recommendation system, this dissertation proposes an efficient CPU/GPU heterogeneous recommendation system, HEALS. HEALS employs newly designed architecture-adaptive data formats to achieve load balance and good data locality on CPU and GPU. To mitigate the data dependence, HEALS presents a CPU/GPU collaboration model for both task parallelism and data parallelism. Additionally, HEALS optimizes this collaboration model with kernel-communication overlapping and dynamic workload partition. HEALS also applies various kernel parallel techniques for better GPU utilization: loop unrolling, vectorization, and warp reduction. In summary, this dissertation efficiently accelerates two typical irregular applications on GPUs by building four frameworks, including CPU/GPU collaboration, GPU memory management, and multi-GPU scheduling.
DOI
https://dx.doi.org/10.21220/s2-n4fs-jk74
Rights
© The Author
Recommended Citation
Wang, Qihan, "Efficient Parallelization Of Irregular Applications On Gpu Architectures" (2024). Dissertations, Theses, and Masters Projects. William & Mary. Paper 1709301519.
https://dx.doi.org/10.21220/s2-n4fs-jk74