Date Awarded

2021

Document Type

Dissertation

Degree Name

Doctor of Philosophy (Ph.D.)

Department

Computer Science

Advisor

Xu Liu

Committee Member

Qun Li

Committee Member

Weizhen Mao

Committee Member

Bin Ren

Committee Member

Ang Li

Abstract

Heterogeneous architectures have become popular due to programming flexibility and energy efficiency. Heterogeneous architectures include GPU accelerators, and memory subsystems consisting fast and slow components. Achieving high performance for programs running on heterogeneous architectures requires sophisticated tools and applications. They either lack hardware support for fast memory component, or provide complex programming model, which puts extra burdens on compilers and programmers. However, existing tools either rely on simulators or lack support across different GPU architectures, runtime or driver versions. Thus, they only provide insufficient insights. In the first project, we develop DataPlacer, a profiling tool to provide guidance for data placement. We characterize a real heterogeneous system, the TI KeyStone II, whose memory system consists of fast and slow component, and the fast memory lacks hardware support. We develop a set of parallel benchmarks to characterize the performance and power efficiency of heterogeneous architectures. DataPlacer analyzes memory access patterns and provides high-level feedback at the source-code level for optimization. We apply the data placement optimization to our benchmarks and evaluate the effectiveness of HM in boosting performance (11X speedup) and saving energy (50% reduction in energy consumption). In the second project, we present CUDAAdvisor, a profiling framework to guide code optimization in modern NVIDIA GPUs. General-purpose GPUs have been widely utilized to accelerate parallel applications. Given a relatively complex programming model and fast architecture evolution, producing efficient GPU code is nontrivial. CUDAAdvisor performs various fine-grained analyses based on the profiling results from GPU kernels, such as memory-level analysis (e.g., reuse distance and memory divergence), control flow analysis (e.g., branch divergence) and code-/data-centric debugging. CUDAAdvisor supports GPU profiling across different CUDA versions and architectures. We demonstrate several case studies that derive significant insights to guide GPU code optimization for performance improvement. In the third project,we present Presponse, a GPU-based incremental graph processing framework. This framework proposes an approach to reducing response latency for large-scale graph queries. We first fill the gap that few incremental graph algorithms have been tailored for GPUs. Then, based on the key observation that graph evolution often follows certain patterns that can be accurately predicted, our framework speculatively conducts preprocessing on the graph during the idle period ahead of real graph update, significantly reducing response time. Experiments show that Presponse can predict over 90% of future graph updates, yielding up to a 25X speedup in graph query response latency.

DOI

http://dx.doi.org/10.21220/s2-f8gh-0967

Rights

© The Author

Share

COinS