High-Performance Computing Research Group

Experimental Analysis of Application Scalability

for the Xeon Phi

Summary:

The scalability of chip Multiprocessors (CMPs) such as the Intel Xeon Phi depends to a large extent on the design efficient and scalable cache coherence protocols. Currently cache coherence protocols use invalidate schemes that are known to generate a high number of coherence misses. Some important Basic Linear Algebra Solvers have data layouts that require frequent alternation of block ownership among cores. This data sharing patterns trigger a massive amount of snoops that may become a performance bottleneck as the number of cores increases. Adaptive hybrid protocols have been proposed for the Tile64 to reduce coherence misses in write-invalidate protocols which may substantially reduce cache-to-cache sharing misses. The Xeon Phi and Tile64 propose non-cache coherent models to avoid the impact on scalability due to coherence overheads. A partitioned global address space (PGAS) offers a global, logically shared memory space for all the threads, with locality awareness and one-sided communication constructs. We will experimentally analyze the shared data access patterns under the PGAS scheme for set of basic linear algebra solvers (BLAS and MAGMA) having different memory layout on the Xeon Phi co-processor. We will compare obtained throughput to those obtained on the Tile64 and on GPUs. Potential limitations on scalability due to the producer-consumer interdependencies will be analyzed with its impact on scalability of parallel programs. We will develop a Micro-benchmark which alternates the data block ownership among various cores and across all memory levels will be developed to characterize thread inter-dependencies on overall scalability.

Previous Research Work

Commodity multi-core SMPs generate an enormous amount of coherency traffic. Inter-core snooping traffic may affect the scalability of SMPs in addition to increasing power dissipation. Snoop filtering was proposed to reduce unnecessary coherence actions. However, limitation of scalability of parallel program having frequent changes in data sharing status is observed even when snoop requests from all remote cores are simultaneously processed. We analyze four typical computational programs having different shared-data scenarios: solving linear system equation, ocean simulation, bucket sorting, and the alternating direction integration. These computations are analyzed respect to their organization, recurrences, and shared data access patterns. For each case, we derive an OpenMp execution model to emphasize changes in shared data statuses and to ease the design of the parallel algorithm. In the evaluation we plot the speedup obtained versus the number of cores and problem size when using one SMP with 8 cores. We discuss potential limitations on scalability versus loop recurrences and variation in data access patterns as row-major and column major. We also study the effects on scalability to intra- and inter- processor data swapping among the cores within a 4-core processor and across processors in the SMP.

Cooperation:

Our team consists of Dr. Mayez Al-Mouhamed, Mr. Allam Fatayer (COE M.Sc. Student), and Mr. Khaled Daud (ITC-KFUPM). The team cooperates with KAUST group on Xeon Phi.

Publications

Mayez A. Al-Mouhamed and Khaled A. Daud, Experimental Analysis of SMP Scalability in the Presence of Coherence Traffic and Snoop Filtering, IEEE 14th International Conference on High Performance Computing & Communication (HPCC), Liverpool, United Kingdom, May 2012.
M. A. Al-Mouhamed and K. A. Daud, Analysis of Scalability in the Presence of Coherence Traffic and Snoop Filtering in Embedded SMPs, International Journal of High Performance Computing Applications, June 2013.