Kurnosov M.G.   Berlizov D.   Tkacheva T.   Tokmasheva E.I.  

Analysis and Optimization of Pipelined Broadcast Algorithms on Gigabit Ethernet and InfiniBand Networks

Reporter: Kurnosov M.G.

The Message Passing Interface (MPI) is the de-facto standard for programming computer systems with distributed memory. The first version of the MPI standard was delivered in 1994 as a joint effort of academia and industry. An important subset of the MPI standard are collective operations used to exchange the information among a group of processes: Allgather, Allreduce, Alltoall, Barrier, Broadcast, Gather, Exscan, Reduce, Scan, Scatter. The profiling studies of parallel applications showed that the collective operations are commonly used and that they could be an application performance bottleneck. Moving towards exascale systems (million processor cores), the time spent in collectives only increases. Performance and scalability of HPC applications require efficient and scalable collective operations.
    An important aspect of collective algorithm optimizations is analyzing the algorithm performance in terms of different parallel communication models: Hockney (latency-bandwidth model), LogP/LogGP, BSP. In this paper we describe scalable algorithms for implementing MPI_Bcast collective operation: k-ary tree, k-nomial tree, linear pipeline and pipelined binary tree. Algorithms were investigated according to their implementation in the Open MPI library.
When a collective operation is called, an MPI library dynamically determines which of the available broadcast algorithms should be used for a given MPI communicator (number of processes) and operation parameters (message size, data type, root process). The current implementations of MPICH, MVAPICH and Open MPI libraries use fixed algorithm selection rules based on the message size or/and number of processes. A significant part of MPI_Bcast algorithms supports tunable parameters that affect the total execution time of the operation. In this paper we did a theoretical and experimental analysis of algorithms implementing the MPI_Bcast operation and determined in latency-bandwidth model the optimal values of tunable algorithm parameters – tree degree, and segment size in pipelined versions of algorithms. In contrast to previous studies, algorithms execution time analysis in this paper is not constrained to asymptotic complexity estimations, the focus was also on the constants for the α and β parameters in the Hockney model. Theoretical results are consistent with experiments on a computer clusters with Gigabit Ethernet and InfiniBand communication networks.


To reports list