Optimizing MPIs for multi-cores : Thoughts
>> Sunday, June 28, 2009
MPI is still the programming model for most of the scientific applications which runs on super computers. Yes one could argue that map-reduce, dryad style models are coming up, but they are good for embarrassingly parallel apps or they don't exploit the fast interconnects within nodes. Anyway its a different debate.
When we look at current scientific apps running on super computers, they are already parallelized, exploiting most of the MPI constructs. MPI programming model encourages programmers to send messages and pass data around the network. But the problem with MPI applications is that, when we increase number of CPUs, we sometimes get diminishing returns (meaning anti-scaling) , and at one point it starts to give negative effect reducing the perfomance. The reason for this is that due to the increase amount of data being transmitted in the network, with the increase number of CPUs, the communication overhead exceeds actual computation and also saturate the network.
Recently I got to talk with scientists from three different domains and all of them were whining about the network saturation. The naive solutions to this problem are
- change their implementation algorithms to reduce the communications (oops, I thought we wanted to exploit the fast interconnects)
- optimize the MPI usage by using proper MPI constructs . For example using MPI_Reduce rather than using MPI_send to single node and it reducing it and doing MPI_Bcast. Remember MPI_Reduce exploits the network topology and do a hierarchical reduction.
(I know this idea is not new [1], [2], but it was very exciting to me still especially with some of my current experiences with it. )
Let's say we have an 8-core 100 node cluster (when I first get to use 2000 core clusters, yeah I was nervous and excited too :) ). This will enable us to run a job on 800-cores and make all these 800-cores banging the network. But rather than scheduling it at the core level, how about scheduling at node-level for 100 nodes and using SMP, with 8 threads each, within a node (same 800-processes)? This will cut down the network traffic, at most 8 times (the number of cores), depending on the application. Not only that, we might see an increase on performance as within each node, 8 processes are on the same address space and communication among them will be much much faster.
Exploiting this requires a change in the application too. One has to use the main thread, in each node, for MPI communications and carefully manage multi-threads within it. The challenges with this hybrid approach includes, but not limited to
- Handling multi-threads. Yeah we all know how cumbersome it is
- With some implementations, only the main thread is allowed to do MPI calls. So there can be a slight bottle-neck if the threads try to connect to processes that are not on the same address space.
- Main thread has to manage all the threads, divide the work/data among the threads, aggregate them before sending to other nodes, control intra-node communication, etc.,
- Debugging. Believe me, its very hard and painful.
What I like to have is something like in DryadLINQ, where users will write on a bit higher abstraction and the framework will optimize based on the network topology and machine configuration. But I guess we will have to wait till the next generation of programming models/frameworks to see things like that. Read more...