Monday, August 28, 2006

Ragging on RDMA

Last week, HPCwire posted an in-depth technical article by Patrick Geoffray of Myricom Inc. in which he claims RDMA brings little to the table for high performance system interconnects beyond what is available using the message passing interface (MPI) that Myricom and a handful of other clustering specialists employ.

See http://www.hpcwire.com/hpc/815242.html

His history is a little revisionist. For example, the Virtual Interface Architecture the PC world developed did not fade away, as he says. It was subsumed into Infiniband where many of the concepts were further refined and became a launching pad for RDMA.

The article makes the point there are many ways to skin the cat, and an engineer has to do some heavy lifting to get RDMA to work well with MPI and the Sockets interface. Good points. RDMA has its shortcomings, and the industry needs to continue to work on them.

In the end, however, RDMA is the industry standards effort trying to take high performance interconnects into the mainstream for HPC and beyond. See http://www.rdmaconsortium.org/home The MPI interface that is the basis for products from Myricom and others, will not likely see much action outside the fairly narrow confines of the HPC market. Geoffray's critique bears some of the signs of a technologist swimming against the tide of history.

--rbm

1 comment:

Mike said...

For any marketing article to be effective, it must be founded upon facts (even if only a subset) in order to convince the reader that the conclusions drawn are the correct ones. How effective the marketing piece is then becomes a matter of the spin provided and how close the spin parallels just enough of the facts that the reader buys it all. This critique is an effective marketing piece with just enough facts presented to create FUD but in the end if it is marketing, an opinion, and therefore any reader should be cautious in their acceptance or rejection of any or all of the opinions expressed. I will state up front that this is just an opinion from someone who has created a number of the technologies within the industry including the RDMA ones under discussion here. Hence, the reader should in no way blindly accept anything I state here but take some time to evaluate, do some research, and think carefully about any conclusions drawn. With that in mind and in an attempt to keep this from being too verbose, let's examine some aspects of this opinion piece - this isn't an exhaustive set of points but some key ones to keep in mind.

The article claims that RDMA is a poor technology choice because it requires memory registration. Well, for any I/O DMA operation to occur (and this generally applies to any I/O device typically found in a modern server), the memory must be pinned (i.e. locked in memory) and mapped (i.e. a virtual to physical address must be obtain to enable the I/O device to issue PCI transactions). Hmmm, last I checked that is what memory registration is - the pinning and mapping of memory and communication of the translations to the I/O device. What RDMA technology such as iWARP and InfiniBand do is simply formalize the registration operation so that hardware and OS vendors will be provided with consistent semantics across all implementation options in order to simplify product integration and solution delivery. Not clear how this can be construed as a poor technology choice but let's move on.

The article claims that memory registration is expensive as a function of the amount of memory being registered. Well, depending upon the physical page size used (the world is not locked into only 4KB pages since that has implications on processor TLB as well), and of course the implementation being measured the performance will certainly vary as shown by the one data set presented hence a fact. However, not all implementations are the same and not all applications use the more expensive memory registration measured. Some applications use the fast registration technique developed for iWARP and then ported back to InfiniBand. Further, some applications register memory at the beginning of time and therefore amortize the cost across seconds or hours or even days of execution (something quite common in MPI or enterprise applications) and therefore the cost of memory registration is very much in the noise. No data or clarifications are provided in the article as that might subtract from the conclusions drawn.

The article claims that memory consumption is horrible with RDMA. Well, to start, such a conclusion is based only on a partial examination of the technology. Certainly, if each process or thread of execution maintains per connection or endpoint buffers then the aggregate memory consumed will increase - this is no different than the traditional network stacks implemented on all major OS offerings. However, many applications whether user space or kernel space use shared memory. So, if we expand the focus beyond the narrow examples shown in the article, the reader can find any number of usage models where the total memory consumed is often constant or in some cases less than expected. It all boils down to application design and implementation (no technology can completely compensate for a poor design or implementation).

The article claims SRQ was in response for memory consumption issues. Well, this is partially true but not due to the alleged epiphany in the article. During the development of iSER as well as a close examination of other storage technologies such as Fibre Channel, careful attention was paid to the impacts on the storage controller as well as buffer cache design common in distributed or cluster file systems. The RDMA architects spent a considerable amount of time and analysis to comprehend the impacts on the technology on real-world applications and eventually developed SRQ over iWARP which was then ported back to InfiniBand (in fact, key learnings from real RDMA implementations led to the evolution of RDMA technologies through its various incarnations to what is defined today in iWARP and InfiniBand). The SRQ model is therefore predicated on a shared memory model (hence a single memory registration can be used to reduce cost) combined with how storage works which is a mix of control messages (communicated over the SRQ resources) and data placement (can be communicated via RDMA).

The article brings up a good metric to evaluate any technology which is how much of the system resources are left available for application usage. I applaud this metric as it is what customers are most interested in and are paying for at the end of the day. I won't debate it but will point out that the analysis is incomplete. It only presents the send / receive paradigm as illustrated by a set of implementations and ignores the RDMA paradigm. RDMA is focused on making a system more efficient by eliminating the need for the processor to be involved in data movement as well as enables zero copy / OS bypass to reduce memory bandwidth consumption and hence should lead to more resources being available for the application's execution. Perhaps there is no data readily available to compare and contrast but any reader should take the time to understand the full impacts of RDMA technology.

The article brings up an interesting point about micro-benchmarks and illustrates a couple of data points. Well, I've never been a fan of micro-benchmarks as they are often at best a guide to what is possible but often inaccurate as a function of whether they reflect the application or not (benchmarks are always a work in progress). In any case, if one is going to cite micro-benchmarks and keep to the facts, they should at least use the best numbers available but, then again, this is marketing. Any reader should take time to discuss these benchmarks with their interconnect or solution provider. The numbers shown here are far from the best that marketing can provide - quite a bit far and given things don't stand still, well, they are way off from next year's offerings.

Well, this response is getting way too long so my apologies. I will close with the following points:

- The opinions expressed in the article are interesting but due to either lack of space or whatever imposed, the article does not explore the space in sufficient detail to support the conclusions drawn; at least, that is my opinion.

- The author is from one of the most successful, to date, cluster interconnect providers within the industry. This alone makes the points raised in the article worth considering. However, any reader should keep in mind that competitors often put out articles whose tone seems objective but in the end should be treated as marketing collateral and judge accordingly. Readers should remain skeptical and check out the facts before deciding whether the opinions expressed are FUD or not.

- Examination of the top 500 computers in the world reveals that the more and more of the top systems are based on RDMA interconnects. It is hard to believe that people would spend millions on constructing such systems if there were not some ROI benefit at the end of the day. RDMA technology must be doing something right to see such growth, adoption, and success across this spectrum of usage models.

Again, the above is all an opinion so take it as you will.

 
interconnects