Thursday, February 14, 2008

Sun still shines on IB

I was humbled when Valley vet Andy Bechtolsheim sought me out after a Sun Microsystems press conference yesterday. I had just asked Sun's systems VP John Fowler a question about unified data center networking and he pledged to ship such products on Infiniband in 2008.

Bechtolsheim wanted to make sure I knew all the technical advantages of IB as a unified net (high bandwidth, lower latency, built in resilience) and the downside of the FCoE effort (standards not done being the biggie).

It was interesting to hear Bechtolsheim point out weakness in his former employer (Cisco) than could trip it up in the push toward unified Ethernet nets, especially given he had made more than a few million when the networking giant bought his Ethernet startup in the boom days.

Just another example that technologies can look very different depending on where you sit at the moment. I expect Ethernet will become the unified net and that it will be slooooow getting there. I also expect Sun will be finding interesting technology arbitrage positions as long as savvy CEOs like Jonathan Schwartz can keep smart folks like Bechtolsheim on board.


Pat said...

A few comments:

* IB switches have a contention and bisection problem. They are source-routed, so HOL cut's the bandwidth in half for random traffic. Sure, they will add congestion notification in the future, but so will Ethernet. Today's Ethernet switches (the expensive ones) are over-dimensioned to be truly non-blocking.

* What cables for IB QDR ? It's quad-pumped, it's not just adding lanes. Very short cables copper or expensive optics ? Ethernet has always waited for the right price. Link aggregation on 4 10GigE links is 40 Gbps effectively. IB QDR is only 32 Gbps. But it does not matter as long as PCIE-2 x8 is the reference server slot (barely 20 Gbps useful bandwidth).

* if you remove TCP, everything is faster. That is true on Ethernet as well. If you are lossless, you don't need TCP. Ethernet crossbar latency is down to 250 ns, but I suspect than people will build Ethernet switches with something else at the core.

* resilience: that's the funny one. IB does reliability at the hop-level, not end-to-end. Kill a switch while packet are flowing, see what happens.

Anyway, the main reasons why Ethernet will be the unified wire are:
* The Cisco Steamroller + Intel 10 GigE on the motherboard.
* Mellanox is the only real IB silicon vendor. People don't like monopolies that much.

Rick Merritt said...

Thanks for sharing this additional level of technical detail as well as your opinions. When it comes time for a ConnectX briefing, I will try to explore some of these issues further.

Anonymous said...

Nothing like shoot-from-the-hip analysis. Let's be fair here:

- IB completed a congestion management specification and it is being translated into h/w today.

- IEEE is developing a congestion management specification. It will be translated into h/w likely in 2010.

- IB has both per hop and end-to-end data recovery. The transport protocols are largely strongly ordered so the recovery of an intermediate hop failure can involve potentially a large amount of data retransmission.

- Ethernet does not have per hope or end-to-end data recovery. It indicates failure and relies upon higher level services such as TCP for the recovery. This is one of the reasons why many in the IEEE oppose the per priority pause mechanism since it is largely being done to support FCoE because FC isn't as robust as TCP in this regard.

- Ethernet choosing the right optics when the time is right is an interesting observation. A better way to say it is the optics providers were and to some extent remain dominated by the telecom focused companies so optics remain expensive. The failure of the optics companies to develop optics for the data center is the issue - optics are not a function of the layer 2 protocol making the problem separate from IB or Ethernet. Further, one can argue that the reason Ethernet has not pushed on optics volumes thus lowering the cost is the 10 GbE NIC providers have not achieved any real volumes leaving the volume dominated by ISL ports. Lot of other reasons but tossing that at either IB or Ethernet's feet is wrong.

- IB represents bandwidth as raw signalig rate while Ethernet represents it as payload. Marketing people spread FUD one way or the other but the fact is both can push a good deal of data and depending upon the workload, both do the job reasonably well.

- It is absurd to state one can remove TCP when the fabric is lossless. A transport describes a data flow through the fabric for a given application - there must be some way to delineate multiple flows being multiplexed on a single port whether it is IB or Ethernet. Hence, TCP is a reality when it comes to Ethernet just like IB transport is for IB. For some special protocols such as FCoE, the TCP is replaced by the FC protocol while for others such as iSCSI, the TCP remains a core component of the communication path.

- The reason Ethernet will be the data center backbone for the volume space is because it is a commmodity and is evolving to resolve functional or performance issues at a sufficient rate for customers to trust the evolutionary approach. While major players such as Cisco and Intel are on the Ethernet bandwagon, they are or were proponents of IB. Intel exited IB because PCI Express came along and that simplified many things for them making IB unnecessary as a core product offering. Cisco though continues to offer IB-based products though the business is dwarfed by the Ethernet / IP side of things. And that is the most important reason why Ethernet wins at the end of the day - it is where the money is for the volume space and all companies are focused on what matters most....making money.

I hate to use the phrase drive-by analysis since that is used in politics but people need to not fall for drive-bys and instead focus on the real questions and challenges in making either technology a real data center backbone with all that is promised by the proponents. It isn't as simple as people portray; it isn't as cheap; it isn't as performant, and so forth.

Rick Merritt said...

Thanks for another strong comment on the Ethernet side of this issue. I especially like the observation that the data center is under-served by the optical community which is driven mainly by telecom.

Makes me want to book a ticket for OFC next week.

Anonymous said...

If Infiniband is really the future, why is all of the VC money flowing to 10GbE and not to IB? The only network connection that is truly ubiquitous is Ethernet. I find it difficult to believe that IT managers will yank the Ethernet connection and "converge" on another fabric. Look how long it has taken iSCSI to transition storage networking to Ethernet - new protocols take time to be adopted and time is on Ethernet's side.

Pat said...

I am all about being fair. Comments on the comments:

- IB completed a congestion management specification and it is being translated into h/w today.
Congestion notification is not enough to get anywhere close to a full bisection. I am not talking about the sender reducing throughput to handle a N->1 congestion at the receiver, that's quite easy to do in hardware or in software. I am talking about contention due to Head of Line blocking.
First of all, you would need to be able to change path on a per-packet basis. However, that would break ordering, and most of the protocols built on top of IB verbs rely on order. Will IB do re-ordering on the receiving NIC ? I don't think so, it requires too much space. If you drain the connection before changing path like Path Migration does today, it costs a lot and you have no guarantee that the new path is contention free.
No, the only way to have a non-blocking fabric with random traffic is to over-dimension the fabric, by using a lot more internal paths or by using much larger ones (QDR for example). Either way, there is no free lunch and enterprise-class truly non-blocking Ethernet switches already pay for that premium. If IB wants to compete in that market, it will have to adapt somehow.

- IEEE is developing a congestion management specification. It will be translated into h/w likely in 2010
When you have more than one company doing the silicon, it's true that there are more discussions :-) I think it's a good thing, but it don't think it matters: all Ethernet vendors are already using their own special sauces inside their switches, nobody will wait on the standard.

- IB has both per hop and end-to-end data recovery.
You are absolutely correct, this is the RC spec. My mistake was based on experimentations using the UD protocol (unreliable by definition), which is apparently the only way to scale on large IB fabrics today.

- FCoE because FC isn't as robust as TCP in this regard This is completely backward. TCP was designed for networks where nothing can be assumed or bounded (packet loss, congestion). FC was designed for a network where everything can be assumed or bounded, just like IB. With many Ethernet switches already providing the same characteristics as FC or IB, where is the need for TCP ? The per-priority pause is not a sine qua non condition for FCoE, it just isolates flows that you can trust from the ones you can't (just like QoS in IB). If you trust your traffic, which is mostly the case in a datacenter, current pause-based Ethernet flow-control is just fine.

- Lot of other reasons but tossing that at either IB or Ethernet's feet is wrong
I agree with you that it's a demand/offer feedback loop. This is exactly what happened with all previous generation of Ethernet: once volume reach critical mass, components price fall and adoption increases, and volume increases further. I can see this happening with 10GbE right now with QSFP optics. IB was at 8 Gbs years before 10GbE, but only now do we have reasonably priced cabling solutions. IB is touting its upcoming QDR link as a competitive advantage over 10GbE, but it will only be a marketing advantage if the QDR cables are too expensive. By the time they are affordable, Ethernet will happily bump to 40 Gbs. In the meantime, it would be more cost effective to aggregate 3 10GbE links to you have the same throughput as IB QDR.
You are right that the optics market was driven by telcos, but I believe it took the 10GbE takeoff to change that.

- It is absurd to state one can remove TCP when the fabric is lossless
Please, I am not that stupid. What is absurd is for IB to claim lower latency when compared to TCP. If you do have hardware flow-control, then you can replace TCP with a lightweight protocol that does end-to-end reliability (which can be made very inexpensive by assuming that packet losses are rare) and respect message boundaries (no stream demultiplexing). If you do that, then you have the same protocol overhead on Ethernet than on IB. However, you do need to have loss-less Ethernet to not have packet loss due to contention. Why do think you always have to use TCP when using Ethernet ? I have a better example than FCoE: ATAoE is not built on TCP (directly on Ethernet), it respects packet boundary (no need for ugly markers) and is much more efficient that iSCSI.

Why is iSCSI (or iWARP) built on TCP ??? I guess IETF believes the Datacenter is like the Internet.

- Intel exited IB because PCI Express came along.
Well, PCIE didn't just came along, Intel was the one pushing it :-) Intel did indeed kick IB out of the PCI market but it made a substancial investment in Mellanox through its VC arm and provided a lot of technical assistance over the years (specially related to PCIE).

- focus on the real questions and challenges in making either technology a real data center backbone
That's what we are doing, we just disagree on some of the questions and the challenges.

Rick Merritt said...

Wow, first of all thanks nfor continuing a great discussion.

Let me pull out a couple observations worth remembering:

1. Try to compare IB latency with Ethernet latecny w/o TCP overhead.

2. Wtach for ATAoE

3. Cabling issues are becoming the bottleneck/enabler


Anonymous said...

Good debate. To Rick's questions:

- Latency - Ethernet addressing is quite different (different house of cards) than IB. To date, most Ethernet switch vendors have optimized latency for their target usage model with only a couple actually focused on low-latency switching. Hence, IB switches are generally lower latency than Ethernet but technically, there isn't anything that precludes a smart Ethernet design from coming very close - enough so that the issue might be moot.

- The debate of to use or not use TCP was contentious in the IETF and the overall industry. The IBTA did not chose TCP/IP because its original charter was focused on replacing PCI/PCI-X. The messaging / consolidated fabric was added in as the work progressed. Lot of reasons behind it all but an attempt to use at least Ethernet as the IB layer 2 was rejected and life is what it is. In any case, the IETF and many felt that TCP/IP was preferred over a new protocol on top of Ethernet not just because it was a known quantity but because so much of the overall ecosystem to support the intended services already comprehended TCP/IP. The failure of IB to create this ecosystem or quite frankly being forced to recreate it is why so many left the table years ago. Could a new transport be defined? Of course. Could it gain wide adoption? Very questionable. Just look at SCTP to see the challenges any new transport hits.

- Cabling is a major issue no matter the protocol above. The components, size, weight, EMI, and so forth all play a role in determining what is possible and practical to deploy. The total cost to deploy not just the individual component costs must be considered. This is what has gated high-speed and large fabric diameter configurations more than anything else within the data center. It is fair by most to state that IB and Ethernet can largely share the same set of cables so the question becomes more one of TTM and economics.

In response to some prior comments, (a) IB x4 QDR provides performance win for ISL and some HCA where the workload is bandwidth heavy or the cost to overprovision to largely eliminate or mitigate the QoS and congestion problems isn't significant. (b) There are active cable providers for 40 Gbps solutions today sampling. IB could be a first adopter but due to cost, it may not generate enough volume to drive the costs down. This is where the optics vendors have to make some tough choices - wait for 40 GbE to drive volumes while keeping margins high in the interim or accept smaller margins and drive volume. (c) Given the cost to develop a 10 GbE NIC and the lack of volume to date, the ROI has been poor for many. Many of us hope that will change. The gotcha is the cost to develop a 40 GbE or 100 GbE is going to be significant and the volume or more importantly the TTM for volume to appear could be quite long. This presents a risk to the Ethernet providers as to how much and how fast to invest. It is not an easy answer. I'll contend this provides a modest advantage to the IB proponents since their spec is complete, they have parts to sample, and they will deploy well ahead of 40 GbE. That will enable some customers to move forward with a certain amount of confidence.