Yet another Deep Learning build? Yes, but this one won’t break my bank account.

Learning how to put together a cheap DL rig without compromising about performance and reliability.

Andrea de Luca

--

A few days ago, a fellow Deep Learning practitioner manifested interest in my latest hardware build, and asked me to write a blog post about it, so here we are. Please, be patient and indulge me!

You will find a lot of blog posts about the same topic (the vast majority of them right here on medium).
They are all interesting and quite instructive, but mine is precisely aimed at showing you how squeeze the most from your next build in terms of bangs for bucks (read it as TFlops per money units).

Here in Part 1 we’ll see how select the right key pieces and put them together. In Part 2, we’ll try and stress-test the finished build in realistic conditions, with a glimpse of its performances.

1. How I chose the components

Let’s start by saying what I didn’t want, piece by piece.

  1. CPU
    I wanted at least two GPUs, and I didn’t want to be limited in terms of PCI Express lanes. Every card would have had a full 16X gen3 bandwidth.
    Tim Dettmers, author of an outstanding (although a bit outdated) DL hardware guide (http://timdettmers.com/2015/03/09/deep-learning-hardware-guide/), experimented a lot with PCIe bandwidth and reported that running a modern card on half-bandwidth (8X gen3) would result on a 10% worst-case penalty. Although something like ~10% seems to be acceptable, I wanted to avoid it since it could do quite a difference when running long experiments.
    Furthermore, one has to take into account other devices: for example, a modern NVMe drive takes four gen3 lanes for itself.
    So, if you plan to buy some shiny 8700K-ish processor (16 lanes in total) for a dual GPU 8X/8X setup, be prepared to sacrifice something in terms of storage speed.
    The matter is somewhat less pesky with the new Ryzens , since they get 20 lanes and as the motherboard allows it, you can go for a somewhat acceptable 8X/8X (GPUs) + 4X (NVMe) configuration, as long as you can make some sacrifices in terms of libraries (for example, MKL: https://software.intel.com/en-us/mkl). While those libraries have little impact on the training phase, they could come quite handy in a lot of data-intensive collateral tasks.
    Finally, you can always resort to a mighty Threadripper: with its 64 lanes, it would leave you with a lot of room for expansion. More importantly, looking at its size, you can use it as self-defence weapon once it becomes obsolete.
    For me, it was not an option due its price and its monstrously high power draw. When I said I wanted a cheap build, I meant power supply and electricity bill as well.
    [Update, July 2018: If you are a Linux user, which is typical for a deep learning practitioner, there might be other headaches: as one user reported in the comments, the Threadripper still lacks Linux drivers]
    All these considerations led me to a single option: an used Xeon E5 (more below).
  2. RAM
    I didn’t want desktop (that is, non-ECC) memory.
    In my opinion, people often underestimates the importance of having Error Correction RAM. I won’t bother you with details, let’s just say that a (quite common) bit flip in RAM could potentially corrupt any file on your hard drive. It is the main reason for which OSes do tend to go kablooey in a year or two and “have to be reinstalled”.
    ECC apart, I didn’t want to buy DDR4 memory. For whatever market-related reason I don’t even want to understand, a DDR4 32Gb kit ranges from 400 to 500 Euros here in EU. Frankly, a fraud.
    Solution: good old DDR3. Obviously, ECC.
  3. Mainboard and Chipset
    I needed a socket 2011 board capable of supporting ECC memory, a Xeon E5 v1/v2, and with at least two 16X gen3 slots, adequately spaced. It meant Intel C602 chipset, the workstation equivalent of X79.
    I ended up with a (used, yet new! See below) Fujitsu D-3128-A2 mainboard.
  4. GPU(s)
    You cannot save money here, at least if you want a Pascal card.
    The used ones are almost as expensive as the new ones (thanks to that mining hype), plus they are invariably clogged up with dust, so I decided to go for a new 1080ti, which would have been installed aside with my previous 1070.
    Since GPUs are so important in Deep Learning, we’ll talk extensively about them below and in Part 2.
  5. Hard Drives
    I’d leave the hard drives out of the main description since I already had them. Let’s just say I’m using a couple of seasoned Samsung 850 Pro, although I plan to add a NVMe drive soon.
    As a general guideline, it’s better not to forget that minibatches have to be moved in and out, and your hard drive is the one which got to serve them to the GPU. In brief, a spinning drive won’t be sufficient.
    In other words, SSDs are essential. NVMe is optional, but recommended.

The Build

Once I made my choices, I started to search Ebay for the appropriate hardware at reasonable prices.

Any Xeon e5 would have been sufficient for my purposes, as long as it would have supported 40 gen3 lanes (32 for the GPUs, the remaining for storage and/or other stuff). Indeed, benchmarks show that once you get one core and two threads per GPU, and al least Sandy Bridge generation at more than 1/1.5 GHz, you will be OK (please refer to good ol’ Dettmers for detailed benchmarks).

As you may see below, a drop from 3.6 GHz to 1.2 GHz produces a performance decrease of 8% in the worst case. Keeping yourself above 2 GHz does not, to any practical extent, affect DL performances at all.

Source: Tim Dettmers. Note: the CPUs listed above are the desktop equivalent of e5–1620v1 and e5–1650v1. Essentially the same processors apart from ECC support and energy efficiency.

If you want something really beefy (who doesn’t want to leave hundreds of chrome tabs open while running Jupyter notebooks?) the best price/performance option would be the E5–2680v2 (ten cores, twenty threads). You will find them for less than 200 euros, since a lot of them have been demounted from dismissed servers. For me, something similar was wasted, and like I said I sought less power-hungry options. 4–6 cores would have been sufficient.

I found a barebone (case, PSU, mobo, a Nvidia NVS295 and a Xeon E5-1650v2 six-cores CPU) Fujitsu M730 workstation for the incredible price of 119 euros, shipped. It should be noted that all this hardware is specifically designed for 24/7 continuous operation.

I expected something incredibly worn out. Instead, I received brand new hardware. I suspect the computer was bought, and never used.

Looking at the next picture below, you’ll note that the mainboard has the correct slots for a solid DL build:

  • It has 2 full-fledged 16X gen3 slots for the GPUs (second and fifth from top).
  • It has two 8X mechanical, 4X electrical gen3 slots, ideal for hosting a couple of NVMe drives on appropriate cards (worried that the board won’t boot them? No problem at all: just install GRUB over a sata drive and let it manage boot drives and OSes!).
  • It has one 16X mechanical, 4X electrical gen2 slot, that I will use for the Quadro NVS 295 (which will control the monitor).
  • It even has two PCI slots if you want to add more usb ports or other crappy legacy stuff.

Was I amazingly lucky? Not quite.

The mainboard had proprietary PSU connectors. Looking at the pic below, you’ll notice it lacks the standard ATX connector, having two awkward 12V sockets.

The D3128-A board with its peculiar 12V sockets.

It could have been used only in conjunction with the original Fujitsu PSU, which in turn was just 500W and utterly wanting in terms of PCIe connectors (just one 6-pin. But a 1080ti and a 1070 sum up to two 8-pin and one 6-pin darn connectors).

I was aware that you can use two different PSUs inside the same PC, given you could accept jumpering the second PSU in order to start it together with the primary one. Incredibly untidy.

Short-circuiting the ATX connector

But then, the same guy who told me to write this blog post suggested that a 2-PSU solution was doable in a more orderly way by the means of a so-called add2psu, a little contraption originally aimed at cryptominers who need a lot of PSUs to feed their mining hardware. It costs 9 euros.

The add2psu

Having solved the interconnection problem, I threw away the original case and ordered an Anidees AI8 full tower case. Specifically engineered to house two power supplies, it was incredibly cheap for its specifications: 106 euros (shipped) on Amazon. As we shall see, delivering the components from their small erstwhile abode proved to be quite effective in lowering the temperatures.

Anidees AI8

It includes three 140 mm front fans and one rear fan, also 140 mm. All of them are pleasantly silent.
A fan/LED lighting controller with six outlets is also included. You can set it from the from the front panel on three positions: 0V (fans and their LED lighting are off), 5V (silent operation and low intensity light), or 12V (maximum airflow and LED lighting).

The AI8’s fan hub

The second PSU that I chose, mainly for its unbeatable price/quality ratio, was the XFX P1–650G-TS3X.
Boasting four 6+2 PCIe plugs, a terrific ripple suppression capability, and 80+ Gold compliance, it has excellent reviews all around the internet and it’s cheap (65 euros, local store) since it’s not modular. Who cares about modularity? The chassis has plenty of space for all those wires.

The non-modular XFX P1–650G: quite a lot of wiring indeed.

Now I had a combined power of 1150W, 650 of which for the GPUs.

What about the GPUs themselves? When it comes to buying a GPU, I strongly advise against any non-standard solution. Unless you have specific space requirements, you should go for blower fan (or “founder’s”, or “turbo”, etc.. depending on the manufacturer) versions, since contrarily to other dual-triple fan solutions, they throw the hot air from inside the case to out of it, and even more importantly, could be stacked very tightly without any fuss (more about this in Part 2).

I grabbed a 1080ti Founder’s. At 736 euros, it was the most expensive piece in the rig, vastly exceeding all the rest in price.
It weighs a ton and almost entered the slot under its own weight. The sturdy backplate ensures torsional resistance, tough I’m worried the card could eradicate the slot itself sooner or later.

GTX 1080ti FE

Why did I choose a 1080ti over its less expensive siblings? Memory. I wanted the ability to manage large batches with complex NN architectures.
I myself scratched the limits of my previous 8Gb card in several occasions, but if you need a more authoritative example, look at the benchmarks by Justin Johnson (Stanford’s AI lab). With a large architecture (Resnet 200), he reports:

Even with a batch size of 16, the 8GB GTX 1080 did not have enough memory to run the model.

Source: https://github.com/jcjohnson/cnn-benchmarks

Speaking about 8Gb cards, I already had a MSI GTX-1070 aero ITX. As we shall see, it’s almost 50% slower than the 1080 (not to mention VRAM amount), yet it remains quite a decent GPU with a very good price/performance ratio.

The GTX 1070 aero ITX. Note its size.

Wait, I just said avoid non-blower cards.
Sure, but I added unless you have specific needs: I need to put it into a mini-ITX case to take it with me when I travel.

You’ll find more detailed information about my GPUs in Part 2, but allow me to summarize briefly some important specifications about the Pascal cards suitable for Deep Learning.
Do not overlook the TDP, as well as the number and type of power connectors: your PSU(s) got to cope with them.

The Pascal range

Let‘s move on, speaking about main memory: it is recommended to get an amount of RAM at least twice the video memory, otherwise you could see your computer resorting to swap files with obvious performance losses (although NVMe should mitigate such issue).
I borrowed 32Gb of ECC UDIMMs from another machine of mine. Once I finish an appropriate cycle of tests and benchmarks, I’ll get at least 48/64Gb of ECC RDIMMs (you’ll find them in abundance, and crap cheap, on Ebay).

[Update, 16 June: 64Gb (8x8Gb)1066 MHz DDR3 RDIMMs have arrived (148 EUR, shipped). Never mind the relatively low frequency: for Deep Learning purposes (and for almost any other application, to be honest) the data throughput is more than adequate. Also, remember that 8 modules will be used in a Quad-Channel fashion.]

RDIMMs add an useful buffer with respect to UDIMMs, which contributes in stabilizing the command signals.
Not surprisingly, RDIMMs are the cheapest kind of memory modules you can find on the market, while at the same time being the most reliable. That’s because they cannot be used on desktop hardware (and this in turn leaves out 90% of potential buyers). If you have server-grade hardware, go for them.

Below is the finished build. Note the original PSU on top.
You can also see the 1080ti sitting aside an old 750ti. In the definitive version of the build, you would see the 1070 in place of the 750ti, with the NVS295 left to manage video outputs. I had to use the 750ti for testing purposes since I don’t have demounted the 1070 from the previous build yet (it still is my production machine), the NVS just has DidplayPort outputs, and I didn’t have any DP cable [one ordered].

Building
Testing

Summary

Not taking the GPUs into account, we have:

  • Fujitsu M730 with Xeon Ivy Bridge-EP CPU, mainboard, 80+ Gold 500W PSU and Quadro NVS: 119 EUR
  • Dual-PSU windowed case: 106 EUR
  • Second 80+ Gold 650W PSU: 65 EUR
  • 64Gb ECC buffered RAM: 148 EUR
  • add2psu and various accessories: 35 EUR

We get a total of 473 EUR, for an ECC-equipped system (64Gb) and 40 gen3 lanes.

I deliberately left out the GPUs in order to focus your attention towards a price comparison with a modern Kaby Lake/Coffe Lake system. Remember: all the money you save from the bulk of the system (without compromising on memory reliability and PCIe bandwidth) you can invest in GPUs, which are the ones making real differences when it comes to train a model.

I decided to invest the money I saved on the system in two powerful GPUs, but if you ant to stay under one grand and still enjoy a capable multi-GPU rig, add two GTX 1060.

Conclusions

Speaking about comparisons, should I mention that this is the same hardware of the trashcan Mac Pro?
Configured exactly with the same CPU (and chipset) and the same amount of ECC RAM, but with two AMD d700 GPUs you cannot use for Deep Learning, that machine still costs four grands and half:

That’s all, for now.

If you liked this blog post, I think you’ll enjoy reading Part 2, too!

--

--