Building a cheap yet powerful Deep Learning rig, Part 2:

Stress-testing the system: Thermal Considerations, Power Consumption, and Deep Learning Performances.

13 min readJun 19, 2018

Welcome to Part 2 of my series about building a cheap Deep Learning rig without sacrificing performance and reliability. In case you landed on this page through a search engine, Part 1 is here.

Power Draw Measurements and Cooling Optimizations

Let’s see briefly how much that monster could weigh over my electicity bill. All measuments were taken at the wall outlet by the means of a Fluke power meter.

With one GPU installed (no load on it), two SSDs, four 1.35V UDIMMs, five fans (seven if you take the ones inside the PSUs into account), the system idles at ~45W. Not bad.

[Update, 16 June: with eight 1.5V RDIMMs installed, the overall power draw increased by 18W]

Getting all CPU cores under full load (Prime95 stress test, 12 workers, one per thread), the system peaks at ~120/125W [~140W with 8 RDIMMs]. Again, no load on the GPU.
If you are surprised by the fact that the whole machine consumes less than the nominal TDP of the CPU alone (130W), remember that Xeons are more energy-optimized with respect to their desktop counterparts. Moreover, the TDP is not meant to indicate how much the CPU would draw under load. Rather, it represents a guideline for selecting adequate cooling hardware.

I should mention that the M730 workstation had a standard, rather generous, tower CPU heatsink, but no proper fan on it. Rather, the air flow generated by the front chassis fans was forced trough the heatsink by an awkward air tube. Usual proprietary crappy solutions: they want you not to replace anything.
Thus, I grabbed an Enermax 80mm silent fan and wired it to the heatsink. Using brass wires, the coupling with the heatsink ended up being much more reliable than any commercial plastic frame.
Also, I reduced its working voltage from 12V to 7V in order to ensure silent operation and not to wear it prematurely.

Enermax TB-Silence 80 coupled to the heatsink with brass wires

After half an hour of Prime95, the measured load temperature was around 82–83C. High, but well within specifications. I reckon the fan is kind of smallish, maybe I’ll go for a 92mm model or increase the rotational speed. We’ll see.

Testing with two GPUs

I went ahead and tested the system in its full-fledged DL configuration by installing the GTX 1070 alongside the 1080 ti.
Idle power draw increased by a modest 8–9W. When it came to evaluate full load CPU and GPU(s) power usage, push came to shove.

I tried to generate the most harsh workload I could, by running FurMark “extreme burn-in test” (an infamous stress test known for having killed quite a few old GPUs which didn’t have overheat protection) over the 1080ti, while at the same time loading the 1070 with Unigine Heaven at maximum presets. The CPU kept running Prime95.

No deep learning task (or real-world application whatsoever) could ever generate such a tremendous workload.
For the record, I measured a maximum power draw slightly above 500W, but let me add some interesting considerations about thermal management.

My 1080 ti FE (Founder’s Edition) behaved exactly as described by a ton of reviews all around the Internet. Founder’s cards (like mine) are all identical, no matter the brand: Nvidia designs both the card and the cooling system, and the actual manufacturer (mine was Zotac) just puts the pieces together.
As a matter of fact, the GP102 chip has a maximum operating temperature of 91C, but Nvidia decided to keep the full-load temperature limit within a more reasonable 84C.
At the same time, for whatever reason, they went for a surprisingly conservative fan management profile.
As a result, once FurMark kicks in, the card quickly begins heating up towards the target temperature. As 84C are hit, the card throttles down.

Indeed, at such temperature the blower fan rotates at just ~50% of its maximum speed, so the only way to prevent the temperature from rising up further is throttling down. Nothing dramatic: it goes down by 100–200 MHz.
Maybe, they preferred to trade a bit of performance for silent operation.
I cannot care less about silence (at least during training), and following a tad of experimenting, I manually changed (MSI afterburner) the fan operating profile by requiring 40% max rotating speed at 60C, 55% at 70C and 72% at 80C.

In this convenient way, the card neither goes over 80C nor it throttles down. Want to run your card on Linux instead of Windows? Worry not. You will find your fan speed controls directly inside the Nvidia Drivers app.

Note the fan spinning at 61% of its maximum rate at 75C. Very different from its stock profile.

Mind that at 70%, the fan is quite loud. It actually reminds me of an airliner taking off. But I like it!

What about the 1070? It is much more of a tranquil card. Thanks to its lower TDP (and performances) and a wise fan regulation policy adopted by MSI, it rarely hits 75–77C no matter the load. It’s silent, too.
Conversely, lacking a blower-style heatsink, it contributes in heating up the inside of the case, and I had to install it in the bottom 16X slot to allow sufficient breathing space.

More about GPUs: why did I pick a blower version?

As I anticipated in Part 1, I definitely prefer blower versions (Founder’s Edition, Turbo, etc..) over ‘open air’ custom coolers.

A Founder’s Edition card without its plastic cover

There is nothing wrong with custom coolers: some of them are indeed very well engineered, not to mention they are generally quieter than a blower version, and given the right conditions, they could easily outperform a blower system in terms of cooling performance.
Still, one has to consider that a multi-GPU system seldom matches those conditions:

Non-blower cards need to be generously spaced (more than two slots per card), otherwise the fans won’t breathe. Blower versions have no such limitations, since their intake vents are placed at the end of the cards. Below you can see four Founder’s Titan stacked very tightly (2 slots per card). Note their intake vents at the rear ends.

Four Titans in SLI. In a Deep Learning environment, the SLI bridge would have been unnecessary.

A non-blower card will contribute in heating up the environment inside the case, since just a minimal part of the air which impacts the heatsink will be taken outside. Below you can see the typical airflow for a custom cooled card compared to a Founder’s edition:

Liquid Cooling

What about liquid cooling? I considered it for both the GPU and the CPU, but ultimately discarded it because:

You got to do maintenance every now and then.
A pump malfunction could be a disaster.
A coolant spilling would be even worse.

A desktop system with both GPU and CPU cooled by water pumps

That said, liquid cooling systems are much more effective in removing heat from components and a lot of people enjoy them, but please appreciate that they are rarely used in professional contexts, both in generic servers (CPUs), and in GPU-dedicated environments (Tesla clusters, etc.).

Just to provide an example, the Nvidia DGX-2, a 400.000 USD machine (2 Petaflops thanks to sixteen tesla V100), is air-cooled.

How much powerful a PSU got to be?

While I was quite sure the XFX TS had both sufficient amperage and wattage to handle the two video cards without frying itself, I expected it to at least increase the spinning rate of its own fan, and a somewhat substantial heat generation under load.
After all, the two GPUs amount to a respectable 400W TDP.

No such things happned: it continued spitting out air at ambient temperature with its fan barely rotating. It remained dead silent too.

The situation was a bit different for the Fujitsu power supply. Despite being an industrial grade unit, by itself designed for continuous operation, its fan began getting a bit loud under load, and it spat out a substantial amount of hot air (almost as hot as the 1080ti’s exhaust).
One has to consider these aspects, however:

The two cards are not entirely powered by the XFX TS. Indeed, they draw 75W each from their respective PCIe slots, and these slots are powered by the Fujitsu unit. So, I got to add 150W to the overall tax over it. In total, you end up with a respectable 150W+125W (system) = ~275W over a multirail unit (see below). With 8 DIMMs installed it would scratch the 300W boundary. By the contrary, the XFX takes just 400W-150W = 250W, not even half its max output.
The Fujitsu unit is multirail, whereas the XFX one is monorail. While something with a ‘multi’ prefix could sound cool, in a multirail unit the power is distributed, more or less evenly, amongst the various rail and one has to worry about individual rail overload. Below you can read the maximum amperage per rail. Let’s hope the slots and the CPU are on different rails, since the ones who wrote the M730’s manual didn’t see fit to specify such trifles.

Amperages per rail for a total of five 12V rails

Conversely, in a monorail unit like the XFX TS, one can afford to care just about overall load, since we just have one rail and the PSU delivers its entire power through it. I could even install two Titan-class GPUs, and it would handle them without any fuss.
Last but not least, while the bottom-mounted XFX unit takes fresh air from outside the case, the top-mounted Fujitsu grabs its air from inside the case, where the 1070 and the CPU greatly contribute in making the environment quite hot. To be more precise, the intake fan is just a few centimeters above the CPU heatsink.
I also noticed I could roast chickens over the PCH heatsink.

Deep Learning: Training a model with Pytorch and Fastai (with a grain of Keras).

Let’s begin this section by saying we won’t do any attempt to train a model aimed at optimal results in terms of accuracy and/or other metrics.
All I wanted to do is to provide some reproducible benchmark you could run by yourself for gaining some insight about the build or compare your own build.

Caveats:

I wanted to benchmark my build using the libraries I usually employ for my real-world Deep Learning projects.
Keras and Tensorfow are great pieces of software, more mature and reliable than Fastai (and Pytorch itself), but I noticed I could attain slightly better results on the same data with the latter, while at the same time writing less code.
Both Francois Chollet and the guys at Google are pretty receptive and capable, so I expect Keras/TF to improve a lot in the near future. We’ll see!
[Update, 18 June: I added three epochs with an ‘official’ Keras MNIST example. You can find the code here]
I ran these benchmark on Windows 10 because I’m forced to use Windows 10 on this rig (I got to work on Windows-only stuff while training NNs, mainly fintech software).
Many people do report that Linux guarantees better performances when compared to W10 with the same hardware, frameworks, and libraries, so you could want to consider it for a production, DL-only machine.
If I’ll test my rig with Linux in future, I’ll duly update this section with detailed results.
That said, one has to consider that Windows offers a lot more tools for tuning and monitoring the GPU, so it makes sense to use it for our present task.

Benchmark 1: Running an epoch over Kaggle’s Cats vs. Dogs dataset with Resnet101 (frozen) andprecompute=False
Resnet101 is It is a very large and deep architecture (but neither the largest nor the deepest, that’s for sure), and we’ll see it doesn’t grant us too much freedom when it comes to batch sizes, even with 11 Gb of VRAM.
By specifying precompute=False, we won’t use precomputed activations, thus increasing the overall amount of work to be done by the GPU. In turn, this makes possible using data augmentation (again, more work).

Please appreciate that we are using a pretrained model (Resnet101, like I said). This means the weights relative to all layers but the last very few are already established, and those layers won’t be trained. In DL jargon, they are frozen.
Just the last few layers that we’ve put on top of Resnet101 will be. This will make a general model (any kind of image) more sensitive towards cats/dogs specific images.

The maximum batch size I was could choose was 128. Larger batch sizes produced CUDA out of memory errors. However, I judged 128 to be enough to adopt quite an aggressive learning rate (see here if you’re not following me).

We attained such an outstanding accuracy with a few lines of codes and just one epoch because Resnet101 is a wonderfully trained architecture (and it has seen a ton of Imagenet pics).

Let’s see how the GTX 1080 ti behaved during those 6 minutes and 27 seconds.

If you look at the picture above, you will appreciate that the GPU is far from being pushed to its limits. Training one layer or two does not allow us to unleash its true power. Look also at how inconstant the load is.

It is worth noting, however, that GPU-Z reported just 1.6 Gb memory usage, while for example HWmonitor reported a more realistic 80% (a bit less than 9 Gb). All the other relevant data are aligned (the screenshot below was taken some 10 seconds after the one above).

We’ll try and push the GPU to its limits now.
To this end, we unfreeze the entire network, and retrain (or, to be more precise, fine-tune) all its layers. We’ll do it using differential learning rates: as you descend towards the early layers of the network, you don’t want to perturb their weights too much, since they recognize the most general features.
Note how we pass the lr array to fit(). Don’t mind cycle_len: it would be too long to explain and this is not a blog post about Fastai library.

One important note regarding the batch size: I had to reduce it from 128 to 24, since the 11Gb of VRAM wouldn’t handle the entire network even with a batch size of 32. Consequently, I shrank the LR by a whole order of magnitude (0.1 to 0.01) even for the last few layers.
Now you understand the importance of getting as much VRAM as possible. Reducing the batch size further would make it difficult to properly tune the LR.

The following sensor readings were taken at the rough half of these 10 minutes and 41 seconds.
Now the GPU is fully used, and it even exceeds (107.1%, below) its nominal TDP by running at a boosted frequency (it even touched 1911 MHz sometimes).

I focused upon the big GPU during these benchmark, since I want to add another one as soon as my wallet can afford it. Furthermore, I think it the the obvious choice for anyone who is serious about deep learning.
However, mine is still a dual-GPU build, so what about the 1070?
It completed the benchmarks, respectively, in 9 minutes and 41 seconds, and 17 minutes and 26 seconds (Not bad at all).

The card even managed to run the first fit with unchanged batch size. Running the second, however, required a batch size of 16.

Let’s finish that sequence of benchmark with Keras and Tensorflow: three epochs over MNIST dataset (it’s an ‘official’ Keras example, see above):

Four seconds per epoch.
The 1070 did it in seven (first epoch) and six (the last two epochs).

No further considerations here: I just wanted to give you something reproducible if you don’t want to install Fastai and Pytorch.

Conclusions

The system behaved admirably under stress. The only component that worried me a bit was the Fujitsu PSU, but the system remained rock solid during over an hour of Prime95+FurMark+Heaven, so I think I cannot complain.

Coming to Deep Learning, if we didn’t make hidden mistakes during our experiments, one surprising thing emerges:

A GPU is not fully leveraged unless you train, retrain, or fine-tune a big model in all its layers (or at least a good part of them).
In turn, this does mean you have room for improvement over the other components.

I think that when we have just one layer or two to train (and/or when precompute=True), quite a lot of time is spent moving minibatches back and forth. So I’ll run another couple of benchmarks as soon I get my hands over a NVMe ssd, and update this page accordingly.