Watercooling a Deep Learning Machine
This is an article documenting the lessons I have learnt building two watercooled deep learning boxes. There are plenty of youtube videos and articles about watercooling gaming rigs, there is much less information about multi-GPU water-cooling setups.
If you research what type of GPU’s to use for multi-GPU deep learning rigs, most people nowadays will recommend blower type GPU’s such as the ASUS Turbo GeForce® RTX 2080 Ti (Note that for single GPU setups open-air / blower / AIO style are all totally fine).
This makes sense as when you have multiple GPU’s you want to redirect hot air away from other GPU air intakes — and with blower style the hot air get re-directed straight outside the case. However, blower type GPU’s don’t appear to be as popular with gamers — and as such are much harder to find second-hand.
I have built two deep learning boxes as documented in New vs used deep-learning machine builds, Part 1 and Part 2, using a combination of new and used parts. I have bought one new Gtx1080ti as well as two second hand Gtx1080ti’s and a second hand Gtx1080 at great prices and have been happy so far with the air cooled setups.
I have however run-into times when I would have liked to run heavy neural networks on three GPU’s (in Pytorch using DataParallel, or using Keras) to have available the extra GPU RAM this would afford — or just to be able to run three experiments simultaneously on one machine. My plan was to put the 3 x Gtx1080ti’s in one machine and keep the Gtx1080 in the other machine for prototyping. And potentially fitting a fourth Gtx1080ti (or a Rtx2080ti) card in the multi-GPU rig if a second hand one becomes available at a reasonable price.
Fitting 3x open-air GPU’s inside a machine -and keeping them sufficiently cool when all were running at 100% for long periods of time —may be doable with a giant high-airflow case and a ton of fans — but I don’t think four is realistic, for that you would need an open bitcoin mining type rig (which I didn’t want due to lack of noise deadening).
So I decided to watercool my GPU’s. At first I found researching watercooling to be pretty intimidating. There were a lot of concepts and parts and I didn’t know a D5 pump from a 180mm tube reservoir. Probably the biggest difficulties with watercooling are the amount of choices you need to make, and research you need to do.
The priorities for gaming cooling cooling vs deep learning cooling are a little different. Many gaming PC builds I have come across show stripped out 3.5" HD bays, large reservoirs, colourful additives and bright lights inside the box. Typically a single pump is used. For me at least I wanted to use the minimum possible space for pumps (ideally 2, one for redundancy) and reservoir so that I could still have room for as many 3.5" hard drives (and SSD’s) as possible — so in the future I could add drives when needed. A relatively quiet machine was also important.
I decided the best approach was to watercool one GPU at first. I chose the cheapest of my GPU’s to experiment with — a Gigabyte Gtx1080. The first difficult task is finding a waterblock compatible with the GPU. The manufacturers of waterblocks that are most available in Australia are EK and XSPC. I found a suitable EK block at my local computer store. Now that I know how hard it can be to find a waterblock for you GPU model — I will only buy another GPU for which I can relatively easily source a waterblock.
The next parts I needed to source were radiator, pump and reservoir. Finding a radiator was fairly straightforward — research what length and width radiators will fit inside your case. However you need to pay special attention to the thickness of the radiator. Too thick and it wont fit in your case. From online research I have read that for the same volume it is generally more efficient to have a thinner larger length x width radiator than a thicker smaller one.
Radiators come in 2 fan sizes 120mm and 140mm. For my Dark Base 900 case I chose a 420mm (ie 3x 140mm fans) 30mm thick radiator to be mounted at the top of the case in exhaust mode with fans in pull configuration.
So as per figures above I had 3 x 140mm fans sitting above the radiator with fan label facing upwards.
Choosing a pump I found to be pretty difficult — there are a lot of options. I was after something pretty quiet and read that Laing D5 pumps were quiet and reliable, but these tended to come as just the pump without housing, and it was unclear to me exactly what housing you would need and how you would connect the reservoir.
I spent a lot of time trying to work out what pump and reservoir combination to get and ended up choosing an all in one pump, housing and reservoir that was pretty compact.
After I had bought the main parts of the system, the next stage was to design the layout and buy the loop tubing and connections. Here you have the choice of hard tubing (not recommended for beginners) or flexi-tubing. Flexi-tubing comes in 2 internal diameter sizes, it doesn’t matter which you choose. I chose 10mm ID (3/8") as that was the type available at my local shop.
Then you need fittings to connect the parts of the system. I chose to mostly use barbs as these were 1/5th the price of compression fittings, but you do need to buy suitably sized worm drive hose clamps from your local hardware store to secure the tubing to each barb. I also found barbs (+hose clamp) to be easier to gauge if they are sufficiently tight.
Then came the time for stripping the GPU. I was pretty nervous about opening up a GPU to fit the waterblock, but with the aid of youtube videos, the process was pretty simple. In the case of the Gigabyte cards I have you just unscrew 7 screws from the back of the card, pull off the fan/heat exchange housing, clean with Isopropyl Alcohol, add thermal pads to the VRAM (supplied with waterblock)/VRM’s and thermal paste to the GPU chip then screw on the waterblock. The whole process took 30 minutes.
I filled the loop with a coloured premix solution, which for your first loop is a good idea — you will be able to quickly spot leaks. Some blogs don’t recommend pre-mix due to the effects the additives can have on your pump, but I didn’t plan to run it without a change for months on end so wasn’t concerned with this. For a subsequent fill I used distilled water and some biocide and a silver coil that I placed inside the reservoir.
After watching some you-tube videos about how to jump your PSU (using a copper wire) and fill a loop (use a funnel and squeeze bottle with plenty of absorbent paper) I ran the unit overnight.
Then next day I re-connected the power supply to the motherboard. After running tests on the GPU I noted it was hitting 90 degrees — and discovered I still had low noise PWM adapters connected the fans — they were running at pretty low RPM’s — I disconnected these and the GPU temperature dropped significantly. The fans I was using were second-hand coolermaster case fans — designed for air-flow rather than radiator use. Later on I purchased high static pressure Noctua NF-F12 industrial ppc 2000RPM fans for this radiator. You can read more on static pressure vs airflow fans here.
The next stage was either to build a new loop in my second machine (Xeon E5–2680 v2) or add to the i7 build documented above. With the ASRock EP2C602–4L/D16 motherboard in my Xeon machine there is not enough distance between the PCIE slots to fit 4x GPU’s side by side.
Although there is (just) enough room between the 2x GPU’s to fit a PCI-E riser cable between them and another PCI-E riser cable above GPU0 as indicated by black arrows below. After experimenting with a PCIe riser and modifying a GPU mounting bracket as shown below to fit a third GPU, I decided using the PCIe riser was just too complicated and fiddly. Plus having a GPU mounted at the base precludes using a radiator at the base of the case.
Rather than build a loop in the Xeon box as shown above, I decided to use my X299 (i7 7820X) machine with is MSI X299 Tomahawk motherboard for the 3xGPU build, where there is more room between PCIe slots. As you can see in the image below, there is ~6cm between slot 0 and 3 (with metal shields) and ~4cm between slots 3 and 5 (total of ~10cm from top to bottom PCIe 3.0 slot). This compares with the ASRock EP2C602–4L/D16 Xeon motherboard with ~2cm between each of slots 0,1,2,3 (total of ~6cm from top to bottom PCIe 3.0 slot).
However there were two issues with this MB. Most importantly, while the ASRock MB supports 4 PCIe 3.0 x16, the Tomahawk only supports x16/x8 with a 28 lane CPU (Intel i7–7820X) (but x16/ x4/ x16/ x8 mode with a 44 lane CPU). A quick google search showed 44 lane lane i9 processors to be priced at ~$1400AUD, way above what I want to spend getting 3 GPU’s to work in one box. The second issue being that I couldn’t physically fit a third GPU+ waterblock in the bottom PCIe slot — the power, hdd/reset cables stuck out too much to allow the GPU to seat.
I then looked at alternative x299 motherboards with x8/x8/x8 PCIe modes with a 28 lane CPU’s (see Tim Dettmers’ posts on PCIe modes and why x8 is fine here http://timdettmers.com/2018/11/05/which-gpu-for-deep-learning/). I found a EVGA x299 FTW K on Amazon for $175USD which supported x8/x8/x8 and ordered one.
As this board is EATX size I had another issue. My Dark Base 900, while it could accommodate an EATX sized MB, it couldn’t fit an EATX MB with a radiator vertically mounted at the front of the case (HDD bays needed to be moved to fit EATX size). With 3x GPU’s several large radiators were essential so I went looking on Gumtree for second hand cases. I found an ex-demo EVGA DG-87 case at my local computer store going cheap and picked it up.
I managed to fit the 3 x 1080ti’s with waterblocks in the EVGA MB and case and thought all my troubles were over. But connecting the waterloop fittings to the middle GPU was proving to be very difficult due to the tight fit. Trying different combinations and experimenting with fittings I found that 90 degree elbow fittings, and having the smaller EK waterblock on the top, allowed just enough room for the GPU’s to seat properly and was able to screw in the GPU bracket support screws.
Note how I am using 2x pumps (EK all in one res/pump combo) and Swiftech D5 plus ‘EK-XTOP-RECO-D5’ housing. The pumps are connected in parallel using 2 x Y shaped connectors, so that if one fails, the other can still work unhindered. I am also using a tap (below base of picture, connected to bottom-most tube) and importantly a plug in the end of the tap as they leak — so that the loop can be drained a little more easily (cant drain entire loop with tap but helps).
That had to be it, all done I thought, when I tried to power on the system — the EVGA motherboard wouldn’t boot from anything but external USB or one drive (which happened to be a 250GB NVMe SSD I was using for deep learning data). I’ve never seen a MB that wouldn’t allow you to select form all connected drives to boot from. OK I thought — easy, I’ll just update the bios. 1 hour later, same issue, nothing but the NVMe available to boot from. No way I was going to drain the loop pull the GPU’s out, pull out the NVMe and try to get the MB to pick up the drive Ubuntu was installed on. So I downloaded Ubuntu 18.04 (was meaning to upgrade from 16.04 sometime anyway) and installed it on the NVMe (one bonus being boots are super fast now).
Finally (after a few NVIDIA driver issues) I got the machine running and tried DataParallel with pytorch in fastai ie after creating a learner I did.
I was thinking, wow after all this work, finally, I’m going to be able to train big datasets on Resnet50 or Densenet121. After a few seconds boom, the machine lost power and turn off. Oh no, that’s bad I thought. I tried running just GPU 0 then 1 then 2, all OK then 0 +1. I then added up all the power use for equipment in the machine — 3x 1080ti’s at 250W max power but ~295 peak load, 2 x pumps, 3x HDD’s, 2x SSD’s, 1xNVMe, i7–7820X (a heavy user at ~180W peak power) — and it’s apparent that even with a 1200W PSU, there could be bursts where if 3x GPU’s were drawing hard simultaneously, the PSU wouldn’t be able to cope. I really didn’t want to spend another $500 on a 1600W PSU or have the hassle of adding a second PSU and an add2PSU unit.
Doing some more research I stumbled across a method that allows you to limit the power draw by each GPU. Sure performance may suffer a bit, but the main thing I wanted the 3 GPU’s for is 33 GB GPU RAM, a bit of a performance hit I can live with.
#enable persistance mode
sudo nvidia-smi -pm 1#set GPU 0 max power use to 220W
sudo nvidia-smi -i 0 -pl 220
And after a lot of work, I now have 3x GPU’s running, with temps of ~45, ~54 and ~64 degrees and fairly low noise (I have fitted a fan controller with manual knobs so I can turn fans down when sitting next to machine).
Alphacool Nexxxos st30 420mm radiator (top)
3 x Noctua NF-A14 FLX fans
Thermaltake Pacific RL420 radiator (front)
3 x Noctua NF-A14 industrialPPC-2000 PWM fans
EK-XRES 100 Revo D5 PWN pump res. combo
Swiftech D5 MCP655-PWM-DRIVE pump
EK-XTOP-RECO-D5 pump housing
EVGA DG-87 case
EVGA X299 FTW K motherboard
Intel i7–7820X CPU
Noctua NH-U12S CPU cooler
Corsair AX1200i PSU
Gigabyte Gtx 1080ti Gaming OC
2x Gigabyte Gtx 1080ti AORUS Xtreme
Watercooling is time-consuming and moderately expensive and only really necessary if you plan to have multiple open-air GPU’s in your machine. If you are able to buy blower style GPU’s and have a motherboard with sufficient PCI-E spacing to fit more than 2x GPU’s next to each-other, I would recommend this as the easiest option.
The lessons I have learnt are: Start with one GPU — your system will be able to cope with heat extraction more easily than if you jump straight into trying to watercool multiple GPU’s.
Monitor you machine carefully for the first n hours of operation with pump running on its own (for leaks), and then when the GPU(s) are running (for leaks and temperature).
Radiator fans make a significant difference to system temperatures, use good high static pressure fans if you can afford it.
A big case that has sufficient room for large/multiple radiators will help keep the system temperatures down - which will then allow you to run your fans at lower RPM’s when your working next to it.
Serial vs Parallel for GPU, pump and radiator setup is woth some consideration. See some tips in the links below: