Benchmarking deep learning gpu Hardware K80 M40 P100 Technology

Deep Learning Benchmarks of NVIDIA Tesla P100 PCIe, Tesla K80, and Tesla M40 GPUs

Plot of deep learning benchmark results across Tesla K80, Tesla M40, and Tesla P100 16GB PCIe GPUs

Sources of CPU benchmarks, used for estimating efficiency on comparable workloads, have been out there all through the course of CPU improvement. For instance, the Commonplace Efficiency Analysis Company has compiled a big set of purposes benchmarks, operating on a spread of CPUs, throughout a mess of techniques. There are definitely benchmarks for GPUs, however solely in the course of the previous yr has an organized set of deep studying benchmarks been revealed. Referred to as DeepMarks, these deep studying benchmarks can be found to all builders who need to get a way of how their software may carry out throughout numerous deep studying frameworks.

The benchmarking scripts used for the DeepMarks research are revealed at GitHub. The unique DeepMarks research was run on a Titan X GPU (Maxwell microarchitecture), having 12GB of onboard video reminiscence. Right here we’ll look at the efficiency of a number of deep studying frameworks on a spread of Tesla GPUs, together with the Tesla P100 16GB PCIe, Tesla Okay80, and Tesla M40 12GB GPUs.

Knowledge from Deep Learning Benchmarks

The deep studying frameworks coated on this benchmark research are TensorFlow, Caffe, Torch, and Theano. All deep studying benchmarks have been single-GPU runs. The benchmarking scripts used on this research are the identical as these discovered at DeepMarks. DeepMarks runs a collection of benchmarking scripts which report the time required for a framework to course of one ahead propagation step, plus one backpropagation step. The sum of each includes one coaching iteration. The occasions reported are the occasions required for one coaching iteration per batch, in milliseconds.

To start out, we ran CPU-only trainings of every neural community. We then ran the identical trainings on every sort of GPU. The plot under depicts the ranges of speedup that have been obtained by way of GPU acceleration.

Plot of deep learning benchmark results across Tesla K80, Tesla M40, and Tesla P100 16GB PCIe GPUs

Determine 1. GPU speedup ranges over CPU-only trainings – geometrically averaged throughout all 4 framework varieties and all 4 neural community varieties.

If we increase the plot and present the speedups for the different sorts of neural networks, we see that some varieties of networks bear a bigger speedup than others.

Plot of deep learning benchmark speedups (with geometric averages) for each network on Tesla K80, Tesla M40, and Tesla P100 16GB PCIe GPUs

Determine 2. GPU speedups over CPU-only trainings – geometrically averaged throughout all 4 deep studying frameworks. The speedup ranges from Determine 1 are uncollapsed into values for every neural community structure.

If we take a step again and take a look at the ranges of speedups the GPUs present, there’s a pretty wide selection of speedup. The plot under exhibits the complete vary of speedups measured (with out geometrically averaging throughout the varied deep studying frameworks). Notice that the ranges are widened and grow to be overlapped.

Plot of deep learning benchmark results (without geometric averages) across Tesla K80, Tesla M40, and Tesla P100 16GB PCIe GPUs

Determine three. Speedup issue ranges with out geometric averaging throughout frameworks. Vary is taken throughout set of runtimes for all framework/community pairs.

We consider the ranges ensuing from geometric averaging throughout frameworks (as proven in Determine 1) leads to narrower distributions and seems to be a extra correct high quality measure than is proven in Determine three. Nevertheless, it’s instructive to broaden the plot from Determine three to point out every deep studying framework. These ranges, as proven under, exhibit that your neural community coaching time will strongly rely upon which deep studying framework you choose.

Plot of deep learning benchmark results for each framework (without geometric averages) across Tesla K80, Tesla M40, and Tesla P100 16GB PCIe GPUs

Determine four. GPU speedups over CPU-only trainings – displaying the vary of speedups when coaching 4 neural community varieties. The speedup ranges from Determine three are uncollapsed into values for every deep studying framework.

As proven in all 4 plots above, the Tesla P100 PCIe GPU supplies the quickest speedups for neural community coaching. With that in thoughts, the plot under exhibits the uncooked coaching occasions for every sort of neural community on every of the 4 deep studying frameworks.

Plot of deep learning benchmark training iteration times for each framework on Tesla P100 16GB PCIe GPUs

Determine 5. Coaching iteration occasions (in milliseconds) for every deep studying framework and neural community structure (as measured on the Tesla P100 16GB PCIe GPU).

We offer extra dialogue under. For reference, we’ve listed the measurements from every set of checks.

Tesla P100 16GB PCIe Benchmark Outcomes

AlexNet Overfeat GoogLeNet VGG (ver.a) Speedup Over CPU Caffe 80 288 279 393 (35x ~ 70x speedups) TensorFlow 46 144 253 277 (16x ~ 40x speedups) Theano 161 482 624 2075 (19x ~ 43x speedups) cuDNN-fp32 (Torch) 44 107 247 222 (33x ~ 41x speedups) geometric common over frameworks 71 215 331 473 (29x ~ 42x speedups)

Desk 1: Benchmarks have been run on a single Tesla P100 16GB PCIe GPU. Occasions reported are in msec per batch. The batch measurement for all coaching iterations measured for runtime on this research is 128, apart from VGG internet, which makes use of a batch measurement of 64.

Tesla Okay80 Benchmark Outcomes

AlexNet Overfeat GoogLeNet VGG (ver.a) Speedup Over CPU
Caffe 365 1,187 1,236 1,747 (9x ~ 15x speedups)
TensorFlow 181 622 979 1,104 (4x ~ 10x speedups)
Theano 515 1,716 1,793 (8x ~ 16x speedups)
cuDNN-fp32 (Torch) 171 379 914 743 (9x ~ 12x speedups)
geometric common over frameworks 276 832 1,187 1,127 (9x ~ 11x speedups)

Desk 2: Benchmarks have been run on a single Tesla Okay80 GPU chip. Occasions reported are in msec per batch.

Tesla M40 Benchmark Outcomes

AlexNet Overfeat GoogLeNet VGG (ver.a) Speedup Over CPU
Caffe 128 448 468 637 (22x ~ 53x speedups)
TensorFlow 82 273 418 498 (10x ~ 22x speedups)
Theano 245 786 963 (17x ~ 28x speedups)
cuDNN-fp32 (Torch) 79 182 433 400 (19x ~ 22x speedups)
geometric common over frameworks 119 364 534 506 (20x ~ 27x speedups)

Desk three: Benchmarks have been run on a single Tesla M40 GPU. Occasions reported are in msec per batch.

CPU-only Benchmark Outcomes

AlexNet Overfeat GoogLeNet VGG (ver.a)
Caffe four,529 10,350 18,545 14,zero10
TensorFlow 1,823 5,275 four,018 7,341
Theano 5,275 13,579 26,829 38,687
cuDNN-fp32 (Torch) 1,838 three,604 eight,234 9,166
geometric common over frameworks 2,991 7,190 11,326 13,819

Desk four: Benchmarks have been run on twin Xeon E5-2690v4 processors in a system with 256GB RAM. Occasions reported are in msec per batch.

Dialogue

When geometric averaging is utilized throughout framework runtimes, a variety of speedup values is derived for every GPU, as proven in Determine 1. CPU occasions are additionally averaged geometrically throughout framework sort. These outcomes point out that the best speedups are realized with the Tesla P100, with the Tesla M40 rating second, and the Tesla Okay80 yielding the bottom speedup elements. Determine 2 exhibits the vary of speedup values by community structure, uncollapsed from the ranges proven in Determine 1.

The speedup ranges for runtimes not geometrically averaged throughout frameworks are proven in Determine three. Right here the set of all runtimes corresponding to every framework/community pair is taken into account when figuring out the vary of speedups for every GPU sort. Determine four exhibits the speedup ranges by framework, uncollapsed from the ranges proven in determine three. The diploma of overlap in Determine three means that geometric averaging throughout framework sort yields a greater measure of GPU efficiency, with extra slender and distinct ranges ensuing for every GPU sort, as proven in Determine 1.

The best speedups have been noticed when evaluating Caffe ahead+backpropagation runtime to CPU runtime, when fixing the GoogLeNet community mannequin. Caffe usually confirmed speedups bigger than some other framework for this comparability, starting from 35x to ~70x (see Determine four and Desk 1). Regardless of the upper speedups, Caffe doesn’t grow to be one of the best performing framework on these benchmarks (see Determine 5). When evaluating runtimes on the Tesla P100, Torch performs greatest and has the shortest runtimes (see Determine 5). Observe that though the VGG internet tends to be the slowest of all, it does practice quicker then GooLeNet when run on the Torch framework (see Determine 5).

The info present that Theano and TensorFlow show comparable speedups on GPUs (see Determine four). Even if Theano typically has bigger speedups than Torch, Torch and TensorFlow outperform Theano. Whereas Torch and TensorFlow yield comparable efficiency, Torch performs barely higher with most community / GPU mixtures. Nevertheless, TensorFlow outperforms Torch normally for CPU-only coaching (see Desk four).

Theano is outperformed by all different frameworks, throughout all benchmark measurements and units (see Tables 1 – four). Determine 5 exhibits the massive runtimes for Theano in comparison with different frameworks run on the Tesla P100. It must be famous that since VGG internet was run with a batch measurement of solely 64, in comparison with 128 with all different community architectures, the runtimes can typically be quicker with VGG internet, than with GoogLeNet. See, for instance, the runtimes for Torch, on GoogLeNet, in comparison with VGG internet, throughout all GPU units (Tables 1 – three).

Deep Learning Benchmark Conclusions

The only-GPU benchmark outcomes present that speedups over CPU improve from Tesla Okay80, to Tesla M40, and lastly to Tesla P100, which yields the best speedups (Desk 5, Determine 1) and quickest runtimes (Desk 6).

Vary of Speedups, by GPU sort

Tesla P100 16GB PCIe Tesla M40 12GB Tesla Okay80
19x ~ 70x 10x ~ 53x 4x ~ 16x

Desk 5: Measured speedups for operating numerous deep studying frameworks on GPUs (see Desk 1)

Quickest Runtime for VGG internet, by GPU sort

Tesla P100 16GB PCIe Tesla M40 12GB Tesla Okay80
222 408 743

Desk 6: Very best runtimes (msec / batch) throughout all frameworks for VGG internet (ver. a). The Torch framework supplies the most effective VGG runtimes, throughout all GPU varieties.

The outcomes present that of the examined GPUs, Tesla P100 16GB PCIe yields the very best runtime, and additionally provides the perfect speedup over CPU-only runs. Regardless of which deep studying framework you favor, these GPUs supply priceless efficiency boosts.

Benchmark Setup

Microway’s GPU Check Drive compute nodes have been used on this research. Every is configured with 256GB of system reminiscence and twin 14-core Intel Xeon E5-2690v4 processors (with a base frequency of 2.6GHz and a Turbo Increase frequency of three.5GHz). Similar benchmark workloads have been run on the Tesla P100 16GB PCIe, Tesla Okay80, and Tesla M40 GPUs. The batch measurement is 128 for all runtimes reported, apart from VGG internet (which makes use of a batch measurement of 64). All deep studying frameworks have been linked to the NVIDIA cuDNN library (v5.1), as an alternative of their very own native deep community libraries. It’s because linking to cuDNN yields higher efficiency than utilizing the native library of every framework.

When operating benchmarks of Theano, barely higher runtimes resulted when CNMeM, a CUDA reminiscence supervisor, is used to handle the GPU’s reminiscence. By setting lib.cnmem=zero.95, the GPU gadget could have CNMeM handle 95% of its reminiscence:
THEANO_FLAGS=’floatX=float32,gadget=gpu0,lib.cnmem=zero.95,allow_gc=True’ python …

Notes on Tesla M40 versus Tesla Okay80

The info reveal that Tesla M40 outperforms Tesla Okay80. When geometrically averaging runtimes throughout frameworks, the speedup of the Tesla Okay80 ranges from 9x to 11x, whereas for the Tesla M40, speedups vary from 20x to 27x. The identical relationship exists when evaluating ranges with out geometric averaging. This result’s anticipated, contemplating that the Tesla Okay80 card consists of two separate GK210 GPU chips (related by a PCIe change on the GPU card). Because the benchmarks right here have been run on single GPU chips, the benchmarks mirror solely half the throughput potential on a Tesla Okay80 GPU. If operating a wonderfully parallel job, or two separate jobs, the Tesla Okay80 ought to be anticipated to strategy the throughput of a Tesla M40.

Singularity Containers

Logo image of the Singularity projectSingularity is a brand new sort of container designed particularly for HPC environments. Singularity allows the consumer to outline an surroundings inside the container, which could embrace custom-made deep studying frameworks, NVIDIA gadget drivers, and the CUDA eight.zero toolkit. The consumer can copy and transport this container as a single file, bringing their custom-made setting to a special machine the place the host OS and base hardware could also be utterly totally different. The container will course of the workflow inside it to execute within the host’s OS setting, simply because it does in its inner container setting. The workflow is pre-defined inside of the container, together with and mandatory library information, packages, configuration information, surroundings variables, and so on.

In an effort to facilitate benchmarking of 4 totally different deep studying frameworks, Singularity containers have been created individually for Caffe, TensorFlow, Theano, and Torch. Given its simplicity and highly effective capabilities, it is best to anticipate to listen to extra about Singularity quickly.

References

DeepMarks
Deep Learning Benchmarks revealed on GitHub

Singularity
Containers for Full Consumer Management of Setting

Alexnet
Krizhevsky, Alex, Ilya Sutskever, and Geoffrey E. Hinton. “Imagenet classification with deep convolutional neural networks.” Advances in neural info processing methods. 2012.

Overfeat
Sermanet, Pierre, et al. “Overfeat: Integrated recognition, localization and detection using convolutional networks.” arXiv preprint arXiv:1312.6229 (2013).

GoogLeNet
Szegedy, Christian, et al. “Going deeper with convolutions.” Proceedings of the IEEE Convention on Pc Imaginative and prescient and Sample Recognition. 2015.

VGG Internet
Simonyan, Karen, and Andrew Zisserman. “Very deep convolutional networks for large-scale image recognition.” arXiv preprint arXiv:1409.1556 (2014).