The lifespan of Pascal video cards turned out to be quite long. The older models have been on the market for more than two years and will still be on sale for some time. During this period, we saw new solutions based on the Volta architecture, which remained the lot of specialized computing accelerators. The only gaming product in the Volta family was TITAN V, which was released in a small edition at an extremely high price. But now the moment has come for the start of a new generation, which should change everything. The new Turing-based graphics cards not only bring another performance boost, they bring a number of technological innovations and are the first gaming solutions to support real-time ray tracing. Therefore, even the familiar name GeForce GTX has been changed to GeForce RTX. In this review, we will talk about the features of the Turing architecture and the technical parameters of the new GPUs. Practical acquaintance with video cards, including testing and comparison with old NVIDIA models, will be devoted to the following reviews.
GeForce RTX graphics cards
There are several key changes to the Turing family. This is a completely new GPU architecture, the emergence of new computing units — tensor and RT cores, accelerated processing of shaders.
At the moment, three video cards are presented — GeForce RTX 2080 Ti, GeForce RTX 2080 and GeForce RTX 2070. All of them are based on different Turing GPUs. The top model received the most powerful TU102 processor, the crystal of which is shown below on the slide.
First, we will give a block diagram of each new GPU, describe the general characteristics of video cards, and then consider in detail the architectural changes. All processors are manufactured using 12nm FinFET technology. They retain the cluster structure when the GPU consists of several GPCs, and by changing the number of such clusters, the performance of each specific chip is scaled.
TU102 (GeForce RTX 2080Ti)
The senior graphics processor TU102 consists of 18.6 billion transistors with a die area of 754 sq. mm. If we compare it with GP102 (GeForce GTX 1080 Ti), then the area of the new chip and the number of transistors have increased by 55-60%. The TU102 has a total of six GPC clusters, each containing six TPC texture processor clusters that combine multiprocessor SM blocks. The latter are noticeably reorganized and include new blocks, which will be discussed in more detail below. Each SM-block has 64 main computing units (CUDA-cores). With 72 SM, you get a total of 4608 stream processors. However, the GeForce RTX 2080 Ti GPU (like the GeForce GTX 1080 Ti at one time) has been cut down a bit. The top graphics card has two SMs disabled, bringing the total number of stream processors to 4352. This solution also has 544 new tensor cores and 68 RT cores, 272 texture units and 88 ROPs.
For comparison, we can recall that the GeForce GTX 1080 Ti based on GP102 operated only 3584 CUDA cores with 224 texture units. So the capacity building of the new TU102 is quite significant. The memory bus remains 352-bit, but new GDDR6 memory chips are used with an effective data rate equivalent to 14 GHz. 11 GB of memory is at the level of the old flagship, and it is quite enough for modern games at high resolutions.
Judging by the block diagram, the TU102 processor has a total of 12 32-bit memory controllers. Therefore, the chip can work with a 384-bit interface. Perhaps we will see such a bus along with 4608 stream processors in the new Titan. The L2 cache of the GeForce RTX 2080 Ti reaches 5632 KB. Obviously, the total volume of L2 is 6 MB, but it is slightly cut along with the bus.
TU104 (GeForce RTX 2080)
The next processor in the hierarchy, TU104, has a configuration of six GPC clusters of four TPCs. In the last generation of Pascal, the identity of the internal structure of clusters for solutions of the middle and top levels was preserved, only in budget GPUs the number of TPCs decreased. Probably, this TU104 configuration is optimal for maintaining a certain balance of performance and flexible resource management — the number of clusters is at the level of a top-end GPU, but they are weaker. In this case, 46 out of 48 SM units are involved, which gives 2944 active CUDA cores, 368 tensor cores, 46 RT cores and 184 texture units. The amount of L2 cache is 4 MB, which is twice the amount of L2 in GP102 (GeForce GTX 1080).
TU106 (GeForce RTX 2070)
The surprise was the premiere of the third chip for the GeForce RTX 2070. By analogy with previous generations, one could expect a simple block cut on the processor of the older video card. But the basis of the GeForce RTX 2070 was the TU106 GPU with three standard clusters of six TPCs. The total number of stream processors is 2034, tensor units 288, RT units 36, texture units 144. A direct comparison of the GeForce RTX 2070 with the GeForce RTX 2080 results in a 28% difference in compute units. The L2 cache remained at 4 MB.
TU104 and TU106 have a 256-bit memory bus (8 controllers with 32 bits). At the same time, video cards use GDDR6 memory with an effective frequency of 14 GHz, which ensures an increase in memory bandwidth relative to the previous generation.
As you can see, the overall configuration of the computing units even for the junior GPU is quite powerful, not to mention the top TU102. But they still have new functional blocks. Therefore, Turing chips are complex and rather large crystals. TU102 has 18.6 billion transistors, TU104 has 13.6 billion, and TU106 has 10.8 billion transistors. As a result, even with the transition to a 12-nm process technology, we do not see an increase in operating frequencies. If we talk about the GeForce RTX 2080 Ti, then the base value is generally declared at 1350 MHz with Boost Clock up to 1635 MHz. For lower GPUs, the operating frequencies are higher, but they are approximately at the level of Pascal models.
One interesting point is connected with frequencies. For the first time, NVIDIA introduces different Boost frequencies with the same base values. In official specifications, we see higher Boost values for Founders Edition models manufactured by NVIDIA itself. At the same time, the rest of the cards are also designated as Reference, which is misleading, since we are used to associating the reference versions with the Founders Edition. We had the opportunity to quickly compare a video card from NVIDIA with a model from another manufacturer, and in reality the difference in frequencies is minimal. So do not be afraid of different characteristics. With good cooling, the performance of all GeForce RTXs in the same series will be similar. Worse than the rest may be those rare models with a turbine-type cooler, which were announced by some partners.
|Video adapter||GeForce RTX 2080 Ti||GeForce RTX 2080||GeForce RTX 2070|
|Number of transistors, million pieces||18600||13600||10800|
|Process technology, nm||12||12||12|
|Core area, sq. mm||754||545||445|
|Number of CUDA Stream Processors||4352||2944||2304|
|Number of tensor cores||544||368||288|
|Number of RT cores||68||46||36|
|Number of texture blocks||272||184||144|
|Number of render units||88||64||64|
|Core frequency Base, MHz||1350||1515||1410|
|Boost core frequency, MHz (Reference)||1545||1710||1620|
|Boost core frequency, MHz (Founders Edition)||1635||1800||1710|
|Memory bus, bit||352||256||256|
|Memory frequency, MHz||14000||14000||14000|
|Memory size, GB||11||8||8|
|Supported version of DirectX||12||12||12|
|Interface||PCI-E 3.0||PCI-E 3.0||PCI-E 3.0|
|official cost||MSRP $999
The TDP of the new video cards remained approximately at the old level. So, for the GeForce RTX 2080 Ti Founders Edition, 260 W and 250 W for partner versions are declared. For the GeForce RTX 2080, these are 225 and 215 W, which is higher than the TDP of the GeForce GTX 1080 series, but generally acceptable for top products.
After a general overview of the new GPUs, let’s talk directly about the innovations of the Turing architecture.
Turing Architecture Features
Important changes have taken place at the level of SM multiprocessor blocks, which have a standard structure in all GPU Turing variants. The new architecture inherits the capabilities of the Volta computing architecture and the Pascal gaming architecture. All computing units inside the SM are grouped into four data processing arrays with their own control logic (register data, scheduler). There are 64 stream processors in one SM. And these computing units are now able to simultaneously perform integer operations (INT32) and floating point operations (FP32). By the way, on the SM diagram they are designated as different functional blocks. Interestingly, Pascal had 128 CUDA cores per SM, but INT and FP format calculations were performed in sequential order.
According to NVIDIA, in modern applications, when executing game shaders, integer calculations take up to 36%. And performing operations of two types in one thread will significantly speed up the overall calculations. Here, at the same time, we can say about a certain imbalance, since complete duplication of INT32 and FP32 is not necessary. But such a structure can be relevant for non-gaming calculations and tasks.
The updated unified structure of the L1 cache allows the TPC pipeline to work more efficiently with it. While keeping the total L1 cache at 96 KB, there is less latency, and the overall throughput can increase up to two times. Also, in all processors, the volume of the shared L2 cache has been increased. For example, in the TU102 GPU, this is 6 MB instead of 3 MB for the old GP102.
There are also completely new blocks. These are eight tensor cores for machine learning mathematical operations and one RT (Ray-tracing) block for ray tracing calculations. But even without taking into account the new blocks and new rendering capabilities, NVIDIA is talking about an average increase in shader performance of about 50%, which sounds quite impressive. In virtual reality VR, this increase is twofold and even higher. This looks very optimistic, and in the course of the article we will reveal many nuances that give such a complex effect.
Once again, algorithms for data compression in the frame buffer have been improved, which reduces the number of accesses to external memory. Combined with GDDR6 chips that operate at 14 Gb/s, up to a 50% increase in effective throughput is claimed. Some users were alarmed that the GeForce RTX 2080 Ti retained 11 GB of memory, while the GeForce RTX 2080/2070 received 8 GB of memory each, because this is at the level of existing Pascal models. However, this volume is now enough for high resolutions, and Turing, in theory, works even more efficiently with memory.
Turing chips received support for new feature levels from Direct 12. Improved asynchronous computing. Also, the new architecture has a number of improvements for accelerated processing of shaders.
Mesh Shading offers a new unified geometry pipeline, replacing vertex, geometry and tessellation shaders. This is a more manageable pipeline with new shader types Task Shaders and Mesh Shaders, which allows you to work on the geometry of a group of objects at the same time, reducing the total number of draw calls.
Mesh Shading will be effective in scenes with many objects and complex geometry, allowing more flexible LOD control. At the DirectX 12 level, it can be implemented via NVAPI. Mesh Shading support will also be added to OpenGL and Vulkan.
The Variable Rate Shading (VRS) technology looks promising. This method allows you to adjust the quality of shading in 4×4 pixel samples. This allows for flexible optimization. For example, on the periphery, the image can be blurred by Motion Blur effects and the high accuracy of the samples does not matter here. This is very relevant for racing games, where the road and the environment at the periphery of the frame are often blurry.
Three algorithms use VRS:
- Content Adaptive Shading — reduces the speed of shading for zones with slightly changing color;
- Motion Adaptive Shading — variable quality for moving objects;
- Foveated Rendering — Reduced quality for areas out of focus.
All this requires implementation by developers. However, VRS can actually improve performance. It is also one of the factors that reduce the load on video memory.
Turing supports the new Texture Space Shading (TSS) model. Shader data values are stored in memory in a special texture space, from where they can be re-called later. TSS allows such texels to be used for temporal rendering and different coordinate systems.
TSS is one of the elements of VR processing acceleration. Each eye sees a similar image. When rendering the frame of the right eye, data from the frame of the left eye is used, and only those textiles where there are no suitable samples will be re-processed.
Turing Tensor Cores are upgraded Volta cores. They are needed to perform tasks using artificial intelligence. These blocks support calculations in INT8, INT4, and FP16 modes when working with arrays of matrix data for real-time deep learning. Each tensor core performs up to 64 floating point operations using FP16 format input. That is, one SM with eight cores processes 512 FP16 operations per clock. INT8 calculations run at twice the rate of 1024 operations, while INT4 performs 2048 operations per clock. And the top-end TU102 GPU is capable of delivering peak tensor performance up to 130.5 TFLOPS (Quadro RTX 6000).
NVIDIA has been working in the field of artificial intelligence for a long time. However, until recently, all technologies based on trained neural networks seemed to be the lot of some highly specialized areas and large data centers. With the advent of Turing, the situation is changing, because we get not only a hardware platform, but also new software features. NVIDIA NGX (Neural Graphics Acceleration) is used to integrate artificial intelligence capabilities, allowing you to use deep learning capabilities to improve graphics and visual display.
Based on NGX, AI Super Rez image resolution technology, InPainting technology for restoring photo fragments, and some other interesting features have already been implemented.
But the most important is Deep Learning Super-Sampling (DLSS) smoothing. This is an evolution of Temporal AntiAliasing (TAA) using Turing’s new intelligence. Now TAA is the most common anti-aliasing method, which is given with a meager performance loss of a few percent. TAA uses data from the previous frame to sample the new one. With a good result of smoothing the edges, this method gives a certain blurring and jitter of the picture, especially in dynamics. DLSS uses a specially trained neural network for faster and better sampling. The new method gives a clear picture at an even lower performance cost.
DLSS anti-aliasing looks very promising, and it is easily integrated into games, which will make it easier to popularize. Interestingly, the NVIDIA charts show a very significant increase in fps when DLSS is activated. The reason is that different sampling methods are possible with DLSS, and in some modes it is, in fact, a matter of reconstructing the final image from a smaller one. That is, it can really speed up rendering. You also need to understand that many games now use adaptive resolution technologies with anti-aliasing through TAA. Not every user is aware of such fine settings. And if the DLSS mode is set to it during automatic tuning, then it will receive a noticeable improvement in image quality with a real increase in performance.
At the moment, we know about the integration of DLSS into the Unreal Engine and Unity engines. And the list of games that will add this anti-aliasing is constantly growing.