In addition to the innovative physical design changes that Zen 2 brings to the table, there is also numerous low-level improvements. These improvements are how AMD is targeting IPC parity with Intel.
The first and largest improvement has nothing to do with Zen vs Zen 2. Instead it has to do with the fabrication size. By shifting to 7nm node size, AMD claims Instructions Per Clock cycle are increased by a full 15 percent from Zen 1.0 (not Zen+). This is not because they are clocked higher, which does not matter in IPC calculations, rather it is because each component takes up less room. When combined with less complex CCX design, this combination allows more space in the CCD for cache and the inclusion of ‘smarter’ cache designs. It also means more time and room for designing an improved branch prediction unit which is bigger and smarter than its predecessor.
In modern CPUs, an extreme amount of resources is spent on the integrated branch prediction unit. This is part of the “front end” of a processor which not only receive the incoming execution requests but guesses at how set of instructions (or ‘branch’) will go before it actually is complete. How it guesses at future needs involves a lot of math. In simplistic terms if someone says “hand me 1, hand me 2, hand me 3” you are going to assume the next is “4”. Congratulations you just predicted the future of that ‘branch’ of requests using historical data you stored in your ‘cache’. This is basically what branch predication algorithms boil down to: preloading as much as possible so that the core is not wasting cycles idling using recent history of already processed requests. This is sometimes called ‘pre-fetching’. The smarter the predictor unit the higher the IPC the processor is rated for.
Zen 2’s branch prediction unit consists of numerous parts. The first is an all new branch predictor which uses TAGE branch predications. TAGE stands for TAgged GEometric branch predication and uses a hybrid approach to branch predication. What this means is it not only keeps a history of all predications it made but tracks the accuracy of its guesses (‘tagging’ them). Then when it needs to do new branch prediction it looks at the historical predictions it made, picks the one that fits what it is working on, and when multiple records are found… picks the one that it got right more often than the other choices.
To imagine this, let’s build on the example we already used. If someone asks you for 1 then 2 then 3 multiple times a day and you guessed 4 the first time, got it right, but guessed 5 the next time and got it wrong TAGE will tell you to guess ‘4’ the next time that scenario pops up. The down side is TAGE is not precisely fast compared to simpler prediction methods. This is why TAGE is only being used for L2 branch predictions, not L1… and not all L2 predictions. On top of TAGE is the previous generations L2 Branch Target Buffer (BTB), this simpler (but faster) approach is now backstopped by a 7K entry (or history) capacity buffer which may not guess as right as TAGE but with such a large history buffer to look through the loss of processing cycles is minimized. Basically, and overly simplified, TAGE is used for complex and large branches where a little bit of time up front can save a lot of cycles later, and BTB is for shorter and smaller branch predictions where the latency penalty would be greater than a ‘missed’ prediction.
The L1 is still using a Branch Target Buffer as Zen 1 and Zen+ used, but it too is now doubled from 256 to 512 entries. Interestingly enough L1i cache is now only 32KB instead of 64KB per core like the previous Zen generation… but as it is 8-way instead of 4-way associative some of the difference is capacity is made up for in efficiency. Making up for this small loss in L1 instruction cache is macro-operation cache (MOP) has doubled to an impressive 4K. MOP cache allows the processor to decode complex operations into Macro-Operation code once, stick the answer in cache and then skip the decoding cycles when it encounters that operation in the future. The more cache, the more MOPs it saves, the more cycles it saves, the higher the IPC.
To further help IPC, AMD Zen 2 is now – finally- using a 256bit wide Floating Point Unit data path. Amongst other things (such as generating a single MOP instead of two for 256-bit instructions… and thus saving cycles and cache space) Zen 2 is now capable of handling ‘double-wide’ AVX-256 instructions in a single cycle. While this will not result in a doubling of AVX performance over original Zen models… it does indeed decrease wasted cycles. Once again improving overall performance. In this vein, FP multiplication latency has also been decreased from 4 cycles to 3 cycles.
On top of all these low-level changes is another, one that everyone will be glad to see… and immediately grasp its impact. That is the Infinity Fabric interconnect. Due to its new chiplet configuration, even more data will be traveling the Infinity Fabric interconnect at any given time. This is just a fact of life for Zen 2 processors. With the last generation of Zen (be it Zen or Zen+) Infinity Fabric was a major bottleneck for a lot of users. To ensure that this increase in load does not result in an even larger bottleneck, the all new ‘Infinity Fabric 2’ is a PCIe 4 based interconnect. This change means that even at the exact same frequency IF2’s bandwidth is double that of Zen 1 or Zen+. To be precise, instead of 256-bits wide it is now 512-bits wide. AMD also states they have improved the overall efficiency of IF2 by 27 percent, further decreasing bottlenecks on this critical interconnect.
Also helping to alleviate concerns over this perceived handicap, IF2 has been decoupled from the memory clock. Though not entirely. At memory speed of DDR4-3733 (or 1866.5MHz for IF2) or lower, the IF2 and memory clock is linked at a 1:1 ratio. At above this it is a 2:1 ratio – with the IF2 running at half the memory clock. This certainly takes stress off the memory and IF2 controller when higher frequency memory is used, but even more careful consideration has to be paid to memory speed when selecting a DDR4 kit of RAM to pair with the new Ryzen 3 processors. For example, ultra ‘fast’ and expensive DDR4-4266 memory will result in IF2 bandwidth the same as if DDR4-2133 was used. AMD recommends DDR4-3600 or DDR4-3733 for this very reason. So, while the doubling of bandwidth from doubling the bit size of the IF2 will reduce bottlenecks it is not the slam dunk success it could have been if the Infinity Fabric 2 clock had been entirely decoupled from memory. It is however a great step in the right direction.
Another low-level improvement is in regards to how the processor handles frequencies. With Ryzen 1000-series certain models came with two specifications: those on the box, and an XFR boost rating which could boost above the specified maximum boost rating. With Zen+ based Ryzen 2000-series this prioritization on single thread performance was changed to overall performance. This was done via ‘Precision Boost’ which allowed the processor to have much more fine grain control over each of the core’s frequencies than the PBoost in Ryzen 1000-series CPUs.
With Zen 2 based Ryzen 3000 series this Precision Boost feature has been further improved. Much like Precision Boost 1.0, Precision Boost 2.0 is an opportunistic algorithm which will actively monitor each core and provide additional frequency boost to those that need it. As with PB 1.0, it will do this in 1ms time slices. However, there is no longer a lower clock speed limit for when more than 2 cores are active. Instead, it looks at all the cores and as along as PPT (Package Power Tracking), TDC (Thermal Design Current), EDC (Electrical Design Current), and temperatures are within tolerances… it will work with each core to maximize all active cores’ frequencies. So, for example, if three cores are active and three are not, the three non-active will have their frequency and power consumption drastically reduced and this additional overhead applied to the three active cores. Temperature though is actually the most critical deciding factor. If the CPU considers the temperatures are close to its TJMax it will not push the cores as hard. So, the cooler running the processor the harder it can push the active cores.
With that said, what it will not do under any circumstances is boost a core beyond the specified maximum frequency. For example, with a Ryzen 7 3800X this means no core will ever be pushed beyond 4.5GHz. However, it can – under some circumstances – push more than one core to this level… even all of them (though that is highly, highly… highly unlikely). Much like XFR/XFR2 and PB 1.0, the new Precision Boost 2 is enabled by default and works ‘out of the box’ with zero input needed from you. This is a nifty little feature that is very reminiscent on how modern video cards handle things internally.
Also now include are two interesting features. The first is Precision Boost Overdrive (PBO), which made its debut last generation. Unlike PB, PBO has to manually enabled. PBO will push higher frequencies when the CPU is too hot BUT you cannot change the temp it considers to be ‘too hot’. So while PBO will give more performance to a hot running CPU than PB… it is not going to be night and day different. Instead where you will see the most impact is on cool running processors. Here it will boost overall performance by allowing ‘enhancements’ to the other three metrics the CPU uses in its calculations – max power consumption allowed for the socket and CPU (PPT), max sustained amps that can be pushed through a given mobo’s VRM/power delivery subsystem (TDC), and Electrical Desing Current (EDC).
Depending on how good the motherboard is, the BIOS may indeed contain options that are bordering on insane for these metrics… and allow you to tell the processor’s built in algorithms that it really is a more powerful/higher TDP/etc. processor than it really is. Just understand that temperatures are going to quickly skyrocket if you let PBO use the equivalent settings of a 195TDP chip on a 95TDP processor. Also do not be surprised if you see voltage peaks of 1.5v when you manually enabled PBO – so good aftermarket cooling all but a requirement to successfully seeing a difference with PBO over PB. What also must be taken into consideration is PBO will void your warranty. Whether or not AMD can tell if you used PBO is another thing, but their literature is very specific. They consider PBO to be the same as manual overclocking.
If all this is not confusing enough there is another new feature that sits on top (or at least to the side) of PBO. That is ‘autoOC’. AutoOC allows for upwards of 200Mhz over rated maximum boost frequency settings. It is not XFR. It will be applied to PBO and allow one or all cores to hit higher than the processors rated ‘maximum’ frequency. Though once again with all the limitations already listed taken into consideration before doing so. As such AutoOC will be of very limited value to those without great cooling solutions. So do not expect to see upwards of a 200Mhz boost outside of single active core / dual thread scenarios all that often. It too will also technically void your warranty.