
Yes, the L2 has the same ~17-cycle performance penalty as before… but the lessons learned in the Lunar Lake chips have not only been carried over but further expounded upon. This results in much greater consistency from the L2 cache. Namely, three things have been improved.
The first is that the Age-Based and Round-Robin logic has been enhanced with a new Critical Path Arbitration layer. So, instead of just a simplistic “age-based” logic that often got it wrong, the new algorithm analyzes the instruction dependency chain being worked on. If it sees a bottleneck approaching, this arbiter will grab the proper instructions and push them to the front of the line… even if the next in the queue has been waiting around and is about to “age out” of the L2 data banks. This reduces wasted cycles and increases consistency.

Ironically, the next improvement baked into the L2 cache is a refined conflict avoidance algorithm. All the P-cores are once again relying on L2 cache banks that have a limited number of ports. If two (or more) P-cores demand the same cached data, they slam into each other and stall out. There are pros and cons to increasing port numbers, so instead, Intel has opted for a more intelligent hash-based mapping routine that tries to spread the data out over the entirety of the L2 banks and move it around if it sees a potential fight coming down the pipe. This in turn allows both 512 bit read ports to run as close to a consistent 100% utilization as possible, once again improving consistency even if it does not radically improve theoretical performance.
Lastly, the L2 arbiter has been rewritten to include snoop requests. Basically, when the NPU or iGPU thinks there is dirty data in the L2 cache, it sends in a “snoop agent” to check it. This takes cycles and can jam up a bank until the snoop process is complete, be it a simple “yup, still good” or extended house cleaning that yeets outdated data and replaces it. With Cougar Cove, the snoop algorithm’s priority has been reduced, and its cycles can now even be interleaved so as to further reduce the real-world performance impact. Thus, all these improvements turn what was a veritable “first-come, first-served” gatekeeper approach into a more holistic, multi-algorithmic approach aimed at reducing wasted cycles and improving real-world performance.
Moving on to the L3 cache: here we also see noticeable improvements. If one were to consider the topology of the cores and L3 cache banks, one could easily be forgiven for calling it a large, “Timey-Wimey” (or the dot in the “i” of Jeremy Bearimy) asymmetric ring bus that had a latency of upwards of 84 freaking cycles. Data routinely had to go halfway around the ring just to get to its destination.

Lunar Lake improved this as it went to a smaller but isolated ring and shaved the latency down to ~51 cycles. However, the E-cores were yeeted from it, and it was solely for the P-cores. Furthermore, Intel moved the E-cores to the SoC tile, so P-to-E and/or E-to-P handoffs were straight up trash. On the plus side, it did allow for non-inclusive data usage where the L3 did not have to hold all the L2 data in it. This worked way better than anyone thought it would. There was still random wonkiness, but it was surprisingly better than the dumpster fire many thought it would be.
As it is a “Big Compute Tile,” a small ring was not an option. Instead, Intel went to an asymmetric unified ring that tries to implement the best of both design philosophies. This is one where the latency is more consistent (in the mid-50s), but since the E-cores are on it, it doesn’t suffer the same fate as Lunar Lake. Furthermore, Intel has gone for a downright aggressive non-inclusive algorithm that not only allows but, in certain circumstances, “encourages” the arbiter to not straight-copy the L2 to the L3. Instead, it allows the L3 to have more room for moderate (and even low) probability guesses that would not fit if it were storing a perfect 1:1 copy of the L2.
Of course, the downside is that in L2 bank conflicts, the arbiter can’t just point one of the offending P-cores to the L3 and tell it to grab it from there, but it is a step in the right direction. Overall, it allows this mobile processor to have a more desktop-level of L3 latency, which is a Great (if long overdue) Thing™.

All of those improvements are great; however, the fact remains that Cougar Cove P-cores are running “up to” 400MHz slower than the Lion Cove P-cores found in the 2-series. In theory, and in a perfect world where thermal envelopes don’t matter, this would be a major issue for Cougar Cove. However, by moving away from front-side power delivery (FSPD) to backside power delivery (BSPD, or in Intel-speak, “PowerVia”), Intel has pulled off a major win on two different and yet highly critical fronts. Both of these first negate the difference in frequencies and then improve real-world performance beyond what one could expect from the 2-series generation. The first front is the fact that PowerVia results in lower vDroop. This means less additional voltage is needed to overcome the dreaded “droop,” which in turn means less heat is needlessly being created.

By moving the power and its thermal overhead to the back, there is also less material between the core’s heat and the Integrated Heat Spreader. This means that core temperatures are lower and stay lower for longer. This, in turn, means that these Performance cores can hit their maximum boost numbers and stay there longer before they start to thermally limit (aka throttle). Mix in a decrease in node size and technology—moving from first-gen 3nm (TSMC “N3B”) with older FinFET tech to cutting-edge 1.8nm (ironically called “Intel 18A”) and RibbonFET, where the gate now completely surrounds the channel on all four sides to massively reduce power leakage—and they don’t just hit their “up to” specifications; they stick there much, much longer. Think twice as long compared to their predecessor and even longer than AMD processors.

Put simply, when one combines this flatter thermal curve with the noticeable reduction in off-die calls for memory, these four P-cores can not only theoretically but consistently outperform the previous generation’s six P-cores in the real world. This is especially true in work-related tasks where the P-cores are being called on for more than just “blips” of 100% activation.







