
Yes, the L2 has the same ~17-cycle performance penalty as before… but the lessons learned in the Lunar Lake chips have not only been carried over but further expounded upon. Resulting in much greater consistency from the L2 cache. Namely, 3 things have been improved.
The first is that the Age-Based and Round-Robin logic has been enhanced with a new Critical Path Arbitration layer. So instead of just a simplistic ‘age’ based logic that often got it wrong the new algorithm analyzes the instruction dependency chain being worked on and if it sees a bottleneck approaching this arbiter will grab the proper instructions and push it to the front of the line… even if the next in the queue was waiting around and about to age out/off the L2 data banks. Thus reducing wasted cycles and increasing consistency.

Ironically, the next improvement baked into the L2 cache is a refined conflict avoidance algorithm. All the P-cores are once again relying on L2 cache banks that have a limited number of ports… and if two (or more) P-cores demand the same cached data… they slam into each other and stall out. There are pros and cons to increasing port numbers, so instead, Intel has opted for a more intelligent hash-based mapping routine that tries to intelligently spread the data out over the entirety of the L2 banks and then moves it around if it sees a potential fight coming down the pipe. This, in turn, allows both 512-bit read ports to run as close to a consistent 100% utilization as possible. Once again, improving consistency even if it does not radically improve theoretical performance.
Lastly, the L2 arbiter has been rewritten to include snoop requests. Basically, when the NPU or iGPU thinks there is dirty data in the L2 cache, it sends in a snoop agent to check it. Which takes cycles and can jam up a bank until the snoop process is complete (be it simple “yup still good” or extended house cleaning that yeets outdated data and replaces it). With Cougar Cove, the snoop algorithm’s priority has been reduced, and its cycles can now even be interleaved so as to further reduce real-world performance impact. Thus, all these improvements turn what was a veritable “first-come, first-served” gatekeeper approach into a more holistic multi-algorithmic approach, which aims to reduce wasted cycles and improve real-world performance.
Moving on to the L3 cache. Here we also see noticeable improvements. If one were to consider the topology of the Cores and L3 Cache banks, one could easily be forgiven for calling it a Large, Timey-Wimey / the dot in the i of Jeremy Bearimy, Asymmetric Ring bus… that had a latency of upwards of 84 freaking cycles. With data having to routinely go half way around the ring to get to its destination.

Lunar Cove improved this as it went to a Small(er) but isolated Ring and shaved the latency down to ~51 cycles… but the E-Cores were yeeted from it, and it was solely for the P-Cores. Furthermore, Intel moved the E-cores to the SoC tile so P to E and/or E to P handoffs were straight up trash… but it did allow for non-inclusive data usage where the L3 did not have to hold all the L2 data in it. Which worked way better than anyone thought it would. Still random wonkiness, but surprisingly better than the dumpster fire many thought it would be.
As it is a . Big Compute Tile “small” was not an option, but what Intel did was go to an asymmetric unified ring that tries to implement the best of both design philosophies. One where the latency is more consistent (in the mid-50s) but since the E-cores are on it… doesn’t suffer the same fate as Lunar Cove. Furthermore, Intel has gone for a downright aggressive non-inclusive algorithm that not only allows but, in certain circumstances, “encourages” the arbiter to not straight copy the L2 to the L3. Instead, it allows the L3 to have more room for moderate (and even low) probability guesses that would not fit if it were storing a perfect 1:1 copy of the L2. Of course, the downside is that in L2 bank conflicts the arbiter can’t just point one of the offending P-cores to the L3 and tell it to grab it from there… but it is a step in the right direction. Overall, it allows this mobile processor to have a more desktop-level of L3 latency, which is a Great (if long overdue) Thing™.

All of those improvements are great… however, the fact remains Cougar Cove P-Cores are running “up to” 400MHz slower than the Lion Cove P-Cores found in the 2-series. In theory, and in a perfect world where thermal envelopes don’t matter, this would be a major issue for Cougar Cove. However, by moving away from a front-side power delivery (FSPD) to a backside power delivery (BSPD… or in Intel speak “PowerVia”) Intel has pulled off a major win on two different and yet highly critical fronts. Both of which first negate the difference in frequencies… and then improves real world performance beyond what one could expect from the 2-series generation. The first front is the fact that PowerVia results in lower vDroop. Meaning less additional voltage is needed to overcome the dead droop. Which in turn means less heat is needlessly being created.

By moving the power (and its thermal overhead) to the back, there is also less in between the core’s heat and the Integrated Heat Spreader. Meaning that core temperatures are lower and stay lower longer. Which means that these Performance cores can hit their max boost numbers and stay there longer before they start to thermally limit (aka Throttle). The mix in a decrease in node size and technology from first gen 3nm (TSCM “N3B”) w/ older FinFET tech to cutting edge 1.8nm (ironically called “Intel 18a”) and RibbonFET (the gate now completely surrounds the channel on all four sides massively reducing power leakage!)… and they don’t just hit their up to specifications that stick there much, much longer. Think twice as long compared to their predecessor and even longer than AMD processors.

Put simply, when one combines this flatter thermal curve with noticeable reduction in off-die calls for memory… these four P cores can not only theoretically but consistently outperform the previous generation’s six P-Cores in the real world. Especially in work-related tasks where the P-cores are being called on for more than just ‘blips’ of 100% activation.






