More Conjecture on KNL's Near Memory
The Platform ran an interesting collection of conjectures on how KNL's on-package MCDRAM might be used this morning, and I recommend reading through it if you're following the race to exascale. I was originally going to write this commentary as a Google+ post, but it got a little long, so pardon the lack of a proper lead-in here.
I appreciated Mr. Funk's detailed description of how processor caches interact with DRAM, and how this might translate into KNL's caching mode. However, he underplays exactly why MCDRAM (and the GDDR on KNC) exists on these manycore architectures in his discussion on how MCDRAM may act as an L3 cache. On-package memory is not simply another way to get better performance out of the manycore processor; rather, it is a hard requirement for keeping all 60+ cores (and their 120+ 512-bit vector registers, 1.8+ MB of L1 data cache, etc) loaded. Without MCDRAM, it would be physically impossible for these KNL processors to achieve their peak performance due to memory starvation. By extension, Mr. Funk's assumption that this MCDRAM will come with substantially lower latency than DRAM might not be true.
As a matter of fact, the massive parallelism game is not about latency at all; it came about as a result of latencies hitting a physical floor. So, rather than drive clocks up to lower latency and increase performance, the industry has been throwing more but slower clocks at a given problem to mask the latencies of data access for any given worker. While one thread may be stalled due to a cache miss on a Xeon Phi core, the other three threads are keeping the FPU busy to achieve the high efficiency required for performance. This is at the core of the Xeon Phi architecture (as well as every other massively parallel architecture including GPUs and Blue Gene), so it is unlikely that Intel has sacrificed their power envelope to actually give MCDRAM lower latency than the off-package DRAM on KNL nodes.
At an architectural level, accesses to MCDRAM still needs to go through memory controllers like off-package DRAM. Intel hasn't been marketing the MCDRAM controllers as "cache controllers," so it is likely that the latencies of memory access are on par with the off-package memory controllers. There are simply more of these parallel MCDRAM controllers (eight) operating relative to off-package DRAM controllers (two), again suggesting that bandwidth is the primary capability.
Judging by current trends in GPGPU and KNC programming, I think it is far more likely that this caching mode acts at a much higher level, and Intel is providing it as a convenience for (1) algorithmically simple workloads with highly predictable memory access patterns, and (2) problems that will fit entirely within MCDRAM. Like with OpenACC, I'm sure there will be some problems where explicitly on/off-package memory management (analogous to OpenACC's copyin, copyout, etc) aren't necessary and cache mode will be fine. Intel will also likely provide all of the necessary optimizations in their compiler collection and MKL to make many common operations (BLAS, FFTs, etc) work well in cache mode as they did for KNC's offload mode.
However, to answer Mr. Funk's question of "Can pre-knowledge of our application’s data use--and, perhaps, even reorganization of that data--allow our application to run still faster if we instead use Flat Model mode," the answer is almost unequivocally "YES!" Programming massively parallel architectures has never been easy, and magically transparent caches rarely deliver reliable, high performance. Even the L1 and L2 caches do not work well without very deliberate application design to accommodate wide vectors; cache alignment and access patterns are at the core of why, in practice, it's difficult to get OpenMP codes working with high efficiency on current KNC processors. As much as I'd like to believe otherwise, the caching mode on KNL will likely be even harder to effectively utilize, and explicitly managing the MCDRAM will be an absolute requirement for the majority of applications.
I appreciated Mr. Funk's detailed description of how processor caches interact with DRAM, and how this might translate into KNL's caching mode. However, he underplays exactly why MCDRAM (and the GDDR on KNC) exists on these manycore architectures in his discussion on how MCDRAM may act as an L3 cache. On-package memory is not simply another way to get better performance out of the manycore processor; rather, it is a hard requirement for keeping all 60+ cores (and their 120+ 512-bit vector registers, 1.8+ MB of L1 data cache, etc) loaded. Without MCDRAM, it would be physically impossible for these KNL processors to achieve their peak performance due to memory starvation. By extension, Mr. Funk's assumption that this MCDRAM will come with substantially lower latency than DRAM might not be true.
As a matter of fact, the massive parallelism game is not about latency at all; it came about as a result of latencies hitting a physical floor. So, rather than drive clocks up to lower latency and increase performance, the industry has been throwing more but slower clocks at a given problem to mask the latencies of data access for any given worker. While one thread may be stalled due to a cache miss on a Xeon Phi core, the other three threads are keeping the FPU busy to achieve the high efficiency required for performance. This is at the core of the Xeon Phi architecture (as well as every other massively parallel architecture including GPUs and Blue Gene), so it is unlikely that Intel has sacrificed their power envelope to actually give MCDRAM lower latency than the off-package DRAM on KNL nodes.
At an architectural level, accesses to MCDRAM still needs to go through memory controllers like off-package DRAM. Intel hasn't been marketing the MCDRAM controllers as "cache controllers," so it is likely that the latencies of memory access are on par with the off-package memory controllers. There are simply more of these parallel MCDRAM controllers (eight) operating relative to off-package DRAM controllers (two), again suggesting that bandwidth is the primary capability.
Judging by current trends in GPGPU and KNC programming, I think it is far more likely that this caching mode acts at a much higher level, and Intel is providing it as a convenience for (1) algorithmically simple workloads with highly predictable memory access patterns, and (2) problems that will fit entirely within MCDRAM. Like with OpenACC, I'm sure there will be some problems where explicitly on/off-package memory management (analogous to OpenACC's copyin, copyout, etc) aren't necessary and cache mode will be fine. Intel will also likely provide all of the necessary optimizations in their compiler collection and MKL to make many common operations (BLAS, FFTs, etc) work well in cache mode as they did for KNC's offload mode.
However, to answer Mr. Funk's question of "Can pre-knowledge of our application’s data use--and, perhaps, even reorganization of that data--allow our application to run still faster if we instead use Flat Model mode," the answer is almost unequivocally "YES!" Programming massively parallel architectures has never been easy, and magically transparent caches rarely deliver reliable, high performance. Even the L1 and L2 caches do not work well without very deliberate application design to accommodate wide vectors; cache alignment and access patterns are at the core of why, in practice, it's difficult to get OpenMP codes working with high efficiency on current KNC processors. As much as I'd like to believe otherwise, the caching mode on KNL will likely be even harder to effectively utilize, and explicitly managing the MCDRAM will be an absolute requirement for the majority of applications.