Intel® Xeon Phi™ coprocessor offers high memory bandwidth with GDDR5 Stream Triad score more than 140 GB/s. The absolute latency to GDDR is more than twice as high on the Intel Xeon Phi coprocessors as on the 2nd and 3rd Generation Intel® Core™ processor family server CPUs. The Intel Xeon Phi coprocessor has more threads to tolerate latency, while the Intel CPUs are able to tolerate latency within a thread through out-of-order scheduling.
The coprocessor GDDR offers a latency of 500-1000 cycles. Reading data from the L2 cache involves 15-30 cycle latency. Accessing the L1 cache involves latency of only one cycle. Thus bringing or fitting data into the caches is important. Specifically, to limit exposure to memory latency use the following techniques:
Accessing memory consecutively is the fastest way to access memory on the Intel Xeon Phi coprocessor. The consecutive access improves cache efficiency, reduces the number of misses into the Translation Lookaside Buffer that performs mapping of virtual to physical addresses and enables the hardware prefetcher to kick in.