You can conduct experiments using variety of hardware events and associated efficiency metrics. For example, considering the kernels with respect to data read and write miss might help you to identify the potential for improving the prefetching or better data reuse by use of blocking techniques (tiling).
The event-driven analysis for the OpenCL application on the Intel Xeon Phi coprocessors is conceptually similar to the analysis for the regular native (or offload) application for the coprocessor. See the "Optimization and Performance Tuning for Intel® Xeon Phi™ Coprocessors, Part 2" web article for more information.