Memory Movement and Initialization: Optimization and Control

Compiler Methodology for Intel^® Many Integrated Core (Intel^®MIC) Architecture

Overview

Are you initializing data or copying blocks of data from one variable to another in your application? Probably so. Moving or setting blocks of data is very common. So how to best optimize these operations for Intel^® Xeon Phi™ Coprocessors?

Job #1 - For Phi, Parallelize the initialization!

A single Phi core cannot saturate the bandwidth available on Phi. So if only 1 core is initializing your large arrays you will notice a significant slowness compared to Xeon (due to the relatively slow clock speed of the Phi cores). Therefore, on Phi it is necessary to get many cores involved in the memory initialization to insure that the memory subsystem is driven at or near maximum bandwidth.

For example, if you have something like this:

do i=1,N
  arr1(i) = 1.1_dp
end do

you can parallelize the do loop:

!DIR$ vector nontemporal
!$OMP PARALLEL DO
do i=1,N
  arr1(i) = 1.1_dp
end do

mem*() calls in libc

The mem* family of functions in libc can take significant amounts of time in many applications. These include memcpy(), memset() and memmove() functions. C programmers may call these directly in their code. In addition to directly calling these functions, Fortran and C applications with data initalizations or data copy statements may IMPLICITLY call these functions when a compiler translates the data set/move/copy statements into calls to these libc mem*() functions. In addition, Fortran may hide direct calls to libc mem*() functions in the Fortran Runtime Libraries which often "wrap" calls to libc mem*() functions.

Applications compiled with the Intel Compilers: because these libc mem*() functions are so common, the Intel Compilers provide optimized versions of memset and memcpy in the Intel Compiler provided library 'libirc'. This library and specifically these functions are intended to replace the calls to mem*() functions with a more optimized version of the mem*() functions. The Intel replacement libraries have symbol names "_intel_fast_memset" and "_intel_fast_memcpy".

Some examples showing how the compiler will translate some calls into :

Fortran:

more memset.f90
program memsetter
  integer, parameter :: N=10000
  real :: a(N)
  integer :: i

  do i=1,N
    a(i) = 0.0
  end do
!...or with array syntax
  a = 0.0

 print*, a(1:10) !...if you don't use array a, the loop above
                  !...is completely optimized away
end program memsetter

# now compile the code at O2 or greater, use 'nm' to dump the symbols
$ ifort -c -g -O2 memset.f90
$ nm memset.o
0000000000000000 N .debug_info_seg
0000000000000000 T MAIN__
...U _intel_fast_memset
...
0000000000000000 b memsetter_$A.0.1

#include <stdio.h>
float a[1000000];
float b[1000000];
int main() {
int i, n;
n=1000000;
for (i=0; i<n; i++) {
a[i] = b[i];
}
printf("%f", a[1]);
}

$ icc -g -c -O2 memset.c
rwgreen@dpdknf01:~/projects$ nm memset.o
0000000000000000 N .debug_info_seg
...
U _intel_fast_memcpy

_intel_fast_mem* function calls and how to control their use?

“memcpy” calls in user-code (explicit and implicit) get translated to intel_fast_memcpy UNLESS user uses non-default options such as:

C++: -ffreestanding (option means user provides possibly their own version of library entry-points, so compiler is NOT free to translate mem* calls to other versions).
Fortran: -nolib-inline ( option disables inline expansion of standard library or intrinsic functions, and prevents the compiler from translating mem* functions to their intel_fast_mem* equivalents )

So depending on the options used for compilation, you may be getting glibc memcpy (or user’s own version) OR intel_fast_memcpy.

Streaming Stores - Nontemporal writes for data:

Many High-Performance Computing applications need to move data in huge blocks. Normally during write operations the application will move data through the data cache(s) with the assumption that data may be reused again soon ( known as a 'write through cache'). However, in many cases an HPC application will completely overwrite cache contents (first level, second level - the whole cache hierarchy) in the process of moving data that are much larger than the cache size. This wipes out any 'useful' data that may be cached, effectively flushing their contents. To avoid this, the programmer may specify to use 'streaming stores.' Streaming store instructions on the Intel microarchitecture code name Knights Corner do not perform a read for ownership (RFO) for the target cache line before the actual store, thus saving memory bandwidth. The data remain cached in L2 (This is in contrast to the streaming stores on Intel^® Xeon^® processors where the on-chip cache hierarchy is bypassed and the data get combined in a separate write-combining buffer). See the article here for more details: Intel^® MIC Architecture Streaming Stores.

To control use of non-temporal streaming store instructions, the Intel compilers provide the -opt-streaming-stores (Linux*, OS* X) , /Qopt-streaming-stores (WIndows*) option. The syntax is:

Linux and OS X

-opt-streaming-stores keyword

Windows:

/Qopt-streaming-stores:keyword

Arguments

keyword

Specifies whether streaming stores are generated. Possible values are:

always	Enables generation of streaming stores for optimization. The compiler optimizes under the assumption that the application is memory bound.
never	Disables generation of streaming stores for optimization. Normal stores are performed.
auto	Lets the compiler decide which instructions to use.

Default

-opt-streaming-stores auto
or/Qopt-streaming-stores:auto

The compiler decides whether to use streaming stores or normal stores.

Description

This option enables generation of streaming stores for optimization. This method stores data with instructions that use a non-temporal buffer, which minimizes memory hierarchy pollution.

This option may be useful for applications that can benefit from streaming stores.

Control Streaming Store with Pragma/Directive:

C and Fortran: Add “simd” pragma to suppress conversion-to-mem*. Add another “vector nontemporal” pragma/directive to generate non-temporal stores. Examples:

!DIR$ vector nontemporal
!DIR$ simd
do i=1,n
   a(i) = 0
enddo

#pragma vector nontemporal
#pragma simd
  for (i=0; i<n; i++) {
    a[i] = b[i];
  }
}

Advanced Notes:

Inside intel_fast_memcpy() (library function that resides in libirc.a library that gets shipped with the Intel compiler), uses non-temporal stores for memcpy IF the copy-size is > 256K. For smaller sizes you will still get vector-code, but it will not use non-temporal stores.

The Intel compilers and libraries do NOT automatically parallelize the mem* calls (The execution will happen in a single thread unless the memcpy/loop resides inside a user-parallelized code-region).

In some specialized uses of memcpy, the application has extra knowledge of the cache-behavior of the src/dest arrays and their cache-locality at a bigger scope than what the library-code sees from just one invocation of the memcpy. In such cases, you may be able to do smarter optimizations (such as different prefetching techniques that are not just based on the input-size) in a loop-version (or a smarter user-version of specialized memcpy) that may lead to better behavior for your application.

For stream-copy, the src-code does not use memcpy directly, but it has a copy-loop. Under default options, compiler translates the loop into a call to intel_fast_memcpy that then takes the path executing the stores using non-temporal stores. In the best performing stream-copy version though, you can get slightly better performance (~14% better) using the C++ options "-opt-streaming-stores always -opt-prefetch-distance=64,8" OR “-ffreestanding -opt-prefetch-distance=64,8” due to the better prefetching behavior in the loop-version of the code vectorized by the compiler (driven by the compiler prefetching options and no translation to memcpy library call).

In general, small-size memcpy performance is expected to be slower on Intel MIC Archiecture compared to a host processor (when it is NOT bound by bandwidth - meaning small sizes plus cache-resident data) due to the slower single-threaded clock speed on the coprocessor.

Take Aways

Memory movement operations can either explicitly or implicitly call memcpy() or memset() functions to move or set blocks of data. These functions can be linked to routines provided by the resident libc provided by your OS. The Intel compilers in certain conditions will replace the slower libc calls with faster versions in the Intel compiler runtime libraries such as _intel_fast_memcpy and _intel_fast_memset, which are optimized for Intel architecture.

Moving large data sets through the cache hierarchy can flush useful data out of cache. Streaming stores can be used to improve memory bandwidth on Intel^® Xeon Phi™ Coprocessors. The opt-streaming-store compiler option can be used or the pragma/directive nontemporal can be used for finer grain control.

NEXT STEPS

It is essential that you read this guide from start to finish using the built-in hyperlinks to guide you along a path to a successful port and tuning of your application(s) on Intel^® Xeon Phi™ Coprocessors. The paths provided in this guide reflect the steps necessary to discover best possible application performance.

Back to Advanced Optimizations chapter

intel_fast_memcpy

intel_fast_memset

VECTOR TEMPORAL

-opt-streaming-stores