Vous êtes sur la page 1sur 7

Intel Integrated Performance Primitives (IPP) -

Performance Tips and Tricks

Intel, Pentium, Itanium and XScale are trademarks or registered trademarks of Intel Corporation or its
subsidiaries in the United States and other countries.
*
Other names and brands may be claimed as the property of others.

INFORMATION IN THIS DOCUMENT IS PROVIDED IN CONNECTION WITH INTEL® PRODUCTS.


EXCEPT AS PROVIDED IN INTEL'S TERMS AND CONDITIONS OF SALE FOR SUCH PRODUCTS,
INTEL ASSUMES NO LIABILITY WHATSOEVER, AND INTEL DISCLAIMS ANY EXPRESS OR
IMPLIED WARRANTY RELATING TO SALE AND/OR USE OF INTEL PRODUCTS, INCLUDING
LIABILITY OR WARRANTIES RELATING TO FITNESS FOR A PARTICULAR PURPOSE,
MERCHANTABILITY, OR INFRINGEMENT OF ANY PATENT, COPYRIGHT, OR OTHER
INTELLECTUAL PROPERTY RIGHT.

August 24, 2001 Copyright  2001 Intel Corporation.


Intel Integrated Performance Primitives (IPP)
Performance Tips and Tricks
In this paper we provide tips and tricks to help you use Intel Performance Primitives (IPP)
to increase the performance of your applications. Tips and tricks include:
• Aligning memory to improve performance
• Avoiding 64 KB aliasing on Intel Pentium 4 processsors
• Using Fast Fourier Transform (FFT) as a processing accelerator
• Using the IPP threshold functions for denormal data
• Using external buffers in sequential procedures
• Using multi-thread processing
• Using integer fixed point arithmetic instead of FP computation in Intel
StrongARM* (Intel SA) architectures

Align memory to improve performance


The performance of IPP, when operating on aligned or misaligned data, can be significantly
different. Access to memory is faster if pointers to the data are aligned, and IPP functions
perform better if they process data with aligned pointers. To align pointers, developers should
apply the special IPP memory allocation function ippsMalloc. There are several
modifications of the function in the library, for example, for 32fc data type:
Ipp32fc* x = ippsMalloc_32fc( fftlen );
The functions allocate memory and return the pointer aligned by 32 bytes. The following
performance results using the IPP copy function on an Intel Pentium 4 processor clearly
present the differences for different alignments of the source data:

Source align, B 16+0 16+4 16+3


cpe 0.3 0.5 0.7

The cpe is number of CPU-cycles per element.


For more efficient image processing it is desirable that every row of an image be aligned. To
align image rows a developer can define the width of an image in accordance with the data
type. For example, for 8u data and a 3-channel image, the width of the image in pixels should
be chosen so that w*3 is a multiple of 8, or preferably 16. In spite of enlarging the width this
way, the area to be processed may be defined exactly as the developer needs, using special
parameters passed to almost every image processing function. The special parameters are the
width step (given in bytes) and size of the region of interest (given as a structure of type):
typedef struct {
int width;
int height;
} IppiSize;

2
Intel Integrated Performance Primitives (IPP)
Performance Tips and Tricks

64 KB Aliasing on Intel Pentium 4 Processors


It is very important when developing on the Intel Pentium 4 processor that two pointers used
by a function do not differ by a multiple of 64 KB. Performance of the function is decreased
significantly where two pointers differ by a multiple of 64 KB. This is particular to the
current version of the Intel Pentium 4 processor. For example, consider the
ippsCplxToReal_32fc function:
for(n=0;n<length;n++) {
pRe[n]=pSrc[n].re;
pIm[n]=pSrc[n].im;
}
Performance decreases sharply if difference between the pRe and pIm pointers is 64 KB or a
multiple of that. We can see below that performance becomes 80 cpe instead of 5 cpe.

Using Fast Fourier Transform (FFT) as a processing accelerator


Fast Fourier Transform (FFT) is an all-purpose method to increase performance of data
processing, in particular in the area of digital signal processing.
According to the convolution theorem, filtering of two signals in the spatial domain can be
computed as point-wise multiplication in the frequency domain. The transform data to and
from the frequency domain is usually fulfilled using FFT. Finite Impulse Response (FIR)
filtering of an input signal can be done using the IPP FFT functions that are very fast on Intel
processors. The data array’s length may be increased to the next greater power of two by
adding zeros to the end of the data, and then the input signal and the FIR coefficients can be
transformed by the FFT forward transform function. The Fourier coefficients obtained this

3
Intel Integrated Performance Primitives (IPP)
Performance Tips and Tricks
way are point-wisely multiplied, and the result is transformed back to the spatial domain.
The performance improvement gained this way can be very significant.
If the applied filter is the same for several processing iterations then the FFT transform of the
filter coefficients need to be done only once. The twiddle tables and the bit reverse tables are
created in the initialization function for the forward and for inverse transforms at the same
time. The main operations in such kind of filtering are presented below:
ippsFFTInitAlloc_R_32f( &pFFTSpec, fftord, IPP_FFT_DIV_INV_BY_N,
ippAlgHintNone );
/// perform forward FFT to put source data x to freq-domain
ippsFFTFwd_RToPack_32f( xx, XX, pFFTSpec, 0 );
/// point-wise multiplication in freq-domain is convolution
ippsMulPack_32f_I( HH, XX, fftlen );
/// perform inverse FFT to get result in time-domain
ippsFFTInv_PackToR_32f( XX, yy, pFFTSpec, 0 );
/// free FFT tables
ippsFFTFree_R_32f( pFFTSpec );

The IPP FFT transform functions are used in the IPP convolution and cross-correlation
functions. Even a Discrete Fourier Transform (DFT) transform with arbitrary length is
computed using FFT. This is possible because DFT can be presented as a cyclic convolution:

The expression under Sum is a cyclic convolution and can be computed with FFT. The speed
of this is up to 100 times faster than the direct DFT computation.

IPP threshold functions and denormal data


Operations on non-numbers or denormal data make processing slow, even if corresponding
interruptions are disabled. Denormal data occur, for example, in filtering by Infinite Impulse
Response (IIR) and FIR filters of the signal obtained with a microphone and converted to the
floating point (FP) format. To avoid the slow-down effect in denormal data processing the
IPP threshold functions can be applied to the input signal before filtering. For example:
ippsThreshold_LT_32f_I( src, len, 1e-6f );
ippsFIR_32f( src, dst, len, st );
The 1e-6 value is the threshold level; the input data below that are set to zero. Because the
IPP threshold function is very fast, the execution of two functions is faster than execution of
one, if denormal numbers meet in the source data. Of course, if the denormal data occur
while using the filtering procedure the threshold functions do not help. In this case, for Intel
processors beginning at the Pentium 4 processor and including Itanium™ processors, it is
possible to set a special mode of computations - Flush to Zero. The IPP
ippsSetFlushToZero function can be used.

4
Intel Integrated Performance Primitives (IPP)
Performance Tips and Tricks
Note that the Flush-to-Zero mode is applicable in computation using Streaming SIMD
Extensions (SSE) instructions or Streaming SIMD Extensions 2 (SSE2) only and does not
affect x87 instruction results.
In the table below are performance numbers obtained on Intel Pentium 4 processors for
normal and denormal source data using the different approaches mentioned above:

Source Data and Method Performance, cpe


Normal, None 46
Denormal, None 11467
Denormal, Threshold 49
Denormal, Flush to zero 45

Using external buffers in sequential procedures


Many complicated functions in IPP need additional buffers and allocate the memory inside. If
every such function allocates and frees memory independently while processing, then
performance of the operations decreases for two reasons:
• memory allocation and deallocation are expensive
• the cache misses occurred while accessing newly allocated buffer
To exclude this slow-down effect some functions have a special parameter that allows the
functions to use an external buffer. If several functions using an external buffer are invoked
sequentially the probability that the processed data are in the cache increases, and the
performance of operations is higher. For example, if several FFT transform functions must be
invoked sequentially then it is faster to use an external memory buffer. The size of the buffer
is the max value for all values obtained with the special functions:
ippsFFTGetBufSize

Multi-thread processing
IPP is thread safe, which means that a user can organize processing in a way in which
processing is faster on multi-processor platforms. For example, FFT transform can be
computed in parallel. For that, the transform of length N should be performed as two
transforms of length N/2, and each one should be computed in a separate thread using IPP
FFT functions. One possible way to split the computation into two independent parts that can
be done in parallel is decimation in time. If we compute transform of length FFT(N):
Y(n) = FFT(N,x(n)), n=[0,N-1]
we can compute two transforms FFT(N/2) instead:
Y(n) = FFT(N/2,x(even)) + W(N,n)*FFT(N/2,x(odd)), n=[0,N/2-1]
Y(m) = FFT(N/2,x(even)) – W(N,n)*FFT(N/2,x(odd)), m=[N/2,N-1]

5
Intel Integrated Performance Primitives (IPP)
Performance Tips and Tricks

and make additional computations. In spite of the additional multiplications and additions
needed, the total execution time may be shorter on a two-way computer.
An implementation of the decimation in time algorithm in C is given below. For example, use
this instead of one FFT transform:

ippsFFTFwd_CToC_32fc( x, Xst, ctxN, buffer );

Two transforms of a smaller order and several addition operations are made:

ippsFFTFwd_CToC_32fc( xleft, xleft, ctxN2, buffer );


ippsFFTFwd_CToC_32fc( xrght, xrght, ctxN2, buffer );
ippsMul_32fc_I( xrght, tw, fftlen2 );
ippsAdd_32fc( xleft, tw, Xmt, fftlen2 );
ippsSub_32fc( tw, xleft, Xmt+fftlen2, fftlen2 );

Another possible approach is to develop the IPP functions in such a way that the code of the
functions is executed in parallel on a multi-processor platform. That is possible, for example,
using OpenMP* technology. Intel® compilers support OpenMP technology, and OpenMP
code has been developed for some ippIP functions. To illustrate the scalability thesis (the
more Intel processors, the better performance), several performance numbers are given in the
table and in the chart below for one of the ippIP color conversion functions tested with
different number of threads executed in parallel on a 6-way Intel Pentium III computer.

6
Intel Integrated Performance Primitives (IPP)
Performance Tips and Tricks

A:6 is Intel Pentium III optimized code; the “:6” means execution in 6 threads; and
performance is given in cpe – processor cycles per element.

Using integer fixed point arithmetic instead of FP computation: Q16


vs. 32f
Because Intel SA and Intel XScale processors have only integer instructions, a good way
to get high performance and extremely low power consumption is to use integer fixed point
arithmetic instead of FP computation. This allows us to exclude the slow FP emulation
library and avoid the use of high power consuming FP coprocessors. So, the Intel SA version
of the functions, unlike the Intel architecture (IA) version, applies special integer
parameters for which the FP data type should be used. For example, the shift parameters in
the rotate function are presented as the fixed point values in which the defined number of bits
presents the fractional part of the values. An accepted way to define number of the fractional
bits where it is necessary is to add a special modifier to the function name – Q16, as in the
example below:
IppStatus ippiRotateQ16_8u_C1R( const Ipp8u* pSrc, IppiSize
srcSize, int srcStep, IppiRect srcROI, Ipp8u* pDst, int dstStep,
IppiRect dstROI, int angle, Ipp32s xShiftQ16, Ipp32s yShiftQ16,
int interpolate);
To get the corresponding values of the shift parameters from FP values one has to multiply
the original values by 216.
There are special macros provided to convert values to and from FP numbers:
#define FRACTBITS 16
#define FRACT_05 (1 << (FRACTBITS - 1))
#define FRACT_ITOF(x) ((x) << FRACTBITS)
#define FRACT_FTOI(x) ((x) >> FRACTBITS)
#define FRACT_ROUND(x) (((x) + FRACT_05) >> FRACTBITS)

Vous aimerez peut-être aussi