Académique Documents
Professionnel Documents
Culture Documents
Intel, Pentium, Itanium and XScale are trademarks or registered trademarks of Intel Corporation or its
subsidiaries in the United States and other countries.
*
Other names and brands may be claimed as the property of others.
2
Intel Integrated Performance Primitives (IPP)
Performance Tips and Tricks
3
Intel Integrated Performance Primitives (IPP)
Performance Tips and Tricks
way are point-wisely multiplied, and the result is transformed back to the spatial domain.
The performance improvement gained this way can be very significant.
If the applied filter is the same for several processing iterations then the FFT transform of the
filter coefficients need to be done only once. The twiddle tables and the bit reverse tables are
created in the initialization function for the forward and for inverse transforms at the same
time. The main operations in such kind of filtering are presented below:
ippsFFTInitAlloc_R_32f( &pFFTSpec, fftord, IPP_FFT_DIV_INV_BY_N,
ippAlgHintNone );
/// perform forward FFT to put source data x to freq-domain
ippsFFTFwd_RToPack_32f( xx, XX, pFFTSpec, 0 );
/// point-wise multiplication in freq-domain is convolution
ippsMulPack_32f_I( HH, XX, fftlen );
/// perform inverse FFT to get result in time-domain
ippsFFTInv_PackToR_32f( XX, yy, pFFTSpec, 0 );
/// free FFT tables
ippsFFTFree_R_32f( pFFTSpec );
The IPP FFT transform functions are used in the IPP convolution and cross-correlation
functions. Even a Discrete Fourier Transform (DFT) transform with arbitrary length is
computed using FFT. This is possible because DFT can be presented as a cyclic convolution:
The expression under Sum is a cyclic convolution and can be computed with FFT. The speed
of this is up to 100 times faster than the direct DFT computation.
4
Intel Integrated Performance Primitives (IPP)
Performance Tips and Tricks
Note that the Flush-to-Zero mode is applicable in computation using Streaming SIMD
Extensions (SSE) instructions or Streaming SIMD Extensions 2 (SSE2) only and does not
affect x87 instruction results.
In the table below are performance numbers obtained on Intel Pentium 4 processors for
normal and denormal source data using the different approaches mentioned above:
Multi-thread processing
IPP is thread safe, which means that a user can organize processing in a way in which
processing is faster on multi-processor platforms. For example, FFT transform can be
computed in parallel. For that, the transform of length N should be performed as two
transforms of length N/2, and each one should be computed in a separate thread using IPP
FFT functions. One possible way to split the computation into two independent parts that can
be done in parallel is decimation in time. If we compute transform of length FFT(N):
Y(n) = FFT(N,x(n)), n=[0,N-1]
we can compute two transforms FFT(N/2) instead:
Y(n) = FFT(N/2,x(even)) + W(N,n)*FFT(N/2,x(odd)), n=[0,N/2-1]
Y(m) = FFT(N/2,x(even)) – W(N,n)*FFT(N/2,x(odd)), m=[N/2,N-1]
5
Intel Integrated Performance Primitives (IPP)
Performance Tips and Tricks
and make additional computations. In spite of the additional multiplications and additions
needed, the total execution time may be shorter on a two-way computer.
An implementation of the decimation in time algorithm in C is given below. For example, use
this instead of one FFT transform:
Two transforms of a smaller order and several addition operations are made:
Another possible approach is to develop the IPP functions in such a way that the code of the
functions is executed in parallel on a multi-processor platform. That is possible, for example,
using OpenMP* technology. Intel® compilers support OpenMP technology, and OpenMP
code has been developed for some ippIP functions. To illustrate the scalability thesis (the
more Intel processors, the better performance), several performance numbers are given in the
table and in the chart below for one of the ippIP color conversion functions tested with
different number of threads executed in parallel on a 6-way Intel Pentium III computer.
6
Intel Integrated Performance Primitives (IPP)
Performance Tips and Tricks
A:6 is Intel Pentium III optimized code; the “:6” means execution in 6 threads; and
performance is given in cpe – processor cycles per element.