Académique Documents
Professionnel Documents
Culture Documents
Faculty of Physics and Astronomy, University of Wroclaw, pl. M. Borna 9, Wroclaw, Poland
2
Institute of Computer Engineering, Control and Robotics, Wroclaw University of
Technology, Wybrzeze Wyspia
nskiego 27, 50-370 Wroclaw, Poland
Email: sebastian.szkoda@ift.uni.wroc.pl
1.
INTRODUCTION
or
or
MODEL
a)
b)
FIGURE 4.
Schematic of the periodic boundary
conditions. Each edge column is mirrored in an extra column
added on the other side of the grid.
IMPLEMENTATIONS
CPU implementation with AVX
width
height/n
core 0
height/n
core 1
...
core n-1
height/n
valp = (__m256i*)(lattice1+pos(e1,e2));
__m256i cr = _mm256_loadu_si256(valp);
ar = (__m256i)_mm256_or_pd((__m256d)cr,\
(__m256d)ar);
//store data in destination nodes
_mm256_storeu_si256(valp, ar);
}
//update rest particles
br = (__m256i)_mm256_and_pd((__m256d)br,\
(__m256d)maskv2);
_mm256_storeu_si256(valv, br);
}
}
(1,0)
(2,0)
(0,1)
(1,1)
(2,1)
(0,2)
(1,2)
(2,2)
(0,0)
(1,0)
(0,1)
(1,1)
FIGURE 7.
Partition of the FHP lattice into
computational blocks of threads for efficient move step. The
size of the outer (green) rectangle is equal to the size of a
block o threads. This rectangle represents the nodes read
by the corresponding block. The inner (brown) rectangle
marks the nodes whose state is updated ba a given block.
FIGURE 8.
Assignment of shared memory array
elements to a block of threads. Rectangle A marks the FHP
nodes loaded from the GPU main memory into the shared
memory. Rectangle C embraces A and C \ A is used as a
buffer to capture the particles moving out of A. Rectangle
B A denotes the nodes whose updated states will be saved
back to the main memory. Colors correspond to those in
Fig. 7.
CPU t
hreads
Mups
USD/M
ups
Watt/
Mups
CPU c
ores
RESULTS
CPU t
ype
4.
Code t
ype
seq
SSE
Pth
pSSE
seq
SSE
AVX
Pth
pSSE
pAVX
SSE
AVX
pSSE
pAVX
GPU
mGPU
GPU
GPU
Xeon i7
960
(3.2GHz)
1
1
4
4
1
1
1
4
4
4
1
1
32
16
1
2
1
1
1
1
4
4
1
1
1
8
8
8
1
1
32
16
1
2
1
1
28
80
138
340
36
96
108
190
596
654
72
80
1446
1313
1746
3493
1103
1198
14.29
3.33
2.17
0.88
9.17
3.44
3.06
1.74
0.55
0.50
50.00
45.00
9.96
5.48
0.46
0.37
2.72
2.50
6.19
1.44
0.94
0.38
2.14
0.80
0.77
0.41
0.13
0.12
1.60
1.44
0.31
0.17
0.30
0.27
0.32
0.30
Xeon E3
1270
(3.5 GHz)
Xeon E5
4650L
(2.6 GHz)
GTX 480
Tesla M2075
50
40
30
20
10
0
0.40
USD/Mups
Watt/Mups
0.35
0.30
0.25
0.20
6.
ACKNOWLEDGMENTS
0.15
REFERENCES
0.10
0.05
0.00
SSE AVX
FIGURE 10.
Comparison of the purchase (blue)
and running (green) costs relative to the sequential
implementation on the Xeon E3-1270 processor. All CPU
data are for the E3 processor.
5.