Vous êtes sur la page 1sur 127

1

System Co-Design and Data


Management for Flash Devices
VLDB2011
Philippe Bonnet,
ITU, Denmark

Luc Bouganim,
INRIA, France

Ioannis Koltsidas
IBM Research, Switzerland

Stratis D. Viglas
University of Edinburgh, United Kingdom

Bonnet, Bouganim, Koltsidas, Viglas, VLDB 2011

Flash Devices (SSD)


Just a SATA drive
IO don't matter

I can readily plug in flash


devices in my server.
What is the big deal?

CPU is the critical


resource

Why Bother?
Disk is disk
~650 mio units
shipped in 2010

PCM is coming
100x faster
10 mio write cycles
[Papandreou et al., IMW 2011]

Bonnet, Bouganim, Koltsidas, Viglas, VLDB 2011

Some Trends ...


2010

2000
HDD Capacity

200 GB

x10

2 TB

HDD GB/$

0,05

x600

30

HDD IOPS

200

x1

200

14 GB (2001)

x20

256 GB

SSD GB/$

3 x10E-4

0,5

SSD IOPS

10E3 (SCSI)

x1000
x1000

SSD Capacity

10E6+ (PCIe)
5x10E3+ (SATA)

PCM Capacity
PCM IOPS

2x10E5 cells, 4 bits/cell


10E6+ (1 chip)

Bonnet, Bouganim, Koltsidas, Viglas, VLDB 2011

and a Fact

[Tsorigiannis et al. 2010]

Flash-based SSDs do nothing well!


They offer high throughput at low energy consumption.

Bonnet, Bouganim, Koltsidas, Viglas, VLDB 2011

SSD-based Systems

With more than 1,000 stores, Danish Supermarket group


is one of Denmarks largest retailers.
To help keep up with customer needs, the company manages
more than 10 terabytes of business intelligence data.
Database Appliances

SSD-based blades
Scaled up

Super Micro 6026


Scaled down

Neteeza Twin-fin

Oracle Exadata

Amdahl blade [Szalay et al., 2009]

IOs matter. Systems are being designed and commercialized


for efficient data management for flash devices.
Bonnet, Bouganim, Koltsidas, Viglas, VLDB 2011

Block Device

SSDs and HDDs provide the same memory abstraction: a block device interface

ERASE (address)

Figure courtesy of Koschaak and Saltzer

Bonnet, Bouganim, Koltsidas, Viglas, VLDB 2011

Strong Modularity

SSDs and HDDs provide the same memory abstraction: a block device interface
application

=> There should be no impact on application


(e.g., DBMS) ?

Bonnet, Bouganim, Koltsidas, Viglas, VLDB 2011

Design Assumptions
=> Actually DBMS design very much based on disk characteristics:
(1) locality in the logical space preserved in the physical space,
(2) sequential access is faster than random access.

tracks

Random accesses
are avoided

Sequential accesses
are favored: Extent-based
allocation, clustering

platter

spindle

read/write
head

actuator

disk arm

Controller
Page-based
IO quantization;
Identical representation
In memory and on disk

Write-ahead logging;
Physiological logging

disk interface

Bonnet, Bouganim, Koltsidas, Viglas, VLDB 2011

How do flash devices impact


DBMS design?

(Bottom-up) We need to understand flash devices a bit better.


If they exhibit stable properties
=> Design principles for data management
If they do not exhibit stable properties
=> How to tackle the increased complexity?
(Top-down) We make assumptions about the behaviour of flash
devices, and we design adapted DBMS components. We
then need to make sure that (at least some) flash devices
actually fit our assumptions.
Bonnet, Bouganim, Koltsidas, Viglas, VLDB 2011

10

Tutorial Outline
1. Introduction (Philippe)
2. Flash devices characteristics (Luc)
3. Data management for flash devices (Stratis)
4. Two outlooks (Stratis & Philippe)

Bonnet, Bouganim, Koltsidas, Viglas, VLDB 2011

11

A short motivating story (1)


! Alice, Bob, Charlie and Dave want to measure the

performance of a given data intensive algorithm for flash


devices

! They use different strategies but start from the same IO traces
of that algorithm and own an MTRON and 2 identical INTEL
X25-M SSDs.

Same model
Same firmware
Algorithm X

Never used
Used

IO Traces

RW(2000, 2.0, 8000)


SR(2000, 16.0)
RW(500, 2.0, 8000)
RW(500, 2.0, 8000)
RR(100, 4.0, 8000)

Bonnet, Bouganim, Koltsidas, Viglas, VLDB 2011

A short motivating story (2): Alice & Bob

12

! Alice believes in datasheets. She builds a simple SSD

simulator configured with basic SSD performance numbers.

! She takes the SSD performance numbers from the datasheet


and runs the simulator using the traces.
Mtron
Datasheet

Configuration File
IOS
1
2
4
8

SR
70
81
104
150

RR
87
98
122
167

IO Traces

SW
51
64
85
129

RW
9023
8723
8686
8682

Simulator

Results

RW(2000, 2.0, 8000)


SR(2000, 16.0)
RW(500, 2.0, 8000)
RW(500, 2.0, 8000)
RR(100, 4.0, 8000)

! Bob, does not believe in datasheets. He runs simple tests on


both SSDs to obtain the basic performance numbersHe
then runs Alices simulator on the traces with his numbers

Bonnet, Bouganim, Koltsidas, Viglas, VLDB 2011

13

A short motivating story (3): Charlie & Dave


! Charlie, does not believe in Bob! He is more cautious and

runs long tests on the same SSDs and obtain his own basic
performance numbers. Then, he proceeds as Bob.

! Dave does not like simulation and runs the traces directly on
the SSDs.

IO Traces

RW(2000, 2.0, 8000)


SR(2000, 16.0)
RW(500, 2.0, 8000)
RW(500, 2.0, 8000)
RR(100, 4.0, 8000)

What is your take on the resulting measures?


Bonnet, Bouganim, Koltsidas, Viglas, VLDB 2011

14

A short motivating story (4): Results


&'(

&'(

MTRON

%"

INTEL X25

$E"

Used

%!
$

$"
$!

#E"

#"

Never
used

#!
!E"

"
!

!
)*+,&./0/'1--0(

!
!

2345&'+67*-55
,/*+48/93:(5

;1/8*+-5&*3:<
,/*+48/93:(

=/>-5&8?:53:
@ABCD(

2345&'+67*,/*+48/93:(

;1/8*+-5&*3:<
,/*+48/93:(

=/>-5&B?:53:
?'-.5F$"(

=/>-5&B?:53:
:-G5F$"(

Mtron and Intel devices behave differently


Identical Intel devices behave differently

! Confidence in performance measurements is very low!

!
!

Modeling flash devices seems difficult


What about designing algorithms for flash devices ?
"! e.g., database systems, operating systems, applications ?
Bonnet, Bouganim, Koltsidas, Viglas, VLDB 2011

Outline of the first part of this tutorial

15

Goal: understand the impact of flash memory on


software (DBMS) design and vice-versa

! We study flash chips, explaining their constraints and trends


! We then consider flash devices as black boxes and try to
understand their performance behavior (uFLIP).
Goal: Find a simple model, basis for a DBMS design

! We hit a wall with the black box approach # we open the box,
i.e., the FTL, and look at FTL techniques.

! Finally, we propose an alternative to complex FTLs, better


adapted for DBMS design.

Bonnet, Bouganim, Koltsidas, Viglas, VLDB 2011

16
The Good

NAND Flash chip performance!


! A single flash chip offers great performance
"! e.g., 40 MB/s Read, 10 MB/s Program
"! Random access is as fast as sequential access
"! Low energy consumption

Bonnet, Bouganim, Koltsidas, Viglas, VLDB 2011

17
The Bad

The severe constraints of NAND flash chips!


! C1: Program granularity:
"! Program must be performed at flash page granularity (2KB-16KB)

! C2: Must erase a block before updating a page (256 KB-1MB)


! C3: Pages must be programmed sequentially within a block
! C4: Limited lifetime (from 104 up to 105 erase operations)

Pagess must be
programmed
sequentially
within the block
(256 pages)

Program granularity: a page (32 KB)


Erase granularity: a block (1 MB)

Bonnet, Bouganim, Koltsidas, Viglas, VLDB 2011

18

s
p
i
h
c
Flash
BY

A bit of electronic to
understand flash chip
constraints and trends
Bonnet, Bouganim, Koltsidas, Viglas, VLDB 2011

Bonnet, Bouganim, Koltsidas, Viglas, VLDB 2011

19

Flash cells
! Flash cell: resembles a semiconductor transistor
"! 2 gates instead of 1
"! Floating gate insulated all around by an oxide layer

! Electrons placed on the floating gate are trapped


! The floating gate will not discharge for many years
Oxide
Layer

Control Gate
Floating Gate
N+

P substrate

N+

Flash cell: a floating


gate transistor
Bonnet, Bouganim, Koltsidas, Viglas, VLDB 2011

20

Flash cells: NOR vs NAND


NOR
"!
"!
"!
"!

Quick read (Byte)


Slow prog. (Byte)
Slow erase
XIP # Code

NAND
"!
"!
"!
"!

Slower read (Page)


Quicker prog. (Page)
Quicker erase (Block)
Files, data

Bonnet, Bouganim, Koltsidas, Viglas, VLDB 2011

NAND Flash cells mode of operation


!

Programming: Apply a high voltage to the control gate

Erasing: Apply a high voltage to the substrate

Reading: the charge changes the threshold voltage of the cell

After a number of program/erase cycle, electrons are getting trapped in


the oxyde layer # End of life of the cell

21

# electrons get trapped in the floating gate


# electrons are removed from the floating gate
"! Single level cell (SLC) store one bit per cell: charged = 0, not charged = 1
"! Multi level cell (MLC) store 2 bits per cell (4 levels)

20 V

0V

0V

0V

Programming

20 V

20 V

Erasing

Wear out cell


Bonnet, Bouganim, Koltsidas, Viglas, VLDB 2011

22

NAND Architecture & timings


! Based upon independent
blocks (4 Mio cells here)

! Block: smallest erasable unit


! Page: smallest
programmable unit

Geometry & Timings


Page Size
Block Size
Chip Size
Read Page (s)
Program Page (s)
Erase Block (s)
NAND flash MICRON MLC:
MT29F128G08CJABB

MLC
4 KB
1 MB
16 GB
150
1000
3000

1 page

256
pages/
block

Floating
gate
1 flash
cell
Control
gate

34560 bits/page (4 KB + 224 B)

Bonnet, Bouganim, Koltsidas, Viglas, VLDB 2011

23

Program Disturb
! Some cells not being

programmed receive
elevated voltage stress
(near the cells being
programmed)

! Stressed cells can

appear weakly programmed

Reducing program disturb:

! Use Error Correction Code to recover errors


! Program page sequentially within a block
Cooke (FMS 2007)

Bonnet, Bouganim, Koltsidas, Viglas, VLDB 2011

24

Impact on flash chip IOs

!Flash cell technology

! Limited lifetime for entire blocks (when a cell wear out,


the entire block is marked as failed).

!NAND Layout and structure

!Block is the smallest erase granularity

!Program Disturb

! Page is the smallest program granularity (! for SLC)


! Pages must me programmed sequentially within a block
! Use of ECC is mandatory # ECC unit is the smallest
read unit (generally 1 or ! page)
Bonnet, Bouganim, Koltsidas, Viglas, VLDB 2011

25

Flash chips: trends


! Density increases (price decreases)

"! NAND process migration: faster than Moores Law (today 20 nm)
"! More bits/cell:
! SLC (1), MLC (2), TLC (3)

! Flash chip layout and structure: larger, parallel


"! Larger blocks (32 # 256 Pages)
"! Larger pages: 512 B (old SLC) # 16KB (future TLC)
"! Dual plane Flash # parallelism within the flash chip

! Lifetime decreases

"! 100 000 (SLC), 10 000 (MLC), 5000 (TLC)

! ECC size increases


! Basic performance decreases
"! Compensated by parallelism

Abraham (FMS 2011), StorageSearch.com

Bonnet, Bouganim, Koltsidas, Viglas, VLDB 2011

Outline of the first part of this tutorial

26

Goal: understand the impact of flash memory on


software (DBMS) design and vice-versa

! We study flash chips, explaining their constraints and trends


! We then consider flash devices as black boxes and try to
understand their performance behavior (uFLIP)

! We hit a wall with the black box approach # we open the box,
i.e., the FTL, and look at FTL techniques

! Finally, we propose an alternative to complex FTLs, better


adapted for DBMS design

Bonnet, Bouganim, Koltsidas, Viglas, VLDB 2011

27
The Good

The hardware!

! A single flash chip offers great performance


"! e.g., 40 MB/s Read, 10 MB/s Program
"! Random access is as fast as sequential access
"! Low energy consumption

! A flash device contains many (e.g., 32, 64) flash chips and
provides inter-chips parallelism

! Flash devices may include some (power-failure resistant)


SRAM

Bonnet, Bouganim, Koltsidas, Viglas, VLDB 2011

28
The Bad

The severe constraints of flash chips!


! C1: Program granularity:

"! Program must be performed at flash page granularity

! C2: Must erase a block before updating a page


! C3: Pages must be programmed sequentially within a block
! C4: Limited lifetime (from 104 up to 106 erase operations)

Pagess must be
programmed
sequentially
within the block
(256 pages)

Program granularity: a page (32 KB)


Erase granularity: a block (1 MB)

Bonnet, Bouganim, Koltsidas, Viglas, VLDB 2011

29
And The FTL

The software!, the Flash Translation Layer


"! emulates a classical block device and
handle flash constraints

Read sector
Write sector

MAPPING

Read page
Program page

GARBAGE
COLLECTION

WEAR
LEVELING

(C1) Program granularity


(C2) Erase before prog.

(C3) Sequential program


within a block
Erase block
(C4) Limited lifetime

No constraint!
SSD

Constraints

FTL

Flash chips
Bonnet, Bouganim, Koltsidas, Viglas, VLDB 2011

30

Flash devices are black boxes!


! Flash devices are not flash chips

"! Do not behave as the flash chip they contain


"! No access to the flash chip API but only through the device API
"! Complex architecture and software, proprietary and not documented

#! Flash devices are black boxes !


#! DBMS design cannot be based on flash chip behavior!
We need to understand flash devices behavior!

DBMS

Read
sector
Write
sector

No
constraint!

MAPPING

GARBAGE
COLLECTION

?
WEAR
LEVELING

FT
L

Constraints

Read page

(C1) Program granularity

Program page

(C2) Erase before prog.

Erase block

(C3) Sequential program


within a block

SSD

(C4) Limited lifetime

Flash chips

Bonnet, Bouganim, Koltsidas, Viglas, VLDB 2011

Understanding flash devices behavior

31

! Define an experimental benchmark which can exhibit


the behavior of flash devices.

! Define a broad benchmark

"! No safe assumption can be made on the device behavior (black box)
! e.g., Random writes are expensive
"! No safe assumption on the benchmark usage!

! Design a sound benchmarking methodology

"! IO cost is highly variable and depends on the whole device history!
Bonnet, Bouganim, Koltsidas, Viglas, VLDB 2011

32

Methodology (1): Device state

Random Writes Samsung SSD


Out of the box

Random Writes Samsung SSD


After filling the device

! Enforce a well-defined device state


"! performing random write IOs of random size on the whole device
"! The alternative, sequential IOs, is less stable, thus more difficult to enforce

Bonnet, Bouganim, Koltsidas, Viglas, VLDB 2011

33

Methodology (2): Startup and running phases


! When do we reach a steady state? How long to run each test?

Startup and running phases for


the Mtron SSD (RW)

Running phase for the Kingston DTI


flash Drive (SW)

! Startup and running phase: Run experiments to define


"! IOIgnore: Number of IOs ignored when computing statistics
"! IOCount: Number of measures to allow for convergence of those statistics.
Bonnet, Bouganim, Koltsidas, Viglas, VLDB 2011

34

Methodology (3): Interferences


10

Sequential Reads

Random Writes

Sequential Reads

Pause
1

0.1
0

250

500

750

1000

1250

1500

! Interferences: Introduce a pause between experiments


Bonnet, Bouganim, Koltsidas, Viglas, VLDB 2011

Results (1): Samsung, memoright, Mtron

Locality for the Samsung,


Memoright and Mtron SSDs

Granularity for the


Memoright SSD

For SR, SW and RR,

For RW, !5ms for a 16KB-128KB IO

"! linear behavior, almost no latency


"! good throughputs with large IO Size

35

When limited to a focused area,


RW performs very well

Bonnet, Bouganim, Koltsidas, Viglas, VLDB 2011

Results (2): Intel X25-E

36

Response
time (s)

SR, SW and RW have


similar performance.
RR are more costly!

Response
time (s)

IO size (KB)

RW (16 KB) performance


varies from 100 s to
100 ms!! (x 1000)
Bonnet, Bouganim, Koltsidas, Viglas, VLDB 2011

37

Results (3): Fusion IO

!Capacity vs Performance tradeoff (80 GB # 22 GB!)


!Sensitivity to device state
Response %#!"
time (s)

IO Size = 4KB

%!!"
$#!"
$!!"

01"

01""

11"

11""

0+"

0+""

1+"

1+""

#!"

"

!"
"

&'()'*""

&'(+,-./""

Low level formatted

&'()'*""

&'(+,-./""

Fully written
Bonnet, Bouganim, Koltsidas, Viglas, VLDB 2011

38

Conclusion: Flash device behavior


Finally, what is the behavior of flash devices?
Common wisdom

$!Update in place are inefficient?


$!Random writes are slower than sequential ones?
$!Better not filling the whole device if we want good performance?

! Behavior varies across devices and firmware updates


! Behavior depends heavily on the device state!

Is it a problem ?

Bonnet, Bouganim, Koltsidas, Viglas, VLDB 2011

Conclusion: Flash device behavior (2)

39

! Flash devices are difficult (impossible?) to model!


! Hard to build DBMS design on such a moving ground!
Bill Nesheim: Mythbusting Flash Performance

! Substantial performance variability

"! Some cases can be even worse than disk

! Performance outliers can have significant adverse impact


! Whats Needed:
! Predictable scaling & performance over time
! Less asymmetry between reads/writes, random/sequential
! Predictable response time

(FMS 2011)
Bonnet, Bouganim, Koltsidas, Viglas, VLDB 2011

Outline of the first part of this tutorial

40

Goal: understand the impact of flash memory on


software (DBMS) design and vice-versa

! We study flash chips, explaining their constraints and trends


! We then consider flash devices as black boxes and try to
understand their performance behavior (uFLIP)

! We hit a wall with the black box approach # we open the box,
i.e., the FTL, and look at FTL techniques

! Finally, we propose an alternative to complex FTLs, better


adapted for DBMS design

Bonnet, Bouganim, Koltsidas, Viglas, VLDB 2011

41

Opening the
black box !

Bonnet, Bouganim, Koltsidas, Viglas, VLDB 2011

42

FTL Basic components

Read sector
Write sector

MAPPING

Constraints
Read page
Program page

GARBAGE
COLLECTION

WEAR
LEVELING

(C1) Program granularity


(C2) Erase before prog.

(C3) Sequential program


within a block
Erase block
(C4) Limited lifetime

No constraint!
FTL

Flash chips
SSD

Bonnet, Bouganim, Koltsidas, Viglas, VLDB 2011

43

FTL Page Level Mapping


! Basic page level mapping: translation table stored in SRAM
Logical
Physical
Block 0

Block 1

Block 2

Block 3

"! Problem: the table is too large ! (1 GB for 1 TB flash) (4KB pages)

! Demand-base FTL: DFTL (Gupta et al. 2009)

"! The translation table is stored in Flash and cached in SRAM

SRAM

Global Translation
Directory

Flash

Translation
blocks

Cached Mapping
Table

Data
blocks
Bonnet, Bouganim, Koltsidas, Viglas, VLDB 2011

44

FTL - Mapping: Block Level / Hybrid


! Pure Block Level Mapping

"! Translation table at block level


"! The page offset remains the same
"! Does not comply with C3!

Logical
Physical
Block 0

Block 1

Block 2

Block 3

! Hybrid Mapping

Updates done out-of-place in log blocks


Data blocks # block mapping
Log blocks # page mapping
Proposals differ in the way log blocks are managed
! 1 log block for 1 data block # BAST (Kim et al. 2002)
! n log blocks for all data blocks # FAST (Lee et al. 2007)
! Exploiting locality # LAST (Lee et al. 2008)
"! Cleaning when log blocks are exhausted # Major costs
"! Block mapping for data blocks does not either comply with C3!

"!
"!
"!
"!

Bonnet, Bouganim, Koltsidas, Viglas, VLDB 2011

45

FTL Garbage Collection


! With page mapping:
Block 1

!
Block 2

!!!!

Block 1

Block 3

!
Block 2

! With hybrid mapping: three cases with BAST

Erase

Erase

Block 3

Switch
Block 0

!
!
!

Log(Block0)

!
!

Block 0

!
!
!
!

Log(Block0)

Erase

Partial Merge

!
New Block0

Block 0

Full Merge

Erase

!
!
!
!

Log(Block0)

!!!!

Erased

More complex with FAST


"! pages of the same block can be on different log blocks

New block 0
Bonnet, Bouganim, Koltsidas, Viglas, VLDB 2011

46

FTL-Wear leveling

! Goal: ensure that all blocks of the flash have about the same
erase count (i.e., number of program/erase cycle).

! Basic algorithm: hot-cold swapping (Jung et al. 2007)


"! Swap the blocks with min and max erase count.

! Difficulties:

(1)! When to trigger the WL algorithm


(2)! How to manage erase count, how to select min or max erase count block wrt
the limited CPU and memory resources of the flash controler
(3)! What wear leveling strategy?
(4)! Interactions between Garbage Collection and Wear Leveling

The same difficulties arise with garbage collection!

Bonnet, Bouganim, Koltsidas, Viglas, VLDB 2011

47

FTL: Trends
Hybrid
mapping

Detect sequential
or semi-random
writes

Temporal/spatial
locality?

Caching
Compression /
deduplication

Adaptivity

Background/
on demand

MAPPING

TRIM
management
Security /
encryption

GARBAGE
COLLECTION

WEAR
LEVELING

Consider
hot/cold data

Dynamic /
static WL

Bonnet, Bouganim, Koltsidas, Viglas, VLDB 2011

48

FTL designers vs DBMS designers goals


! Flash device designers goals:
"!
"!
"!
"!

Hide the flash device constraints (usability)


Improve the performance for most common workloads
Make the device auto-adaptive
Mask design decision to protect their advantage (black box approach)

! DBMS designers goals:

"! Have a model for IO performance (and behavior)


! Predictable
! Clear distinction between efficient and inefficient IO patterns
! To design the storage model and query processing/optimization strategies
"! Reach best performance, even at the price of higher complexity (having
a full control on actual IOs)

These goals are conflicting!


Bonnet, Bouganim, Koltsidas, Viglas, VLDB 2011

Outline of the first part of this tutorial

49

Goal: understand the impact of flash memory on


software (DBMS) design and vice-versa

! We study flash chips, explaining their constraints and trends


! We then consider flash devices as black boxes and try to
understand their performance behavior (uFLIP)

! We hit a wall with the black box approach # we open the box,
i.e., the FTL, and look at FTL techniques

! Finally, we propose an alternative to complex FTLs, better


adapted for DBMS design

Bonnet, Bouganim, Koltsidas, Viglas, VLDB 2011

Minimal FTL: Take the FTL out of equation!

50

FTL provides only wear leveling, using block mapping to


address C4 (limited lifetime)

! Pros

"! Maximal performance for


! SR, RR, SW
! Semi-Random Writes
"! Maximal control for the DBMS

DBMS

Constrained Patterns only


(C1, C2, C3)

"! All complexity is handled


by the DBMS
"! All IOs must follow C1-C3
! The whole DBMS must
be rewritten
! The flash device is
dedicated

Minimal flash device

! Cons

Block mapping, Wear Leveling


(C4)

(C1) Write granularity


(C2) Erase before prog.
(C3) Sequential prog.
within a block

(C4) Limited lifetime

Flash chips
Bonnet, Bouganim, Koltsidas, Viglas, VLDB 2011

Semi-random writes (uFLIP [CIDR09])

51

! Inter-blocks : Random
! Intra-block : Sequential
! Example with 3 blocks of 10 pages:
IO address

&!"

%#"

%!"

$#"

$!"

#"

!"

0 10 11

time
1 20 21 22

2 23 24 12

3 13 14

4 25 26 15

5 16 27

7 17 18 19 28

8 29

Bonnet, Bouganim, Koltsidas, Viglas, VLDB 2011

Bimodal FTL: a simple idea

52

!Bimodal Flash Devices:

"! Provide a tunnel for those IOs that respect constraints C1-C3 ensuring maximal
performance
"! Manage other unconstrained IOs in best effort
"! Minimize interferences between these two modes of operation

! Pros

DBMS

"! Flexible
"! Maximal performance and
control for the DBMS for
constrained IOs
"! No behavior guarantees for
unconstrained IOs.

Bimodal flash device

! Cons

unconstrained
patterns

constr. patterns
(C1, C2, C3)

Page map., Garb. Coll.


(C1, C2, C3)
Block map., Wear Leveling
(C4)

(C1) Program granularity


(C2) Erase before prog.
(C3) Sequential prog.
within a block
(C4) Limited lifetime

Flash chips

Bonnet, Bouganim, Koltsidas, Viglas, VLDB 2011

Bimodal FTL: easy to implement

53

! Constrained IOs lead to optimal blocks


Flag = Optimal

Page 0
Page 1
Page 2
Page 3
Page 4
Page 5

Flag = Non-Optimal

CurPos=6

Page 0
Page 1
Page 1
Page 1
Page 0
Page 2

CurPos=6

! Optimal blocks can be trivially

"! mapped using a small map table in safe cache


"! detected using a flag and cursor in safe cache

16 MB for a 1TB device

! No interferences!
! No change to the block device interface:

"! Need to expose two constants: block size and page size
Bonnet, Bouganim, Koltsidas, Viglas, VLDB 2011

54

Bimodal FTL: better than Minimal + FTL


! Non-optimal block can become

Free

(CurPos = 0)

optimal (thanks to GC)

TRIM

TRIM

Write at @
CurPos++

Write at @ ! CurPos

Non
optimal

Optimal
Write at @ CurPos++

Flag = Non-Optimal

Page 0
Page 1
Page 1
Page 1
Page 0
Page 2

Garbage collector
actions

Flag = Optimal
CurPos=3

Page 0
Page 1
Page 2

CurPos=6

Bonnet, Bouganim, Koltsidas, Viglas, VLDB 2011

55

Impact on DBMS Design

Using bimodal flash devices, we have a solid basis


for designing efficient DBMS on flash:

! What IOs should be constrained?

"! i.e., what part of the DBMS should be redesigned?

! How to enforce these constraints? Revisit literature:

"! Solutions based on flash chip behavior enforce C1-C3 constraints


"! Solutions based on existing classes of devices might not.

Bonnet, Bouganim, Koltsidas, Viglas, VLDB 2011

Example: Hash Join on HDD

One pass partitioning

56

Multi-pass partitioning (2 passes)

Tradeoff: IOSize vs Memory consumption

! IOSize should be as large as possible, e.g., 256KB 1 MB


"! To minimize IO cost when writing or reading partitions

! IOSize should be as small as possible

"! To minimize memory consumption: One pass partitioning needs


2 x IOSize x NbPartitions in RAM
"! Insufficient memory # multi-pass # performance degrades!
Bonnet, Bouganim, Koltsidas, Viglas, VLDB 2011

57

Hash join on SSD and on bimodal SSD


! With non bimodal SSDs

"! No behavior guarantees but


"! Choosing IOSize = Block size (256 KB 1MB) should bring good performance

! With bimodal SSDs

"! Maximal performance are guaranteed (constrained patterns)


"! Use semi-random writes
"! IOSize can be reduced up to page size (4 16 KB) with no penalty
!!Memory savings
!!Performance improvement
!!Predictability!

Bonnet, Bouganim, Koltsidas, Viglas, VLDB 2011

58

Summary
! Flash chips

"! Performance & Energy consumption


"! Wired in parallel in flash devices

! Hardware constraints!

(C1) Program granularity, (C2) Erase before program,


(C3) Sequential program within a block,
(C4) Limited lifetime

! FTL: a complex piece of sofware

"! Constantly evolving, no common behavior


"! Hard to model
"! Hard to build a consistent DBMS design!

Bonnet, Bouganim, Koltsidas, Viglas, VLDB 2011

Conclusion: DBMS Design ?

Complex FTLs

Simple FTLs

HW Constraints

HW Constraints

Complex FTLs

Bimodal

Unpredictable performance

Predictable & Optimal

No stable design

Stable Design

59

Adding bimodality does not hinder competition between flash device


manufacturers, they can
"! bring down the cost of constrained IO patterns (e.g., using parallelism)
"! bring down the cost of unconstrained IO patterns without jeopardizing DBMS
design
Bonnet, Bouganim, Koltsidas, Viglas, VLDB 2011

60

Tutorial Outline
1. Introduction (Philippe)
2. Flash devices characteristics (Luc)
3. Data management for flash devices (Stratis)
4. Two outlooks (Stratis & Philippe)

Bonnet, Bouganim, Koltsidas, Viglas, VLDB 2011

!"#$%&'#'()$*+,-./'0/1%23"345'
%! 6*(/*$'37'8#42)0+(/'9/:/*',/*73*8#21/'0%#2';<<''
%! =3>',3>/*'132$+8,-32'
%! <*3,,)24',*)1/$'
!!

?(/#&'@%*3>'#>#5';<<$'#2('*/,"#1/'/./*50%)24'>)0%'!"#$%'AA<$'
!!
!!

!!

;3>/./*D'!"#$%'E0$'./*5'>/""'9/0>//2'<FGH'#2(';<<'
!!
!!

!!

B30'/23+4%'1#,#1)05'
B30'/23+4%'832/5'03'9+5'0%/'230C/23+4%C1#,#1)05'
<FGHI!"#$%I;<<',*)1/'*#-3&'JKLL&KL&K',/*'MN'
<FGHI!"#$%I;<<'"#0/215'*#-3&'JK&KL&KLL'

?20/4*#0/'!"#$%')203'0%/'$03*#4/'%)/*#*1%5'
!!
61

138,"/8/20'/O)$-24'<FGH'8/83*5'#2(';<<$'
Bonnet, Bouganim, Koltsidas, Viglas, VLDB 2011

6+0")2/'
!!

!"#$%C9#$/('(/.)1/'(/$)42'
!!
!!

!!

A3")('$0#0/'(*)./$'
H#P)24'AA<$'(#0#9#$/C7*)/2("5'

A5$0/8C"/./"'1%#""/24/$'
!!
!!
!!
!!

62

;59*)('$5$0/8$'
A03*#4/D'9+Q/*)24'#2('1#1%)24'
?2(/O)24'32'R#$%'
S+/*5'#2('0*#2$#1-32',*31/$$)24'

Bonnet, Bouganim, Koltsidas, Viglas, VLDB 2011

6+0")2/'
!!

!"#$%C9#$/('(/.)1/'(/$)42'
!!
!!

!!

A3")('$0#0/'(*)./$'
H#P)24'AA<$'(#0#9#$/C7*)/2("5'

A5$0/8C"/./"'1%#""/24/$'
!!
!!
!!
!!

63

;59*)('$5$0/8$'
A03*#4/D'9+Q/*)24'#2('1#1%)24'
?2(/O)24'32'R#$%'
S+/*5'#2('0*#2$#1-32',*31/$$)24'

Bonnet, Bouganim, Koltsidas, Viglas, VLDB 2011

!"#$%C9#$/('A3")('A0#0/'<*)./$'
!!

T38832'?I6')20/*7#1/'
!!

!!

B3'8/1%#2)1#"'"#0/215'
!!
!!

!!

!!

V'5/#*'>#**#205'73*'/20/*,*)$/'AA<$'\#$$+8)24'KL'138,"/0/'*/C>*)0/$',/*'(#5]'

[2/*45'/W1)/215'
!!

!!

F/#($'#*/'7#$0/*'0%#2'>*)0/$'
[*#$/C9/73*/C>*)0/'")8)0#-32'

=)8)0/('/2(+*#21/'I'>/#*'"/./")24''
!!

!!

G11/$$'"#0/215')2(/,/2(/20'37'0%/'#11/$$',#:/*2'
UL'03'VL'-8/$'83*/'/W1)/20')2'?6XAIY',/*'MN'0%#2';<<$'

F/#('I'Z*)0/'#$588/0*5'
!!

!!

N"31PC#((*/$$#9"/')20/*7#1/'

KLL'^'_LL'-8/$'83*/'/W1)/20'0%#2';<<$')2'?6XA'I'Z#:'

X%5$)1#"',*3,/*-/$'
!!
!!

64

F/$)$0#21/'03'/O0*/8/'$%31PD'.)9*#-32D'0/8,/*#0+*/D'#"-0+(/'
B/#*C)2$0#20'$0#*0C+,'-8/'
Bonnet, Bouganim, Koltsidas, Viglas, VLDB 2011

AA<'#*1%)0/10+*/'
!!

`#*)3+$'73*8'7#103*$'

!!

AG@G'\Kabc'^'UaVc]'

!!
!!

B+89/*'37'1%#22/"$'
!! d'03'Ke'3*'83*/'
FGH'9+Q/*$'
!!

!!

<#0#'
N+Q/*'

!"#$%'
T%),'

H)1*3C'
,*31/$$3*'

FGH'

T%#22/"'
T320*3""/*'
[TT'

KHN'+,'03'83*/'0%#2'_VeHN'

h'

!"#$%'
T%),'

h'
!"#$%'
T%),'

!"#$%'
T%),'

6./*C,*3.)$)32)24'
!!

!!

;3$0'
?20/*7#1/'

T%#22/"'
T320*3""/*'
[TT'

h'

!!

XT?C/'
AGA'\Kabc'^'UaVc]'

!!

556(7#8,"9'89:#'(

KLf'+,'03'dLf'
F/g+/$0';#2("/*'

T388#2('X#*#""/")$8'
!!

?20*#C1388#2('

!!

?20/*C1388#2('
F/3*(/*)24'I'8/*4)24'\BTS]'

!!

65

=NGC03CXNG'
8#,,/*'

Z*)0/'X#4/'
G""31#03*'

M#*9#4/'
T3""/103*'

Z/#*'
=/./")24'

=NGC03CXNG'
H#,'

N#('N"31P'
=)$0'

!*//'N"31P'
S+/+/'

H/0#'<#0#'
T#1%/'

!"#$%&#'()!*&+,(-#&.+*&/0.(1&2'#(3(!-14(

Bonnet, Bouganim, Koltsidas, Viglas, VLDB 2011

KVP'FXH'AGA';<<&'J$"!R%!!'?6XA'
ia_P'FXH'AG@G';<<&'JS!'?6XA'

6QC0%/C$%/"7'AA<$'
I3865I/,038'

)5

25

;5

=5

H5

XG@G'<*)./'

AG@G'<*)./'

AG@G'<*)./'

AGA'<*)./'

XT?C/'1#*('

;3:'?6-85

;3:'?6-85

;3:'?6-85

H:0-878+'-5

H:0-878+'-5

I*/'15;1+7''

H=T'

H=T'

H=T'

A=T'

A=T'

;/7/,+0J'

U_'MN'

KLL'MN'

KeLMN'

KdLMN'

dVL'MN'

B-/.5
2/:.G+.01'

VU'HNI$'

_bV'HNI$'

_VL'HNI$'

__L'HNI$'

iLL'HNI$'

K8+0-5
2/:.G+.01'

_b'HNI$'

_VL'HNI$'

KLL'HNI$'

KKV'HNI$'

VLL'HNI$'

B/:.365LM25
B-/.5NCOP'

J'K'3*(/*'37'8#42)0+(/'

UaVP'

ULP'

UVP'

dVP'

KdLP'

k'_'3*(/*$'37'8#42)0+(/'

B/:.365LM25
K8+0-5NCOP'

LaLKP'

KLP'

LaeP'

KeP'

iLP'

P08--05O8+,-Q'

J'KV'YIMN'
\_LLi]'

J'd'YIMN'
\_LKL]'

J'_aV'YIMN'
\_LKL]'

JKb'YIMN'
\_LKK]'

J'Ub'YIMN'
\_LLj]'

66

Bonnet, Bouganim, Koltsidas, Viglas, VLDB 2011

F/#('"#0/215'
d'lN'F#2(38'F/#($'+2)73*8"5'()$0*)9+0/('3./*'0%/'>%3"/'8/()+8'
VLf'*#2(38'(#0#'
1
0,9
0,8

Latency (ms)

0,7
0,6
0,5
0,4
0,3

0,2

0,1

0
0

20000

40000

60000

80000

100000

120000

140000

160000

IOPS

67

Bonnet, Bouganim, Koltsidas, Viglas, VLDB 2011

Z*)0/'"#0/215'
d'lN'F#2(38'Z*)0/$'+2)73*8"5'()$0*)9+0/('3./*'0%/'>%3"/'8/()+8'
VLf'*#2(38'(#0#'
3
B

2,5

Latency (ms)

1,5

0,5

0
0

68

5000

10000

15000

20000

25000

IOPS

30000

35000

40000

45000

Bonnet, Bouganim, Koltsidas, Viglas, VLDB 2011

H)O/('>3*P"3#('^'F/#('"#0/215'
d'lN'?I6'3,/*#-32$'+2)73*8"5'()$0*)9+0/('3./*'0%/'>%3"/'8/()+8'
VLf'*#2(38'(#0#D'S+/+/'(/,0%'m'U_'
d'

)>-8/<-5B-/.5V/0-:,J5

UDV'
N'

T'

iLf'

bLf'

<'

['

B-'73:'-5A+6-5&6'(5

U'

_DV'

_'

KDV'

K'

LDV'

L'
Lf'

KLf'

_Lf'

ULf'

dLf'

VLf'

eLf'

jLf'

KLLf'

T53U5K8+0-'5

69

Bonnet, Bouganim, Koltsidas, Viglas, VLDB 2011

H)O/('>3*P"3#('^'Z*)0/'"#0/215'
d'lN'?I6'3,/*#-32$'+2)73*8"5'()$0*)9+0/('3./*'0%/'>%3"/'8/()+8'
VLf'*#2(38'(#0#D'S+/+/'(/,0%'m'U_'
K_'

LDV'

)>-8/<-5K8+0-5V/0-:,J5)>-8/<-5K8+0-5V/0-:,J5

LDd'

KL'

['
N'

LDU'
LD_'
LDK'
L'

B-'73:'-5A+6-5&6'(5

B-'73:'-5A+6-5&6'(5

LDe'

T'

<'

bLf'

jLf'

['

b'

e'

Lf' KLf' _Lf' ULf' dLf' VLf' eLf' iLf' bLf' jLf' KLLf'
d'

T53U5K8+0-'5

_'

L'
Lf'

KLf'

_Lf'

ULf'

dLf'

VLf'

eLf'

iLf'

KLLf'

T53U5K8+0-'5

70

Bonnet, Bouganim, Koltsidas, Viglas, VLDB 2011

6+0")2/'
!!

!"#$%C9#$/('(/.)1/'(/$)42'
!!
!!

!!

A3")('$0#0/'(*)./$'
H#P)24'AA<$'(#0#9#$/C7*)/2("5'

A5$0/8C"/./"'1%#""/24/$'
!!
!!
!!
!!

71

;59*)('$5$0/8$'
A03*#4/D'9+Q/*)24'#2('1#1%)24'
?2(/O)24'32'R#$%'
S+/*5'#2('0*#2$#1-32',*31/$$)24'

Bonnet, Bouganim, Koltsidas, Viglas, VLDB 2011

o=//'p'H332D'A?MH6<'_LLiq'

?2C,#4/'"344)24'
!!
!!

X#4/'+,(#0/$'#*/'"344/('
=34'$/103*'73*'/#1%'<N',#4/'
!!

!!
!!

=34'*/4)32')2'/#1%'R#$%'9"31P'
X#4/'>*)0/C9#1P$'32"5')2.3"./'"34C
$/103*'>*)0/$'
!!

!!

n2-"'#'8/*4/')$'*/g+)*/('

n,32'*/#(&'
!!
!!

!!

G""31#0/('>%/2',#4/'9/138/'()*05'

!/01%'"34'*/13*($'7*38'R#$%''
G,,"5'0%/8'03'0%/')2C8/83*5',#4/'

A#8/'3*'83*/'2+89/*'37'>*)0/$'
!!

2?0D'$)42)E1#20'*/(+1-32'37'/*#$+*/$'

;3>/./*&'
^!
^!
72

@%/'<NHA'2//($'03'1320*3"',%5$)1#"',"#1/8/20''''
X#*-#"'R#$%',#4/'>*)0/$'#*/')2.3"./('
Bonnet, Bouganim, Koltsidas, Viglas, VLDB 2011

X#*#""/")$8r'
=34)1#"'9"31P'
!!

A-""D'$38/'3,/*#-32$'#*/'83*/'/W1)/20'
32'%#*(>#*/'
!!

!!

H#,,)24'37'0%/'#((*/$$'$,#1/'03'R#$%',"#2/$D'
()/$'#2('1%#22/"$'

!!

[TTD'/21*5,-32'/01a'

!!

Z/#*C"/./")24'$-""'2//($'03'9/'(32/'95'0%/'
(/.)1/'E*8>#*/'

@%/')20/*2#"'(/.)1/'4/38/0*5')$'1*)-1#"'
03'#1%)/./'8#O)8+8',#*#""/")$8'
@%/'<NHA'2//($'03'9/'#>#*/'37'0%/'
4/38/0*5'03'$38/'(/4*//''

T%#22/"'
T320*3""/*'
[TT'

!"#$%'
T%),'

T%#22/"'
T320*3""/*'
[TT'

h'

!"#$%'
T%),'

_
h'

h'

!!

L K _ U

!"#$%'
T%),'

!"#$%'
T%),'

V3<+,/*54*3,M5'+W-5X5I*/'154*3,M5'+W-5
=34)1#"'9"31P'$)s/'*/"/.#20'03'2+89/*'37'1%#22/"$D',),/")2)24'1#,#9)")-/$'32'/#1%'
1%#22/"D'/01a'
73

Bonnet, Bouganim, Koltsidas, Viglas, VLDB 2011

AA<$'^'$+88#*5'
!!

!"#$%'8/83*5'%#$'0%/',30/2-#"'03'*/83./'0%/'?I6'93:"/2/1P'
!!

[$,/1)#""5'73*'*/#(C(38)2#0/('>3*P"3#($'

!!

tAA<c&'8+"-,"/'1"#$$/$'37'(/.)1/$'

!!

[O1/""/20'*#2(38'*/#('"#0/215')$'+2)./*$#"'

!!

F/#('#2('>*)0/'9#2(>)(0%'.#*)/$'>)(/"5'

!!

<*#8#-1'()Q/*/21/'#1*3$$'*#2(38'>*)0/'"#0/21)/$'

!!

<*#8#-1'()Q/*/21/$')2'0/*8$'37'/13238)1$&'YIMN'13$0D',3>/*'
132$+8,-32D'/O,/10/('")7/-8/D'*/")#9)")05'

!!

G'"30'37'*/$/#*1%'03'9/'(32/'03>#*($'(/E2)24'<NHAC$,/1)E1'
)20/*7#1/$'
74

Bonnet, Bouganim, Koltsidas, Viglas, VLDB 2011

6+0")2/'
!!

!"#$%C9#$/('(/.)1/'(/$)42'
!!
!!

!!

A3")('$0#0/'(*)./$'
H#P)24'AA<$'(#0#9#$/C7*)/2("5'

A5$0/8C"/./"'1%#""/24/$'
!!
!!
!!
!!

75

;59*)('$5$0/8$'
A03*#4/D'9+Q/*)24'#2('1#1%)24'
?2(/O)24'32'R#$%'
S+/*5'#2('0*#2$#1-32',*31/$$)24'

Bonnet, Bouganim, Koltsidas, Viglas, VLDB 2011

A03*#4/D'9+Q/*)24'#2('1#1%)24'

AA<'1#1%/'

9+Q/*'
,33"'

(/8#2('
,#4)24'

/.)1-32'

AA<',/*$)$0/20'
$03*#4/'

76

;<<',/*$)$0/20'
$03*#4/'

Bonnet, Bouganim, Koltsidas, Viglas, VLDB 2011

!"#$%'8/83*5'73*',/*$)$0/20'$03*#4/'

AA<'1#1%/'

9+Q/*'
,33"'

AA<',/*$)$0/20'
$03*#4/'

77

;<<',/*$)$0/20'
$03*#4/'

Bonnet, Bouganim, Koltsidas, Viglas, VLDB 2011

;59*)('$03*#4/'"#5/*'

AA<'1#1%/'

9+Q/*'
,33"'

AA<',/*$)$0/20'
$03*#4/'

78

;<<',/*$)$0/20'
$03*#4/'

Bonnet, Bouganim, Koltsidas, Viglas, VLDB 2011

!"#$%'8/83*5'#$'1#1%/'

AA<'1#1%/'

9+Q/*'
,33"'

AA<',/*$)$0/20'
$03*#4/'

79

;<<',/*$)$0/20'
$03*#4/'

Bonnet, Bouganim, Koltsidas, Viglas, VLDB 2011

6+0")2/'
!!

!"#$%C9#$/('(/.)1/'(/$)42'
!!
!!

!!

A3")('$0#0/'(*)./$'
H#P)24'AA<$'(#0#9#$/C7*)/2("5'

A5$0/8C"/./"'1%#""/24/$'
!!
!!
!!
!!

80

;59*)('$5$0/8$'
A03*#4/D'9+Q/*)24'#2('1#1%)24'
?2(/O)24'32'R#$%'
S+/*5'#2('0*#2$#1-32',*31/$$)24'

Bonnet, Bouganim, Koltsidas, Viglas, VLDB 2011

;59*)('$5$0/8$'
!!

X*39"/8'$/0+,'
!!
!!

!!

F/$/#*1%'g+/$-32$'
!!
!!

!!

AA<$'#*/'9/138)24'13$0C/Q/1-./D'9+0'$-""'230'*/#(5'03'*/,"#1/';<<$'
)2'0%/'/20/*,*)$/'
T/*0#)2"5'1321/).#9"/'03'%#./'930%'AA<$'#2(';<<$'#0'0%/'$#8/'"/./"'
37'0%/'$03*#4/'%)/*#*1%5'
;3>'1#2'>/'0#P/'#(.#20#4/'37'0%/'AA<'1%#*#10/*)$-1$'>%/2'
(/$)42)24'#'(#0#9#$/r'
;3>'1#2'>/'3,-8#""5',"#1/'(#0#'#1*3$$'930%'05,/$'37'8/()+8r'

H/0%3(3"34)/$'
!!
!!
!!

81

Z3*P"3#('(/0/1-32'73*'(#0#',"#1/8/20'
=3#('9#"#21)24'03'8)2)8)s/'*/$,32$/'-8/'
T#1%)24'(#0#'9/0>//2'()$P$'
Bonnet, Bouganim, Koltsidas, Viglas, VLDB 2011

ol'p'`D''`=<N'_LLbq'

Z3*P"3#(C(*)./2',#4/',"#1/8/20'
n$/*C"/./"'
,#4/'?I6'

!!

0#&12%)(345(#6'$%7#8,(
N+Q/*'
8#2#4/*'

F/,"#1/8/20'
,3")15'

!!

9':;',",(<#$(6-*,12%)(345(
A03*#4/'
8#2#4/*'

AA<'

!"#$%'8/83*5'#2(';<<'#0'0%/'
$#8/'"/./"'37'0%/'$03*#4/'
%)/*#*1%5'
H32)03*',#4/'+$/'95'P//,)24'0*#1P'
37'*/#($'#2('>*)0/$'
!!

=-*,12%)(345(#6'$%7#8,(
!!

X#4/'8)4*#-32$'

!!

;<<'

!"#$%&'()%*'$(

!!

8Y/'15&GY/'1(5
+%,-(

F/#(I>*)0/'
3,/*#-32'

8PP=5&GPP=(5

GPP=5
GY/'15

?(/2-75'0%/'>3*P"3#('37'#',#4/'
#2('#,,*3,*)#0/"5',"#1/')0'
!!
!!
!!

.//(

=34)1#"'3,/*#-32$'\)a/aD'*/7/*/21/$'
32"5]'
X%5$)1#"'3,/*#-32$'\#10+#""5'
03+1%)24'0%/'()$P]'
;59*)('83(/"'\"34)1#"'3,/*#-32$'
8#2)7/$0/('#$',%5$)1#"'32/$]'

F/#(C)20/2$)./',#4/$'32'R#$%'
Z*)0/C)20/2$)./',#4/$'32';<<'
H)4*#0/',#4/$'>%/2'0%/5'%#./'
/O,/2$/('0%/)*'13$0')7'/**32/3+$"5'
,"#1/('

_C$0#0/'0#$P'$5$0/8'
82

Bonnet, Bouganim, Koltsidas, Viglas, VLDB 2011

oT#2)8D'H)%#)"#D'N%#:#1%#*u//D'F3$$'p'=#24D'`=<N'_LLjq'

69u/10',"#1/8/20'
!!

;59*)('()$P'$/0+,'
6v)2/'033"'
!!

!!

@>3',%#$/$'
!!

!!

!!

!!

6,-8#"'39u/10'#""31#-32'
#1*3$$'0%/'0>3'05,/$'37'()$P'
X*3E")24&'$0#*0'>)0%'#""'39u/10$'
32'0%/';<<'#2('832)03*'
$5$0/8'+$/'
</1)$)32&'9#$/('32',*3E")24'
$0#-$-1$'/$-8#0/'
,/*73*8#21/'4#)2/('7*38'
83.)24'/#1%'39u/10'7*38'0%/'
;<<'03'0%/'AA<'

F/(+1/'0%/'(/1)$)32'03'#'
P2#,$#1P',*39"/8'#2('#,,"5'
4*//(5'%/+*)$-1$'
?8,"/8/20/(')2'<N_'
83

(/.)1/'
,#*#8/0/*$'

>3*P"3#('

<#0#9#$/'
/24)2/'

N+Q/*',33"'
832)03*'

!!

F/#(I>*)0/$'

69u/10'
,"#1/8/20'
#(.)$3*'

='$<#$>%82'(
&%18(?,@(
!!/(

.//(

D;"E#F(6#18"(

A03*#4/'$5$0/8'
!!/(A;B&'"(?C@(

Bonnet, Bouganim, Koltsidas, Viglas, VLDB 2011

oA3+2(#*#*#u#2D'X*#9%#P#*#2D'N#"#P*)$%2#2D'p'Z399/*D'!GA@'_LKLq'
o;3""3>#5'X%<'@%/$)$D'nZCH#()$32D'_LLjq'

Z*)0/'1#1%)24'
!!
!!
!!

AA<'73*',*)8#*5'$03*#4/D'#+O)")#*5';<<''
@#P/'#(.#20#4/'37'9/:/*';<<'>*)0/',/*73*8#21/'03'/O0/2('
AA<'")7/-8/'#2(')8,*3./'>*)0/'0%*3+4%,+0'
Z*)0/$'#*/',+$%/('03'0%/';<<''
!!
!!
!!

=34'$0*+10+*/'/2$+*/$'$/g+/2-#"'>*)0/$'
!)O/('"34'$)s/'
621/'"34')$'7+""'8/*4/'>*)0/$'9#1P'03'0%/'AA<'
;<<C*/$)(/20'
"34'

>*)0/'

*/#('

84

AA<'

8/*4/'

Bonnet, Bouganim, Koltsidas, Viglas, VLDB 2011

oZ+'p'F/((5D''HGAT6@A'_LKLq'

=3#('9#"#21)24'03'8#O)8)s/'0%*3+4%,+0'
!!
!!

A/w24'132$)$0$'37'#'0*#2$#1-32',*31/$$)24'$5$0/8'>)0%'930%'05,/$'
37'()$P'
69u/1-./')$'03'9#"#21/'0%/'"3#('#1*3$$'8/()#'
!!
!!

G1%)/./('>%/2'0%/'*/$,32$/'-8/$'#1*3$$'8/()#'#*/'/g+#"D')a/aD'#'
Z#*(*3,'/g+)")9*)+8'
G"43*)0%8$'03'#1%)/./'0%)$'/g+)")9*)+8''
!!
!!

X#4/'1"#$$)E1#-32'\%30'3*'13"(]'
X#4/'#""31#-32'#2('8)4*#-32'
!"#$%&'(>%8%&'>'8"()%*'$(

;30I13"('(#0#'
1"#$$)E/*'

1#1%/'

N:.+8-,056/77+:<50/4*-5

6,/*#-32'
*/()*/103*'

0#&12%)(
%BB$G(

/'H12'(

=-*,12%)(%BB$G(

A03*#4/'8#'

<#0#'\*/]"31#03*'

;<<_'
AA<_'

</.)1/',/*73*8#21/'
832)03*'

85

X3")15'132E4+*#-32'

Bonnet, Bouganim, Koltsidas, Viglas, VLDB 2011

6+0")2/'
!!

!"#$%C9#$/('(/.)1/'(/$)42'
!!
!!

!!

A3")('$0#0/'(*)./$'
H#P)24'AA<$'(#0#9#$/C7*)/2("5'

A5$0/8C"/./"'1%#""/24/$'
!!
!!
!!
!!

86

;59*)('$5$0/8$'
A03*#4/D'9+Q/*)24'#2('1#1%)24'
?2(/O)24'32'R#$%'
S+/*5'#2('0*#2$#1-32',*31/$$)24'

Bonnet, Bouganim, Koltsidas, Viglas, VLDB 2011

N+Q/*)24')2'8#)2'8/83*5'
!!

X*39"/8'$/0+,'
!!
!!

!!

F/$/#*1%'g+/$-32$'
!!
!!

!!

!"#$%'8/83*5')$'+$/('73*',/*$)$0/20'$03*#4/'
@5,)1#"'32C(/8#2(',#4)24'
Z%)1%',#4/$'(3'>/'9+Q/*r'
Z%)1%',#4/$'(3'>/'/.)10'#2('>%/2r'

H/0%3(3"34)/$'
!!
!!
!!

87

!"#$%'8/83*5'$)s/'#")428/20'
T3$0C9#$/('*/,"#1/8/20'
Z*)0/'$1%/(+")24'

Bonnet, Bouganim, Koltsidas, Viglas, VLDB 2011

ol)8'p'G%2D'!GA@'_LLbq'

N"31P',#(()24'=Fn'\NX=Fn]'
!!
!!

H#2#4/$'0%/'32'()$P'
FGH'9+Q/*'
<#0#'9"31P$'#*/'
3*4#2)s/('#0'/*#$/C+2)0'
4*#2+"#*)05'
!!

!!

!!

I9J(A)#2K(
N"P'_'
e'

88

N"P'U'

L'

j'

K'

d'
KK'

=Fn'g+/+/')$'32'(#0#'
9"31P$'

62'*/7/*/21/D'83./'0%/'
/2-*/'9"31P'03'0%/'%/#('
37'0%/'g+/+/'
62'/.)1-32D'
$/g+/2-#""5'>*)0/'0%/'
/2-*/'9"31P'

N"P'L'

09J(A)#2K(
N"P'K'

V'

0#&12%)(,'2"#$(LL($'<'$'82'B(
N"P'U'

N"P'_'

j'

N"P'K'

L'
e'

KK'

N"P'L'
K'

d'
V'

M127>(A)#2KN((
)#&12%)(,'2"#$,(OP(Q(R$1S'8(
Bonnet, Bouganim, Koltsidas, Viglas, VLDB 2011

ol)8'p'G%2D'!GA@'_LLbq'

NX=Fn&'!+*0%/*'3,-8)s#-32$'
n$/'37',#(()24'
!!

!!

?7'#'(#0#'9"31P'03'9/'
>*):/2'%#$'230'9//2'
7+""5'*/#(D'*/#('>%#0x$'
8)$$)24'#2('>*)0/'
$/g+/2-#""5'

*/#('

j'

b'

KK'

>*)0/'

=Fn'138,/2$#-32'
!!

N"P'_'

N"P'L'

N"P'U'

e'

L'

j'

i'

K'

b'

L'

89 Bonnet, Bouganim, Koltsidas, Viglas, VLDB 2011

N"P'L'

d'
KK'

j'

K'

N"P'K'

V'

e'
d'

i'

KK'

V'

e'

N"P'U'

N"P'K'

N"P'_'

09J(A)#2K(

!!

A/g+/2-#""5'>*):/2'
9"31P$'#*/'83./('03'
0%/'/2('37'0%/'=Fn'
g+/+/'
=/#$0'")P/"5'03'9/'
>*):/2')2'0%/'7+0+*/'

!@='*/#($'
8)$$)24'
$/103*$'#2('
*/,"#1/$'(#0#'
9"31P')2'32/'
$/g+/2-#"'>*)0/'

i'

e'

I9J(A)#2K(

!!

T3$0C9#$/('*/,"#1/8/20'
!!
!!

T%3)1/'37'.)1-8'(/,/2($'32',*39#9)")05'37'*/7/*/21/'\#$'
+$+#"]'
N+0'0%/'/.)1-32'13$0')$'230'+2)73*8'
!!
!!

!!

?0'(3/$2x0'%+*0')7'>/'8)$/$-8#0/'0%/'%/#0'37'#',#4/'
!!

!!

T"/#2',#4/$'9/#*'23'>*)0/'13$0D'()*05',#4/$'*/$+"0')2'#'>*)0/'
?I6'#$588/0*5&'>*)0/$'83*/'/O,/2$)./'0%#2'*/#($'
A3'"324'#$'>/'$#./'\/O,/2$)./]'>*)0/$'

l/5')(/#&'1389)2/'=FnC9#$/('*/,"#1/8/20'>)0%'13$0C
9#$/('#"43*)0%8$'
!!

90

G,,")1#9"/'930%')2'AA<C32"5'#$'>/""'#$'%59*)('$5$0/8$'

Bonnet, Bouganim, Koltsidas, Viglas, VLDB 2011

oX#*PD'z+24D'l#24D'l)8'p'=//D'TGA[A'_LLeq'

T"/#2'E*$0'=Fn'\T!=Fn]'
!!

N+Q/*',33"'().)(/(')203'0>3'*/4)32$'
!!
!!
!!

Z3*P)24'*/4)32&'9+$)2/$$'#$'+$+#"'
T"/#2CE*$0'*/4)32&'1#2()(#0/$'73*'/.)1-32'
B+89/*'37'1#2()(#0/$')$'1#""/('0%/'>)2(3>'$)s/'Z'

!!

G">#5$'/.)10'7*38'1"/#2CE*$0'*/4)32'

!!

[.)10'1"/#2',#4/$'9/73*/'()*05'32/$'03'$#./'>*)0/'13$0'
?8,*3./8/20&'T"/#2C!)*$0'<)*05CT"+$0/*/('o6+D';#*(/*'p'y)2D'<GH6B'_LLjq'

!!

!!

T"+$0/*'()*05',#4/$'37'0%/'1"/#2CE*$0'*/4)32'9#$/('32'$,#-#"',*3O)8)05'

()*05',#4/'

1"/#2',#4/'

=Fn'3*(/*&' XbD'XiD'XeD'XV'
T!=Fn'3*(/*&' XiD'XVD'XbD'Xe'

U#$K18&($'&1#8(
XK'
I9J(
91

X_'

XU'

D)'%8ET$,"($'&1#8(
Xd'

XV'

Xe'

Xi'

U18B#R(,1V'(U(

Xb'
09J(

Bonnet, Bouganim, Koltsidas, Viglas, VLDB 2011

ol'p'`D''`=<N'_LLbq'

T3$0C9#$/('*/,"#1/8/20')2'%59*)('$5$0/8$'
!!

A)8)"#*'03'0%/',*/.)3+$')(/#D'9+0'
73*'%59*)('$/0+,$'
!!

!!

AA<'#2(';<<'73*',/*$)$0/20'$03*#4/'

<).)(/'0%/'9+Q/*',33"')203'0>3'
*/4)32$'
!!
!!

@)8/'*/4)32&'05,)1#"'=Fn'
T3$0'*/4)32&'73+*'=Fn'g+/+/$D'32/'
,/*'13$0'1"#$$'
!!
!!
!!

!!

!!

T"/#2'R#$%'
T"/#2'8#42/-1'
<)*05'R#$%'
<)*05'8#42/-1'

6*(/*'g+/+/$'9#$/('32'13$0'

[.)10'7*38'-8/'*/4)32'03'13$0'
*/4)32'
!)2#"'.)1-8')$'#">#5$'7*38'0%/'13$0'
*/4)32'
92

D#,"($'&1#8(

13$0'

!!

!!

W1>'($'&1#8(

Bonnet, Bouganim, Koltsidas, Viglas, VLDB 2011

oA03)1#D'G0%#2#$$3+")$D'y3%2$32'p'G)"#8#P)D'<GH6B'_LLjq'

G,,/2('#2(',#1P'
!!

T32./*0'*#2(38'>*)0/$'03'
$/g+/2-#"'32/$'
!!
!!

*#2(38'>*)0/$'
!!

A%)8'$03*#4/'8#2#4/*'"#5/*'
4*3+,'#2('
>*)0/''
$/g+/2-#""5'

)2.#")(#0/'

AA<',/*$)$0/20'$03*#4/'
93

!!

!!

A%)8'"#5/*'9/0>//2'$03*#4/'
8#2#4/*'#2('AA<'
62'/.)1-32D'4*3+,'()*05',#4/$D'
)2'9"31P$'0%#0'#*/'8+"-,"/$'37'
0%/'/*#$/'+2)0'
<3'230'3./*>*)0/'3"('./*$)32$D'
)2$0/#('>*)0/'9"31P'
$/g+/2-#""5'
?2.#")(#0/'3"('./*$)32$'

X#5'0%/',*)1/'37'#'7/>'/O0*#'
*/#($'9+0'$#./'0%/'13$0'37'
*#2(38'>*)0/$'
Bonnet, Bouganim, Koltsidas, Viglas, VLDB 2011

T#1%)24')2'R#$%'8/83*5'
!!

X*39"/8'$/0+,'
!!
!!

!!

F/$/#*1%'g+/$-32$'
!!
!!
!!

!!

AA<'#2(';<<'#0'()Q/*/20'"/./"$'37'0%/'$03*#4/'%)/*#*1%5'
!"#$%'8/83*5'+$/('#$'#'1#1%/'73*';<<',#4/$'
Z%/2'#2('%3>'03'+$/'0%/'AA<'#$'#'1#1%/r'
Z%)1%',#4/$'03'9*)24')203'0%/'1#1%/r'
;3>'03'1%33$/'.)1-8',#4/$r'

H/0%3(3"34)/$'
!!
!!
!!

94

6,-8#"'1%3)1/'37'%#*(>#*/'132E4+*#-32'
AA<'#$'#'*/#('1#1%/'
!"#$%C*/$)(/20'/O0/2(/('9+Q/*',33"$'
Bonnet, Bouganim, Koltsidas, Viglas, VLDB 2011

oB#*#5#2#2D'@%/*/$P#D'<322/""5D'["2)P/-'p'F3>$0*32D'[+*3A5$'_LLjq'

?213*,3*#-24'AA<$')203'0%/'/20/*,*)$/'
!!
!!

<)$1),")2/('>#5'37')20*3(+1)24'
AA<'$03*#4/'
0*#1/$'
H)4*#-32'7*38';<<$'03'AA<$'
!!

!!
!!

F/g+)*/8/20$'#2('83(/"$'
#2#"5-1#""5'03'$3"./'0%/'
132E4+*#-32',*39"/8'

A)8,"5'*/,"#1)24';<<$'>)0%'
AA<$')$'230'13$0C/Q/1-./'
AA<$'#*/'9/$0'+$/('#$'#'1#1%/'
!!
!!
!!

_C-/*/('#*1%)0/10+*/'
=34'#2('*/#('1#1%/'32'0%/'
AA<$D'(#0#'32'0%/';<<$'
N+0'/./2'0%/2'0%/'9/2/E0$'#*/'
")8)0/('

95

$,/1$'

39u/1-./$'
Z3*P"3#('
*/g+)*/8/20$'
A3"./*'
</.)1/'
83(/"$'

132E4+*#-32'

9/21%8#*P$'
>*)0/'

*/#('

U$1"'E%-'%B()#&(

9'%B(2%2-'(

AA<'-/*'

;<<'-/*'

Bonnet, Bouganim, Koltsidas, Viglas, VLDB 2011

o=/./20%#"D'T388+2a'GTHD'VK\i]'_LLbq'

@%/'{!A',/*$,/1-./'
!!
!!

H+"-C-/*/('#*1%)0/10+*/'
T389)2#-32'37'"344)24'#2('
*/#('1#1%)24'
!!

!!

!!

!"#$%'8/83*5')$'433('73*'
"#*4/'$/g+/2-#"'>*)0/$')2'#2'
#,,/2(C32"5'7#$%)32'\23'
+,(#0/$]'
G"$3'433('#$'#'*/#('1#1%/'73*'
;<<'(#0#'

[.)10C#%/#(',3")15'
!!

96

G44*/4#0/'1%#24/$'7*38'
8/83*5'#2(',*/()1-./"5',+$%'
0%/8'03'R#$%'03'#83*-s/'%)4%'
>*)0/'13$0'

9+Q/*'
,33"'

=34'
3,/*#-32$'
)#&(

G44*/4#0/'1%#24/$'#2('
,*/()1-./"5',+$%'
9'%B(2%2-'(

AA<'-/*'

;<<'-/*'

Bonnet, Bouganim, Koltsidas, Viglas, VLDB 2011

oT#2)8D'H)%#)"#D'N%#:#1%#*u//D'F3$$'p'=#24D'X`=<N'_LKLq'
o;3""3>#5'X%<'@%/$)$D'nZCH#()$32D'_LLjq'

<NHA'9+Q/*',33"'/O0/2$)32$'
!!
!!

!!

TXnI1#1%/'
G4#)2'#'8+"-C-/*/('
\K]'*/#('*/13*('
\e]'>*)0/'*/13*('
#,,*3#1%'
I%18(>'>#$*(A;F'$(6##)(
X3")1)/$'#2('#"43*)0%8$'73*'
1#1%)24')2'R#$%C*/$)(/20'
9+Q/*',33"$'
@/8,/*#0+*/C9#$/('
\_]'*/#(',#4/'
*/,"#1/8/20'
AA<'#$'$/132(#*5'9+Q/*',33"'
\d]'>*)0/',#4/'

!!
!!

!!

@/8,/*#0+*/C9#$/('/.)1-32'
,3")15'7*38'8/83*5'03'AA<'
X#4/$'#*/'1#1%/('32"5')7'%30'
/23+4%'

!!/(A;F'$(6##)(

\V]'>*)0/'
()*05',#4/'

\e]'+,(#0/'
AA<'13,5'
\U]'*/#(',#4/'
230'32'AA<'

G"43*)0%8$'73*'$521)24'(#0#'
#1*3$$'0%/'1#1%/$'
.//(R1"-()#&12%)($'&1#8,(
97

Bonnet, Bouganim, Koltsidas, Viglas, VLDB 2011

ol'p'`D'G<N?A'_LKKq'

X+w24')0'#""'034/0%/*'
!!
!!

[O0/2$)./'$0+(5'37'+$)24'R#$%'
8/83*5'#$'1#1%/'
X#4/'R3>'$1%/8/$'()10#0/'%3>'
(#0#'8)4*#0/$'#1*3$$'0%/'-/*$'
!!
!!
!!

!!

!!

?21"+$)./&'(#0#')2'8/83*5')$'#"$3'32'
R#$%'
[O1"+$)./&'23',#4/')$'930%')2'
8/83*5'#2('32'R#$%'
=#s5&'#2')2C8/83*5',#4/'8#5'3*'
8#5'230'9/'32'R#$%'(/,/2()24'32'
/O0/*2#"'1*)0/*)#'

T3$0'83(/"',*/()10$'%3>'#'
1389)2#-32'37'>3*P"3#('#2('
$1%/8/'>)""'9/%#./'32'
132E4+*#-32''
B3'8#4)1'1389)2#-32|'()Q/*/20'
$1%/8/$'73*'()Q/*/20'>3*P"3#($'
#2('()Q/*/20';;<$'#2('AA<$'
98

9+Q/*'
,33"'
AA<'
1#1%/'

;<<',/*$)$0/20'
$03*#4/'
Bonnet, Bouganim, Koltsidas, Viglas, VLDB 2011

6+0")2/'
!!

!"#$%C9#$/('(/.)1/'(/$)42'
!!
!!

!!

A3")('$0#0/'(*)./$'
H#P)24'AA<$'(#0#9#$/C7*)/2("5'

A5$0/8C"/./"'1%#""/24/$'
!!
!!
!!
!!

99

;59*)('$5$0/8$'
A03*#4/D'9+Q/*)24'#2('1#1%)24'
?2(/O)24'32'R#$%'
S+/*5'#2('0*#2$#1-32',*31/$$)24'

Bonnet, Bouganim, Koltsidas, Viglas, VLDB 2011

?2(/O)24'
!!

X*39"/8'$/0+,'
!!
!!

!!

F/$/#*1%'g+/$-32$'
!!
!!

!!

Z%)"/',*/$/2-24'0%/'$#8/'?I6')20/*7#1/'#$';<<$D'R#$%'
8/83*5'%#$'*#()1#""5'()Q/*/20'1%#*#10/*)$-1$'
?I6'#$588/0*5D'/*#$/C9/73*/C>*)0/'")8)0#-32'
;3>'$%3+"('>/'#(#,0'/O)$-24')2(/O)24'#,,*3#1%/$r'
;3>'1#2'>/'(/$)42'/W1)/20'$/132(#*5'$03*#4/')2(/O/$'^'
,30/2-#""5'73*'83*/'0%#2'32/'8/0*)1r'

H/0%3(3"34)/$'
!!
!!
!!

100

G.3)('/O,/2$)./'3,/*#-32$'>%/2'+,(#-24'0%/')2(/O'
A/"7C0+2)24')2(/O)24D'1#0/*)24'73*'R#$%C*/$)(/20'(#0#'
T389)2/'AA<$'#2(';<<$'73*')21*/#$/('0%*3+4%,+0'
Bonnet, Bouganim, Koltsidas, Viglas, VLDB 2011

oB#0%'p'M)9932$D'`=<N'_LLbq'

A/8)C*#2(38'>*)0/$'
!!
!!
!!
!!

A0#*-24',3)20')$'$0+(5)24'05,)1#"'>*)0/'#11/$$',#:/*2$')2'0%/'1320/O0'37'
$#8,")24'
!#10&'*#2(38'>*)0/$'%+*0',/*73*8#21/'
N+0'1#*/7+"'#2#"5$)$'37'#'05,)1#"'>3*P"3#('$%3>$'0%#0'>*)0/$'#*/'*#*/"5'
138,"/0/"5'*#2(38'
F#0%/*D'0%/5'#*/'$/8)C*#2(38'
!!
!!
!!
!!

F#2(38"5'()$,#01%/('#1*3$$'9"31P$D'$/g+/2-#""5'>*):/2'>)0%)2'#'9"31P'
A)8)"#*'03'0%/'"31#")05',*)21),"/$'37'8/83*5'#11/$$'
@#P/'#(.#20#4/'37'0%)$'#0'0%/'$0*+10+*/'(/$)42'"/./"'#2('>%/2')$$+)24'>*)0/$'
N+"P'>*)0/$'03'#83*-s/'>*)0/'13$0'
h9+0'#10+#""5'>*):/2'$/g+/2-#""5'>)0%)2'#'9"31P'
N"31P'K'

N"31P'_'

N"31P'U'

>*)0/$'$//8)24"5'*#2(38"5'()$,#01%/(')2'-8/h'
101

h'

N"31P'8(

7>'(
Bonnet, Bouganim, Koltsidas, Viglas, VLDB 2011

oB#0%'p'l#2$#"D'?AXB'_LLiq'

!"#$%<N&'$/"7C0+2)24'N}C0*//'
!!
!!

F/#($'#*/'1%/#,D'>*)0/$'#*/'
,30/2-#""5'/O,/2$)./')7'*#2(38'
@>3'83(/$'73*'N}C0*//'23(/$'
!!
!!

!!
!!

!!

<)$P'83(/&'23(/')$',*)8#*)"5'*/#('
=34'83(/&'23(/')$',*)8#*)"5'+,(#0/(|'
)2$0/#('37'3./*>*)-24D'8#)20#)2'"34'
/20*)/$'73*'0%/'23(/'#2('*/132$0*+10'32'
(/8#2('

)#&(>#B'(
B3(/'(#0#'
8/*4/'
=34'/20*)/$'

@*#2$"#-32'"#5/*',*/$/20$'+2)73*8'
)20/*7#1/'73*'930%'83(/$'
A5$0/8'$>)01%/$'9/0>//2'83(/$'95'
832)03*)24'+$/'
A)8)"#*'"344)24'#,,*3#1%')2'oZ+D'l+3'
p'T%#24D'GTH'@*#2$a'62'[89/((/('
A5$0/8$D'e\U]D'_LLiq'
!!
!!

102

<)Q/*/21/')$')2'>%/2'#2('%3>'>*)0/$'
#*/'#,,")/('
N+Q/*/('E*$0D'0%/2'9#01%/('#2('#,,")/('
95'0%/'N}C0*//'!@='

N}C0*//'
23(/'

F/#(I>*)0/'3,/*#-32'

/1,K(
>#B'(

8)4*#-32'13$0'
7*38'()$P'03'"34'
8)4*#-32'13$0'
7*38'"34'03'()$P'

0#&(
>#B'(

_C$0#0/'0#$P'$5$0/8'
Bonnet, Bouganim, Koltsidas, Viglas, VLDB 2011

o=)D';/D'=+3'p'z)D'?T<['_LLjq'

H#P)24'N}C0*//$'R#$%C7*)/2("5'\0%/'!<C0*//]'
!!

H+"-,"/'"/./"$')2'0%/')2(/O'
!!
!!

!!

[#1%',*34*/$$)./'"/./"'37'(3+9"/'$)s/'
A3'"324'#$'>/'#*/'230'+,(#-24')2',"#1/D'#2('#"$3',/*73*8)24'"#*4/'$/g+/2-#"'>*)0/$'03'
#83*-s/'0%/'13$0D')0x$'#""'433('

!<C0*//'"/./"$'#*/'$3*0/('*+2$'
!!
!!
!!

;/#(C0*//'\E*$0'"/./"$]')2'8#)2'8/83*5'
621/'#'"3>/*'"/./"'/O1//($')0$'1#,#1)05')0')$'8/*4/('>)0%'0%/'2/O0'32/'
A,/1)#"'/20*)/$'\7/21/$]'#*/'+$/('03'8#)20#)2'0%/'$0*+10+*/'#2('(/#"'>)0%',30/2-#"'$P/>'

/#1%'"/./"'#''
$3*0/('*+2'

103

Z%/2'"/./"'
)$'7+""D'
8/*4/'>)0%'
"3>/*'#2(''
>*)0/'
$/g+/2-#""5'

Bonnet, Bouganim, Koltsidas, Viglas, VLDB 2011

A,#-#"')2(/O)24'
!!

A)8)"#*'39$/*.#-32$'#$'>)0%'N}C0*//$'1#2'9/'8#(/'32'FC0*//$'
!!
!!
!!

@%/5x*/'0*//')2(/O/$'#~/*'#""'
=/$$32')$'"#*4/"5'0%/'$#8/&'32/'2//($'03'1#*/7+""5'1*#~'0%/'$0*+10+*/'
#2(')0$'#"43*)0%8$'73*'0%/'2/>'8/()+8'
N#01%'+,(#0/$'03'#83*-s/'>*)0/'13$0$'
!!

!!

@*#(/'1%/#,'*/#($'73*'/O,/2$)./'>*)0/$'95')20*3(+1)24')89#"#21/'
!!

!!

oZ+D'T%#24'p'l+3D'M?A'_LLUq'
ol'p'`D'AA@<'_LKKq'

A5$0/8#-1'$0+(5'32',/*73*8#21/'37'FC0*//$'32'AA<$')2'
o[8*)1%D'M*#7D'l*)/4/"D'A1%+9/*0'p'@%38#D'<GH6B'_LKLq'
!!
!!

104

AA<$'230'#$'$/2$)-./'#$';<<$'03',#4/'$)s/'
T#,#9"/'37'#((*/$$)24'%)4%/*'()8/2$)32#"'(#0#')2'"/$$'-8/'
Bonnet, Bouganim, Koltsidas, Viglas, VLDB 2011

o{/)2#"),3+*Cz#s-D'=)2D'l#"34/*#P)D'M+23,+"3$'p'B#uu#*D'!GA@'_LLVq'

;#$%')2(/O)24'9#$/('32'H)1*3;#$%'
!!

()*/103*5'
oLCKL]'

A&'_'
T&'K'

oKLC_L]'

A&'K'
T&'U'

o_LCUL]'

A&'L'
T&'L'

oULCdL]'

A&'L'
T&'L'

!!

A/0+,')$'$/2$3*'23(/$'>)0%'")8)0/('8/83*5'
#2(',*31/$$)24'1#,#9)")-/$'
A)8)"#*'03'/O0/2()9"/'%#$%)24'0/1%2)g+/$'
!!

<)*/103*5'P//,$'0*#1P'37'9+1P/0'93+2(#*)/$'
!!

F/3*4#2)s#-32'
\*/,#*--32)24]'
)7'$,")0'0%*/$%3"(')$'_'

!!

X*34*/$$)./'/O,#2$)32'9#$/('32'/g+)C>)(0%'
$,")w24'
!!
!!

()*/103*5'

105

oLCKL]'

A&'_'
T&'K'

oKLCKV]'

A&'K'
T&'L'

oKVC_L]'

A&'U'
T&'K'

o_LCUL]'

A&'L'
T&'L'

oULCdL]'

A&'L'
T&'L'

!!

[O,#2$)32'0*)44/*/('>%/2'0%/'2+89/*'37'
$,")0$'/O1//($'$38/'0%*/$%3"('
?27*/g+/20"5'+$/('9+1P/0$'#*/'R+$%/('03'AA<'

</"/-32'0%*3+4%'4#*9#4/'13""/1-32'#2('
*/3*4#2)s#-32'
!!

!!

!3*'/#1%'9+1P/0'8#)20#)2'0%/'"#$0'-8/')0'>#$'
+$/('A'#2('0%/'2+89/*'37'-8/$')0'%#$'9//2'
$,")0'T'

N#01%'+,(#0/$'#*/'%/",7+"'

M/2/*#")s#-32'37'")2/#*'%#$%)24'95'"#s5'
$,")w24')2'o)#24D'{%3+'p'H/24D'ZG?H'
_LLbq''
!!

@#P/$'#(.#20#4/'37'9#01%'+,(#0/$'

/.)10/('03'R#$%'

Bonnet, Bouganim, Koltsidas, Viglas, VLDB 2011

o</92#0%D'A/24+,0#'p'=)D'`=<N'_LKLq'

G'(/$)42'73*'%59*)('$5$0/8$'\!"#$%A03*/]'
9YI(>'>#$*&'>*)0/'#2('*/#('9+Q/*$'#2('8/0#(#0#'
;&+,(9&<*'(
".='>(

?#"9'(<:@'#(
l/5C.#"+/',#)*'

!)*$0'.#")(',#4/'

(/$0#4)24'

h'
A'&=(8&8,'(
l/5C.#"+/',#)*'

;<<'

h'

h'

A'8'.82(<"9(B'890#(
K'

L'

h'

K'

6"+C(D#'+'.8'(E*00$(F*9'#(
K'

L'

h'

K'

h'

L'

=#$0'.#")(',#4/'

X)%,-(>'>#$*&''
*/151"/('#,,/2('"34'
3*4#2)s/('#$'#''
151")1'")$0'37',#4/$D'
(/$0#4/('03';<<'
9#$/('32'*/1/215'

l//,$'0*#1P'37'(/$0#4/('/20*)/$'
106

Bonnet, Bouganim, Koltsidas, Viglas, VLDB 2011

oN/*2$0/)2D'F/)('p'<#$D'T?<F'_LKLq'

@*#2$#1-32'8#2#4/8/20'32'R#$%&';5(/*'
!!

G'()Q/*/20'#*1%)0/10+*/'73*'0*#2$#1-32'
8#2#4/8/20'
!!

@%/'0#*4/0')$'(#0#'1/20/*$'
!!

!!
!!

B//('73*'$1#"/C3+0'
=34C$0*+10+*/('8+"-C./*$)32/('
(#0#9#$/'
!!

!!

!!

t@%/'"34')$'0%/'(#0#9#$/c'

@%3+4%'AA<$'3*'/./2';<<$'8#5'9/'
+$/('#$'>/""'

@%*//C"#5/*/('#*1%)0/10+*/'
!!
!!
!!

107

A/*./*'_'

@*#2$#1-32')2,+0'
$2#,$%30'

F3""'"34'73*>#*('

@*#2$#1-32'
)20/2-32'\FZ'$/0$]'

G$$/89"/'"31#"'"34'
13,5'

B3',#*--32)24'

F#>'R#$%'1%),$'+$/('73*'$03*#4/'
!!

!!

A%#*/('(#0#D'8+"-C13*/'23(/$'

A/*./*'K'

A03*#4/'"#5/*'8#)20#)2$'$%#*/('"34'
?2(/O'"#5/*'$+,,3*0$'"33P+,'#2('
./*$)32)24'
@*#2$#1-32'"#5/*',*3.)(/$')$3"#-32'#2('
132-2+3+$"5'*/7*/$%/$'0%/'(#0#9#$/'
1#1%/'95'*+22)24'0%/'t8/"(c'#"43*)0%8'

N*3#(1#$0'03'
$/*./*$'

X*30313"'

!3*>#*(')20/2-32'

G,,/2('03'"34'

A1#"#9"/'*/")#9"/'()$0*)9+0/('"34'

Bonnet, Bouganim, Koltsidas, Viglas, VLDB 2011

6+0")2/'
!!

!"#$%C9#$/('(/.)1/'(/$)42'
!!
!!

!!

A3")('$0#0/'(*)./$'
H#P)24'AA<$'(#0#9#$/C7*)/2("5'

A5$0/8C"/./"'1%#""/24/$'
!!
!!
!!
!!

108

;59*)('$5$0/8$'
A03*#4/D'9+Q/*)24'#2('1#1%)24'
?2(/O)24'32'R#$%'
S+/*5'#2('0*#2$#1-32',*31/$$)24'

Bonnet, Bouganim, Koltsidas, Viglas, VLDB 2011

S+/*5'#2('0*#2$#1-32',*31/$$)24'
!!

X*39"/8'$/0+,'
!!
!!

!!

F/$/#*1%'g+/$-32$'
!!
!!
!!

!!

A#8/'05,/$'37'g+/*5'#2('0*#2$#1-32#"'>3*P"3#('
<)Q/*/20'8/()+8|'230'>%#0'/O)$-24'#,,*3#1%/$'%#./'9//2'
3,-8)s/('73*'
G*/'0%/*/',*39"/8$'0%#0'9/$0'E0'03'AA<$r'
<3/$'32/'2//('*#()1#""5'()Q/*/20'#,,*3#1%/$D'3*'$")4%0'#(#,0#-32$r'
Z%/*/')2'0%/'$03*#4/'%)/*#*1%5'$%3+"('>/'+$/'AA<$'#2('%3>r'

H/0%3(3"34)/$'
!!
!!
!!

109

!"#$%C#>#*/'#"43*)0%8$'/)0%/*'95'(/$)42'3*'0%*3+4%'#(#,0#-32'
6v3#(',#*0$'37'0%/'138,+0#-32'03'R#$%'8/83*5'
[13238)/$'37'$1#"/'
Bonnet, Bouganim, Koltsidas, Viglas, VLDB 2011

6"('$03*)/$D'2/>'035$'
!!

?8,#10'37'$/"/1-.)05'32',*/()1#0/'/.#"+#-32'oH5/*$D'HA1'
@%/$)$D'H?@D'_LLiq'
!!
!!

!!

6./*#""D'#$'$/"/1-.)05'7#103*')21*/#$/$',/*73*8#21/'(/4*#(/$'\2//("/'
)2'%#5$0#1P'g+/*)/$]'
G0'-8/$';<<$'8)4%0'3+0,/*73*8'AA<$'

y3)2',*31/$$)24'32'AA<$'+$)24'#"43*)0%8$'(/$)42/('73*';<<$'
o<3'p'X#0/"D'<GH6B'_LLjq'
!!
!!
!!
!!
!!

110

AA<'u3)2$'8#5'>/""'9/138/'TXnC93+2(D'$3'>#5$'03')8,*3./'0%/'TXn'
,/*73*8#21/'9/138/'$#")/20''
@*#()24'*#2(38'*/#($'73*'*#2(38'>*)0/$',#5$'3Q'
F#2(38'>*)0/$'*/$+"0')2'.#*5)24'?I6'#2('+2,*/()10#9"/',/*73*8#21/'
N"31P/('?I6'$-""')8,*3./$',/*73*8#21/'
N"31P'$)s/'$%3+"('9/'#'8+"-,"/'37'0%/',#4/'$)s/'

Bonnet, Bouganim, Koltsidas, Viglas, VLDB 2011

o@$)*34)#22)$D';#*)s3,3+"3$D'A%#%D'Z)/2/*'p'M*#/7/D'A?MH6<'_LLjq'

?8,#10'37'$03*#4/'"#53+0&'0%/'!"#$%y3)2'
!!
!!

A03*#4/'"#53+0'9#$/('32'
XG'
B3'2//('03'*/0*)/./'>%#0x$'
230'2/1/$$#*5'
!!

!!

select
from
where

n$/'#'$,/1)#")s/('3,/*#03*'
\!"#$%A1#2]'73*'32C0%/CR5'
,*3u/1-32$'3./*'XG'

!!

!/01%'
P/*2/"'

!/01%)24'(#0#'95',*3u/1-24' ND'T'
*/"/.#20'\0%/'7/01%'P/*2/"]'
[.#"+#0/'0%/'u3)2',*/()1#0/'
G'
\0%*3+4%'0%/'u3)2'P/*2/"]'
#2('8#0/*)#")s/'0%/'*/$+"0')2'
!"#$%'
#'u3)2')2(/O'
A1#2'

!'

y3)2'
P/*2/"'
G'm'<'

FK\GD'ND'T]'
111

!/01%'
P/*2/"'

M'

</"/4#0/'u3)2'138,+0#-32'
03'0>3'$0/,$'
!!

ND'TD'[D';D'!'

R1.B, R1.C,
R2.E, R2.H, R3.F
R1, R2, R3
R1.A = R2.D and
R2.G = R3.K

y3)2'
P/*2/"'
M'm'l'

l'
!"#$%'
A1#2'

<'
!"#$%'
A1#2'

FU\!D'lD'=]'

[D';'

F_\<D'[D'MD';]'
Bonnet, Bouganim, Koltsidas, Viglas, VLDB 2011

oT#2)8D'H)%#)"#D'N%#:#1%#*u//D'=#24'p'F3$$D'G<HA'_LKLq'

H/89/*$%),'g+/*)/$'
!!

H3-.#-32&',33*'"31#")05'37'N"338'E"0/*$'
!!
!!
!!

!!

;+*0$'1#1%/',/*73*8#21/')2'TXnC)20/2$)./'#,,")1#-32$'
M33('1#2()(#0/'73*'3v3#()24'03'R#$%'8/83*5'
M33('*#2(38'*/#(',/*73*8#21/'138,#*/('03';<<$D'9+0'>/'$-""'
2//('03'1*3$$'0%/'8/83*5C()$P'93+2(#*5'

A3"+-32&'9/)24'"#s5',#5$'3Q'
!!
!!

</7/*'*/#($'#2('>*)0/$'#2('0%*3+4%'9+Q/*)24'
?20*3(+1/'%)/*#*1%)1#"'$0*+10+*/'03'#113+20'73*'()$PC"/./"',#4)24'
N+Q/*'"#5/*')2'8/83*5'

<)$P',#4/$'#10'#$'
$+9CN"338'E"0/*$'

LKKLKKLLLKLKKLLK'

KKLLKKLLLKLKKLKK'

KKLLKKLKLKLKKLKL'

N+Q/*'9"31P$'
P//,'0*#1P'
37'(/7/**/('
*/#($'#2('>*)0/$'

LLKKKKLLLKLKKLKL'

!)"0/*'"#5/*'32'AA<'
112

Bonnet, Bouganim, Koltsidas, Viglas, VLDB 2011

o=//'D'H332D'X#*PD'l)8'p'l)8D'A?MH6<'_LLbq'

<#0#9#$/'3,/*#-32$'$+)0/('73*'AA<$'
!!
!!
!!

M)./2'0*#2$#1-32#"'#2('g+/*5',*31/$$)24'>3*P"3#($D'
>%)1%'3,/*#-32$'#*/'AA<$'9/:/*'73*r'
A0+(5'37'0%/'?I6',#:/*2$'
?(/2-75'$/*./*'$03*#4/'$,#1/$'0%#0'/O%)9)0'AA<C7*)/2("5'?I
6',#:/*2$'
!!

!!

@#9"/$D')2(/O/$D'0/8,3*#*5'$03*#4/D'"34D'*3""9#1P'$/48/20$'

A/132(#*5'$0*+10+*/$'#*/'9/:/*'$+)0/('73*'AA<$'
!!
!!
!!

113

=324'$/g+/2-#"'>*)0/$'\23'+,(#0/$]'#2('*#2(38'*/#($'
X/*73*8#21/')8,*3./8/20'83*/'0%#2'32/'3*(/*'37'8#42)0+(/'
>%/2'"344)24'#2('*3""9#1P'$/48/20$'#*/'(/"/4#0/('03'AA<$'
!#103*'37'0>3')8,*3./8/20'>%/2'0/8,3*#*5'$03*#4/')$'32'AA<$'
Bonnet, Bouganim, Koltsidas, Viglas, VLDB 2011

oT%/2D'A?MH6<'_LLjq'

?2/O,/2$)./'"344)24')2'R#$%'8/83*5'
!!

G$'8/2-32/(D'"344)24')$'32/'37'0%/'9/$0'E0$'73*'AA<$'
!!

!!
!!
!!

@5,)1#""5D'>*)0/$'#*/'#,,/2(/('03'0%/'"34'

@%/'32")2/'./*$)32'37'0%/'"34')$'+$+#""5'$8#""'
nAN'R#$%'8/83*5')$'1%/#,'#2('nAN',3*0$'#*/'#9+2(#20'
?20+)-32&'$,*/#('0%/'"34'#1*3$$'8+"-,"/'1%/#,'nAN'R#$%'()$P$'
!!

X*3.)(/'$)8,"/'"344)24')20/*7#1/'
#*1%)./*'

h'

>3*P/*'

>3*P/*'
114

9':;',"(:;';'(

38"'$<%2'(
>*)0/D'R+$%D'1%/1P,3)20|'
*/13./*5'

?2C8/83*5'"34'9+Q/*'
Bonnet, Bouganim, Koltsidas, Viglas, VLDB 2011

115

Tutorial Outline
1. Introduction (Philippe)
2. Flash devices characteristics (Luc)
3. Data management for flash devices (Stratis)
4. Two outlooks (Stratis & Philippe)

Bonnet, Bouganim, Koltsidas, Viglas, VLDB 2011

6+0"33P'
!!

AA<')$'#'()./*$/'1"#$$'37'(/.)1/$'
!!
!!
!!

@%/'32"5'138832'1%#*#10/*)$-1'37'#""'8/89/*$'37'0%/'1"#$$')$'0%/'
/O1/""/20'*#2(38'*/#(',/*73*8#21/'
n2(/*"5)24'0/1%23"345'#Q/10$',/*73*8#21/')2'30%/*'3,/*#-32$'
AA<$'(3'230'138,"/0/"5'(38)2#0/';<<$'^'230'5/0'
!!

!!

Z%/*/'(3'0%/5'E0')2'0%/'(#0#9#$/'$0#1Pr'
!!
!!
!!
!!
!!
!!

!!

A38/'05,/$'37'AA<'8#5'9/'#2'3*(/*'37'8#42)0+(/'$"3>/*'0%#2';<<$')2'
*#2(38'>*)0/$'

X/*$)$0/20'$03*#4/'^'8#59/')2'1389)2#-32'>)0%';<<$'
F/#('1#1%/'37';<<'(#0#'
@*#2$#1-32#"'"344)24'
n$)24'0%/';<<'#$'#'"34C$0*+10+*/('>*)0/C1#1%/'73*'0%/'AA<'
@/8,3*#*5'$03*#4/'#2('$0#4)24'#*/#'
G25'37'0%/'#93./'

H3*/'*/$/#*1%'2/1/$$#*5'#0'0%/'AA<I<N')20/*7#1/$'
116

Bonnet, Bouganim, Koltsidas, Viglas, VLDB 2011

117

Design Space
Query
Processor
Storage
Manager

Sources of increased complexity:


Improve Performance AND predictability,
no stable performance contract at interfaces,
high utilization,
d(techno)/dt

OS
RAID Controller

Which IOs are issued?

FTL
FD HW

How are IOs scheduled?


How are IOs interpreted?

Cross-layer issues:
- Avoid duplicating work
- Split work most effectively
- Schedule work most effectively
- Avoid arbitrary limitations
Bonnet, Bouganim, Koltsidas, Viglas, VLDB 2011

118
Query Processor

Performance Contract
Flash Devices Characteristics:

Storage Manager
OS
RAID controller
FTL
FD HW

- Predictable, unconstrained and inefficient : USB key


or low-end SSD
Existing DBMS are probably good enough
- Predictable, constrained and efficient : mininal FTL
!Which DBMS functions can efficiently
enforce constraints? How?
!Performance Modelling.
- Unpredictable and unconstrainted
Derive constraints for efficient regimes
(ad-hoc)
Bonnet, Bouganim, Koltsidas, Viglas, VLDB 2011

Flash chips trend: Less into the chip:


Storage Class Flash

Scaramuzzo (FMS 2010)

119

Bonnet, Bouganim, Koltsidas, Viglas, VLDB 2011

120
Query Processor

Dealing with complexity

Storage Manager
OS

[Schloser et al, CMU tech report 2003; Schloser et al, FAST 2004; Prabakharan et al., OSDI 2008]

RAID controller
FTL
FD HW

From a memory abstraction (block device)...

ERASE (address)

... to a communication abstraction (rich interface)


send(link_name, outgoing_message_buffer)
receive(link_name, incoming_message_buffer)

Command
Interpreter
ERASE (address)

Figures courtesy of Koschaak and Saltzer

Bonnet, Bouganim, Koltsidas, Viglas, VLDB 2011

121

TRIM command

- ATA/ATAPI Command set


Data Set Management command (TRIM)
TRIM(LBA) is a hint to FTL to unmap LBA-PBA
! Unmapping is asynchronous, i.e., fast
(if at all executed)
! Read after TRIM is unspecified

- Pushed by file systems community


Supported in Linux kernel 2.6.33+ and Windows 7/2008R2
Implemented in X25
! X25-M has 80GB capacity, but provides LBA for 74,4
GB
! Trimming a whole disk does not take it back to factory
settings.
Bonnet, Bouganim, Koltsidas, Viglas, VLDB 2011

122

Beyond TRIM
[Nellans et al. FusionIO 2010, Arpaci-Dusseau et al, HotStorage 2010]

Slide courtesy of David Nellans, FusionIO, FMS'11

Bonnet, Bouganim, Koltsidas, Viglas, VLDB 2011

123

Atomic Writes
[Prabakharan et al, OSDI 2008; Ouyang et al, HPCA'11]

Problem: partial writes due to system failure during an in-place update


Solution: copy on write (InnoDB physiological logging + double write buffer);
atomic write at FTL level improve performance significantly (single write) and
reduce DBMS complexity; it also limits concurrency.
Slide courtesy of , Gary Orenstein Fusion IO

Bonnet, Bouganim, Koltsidas, Viglas, VLDB 2011

124

Co-design: What's next?


- No in-place updates, do we still need WAL?
- If we reconsider physiological logging, do we still
need page-based IOs? Do we still need the
same representation in memory and on disk?
- Can we leverage prioritized IOs to improve a
form of predictability?
- What does extent-based data allocation buy us?
- How to efficiently deal with arrays of flash
devices?
Bonnet, Bouganim, Koltsidas, Viglas, VLDB 2011

125

Take-away Point # 1
&!

Flash devices are here to stay


Towards high-performance, energy proportionality

&!

The key issue is to improve predictability AND


performance
As long as flash devices hide flash chip constraints to
support any types of IOs, performance
characteristics will remain opaque.

Bonnet, Bouganim, Koltsidas, Viglas, VLDB 2011

126

Take-away Point # 2
&!

&!

Need to revisit DBMS design decisions


stemming from hard drive characteristics
Need to revisit strict layering between DBMS,
OS and FTL
The complexity of flash devices should not be
abstracted away if it results in unpredictable and
suboptimal performance.

Bonnet, Bouganim, Koltsidas, Viglas, VLDB 2011

127

Take-away Point # 3
&!

&!

Lot of work in DB community based on FD


assumptions
The co-design train has left the station.
FusionIO and Oracle are leading the way.
There is an opportunity for the DB community to stop
running after the technology, and influence the
upcoming developments of flash devices

Bonnet, Bouganim, Koltsidas, Viglas, VLDB 2011