Vous êtes sur la page 1sur 53

Fractal Tree

Indexes
Theory to Practice
Percona Live London 2013
Tim Callaghan, Tokutek
tim@tokutek.com
@tmcallaghan
Tuesday, November 12, 13

Ever seen this?


IO Utilization Graph, performance is IO limited
Tuesday, November 12, 13

Who is Tokutek?
Tokutek builds high-
performance database
software!
TokuDB - storage engine for
MySQL and MariaDB
TokuMX - storage engine for
MongoDB
HDD & SSD
storage

Storage Engine
Developer Interface
Tuesday, November 12, 13

Who am I?
17 year database consumer
schema design, development, deployment
database administration + infrastructure
mostly Oracle
5 year database producer
2 years @ VoltDB
2+ years @ Tokutek
Tuesday, November 12, 13

Housekeeping
Feedback is important to me
Ideas for Webinars or Presentations?
Whos using MongoDB?
Anyone using TokuDB or TokuMX?
Please ask questions
Tuesday, November 12, 13

Agenda
Why Fractal Tree indexes are cool

What they enable in MySQL

(TokuDB)

What they enable in MongoDB

(TokuMX)
Q+A
Tuesday, November 12, 13

Indexing:
B-trees and
Fractal Tree Indexes
Tuesday, November 12, 13

B-trees
Tuesday, November 12, 13

B-tree Overview - vocabulary


Internal Nodes -
Path to data
Leaf Nodes -
Actual Data -
Sorted
Pointers
Pivots
Tuesday, November 12, 13

B-tree Overview - example


22
10 99
2, 3, 4 10,20 22,25 99
* Pivot Rule is >=
Tuesday, November 12, 13

B-tree Overview - search


22
10 99
2, 3, 4 10,20 22,25 99
Find 25
Tuesday, November 12, 13

B-tree Overview - insert


22
10 99
2, 3, 4 10,15,20 22,25 99
Insert 15
Tuesday, November 12, 13
RAM
RAM
DISK

B-tree Overview - performance


22
10 99
2, 3, 4 10,20 22,25 99
Performance is IO limited when data > RAM,
one IO is needed for each insert/update
(actually its one IO for every index on the table)
Tuesday, November 12, 13

Fractal Tree Indexes


Tuesday, November 12, 13

Fractal Tree Indexes


similar to B-trees

store data in leaf nodes

use index key for ordering


message
buffer
message
buffer
message
buffer
All internal nodes
have message
buffers
different than B-trees

message buffers

big nodes (4MB vs. ~16KB)


As buffers overow,
they cascade down
the tree
Messages are
eventually applied to
leaf nodes
Tuesday, November 12, 13

Fractal Tree Indexes - sample data


25
10 99
2,3,4 10,20 22,25 99
Looks a lot like a b-tree!
Tuesday, November 12, 13

insert 15;
Fractal Tree Indexes - insert
25
10 99
2,3,4 10,20 22,25 99
insert (15)

search operations must consider messages along the way

messages cascade down the tree as buffers ll up

they are eventually applied to the leaf nodes, hundreds or


thousands of operations for a single IO

CPU and cache are conserved as important data is not ejected


Tuesday, November 12, 13

Fractal Tree Indexes - other operations


25
10 99
2,3,4 10,20 22,25 99
add_column(c4 bigint)
delete(99)
increment(22,+5)
...
insert (100)
delete(8)
delete(2)
insert (8)
Lots of operations can be messages!
Tuesday, November 12, 13

TokuDB
Fractal Tree Indexing +
MySQL/MariaDB
Tuesday, November 12, 13

What is TokuDB?
Transactional MySQL Storage Engine - think InnoDB
Available for MySQL 5.5 and MariaDB 5.5
ACID and MVCC
Free/OSS Community Edition
http://github.com/Tokutek/ft-engine
Enterprise Edition
Commercial support + hot backup
20
Performance + Compression + Agility
Tuesday, November 12, 13

TokuDB Performance
Warning - Benchmarks Ahead!
Tuesday, November 12, 13

Indexed Insertion Performance


High-performance insert/update/delete for large
databases (> RAM) while maintaining indexes
22
* old numbers, now > 25K/sec
Tuesday, November 12, 13

Sysbench Performance
Sysbench read/write workload, > RAM
23
The fastest IO is the one you never have to do (compression)
Tuesday, November 12, 13

Efficient index maintenance, especially secondary


indexes
Clustered secondary indexes
Additional copy of the row is stored in the index
No additional IO to get row data from primary key
Think better covering index (all non-indexed columns)
Compression eliminates size concerns
Big blocks = sequential IO for range scans
Basement nodes are always co-located
Multi-threaded bulk loader
24
Performance Advantages
Tuesday, November 12, 13

TokuDB Compression
Tuesday, November 12, 13

Compression: TokuDB vs. InnoDB


InnoDB compression misses force node splits, which
greatly reduces performance
MySQL 5.6 dynamic padding (from FB), less cache
Larger block size and flexible on-disk size wins!
Multiple compression algorithms (lzma, quicklz, zlib)
Larger, less frequent writes (much less IO)
Why it matters on spinning disks:
Compressed reads and amortized compressed writes
overcome IO limitations
Why it matters on flash/SSD:
Buy less : 250GB * 10x = as 2.5TB)
Large/less frequent writes are flash friendly
26
Tuesday, November 12, 13

Compression + IO Reduction

Server was at 90% IO utilization with InnoDB,


10% IO utilization with TokuDB
27
Tuesday, November 12, 13

Compression Performance

iiBench benchmark
28
Tuesday, November 12, 13

Compression Achieved

log data (extremely compressible)


29
Tuesday, November 12, 13

TokuDB Agility
Tuesday, November 12, 13

The Challenge of MySQL Schema Changes

Common schema changes can take hours in


MySQL

Adding, dropping, or expanding a column

Adding an index

And the table is unavailable for writes during the


process

As a workaround, people generally

Use a replication slave, then swap with master

Use helper tools: Percona OSC, MySQL 5.6


o
These have IO, CPU, RAM consequences
31
Tuesday, November 12, 13

Schema Changes Without Downtime

In TokuDB, column add/drop/expand is


instantaneous

its just a message

Indexes can be created in the background while


table is fully available

TokuDB just builds the index, it does not


rebuild the table (MySQL getting better)
32
Tuesday, November 12, 13

TokuMX
Fractal Tree Indexing +
MongoDB
Tuesday, November 12, 13

What is TokuMX?
TokuMX = MongoDB with improved storage (Fractal Tree indexes)
Drop in replacement for MongoDB v2.2 applications
Including replication and sharding
Same data model
Same query language
Drivers just work
Open Source
http://github.com/Tokutek/mongo
Performance + Compression + Transactions
Tuesday, November 12, 13

MongoDB Storage
18
4 5555
(1,ptr5) (4,ptr1),
(12,ptr8)
(19,ptr7) (10000,ptr2)
The pointer tells MongoDB where to look in the heap for the requested
document (another IO)
35
85
40 120
(2,ptr5),
(22,ptr6)
(50,ptr4) (100,ptr7) (222,ptr3)
PK index (_id + pointer) Secondary index (foo + pointer)
db.test.insert({foo:55})
db.test.ensureIndex({foo:1})
memory mapped heap
Tuesday, November 12, 13

TokuMX Storage
18
4 5555
(1,doc) (4,doc),
(12,doc)
(19,doc) (10000,doc)
36
85
40 120
(2,4), (22,12) (50,19) (100,10000) (222,1)
PK index (_id + document) Secondary index (foo + _id)
db.test.insert({foo:55})
db.test.ensureIndex({foo:1})
memory mapped heap
One less IO per _id lookup, document is clustered in the index
Tuesday, November 12, 13

TokuMX Performance
Tuesday, November 12, 13

Performance - Indexed Insertion


100mm inserts into a collection with 3 secondary indexes
38
Tuesday, November 12, 13

Indexed Insertion : Multikey (100 inserts per doc)


39
Performance - Inserts on Indexed Arrays
Tuesday, November 12, 13

Performance - Replication
TokuMX replication allows secondary servers to process
replication without IO
Simply injecting messages into the Fractal Tree
Indexes on the secondary server
The Hard Work was done on the primary
o Uniqueness checking
o Transactional locking
o Update effort (read-before-write)
Elimination of replication lag
Your secondaries are fully available for read scaling!
Wasnt that the point?
40
Tuesday, November 12, 13

Performance - Lock Refinement


41
TokuMX performs locking at the document level
Extreme concurrency!
instance
database database
collection collection collection collection
document
document
document
document
document
document document
document
document
document
MongoDB v2.2
MongoDB v2.0
TokuMX
Tuesday, November 12, 13

42
Performance - Lock Refinement
Tuesday, November 12, 13

Sysbench benchmark (> RAM)


43
Performance - Lock Refinement + Reduced IO
Tuesday, November 12, 13

Indexed insertion benchmark


44
Performance - Reduced IO
Tuesday, November 12, 13

Performance - Clustered Indexes


Clustered secondary indexes
Additional copy of the document is stored in the index
No additional IO to get row data from primary key
Think better covered index (all non-indexed fields)
Good for point queries, great for range scans
Compression eliminates size concerns
45
Tuesday, November 12, 13

Performance - Memory Management


Two approaches to memory management
MongoDB = memory-mapped files
o Operating system determines what data is
important
TokuMX = managed cache
o User defined size
o TokuMX determines what data is important
Run multiple TokuMX instances on a single server
Each has its own fixed cache size
46
Tuesday, November 12, 13

TokuMX Compression
Tuesday, November 12, 13

Compression
MongoDB does not offer compression
Compressed file systems?
Shortened field names?
o Remember: each field name is stored in every single document
TokuMX easily achieves 5x-10x compression
Buy less disk or flash
Compressed reads and writes reduce overall IO
TokuMX support 3 compression types
zlib, quicklz, lzma (size vs. speed)
all data is compressed
Use descriptive field names!
They are easy to compress
48
Tuesday, November 12, 13

Compression

31 million documents, bit torrent peer data


http://cs.brown.edu/~pavlo/torrent/
49
Tuesday, November 12, 13

TokuMX Transactions
Tuesday, November 12, 13

ACID + MVCC

ACID
In MongoDB, multi-insertion operations allow for
partial success
o Asked to store 5 documents, 3 succeeded
We offer all or nothing behavior
Document level locking

MVCC
In MongoDB, queries can be interrupted by writers.
o The effect of these writers are visible to the reader
TokuMX offers MVCC
o Reads are consistent as of the operation start
51
Tuesday, November 12, 13

Multi-statement Transactions
TokuMX brings the following to MongoDB
db.runCommand({beginTransaction, isolation:
mvcc})
... perform 1 or more operations
db.runCommand(rollbackTransaction) |
db.runCommand(commitTransaction)
Not allowed in sharded environments
mongos will reject
52
Tuesday, November 12, 13

Tim Callaghan
VP/Engineering, Tokutek
tim@tokutek.com
@tmcallaghan
Questions?
Tuesday, November 12, 13

Vous aimerez peut-être aussi