Académique Documents
Professionnel Documents
Culture Documents
OSDI 2006 Fay Chang, Jeffrey Dean, Sanjay Ghemawat et.al. Presented by Srindhya K
Why BigTable?
Google goals
Self-managing Simple!
Bigtable
BigTable is a distributed storage system for managing structured data. Designed to scale to a very large size
Web indexing, Personalized Search, Google Earth, Google Analytics, Google Finance,
Columns anchor:my.look.ca
Rows
<html>..
T7 T5 T2
com.cnn.www
<html>.. <html>..
CNN
T9
CNN.com
T11
Rows
Rows ordered lexicographically. The row range for a table is dynamically partitioned. Each partition (row range) is named a tablet.
E.g., In Webtable, pages in the same domain are grouped together by reversing the hostname components of the URLs: com.google.maps instead of maps.google.com.
Columns
Column keys are grouped into sets called column families. A column family must be created before data can be stored in a column key. Columns has two-level name structure:
family:optional_qualifier
Column family
Timestamps
Bigtable: real-time in microseconds, Client application: when unique timestamps are a necessity.
Items in a cell are stored in decreasing timestamp order. Application specifies how many versions (n) of data items are maintained in a cell.
Bigtable API
create and delete tables and column families, modify cluster, table, and column family metadata such as access control rights, Write or delete values in Bigtable, Look up values from individual rows, Iterate over a subset of the data in a table, Atomic R-M-W sequences on data stored in a single row key.
Google File System Raw Storage. Scheduler Schedules jobs onto machines. Lock Service Chubby distributed lock manager. SSTable - A key/value database.
GFS Stores persistent state. Scheduler Schedules jobs involved in BigTable serving. Lock Service Master election, location bootstrapping. SSTable - Stores and retrieves key/data pairs.
A persistent and distributed lock service. Consists of 5 active replicas, one replica is the master and serves requests. Service is functional when majority of the replicas are running and in communication with one another. Implements a nameservice that consists of directories and files. Bigtable uses Chubby to:
Ensure there is at most one active master at a time, Store the bootstrap location of Bigtable data (Root tablet), Discover tablet servers and finalize tablet server deaths, Store Bigtable schema information (column family information).
SS Table
64K block
64K block
64K block
SSTable
Index
Tablets
The way to get data to be spread out in all machines in serving cluster
Index
Index
Table
Multiple tablets make up the table SSTables can be shared Tablets do not overlap, SSTables can overlap
Tablet aardvark apple Tablet apple_two_E boat
SSTable SSTable
SSTable SSTable
Implementation
Tablet servers are added and removed dynamically. Ten to a thousand tablets assigned to a tablet server. Each tablet is typically 100-200 MB in size. Assigning tablets to tablet servers, Detecting the addition and deletion of tablet servers, Balancing tablet-server load, Garbage collection of files in GFS.
Locating Tablets
Since tablets move around from server to server, given a row,how do clients find a right machine ?
Instead: Store special tables containing tablet location info in BigTable cell itself.
Three level hierarchical lookup scheme for tablets 1st Level: A file stored in chubby contains location of the root tablet, i.e., a directory of ranges (tablets) and associated metadata.
2nd Level: Each meta-data tablet contains the location of a set of user tablets. 3rd Level: A set of SSTable identifiers for each tablet.
Tablet Assignment
Each tablet is assigned to one tablet server at a time. The master keeps track of the set of live tablet servers, and the current assignment of tablets to tablet servers. Bigtable uses Chubby to keep track of tablet servers. When a tablet server starts, it creates, and acquires an exclusive lock on, a uniquely-named file in a specific Chubby directory. Tablet server stops serving its tablets if loses its exclusive lock The master is responsible for detecting when a tablet server is no longer serving its tablets, and for reassigning those tablets as soon as possible. When a master is started by the cluster management system, it needs to discover the current tablet assignments before it can change them.
Tablet Serving
Server ensures the client has sufficient privileges for the write operation (Chubby), A log record is generated to the commit log file, Once the write commits, its contents are inserted into the memtable.
Server ensures client has sufficient privileges for the read operation (Chubby), Read is performed on a merged view of (a) the SSTables that constitute the tablet, and (b) the memtable.
Compactions
As write operations execute, the size of the memtable increases.
Merging compaction
- Periodically executed in the background - Reduce number of SSTables - Good place to apply policy keep only N versions
Major compaction
- Merging compaction that results in only one SSTable - No deletion records, only live data - Reclaim resources.
Refinements
Group column families together into an SSTable. Segregating column families that are not typically accessed together into separate locality groups enables more efficient reads. Can compress locality groups, using Bentley and McIlroys scheme and a fast compression algorithm that looks for repetitions. Bloom Filters on locality groups allows to ask whether an SSTable might contain any data for a specified row/column pair. Drastically reduces the number of disk seeks required - or non-existent rows or columns do not need to touch disk. Caching for read performance ( two levels of caching) - Scan Cache - Block Cache:
Real Applications
Conclusion
Satisfies goals of high-availability, high-performance, massively scalable data storage. API. Successfully used by various Google products (>60). Additional features in progress:
Significant flexibility due to own data model. Can remove bottlenecks and inefficiencies as they arise.