Big Table

Big Table A Distributed Storage System For Structured Data
OSDI 2006 Fay Chang, Jeffrey Dean, Sanjay Ghemawat et.al. Presented by Srindhya K
Why BigTable?
Lots of Different kinds of data!

- Crawling system URLs, contents, links, anchors, page-rank etc - Per-user data: preferences, recent queries/ search history -Geographic data, images etc ...
Many incoming requests No commercial system is big enough

- Scale is too large for commercial databases - May not run on their commodity hardware - No dependence on other vendors
Google goals
Fault-tolerant, persistent Scalable

- 1000s of servers - Millions of reads/writes, efficient scans
Self-managing Simple!
Bigtable
BigTable is a distributed storage system for managing structured data. Designed to scale to a very large size
Petabytes of data across thousands of servers
Used for many Google projects
Web indexing, Personalized Search, Google Earth, Google Analytics, Google Finance,
Flexible, high-performance solution for all of Googles products
Basic Data Model
Distributed multi-dimensional sparse map
(row, column, timestamp) -> cell contents

contents anchor:cnnsi.com
Columns anchor:my.look.ca
Rows
<html>..
T7 T5 T2
com.cnn.www
<html>.. <html>..
CNN
T9
CNN.com
T11
Rows
A row key is an arbitrary string.

Access to data in a row is atomic. Typically 10-100 bytes in size, up to 64 KB.
Rows ordered lexicographically. The row range for a table is dynamically partitioned. Each partition (row range) is named a tablet.
Unit of distribution and load-balancing.
Objective: make read operations single-sited!
E.g., In Webtable, pages in the same domain are grouped together by reversing the hostname components of the URLs: com.google.maps instead of maps.google.com.
Columns
Column keys are grouped into sets called column families. A column family must be created before data can be stored in a column key. Columns has two-level name structure:
family:optional_qualifier
Column family

Unit of access control Has associated type information
Qualifier gives unbounded columns
Additional level of indexing, if desired
Timestamps
64 bit integers Assigned by:

Bigtable: real-time in microseconds, Client application: when unique timestamps are a necessity.
Items in a cell are stored in decreasing timestamp order. Application specifies how many versions (n) of data items are maintained in a cell.
Bigtable garbage collects obsolete versions
Bigtable API
Implements interfaces to:

create and delete tables and column families, modify cluster, table, and column family metadata such as access control rights, Write or delete values in Bigtable, Look up values from individual rows, Iterate over a subset of the data in a table, Atomic R-M-W sequences on data stored in a single row key.
Background: Building Blocks
Google File System Raw Storage. Scheduler Schedules jobs onto machines. Lock Service Chubby distributed lock manager. SSTable - A key/value database.
BigTable uses of building blocks
GFS Stores persistent state. Scheduler Schedules jobs involved in BigTable serving. Lock Service Master election, location bootstrapping. SSTable - Stores and retrieves key/data pairs.
Chubby Lock Service
A persistent and distributed lock service. Consists of 5 active replicas, one replica is the master and serves requests. Service is functional when majority of the replicas are running and in communication with one another. Implements a nameservice that consists of directories and files. Bigtable uses Chubby to:

Ensure there is at most one active master at a time, Store the bootstrap location of Bigtable data (Root tablet), Discover tablet servers and finalize tablet server deaths, Store Bigtable schema information (column family information).
SS Table
Immutable, sorted file of key-value pairs Chunks of data plus an index
Index is of block ranges, not values
64K block
64K block
64K block
SSTable
Index
Tablets
The way to get data to be spread out in all machines in serving cluster
Large tables broken into tablets at row boundaries.
Tablet holds contiguous range of rows
Clients can often chose row keys to achieve locality
Aim for 100MB or 200MB of data per tablet
Built out of multiple SSTables

Tablet 64K block Start:aardvark 64K block 64K block End:apple SSTable 64K block 64K block 64K block SSTable
Index
Index
Table

Multiple tablets make up the table SSTables can be shared Tablets do not overlap, SSTables can overlap
Tablet aardvark apple Tablet apple_two_E boat
SSTable SSTable
SSTable SSTable
Implementation

A Bigtable library linked to every client. Many tablet servers.

Tablet servers are added and removed dynamically. Ten to a thousand tablets assigned to a tablet server. Each tablet is typically 100-200 MB in size. Assigning tablets to tablet servers, Detecting the addition and deletion of tablet servers, Balancing tablet-server load, Garbage collection of files in GFS.
One master server responsible for:

Client communicates directly with tablet server for reads/writes.
Locating Tablets
Since tablets move around from server to server, given a row,how do clients find a right machine ?
One approach: Could use BigTable master.
Central server almost certainly would be bottleneck in large system
Instead: Store special tables containing tablet location info in BigTable cell itself.
L Locating Tablets (contd ..)
Three level hierarchical lookup scheme for tablets 1st Level: A file stored in chubby contains location of the root tablet, i.e., a directory of ranges (tablets) and associated metadata.
The root tablet never splits.
2nd Level: Each meta-data tablet contains the location of a set of user tablets. 3rd Level: A set of SSTable identifiers for each tablet.
Tablet Assignment
Each tablet is assigned to one tablet server at a time. The master keeps track of the set of live tablet servers, and the current assignment of tablets to tablet servers. Bigtable uses Chubby to keep track of tablet servers. When a tablet server starts, it creates, and acquires an exclusive lock on, a uniquely-named file in a specific Chubby directory. Tablet server stops serving its tablets if loses its exclusive lock The master is responsible for detecting when a tablet server is no longer serving its tablets, and for reassigning those tablets as soon as possible. When a master is started by the cluster management system, it needs to discover the current tablet assignments before it can change them.
Tablet Serving
Write operation arrives at a tablet server:
Server ensures the client has sufficient privileges for the write operation (Chubby), A log record is generated to the commit log file, Once the write commits, its contents are inserted into the memtable.
Read operation arrives at a tablet server:
Server ensures client has sufficient privileges for the read operation (Chubby), Read is performed on a merged view of (a) the SSTables that constitute the tablet, and (b) the memtable.
Compactions
As write operations execute, the size of the memtable increases.
Minor compaction convert the memtable into an SSTable

- Reduce memory usage - Reduce log traffic on restart
Merging compaction
- Periodically executed in the background - Reduce number of SSTables - Good place to apply policy keep only N versions
Major compaction
- Merging compaction that results in only one SSTable - No deletion records, only live data - Reclaim resources.
Refinements
Group column families together into an SSTable. Segregating column families that are not typically accessed together into separate locality groups enables more efficient reads. Can compress locality groups, using Bentley and McIlroys scheme and a fast compression algorithm that looks for repetitions. Bloom Filters on locality groups allows to ask whether an SSTable might contain any data for a specified row/column pair. Drastically reduces the number of disk seeks required - or non-existent rows or columns do not need to touch disk. Caching for read performance ( two levels of caching) - Scan Cache - Block Cache:
Commit-log implementation Speeding up tablet recovery (log entries) Exploiting immutability
Real Applications
Used in different applications supported by Google.
Google Analytics Google Earth Personalized Search
Conclusion
Satisfies goals of high-availability, high-performance, massively scalable data storage. API. Successfully used by various Google products (>60). Additional features in progress:

Secondary indexes Cross data center replication. Deploy as a hosted service.
Advantages of the custom development:

Significant flexibility due to own data model. Can remove bottlenecks and inefficiencies as they arise.

Big Table

Transféré par

Informations du document

Description originale:

Copyright

Formats disponibles

Partager ce document

Partager ou intégrer le document

Options de partage

Avez-vous trouvé ce document utile ?

Ce contenu est-il inapproprié ?

Droits d'auteur :

Formats disponibles

Big Table

Transféré par

Droits d'auteur :

Formats disponibles

Big Table A Distributed Storage System For Structured Data

Lots of Different kinds of data!

Many incoming requests No commercial system is big enough

Fault-tolerant, persistent Scalable

Petabytes of data across thousands of servers

Used for many Google projects

Flexible, high-performance solution for all of Googles products

Basic Data Model

Distributed multi-dimensional sparse map

(row, column, timestamp) -> cell contents

A row key is an arbitrary string.

Access to data in a row is atomic. Typically 10-100 bytes in size, up to 64 KB.

Unit of distribution and load-balancing.

Objective: make read operations single-sited!

Unit of access control Has associated type information

Qualifier gives unbounded columns

Additional level of indexing, if desired

64 bit integers Assigned by:

Bigtable garbage collects obsolete versions

Implements interfaces to:

Background: Building Blocks

BigTable uses of building blocks

Chubby Lock Service

Immutable, sorted file of key-value pairs Chunks of data plus an index

Index is of block ranges, not values

Large tables broken into tablets at row boundaries.

Tablet holds contiguous range of rows

Clients can often chose row keys to achieve locality

Aim for 100MB or 200MB of data per tablet

Built out of multiple SSTables

A Bigtable library linked to every client. Many tablet servers.

One master server responsible for:

Client communicates directly with tablet server for reads/writes.

One approach: Could use BigTable master.

Central server almost certainly would be bottleneck in large system

L Locating Tablets (contd ..)

The root tablet never splits.

Write operation arrives at a tablet server:

Read operation arrives at a tablet server:

Minor compaction convert the memtable into an SSTable

Commit-log implementation Speeding up tablet recovery (log entries) Exploiting immutability

Used in different applications supported by Google.

Google Analytics Google Earth Personalized Search

Secondary indexes Cross data center replication. Deploy as a hosted service.

Advantages of the custom development:

Vous aimerez peut-être aussi