Vous êtes sur la page 1sur 24

Big Table A Distributed Storage System For Structured Data

OSDI 2006 Fay Chang, Jeffrey Dean, Sanjay Ghemawat et.al. Presented by Srindhya K

Why BigTable?

Lots of Different kinds of data!


- Crawling system URLs, contents, links, anchors, page-rank etc - Per-user data: preferences, recent queries/ search history -Geographic data, images etc ...

Many incoming requests No commercial system is big enough


- Scale is too large for commercial databases - May not run on their commodity hardware - No dependence on other vendors

Google goals

Fault-tolerant, persistent Scalable


- 1000s of servers - Millions of reads/writes, efficient scans

Self-managing Simple!

Bigtable

BigTable is a distributed storage system for managing structured data. Designed to scale to a very large size

Petabytes of data across thousands of servers

Used for many Google projects

Web indexing, Personalized Search, Google Earth, Google Analytics, Google Finance,

Flexible, high-performance solution for all of Googles products

Basic Data Model

Distributed multi-dimensional sparse map

(row, column, timestamp) -> cell contents


contents anchor:cnnsi.com

Columns anchor:my.look.ca

Rows
<html>..
T7 T5 T2

com.cnn.www

<html>.. <html>..

CNN

T9

CNN.com
T11

Rows

A row key is an arbitrary string.


Access to data in a row is atomic. Typically 10-100 bytes in size, up to 64 KB.

Rows ordered lexicographically. The row range for a table is dynamically partitioned. Each partition (row range) is named a tablet.

Unit of distribution and load-balancing.

Objective: make read operations single-sited!

E.g., In Webtable, pages in the same domain are grouped together by reversing the hostname components of the URLs: com.google.maps instead of maps.google.com.

Columns

Column keys are grouped into sets called column families. A column family must be created before data can be stored in a column key. Columns has two-level name structure:

family:optional_qualifier

Column family

Unit of access control Has associated type information

Qualifier gives unbounded columns

Additional level of indexing, if desired

Timestamps

64 bit integers Assigned by:


Bigtable: real-time in microseconds, Client application: when unique timestamps are a necessity.

Items in a cell are stored in decreasing timestamp order. Application specifies how many versions (n) of data items are maintained in a cell.

Bigtable garbage collects obsolete versions

Bigtable API

Implements interfaces to:


create and delete tables and column families, modify cluster, table, and column family metadata such as access control rights, Write or delete values in Bigtable, Look up values from individual rows, Iterate over a subset of the data in a table, Atomic R-M-W sequences on data stored in a single row key.

Background: Building Blocks

Google File System Raw Storage. Scheduler Schedules jobs onto machines. Lock Service Chubby distributed lock manager. SSTable - A key/value database.

BigTable uses of building blocks

GFS Stores persistent state. Scheduler Schedules jobs involved in BigTable serving. Lock Service Master election, location bootstrapping. SSTable - Stores and retrieves key/data pairs.

Chubby Lock Service

A persistent and distributed lock service. Consists of 5 active replicas, one replica is the master and serves requests. Service is functional when majority of the replicas are running and in communication with one another. Implements a nameservice that consists of directories and files. Bigtable uses Chubby to:

Ensure there is at most one active master at a time, Store the bootstrap location of Bigtable data (Root tablet), Discover tablet servers and finalize tablet server deaths, Store Bigtable schema information (column family information).

SS Table

Immutable, sorted file of key-value pairs Chunks of data plus an index

Index is of block ranges, not values

64K block

64K block

64K block

SSTable

Index

Tablets
The way to get data to be spread out in all machines in serving cluster

Large tables broken into tablets at row boundaries.

Tablet holds contiguous range of rows

Clients can often chose row keys to achieve locality

Aim for 100MB or 200MB of data per tablet

Built out of multiple SSTables


Tablet 64K block Start:aardvark 64K block 64K block End:apple SSTable 64K block 64K block 64K block SSTable

Index

Index

Table

Multiple tablets make up the table SSTables can be shared Tablets do not overlap, SSTables can overlap
Tablet aardvark apple Tablet apple_two_E boat

SSTable SSTable

SSTable SSTable

Implementation

A Bigtable library linked to every client. Many tablet servers.


Tablet servers are added and removed dynamically. Ten to a thousand tablets assigned to a tablet server. Each tablet is typically 100-200 MB in size. Assigning tablets to tablet servers, Detecting the addition and deletion of tablet servers, Balancing tablet-server load, Garbage collection of files in GFS.

One master server responsible for:


Client communicates directly with tablet server for reads/writes.

Locating Tablets
Since tablets move around from server to server, given a row,how do clients find a right machine ?

One approach: Could use BigTable master.

Central server almost certainly would be bottleneck in large system

Instead: Store special tables containing tablet location info in BigTable cell itself.

L Locating Tablets (contd ..)

Three level hierarchical lookup scheme for tablets 1st Level: A file stored in chubby contains location of the root tablet, i.e., a directory of ranges (tablets) and associated metadata.

The root tablet never splits.

2nd Level: Each meta-data tablet contains the location of a set of user tablets. 3rd Level: A set of SSTable identifiers for each tablet.

Tablet Assignment

Each tablet is assigned to one tablet server at a time. The master keeps track of the set of live tablet servers, and the current assignment of tablets to tablet servers. Bigtable uses Chubby to keep track of tablet servers. When a tablet server starts, it creates, and acquires an exclusive lock on, a uniquely-named file in a specific Chubby directory. Tablet server stops serving its tablets if loses its exclusive lock The master is responsible for detecting when a tablet server is no longer serving its tablets, and for reassigning those tablets as soon as possible. When a master is started by the cluster management system, it needs to discover the current tablet assignments before it can change them.

Tablet Serving

Write operation arrives at a tablet server:

Server ensures the client has sufficient privileges for the write operation (Chubby), A log record is generated to the commit log file, Once the write commits, its contents are inserted into the memtable.

Read operation arrives at a tablet server:

Server ensures client has sufficient privileges for the read operation (Chubby), Read is performed on a merged view of (a) the SSTables that constitute the tablet, and (b) the memtable.

Compactions
As write operations execute, the size of the memtable increases.

Minor compaction convert the memtable into an SSTable


- Reduce memory usage - Reduce log traffic on restart

Merging compaction
- Periodically executed in the background - Reduce number of SSTables - Good place to apply policy keep only N versions

Major compaction
- Merging compaction that results in only one SSTable - No deletion records, only live data - Reclaim resources.

Refinements

Group column families together into an SSTable. Segregating column families that are not typically accessed together into separate locality groups enables more efficient reads. Can compress locality groups, using Bentley and McIlroys scheme and a fast compression algorithm that looks for repetitions. Bloom Filters on locality groups allows to ask whether an SSTable might contain any data for a specified row/column pair. Drastically reduces the number of disk seeks required - or non-existent rows or columns do not need to touch disk. Caching for read performance ( two levels of caching) - Scan Cache - Block Cache:

Commit-log implementation Speeding up tablet recovery (log entries) Exploiting immutability

Real Applications

Used in different applications supported by Google.

Google Analytics Google Earth Personalized Search

Conclusion

Satisfies goals of high-availability, high-performance, massively scalable data storage. API. Successfully used by various Google products (>60). Additional features in progress:

Secondary indexes Cross data center replication. Deploy as a hosted service.

Advantages of the custom development:


Significant flexibility due to own data model. Can remove bottlenecks and inefficiencies as they arise.

Vous aimerez peut-être aussi