Vous êtes sur la page 1sur 17

Cassandra: Principles and Application

Dietrich Featherston
fthrstn2@illinois.edu
d@dfeatherston.com

Department of Computer Science
University of Illinois at Urbana-Champaign

Abstract at such a scale are in high demand. Furthermore,
the importance of data locality is growing as systems
Cassandra is a distributed database designed to be span large distances with many network hops.
highly scalable both in terms of storage volume and Cassandra is designed to continue functioning in
request throughput while not being subject to any the face of component failure in a number of user
single point of failure. This paper presents an archi- configurable ways. As we will see in sections 1.1 and
tectural overview of Cassandra and discusses how its 4.3, Cassandra enables high levels of system availabil-
design is founded in fundamental principles of dis- ity by compromising data consistency, but also allows
tributed systems. Merits of this design and practical the client to tune the extent of this tradeoff. Data in
use are given throughout this discussion. Addition- Cassandra is optionally replicated onto N different
ally, a project is supplied demonstrating how Cassan- peers in its cluster while a gossip protocol ensures
dra can be leveraged to store and query high-volume, that each node maintains state regarding each of its
consumer-oriented airline flight data. peers. This has the property of reducing the hard
depends-on relationship between any two nodes in
the system which increases availability partition tol-
1 Introduction erance.
Cassandra’s core design brings together the data
Cassandra is a distributed key-value store capable of model described in Google’s Bigtable paper [2] and
scaling to arbitrarily large sets with no single point the eventual consistency behavior of Amazon’s Dy-
of failure. These data sets may span server nodes, namo [3]. Cassandra, along with its remote Thrift
racks, and even multiple data centers. With many API [5] (discussed in section 5), were initially devel-
organizations measuring their structured data stor- oped by Facebook as a data platform to build many
age needs in terabytes and petabytes rather than gi- of their social services such as Inbox Search that scale
gabytes, technologies for working with information at to serve hundreds of millions of users [4].
scale are becoming more critical. As the scale of these After being submitted to the Apache Software
systems grow to cover local and wide area networks, Foundation Incubator in 2009, Cassandra was ac-
it is important that they continue functioning in the cepted as a top-level Apache project in March of 2010
face of faults such as broken links, crashed routers, [22].
and failed nodes. The failure of a single component in In sections 1.1 and 4.3 we will see how Cassan-
a distributed system is usually low, but the probabil- dra compensates for node failure and network par-
ity of failure at some point increases with direct pro- titioning. More specifically, section 1.1 discusses the
portion to the number of components. Fault-tolerant, fundamental characteristics distributed systems must
structured data stores for working with information sacrifice to gain resiliency to inevitable failures at suf-

many popular NoSQL systems–Cassandra included. which he uses to clarify some of the tradeoffs buried NoSQL data stores vary widely in their offerings and within the CAP Theorem. what makes each unique. but that when there is be used to model worldwide commercial airline traf. [12]. A brief overview of this theorem is given. Such an event may be caused by a crashed discusses Cassandra’s tunable consistency model for router or broken network link which prevents reading and writing data while section 4.5 covers cluster growth and section 4. this simplicity leaves room for potentially incor- that makes up the NoSQL movement. NoSQL is a rect interpretation of the theorem.4 further Depending on the intended usage. some have observed that the entire movement is a way to group together fundamentally dissimilar systems based on what they partition? P else do not have in common [19]. In fact.Cassandra: Principles and Application 2 ficiently large scale. However CAP.2 gives cov. available. Partition-tolerance requires that the system con- Section 4 builds on previous sections with a deeper tinue to operate in spite of arbitrary message discussion of Cassandra’s architecture. For ory with section 3. a fault one or the other becomes more important for fic. Cassandra’s implementation of a DHT enables fault tolerance and load balancing within the cluster. As a simple tool to grasp a complex subject. Finally.3 loss. This Figure 1: PACELC Tradeoffs for Distributed Data overview will help support discussion of the tradeoffs Services made by the Cassandra data store in later sections. Section 4. communication between groups of nodes. PACELC re- partition-tolerant at the same instant in time. has become a useful model for describing the fundamental behavior A C L C of NoSQL systems. as well as a model which attempts to refine CAP as it relates to popular NoSQL systems. the user of Cas- contrasts the consistency and isolation guarantees sandra can opt for Availability + Partition-tolerance of Cassandra against traditional ACID-compliant or Consistency + Partition-tolerance.1 CAP Theorem and PACELC essary tradeoffs in distributed data systems. Sec- tion 4. first conceived in 2000 by Eric Brewer and formalized into a theorem in 2002 by Nancy Lynch [11]. and tradespace as described by PACELC. it tends to require that tradeoffs be made . It states that when a system experiences partition- Consistency means that all copies of data in the ing P . erage to data replication within a cluster. We formulates and clarifies CAP in a way that applies to define these terms as follows. section 6 adds to the breadth of should not be interpreted as meaning the system is available case studies by showing how Cassandra can either available or consistent. Daniel Abadi of term often used to describe a class of non-relational Yale University’s Computer Science department has databases that scale horizontally to very large data described a refining model referred to as PACELC sets but do not in general make ACID guarantees. the system to maintain. The CAP Theorem states that it is impossible for Figure 1 diagrams the distributed data service a distributed service to be consistent. Availability means that the system as a whole con- Section 3 discusses distributed hash table (DHT) the- tinues to operate in spite of node failure. These terms databases. Section 2 presents the basic data system appear the same to the outside observer model Cassandra uses to store data and contrasts at all times. the hard drive in a server may fail. How- Cassandra is one of many new data storage systems ever.1 going into more detail on how example. that model against traditional relational databases. the CAP Theorem has rapidly become a popular model for understanding the nec- 1.

structured data. all queries for infor- the reader should be careful not to think of how its mation in Cassandra take the general form abstractions map onto Cassandra.Cassandra: Principles and Application 3 between availability A and consistency C. and latency of operations. ported. system which groups together columns and super tween consistency C and latency L. Cassandra has Keyspace: The Keyspace is the top level unit adopted abstractions that closely align with the de. Else E. All logic regarding data interpretation stays within the application layer. row key). Every row in Cassandra is uniquely ture describes the real-world operation of Cassandra identifiable by its key. that can be derived or discovered easily and ensure maintenance of referential integrity. structured with little ceremony surrounding schema design. columns. indexes on non-key columns are not al. This data model provides the application with a per and is especially important to understanding the great deal of freedom to evolve how information is case study in section 6. column f amily. encoded name. While variations exist. ferred to as replication-transparency. of information in Cassandra. All column names and values are stored Cassandra is a distributed key-value store. sometimes re. table implementation. when the system experiences Column Family: A Column Family is the unit of partitioning. This conjec. This vernacular is used throughout the pa. . columns within a sandra only allows data to be queried by its key. timestamp. column family can be sorted either by UTF-8- ditionally. This sorting criteria is immutable and should be ing together of related rows of data. Column families have no de- sons behind these tradeoffs are made more clear in fined schema of column names and types sup- section 4. However. The current development version allows configu- other simple data structures. ration of additional keyspaces and column families at runtime. While some of its terminology lies are subordinate to exactly one keyspace.3.3. This is in stark contrast to the typical relational database 2 Data Model which requires predefined column names and types. the application must implement any necessary piec. Unlike as bytes of unlimited size and are usually inter- SQL queries which allow the client to express arbi. once columns with a common name and are useful for 1 This limitation is current as of the released version of Cas- modeling complex data types such as addresses sandra 0. Cassandra must sacrifice consistency to abstraction containing keyed rows which group remain available since write durability is impossible together columns and super columns of highly when a replica is present on a failed node. Ad. Under normal conditions. An exception to this rule is the definition Column: A column is the atomic unit of information of new keyspaces and column families which 1 must be supported by Cassandra and is expressed in the known at the time a Cassandra node starts . Cas. long integer. However. tion. chosen wisely based on the semantics of the ap- son the Cassandra data modeler must choose keys plication.6. the user may to understanding Cassandra’s distributed hash make tradeoffs between consistency. preted as either UTF-8 strings or 64-bit long trarily complex constraints and joining criteria. integer types. For this rea. The primary units get(keyspace. The rea. Row keys are important very well. or using lowed and Cassandra does not include a join engine– a custom algorithm provided by the application. Column fami- sign of Bigtable [2. this configuration must be common to all nodes in a cluster meaning that changes to either will re- Super Column: Super columns group together like quire an entire cluster to be rebooted. In addition. is similar to those familiar with relational databases. of information in Cassandra parlance are outlined as follows. In addi- form name : value. 15]. Row: A Row is the uniquely identifiable data in the under normal operation it must make tradeoffs be.

tations which attempt at balance between network churn and lookup latency. the only ac. they are converted into the keys stored along with information about the range token domain using a consistent hashing algorithm. a key will forward that request to the next nearest tion required for an application to change the struc. However. This is done through a system of key-based routing. terms of processing power and bandwidth can arise Like other DHT implementations. Specifically. DHTs offers a scalable alterna. their information structure in Cassandra. for which it will be responsible for. Servers are numbered sequentially around tive to the central server lookup which distributes a ring with the highest numbered connecting back to lookup and storage over a number of peers with no the lowest numbered. This is the case with many earlier DHT implemen- fully early in the design process. nodes is a fundamental driver behind DHT design. between lookup latency and internal gossip between ing it to the new structure. routing and interconnectedness complexity [13. browse for the data lookup complexity and is often referred to as a one- they want. This means applications may hops. and request that data over a point-to. is to start using the de. O( n) O(1) search topic in the internet age.1 Balanced Storage rise and fall of Napster which maintained a listing file locations on a central server [14.2 [16]. However. Any node contacted for information regarding may supply its own hashing function to achieve spe- . they typ. The more in- ture or schema of its data. Keys in Cassandra may be a sequence of bytes or a Each node participating in a DHT contains a range of 64-bit integer. in a ring. In Cassandra each server is as- central coordination required. The tradeoff capable of interpreting this older data and migrat. This in- this approach. formation each node maintains about its neighbors. While maintaining more state at each node means ically implement a process of updating old rows as lower latency lookups due to a reduced number of they are encountered. stances this may be advantageous over maintaining Many DHT systems such as Chord exhibit O(log n) a strictly typed data base. cludes the range of keys it is responsible for. 17]. Users would Cassandra’s DHT implementation achieves O(1) connect to the central server. put(key. Connections N umberof Hops ized. Reliable O(log √ n) O(log n) decentralized lookup of data has been a popular re. it underscores the impor. a listing tem with a central coordinating node will experience of other nodes and their availability. O(1) O(n) ations: get(key) : value. sired structure. node according to its lookup table. ture in Cassandra which ensures each node eventually erenced data. keyed data storage offering get and put oper. This is a function of the gossip architec- point connection with the server containing the ref. a token t may be any integer such that 0 ≤ t ≤ 2127 . nodes in a Cas- at the central node which can make the entire system sandra cluster can be thought of as being arranged appear unavailable. Large amount of strain in information discussed in section 4. The the value of cate the node containing information about the key. 10]. it also means nodes must exchange a greater grow in complexity by maintaining additional code amount of information about one other. When applications require evolving the fewer hops required to get to the correct node. While in some circum. tance of choosing a data model and key strategy care. there are a few problems with has state information for every other node. of keys available at 0 or more other nodes in the sys. value). A distributed hash table is a strategy for decentral.Cassandra: Principles and Application 4 the appropriate configuration is in place. Figure 3 shows some pair- ings of interconnectedness and lookup complexity of 3 Distributed Hash Tables common DHT solutions [17]. and other state bottlenecks at that node. Any sufficiently large distributed sys. MD5 hashing is used by default but an application tem. signed a unique token which represents the range keys Any read or write operation on a DHT must lo. the area began to receive a great deal of attention during the 3. hop DHT.

age engine implements a MD5-keyed distributed hash ing Cassandra’s architecture.2 Replication 4 Architecture As discussed in sections 3 and 3. Cassandra achieves even distribution of of the replica nodes will first search the memcache data and processing responsibilities within the clus- for any requested column. In the following sec. decentralized ever since its design only accounts for a single node at data storage over DHT internals. specified (see section 4. The only I/O for which a client will be blocked lowest token value.3 for details). While each node is responsible for only for a key. Simultaneously an in-memory data structure known as the memtable is updated with this write. tn−1 < md5(k) ≤ tn it is immediately written to the commit log which is an append-only. 0< n <N Once a write request reaches the appropriate node. Even when given a key may be distributed among multiple SSTables. Reads for a key at one token ring. As a result of employing the MD5 hashing algorithm Reads are much more I/O intensive than writes and for distributing responsibility for keys around the typically incur higher latency. This keeps write latency low. The hashing algorithm applied to keys acts consolidated. This is due to the domain and range character- be searched. SSTables are periodically queries. To help keep burdening a server’s storage and capacity to handle read latency under control. Cassandra will keep N copies dis- 4. table to evenly distribute data across the cluster. identical.1. which is responsible for is this append operation to the commit log which all keys k matching the following criteria (1). This even distribution of data within a cluster is columns. Because the constituent columns for istics of the MD5 algorithm [18].1 Anatomy of Writes and Reads tributed within the cluster. To allow keys to be read and written even when a responsible node has failed. crash recovery file in durable stor- Completing the token domain is the node with the age. which information regarding a key resides. n = 0. Once this memtable (a md5(k) > tN −1 term originating from the Bigtable paper [2]) reaches md5(k) ≤ t0 (1) a certain size it too is periodically flushed to durable storage known as a SSTable. so do the number of SSTables. A write request will not is often referred to as the wrapping range since it return a response until that write is durable in the captures all values at or below the lowest token as commit log unless a consistency level of ZERO is well as all values higher than the highest. Cassandra’s stor- Our discussion of DHTs is central to understand. If a node n in a token ring a subset of the data within the system. However each node keeps a listing of N − 1 . write to a particular key. Any SSTables will also ter. works to naturally load-balance a Cassandra cluster. each node is of size N has token tn it is responsible for a key k capable of servicing any user request to read from or under the following conditions. Such requests are automat- ically proxied to the appropriate node by checking the key against the local replica of the token table. As the number of keys stored at a node essential to avoiding hotspots that could lead to over- increases. N is also known as the To the user.Cassandra: Principles and Application 5 cific load balancing goals. all nodes in a Cassandra cluster appear replication factor and is configurable by the user. The tions we build on this understanding with discussions DHT itself does not provide for fault-tolerance how- of how Cassandra implements reliable. 4. MD5 produces uniform out- each SSTable includes an index to help locate those put. The fact that each node is responsible for The hashing function provides a lookup to the managing a different part of the whole data set is server primarily responsible for maintaining the row transparent. similar but unique input.

balance between consistency and latency for the ap- plication. ity Cassandra allows the user to make tradeoffs be- mining which node(s) should hold replicas for a each tween consistency and latency. characteristics into account. a network partition. the client. cas are updated durably. In addition to reduced latency. This gives no consistency guarantee but ability by strategically placing replicas in dif. network latency. ken and returns the endpoints of the next N − 1 A consistency level of ZERO indicates that a write nodes in the token ring. The ability to tailor A write consistency of AN Y has special properties replica placement is an important part of architecting that provide for even higher availability at the ex- a sufficiently large Cassandra cluster. It does so by requir- token. the new data unaware strategy. services remain more highly available in the event of This option begins at the primary node for a to. consistency requirements means that read and write sandra and ignores the physical cluster topology. offers the lowest possible latency. of ALL means that a write will fail unless all repli- ment strategy to meet the specific needs of the sys. When a live node in the cluster is contacted to read or write information regarding a key. there is no guarantee that write will be durable and Nodes are assumed to be in different data centers ever seen by another read operation. If no in a separate data center. should be processed completely asynchronously to Rack aware strategy attempts to improve avail. or other specific the number of replicas. Higher values of N contribute Cassandra is often communicated as being an even- to availability and partition tolerance but at the ex. and finds available. cluster.3 Consistency consult the nearest copy if the primary node for that key is not available. While this is true for pense of read-consistency of the replicas. a strategy may take physical 2 + 1 servers must have durable copies where N is node geography. Use of these consistency lev- ring. member of the replica group for applicable token is ent rack within the same data center. turn until at least one server where the key is stored cation strategy attempts to find a single node has written the new data to its commit log. if that node isn’t responsible for the key of the . ON E. For each token in the read or write operation. in real- Cassandra provides two basic strategies for deter. or AN Y with each (IP addresses) of the cluster. This design al. most cases in which Cassandra is suitable. When a write request is sent to a node in the While this may be perfectly desirable in many cases. Even if the server crashes the remaining N − 1 endpoints using the rack immediately following this operation. only be used when the write operation can happen lows the system to continue operating in spite at most once and consistency is unimportant since of a rack or whole data center being unavailable. the write fails. it will 4.1 hotspots of activity. A consistency if the second octet of their IPs differ and in dif. level of ON E means that the write request won’t re- ferent racks if the third octet differs. replica strategy will infect the cluster with unwanted siped to every other node as described in sections 3. This repli. pense of consistency. QU ORU M . ALL. sandra nodes can perform what is known as hinted duce unwanted patterns into the distribution of data. QU ORU M requires that N tem. handoff. tually consistent data store. the replication strategy returns N − 1 alternate els should be tuned in order to strike the appropriate endpoints. another on a differ. For example.5. This listing is part of the information gos. This mode must ferent racks and data centers. Each strategy is provided both the logical ing that clients specify a desired consistency level– topology (token ring ordering) and physical layout ZERO. In this consistency mode Cas- ing replicas for certain nodes it is possible to intro. but by target. is guaranteed to eventually turn up for all reads af- ter being brought back online. lowering Rack unaware is the default strategy used by Cas. A consistency level Users may also implement their own replica place. and 4.Cassandra: Principles and Application 6 alternate servers where it will maintain additional it is important to understand whether or not a given copies.

criteria are selected when the cluster is partitioned tacted. carefully against their requirements while selecting a sulted during a read. Hinted As of this writing. that prevents the specified number of replicas from Reads on the other hand require coordination being consulted for either a read or write operation among the same number of replicas but have some that operation will fail until the partition is repaired unique properties. In the other synchronous write consistency as long as R + W > N . R+W >N (2) QU ORU M on reads and writes meets that require- ment and is a common starting position which pro- vides consistency without inhibiting performance or fault-tolerance. the most le. In some cases applications designers may during a write. If a partition occurs write eventually gets to a correct replica node. Fac- of the read. Cassandra can be made fully consis. Application designers nient consistency levels may be chosen for reads and choosing a data store should consider these criteria writes. First. If lower la- consistency modes ON E. not guaranteed atomicity [15]. tency is required after exploring these options. In It is important to mention that for many reliable all cases. Note that this is ject to ACID guarantees in a transactional relational consistent with basic replication theory. then the request is transparently proxied to a for that performance profile while maintaining total replica for that token. Cassandra does not support the handoff allows writes to succeed without blocking the notion of a dynamic quorum in which new quorum client pending handoff to a replica. QU ORU M . In addition. Cassandra guarantees a Understanding how to tune the consistency of each reads or writes for a key within a single column fam- read and write operation. there is no way to achieve isolation from other clients the read fails without fails based on the static quorum working on the same data. There is no notion of a check- derstand how to balance between consistency and and-set operation that executes atomically and batch the combination of latency and fault-tolerance. any auction sites. If an application is more read or write heavy. In Cassandra the specified number of replicas cannot be contacted. node responsible for managing that key. and ALL. but it is limited in scope. even if it is not a replica for that token. will into two or more separate parts unable to communi- maintain a hint and asynchronously ensure that the cate with one another [1. To updates within a column family for multiple keys are achieve the lowest latency operations. ALL. If R is given as the number of replicas con. from other client updates can be critical in many chronously with the caller or asynchronously depends cases and are often encapsulated within a transac- on the stringency of the consistency level specified. The fist node con. one writes must be committed to durable storage at a may choose R + W <= N . This is known in Cassandra parlance as tors such as atomicity of operations and isolation read-repair. Some atomicity is offered. choose to put some subset of data that should be sub- tent under the following condition. we can now better un. consistency modes of [25]. 4.4 Isolation and Atomicity QU ORU M . If tion construct for relational databases. ZERO and AN Y are special and only apply to writes. there are portions of functionality for copies that are in conflict are repaired at the time which consistency is only one important factor. rules of the system. database while some other data resides in a data store like Cassandra. then the consistency level can be tuned . 15]. if any replicas are in conflict the most data-intensive applications such as online banking or recent is returned to the client. and W as the number consulted data store. The remaining consistency levels ON E. simply indicate the number of replicas that must be consulted during a read.Cassandra: Principles and Application 7 write. ily are always atomic. Whether this read-repair happens syn.

some data from the node with the next highest token will begin 5 Client Access migration to this node. consistent hashing. it may make sense to assign it responsibil- approach to storage. Thrift is a framework that includes a high Figure 2: Node tn′ during bootstrap into token ring level grammar for defining services. Instead Cassandra implements services using Thrift [5. tn+1 However there is no native Java API for communi- cating with Cassandra from a separate address space. tn+2 tn The Cassandra data store is implemented in Java. choosing a node with a replica of the data to be read or written to will result in reduced communication overhead between nodes to coordinate a response.5 Elastic Storage The approximate fraction of data that will be mi- grated from node tn to tn′ can be calculated as given Up to this point. on storage load data gossiped from other nodes peri- sandra cluster is referred to as bootstrapping and is odically [1. In the case of t0 Cassandra. bandwidth. ing a token value that achieves specific load balancing fined group of nodes. another node bootstrapping into its range [1. One of the more obvious problems is for the client to decide which host it will send requests to. remote objects . Cassandra clusters scale through the addition for the cluster to select a token dictating this nodes of new servers rather than requiring the purchase of placement in the ring. A second common way to bootstrap a new node is ments. tn′ − tn−1 (3) ticular token which dictates its placement within the tn − tn−1 ring. 15]. Rather. Figure 2 shows a new node tn′ being bootstrapped to a token between tn−1 and Client interaction with a decentralized distributed tn . However one of the attractions goals within the cluster. tn-1 requests to a single client can be the source of bot- tlenecks in the system.. if that node has tn' failed or is otherwise unreachable. This is often referred to choose a token for the new node that will make it as horizontal versus vertical scaling. Specifically. In addition. The first is to configure the node to bootstrap itself to a par. 15]. To understand responsible for approximately half of the data on the how a cluster can grow or shrink over time.Cassandra: Principles and Application 8 4. mented by Cassandra. usually accomplished in one of two ways. each node is capable of responding to any tN-1 . client request [15]. the new node unilaterally makes this decision based The process of introducing a new node into a Cas. hardware require. This calculation can be used in select- sandra’s behavior in the context of a statically de. The goal of the election is to ever more powerful servers. in equation 3. ments and costs scale linearly with storage require. discussion has focused on Cas.. ity for a smaller slice of the token range. nodes are not contacted range of keys as mapped to the token domain using in an ad-hoc way to initiate this election. When a token is chosen specifically. or processing large data sets without rethinking the fundamental capability. 16]. we will node with the most data that does not already have revisit the topic of distributed hash tables as imple. It follows that routing all . system like Cassandra can be challenging.. Recall that nodes are arranged While this process is technically an election in dis- in a logical ring where each node is responsible for a tributed systems vernacular. the entire cluster may be perceived as being unavailable. For example if a particular of Cassandra is that it allows scaling to arbitrarily node has limited storage. However..

token ranges. As pose. Cas- cussed in this section available on github [26] in a sandra’s Thrift API supports introspection of the to- project named Brireme 1 . no client could be found which all source code for data manipulation and search dis- intelligently routes requests to replica nodes. book. Hector is a native Java client for 6 Case Study: Modeling Air- Cassandra which has begun to tackle the challenges associated with accessing this decentralized system. clients throughout the cluster with different respon- ing Java. the Thrift API to Cassandra exhibits cluster. many languages are supported includ. but this would not help In this example we largely ignore the possibilities in identifying nodes where replicas are located as of inconsistent data that could arise when using an that would require knowledge of the replica place- eventually consistent store such as Cassandra. and Digg [15. In this style. Depending on an application’s pur- and server RMI stubs in a variety of languages. the cluster. clients become not passive users of the At this time. nections. Because of Thrift’s generic nature. a discussion is held before failing regarding the merits and drawbacks of applying Cas- sandra to this problem domain. 23]. it is unlikely these issues will be addressed the future. It introspects data about the state main familiar to the audience (or at least the regular of the ring in order to determine the endpoint for each traveler). Python. This information is then used to implement the industry are social networking sites such as Face- three modes of client-level failover [7]. More specifi- before failing cally. Each Cassandra sibilities. is invalidated over the course of a search. a simple study is presented which models air- line activity and allows searching for flights from one ON FAIL TRY ALL AVAILABLE will con. It is expected that all of the challenges discussed thus far. On the other hand. Twitter. to replica nodes. The author has made As of this writing. it leaves open best practices for client interaction with Cassandra the possibility of a perceived single point of failure and similar decentralized data stores will experience and does not intelligently route service invocations significant attention in the near future. but active members of it. and the availability is acceptably low. and Erlang.2. it may make sense to distribute a number of of this writing. this paper attempts to model a problem do- side bindings [7]. it is as- it stands to reason that such a client would need to become a receiver of at least a subset of the gossip 1 A brireme is a ship used in Ancient Greece during the time passed between nodes and thus an actual member of Cassandra . line Activity Hector itself is actually a layer on top of the Thrift To illustrate the architectural principles discussed API and as such it depends on its client and server this far. the workload for a number of different clients oper- acting with data and introspecting information about ating on data retrieved from the Cassandra cluster. up to all in the cluster. The most readily FAIL FAST implements classic behavior of failing available examples of Cassandra usage focus on so- a request of the first node contacted is down cial networking applications. The largest known users of Cassandra in host. In the event that a particular flight status of nodes in the cluster. Ruby. Following this study. ken range for each ring [1]. This being the case. This may also be a strategy for balancing node starts a Thrift server exposing services for inter. It is the goal of this paper to contribute to the breadth of examples in ON FAIL TRY ONE NEXT AVAILABLE the community by studying an unrelated use case– attempts to contact one more node in the ring modeling commercial airline activity. A truly intelli- assumed that the likelihood of a user deciding on a gent client requires up-to-date information regarding flight that will change over the course of their search replica locations. and a code generator that produces client of the cluster.Cassandra: Principles and Application 9 and types. It is ment strategy discussed in section 4. airport to another with a configurable number of con- tinue to contact nodes.

1. usage for speed of information retrieval which. end day. rival airport.AA) a lookup is required that easily finds lists of flights (arrivalCountry. The source data is ex- during searches. we must 201007200730-DCA-SEA-AA-1603 => carefully select a data model and system of unique (takeoff.US) airport. in the respective columns from rows in the Flight col- umn family. departure (departureCountry.201007201200) in the context of a single keyspace.SEA) a column family is introduced which we call Flight. and nodes in the cluster as it reduces the need to is- 201007200600-DCA-ATL-FL-183. such risks cannot be completely mitigated comprising that flight in the Flight column family. This includes the carrier.DCA) ing a linked list of flights from the departure airport. (departureAirport. or hops. For this reason.1603) contain four pieces of information: a date. days of week. would present storage problems for a suf- 201007200600-DCA-DFW-AA-259) ficiently large number of flights or details regarding To model this data structure we introduce a new col. (arrivalCity. (arrivalAirport.Cassandra: Principles and Application 10 sumed that. In order to efficiently search to limit the amount of data contained in a row and this data for flights on a particular day it must be thus helping minimize bandwidth usage and latency mapped onto our data model. on July 20. for example. As demonstrated by the CAP Theorem in sec. (carrier. given a key of day and departure airport. This information could be stored as su- 20100720-DCA => per columns within a single Flight row. A small subset of data is chosen detailed information regarding flights. For example. and total number of flights (departureCity. are main- list of flights departing Washington Reagan Airport tained for looking up further details for a given flight. when working with changing data at sufficiently large The Flight column family holds more detailed infor- scale.g. row key which may be used to look up the columns tion 1.201007200730) keys supporting required searches. Note that the column start day. arrival airport. mation regarding a flight in question. The key is 20100720-DCA and the The row keys for these column families can be found list of flight ids are contained within the value. All examples are (landing. Because we are interested in follow. columns within the the row. single node. Flight data is given in schedule for- key for each row that contains a complete listing of mat which is a compact representation of the flights flights. 201007200730 − DCA − SEA − AA − 1603) is a again. aux- following map data structure shows an abbreviated iliary column families Carrier and Airport. In order to model airline activity in a way that allows searching for flights on a given day. This is by design fields not used here. the In addition to these primary column families. provides a listing of flights. bandwidth and protocol overhead between the client 201007200600-DCA-ATL-DL-2939. Each query will (flight. 2010. and such de- (201007200545-DCA-CLT-US-1227. . The departure Source data for this demonstration was obtained date and airport are combined to create a unique from OAG [24]. departure and ar- identifier itself contains all the information this col. umn family named FlightDeparture. sue additional requests.SEA) Departure that.US) departing an airport on a given day. and flight number along with other umn family is designed to store. normalization is often a good strategy for minimizing 201007200545-DCA-MBJ-US-1227. The column for this paper which captures activity at a selection . for a .WAS) allowed. panded from flight schedules to flight instances and A separate column family we call Flight contains stored in Cassandra. the carrier and airports involved. Each of these flights are represented as single an airline intends to fly. Such a strategy trades disk 201007200600-DCA-DCA-DL-6709. a purchasing system would name of a row in the FlightDeparture column family alert the user requiring them to begin their search (e. .

and get f light by id are taken to be simple imple- mentations that look up single rows of data using the respective FlightDeparture and Flight column fami. ning. which the key for a particular flight. The above algorithm could implement its data access through interaction with a 1 The full Java implementation from the author may be found at [26] . proach would be to implement a database schema in With future planned flights loaded into Cas. This paper has provided an introduction to Cassan- The algorithm given is straightforward to under. This hypothesis is made based design with the traditional relational database and on the one-hop DHT substrate underlying Cassandra provided a formula for working with data at internet discussed in section 3. primary latency is the communication overhead in. Two-hop flight founded by referencing the CAP Theorem and an routes can be calculated in just a few seconds. lesser-known model known as PACELC. or other piece sandra.1. Conclusions lies using the provided key. dra and fundamental distributed systems principles stand and its data access pattern of single object on which it is based. flight combina. Throughout this discus- similar performance profile as the cluster grows to en. The concept of distributed lookups lends itself well to implementation over a hash tables has been discussed both in general and key-value store. this data volume can be expected to grow by across additional nodes would preset a challenge. secondary for temporal search will be larger still. Such a pattern closely resembles a key-value pendix A with a Java implementation given in Ap. due to the indexes. Further research is needed scale. as wide-area network with intelligent replica placement long as nodes are connected by high bandwidth cable. The orders of magnitude. In addition. Were this data modeled in a used to illustrate important principles in this paper. ever larger to support larger data sets. The procedures get f lights by origin tional features it was designed for. The alternate.Cassandra: Principles and Application 11 of airports over the course of a few weeks. indicates the node on which it is run- number of hops is developed and shown in Ap. but such an approach would not small data set represents nearly 22 million flights and scale gracefully as data set growth requires multi- 2 gigabytes of source data. key-value data store. We have seen how data can be distributed across a volved in making repeated requests for flights and. such as joins. the goal has been to add to the breadth of table would likely be used to represent flight data. One ap- storage overhead. Even this relational database. lem domain with large data requirements has been tem and data growth. multi-master environments. partitioning mance. Cassandra’s model for trad- tions covering three hops can be calculated in under ing consistency for availability and latency has been a minute on a single Cassandra node. an example modeling a real world prob- to understand how well this model scales with sys. Lastly. If all data is captured ple nodes. in many cases. a single In doing so. A similar data set stored in a data designer would be faced with dropping depen- relational database with secondary indexes necessary dencies on database features. to maximize availability in the event of node failure it is expected that these algorithms would maintain a and network partitioning. As indexes supporting these queries grow and denormalized to further improve query perfor. times without full table scans would also need to be in place for efficient queries. evolvable algorithm shows that. an algorithm for searching flights over a of information. sion we have contrasted this distributed key-value compass many nodes. and other features that do not scale well additional storage requirements and other relational to multi-node. store like Cassandra but begins to abandon the rela- pendix B 1 . traditional normalized relational database. experience running this as a substrate for building a decentralized. examples in the community for working with sparse Indexes supporting constraints on airport and flight column data stores like Cassandra.

if arr airport = dep airport comment: capture this one hop flight.Cassandra: Principles and Application 12 A Flight Search Algorithm begin Algorithm for finding flights over a given number of hops proc get flights(date. comment: is a list of connecting flights from. airport)). comment: dep airport to dest airport return(options). legs. f light keys := get f lights by origin(concat(date.pop(). comment: on date using F lightDeparture CF.each() ⇒ flight key do f light := get f light by id(f light key). legs. dep airport. hops).push(f light). for f light keys.arrival airport(). f light := get f light by id(f light key). legs = (). comment: the requested number of hops. if f light.peek().each() ⇒ flight key do arr airport := get arr airport(f light key). . 2.add(get f light by id(f light key)). comment: recursively search for flights over. elsif hops > 1 comment: look up flight details from the Flight CF. traverse f lights(options. comment: see if flight is to destination airport and departs. legs. f light keys := get f lights by origin(concat(date. . comment: our criteria. hops) ≡ options = ().happens af ter(last leg) if arrival = dest airport route = (). arrival)). route. Each element of this list. date. comment: after the last leg lands at connecting airport. dest airport.add all(legs). hops) ≡ last leg := legs. arrival := last leg. options. comment: get all flights leaving airport. date. dest airport. dest airport. level. recursive portion of algorithm performs a depth-first traversal of flight options proc traverse flights(options. fi od comment: Return a list of flight routes matching. legs. for f light keys.

dest airport. level + 1. date.Cassandra: Principles and Application 13 route. traverse f lights( options. legs.pop().add(f light). options. end .add(route). legs. else if level < hops legs.push(f light). hops). fi fi fi od .

boolean sameCarrier. if(flight.lastLeg)) { .add(flights).happensAfter(lastLeg)) { if (canTerminate(flight. options. } void traverseFlights(List<List<FlightInstance>> optionList. String dest. legs.arr. List<String> flightIds = getFlights(day. if(arrivalAirport. legs. for (String flightId : flightIds) { String arrivalAirport = getArrivalAirport(flightId). for (String flightId : flightIds) { FlightInstance flight = getFlightById(flightId). int level. String arrivingAt = lastLeg. } else if(hops > 1) { // look at possible destinations connecting from this flight legs. dest. dep). traverseFlights(options. String arr.Cassandra: Principles and Application 14 B Flight Search Algorithm (Java Implementation) List<List<FlightInstance>> getFlights(String day.push(getFlightById(flightId)). } } return options. arrivingAt). 2. int hops) throws Exception { // holds all verified routes List<List<FlightInstance>> options = new ArrayList<List<FlightInstance>>(). hops).getArrivalAirport().size()-1). day. sameCarrier. String day. List<String> flightIds = getFlights(day. boolean sameCarrier. int hops) throws Exception { // get the connection information from the last flight and // search all outbound flights in search of our ultimate destination FlightInstance lastLeg = legs.equals(dest)) { // build new connection list with only this flight List<FlightInstance> flights = new ArrayList<FlightInstance>().add(getFlightById(flightId)).get(legs. flights. // temporary data structure for passing connecting information Stack<FlightInstance> legs = new Stack<FlightInstance>().sameCarrier. Stack<FlightInstance> legs.pop(). String dep.

FlightInstance lastLeg) { return flight.addAll(legs).hasSameCarrier(lastLeg)).add(flight).getArrivalAirport(). String arr.push(flight).Cassandra: Principles and Application 15 // build new route with all prior legs.arr. } } } } boolean canTerminate(FlightInstance flight.hops). adding this flight to the end List<FlightInstance> route = new ArrayList<FlightInstance>(legs.pop().level+1.equals(arr) && (!sameCarrier || flight. traverseFlights(optionList.day. route.sameCarrier. // copy this route to the verified set that go from dep -> arr optionList. route. boolean sameCarrier.legs. legs. } .add(route).size()+1). } else if (level < hops) { legs.

E. P. T. Pilchin. pp. al.apache. Authors Apache Cassandra 0. Rollins.org [12] D. San Diego. Malik. M.[13] I. 2004 tributed Systems University of California.apache. Gruber. Looking value Store In Proceedings of twenty-first ACM Beyond the Legacy of Napster and Gnutella SIGOPS symposium on Operating systems prin.org/cassandra/FrontPage. R. NY..wikipedia. Burrows. Kwiatkowski. http://wiki. http://cassandra. Brussels.apache. A.html. [9] F. Cristian.blogspot. Agarwal.me/2010/02/23/hector-a. Dabek. et. USA. Kakulapati. F. April. J. M. Understanding Fault-Tolerant Dis. May 1993 [21] J. Yale Univer- tributed Storage System for Structured Data sity. A. and [2] F. 2010 ysis: The future belongs to the compa- nies and people that turn data into prod- [7] R. Abdi. 2006.6. Available. Wallach. WA. 33 issue 2.org/cassandra/ArchitectureGossip. S. and W.[14] K. D. Hector a Java Cassandra client http://prettyprint. Sivasubramanian. al. Yahoo’s little known NoSQL system D. 51-59.apache. Liben-Nowell. P.org/thrift [20] Apache Software Foundation. Apache Software Foundation An- ence and Artificial Intelligence Laboratory. Slee. Jampani.wikipedia.Cassandra: Principles and Application 16 References [11] N. S. What is data science? Anal- java-cassandra-client February.[15] J. Problems with CAP. Peryn. Hastorun. Stoica. and R. S. et. 2004 nounces New Top-Level Projects . DeCandia. Chord: A scalable peer-to-peer lookup ser- [3] G. W. Seat. Cornell. Fikes. and H. http://radar. Proceedings of the 1st Symposium on Networked Systems De.org/wiki/MD5 [6] R. tle. vice for internet applications Technical report.html. 2002 S. Karger.[22] Apache Software Foundation. Kaashoek. Cassandra Wiki ciples (2007). with-cap-and-yahoos-little.com/2010/06/what-is- from http://github. 2009 2010 [5] M. Lynch. Ellis. Morris. Ellis. Hector Java Source Code Available ucts. Palo Alto. Liskov. Nagaraja. Chang. Khambatti. 2010 OSDI’06: Seventh Symposium on Operating System Design and Implementation.oreilly.org/wiki/Distributed hash table tion Facebook. Balakrish- nan. Thrift: [17] Distributed Hash Table Scalable Cross-Language Services Implementa. [1] Misc. Gilbert. F. Rodrigues. Apache License Version 2. 2002. Dynamo: Amazons Highly Available Key. p. Tavory. CA. CA. Cassandra . Brewer’s Conjecture and the Feasibility of Consistent. June 2010 [8] Thrift Wiki http://wiki.com/2010/04/problems- A. G. M.A De. D. C. Cassandra Gossiper Architecture centralized Structured Storage System. http://en. 2006.0 http://www. Lakshman. La Jolla. Hsieh. Chandra. The sign and Implementation. v. Loukides. Bigtable: A Dis.[16] J. and R. 2007 [18] MD5 http://en. http://wiki. Lakshman. MIT LCS. Belgium. Gupta. Dean. http://dbmsmusings. Ghemawat. 205220 2010 [4] A. [10] A. Vo- gels. B. Efficient February. January. A. Vosshall. [19] M.3 Partition-Tolerant Web Services ACM SIGACT Java Source Code Available from News.org/licenses/. Tavory. D. M.com/rantav/hector data-science. A. 2009 Routing for Peer-to-Peer Overlays. ACM Press New York. MIT Computer Sci.apache.. M. Database Sharding at Netlog Pre- sented at FOSDEM 2009.

2005 [26] D. 2009 [24] OAG Aviation http://www.digg.Cassandra: Principles and Application 17 https://blogs. Kindberg. J.oag.com/blog/looking-future- cassandra.com/ [25] G. Forest Hill. MD. Dis- tributed Systems: Concepts and Design Addison Wesley.com/dietrichf/brireme . T. September. May 4. Featherston (the author) Brireme project on Github http://github. Dollimore. Looking to the future with Cassan- dra http://about.apache. Eure. 2010 [23] I.org/foundation/entry/the apache software foundation announces4. Coulouris.