Deploying Cassandra Across Multiple Data Centers with Replication
>> Thursday, October 13, 2011
Cassandra provides a highly scalable key/value storage that can be used for many applications. When Cassandra is to be used in production one might consider deploying it across multiple data centers for various reasons. For example, your current architecture is such that you update data in one data center and all the other data centers should have a replication of the same data but you are ok with eventual consistency.
In this blog post I will discuss how one can deploy a Cassandra across three data centers making sure every data center contains full copy of the complete data set (this is important because you don't have to go across data centers to serve the traffic coming into a given data-center.
I assume you already downloaded and configured Cassandra on each of the boxes in your data centers. Since most of the steps we are doing here should be done for each node in every data center, I encourage you to use a tool like cluster-ssh (this will enable to open connections to all the nodes and run commands in parallel).
Goals
Setup a Cassandra cluster on three data centers with four nodes in each cluster. Every piece of data will be places on three nodes (one in each data center). In other words replication factor is 3. Let's assume our nodes are named as DC<data-center-name>N<node-id>
Steps
Note that all these steps, except Step 4, must be followed in EACH AND EVERY node of the cluster. These steps are tested on Cassandra 0.8.7 version.
Step 1: Configure cassandra.yaml
Open up $CASSANDRA_HOME/conf/cassandra.yaml in your favorite test editor (did I hear emacs :D).
They keys of each node in the data center should look like the following in our example.Data Center Node Key 1 1 0 1 2 14178431955039101857246194831382806528 1 3 28356863910078203714492389662765613056 1 4 42535295865117307932921825928971026432 2 1 56713727820156407428984779325531226112 2 2 70892159775195516369780698461381853184 2 3 85070591730234615865843651857942052864 2 4 99249023685273724806639570993792679936 3 1 113427455640312814857969558651062452224 3 2 127605887595351923798765477786913079296 3 3 141784319550391032739561396922763706368 3 4 155962751505430122790891384580033478656
They keys of each node in the data center should look like the following in our example.Data Center Node Key 1 1 0 1 2 42535295865117307932921825928971026432 1 3 85070591730234615865843651857942052864 1 4 127605887595351923798765477786913079296 2 1 1 2 2 42535295865117307932921825928971026433 2 3 85070591730234615865843651857942052865 2 4 127605887595351923798765477786913079297 3 1 2 3 2 42535295865117307932921825928971026434 3 3 85070591730234615865843651857942052866 3 4 127605887595351923798765477786913079298
Once we loaded the data into the cluster we've seen an equal distribution of load using the second method and also it is the recommended way for multiple data centers with snitch files.
