Cassandra provides a highly scalable key/value storage that can be used for many applications. When Cassandra is to be used in production one might consider deploying it across multiple data centers for various reasons. For example, your current architecture is such that you update data in one data center and all the other data centers should have a replication of the same data but you are ok with eventual consistency.
In this blog post I will discuss how one can deploy a Cassandra across three data centers making sure every data center contains full copy of the complete data set (this is important because you don't have to go across data centers to serve the traffic coming into a given data-center.
I assume you already downloaded and configured Cassandra on each of the boxes in your data centers. Since most of the steps we are doing here should be done for each node in every data center, I encourage you to use a tool like cluster-ssh (this will enable to open connections to all the nodes and run commands in parallel).
Goals
Setup a Cassandra cluster on three data centers with four nodes in each cluster. Every piece of data will be places on three nodes (one in each data center). In other words replication factor is 3. Let's assume our nodes are named as DC<data-center-name>N<node-id>. For example, DC2N3 will be the third node in second data center.
Steps
Note that all these steps, except Step 4, must be followed in EACH AND EVERY node of the cluster. These steps are tested on Cassandra 0.8.7 version.
Step 1: Configure cassandra.yaml
Open up $CASSANDRA_HOME/conf/cassandra.yaml in your favorite test editor (did I hear emacs :D).
- change cluster_name to a suitable value instead of the boring 'Test Cluster'.
- Set the initial_token. Current Cassandra implementation does a very poor job of distributing keys across the cluster. So you need to give all the nodes they are responsible for. There are two ways to divide the keys among nodes.
- Distributing Keys Evenly: In this scenario we will distribute the range of keys across all the nodes in the cluster. Go here and enter the number of nodes that you have in total in all data centers. For our example it is 12. Once it is generated carefully copy each value and place in each of the node's cassandra.yaml file under initial_token.
They keys of each node in the data center should look like the following in our example.
| Data Center | Node | Key |
|---|
| 1 | 1 | 0 |
| 1 | 2 | 14178431955039101857246194831382806528 |
| 1 | 3 | 28356863910078203714492389662765613056 |
| 1 | 4 | 42535295865117307932921825928971026432 |
| 2 | 1 | 56713727820156407428984779325531226112 |
| 2 | 2 | 70892159775195516369780698461381853184 |
| 2 | 3 | 85070591730234615865843651857942052864 |
| 2 | 4 | 99249023685273724806639570993792679936 |
| 3 | 1 | 113427455640312814857969558651062452224 |
| 3 | 2 | 127605887595351923798765477786913079296 |
| 3 | 3 | 141784319550391032739561396922763706368 |
| 3 | 4 | 155962751505430122790891384580033478656 |
- Distributing Load Evenly: In this scenario we will distribute the load of the cluster across all the nodes in the cluster. Go here and enter the number of nodes that you have in each data centers. For our example it is 4 in this case. Copy the values generated into the nodes's cassandra.yaml file under initial_token in first data center. Then add one to each of these values and put that value on the nodes of second data center. Then for the third data center add two to the value of tokens in first data center (or add one to the values of nodes in second data center)
They keys of each node in the data center should look like the following in our example.
| Data Center | Node | Key |
|---|
| 1 | 1 | 0 |
| 1 | 2 | 42535295865117307932921825928971026432 |
| 1 | 3 | 85070591730234615865843651857942052864 |
| 1 | 4 | 127605887595351923798765477786913079296 |
| 2 | 1 | 1 |
| 2 | 2 | 42535295865117307932921825928971026433 |
| 2 | 3 | 85070591730234615865843651857942052865 |
| 2 | 4 | 127605887595351923798765477786913079297 |
| 3 | 1 | 2 |
| 3 | 2 | 42535295865117307932921825928971026434 |
| 3 | 3 | 85070591730234615865843651857942052866 |
| 3 | 4 | 127605887595351923798765477786913079298 |
Once we loaded the data into the cluster we've seen an equal distribution of load using the second method and also it is the recommended way for multiple data centers with snitch files.
- Point data_file_directories, commitlog_directory and saved_caches_directory to proper locations and make sure those locations do exists (otherwise create them).
- Set the seeds. It is best to select one node from each data center and list it here. For example, DC1N1, DC2N2, DC3N3
- Assuming your node is properly configured to return the right address when java calls InetAddress.getLocalHost(), leave listen_address and rpc_address blank. If you are not sure type hostname in each node and get that value as the address.
- Set endpoint_snitch: org.apache.cassandra.locator.PropertyFileSnitch. We will provide a snitch file later (snitch file let Cassandra know the layout of our data centers.
That's pretty much it you have to do in cassandra.yaml (assuming you haven't touched any of the other default params)
Step 2: Configure log4j-server.properties
Find log4j.appender.R.File and point it to a proper location. Make sure you remember this because this is the log you will be searching for when things are going bad.
Step 3: Configure Snitch File
Open cassandra-topology.properties in a text editor and let Cassandra know about your node and data center configuration. For our example, this is how it should look like.
# Cassandra Node IP=Data Center:Rack
DC1N1=DC1:RAC1
DC1N2=DC1:RAC1
DC1N3=DC1:RAC1
DC1N4=DC1:RAC1
DC2N1=DC2:RAC1
DC2N2=DC2:RAC1
DC2N3=DC2:RAC1
DC2N4=DC2:RAC1
DC3N1=DC3:RAC1
DC3N2=DC3:RAC1
DC3N3=DC3:RAC1
DC3N4=DC3:RAC1
# default for unknown nodes
default=DC1:RAC1
Step 4: Start Your Cluster.
Goto $CASSANDRA_HOME and type ./bin/cassandra -f to bring up the node. Once you do this in all the nodes type ./bin/nodetool -h localhost ring to make sure all the nodes are up and running.
Step 5: Create Data Model with Replication
We are almost there. Now we need to tell Cassandra to use this configuration for our data model. The best way to do is through cassandra-cli.
Goto $CASSANDRA_HOME/bin and type ./cassandra-cli.
Type connect localhost/9160; to connect to the cluster. Note the semi-colon at the end. If successful you will see Connected to: "<YOUR_CLUSTER_NAME>" on localhost/9160;
Now you need to create the keyspace with proper replication. Assuming your keyspace name is MyCompanyKS type the following.
create keyspace MyCompanyKS with placement_strategy = 'org.apache.cassandra.locator.NetworkTopologyStrategy' and strategy_options = [{DC1:1,DC2:1,DC3:1}];
That's it. Now you have an awesome Cassandra cluster spanning across three data centers. Enjoy !!
0 comments:
Post a Comment