【Database】Cassandra

Posted by 西维蜀黍 on 2021-07-07, Last Modified on 2023-11-24

Cassandra

Cassandra is a free and open-source, distributed, wide-column store, NoSQL database management system designed to handle large amounts of data across many commodity servers, providing high availability with no single point of failure. Cassandra offers support for clusters spanning multiple datacenters, with asynchronous masterless replication allowing low latency operations for all clients.

Main features

Distributed

Every node in the cluster has the same role. There is no single point of failure. Data is distributed across the cluster (so each node contains different data), but there is no master as every node can service any request.

Cassandra 是分布式的,这意味着它可以运行在多台机器上,并呈现给用户一个一致的整体。

对于很多存储系统(如 MySQL、Bigtable、HBase 等),一旦你开始扩展它,就需要把某些节点设为主节点,其他则作为从节点。但 Cassandra 是去中心的,也就是说每个节点都是一样的,没有节点会承担特殊的管理任务。与主从模式(master-slave)相反,Cassandra 的协议是 P2P 的,并使用 gossip 来维护存活或死亡节点的列表。

去中心化这一事实意味着 Cassandra 不会存在单点失效。Cassandra 集群中的所有节点的功能都完全一样, 所以不存在一个特殊的主机作为主节点来承担协调任务。有时这被叫做“server symmetry(服务器对称)”。

Supports replication and multi data center replication

Replication strategies are configurable. Cassandra is designed as a distributed system, for deployment of large numbers of nodes across multiple data centers. Key features of Cassandra’s distributed architecture are specifically tailored for multiple-data center deployment, for redundancy, for failover and disaster recovery.

Scalability

Designed to have read and write throughput both increase linearly as new machines are added, with the aim of no downtime or interruption to applications.

Fault-tolerant

Data is automatically replicated to multiple nodes for fault-tolerance. Replication across multiple data centers is supported. Failed nodes can be replaced with no downtime.

Tunable consistency

Cassandra is typically classified as an AP system, meaning that availability and partition tolerance are generally considered to be more important than consistency in Cassandra, Writes and reads offer a tunable level of consistency, all the way from “writes never fail” to “block for all replicas to be readable”, with the quorum level in the middle.

简单地说,Cassandra 牺牲了一点一致性来换取了完全的可用性。但是 Cassandra 实际更应该表述为“可调一致性”,它允许你来方便地选定需要的一致性水平与可用性水平,在二者间找到平衡点。

因为客户端可以控制在更新到达多少个副本之前,必须阻塞系统。这是通过设置 replication factor(副本因子)来调节与之相对的一致性级别。 通过 replication factor,你可以决定准备牺牲多少性能来换取一致性。 replication factor 是你要求更新在集群中传播到的节点数(注意,更新包括所有增加、删除和更新操作)。

客户端每次操作还必须设置一个 consistency level(一致性级别)参数,这个参数决定了多少个副本写入成功才可以认定写操作是成功的,或者读取过程中读到多少个副本正确就可以认定是读成功的。这里,Cassandra 把决定一致性程度的权利留给了客户自己。

所以,如果需要的话,你可以设定 consistency level 和 replication factor 相等,从而达到一个较高的一致性水平,不过这样就必须付出同步阻塞操作的代价,只有所有节点都被更新完成才能成功返回一次更新。而实际上,Cassandra 一般都不会这么来用,原因显而易见(这样就丧失了可用性目标,影响性能,而且这不是你选择 Cassandra 的初衷)。而如果一个客户端设置 consistency level 低于 replication factor 的话,即使有节点宕机了,仍然可以写成功。

Data model

Cassandra is wide column store, and, as such, essentially a hybrid between a key-value and a tabular database management system. Its data model is a partitioned row store with tunable consistency. Rows are organized into tables; the first component of a table’s primary key is the partition key; within a partition, rows are clustered by the remaining columns of the key. Other columns may be indexed separately from the primary key.

Cassandra VS HBase

https://appinventiv.com/blog/hbase-vs-cassandra/

https://www.projectpro.io/article/hbase-vs-cassandra/485

https://www.instaclustr.com/blog/apache-hbase-vs-apache-cassandra-an-in-depth-comparison-of-nosql-databases/

Reference