Cassandra data modeling

This topic describes Cassandra data modeling concepts and recommendations.

Cassandra is a distributed, decentralized, and highly available NoSQL database that uses a wide column store. It uses a consistent hashing algorithm to distribute data across the cluster. Each node runs an engine based on the Log-Structured Merge-Tree (LSM-Tree).

Each node in the cluster evenly shares the entire hash range. A node acts as a proxy to accept client requests and manages the data for its assigned primary key range. Based on the keyspace's replica policy and the cluster's snitch policy, Cassandra replicates the primary key range from each node to other nodes. This replication improves data reliability and service availability.

Each read and write operation in Cassandra specifies a ConsistencyLevel, such as ONE, TWO, or QUORUM. These tunable consistency levels allow Cassandra to balance service availability with data consistency for each request.

Concepts to understand

Key

Cassandra uses several types of keys. The following examples illustrate these concepts:

CREATE TABLE mytable1 ( name text PRIMARY KEY , age int , address text , persion_id text );

CREATE TABLE mytable2 ( name text , age int , address text , persion_id text, PRIMARY KEY (name, age) );

CREATE TABLE mytable3 ( name text , age int , address text , persion_id text, PRIMARY KEY ((name, age), persion_id) ) WITH CLUSTERING ORDER BY (persion_id DESC );

PRIMARY KEY: The unique identifier for a row. It consists of one or more columns. In the examples, name, (name, age), and ((name, age), person_id) are the primary keys for mytable1, mytable2, and mytable3, respectively.
partition key: The first part of the PRIMARY KEY. It determines which node stores the data after hashing. In the examples, the partition keys for mytable1, mytable2, and mytable3 are name, name, and (name, age), respectively. Data with the same partition key is usually stored in the same partition.
clustering key: The columns of the PRIMARY KEY after the partition key. The clustering key defines the sorting order of data within a partition. In the examples, mytable1 has no clustering key. The clustering keys for mytable2 and mytable3 are age and person_id, respectively.

To maximize cluster performance, distribute data evenly across all nodes. Consider factors such as partition size, data redundancy, and disk space usage. For optimal performance, limit each partition to 100,000 rows and a data size of 100 MB.

Secondary index

Consider the following example:

CREATE INDEX mytable_idx_age ON mytable2 (age) ;

The preceding statement creates a native secondary index on the age column of mytable2. Cassandra stores this index in a new table. The schema of the index table might look like this:

CREATE TABLE mytable_index_age (age int, name text, address text, persion_id text, PRIMARY KEY(age, name))

However, you cannot use the partition key of the index table to directly locate the node that stores the index. This is because Cassandra stores the index data on the same node as the original data. This is known as a local index.

For the most efficient queries, specify the original table's partition key when you use a native secondary index. If you do not specify the partition key, the query might scan the entire cluster. Use the following query patterns:

SELECT * FROM mytable2 WHERE age = 11 AND name = 'name';
SELECT * FROM mytable2 WHERE age >= 11 AND name IN ('name1', 'name2') ;
SELECT * FROM mytable2 WHERE age = 11 AND TOKEN (name)> xxxxx AND TOKEN(name) < yyyyy;

Data modeling recommendations and principles

Before you use Cassandra, you must design your data model. Start with your application's queries. These queries will determine how you organize your data, design your primary keys, and ultimately how data is stored and retrieved.

No JOINs: Cassandra does not support JOIN operations. To join data, perform the join in your client application or create a new table in Cassandra that pre-joins the data.
No referential integrity: Cassandra does not support referential integrity across tables. You cannot use foreign keys in one table to reference data in another table.
Denormalization is the opposite of normalization.
Query-first design: Unlike with a relational database, you should design your tables based on your queries, not your data model.
Design for optimal storage: In a relational database, the underlying data storage is transparent. In Cassandra, you must design your data model based on how data is stored on disk. Aim to design your tables so that queries read from as few partitions as possible.
Sorting is a design decision: The sort order for query results is determined when you create the table.

Note Cassandra materialized views and Storage-Attached Secondary Indexes (SASI) are experimental features. Use them with caution.