Overview
Apache Cassandra is a highly scalable, distributed NoSQL database designed to handle massive amounts of data across many commodity servers with no single point of failure. Originally developed at Facebook and open-sourced in 2008, Cassandra combines the distributed architecture of Amazon's Dynamo with the data model of Google's Bigtable, making it one of the most powerful databases for applications that require high availability and linear scalability.
Cassandra excels at write-heavy workloads and can handle petabytes of data across thousands of nodes while providing sub-millisecond response times. Its masterless "ring" architecture means every node is identical, eliminating single points of failure and making it easy to scale horizontally by simply adding more nodes. Data is automatically distributed and replicated across the cluster, with configurable replication factors that ensure high availability even when nodes or entire datacenters fail.
The database uses a column-family data model that's optimized for fast writes and reads. Unlike traditional relational databases, Cassandra allows you to denormalize data and design your schema around query patterns, enabling extremely fast reads without joins. The CQL (Cassandra Query Language) provides a familiar SQL-like syntax for interacting with data, making it accessible to developers familiar with relational databases while leveraging Cassandra's distributed capabilities.
Cassandra provides tunable consistency, allowing you to balance consistency, availability, and partition tolerance (CAP theorem) on a per-query basis. You can choose from strong consistency for critical operations to eventual consistency for high-throughput scenarios. This flexibility makes Cassandra suitable for a wide range of use cases, from real-time analytics to messaging platforms to IoT data storage.
With built-in compression, efficient storage of time-series data, support for lightweight transactions, and powerful features like materialized views and secondary indexes, Cassandra is battle-tested by companies like Netflix, Apple, Instagram, and Uber to handle their most demanding workloads. Whether you need to store user activity streams, time-series sensor data, messaging histories, or product catalogs, Cassandra provides the scale and reliability required for mission-critical applications.
Key Features
Linear Scalability
Scale horizontally by adding nodes with no downtime, achieving linear performance increases
No Single Point of Failure
Masterless architecture where every node is identical, ensuring continuous availability
Multi-Datacenter Replication
Built-in replication across multiple datacenters for disaster recovery and geo-distribution
Tunable Consistency
Choose consistency level per query, balancing consistency, availability, and performance
High Write Throughput
Optimized for write-heavy workloads with millisecond response times at massive scale
CQL Query Language
SQL-like query language that's familiar to developers while leveraging distributed capabilities
کارونې پېښې
• **Time-Series Data**: IoT sensor data, application metrics, log aggregation, financial tick data
• **Real-Time Analytics**: User behavior tracking, clickstream analysis, recommendation engines
• **Messaging Platforms**: Chat applications, notification systems, activity feeds, message queues
• **E-Commerce**: Product catalogs, user sessions, shopping carts, order history
• **Social Media**: User profiles, posts, comments, likes, follower graphs
• **Gaming**: Player profiles, game state, leaderboards, in-game transactions
• **Financial Services**: Transaction records, fraud detection, trading platforms
Installation Guide
**Installation on Ubuntu:**
```bash
# Add Cassandra repository
echo "deb https://debian.cassandra.apache.org 41x main" | sudo tee /etc/apt/sources.list.d/cassandra.list
curl https://downloads.apache.org/cassandra/KEYS | sudo apt-key add -
# Install Java
sudo apt update
sudo apt install openjdk-11-jdk -y
# Install Cassandra
sudo apt update
sudo apt install cassandra -y
# Start Cassandra
sudo systemctl start cassandra
sudo systemctl enable cassandra
# Verify installation
nodetool status
cqlsh
# Check cluster status
nodetool info
```
**Quick Start:**
```bash
# Connect to Cassandra
cqlsh
# Create keyspace
CREATE KEYSPACE mykeyspace WITH replication = {
'class': 'SimpleStrategy',
'replication_factor': 3
};
# Use keyspace
USE mykeyspace;
# Create table
CREATE TABLE users (
user_id UUID PRIMARY KEY,
username TEXT,
email TEXT,
created_at TIMESTAMP
);
# Insert data
INSERT INTO users (user_id, username, email, created_at)
VALUES (uuid(), 'john_doe', 'john@example.com', toTimestamp(now()));
```
Configuration Tips
**Essential Configuration (cassandra.yaml):**
```yaml
# Cluster name
cluster_name: 'MyCluster'
# Data directories
data_file_directories:
- /var/lib/cassandra/data
# Commit log directory
commitlog_directory: /var/lib/cassandra/commitlog
# Seeds (other nodes)
seed_provider:
- class_name: org.apache.cassandra.locator.SimpleSeedProvider
parameters:
- seeds: "192.168.1.1,192.168.1.2,192.168.1.3"
# Listen address
listen_address: 192.168.1.10
rpc_address: 0.0.0.0
# Ports
native_transport_port: 9042
storage_port: 7000
```
**Performance Tuning:**
- Set heap size to 8-14GB (max 14GB)
- Use G1GC garbage collector for Java 11+
- Enable compression for network and storage
- Use SSD for data and commit log
- Separate commit log to different disk
- Tune read/write consistency levels
**Monitoring:**
```bash
# Cluster status
nodetool status
# Node statistics
nodetool info
nodetool cfstats
# Compaction statistics
nodetool compactionstats
# Thread pool stats
nodetool tpstats
```