Install
Ubuntu
# install
sudo apt install clickhouse-server clickhouse-client
sudo service clickhouse-server start
sudo service clickhouse-server status
# connect
# 1. Download the binary
curl https://clickhouse.com/ | sh
# 3. Start the client
# Use the clickhouse-client to connect to your ClickHouse service. Open a new Terminal, change directories to where your clickhouse binary is saved, and run the following command:
./clickhouse client --host 192.168.18.247
# You should see a smiling face as it connects to your service running on localhost:
my-host :)
What Is ClickHouse?
ClickHouse® is a high-performance, column-oriented SQL database management system (DBMS) for online analytical processing (OLAP).
Operation
Creating Tables in ClickHouse
Like most databases, ClickHouse logically groups tables into databases. Use the CREATE DATABASE
command to create a new database in ClickHouse:
CREATE DATABASE IF NOT EXISTS helloworld
Similarly, use CREATE TABLE
to define a new table. (If you do not specify the database name, the table will be in the default
database.) The following table named is my_first_table
in the helloworld
database:
CREATE TABLE helloworld.my_first_table
(
user_id UInt32,
message String,
timestamp DateTime,
metric Float32
)
ENGINE = MergeTree()
ORDER BY (user_id, timestamp)
PRIMARY KEY (user_id, timestamp);
In the example above, my_first_table
is a MergeTree
table with four columns:
user_id
: a 32-bit unsigned integermessage
: aString
data type, which replaces types likeVARCHAR
,BLOB
,CLOB
and others from other database systemstimestamp
: aDateTime
value, which represents an instant in timemetric
: a 32-bit floating point number
A Brief Intro to Primary Keys
Before you go any further, it is important to understand how primary keys work in ClickHouse (the implementation of primary keys might seem unexpected!):
- primary keys in ClickHouse are *not unique* for each row in a table
The primary key of a ClickHouse table determines how the data is sorted when written to disk. Every 8,192 rows or 10MB of data (referred to as the index granularity) creates an entry in the primary key index file. This granularity concept creates a sparse index that can easily fit in memory, and the granules represent a stripe of the smallest amount of column data that gets processed during SELECT
queries.
The primary key can be defined using the PRIMARY KEY
parameter. If you define a table without a PRIMARY KEY
specified, then the key becomes the tuple specified in the ORDER BY
clause. If you specify both a PRIMARY KEY
and an ORDER BY
, the primary key must be a subset of the sort order.
The primary key is also the sorting key, which is a tuple of (user_id, timestamp)
. Therefore, the data stored in each column file will be sorted by user_id
, then timestamp
.
Inserting Data
Basic Example
You can use the familiar INSERT INTO TABLE
command with ClickHouse:
INSERT INTO helloworld.my_first_table (user_id, message, timestamp, metric) VALUES
(101, 'Hello, ClickHouse!', now(), -1.0 ),
(102, 'Insert a lot of rows per batch', yesterday(), 1.41421 ),
(102, 'Sort your data based on your commonly-used queries', today(), 2.718 ),
(101, 'Granules are the smallest chunks of data read', now() + 5, 3.14159 )
Let’s verify it worked - you should see the four rows of data that were inserted:
SELECT * FROM helloworld.my_first_table
SELECT Queries
ClickHouse is a SQL database, and you query your data by writing the same type of SELECT
queries you are already familiar with. For example:
SELECT *
FROM helloworld.my_first_table
ORDER BY timestamp
Updating and Deleting ClickHouse Data
Use the ALTER TABLE...UPDATE
command to update rows in a table:
ALTER TABLE [<database>.]<table> UPDATE <column> = <expression> WHERE <filter_expr>
<expression>
is the new value for the column where the <filter_expr>
is satisfied. The <expression>
must be the same datatype as the column or be convertable to the same datatype using the CAST
operator. The <filter_expr>
should return a UInt8
(zero or non-zero) value for each row of the data. Multiple UPDATE <column>
statements can be combined in a single ALTER TABLE
command separated by commas.
Use the ALTER TABLE
command to delete rows:
ALTER TABLE [<database>.]<table> DELETE WHERE <filter_expr>
The <filter_expr>
should return a UInt8 value for each row of data.
Data Types
LowCardinality
LowCardinality
is a superstructure that changes a data storage method and rules of data processing. ClickHouse applies dictionary coding to LowCardinality
-columns. Operating with dictionary encoded data significantly increases performance of SELECT queries for many applications.
The efficiency of using LowCardinality
data type depends on data diversity. If a dictionary contains less than 10,000 distinct values, then ClickHouse mostly shows higher efficiency of data reading and storing. If a dictionary contains more than 100,000 distinct values, then ClickHouse can perform worse in comparison with using ordinary data types.
Consider using LowCardinality
instead of Enum when working with strings. LowCardinality
provides more flexibility in use and often reveals the same or higher efficiency.
Consideration
Best Practices
Index Design
Choosing a primary (ordering) key
Once you have extracted your desired columns, you can begin optimizing your ordering/primary key.
Some simple rules can be applied to help choose an ordering key. The following can sometimes be in conflict, so consider these in order. Users can identify a number of keys from this process, with 4-5 typically sufficient:
- Select columns that align with your common filters and access patterns. If users typically start Observability investigations by filtering by a specific column e.g. pod name, this column will be used frequently in
WHERE
clauses. Prioritize including these in your key over those which are used less frequently. - Prefer columns which help exclude a large percentage of the total rows when filtered, thus reducing the amount of data which needs to be read. Service names and status codes are often good candidates - in the latter case only if users filter by values which exclude most rows e.g. filtering by 200s will in most systems match most rows, in comparison to 500 errors which will correspond to a small subset.
- Prefer columns that are likely to be highly correlated with other columns in the table. This will help ensure these values are also stored contiguously, improving compression.
GROUP BY
andORDER BY
operations for columns in the ordering key can be made more memory efficient.
On identifying the subset of columns for the ordering key, they must be declared in a specific order. This order can significantly influence both the efficiency of the filtering on secondary key columns in queries and the compression ratio for the table’s data files. In general, it is best to order the keys in ascending order of cardinality. This should be balanced against the fact that filtering on columns that appear later in the ordering key will be less efficient than filtering on those that appear earlier in the tuple. Balance these behaviors and consider your access patterns. Most importantly, test variants. For further understanding of ordering keys and how to optimize them, we recommend this article.
ref
- https://clickhouse.com/docs/en/observability/schema-design#materialized-views
- https://clickhouse.com/docs/en/optimize/sparse-primary-indexes
- https://clickhouse.com/blog/clickhouse-faster-queries-with-projections-and-primary-indexes
- https://clickhouse.com/docs/en/optimize/sparse-primary-indexes#ordering-key-columns-efficiently
Asynchronous Inserts (async_insert)
https://clickhouse.com/docs/en/cloud/bestpractices/asynchronous-inserts
MergeTree Engine
ref
InfluxDB vs ClickHouse
ref https://swsmile.info/post/InfluxDB-learning/
ClickHouse Clients and Drivers
Reference
- https://en.wikipedia.org/wiki/ClickHouse
- https://github.com/ClickHouse/ClickHouse
- https://clickhouse.com/docs/en/concepts/why-clickhouse-is-so-fast
- https://clickhouse.com/docs/en/sql-reference