【ClickHouse】学习

Posted by 西维蜀黍 on 2024-09-23, Last Modified on 2024-09-27

Install

Ubuntu

# install
sudo apt install clickhouse-server clickhouse-client

sudo service clickhouse-server start
sudo service clickhouse-server status

# connect
# 1. Download the binary
curl https://clickhouse.com/ | sh

# 3. Start the client
# Use the clickhouse-client to connect to your ClickHouse service. Open a new Terminal, change directories to where your clickhouse binary is saved, and run the following command:
./clickhouse client --host 192.168.18.247
# You should see a smiling face as it connects to your service running on localhost:
my-host :)

What Is ClickHouse?

ClickHouse® is a high-performance, column-oriented SQL database management system (DBMS) for online analytical processing (OLAP).

Operation

Creating Tables in ClickHouse

Like most databases, ClickHouse logically groups tables into databases. Use the CREATE DATABASE command to create a new database in ClickHouse:

CREATE DATABASE IF NOT EXISTS helloworld

Similarly, use CREATE TABLE to define a new table. (If you do not specify the database name, the table will be in the default database.) The following table named is my_first_table in the helloworld database:

CREATE TABLE helloworld.my_first_table
(
    user_id UInt32,
    message String,
    timestamp DateTime,
    metric Float32
)
ENGINE = MergeTree()
ORDER BY (user_id, timestamp)
PRIMARY KEY (user_id, timestamp);

In the example above, my_first_table is a MergeTree table with four columns:

  • user_id: a 32-bit unsigned integer
  • message: a String data type, which replaces types like VARCHAR, BLOB, CLOB and others from other database systems
  • timestamp: a DateTime value, which represents an instant in time
  • metric: a 32-bit floating point number

A Brief Intro to Primary Keys

Before you go any further, it is important to understand how primary keys work in ClickHouse (the implementation of primary keys might seem unexpected!):

  • primary keys in ClickHouse are *not unique* for each row in a table

The primary key of a ClickHouse table determines how the data is sorted when written to disk. Every 8,192 rows or 10MB of data (referred to as the index granularity) creates an entry in the primary key index file. This granularity concept creates a sparse index that can easily fit in memory, and the granules represent a stripe of the smallest amount of column data that gets processed during SELECT queries.

The primary key can be defined using the PRIMARY KEY parameter. If you define a table without a PRIMARY KEY specified, then the key becomes the tuple specified in the ORDER BY clause. If you specify both a PRIMARY KEY and an ORDER BY, the primary key must be a subset of the sort order.

The primary key is also the sorting key, which is a tuple of (user_id, timestamp). Therefore, the data stored in each column file will be sorted by user_id, then timestamp.

Inserting Data

Basic Example

You can use the familiar INSERT INTO TABLE command with ClickHouse:

INSERT INTO helloworld.my_first_table (user_id, message, timestamp, metric) VALUES
    (101, 'Hello, ClickHouse!',                                 now(),       -1.0    ),
    (102, 'Insert a lot of rows per batch',                     yesterday(), 1.41421 ),
    (102, 'Sort your data based on your commonly-used queries', today(),     2.718   ),
    (101, 'Granules are the smallest chunks of data read',      now() + 5,   3.14159 )

Let’s verify it worked - you should see the four rows of data that were inserted:

SELECT * FROM helloworld.my_first_table

SELECT Queries

ClickHouse is a SQL database, and you query your data by writing the same type of SELECT queries you are already familiar with. For example:

SELECT *
FROM helloworld.my_first_table
ORDER BY timestamp

Updating and Deleting ClickHouse Data

Use the ALTER TABLE...UPDATE command to update rows in a table:

ALTER TABLE [<database>.]<table> UPDATE <column> = <expression> WHERE <filter_expr>

<expression> is the new value for the column where the <filter_expr> is satisfied. The <expression> must be the same datatype as the column or be convertable to the same datatype using the CAST operator. The <filter_expr> should return a UInt8 (zero or non-zero) value for each row of the data. Multiple UPDATE <column> statements can be combined in a single ALTER TABLE command separated by commas.

Use the ALTER TABLE command to delete rows:

ALTER TABLE [<database>.]<table> DELETE WHERE <filter_expr>

The <filter_expr> should return a UInt8 value for each row of data.

Data Types

LowCardinality

LowCardinality is a superstructure that changes a data storage method and rules of data processing. ClickHouse applies dictionary coding to LowCardinality-columns. Operating with dictionary encoded data significantly increases performance of SELECT queries for many applications.

The efficiency of using LowCardinality data type depends on data diversity. If a dictionary contains less than 10,000 distinct values, then ClickHouse mostly shows higher efficiency of data reading and storing. If a dictionary contains more than 100,000 distinct values, then ClickHouse can perform worse in comparison with using ordinary data types.

Consider using LowCardinality instead of Enum when working with strings. LowCardinality provides more flexibility in use and often reveals the same or higher efficiency.

Consideration

Best Practices

Index Design

Choosing a primary (ordering) key

Once you have extracted your desired columns, you can begin optimizing your ordering/primary key.

Some simple rules can be applied to help choose an ordering key. The following can sometimes be in conflict, so consider these in order. Users can identify a number of keys from this process, with 4-5 typically sufficient:

  1. Select columns that align with your common filters and access patterns. If users typically start Observability investigations by filtering by a specific column e.g. pod name, this column will be used frequently in WHERE clauses. Prioritize including these in your key over those which are used less frequently.
  2. Prefer columns which help exclude a large percentage of the total rows when filtered, thus reducing the amount of data which needs to be read. Service names and status codes are often good candidates - in the latter case only if users filter by values which exclude most rows e.g. filtering by 200s will in most systems match most rows, in comparison to 500 errors which will correspond to a small subset.
  3. Prefer columns that are likely to be highly correlated with other columns in the table. This will help ensure these values are also stored contiguously, improving compression.
  4. GROUP BY and ORDER BY operations for columns in the ordering key can be made more memory efficient.

On identifying the subset of columns for the ordering key, they must be declared in a specific order. This order can significantly influence both the efficiency of the filtering on secondary key columns in queries and the compression ratio for the table’s data files. In general, it is best to order the keys in ascending order of cardinality. This should be balanced against the fact that filtering on columns that appear later in the ordering key will be less efficient than filtering on those that appear earlier in the tuple. Balance these behaviors and consider your access patterns. Most importantly, test variants. For further understanding of ordering keys and how to optimize them, we recommend this article.

ref

Asynchronous Inserts (async_insert)

https://clickhouse.com/docs/en/cloud/bestpractices/asynchronous-inserts

MergeTree Engine

ref

InfluxDB vs ClickHouse

ref https://swsmile.info/post/InfluxDB-learning/

ClickHouse Clients and Drivers

Reference