LogoDuyệtSr. Data Engineer
HomeAboutPhotosInsightsCV

Footer

Logo

Resources

  • Rust Tiếng Việt
  • /archives
  • /series
  • /tags
  • Status

me@duyet.net

  • About
  • LinkedIn
  • Resume
  • Projects

© 2026 duyet.net | Sr. Data Engineer

Monitoring ClickHouse on Kubernetes

Now that you have your first ClickHouse instance on Kubernetes and are starting to use it, monitoring and observing what happens on it is an important task to achieve stability and optimal performance.

In this guide, you'll learn about:

  • Built-in advanced observability dashboard
  • Setting up Prometheus + Grafana for metrics visualization
  • Leveraging ClickHouse system tables for deep insights
  • Using ClickHouse Monitoring UI for simplified management

There are several effective approaches to monitoring ClickHouse:

  • Built-in dashboard
  • Export metrics to Prometheus and visualize them with Grafana
  • ClickHouse Monitoring UI that relies on powerful system tables

1. Built-in advanced observability dashboard

ClickHouse includes an advanced observability dashboard feature accessible at $HOST:$PORT/dashboard (login required).

It displays metrics such as Queries/second, CPU usage, Queries running, Merges running, Selected bytes/second, IO wait, CPU, Read/Write, Inserted rows/second, Total MergeTree parts, and Max parts for partition, etc.

2. Monitoring with Prometheus + Grafana

You can configure ClickHouse to export metrics to Prometheus. ClickHouse Operator do have a section for this:

  1. Step 1: Setup Prometheus pool data from ClickHouse into Prometheus
  2. Step 2: Setup Grafana

They also have a dashboard for ClickHouse Operator Dashboard so you can just need to import it:

  • Altinity_ClickHouse_Operator_dashboard.json
  • ClickHouse_Queries_dashboard.json
  • ClickHouseKeeper_dashboard.json

References:

  • ClickHouse Grafana plugin 4.0 - Leveling up SQL Observability
  • A Story of Open-source GitHub Activity using ClickHouse + Grafana
  • Video: Visualizing ClickHouse Data with Grafana
  • Visualizing Data with ClickHouse - Part 1 - Grafana

3. ClickHouse system tables

You should read these blog post by ClickHouse about rely on the system tables to get more insights about running queries and their performance. These contains about some topic like for example: most expensive SELECT queries, average query duration and number of requests, number of SQL queries by client or user, etc.

  • Essential Monitoring Queries - part 1 - INSERT Queries
  • Essential Monitoring Queries - part 2 - SELECT Queries

4. ClickHouse Monitoring UI Dashboard

This is my simple monitoring dashboard for ClickHouse, built with Next.js for monitoring all my clusters. It relies on system tables above that provide rich information. A live demo is available at: https://clickhouse-monitoring.vercel.app/

You can install it into Kubernetes via the latest helm chart here: https://github.com/duyet/charts/tree/master/clickhouse-monitoring

helm repo add duyet https://duyet.github.io/charts

cat <<EOF >> values.yaml
env:
  - name: CLICKHOUSE_HOST
    value: http://clickhouse-single.clickhouse.svc:8123
  - name: CLICKHOUSE_USER
    value: monitoring
  - name: CLICKHOUSE_PASSWORD
    value: ''
EOF

helm install -f values.yaml clickhouse-monitoring-release duyet/clickhouse-monitoring

Summary and Best Practices

Effective monitoring is crucial for maintaining a healthy ClickHouse cluster. Let's recap the key monitoring strategies and best practices:

Monitoring Approaches Comparison

Approach Pros Cons Best For
Built-in Dashboard ✅ No setup required
✅ Real-time metrics
✅ Lightweight
❌ Basic features
❌ No historical data
❌ Single instance view
Quick health checks, Development
Prometheus + Grafana ✅ Industry standard
✅ Extensive customization
✅ Alerting support
✅ Long-term storage
❌ Complex setup
❌ Resource intensive
❌ Requires expertise
Production environments, Large clusters
System Tables ✅ Most detailed insights
✅ SQL-based queries
✅ No external deps
❌ Manual querying
❌ No visualization
❌ Requires SQL knowledge
Debugging, Performance tuning
Monitoring UI ✅ Easy to use
✅ Multiple clusters
✅ Quick deployment
❌ Limited customization
❌ Community tool
Small to medium clusters, Quick setup

Key Metrics to Monitor

Performance Metrics:

  • 📊 Queries per second (QPS) - Request throughput
  • ⏱️ Query duration - P50, P90, P99 latencies
  • 💾 Memory usage - Per query and cluster-wide
  • 🔄 CPU utilization - Query processing load
  • 📝 Merge activity - Background operations health

Operational Metrics:

  • 🗂️ MergeTree parts - Data organization efficiency
  • 📦 Disk usage - Storage capacity tracking
  • 🔌 Connections - Active client connections
  • ⚠️ Failed queries - Error rate monitoring
  • 🔄 Replication lag - For replicated setups

Critical System Tables:

  • system.query_log - Query history and performance
  • system.processes - Currently running queries
  • system.merges - Merge operations status
  • system.replication_queue - Replication health
  • system.parts - Table parts information
  • system.metrics - Real-time metrics
  • system.asynchronous_metrics - Periodic metrics

Essential Monitoring Queries

Find slow queries:

SELECT
    query_duration_ms,
    query,
    user,
    query_start_time
FROM system.query_log
WHERE type = 'QueryFinish'
  AND query_duration_ms > 10000
ORDER BY query_duration_ms DESC
LIMIT 10

Check current running queries:

SELECT
    elapsed,
    query,
    user,
    memory_usage
FROM system.processes
ORDER BY elapsed DESC

Monitor disk usage:

SELECT
    database,
    table,
    formatReadableSize(sum(bytes)) AS size
FROM system.parts
WHERE active
GROUP BY database, table
ORDER BY sum(bytes) DESC
LIMIT 20

Production Recommendations

1. Implement Multi-Layer Monitoring:

  • Use Prometheus + Grafana for real-time metrics and alerting
  • Keep system tables accessible for deep debugging
  • Deploy Monitoring UI for quick cluster overview

2. Set Up Alerting:

  • High query error rate (> 1%)
  • Memory usage > 80%
  • Disk usage > 85%
  • Too many table parts (> 300 per partition)
  • Replication lag > 5 minutes
  • Query duration spikes

3. Regular Health Checks:

  • Daily: Review slow query log
  • Weekly: Analyze query patterns and optimize
  • Monthly: Capacity planning review
  • Quarterly: Performance benchmarking

4. Documentation:

  • Maintain runbooks for common issues
  • Document baseline metrics for your workload
  • Keep track of configuration changes
  • Share monitoring dashboards with team

5. Continuous Improvement:

  • Regularly review and update alert thresholds
  • Add custom metrics for business-specific KPIs
  • Optimize queries based on monitoring insights
  • Scale infrastructure proactively based on trends

Next Steps

Now that you have monitoring in place:

  1. Explore ReplicatedReplacingMergeTree for high availability
  2. Learn about query optimization techniques
  3. Implement backup and disaster recovery
  4. Set up proper access control and security

Remember: Good monitoring is not just about collecting metrics—it's about understanding your system's behavior and making informed decisions to maintain performance and reliability.

Resources

  • 📖 Official ClickHouse Monitoring Documentation
  • 🎥 Visualizing Data with ClickHouse + Grafana
  • 🔧 ClickHouse Monitoring UI
  • 📊 Grafana Dashboards
  • 💬 ClickHouse Community Slack
Mar 27, 2024·2 years ago
|Data|
DataClickhouseClickhouse On Kubernetes
|Edit|

Series: ClickHouse on Kubernetes

1
ClickHouse on Kubernetes

Complete guide to deploying ClickHouse on Kubernetes using the Altinity ClickHouse Operator. Learn how to set up your first single-node cluster, configure persistent storage, manage users, and customize ClickHouse versions. Includes practical examples and best practices from production experience managing clusters with trillions of rows.

2
ClickHouse SELECT Advances

Dynamic column selection (also known as a `COLUMNS` expression) allows you to match some columns in a result with a re2 regular expression.

3
Monitoring ClickHouse on Kubernetes

Complete guide to monitoring ClickHouse on Kubernetes. Learn about built-in dashboards, Prometheus + Grafana setup, powerful system tables for monitoring queries, and the ClickHouse Monitoring UI dashboard. Includes practical examples, essential monitoring queries, and best practices for production observability.

4
MergeTree

After starting this series ClickHouse on Kubernetes, you can now configure your first single-node ClickHouse server. Let's dive into creating your first table and understanding the basic concepts behind the ClickHouse engine, its data storage, and some cool features

5
ReplacingMergeTree

My favorite ClickHouse table engine is `ReplacingMergeTree`. The main reason is that it is similar to `MergeTree` but can automatically deduplicate based on columns in the `ORDER BY` clause, which is very useful.

6
ReplicatedReplacingMergeTree

Learn how to set up and manage ReplicatedReplacingMergeTree in ClickHouse on Kubernetes. This comprehensive guide covers cluster setup with ClickHouse Operator, data replication, performance tuning, and best practices for high availability deployments.

On this page
  • 2. Monitoring with Prometheus + Grafana
  • 3. ClickHouse system tables
  • 4. ClickHouse Monitoring UI Dashboard
  • Summary and Best Practices
  • Monitoring Approaches Comparison
  • Key Metrics to Monitor
  • Essential Monitoring Queries
  • Production Recommendations
  • Next Steps
  • Resources
On this page
  • 2. Monitoring with Prometheus + Grafana
  • 3. ClickHouse system tables
  • 4. ClickHouse Monitoring UI Dashboard
  • Summary and Best Practices
  • Monitoring Approaches Comparison
  • Key Metrics to Monitor
  • Essential Monitoring Queries
  • Production Recommendations
  • Next Steps
  • Resources