Sharvil Patel

Projects /

Realtime Metrics Pipeline

Kafka-to-ClickHouse ingestion pipeline handling 40k events/sec with sub-second dashboard latency.

Date
NOV 2025
Role
Solo project
Stack
Go / Kafka / ClickHouse / Grafana / Terraform

Placeholder writeup. Replace this with the real project. The structure below (problem, constraints, approach, outcome) is a good skeleton to keep.

Problem

Application metrics were scattered across per-service log files and a managed APM product whose retention and query costs grew faster than the traffic did. Engineers couldn’t ask ad-hoc questions like “what was p99 latency for this endpoint, for this customer tier, last Tuesday” without exporting data by hand.

Constraints

  • Sustained ingest of around 40,000 events per second at peak, with bursts to double that.
  • Dashboard queries needed to come back in under a second to be usable during incidents.
  • Run on existing infrastructure; no new managed services beyond what was already approved.

Approach

Events flow from services into Kafka, where a small fleet of Go consumers batch, validate, and flatten them before inserting into ClickHouse. The consumers are deliberately boring: at-least-once delivery, idempotent inserts keyed on event ID, and backpressure by pausing partition consumption rather than buffering in memory.

// Batches flush on size or age, whichever comes first.
func (w *writer) flushLoop(ctx context.Context) {
    ticker := time.NewTicker(w.maxAge)
    defer ticker.Stop()
    for {
        select {
        case <-ctx.Done():
            w.flush() // drain on shutdown
            return
        case <-ticker.C:
            w.flush()
        case ev := <-w.in:
            if w.add(ev) >= w.maxBatch {
                w.flush()
            }
        }
    }
}

ClickHouse tables use a MergeTree ordered by (service, endpoint, timestamp) with a materialized view maintaining per-minute rollups, so Grafana dashboards hit the rollups and incident deep-dives hit raw events.

Outcome

  • p99 dashboard query latency dropped from roughly 8 seconds to 300 milliseconds.
  • Retention went from 14 days to 13 months at lower cost than the APM product.
  • Schema changes ship through Terraform and take effect without downtime.

What I’d do differently

Start with the rollup tables instead of adding them after the first slow-dashboard complaint, and budget more time for Kafka consumer-group rebalancing edge cases, which produced the only two production incidents the pipeline has had.