Monitoring

Raccoon provides monitoring for server connection, publisher, resource usage, and event delivery. Reference for available metrics is available here. The metrics are reported using Statsd and Prometheus

How To Setup

statsd
prometheus

TL;DR
- Run Statsd supported metric collector
- Configure `METRIC_STATSD_ADDRESS` on Raccoon to send to the metric collector
- Visualize and create alerting from the collected metrics

Generally, you can follow the steps above and use any metric collector that supports statsd like Telegraf or Datadog.

This section will cover a setup example using Telegraf, Influx, Kapacitor, and Grafana stack based on the steps above.

Run Statsd Supported Metric Collector To enable statsd on Telegraf you need to enable statsd input on telegraf.conf file. Following are default configurations that you can add based on statsd input readme.

[[inputs.statsd]]
  protocol = "udp"
  max_tcp_connections = 250
  tcp_keep_alive = false
  service_address = ":8125"

  delete_gauges = true
  delete_counters = true
  delete_sets = true
  delete_timings = true

  percentiles = [50.0, 90.0, 99.0, 99.9, 99.95, 100.0]

  metric_separator = "_"

  parse_data_dog_tags = false
  datadog_extensions = false
  datadog_distributions = false

  allowed_pending_messages = 10000
  percentile_limit = 1000

[[outputs.influxdb]]
  urls = ["http://127.0.0.1:8086"]
  database = "raccoon"
  retention_policy = "autogen"
  write_consistency = "any"
  timeout = "5s"

Configure Raccoon To Send To The Metric Collector After you have the collector with the port configured, you need to set METRIC_STATSD_ADDRESS to match the metric collector address. Suppose you deploy the telegraf using the default configuration above as sidecar or in localhost, you need to set the value to :8125.

Visualize And Create Alerting From The Collected Metrics Now that you have Raccoon and Telegraf as metric collector set, next is to use the metrics reported. You may notice that the Telegraf config above contains outputs.influxdb. That config will send the metric received to Influxdb. Make sure you have influx service accessible from the configured URL. You can visualize the metrics using Grafana. To do that, you need to add influx datasource to make the data available on Grafana. After that, you can use the data to You can visualize the metrics using Grafana.

TL;DR
- Run Prometheus
- Configure `METRIC_PROMETHEUS_PORT` and `METRIC_PROMETHEUS_ENABLED` on Raccoon
- Visualize and create alerting from the collected metrics

Setting up Prometheus is fairly straight-forward. Prometheus is available as a self-contained binary program for most platforms. For alerting you can use alertmanager that let's you define alerts and offers integration with different notification platforms. For visualisation Grafana comes with out-of-the-box support for prometheus as a data source.

You can download prometheus from their official website.

Run Prometheus

Let's explore an example setup that runs prometheus locally. Begin by creating a new directory and downloading prometheus.

$ mkdir prometheus-for-raccoon
$ cd prometheus-for-raccoon
$ wget https://github.com/prometheus/prometheus/releases/download/v2.53.1/prometheus-2.53.1.linux-amd64.tar.gz
$ tar xzvf prometheus-2.53.1.linux-amd64.tar.gz
$ cd prometheus-2.53.1.linux-amd64

Next, we will edit prometheus.yml to tell prometheus to scrape metrics from raccoon. You can use any text editor that you're familiar with.

prometheus.yml
global:
  scrape_interval: 15s 
  evaluation_interval: 15s 

scrape_configs:
  - job_name: "raccoon"
    static_configs:
      - targets: ["localhost:8888"]

Now run prometheus

$ ./prometheus --config.file=./prometheus.yml

We have now configured prometheus to scrape metrics from localhost:8888. We will now tell raccoon to expose prometheus metric on this port.

Configure METRIC_PROMETHEUS_PORT and METRIC_PROMETHEUS_ENABLED on Raccoon

By default, raccoon doesn't expose prometheus metrics. To enable prometheus metrics, you need to set the following environment variables:

METRIC_PROMETHEUS_ENABLED=true
METRIC_PROMETHEUS_PORT=8888   

Now when you run raccoon, prometheus will start collecting metrics from it.

Visualize And Create Alerting From The Collected Metrics

Now that you have Raccoon and Prometheus setup, next is to use the metrics reported. You can visualize the metrics using Grafana. To do that, you need to add prometheus datasource to make the data available on Grafana. After that, you can use the data to visualize the metrics using Grafana.

Metrics Usages

Following are key monitoring statistics that you can infer from Raccoon metrics. Refer to this section to understand how to build alerting, dashboard, or analyze the metrics.

Data Loss

To infer data loss, you can count kafka_messages_delivered_total with tag success=false. You can also infer the loss rate by calculating the following.

count(kafka_messages_delivered_total success=false)/count(kafka_messages_delivered_total)

For other publishers, just replace kafka in the metric name with the name of the publisher. For instance, analogs of kafka_messages_delivered_total for PubSub and Kinesis would be:

Latency

Raccoon provides fine-grained metrics that denote latency. That gives clues as to where to look in case something goes wrong. In summary, these are key metrics for latency:

event_processing_duration_milliseconds This metrics denotes overall latency. You need to look at other latency metrics to find the root cause when this metric is high.
server_processing_latency_milliseconds Correlate this metric with event_processing_duration_milliseconds to infer whether the issue is with Raccoon itself, or something wrong with the network, or the way sent_time is generated.-
worker_processing_duration_milliseconds High value of this metric indicates that the publisher is slow or can't keep up.

Dashboard

There is a pre-built grafana dashboard available with support for Prometheus data source.

If you're running the statsd + telegraf setup, you can configure telegraf to push metrics to Prometheus.

Monitoring

How To Setup​

Metrics Usages​

Data Loss​

Latency​

Dashboard​

How To Setup

Metrics Usages

Data Loss

Latency

Dashboard