How to manage high cardinality in metrics

High cardinality refers to having a large number of unique values in your dataset, such as unique user IDs, URLs, or session IDs. This increases the number of unique time series in your metrics.

Why high cardinality increases cost?

Metrics are stored as data points. The amount of data points required to store metrics can be calculated as:

Data points = aggregations Ă— active time series (cardinality) Ă— retention in days

High cardinality increases the “active time series” value, leading to more data points and higher charges.

Finding expensive metrics

To determine which metrics have many active time series, run a query to count the unique values for each metric over a specific time period. Here's an example query to get the active time series for the last day:

SQL Query for daily active time series count per metric
SELECT
  COUNT(DISTINCT non_aggregated_metric1) AS unique_metric1,
  COUNT(DISTINCT non_aggregated_metric2) AS unique_metric2,
  COUNT(DISTINCT non_aggregated_metric3) AS unique_metric3,
  …
FROM {{source}}
WHERE dt > now() - INTERVAL 1 DAY

This query counts the distinct values of each non-aggregated metric over the past day. Metrics that have many distinct values have high cardinality. Once you know which metrics have high cardinality you can reduce it.

Reducing cardinality

To effectively manage high cardinality in metrics, you can apply several strategies to reduce the number of unique values:

  1. Remove unused unique data: Identify any unique data points that are unnecessary for your analysis and remove them. Often, datasets contain unique identifiers or segments that aren’t needed for meaningful insights. By cleaning up such data points, you can significantly lower the cardinality.
  2. Remove dynamic segments: Replace dynamic segments in your data with static placeholders. This is particularly useful for components such as URLs that contain variable elements like IDs. By transforming these dynamic segments into standard placeholders, you reduce the number of unique values.

For example, consider a set of URLs with dynamic IDs:

Original URLs
/orders/123
/orders/124
/orders/125/items

These URLs have high cardinality because each unique ID increases the number of unique values. To simplify these URLs, replace the dynamic segments (IDs) with a placeholder such as <id>.

You can use replaceRegexAll and cutQueryString in your SQL expression to achieve this transformation when you are adding a new metric in Source → Logs to metrics → Create metric page. For instance:

Screenshot 32.png

SQL expression
replaceRegexpAll(
  cutQueryString(
    JSONExtract(json, 'request', 'path', 'Nullable(String)')
  ),
  '(^|/)\\d+(/|$)',
  '\1<id>\2'
)

After applying this transformation, the URLs will look like this:

Result
/orders/<id>
/orders/<id>
/orders/<id>/items

By standardizing the dynamic segments, you significantly reduce the number of unique paths. In this small dataset, the reduction in cardinality is around 33%.

These strategies help manage and reduce high cardinality, making your metrics more efficient and cost-effective to store and process.

Final thoughts

Managing high cardinality in metrics is crucial for efficient data storage and processing. By removing unused unique data and replacing dynamic segments with placeholders, you can significantly reduce cardinality.

However, the best approach depends on your specific data and needs. Carefully consider which unique values are essential for your analysis and customize the strategies accordingly. Tailoring these methods to fit your data will provide you with meaningful insights while maintaining efficiency and effectiveness.