Explore documentation
How to manage high cardinality in metrics
High cardinality refers to having a large number of unique values in your dataset, such as unique user IDs, URLs, or session IDs. This increases the number of unique time series in your metrics.
Why does high cardinality increase cost?
Metrics are stored as data points. The amount of data points required to store metrics can be calculated as:
Data points = aggregations Ă— active time series (cardinality) Ă— retention in days
High cardinality increases the “active time series” value, leading to more data points and higher charges.
Finding expensive metrics
To determine which metrics have many active time series, run a query to count the unique values for each metric over a specific time period. Here's an example query to get the active time series for the last day:
SELECT
COUNT(DISTINCT non_aggregated_metric1) AS unique_metric1,
COUNT(DISTINCT non_aggregated_metric2) AS unique_metric2,
COUNT(DISTINCT non_aggregated_metric3) AS unique_metric3,
…
FROM {{source}}
WHERE dt > now() - INTERVAL 1 DAY
This query counts the distinct values of each non-aggregated metric over the past day. Metrics that have many distinct values have high cardinality. Once you know which metrics have high cardinality you can reduce it.
Reducing cardinality
To effectively manage high cardinality in metrics, you can apply several strategies to reduce the number of unique values:
- Remove unused unique data: Identify any unique data points that are unnecessary for your analysis and remove them. Often, datasets contain unique identifiers or segments that aren’t needed for meaningful insights. By cleaning up such data points, you can significantly lower the cardinality.
- Remove dynamic segments: Replace dynamic segments in your data with static placeholders. This is particularly useful for components such as URLs that contain variable elements like IDs. By transforming these dynamic segments into standard placeholders, you reduce the number of unique values.
For example, consider a set of URLs with dynamic IDs:
/orders/123
/orders/124
/orders/125/items
These URLs have high cardinality because each unique ID increases the number of unique values. To simplify these URLs, replace the dynamic segments (IDs) with a placeholder such as <id>
.
You can use replaceRegexAll and cutQueryString in your SQL expression to achieve this transformation when you are adding a new metric in Source → Logs to metrics → Create metric page. For instance:
replaceRegexpAll(
cutQueryString(
JSONExtract(json, 'request', 'path', 'Nullable(String)')
),
'(^|/)\\d+(/|$)',
'\1<id>\2'
)
After applying this transformation, the URLs will look like this:
/orders/<id>
/orders/<id>
/orders/<id>/items
By standardizing the dynamic segments, you significantly reduce the number of unique paths. In this small dataset, the reduction in cardinality is around 33%.
These strategies help manage and reduce high cardinality, making your metrics more efficient and cost-effective to store and process.
Final thoughts
Managing high cardinality in metrics is crucial for efficient data storage and processing. By removing unused unique data and replacing dynamic segments with placeholders, you can significantly reduce cardinality.
However, the best approach depends on your specific data and needs. Carefully consider which unique values are essential for your analysis and customize the strategies accordingly. Tailoring these methods to fit your data will provide you with meaningful insights while maintaining efficiency and effectiveness.