S3 vs NVMe hosted data

Better Stack Warehouse is powered by a purpose-built data processing pipeline for time-based events that leverages recent advances in stream processing, cloud storage, and analytical databases.

We developed the pipeline for processing massive internet-scale datasets: think petabytes to exabytes.

The pipeline has a unique combination of properties:

  • Massively scalable
  • No cardinality limitations
  • Sub-second analytical queries
  • Cost efficient

How do we achieve these seemingly mutually exclusive properties?

We work with two types of data: JSON events and time series.

JSON event

JSON event is any time-based JSON with arbitrary structure smaller than 10MB stored in object storage. Think OpenTelemetry span, a structured log line or a JavaScript object containing properties of a user action.

Time series

Time series is a set of highly compressed time-based metrics with a pre-defined schema stored on local NVMe SSD drives. Prometheus metrics, for example.

While you can ingest time series data directly, the secret sauce of our data pipeline is the integration of JSON events with time series directly

Go to Warehouse -> Sources → Your source → Time series on NVMe SSD and add SQL expressions that extract specific JSON attributes from JSON events in real-time into highly compressed and locally stored time series columns.

Time series on NVMe SSD example

Aggregated time series

Say your JSON event contains the attribute duration that we want to use in a client-facing API generated with Queries.

A SQL expression JSONExtract('duration', 'Nullable(Float64)') of type Float64 aggregated via avg, min, and max generates 3 columns in your time series schema.

Duration aggregated time series example

After creating the time series, you use the new columns in your API:

Query saved as a client-facing API
SELECT
  {{time}} AS time,
  avgMerge(duration_avg),
  minMerge(duration_min),
  maxMerge(duration_max)
FROM {{source}}
GROUP BY time

*Merge() functions are a ClickHouse specialty. You don’t need to worry about them now — most of the time you will be charting trends with our Drag & drop query builder anyway.

Define time series for queries you need to execute frequently or trends you want to track long-term over large data sets.

Time series enable you to create very fast analytical client facing API endpoints even for massive datasets.

For everything else there's wide events.

And the best thing? You can always change your mind and add more time series later. We only bill you for the time series you use.

Non-aggregated time series

When you select No aggregation in your created time series, we will create a single new column without any *Merge() function. We will split your data into multiple records to keep all combinations distinct.

Non-aggregated time series can be used in WHERE or GROUP BY clauses.

Say your JSON event contains the attribute process alongside aggregated duration, that we want to use in Queries for filtering and grouping the results. A SQL expression JSONExtract('process', 'Nullable(String)') of type String without any aggregation generates the single columns in your schema.

Query saved as a client-facing API with filtering and grouping
SELECT
  {{time}} AS time,
  process,
  avgMerge(duration_avg),
  minMerge(duration_min),
  maxMerge(duration_max)
FROM {{source}}
WHERE process LIKE 'order_%'
GROUP BY time, process

Non-aggregated time series may significantly increase amount of data on NVMe

Since a row exists for every unique combination of all your non-aggregated time series data, we recommend keeping the cardinality of all time series without aggregations as low as possible.

Overview: JSON events vs. time series

JSON events Time series
Examples Any JSON such as structured logs, OpenTelemetry traces & spans, plain text logs Prometheus metrics, OpenTelemetry metrics, time series extracted from JSON events
Best for Keeping large amounts of raw unstructured data with high cardinality Fast frequently executed queries powering analytical APIs
Storage Scalable object storage in the cloud High-speed local NVMe drives
Cardinality High cardinality Low cardinality
Compression Somewhat compressed Heavily compressed
Data format Row store Column store
Sampling Sampling available Always unsampled
Cost Cost-effective Optimized for performance

Are you planning to ingest over 100 TB per month? Need to store data in a custom data region or your own S3 bucket? Need a fast query speed even for large datasets? Please get in touch at hello@betterstack.com.