S3 vs NVMe hosted data

Better Stack Warehouse is powered by a purpose-built data processing pipeline for time-based events that leverages recent advances in stream processing, cloud storage, and analytical databases.

We developed the pipeline for processing massive internet-scale datasets: think petabytes to exabytes.

The pipeline has a unique combination of properties:

Massively scalable
No cardinality limitations
Sub-second analytical queries
Cost efficient

How do we achieve these seemingly mutually exclusive properties?

We work with two types of data: JSON events and time series.

JSON event

JSON event is any time-based JSON with arbitrary structure smaller than 10MB stored in object storage. Think OpenTelemetry span, a structured log line or a JavaScript object containing properties of a user action.

Time series

Time series is a set of highly compressed time-based metrics with a pre-defined schema stored on local NVMe SSD drives. Prometheus metrics, for example.

While you can ingest time series data directly, the secret sauce of our data pipeline is the integration of JSON events with time series directly

Go to Warehouse -> Sources → Your source → Time series on NVMe SSD and add SQL expressions that extract specific JSON attributes from JSON events in real-time into highly compressed and locally stored time series columns.

Want to learn more about time series?

In Telemetry, we use time series for Metrics in a similar way.

You can learn more tips and tricks in Extracting metrics from logs.

Aggregated time series

Say your JSON event contains the attribute duration that we want to use in a client-facing API generated with Queries.

A SQL expression JSONExtract('duration', 'Nullable(Float64)') of type Float64 aggregated via avg, min, and max generates 3 columns in your time series schema.

After creating the time series, you use the new columns in your API:

Query saved as a client-facing API

Copied!

SELECT
  {{time}} AS time,
  avgMerge(duration_avg),
  minMerge(duration_min),
  maxMerge(duration_max)
FROM {{source}}
GROUP BY time

*Merge() functions are a ClickHouse specialty. You don’t need to worry about them now — most of the time you will be charting trends with our Drag & drop query builder anyway.

Define time series for queries you need to execute frequently or trends you want to track long-term over large data sets.

Time series enable you to create very fast analytical client facing API endpoints even for massive datasets.

For everything else there's wide events.

And the best thing? You can always change your mind and add more time series later. We only bill you for the time series you use.

Non-aggregated time series

When you select No aggregation in your created time series, we will create a single new column without any *Merge() function. We will split your data into multiple records to keep all combinations distinct.

Non-aggregated time series can be used in WHERE or GROUP BY clauses.

Say your JSON event contains the attribute process alongside aggregated duration, that we want to use in Queries for filtering and grouping the results. A SQL expression JSONExtract('process', 'Nullable(String)') of type String without any aggregation generates the single columns in your schema.

Query saved as a client-facing API with filtering and grouping

Copied!

SELECT
  {{time}} AS time,
  process,
  avgMerge(duration_avg),
  minMerge(duration_min),
  maxMerge(duration_max)
FROM {{source}}
WHERE process LIKE 'order_%'
GROUP BY time, process

Non-aggregated time series may significantly increase amount of data on NVMe

Since a row exists for every unique combination of all your non-aggregated time series data, we recommend keeping the cardinality of all time series without aggregations as low as possible.

Time series limits

By default, each Warehouse source is limited to up to 50 aggregated metrics and 20 non-aggregated metrics.

Need to raise these limits for your organization?

Please let us know at hello@betterstack.com 🙌

Overview: JSON events vs. time series

	JSON events	Time series
Examples	Any JSON such as structured logs, OpenTelemetry traces & spans, plain text logs	Prometheus metrics, OpenTelemetry metrics, time series extracted from JSON events
Best for	Keeping large amounts of raw unstructured data with high cardinality	Fast frequently executed queries powering analytical APIs
Storage	Scalable object storage in the cloud	High-speed local NVMe drives
Cardinality	High cardinality	Low cardinality
Compression	Somewhat compressed	Heavily compressed
Data format	Row store	Column store
Sampling	Sampling available	Always unsampled
Cost	Cost-effective	Optimized for performance

Are you planning to ingest over 100 TB per month? Need to store data in a custom data region or your own S3 bucket? Need a fast query speed even for large datasets? Please get in touch at hello@betterstack.com.

Ingesting events

Host events in your own bucket

Explore documentation

S3 vs NVMe hosted data

JSON event

Time series

Want to learn more about time series?

Aggregated time series

Non-aggregated time series

Non-aggregated time series may significantly increase amount of data on NVMe

Time series limits

Need to raise these limits for your organization?

Overview: JSON events vs. time series

On this page

Explore documentation

S3 vs NVMe hosted data

JSON event

Time series

Want to learn more about time series?

Aggregated time series

Non-aggregated time series

Non-aggregated time series may significantly increase amount of data on NVMe

Time series limits

Need to raise these limits for your organization?

Overview: JSON events vs. time series

On this page

Please accept cookies