S3 vs NVMe hosted data

Better Stack Warehouse is powered by a purpose-built data processing pipeline for time-based events that leverages recent advances in stream processing, cloud storage, and analytical databases.

We developed the pipeline for processing massive internet-scale datasets: think petabytes to exabytes.

The pipeline has a unique combination of properties:

  • Massively scalable
  • No cardinality limitations
  • Sub-second analytical queries
  • Cost efficient

How do we achieve these seemingly mutually exclusive properties?

We work with two types of data: JSON events and time series.

JSON event

JSON event is any time-based JSON with arbitrary structure smaller than 10MB stored in object storage. Think OpenTelemetry span, a structured log line or a JavaScript object containing properties of a user action.

Time series

Time series is a set of highly compressed time-based metrics with a pre-defined schema stored on local NVMe SSD drives. Prometheus metrics, for example.

While you can ingest time series data directly, the secret sauce of our data pipeline is the integration of JSON events with time series directly

You can go to Sources → Your source → Time series on NVMe SSD tab and add SQL expressions that extract specific JSON attributes from JSON events in real-time into highly compressed and locally stored time series columns.

Example: Say your JSON event contains the attribute duration that we want to use in a client-facing API generated with Queries.

A SQL expression JSONExtract('duration', 'Nullable(Float64)') of type Float64 aggregated via avg, min, and max generates 3 columns in your time series schema that you use in your API with:

Query saved as a client-facing API
SELECT
  {{time}} AS time,
  avgMerge(duration_avg),
  minMerge(duration_min),
  maxMerge(duration_max)
FROM {{source}}

*Merge() functions are a ClickHouse specialty. You don’t need to worry about them now — most of the time you will be charting trends with our Drag & drop query builder anyway.

Define time series for queries you need to execute frequently or trends you want to track long-term over large data sets.

Time series enable you to create very fast analytical client facing API endpoints even for massive datasets.

For everything else there's wide events.

And the best thing? You can always change your mind and add more time series later. We only bill you for the time series you use.

Overview: JSON events vs. time series

JSON events Time series
Examples Any JSON such as structured logs, OpenTelemetry traces & spans, plain text logs Prometheus metrics, OpenTelemetry metrics, time series extracted from JSON events
Best used for Keeping large amounts of raw unstructured data with high cardinality Fast frequently executed queries powering analytical APIs
Storage Scalable object storage in the cloud High-speed local NVMe drives
Cardinality High cardinality Low cardinality
Compression Somewhat compressed Heavily compressed
Data format Row store Column store
Sampling Sampling available Always unsampled
Cost Cost-effective Optimized for performance

Are you planning to ingest over 100 TB per month? Need to store data in a custom data region or your own S3 bucket? Need a fast query speed even for large datasets? Please get in touch at hello@betterstack.com.