Are you planning to ingest over 100 TB per month? Need to store data in a custom data region or your own S3 bucket? Need a fast query speed even for large datasets? Please get in touch at hello@betterstack.com.
Explore documentation
S3 vs NVMe hosted data
Better Stack Warehouse is powered by a purpose-built data processing pipeline for time-based events that leverages recent advances in stream processing, cloud storage, and analytical databases.
We developed the pipeline for processing massive internet-scale datasets: think petabytes to exabytes.
The pipeline has a unique combination of properties:
- Massively scalable
- No cardinality limitations
- Sub-second analytical queries
- Cost efficient
How do we achieve these seemingly mutually exclusive properties?
We work with two types of data: JSON events and time series.
JSON event
JSON event is any time-based JSON with arbitrary structure smaller than 10MB stored in object storage. Think OpenTelemetry span, a structured log line or a JavaScript object containing properties of a user action.
Time series
Time series is a set of highly compressed time-based metrics with a pre-defined schema stored on local NVMe SSD drives. Prometheus metrics, for example.
While you can ingest time series data directly, the secret sauce of our data pipeline is the integration of JSON events with time series directly
You can go to Sources → Your source → Time series on NVMe SSD tab and add SQL expressions that extract specific JSON attributes from JSON events in real-time into highly compressed and locally stored time series columns.
Example:
Say your JSON event contains the attribute duration
that we want to use in a client-facing API generated with Queries.
A SQL expression JSONExtract('duration', 'Nullable(Float64)')
of type Float64
aggregated via avg
, min
, and max
generates 3 columns in your time series schema that you use in your API with:
SELECT
{{time}} AS time,
avgMerge(duration_avg),
minMerge(duration_min),
maxMerge(duration_max)
FROM {{source}}
*Merge()
functions are a ClickHouse specialty. You don’t need to worry about them now — most of the time you will be charting trends with our Drag & drop query builder anyway.
Define time series for queries you need to execute frequently or trends you want to track long-term over large data sets.
Time series enable you to create very fast analytical client facing API endpoints even for massive datasets.
For everything else there's wide events.
And the best thing? You can always change your mind and add more time series later. We only bill you for the time series you use.
Overview: JSON events vs. time series
JSON events | Time series | |
---|---|---|
Examples | Any JSON such as structured logs, OpenTelemetry traces & spans, plain text logs | Prometheus metrics, OpenTelemetry metrics, time series extracted from JSON events |
Best used for | Keeping large amounts of raw unstructured data with high cardinality | Fast frequently executed queries powering analytical APIs |
Storage | Scalable object storage in the cloud | High-speed local NVMe drives |
Cardinality | High cardinality | Low cardinality |
Compression | Somewhat compressed | Heavily compressed |
Data format | Row store | Column store |
Sampling | Sampling available | Always unsampled |
Cost | Cost-effective | Optimized for performance |