# S3 vs NVMe hosted data

Better Stack Warehouse is powered by a **purpose-built data processing pipeline** for time-based events that leverages recent advances in stream processing, cloud storage, and analytical databases.

We developed the pipeline for processing massive internet-scale datasets: think petabytes to exabytes.

The pipeline has a unique combination of properties:

* Massively scalable
* No cardinality limitations
* Sub-second analytical queries
* Cost efficient

How do we achieve these seemingly mutually exclusive properties?

We work with two types of data: **JSON events** and **time series**.

## JSON event

**JSON event** is any time-based JSON with arbitrary structure smaller than 10MB stored in object storage.
Think [OpenTelemetry](https://betterstack.com/docs/logs/open-telemetry/) span, a structured log line or a JavaScript object containing properties of a user action.

## Time series

**Time series** is a set of highly compressed time-based metrics with a pre-defined schema stored on local NVMe SSD drives.
[Prometheus metrics](https://betterstack.com/docs/logs/ingesting-data/metrics/prometheus-scrape/), for example.

While you can ingest time series data directly, the secret sauce of our data pipeline is the **integration of JSON events with time series directly**

Go to **Warehouse** -> [Sources](https://warehouse.betterstack.com/team/0/sources ";_blank") → Your source → **Time series on NVMe SSD** and add SQL expressions that extract specific JSON attributes from JSON events in real-time into highly compressed and locally stored time series columns.

![Time series on NVMe SSD example](https://imagedelivery.net/xZXo0QFi-1_4Zimer-T0XQ/90b4a930-cc87-4291-b1e2-3fa0bc429600/public =2460x1976)

[note]
#### Want to learn more about time series?

In [Telemetry](https://betterstack.com/docs/logs/), we use time series for **Metrics** in a similar way.

You can learn more tips and tricks in [Extracting metrics from logs](https://betterstack.com/docs/logs/dashboards/logs-to-metrics/).
[/note]

## Aggregated time series

Say your JSON event contains the attribute `duration` that we want to use in a client-facing API generated with [Queries](https://betterstack.com/docs/warehouse/querying-data/queries/).

A SQL expression `JSONExtract('duration', 'Nullable(Float64)')` of type `Float64` aggregated via `avg`, `min`, and `max` generates 3 columns in your time series schema.

![Duration aggregated time series example](https://imagedelivery.net/xZXo0QFi-1_4Zimer-T0XQ/1b35e281-9101-4bed-ed46-7750bdb11500/public =2380x2742)

After creating the time series, you use the new columns in your API:

```sql
[label Query saved as a client-facing API]
SELECT
  {{time}} AS time,
  avgMerge(duration_avg),
  minMerge(duration_min),
  maxMerge(duration_max)
FROM {{source}}
GROUP BY time
```

`*Merge()` functions are a [ClickHouse specialty](https://betterstack.com/docs/warehouse/querying-data/merge-and-state-aggregators/). You don’t need to worry about them now — most of the time you will be charting trends with our Drag & drop query builder anyway.

**Define time series for queries you need to execute frequently or trends you want to track long-term over large data sets**.

Time series enable you to create very fast analytical **client facing API endpoints** even for massive datasets.

For everything else there's wide events.

And the best thing?
You can always change your mind and **add more time series later.**
We only bill you for the time series you use.

## Non-aggregated time series

When you select **No aggregation** in your created time series, we will create a single new column without any `*Merge()` function. We will split your data into multiple records to keep all combinations distinct.

**Non-aggregated time series can be used in `WHERE` or `GROUP BY` clauses.**

Say your JSON event contains the attribute `process` alongside aggregated `duration`, that we want to use in [Queries](https://betterstack.com/docs/warehouse/querying-data/queries/) for filtering and grouping the results. A SQL expression `JSONExtract('process', 'Nullable(String)')` of type `String` without any aggregation generates the single columns in your schema.

```sql
[label Query saved as a client-facing API with filtering and grouping]
SELECT
  {{time}} AS time,
  process,
  avgMerge(duration_avg),
  minMerge(duration_min),
  maxMerge(duration_max)
FROM {{source}}
WHERE process LIKE 'order_%'
GROUP BY time, process
```

[note]
#### Non-aggregated time series may significantly increase amount of data on NVMe

Since a row exists for every unique combination of all your non-aggregated time series data, we recommend keeping the cardinality of all time series without aggregations as low as possible.
[/note]


## Overview: JSON events vs. time series

|                   | **JSON events**                                                                             | **Time series**                                                                                 |
| ----------------- | ---------------------------------------------------------------------------------------------------------- | --------------------------------------------------------------------------------------------------------- |
| **Examples**      | Any JSON such as structured logs, OpenTelemetry traces & spans, plain text logs                            | Prometheus metrics, OpenTelemetry metrics, time series extracted from JSON events                             |
| **Best for** | Keeping large amounts of raw unstructured data with high cardinality | Fast frequently executed queries powering analytical APIs |
| **Storage**       | Scalable object storage in the cloud                                                                       | High-speed local NVMe drives                                                                              |
| **Cardinality**   | High cardinality                                                                                           | Low cardinality                                                                                           |
| **Compression**   | Somewhat compressed                                                                                        | Heavily compressed                                                                                        |
| **Data format**   | Row store                                                                                                  | Column store                                                                                              |
| **Sampling**      | Sampling available                                                                                         | Always unsampled                                                                                          |
| **Cost**          | Cost-effective                                                                                             | Optimized for performance                                                                                 |

[info]
Are you planning to ingest over 100 TB per month?
Need to store data in a custom data region or your own S3 bucket?
Need a fast query speed even for large datasets?
Please get in touch at **[hello@betterstack.com](mailto:hello@betterstack.com)**.
[/info]