PostgreSQL on S3-backed Block Storage with Near-Local Performance

Started by Pierre Barre11 months ago26 messagesgeneral

pierre@barre.sh

11 months ago

Hi everyone,

I wanted to share a project I've been working on that enables PostgreSQL to run on S3 storage while maintaining performance comparable to local NVMe. The approach uses block-level access rather than trying to map filesystem operations to S3 objects.

ZeroFS: https://github.com/Barre/ZeroFS

# The Architecture

ZeroFS provides NBD (Network Block Device) servers that expose S3 storage as raw block devices. PostgreSQL runs unmodified on ZFS pools built on these block devices:

PostgreSQL -> ZFS -> NBD -> ZeroFS -> S3

By providing block-level access and leveraging ZFS's caching capabilities (L2ARC), we can achieve microsecond latencies despite the underlying storage being in S3.

## Performance Results

Here are pgbench results from PostgreSQL running on this setup:

### Read/Write Workload

```
postgres@ubuntu-16gb-fsn1-1:/root$ pgbench -c 50 -j 15 -t 100000 example
pgbench (16.9 (Ubuntu 16.9-0ubuntu0.24.04.1))
starting vacuum...end.
transaction type: <builtin: TPC-B (sort of)>
scaling factor: 50
query mode: simple
number of clients: 50
number of threads: 15
maximum number of tries: 1
number of transactions per client: 100000
number of transactions actually processed: 5000000/5000000
number of failed transactions: 0 (0.000%)
latency average = 0.943 ms
initial connection time = 48.043 ms
tps = 53041.006947 (without initial connection time)
```

### Read-Only Workload

```
postgres@ubuntu-16gb-fsn1-1:/root$ pgbench -c 50 -j 15 -t 100000 -S example
pgbench (16.9 (Ubuntu 16.9-0ubuntu0.24.04.1))
starting vacuum...end.
transaction type: <builtin: select only>
scaling factor: 50
query mode: simple
number of clients: 50
number of threads: 15
maximum number of tries: 1
number of transactions per client: 100000
number of transactions actually processed: 5000000/5000000
number of failed transactions: 0 (0.000%)
latency average = 0.121 ms
initial connection time = 53.358 ms
tps = 413436.248089 (without initial connection time)
```

These numbers are with 50 concurrent clients and the actual data stored in S3. Hot data is served from ZFS L2ARC and ZeroFS's memory caches, while cold data comes from S3.

## How It Works

1. ZeroFS exposes NBD devices (e.g., /dev/nbd0) that PostgreSQL/ZFS can use like any other block device
2. Multiple cache layers hide S3 latency:
a. ZFS ARC/L2ARC for frequently accessed blocks
b. ZeroFS memory cache for metadata and hot dataZeroFS exposes NBD devices (e.g., /dev/nbd0) that PostgreSQL/ZFS can use like any other block device
c. Optional local disk cache
3. All data is encrypted (ChaCha20-Poly1305) before hitting S3
4. Files are split into 128KB chunks for insertion into ZeroFS' LSM-tree

## Geo-Distributed PostgreSQL

Since each region can run its own ZeroFS instance, you can create geographically distributed PostgreSQL setups.

Example architectures:

Architecture 1

                         PostgreSQL Client
                                   |
                                   | SQL queries
                                   |
                            +--------------+
                            |  PG Proxy    |
                            | (HAProxy/    |
                            |  PgBouncer)  |
                            +--------------+
                               /        \
                              /          \
                   Synchronous            Synchronous
                   Replication            Replication
                            /              \
                           /                \
              +---------------+        +---------------+
              | PostgreSQL 1  |        | PostgreSQL 2  |
              | (Primary)     |◄------►| (Standby)     |
              +---------------+        +---------------+
                      |                        |
                      |  POSIX filesystem ops  |
                      |                        |
              +---------------+        +---------------+
              |   ZFS Pool 1  |        |   ZFS Pool 2  |
              | (3-way mirror)|        | (3-way mirror)|
              +---------------+        +---------------+
               /      |      \          /      |      \
              /       |       \        /       |       \
        NBD:10809 NBD:10810 NBD:10811  NBD:10812 NBD:10813 NBD:10814
             |        |        |           |        |        |
        +--------++--------++--------++--------++--------++--------+
        |ZeroFS 1||ZeroFS 2||ZeroFS 3||ZeroFS 4||ZeroFS 5||ZeroFS 6|
        +--------++--------++--------++--------++--------++--------+
             |         |         |         |         |         |
             |         |         |         |         |         |
        S3-Region1 S3-Region2 S3-Region3 S3-Region4 S3-Region5 S3-Region6
        (us-east) (eu-west) (ap-south) (us-west) (eu-north) (ap-east)

Architecture 2:

PostgreSQL Primary (Region 1) ←→ PostgreSQL Standby (Region 2)
\ /
\ /
Same ZFS Pool (NBD)
|
6 Global ZeroFS
|
S3 Regions

The main advantages I see are:
1. Dramatic cost reduction for large datasets
2. Simplified geo-distribution
3. Infinite storage capacity
4. Built-in encryption and compression

Looking forward to your feedback and questions!

Best,
Pierre

P.S. The full project includes a custom NFS filesystem too.

Laurenz Albe

laurenz.albe@cybertec.at

11 months ago

In reply to: Pierre Barre (#1)

Re: PostgreSQL on S3-backed Block Storage with Near-Local Performance

On Fri, 2025-07-18 at 00:57 +0200, Pierre Barre wrote:

Looking forward to your feedback and questions!

I think the biggest hurdle you will have to overcome is to
convince notoriously paranoid DBAs that this tall stack
provides reliable service, honors fsync() etc.

Performance is great, but it is not everything. If things
perform surprisingly well, people become suspicious.

P.S. The full project includes a custom NFS filesystem too.

"NFS" is a key word that does not inspire confidence in
PostgreSQL circles...

Yours,
Laurenz Albe

Pierre Barre

pierre@barre.sh

11 months ago

In reply to: Laurenz Albe (#2)

Re: PostgreSQL on S3-backed Block Storage with Near-Local Performance

Hi Laurenz,

I think the biggest hurdle you will have to overcome is to
convince notoriously paranoid DBAs that this tall stack
provides reliable service, honors fsync() etc.

Indeed, but that doesn't have to be "sudden." I think we need to gain confidence in the whole system gradually by starting with throwable workloads (e.g., persistent volumes in CI), then moving to data we can afford to lose, then backups, and finally to production data.

P.S. The full project includes a custom NFS filesystem too.

"NFS" is a key word that does not inspire confidence in
PostgreSQL circles...

I've had my fair share of major annoyances with NFS too!

I think bad experiences with NFS are basically due to the fact that when the hardware is bad, the NFS server implementation is bad, and the kernel treats it mostly like a "local" filesystem (in terms of failure behavior).

So when it doesn't work well, everything goes down.

But the protocols themselves are not inherently bad—they are actually quite elegant. NFSv3 is just what you need to reach (very close to) POSIX compliance. The NFS server implementation in ZeroFS passes all 8,662 tests in https://github.com/Barre/pjdfstest_nfs.

https://github.com/Barre/ZeroFS/actions/runs/16367571315/job/46248240251#step:11:9376

For database workloads specifically, users will probably prefer running something like ZFS on top of the NBD server rather than using NFS directly.

Best,
Pierre

Show quoted text

On Fri, Jul 18, 2025, at 06:40, Laurenz Albe wrote:

On Fri, 2025-07-18 at 00:57 +0200, Pierre Barre wrote:

Looking forward to your feedback and questions!

I think the biggest hurdle you will have to overcome is to
convince notoriously paranoid DBAs that this tall stack
provides reliable service, honors fsync() etc.

Performance is great, but it is not everything. If things
perform surprisingly well, people become suspicious.

P.S. The full project includes a custom NFS filesystem too.

"NFS" is a key word that does not inspire confidence in
PostgreSQL circles...

Yours,
Laurenz Albe

Seref Arikan

serefarikan@gmail.com

11 months ago

In reply to: Pierre Barre (#1)

Re: PostgreSQL on S3-backed Block Storage with Near-Local Performance

Sorry, this was meant to go to the whole group:

Very interesting!. Great work. Can you clarify how exactly you're running
postgres in your tests? A specific AWS service? What's the test
infrastructure that sits above the file system?

On Thu, Jul 17, 2025 at 11:59 PM Pierre Barre <pierre@barre.sh> wrote:

Show quoted text

Hi everyone,

I wanted to share a project I've been working on that enables PostgreSQL
to run on S3 storage while maintaining performance comparable to local
NVMe. The approach uses block-level access rather than trying to map
filesystem operations to S3 objects.

ZeroFS: https://github.com/Barre/ZeroFS

# The Architecture

ZeroFS provides NBD (Network Block Device) servers that expose S3 storage
as raw block devices. PostgreSQL runs unmodified on ZFS pools built on
these block devices:

PostgreSQL -> ZFS -> NBD -> ZeroFS -> S3

By providing block-level access and leveraging ZFS's caching capabilities
(L2ARC), we can achieve microsecond latencies despite the underlying
storage being in S3.

## Performance Results

Here are pgbench results from PostgreSQL running on this setup:

### Read/Write Workload

```
postgres@ubuntu-16gb-fsn1-1:/root$ pgbench -c 50 -j 15 -t 100000 example
pgbench (16.9 (Ubuntu 16.9-0ubuntu0.24.04.1))
starting vacuum...end.
transaction type: <builtin: TPC-B (sort of)>
scaling factor: 50
query mode: simple
number of clients: 50
number of threads: 15
maximum number of tries: 1
number of transactions per client: 100000
number of transactions actually processed: 5000000/5000000
number of failed transactions: 0 (0.000%)
latency average = 0.943 ms
initial connection time = 48.043 ms
tps = 53041.006947 (without initial connection time)
```

### Read-Only Workload

```
postgres@ubuntu-16gb-fsn1-1:/root$ pgbench -c 50 -j 15 -t 100000 -S
example
pgbench (16.9 (Ubuntu 16.9-0ubuntu0.24.04.1))
starting vacuum...end.
transaction type: <builtin: select only>
scaling factor: 50
query mode: simple
number of clients: 50
number of threads: 15
maximum number of tries: 1
number of transactions per client: 100000
number of transactions actually processed: 5000000/5000000
number of failed transactions: 0 (0.000%)
latency average = 0.121 ms
initial connection time = 53.358 ms
tps = 413436.248089 (without initial connection time)
```

These numbers are with 50 concurrent clients and the actual data stored in
S3. Hot data is served from ZFS L2ARC and ZeroFS's memory caches, while
cold data comes from S3.

## How It Works

1. ZeroFS exposes NBD devices (e.g., /dev/nbd0) that PostgreSQL/ZFS can
use like any other block device
2. Multiple cache layers hide S3 latency:
a. ZFS ARC/L2ARC for frequently accessed blocks
b. ZeroFS memory cache for metadata and hot dataZeroFS exposes NBD
devices (e.g., /dev/nbd0) that PostgreSQL/ZFS can use like any other block
device
c. Optional local disk cache
3. All data is encrypted (ChaCha20-Poly1305) before hitting S3
4. Files are split into 128KB chunks for insertion into ZeroFS' LSM-tree

## Geo-Distributed PostgreSQL

Since each region can run its own ZeroFS instance, you can create
geographically distributed PostgreSQL setups.

Example architectures:

Architecture 1
PostgreSQL Client
|
| SQL queries
|
+--------------+
|  PG Proxy    |
| (HAProxy/    |
|  PgBouncer)  |
+--------------+
/        \
/          \
Synchronous            Synchronous
Replication            Replication
/              \
/                \
+---------------+        +---------------+
| PostgreSQL 1  |        | PostgreSQL 2  |
| (Primary)     |◄------►| (Standby)     |
+---------------+        +---------------+
|                        |
|  POSIX filesystem ops  |
|                        |
+---------------+        +---------------+
|   ZFS Pool 1  |        |   ZFS Pool 2  |
| (3-way mirror)|        | (3-way mirror)|
+---------------+        +---------------+
/      |      \          /      |      \
/       |       \        /       |       \
NBD:10809 NBD:10810 NBD:10811  NBD:10812 NBD:10813 NBD:10814
|        |        |           |        |        |
+--------++--------++--------++--------++--------++--------+
|ZeroFS 1||ZeroFS 2||ZeroFS 3||ZeroFS 4||ZeroFS 5||ZeroFS 6|
+--------++--------++--------++--------++--------++--------+
|         |         |         |         |         |
|         |         |         |         |         |
S3-Region1 S3-Region2 S3-Region3 S3-Region4 S3-Region5 S3-Region6
(us-east) (eu-west) (ap-south) (us-west) (eu-north) (ap-east)
Architecture 2:

PostgreSQL Primary (Region 1) ←→ PostgreSQL Standby (Region 2)
\ /
\ /
Same ZFS Pool (NBD)
|
6 Global ZeroFS
|
S3 Regions

The main advantages I see are:
1. Dramatic cost reduction for large datasets
2. Simplified geo-distribution
3. Infinite storage capacity
4. Built-in encryption and compression

Looking forward to your feedback and questions!

Best,
Pierre

P.S. The full project includes a custom NFS filesystem too.

Pierre Barre

pierre@barre.sh

11 months ago

In reply to: Seref Arikan (#4)

Re: PostgreSQL on S3-backed Block Storage with Near-Local Performance

Hi Seref,

For the benchmarks, I used Hetzner's cloud service with the following setup:

- A Hetzner s3 bucket in the FSN1 region
- A virtual machine of type ccx63 48 vCPU 192 GB memory
- 3 ZeroFS nbd devices (same s3 bucket)
- A ZFS stripped pool with the 3 devices
- 200GB zfs L2ARC
- Postgres configured accordingly memory-wise as well as with synchronous_commit = off, wal_init_zero = off and wal_recycle = off.

Best,
Pierre

Show quoted text

On Fri, Jul 18, 2025, at 12:42, Seref Arikan wrote:

Sorry, this was meant to go to the whole group:

Very interesting!. Great work. Can you clarify how exactly you're running postgres in your tests? A specific AWS service? What's the test infrastructure that sits above the file system?

On Thu, Jul 17, 2025 at 11:59 PM Pierre Barre <pierre@barre.sh> wrote:
Hi everyone,

I wanted to share a project I've been working on that enables PostgreSQL to run on S3 storage while maintaining performance comparable to local NVMe. The approach uses block-level access rather than trying to map filesystem operations to S3 objects.

ZeroFS: https://github.com/Barre/ZeroFS

# The Architecture

ZeroFS provides NBD (Network Block Device) servers that expose S3 storage as raw block devices. PostgreSQL runs unmodified on ZFS pools built on these block devices:

PostgreSQL -> ZFS -> NBD -> ZeroFS -> S3

By providing block-level access and leveraging ZFS's caching capabilities (L2ARC), we can achieve microsecond latencies despite the underlying storage being in S3.

## Performance Results

Here are pgbench results from PostgreSQL running on this setup:

### Read/Write Workload

```
postgres@ubuntu-16gb-fsn1-1:/root$ pgbench -c 50 -j 15 -t 100000 example
pgbench (16.9 (Ubuntu 16.9-0ubuntu0.24.04.1))
starting vacuum...end.
transaction type: <builtin: TPC-B (sort of)>
scaling factor: 50
query mode: simple
number of clients: 50
number of threads: 15
maximum number of tries: 1
number of transactions per client: 100000
number of transactions actually processed: 5000000/5000000
number of failed transactions: 0 (0.000%)
latency average = 0.943 ms
initial connection time = 48.043 ms
tps = 53041.006947 (without initial connection time)
```

### Read-Only Workload

```
postgres@ubuntu-16gb-fsn1-1:/root$ pgbench -c 50 -j 15 -t 100000 -S example
pgbench (16.9 (Ubuntu 16.9-0ubuntu0.24.04.1))
starting vacuum...end.
transaction type: <builtin: select only>
scaling factor: 50
query mode: simple
number of clients: 50
number of threads: 15
maximum number of tries: 1
number of transactions per client: 100000
number of transactions actually processed: 5000000/5000000
number of failed transactions: 0 (0.000%)
latency average = 0.121 ms
initial connection time = 53.358 ms
tps = 413436.248089 (without initial connection time)
```

These numbers are with 50 concurrent clients and the actual data stored in S3. Hot data is served from ZFS L2ARC and ZeroFS's memory caches, while cold data comes from S3.

## How It Works

1. ZeroFS exposes NBD devices (e.g., /dev/nbd0) that PostgreSQL/ZFS can use like any other block device
2. Multiple cache layers hide S3 latency:
a. ZFS ARC/L2ARC for frequently accessed blocks
b. ZeroFS memory cache for metadata and hot dataZeroFS exposes NBD devices (e.g., /dev/nbd0) that PostgreSQL/ZFS can use like any other block device
c. Optional local disk cache
3. All data is encrypted (ChaCha20-Poly1305) before hitting S3
4. Files are split into 128KB chunks for insertion into ZeroFS' LSM-tree

## Geo-Distributed PostgreSQL

Since each region can run its own ZeroFS instance, you can create geographically distributed PostgreSQL setups.

Example architectures:

Architecture 1
PostgreSQL Client
|
| SQL queries
|
+--------------+
|  PG Proxy    |
| (HAProxy/    |
|  PgBouncer)  |
+--------------+
/        \
/          \
Synchronous            Synchronous
Replication            Replication
/              \
/                \
+---------------+        +---------------+
| PostgreSQL 1  |        | PostgreSQL 2  |
| (Primary)     |◄------►| (Standby)     |
+---------------+        +---------------+
|                        |
|  POSIX filesystem ops  |
|                        |
+---------------+        +---------------+
|   ZFS Pool 1  |        |   ZFS Pool 2  |
| (3-way mirror)|        | (3-way mirror)|
+---------------+        +---------------+
/      |      \          /      |      \
/       |       \        /       |       \
NBD:10809 NBD:10810 NBD:10811  NBD:10812 NBD:10813 NBD:10814
|        |        |           |        |        |
+--------++--------++--------++--------++--------++--------+
|ZeroFS 1||ZeroFS 2||ZeroFS 3||ZeroFS 4||ZeroFS 5||ZeroFS 6|
+--------++--------++--------++--------++--------++--------+
|         |         |         |         |         |
|         |         |         |         |         |
S3-Region1 S3-Region2 S3-Region3 S3-Region4 S3-Region5 S3-Region6
(us-east) (eu-west) (ap-south) (us-west) (eu-north) (ap-east)
Architecture 2:

PostgreSQL Primary (Region 1) ←→ PostgreSQL Standby (Region 2)
\ /
\ /
Same ZFS Pool (NBD)
|
6 Global ZeroFS
|
S3 Regions

The main advantages I see are:
1. Dramatic cost reduction for large datasets
2. Simplified geo-distribution
3. Infinite storage capacity
4. Built-in encryption and compression

Looking forward to your feedback and questions!

Best,
Pierre

P.S. The full project includes a custom NFS filesystem too.

Pierre Barre

pierre@barre.sh

11 months ago

In reply to: Pierre Barre (#5)

Re: PostgreSQL on S3-backed Block Storage with Near-Local Performance

Now, I'm trying to understand how CAP theorem applies here. Traditional PostgreSQL replication has clear CAP trade-offs - you choose between consistency and availability during partitions.

But when PostgreSQL instances share storage rather than replicate:
- Consistency seems maintained (same data)
- Availability seems maintained (client can always promote an accessible node)
- Partitions between PostgreSQL nodes don't prevent the system from functioning

It seems that CAP assumes specific implementation details (like nodes maintaining independent state) without explicitly stating them.

How should we think about CAP theorem when distributed nodes share storage rather than coordinate state? Are the trade-offs simply moved to a different layer, or does shared storage fundamentally change the analysis?

Client with awareness of both PostgreSQL nodes
| |
↓ (partition here) ↓
PostgreSQL Primary PostgreSQL Standby
| |
└───────────┬───────────────────┘
↓
Shared ZFS Pool
|
6 Global ZeroFS instances

Best,
Pierre

Show quoted text

On Fri, Jul 18, 2025, at 12:57, Pierre Barre wrote:

Hi Seref,

For the benchmarks, I used Hetzner's cloud service with the following setup:

- A Hetzner s3 bucket in the FSN1 region
- A virtual machine of type ccx63 48 vCPU 192 GB memory
- 3 ZeroFS nbd devices (same s3 bucket)
- A ZFS stripped pool with the 3 devices
- 200GB zfs L2ARC
- Postgres configured accordingly memory-wise as well as with synchronous_commit = off, wal_init_zero = off and wal_recycle = off.

Best,
Pierre

On Fri, Jul 18, 2025, at 12:42, Seref Arikan wrote:
Sorry, this was meant to go to the whole group:

Very interesting!. Great work. Can you clarify how exactly you're running postgres in your tests? A specific AWS service? What's the test infrastructure that sits above the file system?

On Thu, Jul 17, 2025 at 11:59 PM Pierre Barre <pierre@barre.sh> wrote:
Hi everyone,

I wanted to share a project I've been working on that enables PostgreSQL to run on S3 storage while maintaining performance comparable to local NVMe. The approach uses block-level access rather than trying to map filesystem operations to S3 objects.

ZeroFS: https://github.com/Barre/ZeroFS

# The Architecture

ZeroFS provides NBD (Network Block Device) servers that expose S3 storage as raw block devices. PostgreSQL runs unmodified on ZFS pools built on these block devices:

PostgreSQL -> ZFS -> NBD -> ZeroFS -> S3

By providing block-level access and leveraging ZFS's caching capabilities (L2ARC), we can achieve microsecond latencies despite the underlying storage being in S3.

## Performance Results

Here are pgbench results from PostgreSQL running on this setup:

### Read/Write Workload

```
postgres@ubuntu-16gb-fsn1-1:/root$ pgbench -c 50 -j 15 -t 100000 example
pgbench (16.9 (Ubuntu 16.9-0ubuntu0.24.04.1))
starting vacuum...end.
transaction type: <builtin: TPC-B (sort of)>
scaling factor: 50
query mode: simple
number of clients: 50
number of threads: 15
maximum number of tries: 1
number of transactions per client: 100000
number of transactions actually processed: 5000000/5000000
number of failed transactions: 0 (0.000%)
latency average = 0.943 ms
initial connection time = 48.043 ms
tps = 53041.006947 (without initial connection time)
```

### Read-Only Workload

```
postgres@ubuntu-16gb-fsn1-1:/root$ pgbench -c 50 -j 15 -t 100000 -S example
pgbench (16.9 (Ubuntu 16.9-0ubuntu0.24.04.1))
starting vacuum...end.
transaction type: <builtin: select only>
scaling factor: 50
query mode: simple
number of clients: 50
number of threads: 15
maximum number of tries: 1
number of transactions per client: 100000
number of transactions actually processed: 5000000/5000000
number of failed transactions: 0 (0.000%)
latency average = 0.121 ms
initial connection time = 53.358 ms
tps = 413436.248089 (without initial connection time)
```

These numbers are with 50 concurrent clients and the actual data stored in S3. Hot data is served from ZFS L2ARC and ZeroFS's memory caches, while cold data comes from S3.

## How It Works

1. ZeroFS exposes NBD devices (e.g., /dev/nbd0) that PostgreSQL/ZFS can use like any other block device
2. Multiple cache layers hide S3 latency:
a. ZFS ARC/L2ARC for frequently accessed blocks
b. ZeroFS memory cache for metadata and hot dataZeroFS exposes NBD devices (e.g., /dev/nbd0) that PostgreSQL/ZFS can use like any other block device
c. Optional local disk cache
3. All data is encrypted (ChaCha20-Poly1305) before hitting S3
4. Files are split into 128KB chunks for insertion into ZeroFS' LSM-tree

## Geo-Distributed PostgreSQL

Since each region can run its own ZeroFS instance, you can create geographically distributed PostgreSQL setups.

Example architectures:

Architecture 1
PostgreSQL Client
|
| SQL queries
|
+--------------+
|  PG Proxy    |
| (HAProxy/    |
|  PgBouncer)  |
+--------------+
/        \
/          \
Synchronous            Synchronous
Replication            Replication
/              \
/                \
+---------------+        +---------------+
| PostgreSQL 1  |        | PostgreSQL 2  |
| (Primary)     |◄------►| (Standby)     |
+---------------+        +---------------+
|                        |
|  POSIX filesystem ops  |
|                        |
+---------------+        +---------------+
|   ZFS Pool 1  |        |   ZFS Pool 2  |
| (3-way mirror)|        | (3-way mirror)|
+---------------+        +---------------+
/      |      \          /      |      \
/       |       \        /       |       \
NBD:10809 NBD:10810 NBD:10811  NBD:10812 NBD:10813 NBD:10814
|        |        |           |        |        |
+--------++--------++--------++--------++--------++--------+
|ZeroFS 1||ZeroFS 2||ZeroFS 3||ZeroFS 4||ZeroFS 5||ZeroFS 6|
+--------++--------++--------++--------++--------++--------+
|         |         |         |         |         |
|         |         |         |         |         |
S3-Region1 S3-Region2 S3-Region3 S3-Region4 S3-Region5 S3-Region6
(us-east) (eu-west) (ap-south) (us-west) (eu-north) (ap-east)
Architecture 2:

PostgreSQL Primary (Region 1) ←→ PostgreSQL Standby (Region 2)
\ /
\ /
Same ZFS Pool (NBD)
|
6 Global ZeroFS
|
S3 Regions

The main advantages I see are:
1. Dramatic cost reduction for large datasets
2. Simplified geo-distribution
3. Infinite storage capacity
4. Built-in encryption and compression

Looking forward to your feedback and questions!

Best,
Pierre

P.S. The full project includes a custom NFS filesystem too.

Seref Arikan

serefarikan@gmail.com

11 months ago

In reply to: Pierre Barre (#5)

Re: PostgreSQL on S3-backed Block Storage with Near-Local Performance

Thanks, I learned something else: I didn't know Hetzner offered S3
compatible storage.

The interesting thing is, a few searches about the performance return
mostly negative impressions about their object storage in comparison to the
original S3.

Finding out what kind of performance your benchmarks would yield on a pure
AWS setting would be interesting. I am not asking you to do that, but you
may get even better performance in that case :)

Cheers,
Seref

On Fri, Jul 18, 2025 at 11:58 AM Pierre Barre <pierre@barre.sh> wrote:

Show quoted text

Hi Seref,

For the benchmarks, I used Hetzner's cloud service with the following
setup:

- A Hetzner s3 bucket in the FSN1 region
- A virtual machine of type ccx63 48 vCPU 192 GB memory
- 3 ZeroFS nbd devices (same s3 bucket)
- A ZFS stripped pool with the 3 devices
- 200GB zfs L2ARC
- Postgres configured accordingly memory-wise as well as with
synchronous_commit = off, wal_init_zero = off and wal_recycle = off.

Best,
Pierre

On Fri, Jul 18, 2025, at 12:42, Seref Arikan wrote:

Sorry, this was meant to go to the whole group:

Very interesting!. Great work. Can you clarify how exactly you're running
postgres in your tests? A specific AWS service? What's the test
infrastructure that sits above the file system?

On Thu, Jul 17, 2025 at 11:59 PM Pierre Barre <pierre@barre.sh> wrote:

Hi everyone,

I wanted to share a project I've been working on that enables PostgreSQL
to run on S3 storage while maintaining performance comparable to local
NVMe. The approach uses block-level access rather than trying to map
filesystem operations to S3 objects.

ZeroFS: https://github.com/Barre/ZeroFS

# The Architecture

ZeroFS provides NBD (Network Block Device) servers that expose S3 storage
as raw block devices. PostgreSQL runs unmodified on ZFS pools built on
these block devices:

PostgreSQL -> ZFS -> NBD -> ZeroFS -> S3

By providing block-level access and leveraging ZFS's caching capabilities
(L2ARC), we can achieve microsecond latencies despite the underlying
storage being in S3.

## Performance Results

Here are pgbench results from PostgreSQL running on this setup:

### Read/Write Workload

```
postgres@ubuntu-16gb-fsn1-1:/root$ pgbench -c 50 -j 15 -t 100000 example
pgbench (16.9 (Ubuntu 16.9-0ubuntu0.24.04.1))
starting vacuum...end.
transaction type: <builtin: TPC-B (sort of)>
scaling factor: 50
query mode: simple
number of clients: 50
number of threads: 15
maximum number of tries: 1
number of transactions per client: 100000
number of transactions actually processed: 5000000/5000000
number of failed transactions: 0 (0.000%)
latency average = 0.943 ms
initial connection time = 48.043 ms
tps = 53041.006947 (without initial connection time)
```

### Read-Only Workload

```
postgres@ubuntu-16gb-fsn1-1:/root$ pgbench -c 50 -j 15 -t 100000 -S
example
pgbench (16.9 (Ubuntu 16.9-0ubuntu0.24.04.1))
starting vacuum...end.
transaction type: <builtin: select only>
scaling factor: 50
query mode: simple
number of clients: 50
number of threads: 15
maximum number of tries: 1
number of transactions per client: 100000
number of transactions actually processed: 5000000/5000000
number of failed transactions: 0 (0.000%)
latency average = 0.121 ms
initial connection time = 53.358 ms
tps = 413436.248089 (without initial connection time)
```

These numbers are with 50 concurrent clients and the actual data stored in
S3. Hot data is served from ZFS L2ARC and ZeroFS's memory caches, while
cold data comes from S3.

## How It Works

1. ZeroFS exposes NBD devices (e.g., /dev/nbd0) that PostgreSQL/ZFS can
use like any other block device
2. Multiple cache layers hide S3 latency:
a. ZFS ARC/L2ARC for frequently accessed blocks
b. ZeroFS memory cache for metadata and hot dataZeroFS exposes NBD
devices (e.g., /dev/nbd0) that PostgreSQL/ZFS can use like any other block
device
c. Optional local disk cache
3. All data is encrypted (ChaCha20-Poly1305) before hitting S3
4. Files are split into 128KB chunks for insertion into ZeroFS' LSM-tree

## Geo-Distributed PostgreSQL

Since each region can run its own ZeroFS instance, you can create
geographically distributed PostgreSQL setups.

Example architectures:

Architecture 1
PostgreSQL Client
|
| SQL queries
|
+--------------+
|  PG Proxy    |
| (HAProxy/    |
|  PgBouncer)  |
+--------------+
/        \
/          \
Synchronous            Synchronous
Replication            Replication
/              \
/                \
+---------------+        +---------------+
| PostgreSQL 1  |        | PostgreSQL 2  |
| (Primary)     |◄------►| (Standby)     |
+---------------+        +---------------+
|                        |
|  POSIX filesystem ops  |
|                        |
+---------------+        +---------------+
|   ZFS Pool 1  |        |   ZFS Pool 2  |
| (3-way mirror)|        | (3-way mirror)|
+---------------+        +---------------+
/      |      \          /      |      \
/       |       \        /       |       \
NBD:10809 NBD:10810 NBD:10811  NBD:10812 NBD:10813 NBD:10814
|        |        |           |        |        |
+--------++--------++--------++--------++--------++--------+
|ZeroFS 1||ZeroFS 2||ZeroFS 3||ZeroFS 4||ZeroFS 5||ZeroFS 6|
+--------++--------++--------++--------++--------++--------+
|         |         |         |         |         |
|         |         |         |         |         |
S3-Region1 S3-Region2 S3-Region3 S3-Region4 S3-Region5 S3-Region6
(us-east) (eu-west) (ap-south) (us-west) (eu-north) (ap-east)
Architecture 2:

PostgreSQL Primary (Region 1) ←→ PostgreSQL Standby (Region 2)
\ /
\ /
Same ZFS Pool (NBD)
|
6 Global ZeroFS
|
S3 Regions

The main advantages I see are:
1. Dramatic cost reduction for large datasets
2. Simplified geo-distribution
3. Infinite storage capacity
4. Built-in encryption and compression

Looking forward to your feedback and questions!

Best,
Pierre

P.S. The full project includes a custom NFS filesystem too.

Pierre Barre

pierre@barre.sh

11 months ago

In reply to: Seref Arikan (#7)

Re: PostgreSQL on S3-backed Block Storage with Near-Local Performance

The interesting thing is, a few searches about the performance return mostly negative impressions about their object storage in comparison to the original S3.

I think they had a rough start, but it's quite good now from what I've experienced. It's also dirt-cheap, and they don't bill for operations. So if you run ZeroFS on that you only pay for raw storage at €4.99 a month.

Combine that with their dirt cheap dedicated servers, https://www.hetzner.com/dedicated-rootserver/matrix-ax/ you can have a <€50 a month multi-terabytes postgres database

I'm dreaming of running https://www.merklemap.com/ on such a setup, but it's too early yet :)

Finding out what kind of performance your benchmarks would yield on a pure AWS setting would be interesting. I am not asking you to do that, but you may get even better performance in that case :)

Yes, I need to try that!

Best,
Pierre

Show quoted text

On Fri, Jul 18, 2025, at 14:55, Seref Arikan wrote:

Thanks, I learned something else: I didn't know Hetzner offered S3 compatible storage.

The interesting thing is, a few searches about the performance return mostly negative impressions about their object storage in comparison to the original S3.

Finding out what kind of performance your benchmarks would yield on a pure AWS setting would be interesting. I am not asking you to do that, but you may get even better performance in that case :)

Cheers,
Seref

On Fri, Jul 18, 2025 at 11:58 AM Pierre Barre <pierre@barre.sh> wrote:
__
Hi Seref,

For the benchmarks, I used Hetzner's cloud service with the following setup:

- A Hetzner s3 bucket in the FSN1 region
- A virtual machine of type ccx63 48 vCPU 192 GB memory
- 3 ZeroFS nbd devices (same s3 bucket)
- A ZFS stripped pool with the 3 devices
- 200GB zfs L2ARC
- Postgres configured accordingly memory-wise as well as with synchronous_commit = off, wal_init_zero = off and wal_recycle = off.

Best,
Pierre

On Fri, Jul 18, 2025, at 12:42, Seref Arikan wrote:
Sorry, this was meant to go to the whole group:

Very interesting!. Great work. Can you clarify how exactly you're running postgres in your tests? A specific AWS service? What's the test infrastructure that sits above the file system?

On Thu, Jul 17, 2025 at 11:59 PM Pierre Barre <pierre@barre.sh> wrote:
Hi everyone,

I wanted to share a project I've been working on that enables PostgreSQL to run on S3 storage while maintaining performance comparable to local NVMe. The approach uses block-level access rather than trying to map filesystem operations to S3 objects.

ZeroFS: https://github.com/Barre/ZeroFS

# The Architecture

ZeroFS provides NBD (Network Block Device) servers that expose S3 storage as raw block devices. PostgreSQL runs unmodified on ZFS pools built on these block devices:

PostgreSQL -> ZFS -> NBD -> ZeroFS -> S3

By providing block-level access and leveraging ZFS's caching capabilities (L2ARC), we can achieve microsecond latencies despite the underlying storage being in S3.

## Performance Results

Here are pgbench results from PostgreSQL running on this setup:

### Read/Write Workload

```
postgres@ubuntu-16gb-fsn1-1:/root$ pgbench -c 50 -j 15 -t 100000 example
pgbench (16.9 (Ubuntu 16.9-0ubuntu0.24.04.1))
starting vacuum...end.
transaction type: <builtin: TPC-B (sort of)>
scaling factor: 50
query mode: simple
number of clients: 50
number of threads: 15
maximum number of tries: 1
number of transactions per client: 100000
number of transactions actually processed: 5000000/5000000
number of failed transactions: 0 (0.000%)
latency average = 0.943 ms
initial connection time = 48.043 ms
tps = 53041.006947 (without initial connection time)
```

### Read-Only Workload

```
postgres@ubuntu-16gb-fsn1-1:/root$ pgbench -c 50 -j 15 -t 100000 -S example
pgbench (16.9 (Ubuntu 16.9-0ubuntu0.24.04.1))
starting vacuum...end.
transaction type: <builtin: select only>
scaling factor: 50
query mode: simple
number of clients: 50
number of threads: 15
maximum number of tries: 1
number of transactions per client: 100000
number of transactions actually processed: 5000000/5000000
number of failed transactions: 0 (0.000%)
latency average = 0.121 ms
initial connection time = 53.358 ms
tps = 413436.248089 (without initial connection time)
```

These numbers are with 50 concurrent clients and the actual data stored in S3. Hot data is served from ZFS L2ARC and ZeroFS's memory caches, while cold data comes from S3.

## How It Works

1. ZeroFS exposes NBD devices (e.g., /dev/nbd0) that PostgreSQL/ZFS can use like any other block device
2. Multiple cache layers hide S3 latency:
a. ZFS ARC/L2ARC for frequently accessed blocks
b. ZeroFS memory cache for metadata and hot dataZeroFS exposes NBD devices (e.g., /dev/nbd0) that PostgreSQL/ZFS can use like any other block device
c. Optional local disk cache
3. All data is encrypted (ChaCha20-Poly1305) before hitting S3
4. Files are split into 128KB chunks for insertion into ZeroFS' LSM-tree

## Geo-Distributed PostgreSQL

Since each region can run its own ZeroFS instance, you can create geographically distributed PostgreSQL setups.

Example architectures:

Architecture 1
PostgreSQL Client
|
| SQL queries
|
+--------------+
|  PG Proxy    |
| (HAProxy/    |
|  PgBouncer)  |
+--------------+
/        \
/          \
Synchronous            Synchronous
Replication            Replication
/              \
/                \
+---------------+        +---------------+
| PostgreSQL 1  |        | PostgreSQL 2  |
| (Primary)     |◄------►| (Standby)     |
+---------------+        +---------------+
|                        |
|  POSIX filesystem ops  |
|                        |
+---------------+        +---------------+
|   ZFS Pool 1  |        |   ZFS Pool 2  |
| (3-way mirror)|        | (3-way mirror)|
+---------------+        +---------------+
/      |      \          /      |      \
/       |       \        /       |       \
NBD:10809 NBD:10810 NBD:10811  NBD:10812 NBD:10813 NBD:10814
|        |        |           |        |        |
+--------++--------++--------++--------++--------++--------+
|ZeroFS 1||ZeroFS 2||ZeroFS 3||ZeroFS 4||ZeroFS 5||ZeroFS 6|
+--------++--------++--------++--------++--------++--------+
|         |         |         |         |         |
|         |         |         |         |         |
S3-Region1 S3-Region2 S3-Region3 S3-Region4 S3-Region5 S3-Region6
(us-east) (eu-west) (ap-south) (us-west) (eu-north) (ap-east)
Architecture 2:

PostgreSQL Primary (Region 1) ←→ PostgreSQL Standby (Region 2)
\ /
\ /
Same ZFS Pool (NBD)
|
6 Global ZeroFS
|
S3 Regions

The main advantages I see are:
1. Dramatic cost reduction for large datasets
2. Simplified geo-distribution
3. Infinite storage capacity
4. Built-in encryption and compression

Looking forward to your feedback and questions!

Best,
Pierre

P.S. The full project includes a custom NFS filesystem too.

Pierre Barre

pierre@barre.sh

11 months ago

In reply to: Laurenz Albe (#2)

Re: PostgreSQL on S3-backed Block Storage with Near-Local Performance

"NFS" is a key word that does not inspire confidence in

PostgreSQL circles...

Coming back to this, I just implemented 9P, which should translates to proper semantics for FSYNC.

mount -t 9p -o trans=tcp,port=5564,version=9p2000.L,msize=65536,access=user 127.0.0.1 /mnt/9p

Best,
Pierre

Show quoted text

On Fri, Jul 18, 2025, at 06:40, Laurenz Albe wrote:

On Fri, 2025-07-18 at 00:57 +0200, Pierre Barre wrote:

Looking forward to your feedback and questions!

I think the biggest hurdle you will have to overcome is to
convince notoriously paranoid DBAs that this tall stack
provides reliable service, honors fsync() etc.

Performance is great, but it is not everything. If things
perform surprisingly well, people become suspicious.

P.S. The full project includes a custom NFS filesystem too.

"NFS" is a key word that does not inspire confidence in
PostgreSQL circles...

Yours,
Laurenz Albe

#10

Nico Williams

nico@cryptonector.com

11 months ago

In reply to: Laurenz Albe (#2)

Re: PostgreSQL on S3-backed Block Storage with Near-Local Performance

On Fri, Jul 18, 2025 at 06:40:58AM +0200, Laurenz Albe wrote:

On Fri, 2025-07-18 at 00:57 +0200, Pierre Barre wrote:

Looking forward to your feedback and questions!

I think the biggest hurdle you will have to overcome is to
convince notoriously paranoid DBAs that this tall stack
provides reliable service, honors fsync() etc.

Is there a test suite that can be used to test PG's ACIDity in the face
of simulated power failures?

Performance is great, but it is not everything. If things
perform surprisingly well, people become suspicious.

P.S. The full project includes a custom NFS filesystem too.

"NFS" is a key word that does not inspire confidence in
PostgreSQL circles...

Certainly NFSv3 should. NFSv4 is much safer but I've no experience
running PG on it and I assume there will be cases where recovery from
network and/or server failures is slow.

#11

Nico Williams

nico@cryptonector.com

11 months ago

In reply to: Pierre Barre (#5)

Re: PostgreSQL on S3-backed Block Storage with Near-Local Performance

On Fri, Jul 18, 2025 at 12:57:39PM +0200, Pierre Barre wrote:

- Postgres configured accordingly memory-wise as well as with
synchronous_commit = off, wal_init_zero = off and wal_recycle = off.

Bingo. That's why it's fast (synchronous_commit = off). It's also why
it's not safe _unless_ you have a local, fast, persistent ZIL device
(which I assume you don't).

Nico
--

#12

Pierre Barre

pierre@barre.sh

11 months ago

In reply to: Nico Williams (#11)

Re: PostgreSQL on S3-backed Block Storage with Near-Local Performance

It’s not “safe” or “unsafe”, there’s mountains of valid workloads which don’t require synchronous_commit. Synchronous_commit don’t make your system automatically safe either, and if that’s a requirement, there’s many workarounds, as you suggested, it certainly doesn’t make the setup useless.

Best,
Pierre

Show quoted text

On Thu, Jul 24, 2025, at 21:44, Nico Williams wrote:

On Fri, Jul 18, 2025 at 12:57:39PM +0200, Pierre Barre wrote:

- Postgres configured accordingly memory-wise as well as with
synchronous_commit = off, wal_init_zero = off and wal_recycle = off.

Bingo. That's why it's fast (synchronous_commit = off). It's also why
it's not safe _unless_ you have a local, fast, persistent ZIL device
(which I assume you don't).

Nico
--

#13

Jeff Ross

jross@openvistas.net

11 months ago

In reply to: Pierre Barre (#12)

Re: PostgreSQL on S3-backed Block Storage with Near-Local Performance

On 7/24/25 13:50, Pierre Barre wrote:

It’s not “safe” or “unsafe”, there’s mountains of valid workloads which don’t require synchronous_commit. Synchronous_commit don’t make your system automatically safe either, and if that’s a requirement, there’s many workarounds, as you suggested, it certainly doesn’t make the setup useless.

Best,
Pierre

On Thu, Jul 24, 2025, at 21:44, Nico Williams wrote:

On Fri, Jul 18, 2025 at 12:57:39PM +0200, Pierre Barre wrote:

- Postgres configured accordingly memory-wise as well as with
synchronous_commit = off, wal_init_zero = off and wal_recycle = off.

Bingo. That's why it's fast (synchronous_commit = off). It's also why
it's not safe _unless_ you have a local, fast, persistent ZIL device
(which I assume you don't).

Nico
--

This then begs the obvious question of how fast is this with
synchronous_commit = on?

#14

Marco Torres

mtors25@gmail.com

11 months ago

In reply to: Pierre Barre (#1)

Re: PostgreSQL on S3-backed Block Storage with Near-Local Performance

My humble take on this project: well done! You are opening the doors to
work on a much-needed endeavor, decouple compute from storage, and
potentially elaborate on other projects for an active/active cluster! I
applaud you.

On Thu, Jul 17, 2025, 4:59 PM Pierre Barre <pierre@barre.sh> wrote:

Show quoted text

Hi everyone,

I wanted to share a project I've been working on that enables PostgreSQL
to run on S3 storage while maintaining performance comparable to local
NVMe. The approach uses block-level access rather than trying to map
filesystem operations to S3 objects.

ZeroFS: https://github.com/Barre/ZeroFS

# The Architecture

ZeroFS provides NBD (Network Block Device) servers that expose S3 storage
as raw block devices. PostgreSQL runs unmodified on ZFS pools built on
these block devices:

PostgreSQL -> ZFS -> NBD -> ZeroFS -> S3

By providing block-level access and leveraging ZFS's caching capabilities
(L2ARC), we can achieve microsecond latencies despite the underlying
storage being in S3.

## Performance Results

Here are pgbench results from PostgreSQL running on this setup:

### Read/Write Workload

```
postgres@ubuntu-16gb-fsn1-1:/root$ pgbench -c 50 -j 15 -t 100000 example
pgbench (16.9 (Ubuntu 16.9-0ubuntu0.24.04.1))
starting vacuum...end.
transaction type: <builtin: TPC-B (sort of)>
scaling factor: 50
query mode: simple
number of clients: 50
number of threads: 15
maximum number of tries: 1
number of transactions per client: 100000
number of transactions actually processed: 5000000/5000000
number of failed transactions: 0 (0.000%)
latency average = 0.943 ms
initial connection time = 48.043 ms
tps = 53041.006947 (without initial connection time)
```

### Read-Only Workload

```
postgres@ubuntu-16gb-fsn1-1:/root$ pgbench -c 50 -j 15 -t 100000 -S
example
pgbench (16.9 (Ubuntu 16.9-0ubuntu0.24.04.1))
starting vacuum...end.
transaction type: <builtin: select only>
scaling factor: 50
query mode: simple
number of clients: 50
number of threads: 15
maximum number of tries: 1
number of transactions per client: 100000
number of transactions actually processed: 5000000/5000000
number of failed transactions: 0 (0.000%)
latency average = 0.121 ms
initial connection time = 53.358 ms
tps = 413436.248089 (without initial connection time)
```

These numbers are with 50 concurrent clients and the actual data stored in
S3. Hot data is served from ZFS L2ARC and ZeroFS's memory caches, while
cold data comes from S3.

## How It Works

1. ZeroFS exposes NBD devices (e.g., /dev/nbd0) that PostgreSQL/ZFS can
use like any other block device
2. Multiple cache layers hide S3 latency:
a. ZFS ARC/L2ARC for frequently accessed blocks
b. ZeroFS memory cache for metadata and hot dataZeroFS exposes NBD
devices (e.g., /dev/nbd0) that PostgreSQL/ZFS can use like any other block
device
c. Optional local disk cache
3. All data is encrypted (ChaCha20-Poly1305) before hitting S3
4. Files are split into 128KB chunks for insertion into ZeroFS' LSM-tree

## Geo-Distributed PostgreSQL

Since each region can run its own ZeroFS instance, you can create
geographically distributed PostgreSQL setups.

Example architectures:

Architecture 1
PostgreSQL Client
|
| SQL queries
|
+--------------+
|  PG Proxy    |
| (HAProxy/    |
|  PgBouncer)  |
+--------------+
/        \
/          \
Synchronous            Synchronous
Replication            Replication
/              \
/                \
+---------------+        +---------------+
| PostgreSQL 1  |        | PostgreSQL 2  |
| (Primary)     |◄------►| (Standby)     |
+---------------+        +---------------+
|                        |
|  POSIX filesystem ops  |
|                        |
+---------------+        +---------------+
|   ZFS Pool 1  |        |   ZFS Pool 2  |
| (3-way mirror)|        | (3-way mirror)|
+---------------+        +---------------+
/      |      \          /      |      \
/       |       \        /       |       \
NBD:10809 NBD:10810 NBD:10811  NBD:10812 NBD:10813 NBD:10814
|        |        |           |        |        |
+--------++--------++--------++--------++--------++--------+
|ZeroFS 1||ZeroFS 2||ZeroFS 3||ZeroFS 4||ZeroFS 5||ZeroFS 6|
+--------++--------++--------++--------++--------++--------+
|         |         |         |         |         |
|         |         |         |         |         |
S3-Region1 S3-Region2 S3-Region3 S3-Region4 S3-Region5 S3-Region6
(us-east) (eu-west) (ap-south) (us-west) (eu-north) (ap-east)
Architecture 2:

PostgreSQL Primary (Region 1) ←→ PostgreSQL Standby (Region 2)
\ /
\ /
Same ZFS Pool (NBD)
|
6 Global ZeroFS
|
S3 Regions

The main advantages I see are:
1. Dramatic cost reduction for large datasets
2. Simplified geo-distribution
3. Infinite storage capacity
4. Built-in encryption and compression

Looking forward to your feedback and questions!

Best,
Pierre

P.S. The full project includes a custom NFS filesystem too.

#15

Pierre Barre

pierre@barre.sh

11 months ago

In reply to: Marco Torres (#14)

Re: PostgreSQL on S3-backed Block Storage with Near-Local Performance

Hi Marco,

Thanks for the kind words!

and potentially elaborate on other projects for an active/active cluster! I applaud you.

I wrote an argument there: https://github.com/Barre/ZeroFS?tab=readme-ov-file#cap-theorem

I definitely want to write a proof of concept when I get some time.

Best,
Pierre

Show quoted text

On Fri, Jul 25, 2025, at 00:21, Marco Torres wrote:

My humble take on this project: well done! You are opening the doors to work on a much-needed endeavor, decouple compute from storage, and potentially elaborate on other projects for an active/active cluster! I applaud you.

On Thu, Jul 17, 2025, 4:59 PM Pierre Barre <pierre@barre.sh> wrote:
Hi everyone,

I wanted to share a project I've been working on that enables PostgreSQL to run on S3 storage while maintaining performance comparable to local NVMe. The approach uses block-level access rather than trying to map filesystem operations to S3 objects.

ZeroFS: https://github.com/Barre/ZeroFS

# The Architecture

ZeroFS provides NBD (Network Block Device) servers that expose S3 storage as raw block devices. PostgreSQL runs unmodified on ZFS pools built on these block devices:

PostgreSQL -> ZFS -> NBD -> ZeroFS -> S3

By providing block-level access and leveraging ZFS's caching capabilities (L2ARC), we can achieve microsecond latencies despite the underlying storage being in S3.

## Performance Results

Here are pgbench results from PostgreSQL running on this setup:

### Read/Write Workload

```
postgres@ubuntu-16gb-fsn1-1:/root$ pgbench -c 50 -j 15 -t 100000 example
pgbench (16.9 (Ubuntu 16.9-0ubuntu0.24.04.1))
starting vacuum...end.
transaction type: <builtin: TPC-B (sort of)>
scaling factor: 50
query mode: simple
number of clients: 50
number of threads: 15
maximum number of tries: 1
number of transactions per client: 100000
number of transactions actually processed: 5000000/5000000
number of failed transactions: 0 (0.000%)
latency average = 0.943 ms
initial connection time = 48.043 ms
tps = 53041.006947 (without initial connection time)
```

### Read-Only Workload

```
postgres@ubuntu-16gb-fsn1-1:/root$ pgbench -c 50 -j 15 -t 100000 -S example
pgbench (16.9 (Ubuntu 16.9-0ubuntu0.24.04.1))
starting vacuum...end.
transaction type: <builtin: select only>
scaling factor: 50
query mode: simple
number of clients: 50
number of threads: 15
maximum number of tries: 1
number of transactions per client: 100000
number of transactions actually processed: 5000000/5000000
number of failed transactions: 0 (0.000%)
latency average = 0.121 ms
initial connection time = 53.358 ms
tps = 413436.248089 (without initial connection time)
```

These numbers are with 50 concurrent clients and the actual data stored in S3. Hot data is served from ZFS L2ARC and ZeroFS's memory caches, while cold data comes from S3.

## How It Works

1. ZeroFS exposes NBD devices (e.g., /dev/nbd0) that PostgreSQL/ZFS can use like any other block device
2. Multiple cache layers hide S3 latency:
a. ZFS ARC/L2ARC for frequently accessed blocks
b. ZeroFS memory cache for metadata and hot dataZeroFS exposes NBD devices (e.g., /dev/nbd0) that PostgreSQL/ZFS can use like any other block device
c. Optional local disk cache
3. All data is encrypted (ChaCha20-Poly1305) before hitting S3
4. Files are split into 128KB chunks for insertion into ZeroFS' LSM-tree

## Geo-Distributed PostgreSQL

Since each region can run its own ZeroFS instance, you can create geographically distributed PostgreSQL setups.

Example architectures:

Architecture 1
PostgreSQL Client
|
| SQL queries
|
+--------------+
|  PG Proxy    |
| (HAProxy/    |
|  PgBouncer)  |
+--------------+
/        \
/          \
Synchronous            Synchronous
Replication            Replication
/              \
/                \
+---------------+        +---------------+
| PostgreSQL 1  |        | PostgreSQL 2  |
| (Primary)     |◄------►| (Standby)     |
+---------------+        +---------------+
|                        |
|  POSIX filesystem ops  |
|                        |
+---------------+        +---------------+
|   ZFS Pool 1  |        |   ZFS Pool 2  |
| (3-way mirror)|        | (3-way mirror)|
+---------------+        +---------------+
/      |      \          /      |      \
/       |       \        /       |       \
NBD:10809 NBD:10810 NBD:10811  NBD:10812 NBD:10813 NBD:10814
|        |        |           |        |        |
+--------++--------++--------++--------++--------++--------+
|ZeroFS 1||ZeroFS 2||ZeroFS 3||ZeroFS 4||ZeroFS 5||ZeroFS 6|
+--------++--------++--------++--------++--------++--------+
|         |         |         |         |         |
|         |         |         |         |         |
S3-Region1 S3-Region2 S3-Region3 S3-Region4 S3-Region5 S3-Region6
(us-east) (eu-west) (ap-south) (us-west) (eu-north) (ap-east)
Architecture 2:

PostgreSQL Primary (Region 1) ←→ PostgreSQL Standby (Region 2)
\ /
\ /
Same ZFS Pool (NBD)
|
6 Global ZeroFS
|
S3 Regions

The main advantages I see are:
1. Dramatic cost reduction for large datasets
2. Simplified geo-distribution
3. Infinite storage capacity
4. Built-in encryption and compression

Looking forward to your feedback and questions!

Best,
Pierre

P.S. The full project includes a custom NFS filesystem too.

#16

Pierre Barre

pierre@barre.sh

11 months ago

In reply to: Jeff Ross (#13)

Re: PostgreSQL on S3-backed Block Storage with Near-Local Performance

This then begs the obvious question of how fast is this with
synchronous_commit = on?

Probably not awful, especially with commit_delay.

I'll try that and report back.

Best,
Pierre

Show quoted text

On Fri, Jul 25, 2025, at 00:03, Jeff Ross wrote:

On 7/24/25 13:50, Pierre Barre wrote:

It’s not “safe” or “unsafe”, there’s mountains of valid workloads which don’t require synchronous_commit. Synchronous_commit don’t make your system automatically safe either, and if that’s a requirement, there’s many workarounds, as you suggested, it certainly doesn’t make the setup useless.

Best,
Pierre

On Thu, Jul 24, 2025, at 21:44, Nico Williams wrote:

On Fri, Jul 18, 2025 at 12:57:39PM +0200, Pierre Barre wrote:

- Postgres configured accordingly memory-wise as well as with
synchronous_commit = off, wal_init_zero = off and wal_recycle = off.

Bingo. That's why it's fast (synchronous_commit = off). It's also why
it's not safe _unless_ you have a local, fast, persistent ZIL device
(which I assume you don't).

Nico
--

This then begs the obvious question of how fast is this with
synchronous_commit = on?

#17

Pierre Barre

pierre@barre.sh

11 months ago

In reply to: Pierre Barre (#16)

Re: PostgreSQL on S3-backed Block Storage with Near-Local Performance

Hi,

I went ahead and did that test.

Here is the postgresql config I used for reference (note the wal options (recycle, init_zero) as well as full_page_writes = off, because ZeroFS cannot have torn writes by design).

https://gist.github.com/Barre/8d68f0d00446389998a31f4e60f3276d

Test was running on Azure with Standard D16ads v5 (16 vcpus, 64 GiB memory)

This time, I didn't run ZFS with L2ARC, I just mounted ZeroFS with 9p.

synchronous_commit = off

postgres@zerofs:~$ pgbench -vvv -c 100 -j 40 -t 1000 bench
pgbench (16.9 (Ubuntu 16.9-0ubuntu0.24.04.1))
starting vacuum...end.
starting vacuum pgbench_accounts...end.
transaction type: <builtin: TPC-B (sort of)>
scaling factor: 50
query mode: simple
number of clients: 100
number of threads: 40
maximum number of tries: 1
number of transactions per client: 1000
number of transactions actually processed: 100000/100000
number of failed transactions: 0 (0.000%)
latency average = 6.239 ms
initial connection time = 68.922 ms
tps = 16026.940646 (without initial connection time)

synchronous_commit = on

postgres@zerofs:~$ pgbench -vvv -c 50 -j 15 -t 1000 bench
pgbench (16.9 (Ubuntu 16.9-0ubuntu0.24.04.1))
starting vacuum...end.
starting vacuum pgbench_accounts...end.
transaction type: <builtin: TPC-B (sort of)>
scaling factor: 50
query mode: simple
number of clients: 50
number of threads: 15
maximum number of tries: 1
number of transactions per client: 1000
number of transactions actually processed: 50000/50000
number of failed transactions: 0 (0.000%)
latency average = 197.723 ms
initial connection time = 46.089 ms
tps = 252.878721 (without initial connection time)

Not great barebones with with synchronous_commit, but still usable!

Best,
Pierre

Show quoted text

On Fri, Jul 25, 2025, at 00:44, Pierre Barre wrote:

This then begs the obvious question of how fast is this with
synchronous_commit = on?

Probably not awful, especially with commit_delay.

I'll try that and report back.

Best,
Pierre

On Fri, Jul 25, 2025, at 00:03, Jeff Ross wrote:

On 7/24/25 13:50, Pierre Barre wrote:

It’s not “safe” or “unsafe”, there’s mountains of valid workloads which don’t require synchronous_commit. Synchronous_commit don’t make your system automatically safe either, and if that’s a requirement, there’s many workarounds, as you suggested, it certainly doesn’t make the setup useless.

Best,
Pierre

On Thu, Jul 24, 2025, at 21:44, Nico Williams wrote:

On Fri, Jul 18, 2025 at 12:57:39PM +0200, Pierre Barre wrote:

- Postgres configured accordingly memory-wise as well as with
synchronous_commit = off, wal_init_zero = off and wal_recycle = off.

Bingo. That's why it's fast (synchronous_commit = off). It's also why
it's not safe _unless_ you have a local, fast, persistent ZIL device
(which I assume you don't).

Nico
--

This then begs the obvious question of how fast is this with
synchronous_commit = on?

#18

Pierre Barre

pierre@barre.sh

11 months ago

In reply to: Pierre Barre (#17)

Re: PostgreSQL on S3-backed Block Storage with Near-Local Performance

I built postgres (same version, 16.9) but --with-block-size=32 (I'd really love if this would be a initdb time flag!) and did some more testing:

synchronous_commit = off

postgres@zerofs:~$ pgbench -vvv -c 100 -j 40 -t 10000 bench
pgbench (16.9 (Ubuntu 16.10-1))
starting vacuum...end.
starting vacuum pgbench_accounts...end.
transaction type: <builtin: TPC-B (sort of)>
scaling factor: 50
query mode: simple
number of clients: 100
number of threads: 40
maximum number of tries: 1
number of transactions per client: 10000
number of transactions actually processed: 1000000/1000000
number of failed transactions: 0 (0.000%)
latency average = 5.727 ms
initial connection time = 59.223 ms
tps = 17460.128835 (without initial connection time)

synchronous_commit = on

postgres@zerofs:/root$ pgbench -vvv -c 100 -j 40 -t 1000 bench
pgbench (16.9 (Ubuntu 16.10-1))
starting vacuum...end.
starting vacuum pgbench_accounts...end.
transaction type: <builtin: TPC-B (sort of)>
scaling factor: 50
query mode: simple
number of clients: 100
number of threads: 40
maximum number of tries: 1
number of transactions per client: 1000
number of transactions actually processed: 100000/100000
number of failed transactions: 0 (0.000%)
latency average = 301.800 ms
initial connection time = 62.237 ms
tps = 331.345391 (without initial connection time)

=====================================

Then, using the same setup (same server, same postgres build), I create a ZeroFS NBD device with ext4 on top

/dev/nbd0 on /mnt_9p type ext4 (rw,relatime,stripe=32)

synchronous_commit = off

postgres@zerofs:/mnt_9p$ pgbench -vvv -c 100 -j 40 -t 10000 bench
pgbench (16.9 (Ubuntu 16.10-1))
starting vacuum...end.
starting vacuum pgbench_accounts...end.
transaction type: <builtin: TPC-B (sort of)>
scaling factor: 50
query mode: simple
number of clients: 100
number of threads: 40
maximum number of tries: 1
number of transactions per client: 10000
number of transactions actually processed: 1000000/1000000
number of failed transactions: 0 (0.000%)
latency average = 3.615 ms
initial connection time = 45.653 ms
tps = 27665.373366 (without initial connection time)

synchronous_commit = on

postgres@zerofs:/root$ pgbench -vvv -c 100 -j 40 -t 1000 bench
pgbench (16.9 (Ubuntu 16.10-1))
starting vacuum...end.
starting vacuum pgbench_accounts...end.
transaction type: <builtin: TPC-B (sort of)>
scaling factor: 50
query mode: simple
number of clients: 100
number of threads: 40
maximum number of tries: 1
number of transactions per client: 1000
number of transactions actually processed: 100000/100000
number of failed transactions: 0 (0.000%)
latency average = 337.762 ms
initial connection time = 43.969 ms
tps = 296.066616 (without initial connection time)

Best,
Pierre

Show quoted text

On Fri, Jul 25, 2025, at 11:25, Pierre Barre wrote:

Hi,

I went ahead and did that test.

Here is the postgresql config I used for reference (note the wal
options (recycle, init_zero) as well as full_page_writes = off, because
ZeroFS cannot have torn writes by design).

https://gist.github.com/Barre/8d68f0d00446389998a31f4e60f3276d

Test was running on Azure with Standard D16ads v5 (16 vcpus, 64 GiB memory)

This time, I didn't run ZFS with L2ARC, I just mounted ZeroFS with 9p.

synchronous_commit = off

postgres@zerofs:~$ pgbench -vvv -c 100 -j 40 -t 1000 bench
pgbench (16.9 (Ubuntu 16.9-0ubuntu0.24.04.1))
starting vacuum...end.
starting vacuum pgbench_accounts...end.
transaction type: <builtin: TPC-B (sort of)>
scaling factor: 50
query mode: simple
number of clients: 100
number of threads: 40
maximum number of tries: 1
number of transactions per client: 1000
number of transactions actually processed: 100000/100000
number of failed transactions: 0 (0.000%)
latency average = 6.239 ms
initial connection time = 68.922 ms
tps = 16026.940646 (without initial connection time)

synchronous_commit = on

postgres@zerofs:~$ pgbench -vvv -c 50 -j 15 -t 1000 bench
pgbench (16.9 (Ubuntu 16.9-0ubuntu0.24.04.1))
starting vacuum...end.
starting vacuum pgbench_accounts...end.
transaction type: <builtin: TPC-B (sort of)>
scaling factor: 50
query mode: simple
number of clients: 50
number of threads: 15
maximum number of tries: 1
number of transactions per client: 1000
number of transactions actually processed: 50000/50000
number of failed transactions: 0 (0.000%)
latency average = 197.723 ms
initial connection time = 46.089 ms
tps = 252.878721 (without initial connection time)

Not great barebones with with synchronous_commit, but still usable!

Best,
Pierre

On Fri, Jul 25, 2025, at 00:44, Pierre Barre wrote:

This then begs the obvious question of how fast is this with
synchronous_commit = on?

Probably not awful, especially with commit_delay.

I'll try that and report back.

Best,
Pierre

On Fri, Jul 25, 2025, at 00:03, Jeff Ross wrote:

On 7/24/25 13:50, Pierre Barre wrote:

It’s not “safe” or “unsafe”, there’s mountains of valid workloads which don’t require synchronous_commit. Synchronous_commit don’t make your system automatically safe either, and if that’s a requirement, there’s many workarounds, as you suggested, it certainly doesn’t make the setup useless.

Best,
Pierre

On Thu, Jul 24, 2025, at 21:44, Nico Williams wrote:

On Fri, Jul 18, 2025 at 12:57:39PM +0200, Pierre Barre wrote:

- Postgres configured accordingly memory-wise as well as with
synchronous_commit = off, wal_init_zero = off and wal_recycle = off.

Bingo. That's why it's fast (synchronous_commit = off). It's also why
it's not safe _unless_ you have a local, fast, persistent ZIL device
(which I assume you don't).

Nico
--

This then begs the obvious question of how fast is this with
synchronous_commit = on?

#19

Pierre Barre

pierre@barre.sh

11 months ago

In reply to: Pierre Barre (#18)

Re: PostgreSQL on S3-backed Block Storage with Near-Local Performance

And finally, some read only benchmarks with the same postgres build.

9P:

postgres@zerofs:/mnt_9p$ pgbench -vvv -c 100 -j 40 -t 10000 bench -S
pgbench (16.9 (Ubuntu 16.10-1))
starting vacuum...end.
starting vacuum pgbench_accounts...end.
transaction type: <builtin: select only>
scaling factor: 50
query mode: simple
number of clients: 100
number of threads: 40
maximum number of tries: 1
number of transactions per client: 10000
number of transactions actually processed: 1000000/1000000
number of failed transactions: 0 (0.000%)
latency average = 0.539 ms
initial connection time = 59.157 ms
tps = 185652.686153 (without initial connection time)

ext4:

postgres@zerofs:/root$ pgbench -vvv -c 100 -j 40 -t 10000 bench -S
pgbench (16.9 (Ubuntu 16.10-1))
starting vacuum...end.
starting vacuum pgbench_accounts...end.
transaction type: <builtin: select only>
scaling factor: 50
query mode: simple
number of clients: 100
number of threads: 40
maximum number of tries: 1
number of transactions per client: 10000
number of transactions actually processed: 1000000/1000000
number of failed transactions: 0 (0.000%)
latency average = 0.547 ms
initial connection time = 44.054 ms
tps = 182836.180428 (without initial connection time)

Best,
Pierre

Show quoted text

On Sat, Jul 26, 2025, at 03:16, Pierre Barre wrote:

I built postgres (same version, 16.9) but --with-block-size=32 (I'd
really love if this would be a initdb time flag!) and did some more
testing:

synchronous_commit = off

postgres@zerofs:~$ pgbench -vvv -c 100 -j 40 -t 10000 bench
pgbench (16.9 (Ubuntu 16.10-1))
starting vacuum...end.
starting vacuum pgbench_accounts...end.
transaction type: <builtin: TPC-B (sort of)>
scaling factor: 50
query mode: simple
number of clients: 100
number of threads: 40
maximum number of tries: 1
number of transactions per client: 10000
number of transactions actually processed: 1000000/1000000
number of failed transactions: 0 (0.000%)
latency average = 5.727 ms
initial connection time = 59.223 ms
tps = 17460.128835 (without initial connection time)

synchronous_commit = on

postgres@zerofs:/root$ pgbench -vvv -c 100 -j 40 -t 1000 bench
pgbench (16.9 (Ubuntu 16.10-1))
starting vacuum...end.
starting vacuum pgbench_accounts...end.
transaction type: <builtin: TPC-B (sort of)>
scaling factor: 50
query mode: simple
number of clients: 100
number of threads: 40
maximum number of tries: 1
number of transactions per client: 1000
number of transactions actually processed: 100000/100000
number of failed transactions: 0 (0.000%)
latency average = 301.800 ms
initial connection time = 62.237 ms
tps = 331.345391 (without initial connection time)

=====================================

Then, using the same setup (same server, same postgres build), I create
a ZeroFS NBD device with ext4 on top

/dev/nbd0 on /mnt_9p type ext4 (rw,relatime,stripe=32)

synchronous_commit = off

postgres@zerofs:/mnt_9p$ pgbench -vvv -c 100 -j 40 -t 10000 bench
pgbench (16.9 (Ubuntu 16.10-1))
starting vacuum...end.
starting vacuum pgbench_accounts...end.
transaction type: <builtin: TPC-B (sort of)>
scaling factor: 50
query mode: simple
number of clients: 100
number of threads: 40
maximum number of tries: 1
number of transactions per client: 10000
number of transactions actually processed: 1000000/1000000
number of failed transactions: 0 (0.000%)
latency average = 3.615 ms
initial connection time = 45.653 ms
tps = 27665.373366 (without initial connection time)

synchronous_commit = on

postgres@zerofs:/root$ pgbench -vvv -c 100 -j 40 -t 1000 bench
pgbench (16.9 (Ubuntu 16.10-1))
starting vacuum...end.
starting vacuum pgbench_accounts...end.
transaction type: <builtin: TPC-B (sort of)>
scaling factor: 50
query mode: simple
number of clients: 100
number of threads: 40
maximum number of tries: 1
number of transactions per client: 1000
number of transactions actually processed: 100000/100000
number of failed transactions: 0 (0.000%)
latency average = 337.762 ms
initial connection time = 43.969 ms
tps = 296.066616 (without initial connection time)

Best,
Pierre

On Fri, Jul 25, 2025, at 11:25, Pierre Barre wrote:

Hi,

I went ahead and did that test.

Here is the postgresql config I used for reference (note the wal
options (recycle, init_zero) as well as full_page_writes = off, because
ZeroFS cannot have torn writes by design).

https://gist.github.com/Barre/8d68f0d00446389998a31f4e60f3276d

Test was running on Azure with Standard D16ads v5 (16 vcpus, 64 GiB memory)

This time, I didn't run ZFS with L2ARC, I just mounted ZeroFS with 9p.

synchronous_commit = off

postgres@zerofs:~$ pgbench -vvv -c 100 -j 40 -t 1000 bench
pgbench (16.9 (Ubuntu 16.9-0ubuntu0.24.04.1))
starting vacuum...end.
starting vacuum pgbench_accounts...end.
transaction type: <builtin: TPC-B (sort of)>
scaling factor: 50
query mode: simple
number of clients: 100
number of threads: 40
maximum number of tries: 1
number of transactions per client: 1000
number of transactions actually processed: 100000/100000
number of failed transactions: 0 (0.000%)
latency average = 6.239 ms
initial connection time = 68.922 ms
tps = 16026.940646 (without initial connection time)

synchronous_commit = on

postgres@zerofs:~$ pgbench -vvv -c 50 -j 15 -t 1000 bench
pgbench (16.9 (Ubuntu 16.9-0ubuntu0.24.04.1))
starting vacuum...end.
starting vacuum pgbench_accounts...end.
transaction type: <builtin: TPC-B (sort of)>
scaling factor: 50
query mode: simple
number of clients: 50
number of threads: 15
maximum number of tries: 1
number of transactions per client: 1000
number of transactions actually processed: 50000/50000
number of failed transactions: 0 (0.000%)
latency average = 197.723 ms
initial connection time = 46.089 ms
tps = 252.878721 (without initial connection time)

Not great barebones with with synchronous_commit, but still usable!

Best,
Pierre

On Fri, Jul 25, 2025, at 00:44, Pierre Barre wrote:

This then begs the obvious question of how fast is this with
synchronous_commit = on?

Probably not awful, especially with commit_delay.

I'll try that and report back.

Best,
Pierre

On Fri, Jul 25, 2025, at 00:03, Jeff Ross wrote:

On 7/24/25 13:50, Pierre Barre wrote:

It’s not “safe” or “unsafe”, there’s mountains of valid workloads which don’t require synchronous_commit. Synchronous_commit don’t make your system automatically safe either, and if that’s a requirement, there’s many workarounds, as you suggested, it certainly doesn’t make the setup useless.

Best,
Pierre

On Thu, Jul 24, 2025, at 21:44, Nico Williams wrote:

On Fri, Jul 18, 2025 at 12:57:39PM +0200, Pierre Barre wrote:

- Postgres configured accordingly memory-wise as well as with
synchronous_commit = off, wal_init_zero = off and wal_recycle = off.

Bingo. That's why it's fast (synchronous_commit = off). It's also why
it's not safe _unless_ you have a local, fast, persistent ZIL device
(which I assume you don't).

Nico
--

This then begs the obvious question of how fast is this with
synchronous_commit = on?

#20

Vladimir Churyukin

vladimir@churyukin.com

11 months ago

In reply to: Pierre Barre (#6)

Re: PostgreSQL on S3-backed Block Storage with Near-Local Performance

A shared storage would require a lot of extra work. That's essentially what
AWS Aurora does.
You will have to have functionality to sync in-memory states between nodes,
because all the instances will have cached data that can easily become
stale on any write operation.
That alone is not that simple. You will have to modify some locking logic.
Most likely do a lot of other changes in a lot of places, Postgres was not
just built with the assumption that the storage can be shared.

-Vladimir

On Fri, Jul 18, 2025 at 5:31 AM Pierre Barre <pierre@barre.sh> wrote:

Show quoted text

Now, I'm trying to understand how CAP theorem applies here. Traditional
PostgreSQL replication has clear CAP trade-offs - you choose between
consistency and availability during partitions.

But when PostgreSQL instances share storage rather than replicate:
- Consistency seems maintained (same data)
- Availability seems maintained (client can always promote an accessible
node)
- Partitions between PostgreSQL nodes don't prevent the system from
functioning

It seems that CAP assumes specific implementation details (like nodes
maintaining independent state) without explicitly stating them.

How should we think about CAP theorem when distributed nodes share storage
rather than coordinate state? Are the trade-offs simply moved to a
different layer, or does shared storage fundamentally change the analysis?

Client with awareness of both PostgreSQL nodes
| |
↓ (partition here) ↓
PostgreSQL Primary PostgreSQL Standby
| |
└───────────┬───────────────────┘
↓
Shared ZFS Pool
|
6 Global ZeroFS instances

Best,
Pierre

On Fri, Jul 18, 2025, at 12:57, Pierre Barre wrote:

Hi Seref,

For the benchmarks, I used Hetzner's cloud service with the following

setup:

- A Hetzner s3 bucket in the FSN1 region
- A virtual machine of type ccx63 48 vCPU 192 GB memory
- 3 ZeroFS nbd devices (same s3 bucket)
- A ZFS stripped pool with the 3 devices
- 200GB zfs L2ARC
- Postgres configured accordingly memory-wise as well as with

synchronous_commit = off, wal_init_zero = off and wal_recycle = off.

Best,
Pierre

On Fri, Jul 18, 2025, at 12:42, Seref Arikan wrote:

Sorry, this was meant to go to the whole group:

Very interesting!. Great work. Can you clarify how exactly you're

running postgres in your tests? A specific AWS service? What's the test
infrastructure that sits above the file system?

On Thu, Jul 17, 2025 at 11:59 PM Pierre Barre <pierre@barre.sh> wrote:

Hi everyone,

I wanted to share a project I've been working on that enables

PostgreSQL to run on S3 storage while maintaining performance comparable to
local NVMe. The approach uses block-level access rather than trying to map
filesystem operations to S3 objects.

ZeroFS: https://github.com/Barre/ZeroFS

# The Architecture

ZeroFS provides NBD (Network Block Device) servers that expose S3

storage as raw block devices. PostgreSQL runs unmodified on ZFS pools built
on these block devices:

PostgreSQL -> ZFS -> NBD -> ZeroFS -> S3

By providing block-level access and leveraging ZFS's caching

capabilities (L2ARC), we can achieve microsecond latencies despite the
underlying storage being in S3.

## Performance Results

Here are pgbench results from PostgreSQL running on this setup:

### Read/Write Workload

```
postgres@ubuntu-16gb-fsn1-1:/root$ pgbench -c 50 -j 15 -t 100000

example

pgbench (16.9 (Ubuntu 16.9-0ubuntu0.24.04.1))
starting vacuum...end.
transaction type: <builtin: TPC-B (sort of)>
scaling factor: 50
query mode: simple
number of clients: 50
number of threads: 15
maximum number of tries: 1
number of transactions per client: 100000
number of transactions actually processed: 5000000/5000000
number of failed transactions: 0 (0.000%)
latency average = 0.943 ms
initial connection time = 48.043 ms
tps = 53041.006947 (without initial connection time)
```

### Read-Only Workload

```
postgres@ubuntu-16gb-fsn1-1:/root$ pgbench -c 50 -j 15 -t 100000 -S

example

pgbench (16.9 (Ubuntu 16.9-0ubuntu0.24.04.1))
starting vacuum...end.
transaction type: <builtin: select only>
scaling factor: 50
query mode: simple
number of clients: 50
number of threads: 15
maximum number of tries: 1
number of transactions per client: 100000
number of transactions actually processed: 5000000/5000000
number of failed transactions: 0 (0.000%)
latency average = 0.121 ms
initial connection time = 53.358 ms
tps = 413436.248089 (without initial connection time)
```

These numbers are with 50 concurrent clients and the actual data

stored in S3. Hot data is served from ZFS L2ARC and ZeroFS's memory caches,
while cold data comes from S3.

## How It Works

1. ZeroFS exposes NBD devices (e.g., /dev/nbd0) that PostgreSQL/ZFS

can use like any other block device

2. Multiple cache layers hide S3 latency:
a. ZFS ARC/L2ARC for frequently accessed blocks
b. ZeroFS memory cache for metadata and hot dataZeroFS exposes NBD

devices (e.g., /dev/nbd0) that PostgreSQL/ZFS can use like any other block
device

c. Optional local disk cache
3. All data is encrypted (ChaCha20-Poly1305) before hitting S3
4. Files are split into 128KB chunks for insertion into ZeroFS'

LSM-tree

## Geo-Distributed PostgreSQL

Since each region can run its own ZeroFS instance, you can create

geographically distributed PostgreSQL setups.
Example architectures:

Architecture 1
PostgreSQL Client
|
| SQL queries
|
+--------------+
|  PG Proxy    |
| (HAProxy/    |
|  PgBouncer)  |
+--------------+
/        \
/          \
Synchronous            Synchronous
Replication            Replication
/              \
/                \
+---------------+        +---------------+
| PostgreSQL 1  |        | PostgreSQL 2  |
| (Primary)     |◄------►| (Standby)     |
+---------------+        +---------------+
|                        |
|  POSIX filesystem ops  |
|                        |
+---------------+        +---------------+
|   ZFS Pool 1  |        |   ZFS Pool 2  |
| (3-way mirror)|        | (3-way mirror)|
+---------------+        +---------------+
/      |      \          /      |      \
/       |       \        /       |       \
NBD:10809 NBD:10810 NBD:10811  NBD:10812 NBD:10813 NBD:10814
|        |        |           |        |        |
+--------++--------++--------++--------++--------++--------+
|ZeroFS 1||ZeroFS 2||ZeroFS 3||ZeroFS 4||ZeroFS 5||ZeroFS 6|
+--------++--------++--------++--------++--------++--------+
|         |         |         |         |         |
|         |         |         |         |         |
S3-Region1 S3-Region2 S3-Region3 S3-Region4 S3-Region5
S3-Region6

(us-east) (eu-west) (ap-south) (us-west) (eu-north) (ap-east)

Architecture 2:

PostgreSQL Primary (Region 1) ←→ PostgreSQL Standby (Region 2)
\ /
\ /
Same ZFS Pool (NBD)
|
6 Global ZeroFS
|
S3 Regions

The main advantages I see are:
1. Dramatic cost reduction for large datasets
2. Simplified geo-distribution
3. Infinite storage capacity
4. Built-in encryption and compression

Looking forward to your feedback and questions!

Best,
Pierre

P.S. The full project includes a custom NFS filesystem too.