Proposal: Exploring LSM Tree‑Based Storage Engine for PostgreSQL (Inspired by MyRocks)

Started by Manish Rai Jain8 months ago5 messages

manishrjain@gmail.com

8 months ago

Hi hackers,

I’ve been exploring the idea of integrating an LSM tree–based storage engine
into PostgreSQL — similar in spirit to MyRocks for MySQL — by
replacing the underlying
storage while preserving PostgreSQL’s upper layers (planner, executor, MVCC,
etc.).

The motivation stems from the well‑known write‑amplification issues
with B‑trees
under high write throughput. An LSM‑based engine could offer:

- Significant improvements in write performance and space efficiency,
especially under heavy ingestion workloads.
- Better scalability with larger datasets, particularly when compression
is applied.
- Comparable read performance (with trade‑offs depending on workload),
and opportunities to optimize through Bloom filters, tiered compaction
strategies, etc.
- Reduced reliance on manual VACUUM: obsolete versions would be purged
naturally during LSM compactions, potentially eliminating routine heap
vacuuming (transaction‑ID wrap‑around handling and stats collection would
still need careful design).

The hoped‑for outcome is a faster, more scalable PostgreSQL for >1 TB
workloads, while maintaining the rich feature set and ecosystem
compatibility users expect from Postgres.

Unlike Neon, this approach is not targeting cloud‑native object storage or
remote WAL streaming, but instead optimizing for maximum performance on
local disks or high‑performance block volumes, where write throughput and
compaction efficiency matter most.

This would likely involve implementing a new Table Access Method
(TAM), possibly
backed by a forked engine such as BadgerDB or RocksDB, adapted to
support PostgreSQL’s
MVCC and WAL semantics.

I’d love to hear your thoughts:

1. Does this direction make sense for experimentation within the Postgres
ecosystem?
2. Are there known architectural blockers or prior discussions/attempts in
this space worth revisiting?
3. Would such a project be best developed entirely as a fork, or is there
openness to evolving TAM to better support pluggable storage with LSM‑like
semantics?

Looking forward to your feedback.
- Manish

https://github.com/manishrjain

Matthias van de Meent

boekewurm+postgres@gmail.com

8 months ago

In reply to: Manish Rai Jain (#1)

Re: Proposal: Exploring LSM Tree‑Based Storage Engine for PostgreSQL (Inspired by MyRocks)

On Sun, 11 May 2025 at 14:06, Manish Rai Jain <manishrjain@gmail.com> wrote:

Hi hackers,

I’ve been exploring the idea of integrating an LSM tree–based storage engine into PostgreSQL — similar in spirit to MyRocks for MySQL — by replacing the underlying storage while preserving PostgreSQL’s upper layers (planner, executor, MVCC, etc.).

The motivation stems from the well‑known write‑amplification issues with B‑trees under high write throughput. An LSM‑based engine could offer:

Significant improvements in write performance and space efficiency, especially under heavy ingestion workloads.
Better scalability with larger datasets, particularly when compression is applied.
Comparable read performance (with trade‑offs depending on workload), and opportunities to optimize through Bloom filters, tiered compaction strategies, etc.
Reduced reliance on manual VACUUM: obsolete versions would be purged naturally during LSM compactions, potentially eliminating routine heap vacuuming (transaction‑ID wrap‑around handling and stats collection would still need careful design).

The hoped‑for outcome is a faster, more scalable PostgreSQL for >1 TB workloads, while maintaining the rich feature set and ecosystem compatibility users expect from Postgres.

Unlike Neon, this approach is not targeting cloud‑native object storage or remote WAL streaming, but instead optimizing for maximum performance on local disks or high‑performance block volumes, where write throughput and compaction efficiency matter most.

Note: Neon does not exactly structure anything PostgreSQL-visible in
an LSM structure. From Postgres' point of view, it's much more similar
to an LSM-based file system than an LSM-based TableAM or IndexAM.

This would likely involve implementing a new Table Access Method (TAM), possibly backed by a forked engine such as BadgerDB or RocksDB, adapted to support PostgreSQL’s MVCC and WAL semantics.

I’d love to hear your thoughts:

Does this direction make sense for experimentation within the Postgres ecosystem?

IMV, PostgreSQL is uniquely positioned for experimentation with
different index data structures, while it has significant restrictions
on Table AM implementations through the Table AM API and our shared
buffers. Given the significant differences between the usual LSM
(arbitrarily large ranges of values) and block-based storage I'm not
sure how well Postgres would accomodate LSM-based storage. See also
below.

Are there known architectural blockers or prior discussions/attempts in this space worth revisiting?

It depends on what you're planning to do. LSM-based TableAMs usually
are implemented with Index-organized data in mind. PostgreSQL,
however, expects its tables to have heap-like semantics for stored
tuples, with TID-addressible tuples that identify an unchanging set of
non-summarizingly indexed column values.

Additionally, PostgreSQL has its own buffer pool that is
block/page-based. You might be able to convince an LSM engine to use
block IO and page storage, but if that's not in your plans, you'll
probably also have to figure out how to do buffering and
checkpointing, and integrate with restartpoints.

Would such a project be best developed entirely as a fork, or is there openness to evolving TAM to better support pluggable storage with LSM‑like semantics?

I think the project is open to extensions of the TableAM to better
support new TableAMs, but only if that benefits the project as a
whole. E.g. if there are changes that impact the way things are going
to be indexed, you'll need to have a compelling story about why that's
needed and how indexes need to be adapted to support the new system.

Kind regards,

Matthias van de Meent
Neon (https://neon.tech)

Aleksander Alekseev

aleksander@timescale.com

8 months ago

In reply to: Manish Rai Jain (#1)

Re: Proposal: Exploring LSM Tree‑Based Storage Engine for PostgreSQL (Inspired by MyRocks)

Hi Manish,

I’ve been exploring the idea of integrating an LSM tree–based storage engine into PostgreSQL — similar in spirit to MyRocks for MySQL — by replacing the underlying storage while preserving PostgreSQL’s upper layers (planner, executor, MVCC, etc.).
[...]

Personally I wouldn't expect too much from RocksDB considering that
historically LSM-trees were designed to improve performance on HDDs.
The idea was to replace random writes with sequential ones. These days
we seem to be shifting to SSDs so this approach should be beneficial
in relatively niche cases when the amount of data is so large that
it's economically justified to buy HDDs... and you still want Postgres
for these tens or hundreds of terabytes of data, as a single instance
(no FDWs etc). In theory at least.

This being said, I don't want to discourage you from implementing a
TAM that would use RocksDB. This would be an excellent project for
several reasons:

* We could check if the above statement is true or false
* This could be considered as a reference 3rd party TAM implementation
* This also would be an implementation of index-organized tables
* Perhaps you will discover certain shortcomings of TAM API in the
process so we could improve it
* This promises to be a very-very fun project

I’d love to hear your thoughts:

Does this direction make sense for experimentation within the Postgres ecosystem?

Yes.

Are there known architectural blockers or prior discussions/attempts in this space worth revisiting?

From the performance point of view, maybe. Note that RocksDB basically
has its own shared buffers, transactions and write ahead logging.
Makes me wonder if it would be simpler to implement LSM-tree from
scratch. On the flip side you can start by using RocksDB as a
key-value storage basically, see the results, and then replace it with
your own implementation with the same interface.

Supporting backups could be a challenge, including incremental ones.
I'm not sure if/how pg_basebackup and our other tools interact with
TAMs.

Would such a project be best developed entirely as a fork, or is there openness to evolving TAM to better support pluggable storage with LSM‑like semantics?

Please don't create forks. As a person whose responsibility besides
other things was to help maintain several PG forks and upgrade them to
the next upstream releases I assure you that you don't want doing this
ever.

Just use the TAM interface and create a seperate PostgreSQL extension
on GitHub. This mailing list could be used for general TAM
discussions. When and if your extension will become mature enough it
can be discussed whether it's worth adding to /contrib/ or not.

--
Best regards,
Aleksander Alekseev

Nico Williams

nico@cryptonector.com

8 months ago

In reply to: Manish Rai Jain (#1)

Re: Proposal: Exploring LSM Tree‑Based Storage Engine for PostgreSQL (Inspired by MyRocks)

On Sun, May 11, 2025 at 11:06:00AM -0700, Manish Rai Jain wrote:

The motivation stems from the well‑known write‑amplification issues
with B‑trees under high write throughput. An LSM‑based engine could
offer:

Consider the ZFS copy-on-write b-tree-ish design, which has _even more_
write amplification than any normal b-tree. The ZFS solution to that is
LSM-ish. Specifically is the ZFS intent log (ZIL), which lets the
filesystem write w/o amplification for a while at the cost of:

a) having to keep an in-memory representation of the filesystem as
modified by the ZIL.

b) having to spend time on boot after power failure reconstructing (a)
from the on-disk ZIL.

The obvious way to reduce this cost of the ZIL is to periodically write
a simple index for all of the extant ZIL so that reading the filesystem
can be done by first searching the latest ZIL index, then then
b-tree-ish filesystem itself. This begins to resemble LSMs.

But what are the benefits of a copy-on-write design with write
magnification mitigation by an indexed "intent log" / LSM? This:

- you get to do content-addressed storage (which ZFS isn't quite, but
any greenfield thing could be)

- you get one root hash to identify the whole content (same as Git and
such), which gets you:

- hardware bit flip fault detection and recovery (if you add
mirroring or raid-z)

- read-only snapshots, which gets you:

- MVCC semantics, w/o MVCC

- O(1) cloning of read-only snapshots

- zero-lock reads w/ high concurrency!

(Either you lock the txid for readin -and lock nothing else-
you find at the root when you start reading so that pages freed
in later transactions are not reused while your read tx is
still going, or you lock nothing and if ever you find checksums
don't match [most likely because the block you're reading has
been freed and reused] you restart the read operation, much
like with SERIALIZABLE transactions one has to be prepared to
restart.)

But now if you want to build a distributed DB then you almost certainly
have to go back to MVCC and lose the read-only snapshots and clones.
But you could still have zero-lock reads w/ high concurrency, which is
not nothing!

But for PG it's too late to even consider such a thing unless... one
were to:

- first add support for something other than heaps for table storage

I'm sure many would like to have this.

- then follow the SQLite4 (yes, 4, not 3) model so you can store the
whole DB in one LSM (or one b-tree w/ write mag. amortization)

This strikes me as a lot of work.

Nico
--

Konstantin Osipov

kostja.osipov@gmail.com

3 months ago

In reply to: Manish Rai Jain (#1)

Re: Proposal: Exploring LSM Tree‑Based Storage Engine for PostgreSQL (Inspired by MyRocks)

* Manish Rai Jain <manishrjain@gmail.com> [25/05/11 22:06]:

Hi hackers,

1. Does this direction make sense for experimentation within the Postgres
ecosystem?
2. Are there known architectural blockers or prior discussions/attempts in
this space worth revisiting?
3. Would such a project be best developed entirely as a fork, or is there
openness to evolving TAM to better support pluggable storage with LSM‑like
semantics?

I think it would be difficult to fully integrate rocksdb since it
has its own transaction control and recovery, as well as uses
multi-threading rather than multi-processing.

PostgreSQL TAM expects that PostgreSQL WAL is used for replication
and most of transaction control functions (e.g. locking) is
outside the TAM domain.

--
Konstantin Osipov, Moscow, Russia