Built-in Raft replication
Hi,
I am considering starting work on implementing a built-in Raft
replication for PostgreSQL.
Raft's advantage is that it unifies log replication, cluster
configuration/membership/topology management and initial state transfer
into a single protocol.
Currently the cluster configuration/topology is often managed by
Patroni, or similar tools, however, it seems there are certain
usability drawbacks with this approach:
- it's a separate tool, requiring an external state provider like etcd;
raft could store its configuration in system tables; this is
also an observability improvement since everyone could look up
cluster state the same way as everything else
- same for watchdog; raft has a built-in failure detector that's
configuration aware;
- flexible quorums; currently quorum size is a configurable;
with a built-in raft, extending the quorum could be a matter
of starting a new node and pointing it to an existing cluster
Going forward I can see PostgreSQL providing transparent bouncing
on pg_wire level, given that Raft state is now part of the
system, so drivers and all cluster nodes could easily see where
the leader is.
If anyone is working on Raft already I'd be happy to discuss
the details. I am fairly new to the PostgreSQL hackers ecosystem
so cautious of starting work in isolation/knowing there is no
interest in accepting the feature into the trunk.
Thanks,
--
Konstantin Osipov
On Mon, 14 Apr 2025 at 22:15, Konstantin Osipov <kostja.osipov@gmail.com> wrote:
Hi,
Hi
I am considering starting work on implementing a built-in Raft
replication for PostgreSQL.
Just some thought on top of my mind, if you need my voice here:
I have a hard time believing the community will be positive about this
change in-core. It has more changes as contrib extension. In fact, if
we want a built-in consensus algorithm, Paxos is a better option,
because you can use postgresql as local crash-safe storage for single
decree paxos, just store your state (ballot number, last voice) in a
heap table.
OTOH Raft needs to write its own log, and what's worse, it sometimes
needs to remove already written parts of it (so, it is not appended
only, unlike WAL). If you have a production system which maintains two
kinds of logs with different semantics, it is a very hard system to
maintain..
There is actually a prod-ready (non open source) implementation of
RAFT as extension, called BiHA, by pgpro.
Just some thought on top of my mind, if you need my voice here.
--
Best regards,
Kirill Reshke
* Kirill Reshke <reshkekirill@gmail.com> [25/04/14 20:48]:
I am considering starting work on implementing a built-in Raft
replication for PostgreSQL.Just some thought on top of my mind, if you need my voice here:
I have a hard time believing the community will be positive about this
change in-core. It has more changes as contrib extension. In fact, if
we want a built-in consensus algorithm, Paxos is a better option,
because you can use postgresql as local crash-safe storage for single
decree paxos, just store your state (ballot number, last voice) in a
heap table.
But Raft is a log replication algorithm, not a consensus
algorithm. It does use consensus, but that's for leader election.
Paxos could be used for log replication, but that would be
expensive. In fact etcd uses Raft, and etcd is used by Patroni. So
I completely lost your line of thought here.
OTOH Raft needs to write its own log, and what's worse, it sometimes
needs to remove already written parts of it (so, it is not appended
only, unlike WAL). If you have a production system which maintains two
kinds of logs with different semantics, it is a very hard system to
maintain..
My proposal is exactly to replace (or rather, extend) the current
synchronous log replication with Raft. Entry removal is possible to
stack on top of append-only format, and production implementations
exist which do that.
So, no, it's a single log, and in fact the current WAL will do.
There is actually a prod-ready (non open source) implementation of
RAFT as extension, called BiHA, by pgpro.
My guess biha is an extension since a proprietary code is easier
to maintain that way. I'd rather say the fact that there is a
proprietary implementation out in the field confirms it could be a
good idea to have it in PostgreSQL trunk.
In any case I'm interested in contributing to the trunk, not
building a proprietary module/fork.
--
Konstantin Osipov
14.04.2025 20:44, Kirill Reshke пишет:
OTOH Raft needs to write its own log, and what's worse, it sometimes
needs to remove already written parts of it (so, it is not appended
only, unlike WAL). If you have a production system which maintains two
kinds of logs with different semantics, it is a very hard system to
maintain..
Raft is log replication protocol which uses log position and term.
But... PostgreSQL already have log position and term in its WAL structure.
PostgreSQL's timeline is actually the Term.
Raft implementer needs just to correct rules for Term/Timeline switching:
- instead of "next TimeLine number is just increment of largest known
TimeLine number" it needs to be "next TimeLine number is the result of
Leader Election".
And yes, "it sometimes needs to remove already written parts of it".
But... It is exactly what every PostgreSQL's cluster manager software have
to do to join previous leader as a follower to new leader - pg_rewind.
So, PostgreSQL already have 70-90%% of Raft implementation details.
Raft doesn't have to be implemented in PostgreSQL.
Raft has to be finished!!!
PS: One of the biggest issues is forced snapshot on replica promotion. It
really slows down leader switch time. It looks like it is not really
needed, or some small workaround should be enough.
--
regards
Yura Sokolov aka funny-falcon
Hi Konstantin,
I am considering starting work on implementing a built-in Raft
replication for PostgreSQL.
Generally speaking I like the idea. The more important question IMO is
whether we want to maintain Raft within the PostgreSQL core project.
Building distributed systems on commodity hardware was a popular idea
back in the 2000s. These days you can rent a server with 2 Tb of RAM
for something like 2000 USD/month (numbers from my memory that were
valid ~5 years ago) which will fit many of the existing businesses (!)
in memory. And you can rent another one for a replica, just in order
not to recover from a backup if something happens to your primary
server. The common wisdom is if you can avoid building distributed
systems, don't build one.
Which brings the question if we want to maintain something like this
(which will include logic for cases when a node joins or leaves the
cluster, proxy server / service discovery for clients, test cases /
infrastructure for all this and also upgrading the cluster, docs, ...)
for a presumably view users which business doesn't fit in a single
server *and* they want an automatic failover (not the manual one)
*and* they don't use Patroni/Stolon/CockroachDB/Neon/... already.
Although the idea is tempting personally I'm inclined to think that
it's better to invest community resources into something else.
--
Best regards,
Aleksander Alekseev
15.04.2025 13:20, Aleksander Alekseev пишет:
Hi Konstantin,
I am considering starting work on implementing a built-in Raft
replication for PostgreSQL.Generally speaking I like the idea. The more important question IMO is
whether we want to maintain Raft within the PostgreSQL core project.Building distributed systems on commodity hardware was a popular idea
back in the 2000s. These days you can rent a server with 2 Tb of RAM
for something like 2000 USD/month (numbers from my memory that were
valid ~5 years ago) which will fit many of the existing businesses (!)
in memory. And you can rent another one for a replica, just in order
not to recover from a backup if something happens to your primary
server. The common wisdom is if you can avoid building distributed
systems, don't build one.Which brings the question if we want to maintain something like this
(which will include logic for cases when a node joins or leaves the
cluster, proxy server / service discovery for clients, test cases /
infrastructure for all this and also upgrading the cluster, docs, ...)
for a presumably view users which business doesn't fit in a single
server *and* they want an automatic failover (not the manual one)
*and* they don't use Patroni/Stolon/CockroachDB/Neon/... already.Although the idea is tempting personally I'm inclined to think that
it's better to invest community resources into something else.
Raft is not for "commodity hardware". It is for reliability.
Yes, it needs 3 servers instead of 2. It costs more than simple replication
with "manual" failover.
But if business needs high availability, it wouldn't rely on "manual"
failover. And if business relies on correctness, it would rely on any
solution which "automatically switches between two replicas". Because there
is no way to guarantee correctness with just two replicas. And many stories
of lost transactions with Patroni/Stolon already confirms this thesis.
CockroachDB/Neon - they are good solutions for distributed systems. But, as
you've said, many clients don't need distributed systems. They just need
reliable replication.
I've been working in a company which uses MongoDB (3.6 and up) as their
primary storage. And it seemed to me as "God Send". Everything just worked.
Replication was as reliable as one could imagine. It outlives several
hardware incidents without manual intervention. It allowed cluster
maintenance (software and hardware upgrades) without application downtime.
I really dream PostgreSQL will be as reliable as MongoDB without need of
external services.
--
regards
Yura Sokolov aka funny-falcon
* Yura Sokolov <y.sokolov@postgrespro.ru> [25/04/15 12:02]:
OTOH Raft needs to write its own log, and what's worse, it sometimes
needs to remove already written parts of it (so, it is not appended
only, unlike WAL). If you have a production system which maintains two
kinds of logs with different semantics, it is a very hard system to
maintain..Raft is log replication protocol which uses log position and term.
But... PostgreSQL already have log position and term in its WAL structure.
PostgreSQL's timeline is actually the Term.
Raft implementer needs just to correct rules for Term/Timeline switching:
- instead of "next TimeLine number is just increment of largest known
TimeLine number" it needs to be "next TimeLine number is the result of
Leader Election".And yes, "it sometimes needs to remove already written parts of it".
But... It is exactly what every PostgreSQL's cluster manager software have
to do to join previous leader as a follower to new leader - pg_rewind.So, PostgreSQL already have 70-90%% of Raft implementation details.
Raft doesn't have to be implemented in PostgreSQL.
Raft has to be finished!!!PS: One of the biggest issues is forced snapshot on replica promotion. It
really slows down leader switch time. It looks like it is not really
needed, or some small workaround should be enough.
I'd say my pet peeve is storing the cluster topology (the so
called raft configuration) inside the database, not in an external
state provider. Agree on other points.
--
Konstantin Osipov
Hi Yura,
I've been working in a company which uses MongoDB (3.6 and up) as their
primary storage. And it seemed to me as "God Send". Everything just worked.
Replication was as reliable as one could imagine. It outlives several
hardware incidents without manual intervention. It allowed cluster
maintenance (software and hardware upgrades) without application downtime.
I really dream PostgreSQL will be as reliable as MongoDB without need of
external services.
I completely understand. I had exactly the same experience with
Stolon. Everything just worked. And the setup took like 5 minutes.
It's a pity this project doesn't seem to get as much attention as
Patroni. Probably because attention requires traveling and presenting
the project at conferences which costs money. Or perhaps people are
just happy with Patroni. I'm not sure in which state Stolon is today.
--
Best regards,
Aleksander Alekseev
15.04.2025 14:15, Aleksander Alekseev пишет:
Hi Yura,
I've been working in a company which uses MongoDB (3.6 and up) as their
primary storage. And it seemed to me as "God Send". Everything just worked.
Replication was as reliable as one could imagine. It outlives several
hardware incidents without manual intervention. It allowed cluster
maintenance (software and hardware upgrades) without application downtime.
I really dream PostgreSQL will be as reliable as MongoDB without need of
external services.I completely understand. I had exactly the same experience with
Stolon. Everything just worked. And the setup took like 5 minutes.It's a pity this project doesn't seem to get as much attention as
Patroni. Probably because attention requires traveling and presenting
the project at conferences which costs money. Or perhaps people are
just happy with Patroni. I'm not sure in which state Stolon is today.
But the key point: if PostgreSQL will be improved a bit, there will be no
need neither in Patroni, nor in Stolon. Isn't it great?
--
regards
Yura Sokolov aka funny-falcon
* Aleksander Alekseev <aleksander@timescale.com> [25/04/15 13:20]:
I am considering starting work on implementing a built-in Raft
replication for PostgreSQL.Generally speaking I like the idea. The more important question IMO is
whether we want to maintain Raft within the PostgreSQL core project.Building distributed systems on commodity hardware was a popular idea
back in the 2000s. These days you can rent a server with 2 Tb of RAM
for something like 2000 USD/month (numbers from my memory that were
valid ~5 years ago) which will fit many of the existing businesses (!)
in memory. And you can rent another one for a replica, just in order
not to recover from a backup if something happens to your primary
server. The common wisdom is if you can avoid building distributed
systems, don't build one.Which brings the question if we want to maintain something like this
(which will include logic for cases when a node joins or leaves the
cluster, proxy server / service discovery for clients, test cases /
infrastructure for all this and also upgrading the cluster, docs, ...)
for a presumably view users which business doesn't fit in a single
server *and* they want an automatic failover (not the manual one)
*and* they don't use Patroni/Stolon/CockroachDB/Neon/... already.Although the idea is tempting personally I'm inclined to think that
it's better to invest community resources into something else.
My personal take away from this as a community member would be
seamless coordinator failover in Greenplum and all of its forks
(CloudBerry, Greengage, synxdata, what not). I also imagine there
is a number of PostgreSQL derivatives that could benefit from
built-in transparent failover since it standardizes the solution
space.
--
Konstantin Osipov
* Yura Sokolov <y.sokolov@postgrespro.ru> [25/04/15 14:02]:
I've been working in a company which uses MongoDB (3.6 and up) as their
primary storage. And it seemed to me as "God Send". Everything just worked.
Replication was as reliable as one could imagine. It outlives several
hardware incidents without manual intervention. It allowed cluster
maintenance (software and hardware upgrades) without application downtime.
I really dream PostgreSQL will be as reliable as MongoDB without need of
external services.
thanks for pointing out mongodb, so built-in raft would help
ferretdb as well.
--
Konstantin Osipov
On Mon, Apr 14, 2025 at 1:15 PM Konstantin Osipov <kostja.osipov@gmail.com>
wrote:
If anyone is working on Raft already I'd be happy to discuss
the details. I am fairly new to the PostgreSQL hackers ecosystem
so cautious of starting work in isolation/knowing there is no
interest in accepting the feature into the trunk.
Putting aside the technical concerns about this specific idea, it's best to
start by laying out a very detailed plan of what you would want to change,
and what you see as the costs and benefits. It's also extremely helpful to
think about developing this as an extension. If you get stuck due to
extension limitations, propose additional hooks. If the hooks will not
work, explain why.
Getting this into core is going to be a long, multi-year effort, in which
people are going to be pushing back the entire time, so prepare yourself
for that. My immediate retort is going to be: why would we add this if
there are existing tools that already do the job just fine? Postgres has
lots of tasks that it is happy to let other programs/OS
subsystems/extensions/etc. handle instead.
Cheers,
Greg
--
Crunchy Data - https://www.crunchydata.com
Enterprise Postgres Software Products & Tech Support
* Greg Sabino Mullane <htamfids@gmail.com> [25/04/15 18:08]:
If anyone is working on Raft already I'd be happy to discuss
the details. I am fairly new to the PostgreSQL hackers ecosystem
so cautious of starting work in isolation/knowing there is no
interest in accepting the feature into the trunk.Putting aside the technical concerns about this specific idea, it's best to
start by laying out a very detailed plan of what you would want to change,
and what you see as the costs and benefits. It's also extremely helpful to
think about developing this as an extension. If you get stuck due to
extension limitations, propose additional hooks. If the hooks will not
work, explain why.Getting this into core is going to be a long, multi-year effort, in which
people are going to be pushing back the entire time, so prepare yourself
for that. My immediate retort is going to be: why would we add this if
there are existing tools that already do the job just fine? Postgres has
lots of tasks that it is happy to let other programs/OS
subsystems/extensions/etc. handle instead.
I had hoped I explained why external state providers can not
provide the same seamless UX as built-in ones. The key idea is to
have a built-in configuration management, so that adding and
removing replicas does not require changes in multiple disjoint
parts of the installation (server configurations, proxies,
clients).
I understand and accept that it's a multi-year effort, but I do
not accept the retort - my main point is that external tools
are not a replacement, and I'd like to reach consensus on that.
--
Konstantin Osipov, Moscow, Russia
On Tue, Apr 15, 2025 at 8:08 AM Greg Sabino Mullane <htamfids@gmail.com>
wrote:
On Mon, Apr 14, 2025 at 1:15 PM Konstantin Osipov <kostja.osipov@gmail.com>
wrote:If anyone is working on Raft already I'd be happy to discuss
the details. I am fairly new to the PostgreSQL hackers ecosystem
so cautious of starting work in isolation/knowing there is no
interest in accepting the feature into the trunk.Putting aside the technical concerns about this specific idea, it's best
to start by laying out a very detailed plan of what you would want to
change, and what you see as the costs and benefits. It's also extremely
helpful to think about developing this as an extension. If you get stuck
due to extension limitations, propose additional hooks. If the hooks will
not work, explain why.
This is exactly what I wanted to write as well. The idea is great. At the
same time, I think, consensus on many decisions will be extremely hard to
reach, so this project has a high risk of being very long. Unless it's an
extension, at least in the beginning.
Nik
Nikolay Samokhvalov <nik@postgres.ai> writes:
This is exactly what I wanted to write as well. The idea is great. At the
same time, I think, consensus on many decisions will be extremely hard to
reach, so this project has a high risk of being very long. Unless it's an
extension, at least in the beginning.
Yeah. The two questions you'd have to get past to get this into PG
core are:
1. Why can't it be an extension? (You claimed it would work more
seamlessly in core, but I don't think you've made a proven case.)
2. Why depend on Raft rather than some other project?
Longtime PG developers are going to be particularly hard on point 2,
because we have a track record now of outliving outside projects
that we thought we could rely on. One example here is the Snowball
stemmer; while its upstream isn't quite dead, it's twitching only
feebly, and seems to have a bus factor of 1. Another example is the
Spencer regex engine; we thought we could depend on Tcl to be the
upstream for that, but for a decade or more they've acted as though
*we* are the upstream. And then there's libxml2. And uuid-ossp.
And autoconf. And various documentation toolchains. Need I go on?
The great advantage of implementing an outside dependency in an
extension is that if the depended-on project dies, we can say a few
words of mourning and move on. It's a lot harder to walk away from
in-core features.
regards, tom lane
On 16 Apr 2025, at 04:19, Tom Lane <tgl@sss.pgh.pa.us> wrote:
feebly, and seems to have a bus factor of 1. Another example is the
Spencer regex engine; we thought we could depend on Tcl to be the
upstream for that, but for a decade or more they've acted as though
*we* are the upstream.
I think it's what Konstantin is proposing. To have our own Raft implementation, without dependencies.
IMO to better understand what is proposed we need some more description of proposed systems. How the new system will be configured? initdb and what than? How new node joins cluster? What is running pg_rewind when necessary?
Some time ago Peter E proposed to be able to start replication atop of empty directory, so that initial sync would be more straightforward. And also Heikki proposed to remove archive race condition when choosing new timeline. I think this steps are gradual movement in the same direction.
My view is what Konstantin wants is automatic replication topology management. For some reason this technology is called HA, DCS, Raft, Paxos and many other scary words. But basically it manages primary_conn_info of some nodes to provide some fault-tolerance properties. I'd start to design from here, not from Raft paper.
Best regards, Andrey Borodin.
Andrey Borodin <x4mmm@yandex-team.ru> writes:
I think it's what Konstantin is proposing. To have our own Raft implementation, without dependencies.
Hmm, OK. I thought that the proposal involved relying on some existing
code, but re-reading the thread that was said nowhere. Still, that
moves it from a large project to a really large project :-(
I continue to think that it'd be best to try to implement it as
an extension, at least up till the point of finding show-stopping
reasons why it cannot be that.
regards, tom lane
On Wed, Apr 16, 2025 at 9:37 AM Andrey Borodin <x4mmm@yandex-team.ru> wrote:
My view is what Konstantin wants is automatic replication topology management. For some reason this technology is called HA, DCS, Raft, Paxos and many other scary words. But basically it manages primary_conn_info of some nodes to provide some fault-tolerance properties. I'd start to design from here, not from Raft paper.
In my experience, the load of managing hundreds of replicas which all
participate in RAFT protocol becomes more than regular transaction
load. So making every replica a RAFT participant will affect the
ability to deploy hundreds of replica. We may build an extension which
has a similar role in PostgreSQL world as zookeeper in Hadoop. It can
be then used for other distributed systems as well - like shared
nothing clusters based on FDW. There's already a proposal to bring
CREATE SERVER to the world of logical replication - so I see these two
worlds uniting in future. The way I imagine it is some PostgreSQL
instances, which have this extension installed, will act as a RAFT
cluster (similar to Zookeeper ensemble or etcd cluster). The
distributed system based on logical replication or FDW or both will
use this ensemble to manage its shared state. The same ensemble can be
shared across multiple distributed clusters if it has scaling
capabilities.
--
Best Wishes,
Ashutosh Bapat
On 16 Apr 2025, at 09:33, Ashutosh Bapat <ashutosh.bapat.oss@gmail.com> wrote:
In my experience, the load of managing hundreds of replicas which all
participate in RAFT protocol becomes more than regular transaction
load. So making every replica a RAFT participant will affect the
ability to deploy hundreds of replica.
No need to make all standbys voting. And no need to make plain topology. pg_consul is using 2/3 or 3/5 HA groups, and cascades all others from HA group.
Existing tools already solve the original problem, Konstantin is just proposing to solve it in some standard “official” way.
We may build an extension which
has a similar role in PostgreSQL world as zookeeper in Hadoop.
Patroni, pg_consul and others already use zookeeper, etcd and similar systems for consensus.
Is it any better as extension than as etcd?
It can
be then used for other distributed systems as well - like shared
nothing clusters based on FDW.
I didn’t get FDW analogy. Why other distributed systems should choose Postgres extension over Zookeeper?
There's already a proposal to bring
CREATE SERVER to the world of logical replication - so I see these two
worlds uniting in future.
Again, I’m lost here. Which two worlds?
The way I imagine it is some PostgreSQL
instances, which have this extension installed, will act as a RAFT
cluster (similar to Zookeeper ensemble or etcd cluster).
That’s exactly what is proposed here.
The
distributed system based on logical replication or FDW or both will
use this ensemble to manage its shared state. The same ensemble can be
shared across multiple distributed clusters if it has scaling
capabilities.
Yes, shared DCS are common these days. AFAIK, we use one Zookeeper instance per hundred Postgres clusters to coordinate pg_consuls.
Actually, scalability is opposite to topic of this thread. Let me explain.
Currently, Postgres automatic failover tools rely on databases with built-in automatic failover. Konstantin is proposing to shorten this loop and make Postgres use its build-in automatic failover.
So, existing tooling allows you to have 3 hosts for DCS, with majority of 2 hosts able to elect new leader in case of failover.
And you can have only 2 hosts for Postgres - Primary and Standby. You can have 2 big Postgres machines with 64 CPUs. And 3 one-CPU hosts for Zookeper\etcd.
If you use build-in failover you have to resort to 3 big Postgres machines because you need 2/3 majority. Of course, you can install MySQL-stype arbiter - host that had no real PGDATA, only participates in voting. But this is a solution to problem induced by built-in autofailover.
Best regards, Andrey Borodin.
On 16 Apr 2025, at 09:26, Tom Lane <tgl@sss.pgh.pa.us> wrote:
Andrey Borodin <x4mmm@yandex-team.ru> writes:
I think it's what Konstantin is proposing. To have our own Raft implementation, without dependencies.
Hmm, OK. I thought that the proposal involved relying on some existing
code, but re-reading the thread that was said nowhere. Still, that
moves it from a large project to a really large project :-(I continue to think that it'd be best to try to implement it as
an extension, at least up till the point of finding show-stopping
reasons why it cannot be that.
I think I can provide some reasons why it cannot be neither extension, nor any part running within postmaster reign.
1. When joining cluster, there’s not PGDATA to run postmaster on top of it.
2. After failover, old Primary node must rejoin cluster by running pg_rewind and following timeline switch.
The system in hand must be able to manipulate with PGDATA without starting Postgres.
My question to Konstantin is Why wouldn’t you just add Raft to Patroni? Is there a reason why something like Patroni is not in core and noone rushes to get it in?
Everyone is using it, or system like it.
Best regards, Andrey Borodin.