Throttling WAL inserts when the standby falls behind more than the configured replica_lag_in_bytes
Hi Hackers,
I am considering implementing RPO (recovery point objective) enforcement
feature for Postgres where the WAL writes on the primary are stalled when
the WAL distance between the primary and standby exceeds the configured
(replica_lag_in_bytes) threshold. This feature is useful particularly in
the disaster recovery setups where primary and standby are in different
regions and synchronous replication can't be set up for latency and
performance reasons yet requires some level of RPO enforcement.
The idea here is to calculate the lag between the primary and the standby
(Async?) server during XLogInsert and block the caller until the lag is
less than the threshold value. We can calculate the max lag by iterating
over ReplicationSlotCtl->replication_slots. If this is not something we
don't want to do in the core, at least adding a hook for XlogInsert is of
great value.
A few other scenarios I can think of with the hook are:
1. Enforcing RPO as described above
2. Enforcing rate limit and slow throttling when sync standby is falling
behind (could be flush lag or replay lag)
3. Transactional log rate governance - useful for cloud providers to
provide SKU sizes based on allowed WAL writes.
Thoughts?
Thanks,
Satya
On Thu, Dec 23, 2021 at 5:53 AM SATYANARAYANA NARLAPURAM
<satyanarlapuram@gmail.com> wrote:
Hi Hackers,
I am considering implementing RPO (recovery point objective) enforcement feature for Postgres where the WAL writes on the primary are stalled when the WAL distance between the primary and standby exceeds the configured (replica_lag_in_bytes) threshold. This feature is useful particularly in the disaster recovery setups where primary and standby are in different regions and synchronous replication can't be set up for latency and performance reasons yet requires some level of RPO enforcement.
+1 for the idea in general. However, blocking writes on primary seems
an extremely radical idea. The replicas can fall behind transiently at
times and blocking writes on the primary may stop applications failing
for these transient times. This is not a problem if the applications
have retry logic for the writes. How about blocking writes on primary
if the replicas fall behind the primary for a certain period of time?
The idea here is to calculate the lag between the primary and the standby (Async?) server during XLogInsert and block the caller until the lag is less than the threshold value. We can calculate the max lag by iterating over ReplicationSlotCtl->replication_slots.
The "falling behind" can also be quantified by the number of
write-transactions on the primary. I think it's good to have the users
choose what the "falling behind" means for them. We can have something
like the "recovery_target" param with different options name, xid,
time, lsn.
If this is not something we don't want to do in the core, at least adding a hook for XlogInsert is of great value.
IMHO, this feature may not be needed by everyone, the hook-way seems
reasonable so that the postgres vendors can provide different
implementations (for instance they can write an extension that
implements this hook which can block writes on primary, write some log
messages, inform some service layer of the replicas falling behind the
primary etc.). If we were to have the hook in XLogInsert which gets
called so frequently or XLogInsert is a hot-path, the hook really
should do as little work as possible, otherwise the write operations
latency may increase.
A few other scenarios I can think of with the hook are:
Enforcing RPO as described above
Enforcing rate limit and slow throttling when sync standby is falling behind (could be flush lag or replay lag)
Transactional log rate governance - useful for cloud providers to provide SKU sizes based on allowed WAL writes.Thoughts?
The hook can help to achieve the above objectives but where to place
it and what parameters it should take as input (or what info it should
emit out of the server via the hook) are important too.
Having said all, the RPO feature can also be implemented outside of
the postgres, a simple implementation could be - get the primary
current wal lsn using pg_current_wal_lsn and all the replicas
restart_lsn using pg_replication_slot, if they differ by certain
amount, then issue ALTER SYSTEM SET READ ONLY command [1]/messages/by-id/CAAJ_b967uKBiW6gbHr5aPzweURYjEGv333FHVHxvJmMhanwHXA@mail.gmail.com on the
primary, this requires the connections to the server and proper access
rights. This feature can also be implemented as an extension (without
the hook) which doesn't require any connections to the server yet can
access the required info primary current_wal_lsn, restart_lsn of the
replication slots etc, but the RPO enforcement may not be immediate as
the server doesn't have any hooks in XLogInsert or some other area.
[1]: /messages/by-id/CAAJ_b967uKBiW6gbHr5aPzweURYjEGv333FHVHxvJmMhanwHXA@mail.gmail.com
Regards,
Bharath Rupireddy.
On Thu, Dec 23, 2021 at 5:53 AM SATYANARAYANA NARLAPURAM
<satyanarlapuram@gmail.com> wrote:
Hi Hackers,
I am considering implementing RPO (recovery point objective) enforcement feature for Postgres where the WAL writes on the primary are stalled when the WAL distance between the primary and standby exceeds the configured (replica_lag_in_bytes) threshold. This feature is useful particularly in the disaster recovery setups where primary and standby are in different regions and synchronous replication can't be set up for latency and performance reasons yet requires some level of RPO enforcement.
Limiting transaction rate when the standby fails behind is a good feature ...
The idea here is to calculate the lag between the primary and the standby (Async?) server during XLogInsert and block the caller until the lag is less than the threshold value. We can calculate the max lag by iterating over ReplicationSlotCtl->replication_slots. If this is not something we don't want to do in the core, at least adding a hook for XlogInsert is of great value.
but doing it in XLogInsert does not seem to be a good idea. It's a
common point for all kinds of logging including VACUUM. We could
accidently stall a critical VACUUM operation because of that.
As Bharath described, it better be handled at the application level monitoring.
--
Best Wishes,
Ashutosh Bapat
On Thu, Dec 23, 2021 at 5:18 AM Ashutosh Bapat <ashutosh.bapat.oss@gmail.com>
wrote:
On Thu, Dec 23, 2021 at 5:53 AM SATYANARAYANA NARLAPURAM
<satyanarlapuram@gmail.com> wrote:Hi Hackers,
I am considering implementing RPO (recovery point objective) enforcement
feature for Postgres where the WAL writes on the primary are stalled when
the WAL distance between the primary and standby exceeds the configured
(replica_lag_in_bytes) threshold. This feature is useful particularly in
the disaster recovery setups where primary and standby are in different
regions and synchronous replication can't be set up for latency and
performance reasons yet requires some level of RPO enforcement.Limiting transaction rate when the standby fails behind is a good feature
...The idea here is to calculate the lag between the primary and the
standby (Async?) server during XLogInsert and block the caller until the
lag is less than the threshold value. We can calculate the max lag by
iterating over ReplicationSlotCtl->replication_slots. If this is not
something we don't want to do in the core, at least adding a hook for
XlogInsert is of great value.but doing it in XLogInsert does not seem to be a good idea.
XLogInsert isn't the best place to throttle/govern in a simple and fair
way, particularly the long-running transactions on the server?
It's a
common point for all kinds of logging including VACUUM. We could
accidently stall a critical VACUUM operation because of that.
Agreed, but again this is a policy decision that DBA can relax/enforce. I
expect RPO is in the range of a few 100MBs to GBs and on a healthy system
typically lag never comes close to this value. The Hook implementation can
take care of nitty-gritty details on the policy enforcement based on the
needs, for example, not throttling some backend processes like vacuum,
checkpointer; throttling based on the roles, for example not to throttle
superuser connections; and throttling based on replay lag, write lag,
checkpoint taking longer, closer to disk full. Each of these can be easily
translated into GUCs. Depending on the direction of the thread on the hook
vs a feature in the Core, I can add more implementation details.
As Bharath described, it better be handled at the application level
monitoring.
Both RPO based WAL throttling and application level monitoring can co-exist
as each one has its own merits and challenges. Each application developer
has to implement their own throttling logic and often times it is hard to
get it right.
Show quoted text
--
Best Wishes,
Ashutosh Bapat
Please find the attached draft patch.
On Thu, Dec 23, 2021 at 2:47 AM Bharath Rupireddy <
bharath.rupireddyforpostgres@gmail.com> wrote:
On Thu, Dec 23, 2021 at 5:53 AM SATYANARAYANA NARLAPURAM
<satyanarlapuram@gmail.com> wrote:Hi Hackers,
I am considering implementing RPO (recovery point objective) enforcement
feature for Postgres where the WAL writes on the primary are stalled when
the WAL distance between the primary and standby exceeds the configured
(replica_lag_in_bytes) threshold. This feature is useful particularly in
the disaster recovery setups where primary and standby are in different
regions and synchronous replication can't be set up for latency and
performance reasons yet requires some level of RPO enforcement.+1 for the idea in general. However, blocking writes on primary seems
an extremely radical idea. The replicas can fall behind transiently at
times and blocking writes on the primary may stop applications failing
for these transient times. This is not a problem if the applications
have retry logic for the writes. How about blocking writes on primary
if the replicas fall behind the primary for a certain period of time?
My proposal is to block the caller from writing until the lag situation is
improved. Don't want to throw any errors and fail the tranaction. I think
we are aligned?
The idea here is to calculate the lag between the primary and the
standby (Async?) server during XLogInsert and block the caller until the
lag is less than the threshold value. We can calculate the max lag by
iterating over ReplicationSlotCtl->replication_slots.The "falling behind" can also be quantified by the number of
write-transactions on the primary. I think it's good to have the users
choose what the "falling behind" means for them. We can have something
like the "recovery_target" param with different options name, xid,
time, lsn.
The transactions can be of arbitrary size and length and these options may
not provide the desired results. Time is a worthy option to add.
If this is not something we don't want to do in the core, at least
adding a hook for XlogInsert is of great value.
IMHO, this feature may not be needed by everyone, the hook-way seems
reasonable so that the postgres vendors can provide different
implementations (for instance they can write an extension that
implements this hook which can block writes on primary, write some log
messages, inform some service layer of the replicas falling behind the
primary etc.). If we were to have the hook in XLogInsert which gets
called so frequently or XLogInsert is a hot-path, the hook really
should do as little work as possible, otherwise the write operations
latency may increase.
A Hook is a good start. If there is enough interest then an extension can
be added to the contrib module.
A few other scenarios I can think of with the hook are:
Enforcing RPO as described above
Enforcing rate limit and slow throttling when sync standby is fallingbehind (could be flush lag or replay lag)
Transactional log rate governance - useful for cloud providers to
provide SKU sizes based on allowed WAL writes.
Thoughts?
The hook can help to achieve the above objectives but where to place
it and what parameters it should take as input (or what info it should
emit out of the server via the hook) are important too.
XLogInsert in my opinion is the best place to call it and the hook can be
something like this "void xlog_insert_hook(NULL)" as all the throttling
logic required is the current flush position which can be obtained
from GetFlushRecPtr and the ReplicationSlotCtl. Attached a draft patch.
Having said all, the RPO feature can also be implemented outside of
the postgres, a simple implementation could be - get the primary
current wal lsn using pg_current_wal_lsn and all the replicas
restart_lsn using pg_replication_slot, if they differ by certain
amount, then issue ALTER SYSTEM SET READ ONLY command [1] on the
primary, this requires the connections to the server and proper access
rights. This feature can also be implemented as an extension (without
the hook) which doesn't require any connections to the server yet can
access the required info primary current_wal_lsn, restart_lsn of the
replication slots etc, but the RPO enforcement may not be immediate as
the server doesn't have any hooks in XLogInsert or some other area.
READ ONLY is a decent choice but can fail the writes or not take
into effect until the end of the transaction?
Show quoted text
[1] -
/messages/by-id/CAAJ_b967uKBiW6gbHr5aPzweURYjEGv333FHVHxvJmMhanwHXA@mail.gmail.comRegards,
Bharath Rupireddy.
Attachments:
0001-Add-xlog_insert_hook-to-give-control-to-the-plugins.patchapplication/octet-stream; name=0001-Add-xlog_insert_hook-to-give-control-to-the-plugins.patchDownload
From 9b1b06fa5541aa643fcb8133bdd1e1bf4c433949 Mon Sep 17 00:00:00 2001
From: root <root@vm-pgsrc.voh31zlp2kzufgr23fvzpey3uf.xx.internal.cloudapp.net>
Date: Thu, 23 Dec 2021 21:43:56 +0000
Subject: [PATCH] Add xlog_insert_hook to give control to the plugins on the
WAL write actions. This helps solve scenarios like WAL rate governance,
throttling the writes based on the policies defined by the plugin owners.
---
src/backend/access/transam/xloginsert.c | 9 +++++++++
src/include/access/xlog.h | 5 +++++
2 files changed, 14 insertions(+)
diff --git a/src/backend/access/transam/xloginsert.c b/src/backend/access/transam/xloginsert.c
index 689384a411..c9c5bb4f1d 100644
--- a/src/backend/access/transam/xloginsert.c
+++ b/src/backend/access/transam/xloginsert.c
@@ -128,6 +128,9 @@ static XLogRecData *XLogRecordAssemble(RmgrId rmid, uint8 info,
static bool XLogCompressBackupBlock(char *page, uint16 hole_offset,
uint16 hole_length, char *dest, uint16 *dlen);
+/* Hook for plugins to get control in xlog_insert() */
+xlog_insert_hook_type xlog_insert_hook = NULL;
+
/*
* Begin constructing a WAL record. This must be called before the
* XLogRegister* functions and XLogInsert().
@@ -456,6 +459,12 @@ XLogInsert(RmgrId rmid, uint8 info)
return EndPos;
}
+ /*
+ * Allow a plugin to take action on inserting a new WAL record.
+ */
+ if (xlog_insert_hook)
+ (*xlog_insert_hook)();
+
do
{
XLogRecPtr RedoRecPtr;
diff --git a/src/include/access/xlog.h b/src/include/access/xlog.h
index 34f6c89f06..cf761fbc1c 100644
--- a/src/include/access/xlog.h
+++ b/src/include/access/xlog.h
@@ -299,6 +299,7 @@ extern void BootStrapXLOG(void);
extern void LocalProcessControlFile(bool reset);
extern void StartupXLOG(void);
extern void ShutdownXLOG(int code, Datum arg);
+extern void InitXLOGAccess(void);
extern void CreateCheckPoint(int flags);
extern bool CreateRestartPoint(int flags);
extern WALAvailability GetWALAvailability(XLogRecPtr targetLSN);
@@ -367,4 +368,8 @@ extern SessionBackupState get_backup_status(void);
/* files to signal promotion to primary */
#define PROMOTE_SIGNAL_FILE "promote"
+/* hook for plugins to get control in XlogInsert */
+typedef void (*xlog_insert_hook_type) (void);
+extern PGDLLIMPORT xlog_insert_hook_type xlog_insert_hook;
+
#endif /* XLOG_H */
--
2.17.1
Import Notes
Reply to msg id not found: CAHg+QDcUj5UWG13FCELZNVi_smZZdjiYQvSsQcgH2hp-NJTRZQ@mail.gmail.com
On Fri, Dec 24, 2021 at 3:27 AM SATYANARAYANA NARLAPURAM <
satyanarlapuram@gmail.com> wrote:
XLogInsert in my opinion is the best place to call it and the hook can be
something like this "void xlog_insert_hook(NULL)" as all the throttling
logic required is the current flush position which can be obtained
from GetFlushRecPtr and the ReplicationSlotCtl. Attached a draft patch.
IMHO, it is not a good idea to call an external hook function inside a
critical section. Generally, we ensure that we do not call any code path
within a critical section which can throw an error and if we start calling
the external hook then we lose that control. It should be blocked at the
operation level itself e.g. ALTER TABLE READ ONLY, or by some other hook at
a little higher level.
--
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com
On Fri, Dec 24, 2021 at 4:43 PM Dilip Kumar <dilipbalaut@gmail.com> wrote:
On Fri, Dec 24, 2021 at 3:27 AM SATYANARAYANA NARLAPURAM <satyanarlapuram@gmail.com> wrote:
XLogInsert in my opinion is the best place to call it and the hook can be something like this "void xlog_insert_hook(NULL)" as all the throttling logic required is the current flush position which can be obtained from GetFlushRecPtr and the ReplicationSlotCtl. Attached a draft patch.
IMHO, it is not a good idea to call an external hook function inside a critical section. Generally, we ensure that we do not call any code path within a critical section which can throw an error and if we start calling the external hook then we lose that control. It should be blocked at the operation level itself e.g. ALTER TABLE READ ONLY, or by some other hook at a little higher level.
Yeah, good point. It's not advisable to give the control to the
external module in the critical section. For instance, memory
allocation isn't allowed (see [1]/* * You should not do memory allocations within a critical section, because * an out-of-memory error will be escalated to a PANIC. To enforce that * rule, the allocation functions Assert that. */ #define AssertNotInCriticalSection(context) \ Assert(CritSectionCount == 0 || (context)->allowInCritSection)) and the ereport(ERROR,....) would
transform to PANIC inside the critical section (see [2]/* * If we are inside a critical section, all errors become PANIC * errors. See miscadmin.h. */ if (CritSectionCount > 0) elevel = PANIC;, [3]* A related, but conceptually distinct, mechanism is the "critical section" * mechanism. A critical section not only holds off cancel/die interrupts, * but causes any ereport(ERROR) or ereport(FATAL) to become ereport(PANIC) * --- that is, a system-wide reset is forced. Needless to say, only really * *critical* code should be marked as a critical section! Currently, this * mechanism is only used for XLOG-related code.).
Moreover the critical section is to be short-spanned i.e. executing
the as minimal code as possible. There's no guarantee that an external
module would follow these.
I suggest we do it at the level of transaction start i.e. when a txnid
is getting allocated i.e. in AssignTransactionId(). If we do this,
when the limit for the throttling is exceeded, the current txn (even
if it is a long running txn) continues to do the WAL insertions, the
next txns would get blocked. But this is okay and can be conveyed to
the users via documentation if need be. We do block txnid assignments
for parallel workers in this function, so this is a good choice IMO.
Thoughts?
[1]: /* * You should not do memory allocations within a critical section, because * an out-of-memory error will be escalated to a PANIC. To enforce that * rule, the allocation functions Assert that. */ #define AssertNotInCriticalSection(context) \ Assert(CritSectionCount == 0 || (context)->allowInCritSection)
/*
* You should not do memory allocations within a critical section, because
* an out-of-memory error will be escalated to a PANIC. To enforce that
* rule, the allocation functions Assert that.
*/
#define AssertNotInCriticalSection(context) \
Assert(CritSectionCount == 0 || (context)->allowInCritSection)
[2]: /* * If we are inside a critical section, all errors become PANIC * errors. See miscadmin.h. */ if (CritSectionCount > 0) elevel = PANIC;
/*
* If we are inside a critical section, all errors become PANIC
* errors. See miscadmin.h.
*/
if (CritSectionCount > 0)
elevel = PANIC;
[3]: * A related, but conceptually distinct, mechanism is the "critical section" * mechanism. A critical section not only holds off cancel/die interrupts, * but causes any ereport(ERROR) or ereport(FATAL) to become ereport(PANIC) * --- that is, a system-wide reset is forced. Needless to say, only really * *critical* code should be marked as a critical section! Currently, this * mechanism is only used for XLOG-related code.
* A related, but conceptually distinct, mechanism is the "critical section"
* mechanism. A critical section not only holds off cancel/die interrupts,
* but causes any ereport(ERROR) or ereport(FATAL) to become ereport(PANIC)
* --- that is, a system-wide reset is forced. Needless to say, only really
* *critical* code should be marked as a critical section! Currently, this
* mechanism is only used for XLOG-related code.
Regards,
Bharath Rupireddy.
On Fri, Dec 24, 2021 at 3:13 AM Dilip Kumar <dilipbalaut@gmail.com> wrote:
On Fri, Dec 24, 2021 at 3:27 AM SATYANARAYANA NARLAPURAM <
satyanarlapuram@gmail.com> wrote:XLogInsert in my opinion is the best place to call it and the hook can be
something like this "void xlog_insert_hook(NULL)" as all the throttling
logic required is the current flush position which can be obtained
from GetFlushRecPtr and the ReplicationSlotCtl. Attached a draft patch.IMHO, it is not a good idea to call an external hook function inside a
critical section. Generally, we ensure that we do not call any code path
within a critical section which can throw an error and if we start calling
the external hook then we lose that control.
Thank you for the comment. XLogInsertRecord is inside a critical section
but not XLogInsert. Am I missing something?
It should be blocked at the operation level itself e.g. ALTER TABLE READ
ONLY, or by some other hook at a little higher level.
There is a lot of maintenance overhead with a custom implementation at
individual databases and tables level. This doesn't provide the necessary
control that I am looking for.
Show quoted text
--
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com
On Sun, Dec 26, 2021 at 3:52 AM SATYANARAYANA NARLAPURAM <
satyanarlapuram@gmail.com> wrote:
On Fri, Dec 24, 2021 at 3:13 AM Dilip Kumar <dilipbalaut@gmail.com> wrote:
On Fri, Dec 24, 2021 at 3:27 AM SATYANARAYANA NARLAPURAM <
satyanarlapuram@gmail.com> wrote:XLogInsert in my opinion is the best place to call it and the hook can
be something like this "void xlog_insert_hook(NULL)" as all the throttling
logic required is the current flush position which can be obtained
from GetFlushRecPtr and the ReplicationSlotCtl. Attached a draft patch.IMHO, it is not a good idea to call an external hook function inside a
critical section. Generally, we ensure that we do not call any code path
within a critical section which can throw an error and if we start calling
the external hook then we lose that control.Thank you for the comment. XLogInsertRecord is inside a critical section
but not XLogInsert. Am I missing something?
Actually all the WAL insertions are done under a critical section (except
few exceptions), that means if you see all the references of XLogInsert(),
it is always called under the critical section and that is my main worry
about hooking at XLogInsert level.
--
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com
On Sat, Dec 25, 2021 at 6:01 PM Dilip Kumar <dilipbalaut@gmail.com> wrote:
On Sun, Dec 26, 2021 at 3:52 AM SATYANARAYANA NARLAPURAM <
satyanarlapuram@gmail.com> wrote:On Fri, Dec 24, 2021 at 3:13 AM Dilip Kumar <dilipbalaut@gmail.com>
wrote:On Fri, Dec 24, 2021 at 3:27 AM SATYANARAYANA NARLAPURAM <
satyanarlapuram@gmail.com> wrote:XLogInsert in my opinion is the best place to call it and the hook can
be something like this "void xlog_insert_hook(NULL)" as all the throttling
logic required is the current flush position which can be obtained
from GetFlushRecPtr and the ReplicationSlotCtl. Attached a draft patch.IMHO, it is not a good idea to call an external hook function inside a
critical section. Generally, we ensure that we do not call any code path
within a critical section which can throw an error and if we start calling
the external hook then we lose that control.Thank you for the comment. XLogInsertRecord is inside a critical section
but not XLogInsert. Am I missing something?Actually all the WAL insertions are done under a critical section (except
few exceptions), that means if you see all the references of XLogInsert(),
it is always called under the critical section and that is my main worry
about hooking at XLogInsert level.
Got it, understood the concern. But can we document the limitations of the
hook and let the hook take care of it? I don't expect an error to be thrown
here since we are not planning to allocate memory or make file system calls
but instead look at the shared memory state and add delays when required.
Show quoted text
--
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com
On Sun, Dec 26, 2021 at 1:06 PM SATYANARAYANA NARLAPURAM
<satyanarlapuram@gmail.com> wrote:
Got it, understood the concern. But can we document the limitations of the hook and let the hook take care of it? I don't expect an error to be thrown here since we are not planning to allocate memory or make file system calls but instead look at the shared memory state and add delays when required.
It wouldn't work. You can't make any assumption about how long it
would take for the replication lag to resolve, so you may have to wait
for a very long time. It means that at the very least the sleep has
to be interruptible and therefore can raise an error. In general
there isn't much you can due in a critical section, so this approach
doesn't seem sensible to me.
On Sun, Dec 26, 2021 at 10:36 AM SATYANARAYANA NARLAPURAM <
satyanarlapuram@gmail.com> wrote:
Actually all the WAL insertions are done under a critical section (except
few exceptions), that means if you see all the references of XLogInsert(),
it is always called under the critical section and that is my main worry
about hooking at XLogInsert level.Got it, understood the concern. But can we document the limitations of the
hook and let the hook take care of it? I don't expect an error to be thrown
here since we are not planning to allocate memory or make file system calls
but instead look at the shared memory state and add delays when required.
Yet another problem is that if we are in XlogInsert() that means we are
holding the buffer locks on all the pages we have modified, so if we add a
hook at that level which can make it wait then we would also block any of
the read operations needed to read from those buffers. I haven't thought
what could be better way to do this but this is certainly not good.
--
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com
On Sat, Dec 25, 2021 at 9:25 PM Dilip Kumar <dilipbalaut@gmail.com> wrote:
On Sun, Dec 26, 2021 at 10:36 AM SATYANARAYANA NARLAPURAM <
satyanarlapuram@gmail.com> wrote:Actually all the WAL insertions are done under a critical section
(except few exceptions), that means if you see all the references of
XLogInsert(), it is always called under the critical section and that is my
main worry about hooking at XLogInsert level.Got it, understood the concern. But can we document the limitations of
the hook and let the hook take care of it? I don't expect an error to be
thrown here since we are not planning to allocate memory or make file
system calls but instead look at the shared memory state and add delays
when required.Yet another problem is that if we are in XlogInsert() that means we are
holding the buffer locks on all the pages we have modified, so if we add a
hook at that level which can make it wait then we would also block any of
the read operations needed to read from those buffers. I haven't thought
what could be better way to do this but this is certainly not good.
Yes, this is a problem. The other approach is adding a hook at
XLogWrite/XLogFlush? All the other backends will be waiting behind the
WALWriteLock. The process that is performing the write enters into a busy
loop with small delays until the criteria are met. Inability to process the
interrupts inside the critical section is a challenge in both approaches.
Any other thoughts?
Show quoted text
--
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com
Greetings,
* SATYANARAYANA NARLAPURAM (satyanarlapuram@gmail.com) wrote:
On Sat, Dec 25, 2021 at 9:25 PM Dilip Kumar <dilipbalaut@gmail.com> wrote:
On Sun, Dec 26, 2021 at 10:36 AM SATYANARAYANA NARLAPURAM <
satyanarlapuram@gmail.com> wrote:Actually all the WAL insertions are done under a critical section
(except few exceptions), that means if you see all the references of
XLogInsert(), it is always called under the critical section and that is my
main worry about hooking at XLogInsert level.Got it, understood the concern. But can we document the limitations of
the hook and let the hook take care of it? I don't expect an error to be
thrown here since we are not planning to allocate memory or make file
system calls but instead look at the shared memory state and add delays
when required.Yet another problem is that if we are in XlogInsert() that means we are
holding the buffer locks on all the pages we have modified, so if we add a
hook at that level which can make it wait then we would also block any of
the read operations needed to read from those buffers. I haven't thought
what could be better way to do this but this is certainly not good.Yes, this is a problem. The other approach is adding a hook at
XLogWrite/XLogFlush? All the other backends will be waiting behind the
WALWriteLock. The process that is performing the write enters into a busy
loop with small delays until the criteria are met. Inability to process the
interrupts inside the critical section is a challenge in both approaches.
Any other thoughts?
Why not have this work the exact same way sync replicas do, except that
it's based off of some byte/time lag for some set of async replicas?
That is, in RecordTransactionCommit(), perhaps right after the
SyncRepWaitForLSN() call, or maybe even add this to that function? Sure
seems like there's a lot of similarity.
Thanks,
Stephen
Stephen, thank you!
On Wed, Dec 29, 2021 at 5:46 AM Stephen Frost <sfrost@snowman.net> wrote:
Greetings,
* SATYANARAYANA NARLAPURAM (satyanarlapuram@gmail.com) wrote:
On Sat, Dec 25, 2021 at 9:25 PM Dilip Kumar <dilipbalaut@gmail.com>
wrote:
On Sun, Dec 26, 2021 at 10:36 AM SATYANARAYANA NARLAPURAM <
satyanarlapuram@gmail.com> wrote:Actually all the WAL insertions are done under a critical section
(except few exceptions), that means if you see all the references of
XLogInsert(), it is always called under the critical section andthat is my
main worry about hooking at XLogInsert level.
Got it, understood the concern. But can we document the limitations of
the hook and let the hook take care of it? I don't expect an error tobe
thrown here since we are not planning to allocate memory or make file
system calls but instead look at the shared memory state and adddelays
when required.
Yet another problem is that if we are in XlogInsert() that means we are
holding the buffer locks on all the pages we have modified, so if weadd a
hook at that level which can make it wait then we would also block any
of
the read operations needed to read from those buffers. I haven't
thought
what could be better way to do this but this is certainly not good.
Yes, this is a problem. The other approach is adding a hook at
XLogWrite/XLogFlush? All the other backends will be waiting behind the
WALWriteLock. The process that is performing the write enters into a busy
loop with small delays until the criteria are met. Inability to processthe
interrupts inside the critical section is a challenge in both approaches.
Any other thoughts?Why not have this work the exact same way sync replicas do, except that
it's based off of some byte/time lag for some set of async replicas?
That is, in RecordTransactionCommit(), perhaps right after the
SyncRepWaitForLSN() call, or maybe even add this to that function? Sure
seems like there's a lot of similarity.
I was thinking of achieving log governance (throttling WAL MB/sec) and also
providing RPO guarantees. In this model, it is hard to throttle WAL
generation of a long running transaction (for example copy/select into).
However, this meets my RPO needs. Are you in support of adding a hook or
the actual change? IMHO, the hook allows more creative options. I can go
ahead and make a patch accordingly.
Show quoted text
Thanks,
Stephen
Greetings,
On Wed, Dec 29, 2021 at 14:04 SATYANARAYANA NARLAPURAM <
satyanarlapuram@gmail.com> wrote:
Stephen, thank you!
On Wed, Dec 29, 2021 at 5:46 AM Stephen Frost <sfrost@snowman.net> wrote:
Greetings,
* SATYANARAYANA NARLAPURAM (satyanarlapuram@gmail.com) wrote:
On Sat, Dec 25, 2021 at 9:25 PM Dilip Kumar <dilipbalaut@gmail.com>
wrote:
On Sun, Dec 26, 2021 at 10:36 AM SATYANARAYANA NARLAPURAM <
satyanarlapuram@gmail.com> wrote:Actually all the WAL insertions are done under a critical section
(except few exceptions), that means if you see all the references of
XLogInsert(), it is always called under the critical section andthat is my
main worry about hooking at XLogInsert level.
Got it, understood the concern. But can we document the limitations
of
the hook and let the hook take care of it? I don't expect an error
to be
thrown here since we are not planning to allocate memory or make file
system calls but instead look at the shared memory state and adddelays
when required.
Yet another problem is that if we are in XlogInsert() that means we
are
holding the buffer locks on all the pages we have modified, so if we
add a
hook at that level which can make it wait then we would also block
any of
the read operations needed to read from those buffers. I haven't
thought
what could be better way to do this but this is certainly not good.
Yes, this is a problem. The other approach is adding a hook at
XLogWrite/XLogFlush? All the other backends will be waiting behind the
WALWriteLock. The process that is performing the write enters into abusy
loop with small delays until the criteria are met. Inability to process
the
interrupts inside the critical section is a challenge in both
approaches.
Any other thoughts?
Why not have this work the exact same way sync replicas do, except that
it's based off of some byte/time lag for some set of async replicas?
That is, in RecordTransactionCommit(), perhaps right after the
SyncRepWaitForLSN() call, or maybe even add this to that function? Sure
seems like there's a lot of similarity.I was thinking of achieving log governance (throttling WAL MB/sec) and
also providing RPO guarantees. In this model, it is hard to throttle WAL
generation of a long running transaction (for example copy/select into).
Long running transactions have a lot of downsides and are best discouraged.
I don’t know that we should be designing this for that case specifically,
particularly given the complications it would introduce as discussed on
this thread already.
However, this meets my RPO needs. Are you in support of adding a hook or
the actual change? IMHO, the hook allows more creative options. I can go
ahead and make a patch accordingly.
I would think this would make more sense as part of core rather than a
hook, as that then requires an extension and additional setup to get going,
which raises the bar quite a bit when it comes to actually being used.
Thanks,
Stephen
Show quoted text
On Wed, Dec 29, 2021 at 11:16 AM Stephen Frost <sfrost@snowman.net> wrote:
Greetings,
On Wed, Dec 29, 2021 at 14:04 SATYANARAYANA NARLAPURAM <
satyanarlapuram@gmail.com> wrote:Stephen, thank you!
On Wed, Dec 29, 2021 at 5:46 AM Stephen Frost <sfrost@snowman.net> wrote:
Greetings,
* SATYANARAYANA NARLAPURAM (satyanarlapuram@gmail.com) wrote:
On Sat, Dec 25, 2021 at 9:25 PM Dilip Kumar <dilipbalaut@gmail.com>
wrote:
On Sun, Dec 26, 2021 at 10:36 AM SATYANARAYANA NARLAPURAM <
satyanarlapuram@gmail.com> wrote:Actually all the WAL insertions are done under a critical section
(except few exceptions), that means if you see all the referencesof
XLogInsert(), it is always called under the critical section and
that is my
main worry about hooking at XLogInsert level.
Got it, understood the concern. But can we document the limitations
of
the hook and let the hook take care of it? I don't expect an error
to be
thrown here since we are not planning to allocate memory or make
file
system calls but instead look at the shared memory state and add
delays
when required.
Yet another problem is that if we are in XlogInsert() that means we
are
holding the buffer locks on all the pages we have modified, so if we
add a
hook at that level which can make it wait then we would also block
any of
the read operations needed to read from those buffers. I haven't
thought
what could be better way to do this but this is certainly not good.
Yes, this is a problem. The other approach is adding a hook at
XLogWrite/XLogFlush? All the other backends will be waiting behind the
WALWriteLock. The process that is performing the write enters into abusy
loop with small delays until the criteria are met. Inability to
process the
interrupts inside the critical section is a challenge in both
approaches.
Any other thoughts?
Why not have this work the exact same way sync replicas do, except that
it's based off of some byte/time lag for some set of async replicas?
That is, in RecordTransactionCommit(), perhaps right after the
SyncRepWaitForLSN() call, or maybe even add this to that function? Sure
seems like there's a lot of similarity.I was thinking of achieving log governance (throttling WAL MB/sec) and
also providing RPO guarantees. In this model, it is hard to throttle WAL
generation of a long running transaction (for example copy/select into).Long running transactions have a lot of downsides and are best
discouraged. I don’t know that we should be designing this for that case
specifically, particularly given the complications it would introduce as
discussed on this thread already.However, this meets my RPO needs. Are you in support of adding a hook or
the actual change? IMHO, the hook allows more creative options. I can go
ahead and make a patch accordingly.I would think this would make more sense as part of core rather than a
hook, as that then requires an extension and additional setup to get going,
which raises the bar quite a bit when it comes to actually being used.
Sounds good, I will work on making the changes accordingly.
Show quoted text
Thanks,
Stephen
Hi,
On 2021-12-27 16:40:28 -0800, SATYANARAYANA NARLAPURAM wrote:
Yet another problem is that if we are in XlogInsert() that means we are
holding the buffer locks on all the pages we have modified, so if we add a
hook at that level which can make it wait then we would also block any of
the read operations needed to read from those buffers. I haven't thought
what could be better way to do this but this is certainly not good.Yes, this is a problem. The other approach is adding a hook at
XLogWrite/XLogFlush?
That's pretty much the same - XLogInsert() can trigger an
XLogWrite()/Flush().
I think it's a complete no-go to add throttling to these places. It's quite
possible that it'd cause new deadlocks, and it's almost guaranteed to have
unintended consequences (e.g. replication falling back further because
XLogFlush() is being throttled).
I also don't think it's a sane thing to add hooks to these places. It's
complicated enough as-is, adding the chance for random other things to happen
during such crucial operations will make it even harder to maintain.
Greetings,
Andres Freund
On Wed, Dec 29, 2021 at 11:31 AM Andres Freund <andres@anarazel.de> wrote:
Hi,
On 2021-12-27 16:40:28 -0800, SATYANARAYANA NARLAPURAM wrote:
Yet another problem is that if we are in XlogInsert() that means we are
holding the buffer locks on all the pages we have modified, so if weadd a
hook at that level which can make it wait then we would also block any
of
the read operations needed to read from those buffers. I haven't
thought
what could be better way to do this but this is certainly not good.
Yes, this is a problem. The other approach is adding a hook at
XLogWrite/XLogFlush?That's pretty much the same - XLogInsert() can trigger an
XLogWrite()/Flush().I think it's a complete no-go to add throttling to these places. It's quite
possible that it'd cause new deadlocks, and it's almost guaranteed to have
unintended consequences (e.g. replication falling back further because
XLogFlush() is being throttled).I also don't think it's a sane thing to add hooks to these places. It's
complicated enough as-is, adding the chance for random other things to
happen
during such crucial operations will make it even harder to maintain.
Andres, thanks for the comments. Agreed on this based on the previous
discussions on this thread. Could you please share your thoughts on adding
it after SyncRepWaitForLSN()?
Show quoted text
Greetings,
Andres Freund
Hi,
On 2021-12-29 11:34:53 -0800, SATYANARAYANA NARLAPURAM wrote:
On Wed, Dec 29, 2021 at 11:31 AM Andres Freund <andres@anarazel.de> wrote:
Andres, thanks for the comments. Agreed on this based on the previous
discussions on this thread. Could you please share your thoughts on adding
it after SyncRepWaitForLSN()?
I don't think that's good either - you're delaying transaction commit
(i.e. xact becoming visible / locks being released). That also has the danger
of increasing lock contention (albeit more likely to be heavyweight locks /
serializable state). It'd have to be after the transaction actually committed.
Greetings,
Andres Freund
On Thu, Dec 30, 2021 at 1:09 AM Andres Freund <andres@anarazel.de> wrote:
Hi,
On 2021-12-29 11:34:53 -0800, SATYANARAYANA NARLAPURAM wrote:
On Wed, Dec 29, 2021 at 11:31 AM Andres Freund <andres@anarazel.de>
wrote:
Andres, thanks for the comments. Agreed on this based on the previous
discussions on this thread. Could you please share your thoughts onadding
it after SyncRepWaitForLSN()?
I don't think that's good either - you're delaying transaction commit
(i.e. xact becoming visible / locks being released).
Agree with that.
That also has the danger
of increasing lock contention (albeit more likely to be heavyweight locks /
serializable state). It'd have to be after the transaction actually
committed.
Yeah, I think that would make sense, even though we will be allowing a new
backend to get connected insert WAL, and get committed but after that, it
will be throttled. However, if the number of max connections will be very
high then even after we detected a lag there a significant amount WAL could
be generated, even if we keep long-running transactions aside. But I think
still it will serve the purpose of what Satya is trying to achieve.
--
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com
On Wed, Dec 29, 2021 at 10:38 PM Dilip Kumar <dilipbalaut@gmail.com> wrote:
On Thu, Dec 30, 2021 at 1:09 AM Andres Freund <andres@anarazel.de> wrote:
Hi,
On 2021-12-29 11:34:53 -0800, SATYANARAYANA NARLAPURAM wrote:
On Wed, Dec 29, 2021 at 11:31 AM Andres Freund <andres@anarazel.de>
wrote:
Andres, thanks for the comments. Agreed on this based on the previous
discussions on this thread. Could you please share your thoughts onadding
it after SyncRepWaitForLSN()?
I don't think that's good either - you're delaying transaction commit
(i.e. xact becoming visible / locks being released).Agree with that.
That also has the danger
of increasing lock contention (albeit more likely to be heavyweight locks
/
serializable state). It'd have to be after the transaction actually
committed.Yeah, I think that would make sense, even though we will be allowing a new
backend to get connected insert WAL, and get committed but after that, it
will be throttled. However, if the number of max connections will be very
high then even after we detected a lag there a significant amount WAL could
be generated, even if we keep long-running transactions aside. But I think
still it will serve the purpose of what Satya is trying to achieve.
I am afraid there are problems with making the RPO check post releasing the
locks. By this time the transaction is committed and visible to the other
backends (ProcArrayEndTransaction is already called) though the intention
is to block committing transactions that violate the defined RPO. Even
though we block existing connections starting a new transaction, it is
possible to do writes by opening a new connection / canceling the query. I
am not too much worried about the lock contention as the system is already
hosed because of the policy. This behavior is very similar to what
happens when the Sync standby is not responding. Thoughts?
Show quoted text
--
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com
On Thu, Dec 30, 2021 at 12:36 PM SATYANARAYANA NARLAPURAM <
satyanarlapuram@gmail.com> wrote:
Yeah, I think that would make sense, even though we will be allowing a
new backend to get connected insert WAL, and get committed but after that,
it will be throttled. However, if the number of max connections will be
very high then even after we detected a lag there a significant amount WAL
could be generated, even if we keep long-running transactions aside. But I
think still it will serve the purpose of what Satya is trying to achieve.I am afraid there are problems with making the RPO check post releasing
the locks. By this time the transaction is committed and visible to the
other backends (ProcArrayEndTransaction is already called) though the
intention is to block committing transactions that violate the defined RPO.
Even though we block existing connections starting a new transaction, it is
possible to do writes by opening a new connection / canceling the query. I
am not too much worried about the lock contention as the system is already
hosed because of the policy. This behavior is very similar to what
happens when the Sync standby is not responding. Thoughts?
Yeah, that's true, but even if we are blocking the transactions from
committing then also it is possible that a new connection can come and
generate more WAL, yeah but I agree with the other part that if you
throttle after committing then the user can cancel the queries and generate
more WAL from those sessions as well. But that is an extreme case where
application developers want to bypass the throttling and want to generate
more WALs.
--
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com
On Thu, Dec 30, 2021 at 1:21 PM Dilip Kumar <dilipbalaut@gmail.com> wrote:
On Thu, Dec 30, 2021 at 12:36 PM SATYANARAYANA NARLAPURAM <satyanarlapuram@gmail.com> wrote:
Yeah, I think that would make sense, even though we will be allowing a new backend to get connected insert WAL, and get committed but after that, it will be throttled. However, if the number of max connections will be very high then even after we detected a lag there a significant amount WAL could be generated, even if we keep long-running transactions aside. But I think still it will serve the purpose of what Satya is trying to achieve.
I am afraid there are problems with making the RPO check post releasing the locks. By this time the transaction is committed and visible to the other backends (ProcArrayEndTransaction is already called) though the intention is to block committing transactions that violate the defined RPO. Even though we block existing connections starting a new transaction, it is possible to do writes by opening a new connection / canceling the query. I am not too much worried about the lock contention as the system is already hosed because of the policy. This behavior is very similar to what happens when the Sync standby is not responding. Thoughts?
Yeah, that's true, but even if we are blocking the transactions from committing then also it is possible that a new connection can come and generate more WAL, yeah but I agree with the other part that if you throttle after committing then the user can cancel the queries and generate more WAL from those sessions as well. But that is an extreme case where application developers want to bypass the throttling and want to generate more WALs.
How about having the new hook at the start of the new txn? If we do
this, when the limit for the throttling is exceeded, the current txn
(even if it is a long running one) continues to do the WAL insertions,
the next txns would get blocked. Thoughts?
Regards,
Bharath Rupireddy.
On Thu, Dec 30, 2021 at 1:41 PM Bharath Rupireddy <
bharath.rupireddyforpostgres@gmail.com> wrote:
Yeah, that's true, but even if we are blocking the transactions from
committing then also it is possible that a new connection can come and
generate more WAL, yeah but I agree with the other part that if you
throttle after committing then the user can cancel the queries and generate
more WAL from those sessions as well. But that is an extreme case where
application developers want to bypass the throttling and want to generate
more WALs.How about having the new hook at the start of the new txn? If we do
this, when the limit for the throttling is exceeded, the current txn
(even if it is a long running one) continues to do the WAL insertions,
the next txns would get blocked. Thoughts?
Do you mean while StartTransactionCommand or while assigning a new
transaction id? If it is at StartTransactionCommand then we would be
blocking the sessions which are only performing read queries right? If we
are doing at the transaction assignment level then we might be holding some
of the locks so this might not be any better than throttling inside the
commit.
--
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com
On Thu, Dec 30, 2021 at 12:20 AM Dilip Kumar <dilipbalaut@gmail.com> wrote:
On Thu, Dec 30, 2021 at 1:41 PM Bharath Rupireddy <
bharath.rupireddyforpostgres@gmail.com> wrote:Yeah, that's true, but even if we are blocking the transactions from
committing then also it is possible that a new connection can come and
generate more WAL, yeah but I agree with the other part that if you
throttle after committing then the user can cancel the queries and generate
more WAL from those sessions as well. But that is an extreme case where
application developers want to bypass the throttling and want to generate
more WALs.How about having the new hook at the start of the new txn? If we do
this, when the limit for the throttling is exceeded, the current txn
(even if it is a long running one) continues to do the WAL insertions,
the next txns would get blocked. Thoughts?Do you mean while StartTransactionCommand or while assigning a new
transaction id? If it is at StartTransactionCommand then we would be
blocking the sessions which are only performing read queries right?
Definitely not at StartTransactionCommand but possibly while assigning
transaction Id inAssignTransactionId. Blocking readers is never the intent.
If we are doing at the transaction assignment level then we might be
holding some of the locks so this might not be any better than throttling
inside the commit.
If we define RPO as no transaction can commit when the wal_distance is more
than configured MB, we had to throttle the writes before committing the
transaction and new WAL generation by new connections or active doesn't
matter as the transactions can't be committed and visible to the user. If
the RPO is defined as no new write transactions allowed when wal_distance >
configured MB, then we can block assigning the new transaction IDs until
the RPO policy is met. IMHO, following the sync replication semantics is
easier and more explainable as it is already familiar to the customers.
Show quoted text
--
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com
Hi,
On 2021-12-29 23:06:31 -0800, SATYANARAYANA NARLAPURAM wrote:
I am afraid there are problems with making the RPO check post releasing the
locks. By this time the transaction is committed and visible to the other
backends (ProcArrayEndTransaction is already called) though the intention
is to block committing transactions that violate the defined RPO.
Shrug. Anything transaction based has way bigger holes than this.
Even though we block existing connections starting a new transaction, it is
possible to do writes by opening a new connection / canceling the query.
If your threat model is users explicitly trying to circumvent this they can
cause problems much more easily. Trigger a bunch of vacuums, big COPYs etc.
I am not too much worried about the lock contention as the system is already
hosed because of the policy. This behavior is very similar to what happens
when the Sync standby is not responding. Thoughts?
I don't see why we'd bury ourselves deeper in problems just because we already
have a problem. There's reasons why we want to do the delay for syncrep be
before xact completion - but I don't see those applying to WAL throttling to a
significant degree, particularly not when it's on a transaction level.
Greetings,
Andres Freund
On Wed, Dec 22, 2021 at 4:23 PM SATYANARAYANA NARLAPURAM <
satyanarlapuram@gmail.com> wrote:
Hi Hackers,
I am considering implementing RPO (recovery point objective) enforcement
feature for Postgres where the WAL writes on the primary are stalled when
the WAL distance between the primary and standby exceeds the configured
(replica_lag_in_bytes) threshold. This feature is useful particularly in
the disaster recovery setups where primary and standby are in different
regions and synchronous replication can't be set up for latency and
performance reasons yet requires some level of RPO enforcement.The idea here is to calculate the lag between the primary and the standby
(Async?) server during XLogInsert and block the caller until the lag is
less than the threshold value. We can calculate the max lag by iterating
over ReplicationSlotCtl->replication_slots. If this is not something we
don't want to do in the core, at least adding a hook for XlogInsert is of
great value.A few other scenarios I can think of with the hook are:
1. Enforcing RPO as described above
2. Enforcing rate limit and slow throttling when sync standby is
falling behind (could be flush lag or replay lag)
3. Transactional log rate governance - useful for cloud providers to
provide SKU sizes based on allowed WAL writes.Thoughts?
Very similar requirement or need was discussed in the past in [1], not
exactly RPO enforcement but large bulk operation/transaction negatively
impacting concurrent transactions due to replication lag.
Would be good to refer to that thread as it explains the challenges for
implementing functionality mentioned in this thread. Mostly the challenge
being no common place to code the throttling logic instead requiring calls
to be sprinkled around in various parts.
1]
/messages/by-id/CA+U5nMLfxBgHQ1VLSeBHYEMjHXz_OHSkuFdU6_1quiGM0RNKEg@mail.gmail.com
Hi,
On 2021-12-29 11:31:51 -0800, Andres Freund wrote:
That's pretty much the same - XLogInsert() can trigger an
XLogWrite()/Flush().I think it's a complete no-go to add throttling to these places. It's quite
possible that it'd cause new deadlocks, and it's almost guaranteed to have
unintended consequences (e.g. replication falling back further because
XLogFlush() is being throttled).
I thought of another way to implement this feature. What if we checked the
current distance somewhere within XLogInsert(), but only set
InterruptPending=true, XLogDelayPending=true. Then in ProcessInterrupts() we
check if XLogDelayPending is true and sleep the appropriate time.
That way the sleep doesn't happen with important locks held / within a
critical section, but we still delay close to where we went over the maximum
lag. And the overhead should be fairly minimal.
I'm doubtful that implementing the waits on a transactional level provides a
meaningful enough amount of control - there's just too much WAL that can be
generated within a transaction.
Greetings,
Andres Freund
On Wed, Jan 5, 2022 at 11:16 PM Andres Freund <andres@anarazel.de> wrote:
Hi,
On 2021-12-29 11:31:51 -0800, Andres Freund wrote:
That's pretty much the same - XLogInsert() can trigger an
XLogWrite()/Flush().I think it's a complete no-go to add throttling to these places. It's quite
possible that it'd cause new deadlocks, and it's almost guaranteed to have
unintended consequences (e.g. replication falling back further because
XLogFlush() is being throttled).I thought of another way to implement this feature. What if we checked the
current distance somewhere within XLogInsert(), but only set
InterruptPending=true, XLogDelayPending=true. Then in ProcessInterrupts() we
check if XLogDelayPending is true and sleep the appropriate time.That way the sleep doesn't happen with important locks held / within a
critical section, but we still delay close to where we went over the maximum
lag. And the overhead should be fairly minimal.
+1, this sounds like a really good idea to me.
--
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com
On Wed, Jan 5, 2022 at 9:46 AM Andres Freund <andres@anarazel.de> wrote:
Hi,
On 2021-12-29 11:31:51 -0800, Andres Freund wrote:
That's pretty much the same - XLogInsert() can trigger an
XLogWrite()/Flush().I think it's a complete no-go to add throttling to these places. It's
quite
possible that it'd cause new deadlocks, and it's almost guaranteed to
have
unintended consequences (e.g. replication falling back further because
XLogFlush() is being throttled).I thought of another way to implement this feature. What if we checked the
current distance somewhere within XLogInsert(), but only set
InterruptPending=true, XLogDelayPending=true. Then in ProcessInterrupts()
we
check if XLogDelayPending is true and sleep the appropriate time.That way the sleep doesn't happen with important locks held / within a
critical section, but we still delay close to where we went over the
maximum
lag. And the overhead should be fairly minimal.
+1 to the idea, this way we can fairly throttle large and
smaller transactions the same way. I will work on this model and share the
patch. Please note that the lock contention still exists in this case.
Show quoted text
I'm doubtful that implementing the waits on a transactional level provides
a
meaningful enough amount of control - there's just too much WAL that can be
generated within a transaction.
Greetings,
Andres Freund
On Thu, Jan 6, 2022 at 11:27 AM SATYANARAYANA NARLAPURAM
<satyanarlapuram@gmail.com> wrote:
On Wed, Jan 5, 2022 at 9:46 AM Andres Freund <andres@anarazel.de> wrote:
Hi,
On 2021-12-29 11:31:51 -0800, Andres Freund wrote:
That's pretty much the same - XLogInsert() can trigger an
XLogWrite()/Flush().I think it's a complete no-go to add throttling to these places. It's quite
possible that it'd cause new deadlocks, and it's almost guaranteed to have
unintended consequences (e.g. replication falling back further because
XLogFlush() is being throttled).I thought of another way to implement this feature. What if we checked the
current distance somewhere within XLogInsert(), but only set
InterruptPending=true, XLogDelayPending=true. Then in ProcessInterrupts() we
check if XLogDelayPending is true and sleep the appropriate time.That way the sleep doesn't happen with important locks held / within a
critical section, but we still delay close to where we went over the maximum
lag. And the overhead should be fairly minimal.+1 to the idea, this way we can fairly throttle large and smaller transactions the same way. I will work on this model and share the patch. Please note that the lock contention still exists in this case.
Generally while checking for the interrupt we should not be holding
any lwlock that means we don't have the risk of holding any buffer
locks, so any other reader can continue to read from those buffers.
We will only be holding some heavyweight locks like relation/tuple
lock but that will not impact anyone except the writers trying to
update the same tuple or the DDL who want to modify the table
definition so I don't think we have any issue here because anyway we
don't want other writers to continue.
--
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com
On Wed, Jan 5, 2022 at 10:05 PM Dilip Kumar <dilipbalaut@gmail.com> wrote:
On Thu, Jan 6, 2022 at 11:27 AM SATYANARAYANA NARLAPURAM
<satyanarlapuram@gmail.com> wrote:On Wed, Jan 5, 2022 at 9:46 AM Andres Freund <andres@anarazel.de> wrote:
Hi,
On 2021-12-29 11:31:51 -0800, Andres Freund wrote:
That's pretty much the same - XLogInsert() can trigger an
XLogWrite()/Flush().I think it's a complete no-go to add throttling to these places. It's
quite
possible that it'd cause new deadlocks, and it's almost guaranteed to
have
unintended consequences (e.g. replication falling back further because
XLogFlush() is being throttled).I thought of another way to implement this feature. What if we checked
the
current distance somewhere within XLogInsert(), but only set
InterruptPending=true, XLogDelayPending=true. Then inProcessInterrupts() we
check if XLogDelayPending is true and sleep the appropriate time.
That way the sleep doesn't happen with important locks held / within a
critical section, but we still delay close to where we went over themaximum
lag. And the overhead should be fairly minimal.
+1 to the idea, this way we can fairly throttle large and smaller
transactions the same way. I will work on this model and share the patch.
Please note that the lock contention still exists in this case.Generally while checking for the interrupt we should not be holding
any lwlock that means we don't have the risk of holding any buffer
locks, so any other reader can continue to read from those buffers.
We will only be holding some heavyweight locks like relation/tuple
lock but that will not impact anyone except the writers trying to
update the same tuple or the DDL who want to modify the table
definition so I don't think we have any issue here because anyway we
don't want other writers to continue.
Yes, it should be ok. I was just making it explicit on Andres' previous
comment on holding the heavyweight locks when wait before the commit.
Show quoted text
--
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com
I noticed this thread and thought I'd share my experiences building
something similar for Multi-AZ DB clusters [0]https://docs.aws.amazon.com/AmazonRDS/latest/UserGuide/multi-az-db-clusters-concepts.html. It's not a strict RPO
mechanism, but it does throttle backends in an effort to keep the
replay lag below a configured maximum. I can share the code if there
is interest.
I wrote it as a new extension, and except for one piece that I'll go
into later, I was able to avoid changes to core PostgreSQL code. The
extension manages a background worker that periodically assesses the
state of the designated standbys and updates an atomic in shared
memory that indicates how long to delay. A transaction callback
checks this value and sleeps as necessary. Delay can be injected for
write-enabled transactions on the primary, read-only transactions on
the standbys, or both. The extension is heavily configurable so that
it can meet the needs of a variety of workloads.
One interesting challenge I encountered was accurately determining the
amount of replay lag. The problem was twofold. First, if there is no
activity on the primary, there will be nothing to replay on the
standbys, so the replay lag will appear to grow unbounded. To work
around this, the extension's background worker periodically creates an
empty COMMIT record. Second, if a standby reconnects after a long
time, the replay lag won't be accurate for some time. Instead, the
replay lag will slowly increase until it reaches the correct value.
Since the delay calculation looks at the trend of the replay lag, this
apparent unbounded growth causes it to inject far more delay than is
necessary. My guess is that this is related to 9ea3c64, and maybe it
is worth rethinking that logic. For now, the extension just
periodically reports the value of GetLatestXTime() from the standbys
to the primary to get an accurate reading. This is done via a new
replication callback mechanism (which requires core PostgreSQL
changes). I can share this patch along with the extension, as I bet
there are other applications for it.
I should also note that the extension only considers "active" standbys
and primaries. That is, ones with an active WAL sender or WAL
receiver. This avoids the need to guess what should be done during a
network partition, but it also means that we must gracefully handle
standbys reconnecting with massive amounts of lag. The extension is
designed to slowly ramp up the amount of injected delay until the
standby's apply lag is trending down at a sufficient rate.
I see that an approach was suggested upthread for throttling based on
WAL distance instead of per-transaction. While the transaction
approach works decently well for certain workloads (e.g., many small
transactions like those from pgbench), it might require further tuning
for very large transactions or workloads with a variety of transaction
sizes. For that reason, I would definitely support building a way to
throttle based on WAL generation. It might be a good idea to avoid
throttling critical activity such as anti-wraparound vacuuming, too.
Nathan
[0]: https://docs.aws.amazon.com/AmazonRDS/latest/UserGuide/multi-az-db-clusters-concepts.html
On 11.01.2022 03:06, Bossart, Nathan wrote:
I noticed this thread and thought I'd share my experiences building
something similar for Multi-AZ DB clusters [0]. It's not a strict RPO
mechanism, but it does throttle backends in an effort to keep the
replay lag below a configured maximum. I can share the code if there
is interest.I wrote it as a new extension, and except for one piece that I'll go
into later, I was able to avoid changes to core PostgreSQL code. The
extension manages a background worker that periodically assesses the
state of the designated standbys and updates an atomic in shared
memory that indicates how long to delay. A transaction callback
checks this value and sleeps as necessary. Delay can be injected for
write-enabled transactions on the primary, read-only transactions on
the standbys, or both. The extension is heavily configurable so that
it can meet the needs of a variety of workloads.One interesting challenge I encountered was accurately determining the
amount of replay lag. The problem was twofold. First, if there is no
activity on the primary, there will be nothing to replay on the
standbys, so the replay lag will appear to grow unbounded. To work
around this, the extension's background worker periodically creates an
empty COMMIT record. Second, if a standby reconnects after a long
time, the replay lag won't be accurate for some time. Instead, the
replay lag will slowly increase until it reaches the correct value.
Since the delay calculation looks at the trend of the replay lag, this
apparent unbounded growth causes it to inject far more delay than is
necessary. My guess is that this is related to 9ea3c64, and maybe it
is worth rethinking that logic. For now, the extension just
periodically reports the value of GetLatestXTime() from the standbys
to the primary to get an accurate reading. This is done via a new
replication callback mechanism (which requires core PostgreSQL
changes). I can share this patch along with the extension, as I bet
there are other applications for it.I should also note that the extension only considers "active" standbys
and primaries. That is, ones with an active WAL sender or WAL
receiver. This avoids the need to guess what should be done during a
network partition, but it also means that we must gracefully handle
standbys reconnecting with massive amounts of lag. The extension is
designed to slowly ramp up the amount of injected delay until the
standby's apply lag is trending down at a sufficient rate.I see that an approach was suggested upthread for throttling based on
WAL distance instead of per-transaction. While the transaction
approach works decently well for certain workloads (e.g., many small
transactions like those from pgbench), it might require further tuning
for very large transactions or workloads with a variety of transaction
sizes. For that reason, I would definitely support building a way to
throttle based on WAL generation. It might be a good idea to avoid
throttling critical activity such as anti-wraparound vacuuming, too.Nathan
[0] https://docs.aws.amazon.com/AmazonRDS/latest/UserGuide/multi-az-db-clusters-concepts.html
We have faced with the similar problem in Zenith (open source Aurora)
and have to implement back pressure mechanism to prevent overflow of WAL
at stateless compute nodes
and too long delays of [age reconstruction. Our implementation is the
following:
1. Three GUCs are added: max_replication_write/flush/apply_lag
2. Replication lags are checked in XLogInsert and if one of 3 thresholds
is reached then InterruptPending is set.
3. In ProcessInterrupts we block backend execution until lag is within
specified boundary:
#define BACK_PRESSURE_DELAY 10000L // 0.01 sec
while(true)
{
ProcessInterrupts_pg();
// Suspend writers until replicas catch up
lag = backpressure_lag();
if (lag <= 0)
break;
set_ps_display("backpressure throttling");
elog(DEBUG2, "backpressure throttling: lag %lu", lag);
pg_usleep(BACK_PRESSURE_DELAY);
}
What is wrong here is that backend can be blocked for a long time
(causing failure of client application due to timeout expiration) and
hold acquired locks while sleeping.
We are thinking about smarter way of choosing throttling delay (for
example exponential increase of throttling sleep interval until some
maximal value is reached).
But it is really hard to find some universal schema which will be good
for all use cases (for example in case of short living session, which
clients are connected to the server to execute just one query).
Concerning throttling at the end of transaction which eliminates problem
with holding locks and do not require changes in postgres core,
unfortunately it doesn't address problem with large transactions (for
example bulk load of data using COPY). In this case just one transaction
can cause arbitrary large lag.
I am not sure how critical is the problems with holding locks during
throttling: yes, it may block other database activity, including vacuum
and execution of read-only queries.
But it should not block walsender and so cause deadlock. And in most
cases read-only transactions are not conflicting with write transaction,
so suspending write transaction
should not block readers.
Another problem with throttling is large WAL records (for example custom
logical replication WAL record can be arbitrary large). If such record
is larger than replication lag limit,
then it can cause deadlock.
On Tue, Jan 11, 2022 at 2:11 PM Konstantin Knizhnik <knizhnik@garret.ru> wrote:
We have faced with the similar problem in Zenith (open source Aurora)
and have to implement back pressure mechanism to prevent overflow of WAL
at stateless compute nodes
and too long delays of [age reconstruction. Our implementation is the
following:
1. Three GUCs are added: max_replication_write/flush/apply_lag
2. Replication lags are checked in XLogInsert and if one of 3 thresholds
is reached then InterruptPending is set.
3. In ProcessInterrupts we block backend execution until lag is within
specified boundary:#define BACK_PRESSURE_DELAY 10000L // 0.01 sec
while(true)
{
ProcessInterrupts_pg();// Suspend writers until replicas catch up
lag = backpressure_lag();
if (lag <= 0)
break;set_ps_display("backpressure throttling");
elog(DEBUG2, "backpressure throttling: lag %lu", lag);
pg_usleep(BACK_PRESSURE_DELAY);
}What is wrong here is that backend can be blocked for a long time
(causing failure of client application due to timeout expiration) and
hold acquired locks while sleeping.
Do we ever call CHECK_FOR_INTERRUPTS() while holding "important"
locks? I haven't seen any asserts or anything of that sort in
ProcessInterrupts() though, looks like it's the caller's
responsibility to not process interrupts while holding heavy weight
locks, here are some points on this upthread [1]/messages/by-id/20220105174643.lozdd3radxv4tlmx@alap3.anarazel.de.
I don't think we have problem with various postgres timeouts
statement_timeout, lock_timeout, idle_in_transaction_session_timeout,
idle_session_timeout, client_connection_check_interval, because while
we wait for replication lag to get better in ProcessInterrupts(). I
think SIGALRM can be raised while we wait for replication lag to get
better, but it can't be handled. Why can't we just disable these
timeouts before going to wait and reset/enable right after the
replication lag gets better?
And the clients can always have their own
no-reply-kill-transaction-sort-of-timeout, if yes, let them fail and
deal with it. I don't think we can do much about this.
We are thinking about smarter way of choosing throttling delay (for
example exponential increase of throttling sleep interval until some
maximal value is reached).
But it is really hard to find some universal schema which will be good
for all use cases (for example in case of short living session, which
clients are connected to the server to execute just one query).
I think there has to be an upper limit to wait, perhaps a
'preconfigured amount of time'. I think others upthread aren't happy
with failing transactions because of the replication lag. But, my
point is how much time we would let the backends wait or throttle WAL
writes? It mustn't be forever (say if a broken connection to the async
standby is found).
[1]: /messages/by-id/20220105174643.lozdd3radxv4tlmx@alap3.anarazel.de
Regards,
Bharath Rupireddy.