[Patch] ALTER SYSTEM READ ONLY

Started by Amul Sulalmost 6 years ago199 messageshackers
Jump to latest
#1Amul Sul
sulamul@gmail.com

Hi,

Attached patch proposes $Subject feature which forces the system into
read-only
mode where insert write-ahead log will be prohibited until ALTER SYSTEM READ
WRITE executed.

The high-level goal is to make the availability/scale-out situation
better. The feature
will help HA setup where the master server needs to stop accepting WAL
writes
immediately and kick out any transaction expecting WAL writes at the end,
in case
of network down on master or replication connections failures.

For example, this feature allows for a controlled switchover without
needing to shut
down the master. You can instead make the master read-only, wait until the
standby
catches up, and then promote the standby. The master remains available for
read
queries throughout, and also for WAL streaming, but without the possibility
of any
new write transactions. After switchover is complete, the master can be
shut down
and brought back up as a standby without needing to use pg_rewind.
(Eventually, it
would be nice to be able to make the read-only master into a standby
without having
to restart it, but that is a problem for another patch.)

This might also help in failover scenarios. For example, if you detect that
the master
has lost network connectivity to the standby, you might make it read-only
after 30 s,
and promote the standby after 60 s, so that you never have two writable
masters at
the same time. In this case, there's still some split-brain, but it's still
better than what
we have now.

Design:
----------
The proposed feature is built atop of super barrier mechanism commit[1] to
coordinate
global state changes to all active backends. Backends which executed
ALTER SYSTEM READ { ONLY | WRITE } command places request to checkpointer
process to change the requested WAL read/write state aka WAL prohibited and
WAL
permitted state respectively. When the checkpointer process sees the WAL
prohibit
state change request, it emits a global barrier and waits until all
backends that
participate in the ProcSignal absorbs it. Once it has done the WAL
read/write state in
share memory and control file will be updated so that XLogInsertAllowed()
returns
accordingly.

If there are open transactions that have acquired an XID, the sessions are
killed
before the barrier is absorbed. They can't commit without writing WAL, and
they
can't abort without writing WAL, either, so we must at least abort the
transaction. We
don't necessarily need to kill the session, but it's hard to avoid in all
cases because
(1) if there are subtransactions active, we need to force the top-level
abort record to
be written immediately, but we can't really do that while keeping the
subtransactions
on the transaction stack, and (2) if the session is idle, we also need the
top-level abort
record to be written immediately, but can't send an error to the client
until the next
command is issued without losing wire protocol synchronization. For now, we
just use
FATAL to kill the session; maybe this can be improved in the future.

Open transactions that don't have an XID are not killed, but will get an
ERROR if they
try to acquire an XID later, or if they try to write WAL without acquiring
an XID (e.g. VACUUM).
To make that happen, the patch adds a new coding rule: a critical section
that will write
WAL must be preceded by a call to CheckWALPermitted(),
AssertWALPermitted(), or
AssertWALPermitted_HaveXID(). The latter variants are used when we know for
certain
that inserting WAL here must be OK, either because we have an XID (we would
have
been killed by a change to read-only if one had occurred) or for some other
reason.

The ALTER SYSTEM READ WRITE command can be used to reverse the effects of
ALTER SYSTEM READ ONLY. Both ALTER SYSTEM READ ONLY and ALTER
SYSTEM READ WRITE update not only the shared memory state but also the
control
file, so that changes survive a restart.

The transition between read-write and read-only is a pretty major
transition, so we emit
log message for each successful execution of a ALTER SYSTEM READ {ONLY |
WRITE}
command. Also, we have added a new GUC system_is_read_only which returns
"on"
when the system is in WAL prohibited state or recovery.

Another part of the patch that quite uneasy and need a discussion is that
when the
shutdown in the read-only state we do skip shutdown checkpoint and at a
restart, first
startup recovery will be performed and latter the read-only state will be
restored to
prohibit further WAL write irrespective of recovery checkpoint succeed or
not. The
concern is here if this startup recovery checkpoint wasn't ok, then it will
never happen
even if it's later put back into read-write mode. Thoughts?

Quick demo:
----------------
We have few active sessions, section 1 has performed some writes and stayed
in the
idle state for some time, in between in session 2 where superuser
successfully changed
system state in read-only via ALTER SYSTEM READ ONLY command which kills
session 1. Any other backend who is trying to run write transactions
thereafter will see
a read-only system error.

------------- SESSION 1 -------------
session_1=# BEGIN;
BEGIN
session_1=*# CREATE TABLE foo AS SELECT i FROM generate_series(1,5) i;
SELECT 5

------------- SESSION 2 -------------
session_2=# ALTER SYSTEM READ ONLY;
ALTER SYSTEM

------------- SESSION 1 -------------
session_1=*# COMMIT;
FATAL: system is now read only
HINT: Cannot continue a transaction if it has performed writes while
system is read only.
server closed the connection unexpectedly
This probably means the server terminated abnormally
before or while processing the request.
The connection to the server was lost. Attempting reset: Succeeded.

------------- SESSION 3 -------------
session_3=# CREATE TABLE foo_bar (i int);
ERROR: cannot execute CREATE TABLE in a read-only transaction

------------- SESSION 4 -------------
session_4=# CHECKPOINT;
ERROR: system is now read only

System can put back to read-write mode by "ALTER SYSTEM READ WRITE" :

------------- SESSION 2 -------------
session_2=# ALTER SYSTEM READ WRITE;
ALTER SYSTEM

------------- SESSION 3 -------------
session_3=# CREATE TABLE foo_bar (i int);
CREATE TABLE

------------- SESSION 4 -------------
session_4=# CHECKPOINT;
CHECKPOINT

TODOs:
-----------
1. Documentation.

Attachments summary:
------------------------------
I tried to split the changes so that it can be easy to read and see the
incremental implementation.

0001: Patch by Robert, to add ability support error in global barrier
absorption.
0002: Patch implement ALTER SYSTEM { READ | WRITE} syntax and psql tab
completion support for it.
0003: A basic implementation where the system can accept $Subject command
and change system to read-only by an emitting barrier.
0004: Patch does the enhancing where the backed execute $Subject command
only and places a request to the checkpointer which is
responsible to change
the state by the emitting barrier. Also, store the state into the
control file to
make It persists across the server restarts.
0005: Patch tightens the check to prevent error in the critical section.
0006: Documentation - WIP

Credit:
-------
The feature is one of the part of Andres Frued's high-level design ideas
for inbuilt
graceful failover for PostgreSQL. Feature implementation design by Robert
Haas.
Initial patch by Amit Khandekar further works and improvement by me under
Robert's
guidance includes this mail writeup as well.

Ref:
----
1] Global barrier commit # 16a4e4aecd47da7a6c4e1ebc20f6dd1a13f9133b

Thank you !

Regards,
Amul Sul
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Attachments:

v1-0005-Error-or-Assert-before-START_CRIT_SECTION-for-WAL.patchapplication/octet-stream; name=v1-0005-Error-or-Assert-before-START_CRIT_SECTION-for-WAL.patchDownload+490-32
v1-0006-Documentation-WIP.patchapplication/octet-stream; name=v1-0006-Documentation-WIP.patchDownload+61-12
v1-0002-Add-alter-system-read-only-write-syntax.patchapplication/octet-stream; name=v1-0002-Add-alter-system-read-only-write-syntax.patchDownload+70-3
v1-0003-Implement-ALTER-SYSTEM-READ-ONLY-using-global-bar.patchapplication/octet-stream; name=v1-0003-Implement-ALTER-SYSTEM-READ-ONLY-using-global-bar.patchDownload+246-73
v1-0001-Allow-error-or-refusal-while-absorbing-barriers.patchapplication/octet-stream; name=v1-0001-Allow-error-or-refusal-while-absorbing-barriers.patchDownload+63-13
v1-0004-Use-checkpointer-to-make-system-READ-ONLY-or-READ.patchapplication/octet-stream; name=v1-0004-Use-checkpointer-to-make-system-READ-ONLY-or-READ.patchDownload+202-21
#2Amit Kapila
amit.kapila16@gmail.com
In reply to: Amul Sul (#1)
Re: [Patch] ALTER SYSTEM READ ONLY

On Tue, Jun 16, 2020 at 7:26 PM amul sul <sulamul@gmail.com> wrote:

Hi,

Attached patch proposes $Subject feature which forces the system into read-only
mode where insert write-ahead log will be prohibited until ALTER SYSTEM READ
WRITE executed.

The high-level goal is to make the availability/scale-out situation better. The feature
will help HA setup where the master server needs to stop accepting WAL writes
immediately and kick out any transaction expecting WAL writes at the end, in case
of network down on master or replication connections failures.

For example, this feature allows for a controlled switchover without needing to shut
down the master. You can instead make the master read-only, wait until the standby
catches up, and then promote the standby. The master remains available for read
queries throughout, and also for WAL streaming, but without the possibility of any
new write transactions. After switchover is complete, the master can be shut down
and brought back up as a standby without needing to use pg_rewind. (Eventually, it
would be nice to be able to make the read-only master into a standby without having
to restart it, but that is a problem for another patch.)

This might also help in failover scenarios. For example, if you detect that the master
has lost network connectivity to the standby, you might make it read-only after 30 s,
and promote the standby after 60 s, so that you never have two writable masters at
the same time. In this case, there's still some split-brain, but it's still better than what
we have now.

Design:
----------
The proposed feature is built atop of super barrier mechanism commit[1] to coordinate
global state changes to all active backends. Backends which executed
ALTER SYSTEM READ { ONLY | WRITE } command places request to checkpointer
process to change the requested WAL read/write state aka WAL prohibited and WAL
permitted state respectively. When the checkpointer process sees the WAL prohibit
state change request, it emits a global barrier and waits until all backends that
participate in the ProcSignal absorbs it. Once it has done the WAL read/write state in
share memory and control file will be updated so that XLogInsertAllowed() returns
accordingly.

Do we prohibit the checkpointer to write dirty pages and write a
checkpoint record as well? If so, will the checkpointer process
writes the current dirty pages and writes a checkpoint record or we
skip that as well?

If there are open transactions that have acquired an XID, the sessions are killed
before the barrier is absorbed.

What about prepared transactions?

They can't commit without writing WAL, and they
can't abort without writing WAL, either, so we must at least abort the transaction. We
don't necessarily need to kill the session, but it's hard to avoid in all cases because
(1) if there are subtransactions active, we need to force the top-level abort record to
be written immediately, but we can't really do that while keeping the subtransactions
on the transaction stack, and (2) if the session is idle, we also need the top-level abort
record to be written immediately, but can't send an error to the client until the next
command is issued without losing wire protocol synchronization. For now, we just use
FATAL to kill the session; maybe this can be improved in the future.

Open transactions that don't have an XID are not killed, but will get an ERROR if they
try to acquire an XID later, or if they try to write WAL without acquiring an XID (e.g. VACUUM).

What if vacuum is on an unlogged relation? Do we allow writes via
vacuum to unlogged relation?

To make that happen, the patch adds a new coding rule: a critical section that will write
WAL must be preceded by a call to CheckWALPermitted(), AssertWALPermitted(), or
AssertWALPermitted_HaveXID(). The latter variants are used when we know for certain
that inserting WAL here must be OK, either because we have an XID (we would have
been killed by a change to read-only if one had occurred) or for some other reason.

The ALTER SYSTEM READ WRITE command can be used to reverse the effects of
ALTER SYSTEM READ ONLY. Both ALTER SYSTEM READ ONLY and ALTER
SYSTEM READ WRITE update not only the shared memory state but also the control
file, so that changes survive a restart.

The transition between read-write and read-only is a pretty major transition, so we emit
log message for each successful execution of a ALTER SYSTEM READ {ONLY | WRITE}
command. Also, we have added a new GUC system_is_read_only which returns "on"
when the system is in WAL prohibited state or recovery.

Another part of the patch that quite uneasy and need a discussion is that when the
shutdown in the read-only state we do skip shutdown checkpoint and at a restart, first
startup recovery will be performed and latter the read-only state will be restored to
prohibit further WAL write irrespective of recovery checkpoint succeed or not. The
concern is here if this startup recovery checkpoint wasn't ok, then it will never happen
even if it's later put back into read-write mode.

I am not able to understand this problem. What do you mean by
"recovery checkpoint succeed or not", do you add a try..catch and skip
any error while performing recovery checkpoint?

--
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

#3tushar
tushar.ahuja@enterprisedb.com
In reply to: Amul Sul (#1)
Re: [Patch] ALTER SYSTEM READ ONLY

On 6/16/20 7:25 PM, amul sul wrote:

Attached patch proposes $Subject feature which forces the system into
read-only
mode where insert write-ahead log will be prohibited until ALTER
SYSTEM READ
WRITE executed.

Thanks Amul.

1) ALTER SYSTEM

postgres=# alter system read only;
ALTER SYSTEM
postgres=# alter  system reset all;
ALTER SYSTEM
postgres=# create table t1(n int);
ERROR:  cannot execute CREATE TABLE in a read-only transaction

Initially i thought after firing 'Alter system reset all' , it will be
back to  normal.

can't we have a syntax like - "Alter system set read_only='True' ; "

so that ALTER SYSTEM command syntax should be same for all.

postgres=# \h alter system
Command:     ALTER SYSTEM
Description: change a server configuration parameter
Syntax:
ALTER SYSTEM SET configuration_parameter { TO | = } { value | 'value' |
DEFAULT }

ALTER SYSTEM RESET configuration_parameter
ALTER SYSTEM RESET ALL

How we are going to justify this in help command of ALTER SYSTEM ?

2)When i connected to postgres in a single user mode , i was not able to
set the system in read only

[edb@tushar-ldap-docker bin]$ ./postgres --single -D data postgres

PostgreSQL stand-alone backend 14devel
backend> alter system read only;
ERROR:  checkpointer is not running

backend>

--
regards,tushar
EnterpriseDB https://www.enterprisedb.com/
The Enterprise PostgreSQL Company

#4Robert Haas
robertmhaas@gmail.com
In reply to: Amit Kapila (#2)
Re: [Patch] ALTER SYSTEM READ ONLY

On Wed, Jun 17, 2020 at 9:02 AM Amit Kapila <amit.kapila16@gmail.com> wrote:

Do we prohibit the checkpointer to write dirty pages and write a
checkpoint record as well? If so, will the checkpointer process
writes the current dirty pages and writes a checkpoint record or we
skip that as well?

I think the definition of this feature should be that you can't write
WAL. So, it's OK to write dirty pages in general, for example to allow
for buffer replacement so we can continue to run read-only queries.
But there's no reason for the checkpointer to do it: it shouldn't try
to checkpoint, and therefore it shouldn't write dirty pages either.
(I'm not sure if this is how the patch currently works; I'm describing
how I think it should work.)

If there are open transactions that have acquired an XID, the sessions are killed
before the barrier is absorbed.

What about prepared transactions?

They don't matter. The problem with a running transaction that has an
XID is that somebody might end the session, and then we'd have to
write either a commit record or an abort record. But a prepared
transaction doesn't have that problem. You can't COMMIT PREPARED or
ROLLBACK PREPARED while the system is read-only, as I suppose anybody
would expect, but their mere existence isn't a problem.

What if vacuum is on an unlogged relation? Do we allow writes via
vacuum to unlogged relation?

Interesting question. I was thinking that we should probably teach the
autovacuum launcher to stop launching workers while the system is in a
READ ONLY state, but what about existing workers? Anything that
generates invalidation messages, acquires an XID, or writes WAL has to
be blocked in a read-only state; but I'm not sure to what extent the
first two of those things would be a problem for vacuuming an unlogged
table. I think you couldn't truncate it, at least, because that
acquires an XID.

Another part of the patch that quite uneasy and need a discussion is that when the
shutdown in the read-only state we do skip shutdown checkpoint and at a restart, first
startup recovery will be performed and latter the read-only state will be restored to
prohibit further WAL write irrespective of recovery checkpoint succeed or not. The
concern is here if this startup recovery checkpoint wasn't ok, then it will never happen
even if it's later put back into read-write mode.

I am not able to understand this problem. What do you mean by
"recovery checkpoint succeed or not", do you add a try..catch and skip
any error while performing recovery checkpoint?

What I think should happen is that the end-of-recovery checkpoint
should be skipped, and then if the system is put back into read-write
mode later we should do it then. But I think right now the patch
performs the end-of-recovery checkpoint before restoring the read-only
state, which seems 100% wrong to me.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

#5Robert Haas
robertmhaas@gmail.com
In reply to: tushar (#3)
Re: [Patch] ALTER SYSTEM READ ONLY

On Wed, Jun 17, 2020 at 9:51 AM tushar <tushar.ahuja@enterprisedb.com> wrote:

1) ALTER SYSTEM

postgres=# alter system read only;
ALTER SYSTEM
postgres=# alter system reset all;
ALTER SYSTEM
postgres=# create table t1(n int);
ERROR: cannot execute CREATE TABLE in a read-only transaction

Initially i thought after firing 'Alter system reset all' , it will be
back to normal.

can't we have a syntax like - "Alter system set read_only='True' ; "

No, this needs to be separate from the GUC-modification syntax, I
think. It's a different kind of state change. It doesn't, and can't,
just edit postgresql.auto.conf.

2)When i connected to postgres in a single user mode , i was not able to
set the system in read only

[edb@tushar-ldap-docker bin]$ ./postgres --single -D data postgres

PostgreSQL stand-alone backend 14devel
backend> alter system read only;
ERROR: checkpointer is not running

backend>

Hmm, that's an interesting finding. I wonder what happens if you make
the system read only, shut it down, and then restart it in single-user
mode. Given what you see here, I bet you can't put it back into a
read-write state from single user mode either, which seems like a
problem. Either single-user mode should allow changing between R/O and
R/W, or alternatively single-user mode should ignore ALTER SYSTEM READ
ONLY and always allow writes anyway.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

#6Tom Lane
tgl@sss.pgh.pa.us
In reply to: Amit Kapila (#2)
Re: [Patch] ALTER SYSTEM READ ONLY

Amit Kapila <amit.kapila16@gmail.com> writes:

On Tue, Jun 16, 2020 at 7:26 PM amul sul <sulamul@gmail.com> wrote:

Attached patch proposes $Subject feature which forces the system into read-only
mode where insert write-ahead log will be prohibited until ALTER SYSTEM READ
WRITE executed.

Do we prohibit the checkpointer to write dirty pages and write a
checkpoint record as well?

I think this is a really bad idea and should simply be rejected.

Aside from the points you mention, such a switch would break autovacuum.
It would break the ability for scans to do HOT-chain cleanup, which would
likely lead to some odd behaviors (if, eg, somebody flips the switch
between where that's supposed to happen and where an update needs to
happen on the same page). It would break the ability for indexscans to do
killed-tuple marking, which is critical for performance in some scenarios.
It would break the ability to set tuple hint bits, which is even more
critical for performance. It'd possibly break, or at least complicate,
logic in index AMs to deal with index format updates --- I'm fairly sure
there are places that will try to update out-of-date data structures
rather than cope with the old structure, even in nominally read-only
searches.

I also think that putting such a thing into ALTER SYSTEM has got big
logical problems. Someday we will probably want to have ALTER SYSTEM
write WAL so that standby servers can absorb the settings changes.
But if writing WAL is disabled, how can you ever turn the thing off again?

Lastly, the arguments in favor seem pretty bogus. HA switchover normally
involves just killing the primary server, not expecting that you can
leisurely issue some commands to it first. Commands that involve a whole
bunch of subtle interlocking --- and, therefore, aren't going to work if
anything has gone wrong already anywhere in the server --- seem like a
particularly poor thing to be hanging your HA strategy on. I also wonder
what this accomplishes that couldn't be done much more simply by killing
the walsenders.

In short, I see a huge amount of complexity here, an ongoing source of
hard-to-identify, hard-to-fix bugs, and not very much real usefulness.

regards, tom lane

#7Robert Haas
robertmhaas@gmail.com
In reply to: Tom Lane (#6)
Re: [Patch] ALTER SYSTEM READ ONLY

On Wed, Jun 17, 2020 at 10:58 AM Tom Lane <tgl@sss.pgh.pa.us> wrote:

Aside from the points you mention, such a switch would break autovacuum.
It would break the ability for scans to do HOT-chain cleanup, which would
likely lead to some odd behaviors (if, eg, somebody flips the switch
between where that's supposed to happen and where an update needs to
happen on the same page). It would break the ability for indexscans to do
killed-tuple marking, which is critical for performance in some scenarios.
It would break the ability to set tuple hint bits, which is even more
critical for performance. It'd possibly break, or at least complicate,
logic in index AMs to deal with index format updates --- I'm fairly sure
there are places that will try to update out-of-date data structures
rather than cope with the old structure, even in nominally read-only
searches.

This seems like pretty dubious hand-waving. Of course, things that
write WAL are going to be broken by a switch that prevents writing
WAL; but if they were not, there would be no purpose in having such a
switch, so that's not really an argument. But you seem to have mixed
in some things that don't require writing WAL, and claimed without
evidence that those would somehow also be broken. I don't think that's
the case, but even if it were, so what? We live with all of these
restrictions on standbys anyway.

I also think that putting such a thing into ALTER SYSTEM has got big
logical problems. Someday we will probably want to have ALTER SYSTEM
write WAL so that standby servers can absorb the settings changes.
But if writing WAL is disabled, how can you ever turn the thing off again?

I mean, the syntax that we use for a feature like this is arbitrary. I
picked this one, so I like it, but it can easily be changed if other
people want something else. The rest of this argument doesn't seem to
me to make very much sense. The existing ALTER SYSTEM functionality to
modify a text configuration file isn't replicated today and I'm not
sure why we should make it so, considering that replication generally
only considers things that are guaranteed to be the same on the master
and the standby, which this is not. But even if we did, that has
nothing to do with whether some functionality that changes the system
state without changing a text file ought to also be replicated. This
is a piece of cluster management functionality and it makes no sense
to replicate it. And no right-thinking person would ever propose to
change a feature that renders the system read-only in such a way that
it was impossible to deactivate it. That would be nuts.

Lastly, the arguments in favor seem pretty bogus. HA switchover normally
involves just killing the primary server, not expecting that you can
leisurely issue some commands to it first.

Yeah, that's exactly the problem I want to fix. If you kill the master
server, then you have interrupted service, even for read-only queries.
That sucks. Also, even if you don't care about interrupting service on
the master, it's actually sorta hard to guarantee a clean switchover.
The walsenders are supposed to send all the WAL from the master before
exiting, but if the connection is broken for some reason, then the
master is down and the standbys can't stream the rest of the WAL. You
can start it up again, but then you might generate more WAL. You can
try to copy the WAL around manually from one pg_wal directory to
another, but that's not a very nice thing for users to need to do
manually, and seems buggy and error-prone.

And how do you figure out where the WAL ends on the master and make
sure that the standby replayed it all? If the master is up, it's easy:
you just use the same queries you use all the time. If the master is
down, you have to use some different technique that involves manually
examining files or scrutinizing pg_controldata output. It's actually
very difficult to get this right.

Commands that involve a whole
bunch of subtle interlocking --- and, therefore, aren't going to work if
anything has gone wrong already anywhere in the server --- seem like a
particularly poor thing to be hanging your HA strategy on.

It's important not to conflate controlled switchover with failover.
When there's a failover, you have to accept some risk of data loss or
service interruption; but a controlled switchover does not need to
carry the same risks and there are plenty of systems out there where
it doesn't.

I also wonder
what this accomplishes that couldn't be done much more simply by killing
the walsenders.

Killing the walsenders does nothing ... the clients immediately reconnect.

In short, I see a huge amount of complexity here, an ongoing source of
hard-to-identify, hard-to-fix bugs, and not very much real usefulness.

I do think this is complex and the risk of bugs that are hard to
identify or hard to fix certainly needs to be considered. I
strenuously disagree with the idea that there is not very much real
usefulness. Getting failover set up in a way that actually works
robustly is, in my experience, one of the two or three most serious
challenges my employer's customers face today. The core server support
we provide for that is breathtakingly primitive, and it's urgent that
we do better. Cloud providers are moving users from PostgreSQL to
their own forks of PostgreSQL in vast numbers in large part because
users don't want to deal with this crap, and the cloud providers have
made it so they don't have to. People running PostgreSQL themselves
need complex third-party tools and even then the experience isn't as
good as what a major cloud provider would offer. This patch is not
going to fix that, but I think it's a step in the right direction, and
I hope others will agree.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

#8Tom Lane
tgl@sss.pgh.pa.us
In reply to: Robert Haas (#7)
Re: [Patch] ALTER SYSTEM READ ONLY

Robert Haas <robertmhaas@gmail.com> writes:

This seems like pretty dubious hand-waving. Of course, things that
write WAL are going to be broken by a switch that prevents writing
WAL; but if they were not, there would be no purpose in having such a
switch, so that's not really an argument. But you seem to have mixed
in some things that don't require writing WAL, and claimed without
evidence that those would somehow also be broken.

Which of the things I mentioned don't require writing WAL?

You're right that these are the same things that we already forbid on a
standby, for the same reason, so maybe it won't be as hard to identify
them as I feared. I wonder whether we should envision this as "demote
primary to standby" rather than an independent feature.

I also think that putting such a thing into ALTER SYSTEM has got big
logical problems.

... no right-thinking person would ever propose to
change a feature that renders the system read-only in such a way that
it was impossible to deactivate it. That would be nuts.

My point was that putting this in ALTER SYSTEM paints us into a corner
as to what we can do with ALTER SYSTEM in the future: we won't ever be
able to make that do anything that would require writing WAL. And I
don't entirely believe your argument that that will never be something
we'd want to do.

regards, tom lane

#9Robert Haas
robertmhaas@gmail.com
In reply to: Tom Lane (#8)
Re: [Patch] ALTER SYSTEM READ ONLY

On Wed, Jun 17, 2020 at 12:27 PM Tom Lane <tgl@sss.pgh.pa.us> wrote:

Which of the things I mentioned don't require writing WAL?

Writing hint bits and marking index tuples as killed do not write WAL
unless checksums are enabled.

You're right that these are the same things that we already forbid on a
standby, for the same reason, so maybe it won't be as hard to identify
them as I feared. I wonder whether we should envision this as "demote
primary to standby" rather than an independent feature.

See my comments on the nearby pg_demote thread. I think we want both.

I also think that putting such a thing into ALTER SYSTEM has got big
logical problems.

... no right-thinking person would ever propose to
change a feature that renders the system read-only in such a way that
it was impossible to deactivate it. That would be nuts.

My point was that putting this in ALTER SYSTEM paints us into a corner
as to what we can do with ALTER SYSTEM in the future: we won't ever be
able to make that do anything that would require writing WAL. And I
don't entirely believe your argument that that will never be something
we'd want to do.

I think that depends a lot on how you view ALTER SYSTEM. I believe it
would be reasonable to view ALTER SYSTEM as a catch-all for commands
that make system-wide state changes, even if those changes are not all
of the same kind as each other; some might be machine-local, and
others cluster-wide; some WAL-logged, and others not. I don't think
it's smart to view ALTER SYSTEM through a lens that boxes it into only
editing postgresql.auto.conf; if that were so, we ought to have called
it ALTER CONFIGURATION FILE or something rather than ALTER SYSTEM. For
that reason, I do not see the choice of syntax as painting us into a
corner.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

#10Tom Lane
tgl@sss.pgh.pa.us
In reply to: Robert Haas (#9)
Re: [Patch] ALTER SYSTEM READ ONLY

Robert Haas <robertmhaas@gmail.com> writes:

On Wed, Jun 17, 2020 at 12:27 PM Tom Lane <tgl@sss.pgh.pa.us> wrote:

Which of the things I mentioned don't require writing WAL?

Writing hint bits and marking index tuples as killed do not write WAL
unless checksums are enabled.

And your point is? I thought enabling checksums was considered
good practice these days.

You're right that these are the same things that we already forbid on a
standby, for the same reason, so maybe it won't be as hard to identify
them as I feared. I wonder whether we should envision this as "demote
primary to standby" rather than an independent feature.

See my comments on the nearby pg_demote thread. I think we want both.

Well, if pg_demote can be done for X amount of effort, and largely
gets the job done, while this requires 10X or 100X the effort and
introduces 10X or 100X as many bugs, I'm not especially convinced
that we want both.

regards, tom lane

#11Robert Haas
robertmhaas@gmail.com
In reply to: Tom Lane (#10)
Re: [Patch] ALTER SYSTEM READ ONLY

On Wed, Jun 17, 2020 at 12:45 PM Tom Lane <tgl@sss.pgh.pa.us> wrote:

Writing hint bits and marking index tuples as killed do not write WAL
unless checksums are enabled.

And your point is? I thought enabling checksums was considered
good practice these days.

I don't want to have an argument about what typical or best practices
are; I wasn't trying to make any point about that one way or the
other. I'm just saying that the operations you listed don't
necessarily all write WAL. In an event, even if they did, the larger
point is that standbys work like that, too, so it's not unprecedented
or illogical to think of such things.

You're right that these are the same things that we already forbid on a
standby, for the same reason, so maybe it won't be as hard to identify
them as I feared. I wonder whether we should envision this as "demote
primary to standby" rather than an independent feature.

See my comments on the nearby pg_demote thread. I think we want both.

Well, if pg_demote can be done for X amount of effort, and largely
gets the job done, while this requires 10X or 100X the effort and
introduces 10X or 100X as many bugs, I'm not especially convinced
that we want both.

Sure: if two features duplicate each other, and one of them is way
more work and way more buggy, then it's silly to have both, and we
should just accept the easy, bug-free one. However, as I said in the
other email to which I referred you, I currently believe that these
two features actually don't duplicate each other and that using them
both together would be quite beneficial. Also, even if they did, I
don't know where you are getting the idea that this feature will be
10X or 100X more work and more buggy than the other one. I have looked
at this code prior to it being posted, but I haven't looked at the
other code at all; I am guessing that you have looked at neither. I
would be happy if you did, because it is often the case that
architectural issues that escape other people are apparent to you upon
examination, and it's always nice to know about those earlier rather
than later so that one can decide to (a) give up or (b) fix them. But
I see no point in speculating in the abstract that such issues may
exist and that they may be more severe in one case than the other. My
own guess is that, properly implemented, they are within 2-3X of each
in one direction or the other, not 10-100X. It is almost unbelievable
to me that the pg_demote patch could be 100X simpler than this one; if
it were, I'd be practically certain it was a 5-minute hack job
unworthy of any serious consideration.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

#12Andres Freund
andres@anarazel.de
In reply to: Robert Haas (#7)
Re: [Patch] ALTER SYSTEM READ ONLY

Hi,

On 2020-06-17 12:07:22 -0400, Robert Haas wrote:

On Wed, Jun 17, 2020 at 10:58 AM Tom Lane <tgl@sss.pgh.pa.us> wrote:

I also think that putting such a thing into ALTER SYSTEM has got big
logical problems. Someday we will probably want to have ALTER SYSTEM
write WAL so that standby servers can absorb the settings changes.
But if writing WAL is disabled, how can you ever turn the thing off again?

I mean, the syntax that we use for a feature like this is arbitrary. I
picked this one, so I like it, but it can easily be changed if other
people want something else. The rest of this argument doesn't seem to
me to make very much sense. The existing ALTER SYSTEM functionality to
modify a text configuration file isn't replicated today and I'm not
sure why we should make it so, considering that replication generally
only considers things that are guaranteed to be the same on the master
and the standby, which this is not. But even if we did, that has
nothing to do with whether some functionality that changes the system
state without changing a text file ought to also be replicated. This
is a piece of cluster management functionality and it makes no sense
to replicate it. And no right-thinking person would ever propose to
change a feature that renders the system read-only in such a way that
it was impossible to deactivate it. That would be nuts.

I agree that the concrete syntax here doesn't seem to matter much. If
this worked by actually putting a GUC into the config file, it would
perhaps matter a bit more, but it doesn't afaict. It seems good to
avoid new top-level statements, and ALTER SYSTEM seems to fit well.

I wonder if there's an argument about wanting to be able to execute this
command over a physical replication connection? I think this feature
fairly obviously is a building block for "gracefully failover to this
standby", and it seems like it'd be nicer if that didn't potentially
require two pg_hba.conf entries for the to-be-promoted primary on the
current/old primary?

Lastly, the arguments in favor seem pretty bogus. HA switchover normally
involves just killing the primary server, not expecting that you can
leisurely issue some commands to it first.

Yeah, that's exactly the problem I want to fix. If you kill the master
server, then you have interrupted service, even for read-only queries.
That sucks. Also, even if you don't care about interrupting service on
the master, it's actually sorta hard to guarantee a clean switchover.
The walsenders are supposed to send all the WAL from the master before
exiting, but if the connection is broken for some reason, then the
master is down and the standbys can't stream the rest of the WAL. You
can start it up again, but then you might generate more WAL. You can
try to copy the WAL around manually from one pg_wal directory to
another, but that's not a very nice thing for users to need to do
manually, and seems buggy and error-prone.

Also (I'm sure you're aware) if you just non-gracefully shut down the
old primary, you're going to have to rewind the old primary to be able
to use it as a standby. And if you non-gracefully stop you're gonna
incur checkpoint overhead, which is *massive* on non-toy
databases. There's a huge practical difference between a minor version
upgrade causing 10s of unavailability and causing 5min-30min.

And how do you figure out where the WAL ends on the master and make
sure that the standby replayed it all? If the master is up, it's easy:
you just use the same queries you use all the time. If the master is
down, you have to use some different technique that involves manually
examining files or scrutinizing pg_controldata output. It's actually
very difficult to get this right.

Yea, it's absurdly hard. I think it's really kind of ridiculous that we
expect others to get this right if we, the developers of this stuff,
can't really get it right because it's so complicated. Which imo makes
this:

Commands that involve a whole
bunch of subtle interlocking --- and, therefore, aren't going to work if
anything has gone wrong already anywhere in the server --- seem like a
particularly poor thing to be hanging your HA strategy on.

more of an argument for having this type of stuff builtin.

It's important not to conflate controlled switchover with failover.
When there's a failover, you have to accept some risk of data loss or
service interruption; but a controlled switchover does not need to
carry the same risks and there are plenty of systems out there where
it doesn't.

Yup.

Greetings,

Andres Freund

#13Amul Sul
sulamul@gmail.com
In reply to: Robert Haas (#4)
Re: [Patch] ALTER SYSTEM READ ONLY

On Wed, Jun 17, 2020 at 8:12 PM Robert Haas <robertmhaas@gmail.com> wrote:

On Wed, Jun 17, 2020 at 9:02 AM Amit Kapila <amit.kapila16@gmail.com> wrote:

Do we prohibit the checkpointer to write dirty pages and write a
checkpoint record as well? If so, will the checkpointer process
writes the current dirty pages and writes a checkpoint record or we
skip that as well?

I think the definition of this feature should be that you can't write
WAL. So, it's OK to write dirty pages in general, for example to allow
for buffer replacement so we can continue to run read-only queries.
But there's no reason for the checkpointer to do it: it shouldn't try
to checkpoint, and therefore it shouldn't write dirty pages either.
(I'm not sure if this is how the patch currently works; I'm describing
how I think it should work.)

You are correct -- writing dirty pages is not restricted.

If there are open transactions that have acquired an XID, the sessions are killed
before the barrier is absorbed.

What about prepared transactions?

They don't matter. The problem with a running transaction that has an
XID is that somebody might end the session, and then we'd have to
write either a commit record or an abort record. But a prepared
transaction doesn't have that problem. You can't COMMIT PREPARED or
ROLLBACK PREPARED while the system is read-only, as I suppose anybody
would expect, but their mere existence isn't a problem.

What if vacuum is on an unlogged relation? Do we allow writes via
vacuum to unlogged relation?

Interesting question. I was thinking that we should probably teach the
autovacuum launcher to stop launching workers while the system is in a
READ ONLY state, but what about existing workers? Anything that
generates invalidation messages, acquires an XID, or writes WAL has to
be blocked in a read-only state; but I'm not sure to what extent the
first two of those things would be a problem for vacuuming an unlogged
table. I think you couldn't truncate it, at least, because that
acquires an XID.

Another part of the patch that quite uneasy and need a discussion is that when the
shutdown in the read-only state we do skip shutdown checkpoint and at a restart, first
startup recovery will be performed and latter the read-only state will be restored to
prohibit further WAL write irrespective of recovery checkpoint succeed or not. The
concern is here if this startup recovery checkpoint wasn't ok, then it will never happen
even if it's later put back into read-write mode.

I am not able to understand this problem. What do you mean by
"recovery checkpoint succeed or not", do you add a try..catch and skip
any error while performing recovery checkpoint?

What I think should happen is that the end-of-recovery checkpoint
should be skipped, and then if the system is put back into read-write
mode later we should do it then. But I think right now the patch
performs the end-of-recovery checkpoint before restoring the read-only
state, which seems 100% wrong to me.

Yeah, we need more thought on how to proceed further. I am kind of agree that
the current behavior is not right with Robert since writing end-of-recovery
checkpoint violates the no-wal-write rule.

Regards,
Amul

#14Amul Sul
sulamul@gmail.com
In reply to: Robert Haas (#5)
Re: [Patch] ALTER SYSTEM READ ONLY

On Wed, Jun 17, 2020 at 8:15 PM Robert Haas <robertmhaas@gmail.com> wrote:

On Wed, Jun 17, 2020 at 9:51 AM tushar <tushar.ahuja@enterprisedb.com> wrote:

1) ALTER SYSTEM

postgres=# alter system read only;
ALTER SYSTEM
postgres=# alter system reset all;
ALTER SYSTEM
postgres=# create table t1(n int);
ERROR: cannot execute CREATE TABLE in a read-only transaction

Initially i thought after firing 'Alter system reset all' , it will be
back to normal.

can't we have a syntax like - "Alter system set read_only='True' ; "

No, this needs to be separate from the GUC-modification syntax, I
think. It's a different kind of state change. It doesn't, and can't,
just edit postgresql.auto.conf.

2)When i connected to postgres in a single user mode , i was not able to
set the system in read only

[edb@tushar-ldap-docker bin]$ ./postgres --single -D data postgres

PostgreSQL stand-alone backend 14devel
backend> alter system read only;
ERROR: checkpointer is not running

backend>

Hmm, that's an interesting finding. I wonder what happens if you make
the system read only, shut it down, and then restart it in single-user
mode. Given what you see here, I bet you can't put it back into a
read-write state from single user mode either, which seems like a
problem. Either single-user mode should allow changing between R/O and
R/W, or alternatively single-user mode should ignore ALTER SYSTEM READ
ONLY and always allow writes anyway.

Ok, will try to enable changing between R/O and R/W in the next version.

Thanks Tushar for the testing.

Regards,
Amul

#15Amit Kapila
amit.kapila16@gmail.com
In reply to: Robert Haas (#4)
Re: [Patch] ALTER SYSTEM READ ONLY

On Wed, Jun 17, 2020 at 8:12 PM Robert Haas <robertmhaas@gmail.com> wrote:

On Wed, Jun 17, 2020 at 9:02 AM Amit Kapila <amit.kapila16@gmail.com> wrote:

Do we prohibit the checkpointer to write dirty pages and write a
checkpoint record as well? If so, will the checkpointer process
writes the current dirty pages and writes a checkpoint record or we
skip that as well?

I think the definition of this feature should be that you can't write
WAL. So, it's OK to write dirty pages in general, for example to allow
for buffer replacement so we can continue to run read-only queries.

For buffer replacement, many-a-times we have to also perform
XLogFlush, what do we do for that? We can't proceed without doing
that and erroring out from there means stopping read-only query from
the user perspective.

But there's no reason for the checkpointer to do it: it shouldn't try
to checkpoint, and therefore it shouldn't write dirty pages either.

What is the harm in doing the checkpoint before we put the system into
READ ONLY state? The advantage is that we can at least reduce the
recovery time if we allow writing checkpoint record.

What if vacuum is on an unlogged relation? Do we allow writes via
vacuum to unlogged relation?

Interesting question. I was thinking that we should probably teach the
autovacuum launcher to stop launching workers while the system is in a
READ ONLY state, but what about existing workers? Anything that
generates invalidation messages, acquires an XID, or writes WAL has to
be blocked in a read-only state; but I'm not sure to what extent the
first two of those things would be a problem for vacuuming an unlogged
table. I think you couldn't truncate it, at least, because that
acquires an XID.

If the truncate operation errors out, then won't the system will again
trigger a new autovacuum worker for the same relation as we update
stats at the end? Also, in general for regular tables, if there is an
error while it tries to WAL, it could again trigger the autovacuum
worker for the same relation. If this is true then unnecessarily it
will generate a lot of dirty pages and don't think it will be good for
the system to behave that way?

Another part of the patch that quite uneasy and need a discussion is that when the
shutdown in the read-only state we do skip shutdown checkpoint and at a restart, first
startup recovery will be performed and latter the read-only state will be restored to
prohibit further WAL write irrespective of recovery checkpoint succeed or not. The
concern is here if this startup recovery checkpoint wasn't ok, then it will never happen
even if it's later put back into read-write mode.

I am not able to understand this problem. What do you mean by
"recovery checkpoint succeed or not", do you add a try..catch and skip
any error while performing recovery checkpoint?

What I think should happen is that the end-of-recovery checkpoint
should be skipped, and then if the system is put back into read-write
mode later we should do it then.

But then if we have to perform recovery again, it will start from the
previous checkpoint. I think we have to live with it.

--
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

In reply to: Robert Haas (#7)
Re: [Patch] ALTER SYSTEM READ ONLY

On Wed, 17 Jun 2020 12:07:22 -0400
Robert Haas <robertmhaas@gmail.com> wrote:
[...]

Commands that involve a whole
bunch of subtle interlocking --- and, therefore, aren't going to work if
anything has gone wrong already anywhere in the server --- seem like a
particularly poor thing to be hanging your HA strategy on.

It's important not to conflate controlled switchover with failover.
When there's a failover, you have to accept some risk of data loss or
service interruption; but a controlled switchover does not need to
carry the same risks and there are plenty of systems out there where
it doesn't.

Yes. Maybe we should make sure the wording we are using is the same for
everyone. I already hear/read "failover", "controlled failover", "switchover" or
"controlled switchover", this is confusing. My definition of switchover is:

swapping primary and secondary status between two replicating instances. With
no data loss. This is a controlled procedure where all steps must succeed to
complete.
If a step fails, the procedure fail back to the original primary with no data
loss.

However, Wikipedia has a broader definition, including situations where the
switchover is executed upon a failure: https://en.wikipedia.org/wiki/Switchover

Regards,

#17Simon Riggs
simon@2ndQuadrant.com
In reply to: Amul Sul (#1)
Re: [Patch] ALTER SYSTEM READ ONLY

On Tue, 16 Jun 2020 at 14:56, amul sul <sulamul@gmail.com> wrote:

The high-level goal is to make the availability/scale-out situation
better. The feature
will help HA setup where the master server needs to stop accepting WAL
writes
immediately and kick out any transaction expecting WAL writes at the end,
in case
of network down on master or replication connections failures.

For example, this feature allows for a controlled switchover without
needing to shut
down the master. You can instead make the master read-only, wait until the
standby
catches up, and then promote the standby. The master remains available for
read
queries throughout, and also for WAL streaming, but without the
possibility of any
new write transactions. After switchover is complete, the master can be
shut down
and brought back up as a standby without needing to use pg_rewind.
(Eventually, it
would be nice to be able to make the read-only master into a standby
without having
to restart it, but that is a problem for another patch.)

This might also help in failover scenarios. For example, if you detect
that the master
has lost network connectivity to the standby, you might make it read-only
after 30 s,
and promote the standby after 60 s, so that you never have two writable
masters at
the same time. In this case, there's still some split-brain, but it's
still better than what
we have now.

If there are open transactions that have acquired an XID, the sessions are
killed
before the barrier is absorbed.

inbuilt graceful failover for PostgreSQL

That doesn't appear to be very graceful. Perhaps objections could be
assuaged by having a smoother transition and perhaps not even a full
barrier, initially.

--
Simon Riggs http://www.2ndQuadrant.com/
<http://www.2ndquadrant.com/&gt;
Mission Critical Databases

#18Amit Kapila
amit.kapila16@gmail.com
In reply to: Robert Haas (#7)
Re: [Patch] ALTER SYSTEM READ ONLY

On Wed, Jun 17, 2020 at 9:37 PM Robert Haas <robertmhaas@gmail.com> wrote:

On Wed, Jun 17, 2020 at 10:58 AM Tom Lane <tgl@sss.pgh.pa.us> wrote:

Lastly, the arguments in favor seem pretty bogus. HA switchover normally
involves just killing the primary server, not expecting that you can
leisurely issue some commands to it first.

Yeah, that's exactly the problem I want to fix. If you kill the master
server, then you have interrupted service, even for read-only queries.

Yeah, but if there is a synchronuos_standby (standby that provide sync
replication), user can always route the connections to it
(automatically if there is some middleware which can detect and route
the connection to standby)

That sucks. Also, even if you don't care about interrupting service on
the master, it's actually sorta hard to guarantee a clean switchover.

Fair enough. However, it is not described in the initial email
(unless I have missed it; there is a mention that this patch is one
part of that bigger feature but no further explanation of that bigger
feature) how this feature will allow a clean switchover. I think
before we put the system into READ ONLY state, there could be some WAL
which we haven't sent to standby, what we do we do for that.

--
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

#19Amul Sul
sulamul@gmail.com
In reply to: Amit Kapila (#15)
Re: [Patch] ALTER SYSTEM READ ONLY

On Thu, Jun 18, 2020 at 3:25 PM Amit Kapila <amit.kapila16@gmail.com> wrote:

On Wed, Jun 17, 2020 at 8:12 PM Robert Haas <robertmhaas@gmail.com> wrote:

On Wed, Jun 17, 2020 at 9:02 AM Amit Kapila <amit.kapila16@gmail.com> wrote:

Do we prohibit the checkpointer to write dirty pages and write a
checkpoint record as well? If so, will the checkpointer process
writes the current dirty pages and writes a checkpoint record or we
skip that as well?

I think the definition of this feature should be that you can't write
WAL. So, it's OK to write dirty pages in general, for example to allow
for buffer replacement so we can continue to run read-only queries.

For buffer replacement, many-a-times we have to also perform
XLogFlush, what do we do for that? We can't proceed without doing
that and erroring out from there means stopping read-only query from
the user perspective.

Read-only does not restrict XLogFlush().

But there's no reason for the checkpointer to do it: it shouldn't try
to checkpoint, and therefore it shouldn't write dirty pages either.

What is the harm in doing the checkpoint before we put the system into
READ ONLY state? The advantage is that we can at least reduce the
recovery time if we allow writing checkpoint record.

The checkpoint could take longer, intending to quickly switch to the read-only
state.

What if vacuum is on an unlogged relation? Do we allow writes via
vacuum to unlogged relation?

Interesting question. I was thinking that we should probably teach the
autovacuum launcher to stop launching workers while the system is in a
READ ONLY state, but what about existing workers? Anything that
generates invalidation messages, acquires an XID, or writes WAL has to
be blocked in a read-only state; but I'm not sure to what extent the
first two of those things would be a problem for vacuuming an unlogged
table. I think you couldn't truncate it, at least, because that
acquires an XID.

If the truncate operation errors out, then won't the system will again
trigger a new autovacuum worker for the same relation as we update
stats at the end? Also, in general for regular tables, if there is an
error while it tries to WAL, it could again trigger the autovacuum
worker for the same relation. If this is true then unnecessarily it
will generate a lot of dirty pages and don't think it will be good for
the system to behave that way?

No new autovacuum worker will be forked in the read-only state and existing will
have an error if they try to write WAL after barrier absorption.

Another part of the patch that quite uneasy and need a discussion is that when the
shutdown in the read-only state we do skip shutdown checkpoint and at a restart, first
startup recovery will be performed and latter the read-only state will be restored to
prohibit further WAL write irrespective of recovery checkpoint succeed or not. The
concern is here if this startup recovery checkpoint wasn't ok, then it will never happen
even if it's later put back into read-write mode.

I am not able to understand this problem. What do you mean by
"recovery checkpoint succeed or not", do you add a try..catch and skip
any error while performing recovery checkpoint?

What I think should happen is that the end-of-recovery checkpoint
should be skipped, and then if the system is put back into read-write
mode later we should do it then.

But then if we have to perform recovery again, it will start from the
previous checkpoint. I think we have to live with it.

Let me explain the case, if we do skip the end-of-recovery checkpoint while
starting the system in read-only mode and then later changing the state to
read-write and do a few write operations and online checkpoints, that will be
fine? I am yet to explore those things.

Regards,
Amul

#20Robert Haas
robertmhaas@gmail.com
In reply to: Amit Kapila (#15)
Re: [Patch] ALTER SYSTEM READ ONLY

On Thu, Jun 18, 2020 at 5:55 AM Amit Kapila <amit.kapila16@gmail.com> wrote:

For buffer replacement, many-a-times we have to also perform
XLogFlush, what do we do for that? We can't proceed without doing
that and erroring out from there means stopping read-only query from
the user perspective.

I think we should stop WAL writes, then XLogFlush() once, then declare
the system R/O. After that there might be more XLogFlush() calls but
there won't be any new WAL, so they won't do anything.

But there's no reason for the checkpointer to do it: it shouldn't try
to checkpoint, and therefore it shouldn't write dirty pages either.

What is the harm in doing the checkpoint before we put the system into
READ ONLY state? The advantage is that we can at least reduce the
recovery time if we allow writing checkpoint record.

Well, as Andres says in
/messages/by-id/20200617180546.yucxtiupvxghxss6@alap3.anarazel.de
it can take a really long time.

Interesting question. I was thinking that we should probably teach the
autovacuum launcher to stop launching workers while the system is in a
READ ONLY state, but what about existing workers? Anything that
generates invalidation messages, acquires an XID, or writes WAL has to
be blocked in a read-only state; but I'm not sure to what extent the
first two of those things would be a problem for vacuuming an unlogged
table. I think you couldn't truncate it, at least, because that
acquires an XID.

If the truncate operation errors out, then won't the system will again
trigger a new autovacuum worker for the same relation as we update
stats at the end?

Not if we do what I said in that paragraph. If we're not launching new
workers we can't again trigger a worker for the same relation.

Also, in general for regular tables, if there is an
error while it tries to WAL, it could again trigger the autovacuum
worker for the same relation. If this is true then unnecessarily it
will generate a lot of dirty pages and don't think it will be good for
the system to behave that way?

I don't see how this would happen. VACUUM can't really dirty pages
without writing WAL, can it? And, anyway, if there's an error, we're
not going to try again for the same relation unless we launch new
workers.

What I think should happen is that the end-of-recovery checkpoint
should be skipped, and then if the system is put back into read-write
mode later we should do it then.

But then if we have to perform recovery again, it will start from the
previous checkpoint. I think we have to live with it.

Yeah. I don't think it's that bad. The case where you shut down the
system while it's read-only should be a somewhat unusual one. Normally
you would mark it read only and then promote a standby and shut the
old master down (or demote it). But what you want is that if it does
happen to go down for some reason before all the WAL is streamed, you
can bring it back up and finish streaming the WAL without generating
any new WAL.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

#21Robert Haas
robertmhaas@gmail.com
In reply to: Amul Sul (#19)
#22Robert Haas
robertmhaas@gmail.com
In reply to: Simon Riggs (#17)
In reply to: Robert Haas (#20)
#24Robert Haas
robertmhaas@gmail.com
In reply to: Jehan-Guillaume de Rorthais (#23)
#25Amul Sul
sulamul@gmail.com
In reply to: Robert Haas (#20)
#26Amul Sul
sulamul@gmail.com
In reply to: Amul Sul (#25)
#27tushar
tushar.ahuja@enterprisedb.com
In reply to: Amul Sul (#26)
#28Amul Sul
sulamul@gmail.com
In reply to: tushar (#27)
#29Michael Banck
michael.banck@credativ.de
In reply to: tushar (#27)
#30Michael Paquier
michael@paquier.xyz
In reply to: Amul Sul (#28)
#31Amul Sul
sulamul@gmail.com
In reply to: Michael Banck (#29)
#32Robert Haas
robertmhaas@gmail.com
In reply to: Michael Paquier (#30)
#33Amul Sul
sulamul@gmail.com
In reply to: Robert Haas (#32)
#34Prabhat Sahu
prabhat.sahu@enterprisedb.com
In reply to: Amul Sul (#33)
#35Robert Haas
robertmhaas@gmail.com
In reply to: Prabhat Sahu (#34)
#36Soumyadeep Chakraborty
soumyadeep2007@gmail.com
In reply to: Robert Haas (#35)
#37SATYANARAYANA NARLAPURAM
satyanarlapuram@gmail.com
In reply to: Soumyadeep Chakraborty (#36)
#38Soumyadeep Chakraborty
soumyadeep2007@gmail.com
In reply to: Amul Sul (#1)
#39Amul Sul
sulamul@gmail.com
In reply to: Soumyadeep Chakraborty (#36)
#40Amul Sul
sulamul@gmail.com
In reply to: SATYANARAYANA NARLAPURAM (#37)
#41Amul Sul
sulamul@gmail.com
In reply to: Soumyadeep Chakraborty (#38)
#42Soumyadeep Chakraborty
soumyadeep2007@gmail.com
In reply to: Amul Sul (#39)
#43Soumyadeep Chakraborty
soumyadeep2007@gmail.com
In reply to: Amul Sul (#41)
#44Soumyadeep Chakraborty
soumyadeep2007@gmail.com
In reply to: Robert Haas (#21)
#45Andres Freund
andres@anarazel.de
In reply to: Amul Sul (#33)
#46Amul Sul
sulamul@gmail.com
In reply to: Soumyadeep Chakraborty (#44)
#47Amul Sul
sulamul@gmail.com
In reply to: Andres Freund (#45)
#48Robert Haas
robertmhaas@gmail.com
In reply to: Soumyadeep Chakraborty (#36)
#49Robert Haas
robertmhaas@gmail.com
In reply to: Soumyadeep Chakraborty (#42)
#50Robert Haas
robertmhaas@gmail.com
In reply to: Andres Freund (#45)
#51Soumyadeep Chakraborty
soumyadeep2007@gmail.com
In reply to: Amul Sul (#46)
#52Soumyadeep Chakraborty
soumyadeep2007@gmail.com
In reply to: Robert Haas (#48)
#53Soumyadeep Chakraborty
soumyadeep2007@gmail.com
In reply to: Robert Haas (#49)
#54Amul Sul
sulamul@gmail.com
In reply to: Robert Haas (#50)
#55Amul Sul
sulamul@gmail.com
In reply to: Soumyadeep Chakraborty (#51)
#56Robert Haas
robertmhaas@gmail.com
In reply to: Soumyadeep Chakraborty (#52)
#57Amul Sul
sulamul@gmail.com
In reply to: Robert Haas (#56)
#58Robert Haas
robertmhaas@gmail.com
In reply to: Amul Sul (#57)
#59Amul Sul
sulamul@gmail.com
In reply to: Robert Haas (#58)
#60Andres Freund
andres@anarazel.de
In reply to: Robert Haas (#58)
#61Andres Freund
andres@anarazel.de
In reply to: Amul Sul (#59)
#62Amul Sul
sulamul@gmail.com
In reply to: Andres Freund (#61)
#63Amul Sul
sulamul@gmail.com
In reply to: Amul Sul (#62)
#64Robert Haas
robertmhaas@gmail.com
In reply to: Andres Freund (#60)
#65Amul Sul
sulamul@gmail.com
In reply to: Amul Sul (#63)
#66Amul Sul
sulamul@gmail.com
In reply to: Amul Sul (#65)
#67Robert Haas
robertmhaas@gmail.com
In reply to: Robert Haas (#64)
#68Amul Sul
sulamul@gmail.com
In reply to: Robert Haas (#67)
#69Amul Sul
sulamul@gmail.com
In reply to: Amul Sul (#68)
#70Robert Haas
robertmhaas@gmail.com
In reply to: Amul Sul (#68)
#71Amul Sul
sulamul@gmail.com
In reply to: Robert Haas (#70)
#72Amul Sul
sulamul@gmail.com
In reply to: Amul Sul (#71)
#73Robert Haas
robertmhaas@gmail.com
In reply to: Amul Sul (#62)
#74Andres Freund
andres@anarazel.de
In reply to: Robert Haas (#70)
#75Andres Freund
andres@anarazel.de
In reply to: Robert Haas (#73)
#76Amul Sul
sulamul@gmail.com
In reply to: Andres Freund (#75)
#77Amul Sul
sulamul@gmail.com
In reply to: Amul Sul (#76)
#78Amul Sul
sulamul@gmail.com
In reply to: Amul Sul (#77)
#79Robert Haas
robertmhaas@gmail.com
In reply to: Amul Sul (#78)
#80Robert Haas
robertmhaas@gmail.com
In reply to: Amul Sul (#78)
#81Amul Sul
sulamul@gmail.com
In reply to: Robert Haas (#80)
#82Robert Haas
robertmhaas@gmail.com
In reply to: Amul Sul (#81)
#83Amul Sul
sulamul@gmail.com
In reply to: Robert Haas (#82)
#84Robert Haas
robertmhaas@gmail.com
In reply to: Amul Sul (#83)
#85Andres Freund
andres@anarazel.de
In reply to: Robert Haas (#84)
#86Amul Sul
sulamul@gmail.com
In reply to: Andres Freund (#85)
#87Amul Sul
sulamul@gmail.com
In reply to: Amul Sul (#86)
#88Dilip Kumar
dilipbalaut@gmail.com
In reply to: Amul Sul (#86)
#89Amul Sul
sulamul@gmail.com
In reply to: Dilip Kumar (#88)
#90Dilip Kumar
dilipbalaut@gmail.com
In reply to: Amul Sul (#89)
#91Dilip Kumar
dilipbalaut@gmail.com
In reply to: Dilip Kumar (#90)
#92Amul Sul
sulamul@gmail.com
In reply to: Dilip Kumar (#91)
#93Dilip Kumar
dilipbalaut@gmail.com
In reply to: Amul Sul (#92)
#94Robert Haas
robertmhaas@gmail.com
In reply to: Dilip Kumar (#88)
#95Amul Sul
sulamul@gmail.com
In reply to: Robert Haas (#94)
#96Amul Sul
sulamul@gmail.com
In reply to: Dilip Kumar (#93)
#97Dilip Kumar
dilipbalaut@gmail.com
In reply to: Robert Haas (#94)
#98Amul Sul
sulamul@gmail.com
In reply to: Amul Sul (#95)
#99Ibrar Ahmed
ibrar.ahmad@gmail.com
In reply to: Amul Sul (#98)
#100Amul Sul
sulamul@gmail.com
In reply to: Ibrar Ahmed (#99)
#101Amul Sul
sulamul@gmail.com
In reply to: Amul Sul (#100)
#102Prabhat Sahu
prabhat.sahu@enterprisedb.com
In reply to: Amul Sul (#100)
#103Amul Sul
sulamul@gmail.com
In reply to: Prabhat Sahu (#102)
#104Amul Sul
sulamul@gmail.com
In reply to: Amul Sul (#103)
#105Bharath Rupireddy
bharath.rupireddyforpostgres@gmail.com
In reply to: Amul Sul (#104)
#106Amul Sul
sulamul@gmail.com
In reply to: Bharath Rupireddy (#105)
#107Amul Sul
sulamul@gmail.com
In reply to: Amul Sul (#106)
#108Amul Sul
sulamul@gmail.com
In reply to: Amul Sul (#107)
#109Robert Haas
robertmhaas@gmail.com
In reply to: Amul Sul (#108)
#110Amul Sul
sulamul@gmail.com
In reply to: Robert Haas (#109)
#111Robert Haas
robertmhaas@gmail.com
In reply to: Amul Sul (#110)
#112Amul Sul
sulamul@gmail.com
In reply to: Robert Haas (#111)
#113Dilip Kumar
dilipbalaut@gmail.com
In reply to: Amul Sul (#112)
#114Amul Sul
sulamul@gmail.com
In reply to: Dilip Kumar (#113)
#115Dilip Kumar
dilipbalaut@gmail.com
In reply to: Amul Sul (#114)
#116Amul Sul
sulamul@gmail.com
In reply to: Dilip Kumar (#115)
#117Dilip Kumar
dilipbalaut@gmail.com
In reply to: Amul Sul (#116)
#118Amul Sul
sulamul@gmail.com
In reply to: Dilip Kumar (#117)
#119Dilip Kumar
dilipbalaut@gmail.com
In reply to: Amul Sul (#118)
#120Amul Sul
sulamul@gmail.com
In reply to: Dilip Kumar (#119)
#121Dilip Kumar
dilipbalaut@gmail.com
In reply to: Amul Sul (#120)
#122Amul Sul
sulamul@gmail.com
In reply to: Dilip Kumar (#121)
#123Robert Haas
robertmhaas@gmail.com
In reply to: Amul Sul (#122)
#124Dilip Kumar
dilipbalaut@gmail.com
In reply to: Robert Haas (#123)
#125Amul Sul
sulamul@gmail.com
In reply to: Dilip Kumar (#124)
#126Robert Haas
robertmhaas@gmail.com
In reply to: Dilip Kumar (#124)
#127Dilip Kumar
dilipbalaut@gmail.com
In reply to: Robert Haas (#126)
#128Dilip Kumar
dilipbalaut@gmail.com
In reply to: Amul Sul (#125)
#129Amul Sul
sulamul@gmail.com
In reply to: Dilip Kumar (#128)
#130Dilip Kumar
dilipbalaut@gmail.com
In reply to: Amul Sul (#129)
#131Dilip Kumar
dilipbalaut@gmail.com
In reply to: Dilip Kumar (#130)
#132Amul Sul
sulamul@gmail.com
In reply to: Dilip Kumar (#131)
#133Dilip Kumar
dilipbalaut@gmail.com
In reply to: Amul Sul (#132)
#134Amul Sul
sulamul@gmail.com
In reply to: Dilip Kumar (#133)
#135Robert Haas
robertmhaas@gmail.com
In reply to: Amul Sul (#134)
#136Robert Haas
robertmhaas@gmail.com
In reply to: Robert Haas (#135)
#137Amul Sul
sulamul@gmail.com
In reply to: Robert Haas (#136)
#138Amul Sul
sulamul@gmail.com
In reply to: Amul Sul (#137)
#139Dilip Kumar
dilipbalaut@gmail.com
In reply to: Amul Sul (#138)
#140Amul Sul
sulamul@gmail.com
In reply to: Dilip Kumar (#139)
#141Robert Haas
robertmhaas@gmail.com
In reply to: Amul Sul (#138)
#142Prabhat Sahu
prabhat.sahu@enterprisedb.com
In reply to: Robert Haas (#141)
#143Amul Sul
sulamul@gmail.com
In reply to: Robert Haas (#141)
#144Amul Sul
sulamul@gmail.com
In reply to: Amul Sul (#143)
#145Robert Haas
robertmhaas@gmail.com
In reply to: Amul Sul (#144)
#146Mark Dilger
mark.dilger@enterprisedb.com
In reply to: Amul Sul (#144)
#147Amul Sul
sulamul@gmail.com
In reply to: Mark Dilger (#146)
#148Amul Sul
sulamul@gmail.com
In reply to: Amul Sul (#147)
#149Mark Dilger
mark.dilger@enterprisedb.com
In reply to: Amul Sul (#148)
#150Robert Haas
robertmhaas@gmail.com
In reply to: Mark Dilger (#149)
#151Mark Dilger
mark.dilger@enterprisedb.com
In reply to: Robert Haas (#150)
#152Amul Sul
sulamul@gmail.com
In reply to: Mark Dilger (#149)
#153Mark Dilger
mark.dilger@enterprisedb.com
In reply to: Amul Sul (#152)
#154Mark Dilger
mark.dilger@enterprisedb.com
In reply to: Mark Dilger (#153)
#155Robert Haas
robertmhaas@gmail.com
In reply to: Mark Dilger (#154)
#156Mark Dilger
mark.dilger@enterprisedb.com
In reply to: Robert Haas (#155)
#157Robert Haas
robertmhaas@gmail.com
In reply to: Mark Dilger (#156)
#158Mark Dilger
mark.dilger@enterprisedb.com
In reply to: Amul Sul (#1)
#159Amul Sul
sulamul@gmail.com
In reply to: Robert Haas (#135)
#160Robert Haas
robertmhaas@gmail.com
In reply to: Amul Sul (#159)
#161Robert Haas
robertmhaas@gmail.com
In reply to: Robert Haas (#160)
#162Amul Sul
sulamul@gmail.com
In reply to: Robert Haas (#161)
#163Amul Sul
sulamul@gmail.com
In reply to: Amul Sul (#152)
#164Amul Sul
sulamul@gmail.com
In reply to: Mark Dilger (#158)
#165Mark Dilger
mark.dilger@enterprisedb.com
In reply to: Amul Sul (#164)
#166Amul Sul
sulamul@gmail.com
In reply to: Mark Dilger (#165)
#167Mark Dilger
mark.dilger@enterprisedb.com
In reply to: Amul Sul (#166)
#168Amul Sul
sulamul@gmail.com
In reply to: Mark Dilger (#167)
#169Robert Haas
robertmhaas@gmail.com
In reply to: Amul Sul (#162)
#170Amul Sul
sulamul@gmail.com
In reply to: Robert Haas (#169)
#171Amul Sul
sulamul@gmail.com
In reply to: Amul Sul (#170)
#172Robert Haas
robertmhaas@gmail.com
In reply to: Amul Sul (#171)
#173Rushabh Lathia
rushabh.lathia@gmail.com
In reply to: Robert Haas (#172)
#174Amul Sul
sulamul@gmail.com
In reply to: Rushabh Lathia (#173)
#175Jaime Casanova
jcasanov@systemguards.com.ec
In reply to: Amul Sul (#174)
#176Amul Sul
sulamul@gmail.com
In reply to: Jaime Casanova (#175)
#177Amul Sul
sulamul@gmail.com
In reply to: Amul Sul (#176)
#178Robert Haas
robertmhaas@gmail.com
In reply to: Amul Sul (#177)
#179Amul Sul
sulamul@gmail.com
In reply to: Robert Haas (#178)
#180Robert Haas
robertmhaas@gmail.com
In reply to: Amul Sul (#179)
#181Amul Sul
sulamul@gmail.com
In reply to: Robert Haas (#180)
#182Robert Haas
robertmhaas@gmail.com
In reply to: Amul Sul (#181)
#183Nathan Bossart
nathandbossart@gmail.com
In reply to: Robert Haas (#182)
#184Robert Haas
robertmhaas@gmail.com
In reply to: Nathan Bossart (#183)
#185Nathan Bossart
nathandbossart@gmail.com
In reply to: Robert Haas (#184)
#186Amul Sul
sulamul@gmail.com
In reply to: Robert Haas (#182)
#187Amul Sul
sulamul@gmail.com
In reply to: Amul Sul (#186)
#188Robert Haas
robertmhaas@gmail.com
In reply to: Amul Sul (#187)
#189Amul Sul
sulamul@gmail.com
In reply to: Robert Haas (#188)
#190Amul Sul
sulamul@gmail.com
In reply to: Amul Sul (#189)
#191Amul Sul
sulamul@gmail.com
In reply to: Amul Sul (#190)
#192Amul Sul
sulamul@gmail.com
In reply to: Amul Sul (#191)
#193Amul Sul
sulamul@gmail.com
In reply to: Amul Sul (#192)
#194Amul Sul
sulamul@gmail.com
In reply to: Amul Sul (#193)
#195Amul Sul
sulamul@gmail.com
In reply to: Amul Sul (#194)
#196Bharath Rupireddy
bharath.rupireddyforpostgres@gmail.com
In reply to: Amul Sul (#100)
#197Amul Sul
sulamul@gmail.com
In reply to: Bharath Rupireddy (#196)
#198Jacob Champion
jacob.champion@enterprisedb.com
In reply to: Amul Sul (#195)
#199Amul Sul
sulamul@gmail.com
In reply to: Jacob Champion (#198)