Hot standby and b-tree killed items
Whenever a B-tree index scan fetches a heap tuple that turns out to be
dead, the B-tree item is marked as killed by calling _bt_killitems. When
the page gets full, all the killed items are removed by calling
_bt_vacuum_one_page.
That's a problem for hot standby. If any of the killed b-tree items
point to a tuple that is still visible to a running read-only
transaction, we have the same situation as with vacuum, and have to
either wait for the read-only transaction to finish before applying the
WAL record or kill the transaction.
It looks like there's some cosmetic changes related to that in the
patch, the signature of _bt_delitems is modified, but there's no actual
changes that would handle that situation. I didn't see it on the TODO on
the hot standby wiki either. Am I missing something, or the patch?
--
Heikki Linnakangas
EnterpriseDB http://www.enterprisedb.com
On Fri, 2008-12-19 at 10:49 +0200, Heikki Linnakangas wrote:
Whenever a B-tree index scan fetches a heap tuple that turns out to be
dead, the B-tree item is marked as killed by calling _bt_killitems. When
the page gets full, all the killed items are removed by calling
_bt_vacuum_one_page.That's a problem for hot standby. If any of the killed b-tree items
point to a tuple that is still visible to a running read-only
transaction, we have the same situation as with vacuum, and have to
either wait for the read-only transaction to finish before applying the
WAL record or kill the transaction.It looks like there's some cosmetic changes related to that in the
patch, the signature of _bt_delitems is modified, but there's no actual
changes that would handle that situation. I didn't see it on the TODO on
the hot standby wiki either. Am I missing something, or the patch?
ResolveRedoVisibilityConflicts() describes the current patch's position
on this point, which on review is wrong, I agree.
It looks like I assumed that _bt_delitems is only called during VACUUM,
which I knew it wasn't. I know I was going to split XLOG_BTREE_VACUUM
into two record types at one point, one for delete, one for vacuum. In
the end I didn't. Anyhow, its wrong.
We have infrastructure in place to make this work correctly, just need
to add latestRemovedXid field to xl_btree_vacuum. So that part is easily
solved.
Thanks for spotting it. More like that please!
--
Simon Riggs www.2ndQuadrant.com
PostgreSQL Training, Services and Support
Simon Riggs wrote:
We have infrastructure in place to make this work correctly, just need
to add latestRemovedXid field to xl_btree_vacuum. So that part is easily
solved.
That's tricky because there's no xmin/xmax on index tuples. You could
conservatively use OldestXmin as latestRemovedXid, but that could stall
the WAL redo a lot more than necessary. Or you could store
latestRemovedXid in the page header, but that would need to be
WAL-logged to ensure that it's valid after crash. Or you could look at
the heap to fetch the xmin/xmax, but that would be expensive.
--
Heikki Linnakangas
EnterpriseDB http://www.enterprisedb.com
On Fri, 2008-12-19 at 12:24 +0200, Heikki Linnakangas wrote:
Simon Riggs wrote:
We have infrastructure in place to make this work correctly, just need
to add latestRemovedXid field to xl_btree_vacuum. So that part is easily
solved.That's tricky because there's no xmin/xmax on index tuples.
Doh.
You could
conservatively use OldestXmin as latestRemovedXid, but that could stall
the WAL redo a lot more than necessary. Or you could store
latestRemovedXid in the page header, but that would need to be
WAL-logged to ensure that it's valid after crash. Or you could look at
the heap to fetch the xmin/xmax, but that would be expensive.
Agreed. Probably need to use OldestXmin then.
If I was going to add anything to the btree page header, it would be
latestRemovedLSN, only set during recovery. That way we don't have to
explicitly kill queries, we can do the a wait on OldestXmin then let
them ERROR out when they find a page that has been modified.
I have a suspicion that we may need some modification of that solution
for all data blocks, so we don't kill too many queries.
Hmmm. I wonder if we can track latestRemovedLSN for all of
shared_buffers. That was initially rejected, but if we set the
latestRemovedLSN to be the block's LSN when we read it in, that would be
fairly useful. Either way we use 8 bytes RAM per buffer.
BTW, I noticed the other day that Oracle 11g only allows you to have a
read only slave *or* allows you to continue replaying. You need to
manually switch back and forth between those modes. They can't do
*both*, as Postgres will be able to do. That's because their undo
information is stored off-block in the Undo Tablespace, so is not
available for standby queries. Nice one, Postgres.
--
Simon Riggs www.2ndQuadrant.com
PostgreSQL Training, Services and Support
I'm confused shouldn't read-only transactions on the slave just be
hacked to not set any hint bits including lp_delete?
--
Greg
On 19 Dec 2008, at 03:49, Heikki Linnakangas <heikki.linnakangas@enterprisedb.com
Show quoted text
wrote:
Whenever a B-tree index scan fetches a heap tuple that turns out to
be dead, the B-tree item is marked as killed by calling
_bt_killitems. When the page gets full, all the killed items are
removed by calling _bt_vacuum_one_page.That's a problem for hot standby. If any of the killed b-tree items
point to a tuple that is still visible to a running read-only
transaction, we have the same situation as with vacuum, and have to
either wait for the read-only transaction to finish before applying
the WAL record or kill the transaction.It looks like there's some cosmetic changes related to that in the
patch, the signature of _bt_delitems is modified, but there's no
actual changes that would handle that situation. I didn't see it on
the TODO on the hot standby wiki either. Am I missing something, or
the patch?--
Heikki Linnakangas
EnterpriseDB http://www.enterprisedb.com--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
On Fri, 2008-12-19 at 09:22 -0500, Greg Stark wrote:
I'm confused shouldn't read-only transactions on the slave just be
hacked to not set any hint bits including lp_delete?
They could be, though I see no value in doing so.
But that is not Heikki's point. He is discussing what happens on the
primary and the effects that must then occur on the standby. He has
rightly pointed out a (pluggable) hole in my logic.
--
Simon Riggs www.2ndQuadrant.com
PostgreSQL Training, Services and Support
Simon Riggs <simon@2ndQuadrant.com> wrote:
If I was going to add anything to the btree page header, it would be
latestRemovedLSN, only set during recovery. That way we don't have
to
explicitly kill queries, we can do the a wait on OldestXmin then let
them ERROR out when they find a page that has been modified.I have a suspicion that we may need some modification of that
solution
for all data blocks, so we don't kill too many queries.
If the failure is caused by the timing of various database
transactions, and the query is likely to run successfully after a
delay and a retry, please use SQLSTATE of '40001'. Some software
(ours, for one) will recognize this and retry the query automatically,
so that the user impact is essentially the same as blocking.
-Kevin
On Fri, 2008-12-19 at 11:54 -0600, Kevin Grittner wrote:
Simon Riggs <simon@2ndQuadrant.com> wrote:
If I was going to add anything to the btree page header, it would be
latestRemovedLSN, only set during recovery. That way we don't haveto
explicitly kill queries, we can do the a wait on OldestXmin then let
them ERROR out when they find a page that has been modified.I have a suspicion that we may need some modification of that
solution
for all data blocks, so we don't kill too many queries.
If the failure is caused by the timing of various database
transactions, and the query is likely to run successfully after a
delay and a retry, please use SQLSTATE of '40001'. Some software
(ours, for one) will recognize this and retry the query automatically,
so that the user impact is essentially the same as blocking.
I understand the need, but we won't be using SQLSTATE = 40001.
That corresponds to ERRCODE_T_R_SERIALIZATION_FAILURE, which that error
would not be.
The error message ought to be "snapshot too old", which could raise a
chuckle, so I called it something else.
The point you raise is a good one and I think we should publish a list
of retryable error messages. I contemplated once proposing a special log
level for a retryable error, but not quite a good idea.
--
Simon Riggs www.2ndQuadrant.com
PostgreSQL Training, Services and Support
On Fri, 2008-12-19 at 10:52 +0000, Simon Riggs wrote:
You could
conservatively use OldestXmin as latestRemovedXid, but that could stall
the WAL redo a lot more than necessary. Or you could store
latestRemovedXid in the page header, but that would need to be
WAL-logged to ensure that it's valid after crash. Or you could look at
the heap to fetch the xmin/xmax, but that would be expensive.Agreed. Probably need to use OldestXmin then.
Just finished coding this up, plus TODO item to pin every index page.
Will post after some further testing.
Used RecentOldestXmin.
--
Simon Riggs www.2ndQuadrant.com
PostgreSQL Training, Services and Support
Simon Riggs <simon@2ndQuadrant.com> wrote:
I understand the need, but we won't be using SQLSTATE = 40001.
That corresponds to ERRCODE_T_R_SERIALIZATION_FAILURE, which that
error
would not be.
Isn't it a problem with serialization of database transactions? You
hit it in a different way, but if it is a temporary failure due to the
timing of the transactions, I strongly feel that that is the correct
SQLSTATE to use. Perhaps more information to provide any useful
context could be in the info or hint areas?
-Kevin
Simon Riggs <simon@2ndQuadrant.com> writes:
The error message ought to be "snapshot too old", which could raise a
chuckle, so I called it something else.The point you raise is a good one and I think we should publish a list
of retryable error messages. I contemplated once proposing a special log
level for a retryable error, but not quite a good idea.
I'm a bit concerned about the idea of killing off queries to allow WAL to
proceed. While I have nothing against that being an option I think we should
be aiming to make it not necessary for correctness and not the default. By
default I think WAL replay should stick to stalling WAL replay and only resort
to killing queries if the user specifically requests it.
--
Gregory Stark
EnterpriseDB http://www.enterprisedb.com
Ask me about EnterpriseDB's On-Demand Production Tuning
On Fri, 2008-12-19 at 18:59 +0000, Gregory Stark wrote:
Simon Riggs <simon@2ndQuadrant.com> writes:
The error message ought to be "snapshot too old", which could raise a
chuckle, so I called it something else.The point you raise is a good one and I think we should publish a list
of retryable error messages. I contemplated once proposing a special log
level for a retryable error, but not quite a good idea.I'm a bit concerned about the idea of killing off queries to allow WAL to
proceed. While I have nothing against that being an option I think we should
be aiming to make it not necessary for correctness and not the default. By
default I think WAL replay should stick to stalling WAL replay and only resort
to killing queries if the user specifically requests it.
Increasing the waiting time increases the failover time and thus
decreases the value of the standby as an HA system. Others value high
availability higher than you and so we had agreed to provide an option
to allow the max waiting time to be set.
max_standby_delay is set in recovery.conf, value 0 (forever) - 2,000,000
secs, settable in milliseconds. So think of it like a deadlock detector
for recovery apply.
Also, there is a set of functions to control the way recovery proceeds,
much as you might control an MP3 player (start, stop, pause). There ares
also functions to pause at specific xids, pause at specific time, pause
at the next cleanup record. That allows you to set the max_standby_delay
lower and then freeze the server for longer to run a long query if
required. It also allows you to do PITR by trial and error rather than
one shot specify-in-advance settings. There is a function to manually
end recovery at a useful place if desired.
I hope your needs and wishes are catered for by that?
(I have a Plan B in case we need it during wider user testing, as
explained up thread.)
--
Simon Riggs www.2ndQuadrant.com
PostgreSQL Training, Services and Support
Simon Riggs <simon@2ndQuadrant.com> wrote:
max_standby_delay is set in recovery.conf, value 0 (forever) -
2,000,000
secs, settable in milliseconds. So think of it like a deadlock
detector
for recovery apply.
Aha! A deadlock is a type of serialization failure. (In fact, on
databases with lock-based concurrency control rather than MVCC, it can
be the ONLY type of serialization failure.)
-Kevin
On Fri, 2008-12-19 at 13:47 -0600, Kevin Grittner wrote:
Simon Riggs <simon@2ndQuadrant.com> wrote:
max_standby_delay is set in recovery.conf, value 0 (forever) -
2,000,000
secs, settable in milliseconds. So think of it like a deadlock
detector
for recovery apply.
Aha! A deadlock is a type of serialization failure. (In fact, on
databases with lock-based concurrency control rather than MVCC, it can
be the ONLY type of serialization failure.)
The SQL Standard specifically names this error as thrown when "it
detects the inability to guarantee the serializability of two or more
concurrent SQL-transactions". Now that really should only apply when
running with SERIALIZABLE transactions, but I grant you the standard
doesn't explicitly say that.
You give me the strange sense that you want this because of some quirk
in your software, rather than an overwhelming desire to see these two
situations described the same.
I guess making it that SQLSTATE would make it simpler to understand why
the error occurs and also how to handle it (i.e. resubmit). So there
probably is a wide argument for making developers jobs a little easier
by doing it. i.e. usability will be improved if we do that.
--
Simon Riggs www.2ndQuadrant.com
PostgreSQL Training, Services and Support
Simon Riggs <simon@2ndQuadrant.com> writes:
Increasing the waiting time increases the failover time and thus
decreases the value of the standby as an HA system. Others value high
availability higher than you and so we had agreed to provide an option
to allow the max waiting time to be set.
Sure, it's a nice option to have. But I think the default should be to pause
WAL replay.
The question I had was whether your solution for btree pointers marked dead
and later dropped from the index works when the user hasn't configured a
timeout and doesn't want standby queries killed.
--
Gregory Stark
EnterpriseDB http://www.enterprisedb.com
Ask me about EnterpriseDB's PostGIS support!
"Kevin Grittner" <Kevin.Grittner@wicourts.gov> writes:
Simon Riggs <simon@2ndQuadrant.com> wrote:
max_standby_delay is set in recovery.conf, value 0 (forever) -
2,000,000
secs, settable in milliseconds. So think of it like a deadlock
detector
for recovery apply.
Aha! A deadlock is a type of serialization failure. (In fact, on
databases with lock-based concurrency control rather than MVCC, it can
be the ONLY type of serialization failure.)
I think the fundamental difference is that a deadlock or serialization failure
can be predicted as a potential problem when writing the code. This is
something that can happen for any query any time, even plain old read-only
select queries.
--
Gregory Stark
EnterpriseDB http://www.enterprisedb.com
Ask me about EnterpriseDB's 24x7 Postgres support!
Simon Riggs <simon@2ndQuadrant.com> wrote:
The SQL Standard specifically names this error as thrown when "it
detects the inability to guarantee the serializability of two or
more
concurrent SQL-transactions". Now that really should only apply when
running with SERIALIZABLE transactions,
I disagree. Database integrity could not be guaranteed without
detection of conflicting modification in READ COMMITTED on up, and
this is the normal means of indicating these problems.
but I grant you the standard doesn't explicitly say that.
I think that's intentional.
You give me the strange sense that you want this because of some
quirk
in your software, rather than an overwhelming desire to see these
two
situations described the same.
Well, we are very unlikely to ever use this feature, so it's not
really something I care about for us; it just struck me that there may
be others that care about categorizing errors accurately according the
the SQL standard, and that what you were describing sounded like a new
type of serialization failure in the PostgreSQL environment, and
should be classified that way.
The primary quirkiness of our software is that it needs to be able to
run with a number of different database products, and we do want to
take advantage of whatever information is available in a portable
format. This is not the only standard SQLSTATE we look for and handle
appropriately for the documented meaning, but it is an important one,
as it has simplified application programming and reduced the confusing
error messages which reach our end users.
I guess making it that SQLSTATE would make it simpler to understand
why
the error occurs and also how to handle it (i.e. resubmit).
Precisely.
-Kevin
Gregory Stark <stark@enterprisedb.com> wrote:
I think the fundamental difference is that a deadlock or
serialization
failure
can be predicted as a potential problem when writing the code. This
is
something that can happen for any query any time, even plain old
read-only
select queries.
I've heard that on Oracle it is (or at least was) possible to get a
serialization failure on a single SELECT statement which was the only
user-requested activity on the system, because it could conflict with
automatic maintenance operations.
In Sybase and Microsoft databases it is definitely possible for a
plain old read-only SELECT statement to be a deadlock victim (reported
as a serialization failure) if some of the data it is referencing is
being updated concurrently. In these (and many other) products, a
lock must be acquired before a row can be read. Imagine, the SELECT
locks a row against updates, another transaction locks some other row
against any access, then the UPDATE tries to change the row locked by
the SELECT while the SELECT tries to read the row locked by the
UPDATE.
PostgreSQL is much less prone to serialization failures, but it is
certainly understandable if hot standby replication introduces new
cases of it.
-Kevin
"Kevin Grittner" <Kevin.Grittner@wicourts.gov> writes:
PostgreSQL is much less prone to serialization failures, but it is
certainly understandable if hot standby replication introduces new
cases of it.
In this case it will be possible to get this error even if you're just running
a single SELECT query -- and that's the *only* query in the database at all.
A vacuum being replayed -- even in a different database -- could trigger the
error. Or with the btree split issue, a data load -- again even in a different
database -- would be quite likely cause your SELECT to be killed.
--
Gregory Stark
EnterpriseDB http://www.enterprisedb.com
Get trained by Bruce Momjian - ask me about EnterpriseDB's PostgreSQL training!
Gregory Stark <stark@enterprisedb.com> wrote:
"Kevin Grittner" <Kevin.Grittner@wicourts.gov> writes:
PostgreSQL is much less prone to serialization failures, but it is
certainly understandable if hot standby replication introduces new
cases of it.In this case it will be possible to get this error even if you're
just
running
a single SELECT query -- and that's the *only* query in the database
at all.
A vacuum being replayed -- even in a different database -- could
trigger the
error. Or with the btree split issue, a data load -- again even in a
different
database -- would be quite likely cause your SELECT to be killed.
OK. Does that make serialization failure a bad description of the
problem?
If these steps are serialized (run one after the other), is there a
problem? It just seems that the hot standby near-synchronous
replication creates a situation where tasks on multiple, linked
databases might need to be serialized.
It does seem like it will be important to provide as much information
to the user about what's causing the problem, and hints about what to
do. PostgreSQL has nice features for that, though.
Since I have no vested interest here, I'm not inclined to belabor the
point. I was really just trying to make sure the feature was as
useful as possible to others, some of whom might be looking for
standard SQLSTATE values to help the software take the right course.
If others feel the 40001 code would confuse more than enlighten, I'll
respect that.
-Kevin