Issue in GIN fast-insert: XLogBeginInsert + Read/LockBuffer ordering
Hi,
In Neon, we've had to modify the GIN fast insertion path as attached,
due to an unexpected XLOG insertion and buffer locking ordering issue.
The xlog readme [0]access/transam/README, section "Write-Ahead Log Coding", line 436-470 mentions that the common order of operations is 1)
pin and lock any buffers you need for the log record, then 2) start a
critical section, then 3) call XLogBeginInsert.
In Neon, we rely on this documented order of operations to expect to
be able to WAL-log hint pages (freespace map, visibility map) when
they are written to disk (e.g. cache eviction, checkpointer). In
general, this works fine, except that in ginHeapTupleFastInsert we
call XLogBeginInsert() before the last of the buffers for the eventual
record was read, thus creating a path where eviction is possible in a
`begininsert_called = true` context. That breaks our current code by
being unable to evict (WAL-log) the dirtied hint pages.
PFA a patch that rectifies this issue, by moving the XLogBeginInsert()
down to where 1.) we have all relevant buffers pinned and locked, and
2.) we're in a critical section, making that part of the code
consistent with the general scheme for XLog insertion.
Kind regards,
Matthias van de Meent
[0]: access/transam/README, section "Write-Ahead Log Coding", line 436-470
Attachments:
v1-0001-Fix-GIN-fast-path-XLogBeginInsert-and-Read-LockBu.patchapplication/octet-stream; name=v1-0001-Fix-GIN-fast-path-XLogBeginInsert-and-Read-LockBu.patchDownload
From 05d0b9209ec5636e8b017b2d1050767517a96cd7 Mon Sep 17 00:00:00 2001
From: Matthias van de Meent <boekewurm+postgres@gmail.com>
Date: Thu, 8 Sep 2022 12:38:15 +0200
Subject: [PATCH v1] Fix GIN fast-path XLogBeginInsert() and Read/LockBuffer
ordering issue
XLogBeginInsert() should be called only after the caller has locked all
relevant buffers, and starting a critical section (where and when
applicable), according to the XLog Manual in access/transam/README,
section "Write-Ahead Log Coding".
---
src/backend/access/gin/ginfast.c | 9 ++++++---
1 file changed, 6 insertions(+), 3 deletions(-)
diff --git a/src/backend/access/gin/ginfast.c b/src/backend/access/gin/ginfast.c
index 6c677447a9..5784c365bf 100644
--- a/src/backend/access/gin/ginfast.c
+++ b/src/backend/access/gin/ginfast.c
@@ -285,9 +285,6 @@ ginHeapTupleFastInsert(GinState *ginstate, GinTupleCollector *collector)
memset(&sublist, 0, sizeof(GinMetaPageData));
makeSublist(index, collector->tuples, collector->ntuples, &sublist);
- if (needWal)
- XLogBeginInsert();
-
/*
* metapage was unlocked, see above
*/
@@ -307,6 +304,9 @@ ginHeapTupleFastInsert(GinState *ginstate, GinTupleCollector *collector)
metadata->nPendingPages = sublist.nPendingPages;
metadata->nPendingHeapTuples = sublist.nPendingHeapTuples;
+
+ if (needWal)
+ XLogBeginInsert();
}
else
{
@@ -335,7 +335,10 @@ ginHeapTupleFastInsert(GinState *ginstate, GinTupleCollector *collector)
metadata->nPendingHeapTuples += sublist.nPendingHeapTuples;
if (needWal)
+ {
+ XLogBeginInsert();
XLogRegisterBuffer(1, buffer, REGBUF_STANDARD);
+ }
}
}
else
--
2.30.2
HI,
On Sep 8, 2022, 19:08 +0800, Matthias van de Meent <boekewurm+postgres@gmail.com>, wrote:
In general, this works fine, except that in ginHeapTupleFastInsert we
call XLogBeginInsert() before the last of the buffers for the eventual
record was read, thus creating a path where eviction is possible in a
`begininsert_called = true` context. That breaks our current code by
being unable to evict (WAL-log) the dirtied hint pages.
Does it break Postgres or Neon?
I look around the codes and as far as I can see, dirty pages could be flushed whether begininsert_called is true or false.
PFA a patch that rectifies this issue, by moving the XLogBeginInsert()
down to where 1.) we have all relevant buffers pinned and locked, and
2.) we're in a critical section, making that part of the code
consistent with the general scheme for XLog insertion.
+1, Make sense.
Regards,
Zhang Mingli
Hi,
On Sep 8, 2022, 19:08 +0800, Matthias van de Meent <boekewurm+postgres@gmail.com>, wrote:
PFA a patch that rectifies this issue, by moving the XLogBeginInsert()
down to where 1.) we have all relevant buffers pinned and locked, and
2.) we're in a critical section, making that part of the code
consistent with the general scheme for XLog insertion.
In the same function, there is disorder of XLogBeginInsert and START_CRIT_SECTION.
```
collectordata = ptr = (char *) palloc(collector->sumsize);
data.ntuples = collector->ntuples;
if (needWal)
XLogBeginInsert();
START_CRIT_SECTION();
```
Shall we adjust that too?
Regards,
Zhang Mingli
On Fri, Sep 30, 2022 at 06:22:02PM +0800, Zhang Mingli wrote:
In the same function, there is disorder of XLogBeginInsert and START_CRIT_SECTION.
```
collectordata = ptr = (char *) palloc(collector->sumsize);data.ntuples = collector->ntuples;
if (needWal)
XLogBeginInsert();START_CRIT_SECTION();
```Shall we adjust that too?
Nice catches, both of you. Let's adjust everything spotted in one
shot. Matthias' patch makes the ordering right, but the
initialization path a bit harder to follow when using a separate list.
The code is short so it does not strike me as a big deal, and it comes
from the fact that we need to lock an existing buffer when merging two
lists. For the branch where we insert into a tail page, the pages are
already locked but it looks to be enough to move XLogBeginInsert()
before the first XLogRegisterBuffer() call.
Would any of you like to write a patch?
--
Michael
HI,
On Oct 12, 2022, 10:09 +0800, Michael Paquier <michael@paquier.xyz>, wrote:
Nice catches, both of you. Let's adjust everything spotted in one
shot. Matthias' patch makes the ordering right, but the
initialization path a bit harder to follow when using a separate list.
The code is short so it does not strike me as a big deal, and it comes
from the fact that we need to lock an existing buffer when merging two
lists. For the branch where we insert into a tail page, the pages are
already locked but it looks to be enough to move XLogBeginInsert()
before the first XLogRegisterBuffer() call.Would any of you like to write a patch?
--
Michael
Patch added.
Regards,
Zhang Mingli
Attachments:
v2-0001-Fix-GIN-fast-path-XLogBeginInsert-and-Read-LockBu.patchapplication/octet-streamDownload
From 7506944d6712093fad871a4b63bbbb07ee84e492 Mon Sep 17 00:00:00 2001
From: Matthias van de Meent <boekewurm+postgres@gmail.com>
Date: Thu, 8 Sep 2022 12:38:15 +0200
Subject: [PATCH v2] Fix GIN fast-path XLogBeginInsert() and Read/LockBuffer
ordering issue
XLogBeginInsert() should be called only after the caller has locked all
relevant buffers, and starting a critical section (where and when
applicable), according to the XLog Manual in access/transam/README,
section "Write-Ahead Log Coding".
---
src/backend/access/gin/ginfast.c | 13 ++++++++-----
1 file changed, 8 insertions(+), 5 deletions(-)
diff --git a/src/backend/access/gin/ginfast.c b/src/backend/access/gin/ginfast.c
index ab4bb67d4b..f750b5ed9e 100644
--- a/src/backend/access/gin/ginfast.c
+++ b/src/backend/access/gin/ginfast.c
@@ -285,9 +285,6 @@ ginHeapTupleFastInsert(GinState *ginstate, GinTupleCollector *collector)
memset(&sublist, 0, sizeof(GinMetaPageData));
makeSublist(index, collector->tuples, collector->ntuples, &sublist);
- if (needWal)
- XLogBeginInsert();
-
/*
* metapage was unlocked, see above
*/
@@ -307,6 +304,9 @@ ginHeapTupleFastInsert(GinState *ginstate, GinTupleCollector *collector)
metadata->nPendingPages = sublist.nPendingPages;
metadata->nPendingHeapTuples = sublist.nPendingHeapTuples;
+
+ if (needWal)
+ XLogBeginInsert();
}
else
{
@@ -335,7 +335,10 @@ ginHeapTupleFastInsert(GinState *ginstate, GinTupleCollector *collector)
metadata->nPendingHeapTuples += sublist.nPendingHeapTuples;
if (needWal)
+ {
+ XLogBeginInsert();
XLogRegisterBuffer(1, buffer, REGBUF_STANDARD);
+ }
}
}
else
@@ -361,11 +364,11 @@ ginHeapTupleFastInsert(GinState *ginstate, GinTupleCollector *collector)
data.ntuples = collector->ntuples;
+ START_CRIT_SECTION();
+
if (needWal)
XLogBeginInsert();
- START_CRIT_SECTION();
-
/*
* Increase counter of heap tuples
*/
--
2.34.1
On Wed, Oct 12, 2022 at 09:39:11PM +0800, Zhang Mingli wrote:
Patch added.
Thanks. This looks fine on a second look, so applied.
--
Michael
Michael Paquier <michael@paquier.xyz> writes:
Thanks. This looks fine on a second look, so applied.
Don't we need to back-patch these fixes?
regards, tom lane
On Wed, Oct 12, 2022 at 08:54:34PM -0400, Tom Lane wrote:
Don't we need to back-patch these fixes?
I guess I should, though I have finished by not doing due to the
unlikeliness of the problem, where we would need the combination of a
page eviction with an error in the critical section to force a PANIC,
or a crash before the WAL gets inserted. Other opinions?
--
Michael
On 2022-Oct-13, Michael Paquier wrote:
On Wed, Oct 12, 2022 at 08:54:34PM -0400, Tom Lane wrote:
Don't we need to back-patch these fixes?
I guess I should, though I have finished by not doing due to the
unlikeliness of the problem, where we would need the combination of a
page eviction with an error in the critical section to force a PANIC,
or a crash before the WAL gets inserted. Other opinions?
I suppose it's a matter of whether any bugs are possible outside of
Neon. If yes, then definitely this should be backpatched. Offhand, I
don't see any. On the other hand, even if no bugs are known, then it's
still valuable to have all code paths do WAL insertion in the same way,
rather than having a single place that is alone in doing it in a
different way. But if we don't know of any bugs, then backpatching
might be more risk than not doing so.
I confess I don't understand why is it important that XLogBeginInsert is
called inside the critical section. It seems to me that that part is
only a side-effect of having to acquire the buffer locks in the critical
section. Right?
I noticed that line 427 logs the GIN metapage with flag REGBUF_STANDARD;
is the GIN metapage really honoring pd_upper? I see only pg_lower being
set.
--
Álvaro Herrera 48°01'N 7°57'E — https://www.EnterpriseDB.com/
"La grandeza es una experiencia transitoria. Nunca es consistente.
Depende en gran parte de la imaginación humana creadora de mitos"
(Irulan)
On Mon, Oct 24, 2022 at 02:22:16PM +0200, Alvaro Herrera wrote:
I suppose it's a matter of whether any bugs are possible outside of
Neon. If yes, then definitely this should be backpatched. Offhand, I
don't see any. On the other hand, even if no bugs are known, then it's
still valuable to have all code paths do WAL insertion in the same way,
rather than having a single place that is alone in doing it in a
different way. But if we don't know of any bugs, then backpatching
might be more risk than not doing so.
I have been putting my mind on that once again, and I don't see how
this would cause an issue in vanilla PG code. XLogBeginInsert() does
its checks, meaning that we'd get a PANIC instead of an ERROR now that
these calls are within a critical section but that should not matter
because we know that recovery has ended or we would not be able to do
GIN insertions like that. Then, it only switches to
begininsert_called to true, that we use for sanity checks in the
various WAL insert paths. As Matthias has outlined, Neon relies on
begininsert_called more than we do currently. FWIW, I think that
we're still fine not backpatching that, even considering the
consistency of the code with stable branches. Now, if there is a
strong trend in favor of a backpatch, I'd be fine with this conclusion
as well.
I confess I don't understand why is it important that XLogBeginInsert is
called inside the critical section. It seems to me that that part is
only a side-effect of having to acquire the buffer locks in the critical
section. Right?
Yeah, you are right that it would not matter for XLogBeginInsert(),
though I'd like to think that this is a good practice on consistency
grounds with anywhere else, and we respect what's documented in the
README.
I noticed that line 427 logs the GIN metapage with flag REGBUF_STANDARD;
is the GIN metapage really honoring pd_upper? I see only pg_lower being
set.
Hmm. Not sure.
--
Michael
Michael Paquier <michael@paquier.xyz> writes:
On Mon, Oct 24, 2022 at 02:22:16PM +0200, Alvaro Herrera wrote:
I confess I don't understand why is it important that XLogBeginInsert is
called inside the critical section. It seems to me that that part is
only a side-effect of having to acquire the buffer locks in the critical
section. Right?
Yeah, you are right that it would not matter for XLogBeginInsert(),
though I'd like to think that this is a good practice on consistency
grounds with anywhere else, and we respect what's documented in the
README.
Yeah --- it's documented that way, and there doesn't seem to be
a good reason not to honor that here.
regards, tom lane
On 2022-Oct-25, Tom Lane wrote:
Michael Paquier <michael@paquier.xyz> writes:
On Mon, Oct 24, 2022 at 02:22:16PM +0200, Alvaro Herrera wrote:
I confess I don't understand why is it important that XLogBeginInsert is
called inside the critical section. It seems to me that that part is
only a side-effect of having to acquire the buffer locks in the critical
section. Right?Yeah, you are right that it would not matter for XLogBeginInsert(),
though I'd like to think that this is a good practice on consistency
grounds with anywhere else, and we respect what's documented in the
README.Yeah --- it's documented that way, and there doesn't seem to be
a good reason not to honor that here.
Okay, so if we follow this argument, then the logical conclusion is that
this *should* be backpatched, after all.
--
Álvaro Herrera Breisgau, Deutschland — https://www.EnterpriseDB.com/
Maybe there's lots of data loss but the records of data loss are also lost.
(Lincoln Yeoh)
On Tue, Oct 25, 2022 at 09:37:08AM +0200, Alvaro Herrera wrote:
Okay, so if we follow this argument, then the logical conclusion is that
this *should* be backpatched, after all.
After sleeping on it and looking at all the stable branches involved,
backpatched down to v10.
--
Michael