BUG #17257: (auto)vacuum hangs within lazy_scan_prune()

pg@bowt.ie

over 4 years ago

In reply to: Alexander Lakhin (#2)

Re: BUG #17257: (auto)vacuum hangs within lazy_scan_prune()

On Fri, Oct 29, 2021 at 6:30 AM Alexander Lakhin <exclusion@gmail.com> wrote:

I can propose the debugging patch to reproduce the issue that replaces
the hang with the assert and modifies a pair of crash-causing test
scripts to simplify the reproducing. (Sorry, I have no time now to prune
down the scripts further as I have to leave for a week.)

This bug is similar to the one fixed in commit d9d8aa9b. And so I
wonder if code like GlobalVisTestFor() is missing something that it
needs for partitioned tables.

--
Peter Geoghegan

[0]: /messages/by-id/20210609184506.rqm5rikoikm47csf@alap3.anarazel.de

boekewurm+postgres@gmail.com

over 4 years ago

In reply to: Peter Geoghegan (#3)

Re: BUG #17257: (auto)vacuum hangs within lazy_scan_prune()

On Fri, 29 Oct 2021 at 20:17, Peter Geoghegan <pg@bowt.ie> wrote:

On Fri, Oct 29, 2021 at 6:30 AM Alexander Lakhin <exclusion@gmail.com> wrote:

I can propose the debugging patch to reproduce the issue that replaces
the hang with the assert and modifies a pair of crash-causing test
scripts to simplify the reproducing. (Sorry, I have no time now to prune
down the scripts further as I have to leave for a week.)

This bug is similar to the one fixed in commit d9d8aa9b. And so I
wonder if code like GlobalVisTestFor() is missing something that it
needs for partitioned tables.

Without `autovacuum = off; fsync = off` I could not replicate the
issue in the configured 10m time window; with those options I did get
the reported trace in minutes.

I think that I also have found the culprit, which is something we
talked about in [0]/messages/by-id/20210609184506.rqm5rikoikm47csf@alap3.anarazel.de: GlobalVisState->maybe_needed was not guaranteed
to never move backwards when recalculated, and because vacuum can
update its snapshot bounds (heap_prune_satisfies_vacuum ->
GlobalVisTestIsRemovableFullXid -> GlobalVisUpdate) this maybe_needed
could move backwards, resulting in the observed behaviour.

It was my understanding based on the mail conversation that Andres
would fix this observed issue too while fixing [0]/messages/by-id/20210609184506.rqm5rikoikm47csf@alap3.anarazel.de (whose fix was
included with beta 2), but apparently I was wrong; I can't find the
code for 'maybe_needed'-won't-move-backwards-in-a-backend.

I (again) propose the attached patch, which ensures that this
maybe_needed field will not move backwards for a backend. It is
based on 14, but should be applied on head as well, because it's
lacking there as well.

Another alternative would be to replace the use of vacrel->OldestXmin
with `vacrel->vistest->maybe_needed` in lazy_scan_prune, but I believe
that is not legal in how vacuum works (we cannot unilaterally decide
that we want to retain tuples < OldestXmin).

Note: After fixing the issue with retreating maybe_needed I also hit
your segfault, and I'm still trying to find out what the source of
that issue might be. I do think it is an issue seperate from stuck
vacuum, though.

Kind regards,

Matthias van de Meent

[0]: /messages/by-id/202110191807.5svc3kmm32tl@alvherre.pgsql

boekewurm+postgres@gmail.com

over 4 years ago

In reply to: Matthias van de Meent (#4)

Re: BUG #17257: (auto)vacuum hangs within lazy_scan_prune()

On Mon, 1 Nov 2021 at 16:15, Matthias van de Meent
<boekewurm+postgres@gmail.com> wrote:

On Fri, 29 Oct 2021 at 20:17, Peter Geoghegan <pg@bowt.ie> wrote:

On Fri, Oct 29, 2021 at 6:30 AM Alexander Lakhin <exclusion@gmail.com> wrote:

I can propose the debugging patch to reproduce the issue that replaces
the hang with the assert and modifies a pair of crash-causing test
scripts to simplify the reproducing. (Sorry, I have no time now to prune
down the scripts further as I have to leave for a week.)

This bug is similar to the one fixed in commit d9d8aa9b. And so I
wonder if code like GlobalVisTestFor() is missing something that it
needs for partitioned tables.

Without `autovacuum = off; fsync = off` I could not replicate the
issue in the configured 10m time window; with those options I did get
the reported trace in minutes.

I think that I also have found the culprit, which is something we
talked about in [0]: GlobalVisState->maybe_needed was not guaranteed
to never move backwards when recalculated, and because vacuum can
update its snapshot bounds (heap_prune_satisfies_vacuum ->
GlobalVisTestIsRemovableFullXid -> GlobalVisUpdate) this maybe_needed
could move backwards, resulting in the observed behaviour.

It was my understanding based on the mail conversation that Andres
would fix this observed issue too while fixing [0] (whose fix was
included with beta 2), but apparently I was wrong; I can't find the
code for 'maybe_needed'-won't-move-backwards-in-a-backend.

I (again) propose the attached patch, which ensures that this
maybe_needed field will not move backwards for a backend. It is
based on 14, but should be applied on head as well, because it's
lacking there as well.

Another alternative would be to replace the use of vacrel->OldestXmin
with `vacrel->vistest->maybe_needed` in lazy_scan_prune, but I believe
that is not legal in how vacuum works (we cannot unilaterally decide
that we want to retain tuples < OldestXmin).

Note: After fixing the issue with retreating maybe_needed I also hit
your segfault, and I'm still trying to find out what the source of
that issue might be. I do think it is an issue seperate from stuck
vacuum, though.

After further debugging, I think these both might be caused by the
same issue, due to xmin horizon confusion as a result from restored
snapshots:

I seem to repeatedly get backends of which the xmin is set from
InvalidTransactionId to some value < min(ProcGlobal->xids), which then
result in shared_oldest_nonremovable (and others) being less than the
value of their previous iteration. This leads to the infinite loop in
lazy_scan_prune (it stores and uses one value of
*_oldest_nonremovable, whereas heap_page_prune uses a more up-to-date
variant). Ergo, this issue is not really solved by my previous patch,
because apparently at this point we have snapshots wih an xmin that is
only registered in the backend's procarray entry when the xmin is
already out of scope, which makes it generally impossible to determine
what tuples may or may not yet be vacuumed.

I noticed that when this happens, generally a parallel vacuum worker
is involved. I also think that this is intimately related to [0]/messages/by-id/202110191807.5svc3kmm32tl@alvherre.pgsql, and
how snapshots are restored in parallel workers: A vacuum worker is
generally ignored, but if its snapshot has the oldest xmin available,
then a parallel worker launched from that vacuum worker will move the
visible xmin backwards. Same for concurrent index creation jobs.

Kind regards,

Matthias van de Meent

pg@bowt.ie

over 4 years ago

In reply to: Matthias van de Meent (#5)

Re: BUG #17257: (auto)vacuum hangs within lazy_scan_prune()

On Wed, Nov 3, 2021 at 8:46 AM Matthias van de Meent
<boekewurm+postgres@gmail.com> wrote:

I seem to repeatedly get backends of which the xmin is set from
InvalidTransactionId to some value < min(ProcGlobal->xids), which then
result in shared_oldest_nonremovable (and others) being less than the
value of their previous iteration. This leads to the infinite loop in
lazy_scan_prune (it stores and uses one value of
*_oldest_nonremovable, whereas heap_page_prune uses a more up-to-date
variant).

I noticed that when this happens, generally a parallel vacuum worker
is involved.

Hmm. That is plausible. The way that VACUUM (and concurrent index
builds) avoid being seen via the PROC_IN_VACUUM thing is pretty
delicate. Wouldn't surprise me if the parallel VACUUM issue subtly
broke lazy_scan_prune in the way that we see here.

What about testing? Can we find a simple way of reducing this
complicated repro to a less complicated repro with a failing
assertion? Maybe an assertion that we get to keep after the bug is
fixed?

--
Peter Geoghegan

[0]: /messages/by-id/CAD21AoDkERUJkGEuQRiyGKmVRt2duU378UgnwBpqXQjA+EY3Lg@mail.gmail.com

boekewurm+postgres@gmail.com

over 4 years ago

In reply to: Peter Geoghegan (#6)

Re: BUG #17257: (auto)vacuum hangs within lazy_scan_prune()

On Wed, 3 Nov 2021 at 17:21, Peter Geoghegan <pg@bowt.ie> wrote:

On Wed, Nov 3, 2021 at 8:46 AM Matthias van de Meent
<boekewurm+postgres@gmail.com> wrote:

I seem to repeatedly get backends of which the xmin is set from
InvalidTransactionId to some value < min(ProcGlobal->xids), which then
result in shared_oldest_nonremovable (and others) being less than the
value of their previous iteration. This leads to the infinite loop in
lazy_scan_prune (it stores and uses one value of
*_oldest_nonremovable, whereas heap_page_prune uses a more up-to-date
variant).

I noticed that when this happens, generally a parallel vacuum worker
is involved.

Hmm. That is plausible. The way that VACUUM (and concurrent index
builds) avoid being seen via the PROC_IN_VACUUM thing is pretty
delicate. Wouldn't surprise me if the parallel VACUUM issue subtly
broke lazy_scan_prune in the way that we see here.

What about testing? Can we find a simple way of reducing this
complicated repro to a less complicated repro with a failing
assertion? Maybe an assertion that we get to keep after the bug is
fixed?

I added the attached instrumentation for checking xmin validity, which
asserts what I believe are correct claims about the proc
infrastructure:

- It is always safe to set ->xmin to InvalidTransactionId: This
removes any claim that we have a snapshot anyone should worry about.
- If we have a valid ->xmin set, it is always safe to increase its value.
- Otherwise, the xmin must not lower the overall xmin of the database
it is connected to, plus some potential conditions for status flags.
It also may not be set without first taking the ProcArrayLock:
without synchronised access to the proc array, you cannot guarantee
you can set your xmin to a globally correct value.

It worked well with the bgworker flags patch [0]/messages/by-id/CAD21AoDkERUJkGEuQRiyGKmVRt2duU378UgnwBpqXQjA+EY3Lg@mail.gmail.com, until I added this
instrumentation to SnapshotResetXmin and ran the regression tests: I
stumbled upon the following issue with aborting transactions, and I
don't know what the correct solution is to solve it:

AbortTransaction (see xact.c) calls ProcArrayEndTransaction, which can
reset MyProc->xmin to InvalidTransactionId (both directly and through
ProcArrayEndTransactionInternal). So far, this is safe.

However, later in AbortTransaction we call ResourceOwnerRelease(...,
RESOURCE_RELEASE_AFTER_LOCKS...), which will clean up the snapshots
stored in its owner->snapshotarr array using UnregisterSnapshot.
Then, if UnregisterSnapshot determines that a snapshot is now not
referenced anymore, and that snapshot has no active count, then it
will call SnapshotResetXmin().
Finally, when SnapshotResetXmin() is called, the oldest still
registered snapshot in RegisteredSnapshots will be pulled and
MyProc->xmin will be set to that snapshot's xmin.

Similarly, in AbortTransaction we call AtEOXact_Inval, which calls
ProcessInvalidationMessages -> LocalExecuteInvalidationMessage ->
InvalidateCatalogSnapshot -> SnapshotResetXmin, also setting
MyProc->xmin back to a non-InvalidXid value.

Note that from a third-party observer's standpoint we've just moved
our horizons backwards, and the regression tests (correctly) fail when
assertions are enabled.

I don't know what the expected behaviour is, but I do know that this
is a violation of the expected invariant of xmin never goes backwards
(for any of the cluster, database or data level).

Kind regards,

Matthias van de Meent

pg@bowt.ie

over 4 years ago

In reply to: Matthias van de Meent (#7)

Re: BUG #17257: (auto)vacuum hangs within lazy_scan_prune()

On Fri, Nov 5, 2021 at 4:43 AM Matthias van de Meent
<boekewurm+postgres@gmail.com> wrote:

I added the attached instrumentation for checking xmin validity, which
asserts what I believe are correct claims about the proc
infrastructure:

This test case involves partitioning, but also pruning, which is very
particular about heap tuple headers being a certain way following
updates. I wonder if we're missing a
HeapTupleHeaderIndicatesMovedPartitions() test somewhere. Could be in
heapam/VACUUM/pruning code, or could be somewhere else.

Take a look at commit f16241bef7 to get some idea of what I mean.

--
Peter Geoghegan

[0]: /messages/by-id/d5d5af5d-ba46-aff3-9f91-776c70246cc3@gmail.com

boekewurm+postgres@gmail.com

over 4 years ago

In reply to: Peter Geoghegan (#8)

Re: BUG #17257: (auto)vacuum hangs within lazy_scan_prune()

On Fri, 5 Nov 2021 at 22:25, Peter Geoghegan <pg@bowt.ie> wrote:

On Fri, Nov 5, 2021 at 4:43 AM Matthias van de Meent
<boekewurm+postgres@gmail.com> wrote:

I added the attached instrumentation for checking xmin validity, which
asserts what I believe are correct claims about the proc
infrastructure:

This test case involves partitioning, but also pruning, which is very
particular about heap tuple headers being a certain way following
updates. I wonder if we're missing a
HeapTupleHeaderIndicatesMovedPartitions() test somewhere. Could be in
heapam/VACUUM/pruning code, or could be somewhere else.

If you watch closely, the second backtrace in [0]/messages/by-id/d5d5af5d-ba46-aff3-9f91-776c70246cc3@gmail.com (the segfault)
originates from the code that builds the partition bounds based on
relcaches / catalog tables, which are never partitioned. Although it
is indeed in the partition infrastructure, if we'd have a tuple with
HeapTupleHeaderIndicatesMovedPartitions() at that point, then that'd
be a bug (we do not partition catalogs).

But I hit this same segfault earlier while testing, and I deduced that
problem that I hit at that point was that there was that an index
entry could not resolve to a heap tuple (or the scan at partdesc.c:227
otherwise returned NULL where one result was expected); so that tuple
is NULL at partdesc.c:230, and heap_getattr subsequently segfaults
when it dereferences the null tuple pointer to access it's fields.

Due to the blatant visibility horizon confusion, the failing scan
being on the pg_class table, and the test case including aggressive
manual vacuuming of the pg_class table, I assume that the error was
caused by vacuum having removed tuples from pg_class, while other
backends still required / expected access to these tuples.

Kind regards,

Matthias

#10

andres@anarazel.de

over 4 years ago

In reply to: Matthias van de Meent (#7)

Re: BUG #17257: (auto)vacuum hangs within lazy_scan_prune()

On 2021-11-05 12:43:00 +0100, Matthias van de Meent wrote:

I added the attached instrumentation for checking xmin validity, which
asserts what I believe are correct claims about the proc
infrastructure:

- It is always safe to set ->xmin to InvalidTransactionId: This
removes any claim that we have a snapshot anyone should worry about.
- If we have a valid ->xmin set, it is always safe to increase its value.

I think I know what you mean, but of course you cannot just increase xmin if
there are existing snapshots requiring that xmin.

- Otherwise, the xmin must not lower the overall xmin of the database
it is connected to, plus some potential conditions for status flags.

walsenders can end up doing this IIRC.

It also may not be set without first taking the ProcArrayLock:
without synchronised access to the proc array, you cannot guarantee
you can set your xmin to a globally correct value.

There's possibly one exception around this, which is snapshot import. But
that's rare enough that an unnecessary acquisition is fine.

It worked well with the bgworker flags patch [0], until I added this
instrumentation to SnapshotResetXmin and ran the regression tests: I
stumbled upon the following issue with aborting transactions, and I
don't know what the correct solution is to solve it:

AbortTransaction (see xact.c) calls ProcArrayEndTransaction, which can
reset MyProc->xmin to InvalidTransactionId (both directly and through
ProcArrayEndTransactionInternal). So far, this is safe.

However, later in AbortTransaction we call ResourceOwnerRelease(...,
RESOURCE_RELEASE_AFTER_LOCKS...), which will clean up the snapshots
stored in its owner->snapshotarr array using UnregisterSnapshot.
Then, if UnregisterSnapshot determines that a snapshot is now not
referenced anymore, and that snapshot has no active count, then it
will call SnapshotResetXmin().
Finally, when SnapshotResetXmin() is called, the oldest still
registered snapshot in RegisteredSnapshots will be pulled and
MyProc->xmin will be set to that snapshot's xmin.

Yea, that's not great. This is a pretty old behaviour, IIRC?

We have an unwritten rule that a backend's xmin may not go back, but
this is not enforced.

I don't think we can make any of this hard assertions. There's e.g. the
following comment:

* Note: despite the above, it's possible for the calculated values to move
* backwards on repeated calls. The calculated values are conservative, so
* that anything older is definitely not considered as running by anyone
* anymore, but the exact values calculated depend on a number of things. For
* example, if there are no transactions running in the current database, the
* horizon for normal tables will be latestCompletedXid. If a transaction
* begins after that, its xmin will include in-progress transactions in other
* databases that started earlier, so another call will return a lower value.
* Nonetheless it is safe to vacuum a table in the current database with the
* first result. There are also replication-related effects: a walsender
* process can set its xmin based on transactions that are no longer running
* on the primary but are still being replayed on the standby, thus possibly
* making the values go backwards. In this case there is a possibility that
* we lose data that the standby would like to have, but unless the standby
* uses a replication slot to make its xmin persistent there is little we can
* do about that --- data is only protected if the walsender runs continuously
* while queries are executed on the standby. (The Hot Standby code deals
* with such cases by failing standby queries that needed to access
* already-removed data, so there's no integrity bug.) The computed values
* are also adjusted with vacuum_defer_cleanup_age, so increasing that setting
* on the fly is another easy way to make horizons move backwards, with no
* consequences for data integrity.

Greetings,

Andres Freund

#11

pg@bowt.ie

over 4 years ago

In reply to: Alexander Lakhin (#2)

Re: BUG #17257: (auto)vacuum hangs within lazy_scan_prune()

On Fri, Oct 29, 2021 at 6:30 AM Alexander Lakhin <exclusion@gmail.com> wrote:

I can propose the debugging patch to reproduce the issue that replaces
the hang with the assert and modifies a pair of crash-causing test
scripts to simplify the reproducing. (Sorry, I have no time now to prune
down the scripts further as I have to leave for a week.)

The reproducing script is:

I cannot reproduce this bug by following your steps, even when the
assertion is made to fail after only 5 retries (5 is still ludicrously
excessive, 100 might be overkill). And even when I don't use a debug
build (and make the assertion into an equivalent PANIC). I wonder why
that is. I didn't have much trouble following your similar repro for
bug #17255.

My immediate goal in trying to follow your reproducer was to determine
what effect (if any) the pending bugfix for #17255 [1]/messages/by-id/CAH2-WzkpG9KLQF5sYHaOO_dSVdOjM+dv=nTEn85oNfMUTk836Q@mail.gmail.com -- Peter Geoghegan has on this
bug. It seems more than possible that this bug is in fact a different
manifestation of the same underlying problem we see in #17255. And so
that should be the next thing we check here.

[1]: /messages/by-id/CAH2-WzkpG9KLQF5sYHaOO_dSVdOjM+dv=nTEn85oNfMUTk836Q@mail.gmail.com -- Peter Geoghegan
--
Peter Geoghegan

#12

noah@leadboat.com

over 2 years ago

In reply to: Peter Geoghegan (#11)

Re: BUG #17257: (auto)vacuum hangs within lazy_scan_prune()

On Mon, Nov 01, 2021 at 04:15:27PM +0100, Matthias van de Meent wrote:

Another alternative would be to replace the use of vacrel->OldestXmin
with `vacrel->vistest->maybe_needed` in lazy_scan_prune, but I believe

v17 commit 1ccc1e05ae essentially did that.

that is not legal in how vacuum works (we cannot unilaterally decide
that we want to retain tuples < OldestXmin).

Do you think commit 1ccc1e05ae creates problems in that respect? It does have
the effect of retaining tuples for which GlobalVisState rules "retain" but
HeapTupleSatisfiesVacuum() would have ruled "delete". If that doesn't create
problems, then back-patching commit 1ccc1e05ae could be a fix for remaining
infinite-retries scenarios, if any.

On Wed, Nov 10, 2021 at 12:28:43PM -0800, Peter Geoghegan wrote:

On Fri, Oct 29, 2021 at 6:30 AM Alexander Lakhin <exclusion@gmail.com> wrote:

I can propose the debugging patch to reproduce the issue that replaces
the hang with the assert and modifies a pair of crash-causing test
scripts to simplify the reproducing. (Sorry, I have no time now to prune
down the scripts further as I have to leave for a week.)

The reproducing script is:

I cannot reproduce this bug by following your steps, even when the
assertion is made to fail after only 5 retries (5 is still ludicrously
excessive, 100 might be overkill). And even when I don't use a debug
build (and make the assertion into an equivalent PANIC). I wonder why
that is. I didn't have much trouble following your similar repro for
bug #17255.

For what it's worth, I needed "-X" on the script's psql command lines to keep
my ~/.psqlrc from harming things. I also wondered if the regression database
needed to be populated with a "make installcheck" run. The script had a
"createdb regression" without a "make installcheck", so I assumed an empty
regression database was intended.

My immediate goal in trying to follow your reproducer was to determine
what effect (if any) the pending bugfix for #17255 [1] has on this
bug. It seems more than possible that this bug is in fact a different
manifestation of the same underlying problem we see in #17255. And so
that should be the next thing we check here.

[1] /messages/by-id/CAH2-WzkpG9KLQF5sYHaOO_dSVdOjM+dv=nTEn85oNfMUTk836Q@mail.gmail.com

Using the /messages/by-id/d5d5af5d-ba46-aff3-9f91-776c70246cc3@gmail.com
procedure, I see these results:

- A commit from the day of that email, 2021-10-29, (5ccceb2946) reproduced the
"numretries" assertion failure in each of five 10m runs.

- At the 2022-01-13 commit (18b87b201f^) just before the fix for #17255, the
script instead gets: FailedAssertion("HeapTupleHeaderIsHeapOnly(htup)",
File: "pruneheap.c", Line: 964. That happened once in two 10m runs, so it
was harder to reach than the numretries failure.

- At 18b87b201f, a 1440m script run got no failures.

I've seen symptoms that suggest the infinite-numretries bug remains reachable,
but I don't know how to reproduce them. (Given the upthread notes about xmin
going backward during end-of-xact, I'd first try some pauses there.) If it
does remain reachable, likely some other code change between 2021-10 and
2022-01 made this particular test script no longer reach it.

Thanks,
nm

#13

pg@bowt.ie

over 2 years ago

In reply to: Noah Misch (#12)

Re: BUG #17257: (auto)vacuum hangs within lazy_scan_prune()

On Sun, Dec 24, 2023 at 6:44 PM Noah Misch <noah@leadboat.com> wrote:

On Mon, Nov 01, 2021 at 04:15:27PM +0100, Matthias van de Meent wrote:

Another alternative would be to replace the use of vacrel->OldestXmin
with `vacrel->vistest->maybe_needed` in lazy_scan_prune, but I believe

v17 commit 1ccc1e05ae essentially did that.

Obviously, 1ccc1e05ae would fix the immediate problem of infinite
retries, since it just rips out the loop.

that is not legal in how vacuum works (we cannot unilaterally decide
that we want to retain tuples < OldestXmin).

Do you think commit 1ccc1e05ae creates problems in that respect? It does have
the effect of retaining tuples for which GlobalVisState rules "retain" but
HeapTupleSatisfiesVacuum() would have ruled "delete". If that doesn't create
problems, then back-patching commit 1ccc1e05ae could be a fix for remaining
infinite-retries scenarios, if any.

My guess is that there is a decent chance that backpatching 1ccc1e05ae
would be okay, but that isn't much use. I really don't know either way
right now. And I wouldn't like to speculate too much further before
gaining a proper understanding of what's going on here. Seems to be
specific to partitioning with cross-partition updates.

Using the /messages/by-id/d5d5af5d-ba46-aff3-9f91-776c70246cc3@gmail.com
procedure, I see these results:

- A commit from the day of that email, 2021-10-29, (5ccceb2946) reproduced the
"numretries" assertion failure in each of five 10m runs.

- At the 2022-01-13 commit (18b87b201f^) just before the fix for #17255, the
script instead gets: FailedAssertion("HeapTupleHeaderIsHeapOnly(htup)",
File: "pruneheap.c", Line: 964. That happened once in two 10m runs, so it
was harder to reach than the numretries failure.

- At 18b87b201f, a 1440m script run got no failures.

I've seen symptoms that suggest the infinite-numretries bug remains reachable,
but I don't know how to reproduce them. (Given the upthread notes about xmin
going backward during end-of-xact, I'd first try some pauses there.) If it
does remain reachable, likely some other code change between 2021-10 and
2022-01 made this particular test script no longer reach it.

I am aware of a production database that appears to run into the same
problem. Inserts and concurrent cross-partition updates are used
heavily on this instance (the table in question uses partitioning).
Perhaps you happened upon a similar problematic production database,
and found this thread when researching the issue? Maybe we've both
seen the same problem in the wild?

I have seen VACUUM get stuck like this on multiple versions, all
associated with the same application code/partitioning
scheme/workload. This includes a 15.4 instance, and various 14.* point
release instances. It seems likely that there is a bug here, and that
it affects Postgres 14, 15, and 16.

--
Peter Geoghegan

#14

noah@leadboat.com

over 2 years ago

In reply to: Peter Geoghegan (#13)

Re: BUG #17257: (auto)vacuum hangs within lazy_scan_prune()

On Sun, Dec 31, 2023 at 03:53:34PM -0800, Peter Geoghegan wrote:

On Sun, Dec 24, 2023 at 6:44 PM Noah Misch <noah@leadboat.com> wrote:

On Mon, Nov 01, 2021 at 04:15:27PM +0100, Matthias van de Meent wrote:

Another alternative would be to replace the use of vacrel->OldestXmin
with `vacrel->vistest->maybe_needed` in lazy_scan_prune, but I believe

v17 commit 1ccc1e05ae essentially did that.

Obviously, 1ccc1e05ae would fix the immediate problem of infinite
retries, since it just rips out the loop.

Yep.

that is not legal in how vacuum works (we cannot unilaterally decide
that we want to retain tuples < OldestXmin).

Do you think commit 1ccc1e05ae creates problems in that respect? It does have
the effect of retaining tuples for which GlobalVisState rules "retain" but
HeapTupleSatisfiesVacuum() would have ruled "delete". If that doesn't create
problems, then back-patching commit 1ccc1e05ae could be a fix for remaining
infinite-retries scenarios, if any.

My guess is that there is a decent chance that backpatching 1ccc1e05ae
would be okay, but that isn't much use. I really don't know either way
right now. And I wouldn't like to speculate too much further before
gaining a proper understanding of what's going on here.

Fair enough. While I agree there's a decent chance back-patching would be
okay, I think there's also a decent chance that 1ccc1e05ae creates the problem
Matthias theorized. Something like: we update relfrozenxid based on
OldestXmin, even though GlobalVisState caused us to retain a tuple older than
OldestXmin. Then relfrozenxid disagrees with table contents.

Seems to be
specific to partitioning with cross-partition updates.

Using the /messages/by-id/d5d5af5d-ba46-aff3-9f91-776c70246cc3@gmail.com
procedure, I see these results:

- A commit from the day of that email, 2021-10-29, (5ccceb2946) reproduced the
"numretries" assertion failure in each of five 10m runs.

- At the 2022-01-13 commit (18b87b201f^) just before the fix for #17255, the
script instead gets: FailedAssertion("HeapTupleHeaderIsHeapOnly(htup)",
File: "pruneheap.c", Line: 964. That happened once in two 10m runs, so it
was harder to reach than the numretries failure.

- At 18b87b201f, a 1440m script run got no failures.

I've seen symptoms that suggest the infinite-numretries bug remains reachable,
but I don't know how to reproduce them. (Given the upthread notes about xmin
going backward during end-of-xact, I'd first try some pauses there.) If it
does remain reachable, likely some other code change between 2021-10 and
2022-01 made this particular test script no longer reach it.

I am aware of a production database that appears to run into the same
problem. Inserts and concurrent cross-partition updates are used
heavily on this instance (the table in question uses partitioning).
Perhaps you happened upon a similar problematic production database,
and found this thread when researching the issue? Maybe we've both
seen the same problem in the wild?

I did find this thread while researching the symptoms I was seeing. No
partitioning where I saw them.

I have seen VACUUM get stuck like this on multiple versions, all
associated with the same application code/partitioning
scheme/workload. This includes a 15.4 instance, and various 14.* point
release instances. It seems likely that there is a bug here, and that
it affects Postgres 14, 15, and 16.

Agreed.

#15

pg@bowt.ie

over 2 years ago

In reply to: Noah Misch (#14)

Re: BUG #17257: (auto)vacuum hangs within lazy_scan_prune()

On Sat, Jan 6, 2024 at 12:24 PM Noah Misch <noah@leadboat.com> wrote:

On Sun, Dec 31, 2023 at 03:53:34PM -0800, Peter Geoghegan wrote:

My guess is that there is a decent chance that backpatching 1ccc1e05ae
would be okay, but that isn't much use. I really don't know either way
right now. And I wouldn't like to speculate too much further before
gaining a proper understanding of what's going on here.

Fair enough. While I agree there's a decent chance back-patching would be
okay, I think there's also a decent chance that 1ccc1e05ae creates the problem
Matthias theorized. Something like: we update relfrozenxid based on
OldestXmin, even though GlobalVisState caused us to retain a tuple older than
OldestXmin. Then relfrozenxid disagrees with table contents.

Either every relevant code path has the same OldestXmin to work off
of, or the whole NewRelfrozenXid/relfrozenxid-tracking thing can't be
expected to work as designed. I find it a bit odd that
pruneheap.c/GlobalVisState has no direct understanding of this
dependency (none that I can discern, at least). Wouldn't it at least
be more natural if pruneheap.c could access OldestXmin when run inside
VACUUM? (Could just be used by defensive hardening code.)

We're also relying on vacuumlazy.c's call to vacuum_get_cutoffs()
(which itself calls GetOldestNonRemovableTransactionId) taking place
before vacuumlazy.c goes on to call GlobalVisTestFor() a few lines
further down (I think). It seems like even the code in procarray.c
might have something to say about the vacuumlazy.c dependency, too.
But offhand it doesn't look like it does, either. Why shouldn't we
expect random implementation details in code like ComputeXidHorizons()
to break the assumption/dependency within vacuumlazy.c?

I also worry about the possibility that GlobalVisTestShouldUpdate()
masks problems in this area (as opposed to causing the problems). It
seems very hard to test.

I did find this thread while researching the symptoms I was seeing. No
partitioning where I saw them.

If this was an isolated incident then it could perhaps have been a
symptom of corruption. Though corruption seems highly unlikely to be
involved with the cases that I've seen, since they appear
intermittently, across a variety of different contexts/versions.

--
Peter Geoghegan

#16

pg@bowt.ie

over 2 years ago

In reply to: Peter Geoghegan (#15)

Re: BUG #17257: (auto)vacuum hangs within lazy_scan_prune()

On Sat, Jan 6, 2024 at 1:30 PM Peter Geoghegan <pg@bowt.ie> wrote:

On Sat, Jan 6, 2024 at 12:24 PM Noah Misch <noah@leadboat.com> wrote:

Fair enough. While I agree there's a decent chance back-patching would be
okay, I think there's also a decent chance that 1ccc1e05ae creates the problem
Matthias theorized. Something like: we update relfrozenxid based on
OldestXmin, even though GlobalVisState caused us to retain a tuple older than
OldestXmin. Then relfrozenxid disagrees with table contents.

Either every relevant code path has the same OldestXmin to work off
of, or the whole NewRelfrozenXid/relfrozenxid-tracking thing can't be
expected to work as designed. I find it a bit odd that
pruneheap.c/GlobalVisState has no direct understanding of this
dependency (none that I can discern, at least).

What do you think of the idea of adding a defensive "can't happen"
error to lazy_scan_prune() that will catch DEAD or RECENTLY_DEAD
tuples with storage that still contain an xmax < OldestXmin? This
probably won't catch every possible problem, but I suspect it'll work
well enough.

--
Peter Geoghegan

#17

noah@leadboat.com

over 2 years ago

In reply to: Peter Geoghegan (#16)

Re: BUG #17257: (auto)vacuum hangs within lazy_scan_prune()

On Sat, Jan 06, 2024 at 01:30:40PM -0800, Peter Geoghegan wrote:

On Sat, Jan 6, 2024 at 12:24 PM Noah Misch <noah@leadboat.com> wrote:

On Sun, Dec 31, 2023 at 03:53:34PM -0800, Peter Geoghegan wrote:

My guess is that there is a decent chance that backpatching 1ccc1e05ae
would be okay, but that isn't much use. I really don't know either way
right now. And I wouldn't like to speculate too much further before
gaining a proper understanding of what's going on here.

Fair enough. While I agree there's a decent chance back-patching would be
okay, I think there's also a decent chance that 1ccc1e05ae creates the problem
Matthias theorized. Something like: we update relfrozenxid based on
OldestXmin, even though GlobalVisState caused us to retain a tuple older than
OldestXmin. Then relfrozenxid disagrees with table contents.

Either every relevant code path has the same OldestXmin to work off
of, or the whole NewRelfrozenXid/relfrozenxid-tracking thing can't be
expected to work as designed. I find it a bit odd that
pruneheap.c/GlobalVisState has no direct understanding of this
dependency (none that I can discern, at least). Wouldn't it at least
be more natural if pruneheap.c could access OldestXmin when run inside
VACUUM? (Could just be used by defensive hardening code.)

Tied to that decision is the choice of semantics when the xmin horizon moves
backward during one VACUUM, e.g. when a new walsender xmin does so. Options:

1. Continue to remove tuples based on the OldestXmin from VACUUM's start. We
could have already removed some of those tuples, so the walsender xmin
won't achieve a guarantee anyway. (VACUUM would want ratchet-like behavior
in GlobalVisState, possibly by sharing OldestXmin with pruneheap like you
say.)

2. Move OldestXmin backward, to reflect the latest xmin horizon. (Perhaps
VACUUM would just pass GlobalVisState to a function that returns the
compatible OldestXmin.)

Which way is better?

We're also relying on vacuumlazy.c's call to vacuum_get_cutoffs()
(which itself calls GetOldestNonRemovableTransactionId) taking place
before vacuumlazy.c goes on to call GlobalVisTestFor() a few lines
further down (I think). It seems like even the code in procarray.c
might have something to say about the vacuumlazy.c dependency, too.
But offhand it doesn't look like it does, either. Why shouldn't we
expect random implementation details in code like ComputeXidHorizons()
to break the assumption/dependency within vacuumlazy.c?

Makes sense.

On Sat, Jan 06, 2024 at 01:41:23PM -0800, Peter Geoghegan wrote:

What do you think of the idea of adding a defensive "can't happen"
error to lazy_scan_prune() that will catch DEAD or RECENTLY_DEAD
tuples with storage that still contain an xmax < OldestXmin? This
probably won't catch every possible problem, but I suspect it'll work
well enough.

So before the "goto retry", ERROR if the tuple header suggests this mismatch
is happening? That, or limiting the retry count, could help.

#18

pg@bowt.ie

over 2 years ago

In reply to: Noah Misch (#17)

Re: BUG #17257: (auto)vacuum hangs within lazy_scan_prune()

On Sat, Jan 6, 2024 at 5:44 PM Noah Misch <noah@leadboat.com> wrote:

Tied to that decision is the choice of semantics when the xmin horizon moves
backward during one VACUUM, e.g. when a new walsender xmin does so. Options:

1. Continue to remove tuples based on the OldestXmin from VACUUM's start. We
could have already removed some of those tuples, so the walsender xmin
won't achieve a guarantee anyway. (VACUUM would want ratchet-like behavior
in GlobalVisState, possibly by sharing OldestXmin with pruneheap like you
say.)

2. Move OldestXmin backward, to reflect the latest xmin horizon. (Perhaps
VACUUM would just pass GlobalVisState to a function that returns the
compatible OldestXmin.)

Which way is better?

I suppose that a hybrid of these two approaches makes the most sense.
A design that's a lot closer to your #1 than to your #2.

Under this scheme, pruneheap.c would be explicitly aware of
OldestXmin, and would promise to respect the exact invariant that we
need to avoid getting stuck in lazy_scan_prune's loop (or avoid
confused NewRelfrozenXid tracking on HEAD, which no longer has this
loop). But it'd be limited to that exact invariant; we'd still avoid
unduly imposing any requirements on pruning-away deleted tuples whose
xmax was >= OldestXmin. lazy_scan_prune/vacuumlazy.c shouldn't care if
we prune away any "extra" heap tuples, just because we can (or just
because it's convenient to the implementation). Andres has in the past
placed a lot of emphasis on being able to update the
GlobalVisState-wise bounds on the fly. Not sure that it's really that
important that VACUUM does that, but there is no reason to not allow
it. So we can keep that property (as well as the aforementioned
high-level OldestXmin immutability property).

More importantly (at least to me), this scheme allows vacuumlazy.c to
continue to treat OldestXmin as an immutable cutoff for both pruning
and freezing -- the high level design doesn't need any revisions. We
already "freeze away" multixact member XIDs >= OldestXmin in certain
rare cases (i.e. we remove lockers that are determined to no longer be
running in FreezeMultiXactId's "second pass" slow path), so there is
nothing fundamentally novel about the idea of removing some extra XIDs

= OldestXmin in passing, just because it happens to be convenient to

some low-level piece of code that's external to vacuumlazy.c.

What do you think of that general approach? I see no reason why it
matters if OldestXmin goes backwards across two VACUUM operations, so
I haven't tried to avoid that.

On Sat, Jan 06, 2024 at 01:41:23PM -0800, Peter Geoghegan wrote:

What do you think of the idea of adding a defensive "can't happen"
error to lazy_scan_prune() that will catch DEAD or RECENTLY_DEAD
tuples with storage that still contain an xmax < OldestXmin? This
probably won't catch every possible problem, but I suspect it'll work
well enough.

So before the "goto retry", ERROR if the tuple header suggests this mismatch
is happening? That, or limiting the retry count, could help.

When I wrote this code, my understanding was that the sole reason for
needing to loop back was a concurrently-aborted xact. In principle we
ought to be able to test the tuple to detect if it's that exact case
(the only truly valid case), and then throw an error if we somehow got
it wrong. That kind of hardening would at least be correct according
to my original understanding of things.

There is an obvious practical concern with adding such hardening now:
what if the current loop is accidentally protective, in whatever way?
That seems quite possible. I seem to recall that Andres supposed at
some point that the loop's purpose wasn't limited to the
concurrently-aborted-inserter case that I believed was the only
relevant case back when I worked on what became commit 8523492d4e
("Remove tupgone special case from vacuumlazy.c"). I don't have a
reference for that, but I'm pretty sure it was said at some point
around the release of 14.

--
Peter Geoghegan

#19

noah@leadboat.com

over 2 years ago

In reply to: Peter Geoghegan (#18)

Re: BUG #17257: (auto)vacuum hangs within lazy_scan_prune()

On Mon, Jan 08, 2024 at 12:02:01PM -0500, Peter Geoghegan wrote:

On Sat, Jan 6, 2024 at 5:44 PM Noah Misch <noah@leadboat.com> wrote:

Tied to that decision is the choice of semantics when the xmin horizon moves
backward during one VACUUM, e.g. when a new walsender xmin does so. Options:

1. Continue to remove tuples based on the OldestXmin from VACUUM's start. We
could have already removed some of those tuples, so the walsender xmin
won't achieve a guarantee anyway. (VACUUM would want ratchet-like behavior
in GlobalVisState, possibly by sharing OldestXmin with pruneheap like you
say.)

2. Move OldestXmin backward, to reflect the latest xmin horizon. (Perhaps
VACUUM would just pass GlobalVisState to a function that returns the
compatible OldestXmin.)

Which way is better?

I suppose that a hybrid of these two approaches makes the most sense.
A design that's a lot closer to your #1 than to your #2.

Under this scheme, pruneheap.c would be explicitly aware of
OldestXmin, and would promise to respect the exact invariant that we
need to avoid getting stuck in lazy_scan_prune's loop (or avoid
confused NewRelfrozenXid tracking on HEAD, which no longer has this
loop). But it'd be limited to that exact invariant; we'd still avoid
unduly imposing any requirements on pruning-away deleted tuples whose
xmax was >= OldestXmin. lazy_scan_prune/vacuumlazy.c shouldn't care if
we prune away any "extra" heap tuples, just because we can (or just
because it's convenient to the implementation). Andres has in the past
placed a lot of emphasis on being able to update the
GlobalVisState-wise bounds on the fly. Not sure that it's really that
important that VACUUM does that, but there is no reason to not allow
it. So we can keep that property (as well as the aforementioned
high-level OldestXmin immutability property).

More importantly (at least to me), this scheme allows vacuumlazy.c to
continue to treat OldestXmin as an immutable cutoff for both pruning
and freezing -- the high level design doesn't need any revisions. We
already "freeze away" multixact member XIDs >= OldestXmin in certain
rare cases (i.e. we remove lockers that are determined to no longer be
running in FreezeMultiXactId's "second pass" slow path), so there is
nothing fundamentally novel about the idea of removing some extra XIDs

= OldestXmin in passing, just because it happens to be convenient to

some low-level piece of code that's external to vacuumlazy.c.

What do you think of that general approach?

That all sounds good to me.

I see no reason why it
matters if OldestXmin goes backwards across two VACUUM operations, so
I haven't tried to avoid that.

That may be fully okay, or we may want to clamp OldestXmin to be no older than
relfrozenxid. I don't feel great about the system moving relfrozenxid
backward unless it observed an older XID, and observing an older XID would be
a corruption signal. I don't have a specific way non-monotonic relfrozenxid
breaks things, though.

#20

pg@bowt.ie

over 2 years ago

In reply to: Noah Misch (#19)

Re: BUG #17257: (auto)vacuum hangs within lazy_scan_prune()

On Mon, Jan 8, 2024 at 1:21 PM Noah Misch <noah@leadboat.com> wrote:

I see no reason why it
matters if OldestXmin goes backwards across two VACUUM operations, so
I haven't tried to avoid that.

That may be fully okay, or we may want to clamp OldestXmin to be no older than
relfrozenxid. I don't feel great about the system moving relfrozenxid
backward unless it observed an older XID, and observing an older XID would be
a corruption signal. I don't have a specific way non-monotonic relfrozenxid
breaks things, though.

We're already prepared for this -- relfrozenxid simply cannot go
backwards, regardless of what vacuumlazy.c thinks. That is,
vac_update_relstats() won't accept a new relfrozenxid that is < its
existing value (unless it's a value "from the future", which is a way
of recovering after historical pg_upgrade-related corruption bugs).

If memory serves it doesn't take much effort to exercise the relevant
code within vac_update_relstats(). I'm pretty sure that the regression
tests will fail if you run them after removing its defensive
no-older-relfrozenxid test (though I haven't checked recently).

--
Peter Geoghegan

#21

pg@bowt.ie

over 2 years ago

In reply to: Noah Misch (#14)

#22

noah@leadboat.com

over 2 years ago

In reply to: Peter Geoghegan (#21)

#23

pg@bowt.ie

over 2 years ago

In reply to: Noah Misch (#22)

#24

pg@bowt.ie

over 2 years ago

In reply to: Peter Geoghegan (#23)

#25

noah@leadboat.com

over 2 years ago

In reply to: Peter Geoghegan (#24)

#26

pg@bowt.ie

over 2 years ago

In reply to: Noah Misch (#25)

#27

noah@leadboat.com

over 2 years ago

In reply to: Peter Geoghegan (#26)

#28

noah@leadboat.com

about 2 years ago

In reply to: Noah Misch (#17)

#29

robertmhaas@gmail.com

about 2 years ago

In reply to: Noah Misch (#28)

#30

boekewurm+postgres@gmail.com

about 2 years ago

In reply to: Robert Haas (#29)

#31

robertmhaas@gmail.com

about 2 years ago

In reply to: Matthias van de Meent (#30)

#32

melanieplageman@gmail.com

about 2 years ago

In reply to: Robert Haas (#31)

#33

robertmhaas@gmail.com

about 2 years ago

In reply to: Melanie Plageman (#32)

#34

noah@leadboat.com

about 2 years ago

In reply to: Melanie Plageman (#32)

#35

robertmhaas@gmail.com

about 2 years ago

In reply to: Robert Haas (#33)

#36

pg@bowt.ie

about 2 years ago

In reply to: Robert Haas (#35)

#37

robertmhaas@gmail.com

about 2 years ago

In reply to: Peter Geoghegan (#36)

#38

melanieplageman@gmail.com

about 2 years ago

In reply to: Robert Haas (#35)

#39

robertmhaas@gmail.com

about 2 years ago

In reply to: Melanie Plageman (#38)

#40

robertmhaas@gmail.com

about 2 years ago

In reply to: Alexander Lakhin (#2)

#41

Alexander Lakhin

exclusion@gmail.com

about 2 years ago

In reply to: Robert Haas (#40)

#42

robertmhaas@gmail.com

about 2 years ago

In reply to: Alexander Lakhin (#41)

#43

Alexander Lakhin

exclusion@gmail.com

about 2 years ago

In reply to: Robert Haas (#42)

#44

robertmhaas@gmail.com

about 2 years ago

In reply to: Alexander Lakhin (#43)

#45

robertmhaas@gmail.com

about 2 years ago

In reply to: Robert Haas (#39)

#46

andres@anarazel.de

about 2 years ago

In reply to: Noah Misch (#14)

#47

andres@anarazel.de

almost 2 years ago

In reply to: Robert Haas (#45)

#48

andres@anarazel.de

almost 2 years ago

In reply to: Robert Haas (#35)

#49

noah@leadboat.com

almost 2 years ago

In reply to: Robert Haas (#45)

#50

andres@anarazel.de

almost 2 years ago

In reply to: Noah Misch (#49)

#51

Michael Paquier

michael@paquier.xyz

almost 2 years ago

In reply to: Robert Haas (#45)

#52

noah@leadboat.com

almost 2 years ago

In reply to: Andres Freund (#50)

#53

andres@anarazel.de

almost 2 years ago

In reply to: Noah Misch (#52)

#54

noah@leadboat.com

almost 2 years ago

In reply to: Andres Freund (#53)

#55

melanieplageman@gmail.com

almost 2 years ago

In reply to: Andres Freund (#48)

#56

melanieplageman@gmail.com

almost 2 years ago

In reply to: Andres Freund (#46)

#57

pg@bowt.ie

almost 2 years ago

In reply to: Melanie Plageman (#56)

#58

melanieplageman@gmail.com

almost 2 years ago

In reply to: Peter Geoghegan (#57)

#59

pg@bowt.ie

almost 2 years ago

In reply to: Melanie Plageman (#58)

#60

melanieplageman@gmail.com

almost 2 years ago

In reply to: Peter Geoghegan (#59)

#61

pg@bowt.ie

almost 2 years ago

In reply to: Melanie Plageman (#60)

#62

Alena Rybakina

lena.ribackina@yandex.ru

almost 2 years ago

In reply to: Peter Geoghegan (#61)

#63

Alena Rybakina

lena.ribackina@yandex.ru

almost 2 years ago

In reply to: Alena Rybakina (#62)

#64

boekewurm+postgres@gmail.com

almost 2 years ago

In reply to: Alena Rybakina (#63)

#65

Alena Rybakina

lena.ribackina@yandex.ru

almost 2 years ago

In reply to: Matthias van de Meent (#64)

#66

melanieplageman@gmail.com

almost 2 years ago

In reply to: Melanie Plageman (#55)

#67

zxwsbg12138@gmail.com

almost 2 years ago

In reply to: Melanie Plageman (#38)

#68

melanieplageman@gmail.com

almost 2 years ago

In reply to: Bowen Shi (#67)

#69

zxwsbg12138@gmail.com

almost 2 years ago

In reply to: Melanie Plageman (#68)

#70

melanieplageman@gmail.com

almost 2 years ago

In reply to: Bowen Shi (#69)

#71

zxwsbg12138@gmail.com

almost 2 years ago

In reply to: Melanie Plageman (#70)

#72

melanieplageman@gmail.com

almost 2 years ago

In reply to: Bowen Shi (#71)

#73

zxwsbg12138@gmail.com

almost 2 years ago

In reply to: Melanie Plageman (#72)

#74

andres@anarazel.de

almost 2 years ago

In reply to: Bowen Shi (#73)

#75

zxwsbg12138@gmail.com

almost 2 years ago

In reply to: Andres Freund (#74)

#76

andres@anarazel.de

almost 2 years ago

In reply to: Bowen Shi (#75)

#77

pg@bowt.ie

almost 2 years ago

In reply to: Andres Freund (#76)

#78

andres@anarazel.de

almost 2 years ago

In reply to: Peter Geoghegan (#77)

#79

andres@anarazel.de

almost 2 years ago

In reply to: Andres Freund (#78)

#80

pg@bowt.ie

almost 2 years ago

In reply to: Andres Freund (#78)

#81

andres@anarazel.de

almost 2 years ago

In reply to: Peter Geoghegan (#80)

#82

pg@bowt.ie

almost 2 years ago

In reply to: Andres Freund (#81)

#83

andres@anarazel.de

almost 2 years ago

In reply to: Andres Freund (#81)

#84

melanieplageman@gmail.com

almost 2 years ago

In reply to: Melanie Plageman (#66)

#85

zxwsbg12138@gmail.com

almost 2 years ago

In reply to: Andres Freund (#76)

#86

zxwsbg12138@gmail.com

almost 2 years ago

In reply to: Peter Geoghegan (#77)

#87

pg@bowt.ie

almost 2 years ago

In reply to: Bowen Shi (#86)

#88

noah@leadboat.com

almost 2 years ago

In reply to: Melanie Plageman (#84)

#89

melanieplageman@gmail.com

almost 2 years ago

In reply to: Noah Misch (#88)

#90

noah@leadboat.com

almost 2 years ago

In reply to: Melanie Plageman (#89)

#91

melanieplageman@gmail.com

almost 2 years ago

In reply to: Noah Misch (#90)

#92

noah@leadboat.com

almost 2 years ago

In reply to: Melanie Plageman (#91)

#93

zxwsbg12138@gmail.com

almost 2 years ago

In reply to: Peter Geoghegan (#87)

#94

pg@bowt.ie

almost 2 years ago

In reply to: Bowen Shi (#93)

#95

zxwsbg12138@gmail.com

almost 2 years ago

In reply to: Melanie Plageman (#91)

#96

zxwsbg12138@gmail.com

almost 2 years ago

In reply to: Melanie Plageman (#91)

#97

melanieplageman@gmail.com

almost 2 years ago

In reply to: Bowen Shi (#95)

#98

melanieplageman@gmail.com

almost 2 years ago

In reply to: Bowen Shi (#96)

#99

melanieplageman@gmail.com

almost 2 years ago

In reply to: Melanie Plageman (#91)

#100

melanieplageman@gmail.com

almost 2 years ago

In reply to: Melanie Plageman (#99)

#101

melanieplageman@gmail.com

almost 2 years ago

In reply to: Melanie Plageman (#100)

#102

noah@leadboat.com

almost 2 years ago

In reply to: Melanie Plageman (#100)

#103

melanieplageman@gmail.com

almost 2 years ago

In reply to: Noah Misch (#102)

#104