Allow ERROR from heap_prepare_freeze_tuple to be downgraded to WARNING

Started by Dilip Kumarover 5 years ago32 messages
#1Dilip Kumar
dilipbalaut@gmail.com
1 attachment(s)

The attached patch allows the vacuum to continue by emitting WARNING
for the corrupted tuple instead of immediately error out as discussed
at [1]/messages/by-id/CA+TgmoaZwZHtFFU6NUJgEAp6adDs-qWfNOXpZGQpZMmm0VTDfg@mail.gmail.com.

Basically, it provides a new GUC called vacuum_tolerate_damage, to
control whether to continue the vacuum or to stop on the occurrence of
a corrupted tuple. So if the vacuum_tolerate_damage is set then in
all the cases in heap_prepare_freeze_tuple where the corrupted xid is
detected, it will emit a warning and return that nothing is changed in
the tuple and the 'tuple_totally_frozen' will also be set to false.
Since we are returning false the caller will not try to freeze such
tuple and the tuple_totally_frozen is also set to false so that the
page will not be marked to all frozen even if all other tuples in the
page are frozen.

Alternatively, we can try to freeze other XIDs in the tuple which is
not corrupted but I don't think we will gain anything from this,
because if one of the xmin or xmax is wrong then next time also if we
run the vacuum then we are going to get the same WARNING or the ERROR.
Is there any other opinion on this?

[1]: /messages/by-id/CA+TgmoaZwZHtFFU6NUJgEAp6adDs-qWfNOXpZGQpZMmm0VTDfg@mail.gmail.com

--
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com

Attachments:

v1-0001-Provide-a-GUC-to-allow-vacuum-to-continue-on-corr.patchapplication/x-patch; name=v1-0001-Provide-a-GUC-to-allow-vacuum-to-continue-on-corr.patchDownload
From 5421c567a168705272bde425a02eb248c4468cb0 Mon Sep 17 00:00:00 2001
From: dilipkumar <dilipbalaut@gmail.com>
Date: Thu, 16 Jul 2020 11:41:28 +0530
Subject: [PATCH v1] Provide a GUC to allow vacuum to continue on corrupted
 tuple

A new GUC to control whether the vacuum will error out immediately
on detection of a corrupted tuple or it will just emit a WARNING for
each such instance and complete the rest of the vacuuming.
---
 doc/src/sgml/config.sgml                      | 21 ++++++++
 src/backend/access/heap/heapam.c              | 28 +++++++++--
 src/backend/commands/vacuum.c                 | 15 ++++++
 src/backend/utils/misc/guc.c                  |  9 ++++
 src/backend/utils/misc/postgresql.conf.sample |  1 +
 src/include/commands/vacuum.h                 |  2 +
 .../test_misc/t/002_vacuum_tolerate_damage.pl | 49 +++++++++++++++++++
 7 files changed, 120 insertions(+), 5 deletions(-)
 create mode 100644 src/test/modules/test_misc/t/002_vacuum_tolerate_damage.pl

diff --git a/doc/src/sgml/config.sgml b/doc/src/sgml/config.sgml
index b353c61683..3c1e647e27 100644
--- a/doc/src/sgml/config.sgml
+++ b/doc/src/sgml/config.sgml
@@ -8250,6 +8250,27 @@ COPY postgres_log FROM '/full/path/to/logfile.csv' WITH csv;
       </listitem>
      </varlistentry>
 
+      <varlistentry id="guc-vacuum-tolerate-damage" xreflabel="vacuum_tolerate_damage">
+      <term><varname>vacuum_tolerate_damage</varname> (<type>boolean</type>)
+      <indexterm>
+       <primary><varname>vacuum_tolerate_damage</varname> configuration parameter</primary>
+      </indexterm>
+      </term>
+      <listitem>
+       <para>
+        When set to off, which is the default, the <command>VACUUM</command> will raise
+        a ERROR-level error if detected a corrupted tuple.  This causes the operation to
+        stop immediately.
+       </para>
+
+       <para>
+        If set to on, the <command>VACUUM</command> will instead emit a WARNING for each
+        such tuple but the operation will continue.
+        vacuuming.
+       </para>
+      </listitem>
+     </varlistentry>
+
      <varlistentry id="guc-bytea-output" xreflabel="bytea_output">
       <term><varname>bytea_output</varname> (<type>enum</type>)
       <indexterm>
diff --git a/src/backend/access/heap/heapam.c b/src/backend/access/heap/heapam.c
index e09c8101e7..fc31799da5 100644
--- a/src/backend/access/heap/heapam.c
+++ b/src/backend/access/heap/heapam.c
@@ -51,6 +51,7 @@
 #include "access/xloginsert.h"
 #include "access/xlogutils.h"
 #include "catalog/catalog.h"
+#include "commands/vacuum.h"
 #include "miscadmin.h"
 #include "pgstat.h"
 #include "port/atomics.h"
@@ -6123,6 +6124,8 @@ heap_prepare_freeze_tuple(HeapTupleHeader tuple,
 	frz->t_infomask = tuple->t_infomask;
 	frz->xmax = HeapTupleHeaderGetRawXmax(tuple);
 
+	*totally_frozen_p = false;
+
 	/*
 	 * Process xmin.  xmin_frozen has two slightly different meanings: in the
 	 * !XidIsNormal case, it means "the xmin doesn't need any freezing" (it's
@@ -6137,19 +6140,25 @@ heap_prepare_freeze_tuple(HeapTupleHeader tuple,
 	else
 	{
 		if (TransactionIdPrecedes(xid, relfrozenxid))
-			ereport(ERROR,
+		{
+			ereport(vacuum_damage_elevel(),
 					(errcode(ERRCODE_DATA_CORRUPTED),
 					 errmsg_internal("found xmin %u from before relfrozenxid %u",
 									 xid, relfrozenxid)));
+			return false;
+		}
 
 		xmin_frozen = TransactionIdPrecedes(xid, cutoff_xid);
 		if (xmin_frozen)
 		{
 			if (!TransactionIdDidCommit(xid))
-				ereport(ERROR,
+			{
+				ereport(vacuum_damage_elevel(),
 						(errcode(ERRCODE_DATA_CORRUPTED),
 						 errmsg_internal("uncommitted xmin %u from before xid cutoff %u needs to be frozen",
 										 xid, cutoff_xid)));
+				return false;
+			}
 
 			frz->t_infomask |= HEAP_XMIN_FROZEN;
 			changed = true;
@@ -6218,10 +6227,13 @@ heap_prepare_freeze_tuple(HeapTupleHeader tuple,
 	else if (TransactionIdIsNormal(xid))
 	{
 		if (TransactionIdPrecedes(xid, relfrozenxid))
-			ereport(ERROR,
+		{
+			ereport(vacuum_damage_elevel(),
 					(errcode(ERRCODE_DATA_CORRUPTED),
 					 errmsg_internal("found xmax %u from before relfrozenxid %u",
 									 xid, relfrozenxid)));
+			return false;
+		}
 
 		if (TransactionIdPrecedes(xid, cutoff_xid))
 		{
@@ -6233,10 +6245,13 @@ heap_prepare_freeze_tuple(HeapTupleHeader tuple,
 			 */
 			if (!HEAP_XMAX_IS_LOCKED_ONLY(tuple->t_infomask) &&
 				TransactionIdDidCommit(xid))
-				ereport(ERROR,
+			{
+				ereport(vacuum_damage_elevel(),
 						(errcode(ERRCODE_DATA_CORRUPTED),
 						 errmsg_internal("cannot freeze committed xmax %u",
 										 xid)));
+				return false;
+			}
 			freeze_xmax = true;
 		}
 		else
@@ -6249,10 +6264,13 @@ heap_prepare_freeze_tuple(HeapTupleHeader tuple,
 		xmax_already_frozen = true;
 	}
 	else
-		ereport(ERROR,
+	{
+		ereport(vacuum_damage_elevel(),
 				(errcode(ERRCODE_DATA_CORRUPTED),
 				 errmsg_internal("found xmax %u (infomask 0x%04x) not frozen, not multi, not normal",
 								 xid, tuple->t_infomask)));
+		return false;
+	}
 
 	if (freeze_xmax)
 	{
diff --git a/src/backend/commands/vacuum.c b/src/backend/commands/vacuum.c
index 576c7e63e9..5a093c231c 100644
--- a/src/backend/commands/vacuum.c
+++ b/src/backend/commands/vacuum.c
@@ -63,6 +63,8 @@ int			vacuum_freeze_table_age;
 int			vacuum_multixact_freeze_min_age;
 int			vacuum_multixact_freeze_table_age;
 
+/* Whether to continue with the vacuuming after detecting a corruped tuple */
+bool vacuum_tolerate_damage = false;
 
 /* A few variables that don't seem worth passing around as parameters */
 static MemoryContext vac_context = NULL;
@@ -2102,3 +2104,16 @@ get_vacopt_ternary_value(DefElem *def)
 {
 	return defGetBoolean(def) ? VACOPT_TERNARY_ENABLED : VACOPT_TERNARY_DISABLED;
 }
+
+/*
+ * vacuum_damage_elevel
+ *
+ * Return the error level for reporting the corrupted tuple while trying to
+ * freeze the tuple during vacuum.  Based on the GUC value, it will return
+ * whether to report with ERROR or to report with WARNING.
+ */
+int
+vacuum_damage_elevel(void)
+{
+	return vacuum_tolerate_damage ? WARNING : ERROR;
+}
diff --git a/src/backend/utils/misc/guc.c b/src/backend/utils/misc/guc.c
index 031ca0327f..683812dd4a 100644
--- a/src/backend/utils/misc/guc.c
+++ b/src/backend/utils/misc/guc.c
@@ -2041,6 +2041,15 @@ static struct config_bool ConfigureNamesBool[] =
 		NULL, NULL, NULL
 	},
 
+	{
+		{"vacuum_tolerate_damage", PGC_USERSET, ERROR_HANDLING_OPTIONS,
+			gettext_noop("Whether to continue running vacuum after detecting corrupted tuple."),
+		},
+		&vacuum_tolerate_damage,
+		false,
+		NULL, NULL, NULL
+	},
+
 	/* End-of-list marker */
 	{
 		{NULL, 0, 0, NULL, NULL}, NULL, false, NULL, NULL, NULL
diff --git a/src/backend/utils/misc/postgresql.conf.sample b/src/backend/utils/misc/postgresql.conf.sample
index e430e33c7b..88654f4983 100644
--- a/src/backend/utils/misc/postgresql.conf.sample
+++ b/src/backend/utils/misc/postgresql.conf.sample
@@ -666,6 +666,7 @@
 #vacuum_cleanup_index_scale_factor = 0.1	# fraction of total number of tuples
 						# before index cleanup, 0 always performs
 						# index cleanup
+#vacuum_tolerate_damage = false
 #bytea_output = 'hex'			# hex, escape
 #xmlbinary = 'base64'
 #xmloption = 'content'
diff --git a/src/include/commands/vacuum.h b/src/include/commands/vacuum.h
index a4cd721400..5e711832bc 100644
--- a/src/include/commands/vacuum.h
+++ b/src/include/commands/vacuum.h
@@ -237,6 +237,7 @@ extern int	vacuum_freeze_min_age;
 extern int	vacuum_freeze_table_age;
 extern int	vacuum_multixact_freeze_min_age;
 extern int	vacuum_multixact_freeze_table_age;
+extern bool vacuum_tolerate_damage;
 
 /* Variables for cost-based parallel vacuum */
 extern pg_atomic_uint32 *VacuumSharedCostBalance;
@@ -278,6 +279,7 @@ extern bool vacuum_is_relation_owner(Oid relid, Form_pg_class reltuple,
 									 int options);
 extern Relation vacuum_open_relation(Oid relid, RangeVar *relation,
 									 int options, bool verbose, LOCKMODE lmode);
+extern int vacuum_damage_elevel(void);
 
 /* in commands/analyze.c */
 extern void analyze_rel(Oid relid, RangeVar *relation,
diff --git a/src/test/modules/test_misc/t/002_vacuum_tolerate_damage.pl b/src/test/modules/test_misc/t/002_vacuum_tolerate_damage.pl
new file mode 100644
index 0000000000..acf1266ab1
--- /dev/null
+++ b/src/test/modules/test_misc/t/002_vacuum_tolerate_damage.pl
@@ -0,0 +1,49 @@
+# Verify the vacuum_tolerate_damage GUC
+
+use strict;
+use warnings;
+use PostgresNode;
+use TestLib;
+use Test::More tests => 2;
+
+# Initialize a test cluster
+my $node = get_new_node('primary');
+$node->init();
+$node->start;
+
+# Run a SQL command and return psql's stderr
+sub run_sql_command
+{
+	my $sql = shift;
+	my $stderr;
+
+	$node->psql(
+		'postgres',
+		$sql,
+		stderr        => \$stderr);
+	return $stderr;
+}
+
+my $output;
+
+# Insert 2 tuple in the table and update the relfrozenxid for the table to
+# the future xid.
+run_sql_command(
+	"create table test_vac (a int) WITH (autovacuum_enabled = off);
+	 insert into test_vac values (1), (2);
+	 update pg_class set relfrozenxid=txid_current()::text::xid where relname='test_vac';");
+
+$output = run_sql_command('vacuum(freeze, disable_page_skipping) test_vac;');
+ok( $output =~ m/ERROR:.*found xmin.*before relfrozenxid/);
+
+# set the vacuum_tolerate_damage and run again
+$output = run_sql_command('set vacuum_tolerate_damage=true;
+						   vacuum(freeze, disable_page_skipping) test_vac;');
+
+
+# this time we should get WARNING for both the tuples
+ok( scalar( @{[ $output=~/WARNING:.*found xmin.*before relfrozenxid/gi ]}) == 2);
+
+run_sql_command('DROP TABLE test_vac;');
+
+$node->stop('fast');
-- 
2.23.0

#2Andrey M. Borodin
x4mmm@yandex-team.ru
In reply to: Dilip Kumar (#1)
Re: Allow ERROR from heap_prepare_freeze_tuple to be downgraded to WARNING

Hi Dilip!

17 июля 2020 г., в 15:46, Dilip Kumar <dilipbalaut@gmail.com> написал(а):

The attached patch allows the vacuum to continue by emitting WARNING
for the corrupted tuple instead of immediately error out as discussed
at [1].

Basically, it provides a new GUC called vacuum_tolerate_damage, to
control whether to continue the vacuum or to stop on the occurrence of
a corrupted tuple. So if the vacuum_tolerate_damage is set then in
all the cases in heap_prepare_freeze_tuple where the corrupted xid is
detected, it will emit a warning and return that nothing is changed in
the tuple and the 'tuple_totally_frozen' will also be set to false.
Since we are returning false the caller will not try to freeze such
tuple and the tuple_totally_frozen is also set to false so that the
page will not be marked to all frozen even if all other tuples in the
page are frozen.

Alternatively, we can try to freeze other XIDs in the tuple which is
not corrupted but I don't think we will gain anything from this,
because if one of the xmin or xmax is wrong then next time also if we
run the vacuum then we are going to get the same WARNING or the ERROR.
Is there any other opinion on this?

FWIW AFAIK this ERROR was the reason why we had to use older versions of heap_prepare_freeze_tuple() in our recovery kit [0]https://github.com/dsarafan/pg_dirty_hands/blob/master/src/pg_dirty_hands.c#L443.
So +1 from me.
But I do not think that just ignoring corruption here is sufficient. Soon after this freeze problem user will, probably, have to deal with absent CLOG.
I think this GUC is only a part of an incomplete solution.
Personally I'd be happy if this is backported - our recovery kit would be much smaller. But this does not seem like a valid reason.

Thanks!

Best regards, Andrey Borodin.

[0]: https://github.com/dsarafan/pg_dirty_hands/blob/master/src/pg_dirty_hands.c#L443

#3Dilip Kumar
dilipbalaut@gmail.com
In reply to: Dilip Kumar (#1)
Re: Allow ERROR from heap_prepare_freeze_tuple to be downgraded to WARNING

On Fri, Jul 17, 2020 at 4:16 PM Dilip Kumar <dilipbalaut@gmail.com> wrote:

The attached patch allows the vacuum to continue by emitting WARNING
for the corrupted tuple instead of immediately error out as discussed
at [1].

Basically, it provides a new GUC called vacuum_tolerate_damage, to
control whether to continue the vacuum or to stop on the occurrence of
a corrupted tuple. So if the vacuum_tolerate_damage is set then in
all the cases in heap_prepare_freeze_tuple where the corrupted xid is
detected, it will emit a warning and return that nothing is changed in
the tuple and the 'tuple_totally_frozen' will also be set to false.
Since we are returning false the caller will not try to freeze such
tuple and the tuple_totally_frozen is also set to false so that the
page will not be marked to all frozen even if all other tuples in the
page are frozen.

Alternatively, we can try to freeze other XIDs in the tuple which is
not corrupted but I don't think we will gain anything from this,
because if one of the xmin or xmax is wrong then next time also if we
run the vacuum then we are going to get the same WARNING or the ERROR.
Is there any other opinion on this?

Robert has mentioned at [1]/messages/by-id/CA+TgmoaZwZHtFFU6NUJgEAp6adDs-qWfNOXpZGQpZMmm0VTDfg@mail.gmail.com that we probably should refuse to update
'relfrozenxid/relminmxid' when we encounter such tuple and emit
WARNING instead of an error. I think we shall do that in some cases
but IMHO it's not a very good idea in all the cases. Basically, if
the xmin precedes the relfrozenxid then probably we should allow to
update the relfrozenxid whereas if the xmin precedes cutoff xid and
still uncommitted then probably we might stop relfrozenxid from being
updated so that we can stop CLOG from getting truncated. I will make
these changes if we agree with the idea? Or we should keep it simple
and never allow to update 'relfrozenxid/relminmxid' in such cases?

[1]: /messages/by-id/CA+TgmoaZwZHtFFU6NUJgEAp6adDs-qWfNOXpZGQpZMmm0VTDfg@mail.gmail.com

--
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com

#4Dilip Kumar
dilipbalaut@gmail.com
In reply to: Andrey M. Borodin (#2)
Re: Allow ERROR from heap_prepare_freeze_tuple to be downgraded to WARNING

On Sun, Jul 19, 2020 at 4:56 PM Andrey M. Borodin <x4mmm@yandex-team.ru> wrote:

Hi Dilip!

17 июля 2020 г., в 15:46, Dilip Kumar <dilipbalaut@gmail.com> написал(а):

The attached patch allows the vacuum to continue by emitting WARNING
for the corrupted tuple instead of immediately error out as discussed
at [1].

Basically, it provides a new GUC called vacuum_tolerate_damage, to
control whether to continue the vacuum or to stop on the occurrence of
a corrupted tuple. So if the vacuum_tolerate_damage is set then in
all the cases in heap_prepare_freeze_tuple where the corrupted xid is
detected, it will emit a warning and return that nothing is changed in
the tuple and the 'tuple_totally_frozen' will also be set to false.
Since we are returning false the caller will not try to freeze such
tuple and the tuple_totally_frozen is also set to false so that the
page will not be marked to all frozen even if all other tuples in the
page are frozen.

Alternatively, we can try to freeze other XIDs in the tuple which is
not corrupted but I don't think we will gain anything from this,
because if one of the xmin or xmax is wrong then next time also if we
run the vacuum then we are going to get the same WARNING or the ERROR.
Is there any other opinion on this?

FWIW AFAIK this ERROR was the reason why we had to use older versions of heap_prepare_freeze_tuple() in our recovery kit [0].
So +1 from me.

Thanks for showing interest in this patch.

But I do not think that just ignoring corruption here is sufficient. Soon after this freeze problem user will, probably, have to deal with absent CLOG.
I think this GUC is only a part of an incomplete solution.
Personally I'd be happy if this is backported - our recovery kit would be much smaller. But this does not seem like a valid reason.

I agree that this is just solving one part of the problem and in some
cases, it may not work if the CLOG itself is corrupted i.e does not
exist for the xid which are not yet frozen.

--
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com

#5Alvaro Herrera
alvherre@2ndquadrant.com
In reply to: Dilip Kumar (#3)
Re: Allow ERROR from heap_prepare_freeze_tuple to be downgraded to WARNING

On 2020-Jul-20, Dilip Kumar wrote:

On Fri, Jul 17, 2020 at 4:16 PM Dilip Kumar <dilipbalaut@gmail.com> wrote:

So if the vacuum_tolerate_damage is set then in
all the cases in heap_prepare_freeze_tuple where the corrupted xid is
detected, it will emit a warning and return that nothing is changed in
the tuple and the 'tuple_totally_frozen' will also be set to false.
Since we are returning false the caller will not try to freeze such
tuple and the tuple_totally_frozen is also set to false so that the
page will not be marked to all frozen even if all other tuples in the
page are frozen.

Robert has mentioned at [1] that we probably should refuse to update
'relfrozenxid/relminmxid' when we encounter such tuple and emit
WARNING instead of an error.

Isn't this already happening per your description above?

I think we shall do that in some cases
but IMHO it's not a very good idea in all the cases. Basically, if
the xmin precedes the relfrozenxid then probably we should allow to
update the relfrozenxid whereas if the xmin precedes cutoff xid and
still uncommitted then probably we might stop relfrozenxid from being
updated so that we can stop CLOG from getting truncated.

I'm not sure I understand 100% what you're talking about here (the first
half seems dangerous unless you misspoke), but in any case it seems a
pointless optimization. I mean, if the heap is corrupted, you can hope
to complete the vacuum (which will hopefully return which *other* tuples
are similarly corrupt) but trying to advance relfrozenxid is a lost
cause.

--
�lvaro Herrera https://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

#6Andrey M. Borodin
x4mmm@yandex-team.ru
In reply to: Alvaro Herrera (#5)
Re: Allow ERROR from heap_prepare_freeze_tuple to be downgraded to WARNING

20 июля 2020 г., в 21:44, Alvaro Herrera <alvherre@2ndquadrant.com> написал(а):

I think we shall do that in some cases
but IMHO it's not a very good idea in all the cases. Basically, if
the xmin precedes the relfrozenxid then probably we should allow to
update the relfrozenxid whereas if the xmin precedes cutoff xid and
still uncommitted then probably we might stop relfrozenxid from being
updated so that we can stop CLOG from getting truncated.

I'm not sure I understand 100% what you're talking about here (the first
half seems dangerous unless you misspoke), but in any case it seems a
pointless optimization. I mean, if the heap is corrupted, you can hope
to complete the vacuum (which will hopefully return which *other* tuples
are similarly corrupt) but trying to advance relfrozenxid is a lost
cause.

I think the point here is to actually move relfrozenxid back. But the mince can't be turned back. If CLOG is rotated - the table is corrupted beyond easy repair.

I'm not sure it's Dilip's case, but I'll try to describe what I was encountering.

We were observing this kind of corruption in three cases:
1. With a bug in patched Linux kernel page cache we could loose FS page write
2. With a bug in WAL-G block-level incremental backup - we could loose update of the page.
3. With a firmware bug in SSD drives from one vendor - one write to block storage device was lost
One page in a database is of some non-latest version (but with correct checksum, it's just an old version). And in our case usually a VACUUMing of a page was lost (with freezes of all tuples). Some tuples are not marked as frozen, while VM has frozen bit for page. Everything works just fine until someone updates a tuple on the same page: VM bit is reset and eventually user will try to consult CLOG, which is already truncated.

This is why we may need to defer CLOG truncation or even move relfrozenxid back.

FWIW we coped with this by actively monitoring this kind of corruption with this amcheck patch [0]https://commitfest.postgresql.org/24/2254/. One can observe this lost page updates cheaply in indexes and act on first sight of corruption: identify source of the buggy behaviour.

Dilip, does this sound like a corruption case you are working on?

Thanks!

Best regards, Andrey Borodin.

[0]: https://commitfest.postgresql.org/24/2254/

#7Alvaro Herrera
alvherre@2ndquadrant.com
In reply to: Andrey M. Borodin (#6)
Re: Allow ERROR from heap_prepare_freeze_tuple to be downgraded to WARNING

On 2020-Jul-20, Andrey M. Borodin wrote:

I think the point here is to actually move relfrozenxid back. But the
mince can't be turned back. If CLOG is rotated - the table is
corrupted beyond easy repair.

Oh, I see. Hmm. Well, if you discover relfrozenxid that's newer and
the pg_clog files are still there, then yeah it might make sense to move
relfrozenxid back. But it seems difficult to do correctly ... you have
to move datfrozenxid back too ... frankly, I'd rather not go there.

I'm not sure it's Dilip's case, but I'll try to describe what I was encountering.

We were observing this kind of corruption in three cases:
1. With a bug in patched Linux kernel page cache we could loose FS page write

I think I've seen this too. (Or possibly your #3, which from Postgres
POV is the same thing -- a write was claimed done but actually not
done.)

FWIW we coped with this by actively monitoring this kind of corruption
with this amcheck patch [0]. One can observe this lost page updates
cheaply in indexes and act on first sight of corruption: identify
source of the buggy behaviour.

Right.

I wish we had some way to better protect against this kind of problems,
but I don't have any ideas. Some things can be protected against with
checksums, but if you just lose a write, there's nothing to indicate
that. We don't have a per-page write counter, or a central repository
of per-page LSNs or checksums, and it seems very expensive to maintain
such things.

--
�lvaro Herrera https://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

#8Andres Freund
andres@anarazel.de
In reply to: Dilip Kumar (#1)
Re: Allow ERROR from heap_prepare_freeze_tuple to be downgraded to WARNING

Hi,

On 2020-07-17 16:16:23 +0530, Dilip Kumar wrote:

The attached patch allows the vacuum to continue by emitting WARNING
for the corrupted tuple instead of immediately error out as discussed
at [1].

Basically, it provides a new GUC called vacuum_tolerate_damage, to
control whether to continue the vacuum or to stop on the occurrence of
a corrupted tuple. So if the vacuum_tolerate_damage is set then in
all the cases in heap_prepare_freeze_tuple where the corrupted xid is
detected, it will emit a warning and return that nothing is changed in
the tuple and the 'tuple_totally_frozen' will also be set to false.
Since we are returning false the caller will not try to freeze such
tuple and the tuple_totally_frozen is also set to false so that the
page will not be marked to all frozen even if all other tuples in the
page are frozen.

I'm extremely doubtful this is a good idea. In all likelihood this will
just exascerbate corruption.

You cannot just stop freezing tuples, that'll lead to relfrozenxid
getting *further* out of sync with the actual table contents. And you
cannot just freeze such tuples, because that has a good chance of making
deleted tuples suddenly visible, leading to unique constraint violations
etc. Which will then subsequently lead to clog lookup errors and such.

At the very least you'd need to signal up that relfrozenxid/relminmxid
cannot be increased. Without that I think it's entirely unacceptable to
do this.

If we really were to do something like this the option would need to be
called vacuum_allow_making_corruption_worse or such. Its need to be
*exceedingly* clear that it will likely lead to making everything much
worse.

@@ -6123,6 +6124,8 @@ heap_prepare_freeze_tuple(HeapTupleHeader tuple,
frz->t_infomask = tuple->t_infomask;
frz->xmax = HeapTupleHeaderGetRawXmax(tuple);

I don't think it can be right to just update heap_prepare_freeze_tuple()
but not FreezeMultiXactId().

Greetings,

Andres Freund

#9Andrey M. Borodin
x4mmm@yandex-team.ru
In reply to: Alvaro Herrera (#7)
Re: Allow ERROR from heap_prepare_freeze_tuple to be downgraded to WARNING

21 июля 2020 г., в 00:36, Alvaro Herrera <alvherre@2ndquadrant.com> написал(а):

FWIW we coped with this by actively monitoring this kind of corruption
with this amcheck patch [0]. One can observe this lost page updates
cheaply in indexes and act on first sight of corruption: identify
source of the buggy behaviour.

Right.

I wish we had some way to better protect against this kind of problems,
but I don't have any ideas. Some things can be protected against with
checksums, but if you just lose a write, there's nothing to indicate
that. We don't have a per-page write counter, or a central repository
of per-page LSNs or checksums, and it seems very expensive to maintain
such things.

If we had data checksums in another fork we could flush them on checkpoint.
This checksums could protect from lost page update.
And it would be much easier to maintain these checksums for SLRUs.

Best regards, Andrey Borodin.

#10Dilip Kumar
dilipbalaut@gmail.com
In reply to: Alvaro Herrera (#5)
Re: Allow ERROR from heap_prepare_freeze_tuple to be downgraded to WARNING

On Mon, Jul 20, 2020 at 10:14 PM Alvaro Herrera
<alvherre@2ndquadrant.com> wrote:

On 2020-Jul-20, Dilip Kumar wrote:

On Fri, Jul 17, 2020 at 4:16 PM Dilip Kumar <dilipbalaut@gmail.com> wrote:

So if the vacuum_tolerate_damage is set then in
all the cases in heap_prepare_freeze_tuple where the corrupted xid is
detected, it will emit a warning and return that nothing is changed in
the tuple and the 'tuple_totally_frozen' will also be set to false.
Since we are returning false the caller will not try to freeze such
tuple and the tuple_totally_frozen is also set to false so that the
page will not be marked to all frozen even if all other tuples in the
page are frozen.

Robert has mentioned at [1] that we probably should refuse to update
'relfrozenxid/relminmxid' when we encounter such tuple and emit
WARNING instead of an error.

Isn't this already happening per your description above?

As per the above description, we are avoiding to set the page as all
frozen. But the vacrelstats->scanned_pages count has already been
increased for this page. Now, right after the lazy_scan_heap, we
will update the pg_class tuple with the new FreezeLimit and
MultiXactCutoff.

I think we shall do that in some cases
but IMHO it's not a very good idea in all the cases. Basically, if
the xmin precedes the relfrozenxid then probably we should allow to
update the relfrozenxid whereas if the xmin precedes cutoff xid and
still uncommitted then probably we might stop relfrozenxid from being
updated so that we can stop CLOG from getting truncated.

I'm not sure I understand 100% what you're talking about here (the first
half seems dangerous unless you misspoke), but in any case it seems a
pointless optimization. I mean, if the heap is corrupted, you can hope
to complete the vacuum (which will hopefully return which *other* tuples
are similarly corrupt) but trying to advance relfrozenxid is a lost
cause.

I agree with your point. I think we just need to avoid advancing the
relfrozenxid in all such cases.

--
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com

#11Dilip Kumar
dilipbalaut@gmail.com
In reply to: Andres Freund (#8)
Re: Allow ERROR from heap_prepare_freeze_tuple to be downgraded to WARNING

On Tue, Jul 21, 2020 at 2:00 AM Andres Freund <andres@anarazel.de> wrote:

Hi,

On 2020-07-17 16:16:23 +0530, Dilip Kumar wrote:

The attached patch allows the vacuum to continue by emitting WARNING
for the corrupted tuple instead of immediately error out as discussed
at [1].

Basically, it provides a new GUC called vacuum_tolerate_damage, to
control whether to continue the vacuum or to stop on the occurrence of
a corrupted tuple. So if the vacuum_tolerate_damage is set then in
all the cases in heap_prepare_freeze_tuple where the corrupted xid is
detected, it will emit a warning and return that nothing is changed in
the tuple and the 'tuple_totally_frozen' will also be set to false.
Since we are returning false the caller will not try to freeze such
tuple and the tuple_totally_frozen is also set to false so that the
page will not be marked to all frozen even if all other tuples in the
page are frozen.

I'm extremely doubtful this is a good idea. In all likelihood this will
just exascerbate corruption.

You cannot just stop freezing tuples, that'll lead to relfrozenxid
getting *further* out of sync with the actual table contents. And you
cannot just freeze such tuples, because that has a good chance of making
deleted tuples suddenly visible, leading to unique constraint violations
etc. Which will then subsequently lead to clog lookup errors and such.

I agree with the point. But, if we keep giving the ERROR in such
cases then also the situation is not any better. Basically, we are
not freezing such tuple as well as we can not advance the
relfrozenxid. So if we follow the same rule that we don't freeze
those tuples and also don't advance the relfrozenxid. The only
difference is, allow the vacuum to continue with other tuples.

At the very least you'd need to signal up that relfrozenxid/relminmxid
cannot be increased. Without that I think it's entirely unacceptable to
do this.

I agree with that point. I was just confused that shall we disallow
to advance the relfrozenxid in all such cases or in some cases where
the xid already precedes the relfrozenxid, we can allow it to advance
as it can not become any worse. But, as Alvaro pointed out that there
is no point in optimizing such cases. I will update the patch to
stop advancing the relfrozenxid if we find any corrupted xid during
tuple freeze.

If we really were to do something like this the option would need to be
called vacuum_allow_making_corruption_worse or such. Its need to be
*exceedingly* clear that it will likely lead to making everything much
worse.

Maybe we can clearly describe this in the document.

@@ -6123,6 +6124,8 @@ heap_prepare_freeze_tuple(HeapTupleHeader tuple,
frz->t_infomask = tuple->t_infomask;
frz->xmax = HeapTupleHeaderGetRawXmax(tuple);

I don't think it can be right to just update heap_prepare_freeze_tuple()
but not FreezeMultiXactId().

oh, I missed this part. I will fix it.

--
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com

#12Dilip Kumar
dilipbalaut@gmail.com
In reply to: Dilip Kumar (#11)
1 attachment(s)
Re: Allow ERROR from heap_prepare_freeze_tuple to be downgraded to WARNING

On Tue, Jul 21, 2020 at 11:00 AM Dilip Kumar <dilipbalaut@gmail.com> wrote:

On Tue, Jul 21, 2020 at 2:00 AM Andres Freund <andres@anarazel.de> wrote:

Hi,

On 2020-07-17 16:16:23 +0530, Dilip Kumar wrote:

The attached patch allows the vacuum to continue by emitting WARNING
for the corrupted tuple instead of immediately error out as discussed
at [1].

Basically, it provides a new GUC called vacuum_tolerate_damage, to
control whether to continue the vacuum or to stop on the occurrence of
a corrupted tuple. So if the vacuum_tolerate_damage is set then in
all the cases in heap_prepare_freeze_tuple where the corrupted xid is
detected, it will emit a warning and return that nothing is changed in
the tuple and the 'tuple_totally_frozen' will also be set to false.
Since we are returning false the caller will not try to freeze such
tuple and the tuple_totally_frozen is also set to false so that the
page will not be marked to all frozen even if all other tuples in the
page are frozen.

I'm extremely doubtful this is a good idea. In all likelihood this will
just exascerbate corruption.

You cannot just stop freezing tuples, that'll lead to relfrozenxid
getting *further* out of sync with the actual table contents. And you
cannot just freeze such tuples, because that has a good chance of making
deleted tuples suddenly visible, leading to unique constraint violations
etc. Which will then subsequently lead to clog lookup errors and such.

I agree with the point. But, if we keep giving the ERROR in such
cases then also the situation is not any better. Basically, we are
not freezing such tuple as well as we can not advance the
relfrozenxid. So if we follow the same rule that we don't freeze
those tuples and also don't advance the relfrozenxid. The only
difference is, allow the vacuum to continue with other tuples.

At the very least you'd need to signal up that relfrozenxid/relminmxid
cannot be increased. Without that I think it's entirely unacceptable to
do this.

I agree with that point. I was just confused that shall we disallow
to advance the relfrozenxid in all such cases or in some cases where
the xid already precedes the relfrozenxid, we can allow it to advance
as it can not become any worse. But, as Alvaro pointed out that there
is no point in optimizing such cases. I will update the patch to
stop advancing the relfrozenxid if we find any corrupted xid during
tuple freeze.

If we really were to do something like this the option would need to be
called vacuum_allow_making_corruption_worse or such. Its need to be
*exceedingly* clear that it will likely lead to making everything much
worse.

Maybe we can clearly describe this in the document.

@@ -6123,6 +6124,8 @@ heap_prepare_freeze_tuple(HeapTupleHeader tuple,
frz->t_infomask = tuple->t_infomask;
frz->xmax = HeapTupleHeaderGetRawXmax(tuple);

I don't think it can be right to just update heap_prepare_freeze_tuple()
but not FreezeMultiXactId().

oh, I missed this part. I will fix it.

Please find the updated patch. In this version, we don't allow the
relfrozenxid and relminmxid to advance if the corruption is detected
and also added the handling in FreezeMultiXactId.

--
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com

Attachments:

v2-0001-Provide-a-GUC-to-allow-vacuum-to-continue-on-corr.patchapplication/octet-stream; name=v2-0001-Provide-a-GUC-to-allow-vacuum-to-continue-on-corr.patchDownload
From a5ab1de277e70fc9a0975f8064fe7db2e7aeb338 Mon Sep 17 00:00:00 2001
From: dilipkumar <dilipbalaut@gmail.com>
Date: Thu, 16 Jul 2020 11:41:28 +0530
Subject: [PATCH v2] Provide a GUC to allow vacuum to continue on corrupted
 tuple

A new GUC to control whether the vacuum will error out immediately
on detection of a corrupted tuple or it will just emit a WARNING for
each such instance and complete the rest of the vacuuming.
---
 doc/src/sgml/config.sgml                      | 21 ++++
 src/backend/access/heap/heapam.c              | 95 ++++++++++++++++---
 src/backend/access/heap/vacuumlazy.c          | 24 ++++-
 src/backend/commands/vacuum.c                 | 15 +++
 src/backend/utils/misc/guc.c                  |  9 ++
 src/backend/utils/misc/postgresql.conf.sample |  1 +
 src/include/access/heapam_xlog.h              |  3 +-
 src/include/commands/vacuum.h                 |  2 +
 .../test_misc/t/002_vacuum_tolerate_damage.pl | 49 ++++++++++
 9 files changed, 202 insertions(+), 17 deletions(-)
 create mode 100644 src/test/modules/test_misc/t/002_vacuum_tolerate_damage.pl

diff --git a/doc/src/sgml/config.sgml b/doc/src/sgml/config.sgml
index b353c61683..3c1e647e27 100644
--- a/doc/src/sgml/config.sgml
+++ b/doc/src/sgml/config.sgml
@@ -8250,6 +8250,27 @@ COPY postgres_log FROM '/full/path/to/logfile.csv' WITH csv;
       </listitem>
      </varlistentry>
 
+      <varlistentry id="guc-vacuum-tolerate-damage" xreflabel="vacuum_tolerate_damage">
+      <term><varname>vacuum_tolerate_damage</varname> (<type>boolean</type>)
+      <indexterm>
+       <primary><varname>vacuum_tolerate_damage</varname> configuration parameter</primary>
+      </indexterm>
+      </term>
+      <listitem>
+       <para>
+        When set to off, which is the default, the <command>VACUUM</command> will raise
+        a ERROR-level error if detected a corrupted tuple.  This causes the operation to
+        stop immediately.
+       </para>
+
+       <para>
+        If set to on, the <command>VACUUM</command> will instead emit a WARNING for each
+        such tuple but the operation will continue.
+        vacuuming.
+       </para>
+      </listitem>
+     </varlistentry>
+
      <varlistentry id="guc-bytea-output" xreflabel="bytea_output">
       <term><varname>bytea_output</varname> (<type>enum</type>)
       <indexterm>
diff --git a/src/backend/access/heap/heapam.c b/src/backend/access/heap/heapam.c
index e09c8101e7..ab7e38182e 100644
--- a/src/backend/access/heap/heapam.c
+++ b/src/backend/access/heap/heapam.c
@@ -51,6 +51,7 @@
 #include "access/xloginsert.h"
 #include "access/xlogutils.h"
 #include "catalog/catalog.h"
+#include "commands/vacuum.h"
 #include "miscadmin.h"
 #include "pgstat.h"
 #include "port/atomics.h"
@@ -5784,6 +5785,7 @@ heap_inplace_update(Relation relation, HeapTuple tuple)
 #define		FRM_RETURN_IS_XID		0x0004
 #define		FRM_RETURN_IS_MULTI		0x0008
 #define		FRM_MARK_COMMITTED		0x0010
+#define		FRM_CORRUPTED_XID		0x0020
 
 /*
  * FreezeMultiXactId
@@ -5805,6 +5807,12 @@ heap_inplace_update(Relation relation, HeapTuple tuple)
  * FRM_RETURN_IS_MULTI
  *		The return value is a new MultiXactId to set as new Xmax.
  *		(caller must obtain proper infomask bits using GetMultiXactIdHintBits)
+ * FRM_CORRUPTED_XID
+ *		This flag will be set if a corrupted multi-xact is detected.
+ *		Ideally, in such cases it will report an error but based on the
+ *		vacuum_damage_elevel, it might just complain about the corruption
+ *		and allow vacuum to continue.  So if the flag is set the caller will
+ *		not do anything to this tuple.
  */
 static TransactionId
 FreezeMultiXactId(MultiXactId multi, uint16 t_infomask,
@@ -5836,10 +5844,14 @@ FreezeMultiXactId(MultiXactId multi, uint16 t_infomask,
 		return InvalidTransactionId;
 	}
 	else if (MultiXactIdPrecedes(multi, relminmxid))
-		ereport(ERROR,
+	{
+		ereport(vacuum_damage_elevel(),
 				(errcode(ERRCODE_DATA_CORRUPTED),
 				 errmsg_internal("found multixact %u from before relminmxid %u",
 								 multi, relminmxid)));
+		*flags |= FRM_CORRUPTED_XID;
+		return InvalidTransactionId;
+	}
 	else if (MultiXactIdPrecedes(multi, cutoff_multi))
 	{
 		/*
@@ -5850,10 +5862,14 @@ FreezeMultiXactId(MultiXactId multi, uint16 t_infomask,
 		 */
 		if (MultiXactIdIsRunning(multi,
 								 HEAP_XMAX_IS_LOCKED_ONLY(t_infomask)))
-			ereport(ERROR,
+		{
+			ereport(vacuum_damage_elevel(),
 					(errcode(ERRCODE_DATA_CORRUPTED),
 					 errmsg_internal("multixact %u from before cutoff %u found to be still running",
 									 multi, cutoff_multi)));
+			*flags |= FRM_CORRUPTED_XID;
+			return InvalidTransactionId;
+		}
 
 		if (HEAP_XMAX_IS_LOCKED_ONLY(t_infomask))
 		{
@@ -5869,10 +5885,14 @@ FreezeMultiXactId(MultiXactId multi, uint16 t_infomask,
 			Assert(TransactionIdIsValid(xid));
 
 			if (TransactionIdPrecedes(xid, relfrozenxid))
-				ereport(ERROR,
+			{
+				ereport(vacuum_damage_elevel(),
 						(errcode(ERRCODE_DATA_CORRUPTED),
 						 errmsg_internal("found update xid %u from before relfrozenxid %u",
 										 xid, relfrozenxid)));
+				*flags |= FRM_CORRUPTED_XID;
+				return InvalidTransactionId;
+			}
 
 			/*
 			 * If the xid is older than the cutoff, it has to have aborted,
@@ -5881,9 +5901,13 @@ FreezeMultiXactId(MultiXactId multi, uint16 t_infomask,
 			if (TransactionIdPrecedes(xid, cutoff_xid))
 			{
 				if (TransactionIdDidCommit(xid))
-					ereport(ERROR,
+				{
+					ereport(vacuum_damage_elevel(),
 							(errcode(ERRCODE_DATA_CORRUPTED),
 							 errmsg_internal("cannot freeze committed update xid %u", xid)));
+					*flags |= FRM_CORRUPTED_XID;
+					return InvalidTransactionId;
+				}
 				*flags |= FRM_INVALIDATE_XMAX;
 				xid = InvalidTransactionId; /* not strictly necessary */
 			}
@@ -5957,10 +5981,15 @@ FreezeMultiXactId(MultiXactId multi, uint16 t_infomask,
 
 			Assert(TransactionIdIsValid(xid));
 			if (TransactionIdPrecedes(xid, relfrozenxid))
-				ereport(ERROR,
+			{
+				ereport(vacuum_damage_elevel(),
 						(errcode(ERRCODE_DATA_CORRUPTED),
 						 errmsg_internal("found update xid %u from before relfrozenxid %u",
 										 xid, relfrozenxid)));
+				pfree(members);
+				*flags |= FRM_CORRUPTED_XID;
+				return InvalidTransactionId;
+			}
 
 			/*
 			 * It's an update; should we keep it?  If the transaction is known
@@ -6007,10 +6036,15 @@ FreezeMultiXactId(MultiXactId multi, uint16 t_infomask,
 			 */
 			if (TransactionIdIsValid(update_xid) &&
 				TransactionIdPrecedes(update_xid, cutoff_xid))
-				ereport(ERROR,
+			{
+				ereport(vacuum_damage_elevel(),
 						(errcode(ERRCODE_DATA_CORRUPTED),
 						 errmsg_internal("found update xid %u from before xid cutoff %u",
 										 update_xid, cutoff_xid)));
+				pfree(members);
+				*flags |= FRM_CORRUPTED_XID;
+				return InvalidTransactionId;
+			}
 
 			/*
 			 * If we determined that it's an Xid corresponding to an update
@@ -6110,7 +6144,8 @@ bool
 heap_prepare_freeze_tuple(HeapTupleHeader tuple,
 						  TransactionId relfrozenxid, TransactionId relminmxid,
 						  TransactionId cutoff_xid, TransactionId cutoff_multi,
-						  xl_heap_freeze_tuple *frz, bool *totally_frozen_p)
+						  xl_heap_freeze_tuple *frz, bool *totally_frozen_p,
+						  bool *corrupted_xid)
 {
 	bool		changed = false;
 	bool		xmax_already_frozen = false;
@@ -6123,6 +6158,9 @@ heap_prepare_freeze_tuple(HeapTupleHeader tuple,
 	frz->t_infomask = tuple->t_infomask;
 	frz->xmax = HeapTupleHeaderGetRawXmax(tuple);
 
+	*totally_frozen_p = false;
+	*corrupted_xid = false;
+
 	/*
 	 * Process xmin.  xmin_frozen has two slightly different meanings: in the
 	 * !XidIsNormal case, it means "the xmin doesn't need any freezing" (it's
@@ -6137,19 +6175,27 @@ heap_prepare_freeze_tuple(HeapTupleHeader tuple,
 	else
 	{
 		if (TransactionIdPrecedes(xid, relfrozenxid))
-			ereport(ERROR,
+		{
+			ereport(vacuum_damage_elevel(),
 					(errcode(ERRCODE_DATA_CORRUPTED),
 					 errmsg_internal("found xmin %u from before relfrozenxid %u",
 									 xid, relfrozenxid)));
+			*corrupted_xid = true;
+			return false;
+		}
 
 		xmin_frozen = TransactionIdPrecedes(xid, cutoff_xid);
 		if (xmin_frozen)
 		{
 			if (!TransactionIdDidCommit(xid))
-				ereport(ERROR,
+			{
+				ereport(vacuum_damage_elevel(),
 						(errcode(ERRCODE_DATA_CORRUPTED),
 						 errmsg_internal("uncommitted xmin %u from before xid cutoff %u needs to be frozen",
 										 xid, cutoff_xid)));
+				*corrupted_xid = true;
+				return false;
+			}
 
 			frz->t_infomask |= HEAP_XMIN_FROZEN;
 			changed = true;
@@ -6175,6 +6221,15 @@ heap_prepare_freeze_tuple(HeapTupleHeader tuple,
 		newxmax = FreezeMultiXactId(xid, tuple->t_infomask,
 									relfrozenxid, relminmxid,
 									cutoff_xid, cutoff_multi, &flags);
+		/*
+		 * Return as nothing changed if we have detected a corrupted
+		 * multi-xact.
+		 */
+		if (flags & FRM_CORRUPTED_XID)
+		{
+			*corrupted_xid = true;
+			return false;
+		}
 
 		freeze_xmax = (flags & FRM_INVALIDATE_XMAX);
 
@@ -6218,10 +6273,14 @@ heap_prepare_freeze_tuple(HeapTupleHeader tuple,
 	else if (TransactionIdIsNormal(xid))
 	{
 		if (TransactionIdPrecedes(xid, relfrozenxid))
-			ereport(ERROR,
+		{
+			ereport(vacuum_damage_elevel(),
 					(errcode(ERRCODE_DATA_CORRUPTED),
 					 errmsg_internal("found xmax %u from before relfrozenxid %u",
 									 xid, relfrozenxid)));
+			*corrupted_xid = true;
+			return false;
+		}
 
 		if (TransactionIdPrecedes(xid, cutoff_xid))
 		{
@@ -6233,10 +6292,14 @@ heap_prepare_freeze_tuple(HeapTupleHeader tuple,
 			 */
 			if (!HEAP_XMAX_IS_LOCKED_ONLY(tuple->t_infomask) &&
 				TransactionIdDidCommit(xid))
-				ereport(ERROR,
+			{
+				ereport(vacuum_damage_elevel(),
 						(errcode(ERRCODE_DATA_CORRUPTED),
 						 errmsg_internal("cannot freeze committed xmax %u",
 										 xid)));
+				*corrupted_xid = true;
+				return false;
+			}
 			freeze_xmax = true;
 		}
 		else
@@ -6249,10 +6312,14 @@ heap_prepare_freeze_tuple(HeapTupleHeader tuple,
 		xmax_already_frozen = true;
 	}
 	else
-		ereport(ERROR,
+	{
+		ereport(vacuum_damage_elevel(),
 				(errcode(ERRCODE_DATA_CORRUPTED),
 				 errmsg_internal("found xmax %u (infomask 0x%04x) not frozen, not multi, not normal",
 								 xid, tuple->t_infomask)));
+		*corrupted_xid = true;
+		return false;
+	}
 
 	if (freeze_xmax)
 	{
@@ -6364,11 +6431,13 @@ heap_freeze_tuple(HeapTupleHeader tuple,
 	xl_heap_freeze_tuple frz;
 	bool		do_freeze;
 	bool		tuple_totally_frozen;
+	bool		corrupted_xid;
 
 	do_freeze = heap_prepare_freeze_tuple(tuple,
 										  relfrozenxid, relminmxid,
 										  cutoff_xid, cutoff_multi,
-										  &frz, &tuple_totally_frozen);
+										  &frz, &tuple_totally_frozen,
+										  &corrupted_xid);
 
 	/*
 	 * Note that because this is not a WAL-logged operation, we don't need to
diff --git a/src/backend/access/heap/vacuumlazy.c b/src/backend/access/heap/vacuumlazy.c
index 1bbc4598f7..df5c765928 100644
--- a/src/backend/access/heap/vacuumlazy.c
+++ b/src/backend/access/heap/vacuumlazy.c
@@ -312,6 +312,7 @@ typedef struct LVRelStats
 	int			num_index_scans;
 	TransactionId latestRemovedXid;
 	bool		lock_waiter_detected;
+	bool		corrupted_tuple_detected;
 
 	/* Used for error callback */
 	char	   *indname;
@@ -492,6 +493,7 @@ heap_vacuum_rel(Relation onerel, VacuumParams *params,
 	vacrelstats->num_index_scans = 0;
 	vacrelstats->pages_removed = 0;
 	vacrelstats->lock_waiter_detected = false;
+	vacrelstats->corrupted_tuple_detected = false;
 
 	/* Open all indexes of the relation */
 	vac_open_indexes(onerel, RowExclusiveLock, &nindexes, &Irel);
@@ -591,8 +593,20 @@ heap_vacuum_rel(Relation onerel, VacuumParams *params,
 	if (new_rel_allvisible > new_rel_pages)
 		new_rel_allvisible = new_rel_pages;
 
-	new_frozen_xid = scanned_all_unfrozen ? FreezeLimit : InvalidTransactionId;
-	new_min_multi = scanned_all_unfrozen ? MultiXactCutoff : InvalidMultiXactId;
+	/*
+	 * Don't advance the relfrozenxid and relminmxid, if we have detected a
+	 * corrupted xid while trying to freeze some tuple.
+	 */
+	if (scanned_all_unfrozen && !vacrelstats->corrupted_tuple_detected)
+	{
+		new_frozen_xid = FreezeLimit;
+		new_min_multi = MultiXactCutoff;
+	}
+	else
+	{
+		new_frozen_xid = InvalidTransactionId;
+		new_min_multi = InvalidMultiXactId;
+	}
 
 	vac_update_relstats(onerel,
 						new_rel_pages,
@@ -1444,6 +1458,7 @@ lazy_scan_heap(Relation onerel, VacuumParams *params, LVRelStats *vacrelstats,
 			else
 			{
 				bool		tuple_totally_frozen;
+				bool		corrupted_xid;
 
 				num_tuples += 1;
 				hastup = true;
@@ -1456,8 +1471,11 @@ lazy_scan_heap(Relation onerel, VacuumParams *params, LVRelStats *vacrelstats,
 											  relfrozenxid, relminmxid,
 											  FreezeLimit, MultiXactCutoff,
 											  &frozen[nfrozen],
-											  &tuple_totally_frozen))
+											  &tuple_totally_frozen,
+											  &corrupted_xid))
 					frozen[nfrozen++].offset = offnum;
+				else if (corrupted_xid)
+					vacrelstats->corrupted_tuple_detected = true;
 
 				if (!tuple_totally_frozen)
 					all_frozen = false;
diff --git a/src/backend/commands/vacuum.c b/src/backend/commands/vacuum.c
index 576c7e63e9..5a093c231c 100644
--- a/src/backend/commands/vacuum.c
+++ b/src/backend/commands/vacuum.c
@@ -63,6 +63,8 @@ int			vacuum_freeze_table_age;
 int			vacuum_multixact_freeze_min_age;
 int			vacuum_multixact_freeze_table_age;
 
+/* Whether to continue with the vacuuming after detecting a corruped tuple */
+bool vacuum_tolerate_damage = false;
 
 /* A few variables that don't seem worth passing around as parameters */
 static MemoryContext vac_context = NULL;
@@ -2102,3 +2104,16 @@ get_vacopt_ternary_value(DefElem *def)
 {
 	return defGetBoolean(def) ? VACOPT_TERNARY_ENABLED : VACOPT_TERNARY_DISABLED;
 }
+
+/*
+ * vacuum_damage_elevel
+ *
+ * Return the error level for reporting the corrupted tuple while trying to
+ * freeze the tuple during vacuum.  Based on the GUC value, it will return
+ * whether to report with ERROR or to report with WARNING.
+ */
+int
+vacuum_damage_elevel(void)
+{
+	return vacuum_tolerate_damage ? WARNING : ERROR;
+}
diff --git a/src/backend/utils/misc/guc.c b/src/backend/utils/misc/guc.c
index 031ca0327f..683812dd4a 100644
--- a/src/backend/utils/misc/guc.c
+++ b/src/backend/utils/misc/guc.c
@@ -2041,6 +2041,15 @@ static struct config_bool ConfigureNamesBool[] =
 		NULL, NULL, NULL
 	},
 
+	{
+		{"vacuum_tolerate_damage", PGC_USERSET, ERROR_HANDLING_OPTIONS,
+			gettext_noop("Whether to continue running vacuum after detecting corrupted tuple."),
+		},
+		&vacuum_tolerate_damage,
+		false,
+		NULL, NULL, NULL
+	},
+
 	/* End-of-list marker */
 	{
 		{NULL, 0, 0, NULL, NULL}, NULL, false, NULL, NULL, NULL
diff --git a/src/backend/utils/misc/postgresql.conf.sample b/src/backend/utils/misc/postgresql.conf.sample
index e430e33c7b..88654f4983 100644
--- a/src/backend/utils/misc/postgresql.conf.sample
+++ b/src/backend/utils/misc/postgresql.conf.sample
@@ -666,6 +666,7 @@
 #vacuum_cleanup_index_scale_factor = 0.1	# fraction of total number of tuples
 						# before index cleanup, 0 always performs
 						# index cleanup
+#vacuum_tolerate_damage = false
 #bytea_output = 'hex'			# hex, escape
 #xmlbinary = 'base64'
 #xmloption = 'content'
diff --git a/src/include/access/heapam_xlog.h b/src/include/access/heapam_xlog.h
index aa17f7df84..61def10e2a 100644
--- a/src/include/access/heapam_xlog.h
+++ b/src/include/access/heapam_xlog.h
@@ -412,7 +412,8 @@ extern bool heap_prepare_freeze_tuple(HeapTupleHeader tuple,
 									  TransactionId cutoff_xid,
 									  TransactionId cutoff_multi,
 									  xl_heap_freeze_tuple *frz,
-									  bool *totally_frozen);
+									  bool *totally_frozen,
+									  bool *corrupted_xid);
 extern void heap_execute_freeze_tuple(HeapTupleHeader tuple,
 									  xl_heap_freeze_tuple *xlrec_tp);
 extern XLogRecPtr log_heap_visible(RelFileNode rnode, Buffer heap_buffer,
diff --git a/src/include/commands/vacuum.h b/src/include/commands/vacuum.h
index a4cd721400..5e711832bc 100644
--- a/src/include/commands/vacuum.h
+++ b/src/include/commands/vacuum.h
@@ -237,6 +237,7 @@ extern int	vacuum_freeze_min_age;
 extern int	vacuum_freeze_table_age;
 extern int	vacuum_multixact_freeze_min_age;
 extern int	vacuum_multixact_freeze_table_age;
+extern bool vacuum_tolerate_damage;
 
 /* Variables for cost-based parallel vacuum */
 extern pg_atomic_uint32 *VacuumSharedCostBalance;
@@ -278,6 +279,7 @@ extern bool vacuum_is_relation_owner(Oid relid, Form_pg_class reltuple,
 									 int options);
 extern Relation vacuum_open_relation(Oid relid, RangeVar *relation,
 									 int options, bool verbose, LOCKMODE lmode);
+extern int vacuum_damage_elevel(void);
 
 /* in commands/analyze.c */
 extern void analyze_rel(Oid relid, RangeVar *relation,
diff --git a/src/test/modules/test_misc/t/002_vacuum_tolerate_damage.pl b/src/test/modules/test_misc/t/002_vacuum_tolerate_damage.pl
new file mode 100644
index 0000000000..acf1266ab1
--- /dev/null
+++ b/src/test/modules/test_misc/t/002_vacuum_tolerate_damage.pl
@@ -0,0 +1,49 @@
+# Verify the vacuum_tolerate_damage GUC
+
+use strict;
+use warnings;
+use PostgresNode;
+use TestLib;
+use Test::More tests => 2;
+
+# Initialize a test cluster
+my $node = get_new_node('primary');
+$node->init();
+$node->start;
+
+# Run a SQL command and return psql's stderr
+sub run_sql_command
+{
+	my $sql = shift;
+	my $stderr;
+
+	$node->psql(
+		'postgres',
+		$sql,
+		stderr        => \$stderr);
+	return $stderr;
+}
+
+my $output;
+
+# Insert 2 tuple in the table and update the relfrozenxid for the table to
+# the future xid.
+run_sql_command(
+	"create table test_vac (a int) WITH (autovacuum_enabled = off);
+	 insert into test_vac values (1), (2);
+	 update pg_class set relfrozenxid=txid_current()::text::xid where relname='test_vac';");
+
+$output = run_sql_command('vacuum(freeze, disable_page_skipping) test_vac;');
+ok( $output =~ m/ERROR:.*found xmin.*before relfrozenxid/);
+
+# set the vacuum_tolerate_damage and run again
+$output = run_sql_command('set vacuum_tolerate_damage=true;
+						   vacuum(freeze, disable_page_skipping) test_vac;');
+
+
+# this time we should get WARNING for both the tuples
+ok( scalar( @{[ $output=~/WARNING:.*found xmin.*before relfrozenxid/gi ]}) == 2);
+
+run_sql_command('DROP TABLE test_vac;');
+
+$node->stop('fast');
-- 
2.23.0

#13Dilip Kumar
dilipbalaut@gmail.com
In reply to: Dilip Kumar (#12)
1 attachment(s)
Re: Allow ERROR from heap_prepare_freeze_tuple to be downgraded to WARNING

On Tue, Jul 21, 2020 at 4:08 PM Dilip Kumar <dilipbalaut@gmail.com> wrote:

On Tue, Jul 21, 2020 at 11:00 AM Dilip Kumar <dilipbalaut@gmail.com> wrote:

On Tue, Jul 21, 2020 at 2:00 AM Andres Freund <andres@anarazel.de> wrote:

Hi,

On 2020-07-17 16:16:23 +0530, Dilip Kumar wrote:

The attached patch allows the vacuum to continue by emitting WARNING
for the corrupted tuple instead of immediately error out as discussed
at [1].

Basically, it provides a new GUC called vacuum_tolerate_damage, to
control whether to continue the vacuum or to stop on the occurrence of
a corrupted tuple. So if the vacuum_tolerate_damage is set then in
all the cases in heap_prepare_freeze_tuple where the corrupted xid is
detected, it will emit a warning and return that nothing is changed in
the tuple and the 'tuple_totally_frozen' will also be set to false.
Since we are returning false the caller will not try to freeze such
tuple and the tuple_totally_frozen is also set to false so that the
page will not be marked to all frozen even if all other tuples in the
page are frozen.

I'm extremely doubtful this is a good idea. In all likelihood this will
just exascerbate corruption.

You cannot just stop freezing tuples, that'll lead to relfrozenxid
getting *further* out of sync with the actual table contents. And you
cannot just freeze such tuples, because that has a good chance of making
deleted tuples suddenly visible, leading to unique constraint violations
etc. Which will then subsequently lead to clog lookup errors and such.

I agree with the point. But, if we keep giving the ERROR in such
cases then also the situation is not any better. Basically, we are
not freezing such tuple as well as we can not advance the
relfrozenxid. So if we follow the same rule that we don't freeze
those tuples and also don't advance the relfrozenxid. The only
difference is, allow the vacuum to continue with other tuples.

At the very least you'd need to signal up that relfrozenxid/relminmxid
cannot be increased. Without that I think it's entirely unacceptable to
do this.

I agree with that point. I was just confused that shall we disallow
to advance the relfrozenxid in all such cases or in some cases where
the xid already precedes the relfrozenxid, we can allow it to advance
as it can not become any worse. But, as Alvaro pointed out that there
is no point in optimizing such cases. I will update the patch to
stop advancing the relfrozenxid if we find any corrupted xid during
tuple freeze.

If we really were to do something like this the option would need to be
called vacuum_allow_making_corruption_worse or such. Its need to be
*exceedingly* clear that it will likely lead to making everything much
worse.

Maybe we can clearly describe this in the document.

@@ -6123,6 +6124,8 @@ heap_prepare_freeze_tuple(HeapTupleHeader tuple,
frz->t_infomask = tuple->t_infomask;
frz->xmax = HeapTupleHeaderGetRawXmax(tuple);

I don't think it can be right to just update heap_prepare_freeze_tuple()
but not FreezeMultiXactId().

oh, I missed this part. I will fix it.

Please find the updated patch. In this version, we don't allow the
relfrozenxid and relminmxid to advance if the corruption is detected
and also added the handling in FreezeMultiXactId.

In the previous version, the feature was enabled for cluster/vacuum
full command as well. in the attached patch I have enabled it only
if we are running vacuum command. It will not be enabled during a
table rewrite. If we think that it should be enabled for the 'vacuum
full' then we might need to pass a flag from the cluster_rel, all the
way down to the heap_freeze_tuple.

--
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com

Attachments:

v3-0001-Provide-a-GUC-to-allow-vacuum-to-continue-on-corr.patchapplication/octet-stream; name=v3-0001-Provide-a-GUC-to-allow-vacuum-to-continue-on-corr.patchDownload
From 165549f7adf60796f3e397b3e5b66c2cb315873a Mon Sep 17 00:00:00 2001
From: dilipkumar <dilipbalaut@gmail.com>
Date: Thu, 16 Jul 2020 11:41:28 +0530
Subject: [PATCH v3] Provide a GUC to allow vacuum to continue on corrupted
 tuple

A new GUC to control whether the vacuum will error out immediately
on detection of a corrupted tuple or it will just emit a WARNING for
each such instance and complete the rest of the vacuuming.
---
 doc/src/sgml/config.sgml                      |  21 ++++
 src/backend/access/heap/heapam.c              | 100 +++++++++++++++---
 src/backend/access/heap/vacuumlazy.c          |  24 ++++-
 src/backend/commands/vacuum.c                 |  18 ++++
 src/backend/utils/misc/guc.c                  |   9 ++
 src/backend/utils/misc/postgresql.conf.sample |   1 +
 src/include/access/heapam_xlog.h              |   4 +-
 src/include/commands/vacuum.h                 |   2 +
 .../test_misc/t/002_vacuum_tolerate_damage.pl |  49 +++++++++
 9 files changed, 209 insertions(+), 19 deletions(-)
 create mode 100644 src/test/modules/test_misc/t/002_vacuum_tolerate_damage.pl

diff --git a/doc/src/sgml/config.sgml b/doc/src/sgml/config.sgml
index b353c61683..3c1e647e27 100644
--- a/doc/src/sgml/config.sgml
+++ b/doc/src/sgml/config.sgml
@@ -8250,6 +8250,27 @@ COPY postgres_log FROM '/full/path/to/logfile.csv' WITH csv;
       </listitem>
      </varlistentry>
 
+      <varlistentry id="guc-vacuum-tolerate-damage" xreflabel="vacuum_tolerate_damage">
+      <term><varname>vacuum_tolerate_damage</varname> (<type>boolean</type>)
+      <indexterm>
+       <primary><varname>vacuum_tolerate_damage</varname> configuration parameter</primary>
+      </indexterm>
+      </term>
+      <listitem>
+       <para>
+        When set to off, which is the default, the <command>VACUUM</command> will raise
+        a ERROR-level error if detected a corrupted tuple.  This causes the operation to
+        stop immediately.
+       </para>
+
+       <para>
+        If set to on, the <command>VACUUM</command> will instead emit a WARNING for each
+        such tuple but the operation will continue.
+        vacuuming.
+       </para>
+      </listitem>
+     </varlistentry>
+
      <varlistentry id="guc-bytea-output" xreflabel="bytea_output">
       <term><varname>bytea_output</varname> (<type>enum</type>)
       <indexterm>
diff --git a/src/backend/access/heap/heapam.c b/src/backend/access/heap/heapam.c
index e09c8101e7..4248fc684e 100644
--- a/src/backend/access/heap/heapam.c
+++ b/src/backend/access/heap/heapam.c
@@ -51,6 +51,7 @@
 #include "access/xloginsert.h"
 #include "access/xlogutils.h"
 #include "catalog/catalog.h"
+#include "commands/vacuum.h"
 #include "miscadmin.h"
 #include "pgstat.h"
 #include "port/atomics.h"
@@ -5784,6 +5785,7 @@ heap_inplace_update(Relation relation, HeapTuple tuple)
 #define		FRM_RETURN_IS_XID		0x0004
 #define		FRM_RETURN_IS_MULTI		0x0008
 #define		FRM_MARK_COMMITTED		0x0010
+#define		FRM_CORRUPTED_XID		0x0020
 
 /*
  * FreezeMultiXactId
@@ -5805,12 +5807,18 @@ heap_inplace_update(Relation relation, HeapTuple tuple)
  * FRM_RETURN_IS_MULTI
  *		The return value is a new MultiXactId to set as new Xmax.
  *		(caller must obtain proper infomask bits using GetMultiXactIdHintBits)
+ * FRM_CORRUPTED_XID
+ *		This flag will be set if a corrupted multi-xact is detected.
+ *		Ideally, in such cases it will report an error but based on the
+ *		vacuum_damage_elevel, it might just complain about the corruption
+ *		and allow vacuum to continue.  So if the flag is set the caller will
+ *		not do anything to this tuple.
  */
 static TransactionId
 FreezeMultiXactId(MultiXactId multi, uint16 t_infomask,
 				  TransactionId relfrozenxid, TransactionId relminmxid,
 				  TransactionId cutoff_xid, MultiXactId cutoff_multi,
-				  uint16 *flags)
+				  uint16 *flags, bool is_vacuum)
 {
 	TransactionId xid = InvalidTransactionId;
 	int			i;
@@ -5836,10 +5844,14 @@ FreezeMultiXactId(MultiXactId multi, uint16 t_infomask,
 		return InvalidTransactionId;
 	}
 	else if (MultiXactIdPrecedes(multi, relminmxid))
-		ereport(ERROR,
+	{
+		ereport(vacuum_damage_elevel(is_vacuum),
 				(errcode(ERRCODE_DATA_CORRUPTED),
 				 errmsg_internal("found multixact %u from before relminmxid %u",
 								 multi, relminmxid)));
+		*flags |= FRM_CORRUPTED_XID;
+		return InvalidTransactionId;
+	}
 	else if (MultiXactIdPrecedes(multi, cutoff_multi))
 	{
 		/*
@@ -5850,10 +5862,14 @@ FreezeMultiXactId(MultiXactId multi, uint16 t_infomask,
 		 */
 		if (MultiXactIdIsRunning(multi,
 								 HEAP_XMAX_IS_LOCKED_ONLY(t_infomask)))
-			ereport(ERROR,
+		{
+			ereport(vacuum_damage_elevel(is_vacuum),
 					(errcode(ERRCODE_DATA_CORRUPTED),
 					 errmsg_internal("multixact %u from before cutoff %u found to be still running",
 									 multi, cutoff_multi)));
+			*flags |= FRM_CORRUPTED_XID;
+			return InvalidTransactionId;
+		}
 
 		if (HEAP_XMAX_IS_LOCKED_ONLY(t_infomask))
 		{
@@ -5869,10 +5885,14 @@ FreezeMultiXactId(MultiXactId multi, uint16 t_infomask,
 			Assert(TransactionIdIsValid(xid));
 
 			if (TransactionIdPrecedes(xid, relfrozenxid))
-				ereport(ERROR,
+			{
+				ereport(vacuum_damage_elevel(is_vacuum),
 						(errcode(ERRCODE_DATA_CORRUPTED),
 						 errmsg_internal("found update xid %u from before relfrozenxid %u",
 										 xid, relfrozenxid)));
+				*flags |= FRM_CORRUPTED_XID;
+				return InvalidTransactionId;
+			}
 
 			/*
 			 * If the xid is older than the cutoff, it has to have aborted,
@@ -5881,9 +5901,13 @@ FreezeMultiXactId(MultiXactId multi, uint16 t_infomask,
 			if (TransactionIdPrecedes(xid, cutoff_xid))
 			{
 				if (TransactionIdDidCommit(xid))
-					ereport(ERROR,
+				{
+					ereport(vacuum_damage_elevel(is_vacuum),
 							(errcode(ERRCODE_DATA_CORRUPTED),
 							 errmsg_internal("cannot freeze committed update xid %u", xid)));
+					*flags |= FRM_CORRUPTED_XID;
+					return InvalidTransactionId;
+				}
 				*flags |= FRM_INVALIDATE_XMAX;
 				xid = InvalidTransactionId; /* not strictly necessary */
 			}
@@ -5957,10 +5981,15 @@ FreezeMultiXactId(MultiXactId multi, uint16 t_infomask,
 
 			Assert(TransactionIdIsValid(xid));
 			if (TransactionIdPrecedes(xid, relfrozenxid))
-				ereport(ERROR,
+			{
+				ereport(vacuum_damage_elevel(is_vacuum),
 						(errcode(ERRCODE_DATA_CORRUPTED),
 						 errmsg_internal("found update xid %u from before relfrozenxid %u",
 										 xid, relfrozenxid)));
+				pfree(members);
+				*flags |= FRM_CORRUPTED_XID;
+				return InvalidTransactionId;
+			}
 
 			/*
 			 * It's an update; should we keep it?  If the transaction is known
@@ -6007,10 +6036,15 @@ FreezeMultiXactId(MultiXactId multi, uint16 t_infomask,
 			 */
 			if (TransactionIdIsValid(update_xid) &&
 				TransactionIdPrecedes(update_xid, cutoff_xid))
-				ereport(ERROR,
+			{
+				ereport(vacuum_damage_elevel(is_vacuum),
 						(errcode(ERRCODE_DATA_CORRUPTED),
 						 errmsg_internal("found update xid %u from before xid cutoff %u",
 										 update_xid, cutoff_xid)));
+				pfree(members);
+				*flags |= FRM_CORRUPTED_XID;
+				return InvalidTransactionId;
+			}
 
 			/*
 			 * If we determined that it's an Xid corresponding to an update
@@ -6110,7 +6144,8 @@ bool
 heap_prepare_freeze_tuple(HeapTupleHeader tuple,
 						  TransactionId relfrozenxid, TransactionId relminmxid,
 						  TransactionId cutoff_xid, TransactionId cutoff_multi,
-						  xl_heap_freeze_tuple *frz, bool *totally_frozen_p)
+						  xl_heap_freeze_tuple *frz, bool *totally_frozen_p,
+						  bool *is_corrupted_xid, bool is_vacuum)
 {
 	bool		changed = false;
 	bool		xmax_already_frozen = false;
@@ -6123,6 +6158,9 @@ heap_prepare_freeze_tuple(HeapTupleHeader tuple,
 	frz->t_infomask = tuple->t_infomask;
 	frz->xmax = HeapTupleHeaderGetRawXmax(tuple);
 
+	*totally_frozen_p = false;
+	*is_corrupted_xid = false;
+
 	/*
 	 * Process xmin.  xmin_frozen has two slightly different meanings: in the
 	 * !XidIsNormal case, it means "the xmin doesn't need any freezing" (it's
@@ -6137,19 +6175,27 @@ heap_prepare_freeze_tuple(HeapTupleHeader tuple,
 	else
 	{
 		if (TransactionIdPrecedes(xid, relfrozenxid))
-			ereport(ERROR,
+		{
+			ereport(vacuum_damage_elevel(is_vacuum),
 					(errcode(ERRCODE_DATA_CORRUPTED),
 					 errmsg_internal("found xmin %u from before relfrozenxid %u",
 									 xid, relfrozenxid)));
+			*is_corrupted_xid = true;
+			return false;
+		}
 
 		xmin_frozen = TransactionIdPrecedes(xid, cutoff_xid);
 		if (xmin_frozen)
 		{
 			if (!TransactionIdDidCommit(xid))
-				ereport(ERROR,
+			{
+				ereport(vacuum_damage_elevel(is_vacuum),
 						(errcode(ERRCODE_DATA_CORRUPTED),
 						 errmsg_internal("uncommitted xmin %u from before xid cutoff %u needs to be frozen",
 										 xid, cutoff_xid)));
+				*is_corrupted_xid = true;
+				return false;
+			}
 
 			frz->t_infomask |= HEAP_XMIN_FROZEN;
 			changed = true;
@@ -6174,7 +6220,17 @@ heap_prepare_freeze_tuple(HeapTupleHeader tuple,
 
 		newxmax = FreezeMultiXactId(xid, tuple->t_infomask,
 									relfrozenxid, relminmxid,
-									cutoff_xid, cutoff_multi, &flags);
+									cutoff_xid, cutoff_multi,
+									&flags, is_vacuum);
+		/*
+		 * Return as nothing changed if we have detected a corrupted
+		 * multi-xact.
+		 */
+		if (flags & FRM_CORRUPTED_XID)
+		{
+			*is_corrupted_xid = true;
+			return false;
+		}
 
 		freeze_xmax = (flags & FRM_INVALIDATE_XMAX);
 
@@ -6218,10 +6274,14 @@ heap_prepare_freeze_tuple(HeapTupleHeader tuple,
 	else if (TransactionIdIsNormal(xid))
 	{
 		if (TransactionIdPrecedes(xid, relfrozenxid))
-			ereport(ERROR,
+		{
+			ereport(vacuum_damage_elevel(is_vacuum),
 					(errcode(ERRCODE_DATA_CORRUPTED),
 					 errmsg_internal("found xmax %u from before relfrozenxid %u",
 									 xid, relfrozenxid)));
+			*is_corrupted_xid = true;
+			return false;
+		}
 
 		if (TransactionIdPrecedes(xid, cutoff_xid))
 		{
@@ -6233,10 +6293,14 @@ heap_prepare_freeze_tuple(HeapTupleHeader tuple,
 			 */
 			if (!HEAP_XMAX_IS_LOCKED_ONLY(tuple->t_infomask) &&
 				TransactionIdDidCommit(xid))
-				ereport(ERROR,
+			{
+				ereport(vacuum_damage_elevel(is_vacuum),
 						(errcode(ERRCODE_DATA_CORRUPTED),
 						 errmsg_internal("cannot freeze committed xmax %u",
 										 xid)));
+				*is_corrupted_xid = true;
+				return false;
+			}
 			freeze_xmax = true;
 		}
 		else
@@ -6249,10 +6313,14 @@ heap_prepare_freeze_tuple(HeapTupleHeader tuple,
 		xmax_already_frozen = true;
 	}
 	else
-		ereport(ERROR,
+	{
+		ereport(vacuum_damage_elevel(is_vacuum),
 				(errcode(ERRCODE_DATA_CORRUPTED),
 				 errmsg_internal("found xmax %u (infomask 0x%04x) not frozen, not multi, not normal",
 								 xid, tuple->t_infomask)));
+		*is_corrupted_xid = true;
+		return false;
+	}
 
 	if (freeze_xmax)
 	{
@@ -6364,11 +6432,13 @@ heap_freeze_tuple(HeapTupleHeader tuple,
 	xl_heap_freeze_tuple frz;
 	bool		do_freeze;
 	bool		tuple_totally_frozen;
+	bool		corrupted_xid;
 
 	do_freeze = heap_prepare_freeze_tuple(tuple,
 										  relfrozenxid, relminmxid,
 										  cutoff_xid, cutoff_multi,
-										  &frz, &tuple_totally_frozen);
+										  &frz, &tuple_totally_frozen,
+										  &corrupted_xid, false);
 
 	/*
 	 * Note that because this is not a WAL-logged operation, we don't need to
diff --git a/src/backend/access/heap/vacuumlazy.c b/src/backend/access/heap/vacuumlazy.c
index 1bbc4598f7..4e0d7cab94 100644
--- a/src/backend/access/heap/vacuumlazy.c
+++ b/src/backend/access/heap/vacuumlazy.c
@@ -312,6 +312,7 @@ typedef struct LVRelStats
 	int			num_index_scans;
 	TransactionId latestRemovedXid;
 	bool		lock_waiter_detected;
+	bool		corrupted_tuple_detected;
 
 	/* Used for error callback */
 	char	   *indname;
@@ -492,6 +493,7 @@ heap_vacuum_rel(Relation onerel, VacuumParams *params,
 	vacrelstats->num_index_scans = 0;
 	vacrelstats->pages_removed = 0;
 	vacrelstats->lock_waiter_detected = false;
+	vacrelstats->corrupted_tuple_detected = false;
 
 	/* Open all indexes of the relation */
 	vac_open_indexes(onerel, RowExclusiveLock, &nindexes, &Irel);
@@ -591,8 +593,20 @@ heap_vacuum_rel(Relation onerel, VacuumParams *params,
 	if (new_rel_allvisible > new_rel_pages)
 		new_rel_allvisible = new_rel_pages;
 
-	new_frozen_xid = scanned_all_unfrozen ? FreezeLimit : InvalidTransactionId;
-	new_min_multi = scanned_all_unfrozen ? MultiXactCutoff : InvalidMultiXactId;
+	/*
+	 * Don't advance the relfrozenxid and relminmxid, if we have detected a
+	 * corrupted xid while trying to freeze some tuple.
+	 */
+	if (scanned_all_unfrozen && !vacrelstats->corrupted_tuple_detected)
+	{
+		new_frozen_xid = FreezeLimit;
+		new_min_multi = MultiXactCutoff;
+	}
+	else
+	{
+		new_frozen_xid = InvalidTransactionId;
+		new_min_multi = InvalidMultiXactId;
+	}
 
 	vac_update_relstats(onerel,
 						new_rel_pages,
@@ -1444,6 +1458,7 @@ lazy_scan_heap(Relation onerel, VacuumParams *params, LVRelStats *vacrelstats,
 			else
 			{
 				bool		tuple_totally_frozen;
+				bool		corrupted_xid;
 
 				num_tuples += 1;
 				hastup = true;
@@ -1456,8 +1471,11 @@ lazy_scan_heap(Relation onerel, VacuumParams *params, LVRelStats *vacrelstats,
 											  relfrozenxid, relminmxid,
 											  FreezeLimit, MultiXactCutoff,
 											  &frozen[nfrozen],
-											  &tuple_totally_frozen))
+											  &tuple_totally_frozen,
+											  &corrupted_xid, true))
 					frozen[nfrozen++].offset = offnum;
+				else if (corrupted_xid)
+					vacrelstats->corrupted_tuple_detected = true;
 
 				if (!tuple_totally_frozen)
 					all_frozen = false;
diff --git a/src/backend/commands/vacuum.c b/src/backend/commands/vacuum.c
index 576c7e63e9..53218d6799 100644
--- a/src/backend/commands/vacuum.c
+++ b/src/backend/commands/vacuum.c
@@ -63,6 +63,8 @@ int			vacuum_freeze_table_age;
 int			vacuum_multixact_freeze_min_age;
 int			vacuum_multixact_freeze_table_age;
 
+/* Whether to continue with the vacuuming after detecting a corruped tuple */
+bool vacuum_tolerate_damage = false;
 
 /* A few variables that don't seem worth passing around as parameters */
 static MemoryContext vac_context = NULL;
@@ -2102,3 +2104,19 @@ get_vacopt_ternary_value(DefElem *def)
 {
 	return defGetBoolean(def) ? VACOPT_TERNARY_ENABLED : VACOPT_TERNARY_DISABLED;
 }
+
+/*
+ * vacuum_damage_elevel
+ *
+ * Return the error level for reporting the corrupted tuple while trying to
+ * freeze the tuple during vacuum.  Based on the GUC value, it will return
+ * whether to report with ERROR or to report with WARNING.
+ */
+int
+vacuum_damage_elevel(bool is_vacuum)
+{
+	if (is_vacuum && vacuum_tolerate_damage)
+		return WARNING;
+	else
+		return ERROR;
+}
diff --git a/src/backend/utils/misc/guc.c b/src/backend/utils/misc/guc.c
index 031ca0327f..683812dd4a 100644
--- a/src/backend/utils/misc/guc.c
+++ b/src/backend/utils/misc/guc.c
@@ -2041,6 +2041,15 @@ static struct config_bool ConfigureNamesBool[] =
 		NULL, NULL, NULL
 	},
 
+	{
+		{"vacuum_tolerate_damage", PGC_USERSET, ERROR_HANDLING_OPTIONS,
+			gettext_noop("Whether to continue running vacuum after detecting corrupted tuple."),
+		},
+		&vacuum_tolerate_damage,
+		false,
+		NULL, NULL, NULL
+	},
+
 	/* End-of-list marker */
 	{
 		{NULL, 0, 0, NULL, NULL}, NULL, false, NULL, NULL, NULL
diff --git a/src/backend/utils/misc/postgresql.conf.sample b/src/backend/utils/misc/postgresql.conf.sample
index e430e33c7b..88654f4983 100644
--- a/src/backend/utils/misc/postgresql.conf.sample
+++ b/src/backend/utils/misc/postgresql.conf.sample
@@ -666,6 +666,7 @@
 #vacuum_cleanup_index_scale_factor = 0.1	# fraction of total number of tuples
 						# before index cleanup, 0 always performs
 						# index cleanup
+#vacuum_tolerate_damage = false
 #bytea_output = 'hex'			# hex, escape
 #xmlbinary = 'base64'
 #xmloption = 'content'
diff --git a/src/include/access/heapam_xlog.h b/src/include/access/heapam_xlog.h
index aa17f7df84..99cd3de276 100644
--- a/src/include/access/heapam_xlog.h
+++ b/src/include/access/heapam_xlog.h
@@ -412,7 +412,9 @@ extern bool heap_prepare_freeze_tuple(HeapTupleHeader tuple,
 									  TransactionId cutoff_xid,
 									  TransactionId cutoff_multi,
 									  xl_heap_freeze_tuple *frz,
-									  bool *totally_frozen);
+									  bool *totally_frozen,
+									  bool *corrupted_xid,
+									  bool is_vacuum);
 extern void heap_execute_freeze_tuple(HeapTupleHeader tuple,
 									  xl_heap_freeze_tuple *xlrec_tp);
 extern XLogRecPtr log_heap_visible(RelFileNode rnode, Buffer heap_buffer,
diff --git a/src/include/commands/vacuum.h b/src/include/commands/vacuum.h
index a4cd721400..17624acdbb 100644
--- a/src/include/commands/vacuum.h
+++ b/src/include/commands/vacuum.h
@@ -237,6 +237,7 @@ extern int	vacuum_freeze_min_age;
 extern int	vacuum_freeze_table_age;
 extern int	vacuum_multixact_freeze_min_age;
 extern int	vacuum_multixact_freeze_table_age;
+extern bool vacuum_tolerate_damage;
 
 /* Variables for cost-based parallel vacuum */
 extern pg_atomic_uint32 *VacuumSharedCostBalance;
@@ -278,6 +279,7 @@ extern bool vacuum_is_relation_owner(Oid relid, Form_pg_class reltuple,
 									 int options);
 extern Relation vacuum_open_relation(Oid relid, RangeVar *relation,
 									 int options, bool verbose, LOCKMODE lmode);
+extern int vacuum_damage_elevel(bool is_vacuum);
 
 /* in commands/analyze.c */
 extern void analyze_rel(Oid relid, RangeVar *relation,
diff --git a/src/test/modules/test_misc/t/002_vacuum_tolerate_damage.pl b/src/test/modules/test_misc/t/002_vacuum_tolerate_damage.pl
new file mode 100644
index 0000000000..acf1266ab1
--- /dev/null
+++ b/src/test/modules/test_misc/t/002_vacuum_tolerate_damage.pl
@@ -0,0 +1,49 @@
+# Verify the vacuum_tolerate_damage GUC
+
+use strict;
+use warnings;
+use PostgresNode;
+use TestLib;
+use Test::More tests => 2;
+
+# Initialize a test cluster
+my $node = get_new_node('primary');
+$node->init();
+$node->start;
+
+# Run a SQL command and return psql's stderr
+sub run_sql_command
+{
+	my $sql = shift;
+	my $stderr;
+
+	$node->psql(
+		'postgres',
+		$sql,
+		stderr        => \$stderr);
+	return $stderr;
+}
+
+my $output;
+
+# Insert 2 tuple in the table and update the relfrozenxid for the table to
+# the future xid.
+run_sql_command(
+	"create table test_vac (a int) WITH (autovacuum_enabled = off);
+	 insert into test_vac values (1), (2);
+	 update pg_class set relfrozenxid=txid_current()::text::xid where relname='test_vac';");
+
+$output = run_sql_command('vacuum(freeze, disable_page_skipping) test_vac;');
+ok( $output =~ m/ERROR:.*found xmin.*before relfrozenxid/);
+
+# set the vacuum_tolerate_damage and run again
+$output = run_sql_command('set vacuum_tolerate_damage=true;
+						   vacuum(freeze, disable_page_skipping) test_vac;');
+
+
+# this time we should get WARNING for both the tuples
+ok( scalar( @{[ $output=~/WARNING:.*found xmin.*before relfrozenxid/gi ]}) == 2);
+
+run_sql_command('DROP TABLE test_vac;');
+
+$node->stop('fast');
-- 
2.23.0

#14Robert Haas
robertmhaas@gmail.com
In reply to: Andres Freund (#8)
Re: Allow ERROR from heap_prepare_freeze_tuple to be downgraded to WARNING

On Mon, Jul 20, 2020 at 4:30 PM Andres Freund <andres@anarazel.de> wrote:

I'm extremely doubtful this is a good idea. In all likelihood this will
just exascerbate corruption.

You cannot just stop freezing tuples, that'll lead to relfrozenxid
getting *further* out of sync with the actual table contents. And you
cannot just freeze such tuples, because that has a good chance of making
deleted tuples suddenly visible, leading to unique constraint violations
etc. Which will then subsequently lead to clog lookup errors and such.

I think that the behavior ought to be:

- If we encounter any damaged tuples (e.g. tuple xid < relfrozenxid),
we give up on advancing relfrozenxid and relminmxid. This vacuum won't
change them at all.

- We do nothing to the damaged tuples themselves.

- We can still prune pages, and we can still freeze tuples that do not
appear to be damaged.

This amounts to an assumption that relfrozenxid is probably sane, and
that there are individual tuples that are messed up. It's probably not
the right thing if relfrozenxid got overwritten with a nonsense value
without changing the table contents. But, I think it's difficult to
cater to all contingencies. In my experience, the normal problem here
is that there are a few tuples or pages in the table that somehow
escaped vacuuming for long enough that they contain references to XIDs
from before the last time relfrozenxid was advanced - so continuing to
do what we can to the rest of the table is the right thing to do.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

#15Robert Haas
robertmhaas@gmail.com
In reply to: Dilip Kumar (#13)
Re: Allow ERROR from heap_prepare_freeze_tuple to be downgraded to WARNING

On Tue, Jul 21, 2020 at 9:21 AM Dilip Kumar <dilipbalaut@gmail.com> wrote:

In the previous version, the feature was enabled for cluster/vacuum
full command as well. in the attached patch I have enabled it only
if we are running vacuum command. It will not be enabled during a
table rewrite. If we think that it should be enabled for the 'vacuum
full' then we might need to pass a flag from the cluster_rel, all the
way down to the heap_freeze_tuple.

I think this is a somewhat clunky way of accomplishing this. The
caller passes down a flag to heap_prepare_freeze_tuple() which decides
whether or not an error is forced, and then that function and
FreezeMultiXactId use vacuum_damage_elevel() to combine the results of
that flag with the value of the vacuum_tolerate_damage GUC. But that
means that a decision that could be made in one place is instead made
in many places. If we just had heap_freeze_tuple() and
FreezeMultiXactId() take an argument int vacuum_damage_elevel, then
heap_freeze_tuple() could pass ERROR and lazy_scan_heap() could
arrange to pass WARNING or ERROR based on the value of
vacuum_tolerate_damage. I think that would likely end up cleaner. What
do you think?

I also suggest renaming is_corrupted_xid to found_corruption. With the
current name, it's not very clear which XID we're saying is corrupted;
in fact, the problem might be a MultiXactId rather than an XID, and
the real issue might be with the table's pg_class entry or something.

The new arguments to heap_prepare_freeze_tuple() need to be documented
in its header comment.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

#16Robert Haas
robertmhaas@gmail.com
In reply to: Andres Freund (#8)
Re: Allow ERROR from heap_prepare_freeze_tuple to be downgraded to WARNING

On Mon, Jul 20, 2020 at 4:30 PM Andres Freund <andres@anarazel.de> wrote:

If we really were to do something like this the option would need to be
called vacuum_allow_making_corruption_worse or such. Its need to be
*exceedingly* clear that it will likely lead to making everything much
worse.

I don't really understand this objection. How does letting VACUUM
continue after problems have been detected make anything worse? I
agree that if it does, it shouldn't touch relfrozenxid or relminmxid,
but the patch has been adjusted to work that way. Assuming you don't
touch relfrozenxid or relminmxid, what harm befalls if you continue
freezing undamaged tuples and continue removing dead tuples after
finding a bad tuple? You may have already done an arbitrary amount of
that before encountering the damage, and doing it afterward is no
different. Doing the index vacuuming step is different, but I don't
see how that would exacerbate corruption either.

The point is that when you make VACUUM fail, you not only don't
advance relfrozenxid/relminmxid, but also don't remove dead tuples. In
the long run, either thing will kill you, but it is not difficult to
have a situation where failing to remove dead tuples kills you a lot
faster. The table can just bloat until performance tanks, and then the
application goes down, even if you still had 100+ million XIDs before
you needed a wraparound vacuum.

Honestly, I wonder why continuing (but without advancing relfrozenxid
or relminmxid) shouldn't be the default behavior. I mean, if it
actually corrupts your data, then it clearly shouldn't be, and
probably shouldn't even be an optional behavior, but I still don't see
why it would do that.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

#17Andres Freund
andres@anarazel.de
In reply to: Robert Haas (#16)
Re: Allow ERROR from heap_prepare_freeze_tuple to be downgraded to WARNING

Hi,

On 2020-08-28 12:37:17 -0400, Robert Haas wrote:

On Mon, Jul 20, 2020 at 4:30 PM Andres Freund <andres@anarazel.de> wrote:

If we really were to do something like this the option would need to be
called vacuum_allow_making_corruption_worse or such. Its need to be
*exceedingly* clear that it will likely lead to making everything much
worse.

I don't really understand this objection. How does letting VACUUM
continue after problems have been detected make anything worse?

It can break HOT chains, plain ctid chains etc, for example. Which, if
earlier / follower tuples are removed can't be detected anymore at a
later time.

The point is that when you make VACUUM fail, you not only don't
advance relfrozenxid/relminmxid, but also don't remove dead tuples. In
the long run, either thing will kill you, but it is not difficult to
have a situation where failing to remove dead tuples kills you a lot
faster. The table can just bloat until performance tanks, and then the
application goes down, even if you still had 100+ million XIDs before
you needed a wraparound vacuum.

Honestly, I wonder why continuing (but without advancing relfrozenxid
or relminmxid) shouldn't be the default behavior. I mean, if it
actually corrupts your data, then it clearly shouldn't be, and
probably shouldn't even be an optional behavior, but I still don't see
why it would do that.

I think it's an EXTREMELY bad idea to enable anything like this by
default. It'll make bugs entirely undiagnosable, because we'll remove a
lot of the evidence of what the problem is. And we've had many long
standing bugs in this area, several only found because we actually
started to bleat about them. And quite evidently, we have more bugs to
fix in the area.

Greetings,

Andres Freund

#18Robert Haas
robertmhaas@gmail.com
In reply to: Andres Freund (#17)
Re: Allow ERROR from heap_prepare_freeze_tuple to be downgraded to WARNING

On Fri, Aug 28, 2020 at 1:29 PM Andres Freund <andres@anarazel.de> wrote:

It can break HOT chains, plain ctid chains etc, for example. Which, if
earlier / follower tuples are removed can't be detected anymore at a
later time.

I think I need a more specific example here to understand the problem.
If the xmax of one tuple matches the xmin of the next, then either
both values precede relfrozenxid or both follow it. In the former
case, neither tuple should be frozen and the chain should not get
broken; in the latter case, everything's normal anyway. If the xmax
and xmin don't match, then the chain was already broken. Perhaps we
are removing important evidence, though it seems like that might've
happened anyway prior to reaching the damaged page, but we're not
making whatever corruption may exist any worse. At least, not as far
as I can see.

And we've had many long
standing bugs in this area, several only found because we actually
started to bleat about them. And quite evidently, we have more bugs to
fix in the area.

I agree with all of this, but I do not think that it establishes that
we should abandon the entire VACUUM. "Bleating" about something
usually means logging it, and I think you understand that I am not now
nor have I ever complained about the logging we are doing here. I also
think you understand why I don't like the current behavior, and that
EDB has actual customers who have actually been damaged by it. All the
same, I don't expect to win an argument about changing the default,
but I hope to win one about at least providing an option. And if we're
not even going to do that much, then I hope to come out of this
discussion with a clear understanding of exactly why that's a bad
idea. I don't think "we need the data for forensics" is a sufficient
justification for "if you end up with one corrupted XID in a
billion-row table, your entire table will bloat out the wazoo, and
there is no option to get any other behavior."

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

#19Dilip Kumar
dilipbalaut@gmail.com
In reply to: Robert Haas (#15)
1 attachment(s)
Re: Allow ERROR from heap_prepare_freeze_tuple to be downgraded to WARNING

On Fri, Aug 28, 2020 at 9:49 PM Robert Haas <robertmhaas@gmail.com> wrote:

On Tue, Jul 21, 2020 at 9:21 AM Dilip Kumar <dilipbalaut@gmail.com> wrote:

In the previous version, the feature was enabled for cluster/vacuum
full command as well. in the attached patch I have enabled it only
if we are running vacuum command. It will not be enabled during a
table rewrite. If we think that it should be enabled for the 'vacuum
full' then we might need to pass a flag from the cluster_rel, all the
way down to the heap_freeze_tuple.

I think this is a somewhat clunky way of accomplishing this. The
caller passes down a flag to heap_prepare_freeze_tuple() which decides
whether or not an error is forced, and then that function and
FreezeMultiXactId use vacuum_damage_elevel() to combine the results of
that flag with the value of the vacuum_tolerate_damage GUC. But that
means that a decision that could be made in one place is instead made
in many places. If we just had heap_freeze_tuple() and
FreezeMultiXactId() take an argument int vacuum_damage_elevel, then
heap_freeze_tuple() could pass ERROR and lazy_scan_heap() could
arrange to pass WARNING or ERROR based on the value of
vacuum_tolerate_damage. I think that would likely end up cleaner. What
do you think?

I agree this way it is much more cleaner. I have changed in the attached patch.

I also suggest renaming is_corrupted_xid to found_corruption. With the
current name, it's not very clear which XID we're saying is corrupted;
in fact, the problem might be a MultiXactId rather than an XID, and
the real issue might be with the table's pg_class entry or something.

Okay, changed to found_corruption.

The new arguments to heap_prepare_freeze_tuple() need to be documented
in its header comment.

Done.

I have also done a few more cosmetic changes to the patch.

--
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com

Attachments:

v4-0001-Provide-a-GUC-to-allow-vacuum-to-continue-on-corr.patchapplication/octet-stream; name=v4-0001-Provide-a-GUC-to-allow-vacuum-to-continue-on-corr.patchDownload
From e060390eac9c277b1023038a7d4c6310ba016e5c Mon Sep 17 00:00:00 2001
From: dilipkumar <dilipbalaut@gmail.com>
Date: Thu, 16 Jul 2020 11:41:28 +0530
Subject: [PATCH v4] Provide a GUC to allow vacuum to continue on corrupted
 tuple

A new GUC to control whether the vacuum will error out immediately
on detection of a corrupted tuple or it will just emit a WARNING for
each such instance and complete the rest of the vacuuming.  However,
nothing will be done to the corrupted tuple itself and we will also
not advance the relfrozenxid and relminmxid.
---
 doc/src/sgml/config.sgml                           |  22 ++++
 src/backend/access/heap/heapam.c                   | 113 ++++++++++++++++++---
 src/backend/access/heap/vacuumlazy.c               |  26 ++++-
 src/backend/commands/vacuum.c                      |   6 ++
 src/backend/utils/misc/guc.c                       |   9 ++
 src/backend/utils/misc/postgresql.conf.sample      |   1 +
 src/include/access/heapam_xlog.h                   |   4 +-
 src/include/commands/vacuum.h                      |   1 +
 .../test_misc/t/002_vacuum_tolerate_damage.pl      |  49 +++++++++
 9 files changed, 212 insertions(+), 19 deletions(-)
 create mode 100644 src/test/modules/test_misc/t/002_vacuum_tolerate_damage.pl

diff --git a/doc/src/sgml/config.sgml b/doc/src/sgml/config.sgml
index 7a7177c..f3f7b45 100644
--- a/doc/src/sgml/config.sgml
+++ b/doc/src/sgml/config.sgml
@@ -8340,6 +8340,28 @@ COPY postgres_log FROM '/full/path/to/logfile.csv' WITH csv;
       </listitem>
      </varlistentry>
 
+      <varlistentry id="guc-vacuum-tolerate-damage" xreflabel="vacuum_tolerate_damage">
+      <term><varname>vacuum_tolerate_damage</varname> (<type>boolean</type>)
+      <indexterm>
+       <primary><varname>vacuum_tolerate_damage</varname> configuration parameter</primary>
+      </indexterm>
+      </term>
+      <listitem>
+       <para>
+        When set to off, which is the default, the <command>VACUUM</command> will raise
+        a ERROR-level error if detected a corrupted tuple.  This causes the operation to
+        stop immediately.
+       </para>
+
+       <para>
+        If set to on, the <command>VACUUM</command> will instead emit a WARNING for each
+        such tuple but the operation will continue vacuuming the remaining tuples.  However,
+        this will not do anything to the corrupted tuples and it will also not advance the
+        relfrozenxid and the relminmxid.
+       </para>
+      </listitem>
+     </varlistentry>
+
      <varlistentry id="guc-bytea-output" xreflabel="bytea_output">
       <term><varname>bytea_output</varname> (<type>enum</type>)
       <indexterm>
diff --git a/src/backend/access/heap/heapam.c b/src/backend/access/heap/heapam.c
index 9b5f417..35b3579 100644
--- a/src/backend/access/heap/heapam.c
+++ b/src/backend/access/heap/heapam.c
@@ -52,6 +52,7 @@
 #include "access/xloginsert.h"
 #include "access/xlogutils.h"
 #include "catalog/catalog.h"
+#include "commands/vacuum.h"
 #include "miscadmin.h"
 #include "pgstat.h"
 #include "port/atomics.h"
@@ -5802,6 +5803,7 @@ heap_inplace_update(Relation relation, HeapTuple tuple)
 #define		FRM_RETURN_IS_XID		0x0004
 #define		FRM_RETURN_IS_MULTI		0x0008
 #define		FRM_MARK_COMMITTED		0x0010
+#define		FRM_FOUND_CORRUPTION	0x0020
 
 /*
  * FreezeMultiXactId
@@ -5823,12 +5825,19 @@ heap_inplace_update(Relation relation, HeapTuple tuple)
  * FRM_RETURN_IS_MULTI
  *		The return value is a new MultiXactId to set as new Xmax.
  *		(caller must obtain proper infomask bits using GetMultiXactIdHintBits)
+ * FRM_FOUND_CORRUPTION
+ *		This flag will be set if a corrupted multi-xact is detected.
+ *		Ideally, in such cases it will report an error but based on the
+ *		vacuum_damage_elevel, it might just complain about the corruption
+ *		and allow vacuum to continue.  So if the flag is set the caller will
+ *		not do anything to this tuple.
+ * vacuum_damage_elevel - Error level for reporting the corrupted tuple.
  */
 static TransactionId
 FreezeMultiXactId(MultiXactId multi, uint16 t_infomask,
 				  TransactionId relfrozenxid, TransactionId relminmxid,
 				  TransactionId cutoff_xid, MultiXactId cutoff_multi,
-				  uint16 *flags)
+				  uint16 *flags, int vacuum_damage_elevel)
 {
 	TransactionId xid = InvalidTransactionId;
 	int			i;
@@ -5854,10 +5863,14 @@ FreezeMultiXactId(MultiXactId multi, uint16 t_infomask,
 		return InvalidTransactionId;
 	}
 	else if (MultiXactIdPrecedes(multi, relminmxid))
-		ereport(ERROR,
+	{
+		ereport(vacuum_damage_elevel,
 				(errcode(ERRCODE_DATA_CORRUPTED),
 				 errmsg_internal("found multixact %u from before relminmxid %u",
 								 multi, relminmxid)));
+		*flags |= FRM_FOUND_CORRUPTION;
+		return InvalidTransactionId;
+	}
 	else if (MultiXactIdPrecedes(multi, cutoff_multi))
 	{
 		/*
@@ -5868,10 +5881,14 @@ FreezeMultiXactId(MultiXactId multi, uint16 t_infomask,
 		 */
 		if (MultiXactIdIsRunning(multi,
 								 HEAP_XMAX_IS_LOCKED_ONLY(t_infomask)))
-			ereport(ERROR,
+		{
+			ereport(vacuum_damage_elevel,
 					(errcode(ERRCODE_DATA_CORRUPTED),
 					 errmsg_internal("multixact %u from before cutoff %u found to be still running",
 									 multi, cutoff_multi)));
+			*flags |= FRM_FOUND_CORRUPTION;
+			return InvalidTransactionId;
+		}
 
 		if (HEAP_XMAX_IS_LOCKED_ONLY(t_infomask))
 		{
@@ -5887,10 +5904,14 @@ FreezeMultiXactId(MultiXactId multi, uint16 t_infomask,
 			Assert(TransactionIdIsValid(xid));
 
 			if (TransactionIdPrecedes(xid, relfrozenxid))
-				ereport(ERROR,
+			{
+				ereport(vacuum_damage_elevel,
 						(errcode(ERRCODE_DATA_CORRUPTED),
 						 errmsg_internal("found update xid %u from before relfrozenxid %u",
 										 xid, relfrozenxid)));
+				*flags |= FRM_FOUND_CORRUPTION;
+				return InvalidTransactionId;
+			}
 
 			/*
 			 * If the xid is older than the cutoff, it has to have aborted,
@@ -5899,9 +5920,13 @@ FreezeMultiXactId(MultiXactId multi, uint16 t_infomask,
 			if (TransactionIdPrecedes(xid, cutoff_xid))
 			{
 				if (TransactionIdDidCommit(xid))
-					ereport(ERROR,
+				{
+					ereport(vacuum_damage_elevel,
 							(errcode(ERRCODE_DATA_CORRUPTED),
 							 errmsg_internal("cannot freeze committed update xid %u", xid)));
+					*flags |= FRM_FOUND_CORRUPTION;
+					return InvalidTransactionId;
+				}
 				*flags |= FRM_INVALIDATE_XMAX;
 				xid = InvalidTransactionId; /* not strictly necessary */
 			}
@@ -5975,10 +6000,15 @@ FreezeMultiXactId(MultiXactId multi, uint16 t_infomask,
 
 			Assert(TransactionIdIsValid(xid));
 			if (TransactionIdPrecedes(xid, relfrozenxid))
-				ereport(ERROR,
+			{
+				ereport(vacuum_damage_elevel,
 						(errcode(ERRCODE_DATA_CORRUPTED),
 						 errmsg_internal("found update xid %u from before relfrozenxid %u",
 										 xid, relfrozenxid)));
+				pfree(members);
+				*flags |= FRM_FOUND_CORRUPTION;
+				return InvalidTransactionId;
+			}
 
 			/*
 			 * It's an update; should we keep it?  If the transaction is known
@@ -6025,10 +6055,15 @@ FreezeMultiXactId(MultiXactId multi, uint16 t_infomask,
 			 */
 			if (TransactionIdIsValid(update_xid) &&
 				TransactionIdPrecedes(update_xid, cutoff_xid))
-				ereport(ERROR,
+			{
+				ereport(vacuum_damage_elevel,
 						(errcode(ERRCODE_DATA_CORRUPTED),
 						 errmsg_internal("found update xid %u from before xid cutoff %u",
 										 update_xid, cutoff_xid)));
+				pfree(members);
+				*flags |= FRM_FOUND_CORRUPTION;
+				return InvalidTransactionId;
+			}
 
 			/*
 			 * If we determined that it's an Xid corresponding to an update
@@ -6110,6 +6145,11 @@ FreezeMultiXactId(MultiXactId multi, uint16 t_infomask,
  * HeapTupleSatisfiesVacuum() and determined that it is not HEAPTUPLE_DEAD
  * (else we should be removing the tuple, not freezing it).
  *
+ * vacuum_damage_elevel - Error level for reporting the corrupted tuple.
+ * found_corruption - Out parameter, must be a valid pointer if the
+ * vacuum_damage_elevel is set to the WARNING.  This will be set to true
+ * if any corrupted tuple is found.
+ *
  * NB: cutoff_xid *must* be <= the current global xmin, to ensure that any
  * XID older than it could neither be running nor seen as running by any
  * open transaction.  This ensures that the replacement will not change
@@ -6128,7 +6168,8 @@ bool
 heap_prepare_freeze_tuple(HeapTupleHeader tuple,
 						  TransactionId relfrozenxid, TransactionId relminmxid,
 						  TransactionId cutoff_xid, TransactionId cutoff_multi,
-						  xl_heap_freeze_tuple *frz, bool *totally_frozen_p)
+						  xl_heap_freeze_tuple *frz, bool *totally_frozen_p,
+						  bool *found_corruption, int vacuum_damage_elevel)
 {
 	bool		changed = false;
 	bool		xmax_already_frozen = false;
@@ -6141,6 +6182,17 @@ heap_prepare_freeze_tuple(HeapTupleHeader tuple,
 	frz->t_infomask = tuple->t_infomask;
 	frz->xmax = HeapTupleHeaderGetRawXmax(tuple);
 
+	*totally_frozen_p = false;
+
+	/*
+	 * The vacuum_damage_elevel must be set to either ERROR or WARNING and
+	 * if it is set to WARNING then the caller must pass a valid pointer for
+	 * found_corruption.
+	 */
+	Assert((vacuum_damage_elevel == WARNING) ||
+		   (vacuum_damage_elevel == ERROR));
+	Assert((found_corruption != NULL) || (vacuum_damage_elevel == ERROR));
+
 	/*
 	 * Process xmin.  xmin_frozen has two slightly different meanings: in the
 	 * !XidIsNormal case, it means "the xmin doesn't need any freezing" (it's
@@ -6155,19 +6207,27 @@ heap_prepare_freeze_tuple(HeapTupleHeader tuple,
 	else
 	{
 		if (TransactionIdPrecedes(xid, relfrozenxid))
-			ereport(ERROR,
+		{
+			ereport(vacuum_damage_elevel,
 					(errcode(ERRCODE_DATA_CORRUPTED),
 					 errmsg_internal("found xmin %u from before relfrozenxid %u",
 									 xid, relfrozenxid)));
+			*found_corruption = true;
+			return false;
+		}
 
 		xmin_frozen = TransactionIdPrecedes(xid, cutoff_xid);
 		if (xmin_frozen)
 		{
 			if (!TransactionIdDidCommit(xid))
-				ereport(ERROR,
+			{
+				ereport(vacuum_damage_elevel,
 						(errcode(ERRCODE_DATA_CORRUPTED),
 						 errmsg_internal("uncommitted xmin %u from before xid cutoff %u needs to be frozen",
 										 xid, cutoff_xid)));
+				*found_corruption = true;
+				return false;
+			}
 
 			frz->t_infomask |= HEAP_XMIN_FROZEN;
 			changed = true;
@@ -6192,7 +6252,17 @@ heap_prepare_freeze_tuple(HeapTupleHeader tuple,
 
 		newxmax = FreezeMultiXactId(xid, tuple->t_infomask,
 									relfrozenxid, relminmxid,
-									cutoff_xid, cutoff_multi, &flags);
+									cutoff_xid, cutoff_multi,
+									&flags, vacuum_damage_elevel);
+		/*
+		 * If we have detected a corrupted multi-xact id then just set the
+		 * found_corruption flag and return.
+		 */
+		if (flags & FRM_FOUND_CORRUPTION)
+		{
+			*found_corruption = true;
+			return false;
+		}
 
 		freeze_xmax = (flags & FRM_INVALIDATE_XMAX);
 
@@ -6236,10 +6306,14 @@ heap_prepare_freeze_tuple(HeapTupleHeader tuple,
 	else if (TransactionIdIsNormal(xid))
 	{
 		if (TransactionIdPrecedes(xid, relfrozenxid))
-			ereport(ERROR,
+		{
+			ereport(vacuum_damage_elevel,
 					(errcode(ERRCODE_DATA_CORRUPTED),
 					 errmsg_internal("found xmax %u from before relfrozenxid %u",
 									 xid, relfrozenxid)));
+			*found_corruption = true;
+			return false;
+		}
 
 		if (TransactionIdPrecedes(xid, cutoff_xid))
 		{
@@ -6251,10 +6325,14 @@ heap_prepare_freeze_tuple(HeapTupleHeader tuple,
 			 */
 			if (!HEAP_XMAX_IS_LOCKED_ONLY(tuple->t_infomask) &&
 				TransactionIdDidCommit(xid))
-				ereport(ERROR,
+			{
+				ereport(vacuum_damage_elevel,
 						(errcode(ERRCODE_DATA_CORRUPTED),
 						 errmsg_internal("cannot freeze committed xmax %u",
 										 xid)));
+				*found_corruption = true;
+				return false;
+			}
 			freeze_xmax = true;
 		}
 		else
@@ -6267,10 +6345,14 @@ heap_prepare_freeze_tuple(HeapTupleHeader tuple,
 		xmax_already_frozen = true;
 	}
 	else
-		ereport(ERROR,
+	{
+		ereport(vacuum_damage_elevel,
 				(errcode(ERRCODE_DATA_CORRUPTED),
 				 errmsg_internal("found xmax %u (infomask 0x%04x) not frozen, not multi, not normal",
 								 xid, tuple->t_infomask)));
+		*found_corruption = true;
+		return false;
+	}
 
 	if (freeze_xmax)
 	{
@@ -6386,7 +6468,8 @@ heap_freeze_tuple(HeapTupleHeader tuple,
 	do_freeze = heap_prepare_freeze_tuple(tuple,
 										  relfrozenxid, relminmxid,
 										  cutoff_xid, cutoff_multi,
-										  &frz, &tuple_totally_frozen);
+										  &frz, &tuple_totally_frozen,
+										  NULL, ERROR);
 
 	/*
 	 * Note that because this is not a WAL-logged operation, we don't need to
diff --git a/src/backend/access/heap/vacuumlazy.c b/src/backend/access/heap/vacuumlazy.c
index a0da444..6f848ea 100644
--- a/src/backend/access/heap/vacuumlazy.c
+++ b/src/backend/access/heap/vacuumlazy.c
@@ -312,6 +312,7 @@ typedef struct LVRelStats
 	int			num_index_scans;
 	TransactionId latestRemovedXid;
 	bool		lock_waiter_detected;
+	bool		corrupted_tuple_detected;
 
 	/* Used for error callback */
 	char	   *indname;
@@ -497,6 +498,7 @@ heap_vacuum_rel(Relation onerel, VacuumParams *params,
 	vacrelstats->num_index_scans = 0;
 	vacrelstats->pages_removed = 0;
 	vacrelstats->lock_waiter_detected = false;
+	vacrelstats->corrupted_tuple_detected = false;
 
 	/* Open all indexes of the relation */
 	vac_open_indexes(onerel, RowExclusiveLock, &nindexes, &Irel);
@@ -597,8 +599,20 @@ heap_vacuum_rel(Relation onerel, VacuumParams *params,
 	if (new_rel_allvisible > new_rel_pages)
 		new_rel_allvisible = new_rel_pages;
 
-	new_frozen_xid = scanned_all_unfrozen ? FreezeLimit : InvalidTransactionId;
-	new_min_multi = scanned_all_unfrozen ? MultiXactCutoff : InvalidMultiXactId;
+	/*
+	 * Don't advance the relfrozenxid and relminmxid, if we have detected a
+	 * corrupted tuple.
+	 */
+	if (scanned_all_unfrozen && !vacrelstats->corrupted_tuple_detected)
+	{
+		new_frozen_xid = FreezeLimit;
+		new_min_multi = MultiXactCutoff;
+	}
+	else
+	{
+		new_frozen_xid = InvalidTransactionId;
+		new_min_multi = InvalidMultiXactId;
+	}
 
 	vac_update_relstats(onerel,
 						new_rel_pages,
@@ -1460,6 +1474,7 @@ lazy_scan_heap(Relation onerel, VacuumParams *params, LVRelStats *vacrelstats,
 			else
 			{
 				bool		tuple_totally_frozen;
+				bool		found_corruption = false;
 
 				num_tuples += 1;
 				hastup = true;
@@ -1472,8 +1487,13 @@ lazy_scan_heap(Relation onerel, VacuumParams *params, LVRelStats *vacrelstats,
 											  relfrozenxid, relminmxid,
 											  FreezeLimit, MultiXactCutoff,
 											  &frozen[nfrozen],
-											  &tuple_totally_frozen))
+											  &tuple_totally_frozen,
+											  &found_corruption,
+											  vacuum_tolerate_damage ?
+											  WARNING : ERROR))
 					frozen[nfrozen++].offset = offnum;
+				else if (found_corruption)
+					vacrelstats->corrupted_tuple_detected = true;
 
 				if (!tuple_totally_frozen)
 					all_frozen = false;
diff --git a/src/backend/commands/vacuum.c b/src/backend/commands/vacuum.c
index 23eb605..401a45f 100644
--- a/src/backend/commands/vacuum.c
+++ b/src/backend/commands/vacuum.c
@@ -63,6 +63,12 @@ int			vacuum_freeze_table_age;
 int			vacuum_multixact_freeze_min_age;
 int			vacuum_multixact_freeze_table_age;
 
+/*
+ * Whether to continue with the vacuuming after detecting a corrupted tuple.
+ * If this flag is set then on detection of a corrupted tuple during a freeze,
+ * it will report each such occurrence with WARNING and continue the vacuum.
+ */
+bool vacuum_tolerate_damage = false;
 
 /* A few variables that don't seem worth passing around as parameters */
 static MemoryContext vac_context = NULL;
diff --git a/src/backend/utils/misc/guc.c b/src/backend/utils/misc/guc.c
index de87ad6..c7a9eb3 100644
--- a/src/backend/utils/misc/guc.c
+++ b/src/backend/utils/misc/guc.c
@@ -2036,6 +2036,15 @@ static struct config_bool ConfigureNamesBool[] =
 		NULL, NULL, NULL
 	},
 
+	{
+		{"vacuum_tolerate_damage", PGC_USERSET, ERROR_HANDLING_OPTIONS,
+			gettext_noop("Whether to continue running vacuum after detecting corrupted tuple."),
+		},
+		&vacuum_tolerate_damage,
+		false,
+		NULL, NULL, NULL
+	},
+
 	/* End-of-list marker */
 	{
 		{NULL, 0, 0, NULL, NULL}, NULL, false, NULL, NULL, NULL
diff --git a/src/backend/utils/misc/postgresql.conf.sample b/src/backend/utils/misc/postgresql.conf.sample
index 9cb571f..a04d6aa 100644
--- a/src/backend/utils/misc/postgresql.conf.sample
+++ b/src/backend/utils/misc/postgresql.conf.sample
@@ -671,6 +671,7 @@
 #vacuum_cleanup_index_scale_factor = 0.1	# fraction of total number of tuples
 						# before index cleanup, 0 always performs
 						# index cleanup
+#vacuum_tolerate_damage = false
 #bytea_output = 'hex'			# hex, escape
 #xmlbinary = 'base64'
 #xmloption = 'content'
diff --git a/src/include/access/heapam_xlog.h b/src/include/access/heapam_xlog.h
index aa17f7d..fb0bb31 100644
--- a/src/include/access/heapam_xlog.h
+++ b/src/include/access/heapam_xlog.h
@@ -412,7 +412,9 @@ extern bool heap_prepare_freeze_tuple(HeapTupleHeader tuple,
 									  TransactionId cutoff_xid,
 									  TransactionId cutoff_multi,
 									  xl_heap_freeze_tuple *frz,
-									  bool *totally_frozen);
+									  bool *totally_frozen,
+									  bool *found_corruption,
+									  int vacuum_damage_elevel);
 extern void heap_execute_freeze_tuple(HeapTupleHeader tuple,
 									  xl_heap_freeze_tuple *xlrec_tp);
 extern XLogRecPtr log_heap_visible(RelFileNode rnode, Buffer heap_buffer,
diff --git a/src/include/commands/vacuum.h b/src/include/commands/vacuum.h
index a4cd721..322fa17 100644
--- a/src/include/commands/vacuum.h
+++ b/src/include/commands/vacuum.h
@@ -237,6 +237,7 @@ extern int	vacuum_freeze_min_age;
 extern int	vacuum_freeze_table_age;
 extern int	vacuum_multixact_freeze_min_age;
 extern int	vacuum_multixact_freeze_table_age;
+extern bool	vacuum_tolerate_damage;
 
 /* Variables for cost-based parallel vacuum */
 extern pg_atomic_uint32 *VacuumSharedCostBalance;
diff --git a/src/test/modules/test_misc/t/002_vacuum_tolerate_damage.pl b/src/test/modules/test_misc/t/002_vacuum_tolerate_damage.pl
new file mode 100644
index 0000000..acf1266
--- /dev/null
+++ b/src/test/modules/test_misc/t/002_vacuum_tolerate_damage.pl
@@ -0,0 +1,49 @@
+# Verify the vacuum_tolerate_damage GUC
+
+use strict;
+use warnings;
+use PostgresNode;
+use TestLib;
+use Test::More tests => 2;
+
+# Initialize a test cluster
+my $node = get_new_node('primary');
+$node->init();
+$node->start;
+
+# Run a SQL command and return psql's stderr
+sub run_sql_command
+{
+	my $sql = shift;
+	my $stderr;
+
+	$node->psql(
+		'postgres',
+		$sql,
+		stderr        => \$stderr);
+	return $stderr;
+}
+
+my $output;
+
+# Insert 2 tuple in the table and update the relfrozenxid for the table to
+# the future xid.
+run_sql_command(
+	"create table test_vac (a int) WITH (autovacuum_enabled = off);
+	 insert into test_vac values (1), (2);
+	 update pg_class set relfrozenxid=txid_current()::text::xid where relname='test_vac';");
+
+$output = run_sql_command('vacuum(freeze, disable_page_skipping) test_vac;');
+ok( $output =~ m/ERROR:.*found xmin.*before relfrozenxid/);
+
+# set the vacuum_tolerate_damage and run again
+$output = run_sql_command('set vacuum_tolerate_damage=true;
+						   vacuum(freeze, disable_page_skipping) test_vac;');
+
+
+# this time we should get WARNING for both the tuples
+ok( scalar( @{[ $output=~/WARNING:.*found xmin.*before relfrozenxid/gi ]}) == 2);
+
+run_sql_command('DROP TABLE test_vac;');
+
+$node->stop('fast');
-- 
1.8.3.1

#20Dilip Kumar
dilipbalaut@gmail.com
In reply to: Robert Haas (#18)
Re: Allow ERROR from heap_prepare_freeze_tuple to be downgraded to WARNING

On Sat, Aug 29, 2020 at 1:46 AM Robert Haas <robertmhaas@gmail.com> wrote:

On Fri, Aug 28, 2020 at 1:29 PM Andres Freund <andres@anarazel.de> wrote:

It can break HOT chains, plain ctid chains etc, for example. Which, if
earlier / follower tuples are removed can't be detected anymore at a
later time.

I think I need a more specific example here to understand the problem.
If the xmax of one tuple matches the xmin of the next, then either
both values precede relfrozenxid or both follow it. In the former
case, neither tuple should be frozen and the chain should not get
broken; in the latter case, everything's normal anyway. If the xmax
and xmin don't match, then the chain was already broken. Perhaps we
are removing important evidence, though it seems like that might've
happened anyway prior to reaching the damaged page, but we're not
making whatever corruption may exist any worse. At least, not as far
as I can see.

One example is, suppose during vacuum, there are 2 tuples in the hot
chain, and the xmin of the first tuple is corrupted (i.e. smaller
than relfrozenxid). And the xmax of this tuple (which is same as the
xmin of the second tuple) is smaller than the cutoff_xid while trying
to freeze the tuple. As a result, it will freeze the second tuple but
the first tuple will be left untouched.

Now, if we come for the heap_hot_search_buffer, then the xmax of the
first tuple will not match the xmin of the second tuple as we have
frozen the second tuple. But, I feel this is easily fixable right? I
mean instead of not doing anything to the corrupted tuple we can
partially freeze it? I mean we can just leave the corrupted xid alone
but mark the other xid as frozen if that is smaller then cutoff_xid.

--
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com

#21Robert Haas
robertmhaas@gmail.com
In reply to: Dilip Kumar (#20)
Re: Allow ERROR from heap_prepare_freeze_tuple to be downgraded to WARNING

On Sat, Aug 29, 2020 at 4:36 AM Dilip Kumar <dilipbalaut@gmail.com> wrote:

One example is, suppose during vacuum, there are 2 tuples in the hot
chain, and the xmin of the first tuple is corrupted (i.e. smaller
than relfrozenxid). And the xmax of this tuple (which is same as the
xmin of the second tuple) is smaller than the cutoff_xid while trying
to freeze the tuple. As a result, it will freeze the second tuple but
the first tuple will be left untouched.

Now, if we come for the heap_hot_search_buffer, then the xmax of the
first tuple will not match the xmin of the second tuple as we have
frozen the second tuple. But, I feel this is easily fixable right? I
mean instead of not doing anything to the corrupted tuple we can
partially freeze it? I mean we can just leave the corrupted xid alone
but mark the other xid as frozen if that is smaller then cutoff_xid.

That seems reasonable to me. Andres, what do you think?

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

#22Andres Freund
andres@anarazel.de
In reply to: Robert Haas (#21)
Re: Allow ERROR from heap_prepare_freeze_tuple to be downgraded to WARNING

Hi,

On 2020-09-14 13:26:27 -0400, Robert Haas wrote:

On Sat, Aug 29, 2020 at 4:36 AM Dilip Kumar <dilipbalaut@gmail.com> wrote:

One example is, suppose during vacuum, there are 2 tuples in the hot
chain, and the xmin of the first tuple is corrupted (i.e. smaller
than relfrozenxid). And the xmax of this tuple (which is same as the
xmin of the second tuple) is smaller than the cutoff_xid while trying
to freeze the tuple. As a result, it will freeze the second tuple but
the first tuple will be left untouched.

Now, if we come for the heap_hot_search_buffer, then the xmax of the
first tuple will not match the xmin of the second tuple as we have
frozen the second tuple. But, I feel this is easily fixable right? I
mean instead of not doing anything to the corrupted tuple we can
partially freeze it? I mean we can just leave the corrupted xid alone
but mark the other xid as frozen if that is smaller then cutoff_xid.

That seems reasonable to me. Andres, what do you think?

It seems pretty dangerous to me. What exactly are you going to put into
xmin/xmax here? And how would anything you put into the first tuple not
break index lookups? There's no such thing as a frozen xmax (so far), so
what are you going to put in there? A random different xid?
FrozenTransactionId? HEAP_XMAX_INVALID?

This whole approach just seems likely to exascerbate corruption while
also making it impossible to debug. That's ok enough if it's an explicit
user action, but doing it based on a config variable setting seems
absurdly dangerous to me.

Greetings,

Andres Freund

#23Alvaro Herrera
alvherre@2ndquadrant.com
In reply to: Andres Freund (#22)
Re: Allow ERROR from heap_prepare_freeze_tuple to be downgraded to WARNING

On 2020-Sep-14, Andres Freund wrote:

It seems pretty dangerous to me. What exactly are you going to put into
xmin/xmax here? And how would anything you put into the first tuple not
break index lookups? There's no such thing as a frozen xmax (so far), so
what are you going to put in there? A random different xid?
FrozenTransactionId? HEAP_XMAX_INVALID?

This whole approach just seems likely to exascerbate corruption while
also making it impossible to debug. That's ok enough if it's an explicit
user action, but doing it based on a config variable setting seems
absurdly dangerous to me.

FWIW I agree with Andres' stance on this. The current system is *very*
complicated and bugs are obscure already. If we hide them, what we'll
be getting is a system where data can become corrupted for no apparent
reason.

--
�lvaro Herrera https://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

#24Robert Haas
robertmhaas@gmail.com
In reply to: Alvaro Herrera (#23)
Re: Allow ERROR from heap_prepare_freeze_tuple to be downgraded to WARNING

On Mon, Sep 14, 2020 at 3:00 PM Alvaro Herrera <alvherre@2ndquadrant.com> wrote:

FWIW I agree with Andres' stance on this. The current system is *very*
complicated and bugs are obscure already. If we hide them, what we'll
be getting is a system where data can become corrupted for no apparent
reason.

I think I might have to give up on this proposal given the level of
opposition to it, but the nature of the opposition doesn't make any
sense to me on a technical level. Suppose a tuple with tid A has been
updated, producing a new version at tid B. The argument that is now
being offered is that if A has been found to be corrupt then we'd
better stop vacuuming the table altogether lest we reach B and vacuum
it too, further corrupting the table and destroying forensic evidence.
But even ignoring the fact that many users want to get the database
running again more than they want to do forensics, it's entirely
possible that B < A, in which case the damage has already been done.
Therefore, I can't see any argument that this patch creates any
scenario that can't happen already. It seems entirely reasonable to me
to say, as a review comment, hey, you haven't sufficiently considered
this particular scenario, that still needs work. But the argument here
is much more about whether this is a reasonable thing to do in general
and under any circumstances, and it feels to me like you guys are
saying "no" without offering any really convincing evidence that there
are unfixable problems here. IOW, I agree that having a GUC
corrupt_my_tables_more=true is not a reasonable thing, but I disagree
that the proposal on the table is tantamount to that.

The big picture here is that people have terabyte-scale tables, 1 or 2
tuples get corrupted, and right now the only real fix is to dump and
restore the whole table, which leads to prolonged downtime. The
pg_surgery stuff should help with that, and the work to make VACUUM
report the exact TID will also help, and if we can get the heapcheck
stuff Mark Dilger is working on committed, that will provide an
alternative and probably better way of finding this kind of
corruption, which is all to the good. However, I disagree with the
idea that a typical user who has a 2TB with one corrupted tuple on
page 0 probably wants VACUUM to fail over and over again, letting the
table bloat like crazy, instead of bleating loudly but still vacuuming
the other 0.999999% of the table. I mean, somebody probably wants
that, and that's fine. But I have a hard time imagining it as a
typical view. Am I just lacking in imagination?

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

#25Andres Freund
andres@anarazel.de
In reply to: Robert Haas (#24)
Re: Allow ERROR from heap_prepare_freeze_tuple to be downgraded to WARNING

Hi,

On 2020-09-14 15:50:49 -0400, Robert Haas wrote:

On Mon, Sep 14, 2020 at 3:00 PM Alvaro Herrera <alvherre@2ndquadrant.com> wrote:

FWIW I agree with Andres' stance on this. The current system is *very*
complicated and bugs are obscure already. If we hide them, what we'll
be getting is a system where data can become corrupted for no apparent
reason.

I think I might have to give up on this proposal given the level of
opposition to it, but the nature of the opposition doesn't make any
sense to me on a technical level. Suppose a tuple with tid A has been
updated, producing a new version at tid B. The argument that is now
being offered is that if A has been found to be corrupt then we'd
better stop vacuuming the table altogether lest we reach B and vacuum
it too, further corrupting the table and destroying forensic evidence.
But even ignoring the fact that many users want to get the database
running again more than they want to do forensics, it's entirely
possible that B < A, in which case the damage has already been done.

My understanding of the case we're discussing is that it's corruption
(e.g. relfrozenxid being different than table contents) affecting a HOT
chain. I.e. by definition all within a single page. We won't have
modified part of it independent of B < A, because freezing is
all-or-nothing. Just breaking the HOT chain into two or something like
that will just make things worse, because indexes won't find tuples, and
because reindexing might then get confused e.g. by HOT chains without a
valid start, or by having two visible tuples for the same PK.

But even ignoring the fact that many users want to get the database
running again more than they want to do forensics

The user isn't always right. And I am not objecting against providing a
tool to get things running. I'm objecting to VACUUM doing so, especially
when it's a system wide config option triggering that behaviour.

Therefore, I can't see any argument that this patch creates any
scenario that can't happen already. It seems entirely reasonable to me
to say, as a review comment, hey, you haven't sufficiently considered
this particular scenario, that still needs work. But the argument here
is much more about whether this is a reasonable thing to do in general
and under any circumstances, and it feels to me like you guys are
saying "no" without offering any really convincing evidence that there
are unfixable problems here.

I don't think that's quite the calculation. You're suggesting to make
already really complicated and failure prone code even more complicated
by adding heuristic error recovery to it. That has substantial cost,
even if we were to get it perfectly right (which I don't believe we
will).

The big picture here is that people have terabyte-scale tables, 1 or 2
tuples get corrupted, and right now the only real fix is to dump and
restore the whole table, which leads to prolonged downtime. The
pg_surgery stuff should help with that, and the work to make VACUUM
report the exact TID will also help, and if we can get the heapcheck
stuff Mark Dilger is working on committed, that will provide an
alternative and probably better way of finding this kind of
corruption, which is all to the good.

Agreed.

However, I disagree with the idea that a typical user who has a 2TB
with one corrupted tuple on page 0 probably wants VACUUM to fail over
and over again, letting the table bloat like crazy, instead of
bleating loudly but still vacuuming the other 0.999999% of the
table. I mean, somebody probably wants that, and that's fine. But I
have a hard time imagining it as a typical view. Am I just lacking in
imagination?

I know that that kind of user exists, but yea, I disagree extremely
strongly that that's a reasonable thing that the majority of users
want. And I don't think that that's something we should encourage. Those
cases indicate that either postgres has a bug, or their storage / memory
/ procedures have an issue. Reacting by making it harder to diagnose is
just a bad idea.

Greetings,

Andres Freund

#26Robert Haas
robertmhaas@gmail.com
In reply to: Andres Freund (#25)
Re: Allow ERROR from heap_prepare_freeze_tuple to be downgraded to WARNING

On Mon, Sep 14, 2020 at 4:13 PM Andres Freund <andres@anarazel.de> wrote:

My understanding of the case we're discussing is that it's corruption
(e.g. relfrozenxid being different than table contents) affecting a HOT
chain. I.e. by definition all within a single page. We won't have
modified part of it independent of B < A, because freezing is
all-or-nothing. Just breaking the HOT chain into two or something like
that will just make things worse, because indexes won't find tuples, and
because reindexing might then get confused e.g. by HOT chains without a
valid start, or by having two visible tuples for the same PK.

If we adopt the proposal made by Dilip, we will not do that. We must
have a.xmax = b.xmin, and that value is either less than relfrozenxid
or it is not. If we skip an entire tuple because one XID is bad, then
we could break the HOT chain when a.xmin is bad and the remaining
values are OK. But if we decide separately for xmin and xmax then we
should be alright. Alternately, if we're only concerned about HOT
chains, we could skip the entire page if any tuple on the page shows
evidence of damage.

I don't think that's quite the calculation. You're suggesting to make
already really complicated and failure prone code even more complicated
by adding heuristic error recovery to it. That has substantial cost,
even if we were to get it perfectly right (which I don't believe we
will).

That's a legitimate concern, but I think it would make more sense to
first make the design as good as we can and then decide whether it's
adequate than to decide ab initio that there's no way to make it good
enough.

However, I disagree with the idea that a typical user who has a 2TB
with one corrupted tuple on page 0 probably wants VACUUM to fail over
and over again, letting the table bloat like crazy, instead of
bleating loudly but still vacuuming the other 0.999999% of the
table. I mean, somebody probably wants that, and that's fine. But I
have a hard time imagining it as a typical view. Am I just lacking in
imagination?

I know that that kind of user exists, but yea, I disagree extremely
strongly that that's a reasonable thing that the majority of users
want. And I don't think that that's something we should encourage. Those
cases indicate that either postgres has a bug, or their storage / memory
/ procedures have an issue. Reacting by making it harder to diagnose is
just a bad idea.

Well, the people I tend to deal with are not going to let me conduct a
lengthy investigation almost no matter what, and the more severe the
operational consequences of the problem are, the less likely it is
that I'm going to have time to figure anything out. Being able to
create some kind of breathing room is pretty valuable.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

#27Andres Freund
andres@anarazel.de
In reply to: Robert Haas (#26)
Re: Allow ERROR from heap_prepare_freeze_tuple to be downgraded to WARNING

Hi,

On 2020-09-14 17:00:48 -0400, Robert Haas wrote:

On Mon, Sep 14, 2020 at 4:13 PM Andres Freund <andres@anarazel.de> wrote:

My understanding of the case we're discussing is that it's corruption
(e.g. relfrozenxid being different than table contents) affecting a HOT
chain. I.e. by definition all within a single page. We won't have
modified part of it independent of B < A, because freezing is
all-or-nothing. Just breaking the HOT chain into two or something like
that will just make things worse, because indexes won't find tuples, and
because reindexing might then get confused e.g. by HOT chains without a
valid start, or by having two visible tuples for the same PK.

If we adopt the proposal made by Dilip, we will not do that. We must
have a.xmax = b.xmin, and that value is either less than relfrozenxid
or it is not. If we skip an entire tuple because one XID is bad, then
we could break the HOT chain when a.xmin is bad and the remaining
values are OK. But if we decide separately for xmin and xmax then we
should be alright.

I thought I precisely addressed this case:

What exactly are you going to put into xmin/xmax here? And how would
anything you put into the first tuple not break index lookups? There's
no such thing as a frozen xmax (so far), so what are you going to put
in there? A random different xid? FrozenTransactionId?
HEAP_XMAX_INVALID?

What am I missing?

Greetings,

Andres Freund

#28Dilip Kumar
dilipbalaut@gmail.com
In reply to: Andres Freund (#27)
Re: Allow ERROR from heap_prepare_freeze_tuple to be downgraded to WARNING

On Tue, Sep 15, 2020 at 2:35 AM Andres Freund <andres@anarazel.de> wrote:

Hi,

On 2020-09-14 17:00:48 -0400, Robert Haas wrote:

On Mon, Sep 14, 2020 at 4:13 PM Andres Freund <andres@anarazel.de> wrote:

My understanding of the case we're discussing is that it's corruption
(e.g. relfrozenxid being different than table contents) affecting a HOT
chain. I.e. by definition all within a single page. We won't have
modified part of it independent of B < A, because freezing is
all-or-nothing. Just breaking the HOT chain into two or something like
that will just make things worse, because indexes won't find tuples, and
because reindexing might then get confused e.g. by HOT chains without a
valid start, or by having two visible tuples for the same PK.

If we adopt the proposal made by Dilip, we will not do that. We must
have a.xmax = b.xmin, and that value is either less than relfrozenxid
or it is not. If we skip an entire tuple because one XID is bad, then
we could break the HOT chain when a.xmin is bad and the remaining
values are OK. But if we decide separately for xmin and xmax then we
should be alright.

I thought I precisely addressed this case:

What exactly are you going to put into xmin/xmax here? And how would
anything you put into the first tuple not break index lookups? There's
no such thing as a frozen xmax (so far), so what are you going to put
in there? A random different xid? FrozenTransactionId?
HEAP_XMAX_INVALID?

What am I missing?

What problem do you see if we set xmax to the InvalidTransactionId and
HEAP_XMAX_INVALID flag in the infomask ? I mean now also if the xmax
is older than the cutoff xid then we do the same thing i.e.
if (freeze_xmax)
{
..
frz->xmax = InvalidTransactionId;
..
frz->t_infomask &= ~HEAP_XMAX_BITS;
frz->t_infomask |= HEAP_XMAX_INVALID;
frz->t_infomask2 &= ~HEAP_HOT_UPDATED;
frz->t_infomask2 &= ~HEAP_KEYS_UPDATED;
changed = true;
}

So if we do that it will not be part of the hot chain anymore. I
might be missing something but could not see how it can be more broken
than what it is without our change. I agree that in case of corrupted
xmin it can now mark tuple with HEAP_XMAX_INVALID without freezing the
xmin but that is anyway a valid status for a tuple.

However, if we think it still can cause some issues then I feel that
we can skip the whole page as Robert suggested.

--
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com

#29Andres Freund
andres@anarazel.de
In reply to: Dilip Kumar (#28)
Re: Allow ERROR from heap_prepare_freeze_tuple to be downgraded to WARNING

On 2020-09-15 10:54:29 +0530, Dilip Kumar wrote:

What problem do you see if we set xmax to the InvalidTransactionId and
HEAP_XMAX_INVALID flag in the infomask ?

1) It'll make a dead tuple appear live. You cannot do this for tuples
with an xid below the horizon.
2) it'll break HOT chain following / indexes.

#30Dilip Kumar
dilipbalaut@gmail.com
In reply to: Andres Freund (#29)
Re: Allow ERROR from heap_prepare_freeze_tuple to be downgraded to WARNING

On Tue, Sep 15, 2020 at 11:14 AM Andres Freund <andres@anarazel.de> wrote:

On 2020-09-15 10:54:29 +0530, Dilip Kumar wrote:

What problem do you see if we set xmax to the InvalidTransactionId and
HEAP_XMAX_INVALID flag in the infomask ?

1) It'll make a dead tuple appear live. You cannot do this for tuples
with an xid below the horizon.

How is it possible? Because tuple which has a committed xmax and the
xmax is older than the oldestXmin, should not come for freezing unless
it is lock_only xid (because those tuples are already gone). So if
the xmax is smaller than the cutoff xid than either it is lock_only or
it is aborted. If the XMAX is lock only then I don't see any issue
OTOH if it is aborted xid and if it is already smaller than the
cut-off xid then it is anyway live tuple.

2) it'll break HOT chain following / indexes.

If my above theory in point 1 is correct then I don't see this issue as well.

--
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com

#31Andres Freund
andres@anarazel.de
In reply to: Dilip Kumar (#30)
Re: Allow ERROR from heap_prepare_freeze_tuple to be downgraded to WARNING

Hi,

On 2020-09-15 12:52:25 +0530, Dilip Kumar wrote:

On Tue, Sep 15, 2020 at 11:14 AM Andres Freund <andres@anarazel.de> wrote:

On 2020-09-15 10:54:29 +0530, Dilip Kumar wrote:

What problem do you see if we set xmax to the InvalidTransactionId and
HEAP_XMAX_INVALID flag in the infomask ?

1) It'll make a dead tuple appear live. You cannot do this for tuples
with an xid below the horizon.

How is it possible? Because tuple which has a committed xmax and the
xmax is older than the oldestXmin, should not come for freezing unless
it is lock_only xid (because those tuples are already gone).

There've been several cases of this in the past. A fairly easy way is a
corrupted relfrozenxid (of which there are many examples).

You simply cannot just assume that everything is OK and argue that
that's why it's ok to fix data corruption in some approximate manner. By
definition everything *is not ok* if you ever come here.

Greetings,

Andres Freund

#32Robert Haas
robertmhaas@gmail.com
In reply to: Andres Freund (#31)
Re: Allow ERROR from heap_prepare_freeze_tuple to be downgraded to WARNING

On Tue, Sep 15, 2020 at 2:04 PM Andres Freund <andres@anarazel.de> wrote:

How is it possible? Because tuple which has a committed xmax and the
xmax is older than the oldestXmin, should not come for freezing unless
it is lock_only xid (because those tuples are already gone).

There've been several cases of this in the past. A fairly easy way is a
corrupted relfrozenxid (of which there are many examples).

Hmm, so is the case you're worried about here the case where the
freezing threshold is greater than the pruning threshold? i.e. The
relfrozenxid has been updated to a value greater than the xmin we
derive from the procarray?

If that's not the case, then I don't see what problem there can be
here. To reach heap_prepare_freeze_tuple the tuple has to survive
pruning. If xmin < freezing-threshold and freezing-threshold <
pruning-threshold and the tuple survived pruning, then xmin must be a
committed transaction visible to everyone so setting xmin to
FrozenTransactionId is fine. If xmax < freezing-threshold and
freezing-threshold < pruning-threshold and the tuple survived pruning,
xmax must be visible to everyone and can't be running so it must have
aborted, so setting xmax to InvalidTransactionId is fine.

On the other hand if, somehow, freezing-threshold > pruning-threshold,
then freezing seems categorically unsafe. Doing so would change
visibility decisions of transactions that are still running, or that
were running at the time when we computed the pruning threshold. But
the sanity checks in heap_prepare_freeze_tuple() seem like they would
catch many such cases, but I'm not sure if they're all water-tight. It
might be better to skip calling heap_prepare_freeze_tuple() altogether
if the freezing threshold does not precede the pruning threshold.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company