Reviewing freeze map code

Started by Andres Freundover 9 years ago256 messages

andres@anarazel.de

over 9 years ago

Hi,

The freeze map changes, besides being very important, seem to be one of
the patches with a high risk profile in 9.6. Robert had asked whether
I'd take a look. I thought it'd be a good idea to review that while
running tests for
/messages/by-id/CAMkU=1w85Dqt766AUrCnyqCXfZ+rsk1witAc_=v5+Pce93Sftw@mail.gmail.com

For starters, I'm just going through the commits. It seems the relevant
pieces are:

a892234 Change the format of the VM fork to add a second bit per page.
77a1d1e Department of second thoughts: remove PD_ALL_FROZEN.
fd31cd2 Don't vacuum all-frozen pages.
7087166 pg_upgrade: Convert old visibility map format to new format.
ba0a198 Add pg_visibility contrib module.

did I miss anything important?

Greetings,

Andres Freund

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Andres Freund

andres@anarazel.de

over 9 years ago

In reply to: Andres Freund (#1)

Re: Reviewing freeze map code

Hi,

some of the review items here are mere matters of style/preference. Feel
entirely free to discard them, but I thought if I'm going through the
change anyway...

On 2016-05-02 14:48:18 -0700, Andres Freund wrote:

a892234 Change the format of the VM fork to add a second bit per page.

TL;DR: fairly minor stuff.

+ * heap_tuple_needs_eventual_freeze
+ *
+ * Check to see whether any of the XID fields of a tuple (xmin, xmax, xvac)
+ * will eventually require freezing.  Similar to heap_tuple_needs_freeze,
+ * but there's no cutoff, since we're trying to figure out whether freezing
+ * will ever be needed, not whether it's needed now.
+ */
+bool
+heap_tuple_needs_eventual_freeze(HeapTupleHeader tuple)

Wouldn't redefining this to heap_tuple_is_frozen() and then inverting the
checks be easier to understand?

+	/*
+	 * If xmax is a valid xact or multixact, this tuple is also not frozen.
+	 */
+	if (tuple->t_infomask & HEAP_XMAX_IS_MULTI)
+	{
+		MultiXactId multi;
+
+		multi = HeapTupleHeaderGetRawXmax(tuple);
+		if (MultiXactIdIsValid(multi))
+			return true;
+	}

Hm. What's the test inside the if() for? There shouldn't be any case
where xmax is invalid if HEAP_XMAX_IS_MULTI is set. Now there's a
check like that outside of this commit, but it seems strange to me
(Alvaro, perhaps you could comment on this?).

+ *
+ * Clearing both visibility map bits is not separately WAL-logged.  The callers
  * must make sure that whenever a bit is cleared, the bit is cleared on WAL
  * replay of the updating operation as well.

I think including "both" here makes things less clear, because it
differentiates clearing one bit from clearing both. There's no practical
differentce atm, but still.

*
* VACUUM will normally skip pages for which the visibility map bit is set;
* such pages can't contain any dead tuples and therefore don't need vacuuming.
- * The visibility map is not used for anti-wraparound vacuums, because
- * an anti-wraparound vacuum needs to freeze tuples and observe the latest xid
- * present in the table, even on pages that don't have any dead tuples.
*

I think the remaining sentence isn't entirely accurate, there's now more
than one bit, and they're different with regard to scan_all/!scan_all
vacuums (or will be - maybe this updated further in a later commit? But
if so, that sentence shouldn't yet be removed...).

-
-/* Number of heap blocks we can represent in one byte. */
-#define HEAPBLOCKS_PER_BYTE 8
-

Hm, why was this moved to the header? Sounds like something the outside
shouldn't care about.

#define HEAPBLK_TO_MAPBIT(x) (((x) % HEAPBLOCKS_PER_BYTE) * BITS_PER_HEAPBLOCK)

Hm. This isn't really a mapping to an individual bit anymore - but I
don't really have a better name in mind. Maybe TO_OFFSET?

+static const uint8 number_of_ones_for_visible[256] = {
...
+};
+static const uint8 number_of_ones_for_frozen[256] = {
...
 };

Did somebody verify the new contents are correct?

/*
- *	visibilitymap_clear - clear a bit in visibility map
+ *	visibilitymap_clear - clear all bits in visibility map
  *

This seems rather easy to misunderstand, as this really only clears all
the bits for one page, not actually all the bits.

  * the bit for heapBlk, or InvalidBuffer. The caller is responsible for
- * releasing *buf after it's done testing and setting bits.
+ * releasing *buf after it's done testing and setting bits, and must pass flags
+ * for which it needs to check the value in visibility map.
  *
  * NOTE: This function is typically called without a lock on the heap page,
  * so somebody else could change the bit just after we look at it.  In fact,
@@ -327,17 +351,16 @@ visibilitymap_set(Relation rel, BlockNumber heapBlk, Buffer heapBuf,

I'm not seing what flags the above comment change is referring to?

 	/*
-	 * A single-bit read is atomic.  There could be memory-ordering effects
+	 * A single byte read is atomic.  There could be memory-ordering effects
 	 * here, but for performance reasons we make it the caller's job to worry
 	 * about that.
 	 */
-	result = (map[mapByte] & (1 << mapBit)) ? true : false;
-
-	return result;
+	return ((map[mapByte] >> mapBit) & VISIBILITYMAP_VALID_BITS);
 }

Not a new issue, and *very* likely to be irrelevant in practice (given
the value is only referenced once): But there's really no guarantee
map[mapByte] is only read once here.

-BlockNumber
-visibilitymap_count(Relation rel)
+void
+visibilitymap_count(Relation rel, BlockNumber *all_visible, BlockNumber *all_frozen)

Not really a new issue again: The parameter types (previously return
type) to this function seem wrong to me.

@@ -1934,5 +1992,14 @@ heap_page_is_all_visible(Relation rel, Buffer buf, TransactionId *visibility_cut
 		} 
+	/*
+	 * We don't bother clearing *all_frozen when the page is discovered not
+	 * to be all-visible, so do that now if necessary.  The page might fail
+	 * to be all-frozen for other reasons anyway, but if it's not all-visible,
+	 * then it definitely isn't all-frozen.
+	 */
+	if (!all_visible)
+		*all_frozen = false;
+

Why don't we just set *all_frozen to false when appropriate? It'd be
just as many lines and probably easier to understand?

+		/*
+		 * If the page is marked as all-visible but not all-frozen, we should
+		 * so mark it.  Note that all_frozen is only valid if all_visible is
+		 * true, so we must check both.
+		 */

This kinda seems to imply that all-visible implies all_frozen. Also, why
has that block been added to the end of the if/else if chain? Seems like
it belongs below the (all_visible && !all_visible_according_to_vm) block.

Greetings,

Andres Freund

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Masahiko Sawada

sawada.mshk@gmail.com

over 9 years ago

In reply to: Andres Freund (#1)

Re: Reviewing freeze map code

On Tue, May 3, 2016 at 6:48 AM, Andres Freund <andres@anarazel.de> wrote:

Hi,

The freeze map changes, besides being very important, seem to be one of
the patches with a high risk profile in 9.6. Robert had asked whether
I'd take a look. I thought it'd be a good idea to review that while
running tests for
/messages/by-id/CAMkU=1w85Dqt766AUrCnyqCXfZ+rsk1witAc_=v5+Pce93Sftw@mail.gmail.com

Thank you for reviewing.

For starters, I'm just going through the commits. It seems the relevant
pieces are:

a892234 Change the format of the VM fork to add a second bit per page.
77a1d1e Department of second thoughts: remove PD_ALL_FROZEN.
fd31cd2 Don't vacuum all-frozen pages.
7087166 pg_upgrade: Convert old visibility map format to new format.
ba0a198 Add pg_visibility contrib module.

did I miss anything important?

That's all.

Regards,

--
Masahiko Sawada

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Andres Freund

andres@anarazel.de

over 9 years ago

In reply to: Andres Freund (#1)

Re: Reviewing freeze map code

On 2016-05-02 14:48:18 -0700, Andres Freund wrote:

77a1d1e Department of second thoughts: remove PD_ALL_FROZEN.

Nothing to say here.

fd31cd2 Don't vacuum all-frozen pages.

Hm. I do wonder if it's going to bite us that we don't have a way to
actually force vacuuming of the whole table (besides manually rm'ing the
VM). I've more than once seen VACUUM used to try to do some integrity
checking of the database. How are we actually going to test that the
feature works correctly? They'd have to write checks ontop of
pg_visibility to see whether things are borked.

/*
* Compute whether we actually scanned the whole relation. If we did, we
* can adjust relfrozenxid and relminmxid.
*
* NB: We need to check this before truncating the relation, because that
* will change ->rel_pages.
*/

Comment is out-of-date now.

-		if (blkno == next_not_all_visible_block)
+		if (blkno == next_unskippable_block)
 		{
-			/* Time to advance next_not_all_visible_block */
-			for (next_not_all_visible_block++;
-				 next_not_all_visible_block < nblocks;
-				 next_not_all_visible_block++)
+			/* Time to advance next_unskippable_block */
+			for (next_unskippable_block++;
+				 next_unskippable_block < nblocks;
+				 next_unskippable_block++)

Hm. So we continue with the course of re-processing pages, even if
they're marked all-frozen. For all-visible there at least can be a
benefit by freezing earlier, but for all-frozen pages there's really no
point. I don't really buy the arguments for the skipping logic. But
even disregarding that, maybe we should skip processing a block if it's
all-frozen (without preventing the page from being read?); as there's no
possible benefit? Acquring the exclusive/content lock and stuff is far
from free.

Not really related to this patch, but the FORCE_CHECK_PAGE is rather
ugly.

+			/*
+			 * The current block is potentially skippable; if we've seen a
+			 * long enough run of skippable blocks to justify skipping it, and
+			 * we're not forced to check it, then go ahead and skip.
+			 * Otherwise, the page must be at least all-visible if not
+			 * all-frozen, so we can set all_visible_according_to_vm = true.
+			 */
+			if (skipping_blocks && !FORCE_CHECK_PAGE())
+			{
+				/*
+				 * Tricky, tricky.  If this is in aggressive vacuum, the page
+				 * must have been all-frozen at the time we checked whether it
+				 * was skippable, but it might not be any more.  We must be
+				 * careful to count it as a skipped all-frozen page in that
+				 * case, or else we'll think we can't update relfrozenxid and
+				 * relminmxid.  If it's not an aggressive vacuum, we don't
+				 * know whether it was all-frozen, so we have to recheck; but
+				 * in this case an approximate answer is OK.
+				 */
+				if (aggressive || VM_ALL_FROZEN(onerel, blkno, &vmbuffer))
+					vacrelstats->frozenskipped_pages++;
 				continue;
+			}

Hm. This indeed seems a bit tricky. Not sure how to make it easier
though without just ripping out the SKIP_PAGES_THRESHOLD stuff.

Hm. This also doubles the number of VM accesses. While I guess that's
not noticeable most of the time, it's still not nice; especially when a
large relation is entirely frozen, because it'll mean we'll sequentially
go through the visibilityma twice.

I wondered for a minute whether #14057 could cause really bad issues
here
/messages/by-id/20160331103739.8956.94469@wrigleys.postgresql.org
but I don't see it being more relevant here.

Andres

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Andres Freund

andres@anarazel.de

over 9 years ago

In reply to: Andres Freund (#1)

Re: Reviewing freeze map code

Hi,

On 2016-05-02 14:48:18 -0700, Andres Freund wrote:

7087166 pg_upgrade: Convert old visibility map format to new format.

+const char *
+rewriteVisibilityMap(const char *fromfile, const char *tofile, bool force)
...

+	while ((bytesRead = read(src_fd, buffer, BLCKSZ)) == BLCKSZ)
+	{
..

Uh, shouldn't we actually fail if we read incompletely? Rather than
silently ignoring the problem? Ok, this causes no corruption, but it
indicates that something went significantly wrong.

+			char		new_vmbuf[BLCKSZ];
+			char	   *new_cur = new_vmbuf;
+			bool		empty = true;
+			bool		old_lastpart;
+
+			/* Copy page header in advance */
+			memcpy(new_vmbuf, &pageheader, SizeOfPageHeaderData);

Shouldn't we zero out new_vmbuf? Afaics we're not necessarily zeroing it
with old_lastpart && !empty, right?

+	if ((dst_fd = open(tofile, O_RDWR | O_CREAT | (force ? 0 : O_EXCL), S_IRUSR | S_IWUSR)) < 0)
+	{
+		close(src_fd);
+		return getErrorText();
+	}

I know you guys copied this, but what's the force thing about?
Expecially as it's always set to true by the callers (i.e. what is the
parameter even about?)? Wouldn't we at least have to specify O_TRUNC in
the force case?

+ old_cur += BITS_PER_HEAPBLOCK_OLD;
+ new_cur += BITS_PER_HEAPBLOCK;

I'm not sure I'm understanding the point of the BITS_PER_HEAPBLOCK_OLD
stuff - as long as it's hardcoded into rewriteVisibilityMap() we'll not
be able to have differing ones anyway, should we decide to add a third
bit?

- Andres

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Robert Haas

robertmhaas@gmail.com

over 9 years ago

In reply to: Andres Freund (#2)

Re: Reviewing freeze map code

On Mon, May 2, 2016 at 8:25 PM, Andres Freund <andres@anarazel.de> wrote:

+ * heap_tuple_needs_eventual_freeze
+ *
+ * Check to see whether any of the XID fields of a tuple (xmin, xmax, xvac)
+ * will eventually require freezing.  Similar to heap_tuple_needs_freeze,
+ * but there's no cutoff, since we're trying to figure out whether freezing
+ * will ever be needed, not whether it's needed now.
+ */
+bool
+heap_tuple_needs_eventual_freeze(HeapTupleHeader tuple)

Wouldn't redefining this to heap_tuple_is_frozen() and then inverting the
checks be easier to understand?

I thought it much safer to keep this as close to a copy of
heap_tuple_needs_freeze() as possible. Copying a function and
inverting all of the return values is much more likely to introduce
bugs, IME.

+       /*
+        * If xmax is a valid xact or multixact, this tuple is also not frozen.
+        */
+       if (tuple->t_infomask & HEAP_XMAX_IS_MULTI)
+       {
+               MultiXactId multi;
+
+               multi = HeapTupleHeaderGetRawXmax(tuple);
+               if (MultiXactIdIsValid(multi))
+                       return true;
+       }
Hm. What's the test inside the if() for? There shouldn't be any case
where xmax is invalid if HEAP_XMAX_IS_MULTI is set. Now there's a
check like that outside of this commit, but it seems strange to me
(Alvaro, perhaps you could comment on this?).

Here again I was copying existing code, with appropriate simplifications.

+ *
+ * Clearing both visibility map bits is not separately WAL-logged.  The callers
* must make sure that whenever a bit is cleared, the bit is cleared on WAL
* replay of the updating operation as well.
I think including "both" here makes things less clear, because it
differentiates clearing one bit from clearing both. There's no practical
differentce atm, but still.

I agree.

*
* VACUUM will normally skip pages for which the visibility map bit is set;
* such pages can't contain any dead tuples and therefore don't need vacuuming.
- * The visibility map is not used for anti-wraparound vacuums, because
- * an anti-wraparound vacuum needs to freeze tuples and observe the latest xid
- * present in the table, even on pages that don't have any dead tuples.
*

I think the remaining sentence isn't entirely accurate, there's now more
than one bit, and they're different with regard to scan_all/!scan_all
vacuums (or will be - maybe this updated further in a later commit? But
if so, that sentence shouldn't yet be removed...).

We can adjust the language, but I don't really see a big problem here.

-/* Number of heap blocks we can represent in one byte. */
-#define HEAPBLOCKS_PER_BYTE 8
-
Hm, why was this moved to the header? Sounds like something the outside
shouldn't care about.

Oh... yeah. Let's undo that.

#define HEAPBLK_TO_MAPBIT(x) (((x) % HEAPBLOCKS_PER_BYTE) * BITS_PER_HEAPBLOCK)

Hm. This isn't really a mapping to an individual bit anymore - but I
don't really have a better name in mind. Maybe TO_OFFSET?

Well, it sorta is... but we could change it, I suppose.

+static const uint8 number_of_ones_for_visible[256] = {
...
+};
+static const uint8 number_of_ones_for_frozen[256] = {
...
};

Did somebody verify the new contents are correct?

I admit that I didn't. It seemed like an unlikely place for a goof,
but I guess we should verify.

/*
- *     visibilitymap_clear - clear a bit in visibility map
+ *     visibilitymap_clear - clear all bits in visibility map
*
This seems rather easy to misunderstand, as this really only clears all
the bits for one page, not actually all the bits.

We could change "in" to "for one page in the".

* the bit for heapBlk, or InvalidBuffer. The caller is responsible for
- * releasing *buf after it's done testing and setting bits.
+ * releasing *buf after it's done testing and setting bits, and must pass flags
+ * for which it needs to check the value in visibility map.
*
* NOTE: This function is typically called without a lock on the heap page,
* so somebody else could change the bit just after we look at it.  In fact,
@@ -327,17 +351,16 @@ visibilitymap_set(Relation rel, BlockNumber heapBlk, Buffer heapBuf,

I'm not seing what flags the above comment change is referring to?

Ugh. I think that's leftover cruft from an earlier patch version that
should have been excised from what got committed.

/*
-        * A single-bit read is atomic.  There could be memory-ordering effects
+        * A single byte read is atomic.  There could be memory-ordering effects
* here, but for performance reasons we make it the caller's job to worry
* about that.
*/
-       result = (map[mapByte] & (1 << mapBit)) ? true : false;
-
-       return result;
+       return ((map[mapByte] >> mapBit) & VISIBILITYMAP_VALID_BITS);
}

Not a new issue, and *very* likely to be irrelevant in practice (given
the value is only referenced once): But there's really no guarantee
map[mapByte] is only read once here.

Meh. But we can fix if you want to.

-BlockNumber
-visibilitymap_count(Relation rel)
+void
+visibilitymap_count(Relation rel, BlockNumber *all_visible, BlockNumber *all_frozen)
Not really a new issue again: The parameter types (previously return
type) to this function seem wrong to me.

Not this patch's job to tinker.

@@ -1934,5 +1992,14 @@ heap_page_is_all_visible(Relation rel, Buffer buf, TransactionId *visibility_cut
}
+       /*
+        * We don't bother clearing *all_frozen when the page is discovered not
+        * to be all-visible, so do that now if necessary.  The page might fail
+        * to be all-frozen for other reasons anyway, but if it's not all-visible,
+        * then it definitely isn't all-frozen.
+        */
+       if (!all_visible)
+               *all_frozen = false;
+

Why don't we just set *all_frozen to false when appropriate? It'd be
just as many lines and probably easier to understand?

I thought that looked really easy to mess up, either now or down the
road. This way seemed more solid to me. That's a judgement call, of
course.

+               /*
+                * If the page is marked as all-visible but not all-frozen, we should
+                * so mark it.  Note that all_frozen is only valid if all_visible is
+                * true, so we must check both.
+                */
This kinda seems to imply that all-visible implies all_frozen. Also, why
has that block been added to the end of the if/else if chain? Seems like
it belongs below the (all_visible && !all_visible_according_to_vm) block.

We can adjust the comment a bit to make it more clear, if you like,
but I doubt it's going to cause serious misunderstanding. As for the
placement, the reason I put it at the end is because I figured that we
did not want to mark it all-frozen if any of the "oh crap, emit a
warning" cases applied.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Robert Haas

robertmhaas@gmail.com

over 9 years ago

In reply to: Andres Freund (#4)

Re: Reviewing freeze map code

On Wed, May 4, 2016 at 8:08 PM, Andres Freund <andres@anarazel.de> wrote:

On 2016-05-02 14:48:18 -0700, Andres Freund wrote:

77a1d1e Department of second thoughts: remove PD_ALL_FROZEN.

Nothing to say here.

fd31cd2 Don't vacuum all-frozen pages.

Hm. I do wonder if it's going to bite us that we don't have a way to
actually force vacuuming of the whole table (besides manually rm'ing the
VM). I've more than once seen VACUUM used to try to do some integrity
checking of the database. How are we actually going to test that the
feature works correctly? They'd have to write checks ontop of
pg_visibility to see whether things are borked.

Let's add VACUUM (FORCE) or something like that.

/*
* Compute whether we actually scanned the whole relation. If we did, we
* can adjust relfrozenxid and relminmxid.
*
* NB: We need to check this before truncating the relation, because that
* will change ->rel_pages.
*/

Comment is out-of-date now.

OK.

-               if (blkno == next_not_all_visible_block)
+               if (blkno == next_unskippable_block)
{
-                       /* Time to advance next_not_all_visible_block */
-                       for (next_not_all_visible_block++;
-                                next_not_all_visible_block < nblocks;
-                                next_not_all_visible_block++)
+                       /* Time to advance next_unskippable_block */
+                       for (next_unskippable_block++;
+                                next_unskippable_block < nblocks;
+                                next_unskippable_block++)
Hm. So we continue with the course of re-processing pages, even if
they're marked all-frozen. For all-visible there at least can be a
benefit by freezing earlier, but for all-frozen pages there's really no
point. I don't really buy the arguments for the skipping logic. But
even disregarding that, maybe we should skip processing a block if it's
all-frozen (without preventing the page from being read?); as there's no
possible benefit? Acquring the exclusive/content lock and stuff is far
from free.

I wanted to tinker with this logic as little as possible in the
interest of ending up with something that worked. I would not have
written it this way.

Not really related to this patch, but the FORCE_CHECK_PAGE is rather
ugly.

+1.

+                       /*
+                        * The current block is potentially skippable; if we've seen a
+                        * long enough run of skippable blocks to justify skipping it, and
+                        * we're not forced to check it, then go ahead and skip.
+                        * Otherwise, the page must be at least all-visible if not
+                        * all-frozen, so we can set all_visible_according_to_vm = true.
+                        */
+                       if (skipping_blocks && !FORCE_CHECK_PAGE())
+                       {
+                               /*
+                                * Tricky, tricky.  If this is in aggressive vacuum, the page
+                                * must have been all-frozen at the time we checked whether it
+                                * was skippable, but it might not be any more.  We must be
+                                * careful to count it as a skipped all-frozen page in that
+                                * case, or else we'll think we can't update relfrozenxid and
+                                * relminmxid.  If it's not an aggressive vacuum, we don't
+                                * know whether it was all-frozen, so we have to recheck; but
+                                * in this case an approximate answer is OK.
+                                */
+                               if (aggressive || VM_ALL_FROZEN(onerel, blkno, &vmbuffer))
+                                       vacrelstats->frozenskipped_pages++;
continue;
+                       }

Hm. This indeed seems a bit tricky. Not sure how to make it easier
though without just ripping out the SKIP_PAGES_THRESHOLD stuff.

Yep, I had the same problem.

Hm. This also doubles the number of VM accesses. While I guess that's
not noticeable most of the time, it's still not nice; especially when a
large relation is entirely frozen, because it'll mean we'll sequentially
go through the visibilityma twice.

Compared to what we're saving, that's obviously a trivial cost.
That's not to say that we might not want to improve it, but it's
hardly a disaster.

In short: wah, wah, wah.

I wondered for a minute whether #14057 could cause really bad issues
here
/messages/by-id/20160331103739.8956.94469@wrigleys.postgresql.org
but I don't see it being more relevant here.

I don't really understand what the concern is here, but if it's not a
problem, let's not spend time trying to clarify.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Robert Haas

robertmhaas@gmail.com

over 9 years ago

In reply to: Andres Freund (#5)

Re: Reviewing freeze map code

On Thu, May 5, 2016 at 2:20 PM, Andres Freund <andres@anarazel.de> wrote:

On 2016-05-02 14:48:18 -0700, Andres Freund wrote:

7087166 pg_upgrade: Convert old visibility map format to new format.
+const char *
+rewriteVisibilityMap(const char *fromfile, const char *tofile, bool force)
...
+       while ((bytesRead = read(src_fd, buffer, BLCKSZ)) == BLCKSZ)
+       {
..
Uh, shouldn't we actually fail if we read incompletely? Rather than
silently ignoring the problem? Ok, this causes no corruption, but it
indicates that something went significantly wrong.

Sure, that's reasonable.

+                       char            new_vmbuf[BLCKSZ];
+                       char       *new_cur = new_vmbuf;
+                       bool            empty = true;
+                       bool            old_lastpart;
+
+                       /* Copy page header in advance */
+                       memcpy(new_vmbuf, &pageheader, SizeOfPageHeaderData);

Shouldn't we zero out new_vmbuf? Afaics we're not necessarily zeroing it
with old_lastpart && !empty, right?

Oh, dear. That seems like a possible data corruption bug. Maybe we'd
better fix that right away (although I don't actually have time before
the wrap).

+       if ((dst_fd = open(tofile, O_RDWR | O_CREAT | (force ? 0 : O_EXCL), S_IRUSR | S_IWUSR)) < 0)
+       {
+               close(src_fd);
+               return getErrorText();
+       }
I know you guys copied this, but what's the force thing about?
Expecially as it's always set to true by the callers (i.e. what is the
parameter even about?)? Wouldn't we at least have to specify O_TRUNC in
the force case?

I just work here.

+                               old_cur += BITS_PER_HEAPBLOCK_OLD;
+                               new_cur += BITS_PER_HEAPBLOCK;
I'm not sure I'm understanding the point of the BITS_PER_HEAPBLOCK_OLD
stuff - as long as it's hardcoded into rewriteVisibilityMap() we'll not
be able to have differing ones anyway, should we decide to add a third
bit?

I think that's just a matter of style.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Joshua D. Drake

jd@commandprompt.com

over 9 years ago

In reply to: Robert Haas (#7)

Re: Reviewing freeze map code

On 05/06/2016 01:40 PM, Robert Haas wrote:

On Wed, May 4, 2016 at 8:08 PM, Andres Freund <andres@anarazel.de> wrote:

On 2016-05-02 14:48:18 -0700, Andres Freund wrote:

77a1d1e Department of second thoughts: remove PD_ALL_FROZEN.

Nothing to say here.

fd31cd2 Don't vacuum all-frozen pages.

Hm. I do wonder if it's going to bite us that we don't have a way to
actually force vacuuming of the whole table (besides manually rm'ing the
VM). I've more than once seen VACUUM used to try to do some integrity
checking of the database. How are we actually going to test that the
feature works correctly? They'd have to write checks ontop of
pg_visibility to see whether things are borked.

Let's add VACUUM (FORCE) or something like that.

This is actually inverted. Vacuum by default should vacuum the entire
relation, however if we are going to keep the existing behavior of this
patch, VACUUM (FROZEN) seems to be better than (FORCE)?

Sincerely,

--
Command Prompt, Inc. http://the.postgres.company/
+1-503-667-4564
PostgreSQL Centered full stack support, consulting and development.
Everyone appreciates your honesty, until you are honest with them.

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#10

Andres Freund

andres@anarazel.de

over 9 years ago

In reply to: Joshua D. Drake (#9)

Re: Reviewing freeze map code

On 2016-05-06 13:48:09 -0700, Joshua D. Drake wrote:

On 05/06/2016 01:40 PM, Robert Haas wrote:

On Wed, May 4, 2016 at 8:08 PM, Andres Freund <andres@anarazel.de> wrote:

On 2016-05-02 14:48:18 -0700, Andres Freund wrote:

77a1d1e Department of second thoughts: remove PD_ALL_FROZEN.

Nothing to say here.

fd31cd2 Don't vacuum all-frozen pages.

Hm. I do wonder if it's going to bite us that we don't have a way to
actually force vacuuming of the whole table (besides manually rm'ing the
VM). I've more than once seen VACUUM used to try to do some integrity
checking of the database. How are we actually going to test that the
feature works correctly? They'd have to write checks ontop of
pg_visibility to see whether things are borked.

Let's add VACUUM (FORCE) or something like that.

Yes, that makes sense.

This is actually inverted. Vacuum by default should vacuum the entire
relation

What? Why on earth would that be a good idea? Not to speak of hte fact
that that's not been the case since ~8.4?

,however if we are going to keep the existing behavior of this
patch, VACUUM (FROZEN) seems to be better than (FORCE)?

There already is FREEZE - meaning something different - so I doubt it.

Andres

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#11

Joshua D. Drake

jd@commandprompt.com

over 9 years ago

In reply to: Andres Freund (#10)

Re: Reviewing freeze map code

On 05/06/2016 01:50 PM, Andres Freund wrote:

Let's add VACUUM (FORCE) or something like that.

Yes, that makes sense.

This is actually inverted. Vacuum by default should vacuum the entire
relation

What? Why on earth would that be a good idea? Not to speak of hte fact
that that's not been the case since ~8.4?

Sorry, I just meant the default behavior shouldn't change but I do agree
that we need the ability to keep the same behavior.

,however if we are going to keep the existing behavior of this
patch, VACUUM (FROZEN) seems to be better than (FORCE)?

There already is FREEZE - meaning something different - so I doubt it.

Yeah I thought about that, it is the word "FORCE" that bothers me. When
you use FORCE there is an assumption that no matter what, it plows
through (think rm -f). So if we don't use FROZEN, that's cool but FORCE
doesn't work either.

Sincerely,

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#12

Stephen Frost

sfrost@snowman.net

over 9 years ago

In reply to: Joshua D. Drake (#11)

Re: Reviewing freeze map code

* Joshua D. Drake (jd@commandprompt.com) wrote:

Yeah I thought about that, it is the word "FORCE" that bothers me.
When you use FORCE there is an assumption that no matter what, it
plows through (think rm -f). So if we don't use FROZEN, that's cool
but FORCE doesn't work either.

Isn't that exactly what this FORCE option being contemplated would do
though? Plow through the entire relation, regardless of what the VM
says is all frozen or not?

Seems like FORCE is a good word for that to me.

Thanks!

Stephen

#13

Andres Freund

andres@anarazel.de

over 9 years ago

In reply to: Joshua D. Drake (#11)

Re: Reviewing freeze map code

On 2016-05-06 13:54:09 -0700, Joshua D. Drake wrote:

On 05/06/2016 01:50 PM, Andres Freund wrote:

Let's add VACUUM (FORCE) or something like that.

Yes, that makes sense.

This is actually inverted. Vacuum by default should vacuum the entire
relation

What? Why on earth would that be a good idea? Not to speak of hte fact
that that's not been the case since ~8.4?

Sorry, I just meant the default behavior shouldn't change but I do agree
that we need the ability to keep the same behavior.

Which default behaviour shouldn't change? The one in master where we
skip known frozen pages? Or the released branches where can't skip those?

,however if we are going to keep the existing behavior of this
patch, VACUUM (FROZEN) seems to be better than (FORCE)?

There already is FREEZE - meaning something different - so I doubt it.

Yeah I thought about that, it is the word "FORCE" that bothers me. When you
use FORCE there is an assumption that no matter what, it plows through
(think rm -f). So if we don't use FROZEN, that's cool but FORCE doesn't work
either.

SCANALL?

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#14

Joshua D. Drake

jd@commandprompt.com

over 9 years ago

In reply to: Stephen Frost (#12)

Re: Reviewing freeze map code

On 05/06/2016 01:58 PM, Stephen Frost wrote:

* Joshua D. Drake (jd@commandprompt.com) wrote:

Yeah I thought about that, it is the word "FORCE" that bothers me.
When you use FORCE there is an assumption that no matter what, it
plows through (think rm -f). So if we don't use FROZEN, that's cool
but FORCE doesn't work either.

Isn't that exactly what this FORCE option being contemplated would do
though? Plow through the entire relation, regardless of what the VM
says is all frozen or not?

Seems like FORCE is a good word for that to me.

Except that we aren't FORCING a vacuum. That is the part I have
contention with. To me, FORCE means:

No matter what else is happening, we are vacuuming this relation (think
locks).

But I am also not going to dig in my heals. If that is truly what
-hackers come up with, thank you at least considering what I said.

Sincerely,

Thanks!

Stephen

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#15

Josh berkus

josh@agliodbs.com

over 9 years ago

In reply to: Andres Freund (#1)

Re: Reviewing freeze map code

On 05/06/2016 01:58 PM, Andres Freund wrote:

On 2016-05-06 13:54:09 -0700, Joshua D. Drake wrote:

On 05/06/2016 01:50 PM, Andres Freund wrote:

There already is FREEZE - meaning something different - so I doubt it.

Yeah I thought about that, it is the word "FORCE" that bothers me. When you
use FORCE there is an assumption that no matter what, it plows through
(think rm -f). So if we don't use FROZEN, that's cool but FORCE doesn't work
either.

SCANALL?

VACUUM THEWHOLEDAMNTHING

--
--
Josh Berkus
Red Hat OSAS
(any opinions are my own)

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Import Notes

Reply to msg id not found: WM3c465a785ef366521ed338bb93585bc99a43cc6db976bb0e71111a522ee7f39a31942de6526cb0bccd5f2b7d1678818f@asav-3.01.com

#16

Joshua D. Drake

jd@commandprompt.com

over 9 years ago

In reply to: Josh berkus (#15)

Re: Reviewing freeze map code

On 05/06/2016 02:01 PM, Josh berkus wrote:

On 05/06/2016 01:58 PM, Andres Freund wrote:

On 2016-05-06 13:54:09 -0700, Joshua D. Drake wrote:

On 05/06/2016 01:50 PM, Andres Freund wrote:

There already is FREEZE - meaning something different - so I doubt it.

Yeah I thought about that, it is the word "FORCE" that bothers me. When you
use FORCE there is an assumption that no matter what, it plows through
(think rm -f). So if we don't use FROZEN, that's cool but FORCE doesn't work
either.

SCANALL?

VACUUM THEWHOLEDAMNTHING

I know that would never fly but damn if that wouldn't be an awesome
keyword for VACUUM.

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#17

Stephen Frost

sfrost@snowman.net

over 9 years ago

In reply to: Josh berkus (#15)

Re: Reviewing freeze map code

* Josh berkus (josh@agliodbs.com) wrote:

On 05/06/2016 01:58 PM, Andres Freund wrote:

On 2016-05-06 13:54:09 -0700, Joshua D. Drake wrote:

On 05/06/2016 01:50 PM, Andres Freund wrote:

There already is FREEZE - meaning something different - so I doubt it.

Yeah I thought about that, it is the word "FORCE" that bothers me. When you
use FORCE there is an assumption that no matter what, it plows through
(think rm -f). So if we don't use FROZEN, that's cool but FORCE doesn't work
either.

SCANALL?

VACUUM THEWHOLEDAMNTHING

+100

(hahahaha)

Thanks!

Stephen

#18

Andres Freund

andres@anarazel.de

over 9 years ago

In reply to: Joshua D. Drake (#16)

Re: Reviewing freeze map code

On 2016-05-06 14:03:11 -0700, Joshua D. Drake wrote:

On 05/06/2016 02:01 PM, Josh berkus wrote:

On 05/06/2016 01:58 PM, Andres Freund wrote:

On 2016-05-06 13:54:09 -0700, Joshua D. Drake wrote:

On 05/06/2016 01:50 PM, Andres Freund wrote:

There already is FREEZE - meaning something different - so I doubt it.

Yeah I thought about that, it is the word "FORCE" that bothers me. When you
use FORCE there is an assumption that no matter what, it plows through
(think rm -f). So if we don't use FROZEN, that's cool but FORCE doesn't work
either.

SCANALL?

VACUUM THEWHOLEDAMNTHING

I know that would never fly but damn if that wouldn't be an awesome keyword
for VACUUM.

It bothers me more than it probably should: Nobdy tests, reviews,
whatever a complex patch with significant data-loss potential. But as
soon somebody dares to mention an option name...

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#19

Joshua D. Drake

jd@commandprompt.com

over 9 years ago

In reply to: Stephen Frost (#17)

Re: Reviewing freeze map code

On 05/06/2016 02:03 PM, Stephen Frost wrote:

VACUUM THEWHOLEDAMNTHING

+100

(hahahaha)

You know what? Why not? Seriously? We aren't product. This is supposed
to be a bit fun. Let's have some fun with it? It would be so easy to
turn that into a positive advocacy opportunity.

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#20

Josh berkus

josh@agliodbs.com

over 9 years ago

In reply to: Andres Freund (#1)

Re: Reviewing freeze map code

On 05/06/2016 02:08 PM, Andres Freund wrote:

It bothers me more than it probably should: Nobdy tests, reviews,
whatever a complex patch with significant data-loss potential. But as
soon somebody dares to mention an option name...

Definitely more than it should, because it's gonna happen *every* time.

https://en.wikipedia.org/wiki/Law_of_triviality

--
--
Josh Berkus
Red Hat OSAS
(any opinions are my own)

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Import Notes

Reply to msg id not found: WM795dd98efc24e9727ddfa4f9d6895d1145d958c5eafd1fb31733b556ab1af27dc1b6a3fa1d4968448dec4919a693bf5a@asav-2.01.com

#21

Andres Freund

andres@anarazel.de

over 9 years ago

In reply to: Josh berkus (#20)

Re: Reviewing freeze map code

On 2016-05-06 14:10:04 -0700, Josh berkus wrote:

On 05/06/2016 02:08 PM, Andres Freund wrote:

It bothers me more than it probably should: Nobdy tests, reviews,
whatever a complex patch with significant data-loss potential. But as
soon somebody dares to mention an option name...

Definitely more than it should, because it's gonna happen *every* time.

https://en.wikipedia.org/wiki/Law_of_triviality

Doesn't mean it should not be frowned upon.

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#22

Josh berkus

josh@agliodbs.com

over 9 years ago

In reply to: Joshua D. Drake (#9)

Re: Reviewing freeze map code

On 05/06/2016 02:12 PM, Andres Freund wrote:

On 2016-05-06 14:10:04 -0700, Josh berkus wrote:

On 05/06/2016 02:08 PM, Andres Freund wrote:

It bothers me more than it probably should: Nobdy tests, reviews,
whatever a complex patch with significant data-loss potential. But as
soon somebody dares to mention an option name...

Definitely more than it should, because it's gonna happen *every* time.

https://en.wikipedia.org/wiki/Law_of_triviality

Doesn't mean it should not be frowned upon.

Or made light of, hence my post. Personally I don't care what the
option is called, as long as we have docs for it.

For the serious testing, does anyone have a good technique for creating
loads which would stress-test vacuum freezing? It's hard for me to come
up with anything which wouldn't be very time-and-resource intensive
(like running at 10,000 TPS for a week).

--
--
Josh Berkus
Red Hat OSAS
(any opinions are my own)

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Import Notes

Reply to msg id not found: WM2dc837b5877ab940f858eb5cd26c9df3c9720e7f64f45a91587c50a6f7e17d8a9a343331e22285a556738bb2a1c9b627@asav-2.01.com

#23

Joshua D. Drake

jd@commandprompt.com

over 9 years ago

In reply to: Andres Freund (#18)

Re: Reviewing freeze map code

On 05/06/2016 02:08 PM, Andres Freund wrote:

VACUUM THEWHOLEDAMNTHING

I know that would never fly but damn if that wouldn't be an awesome keyword
for VACUUM.

It bothers me more than it probably should: Nobdy tests, reviews,
whatever a complex patch with significant data-loss potential. But as
soon somebody dares to mention an option name...

That is a fair complaint but let me ask you something:

How do I test?

Is there a script I can run? Are there specific things I can do to try
and break it? What are we looking for exactly?

A lot of -hackers seem to forget that although we have 100 -hackers, we
have 10000 "consultant/practitioners". Could I read the code and with a
weekend of WTF and -hackers questions figure out what is going on, yes
but a lot of people couldn't and I don't have the time.

You want me (or people like me) to test more? Give us an easy way to do
it. Otherwise, we do what we can, which is try and interface on the
things that will directly and immediately affect us (like keywords and
syntax).

Sincerely,

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#24

Andres Freund

andres@anarazel.de

over 9 years ago

In reply to: Josh berkus (#22)

Re: Reviewing freeze map code

On 2016-05-06 14:15:47 -0700, Josh berkus wrote:

For the serious testing, does anyone have a good technique for creating
loads which would stress-test vacuum freezing? It's hard for me to come
up with anything which wouldn't be very time-and-resource intensive
(like running at 10,000 TPS for a week).

I've changed the limits for freezing options a while back, so you can
now set autovacuum_freeze_max as low as 100000 (best set
vacuum_freeze_table_age accordingly). You'll have to come up with a
workload that doesn't overwrite all data continuously (otherwise
there'll never be old rows), but otherwise it should now be fairly easy
to test that kind of scenario.

Andres

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#25

Andres Freund

andres@anarazel.de

over 9 years ago

In reply to: Joshua D. Drake (#23)

Re: Reviewing freeze map code

Hi,

On 2016-05-06 14:17:13 -0700, Joshua D. Drake wrote:

How do I test?

Is there a script I can run?

Unfortunately there's few interesting things to test with pre-made
scripts. There's no relevant OS dependency here, so each already
existing test doesn't really lead to significantly increased coverage
being run by other people. Generally, when testing for correctness
issues, it's often of limited benefit to run tests written by the author
of reviewer - such scripts will usually just test things either has
thought of. The dangerous areas are the ones neither author or reviewer
has considered.

Are there specific things I can do to try and break it?

Upgrade clusters using pg_upgrade and make sure things like index only
scans still work and yield correct data. Set up workloads that involve
freezing, and check that less WAL (and not more!) is generated with 9.6
than with 9.5. Make sure queries still work.

What are we looking for exactly?

Data corruption, efficiency problems.

A lot of -hackers seem to forget that although we have 100 -hackers, we have
10000 "consultant/practitioners". Could I read the code and with a weekend
of WTF and -hackers questions figure out what is going on, yes but a lot of
people couldn't and I don't have the time.

I think tests without reading the code are quite sensible and
important. And it perfectly makes sense to ask for information about
what to test. But fundamentally testing is a lot of work, as is writing
and reviewing code; unless you're really really good at destructive
testing, you won't find much in a 15 minute break.

You want me (or people like me) to test more? Give us an easy way to
do it.

Useful additional testing and easy just don't go well together. By the
time I have made it easy I've done the testing that's needed.

Otherwise, we do what we can, which is try and interface on the things that
will directly and immediately affect us (like keywords and syntax).

The amount of bikeshedding on -hackers steals energy and time for
actually working on stuff, including testing. So I have little sympathy
for the amount of bike shedding done.

Greetings,

Andres Freund

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#26

Alvaro Herrera

alvherre@2ndquadrant.com

over 9 years ago

In reply to: Joshua D. Drake (#9)

Re: Reviewing freeze map code

Joshua D. Drake wrote:

On 05/06/2016 01:40 PM, Robert Haas wrote:

On Wed, May 4, 2016 at 8:08 PM, Andres Freund <andres@anarazel.de> wrote:

On 2016-05-02 14:48:18 -0700, Andres Freund wrote:

77a1d1e Department of second thoughts: remove PD_ALL_FROZEN.

Nothing to say here.

fd31cd2 Don't vacuum all-frozen pages.

Hm. I do wonder if it's going to bite us that we don't have a way to
actually force vacuuming of the whole table (besides manually rm'ing the
VM). I've more than once seen VACUUM used to try to do some integrity
checking of the database. How are we actually going to test that the
feature works correctly? They'd have to write checks ontop of
pg_visibility to see whether things are borked.

Let's add VACUUM (FORCE) or something like that.

This is actually inverted. Vacuum by default should vacuum the entire
relation, however if we are going to keep the existing behavior of this
patch, VACUUM (FROZEN) seems to be better than (FORCE)?

Prior to some 7.x release, VACUUM actually did what we ripped out in
9.0 release as VACUUM FULL. We actually changed the mode of operation
quite heavily into the "lazy" mode which didn't acquire access exclusive
lock, and it was a huge relief. I think that changing the mode of
operation to be the lightest possible thing that gets the job done
is convenient for users, because their existing scripts continue to
clean their tables only they take less time. No need to tweak the
maintenance scripts.

I don't know what happens when the freeze_table_age threshold is
reached. Do we scan the whole table when that happens? Because if we
do, then we don't need a new keyword: just invoke the command after
lowering the setting.

Another question on this feature is what happens with the table age
(relfrozenxid, relminmxid) when the table is not wholly scanned by
vacuum.

--
ï¿½lvaro Herrera http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#27

Alvaro Herrera

alvherre@2ndquadrant.com

over 9 years ago

In reply to: Andres Freund (#25)

Re: Reviewing freeze map code

Andres Freund wrote:

On 2016-05-06 14:17:13 -0700, Joshua D. Drake wrote:

How do I test?

Is there a script I can run?

Unfortunately there's few interesting things to test with pre-made
scripts. There's no relevant OS dependency here, so each already
existing test doesn't really lead to significantly increased coverage
being run by other people. Generally, when testing for correctness
issues, it's often of limited benefit to run tests written by the author
of reviewer - such scripts will usually just test things either has
thought of. The dangerous areas are the ones neither author or reviewer
has considered.

We touched this question in connection with multixact freezing and
wraparound. Testers seem to want to be given a script that they can
install and run, then go for a beer and get back to a bunch of errors to
report. But it doesn't work that way; writing a useful test script
requires a lot of effort. Jeff Janes has done astounding work in these
matters. (I don't think we credit him enough for that.)

--
ï¿½lvaro Herrera http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#28

Joshua D. Drake

jd@commandprompt.com

over 9 years ago

In reply to: Andres Freund (#25)

Re: Reviewing freeze map code

On 05/06/2016 02:29 PM, Andres Freund wrote:

Hi,

On 2016-05-06 14:17:13 -0700, Joshua D. Drake wrote:

How do I test?

Is there a script I can run?

Unfortunately there's few interesting things to test with pre-made
scripts. There's no relevant OS dependency here, so each already
existing test doesn't really lead to significantly increased coverage
being run by other people. Generally, when testing for correctness
issues, it's often of limited benefit to run tests written by the author
of reviewer - such scripts will usually just test things either has
thought of. The dangerous areas are the ones neither author or reviewer
has considered.

I can't argue with that.

Are there specific things I can do to try and break it?

Upgrade clusters using pg_upgrade and make sure things like index only
scans still work and yield correct data. Set up workloads that involve
freezing, and check that less WAL (and not more!) is generated with 9.6
than with 9.5. Make sure queries still work.

What are we looking for exactly?

Data corruption, efficiency problems.

I am really not trying to be difficult here but Data Corruption is an
easy one... what is the metric we accept as an efficiency problem?

A lot of -hackers seem to forget that although we have 100 -hackers, we have
10000 "consultant/practitioners". Could I read the code and with a weekend
of WTF and -hackers questions figure out what is going on, yes but a lot of
people couldn't and I don't have the time.

I think tests without reading the code are quite sensible and
important. And it perfectly makes sense to ask for information about
what to test. But fundamentally testing is a lot of work, as is writing
and reviewing code; unless you're really really good at destructive
testing, you won't find much in a 15 minute break.

Yes, this is true but with a proper testing framework, I don't need a 15
minute break. I need 1 hour to configure, the rest just "happens" and
reports back.

I have cycles to test, I have team members to help test (as do *lots* of
other people) but sometimes we just get lost in how to help.

You want me (or people like me) to test more? Give us an easy way to
do it.

Useful additional testing and easy just don't go well together. By the
time I have made it easy I've done the testing that's needed.

I don't know that I can agree with this. A proper harness allows you to
execute: go.sh and boom... 2, 4, even 8 hours later you get a report. I
will not argue that it isn't easy to implement but I know it can be done.

Otherwise, we do what we can, which is try and interface on the things that
will directly and immediately affect us (like keywords and syntax).

The amount of bikeshedding on -hackers steals energy and time for
actually working on stuff, including testing. So I have little sympathy
for the amount of bike shedding done.

Insuring a reasonable and thought out interface for users is not bike
shedding, it is at least as important and possibly more important than
any feature we add.

Sincerely,

Greetings,

Andres Freund

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#29

Andres Freund

andres@anarazel.de

over 9 years ago

In reply to: Joshua D. Drake (#28)

Re: Reviewing freeze map code

On 2016-05-06 14:39:57 -0700, Joshua D. Drake wrote:

What are we looking for exactly?

Data corruption, efficiency problems.

I am really not trying to be difficult here but Data Corruption is an easy
one... what is the metric we accept as an efficiency problem?

That's indeed not easy to define. In this case I'd say vacuums taking
longer, index only scans being slower, more WAL being generated would
count?

I think tests without reading the code are quite sensible and
important. And it perfectly makes sense to ask for information about
what to test. But fundamentally testing is a lot of work, as is writing
and reviewing code; unless you're really really good at destructive
testing, you won't find much in a 15 minute break.

Yes, this is true but with a proper testing framework, I don't need a 15
minute break. I need 1 hour to configure, the rest just "happens" and
reports back.

That only works if somebody writes such tests. And in that case the
tester having run will often suffice (until related changes are being
made). I'm not arguing against introducing more tests into the codebase
- I rather fervently for that. But that really isn't what's going to
avoid issues like this feature (or multixact) causing problems, because
those tests will just test what the author thought of.

You want me (or people like me) to test more? Give us an easy way to
do it.

Useful additional testing and easy just don't go well together. By the
time I have made it easy I've done the testing that's needed.

I don't know that I can agree with this. A proper harness allows you to
execute: go.sh and boom... 2, 4, even 8 hours later you get a report. I will
not argue that it isn't easy to implement but I know it can be done.

The problem is that the contents of go.sh are the much more relevant
part than the 8 hours.

Greetings,

Andres Freund

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#30

Andres Freund

andres@anarazel.de

over 9 years ago

In reply to: Alvaro Herrera (#27)

Re: Reviewing freeze map code

On 2016-05-06 18:36:52 -0300, Alvaro Herrera wrote:

Andres Freund wrote:

On 2016-05-06 14:17:13 -0700, Joshua D. Drake wrote:

How do I test?

Is there a script I can run?

Unfortunately there's few interesting things to test with pre-made
scripts. There's no relevant OS dependency here, so each already
existing test doesn't really lead to significantly increased coverage
being run by other people. Generally, when testing for correctness
issues, it's often of limited benefit to run tests written by the author
of reviewer - such scripts will usually just test things either has
thought of. The dangerous areas are the ones neither author or reviewer
has considered.

We touched this question in connection with multixact freezing and
wraparound. Testers seem to want to be given a script that they can
install and run, then go for a beer and get back to a bunch of errors to
report. But it doesn't work that way; writing a useful test script
requires a lot of effort.

Right. And once written, often enough running it on a lot more instances
only marginally increases the coverage.

Jeff Janes has done astounding work in these matters. (I don't think
we credit him enough for that.)

+many.

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#31

Joshua D. Drake

jd@commandprompt.com

over 9 years ago

In reply to: Andres Freund (#29)

Re: Reviewing freeze map code

On 05/06/2016 02:48 PM, Andres Freund wrote:

On 2016-05-06 14:39:57 -0700, Joshua D. Drake wrote:

Yes, this is true but with a proper testing framework, I don't need a 15
minute break. I need 1 hour to configure, the rest just "happens" and
reports back.

That only works if somebody writes such tests.

Agreed.

And in that case the
tester having run will often suffice (until related changes are being
made). I'm not arguing against introducing more tests into the codebase
- I rather fervently for that. But that really isn't what's going to
avoid issues like this feature (or multixact) causing problems, because
those tests will just test what the author thought of.

Good point. I am not sure how to address the alternative though.

You want me (or people like me) to test more? Give us an easy way to
do it.

Useful additional testing and easy just don't go well together. By the
time I have made it easy I've done the testing that's needed.

I don't know that I can agree with this. A proper harness allows you to
execute: go.sh and boom... 2, 4, even 8 hours later you get a report. I will
not argue that it isn't easy to implement but I know it can be done.

The problem is that the contents of go.sh are the much more relevant
part than the 8 hours.

True.

Please don't misunderstand, I am not saying this is "easy". I just hope
that it is something we work for.

Sincerely,

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#32

Andres Freund

andres@anarazel.de

over 9 years ago

In reply to: Alvaro Herrera (#26)

Re: Reviewing freeze map code

On 2016-05-06 18:31:03 -0300, Alvaro Herrera wrote:

I don't know what happens when the freeze_table_age threshold is
reached.

We scan all non-frozen pages, whereas we earlier had to scan all pages.

That's really both the significant benefit, and the danger. Because if
we screw up the all-frozen bits in the visibilitymap, we'll be screwed
soon after.

Do we scan the whole table when that happens?

No, there's atm no way to force a whole-table vacuum, besides manually
rm'ing the _vm fork.

Another question on this feature is what happens with the table age
(relfrozenxid, relminmxid) when the table is not wholly scanned by
vacuum.

Basically we increase the horizons whenever scanning all pages that are
not known to be frozen (+ potentially some frozen ones due to the
skipping logic). Without that there'd really not be a point in the
freeze map feature, as we'd continue to have the expensive anti
wraparound vacuums.

Andres

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#33

Peter Geoghegan

pg@heroku.com

over 9 years ago

In reply to: Andres Freund (#30)

Re: Reviewing freeze map code

On Fri, May 6, 2016 at 2:49 PM, Andres Freund <andres@anarazel.de> wrote:

Jeff Janes has done astounding work in these matters. (I don't think
we credit him enough for that.)

+many.

Agreed. I'm a huge fan of what Jeff has been able to do in this area.
I often say so. It would be even better if Jeff's approach to testing
was followed as an example by other people, but I wouldn't bet on it
ever happening. It requires real persistence and deep understanding to
do well.

--
Peter Geoghegan

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#34

Thomas Munro

thomas.munro@enterprisedb.com

over 9 years ago

In reply to: Robert Haas (#6)

1 attachment(s)

Re: Reviewing freeze map code

On Sat, May 7, 2016 at 8:34 AM, Robert Haas <robertmhaas@gmail.com> wrote:

On Mon, May 2, 2016 at 8:25 PM, Andres Freund <andres@anarazel.de> wrote:
+static const uint8 number_of_ones_for_visible[256] = {
...
+};
+static const uint8 number_of_ones_for_frozen[256] = {
...
};
Did somebody verify the new contents are correct?
I admit that I didn't. It seemed like an unlikely place for a goof,
but I guess we should verify.

Looks correct. The tables match the output of the attached script.

--
Thomas Munro
http://www.enterprisedb.com

#35

Andres Freund

andres@anarazel.de

over 9 years ago

In reply to: Thomas Munro (#34)

Re: Reviewing freeze map code

On 2016-05-07 10:00:27 +1200, Thomas Munro wrote:

On Sat, May 7, 2016 at 8:34 AM, Robert Haas <robertmhaas@gmail.com> wrote:

Did somebody verify the new contents are correct?

I admit that I didn't. It seemed like an unlikely place for a goof,
but I guess we should verify.

Looks correct. The tables match the output of the attached script.

Great!

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#36

Alvaro Herrera

alvherre@2ndquadrant.com

over 9 years ago

In reply to: Alvaro Herrera (#27)

Re: Reviewing freeze map code

Alvaro Herrera wrote:

We touched this question in connection with multixact freezing and
wraparound. Testers seem to want to be given a script that they can
install and run, then go for a beer and get back to a bunch of errors to
report.

Here I spent some time trying to explain what to test to try and find
certain multixact bugs
/messages/by-id/20150605213832.GZ133018@postgresql.org

--
ï¿½lvaro Herrera http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#37

Masahiko Sawada

sawada.mshk@gmail.com

over 9 years ago

In reply to: Joshua D. Drake (#14)

Re: Reviewing freeze map code

On Sat, May 7, 2016 at 6:00 AM, Joshua D. Drake <jd@commandprompt.com> wrote:

On 05/06/2016 01:58 PM, Stephen Frost wrote:

* Joshua D. Drake (jd@commandprompt.com) wrote:

Yeah I thought about that, it is the word "FORCE" that bothers me.
When you use FORCE there is an assumption that no matter what, it
plows through (think rm -f). So if we don't use FROZEN, that's cool
but FORCE doesn't work either.

Isn't that exactly what this FORCE option being contemplated would do
though? Plow through the entire relation, regardless of what the VM
says is all frozen or not?

Seems like FORCE is a good word for that to me.

Except that we aren't FORCING a vacuum. That is the part I have contention
with. To me, FORCE means:

No matter what else is happening, we are vacuuming this relation (think
locks).

But I am also not going to dig in my heals. If that is truly what -hackers
come up with, thank you at least considering what I said.

Sincerely,

JD

As Joshua mentioned, FORCE word might imply doing VACUUM while plowing
through locks.
I guess that it might confuse the users.
IMO, since this option will be a way for emergency, SCANALL word works for me.

Or other ideas are,
VACUUM IGNOREVM
VACUUM RESCURE

Regards,

--
Masahiko Sawada

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#38

Masahiko Sawada

sawada.mshk@gmail.com

over 9 years ago

In reply to: Masahiko Sawada (#37)

Re: Reviewing freeze map code

On Sat, May 7, 2016 at 11:08 PM, Masahiko Sawada <sawada.mshk@gmail.com> wrote:

On Sat, May 7, 2016 at 6:00 AM, Joshua D. Drake <jd@commandprompt.com> wrote:

On 05/06/2016 01:58 PM, Stephen Frost wrote:

* Joshua D. Drake (jd@commandprompt.com) wrote:

Yeah I thought about that, it is the word "FORCE" that bothers me.
When you use FORCE there is an assumption that no matter what, it
plows through (think rm -f). So if we don't use FROZEN, that's cool
but FORCE doesn't work either.

Isn't that exactly what this FORCE option being contemplated would do
though? Plow through the entire relation, regardless of what the VM
says is all frozen or not?

Seems like FORCE is a good word for that to me.

Except that we aren't FORCING a vacuum. That is the part I have contention
with. To me, FORCE means:

No matter what else is happening, we are vacuuming this relation (think
locks).

But I am also not going to dig in my heals. If that is truly what -hackers
come up with, thank you at least considering what I said.

Sincerely,

JD

As Joshua mentioned, FORCE word might imply doing VACUUM while plowing
through locks.
I guess that it might confuse the users.
IMO, since this option will be a way for emergency, SCANALL word works for me.

Or other ideas are,
VACUUM IGNOREVM
VACUUM RESCURE

Oops, VACUUM RESCUE is correct.

Regards,

--
Masahiko Sawada

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#39

Masahiko Sawada

sawada.mshk@gmail.com

over 9 years ago

In reply to: Masahiko Sawada (#38)

1 attachment(s)

Re: Reviewing freeze map code

On Sun, May 8, 2016 at 3:18 PM, Masahiko Sawada <sawada.mshk@gmail.com> wrote:

On Sat, May 7, 2016 at 11:08 PM, Masahiko Sawada <sawada.mshk@gmail.com> wrote:

On Sat, May 7, 2016 at 6:00 AM, Joshua D. Drake <jd@commandprompt.com> wrote:

On 05/06/2016 01:58 PM, Stephen Frost wrote:

* Joshua D. Drake (jd@commandprompt.com) wrote:

Yeah I thought about that, it is the word "FORCE" that bothers me.
When you use FORCE there is an assumption that no matter what, it
plows through (think rm -f). So if we don't use FROZEN, that's cool
but FORCE doesn't work either.

Isn't that exactly what this FORCE option being contemplated would do
though? Plow through the entire relation, regardless of what the VM
says is all frozen or not?

Seems like FORCE is a good word for that to me.

Except that we aren't FORCING a vacuum. That is the part I have contention
with. To me, FORCE means:

No matter what else is happening, we are vacuuming this relation (think
locks).

But I am also not going to dig in my heals. If that is truly what -hackers
come up with, thank you at least considering what I said.

Sincerely,

JD

As Joshua mentioned, FORCE word might imply doing VACUUM while plowing
through locks.
I guess that it might confuse the users.
IMO, since this option will be a way for emergency, SCANALL word works for me.

Or other ideas are,
VACUUM IGNOREVM
VACUUM RESCURE

Oops, VACUUM RESCUE is correct.

Attached draft patch adds SCANALL option to VACUUM in order to scan
all pages forcibly while ignoring visibility map information.
The option name is SCANALL for now but we could change it after got consensus.

Regards,

--
Masahiko Sawada

Attachments:

vacuum_scanall_v1.patchtext/x-patch; charset=US-ASCII; name=vacuum_scanall_v1.patchDownload

diff --git a/doc/src/sgml/ref/vacuum.sgml b/doc/src/sgml/ref/vacuum.sgml
index 19fd748..130a70e 100644
--- a/doc/src/sgml/ref/vacuum.sgml
+++ b/doc/src/sgml/ref/vacuum.sgml
@@ -21,9 +21,9 @@ PostgreSQL documentation
 
  <refsynopsisdiv>
 <synopsis>
-VACUUM [ ( { FULL | FREEZE | VERBOSE | ANALYZE } [, ...] ) ] [ <replaceable class="PARAMETER">table_name</replaceable> [ (<replaceable class="PARAMETER">column_name</replaceable> [, ...] ) ] ]
-VACUUM [ FULL ] [ FREEZE ] [ VERBOSE ] [ <replaceable class="PARAMETER">table_name</replaceable> ]
-VACUUM [ FULL ] [ FREEZE ] [ VERBOSE ] ANALYZE [ <replaceable class="PARAMETER">table_name</replaceable> [ (<replaceable class="PARAMETER">column_name</replaceable> [, ...] ) ] ]
+VACUUM [ ( { FULL | FREEZE | VERBOSE | ANALYZE | SCANALL } [, ...] ) ] [ <replaceable class="PARAMETER">table_name</replaceable> [ (<replaceable class="PARAMETER">column_name</replaceable> [, ...] ) ] ]
+VACUUM [ FULL ] [ FREEZE ] [ VERBOSE ] [ SCANALL ] [ <replaceable class="PARAMETER">table_name</replaceable> ]
+VACUUM [ FULL ] [ FREEZE ] [ VERBOSE ] [ SCANALL ] ANALYZE [ <replaceable class="PARAMETER">table_name</replaceable> [ (<replaceable class="PARAMETER">column_name</replaceable> [, ...] ) ] ]
 </synopsis>
  </refsynopsisdiv>
 
@@ -120,6 +120,17 @@ VACUUM [ FULL ] [ FREEZE ] [ VERBOSE ] ANALYZE [ <replaceable class="PARAMETER">
    </varlistentry>
 
    <varlistentry>
+    <term><literal>SCANALL</literal></term>
+    <listitem>
+     <para>
+      Selects forcibly full page scanning vacuum while ignoring visibility map.
+      Forcibly full page scanning vacuum is always performed when the table is
+      rewritten so this option is redundant when <literal>FULL</> is specified.
+     </para>
+    </listitem>
+   </varlistentry>
+   
+   <varlistentry>
     <term><literal>ANALYZE</literal></term>
     <listitem>
      <para>
diff --git a/src/backend/commands/vacuumlazy.c b/src/backend/commands/vacuumlazy.c
index 426e756..85e04ac 100644
--- a/src/backend/commands/vacuumlazy.c
+++ b/src/backend/commands/vacuumlazy.c
@@ -138,7 +138,7 @@ static BufferAccessStrategy vac_strategy;
 
 /* non-export function prototypes */
 static void lazy_scan_heap(Relation onerel, LVRelStats *vacrelstats,
-			   Relation *Irel, int nindexes, bool aggressive);
+			   Relation *Irel, int nindexes, bool aggressive, bool scan_all);
 static void lazy_vacuum_heap(Relation onerel, LVRelStats *vacrelstats);
 static bool lazy_check_needs_freeze(Buffer buf, bool *hastup);
 static void lazy_vacuum_index(Relation indrel,
@@ -185,6 +185,7 @@ lazy_vacuum_rel(Relation onerel, int options, VacuumParams *params,
 	double		read_rate,
 				write_rate;
 	bool		aggressive;		/* should we scan all unfrozen pages? */
+	bool		scan_all;		/* should we scan all pages forcibly? */
 	bool		scanned_all_unfrozen;	/* actually scanned all such pages? */
 	TransactionId xidFullScanLimit;
 	MultiXactId mxactFullScanLimit;
@@ -233,6 +234,9 @@ lazy_vacuum_rel(Relation onerel, int options, VacuumParams *params,
 	aggressive |= MultiXactIdPrecedesOrEquals(onerel->rd_rel->relminmxid,
 											  mxactFullScanLimit);
 
+	/* If SCANALL option is specified, we have to scan all pages forcibly */
+	scan_all = options & VACOPT_SCANALL;
+
 	vacrelstats = (LVRelStats *) palloc0(sizeof(LVRelStats));
 
 	vacrelstats->old_rel_pages = onerel->rd_rel->relpages;
@@ -246,14 +250,14 @@ lazy_vacuum_rel(Relation onerel, int options, VacuumParams *params,
 	vacrelstats->hasindex = (nindexes > 0);
 
 	/* Do the vacuuming */
-	lazy_scan_heap(onerel, vacrelstats, Irel, nindexes, aggressive);
+	lazy_scan_heap(onerel, vacrelstats, Irel, nindexes, aggressive, scan_all);
 
 	/* Done with indexes */
 	vac_close_indexes(nindexes, Irel, NoLock);
 
 	/*
-	 * Compute whether we actually scanned the whole relation. If we did, we
-	 * can adjust relfrozenxid and relminmxid.
+	 * Compute whether we actually scanned the whole relation. If we did,
+	 * we can adjust relfrozenxid and relminmxid.
 	 *
 	 * NB: We need to check this before truncating the relation, because that
 	 * will change ->rel_pages.
@@ -261,7 +265,7 @@ lazy_vacuum_rel(Relation onerel, int options, VacuumParams *params,
 	if ((vacrelstats->scanned_pages + vacrelstats->frozenskipped_pages)
 		< vacrelstats->rel_pages)
 	{
-		Assert(!aggressive);
+		Assert(!aggressive && !scan_all);
 		scanned_all_unfrozen = false;
 	}
 	else
@@ -442,7 +446,7 @@ vacuum_log_cleanup_info(Relation rel, LVRelStats *vacrelstats)
  */
 static void
 lazy_scan_heap(Relation onerel, LVRelStats *vacrelstats,
-			   Relation *Irel, int nindexes, bool aggressive)
+			   Relation *Irel, int nindexes, bool aggressive, bool scan_all)
 {
 	BlockNumber nblocks,
 				blkno;
@@ -513,6 +517,10 @@ lazy_scan_heap(Relation onerel, LVRelStats *vacrelstats,
 	 * such pages do not need freezing and do not affect the value that we can
 	 * safely set for relfrozenxid or relminmxid.
 	 *
+	 * When scan_all is set, we have to scan all pages forcibly while ignoring
+	 * visibility map status, and then we can safely set for relfrozenxid or
+	 * relminmxid.
+	 *
 	 * Before entering the main loop, establish the invariant that
 	 * next_unskippable_block is the next block number >= blkno that's not we
 	 * can't skip based on the visibility map, either all-visible for a
@@ -639,11 +647,12 @@ lazy_scan_heap(Relation onerel, LVRelStats *vacrelstats,
 			/*
 			 * The current block is potentially skippable; if we've seen a
 			 * long enough run of skippable blocks to justify skipping it, and
-			 * we're not forced to check it, then go ahead and skip.
-			 * Otherwise, the page must be at least all-visible if not
-			 * all-frozen, so we can set all_visible_according_to_vm = true.
+			 * SCANALL option is not specified, and we're not forced to check it,
+			 * then go ahead and skip. Otherwise, the page must be at least
+			 * all-visible if not all-frozen, so we can set
+			 * all_visible_according_to_vm = true.
 			 */
-			if (skipping_blocks && !FORCE_CHECK_PAGE())
+			if (skipping_blocks && !scan_all && !FORCE_CHECK_PAGE())
 			{
 				/*
 				 * Tricky, tricky.  If this is in aggressive vacuum, the page
@@ -1316,6 +1325,7 @@ lazy_scan_heap(Relation onerel, LVRelStats *vacrelstats,
 									"Skipped %u pages due to buffer pins.\n",
 									vacrelstats->pinskipped_pages),
 					 vacrelstats->pinskipped_pages);
+
 	appendStringInfo(&buf, ngettext("%u page is entirely empty.\n",
 									"%u pages are entirely empty.\n",
 									empty_pages),
diff --git a/src/backend/parser/gram.y b/src/backend/parser/gram.y
index 18ec5f0..f73ca47 100644
--- a/src/backend/parser/gram.y
+++ b/src/backend/parser/gram.y
@@ -408,7 +408,7 @@ static Node *makeRecursiveViewSelect(char *relname, List *aliases, Node *query);
 %type <node>	overlay_placing substr_from substr_for
 
 %type <boolean> opt_instead
-%type <boolean> opt_unique opt_concurrently opt_verbose opt_full
+%type <boolean> opt_unique opt_concurrently opt_verbose opt_full opt_scanall
 %type <boolean> opt_freeze opt_default opt_recheck
 %type <defelt>	opt_binary opt_oids copy_delimiter
 
@@ -626,7 +626,7 @@ static Node *makeRecursiveViewSelect(char *relname, List *aliases, Node *query);
 	RESET RESTART RESTRICT RETURNING RETURNS REVOKE RIGHT ROLE ROLLBACK ROLLUP
 	ROW ROWS RULE
 
-	SAVEPOINT SCHEMA SCROLL SEARCH SECOND_P SECURITY SELECT SEQUENCE SEQUENCES
+	SAVEPOINT SCHEMA SCROLL SCANALL SEARCH SECOND_P SECURITY SELECT SEQUENCE SEQUENCES
 	SERIALIZABLE SERVER SESSION SESSION_USER SET SETS SETOF SHARE SHOW
 	SIMILAR SIMPLE SKIP SMALLINT SNAPSHOT SOME SQL_P STABLE STANDALONE_P START
 	STATEMENT STATISTICS STDIN STDOUT STORAGE STRICT_P STRIP_P SUBSTRING
@@ -9299,7 +9299,7 @@ cluster_index_specification:
  *
  *****************************************************************************/
 
-VacuumStmt: VACUUM opt_full opt_freeze opt_verbose
+VacuumStmt: VACUUM opt_full opt_freeze opt_verbose opt_scanall
 				{
 					VacuumStmt *n = makeNode(VacuumStmt);
 					n->options = VACOPT_VACUUM;
@@ -9309,11 +9309,13 @@ VacuumStmt: VACUUM opt_full opt_freeze opt_verbose
 						n->options |= VACOPT_FREEZE;
 					if ($4)
 						n->options |= VACOPT_VERBOSE;
+					if ($5)
+						n->options |= VACOPT_SCANALL;
 					n->relation = NULL;
 					n->va_cols = NIL;
 					$$ = (Node *)n;
 				}
-			| VACUUM opt_full opt_freeze opt_verbose qualified_name
+			| VACUUM opt_full opt_freeze opt_verbose opt_scanall qualified_name
 				{
 					VacuumStmt *n = makeNode(VacuumStmt);
 					n->options = VACOPT_VACUUM;
@@ -9323,13 +9325,15 @@ VacuumStmt: VACUUM opt_full opt_freeze opt_verbose
 						n->options |= VACOPT_FREEZE;
 					if ($4)
 						n->options |= VACOPT_VERBOSE;
-					n->relation = $5;
+					if ($5)
+						n->options |= VACOPT_SCANALL;
+					n->relation = $6;
 					n->va_cols = NIL;
 					$$ = (Node *)n;
 				}
-			| VACUUM opt_full opt_freeze opt_verbose AnalyzeStmt
+			| VACUUM opt_full opt_freeze opt_verbose opt_scanall AnalyzeStmt
 				{
-					VacuumStmt *n = (VacuumStmt *) $5;
+					VacuumStmt *n = (VacuumStmt *) $6;
 					n->options |= VACOPT_VACUUM;
 					if ($2)
 						n->options |= VACOPT_FULL;
@@ -9337,6 +9341,8 @@ VacuumStmt: VACUUM opt_full opt_freeze opt_verbose
 						n->options |= VACOPT_FREEZE;
 					if ($4)
 						n->options |= VACOPT_VERBOSE;
+					if ($5)
+						n->options |= VACOPT_SCANALL;
 					$$ = (Node *)n;
 				}
 			| VACUUM '(' vacuum_option_list ')'
@@ -9369,6 +9375,7 @@ vacuum_option_elem:
 			| VERBOSE			{ $$ = VACOPT_VERBOSE; }
 			| FREEZE			{ $$ = VACOPT_FREEZE; }
 			| FULL				{ $$ = VACOPT_FULL; }
+			| SCANALL			{ $$ = VACOPT_SCANALL; }
 		;
 
 AnalyzeStmt:
@@ -9411,7 +9418,9 @@ opt_full:	FULL									{ $$ = TRUE; }
 opt_freeze: FREEZE									{ $$ = TRUE; }
 			| /*EMPTY*/								{ $$ = FALSE; }
 		;
-
+opt_scanall: SCANALL								{ $$ = TRUE; }
+			| /* EMPTY */							{ $$ = FALSE; }
+		;
 opt_name_list:
 			'(' name_list ')'						{ $$ = $2; }
 			| /*EMPTY*/								{ $$ = NIL; }
@@ -14083,6 +14092,7 @@ type_func_name_keyword:
 			| OUTER_P
 			| OVERLAPS
 			| RIGHT
+			| SCANALL
 			| SIMILAR
 			| TABLESAMPLE
 			| VERBOSE
diff --git a/src/include/nodes/parsenodes.h b/src/include/nodes/parsenodes.h
index 714cf15..fc6338d 100644
--- a/src/include/nodes/parsenodes.h
+++ b/src/include/nodes/parsenodes.h
@@ -2822,7 +2822,8 @@ typedef enum VacuumOption
 	VACOPT_FREEZE = 1 << 3,		/* FREEZE option */
 	VACOPT_FULL = 1 << 4,		/* FULL (non-concurrent) vacuum */
 	VACOPT_NOWAIT = 1 << 5,		/* don't wait to get lock (autovacuum only) */
-	VACOPT_SKIPTOAST = 1 << 6	/* don't process the TOAST table, if any */
+	VACOPT_SKIPTOAST = 1 << 6,	/* don't process the TOAST table, if any */
+	VACOPT_SCANALL = 1 << 7		/* SCANALL option */
 } VacuumOption;
 
 typedef struct VacuumStmt
diff --git a/src/include/parser/kwlist.h b/src/include/parser/kwlist.h
index 17ffef5..04214b0 100644
--- a/src/include/parser/kwlist.h
+++ b/src/include/parser/kwlist.h
@@ -335,6 +335,7 @@ PG_KEYWORD("row", ROW, COL_NAME_KEYWORD)
 PG_KEYWORD("rows", ROWS, UNRESERVED_KEYWORD)
 PG_KEYWORD("rule", RULE, UNRESERVED_KEYWORD)
 PG_KEYWORD("savepoint", SAVEPOINT, UNRESERVED_KEYWORD)
+PG_KEYWORD("scanall", SCANALL, TYPE_FUNC_NAME_KEYWORD)
 PG_KEYWORD("schema", SCHEMA, UNRESERVED_KEYWORD)
 PG_KEYWORD("scroll", SCROLL, UNRESERVED_KEYWORD)
 PG_KEYWORD("search", SEARCH, UNRESERVED_KEYWORD)

#40

Masahiko Sawada

sawada.mshk@gmail.com

over 9 years ago

In reply to: Andres Freund (#1)

1 attachment(s)

Re: Reviewing freeze map code

On Tue, May 3, 2016 at 6:48 AM, Andres Freund <andres@anarazel.de> wrote:

fd31cd2 Don't vacuum all-frozen pages.

-                       appendStringInfo(&buf, _("pages: %u removed,
%u remain, %u skipped due to pins\n"),
+                       appendStringInfo(&buf, _("pages: %u removed,
%u remain, %u skipped due to pins, %u skipped frozen\n"),

vacrelstats->pages_removed,
vacrelstats->rel_pages,
-
vacrelstats->pinskipped_pages);
+
vacrelstats->pinskipped_pages,
+
vacrelstats->frozenskipped_pages);

The verbose information about skipping frozen page is emitted by only
autovacuum.
But I think that this information is also helpful for manual vacuum.

Please find attached patch which fixes that.

Regards,

--
Masahiko Sawada

Attachments:

lazy_scan_heap_verbose_output.patchtext/x-patch; charset=US-ASCII; name=lazy_scan_heap_verbose_output.patchDownload

diff --git a/src/backend/commands/vacuumlazy.c b/src/backend/commands/vacuumlazy.c
index 426e756..fa6e5fa 100644
--- a/src/backend/commands/vacuumlazy.c
+++ b/src/backend/commands/vacuumlazy.c
@@ -1316,6 +1316,8 @@ lazy_scan_heap(Relation onerel, LVRelStats *vacrelstats,
 									"Skipped %u pages due to buffer pins.\n",
 									vacrelstats->pinskipped_pages),
 					 vacrelstats->pinskipped_pages);
+	appendStringInfo(&buf, _("Skipped %u frozen pages.\n"),
+					 vacrelstats->frozenskipped_pages);
 	appendStringInfo(&buf, ngettext("%u page is entirely empty.\n",
 									"%u pages are entirely empty.\n",
 									empty_pages),

#41

Robert Haas

robertmhaas@gmail.com

over 9 years ago

In reply to: Masahiko Sawada (#39)

Re: Reviewing freeze map code

On Sun, May 8, 2016 at 10:42 PM, Masahiko Sawada <sawada.mshk@gmail.com> wrote:

Attached draft patch adds SCANALL option to VACUUM in order to scan
all pages forcibly while ignoring visibility map information.
The option name is SCANALL for now but we could change it after got consensus.

If we're going to go that way, I'd say it should be scan_all rather
than scanall. Makes it clearer, at least IMHO.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#42

Ants Aasma

ants.aasma@eesti.ee

over 9 years ago

In reply to: Andres Freund (#1)

Re: Reviewing freeze map code

On Mon, May 9, 2016 at 10:53 PM, Robert Haas <robertmhaas@gmail.com> wrote:

On Sun, May 8, 2016 at 10:42 PM, Masahiko Sawada <sawada.mshk@gmail.com> wrote:

Attached draft patch adds SCANALL option to VACUUM in order to scan
all pages forcibly while ignoring visibility map information.
The option name is SCANALL for now but we could change it after got consensus.

If we're going to go that way, I'd say it should be scan_all rather
than scanall. Makes it clearer, at least IMHO.

Just to add some diversity to opinions, maybe there should be a
separate command for performing integrity checks. Currently the best
ways to actually verify database correctness do so as a side effect.
The question that I get pretty much every time after I explain why we
have data checksums, is "how do I check that they are correct" and we
don't have a nice answer for that now. We could also use some ways to
sniff out corrupted rows that don't involve crashing the server in a
loop. Vacuuming pages that supposedly don't need vacuuming just to
verify integrity seems very much in the same vein.

I know right now isn't exactly the best time to hastily slap on such a
feature, but I just wanted the thought to be out there for
consideration.

Regards,
Ants Aasma

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Import Notes

Reply to msg id not found: 1586207084.348447.1462823613841@RIA

#43

Robert Haas

robertmhaas@gmail.com

over 9 years ago

In reply to: Ants Aasma (#42)

Re: Reviewing freeze map code

On Mon, May 9, 2016 at 7:40 PM, Ants Aasma <ants.aasma@eesti.ee> wrote:

On Mon, May 9, 2016 at 10:53 PM, Robert Haas <robertmhaas@gmail.com> wrote:

On Sun, May 8, 2016 at 10:42 PM, Masahiko Sawada <sawada.mshk@gmail.com> wrote:

Attached draft patch adds SCANALL option to VACUUM in order to scan
all pages forcibly while ignoring visibility map information.
The option name is SCANALL for now but we could change it after got consensus.

If we're going to go that way, I'd say it should be scan_all rather
than scanall. Makes it clearer, at least IMHO.

Just to add some diversity to opinions, maybe there should be a
separate command for performing integrity checks. Currently the best
ways to actually verify database correctness do so as a side effect.
The question that I get pretty much every time after I explain why we
have data checksums, is "how do I check that they are correct" and we
don't have a nice answer for that now. We could also use some ways to
sniff out corrupted rows that don't involve crashing the server in a
loop. Vacuuming pages that supposedly don't need vacuuming just to
verify integrity seems very much in the same vein.

I know right now isn't exactly the best time to hastily slap on such a
feature, but I just wanted the thought to be out there for
consideration.

I think that it's quite reasonable to have ways of performing an
integrity check that are separate from VACUUM, but this is about
having a way to force VACUUM to scan all-frozen pages - and it's hard
to imagine that we want a different command name for that.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#44

Masahiko Sawada

sawada.mshk@gmail.com

over 9 years ago

In reply to: Robert Haas (#43)

1 attachment(s)

Re: Reviewing freeze map code

On Tue, May 10, 2016 at 11:30 PM, Robert Haas <robertmhaas@gmail.com> wrote:

On Mon, May 9, 2016 at 7:40 PM, Ants Aasma <ants.aasma@eesti.ee> wrote:

On Mon, May 9, 2016 at 10:53 PM, Robert Haas <robertmhaas@gmail.com> wrote:

On Sun, May 8, 2016 at 10:42 PM, Masahiko Sawada <sawada.mshk@gmail.com> wrote:

Attached draft patch adds SCANALL option to VACUUM in order to scan
all pages forcibly while ignoring visibility map information.
The option name is SCANALL for now but we could change it after got consensus.

If we're going to go that way, I'd say it should be scan_all rather
than scanall. Makes it clearer, at least IMHO.

Just to add some diversity to opinions, maybe there should be a
separate command for performing integrity checks. Currently the best
ways to actually verify database correctness do so as a side effect.
The question that I get pretty much every time after I explain why we
have data checksums, is "how do I check that they are correct" and we
don't have a nice answer for that now. We could also use some ways to
sniff out corrupted rows that don't involve crashing the server in a
loop. Vacuuming pages that supposedly don't need vacuuming just to
verify integrity seems very much in the same vein.

I know right now isn't exactly the best time to hastily slap on such a
feature, but I just wanted the thought to be out there for
consideration.

I think that it's quite reasonable to have ways of performing an
integrity check that are separate from VACUUM, but this is about
having a way to force VACUUM to scan all-frozen pages

Or second way I came up with is having tool to remove particular _vm
file safely, which is executed via SQL or client tool like
pg_resetxlog.

Attached updated VACUUM SCAN_ALL patch.
Please find it.

Regards,

--
Masahiko Sawada

Attachments:

vacuum_scanall_v2.patchtext/x-patch; charset=US-ASCII; name=vacuum_scanall_v2.patchDownload

diff --git a/doc/src/sgml/ref/vacuum.sgml b/doc/src/sgml/ref/vacuum.sgml
index 19fd748..8f63fad 100644
--- a/doc/src/sgml/ref/vacuum.sgml
+++ b/doc/src/sgml/ref/vacuum.sgml
@@ -21,9 +21,9 @@ PostgreSQL documentation
 
  <refsynopsisdiv>
 <synopsis>
-VACUUM [ ( { FULL | FREEZE | VERBOSE | ANALYZE } [, ...] ) ] [ <replaceable class="PARAMETER">table_name</replaceable> [ (<replaceable class="PARAMETER">column_name</replaceable> [, ...] ) ] ]
-VACUUM [ FULL ] [ FREEZE ] [ VERBOSE ] [ <replaceable class="PARAMETER">table_name</replaceable> ]
-VACUUM [ FULL ] [ FREEZE ] [ VERBOSE ] ANALYZE [ <replaceable class="PARAMETER">table_name</replaceable> [ (<replaceable class="PARAMETER">column_name</replaceable> [, ...] ) ] ]
+VACUUM [ ( { FULL | FREEZE | VERBOSE | ANALYZE | SCAN_ALL } [, ...] ) ] [ <replaceable class="PARAMETER">table_name</replaceable> [ (<replaceable class="PARAMETER">column_name</replaceable> [, ...] ) ] ]
+VACUUM [ FULL ] [ FREEZE ] [ VERBOSE ] [ SCAN_ALL ] [ <replaceable class="PARAMETER">table_name</replaceable> ]
+VACUUM [ FULL ] [ FREEZE ] [ VERBOSE ] [ SCAN_ALL ] ANALYZE [ <replaceable class="PARAMETER">table_name</replaceable> [ (<replaceable class="PARAMETER">column_name</replaceable> [, ...] ) ] ]
 </synopsis>
  </refsynopsisdiv>
 
@@ -120,6 +120,17 @@ VACUUM [ FULL ] [ FREEZE ] [ VERBOSE ] ANALYZE [ <replaceable class="PARAMETER">
    </varlistentry>
 
    <varlistentry>
+    <term><literal>SCAN_ALL</literal></term>
+    <listitem>
+     <para>
+      Selects forcibly full page scanning vacuum while ignoring visibility map.
+      Forcibly full page scanning vacuum is always performed when the table is
+      rewritten so this option is redundant when <literal>FULL</> is specified.
+     </para>
+    </listitem>
+   </varlistentry>
+   
+   <varlistentry>
     <term><literal>ANALYZE</literal></term>
     <listitem>
      <para>
diff --git a/src/backend/commands/vacuumlazy.c b/src/backend/commands/vacuumlazy.c
index 426e756..eee93c4 100644
--- a/src/backend/commands/vacuumlazy.c
+++ b/src/backend/commands/vacuumlazy.c
@@ -138,7 +138,7 @@ static BufferAccessStrategy vac_strategy;
 
 /* non-export function prototypes */
 static void lazy_scan_heap(Relation onerel, LVRelStats *vacrelstats,
-			   Relation *Irel, int nindexes, bool aggressive);
+			   Relation *Irel, int nindexes, bool aggressive, bool scan_all);
 static void lazy_vacuum_heap(Relation onerel, LVRelStats *vacrelstats);
 static bool lazy_check_needs_freeze(Buffer buf, bool *hastup);
 static void lazy_vacuum_index(Relation indrel,
@@ -185,6 +185,7 @@ lazy_vacuum_rel(Relation onerel, int options, VacuumParams *params,
 	double		read_rate,
 				write_rate;
 	bool		aggressive;		/* should we scan all unfrozen pages? */
+	bool		scan_all;		/* should we scan all pages forcibly? */
 	bool		scanned_all_unfrozen;	/* actually scanned all such pages? */
 	TransactionId xidFullScanLimit;
 	MultiXactId mxactFullScanLimit;
@@ -233,6 +234,9 @@ lazy_vacuum_rel(Relation onerel, int options, VacuumParams *params,
 	aggressive |= MultiXactIdPrecedesOrEquals(onerel->rd_rel->relminmxid,
 											  mxactFullScanLimit);
 
+	/* If SCAN_ALL option is specified, we have to scan all pages forcibly */
+	scan_all = options & VACOPT_SCANALL;
+
 	vacrelstats = (LVRelStats *) palloc0(sizeof(LVRelStats));
 
 	vacrelstats->old_rel_pages = onerel->rd_rel->relpages;
@@ -246,14 +250,14 @@ lazy_vacuum_rel(Relation onerel, int options, VacuumParams *params,
 	vacrelstats->hasindex = (nindexes > 0);
 
 	/* Do the vacuuming */
-	lazy_scan_heap(onerel, vacrelstats, Irel, nindexes, aggressive);
+	lazy_scan_heap(onerel, vacrelstats, Irel, nindexes, aggressive, scan_all);
 
 	/* Done with indexes */
 	vac_close_indexes(nindexes, Irel, NoLock);
 
 	/*
-	 * Compute whether we actually scanned the whole relation. If we did, we
-	 * can adjust relfrozenxid and relminmxid.
+	 * Compute whether we actually scanned the whole relation. If we did,
+	 * we can adjust relfrozenxid and relminmxid.
 	 *
 	 * NB: We need to check this before truncating the relation, because that
 	 * will change ->rel_pages.
@@ -261,7 +265,7 @@ lazy_vacuum_rel(Relation onerel, int options, VacuumParams *params,
 	if ((vacrelstats->scanned_pages + vacrelstats->frozenskipped_pages)
 		< vacrelstats->rel_pages)
 	{
-		Assert(!aggressive);
+		Assert(!aggressive && !scan_all);
 		scanned_all_unfrozen = false;
 	}
 	else
@@ -442,7 +446,7 @@ vacuum_log_cleanup_info(Relation rel, LVRelStats *vacrelstats)
  */
 static void
 lazy_scan_heap(Relation onerel, LVRelStats *vacrelstats,
-			   Relation *Irel, int nindexes, bool aggressive)
+			   Relation *Irel, int nindexes, bool aggressive, bool scan_all)
 {
 	BlockNumber nblocks,
 				blkno;
@@ -513,6 +517,10 @@ lazy_scan_heap(Relation onerel, LVRelStats *vacrelstats,
 	 * such pages do not need freezing and do not affect the value that we can
 	 * safely set for relfrozenxid or relminmxid.
 	 *
+	 * When scan_all is set, we have to scan all pages forcibly while ignoring
+	 * visibility map status, and then we can safely set for relfrozenxid or
+	 * relminmxid.
+	 *
 	 * Before entering the main loop, establish the invariant that
 	 * next_unskippable_block is the next block number >= blkno that's not we
 	 * can't skip based on the visibility map, either all-visible for a
@@ -639,11 +647,12 @@ lazy_scan_heap(Relation onerel, LVRelStats *vacrelstats,
 			/*
 			 * The current block is potentially skippable; if we've seen a
 			 * long enough run of skippable blocks to justify skipping it, and
-			 * we're not forced to check it, then go ahead and skip.
-			 * Otherwise, the page must be at least all-visible if not
-			 * all-frozen, so we can set all_visible_according_to_vm = true.
+			 * SCAN_ALL option is not specified, and we're not forced to check it,
+			 * then go ahead and skip. Otherwise, the page must be at least
+			 * all-visible if not all-frozen, so we can set
+			 * all_visible_according_to_vm = true.
 			 */
-			if (skipping_blocks && !FORCE_CHECK_PAGE())
+			if (skipping_blocks && !scan_all && !FORCE_CHECK_PAGE())
 			{
 				/*
 				 * Tricky, tricky.  If this is in aggressive vacuum, the page
@@ -1316,6 +1325,7 @@ lazy_scan_heap(Relation onerel, LVRelStats *vacrelstats,
 									"Skipped %u pages due to buffer pins.\n",
 									vacrelstats->pinskipped_pages),
 					 vacrelstats->pinskipped_pages);
+
 	appendStringInfo(&buf, ngettext("%u page is entirely empty.\n",
 									"%u pages are entirely empty.\n",
 									empty_pages),
diff --git a/src/backend/parser/gram.y b/src/backend/parser/gram.y
index 18ec5f0..085a6f5 100644
--- a/src/backend/parser/gram.y
+++ b/src/backend/parser/gram.y
@@ -408,7 +408,7 @@ static Node *makeRecursiveViewSelect(char *relname, List *aliases, Node *query);
 %type <node>	overlay_placing substr_from substr_for
 
 %type <boolean> opt_instead
-%type <boolean> opt_unique opt_concurrently opt_verbose opt_full
+%type <boolean> opt_unique opt_concurrently opt_verbose opt_full opt_scanall
 %type <boolean> opt_freeze opt_default opt_recheck
 %type <defelt>	opt_binary opt_oids copy_delimiter
 
@@ -626,7 +626,7 @@ static Node *makeRecursiveViewSelect(char *relname, List *aliases, Node *query);
 	RESET RESTART RESTRICT RETURNING RETURNS REVOKE RIGHT ROLE ROLLBACK ROLLUP
 	ROW ROWS RULE
 
-	SAVEPOINT SCHEMA SCROLL SEARCH SECOND_P SECURITY SELECT SEQUENCE SEQUENCES
+	SAVEPOINT SCHEMA SCROLL SCANALL SEARCH SECOND_P SECURITY SELECT SEQUENCE SEQUENCES
 	SERIALIZABLE SERVER SESSION SESSION_USER SET SETS SETOF SHARE SHOW
 	SIMILAR SIMPLE SKIP SMALLINT SNAPSHOT SOME SQL_P STABLE STANDALONE_P START
 	STATEMENT STATISTICS STDIN STDOUT STORAGE STRICT_P STRIP_P SUBSTRING
@@ -9299,7 +9299,7 @@ cluster_index_specification:
  *
  *****************************************************************************/
 
-VacuumStmt: VACUUM opt_full opt_freeze opt_verbose
+VacuumStmt: VACUUM opt_full opt_freeze opt_verbose opt_scanall
 				{
 					VacuumStmt *n = makeNode(VacuumStmt);
 					n->options = VACOPT_VACUUM;
@@ -9309,11 +9309,13 @@ VacuumStmt: VACUUM opt_full opt_freeze opt_verbose
 						n->options |= VACOPT_FREEZE;
 					if ($4)
 						n->options |= VACOPT_VERBOSE;
+					if ($5)
+						n->options |= VACOPT_SCANALL;
 					n->relation = NULL;
 					n->va_cols = NIL;
 					$$ = (Node *)n;
 				}
-			| VACUUM opt_full opt_freeze opt_verbose qualified_name
+			| VACUUM opt_full opt_freeze opt_verbose opt_scanall qualified_name
 				{
 					VacuumStmt *n = makeNode(VacuumStmt);
 					n->options = VACOPT_VACUUM;
@@ -9323,13 +9325,15 @@ VacuumStmt: VACUUM opt_full opt_freeze opt_verbose
 						n->options |= VACOPT_FREEZE;
 					if ($4)
 						n->options |= VACOPT_VERBOSE;
-					n->relation = $5;
+					if ($5)
+						n->options |= VACOPT_SCANALL;
+					n->relation = $6;
 					n->va_cols = NIL;
 					$$ = (Node *)n;
 				}
-			| VACUUM opt_full opt_freeze opt_verbose AnalyzeStmt
+			| VACUUM opt_full opt_freeze opt_verbose opt_scanall AnalyzeStmt
 				{
-					VacuumStmt *n = (VacuumStmt *) $5;
+					VacuumStmt *n = (VacuumStmt *) $6;
 					n->options |= VACOPT_VACUUM;
 					if ($2)
 						n->options |= VACOPT_FULL;
@@ -9337,6 +9341,8 @@ VacuumStmt: VACUUM opt_full opt_freeze opt_verbose
 						n->options |= VACOPT_FREEZE;
 					if ($4)
 						n->options |= VACOPT_VERBOSE;
+					if ($5)
+						n->options |= VACOPT_SCANALL;
 					$$ = (Node *)n;
 				}
 			| VACUUM '(' vacuum_option_list ')'
@@ -9369,6 +9375,7 @@ vacuum_option_elem:
 			| VERBOSE			{ $$ = VACOPT_VERBOSE; }
 			| FREEZE			{ $$ = VACOPT_FREEZE; }
 			| FULL				{ $$ = VACOPT_FULL; }
+			| SCAN_ALL			{ $$ = VACOPT_SCANALL; }
 		;
 
 AnalyzeStmt:
@@ -9411,7 +9418,9 @@ opt_full:	FULL									{ $$ = TRUE; }
 opt_freeze: FREEZE									{ $$ = TRUE; }
 			| /*EMPTY*/								{ $$ = FALSE; }
 		;
-
+opt_scanall: SCAN_ALL								{ $$ = TRUE; }
+			| /* EMPTY */							{ $$ = FALSE; }
+		;
 opt_name_list:
 			'(' name_list ')'						{ $$ = $2; }
 			| /*EMPTY*/								{ $$ = NIL; }
@@ -14083,6 +14092,7 @@ type_func_name_keyword:
 			| OUTER_P
 			| OVERLAPS
 			| RIGHT
+			| SCANALL
 			| SIMILAR
 			| TABLESAMPLE
 			| VERBOSE
diff --git a/src/include/nodes/parsenodes.h b/src/include/nodes/parsenodes.h
index 714cf15..fc6338d 100644
--- a/src/include/nodes/parsenodes.h
+++ b/src/include/nodes/parsenodes.h
@@ -2822,7 +2822,8 @@ typedef enum VacuumOption
 	VACOPT_FREEZE = 1 << 3,		/* FREEZE option */
 	VACOPT_FULL = 1 << 4,		/* FULL (non-concurrent) vacuum */
 	VACOPT_NOWAIT = 1 << 5,		/* don't wait to get lock (autovacuum only) */
-	VACOPT_SKIPTOAST = 1 << 6	/* don't process the TOAST table, if any */
+	VACOPT_SKIPTOAST = 1 << 6,	/* don't process the TOAST table, if any */
+	VACOPT_SCANALL = 1 << 7		/* SCANALL option */
 } VacuumOption;
 
 typedef struct VacuumStmt
diff --git a/src/include/parser/kwlist.h b/src/include/parser/kwlist.h
index 17ffef5..04214b0 100644
--- a/src/include/parser/kwlist.h
+++ b/src/include/parser/kwlist.h
@@ -335,6 +335,7 @@ PG_KEYWORD("row", ROW, COL_NAME_KEYWORD)
 PG_KEYWORD("rows", ROWS, UNRESERVED_KEYWORD)
 PG_KEYWORD("rule", RULE, UNRESERVED_KEYWORD)
 PG_KEYWORD("savepoint", SAVEPOINT, UNRESERVED_KEYWORD)
+PG_KEYWORD("scanall", SCANALL, TYPE_FUNC_NAME_KEYWORD)
 PG_KEYWORD("schema", SCHEMA, UNRESERVED_KEYWORD)
 PG_KEYWORD("scroll", SCROLL, UNRESERVED_KEYWORD)
 PG_KEYWORD("search", SEARCH, UNRESERVED_KEYWORD)

#45

Jim Nasby

Jim.Nasby@BlueTreble.com

over 9 years ago

In reply to: Andres Freund (#24)

Re: Reviewing freeze map code

On 5/6/16 4:20 PM, Andres Freund wrote:

On 2016-05-06 14:15:47 -0700, Josh berkus wrote:

For the serious testing, does anyone have a good technique for creating
loads which would stress-test vacuum freezing? It's hard for me to come
up with anything which wouldn't be very time-and-resource intensive
(like running at 10,000 TPS for a week).

I've changed the limits for freezing options a while back, so you can
now set autovacuum_freeze_max as low as 100000 (best set
vacuum_freeze_table_age accordingly). You'll have to come up with a
workload that doesn't overwrite all data continuously (otherwise
there'll never be old rows), but otherwise it should now be fairly easy
to test that kind of scenario.

There's also been a tool for forcibly advancing XID floating around for
quite some time. Using that could have the added benefit of verifying
anti-wrap still works correctly. (Might be worth testing mxid wrap too...)
--
Jim Nasby, Data Architect, Blue Treble Consulting, Austin TX
Experts in Analytics, Data Architecture and PostgreSQL
Data in Trouble? Get it in Treble! http://BlueTreble.com
855-TREBLE2 (855-873-2532) mobile: 512-569-9461

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#46

Jim Nasby

Jim.Nasby@BlueTreble.com

over 9 years ago

In reply to: Peter Geoghegan (#33)

Re: Reviewing freeze map code

On 5/6/16 4:55 PM, Peter Geoghegan wrote:

On Fri, May 6, 2016 at 2:49 PM, Andres Freund <andres@anarazel.de> wrote:

Jeff Janes has done astounding work in these matters. (I don't think
we credit him enough for that.)

+many.

Agreed. I'm a huge fan of what Jeff has been able to do in this area.
I often say so. It would be even better if Jeff's approach to testing
was followed as an example by other people, but I wouldn't bet on it
ever happening. It requires real persistence and deep understanding to
do well.

It takes deep understanding to *design* the tests, not to write them.
There's a lot of folks out there that will never understand enough to
design tests meant to expose data corruption but who could easily code
someone else's design, especially if we provided tools/ways to tweak a
cluster to make testing easier/faster (such as artificially advancing
XID/MXID).
--
Jim Nasby, Data Architect, Blue Treble Consulting, Austin TX
Experts in Analytics, Data Architecture and PostgreSQL
Data in Trouble? Get it in Treble! http://BlueTreble.com
855-TREBLE2 (855-873-2532) mobile: 512-569-9461

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#47

Jim Nasby

Jim.Nasby@BlueTreble.com

over 9 years ago

In reply to: Joshua D. Drake (#19)

Re: Reviewing freeze map code

On 5/6/16 4:08 PM, Joshua D. Drake wrote:

VACUUM THEWHOLEDAMNTHING

+100

(hahahaha)

You know what? Why not? Seriously? We aren't product. This is supposed
to be a bit fun. Let's have some fun with it? It would be so easy to
turn that into a positive advocacy opportunity.

Honestly, for an option this obscure, I agree. I don't think we'd want
any normally used stuff named so glibly, but I sure as heck could have
used some easter-eggs like this when I was doing training.
--
Jim Nasby, Data Architect, Blue Treble Consulting, Austin TX
Experts in Analytics, Data Architecture and PostgreSQL
Data in Trouble? Get it in Treble! http://BlueTreble.com
855-TREBLE2 (855-873-2532) mobile: 512-569-9461

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#48

Jim Nasby

Jim.Nasby@BlueTreble.com

over 9 years ago

In reply to: Jim Nasby (#46)

Re: Reviewing freeze map code

On 5/10/16 11:42 PM, Jim Nasby wrote:

On 5/6/16 4:55 PM, Peter Geoghegan wrote:

On Fri, May 6, 2016 at 2:49 PM, Andres Freund <andres@anarazel.de> wrote:

Jeff Janes has done astounding work in these matters. (I don't think
we credit him enough for that.)

+many.

Agreed. I'm a huge fan of what Jeff has been able to do in this area.
I often say so. It would be even better if Jeff's approach to testing
was followed as an example by other people, but I wouldn't bet on it
ever happening. It requires real persistence and deep understanding to
do well.

It takes deep understanding to *design* the tests, not to write them.
There's a lot of folks out there that will never understand enough to
design tests meant to expose data corruption but who could easily code
someone else's design, especially if we provided tools/ways to tweak a
cluster to make testing easier/faster (such as artificially advancing
XID/MXID).

Speaking of which, another email in the thread made me realize that
there's a test condition no one has mentioned: verifying we don't lose
tuples after wraparound.

To test this, you'd want a table that's mostly frozen. Ideally, dirty a
single tuple on a bunch of frozen pages, with committed updates,
deletes, and un-committed inserts. Advance XID far enough to get you
close to wrap-around. Do a vacuum, SELECT count(*), advance XID past
wraparound, SELECT count(*) again and you should get the same number.
--
Jim Nasby, Data Architect, Blue Treble Consulting, Austin TX
Experts in Analytics, Data Architecture and PostgreSQL
Data in Trouble? Get it in Treble! http://BlueTreble.com
855-TREBLE2 (855-873-2532) mobile: 512-569-9461

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#49

Robert Haas

robertmhaas@gmail.com

over 9 years ago

In reply to: Masahiko Sawada (#44)

Re: Reviewing freeze map code

On Tue, May 10, 2016 at 10:40 PM, Masahiko Sawada <sawada.mshk@gmail.com> wrote:

Or second way I came up with is having tool to remove particular _vm
file safely, which is executed via SQL or client tool like
pg_resetxlog.

Attached updated VACUUM SCAN_ALL patch.
Please find it.

We should support scan_all only with the new-style options syntax for
VACUUM; that is, vacuum (scan_all) rename. That doesn't require
making scan_all a keyword, which is good: this is a minor feature, and
we don't want to bloat the parsing tables for it.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#50

Masahiko Sawada

sawada.mshk@gmail.com

over 9 years ago

In reply to: Robert Haas (#49)

Re: Reviewing freeze map code

On Mon, May 16, 2016 at 10:49 AM, Robert Haas <robertmhaas@gmail.com> wrote:

On Tue, May 10, 2016 at 10:40 PM, Masahiko Sawada <sawada.mshk@gmail.com> wrote:

Or second way I came up with is having tool to remove particular _vm
file safely, which is executed via SQL or client tool like
pg_resetxlog.

Attached updated VACUUM SCAN_ALL patch.
Please find it.

We should support scan_all only with the new-style options syntax for
VACUUM; that is, vacuum (scan_all) rename. That doesn't require
making scan_all a keyword, which is good: this is a minor feature, and
we don't want to bloat the parsing tables for it.

I agree with having new-style options syntax.
Isn't it better to have SCAN_ALL option without parentheses?

Syntaxes are;
VACUUM SCAN_ALL table_name;
VACUUM SCAN_ALL; -- for all tables on database

Regards,

--
Masahiko Sawada

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#51

Alvaro Herrera

alvherre@2ndquadrant.com

over 9 years ago

In reply to: Masahiko Sawada (#50)

Re: Reviewing freeze map code

Masahiko Sawada wrote:

On Mon, May 16, 2016 at 10:49 AM, Robert Haas <robertmhaas@gmail.com> wrote:

We should support scan_all only with the new-style options syntax for
VACUUM; that is, vacuum (scan_all) rename. That doesn't require
making scan_all a keyword, which is good: this is a minor feature, and
we don't want to bloat the parsing tables for it.

I agree with having new-style options syntax.
Isn't it better to have SCAN_ALL option without parentheses?

Syntaxes are;
VACUUM SCAN_ALL table_name;
VACUUM SCAN_ALL; -- for all tables on database

No, I agree with Robert that we shouldn't add any more such options to
avoid keyword proliferation.

Syntaxes are;
VACUUM (SCAN_ALL) table_name;
VACUUM (SCAN_ALL); -- for all tables on database

Is SCAN_ALL really the best we can do here? The business of having an
underscore in an option name has no precedent (other than
CURRENT_DATABASE and the like). How about COMPLETE, TOTAL, or WHOLE?

--
ï¿½lvaro Herrera http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#52

Masahiko Sawada

sawada.mshk@gmail.com

over 9 years ago

In reply to: Alvaro Herrera (#51)

Re: Reviewing freeze map code

On Tue, May 17, 2016 at 3:32 PM, Alvaro Herrera
<alvherre@2ndquadrant.com> wrote:

Masahiko Sawada wrote:

On Mon, May 16, 2016 at 10:49 AM, Robert Haas <robertmhaas@gmail.com> wrote:

We should support scan_all only with the new-style options syntax for
VACUUM; that is, vacuum (scan_all) rename. That doesn't require
making scan_all a keyword, which is good: this is a minor feature, and
we don't want to bloat the parsing tables for it.

I agree with having new-style options syntax.
Isn't it better to have SCAN_ALL option without parentheses?

Syntaxes are;
VACUUM SCAN_ALL table_name;
VACUUM SCAN_ALL; -- for all tables on database

No, I agree with Robert that we shouldn't add any more such options to
avoid keyword proliferation.

Syntaxes are;
VACUUM (SCAN_ALL) table_name;
VACUUM (SCAN_ALL); -- for all tables on database

Okay, I agree with this.

Is SCAN_ALL really the best we can do here? The business of having an
underscore in an option name has no precedent (other than
CURRENT_DATABASE and the like).

Another way is having tool or function that removes _vm file safely for example.

How about COMPLETE, TOTAL, or WHOLE?

IMHO, I don't have strong opinion about SCAN_ALL as long as we have
document about that option and option name doesn't confuse users.
But ISTM that COMPLETE, TOTAL might make users mislead normal vacuum
as it doesn't do that completely.

Regards,

--
Masahiko Sawada

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#53

Joshua D. Drake

jd@commandprompt.com

over 9 years ago

In reply to: Alvaro Herrera (#51)

Re: Reviewing freeze map code

On 05/17/2016 12:32 PM, Alvaro Herrera wrote:

Syntaxes are;
VACUUM (SCAN_ALL) table_name;
VACUUM (SCAN_ALL); -- for all tables on database

Is SCAN_ALL really the best we can do here? The business of having an
underscore in an option name has no precedent (other than
CURRENT_DATABASE and the like). How about COMPLETE, TOTAL, or WHOLE?

VACUUM (ANALYZE, VERBOSE, WHOLE)
....

That seems reasonable? I agree that SCAN_ALL doesn't fit. I am not
trying to pull a left turn but is there a technical reason we don't just
make FULL do this?

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#54

Masahiko Sawada

sawada.mshk@gmail.com

over 9 years ago

In reply to: Joshua D. Drake (#53)

Re: Reviewing freeze map code

On Tue, May 17, 2016 at 4:34 PM, Joshua D. Drake <jd@commandprompt.com> wrote:

On 05/17/2016 12:32 PM, Alvaro Herrera wrote:

Syntaxes are;
VACUUM (SCAN_ALL) table_name;
VACUUM (SCAN_ALL); -- for all tables on database

Is SCAN_ALL really the best we can do here? The business of having an
underscore in an option name has no precedent (other than
CURRENT_DATABASE and the like). How about COMPLETE, TOTAL, or WHOLE?

VACUUM (ANALYZE, VERBOSE, WHOLE)
....

That seems reasonable? I agree that SCAN_ALL doesn't fit. I am not trying to
pull a left turn but is there a technical reason we don't just make FULL do
this?

FULL option requires AccessExclusiveLock, which could be a problem.

Regards,

--
Masahiko Sawada

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#55

Vik Fearing

vik@2ndquadrant.fr

over 9 years ago

In reply to: Alvaro Herrera (#51)

Re: Reviewing freeze map code

On 17/05/16 21:32, Alvaro Herrera wrote:

Is SCAN_ALL really the best we can do here? The business of having an
underscore in an option name has no precedent (other than
CURRENT_DATABASE and the like).

ALTER DATABASE has options for ALLOW_CONNECTIONS, CONNECTION_LIMIT, and
IS_TEMPLATE.

How about COMPLETE, TOTAL, or WHOLE?

Sure, I'll play this game. I like EXHAUSTIVE.
--
Vik Fearing +33 6 46 75 15 36
http://2ndQuadrant.fr PostgreSQL : Expertise, Formation et Support

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#56

Gavin Flower

GavinFlower@archidevsys.co.nz

over 9 years ago

In reply to: Vik Fearing (#55)

Re: Reviewing freeze map code

On 18/05/16 09:34, Vik Fearing wrote:

On 17/05/16 21:32, Alvaro Herrera wrote:

Is SCAN_ALL really the best we can do here? The business of having an
underscore in an option name has no precedent (other than
CURRENT_DATABASE and the like).

ALTER DATABASE has options for ALLOW_CONNECTIONS, CONNECTION_LIMIT, and
IS_TEMPLATE.

How about COMPLETE, TOTAL, or WHOLE?

Sure, I'll play this game. I like EXHAUSTIVE.

I prefer 'WHOLE', as it seems more obvious (and not because of the pun
relating to 'wholesomeness'!!!)

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#57

Robert Haas

robertmhaas@gmail.com

over 9 years ago

In reply to: Gavin Flower (#56)

Re: Reviewing freeze map code

On Tue, May 17, 2016 at 5:47 PM, Gavin Flower
<GavinFlower@archidevsys.co.nz> wrote:

On 18/05/16 09:34, Vik Fearing wrote:

On 17/05/16 21:32, Alvaro Herrera wrote:

Is SCAN_ALL really the best we can do here? The business of having an
underscore in an option name has no precedent (other than
CURRENT_DATABASE and the like).

ALTER DATABASE has options for ALLOW_CONNECTIONS, CONNECTION_LIMIT, and
IS_TEMPLATE.

How about COMPLETE, TOTAL, or WHOLE?

Sure, I'll play this game. I like EXHAUSTIVE.

I prefer 'WHOLE', as it seems more obvious (and not because of the pun
relating to 'wholesomeness'!!!)

I think that users might believe that they need VACUUM (WHOLE) a lot
more often than they will actually need this option. "Of course I
want to vacuum my whole table!"

I think we should give this a name that hints more strongly at this
being an exceptional thing, like vacuum (even_frozen_pages).

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#58

David Steele

david@pgmasters.net

over 9 years ago

In reply to: Robert Haas (#57)

Re: Reviewing freeze map code

On 5/18/16 6:37 AM, Robert Haas wrote:

On Tue, May 17, 2016 at 5:47 PM, Gavin Flower
<GavinFlower@archidevsys.co.nz> wrote:

On 18/05/16 09:34, Vik Fearing wrote:

On 17/05/16 21:32, Alvaro Herrera wrote:

Is SCAN_ALL really the best we can do here? The business of having an
underscore in an option name has no precedent (other than
CURRENT_DATABASE and the like).

ALTER DATABASE has options for ALLOW_CONNECTIONS, CONNECTION_LIMIT, and
IS_TEMPLATE.

How about COMPLETE, TOTAL, or WHOLE?

Sure, I'll play this game. I like EXHAUSTIVE.

I prefer 'WHOLE', as it seems more obvious (and not because of the pun
relating to 'wholesomeness'!!!)

I think that users might believe that they need VACUUM (WHOLE) a lot
more often than they will actually need this option. "Of course I
want to vacuum my whole table!"

I think we should give this a name that hints more strongly at this
being an exceptional thing, like vacuum (even_frozen_pages).

How about just FROZEN? Perhaps it's too confusing to have that and
FREEZE, but I thought I would throw it out there.

--
-David
david@pgmasters.net

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#59

Robert Haas

robertmhaas@gmail.com

over 9 years ago

In reply to: David Steele (#58)

Re: Reviewing freeze map code

On Wed, May 18, 2016 at 8:41 AM, David Steele <david@pgmasters.net> wrote:

I think we should give this a name that hints more strongly at this
being an exceptional thing, like vacuum (even_frozen_pages).

How about just FROZEN? Perhaps it's too confusing to have that and FREEZE,
but I thought I would throw it out there.

It's not a bad thought, but I do think it might be a bit confusing.
My main priority for this new option is that people aren't tempted to
use it very often, and I think a name like "even_frozen_pages" is more
likely to accomplish that than just "frozen".

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#60

Joshua D. Drake

jd@commandprompt.com

over 9 years ago

In reply to: Robert Haas (#59)

Re: Reviewing freeze map code

On 05/18/2016 05:51 AM, Robert Haas wrote:

On Wed, May 18, 2016 at 8:41 AM, David Steele <david@pgmasters.net> wrote:

I think we should give this a name that hints more strongly at this
being an exceptional thing, like vacuum (even_frozen_pages).

How about just FROZEN? Perhaps it's too confusing to have that and FREEZE,
but I thought I would throw it out there.

It's not a bad thought, but I do think it might be a bit confusing.
My main priority for this new option is that people aren't tempted to
use it very often, and I think a name like "even_frozen_pages" is more
likely to accomplish that than just "frozen".

freeze_all_pages?

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#61

Robert Haas

robertmhaas@gmail.com

over 9 years ago

In reply to: Joshua D. Drake (#60)

Re: Reviewing freeze map code

On Wed, May 18, 2016 at 9:42 AM, Joshua D. Drake <jd@commandprompt.com> wrote:

It's not a bad thought, but I do think it might be a bit confusing.
My main priority for this new option is that people aren't tempted to
use it very often, and I think a name like "even_frozen_pages" is more
likely to accomplish that than just "frozen".

freeze_all_pages?

No, that's what the existing FREEZE option does. This new option is
about unnecessarily vacuuming pages that don't need it. The
expectation is that vacuuming all-frozen pages will be a no-op.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#62

Victor Yegorov

vyegorov@gmail.com

over 9 years ago

In reply to: Robert Haas (#61)

Re: Reviewing freeze map code

2016-05-18 16:45 GMT+03:00 Robert Haas <robertmhaas@gmail.com>:

No, that's what the existing FREEZE option does. This new option is
about unnecessarily vacuuming pages that don't need it. The
expectation is that vacuuming all-frozen pages will be a no-op.

VACUUM (INCLUDING ALL) ?

--
Victor Y. Yegorov

#63

Joe Conway

mail@joeconway.com

over 9 years ago

In reply to: Victor Yegorov (#62)

Re: Reviewing freeze map code

On 05/18/2016 09:55 AM, Victor Yegorov wrote:

2016-05-18 16:45 GMT+03:00 Robert Haas <robertmhaas@gmail.com
<mailto:robertmhaas@gmail.com>>:

No, that's what the existing FREEZE option does. This new option is
about unnecessarily vacuuming pages that don't need it. The
expectation is that vacuuming all-frozen pages will be a no-op.

VACUUM (INCLUDING ALL) ?

VACUUM (FORCE ALL) ?

Joe

--
Crunchy Data - http://crunchydata.com
PostgreSQL Support for Secure Enterprises
Consulting, Training, & Open Source Development

#64

Jeff Janes

jeff.janes@gmail.com

over 9 years ago

In reply to: Joe Conway (#63)

Re: Reviewing freeze map code

On Wed, May 18, 2016 at 7:09 AM, Joe Conway <mail@joeconway.com> wrote:

On 05/18/2016 09:55 AM, Victor Yegorov wrote:

2016-05-18 16:45 GMT+03:00 Robert Haas <robertmhaas@gmail.com
<mailto:robertmhaas@gmail.com>>:

No, that's what the existing FREEZE option does. This new option is
about unnecessarily vacuuming pages that don't need it. The
expectation is that vacuuming all-frozen pages will be a no-op.

VACUUM (INCLUDING ALL) ?

VACUUM (FORCE ALL) ?

How about going with something that says more about why we are doing
it, rather than trying to describe in one or two words what it is
doing?

VACUUM (FORENSIC)

VACUUM (DEBUG)

VACUUM (LINT)

Cheers,

Jeff

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#65

Peter Geoghegan

pg@heroku.com

over 9 years ago

In reply to: Jeff Janes (#64)

Re: Reviewing freeze map code

On Wed, May 18, 2016 at 8:52 AM, Jeff Janes <jeff.janes@gmail.com> wrote:

How about going with something that says more about why we are doing
it, rather than trying to describe in one or two words what it is
doing?

VACUUM (FORENSIC)

VACUUM (DEBUG)

VACUUM (LINT)

--
Peter Geoghegan

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#66

Josh berkus

josh@agliodbs.com

over 9 years ago

In reply to: Alvaro Herrera (#51)

Re: Reviewing freeze map code

On 05/18/2016 03:51 PM, Peter Geoghegan wrote:

On Wed, May 18, 2016 at 8:52 AM, Jeff Janes <jeff.janes@gmail.com> wrote:

How about going with something that says more about why we are doing
it, rather than trying to describe in one or two words what it is
doing?

VACUUM (FORENSIC)

VACUUM (DEBUG)

VACUUM (LINT)

+1

Maybe this is the wrong perspective. I mean, is there a reason we even
need this option, other than a lack of any other way to do a full table
scan to check for corruption, etc.? If we're only doing this for
integrity checking, then maybe it's better if it becomes a function,
which could be later extended with additional forensic features?

--
--
Josh Berkus
Red Hat OSAS
(any opinions are my own)

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Import Notes

Reply to msg id not found: WMc3a6cf0190f9c5d6f5a9a6310a38bac5c64c41d23fe8f04f9856771643267f4ec3291c98a2074de25420aae13297044e@asav-3.01.com

#67

Tom Lane

tgl@sss.pgh.pa.us

over 9 years ago

In reply to: Josh berkus (#66)

Re: Reviewing freeze map code

Josh berkus <josh@agliodbs.com> writes:

Maybe this is the wrong perspective. I mean, is there a reason we even
need this option, other than a lack of any other way to do a full table
scan to check for corruption, etc.? If we're only doing this for
integrity checking, then maybe it's better if it becomes a function,
which could be later extended with additional forensic features?

Yes, I've been wondering that too. VACUUM is not meant as a corruption
checker, and should not be made into one, so what is the point of this
flag exactly?

(AFAIK, "select count(*) from table" would offer a similar amount of
sanity checking as a full-table VACUUM scan does, so it's not like
we've removed functionality with no near-term replacement.)

regards, tom lane

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#68

Andres Freund

andres@anarazel.de

over 9 years ago

In reply to: Tom Lane (#67)

Re: Reviewing freeze map code

On 2016-05-18 18:25:39 -0400, Tom Lane wrote:

Josh berkus <josh@agliodbs.com> writes:

Maybe this is the wrong perspective. I mean, is there a reason we even
need this option, other than a lack of any other way to do a full table
scan to check for corruption, etc.? If we're only doing this for
integrity checking, then maybe it's better if it becomes a function,
which could be later extended with additional forensic features?

Yes, I've been wondering that too. VACUUM is not meant as a corruption
checker, and should not be made into one, so what is the point of this
flag exactly?

Well, so far a VACUUM FREEZE (or just setting vacuum_freeze_table_age =
0) verified the correctness of the visibility map; and that found a
number of bugs. Now visibilitymap grew additional responsibilities,
with a noticeable risk of data eating bugs, and there's no way to verify
whether visibilitymap's frozen bits are set correctly.

(AFAIK, "select count(*) from table" would offer a similar amount of
sanity checking as a full-table VACUUM scan does, so it's not like
we've removed functionality with no near-term replacement.)

I don't think that'd do anything comparable to
/*
* As of PostgreSQL 9.2, the visibility map bit should never be set if
* the page-level bit is clear. However, it's possible that the bit
* got cleared after we checked it and before we took the buffer
* content lock, so we must recheck before jumping to the conclusion
* that something bad has happened.
*/
else if (all_visible_according_to_vm && !PageIsAllVisible(page)
&& VM_ALL_VISIBLE(onerel, blkno, &vmbuffer))
{
elog(WARNING, "page is not marked all-visible but visibility map bit is set in relation \"%s\" page %u",
relname, blkno);
visibilitymap_clear(onerel, blkno, vmbuffer);
}

If we had a checking module for all this it'd possibly be sufficient,
but we don't.

Greetings,

Andres Freund

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#69

Tom Lane

tgl@sss.pgh.pa.us

over 9 years ago

In reply to: Andres Freund (#68)

Re: Reviewing freeze map code

Andres Freund <andres@anarazel.de> writes:

On 2016-05-18 18:25:39 -0400, Tom Lane wrote:

Yes, I've been wondering that too. VACUUM is not meant as a corruption
checker, and should not be made into one, so what is the point of this
flag exactly?

Well, so far a VACUUM FREEZE (or just setting vacuum_freeze_table_age =
0) verified the correctness of the visibility map; and that found a
number of bugs. Now visibilitymap grew additional responsibilities,
with a noticeable risk of data eating bugs, and there's no way to verify
whether visibilitymap's frozen bits are set correctly.

Meh. I'm not sure we should grow a rather half-baked feature we'll never
be able to remove as a substitute for a separate sanity checker. The
latter is really the right place for this kind of thing.

regards, tom lane

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#70

Andres Freund

andres@anarazel.de

over 9 years ago

In reply to: Tom Lane (#69)

Re: Reviewing freeze map code

On 2016-05-18 18:42:16 -0400, Tom Lane wrote:

Andres Freund <andres@anarazel.de> writes:

On 2016-05-18 18:25:39 -0400, Tom Lane wrote:

Yes, I've been wondering that too. VACUUM is not meant as a corruption
checker, and should not be made into one, so what is the point of this
flag exactly?

Well, so far a VACUUM FREEZE (or just setting vacuum_freeze_table_age =
0) verified the correctness of the visibility map; and that found a
number of bugs. Now visibilitymap grew additional responsibilities,
with a noticeable risk of data eating bugs, and there's no way to verify
whether visibilitymap's frozen bits are set correctly.

Meh. I'm not sure we should grow a rather half-baked feature we'll never
be able to remove as a substitute for a separate sanity checker. The
latter is really the right place for this kind of thing.

It's not a new feature, it's a feature we removed as a side effect. And
one that allows us to evaluate whether the new feature actually works.

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#71

Alvaro Herrera

alvherre@2ndquadrant.com

over 9 years ago

In reply to: Andres Freund (#68)

Re: Reviewing freeze map code

Andres Freund wrote:

(AFAIK, "select count(*) from table" would offer a similar amount of
sanity checking as a full-table VACUUM scan does, so it's not like
we've removed functionality with no near-term replacement.)

I don't think that'd do anything comparable to
/*
* As of PostgreSQL 9.2, the visibility map bit should never be set if
* the page-level bit is clear. However, it's possible that the bit
* got cleared after we checked it and before we took the buffer
* content lock, so we must recheck before jumping to the conclusion
* that something bad has happened.
*/
else if (all_visible_according_to_vm && !PageIsAllVisible(page)
&& VM_ALL_VISIBLE(onerel, blkno, &vmbuffer))
{
elog(WARNING, "page is not marked all-visible but visibility map bit is set in relation \"%s\" page %u",
relname, blkno);
visibilitymap_clear(onerel, blkno, vmbuffer);
}

If we had a checking module for all this it'd possibly be sufficient,
but we don't.

Here's an idea. We need core-blessed extensions (src/extensions/, you
know I've proposed this before), so why not take this opportunity to
create our first such and make it carry a function to scan a table
completely to do this task.

Since we were considering a new VACUUM option, surely this is serious
enough to warrant more than just contrib.

--
ï¿½lvaro Herrera http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#72

Peter Geoghegan

pg@heroku.com

over 9 years ago

In reply to: Alvaro Herrera (#71)

Re: Reviewing freeze map code

On Wed, May 18, 2016 at 3:57 PM, Alvaro Herrera
<alvherre@2ndquadrant.com> wrote:

Since we were considering a new VACUUM option, surely this is serious
enough to warrant more than just contrib.

I would like to see us consider the long-term best place for amcheck's
functionality at the same time. Ideally, verification would be a
somewhat generic operation, with AM-specific code invoked as
appropriate.

--
Peter Geoghegan

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#73

Noah Misch

noah@leadboat.com

over 9 years ago

In reply to: Robert Haas (#8)

Re: Reviewing freeze map code

On Fri, May 06, 2016 at 04:42:48PM -0400, Robert Haas wrote:

On Thu, May 5, 2016 at 2:20 PM, Andres Freund <andres@anarazel.de> wrote:
On 2016-05-02 14:48:18 -0700, Andres Freund wrote:
+                       char            new_vmbuf[BLCKSZ];
+                       char       *new_cur = new_vmbuf;
+                       bool            empty = true;
+                       bool            old_lastpart;
+
+                       /* Copy page header in advance */
+                       memcpy(new_vmbuf, &pageheader, SizeOfPageHeaderData);
Shouldn't we zero out new_vmbuf? Afaics we're not necessarily zeroing it
with old_lastpart && !empty, right?
Oh, dear. That seems like a possible data corruption bug. Maybe we'd
better fix that right away (although I don't actually have time before
the wrap).

[This is a generic notification.]

The above-described topic is currently a PostgreSQL 9.6 open item. Robert,
since you committed the patch believed to have created it, you own this open
item. If some other commit is more relevant or if this does not belong as a
9.6 open item, please let us know. Otherwise, please observe the policy on
open item ownership[1]/messages/by-id/20160527025039.GA447393@tornado.leadboat.com and send a status update within 72 hours of this
message. Include a date for your subsequent status update. Testers may
discover new open items at any time, and I want to plan to get them all fixed
well in advance of shipping 9.6rc1. Consequently, I will appreciate your
efforts toward speedy resolution. Thanks.

[1]: /messages/by-id/20160527025039.GA447393@tornado.leadboat.com

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#74

Masahiko Sawada

sawada.mshk@gmail.com

over 9 years ago

In reply to: Noah Misch (#73)

Re: Reviewing freeze map code

On Sun, May 29, 2016 at 2:44 PM, Noah Misch <noah@leadboat.com> wrote:

On Fri, May 06, 2016 at 04:42:48PM -0400, Robert Haas wrote:
On Thu, May 5, 2016 at 2:20 PM, Andres Freund <andres@anarazel.de> wrote:
On 2016-05-02 14:48:18 -0700, Andres Freund wrote:
+                       char            new_vmbuf[BLCKSZ];
+                       char       *new_cur = new_vmbuf;
+                       bool            empty = true;
+                       bool            old_lastpart;
+
+                       /* Copy page header in advance */
+                       memcpy(new_vmbuf, &pageheader, SizeOfPageHeaderData);
Shouldn't we zero out new_vmbuf? Afaics we're not necessarily zeroing it
with old_lastpart && !empty, right?
Oh, dear. That seems like a possible data corruption bug. Maybe we'd
better fix that right away (although I don't actually have time before
the wrap).
[This is a generic notification.]

The above-described topic is currently a PostgreSQL 9.6 open item. Robert,
since you committed the patch believed to have created it, you own this open
item. If some other commit is more relevant or if this does not belong as a
9.6 open item, please let us know. Otherwise, please observe the policy on
open item ownership[1] and send a status update within 72 hours of this
message. Include a date for your subsequent status update. Testers may
discover new open items at any time, and I want to plan to get them all fixed
well in advance of shipping 9.6rc1. Consequently, I will appreciate your
efforts toward speedy resolution. Thanks.

[1] /messages/by-id/20160527025039.GA447393@tornado.leadboat.com

Thank you for notification.

Regarding check tool for visibility map is still under the discussion.
I'm going to address other review comments, and send the patch ASAP.

Regards,

--
Masahiko Sawada

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#75

Jeff Janes

jeff.janes@gmail.com

over 9 years ago

In reply to: Alvaro Herrera (#71)

Re: Reviewing freeze map code

On Wed, May 18, 2016 at 3:57 PM, Alvaro Herrera
<alvherre@2ndquadrant.com> wrote:

Andres Freund wrote:

If we had a checking module for all this it'd possibly be sufficient,
but we don't.

Here's an idea. We need core-blessed extensions (src/extensions/, you
know I've proposed this before), so why not take this opportunity to
create our first such and make it carry a function to scan a table
completely to do this task.

Since we were considering a new VACUUM option, surely this is serious
enough to warrant more than just contrib.

What does "core-blessed" mean? The commit rights for contrib/ are the
same as they are for src/

Cheers,

Jeff

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#76

Michael Paquier

michael.paquier@gmail.com

over 9 years ago

In reply to: Jeff Janes (#75)

Re: Reviewing freeze map code

On Tue, May 31, 2016 at 4:40 AM, Jeff Janes <jeff.janes@gmail.com> wrote:

On Wed, May 18, 2016 at 3:57 PM, Alvaro Herrera
<alvherre@2ndquadrant.com> wrote:

Andres Freund wrote:

If we had a checking module for all this it'd possibly be sufficient,
but we don't.

Here's an idea. We need core-blessed extensions (src/extensions/, you
know I've proposed this before), so why not take this opportunity to
create our first such and make it carry a function to scan a table
completely to do this task.

Since we were considering a new VACUUM option, surely this is serious
enough to warrant more than just contrib.

What does "core-blessed" mean? The commit rights for contrib/ are the
same as they are for src/

Personally I understand contrib/ modules as third-part plugins that
are considered as not enough mature to be part of src/backend or
src/bin, but one day they could become so. See pg_upgrade's recent
move for example. src/extensions/ includes third-part plugins that are
thought as useful, are part of the main server package, but are not
something that we want to enable by default.
--
Michael

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#77

Robert Haas

robertmhaas@gmail.com

over 9 years ago

In reply to: Noah Misch (#73)

Re: Reviewing freeze map code

On Sun, May 29, 2016 at 1:44 AM, Noah Misch <noah@leadboat.com> wrote:

On Fri, May 06, 2016 at 04:42:48PM -0400, Robert Haas wrote:
On Thu, May 5, 2016 at 2:20 PM, Andres Freund <andres@anarazel.de> wrote:
On 2016-05-02 14:48:18 -0700, Andres Freund wrote:
+                       char            new_vmbuf[BLCKSZ];
+                       char       *new_cur = new_vmbuf;
+                       bool            empty = true;
+                       bool            old_lastpart;
+
+                       /* Copy page header in advance */
+                       memcpy(new_vmbuf, &pageheader, SizeOfPageHeaderData);
Shouldn't we zero out new_vmbuf? Afaics we're not necessarily zeroing it
with old_lastpart && !empty, right?
Oh, dear. That seems like a possible data corruption bug. Maybe we'd
better fix that right away (although I don't actually have time before
the wrap).
[This is a generic notification.]

The above-described topic is currently a PostgreSQL 9.6 open item. Robert,
since you committed the patch believed to have created it, you own this open
item. If some other commit is more relevant or if this does not belong as a
9.6 open item, please let us know. Otherwise, please observe the policy on
open item ownership[1] and send a status update within 72 hours of this
message. Include a date for your subsequent status update. Testers may
discover new open items at any time, and I want to plan to get them all fixed
well in advance of shipping 9.6rc1. Consequently, I will appreciate your
efforts toward speedy resolution. Thanks.

I am going to try to find time to look at this later this week, but
realistically it's going to be a little bit difficult to find that
time. I was away over Memorial Day weekend and was in meetings most
of today. I have a huge pile of email to catch up on. I will send
another status update no later than Friday. If Andres or anyone else
wants to jump in and fix this up meanwhile, that would be great.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#78

Masahiko Sawada

sawada.mshk@gmail.com

over 9 years ago

In reply to: Robert Haas (#6)

1 attachment(s)

Re: Reviewing freeze map code

On Sat, May 7, 2016 at 5:34 AM, Robert Haas <robertmhaas@gmail.com> wrote:

On Mon, May 2, 2016 at 8:25 PM, Andres Freund <andres@anarazel.de> wrote:
+ * heap_tuple_needs_eventual_freeze
+ *
+ * Check to see whether any of the XID fields of a tuple (xmin, xmax, xvac)
+ * will eventually require freezing.  Similar to heap_tuple_needs_freeze,
+ * but there's no cutoff, since we're trying to figure out whether freezing
+ * will ever be needed, not whether it's needed now.
+ */
+bool
+heap_tuple_needs_eventual_freeze(HeapTupleHeader tuple)
Wouldn't redefining this to heap_tuple_is_frozen() and then inverting the
checks be easier to understand?
I thought it much safer to keep this as close to a copy of
heap_tuple_needs_freeze() as possible. Copying a function and
inverting all of the return values is much more likely to introduce
bugs, IME.

I agree.

+       /*
+        * If xmax is a valid xact or multixact, this tuple is also not frozen.
+        */
+       if (tuple->t_infomask & HEAP_XMAX_IS_MULTI)
+       {
+               MultiXactId multi;
+
+               multi = HeapTupleHeaderGetRawXmax(tuple);
+               if (MultiXactIdIsValid(multi))
+                       return true;
+       }
Hm. What's the test inside the if() for? There shouldn't be any case
where xmax is invalid if HEAP_XMAX_IS_MULTI is set. Now there's a
check like that outside of this commit, but it seems strange to me
(Alvaro, perhaps you could comment on this?).
Here again I was copying existing code, with appropriate simplifications.
+ *
+ * Clearing both visibility map bits is not separately WAL-logged.  The callers
* must make sure that whenever a bit is cleared, the bit is cleared on WAL
* replay of the updating operation as well.
I think including "both" here makes things less clear, because it
differentiates clearing one bit from clearing both. There's no practical
differentce atm, but still.
I agree.

Fixed.

*
* VACUUM will normally skip pages for which the visibility map bit is set;
* such pages can't contain any dead tuples and therefore don't need vacuuming.
- * The visibility map is not used for anti-wraparound vacuums, because
- * an anti-wraparound vacuum needs to freeze tuples and observe the latest xid
- * present in the table, even on pages that don't have any dead tuples.
*

I think the remaining sentence isn't entirely accurate, there's now more
than one bit, and they're different with regard to scan_all/!scan_all
vacuums (or will be - maybe this updated further in a later commit? But
if so, that sentence shouldn't yet be removed...).

We can adjust the language, but I don't really see a big problem here.

This comment is not incorporate this patch so far.

-/* Number of heap blocks we can represent in one byte. */
-#define HEAPBLOCKS_PER_BYTE 8
-
Hm, why was this moved to the header? Sounds like something the outside
shouldn't care about.

Oh... yeah. Let's undo that.

Fixed.

#define HEAPBLK_TO_MAPBIT(x) (((x) % HEAPBLOCKS_PER_BYTE) * BITS_PER_HEAPBLOCK)

Hm. This isn't really a mapping to an individual bit anymore - but I
don't really have a better name in mind. Maybe TO_OFFSET?

Well, it sorta is... but we could change it, I suppose.
+static const uint8 number_of_ones_for_visible[256] = {
...
+};
+static const uint8 number_of_ones_for_frozen[256] = {
...
};
Did somebody verify the new contents are correct?
I admit that I didn't. It seemed like an unlikely place for a goof,
but I guess we should verify.
/*
- *     visibilitymap_clear - clear a bit in visibility map
+ *     visibilitymap_clear - clear all bits in visibility map
*
This seems rather easy to misunderstand, as this really only clears all
the bits for one page, not actually all the bits.
We could change "in" to "for one page in the".

Fixed.

* the bit for heapBlk, or InvalidBuffer. The caller is responsible for
- * releasing *buf after it's done testing and setting bits.
+ * releasing *buf after it's done testing and setting bits, and must pass flags
+ * for which it needs to check the value in visibility map.
*
* NOTE: This function is typically called without a lock on the heap page,
* so somebody else could change the bit just after we look at it.  In fact,
@@ -327,17 +351,16 @@ visibilitymap_set(Relation rel, BlockNumber heapBlk, Buffer heapBuf,

I'm not seing what flags the above comment change is referring to?

Ugh. I think that's leftover cruft from an earlier patch version that
should have been excised from what got committed.

Fixed.

/*
-        * A single-bit read is atomic.  There could be memory-ordering effects
+        * A single byte read is atomic.  There could be memory-ordering effects
* here, but for performance reasons we make it the caller's job to worry
* about that.
*/
-       result = (map[mapByte] & (1 << mapBit)) ? true : false;
-
-       return result;
+       return ((map[mapByte] >> mapBit) & VISIBILITYMAP_VALID_BITS);
}
Not a new issue, and *very* likely to be irrelevant in practice (given
the value is only referenced once): But there's really no guarantee
map[mapByte] is only read once here.
Meh. But we can fix if you want to.

Fixed.

-BlockNumber
-visibilitymap_count(Relation rel)
+void
+visibilitymap_count(Relation rel, BlockNumber *all_visible, BlockNumber *all_frozen)
Not really a new issue again: The parameter types (previously return
type) to this function seem wrong to me.
Not this patch's job to tinker.

This comment is not incorporate this patch yet.

@@ -1934,5 +1992,14 @@ heap_page_is_all_visible(Relation rel, Buffer buf, TransactionId *visibility_cut
}
+       /*
+        * We don't bother clearing *all_frozen when the page is discovered not
+        * to be all-visible, so do that now if necessary.  The page might fail
+        * to be all-frozen for other reasons anyway, but if it's not all-visible,
+        * then it definitely isn't all-frozen.
+        */
+       if (!all_visible)
+               *all_frozen = false;
+
Why don't we just set *all_frozen to false when appropriate? It'd be
just as many lines and probably easier to understand?
I thought that looked really easy to mess up, either now or down the
road. This way seemed more solid to me. That's a judgement call, of
course.

To be understanding easier, I changed it so.

+               /*
+                * If the page is marked as all-visible but not all-frozen, we should
+                * so mark it.  Note that all_frozen is only valid if all_visible is
+                * true, so we must check both.
+                */
This kinda seems to imply that all-visible implies all_frozen. Also, why
has that block been added to the end of the if/else if chain? Seems like
it belongs below the (all_visible && !all_visible_according_to_vm) block.
We can adjust the comment a bit to make it more clear, if you like,
but I doubt it's going to cause serious misunderstanding. As for the
placement, the reason I put it at the end is because I figured that we
did not want to mark it all-frozen if any of the "oh crap, emit a
warning" cases applied.

Fixed comment.
I think that we should care about all-visible problem first, and then
care all-frozen problem.
So this patch doesn't change the placement.

Attached patch fixes only above comments, other are being addressed now.

--
Regards,

--
Masahiko Sawada

Attachments:

fix_freeze_map_a892234_v1.patchtext/x-diff; charset=US-ASCII; name=fix_freeze_map_a892234_v1.patchDownload

diff --git a/src/backend/access/heap/visibilitymap.c b/src/backend/access/heap/visibilitymap.c
index eaab4be..05422f1 100644
--- a/src/backend/access/heap/visibilitymap.c
+++ b/src/backend/access/heap/visibilitymap.c
@@ -33,7 +33,7 @@
  * is set, we know the condition is true, but if a bit is not set, it might or
  * might not be true.
  *
- * Clearing both visibility map bits is not separately WAL-logged.  The callers
+ * Clearing visibility map bits is not separately WAL-logged.  The callers
  * must make sure that whenever a bit is cleared, the bit is cleared on WAL
  * replay of the updating operation as well.
  *
@@ -104,13 +104,16 @@
  */
 #define MAPSIZE (BLCKSZ - MAXALIGN(SizeOfPageHeaderData))
 
+/* Number of heap blocks we can represent in one byte */
+#define HEAPBLOCKS_PER_BYTE (BITS_PER_BYTE / BITS_PER_HEAPBLOCK)
+
 /* Number of heap blocks we can represent in one visibility map page. */
 #define HEAPBLOCKS_PER_PAGE (MAPSIZE * HEAPBLOCKS_PER_BYTE)
 
 /* Mapping from heap block number to the right bit in the visibility map */
 #define HEAPBLK_TO_MAPBLOCK(x) ((x) / HEAPBLOCKS_PER_PAGE)
 #define HEAPBLK_TO_MAPBYTE(x) (((x) % HEAPBLOCKS_PER_PAGE) / HEAPBLOCKS_PER_BYTE)
-#define HEAPBLK_TO_MAPBIT(x) (((x) % HEAPBLOCKS_PER_BYTE) * BITS_PER_HEAPBLOCK)
+#define HEAPBLK_TO_OFFSET(x) (((x) % HEAPBLOCKS_PER_BYTE) * BITS_PER_HEAPBLOCK)
 
 /* tables for fast counting of set bits for visible and frozen */
 static const uint8 number_of_ones_for_visible[256] = {
@@ -156,7 +159,7 @@ static void vm_extend(Relation rel, BlockNumber nvmblocks);
 
 
 /*
- *	visibilitymap_clear - clear all bits in visibility map
+ *	visibilitymap_clear - clear all bits for one page in visibility map
  *
  * You must pass a buffer containing the correct map page to this function.
  * Call visibilitymap_pin first to pin the right one. This function doesn't do
@@ -167,8 +170,8 @@ visibilitymap_clear(Relation rel, BlockNumber heapBlk, Buffer buf)
 {
 	BlockNumber mapBlock = HEAPBLK_TO_MAPBLOCK(heapBlk);
 	int			mapByte = HEAPBLK_TO_MAPBYTE(heapBlk);
-	int			mapBit = HEAPBLK_TO_MAPBIT(heapBlk);
-	uint8		mask = VISIBILITYMAP_VALID_BITS << mapBit;
+	int			mapOffset = HEAPBLK_TO_OFFSET(heapBlk);
+	uint8		mask = VISIBILITYMAP_VALID_BITS << mapOffset;
 	char	   *map;
 
 #ifdef TRACE_VISIBILITYMAP
@@ -267,7 +270,7 @@ visibilitymap_set(Relation rel, BlockNumber heapBlk, Buffer heapBuf,
 {
 	BlockNumber mapBlock = HEAPBLK_TO_MAPBLOCK(heapBlk);
 	uint32		mapByte = HEAPBLK_TO_MAPBYTE(heapBlk);
-	uint8		mapBit = HEAPBLK_TO_MAPBIT(heapBlk);
+	uint8		mapOffset = HEAPBLK_TO_OFFSET(heapBlk);
 	Page		page;
 	uint8		*map;
 
@@ -291,11 +294,11 @@ visibilitymap_set(Relation rel, BlockNumber heapBlk, Buffer heapBuf,
 	map = (uint8 *)PageGetContents(page);
 	LockBuffer(vmBuf, BUFFER_LOCK_EXCLUSIVE);
 
-	if (flags != (map[mapByte] >> mapBit & VISIBILITYMAP_VALID_BITS))
+	if (flags != (map[mapByte] >> mapOffset & VISIBILITYMAP_VALID_BITS))
 	{
 		START_CRIT_SECTION();
 
-		map[mapByte] |= (flags << mapBit);
+		map[mapByte] |= (flags << mapOffset);
 		MarkBufferDirty(vmBuf);
 
 		if (RelationNeedsWAL(rel))
@@ -338,8 +341,7 @@ visibilitymap_set(Relation rel, BlockNumber heapBlk, Buffer heapBuf,
  * earlier call to visibilitymap_pin or visibilitymap_get_status on the same
  * relation. On return, *buf is a valid buffer with the map page containing
  * the bit for heapBlk, or InvalidBuffer. The caller is responsible for
- * releasing *buf after it's done testing and setting bits, and must pass flags
- * for which it needs to check the value in visibility map.
+ * releasing *buf after it's done testing and setting bits.
  *
  * NOTE: This function is typically called without a lock on the heap page,
  * so somebody else could change the bit just after we look at it.  In fact,
@@ -353,8 +355,9 @@ visibilitymap_get_status(Relation rel, BlockNumber heapBlk, Buffer *buf)
 {
 	BlockNumber mapBlock = HEAPBLK_TO_MAPBLOCK(heapBlk);
 	uint32		mapByte = HEAPBLK_TO_MAPBYTE(heapBlk);
-	uint8		mapBit = HEAPBLK_TO_MAPBIT(heapBlk);
+	uint8		mapOffset = HEAPBLK_TO_OFFSET(heapBlk);
 	char	   *map;
+	uint8		result;
 
 #ifdef TRACE_VISIBILITYMAP
 	elog(DEBUG1, "vm_get_status %s %d", RelationGetRelationName(rel), heapBlk);
@@ -384,7 +387,8 @@ visibilitymap_get_status(Relation rel, BlockNumber heapBlk, Buffer *buf)
 	 * here, but for performance reasons we make it the caller's job to worry
 	 * about that.
 	 */
-	return ((map[mapByte] >> mapBit) & VISIBILITYMAP_VALID_BITS);
+	result = ((map[mapByte] >> mapOffset) & VISIBILITYMAP_VALID_BITS);
+	return result;
 }
 
 /*
@@ -456,7 +460,7 @@ visibilitymap_truncate(Relation rel, BlockNumber nheapblocks)
 	/* last remaining block, byte, and bit */
 	BlockNumber truncBlock = HEAPBLK_TO_MAPBLOCK(nheapblocks);
 	uint32		truncByte = HEAPBLK_TO_MAPBYTE(nheapblocks);
-	uint8		truncBit = HEAPBLK_TO_MAPBIT(nheapblocks);
+	uint8		truncOffset = HEAPBLK_TO_OFFSET(nheapblocks);
 
 #ifdef TRACE_VISIBILITYMAP
 	elog(DEBUG1, "vm_truncate %s %d", RelationGetRelationName(rel), nheapblocks);
@@ -478,7 +482,7 @@ visibilitymap_truncate(Relation rel, BlockNumber nheapblocks)
 	 * because we don't get a chance to clear the bits if the heap is extended
 	 * again.
 	 */
-	if (truncByte != 0 || truncBit != 0)
+	if (truncByte != 0 || truncOffset != 0)
 	{
 		Buffer		mapBuffer;
 		Page		page;
@@ -511,7 +515,7 @@ visibilitymap_truncate(Relation rel, BlockNumber nheapblocks)
 		 * ((1 << 7) - 1) = 01111111
 		 *----
 		 */
-		map[truncByte] &= (1 << truncBit) - 1;
+		map[truncByte] &= (1 << truncOffset) - 1;
 
 		MarkBufferDirty(mapBuffer);
 		UnlockReleaseBuffer(mapBuffer);
diff --git a/src/backend/commands/vacuumlazy.c b/src/backend/commands/vacuumlazy.c
index 426e756..e39321b 100644
--- a/src/backend/commands/vacuumlazy.c
+++ b/src/backend/commands/vacuumlazy.c
@@ -1192,9 +1192,9 @@ lazy_scan_heap(Relation onerel, LVRelStats *vacrelstats,
 		}
 
 		/*
-		 * If the page is marked as all-visible but not all-frozen, we should
-		 * so mark it.  Note that all_frozen is only valid if all_visible is
-		 * true, so we must check both.
+		 * If the all-visible page is turned out to be all-frozen but not marked,
+		 * we should so mark it.  Note that all_frozen is only valid if all_visible
+		 * is true, so we must check both.
 		 */
 		else if (all_visible_according_to_vm && all_visible && all_frozen &&
 				 !VM_ALL_FROZEN(onerel, blkno, &vmbuffer))
@@ -2068,6 +2068,7 @@ heap_page_is_all_visible(Relation rel, Buffer buf,
 		if (ItemIdIsDead(itemid))
 		{
 			all_visible = false;
+			*all_frozen = false;
 			break;
 		}
 
@@ -2087,6 +2088,7 @@ heap_page_is_all_visible(Relation rel, Buffer buf,
 					if (!HeapTupleHeaderXminCommitted(tuple.t_data))
 					{
 						all_visible = false;
+						*all_frozen = false;
 						break;
 					}
 
@@ -2098,6 +2100,7 @@ heap_page_is_all_visible(Relation rel, Buffer buf,
 					if (!TransactionIdPrecedes(xmin, OldestXmin))
 					{
 						all_visible = false;
+						*all_frozen = false;
 						break;
 					}
 
@@ -2116,23 +2119,16 @@ heap_page_is_all_visible(Relation rel, Buffer buf,
 			case HEAPTUPLE_RECENTLY_DEAD:
 			case HEAPTUPLE_INSERT_IN_PROGRESS:
 			case HEAPTUPLE_DELETE_IN_PROGRESS:
-				all_visible = false;
-				break;
-
+				{
+					all_visible = false;
+					*all_frozen = false;
+					break;
+				}
 			default:
 				elog(ERROR, "unexpected HeapTupleSatisfiesVacuum result");
 				break;
 		}
 	}							/* scan along page */
 
-	/*
-	 * We don't bother clearing *all_frozen when the page is discovered not to
-	 * be all-visible, so do that now if necessary.  The page might fail to be
-	 * all-frozen for other reasons anyway, but if it's not all-visible, then
-	 * it definitely isn't all-frozen.
-	 */
-	if (!all_visible)
-		*all_frozen = false;
-
 	return all_visible;
 }
diff --git a/src/include/access/visibilitymap.h b/src/include/access/visibilitymap.h
index b8dc54c..65e78ec 100644
--- a/src/include/access/visibilitymap.h
+++ b/src/include/access/visibilitymap.h
@@ -19,8 +19,8 @@
 #include "storage/buf.h"
 #include "utils/relcache.h"
 
+/* Number of bits for one heap page */
 #define BITS_PER_HEAPBLOCK 2
-#define HEAPBLOCKS_PER_BYTE (BITS_PER_BYTE / BITS_PER_HEAPBLOCK)
 
 /* Flags for bit map */
 #define VISIBILITYMAP_ALL_VISIBLE	0x01

#79

Masahiko Sawada

sawada.mshk@gmail.com

over 9 years ago

In reply to: Robert Haas (#7)

1 attachment(s)

Re: Reviewing freeze map code

On Sat, May 7, 2016 at 5:40 AM, Robert Haas <robertmhaas@gmail.com> wrote:

On Wed, May 4, 2016 at 8:08 PM, Andres Freund <andres@anarazel.de> wrote:

On 2016-05-02 14:48:18 -0700, Andres Freund wrote:

77a1d1e Department of second thoughts: remove PD_ALL_FROZEN.

Nothing to say here.

fd31cd2 Don't vacuum all-frozen pages.

Hm. I do wonder if it's going to bite us that we don't have a way to
actually force vacuuming of the whole table (besides manually rm'ing the
VM). I've more than once seen VACUUM used to try to do some integrity
checking of the database. How are we actually going to test that the
feature works correctly? They'd have to write checks ontop of
pg_visibility to see whether things are borked.

Let's add VACUUM (FORCE) or something like that.

/*
* Compute whether we actually scanned the whole relation. If we did, we
* can adjust relfrozenxid and relminmxid.
*
* NB: We need to check this before truncating the relation, because that
* will change ->rel_pages.
*/

Comment is out-of-date now.

OK.

Fixed.

-               if (blkno == next_not_all_visible_block)
+               if (blkno == next_unskippable_block)
{
-                       /* Time to advance next_not_all_visible_block */
-                       for (next_not_all_visible_block++;
-                                next_not_all_visible_block < nblocks;
-                                next_not_all_visible_block++)
+                       /* Time to advance next_unskippable_block */
+                       for (next_unskippable_block++;
+                                next_unskippable_block < nblocks;
+                                next_unskippable_block++)
Hm. So we continue with the course of re-processing pages, even if
they're marked all-frozen. For all-visible there at least can be a
benefit by freezing earlier, but for all-frozen pages there's really no
point. I don't really buy the arguments for the skipping logic. But
even disregarding that, maybe we should skip processing a block if it's
all-frozen (without preventing the page from being read?); as there's no
possible benefit? Acquring the exclusive/content lock and stuff is far
from free.
I wanted to tinker with this logic as little as possible in the
interest of ending up with something that worked. I would not have
written it this way.

Not really related to this patch, but the FORCE_CHECK_PAGE is rather
ugly.

+1.
+                       /*
+                        * The current block is potentially skippable; if we've seen a
+                        * long enough run of skippable blocks to justify skipping it, and
+                        * we're not forced to check it, then go ahead and skip.
+                        * Otherwise, the page must be at least all-visible if not
+                        * all-frozen, so we can set all_visible_according_to_vm = true.
+                        */
+                       if (skipping_blocks && !FORCE_CHECK_PAGE())
+                       {
+                               /*
+                                * Tricky, tricky.  If this is in aggressive vacuum, the page
+                                * must have been all-frozen at the time we checked whether it
+                                * was skippable, but it might not be any more.  We must be
+                                * careful to count it as a skipped all-frozen page in that
+                                * case, or else we'll think we can't update relfrozenxid and
+                                * relminmxid.  If it's not an aggressive vacuum, we don't
+                                * know whether it was all-frozen, so we have to recheck; but
+                                * in this case an approximate answer is OK.
+                                */
+                               if (aggressive || VM_ALL_FROZEN(onerel, blkno, &vmbuffer))
+                                       vacrelstats->frozenskipped_pages++;
continue;
+                       }
Hm. This indeed seems a bit tricky. Not sure how to make it easier
though without just ripping out the SKIP_PAGES_THRESHOLD stuff.
Yep, I had the same problem.

Hm. This also doubles the number of VM accesses. While I guess that's
not noticeable most of the time, it's still not nice; especially when a
large relation is entirely frozen, because it'll mean we'll sequentially
go through the visibilityma twice.

Compared to what we're saving, that's obviously a trivial cost.
That's not to say that we might not want to improve it, but it's
hardly a disaster.

In short: wah, wah, wah.

Attached patch optimises skipping pages logic so that blkno can jump to
next_unskippable_block directly while counting the number of all_visible
and all_frozen pages. So we can avoid double checking visibility map.

Regards,

--
Masahiko Sawada

Attachments:

fix_freeze_map_fd31cd2.patchapplication/octet-stream; name=fix_freeze_map_fd31cd2.patchDownload

diff --git a/src/backend/commands/vacuumlazy.c b/src/backend/commands/vacuumlazy.c
index e39321b..06041fb 100644
--- a/src/backend/commands/vacuumlazy.c
+++ b/src/backend/commands/vacuumlazy.c
@@ -252,8 +252,8 @@ lazy_vacuum_rel(Relation onerel, int options, VacuumParams *params,
 	vac_close_indexes(nindexes, Irel, NoLock);
 
 	/*
-	 * Compute whether we actually scanned the whole relation. If we did, we
-	 * can adjust relfrozenxid and relminmxid.
+	 * Compute whether we actually scanned the all unfrozen pages. If we did,
+	 * we can adjust relfrozenxid and relminmxid.
 	 *
 	 * NB: We need to check this before truncating the relation, because that
 	 * will change ->rel_pages.
@@ -468,6 +468,8 @@ lazy_scan_heap(Relation onerel, LVRelStats *vacrelstats,
 		PROGRESS_VACUUM_MAX_DEAD_TUPLES
 	};
 	int64		initprog_val[3];
+	BlockNumber	n_all_visible;
+	BlockNumber	n_all_frozen;
 
 	pg_rusage_init(&ru0);
 
@@ -498,6 +500,9 @@ lazy_scan_heap(Relation onerel, LVRelStats *vacrelstats,
 	initprog_val[2] = vacrelstats->max_dead_tuples;
 	pgstat_progress_update_multi_param(3, initprog_index, initprog_val);
 
+	n_all_visible = 0;
+	n_all_frozen = 0;
+
 	/*
 	 * Except when aggressive is set, we want to skip pages that are
 	 * all-visible according to the visibility map, but only when we can skip
@@ -546,19 +551,29 @@ lazy_scan_heap(Relation onerel, LVRelStats *vacrelstats,
 		 next_unskippable_block < nblocks;
 		 next_unskippable_block++)
 	{
-		uint8		vmstatus;
+		uint8		vmskipflags;
 
-		vmstatus = visibilitymap_get_status(onerel, next_unskippable_block,
+		vmskipflags = visibilitymap_get_status(onerel, next_unskippable_block,
 											&vmbuffer);
 		if (aggressive)
 		{
-			if ((vmstatus & VISIBILITYMAP_ALL_FROZEN) == 0)
+			if ((vmskipflags & VISIBILITYMAP_ALL_FROZEN) == 0)
 				break;
+			n_all_frozen++;
 		}
 		else
 		{
-			if ((vmstatus & VISIBILITYMAP_ALL_VISIBLE) == 0)
+			if ((vmskipflags & VISIBILITYMAP_ALL_VISIBLE) == 0)
 				break;
+
+			/*
+			 * Count the number of all-visible and all-frozen pages
+			 * in this bunc of skippable blocks.
+			 */
+			if (vmskipflags & VISIBILITYMAP_ALL_FROZEN)
+				n_all_frozen++;
+			else
+				n_all_visible++;
 		}
 		vacuum_delay_point();
 	}
@@ -568,7 +583,8 @@ lazy_scan_heap(Relation onerel, LVRelStats *vacrelstats,
 	else
 		skipping_blocks = false;
 
-	for (blkno = 0; blkno < nblocks; blkno++)
+	blkno = 0;
+	while (blkno < nblocks)
 	{
 		Buffer		buf;
 		Page		page;
@@ -586,16 +602,20 @@ lazy_scan_heap(Relation onerel, LVRelStats *vacrelstats,
 		TransactionId visibility_cutoff_xid = InvalidTransactionId;
 
 		/* see note above about forcing scanning of last page */
-#define FORCE_CHECK_PAGE() \
-		(blkno == nblocks - 1 && should_attempt_truncation(vacrelstats))
+#define FORCE_CHECK_PAGE(blk) \
+		((blk) == nblocks - 1 && should_attempt_truncation(vacrelstats))
 
 		pgstat_progress_update_param(PROGRESS_VACUUM_HEAP_BLKS_SCANNED, blkno);
 
 		if (blkno == next_unskippable_block)
 		{
+			n_all_visible = 0;
+			n_all_frozen = 0;
+
 			/* Time to advance next_unskippable_block */
 			for (next_unskippable_block++;
-				 next_unskippable_block < nblocks;
+				 next_unskippable_block < nblocks &&
+					 !FORCE_CHECK_PAGE(next_unskippable_block);
 				 next_unskippable_block++)
 			{
 				uint8		vmskipflags;
@@ -607,11 +627,24 @@ lazy_scan_heap(Relation onerel, LVRelStats *vacrelstats,
 				{
 					if ((vmskipflags & VISIBILITYMAP_ALL_FROZEN) == 0)
 						break;
+					/*
+					 * Count the number of all-frozen pages in this bunch
+					 * of skippable blocks.
+					 */
+					n_all_frozen++;
 				}
 				else
 				{
 					if ((vmskipflags & VISIBILITYMAP_ALL_VISIBLE) == 0)
 						break;
+					/*
+					 * Count the number of all-visible and all-frozen pages
+					 * in this bunch of skippable blocks.
+					 */
+					if (vmskipflags & VISIBILITYMAP_ALL_FROZEN)
+						n_all_frozen++;
+					else
+						n_all_visible++;
 				}
 				vacuum_delay_point();
 			}
@@ -637,26 +670,25 @@ lazy_scan_heap(Relation onerel, LVRelStats *vacrelstats,
 		else
 		{
 			/*
-			 * The current block is potentially skippable; if we've seen a
-			 * long enough run of skippable blocks to justify skipping it, and
-			 * we're not forced to check it, then go ahead and skip.
-			 * Otherwise, the page must be at least all-visible if not
-			 * all-frozen, so we can set all_visible_according_to_vm = true.
+			 * The current block is the first block of continuous skippable blocks,
+			 * and we know that how many blocks we can skip to scan. If we've
+			 * seen a long enough run of skippable blocks to justify skipping it,
+			 * then go ahead and skip. Otherwise, the page must be at least all-visible
+			 * if not all-frozen, so we can set all_visible_according_to_vm = true.
 			 */
-			if (skipping_blocks && !FORCE_CHECK_PAGE())
+			if (skipping_blocks)
 			{
+				BlockNumber n_skippable_blocks = n_all_visible + n_all_frozen;
+
 				/*
-				 * Tricky, tricky.  If this is in aggressive vacuum, the page
-				 * must have been all-frozen at the time we checked whether it
-				 * was skippable, but it might not be any more.  We must be
-				 * careful to count it as a skipped all-frozen page in that
-				 * case, or else we'll think we can't update relfrozenxid and
-				 * relminmxid.  If it's not an aggressive vacuum, we don't
-				 * know whether it was all-frozen, so we have to recheck; but
-				 * in this case an approximate answer is OK.
+				 * We know that there are 'n_all_frozen' all-frozen pages
+				 * of the all skippable pages in the scan we did just before.
 				 */
-				if (aggressive || VM_ALL_FROZEN(onerel, blkno, &vmbuffer))
-					vacrelstats->frozenskipped_pages++;
+				vacrelstats->frozenskipped_pages += n_all_frozen;
+
+				/* Jump to the next unskippable block directly */
+				blkno += n_skippable_blocks;
+
 				continue;
 			}
 			all_visible_according_to_vm = true;
@@ -751,10 +783,11 @@ lazy_scan_heap(Relation onerel, LVRelStats *vacrelstats,
 			 * it's OK to skip vacuuming pages we get a lock conflict on. They
 			 * will be dealt with in some future vacuum.
 			 */
-			if (!aggressive && !FORCE_CHECK_PAGE())
+			if (!aggressive && !FORCE_CHECK_PAGE(blkno))
 			{
 				ReleaseBuffer(buf);
 				vacrelstats->pinskipped_pages++;
+				blkno++;
 				continue;
 			}
 
@@ -782,6 +815,7 @@ lazy_scan_heap(Relation onerel, LVRelStats *vacrelstats,
 				vacrelstats->pinskipped_pages++;
 				if (hastup)
 					vacrelstats->nonempty_pages = blkno + 1;
+				blkno++;
 				continue;
 			}
 			if (!aggressive)
@@ -794,6 +828,7 @@ lazy_scan_heap(Relation onerel, LVRelStats *vacrelstats,
 				vacrelstats->pinskipped_pages++;
 				if (hastup)
 					vacrelstats->nonempty_pages = blkno + 1;
+				blkno++;
 				continue;
 			}
 			LockBuffer(buf, BUFFER_LOCK_UNLOCK);
@@ -844,6 +879,7 @@ lazy_scan_heap(Relation onerel, LVRelStats *vacrelstats,
 			UnlockReleaseBuffer(buf);
 
 			RecordPageWithFreeSpace(onerel, blkno, freespace);
+			blkno++;
 			continue;
 		}
 
@@ -883,6 +919,7 @@ lazy_scan_heap(Relation onerel, LVRelStats *vacrelstats,
 
 			UnlockReleaseBuffer(buf);
 			RecordPageWithFreeSpace(onerel, blkno, freespace);
+			blkno++;
 			continue;
 		}
 
@@ -1224,6 +1261,8 @@ lazy_scan_heap(Relation onerel, LVRelStats *vacrelstats,
 		 */
 		if (vacrelstats->num_dead_tuples == prev_dead_count)
 			RecordPageWithFreeSpace(onerel, blkno, freespace);
+
+		blkno++;
 	}
 
 	/* report that everything is scanned and vacuumed */

#80

Robert Haas

robertmhaas@gmail.com

over 9 years ago

In reply to: Masahiko Sawada (#78)

Re: Reviewing freeze map code

On Wed, Jun 1, 2016 at 3:50 AM, Masahiko Sawada <sawada.mshk@gmail.com> wrote:

Attached patch fixes only above comments, other are being addressed now.

Committed.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#81

Robert Haas

robertmhaas@gmail.com

over 9 years ago

In reply to: Masahiko Sawada (#79)

Re: Reviewing freeze map code

On Thu, Jun 2, 2016 at 11:24 AM, Masahiko Sawada <sawada.mshk@gmail.com> wrote:

Attached patch optimises skipping pages logic so that blkno can jump to
next_unskippable_block directly while counting the number of all_visible
and all_frozen pages. So we can avoid double checking visibility map.

I think this is 9.7 material. This patch has already won the
"scariest patch" tournament. Changing the logic more than necessary
at this late date seems like it just increases the scariness. I think
this is an opportunity for further optimization, not a defect.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#82

Masahiko Sawada

sawada.mshk@gmail.com

over 9 years ago

In reply to: Robert Haas (#81)

Re: Reviewing freeze map code

On Fri, Jun 3, 2016 at 11:03 PM, Robert Haas <robertmhaas@gmail.com> wrote:

On Thu, Jun 2, 2016 at 11:24 AM, Masahiko Sawada <sawada.mshk@gmail.com> wrote:

Attached patch optimises skipping pages logic so that blkno can jump to
next_unskippable_block directly while counting the number of all_visible
and all_frozen pages. So we can avoid double checking visibility map.

I think this is 9.7 material. This patch has already won the
"scariest patch" tournament. Changing the logic more than necessary
at this late date seems like it just increases the scariness. I think
this is an opportunity for further optimization, not a defect.

I agree with you.
I'll submit this as a improvement for 9.7.
That patch also incorporates the following review comment.
We can push at least this fix.

/*
* Compute whether we actually scanned the whole relation. If we did, we
* can adjust relfrozenxid and relminmxid.
*
* NB: We need to check this before truncating the relation, because that
* will change ->rel_pages.
*/

Comment is out-of-date now.

I'm address the review comment of 7087166 commit, and will post the patch.
And testing feature for freeze map is under the discussion.

Regards,

--
Masahiko Sawada

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#83

Robert Haas

robertmhaas@gmail.com

over 9 years ago

In reply to: Masahiko Sawada (#82)

Re: Reviewing freeze map code

On Fri, Jun 3, 2016 at 10:49 AM, Masahiko Sawada <sawada.mshk@gmail.com> wrote:

That patch also incorporates the following review comment.
We can push at least this fix.

Can you submit that part as a separate patch?

I'm address the review comment of 7087166 commit, and will post the patch.

When?

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#84

Masahiko Sawada

sawada.mshk@gmail.com

over 9 years ago

In reply to: Robert Haas (#83)

1 attachment(s)

Re: Reviewing freeze map code

On Sat, Jun 4, 2016 at 12:08 AM, Robert Haas <robertmhaas@gmail.com> wrote:

On Fri, Jun 3, 2016 at 10:49 AM, Masahiko Sawada <sawada.mshk@gmail.com> wrote:

That patch also incorporates the following review comment.
We can push at least this fix.

Can you submit that part as a separate patch?

Attached.

I'm address the review comment of 7087166 commit, and will post the patch.

When?

On Saturday.

Regards,

--
Masahiko Sawada

Attachments:

fix_freeze_map_fd31cd2.patchapplication/octet-stream; name=fix_freeze_map_fd31cd2.patchDownload

diff --git a/src/backend/commands/vacuumlazy.c b/src/backend/commands/vacuumlazy.c
index e39321b..06041fb 100644
--- a/src/backend/commands/vacuumlazy.c
+++ b/src/backend/commands/vacuumlazy.c
@@ -252,8 +252,8 @@ lazy_vacuum_rel(Relation onerel, int options, VacuumParams *params,
 	vac_close_indexes(nindexes, Irel, NoLock);
 
 	/*
-	 * Compute whether we actually scanned the whole relation. If we did, we
-	 * can adjust relfrozenxid and relminmxid.
+	 * Compute whether we actually scanned the all unfrozen pages. If we did,
+	 * we can adjust relfrozenxid and relminmxid.
 	 *
 	 * NB: We need to check this before truncating the relation, because that
 	 * will change ->rel_pages.

#85

Robert Haas

robertmhaas@gmail.com

over 9 years ago

In reply to: Masahiko Sawada (#84)

Re: Reviewing freeze map code

On Fri, Jun 3, 2016 at 11:21 AM, Masahiko Sawada <sawada.mshk@gmail.com> wrote:

Can you submit that part as a separate patch?

Attached.

Thanks, committed.

I'm address the review comment of 7087166 commit, and will post the patch.

When?

On Saturday.

Great. Will that address everything for this open item, then?

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#86

Masahiko Sawada

sawada.mshk@gmail.com

over 9 years ago

In reply to: Robert Haas (#8)

1 attachment(s)

Re: Reviewing freeze map code

On Sat, May 7, 2016 at 5:42 AM, Robert Haas <robertmhaas@gmail.com> wrote:

On Thu, May 5, 2016 at 2:20 PM, Andres Freund <andres@anarazel.de> wrote:
On 2016-05-02 14:48:18 -0700, Andres Freund wrote:

7087166 pg_upgrade: Convert old visibility map format to new format.
+const char *
+rewriteVisibilityMap(const char *fromfile, const char *tofile, bool force)
...
+       while ((bytesRead = read(src_fd, buffer, BLCKSZ)) == BLCKSZ)
+       {
..
Uh, shouldn't we actually fail if we read incompletely? Rather than
silently ignoring the problem? Ok, this causes no corruption, but it
indicates that something went significantly wrong.
Sure, that's reasonable.

Fixed.

+                       char            new_vmbuf[BLCKSZ];
+                       char       *new_cur = new_vmbuf;
+                       bool            empty = true;
+                       bool            old_lastpart;
+
+                       /* Copy page header in advance */
+                       memcpy(new_vmbuf, &pageheader, SizeOfPageHeaderData);
Shouldn't we zero out new_vmbuf? Afaics we're not necessarily zeroing it
with old_lastpart && !empty, right?
Oh, dear. That seems like a possible data corruption bug. Maybe we'd
better fix that right away (although I don't actually have time before
the wrap).

Since the force is always set true, I removed the force from argument
of copyFile() and rewriteVisibilityMap().
And destination file is always opened with O_RDWR, O_CREAT, O_TRUNC flags .

+       if ((dst_fd = open(tofile, O_RDWR | O_CREAT | (force ? 0 : O_EXCL), S_IRUSR | S_IWUSR)) < 0)
+       {
+               close(src_fd);
+               return getErrorText();
+       }
I know you guys copied this, but what's the force thing about?
Expecially as it's always set to true by the callers (i.e. what is the
parameter even about?)? Wouldn't we at least have to specify O_TRUNC in
the force case?
I just work here.
+                               old_cur += BITS_PER_HEAPBLOCK_OLD;
+                               new_cur += BITS_PER_HEAPBLOCK;
I'm not sure I'm understanding the point of the BITS_PER_HEAPBLOCK_OLD
stuff - as long as it's hardcoded into rewriteVisibilityMap() we'll not
be able to have differing ones anyway, should we decide to add a third
bit?
I think that's just a matter of style.

So this comments is not incorporated.

Attached patch, please review it.

Regards,

--
Masahiko Sawada

Attachments:

fix_freeze_map_7087166.patchapplication/octet-stream; name=fix_freeze_map_7087166.patchDownload

diff --git a/src/bin/pg_upgrade/file.c b/src/bin/pg_upgrade/file.c
index 5d87408..3d49774 100644
--- a/src/bin/pg_upgrade/file.c
+++ b/src/bin/pg_upgrade/file.c
@@ -22,7 +22,7 @@
 
 
 #ifndef WIN32
-static int	copy_file(const char *fromfile, const char *tofile, bool force);
+static int	copy_file(const char *fromfile, const char *tofile);
 #else
 static int	win32_pghardlink(const char *src, const char *dst);
 #endif
@@ -34,12 +34,12 @@ static int	win32_pghardlink(const char *src, const char *dst);
  *	Copies a relation file from src to dst.
  */
 const char *
-copyFile(const char *src, const char *dst, bool force)
+copyFile(const char *src, const char *dst)
 {
 #ifndef WIN32
-		if (copy_file(src, dst, force) == -1)
+		if (copy_file(src, dst) == -1)
 #else
-		if (CopyFile(src, dst, !force) == 0)
+		if (CopyFile(src, dst, false) == 0)
 #endif
 			return getErrorText();
 		else
@@ -68,7 +68,7 @@ linkFile(const char *src, const char *dst)
 
 #ifndef WIN32
 static int
-copy_file(const char *srcfile, const char *dstfile, bool force)
+copy_file(const char *srcfile, const char *dstfile)
 {
 #define COPY_BUF_SIZE (50 * BLCKSZ)
 
@@ -87,7 +87,7 @@ copy_file(const char *srcfile, const char *dstfile, bool force)
 	if ((src_fd = open(srcfile, O_RDONLY, 0)) < 0)
 		return -1;
 
-	if ((dest_fd = open(dstfile, O_RDWR | O_CREAT | (force ? 0 : O_EXCL), S_IRUSR | S_IWUSR)) < 0)
+	if ((dest_fd = open(dstfile, O_RDWR | O_CREAT | O_TRUNC, S_IRUSR | S_IWUSR)) < 0)
 	{
 		save_errno = errno;
 
@@ -159,12 +159,13 @@ copy_file(const char *srcfile, const char *dstfile, bool force)
  * VACUUM.
  */
 const char *
-rewriteVisibilityMap(const char *fromfile, const char *tofile, bool force)
+rewriteVisibilityMap(const char *fromfile, const char *tofile)
 {
 	int			src_fd = 0;
 	int			dst_fd = 0;
 	char		buffer[BLCKSZ];
 	ssize_t		bytesRead;
+	ssize_t		totalBytesRead;
 	ssize_t		src_filesize;
 	int			rewriteVmBytesPerPage;
 	BlockNumber new_blkno = 0;
@@ -185,7 +186,7 @@ rewriteVisibilityMap(const char *fromfile, const char *tofile, bool force)
 		return getErrorText();
 	}
 
-	if ((dst_fd = open(tofile, O_RDWR | O_CREAT | (force ? 0 : O_EXCL), S_IRUSR | S_IWUSR)) < 0)
+	if ((dst_fd = open(tofile, O_RDWR | O_CREAT | O_TRUNC, S_IRUSR | S_IWUSR)) < 0)
 	{
 		close(src_fd);
 		return getErrorText();
@@ -200,13 +201,23 @@ rewriteVisibilityMap(const char *fromfile, const char *tofile, bool force)
 	 * page is empty, we skip it, mostly to avoid turning one-page visibility
 	 * maps for small relations into two pages needlessly.
 	 */
-	while ((bytesRead = read(src_fd, buffer, BLCKSZ)) == BLCKSZ)
+	while (totalBytesRead < src_filesize)
 	{
 		char	   *old_cur;
 		char	   *old_break;
 		char	   *old_blkend;
 		PageHeaderData pageheader;
-		bool		old_lastblk = ((BLCKSZ * (new_blkno + 1)) == src_filesize);
+		bool		old_lastblk;
+
+		if ((bytesRead = read(src_fd, buffer, BLCKSZ)) != BLCKSZ)
+		{
+			close(dst_fd);
+			close(src_fd);
+			return getErrorText();
+		}
+
+		totalBytesRead += BLCKSZ;
+		old_lastblk = (totalBytesRead == src_filesize);
 
 		/* Save the page header data */
 		memcpy(&pageheader, buffer, SizeOfPageHeaderData);
@@ -229,6 +240,9 @@ rewriteVisibilityMap(const char *fromfile, const char *tofile, bool force)
 			bool		empty = true;
 			bool		old_lastpart;
 
+			/* Zero out new_vmbuf */
+			memset(new_vmbuf, 0, sizeof(new_vmbuf));
+
 			/* Copy page header in advance */
 			memcpy(new_vmbuf, &pageheader, SizeOfPageHeaderData);
 
diff --git a/src/bin/pg_upgrade/pg_upgrade.h b/src/bin/pg_upgrade/pg_upgrade.h
index 5b00142..10182de 100644
--- a/src/bin/pg_upgrade/pg_upgrade.h
+++ b/src/bin/pg_upgrade/pg_upgrade.h
@@ -367,10 +367,9 @@ bool		pid_lock_file_exists(const char *datadir);
 
 /* file.c */
 
-const char *copyFile(const char *src, const char *dst, bool force);
+const char *copyFile(const char *src, const char *dst);
 const char *linkFile(const char *src, const char *dst);
-const char *rewriteVisibilityMap(const char *fromfile, const char *tofile,
-								 bool force);
+const char *rewriteVisibilityMap(const char *fromfile, const char *tofile);
 
 void		check_hard_link(void);
 FILE	   *fopen_priv(const char *path, const char *mode);
diff --git a/src/bin/pg_upgrade/relfilenode.c b/src/bin/pg_upgrade/relfilenode.c
index 0c1a822..85cb717 100644
--- a/src/bin/pg_upgrade/relfilenode.c
+++ b/src/bin/pg_upgrade/relfilenode.c
@@ -248,9 +248,9 @@ transfer_relfile(FileNameMap *map, const char *type_suffix, bool vm_must_add_fro
 
 			/* Rewrite visibility map if needed */
 			if (vm_must_add_frozenbit && (strcmp(type_suffix, "_vm") == 0))
-				msg = rewriteVisibilityMap(old_file, new_file, true);
+				msg = rewriteVisibilityMap(old_file, new_file);
 			else
-				msg = copyFile(old_file, new_file, true);
+				msg = copyFile(old_file, new_file);
 
 			if (msg)
 				pg_fatal("error while copying relation \"%s.%s\" (\"%s\" to \"%s\"): %s\n",
@@ -262,7 +262,7 @@ transfer_relfile(FileNameMap *map, const char *type_suffix, bool vm_must_add_fro
 
 			/* Rewrite visibility map if needed */
 			if (vm_must_add_frozenbit && (strcmp(type_suffix, "_vm") == 0))
-				msg = rewriteVisibilityMap(old_file, new_file, true);
+				msg = rewriteVisibilityMap(old_file, new_file);
 			else
 				msg = linkFile(old_file, new_file);

#87

Robert Haas

robertmhaas@gmail.com

over 9 years ago

In reply to: Masahiko Sawada (#86)

Re: Reviewing freeze map code

On Fri, Jun 3, 2016 at 10:25 PM, Masahiko Sawada <sawada.mshk@gmail.com> wrote:

+                       char            new_vmbuf[BLCKSZ];
+                       char       *new_cur = new_vmbuf;
+                       bool            empty = true;
+                       bool            old_lastpart;
+
+                       /* Copy page header in advance */
+                       memcpy(new_vmbuf, &pageheader, SizeOfPageHeaderData);
Shouldn't we zero out new_vmbuf? Afaics we're not necessarily zeroing it
with old_lastpart && !empty, right?
Oh, dear. That seems like a possible data corruption bug. Maybe we'd
better fix that right away (although I don't actually have time before
the wrap).

Actually, on second thought, I'm not seeing the bug here. It seems to
me that the loop commented this way:

/* Process old page bytes one by one, and turn it into new page. */

...should always write to every byte in new_vmbuf, because we process
exactly half the bytes in the old block at a time, and so that's going
to generate exactly one full page of new bytes. Am I missing
something?

Since the force is always set true, I removed the force from argument
of copyFile() and rewriteVisibilityMap().
And destination file is always opened with O_RDWR, O_CREAT, O_TRUNC flags .

I'm not happy with this. I think we should always open with O_EXCL,
because the new file is not expected to exist and if it does,
something's probably broken. I think we should default to the safe
behavior (which is failing) rather than the unsafe behavior (which is
clobbering data).

(Status update for Noah: I expect Masahiko Sawada will respond
quickly, but if not I'll give some kind of update by Monday COB
anyhow.)

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#88

Masahiko Sawada

sawada.mshk@gmail.com

over 9 years ago

In reply to: Robert Haas (#87)

1 attachment(s)

Re: Reviewing freeze map code

On Sat, Jun 4, 2016 at 12:41 PM, Robert Haas <robertmhaas@gmail.com> wrote:

On Fri, Jun 3, 2016 at 10:25 PM, Masahiko Sawada <sawada.mshk@gmail.com> wrote:
+                       char            new_vmbuf[BLCKSZ];
+                       char       *new_cur = new_vmbuf;
+                       bool            empty = true;
+                       bool            old_lastpart;
+
+                       /* Copy page header in advance */
+                       memcpy(new_vmbuf, &pageheader, SizeOfPageHeaderData);
Shouldn't we zero out new_vmbuf? Afaics we're not necessarily zeroing it
with old_lastpart && !empty, right?
Oh, dear. That seems like a possible data corruption bug. Maybe we'd
better fix that right away (although I don't actually have time before
the wrap).
Actually, on second thought, I'm not seeing the bug here. It seems to
me that the loop commented this way:

/* Process old page bytes one by one, and turn it into new page. */

...should always write to every byte in new_vmbuf, because we process
exactly half the bytes in the old block at a time, and so that's going
to generate exactly one full page of new bytes. Am I missing
something?

Yeah, you're right.
the rewriteVisibilityMap() always exactly writes whole new_vmbuf.

Since the force is always set true, I removed the force from argument
of copyFile() and rewriteVisibilityMap().
And destination file is always opened with O_RDWR, O_CREAT, O_TRUNC flags .

I'm not happy with this. I think we should always open with O_EXCL,
because the new file is not expected to exist and if it does,
something's probably broken. I think we should default to the safe
behavior (which is failing) rather than the unsafe behavior (which is
clobbering data).

I specified O_EXCL instead of O_TRUNC.

Attached updated patch.

Regards,

--
Masahiko Sawada

Attachments:

fix_freeze_map_7087166_v2.patchapplication/octet-stream; name=fix_freeze_map_7087166_v2.patchDownload

diff --git a/src/bin/pg_upgrade/file.c b/src/bin/pg_upgrade/file.c
index 5d87408..643bb93 100644
--- a/src/bin/pg_upgrade/file.c
+++ b/src/bin/pg_upgrade/file.c
@@ -22,7 +22,7 @@
 
 
 #ifndef WIN32
-static int	copy_file(const char *fromfile, const char *tofile, bool force);
+static int	copy_file(const char *fromfile, const char *tofile);
 #else
 static int	win32_pghardlink(const char *src, const char *dst);
 #endif
@@ -34,12 +34,12 @@ static int	win32_pghardlink(const char *src, const char *dst);
  *	Copies a relation file from src to dst.
  */
 const char *
-copyFile(const char *src, const char *dst, bool force)
+copyFile(const char *src, const char *dst)
 {
 #ifndef WIN32
-		if (copy_file(src, dst, force) == -1)
+		if (copy_file(src, dst) == -1)
 #else
-		if (CopyFile(src, dst, !force) == 0)
+		if (CopyFile(src, dst, true) == 0)
 #endif
 			return getErrorText();
 		else
@@ -68,7 +68,7 @@ linkFile(const char *src, const char *dst)
 
 #ifndef WIN32
 static int
-copy_file(const char *srcfile, const char *dstfile, bool force)
+copy_file(const char *srcfile, const char *dstfile)
 {
 #define COPY_BUF_SIZE (50 * BLCKSZ)
 
@@ -87,7 +87,7 @@ copy_file(const char *srcfile, const char *dstfile, bool force)
 	if ((src_fd = open(srcfile, O_RDONLY, 0)) < 0)
 		return -1;
 
-	if ((dest_fd = open(dstfile, O_RDWR | O_CREAT | (force ? 0 : O_EXCL), S_IRUSR | S_IWUSR)) < 0)
+	if ((dest_fd = open(dstfile, O_RDWR | O_CREAT | O_EXCL, S_IRUSR | S_IWUSR)) < 0)
 	{
 		save_errno = errno;
 
@@ -159,12 +159,13 @@ copy_file(const char *srcfile, const char *dstfile, bool force)
  * VACUUM.
  */
 const char *
-rewriteVisibilityMap(const char *fromfile, const char *tofile, bool force)
+rewriteVisibilityMap(const char *fromfile, const char *tofile)
 {
 	int			src_fd = 0;
 	int			dst_fd = 0;
 	char		buffer[BLCKSZ];
 	ssize_t		bytesRead;
+	ssize_t		totalBytesRead;
 	ssize_t		src_filesize;
 	int			rewriteVmBytesPerPage;
 	BlockNumber new_blkno = 0;
@@ -185,7 +186,7 @@ rewriteVisibilityMap(const char *fromfile, const char *tofile, bool force)
 		return getErrorText();
 	}
 
-	if ((dst_fd = open(tofile, O_RDWR | O_CREAT | (force ? 0 : O_EXCL), S_IRUSR | S_IWUSR)) < 0)
+	if ((dst_fd = open(tofile, O_RDWR | O_CREAT | O_EXCL, S_IRUSR | S_IWUSR)) < 0)
 	{
 		close(src_fd);
 		return getErrorText();
@@ -200,13 +201,23 @@ rewriteVisibilityMap(const char *fromfile, const char *tofile, bool force)
 	 * page is empty, we skip it, mostly to avoid turning one-page visibility
 	 * maps for small relations into two pages needlessly.
 	 */
-	while ((bytesRead = read(src_fd, buffer, BLCKSZ)) == BLCKSZ)
+	while (totalBytesRead < src_filesize)
 	{
 		char	   *old_cur;
 		char	   *old_break;
 		char	   *old_blkend;
 		PageHeaderData pageheader;
-		bool		old_lastblk = ((BLCKSZ * (new_blkno + 1)) == src_filesize);
+		bool		old_lastblk;
+
+		if ((bytesRead = read(src_fd, buffer, BLCKSZ)) != BLCKSZ)
+		{
+			close(dst_fd);
+			close(src_fd);
+			return getErrorText();
+		}
+
+		totalBytesRead += BLCKSZ;
+		old_lastblk = (totalBytesRead == src_filesize);
 
 		/* Save the page header data */
 		memcpy(&pageheader, buffer, SizeOfPageHeaderData);
diff --git a/src/bin/pg_upgrade/pg_upgrade.h b/src/bin/pg_upgrade/pg_upgrade.h
index 5b00142..10182de 100644
--- a/src/bin/pg_upgrade/pg_upgrade.h
+++ b/src/bin/pg_upgrade/pg_upgrade.h
@@ -367,10 +367,9 @@ bool		pid_lock_file_exists(const char *datadir);
 
 /* file.c */
 
-const char *copyFile(const char *src, const char *dst, bool force);
+const char *copyFile(const char *src, const char *dst);
 const char *linkFile(const char *src, const char *dst);
-const char *rewriteVisibilityMap(const char *fromfile, const char *tofile,
-								 bool force);
+const char *rewriteVisibilityMap(const char *fromfile, const char *tofile);
 
 void		check_hard_link(void);
 FILE	   *fopen_priv(const char *path, const char *mode);
diff --git a/src/bin/pg_upgrade/relfilenode.c b/src/bin/pg_upgrade/relfilenode.c
index 0c1a822..85cb717 100644
--- a/src/bin/pg_upgrade/relfilenode.c
+++ b/src/bin/pg_upgrade/relfilenode.c
@@ -248,9 +248,9 @@ transfer_relfile(FileNameMap *map, const char *type_suffix, bool vm_must_add_fro
 
 			/* Rewrite visibility map if needed */
 			if (vm_must_add_frozenbit && (strcmp(type_suffix, "_vm") == 0))
-				msg = rewriteVisibilityMap(old_file, new_file, true);
+				msg = rewriteVisibilityMap(old_file, new_file);
 			else
-				msg = copyFile(old_file, new_file, true);
+				msg = copyFile(old_file, new_file);
 
 			if (msg)
 				pg_fatal("error while copying relation \"%s.%s\" (\"%s\" to \"%s\"): %s\n",
@@ -262,7 +262,7 @@ transfer_relfile(FileNameMap *map, const char *type_suffix, bool vm_must_add_fro
 
 			/* Rewrite visibility map if needed */
 			if (vm_must_add_frozenbit && (strcmp(type_suffix, "_vm") == 0))
-				msg = rewriteVisibilityMap(old_file, new_file, true);
+				msg = rewriteVisibilityMap(old_file, new_file);
 			else
 				msg = linkFile(old_file, new_file);

#89

Masahiko Sawada

sawada.mshk@gmail.com

over 9 years ago

In reply to: Robert Haas (#85)

Re: Reviewing freeze map code

On Sat, Jun 4, 2016 at 12:59 AM, Robert Haas <robertmhaas@gmail.com> wrote:

On Fri, Jun 3, 2016 at 11:21 AM, Masahiko Sawada <sawada.mshk@gmail.com> wrote:

Can you submit that part as a separate patch?

Attached.

Thanks, committed.

I'm address the review comment of 7087166 commit, and will post the patch.

When?

On Saturday.

Great. Will that address everything for this open item, then?

Attached patch for commit 7087166 on another mail.
I think that only the test tool for visibility map is remaining and
under the discussion.
Even if we have verification tool or function for visibility map, we
cannot repair the contents of visibility map if we turned out that
contents of visibility map is something wrong.
So I think we should have the way that re-generates the visibility map.
For this purpose, doing vacuum while ignoring visibility map by a new
option or new function is one idea.
But IMHO, it's not good idea to allow a function to do vacuum, and
expanding the VACUUM syntax might be somewhat overkill.

So other idea is to have GUC parameter, vacuum_even_frozen_page for example.
If this parameter is set true (false by default), we do vacuum whole
table forcibly and re-generate visibility map.
The advantage of this idea is that we don't necessary to expand VACUUM
syntax and relatively easily can remove this parameter if it's not
necessary anymore.

Thought?

Regards,

--
Masahiko Sawada

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#90

Masahiko Sawada

sawada.mshk@gmail.com

over 9 years ago

In reply to: Masahiko Sawada (#89)

1 attachment(s)

Re: Reviewing freeze map code

On Sat, Jun 4, 2016 at 1:46 PM, Masahiko Sawada <sawada.mshk@gmail.com> wrote:

On Sat, Jun 4, 2016 at 12:59 AM, Robert Haas <robertmhaas@gmail.com> wrote:

On Fri, Jun 3, 2016 at 11:21 AM, Masahiko Sawada <sawada.mshk@gmail.com> wrote:

Can you submit that part as a separate patch?

Attached.

Thanks, committed.

I'm address the review comment of 7087166 commit, and will post the patch.

When?

On Saturday.

Great. Will that address everything for this open item, then?

Attached patch for commit 7087166 on another mail.
I think that only the test tool for visibility map is remaining and
under the discussion.
Even if we have verification tool or function for visibility map, we
cannot repair the contents of visibility map if we turned out that
contents of visibility map is something wrong.
So I think we should have the way that re-generates the visibility map.
For this purpose, doing vacuum while ignoring visibility map by a new
option or new function is one idea.
But IMHO, it's not good idea to allow a function to do vacuum, and
expanding the VACUUM syntax might be somewhat overkill.

So other idea is to have GUC parameter, vacuum_even_frozen_page for example.
If this parameter is set true (false by default), we do vacuum whole
table forcibly and re-generate visibility map.
The advantage of this idea is that we don't necessary to expand VACUUM
syntax and relatively easily can remove this parameter if it's not
necessary anymore.

Attached is a sample patch that controls full page vacuum by new GUC parameter.

Regards,

--
Masahiko Sawada

Attachments:

vacuum_even_frozen_page.patchtext/plain; charset=US-ASCII; name=vacuum_even_frozen_page.patchDownload

diff --git a/src/backend/commands/vacuumlazy.c b/src/backend/commands/vacuumlazy.c
index 784c3e9..03f026d 100644
--- a/src/backend/commands/vacuumlazy.c
+++ b/src/backend/commands/vacuumlazy.c
@@ -125,6 +125,8 @@ typedef struct LVRelStats
 	bool		lock_waiter_detected;
 } LVRelStats;
 
+/* GUC parameter */
+bool vacuum_even_frozen_page; /* should we scan all pages forcibly? */
 
 /* A few variables that don't seem worth passing around as parameters */
 static int	elevel = -1;
@@ -138,7 +140,7 @@ static BufferAccessStrategy vac_strategy;
 
 /* non-export function prototypes */
 static void lazy_scan_heap(Relation onerel, LVRelStats *vacrelstats,
-			   Relation *Irel, int nindexes, bool aggressive);
+			   Relation *Irel, int nindexes, bool aggressive, bool scan_all);
 static void lazy_vacuum_heap(Relation onerel, LVRelStats *vacrelstats);
 static bool lazy_check_needs_freeze(Buffer buf, bool *hastup);
 static void lazy_vacuum_index(Relation indrel,
@@ -246,7 +248,8 @@ lazy_vacuum_rel(Relation onerel, int options, VacuumParams *params,
 	vacrelstats->hasindex = (nindexes > 0);
 
 	/* Do the vacuuming */
-	lazy_scan_heap(onerel, vacrelstats, Irel, nindexes, aggressive);
+	lazy_scan_heap(onerel, vacrelstats, Irel, nindexes, aggressive,
+				   vacuum_even_frozen_page);
 
 	/* Done with indexes */
 	vac_close_indexes(nindexes, Irel, NoLock);
@@ -261,7 +264,7 @@ lazy_vacuum_rel(Relation onerel, int options, VacuumParams *params,
 	if ((vacrelstats->scanned_pages + vacrelstats->frozenskipped_pages)
 		< vacrelstats->rel_pages)
 	{
-		Assert(!aggressive);
+		Assert(!aggressive && !vacuum_even_frozen_page);
 		scanned_all_unfrozen = false;
 	}
 	else
@@ -442,7 +445,7 @@ vacuum_log_cleanup_info(Relation rel, LVRelStats *vacrelstats)
  */
 static void
 lazy_scan_heap(Relation onerel, LVRelStats *vacrelstats,
-			   Relation *Irel, int nindexes, bool aggressive)
+			   Relation *Irel, int nindexes, bool aggressive, bool scan_all)
 {
 	BlockNumber nblocks,
 				blkno;
@@ -513,6 +516,10 @@ lazy_scan_heap(Relation onerel, LVRelStats *vacrelstats,
 	 * such pages do not need freezing and do not affect the value that we can
 	 * safely set for relfrozenxid or relminmxid.
 	 *
+	 * When scan_all is set, we have to scan all pages forcibly while ignoring
+	 * visibility map status, and then we can safely set for relfrozenxid or
+	 * relminmxid.
+	 *
 	 * Before entering the main loop, establish the invariant that
 	 * next_unskippable_block is the next block number >= blkno that's not we
 	 * can't skip based on the visibility map, either all-visible for a
@@ -639,11 +646,12 @@ lazy_scan_heap(Relation onerel, LVRelStats *vacrelstats,
 			/*
 			 * The current block is potentially skippable; if we've seen a
 			 * long enough run of skippable blocks to justify skipping it, and
-			 * we're not forced to check it, then go ahead and skip.
-			 * Otherwise, the page must be at least all-visible if not
-			 * all-frozen, so we can set all_visible_according_to_vm = true.
+			 * scan_all is not specified, and we're not forced to check it,
+			 * then go ahead and skip. Otherwise, the page must be at least
+			 * all-visible if not all-frozen, so we can set
+			 * all_visible_according_to_vm = true.
 			 */
-			if (skipping_blocks && !FORCE_CHECK_PAGE())
+			if (skipping_blocks && !scan_all && !FORCE_CHECK_PAGE())
 			{
 				/*
 				 * Tricky, tricky.  If this is in aggressive vacuum, the page
diff --git a/src/backend/utils/misc/guc.c b/src/backend/utils/misc/guc.c
index e246a9c..f86fa68 100644
--- a/src/backend/utils/misc/guc.c
+++ b/src/backend/utils/misc/guc.c
@@ -1662,6 +1662,16 @@ static struct config_bool ConfigureNamesBool[] =
 		NULL, NULL, NULL
 	},
 
+	{
+		{"vacuum_even_frozen_page", PGC_USERSET, DEVELOPER_OPTIONS,
+			gettext_noop("Do vacuum even frozen page while ignoring visibility map."),
+			NULL
+		},
+		&vacuum_even_frozen_page,
+		false,
+		NULL, NULL, NULL
+	},
+
 	/* End-of-list marker */
 	{
 		{NULL, 0, 0, NULL, NULL}, NULL, false, NULL, NULL, NULL
diff --git a/src/include/commands/vacuum.h b/src/include/commands/vacuum.h
index 80cd4a8..31916dd 100644
--- a/src/include/commands/vacuum.h
+++ b/src/include/commands/vacuum.h
@@ -154,6 +154,7 @@ extern int	vacuum_freeze_min_age;
 extern int	vacuum_freeze_table_age;
 extern int	vacuum_multixact_freeze_min_age;
 extern int	vacuum_multixact_freeze_table_age;
+extern bool	vacuum_even_frozen_page;
 
 
 /* in commands/vacuum.c */

#91

Michael Paquier

michael.paquier@gmail.com

over 9 years ago

In reply to: Masahiko Sawada (#90)

Re: Reviewing freeze map code

On Mon, Jun 6, 2016 at 5:44 PM, Masahiko Sawada <sawada.mshk@gmail.com> wrote:

On Sat, Jun 4, 2016 at 1:46 PM, Masahiko Sawada <sawada.mshk@gmail.com> wrote:

On Sat, Jun 4, 2016 at 12:59 AM, Robert Haas <robertmhaas@gmail.com> wrote:

On Fri, Jun 3, 2016 at 11:21 AM, Masahiko Sawada <sawada.mshk@gmail.com> wrote:

Can you submit that part as a separate patch?

Attached.

Thanks, committed.

I'm address the review comment of 7087166 commit, and will post the patch.

When?

On Saturday.

Great. Will that address everything for this open item, then?

Attached patch for commit 7087166 on another mail.
I think that only the test tool for visibility map is remaining and
under the discussion.
Even if we have verification tool or function for visibility map, we
cannot repair the contents of visibility map if we turned out that
contents of visibility map is something wrong.
So I think we should have the way that re-generates the visibility map.
For this purpose, doing vacuum while ignoring visibility map by a new
option or new function is one idea.
But IMHO, it's not good idea to allow a function to do vacuum, and
expanding the VACUUM syntax might be somewhat overkill.

So other idea is to have GUC parameter, vacuum_even_frozen_page for example.
If this parameter is set true (false by default), we do vacuum whole
table forcibly and re-generate visibility map.
The advantage of this idea is that we don't necessary to expand VACUUM
syntax and relatively easily can remove this parameter if it's not
necessary anymore.

Attached is a sample patch that controls full page vacuum by new GUC parameter.

Don't we want a reloption for that? Just wondering...
--
Michael

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#92

Robert Haas

robertmhaas@gmail.com

over 9 years ago

In reply to: Michael Paquier (#91)

Re: Reviewing freeze map code

On Mon, Jun 6, 2016 at 5:11 AM, Michael Paquier
<michael.paquier@gmail.com> wrote:

Attached is a sample patch that controls full page vacuum by new GUC parameter.

Don't we want a reloption for that? Just wondering...

Why? Just for consistency? I think the bigger question here is
whether we need to do anything at all. It's true that, without some
new option, we'll lose the ability to forcibly vacuum every page in
the relation, even if all-frozen. But there's not much use case for
that in the first place. It will be potentially helpful if it turns
out that we have a bug that sets the all-frozen bit on pages that are
not, in fact, all-frozen. Otherwise, what's the use?

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#93

Masahiko Sawada

sawada.mshk@gmail.com

over 9 years ago

In reply to: Robert Haas (#92)

Re: Reviewing freeze map code

On Mon, Jun 6, 2016 at 6:34 PM, Robert Haas <robertmhaas@gmail.com> wrote:

On Mon, Jun 6, 2016 at 5:11 AM, Michael Paquier
<michael.paquier@gmail.com> wrote:

Attached is a sample patch that controls full page vacuum by new GUC parameter.

Don't we want a reloption for that? Just wondering...

Why? Just for consistency? I think the bigger question here is
whether we need to do anything at all. It's true that, without some
new option, we'll lose the ability to forcibly vacuum every page in
the relation, even if all-frozen. But there's not much use case for
that in the first place. It will be potentially helpful if it turns
out that we have a bug that sets the all-frozen bit on pages that are
not, in fact, all-frozen. Otherwise, what's the use?

I cannot agree with using this parameter as a reloption.
We set it true only when the serious bug is discovered and we want to
re-generate the visibility maps of specific tables.
I thought that control by GUC parameter would be convenient rather
than adding the new option.

Regards,

--
Masahiko Sawada

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#94

Robert Haas

robertmhaas@gmail.com

over 9 years ago

In reply to: Masahiko Sawada (#88)

Re: Reviewing freeze map code

On Sat, Jun 4, 2016 at 12:18 AM, Masahiko Sawada <sawada.mshk@gmail.com> wrote:

Attached updated patch.

The error-checking enhancements here look good to me, except that you
forgot to initialize totalBytesRead. I've committed those changes
with a fix for that problem and will look at the rest of this
separately.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#95

Tom Lane

tgl@sss.pgh.pa.us

over 9 years ago

In reply to: Masahiko Sawada (#90)

Re: Reviewing freeze map code

Masahiko Sawada <sawada.mshk@gmail.com> writes:

On Sat, Jun 4, 2016 at 1:46 PM, Masahiko Sawada <sawada.mshk@gmail.com> wrote:

So other idea is to have GUC parameter, vacuum_even_frozen_page for example.
If this parameter is set true (false by default), we do vacuum whole
table forcibly and re-generate visibility map.
The advantage of this idea is that we don't necessary to expand VACUUM
syntax and relatively easily can remove this parameter if it's not
necessary anymore.

Attached is a sample patch that controls full page vacuum by new GUC parameter.

I find this approach fairly ugly ... it's randomly inconsistent with other
VACUUM parameters for no very defensible reason. Taking out GUCs is not
easier than taking out statement parameters; you risk breaking
applications either way.

regards, tom lane

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#96

Robert Haas

robertmhaas@gmail.com

over 9 years ago

In reply to: Robert Haas (#94)

Re: Reviewing freeze map code

On Mon, Jun 6, 2016 at 7:46 AM, Robert Haas <robertmhaas@gmail.com> wrote:

On Sat, Jun 4, 2016 at 12:18 AM, Masahiko Sawada <sawada.mshk@gmail.com> wrote:

Attached updated patch.

The error-checking enhancements here look good to me, except that you
forgot to initialize totalBytesRead. I've committed those changes
with a fix for that problem and will look at the rest of this
separately.

Committed that now, too.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#97

Robert Haas

robertmhaas@gmail.com

over 9 years ago

In reply to: Tom Lane (#95)

Re: Reviewing freeze map code

On Mon, Jun 6, 2016 at 9:53 AM, Tom Lane <tgl@sss.pgh.pa.us> wrote:

Masahiko Sawada <sawada.mshk@gmail.com> writes:

On Sat, Jun 4, 2016 at 1:46 PM, Masahiko Sawada <sawada.mshk@gmail.com> wrote:

So other idea is to have GUC parameter, vacuum_even_frozen_page for example.
If this parameter is set true (false by default), we do vacuum whole
table forcibly and re-generate visibility map.
The advantage of this idea is that we don't necessary to expand VACUUM
syntax and relatively easily can remove this parameter if it's not
necessary anymore.

Attached is a sample patch that controls full page vacuum by new GUC parameter.

I find this approach fairly ugly ... it's randomly inconsistent with other
VACUUM parameters for no very defensible reason.

Just to be sure I understand, in what way is it inconsistent?

Taking out GUCs is not
easier than taking out statement parameters; you risk breaking
applications either way.

Agreed, but that doesn't really answer the question of which one we
should have, if either. My gut feeling on this is to either do
nothing or add a VACUUM option (not a GUC, not a reloption) called
even_frozen_pages, default false. What is your opinion?

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#98

Tom Lane

tgl@sss.pgh.pa.us

over 9 years ago

In reply to: Robert Haas (#97)

Re: Reviewing freeze map code

Robert Haas <robertmhaas@gmail.com> writes:

On Mon, Jun 6, 2016 at 9:53 AM, Tom Lane <tgl@sss.pgh.pa.us> wrote:

Taking out GUCs is not
easier than taking out statement parameters; you risk breaking
applications either way.

Agreed, but that doesn't really answer the question of which one we
should have, if either. My gut feeling on this is to either do
nothing or add a VACUUM option (not a GUC, not a reloption) called
even_frozen_pages, default false. What is your opinion?

That's about where I stand, with some preference for "do nothing".
I'm not convinced we need this.

regards, tom lane

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#99

Robert Haas

robertmhaas@gmail.com

over 9 years ago

In reply to: Robert Haas (#87)

Re: Reviewing freeze map code

On Fri, Jun 3, 2016 at 11:41 PM, Robert Haas <robertmhaas@gmail.com> wrote:

(Status update for Noah: I expect Masahiko Sawada will respond
quickly, but if not I'll give some kind of update by Monday COB
anyhow.)

I believe this open item is now closed, unless Andres has more
comments or wishes to discuss any point further, with the exception
that we still need to decide whether to add VACUUM (even_frozen_pages)
or some variant of that. I have added a new open item for that issue
and marked this one as resolved.

My intended strategy as the presumptive owner of the new items is to
do nothing unless more of a consensus emerges than we have presently.
We do not seem to have clear agreement on whether to add the new
option; whether to make it a GUC, a reloption, a VACUUM syntax option,
or some combination of those things; and whether it should blow up the
existing VM and rebuild it (as proposed by Sawada-san) or just force
frozen pages to be scanned in the hope that something good will happen
(as proposed by Andres). In the absence of consensus, doing nothing
is a reasonable choice here.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#100

Alvaro Herrera

alvherre@2ndquadrant.com

over 9 years ago

In reply to: Robert Haas (#97)

Re: Reviewing freeze map code

Robert Haas wrote:

My gut feeling on this is to either do nothing or add a VACUUM option
(not a GUC, not a reloption) called even_frozen_pages, default false.
What is your opinion?

+1 for that approach -- I thought that was already agreed weeks ago and
the only question was what to name that option. even_frozen_pages
sounds better than SCANALL, SCAN_ALL, FREEZE, FORCE (the other
options I saw proposed in that subthread), so +1 for that naming
too.

I don't like doing nothing; that means that when we discover a bug we'll
have to tell users to rm a file whose name requires a complicated
catalog query to find out, so -1 for that.

--
ï¿½lvaro Herrera http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#101

Robert Haas

robertmhaas@gmail.com

over 9 years ago

In reply to: Alvaro Herrera (#100)

Re: Reviewing freeze map code

On Mon, Jun 6, 2016 at 10:18 AM, Alvaro Herrera
<alvherre@2ndquadrant.com> wrote:

Robert Haas wrote:

My gut feeling on this is to either do nothing or add a VACUUM option
(not a GUC, not a reloption) called even_frozen_pages, default false.
What is your opinion?

+1 for that approach -- I thought that was already agreed weeks ago and
the only question was what to name that option. even_frozen_pages
sounds better than SCANALL, SCAN_ALL, FREEZE, FORCE (the other
options I saw proposed in that subthread), so +1 for that naming
too.

I don't like doing nothing; that means that when we discover a bug we'll
have to tell users to rm a file whose name requires a complicated
catalog query to find out, so -1 for that.

So... I agree that it is definitely not good if we have to tell users
to rm a file, but I am not quite sure how this new option would
prevent us from having to say that? Here are some potential kinds of
bugs we might have:

1. Sometimes, the all-frozen bit doesn't get set when it should.
2. Sometimes, the all-frozen bit gets sit when it shouldn't.
3. Some combination of (1) and (2), so that the VM fork can't be
trusted in either direction.

If (1) happens, removing the VM fork is not a good idea; what people
will want to do is re-run a VACUUM FREEZE.

If (2) or (3) happens, removing the VM fork might be a good idea, but
it's not really clear that VACUUM (even_frozen_pages) will help much.
For one thing, if there are actually unfrozen tuples on those pages
and the clog pages which they reference are already gone or recycled,
rerunning VACUUM on the table in any form might permanently lose data,
or maybe it will just fail.

If because of the nature of the bug you somehow know that case doesn't
pertain, then I suppose the bug is that the tuple-level and page-level
state is out of sync. VACUUM (even_frozen_pages) probably won't help
with that much either, because VACUUM never clears the all-frozen bit
without also clearing the all-visible bit, and that only if the page
contains dead tuples, which in this case it probably doesn't.

I'm intuitively sympathetic to the idea that we should have an option
for this, but I can't figure out in what case we'd actually tell
anyone to use it. It would be useful for the kinds of bugs listed
above to have VACUUM (rebuild_vm) to blow away the VM fork and rebuild
it, but that's different semantics than what we proposed for VACUUM
(even_frozen_pages). And I'd be sort of inclined to handle that case
by providing some other way to remove VM forks (like a new function in
the pg_visibilitymap contrib module, maybe?) and then just tell people
to run regular VACUUM afterwards, rather than putting the actual VM
fork removal into VACUUM.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#102

Tom Lane

tgl@sss.pgh.pa.us

over 9 years ago

In reply to: Robert Haas (#101)

Re: Reviewing freeze map code

Robert Haas <robertmhaas@gmail.com> writes:

I'm intuitively sympathetic to the idea that we should have an option
for this, but I can't figure out in what case we'd actually tell
anyone to use it. It would be useful for the kinds of bugs listed
above to have VACUUM (rebuild_vm) to blow away the VM fork and rebuild
it, but that's different semantics than what we proposed for VACUUM
(even_frozen_pages). And I'd be sort of inclined to handle that case
by providing some other way to remove VM forks (like a new function in
the pg_visibilitymap contrib module, maybe?) and then just tell people
to run regular VACUUM afterwards, rather than putting the actual VM
fork removal into VACUUM.

There's a lot to be said for that approach. If we do it, I'd be a bit
inclined to offer an option to blow away the FSM as well.

regards, tom lane

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#103

Andres Freund

andres@anarazel.de

over 9 years ago

In reply to: Robert Haas (#92)

Re: Reviewing freeze map code

On 2016-06-06 05:34:32 -0400, Robert Haas wrote:

On Mon, Jun 6, 2016 at 5:11 AM, Michael Paquier
<michael.paquier@gmail.com> wrote:

Attached is a sample patch that controls full page vacuum by new GUC parameter.

Don't we want a reloption for that? Just wondering...

Why? Just for consistency? I think the bigger question here is
whether we need to do anything at all. It's true that, without some
new option, we'll lose the ability to forcibly vacuum every page in
the relation, even if all-frozen. But there's not much use case for
that in the first place. It will be potentially helpful if it turns
out that we have a bug that sets the all-frozen bit on pages that are
not, in fact, all-frozen. Otherwise, what's the use?

Except that we right now don't have any realistic way to figure out
whether this new feature actually does the right thing. Which makes
testing this *considerably* harder than just VACUUM (dwim). I think it's
unacceptable to release this feature without a way that'll tell that it
so far has/has not corrupted the database. Would that, in a perfect
world, be vacuum? No, probably not. But since we're not in a perfect world...

Andres

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#104

Robert Haas

robertmhaas@gmail.com

over 9 years ago

In reply to: Andres Freund (#103)

Re: Reviewing freeze map code

On Mon, Jun 6, 2016 at 11:28 AM, Andres Freund <andres@anarazel.de> wrote:

On 2016-06-06 05:34:32 -0400, Robert Haas wrote:

On Mon, Jun 6, 2016 at 5:11 AM, Michael Paquier
<michael.paquier@gmail.com> wrote:

Attached is a sample patch that controls full page vacuum by new GUC parameter.

Don't we want a reloption for that? Just wondering...

Why? Just for consistency? I think the bigger question here is
whether we need to do anything at all. It's true that, without some
new option, we'll lose the ability to forcibly vacuum every page in
the relation, even if all-frozen. But there's not much use case for
that in the first place. It will be potentially helpful if it turns
out that we have a bug that sets the all-frozen bit on pages that are
not, in fact, all-frozen. Otherwise, what's the use?

Except that we right now don't have any realistic way to figure out
whether this new feature actually does the right thing. Which makes
testing this *considerably* harder than just VACUUM (dwim). I think it's
unacceptable to release this feature without a way that'll tell that it
so far has/has not corrupted the database. Would that, in a perfect
world, be vacuum? No, probably not. But since we're not in a perfect world...

I just don't see how running VACUUM on the all-frozen pages is going
to help. In terms of diagnostic tools, you can get the VM bits and
page-level bits using the pg_visibility extension; I wrote it
precisely because of concerns like the ones you raise here. If you
want to cross-check the page-level bits against the tuple-level bits,
you can do that with the pageinspect extension. And if you do those
things, you can actually find out whether stuff is broken. Vacuuming
the all-frozen pages won't tell you that. It will either do nothing
(which doesn't tell you that things are OK) or it will change
something (possibly without reporting any message, and possibly making
a bad situation worse instead of better).

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#105

Andres Freund

andres@anarazel.de

over 9 years ago

In reply to: Robert Haas (#104)

Re: Reviewing freeze map code

On 2016-06-06 11:37:25 -0400, Robert Haas wrote:

On Mon, Jun 6, 2016 at 11:28 AM, Andres Freund <andres@anarazel.de> wrote:

Except that we right now don't have any realistic way to figure out
whether this new feature actually does the right thing. Which makes
testing this *considerably* harder than just VACUUM (dwim). I think it's
unacceptable to release this feature without a way that'll tell that it
so far has/has not corrupted the database. Would that, in a perfect
world, be vacuum? No, probably not. But since we're not in a perfect world...

I just don't see how running VACUUM on the all-frozen pages is going
to help.

Because we can tell people in the beta2 announcement or some wiki page
"please run VACUUM(scan_all)" and check whether it emits WARNINGs. And
if we suspect freeze map in bug reports, we can just ask reporters to
run a VACUUM (scan_all).

In terms of diagnostic tools, you can get the VM bits and
page-level bits using the pg_visibility extension; I wrote it
precisely because of concerns like the ones you raise here. If you
want to cross-check the page-level bits against the tuple-level bits,
you can do that with the pageinspect extension. And if you do those
things, you can actually find out whether stuff is broken.

That's WAY out ouf reach of any "normal users". Adding a vacuum option
is doable, writing complex queries is not.

Vacuuming the all-frozen pages won't tell you that. It will either do
nothing (which doesn't tell you that things are OK) or it will change
something (possibly without reporting any message, and possibly making
a bad situation worse instead of better).

We found a number of bugs for the equivalent issues in all-visible
handling via the vacuum error reporting around those.

Greetings,

Andres Freund

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#106

Tom Lane

tgl@sss.pgh.pa.us

over 9 years ago

In reply to: Robert Haas (#104)

Re: Reviewing freeze map code

Robert Haas <robertmhaas@gmail.com> writes:

On Mon, Jun 6, 2016 at 11:28 AM, Andres Freund <andres@anarazel.de> wrote:

Except that we right now don't have any realistic way to figure out
whether this new feature actually does the right thing.

I just don't see how running VACUUM on the all-frozen pages is going
to help.

Yes. I don't see that any of the proposed features would be very useful
for answering the question "is my VM incorrect". Maybe they would fix
problems, and maybe not, but in any case you couldn't rely on VACUUM
to tell you about a problem. (Even if you've got warning messages in
there, they might disappear into the postmaster log during an
auto-vacuum. Warning messages in VACUUM are not a good debugging
technology.)

regards, tom lane

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#107

Robert Haas

robertmhaas@gmail.com

over 9 years ago

In reply to: Andres Freund (#105)

Re: Reviewing freeze map code

On Mon, Jun 6, 2016 at 11:44 AM, Andres Freund <andres@anarazel.de> wrote:

In terms of diagnostic tools, you can get the VM bits and
page-level bits using the pg_visibility extension; I wrote it
precisely because of concerns like the ones you raise here. If you
want to cross-check the page-level bits against the tuple-level bits,
you can do that with the pageinspect extension. And if you do those
things, you can actually find out whether stuff is broken.

That's WAY out ouf reach of any "normal users". Adding a vacuum option
is doable, writing complex queries is not.

Why would they have to write the complex query? Wouldn't they just
need to run that we wrote for them?

I mean, I'm not 100% dead set against this option you want, but in all
honestly, I would never, ever tell anyone to use it. Unleashing
VACUUM on possibly-damaged data is just asking it to decide to prune
away tuples you don't want gone. I would try very hard to come up
with something to give that user that was only going to *read* the
possibly-damaged data with as little chance of modifying or erasing it
as possible.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#108

Stephen Frost

sfrost@snowman.net

over 9 years ago

In reply to: Robert Haas (#107)

Re: Reviewing freeze map code

* Robert Haas (robertmhaas@gmail.com) wrote:

On Mon, Jun 6, 2016 at 11:44 AM, Andres Freund <andres@anarazel.de> wrote:

In terms of diagnostic tools, you can get the VM bits and
page-level bits using the pg_visibility extension; I wrote it
precisely because of concerns like the ones you raise here. If you
want to cross-check the page-level bits against the tuple-level bits,
you can do that with the pageinspect extension. And if you do those
things, you can actually find out whether stuff is broken.

That's WAY out ouf reach of any "normal users". Adding a vacuum option
is doable, writing complex queries is not.

Why would they have to write the complex query? Wouldn't they just
need to run that we wrote for them?

I mean, I'm not 100% dead set against this option you want, but in all
honestly, I would never, ever tell anyone to use it. Unleashing
VACUUM on possibly-damaged data is just asking it to decide to prune
away tuples you don't want gone. I would try very hard to come up
with something to give that user that was only going to *read* the
possibly-damaged data with as little chance of modifying or erasing it
as possible.

I certainly agree with this.

We need a read-only utility which checks that the system is in a correct
and valid state. There are a few of those which have been built for
different pieces, I believe, and we really should have one for the
visibility map, but I don't think it makes sense to imply in any way
that VACUUM can or should be used for that.

Thanks!

Stephen

#109

Andres Freund

andres@anarazel.de

over 9 years ago

In reply to: Stephen Frost (#108)

Re: Reviewing freeze map code

On 2016-06-06 14:24:14 -0400, Stephen Frost wrote:

* Robert Haas (robertmhaas@gmail.com) wrote:

On Mon, Jun 6, 2016 at 11:44 AM, Andres Freund <andres@anarazel.de> wrote:

In terms of diagnostic tools, you can get the VM bits and
page-level bits using the pg_visibility extension; I wrote it
precisely because of concerns like the ones you raise here. If you
want to cross-check the page-level bits against the tuple-level bits,
you can do that with the pageinspect extension. And if you do those
things, you can actually find out whether stuff is broken.

That's WAY out ouf reach of any "normal users". Adding a vacuum option
is doable, writing complex queries is not.

Why would they have to write the complex query? Wouldn't they just
need to run that we wrote for them?

Then write that query. Verify that that query performs halfway
reasonably fast. Document that it should be run against databases after
subjecting them to tests. That'd address my concern as well.

I mean, I'm not 100% dead set against this option you want, but in all
honestly, I would never, ever tell anyone to use it. Unleashing
VACUUM on possibly-damaged data is just asking it to decide to prune
away tuples you don't want gone. I would try very hard to come up
with something to give that user that was only going to *read* the
possibly-damaged data with as little chance of modifying or erasing it
as possible.

I'm more concerned about actually being able to verify that the freeze
logic does actually something meaningful, in situation where we'd *NOT*
expect any problems. If we're not trusting vacuum in that situation,
well ...

I certainly agree with this.

We need a read-only utility which checks that the system is in a correct
and valid state. There are a few of those which have been built for
different pieces, I believe, and we really should have one for the
visibility map, but I don't think it makes sense to imply in any way
that VACUUM can or should be used for that.

Meh. This is vacuum behaviour that *has existed* up to this point. You
essentially removed it. Sure, I'm all for adding a verification
tool. But that's just pie in the skie at this point. We have a complex,
data loss threatening feature, which just about nobody can verify at
this point. That's crazy.

Greetings,

Andres Freund

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#110

Robert Haas

robertmhaas@gmail.com

over 9 years ago

In reply to: Andres Freund (#109)

Re: Reviewing freeze map code

On Mon, Jun 6, 2016 at 2:35 PM, Andres Freund <andres@anarazel.de> wrote:

Why would they have to write the complex query? Wouldn't they just
need to run that we wrote for them?

Then write that query. Verify that that query performs halfway
reasonably fast. Document that it should be run against databases after
subjecting them to tests. That'd address my concern as well.

You know, I am starting to lose a teeny bit of patience here. I do
appreciate you reviewing this code, very much, and genuinely, and it
would be great if more people wanted to review it. But this kind of
reads like you think that I'm being a jerk, which I'm trying pretty
hard not to be, and like you have the right to tell assign me
arbitrary work, which I think you don't. If you want to have a
reasonable conversation about what the options are for making this
better, great. If you want to me to do some work to help improve
things on a patch I committed, that is 100% fair. But I don't know
what I did to earn this response which, to me, reads as rather
demanding and rather exasperated.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#111

Andres Freund

andres@anarazel.de

over 9 years ago

In reply to: Robert Haas (#110)

Re: Reviewing freeze map code

Hi,

On 2016-06-06 15:16:10 -0400, Robert Haas wrote:

On Mon, Jun 6, 2016 at 2:35 PM, Andres Freund <andres@anarazel.de> wrote:

Why would they have to write the complex query? Wouldn't they just
need to run that we wrote for them?

Then write that query. Verify that that query performs halfway
reasonably fast. Document that it should be run against databases after
subjecting them to tests. That'd address my concern as well.

You know, I am starting to lose a teeny bit of patience here.

Same here.

I do appreciate you reviewing this code, very much, and genuinely, and
it would be great if more people wanted to review it.

But this kind of reads like you think that I'm being a jerk, which I'm
trying pretty hard not to be

I don't think you're a jerk. But I am loosing a good bit of my patience
here. I've posted these issues a month ago, and for a long while the
only thing that happened was bikeshedding about the name of something
that wasn't even decided to happen yet (obviously said bikeshedding
isn't your fault).

and like you have the right to tell assign me arbitrary work, which I
think you don't.

It's not like adding a parameter for this would be a lot of work,
there's even a patch out there. I'm getting impatient because I feel
the issue of this critical feature not being testable is getting ignored
and/or played down. And then sidetracked into a general "let's add a
database consistency checker" type discussion. Which we need, but won't
get in 9.6.

If you say: "I agree with the feature in principle, but I don't want to
spend time to review/commit it." - ok, that's fair enough. But at the
moment that isn't what I'm reading between the lines.

If you want to have a
reasonable conversation about what the options are for making this
better, great.

Yes, I want that.

If you want to me to do some work to help improve things on a patch I
committed, that is 100% fair. But I don't know what I did to earn
this response which, to me, reads as rather demanding and rather
exasperated.

I don't think it's absurd to make some demands on the committer of a
impact-heavy feature, about at least finding a realistic path towards
the new feature being realistically testable. This is a scary (but
*REALLY IMPORTANT*) patch, and I don't understand why it's ok that we
can't push a it through a couple wraparounds under high concurrency, and
easily verify that the freeze map is in sync with the actual data.

And yes, I *am* exasperated, that I'm the only one that appears to be
scared by the lack of that capability. I think the feature is in a
*lot* better shape than multixacts, but it certainly has the potential
to do even more damage in ways that'll essentially be unrecoverable.

Andres

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#112

Stephen Frost

sfrost@snowman.net

over 9 years ago

In reply to: Andres Freund (#111)

Re: Reviewing freeze map code

Andres, all,

* Andres Freund (andres@anarazel.de) wrote:

On 2016-06-06 15:16:10 -0400, Robert Haas wrote:

On Mon, Jun 6, 2016 at 2:35 PM, Andres Freund <andres@anarazel.de> wrote:
and like you have the right to tell assign me arbitrary work, which I
think you don't.

It's not like adding a parameter for this would be a lot of work,
there's even a patch out there. I'm getting impatient because I feel
the issue of this critical feature not being testable is getting ignored
and/or played down. And then sidetracked into a general "let's add a
database consistency checker" type discussion. Which we need, but won't
get in 9.6.

To be clear, I was pointing out that we've had similar types of
consistency checkers implemented for other big features (eg: Heikki's
work on checking that WAL works) and that it'd be good to have one here
also.

That could be as simple as a query with the right things installed, or
it might be an independent tool, but not having any way to check isn't
good. That said, trying to make VACUUM do that doesn't make sense to me
either.

Perhaps that's not an option due to the lateness of the hour or the lack
of manpower behind it, but that doesn't seem to be what has been said so
far.

If you want to me to do some work to help improve things on a patch I
committed, that is 100% fair. But I don't know what I did to earn
this response which, to me, reads as rather demanding and rather
exasperated.

I don't think it's absurd to make some demands on the committer of a
impact-heavy feature, about at least finding a realistic path towards
the new feature being realistically testable. This is a scary (but
*REALLY IMPORTANT*) patch, and I don't understand why it's ok that we
can't push a it through a couple wraparounds under high concurrency, and
easily verify that the freeze map is in sync with the actual data.

And yes, I *am* exasperated, that I'm the only one that appears to be
scared by the lack of that capability. I think the feature is in a
*lot* better shape than multixacts, but it certainly has the potential
to do even more damage in ways that'll essentially be unrecoverable.

Not having a straightforward way to ensure that it's working properly is
certainly concerning to me as well.

Thanks!

Stephen

#113

Andres Freund

andres@anarazel.de

over 9 years ago

In reply to: Stephen Frost (#112)

Re: Reviewing freeze map code

On 2016-06-06 16:18:19 -0400, Stephen Frost wrote:

That could be as simple as a query with the right things installed, or
it might be an independent tool, but not having any way to check isn't
good. That said, trying to make VACUUM do that doesn't make sense to me
either.

The point is that VACUUM *has* these types of checks. And had so for
many years:
else if (all_visible_according_to_vm && !PageIsAllVisible(page)
&& VM_ALL_VISIBLE(onerel, blkno, &vmbuffer))
{
elog(WARNING, "page is not marked all-visible but visibility map bit is set in relation \"%s\" page %u",
relname, blkno);
visibilitymap_clear(onerel, blkno, vmbuffer);
}
...
else if (PageIsAllVisible(page) && has_dead_tuples)
{
elog(WARNING, "page containing dead tuples is marked as all-visible in relation \"%s\" page %u",
relname, blkno);
PageClearAllVisible(page);
MarkBufferDirty(buf);
visibilitymap_clear(onerel, blkno, vmbuffer);
}

the point is that, after the introduction of the freeze bit, there's no
way to reach them anymore (and they're missing a useful extension of
these warnings, but ...); these warnings have caught bugs. I don't
think it'd advocate for the vacuum option otherwise.

Greetings,

Andres Freund

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#114

Robert Haas

robertmhaas@gmail.com

over 9 years ago

In reply to: Andres Freund (#111)

Re: Reviewing freeze map code

On Mon, Jun 6, 2016 at 4:06 PM, Andres Freund <andres@anarazel.de> wrote:

I do appreciate you reviewing this code, very much, and genuinely, and
it would be great if more people wanted to review it.

But this kind of reads like you think that I'm being a jerk, which I'm
trying pretty hard not to be

I don't think you're a jerk. But I am loosing a good bit of my patience
here. I've posted these issues a month ago, and for a long while the
only thing that happened was bikeshedding about the name of something
that wasn't even decided to happen yet (obviously said bikeshedding
isn't your fault).

No, the bikeshedding is not my fault.

As for the timing, you posted your first comments exactly a week
before beta1 when I was still busy addressing issues that were
reported before you reported yours, and I did not think it was
realistic to get them addressed in the time available. If you'd sent
them two weeks sooner, I would probably have done so. Now, it's been
four weeks since beta1 wrapped, one of which was PGCon. As far as I
understand at this point in time, your review identified exactly zero
potential data loss bugs. (We thought there was one, but it looks
like there isn't.) All of the non-critical defects you identified
have now been fixed, apart from the lack of a better testing tool.
And since there is ongoing discussion (call it bikeshedding if you
want) about what would actually help in that area, I really don't feel
like anything very awful is happening here.

I really don't understand how you can not weigh in on the original
thread leading up to my mid-March commits and say "hey, this needs a
better testing tool", and then when you finally get around to
reviewing it in May, I'm supposed to drop everything and write one
immediately. Why do you get two months from the time of commit to
weigh in but I get no time to respond? For my part, I thought I *had*
written a testing tool - that's what pg_visibility is and that's what
I used to test the feature before committing it. Now, you think
that's not good enough, and I respect your opinion, but it's not as if
you said this back when this was being committed. Or at least if you
did, I don't remember it.

and like you have the right to tell assign me arbitrary work, which I
think you don't.

It's not like adding a parameter for this would be a lot of work,
there's even a patch out there. I'm getting impatient because I feel
the issue of this critical feature not being testable is getting ignored
and/or played down. And then sidetracked into a general "let's add a
database consistency checker" type discussion. Which we need, but won't
get in 9.6.

I know there's a patch. Both Tom and I are skeptical about whether it
adds value, and I really don't think you've spelled out in as much
detail why you think it will help as I have why I think it won't.
Initially, I was like "ok, sure, we should have that", but the more I
thought about it (another advantage of time passing: you can think
about things more) the less convinced I was that it did anything
useful. I don't think that's very unreasonable. The importance of
the feature is exactly why we *should* think carefully about what is
best here and not just do the first thing that pops into our head.

If you say: "I agree with the feature in principle, but I don't want to
spend time to review/commit it." - ok, that's fair enough. But at the
moment that isn't what I'm reading between the lines.

No, what I'm saying is "I'm not confident that this feature adds
value, and I'm afraid that by adding it we are making ourselves feel
better without solving any real problem". I'm also saying "let's try
to agree on what problems we need to solve first and then decide on
the solutions".

If you want to have a
reasonable conversation about what the options are for making this
better, great.

Yes, I want that.

Great.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#115

Robert Haas

robertmhaas@gmail.com

over 9 years ago

In reply to: Andres Freund (#109)

Re: Reviewing freeze map code

On Mon, Jun 6, 2016 at 2:35 PM, Andres Freund <andres@anarazel.de> wrote:

Why would they have to write the complex query? Wouldn't they just
need to run that we wrote for them?

Then write that query. Verify that that query performs halfway
reasonably fast. Document that it should be run against databases after
subjecting them to tests. That'd address my concern as well.

Here is a first attempt at such a query. It requires that the
pageinspect and pg_visibility extensions be installed.

SELECT c.oid, v.blkno, array_agg(hpi.lp) AS affect_lps FROM pg_class
c, LATERAL ROWS FROM (pg_visibility(c.oid)) v, LATERAL ROWS FROM
(heap_page_items(get_raw_page(c.oid::regclass::text, blkno::int4)))
hpi WHERE c.relkind IN ('r', 't', 'm') AND v.all_frozen AND
(((hpi.t_infomask & 768) != 768 AND hpi.t_xmin NOT IN (1, 2)) OR
(hpi.t_infomask & 2048) != 2048) GROUP BY 1, 2 ORDER BY 1, 2;

I am not sure this is 100% correct, especially the XMAX-checking part:
is HEAP_XMAX_INVALID guaranteed to be set on a fully-frozen tuple? Is
the method of constructing the first argument to get_raw_page() going
to be robust in all cases?

I'm not sure what the performance will be on a large table, either.
That will have to be checked. And I obviously have not done extensive
stress runs yet. But maybe it's a start. Comments?

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#116

Peter Geoghegan

pg@heroku.com

over 9 years ago

In reply to: Andres Freund (#109)

Re: Reviewing freeze map code

On Mon, Jun 6, 2016 at 11:35 AM, Andres Freund <andres@anarazel.de> wrote:

We need a read-only utility which checks that the system is in a correct
and valid state. There are a few of those which have been built for
different pieces, I believe, and we really should have one for the
visibility map, but I don't think it makes sense to imply in any way
that VACUUM can or should be used for that.

Meh. This is vacuum behaviour that *has existed* up to this point. You
essentially removed it. Sure, I'm all for adding a verification
tool. But that's just pie in the skie at this point. We have a complex,
data loss threatening feature, which just about nobody can verify at
this point. That's crazy.

FWIW, I agree with the general sentiment. Building a stress-testing
suite would have been a good idea. In general, testability is a design
goal that I'd be willing to give up other things for.

--
Peter Geoghegan

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#117

Robert Haas

robertmhaas@gmail.com

over 9 years ago

In reply to: Andres Freund (#113)

Re: Reviewing freeze map code

On Mon, Jun 6, 2016 at 4:27 PM, Andres Freund <andres@anarazel.de> wrote:

On 2016-06-06 16:18:19 -0400, Stephen Frost wrote:

That could be as simple as a query with the right things installed, or
it might be an independent tool, but not having any way to check isn't
good. That said, trying to make VACUUM do that doesn't make sense to me
either.

The point is that VACUUM *has* these types of checks. And had so for
many years:
else if (all_visible_according_to_vm && !PageIsAllVisible(page)
&& VM_ALL_VISIBLE(onerel, blkno, &vmbuffer))
{
elog(WARNING, "page is not marked all-visible but visibility map bit is set in relation \"%s\" page %u",
relname, blkno);
visibilitymap_clear(onerel, blkno, vmbuffer);
}
...
else if (PageIsAllVisible(page) && has_dead_tuples)
{
elog(WARNING, "page containing dead tuples is marked as all-visible in relation \"%s\" page %u",
relname, blkno);
PageClearAllVisible(page);
MarkBufferDirty(buf);
visibilitymap_clear(onerel, blkno, vmbuffer);
}

the point is that, after the introduction of the freeze bit, there's no
way to reach them anymore (and they're missing a useful extension of
these warnings, but ...); these warnings have caught bugs. I don't
think it'd advocate for the vacuum option otherwise.

So a couple of things:

1. I think it is pretty misleading to say that those checks aren't
reachable any more. It's not like we freeze every page when we mark
it all-visible. In most cases, I think that what will happen is that
the page will be marked all-visible and then, because it is
all-visible, skipped by subsequent vacuums, so that it doesn't get
marked all-frozen until a few hundred million transactions later. Of
course there will be some cases when a page gets marked all-visible
and all-frozen at the same time, but I don't see why we should expect
that to be the norm.

2. With the new pg_visibility extension, you can actually check the
same thing that first warning checks like this:

select * from pg_visibility('t1'::regclass) where all_visible and not
pd_all_visible;

IMHO, that's a substantial improvement over running VACUUM and
checking whether it spits out a WARNING.

The second one, you can't currently trigger for all-frozen pages. The
query I just sent in my other email could perhaps be adapted to that
purpose, but maybe this is a good-enough reason to add VACUUM
(even_frozen_pages).

3. If you think there are analogous checks that I should add for the
frozen case, or that you want to add yourself, please say what they
are specifically. I *did* think about it when I wrote that code and I
didn't see how to make it work. If I had, I would have added them.
The whole point of review here is, hopefully, to illuminate what
should have been done differently - if I'd known how to do it better,
I would have done so. Provide an idea, or better yet, provide a
patch. If you see how to do it, coding it up shouldn't be the hard
part.

Thanks,

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#118

Andres Freund

andres@anarazel.de

over 9 years ago

In reply to: Robert Haas (#114)

Re: Reviewing freeze map code

On 2016-06-06 16:41:19 -0400, Robert Haas wrote:

I really don't understand how you can not weigh in on the original
thread leading up to my mid-March commits and say "hey, this needs a
better testing tool", and then when you finally get around to
reviewing it in May, I'm supposed to drop everything and write one
immediately.

Meh. Asking you to "drop everything" and starting to push a month later
are very different things. The reason I'm pushing is because this atm
seems likely to slip enough that we'll decide "can't do this for
9.6". And I think that'd be seriously bad.

Why do you get two months from the time of commit to weigh in but I
get no time to respond?

Really? You've started to apply pressure to fix things days after
they've been discovered. It's been a month.

For my part, I thought I *had*
written a testing tool - that's what pg_visibility is and that's what
I used to test the feature before committing it.

I think looking only at page level data, and not at row level data is is
insufficient. And I think we need to make $tool output the data in a way
that only returns data if things are wrong (that can be a pre-canned
query).

Now, you think that's not good enough, and I respect your opinion, but
it's not as if you said this back when this was being committed. Or
at least if you did, I don't remember it.

I think I mentioned testing ages ago, but not around the commit, no. I
kind of had assumed that it was there. I don't think that's really
relevant though. Backend flushing was discussed and benchmarked over
months as well; and while I don't agree with your, conclusion it's
absolutely sane of you to push for changing the default on that; even if
you didn't immediately push back.

I know there's a patch. Both Tom and I are skeptical about whether it
adds value, and I really don't think you've spelled out in as much
detail why you think it will help as I have why I think it won't.

The primary reason I think it'll help because it allows users/testers to
run a simple one-line command (VACUUM (scan_all);)in their database, and
they'll get a clear "WARNING: XXX is bad" message if something's broken,
and nothing if things are ok. Vacuum isn't a bad place for that,
because it'll be the place that removes dead item pointers and such if
things were wrongly labeled; and because we historically have emitted
warnings from there. The more complex stuff we ask testers to run, the
less likely it is that they'll actually do that.

I'd also be ok with adding & documenting (beta release notes)
CREATE EXTENSION pg_visibility;
SELECT relname FROM pg_class WHERE relkind IN ('r', 'm') AND NOT pg_check_visibility(oid);
or something olong those lines.

Greetings,

Andres Freund

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#119

Robert Haas

robertmhaas@gmail.com

over 9 years ago

In reply to: Andres Freund (#118)

Re: Reviewing freeze map code

On Mon, Jun 6, 2016 at 5:06 PM, Andres Freund <andres@anarazel.de> wrote:

On 2016-06-06 16:41:19 -0400, Robert Haas wrote:

I really don't understand how you can not weigh in on the original
thread leading up to my mid-March commits and say "hey, this needs a
better testing tool", and then when you finally get around to
reviewing it in May, I'm supposed to drop everything and write one
immediately.

Meh. Asking you to "drop everything" and starting to push a month later
are very different things. The reason I'm pushing is because this atm
seems likely to slip enough that we'll decide "can't do this for
9.6". And I think that'd be seriously bad.

To be clear, I'm not objecting to you pushing on this. I just think
your tone sounds a bit, uh, antagonized.

Why do you get two months from the time of commit to weigh in but I
get no time to respond?

Really? You've started to apply pressure to fix things days after
they've been discovered. It's been a month.

Yes, it would have been nice if I had gotten to this one sooner. But
it's not like you said "hey, hurry up" before I started working on it.
You waited until I did start working on it and *then* complained that
I didn't get to it sooner. I cannot rewind time.

For my part, I thought I *had*
written a testing tool - that's what pg_visibility is and that's what
I used to test the feature before committing it.

I think looking only at page level data, and not at row level data is is
insufficient. And I think we need to make $tool output the data in a way
that only returns data if things are wrong (that can be a pre-canned
query).

OK. I didn't think that was necessary, but it sure can't hurt.

I know there's a patch. Both Tom and I are skeptical about whether it
adds value, and I really don't think you've spelled out in as much
detail why you think it will help as I have why I think it won't.

The primary reason I think it'll help because it allows users/testers to
run a simple one-line command (VACUUM (scan_all);)in their database, and
they'll get a clear "WARNING: XXX is bad" message if something's broken,
and nothing if things are ok. Vacuum isn't a bad place for that,
because it'll be the place that removes dead item pointers and such if
things were wrongly labeled; and because we historically have emitted
warnings from there. The more complex stuff we ask testers to run, the
less likely it is that they'll actually do that.

OK, now I understand. Let's see if there is general agreement on this
and then we can decide how to proceed. I think the main danger here
is that people will think that this option is more useful than it
really is and start using it in all kinds of cases where it isn't
really necessary in the hopes that it will fix problems it really
can't fix. I think we need to write the documentation in such a way
as to be deeply discouraging to people who might otherwise be prone to
unwarranted optimism. Otherwise, 5 years from now, we're going to be
fielding complaints from people who are unhappy that there's no way to
make autovacuum run with (even_frozen_pages true).

I'd also be ok with adding & documenting (beta release notes)
CREATE EXTENSION pg_visibility;
SELECT relname FROM pg_class WHERE relkind IN ('r', 'm') AND NOT pg_check_visibility(oid);
or something olong those lines.

That wouldn't be too useful as-written in my book, because it gives
you no detail on what exactly the problem was. Maybe it could be
"pg_check_visibility(regclass) RETURNS SETOF tid", where the returned
TIDs are non-frozen TIDs on frozen pages. Then I think something like
this would work:

SELECT c.oid, pg_check_visibility(c.oid) FROM pg_class WHERE relkind
IN ('r', 't', 'm');

If you get any rows back, you've got trouble.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#120

Andres Freund

andres@anarazel.de

over 9 years ago

In reply to: Robert Haas (#117)

Re: Reviewing freeze map code

Hi,

On 2016-06-06 17:00:19 -0400, Robert Haas wrote:

1. I think it is pretty misleading to say that those checks aren't
reachable any more. It's not like we freeze every page when we mark
it all-visible.

True. What I mean is that you can't force the checks (and some that I
think should be added) to occur anymore. Once a page is frozen it'll be
kinda hard to predict whether vacuum touches it (due to the skip logic).

2. With the new pg_visibility extension, you can actually check the
same thing that first warning checks like this:

select * from pg_visibility('t1'::regclass) where all_visible and not
pd_all_visible;

Right, but not the second.

IMHO, that's a substantial improvement over running VACUUM and
checking whether it spits out a WARNING.

I think it's a mixed bag. I do think that WARNINGS are a lot easier to
understand for a casual user/tester; rather than having to write/copy
queries which return results where you don't know what the expected
result is. I agree that it's better to have that in a non-modifying way
- although I'm afraid atm it's not really possible to do a
HeapTupleSatisfies* without modifications :(.

3. If you think there are analogous checks that I should add for the
frozen case, or that you want to add yourself, please say what they
are specifically. I *did* think about it when I wrote that code and I
didn't see how to make it work. If I had, I would have added them.
The whole point of review here is, hopefully, to illuminate what
should have been done differently - if I'd known how to do it better,
I would have done so. Provide an idea, or better yet, provide a
patch. If you see how to do it, coding it up shouldn't be the hard
part.

I think it's pretty important (and not hard) to add a check for
(all_frozen_according_to_vm && has_unfrozen_tuples). Checking for
VM_ALL_FROZEN && !VM_ALL_VISIBLE looks worthwhile as well, especially as
we could check that always, without a measurable overhead. But the
former primarily makes sense if we have a way to force the check to
occur in a way that's not dependant on the state of neighbouring pages.

Greetings,

Andres Freund

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#121

Andres Freund

andres@anarazel.de

over 9 years ago

In reply to: Robert Haas (#119)

Re: Reviewing freeze map code

Hi,

On 2016-06-06 17:22:38 -0400, Robert Haas wrote:

I'd also be ok with adding & documenting (beta release notes)
CREATE EXTENSION pg_visibility;
SELECT relname FROM pg_class WHERE relkind IN ('r', 'm') AND NOT pg_check_visibility(oid);
or something olong those lines.

That wouldn't be too useful as-written in my book, because it gives
you no detail on what exactly the problem was.

True. I don't think that's a big issue though, because we'd likely want
a lot more detail after a report anyway; to analyze things properly.

Maybe it could be
"pg_check_visibility(regclass) RETURNS SETOF tid", where the returned
TIDs are non-frozen TIDs on frozen pages. Then I think something like
this would work:

SELECT c.oid, pg_check_visibility(c.oid) FROM pg_class WHERE relkind
IN ('r', 't', 'm');

If you get any rows back, you've got trouble.

That'd work too; with the slight danger of returning way too much data.

- Andres

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#122

Amit Kapila

amit.kapila16@gmail.com

over 9 years ago

In reply to: Robert Haas (#119)

1 attachment(s)

Re: Reviewing freeze map code

On Tue, Jun 7, 2016 at 2:52 AM, Robert Haas <robertmhaas@gmail.com> wrote:

On Mon, Jun 6, 2016 at 5:06 PM, Andres Freund <andres@anarazel.de> wrote:

I'd also be ok with adding & documenting (beta release notes)
CREATE EXTENSION pg_visibility;
SELECT relname FROM pg_class WHERE relkind IN ('r', 'm') AND NOT

pg_check_visibility(oid);

or something olong those lines.

That wouldn't be too useful as-written in my book, because it gives
you no detail on what exactly the problem was. Maybe it could be
"pg_check_visibility(regclass) RETURNS SETOF tid", where the returned
TIDs are non-frozen TIDs on frozen pages. Then I think something like
this would work:

SELECT c.oid, pg_check_visibility(c.oid) FROM pg_class WHERE relkind
IN ('r', 't', 'm');

I have implemented the above function in attached patch. Currently, it
returns SETOF tupleids, but if we want some variant of same, that should
also be possible.

With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

Attachments:

pg_check_visibility_func_v1.patchapplication/octet-stream; name=pg_check_visibility_func_v1.patchDownload

diff --git a/contrib/pg_visibility/pg_visibility--1.0.sql b/contrib/pg_visibility/pg_visibility--1.0.sql
index da511e5..d0bdeca 100644
--- a/contrib/pg_visibility/pg_visibility--1.0.sql
+++ b/contrib/pg_visibility/pg_visibility--1.0.sql
@@ -44,9 +44,16 @@ RETURNS record
 AS 'MODULE_PATHNAME', 'pg_visibility_map_summary'
 LANGUAGE C STRICT;
 
+-- Show tupleids of non-frozen tuples if any in frozen pages for a relation.
+CREATE FUNCTION pg_check_visibility(regclass, t_ctid OUT tid)
+RETURNS SETOF tid
+AS 'MODULE_PATHNAME', 'pg_check_visibility'
+LANGUAGE C STRICT;
+
 -- Don't want these to be available to public.
 REVOKE ALL ON FUNCTION pg_visibility_map(regclass, bigint) FROM PUBLIC;
 REVOKE ALL ON FUNCTION pg_visibility(regclass, bigint) FROM PUBLIC;
 REVOKE ALL ON FUNCTION pg_visibility_map(regclass) FROM PUBLIC;
 REVOKE ALL ON FUNCTION pg_visibility(regclass) FROM PUBLIC;
 REVOKE ALL ON FUNCTION pg_visibility_map_summary(regclass) FROM PUBLIC;
+REVOKE ALL ON FUNCTION pg_check_visibility(regclass) FROM PUBLIC;
diff --git a/contrib/pg_visibility/pg_visibility.c b/contrib/pg_visibility/pg_visibility.c
index 5e5c7cc..51663e6 100644
--- a/contrib/pg_visibility/pg_visibility.c
+++ b/contrib/pg_visibility/pg_visibility.c
@@ -25,14 +25,23 @@ typedef struct vbits
 	uint8		bits[FLEXIBLE_ARRAY_MEMBER];
 } vbits;
 
+typedef struct tupleids
+{
+	BlockNumber next;
+	BlockNumber count;
+	ItemPointer tids;
+} tupleids;
+
 PG_FUNCTION_INFO_V1(pg_visibility_map);
 PG_FUNCTION_INFO_V1(pg_visibility_map_rel);
 PG_FUNCTION_INFO_V1(pg_visibility);
 PG_FUNCTION_INFO_V1(pg_visibility_rel);
 PG_FUNCTION_INFO_V1(pg_visibility_map_summary);
+PG_FUNCTION_INFO_V1(pg_check_visibility);
 
 static TupleDesc pg_visibility_tupdesc(bool include_blkno, bool include_pd);
 static vbits *collect_visibility_data(Oid relid, bool include_pd);
+static tupleids *collect_nonfrozen_items(Oid relid);
 
 /*
  * Visibility map information for a single block of a relation.
@@ -259,6 +268,36 @@ pg_visibility_map_summary(PG_FUNCTION_ARGS)
 }
 
 /*
+ * Return the tids of non-frozen tuples present in frozen pages.  All such
+ * tids' indicates corrupt tuples.
+ */
+Datum
+pg_check_visibility(PG_FUNCTION_ARGS)
+{
+	FuncCallContext *funcctx;
+	tupleids   *info;
+
+	if (SRF_IS_FIRSTCALL())
+	{
+		Oid			relid = PG_GETARG_OID(0);
+		MemoryContext oldcontext;
+
+		funcctx = SRF_FIRSTCALL_INIT();
+		oldcontext = MemoryContextSwitchTo(funcctx->multi_call_memory_ctx);
+		funcctx->user_fctx = collect_nonfrozen_items(relid);
+		MemoryContextSwitchTo(oldcontext);
+	}
+
+	funcctx = SRF_PERCALL_SETUP();
+	info = (tupleids *) funcctx->user_fctx;
+
+	if (info->next < info->count)
+		SRF_RETURN_NEXT(funcctx, PointerGetDatum(&info->tids[info->next++]));
+
+	SRF_RETURN_DONE(funcctx);
+}
+
+/*
  * Helper function to construct whichever TupleDesc we need for a particular
  * call.
  */
@@ -348,3 +387,95 @@ collect_visibility_data(Oid relid, bool include_pd)
 
 	return info;
 }
+
+/*
+ * Collect non frozen items in the frozen pages for a relation.
+ */
+static tupleids *
+collect_nonfrozen_items(Oid relid)
+{
+	Relation	rel;
+	BlockNumber nblocks;
+	tupleids   *info;
+	BlockNumber blkno;
+	uint64		nallocated;
+	uint64		count_non_frozen = 0;
+	Buffer		vmbuffer = InvalidBuffer;
+	BufferAccessStrategy bstrategy = GetAccessStrategy(BAS_BULKREAD);
+
+	rel = relation_open(relid, AccessShareLock);
+
+	nblocks = RelationGetNumberOfBlocks(rel);
+
+	/*
+	 * Guess an initial array size, we don't expect many corrupted tuples, so
+	 * start with smaller number.  Having initial size as MaxHeapTuplesPerPage
+	 * allows us to check whether tids array needs to be enlarged at page
+	 * level rather than at tuple level.
+	 */
+	nallocated = MaxHeapTuplesPerPage;
+	info = palloc0(sizeof(tupleids));
+	info->tids = palloc(nallocated * sizeof(ItemPointerData));
+	info->next = 0;
+
+	for (blkno = 0; blkno < nblocks; ++blkno)
+	{
+		/* Make sure we are interruptible. */
+		CHECK_FOR_INTERRUPTS();
+
+		/* enlarge output array if needed. */
+		if (count_non_frozen >= nallocated)
+		{
+			nallocated *= 2;
+			info->tids = repalloc(info->tids, nallocated * sizeof(ItemPointerData));
+		}
+
+		/* collect the non-frozen tuples on a frozen page. */
+		if (VM_ALL_FROZEN(rel, blkno, &vmbuffer))
+		{
+			Buffer		buffer;
+			Page		page;
+			OffsetNumber offnum,
+						maxoff;
+
+			buffer = ReadBufferExtended(rel, MAIN_FORKNUM, blkno, RBM_NORMAL,
+										bstrategy);
+			LockBuffer(buffer, BUFFER_LOCK_SHARE);
+
+			page = BufferGetPage(buffer);
+			maxoff = PageGetMaxOffsetNumber(page);
+
+			for (offnum = FirstOffsetNumber;
+				 offnum <= maxoff;
+				 offnum = OffsetNumberNext(offnum))
+			{
+				HeapTupleHeader tuphdr;
+				ItemId		itemid;
+
+				itemid = PageGetItemId(page, offnum);
+
+				/* Unused or redirect line pointers are of no interest. */
+				if (!ItemIdIsUsed(itemid) || ItemIdIsRedirected(itemid))
+					continue;
+
+				tuphdr = (HeapTupleHeader) PageGetItem(page, itemid);
+				if (heap_tuple_needs_eventual_freeze(tuphdr))
+				{
+					info->tids[count_non_frozen] = tuphdr->t_ctid;
+					++count_non_frozen;
+				}
+			}
+
+			UnlockReleaseBuffer(buffer);
+		}
+	}
+
+	/* Clean up. */
+	if (vmbuffer != InvalidBuffer)
+		ReleaseBuffer(vmbuffer);
+	relation_close(rel, AccessShareLock);
+
+	info->count = count_non_frozen;
+
+	return info;
+}

#123

Masahiko Sawada

sawada.mshk@gmail.com

over 9 years ago

In reply to: Amit Kapila (#122)

Re: Reviewing freeze map code

On Tue, Jun 7, 2016 at 11:19 PM, Amit Kapila <amit.kapila16@gmail.com> wrote:

On Tue, Jun 7, 2016 at 2:52 AM, Robert Haas <robertmhaas@gmail.com> wrote:

On Mon, Jun 6, 2016 at 5:06 PM, Andres Freund <andres@anarazel.de> wrote:

I'd also be ok with adding & documenting (beta release notes)
CREATE EXTENSION pg_visibility;
SELECT relname FROM pg_class WHERE relkind IN ('r', 'm') AND NOT
pg_check_visibility(oid);
or something olong those lines.

That wouldn't be too useful as-written in my book, because it gives
you no detail on what exactly the problem was. Maybe it could be
"pg_check_visibility(regclass) RETURNS SETOF tid", where the returned
TIDs are non-frozen TIDs on frozen pages. Then I think something like
this would work:

SELECT c.oid, pg_check_visibility(c.oid) FROM pg_class WHERE relkind
IN ('r', 't', 'm');

I have implemented the above function in attached patch. Currently, it
returns SETOF tupleids, but if we want some variant of same, that should
also be possible.

With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

Thank you for implementing the patch.

I've not test it deeply but here are some comments.
This check tool only checks if the frozen page has live-unfrozen tuple.
That is, it doesn't care in case where the all-frozen page mistakenly
has dead-frozen tuple.
I think this tool should check such case, otherwise the function name
would need to be changed.

+       /* Clean up. */
+       if (vmbuffer != InvalidBuffer)
+               ReleaseBuffer(vmbuffer);

I think that we should use BufferIsValid() here.

Regards,

--
Masahiko Sawada

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#124

Andres Freund

andres@anarazel.de

over 9 years ago

In reply to: Amit Kapila (#122)

Re: Reviewing freeze map code

On 2016-06-07 19:49:59 +0530, Amit Kapila wrote:

On Tue, Jun 7, 2016 at 2:52 AM, Robert Haas <robertmhaas@gmail.com> wrote:

On Mon, Jun 6, 2016 at 5:06 PM, Andres Freund <andres@anarazel.de> wrote:

I'd also be ok with adding & documenting (beta release notes)
CREATE EXTENSION pg_visibility;
SELECT relname FROM pg_class WHERE relkind IN ('r', 'm') AND NOT

pg_check_visibility(oid);

or something olong those lines.

That wouldn't be too useful as-written in my book, because it gives
you no detail on what exactly the problem was. Maybe it could be
"pg_check_visibility(regclass) RETURNS SETOF tid", where the returned
TIDs are non-frozen TIDs on frozen pages. Then I think something like
this would work:

SELECT c.oid, pg_check_visibility(c.oid) FROM pg_class WHERE relkind
IN ('r', 't', 'm');

I have implemented the above function in attached patch. Currently, it
returns SETOF tupleids, but if we want some variant of same, that should
also be possible.

Cool!

I think if we go with the pg_check_visibility approach, we should also
copy the other consistency checks from vacuumlazy.c, given they can't
easily be triggered. Wonder how we can report both block and tuple
level issues. Kinda inclined to report everything as a block level
issue?

Regards,

Andres

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#125

Jim Nasby

Jim.Nasby@BlueTreble.com

over 9 years ago

In reply to: Peter Geoghegan (#116)

Re: Reviewing freeze map code

On 6/6/16 3:57 PM, Peter Geoghegan wrote:

On Mon, Jun 6, 2016 at 11:35 AM, Andres Freund <andres@anarazel.de> wrote:

We need a read-only utility which checks that the system is in a correct
and valid state. There are a few of those which have been built for
different pieces, I believe, and we really should have one for the
visibility map, but I don't think it makes sense to imply in any way
that VACUUM can or should be used for that.

Meh. This is vacuum behaviour that *has existed* up to this point. You
essentially removed it. Sure, I'm all for adding a verification
tool. But that's just pie in the skie at this point. We have a complex,
data loss threatening feature, which just about nobody can verify at
this point. That's crazy.

FWIW, I agree with the general sentiment. Building a stress-testing
suite would have been a good idea. In general, testability is a design
goal that I'd be willing to give up other things for.

Related to that, I suspect it would be helpful if it was possible to
test boundary cases in this kind of critical code by separating the
logic from the underlying implementation. It becomes very hard to verify
the system does the right thing in some of these scenarios, because it's
so difficult to put the system into that state to begin with. Stuff that
depends on burning through a large number of XIDs is an example of that.
(To be clear, I'm talking about unit-test kind of stuff here, not
validating an existing system.)
--
Jim Nasby, Data Architect, Blue Treble Consulting, Austin TX
Experts in Analytics, Data Architecture and PostgreSQL
Data in Trouble? Get it in Treble! http://BlueTreble.com
855-TREBLE2 (855-873-2532) mobile: 512-569-9461

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#126

Robert Haas

robertmhaas@gmail.com

over 9 years ago

In reply to: Amit Kapila (#122)

Re: Reviewing freeze map code

On Tue, Jun 7, 2016 at 10:19 AM, Amit Kapila <amit.kapila16@gmail.com> wrote:

I have implemented the above function in attached patch. Currently, it
returns SETOF tupleids, but if we want some variant of same, that should
also be possible.

I think we'd want to bump the pg_visibility version to 1.1 and do the
upgrade dance, since the existing thing was in beta1.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#127

Amit Kapila

amit.kapila16@gmail.com

over 9 years ago

In reply to: Masahiko Sawada (#123)

Re: Reviewing freeze map code

On Tue, Jun 7, 2016 at 10:10 PM, Masahiko Sawada <sawada.mshk@gmail.com>
wrote:

Thank you for implementing the patch.

I've not test it deeply but here are some comments.
This check tool only checks if the frozen page has live-unfrozen tuple.
That is, it doesn't care in case where the all-frozen page mistakenly
has dead-frozen tuple.

Do you mean to say that we should have a check for ItemIdIsDead() and then
if item is found to be dead, then add it to array of non_frozen items? If
so, earlier I thought we might not need this check as we are already using
heap_tuple_needs_eventual_freeze(), but now again looking at it, it seems
wise to check for dead items separately as those won't be covered by other
check.

+       /* Clean up. */
+       if (vmbuffer != InvalidBuffer)
+               ReleaseBuffer(vmbuffer);
I think that we should use BufferIsValid() here.

We can use BufferIsValid() as well, but I am trying to be consistent with
nearby code, refer collect_visibility_data(). We can change at all places
together if people prefer that way.

With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

#128

Amit Kapila

amit.kapila16@gmail.com

over 9 years ago

In reply to: Robert Haas (#126)

Re: Reviewing freeze map code

On Wed, Jun 8, 2016 at 8:37 AM, Robert Haas <robertmhaas@gmail.com> wrote:

On Tue, Jun 7, 2016 at 10:19 AM, Amit Kapila <amit.kapila16@gmail.com>

wrote:

I have implemented the above function in attached patch. Currently, it
returns SETOF tupleids, but if we want some variant of same, that should
also be possible.

I think we'd want to bump the pg_visibility version to 1.1 and do the
upgrade dance, since the existing thing was in beta1.

Okay, will do it in next version of patch.

With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

#129

Amit Kapila

amit.kapila16@gmail.com

over 9 years ago

In reply to: Andres Freund (#124)

Re: Reviewing freeze map code

On Tue, Jun 7, 2016 at 11:01 PM, Andres Freund <andres@anarazel.de> wrote:>

I think if we go with the pg_check_visibility approach, we should also
copy the other consistency checks from vacuumlazy.c, given they can't
easily be triggered.

Are you referring to checks that are done in lazy_scan_heap() for each
block? I think the meaning full checks in this context could be (a) page
is marked as visible, but corresponding vm is not marked. (b) page is
marked as all visible and has dead tuples. (c) vm bit indicates frozen, but
page contains non-frozen tuples.

I think right now the design of pg_visibility is such that it returns the
required information at page level to user by means of various functions
like pg_visibility, pg_visibility_map, etc. If we want to add page level
checks in this new routine as well, then we have to think what should be
the output if such checks fails, shall we issue warning, shall we return
information in some other way. Also, I think there will be some duplicity
with the already provided information via other functions of this module.

Wonder how we can report both block and tuple
level issues. Kinda inclined to report everything as a block level
issue?

The way currently this module provides information, it seems better to have
separate API's for block and tuple level inconsistency. For block level, I
think most of the information can be retrieved by existing API's and for
tuple level, this new API can be used.

With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

#130

Andres Freund

andres@anarazel.de

over 9 years ago

In reply to: Amit Kapila (#129)

Re: Reviewing freeze map code

On 2016-06-08 10:04:56 +0530, Amit Kapila wrote:

On Tue, Jun 7, 2016 at 11:01 PM, Andres Freund <andres@anarazel.de> wrote:>

I think if we go with the pg_check_visibility approach, we should also
copy the other consistency checks from vacuumlazy.c, given they can't
easily be triggered.

Are you referring to checks that are done in lazy_scan_heap() for each
block?

Yes.

I think the meaning full checks in this context could be (a) page
is marked as visible, but corresponding vm is not marked. (b) page is
marked as all visible and has dead tuples. (c) vm bit indicates frozen, but
page contains non-frozen tuples.

Yes.

I think right now the design of pg_visibility is such that it returns the
required information at page level to user by means of various functions
like pg_visibility, pg_visibility_map, etc. If we want to add page level
checks in this new routine as well, then we have to think what should be
the output if such checks fails, shall we issue warning, shall we return
information in some other way.

Right.

Also, I think there will be some duplicity
with the already provided information via other functions of this module.

Don't think that's a problem. One part of the functionality then is
returning the available information, the other is checking for problems
and only returning problematic blocks.

Wonder how we can report both block and tuple
level issues. Kinda inclined to report everything as a block level
issue?

The way currently this module provides information, it seems better to have
separate API's for block and tuple level inconsistency. For block level, I
think most of the information can be retrieved by existing API's and for
tuple level, this new API can be used.

I personally think simplicity is more important than detail here; but
it's not that important. If this reports a problem, you can look into
the nitty gritty using existing functions.

Andres

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#131

Masahiko Sawada

sawada.mshk@gmail.com

over 9 years ago

In reply to: Amit Kapila (#127)

Re: Reviewing freeze map code

On Wed, Jun 8, 2016 at 12:19 PM, Amit Kapila <amit.kapila16@gmail.com> wrote:

On Tue, Jun 7, 2016 at 10:10 PM, Masahiko Sawada <sawada.mshk@gmail.com>
wrote:

Thank you for implementing the patch.

I've not test it deeply but here are some comments.
This check tool only checks if the frozen page has live-unfrozen tuple.
That is, it doesn't care in case where the all-frozen page mistakenly
has dead-frozen tuple.

Do you mean to say that we should have a check for ItemIdIsDead() and then
if item is found to be dead, then add it to array of non_frozen items?

Yes.

If so, earlier I thought we might not need this check as we are already using
heap_tuple_needs_eventual_freeze(),

You're right. Sorry, I had misunderstood.

but now again looking at it, it seems
wise to check for dead items separately as those won't be covered by other
check.

Sounds good.

+       /* Clean up. */
+       if (vmbuffer != InvalidBuffer)
+               ReleaseBuffer(vmbuffer);
I think that we should use BufferIsValid() here.
We can use BufferIsValid() as well, but I am trying to be consistent with
nearby code, refer collect_visibility_data(). We can change at all places
together if people prefer that way.

In vacuumlazy.c we use it like BufferisValid(vmbuffer), so I think we
can replace all these thing to be more safety if there is not specific
reason.

Regards,

--
Masahiko Sawada

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#132

Amit Kapila

amit.kapila16@gmail.com

over 9 years ago

In reply to: Andres Freund (#130)

Re: Reviewing freeze map code

On Wed, Jun 8, 2016 at 11:39 AM, Andres Freund <andres@anarazel.de> wrote:

On 2016-06-08 10:04:56 +0530, Amit Kapila wrote:

On Tue, Jun 7, 2016 at 11:01 PM, Andres Freund <andres@anarazel.de>

wrote:>

I think if we go with the pg_check_visibility approach, we should also
copy the other consistency checks from vacuumlazy.c, given they can't
easily be triggered.

Are you referring to checks that are done in lazy_scan_heap() for each
block?

Yes.

I think the meaning full checks in this context could be (a) page
is marked as visible, but corresponding vm is not marked. (b) page is
marked as all visible and has dead tuples. (c) vm bit indicates frozen,

but

page contains non-frozen tuples.

Yes.

If we want to address both page level and tuple level inconsistencies, I
could see below possibility.

1. An API that returns setof records containing a block that have
inconsistent vm bit, a block where visible page contains dead tuples and a
block where vm bit indicates frozen, but page contains non-frozen tuples.
Three separate block numbers are required in record to distinguish the
problem with block.

Signature of API will be something like:
pg_check_visibility_blocks(regclass, corrupt_vm_blkno OUT bigint,
corrupt_dead_blkno OUT bigint, corrupt_frozen_blkno OUT boolean) RETURNS
SETOF record

2. An API that provides information of non-frozen tuples on a frozen page

Signature of API:
CREATE FUNCTION pg_check_visibility_tuples(regclass, t_ctid OUT tid)
RETURNS SETOF tid

This is same as what is present in current patch [1]/messages/by-id/CAA4eK1JHz=OB4Ya+_1dMRqgxrKCt4LxiSyukgm3ZzuxF2ONqGA@mail.gmail.com.
In this, user can use first API to find corrupt blocks if any and if
further information is required, second API can be used.

Does that address your concern? If you, Robert and others are okay with
above idea, then I will send an update patch.

[1]: /messages/by-id/CAA4eK1JHz=OB4Ya+_1dMRqgxrKCt4LxiSyukgm3ZzuxF2ONqGA@mail.gmail.com
/messages/by-id/CAA4eK1JHz=OB4Ya+_1dMRqgxrKCt4LxiSyukgm3ZzuxF2ONqGA@mail.gmail.com

With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

#133

Robert Haas

robertmhaas@gmail.com

over 9 years ago

In reply to: Amit Kapila (#132)

Re: Reviewing freeze map code

On Wed, Jun 8, 2016 at 4:01 AM, Amit Kapila <amit.kapila16@gmail.com> wrote:

If we want to address both page level and tuple level inconsistencies, I
could see below possibility.

1. An API that returns setof records containing a block that have
inconsistent vm bit, a block where visible page contains dead tuples and a
block where vm bit indicates frozen, but page contains non-frozen tuples.
Three separate block numbers are required in record to distinguish the
problem with block.

Signature of API will be something like:
pg_check_visibility_blocks(regclass, corrupt_vm_blkno OUT bigint,
corrupt_dead_blkno OUT bigint, corrupt_frozen_blkno OUT boolean) RETURNS
SETOF record

I don't understand this, and I think we're making this too
complicated. The function that just returned non-frozen TIDs on
supposedly-frozen pages was simple. Now we're trying to redesign this
into a general-purpose integrity checker on the eve of beta2, and I
think that's a bad idea. We don't have time to figure that out, get
consensus on it, and do it well, and I don't want to be stuck
supporting something half-baked from now until eternity. Let's scale
back our goals here to something that can realistically be done well
in the time available.

Here's my proposal:

1. You already implemented a function to find non-frozen tuples on
supposedly all-frozen pages. Great.

2. Let's implement a second function to find dead tuples on supposedly
all-visible pages.

3. And then let's call it good.

If we start getting into the game of "well, that's not enough because
you can also check for X", that's an infinite treadmill. There will
always be more things we can check. But that's the project of
building an integrity checker, which while worthwhile, is out of scope
for 9.6.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#134

Amit Kapila

amit.kapila16@gmail.com

over 9 years ago

In reply to: Robert Haas (#133)

Re: Reviewing freeze map code

On Wed, Jun 8, 2016 at 6:31 PM, Robert Haas <robertmhaas@gmail.com> wrote:

On Wed, Jun 8, 2016 at 4:01 AM, Amit Kapila <amit.kapila16@gmail.com>

wrote:

If we want to address both page level and tuple level inconsistencies, I
could see below possibility.

1. An API that returns setof records containing a block that have
inconsistent vm bit, a block where visible page contains dead tuples

and a

block where vm bit indicates frozen, but page contains non-frozen

tuples.

Three separate block numbers are required in record to distinguish the
problem with block.

Signature of API will be something like:
pg_check_visibility_blocks(regclass, corrupt_vm_blkno OUT bigint,
corrupt_dead_blkno OUT bigint, corrupt_frozen_blkno OUT boolean) RETURNS
SETOF record

I don't understand this,

This new API was to address Andres's concern of checking block level
inconsistency as we do in lazy_scan_heap. It returns set of inconsistent
blocks.

The function that just returned non-frozen TIDs on
supposedly-frozen pages was simple. Now we're trying to redesign this
into a general-purpose integrity checker on the eve of beta2, and I
think that's a bad idea. We don't have time to figure that out, get
consensus on it, and do it well, and I don't want to be stuck
supporting something half-baked from now until eternity. Let's scale
back our goals here to something that can realistically be done well
in the time available.

Here's my proposal:

1. You already implemented a function to find non-frozen tuples on
supposedly all-frozen pages. Great.

2. Let's implement a second function to find dead tuples on supposedly
all-visible pages.

3. And then let's call it good.

Your proposal sounds good, will send an updated patch, if there are no
further concerns.

With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

#135

Amit Kapila

amit.kapila16@gmail.com

over 9 years ago

In reply to: Robert Haas (#133)

Re: Reviewing freeze map code

On Wed, Jun 8, 2016 at 6:31 PM, Robert Haas <robertmhaas@gmail.com> wrote:

Here's my proposal:

1. You already implemented a function to find non-frozen tuples on
supposedly all-frozen pages. Great.

2. Let's implement a second function to find dead tuples on supposedly
all-visible pages.

I am planning to name them as pg_check_frozen and pg_check_visible, let me
know if you something else suits better?

With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

#136

Amit Kapila

amit.kapila16@gmail.com

over 9 years ago

In reply to: Amit Kapila (#135)

1 attachment(s)

Re: Reviewing freeze map code

On Thu, Jun 9, 2016 at 8:48 AM, Amit Kapila <amit.kapila16@gmail.com> wrote:

On Wed, Jun 8, 2016 at 6:31 PM, Robert Haas <robertmhaas@gmail.com> wrote:

Here's my proposal:

1. You already implemented a function to find non-frozen tuples on
supposedly all-frozen pages. Great.

2. Let's implement a second function to find dead tuples on supposedly
all-visible pages.

I am planning to name them as pg_check_frozen and pg_check_visible, let

me know if you something else suits better?

Attached patch implements the above 2 functions. I have addressed the
comments by Sawada San and you in latest patch and updated the
documentation as well.

With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

Attachments:

pg_check_visibility_func_v2.patchapplication/octet-stream; name=pg_check_visibility_func_v2.patchDownload

diff --git a/contrib/pg_visibility/Makefile b/contrib/pg_visibility/Makefile
index fbbaa2e..379591a 100644
--- a/contrib/pg_visibility/Makefile
+++ b/contrib/pg_visibility/Makefile
@@ -4,7 +4,7 @@ MODULE_big = pg_visibility
 OBJS = pg_visibility.o $(WIN32RES)
 
 EXTENSION = pg_visibility
-DATA = pg_visibility--1.0.sql
+DATA = pg_visibility--1.1.sql pg_visibility--1.0--1.1.sql
 PGFILEDESC = "pg_visibility - page visibility information"
 
 ifdef USE_PGXS
diff --git a/contrib/pg_visibility/pg_visibility--1.0--1.1.sql b/contrib/pg_visibility/pg_visibility--1.0--1.1.sql
new file mode 100644
index 0000000..494f42f
--- /dev/null
+++ b/contrib/pg_visibility/pg_visibility--1.0--1.1.sql
@@ -0,0 +1,15 @@
+/* contrib/pg_visibility/pg_visibility--1.0--1.1.sql */
+
+-- complain if script is sourced in psql, rather than via ALTER EXTENSION
+\echo Use "ALTER EXTENSION pg_visibility UPDATE TO '1.1'" to load this file. \quit
+
+
+CREATE FUNCTION pg_check_frozen(regclass, t_ctid OUT tid)
+RETURNS SETOF tid
+AS 'MODULE_PATHNAME', 'pg_check_frozen'
+LANGUAGE C STRICT;
+
+CREATE FUNCTION pg_check_visible(regclass, t_ctid OUT tid)
+RETURNS SETOF tid
+AS 'MODULE_PATHNAME', 'pg_check_visible'
+LANGUAGE C STRICT;
diff --git a/contrib/pg_visibility/pg_visibility--1.0.sql b/contrib/pg_visibility/pg_visibility--1.0.sql
deleted file mode 100644
index da511e5..0000000
--- a/contrib/pg_visibility/pg_visibility--1.0.sql
+++ /dev/null
@@ -1,52 +0,0 @@
-/* contrib/pg_visibility/pg_visibility--1.0.sql */
-
--- complain if script is sourced in psql, rather than via CREATE EXTENSION
-\echo Use "CREATE EXTENSION pg_visibility" to load this file. \quit
-
--- Show visibility map information.
-CREATE FUNCTION pg_visibility_map(regclass, blkno bigint,
-								  all_visible OUT boolean,
-								  all_frozen OUT boolean)
-RETURNS record
-AS 'MODULE_PATHNAME', 'pg_visibility_map'
-LANGUAGE C STRICT;
-
--- Show visibility map and page-level visibility information.
-CREATE FUNCTION pg_visibility(regclass, blkno bigint,
-							  all_visible OUT boolean,
-							  all_frozen OUT boolean,
-							  pd_all_visible OUT boolean)
-RETURNS record
-AS 'MODULE_PATHNAME', 'pg_visibility'
-LANGUAGE C STRICT;
-
--- Show visibility map information for each block in a relation.
-CREATE FUNCTION pg_visibility_map(regclass, blkno OUT bigint,
-								  all_visible OUT boolean,
-								  all_frozen OUT boolean)
-RETURNS SETOF record
-AS 'MODULE_PATHNAME', 'pg_visibility_map_rel'
-LANGUAGE C STRICT;
-
--- Show visibility map and page-level visibility information for each block.
-CREATE FUNCTION pg_visibility(regclass, blkno OUT bigint,
-							  all_visible OUT boolean,
-							  all_frozen OUT boolean,
-							  pd_all_visible OUT boolean)
-RETURNS SETOF record
-AS 'MODULE_PATHNAME', 'pg_visibility_rel'
-LANGUAGE C STRICT;
-
--- Show summary of visibility map bits for a relation.
-CREATE FUNCTION pg_visibility_map_summary(regclass,
-    OUT all_visible bigint, OUT all_frozen bigint)
-RETURNS record
-AS 'MODULE_PATHNAME', 'pg_visibility_map_summary'
-LANGUAGE C STRICT;
-
--- Don't want these to be available to public.
-REVOKE ALL ON FUNCTION pg_visibility_map(regclass, bigint) FROM PUBLIC;
-REVOKE ALL ON FUNCTION pg_visibility(regclass, bigint) FROM PUBLIC;
-REVOKE ALL ON FUNCTION pg_visibility_map(regclass) FROM PUBLIC;
-REVOKE ALL ON FUNCTION pg_visibility(regclass) FROM PUBLIC;
-REVOKE ALL ON FUNCTION pg_visibility_map_summary(regclass) FROM PUBLIC;
diff --git a/contrib/pg_visibility/pg_visibility--1.1.sql b/contrib/pg_visibility/pg_visibility--1.1.sql
new file mode 100644
index 0000000..b49b644
--- /dev/null
+++ b/contrib/pg_visibility/pg_visibility--1.1.sql
@@ -0,0 +1,67 @@
+/* contrib/pg_visibility/pg_visibility--1.1.sql */
+
+-- complain if script is sourced in psql, rather than via CREATE EXTENSION
+\echo Use "CREATE EXTENSION pg_visibility" to load this file. \quit
+
+-- Show visibility map information.
+CREATE FUNCTION pg_visibility_map(regclass, blkno bigint,
+								  all_visible OUT boolean,
+								  all_frozen OUT boolean)
+RETURNS record
+AS 'MODULE_PATHNAME', 'pg_visibility_map'
+LANGUAGE C STRICT;
+
+-- Show visibility map and page-level visibility information.
+CREATE FUNCTION pg_visibility(regclass, blkno bigint,
+							  all_visible OUT boolean,
+							  all_frozen OUT boolean,
+							  pd_all_visible OUT boolean)
+RETURNS record
+AS 'MODULE_PATHNAME', 'pg_visibility'
+LANGUAGE C STRICT;
+
+-- Show visibility map information for each block in a relation.
+CREATE FUNCTION pg_visibility_map(regclass, blkno OUT bigint,
+								  all_visible OUT boolean,
+								  all_frozen OUT boolean)
+RETURNS SETOF record
+AS 'MODULE_PATHNAME', 'pg_visibility_map_rel'
+LANGUAGE C STRICT;
+
+-- Show visibility map and page-level visibility information for each block.
+CREATE FUNCTION pg_visibility(regclass, blkno OUT bigint,
+							  all_visible OUT boolean,
+							  all_frozen OUT boolean,
+							  pd_all_visible OUT boolean)
+RETURNS SETOF record
+AS 'MODULE_PATHNAME', 'pg_visibility_rel'
+LANGUAGE C STRICT;
+
+-- Show summary of visibility map bits for a relation.
+CREATE FUNCTION pg_visibility_map_summary(regclass,
+    OUT all_visible bigint, OUT all_frozen bigint)
+RETURNS record
+AS 'MODULE_PATHNAME', 'pg_visibility_map_summary'
+LANGUAGE C STRICT;
+
+-- Show tupleids of non-frozen tuples if any in all_frozen pages
+-- for a relation.
+CREATE FUNCTION pg_check_frozen(regclass, t_ctid OUT tid)
+RETURNS SETOF tid
+AS 'MODULE_PATHNAME', 'pg_check_frozen'
+LANGUAGE C STRICT;
+
+-- Show tupleids of dead tuples if any in all_visible pages for a relation.
+CREATE FUNCTION pg_check_visible(regclass, t_ctid OUT tid)
+RETURNS SETOF tid
+AS 'MODULE_PATHNAME', 'pg_check_visible'
+LANGUAGE C STRICT;
+
+-- Don't want these to be available to public.
+REVOKE ALL ON FUNCTION pg_visibility_map(regclass, bigint) FROM PUBLIC;
+REVOKE ALL ON FUNCTION pg_visibility(regclass, bigint) FROM PUBLIC;
+REVOKE ALL ON FUNCTION pg_visibility_map(regclass) FROM PUBLIC;
+REVOKE ALL ON FUNCTION pg_visibility(regclass) FROM PUBLIC;
+REVOKE ALL ON FUNCTION pg_visibility_map_summary(regclass) FROM PUBLIC;
+REVOKE ALL ON FUNCTION pg_check_frozen(regclass) FROM PUBLIC;
+REVOKE ALL ON FUNCTION pg_check_visible(regclass) FROM PUBLIC;
diff --git a/contrib/pg_visibility/pg_visibility.c b/contrib/pg_visibility/pg_visibility.c
index 5e5c7cc..e746e27 100644
--- a/contrib/pg_visibility/pg_visibility.c
+++ b/contrib/pg_visibility/pg_visibility.c
@@ -25,14 +25,24 @@ typedef struct vbits
 	uint8		bits[FLEXIBLE_ARRAY_MEMBER];
 } vbits;
 
+typedef struct tupleids
+{
+	BlockNumber next;
+	BlockNumber count;
+	ItemPointer tids;
+} tupleids;
+
 PG_FUNCTION_INFO_V1(pg_visibility_map);
 PG_FUNCTION_INFO_V1(pg_visibility_map_rel);
 PG_FUNCTION_INFO_V1(pg_visibility);
 PG_FUNCTION_INFO_V1(pg_visibility_rel);
 PG_FUNCTION_INFO_V1(pg_visibility_map_summary);
+PG_FUNCTION_INFO_V1(pg_check_frozen);
+PG_FUNCTION_INFO_V1(pg_check_visible);
 
 static TupleDesc pg_visibility_tupdesc(bool include_blkno, bool include_pd);
 static vbits *collect_visibility_data(Oid relid, bool include_pd);
+static tupleids *collect_corrupt_items(Oid relid, bool all_visible, bool all_frozen);
 
 /*
  * Visibility map information for a single block of a relation.
@@ -259,6 +269,66 @@ pg_visibility_map_summary(PG_FUNCTION_ARGS)
 }
 
 /*
+ * Return the tids of non-frozen tuples present in frozen pages.  All such
+ * tids' indicates corrupt tuples.
+ */
+Datum
+pg_check_frozen(PG_FUNCTION_ARGS)
+{
+	FuncCallContext *funcctx;
+	tupleids   *info;
+
+	if (SRF_IS_FIRSTCALL())
+	{
+		Oid			relid = PG_GETARG_OID(0);
+		MemoryContext oldcontext;
+
+		funcctx = SRF_FIRSTCALL_INIT();
+		oldcontext = MemoryContextSwitchTo(funcctx->multi_call_memory_ctx);
+		funcctx->user_fctx = collect_corrupt_items(relid, false, true);
+		MemoryContextSwitchTo(oldcontext);
+	}
+
+	funcctx = SRF_PERCALL_SETUP();
+	info = (tupleids *) funcctx->user_fctx;
+
+	if (info->next < info->count)
+		SRF_RETURN_NEXT(funcctx, PointerGetDatum(&info->tids[info->next++]));
+
+	SRF_RETURN_DONE(funcctx);
+}
+
+/*
+ * Return the tids of dead tuples present in all visible pages.  All such
+ * tids' indicates corrupt tuples.
+ */
+Datum
+pg_check_visible(PG_FUNCTION_ARGS)
+{
+	FuncCallContext *funcctx;
+	tupleids   *info;
+
+	if (SRF_IS_FIRSTCALL())
+	{
+		Oid			relid = PG_GETARG_OID(0);
+		MemoryContext oldcontext;
+
+		funcctx = SRF_FIRSTCALL_INIT();
+		oldcontext = MemoryContextSwitchTo(funcctx->multi_call_memory_ctx);
+		funcctx->user_fctx = collect_corrupt_items(relid, true, false);
+		MemoryContextSwitchTo(oldcontext);
+	}
+
+	funcctx = SRF_PERCALL_SETUP();
+	info = (tupleids *) funcctx->user_fctx;
+
+	if (info->next < info->count)
+		SRF_RETURN_NEXT(funcctx, PointerGetDatum(&info->tids[info->next++]));
+
+	SRF_RETURN_DONE(funcctx);
+}
+
+/*
  * Helper function to construct whichever TupleDesc we need for a particular
  * call.
  */
@@ -348,3 +418,117 @@ collect_visibility_data(Oid relid, bool include_pd)
 
 	return info;
 }
+
+/*
+ * Collect dead items on all visible pages and or non frozen items in the all
+ * frozen pages for a relation.
+ */
+static tupleids *
+collect_corrupt_items(Oid relid, bool all_visible, bool all_frozen)
+{
+	Relation	rel;
+	BlockNumber nblocks;
+	tupleids   *info;
+	BlockNumber blkno;
+	uint64		nallocated;
+	uint64		count_corrupt_items = 0;
+	Buffer		vmbuffer = InvalidBuffer;
+	BufferAccessStrategy bstrategy = GetAccessStrategy(BAS_BULKREAD);
+
+	rel = relation_open(relid, AccessShareLock);
+
+	nblocks = RelationGetNumberOfBlocks(rel);
+
+	/*
+	 * Guess an initial array size, we don't expect many corrupted tuples, so
+	 * start with smaller number.  Having initial size as MaxHeapTuplesPerPage
+	 * allows us to check whether tids array needs to be enlarged at page
+	 * level rather than at tuple level.
+	 */
+	nallocated = MaxHeapTuplesPerPage;
+	info = palloc0(sizeof(tupleids));
+	info->tids = palloc(nallocated * sizeof(ItemPointerData));
+	info->next = 0;
+
+	for (blkno = 0; blkno < nblocks; ++blkno)
+	{
+		bool		check_frozen = false;
+		bool		check_visible = false;
+
+		/* Make sure we are interruptible. */
+		CHECK_FOR_INTERRUPTS();
+
+		/* enlarge output array if needed. */
+		if (count_corrupt_items >= nallocated)
+		{
+			nallocated *= 2;
+			info->tids = repalloc(info->tids, nallocated * sizeof(ItemPointerData));
+		}
+
+		if (all_frozen && VM_ALL_FROZEN(rel, blkno, &vmbuffer))
+			check_frozen = true;
+		if (all_visible && VM_ALL_VISIBLE(rel, blkno, &vmbuffer))
+			check_visible = true;
+
+		/* collect the non-frozen tuples on a frozen page. */
+		if (check_visible || check_frozen)
+		{
+			Buffer		buffer;
+			Page		page;
+			OffsetNumber offnum,
+						maxoff;
+
+			buffer = ReadBufferExtended(rel, MAIN_FORKNUM, blkno, RBM_NORMAL,
+										bstrategy);
+			LockBuffer(buffer, BUFFER_LOCK_SHARE);
+
+			page = BufferGetPage(buffer);
+			maxoff = PageGetMaxOffsetNumber(page);
+
+			for (offnum = FirstOffsetNumber;
+				 offnum <= maxoff;
+				 offnum = OffsetNumberNext(offnum))
+			{
+				HeapTupleHeader tuphdr;
+				ItemId		itemid;
+
+				itemid = PageGetItemId(page, offnum);
+
+				/* Unused or redirect line pointers are of no interest. */
+				if (!ItemIdIsUsed(itemid) || ItemIdIsRedirected(itemid))
+					continue;
+
+				/*
+				 * Count dead item as corrupt. We don't expect dead items on
+				 * all visible or all frozen pages.
+				 */
+				if (ItemIdIsDead(itemid))
+				{
+					ItemPointerData tid;
+
+					ItemPointerSet(&tid, blkno, offnum);
+					info->tids[count_corrupt_items++] = tid;
+					continue;
+				}
+
+				if (check_frozen)
+				{
+					tuphdr = (HeapTupleHeader) PageGetItem(page, itemid);
+					if (heap_tuple_needs_eventual_freeze(tuphdr))
+						info->tids[count_corrupt_items++] = tuphdr->t_ctid;
+				}
+			}
+
+			UnlockReleaseBuffer(buffer);
+		}
+	}
+
+	/* Clean up. */
+	if (vmbuffer != InvalidBuffer)
+		ReleaseBuffer(vmbuffer);
+	relation_close(rel, AccessShareLock);
+
+	info->count = count_corrupt_items;
+
+	return info;
+}
diff --git a/contrib/pg_visibility/pg_visibility.control b/contrib/pg_visibility/pg_visibility.control
index 1d71853..f93ed01 100644
--- a/contrib/pg_visibility/pg_visibility.control
+++ b/contrib/pg_visibility/pg_visibility.control
@@ -1,5 +1,5 @@
 # pg_visibility extension
 comment = 'examine the visibility map (VM) and page-level visibility info'
-default_version = '1.0'
+default_version = '1.1'
 module_pathname = '$libdir/pg_visibility'
 relocatable = true
diff --git a/doc/src/sgml/pgvisibility.sgml b/doc/src/sgml/pgvisibility.sgml
index 48b003d..feb9597 100644
--- a/doc/src/sgml/pgvisibility.sgml
+++ b/doc/src/sgml/pgvisibility.sgml
@@ -30,9 +30,10 @@
 
  <para>
   Functions which display information about <literal>PD_ALL_VISIBLE</>
-  are much more costly than those which only consult the visibility map,
-  because they must read the relation's data blocks rather than only the
-  (much smaller) visibility map.
+  and check all-visible or all-frozen pages are much more costly
+  than those which only consult the visibility map, because they must
+  read the relation's data blocks rather than only the (much smaller)
+  visibility map.
  </para>
 
  <sect2>
@@ -92,6 +93,28 @@
      </para>
     </listitem>
    </varlistentry>
+  
+   <varlistentry>
+    <term><function>pg_check_frozen(regclass, t_ctid OUT tid) returns setof tid</function></term>
+
+    <listitem>
+     <para>
+      Returns the tupleids of non-frozen tuples present in the all-frozen pages
+      for a relation.
+     </para>
+    </listitem>
+   </varlistentry>
+     
+    <varlistentry>
+    <term><function>pg_check_visible(regclass, t_ctid OUT tid) returns setof tid</function></term>
+
+    <listitem>
+     <para>
+      Returns the tupleids of dead tuples present in the all-visible pages
+      for a relation.
+     </para>
+    </listitem>
+   </varlistentry>
   </variablelist>
 
   <para>

#137

Robert Haas

robertmhaas@gmail.com

over 9 years ago

In reply to: Amit Kapila (#136)

1 attachment(s)

Re: Reviewing freeze map code

On Thu, Jun 9, 2016 at 5:48 AM, Amit Kapila <amit.kapila16@gmail.com> wrote:

Attached patch implements the above 2 functions. I have addressed the
comments by Sawada San and you in latest patch and updated the documentation
as well.

I made a number of changes to this patch. Here is the new version.

1. The algorithm you were using for growing the array size is unsafe
and can easily overrun the array. Suppose that each of the first two
pages have some corrupt tuples, more than 50% of MaxHeapTuplesPerPage
but less than the full value of MaxTuplesPerPage. Your code will
conclude that the array does need to be enlarged after processing the
first page. I switched this to what I consider the normal coding
pattern for such problems.

2. The all-visible checks seemed to me to be incorrect and incomplete.
I made the check match the logic in lazy_scan_heap.

3. Your 1.0 -> 1.1 upgrade script was missing copies of the REVOKE
statements you added to the 1.1 script. I added them.

4. The tests as written were not safe under concurrency; they could
return spurious results if the page changed between the time you
checked the visibility map and the time you actually examined the
tuples. I think people will try running these functions on live
systems, so I changed the code to recheck the VM bits after locking
the page. Unfortunately, there's either still a concurrency-related
problem here or there's a bug in the all-frozen code itself because I
once managed to get pg_check_frozen('pgbench_accounts') to return a
TID while pgbench was running concurrently. That's a bit alarming,
but since I can't reproduce it I don't really have a clue how to track
down the problem.

5. I made various cosmetic improvements.

If there are not objections, I will go ahead and commit this tomorrow,
because even if there is a bug (see point #4 above) I think it's
better to have this in the tree than not. However, code review and/or
testing with these new functions seems like it would be an extremely
good idea.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Attachments:

check-visibility-v3.patchbinary/octet-stream; name=check-visibility-v3.patchDownload

diff --git a/contrib/pg_visibility/Makefile b/contrib/pg_visibility/Makefile
index fbbaa2e..379591a 100644
--- a/contrib/pg_visibility/Makefile
+++ b/contrib/pg_visibility/Makefile
@@ -4,7 +4,7 @@ MODULE_big = pg_visibility
 OBJS = pg_visibility.o $(WIN32RES)
 
 EXTENSION = pg_visibility
-DATA = pg_visibility--1.0.sql
+DATA = pg_visibility--1.1.sql pg_visibility--1.0--1.1.sql
 PGFILEDESC = "pg_visibility - page visibility information"
 
 ifdef USE_PGXS
diff --git a/contrib/pg_visibility/pg_visibility--1.0--1.1.sql b/contrib/pg_visibility/pg_visibility--1.0--1.1.sql
new file mode 100644
index 0000000..2c97dfd
--- /dev/null
+++ b/contrib/pg_visibility/pg_visibility--1.0--1.1.sql
@@ -0,0 +1,17 @@
+/* contrib/pg_visibility/pg_visibility--1.0--1.1.sql */
+
+-- complain if script is sourced in psql, rather than via ALTER EXTENSION
+\echo Use "ALTER EXTENSION pg_visibility UPDATE TO '1.1'" to load this file. \quit
+
+CREATE FUNCTION pg_check_frozen(regclass, t_ctid OUT tid)
+RETURNS SETOF tid
+AS 'MODULE_PATHNAME', 'pg_check_frozen'
+LANGUAGE C STRICT;
+
+CREATE FUNCTION pg_check_visible(regclass, t_ctid OUT tid)
+RETURNS SETOF tid
+AS 'MODULE_PATHNAME', 'pg_check_visible'
+LANGUAGE C STRICT;
+
+REVOKE ALL ON FUNCTION pg_check_frozen(regclass) FROM PUBLIC;
+REVOKE ALL ON FUNCTION pg_check_visible(regclass) FROM PUBLIC;
diff --git a/contrib/pg_visibility/pg_visibility--1.0.sql b/contrib/pg_visibility/pg_visibility--1.0.sql
deleted file mode 100644
index da511e5..0000000
--- a/contrib/pg_visibility/pg_visibility--1.0.sql
+++ /dev/null
@@ -1,52 +0,0 @@
-/* contrib/pg_visibility/pg_visibility--1.0.sql */
-
--- complain if script is sourced in psql, rather than via CREATE EXTENSION
-\echo Use "CREATE EXTENSION pg_visibility" to load this file. \quit
-
--- Show visibility map information.
-CREATE FUNCTION pg_visibility_map(regclass, blkno bigint,
-								  all_visible OUT boolean,
-								  all_frozen OUT boolean)
-RETURNS record
-AS 'MODULE_PATHNAME', 'pg_visibility_map'
-LANGUAGE C STRICT;
-
--- Show visibility map and page-level visibility information.
-CREATE FUNCTION pg_visibility(regclass, blkno bigint,
-							  all_visible OUT boolean,
-							  all_frozen OUT boolean,
-							  pd_all_visible OUT boolean)
-RETURNS record
-AS 'MODULE_PATHNAME', 'pg_visibility'
-LANGUAGE C STRICT;
-
--- Show visibility map information for each block in a relation.
-CREATE FUNCTION pg_visibility_map(regclass, blkno OUT bigint,
-								  all_visible OUT boolean,
-								  all_frozen OUT boolean)
-RETURNS SETOF record
-AS 'MODULE_PATHNAME', 'pg_visibility_map_rel'
-LANGUAGE C STRICT;
-
--- Show visibility map and page-level visibility information for each block.
-CREATE FUNCTION pg_visibility(regclass, blkno OUT bigint,
-							  all_visible OUT boolean,
-							  all_frozen OUT boolean,
-							  pd_all_visible OUT boolean)
-RETURNS SETOF record
-AS 'MODULE_PATHNAME', 'pg_visibility_rel'
-LANGUAGE C STRICT;
-
--- Show summary of visibility map bits for a relation.
-CREATE FUNCTION pg_visibility_map_summary(regclass,
-    OUT all_visible bigint, OUT all_frozen bigint)
-RETURNS record
-AS 'MODULE_PATHNAME', 'pg_visibility_map_summary'
-LANGUAGE C STRICT;
-
--- Don't want these to be available to public.
-REVOKE ALL ON FUNCTION pg_visibility_map(regclass, bigint) FROM PUBLIC;
-REVOKE ALL ON FUNCTION pg_visibility(regclass, bigint) FROM PUBLIC;
-REVOKE ALL ON FUNCTION pg_visibility_map(regclass) FROM PUBLIC;
-REVOKE ALL ON FUNCTION pg_visibility(regclass) FROM PUBLIC;
-REVOKE ALL ON FUNCTION pg_visibility_map_summary(regclass) FROM PUBLIC;
diff --git a/contrib/pg_visibility/pg_visibility--1.1.sql b/contrib/pg_visibility/pg_visibility--1.1.sql
new file mode 100644
index 0000000..b49b644
--- /dev/null
+++ b/contrib/pg_visibility/pg_visibility--1.1.sql
@@ -0,0 +1,67 @@
+/* contrib/pg_visibility/pg_visibility--1.1.sql */
+
+-- complain if script is sourced in psql, rather than via CREATE EXTENSION
+\echo Use "CREATE EXTENSION pg_visibility" to load this file. \quit
+
+-- Show visibility map information.
+CREATE FUNCTION pg_visibility_map(regclass, blkno bigint,
+								  all_visible OUT boolean,
+								  all_frozen OUT boolean)
+RETURNS record
+AS 'MODULE_PATHNAME', 'pg_visibility_map'
+LANGUAGE C STRICT;
+
+-- Show visibility map and page-level visibility information.
+CREATE FUNCTION pg_visibility(regclass, blkno bigint,
+							  all_visible OUT boolean,
+							  all_frozen OUT boolean,
+							  pd_all_visible OUT boolean)
+RETURNS record
+AS 'MODULE_PATHNAME', 'pg_visibility'
+LANGUAGE C STRICT;
+
+-- Show visibility map information for each block in a relation.
+CREATE FUNCTION pg_visibility_map(regclass, blkno OUT bigint,
+								  all_visible OUT boolean,
+								  all_frozen OUT boolean)
+RETURNS SETOF record
+AS 'MODULE_PATHNAME', 'pg_visibility_map_rel'
+LANGUAGE C STRICT;
+
+-- Show visibility map and page-level visibility information for each block.
+CREATE FUNCTION pg_visibility(regclass, blkno OUT bigint,
+							  all_visible OUT boolean,
+							  all_frozen OUT boolean,
+							  pd_all_visible OUT boolean)
+RETURNS SETOF record
+AS 'MODULE_PATHNAME', 'pg_visibility_rel'
+LANGUAGE C STRICT;
+
+-- Show summary of visibility map bits for a relation.
+CREATE FUNCTION pg_visibility_map_summary(regclass,
+    OUT all_visible bigint, OUT all_frozen bigint)
+RETURNS record
+AS 'MODULE_PATHNAME', 'pg_visibility_map_summary'
+LANGUAGE C STRICT;
+
+-- Show tupleids of non-frozen tuples if any in all_frozen pages
+-- for a relation.
+CREATE FUNCTION pg_check_frozen(regclass, t_ctid OUT tid)
+RETURNS SETOF tid
+AS 'MODULE_PATHNAME', 'pg_check_frozen'
+LANGUAGE C STRICT;
+
+-- Show tupleids of dead tuples if any in all_visible pages for a relation.
+CREATE FUNCTION pg_check_visible(regclass, t_ctid OUT tid)
+RETURNS SETOF tid
+AS 'MODULE_PATHNAME', 'pg_check_visible'
+LANGUAGE C STRICT;
+
+-- Don't want these to be available to public.
+REVOKE ALL ON FUNCTION pg_visibility_map(regclass, bigint) FROM PUBLIC;
+REVOKE ALL ON FUNCTION pg_visibility(regclass, bigint) FROM PUBLIC;
+REVOKE ALL ON FUNCTION pg_visibility_map(regclass) FROM PUBLIC;
+REVOKE ALL ON FUNCTION pg_visibility(regclass) FROM PUBLIC;
+REVOKE ALL ON FUNCTION pg_visibility_map_summary(regclass) FROM PUBLIC;
+REVOKE ALL ON FUNCTION pg_check_frozen(regclass) FROM PUBLIC;
+REVOKE ALL ON FUNCTION pg_check_visible(regclass) FROM PUBLIC;
diff --git a/contrib/pg_visibility/pg_visibility.c b/contrib/pg_visibility/pg_visibility.c
index 5e5c7cc..7802e22 100644
--- a/contrib/pg_visibility/pg_visibility.c
+++ b/contrib/pg_visibility/pg_visibility.c
@@ -14,25 +14,38 @@
 #include "funcapi.h"
 #include "miscadmin.h"
 #include "storage/bufmgr.h"
+#include "storage/procarray.h"
 #include "utils/rel.h"
 
 PG_MODULE_MAGIC;
 
 typedef struct vbits
 {
-	BlockNumber	next;
-	BlockNumber	count;
+	BlockNumber next;
+	BlockNumber count;
 	uint8		bits[FLEXIBLE_ARRAY_MEMBER];
 } vbits;
 
+typedef struct corrupt_items
+{
+	BlockNumber next;
+	BlockNumber count;
+	ItemPointer tids;
+} corrupt_items;
+
 PG_FUNCTION_INFO_V1(pg_visibility_map);
 PG_FUNCTION_INFO_V1(pg_visibility_map_rel);
 PG_FUNCTION_INFO_V1(pg_visibility);
 PG_FUNCTION_INFO_V1(pg_visibility_rel);
 PG_FUNCTION_INFO_V1(pg_visibility_map_summary);
+PG_FUNCTION_INFO_V1(pg_check_frozen);
+PG_FUNCTION_INFO_V1(pg_check_visible);
 
 static TupleDesc pg_visibility_tupdesc(bool include_blkno, bool include_pd);
 static vbits *collect_visibility_data(Oid relid, bool include_pd);
+static corrupt_items *collect_corrupt_items(Oid relid, bool all_visible,
+					  bool all_frozen);
+static void record_corrupt_item(corrupt_items *items, ItemPointer tid);
 
 /*
  * Visibility map information for a single block of a relation.
@@ -129,7 +142,7 @@ pg_visibility_map_rel(PG_FUNCTION_ARGS)
 	if (SRF_IS_FIRSTCALL())
 	{
 		Oid			relid = PG_GETARG_OID(0);
-		MemoryContext	oldcontext;
+		MemoryContext oldcontext;
 
 		funcctx = SRF_FIRSTCALL_INIT();
 		oldcontext = MemoryContextSwitchTo(funcctx->multi_call_memory_ctx);
@@ -173,7 +186,7 @@ pg_visibility_rel(PG_FUNCTION_ARGS)
 	if (SRF_IS_FIRSTCALL())
 	{
 		Oid			relid = PG_GETARG_OID(0);
-		MemoryContext	oldcontext;
+		MemoryContext oldcontext;
 
 		funcctx = SRF_FIRSTCALL_INIT();
 		oldcontext = MemoryContextSwitchTo(funcctx->multi_call_memory_ctx);
@@ -214,8 +227,8 @@ pg_visibility_map_summary(PG_FUNCTION_ARGS)
 {
 	Oid			relid = PG_GETARG_OID(0);
 	Relation	rel;
-	BlockNumber	nblocks;
-	BlockNumber	blkno;
+	BlockNumber nblocks;
+	BlockNumber blkno;
 	Buffer		vmbuffer = InvalidBuffer;
 	int64		all_visible = 0;
 	int64		all_frozen = 0;
@@ -259,6 +272,68 @@ pg_visibility_map_summary(PG_FUNCTION_ARGS)
 }
 
 /*
+ * Return the TIDs of non-frozen tuples present in pages marked all-frozen
+ * in the visibility map.  We hope no one will ever find any, but there could
+ * be bugs, database corruption, etc.
+ */
+Datum
+pg_check_frozen(PG_FUNCTION_ARGS)
+{
+	FuncCallContext *funcctx;
+	corrupt_items *items;
+
+	if (SRF_IS_FIRSTCALL())
+	{
+		Oid			relid = PG_GETARG_OID(0);
+		MemoryContext oldcontext;
+
+		funcctx = SRF_FIRSTCALL_INIT();
+		oldcontext = MemoryContextSwitchTo(funcctx->multi_call_memory_ctx);
+		funcctx->user_fctx = collect_corrupt_items(relid, false, true);
+		MemoryContextSwitchTo(oldcontext);
+	}
+
+	funcctx = SRF_PERCALL_SETUP();
+	items = (corrupt_items *) funcctx->user_fctx;
+
+	if (items->next < items->count)
+		SRF_RETURN_NEXT(funcctx, PointerGetDatum(&items->tids[items->next++]));
+
+	SRF_RETURN_DONE(funcctx);
+}
+
+/*
+ * Return the TIDs of not-all-visible tuples in pages marked all-visible
+ * in the visibility map.  We hope no one will ever find any, but there could
+ * be bugs, database corruption, etc.
+ */
+Datum
+pg_check_visible(PG_FUNCTION_ARGS)
+{
+	FuncCallContext *funcctx;
+	corrupt_items *items;
+
+	if (SRF_IS_FIRSTCALL())
+	{
+		Oid			relid = PG_GETARG_OID(0);
+		MemoryContext oldcontext;
+
+		funcctx = SRF_FIRSTCALL_INIT();
+		oldcontext = MemoryContextSwitchTo(funcctx->multi_call_memory_ctx);
+		funcctx->user_fctx = collect_corrupt_items(relid, true, false);
+		MemoryContextSwitchTo(oldcontext);
+	}
+
+	funcctx = SRF_PERCALL_SETUP();
+	items = (corrupt_items *) funcctx->user_fctx;
+
+	if (items->next < items->count)
+		SRF_RETURN_NEXT(funcctx, PointerGetDatum(&items->tids[items->next++]));
+
+	SRF_RETURN_DONE(funcctx);
+}
+
+/*
  * Helper function to construct whichever TupleDesc we need for a particular
  * call.
  */
@@ -292,16 +367,16 @@ static vbits *
 collect_visibility_data(Oid relid, bool include_pd)
 {
 	Relation	rel;
-	BlockNumber	nblocks;
+	BlockNumber nblocks;
 	vbits	   *info;
-	BlockNumber	blkno;
+	BlockNumber blkno;
 	Buffer		vmbuffer = InvalidBuffer;
-	BufferAccessStrategy	bstrategy = GetAccessStrategy(BAS_BULKREAD);
+	BufferAccessStrategy bstrategy = GetAccessStrategy(BAS_BULKREAD);
 
 	rel = relation_open(relid, AccessShareLock);
 
 	nblocks = RelationGetNumberOfBlocks(rel);
-	info = palloc0(offsetof(vbits, bits) + nblocks);
+	info = palloc0(offsetof(vbits, bits) +nblocks);
 	info->next = 0;
 	info->count = nblocks;
 
@@ -320,8 +395,8 @@ collect_visibility_data(Oid relid, bool include_pd)
 			info->bits[blkno] |= (1 << 1);
 
 		/*
-		 * Page-level data requires reading every block, so only get it if
-		 * the caller needs it.  Use a buffer access strategy, too, to prevent
+		 * Page-level data requires reading every block, so only get it if the
+		 * caller needs it.  Use a buffer access strategy, too, to prevent
 		 * cache-trashing.
 		 */
 		if (include_pd)
@@ -348,3 +423,189 @@ collect_visibility_data(Oid relid, bool include_pd)
 
 	return info;
 }
+
+/*
+ * Returns a list of items whose visibility map information does not match
+ * the status of the tuples on the page.
+ *
+ * If all_visible is passed as true, this will include all items which are
+ * on pages marked as all-visible in the visibility map but which do not
+ * seem to in fact be all-visible.
+ *
+ * If all_frozen is passed as true, this will include all items which are
+ * on pages marked as all-frozen but which do not seem to in fact be frozen.
+ */
+static corrupt_items *
+collect_corrupt_items(Oid relid, bool all_visible, bool all_frozen)
+{
+	Relation	rel;
+	BlockNumber nblocks;
+	corrupt_items *items;
+	BlockNumber blkno;
+	Buffer		vmbuffer = InvalidBuffer;
+	BufferAccessStrategy bstrategy = GetAccessStrategy(BAS_BULKREAD);
+	TransactionId OldestXmin = InvalidTransactionId;
+
+	rel = relation_open(relid, AccessShareLock);
+
+	nblocks = RelationGetNumberOfBlocks(rel);
+
+	if (all_visible)
+		OldestXmin = GetOldestXmin(rel, true);
+
+	/*
+	 * Guess an initial array size. We don't expect many corrupted tuples, so
+	 * start with a small array.  This function uses the "next" field to track
+	 * the next offset where we can store an item (which is the same thing as
+	 * the number of items found so far) and the "count" field to track the
+	 * number of entries allocated.  We'll repurpose these fields before
+	 * returning.
+	 */
+	items = palloc0(sizeof(corrupt_items));
+	items->next = 0;
+	items->count = 64;
+	items->tids = palloc(items->count * sizeof(ItemPointerData));
+
+	/* Loop over every block in the relation. */
+	for (blkno = 0; blkno < nblocks; ++blkno)
+	{
+		bool		check_frozen = false;
+		bool		check_visible = false;
+		Buffer		buffer;
+		Page		page;
+		OffsetNumber offnum,
+					maxoff;
+
+		/* Make sure we are interruptible. */
+		CHECK_FOR_INTERRUPTS();
+
+		/* Use the visibility map to decide whether to check this page. */
+		if (all_frozen && VM_ALL_FROZEN(rel, blkno, &vmbuffer))
+			check_frozen = true;
+		if (all_visible && VM_ALL_VISIBLE(rel, blkno, &vmbuffer))
+			check_visible = true;
+		if (!check_visible && !check_frozen)
+			continue;
+
+		/* Read and lock the page. */
+		buffer = ReadBufferExtended(rel, MAIN_FORKNUM, blkno, RBM_NORMAL,
+									bstrategy);
+		LockBuffer(buffer, BUFFER_LOCK_SHARE);
+
+		page = BufferGetPage(buffer);
+		maxoff = PageGetMaxOffsetNumber(page);
+
+		/*
+		 * The visibility map bits might have changed while we were acquiring
+		 * the page lock.  Recheck to avoid returning spurious results.
+		 */
+		if (check_frozen && !VM_ALL_FROZEN(rel, blkno, &vmbuffer))
+			check_frozen = false;
+		if (check_visible && !VM_ALL_VISIBLE(rel, blkno, &vmbuffer))
+			check_visible = false;
+		if (!check_visible && !check_frozen)
+		{
+			UnlockReleaseBuffer(buffer);
+			continue;
+		}
+
+		/* Iterate over each tuple on the page. */
+		for (offnum = FirstOffsetNumber;
+			 offnum <= maxoff;
+			 offnum = OffsetNumberNext(offnum))
+		{
+			HeapTupleData tuple;
+			ItemId		itemid;
+
+			itemid = PageGetItemId(page, offnum);
+
+			/* Unused or redirect line pointers are of no interest. */
+			if (!ItemIdIsUsed(itemid) || ItemIdIsRedirected(itemid))
+				continue;
+
+			/* Dead line pointers are neither all-visible nor frozen. */
+			if (ItemIdIsDead(itemid))
+			{
+				ItemPointerData tid;
+
+				ItemPointerSet(&tid, blkno, offnum);
+				record_corrupt_item(items, &tid);
+				continue;
+			}
+
+			/* Initialize a HeapTupleData structure for checks below. */
+			tuple.t_data = (HeapTupleHeader) PageGetItem(page, itemid);
+			tuple.t_len = ItemIdGetLength(itemid);
+			tuple.t_tableOid = relid;
+
+			/*
+			 * If we're checking whether the page is all-visible, we expect
+			 * the tuple to be live, xmin to be hinted committed, and the xmin
+			 * to be old enough that everyone can see it.  The tests here
+			 * should match the ones in lazy_scan_heap.
+			 */
+			if (check_visible)
+			{
+				HTSV_Result state;
+
+				state = HeapTupleSatisfiesVacuum(&tuple, OldestXmin, buffer);
+				if (state != HEAPTUPLE_LIVE ||
+					!HeapTupleHeaderXminCommitted(tuple.t_data))
+					record_corrupt_item(items, &tuple.t_data->t_ctid);
+				else
+				{
+					TransactionId xmin;
+
+					xmin = HeapTupleHeaderGetXmin(tuple.t_data);
+					if (!TransactionIdPrecedes(xmin, OldestXmin))
+						record_corrupt_item(items, &tuple.t_data->t_ctid);
+				}
+			}
+
+			/*
+			 * If we're checking whether the page is all-frozen, we expect the
+			 * tuple to be in a state where it will never need freezing.
+			 */
+			if (check_frozen)
+			{
+				if (heap_tuple_needs_eventual_freeze(tuple.t_data))
+					record_corrupt_item(items, &tuple.t_data->t_ctid);
+			}
+		}
+
+		UnlockReleaseBuffer(buffer);
+	}
+
+	/* Clean up. */
+	if (vmbuffer != InvalidBuffer)
+		ReleaseBuffer(vmbuffer);
+	relation_close(rel, AccessShareLock);
+
+	/*
+	 * Before returning, repurpose the fields to match caller's expectations.
+	 * next is now the next item that should be read (rather than written) and
+	 * count is now the number of items we wrote (rather than the number we
+	 * allocated).
+	 */
+	items->count = items->next;
+	items->next = 0;
+
+	return items;
+}
+
+/*
+ * Remember one corrupt item.
+ */
+static void
+record_corrupt_item(corrupt_items *items, ItemPointer tid)
+{
+	/* enlarge output array if needed. */
+	if (items->next >= items->count)
+	{
+		items->count *= 2;
+		items->tids = repalloc(items->tids,
+							   items->count * sizeof(ItemPointerData));
+	}
+	/* and add the new item */
+	items->tids[items->next++] = *tid;
+}
diff --git a/contrib/pg_visibility/pg_visibility.control b/contrib/pg_visibility/pg_visibility.control
index 1d71853..f93ed01 100644
--- a/contrib/pg_visibility/pg_visibility.control
+++ b/contrib/pg_visibility/pg_visibility.control
@@ -1,5 +1,5 @@
 # pg_visibility extension
 comment = 'examine the visibility map (VM) and page-level visibility info'
-default_version = '1.0'
+default_version = '1.1'
 module_pathname = '$libdir/pg_visibility'
 relocatable = true
diff --git a/doc/src/sgml/pgvisibility.sgml b/doc/src/sgml/pgvisibility.sgml
index 48b003d..4cdca7d 100644
--- a/doc/src/sgml/pgvisibility.sgml
+++ b/doc/src/sgml/pgvisibility.sgml
@@ -32,7 +32,8 @@
   Functions which display information about <literal>PD_ALL_VISIBLE</>
   are much more costly than those which only consult the visibility map,
   because they must read the relation's data blocks rather than only the
-  (much smaller) visibility map.
+  (much smaller) visibility map.  Functions that check the relation's
+  data blocks are similarly expensive.
  </para>
 
  <sect2>
@@ -92,6 +93,31 @@
      </para>
     </listitem>
    </varlistentry>
+  
+   <varlistentry>
+    <term><function>pg_check_frozen(regclass, t_ctid OUT tid) returns setof tid</function></term>
+
+    <listitem>
+     <para>
+      Returns the TIDs of non-frozen tuples present in pages marked all-frozen
+      in the visibility map.  If this function returns a non-empty set of
+      TIDs, the database is corrupt.
+     </para>
+    </listitem>
+   </varlistentry>
+     
+    <varlistentry>
+    <term><function>pg_check_visible(regclass, t_ctid OUT tid) returns setof tid</function></term>
+
+    <listitem>
+     <para>
+      Returns the TIDs of tuples which are not all-visible despite the fact
+      that the pages which contain them are marked as all-visible in the
+      visibility map.  If this function returns a non-empty set of TIDs, the
+      database is corrupt.
+     </para>
+    </listitem>
+   </varlistentry>
   </variablelist>
 
   <para>
diff --git a/src/tools/pgindent/typedefs.list b/src/tools/pgindent/typedefs.list
index 9b38d35..168649f 100644
--- a/src/tools/pgindent/typedefs.list
+++ b/src/tools/pgindent/typedefs.list
@@ -2374,6 +2374,7 @@ convert_testexpr_context
 core_YYSTYPE
 core_yy_extra_type
 core_yyscan_t
+corrupt_items
 cost_qual_eval_context
 count_agg_clauses_context
 create_upper_paths_hook_type

#138

Andres Freund

andres@anarazel.de

over 9 years ago

In reply to: Robert Haas (#137)

Re: Reviewing freeze map code

Hi Robert, Amit,

thanks for working on this.

On 2016-06-09 12:11:15 -0400, Robert Haas wrote:

4. The tests as written were not safe under concurrency; they could
return spurious results if the page changed between the time you
checked the visibility map and the time you actually examined the
tuples. I think people will try running these functions on live
systems, so I changed the code to recheck the VM bits after locking
the page. Unfortunately, there's either still a concurrency-related
problem here or there's a bug in the all-frozen code itself because I
once managed to get pg_check_frozen('pgbench_accounts') to return a
TID while pgbench was running concurrently. That's a bit alarming,
but since I can't reproduce it I don't really have a clue how to track
down the problem.

Ugh, that's a bit concerning.

If there are not objections, I will go ahead and commit this tomorrow,
because even if there is a bug (see point #4 above) I think it's
better to have this in the tree than not. However, code review and/or
testing with these new functions seems like it would be an extremely
good idea.

I'll try to spend some time on that today (code review & testing).

Andres

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#139

Andres Freund

andres@anarazel.de

over 9 years ago

In reply to: Robert Haas (#137)

Re: Reviewing freeze map code

Hi,

I found two, relatively minor, issues.

1) I think we should perform a relkind check in
collect_corrupt_items(). Atm we'll "gladly" run against an index. If
we actually entered the main portion of the loop in
collect_corrupt_items(), that could end up corrupting the table (via
HeapTupleSatisfiesVacuum()). But it's probably safe, because the vm
fork doesn't exist for anything but heap/toast relations.

2) GetOldestXmin() currently specifies a relation, which can cause
trouble in recovery:

/*
* If we're not computing a relation specific limit, or if a shared
* relation has been passed in, backends in all databases have to be
* considered.
*/
allDbs = rel == NULL || rel->rd_rel->relisshared;

/* Cannot look for individual databases during recovery */
Assert(allDbs || !RecoveryInProgress());

I think that needs to be fixed.

3) Harmless here, but I think it's bad policy to release locks
   on normal relations before the end of xact.
+	relation_close(rel, AccessShareLock);
+

i.e. we'll Assert out.

4) 
+			if (check_visible)
+			{
+				HTSV_Result state;
+
+				state = HeapTupleSatisfiesVacuum(&tuple, OldestXmin, buffer);
+				if (state != HEAPTUPLE_LIVE ||
+					!HeapTupleHeaderXminCommitted(tuple.t_data))
+					record_corrupt_item(items, &tuple.t_data->t_ctid);
+				else

This theoretically could give false positives, if GetOldestXmin() went
backwards. But I think that's ok.

5) There's a bunch of whitespace damage in the diff, like
 		Oid			relid = PG_GETARG_OID(0);
-		MemoryContext	oldcontext;
+		MemoryContext oldcontext;

Otherwise this looks good. I played with it for a while, and besides
finding intentionally caused corruption, it didn't flag anything
(besides crashing on a standby, as in 2)).

Greetings,

Andres Freund

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#140

Andres Freund

andres@anarazel.de

over 9 years ago

In reply to: Andres Freund (#139)

Re: Reviewing freeze map code

On 2016-06-09 19:33:52 -0700, Andres Freund wrote:

I played with it for a while, and besides
finding intentionally caused corruption, it didn't flag anything
(besides crashing on a standby, as in 2)).

Ugh. Just sends after I sent that email:

oid | t_ctid
-----+--------
(0 rows)

oid | t_ctid
-----+--------
(0 rows)

Seems to be correlated with a concurrent vacuum, but it's hard to tell,
because I didn't have psql output a timestamp.

Greetings,

Andres Freund

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#141

Amit Kapila

amit.kapila16@gmail.com

over 9 years ago

In reply to: Andres Freund (#140)

Re: Reviewing freeze map code

On Fri, Jun 10, 2016 at 8:08 AM, Andres Freund <andres@anarazel.de> wrote:

On 2016-06-09 19:33:52 -0700, Andres Freund wrote:

I played with it for a while, and besides
finding intentionally caused corruption, it didn't flag anything
(besides crashing on a standby, as in 2)).

Ugh. Just sends after I sent that email:

oid | t_ctid
------------------+--------------
pgbench_accounts | (889641,33)
pgbench_accounts | (893854,56)
pgbench_accounts | (924226,13)
pgbench_accounts | (1073457,51)
pgbench_accounts | (1084904,16)
pgbench_accounts | (1111996,26)
(6 rows)

oid | t_ctid
-----+--------
(0 rows)

oid | t_ctid
------------------+--------------
pgbench_accounts | (739198,13)
pgbench_accounts | (887254,11)
pgbench_accounts | (1050391,6)
pgbench_accounts | (1158640,46)
pgbench_accounts | (1238067,18)
pgbench_accounts | (1273282,22)
pgbench_accounts | (1355816,54)
pgbench_accounts | (1361880,33)
(8 rows)

Is this output of pg_check_visible() or pg_check_frozen()?

With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

#142

Andres Freund

andres@anarazel.de

over 9 years ago

In reply to: Amit Kapila (#141)

Re: Reviewing freeze map code

On June 9, 2016 7:46:06 PM PDT, Amit Kapila <amit.kapila16@gmail.com> wrote:

On Fri, Jun 10, 2016 at 8:08 AM, Andres Freund <andres@anarazel.de>
wrote:

On 2016-06-09 19:33:52 -0700, Andres Freund wrote:

I played with it for a while, and besides
finding intentionally caused corruption, it didn't flag anything
(besides crashing on a standby, as in 2)).

Ugh. Just sends after I sent that email:

oid | t_ctid
------------------+--------------
pgbench_accounts | (889641,33)
pgbench_accounts | (893854,56)
pgbench_accounts | (924226,13)
pgbench_accounts | (1073457,51)
pgbench_accounts | (1084904,16)
pgbench_accounts | (1111996,26)
(6 rows)

oid | t_ctid
-----+--------
(0 rows)

oid | t_ctid
------------------+--------------
pgbench_accounts | (739198,13)
pgbench_accounts | (887254,11)
pgbench_accounts | (1050391,6)
pgbench_accounts | (1158640,46)
pgbench_accounts | (1238067,18)
pgbench_accounts | (1273282,22)
pgbench_accounts | (1355816,54)
pgbench_accounts | (1361880,33)
(8 rows)

Is this output of pg_check_visible() or pg_check_frozen()?

Unfortunately I don't know. I was running a union of both, I didn't really expect to hit an issue... I guess I'll put a PANIC in the relevant places and check whether I cab reproduce.

Andres
--
Sent from my Android device with K-9 Mail. Please excuse my brevity.

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#143

Amit Kapila

amit.kapila16@gmail.com

over 9 years ago

In reply to: Andres Freund (#142)

Re: Reviewing freeze map code

On Fri, Jun 10, 2016 at 8:27 AM, Andres Freund <andres@anarazel.de> wrote:

On June 9, 2016 7:46:06 PM PDT, Amit Kapila <amit.kapila16@gmail.com>
wrote:

On Fri, Jun 10, 2016 at 8:08 AM, Andres Freund <andres@anarazel.de>
wrote:

On 2016-06-09 19:33:52 -0700, Andres Freund wrote:

I played with it for a while, and besides
finding intentionally caused corruption, it didn't flag anything
(besides crashing on a standby, as in 2)).

Ugh. Just sends after I sent that email:

oid | t_ctid
------------------+--------------
pgbench_accounts | (889641,33)
pgbench_accounts | (893854,56)
pgbench_accounts | (924226,13)
pgbench_accounts | (1073457,51)
pgbench_accounts | (1084904,16)
pgbench_accounts | (1111996,26)
(6 rows)

oid | t_ctid
-----+--------
(0 rows)

oid | t_ctid
------------------+--------------
pgbench_accounts | (739198,13)
pgbench_accounts | (887254,11)
pgbench_accounts | (1050391,6)
pgbench_accounts | (1158640,46)
pgbench_accounts | (1238067,18)
pgbench_accounts | (1273282,22)
pgbench_accounts | (1355816,54)
pgbench_accounts | (1361880,33)
(8 rows)

Is this output of pg_check_visible() or pg_check_frozen()?

Unfortunately I don't know. I was running a union of both, I didn't really
expect to hit an issue... I guess I'll put a PANIC in the relevant places
and check whether I cab reproduce.

I have tried in multiple ways by running pgbench with read-write tests, but
could not see any such behaviour. I have tried by even crashing and
restarting the server and then again running pgbench. Do you see these
records on master or slave?

While looking at code in this area, I observed that during replay of
records (heap_xlog_delete), we first clear the vm, then update the page.
So we don't have Buffer lock while updating the vm where as in the patch
(collect_corrupt_items()), we are relying on the fact that for clearing vm
bit one needs to acquire buffer lock. Can that cause a problem?

With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

#144

Andres Freund

andres@anarazel.de

over 9 years ago

In reply to: Amit Kapila (#143)

Re: Reviewing freeze map code

On 2016-06-10 11:58:26 +0530, Amit Kapila wrote:

I have tried in multiple ways by running pgbench with read-write tests, but
could not see any such behaviour.

It took over an hour of pgbench on a fast laptop till I saw it.

I have tried by even crashing and
restarting the server and then again running pgbench. Do you see these
records on master or slave?

Master, but with an existing standby. So it could be related to
hot_standby_feedback or such.

While looking at code in this area, I observed that during replay of
records (heap_xlog_delete), we first clear the vm, then update the page.
So we don't have Buffer lock while updating the vm where as in the patch
(collect_corrupt_items()), we are relying on the fact that for clearing vm
bit one needs to acquire buffer lock. Can that cause a problem?

Unsetting a vm bit is always safe, right? The invariant is that the VM
may never falsely say all_visible/frozen, but it's perfectly ok for a
page to be all_visible/frozen, without the VM bit being present.

Andres

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#145

Amit Kapila

amit.kapila16@gmail.com

over 9 years ago

In reply to: Robert Haas (#137)

Re: Reviewing freeze map code

On Thu, Jun 9, 2016 at 9:41 PM, Robert Haas <robertmhaas@gmail.com> wrote:

2. The all-visible checks seemed to me to be incorrect and incomplete.
I made the check match the logic in lazy_scan_heap.

Okay, I thought we just want to check for dead-tuples. If we want the
logic similar to lazy_scan_heap(), then I think we should also consider
applying snapshot old threshold limit to oldestxmin. We currently do that
in vacuum_set_xid_limits() for Vacuum. Is there a reason for not
considering it for visibility check function?

With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

#146

Masahiko Sawada

sawada.mshk@gmail.com

over 9 years ago

In reply to: Robert Haas (#137)

2 attachment(s)

Re: Reviewing freeze map code

On Fri, Jun 10, 2016 at 1:11 AM, Robert Haas <robertmhaas@gmail.com> wrote:

On Thu, Jun 9, 2016 at 5:48 AM, Amit Kapila <amit.kapila16@gmail.com> wrote:

Attached patch implements the above 2 functions. I have addressed the
comments by Sawada San and you in latest patch and updated the documentation
as well.

I made a number of changes to this patch. Here is the new version.

1. The algorithm you were using for growing the array size is unsafe
and can easily overrun the array. Suppose that each of the first two
pages have some corrupt tuples, more than 50% of MaxHeapTuplesPerPage
but less than the full value of MaxTuplesPerPage. Your code will
conclude that the array does need to be enlarged after processing the
first page. I switched this to what I consider the normal coding
pattern for such problems.

2. The all-visible checks seemed to me to be incorrect and incomplete.
I made the check match the logic in lazy_scan_heap.

3. Your 1.0 -> 1.1 upgrade script was missing copies of the REVOKE
statements you added to the 1.1 script. I added them.

4. The tests as written were not safe under concurrency; they could
return spurious results if the page changed between the time you
checked the visibility map and the time you actually examined the
tuples. I think people will try running these functions on live
systems, so I changed the code to recheck the VM bits after locking
the page. Unfortunately, there's either still a concurrency-related
problem here or there's a bug in the all-frozen code itself because I
once managed to get pg_check_frozen('pgbench_accounts') to return a
TID while pgbench was running concurrently. That's a bit alarming,
but since I can't reproduce it I don't really have a clue how to track
down the problem.

5. I made various cosmetic improvements.

If there are not objections, I will go ahead and commit this tomorrow,
because even if there is a bug (see point #4 above) I think it's
better to have this in the tree than not. However, code review and/or
testing with these new functions seems like it would be an extremely
good idea.

Thank you for working on this.
Here are some minor comments.

---
+/*
+ * Return the TIDs of not-all-visible tuples in pages marked all-visible

If there is even one non-visible tuple in pages marked all-visible,
the database might be corrupted.
Is it better "not-visible" or "non-visible" instead of "not-all-visible"?
---
Do we need to check page header flag?
I think that database also might be corrupt in case where there is
non-visible tuple in page set PD_ALL_VISIBLE.
We could emit the WARNING log in such case.

Also, using attached tool which allows us to set spurious visibility
map status without actual modifying the tuple , I manually made the
some situations where database is corrupted and tested it, but ISTM
that this tool works fine.
It doesn't mean proposing as a new feature of course, but please use
it as appropriate.

Regards,

--
Masahiko Sawada

Attachments:

corrupted_test.sqlapplication/octet-stream; name=corrupted_test.sqlDownload

set_spurious_vm_status.patchbinary/octet-stream; name=set_spurious_vm_status.patchDownload

commit 6f76f11a3842e5401cbbb740bc346bb5329a06e0
Author: Masahiko Sawada <sawada.mshk@gmail.com>
Date:   Fri Jun 10 16:28:39 2016 -0700

    Add cheat func.

diff --git a/contrib/pg_visibility/pg_visibility--1.1.sql b/contrib/pg_visibility/pg_visibility--1.1.sql
index b49b644..a289a45 100644
--- a/contrib/pg_visibility/pg_visibility--1.1.sql
+++ b/contrib/pg_visibility/pg_visibility--1.1.sql
@@ -57,6 +57,30 @@ RETURNS SETOF tid
 AS 'MODULE_PATHNAME', 'pg_check_visible'
 LANGUAGE C STRICT;
 
+CREATE FUNCTION set_vm_status(
+rel regclass,
+blkno bigint,
+all_visible bool,
+all_frozen bool,
+blkno OUT INT
+)
+RETURNS INT
+AS 'MODULE_PATHNAME', 'set_vm_status'
+LANGUAGE C STRICT;
+
+CREATE FUNCTION set_vm_status(
+rel regclass,
+all_visible bool,
+all_frozen bool,
+blkno OUT BIGINT,
+status OUT INT)
+RETURNS SETOF RECORD
+AS $$
+   SELECT blkno, set_vm_status(rel, blkno, $2, $3)
+	FROM generate_series(0, pg_relation_size($1) / current_setting('block_size')::bigint - 1) AS blkno;
+$$
+LANGUAGE SQL STRICT;
+
 -- Don't want these to be available to public.
 REVOKE ALL ON FUNCTION pg_visibility_map(regclass, bigint) FROM PUBLIC;
 REVOKE ALL ON FUNCTION pg_visibility(regclass, bigint) FROM PUBLIC;
diff --git a/contrib/pg_visibility/pg_visibility.c b/contrib/pg_visibility/pg_visibility.c
index 3ccc981..e43cc23 100644
--- a/contrib/pg_visibility/pg_visibility.c
+++ b/contrib/pg_visibility/pg_visibility.c
@@ -41,6 +41,8 @@ PG_FUNCTION_INFO_V1(pg_visibility_map_summary);
 PG_FUNCTION_INFO_V1(pg_check_frozen);
 PG_FUNCTION_INFO_V1(pg_check_visible);
 
+PG_FUNCTION_INFO_V1(set_vm_status);
+
 static TupleDesc pg_visibility_tupdesc(bool include_blkno, bool include_pd);
 static vbits *collect_visibility_data(Oid relid, bool include_pd);
 static corrupt_items *collect_corrupt_items(Oid relid, bool all_visible,
@@ -609,3 +611,60 @@ record_corrupt_item(corrupt_items *items, ItemPointer tid)
 	/* and add the new item */
 	items->tids[items->next++] = *tid;
 }
+
+
+/*
+ * Set spurious visibility map status to specified block.
+ */
+Datum
+set_vm_status(PG_FUNCTION_ARGS)
+{
+	Oid		relid = PG_GETARG_OID(0);
+	BlockNumber	blkno = PG_GETARG_INT64(1);
+	bool	all_visible = PG_GETARG_BOOL(2);
+	bool	all_frozen = PG_GETARG_BOOL(3);
+	Buffer	buffer = InvalidBuffer;
+	Buffer	vmbuffer = InvalidBuffer;
+	uint8		flags = 0;
+	Relation	rel;
+	uint8		status;
+	Page		page;
+	BufferAccessStrategy bstrategy = GetAccessStrategy(BAS_BULKREAD);
+
+	rel = relation_open(relid, ShareUpdateExclusiveLock);
+
+	if (blkno < 0 || blkno > MaxBlockNumber)
+		ereport(ERROR,
+				(errcode(ERRCODE_INVALID_PARAMETER_VALUE),
+				 errmsg("invalid block number")));
+
+	buffer = ReadBufferExtended(rel, MAIN_FORKNUM, blkno, RBM_NORMAL,
+								bstrategy);
+	LockBuffer(buffer, BUFFER_LOCK_SHARE);
+	page = BufferGetPage(buffer);
+
+	/* Create visibility map bits */
+	if (all_visible)
+		flags = VISIBILITYMAP_ALL_VISIBLE;
+	if (all_frozen)
+		flags |= VISIBILITYMAP_ALL_FROZEN;
+
+	/* Set visibility map status anyway */
+	PageSetAllVisible(page);
+	visibilitymap_pin(rel, blkno, &vmbuffer);
+	if (flags)
+		visibilitymap_set(rel, blkno, buffer, InvalidXLogRecPtr,
+						  vmbuffer, InvalidTransactionId, flags);
+	else
+		visibilitymap_clear(rel, blkno, vmbuffer);
+
+	status = visibilitymap_get_status(rel, blkno, &vmbuffer);
+	MarkBufferDirty(buffer);
+
+	UnlockReleaseBuffer(buffer);
+	ReleaseBuffer(vmbuffer);
+
+	relation_close(rel, ShareUpdateExclusiveLock);
+
+	PG_RETURN_INT16(status);
+}

#147

Amit Kapila

amit.kapila16@gmail.com

over 9 years ago

In reply to: Andres Freund (#144)

Re: Reviewing freeze map code

On Fri, Jun 10, 2016 at 12:09 PM, Andres Freund <andres@anarazel.de> wrote:

On 2016-06-10 11:58:26 +0530, Amit Kapila wrote:

While looking at code in this area, I observed that during replay of
records (heap_xlog_delete), we first clear the vm, then update the page.
So we don't have Buffer lock while updating the vm where as in the patch
(collect_corrupt_items()), we are relying on the fact that for clearing

bit one needs to acquire buffer lock. Can that cause a problem?

Unsetting a vm bit is always safe, right?

I think so, which means this should not be a problem area.

With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

#148

Andres Freund

andres@anarazel.de

over 9 years ago

In reply to: Andres Freund (#144)

Re: Reviewing freeze map code

On 2016-06-09 23:39:24 -0700, Andres Freund wrote:

On 2016-06-10 11:58:26 +0530, Amit Kapila wrote:

I have tried in multiple ways by running pgbench with read-write tests, but
could not see any such behaviour.

It took over an hour of pgbench on a fast laptop till I saw it.

I have tried by even crashing and
restarting the server and then again running pgbench. Do you see these
records on master or slave?

Master, but with an existing standby. So it could be related to
hot_standby_feedback or such.

I just managed to trigger it again.

#1 0x00007fa1a73778da in __GI_abort () at abort.c:89
#2 0x00007f9f1395e59c in record_corrupt_item (items=items@entry=0x2137be0, tid=0x7f9fb8681c0c)
at /home/andres/src/postgresql/contrib/pg_visibility/pg_visibility.c:612
#3 0x00007f9f1395ead5 in collect_corrupt_items (relid=relid@entry=29449, all_visible=all_visible@entry=0 '\000', all_frozen=all_frozen@entry=1 '\001')
at /home/andres/src/postgresql/contrib/pg_visibility/pg_visibility.c:572
#4 0x00007f9f1395f476 in pg_check_frozen (fcinfo=0x7ffe5343a200) at /home/andres/src/postgresql/contrib/pg_visibility/pg_visibility.c:292
#5 0x00000000005fdbec in ExecMakeTableFunctionResult (funcexpr=0x2168630, econtext=0x2168320, argContext=<optimized out>, expectedDesc=0x2168ef0,
randomAccess=0 '\000') at /home/andres/src/postgresql/src/backend/executor/execQual.c:2211
#6 0x0000000000616992 in FunctionNext (node=node@entry=0x2168210) at /home/andres/src/postgresql/src/backend/executor/nodeFunctionscan.c:94
#7 0x00000000005ffdcb in ExecScanFetch (recheckMtd=0x6166f0 <FunctionRecheck>, accessMtd=0x616700 <FunctionNext>, node=0x2168210)
at /home/andres/src/postgresql/src/backend/executor/execScan.c:95
#8 ExecScan (node=node@entry=0x2168210, accessMtd=accessMtd@entry=0x616700 <FunctionNext>, recheckMtd=recheckMtd@entry=0x6166f0 <FunctionRecheck>)
at /home/andres/src/postgresql/src/backend/executor/execScan.c:145
#9 0x00000000006169e4 in ExecFunctionScan (node=node@entry=0x2168210) at /home/andres/src/postgresql/src/backend/executor/nodeFunctionscan.c:268

the error happened just after I restarted a standby, so it's not
unlikely to be related to hot_standby_feedback.

(gdb) p *tuple.t_data
$5 = {t_choice = {t_heap = {t_xmin = 9105470, t_xmax = 26049273, t_field3 = {t_cid = 0, t_xvac = 0}}, t_datum = {datum_len_ = 9105470,
datum_typmod = 26049273, datum_typeid = 0}}, t_ctid = {ip_blkid = {bi_hi = 1, bi_lo = 19765}, ip_posid = 3}, t_infomask2 = 4, t_infomask = 770,
t_hoff = 24 '\030', t_bits = 0x7f9fb8681c17 ""}

Infomask is:
#define HEAP_XMIN_COMMITTED 0x0100 /* t_xmin committed */
#define HEAP_XMIN_INVALID 0x0200 /* t_xmin invalid/aborted */
#define HEAP_XMIN_FROZEN (HEAP_XMIN_COMMITTED|HEAP_XMIN_INVALID)
#define HEAP_HASVARWIDTH 0x0002 /* has variable-width attribute(s) */

This indeed looks borked. Such a tuple should never survive
if (check_frozen && !VM_ALL_FROZEN(rel, blkno, &vmbuffer))
check_frozen = false;
especially not when
(gdb) p PageIsAllVisible(page)
$3 = 4

(fwiw, checking PD_ALL_VISIBLE in those functions sounds like a good plan)

I've got another earlier case (that I somehow missed seeing), below
check_visible:

(gdb) p *tuple->t_data
$2 = {t_choice = {t_heap = {t_xmin = 13616549, t_xmax = 25210801, t_field3 = {t_cid = 0, t_xvac = 0}}, t_datum = {datum_len_ = 13616549,
datum_typmod = 25210801, datum_typeid = 0}}, t_ctid = {ip_blkid = {bi_hi = 0, bi_lo = 52320}, ip_posid = 67}, t_infomask2 = 32772, t_infomask = 8962,
t_hoff = 24 '\030', t_bits = 0x7f9fda2f8717 ""}

infomask is:
#define HEAP_UPDATED 0x2000 /* this is UPDATEd version of row */
#define HEAP_XMIN_COMMITTED 0x0100 /* t_xmin committed */
#define HEAP_XMIN_INVALID 0x0200 /* t_xmin invalid/aborted */
#define HEAP_HASVARWIDTH 0x0002 /* has variable-width attribute(s) */
infomask2 is:
#define HEAP_ONLY_TUPLE 0x8000 /* this is heap-only tuple */

I'll run again, with a debugger attached, maybe I can get some more
information.

Regards,

Andres

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#149

Robert Haas

robertmhaas@gmail.com

over 9 years ago

In reply to: Andres Freund (#148)

Re: Reviewing freeze map code

On Fri, Jun 10, 2016 at 1:59 PM, Andres Freund <andres@anarazel.de> wrote:

Master, but with an existing standby. So it could be related to
hot_standby_feedback or such.

I just managed to trigger it again.

#1 0x00007fa1a73778da in __GI_abort () at abort.c:89
#2 0x00007f9f1395e59c in record_corrupt_item (items=items@entry=0x2137be0, tid=0x7f9fb8681c0c)
at /home/andres/src/postgresql/contrib/pg_visibility/pg_visibility.c:612
#3 0x00007f9f1395ead5 in collect_corrupt_items (relid=relid@entry=29449, all_visible=all_visible@entry=0 '\000', all_frozen=all_frozen@entry=1 '\001')
at /home/andres/src/postgresql/contrib/pg_visibility/pg_visibility.c:572
#4 0x00007f9f1395f476 in pg_check_frozen (fcinfo=0x7ffe5343a200) at /home/andres/src/postgresql/contrib/pg_visibility/pg_visibility.c:292
#5 0x00000000005fdbec in ExecMakeTableFunctionResult (funcexpr=0x2168630, econtext=0x2168320, argContext=<optimized out>, expectedDesc=0x2168ef0,
randomAccess=0 '\000') at /home/andres/src/postgresql/src/backend/executor/execQual.c:2211
#6 0x0000000000616992 in FunctionNext (node=node@entry=0x2168210) at /home/andres/src/postgresql/src/backend/executor/nodeFunctionscan.c:94
#7 0x00000000005ffdcb in ExecScanFetch (recheckMtd=0x6166f0 <FunctionRecheck>, accessMtd=0x616700 <FunctionNext>, node=0x2168210)
at /home/andres/src/postgresql/src/backend/executor/execScan.c:95
#8 ExecScan (node=node@entry=0x2168210, accessMtd=accessMtd@entry=0x616700 <FunctionNext>, recheckMtd=recheckMtd@entry=0x6166f0 <FunctionRecheck>)
at /home/andres/src/postgresql/src/backend/executor/execScan.c:145
#9 0x00000000006169e4 in ExecFunctionScan (node=node@entry=0x2168210) at /home/andres/src/postgresql/src/backend/executor/nodeFunctionscan.c:268

the error happened just after I restarted a standby, so it's not
unlikely to be related to hot_standby_feedback.

After some off-list discussion and debugging, Andres and I have
managed to identify three issues here (so far). Two are issues in the
testing, and one is a data-corrupting bug in the freeze map code.

1. pg_check_visible keeps on using the same OldestXmin for all its
checks even though the real OldestXmin may advance in the meantime.
This can lead to spurious problem reports: pg_check_visible() thinks
that the tuple isn't all visible yet and reports it as corruption, but
in reality there's no problem.

2. pg_check_visible includes the same check for heap-xmin-committed
that vacuumlazy.c uses, but hint bits aren't crash safe, so this could
lead to a spurious trouble report in a scenario involving a crash.

3. vacuumlazy.c includes this code:

if (heap_prepare_freeze_tuple(tuple.t_data, FreezeLimit,
MultiXactCutoff, &frozen[nfrozen]))
frozen[nfrozen++].offset = offnum;
else if (heap_tuple_needs_eventual_freeze(tuple.t_data))
all_frozen = false;

That's wrong, because a "true" return value from
heap_prepare_freeze_tuple() means only that it has done *some*
freezing work on the tuple, not that it's done all of the freezing
work that will ever need to be done. So, if the tuple's xmin can be
frozen and is aborted but not older than vacuum_freeze_min_age, then
heap_prepare_freeze_tuple() won't free xmax, but the page will still
be marked all-frozen, which is bad. I think it normally won't matter
because the xmax will probably be hinted invalid anyway, since we just
pruned the page which should have set hint bits everywhere, but if
those hint bits were lost then we'd eventually end up with an
accessible xmax pointing off into space.

My first thought was to just delete the "else" but that would be bad
because we'd fail to set all-frozen immediately in a lot of cases
where we should. This needs a bit more thought than I have time to
give it right now.

(I will update on the status of this open item again no later than
Monday; probably sooner.)

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#150

Alvaro Herrera

alvherre@2ndquadrant.com

over 9 years ago

In reply to: Robert Haas (#149)

Re: Reviewing freeze map code

Robert Haas wrote:

3. vacuumlazy.c includes this code:

if (heap_prepare_freeze_tuple(tuple.t_data, FreezeLimit,
MultiXactCutoff, &frozen[nfrozen]))
frozen[nfrozen++].offset = offnum;
else if (heap_tuple_needs_eventual_freeze(tuple.t_data))
all_frozen = false;

That's wrong, because a "true" return value from
heap_prepare_freeze_tuple() means only that it has done *some*
freezing work on the tuple, not that it's done all of the freezing
work that will ever need to be done. So, if the tuple's xmin can be
frozen and is aborted but not older than vacuum_freeze_min_age, then
heap_prepare_freeze_tuple() won't free xmax, but the page will still
be marked all-frozen, which is bad. I think it normally won't matter
because the xmax will probably be hinted invalid anyway, since we just
pruned the page which should have set hint bits everywhere, but if
those hint bits were lost then we'd eventually end up with an
accessible xmax pointing off into space.

Good catch. Also consider multixact freezing: if there is a
long-running transaction which is a lock-only member of tuple's Xmax,
and the multixact needs freezing because it's older than the multixact
cutoff, we set the xmax to a new multixact which includes that old
locker. See FreezeMultiXactId.

My first thought was to just delete the "else" but that would be bad
because we'd fail to set all-frozen immediately in a lot of cases
where we should. This needs a bit more thought than I have time to
give it right now.

How about changing the return tuple of heap_prepare_freeze_tuple to
a bitmap? Two flags: "Freeze [not] done" and "[No] more freezing
needed"

--
ï¿½lvaro Herrera http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#151

Robert Haas

robertmhaas@gmail.com

over 9 years ago

In reply to: Alvaro Herrera (#150)

Re: Reviewing freeze map code

On Fri, Jun 10, 2016 at 4:55 PM, Alvaro Herrera
<alvherre@2ndquadrant.com> wrote:

Robert Haas wrote:

3. vacuumlazy.c includes this code:

if (heap_prepare_freeze_tuple(tuple.t_data, FreezeLimit,
MultiXactCutoff, &frozen[nfrozen]))
frozen[nfrozen++].offset = offnum;
else if (heap_tuple_needs_eventual_freeze(tuple.t_data))
all_frozen = false;

That's wrong, because a "true" return value from
heap_prepare_freeze_tuple() means only that it has done *some*
freezing work on the tuple, not that it's done all of the freezing
work that will ever need to be done. So, if the tuple's xmin can be
frozen and is aborted but not older than vacuum_freeze_min_age, then
heap_prepare_freeze_tuple() won't free xmax, but the page will still
be marked all-frozen, which is bad. I think it normally won't matter
because the xmax will probably be hinted invalid anyway, since we just
pruned the page which should have set hint bits everywhere, but if
those hint bits were lost then we'd eventually end up with an
accessible xmax pointing off into space.

Good catch. Also consider multixact freezing: if there is a
long-running transaction which is a lock-only member of tuple's Xmax,
and the multixact needs freezing because it's older than the multixact
cutoff, we set the xmax to a new multixact which includes that old
locker. See FreezeMultiXactId.

My first thought was to just delete the "else" but that would be bad
because we'd fail to set all-frozen immediately in a lot of cases
where we should. This needs a bit more thought than I have time to
give it right now.

How about changing the return tuple of heap_prepare_freeze_tuple to
a bitmap? Two flags: "Freeze [not] done" and "[No] more freezing
needed"

Yes, I think something like that sounds about right.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#152

Amit Kapila

amit.kapila16@gmail.com

over 9 years ago

In reply to: Robert Haas (#149)

Re: Reviewing freeze map code

On Sat, Jun 11, 2016 at 1:24 AM, Robert Haas <robertmhaas@gmail.com> wrote:

3. vacuumlazy.c includes this code:

if (heap_prepare_freeze_tuple(tuple.t_data, FreezeLimit,
MultiXactCutoff,

&frozen[nfrozen]))

frozen[nfrozen++].offset = offnum;
else if (heap_tuple_needs_eventual_freeze(tuple.t_data))
all_frozen = false;

That's wrong, because a "true" return value from
heap_prepare_freeze_tuple() means only that it has done *some*
freezing work on the tuple, not that it's done all of the freezing
work that will ever need to be done. So, if the tuple's xmin can be
frozen and is aborted but not older than vacuum_freeze_min_age, then
heap_prepare_freeze_tuple() won't free xmax, but the page will still
be marked all-frozen, which is bad.

To clarify, are you talking about a case where insertion has aborted?
Won't in such a case all_visible flag be set to false due to return value
from HeapTupleSatisfiesVacuum() and if so, later code shouldn't mark it as
all_frozen?

With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

#153

Robert Haas

robertmhaas@gmail.com

over 9 years ago

In reply to: Robert Haas (#151)

2 attachment(s)

Re: Reviewing freeze map code

On Sat, Jun 11, 2016 at 5:00 PM, Robert Haas <robertmhaas@gmail.com> wrote:

How about changing the return tuple of heap_prepare_freeze_tuple to
a bitmap? Two flags: "Freeze [not] done" and "[No] more freezing
needed"

Yes, I think something like that sounds about right.

Here's a patch. I took the approach of adding a separate bool out
parameter instead. I am also attaching an update of the
check-visibility patch which responds to assorted review comments and
adjusting it for the problems found on Friday which could otherwise
lead to false positives. I'm still getting occasional TIDs from the
pg_check_visible() function during pgbench runs, though, so evidently
not all is well with the world.

(Official status update: I'm hoping that senior hackers will carefully
review these patches for defects. If they do not, I plan to commit
the patches anyway neither less than 48 nor more than 60 hours from
now after re-reviewing them myself.)

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Attachments:

fix-freeze-map-v1.patchbinary/octet-stream; name=fix-freeze-map-v1.patchDownload

From 95dca25221681959e7b1d2c628f6cf97ee20fe49 Mon Sep 17 00:00:00 2001
From: Robert Haas <rhaas@postgresql.org>
Date: Mon, 13 Jun 2016 11:49:07 -0400
Subject: [PATCH 2/2] Fix possible data-corrupting bug in new freeze map code.

The prior code incorrectly assumed that if a tuple had been frozen, it
would not need to be frozen again later.  However, this can be false,
because xmin and xmax (and conceivably xvac, if dealing with tuples from
very old releases) could be frozen at separate times.
---
 src/backend/access/heap/heapam.c  | 45 ++++++++++++++++++++++++++++-----------
 src/backend/commands/vacuumlazy.c |  8 +++++--
 src/include/access/heapam_xlog.h  |  3 ++-
 3 files changed, 40 insertions(+), 16 deletions(-)

diff --git a/src/backend/access/heap/heapam.c b/src/backend/access/heap/heapam.c
index 0b3332e..22b3f5f 100644
--- a/src/backend/access/heap/heapam.c
+++ b/src/backend/access/heap/heapam.c
@@ -6377,7 +6377,9 @@ FreezeMultiXactId(MultiXactId multi, uint16 t_infomask,
  * are older than the specified cutoff XID and cutoff MultiXactId.  If so,
  * setup enough state (in the *frz output argument) to later execute and
  * WAL-log what we would need to do, and return TRUE.  Return FALSE if nothing
- * is to be changed.
+ * is to be changed.  In addition, set *totally_frozen_p to true if the tuple
+ * will be totally frozen after these operations are performed and false if
+ * more freezing will eventually be required.
  *
  * Caller is responsible for setting the offset field, if appropriate.
  *
@@ -6402,12 +6404,12 @@ FreezeMultiXactId(MultiXactId multi, uint16 t_infomask,
 bool
 heap_prepare_freeze_tuple(HeapTupleHeader tuple, TransactionId cutoff_xid,
 						  TransactionId cutoff_multi,
-						  xl_heap_freeze_tuple *frz)
-
+						  xl_heap_freeze_tuple *frz, bool *totally_frozen_p)
 {
 	bool		changed = false;
 	bool		freeze_xmax = false;
 	TransactionId xid;
+	bool		totally_frozen = true;
 
 	frz->frzflags = 0;
 	frz->t_infomask2 = tuple->t_infomask2;
@@ -6416,11 +6418,15 @@ heap_prepare_freeze_tuple(HeapTupleHeader tuple, TransactionId cutoff_xid,
 
 	/* Process xmin */
 	xid = HeapTupleHeaderGetXmin(tuple);
-	if (TransactionIdIsNormal(xid) &&
-		TransactionIdPrecedes(xid, cutoff_xid))
+	if (TransactionIdIsNormal(xid))
 	{
-		frz->t_infomask |= HEAP_XMIN_FROZEN;
-		changed = true;
+		if (TransactionIdPrecedes(xid, cutoff_xid))
+		{
+			frz->t_infomask |= HEAP_XMIN_FROZEN;
+			changed = true;
+		}
+		else
+			totally_frozen = false;
 	}
 
 	/*
@@ -6458,6 +6464,7 @@ heap_prepare_freeze_tuple(HeapTupleHeader tuple, TransactionId cutoff_xid,
 			if (flags & FRM_MARK_COMMITTED)
 				frz->t_infomask &= HEAP_XMAX_COMMITTED;
 			changed = true;
+			totally_frozen = false;
 		}
 		else if (flags & FRM_RETURN_IS_MULTI)
 		{
@@ -6479,16 +6486,19 @@ heap_prepare_freeze_tuple(HeapTupleHeader tuple, TransactionId cutoff_xid,
 			frz->xmax = newxmax;
 
 			changed = true;
+			totally_frozen = false;
 		}
 		else
 		{
 			Assert(flags & FRM_NOOP);
 		}
 	}
-	else if (TransactionIdIsNormal(xid) &&
-			 TransactionIdPrecedes(xid, cutoff_xid))
+	else if (TransactionIdIsNormal(xid))
 	{
-		freeze_xmax = true;
+		if (TransactionIdPrecedes(xid, cutoff_xid))
+			freeze_xmax = true;
+		else
+			totally_frozen = false;
 	}
 
 	if (freeze_xmax)
@@ -6514,8 +6524,15 @@ heap_prepare_freeze_tuple(HeapTupleHeader tuple, TransactionId cutoff_xid,
 	if (tuple->t_infomask & HEAP_MOVED)
 	{
 		xid = HeapTupleHeaderGetXvac(tuple);
-		if (TransactionIdIsNormal(xid) &&
-			TransactionIdPrecedes(xid, cutoff_xid))
+		/*
+		 * For Xvac, we ignore the cutoff_xid and just always perform the
+		 * freeze operation.  The oldest release in which such a value can
+		 * actually be set is PostgreSQL 8.4, because old-style VACUUM FULL
+		 * was removed in PostgreSQL 9.0.  Note that if we were to respect
+		 * cutoff_xid here, we'd need to make surely to clear totally_frozen
+		 * when we skipped freezing on that basis.
+		 */
+		if (TransactionIdIsNormal(xid))
 		{
 			/*
 			 * If a MOVED_OFF tuple is not dead, the xvac transaction must
@@ -6537,6 +6554,7 @@ heap_prepare_freeze_tuple(HeapTupleHeader tuple, TransactionId cutoff_xid,
 		}
 	}
 
+	*totally_frozen_p = totally_frozen;
 	return changed;
 }
 
@@ -6587,9 +6605,10 @@ heap_freeze_tuple(HeapTupleHeader tuple, TransactionId cutoff_xid,
 {
 	xl_heap_freeze_tuple frz;
 	bool		do_freeze;
+	bool		tuple_totally_frozen;
 
 	do_freeze = heap_prepare_freeze_tuple(tuple, cutoff_xid, cutoff_multi,
-										  &frz);
+										  &frz, &tuple_totally_frozen);
 
 	/*
 	 * Note that because this is not a WAL-logged operation, we don't need to
diff --git a/src/backend/commands/vacuumlazy.c b/src/backend/commands/vacuumlazy.c
index 0010ca9..cb5777f 100644
--- a/src/backend/commands/vacuumlazy.c
+++ b/src/backend/commands/vacuumlazy.c
@@ -1054,6 +1054,8 @@ lazy_scan_heap(Relation onerel, LVRelStats *vacrelstats,
 			}
 			else
 			{
+				bool	tuple_totally_frozen;
+
 				num_tuples += 1;
 				hastup = true;
 
@@ -1062,9 +1064,11 @@ lazy_scan_heap(Relation onerel, LVRelStats *vacrelstats,
 				 * freezing.  Note we already have exclusive buffer lock.
 				 */
 				if (heap_prepare_freeze_tuple(tuple.t_data, FreezeLimit,
-										  MultiXactCutoff, &frozen[nfrozen]))
+										  MultiXactCutoff, &frozen[nfrozen],
+											&tuple_totally_frozen))
 					frozen[nfrozen++].offset = offnum;
-				else if (heap_tuple_needs_eventual_freeze(tuple.t_data))
+
+				if (!tuple_totally_frozen)
 					all_frozen = false;
 			}
 		}						/* scan along page */
diff --git a/src/include/access/heapam_xlog.h b/src/include/access/heapam_xlog.h
index ad30217..a822d0b 100644
--- a/src/include/access/heapam_xlog.h
+++ b/src/include/access/heapam_xlog.h
@@ -386,7 +386,8 @@ extern XLogRecPtr log_heap_freeze(Relation reln, Buffer buffer,
 extern bool heap_prepare_freeze_tuple(HeapTupleHeader tuple,
 						  TransactionId cutoff_xid,
 						  TransactionId cutoff_multi,
-						  xl_heap_freeze_tuple *frz);
+						  xl_heap_freeze_tuple *frz,
+						  bool *totally_frozen);
 extern void heap_execute_freeze_tuple(HeapTupleHeader tuple,
 						  xl_heap_freeze_tuple *xlrec_tp);
 extern XLogRecPtr log_heap_visible(RelFileNode rnode, Buffer heap_buffer,
-- 
2.5.4 (Apple Git-61)

check-visibility-v4.patchbinary/octet-stream; name=check-visibility-v4.patchDownload

From 3054c56ea8644d78d427fd36f3736322a51742a8 Mon Sep 17 00:00:00 2001
From: Robert Haas <rhaas@postgresql.org>
Date: Fri, 10 Jun 2016 14:42:46 -0400
Subject: [PATCH 1/2] Add integrity-checking functions to pg_visibility.

The new pg_check_visible() and pg_check_frozen() functions can be used to
verify that the visibility map bits for a relation's data pages match the
actual state of the tuples on those pages.

Amit Kapila and Robert Haas, reviewed by Andres Freund.
---
 contrib/pg_visibility/Makefile                    |   2 +-
 contrib/pg_visibility/pg_visibility--1.0--1.1.sql |  17 ++
 contrib/pg_visibility/pg_visibility--1.0.sql      |  52 ----
 contrib/pg_visibility/pg_visibility--1.1.sql      |  67 +++++
 contrib/pg_visibility/pg_visibility.c             | 306 ++++++++++++++++++++++
 contrib/pg_visibility/pg_visibility.control       |   2 +-
 doc/src/sgml/pgvisibility.sgml                    |  28 +-
 src/tools/pgindent/typedefs.list                  |   1 +
 8 files changed, 420 insertions(+), 55 deletions(-)
 create mode 100644 contrib/pg_visibility/pg_visibility--1.0--1.1.sql
 delete mode 100644 contrib/pg_visibility/pg_visibility--1.0.sql
 create mode 100644 contrib/pg_visibility/pg_visibility--1.1.sql

diff --git a/contrib/pg_visibility/Makefile b/contrib/pg_visibility/Makefile
index fbbaa2e..379591a 100644
--- a/contrib/pg_visibility/Makefile
+++ b/contrib/pg_visibility/Makefile
@@ -4,7 +4,7 @@ MODULE_big = pg_visibility
 OBJS = pg_visibility.o $(WIN32RES)
 
 EXTENSION = pg_visibility
-DATA = pg_visibility--1.0.sql
+DATA = pg_visibility--1.1.sql pg_visibility--1.0--1.1.sql
 PGFILEDESC = "pg_visibility - page visibility information"
 
 ifdef USE_PGXS
diff --git a/contrib/pg_visibility/pg_visibility--1.0--1.1.sql b/contrib/pg_visibility/pg_visibility--1.0--1.1.sql
new file mode 100644
index 0000000..2c97dfd
--- /dev/null
+++ b/contrib/pg_visibility/pg_visibility--1.0--1.1.sql
@@ -0,0 +1,17 @@
+/* contrib/pg_visibility/pg_visibility--1.0--1.1.sql */
+
+-- complain if script is sourced in psql, rather than via ALTER EXTENSION
+\echo Use "ALTER EXTENSION pg_visibility UPDATE TO '1.1'" to load this file. \quit
+
+CREATE FUNCTION pg_check_frozen(regclass, t_ctid OUT tid)
+RETURNS SETOF tid
+AS 'MODULE_PATHNAME', 'pg_check_frozen'
+LANGUAGE C STRICT;
+
+CREATE FUNCTION pg_check_visible(regclass, t_ctid OUT tid)
+RETURNS SETOF tid
+AS 'MODULE_PATHNAME', 'pg_check_visible'
+LANGUAGE C STRICT;
+
+REVOKE ALL ON FUNCTION pg_check_frozen(regclass) FROM PUBLIC;
+REVOKE ALL ON FUNCTION pg_check_visible(regclass) FROM PUBLIC;
diff --git a/contrib/pg_visibility/pg_visibility--1.0.sql b/contrib/pg_visibility/pg_visibility--1.0.sql
deleted file mode 100644
index da511e5..0000000
--- a/contrib/pg_visibility/pg_visibility--1.0.sql
+++ /dev/null
@@ -1,52 +0,0 @@
-/* contrib/pg_visibility/pg_visibility--1.0.sql */
-
--- complain if script is sourced in psql, rather than via CREATE EXTENSION
-\echo Use "CREATE EXTENSION pg_visibility" to load this file. \quit
-
--- Show visibility map information.
-CREATE FUNCTION pg_visibility_map(regclass, blkno bigint,
-								  all_visible OUT boolean,
-								  all_frozen OUT boolean)
-RETURNS record
-AS 'MODULE_PATHNAME', 'pg_visibility_map'
-LANGUAGE C STRICT;
-
--- Show visibility map and page-level visibility information.
-CREATE FUNCTION pg_visibility(regclass, blkno bigint,
-							  all_visible OUT boolean,
-							  all_frozen OUT boolean,
-							  pd_all_visible OUT boolean)
-RETURNS record
-AS 'MODULE_PATHNAME', 'pg_visibility'
-LANGUAGE C STRICT;
-
--- Show visibility map information for each block in a relation.
-CREATE FUNCTION pg_visibility_map(regclass, blkno OUT bigint,
-								  all_visible OUT boolean,
-								  all_frozen OUT boolean)
-RETURNS SETOF record
-AS 'MODULE_PATHNAME', 'pg_visibility_map_rel'
-LANGUAGE C STRICT;
-
--- Show visibility map and page-level visibility information for each block.
-CREATE FUNCTION pg_visibility(regclass, blkno OUT bigint,
-							  all_visible OUT boolean,
-							  all_frozen OUT boolean,
-							  pd_all_visible OUT boolean)
-RETURNS SETOF record
-AS 'MODULE_PATHNAME', 'pg_visibility_rel'
-LANGUAGE C STRICT;
-
--- Show summary of visibility map bits for a relation.
-CREATE FUNCTION pg_visibility_map_summary(regclass,
-    OUT all_visible bigint, OUT all_frozen bigint)
-RETURNS record
-AS 'MODULE_PATHNAME', 'pg_visibility_map_summary'
-LANGUAGE C STRICT;
-
--- Don't want these to be available to public.
-REVOKE ALL ON FUNCTION pg_visibility_map(regclass, bigint) FROM PUBLIC;
-REVOKE ALL ON FUNCTION pg_visibility(regclass, bigint) FROM PUBLIC;
-REVOKE ALL ON FUNCTION pg_visibility_map(regclass) FROM PUBLIC;
-REVOKE ALL ON FUNCTION pg_visibility(regclass) FROM PUBLIC;
-REVOKE ALL ON FUNCTION pg_visibility_map_summary(regclass) FROM PUBLIC;
diff --git a/contrib/pg_visibility/pg_visibility--1.1.sql b/contrib/pg_visibility/pg_visibility--1.1.sql
new file mode 100644
index 0000000..b49b644
--- /dev/null
+++ b/contrib/pg_visibility/pg_visibility--1.1.sql
@@ -0,0 +1,67 @@
+/* contrib/pg_visibility/pg_visibility--1.1.sql */
+
+-- complain if script is sourced in psql, rather than via CREATE EXTENSION
+\echo Use "CREATE EXTENSION pg_visibility" to load this file. \quit
+
+-- Show visibility map information.
+CREATE FUNCTION pg_visibility_map(regclass, blkno bigint,
+								  all_visible OUT boolean,
+								  all_frozen OUT boolean)
+RETURNS record
+AS 'MODULE_PATHNAME', 'pg_visibility_map'
+LANGUAGE C STRICT;
+
+-- Show visibility map and page-level visibility information.
+CREATE FUNCTION pg_visibility(regclass, blkno bigint,
+							  all_visible OUT boolean,
+							  all_frozen OUT boolean,
+							  pd_all_visible OUT boolean)
+RETURNS record
+AS 'MODULE_PATHNAME', 'pg_visibility'
+LANGUAGE C STRICT;
+
+-- Show visibility map information for each block in a relation.
+CREATE FUNCTION pg_visibility_map(regclass, blkno OUT bigint,
+								  all_visible OUT boolean,
+								  all_frozen OUT boolean)
+RETURNS SETOF record
+AS 'MODULE_PATHNAME', 'pg_visibility_map_rel'
+LANGUAGE C STRICT;
+
+-- Show visibility map and page-level visibility information for each block.
+CREATE FUNCTION pg_visibility(regclass, blkno OUT bigint,
+							  all_visible OUT boolean,
+							  all_frozen OUT boolean,
+							  pd_all_visible OUT boolean)
+RETURNS SETOF record
+AS 'MODULE_PATHNAME', 'pg_visibility_rel'
+LANGUAGE C STRICT;
+
+-- Show summary of visibility map bits for a relation.
+CREATE FUNCTION pg_visibility_map_summary(regclass,
+    OUT all_visible bigint, OUT all_frozen bigint)
+RETURNS record
+AS 'MODULE_PATHNAME', 'pg_visibility_map_summary'
+LANGUAGE C STRICT;
+
+-- Show tupleids of non-frozen tuples if any in all_frozen pages
+-- for a relation.
+CREATE FUNCTION pg_check_frozen(regclass, t_ctid OUT tid)
+RETURNS SETOF tid
+AS 'MODULE_PATHNAME', 'pg_check_frozen'
+LANGUAGE C STRICT;
+
+-- Show tupleids of dead tuples if any in all_visible pages for a relation.
+CREATE FUNCTION pg_check_visible(regclass, t_ctid OUT tid)
+RETURNS SETOF tid
+AS 'MODULE_PATHNAME', 'pg_check_visible'
+LANGUAGE C STRICT;
+
+-- Don't want these to be available to public.
+REVOKE ALL ON FUNCTION pg_visibility_map(regclass, bigint) FROM PUBLIC;
+REVOKE ALL ON FUNCTION pg_visibility(regclass, bigint) FROM PUBLIC;
+REVOKE ALL ON FUNCTION pg_visibility_map(regclass) FROM PUBLIC;
+REVOKE ALL ON FUNCTION pg_visibility(regclass) FROM PUBLIC;
+REVOKE ALL ON FUNCTION pg_visibility_map_summary(regclass) FROM PUBLIC;
+REVOKE ALL ON FUNCTION pg_check_frozen(regclass) FROM PUBLIC;
+REVOKE ALL ON FUNCTION pg_check_visible(regclass) FROM PUBLIC;
diff --git a/contrib/pg_visibility/pg_visibility.c b/contrib/pg_visibility/pg_visibility.c
index 9edf239..f84e72b 100644
--- a/contrib/pg_visibility/pg_visibility.c
+++ b/contrib/pg_visibility/pg_visibility.c
@@ -14,6 +14,7 @@
 #include "funcapi.h"
 #include "miscadmin.h"
 #include "storage/bufmgr.h"
+#include "storage/procarray.h"
 #include "utils/rel.h"
 
 PG_MODULE_MAGIC;
@@ -25,14 +26,26 @@ typedef struct vbits
 	uint8		bits[FLEXIBLE_ARRAY_MEMBER];
 } vbits;
 
+typedef struct corrupt_items
+{
+	BlockNumber next;
+	BlockNumber count;
+	ItemPointer tids;
+} corrupt_items;
+
 PG_FUNCTION_INFO_V1(pg_visibility_map);
 PG_FUNCTION_INFO_V1(pg_visibility_map_rel);
 PG_FUNCTION_INFO_V1(pg_visibility);
 PG_FUNCTION_INFO_V1(pg_visibility_rel);
 PG_FUNCTION_INFO_V1(pg_visibility_map_summary);
+PG_FUNCTION_INFO_V1(pg_check_frozen);
+PG_FUNCTION_INFO_V1(pg_check_visible);
 
 static TupleDesc pg_visibility_tupdesc(bool include_blkno, bool include_pd);
 static vbits *collect_visibility_data(Oid relid, bool include_pd);
+static corrupt_items *collect_corrupt_items(Oid relid, bool all_visible,
+					  bool all_frozen);
+static void record_corrupt_item(corrupt_items *items, ItemPointer tid);
 
 /*
  * Visibility map information for a single block of a relation.
@@ -259,6 +272,68 @@ pg_visibility_map_summary(PG_FUNCTION_ARGS)
 }
 
 /*
+ * Return the TIDs of non-frozen tuples present in pages marked all-frozen
+ * in the visibility map.  We hope no one will ever find any, but there could
+ * be bugs, database corruption, etc.
+ */
+Datum
+pg_check_frozen(PG_FUNCTION_ARGS)
+{
+	FuncCallContext *funcctx;
+	corrupt_items *items;
+
+	if (SRF_IS_FIRSTCALL())
+	{
+		Oid			relid = PG_GETARG_OID(0);
+		MemoryContext oldcontext;
+
+		funcctx = SRF_FIRSTCALL_INIT();
+		oldcontext = MemoryContextSwitchTo(funcctx->multi_call_memory_ctx);
+		funcctx->user_fctx = collect_corrupt_items(relid, false, true);
+		MemoryContextSwitchTo(oldcontext);
+	}
+
+	funcctx = SRF_PERCALL_SETUP();
+	items = (corrupt_items *) funcctx->user_fctx;
+
+	if (items->next < items->count)
+		SRF_RETURN_NEXT(funcctx, PointerGetDatum(&items->tids[items->next++]));
+
+	SRF_RETURN_DONE(funcctx);
+}
+
+/*
+ * Return the TIDs of not-all-visible tuples in pages marked all-visible
+ * in the visibility map.  We hope no one will ever find any, but there could
+ * be bugs, database corruption, etc.
+ */
+Datum
+pg_check_visible(PG_FUNCTION_ARGS)
+{
+	FuncCallContext *funcctx;
+	corrupt_items *items;
+
+	if (SRF_IS_FIRSTCALL())
+	{
+		Oid			relid = PG_GETARG_OID(0);
+		MemoryContext oldcontext;
+
+		funcctx = SRF_FIRSTCALL_INIT();
+		oldcontext = MemoryContextSwitchTo(funcctx->multi_call_memory_ctx);
+		funcctx->user_fctx = collect_corrupt_items(relid, true, false);
+		MemoryContextSwitchTo(oldcontext);
+	}
+
+	funcctx = SRF_PERCALL_SETUP();
+	items = (corrupt_items *) funcctx->user_fctx;
+
+	if (items->next < items->count)
+		SRF_RETURN_NEXT(funcctx, PointerGetDatum(&items->tids[items->next++]));
+
+	SRF_RETURN_DONE(funcctx);
+}
+
+/*
  * Helper function to construct whichever TupleDesc we need for a particular
  * call.
  */
@@ -348,3 +423,234 @@ collect_visibility_data(Oid relid, bool include_pd)
 
 	return info;
 }
+
+/*
+ * Returns a list of items whose visibility map information does not match
+ * the status of the tuples on the page.
+ *
+ * If all_visible is passed as true, this will include all items which are
+ * on pages marked as all-visible in the visibility map but which do not
+ * seem to in fact be all-visible.
+ *
+ * If all_frozen is passed as true, this will include all items which are
+ * on pages marked as all-frozen but which do not seem to in fact be frozen.
+ */
+static corrupt_items *
+collect_corrupt_items(Oid relid, bool all_visible, bool all_frozen)
+{
+	Relation	rel;
+	BlockNumber nblocks;
+	corrupt_items *items;
+	BlockNumber blkno;
+	Buffer		vmbuffer = InvalidBuffer;
+	BufferAccessStrategy bstrategy = GetAccessStrategy(BAS_BULKREAD);
+	TransactionId OldestXmin = InvalidTransactionId;
+
+	if (all_visible)
+	{
+		/* Don't pass rel; that will fail in recovery. */
+		OldestXmin = GetOldestXmin(NULL, true);
+	}
+
+	rel = relation_open(relid, AccessShareLock);
+
+	if (rel->rd_rel->relkind != RELKIND_RELATION &&
+		rel->rd_rel->relkind != RELKIND_MATVIEW &&
+		rel->rd_rel->relkind != RELKIND_TOASTVALUE)
+		ereport(ERROR,
+				(errcode(ERRCODE_WRONG_OBJECT_TYPE),
+				 errmsg("\"%s\" is not a table, materialized view, or TOAST table",
+						RelationGetRelationName(rel))));
+
+	nblocks = RelationGetNumberOfBlocks(rel);
+
+	/*
+	 * Guess an initial array size. We don't expect many corrupted tuples, so
+	 * start with a small array.  This function uses the "next" field to track
+	 * the next offset where we can store an item (which is the same thing as
+	 * the number of items found so far) and the "count" field to track the
+	 * number of entries allocated.  We'll repurpose these fields before
+	 * returning.
+	 */
+	items = palloc0(sizeof(corrupt_items));
+	items->next = 0;
+	items->count = 64;
+	items->tids = palloc(items->count * sizeof(ItemPointerData));
+
+	/* Loop over every block in the relation. */
+	for (blkno = 0; blkno < nblocks; ++blkno)
+	{
+		bool		check_frozen = false;
+		bool		check_visible = false;
+		Buffer		buffer;
+		Page		page;
+		OffsetNumber offnum,
+					maxoff;
+
+		/* Make sure we are interruptible. */
+		CHECK_FOR_INTERRUPTS();
+
+		/* Use the visibility map to decide whether to check this page. */
+		if (all_frozen && VM_ALL_FROZEN(rel, blkno, &vmbuffer))
+			check_frozen = true;
+		if (all_visible && VM_ALL_VISIBLE(rel, blkno, &vmbuffer))
+			check_visible = true;
+		if (!check_visible && !check_frozen)
+			continue;
+
+		/* Read and lock the page. */
+		buffer = ReadBufferExtended(rel, MAIN_FORKNUM, blkno, RBM_NORMAL,
+									bstrategy);
+		LockBuffer(buffer, BUFFER_LOCK_SHARE);
+
+		page = BufferGetPage(buffer);
+		maxoff = PageGetMaxOffsetNumber(page);
+
+		/*
+		 * The visibility map bits might have changed while we were acquiring
+		 * the page lock.  Recheck to avoid returning spurious results.
+		 */
+		if (check_frozen && !VM_ALL_FROZEN(rel, blkno, &vmbuffer))
+			check_frozen = false;
+		if (check_visible && !VM_ALL_VISIBLE(rel, blkno, &vmbuffer))
+			check_visible = false;
+		if (!check_visible && !check_frozen)
+		{
+			UnlockReleaseBuffer(buffer);
+			continue;
+		}
+
+		/* Iterate over each tuple on the page. */
+		for (offnum = FirstOffsetNumber;
+			 offnum <= maxoff;
+			 offnum = OffsetNumberNext(offnum))
+		{
+			HeapTupleData tuple;
+			ItemId		itemid;
+
+			itemid = PageGetItemId(page, offnum);
+
+			/* Unused or redirect line pointers are of no interest. */
+			if (!ItemIdIsUsed(itemid) || ItemIdIsRedirected(itemid))
+				continue;
+
+			/* Dead line pointers are neither all-visible nor frozen. */
+			if (ItemIdIsDead(itemid))
+			{
+				ItemPointerData tid;
+
+				ItemPointerSet(&tid, blkno, offnum);
+				record_corrupt_item(items, &tid);
+				continue;
+			}
+
+			/* Initialize a HeapTupleData structure for checks below. */
+			tuple.t_data = (HeapTupleHeader) PageGetItem(page, itemid);
+			tuple.t_len = ItemIdGetLength(itemid);
+			tuple.t_tableOid = relid;
+
+			/*
+			 * If we're checking whether the page is all-visible, we expect
+			 * the tuple to be live, xmin to be hinted committed, and the xmin
+			 * to be old enough that everyone can see it.
+			 *
+			 * NB: Neither lazy_scan_heap nor heap_page_is_all_visible will
+			 * mark a page all-visible unless every tuple is hinted committed.
+			 * However, those hint bits could be lost after a crash, so we
+			 * can't be certain that they'll be set here.
+			 */
+			if (check_visible)
+			{
+				HTSV_Result state;
+
+				state = HeapTupleSatisfiesVacuum(&tuple, OldestXmin, buffer);
+				if (state != HEAPTUPLE_LIVE)
+					record_corrupt_item(items, &tuple.t_data->t_ctid);
+				else
+				{
+					TransactionId xmin;
+
+					xmin = HeapTupleHeaderGetXmin(tuple.t_data);
+					if (!TransactionIdPrecedes(xmin, OldestXmin))
+					{
+						TransactionId RecomputedOldestXmin;
+
+						/*
+						 * Time has passed since we computed OldestXmin, so
+						 * it's possible that this tuple is all-visible in
+						 * reality even though it doesn't appear so based on
+						 * our previously-computed value.  Let's compute a new
+						 * value so we can be certain whether there is a
+						 * problem.
+						 *
+						 * From a concurrency point of view, it sort of sucks
+						 * to retake ProcArrayLock here while we're holding
+						 * the buffer exclusively locked, but it should be
+						 * safe against deadlocks, because surely
+						 * GetOldestXmin() should never take a buffer lock.
+						 * And this shouldn't happen often, so it's worth
+						 * being careful so as to avoid false positives.
+						 */
+						RecomputedOldestXmin = GetOldestXmin(NULL, true);
+
+						if (TransactionIdPrecedes(OldestXmin,
+												  RecomputedOldestXmin))
+							OldestXmin = RecomputedOldestXmin;
+
+						/*
+						 * If we still fail this test, VACUUM definitely
+						 * marked the tuple all-visible too early.
+						 */
+						if (!TransactionIdPrecedes(xmin, OldestXmin))
+							record_corrupt_item(items, &tuple.t_data->t_ctid);
+					}
+				}
+			}
+
+			/*
+			 * If we're checking whether the page is all-frozen, we expect the
+			 * tuple to be in a state where it will never need freezing.
+			 */
+			if (check_frozen)
+			{
+				if (heap_tuple_needs_eventual_freeze(tuple.t_data))
+					record_corrupt_item(items, &tuple.t_data->t_ctid);
+			}
+		}
+
+		UnlockReleaseBuffer(buffer);
+	}
+
+	/* Clean up. */
+	if (vmbuffer != InvalidBuffer)
+		ReleaseBuffer(vmbuffer);
+	relation_close(rel, AccessShareLock);
+
+	/*
+	 * Before returning, repurpose the fields to match caller's expectations.
+	 * next is now the next item that should be read (rather than written) and
+	 * count is now the number of items we wrote (rather than the number we
+	 * allocated).
+	 */
+	items->count = items->next;
+	items->next = 0;
+
+	return items;
+}
+
+/*
+ * Remember one corrupt item.
+ */
+static void
+record_corrupt_item(corrupt_items *items, ItemPointer tid)
+{
+	/* enlarge output array if needed. */
+	if (items->next >= items->count)
+	{
+		items->count *= 2;
+		items->tids = repalloc(items->tids,
+							   items->count * sizeof(ItemPointerData));
+	}
+	/* and add the new item */
+	items->tids[items->next++] = *tid;
+}
diff --git a/contrib/pg_visibility/pg_visibility.control b/contrib/pg_visibility/pg_visibility.control
index 1d71853..f93ed01 100644
--- a/contrib/pg_visibility/pg_visibility.control
+++ b/contrib/pg_visibility/pg_visibility.control
@@ -1,5 +1,5 @@
 # pg_visibility extension
 comment = 'examine the visibility map (VM) and page-level visibility info'
-default_version = '1.0'
+default_version = '1.1'
 module_pathname = '$libdir/pg_visibility'
 relocatable = true
diff --git a/doc/src/sgml/pgvisibility.sgml b/doc/src/sgml/pgvisibility.sgml
index 48b003d..4cdca7d 100644
--- a/doc/src/sgml/pgvisibility.sgml
+++ b/doc/src/sgml/pgvisibility.sgml
@@ -32,7 +32,8 @@
   Functions which display information about <literal>PD_ALL_VISIBLE</>
   are much more costly than those which only consult the visibility map,
   because they must read the relation's data blocks rather than only the
-  (much smaller) visibility map.
+  (much smaller) visibility map.  Functions that check the relation's
+  data blocks are similarly expensive.
  </para>
 
  <sect2>
@@ -92,6 +93,31 @@
      </para>
     </listitem>
    </varlistentry>
+  
+   <varlistentry>
+    <term><function>pg_check_frozen(regclass, t_ctid OUT tid) returns setof tid</function></term>
+
+    <listitem>
+     <para>
+      Returns the TIDs of non-frozen tuples present in pages marked all-frozen
+      in the visibility map.  If this function returns a non-empty set of
+      TIDs, the database is corrupt.
+     </para>
+    </listitem>
+   </varlistentry>
+     
+    <varlistentry>
+    <term><function>pg_check_visible(regclass, t_ctid OUT tid) returns setof tid</function></term>
+
+    <listitem>
+     <para>
+      Returns the TIDs of tuples which are not all-visible despite the fact
+      that the pages which contain them are marked as all-visible in the
+      visibility map.  If this function returns a non-empty set of TIDs, the
+      database is corrupt.
+     </para>
+    </listitem>
+   </varlistentry>
   </variablelist>
 
   <para>
diff --git a/src/tools/pgindent/typedefs.list b/src/tools/pgindent/typedefs.list
index 9aa29f6..0c61fc2 100644
--- a/src/tools/pgindent/typedefs.list
+++ b/src/tools/pgindent/typedefs.list
@@ -2372,6 +2372,7 @@ convert_testexpr_context
 core_YYSTYPE
 core_yy_extra_type
 core_yyscan_t
+corrupt_items
 cost_qual_eval_context
 count_agg_clauses_context
 create_upper_paths_hook_type
-- 
2.5.4 (Apple Git-61)

#154

Andres Freund

andres@anarazel.de

over 9 years ago

In reply to: Robert Haas (#153)

Re: Reviewing freeze map code

On June 13, 2016 11:02:42 AM CDT, Robert Haas <robertmhaas@gmail.com> wrote:

(Official status update: I'm hoping that senior hackers will carefully
review these patches for defects. If they do not, I plan to commit
the patches anyway neither less than 48 nor more than 60 hours from
now after re-reviewing them myself.)

I'm traveling today and tomorrow, but will look after that.

Andres
--
Sent from my Android device with K-9 Mail. Please excuse my brevity.

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#155

Thomas Munro

thomas.munro@enterprisedb.com

over 9 years ago

In reply to: Robert Haas (#153)

1 attachment(s)

Re: Reviewing freeze map code

On Tue, Jun 14, 2016 at 4:02 AM, Robert Haas <robertmhaas@gmail.com> wrote:

On Sat, Jun 11, 2016 at 5:00 PM, Robert Haas <robertmhaas@gmail.com> wrote:

How about changing the return tuple of heap_prepare_freeze_tuple to
a bitmap? Two flags: "Freeze [not] done" and "[No] more freezing
needed"

Yes, I think something like that sounds about right.

Here's a patch. I took the approach of adding a separate bool out
parameter instead. I am also attaching an update of the
check-visibility patch which responds to assorted review comments and
adjusting it for the problems found on Friday which could otherwise
lead to false positives. I'm still getting occasional TIDs from the
pg_check_visible() function during pgbench runs, though, so evidently
not all is well with the world.

I'm still working out how half this stuff works, but I managed to get
pg_check_visible() to spit out a row every few seconds with the
following brute force approach:

CREATE TABLE foo (n int);
INSERT INTO foo SELECT generate_series(1, 100000);

Three client threads (see attached script):
1. Run VACUUM in a tight loop.
2. Run UPDATE foo SET n = n + 1 in a tight loop.
3. Run SELECT pg_check_visible('foo'::regclass) in a tight loop, and
print out any rows it produces.

I noticed that the tuples that it reported were always offset 1 in a
page, and that the page always had a maxoff over a couple of hundred,
and that we called record_corrupt_item because VM_ALL_VISIBLE returned
true but HeapTupleSatisfiesVacuum on the first tuple returned
HEAPTUPLE_DELETE_IN_PROGRESS instead of the expected HEAPTUPLE_LIVE.
It did that because HEAP_XMAX_COMMITTED was not set and
TransactionIdIsInProgress returned true for xmax.

Not sure how much of this was already obvious! I will poke at it some
more tomorrow.

--
Thomas Munro
http://www.enterprisedb.com

#156

Robert Haas

robertmhaas@gmail.com

over 9 years ago

In reply to: Thomas Munro (#155)

Re: Reviewing freeze map code

On Tue, Jun 14, 2016 at 2:53 AM, Thomas Munro
<thomas.munro@enterprisedb.com> wrote:

On Tue, Jun 14, 2016 at 4:02 AM, Robert Haas <robertmhaas@gmail.com> wrote:

On Sat, Jun 11, 2016 at 5:00 PM, Robert Haas <robertmhaas@gmail.com> wrote:

How about changing the return tuple of heap_prepare_freeze_tuple to
a bitmap? Two flags: "Freeze [not] done" and "[No] more freezing
needed"

Yes, I think something like that sounds about right.

Here's a patch. I took the approach of adding a separate bool out
parameter instead. I am also attaching an update of the
check-visibility patch which responds to assorted review comments and
adjusting it for the problems found on Friday which could otherwise
lead to false positives. I'm still getting occasional TIDs from the
pg_check_visible() function during pgbench runs, though, so evidently
not all is well with the world.

I'm still working out how half this stuff works, but I managed to get
pg_check_visible() to spit out a row every few seconds with the
following brute force approach:

CREATE TABLE foo (n int);
INSERT INTO foo SELECT generate_series(1, 100000);

Three client threads (see attached script):
1. Run VACUUM in a tight loop.
2. Run UPDATE foo SET n = n + 1 in a tight loop.
3. Run SELECT pg_check_visible('foo'::regclass) in a tight loop, and
print out any rows it produces.

I noticed that the tuples that it reported were always offset 1 in a
page, and that the page always had a maxoff over a couple of hundred,
and that we called record_corrupt_item because VM_ALL_VISIBLE returned
true but HeapTupleSatisfiesVacuum on the first tuple returned
HEAPTUPLE_DELETE_IN_PROGRESS instead of the expected HEAPTUPLE_LIVE.
It did that because HEAP_XMAX_COMMITTED was not set and
TransactionIdIsInProgress returned true for xmax.

So this seems like it might be a visibility map bug rather than a bug
in the test code, but I'm not completely sure of that. How was it
legitimate to mark the page as all-visible if a tuple on the page
still had a live xmax? If xmax is live and not just a locker then the
tuple is not visible to the transaction that wrote xmax, at least.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#157

Robert Haas

robertmhaas@gmail.com

over 9 years ago

In reply to: Robert Haas (#156)

Re: Reviewing freeze map code

On Tue, Jun 14, 2016 at 8:08 AM, Robert Haas <robertmhaas@gmail.com> wrote:

On Tue, Jun 14, 2016 at 2:53 AM, Thomas Munro
<thomas.munro@enterprisedb.com> wrote:

On Tue, Jun 14, 2016 at 4:02 AM, Robert Haas <robertmhaas@gmail.com> wrote:

On Sat, Jun 11, 2016 at 5:00 PM, Robert Haas <robertmhaas@gmail.com> wrote:

How about changing the return tuple of heap_prepare_freeze_tuple to
a bitmap? Two flags: "Freeze [not] done" and "[No] more freezing
needed"

Yes, I think something like that sounds about right.

Here's a patch. I took the approach of adding a separate bool out
parameter instead. I am also attaching an update of the
check-visibility patch which responds to assorted review comments and
adjusting it for the problems found on Friday which could otherwise
lead to false positives. I'm still getting occasional TIDs from the
pg_check_visible() function during pgbench runs, though, so evidently
not all is well with the world.

I'm still working out how half this stuff works, but I managed to get
pg_check_visible() to spit out a row every few seconds with the
following brute force approach:

CREATE TABLE foo (n int);
INSERT INTO foo SELECT generate_series(1, 100000);

Three client threads (see attached script):
1. Run VACUUM in a tight loop.
2. Run UPDATE foo SET n = n + 1 in a tight loop.
3. Run SELECT pg_check_visible('foo'::regclass) in a tight loop, and
print out any rows it produces.

I noticed that the tuples that it reported were always offset 1 in a
page, and that the page always had a maxoff over a couple of hundred,
and that we called record_corrupt_item because VM_ALL_VISIBLE returned
true but HeapTupleSatisfiesVacuum on the first tuple returned
HEAPTUPLE_DELETE_IN_PROGRESS instead of the expected HEAPTUPLE_LIVE.
It did that because HEAP_XMAX_COMMITTED was not set and
TransactionIdIsInProgress returned true for xmax.

So this seems like it might be a visibility map bug rather than a bug
in the test code, but I'm not completely sure of that. How was it
legitimate to mark the page as all-visible if a tuple on the page
still had a live xmax? If xmax is live and not just a locker then the
tuple is not visible to the transaction that wrote xmax, at least.

Ah, wait a minute. I see how this could happen. Hang on, let me
update the pg_visibility patch.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#158

Robert Haas

robertmhaas@gmail.com

over 9 years ago

In reply to: Robert Haas (#157)

1 attachment(s)

Re: Reviewing freeze map code

On Tue, Jun 14, 2016 at 8:11 AM, Robert Haas <robertmhaas@gmail.com> wrote:

I noticed that the tuples that it reported were always offset 1 in a
page, and that the page always had a maxoff over a couple of hundred,
and that we called record_corrupt_item because VM_ALL_VISIBLE returned
true but HeapTupleSatisfiesVacuum on the first tuple returned
HEAPTUPLE_DELETE_IN_PROGRESS instead of the expected HEAPTUPLE_LIVE.
It did that because HEAP_XMAX_COMMITTED was not set and
TransactionIdIsInProgress returned true for xmax.

So this seems like it might be a visibility map bug rather than a bug
in the test code, but I'm not completely sure of that. How was it
legitimate to mark the page as all-visible if a tuple on the page
still had a live xmax? If xmax is live and not just a locker then the
tuple is not visible to the transaction that wrote xmax, at least.

Ah, wait a minute. I see how this could happen. Hang on, let me
update the pg_visibility patch.

The problem should be fixed in the attached revision of
pg_check_visible. I think what happened is:

1. pg_check_visible computed an OldestXmin.
2. Some transaction committed.
3. VACUUM computed a newer OldestXmin and marked a page all-visible with it.
4. pg_check_visible then used its older OldestXmin to check the
visibility of tuples on that page, and saw delete-in-progress as a
result.

I added a guard against a similar scenario involving xmin in the last
version of this patch, but forgot that we need to protect xmax in the
same way. With this version of the patch, I can no longer get any
TIDs to pop up out of pg_check_visible in my testing. (I haven't run
your test script for lack of the proper Python environment...)

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Attachments:

check-visibility-v5.patchtext/x-diff; charset=US-ASCII; name=check-visibility-v5.patchDownload

From 18815b0d6fcfc2048e47f104ef85ee981687d4de Mon Sep 17 00:00:00 2001
From: Robert Haas <rhaas@postgresql.org>
Date: Fri, 10 Jun 2016 14:42:46 -0400
Subject: [PATCH 1/2] Add integrity-checking functions to pg_visibility.

The new pg_check_visible() and pg_check_frozen() functions can be used to
verify that the visibility map bits for a relation's data pages match the
actual state of the tuples on those pages.

Amit Kapila and Robert Haas, reviewed by Andres Freund.  Additional
testing help by Thomas Munro.
---
 contrib/pg_visibility/Makefile                    |   2 +-
 contrib/pg_visibility/pg_visibility--1.0--1.1.sql |  17 ++
 contrib/pg_visibility/pg_visibility--1.0.sql      |  52 ----
 contrib/pg_visibility/pg_visibility--1.1.sql      |  67 +++++
 contrib/pg_visibility/pg_visibility.c             | 313 ++++++++++++++++++++++
 contrib/pg_visibility/pg_visibility.control       |   2 +-
 doc/src/sgml/pgvisibility.sgml                    |  28 +-
 src/tools/pgindent/typedefs.list                  |   1 +
 8 files changed, 427 insertions(+), 55 deletions(-)
 create mode 100644 contrib/pg_visibility/pg_visibility--1.0--1.1.sql
 delete mode 100644 contrib/pg_visibility/pg_visibility--1.0.sql
 create mode 100644 contrib/pg_visibility/pg_visibility--1.1.sql

diff --git a/contrib/pg_visibility/Makefile b/contrib/pg_visibility/Makefile
index fbbaa2e..379591a 100644
--- a/contrib/pg_visibility/Makefile
+++ b/contrib/pg_visibility/Makefile
@@ -4,7 +4,7 @@ MODULE_big = pg_visibility
 OBJS = pg_visibility.o $(WIN32RES)
 
 EXTENSION = pg_visibility
-DATA = pg_visibility--1.0.sql
+DATA = pg_visibility--1.1.sql pg_visibility--1.0--1.1.sql
 PGFILEDESC = "pg_visibility - page visibility information"
 
 ifdef USE_PGXS
diff --git a/contrib/pg_visibility/pg_visibility--1.0--1.1.sql b/contrib/pg_visibility/pg_visibility--1.0--1.1.sql
new file mode 100644
index 0000000..2c97dfd
--- /dev/null
+++ b/contrib/pg_visibility/pg_visibility--1.0--1.1.sql
@@ -0,0 +1,17 @@
+/* contrib/pg_visibility/pg_visibility--1.0--1.1.sql */
+
+-- complain if script is sourced in psql, rather than via ALTER EXTENSION
+\echo Use "ALTER EXTENSION pg_visibility UPDATE TO '1.1'" to load this file. \quit
+
+CREATE FUNCTION pg_check_frozen(regclass, t_ctid OUT tid)
+RETURNS SETOF tid
+AS 'MODULE_PATHNAME', 'pg_check_frozen'
+LANGUAGE C STRICT;
+
+CREATE FUNCTION pg_check_visible(regclass, t_ctid OUT tid)
+RETURNS SETOF tid
+AS 'MODULE_PATHNAME', 'pg_check_visible'
+LANGUAGE C STRICT;
+
+REVOKE ALL ON FUNCTION pg_check_frozen(regclass) FROM PUBLIC;
+REVOKE ALL ON FUNCTION pg_check_visible(regclass) FROM PUBLIC;
diff --git a/contrib/pg_visibility/pg_visibility--1.0.sql b/contrib/pg_visibility/pg_visibility--1.0.sql
deleted file mode 100644
index da511e5..0000000
--- a/contrib/pg_visibility/pg_visibility--1.0.sql
+++ /dev/null
@@ -1,52 +0,0 @@
-/* contrib/pg_visibility/pg_visibility--1.0.sql */
-
--- complain if script is sourced in psql, rather than via CREATE EXTENSION
-\echo Use "CREATE EXTENSION pg_visibility" to load this file. \quit
-
--- Show visibility map information.
-CREATE FUNCTION pg_visibility_map(regclass, blkno bigint,
-								  all_visible OUT boolean,
-								  all_frozen OUT boolean)
-RETURNS record
-AS 'MODULE_PATHNAME', 'pg_visibility_map'
-LANGUAGE C STRICT;
-
--- Show visibility map and page-level visibility information.
-CREATE FUNCTION pg_visibility(regclass, blkno bigint,
-							  all_visible OUT boolean,
-							  all_frozen OUT boolean,
-							  pd_all_visible OUT boolean)
-RETURNS record
-AS 'MODULE_PATHNAME', 'pg_visibility'
-LANGUAGE C STRICT;
-
--- Show visibility map information for each block in a relation.
-CREATE FUNCTION pg_visibility_map(regclass, blkno OUT bigint,
-								  all_visible OUT boolean,
-								  all_frozen OUT boolean)
-RETURNS SETOF record
-AS 'MODULE_PATHNAME', 'pg_visibility_map_rel'
-LANGUAGE C STRICT;
-
--- Show visibility map and page-level visibility information for each block.
-CREATE FUNCTION pg_visibility(regclass, blkno OUT bigint,
-							  all_visible OUT boolean,
-							  all_frozen OUT boolean,
-							  pd_all_visible OUT boolean)
-RETURNS SETOF record
-AS 'MODULE_PATHNAME', 'pg_visibility_rel'
-LANGUAGE C STRICT;
-
--- Show summary of visibility map bits for a relation.
-CREATE FUNCTION pg_visibility_map_summary(regclass,
-    OUT all_visible bigint, OUT all_frozen bigint)
-RETURNS record
-AS 'MODULE_PATHNAME', 'pg_visibility_map_summary'
-LANGUAGE C STRICT;
-
--- Don't want these to be available to public.
-REVOKE ALL ON FUNCTION pg_visibility_map(regclass, bigint) FROM PUBLIC;
-REVOKE ALL ON FUNCTION pg_visibility(regclass, bigint) FROM PUBLIC;
-REVOKE ALL ON FUNCTION pg_visibility_map(regclass) FROM PUBLIC;
-REVOKE ALL ON FUNCTION pg_visibility(regclass) FROM PUBLIC;
-REVOKE ALL ON FUNCTION pg_visibility_map_summary(regclass) FROM PUBLIC;
diff --git a/contrib/pg_visibility/pg_visibility--1.1.sql b/contrib/pg_visibility/pg_visibility--1.1.sql
new file mode 100644
index 0000000..b49b644
--- /dev/null
+++ b/contrib/pg_visibility/pg_visibility--1.1.sql
@@ -0,0 +1,67 @@
+/* contrib/pg_visibility/pg_visibility--1.1.sql */
+
+-- complain if script is sourced in psql, rather than via CREATE EXTENSION
+\echo Use "CREATE EXTENSION pg_visibility" to load this file. \quit
+
+-- Show visibility map information.
+CREATE FUNCTION pg_visibility_map(regclass, blkno bigint,
+								  all_visible OUT boolean,
+								  all_frozen OUT boolean)
+RETURNS record
+AS 'MODULE_PATHNAME', 'pg_visibility_map'
+LANGUAGE C STRICT;
+
+-- Show visibility map and page-level visibility information.
+CREATE FUNCTION pg_visibility(regclass, blkno bigint,
+							  all_visible OUT boolean,
+							  all_frozen OUT boolean,
+							  pd_all_visible OUT boolean)
+RETURNS record
+AS 'MODULE_PATHNAME', 'pg_visibility'
+LANGUAGE C STRICT;
+
+-- Show visibility map information for each block in a relation.
+CREATE FUNCTION pg_visibility_map(regclass, blkno OUT bigint,
+								  all_visible OUT boolean,
+								  all_frozen OUT boolean)
+RETURNS SETOF record
+AS 'MODULE_PATHNAME', 'pg_visibility_map_rel'
+LANGUAGE C STRICT;
+
+-- Show visibility map and page-level visibility information for each block.
+CREATE FUNCTION pg_visibility(regclass, blkno OUT bigint,
+							  all_visible OUT boolean,
+							  all_frozen OUT boolean,
+							  pd_all_visible OUT boolean)
+RETURNS SETOF record
+AS 'MODULE_PATHNAME', 'pg_visibility_rel'
+LANGUAGE C STRICT;
+
+-- Show summary of visibility map bits for a relation.
+CREATE FUNCTION pg_visibility_map_summary(regclass,
+    OUT all_visible bigint, OUT all_frozen bigint)
+RETURNS record
+AS 'MODULE_PATHNAME', 'pg_visibility_map_summary'
+LANGUAGE C STRICT;
+
+-- Show tupleids of non-frozen tuples if any in all_frozen pages
+-- for a relation.
+CREATE FUNCTION pg_check_frozen(regclass, t_ctid OUT tid)
+RETURNS SETOF tid
+AS 'MODULE_PATHNAME', 'pg_check_frozen'
+LANGUAGE C STRICT;
+
+-- Show tupleids of dead tuples if any in all_visible pages for a relation.
+CREATE FUNCTION pg_check_visible(regclass, t_ctid OUT tid)
+RETURNS SETOF tid
+AS 'MODULE_PATHNAME', 'pg_check_visible'
+LANGUAGE C STRICT;
+
+-- Don't want these to be available to public.
+REVOKE ALL ON FUNCTION pg_visibility_map(regclass, bigint) FROM PUBLIC;
+REVOKE ALL ON FUNCTION pg_visibility(regclass, bigint) FROM PUBLIC;
+REVOKE ALL ON FUNCTION pg_visibility_map(regclass) FROM PUBLIC;
+REVOKE ALL ON FUNCTION pg_visibility(regclass) FROM PUBLIC;
+REVOKE ALL ON FUNCTION pg_visibility_map_summary(regclass) FROM PUBLIC;
+REVOKE ALL ON FUNCTION pg_check_frozen(regclass) FROM PUBLIC;
+REVOKE ALL ON FUNCTION pg_check_visible(regclass) FROM PUBLIC;
diff --git a/contrib/pg_visibility/pg_visibility.c b/contrib/pg_visibility/pg_visibility.c
index 9edf239..abb92f3 100644
--- a/contrib/pg_visibility/pg_visibility.c
+++ b/contrib/pg_visibility/pg_visibility.c
@@ -14,6 +14,7 @@
 #include "funcapi.h"
 #include "miscadmin.h"
 #include "storage/bufmgr.h"
+#include "storage/procarray.h"
 #include "utils/rel.h"
 
 PG_MODULE_MAGIC;
@@ -25,14 +26,28 @@ typedef struct vbits
 	uint8		bits[FLEXIBLE_ARRAY_MEMBER];
 } vbits;
 
+typedef struct corrupt_items
+{
+	BlockNumber next;
+	BlockNumber count;
+	ItemPointer tids;
+} corrupt_items;
+
 PG_FUNCTION_INFO_V1(pg_visibility_map);
 PG_FUNCTION_INFO_V1(pg_visibility_map_rel);
 PG_FUNCTION_INFO_V1(pg_visibility);
 PG_FUNCTION_INFO_V1(pg_visibility_rel);
 PG_FUNCTION_INFO_V1(pg_visibility_map_summary);
+PG_FUNCTION_INFO_V1(pg_check_frozen);
+PG_FUNCTION_INFO_V1(pg_check_visible);
 
 static TupleDesc pg_visibility_tupdesc(bool include_blkno, bool include_pd);
 static vbits *collect_visibility_data(Oid relid, bool include_pd);
+static corrupt_items *collect_corrupt_items(Oid relid, bool all_visible,
+					  bool all_frozen);
+static void record_corrupt_item(corrupt_items *items, ItemPointer tid);
+static bool tuple_all_visible(HeapTuple tup, TransactionId OldestXmin,
+				  Buffer buffer);
 
 /*
  * Visibility map information for a single block of a relation.
@@ -259,6 +274,68 @@ pg_visibility_map_summary(PG_FUNCTION_ARGS)
 }
 
 /*
+ * Return the TIDs of non-frozen tuples present in pages marked all-frozen
+ * in the visibility map.  We hope no one will ever find any, but there could
+ * be bugs, database corruption, etc.
+ */
+Datum
+pg_check_frozen(PG_FUNCTION_ARGS)
+{
+	FuncCallContext *funcctx;
+	corrupt_items *items;
+
+	if (SRF_IS_FIRSTCALL())
+	{
+		Oid			relid = PG_GETARG_OID(0);
+		MemoryContext oldcontext;
+
+		funcctx = SRF_FIRSTCALL_INIT();
+		oldcontext = MemoryContextSwitchTo(funcctx->multi_call_memory_ctx);
+		funcctx->user_fctx = collect_corrupt_items(relid, false, true);
+		MemoryContextSwitchTo(oldcontext);
+	}
+
+	funcctx = SRF_PERCALL_SETUP();
+	items = (corrupt_items *) funcctx->user_fctx;
+
+	if (items->next < items->count)
+		SRF_RETURN_NEXT(funcctx, PointerGetDatum(&items->tids[items->next++]));
+
+	SRF_RETURN_DONE(funcctx);
+}
+
+/*
+ * Return the TIDs of not-all-visible tuples in pages marked all-visible
+ * in the visibility map.  We hope no one will ever find any, but there could
+ * be bugs, database corruption, etc.
+ */
+Datum
+pg_check_visible(PG_FUNCTION_ARGS)
+{
+	FuncCallContext *funcctx;
+	corrupt_items *items;
+
+	if (SRF_IS_FIRSTCALL())
+	{
+		Oid			relid = PG_GETARG_OID(0);
+		MemoryContext oldcontext;
+
+		funcctx = SRF_FIRSTCALL_INIT();
+		oldcontext = MemoryContextSwitchTo(funcctx->multi_call_memory_ctx);
+		funcctx->user_fctx = collect_corrupt_items(relid, true, false);
+		MemoryContextSwitchTo(oldcontext);
+	}
+
+	funcctx = SRF_PERCALL_SETUP();
+	items = (corrupt_items *) funcctx->user_fctx;
+
+	if (items->next < items->count)
+		SRF_RETURN_NEXT(funcctx, PointerGetDatum(&items->tids[items->next++]));
+
+	SRF_RETURN_DONE(funcctx);
+}
+
+/*
  * Helper function to construct whichever TupleDesc we need for a particular
  * call.
  */
@@ -348,3 +425,239 @@ collect_visibility_data(Oid relid, bool include_pd)
 
 	return info;
 }
+
+/*
+ * Returns a list of items whose visibility map information does not match
+ * the status of the tuples on the page.
+ *
+ * If all_visible is passed as true, this will include all items which are
+ * on pages marked as all-visible in the visibility map but which do not
+ * seem to in fact be all-visible.
+ *
+ * If all_frozen is passed as true, this will include all items which are
+ * on pages marked as all-frozen but which do not seem to in fact be frozen.
+ */
+static corrupt_items *
+collect_corrupt_items(Oid relid, bool all_visible, bool all_frozen)
+{
+	Relation	rel;
+	BlockNumber nblocks;
+	corrupt_items *items;
+	BlockNumber blkno;
+	Buffer		vmbuffer = InvalidBuffer;
+	BufferAccessStrategy bstrategy = GetAccessStrategy(BAS_BULKREAD);
+	TransactionId OldestXmin = InvalidTransactionId;
+
+	if (all_visible)
+	{
+		/* Don't pass rel; that will fail in recovery. */
+		OldestXmin = GetOldestXmin(NULL, true);
+	}
+
+	rel = relation_open(relid, AccessShareLock);
+
+	if (rel->rd_rel->relkind != RELKIND_RELATION &&
+		rel->rd_rel->relkind != RELKIND_MATVIEW &&
+		rel->rd_rel->relkind != RELKIND_TOASTVALUE)
+		ereport(ERROR,
+				(errcode(ERRCODE_WRONG_OBJECT_TYPE),
+		   errmsg("\"%s\" is not a table, materialized view, or TOAST table",
+				  RelationGetRelationName(rel))));
+
+	nblocks = RelationGetNumberOfBlocks(rel);
+
+	/*
+	 * Guess an initial array size. We don't expect many corrupted tuples, so
+	 * start with a small array.  This function uses the "next" field to track
+	 * the next offset where we can store an item (which is the same thing as
+	 * the number of items found so far) and the "count" field to track the
+	 * number of entries allocated.  We'll repurpose these fields before
+	 * returning.
+	 */
+	items = palloc0(sizeof(corrupt_items));
+	items->next = 0;
+	items->count = 64;
+	items->tids = palloc(items->count * sizeof(ItemPointerData));
+
+	/* Loop over every block in the relation. */
+	for (blkno = 0; blkno < nblocks; ++blkno)
+	{
+		bool		check_frozen = false;
+		bool		check_visible = false;
+		Buffer		buffer;
+		Page		page;
+		OffsetNumber offnum,
+					maxoff;
+
+		/* Make sure we are interruptible. */
+		CHECK_FOR_INTERRUPTS();
+
+		/* Use the visibility map to decide whether to check this page. */
+		if (all_frozen && VM_ALL_FROZEN(rel, blkno, &vmbuffer))
+			check_frozen = true;
+		if (all_visible && VM_ALL_VISIBLE(rel, blkno, &vmbuffer))
+			check_visible = true;
+		if (!check_visible && !check_frozen)
+			continue;
+
+		/* Read and lock the page. */
+		buffer = ReadBufferExtended(rel, MAIN_FORKNUM, blkno, RBM_NORMAL,
+									bstrategy);
+		LockBuffer(buffer, BUFFER_LOCK_SHARE);
+
+		page = BufferGetPage(buffer);
+		maxoff = PageGetMaxOffsetNumber(page);
+
+		/*
+		 * The visibility map bits might have changed while we were acquiring
+		 * the page lock.  Recheck to avoid returning spurious results.
+		 */
+		if (check_frozen && !VM_ALL_FROZEN(rel, blkno, &vmbuffer))
+			check_frozen = false;
+		if (check_visible && !VM_ALL_VISIBLE(rel, blkno, &vmbuffer))
+			check_visible = false;
+		if (!check_visible && !check_frozen)
+		{
+			UnlockReleaseBuffer(buffer);
+			continue;
+		}
+
+		/* Iterate over each tuple on the page. */
+		for (offnum = FirstOffsetNumber;
+			 offnum <= maxoff;
+			 offnum = OffsetNumberNext(offnum))
+		{
+			HeapTupleData tuple;
+			ItemId		itemid;
+
+			itemid = PageGetItemId(page, offnum);
+
+			/* Unused or redirect line pointers are of no interest. */
+			if (!ItemIdIsUsed(itemid) || ItemIdIsRedirected(itemid))
+				continue;
+
+			/* Dead line pointers are neither all-visible nor frozen. */
+			if (ItemIdIsDead(itemid))
+			{
+				ItemPointerData tid;
+
+				ItemPointerSet(&tid, blkno, offnum);
+				record_corrupt_item(items, &tid);
+				continue;
+			}
+
+			/* Initialize a HeapTupleData structure for checks below. */
+			tuple.t_data = (HeapTupleHeader) PageGetItem(page, itemid);
+			tuple.t_len = ItemIdGetLength(itemid);
+			tuple.t_tableOid = relid;
+
+			/*
+			 * If we're checking whether the page is all-visible, we expect
+			 * the tuple to be all-visible.
+			 */
+			if (check_visible &&
+				!tuple_all_visible(&tuple, OldestXmin, buffer))
+			{
+				TransactionId RecomputedOldestXmin;
+
+				/*
+				 * Time has passed since we computed OldestXmin, so it's
+				 * possible that this tuple is all-visible in reality even
+				 * though it doesn't appear so based on our
+				 * previously-computed value.  Let's compute a new value so we
+				 * can be certain whether there is a problem.
+				 *
+				 * From a concurrency point of view, it sort of sucks to
+				 * retake ProcArrayLock here while we're holding the buffer
+				 * exclusively locked, but it should be safe against
+				 * deadlocks, because surely GetOldestXmin() should never take
+				 * a buffer lock. And this shouldn't happen often, so it's
+				 * worth being careful so as to avoid false positives.
+				 */
+				RecomputedOldestXmin = GetOldestXmin(NULL, true);
+
+				if (!TransactionIdPrecedes(OldestXmin, RecomputedOldestXmin))
+					record_corrupt_item(items, &tuple.t_data->t_ctid);
+				else
+				{
+					OldestXmin = RecomputedOldestXmin;
+					if (!tuple_all_visible(&tuple, OldestXmin, buffer))
+						record_corrupt_item(items, &tuple.t_data->t_ctid);
+				}
+			}
+
+			/*
+			 * If we're checking whether the page is all-frozen, we expect the
+			 * tuple to be in a state where it will never need freezing.
+			 */
+			if (check_frozen)
+			{
+				if (heap_tuple_needs_eventual_freeze(tuple.t_data))
+					record_corrupt_item(items, &tuple.t_data->t_ctid);
+			}
+		}
+
+		UnlockReleaseBuffer(buffer);
+	}
+
+	/* Clean up. */
+	if (vmbuffer != InvalidBuffer)
+		ReleaseBuffer(vmbuffer);
+	relation_close(rel, AccessShareLock);
+
+	/*
+	 * Before returning, repurpose the fields to match caller's expectations.
+	 * next is now the next item that should be read (rather than written) and
+	 * count is now the number of items we wrote (rather than the number we
+	 * allocated).
+	 */
+	items->count = items->next;
+	items->next = 0;
+
+	return items;
+}
+
+/*
+ * Remember one corrupt item.
+ */
+static void
+record_corrupt_item(corrupt_items *items, ItemPointer tid)
+{
+	/* enlarge output array if needed. */
+	if (items->next >= items->count)
+	{
+		items->count *= 2;
+		items->tids = repalloc(items->tids,
+							   items->count * sizeof(ItemPointerData));
+	}
+	/* and add the new item */
+	items->tids[items->next++] = *tid;
+}
+
+/*
+ * Check whether a tuple is all-visible relative to a given OldestXmin value.
+ * The buffer should contain the tuple and should be locked and pinned.
+ */
+static bool
+tuple_all_visible(HeapTuple tup, TransactionId OldestXmin, Buffer buffer)
+{
+	HTSV_Result state;
+	TransactionId xmin;
+
+	state = HeapTupleSatisfiesVacuum(tup, OldestXmin, buffer);
+	if (state != HEAPTUPLE_LIVE)
+		return false;			/* all-visible implies live */
+
+	/*
+	 * Neither lazy_scan_heap nor heap_page_is_all_visible will mark a page
+	 * all-visible unless every tuple is hinted committed. However, those hint
+	 * bits could be lost after a crash, so we can't be certain that they'll
+	 * be set here.  So just check the xmin.
+	 */
+
+	xmin = HeapTupleHeaderGetXmin(tup->t_data);
+	if (!TransactionIdPrecedes(xmin, OldestXmin))
+		return false;			/* xmin not old enough for all to see */
+
+	return true;
+}
diff --git a/contrib/pg_visibility/pg_visibility.control b/contrib/pg_visibility/pg_visibility.control
index 1d71853..f93ed01 100644
--- a/contrib/pg_visibility/pg_visibility.control
+++ b/contrib/pg_visibility/pg_visibility.control
@@ -1,5 +1,5 @@
 # pg_visibility extension
 comment = 'examine the visibility map (VM) and page-level visibility info'
-default_version = '1.0'
+default_version = '1.1'
 module_pathname = '$libdir/pg_visibility'
 relocatable = true
diff --git a/doc/src/sgml/pgvisibility.sgml b/doc/src/sgml/pgvisibility.sgml
index 48b003d..4cdca7d 100644
--- a/doc/src/sgml/pgvisibility.sgml
+++ b/doc/src/sgml/pgvisibility.sgml
@@ -32,7 +32,8 @@
   Functions which display information about <literal>PD_ALL_VISIBLE</>
   are much more costly than those which only consult the visibility map,
   because they must read the relation's data blocks rather than only the
-  (much smaller) visibility map.
+  (much smaller) visibility map.  Functions that check the relation's
+  data blocks are similarly expensive.
  </para>
 
  <sect2>
@@ -92,6 +93,31 @@
      </para>
     </listitem>
    </varlistentry>
+  
+   <varlistentry>
+    <term><function>pg_check_frozen(regclass, t_ctid OUT tid) returns setof tid</function></term>
+
+    <listitem>
+     <para>
+      Returns the TIDs of non-frozen tuples present in pages marked all-frozen
+      in the visibility map.  If this function returns a non-empty set of
+      TIDs, the database is corrupt.
+     </para>
+    </listitem>
+   </varlistentry>
+     
+    <varlistentry>
+    <term><function>pg_check_visible(regclass, t_ctid OUT tid) returns setof tid</function></term>
+
+    <listitem>
+     <para>
+      Returns the TIDs of tuples which are not all-visible despite the fact
+      that the pages which contain them are marked as all-visible in the
+      visibility map.  If this function returns a non-empty set of TIDs, the
+      database is corrupt.
+     </para>
+    </listitem>
+   </varlistentry>
   </variablelist>
 
   <para>
diff --git a/src/tools/pgindent/typedefs.list b/src/tools/pgindent/typedefs.list
index 9aa29f6..0c61fc2 100644
--- a/src/tools/pgindent/typedefs.list
+++ b/src/tools/pgindent/typedefs.list
@@ -2372,6 +2372,7 @@ convert_testexpr_context
 core_YYSTYPE
 core_yy_extra_type
 core_yyscan_t
+corrupt_items
 cost_qual_eval_context
 count_agg_clauses_context
 create_upper_paths_hook_type
-- 
2.5.4 (Apple Git-61)

#159

Thomas Munro

thomas.munro@enterprisedb.com

over 9 years ago

In reply to: Robert Haas (#158)

Re: Reviewing freeze map code

On Wed, Jun 15, 2016 at 12:44 AM, Robert Haas <robertmhaas@gmail.com> wrote:

On Tue, Jun 14, 2016 at 8:11 AM, Robert Haas <robertmhaas@gmail.com> wrote:

I noticed that the tuples that it reported were always offset 1 in a
page, and that the page always had a maxoff over a couple of hundred,
and that we called record_corrupt_item because VM_ALL_VISIBLE returned
true but HeapTupleSatisfiesVacuum on the first tuple returned
HEAPTUPLE_DELETE_IN_PROGRESS instead of the expected HEAPTUPLE_LIVE.
It did that because HEAP_XMAX_COMMITTED was not set and
TransactionIdIsInProgress returned true for xmax.

So this seems like it might be a visibility map bug rather than a bug
in the test code, but I'm not completely sure of that. How was it
legitimate to mark the page as all-visible if a tuple on the page
still had a live xmax? If xmax is live and not just a locker then the
tuple is not visible to the transaction that wrote xmax, at least.

Ah, wait a minute. I see how this could happen. Hang on, let me
update the pg_visibility patch.

The problem should be fixed in the attached revision of
pg_check_visible. I think what happened is:

1. pg_check_visible computed an OldestXmin.
2. Some transaction committed.
3. VACUUM computed a newer OldestXmin and marked a page all-visible with it.
4. pg_check_visible then used its older OldestXmin to check the
visibility of tuples on that page, and saw delete-in-progress as a
result.

I added a guard against a similar scenario involving xmin in the last
version of this patch, but forgot that we need to protect xmax in the
same way. With this version of the patch, I can no longer get any
TIDs to pop up out of pg_check_visible in my testing. (I haven't run
your test script for lack of the proper Python environment...)

I can still reproduce the problem with this new patch. What I see is
that the OldestXmin, the new RecomputedOldestXmin and the tuple's xmax
are all the same.

--
Thomas Munro
http://www.enterprisedb.com

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#160

Thomas Munro

thomas.munro@enterprisedb.com

over 9 years ago

In reply to: Thomas Munro (#159)

Re: Reviewing freeze map code

On Wed, Jun 15, 2016 at 11:43 AM, Thomas Munro
<thomas.munro@enterprisedb.com> wrote:

On Wed, Jun 15, 2016 at 12:44 AM, Robert Haas <robertmhaas@gmail.com> wrote:

On Tue, Jun 14, 2016 at 8:11 AM, Robert Haas <robertmhaas@gmail.com> wrote:

I noticed that the tuples that it reported were always offset 1 in a
page, and that the page always had a maxoff over a couple of hundred,
and that we called record_corrupt_item because VM_ALL_VISIBLE returned
true but HeapTupleSatisfiesVacuum on the first tuple returned
HEAPTUPLE_DELETE_IN_PROGRESS instead of the expected HEAPTUPLE_LIVE.
It did that because HEAP_XMAX_COMMITTED was not set and
TransactionIdIsInProgress returned true for xmax.

So this seems like it might be a visibility map bug rather than a bug
in the test code, but I'm not completely sure of that. How was it
legitimate to mark the page as all-visible if a tuple on the page
still had a live xmax? If xmax is live and not just a locker then the
tuple is not visible to the transaction that wrote xmax, at least.

Ah, wait a minute. I see how this could happen. Hang on, let me
update the pg_visibility patch.

The problem should be fixed in the attached revision of
pg_check_visible. I think what happened is:

1. pg_check_visible computed an OldestXmin.
2. Some transaction committed.
3. VACUUM computed a newer OldestXmin and marked a page all-visible with it.
4. pg_check_visible then used its older OldestXmin to check the
visibility of tuples on that page, and saw delete-in-progress as a
result.

I added a guard against a similar scenario involving xmin in the last
version of this patch, but forgot that we need to protect xmax in the
same way. With this version of the patch, I can no longer get any
TIDs to pop up out of pg_check_visible in my testing. (I haven't run
your test script for lack of the proper Python environment...)

I can still reproduce the problem with this new patch. What I see is
that the OldestXmin, the new RecomputedOldestXmin and the tuple's xmax
are all the same.

I spent some time chasing down the exact circumstances. I suspect
that there may be an interlocking problem in heap_update. Using the
line numbers from cae1c788 [1]https://git.postgresql.org/gitweb/?p=postgresql.git;a=blob;f=src/backend/access/heap/heapam.c;hb=cae1c788b9b43887e4a4fa51a11c3a8ffa334939, I see the following interaction
between the VACUUM, UPDATE and SELECT (pg_check_visible) backends, all
in reference to the same block number:

[VACUUM] sets all visible bit

[UPDATE] heapam.c:3931 HeapTupleHeaderSetXmax(oldtup.t_data, xmax_old_tuple);
[UPDATE] heapam.c:3938 LockBuffer(buffer, BUFFER_LOCK_UNLOCK);

[SELECT] LockBuffer(buffer, BUFFER_LOCK_SHARE);
[SELECT] observes VM_ALL_VISIBLE as true
[SELECT] observes tuple in HEAPTUPLE_DELETE_IN_PROGRESS state
[SELECT] barfs

[UPDATE] heapam.c:4116 visibilitymap_clear(...)

[1]: https://git.postgresql.org/gitweb/?p=postgresql.git;a=blob;f=src/backend/access/heap/heapam.c;hb=cae1c788b9b43887e4a4fa51a11c3a8ffa334939

--
Thomas Munro
http://www.enterprisedb.com

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#161

Robert Haas

robertmhaas@gmail.com

over 9 years ago

In reply to: Thomas Munro (#160)

Re: Reviewing freeze map code

On Wed, Jun 15, 2016 at 2:41 AM, Thomas Munro
<thomas.munro@enterprisedb.com> wrote:

I spent some time chasing down the exact circumstances. I suspect
that there may be an interlocking problem in heap_update. Using the
line numbers from cae1c788 [1], I see the following interaction
between the VACUUM, UPDATE and SELECT (pg_check_visible) backends, all
in reference to the same block number:

[VACUUM] sets all visible bit

[UPDATE] heapam.c:3931 HeapTupleHeaderSetXmax(oldtup.t_data, xmax_old_tuple);
[UPDATE] heapam.c:3938 LockBuffer(buffer, BUFFER_LOCK_UNLOCK);

[SELECT] LockBuffer(buffer, BUFFER_LOCK_SHARE);
[SELECT] observes VM_ALL_VISIBLE as true
[SELECT] observes tuple in HEAPTUPLE_DELETE_IN_PROGRESS state
[SELECT] barfs

[UPDATE] heapam.c:4116 visibilitymap_clear(...)

Yikes: heap_update() sets the tuple's XMAX, CMAX, infomask, infomask2,
and CTID without logging anything or clearing the all-visible flag and
then releases the lock on the heap page to go do some more work that
might even ERROR out. Only if that other work all goes OK do we
relock the page and perform the WAL-logged actions.

That doesn't seem like a good idea even in existing releases, because
you've taken a tuple on an all-visible page and made it not
all-visible, and you've made a page modification that is not
necessarily atomic without logging it. This is is particularly bad in
9.6, because if that page is also all-frozen then XMAX will eventually
be pointing into space and VACUUM will never visit the page to
re-freeze it the way it would have done in earlier releases. However,
even in older releases, I think there's a remote possibility of data
corruption. Backend #1 makes these changes to the page, releases the
lock, and errors out. Backend #2 writes the page to the OS. DBA
takes a hot backup, tearing the page in the middle of XMAX. Oops.

I'm not sure what to do about this: this part of the heap_update()
logic has been like this forever, and I assume that if it were easy to
refactor this away, somebody would have done it by now.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#162

Amit Kapila

amit.kapila16@gmail.com

over 9 years ago

In reply to: Robert Haas (#161)

Re: Reviewing freeze map code

On Wed, Jun 15, 2016 at 6:26 PM, Robert Haas <robertmhaas@gmail.com> wrote:

On Wed, Jun 15, 2016 at 2:41 AM, Thomas Munro
<thomas.munro@enterprisedb.com> wrote:

I spent some time chasing down the exact circumstances. I suspect
that there may be an interlocking problem in heap_update. Using the
line numbers from cae1c788 [1], I see the following interaction
between the VACUUM, UPDATE and SELECT (pg_check_visible) backends, all
in reference to the same block number:

[VACUUM] sets all visible bit

[UPDATE] heapam.c:3931 HeapTupleHeaderSetXmax(oldtup.t_data,

xmax_old_tuple);

[UPDATE] heapam.c:3938 LockBuffer(buffer, BUFFER_LOCK_UNLOCK);

[SELECT] LockBuffer(buffer, BUFFER_LOCK_SHARE);
[SELECT] observes VM_ALL_VISIBLE as true
[SELECT] observes tuple in HEAPTUPLE_DELETE_IN_PROGRESS state
[SELECT] barfs

[UPDATE] heapam.c:4116 visibilitymap_clear(...)

Yikes: heap_update() sets the tuple's XMAX, CMAX, infomask, infomask2,
and CTID without logging anything or clearing the all-visible flag and
then releases the lock on the heap page to go do some more work that
might even ERROR out.

Can't we clear the all-visible flag before releasing the lock? We can use
logic of already_marked as it is currently used in code to clear it just
once.

With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

#163

Robert Haas

robertmhaas@gmail.com

over 9 years ago

In reply to: Amit Kapila (#162)

Re: Reviewing freeze map code

On Wed, Jun 15, 2016 at 9:43 AM, Amit Kapila <amit.kapila16@gmail.com> wrote:

On Wed, Jun 15, 2016 at 6:26 PM, Robert Haas <robertmhaas@gmail.com> wrote:

On Wed, Jun 15, 2016 at 2:41 AM, Thomas Munro
<thomas.munro@enterprisedb.com> wrote:

I spent some time chasing down the exact circumstances. I suspect
that there may be an interlocking problem in heap_update. Using the
line numbers from cae1c788 [1], I see the following interaction
between the VACUUM, UPDATE and SELECT (pg_check_visible) backends, all
in reference to the same block number:

[VACUUM] sets all visible bit

[UPDATE] heapam.c:3931 HeapTupleHeaderSetXmax(oldtup.t_data,
xmax_old_tuple);
[UPDATE] heapam.c:3938 LockBuffer(buffer, BUFFER_LOCK_UNLOCK);

[SELECT] LockBuffer(buffer, BUFFER_LOCK_SHARE);
[SELECT] observes VM_ALL_VISIBLE as true
[SELECT] observes tuple in HEAPTUPLE_DELETE_IN_PROGRESS state
[SELECT] barfs

[UPDATE] heapam.c:4116 visibilitymap_clear(...)

Yikes: heap_update() sets the tuple's XMAX, CMAX, infomask, infomask2,
and CTID without logging anything or clearing the all-visible flag and
then releases the lock on the heap page to go do some more work that
might even ERROR out.

Can't we clear the all-visible flag before releasing the lock? We can use
logic of already_marked as it is currently used in code to clear it just
once.

That just kicks the can down the road. Then you have PD_ALL_VISIBLE
clear but the VM bit is still set. And you still haven't WAL-logged
anything.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#164

Amit Kapila

amit.kapila16@gmail.com

over 9 years ago

In reply to: Robert Haas (#163)

Re: Reviewing freeze map code

On Wed, Jun 15, 2016 at 7:13 PM, Robert Haas <robertmhaas@gmail.com> wrote:

On Wed, Jun 15, 2016 at 9:43 AM, Amit Kapila <amit.kapila16@gmail.com>

wrote:

On Wed, Jun 15, 2016 at 6:26 PM, Robert Haas <robertmhaas@gmail.com>

wrote:

On Wed, Jun 15, 2016 at 2:41 AM, Thomas Munro
<thomas.munro@enterprisedb.com> wrote:

I spent some time chasing down the exact circumstances. I suspect
that there may be an interlocking problem in heap_update. Using the
line numbers from cae1c788 [1], I see the following interaction
between the VACUUM, UPDATE and SELECT (pg_check_visible) backends,

all

in reference to the same block number:

[VACUUM] sets all visible bit

[UPDATE] heapam.c:3931 HeapTupleHeaderSetXmax(oldtup.t_data,
xmax_old_tuple);
[UPDATE] heapam.c:3938 LockBuffer(buffer, BUFFER_LOCK_UNLOCK);

[SELECT] LockBuffer(buffer, BUFFER_LOCK_SHARE);
[SELECT] observes VM_ALL_VISIBLE as true
[SELECT] observes tuple in HEAPTUPLE_DELETE_IN_PROGRESS state
[SELECT] barfs

[UPDATE] heapam.c:4116 visibilitymap_clear(...)

Yikes: heap_update() sets the tuple's XMAX, CMAX, infomask, infomask2,
and CTID without logging anything or clearing the all-visible flag and
then releases the lock on the heap page to go do some more work that
might even ERROR out.

Can't we clear the all-visible flag before releasing the lock? We can

use

logic of already_marked as it is currently used in code to clear it just
once.

That just kicks the can down the road. Then you have PD_ALL_VISIBLE
clear but the VM bit is still set.

I mean to say clear both as we are doing currently in code:
if (PageIsAllVisible(BufferGetPage(buffer)))
{
all_visible_cleared = true;
PageClearAllVisible(BufferGetPage(buffer));
visibilitymap_clear(relation, BufferGetBlockNumber(buffer),
vmbuffer);
}

And you still haven't WAL-logged
anything.

Yeah, I think WAL requirement is more difficult to meet and I think
releasing the lock on buffer before writing WAL could lead to flush of such
a buffer before WAL.

I feel this is an existing-bug and should go to Older Bugs Section in open
items page.

With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

#165

Masahiko Sawada

sawada.mshk@gmail.com

over 9 years ago

In reply to: Robert Haas (#161)

Re: Reviewing freeze map code

On Wed, Jun 15, 2016 at 9:56 PM, Robert Haas <robertmhaas@gmail.com> wrote:

On Wed, Jun 15, 2016 at 2:41 AM, Thomas Munro
<thomas.munro@enterprisedb.com> wrote:

I spent some time chasing down the exact circumstances. I suspect
that there may be an interlocking problem in heap_update. Using the
line numbers from cae1c788 [1], I see the following interaction
between the VACUUM, UPDATE and SELECT (pg_check_visible) backends, all
in reference to the same block number:

[VACUUM] sets all visible bit

[UPDATE] heapam.c:3931 HeapTupleHeaderSetXmax(oldtup.t_data, xmax_old_tuple);
[UPDATE] heapam.c:3938 LockBuffer(buffer, BUFFER_LOCK_UNLOCK);

[SELECT] LockBuffer(buffer, BUFFER_LOCK_SHARE);
[SELECT] observes VM_ALL_VISIBLE as true
[SELECT] observes tuple in HEAPTUPLE_DELETE_IN_PROGRESS state
[SELECT] barfs

[UPDATE] heapam.c:4116 visibilitymap_clear(...)

Yikes: heap_update() sets the tuple's XMAX, CMAX, infomask, infomask2,
and CTID without logging anything or clearing the all-visible flag and
then releases the lock on the heap page to go do some more work that
might even ERROR out. Only if that other work all goes OK do we
relock the page and perform the WAL-logged actions.

That doesn't seem like a good idea even in existing releases, because
you've taken a tuple on an all-visible page and made it not
all-visible, and you've made a page modification that is not
necessarily atomic without logging it. This is is particularly bad in
9.6, because if that page is also all-frozen then XMAX will eventually
be pointing into space and VACUUM will never visit the page to
re-freeze it the way it would have done in earlier releases. However,
even in older releases, I think there's a remote possibility of data
corruption. Backend #1 makes these changes to the page, releases the
lock, and errors out. Backend #2 writes the page to the OS. DBA
takes a hot backup, tearing the page in the middle of XMAX. Oops.

I'm not sure what to do about this: this part of the heap_update()
logic has been like this forever, and I assume that if it were easy to
refactor this away, somebody would have done it by now.

How about changing collect_corrupt_items to acquire
AccessExclusiveLock for safely checking?

Regards,

--
Masahiko Sawada

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#166

Robert Haas

robertmhaas@gmail.com

over 9 years ago

In reply to: Masahiko Sawada (#165)

Re: Reviewing freeze map code

On Wed, Jun 15, 2016 at 10:03 AM, Masahiko Sawada <sawada.mshk@gmail.com> wrote:

I'm not sure what to do about this: this part of the heap_update()
logic has been like this forever, and I assume that if it were easy to
refactor this away, somebody would have done it by now.

How about changing collect_corrupt_items to acquire
AccessExclusiveLock for safely checking?

Well, that would make it a lot less likely for
pg_check_{visible,frozen} to detect the bug in heap_update(), but it
wouldn't fix the bug in heap_update().

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#167

Robert Haas

robertmhaas@gmail.com

over 9 years ago

In reply to: Amit Kapila (#164)

Re: Reviewing freeze map code

On Wed, Jun 15, 2016 at 9:59 AM, Amit Kapila <amit.kapila16@gmail.com> wrote:

That just kicks the can down the road. Then you have PD_ALL_VISIBLE
clear but the VM bit is still set.

I mean to say clear both as we are doing currently in code:
if (PageIsAllVisible(BufferGetPage(buffer)))
{
all_visible_cleared = true;
PageClearAllVisible(BufferGetPage(buffer));
visibilitymap_clear(relation, BufferGetBlockNumber(buffer),
vmbuffer);
}

Sure, but without emitting a WAL record, that's just broken. You
could have the heap page get flushed to disk and the VM page not get
flushed to disk, and then crash, and now you have the classic VM
corruption scenario.

And you still haven't WAL-logged
anything.

Yeah, I think WAL requirement is more difficult to meet and I think
releasing the lock on buffer before writing WAL could lead to flush of such
a buffer before WAL.

I feel this is an existing-bug and should go to Older Bugs Section in open
items page.

It does seem to be an existing bug, but the freeze map makes the
problem more serious, I think.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#168

Noah Misch

noah@leadboat.com

over 9 years ago

In reply to: Robert Haas (#161)

Re: Reviewing freeze map code

On Wed, Jun 15, 2016 at 08:56:52AM -0400, Robert Haas wrote:

On Wed, Jun 15, 2016 at 2:41 AM, Thomas Munro
<thomas.munro@enterprisedb.com> wrote:

I spent some time chasing down the exact circumstances. I suspect
that there may be an interlocking problem in heap_update. Using the
line numbers from cae1c788 [1], I see the following interaction
between the VACUUM, UPDATE and SELECT (pg_check_visible) backends, all
in reference to the same block number:

[VACUUM] sets all visible bit

[UPDATE] heapam.c:3931 HeapTupleHeaderSetXmax(oldtup.t_data, xmax_old_tuple);
[UPDATE] heapam.c:3938 LockBuffer(buffer, BUFFER_LOCK_UNLOCK);

[SELECT] LockBuffer(buffer, BUFFER_LOCK_SHARE);
[SELECT] observes VM_ALL_VISIBLE as true
[SELECT] observes tuple in HEAPTUPLE_DELETE_IN_PROGRESS state
[SELECT] barfs

[UPDATE] heapam.c:4116 visibilitymap_clear(...)

Yikes: heap_update() sets the tuple's XMAX, CMAX, infomask, infomask2,
and CTID without logging anything or clearing the all-visible flag and
then releases the lock on the heap page to go do some more work that
might even ERROR out. Only if that other work all goes OK do we
relock the page and perform the WAL-logged actions.

That doesn't seem like a good idea even in existing releases, because
you've taken a tuple on an all-visible page and made it not
all-visible, and you've made a page modification that is not
necessarily atomic without logging it. This is is particularly bad in
9.6, because if that page is also all-frozen then XMAX will eventually
be pointing into space and VACUUM will never visit the page to
re-freeze it the way it would have done in earlier releases. However,
even in older releases, I think there's a remote possibility of data
corruption. Backend #1 makes these changes to the page, releases the
lock, and errors out. Backend #2 writes the page to the OS. DBA
takes a hot backup, tearing the page in the middle of XMAX. Oops.

I agree the non-atomic, unlogged change is a problem. A related threat
doesn't require a torn page:

AssignTransactionId() xid=123
heapam.c:3931 HeapTupleHeaderSetXmax(oldtup.t_data, 123);
some ERROR before heap_update() finishes
rollback; -- xid=123
some backend flushes the modified page
immediate shutdown
AssignTransactionId() xid=123
commit; -- xid=123

If nothing wrote an xlog record that witnesses xid 123, the cluster can
reassign it after recovery. The failed update is now considered a successful
update, and the row improperly becomes dead. That's important.

I don't know whether the 9.6 all-frozen mechanism materially amplifies the
consequences of this bug. The interaction with visibility map and freeze map
is not all bad; indeed, it can reduce the risk of experiencing consequences
from the non-atomic, unlogged change bug. If the row is all-visible when
heap_update() starts, every transaction should continue to consider the row
visible until heap_update() finishes successfully. If an ERROR interrupts
heap_update(), visibility verdicts should be as though the heap_update() never
happened. If one of the previously-described mechanisms would make an xmax
visibility test give the wrong answer, an all-visible bit could mask the
problem for awhile. Having said that, freeze map hurts in scenarios involving
toast_insert_or_update() failures and no crash recovery. Instead of VACUUM
cleaning up the aborted xmax, that xmax could persist long enough for its xid
to be reused in a successful transaction. When some other modification
finally clears all-frozen and all-visible, the row improperly becomes dead.
Both scenarios are fairly rare; I don't know which is more rare. [Disclaimer:
I have not built tests cases to verify those alleged failure mechanisms.]

If we made this pre-9.6 bug a 9.6 open item, would anyone volunteer to own it?
Then we wouldn't need to guess whether 9.6 will be safer with the freeze map
or safer without the freeze map.

Thanks,
nm

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#169

Andres Freund

andres@anarazel.de

over 9 years ago

In reply to: Robert Haas (#161)

Re: Reviewing freeze map code

On 2016-06-15 08:56:52 -0400, Robert Haas wrote:

Yikes: heap_update() sets the tuple's XMAX, CMAX, infomask, infomask2,
and CTID without logging anything or clearing the all-visible flag and
then releases the lock on the heap page to go do some more work that
might even ERROR out. Only if that other work all goes OK do we
relock the page and perform the WAL-logged actions.

That doesn't seem like a good idea even in existing releases, because
you've taken a tuple on an all-visible page and made it not
all-visible, and you've made a page modification that is not
necessarily atomic without logging it.

Right, that's broken.

I'm not sure what to do about this: this part of the heap_update()
logic has been like this forever, and I assume that if it were easy to
refactor this away, somebody would have done it by now.

Well, I think generally nobody seriously looked at actually refactoring
heap_update(), even though that'd be a good idea. But in this instance,
the problem seems relatively fundamental:

We need to lock the origin page, to do visibility checks, etc. Then we
need to figure out the target page. Even disregarding toasting - which
we could be doing earlier with some refactoring - we're going to have to
release the page level lock, to lock them in ascending order. Otherwise
we'll risk kinda likely deadlocks. We also certainly don't want to nest
the lwlocks for the toast stuff, inside a content lock for the old
tupe's page.

So far the best idea I have - and it's really not a good one - is to
invent a new hint-bit that tells concurrent updates to acquire a
heavyweight tuple lock, while releasing the page-level lock. If that
hint bit does not require any other modifications - and we don't need an
xid in xmax for this use case - that'll avoid doing all the other
`already_marked` stuff early, which should address the correctness
issue. It's kinda invasive though, and probably has performance
implications.

Does anybody have a better idea?

Regards,

Andres

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#170

Robert Haas

robertmhaas@gmail.com

over 9 years ago

In reply to: Andres Freund (#169)

Re: Reviewing freeze map code

On Mon, Jun 20, 2016 at 3:33 PM, Andres Freund <andres@anarazel.de> wrote:

I'm not sure what to do about this: this part of the heap_update()
logic has been like this forever, and I assume that if it were easy to
refactor this away, somebody would have done it by now.

Well, I think generally nobody seriously looked at actually refactoring
heap_update(), even though that'd be a good idea. But in this instance,
the problem seems relatively fundamental:

We need to lock the origin page, to do visibility checks, etc. Then we
need to figure out the target page. Even disregarding toasting - which
we could be doing earlier with some refactoring - we're going to have to
release the page level lock, to lock them in ascending order. Otherwise
we'll risk kinda likely deadlocks. We also certainly don't want to nest
the lwlocks for the toast stuff, inside a content lock for the old
tupe's page.

So far the best idea I have - and it's really not a good one - is to
invent a new hint-bit that tells concurrent updates to acquire a
heavyweight tuple lock, while releasing the page-level lock. If that
hint bit does not require any other modifications - and we don't need an
xid in xmax for this use case - that'll avoid doing all the other
`already_marked` stuff early, which should address the correctness
issue. It's kinda invasive though, and probably has performance
implications.

Does anybody have a better idea?

What exactly is the point of all of that already_marked stuff? I
mean, suppose we just don't do any of that before we go off to do
toast_insert_or_update and RelationGetBufferForTuple. Eventually,
when we reacquire the page lock, we might find that somebody else has
already updated the tuple, but couldn't that be handled by
(approximately) looping back up to l2 just as we do in several other
cases?

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#171

Andres Freund

andres@anarazel.de

over 9 years ago

In reply to: Robert Haas (#170)

Re: Reviewing freeze map code

On 2016-06-20 16:10:23 -0400, Robert Haas wrote:

What exactly is the point of all of that already_marked stuff?

Preventing the old tuple from being locked/updated by another backend,
while unlocking the buffer.

I
mean, suppose we just don't do any of that before we go off to do
toast_insert_or_update and RelationGetBufferForTuple. Eventually,
when we reacquire the page lock, we might find that somebody else has
already updated the tuple, but couldn't that be handled by
(approximately) looping back up to l2 just as we do in several other
cases?

We'd potentially have to undo a fair amount more work: the toasted data
would have to be deleted and such, just to retry. Which isn't going to
super easy, because all of it will be happening with the current cid (we
can't just increase CommandCounterIncrement() for correctness reasons).

Andres

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#172

Robert Haas

robertmhaas@gmail.com

over 9 years ago

In reply to: Andres Freund (#171)

Re: Reviewing freeze map code

On Mon, Jun 20, 2016 at 4:24 PM, Andres Freund <andres@anarazel.de> wrote:

On 2016-06-20 16:10:23 -0400, Robert Haas wrote:

What exactly is the point of all of that already_marked stuff?

Preventing the old tuple from being locked/updated by another backend,
while unlocking the buffer.

I
mean, suppose we just don't do any of that before we go off to do
toast_insert_or_update and RelationGetBufferForTuple. Eventually,
when we reacquire the page lock, we might find that somebody else has
already updated the tuple, but couldn't that be handled by
(approximately) looping back up to l2 just as we do in several other
cases?

We'd potentially have to undo a fair amount more work: the toasted data
would have to be deleted and such, just to retry. Which isn't going to
super easy, because all of it will be happening with the current cid (we
can't just increase CommandCounterIncrement() for correctness reasons).

Why would we have to delete the TOAST data? AFAIUI, the tuple points
to the TOAST data, but not the other way around. So if we change our
mind about where to put the tuple, I don't think that requires
re-TOASTing.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#173

Andres Freund

andres@anarazel.de

over 9 years ago

In reply to: Robert Haas (#172)

Re: Reviewing freeze map code

On 2016-06-20 17:55:19 -0400, Robert Haas wrote:

On Mon, Jun 20, 2016 at 4:24 PM, Andres Freund <andres@anarazel.de> wrote:

On 2016-06-20 16:10:23 -0400, Robert Haas wrote:

What exactly is the point of all of that already_marked stuff?

Preventing the old tuple from being locked/updated by another backend,
while unlocking the buffer.

I
mean, suppose we just don't do any of that before we go off to do
toast_insert_or_update and RelationGetBufferForTuple. Eventually,
when we reacquire the page lock, we might find that somebody else has
already updated the tuple, but couldn't that be handled by
(approximately) looping back up to l2 just as we do in several other
cases?

We'd potentially have to undo a fair amount more work: the toasted data
would have to be deleted and such, just to retry. Which isn't going to
super easy, because all of it will be happening with the current cid (we
can't just increase CommandCounterIncrement() for correctness reasons).

Why would we have to delete the TOAST data? AFAIUI, the tuple points
to the TOAST data, but not the other way around. So if we change our
mind about where to put the tuple, I don't think that requires
re-TOASTing.

Consider what happens if we, after restarting at l2, notice that we
can't actually insert, but return in the !HeapTupleMayBeUpdated
branch. If the caller doesn't error out - and there's certainly callers
doing that - we'd "leak" a toasted datum. Unless the transaction aborts,
the toasted datum would never be cleaned up, because there's no datum
pointing to it, so no heap_delete will ever recurse into the toast
datum (via toast_delete()).

Andres

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#174

Thomas Munro

thomas.munro@enterprisedb.com

over 9 years ago

In reply to: Noah Misch (#168)

Re: Reviewing freeze map code

On Fri, Jun 17, 2016 at 3:36 PM, Noah Misch <noah@leadboat.com> wrote:

I agree the non-atomic, unlogged change is a problem. A related threat
doesn't require a torn page:

AssignTransactionId() xid=123
heapam.c:3931 HeapTupleHeaderSetXmax(oldtup.t_data, 123);
some ERROR before heap_update() finishes
rollback; -- xid=123
some backend flushes the modified page
immediate shutdown
AssignTransactionId() xid=123
commit; -- xid=123

If nothing wrote an xlog record that witnesses xid 123, the cluster can
reassign it after recovery. The failed update is now considered a successful
update, and the row improperly becomes dead. That's important.

I wonder if that was originally supposed to be handled with the
HEAP_XMAX_UNLOGGED flag which was removed in 11919160. A comment in
the heap WAL logging commit f2bfe8a2 said that tqual routines would
see the HEAP_XMAX_UNLOGGED flag in the event of a crash before logging
(though I'm not sure if the tqual routines ever actually did that).

--
Thomas Munro
http://www.enterprisedb.com

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#175

Amit Kapila

amit.kapila16@gmail.com

over 9 years ago

In reply to: Andres Freund (#169)

Re: Reviewing freeze map code

On Tue, Jun 21, 2016 at 1:03 AM, Andres Freund <andres@anarazel.de> wrote:

On 2016-06-15 08:56:52 -0400, Robert Haas wrote:

Yikes: heap_update() sets the tuple's XMAX, CMAX, infomask, infomask2,
and CTID without logging anything or clearing the all-visible flag and
then releases the lock on the heap page to go do some more work that
might even ERROR out. Only if that other work all goes OK do we
relock the page and perform the WAL-logged actions.

That doesn't seem like a good idea even in existing releases, because
you've taken a tuple on an all-visible page and made it not
all-visible, and you've made a page modification that is not
necessarily atomic without logging it.

Right, that's broken.

I'm not sure what to do about this: this part of the heap_update()
logic has been like this forever, and I assume that if it were easy to
refactor this away, somebody would have done it by now.

Well, I think generally nobody seriously looked at actually refactoring
heap_update(), even though that'd be a good idea. But in this instance,
the problem seems relatively fundamental:

We need to lock the origin page, to do visibility checks, etc. Then we
need to figure out the target page. Even disregarding toasting - which
we could be doing earlier with some refactoring - we're going to have to
release the page level lock, to lock them in ascending order. Otherwise
we'll risk kinda likely deadlocks.

Can we consider to use some strategy to avoid deadlocks without releasing
the lock on old page? Consider if we could have a mechanism such that
RelationGetBufferForTuple() will ensure that it always returns a new buffer
which has targetblock greater than the old block (on which we already held
a lock). I think here tricky part is whether we can get anything like that
from FSM. Also, there could be cases where we need to extend the heap when
there were pages in heap with space available, but we have ignored them
because there block number is smaller than the block number on which we
have lock.

We also certainly don't want to nest
the lwlocks for the toast stuff, inside a content lock for the old
tupe's page.

So far the best idea I have - and it's really not a good one - is to
invent a new hint-bit that tells concurrent updates to acquire a
heavyweight tuple lock, while releasing the page-level lock. If that
hint bit does not require any other modifications - and we don't need an
xid in xmax for this use case - that'll avoid doing all the other
`already_marked` stuff early, which should address the correctness
issue.

Don't we need to clear such a flag in case of error? Also don't we need to
reset it later, like when modifying the old page later before WAL.

It's kinda invasive though, and probably has performance
implications.

Do you see performance implication due to requirement of heavywieht tuple
lock in more cases than now or something else?

Some others ways could be:

Before releasing the lock on buffer containing old tuple, clear the VM and
visibility info from page and WAL log it. I think this could impact
performance depending on how frequently we need to perform this action.

Have a new flag like HEAP_XMAX_UNLOGGED (as it was there when this logic
was introduced in commit f2bfe8a24c46133f81e188653a127f939eb33c4a ) and set
the same in old tuple header before releasing lock on buffer and teach
tqual.c to honor the flag. I think tqual.c should consider
HEAP_XMAX_UNLOGGED as an indication of aborted transaction unless it is
currently in-progress. Also, I think we need to clear this flag before WAL
logging in heap_update.

With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

#176

Thomas Munro

thomas.munro@enterprisedb.com

over 9 years ago

In reply to: Amit Kapila (#175)

Re: Reviewing freeze map code

On Tue, Jun 21, 2016 at 3:29 PM, Amit Kapila <amit.kapila16@gmail.com> wrote:

On Tue, Jun 21, 2016 at 1:03 AM, Andres Freund <andres@anarazel.de> wrote:

Well, I think generally nobody seriously looked at actually refactoring
heap_update(), even though that'd be a good idea. But in this instance,
the problem seems relatively fundamental:

We need to lock the origin page, to do visibility checks, etc. Then we
need to figure out the target page. Even disregarding toasting - which
we could be doing earlier with some refactoring - we're going to have to
release the page level lock, to lock them in ascending order. Otherwise
we'll risk kinda likely deadlocks.

Can we consider to use some strategy to avoid deadlocks without releasing
the lock on old page? Consider if we could have a mechanism such that
RelationGetBufferForTuple() will ensure that it always returns a new buffer
which has targetblock greater than the old block (on which we already held a
lock). I think here tricky part is whether we can get anything like that
from FSM. Also, there could be cases where we need to extend the heap when
there were pages in heap with space available, but we have ignored them
because there block number is smaller than the block number on which we have
lock.

Doesn't that mean that over time, given a workload that does only or
mostly updates, your records tend to migrate further and further away
from the start of the file, leaving a growing unusable space at the
beginning, until you eventually need to CLUSTER/VACUUM FULL?

I was wondering about speculatively asking for a free page with a
lower block number than the origin page, if one is available, before
locking the origin page. Then after locking the origin page, if it
turns out you need a page but didn't get it earlier, asking for a free
page with a higher block number than the origin page.

--
Thomas Munro
http://www.enterprisedb.com

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#177

Thomas Munro

thomas.munro@enterprisedb.com

over 9 years ago

In reply to: Amit Kapila (#175)

Re: Reviewing freeze map code

On Tue, Jun 21, 2016 at 3:29 PM, Amit Kapila <amit.kapila16@gmail.com> wrote:

Some others ways could be:

Before releasing the lock on buffer containing old tuple, clear the VM and
visibility info from page and WAL log it. I think this could impact
performance depending on how frequently we need to perform this action.

Have a new flag like HEAP_XMAX_UNLOGGED (as it was there when this logic was
introduced in commit f2bfe8a24c46133f81e188653a127f939eb33c4a ) and set the
same in old tuple header before releasing lock on buffer and teach tqual.c
to honor the flag. I think tqual.c should consider HEAP_XMAX_UNLOGGED as
an indication of aborted transaction unless it is currently in-progress.
Also, I think we need to clear this flag before WAL logging in heap_update.

I also noticed that and wondered whether it was a mistake to take that
out. It appears to have been removed as part of the logic to clear
away UNDO log support in 11919160, but it may have been an important
part of the heap_update protocol. Though (as I mentioned nearby in a
reply to Noah) it I'm not sure if the tqual.c side which would ignore
the unlogged xmax in the event of a badly timed crash was ever
implemented.

--
Thomas Munro
http://www.enterprisedb.com

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#178

Andres Freund

andres@anarazel.de

over 9 years ago

In reply to: Amit Kapila (#175)

Re: Reviewing freeze map code

On 2016-06-21 08:59:13 +0530, Amit Kapila wrote:

Can we consider to use some strategy to avoid deadlocks without releasing
the lock on old page? Consider if we could have a mechanism such that
RelationGetBufferForTuple() will ensure that it always returns a new buffer
which has targetblock greater than the old block (on which we already held
a lock). I think here tricky part is whether we can get anything like that
from FSM. Also, there could be cases where we need to extend the heap when
there were pages in heap with space available, but we have ignored them
because there block number is smaller than the block number on which we
have lock.

I can't see that being acceptable, from a space-usage POV.

So far the best idea I have - and it's really not a good one - is to
invent a new hint-bit that tells concurrent updates to acquire a
heavyweight tuple lock, while releasing the page-level lock. If that
hint bit does not require any other modifications - and we don't need an
xid in xmax for this use case - that'll avoid doing all the other
`already_marked` stuff early, which should address the correctness
issue.

Don't we need to clear such a flag in case of error? Also don't we need to
reset it later, like when modifying the old page later before WAL.

If the flag just says "acquire a heavyweight lock", then there's no need
for that. That's cheap enough to just do if it's errorneously set. At
least I can't see any reason.

It's kinda invasive though, and probably has performance
implications.

Do you see performance implication due to requirement of heavywieht tuple
lock in more cases than now or something else?

Because of that, yes.

Some others ways could be:

Before releasing the lock on buffer containing old tuple, clear the VM and
visibility info from page and WAL log it. I think this could impact
performance depending on how frequently we need to perform this action.

Doubling the number of xlog inserts in heap_update would certainly be
measurable :(. My guess is that the heavyweight tuple lock approach will
be less expensive.

Andres

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#179

Amit Kapila

amit.kapila16@gmail.com

over 9 years ago

In reply to: Thomas Munro (#176)

Re: Reviewing freeze map code

On Tue, Jun 21, 2016 at 9:08 AM, Thomas Munro <thomas.munro@enterprisedb.com>
wrote:

On Tue, Jun 21, 2016 at 3:29 PM, Amit Kapila <amit.kapila16@gmail.com>

wrote:

On Tue, Jun 21, 2016 at 1:03 AM, Andres Freund <andres@anarazel.de>

wrote:

Well, I think generally nobody seriously looked at actually refactoring
heap_update(), even though that'd be a good idea. But in this

instance,

the problem seems relatively fundamental:

We need to lock the origin page, to do visibility checks, etc. Then we
need to figure out the target page. Even disregarding toasting - which
we could be doing earlier with some refactoring - we're going to have

release the page level lock, to lock them in ascending order. Otherwise
we'll risk kinda likely deadlocks.

Can we consider to use some strategy to avoid deadlocks without

releasing

the lock on old page? Consider if we could have a mechanism such that
RelationGetBufferForTuple() will ensure that it always returns a new

buffer

which has targetblock greater than the old block (on which we already

held a

lock). I think here tricky part is whether we can get anything like

that

from FSM. Also, there could be cases where we need to extend the heap

when

there were pages in heap with space available, but we have ignored them
because there block number is smaller than the block number on which we

have

lock.

Doesn't that mean that over time, given a workload that does only or
mostly updates, your records tend to migrate further and further away
from the start of the file, leaving a growing unusable space at the
beginning, until you eventually need to CLUSTER/VACUUM FULL?

The request for updates should ideally fit in same page as old tuple for
many of the cases if fillfactor is properly configured, considering
update-mostly loads. Why would it be that always the records will migrate
further away, they should get the space freed by other updates in
intermediate pages. I think there could be some impact space-wise, but
freed-up space will be eventually used.

I was wondering about speculatively asking for a free page with a
lower block number than the origin page, if one is available, before
locking the origin page.

Do you wan't to lock it as well? In any-case, I think adding the code
without deciding whether the update can be accommodated in current page can
prove to be costly.

Then after locking the origin page, if it
turns out you need a page but didn't get it earlier, asking for a free
page with a higher block number than the origin page.

Something like that might workout if it is feasible and people agree on
pursuing such an approach.

With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

#180

Amit Kapila

amit.kapila16@gmail.com

over 9 years ago

In reply to: Thomas Munro (#177)

Re: Reviewing freeze map code

On Tue, Jun 21, 2016 at 9:16 AM, Thomas Munro <thomas.munro@enterprisedb.com>
wrote:

On Tue, Jun 21, 2016 at 3:29 PM, Amit Kapila <amit.kapila16@gmail.com>

wrote:

Some others ways could be:

Before releasing the lock on buffer containing old tuple, clear the VM

and

visibility info from page and WAL log it. I think this could impact
performance depending on how frequently we need to perform this action.

Have a new flag like HEAP_XMAX_UNLOGGED (as it was there when this

logic was

introduced in commit f2bfe8a24c46133f81e188653a127f939eb33c4a ) and set

the

same in old tuple header before releasing lock on buffer and teach

tqual.c

to honor the flag. I think tqual.c should consider HEAP_XMAX_UNLOGGED

an indication of aborted transaction unless it is currently in-progress.
Also, I think we need to clear this flag before WAL logging in

heap_update.

I also noticed that and wondered whether it was a mistake to take that
out. It appears to have been removed as part of the logic to clear
away UNDO log support in 11919160, but it may have been an important
part of the heap_update protocol. Though (as I mentioned nearby in a
reply to Noah) it I'm not sure if the tqual.c side which would ignore
the unlogged xmax in the event of a badly timed crash was ever
implemented.

Right, my observation is similar to yours and that's what I am suggesting
as one-alternative to fix this issue. I think making this approach work
(even if this doesn't have any problems) might turn out to be tricky.
However, the plus-point of this approach seems to be that it shouldn't
impact performance in most of the cases.

With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

#181

Amit Kapila

amit.kapila16@gmail.com

over 9 years ago

In reply to: Andres Freund (#178)

Re: Reviewing freeze map code

On Tue, Jun 21, 2016 at 9:21 AM, Andres Freund <andres@anarazel.de> wrote:

On 2016-06-21 08:59:13 +0530, Amit Kapila wrote:

Can we consider to use some strategy to avoid deadlocks without

releasing

the lock on old page? Consider if we could have a mechanism such that
RelationGetBufferForTuple() will ensure that it always returns a new

buffer

which has targetblock greater than the old block (on which we already

held

a lock). I think here tricky part is whether we can get anything like

that

from FSM. Also, there could be cases where we need to extend the heap

when

there were pages in heap with space available, but we have ignored them
because there block number is smaller than the block number on which we
have lock.

I can't see that being acceptable, from a space-usage POV.

So far the best idea I have - and it's really not a good one - is to
invent a new hint-bit that tells concurrent updates to acquire a
heavyweight tuple lock, while releasing the page-level lock. If that
hint bit does not require any other modifications - and we don't need

xid in xmax for this use case - that'll avoid doing all the other
`already_marked` stuff early, which should address the correctness
issue.

Don't we need to clear such a flag in case of error? Also don't we

need to

reset it later, like when modifying the old page later before WAL.

If the flag just says "acquire a heavyweight lock", then there's no need
for that. That's cheap enough to just do if it's errorneously set. At
least I can't see any reason.

I think it will just increase the chances of other backends to acquire a
heavy weight lock.

It's kinda invasive though, and probably has performance
implications.

Do you see performance implication due to requirement of heavywieht

tuple

lock in more cases than now or something else?

Because of that, yes.

Some others ways could be:

Before releasing the lock on buffer containing old tuple, clear the VM

and

visibility info from page and WAL log it. I think this could impact
performance depending on how frequently we need to perform this action.

Doubling the number of xlog inserts in heap_update would certainly be
measurable :(. My guess is that the heavyweight tuple lock approach will
be less expensive.

Probably, but I think heavyweight tuple lock is more invasive. I think
increasing the number of xlog inserts could surely impact performance, but
depending upon how frequently we need to call it. I think we might want to
combine it with the idea of RelationGetBufferForTuple() to return
higher-block number, such that if we don't find higher block-number from
FSM, then we can release the lock on old page and try to get the locks on
old and new buffers as we do now. This will further reduce the chances of
increasing xlog insert calls and address the issue of space-wastage.

With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

#182

Robert Haas

robertmhaas@gmail.com

over 9 years ago

In reply to: Andres Freund (#173)

Re: Reviewing freeze map code

On Mon, Jun 20, 2016 at 5:59 PM, Andres Freund <andres@anarazel.de> wrote:

On 2016-06-20 17:55:19 -0400, Robert Haas wrote:

On Mon, Jun 20, 2016 at 4:24 PM, Andres Freund <andres@anarazel.de> wrote:

On 2016-06-20 16:10:23 -0400, Robert Haas wrote:

I
mean, suppose we just don't do any of that before we go off to do
toast_insert_or_update and RelationGetBufferForTuple. Eventually,
when we reacquire the page lock, we might find that somebody else has
already updated the tuple, but couldn't that be handled by
(approximately) looping back up to l2 just as we do in several other
cases?

We'd potentially have to undo a fair amount more work: the toasted data
would have to be deleted and such, just to retry. Which isn't going to
super easy, because all of it will be happening with the current cid (we
can't just increase CommandCounterIncrement() for correctness reasons).

Why would we have to delete the TOAST data? AFAIUI, the tuple points
to the TOAST data, but not the other way around. So if we change our
mind about where to put the tuple, I don't think that requires
re-TOASTing.

Consider what happens if we, after restarting at l2, notice that we
can't actually insert, but return in the !HeapTupleMayBeUpdated
branch. If the caller doesn't error out - and there's certainly callers
doing that - we'd "leak" a toasted datum. Unless the transaction aborts,
the toasted datum would never be cleaned up, because there's no datum
pointing to it, so no heap_delete will ever recurse into the toast
datum (via toast_delete()).

OK, I see what you mean. Still, that doesn't seem like such a
terrible cost. If you try to update a tuple and if it looks like you
can update it but then after TOASTing you find that the status of the
tuple has changed such that you can't update it after all, then you
might need to go set xmax = MyTxid() on all of the TOAST tuples you
created (whose CTIDs we could save someplace, so that it's just a
matter of finding them by CTID to kill them). That's not likely to
happen particularly often, though, and when it does happen it's not
insanely expensive. We could also reduce the cost by letting the
caller of heap_update() decide whether to back out the work; if the
caller intends to throw an error anyway, then there's no point.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#183

Tom Lane

tgl@sss.pgh.pa.us

over 9 years ago

In reply to: Robert Haas (#182)

Re: Reviewing freeze map code

Robert Haas <robertmhaas@gmail.com> writes:

On Mon, Jun 20, 2016 at 5:59 PM, Andres Freund <andres@anarazel.de> wrote:

Consider what happens if we, after restarting at l2, notice that we
can't actually insert, but return in the !HeapTupleMayBeUpdated
branch.

OK, I see what you mean. Still, that doesn't seem like such a
terrible cost. If you try to update a tuple and if it looks like you
can update it but then after TOASTing you find that the status of the
tuple has changed such that you can't update it after all, then you
might need to go set xmax = MyTxid() on all of the TOAST tuples you
created (whose CTIDs we could save someplace, so that it's just a
matter of finding them by CTID to kill them).

... and if you get an error or crash partway through that, what happens?

regards, tom lane

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#184

Robert Haas

robertmhaas@gmail.com

over 9 years ago

In reply to: Andres Freund (#178)

Re: Reviewing freeze map code

On Mon, Jun 20, 2016 at 11:51 PM, Andres Freund <andres@anarazel.de> wrote:

So far the best idea I have - and it's really not a good one - is to
invent a new hint-bit that tells concurrent updates to acquire a
heavyweight tuple lock, while releasing the page-level lock. If that
hint bit does not require any other modifications - and we don't need an
xid in xmax for this use case - that'll avoid doing all the other
`already_marked` stuff early, which should address the correctness
issue.

Don't we need to clear such a flag in case of error? Also don't we need to
reset it later, like when modifying the old page later before WAL.

If the flag just says "acquire a heavyweight lock", then there's no need
for that. That's cheap enough to just do if it's errorneously set. At
least I can't see any reason.

I don't quite understand the intended semantics of this proposed flag.
If we don't already have the tuple lock at that point, we can't go
acquire it before releasing the content lock, can we?

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#185

Robert Haas

robertmhaas@gmail.com

over 9 years ago

In reply to: Tom Lane (#183)

Re: Reviewing freeze map code

On Tue, Jun 21, 2016 at 10:47 AM, Tom Lane <tgl@sss.pgh.pa.us> wrote:

Robert Haas <robertmhaas@gmail.com> writes:

On Mon, Jun 20, 2016 at 5:59 PM, Andres Freund <andres@anarazel.de> wrote:

Consider what happens if we, after restarting at l2, notice that we
can't actually insert, but return in the !HeapTupleMayBeUpdated
branch.

OK, I see what you mean. Still, that doesn't seem like such a
terrible cost. If you try to update a tuple and if it looks like you
can update it but then after TOASTing you find that the status of the
tuple has changed such that you can't update it after all, then you
might need to go set xmax = MyTxid() on all of the TOAST tuples you
created (whose CTIDs we could save someplace, so that it's just a
matter of finding them by CTID to kill them).

... and if you get an error or crash partway through that, what happens?

Then the transaction is aborted anyway, and we haven't leaked anything
because VACUUM will clean it up.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#186

Andres Freund

andres@anarazel.de

over 9 years ago

In reply to: Robert Haas (#184)

Re: Reviewing freeze map code

On 2016-06-21 10:50:36 -0400, Robert Haas wrote:

On Mon, Jun 20, 2016 at 11:51 PM, Andres Freund <andres@anarazel.de> wrote:

So far the best idea I have - and it's really not a good one - is to
invent a new hint-bit that tells concurrent updates to acquire a
heavyweight tuple lock, while releasing the page-level lock. If that
hint bit does not require any other modifications - and we don't need an
xid in xmax for this use case - that'll avoid doing all the other
`already_marked` stuff early, which should address the correctness
issue.

Don't we need to clear such a flag in case of error? Also don't we need to
reset it later, like when modifying the old page later before WAL.

If the flag just says "acquire a heavyweight lock", then there's no need
for that. That's cheap enough to just do if it's errorneously set. At
least I can't see any reason.

I don't quite understand the intended semantics of this proposed flag.

Whenever the flag is set, we have to acquire the heavyweight tuple lock
before continuing. That guarantees nobody else can modify the tuple,
while the lock is released, without requiring to modify more than one
hint bit. That should fix the torn page issue, no?

If we don't already have the tuple lock at that point, we can't go
acquire it before releasing the content lock, can we?

Why not? Afaics the way that tuple locks are used, the nesting should
be fine.

Andres

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#187

Robert Haas

robertmhaas@gmail.com

over 9 years ago

In reply to: Andres Freund (#186)

Re: Reviewing freeze map code

On Tue, Jun 21, 2016 at 12:54 PM, Andres Freund <andres@anarazel.de> wrote:

I don't quite understand the intended semantics of this proposed flag.

Whenever the flag is set, we have to acquire the heavyweight tuple lock
before continuing. That guarantees nobody else can modify the tuple,
while the lock is released, without requiring to modify more than one
hint bit. That should fix the torn page issue, no?

Yeah, I guess that would work.

If we don't already have the tuple lock at that point, we can't go
acquire it before releasing the content lock, can we?

Why not? Afaics the way that tuple locks are used, the nesting should
be fine.

Well, the existing places where we acquire the tuple lock within
heap_update() are all careful to release the page lock first, so I'm
skeptical that doing it the other order is safe. Certainly, if we've
got some code that grabs the page lock and then the tuple lock and
other code that grabs the tuple lock and then the page lock, that's a
deadlock waiting to happen. I'm also a bit dubious that LockAcquire
is safe to call in general with interrupts held.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#188

Andres Freund

andres@anarazel.de

over 9 years ago

In reply to: Robert Haas (#187)

Re: Reviewing freeze map code

On 2016-06-21 13:03:24 -0400, Robert Haas wrote:

On Tue, Jun 21, 2016 at 12:54 PM, Andres Freund <andres@anarazel.de> wrote:

I don't quite understand the intended semantics of this proposed flag.

Whenever the flag is set, we have to acquire the heavyweight tuple lock
before continuing. That guarantees nobody else can modify the tuple,
while the lock is released, without requiring to modify more than one
hint bit. That should fix the torn page issue, no?

Yeah, I guess that would work.

If we don't already have the tuple lock at that point, we can't go
acquire it before releasing the content lock, can we?

Why not? Afaics the way that tuple locks are used, the nesting should
be fine.

Well, the existing places where we acquire the tuple lock within
heap_update() are all careful to release the page lock first, so I'm
skeptical that doing it the other order is safe. Certainly, if we've
got some code that grabs the page lock and then the tuple lock and
other code that grabs the tuple lock and then the page lock, that's a
deadlock waiting to happen.

Just noticed this piece of code while looking into this:
UnlockReleaseBuffer(buffer);
if (have_tuple_lock)
UnlockTupleTuplock(relation, &(tp.t_self), LockTupleExclusive);
if (vmbuffer != InvalidBuffer)
ReleaseBuffer(vmbuffer);
return result;

seems weird to release the vmbuffer after the tuplelock...

I'm also a bit dubious that LockAcquire is safe to call in general
with interrupts held.

Looks like we could just acquire the tuple-lock *before* doing the
toast_insert_or_update/RelationGetBufferForTuple, but after releasing
the buffer lock. That'd allow us to do avoid doing the nested locking,
should make the recovery just a goto l2;, ...

Andres

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#189

Robert Haas

robertmhaas@gmail.com

over 9 years ago

In reply to: Andres Freund (#188)

Re: Reviewing freeze map code

On Tue, Jun 21, 2016 at 1:49 PM, Andres Freund <andres@anarazel.de> wrote:

I'm also a bit dubious that LockAcquire is safe to call in general
with interrupts held.

Looks like we could just acquire the tuple-lock *before* doing the
toast_insert_or_update/RelationGetBufferForTuple, but after releasing
the buffer lock. That'd allow us to do avoid doing the nested locking,
should make the recovery just a goto l2;, ...

Why isn't that racey? Somebody else can grab the tuple lock after we
release the buffer content lock and before we acquire the tuple lock.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#190

Andres Freund

andres@anarazel.de

over 9 years ago

In reply to: Robert Haas (#189)

Re: Reviewing freeze map code

On 2016-06-21 15:38:25 -0400, Robert Haas wrote:

On Tue, Jun 21, 2016 at 1:49 PM, Andres Freund <andres@anarazel.de> wrote:

I'm also a bit dubious that LockAcquire is safe to call in general
with interrupts held.

Looks like we could just acquire the tuple-lock *before* doing the
toast_insert_or_update/RelationGetBufferForTuple, but after releasing
the buffer lock. That'd allow us to do avoid doing the nested locking,
should make the recovery just a goto l2;, ...

Why isn't that racey? Somebody else can grab the tuple lock after we
release the buffer content lock and before we acquire the tuple lock.

Sure, but by the time the tuple lock is released, they'd have updated
xmax. So once we acquired that we can just do
if (xmax_infomask_changed(oldtup.t_data->t_infomask,
infomask) ||
!TransactionIdEquals(HeapTupleHeaderGetRawXmax(oldtup.t_data),
xwait))
goto l2;
which is fine, because we've not yet done the toasting.

I'm not sure wether this approach is better than deleting potentially
toasted data though. It's probably faster, but will likely touch more
places in the code, and eat up a infomask bit (infomask & HEAP_MOVED
== HEAP_MOVED in my prototype).

Andres

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#191

Robert Haas

robertmhaas@gmail.com

over 9 years ago

In reply to: Andres Freund (#190)

Re: Reviewing freeze map code

On Tue, Jun 21, 2016 at 3:46 PM, Andres Freund <andres@anarazel.de> wrote:

On 2016-06-21 15:38:25 -0400, Robert Haas wrote:

On Tue, Jun 21, 2016 at 1:49 PM, Andres Freund <andres@anarazel.de> wrote:

I'm also a bit dubious that LockAcquire is safe to call in general
with interrupts held.

Looks like we could just acquire the tuple-lock *before* doing the
toast_insert_or_update/RelationGetBufferForTuple, but after releasing
the buffer lock. That'd allow us to do avoid doing the nested locking,
should make the recovery just a goto l2;, ...

Why isn't that racey? Somebody else can grab the tuple lock after we
release the buffer content lock and before we acquire the tuple lock.

Sure, but by the time the tuple lock is released, they'd have updated
xmax. So once we acquired that we can just do
if (xmax_infomask_changed(oldtup.t_data->t_infomask,
infomask) ||
!TransactionIdEquals(HeapTupleHeaderGetRawXmax(oldtup.t_data),
xwait))
goto l2;
which is fine, because we've not yet done the toasting.

I see.

I'm not sure wether this approach is better than deleting potentially
toasted data though. It's probably faster, but will likely touch more
places in the code, and eat up a infomask bit (infomask & HEAP_MOVED
== HEAP_MOVED in my prototype).

Ugh. That's not very desirable at all.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#192

Andres Freund

andres@anarazel.de

over 9 years ago

In reply to: Robert Haas (#191)

Re: Reviewing freeze map code

On 2016-06-21 16:32:03 -0400, Robert Haas wrote:

On Tue, Jun 21, 2016 at 3:46 PM, Andres Freund <andres@anarazel.de> wrote:

On 2016-06-21 15:38:25 -0400, Robert Haas wrote:

On Tue, Jun 21, 2016 at 1:49 PM, Andres Freund <andres@anarazel.de> wrote:

I'm also a bit dubious that LockAcquire is safe to call in general
with interrupts held.

Looks like we could just acquire the tuple-lock *before* doing the
toast_insert_or_update/RelationGetBufferForTuple, but after releasing
the buffer lock. That'd allow us to do avoid doing the nested locking,
should make the recovery just a goto l2;, ...

Why isn't that racey? Somebody else can grab the tuple lock after we
release the buffer content lock and before we acquire the tuple lock.

Sure, but by the time the tuple lock is released, they'd have updated
xmax. So once we acquired that we can just do
if (xmax_infomask_changed(oldtup.t_data->t_infomask,
infomask) ||
!TransactionIdEquals(HeapTupleHeaderGetRawXmax(oldtup.t_data),
xwait))
goto l2;
which is fine, because we've not yet done the toasting.

I see.

I'm not sure wether this approach is better than deleting potentially
toasted data though. It's probably faster, but will likely touch more
places in the code, and eat up a infomask bit (infomask & HEAP_MOVED
== HEAP_MOVED in my prototype).

Ugh. That's not very desirable at all.

I'm looking into three approaches right now:

1) Flag approach from above
2) Undo toasting on concurrent activity, retry
3) Use WAL logging for the already_marked = true case.

1) primarily suffers from a significant amount of complexity. I still
have a bug in there that sometimes triggers "attempted to update
invisible tuple" ERRORs. Otherwise it seems to perform decently
performancewise - even on workloads with many backends hitting the same
tuple, the retry-rate is low.

2) Seems to work too, but due to the amount of time the tuple is not
locked, the retry rate can be really high. As we perform significant
amount of work (toast insertion & index manipulation or extending a
file) , while the tuple is not locked, it's quite likely that another
session tries to modify the tuple inbetween. I think it's possible to
essentially livelock.

3) This approach so far seems the best. It's possible to reuse the
xl_heap_lock record (in an afaics backwards compatible manner), and in
most cases the overhead isn't that large. It's of course annoying to
emit more WAL, but it's not that big an overhead compared to extending a
file, or to toasting. It's also by far the simplest fix.

Comments?

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#193

Alvaro Herrera

alvherre@2ndquadrant.com

over 9 years ago

In reply to: Andres Freund (#192)

Re: Reviewing freeze map code

Andres Freund wrote:

I'm looking into three approaches right now:

3) Use WAL logging for the already_marked = true case.

3) This approach so far seems the best. It's possible to reuse the
xl_heap_lock record (in an afaics backwards compatible manner), and in
most cases the overhead isn't that large. It's of course annoying to
emit more WAL, but it's not that big an overhead compared to extending a
file, or to toasting. It's also by far the simplest fix.

I suppose it's fine if we crash midway from emitting this wal record and
the actual heap_update one, since the xmax will appear to come from an
aborted xid, right?

I agree that the overhead is probably negligible, considering that this
only happens when toast is invoked. It's probably not as great when the
new tuple goes to another page, though.

--
ï¿½lvaro Herrera http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#194

Andres Freund

andres@anarazel.de

over 9 years ago

In reply to: Alvaro Herrera (#193)

Re: Reviewing freeze map code

On 2016-06-23 18:59:57 -0400, Alvaro Herrera wrote:

Andres Freund wrote:

I'm looking into three approaches right now:

3) Use WAL logging for the already_marked = true case.

3) This approach so far seems the best. It's possible to reuse the
xl_heap_lock record (in an afaics backwards compatible manner), and in
most cases the overhead isn't that large. It's of course annoying to
emit more WAL, but it's not that big an overhead compared to extending a
file, or to toasting. It's also by far the simplest fix.

I suppose it's fine if we crash midway from emitting this wal record and
the actual heap_update one, since the xmax will appear to come from an
aborted xid, right?

Yea, that should be fine.

I agree that the overhead is probably negligible, considering that this
only happens when toast is invoked. It's probably not as great when the
new tuple goes to another page, though.

I think it has to happen in both cases unfortunately. We could try to
add some optimizations (e.g. only release lock & WAL log if the target
page, via fsm, is before the current one), but I don't really want to go
there in the back branches.

Andres

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#195

Amit Kapila

amit.kapila16@gmail.com

over 9 years ago

In reply to: Andres Freund (#194)

Re: Reviewing freeze map code

On Fri, Jun 24, 2016 at 4:33 AM, Andres Freund <andres@anarazel.de> wrote:

On 2016-06-23 18:59:57 -0400, Alvaro Herrera wrote:

Andres Freund wrote:

I'm looking into three approaches right now:

3) Use WAL logging for the already_marked = true case.

3) This approach so far seems the best. It's possible to reuse the
xl_heap_lock record (in an afaics backwards compatible manner), and in
most cases the overhead isn't that large. It's of course annoying to
emit more WAL, but it's not that big an overhead compared to extending a
file, or to toasting. It's also by far the simplest fix.

+1 for proceeding with Approach-3.

I suppose it's fine if we crash midway from emitting this wal record and
the actual heap_update one, since the xmax will appear to come from an
aborted xid, right?

Yea, that should be fine.

I agree that the overhead is probably negligible, considering that this
only happens when toast is invoked. It's probably not as great when the
new tuple goes to another page, though.

I think it has to happen in both cases unfortunately. We could try to
add some optimizations (e.g. only release lock & WAL log if the target
page, via fsm, is before the current one), but I don't really want to go
there in the back branches.

You are right, I think we can try such an optimization in Head and
that too if we see a performance hit with adding this new WAL in
heap_update.

--
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#196

Noah Misch

noah@leadboat.com

over 9 years ago

In reply to: Thomas Munro (#174)

Re: Reviewing freeze map code

On Tue, Jun 21, 2016 at 10:59:25AM +1200, Thomas Munro wrote:

On Fri, Jun 17, 2016 at 3:36 PM, Noah Misch <noah@leadboat.com> wrote:

I agree the non-atomic, unlogged change is a problem. A related threat
doesn't require a torn page:

AssignTransactionId() xid=123
heapam.c:3931 HeapTupleHeaderSetXmax(oldtup.t_data, 123);
some ERROR before heap_update() finishes
rollback; -- xid=123
some backend flushes the modified page
immediate shutdown
AssignTransactionId() xid=123
commit; -- xid=123

If nothing wrote an xlog record that witnesses xid 123, the cluster can
reassign it after recovery. The failed update is now considered a successful
update, and the row improperly becomes dead. That's important.

I wonder if that was originally supposed to be handled with the
HEAP_XMAX_UNLOGGED flag which was removed in 11919160. A comment in
the heap WAL logging commit f2bfe8a2 said that tqual routines would
see the HEAP_XMAX_UNLOGGED flag in the event of a crash before logging
(though I'm not sure if the tqual routines ever actually did that).

HEAP_XMAX_UNLOGGED does appear to have originated in contemplation of this
same hazard. Looking at the three commits in "git log -S HEAP_XMAX_UNLOGGED"
(f2bfe8a b58c041 1191916), nothing ever completed the implementation by
testing for that flag.

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#197

Masahiko Sawada

sawada.mshk@gmail.com

over 9 years ago

In reply to: Andres Freund (#173)

Re: Reviewing freeze map code

On Tue, Jun 21, 2016 at 6:59 AM, Andres Freund <andres@anarazel.de> wrote:

On 2016-06-20 17:55:19 -0400, Robert Haas wrote:

On Mon, Jun 20, 2016 at 4:24 PM, Andres Freund <andres@anarazel.de> wrote:

On 2016-06-20 16:10:23 -0400, Robert Haas wrote:

What exactly is the point of all of that already_marked stuff?

Preventing the old tuple from being locked/updated by another backend,
while unlocking the buffer.

I
mean, suppose we just don't do any of that before we go off to do
toast_insert_or_update and RelationGetBufferForTuple. Eventually,
when we reacquire the page lock, we might find that somebody else has
already updated the tuple, but couldn't that be handled by
(approximately) looping back up to l2 just as we do in several other
cases?

We'd potentially have to undo a fair amount more work: the toasted data
would have to be deleted and such, just to retry. Which isn't going to
super easy, because all of it will be happening with the current cid (we
can't just increase CommandCounterIncrement() for correctness reasons).

Why would we have to delete the TOAST data? AFAIUI, the tuple points
to the TOAST data, but not the other way around. So if we change our
mind about where to put the tuple, I don't think that requires
re-TOASTing.

Consider what happens if we, after restarting at l2, notice that we
can't actually insert, but return in the !HeapTupleMayBeUpdated
branch. If the caller doesn't error out - and there's certainly callers
doing that - we'd "leak" a toasted datum.

Sorry for interrupt you, but I have a question about this case.
Is there case where we back to l2 after we created toasted
datum(called toast_insert_or_update)?
IIUC, after we stored toast datum we just insert heap tuple and log
WAL (or error out for some reasons).

Regards,

--
Masahiko Sawada

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#198

Masahiko Sawada

sawada.mshk@gmail.com

over 9 years ago

In reply to: Masahiko Sawada (#197)

Re: Reviewing freeze map code

On Tue, Jun 28, 2016 at 8:06 PM, Masahiko Sawada <sawada.mshk@gmail.com> wrote:

On Tue, Jun 21, 2016 at 6:59 AM, Andres Freund <andres@anarazel.de> wrote:

On 2016-06-20 17:55:19 -0400, Robert Haas wrote:

On Mon, Jun 20, 2016 at 4:24 PM, Andres Freund <andres@anarazel.de> wrote:

On 2016-06-20 16:10:23 -0400, Robert Haas wrote:

What exactly is the point of all of that already_marked stuff?

Preventing the old tuple from being locked/updated by another backend,
while unlocking the buffer.

I
mean, suppose we just don't do any of that before we go off to do
toast_insert_or_update and RelationGetBufferForTuple. Eventually,
when we reacquire the page lock, we might find that somebody else has
already updated the tuple, but couldn't that be handled by
(approximately) looping back up to l2 just as we do in several other
cases?

We'd potentially have to undo a fair amount more work: the toasted data
would have to be deleted and such, just to retry. Which isn't going to
super easy, because all of it will be happening with the current cid (we
can't just increase CommandCounterIncrement() for correctness reasons).

Why would we have to delete the TOAST data? AFAIUI, the tuple points
to the TOAST data, but not the other way around. So if we change our
mind about where to put the tuple, I don't think that requires
re-TOASTing.

Consider what happens if we, after restarting at l2, notice that we
can't actually insert, but return in the !HeapTupleMayBeUpdated
branch. If the caller doesn't error out - and there's certainly callers
doing that - we'd "leak" a toasted datum.

Sorry for interrupt you, but I have a question about this case.
Is there case where we back to l2 after we created toasted
datum(called toast_insert_or_update)?
IIUC, after we stored toast datum we just insert heap tuple and log
WAL (or error out for some reasons).

I understood now, sorry for the noise.

Regards,

--
Masahiko Sawada

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#199

Masahiko Sawada

sawada.mshk@gmail.com

over 9 years ago

In reply to: Amit Kapila (#195)

1 attachment(s)

Re: Reviewing freeze map code

On Fri, Jun 24, 2016 at 11:04 AM, Amit Kapila <amit.kapila16@gmail.com> wrote:

On Fri, Jun 24, 2016 at 4:33 AM, Andres Freund <andres@anarazel.de> wrote:

On 2016-06-23 18:59:57 -0400, Alvaro Herrera wrote:

Andres Freund wrote:

I'm looking into three approaches right now:

3) Use WAL logging for the already_marked = true case.

3) This approach so far seems the best. It's possible to reuse the
xl_heap_lock record (in an afaics backwards compatible manner), and in
most cases the overhead isn't that large. It's of course annoying to
emit more WAL, but it's not that big an overhead compared to extending a
file, or to toasting. It's also by far the simplest fix.

+1 for proceeding with Approach-3.

I suppose it's fine if we crash midway from emitting this wal record and
the actual heap_update one, since the xmax will appear to come from an
aborted xid, right?

Yea, that should be fine.

I agree that the overhead is probably negligible, considering that this
only happens when toast is invoked. It's probably not as great when the
new tuple goes to another page, though.

I think it has to happen in both cases unfortunately. We could try to
add some optimizations (e.g. only release lock & WAL log if the target
page, via fsm, is before the current one), but I don't really want to go
there in the back branches.

You are right, I think we can try such an optimization in Head and
that too if we see a performance hit with adding this new WAL in
heap_update.

+1 for #3 approach, and attached draft patch for that.
I think attached patch would fix this problem but please let me know
if this patch is not what you're thinking.

Regards,

--
Masahiko Sawada

Attachments:

emit_wal_already_marked_true_case.patchtext/x-diff; charset=US-ASCII; name=emit_wal_already_marked_true_case.patchDownload

diff --git a/src/backend/access/heap/heapam.c b/src/backend/access/heap/heapam.c
index 57da57a..2f3fd83 100644
--- a/src/backend/access/heap/heapam.c
+++ b/src/backend/access/heap/heapam.c
@@ -3923,6 +3923,28 @@ l2:
 
 	if (need_toast || newtupsize > pagefree)
 	{
+		/*
+		 * To prevent data corruption due to updating old tuple by
+		 * other backends after released buffer, we need to emit that
+		 * xmax of old tuple is set and clear visibility map bits if
+		 * needed before relasing buffer. We can reuse xl_heap_lock
+		 * for this pupose. It should be fine even if we crash midway
+		 * from this section and the actual updating one later, since
+		 * the xmax will appear to come from an aborted xid.
+		 */
+		START_CRIT_SECTION();
+
+		/* Celar PD_ALL_VISIBLE flags */
+		if (PageIsAllVisible(BufferGetPage(buffer)))
+		{
+			all_visible_cleared = true;
+			PageClearAllVisible(BufferGetPage(buffer));
+			visibilitymap_clear(relation, BufferGetBlockNumber(buffer),
+								vmbuffer);
+		}
+
+		MarkBufferDirty(buffer);
+
 		/* Clear obsolete visibility flags ... */
 		oldtup.t_data->t_infomask &= ~(HEAP_XMAX_BITS | HEAP_MOVED);
 		oldtup.t_data->t_infomask2 &= ~HEAP_KEYS_UPDATED;
@@ -3936,6 +3958,26 @@ l2:
 		/* temporarily make it look not-updated */
 		oldtup.t_data->t_ctid = oldtup.t_self;
 		already_marked = true;
+
+		if (RelationNeedsWAL(relation))
+		{
+			xl_heap_lock xlrec;
+			XLogRecPtr recptr;
+
+			XLogBeginInsert();
+			XLogRegisterBuffer(0, buffer, REGBUF_STANDARD);
+
+			xlrec.offnum = ItemPointerGetOffsetNumber(&oldtup.t_self);
+			xlrec.locking_xid = xid;
+			xlrec.infobits_set = compute_infobits(oldtup.t_data->t_infomask,
+												  oldtup.t_data->t_infomask2);
+			XLogRegisterData((char *) &xlrec, SizeOfHeapLock);
+			recptr = XLogInsert(RM_HEAP_ID, XLOG_HEAP_LOCK);
+			PageSetLSN(page, recptr);
+		}
+
+		END_CRIT_SECTION();
+
 		LockBuffer(buffer, BUFFER_LOCK_UNLOCK);
 
 		/*

#200

Amit Kapila

amit.kapila16@gmail.com

over 9 years ago

In reply to: Masahiko Sawada (#199)

Re: Reviewing freeze map code

On Wed, Jun 29, 2016 at 11:14 AM, Masahiko Sawada <sawada.mshk@gmail.com> wrote:

On Fri, Jun 24, 2016 at 11:04 AM, Amit Kapila <amit.kapila16@gmail.com> wrote:

On Fri, Jun 24, 2016 at 4:33 AM, Andres Freund <andres@anarazel.de> wrote:

On 2016-06-23 18:59:57 -0400, Alvaro Herrera wrote:

Andres Freund wrote:

I'm looking into three approaches right now:

3) Use WAL logging for the already_marked = true case.

3) This approach so far seems the best. It's possible to reuse the
xl_heap_lock record (in an afaics backwards compatible manner), and in
most cases the overhead isn't that large. It's of course annoying to
emit more WAL, but it's not that big an overhead compared to extending a
file, or to toasting. It's also by far the simplest fix.

You are right, I think we can try such an optimization in Head and
that too if we see a performance hit with adding this new WAL in
heap_update.

+1 for #3 approach, and attached draft patch for that.
I think attached patch would fix this problem but please let me know
if this patch is not what you're thinking.

Review comments:

+ if (RelationNeedsWAL(relation))
+ {
+ xl_heap_lock xlrec;
+ XLogRecPtr recptr;
+
..
+ xlrec.offnum = ItemPointerGetOffsetNumber(&oldtup.t_self);
+ xlrec.locking_xid = xid;
+ xlrec.infobits_set = compute_infobits(oldtup.t_data->t_infomask,
+   oldtup.t_data->t_infomask2);
+ XLogRegisterData((char *) &xlrec, SizeOfHeapLock);
+ recptr = XLogInsert(RM_HEAP_ID, XLOG_HEAP_LOCK);
+ PageSetLSN(page, recptr);
+ }

There is nothing in this record which recorded the information about
visibility clear flag. How will you ensure to clear the flag after
crash? Have you considered to log cid using log_heap_new_cid() for
logical decoding?

It seems to me that the value of locking_xid should be xmax_old_tuple,
why you have chosen xid?

+ /* Celar PD_ALL_VISIBLE flags */
+ if (PageIsAllVisible(BufferGetPage(buffer)))
+ {
+ all_visible_cleared = true;
+ PageClearAllVisible(BufferGetPage(buffer));
+ visibilitymap_clear(relation, BufferGetBlockNumber(buffer),
+ vmbuffer);
+ }
+
+ MarkBufferDirty(buffer);
+

/* Clear obsolete visibility flags ... */
oldtup.t_data->t_infomask &= ~(HEAP_XMAX_BITS | HEAP_MOVED);

I think it is better to first update tuple related info and then clear
PD_ALL_VISIBLE flags (for order, refer how we have done in heap_update
in the code below where you are trying to add new code).

Couple of typo's -
/relasing/releasing
/Celar/Clear

I think in this approach, it is important to measure the performance
of update, may be you can use simple-update option of pgbench for
various workloads. Try it with different fill factors (-F fillfactor
option in pgbench).

--
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#201

Andres Freund

andres@anarazel.de

over 9 years ago

In reply to: Amit Kapila (#200)

Re: Reviewing freeze map code

On 2016-06-29 19:04:31 +0530, Amit Kapila wrote:

There is nothing in this record which recorded the information about
visibility clear flag.

I think we can actually defer the clearing to the lock release? A tuple
being locked doesn't require the vm being cleared.

I think in this approach, it is important to measure the performance
of update, may be you can use simple-update option of pgbench for
various workloads. Try it with different fill factors (-F fillfactor
option in pgbench).

Probably not sufficient, also needs toast activity, to show the really
bad case of many lock releases.

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#202

Amit Kapila

amit.kapila16@gmail.com

over 9 years ago

In reply to: Andres Freund (#201)

Re: Reviewing freeze map code

On Wed, Jun 29, 2016 at 10:30 PM, Andres Freund <andres@anarazel.de> wrote:

On 2016-06-29 19:04:31 +0530, Amit Kapila wrote:

There is nothing in this record which recorded the information about
visibility clear flag.

I think we can actually defer the clearing to the lock release?

How about the case if after we release the lock on page, the heap page
gets flushed, but not vm and then server crashes? After recovery,
vacuum will never consider such a page for freezing as the vm bit
still says all_frozen. Another possibility could be that WAL for
xl_heap_lock got flushed, but not the heap page before crash, now
after recovery, it will set the tuple with appropriate infomask and
other flags, but the heap page will still be marked as ALL_VISIBLE. I
think that can lead to a situation which Thomas Munro has reported
upthread.

All other cases in heapam.c, after clearing vm and corresponding flag
in heap page, we are recording the same in WAL. Why to make this a
different case and how is it safe to do it here and not at other
places.

A tuple
being locked doesn't require the vm being cleared.

I think in this approach, it is important to measure the performance
of update, may be you can use simple-update option of pgbench for
various workloads. Try it with different fill factors (-F fillfactor
option in pgbench).

Probably not sufficient, also needs toast activity, to show the really
bad case of many lock releases.

Okay, makes sense.

--
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#203

Andres Freund

andres@anarazel.de

over 9 years ago

In reply to: Amit Kapila (#202)

Re: Reviewing freeze map code

On 2016-06-30 08:59:16 +0530, Amit Kapila wrote:

On Wed, Jun 29, 2016 at 10:30 PM, Andres Freund <andres@anarazel.de> wrote:

On 2016-06-29 19:04:31 +0530, Amit Kapila wrote:

There is nothing in this record which recorded the information about
visibility clear flag.

I think we can actually defer the clearing to the lock release?

How about the case if after we release the lock on page, the heap page
gets flushed, but not vm and then server crashes?

In the released branches there's no need to clear all visible at that
point. Note how heap_lock_tuple doesn't clear it at all. So we should be
fine there, and that's the part where reusing an existing record is
important (for compatibility).

But your question made me realize that we despearately *do* need to
clear the frozen bit in heap_lock_tuple in 9.6...

Greetings,

Andres Freund

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#204

Amit Kapila

amit.kapila16@gmail.com

over 9 years ago

In reply to: Andres Freund (#203)

Re: Reviewing freeze map code

On Thu, Jun 30, 2016 at 9:13 AM, Andres Freund <andres@anarazel.de> wrote:

On 2016-06-30 08:59:16 +0530, Amit Kapila wrote:

On Wed, Jun 29, 2016 at 10:30 PM, Andres Freund <andres@anarazel.de> wrote:

On 2016-06-29 19:04:31 +0530, Amit Kapila wrote:

There is nothing in this record which recorded the information about
visibility clear flag.

I think we can actually defer the clearing to the lock release?

How about the case if after we release the lock on page, the heap page
gets flushed, but not vm and then server crashes?

In the released branches there's no need to clear all visible at that
point. Note how heap_lock_tuple doesn't clear it at all. So we should be
fine there, and that's the part where reusing an existing record is
important (for compatibility).

For back branches, I agree that heap_lock_tuple is sufficient, but in
that case we should not clear the vm or page bit at all as done in
proposed patch.

But your question made me realize that we despearately *do* need to
clear the frozen bit in heap_lock_tuple in 9.6...

Right.

--
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#205

Masahiko Sawada

sawada.mshk@gmail.com

over 9 years ago

In reply to: Amit Kapila (#204)

Re: Reviewing freeze map code

On Thu, Jun 30, 2016 at 3:12 PM, Amit Kapila <amit.kapila16@gmail.com> wrote:

On Thu, Jun 30, 2016 at 9:13 AM, Andres Freund <andres@anarazel.de> wrote:

On 2016-06-30 08:59:16 +0530, Amit Kapila wrote:

On Wed, Jun 29, 2016 at 10:30 PM, Andres Freund <andres@anarazel.de> wrote:

On 2016-06-29 19:04:31 +0530, Amit Kapila wrote:

There is nothing in this record which recorded the information about
visibility clear flag.

I think we can actually defer the clearing to the lock release?

How about the case if after we release the lock on page, the heap page
gets flushed, but not vm and then server crashes?

In the released branches there's no need to clear all visible at that
point. Note how heap_lock_tuple doesn't clear it at all. So we should be
fine there, and that's the part where reusing an existing record is
important (for compatibility).

For back branches, I agree that heap_lock_tuple is sufficient,

Even if we use heap_lock_tuple, If server crashed after flushed heap
but not vm, after crash recovery the heap is still marked all-visible
on vm.
This case could be happen even on released branched, and could make
IndexOnlyScan returns wrong result?

Regards,

--
Masahiko Sawada

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#206

Amit Kapila

amit.kapila16@gmail.com

over 9 years ago

In reply to: Masahiko Sawada (#205)

Re: Reviewing freeze map code

On Thu, Jun 30, 2016 at 8:10 PM, Masahiko Sawada <sawada.mshk@gmail.com> wrote:

On Thu, Jun 30, 2016 at 3:12 PM, Amit Kapila <amit.kapila16@gmail.com> wrote:

On Thu, Jun 30, 2016 at 9:13 AM, Andres Freund <andres@anarazel.de> wrote:

On 2016-06-30 08:59:16 +0530, Amit Kapila wrote:

On Wed, Jun 29, 2016 at 10:30 PM, Andres Freund <andres@anarazel.de> wrote:

On 2016-06-29 19:04:31 +0530, Amit Kapila wrote:

There is nothing in this record which recorded the information about
visibility clear flag.

I think we can actually defer the clearing to the lock release?

How about the case if after we release the lock on page, the heap page
gets flushed, but not vm and then server crashes?

In the released branches there's no need to clear all visible at that
point. Note how heap_lock_tuple doesn't clear it at all. So we should be
fine there, and that's the part where reusing an existing record is
important (for compatibility).

For back branches, I agree that heap_lock_tuple is sufficient,

Even if we use heap_lock_tuple, If server crashed after flushed heap
but not vm, after crash recovery the heap is still marked all-visible
on vm.

So, in this case both vm and page will be marked as all_visible.

This case could be happen even on released branched, and could make
IndexOnlyScan returns wrong result?

Why do you think IndexOnlyScan will return wrong result? If the
server crash in the way as you described, the transaction that has
made modifications will anyway be considered aborted, so the result of
IndexOnlyScan should not be wrong.

--
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#207

Masahiko Sawada

sawada.mshk@gmail.com

over 9 years ago

In reply to: Amit Kapila (#206)

1 attachment(s)

Re: Reviewing freeze map code

On Fri, Jul 1, 2016 at 11:12 AM, Amit Kapila <amit.kapila16@gmail.com> wrote:

On Thu, Jun 30, 2016 at 8:10 PM, Masahiko Sawada <sawada.mshk@gmail.com> wrote:

On Thu, Jun 30, 2016 at 3:12 PM, Amit Kapila <amit.kapila16@gmail.com> wrote:

On Thu, Jun 30, 2016 at 9:13 AM, Andres Freund <andres@anarazel.de> wrote:

On 2016-06-30 08:59:16 +0530, Amit Kapila wrote:

On Wed, Jun 29, 2016 at 10:30 PM, Andres Freund <andres@anarazel.de> wrote:

On 2016-06-29 19:04:31 +0530, Amit Kapila wrote:

There is nothing in this record which recorded the information about
visibility clear flag.

I think we can actually defer the clearing to the lock release?

How about the case if after we release the lock on page, the heap page
gets flushed, but not vm and then server crashes?

In the released branches there's no need to clear all visible at that
point. Note how heap_lock_tuple doesn't clear it at all. So we should be
fine there, and that's the part where reusing an existing record is
important (for compatibility).

For back branches, I agree that heap_lock_tuple is sufficient,

Even if we use heap_lock_tuple, If server crashed after flushed heap
but not vm, after crash recovery the heap is still marked all-visible
on vm.

So, in this case both vm and page will be marked as all_visible.

This case could be happen even on released branched, and could make
IndexOnlyScan returns wrong result?

Why do you think IndexOnlyScan will return wrong result? If the
server crash in the way as you described, the transaction that has
made modifications will anyway be considered aborted, so the result of
IndexOnlyScan should not be wrong.

Ah, you're right, I misunderstood.

Attached updated patch incorporating your comments.
I've changed it so that heap_xlog_lock clears vm flags if page is
marked all frozen.

Regards,

--
Masahiko Sawada

Attachments:

emit_wal_already_marked_true_case_v2.patchapplication/octet-stream; name=emit_wal_already_marked_true_case_v2.patchDownload

diff --git a/src/backend/access/heap/heapam.c b/src/backend/access/heap/heapam.c
index 57da57a..562fa24 100644
--- a/src/backend/access/heap/heapam.c
+++ b/src/backend/access/heap/heapam.c
@@ -3923,6 +3923,17 @@ l2:
 
 	if (need_toast || newtupsize > pagefree)
 	{
+		/*
+		 * To prevent data corruption due to updating old tuple by
+		 * other backends after released buffer, we need to emit that
+		 * xmax of old tuple is set and clear visibility map bits if
+		 * needed before releasing buffer. We can reuse xl_heap_lock
+		 * for this pupose. It should be fine even if we crash midway
+		 * from this section and the actual updating one later, since
+		 * the xmax will appear to come from an aborted xid.
+		 */
+		START_CRIT_SECTION();
+
 		/* Clear obsolete visibility flags ... */
 		oldtup.t_data->t_infomask &= ~(HEAP_XMAX_BITS | HEAP_MOVED);
 		oldtup.t_data->t_infomask2 &= ~HEAP_KEYS_UPDATED;
@@ -3936,6 +3947,44 @@ l2:
 		/* temporarily make it look not-updated */
 		oldtup.t_data->t_ctid = oldtup.t_self;
 		already_marked = true;
+
+		/* Clear PD_ALL_VISIBLE flags */
+		if (PageIsAllVisible(BufferGetPage(buffer)))
+		{
+			all_visible_cleared = true;
+			PageClearAllVisible(BufferGetPage(buffer));
+			visibilitymap_clear(relation, BufferGetBlockNumber(buffer),
+								vmbuffer);
+		}
+
+		MarkBufferDirty(buffer);
+
+		if (RelationNeedsWAL(relation))
+		{
+			xl_heap_lock xlrec;
+			XLogRecPtr recptr;
+
+			/*
+			 * For logical decoding we need combocids to properly decode the
+			 * catalog.
+			 */
+			if (RelationIsAccessibleInLogicalDecoding(relation))
+				log_heap_new_cid(relation, &oldtup);
+
+			XLogBeginInsert();
+			XLogRegisterBuffer(0, buffer, REGBUF_STANDARD);
+
+			xlrec.offnum = ItemPointerGetOffsetNumber(&oldtup.t_self);
+			xlrec.locking_xid = xmax_old_tuple;
+			xlrec.infobits_set = compute_infobits(oldtup.t_data->t_infomask,
+												  oldtup.t_data->t_infomask2);
+			XLogRegisterData((char *) &xlrec, SizeOfHeapLock);
+			recptr = XLogInsert(RM_HEAP_ID, XLOG_HEAP_LOCK);
+			PageSetLSN(page, recptr);
+		}
+
+		END_CRIT_SECTION();
+
 		LockBuffer(buffer, BUFFER_LOCK_UNLOCK);
 
 		/*
@@ -4140,7 +4189,8 @@ l2:
 		 */
 		if (RelationIsAccessibleInLogicalDecoding(relation))
 		{
-			log_heap_new_cid(relation, &oldtup);
+			if (!already_marked)
+				log_heap_new_cid(relation, &oldtup);
 			log_heap_new_cid(relation, heaptup);
 		}
 
@@ -8694,6 +8744,23 @@ heap_xlog_lock(XLogReaderState *record)
 		}
 		HeapTupleHeaderSetXmax(htup, xlrec->locking_xid);
 		HeapTupleHeaderSetCmax(htup, FirstCommandId, false);
+
+		/* The visibility map need to be cleared */
+		if (PageIsAllVisible(page))
+		{
+			RelFileNode	rnode;
+			Buffer		vmbuffer = InvalidBuffer;
+			BlockNumber	blkno;
+			Relation	reln;
+
+			XLogRecGetBlockTag(record, 0, &rnode, NULL, &blkno);
+			reln = CreateFakeRelcacheEntry(rnode);
+
+			visibilitymap_pin(reln, blkno, &vmbuffer);
+			visibilitymap_clear(reln, blkno, vmbuffer);
+			PageClearAllVisible(page);
+		}
+
 		PageSetLSN(page, lsn);
 		MarkBufferDirty(buffer);
 	}

#208

Robert Haas

robertmhaas@gmail.com

over 9 years ago

In reply to: Masahiko Sawada (#207)

Re: Reviewing freeze map code

On Fri, Jul 1, 2016 at 10:22 AM, Masahiko Sawada <sawada.mshk@gmail.com> wrote:

Ah, you're right, I misunderstood.

Attached updated patch incorporating your comments.
I've changed it so that heap_xlog_lock clears vm flags if page is
marked all frozen.

I believe that this should be separated into two patches, since there
are two issues here:

1. Locking a tuple doesn't clear the all-frozen bit, but needs to do so.
2. heap_update releases the buffer content lock without logging the
changes it has made.

With respect to #1, there is no need to clear the all-visible bit,
only the all-frozen bit. However, that's a bit tricky given that we
removed PD_ALL_FROZEN. Should we think about putting that back again?
Should we just clear all-visible and call it good enough? The only
cost of that is that vacuum will come along and mark the page
all-visible again instead of skipping it, but that's probably not an
enormous expense in most cases.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#209

Andres Freund

andres@anarazel.de

over 9 years ago

In reply to: Robert Haas (#208)

Re: Reviewing freeze map code

On 2016-07-01 15:18:39 -0400, Robert Haas wrote:

On Fri, Jul 1, 2016 at 10:22 AM, Masahiko Sawada <sawada.mshk@gmail.com> wrote:

Ah, you're right, I misunderstood.

Attached updated patch incorporating your comments.
I've changed it so that heap_xlog_lock clears vm flags if page is
marked all frozen.

I believe that this should be separated into two patches, since there
are two issues here:

1. Locking a tuple doesn't clear the all-frozen bit, but needs to do so.
2. heap_update releases the buffer content lock without logging the
changes it has made.

With respect to #1, there is no need to clear the all-visible bit,
only the all-frozen bit. However, that's a bit tricky given that we
removed PD_ALL_FROZEN. Should we think about putting that back again?

I think it's fine to just do the vm lookup.

Should we just clear all-visible and call it good enough?

Given that we need to do that in heap_lock_tuple, which entirely
preserves all-visible (but shouldn't preserve all-frozen), ISTM we
better find something that doesn't invalidate all-visible.

The only
cost of that is that vacuum will come along and mark the page
all-visible again instead of skipping it, but that's probably not an
enormous expense in most cases.

I think the main cost is not having the page marked as all-visible for
index-only purposes. If it's an insert mostly table, it can be a long
while till vacuum comes around.

Andres

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#210

Jim Nasby

Jim.Nasby@BlueTreble.com

over 9 years ago

In reply to: Andres Freund (#209)

Re: Reviewing freeze map code

On 7/1/16 2:23 PM, Andres Freund wrote:

The only
cost of that is that vacuum will come along and mark the page
all-visible again instead of skipping it, but that's probably not an
enormous expense in most cases.

I think the main cost is not having the page marked as all-visible for
index-only purposes. If it's an insert mostly table, it can be a long
while till vacuum comes around.

ISTM that's something that should be addressed anyway (and separately), no?
--
Jim Nasby, Data Architect, Blue Treble Consulting, Austin TX
Experts in Analytics, Data Architecture and PostgreSQL
Data in Trouble? Get it in Treble! http://BlueTreble.com
855-TREBLE2 (855-873-2532) mobile: 512-569-9461

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#211

Andres Freund

andres@anarazel.de

over 9 years ago

In reply to: Jim Nasby (#210)

Re: Reviewing freeze map code

On 2016-07-01 15:42:22 -0500, Jim Nasby wrote:

On 7/1/16 2:23 PM, Andres Freund wrote:

The only
cost of that is that vacuum will come along and mark the page
all-visible again instead of skipping it, but that's probably not an
enormous expense in most cases.

I think the main cost is not having the page marked as all-visible for
index-only purposes. If it's an insert mostly table, it can be a long
while till vacuum comes around.

ISTM that's something that should be addressed anyway (and separately), no?

Huh? That's the current behaviour in heap_lock_tuple.

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#212

Jim Nasby

Jim.Nasby@BlueTreble.com

over 9 years ago

In reply to: Andres Freund (#211)

Re: Reviewing freeze map code

On 7/1/16 3:43 PM, Andres Freund wrote:

On 2016-07-01 15:42:22 -0500, Jim Nasby wrote:

On 7/1/16 2:23 PM, Andres Freund wrote:

The only
cost of that is that vacuum will come along and mark the page
all-visible again instead of skipping it, but that's probably not an
enormous expense in most cases.

I think the main cost is not having the page marked as all-visible for
index-only purposes. If it's an insert mostly table, it can be a long
while till vacuum comes around.

ISTM that's something that should be addressed anyway (and separately), no?

Huh? That's the current behaviour in heap_lock_tuple.

Oh, I was referring to autovac not being aggressive enough on
insert-mostly tables. Certainly if there's a reasonable way to avoid
invalidating the VM when locking a tuple that'd be good.
--
Jim Nasby, Data Architect, Blue Treble Consulting, Austin TX
Experts in Analytics, Data Architecture and PostgreSQL
Data in Trouble? Get it in Treble! http://BlueTreble.com
855-TREBLE2 (855-873-2532) mobile: 512-569-9461

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#213

Amit Kapila

amit.kapila16@gmail.com

over 9 years ago

In reply to: Andres Freund (#209)

Re: Reviewing freeze map code

On Sat, Jul 2, 2016 at 12:53 AM, Andres Freund <andres@anarazel.de> wrote:

On 2016-07-01 15:18:39 -0400, Robert Haas wrote:

On Fri, Jul 1, 2016 at 10:22 AM, Masahiko Sawada <sawada.mshk@gmail.com> wrote:

Ah, you're right, I misunderstood.

Attached updated patch incorporating your comments.
I've changed it so that heap_xlog_lock clears vm flags if page is
marked all frozen.

I believe that this should be separated into two patches, since there
are two issues here:

1. Locking a tuple doesn't clear the all-frozen bit, but needs to do so.
2. heap_update releases the buffer content lock without logging the
changes it has made.

With respect to #1, there is no need to clear the all-visible bit,
only the all-frozen bit. However, that's a bit tricky given that we
removed PD_ALL_FROZEN. Should we think about putting that back again?

I think it's fine to just do the vm lookup.

Should we just clear all-visible and call it good enough?

Given that we need to do that in heap_lock_tuple, which entirely
preserves all-visible (but shouldn't preserve all-frozen), ISTM we
better find something that doesn't invalidate all-visible.

Sounds logical, considering that we have a way to set all-frozen and
vacuum does that as well. So probably either we need to have a new
API or add a new parameter to visibilitymap_clear() to indicate the
same. If we want to go that route, isn't it better to have
PD_ALL_FROZEN as well?

--
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#214

Amit Kapila

amit.kapila16@gmail.com

over 9 years ago

In reply to: Masahiko Sawada (#207)

Re: Reviewing freeze map code

On Fri, Jul 1, 2016 at 7:52 PM, Masahiko Sawada <sawada.mshk@gmail.com> wrote:

On Fri, Jul 1, 2016 at 11:12 AM, Amit Kapila <amit.kapila16@gmail.com> wrote:

Why do you think IndexOnlyScan will return wrong result? If the
server crash in the way as you described, the transaction that has
made modifications will anyway be considered aborted, so the result of
IndexOnlyScan should not be wrong.

Ah, you're right, I misunderstood.

Attached updated patch incorporating your comments.
I've changed it so that heap_xlog_lock clears vm flags if page is
marked all frozen.

I think we should make a similar change in heap_lock_tuple API as
well. Also, currently by default heap_xlog_lock tuple tries to clear
the visibility flags, isn't it better to handle it as we do at all
other places (ex. see log_heap_update), by logging the information
about same. I think it is always advisable to log every action we
want replay to perform. That way, it is always easy to extend it
based on if some change is required only in certain cases, but not in
others.

Though, it is important to get the patch right, but I feel in the
meantime, it might be better to start benchmarking. AFAIU, even if
change some part of information while WAL logging it, the benchmark
results won't be much different.

--
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#215

Masahiko Sawada

sawada.mshk@gmail.com

over 9 years ago

In reply to: Amit Kapila (#214)

Re: Reviewing freeze map code

On Sat, Jul 2, 2016 at 12:34 PM, Amit Kapila <amit.kapila16@gmail.com> wrote:

On Fri, Jul 1, 2016 at 7:52 PM, Masahiko Sawada <sawada.mshk@gmail.com> wrote:

On Fri, Jul 1, 2016 at 11:12 AM, Amit Kapila <amit.kapila16@gmail.com> wrote:

Why do you think IndexOnlyScan will return wrong result? If the
server crash in the way as you described, the transaction that has
made modifications will anyway be considered aborted, so the result of
IndexOnlyScan should not be wrong.

Ah, you're right, I misunderstood.

Attached updated patch incorporating your comments.
I've changed it so that heap_xlog_lock clears vm flags if page is
marked all frozen.

I think we should make a similar change in heap_lock_tuple API as
well.
Also, currently by default heap_xlog_lock tuple tries to clear
the visibility flags, isn't it better to handle it as we do at all
other places (ex. see log_heap_update), by logging the information
about same.

I will deal with them.

Though, it is important to get the patch right, but I feel in the
meantime, it might be better to start benchmarking. AFAIU, even if
change some part of information while WAL logging it, the benchmark
results won't be much different.

Okay, I will do the benchmark test as well.

Regards,

--
Masahiko Sawada

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#216

Masahiko Sawada

sawada.mshk@gmail.com

over 9 years ago

In reply to: Amit Kapila (#213)

Re: Reviewing freeze map code

On Sat, Jul 2, 2016 at 12:17 PM, Amit Kapila <amit.kapila16@gmail.com> wrote:

On Sat, Jul 2, 2016 at 12:53 AM, Andres Freund <andres@anarazel.de> wrote:

On 2016-07-01 15:18:39 -0400, Robert Haas wrote:

On Fri, Jul 1, 2016 at 10:22 AM, Masahiko Sawada <sawada.mshk@gmail.com> wrote:

Ah, you're right, I misunderstood.

Attached updated patch incorporating your comments.
I've changed it so that heap_xlog_lock clears vm flags if page is
marked all frozen.

I believe that this should be separated into two patches, since there
are two issues here:

1. Locking a tuple doesn't clear the all-frozen bit, but needs to do so.
2. heap_update releases the buffer content lock without logging the
changes it has made.

With respect to #1, there is no need to clear the all-visible bit,
only the all-frozen bit. However, that's a bit tricky given that we
removed PD_ALL_FROZEN. Should we think about putting that back again?

I think it's fine to just do the vm lookup.

Should we just clear all-visible and call it good enough?

Given that we need to do that in heap_lock_tuple, which entirely
preserves all-visible (but shouldn't preserve all-frozen), ISTM we
better find something that doesn't invalidate all-visible.

Sounds logical, considering that we have a way to set all-frozen and
vacuum does that as well. So probably either we need to have a new
API or add a new parameter to visibilitymap_clear() to indicate the
same. If we want to go that route, isn't it better to have
PD_ALL_FROZEN as well?

Cant' we call visibilitymap_set with all-visible but not all-frozen
bits instead of clearing flags?

Regards,

--
Masahiko Sawada

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#217

Amit Kapila

amit.kapila16@gmail.com

over 9 years ago

In reply to: Masahiko Sawada (#216)

Re: Reviewing freeze map code

On Mon, Jul 4, 2016 at 2:31 PM, Masahiko Sawada <sawada.mshk@gmail.com> wrote:

On Sat, Jul 2, 2016 at 12:17 PM, Amit Kapila <amit.kapila16@gmail.com> wrote:

On Sat, Jul 2, 2016 at 12:53 AM, Andres Freund <andres@anarazel.de> wrote:

On 2016-07-01 15:18:39 -0400, Robert Haas wrote:

Should we just clear all-visible and call it good enough?

Given that we need to do that in heap_lock_tuple, which entirely
preserves all-visible (but shouldn't preserve all-frozen), ISTM we
better find something that doesn't invalidate all-visible.

Sounds logical, considering that we have a way to set all-frozen and
vacuum does that as well. So probably either we need to have a new
API or add a new parameter to visibilitymap_clear() to indicate the
same. If we want to go that route, isn't it better to have
PD_ALL_FROZEN as well?

Cant' we call visibilitymap_set with all-visible but not all-frozen
bits instead of clearing flags?

That doesn't sound to be an impressive way to deal. First,
visibilitymap_set logs the action itself which will generate two WAL
records (one for visibility map and another for lock tuple). Second,
it doesn't look consistent to me.

--
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#218

Masahiko Sawada

sawada.mshk@gmail.com

over 9 years ago

In reply to: Masahiko Sawada (#215)

2 attachment(s)

Re: Reviewing freeze map code

On Mon, Jul 4, 2016 at 5:44 PM, Masahiko Sawada <sawada.mshk@gmail.com> wrote:

On Sat, Jul 2, 2016 at 12:34 PM, Amit Kapila <amit.kapila16@gmail.com> wrote:

On Fri, Jul 1, 2016 at 7:52 PM, Masahiko Sawada <sawada.mshk@gmail.com> wrote:

On Fri, Jul 1, 2016 at 11:12 AM, Amit Kapila <amit.kapila16@gmail.com> wrote:

Why do you think IndexOnlyScan will return wrong result? If the
server crash in the way as you described, the transaction that has
made modifications will anyway be considered aborted, so the result of
IndexOnlyScan should not be wrong.

Ah, you're right, I misunderstood.

Attached updated patch incorporating your comments.
I've changed it so that heap_xlog_lock clears vm flags if page is
marked all frozen.

I think we should make a similar change in heap_lock_tuple API as
well.
Also, currently by default heap_xlog_lock tuple tries to clear
the visibility flags, isn't it better to handle it as we do at all
other places (ex. see log_heap_update), by logging the information
about same.

I will deal with them.

Though, it is important to get the patch right, but I feel in the
meantime, it might be better to start benchmarking. AFAIU, even if
change some part of information while WAL logging it, the benchmark
results won't be much different.

Okay, I will do the benchmark test as well.

I measured the thoughput and the output quantity of WAL with HEAD and
HEAD+patch(attached) on my virtual environment.
I used pgbench with attached custom script file which sets 3200 length
string to the filler column in order to make toast data.
The scale factor is 1000 and pgbench options are, -c 4 -T 600 -f toast_test.sql.
After changed filler column to the text data type I ran it.

* Throughput
HEAD : 1833.204172
Patched : 1827.399482

* Output quantity of WAL
HEAD : 7771 MB
Patched : 8082 MB

The throughput is almost same, but the ouput quantity of WAL is
slightly increased. (about 4%)

Regards,

--
Masahiko Sawada

Attachments:

toast_test.sqlapplication/octet-stream; name=toast_test.sqlDownload

emit_wal_already_marked_true_case_v3.patchtext/x-diff; charset=US-ASCII; name=emit_wal_already_marked_true_case_v3.patchDownload

diff --git a/src/backend/access/heap/heapam.c b/src/backend/access/heap/heapam.c
index 57da57a..fd66527 100644
--- a/src/backend/access/heap/heapam.c
+++ b/src/backend/access/heap/heapam.c
@@ -3923,6 +3923,17 @@ l2:
 
 	if (need_toast || newtupsize > pagefree)
 	{
+		/*
+		 * To prevent data corruption due to updating old tuple by
+		 * other backends after released buffer, we need to emit that
+		 * xmax of old tuple is set and clear visibility map bits if
+		 * needed before releasing buffer. We can reuse xl_heap_lock
+		 * for this purpose. It should be fine even if we crash midway
+		 * from this section and the actual updating one later, since
+		 * the xmax will appear to come from an aborted xid.
+		 */
+		START_CRIT_SECTION();
+
 		/* Clear obsolete visibility flags ... */
 		oldtup.t_data->t_infomask &= ~(HEAP_XMAX_BITS | HEAP_MOVED);
 		oldtup.t_data->t_infomask2 &= ~HEAP_KEYS_UPDATED;
@@ -3936,6 +3947,46 @@ l2:
 		/* temporarily make it look not-updated */
 		oldtup.t_data->t_ctid = oldtup.t_self;
 		already_marked = true;
+
+		/* Clear PD_ALL_VISIBLE flags */
+		if (PageIsAllVisible(BufferGetPage(buffer)))
+		{
+			all_visible_cleared = true;
+			PageClearAllVisible(BufferGetPage(buffer));
+			visibilitymap_clear(relation, BufferGetBlockNumber(buffer),
+								vmbuffer);
+		}
+
+		MarkBufferDirty(buffer);
+
+		if (RelationNeedsWAL(relation))
+		{
+			xl_heap_lock xlrec;
+			XLogRecPtr recptr;
+
+			/*
+			 * For logical decoding we need combocids to properly decode the
+			 * catalog.
+			 */
+			if (RelationIsAccessibleInLogicalDecoding(relation))
+				log_heap_new_cid(relation, &oldtup);
+
+			XLogBeginInsert();
+			XLogRegisterBuffer(0, buffer, REGBUF_STANDARD);
+
+			xlrec.offnum = ItemPointerGetOffsetNumber(&oldtup.t_self);
+			xlrec.locking_xid = xmax_old_tuple;
+			xlrec.infobits_set = compute_infobits(oldtup.t_data->t_infomask,
+												  oldtup.t_data->t_infomask2);
+			if (all_visible_cleared)
+				xlrec.infobits_set |= XLHL_ALL_VISIBLE_CLEARED;
+			XLogRegisterData((char *) &xlrec, SizeOfHeapLock);
+			recptr = XLogInsert(RM_HEAP_ID, XLOG_HEAP_LOCK);
+			PageSetLSN(page, recptr);
+		}
+
+		END_CRIT_SECTION();
+
 		LockBuffer(buffer, BUFFER_LOCK_UNLOCK);
 
 		/*
@@ -4140,7 +4191,8 @@ l2:
 		 */
 		if (RelationIsAccessibleInLogicalDecoding(relation))
 		{
-			log_heap_new_cid(relation, &oldtup);
+			if (!already_marked)
+				log_heap_new_cid(relation, &oldtup);
 			log_heap_new_cid(relation, heaptup);
 		}
 
@@ -4513,6 +4565,7 @@ heap_lock_tuple(Relation relation, HeapTuple tuple,
 				new_infomask2;
 	bool		first_time = true;
 	bool		have_tuple_lock = false;
+	bool		all_visible_cleared = false;
 
 	*buffer = ReadBuffer(relation, ItemPointerGetBlockNumber(tid));
 	LockBuffer(*buffer, BUFFER_LOCK_EXCLUSIVE);
@@ -5034,6 +5087,18 @@ failed:
 	if (HEAP_XMAX_IS_LOCKED_ONLY(new_infomask))
 		tuple->t_data->t_ctid = *tid;
 
+	/* Clear PD_ALL_VISIBLE flags */
+	if (PageIsAllVisible(page))
+	{
+		Buffer	vmbuffer = InvalidBuffer;
+		BlockNumber	block = BufferGetBlockNumber(*buffer);
+
+		all_visible_cleared = true;
+		PageClearAllVisible(page);
+		visibilitymap_pin(relation, block, &vmbuffer);
+		visibilitymap_clear(relation, block, vmbuffer);
+	}
+
 	MarkBufferDirty(*buffer);
 
 	/*
@@ -5060,6 +5125,8 @@ failed:
 		xlrec.locking_xid = xid;
 		xlrec.infobits_set = compute_infobits(new_infomask,
 											  tuple->t_data->t_infomask2);
+		if (all_visible_cleared)
+			xlrec.infobits_set |= XLHL_ALL_VISIBLE_CLEARED;
 		XLogRegisterData((char *) &xlrec, SizeOfHeapLock);
 
 		/* we don't decode row locks atm, so no need to log the origin */
@@ -8694,6 +8761,23 @@ heap_xlog_lock(XLogReaderState *record)
 		}
 		HeapTupleHeaderSetXmax(htup, xlrec->locking_xid);
 		HeapTupleHeaderSetCmax(htup, FirstCommandId, false);
+
+		/* The visibility map need to be cleared */
+		if ((xlrec->infobits_set & XLHL_ALL_VISIBLE_CLEARED) != 0)
+		{
+			RelFileNode	rnode;
+			Buffer		vmbuffer = InvalidBuffer;
+			BlockNumber	blkno;
+			Relation	reln;
+
+			XLogRecGetBlockTag(record, 0, &rnode, NULL, &blkno);
+			reln = CreateFakeRelcacheEntry(rnode);
+
+			visibilitymap_pin(reln, blkno, &vmbuffer);
+			visibilitymap_clear(reln, blkno, vmbuffer);
+			PageClearAllVisible(page);
+		}
+
 		PageSetLSN(page, lsn);
 		MarkBufferDirty(buffer);
 	}
diff --git a/src/include/access/heapam_xlog.h b/src/include/access/heapam_xlog.h
index a822d0b..41b3c54 100644
--- a/src/include/access/heapam_xlog.h
+++ b/src/include/access/heapam_xlog.h
@@ -242,6 +242,7 @@ typedef struct xl_heap_cleanup_info
 #define XLHL_XMAX_EXCL_LOCK		0x04
 #define XLHL_XMAX_KEYSHR_LOCK	0x08
 #define XLHL_KEYS_UPDATED		0x10
+#define XLHL_ALL_VISIBLE_CLEARED 0x20
 
 /* This is what we need to know about lock */
 typedef struct xl_heap_lock

#219

Andres Freund

andres@anarazel.de

over 9 years ago

In reply to: Masahiko Sawada (#218)

Re: Reviewing freeze map code

On 2016-07-05 23:37:59 +0900, Masahiko Sawada wrote:

diff --git a/src/backend/access/heap/heapam.c b/src/backend/access/heap/heapam.c
index 57da57a..fd66527 100644
--- a/src/backend/access/heap/heapam.c
+++ b/src/backend/access/heap/heapam.c
@@ -3923,6 +3923,17 @@ l2:

if (need_toast || newtupsize > pagefree)
{
+		/*
+		 * To prevent data corruption due to updating old tuple by
+		 * other backends after released buffer

That's not really the reason, is it? The prime problem is crash safety /
replication. The row-lock we're faking (by setting xmax to our xid),
prevents concurrent updates until this backend died.

, we need to emit that
+		 * xmax of old tuple is set and clear visibility map bits if
+		 * needed before releasing buffer. We can reuse xl_heap_lock
+		 * for this purpose. It should be fine even if we crash midway
+		 * from this section and the actual updating one later, since
+		 * the xmax will appear to come from an aborted xid.
+		 */
+		START_CRIT_SECTION();
+
/* Clear obsolete visibility flags ... */
oldtup.t_data->t_infomask &= ~(HEAP_XMAX_BITS | HEAP_MOVED);
oldtup.t_data->t_infomask2 &= ~HEAP_KEYS_UPDATED;
@@ -3936,6 +3947,46 @@ l2:
/* temporarily make it look not-updated */
oldtup.t_data->t_ctid = oldtup.t_self;
already_marked = true;
+
+		/* Clear PD_ALL_VISIBLE flags */
+		if (PageIsAllVisible(BufferGetPage(buffer)))
+		{
+			all_visible_cleared = true;
+			PageClearAllVisible(BufferGetPage(buffer));
+			visibilitymap_clear(relation, BufferGetBlockNumber(buffer),
+								vmbuffer);
+		}
+
+		MarkBufferDirty(buffer);
+
+		if (RelationNeedsWAL(relation))
+		{
+			xl_heap_lock xlrec;
+			XLogRecPtr recptr;
+
+			/*
+			 * For logical decoding we need combocids to properly decode the
+			 * catalog.
+			 */
+			if (RelationIsAccessibleInLogicalDecoding(relation))
+				log_heap_new_cid(relation, &oldtup);

Hm, I don't see that being necessary here. Row locks aren't logically
decoded, so there's no need to emit this here.

+	/* Clear PD_ALL_VISIBLE flags */
+	if (PageIsAllVisible(page))
+	{
+		Buffer	vmbuffer = InvalidBuffer;
+		BlockNumber	block = BufferGetBlockNumber(*buffer);
+
+		all_visible_cleared = true;
+		PageClearAllVisible(page);
+		visibilitymap_pin(relation, block, &vmbuffer);
+		visibilitymap_clear(relation, block, vmbuffer);
+	}
+

That clears all-visible unnecessarily, we only need to clear all-frozen.

@@ -8694,6 +8761,23 @@ heap_xlog_lock(XLogReaderState *record)
}
HeapTupleHeaderSetXmax(htup, xlrec->locking_xid);
HeapTupleHeaderSetCmax(htup, FirstCommandId, false);
+
+		/* The visibility map need to be cleared */
+		if ((xlrec->infobits_set & XLHL_ALL_VISIBLE_CLEARED) != 0)
+		{
+			RelFileNode	rnode;
+			Buffer		vmbuffer = InvalidBuffer;
+			BlockNumber	blkno;
+			Relation	reln;
+
+			XLogRecGetBlockTag(record, 0, &rnode, NULL, &blkno);
+			reln = CreateFakeRelcacheEntry(rnode);
+
+			visibilitymap_pin(reln, blkno, &vmbuffer);
+			visibilitymap_clear(reln, blkno, vmbuffer);
+			PageClearAllVisible(page);
+		}
+

PageSetLSN(page, lsn);
MarkBufferDirty(buffer);
}
diff --git a/src/include/access/heapam_xlog.h b/src/include/access/heapam_xlog.h
index a822d0b..41b3c54 100644
--- a/src/include/access/heapam_xlog.h
+++ b/src/include/access/heapam_xlog.h
@@ -242,6 +242,7 @@ typedef struct xl_heap_cleanup_info
#define XLHL_XMAX_EXCL_LOCK		0x04
#define XLHL_XMAX_KEYSHR_LOCK	0x08
#define XLHL_KEYS_UPDATED		0x10
+#define XLHL_ALL_VISIBLE_CLEARED 0x20

Hm. We can't easily do that in the back-patched version; because a
standby won't know to check for the flag . That's kinda ok, since we
don't yet need to clear all-visible yet at that point of
heap_update. But that better means we don't do so on the master either.

Greetings,

Andres Freund

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#220

Amit Kapila

amit.kapila16@gmail.com

over 9 years ago

In reply to: Andres Freund (#219)

Re: Reviewing freeze map code

On Thu, Jul 7, 2016 at 3:36 AM, Andres Freund <andres@anarazel.de> wrote:

On 2016-07-05 23:37:59 +0900, Masahiko Sawada wrote:

@@ -8694,6 +8761,23 @@ heap_xlog_lock(XLogReaderState *record)
}
HeapTupleHeaderSetXmax(htup, xlrec->locking_xid);
HeapTupleHeaderSetCmax(htup, FirstCommandId, false);
+
+             /* The visibility map need to be cleared */
+             if ((xlrec->infobits_set & XLHL_ALL_VISIBLE_CLEARED) != 0)
+             {
+                     RelFileNode     rnode;
+                     Buffer          vmbuffer = InvalidBuffer;
+                     BlockNumber     blkno;
+                     Relation        reln;
+
+                     XLogRecGetBlockTag(record, 0, &rnode, NULL, &blkno);
+                     reln = CreateFakeRelcacheEntry(rnode);
+
+                     visibilitymap_pin(reln, blkno, &vmbuffer);
+                     visibilitymap_clear(reln, blkno, vmbuffer);
+                     PageClearAllVisible(page);
+             }
+

PageSetLSN(page, lsn);
MarkBufferDirty(buffer);
}
diff --git a/src/include/access/heapam_xlog.h b/src/include/access/heapam_xlog.h
index a822d0b..41b3c54 100644
--- a/src/include/access/heapam_xlog.h
+++ b/src/include/access/heapam_xlog.h
@@ -242,6 +242,7 @@ typedef struct xl_heap_cleanup_info
#define XLHL_XMAX_EXCL_LOCK          0x04
#define XLHL_XMAX_KEYSHR_LOCK        0x08
#define XLHL_KEYS_UPDATED            0x10
+#define XLHL_ALL_VISIBLE_CLEARED 0x20

To clarify, do you mean to say that lets have XLHL_ALL_FROZEN_CLEARED
and do that just for master. For back-branches no need to clear
visibility any flags?

--
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#221

Masahiko Sawada

sawada.mshk@gmail.com

over 9 years ago

In reply to: Andres Freund (#219)

1 attachment(s)

Re: Reviewing freeze map code

Than you for reviewing!

On Thu, Jul 7, 2016 at 7:06 AM, Andres Freund <andres@anarazel.de> wrote:

On 2016-07-05 23:37:59 +0900, Masahiko Sawada wrote:
diff --git a/src/backend/access/heap/heapam.c b/src/backend/access/heap/heapam.c
index 57da57a..fd66527 100644
--- a/src/backend/access/heap/heapam.c
+++ b/src/backend/access/heap/heapam.c
@@ -3923,6 +3923,17 @@ l2:
if (need_toast || newtupsize > pagefree)
{
+             /*
+              * To prevent data corruption due to updating old tuple by
+              * other backends after released buffer
That's not really the reason, is it? The prime problem is crash safety /
replication. The row-lock we're faking (by setting xmax to our xid),
prevents concurrent updates until this backend died.

Fixed.

, we need to emit that
+              * xmax of old tuple is set and clear visibility map bits if
+              * needed before releasing buffer. We can reuse xl_heap_lock
+              * for this purpose. It should be fine even if we crash midway
+              * from this section and the actual updating one later, since
+              * the xmax will appear to come from an aborted xid.
+              */
+             START_CRIT_SECTION();
+
/* Clear obsolete visibility flags ... */
oldtup.t_data->t_infomask &= ~(HEAP_XMAX_BITS | HEAP_MOVED);
oldtup.t_data->t_infomask2 &= ~HEAP_KEYS_UPDATED;
@@ -3936,6 +3947,46 @@ l2:
/* temporarily make it look not-updated */
oldtup.t_data->t_ctid = oldtup.t_self;
already_marked = true;
+
+             /* Clear PD_ALL_VISIBLE flags */
+             if (PageIsAllVisible(BufferGetPage(buffer)))
+             {
+                     all_visible_cleared = true;
+                     PageClearAllVisible(BufferGetPage(buffer));
+                     visibilitymap_clear(relation, BufferGetBlockNumber(buffer),
+                                                             vmbuffer);
+             }
+
+             MarkBufferDirty(buffer);
+
+             if (RelationNeedsWAL(relation))
+             {
+                     xl_heap_lock xlrec;
+                     XLogRecPtr recptr;
+
+                     /*
+                      * For logical decoding we need combocids to properly decode the
+                      * catalog.
+                      */
+                     if (RelationIsAccessibleInLogicalDecoding(relation))
+                             log_heap_new_cid(relation, &oldtup);

Hm, I don't see that being necessary here. Row locks aren't logically
decoded, so there's no need to emit this here.

Fixed.

+     /* Clear PD_ALL_VISIBLE flags */
+     if (PageIsAllVisible(page))
+     {
+             Buffer  vmbuffer = InvalidBuffer;
+             BlockNumber     block = BufferGetBlockNumber(*buffer);
+
+             all_visible_cleared = true;
+             PageClearAllVisible(page);
+             visibilitymap_pin(relation, block, &vmbuffer);
+             visibilitymap_clear(relation, block, vmbuffer);
+     }
+

That clears all-visible unnecessarily, we only need to clear all-frozen.

Fixed.

@@ -8694,6 +8761,23 @@ heap_xlog_lock(XLogReaderState *record)
}
HeapTupleHeaderSetXmax(htup, xlrec->locking_xid);
HeapTupleHeaderSetCmax(htup, FirstCommandId, false);
+
+             /* The visibility map need to be cleared */
+             if ((xlrec->infobits_set & XLHL_ALL_VISIBLE_CLEARED) != 0)
+             {
+                     RelFileNode     rnode;
+                     Buffer          vmbuffer = InvalidBuffer;
+                     BlockNumber     blkno;
+                     Relation        reln;
+
+                     XLogRecGetBlockTag(record, 0, &rnode, NULL, &blkno);
+                     reln = CreateFakeRelcacheEntry(rnode);
+
+                     visibilitymap_pin(reln, blkno, &vmbuffer);
+                     visibilitymap_clear(reln, blkno, vmbuffer);
+                     PageClearAllVisible(page);
+             }
+

PageSetLSN(page, lsn);
MarkBufferDirty(buffer);
}
diff --git a/src/include/access/heapam_xlog.h b/src/include/access/heapam_xlog.h
index a822d0b..41b3c54 100644
--- a/src/include/access/heapam_xlog.h
+++ b/src/include/access/heapam_xlog.h
@@ -242,6 +242,7 @@ typedef struct xl_heap_cleanup_info
#define XLHL_XMAX_EXCL_LOCK          0x04
#define XLHL_XMAX_KEYSHR_LOCK        0x08
#define XLHL_KEYS_UPDATED            0x10
+#define XLHL_ALL_VISIBLE_CLEARED 0x20

Attached latest version patch.
I changed visibilitymap_clear function so that it allows to specify
bits being cleared.
The function that needs to clear the only all-frozen bit on visibility
map calls visibilitymap_clear_extended function to clear particular
bit.
Other function can call visibilitymap_clear function to clear all bits
for one page.

Instead of adding XLHL_ALL_VISIBLE_CLEARED, we do vm loop up for back branches.
To reduce unnecessary looking up visibility map, I changed it so that
we check the PD_ALL_VISIBLE on heap page, and then look up all-frozen
bit on visibility map if necessary.

Regards,

--
Masahiko Sawada

Attachments:

emit_wal_already_marked_true_case_v4.patchtext/x-diff; charset=US-ASCII; name=emit_wal_already_marked_true_case_v4.patchDownload

diff --git a/src/backend/access/heap/heapam.c b/src/backend/access/heap/heapam.c
index 57da57a..e2efba3 100644
--- a/src/backend/access/heap/heapam.c
+++ b/src/backend/access/heap/heapam.c
@@ -3923,6 +3923,16 @@ l2:
 
 	if (need_toast || newtupsize > pagefree)
 	{
+		/*
+		 * For crash safety, we need to emit that xmax of old tuple is set
+		 * and clear only the all-frozen bit on visibility map if needed
+		 * before releasing the buffer. We can reuse xl_heap_lock for this
+		 * purpose. It should be fine even if we crash midway from this
+		 * section and the actual updating one later, since the xmax will
+		 * appear to come from an aborted xid.
+		 */
+		START_CRIT_SECTION();
+
 		/* Clear obsolete visibility flags ... */
 		oldtup.t_data->t_infomask &= ~(HEAP_XMAX_BITS | HEAP_MOVED);
 		oldtup.t_data->t_infomask2 &= ~HEAP_KEYS_UPDATED;
@@ -3936,6 +3946,36 @@ l2:
 		/* temporarily make it look not-updated */
 		oldtup.t_data->t_ctid = oldtup.t_self;
 		already_marked = true;
+
+		/* Clear only the all-frozen bit on visibility map if needed */
+		if (PageIsAllVisible(BufferGetPage(buffer)) &&
+			VM_ALL_FROZEN(relation, block, &vmbuffer))
+		{
+			visibilitymap_clear_extended(relation, block, vmbuffer,
+										 VISIBILITYMAP_ALL_FROZEN);
+		}
+
+		MarkBufferDirty(buffer);
+
+		if (RelationNeedsWAL(relation))
+		{
+			xl_heap_lock xlrec;
+			XLogRecPtr recptr;
+
+			XLogBeginInsert();
+			XLogRegisterBuffer(0, buffer, REGBUF_STANDARD);
+
+			xlrec.offnum = ItemPointerGetOffsetNumber(&oldtup.t_self);
+			xlrec.locking_xid = xmax_old_tuple;
+			xlrec.infobits_set = compute_infobits(oldtup.t_data->t_infomask,
+												  oldtup.t_data->t_infomask2);
+			XLogRegisterData((char *) &xlrec, SizeOfHeapLock);
+			recptr = XLogInsert(RM_HEAP_ID, XLOG_HEAP_LOCK);
+			PageSetLSN(page, recptr);
+		}
+
+		END_CRIT_SECTION();
+
 		LockBuffer(buffer, BUFFER_LOCK_UNLOCK);
 
 		/*
@@ -4506,6 +4546,7 @@ heap_lock_tuple(Relation relation, HeapTuple tuple,
 	ItemPointer tid = &(tuple->t_self);
 	ItemId		lp;
 	Page		page;
+	Buffer		vmbuffer = InvalidBuffer;
 	TransactionId xid,
 				xmax;
 	uint16		old_infomask,
@@ -5034,6 +5075,17 @@ failed:
 	if (HEAP_XMAX_IS_LOCKED_ONLY(new_infomask))
 		tuple->t_data->t_ctid = *tid;
 
+	/* Clear only the all-frozen bit on visibility map if needed */
+	if (PageIsAllVisible(page) &&
+		VM_ALL_FROZEN(relation, BufferGetBlockNumber(*buffer), &vmbuffer))
+	{
+		BlockNumber	block = BufferGetBlockNumber(*buffer);
+
+		visibilitymap_pin(relation, block, &vmbuffer);
+		visibilitymap_clear_extended(relation, block, vmbuffer,
+									 VISIBILITYMAP_ALL_FROZEN);
+	}
+
 	MarkBufferDirty(*buffer);
 
 	/*
@@ -5072,6 +5124,8 @@ failed:
 	END_CRIT_SECTION();
 
 	LockBuffer(*buffer, BUFFER_LOCK_UNLOCK);
+	if (BufferIsValid(vmbuffer))
+		ReleaseBuffer(vmbuffer);
 
 	/*
 	 * Don't update the visibility map here. Locking a tuple doesn't change
@@ -8694,6 +8748,22 @@ heap_xlog_lock(XLogReaderState *record)
 		}
 		HeapTupleHeaderSetXmax(htup, xlrec->locking_xid);
 		HeapTupleHeaderSetCmax(htup, FirstCommandId, false);
+
+		/* The all-frozen bit on visibility map need to be cleared if needed */
+		if (PageIsAllVisible(BufferGetPage(buffer)))
+		{
+			RelFileNode	rnode;
+			Buffer		vmbuffer = InvalidBuffer;
+			BlockNumber	blkno;
+			Relation	reln;
+
+			XLogRecGetBlockTag(record, 0, &rnode, NULL, &blkno);
+			reln = CreateFakeRelcacheEntry(rnode);
+			visibilitymap_clear_extended(reln, blkno, vmbuffer,
+										 VISIBILITYMAP_ALL_FROZEN);
+			ReleaseBuffer(vmbuffer);
+		}
+
 		PageSetLSN(page, lsn);
 		MarkBufferDirty(buffer);
 	}
diff --git a/src/backend/access/heap/visibilitymap.c b/src/backend/access/heap/visibilitymap.c
index b472d31..c52c0b0 100644
--- a/src/backend/access/heap/visibilitymap.c
+++ b/src/backend/access/heap/visibilitymap.c
@@ -11,7 +11,7 @@
  *	  src/backend/access/heap/visibilitymap.c
  *
  * INTERFACE ROUTINES
- *		visibilitymap_clear  - clear a bit in the visibility map
+ *		visibilitymap_clear  - clear all bits for one page in the visibility map
  *		visibilitymap_pin	 - pin a map page for setting a bit
  *		visibilitymap_pin_ok - check whether correct map page is already pinned
  *		visibilitymap_set	 - set a bit in a previously pinned page
@@ -157,23 +157,34 @@ static const uint8 number_of_ones_for_frozen[256] = {
 static Buffer vm_readbuf(Relation rel, BlockNumber blkno, bool extend);
 static void vm_extend(Relation rel, BlockNumber nvmblocks);
 
+/*
+ * A shorthand for visibilitymap_clear_extended, for clearing all bits for one
+ * page in visibility map with VISIBILITYMAP_VALID_BITS.
+ */
+void
+visibilitymap_clear(Relation rel, BlockNumber heapBlk, Buffer buf)
+{
+	visibilitymap_clear_extended(rel, heapBlk, buf, VISIBILITYMAP_VALID_BITS);
+}
 
 /*
- *	visibilitymap_clear - clear all bits for one page in visibility map
+ *	visibilitymap_clear_extended - clear bit(s) for one page in visibility map
  *
  * You must pass a buffer containing the correct map page to this function.
  * Call visibilitymap_pin first to pin the right one. This function doesn't do
  * any I/O.
  */
 void
-visibilitymap_clear(Relation rel, BlockNumber heapBlk, Buffer buf)
+visibilitymap_clear_extended(Relation rel, BlockNumber heapBlk, Buffer buf, uint8 flags)
 {
 	BlockNumber mapBlock = HEAPBLK_TO_MAPBLOCK(heapBlk);
 	int			mapByte = HEAPBLK_TO_MAPBYTE(heapBlk);
 	int			mapOffset = HEAPBLK_TO_OFFSET(heapBlk);
-	uint8		mask = VISIBILITYMAP_VALID_BITS << mapOffset;
+	uint8		mask = flags << mapOffset;
 	char	   *map;
 
+	Assert(flags & VISIBILITYMAP_VALID_BITS);
+
 #ifdef TRACE_VISIBILITYMAP
 	elog(DEBUG1, "vm_clear %s %d", RelationGetRelationName(rel), heapBlk);
 #endif
diff --git a/src/include/access/visibilitymap.h b/src/include/access/visibilitymap.h
index fca99ca..f305b03 100644
--- a/src/include/access/visibilitymap.h
+++ b/src/include/access/visibilitymap.h
@@ -36,6 +36,8 @@
 
 extern void visibilitymap_clear(Relation rel, BlockNumber heapBlk,
 					Buffer vmbuf);
+extern void visibilitymap_clear_extended(Relation rel, BlockNumber heapBlk,
+					Buffer vmbuf, uint8 flags);
 extern void visibilitymap_pin(Relation rel, BlockNumber heapBlk,
 				  Buffer *vmbuf);
 extern bool visibilitymap_pin_ok(BlockNumber heapBlk, Buffer vmbuf);

#222

Robert Haas

robertmhaas@gmail.com

over 9 years ago

In reply to: Andres Freund (#219)

Re: Reviewing freeze map code

On Wed, Jul 6, 2016 at 6:06 PM, Andres Freund <andres@anarazel.de> wrote:

Hm. We can't easily do that in the back-patched version; because a
standby won't know to check for the flag . That's kinda ok, since we
don't yet need to clear all-visible yet at that point of
heap_update. But that better means we don't do so on the master either.

Is there any reason to back-patch this in the first place?

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#223

Alvaro Herrera

alvherre@2ndquadrant.com

over 9 years ago

In reply to: Robert Haas (#222)

Re: Reviewing freeze map code

Robert Haas wrote:

On Wed, Jul 6, 2016 at 6:06 PM, Andres Freund <andres@anarazel.de> wrote:

Hm. We can't easily do that in the back-patched version; because a
standby won't know to check for the flag . That's kinda ok, since we
don't yet need to clear all-visible yet at that point of
heap_update. But that better means we don't do so on the master either.

Is there any reason to back-patch this in the first place?

Wasn't this determined to be a pre-existing bug? I think the
probability of occurrence has increased, but it's still possible in
earlier releases. I wonder if there are unexplained bugs that can be
traced down to this.

I'm not really following this (sorry about that) but I wonder if (in
back branches) the failure to propagate in case the standby wasn't
updated can cause actual problems. If it does, maybe it'd be a better
idea to have a new WAL record type instead of piggybacking on lock
tuple. Then again, apparently the probability of this bug is low enough
that we shouldn't sweat over it ... Moreso considering Robert's apparent
opinion that perhaps we shouldn't backpatch at all in the first place.

In any case I would like to see much more commentary in the patch next
to the new XLHL flag. It's sufficiently different than the rest than it
deserves so, IMO.

--
ï¿½lvaro Herrera http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#224

Robert Haas

robertmhaas@gmail.com

over 9 years ago

In reply to: Alvaro Herrera (#223)

Re: Reviewing freeze map code

On Thu, Jul 7, 2016 at 10:53 AM, Alvaro Herrera
<alvherre@2ndquadrant.com> wrote:

Robert Haas wrote:

On Wed, Jul 6, 2016 at 6:06 PM, Andres Freund <andres@anarazel.de> wrote:

Hm. We can't easily do that in the back-patched version; because a
standby won't know to check for the flag . That's kinda ok, since we
don't yet need to clear all-visible yet at that point of
heap_update. But that better means we don't do so on the master either.

Is there any reason to back-patch this in the first place?

Wasn't this determined to be a pre-existing bug? I think the
probability of occurrence has increased, but it's still possible in
earlier releases. I wonder if there are unexplained bugs that can be
traced down to this.

I'm not really following this (sorry about that) but I wonder if (in
back branches) the failure to propagate in case the standby wasn't
updated can cause actual problems. If it does, maybe it'd be a better
idea to have a new WAL record type instead of piggybacking on lock
tuple. Then again, apparently the probability of this bug is low enough
that we shouldn't sweat over it ... Moreso considering Robert's apparent
opinion that perhaps we shouldn't backpatch at all in the first place.

In any case I would like to see much more commentary in the patch next
to the new XLHL flag. It's sufficiently different than the rest than it
deserves so, IMO.

There are two issues being discussed on this thread. One of them is a
new issue in 9.6: heap_lock_tuple needs to clear the all-frozen bit in
the freeze map even though it does not clear all-visible. The one
that's actually a preexisting bug is that we can start to update a
tuple without WAL-logging anything and then release the page lock in
order to go perform TOAST insertions. At this point, other backends
(on the master) will see this tuple as in the process of being updated
because xmax has been set and ctid has been made to point back to the
same tuple.

I'm guessing that if the UPDATE goes on to complete, any discrepancy
between the master and the standby is erased by the replay of the WAL
record covering the update itself. I haven't checked that, but it
seems like that WAL record must set both xmax and ctid appropriately
or we'd be in big trouble. The infomask bits are in play too, but
presumably the update's WAL is going to set those correctly also. So
in this case I don't think there's really any issue for the standby.
Or for the master, either: it may technically be true the tuple is not
all-visible any more, but the only backend that could potentially fail
to see it is the one performing the update, and no user code can run
in the middle of toast_insert_or_update, so I think we're OK.

On the other hand, if the UPDATE aborts, there's now a persistent
difference between the master and standby: the infomask, xmax, and
ctid of the tuple may differ. I don't know whether that could cause
any problem. It's probably a very rare case, because there aren't all
that many things that will cause us to abort in the middle of
inserting TOAST tuples. Out of disk space comes to mind, or maybe
some kind of corruption that throws an elog().

As far as back-patching goes, the question is whether it's worth the
risk. Introducing new WAL logging at this point could certainly cause
performance problems if nothing else, never mind the risk of
garden-variety bugs. I'm not sure it's worth taking that risk in
released branches for the sake of a bug which has existed for a decade
without anybody finding it until now. I'm not going to argue strongly
for that position, but I think it's worth thinking about.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#225

Andres Freund

andres@anarazel.de

over 9 years ago

In reply to: Robert Haas (#222)

Re: Reviewing freeze map code

On 2016-07-07 10:37:15 -0400, Robert Haas wrote:

On Wed, Jul 6, 2016 at 6:06 PM, Andres Freund <andres@anarazel.de> wrote:

Hm. We can't easily do that in the back-patched version; because a
standby won't know to check for the flag . That's kinda ok, since we
don't yet need to clear all-visible yet at that point of
heap_update. But that better means we don't do so on the master either.

Is there any reason to back-patch this in the first place?

It seems not unlikely that this has caused corruption in the past; and
that we chalked it up to hardware corruption or something. Both toasting
and file extension frequently take extended amounts of time under load,
the window for crashing in the wrong moment isn't small...

Andres

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#226

Robert Haas

robertmhaas@gmail.com

over 9 years ago

In reply to: Andres Freund (#225)

Re: Reviewing freeze map code

On Thu, Jul 7, 2016 at 1:58 PM, Andres Freund <andres@anarazel.de> wrote:

On 2016-07-07 10:37:15 -0400, Robert Haas wrote:

On Wed, Jul 6, 2016 at 6:06 PM, Andres Freund <andres@anarazel.de> wrote:

Hm. We can't easily do that in the back-patched version; because a
standby won't know to check for the flag . That's kinda ok, since we
don't yet need to clear all-visible yet at that point of
heap_update. But that better means we don't do so on the master either.

Is there any reason to back-patch this in the first place?

It seems not unlikely that this has caused corruption in the past; and
that we chalked it up to hardware corruption or something. Both toasting
and file extension frequently take extended amounts of time under load,
the window for crashing in the wrong moment isn't small...

Yeah, that's true, but I'm having a bit of trouble imagining exactly
we end up with corruption that actually matters. I guess a torn page
could do it.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#227

Andres Freund

andres@anarazel.de

over 9 years ago

In reply to: Robert Haas (#226)

Re: Reviewing freeze map code

On 2016-07-07 14:01:05 -0400, Robert Haas wrote:

On Thu, Jul 7, 2016 at 1:58 PM, Andres Freund <andres@anarazel.de> wrote:

On 2016-07-07 10:37:15 -0400, Robert Haas wrote:

On Wed, Jul 6, 2016 at 6:06 PM, Andres Freund <andres@anarazel.de> wrote:

Hm. We can't easily do that in the back-patched version; because a
standby won't know to check for the flag . That's kinda ok, since we
don't yet need to clear all-visible yet at that point of
heap_update. But that better means we don't do so on the master either.

Is there any reason to back-patch this in the first place?

It seems not unlikely that this has caused corruption in the past; and
that we chalked it up to hardware corruption or something. Both toasting
and file extension frequently take extended amounts of time under load,
the window for crashing in the wrong moment isn't small...

Yeah, that's true, but I'm having a bit of trouble imagining exactly
we end up with corruption that actually matters. I guess a torn page
could do it.

I think Noah pointed out a bad scenario: If we crash after putting the
xid in the page header, but before WAL logging, the xid could get reused
after the crash. By a different transaction. And suddenly the row isn't
visible anymore, after the reused xid commits...

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#228

Robert Haas

robertmhaas@gmail.com

over 9 years ago

In reply to: Andres Freund (#227)

Re: Reviewing freeze map code

On Thu, Jul 7, 2016 at 2:04 PM, Andres Freund <andres@anarazel.de> wrote:

Yeah, that's true, but I'm having a bit of trouble imagining exactly
we end up with corruption that actually matters. I guess a torn page
could do it.

I think Noah pointed out a bad scenario: If we crash after putting the
xid in the page header, but before WAL logging, the xid could get reused
after the crash. By a different transaction. And suddenly the row isn't
visible anymore, after the reused xid commits...

Oh, wow. Yikes. OK, so I guess we should try to back-patch the fix, then.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#229

Amit Kapila

amit.kapila16@gmail.com

over 9 years ago

In reply to: Masahiko Sawada (#221)

Re: Reviewing freeze map code

On Thu, Jul 7, 2016 at 12:34 PM, Masahiko Sawada <sawada.mshk@gmail.com> wrote:

Than you for reviewing!

On Thu, Jul 7, 2016 at 7:06 AM, Andres Freund <andres@anarazel.de> wrote:

On 2016-07-05 23:37:59 +0900, Masahiko Sawada wrote:
diff --git a/src/backend/access/heap/heapam.c b/src/backend/access/heap/heapam.c
index 57da57a..fd66527 100644
--- a/src/backend/access/heap/heapam.c
+++ b/src/backend/access/heap/heapam.c
@@ -3923,6 +3923,17 @@ l2:
if (need_toast || newtupsize > pagefree)
{
+             /*
+              * To prevent data corruption due to updating old tuple by
+              * other backends after released buffer
That's not really the reason, is it? The prime problem is crash safety /
replication. The row-lock we're faking (by setting xmax to our xid),
prevents concurrent updates until this backend died.

Fixed.

, we need to emit that
+              * xmax of old tuple is set and clear visibility map bits if
+              * needed before releasing buffer. We can reuse xl_heap_lock
+              * for this purpose. It should be fine even if we crash midway
+              * from this section and the actual updating one later, since
+              * the xmax will appear to come from an aborted xid.
+              */
+             START_CRIT_SECTION();
+
/* Clear obsolete visibility flags ... */
oldtup.t_data->t_infomask &= ~(HEAP_XMAX_BITS | HEAP_MOVED);
oldtup.t_data->t_infomask2 &= ~HEAP_KEYS_UPDATED;
@@ -3936,6 +3947,46 @@ l2:
/* temporarily make it look not-updated */
oldtup.t_data->t_ctid = oldtup.t_self;
already_marked = true;
+
+             /* Clear PD_ALL_VISIBLE flags */
+             if (PageIsAllVisible(BufferGetPage(buffer)))
+             {
+                     all_visible_cleared = true;
+                     PageClearAllVisible(BufferGetPage(buffer));
+                     visibilitymap_clear(relation, BufferGetBlockNumber(buffer),
+                                                             vmbuffer);
+             }
+
+             MarkBufferDirty(buffer);
+
+             if (RelationNeedsWAL(relation))
+             {
+                     xl_heap_lock xlrec;
+                     XLogRecPtr recptr;
+
+                     /*
+                      * For logical decoding we need combocids to properly decode the
+                      * catalog.
+                      */
+                     if (RelationIsAccessibleInLogicalDecoding(relation))
+                             log_heap_new_cid(relation, &oldtup);

Hm, I don't see that being necessary here. Row locks aren't logically
decoded, so there's no need to emit this here.

Fixed.

+     /* Clear PD_ALL_VISIBLE flags */
+     if (PageIsAllVisible(page))
+     {
+             Buffer  vmbuffer = InvalidBuffer;
+             BlockNumber     block = BufferGetBlockNumber(*buffer);
+
+             all_visible_cleared = true;
+             PageClearAllVisible(page);
+             visibilitymap_pin(relation, block, &vmbuffer);
+             visibilitymap_clear(relation, block, vmbuffer);
+     }
+

That clears all-visible unnecessarily, we only need to clear all-frozen.

Fixed.

@@ -8694,6 +8761,23 @@ heap_xlog_lock(XLogReaderState *record)
}
HeapTupleHeaderSetXmax(htup, xlrec->locking_xid);
HeapTupleHeaderSetCmax(htup, FirstCommandId, false);
+
+             /* The visibility map need to be cleared */
+             if ((xlrec->infobits_set & XLHL_ALL_VISIBLE_CLEARED) != 0)
+             {
+                     RelFileNode     rnode;
+                     Buffer          vmbuffer = InvalidBuffer;
+                     BlockNumber     blkno;
+                     Relation        reln;
+
+                     XLogRecGetBlockTag(record, 0, &rnode, NULL, &blkno);
+                     reln = CreateFakeRelcacheEntry(rnode);
+
+                     visibilitymap_pin(reln, blkno, &vmbuffer);
+                     visibilitymap_clear(reln, blkno, vmbuffer);
+                     PageClearAllVisible(page);
+             }
+

PageSetLSN(page, lsn);
MarkBufferDirty(buffer);
}
diff --git a/src/include/access/heapam_xlog.h b/src/include/access/heapam_xlog.h
index a822d0b..41b3c54 100644
--- a/src/include/access/heapam_xlog.h
+++ b/src/include/access/heapam_xlog.h
@@ -242,6 +242,7 @@ typedef struct xl_heap_cleanup_info
#define XLHL_XMAX_EXCL_LOCK          0x04
#define XLHL_XMAX_KEYSHR_LOCK        0x08
#define XLHL_KEYS_UPDATED            0x10
+#define XLHL_ALL_VISIBLE_CLEARED 0x20

Attached latest version patch.

+ /* Clear only the all-frozen bit on visibility map if needed */

+ if (PageIsAllVisible(BufferGetPage(buffer)) &&

+ VM_ALL_FROZEN(relation, block, &vmbuffer))
+ {
+ visibilitymap_clear_extended(relation, block, vmbuffer,
+ VISIBILITYMAP_ALL_FROZEN);
+ }
+

+ if (RelationNeedsWAL(relation))
+ {
..

+ XLogBeginInsert();
+ XLogRegisterBuffer(0, buffer, REGBUF_STANDARD);
+
+ xlrec.offnum = ItemPointerGetOffsetNumber(&oldtup.t_self);
+ xlrec.locking_xid = xmax_old_tuple;
+ xlrec.infobits_set = compute_infobits(oldtup.t_data->t_infomask,
+   oldtup.t_data->t_infomask2);
+ XLogRegisterData((char *) &xlrec, SizeOfHeapLock);
+ recptr = XLogInsert(RM_HEAP_ID, XLOG_HEAP_LOCK);
..

One thing that looks awkward in this code is that it doesn't record
whether the frozen bit is actually cleared during the actual operation
and then during replay, it always clear the frozen bit irrespective of
whether it has been cleared by the actual operation or not.

+ /* Clear only the all-frozen bit on visibility map if needed */
+ if (PageIsAllVisible(page) &&
+ VM_ALL_FROZEN(relation, BufferGetBlockNumber(*buffer), &vmbuffer))
+ {
+ BlockNumber block = BufferGetBlockNumber(*buffer);
+
+ visibilitymap_pin(relation, block, &vmbuffer);

I think it is not right to call visibilitymap_pin after holding a
buffer lock (visibilitymap_pin can perform I/O). Refer heap_update
for how to pin the visibility map.

--
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#230

Masahiko Sawada

sawada.mshk@gmail.com

over 9 years ago

In reply to: Amit Kapila (#229)

2 attachment(s)

Re: Reviewing freeze map code

On Fri, Jul 8, 2016 at 10:24 PM, Amit Kapila <amit.kapila16@gmail.com> wrote:

On Thu, Jul 7, 2016 at 12:34 PM, Masahiko Sawada <sawada.mshk@gmail.com> wrote:

Than you for reviewing!

On Thu, Jul 7, 2016 at 7:06 AM, Andres Freund <andres@anarazel.de> wrote:

On 2016-07-05 23:37:59 +0900, Masahiko Sawada wrote:
diff --git a/src/backend/access/heap/heapam.c b/src/backend/access/heap/heapam.c
index 57da57a..fd66527 100644
--- a/src/backend/access/heap/heapam.c
+++ b/src/backend/access/heap/heapam.c
@@ -3923,6 +3923,17 @@ l2:
if (need_toast || newtupsize > pagefree)
{
+             /*
+              * To prevent data corruption due to updating old tuple by
+              * other backends after released buffer
That's not really the reason, is it? The prime problem is crash safety /
replication. The row-lock we're faking (by setting xmax to our xid),
prevents concurrent updates until this backend died.

Fixed.

, we need to emit that
+              * xmax of old tuple is set and clear visibility map bits if
+              * needed before releasing buffer. We can reuse xl_heap_lock
+              * for this purpose. It should be fine even if we crash midway
+              * from this section and the actual updating one later, since
+              * the xmax will appear to come from an aborted xid.
+              */
+             START_CRIT_SECTION();
+
/* Clear obsolete visibility flags ... */
oldtup.t_data->t_infomask &= ~(HEAP_XMAX_BITS | HEAP_MOVED);
oldtup.t_data->t_infomask2 &= ~HEAP_KEYS_UPDATED;
@@ -3936,6 +3947,46 @@ l2:
/* temporarily make it look not-updated */
oldtup.t_data->t_ctid = oldtup.t_self;
already_marked = true;
+
+             /* Clear PD_ALL_VISIBLE flags */
+             if (PageIsAllVisible(BufferGetPage(buffer)))
+             {
+                     all_visible_cleared = true;
+                     PageClearAllVisible(BufferGetPage(buffer));
+                     visibilitymap_clear(relation, BufferGetBlockNumber(buffer),
+                                                             vmbuffer);
+             }
+
+             MarkBufferDirty(buffer);
+
+             if (RelationNeedsWAL(relation))
+             {
+                     xl_heap_lock xlrec;
+                     XLogRecPtr recptr;
+
+                     /*
+                      * For logical decoding we need combocids to properly decode the
+                      * catalog.
+                      */
+                     if (RelationIsAccessibleInLogicalDecoding(relation))
+                             log_heap_new_cid(relation, &oldtup);

Hm, I don't see that being necessary here. Row locks aren't logically
decoded, so there's no need to emit this here.

Fixed.

+     /* Clear PD_ALL_VISIBLE flags */
+     if (PageIsAllVisible(page))
+     {
+             Buffer  vmbuffer = InvalidBuffer;
+             BlockNumber     block = BufferGetBlockNumber(*buffer);
+
+             all_visible_cleared = true;
+             PageClearAllVisible(page);
+             visibilitymap_pin(relation, block, &vmbuffer);
+             visibilitymap_clear(relation, block, vmbuffer);
+     }
+

That clears all-visible unnecessarily, we only need to clear all-frozen.

Fixed.

@@ -8694,6 +8761,23 @@ heap_xlog_lock(XLogReaderState *record)
}
HeapTupleHeaderSetXmax(htup, xlrec->locking_xid);
HeapTupleHeaderSetCmax(htup, FirstCommandId, false);
+
+             /* The visibility map need to be cleared */
+             if ((xlrec->infobits_set & XLHL_ALL_VISIBLE_CLEARED) != 0)
+             {
+                     RelFileNode     rnode;
+                     Buffer          vmbuffer = InvalidBuffer;
+                     BlockNumber     blkno;
+                     Relation        reln;
+
+                     XLogRecGetBlockTag(record, 0, &rnode, NULL, &blkno);
+                     reln = CreateFakeRelcacheEntry(rnode);
+
+                     visibilitymap_pin(reln, blkno, &vmbuffer);
+                     visibilitymap_clear(reln, blkno, vmbuffer);
+                     PageClearAllVisible(page);
+             }
+

PageSetLSN(page, lsn);
MarkBufferDirty(buffer);
}
diff --git a/src/include/access/heapam_xlog.h b/src/include/access/heapam_xlog.h
index a822d0b..41b3c54 100644
--- a/src/include/access/heapam_xlog.h
+++ b/src/include/access/heapam_xlog.h
@@ -242,6 +242,7 @@ typedef struct xl_heap_cleanup_info
#define XLHL_XMAX_EXCL_LOCK          0x04
#define XLHL_XMAX_KEYSHR_LOCK        0x08
#define XLHL_KEYS_UPDATED            0x10
+#define XLHL_ALL_VISIBLE_CLEARED 0x20

Attached latest version patch.

+ /* Clear only the all-frozen bit on visibility map if needed */

+ if (PageIsAllVisible(BufferGetPage(buffer)) &&

+ VM_ALL_FROZEN(relation, block, &vmbuffer))
+ {
+ visibilitymap_clear_extended(relation, block, vmbuffer,
+ VISIBILITYMAP_ALL_FROZEN);
+ }
+

+ if (RelationNeedsWAL(relation))
+ {
..

+ XLogBeginInsert();
+ XLogRegisterBuffer(0, buffer, REGBUF_STANDARD);
+
+ xlrec.offnum = ItemPointerGetOffsetNumber(&oldtup.t_self);
+ xlrec.locking_xid = xmax_old_tuple;
+ xlrec.infobits_set = compute_infobits(oldtup.t_data->t_infomask,
+   oldtup.t_data->t_infomask2);
+ XLogRegisterData((char *) &xlrec, SizeOfHeapLock);
+ recptr = XLogInsert(RM_HEAP_ID, XLOG_HEAP_LOCK);
..

I changed it so that we look all-frozen bit up first, and then clear
it if needed.

+ /* Clear only the all-frozen bit on visibility map if needed */
+ if (PageIsAllVisible(page) &&
+ VM_ALL_FROZEN(relation, BufferGetBlockNumber(*buffer), &vmbuffer))
+ {
+ BlockNumber block = BufferGetBlockNumber(*buffer);
+
+ visibilitymap_pin(relation, block, &vmbuffer);
I think it is not right to call visibilitymap_pin after holding a
buffer lock (visibilitymap_pin can perform I/O). Refer heap_update
for how to pin the visibility map.

Thank you for your advice!
Fixed.

Attached separated two patched, please give me feedbacks.

Regards,

--
Masahiko Sawada

Attachments:

0001-Fix-heap_udpate-set-xmax-without-WAL-logging-in-the-.patchtext/plain; charset=US-ASCII; name=0001-Fix-heap_udpate-set-xmax-without-WAL-logging-in-the-.patchDownload

From 2020960bc7688afa8b0526c857a0e0af8b20d370 Mon Sep 17 00:00:00 2001
From: Masahiko Sawada <sawada.mshk@gmail.com>
Date: Mon, 11 Jul 2016 12:39:32 -0700
Subject: [PATCH 1/2] Fix heap_udpate set xmax without WAL logging in the
 already_marked = true case.

---
 src/backend/access/heap/heapam.c | 32 ++++++++++++++++++++++++++++++++
 1 file changed, 32 insertions(+)

diff --git a/src/backend/access/heap/heapam.c b/src/backend/access/heap/heapam.c
index 57da57a..e7cb8ca 100644
--- a/src/backend/access/heap/heapam.c
+++ b/src/backend/access/heap/heapam.c
@@ -3923,6 +3923,16 @@ l2:
 
 	if (need_toast || newtupsize > pagefree)
 	{
+		/*
+		 * For crash safety, we need to emit that xmax of old tuple is set
+		 * and clear only the all-frozen bit on visibility map if needed
+		 * before releasing the buffer. We can reuse xl_heap_lock for this
+		 * purpose. It should be fine even if we crash midway from this
+		 * section and the actual updating one later, since the xmax will
+		 * appear to come from an aborted xid.
+		 */
+		START_CRIT_SECTION();
+
 		/* Clear obsolete visibility flags ... */
 		oldtup.t_data->t_infomask &= ~(HEAP_XMAX_BITS | HEAP_MOVED);
 		oldtup.t_data->t_infomask2 &= ~HEAP_KEYS_UPDATED;
@@ -3936,6 +3946,28 @@ l2:
 		/* temporarily make it look not-updated */
 		oldtup.t_data->t_ctid = oldtup.t_self;
 		already_marked = true;
+
+		MarkBufferDirty(buffer);
+
+		if (RelationNeedsWAL(relation))
+		{
+			xl_heap_lock xlrec;
+			XLogRecPtr recptr;
+
+			XLogBeginInsert();
+			XLogRegisterBuffer(0, buffer, REGBUF_STANDARD);
+
+			xlrec.offnum = ItemPointerGetOffsetNumber(&oldtup.t_self);
+			xlrec.locking_xid = xmax_old_tuple;
+			xlrec.infobits_set = compute_infobits(oldtup.t_data->t_infomask,
+												  oldtup.t_data->t_infomask2);
+			XLogRegisterData((char *) &xlrec, SizeOfHeapLock);
+			recptr = XLogInsert(RM_HEAP_ID, XLOG_HEAP_LOCK);
+			PageSetLSN(page, recptr);
+		}
+
+		END_CRIT_SECTION();
+
 		LockBuffer(buffer, BUFFER_LOCK_UNLOCK);
 
 		/*
-- 
2.8.1

0002-Clear-all-frozen-bit-on-visibility-map-when-xmax-is-.patchtext/plain; charset=US-ASCII; name=0002-Clear-all-frozen-bit-on-visibility-map-when-xmax-is-.patchDownload

From 37e8901df9fe6aa6125059257dc6b8d659bb2929 Mon Sep 17 00:00:00 2001
From: Masahiko Sawada <sawada.mshk@gmail.com>
Date: Mon, 11 Jul 2016 12:44:57 -0700
Subject: [PATCH 2/2] Clear all-frozen bit on visibility map when xmax is set.

---
 src/backend/access/heap/heapam.c        | 49 +++++++++++++++++++++++++++++++++
 src/backend/access/heap/visibilitymap.c | 19 ++++++++++---
 src/include/access/visibilitymap.h      |  2 ++
 3 files changed, 66 insertions(+), 4 deletions(-)

diff --git a/src/backend/access/heap/heapam.c b/src/backend/access/heap/heapam.c
index e7cb8ca..23e9b75 100644
--- a/src/backend/access/heap/heapam.c
+++ b/src/backend/access/heap/heapam.c
@@ -3947,6 +3947,14 @@ l2:
 		oldtup.t_data->t_ctid = oldtup.t_self;
 		already_marked = true;
 
+		/* Clear only the all-frozen bit on visibility map if needed */
+		if (PageIsAllVisible(BufferGetPage(buffer)) &&
+			VM_ALL_FROZEN(relation, block, &vmbuffer))
+		{
+			visibilitymap_clear_extended(relation, block, vmbuffer,
+										 VISIBILITYMAP_ALL_FROZEN);
+		}
+
 		MarkBufferDirty(buffer);
 
 		if (RelationNeedsWAL(relation))
@@ -4538,6 +4546,8 @@ heap_lock_tuple(Relation relation, HeapTuple tuple,
 	ItemPointer tid = &(tuple->t_self);
 	ItemId		lp;
 	Page		page;
+	Buffer		vmbuffer = InvalidBuffer;
+	BlockNumber	block;
 	TransactionId xid,
 				xmax;
 	uint16		old_infomask,
@@ -4547,6 +4557,15 @@ heap_lock_tuple(Relation relation, HeapTuple tuple,
 	bool		have_tuple_lock = false;
 
 	*buffer = ReadBuffer(relation, ItemPointerGetBlockNumber(tid));
+	block = ItemPointerGetBlockNumber(tid);
+
+	/*
+	 * Before locking the buffer, pin the visibility map page if it appears to
+	 * to be necessary
+	 */
+	if (PageIsAllVisible(BufferGetPage(*buffer)))
+		visibilitymap_pin(relation, block, &vmbuffer);
+
 	LockBuffer(*buffer, BUFFER_LOCK_EXCLUSIVE);
 
 	page = BufferGetPage(*buffer);
@@ -5066,6 +5085,15 @@ failed:
 	if (HEAP_XMAX_IS_LOCKED_ONLY(new_infomask))
 		tuple->t_data->t_ctid = *tid;
 
+	/* Clear only the all-frozen bit on visibility map if needed */
+	if (PageIsAllVisible(page) &&
+		VM_ALL_FROZEN(relation, block, &vmbuffer))
+	{
+		visibilitymap_clear_extended(relation, block, vmbuffer,
+									 VISIBILITYMAP_ALL_FROZEN);
+	}
+
+
 	MarkBufferDirty(*buffer);
 
 	/*
@@ -5104,6 +5132,8 @@ failed:
 	END_CRIT_SECTION();
 
 	LockBuffer(*buffer, BUFFER_LOCK_UNLOCK);
+	if (BufferIsValid(vmbuffer))
+		ReleaseBuffer(vmbuffer);
 
 	/*
 	 * Don't update the visibility map here. Locking a tuple doesn't change
@@ -8726,6 +8756,25 @@ heap_xlog_lock(XLogReaderState *record)
 		}
 		HeapTupleHeaderSetXmax(htup, xlrec->locking_xid);
 		HeapTupleHeaderSetCmax(htup, FirstCommandId, false);
+
+		/* The all-frozen bit on visibility map need to be cleared if needed */
+		if (PageIsAllVisible(BufferGetPage(buffer)))
+		{
+			RelFileNode	rnode;
+			Buffer		vmbuffer = InvalidBuffer;
+			BlockNumber	block;
+			Relation	reln;
+
+			XLogRecGetBlockTag(record, 0, &rnode, NULL, &block);
+			reln = CreateFakeRelcacheEntry(rnode);
+			visibilitymap_pin(reln, block, &vmbuffer);
+
+			if (VM_ALL_FROZEN(reln, block, &vmbuffer))
+				visibilitymap_clear_extended(reln, block, vmbuffer,
+											 VISIBILITYMAP_ALL_FROZEN);
+			ReleaseBuffer(vmbuffer);
+		}
+
 		PageSetLSN(page, lsn);
 		MarkBufferDirty(buffer);
 	}
diff --git a/src/backend/access/heap/visibilitymap.c b/src/backend/access/heap/visibilitymap.c
index b472d31..c52c0b0 100644
--- a/src/backend/access/heap/visibilitymap.c
+++ b/src/backend/access/heap/visibilitymap.c
@@ -11,7 +11,7 @@
  *	  src/backend/access/heap/visibilitymap.c
  *
  * INTERFACE ROUTINES
- *		visibilitymap_clear  - clear a bit in the visibility map
+ *		visibilitymap_clear  - clear all bits for one page in the visibility map
  *		visibilitymap_pin	 - pin a map page for setting a bit
  *		visibilitymap_pin_ok - check whether correct map page is already pinned
  *		visibilitymap_set	 - set a bit in a previously pinned page
@@ -157,23 +157,34 @@ static const uint8 number_of_ones_for_frozen[256] = {
 static Buffer vm_readbuf(Relation rel, BlockNumber blkno, bool extend);
 static void vm_extend(Relation rel, BlockNumber nvmblocks);
 
+/*
+ * A shorthand for visibilitymap_clear_extended, for clearing all bits for one
+ * page in visibility map with VISIBILITYMAP_VALID_BITS.
+ */
+void
+visibilitymap_clear(Relation rel, BlockNumber heapBlk, Buffer buf)
+{
+	visibilitymap_clear_extended(rel, heapBlk, buf, VISIBILITYMAP_VALID_BITS);
+}
 
 /*
- *	visibilitymap_clear - clear all bits for one page in visibility map
+ *	visibilitymap_clear_extended - clear bit(s) for one page in visibility map
  *
  * You must pass a buffer containing the correct map page to this function.
  * Call visibilitymap_pin first to pin the right one. This function doesn't do
  * any I/O.
  */
 void
-visibilitymap_clear(Relation rel, BlockNumber heapBlk, Buffer buf)
+visibilitymap_clear_extended(Relation rel, BlockNumber heapBlk, Buffer buf, uint8 flags)
 {
 	BlockNumber mapBlock = HEAPBLK_TO_MAPBLOCK(heapBlk);
 	int			mapByte = HEAPBLK_TO_MAPBYTE(heapBlk);
 	int			mapOffset = HEAPBLK_TO_OFFSET(heapBlk);
-	uint8		mask = VISIBILITYMAP_VALID_BITS << mapOffset;
+	uint8		mask = flags << mapOffset;
 	char	   *map;
 
+	Assert(flags & VISIBILITYMAP_VALID_BITS);
+
 #ifdef TRACE_VISIBILITYMAP
 	elog(DEBUG1, "vm_clear %s %d", RelationGetRelationName(rel), heapBlk);
 #endif
diff --git a/src/include/access/visibilitymap.h b/src/include/access/visibilitymap.h
index fca99ca..f305b03 100644
--- a/src/include/access/visibilitymap.h
+++ b/src/include/access/visibilitymap.h
@@ -36,6 +36,8 @@
 
 extern void visibilitymap_clear(Relation rel, BlockNumber heapBlk,
 					Buffer vmbuf);
+extern void visibilitymap_clear_extended(Relation rel, BlockNumber heapBlk,
+					Buffer vmbuf, uint8 flags);
 extern void visibilitymap_pin(Relation rel, BlockNumber heapBlk,
 				  Buffer *vmbuf);
 extern bool visibilitymap_pin_ok(BlockNumber heapBlk, Buffer vmbuf);
-- 
2.8.1

#231

Andres Freund

andres@anarazel.de

over 9 years ago

In reply to: Masahiko Sawada (#230)

Re: Reviewing freeze map code

Hi,

So I'm generally happy with 0001, baring some relatively minor
adjustments. I am however wondering about one thing:

On 2016-07-11 23:51:05 +0900, Masahiko Sawada wrote:

diff --git a/src/backend/access/heap/heapam.c b/src/backend/access/heap/heapam.c
index 57da57a..e7cb8ca 100644
--- a/src/backend/access/heap/heapam.c
+++ b/src/backend/access/heap/heapam.c
@@ -3923,6 +3923,16 @@ l2:

if (need_toast || newtupsize > pagefree)
{
+		/*
+		 * For crash safety, we need to emit that xmax of old tuple is set
+		 * and clear only the all-frozen bit on visibility map if needed
+		 * before releasing the buffer. We can reuse xl_heap_lock for this
+		 * purpose. It should be fine even if we crash midway from this
+		 * section and the actual updating one later, since the xmax will
+		 * appear to come from an aborted xid.
+		 */
+		START_CRIT_SECTION();
+
/* Clear obsolete visibility flags ... */
oldtup.t_data->t_infomask &= ~(HEAP_XMAX_BITS | HEAP_MOVED);
oldtup.t_data->t_infomask2 &= ~HEAP_KEYS_UPDATED;
@@ -3936,6 +3946,28 @@ l2:
/* temporarily make it look not-updated */
oldtup.t_data->t_ctid = oldtup.t_self;
already_marked = true;
+
+		MarkBufferDirty(buffer);
+
+		if (RelationNeedsWAL(relation))
+		{
+			xl_heap_lock xlrec;
+			XLogRecPtr recptr;
+
+			XLogBeginInsert();
+			XLogRegisterBuffer(0, buffer, REGBUF_STANDARD);
+
+			xlrec.offnum = ItemPointerGetOffsetNumber(&oldtup.t_self);
+			xlrec.locking_xid = xmax_old_tuple;
+			xlrec.infobits_set = compute_infobits(oldtup.t_data->t_infomask,
+												  oldtup.t_data->t_infomask2);
+			XLogRegisterData((char *) &xlrec, SizeOfHeapLock);
+			recptr = XLogInsert(RM_HEAP_ID, XLOG_HEAP_LOCK);
+			PageSetLSN(page, recptr);
+		}

Master does
/* temporarily make it look not-updated */
oldtup.t_data->t_ctid = oldtup.t_self;
here, and as is the wal record won't reflect that, because:
static void
heap_xlog_lock(XLogReaderState *record)
{
...
/*
* Clear relevant update flags, but only if the modified infomask says
* there's no update.
*/
if (HEAP_XMAX_IS_LOCKED_ONLY(htup->t_infomask))
{
HeapTupleHeaderClearHotUpdated(htup);
/* Make sure there is no forward chain link in t_ctid */
ItemPointerSet(&htup->t_ctid,
BufferGetBlockNumber(buffer),
offnum);
}
won't enter the branch, because HEAP_XMAX_LOCK_ONLY won't be set. Which
will leave t_ctid and HEAP_HOT_UPDATED set differently on the master and
standby / after crash recovery. I'm failing to see any harmful
consequences right now, but differences between master and standby are a bad
thing. Pre 9.3 that's not a problem, we reset ctid and HOT_UPDATED
unconditionally there. I think I'm more comfortable with setting
HEAP_XMAX_LOCK_ONLY until the tuple is finally updated - that also
coincides more closely with the actual meaning.

Any arguments against?

+		/* Clear only the all-frozen bit on visibility map if needed */
+		if (PageIsAllVisible(BufferGetPage(buffer)) &&
+			VM_ALL_FROZEN(relation, block, &vmbuffer))
+		{
+			visibilitymap_clear_extended(relation, block, vmbuffer,
+										 VISIBILITYMAP_ALL_FROZEN);
+		}
+

FWIW, I don't think it's worth introducing visibilitymap_clear_extended.
As this is a 9.6 only patch, i think it's better to change
visibilitymap_clear's API.

Unless somebody protests I'm planning to commit with those adjustments
tomorrow.

Greetings,

Andres Freund

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#232

Amit Kapila

amit.kapila16@gmail.com

over 9 years ago

In reply to: Andres Freund (#231)

Re: Reviewing freeze map code

On Thu, Jul 14, 2016 at 11:36 AM, Andres Freund <andres@anarazel.de> wrote:

Hi,

Master does
/* temporarily make it look not-updated */
oldtup.t_data->t_ctid = oldtup.t_self;
here, and as is the wal record won't reflect that, because:
static void
heap_xlog_lock(XLogReaderState *record)
{
...
/*
* Clear relevant update flags, but only if the modified infomask says
* there's no update.
*/
if (HEAP_XMAX_IS_LOCKED_ONLY(htup->t_infomask))
{
HeapTupleHeaderClearHotUpdated(htup);
/* Make sure there is no forward chain link in t_ctid */
ItemPointerSet(&htup->t_ctid,
BufferGetBlockNumber(buffer),
offnum);
}
won't enter the branch, because HEAP_XMAX_LOCK_ONLY won't be set. Which
will leave t_ctid and HEAP_HOT_UPDATED set differently on the master and
standby / after crash recovery. I'm failing to see any harmful
consequences right now, but differences between master and standby are a bad
thing. Pre 9.3 that's not a problem, we reset ctid and HOT_UPDATED
unconditionally there. I think I'm more comfortable with setting
HEAP_XMAX_LOCK_ONLY until the tuple is finally updated - that also
coincides more closely with the actual meaning.

Just thinking out loud. If we set HEAP_XMAX_LOCK_ONLY during update,
then won't it impact the return value of
HeapTupleHeaderIsOnlyLocked(). It will start returning true whereas
otherwise I think it would have returned false due to in_progress
transaction. As HeapTupleHeaderIsOnlyLocked() is being used at many
places, it might impact those cases, I have not checked in deep
whether such an impact would cause any real issue, but it seems to me
that some analysis is needed there unless you think we are safe with
respect to that.

Any arguments against?
+             /* Clear only the all-frozen bit on visibility map if needed */
+             if (PageIsAllVisible(BufferGetPage(buffer)) &&
+                     VM_ALL_FROZEN(relation, block, &vmbuffer))
+             {
+                     visibilitymap_clear_extended(relation, block, vmbuffer,
+                                                                              VISIBILITYMAP_ALL_FROZEN);
+             }
+
FWIW, I don't think it's worth introducing visibilitymap_clear_extended.
As this is a 9.6 only patch, i think it's better to change
visibilitymap_clear's API.

Unless somebody protests I'm planning to commit with those adjustments
tomorrow.

Do you think performance tests done by Sawada-san are sufficient to
proceed here?

--
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#233

Andres Freund

andres@anarazel.de

over 9 years ago

In reply to: Amit Kapila (#232)

Re: Reviewing freeze map code

On 2016-07-14 18:12:42 +0530, Amit Kapila wrote:

Just thinking out loud. If we set HEAP_XMAX_LOCK_ONLY during update,
then won't it impact the return value of
HeapTupleHeaderIsOnlyLocked(). It will start returning true whereas
otherwise I think it would have returned false due to in_progress
transaction. As HeapTupleHeaderIsOnlyLocked() is being used at many
places, it might impact those cases, I have not checked in deep
whether such an impact would cause any real issue, but it seems to me
that some analysis is needed there unless you think we are safe with
respect to that.

I don't think that's an issue: Right now the row will be considered
deleted in that moment, with the change it's considered locked. the
latter is surely more appropriate.

Any arguments against?
+             /* Clear only the all-frozen bit on visibility map if needed */
+             if (PageIsAllVisible(BufferGetPage(buffer)) &&
+                     VM_ALL_FROZEN(relation, block, &vmbuffer))
+             {
+                     visibilitymap_clear_extended(relation, block, vmbuffer,
+                                                                              VISIBILITYMAP_ALL_FROZEN);
+             }
+
FWIW, I don't think it's worth introducing visibilitymap_clear_extended.
As this is a 9.6 only patch, i think it's better to change
visibilitymap_clear's API.

Unless somebody protests I'm planning to commit with those adjustments
tomorrow.
Do you think performance tests done by Sawada-san are sufficient to
proceed here?

I'm doing some more, but generally yes. I also don't think we have much
of a choice anyway.

Greetings,

Andres Freund

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#234

Andres Freund

andres@anarazel.de

over 9 years ago

In reply to: Andres Freund (#231)

Re: Reviewing freeze map code

On 2016-07-13 23:06:07 -0700, Andres Freund wrote:

won't enter the branch, because HEAP_XMAX_LOCK_ONLY won't be set. Which
will leave t_ctid and HEAP_HOT_UPDATED set differently on the master and
standby / after crash recovery. I'm failing to see any harmful
consequences right now, but differences between master and standby are a bad
thing.

I think it's actually critical, because HEAP_HOT_UPDATED /
HEAP_XMAX_LOCK_ONLY are used to terminate ctid chasing loops (like
heap_hot_search_buffer()).

Andres

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#235

Andres Freund

andres@anarazel.de

over 9 years ago

In reply to: Andres Freund (#234)

Re: Reviewing freeze map code

On 2016-07-14 20:53:07 -0700, Andres Freund wrote:

On 2016-07-13 23:06:07 -0700, Andres Freund wrote:

won't enter the branch, because HEAP_XMAX_LOCK_ONLY won't be set. Which
will leave t_ctid and HEAP_HOT_UPDATED set differently on the master and
standby / after crash recovery. I'm failing to see any harmful
consequences right now, but differences between master and standby are a bad
thing.

I think it's actually critical, because HEAP_HOT_UPDATED /
HEAP_XMAX_LOCK_ONLY are used to terminate ctid chasing loops (like
heap_hot_search_buffer()).

I've pushed a quite heavily revised version of the first patch to
9.1-master. I manually verified using pageinspect, gdb breakpoints and a
standby that xmax, infomask etc are set correctly (leading to finding
a4d357bf). As there's noticeable differences, especially 9.2->9.3,
between versions, I'd welcome somebody having a look at the commits.

Regards,

Andres

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#236

Andres Freund

andres@anarazel.de

over 9 years ago

In reply to: Andres Freund (#231)

Re: Reviewing freeze map code

On 2016-07-13 23:06:07 -0700, Andres Freund wrote:

+		/* Clear only the all-frozen bit on visibility map if needed */
+		if (PageIsAllVisible(BufferGetPage(buffer)) &&
+			VM_ALL_FROZEN(relation, block, &vmbuffer))
+		{
+			visibilitymap_clear_extended(relation, block, vmbuffer,
+										 VISIBILITYMAP_ALL_FROZEN);
+		}
+
FWIW, I don't think it's worth introducing visibilitymap_clear_extended.
As this is a 9.6 only patch, i think it's better to change
visibilitymap_clear's API.

Besides that easily fixed issue, the code also has the significant issue
that it's only performing the the visibilitymap processing in the
BLK_NEEDS_REDO case. But that's not ok, because both in the BLK_RESTORED
and the BLK_DONE cases the visibilitymap isn't guaranteed (or even
likely in the former case) to have been updated.

I think we have two choices how to deal with that: First, we can add a
new flags variable to xl_heap_lock similar to
xl_heap_insert/update/... and bump page magic, or we can squeeze the
information into infobits_set. The latter seems fairly ugly, and
fragile to me; so unless somebody protests I'm going with the former. I
think due to padding the additional byte doesn't make any size
difference anyway.

Regards,

Andres

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#237

Amit Kapila

amit.kapila16@gmail.com

over 9 years ago

In reply to: Andres Freund (#236)

Re: Reviewing freeze map code

On Sat, Jul 16, 2016 at 7:02 AM, Andres Freund <andres@anarazel.de> wrote:

On 2016-07-13 23:06:07 -0700, Andres Freund wrote:
+           /* Clear only the all-frozen bit on visibility map if needed */
+           if (PageIsAllVisible(BufferGetPage(buffer)) &&
+                   VM_ALL_FROZEN(relation, block, &vmbuffer))
+           {
+                   visibilitymap_clear_extended(relation, block, vmbuffer,
+                                                                            VISIBILITYMAP_ALL_FROZEN);
+           }
+
FWIW, I don't think it's worth introducing visibilitymap_clear_extended.
As this is a 9.6 only patch, i think it's better to change
visibilitymap_clear's API.
Besides that easily fixed issue, the code also has the significant issue
that it's only performing the the visibilitymap processing in the
BLK_NEEDS_REDO case. But that's not ok, because both in the BLK_RESTORED
and the BLK_DONE cases the visibilitymap isn't guaranteed (or even
likely in the former case) to have been updated.

I think we have two choices how to deal with that: First, we can add a
new flags variable to xl_heap_lock similar to
xl_heap_insert/update/... and bump page magic,

+1 for going in this way. This will keep us consistent with how clear
the visibility info in other places like heap_xlog_update().

--
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#238

Tom Lane

tgl@sss.pgh.pa.us

over 9 years ago

In reply to: Amit Kapila (#237)

Re: Reviewing freeze map code

Amit Kapila <amit.kapila16@gmail.com> writes:

On Sat, Jul 16, 2016 at 7:02 AM, Andres Freund <andres@anarazel.de> wrote:

I think we have two choices how to deal with that: First, we can add a
new flags variable to xl_heap_lock similar to
xl_heap_insert/update/... and bump page magic,

+1 for going in this way. This will keep us consistent with how clear
the visibility info in other places like heap_xlog_update().

Yeah. We've already forced a catversion bump for beta3, and I'm about
to go fix PG_CONTROL_VERSION as well, so there's basically no downside
to doing an xlog version bump as well. At least, not if you can get it
in before Monday.

regards, tom lane

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#239

Andres Freund

andres@anarazel.de

over 9 years ago

In reply to: Tom Lane (#238)

Re: Reviewing freeze map code

On July 16, 2016 8:49:06 AM PDT, Tom Lane <tgl@sss.pgh.pa.us> wrote:

Amit Kapila <amit.kapila16@gmail.com> writes:

On Sat, Jul 16, 2016 at 7:02 AM, Andres Freund <andres@anarazel.de>

wrote:

I think we have two choices how to deal with that: First, we can add

a

new flags variable to xl_heap_lock similar to
xl_heap_insert/update/... and bump page magic,

+1 for going in this way. This will keep us consistent with how

clear

the visibility info in other places like heap_xlog_update().

Yeah. We've already forced a catversion bump for beta3, and I'm about
to go fix PG_CONTROL_VERSION as well, so there's basically no downside
to doing an xlog version bump as well. At least, not if you can get it
in before Monday.

OK, Cool. Will do it later today.

Andres
--
Sent from my Android device with K-9 Mail. Please excuse my brevity.

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#240

Andres Freund

andres@anarazel.de

over 9 years ago

In reply to: Andres Freund (#239)

1 attachment(s)

Re: Reviewing freeze map code

On 2016-07-16 10:45:26 -0700, Andres Freund wrote:

On July 16, 2016 8:49:06 AM PDT, Tom Lane <tgl@sss.pgh.pa.us> wrote:

Amit Kapila <amit.kapila16@gmail.com> writes:

On Sat, Jul 16, 2016 at 7:02 AM, Andres Freund <andres@anarazel.de>

wrote:

I think we have two choices how to deal with that: First, we can add

a

new flags variable to xl_heap_lock similar to
xl_heap_insert/update/... and bump page magic,

+1 for going in this way. This will keep us consistent with how

clear

the visibility info in other places like heap_xlog_update().

Yeah. We've already forced a catversion bump for beta3, and I'm about
to go fix PG_CONTROL_VERSION as well, so there's basically no downside
to doing an xlog version bump as well. At least, not if you can get it
in before Monday.

OK, Cool. Will do it later today.

Took till today. Attached is a rather heavily revised version of
Sawada-san's patch. Most notably the recovery routines take care to
reset the vm in all cases, we don't perform visibilitymap_get_status
from inside a critical section anymore, and
heap_lock_updated_tuple_rec() also resets the vm (although I'm not
entirely sure that can practically be hit).

I'm doing some more testing, and Robert said he could take a quick look
at the patch. If somebody else... Will push sometime after dinner.

Regards,

Andres

Attachments:

0001-Clear-all-frozen-visibilitymap-status-when-locking-t.patchtext/x-patch; charset=us-asciiDownload

From 26f6eff8cef9b436e328a7364d6e4954b702208b Mon Sep 17 00:00:00 2001
From: Andres Freund <andres@anarazel.de>
Date: Sun, 17 Jul 2016 19:30:38 -0700
Subject: [PATCH] Clear all-frozen visibilitymap status when locking tuples.

Since a892234 & fd31cd265 the visibilitymap's freeze bit is used to
avoid vacuuming the whole relation in anti-wraparound vacuums. Doing so
correctly relies on not adding xids to the heap without also unsetting
the visibilitymap flag.  Tuple locking related code has not done so.

To allow selectively resetting all-frozen - to avoid pessimizing
heap_lock_tuple - allow to selectively reset the all-frozen with
visibilitymap_clear(). To avoid having to use
visibilitymap_get_status (e.g. via VM_ALL_FROZEN) inside a critical
section, have visibilitymap_clear() return whether any bits have been
reset.

The added flags field fields to xl_heap_lock and xl_heap_lock_updated
require bumping the WAL magic. Since there's already been a catversion
bump since the last beta, that's not an issue.

Author: Masahiko Sawada, heavily revised by Andres Freund
Discussion: CAEepm=3fWAbWryVW9swHyLTY4sXVf0xbLvXqOwUoDiNCx9mBjQ@mail.gmail.com
Backpatch: -
---
 src/backend/access/heap/heapam.c        | 126 +++++++++++++++++++++++++++++---
 src/backend/access/heap/visibilitymap.c |  18 +++--
 src/backend/access/rmgrdesc/heapdesc.c  |   6 +-
 src/backend/commands/vacuumlazy.c       |   6 +-
 src/include/access/heapam_xlog.h        |   9 ++-
 src/include/access/visibilitymap.h      |   4 +-
 src/include/access/xlog_internal.h      |   2 +-
 7 files changed, 145 insertions(+), 26 deletions(-)

diff --git a/src/backend/access/heap/heapam.c b/src/backend/access/heap/heapam.c
index 2815d91..1216f3f 100644
--- a/src/backend/access/heap/heapam.c
+++ b/src/backend/access/heap/heapam.c
@@ -2423,7 +2423,7 @@ heap_insert(Relation relation, HeapTuple tup, CommandId cid,
 		PageClearAllVisible(BufferGetPage(buffer));
 		visibilitymap_clear(relation,
 							ItemPointerGetBlockNumber(&(heaptup->t_self)),
-							vmbuffer);
+							vmbuffer, VISIBILITYMAP_VALID_BITS);
 	}
 
 	/*
@@ -2737,7 +2737,7 @@ heap_multi_insert(Relation relation, HeapTuple *tuples, int ntuples,
 			PageClearAllVisible(page);
 			visibilitymap_clear(relation,
 								BufferGetBlockNumber(buffer),
-								vmbuffer);
+								vmbuffer, VISIBILITYMAP_VALID_BITS);
 		}
 
 		/*
@@ -3239,7 +3239,7 @@ l1:
 		all_visible_cleared = true;
 		PageClearAllVisible(page);
 		visibilitymap_clear(relation, BufferGetBlockNumber(buffer),
-							vmbuffer);
+							vmbuffer, VISIBILITYMAP_VALID_BITS);
 	}
 
 	/* store transaction information of xact deleting the tuple */
@@ -3925,6 +3925,7 @@ l2:
 		TransactionId xmax_lock_old_tuple;
 		uint16		infomask_lock_old_tuple,
 					infomask2_lock_old_tuple;
+		bool		cleared_all_frozen = false;
 
 		/*
 		 * To prevent concurrent sessions from updating the tuple, we have to
@@ -3968,6 +3969,17 @@ l2:
 		/* temporarily make it look not-updated, but locked */
 		oldtup.t_data->t_ctid = oldtup.t_self;
 
+		/*
+		 * Clear all-frozen bit on visibility map if needed. We could
+		 * immediately reset ALL_VISIBLE, but given that the WAL logging
+		 * overhead would be unchanged, that doesn't seem necessarily
+		 * worthwhile.
+		 */
+		if (PageIsAllVisible(BufferGetPage(buffer)) &&
+			visibilitymap_clear(relation, block, vmbuffer,
+								VISIBILITYMAP_ALL_FROZEN))
+			cleared_all_frozen = true;
+
 		MarkBufferDirty(buffer);
 
 		if (RelationNeedsWAL(relation))
@@ -3982,6 +3994,8 @@ l2:
 			xlrec.locking_xid = xmax_lock_old_tuple;
 			xlrec.infobits_set = compute_infobits(oldtup.t_data->t_infomask,
 												  oldtup.t_data->t_infomask2);
+			xlrec.flags =
+				cleared_all_frozen ? XLH_LOCK_ALL_FROZEN_CLEARED : 0;
 			XLogRegisterData((char *) &xlrec, SizeOfHeapLock);
 			recptr = XLogInsert(RM_HEAP_ID, XLOG_HEAP_LOCK);
 			PageSetLSN(page, recptr);
@@ -4159,20 +4173,20 @@ l2:
 	/* record address of new tuple in t_ctid of old one */
 	oldtup.t_data->t_ctid = heaptup->t_self;
 
-	/* clear PD_ALL_VISIBLE flags */
+	/* clear PD_ALL_VISIBLE flags, reset all visibilitymap bits */
 	if (PageIsAllVisible(BufferGetPage(buffer)))
 	{
 		all_visible_cleared = true;
 		PageClearAllVisible(BufferGetPage(buffer));
 		visibilitymap_clear(relation, BufferGetBlockNumber(buffer),
-							vmbuffer);
+							vmbuffer, VISIBILITYMAP_VALID_BITS);
 	}
 	if (newbuf != buffer && PageIsAllVisible(BufferGetPage(newbuf)))
 	{
 		all_visible_cleared_new = true;
 		PageClearAllVisible(BufferGetPage(newbuf));
 		visibilitymap_clear(relation, BufferGetBlockNumber(newbuf),
-							vmbuffer_new);
+							vmbuffer_new, VISIBILITYMAP_VALID_BITS);
 	}
 
 	if (newbuf != buffer)
@@ -4556,6 +4570,8 @@ heap_lock_tuple(Relation relation, HeapTuple tuple,
 	ItemPointer tid = &(tuple->t_self);
 	ItemId		lp;
 	Page		page;
+	Buffer		vmbuffer = InvalidBuffer;
+	BlockNumber	block;
 	TransactionId xid,
 				xmax;
 	uint16		old_infomask,
@@ -4563,8 +4579,18 @@ heap_lock_tuple(Relation relation, HeapTuple tuple,
 				new_infomask2;
 	bool		first_time = true;
 	bool		have_tuple_lock = false;
+	bool		cleared_all_frozen = false;
 
 	*buffer = ReadBuffer(relation, ItemPointerGetBlockNumber(tid));
+	block = ItemPointerGetBlockNumber(tid);
+
+	/*
+	 * Before locking the buffer, pin the visibility map page if it may be
+	 * necessary.
+	 */
+	if (PageIsAllVisible(BufferGetPage(*buffer)))
+		visibilitymap_pin(relation, block, &vmbuffer);
+
 	LockBuffer(*buffer, BUFFER_LOCK_EXCLUSIVE);
 
 	page = BufferGetPage(*buffer);
@@ -5094,6 +5120,13 @@ failed:
 	if (HEAP_XMAX_IS_LOCKED_ONLY(new_infomask))
 		tuple->t_data->t_ctid = *tid;
 
+	/* Clear only the all-frozen bit on visibility map if needed */
+	if (PageIsAllVisible(page) &&
+		visibilitymap_clear(relation, block, vmbuffer,
+							VISIBILITYMAP_ALL_FROZEN))
+		cleared_all_frozen = true;
+
+
 	MarkBufferDirty(*buffer);
 
 	/*
@@ -5120,6 +5153,7 @@ failed:
 		xlrec.locking_xid = xid;
 		xlrec.infobits_set = compute_infobits(new_infomask,
 											  tuple->t_data->t_infomask2);
+		xlrec.flags = cleared_all_frozen ? XLH_LOCK_ALL_FROZEN_CLEARED : 0;
 		XLogRegisterData((char *) &xlrec, SizeOfHeapLock);
 
 		/* we don't decode row locks atm, so no need to log the origin */
@@ -5132,6 +5166,8 @@ failed:
 	END_CRIT_SECTION();
 
 	LockBuffer(*buffer, BUFFER_LOCK_UNLOCK);
+	if (BufferIsValid(vmbuffer))
+		ReleaseBuffer(vmbuffer);
 
 	/*
 	 * Don't update the visibility map here. Locking a tuple doesn't change
@@ -5587,6 +5623,9 @@ heap_lock_updated_tuple_rec(Relation rel, ItemPointer tid, TransactionId xid,
 	TransactionId xmax,
 				new_xmax;
 	TransactionId priorXmax = InvalidTransactionId;
+	bool		cleared_all_frozen = false;
+	Buffer		vmbuffer;
+	BlockNumber block;
 
 	ItemPointerCopy(tid, &tupid);
 
@@ -5594,6 +5633,7 @@ heap_lock_updated_tuple_rec(Relation rel, ItemPointer tid, TransactionId xid,
 	{
 		new_infomask = 0;
 		new_xmax = InvalidTransactionId;
+		block = ItemPointerGetBlockNumber(&tupid);
 		ItemPointerCopy(&tupid, &(mytup.t_self));
 
 		if (!heap_fetch(rel, SnapshotAny, &mytup, &buf, false, NULL))
@@ -5610,6 +5650,16 @@ heap_lock_updated_tuple_rec(Relation rel, ItemPointer tid, TransactionId xid,
 
 l4:
 		CHECK_FOR_INTERRUPTS();
+
+		/*
+		 * Before locking the buffer, pin the visibility map page if it may be
+		 * necessary.
+		 */
+		if (PageIsAllVisible(BufferGetPage(buf)))
+			visibilitymap_pin(rel, block, &vmbuffer);
+		else
+			vmbuffer = InvalidBuffer;
+
 		LockBuffer(buf, BUFFER_LOCK_EXCLUSIVE);
 
 		/*
@@ -5749,6 +5799,11 @@ l4:
 								  xid, mode, false,
 								  &new_xmax, &new_infomask, &new_infomask2);
 
+		if (PageIsAllVisible(BufferGetPage(buf)) &&
+			visibilitymap_clear(rel, block, vmbuffer,
+								VISIBILITYMAP_ALL_FROZEN))
+			cleared_all_frozen = true;
+
 		START_CRIT_SECTION();
 
 		/* ... and set them */
@@ -5773,6 +5828,8 @@ l4:
 			xlrec.offnum = ItemPointerGetOffsetNumber(&mytup.t_self);
 			xlrec.xmax = new_xmax;
 			xlrec.infobits_set = compute_infobits(new_infomask, new_infomask2);
+			xlrec.flags =
+				cleared_all_frozen ? XLH_LOCK_ALL_FROZEN_CLEARED : 0;
 
 			XLogRegisterData((char *) &xlrec, SizeOfHeapLockUpdated);
 
@@ -5789,6 +5846,9 @@ l4:
 			HeapTupleHeaderIsOnlyLocked(mytup.t_data))
 		{
 			UnlockReleaseBuffer(buf);
+			if (vmbuffer != InvalidBuffer)
+				ReleaseBuffer(vmbuffer);
+
 			return HeapTupleMayBeUpdated;
 		}
 
@@ -5796,6 +5856,8 @@ l4:
 		priorXmax = HeapTupleHeaderGetUpdateXid(mytup.t_data);
 		ItemPointerCopy(&(mytup.t_data->t_ctid), &tupid);
 		UnlockReleaseBuffer(buf);
+		if (vmbuffer != InvalidBuffer)
+			ReleaseBuffer(vmbuffer);
 	}
 }
 
@@ -8107,7 +8169,7 @@ heap_xlog_delete(XLogReaderState *record)
 		Buffer		vmbuffer = InvalidBuffer;
 
 		visibilitymap_pin(reln, blkno, &vmbuffer);
-		visibilitymap_clear(reln, blkno, vmbuffer);
+		visibilitymap_clear(reln, blkno, vmbuffer, VISIBILITYMAP_VALID_BITS);
 		ReleaseBuffer(vmbuffer);
 		FreeFakeRelcacheEntry(reln);
 	}
@@ -8185,7 +8247,7 @@ heap_xlog_insert(XLogReaderState *record)
 		Buffer		vmbuffer = InvalidBuffer;
 
 		visibilitymap_pin(reln, blkno, &vmbuffer);
-		visibilitymap_clear(reln, blkno, vmbuffer);
+		visibilitymap_clear(reln, blkno, vmbuffer, VISIBILITYMAP_VALID_BITS);
 		ReleaseBuffer(vmbuffer);
 		FreeFakeRelcacheEntry(reln);
 	}
@@ -8305,7 +8367,7 @@ heap_xlog_multi_insert(XLogReaderState *record)
 		Buffer		vmbuffer = InvalidBuffer;
 
 		visibilitymap_pin(reln, blkno, &vmbuffer);
-		visibilitymap_clear(reln, blkno, vmbuffer);
+		visibilitymap_clear(reln, blkno, vmbuffer, VISIBILITYMAP_VALID_BITS);
 		ReleaseBuffer(vmbuffer);
 		FreeFakeRelcacheEntry(reln);
 	}
@@ -8460,7 +8522,7 @@ heap_xlog_update(XLogReaderState *record, bool hot_update)
 		Buffer		vmbuffer = InvalidBuffer;
 
 		visibilitymap_pin(reln, oldblk, &vmbuffer);
-		visibilitymap_clear(reln, oldblk, vmbuffer);
+		visibilitymap_clear(reln, oldblk, vmbuffer, VISIBILITYMAP_VALID_BITS);
 		ReleaseBuffer(vmbuffer);
 		FreeFakeRelcacheEntry(reln);
 	}
@@ -8544,7 +8606,7 @@ heap_xlog_update(XLogReaderState *record, bool hot_update)
 		Buffer		vmbuffer = InvalidBuffer;
 
 		visibilitymap_pin(reln, newblk, &vmbuffer);
-		visibilitymap_clear(reln, newblk, vmbuffer);
+		visibilitymap_clear(reln, newblk, vmbuffer, VISIBILITYMAP_VALID_BITS);
 		ReleaseBuffer(vmbuffer);
 		FreeFakeRelcacheEntry(reln);
 	}
@@ -8724,6 +8786,27 @@ heap_xlog_lock(XLogReaderState *record)
 	ItemId		lp = NULL;
 	HeapTupleHeader htup;
 
+	/*
+	 * The visibility map may need to be fixed even if the heap page is
+	 * already up-to-date.
+	 */
+	if (xlrec->flags & XLH_LOCK_ALL_FROZEN_CLEARED)
+	{
+		RelFileNode	rnode;
+		Buffer		vmbuffer = InvalidBuffer;
+		BlockNumber	block;
+		Relation	reln;
+
+		XLogRecGetBlockTag(record, 0, &rnode, NULL, &block);
+		reln = CreateFakeRelcacheEntry(rnode);
+
+		visibilitymap_pin(reln, block, &vmbuffer);
+		visibilitymap_clear(reln, block, vmbuffer, VISIBILITYMAP_ALL_FROZEN);
+
+		ReleaseBuffer(vmbuffer);
+		FreeFakeRelcacheEntry(reln);
+	}
+
 	if (XLogReadBufferForRedo(record, 0, &buffer) == BLK_NEEDS_REDO)
 	{
 		page = (Page) BufferGetPage(buffer);
@@ -8776,6 +8859,27 @@ heap_xlog_lock_updated(XLogReaderState *record)
 
 	xlrec = (xl_heap_lock_updated *) XLogRecGetData(record);
 
+	/*
+	 * The visibility map may need to be fixed even if the heap page is
+	 * already up-to-date.
+	 */
+	if (xlrec->flags & XLH_LOCK_ALL_FROZEN_CLEARED)
+	{
+		RelFileNode	rnode;
+		Buffer		vmbuffer = InvalidBuffer;
+		BlockNumber	block;
+		Relation	reln;
+
+		XLogRecGetBlockTag(record, 0, &rnode, NULL, &block);
+		reln = CreateFakeRelcacheEntry(rnode);
+
+		visibilitymap_pin(reln, block, &vmbuffer);
+		visibilitymap_clear(reln, block, vmbuffer, VISIBILITYMAP_ALL_FROZEN);
+
+		ReleaseBuffer(vmbuffer);
+		FreeFakeRelcacheEntry(reln);
+	}
+
 	if (XLogReadBufferForRedo(record, 0, &buffer) == BLK_NEEDS_REDO)
 	{
 		page = BufferGetPage(buffer);
diff --git a/src/backend/access/heap/visibilitymap.c b/src/backend/access/heap/visibilitymap.c
index b472d31..b60d8e4 100644
--- a/src/backend/access/heap/visibilitymap.c
+++ b/src/backend/access/heap/visibilitymap.c
@@ -11,7 +11,7 @@
  *	  src/backend/access/heap/visibilitymap.c
  *
  * INTERFACE ROUTINES
- *		visibilitymap_clear  - clear a bit in the visibility map
+ *		visibilitymap_clear  - clear bits for one page in the visibility map
  *		visibilitymap_pin	 - pin a map page for setting a bit
  *		visibilitymap_pin_ok - check whether correct map page is already pinned
  *		visibilitymap_set	 - set a bit in a previously pinned page
@@ -159,20 +159,23 @@ static void vm_extend(Relation rel, BlockNumber nvmblocks);
 
 
 /*
- *	visibilitymap_clear - clear all bits for one page in visibility map
+ *	visibilitymap_clear - clear bit(s) for one page in visibility map
  *
  * You must pass a buffer containing the correct map page to this function.
  * Call visibilitymap_pin first to pin the right one. This function doesn't do
- * any I/O.
+ * any I/O.  Returns whether any bits have been cleared.
  */
-void
-visibilitymap_clear(Relation rel, BlockNumber heapBlk, Buffer buf)
+bool
+visibilitymap_clear(Relation rel, BlockNumber heapBlk, Buffer buf, uint8 flags)
 {
 	BlockNumber mapBlock = HEAPBLK_TO_MAPBLOCK(heapBlk);
 	int			mapByte = HEAPBLK_TO_MAPBYTE(heapBlk);
 	int			mapOffset = HEAPBLK_TO_OFFSET(heapBlk);
-	uint8		mask = VISIBILITYMAP_VALID_BITS << mapOffset;
+	uint8		mask = flags << mapOffset;
 	char	   *map;
+	bool		cleared = false;
+
+	Assert(flags & VISIBILITYMAP_VALID_BITS);
 
 #ifdef TRACE_VISIBILITYMAP
 	elog(DEBUG1, "vm_clear %s %d", RelationGetRelationName(rel), heapBlk);
@@ -189,9 +192,12 @@ visibilitymap_clear(Relation rel, BlockNumber heapBlk, Buffer buf)
 		map[mapByte] &= ~mask;
 
 		MarkBufferDirty(buf);
+		cleared = true;
 	}
 
 	LockBuffer(buf, BUFFER_LOCK_UNLOCK);
+
+	return cleared;
 }
 
 /*
diff --git a/src/backend/access/rmgrdesc/heapdesc.c b/src/backend/access/rmgrdesc/heapdesc.c
index 2b31ea4..7c763b6 100644
--- a/src/backend/access/rmgrdesc/heapdesc.c
+++ b/src/backend/access/rmgrdesc/heapdesc.c
@@ -85,7 +85,8 @@ heap_desc(StringInfo buf, XLogReaderState *record)
 	{
 		xl_heap_lock *xlrec = (xl_heap_lock *) rec;
 
-		appendStringInfo(buf, "off %u: xid %u ", xlrec->offnum, xlrec->locking_xid);
+		appendStringInfo(buf, "off %u: xid %u: flags %u ",
+						 xlrec->offnum, xlrec->locking_xid, xlrec->flags);
 		out_infobits(buf, xlrec->infobits_set);
 	}
 	else if (info == XLOG_HEAP_INPLACE)
@@ -138,7 +139,8 @@ heap2_desc(StringInfo buf, XLogReaderState *record)
 	{
 		xl_heap_lock_updated *xlrec = (xl_heap_lock_updated *) rec;
 
-		appendStringInfo(buf, "off %u: xmax %u ", xlrec->offnum, xlrec->xmax);
+		appendStringInfo(buf, "off %u: xmax %u: flags %u ",
+						 xlrec->offnum, xlrec->xmax, xlrec->flags);
 		out_infobits(buf, xlrec->infobits_set);
 	}
 	else if (info == XLOG_HEAP2_NEW_CID)
diff --git a/src/backend/commands/vacuumlazy.c b/src/backend/commands/vacuumlazy.c
index 32b6fdd..4075f4d 100644
--- a/src/backend/commands/vacuumlazy.c
+++ b/src/backend/commands/vacuumlazy.c
@@ -1179,7 +1179,8 @@ lazy_scan_heap(Relation onerel, int options, LVRelStats *vacrelstats,
 		{
 			elog(WARNING, "page is not marked all-visible but visibility map bit is set in relation \"%s\" page %u",
 				 relname, blkno);
-			visibilitymap_clear(onerel, blkno, vmbuffer);
+			visibilitymap_clear(onerel, blkno, vmbuffer,
+								VISIBILITYMAP_VALID_BITS);
 		}
 
 		/*
@@ -1201,7 +1202,8 @@ lazy_scan_heap(Relation onerel, int options, LVRelStats *vacrelstats,
 				 relname, blkno);
 			PageClearAllVisible(page);
 			MarkBufferDirty(buf);
-			visibilitymap_clear(onerel, blkno, vmbuffer);
+			visibilitymap_clear(onerel, blkno, vmbuffer,
+								VISIBILITYMAP_VALID_BITS);
 		}
 
 		/*
diff --git a/src/include/access/heapam_xlog.h b/src/include/access/heapam_xlog.h
index a822d0b..06a8242 100644
--- a/src/include/access/heapam_xlog.h
+++ b/src/include/access/heapam_xlog.h
@@ -243,15 +243,19 @@ typedef struct xl_heap_cleanup_info
 #define XLHL_XMAX_KEYSHR_LOCK	0x08
 #define XLHL_KEYS_UPDATED		0x10
 
+/* flag bits for xl_heap_lock / xl_heap_lock_updated's flag field */
+#define XLH_LOCK_ALL_FROZEN_CLEARED		0x01
+
 /* This is what we need to know about lock */
 typedef struct xl_heap_lock
 {
 	TransactionId locking_xid;	/* might be a MultiXactId not xid */
 	OffsetNumber offnum;		/* locked tuple's offset on page */
 	int8		infobits_set;	/* infomask and infomask2 bits to set */
+	uint8		flags;			/* XLH_LOCK_* flag bits */
 } xl_heap_lock;
 
-#define SizeOfHeapLock	(offsetof(xl_heap_lock, infobits_set) + sizeof(int8))
+#define SizeOfHeapLock	(offsetof(xl_heap_lock, flags) + sizeof(int8))
 
 /* This is what we need to know about locking an updated version of a row */
 typedef struct xl_heap_lock_updated
@@ -259,9 +263,10 @@ typedef struct xl_heap_lock_updated
 	TransactionId xmax;
 	OffsetNumber offnum;
 	uint8		infobits_set;
+	uint8		flags;
 } xl_heap_lock_updated;
 
-#define SizeOfHeapLockUpdated	(offsetof(xl_heap_lock_updated, infobits_set) + sizeof(uint8))
+#define SizeOfHeapLockUpdated	(offsetof(xl_heap_lock_updated, flags) + sizeof(uint8))
 
 /* This is what we need to know about confirmation of speculative insertion */
 typedef struct xl_heap_confirm
diff --git a/src/include/access/visibilitymap.h b/src/include/access/visibilitymap.h
index fca99ca..00bbd4c 100644
--- a/src/include/access/visibilitymap.h
+++ b/src/include/access/visibilitymap.h
@@ -34,8 +34,8 @@
 #define VM_ALL_FROZEN(r, b, v) \
 	((visibilitymap_get_status((r), (b), (v)) & VISIBILITYMAP_ALL_FROZEN) != 0)
 
-extern void visibilitymap_clear(Relation rel, BlockNumber heapBlk,
-					Buffer vmbuf);
+extern bool visibilitymap_clear(Relation rel, BlockNumber heapBlk,
+					Buffer vmbuf, uint8 flags);
 extern void visibilitymap_pin(Relation rel, BlockNumber heapBlk,
 				  Buffer *vmbuf);
 extern bool visibilitymap_pin_ok(BlockNumber heapBlk, Buffer vmbuf);
diff --git a/src/include/access/xlog_internal.h b/src/include/access/xlog_internal.h
index 2627519..0a595cc 100644
--- a/src/include/access/xlog_internal.h
+++ b/src/include/access/xlog_internal.h
@@ -31,7 +31,7 @@
 /*
  * Each page of XLOG file has a header like this:
  */
-#define XLOG_PAGE_MAGIC 0xD092	/* can be used as WAL version indicator */
+#define XLOG_PAGE_MAGIC 0xD093	/* can be used as WAL version indicator */
 
 typedef struct XLogPageHeaderData
 {
-- 
2.8.1

#241

Robert Haas

robertmhaas@gmail.com

over 9 years ago

In reply to: Andres Freund (#240)

Re: Reviewing freeze map code

On Sun, Jul 17, 2016 at 10:48 PM, Andres Freund <andres@anarazel.de> wrote:

Took till today. Attached is a rather heavily revised version of
Sawada-san's patch. Most notably the recovery routines take care to
reset the vm in all cases, we don't perform visibilitymap_get_status
from inside a critical section anymore, and
heap_lock_updated_tuple_rec() also resets the vm (although I'm not
entirely sure that can practically be hit).

I'm doing some more testing, and Robert said he could take a quick look
at the patch. If somebody else... Will push sometime after dinner.

Thanks very much for working on this. Random suggestions after a quick look:

+     * Before locking the buffer, pin the visibility map page if it may be
+     * necessary.

s/necessary/needed/

More substantively, what happens if the situation changes before we
obtain the buffer lock? I think you need to release the page lock,
pin the page after all, and then relock the page.

There seem to be several ways to escape from this function without
releasing the pin on vmbuffer. From the visibilitymap_pin call here,
search downward for "return".

+ * visibilitymap_clear - clear bit(s) for one page in visibility map

I don't really like the parenthesized-s convention as a shorthand for
"one or more". It tends to confuse non-native English speakers.

+ * any I/O. Returns whether any bits have been cleared.

I suggest "Returns true if any bits have been cleared and false otherwise".

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#242

Amit Kapila

amit.kapila16@gmail.com

over 9 years ago

In reply to: Andres Freund (#240)

Re: Reviewing freeze map code

On Mon, Jul 18, 2016 at 8:18 AM, Andres Freund <andres@anarazel.de> wrote:

On 2016-07-16 10:45:26 -0700, Andres Freund wrote:

On July 16, 2016 8:49:06 AM PDT, Tom Lane <tgl@sss.pgh.pa.us> wrote:

Amit Kapila <amit.kapila16@gmail.com> writes:

On Sat, Jul 16, 2016 at 7:02 AM, Andres Freund <andres@anarazel.de>

wrote:

I think we have two choices how to deal with that: First, we can add

a

new flags variable to xl_heap_lock similar to
xl_heap_insert/update/... and bump page magic,

+1 for going in this way. This will keep us consistent with how

clear

the visibility info in other places like heap_xlog_update().

Yeah. We've already forced a catversion bump for beta3, and I'm about
to go fix PG_CONTROL_VERSION as well, so there's basically no downside
to doing an xlog version bump as well. At least, not if you can get it
in before Monday.

OK, Cool. Will do it later today.

Took till today. Attached is a rather heavily revised version of
Sawada-san's patch. Most notably the recovery routines take care to
reset the vm in all cases, we don't perform visibilitymap_get_status
from inside a critical section anymore, and
heap_lock_updated_tuple_rec() also resets the vm (although I'm not
entirely sure that can practically be hit).

@@ -4563,8 +4579,18 @@ heap_lock_tuple(Relation relation, HeapTuple tuple,

+ /*
+ * Before locking the buffer, pin the visibility map page if it may be
+ * necessary.
+ */

+ if (PageIsAllVisible(BufferGetPage(*buffer)))
+ visibilitymap_pin(relation, block, &vmbuffer);
+
  LockBuffer(*buffer, BUFFER_LOCK_EXCLUSIVE);

I think we need to check for PageIsAllVisible and try to pin the
visibility map after taking the lock on buffer. I think it is quite
possible that in the time this routine tries to acquire lock on
buffer, the page becomes all visible. To avoid the similar hazard, we
do try to check the visibility of page after acquiring buffer lock in
heap_update() at below place.

if (vmbuffer == InvalidBuffer && PageIsAllVisible(page))

Similarly, I think heap_lock_updated_tuple_rec() needs to take care of
same. While I was typing this e-mail, it seems Robert has already
pointed the same issue.

--
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#243

Andres Freund

andres@anarazel.de

over 9 years ago

In reply to: Robert Haas (#241)

Re: Reviewing freeze map code

On 2016-07-17 23:34:01 -0400, Robert Haas wrote:

Thanks very much for working on this. Random suggestions after a quick look:
+     * Before locking the buffer, pin the visibility map page if it may be
+     * necessary.
s/necessary/needed/

More substantively, what happens if the situation changes before we
obtain the buffer lock? I think you need to release the page lock,
pin the page after all, and then relock the page.

It shouldn't be able to. Cleanup locks, which are required for
vacuumlazy to do anything relevant, aren't possible with the buffer
pinned. This pattern is used in heap_delete/heap_update, so I think
we're on a reasonably well trodden path.

There seem to be several ways to escape from this function without
releasing the pin on vmbuffer. From the visibilitymap_pin call here,
search downward for "return".

Hm, that's cleary not good.

The best thing to address that seems to be to create a
separate jump label, which check vmbuffer and releases the page
lock. Unless you have a better idea.

+ * visibilitymap_clear - clear bit(s) for one page in visibility map

I don't really like the parenthesized-s convention as a shorthand for
"one or more". It tends to confuse non-native English speakers.

+ * any I/O. Returns whether any bits have been cleared.

I suggest "Returns true if any bits have been cleared and false otherwise".

Will change.

- Andres

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#244

Andres Freund

andres@anarazel.de

over 9 years ago

In reply to: Amit Kapila (#242)

Re: Reviewing freeze map code

On 2016-07-18 09:07:19 +0530, Amit Kapila wrote:

+ /*
+ * Before locking the buffer, pin the visibility map page if it may be
+ * necessary.
+ */
+ if (PageIsAllVisible(BufferGetPage(*buffer)))
+ visibilitymap_pin(relation, block, &vmbuffer);
+
LockBuffer(*buffer, BUFFER_LOCK_EXCLUSIVE);
I think we need to check for PageIsAllVisible and try to pin the
visibility map after taking the lock on buffer. I think it is quite
possible that in the time this routine tries to acquire lock on
buffer, the page becomes all visible.

I don't see how. Without a cleanup lock it's not possible to mark a page
all-visible/frozen. We might miss the bit becoming unset concurrently,
but that's ok.

Andres

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#245

Amit Kapila

amit.kapila16@gmail.com

over 9 years ago

In reply to: Andres Freund (#244)

Re: Reviewing freeze map code

On Mon, Jul 18, 2016 at 9:13 AM, Andres Freund <andres@anarazel.de> wrote:

On 2016-07-18 09:07:19 +0530, Amit Kapila wrote:
+ /*
+ * Before locking the buffer, pin the visibility map page if it may be
+ * necessary.
+ */
+ if (PageIsAllVisible(BufferGetPage(*buffer)))
+ visibilitymap_pin(relation, block, &vmbuffer);
+
LockBuffer(*buffer, BUFFER_LOCK_EXCLUSIVE);
I think we need to check for PageIsAllVisible and try to pin the
visibility map after taking the lock on buffer. I think it is quite
possible that in the time this routine tries to acquire lock on
buffer, the page becomes all visible.
I don't see how. Without a cleanup lock it's not possible to mark a page
all-visible/frozen.

Consider the below scenario.

Vacuum
a. acquires a cleanup lock for page - 10
b. busy in checking visibility of tuples
--assume, here it takes some time and in the meantime Session-1
performs step (a) and (b) and start waiting in step- (c)
c. marks the page as all-visible (PageSetAllVisible)
d. unlockandrelease the buffer

Session-1
a. In heap_lock_tuple(), readbuffer for page-10
b. check PageIsAllVisible(), found page is not all-visible, so didn't
acquire the visbilitymap_pin
c. LockBuffer in ExlusiveMode - here it will wait for vacuum to
release the lock
d. Got the lock, but now the page is marked as all-visible, so ideally
need to recheck the page and acquire the visibilitymap_pin

--
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#246

Andres Freund

andres@anarazel.de

over 9 years ago

In reply to: Amit Kapila (#245)

Re: Reviewing freeze map code

On 2016-07-18 10:02:52 +0530, Amit Kapila wrote:

On Mon, Jul 18, 2016 at 9:13 AM, Andres Freund <andres@anarazel.de> wrote:
On 2016-07-18 09:07:19 +0530, Amit Kapila wrote:
+ /*
+ * Before locking the buffer, pin the visibility map page if it may be
+ * necessary.
+ */
+ if (PageIsAllVisible(BufferGetPage(*buffer)))
+ visibilitymap_pin(relation, block, &vmbuffer);
+
LockBuffer(*buffer, BUFFER_LOCK_EXCLUSIVE);
I think we need to check for PageIsAllVisible and try to pin the
visibility map after taking the lock on buffer. I think it is quite
possible that in the time this routine tries to acquire lock on
buffer, the page becomes all visible.
I don't see how. Without a cleanup lock it's not possible to mark a page
all-visible/frozen.
Consider the below scenario.

Vacuum
a. acquires a cleanup lock for page - 10
b. busy in checking visibility of tuples
--assume, here it takes some time and in the meantime Session-1
performs step (a) and (b) and start waiting in step- (c)
c. marks the page as all-visible (PageSetAllVisible)
d. unlockandrelease the buffer

Session-1
a. In heap_lock_tuple(), readbuffer for page-10
b. check PageIsAllVisible(), found page is not all-visible, so didn't
acquire the visbilitymap_pin
c. LockBuffer in ExlusiveMode - here it will wait for vacuum to
release the lock
d. Got the lock, but now the page is marked as all-visible, so ideally
need to recheck the page and acquire the visibilitymap_pin

So, I've tried pretty hard to reproduce that. While the theory above is
sound, I believe the relevant code-path is essentially dead for SQL
callable code, because we'll always hold a buffer pin before even
entering heap_update/heap_lock_tuple. It's possible that you could
concoct a dangerous scenario with follow_updates though; but I can't
immediately see how. Due to that, and based on the closing in beta
release, I'm planning to push a version of the patch that the returns
fixed; but not this. It seems better to have the majority of the fix
in.

Andres

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#247

Andres Freund

andres@anarazel.de

over 9 years ago

In reply to: Andres Freund (#246)

Re: Reviewing freeze map code

On 2016-07-18 01:33:10 -0700, Andres Freund wrote:

On 2016-07-18 10:02:52 +0530, Amit Kapila wrote:
On Mon, Jul 18, 2016 at 9:13 AM, Andres Freund <andres@anarazel.de> wrote:
On 2016-07-18 09:07:19 +0530, Amit Kapila wrote:
+ /*
+ * Before locking the buffer, pin the visibility map page if it may be
+ * necessary.
+ */
+ if (PageIsAllVisible(BufferGetPage(*buffer)))
+ visibilitymap_pin(relation, block, &vmbuffer);
+
LockBuffer(*buffer, BUFFER_LOCK_EXCLUSIVE);
I think we need to check for PageIsAllVisible and try to pin the
visibility map after taking the lock on buffer. I think it is quite
possible that in the time this routine tries to acquire lock on
buffer, the page becomes all visible.
I don't see how. Without a cleanup lock it's not possible to mark a page
all-visible/frozen.
Consider the below scenario.

Vacuum
a. acquires a cleanup lock for page - 10
b. busy in checking visibility of tuples
--assume, here it takes some time and in the meantime Session-1
performs step (a) and (b) and start waiting in step- (c)
c. marks the page as all-visible (PageSetAllVisible)
d. unlockandrelease the buffer

Session-1
a. In heap_lock_tuple(), readbuffer for page-10
b. check PageIsAllVisible(), found page is not all-visible, so didn't
acquire the visbilitymap_pin
c. LockBuffer in ExlusiveMode - here it will wait for vacuum to
release the lock
d. Got the lock, but now the page is marked as all-visible, so ideally
need to recheck the page and acquire the visibilitymap_pin
So, I've tried pretty hard to reproduce that. While the theory above is
sound, I believe the relevant code-path is essentially dead for SQL
callable code, because we'll always hold a buffer pin before even
entering heap_update/heap_lock_tuple. It's possible that you could
concoct a dangerous scenario with follow_updates though; but I can't
immediately see how. Due to that, and based on the closing in beta
release, I'm planning to push a version of the patch that the returns
fixed; but not this. It seems better to have the majority of the fix
in.

Pushed that way. Let's try to figure out a good solution to a) test this
case b) how to fix it in a reasonable way. Note that there's also
http://archives.postgresql.org/message-id/20160718071729.tlj4upxhaylwv75n%40alap3.anarazel.de
which seems related.

Regards,

Andres

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#248

Michael Paquier

michael.paquier@gmail.com

over 9 years ago

In reply to: Andres Freund (#235)

Re: Reviewing freeze map code

On Sat, Jul 16, 2016 at 10:08 AM, Andres Freund <andres@anarazel.de> wrote:

On 2016-07-14 20:53:07 -0700, Andres Freund wrote:

On 2016-07-13 23:06:07 -0700, Andres Freund wrote:

won't enter the branch, because HEAP_XMAX_LOCK_ONLY won't be set. Which
will leave t_ctid and HEAP_HOT_UPDATED set differently on the master and
standby / after crash recovery. I'm failing to see any harmful
consequences right now, but differences between master and standby are a bad
thing.

I think it's actually critical, because HEAP_HOT_UPDATED /
HEAP_XMAX_LOCK_ONLY are used to terminate ctid chasing loops (like
heap_hot_search_buffer()).

I've pushed a quite heavily revised version of the first patch to
9.1-master. I manually verified using pageinspect, gdb breakpoints and a
standby that xmax, infomask etc are set correctly (leading to finding
a4d357bf). As there's noticeable differences, especially 9.2->9.3,
between versions, I'd welcome somebody having a look at the commits.

Waoh, man. Thanks!

I have been just pinged this week end about a set up that likely has
faced this exact problem in the shape of "tuple concurrently updated"
with a node getting kill-9-ed by some framework because it did not
finish its shutdown checkpoint after some time in some test which
enforced it to do crash recovery. I have not been able to put my hands
on the raw data to have a look at the flags set within those tuples
but I got the string feeling that this is related to that. After a
couple of rounds doing so, it was possible to see "tuple concurrently
updated" errors for a relation that has few pages and a high update
rate using 9.4.

More seriously, I have spent some time looking at what you have pushed
on each branch, and the fixes are looking correct to me.
--
Michael

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#249

Amit Kapila

amit.kapila16@gmail.com

over 9 years ago

In reply to: Andres Freund (#246)

1 attachment(s)

Re: Reviewing freeze map code

On Mon, Jul 18, 2016 at 2:03 PM, Andres Freund <andres@anarazel.de> wrote:

On 2016-07-18 10:02:52 +0530, Amit Kapila wrote:

Consider the below scenario.

Vacuum
a. acquires a cleanup lock for page - 10
b. busy in checking visibility of tuples
--assume, here it takes some time and in the meantime Session-1
performs step (a) and (b) and start waiting in step- (c)
c. marks the page as all-visible (PageSetAllVisible)
d. unlockandrelease the buffer

Session-1
a. In heap_lock_tuple(), readbuffer for page-10
b. check PageIsAllVisible(), found page is not all-visible, so didn't
acquire the visbilitymap_pin
c. LockBuffer in ExlusiveMode - here it will wait for vacuum to
release the lock
d. Got the lock, but now the page is marked as all-visible, so ideally
need to recheck the page and acquire the visibilitymap_pin

So, I've tried pretty hard to reproduce that. While the theory above is
sound, I believe the relevant code-path is essentially dead for SQL
callable code, because we'll always hold a buffer pin before even
entering heap_update/heap_lock_tuple.

It is possible that we don't hold any buffer pin before entering
heap_update() and or heap_lock_tuple(). For heap_update(), it is
possible when it enters via simple_heap_update() path. For
heap_lock_tuple(), it is possible for ON CONFLICT DO Update statement
and may be others as well. Let me also try to explain with a test for
both the cases, if above is not clear enough.

Case-1 for heap_update()
-----------------------------------
Session-1
Create table t1(c1 int);
Alter table t1 alter column c1 set default 10; --via debugger stop at
StoreAttrDefault()/heap_update, while you are in heap_update(), note
down the block number

Session-2
vacuum (DISABLE_PAGE_SKIPPING) pg_attribute; -- In lazy_scan_heap(),
stop at line (if (all_visible && !all_visible_according_to_vm))) for
block number noted in Session-1.

Session-1
In debugger, proceed and let it wait at lockbuffer, note that it will
not pin the visibility map.

Session-2
Set the visibility flag and complete the operation.

Session-1
You will notice that it will attempt to unlock the buffer, pin the
visibility map, lock the buffer again.

Case-2 for heap_lock_tuple()
----------------------------------------
Session-1
Create table i_conflict(c1 int, c2 char(100));
Create unique index idx_u on i_conflict(c1);

Insert into i_conflict values(1,'aaa');
Insert into i_conflict values(1,'aaa') On Conflict (c1) Do Update Set
c2='bbb'; -- via debugger, stop at line 385 in nodeModifyTable.c (In
ExecInsert(), at
if (onconflict == ONCONFLICT_UPDATE)).

Session-2
-------------
vacuum (DISABLE_PAGE_SKIPPING) i_conflict --stop before setting the
all-visible flag

Session-1
--------------
In debugger, proceed and let it wait at lockbuffer, note that it will
not pin the visibility map.

Session-2
---------------
Set the visibility flag and complete the operation.

Session-1
--------------
PANIC: wrong buffer passed to visibilitymap_clear --this is problematic.

Attached patch fixes the problem for me. Note, I have not tried to
reproduce the problem for heap_lock_updated_tuple_rec(), but I think
if you are convinced with above cases, then we should have a similar
check in it as well.

--
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

Attachments:

pin_vm_lock_tuple-v1.patchapplication/octet-stream; name=pin_vm_lock_tuple-v1.patchDownload

diff --git a/src/backend/access/heap/heapam.c b/src/backend/access/heap/heapam.c
index 38bba16..636e7b9 100644
--- a/src/backend/access/heap/heapam.c
+++ b/src/backend/access/heap/heapam.c
@@ -4585,9 +4585,10 @@ heap_lock_tuple(Relation relation, HeapTuple tuple,
 	block = ItemPointerGetBlockNumber(tid);
 
 	/*
-	 * Before locking the buffer, pin the visibility map page if it may be
-	 * necessary. XXX: It might be possible for this to change after acquiring
-	 * the lock below. We don't yet deal with that case.
+	 * Before locking the buffer, pin the visibility map page if it appears to
+	 * be necessary.  Since we haven't got the lock yet, someone else might be
+	 * in the middle of changing this, so we'll need to recheck after we have
+	 * the lock.
 	 */
 	if (PageIsAllVisible(BufferGetPage(*buffer)))
 		visibilitymap_pin(relation, block, &vmbuffer);
@@ -5075,6 +5076,23 @@ failed:
 		goto out_locked;
 	}
 
+	/*
+	 * If we didn't pin the visibility map page and the page has become all
+	 * visible while we were busy locking the buffer, or during some
+	 * subsequent window during which we had it unlocked, we'll have to unlock
+	 * and re-lock, to avoid holding the buffer lock across an I/O.  That's a
+	 * bit unfortunate, especially since we'll now have to recheck whether the
+	 * tuple has been locked or updated under us, but hopefully it won't
+	 * happen very often.
+	 */
+	if (vmbuffer == InvalidBuffer && PageIsAllVisible(page))
+	{
+		LockBuffer(*buffer, BUFFER_LOCK_UNLOCK);
+		visibilitymap_pin(relation, block, &vmbuffer);
+		LockBuffer(*buffer, BUFFER_LOCK_EXCLUSIVE);
+		goto l3;
+	}
+
 	xmax = HeapTupleHeaderGetRawXmax(tuple->t_data);
 	old_infomask = tuple->t_data->t_infomask;
 
@@ -5665,9 +5683,10 @@ l4:
 		CHECK_FOR_INTERRUPTS();
 
 		/*
-		 * Before locking the buffer, pin the visibility map page if it may be
-		 * necessary.  XXX: It might be possible for this to change after
-		 * acquiring the lock below. We don't yet deal with that case.
+		 * Before locking the buffer, pin the visibility map page if it
+		 * appears to be necessary.  Since we haven't got the lock yet,
+		 * someone else might be in the middle of changing this, so we'll need
+		 * to recheck after we have the lock.
 		 */
 		if (PageIsAllVisible(BufferGetPage(buf)))
 			visibilitymap_pin(rel, block, &vmbuffer);
@@ -5803,6 +5822,19 @@ l4:
 			}
 		}
 
+		/*
+		 * If we didn't pin the visibility map page and the page has become
+		 * all visible, we'll have to unlock and re-lock.  See heap_lock_tuple
+		 * for details.
+		 */
+		if (vmbuffer == InvalidBuffer && PageIsAllVisible(BufferGetPage(buf)))
+		{
+			LockBuffer(buf, BUFFER_LOCK_UNLOCK);
+			visibilitymap_pin(rel, block, &vmbuffer);
+			LockBuffer(buf, BUFFER_LOCK_EXCLUSIVE);
+			goto l4;
+		}
+
 		/* compute the new Xmax and infomask values for the tuple ... */
 		compute_new_xmax_infomask(xmax, old_infomask, mytup.t_data->t_infomask2,
 								  xid, mode, false,

#250

Robert Haas

robertmhaas@gmail.com

over 9 years ago

In reply to: Amit Kapila (#249)

Re: Reviewing freeze map code

On Sat, Jul 23, 2016 at 3:55 AM, Amit Kapila <amit.kapila16@gmail.com> wrote:

Attached patch fixes the problem for me. Note, I have not tried to
reproduce the problem for heap_lock_updated_tuple_rec(), but I think
if you are convinced with above cases, then we should have a similar
check in it as well.

I don't think this hunk is correct:

+        /*
+         * If we didn't pin the visibility map page and the page has become
+         * all visible, we'll have to unlock and re-lock.  See heap_lock_tuple
+         * for details.
+         */
+        if (vmbuffer == InvalidBuffer && PageIsAllVisible(BufferGetPage(buf)))
+        {
+            LockBuffer(buf, BUFFER_LOCK_UNLOCK);
+            visibilitymap_pin(rel, block, &vmbuffer);
+            LockBuffer(buf, BUFFER_LOCK_EXCLUSIVE);
+            goto l4;
+        }

The code beginning at label l4 appears that the buffer is unlocked,
but this code leaves the buffer unlocked. Also, I don't see the point
of doing this test so far down in the function. Why not just recheck
*immediately* after taking the buffer lock? If you find out that you
need the pin after all, then LockBuffer(buf,
BUFFER_LOCK_UNLOCK); visibilitymap_pin(rel, block, &vmbuffer);
LockBuffer(buf, BUFFER_LOCK_EXCLUSIVE); but *do not* go back to l4.
Unless I'm missing something, putting this block further down, as you
have it, buys nothing, because none of that intervening code can
release the buffer lock without using goto to jump back to l4.

+    /*
+     * If we didn't pin the visibility map page and the page has become all
+     * visible while we were busy locking the buffer, or during some
+     * subsequent window during which we had it unlocked, we'll have to unlock
+     * and re-lock, to avoid holding the buffer lock across an I/O.  That's a
+     * bit unfortunate, especially since we'll now have to recheck whether the
+     * tuple has been locked or updated under us, but hopefully it won't
+     * happen very often.
+     */
+    if (vmbuffer == InvalidBuffer && PageIsAllVisible(page))
+    {
+        LockBuffer(*buffer, BUFFER_LOCK_UNLOCK);
+        visibilitymap_pin(relation, block, &vmbuffer);
+        LockBuffer(*buffer, BUFFER_LOCK_EXCLUSIVE);
+        goto l3;
+    }

In contrast, this looks correct: l3 expects the buffer to be locked
already, and the code above this point and below the point this logic
can unlock and re-lock the buffer, potentially multiple times.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#251

Amit Kapila

amit.kapila16@gmail.com

over 9 years ago

In reply to: Robert Haas (#250)

2 attachment(s)

Re: Reviewing freeze map code

On Wed, Jul 27, 2016 at 3:24 AM, Robert Haas <robertmhaas@gmail.com> wrote:

On Sat, Jul 23, 2016 at 3:55 AM, Amit Kapila <amit.kapila16@gmail.com> wrote:

Attached patch fixes the problem for me. Note, I have not tried to
reproduce the problem for heap_lock_updated_tuple_rec(), but I think
if you are convinced with above cases, then we should have a similar
check in it as well.

I don't think this hunk is correct:
+        /*
+         * If we didn't pin the visibility map page and the page has become
+         * all visible, we'll have to unlock and re-lock.  See heap_lock_tuple
+         * for details.
+         */
+        if (vmbuffer == InvalidBuffer && PageIsAllVisible(BufferGetPage(buf)))
+        {
+            LockBuffer(buf, BUFFER_LOCK_UNLOCK);
+            visibilitymap_pin(rel, block, &vmbuffer);
+            LockBuffer(buf, BUFFER_LOCK_EXCLUSIVE);
+            goto l4;
+        }
The code beginning at label l4 appears that the buffer is unlocked,
but this code leaves the buffer unlocked. Also, I don't see the point
of doing this test so far down in the function. Why not just recheck
*immediately* after taking the buffer lock?

Right, in this case we can recheck immediately after taking buffer
lock, updated patch attached. In the passing by, I have noticed that
heap_delete() doesn't do this unlocking, pinning of vm and locking at
appropriate place. It just checks immediately after taking lock,
whereas in the down code, it do unlock and lock the buffer again. I
think we should do it as in attached patch
(pin_vm_heap_delete-v1.patch).

--
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

Attachments:

pin_vm_lock_tuple-v2.patchapplication/octet-stream; name=pin_vm_lock_tuple-v2.patchDownload

diff --git a/src/backend/access/heap/heapam.c b/src/backend/access/heap/heapam.c
index 38bba16..e24ef65 100644
--- a/src/backend/access/heap/heapam.c
+++ b/src/backend/access/heap/heapam.c
@@ -4585,9 +4585,10 @@ heap_lock_tuple(Relation relation, HeapTuple tuple,
 	block = ItemPointerGetBlockNumber(tid);
 
 	/*
-	 * Before locking the buffer, pin the visibility map page if it may be
-	 * necessary. XXX: It might be possible for this to change after acquiring
-	 * the lock below. We don't yet deal with that case.
+	 * Before locking the buffer, pin the visibility map page if it appears to
+	 * be necessary.  Since we haven't got the lock yet, someone else might be
+	 * in the middle of changing this, so we'll need to recheck after we have
+	 * the lock.
 	 */
 	if (PageIsAllVisible(BufferGetPage(*buffer)))
 		visibilitymap_pin(relation, block, &vmbuffer);
@@ -5075,6 +5076,23 @@ failed:
 		goto out_locked;
 	}
 
+	/*
+	 * If we didn't pin the visibility map page and the page has become all
+	 * visible while we were busy locking the buffer, or during some
+	 * subsequent window during which we had it unlocked, we'll have to unlock
+	 * and re-lock, to avoid holding the buffer lock across an I/O.  That's a
+	 * bit unfortunate, especially since we'll now have to recheck whether the
+	 * tuple has been locked or updated under us, but hopefully it won't
+	 * happen very often.
+	 */
+	if (vmbuffer == InvalidBuffer && PageIsAllVisible(page))
+	{
+		LockBuffer(*buffer, BUFFER_LOCK_UNLOCK);
+		visibilitymap_pin(relation, block, &vmbuffer);
+		LockBuffer(*buffer, BUFFER_LOCK_EXCLUSIVE);
+		goto l3;
+	}
+
 	xmax = HeapTupleHeaderGetRawXmax(tuple->t_data);
 	old_infomask = tuple->t_data->t_infomask;
 
@@ -5665,9 +5683,10 @@ l4:
 		CHECK_FOR_INTERRUPTS();
 
 		/*
-		 * Before locking the buffer, pin the visibility map page if it may be
-		 * necessary.  XXX: It might be possible for this to change after
-		 * acquiring the lock below. We don't yet deal with that case.
+		 * Before locking the buffer, pin the visibility map page if it
+		 * appears to be necessary.  Since we haven't got the lock yet,
+		 * someone else might be in the middle of changing this, so we'll need
+		 * to recheck after we have the lock.
 		 */
 		if (PageIsAllVisible(BufferGetPage(buf)))
 			visibilitymap_pin(rel, block, &vmbuffer);
@@ -5677,6 +5696,19 @@ l4:
 		LockBuffer(buf, BUFFER_LOCK_EXCLUSIVE);
 
 		/*
+		 * If we didn't pin the visibility map page and the page has become
+		 * all visible while we were busy locking the buffer, we'll have to
+		 * unlock and re-lock, to avoid holding the buffer lock across an I/O.
+		 * That's a bit unfortunate, but hopefully shouldn't happen often.
+		 */
+		if (vmbuffer == InvalidBuffer && PageIsAllVisible(BufferGetPage(buf)))
+		{
+			LockBuffer(buf, BUFFER_LOCK_UNLOCK);
+			visibilitymap_pin(rel, block, &vmbuffer);
+			LockBuffer(buf, BUFFER_LOCK_EXCLUSIVE);
+		}
+
+		/*
 		 * Check the tuple XMIN against prior XMAX, if any.  If we reached the
 		 * end of the chain, we're done, so return success.
 		 */

pin_vm_heap_delete-v1.patchapplication/octet-stream; name=pin_vm_heap_delete-v1.patchDownload

diff --git a/src/backend/access/heap/heapam.c b/src/backend/access/heap/heapam.c
index 38bba16..d2914f0 100644
--- a/src/backend/access/heap/heapam.c
+++ b/src/backend/access/heap/heapam.c
@@ -3031,19 +3031,6 @@ heap_delete(Relation relation, ItemPointer tid,
 
 	LockBuffer(buffer, BUFFER_LOCK_EXCLUSIVE);
 
-	/*
-	 * If we didn't pin the visibility map page and the page has become all
-	 * visible while we were busy locking the buffer, we'll have to unlock and
-	 * re-lock, to avoid holding the buffer lock across an I/O.  That's a bit
-	 * unfortunate, but hopefully shouldn't happen often.
-	 */
-	if (vmbuffer == InvalidBuffer && PageIsAllVisible(page))
-	{
-		LockBuffer(buffer, BUFFER_LOCK_UNLOCK);
-		visibilitymap_pin(relation, block, &vmbuffer);
-		LockBuffer(buffer, BUFFER_LOCK_EXCLUSIVE);
-	}
-
 	lp = PageGetItemId(page, ItemPointerGetOffsetNumber(tid));
 	Assert(ItemIdIsNormal(lp));
 
@@ -3189,6 +3176,20 @@ l1:
 	}
 
 	/*
+	 * If we didn't pin the visibility map page and the page has become all
+	 * visible while we were busy locking the buffer, we'll have to unlock and
+	 * re-lock, to avoid holding the buffer lock across an I/O.  See
+	 * heap_update for further details.
+	 */
+	if (vmbuffer == InvalidBuffer && PageIsAllVisible(page))
+	{
+		LockBuffer(buffer, BUFFER_LOCK_UNLOCK);
+		visibilitymap_pin(relation, block, &vmbuffer);
+		LockBuffer(buffer, BUFFER_LOCK_EXCLUSIVE);
+		goto l1;
+	}
+
+	/*
 	 * We're about to do the actual delete -- check for conflict first, to
 	 * avoid possibly having to roll back work we've just done.
 	 *

#252

Noah Misch

noah@leadboat.com

over 9 years ago

In reply to: Amit Kapila (#249)

Re: Reviewing freeze map code

On Sat, Jul 23, 2016 at 01:25:55PM +0530, Amit Kapila wrote:

On Mon, Jul 18, 2016 at 2:03 PM, Andres Freund <andres@anarazel.de> wrote:

On 2016-07-18 10:02:52 +0530, Amit Kapila wrote:

Consider the below scenario.

Vacuum
a. acquires a cleanup lock for page - 10
b. busy in checking visibility of tuples
--assume, here it takes some time and in the meantime Session-1
performs step (a) and (b) and start waiting in step- (c)
c. marks the page as all-visible (PageSetAllVisible)
d. unlockandrelease the buffer

Session-1
a. In heap_lock_tuple(), readbuffer for page-10
b. check PageIsAllVisible(), found page is not all-visible, so didn't
acquire the visbilitymap_pin
c. LockBuffer in ExlusiveMode - here it will wait for vacuum to
release the lock
d. Got the lock, but now the page is marked as all-visible, so ideally
need to recheck the page and acquire the visibilitymap_pin

So, I've tried pretty hard to reproduce that. While the theory above is
sound, I believe the relevant code-path is essentially dead for SQL
callable code, because we'll always hold a buffer pin before even
entering heap_update/heap_lock_tuple.

It is possible that we don't hold any buffer pin before entering
heap_update() and or heap_lock_tuple(). For heap_update(), it is
possible when it enters via simple_heap_update() path. For
heap_lock_tuple(), it is possible for ON CONFLICT DO Update statement
and may be others as well.

This is currently listed as a 9.6 open item. Is it indeed a regression in
9.6, or do released versions have the same defect? If it is a 9.6 regression,
do you happen to know which commit, or at least which feature, caused it?

Thanks,
nm

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#253

Amit Kapila

amit.kapila16@gmail.com

over 9 years ago

In reply to: Noah Misch (#252)

Re: Reviewing freeze map code

On Tue, Aug 2, 2016 at 11:19 AM, Noah Misch <noah@leadboat.com> wrote:

On Sat, Jul 23, 2016 at 01:25:55PM +0530, Amit Kapila wrote:

On Mon, Jul 18, 2016 at 2:03 PM, Andres Freund <andres@anarazel.de> wrote:

On 2016-07-18 10:02:52 +0530, Amit Kapila wrote:

Consider the below scenario.

Vacuum
a. acquires a cleanup lock for page - 10
b. busy in checking visibility of tuples
--assume, here it takes some time and in the meantime Session-1
performs step (a) and (b) and start waiting in step- (c)
c. marks the page as all-visible (PageSetAllVisible)
d. unlockandrelease the buffer

Session-1
a. In heap_lock_tuple(), readbuffer for page-10
b. check PageIsAllVisible(), found page is not all-visible, so didn't
acquire the visbilitymap_pin
c. LockBuffer in ExlusiveMode - here it will wait for vacuum to
release the lock
d. Got the lock, but now the page is marked as all-visible, so ideally
need to recheck the page and acquire the visibilitymap_pin

So, I've tried pretty hard to reproduce that. While the theory above is
sound, I believe the relevant code-path is essentially dead for SQL
callable code, because we'll always hold a buffer pin before even
entering heap_update/heap_lock_tuple.

It is possible that we don't hold any buffer pin before entering
heap_update() and or heap_lock_tuple(). For heap_update(), it is
possible when it enters via simple_heap_update() path. For
heap_lock_tuple(), it is possible for ON CONFLICT DO Update statement
and may be others as well.

This is currently listed as a 9.6 open item. Is it indeed a regression in
9.6, or do released versions have the same defect? If it is a 9.6 regression,
do you happen to know which commit, or at least which feature, caused it?

Commit eca0f1db is the reason for this specific issue.

--
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#254

Noah Misch

noah@leadboat.com

over 9 years ago

In reply to: Amit Kapila (#253)

Re: Reviewing freeze map code

On Tue, Aug 02, 2016 at 02:10:29PM +0530, Amit Kapila wrote:

On Tue, Aug 2, 2016 at 11:19 AM, Noah Misch <noah@leadboat.com> wrote:

On Sat, Jul 23, 2016 at 01:25:55PM +0530, Amit Kapila wrote:

On Mon, Jul 18, 2016 at 2:03 PM, Andres Freund <andres@anarazel.de> wrote:

On 2016-07-18 10:02:52 +0530, Amit Kapila wrote:

Consider the below scenario.

Vacuum
a. acquires a cleanup lock for page - 10
b. busy in checking visibility of tuples
--assume, here it takes some time and in the meantime Session-1
performs step (a) and (b) and start waiting in step- (c)
c. marks the page as all-visible (PageSetAllVisible)
d. unlockandrelease the buffer

Session-1
a. In heap_lock_tuple(), readbuffer for page-10
b. check PageIsAllVisible(), found page is not all-visible, so didn't
acquire the visbilitymap_pin
c. LockBuffer in ExlusiveMode - here it will wait for vacuum to
release the lock
d. Got the lock, but now the page is marked as all-visible, so ideally
need to recheck the page and acquire the visibilitymap_pin

So, I've tried pretty hard to reproduce that. While the theory above is
sound, I believe the relevant code-path is essentially dead for SQL
callable code, because we'll always hold a buffer pin before even
entering heap_update/heap_lock_tuple.

It is possible that we don't hold any buffer pin before entering
heap_update() and or heap_lock_tuple(). For heap_update(), it is
possible when it enters via simple_heap_update() path. For
heap_lock_tuple(), it is possible for ON CONFLICT DO Update statement
and may be others as well.

This is currently listed as a 9.6 open item. Is it indeed a regression in
9.6, or do released versions have the same defect? If it is a 9.6 regression,
do you happen to know which commit, or at least which feature, caused it?

Commit eca0f1db is the reason for this specific issue.

[Action required within 72 hours. This is a generic notification.]

The above-described topic is currently a PostgreSQL 9.6 open item. Andres,
since you committed the patch believed to have created it, you own this open
item. If some other commit is more relevant or if this does not belong as a
9.6 open item, please let us know. Otherwise, please observe the policy on
open item ownership[1]/messages/by-id/20160527025039.GA447393@tornado.leadboat.com and send a status update within 72 hours of this
message. Include a date for your subsequent status update. Testers may
discover new open items at any time, and I want to plan to get them all fixed
in advance of shipping 9.6rc1 next week. Consequently, I will appreciate your
efforts toward speedy resolution. Thanks.

[1]: /messages/by-id/20160527025039.GA447393@tornado.leadboat.com

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#255

Andres Freund

andres@anarazel.de

over 9 years ago

In reply to: Noah Misch (#254)

Re: Reviewing freeze map code

Hi,

On 2016-08-02 10:55:18 -0400, Noah Misch wrote:

[Action required within 72 hours. This is a generic notification.]

The above-described topic is currently a PostgreSQL 9.6 open item. Andres,
since you committed the patch believed to have created it, you own this open
item.

Well kinda (it was a partial fix for something not originally by me),
but I'll deal with. Reading now, will commit tomorrow.

Regards,

Andres

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#256

Robert Haas

robertmhaas@gmail.com

over 9 years ago

In reply to: Andres Freund (#255)

Re: Reviewing freeze map code

On Thu, Aug 4, 2016 at 3:24 AM, Andres Freund <andres@anarazel.de> wrote:

Hi,

On 2016-08-02 10:55:18 -0400, Noah Misch wrote:

[Action required within 72 hours. This is a generic notification.]

The above-described topic is currently a PostgreSQL 9.6 open item. Andres,
since you committed the patch believed to have created it, you own this open
item.

Well kinda (it was a partial fix for something not originally by me),
but I'll deal with. Reading now, will commit tomorrow.

Thanks. I kept meaning to get to this one, and failing to do so.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers