recovering from &quot;found xmin ... from before relfrozenxid ...&quot;

tgl@sss.pgh.pa.us

over 5 years ago

In reply to: Stephen Frost (#2)

Re: recovering from "found xmin ... from before relfrozenxid ..."

Stephen Frost <sfrost@snowman.net> writes:

* Robert Haas (robertmhaas@gmail.com) wrote:

- If so, how? I would propose a new contrib module that we back-patch
all the way, because the VACUUM errors were back-patched all the way,
and there seems to be no advantage in making people wait 5 years for a
new version that has some kind of tooling in this area.

While I agree that this would be a good and useful new contrib module to
have, I don't think it would be appropriate to back-patch it into PG
formally.

Yeah, I don't care for that either. That's a pretty huge violation of our
normal back-patching rules, and I'm not convinced that it's justified.

No objection to adding it as a new contrib module.

regards, tom lane

Peter Geoghegan

pg@bowt.ie

over 5 years ago

In reply to: Robert Haas (#1)

Re: recovering from "found xmin ... from before relfrozenxid ..."

On Mon, Jul 13, 2020 at 2:12 PM Robert Haas <robertmhaas@gmail.com> wrote:

1. There's nothing to identify the tuple that has the problem, and no
way to know how many more of them there might be. Back-patching
b61d161c146328ae6ba9ed937862d66e5c8b035a would help with the first
part of this.

I am in favor of backpatching such changes in cases where senior
community members feel that it could help with hypothetical
undiscovered data corruption issues -- if they're willing to take
responsibility for the change. It certainly wouldn't be the first
time. A "defense in depth" mindset seems like the right one when it
comes to data corruption bugs. Early detection is really important.

Moreover, not everyone is as
interested in an extended debugging exercise as they are in getting
the system working again, and VACUUM failing repeatedly is a pretty
serious problem.

That's absolutely consistent with my experience. Most users want to
get back to business as usual now, while letting somebody else do the
hard work of debugging.

Therefore, one of my colleagues has - at my request - created a couple
of functions called heap_force_kill() and heap_force_freeze() which
take an array of TIDs.

So I have these questions:

- Do people think it would me smart/good/useful to include something
like this in PostgreSQL?

I'm in favor of it.

- If so, how? I would propose a new contrib module that we back-patch
all the way, because the VACUUM errors were back-patched all the way,
and there seems to be no advantage in making people wait 5 years for a
new version that has some kind of tooling in this area.

I'm in favor of it being *possible* to backpatch tooling that is
clearly related to correctness in a fundamental way. Obviously this
would mean that we'd be revising our general position on backpatching
to allow some limited exceptions around corruption. I'm not sure that
this meets that standard, though. It's hardly something that we can
expect all that many users to be able to use effectively.

I may be biased, but I'd be inclined to permit it in the case of
something like amcheck, or pg_visibility, on the grounds that they're
more or less the same as the new VACUUM errcontext instrumentation you
mentioned. The same cannot be said of something like this new
heap_force_kill() stuff.

- Any ideas for additional things we should include, or improvements
on the sketch above?

Clearly you should work out a way of making it very hard to
accidentally (mis)use. For example, maybe you make the functions check
for the presence of a sentinel file in the data directory.

--
Peter Geoghegan

andres@anarazel.de

over 5 years ago

In reply to: Robert Haas (#1)

Re: recovering from "found xmin ... from before relfrozenxid ..."

Hi,

On 2020-07-13 17:12:18 -0400, Robert Haas wrote:

1. There's nothing to identify the tuple that has the problem, and no
way to know how many more of them there might be. Back-patching
b61d161c146328ae6ba9ed937862d66e5c8b035a would help with the first
part of this.

Not fully, I'm afraid. Afaict it doesn't currently tell you the item
pointer offset, just the block numer, right? We probably should extend
it to also include the offset...

2. In some other, similar situations, e.g. where the tuple data is
garbled, it's often possible to get out from under the problem by
deleting the tuple at issue. But I think that doesn't necessarily fix
anything in this case.

Huh, why not? That worked in the cases I saw.

Therefore, one of my colleagues has - at my request - created a couple
of functions called heap_force_kill() and heap_force_freeze() which
take an array of TIDs. The former truncates them all to dead line
pointers. The latter resets the infomask and xmin to make the xmin
frozen. (It should probably handle the xmax too; not sure that the
current version does that, but it's easily fixed if not.)

xmax is among the problematic cases IIRC, so yes, it'd be good to fix
that.

Greetings,

Andres Freund

robertmhaas@gmail.com

over 5 years ago

In reply to: Tom Lane (#3)

Re: recovering from "found xmin ... from before relfrozenxid ..."

On Mon, Jul 13, 2020 at 6:15 PM Tom Lane <tgl@sss.pgh.pa.us> wrote:

Yeah, I don't care for that either. That's a pretty huge violation of our
normal back-patching rules, and I'm not convinced that it's justified.

I think that our normal back-patching rules are based primarily on the
risk of breaking things, and a new contrib module carries a pretty
negligible risk of breaking anything that works today. I wouldn't
propose to back-patch something on those grounds just as a way of
delivering a new feature more quickly, but that's not the intention
here. At least in my experience, un-VACUUM-able tables have gotten
several orders of magnitude more common since Andres put those changes
in. As far as I can recall, EDB has not had this many instances of
different customers reporting the same problem since the 9.3-era
multixact issues. So far, this does not rise to that level, but it is
by no means a negligible issue, either. I believe it deserves to be
taken quite seriously, especially because the existing options for
helping customers with this kind of problem are so limited.

Now, if this goes into v14, we can certainly stick it up on github, or
put it out there in some other way for users to download,
self-compile, and install, but that seems noticeably less convenient
for people who need it, and I'm not clear what the benefit to the
project is.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

robertmhaas@gmail.com

over 5 years ago

In reply to: Andres Freund (#5)

Re: recovering from "found xmin ... from before relfrozenxid ..."

On Mon, Jul 13, 2020 at 6:38 PM Andres Freund <andres@anarazel.de> wrote:

Not fully, I'm afraid. Afaict it doesn't currently tell you the item
pointer offset, just the block numer, right? We probably should extend
it to also include the offset...

Oh, I hadn't realized that limitation. That would be good to fix. It
would be even better, I think, if we could have VACUUM proceed with
the rest of vacuuming the table, emitting warnings about each
instance, instead of blowing up when it hits the first bad tuple, but
I think you may have told me sometime that doing so would be, uh, less
than straightforward. We probably should refuse to update
relfrozenxid/relminmxid when this is happening, but I *think* it would
be better to still proceed with dead tuple cleanup as far as we can,
or at least have an option to enable that behavior. I'm not positive
about that, but not being able to complete VACUUM at all is a FAR more
urgent problem than not being able to freeze, even though in the long
run the latter is more severe.

2. In some other, similar situations, e.g. where the tuple data is
garbled, it's often possible to get out from under the problem by
deleting the tuple at issue. But I think that doesn't necessarily fix
anything in this case.

Huh, why not? That worked in the cases I saw.

I'm not sure I've seen a case where that didn't work, but I don't see
a reason why it couldn't happen. Do you think the code is structured
in such a way that a deleted tuple is guaranteed to be pruned even if
the XID is old? What if clog has been truncated so that the xmin can't
be looked up?

xmax is among the problematic cases IIRC, so yes, it'd be good to fix
that.

Thanks for the input.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

tgl@sss.pgh.pa.us

over 5 years ago

In reply to: Robert Haas (#7)

Re: recovering from "found xmin ... from before relfrozenxid ..."

Robert Haas <robertmhaas@gmail.com> writes:

Oh, I hadn't realized that limitation. That would be good to fix. It
would be even better, I think, if we could have VACUUM proceed with
the rest of vacuuming the table, emitting warnings about each
instance, instead of blowing up when it hits the first bad tuple, but
I think you may have told me sometime that doing so would be, uh, less
than straightforward. We probably should refuse to update
relfrozenxid/relminmxid when this is happening, but I *think* it would
be better to still proceed with dead tuple cleanup as far as we can,
or at least have an option to enable that behavior. I'm not positive
about that, but not being able to complete VACUUM at all is a FAR more
urgent problem than not being able to freeze, even though in the long
run the latter is more severe.

+1 for proceeding in this direction, rather than handing users tools
that they *will* hurt themselves with.

The more that I think about it, the more I think that the proposed
functions are tools for wizards only, and so I'm getting hesitant
about having them in contrib at all. We lack a better place to
put them, but that doesn't mean they should be there.

regards, tom lane

robertmhaas@gmail.com

over 5 years ago

In reply to: Tom Lane (#8)

Re: recovering from "found xmin ... from before relfrozenxid ..."

On Mon, Jul 13, 2020 at 8:58 PM Tom Lane <tgl@sss.pgh.pa.us> wrote:

Robert Haas <robertmhaas@gmail.com> writes:

Oh, I hadn't realized that limitation. That would be good to fix. It
would be even better, I think, if we could have VACUUM proceed with
the rest of vacuuming the table, emitting warnings about each
instance, instead of blowing up when it hits the first bad tuple, but
I think you may have told me sometime that doing so would be, uh, less
than straightforward. We probably should refuse to update
relfrozenxid/relminmxid when this is happening, but I *think* it would
be better to still proceed with dead tuple cleanup as far as we can,
or at least have an option to enable that behavior. I'm not positive
about that, but not being able to complete VACUUM at all is a FAR more
urgent problem than not being able to freeze, even though in the long
run the latter is more severe.

+1 for proceeding in this direction, rather than handing users tools
that they *will* hurt themselves with.

The more that I think about it, the more I think that the proposed
functions are tools for wizards only, and so I'm getting hesitant
about having them in contrib at all. We lack a better place to
put them, but that doesn't mean they should be there.

It's not an either/or; it's a both/and. To recover from this problem,
you need to:

1. Be able to tell which tuples are affected.
2. Do something about it.

I think there are a number of strategies that we could pursue around
either of those things, and there are better and worse ways of
accomplishing them, but having one without the other isn't too great.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

#10

andres@anarazel.de

over 5 years ago

In reply to: Robert Haas (#7)

Re: recovering from "found xmin ... from before relfrozenxid ..."

Hi,

On 2020-07-13 20:47:10 -0400, Robert Haas wrote:

On Mon, Jul 13, 2020 at 6:38 PM Andres Freund <andres@anarazel.de> wrote:

Not fully, I'm afraid. Afaict it doesn't currently tell you the item
pointer offset, just the block numer, right? We probably should extend
it to also include the offset...

Oh, I hadn't realized that limitation. That would be good to fix.

Yea. And it'd even be good if we were to to end up implementing your
suggestion below about continuing vacuuming other tuples.

It would be even better, I think, if we could have VACUUM proceed with
the rest of vacuuming the table, emitting warnings about each
instance, instead of blowing up when it hits the first bad tuple, but
I think you may have told me sometime that doing so would be, uh, less
than straightforward.

Yea, it's not that simple to implement. Not impossible either.

We probably should refuse to update relfrozenxid/relminmxid when this
is happening, but I *think* it would be better to still proceed with
dead tuple cleanup as far as we can, or at least have an option to
enable that behavior. I'm not positive about that, but not being able
to complete VACUUM at all is a FAR more urgent problem than not being
able to freeze, even though in the long run the latter is more severe.

I'm hesitant to default to removing tuples once we've figured out that
something is seriously wrong. Could easy enough make us plow ahead and
delete valuable data on other tuples, even if we'd already detected
there's a problem. But I also see the problem you raise. That's not
academic, a number of multixact corruption issues the checks detected
IIRC weren't guaranteed to be caught.

2. In some other, similar situations, e.g. where the tuple data is
garbled, it's often possible to get out from under the problem by
deleting the tuple at issue. But I think that doesn't necessarily fix
anything in this case.

Huh, why not? That worked in the cases I saw.

I'm not sure I've seen a case where that didn't work, but I don't see
a reason why it couldn't happen. Do you think the code is structured
in such a way that a deleted tuple is guaranteed to be pruned even if
the XID is old?

I think so, leaving aside some temporary situations perhaps.

What if clog has been truncated so that the xmin can't be looked up?

That's possible, but probably only in cases where xmin actually
committed.

Greetings,

Andres Freund

#11

robertmhaas@gmail.com

over 5 years ago

In reply to: Tom Lane (#8)

Re: recovering from "found xmin ... from before relfrozenxid ..."

On Mon, Jul 13, 2020 at 8:58 PM Tom Lane <tgl@sss.pgh.pa.us> wrote:

The more that I think about it, the more I think that the proposed
functions are tools for wizards only, and so I'm getting hesitant
about having them in contrib at all. We lack a better place to
put them, but that doesn't mean they should be there.

Also, I want to clarify that in a typical situation in which a
customer is facing this problem, I don't have any access to their
system. I basically never touch customer systems directly. Typically,
the customer sends us log files and a description of the problem and
their goals, and we send them back advice or instructions. So it's
impractical to imagine that this can be something where you have to
know the secret magic wizard password to get access to it. We'd just
have to give the customers who need to use this tool said password,
and then the jig is up - they can redistribute that password to all
the non-wizards on the Internet, if they so choose.

I understand that it's not too great when we give people access to
sharp tools and they hurt themselves with said tools. But this is open
source. That's how it goes.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

#12

robertmhaas@gmail.com

over 5 years ago

In reply to: Andres Freund (#10)

Re: recovering from "found xmin ... from before relfrozenxid ..."

On Mon, Jul 13, 2020 at 9:10 PM Andres Freund <andres@anarazel.de> wrote:

What if clog has been truncated so that the xmin can't be looked up?

That's possible, but probably only in cases where xmin actually
committed.

Isn't that the normal case? I'm imagining something like:

- Tuple gets inserted. Transaction commits.
- VACUUM processes table.
- Mischievous fairies mark page all-visible in the visibility map.
- VACUUM runs lots more times, relfrozenxid advances, but without ever
looking at the page in question, because it's all-visible.
- clog is truncated, rendering xmin no longer accessible.
- User runs VACUUM disabling page skipping, gets ERROR.
- User deletes offending tuple.
- At this point, I think the tuple is both invisible and unprunable?
- Fairies happy, user sad.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

#13

tgl@sss.pgh.pa.us

over 5 years ago

In reply to: Robert Haas (#11)

Re: recovering from "found xmin ... from before relfrozenxid ..."

Robert Haas <robertmhaas@gmail.com> writes:

On Mon, Jul 13, 2020 at 8:58 PM Tom Lane <tgl@sss.pgh.pa.us> wrote:

The more that I think about it, the more I think that the proposed
functions are tools for wizards only, and so I'm getting hesitant
about having them in contrib at all. We lack a better place to
put them, but that doesn't mean they should be there.

I understand that it's not too great when we give people access to
sharp tools and they hurt themselves with said tools. But this is open
source. That's how it goes.

I think you're attacking a straw man. I'm well aware of how open source
works, thanks. What I'm saying is that contrib is mostly seen to be
reasonably harmless stuff. Sure, you can overwrite data you didn't want
to with adminpack's pg_file_write. But that's the price of having such a
capability at all, and in general it's not hard for users to understand
both the uses and risks of that function. That statement does not apply
to the functions being proposed here. It doesn't seem like they could
possibly be safe to use without very specific expert advice --- and even
then, we're talking rather small values of "safe". So I wish we had some
other way to distribute them than via contrib.

regards, tom lane

#14

Fujii Masao

masao.fujii@oss.nttdata.com

over 5 years ago

In reply to: Robert Haas (#6)

Re: recovering from "found xmin ... from before relfrozenxid ..."

On 2020/07/14 9:41, Robert Haas wrote:

On Mon, Jul 13, 2020 at 6:15 PM Tom Lane <tgl@sss.pgh.pa.us> wrote:

Yeah, I don't care for that either. That's a pretty huge violation of our
normal back-patching rules, and I'm not convinced that it's justified.

I think that our normal back-patching rules are based primarily on the
risk of breaking things, and a new contrib module carries a pretty
negligible risk of breaking anything that works today. I wouldn't
propose to back-patch something on those grounds just as a way of
delivering a new feature more quickly, but that's not the intention
here. At least in my experience, un-VACUUM-able tables have gotten
several orders of magnitude more common since Andres put those changes
in. As far as I can recall, EDB has not had this many instances of
different customers reporting the same problem since the 9.3-era
multixact issues. So far, this does not rise to that level, but it is
by no means a negligible issue, either. I believe it deserves to be
taken quite seriously, especially because the existing options for
helping customers with this kind of problem are so limited.

Now, if this goes into v14, we can certainly stick it up on github, or
put it out there in some other way for users to download,
self-compile, and install, but that seems noticeably less convenient
for people who need it, and I'm not clear what the benefit to the
project is.

But updating this tool can fit to the release schedule and
policy of PostgreSQL?

While investigating the problem by using this tool, we may want to
add new feature into the tool because it's necessary for the investigation.
But users would need to wait for next minor version release, to use this
new feature.

Regards,

--
Fujii Masao
Advanced Computing Technology Center
Research and Development Headquarters
NTT DATA CORPORATION

#15

[0]: https://github.com/dsarafan/pg_dirty_hands/blob/master/src/pg_dirty_hands.c

x4mmm@yandex-team.ru

over 5 years ago

In reply to: Robert Haas (#1)

Re: recovering from "found xmin ... from before relfrozenxid ..."

Hi!

14 июля 2020 г., в 02:12, Robert Haas <robertmhaas@gmail.com> написал(а):

So I have these questions:

- Do people think it would me smart/good/useful to include something
like this in PostgreSQL?

- If so, how? I would propose a new contrib module that we back-patch
all the way

My 0.05₽.

At Yandex we used to fix similar corruption things with our pg_dirty_hands extension [0]https://github.com/dsarafan/pg_dirty_hands/blob/master/src/pg_dirty_hands.c.
But then we developed our internal pg_heapcheck module (unfortunately we did not publish it) and incorporated aggressive recovery into heapcheck.

Now when community has official heapcheck I think it worth to keep detection and fixing tools together.

Best regards, Andrey Borodin.

#16

Peter Eisentraut

peter.eisentraut@2ndquadrant.com

over 5 years ago

In reply to: Robert Haas (#6)

Re: recovering from "found xmin ... from before relfrozenxid ..."

On 2020-07-14 02:41, Robert Haas wrote:

I think that our normal back-patching rules are based primarily on the
risk of breaking things, and a new contrib module carries a pretty
negligible risk of breaking anything that works today.

I think that all feature code ought to go through a beta cycle. So if
this code makes it to 14.0 or 14.1, then I'd consider backpatching it.

Now, if this goes into v14, we can certainly stick it up on github, or
put it out there in some other way for users to download,
self-compile, and install, but that seems noticeably less convenient
for people who need it, and I'm not clear what the benefit to the
project is.

In the meantime, if you're wizard enough to deal with this kind of
thing, you could also clone the module from the PG14 tree and build it
against older versions manually.

--
Peter Eisentraut http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

#17

Magnus Hagander

magnus@hagander.net

over 5 years ago

In reply to: Tom Lane (#13)

Re: recovering from "found xmin ... from before relfrozenxid ..."

On Tue, Jul 14, 2020 at 3:26 AM Tom Lane <tgl@sss.pgh.pa.us> wrote:

Robert Haas <robertmhaas@gmail.com> writes:

On Mon, Jul 13, 2020 at 8:58 PM Tom Lane <tgl@sss.pgh.pa.us> wrote:

The more that I think about it, the more I think that the proposed
functions are tools for wizards only, and so I'm getting hesitant
about having them in contrib at all. We lack a better place to
put them, but that doesn't mean they should be there.

I understand that it's not too great when we give people access to
sharp tools and they hurt themselves with said tools. But this is open
source. That's how it goes.

I think you're attacking a straw man. I'm well aware of how open source
works, thanks. What I'm saying is that contrib is mostly seen to be
reasonably harmless stuff. Sure, you can overwrite data you didn't want
to with adminpack's pg_file_write. But that's the price of having such a
capability at all, and in general it's not hard for users to understand
both the uses and risks of that function. That statement does not apply
to the functions being proposed here. It doesn't seem like they could
possibly be safe to use without very specific expert advice --- and even
then, we're talking rather small values of "safe". So I wish we had some
other way to distribute them than via contrib.

The countersable of this is pg_resetwal. The number of people who have
broken their database with that tool is not small.

That said, we could have a separate "class" of extensions/tools in the
distribution, and encourage packagers to pack them up as separate packages
for example. Technically they don't have to be in the same source
repository at all of course, but I have a feeling some of them might be a
lot easier to maintain if they are. And then the user would just have to
install something like "postgresql-14-wizardtools". They'd still be
available to everybody, of course, but at least the knives would be in a
closed drawer until intentionally picked up.

--
Magnus Hagander
Me: https://www.hagander.net/ <http://www.hagander.net/>
Work: https://www.redpill-linpro.com/ <http://www.redpill-linpro.com/>

#18

Jochem van Dieten

jochemd@gmail.com

over 5 years ago

In reply to: Tom Lane (#13)

Re: recovering from "found xmin ... from before relfrozenxid ..."

On Tue, Jul 14, 2020 at 3:26 AM Tom Lane <tgl@sss.pgh.pa.us> wrote:

I think you're attacking a straw man. I'm well aware of how open source
works, thanks. What I'm saying is that contrib is mostly seen to be
reasonably harmless stuff. Sure, you can overwrite data you didn't want
to with adminpack's pg_file_write. But that's the price of having such a
capability at all, and in general it's not hard for users to understand
both the uses and risks of that function. That statement does not apply
to the functions being proposed here. It doesn't seem like they could
possibly be safe to use without very specific expert advice --- and even
then, we're talking rather small values of "safe".

Would it be possible to make them safe(r)? For instance, truncate
only, don't freeze; only tuples whose visibility information is
corrupted; and only in non-catalog tables. What exactly is the risk in
that case? Foreign keys might not be satisfied, which might make it
impossible to restore a dump, but is that worse than what a DBA can do
anyway? I would think that it is not and would leave the database in a
state DBAs are much better equipped to deal with.
Or would it be possible to create a table like the original table
(minus any constraints) and copy all tuples with corrupted visibility
there before truncating to a dead line pointer?

Jochem

#19

robertmhaas@gmail.com

over 5 years ago

In reply to: Peter Eisentraut (#16)

Re: recovering from "found xmin ... from before relfrozenxid ..."

On Tue, Jul 14, 2020 at 3:08 AM Peter Eisentraut
<peter.eisentraut@2ndquadrant.com> wrote:

In the meantime, if you're wizard enough to deal with this kind of
thing, you could also clone the module from the PG14 tree and build it
against older versions manually.

But what if you are NOT a wizard, and a wizard is giving you
directions? Then having to build from source is a real pain. And
that's normally the situation I'm in when a customer has this issue.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

#20

robertmhaas@gmail.com

over 5 years ago

In reply to: Magnus Hagander (#17)

Re: recovering from "found xmin ... from before relfrozenxid ..."

On Tue, Jul 14, 2020 at 4:59 AM Magnus Hagander <magnus@hagander.net> wrote:

The countersable of this is pg_resetwal. The number of people who have broken their database with that tool is not small.

Very true.

That said, we could have a separate "class" of extensions/tools in the distribution, and encourage packagers to pack them up as separate packages for example. Technically they don't have to be in the same source repository at all of course, but I have a feeling some of them might be a lot easier to maintain if they are. And then the user would just have to install something like "postgresql-14-wizardtools". They'd still be available to everybody, of course, but at least the knives would be in a closed drawer until intentionally picked up.

I don't think that does much to help with the immediate problem here,
because people are being bitten by this problem *now* and a packaging
change like this will take a long time to happen and become standard
out there, but I think it's a good idea.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

#21

Magnus Hagander

magnus@hagander.net

over 5 years ago

In reply to: Robert Haas (#20)

Re: recovering from "found xmin ... from before relfrozenxid ..."

On Tue, Jul 14, 2020 at 1:52 PM Robert Haas <robertmhaas@gmail.com> wrote:

On Tue, Jul 14, 2020 at 4:59 AM Magnus Hagander <magnus@hagander.net>
wrote:

The countersable of this is pg_resetwal. The number of people who have

broken their database with that tool is not small.

Very true.

That said, we could have a separate "class" of extensions/tools in the

distribution, and encourage packagers to pack them up as separate packages
for example. Technically they don't have to be in the same source
repository at all of course, but I have a feeling some of them might be a
lot easier to maintain if they are. And then the user would just have to
install something like "postgresql-14-wizardtools". They'd still be
available to everybody, of course, but at least the knives would be in a
closed drawer until intentionally picked up.

I don't think that does much to help with the immediate problem here,
because people are being bitten by this problem *now* and a packaging
change like this will take a long time to happen and become standard
out there, but I think it's a good idea.

I don't think that it necessarily has to be. As long as we're talking about
adding something and not actually changing their existing packages, getting
this into both yum and apt shouldn't be *that* hard, if it's coordinated
well with Christoph and Devrim (obviously that's based on my experience and
they will have to give a more complete answer themselves). It would be a
lot more complicated if it involved changing an existing package.

--
Magnus Hagander
Me: https://www.hagander.net/ <http://www.hagander.net/>
Work: https://www.redpill-linpro.com/ <http://www.redpill-linpro.com/>

#22

robertmhaas@gmail.com

over 5 years ago

In reply to: Fujii Masao (#14)

Re: recovering from "found xmin ... from before relfrozenxid ..."

On Mon, Jul 13, 2020 at 9:29 PM Fujii Masao <masao.fujii@oss.nttdata.com> wrote:

But updating this tool can fit to the release schedule and
policy of PostgreSQL?

While investigating the problem by using this tool, we may want to
add new feature into the tool because it's necessary for the investigation.
But users would need to wait for next minor version release, to use this
new feature.

Yeah, that's a point that needs careful thought. I don't think it
means that we shouldn't have something in core; after all, this is a
problem that is created in part by the way that PostgreSQL itself
works, and I think it would be quite unfriendly if we refused to do
anything about that in the core distribution. On the other hand, it
might be a good reason not to back-patch, which is something most
people don't seem enthusiastic about anyway.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

#23

robertmhaas@gmail.com

over 5 years ago

In reply to: Magnus Hagander (#21)

Re: recovering from "found xmin ... from before relfrozenxid ..."

On Tue, Jul 14, 2020 at 8:25 AM Magnus Hagander <magnus@hagander.net> wrote:

I don't think that it necessarily has to be. As long as we're talking about adding something and not actually changing their existing packages, getting this into both yum and apt shouldn't be *that* hard, if it's coordinated well with Christoph and Devrim (obviously that's based on my experience and they will have to give a more complete answer themselves). It would be a lot more complicated if it involved changing an existing package.

I mean, you presumably could not move pg_resetwal to this new package
in existing branches, right?

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

#24

Magnus Hagander

magnus@hagander.net

over 5 years ago

In reply to: Robert Haas (#23)

Re: recovering from "found xmin ... from before relfrozenxid ..."

On Tue, Jul 14, 2020 at 4:09 PM Robert Haas <robertmhaas@gmail.com> wrote:

On Tue, Jul 14, 2020 at 8:25 AM Magnus Hagander <magnus@hagander.net>
wrote:

I don't think that it necessarily has to be. As long as we're talking

about adding something and not actually changing their existing packages,
getting this into both yum and apt shouldn't be *that* hard, if it's
coordinated well with Christoph and Devrim (obviously that's based on my
experience and they will have to give a more complete answer themselves).
It would be a lot more complicated if it involved changing an existing
package.

I mean, you presumably could not move pg_resetwal to this new package
in existing branches, right?

Probably and eventually. But that can be done for 14+ (or 13+ depending on
how "done" the packaging is there -- we should just make sure that hits the
biggest platform in the same release).

--
Magnus Hagander
Me: https://www.hagander.net/ <http://www.hagander.net/>
Work: https://www.redpill-linpro.com/ <http://www.redpill-linpro.com/>

#25

Stephen Frost

sfrost@snowman.net

over 5 years ago

In reply to: Magnus Hagander (#24)

Re: recovering from "found xmin ... from before relfrozenxid ..."

Greetings,

* Magnus Hagander (magnus@hagander.net) wrote:

On Tue, Jul 14, 2020 at 4:09 PM Robert Haas <robertmhaas@gmail.com> wrote:

On Tue, Jul 14, 2020 at 8:25 AM Magnus Hagander <magnus@hagander.net>
wrote:

I don't think that it necessarily has to be. As long as we're talking

about adding something and not actually changing their existing packages,
getting this into both yum and apt shouldn't be *that* hard, if it's
coordinated well with Christoph and Devrim (obviously that's based on my
experience and they will have to give a more complete answer themselves).
It would be a lot more complicated if it involved changing an existing
package.

I mean, you presumably could not move pg_resetwal to this new package
in existing branches, right?

Probably and eventually. But that can be done for 14+ (or 13+ depending on
how "done" the packaging is there -- we should just make sure that hits the
biggest platform in the same release).

Considering we just got rid of the -contrib independent package on at
least Debian-based systems, it doesn't really seem likely that the
packagers are going to be anxious to create a new one- they are not
without costs.

Also, in such dire straits as this thread is contemplating, I would
think we'd *really* like to have access to these tools with as small an
amount of change as absolutely possible to the system: what if
pg_extension itself got munged and we aren't able to install this new
contrib module, for example?

I would suggest that, instead, we make this part of core, but have it be
in a relatively clearly marked special schema that isn't part of
search_path by default- eg: pg_hacks, or pg_dirty_hands (I kinda like
the latter though it seems a bit unprofessional for us).

I'd also certainly be in support of having a contrib module with the
same functions that's independent from core and available and able to be
installed on pre-v14 systems. I'd further support having another repo
that's "core maintained" or however we want to phrase it which includes
this proposed module (and possibly all of contrib) and which has a
different release cadence and requirements for what gets into it, has
its own packages, etc.

Thanks,

Stephen

#26

Alvaro Herrera

alvherre@2ndquadrant.com

over 5 years ago

In reply to: Andres Freund (#5)

1 attachment(s)

Re: recovering from "found xmin ... from before relfrozenxid ..."

On 2020-Jul-13, Andres Freund wrote:

Hi,

On 2020-07-13 17:12:18 -0400, Robert Haas wrote:

1. There's nothing to identify the tuple that has the problem, and no
way to know how many more of them there might be. Back-patching
b61d161c146328ae6ba9ed937862d66e5c8b035a would help with the first
part of this.

Not fully, I'm afraid. Afaict it doesn't currently tell you the item
pointer offset, just the block numer, right? We probably should extend
it to also include the offset...

Just having the block number is already a tremendous step forward; with
that you can ask the customer to set a pageinspect dump of tuple
headers, and then the problem is obvious. Now if you want to add block
number to that, by all means do so.

FWIW I do support the idea of backpatching the vacuum errcontext commit.

One useful thing to do is to mark a tuple frozen unconditionally if it's
marked hinted XMIN_COMMITTED; no need to consult pg_clog in that case.
The attached (for 9.6) does that; IIRC it would have helped in a couple
of cases.

--
ï¿½lvaro Herrera https://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Attachments:

0001-Don-t-test-for-xact-commit-if-tuple-is-hinted.patchtext/x-diff; charset=us-asciiDownload

From 2d5574e4e3caa0299d27d77f0fc319463fc84162 Mon Sep 17 00:00:00 2001
From: Alvaro Herrera <alvherre@alvh.no-ip.org>
Date: Thu, 18 Jul 2019 12:50:11 -0400
Subject: [PATCH] Don't test for xact commit if tuple is hinted

---
 src/backend/access/heap/heapam.c | 3 ++-
 1 file changed, 2 insertions(+), 1 deletion(-)

diff --git a/src/backend/access/heap/heapam.c b/src/backend/access/heap/heapam.c
index fe3e83b489..d42bc6502c 100644
--- a/src/backend/access/heap/heapam.c
+++ b/src/backend/access/heap/heapam.c
@@ -6724,7 +6724,8 @@ heap_prepare_freeze_tuple(HeapTupleHeader tuple,
 
 		if (TransactionIdPrecedes(xid, cutoff_xid))
 		{
-			if (!TransactionIdDidCommit(xid))
+			if (!HeapTupleHeaderXminCommitted(xid) &&
+				!TransactionIdDidCommit(xid))
 				ereport(ERROR,
 						(errcode(ERRCODE_DATA_CORRUPTED),
 						 errmsg_internal("uncommitted xmin %u from before xid cutoff %u needs to be frozen",
-- 
2.20.1

#27

andres@anarazel.de

over 5 years ago

In reply to: Robert Haas (#12)

Re: recovering from "found xmin ... from before relfrozenxid ..."

Hi,

On 2020-07-13 21:18:10 -0400, Robert Haas wrote:

On Mon, Jul 13, 2020 at 9:10 PM Andres Freund <andres@anarazel.de> wrote:

What if clog has been truncated so that the xmin can't be looked up?

That's possible, but probably only in cases where xmin actually
committed.

Isn't that the normal case? I'm imagining something like:

- Tuple gets inserted. Transaction commits.
- VACUUM processes table.
- Mischievous fairies mark page all-visible in the visibility map.
- VACUUM runs lots more times, relfrozenxid advances, but without ever
looking at the page in question, because it's all-visible.
- clog is truncated, rendering xmin no longer accessible.
- User runs VACUUM disabling page skipping, gets ERROR.
- User deletes offending tuple.
- At this point, I think the tuple is both invisible and unprunable?
- Fairies happy, user sad.

I'm not saying it's impossible that that happens, but the cases I did
investigate didn't look like this. If something just roguely wrote to
the VM I'd expect a lot more "is not marked all-visible but visibility
map bit is set in relation" type WARNINGs, and I've not seen much of
those (they're WARNINGs though, so maybe we wouldn't). Presumably this
wouldn't always just happen with tuples that'd trigger an error first
during hot pruning.

I've definitely seen indications of both datfrozenxid and relfrozenxid
getting corrupted (in particular vac_update_datfrozenxid being racy as
hell), xid wraparound, indications of multixact problems (although it's
possible we've now fixed those) and some signs of corrupted relcache
entries for shared relations leading to vacuums being skipped.

Greetings,

Andres Freund

#28

andres@anarazel.de

over 5 years ago

In reply to: Alvaro Herrera (#26)

Re: recovering from "found xmin ... from before relfrozenxid ..."

Hi,

On 2020-07-14 13:20:25 -0400, Alvaro Herrera wrote:

On 2020-Jul-13, Andres Freund wrote:

Hi,

On 2020-07-13 17:12:18 -0400, Robert Haas wrote:

1. There's nothing to identify the tuple that has the problem, and no
way to know how many more of them there might be. Back-patching
b61d161c146328ae6ba9ed937862d66e5c8b035a would help with the first
part of this.

Not fully, I'm afraid. Afaict it doesn't currently tell you the item
pointer offset, just the block numer, right? We probably should extend
it to also include the offset...

Just having the block number is already a tremendous step forward; with
that you can ask the customer to set a pageinspect dump of tuple
headers, and then the problem is obvious. Now if you want to add block
number to that, by all means do so.

offset number I assume?

One useful thing to do is to mark a tuple frozen unconditionally if it's
marked hinted XMIN_COMMITTED; no need to consult pg_clog in that case.
The attached (for 9.6) does that; IIRC it would have helped in a couple
of cases.

I think it might also have hidden corruption in at least one case where
we subsequently fixed a bug (and helped detect at least one unfixed
bug). That should only be possible if either required clog has been
removed, or if relfrozenxid/datfrozenxid are corrupt, right?

Greetings,

Andres Freund

#29

andres@anarazel.de

over 5 years ago

In reply to: Robert Haas (#19)

Re: recovering from "found xmin ... from before relfrozenxid ..."

Hi,

On 2020-07-14 07:51:27 -0400, Robert Haas wrote:

On Tue, Jul 14, 2020 at 3:08 AM Peter Eisentraut
<peter.eisentraut@2ndquadrant.com> wrote:

In the meantime, if you're wizard enough to deal with this kind of
thing, you could also clone the module from the PG14 tree and build it
against older versions manually.

But what if you are NOT a wizard, and a wizard is giving you
directions? Then having to build from source is a real pain. And
that's normally the situation I'm in when a customer has this issue.

The "found xmin ... from before relfrozenxid ..." cases should all be
fixable without needing such a function, and without it making fixing
them significantly easier, no? As far as I understand your suggested
solution, you need the tid(s) of these tuples, right? If you have those,
I don't think it's meaningfully harder to INSERT ... DELETE WHERE ctid =
.... or something like that.

ISTM that the hard part is finding all problematic tuples in an
efficient manner (i.e. that doesn't require one manual VACUUM for each
individual block + parsing VACUUMs error message), not "fixing" those
tuples.

Greetings,

Andres Freund

#30

Alvaro Herrera

alvherre@2ndquadrant.com

over 5 years ago

In reply to: Andres Freund (#28)

Re: recovering from "found xmin ... from before relfrozenxid ..."

On 2020-Jul-14, Andres Freund wrote:

Hi,

On 2020-07-14 13:20:25 -0400, Alvaro Herrera wrote:

Just having the block number is already a tremendous step forward; with
that you can ask the customer to set a pageinspect dump of tuple
headers, and then the problem is obvious. Now if you want to add block
number to that, by all means do so.

offset number I assume?

Eh, yeah, that.

One useful thing to do is to mark a tuple frozen unconditionally if it's
marked hinted XMIN_COMMITTED; no need to consult pg_clog in that case.
The attached (for 9.6) does that; IIRC it would have helped in a couple
of cases.

I think it might also have hidden corruption in at least one case where
we subsequently fixed a bug (and helped detect at least one unfixed
bug). That should only be possible if either required clog has been
removed, or if relfrozenxid/datfrozenxid are corrupt, right?

Yes, that's precisely the reason I never submitted it :-)

--
ï¿½lvaro Herrera https://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

#31

robertmhaas@gmail.com

over 5 years ago

In reply to: Andres Freund (#29)

Re: recovering from "found xmin ... from before relfrozenxid ..."

On Tue, Jul 14, 2020 at 3:42 PM Andres Freund <andres@anarazel.de> wrote:

The "found xmin ... from before relfrozenxid ..." cases should all be
fixable without needing such a function, and without it making fixing
them significantly easier, no? As far as I understand your suggested
solution, you need the tid(s) of these tuples, right? If you have those,
I don't think it's meaningfully harder to INSERT ... DELETE WHERE ctid =
.... or something like that.

ISTM that the hard part is finding all problematic tuples in an
efficient manner (i.e. that doesn't require one manual VACUUM for each
individual block + parsing VACUUMs error message), not "fixing" those
tuples.

I haven't tried the INSERT ... DELETE approach, but I've definitely
seen a case where a straight UPDATE did not fix the problem; VACUUM
continued failing afterwards. In that case, it was a system catalog
that was affected, and not one where TRUNCATE + re-INSERT was remotely
practical. The only solution I could come up with was to drop the
database and recreate it. Fortunately in that case the affected
database didn't seem to have any actual data in it, but if it had been
a 1TB database I think we would have been in really bad trouble.

Do you have a reason for believing that INSERT ... DELETE is going to
be better than UPDATE? It seems to me that either way you can end up
with a deleted and thus invisible tuple that you still can't get rid
of.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

#32

andres@anarazel.de

over 5 years ago

In reply to: Robert Haas (#31)

Re: recovering from "found xmin ... from before relfrozenxid ..."

Hi,

On 2020-07-14 15:59:21 -0400, Robert Haas wrote:

On Tue, Jul 14, 2020 at 3:42 PM Andres Freund <andres@anarazel.de> wrote:

The "found xmin ... from before relfrozenxid ..." cases should all be
fixable without needing such a function, and without it making fixing
them significantly easier, no? As far as I understand your suggested
solution, you need the tid(s) of these tuples, right? If you have those,
I don't think it's meaningfully harder to INSERT ... DELETE WHERE ctid =
.... or something like that.

ISTM that the hard part is finding all problematic tuples in an
efficient manner (i.e. that doesn't require one manual VACUUM for each
individual block + parsing VACUUMs error message), not "fixing" those
tuples.

I haven't tried the INSERT ... DELETE approach, but I've definitely
seen a case where a straight UPDATE did not fix the problem; VACUUM
continued failing afterwards.

The only way I can see that to happen is for the old tuple's multixact
being copied forward. That'd not happen with INSERT ... DELETE.

In that case, it was a system catalog
that was affected, and not one where TRUNCATE + re-INSERT was remotely
practical.

FWIW, an rewriting ALTER TABLE would likely also fix it. But obviously
that'd require allow_system_table_mods...

Do you have a reason for believing that INSERT ... DELETE is going to
be better than UPDATE? It seems to me that either way you can end up
with a deleted and thus invisible tuple that you still can't get rid
of.

None of the "new" checks around freezing would apply to deleted
tuples. So we shouldn't fail with an error like $subject.

Greetings,

Andres Freund

#33

robertmhaas@gmail.com

over 5 years ago

In reply to: Andres Freund (#32)

Re: recovering from "found xmin ... from before relfrozenxid ..."

On Wed, Jul 15, 2020 at 11:41 AM Andres Freund <andres@anarazel.de> wrote:

Do you have a reason for believing that INSERT ... DELETE is going to
be better than UPDATE? It seems to me that either way you can end up
with a deleted and thus invisible tuple that you still can't get rid
of.

None of the "new" checks around freezing would apply to deleted
tuples. So we shouldn't fail with an error like $subject.

It can definitely happen at least transiently:

S1:
rhaas=# create table wubble (a int, b text);
CREATE TABLE
rhaas=# insert into wubble values (1, 'glumpf');
INSERT 0 1

S2:
rhaas=# begin transaction isolation level repeatable read;
BEGIN
rhaas=*# select * from wubble;
a | b
---+--------
1 | glumpf
(1 row)

S1:
rhaas=# delete from wubble;
DELETE 1
rhaas=# update pg_class set relfrozenxid =
(relfrozenxid::text::integer + 1000000)::text::xid where relname =
'wubble';
UPDATE 1
rhaas=# vacuum verbose wubble;
INFO: vacuuming "public.wubble"
ERROR: found xmin 528 from before relfrozenxid 1000527
CONTEXT: while scanning block 0 of relation "public.wubble"

S2:
rhaas=*# commit;
COMMIT

S1:
rhaas=# vacuum verbose wubble;
INFO: vacuuming "public.wubble"
INFO: "wubble": removed 1 row versions in 1 pages
INFO: "wubble": found 1 removable, 0 nonremovable row versions in 1
out of 1 pages
DETAIL: 0 dead row versions cannot be removed yet, oldest xmin: 531
There were 0 unused item identifiers.
Skipped 0 pages due to buffer pins, 0 frozen pages.
0 pages are entirely empty.
CPU: user: 0.00 s, system: 0.00 s, elapsed: 0.00 s.
INFO: "wubble": truncated 1 to 0 pages
DETAIL: CPU: user: 0.00 s, system: 0.00 s, elapsed: 0.00 s
INFO: vacuuming "pg_toast.pg_toast_16415"
INFO: index "pg_toast_16415_index" now contains 0 row versions in 1 pages
DETAIL: 0 index row versions were removed.
0 index pages have been deleted, 0 are currently reusable.
CPU: user: 0.00 s, system: 0.00 s, elapsed: 0.00 s.
INFO: "pg_toast_16415": found 0 removable, 0 nonremovable row
versions in 0 out of 0 pages
DETAIL: 0 dead row versions cannot be removed yet, oldest xmin: 532
There were 0 unused item identifiers.
Skipped 0 pages due to buffer pins, 0 frozen pages.
0 pages are entirely empty.
CPU: user: 0.00 s, system: 0.00 s, elapsed: 0.00 s.
VACUUM

I see your point, though: the tuple has to be able to survive
HOT-pruning in order to cause a problem when we check whether it needs
freezing.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

#34

robertmhaas@gmail.com

over 5 years ago

In reply to: Robert Haas (#33)

Re: recovering from "found xmin ... from before relfrozenxid ..."

On Thu, Jul 16, 2020 at 10:00 AM Robert Haas <robertmhaas@gmail.com> wrote:

I see your point, though: the tuple has to be able to survive
HOT-pruning in order to cause a problem when we check whether it needs
freezing.

Here's an example where the new sanity checks fail on an invisible
tuple without any concurrent transactions:

$ initdb
$ pg_ctl start -l ~/logfile
$ createdb
$ psql

create table simpsons (a int, b text);
vacuum freeze;

$ cat > txid.sql
select txid_current();
$ pgbench -t 131072 -c 8 -j 8 -n -f txid.sql
$ psql

insert into simpsons values (1, 'homer');

$ pg_ctl stop
$ pg_resetwal -x 1000 $PGDATA
$ pg_ctl start -l ~/logfile
$ psql

update pg_class set relfrozenxid = (relfrozenxid::text::integer +
2000000)::text::xid where relname = 'simpsons';

rhaas=# select * from simpsons;
a | b
---+---
(0 rows)

rhaas=# vacuum simpsons;
ERROR: found xmin 1049082 from before relfrozenxid 2000506
CONTEXT: while scanning block 0 of relation "public.simpsons"

This is a fairly insane situation, because we should have relfrozenxid
< tuple xid < xid counter, but instead we have xid counter < tuple xid
< relfrozenxid, but it demonstrates that it's possible to have a
database which is sufficiently corrupt that you can't escape from the
new sanity checks using only INSERT, UPDATE, and DELETE.

Now, an even easier way to create a table with a tuple that prevents
vacuuming and also can't just be deleted is to simply remove a
required pg_clog file (and maybe restart the server to clear out any
cached data in the SLRUs). What we typically do with customers who
need to recover from that situation today is give them a script to
fabricate a bogus CLOG file that shows all transactions as committed
(or, perhaps, aborted). But I think that the tools proposed on this
thread might be a better approach in certain cases. If the problem is
that a pg_clog file vanished, then recreating it with whatever content
you think is closest to what was probably there before is likely the
best you can do. But if you've got some individual tuples with crazy
xmin values, you don't really want to drop matching files in pg_clog;
it's better to fix the tuples.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

#35

ashu.coek88@gmail.com

over 5 years ago

In reply to: Robert Haas (#34)

1 attachment(s)

Re: recovering from "found xmin ... from before relfrozenxid ..."

Hi All,

Attached is the patch that adds heap_force_kill(regclass, tid[]) and
heap_force_freeze(regclass, tid[]) functions which Robert mentioned in the
first email in this thread. The patch basically adds an extension named
pg_surgery that contains these functions. Please have a look and let me
know your feedback. Thank you.

--
With Regards,
Ashutosh Sharma
EnterpriseDB:http://www.enterprisedb.com

On Thu, Jul 16, 2020 at 9:44 PM Robert Haas <robertmhaas@gmail.com> wrote:

Show quoted text

On Thu, Jul 16, 2020 at 10:00 AM Robert Haas <robertmhaas@gmail.com>
wrote:

I see your point, though: the tuple has to be able to survive
HOT-pruning in order to cause a problem when we check whether it needs
freezing.

Here's an example where the new sanity checks fail on an invisible
tuple without any concurrent transactions:

$ initdb
$ pg_ctl start -l ~/logfile
$ createdb
$ psql

create table simpsons (a int, b text);
vacuum freeze;

$ cat > txid.sql
select txid_current();
$ pgbench -t 131072 -c 8 -j 8 -n -f txid.sql
$ psql

insert into simpsons values (1, 'homer');

$ pg_ctl stop
$ pg_resetwal -x 1000 $PGDATA
$ pg_ctl start -l ~/logfile
$ psql

update pg_class set relfrozenxid = (relfrozenxid::text::integer +
2000000)::text::xid where relname = 'simpsons';

rhaas=# select * from simpsons;
a | b
---+---
(0 rows)

rhaas=# vacuum simpsons;
ERROR: found xmin 1049082 from before relfrozenxid 2000506
CONTEXT: while scanning block 0 of relation "public.simpsons"

This is a fairly insane situation, because we should have relfrozenxid
< tuple xid < xid counter, but instead we have xid counter < tuple xid
< relfrozenxid, but it demonstrates that it's possible to have a
database which is sufficiently corrupt that you can't escape from the
new sanity checks using only INSERT, UPDATE, and DELETE.

Now, an even easier way to create a table with a tuple that prevents
vacuuming and also can't just be deleted is to simply remove a
required pg_clog file (and maybe restart the server to clear out any
cached data in the SLRUs). What we typically do with customers who
need to recover from that situation today is give them a script to
fabricate a bogus CLOG file that shows all transactions as committed
(or, perhaps, aborted). But I think that the tools proposed on this
thread might be a better approach in certain cases. If the problem is
that a pg_clog file vanished, then recreating it with whatever content
you think is closest to what was probably there before is likely the
best you can do. But if you've got some individual tuples with crazy
xmin values, you don't really want to drop matching files in pg_clog;
it's better to fix the tuples.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Attachments:

0001-Add-contrib-pg_surgery-to-perform-surgery-on-the-dam.patchtext/x-patch; charset=US-ASCII; name=0001-Add-contrib-pg_surgery-to-perform-surgery-on-the-dam.patchDownload

From 75a6ba0e86b6d199201477c5ebd60258ef16f181 Mon Sep 17 00:00:00 2001
From: ashu <ashutosh.sharma@enterprisedb.com>
Date: Fri, 24 Jul 2020 11:26:45 +0530
Subject: [PATCH] Add contrib/pg_surgery to perform surgery on the damaged heap
 tables.

This contrib module basically adds a couple of functions named
heap_force_kill and heap_force_freeze that can be used in the scripts
to perform surgery on the damaged heap tables.

Ashutosh Sharma.
---
 contrib/Makefile                           |   1 +
 contrib/pg_surgery/Makefile                |  23 ++
 contrib/pg_surgery/expected/pg_surgery.out | 109 +++++++++
 contrib/pg_surgery/heap_surgery_funcs.c    | 356 +++++++++++++++++++++++++++++
 contrib/pg_surgery/pg_surgery--1.0.sql     |  14 ++
 contrib/pg_surgery/pg_surgery.control      |   5 +
 contrib/pg_surgery/sql/pg_surgery.sql      |  72 ++++++
 7 files changed, 580 insertions(+)
 create mode 100644 contrib/pg_surgery/Makefile
 create mode 100644 contrib/pg_surgery/expected/pg_surgery.out
 create mode 100644 contrib/pg_surgery/heap_surgery_funcs.c
 create mode 100644 contrib/pg_surgery/pg_surgery--1.0.sql
 create mode 100644 contrib/pg_surgery/pg_surgery.control
 create mode 100644 contrib/pg_surgery/sql/pg_surgery.sql

diff --git a/contrib/Makefile b/contrib/Makefile
index 1846d41..07d5734 100644
--- a/contrib/Makefile
+++ b/contrib/Makefile
@@ -35,6 +35,7 @@ SUBDIRS = \
 		pg_standby	\
 		pg_stat_statements \
 		pg_trgm		\
+		pg_surgery	\
 		pgcrypto	\
 		pgrowlocks	\
 		pgstattuple	\
diff --git a/contrib/pg_surgery/Makefile b/contrib/pg_surgery/Makefile
new file mode 100644
index 0000000..baf2a88
--- /dev/null
+++ b/contrib/pg_surgery/Makefile
@@ -0,0 +1,23 @@
+# contrib/pg_surgery/Makefile
+
+MODULE_big = pg_surgery
+OBJS = \
+	$(WIN32RES) \
+	heap_surgery_funcs.o
+
+EXTENSION = pg_surgery
+DATA = pg_surgery--1.0.sql
+PGFILEDESC = "pg_surgery - perform surgery on the damaged heap table"
+
+REGRESS = pg_surgery
+
+ifdef USE_PGXS
+PG_CONFIG = pg_config
+PGXS := $(shell $(PG_CONFIG) --pgxs)
+include $(PGXS)
+else
+subdir = contrib/pg_surgery
+top_builddir = ../..
+include $(top_builddir)/src/Makefile.global
+include $(top_srcdir)/contrib/contrib-global.mk
+endif
diff --git a/contrib/pg_surgery/expected/pg_surgery.out b/contrib/pg_surgery/expected/pg_surgery.out
new file mode 100644
index 0000000..98419b0
--- /dev/null
+++ b/contrib/pg_surgery/expected/pg_surgery.out
@@ -0,0 +1,109 @@
+create extension pg_surgery;
+--
+-- test pg_surgery functions with all the supported relations. Should pass.
+--
+-- normal heap table.
+begin;
+create table htab(a int);
+insert into htab values (100), (200), (300), (400), (500);
+select * from htab where xmin = 2;
+ a 
+---
+(0 rows)
+
+select heap_force_freeze('htab'::regclass, ARRAY['(0, 4)']::tid[]);
+ heap_force_freeze 
+-------------------
+ 
+(1 row)
+
+select ctid, xmax from htab where xmin = 2;
+ ctid  | xmax 
+-------+------
+ (0,4) |    0
+(1 row)
+
+select heap_force_kill('htab'::regclass, ARRAY['(0, 4)']::tid[]);
+ heap_force_kill 
+-----------------
+ 
+(1 row)
+
+select * from htab where ctid = '(0, 4)';
+ a 
+---
+(0 rows)
+
+rollback;
+-- toast table.
+begin;
+create table ttab(a text);
+insert into ttab select string_agg(chr(floor(random() * 26)::int + 65), '') from generate_series(1,10000);
+select * from ttab where xmin = 2;
+ a 
+---
+(0 rows)
+
+select heap_force_freeze('ttab'::regclass, ARRAY['(0, 1)']::tid[]);
+ heap_force_freeze 
+-------------------
+ 
+(1 row)
+
+select ctid, xmax from ttab where xmin = 2;
+ ctid  | xmax 
+-------+------
+ (0,1) |    0
+(1 row)
+
+select heap_force_kill('ttab'::regclass, ARRAY['(0, 1)']::tid[]);
+ heap_force_kill 
+-----------------
+ 
+(1 row)
+
+select * from ttab where ctid = '(0, 1)';
+ a 
+---
+(0 rows)
+
+rollback;
+--
+-- test pg_surgery functions with the unsupported relations. Should fail.
+--
+-- partitioned tables (the parent table) doesn't contain any tuple.
+create table ptab (a int) partition by list (a);
+select heap_force_kill('ptab'::regclass, ARRAY['(0, 1)']::tid[]);
+ERROR:  only heap AM is supported
+select heap_force_freeze('ptab'::regclass, ARRAY['(0, 1)']::tid[]);
+ERROR:  only heap AM is supported
+create index ptab_idx on ptab (a);
+-- indexes are not supported. so, all these should fail.
+select heap_force_kill('ptab_idx'::regclass, ARRAY['(0, 1)']::tid[]);
+ERROR:  only heap AM is supported
+select heap_force_freeze('ptab_idx'::regclass, ARRAY['(0, 1)']::tid[]);
+ERROR:  only heap AM is supported
+create view vw as select 1;
+-- views are not supported as well. so, all these should fail.
+select heap_force_kill('vw'::regclass, ARRAY['(0, 1)']::tid[]);
+ERROR:  only heap AM is supported
+select heap_force_freeze('vw'::regclass, ARRAY['(0, 1)']::tid[]);
+ERROR:  only heap AM is supported
+create materialized view mvw as select 10;
+-- materialized views are not supported as well. so, all these should fail.
+select heap_force_kill('mvw'::regclass, ARRAY['(0, 1)']::tid[]);
+ERROR:  "mvw" is not a table or TOAST table
+select heap_force_freeze('mvw'::regclass, ARRAY['(0, 1)']::tid[]);
+ERROR:  "mvw" is not a table or TOAST table
+create sequence seq;
+-- sequences are not supported as well. so, all these functions should fail.
+select heap_force_kill('seq'::regclass, ARRAY['(0, 1)']::tid[]);
+ERROR:  only heap AM is supported
+select heap_force_freeze('seq'::regclass, ARRAY['(0, 1)']::tid[]);
+ERROR:  only heap AM is supported
+-- cleanup.
+drop materialized view mvw;
+drop table ptab;
+drop view vw;
+drop sequence seq;
+drop extension pg_surgery;
diff --git a/contrib/pg_surgery/heap_surgery_funcs.c b/contrib/pg_surgery/heap_surgery_funcs.c
new file mode 100644
index 0000000..ac60ad7
--- /dev/null
+++ b/contrib/pg_surgery/heap_surgery_funcs.c
@@ -0,0 +1,356 @@
+/*-------------------------------------------------------------------------
+ *
+ * heap_surgery_funcs.c
+ *	  Functions to perform surgery on the damaged heap table.
+ *
+ * Copyright (c) 2020, PostgreSQL Global Development Group
+ *
+ * IDENTIFICATION
+ *	  contrib/pg_surgery/heap_surgery_funcs.c
+ *
+ *-------------------------------------------------------------------------
+ */
+#include "postgres.h"
+
+#include "access/heapam.h"
+#include "catalog/pg_am_d.h"
+#include "miscadmin.h"
+#include "storage/bufmgr.h"
+#include "utils/rel.h"
+
+#ifdef PG_MODULE_MAGIC
+PG_MODULE_MAGIC;
+#endif
+
+/*
+ * Macros for accessing the item pointer array. Taken from the intarray contrib
+ * header.
+ */
+#define ARRPTR(x)	((ItemPointer) ARR_DATA_PTR(x))
+#define ARRNELEMS(x)	ArrayGetNItems(ARR_NDIM(x), ARR_DIMS(x))
+
+/* Reject arrays we can't handle. */
+#define CHECKARRVALID(x) \
+	do { \
+		if (ARR_HASNULL(x) && array_contains_nulls(x)) \
+			ereport(ERROR, \
+					(errcode(ERRCODE_NULL_VALUE_NOT_ALLOWED), \
+					 errmsg("array must not contain nulls"))); \
+	} while(0)
+
+#define ARRISEMPTY(x)	(ARRNELEMS(x) == 0)
+
+/* Sort the elements of the array. */
+#define SORT(x) \
+	do { \
+		int     _nelems_ = ARRNELEMS(x); \
+		if (_nelems_ > 1) \
+			qsort((void*) ARRPTR(x), _nelems_, sizeof(ItemPointerData), \
+				  tidcmp); \
+	} while(0)
+
+/* Options to forcefully change the state of a heap tuple. */
+typedef enum HTupleForceOption
+{
+	FORCE_KILL,
+	FORCE_FREEZE
+} HTupleForceOption;
+
+PG_FUNCTION_INFO_V1(heap_force_kill);
+PG_FUNCTION_INFO_V1(heap_force_freeze);
+
+static int32 tidcmp(const void *a, const void *b);
+static Datum heap_force_common(FunctionCallInfo fcinfo,
+							   HTupleForceOption force_opt);
+static void check_relation_relkind(Relation rel);
+static void verify_tid(ItemPointer tid, BlockNumber *blkno,
+					   OffsetNumber *offno);
+static BlockNumber tids_same_page_fetch_offnums(ItemPointer tids, int ntids,
+												OffsetNumber *next_start_ptr,
+												OffsetNumber *offnos);
+
+/*-------------------------------------------------------------------------
+ * heap_force_kill()
+ *
+ * Force kill the tuple(s) pointed to by the item pointer(s) stored in the
+ * given tid array.
+ *
+ * Usage: SELECT heap_force_kill(regclass, tid[]);
+ *-------------------------------------------------------------------------
+ */
+Datum
+heap_force_kill(PG_FUNCTION_ARGS)
+{
+	return heap_force_common(fcinfo, FORCE_KILL);
+}
+
+/*-------------------------------------------------------------------------
+ * heap_force_freeze()
+ *
+ * Force freeze the tuple(s) pointed to by the item pointer(s) stored in the
+ * given tid array.
+ *
+ * Usage: SELECT heap_force_freeze(regclass, tid[]);
+ *-------------------------------------------------------------------------
+ */
+Datum
+heap_force_freeze(PG_FUNCTION_ARGS)
+{
+	return heap_force_common(fcinfo, FORCE_FREEZE);
+}
+
+/*-------------------------------------------------------------------------
+ * heap_force_common()
+ *
+ * Common code for heap_force_kill and heap_force_freeze
+ *-------------------------------------------------------------------------
+ */
+static Datum
+heap_force_common(FunctionCallInfo fcinfo, HTupleForceOption force_opt)
+{
+	Oid				relid = PG_GETARG_OID(0);
+	ArrayType	   *ta = PG_GETARG_ARRAYTYPE_P_COPY(1);
+	ItemPointer		tids;
+	int				ntids;
+	Relation		rel;
+	Buffer			buf;
+	Page			page;
+	ItemId			itemid;
+	BlockNumber		blkno;
+	OffsetNumber   *offnos;
+	OffsetNumber	offno,
+					noffs,
+					curr_start_ptr,
+					next_start_ptr;
+	int				i;
+
+	Assert(force_opt == FORCE_KILL || force_opt == FORCE_FREEZE);
+
+	if (RecoveryInProgress())
+		ereport(ERROR,
+				(errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
+				 errmsg("recovery is in progress"),
+				 errhint("%s cannot be executed during recovery.",
+						 force_opt == FORCE_KILL ? "heap_force_kill()" :
+						 "heap_force_freeze()")));
+
+	/* Basic sanity checking. */
+	CHECKARRVALID(ta);
+
+	if (ARRISEMPTY(ta))
+		ereport(ERROR,
+				(errcode(ERRCODE_INVALID_PARAMETER_VALUE),
+				 errmsg("empty tid array")));
+
+	rel = relation_open(relid, AccessShareLock);
+
+	if (!superuser() && GetUserId() != rel->rd_rel->relowner)
+		ereport(ERROR,
+				(errcode(ERRCODE_INSUFFICIENT_PRIVILEGE),
+				 errmsg("must be superuser or object owner to use %s.",
+						force_opt == FORCE_KILL ? "heap_force_kill()" :
+						"heap_force_freeze()")));
+
+	if (rel->rd_rel->relam != HEAP_TABLE_AM_OID)
+		ereport(ERROR,
+				(errcode(ERRCODE_FEATURE_NOT_SUPPORTED),
+				 errmsg("only heap AM is supported")));
+
+	check_relation_relkind(rel);
+
+	SORT(ta);
+
+	tids = ARRPTR(ta);
+	ntids = ARRNELEMS(ta);
+
+	offnos = (OffsetNumber *) palloc(ntids * sizeof(OffsetNumber));
+	noffs = curr_start_ptr = next_start_ptr = 0;
+
+	do
+	{
+		/*
+		 * Get the offset numbers from the tids belonging to one particular page
+		 * and process them one by one.
+		 */
+		blkno = tids_same_page_fetch_offnums(tids, ntids, &next_start_ptr,
+											 offnos);
+
+		CHECK_FOR_INTERRUPTS();
+
+		buf = ReadBuffer(rel, blkno);
+		LockBufferForCleanup(buf);
+
+		page = BufferGetPage(buf);
+
+		/* Calculate the number of offsets stored in offnos array. */
+		noffs = next_start_ptr - curr_start_ptr;
+		Assert(noffs <= PageGetMaxOffsetNumber(page));
+
+		/* No ereport(ERROR) from here until all the changes are logged. */
+		START_CRIT_SECTION();
+
+		for (i = 0; i < noffs; i++)
+		{
+			itemid = PageGetItemId(page, offnos[i]);
+
+			/* Follow any redirections until we find something useful. */
+			while (ItemIdIsRedirected(itemid))
+			{
+				offno = ItemIdGetRedirect(itemid);
+				itemid = PageGetItemId(page, offno);
+				CHECK_FOR_INTERRUPTS();
+			}
+
+			/* Nothing to do if the itemid is unused or already dead. */
+			if (!ItemIdIsUsed(itemid) || ItemIdIsDead(itemid))
+				continue;
+
+			Assert(ItemIdIsNormal(itemid));
+
+			if (force_opt == FORCE_KILL)
+				ItemIdSetDead(itemid);
+			else if (force_opt == FORCE_FREEZE)
+			{
+				HeapTupleHeader htup;
+
+				htup = (HeapTupleHeader) PageGetItem(page, itemid);
+
+				HeapTupleHeaderSetXmin(htup, FrozenTransactionId);
+				HeapTupleHeaderSetXmax(htup, InvalidTransactionId);
+
+				/*
+				 * Clear all the visibility-related bits of this tuple and mark
+				 * it as frozen.
+				 */
+				htup->t_infomask &= ~HEAP_XACT_MASK;
+				htup->t_infomask |= (HEAP_XMIN_FROZEN | HEAP_XMAX_INVALID);
+			}
+			else
+			{
+				/* shouldn't happen. */
+				elog(ERROR, "invalid heap tuple force option: %d\n", force_opt);
+			}
+		}
+
+		/* Mark buffer dirty before we write WAL. */
+		MarkBufferDirty(buf);
+
+		/* XLOG stuff */
+		if (RelationNeedsWAL(rel))
+			log_newpage_buffer(buf, true);
+
+		END_CRIT_SECTION();
+
+		UnlockReleaseBuffer(buf);
+
+		/*
+		 * Update the current start pointer before fetching next set of offset
+		 * numbers from the tids belonging to the same page.
+		 */
+		curr_start_ptr = next_start_ptr;
+	} while (next_start_ptr != ntids);
+
+	relation_close(rel, AccessShareLock);
+
+	pfree(ta);
+	pfree(offnos);
+
+	PG_RETURN_VOID();
+}
+
+/*-------------------------------------------------------------------------
+ * tidcmp()
+ *
+ * Compare two item pointers, return -1, 0, or +1.
+ *
+ * See ItemPointerCompare for details.
+ * ------------------------------------------------------------------------
+ */
+static int32
+tidcmp(const void *a, const void *b)
+{
+	ItemPointer iptr1 = ((const ItemPointer) a);
+	ItemPointer iptr2 = ((const ItemPointer) b);
+
+	return ItemPointerCompare(iptr1, iptr2);
+}
+
+/*-------------------------------------------------------------------------
+ * check_relation_relkind()
+ *
+ * Check if the given relation is of the relkind supported by the callers.
+ * ------------------------------------------------------------------------
+ */
+static void
+check_relation_relkind(Relation rel)
+{
+	if (rel->rd_rel->relkind != RELKIND_RELATION &&
+		rel->rd_rel->relkind != RELKIND_TOASTVALUE)
+		ereport(ERROR,
+				(errcode(ERRCODE_WRONG_OBJECT_TYPE),
+				 errmsg("\"%s\" is not a table or TOAST table",
+						RelationGetRelationName(rel))));
+}
+
+/*-------------------------------------------------------------------------
+ * verify_tid()
+ *
+ * Verify the given item pointer.
+ * ------------------------------------------------------------------------
+ */
+static void
+verify_tid(ItemPointer tid, BlockNumber *blkno, OffsetNumber *offno)
+{
+	if (!ItemPointerIsValid(tid))
+		ereport(ERROR,
+				(errcode(ERRCODE_INVALID_PARAMETER_VALUE),
+				 errmsg("invalid item pointer")));
+
+	*blkno = ItemPointerGetBlockNumberNoCheck(tid);
+	*offno = ItemPointerGetOffsetNumberNoCheck(tid);
+
+	if (!BlockNumberIsValid(*blkno))
+		ereport(ERROR,
+				(errcode(ERRCODE_INVALID_PARAMETER_VALUE),
+				 errmsg("invalid block number")));
+
+	if (!OffsetNumberIsValid(*offno))
+		ereport(ERROR,
+				(errcode(ERRCODE_INVALID_PARAMETER_VALUE),
+				 errmsg("invalid offset number")));
+}
+
+/*-------------------------------------------------------------------------
+ * tids_same_page_fetch_offnums()
+ *
+ * Find out all the tids residing in the same page as tids[next_start_ptr] and
+ * fetch the offset number stored in each of them into a caller-allocated offset
+ * number array.
+ * ------------------------------------------------------------------------
+ */
+static BlockNumber
+tids_same_page_fetch_offnums(ItemPointer tids, int ntids,
+							 OffsetNumber *next_start_ptr, OffsetNumber *offnos)
+{
+	int				i;
+	BlockNumber		prev_blkno,
+					blkno;
+	OffsetNumber	offno;
+
+	for (i = *next_start_ptr; i < ntids; i++)
+	{
+		ItemPointerData tid = tids[i];
+
+		verify_tid(&tid, &blkno, &offno);
+
+		if (i == *next_start_ptr || (prev_blkno == blkno))
+			offnos[i - *next_start_ptr] = offno;
+		else
+			break;
+
+		prev_blkno = blkno;
+	}
+
+	*next_start_ptr = i;
+	return prev_blkno;
+}
\ No newline at end of file
diff --git a/contrib/pg_surgery/pg_surgery--1.0.sql b/contrib/pg_surgery/pg_surgery--1.0.sql
new file mode 100644
index 0000000..0878be6
--- /dev/null
+++ b/contrib/pg_surgery/pg_surgery--1.0.sql
@@ -0,0 +1,14 @@
+/* contrib/pg_surgery/pg_surgery--1.0.sql */
+
+-- complain if script is sourced in psql, rather than via CREATE EXTENSION
+\echo Use "CREATE EXTENSION pg_surgery" to load this file. \quit
+
+CREATE FUNCTION heap_force_kill(reloid regclass, tids tid[])
+RETURNS VOID
+AS 'MODULE_PATHNAME', 'heap_force_kill'
+LANGUAGE C STRICT;
+
+CREATE FUNCTION heap_force_freeze(reloid regclass, tids tid[])
+RETURNS VOID
+AS 'MODULE_PATHNAME', 'heap_force_freeze'
+LANGUAGE C STRICT;
\ No newline at end of file
diff --git a/contrib/pg_surgery/pg_surgery.control b/contrib/pg_surgery/pg_surgery.control
new file mode 100644
index 0000000..31d8cab
--- /dev/null
+++ b/contrib/pg_surgery/pg_surgery.control
@@ -0,0 +1,5 @@
+# pg_surgery extension
+comment = 'extension to perform surgery the damaged heap table'
+default_version = '1.0'
+module_pathname = '$libdir/pg_surgery'
+relocatable = true
diff --git a/contrib/pg_surgery/sql/pg_surgery.sql b/contrib/pg_surgery/sql/pg_surgery.sql
new file mode 100644
index 0000000..936e28b
--- /dev/null
+++ b/contrib/pg_surgery/sql/pg_surgery.sql
@@ -0,0 +1,72 @@
+create extension pg_surgery;
+
+--
+-- test pg_surgery functions with all the supported relations. Should pass.
+--
+
+-- normal heap table.
+begin;
+create table htab(a int);
+insert into htab values (100), (200), (300), (400), (500);
+
+select * from htab where xmin = 2;
+select heap_force_freeze('htab'::regclass, ARRAY['(0, 4)']::tid[]);
+select ctid, xmax from htab where xmin = 2;
+
+select heap_force_kill('htab'::regclass, ARRAY['(0, 4)']::tid[]);
+select * from htab where ctid = '(0, 4)';
+rollback;
+
+-- toast table.
+begin;
+create table ttab(a text);
+insert into ttab select string_agg(chr(floor(random() * 26)::int + 65), '') from generate_series(1,10000);
+
+select * from ttab where xmin = 2;
+select heap_force_freeze('ttab'::regclass, ARRAY['(0, 1)']::tid[]);
+select ctid, xmax from ttab where xmin = 2;
+
+select heap_force_kill('ttab'::regclass, ARRAY['(0, 1)']::tid[]);
+select * from ttab where ctid = '(0, 1)';
+rollback;
+
+--
+-- test pg_surgery functions with the unsupported relations. Should fail.
+--
+
+-- partitioned tables (the parent table) doesn't contain any tuple.
+create table ptab (a int) partition by list (a);
+
+select heap_force_kill('ptab'::regclass, ARRAY['(0, 1)']::tid[]);
+select heap_force_freeze('ptab'::regclass, ARRAY['(0, 1)']::tid[]);
+
+create index ptab_idx on ptab (a);
+
+-- indexes are not supported. so, all these should fail.
+select heap_force_kill('ptab_idx'::regclass, ARRAY['(0, 1)']::tid[]);
+select heap_force_freeze('ptab_idx'::regclass, ARRAY['(0, 1)']::tid[]);
+
+create view vw as select 1;
+
+-- views are not supported as well. so, all these should fail.
+select heap_force_kill('vw'::regclass, ARRAY['(0, 1)']::tid[]);
+select heap_force_freeze('vw'::regclass, ARRAY['(0, 1)']::tid[]);
+
+create materialized view mvw as select 10;
+
+-- materialized views are not supported as well. so, all these should fail.
+select heap_force_kill('mvw'::regclass, ARRAY['(0, 1)']::tid[]);
+select heap_force_freeze('mvw'::regclass, ARRAY['(0, 1)']::tid[]);
+
+create sequence seq;
+
+-- sequences are not supported as well. so, all these functions should fail.
+select heap_force_kill('seq'::regclass, ARRAY['(0, 1)']::tid[]);
+select heap_force_freeze('seq'::regclass, ARRAY['(0, 1)']::tid[]);
+
+-- cleanup.
+drop materialized view mvw;
+drop table ptab;
+drop view vw;
+drop sequence seq;
+drop extension pg_surgery;
-- 
1.8.3.1

#36

x4mmm@yandex-team.ru

over 5 years ago

In reply to: Ashutosh Sharma (#35)

Re: recovering from "found xmin ... from before relfrozenxid ..."

24 июля 2020 г., в 14:05, Ashutosh Sharma <ashu.coek88@gmail.com> написал(а):

Attached is the patch that adds heap_force_kill(regclass, tid[]) and heap_force_freeze(regclass, tid[]) functions which Robert mentioned in the first email in this thread. The patch basically adds an extension named pg_surgery that contains these functions. Please have a look and let me know your feedback. Thank you.

Thanks for the patch!
I have just few random thoughts.

I think here we should report that we haven't done what was asked.
+			/* Nothing to do if the itemid is unused or already dead. */
+			if (!ItemIdIsUsed(itemid) || ItemIdIsDead(itemid))
+				continue;

Also, should we try to fix VM along the way?
Are there any caveats with concurrent VACUUM? (I do not see any, just asking)
It would be good to have some checks for interrupts in safe places.

I think we should not trust user entierly here. I'd prefer validation and graceful exit, not a core dump.
+ Assert(noffs <= PageGetMaxOffsetNumber(page));

For some reason we had unlogged versions of these functions. But I do not recall exact rationale..
Also, I'd be happy if we had something like "Restore this tuple iff this does not break unique constraint". To do so we need to sort tids by xmin\xmax, to revive most recent data first.

Thanks!

Best regards, Andrey Borodin.

#37

ashu.coek88@gmail.com

over 5 years ago

In reply to: Andrey M. Borodin (#36)

Re: recovering from "found xmin ... from before relfrozenxid ..."

Hi,

Thanks for sharing your thoughts. Please find my comments inline below:

I think here we should report that we haven't done what was asked.
+                       /* Nothing to do if the itemid is unused or
already dead. */
+                       if (!ItemIdIsUsed(itemid) || ItemIdIsDead(itemid))
+                               continue;

Okay. Will add a log message saying "skipping tid ... because ..."

Also, should we try to fix VM along the way?

I think we should let VACUUM do that.

Are there any caveats with concurrent VACUUM? (I do not see any, just
asking)

As of now, I don't see any.

It would be good to have some checks for interrupts in safe places.

I think I have already added those wherever I felt it was required. If you
feel it's missing somewhere, it could be good if you could point it out.

I think we should not trust user entierly here. I'd prefer validation and
graceful exit, not a core dump.
+ Assert(noffs <= PageGetMaxOffsetNumber(page));

Yeah, sounds reasonable. Will do that.

For some reason we had unlogged versions of these functions. But I do not
recall exact rationale..
Also, I'd be happy if we had something like "Restore this tuple iff this
does not break unique constraint". To do so we need to sort tids by
xmin\xmax, to revive most recent data first.

I didn't get this point. Could you please elaborate.

--
With Regards,
Ashutosh Sharma
EnterpriseDB:http://www.enterprisedb.com

#38

x4mmm@yandex-team.ru

over 5 years ago

In reply to: Ashutosh Sharma (#37)

Re: recovering from "found xmin ... from before relfrozenxid ..."

27 июля 2020 г., в 09:36, Ashutosh Sharma <ashu.coek88@gmail.com> написал(а):

Also, should we try to fix VM along the way?

I think we should let VACUUM do that.

Sometimes VACUUM will not get to these pages, because they are marked All Frozen.
An possibly some tuples will get stale on this page again

Are there any caveats with concurrent VACUUM? (I do not see any, just asking)

As of now, I don't see any.

VACUUM has collection of dead item pointers. We will not resurrect any of them, right?

It would be good to have some checks for interrupts in safe places.

I think I have already added those wherever I felt it was required. If you feel it's missing somewhere, it could be good if you could point it out.

Sorry, I just overlooked that call, everything is fine here.

Also, I'd be happy if we had something like "Restore this tuple iff this does not break unique constraint". To do so we need to sort tids by xmin\xmax, to revive most recent data first.

I didn't get this point. Could you please elaborate.

You may have 10 corrupted tuples for the same record, that was updated 9 times. And if you have unique constraint on the table you may want to have only latest version of the row. So you want to kill 9 tuples and freeze 1.

Thanks!

Best regards, Andrey Borodin.

#39

MBeena Emerson

mbeena.emerson@gmail.com

over 5 years ago

In reply to: Ashutosh Sharma (#35)

Re: recovering from "found xmin ... from before relfrozenxid ..."

Hello Ashutosh,

On Fri, 24 Jul 2020 at 14:35, Ashutosh Sharma <ashu.coek88@gmail.com> wrote:

Hi All,

Attached is the patch that adds heap_force_kill(regclass, tid[]) and heap_force_freeze(regclass, tid[]) functions which Robert mentioned in the first email in this thread. The patch basically adds an extension named pg_surgery that contains these functions. Please have a look and let me know your feedback. Thank you.

Thanks for the patch.

1. We would be marking buffer dirty and writing wal even if we have
not done any changes( ex if we pass invalid/dead tids). Maybe we can
handle this better?

cosmetic changes
1. Maybe "HTupleSurgicalOption" instead of "HTupleForceOption" and
also the variable names could use surgery instead.
2. extension comment pg_surgery.control "extension to perform surgery
the damaged heap table" -> "extension to perform surgery on the
damaged heap table"

On Thu, Jul 16, 2020 at 9:44 PM Robert Haas <robertmhaas@gmail.com> wrote:

On Thu, Jul 16, 2020 at 10:00 AM Robert Haas <robertmhaas@gmail.com> wrote:

I see your point, though: the tuple has to be able to survive
HOT-pruning in order to cause a problem when we check whether it needs
freezing.

Here's an example where the new sanity checks fail on an invisible
tuple without any concurrent transactions:

$ initdb
$ pg_ctl start -l ~/logfile
$ createdb
$ psql

create table simpsons (a int, b text);
vacuum freeze;

$ cat > txid.sql
select txid_current();
$ pgbench -t 131072 -c 8 -j 8 -n -f txid.sql
$ psql

insert into simpsons values (1, 'homer');

$ pg_ctl stop
$ pg_resetwal -x 1000 $PGDATA
$ pg_ctl start -l ~/logfile
$ psql

update pg_class set relfrozenxid = (relfrozenxid::text::integer +
2000000)::text::xid where relname = 'simpsons';

rhaas=# select * from simpsons;
a | b
---+---
(0 rows)

rhaas=# vacuum simpsons;
ERROR: found xmin 1049082 from before relfrozenxid 2000506
CONTEXT: while scanning block 0 of relation "public.simpsons"

This is a fairly insane situation, because we should have relfrozenxid
< tuple xid < xid counter, but instead we have xid counter < tuple xid
< relfrozenxid, but it demonstrates that it's possible to have a
database which is sufficiently corrupt that you can't escape from the
new sanity checks using only INSERT, UPDATE, and DELETE.

Now, an even easier way to create a table with a tuple that prevents
vacuuming and also can't just be deleted is to simply remove a
required pg_clog file (and maybe restart the server to clear out any
cached data in the SLRUs). What we typically do with customers who
need to recover from that situation today is give them a script to
fabricate a bogus CLOG file that shows all transactions as committed
(or, perhaps, aborted). But I think that the tools proposed on this
thread might be a better approach in certain cases. If the problem is
that a pg_clog file vanished, then recreating it with whatever content
you think is closest to what was probably there before is likely the
best you can do. But if you've got some individual tuples with crazy
xmin values, you don't really want to drop matching files in pg_clog;
it's better to fix the tuples.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

M Beena Emerson

Sr. Software Engineer

edbpostgres.com

#40

ashu.coek88@gmail.com

over 5 years ago

In reply to: Andrey M. Borodin (#38)

Re: recovering from "found xmin ... from before relfrozenxid ..."

I think we should let VACUUM do that.

Sometimes VACUUM will not get to these pages, because they are marked All
Frozen.
An possibly some tuples will get stale on this page again

Hmm, okay, will have a look into this. Thanks.

Are there any caveats with concurrent VACUUM? (I do not see any, just

asking)

As of now, I don't see any.

VACUUM has collection of dead item pointers. We will not resurrect any of
them, right?

We won't be doing any such things.

It would be good to have some checks for interrupts in safe places.

I think I have already added those wherever I felt it was required. If

you feel it's missing somewhere, it could be good if you could point it out.
Sorry, I just overlooked that call, everything is fine here.

Okay, thanks for confirming.

Also, I'd be happy if we had something like "Restore this tuple iff

this does not break unique constraint". To do so we need to sort tids by
xmin\xmax, to revive most recent data first.

I didn't get this point. Could you please elaborate.

You may have 10 corrupted tuples for the same record, that was updated 9
times. And if you have unique constraint on the table you may want to have
only latest version of the row. So you want to kill 9 tuples and freeze 1.

Okay, in that case, users need to pass the tids of the 9 tuples that they
want to kill to heap_force_kill function and the tid of the tuple that they
want to be marked as frozen to heap_force_freeze function. Just to inform
you that this tool is not used to detect any data corruption, it is just
meant to remove/clean the corrupted data in a table so that the operations
like vacuum, pg_dump/restore succeeds. It's users responsibility to
identify the corrupted data and pass its tid to either of these functions
as per the requirement.

--
With Regards,
Ashutosh Sharma
EnterpriseDB:http://www.enterprisedb.com

#41

ashu.coek88@gmail.com

over 5 years ago

In reply to: MBeena Emerson (#39)

Re: recovering from "found xmin ... from before relfrozenxid ..."

Hi Beena,

Thanks for the review.

1. We would be marking buffer dirty and writing wal even if we have

not done any changes( ex if we pass invalid/dead tids). Maybe we can
handle this better?

Yeah, we can skip this when nothing has changed. Will take care of it in
the next version of patch.

cosmetic changes
1. Maybe "HTupleSurgicalOption" instead of "HTupleForceOption" and
also the variable names could use surgery instead.

I think that looks fine. I would rather prefer the word "Force" just
because all the enum options contain the word "Force" in it.

2. extension comment pg_surgery.control "extension to perform surgery
the damaged heap table" -> "extension to perform surgery on the
damaged heap table"

Okay, will fix that typo.

--
With Regards,
Ashutosh Sharma
EnterpriseDB:http://www.enterprisedb.com

#42

Thomas Munro

thomas.munro@gmail.com

over 5 years ago

In reply to: Robert Haas (#1)

Re: recovering from "found xmin ... from before relfrozenxid ..."

On Tue, Jul 14, 2020 at 9:12 AM Robert Haas <robertmhaas@gmail.com> wrote:

A number of EDB customers have had this error crop on their tables for
reasons that we have usually not been able to determine. In many

<long-shot>Do you happen to know if they ever used the
snapshot-too-old feature?</long-shot>

#43

[1]: /messages/by-id/CA+TgmoY=aqf0zjTD+3dUWYkgMiNDegDLFjo+6ze=Wtpik+3XqA@mail.gmail.com

robertmhaas@gmail.com

over 5 years ago

In reply to: Thomas Munro (#42)

Re: recovering from "found xmin ... from before relfrozenxid ..."

On Wed, Jul 29, 2020 at 3:23 AM Thomas Munro <thomas.munro@gmail.com> wrote:

On Tue, Jul 14, 2020 at 9:12 AM Robert Haas <robertmhaas@gmail.com> wrote:

A number of EDB customers have had this error crop on their tables for
reasons that we have usually not been able to determine. In many

<long-shot>Do you happen to know if they ever used the
snapshot-too-old feature?</long-shot>

I don't have any reason to believe that they did. Why?

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

#44

Thomas Munro

thomas.munro@gmail.com

over 5 years ago

In reply to: Robert Haas (#43)

Re: recovering from "found xmin ... from before relfrozenxid ..."

On Thu, Jul 30, 2020 at 1:36 AM Robert Haas <robertmhaas@gmail.com> wrote:

On Wed, Jul 29, 2020 at 3:23 AM Thomas Munro <thomas.munro@gmail.com> wrote:

On Tue, Jul 14, 2020 at 9:12 AM Robert Haas <robertmhaas@gmail.com> wrote:

A number of EDB customers have had this error crop on their tables for
reasons that we have usually not been able to determine. In many

<long-shot>Do you happen to know if they ever used the
snapshot-too-old feature?</long-shot>

I don't have any reason to believe that they did. Why?

Nothing specific, I was just contemplating the problems with that
feature and the patches[1]/messages/by-id/CA+TgmoY=aqf0zjTD+3dUWYkgMiNDegDLFjo+6ze=Wtpik+3XqA@mail.gmail.com proposed so far to fix some of them, and
what types of corruption might be possible due to that stuff, and it
occurred to me to ask if you'd thought about that in connection to
these reports.

#45

ashu.coek88@gmail.com

over 5 years ago

In reply to: Ashutosh Sharma (#40)

Re: recovering from "found xmin ... from before relfrozenxid ..."

On Wed, Jul 29, 2020 at 9:58 AM Ashutosh Sharma <ashu.coek88@gmail.com>
wrote:

I think we should let VACUUM do that.
Sometimes VACUUM will not get to these pages, because they are marked All
Frozen.
An possibly some tuples will get stale on this page again

Hmm, okay, will have a look into this. Thanks.

I had a look over this and found that one can use the DISABLE_PAGE_SKIPPING
option with VACUUM to disable all its page-skipping behavior.

--
With Regards,
Ashutosh Sharma
EnterpriseDB:http://www.enterprisedb.com

#46

ashu.coek88@gmail.com

over 5 years ago

In reply to: Ashutosh Sharma (#41)

1 attachment(s)

Re: recovering from "found xmin ... from before relfrozenxid ..."

Attached is the new version of patch that addresses the comments from
Andrey and Beena.

On Wed, Jul 29, 2020 at 10:07 AM Ashutosh Sharma <ashu.coek88@gmail.com>
wrote:

Show quoted text

Hi Beena,

Thanks for the review.

1. We would be marking buffer dirty and writing wal even if we have

not done any changes( ex if we pass invalid/dead tids). Maybe we can
handle this better?

Yeah, we can skip this when nothing has changed. Will take care of it in
the next version of patch.

cosmetic changes
1. Maybe "HTupleSurgicalOption" instead of "HTupleForceOption" and
also the variable names could use surgery instead.

I think that looks fine. I would rather prefer the word "Force" just
because all the enum options contain the word "Force" in it.

2. extension comment pg_surgery.control "extension to perform surgery
the damaged heap table" -> "extension to perform surgery on the
damaged heap table"

Okay, will fix that typo.

--
With Regards,
Ashutosh Sharma
EnterpriseDB:http://www.enterprisedb.com

Attachments:

v2-0001-Add-contrib-pg_surgery-to-perform-surgery-on-the-dam.patchtext/x-patch; charset=US-ASCII; name=v2-0001-Add-contrib-pg_surgery-to-perform-surgery-on-the-dam.patchDownload

From e78c6fc6a99bba7f38482864a36dc812b0798464 Mon Sep 17 00:00:00 2001
From: ashu <ashutosh.sharma@enterprisedb.com>
Date: Fri, 31 Jul 2020 17:24:50 +0530
Subject: [PATCH] Add contrib/pg_surgery to perform surgery on the damaged heap
 tables.

This contrib module basically adds a couple of functions named
heap_force_kill and heap_force_freeze that can be used in the scripts
to perform surgery on the damaged heap tables.

Ashutosh Sharma.
---
 contrib/Makefile                           |   1 +
 contrib/pg_surgery/Makefile                |  23 ++
 contrib/pg_surgery/expected/pg_surgery.out | 109 +++++++++
 contrib/pg_surgery/heap_surgery_funcs.c    | 378 +++++++++++++++++++++++++++++
 contrib/pg_surgery/pg_surgery--1.0.sql     |  14 ++
 contrib/pg_surgery/pg_surgery.control      |   5 +
 contrib/pg_surgery/sql/pg_surgery.sql      |  72 ++++++
 7 files changed, 602 insertions(+)
 create mode 100644 contrib/pg_surgery/Makefile
 create mode 100644 contrib/pg_surgery/expected/pg_surgery.out
 create mode 100644 contrib/pg_surgery/heap_surgery_funcs.c
 create mode 100644 contrib/pg_surgery/pg_surgery--1.0.sql
 create mode 100644 contrib/pg_surgery/pg_surgery.control
 create mode 100644 contrib/pg_surgery/sql/pg_surgery.sql

diff --git a/contrib/Makefile b/contrib/Makefile
index 1846d41..07d5734 100644
--- a/contrib/Makefile
+++ b/contrib/Makefile
@@ -35,6 +35,7 @@ SUBDIRS = \
 		pg_standby	\
 		pg_stat_statements \
 		pg_trgm		\
+		pg_surgery	\
 		pgcrypto	\
 		pgrowlocks	\
 		pgstattuple	\
diff --git a/contrib/pg_surgery/Makefile b/contrib/pg_surgery/Makefile
new file mode 100644
index 0000000..baf2a88
--- /dev/null
+++ b/contrib/pg_surgery/Makefile
@@ -0,0 +1,23 @@
+# contrib/pg_surgery/Makefile
+
+MODULE_big = pg_surgery
+OBJS = \
+	$(WIN32RES) \
+	heap_surgery_funcs.o
+
+EXTENSION = pg_surgery
+DATA = pg_surgery--1.0.sql
+PGFILEDESC = "pg_surgery - perform surgery on the damaged heap table"
+
+REGRESS = pg_surgery
+
+ifdef USE_PGXS
+PG_CONFIG = pg_config
+PGXS := $(shell $(PG_CONFIG) --pgxs)
+include $(PGXS)
+else
+subdir = contrib/pg_surgery
+top_builddir = ../..
+include $(top_builddir)/src/Makefile.global
+include $(top_srcdir)/contrib/contrib-global.mk
+endif
diff --git a/contrib/pg_surgery/expected/pg_surgery.out b/contrib/pg_surgery/expected/pg_surgery.out
new file mode 100644
index 0000000..98419b0
--- /dev/null
+++ b/contrib/pg_surgery/expected/pg_surgery.out
@@ -0,0 +1,109 @@
+create extension pg_surgery;
+--
+-- test pg_surgery functions with all the supported relations. Should pass.
+--
+-- normal heap table.
+begin;
+create table htab(a int);
+insert into htab values (100), (200), (300), (400), (500);
+select * from htab where xmin = 2;
+ a 
+---
+(0 rows)
+
+select heap_force_freeze('htab'::regclass, ARRAY['(0, 4)']::tid[]);
+ heap_force_freeze 
+-------------------
+ 
+(1 row)
+
+select ctid, xmax from htab where xmin = 2;
+ ctid  | xmax 
+-------+------
+ (0,4) |    0
+(1 row)
+
+select heap_force_kill('htab'::regclass, ARRAY['(0, 4)']::tid[]);
+ heap_force_kill 
+-----------------
+ 
+(1 row)
+
+select * from htab where ctid = '(0, 4)';
+ a 
+---
+(0 rows)
+
+rollback;
+-- toast table.
+begin;
+create table ttab(a text);
+insert into ttab select string_agg(chr(floor(random() * 26)::int + 65), '') from generate_series(1,10000);
+select * from ttab where xmin = 2;
+ a 
+---
+(0 rows)
+
+select heap_force_freeze('ttab'::regclass, ARRAY['(0, 1)']::tid[]);
+ heap_force_freeze 
+-------------------
+ 
+(1 row)
+
+select ctid, xmax from ttab where xmin = 2;
+ ctid  | xmax 
+-------+------
+ (0,1) |    0
+(1 row)
+
+select heap_force_kill('ttab'::regclass, ARRAY['(0, 1)']::tid[]);
+ heap_force_kill 
+-----------------
+ 
+(1 row)
+
+select * from ttab where ctid = '(0, 1)';
+ a 
+---
+(0 rows)
+
+rollback;
+--
+-- test pg_surgery functions with the unsupported relations. Should fail.
+--
+-- partitioned tables (the parent table) doesn't contain any tuple.
+create table ptab (a int) partition by list (a);
+select heap_force_kill('ptab'::regclass, ARRAY['(0, 1)']::tid[]);
+ERROR:  only heap AM is supported
+select heap_force_freeze('ptab'::regclass, ARRAY['(0, 1)']::tid[]);
+ERROR:  only heap AM is supported
+create index ptab_idx on ptab (a);
+-- indexes are not supported. so, all these should fail.
+select heap_force_kill('ptab_idx'::regclass, ARRAY['(0, 1)']::tid[]);
+ERROR:  only heap AM is supported
+select heap_force_freeze('ptab_idx'::regclass, ARRAY['(0, 1)']::tid[]);
+ERROR:  only heap AM is supported
+create view vw as select 1;
+-- views are not supported as well. so, all these should fail.
+select heap_force_kill('vw'::regclass, ARRAY['(0, 1)']::tid[]);
+ERROR:  only heap AM is supported
+select heap_force_freeze('vw'::regclass, ARRAY['(0, 1)']::tid[]);
+ERROR:  only heap AM is supported
+create materialized view mvw as select 10;
+-- materialized views are not supported as well. so, all these should fail.
+select heap_force_kill('mvw'::regclass, ARRAY['(0, 1)']::tid[]);
+ERROR:  "mvw" is not a table or TOAST table
+select heap_force_freeze('mvw'::regclass, ARRAY['(0, 1)']::tid[]);
+ERROR:  "mvw" is not a table or TOAST table
+create sequence seq;
+-- sequences are not supported as well. so, all these functions should fail.
+select heap_force_kill('seq'::regclass, ARRAY['(0, 1)']::tid[]);
+ERROR:  only heap AM is supported
+select heap_force_freeze('seq'::regclass, ARRAY['(0, 1)']::tid[]);
+ERROR:  only heap AM is supported
+-- cleanup.
+drop materialized view mvw;
+drop table ptab;
+drop view vw;
+drop sequence seq;
+drop extension pg_surgery;
diff --git a/contrib/pg_surgery/heap_surgery_funcs.c b/contrib/pg_surgery/heap_surgery_funcs.c
new file mode 100644
index 0000000..93f5d36
--- /dev/null
+++ b/contrib/pg_surgery/heap_surgery_funcs.c
@@ -0,0 +1,378 @@
+/*-------------------------------------------------------------------------
+ *
+ * heap_surgery_funcs.c
+ *	  Functions to perform surgery on the damaged heap table.
+ *
+ * Copyright (c) 2020, PostgreSQL Global Development Group
+ *
+ * IDENTIFICATION
+ *	  contrib/pg_surgery/heap_surgery_funcs.c
+ *
+ *-------------------------------------------------------------------------
+ */
+#include "postgres.h"
+
+#include "access/heapam.h"
+#include "catalog/pg_am_d.h"
+#include "miscadmin.h"
+#include "storage/bufmgr.h"
+#include "utils/rel.h"
+
+#ifdef PG_MODULE_MAGIC
+PG_MODULE_MAGIC;
+#endif
+
+/*
+ * Macros for accessing the item pointer array. Taken from the intarray contrib
+ * header.
+ */
+#define ARRPTR(x)	((ItemPointer) ARR_DATA_PTR(x))
+#define ARRNELEMS(x)	ArrayGetNItems(ARR_NDIM(x), ARR_DIMS(x))
+
+/* Reject arrays we can't handle. */
+#define CHECKARRVALID(x) \
+	do { \
+		if (ARR_HASNULL(x) && array_contains_nulls(x)) \
+			ereport(ERROR, \
+					(errcode(ERRCODE_NULL_VALUE_NOT_ALLOWED), \
+					 errmsg("array must not contain nulls"))); \
+	} while(0)
+
+#define ARRISEMPTY(x)	(ARRNELEMS(x) == 0)
+
+/* Sort the elements of the array. */
+#define SORT(x) \
+	do { \
+		int     _nelems_ = ARRNELEMS(x); \
+		if (_nelems_ > 1) \
+			qsort((void*) ARRPTR(x), _nelems_, sizeof(ItemPointerData), \
+				  tidcmp); \
+	} while(0)
+
+/* Options to forcefully change the state of a heap tuple. */
+typedef enum HTupleForceOption
+{
+	FORCE_KILL,
+	FORCE_FREEZE
+} HTupleForceOption;
+
+PG_FUNCTION_INFO_V1(heap_force_kill);
+PG_FUNCTION_INFO_V1(heap_force_freeze);
+
+static int32 tidcmp(const void *a, const void *b);
+static Datum heap_force_common(FunctionCallInfo fcinfo,
+							   HTupleForceOption force_opt);
+static void check_relation_relkind(Relation rel);
+static void verify_tid(ItemPointer tid, BlockNumber *blkno,
+					   OffsetNumber *offno);
+static BlockNumber tids_same_page_fetch_offnums(ItemPointer tids, int ntids,
+												OffsetNumber *next_start_ptr,
+												OffsetNumber *offnos);
+
+/*-------------------------------------------------------------------------
+ * heap_force_kill()
+ *
+ * Force kill the tuple(s) pointed to by the item pointer(s) stored in the
+ * given tid array.
+ *
+ * Usage: SELECT heap_force_kill(regclass, tid[]);
+ *-------------------------------------------------------------------------
+ */
+Datum
+heap_force_kill(PG_FUNCTION_ARGS)
+{
+	return heap_force_common(fcinfo, FORCE_KILL);
+}
+
+/*-------------------------------------------------------------------------
+ * heap_force_freeze()
+ *
+ * Force freeze the tuple(s) pointed to by the item pointer(s) stored in the
+ * given tid array.
+ *
+ * Usage: SELECT heap_force_freeze(regclass, tid[]);
+ *-------------------------------------------------------------------------
+ */
+Datum
+heap_force_freeze(PG_FUNCTION_ARGS)
+{
+	return heap_force_common(fcinfo, FORCE_FREEZE);
+}
+
+/*-------------------------------------------------------------------------
+ * heap_force_common()
+ *
+ * Common code for heap_force_kill and heap_force_freeze
+ *-------------------------------------------------------------------------
+ */
+static Datum
+heap_force_common(FunctionCallInfo fcinfo, HTupleForceOption force_opt)
+{
+	Oid				relid = PG_GETARG_OID(0);
+	ArrayType	   *ta = PG_GETARG_ARRAYTYPE_P_COPY(1);
+	ItemPointer		tids;
+	int				ntids;
+	Relation		rel;
+	Buffer			buf;
+	Page			page;
+	ItemId			itemid;
+	BlockNumber		blkno;
+	OffsetNumber   *offnos;
+	OffsetNumber	offno,
+					noffs,
+					curr_start_ptr,
+					next_start_ptr,
+					maxoffset;
+	int				i,
+					nskippedItems;
+
+	Assert(force_opt == FORCE_KILL || force_opt == FORCE_FREEZE);
+
+	if (RecoveryInProgress())
+		ereport(ERROR,
+				(errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
+				 errmsg("recovery is in progress"),
+				 errhint("%s cannot be executed during recovery.",
+						 force_opt == FORCE_KILL ? "heap_force_kill()" :
+						 "heap_force_freeze()")));
+
+	/* Basic sanity checking. */
+	CHECKARRVALID(ta);
+
+	if (ARRISEMPTY(ta))
+		ereport(ERROR,
+				(errcode(ERRCODE_INVALID_PARAMETER_VALUE),
+				 errmsg("empty tid array")));
+
+	rel = relation_open(relid, AccessShareLock);
+
+	if (!superuser() && GetUserId() != rel->rd_rel->relowner)
+		ereport(ERROR,
+				(errcode(ERRCODE_INSUFFICIENT_PRIVILEGE),
+				 errmsg("must be superuser or object owner to use %s.",
+						force_opt == FORCE_KILL ? "heap_force_kill()" :
+						"heap_force_freeze()")));
+
+	if (rel->rd_rel->relam != HEAP_TABLE_AM_OID)
+		ereport(ERROR,
+				(errcode(ERRCODE_FEATURE_NOT_SUPPORTED),
+				 errmsg("only heap AM is supported")));
+
+	check_relation_relkind(rel);
+
+	SORT(ta);
+
+	tids = ARRPTR(ta);
+	ntids = ARRNELEMS(ta);
+
+	offnos = (OffsetNumber *) palloc(ntids * sizeof(OffsetNumber));
+	noffs = curr_start_ptr = next_start_ptr = 0;
+
+	do
+	{
+		/*
+		 * Get the offset numbers from the tids belonging to one particular page
+		 * and process them one by one.
+		 */
+		blkno = tids_same_page_fetch_offnums(tids, ntids, &next_start_ptr,
+											 offnos);
+
+		CHECK_FOR_INTERRUPTS();
+
+		buf = ReadBuffer(rel, blkno);
+		LockBufferForCleanup(buf);
+
+		page = BufferGetPage(buf);
+
+		/* Calculate the number of offsets stored in offnos array. */
+		noffs = next_start_ptr - curr_start_ptr;
+
+		maxoffset = PageGetMaxOffsetNumber(page);
+
+		if (noffs > maxoffset)
+			ereport(ERROR,
+					(errcode(ERRCODE_INVALID_PARAMETER_VALUE),
+					 errmsg("number of offsets specified for block %u exceeds the max offset number %u",
+							blkno, maxoffset)));
+
+		nskippedItems = 0;
+
+		/* No ereport(ERROR) from here until all the changes are logged. */
+		START_CRIT_SECTION();
+
+		for (i = 0; i < noffs; i++)
+		{
+			itemid = PageGetItemId(page, offnos[i]);
+
+			/* Follow any redirections until we find something useful. */
+			while (ItemIdIsRedirected(itemid))
+			{
+				offno = ItemIdGetRedirect(itemid);
+				itemid = PageGetItemId(page, offno);
+				CHECK_FOR_INTERRUPTS();
+			}
+
+			/* Nothing to do if the itemid is unused or already dead. */
+			if (!ItemIdIsUsed(itemid) || ItemIdIsDead(itemid))
+			{
+				nskippedItems++;
+				ereport(NOTICE,
+						(errmsg("skipping tid (%u, %u) because it is already marked %s",
+								blkno, offnos[i],
+								ItemIdIsDead(itemid) ? "dead" : "unused")));
+				continue;
+			}
+
+			Assert(ItemIdIsNormal(itemid));
+
+			if (force_opt == FORCE_KILL)
+				ItemIdSetDead(itemid);
+			else if (force_opt == FORCE_FREEZE)
+			{
+				HeapTupleHeader htup;
+
+				htup = (HeapTupleHeader) PageGetItem(page, itemid);
+
+				HeapTupleHeaderSetXmin(htup, FrozenTransactionId);
+				HeapTupleHeaderSetXmax(htup, InvalidTransactionId);
+
+				/*
+				 * Clear all the visibility-related bits of this tuple and mark
+				 * it as frozen.
+				 */
+				htup->t_infomask &= ~HEAP_XACT_MASK;
+				htup->t_infomask |= (HEAP_XMIN_FROZEN | HEAP_XMAX_INVALID);
+			}
+			else
+			{
+				/* shouldn't happen. */
+				elog(ERROR, "invalid heap tuple force option: %d\n", force_opt);
+			}
+		}
+
+		if (nskippedItems == noffs)
+			goto skip_wal;
+
+		/* Mark buffer dirty before we write WAL. */
+		MarkBufferDirty(buf);
+
+		/* XLOG stuff */
+		if (RelationNeedsWAL(rel))
+			log_newpage_buffer(buf, true);
+
+skip_wal:
+		END_CRIT_SECTION();
+
+		UnlockReleaseBuffer(buf);
+
+		/*
+		 * Update the current start pointer before fetching next set of offset
+		 * numbers from the tids belonging to the same page.
+		 */
+		curr_start_ptr = next_start_ptr;
+	} while (next_start_ptr != ntids);
+
+	relation_close(rel, AccessShareLock);
+
+	pfree(ta);
+	pfree(offnos);
+
+	PG_RETURN_VOID();
+}
+
+/*-------------------------------------------------------------------------
+ * tidcmp()
+ *
+ * Compare two item pointers, return -1, 0, or +1.
+ *
+ * See ItemPointerCompare for details.
+ * ------------------------------------------------------------------------
+ */
+static int32
+tidcmp(const void *a, const void *b)
+{
+	ItemPointer iptr1 = ((const ItemPointer) a);
+	ItemPointer iptr2 = ((const ItemPointer) b);
+
+	return ItemPointerCompare(iptr1, iptr2);
+}
+
+/*-------------------------------------------------------------------------
+ * check_relation_relkind()
+ *
+ * Check if the given relation is of the relkind supported by the callers.
+ * ------------------------------------------------------------------------
+ */
+static void
+check_relation_relkind(Relation rel)
+{
+	if (rel->rd_rel->relkind != RELKIND_RELATION &&
+		rel->rd_rel->relkind != RELKIND_TOASTVALUE)
+		ereport(ERROR,
+				(errcode(ERRCODE_WRONG_OBJECT_TYPE),
+				 errmsg("\"%s\" is not a table or TOAST table",
+						RelationGetRelationName(rel))));
+}
+
+/*-------------------------------------------------------------------------
+ * verify_tid()
+ *
+ * Verify the given item pointer.
+ * ------------------------------------------------------------------------
+ */
+static void
+verify_tid(ItemPointer tid, BlockNumber *blkno, OffsetNumber *offno)
+{
+	if (!ItemPointerIsValid(tid))
+		ereport(ERROR,
+				(errcode(ERRCODE_INVALID_PARAMETER_VALUE),
+				 errmsg("invalid item pointer")));
+
+	*blkno = ItemPointerGetBlockNumberNoCheck(tid);
+	*offno = ItemPointerGetOffsetNumberNoCheck(tid);
+
+	if (!BlockNumberIsValid(*blkno))
+		ereport(ERROR,
+				(errcode(ERRCODE_INVALID_PARAMETER_VALUE),
+				 errmsg("invalid block number")));
+
+	if (!OffsetNumberIsValid(*offno))
+		ereport(ERROR,
+				(errcode(ERRCODE_INVALID_PARAMETER_VALUE),
+				 errmsg("invalid offset number")));
+}
+
+/*-------------------------------------------------------------------------
+ * tids_same_page_fetch_offnums()
+ *
+ * Find out all the tids residing in the same page as tids[next_start_ptr] and
+ * fetch the offset number stored in each of them into a caller-allocated offset
+ * number array.
+ * ------------------------------------------------------------------------
+ */
+static BlockNumber
+tids_same_page_fetch_offnums(ItemPointer tids, int ntids,
+							 OffsetNumber *next_start_ptr, OffsetNumber *offnos)
+{
+	int				i;
+	BlockNumber		prev_blkno,
+					blkno;
+	OffsetNumber	offno;
+
+	for (i = *next_start_ptr; i < ntids; i++)
+	{
+		ItemPointerData tid = tids[i];
+
+		verify_tid(&tid, &blkno, &offno);
+
+		if (i == *next_start_ptr || (prev_blkno == blkno))
+			offnos[i - *next_start_ptr] = offno;
+		else
+			break;
+
+		prev_blkno = blkno;
+	}
+
+	*next_start_ptr = i;
+	return prev_blkno;
+}
\ No newline at end of file
diff --git a/contrib/pg_surgery/pg_surgery--1.0.sql b/contrib/pg_surgery/pg_surgery--1.0.sql
new file mode 100644
index 0000000..0878be6
--- /dev/null
+++ b/contrib/pg_surgery/pg_surgery--1.0.sql
@@ -0,0 +1,14 @@
+/* contrib/pg_surgery/pg_surgery--1.0.sql */
+
+-- complain if script is sourced in psql, rather than via CREATE EXTENSION
+\echo Use "CREATE EXTENSION pg_surgery" to load this file. \quit
+
+CREATE FUNCTION heap_force_kill(reloid regclass, tids tid[])
+RETURNS VOID
+AS 'MODULE_PATHNAME', 'heap_force_kill'
+LANGUAGE C STRICT;
+
+CREATE FUNCTION heap_force_freeze(reloid regclass, tids tid[])
+RETURNS VOID
+AS 'MODULE_PATHNAME', 'heap_force_freeze'
+LANGUAGE C STRICT;
\ No newline at end of file
diff --git a/contrib/pg_surgery/pg_surgery.control b/contrib/pg_surgery/pg_surgery.control
new file mode 100644
index 0000000..e042ad4
--- /dev/null
+++ b/contrib/pg_surgery/pg_surgery.control
@@ -0,0 +1,5 @@
+# pg_surgery extension
+comment = 'extension to perform surgery on the damaged heap table'
+default_version = '1.0'
+module_pathname = '$libdir/pg_surgery'
+relocatable = true
diff --git a/contrib/pg_surgery/sql/pg_surgery.sql b/contrib/pg_surgery/sql/pg_surgery.sql
new file mode 100644
index 0000000..936e28b
--- /dev/null
+++ b/contrib/pg_surgery/sql/pg_surgery.sql
@@ -0,0 +1,72 @@
+create extension pg_surgery;
+
+--
+-- test pg_surgery functions with all the supported relations. Should pass.
+--
+
+-- normal heap table.
+begin;
+create table htab(a int);
+insert into htab values (100), (200), (300), (400), (500);
+
+select * from htab where xmin = 2;
+select heap_force_freeze('htab'::regclass, ARRAY['(0, 4)']::tid[]);
+select ctid, xmax from htab where xmin = 2;
+
+select heap_force_kill('htab'::regclass, ARRAY['(0, 4)']::tid[]);
+select * from htab where ctid = '(0, 4)';
+rollback;
+
+-- toast table.
+begin;
+create table ttab(a text);
+insert into ttab select string_agg(chr(floor(random() * 26)::int + 65), '') from generate_series(1,10000);
+
+select * from ttab where xmin = 2;
+select heap_force_freeze('ttab'::regclass, ARRAY['(0, 1)']::tid[]);
+select ctid, xmax from ttab where xmin = 2;
+
+select heap_force_kill('ttab'::regclass, ARRAY['(0, 1)']::tid[]);
+select * from ttab where ctid = '(0, 1)';
+rollback;
+
+--
+-- test pg_surgery functions with the unsupported relations. Should fail.
+--
+
+-- partitioned tables (the parent table) doesn't contain any tuple.
+create table ptab (a int) partition by list (a);
+
+select heap_force_kill('ptab'::regclass, ARRAY['(0, 1)']::tid[]);
+select heap_force_freeze('ptab'::regclass, ARRAY['(0, 1)']::tid[]);
+
+create index ptab_idx on ptab (a);
+
+-- indexes are not supported. so, all these should fail.
+select heap_force_kill('ptab_idx'::regclass, ARRAY['(0, 1)']::tid[]);
+select heap_force_freeze('ptab_idx'::regclass, ARRAY['(0, 1)']::tid[]);
+
+create view vw as select 1;
+
+-- views are not supported as well. so, all these should fail.
+select heap_force_kill('vw'::regclass, ARRAY['(0, 1)']::tid[]);
+select heap_force_freeze('vw'::regclass, ARRAY['(0, 1)']::tid[]);
+
+create materialized view mvw as select 10;
+
+-- materialized views are not supported as well. so, all these should fail.
+select heap_force_kill('mvw'::regclass, ARRAY['(0, 1)']::tid[]);
+select heap_force_freeze('mvw'::regclass, ARRAY['(0, 1)']::tid[]);
+
+create sequence seq;
+
+-- sequences are not supported as well. so, all these functions should fail.
+select heap_force_kill('seq'::regclass, ARRAY['(0, 1)']::tid[]);
+select heap_force_freeze('seq'::regclass, ARRAY['(0, 1)']::tid[]);
+
+-- cleanup.
+drop materialized view mvw;
+drop table ptab;
+drop view vw;
+drop sequence seq;
+drop extension pg_surgery;
-- 
1.8.3.1

#47

x4mmm@yandex-team.ru

over 5 years ago

In reply to: Ashutosh Sharma (#45)

Re: recovering from "found xmin ... from before relfrozenxid ..."

31 июля 2020 г., в 17:32, Ashutosh Sharma <ashu.coek88@gmail.com> написал(а):

On Wed, Jul 29, 2020 at 9:58 AM Ashutosh Sharma <ashu.coek88@gmail.com> wrote:

I think we should let VACUUM do that.

Sometimes VACUUM will not get to these pages, because they are marked All Frozen.
An possibly some tuples will get stale on this page again

Hmm, okay, will have a look into this. Thanks.

I had a look over this and found that one can use the DISABLE_PAGE_SKIPPING option with VACUUM to disable all its page-skipping behavior.

Oh, wow, I didn't know that. Thanks! This actually will do the trick.
I'll try to review your patch again next week.

Thanks!

Best regards, Andrey Borodin.

#48

robertmhaas@gmail.com

over 5 years ago

In reply to: Ashutosh Sharma (#46)

Re: recovering from "found xmin ... from before relfrozenxid ..."

On Fri, Jul 31, 2020 at 8:52 AM Ashutosh Sharma <ashu.coek88@gmail.com> wrote:

Attached is the new version of patch that addresses the comments from Andrey and Beena.

+PGFILEDESC = "pg_surgery - perform surgery on the damaged heap table"

the -> a

I also suggest: heap table -> relation, because we might want to
extend this to other cases later.

+-- toast table.
+begin;
+create table ttab(a text);
+insert into ttab select string_agg(chr(floor(random() * 26)::int +
65), '') from generate_series(1,10000);
+select * from ttab where xmin = 2;
+ a
+---
+(0 rows)
+
+select heap_force_freeze('ttab'::regclass, ARRAY['(0, 1)']::tid[]);
+ heap_force_freeze
+-------------------
+
+(1 row)
+

I don't understand the point of this. You're not testing the function
on the TOAST table; you're testing it on the main table when there
happens to be a TOAST table that is probably getting used for
something. But that's not really relevant to what is being tested
here, so as written this seems redundant with the previous cases.

+-- test pg_surgery functions with the unsupported relations. Should fail.

Please name the specific functions being tested here in case we add
more in the future that are tested separately.

+++ b/contrib/pg_surgery/heap_surgery_funcs.c

I think we could drop _funcs from the file name.

+#ifdef PG_MODULE_MAGIC
+PG_MODULE_MAGIC;
+#endif

The #ifdef here is not required, and if you look at other contrib
modules you'll see that they don't have it.

I don't like all the macros at the top of the file much. CHECKARRVALID
is only used in one place, so it seems to me that you might as well
just inline it and lose the macro. Likewise for SORT and ARRISEMPTY.

Once you do that, heap_force_common() can just compute the number of
array elements once, instead of doing it once inside ARRISEMPTY, then
again inside SORT, and then a third time to initialize ntids. You can
just have a local variable in that function that is set once and then
used as needed. Then you are only doing ARRNELEMS once, so you can get
rid of that macro too. The same technique can be used to get rid of
ARRPTR. So then all the macros go away without introducing any code
duplication.

+/* Options to forcefully change the state of a heap tuple. */
+typedef enum HTupleForceOption
+{
+ FORCE_KILL,
+ FORCE_FREEZE
+} HTupleForceOption;

I suggest un-abbreviating HTuple to HeapTuple and un-abbreviating the
enum members to HEAP_FORCE_KILL and HEAP_FORCE_FREE. Also, how about
option -> operation?

+ return heap_force_common(fcinfo, FORCE_KILL);

I think it might be more idiomatic to use PG_RETURN_DATUM here. I
know it's the same thing, though, and perhaps I'm even wrong about the
prevailing style.

+ Assert(force_opt == FORCE_KILL || force_opt == FORCE_FREEZE);

I think this is unnecessary. It's an enum with 2 values.

+ if (ARRISEMPTY(ta))
+ ereport(ERROR,
+ (errcode(ERRCODE_INVALID_PARAMETER_VALUE),
+ errmsg("empty tid array")));

I don't see why this should be an error. Why can't we just continue
normally and if it does nothing, it does nothing? You'd need to change
the do..while loop to a while loop so that the end condition is tested
at the top, but that seems fine.

+ rel = relation_open(relid, AccessShareLock);

Maybe we should take RowExclusiveLock, since we are going to modify
rows. Not sure how much it matters, though.

+ if (!superuser() && GetUserId() != rel->rd_rel->relowner)
+ ereport(ERROR,
+ (errcode(ERRCODE_INSUFFICIENT_PRIVILEGE),
+ errmsg("must be superuser or object owner to use %s.",
+ force_opt == FORCE_KILL ? "heap_force_kill()" :
+ "heap_force_freeze()")));

This is the wrong way to do a permissions check, and it's also the
wrong way to write an error message about having failed one. To see
the correct method, grep for pg_class_aclcheck. However, I think that
we shouldn't in general trust the object owner to do this, unless the
super-user gave permission. This is a data-corrupting operation, and
only the boss is allowed to authorize it. So I think you should also
add REVOKE EXECUTE FROM PUBLIC statements to the SQL file, and then
have this check as a backup. Then, the superuser is always allowed,
and if they choose to GRANT EXECUTE on this function to some users,
those users can do it for their own relations, but not others.

+ if (rel->rd_rel->relam != HEAP_TABLE_AM_OID)
+ ereport(ERROR,
+ (errcode(ERRCODE_FEATURE_NOT_SUPPORTED),
+ errmsg("only heap AM is supported")));
+
+ check_relation_relkind(rel);

Seems like these checks are in the wrong order. Also, maybe you could
call the function something like check_relation_ok() and put the
permissions test, the relkind test, and the relam test all inside of
it, just to tighten up the code in this main function a bit.

+ if (noffs > maxoffset)
+ ereport(ERROR,
+ (errcode(ERRCODE_INVALID_PARAMETER_VALUE),
+ errmsg("number of offsets specified for block %u exceeds the max
offset number %u",
+ blkno, maxoffset)));

Hmm, this doesn't seem quite right. The actual problem is if an
individual item pointer's offset number is greater than maxoffset,
which can be true even if the total number of offsets is less than
maxoffset. So I think you need to remove this check and add a check
inside the loop which follows that offnos[i] is in range.

The way you've structured that loop is actually problematic -- I don't
think we want to be calling elog() or ereport() inside a critical
section. You could fix the case that checks for an invalid force_opt
by just doing if (op == HEAP_FORCE_KILL) { ... } else { Assert(op ==
HEAP_FORCE_FREEZE); ... }, or by using a switch with no default. The
NOTICE case you have here is a bigger problem. You really cannot
modify the buffer like this and then decide, oops, never mind, I think
I won't mark it dirty or write WAL for the changes. If you do that,
the buffer is still in memory, but it's now been modified. A
subsequent operation that modifies it will start with the altered
state you created here, quite possibly leading to WAL that cannot be
correctly replayed on the standby. In other words, you've got to
decide for certain whether you want to proceed with the operation
*before* you enter the critical section. You also need to emit any
messages before or after the critical section. So you could:

1. If you encounter a TID that's unused or dead, skip it silently.
-or-
2. Loop over offsets twice. The first time, ERROR if you find any one
that is unused or dead. Then start a critical section. Loop again and
do the real work.
-or-
3. Like #2, but emit a NOTICE about a unused or dead item rather than
an ERROR, and skip the critical section and the second loop if you did
that >0 times.
-or-
4. Like #3, but don't skip anything just because you emitted a NOTICE
about the page.

#3 is closest to the behavior you have now, but I'm not sure what else
it has going for it. It doesn't seem like particularly intuitive
behavior that finding a dead or unused TID should cause other item
TIDs on the same page not to get processed while still permitting TIDs
on other pages to get processed. I don't think that's the behavior
users will be expecting. I think my vote is for #4, which will emit a
NOTICE about any TID that is dead or unused -- and I guess also about
any TID whose offset number is out of range -- but won't actually skip
any operations that can be performed. But there are decent arguments
for #1 or #2 too.

+ (errmsg("skipping tid (%u, %u) because it is already marked %s",
+ blkno, offnos[i],
+ ItemIdIsDead(itemid) ? "dead" : "unused")));

I believe this violates our guidelines on message construction. Have
two completely separate messages -- and maybe lose the word "already":

"skipping tid (%u, %u) because it is dead"
"skipping tid (%u, %u) because it is unused"

The point of this is that it makes it easier for translators.

I see very little point in what verify_tid() is doing. Before using
each block number, we should check that it's less than or equal to a
cached value of RelationGetNumberOfBlocks(rel). That's necessary in
any case to avoid funny errors; and then the check here against
specifically InvalidBlockNumber is redundant. For the offset number,
same thing: we need to check each offset against the page's
PageGetMaxOffsetNumber(page); and if we do that then we don't need
these checks.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

#49

ashu.coek88@gmail.com

over 5 years ago

In reply to: Robert Haas (#48)

Re: recovering from "found xmin ... from before relfrozenxid ..."

Hi Robert,

Thanks for the review.

I've gone through all your review comments and understood all of them
except this one:

You really cannot

modify the buffer like this and then decide, oops, never mind, I think
I won't mark it dirty or write WAL for the changes. If you do that,
the buffer is still in memory, but it's now been modified. A
subsequent operation that modifies it will start with the altered
state you created here, quite possibly leading to WAL that cannot be
correctly replayed on the standby. In other words, you've got to
decide for certain whether you want to proceed with the operation
*before* you enter the critical section.

Could you please explain this point once more in detail? I am not quite
able to understand under what circumstances a buffer would be modified, but
won't be marked as dirty or a WAL won't be written for it.

--
With Regards,
Ashutosh Sharma
EnterpriseDB:http://www.enterprisedb.com

#50

robertmhaas@gmail.com

over 5 years ago

In reply to: Ashutosh Sharma (#49)

Re: recovering from "found xmin ... from before relfrozenxid ..."

On Mon, Aug 3, 2020 at 5:05 AM Ashutosh Sharma <ashu.coek88@gmail.com> wrote:

Could you please explain this point once more in detail? I am not quite able to understand under what circumstances a buffer would be modified, but won't be marked as dirty or a WAL won't be written for it.

Whenever this branch is taken:

+               if (nskippedItems == noffs)
+                       goto skip_wal;

At this point you have already modified the page, using ItemIdSetDead,
HeapTupleHeaderSet*, and/or directly adjusting htup->infomask. If this
branch is taken, then MarkBufferDirty() and log_newpage_buffer() are
skipped.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

#51

ashu.coek88@gmail.com

over 5 years ago

In reply to: Robert Haas (#50)

Re: recovering from "found xmin ... from before relfrozenxid ..."

On Mon, Aug 3, 2020 at 7:06 PM Robert Haas <robertmhaas@gmail.com> wrote:

On Mon, Aug 3, 2020 at 5:05 AM Ashutosh Sharma <ashu.coek88@gmail.com> wrote:

Could you please explain this point once more in detail? I am not quite able to understand under what circumstances a buffer would be modified, but won't be marked as dirty or a WAL won't be written for it.

Whenever this branch is taken:
+               if (nskippedItems == noffs)
+                       goto skip_wal;

If the above path is taken that means none of the items in the page
got changed. As per the following if-check whenever an item in the
offnos[] array is found dead or unused, it is skipped (due to continue
statement) which means the item is neither marked dead nor it is
marked frozen. Now, if this happens for all the items in a page, then
the above condition (nskippedItems == noffs) would be true and hence
the buffer would remain unchanged, so, we don't mark such a buffer as
dirty and neither do any WAL logging for it. This is my understanding,
please let me know if I am missing something here. Thank you.

if (!ItemIdIsUsed(itemid) || ItemIdIsDead(itemid))
{
nskippedItems++;
ereport(NOTICE,
(errmsg("skipping tid (%u, %u) because it is
already marked %s",
blkno, offnos[i],
ItemIdIsDead(itemid) ? "dead" : "unused")));
continue;
}

At this point you have already modified the page, using ItemIdSetDead,
HeapTupleHeaderSet*, and/or directly adjusting htup->infomask. If this
branch is taken, then MarkBufferDirty() and log_newpage_buffer() are
skipped.

--
With Regards,
Ashutosh Sharma
EnterpriseDB:http://www.enterprisedb.com

#52

ashu.coek88@gmail.com

over 5 years ago

In reply to: Robert Haas (#48)

1 attachment(s)

Re: recovering from "found xmin ... from before relfrozenxid ..."

Hi Robert,

Thanks for the review. Please find my comments inline:

On Sat, Aug 1, 2020 at 12:18 AM Robert Haas <robertmhaas@gmail.com> wrote:

On Fri, Jul 31, 2020 at 8:52 AM Ashutosh Sharma <ashu.coek88@gmail.com> wrote:

Attached is the new version of patch that addresses the comments from Andrey and Beena.

+PGFILEDESC = "pg_surgery - perform surgery on the damaged heap table"

the -> a

I also suggest: heap table -> relation, because we might want to
extend this to other cases later.

Corrected.

+-- toast table.
+begin;
+create table ttab(a text);
+insert into ttab select string_agg(chr(floor(random() * 26)::int +
65), '') from generate_series(1,10000);
+select * from ttab where xmin = 2;
+ a
+---
+(0 rows)
+
+select heap_force_freeze('ttab'::regclass, ARRAY['(0, 1)']::tid[]);
+ heap_force_freeze
+-------------------
+
+(1 row)
+
I don't understand the point of this. You're not testing the function
on the TOAST table; you're testing it on the main table when there
happens to be a TOAST table that is probably getting used for
something. But that's not really relevant to what is being tested
here, so as written this seems redundant with the previous cases.

Yeah, it's being tested on the main table, not on a toast table. I've
removed this test-case and also restricted direct access to the toast
table using heap_force_kill/freeze functions. I think we shouldn't be
using these functions to do any changes in the toast table. We will
only use these functions with the main table and let VACUUM remove the
corresponding data chunks (pointed by the tuple that got removed from
the main table).

Another option would be to identify all the data chunks corresponding
to the tuple (ctid) being killed from the main table and remove them
one by one. We will only do this if the tuple from the main table that
has been marked as killed has an external storage. We will have to add
a bunch of code for this otherwise we can let VACUUM do this for us.
Let me know your thoughts on this.

+-- test pg_surgery functions with the unsupported relations. Should fail.

Please name the specific functions being tested here in case we add
more in the future that are tested separately.

Done.

+++ b/contrib/pg_surgery/heap_surgery_funcs.c
I think we could drop _funcs from the file name.

Done.

+#ifdef PG_MODULE_MAGIC
+PG_MODULE_MAGIC;
+#endif
The #ifdef here is not required, and if you look at other contrib
modules you'll see that they don't have it.

Okay, done.

I don't like all the macros at the top of the file much. CHECKARRVALID
is only used in one place, so it seems to me that you might as well
just inline it and lose the macro. Likewise for SORT and ARRISEMPTY.

Done.

Once you do that, heap_force_common() can just compute the number of
array elements once, instead of doing it once inside ARRISEMPTY, then
again inside SORT, and then a third time to initialize ntids. You can
just have a local variable in that function that is set once and then
used as needed. Then you are only doing ARRNELEMS once, so you can get
rid of that macro too. The same technique can be used to get rid of
ARRPTR. So then all the macros go away without introducing any code
duplication.

Done.

+/* Options to forcefully change the state of a heap tuple. */
+typedef enum HTupleForceOption
+{
+ FORCE_KILL,
+ FORCE_FREEZE
+} HTupleForceOption;
I suggest un-abbreviating HTuple to HeapTuple and un-abbreviating the
enum members to HEAP_FORCE_KILL and HEAP_FORCE_FREE.

Done.

Also, how about

option -> operation?

I think both look okay to me.

+ return heap_force_common(fcinfo, FORCE_KILL);

I think it might be more idiomatic to use PG_RETURN_DATUM here. I
know it's the same thing, though, and perhaps I'm even wrong about the
prevailing style.

Done.

+ Assert(force_opt == FORCE_KILL || force_opt == FORCE_FREEZE);

I think this is unnecessary. It's an enum with 2 values.

Removed.

+ if (ARRISEMPTY(ta))
+ ereport(ERROR,
+ (errcode(ERRCODE_INVALID_PARAMETER_VALUE),
+ errmsg("empty tid array")));
I don't see why this should be an error. Why can't we just continue
normally and if it does nothing, it does nothing? You'd need to change
the do..while loop to a while loop so that the end condition is tested
at the top, but that seems fine.

I think it's okay to have this check. So, just left it as-is. We do
have such checks in other contrib modules as well wherever the array
is being passed as an input to the function.

+ rel = relation_open(relid, AccessShareLock);

Maybe we should take RowExclusiveLock, since we are going to modify
rows. Not sure how much it matters, though.

I don't know how it would make a difference, but still as you said
replaced AccessShare with RowExclusive.

+ if (!superuser() && GetUserId() != rel->rd_rel->relowner)
+ ereport(ERROR,
+ (errcode(ERRCODE_INSUFFICIENT_PRIVILEGE),
+ errmsg("must be superuser or object owner to use %s.",
+ force_opt == FORCE_KILL ? "heap_force_kill()" :
+ "heap_force_freeze()")));
This is the wrong way to do a permissions check, and it's also the
wrong way to write an error message about having failed one. To see
the correct method, grep for pg_class_aclcheck. However, I think that
we shouldn't in general trust the object owner to do this, unless the
super-user gave permission. This is a data-corrupting operation, and
only the boss is allowed to authorize it. So I think you should also
add REVOKE EXECUTE FROM PUBLIC statements to the SQL file, and then
have this check as a backup. Then, the superuser is always allowed,
and if they choose to GRANT EXECUTE on this function to some users,
those users can do it for their own relations, but not others.

Done.

+ if (rel->rd_rel->relam != HEAP_TABLE_AM_OID)
+ ereport(ERROR,
+ (errcode(ERRCODE_FEATURE_NOT_SUPPORTED),
+ errmsg("only heap AM is supported")));
+
+ check_relation_relkind(rel);

Seems like these checks are in the wrong order.

I don't think there is anything wrong with the order. I can see the
same order in other contrib modules as well.

Also, maybe you could

call the function something like check_relation_ok() and put the
permissions test, the relkind test, and the relam test all inside of
it, just to tighten up the code in this main function a bit.

Yeah, I've added a couple of functions named sanity_check_relation and
sanity_check_tid_array and shifted all the sanity checks there.

+ if (noffs > maxoffset)
+ ereport(ERROR,
+ (errcode(ERRCODE_INVALID_PARAMETER_VALUE),
+ errmsg("number of offsets specified for block %u exceeds the max
offset number %u",
+ blkno, maxoffset)));
Hmm, this doesn't seem quite right. The actual problem is if an
individual item pointer's offset number is greater than maxoffset,
which can be true even if the total number of offsets is less than
maxoffset. So I think you need to remove this check and add a check
inside the loop which follows that offnos[i] is in range.

Agreed and done.

The way you've structured that loop is actually problematic -- I don't
think we want to be calling elog() or ereport() inside a critical
section. You could fix the case that checks for an invalid force_opt
by just doing if (op == HEAP_FORCE_KILL) { ... } else { Assert(op ==
HEAP_FORCE_FREEZE); ... }, or by using a switch with no default. The
NOTICE case you have here is a bigger problem.

Done.

You really cannot

modify the buffer like this and then decide, oops, never mind, I think
I won't mark it dirty or write WAL for the changes. If you do that,
the buffer is still in memory, but it's now been modified. A
subsequent operation that modifies it will start with the altered
state you created here, quite possibly leading to WAL that cannot be
correctly replayed on the standby. In other words, you've got to
decide for certain whether you want to proceed with the operation
*before* you enter the critical section. You also need to emit any
messages before or after the critical section. So you could:

This is still not clear. I think Robert needs to respond to my earlier comment.

I believe this violates our guidelines on message construction. Have
two completely separate messages -- and maybe lose the word "already":

"skipping tid (%u, %u) because it is dead"
"skipping tid (%u, %u) because it is unused"

The point of this is that it makes it easier for translators.

Done.

I see very little point in what verify_tid() is doing. Before using
each block number, we should check that it's less than or equal to a
cached value of RelationGetNumberOfBlocks(rel). That's necessary in
any case to avoid funny errors; and then the check here against
specifically InvalidBlockNumber is redundant. For the offset number,
same thing: we need to check each offset against the page's
PageGetMaxOffsetNumber(page); and if we do that then we don't need
these checks.

Done.

Please check the attached patch for the changes.

--
With Regards,
Ashutosh Sharma
EnterpriseDB:http://www.enterprisedb.com

Attachments:

v3-0001-Add-contrib-pg_surgery-to-perform-surgery-on-a-damag.patchtext/x-patch; charset=US-ASCII; name=v3-0001-Add-contrib-pg_surgery-to-perform-surgery-on-a-damag.patchDownload

From 76ea0329aed2afd2fdfaaaaf38ee3d825de4795e Mon Sep 17 00:00:00 2001
From: ashu <ashutosh.sharma@enterprisedb.com>
Date: Wed, 5 Aug 2020 15:39:51 +0530
Subject: [PATCH] Add contrib/pg_surgery to perform surgery on a damaged heap
 table.

This contrib module basically adds a couple of functions named
heap_force_kill and heap_force_freeze that can be used in the scripts
to perform surgery on the damaged heap tables.

Ashutosh Sharma
---
 contrib/Makefile                           |   1 +
 contrib/pg_surgery/Makefile                |  23 ++
 contrib/pg_surgery/expected/pg_surgery.out |  78 +++++++
 contrib/pg_surgery/heap_surgery.c          | 355 +++++++++++++++++++++++++++++
 contrib/pg_surgery/pg_surgery--1.0.sql     |  18 ++
 contrib/pg_surgery/pg_surgery.control      |   5 +
 contrib/pg_surgery/sql/pg_surgery.sql      |  61 +++++
 7 files changed, 541 insertions(+)
 create mode 100644 contrib/pg_surgery/Makefile
 create mode 100644 contrib/pg_surgery/expected/pg_surgery.out
 create mode 100644 contrib/pg_surgery/heap_surgery.c
 create mode 100644 contrib/pg_surgery/pg_surgery--1.0.sql
 create mode 100644 contrib/pg_surgery/pg_surgery.control
 create mode 100644 contrib/pg_surgery/sql/pg_surgery.sql

diff --git a/contrib/Makefile b/contrib/Makefile
index 1846d41..07d5734 100644
--- a/contrib/Makefile
+++ b/contrib/Makefile
@@ -35,6 +35,7 @@ SUBDIRS = \
 		pg_standby	\
 		pg_stat_statements \
 		pg_trgm		\
+		pg_surgery	\
 		pgcrypto	\
 		pgrowlocks	\
 		pgstattuple	\
diff --git a/contrib/pg_surgery/Makefile b/contrib/pg_surgery/Makefile
new file mode 100644
index 0000000..ecf2e20
--- /dev/null
+++ b/contrib/pg_surgery/Makefile
@@ -0,0 +1,23 @@
+# contrib/pg_surgery/Makefile
+
+MODULE_big = pg_surgery
+OBJS = \
+	$(WIN32RES) \
+	heap_surgery.o
+
+EXTENSION = pg_surgery
+DATA = pg_surgery--1.0.sql
+PGFILEDESC = "pg_surgery - perform surgery on a damaged relation"
+
+REGRESS = pg_surgery
+
+ifdef USE_PGXS
+PG_CONFIG = pg_config
+PGXS := $(shell $(PG_CONFIG) --pgxs)
+include $(PGXS)
+else
+subdir = contrib/pg_surgery
+top_builddir = ../..
+include $(top_builddir)/src/Makefile.global
+include $(top_srcdir)/contrib/contrib-global.mk
+endif
diff --git a/contrib/pg_surgery/expected/pg_surgery.out b/contrib/pg_surgery/expected/pg_surgery.out
new file mode 100644
index 0000000..8391e88
--- /dev/null
+++ b/contrib/pg_surgery/expected/pg_surgery.out
@@ -0,0 +1,78 @@
+create extension pg_surgery;
+--
+-- check that using heap_force_kill and heap_force_freeze functions with the
+-- normal heap table passes.
+--
+-- normal heap table.
+begin;
+create table htab(a int);
+insert into htab values (100), (200), (300), (400), (500);
+select * from htab where xmin = 2;
+ a 
+---
+(0 rows)
+
+select heap_force_freeze('htab'::regclass, ARRAY['(0, 4)']::tid[]);
+ heap_force_freeze 
+-------------------
+ 
+(1 row)
+
+select ctid, xmax from htab where xmin = 2;
+ ctid  | xmax 
+-------+------
+ (0,4) |    0
+(1 row)
+
+select heap_force_kill('htab'::regclass, ARRAY['(0, 4)']::tid[]);
+ heap_force_kill 
+-----------------
+ 
+(1 row)
+
+select * from htab where ctid = '(0, 4)';
+ a 
+---
+(0 rows)
+
+rollback;
+--
+-- check that using heap_force_kill and heap_force_freeze functions with the
+-- unsupported relations fails.
+--
+-- partitioned tables (the parent table) doesn't contain any tuple.
+create table ptab (a int) partition by list (a);
+select heap_force_kill('ptab'::regclass, ARRAY['(0, 1)']::tid[]);
+ERROR:  only heap AM is supported
+select heap_force_freeze('ptab'::regclass, ARRAY['(0, 1)']::tid[]);
+ERROR:  only heap AM is supported
+create index ptab_idx on ptab (a);
+-- indexes are not supported, should fail.
+select heap_force_kill('ptab_idx'::regclass, ARRAY['(0, 1)']::tid[]);
+ERROR:  only heap AM is supported
+select heap_force_freeze('ptab_idx'::regclass, ARRAY['(0, 1)']::tid[]);
+ERROR:  only heap AM is supported
+create view vw as select 1;
+-- views are not supported as well. so, all these should fail.
+select heap_force_kill('vw'::regclass, ARRAY['(0, 1)']::tid[]);
+ERROR:  only heap AM is supported
+select heap_force_freeze('vw'::regclass, ARRAY['(0, 1)']::tid[]);
+ERROR:  only heap AM is supported
+create materialized view mvw as select 10;
+-- materialized views are not supported as well. so, all these should fail.
+select heap_force_kill('mvw'::regclass, ARRAY['(0, 1)']::tid[]);
+ERROR:  "mvw" is not a table
+select heap_force_freeze('mvw'::regclass, ARRAY['(0, 1)']::tid[]);
+ERROR:  "mvw" is not a table
+create sequence seq;
+-- sequences are not supported as well. so, all these functions should fail.
+select heap_force_kill('seq'::regclass, ARRAY['(0, 1)']::tid[]);
+ERROR:  only heap AM is supported
+select heap_force_freeze('seq'::regclass, ARRAY['(0, 1)']::tid[]);
+ERROR:  only heap AM is supported
+-- cleanup.
+drop materialized view mvw;
+drop table ptab;
+drop view vw;
+drop sequence seq;
+drop extension pg_surgery;
diff --git a/contrib/pg_surgery/heap_surgery.c b/contrib/pg_surgery/heap_surgery.c
new file mode 100644
index 0000000..b72417b
--- /dev/null
+++ b/contrib/pg_surgery/heap_surgery.c
@@ -0,0 +1,355 @@
+/*-------------------------------------------------------------------------
+ *
+ * heap_surgery.c
+ *	  Functions to perform surgery on the damaged heap table.
+ *
+ * Copyright (c) 2020, PostgreSQL Global Development Group
+ *
+ * IDENTIFICATION
+ *	  contrib/pg_surgery/heap_surgery.c
+ *
+ *-------------------------------------------------------------------------
+ */
+#include "postgres.h"
+
+#include "access/heapam.h"
+#include "catalog/pg_am_d.h"
+#include "miscadmin.h"
+#include "storage/bufmgr.h"
+#include "utils/acl.h"
+#include "utils/rel.h"
+
+PG_MODULE_MAGIC;
+
+/* Options to forcefully change the state of a heap tuple. */
+typedef enum HeapTupleForceOption
+{
+	HEAP_FORCE_KILL,
+	HEAP_FORCE_FREEZE
+} HeapTupleForceOption;
+
+PG_FUNCTION_INFO_V1(heap_force_kill);
+PG_FUNCTION_INFO_V1(heap_force_freeze);
+
+static int32 tidcmp(const void *a, const void *b);
+static Datum heap_force_common(FunctionCallInfo fcinfo,
+							   HeapTupleForceOption heap_force_opt);
+static void sanity_check_tid_array(ArrayType *ta, int *ntids);
+static void sanity_check_relation(Relation rel);
+static BlockNumber tids_same_page_fetch_offnums(ItemPointer tids, int ntids,
+												OffsetNumber *next_start_ptr,
+												OffsetNumber *offnos);
+
+/*-------------------------------------------------------------------------
+ * heap_force_kill()
+ *
+ * Force kill the tuple(s) pointed to by the item pointer(s) stored in the
+ * given tid array.
+ *
+ * Usage: SELECT heap_force_kill(regclass, tid[]);
+ *-------------------------------------------------------------------------
+ */
+Datum
+heap_force_kill(PG_FUNCTION_ARGS)
+{
+	PG_RETURN_DATUM(heap_force_common(fcinfo, HEAP_FORCE_KILL));
+}
+
+/*-------------------------------------------------------------------------
+ * heap_force_freeze()
+ *
+ * Force freeze the tuple(s) pointed to by the item pointer(s) stored in the
+ * given tid array.
+ *
+ * Usage: SELECT heap_force_freeze(regclass, tid[]);
+ *-------------------------------------------------------------------------
+ */
+Datum
+heap_force_freeze(PG_FUNCTION_ARGS)
+{
+	PG_RETURN_DATUM(heap_force_common(fcinfo, HEAP_FORCE_FREEZE));
+}
+
+/*-------------------------------------------------------------------------
+ * heap_force_common()
+ *
+ * Common code for heap_force_kill and heap_force_freeze
+ *-------------------------------------------------------------------------
+ */
+static Datum
+heap_force_common(FunctionCallInfo fcinfo, HeapTupleForceOption heap_force_opt)
+{
+	Oid				relid = PG_GETARG_OID(0);
+	ArrayType	   *ta = PG_GETARG_ARRAYTYPE_P_COPY(1);
+	ItemPointer		tids;
+	int				ntids;
+	Relation		rel;
+	Buffer			buf;
+	Page			page;
+	ItemId			itemid;
+	BlockNumber		blkno;
+	OffsetNumber   *offnos;
+	OffsetNumber	offno,
+					noffs,
+					curr_start_ptr,
+					next_start_ptr,
+					maxoffset;
+	int				i,
+					nskippedItems,
+					nblocks;
+
+	if (RecoveryInProgress())
+		ereport(ERROR,
+				(errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
+				 errmsg("recovery is in progress"),
+				 errhint("heap surgery functions cannot be executed during recovery.")));
+
+	/* Basic sanity checking. */
+	sanity_check_tid_array(ta, &ntids);
+
+	rel = relation_open(relid, RowExclusiveLock);
+
+	sanity_check_relation(rel);
+
+	tids = ((ItemPointer) ARR_DATA_PTR(ta));
+
+	/*
+	 * If there is more than one tid in the array, sort it so that we can
+	 * easily fetch all the tids belonging to one particular page from the
+	 * array.
+	 */
+	if (ntids > 1)
+		qsort((void*) tids, ntids, sizeof(ItemPointerData), tidcmp);
+
+	offnos = (OffsetNumber *) palloc(ntids * sizeof(OffsetNumber));
+	noffs = curr_start_ptr = next_start_ptr = 0;
+	nblocks = RelationGetNumberOfBlocks(rel);
+
+	do
+	{
+		/*
+		 * Get the offset numbers from the tids belonging to one particular page
+		 * and process them one by one.
+		 */
+		blkno = tids_same_page_fetch_offnums(tids, ntids, &next_start_ptr,
+											 offnos);
+
+		if (blkno < 0 || blkno >= nblocks)
+			ereport(ERROR,
+					(errcode(ERRCODE_INVALID_PARAMETER_VALUE),
+					 errmsg("block number %u is out of range for relation \"%s\"",
+							blkno, RelationGetRelationName(rel))));
+
+		CHECK_FOR_INTERRUPTS();
+
+		buf = ReadBuffer(rel, blkno);
+		LockBufferForCleanup(buf);
+
+		page = BufferGetPage(buf);
+
+		/* Calculate the number of offsets stored in offnos array. */
+		noffs = next_start_ptr - curr_start_ptr;
+
+		maxoffset = PageGetMaxOffsetNumber(page);
+
+		nskippedItems = 0;
+
+		/* No ereport(ERROR) from here until all the changes are logged. */
+		START_CRIT_SECTION();
+
+		for (i = 0; i < noffs; i++)
+		{
+			if (offnos[i] > maxoffset)
+			{
+				ereport(NOTICE,
+						 errmsg("skipping tid (%u, %u) because it contains an invalid offset",
+								blkno, offnos[i]));
+				continue;
+			}
+
+			itemid = PageGetItemId(page, offnos[i]);
+
+			/* Follow any redirections until we find something useful. */
+			while (ItemIdIsRedirected(itemid))
+			{
+				offno = ItemIdGetRedirect(itemid);
+				itemid = PageGetItemId(page, offno);
+				CHECK_FOR_INTERRUPTS();
+			}
+
+			/* Nothing to do if the itemid is unused or already dead. */
+			if (!ItemIdIsUsed(itemid) || ItemIdIsDead(itemid))
+			{
+				nskippedItems++;
+
+				if (!ItemIdIsUsed(itemid))
+					ereport(NOTICE,
+							(errmsg("skipping tid (%u, %u) because it is marked unused",
+									blkno, offnos[i])));
+				else
+					ereport(NOTICE,
+							(errmsg("skipping tid (%u, %u) because it is marked dead",
+									blkno, offnos[i])));
+				continue;
+			}
+
+			Assert(ItemIdIsNormal(itemid));
+
+			if (heap_force_opt == HEAP_FORCE_KILL)
+				ItemIdSetDead(itemid);
+			else
+			{
+				HeapTupleHeader htup;
+
+				Assert(heap_force_opt == HEAP_FORCE_FREEZE);
+
+				htup = (HeapTupleHeader) PageGetItem(page, itemid);
+
+				HeapTupleHeaderSetXmin(htup, FrozenTransactionId);
+				HeapTupleHeaderSetXmax(htup, InvalidTransactionId);
+
+				/*
+				 * Clear all the visibility-related bits of this tuple and mark
+				 * it as frozen.
+				 */
+				htup->t_infomask &= ~HEAP_XACT_MASK;
+				htup->t_infomask |= (HEAP_XMIN_FROZEN | HEAP_XMAX_INVALID);
+			}
+		}
+
+		if (nskippedItems == noffs)
+			goto skip_wal;
+
+		/* Mark buffer dirty before we write WAL. */
+		MarkBufferDirty(buf);
+
+		/* XLOG stuff */
+		if (RelationNeedsWAL(rel))
+			log_newpage_buffer(buf, true);
+
+skip_wal:
+		END_CRIT_SECTION();
+
+		UnlockReleaseBuffer(buf);
+
+		/*
+		 * Update the current start pointer before fetching next set of offset
+		 * numbers from the tids belonging to the same page.
+		 */
+		curr_start_ptr = next_start_ptr;
+	} while (next_start_ptr != ntids);
+
+	relation_close(rel, RowExclusiveLock);
+
+	pfree(ta);
+	pfree(offnos);
+
+	PG_RETURN_VOID();
+}
+
+/*-------------------------------------------------------------------------
+ * tidcmp()
+ *
+ * Compare two item pointers, return -1, 0, or +1.
+ *
+ * See ItemPointerCompare for details.
+ * ------------------------------------------------------------------------
+ */
+static int32
+tidcmp(const void *a, const void *b)
+{
+	ItemPointer iptr1 = ((const ItemPointer) a);
+	ItemPointer iptr2 = ((const ItemPointer) b);
+
+	return ItemPointerCompare(iptr1, iptr2);
+}
+
+/*-------------------------------------------------------------------------
+ * sanity_check_tid_array()
+ *
+ * Perform sanity check on the given tid array.
+ * ------------------------------------------------------------------------
+ */
+static void
+sanity_check_tid_array(ArrayType *ta, int *ntids)
+{
+	if (ARR_HASNULL(ta) && array_contains_nulls(ta))
+		ereport(ERROR,
+				(errcode(ERRCODE_NULL_VALUE_NOT_ALLOWED),
+				 errmsg("array must not contain nulls")));
+
+	*ntids = ArrayGetNItems(ARR_NDIM(ta), ARR_DIMS(ta));
+
+	if (*ntids == 0)
+		ereport(ERROR,
+				(errcode(ERRCODE_INVALID_PARAMETER_VALUE),
+				 errmsg("empty tid array")));
+}
+
+/*-------------------------------------------------------------------------
+ * sanity_check_relation()
+ *
+ * Perform sanity check on the given relation.
+ * ------------------------------------------------------------------------
+ */
+static void
+sanity_check_relation(Relation rel)
+{
+	if (rel->rd_rel->relam != HEAP_TABLE_AM_OID)
+		ereport(ERROR,
+				(errcode(ERRCODE_FEATURE_NOT_SUPPORTED),
+				 errmsg("only heap AM is supported")));
+
+	if (rel->rd_rel->relkind != RELKIND_RELATION)
+		ereport(ERROR,
+				(errcode(ERRCODE_WRONG_OBJECT_TYPE),
+				 errmsg("\"%s\" is not a table",
+						RelationGetRelationName(rel))));
+
+	if (pg_class_aclcheck(RelationGetRelid(rel), GetUserId(), ACL_ALL_RIGHTS_RELATION) != ACLCHECK_OK)
+		ereport(ERROR,
+				(errcode(ERRCODE_INSUFFICIENT_PRIVILEGE),
+				 errmsg("permission denied for relation \"%s\"",
+						RelationGetRelationName(rel))));
+}
+
+/*-------------------------------------------------------------------------
+ * tids_same_page_fetch_offnums()
+ *
+ * Find out all the tids residing in the same page as tids[next_start_ptr] and
+ * fetch the offset number stored in each of them into a caller-allocated offset
+ * number array.
+ * ------------------------------------------------------------------------
+ */
+static BlockNumber
+tids_same_page_fetch_offnums(ItemPointer tids, int ntids,
+							 OffsetNumber *next_start_ptr, OffsetNumber *offnos)
+{
+	int				i;
+	BlockNumber		prev_blkno,
+					blkno;
+	OffsetNumber	offno;
+
+	for (i = *next_start_ptr; i < ntids; i++)
+	{
+		ItemPointerData tid = tids[i];
+
+		if (!ItemPointerIsValid(&tid))
+			ereport(ERROR,
+					(errcode(ERRCODE_INVALID_PARAMETER_VALUE),
+					 errmsg("invalid item pointer")));
+
+		blkno = ItemPointerGetBlockNumberNoCheck(&tid);
+		offno = ItemPointerGetOffsetNumberNoCheck(&tid);
+
+		if (i == *next_start_ptr || (prev_blkno == blkno))
+			offnos[i - *next_start_ptr] = offno;
+		else
+			break;
+
+		prev_blkno = blkno;
+	}
+
+	*next_start_ptr = i;
+	return prev_blkno;
+}
\ No newline at end of file
diff --git a/contrib/pg_surgery/pg_surgery--1.0.sql b/contrib/pg_surgery/pg_surgery--1.0.sql
new file mode 100644
index 0000000..2ae7f22
--- /dev/null
+++ b/contrib/pg_surgery/pg_surgery--1.0.sql
@@ -0,0 +1,18 @@
+/* contrib/pg_surgery/pg_surgery--1.0.sql */
+
+-- complain if script is sourced in psql, rather than via CREATE EXTENSION
+\echo Use "CREATE EXTENSION pg_surgery" to load this file. \quit
+
+CREATE FUNCTION heap_force_kill(reloid regclass, tids tid[])
+RETURNS VOID
+AS 'MODULE_PATHNAME', 'heap_force_kill'
+LANGUAGE C STRICT;
+
+REVOKE EXECUTE ON FUNCTION heap_force_kill(regclass, tid[]) FROM PUBLIC;
+
+CREATE FUNCTION heap_force_freeze(reloid regclass, tids tid[])
+RETURNS VOID
+AS 'MODULE_PATHNAME', 'heap_force_freeze'
+LANGUAGE C STRICT;
+
+REVOKE EXECUTE ON FUNCTION heap_force_freeze(regclass, tid[]) FROM PUBLIC;
\ No newline at end of file
diff --git a/contrib/pg_surgery/pg_surgery.control b/contrib/pg_surgery/pg_surgery.control
new file mode 100644
index 0000000..2bcdad1
--- /dev/null
+++ b/contrib/pg_surgery/pg_surgery.control
@@ -0,0 +1,5 @@
+# pg_surgery extension
+comment = 'extension to perform surgery on a damaged relation'
+default_version = '1.0'
+module_pathname = '$libdir/pg_surgery'
+relocatable = true
diff --git a/contrib/pg_surgery/sql/pg_surgery.sql b/contrib/pg_surgery/sql/pg_surgery.sql
new file mode 100644
index 0000000..bcd1878
--- /dev/null
+++ b/contrib/pg_surgery/sql/pg_surgery.sql
@@ -0,0 +1,61 @@
+create extension pg_surgery;
+
+--
+-- check that using heap_force_kill and heap_force_freeze functions with the
+-- normal heap table passes.
+--
+
+-- normal heap table.
+begin;
+create table htab(a int);
+insert into htab values (100), (200), (300), (400), (500);
+
+select * from htab where xmin = 2;
+select heap_force_freeze('htab'::regclass, ARRAY['(0, 4)']::tid[]);
+select ctid, xmax from htab where xmin = 2;
+
+select heap_force_kill('htab'::regclass, ARRAY['(0, 4)']::tid[]);
+select * from htab where ctid = '(0, 4)';
+rollback;
+
+--
+-- check that using heap_force_kill and heap_force_freeze functions with the
+-- unsupported relations fails.
+--
+
+-- partitioned tables (the parent table) doesn't contain any tuple.
+create table ptab (a int) partition by list (a);
+
+select heap_force_kill('ptab'::regclass, ARRAY['(0, 1)']::tid[]);
+select heap_force_freeze('ptab'::regclass, ARRAY['(0, 1)']::tid[]);
+
+create index ptab_idx on ptab (a);
+
+-- indexes are not supported, should fail.
+select heap_force_kill('ptab_idx'::regclass, ARRAY['(0, 1)']::tid[]);
+select heap_force_freeze('ptab_idx'::regclass, ARRAY['(0, 1)']::tid[]);
+
+create view vw as select 1;
+
+-- views are not supported as well. so, all these should fail.
+select heap_force_kill('vw'::regclass, ARRAY['(0, 1)']::tid[]);
+select heap_force_freeze('vw'::regclass, ARRAY['(0, 1)']::tid[]);
+
+create materialized view mvw as select 10;
+
+-- materialized views are not supported as well. so, all these should fail.
+select heap_force_kill('mvw'::regclass, ARRAY['(0, 1)']::tid[]);
+select heap_force_freeze('mvw'::regclass, ARRAY['(0, 1)']::tid[]);
+
+create sequence seq;
+
+-- sequences are not supported as well. so, all these functions should fail.
+select heap_force_kill('seq'::regclass, ARRAY['(0, 1)']::tid[]);
+select heap_force_freeze('seq'::regclass, ARRAY['(0, 1)']::tid[]);
+
+-- cleanup.
+drop materialized view mvw;
+drop table ptab;
+drop view vw;
+drop sequence seq;
+drop extension pg_surgery;
-- 
1.8.3.1

#53

robertmhaas@gmail.com

over 5 years ago

In reply to: Ashutosh Sharma (#51)

Re: recovering from "found xmin ... from before relfrozenxid ..."

On Mon, Aug 3, 2020 at 12:13 PM Ashutosh Sharma <ashu.coek88@gmail.com> wrote:

If the above path is taken that means none of the items in the page
got changed.

Oops. I didn't realize that, sorry. Maybe it would be a little more
clear if instead of "int nSkippedItems" you had "bool
did_modify_page"? Then you could initialize it to false and set it to
true just before doing the page modifications.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

#54

robertmhaas@gmail.com

over 5 years ago

In reply to: Ashutosh Sharma (#52)

Re: recovering from "found xmin ... from before relfrozenxid ..."

On Wed, Aug 5, 2020 at 9:42 AM Ashutosh Sharma <ashu.coek88@gmail.com> wrote:

Yeah, it's being tested on the main table, not on a toast table. I've
removed this test-case and also restricted direct access to the toast
table using heap_force_kill/freeze functions. I think we shouldn't be
using these functions to do any changes in the toast table. We will
only use these functions with the main table and let VACUUM remove the
corresponding data chunks (pointed by the tuple that got removed from
the main table).

I agree with removing the test case, but I disagree with restricting
this from being used on the TOAST table. These are tools for experts,
who may use them as they see fit. It's unlikely that it would be
useful to use this on a TOAST table, I think, but not impossible.

Another option would be to identify all the data chunks corresponding
to the tuple (ctid) being killed from the main table and remove them
one by one. We will only do this if the tuple from the main table that
has been marked as killed has an external storage. We will have to add
a bunch of code for this otherwise we can let VACUUM do this for us.
Let me know your thoughts on this.

I don't think VACUUM will do anything for us automatically -- it isn't
going to know that we force-killed the tuple in the main table.
Normally, a tuple delete would have to set xmax on the TOAST tuples
and then VACUUM would do its thing, but in this case that won't
happen. So if you force-kill a tuple in the main table you would end
up with a space leak in the TOAST table.

The problem here is that one reason you might force-killing a tuple in
the main table is because it's full of garbage. If so, trying to
decode the tuple so that you can find the TOAST pointers might crash
or error out, or maybe that part will work but then you'll error out
trying to look up the corresponding TOAST tuples, either because the
values are not valid or because the TOAST table itself is generally
hosed in some way. So I think it is probably best if we keep this tool
as simple as possible, with as few dependencies as we can, and
document the possible negative outcomes of using it. It's not
impossible to recover from a space-leak like this; you can always move
the data into a new table with CTAS and then drop the old one. Not
sure whether CLUSTER or VACUUM FULL would also be sufficient.

Separately, we might want to add a TOAST-checker to amcheck, or
enhance the heap-checker Mark is working on, and one of the things it
could do is check for TOAST entries to which nothing points. Then if
you force-kill tuples in the main table you could also use that tool
to look for things in the TOAST table that ought to also be
force-killed.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

#55

ashu.coek88@gmail.com

over 5 years ago

In reply to: Robert Haas (#53)

Re: recovering from "found xmin ... from before relfrozenxid ..."

On Thu, Aug 6, 2020 at 1:04 AM Robert Haas <robertmhaas@gmail.com> wrote:

On Mon, Aug 3, 2020 at 12:13 PM Ashutosh Sharma <ashu.coek88@gmail.com> wrote:

If the above path is taken that means none of the items in the page
got changed.

Oops. I didn't realize that, sorry. Maybe it would be a little more
clear if instead of "int nSkippedItems" you had "bool
did_modify_page"? Then you could initialize it to false and set it to
true just before doing the page modifications.

Okay, np, in that case, as you suggested, I will replace "int
nSkippedItems" with "did_modify_page" to increase the clarity. I will
do this change in the next version of patch. Thanks.

--
With Regards,
Ashutosh Sharma
EnterpriseDB:http://www.enterprisedb.com

#56

ashu.coek88@gmail.com

over 5 years ago

In reply to: Robert Haas (#54)

Re: recovering from "found xmin ... from before relfrozenxid ..."

On Thu, Aug 6, 2020 at 1:29 AM Robert Haas <robertmhaas@gmail.com> wrote:

On Wed, Aug 5, 2020 at 9:42 AM Ashutosh Sharma <ashu.coek88@gmail.com> wrote:

Yeah, it's being tested on the main table, not on a toast table. I've
removed this test-case and also restricted direct access to the toast
table using heap_force_kill/freeze functions. I think we shouldn't be
using these functions to do any changes in the toast table. We will
only use these functions with the main table and let VACUUM remove the
corresponding data chunks (pointed by the tuple that got removed from
the main table).

I agree with removing the test case, but I disagree with restricting
this from being used on the TOAST table. These are tools for experts,
who may use them as they see fit. It's unlikely that it would be
useful to use this on a TOAST table, I think, but not impossible.

Okay, If you want I can remove the restriction on a toast table, but,
then that means a user can directly remove the data chunks from the
toast table without changing anything in the main table. This means we
won't be able to query the main table as it will fail with an error
like "ERROR: unexpected chunk number ...". So, we will have to find
some way to identify the pointer to the chunks that got deleted from
the toast table and remove that pointer from the main table. We also
need to make sure that before we remove a tuple (pointer) from the
main table, we identify all the remaining data chunks pointed by this
tuple and remove them completely only then that table would be
considered to be in a good state. Now, I am not sure if we can
currently do all these things.

Another option would be to identify all the data chunks corresponding
to the tuple (ctid) being killed from the main table and remove them
one by one. We will only do this if the tuple from the main table that
has been marked as killed has an external storage. We will have to add
a bunch of code for this otherwise we can let VACUUM do this for us.
Let me know your thoughts on this.

I don't think VACUUM will do anything for us automatically -- it isn't
going to know that we force-killed the tuple in the main table.
Normally, a tuple delete would have to set xmax on the TOAST tuples
and then VACUUM would do its thing, but in this case that won't
happen. So if you force-kill a tuple in the main table you would end
up with a space leak in the TOAST table.

The problem here is that one reason you might force-killing a tuple in
the main table is because it's full of garbage. If so, trying to
decode the tuple so that you can find the TOAST pointers might crash
or error out, or maybe that part will work but then you'll error out
trying to look up the corresponding TOAST tuples, either because the
values are not valid or because the TOAST table itself is generally
hosed in some way. So I think it is probably best if we keep this tool
as simple as possible, with as few dependencies as we can, and
document the possible negative outcomes of using it.

I completely agree with you.

It's not

impossible to recover from a space-leak like this; you can always move
the data into a new table with CTAS and then drop the old one. Not
sure whether CLUSTER or VACUUM FULL would also be sufficient.

Yeah, I think, we can either use CTAS or VACUUM FULL, both look fine.

Separately, we might want to add a TOAST-checker to amcheck, or
enhance the heap-checker Mark is working on, and one of the things it
could do is check for TOAST entries to which nothing points. Then if
you force-kill tuples in the main table you could also use that tool
to look for things in the TOAST table that ought to also be
force-killed.

Okay, good to know that. Thanks for sharing this info.

--
With Regards,
Ashutosh Sharma
EnterpriseDB:http://www.enterprisedb.com

#57

masahiko.sawada@2ndquadrant.com

over 5 years ago

In reply to: Ashutosh Sharma (#52)

Re: recovering from "found xmin ... from before relfrozenxid ..."

On Wed, 5 Aug 2020 at 22:42, Ashutosh Sharma <ashu.coek88@gmail.com> wrote:

Hi Robert,

Thanks for the review. Please find my comments inline:

On Sat, Aug 1, 2020 at 12:18 AM Robert Haas <robertmhaas@gmail.com> wrote:

On Fri, Jul 31, 2020 at 8:52 AM Ashutosh Sharma <ashu.coek88@gmail.com> wrote:

Attached is the new version of patch that addresses the comments from Andrey and Beena.

+PGFILEDESC = "pg_surgery - perform surgery on the damaged heap table"

the -> a

I also suggest: heap table -> relation, because we might want to
extend this to other cases later.

Corrected.
+-- toast table.
+begin;
+create table ttab(a text);
+insert into ttab select string_agg(chr(floor(random() * 26)::int +
65), '') from generate_series(1,10000);
+select * from ttab where xmin = 2;
+ a
+---
+(0 rows)
+
+select heap_force_freeze('ttab'::regclass, ARRAY['(0, 1)']::tid[]);
+ heap_force_freeze
+-------------------
+
+(1 row)
+
I don't understand the point of this. You're not testing the function
on the TOAST table; you're testing it on the main table when there
happens to be a TOAST table that is probably getting used for
something. But that's not really relevant to what is being tested
here, so as written this seems redundant with the previous cases.
Yeah, it's being tested on the main table, not on a toast table. I've
removed this test-case and also restricted direct access to the toast
table using heap_force_kill/freeze functions. I think we shouldn't be
using these functions to do any changes in the toast table. We will
only use these functions with the main table and let VACUUM remove the
corresponding data chunks (pointed by the tuple that got removed from
the main table).

Another option would be to identify all the data chunks corresponding
to the tuple (ctid) being killed from the main table and remove them
one by one. We will only do this if the tuple from the main table that
has been marked as killed has an external storage. We will have to add
a bunch of code for this otherwise we can let VACUUM do this for us.
Let me know your thoughts on this.

+-- test pg_surgery functions with the unsupported relations. Should fail.

Please name the specific functions being tested here in case we add
more in the future that are tested separately.

Done.
+++ b/contrib/pg_surgery/heap_surgery_funcs.c
I think we could drop _funcs from the file name.
Done.
+#ifdef PG_MODULE_MAGIC
+PG_MODULE_MAGIC;
+#endif
The #ifdef here is not required, and if you look at other contrib
modules you'll see that they don't have it.
Okay, done.

I don't like all the macros at the top of the file much. CHECKARRVALID
is only used in one place, so it seems to me that you might as well
just inline it and lose the macro. Likewise for SORT and ARRISEMPTY.

Done.

Once you do that, heap_force_common() can just compute the number of
array elements once, instead of doing it once inside ARRISEMPTY, then
again inside SORT, and then a third time to initialize ntids. You can
just have a local variable in that function that is set once and then
used as needed. Then you are only doing ARRNELEMS once, so you can get
rid of that macro too. The same technique can be used to get rid of
ARRPTR. So then all the macros go away without introducing any code
duplication.

Done.
+/* Options to forcefully change the state of a heap tuple. */
+typedef enum HTupleForceOption
+{
+ FORCE_KILL,
+ FORCE_FREEZE
+} HTupleForceOption;
I suggest un-abbreviating HTuple to HeapTuple and un-abbreviating the
enum members to HEAP_FORCE_KILL and HEAP_FORCE_FREE.
Done.

Also, how about

option -> operation?

I think both look okay to me.

+ return heap_force_common(fcinfo, FORCE_KILL);

I think it might be more idiomatic to use PG_RETURN_DATUM here. I
know it's the same thing, though, and perhaps I'm even wrong about the
prevailing style.

Done.

+ Assert(force_opt == FORCE_KILL || force_opt == FORCE_FREEZE);

I think this is unnecessary. It's an enum with 2 values.

Removed.
+ if (ARRISEMPTY(ta))
+ ereport(ERROR,
+ (errcode(ERRCODE_INVALID_PARAMETER_VALUE),
+ errmsg("empty tid array")));
I don't see why this should be an error. Why can't we just continue
normally and if it does nothing, it does nothing? You'd need to change
the do..while loop to a while loop so that the end condition is tested
at the top, but that seems fine.
I think it's okay to have this check. So, just left it as-is. We do
have such checks in other contrib modules as well wherever the array
is being passed as an input to the function.

+ rel = relation_open(relid, AccessShareLock);

Maybe we should take RowExclusiveLock, since we are going to modify
rows. Not sure how much it matters, though.

I don't know how it would make a difference, but still as you said
replaced AccessShare with RowExclusive.
+ if (!superuser() && GetUserId() != rel->rd_rel->relowner)
+ ereport(ERROR,
+ (errcode(ERRCODE_INSUFFICIENT_PRIVILEGE),
+ errmsg("must be superuser or object owner to use %s.",
+ force_opt == FORCE_KILL ? "heap_force_kill()" :
+ "heap_force_freeze()")));
This is the wrong way to do a permissions check, and it's also the
wrong way to write an error message about having failed one. To see
the correct method, grep for pg_class_aclcheck. However, I think that
we shouldn't in general trust the object owner to do this, unless the
super-user gave permission. This is a data-corrupting operation, and
only the boss is allowed to authorize it. So I think you should also
add REVOKE EXECUTE FROM PUBLIC statements to the SQL file, and then
have this check as a backup. Then, the superuser is always allowed,
and if they choose to GRANT EXECUTE on this function to some users,
those users can do it for their own relations, but not others.
Done.
+ if (rel->rd_rel->relam != HEAP_TABLE_AM_OID)
+ ereport(ERROR,
+ (errcode(ERRCODE_FEATURE_NOT_SUPPORTED),
+ errmsg("only heap AM is supported")));
+
+ check_relation_relkind(rel);
Seems like these checks are in the wrong order.
I don't think there is anything wrong with the order. I can see the
same order in other contrib modules as well.

Also, maybe you could

call the function something like check_relation_ok() and put the
permissions test, the relkind test, and the relam test all inside of
it, just to tighten up the code in this main function a bit.

Yeah, I've added a couple of functions named sanity_check_relation and
sanity_check_tid_array and shifted all the sanity checks there.
+ if (noffs > maxoffset)
+ ereport(ERROR,
+ (errcode(ERRCODE_INVALID_PARAMETER_VALUE),
+ errmsg("number of offsets specified for block %u exceeds the max
offset number %u",
+ blkno, maxoffset)));
Hmm, this doesn't seem quite right. The actual problem is if an
individual item pointer's offset number is greater than maxoffset,
which can be true even if the total number of offsets is less than
maxoffset. So I think you need to remove this check and add a check
inside the loop which follows that offnos[i] is in range.
Agreed and done.

The way you've structured that loop is actually problematic -- I don't
think we want to be calling elog() or ereport() inside a critical
section. You could fix the case that checks for an invalid force_opt
by just doing if (op == HEAP_FORCE_KILL) { ... } else { Assert(op ==
HEAP_FORCE_FREEZE); ... }, or by using a switch with no default. The
NOTICE case you have here is a bigger problem.

Done.

You really cannot

modify the buffer like this and then decide, oops, never mind, I think
I won't mark it dirty or write WAL for the changes. If you do that,
the buffer is still in memory, but it's now been modified. A
subsequent operation that modifies it will start with the altered
state you created here, quite possibly leading to WAL that cannot be
correctly replayed on the standby. In other words, you've got to
decide for certain whether you want to proceed with the operation
*before* you enter the critical section. You also need to emit any
messages before or after the critical section. So you could:

This is still not clear. I think Robert needs to respond to my earlier comment.

I believe this violates our guidelines on message construction. Have
two completely separate messages -- and maybe lose the word "already":

"skipping tid (%u, %u) because it is dead"
"skipping tid (%u, %u) because it is unused"

The point of this is that it makes it easier for translators.

Done.

I see very little point in what verify_tid() is doing. Before using
each block number, we should check that it's less than or equal to a
cached value of RelationGetNumberOfBlocks(rel). That's necessary in
any case to avoid funny errors; and then the check here against
specifically InvalidBlockNumber is redundant. For the offset number,
same thing: we need to check each offset against the page's
PageGetMaxOffsetNumber(page); and if we do that then we don't need
these checks.

Done.

Please check the attached patch for the changes.

I also looked at this version patch and have some small comments:

+   Oid             relid = PG_GETARG_OID(0);
+   ArrayType      *ta = PG_GETARG_ARRAYTYPE_P_COPY(1);
+   ItemPointer     tids;
+   int             ntids;
+   Relation        rel;
+   Buffer          buf;
+   Page            page;
+   ItemId          itemid;
+   BlockNumber     blkno;
+   OffsetNumber   *offnos;
+   OffsetNumber    offno,
+                   noffs,
+                   curr_start_ptr,
+                   next_start_ptr,
+                   maxoffset;
+   int             i,
+                   nskippedItems,
+                   nblocks;

You declare all variables at the top of heap_force_common() function
but I think we can declare some variables such as buf, page inside of
the do loop.

---
+           if (offnos[i] > maxoffset)
+           {
+               ereport(NOTICE,
+                        errmsg("skipping tid (%u, %u) because it
contains an invalid offset",
+                               blkno, offnos[i]));
+               continue;
+           }

If all tids on a page take the above path, we will end up logging FPI
in spite of modifying nothing on the page.

---
+       /* XLOG stuff */
+       if (RelationNeedsWAL(rel))
+           log_newpage_buffer(buf, true);

I think we need to set the returned LSN by log_newpage_buffer() to the page lsn.

Regards,

--
Masahiko Sawada http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

#58

Rajkumar Raghuwanshi

rajkumar.raghuwanshi@enterprisedb.com

over 5 years ago

In reply to: Masahiko Sawada (#57)

Re: recovering from "found xmin ... from before relfrozenxid ..."

I have been doing some user-level testing of this feature, apart from
sanity test for extension and it's functions

I have tried to corrupt tuples and then able to fix it using
heap_force_freeze/kill functions.

--corrupt relfrozenxid, cause vacuum failed.

update pg_class set relfrozenxid = (relfrozenxid::text::integer +
10)::text::xid where relname = 'test_tbl';

UPDATE 1

insert into test_tbl values (2, 'BBB');

postgres=# vacuum test_tbl;

ERROR: found xmin 507 from before relfrozenxid 516

CONTEXT: while scanning block 0 of relation "public.test_tbl"

postgres=# select *, ctid, xmin, xmax from test_tbl;

a | b | ctid | xmin | xmax

---+-----+-------+------+------

1 | AAA | (0,1) | 505 | 0

2 | BBB | (0,2) | 507 | 0

(2 rows)

--fixed using heap_force_freeze

postgres=# select heap_force_freeze('test_tbl'::regclass,
ARRAY['(0,2)']::tid[]);

heap_force_freeze

-------------------

postgres=# vacuum test_tbl;

VACUUM

postgres=# select *, ctid, xmin, xmax from test_tbl;

a | b | ctid | xmin | xmax

---+-----+-------+------+------

1 | AAA | (0,1) | 505 | 0

2 | BBB | (0,2) | 2 | 0

(2 rows)

--corrupt table headers in base/oid. file, cause table access failed.

postgres=# select ctid, * from test_tbl;

ERROR: could not access status of transaction 4294967295

DETAIL: Could not open file "pg_xact/0FFF": No such file or directory.

--removed corrupted tuple using heap_force_kill

postgres=# select heap_force_kill('test_tbl'::regclass,
ARRAY['(0,2)']::tid[]);

heap_force_kill

-----------------

(1 row)

postgres=# select ctid, * from test_tbl;

ctid | a | b

-------+---+-----

(0,1) | 1 | AAA

(1 row)

I will be continuing with my testing with the latest patch and update here
if found anything.

Thanks & Regards,
Rajkumar Raghuwanshi

On Thu, Aug 6, 2020 at 1:42 PM Masahiko Sawada <
masahiko.sawada@2ndquadrant.com> wrote:

Show quoted text

On Wed, 5 Aug 2020 at 22:42, Ashutosh Sharma <ashu.coek88@gmail.com>
wrote:

Hi Robert,

Thanks for the review. Please find my comments inline:

On Sat, Aug 1, 2020 at 12:18 AM Robert Haas <robertmhaas@gmail.com>

wrote:

On Fri, Jul 31, 2020 at 8:52 AM Ashutosh Sharma <ashu.coek88@gmail.com>

wrote:

Attached is the new version of patch that addresses the comments

from Andrey and Beena.
+PGFILEDESC = "pg_surgery - perform surgery on the damaged heap table"

the -> a

I also suggest: heap table -> relation, because we might want to
extend this to other cases later.

Corrected.
+-- toast table.
+begin;
+create table ttab(a text);
+insert into ttab select string_agg(chr(floor(random() * 26)::int +
65), '') from generate_series(1,10000);
+select * from ttab where xmin = 2;
+ a
+---
+(0 rows)
+
+select heap_force_freeze('ttab'::regclass, ARRAY['(0, 1)']::tid[]);
+ heap_force_freeze
+-------------------
+
+(1 row)
+
I don't understand the point of this. You're not testing the function
on the TOAST table; you're testing it on the main table when there
happens to be a TOAST table that is probably getting used for
something. But that's not really relevant to what is being tested
here, so as written this seems redundant with the previous cases.
Yeah, it's being tested on the main table, not on a toast table. I've
removed this test-case and also restricted direct access to the toast
table using heap_force_kill/freeze functions. I think we shouldn't be
using these functions to do any changes in the toast table. We will
only use these functions with the main table and let VACUUM remove the
corresponding data chunks (pointed by the tuple that got removed from
the main table).

Another option would be to identify all the data chunks corresponding
to the tuple (ctid) being killed from the main table and remove them
one by one. We will only do this if the tuple from the main table that
has been marked as killed has an external storage. We will have to add
a bunch of code for this otherwise we can let VACUUM do this for us.
Let me know your thoughts on this.

+-- test pg_surgery functions with the unsupported relations. Should
fail.
Please name the specific functions being tested here in case we add
more in the future that are tested separately.

Done.
+++ b/contrib/pg_surgery/heap_surgery_funcs.c
I think we could drop _funcs from the file name.
Done.
+#ifdef PG_MODULE_MAGIC
+PG_MODULE_MAGIC;
+#endif
The #ifdef here is not required, and if you look at other contrib
modules you'll see that they don't have it.
Okay, done.

I don't like all the macros at the top of the file much. CHECKARRVALID
is only used in one place, so it seems to me that you might as well
just inline it and lose the macro. Likewise for SORT and ARRISEMPTY.

Done.

Once you do that, heap_force_common() can just compute the number of
array elements once, instead of doing it once inside ARRISEMPTY, then
again inside SORT, and then a third time to initialize ntids. You can
just have a local variable in that function that is set once and then
used as needed. Then you are only doing ARRNELEMS once, so you can get
rid of that macro too. The same technique can be used to get rid of
ARRPTR. So then all the macros go away without introducing any code
duplication.

Done.
+/* Options to forcefully change the state of a heap tuple. */
+typedef enum HTupleForceOption
+{
+ FORCE_KILL,
+ FORCE_FREEZE
+} HTupleForceOption;
I suggest un-abbreviating HTuple to HeapTuple and un-abbreviating the
enum members to HEAP_FORCE_KILL and HEAP_FORCE_FREE.
Done.

Also, how about

option -> operation?

I think both look okay to me.

+ return heap_force_common(fcinfo, FORCE_KILL);

I think it might be more idiomatic to use PG_RETURN_DATUM here. I
know it's the same thing, though, and perhaps I'm even wrong about the
prevailing style.

Done.

+ Assert(force_opt == FORCE_KILL || force_opt == FORCE_FREEZE);

I think this is unnecessary. It's an enum with 2 values.

Removed.
+ if (ARRISEMPTY(ta))
+ ereport(ERROR,
+ (errcode(ERRCODE_INVALID_PARAMETER_VALUE),
+ errmsg("empty tid array")));
I don't see why this should be an error. Why can't we just continue
normally and if it does nothing, it does nothing? You'd need to change
the do..while loop to a while loop so that the end condition is tested
at the top, but that seems fine.
I think it's okay to have this check. So, just left it as-is. We do
have such checks in other contrib modules as well wherever the array
is being passed as an input to the function.

+ rel = relation_open(relid, AccessShareLock);

Maybe we should take RowExclusiveLock, since we are going to modify
rows. Not sure how much it matters, though.

I don't know how it would make a difference, but still as you said
replaced AccessShare with RowExclusive.
+ if (!superuser() && GetUserId() != rel->rd_rel->relowner)
+ ereport(ERROR,
+ (errcode(ERRCODE_INSUFFICIENT_PRIVILEGE),
+ errmsg("must be superuser or object owner to use %s.",
+ force_opt == FORCE_KILL ? "heap_force_kill()" :
+ "heap_force_freeze()")));
This is the wrong way to do a permissions check, and it's also the
wrong way to write an error message about having failed one. To see
the correct method, grep for pg_class_aclcheck. However, I think that
we shouldn't in general trust the object owner to do this, unless the
super-user gave permission. This is a data-corrupting operation, and
only the boss is allowed to authorize it. So I think you should also
add REVOKE EXECUTE FROM PUBLIC statements to the SQL file, and then
have this check as a backup. Then, the superuser is always allowed,
and if they choose to GRANT EXECUTE on this function to some users,
those users can do it for their own relations, but not others.
Done.
+ if (rel->rd_rel->relam != HEAP_TABLE_AM_OID)
+ ereport(ERROR,
+ (errcode(ERRCODE_FEATURE_NOT_SUPPORTED),
+ errmsg("only heap AM is supported")));
+
+ check_relation_relkind(rel);
Seems like these checks are in the wrong order.
I don't think there is anything wrong with the order. I can see the
same order in other contrib modules as well.

Also, maybe you could

call the function something like check_relation_ok() and put the
permissions test, the relkind test, and the relam test all inside of
it, just to tighten up the code in this main function a bit.

Yeah, I've added a couple of functions named sanity_check_relation and
sanity_check_tid_array and shifted all the sanity checks there.
+ if (noffs > maxoffset)
+ ereport(ERROR,
+ (errcode(ERRCODE_INVALID_PARAMETER_VALUE),
+ errmsg("number of offsets specified for block %u exceeds the max
offset number %u",
+ blkno, maxoffset)));
Hmm, this doesn't seem quite right. The actual problem is if an
individual item pointer's offset number is greater than maxoffset,
which can be true even if the total number of offsets is less than
maxoffset. So I think you need to remove this check and add a check
inside the loop which follows that offnos[i] is in range.
Agreed and done.

The way you've structured that loop is actually problematic -- I don't
think we want to be calling elog() or ereport() inside a critical
section. You could fix the case that checks for an invalid force_opt
by just doing if (op == HEAP_FORCE_KILL) { ... } else { Assert(op ==
HEAP_FORCE_FREEZE); ... }, or by using a switch with no default. The
NOTICE case you have here is a bigger problem.

Done.

You really cannot

modify the buffer like this and then decide, oops, never mind, I think
I won't mark it dirty or write WAL for the changes. If you do that,
the buffer is still in memory, but it's now been modified. A
subsequent operation that modifies it will start with the altered
state you created here, quite possibly leading to WAL that cannot be
correctly replayed on the standby. In other words, you've got to
decide for certain whether you want to proceed with the operation
*before* you enter the critical section. You also need to emit any
messages before or after the critical section. So you could:

This is still not clear. I think Robert needs to respond to my earlier
comment.

I believe this violates our guidelines on message construction. Have
two completely separate messages -- and maybe lose the word "already":

"skipping tid (%u, %u) because it is dead"
"skipping tid (%u, %u) because it is unused"

The point of this is that it makes it easier for translators.

Done.

I see very little point in what verify_tid() is doing. Before using
each block number, we should check that it's less than or equal to a
cached value of RelationGetNumberOfBlocks(rel). That's necessary in
any case to avoid funny errors; and then the check here against
specifically InvalidBlockNumber is redundant. For the offset number,
same thing: we need to check each offset against the page's
PageGetMaxOffsetNumber(page); and if we do that then we don't need
these checks.

Done.

Please check the attached patch for the changes.

I also looked at this version patch and have some small comments:
+   Oid             relid = PG_GETARG_OID(0);
+   ArrayType      *ta = PG_GETARG_ARRAYTYPE_P_COPY(1);
+   ItemPointer     tids;
+   int             ntids;
+   Relation        rel;
+   Buffer          buf;
+   Page            page;
+   ItemId          itemid;
+   BlockNumber     blkno;
+   OffsetNumber   *offnos;
+   OffsetNumber    offno,
+                   noffs,
+                   curr_start_ptr,
+                   next_start_ptr,
+                   maxoffset;
+   int             i,
+                   nskippedItems,
+                   nblocks;
You declare all variables at the top of heap_force_common() function
but I think we can declare some variables such as buf, page inside of
the do loop.
---
+           if (offnos[i] > maxoffset)
+           {
+               ereport(NOTICE,
+                        errmsg("skipping tid (%u, %u) because it
contains an invalid offset",
+                               blkno, offnos[i]));
+               continue;
+           }
If all tids on a page take the above path, we will end up logging FPI
in spite of modifying nothing on the page.
---
+       /* XLOG stuff */
+       if (RelationNeedsWAL(rel))
+           log_newpage_buffer(buf, true);
I think we need to set the returned LSN by log_newpage_buffer() to the
page lsn.

Regards,

--
Masahiko Sawada http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

#59

ashu.coek88@gmail.com

over 5 years ago

In reply to: Masahiko Sawada (#57)

Re: recovering from "found xmin ... from before relfrozenxid ..."

Hello Masahiko-san,

Thanks for looking into the patch. Please find my comments inline below:

On Thu, Aug 6, 2020 at 1:42 PM Masahiko Sawada
<masahiko.sawada@2ndquadrant.com> wrote:

On Wed, 5 Aug 2020 at 22:42, Ashutosh Sharma <ashu.coek88@gmail.com> wrote:

Please check the attached patch for the changes.

I also looked at this version patch and have some small comments:
+   Oid             relid = PG_GETARG_OID(0);
+   ArrayType      *ta = PG_GETARG_ARRAYTYPE_P_COPY(1);
+   ItemPointer     tids;
+   int             ntids;
+   Relation        rel;
+   Buffer          buf;
+   Page            page;
+   ItemId          itemid;
+   BlockNumber     blkno;
+   OffsetNumber   *offnos;
+   OffsetNumber    offno,
+                   noffs,
+                   curr_start_ptr,
+                   next_start_ptr,
+                   maxoffset;
+   int             i,
+                   nskippedItems,
+                   nblocks;
You declare all variables at the top of heap_force_common() function
but I think we can declare some variables such as buf, page inside of
the do loop.

Sure, I will do this in the next version of patch.

---
+           if (offnos[i] > maxoffset)
+           {
+               ereport(NOTICE,
+                        errmsg("skipping tid (%u, %u) because it
contains an invalid offset",
+                               blkno, offnos[i]));
+               continue;
+           }

If all tids on a page take the above path, we will end up logging FPI
in spite of modifying nothing on the page.

Yeah, that's right. I've taken care of this in the new version of
patch which I am yet to share.

---
+       /* XLOG stuff */
+       if (RelationNeedsWAL(rel))
+           log_newpage_buffer(buf, true);
I think we need to set the returned LSN by log_newpage_buffer() to the page lsn.

I think we are already setting the page lsn in the log_newpage() which
is being called from log_newpage_buffer(). So, AFAIU, no change would
be required here. Please let me know if I am missing something.

--
With Regards,
Ashutosh Sharma
EnterpriseDB:http://www.enterprisedb.com

#60

masahiko.sawada@2ndquadrant.com

over 5 years ago

In reply to: Ashutosh Sharma (#59)

Re: recovering from "found xmin ... from before relfrozenxid ..."

On Thu, 6 Aug 2020 at 18:05, Ashutosh Sharma <ashu.coek88@gmail.com> wrote:

Hello Masahiko-san,

Thanks for looking into the patch. Please find my comments inline below:

On Thu, Aug 6, 2020 at 1:42 PM Masahiko Sawada
<masahiko.sawada@2ndquadrant.com> wrote:
On Wed, 5 Aug 2020 at 22:42, Ashutosh Sharma <ashu.coek88@gmail.com> wrote:

Please check the attached patch for the changes.

I also looked at this version patch and have some small comments:
+   Oid             relid = PG_GETARG_OID(0);
+   ArrayType      *ta = PG_GETARG_ARRAYTYPE_P_COPY(1);
+   ItemPointer     tids;
+   int             ntids;
+   Relation        rel;
+   Buffer          buf;
+   Page            page;
+   ItemId          itemid;
+   BlockNumber     blkno;
+   OffsetNumber   *offnos;
+   OffsetNumber    offno,
+                   noffs,
+                   curr_start_ptr,
+                   next_start_ptr,
+                   maxoffset;
+   int             i,
+                   nskippedItems,
+                   nblocks;
You declare all variables at the top of heap_force_common() function
but I think we can declare some variables such as buf, page inside of
the do loop.
Sure, I will do this in the next version of patch.
---
+           if (offnos[i] > maxoffset)
+           {
+               ereport(NOTICE,
+                        errmsg("skipping tid (%u, %u) because it
contains an invalid offset",
+                               blkno, offnos[i]));
+               continue;
+           }
If all tids on a page take the above path, we will end up logging FPI
in spite of modifying nothing on the page.
Yeah, that's right. I've taken care of this in the new version of
patch which I am yet to share.
---
+       /* XLOG stuff */
+       if (RelationNeedsWAL(rel))
+           log_newpage_buffer(buf, true);
I think we need to set the returned LSN by log_newpage_buffer() to the page lsn.
I think we are already setting the page lsn in the log_newpage() which
is being called from log_newpage_buffer(). So, AFAIU, no change would
be required here. Please let me know if I am missing something.

You're right. I'd missed the comment of log_newpage_buffer(). Thanks!

Regards,

--
Masahiko Sawada http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

#61

ashu.coek88@gmail.com

over 5 years ago

In reply to: Masahiko Sawada (#60)

1 attachment(s)

Re: recovering from "found xmin ... from before relfrozenxid ..."

Attached v4 patch fixes the latest comments from Robert and Masahiko-san.

Changes:
1) Let heap_force_kill and freeze functions to be used with toast tables.
2) Replace "int nskippedItems" with "bool did_modify_page" flag to
know if any modification happened in the page or not.
3) Declare some of the variables such as buf, page inside of the do
loop instead of declaring them at the top of heap_force_common
function.

--
With Regards,
Ashutosh Sharma
EnterpriseDB:http://www.enterprisedb.com

On Thu, Aug 6, 2020 at 2:49 PM Masahiko Sawada
<masahiko.sawada@2ndquadrant.com> wrote:

Show quoted text

On Thu, 6 Aug 2020 at 18:05, Ashutosh Sharma <ashu.coek88@gmail.com> wrote:
Hello Masahiko-san,

Thanks for looking into the patch. Please find my comments inline below:

On Thu, Aug 6, 2020 at 1:42 PM Masahiko Sawada
<masahiko.sawada@2ndquadrant.com> wrote:
On Wed, 5 Aug 2020 at 22:42, Ashutosh Sharma <ashu.coek88@gmail.com> wrote:

Please check the attached patch for the changes.

I also looked at this version patch and have some small comments:
+   Oid             relid = PG_GETARG_OID(0);
+   ArrayType      *ta = PG_GETARG_ARRAYTYPE_P_COPY(1);
+   ItemPointer     tids;
+   int             ntids;
+   Relation        rel;
+   Buffer          buf;
+   Page            page;
+   ItemId          itemid;
+   BlockNumber     blkno;
+   OffsetNumber   *offnos;
+   OffsetNumber    offno,
+                   noffs,
+                   curr_start_ptr,
+                   next_start_ptr,
+                   maxoffset;
+   int             i,
+                   nskippedItems,
+                   nblocks;
You declare all variables at the top of heap_force_common() function
but I think we can declare some variables such as buf, page inside of
the do loop.
Sure, I will do this in the next version of patch.
---
+           if (offnos[i] > maxoffset)
+           {
+               ereport(NOTICE,
+                        errmsg("skipping tid (%u, %u) because it
contains an invalid offset",
+                               blkno, offnos[i]));
+               continue;
+           }
If all tids on a page take the above path, we will end up logging FPI
in spite of modifying nothing on the page.
Yeah, that's right. I've taken care of this in the new version of
patch which I am yet to share.
---
+       /* XLOG stuff */
+       if (RelationNeedsWAL(rel))
+           log_newpage_buffer(buf, true);
I think we need to set the returned LSN by log_newpage_buffer() to the page lsn.
I think we are already setting the page lsn in the log_newpage() which
is being called from log_newpage_buffer(). So, AFAIU, no change would
be required here. Please let me know if I am missing something.
You're right. I'd missed the comment of log_newpage_buffer(). Thanks!

Regards,

--
Masahiko Sawada http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Attachments:

v4-0001-Add-contrib-pg_surgery-to-perform-surgery-on-a-damag.patchtext/x-patch; charset=US-ASCII; name=v4-0001-Add-contrib-pg_surgery-to-perform-surgery-on-a-damag.patchDownload

From 6d616af51a8f9e6352145f799f635ec907ed83c2 Mon Sep 17 00:00:00 2001
From: ashu <ashutosh.sharma@enterprisedb.com>
Date: Thu, 6 Aug 2020 18:02:29 +0530
Subject: [PATCH] Add contrib/pg_surgery to perform surgery on a damaged heap
 table.

This contrib module basically adds a couple of functions named
heap_force_kill and heap_force_freeze that can be used in the scripts
to perform surgery on the damaged heap tables.

Ashutosh Sharma.
---
 contrib/Makefile                           |   1 +
 contrib/pg_surgery/Makefile                |  23 ++
 contrib/pg_surgery/expected/pg_surgery.out |  78 +++++++
 contrib/pg_surgery/heap_surgery.c          | 362 +++++++++++++++++++++++++++++
 contrib/pg_surgery/pg_surgery--1.0.sql     |  18 ++
 contrib/pg_surgery/pg_surgery.control      |   5 +
 contrib/pg_surgery/sql/pg_surgery.sql      |  61 +++++
 7 files changed, 548 insertions(+)
 create mode 100644 contrib/pg_surgery/Makefile
 create mode 100644 contrib/pg_surgery/expected/pg_surgery.out
 create mode 100644 contrib/pg_surgery/heap_surgery.c
 create mode 100644 contrib/pg_surgery/pg_surgery--1.0.sql
 create mode 100644 contrib/pg_surgery/pg_surgery.control
 create mode 100644 contrib/pg_surgery/sql/pg_surgery.sql

diff --git a/contrib/Makefile b/contrib/Makefile
index 1846d41..07d5734 100644
--- a/contrib/Makefile
+++ b/contrib/Makefile
@@ -35,6 +35,7 @@ SUBDIRS = \
 		pg_standby	\
 		pg_stat_statements \
 		pg_trgm		\
+		pg_surgery	\
 		pgcrypto	\
 		pgrowlocks	\
 		pgstattuple	\
diff --git a/contrib/pg_surgery/Makefile b/contrib/pg_surgery/Makefile
new file mode 100644
index 0000000..ecf2e20
--- /dev/null
+++ b/contrib/pg_surgery/Makefile
@@ -0,0 +1,23 @@
+# contrib/pg_surgery/Makefile
+
+MODULE_big = pg_surgery
+OBJS = \
+	$(WIN32RES) \
+	heap_surgery.o
+
+EXTENSION = pg_surgery
+DATA = pg_surgery--1.0.sql
+PGFILEDESC = "pg_surgery - perform surgery on a damaged relation"
+
+REGRESS = pg_surgery
+
+ifdef USE_PGXS
+PG_CONFIG = pg_config
+PGXS := $(shell $(PG_CONFIG) --pgxs)
+include $(PGXS)
+else
+subdir = contrib/pg_surgery
+top_builddir = ../..
+include $(top_builddir)/src/Makefile.global
+include $(top_srcdir)/contrib/contrib-global.mk
+endif
diff --git a/contrib/pg_surgery/expected/pg_surgery.out b/contrib/pg_surgery/expected/pg_surgery.out
new file mode 100644
index 0000000..c01fcf1
--- /dev/null
+++ b/contrib/pg_surgery/expected/pg_surgery.out
@@ -0,0 +1,78 @@
+create extension pg_surgery;
+--
+-- check that using heap_force_kill and heap_force_freeze functions with the
+-- normal heap table passes.
+--
+-- normal heap table.
+begin;
+create table htab(a int);
+insert into htab values (100), (200), (300), (400), (500);
+select * from htab where xmin = 2;
+ a 
+---
+(0 rows)
+
+select heap_force_freeze('htab'::regclass, ARRAY['(0, 4)']::tid[]);
+ heap_force_freeze 
+-------------------
+ 
+(1 row)
+
+select ctid, xmax from htab where xmin = 2;
+ ctid  | xmax 
+-------+------
+ (0,4) |    0
+(1 row)
+
+select heap_force_kill('htab'::regclass, ARRAY['(0, 4)']::tid[]);
+ heap_force_kill 
+-----------------
+ 
+(1 row)
+
+select * from htab where ctid = '(0, 4)';
+ a 
+---
+(0 rows)
+
+rollback;
+--
+-- check that using heap_force_kill and heap_force_freeze functions with the
+-- unsupported relations fails.
+--
+-- partitioned tables (the parent table) doesn't contain any tuple.
+create table ptab (a int) partition by list (a);
+select heap_force_kill('ptab'::regclass, ARRAY['(0, 1)']::tid[]);
+ERROR:  only heap AM is supported
+select heap_force_freeze('ptab'::regclass, ARRAY['(0, 1)']::tid[]);
+ERROR:  only heap AM is supported
+create index ptab_idx on ptab (a);
+-- indexes are not supported, should fail.
+select heap_force_kill('ptab_idx'::regclass, ARRAY['(0, 1)']::tid[]);
+ERROR:  only heap AM is supported
+select heap_force_freeze('ptab_idx'::regclass, ARRAY['(0, 1)']::tid[]);
+ERROR:  only heap AM is supported
+create view vw as select 1;
+-- views are not supported as well. so, all these should fail.
+select heap_force_kill('vw'::regclass, ARRAY['(0, 1)']::tid[]);
+ERROR:  only heap AM is supported
+select heap_force_freeze('vw'::regclass, ARRAY['(0, 1)']::tid[]);
+ERROR:  only heap AM is supported
+create materialized view mvw as select 10;
+-- materialized views are not supported as well. so, all these should fail.
+select heap_force_kill('mvw'::regclass, ARRAY['(0, 1)']::tid[]);
+ERROR:  "mvw" is not a table or TOAST table
+select heap_force_freeze('mvw'::regclass, ARRAY['(0, 1)']::tid[]);
+ERROR:  "mvw" is not a table or TOAST table
+create sequence seq;
+-- sequences are not supported as well. so, all these functions should fail.
+select heap_force_kill('seq'::regclass, ARRAY['(0, 1)']::tid[]);
+ERROR:  only heap AM is supported
+select heap_force_freeze('seq'::regclass, ARRAY['(0, 1)']::tid[]);
+ERROR:  only heap AM is supported
+-- cleanup.
+drop materialized view mvw;
+drop table ptab;
+drop view vw;
+drop sequence seq;
+drop extension pg_surgery;
diff --git a/contrib/pg_surgery/heap_surgery.c b/contrib/pg_surgery/heap_surgery.c
new file mode 100644
index 0000000..0c1b66e
--- /dev/null
+++ b/contrib/pg_surgery/heap_surgery.c
@@ -0,0 +1,362 @@
+/*-------------------------------------------------------------------------
+ *
+ * heap_surgery.c
+ *	  Functions to perform surgery on the damaged heap table.
+ *
+ * Copyright (c) 2020, PostgreSQL Global Development Group
+ *
+ * IDENTIFICATION
+ *	  contrib/pg_surgery/heap_surgery.c
+ *
+ *-------------------------------------------------------------------------
+ */
+#include "postgres.h"
+
+#include "access/heapam.h"
+#include "catalog/pg_am_d.h"
+#include "miscadmin.h"
+#include "storage/bufmgr.h"
+#include "utils/acl.h"
+#include "utils/rel.h"
+
+PG_MODULE_MAGIC;
+
+/* Options to forcefully change the state of a heap tuple. */
+typedef enum HeapTupleForceOption
+{
+	HEAP_FORCE_KILL,
+	HEAP_FORCE_FREEZE
+} HeapTupleForceOption;
+
+PG_FUNCTION_INFO_V1(heap_force_kill);
+PG_FUNCTION_INFO_V1(heap_force_freeze);
+
+static int32 tidcmp(const void *a, const void *b);
+static Datum heap_force_common(FunctionCallInfo fcinfo,
+							   HeapTupleForceOption heap_force_opt);
+static void sanity_check_tid_array(ArrayType *ta, int *ntids);
+static void sanity_check_relation(Relation rel);
+static BlockNumber tids_same_page_fetch_offnums(ItemPointer tids, int ntids,
+												OffsetNumber *next_start_ptr,
+												OffsetNumber *offnos);
+
+/*-------------------------------------------------------------------------
+ * heap_force_kill()
+ *
+ * Force kill the tuple(s) pointed to by the item pointer(s) stored in the
+ * given tid array.
+ *
+ * Usage: SELECT heap_force_kill(regclass, tid[]);
+ *-------------------------------------------------------------------------
+ */
+Datum
+heap_force_kill(PG_FUNCTION_ARGS)
+{
+	PG_RETURN_DATUM(heap_force_common(fcinfo, HEAP_FORCE_KILL));
+}
+
+/*-------------------------------------------------------------------------
+ * heap_force_freeze()
+ *
+ * Force freeze the tuple(s) pointed to by the item pointer(s) stored in the
+ * given tid array.
+ *
+ * Usage: SELECT heap_force_freeze(regclass, tid[]);
+ *-------------------------------------------------------------------------
+ */
+Datum
+heap_force_freeze(PG_FUNCTION_ARGS)
+{
+	PG_RETURN_DATUM(heap_force_common(fcinfo, HEAP_FORCE_FREEZE));
+}
+
+/*-------------------------------------------------------------------------
+ * heap_force_common()
+ *
+ * Common code for heap_force_kill and heap_force_freeze
+ *-------------------------------------------------------------------------
+ */
+static Datum
+heap_force_common(FunctionCallInfo fcinfo, HeapTupleForceOption heap_force_opt)
+{
+	Oid				relid = PG_GETARG_OID(0);
+	ArrayType	   *ta = PG_GETARG_ARRAYTYPE_P_COPY(1);
+	ItemPointer		tids;
+	int				ntids,
+					nblocks;
+	Relation		rel;
+	OffsetNumber   *offnos;
+	OffsetNumber	noffs,
+					curr_start_ptr,
+					next_start_ptr;
+
+	if (RecoveryInProgress())
+		ereport(ERROR,
+				(errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
+				 errmsg("recovery is in progress"),
+				 errhint("heap surgery functions cannot be executed during recovery.")));
+
+	/* Basic sanity checking. */
+	sanity_check_tid_array(ta, &ntids);
+
+	rel = relation_open(relid, RowExclusiveLock);
+
+	sanity_check_relation(rel);
+
+	tids = ((ItemPointer) ARR_DATA_PTR(ta));
+
+	/*
+	 * If there is more than one tid in the array, sort it so that we can
+	 * easily fetch all the tids belonging to one particular page from the
+	 * array.
+	 */
+	if (ntids > 1)
+		qsort((void*) tids, ntids, sizeof(ItemPointerData), tidcmp);
+
+	offnos = (OffsetNumber *) palloc(ntids * sizeof(OffsetNumber));
+	noffs = curr_start_ptr = next_start_ptr = 0;
+	nblocks = RelationGetNumberOfBlocks(rel);
+
+	do
+	{
+		Buffer			buf;
+		Page			page;
+		BlockNumber		blkno;
+		OffsetNumber	maxoffset;
+		int				i;
+		bool			did_modify_page;
+
+		/*
+		 * Get the offset numbers from the tids belonging to one particular page
+		 * and process them one by one.
+		 */
+		blkno = tids_same_page_fetch_offnums(tids, ntids, &next_start_ptr,
+											 offnos);
+
+		if (blkno < 0 || blkno >= nblocks)
+			ereport(ERROR,
+					(errcode(ERRCODE_INVALID_PARAMETER_VALUE),
+					 errmsg("block number %u is out of range for relation \"%s\"",
+							blkno, RelationGetRelationName(rel))));
+
+		CHECK_FOR_INTERRUPTS();
+
+		buf = ReadBuffer(rel, blkno);
+		LockBufferForCleanup(buf);
+
+		page = BufferGetPage(buf);
+
+		/* Calculate the number of offsets stored in offnos array. */
+		noffs = next_start_ptr - curr_start_ptr;
+
+		maxoffset = PageGetMaxOffsetNumber(page);
+
+		did_modify_page = false;
+
+		/* No ereport(ERROR) from here until all the changes are logged. */
+		START_CRIT_SECTION();
+
+		for (i = 0; i < noffs; i++)
+		{
+			OffsetNumber	offno;
+			ItemId			itemid;
+
+			if (offnos[i] > maxoffset)
+			{
+				ereport(NOTICE,
+						 errmsg("skipping tid (%u, %u) because it contains an invalid offset",
+								blkno, offnos[i]));
+				continue;
+			}
+
+			itemid = PageGetItemId(page, offnos[i]);
+
+			/* Follow any redirections until we find something useful. */
+			while (ItemIdIsRedirected(itemid))
+			{
+				offno = ItemIdGetRedirect(itemid);
+				itemid = PageGetItemId(page, offno);
+				CHECK_FOR_INTERRUPTS();
+			}
+
+			/* Nothing to do if the itemid is unused or already dead. */
+			if (!ItemIdIsUsed(itemid) || ItemIdIsDead(itemid))
+			{
+				if (!ItemIdIsUsed(itemid))
+					ereport(NOTICE,
+							(errmsg("skipping tid (%u, %u) because it is marked unused",
+									blkno, offnos[i])));
+				else
+					ereport(NOTICE,
+							(errmsg("skipping tid (%u, %u) because it is marked dead",
+									blkno, offnos[i])));
+				continue;
+			}
+
+			Assert(ItemIdIsNormal(itemid));
+
+			did_modify_page = true;
+
+			if (heap_force_opt == HEAP_FORCE_KILL)
+				ItemIdSetDead(itemid);
+			else
+			{
+				HeapTupleHeader htup;
+
+				Assert(heap_force_opt == HEAP_FORCE_FREEZE);
+
+				htup = (HeapTupleHeader) PageGetItem(page, itemid);
+
+				HeapTupleHeaderSetXmin(htup, FrozenTransactionId);
+				HeapTupleHeaderSetXmax(htup, InvalidTransactionId);
+
+				/*
+				 * Clear all the visibility-related bits of this tuple and mark
+				 * it as frozen.
+				 */
+				htup->t_infomask &= ~HEAP_XACT_MASK;
+				htup->t_infomask |= (HEAP_XMIN_FROZEN | HEAP_XMAX_INVALID);
+			}
+		}
+
+		/*
+		 * If we didn't modify the page, let's not mark the buffer as dirty or
+		 * do the WAL logging.
+		 */
+		if (!did_modify_page)
+			goto skip_wal;
+
+		/* Mark buffer dirty before we write WAL. */
+		MarkBufferDirty(buf);
+
+		/* XLOG stuff */
+		if (RelationNeedsWAL(rel))
+			log_newpage_buffer(buf, true);
+
+skip_wal:
+		END_CRIT_SECTION();
+
+		UnlockReleaseBuffer(buf);
+
+		/*
+		 * Update the current start pointer before fetching next set of offset
+		 * numbers from the tids belonging to the same page.
+		 */
+		curr_start_ptr = next_start_ptr;
+	} while (next_start_ptr != ntids);
+
+	relation_close(rel, RowExclusiveLock);
+
+	pfree(ta);
+	pfree(offnos);
+
+	PG_RETURN_VOID();
+}
+
+/*-------------------------------------------------------------------------
+ * tidcmp()
+ *
+ * Compare two item pointers, return -1, 0, or +1.
+ *
+ * See ItemPointerCompare for details.
+ * ------------------------------------------------------------------------
+ */
+static int32
+tidcmp(const void *a, const void *b)
+{
+	ItemPointer iptr1 = ((const ItemPointer) a);
+	ItemPointer iptr2 = ((const ItemPointer) b);
+
+	return ItemPointerCompare(iptr1, iptr2);
+}
+
+/*-------------------------------------------------------------------------
+ * sanity_check_tid_array()
+ *
+ * Perform sanity check on the given tid array.
+ * ------------------------------------------------------------------------
+ */
+static void
+sanity_check_tid_array(ArrayType *ta, int *ntids)
+{
+	if (ARR_HASNULL(ta) && array_contains_nulls(ta))
+		ereport(ERROR,
+				(errcode(ERRCODE_NULL_VALUE_NOT_ALLOWED),
+				 errmsg("array must not contain nulls")));
+
+	*ntids = ArrayGetNItems(ARR_NDIM(ta), ARR_DIMS(ta));
+
+	if (*ntids == 0)
+		ereport(ERROR,
+				(errcode(ERRCODE_INVALID_PARAMETER_VALUE),
+				 errmsg("empty tid array")));
+}
+
+/*-------------------------------------------------------------------------
+ * sanity_check_relation()
+ *
+ * Perform sanity check on the given relation.
+ * ------------------------------------------------------------------------
+ */
+static void
+sanity_check_relation(Relation rel)
+{
+	if (rel->rd_rel->relam != HEAP_TABLE_AM_OID)
+		ereport(ERROR,
+				(errcode(ERRCODE_FEATURE_NOT_SUPPORTED),
+				 errmsg("only heap AM is supported")));
+
+	if (rel->rd_rel->relkind != RELKIND_RELATION &&
+		rel->rd_rel->relkind != RELKIND_TOASTVALUE)
+		ereport(ERROR,
+				(errcode(ERRCODE_WRONG_OBJECT_TYPE),
+				 errmsg("\"%s\" is not a table or TOAST table",
+						RelationGetRelationName(rel))));
+
+	if (pg_class_aclcheck(RelationGetRelid(rel), GetUserId(), ACL_ALL_RIGHTS_RELATION) != ACLCHECK_OK)
+		ereport(ERROR,
+				(errcode(ERRCODE_INSUFFICIENT_PRIVILEGE),
+				 errmsg("permission denied for relation \"%s\"",
+						RelationGetRelationName(rel))));
+}
+
+/*-------------------------------------------------------------------------
+ * tids_same_page_fetch_offnums()
+ *
+ * Find out all the tids residing in the same page as tids[next_start_ptr] and
+ * fetch the offset number stored in each of them into a caller-allocated offset
+ * number array.
+ * ------------------------------------------------------------------------
+ */
+static BlockNumber
+tids_same_page_fetch_offnums(ItemPointer tids, int ntids,
+							 OffsetNumber *next_start_ptr, OffsetNumber *offnos)
+{
+	int				i;
+	BlockNumber		prev_blkno,
+					blkno;
+	OffsetNumber	offno;
+
+	for (i = *next_start_ptr; i < ntids; i++)
+	{
+		ItemPointerData tid = tids[i];
+
+		if (!ItemPointerIsValid(&tid))
+			ereport(ERROR,
+					(errcode(ERRCODE_INVALID_PARAMETER_VALUE),
+					 errmsg("invalid item pointer")));
+
+		blkno = ItemPointerGetBlockNumberNoCheck(&tid);
+		offno = ItemPointerGetOffsetNumberNoCheck(&tid);
+
+		if (i == *next_start_ptr || (prev_blkno == blkno))
+			offnos[i - *next_start_ptr] = offno;
+		else
+			break;
+
+		prev_blkno = blkno;
+	}
+
+	*next_start_ptr = i;
+	return prev_blkno;
+}
\ No newline at end of file
diff --git a/contrib/pg_surgery/pg_surgery--1.0.sql b/contrib/pg_surgery/pg_surgery--1.0.sql
new file mode 100644
index 0000000..2ae7f22
--- /dev/null
+++ b/contrib/pg_surgery/pg_surgery--1.0.sql
@@ -0,0 +1,18 @@
+/* contrib/pg_surgery/pg_surgery--1.0.sql */
+
+-- complain if script is sourced in psql, rather than via CREATE EXTENSION
+\echo Use "CREATE EXTENSION pg_surgery" to load this file. \quit
+
+CREATE FUNCTION heap_force_kill(reloid regclass, tids tid[])
+RETURNS VOID
+AS 'MODULE_PATHNAME', 'heap_force_kill'
+LANGUAGE C STRICT;
+
+REVOKE EXECUTE ON FUNCTION heap_force_kill(regclass, tid[]) FROM PUBLIC;
+
+CREATE FUNCTION heap_force_freeze(reloid regclass, tids tid[])
+RETURNS VOID
+AS 'MODULE_PATHNAME', 'heap_force_freeze'
+LANGUAGE C STRICT;
+
+REVOKE EXECUTE ON FUNCTION heap_force_freeze(regclass, tid[]) FROM PUBLIC;
\ No newline at end of file
diff --git a/contrib/pg_surgery/pg_surgery.control b/contrib/pg_surgery/pg_surgery.control
new file mode 100644
index 0000000..2bcdad1
--- /dev/null
+++ b/contrib/pg_surgery/pg_surgery.control
@@ -0,0 +1,5 @@
+# pg_surgery extension
+comment = 'extension to perform surgery on a damaged relation'
+default_version = '1.0'
+module_pathname = '$libdir/pg_surgery'
+relocatable = true
diff --git a/contrib/pg_surgery/sql/pg_surgery.sql b/contrib/pg_surgery/sql/pg_surgery.sql
new file mode 100644
index 0000000..bcd1878
--- /dev/null
+++ b/contrib/pg_surgery/sql/pg_surgery.sql
@@ -0,0 +1,61 @@
+create extension pg_surgery;
+
+--
+-- check that using heap_force_kill and heap_force_freeze functions with the
+-- normal heap table passes.
+--
+
+-- normal heap table.
+begin;
+create table htab(a int);
+insert into htab values (100), (200), (300), (400), (500);
+
+select * from htab where xmin = 2;
+select heap_force_freeze('htab'::regclass, ARRAY['(0, 4)']::tid[]);
+select ctid, xmax from htab where xmin = 2;
+
+select heap_force_kill('htab'::regclass, ARRAY['(0, 4)']::tid[]);
+select * from htab where ctid = '(0, 4)';
+rollback;
+
+--
+-- check that using heap_force_kill and heap_force_freeze functions with the
+-- unsupported relations fails.
+--
+
+-- partitioned tables (the parent table) doesn't contain any tuple.
+create table ptab (a int) partition by list (a);
+
+select heap_force_kill('ptab'::regclass, ARRAY['(0, 1)']::tid[]);
+select heap_force_freeze('ptab'::regclass, ARRAY['(0, 1)']::tid[]);
+
+create index ptab_idx on ptab (a);
+
+-- indexes are not supported, should fail.
+select heap_force_kill('ptab_idx'::regclass, ARRAY['(0, 1)']::tid[]);
+select heap_force_freeze('ptab_idx'::regclass, ARRAY['(0, 1)']::tid[]);
+
+create view vw as select 1;
+
+-- views are not supported as well. so, all these should fail.
+select heap_force_kill('vw'::regclass, ARRAY['(0, 1)']::tid[]);
+select heap_force_freeze('vw'::regclass, ARRAY['(0, 1)']::tid[]);
+
+create materialized view mvw as select 10;
+
+-- materialized views are not supported as well. so, all these should fail.
+select heap_force_kill('mvw'::regclass, ARRAY['(0, 1)']::tid[]);
+select heap_force_freeze('mvw'::regclass, ARRAY['(0, 1)']::tid[]);
+
+create sequence seq;
+
+-- sequences are not supported as well. so, all these functions should fail.
+select heap_force_kill('seq'::regclass, ARRAY['(0, 1)']::tid[]);
+select heap_force_freeze('seq'::regclass, ARRAY['(0, 1)']::tid[]);
+
+-- cleanup.
+drop materialized view mvw;
+drop table ptab;
+drop view vw;
+drop sequence seq;
+drop extension pg_surgery;
-- 
1.8.3.1

#62

robertmhaas@gmail.com

over 5 years ago

In reply to: Ashutosh Sharma (#56)

Re: recovering from "found xmin ... from before relfrozenxid ..."

On Thu, Aug 6, 2020 at 2:11 AM Ashutosh Sharma <ashu.coek88@gmail.com> wrote:

Okay, If you want I can remove the restriction on a toast table, but,
then that means a user can directly remove the data chunks from the
toast table without changing anything in the main table. This means we
won't be able to query the main table as it will fail with an error
like "ERROR: unexpected chunk number ...". So, we will have to find
some way to identify the pointer to the chunks that got deleted from
the toast table and remove that pointer from the main table. We also
need to make sure that before we remove a tuple (pointer) from the
main table, we identify all the remaining data chunks pointed by this
tuple and remove them completely only then that table would be
considered to be in a good state. Now, I am not sure if we can
currently do all these things.

That's the user's problem. If they don't have a plan for that, they
shouldn't use this tool. I don't think we can or should try to handle
it in this code.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

#63

ashu.coek88@gmail.com

over 5 years ago

In reply to: Robert Haas (#62)

Re: recovering from "found xmin ... from before relfrozenxid ..."

On Thu, Aug 6, 2020 at 9:19 PM Robert Haas <robertmhaas@gmail.com> wrote:

On Thu, Aug 6, 2020 at 2:11 AM Ashutosh Sharma <ashu.coek88@gmail.com> wrote:

Okay, If you want I can remove the restriction on a toast table, but,
then that means a user can directly remove the data chunks from the
toast table without changing anything in the main table. This means we
won't be able to query the main table as it will fail with an error
like "ERROR: unexpected chunk number ...". So, we will have to find
some way to identify the pointer to the chunks that got deleted from
the toast table and remove that pointer from the main table. We also
need to make sure that before we remove a tuple (pointer) from the
main table, we identify all the remaining data chunks pointed by this
tuple and remove them completely only then that table would be
considered to be in a good state. Now, I am not sure if we can
currently do all these things.

That's the user's problem. If they don't have a plan for that, they
shouldn't use this tool. I don't think we can or should try to handle
it in this code.

Okay, thanks.

--
With Regards,
Ashutosh Sharma
EnterpriseDB:http://www.enterprisedb.com

#64

ashu.coek88@gmail.com

over 5 years ago

In reply to: Rajkumar Raghuwanshi (#58)

Re: recovering from "found xmin ... from before relfrozenxid ..."

Thanks Rajkumar for testing the patch.

Here are some of the additional test-cases that I would suggest you to
execute, if possible:

1) You may try running the test-cases that you have executed so far
with SR setup and see if the changes are getting reflected on the
standby.

2) You may also try running some concurrent test-cases for e.g. try
running these functions with VACUUM or some other sql commands
(preferable DML commands) in parallel.

3) See what happens when you pass some invalid tids (containing
invalid block or offset number) to these functions. You may also try
running these functions on the same tuple repeatedly and see the
behaviour.

...

--
With Regards,
Ashutosh Sharma
EnterpriseDB:http://www.enterprisedb.com

On Thu, Aug 6, 2020 at 2:25 PM Rajkumar Raghuwanshi
<rajkumar.raghuwanshi@enterprisedb.com> wrote:

Show quoted text

I have been doing some user-level testing of this feature, apart from sanity test for extension and it's functions

I have tried to corrupt tuples and then able to fix it using heap_force_freeze/kill functions.

--corrupt relfrozenxid, cause vacuum failed.

update pg_class set relfrozenxid = (relfrozenxid::text::integer + 10)::text::xid where relname = 'test_tbl';

UPDATE 1

insert into test_tbl values (2, 'BBB');

postgres=# vacuum test_tbl;

ERROR: found xmin 507 from before relfrozenxid 516

CONTEXT: while scanning block 0 of relation "public.test_tbl"

postgres=# select *, ctid, xmin, xmax from test_tbl;

a | b | ctid | xmin | xmax

---+-----+-------+------+------

1 | AAA | (0,1) | 505 | 0

2 | BBB | (0,2) | 507 | 0

(2 rows)

--fixed using heap_force_freeze

postgres=# select heap_force_freeze('test_tbl'::regclass, ARRAY['(0,2)']::tid[]);

heap_force_freeze

-------------------

postgres=# vacuum test_tbl;

VACUUM

postgres=# select *, ctid, xmin, xmax from test_tbl;

a | b | ctid | xmin | xmax

---+-----+-------+------+------

1 | AAA | (0,1) | 505 | 0

2 | BBB | (0,2) | 2 | 0

(2 rows)

--corrupt table headers in base/oid. file, cause table access failed.

postgres=# select ctid, * from test_tbl;

ERROR: could not access status of transaction 4294967295

DETAIL: Could not open file "pg_xact/0FFF": No such file or directory.

--removed corrupted tuple using heap_force_kill

postgres=# select heap_force_kill('test_tbl'::regclass, ARRAY['(0,2)']::tid[]);

heap_force_kill

-----------------

(1 row)

postgres=# select ctid, * from test_tbl;

ctid | a | b

-------+---+-----

(0,1) | 1 | AAA

(1 row)

I will be continuing with my testing with the latest patch and update here if found anything.

Thanks & Regards,
Rajkumar Raghuwanshi

On Thu, Aug 6, 2020 at 1:42 PM Masahiko Sawada <masahiko.sawada@2ndquadrant.com> wrote:
On Wed, 5 Aug 2020 at 22:42, Ashutosh Sharma <ashu.coek88@gmail.com> wrote:
Hi Robert,

Thanks for the review. Please find my comments inline:

On Sat, Aug 1, 2020 at 12:18 AM Robert Haas <robertmhaas@gmail.com> wrote:

On Fri, Jul 31, 2020 at 8:52 AM Ashutosh Sharma <ashu.coek88@gmail.com> wrote:

Attached is the new version of patch that addresses the comments from Andrey and Beena.

+PGFILEDESC = "pg_surgery - perform surgery on the damaged heap table"

the -> a

I also suggest: heap table -> relation, because we might want to
extend this to other cases later.

Corrected.
+-- toast table.
+begin;
+create table ttab(a text);
+insert into ttab select string_agg(chr(floor(random() * 26)::int +
65), '') from generate_series(1,10000);
+select * from ttab where xmin = 2;
+ a
+---
+(0 rows)
+
+select heap_force_freeze('ttab'::regclass, ARRAY['(0, 1)']::tid[]);
+ heap_force_freeze
+-------------------
+
+(1 row)
+
I don't understand the point of this. You're not testing the function
on the TOAST table; you're testing it on the main table when there
happens to be a TOAST table that is probably getting used for
something. But that's not really relevant to what is being tested
here, so as written this seems redundant with the previous cases.
Yeah, it's being tested on the main table, not on a toast table. I've
removed this test-case and also restricted direct access to the toast
table using heap_force_kill/freeze functions. I think we shouldn't be
using these functions to do any changes in the toast table. We will
only use these functions with the main table and let VACUUM remove the
corresponding data chunks (pointed by the tuple that got removed from
the main table).

Another option would be to identify all the data chunks corresponding
to the tuple (ctid) being killed from the main table and remove them
one by one. We will only do this if the tuple from the main table that
has been marked as killed has an external storage. We will have to add
a bunch of code for this otherwise we can let VACUUM do this for us.
Let me know your thoughts on this.

+-- test pg_surgery functions with the unsupported relations. Should fail.

Please name the specific functions being tested here in case we add
more in the future that are tested separately.

Done.
+++ b/contrib/pg_surgery/heap_surgery_funcs.c
I think we could drop _funcs from the file name.
Done.
+#ifdef PG_MODULE_MAGIC
+PG_MODULE_MAGIC;
+#endif
The #ifdef here is not required, and if you look at other contrib
modules you'll see that they don't have it.
Okay, done.

I don't like all the macros at the top of the file much. CHECKARRVALID
is only used in one place, so it seems to me that you might as well
just inline it and lose the macro. Likewise for SORT and ARRISEMPTY.

Done.

Once you do that, heap_force_common() can just compute the number of
array elements once, instead of doing it once inside ARRISEMPTY, then
again inside SORT, and then a third time to initialize ntids. You can
just have a local variable in that function that is set once and then
used as needed. Then you are only doing ARRNELEMS once, so you can get
rid of that macro too. The same technique can be used to get rid of
ARRPTR. So then all the macros go away without introducing any code
duplication.

Done.
+/* Options to forcefully change the state of a heap tuple. */
+typedef enum HTupleForceOption
+{
+ FORCE_KILL,
+ FORCE_FREEZE
+} HTupleForceOption;
I suggest un-abbreviating HTuple to HeapTuple and un-abbreviating the
enum members to HEAP_FORCE_KILL and HEAP_FORCE_FREE.
Done.

Also, how about

option -> operation?

I think both look okay to me.

+ return heap_force_common(fcinfo, FORCE_KILL);

I think it might be more idiomatic to use PG_RETURN_DATUM here. I
know it's the same thing, though, and perhaps I'm even wrong about the
prevailing style.

Done.

+ Assert(force_opt == FORCE_KILL || force_opt == FORCE_FREEZE);

I think this is unnecessary. It's an enum with 2 values.

Removed.
+ if (ARRISEMPTY(ta))
+ ereport(ERROR,
+ (errcode(ERRCODE_INVALID_PARAMETER_VALUE),
+ errmsg("empty tid array")));
I don't see why this should be an error. Why can't we just continue
normally and if it does nothing, it does nothing? You'd need to change
the do..while loop to a while loop so that the end condition is tested
at the top, but that seems fine.
I think it's okay to have this check. So, just left it as-is. We do
have such checks in other contrib modules as well wherever the array
is being passed as an input to the function.

+ rel = relation_open(relid, AccessShareLock);

Maybe we should take RowExclusiveLock, since we are going to modify
rows. Not sure how much it matters, though.

I don't know how it would make a difference, but still as you said
replaced AccessShare with RowExclusive.
+ if (!superuser() && GetUserId() != rel->rd_rel->relowner)
+ ereport(ERROR,
+ (errcode(ERRCODE_INSUFFICIENT_PRIVILEGE),
+ errmsg("must be superuser or object owner to use %s.",
+ force_opt == FORCE_KILL ? "heap_force_kill()" :
+ "heap_force_freeze()")));
This is the wrong way to do a permissions check, and it's also the
wrong way to write an error message about having failed one. To see
the correct method, grep for pg_class_aclcheck. However, I think that
we shouldn't in general trust the object owner to do this, unless the
super-user gave permission. This is a data-corrupting operation, and
only the boss is allowed to authorize it. So I think you should also
add REVOKE EXECUTE FROM PUBLIC statements to the SQL file, and then
have this check as a backup. Then, the superuser is always allowed,
and if they choose to GRANT EXECUTE on this function to some users,
those users can do it for their own relations, but not others.
Done.
+ if (rel->rd_rel->relam != HEAP_TABLE_AM_OID)
+ ereport(ERROR,
+ (errcode(ERRCODE_FEATURE_NOT_SUPPORTED),
+ errmsg("only heap AM is supported")));
+
+ check_relation_relkind(rel);
Seems like these checks are in the wrong order.
I don't think there is anything wrong with the order. I can see the
same order in other contrib modules as well.

Also, maybe you could

call the function something like check_relation_ok() and put the
permissions test, the relkind test, and the relam test all inside of
it, just to tighten up the code in this main function a bit.

Yeah, I've added a couple of functions named sanity_check_relation and
sanity_check_tid_array and shifted all the sanity checks there.
+ if (noffs > maxoffset)
+ ereport(ERROR,
+ (errcode(ERRCODE_INVALID_PARAMETER_VALUE),
+ errmsg("number of offsets specified for block %u exceeds the max
offset number %u",
+ blkno, maxoffset)));
Hmm, this doesn't seem quite right. The actual problem is if an
individual item pointer's offset number is greater than maxoffset,
which can be true even if the total number of offsets is less than
maxoffset. So I think you need to remove this check and add a check
inside the loop which follows that offnos[i] is in range.
Agreed and done.

The way you've structured that loop is actually problematic -- I don't
think we want to be calling elog() or ereport() inside a critical
section. You could fix the case that checks for an invalid force_opt
by just doing if (op == HEAP_FORCE_KILL) { ... } else { Assert(op ==
HEAP_FORCE_FREEZE); ... }, or by using a switch with no default. The
NOTICE case you have here is a bigger problem.

Done.

You really cannot

modify the buffer like this and then decide, oops, never mind, I think
I won't mark it dirty or write WAL for the changes. If you do that,
the buffer is still in memory, but it's now been modified. A
subsequent operation that modifies it will start with the altered
state you created here, quite possibly leading to WAL that cannot be
correctly replayed on the standby. In other words, you've got to
decide for certain whether you want to proceed with the operation
*before* you enter the critical section. You also need to emit any
messages before or after the critical section. So you could:

This is still not clear. I think Robert needs to respond to my earlier comment.

I believe this violates our guidelines on message construction. Have
two completely separate messages -- and maybe lose the word "already":

"skipping tid (%u, %u) because it is dead"
"skipping tid (%u, %u) because it is unused"

The point of this is that it makes it easier for translators.

Done.

I see very little point in what verify_tid() is doing. Before using
each block number, we should check that it's less than or equal to a
cached value of RelationGetNumberOfBlocks(rel). That's necessary in
any case to avoid funny errors; and then the check here against
specifically InvalidBlockNumber is redundant. For the offset number,
same thing: we need to check each offset against the page's
PageGetMaxOffsetNumber(page); and if we do that then we don't need
these checks.

Done.

Please check the attached patch for the changes.
I also looked at this version patch and have some small comments:
+   Oid             relid = PG_GETARG_OID(0);
+   ArrayType      *ta = PG_GETARG_ARRAYTYPE_P_COPY(1);
+   ItemPointer     tids;
+   int             ntids;
+   Relation        rel;
+   Buffer          buf;
+   Page            page;
+   ItemId          itemid;
+   BlockNumber     blkno;
+   OffsetNumber   *offnos;
+   OffsetNumber    offno,
+                   noffs,
+                   curr_start_ptr,
+                   next_start_ptr,
+                   maxoffset;
+   int             i,
+                   nskippedItems,
+                   nblocks;
You declare all variables at the top of heap_force_common() function
but I think we can declare some variables such as buf, page inside of
the do loop.
---
+           if (offnos[i] > maxoffset)
+           {
+               ereport(NOTICE,
+                        errmsg("skipping tid (%u, %u) because it
contains an invalid offset",
+                               blkno, offnos[i]));
+               continue;
+           }
If all tids on a page take the above path, we will end up logging FPI
in spite of modifying nothing on the page.
---
+       /* XLOG stuff */
+       if (RelationNeedsWAL(rel))
+           log_newpage_buffer(buf, true);
I think we need to set the returned LSN by log_newpage_buffer() to the page lsn.

Regards,

--
Masahiko Sawada http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

#65

robertmhaas@gmail.com

over 5 years ago

In reply to: Ashutosh Sharma (#61)

Re: recovering from "found xmin ... from before relfrozenxid ..."

On Thu, Aug 6, 2020 at 9:23 AM Ashutosh Sharma <ashu.coek88@gmail.com> wrote:

Attached v4 patch fixes the latest comments from Robert and Masahiko-san.

Compiler warning:

heap_surgery.c:136:13: error: comparison of unsigned expression < 0 is
always false [-Werror,-Wtautological-compare]
if (blkno < 0 || blkno >= nblocks)
~~~~~ ^ ~

There's a certain inconsistency to these messages:

rhaas=# create table foo (a int);
CREATE TABLE
rhaas=# insert into foo values (1);
INSERT 0 1
rhaas=# select heap_force_kill('foo'::regclass, array['(0,2)'::tid]);
NOTICE: skipping tid (0, 2) because it contains an invalid offset
heap_force_kill
-----------------

(1 row)

rhaas=# select heap_force_kill('foo'::regclass, array['(1,0)'::tid]);
ERROR: invalid item pointer
LOCATION: tids_same_page_fetch_offnums, heap_surgery.c:347
rhaas=# select heap_force_kill('foo'::regclass, array['(1,1)'::tid]);
ERROR: block number 1 is out of range for relation "foo"

From a user perspective it seems like I've made three very similar
mistakes: in the first case, the offset is too high, in the second
case it's too low, and in the third case the block number is out of
range. But in one case I get a NOTICE and in the other two cases I get
an ERROR. In one case I get the relation name and in the other two
cases I don't. The two complaints about an invalid offset are phrased
completely differently from each other. For example, suppose you do
this:

ERROR: tid (%u, %u) is invalid for relation "%s" because the block
number is out of range (%u..%u)
ERROR: tid (%u, %u) is invalid for relation "%s" because the item
number is out of range for this block (%u..%u)
ERROR: tid (%u, %u) is invalid for relation "%s" because the item is unused
ERROR: tid (%u, %u) is invalid for relation "%s" because the item is dead

I think I misled you when I said to use pg_class_aclcheck. I think it
should actually be pg_class_ownercheck.

I think the relkind sanity check should permit RELKIND_MATVIEW also.

It's unclear to me why the freeze logic here shouldn't do this part
what heap_prepare_freeze_tuple() does when freezing xmax:

frz->t_infomask2 &= ~HEAP_HOT_UPDATED;
frz->t_infomask2 &= ~HEAP_KEYS_UPDATED;

Likewise, why should we not freeze or invalidate xvac in the case
where tuple->t_infomask & HEAP_MOVED, as heap_prepare_freeze_tuple()
would do?

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

#66

ashu.coek88@gmail.com

over 5 years ago

In reply to: Robert Haas (#65)

Re: recovering from "found xmin ... from before relfrozenxid ..."

On Fri, Aug 7, 2020 at 9:21 PM Robert Haas <robertmhaas@gmail.com> wrote:

On Thu, Aug 6, 2020 at 9:23 AM Ashutosh Sharma <ashu.coek88@gmail.com> wrote:
There's a certain inconsistency to these messages:

rhaas=# create table foo (a int);
CREATE TABLE
rhaas=# insert into foo values (1);
INSERT 0 1
rhaas=# select heap_force_kill('foo'::regclass, array['(0,2)'::tid]);
NOTICE: skipping tid (0, 2) because it contains an invalid offset
heap_force_kill
-----------------

(1 row)

rhaas=# select heap_force_kill('foo'::regclass, array['(1,0)'::tid]);
ERROR: invalid item pointer
LOCATION: tids_same_page_fetch_offnums, heap_surgery.c:347
rhaas=# select heap_force_kill('foo'::regclass, array['(1,1)'::tid]);
ERROR: block number 1 is out of range for relation "foo"

From a user perspective it seems like I've made three very similar
mistakes: in the first case, the offset is too high, in the second
case it's too low, and in the third case the block number is out of
range. But in one case I get a NOTICE and in the other two cases I get
an ERROR. In one case I get the relation name and in the other two
cases I don't. The two complaints about an invalid offset are phrased
completely differently from each other. For example, suppose you do
this:

ERROR: tid (%u, %u) is invalid for relation "%s" because the block
number is out of range (%u..%u)
ERROR: tid (%u, %u) is invalid for relation "%s" because the item
number is out of range for this block (%u..%u)
ERROR: tid (%u, %u) is invalid for relation "%s" because the item is unused
ERROR: tid (%u, %u) is invalid for relation "%s" because the item is dead

Thank you for your suggestions. To make this consistent, I am planning
to do the following changes:

Remove the error message to report "invalid item pointer" from
tids_same_page_fetch_offnums() and expand the if-check to detect any
invalid offset number in the CRITICAL section such that it not just
checks if the offset number is > maxoffset, but also checks if the
offset number is equal to 0. That way it would also do the job that
"if (!ItemPointerIsValid)" was doing for us.

Further, if any invalid block number is detected, then I am planning
to skip all the tids associated with this block and move to the next
block. Hence, instead of reporting the error I would report the NOTICE
message to the user.

The other two messages for reporting unused items and dead items
remain the same. Hence, with above change, we would be reporting the
following 4 messages:

NOTICE: skipping all the tids in block %u for relation "%s" because
the block number is out of range

NOTICE: skipping tid (%u, %u) for relation "%s" because the item
number is out of range for this block

NOTICE: skipping tid (%u, %u) for relation "%s" because it is marked dead

NOTICE: skipping tid (%u, %u) for relation "%s" because it is marked unused

Please let me know if you are okay with the above changes or not?

--
With Regards,
Ashutosh Sharma
EnterpriseDB:http://www.enterprisedb.com

#67

robertmhaas@gmail.com

over 5 years ago

In reply to: Ashutosh Sharma (#66)

Re: recovering from "found xmin ... from before relfrozenxid ..."

On Tue, Aug 11, 2020 at 3:39 AM Ashutosh Sharma <ashu.coek88@gmail.com> wrote:

The other two messages for reporting unused items and dead items
remain the same. Hence, with above change, we would be reporting the
following 4 messages:

NOTICE: skipping all the tids in block %u for relation "%s" because
the block number is out of range

NOTICE: skipping tid (%u, %u) for relation "%s" because the item
number is out of range for this block

NOTICE: skipping tid (%u, %u) for relation "%s" because it is marked dead

NOTICE: skipping tid (%u, %u) for relation "%s" because it is marked unused

Please let me know if you are okay with the above changes or not?

That seems broadly reasonable, but I would suggest phrasing the first
message like this:

skipping block %u for relation "%s" because the block number is out of range

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

#68

ashu.coek88@gmail.com

over 5 years ago

In reply to: Robert Haas (#67)

Re: recovering from "found xmin ... from before relfrozenxid ..."

On Tue, Aug 11, 2020 at 7:33 PM Robert Haas <robertmhaas@gmail.com> wrote:

On Tue, Aug 11, 2020 at 3:39 AM Ashutosh Sharma <ashu.coek88@gmail.com> wrote:

The other two messages for reporting unused items and dead items
remain the same. Hence, with above change, we would be reporting the
following 4 messages:

NOTICE: skipping all the tids in block %u for relation "%s" because
the block number is out of range

NOTICE: skipping tid (%u, %u) for relation "%s" because the item
number is out of range for this block

NOTICE: skipping tid (%u, %u) for relation "%s" because it is marked dead

NOTICE: skipping tid (%u, %u) for relation "%s" because it is marked unused

Please let me know if you are okay with the above changes or not?

That seems broadly reasonable, but I would suggest phrasing the first
message like this:

skipping block %u for relation "%s" because the block number is out of range

Okay, thanks for the confirmation. I'll do that.

--
With Regards,
Ashutosh Sharma
EnterpriseDB:http://www.enterprisedb.com

#69

ashu.coek88@gmail.com

over 5 years ago

In reply to: Robert Haas (#65)

1 attachment(s)

Re: recovering from "found xmin ... from before relfrozenxid ..."

Thanks Robert for the review. Please find my comments inline below:

On Fri, Aug 7, 2020 at 9:21 PM Robert Haas <robertmhaas@gmail.com> wrote:

On Thu, Aug 6, 2020 at 9:23 AM Ashutosh Sharma <ashu.coek88@gmail.com> wrote:

Attached v4 patch fixes the latest comments from Robert and Masahiko-san.

Compiler warning:

heap_surgery.c:136:13: error: comparison of unsigned expression < 0 is
always false [-Werror,-Wtautological-compare]
if (blkno < 0 || blkno >= nblocks)
~~~~~ ^ ~

Fixed.

There's a certain inconsistency to these messages:

rhaas=# create table foo (a int);
CREATE TABLE
rhaas=# insert into foo values (1);
INSERT 0 1
rhaas=# select heap_force_kill('foo'::regclass, array['(0,2)'::tid]);
NOTICE: skipping tid (0, 2) because it contains an invalid offset
heap_force_kill
-----------------

(1 row)

rhaas=# select heap_force_kill('foo'::regclass, array['(1,0)'::tid]);
ERROR: invalid item pointer
LOCATION: tids_same_page_fetch_offnums, heap_surgery.c:347
rhaas=# select heap_force_kill('foo'::regclass, array['(1,1)'::tid]);
ERROR: block number 1 is out of range for relation "foo"

From a user perspective it seems like I've made three very similar
mistakes: in the first case, the offset is too high, in the second
case it's too low, and in the third case the block number is out of
range. But in one case I get a NOTICE and in the other two cases I get
an ERROR. In one case I get the relation name and in the other two
cases I don't. The two complaints about an invalid offset are phrased
completely differently from each other. For example, suppose you do
this:

ERROR: tid (%u, %u) is invalid for relation "%s" because the block
number is out of range (%u..%u)
ERROR: tid (%u, %u) is invalid for relation "%s" because the item
number is out of range for this block (%u..%u)
ERROR: tid (%u, %u) is invalid for relation "%s" because the item is unused
ERROR: tid (%u, %u) is invalid for relation "%s" because the item is dead

Corrected.

I think I misled you when I said to use pg_class_aclcheck. I think it
should actually be pg_class_ownercheck.

okay, I've changed it to pg_class_ownercheck.

I think the relkind sanity check should permit RELKIND_MATVIEW also.

Yeah, actually we should allow MATVIEW, don't know why I thought of
blocking it earlier.

It's unclear to me why the freeze logic here shouldn't do this part
what heap_prepare_freeze_tuple() does when freezing xmax:

frz->t_infomask2 &= ~HEAP_HOT_UPDATED;
frz->t_infomask2 &= ~HEAP_KEYS_UPDATED;

Yeah, we should have these changes when freezing the xmax.

Likewise, why should we not freeze or invalidate xvac in the case
where tuple->t_infomask & HEAP_MOVED, as heap_prepare_freeze_tuple()
would do?

Again, we should have this as well.

Apart from above, this time I've also added the documentation on
pg_surgery module and added a few more test-cases.

Attached patch with above changes.

--
With Regards,
Ashutosh Sharma
EnterpriseDB:http://www.enterprisedb.com

Attachments:

v5-0001-Add-contrib-pg_surgery-to-perform-surgery-on-a-damag.patchtext/x-patch; charset=US-ASCII; name=v5-0001-Add-contrib-pg_surgery-to-perform-surgery-on-a-damag.patchDownload

From 2aa56b9632cf1d291f4433afd972dc647d354dcb Mon Sep 17 00:00:00 2001
From: ashu <ashutosh.sharma@enterprisedb.com>
Date: Wed, 12 Aug 2020 14:38:14 +0530
Subject: [PATCH] Add contrib/pg_surgery to perform surgery on a damaged heap
 table.

This contrib module basically adds a couple of functions named
heap_force_kill and heap_force_freeze that can be used in the scripts
to perform surgery on the damaged heap tables.

Ashutosh Sharma.
---
 contrib/Makefile                           |   1 +
 contrib/pg_surgery/Makefile                |  23 ++
 contrib/pg_surgery/expected/pg_surgery.out | 161 +++++++++++++
 contrib/pg_surgery/heap_surgery.c          | 375 +++++++++++++++++++++++++++++
 contrib/pg_surgery/pg_surgery--1.0.sql     |  18 ++
 contrib/pg_surgery/pg_surgery.control      |   5 +
 contrib/pg_surgery/sql/pg_surgery.sql      |  89 +++++++
 doc/src/sgml/contrib.sgml                  |   1 +
 doc/src/sgml/filelist.sgml                 |   1 +
 doc/src/sgml/pgsurgery.sgml                |  94 ++++++++
 10 files changed, 768 insertions(+)
 create mode 100644 contrib/pg_surgery/Makefile
 create mode 100644 contrib/pg_surgery/expected/pg_surgery.out
 create mode 100644 contrib/pg_surgery/heap_surgery.c
 create mode 100644 contrib/pg_surgery/pg_surgery--1.0.sql
 create mode 100644 contrib/pg_surgery/pg_surgery.control
 create mode 100644 contrib/pg_surgery/sql/pg_surgery.sql
 create mode 100644 doc/src/sgml/pgsurgery.sgml

diff --git a/contrib/Makefile b/contrib/Makefile
index 1846d41..07d5734 100644
--- a/contrib/Makefile
+++ b/contrib/Makefile
@@ -35,6 +35,7 @@ SUBDIRS = \
 		pg_standby	\
 		pg_stat_statements \
 		pg_trgm		\
+		pg_surgery	\
 		pgcrypto	\
 		pgrowlocks	\
 		pgstattuple	\
diff --git a/contrib/pg_surgery/Makefile b/contrib/pg_surgery/Makefile
new file mode 100644
index 0000000..ecf2e20
--- /dev/null
+++ b/contrib/pg_surgery/Makefile
@@ -0,0 +1,23 @@
+# contrib/pg_surgery/Makefile
+
+MODULE_big = pg_surgery
+OBJS = \
+	$(WIN32RES) \
+	heap_surgery.o
+
+EXTENSION = pg_surgery
+DATA = pg_surgery--1.0.sql
+PGFILEDESC = "pg_surgery - perform surgery on a damaged relation"
+
+REGRESS = pg_surgery
+
+ifdef USE_PGXS
+PG_CONFIG = pg_config
+PGXS := $(shell $(PG_CONFIG) --pgxs)
+include $(PGXS)
+else
+subdir = contrib/pg_surgery
+top_builddir = ../..
+include $(top_builddir)/src/Makefile.global
+include $(top_srcdir)/contrib/contrib-global.mk
+endif
diff --git a/contrib/pg_surgery/expected/pg_surgery.out b/contrib/pg_surgery/expected/pg_surgery.out
new file mode 100644
index 0000000..55f8dce
--- /dev/null
+++ b/contrib/pg_surgery/expected/pg_surgery.out
@@ -0,0 +1,161 @@
+create extension pg_surgery;
+--
+-- check that using heap_force_kill and heap_force_freeze functions with the
+-- supported relations passes.
+--
+-- normal heap table.
+begin;
+create table htab(a int);
+insert into htab values (100), (200), (300), (400), (500);
+select * from htab where xmin = 2;
+ a 
+---
+(0 rows)
+
+select heap_force_freeze('htab'::regclass, ARRAY['(0, 4)']::tid[]);
+ heap_force_freeze 
+-------------------
+ 
+(1 row)
+
+select ctid, xmax from htab where xmin = 2;
+ ctid  | xmax 
+-------+------
+ (0,4) |    0
+(1 row)
+
+select heap_force_kill('htab'::regclass, ARRAY['(0, 4)']::tid[]);
+ heap_force_kill 
+-----------------
+ 
+(1 row)
+
+select * from htab where ctid = '(0, 4)';
+ a 
+---
+(0 rows)
+
+rollback;
+-- materialized view.
+begin;
+create materialized view mvw as select a from generate_series(1, 3) a;
+select * from mvw where xmin = 2;
+ a 
+---
+(0 rows)
+
+select heap_force_freeze('mvw'::regclass, ARRAY['(0, 3)']::tid[]);
+ heap_force_freeze 
+-------------------
+ 
+(1 row)
+
+select * from mvw where xmin = 2;
+ a 
+---
+ 3
+(1 row)
+
+select heap_force_kill('mvw'::regclass, ARRAY['(0, 3)']::tid[]);
+ heap_force_kill 
+-----------------
+ 
+(1 row)
+
+select * from mvw where ctid = '(0, 3)';
+ a 
+---
+(0 rows)
+
+rollback;
+--
+-- check that using heap_force_kill and heap_force_freeze functions with the
+-- unsupported relations fails.
+--
+-- partitioned tables (the parent table) doesn't contain any tuple.
+create table ptab (a int) partition by list (a);
+select heap_force_kill('ptab'::regclass, ARRAY['(0, 1)']::tid[]);
+ERROR:  only heap AM is supported
+select heap_force_freeze('ptab'::regclass, ARRAY['(0, 1)']::tid[]);
+ERROR:  only heap AM is supported
+create index ptab_idx on ptab (a);
+-- indexes are not supported, should fail.
+select heap_force_kill('ptab_idx'::regclass, ARRAY['(0, 1)']::tid[]);
+ERROR:  only heap AM is supported
+select heap_force_freeze('ptab_idx'::regclass, ARRAY['(0, 1)']::tid[]);
+ERROR:  only heap AM is supported
+create view vw as select 1;
+-- views are not supported as well. so, all these should fail.
+select heap_force_kill('vw'::regclass, ARRAY['(0, 1)']::tid[]);
+ERROR:  only heap AM is supported
+select heap_force_freeze('vw'::regclass, ARRAY['(0, 1)']::tid[]);
+ERROR:  only heap AM is supported
+create sequence seq;
+-- sequences are not supported as well. so, all these functions should fail.
+select heap_force_kill('seq'::regclass, ARRAY['(0, 1)']::tid[]);
+ERROR:  only heap AM is supported
+select heap_force_freeze('seq'::regclass, ARRAY['(0, 1)']::tid[]);
+ERROR:  only heap AM is supported
+--
+-- Some negative test-cases with invalid inputs.
+--
+begin;
+create table htab(a int);
+-- invalid block number, should be skipped.
+select heap_force_kill('htab'::regclass, ARRAY['(0, 1)']::tid[]);
+NOTICE:  skipping block 0 for relation "htab" because the block number is out of range
+ heap_force_kill 
+-----------------
+ 
+(1 row)
+
+select heap_force_freeze('htab'::regclass, ARRAY['(0, 1)']::tid[]);
+NOTICE:  skipping block 0 for relation "htab" because the block number is out of range
+ heap_force_freeze 
+-------------------
+ 
+(1 row)
+
+insert into htab values (10);
+-- invalid offset number, should be skipped.
+select heap_force_kill('htab'::regclass, ARRAY['(0, 2)']::tid[]);
+NOTICE:  skipping tid (0, 2) for relation "htab" because the item number is out of range for this block
+ heap_force_kill 
+-----------------
+ 
+(1 row)
+
+select heap_force_freeze('htab'::regclass, ARRAY['(0, 2)']::tid[]);
+NOTICE:  skipping tid (0, 2) for relation "htab" because the item number is out of range for this block
+ heap_force_freeze 
+-------------------
+ 
+(1 row)
+
+-- dead tuple, should be skipped.
+select heap_force_kill('htab'::regclass, ARRAY['(0, 1)']::tid[]);
+ heap_force_kill 
+-----------------
+ 
+(1 row)
+
+select heap_force_kill('htab'::regclass, ARRAY['(0, 1)']::tid[]);
+NOTICE:  skipping tid (0, 1) for relation "htab" because it is marked dead
+ heap_force_kill 
+-----------------
+ 
+(1 row)
+
+select heap_force_freeze('htab'::regclass, ARRAY['(0, 1)']::tid[]);
+NOTICE:  skipping tid (0, 1) for relation "htab" because it is marked dead
+ heap_force_freeze 
+-------------------
+ 
+(1 row)
+
+rollback;
+-- cleanup.
+drop table ptab;
+drop view vw;
+drop sequence seq;
+drop extension pg_surgery;
diff --git a/contrib/pg_surgery/heap_surgery.c b/contrib/pg_surgery/heap_surgery.c
new file mode 100644
index 0000000..20ac4fc
--- /dev/null
+++ b/contrib/pg_surgery/heap_surgery.c
@@ -0,0 +1,375 @@
+/*-------------------------------------------------------------------------
+ *
+ * heap_surgery.c
+ *	  Functions to perform surgery on the damaged heap table.
+ *
+ * Copyright (c) 2020, PostgreSQL Global Development Group
+ *
+ * IDENTIFICATION
+ *	  contrib/pg_surgery/heap_surgery.c
+ *
+ *-------------------------------------------------------------------------
+ */
+#include "postgres.h"
+
+#include "access/heapam.h"
+#include "catalog/pg_am_d.h"
+#include "miscadmin.h"
+#include "storage/bufmgr.h"
+#include "utils/acl.h"
+#include "utils/rel.h"
+
+PG_MODULE_MAGIC;
+
+/* Options to forcefully change the state of a heap tuple. */
+typedef enum HeapTupleForceOption
+{
+	HEAP_FORCE_KILL,
+	HEAP_FORCE_FREEZE
+} HeapTupleForceOption;
+
+PG_FUNCTION_INFO_V1(heap_force_kill);
+PG_FUNCTION_INFO_V1(heap_force_freeze);
+
+static int32 tidcmp(const void *a, const void *b);
+static Datum heap_force_common(FunctionCallInfo fcinfo,
+							   HeapTupleForceOption heap_force_opt);
+static void sanity_check_tid_array(ArrayType *ta, int *ntids);
+static void sanity_check_relation(Relation rel);
+static BlockNumber tids_same_page_fetch_offnums(ItemPointer tids, int ntids,
+												OffsetNumber *next_start_ptr,
+												OffsetNumber *offnos);
+
+/*-------------------------------------------------------------------------
+ * heap_force_kill()
+ *
+ * Force kill the tuple(s) pointed to by the item pointer(s) stored in the
+ * given tid array.
+ *
+ * Usage: SELECT heap_force_kill(regclass, tid[]);
+ *-------------------------------------------------------------------------
+ */
+Datum
+heap_force_kill(PG_FUNCTION_ARGS)
+{
+	PG_RETURN_DATUM(heap_force_common(fcinfo, HEAP_FORCE_KILL));
+}
+
+/*-------------------------------------------------------------------------
+ * heap_force_freeze()
+ *
+ * Force freeze the tuple(s) pointed to by the item pointer(s) stored in the
+ * given tid array.
+ *
+ * Usage: SELECT heap_force_freeze(regclass, tid[]);
+ *-------------------------------------------------------------------------
+ */
+Datum
+heap_force_freeze(PG_FUNCTION_ARGS)
+{
+	PG_RETURN_DATUM(heap_force_common(fcinfo, HEAP_FORCE_FREEZE));
+}
+
+/*-------------------------------------------------------------------------
+ * heap_force_common()
+ *
+ * Common code for heap_force_kill and heap_force_freeze
+ *-------------------------------------------------------------------------
+ */
+static Datum
+heap_force_common(FunctionCallInfo fcinfo, HeapTupleForceOption heap_force_opt)
+{
+	Oid				relid = PG_GETARG_OID(0);
+	ArrayType	   *ta = PG_GETARG_ARRAYTYPE_P_COPY(1);
+	ItemPointer		tids;
+	int				ntids,
+					nblocks;
+	Relation		rel;
+	OffsetNumber   *offnos;
+	OffsetNumber	noffs,
+					curr_start_ptr,
+					next_start_ptr;
+
+	if (RecoveryInProgress())
+		ereport(ERROR,
+				(errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
+				 errmsg("recovery is in progress"),
+				 errhint("heap surgery functions cannot be executed during recovery.")));
+
+	/* Basic sanity checking. */
+	sanity_check_tid_array(ta, &ntids);
+
+	rel = relation_open(relid, RowExclusiveLock);
+
+	sanity_check_relation(rel);
+
+	tids = ((ItemPointer) ARR_DATA_PTR(ta));
+
+	/*
+	 * If there is more than one tid in the array, sort it so that we can
+	 * easily fetch all the tids belonging to one particular page from the
+	 * array.
+	 */
+	if (ntids > 1)
+		qsort((void*) tids, ntids, sizeof(ItemPointerData), tidcmp);
+
+	offnos = (OffsetNumber *) palloc(ntids * sizeof(OffsetNumber));
+	noffs = curr_start_ptr = next_start_ptr = 0;
+	nblocks = RelationGetNumberOfBlocks(rel);
+
+	do
+	{
+		Buffer			buf;
+		Page			page;
+		BlockNumber		blkno;
+		OffsetNumber	maxoffset;
+		int				i;
+		bool			did_modify_page = false;
+
+		/*
+		 * Get the offset numbers from the tids belonging to one particular page
+		 * and process them one by one.
+		 */
+		blkno = tids_same_page_fetch_offnums(tids, ntids, &next_start_ptr,
+											 offnos);
+
+		/* Calculate the number of offsets stored in offnos array. */
+		noffs = next_start_ptr - curr_start_ptr;
+
+		/*
+		 * Update the current start pointer so that next time when
+		 * tids_same_page_fetch_offnums() is called, we can calculate the number
+		 * of offsets present in the offnos array.
+		 */
+		curr_start_ptr = next_start_ptr;
+
+		/* Check whether the block number is valid. */
+		if (blkno >= nblocks)
+		{
+			ereport(NOTICE,
+					(errcode(ERRCODE_INVALID_PARAMETER_VALUE),
+					 errmsg("skipping block %u for relation \"%s\" because the block number is out of range",
+							blkno, RelationGetRelationName(rel))));
+			continue;
+		}
+
+		CHECK_FOR_INTERRUPTS();
+
+		buf = ReadBuffer(rel, blkno);
+		LockBufferForCleanup(buf);
+
+		page = BufferGetPage(buf);
+
+		maxoffset = PageGetMaxOffsetNumber(page);
+
+		/* No ereport(ERROR) from here until all the changes are logged. */
+		START_CRIT_SECTION();
+
+		for (i = 0; i < noffs; i++)
+		{
+			OffsetNumber	offno;
+			ItemId			itemid;
+
+			if (offnos[i] == 0 || offnos[i] > maxoffset)
+			{
+				ereport(NOTICE,
+						 errmsg("skipping tid (%u, %u) for relation \"%s\" because the item number is out of range for this block",
+								blkno, offnos[i], RelationGetRelationName(rel)));
+				continue;
+			}
+
+			itemid = PageGetItemId(page, offnos[i]);
+
+			/* Follow any redirections until we find something useful. */
+			while (ItemIdIsRedirected(itemid))
+			{
+				offno = ItemIdGetRedirect(itemid);
+				itemid = PageGetItemId(page, offno);
+				CHECK_FOR_INTERRUPTS();
+			}
+
+			/* Nothing to do if the itemid is unused or already dead. */
+			if (!ItemIdIsUsed(itemid) || ItemIdIsDead(itemid))
+			{
+				if (!ItemIdIsUsed(itemid))
+					ereport(NOTICE,
+							(errmsg("skipping tid (%u, %u) for relation \"%s\" because it is marked unused",
+									blkno, offnos[i], RelationGetRelationName(rel))));
+				else
+					ereport(NOTICE,
+							(errmsg("skipping tid (%u, %u) for relation \"%s\" because it is marked dead",
+									blkno, offnos[i], RelationGetRelationName(rel))));
+				continue;
+			}
+
+			Assert(ItemIdIsNormal(itemid));
+
+			did_modify_page = true;
+
+			if (heap_force_opt == HEAP_FORCE_KILL)
+				ItemIdSetDead(itemid);
+			else
+			{
+				HeapTupleHeader htup;
+
+				Assert(heap_force_opt == HEAP_FORCE_FREEZE);
+
+				htup = (HeapTupleHeader) PageGetItem(page, itemid);
+
+				/* We might have MOVED_OFF/MOVED_IN tuples in the database */
+				if (htup->t_infomask & HEAP_MOVED)
+				{
+					if (htup->t_infomask & HEAP_MOVED_OFF)
+						HeapTupleHeaderSetXvac(htup, InvalidTransactionId);
+					else
+						HeapTupleHeaderSetXvac(htup, FrozenTransactionId);
+				}
+
+				HeapTupleHeaderSetXmin(htup, FrozenTransactionId);
+				HeapTupleHeaderSetXmax(htup, InvalidTransactionId);
+
+				/*
+				 * Clear all the visibility-related bits of this tuple and mark
+				 * it as frozen. Also, get rid of HOT_UPDATED and KEYS_UPDATES
+				 * bits.
+				 */
+				htup->t_infomask &= ~HEAP_XACT_MASK;
+				htup->t_infomask |= (HEAP_XMIN_FROZEN | HEAP_XMAX_INVALID);
+				htup->t_infomask2 &= ~HEAP_HOT_UPDATED;
+				htup->t_infomask2 &= ~HEAP_KEYS_UPDATED;
+			}
+		}
+
+		/*
+		 * We do not mark the buffer dirty or do WAL logging for unmodifed
+		 * pages.
+		 */
+		if (!did_modify_page)
+			goto skip_wal;
+
+		/* Mark buffer dirty before we write WAL. */
+		MarkBufferDirty(buf);
+
+		/* XLOG stuff */
+		if (RelationNeedsWAL(rel))
+			log_newpage_buffer(buf, true);
+
+skip_wal:
+		END_CRIT_SECTION();
+
+		UnlockReleaseBuffer(buf);
+	} while (next_start_ptr != ntids);
+
+	relation_close(rel, RowExclusiveLock);
+
+	pfree(ta);
+	pfree(offnos);
+
+	PG_RETURN_VOID();
+}
+
+/*-------------------------------------------------------------------------
+ * tidcmp()
+ *
+ * Compare two item pointers, return -1, 0, or +1.
+ *
+ * See ItemPointerCompare for details.
+ * ------------------------------------------------------------------------
+ */
+static int32
+tidcmp(const void *a, const void *b)
+{
+	ItemPointer iptr1 = ((const ItemPointer) a);
+	ItemPointer iptr2 = ((const ItemPointer) b);
+
+	return ItemPointerCompare(iptr1, iptr2);
+}
+
+/*-------------------------------------------------------------------------
+ * sanity_check_tid_array()
+ *
+ * Perform sanity check on the given tid array.
+ * ------------------------------------------------------------------------
+ */
+static void
+sanity_check_tid_array(ArrayType *ta, int *ntids)
+{
+	if (ARR_HASNULL(ta) && array_contains_nulls(ta))
+		ereport(ERROR,
+				(errcode(ERRCODE_NULL_VALUE_NOT_ALLOWED),
+				 errmsg("array must not contain nulls")));
+
+	*ntids = ArrayGetNItems(ARR_NDIM(ta), ARR_DIMS(ta));
+
+	if (*ntids == 0)
+		ereport(ERROR,
+				(errcode(ERRCODE_INVALID_PARAMETER_VALUE),
+				 errmsg("empty tid array")));
+}
+
+/*-------------------------------------------------------------------------
+ * sanity_check_relation()
+ *
+ * Perform sanity check on the given relation.
+ * ------------------------------------------------------------------------
+ */
+static void
+sanity_check_relation(Relation rel)
+{
+	if (rel->rd_rel->relam != HEAP_TABLE_AM_OID)
+		ereport(ERROR,
+				(errcode(ERRCODE_FEATURE_NOT_SUPPORTED),
+				 errmsg("only heap AM is supported")));
+
+	if (rel->rd_rel->relkind != RELKIND_RELATION &&
+		rel->rd_rel->relkind != RELKIND_MATVIEW &&
+		rel->rd_rel->relkind != RELKIND_TOASTVALUE)
+		ereport(ERROR,
+				(errcode(ERRCODE_WRONG_OBJECT_TYPE),
+				 errmsg("\"%s\" is not a table, materialized view, or TOAST table",
+						RelationGetRelationName(rel))));
+
+	/* Must be owner of the table or superuser. */
+	if (!pg_class_ownercheck(RelationGetRelid(rel), GetUserId()))
+		aclcheck_error(ACLCHECK_NOT_OWNER,
+					   get_relkind_objtype(rel->rd_rel->relkind),
+					   RelationGetRelationName(rel));
+}
+
+/*-------------------------------------------------------------------------
+ * tids_same_page_fetch_offnums()
+ *
+ * Find out all the tids residing in the same page as tids[next_start_ptr] and
+ * fetch the offset number stored in each of them into a caller-allocated offset
+ * number array.
+ * ------------------------------------------------------------------------
+ */
+static BlockNumber
+tids_same_page_fetch_offnums(ItemPointer tids, int ntids,
+							 OffsetNumber *next_start_ptr, OffsetNumber *offnos)
+{
+	int				i;
+	BlockNumber		prev_blkno,
+					blkno;
+	OffsetNumber	offno;
+
+	prev_blkno = blkno = InvalidBlockNumber;
+
+	for (i = *next_start_ptr; i < ntids; i++)
+	{
+		ItemPointerData tid = tids[i];
+
+		blkno = ItemPointerGetBlockNumberNoCheck(&tid);
+		offno = ItemPointerGetOffsetNumberNoCheck(&tid);
+
+		if (i == *next_start_ptr || (prev_blkno == blkno))
+			offnos[i - *next_start_ptr] = offno;
+		else
+			break;
+
+		prev_blkno = blkno;
+	}
+
+	*next_start_ptr = i;
+	return prev_blkno;
+}
\ No newline at end of file
diff --git a/contrib/pg_surgery/pg_surgery--1.0.sql b/contrib/pg_surgery/pg_surgery--1.0.sql
new file mode 100644
index 0000000..2ae7f22
--- /dev/null
+++ b/contrib/pg_surgery/pg_surgery--1.0.sql
@@ -0,0 +1,18 @@
+/* contrib/pg_surgery/pg_surgery--1.0.sql */
+
+-- complain if script is sourced in psql, rather than via CREATE EXTENSION
+\echo Use "CREATE EXTENSION pg_surgery" to load this file. \quit
+
+CREATE FUNCTION heap_force_kill(reloid regclass, tids tid[])
+RETURNS VOID
+AS 'MODULE_PATHNAME', 'heap_force_kill'
+LANGUAGE C STRICT;
+
+REVOKE EXECUTE ON FUNCTION heap_force_kill(regclass, tid[]) FROM PUBLIC;
+
+CREATE FUNCTION heap_force_freeze(reloid regclass, tids tid[])
+RETURNS VOID
+AS 'MODULE_PATHNAME', 'heap_force_freeze'
+LANGUAGE C STRICT;
+
+REVOKE EXECUTE ON FUNCTION heap_force_freeze(regclass, tid[]) FROM PUBLIC;
\ No newline at end of file
diff --git a/contrib/pg_surgery/pg_surgery.control b/contrib/pg_surgery/pg_surgery.control
new file mode 100644
index 0000000..2bcdad1
--- /dev/null
+++ b/contrib/pg_surgery/pg_surgery.control
@@ -0,0 +1,5 @@
+# pg_surgery extension
+comment = 'extension to perform surgery on a damaged relation'
+default_version = '1.0'
+module_pathname = '$libdir/pg_surgery'
+relocatable = true
diff --git a/contrib/pg_surgery/sql/pg_surgery.sql b/contrib/pg_surgery/sql/pg_surgery.sql
new file mode 100644
index 0000000..4aef380
--- /dev/null
+++ b/contrib/pg_surgery/sql/pg_surgery.sql
@@ -0,0 +1,89 @@
+create extension pg_surgery;
+
+--
+-- check that using heap_force_kill and heap_force_freeze functions with the
+-- supported relations passes.
+--
+
+-- normal heap table.
+begin;
+create table htab(a int);
+insert into htab values (100), (200), (300), (400), (500);
+
+select * from htab where xmin = 2;
+select heap_force_freeze('htab'::regclass, ARRAY['(0, 4)']::tid[]);
+select ctid, xmax from htab where xmin = 2;
+
+select heap_force_kill('htab'::regclass, ARRAY['(0, 4)']::tid[]);
+select * from htab where ctid = '(0, 4)';
+rollback;
+
+-- materialized view.
+begin;
+create materialized view mvw as select a from generate_series(1, 3) a;
+
+select * from mvw where xmin = 2;
+select heap_force_freeze('mvw'::regclass, ARRAY['(0, 3)']::tid[]);
+select * from mvw where xmin = 2;
+
+select heap_force_kill('mvw'::regclass, ARRAY['(0, 3)']::tid[]);
+select * from mvw where ctid = '(0, 3)';
+rollback;
+
+--
+-- check that using heap_force_kill and heap_force_freeze functions with the
+-- unsupported relations fails.
+--
+
+-- partitioned tables (the parent table) doesn't contain any tuple.
+create table ptab (a int) partition by list (a);
+
+select heap_force_kill('ptab'::regclass, ARRAY['(0, 1)']::tid[]);
+select heap_force_freeze('ptab'::regclass, ARRAY['(0, 1)']::tid[]);
+
+create index ptab_idx on ptab (a);
+
+-- indexes are not supported, should fail.
+select heap_force_kill('ptab_idx'::regclass, ARRAY['(0, 1)']::tid[]);
+select heap_force_freeze('ptab_idx'::regclass, ARRAY['(0, 1)']::tid[]);
+
+create view vw as select 1;
+
+-- views are not supported as well. so, all these should fail.
+select heap_force_kill('vw'::regclass, ARRAY['(0, 1)']::tid[]);
+select heap_force_freeze('vw'::regclass, ARRAY['(0, 1)']::tid[]);
+
+create sequence seq;
+
+-- sequences are not supported as well. so, all these functions should fail.
+select heap_force_kill('seq'::regclass, ARRAY['(0, 1)']::tid[]);
+select heap_force_freeze('seq'::regclass, ARRAY['(0, 1)']::tid[]);
+
+--
+-- Some negative test-cases with invalid inputs.
+--
+begin;
+create table htab(a int);
+
+-- invalid block number, should be skipped.
+select heap_force_kill('htab'::regclass, ARRAY['(0, 1)']::tid[]);
+select heap_force_freeze('htab'::regclass, ARRAY['(0, 1)']::tid[]);
+
+insert into htab values (10);
+
+-- invalid offset number, should be skipped.
+select heap_force_kill('htab'::regclass, ARRAY['(0, 2)']::tid[]);
+select heap_force_freeze('htab'::regclass, ARRAY['(0, 2)']::tid[]);
+
+-- dead tuple, should be skipped.
+select heap_force_kill('htab'::regclass, ARRAY['(0, 1)']::tid[]);
+select heap_force_kill('htab'::regclass, ARRAY['(0, 1)']::tid[]);
+select heap_force_freeze('htab'::regclass, ARRAY['(0, 1)']::tid[]);
+
+rollback;
+
+-- cleanup.
+drop table ptab;
+drop view vw;
+drop sequence seq;
+drop extension pg_surgery;
diff --git a/doc/src/sgml/contrib.sgml b/doc/src/sgml/contrib.sgml
index 261a559..c82dde2 100644
--- a/doc/src/sgml/contrib.sgml
+++ b/doc/src/sgml/contrib.sgml
@@ -125,6 +125,7 @@ CREATE EXTENSION <replaceable>module_name</replaceable>;
  &pgrowlocks;
  &pgstatstatements;
  &pgstattuple;
+ &pgsurgery;
  &pgtrgm;
  &pgvisibility;
  &postgres-fdw;
diff --git a/doc/src/sgml/filelist.sgml b/doc/src/sgml/filelist.sgml
index 64b5da0..828396d 100644
--- a/doc/src/sgml/filelist.sgml
+++ b/doc/src/sgml/filelist.sgml
@@ -139,6 +139,7 @@
 <!ENTITY pgstandby       SYSTEM "pgstandby.sgml">
 <!ENTITY pgstatstatements SYSTEM "pgstatstatements.sgml">
 <!ENTITY pgstattuple     SYSTEM "pgstattuple.sgml">
+<!ENTITY pgsurgery       SYSTEM "pgsurgery.sgml">
 <!ENTITY pgtrgm          SYSTEM "pgtrgm.sgml">
 <!ENTITY pgvisibility    SYSTEM "pgvisibility.sgml">
 <!ENTITY postgres-fdw    SYSTEM "postgres-fdw.sgml">
diff --git a/doc/src/sgml/pgsurgery.sgml b/doc/src/sgml/pgsurgery.sgml
new file mode 100644
index 0000000..92b9060
--- /dev/null
+++ b/doc/src/sgml/pgsurgery.sgml
@@ -0,0 +1,94 @@
+<!-- doc/src/sgml/pgsurgery.sgml -->
+
+<sect1 id="pgsurgery" xreflabel="pg_surgery">
+ <title>pg_surgery</title>
+
+ <indexterm zone="pgsurgery">
+  <primary>pg_surgery</primary>
+ </indexterm>
+
+ <para>
+  The <filename>pg_surgery</filename> module provides various functions to
+  perform surgery on a damaged relation.
+ </para>
+
+ <para>
+  By default, only the superusers have <literal>EXECUTE</literal> privilege on
+  the functions provided by this module. After the extension has been installed,
+  users may issue <command>GRANT</command> commands to change the privileges on
+  the functions to allow others to execute them.
+ </para>
+
+ <sect2>
+  <title>Functions</title>
+
+  <variablelist>
+   <varlistentry>
+    <term>
+     <function>heap_force_kill(regclass, tid[]) returns void</function>
+    </term>
+
+    <listitem>
+     <para>
+      <function>heap_force_kill</function> marks a <quote>used</quote> line
+      pointer as <quote>dead</quote> without examining the tuple. This may help
+      users to force-kill a corrupted tuple in the table. It takes the relation
+      oid and an array of item pointers as an input arguments. For example:
+<programlisting>
+test=&gt; select * from t1 where ctid = '(0, 1)';
+-[ RECORD 1 ]
+a | 10
+
+test=# select heap_force_kill('t1'::regclass, ARRAY['(0, 1)']::tid[]);
+-[ RECORD 1 ]---+-
+heap_force_kill | 
+
+test=# select * from t1 where ctid = '(0, 1)';
+(0 rows)
+</programlisting>
+    </para>
+    </listitem>
+   </varlistentry>
+
+   <varlistentry>
+    <term>
+     <function>heap_force_freeze(regclass, tid[]) returns void</function>
+    </term>
+
+    <listitem>
+     <para>
+      <function>heap_force_freeze</function> marks a live tuple as frozen by
+      setting its xmin to FrozenTransactionId and xmax to InvalidTransactionId.
+      It clears all the visibility related bits in the infomask of the tuple and
+      sets the infomask flag to HEAP_XMIN_FROZEN and HEAP_XMAX_INVALID. This may
+      help users to force-freeze a corrupted tuple in the table. Similar to
+      <function>heap_force_kill</function> function, this also takes the relation
+      oid and an array of item pointers as an input arguments. For example:
+<programlisting>
+test=&gt; select ctid from t1 where xmin = 2;
+(0 rows)
+
+test=# select heap_force_freeze('t1'::regclass, ARRAY['(0, 1)']::tid[]);
+-[ RECORD 1 ]-----+-
+heap_force_freeze | 
+
+test=# select ctid from t1 where xmin = 2;
+-[ RECORD 1 ]
+ctid | (0,1)
+</programlisting>
+     </para>
+    </listitem>
+   </varlistentry>
+
+  </variablelist>
+ </sect2>
+
+ <sect2>
+  <title>Authors</title>
+
+  <para>
+   Ashutosh Sharma <email>ashu.coek88@gmail.com</email>
+  </para>
+ </sect2>
+
+</sect1>
-- 
1.8.3.1

#70

Asim Praveen

pasim@vmware.com

over 5 years ago

In reply to: Ashutosh Sharma (#69)

Re: recovering from "found xmin ... from before relfrozenxid ..."

Hi Ashutosh

I stumbled upon this thread today, went through your patch and it looks good. A minor suggestion in sanity_check_relation():

if (rel->rd_rel->relam != HEAP_TABLE_AM_OID)
ereport(ERROR,
(errcode(ERRCODE_FEATURE_NOT_SUPPORTED),
errmsg("only heap AM is supported")));

Instead of checking the access method OID, it seems better to check the handler OID like so:

if (rel->rd_amhandler != HEAP_TABLE_AM_HANDLER_OID)

The reason is current version of sanity_check_relation() would emit error for the following case even when the table structure is actually heap.

create access method myam type table handler heap_tableam_handler;
create table mytable (…) using myam;

Asim

#71

ashu.coek88@gmail.com

over 5 years ago

In reply to: Asim Praveen (#70)

Re: recovering from "found xmin ... from before relfrozenxid ..."

Hi Asim,

Thanks for having a look into the patch and for sharing your feedback.
Please find my comments inline below:

On Thu, Aug 13, 2020 at 12:36 PM Asim Praveen <pasim@vmware.com> wrote:

Hi Ashutosh

I stumbled upon this thread today, went through your patch and it looks good. A minor suggestion in sanity_check_relation():

if (rel->rd_rel->relam != HEAP_TABLE_AM_OID)
ereport(ERROR,
(errcode(ERRCODE_FEATURE_NOT_SUPPORTED),
errmsg("only heap AM is supported")));

Instead of checking the access method OID, it seems better to check the handler OID like so:

if (rel->rd_amhandler != HEAP_TABLE_AM_HANDLER_OID)

The reason is current version of sanity_check_relation() would emit error for the following case even when the table structure is actually heap.

create access method myam type table handler heap_tableam_handler;
create table mytable (…) using myam;

This looks like a very good suggestion to me. I will do this change in
the next version. Just wondering if we should be doing similar changes
in other contrib modules (like pgrowlocks, pageinspect and
pgstattuple) as well?

--
With Regards,
Ashutosh Sharma
EnterpriseDB:http://www.enterprisedb.com

#72

robertmhaas@gmail.com

over 5 years ago

In reply to: Ashutosh Sharma (#71)

Re: recovering from "found xmin ... from before relfrozenxid ..."

On Thu, Aug 13, 2020 at 3:52 AM Ashutosh Sharma <ashu.coek88@gmail.com> wrote:

This looks like a very good suggestion to me. I will do this change in
the next version. Just wondering if we should be doing similar changes
in other contrib modules (like pgrowlocks, pageinspect and
pgstattuple) as well?

It seems like it should be consistent, but I'm not sure the proposed
change is really an improvement.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

#73

masahiko.sawada@2ndquadrant.com

over 5 years ago

In reply to: Ashutosh Sharma (#69)

Re: recovering from "found xmin ... from before relfrozenxid ..."

On Wed, 12 Aug 2020 at 22:27, Ashutosh Sharma <ashu.coek88@gmail.com> wrote:

Thanks Robert for the review. Please find my comments inline below:

On Fri, Aug 7, 2020 at 9:21 PM Robert Haas <robertmhaas@gmail.com> wrote:

On Thu, Aug 6, 2020 at 9:23 AM Ashutosh Sharma <ashu.coek88@gmail.com> wrote:

Attached v4 patch fixes the latest comments from Robert and Masahiko-san.

Compiler warning:

heap_surgery.c:136:13: error: comparison of unsigned expression < 0 is
always false [-Werror,-Wtautological-compare]
if (blkno < 0 || blkno >= nblocks)
~~~~~ ^ ~

Fixed.

There's a certain inconsistency to these messages:

rhaas=# create table foo (a int);
CREATE TABLE
rhaas=# insert into foo values (1);
INSERT 0 1
rhaas=# select heap_force_kill('foo'::regclass, array['(0,2)'::tid]);
NOTICE: skipping tid (0, 2) because it contains an invalid offset
heap_force_kill
-----------------

(1 row)

rhaas=# select heap_force_kill('foo'::regclass, array['(1,0)'::tid]);
ERROR: invalid item pointer
LOCATION: tids_same_page_fetch_offnums, heap_surgery.c:347
rhaas=# select heap_force_kill('foo'::regclass, array['(1,1)'::tid]);
ERROR: block number 1 is out of range for relation "foo"

From a user perspective it seems like I've made three very similar
mistakes: in the first case, the offset is too high, in the second
case it's too low, and in the third case the block number is out of
range. But in one case I get a NOTICE and in the other two cases I get
an ERROR. In one case I get the relation name and in the other two
cases I don't. The two complaints about an invalid offset are phrased
completely differently from each other. For example, suppose you do
this:

ERROR: tid (%u, %u) is invalid for relation "%s" because the block
number is out of range (%u..%u)
ERROR: tid (%u, %u) is invalid for relation "%s" because the item
number is out of range for this block (%u..%u)
ERROR: tid (%u, %u) is invalid for relation "%s" because the item is unused
ERROR: tid (%u, %u) is invalid for relation "%s" because the item is dead

Corrected.

I think I misled you when I said to use pg_class_aclcheck. I think it
should actually be pg_class_ownercheck.

okay, I've changed it to pg_class_ownercheck.

I think the relkind sanity check should permit RELKIND_MATVIEW also.

Yeah, actually we should allow MATVIEW, don't know why I thought of
blocking it earlier.

It's unclear to me why the freeze logic here shouldn't do this part
what heap_prepare_freeze_tuple() does when freezing xmax:

frz->t_infomask2 &= ~HEAP_HOT_UPDATED;
frz->t_infomask2 &= ~HEAP_KEYS_UPDATED;

Yeah, we should have these changes when freezing the xmax.

Likewise, why should we not freeze or invalidate xvac in the case
where tuple->t_infomask & HEAP_MOVED, as heap_prepare_freeze_tuple()
would do?

Again, we should have this as well.

Apart from above, this time I've also added the documentation on
pg_surgery module and added a few more test-cases.

Attached patch with above changes.

Thank you for updating the patch! Here are my comments on v5 patch:

--- a/contrib/Makefile
+++ b/contrib/Makefile
@@ -35,6 +35,7 @@ SUBDIRS = \
        pg_standby  \
        pg_stat_statements \
        pg_trgm     \
+       pg_surgery  \
        pgcrypto    \

I guess we use alphabetical order here. So pg_surgery should be placed
before pg_trgm.

---
+           if (heap_force_opt == HEAP_FORCE_KILL)
+               ItemIdSetDead(itemid);

I think that if the page is an all-visible page, we should clear an
all-visible bit on the visibility map corresponding to the page and
PD_ALL_VISIBLE on the page header. Otherwise, index only scan would
return the wrong results.

---
+       /*
+        * We do not mark the buffer dirty or do WAL logging for unmodifed
+        * pages.
+        */
+       if (!did_modify_page)
+           goto skip_wal;
+
+       /* Mark buffer dirty before we write WAL. */
+       MarkBufferDirty(buf);
+
+       /* XLOG stuff */
+       if (RelationNeedsWAL(rel))
+           log_newpage_buffer(buf, true);
+
+skip_wal:
+       END_CRIT_SECTION();
+

s/unmodifed/unmodified/

Do we really need to use goto? I think we can modify it like follows:

if (did_modity_page)
{
/* Mark buffer dirty before we write WAL. */
MarkBufferDirty(buf);

/* XLOG stuff */
if (RelationNeedsWAL(rel))
log_newpage_buffer(buf, true);
}

END_CRIT_SECTION();

---
pg_force_freeze() can revival a tuple that is already deleted but not
vacuumed yet. Therefore, the user might need to reindex indexes after
using that function. For instance, with the following script, the last
two queries: index scan and seq scan, will return different results.

set enable_seqscan to off;
set enable_bitmapscan to off;
set enable_indexonlyscan to off;
create table tbl (a int primary key);
insert into tbl values (1);

update tbl set a = a + 100 where a = 1;

explain analyze select * from tbl where a < 200;

-- revive deleted tuple on heap
select heap_force_freeze('tbl', array['(0,1)'::tid]);

-- index scan returns 2 tuples
explain analyze select * from tbl where a < 200;

-- seq scan returns 1 tuple
set enable_seqscan to on;
explain analyze select * from tbl;

Also, if a tuple updated and moved to another partition is revived by
heap_force_freeze(), its ctid still has special values:
MovedPartitionsOffsetNumber and MovedPartitionsBlockNumber. I don't
see a problem yet caused by a visible tuple having the special ctid
value, but it might be worth considering either to reset ctid value as
well or to not freezing already-deleted tuple.

Regards,

--
Masahiko Sawada http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

#74

ashu.coek88@gmail.com

over 5 years ago

In reply to: Masahiko Sawada (#73)

Re: recovering from "found xmin ... from before relfrozenxid ..."

Hello Masahiko-san,

Thanks for the review. Please check the comments inline below:

On Fri, Aug 14, 2020 at 10:07 AM Masahiko Sawada
<masahiko.sawada@2ndquadrant.com> wrote:

Thank you for updating the patch! Here are my comments on v5 patch:
--- a/contrib/Makefile
+++ b/contrib/Makefile
@@ -35,6 +35,7 @@ SUBDIRS = \
pg_standby  \
pg_stat_statements \
pg_trgm     \
+       pg_surgery  \
pgcrypto    \
I guess we use alphabetical order here. So pg_surgery should be placed
before pg_trgm.

Okay, will take care of this in the next version of patch.

---
+           if (heap_force_opt == HEAP_FORCE_KILL)
+               ItemIdSetDead(itemid);
I think that if the page is an all-visible page, we should clear an
all-visible bit on the visibility map corresponding to the page and
PD_ALL_VISIBLE on the page header. Otherwise, index only scan would
return the wrong results.

I think we should let VACUUM do that. Please note that this module is
intended to be used only on a damaged relation and should only be
operated on damaged tuples of such relations. And the execution of any
of the functions provided by this module on a damaged relation must be
followed by VACUUM with DISABLE_PAGE_SKIPPING option on that relation.
This is necessary to bring back a damaged relation to the sane state
once a surgery is performed on it. I will try to add this note in the
documentation for this module.

---
+       /*
+        * We do not mark the buffer dirty or do WAL logging for unmodifed
+        * pages.
+        */
+       if (!did_modify_page)
+           goto skip_wal;
+
+       /* Mark buffer dirty before we write WAL. */
+       MarkBufferDirty(buf);
+
+       /* XLOG stuff */
+       if (RelationNeedsWAL(rel))
+           log_newpage_buffer(buf, true);
+
+skip_wal:
+       END_CRIT_SECTION();
+

s/unmodifed/unmodified/

okay, will fix this typo.

Do we really need to use goto? I think we can modify it like follows:

if (did_modity_page)
{
/* Mark buffer dirty before we write WAL. */
MarkBufferDirty(buf);

/* XLOG stuff */
if (RelationNeedsWAL(rel))
log_newpage_buffer(buf, true);
}

END_CRIT_SECTION();

No, we don't need it. We can achieve the same by checking the status
of did_modify_page flag as you suggested. I will do this change in the
next version.

---
pg_force_freeze() can revival a tuple that is already deleted but not
vacuumed yet. Therefore, the user might need to reindex indexes after
using that function. For instance, with the following script, the last
two queries: index scan and seq scan, will return different results.

set enable_seqscan to off;
set enable_bitmapscan to off;
set enable_indexonlyscan to off;
create table tbl (a int primary key);
insert into tbl values (1);

update tbl set a = a + 100 where a = 1;

explain analyze select * from tbl where a < 200;

-- revive deleted tuple on heap
select heap_force_freeze('tbl', array['(0,1)'::tid]);

-- index scan returns 2 tuples
explain analyze select * from tbl where a < 200;

-- seq scan returns 1 tuple
set enable_seqscan to on;
explain analyze select * from tbl;

I am not sure if this is the right use-case of pg_force_freeze
function. I think we should only be running pg_force_freeze function
on a tuple for which VACUUM reports "found xmin ABC from before
relfrozenxid PQR" sort of error otherwise it might worsen the things
instead of making it better. Now, the question is - can VACUUM report
this type of error for a deleted tuple or it would only report it for
a live tuple? AFAIU this won't be reported for the deleted tuples
because VACUUM wouldn't consider freezing a tuple that has been
deleted.

Also, if a tuple updated and moved to another partition is revived by
heap_force_freeze(), its ctid still has special values:
MovedPartitionsOffsetNumber and MovedPartitionsBlockNumber. I don't
see a problem yet caused by a visible tuple having the special ctid
value, but it might be worth considering either to reset ctid value as
well or to not freezing already-deleted tuple.

For this as well, the answer remains the same as above.

--
With Regards,
Ashutosh Sharma
EnterpriseDB:http://www.enterprisedb.com

#75

ashu.coek88@gmail.com

over 5 years ago

In reply to: Ashutosh Sharma (#74)

Re: recovering from "found xmin ... from before relfrozenxid ..."

Hello Masahiko-san,

I've spent some more time trying to understand the code in
lazy_scan_heap function to know under what all circumstances a VACUUM
can fail with "found xmin ... before relfrozenxid ..." error for a
tuple whose xmin is behind relfrozenxid. Here are my observations:

1) It can fail with this error for a live tuple

OR,

2) It can also fail with this error if a tuple (that went through
update) is marked as HEAP_HOT_UPDATED or HEAP_ONLY_TUPLE.

OR,

3) If there are any concurrent transactions, then the tuple might be
marked as HEAPTUPLE_INSERT_IN_PROGRESS or HEAPTUPLE_DELETE_IN_PROGRESS
or HEAPTUPLE_RECENTLY_DEAD in which case also VACUUM can fail with
this error.

Now, AFAIU, as we will be dealing with a damaged table, the chances of
point #3 being the cause of this error looks impossible in our case
because I don't think we will be doing anything in parallel when
performing surgery on a damaged table, in fact we shouldn't be doing
any such things. However, it is quite possible that reason #2 could
cause VACUUM to fail with this sort of error, but, as we are already
skipping redirected item pointers in heap_force_common(), I think, we
would never be marking HEAP_HOT_UPDATED tuple as frozen and I don't
see any problem in marking HEAP_ONLY_TUPLE as frozen. So, probably, we
may not need to handle point #2 as well.

Further, I also don't see VACUUM reporting this error for a tuple that
has been moved from one partition to another. So, I think we might not
need to do any special handling for a tuple that got updated and its
new version was moved to another partition.

If you feel I am missing something here, please correct me. Thank you.

Moreover, while I was exploring on above, I noticed that in
lazy_scan_heap(), before we call HeapTupleSatisfiesVacuum() we check
for a redirected item pointers and if any redirected item pointer is
detected we do not call HeapTupleSatisfiesVacuum(). So, not sure how
HeapTupleSatisfiesVacuum would ever return a dead tuple that is marked
with HEAP_HOT_UPDATED. I am referring to the following code in
lazy_scan_heap().

for (offnum = FirstOffsetNumber;
offnum <= maxoff;
offnum = OffsetNumberNext(offnum))
{
ItemId itemid;

itemid = PageGetItemId(page, offnum);

.............
.............

/* Redirect items mustn't be touched */ <-- this check
would bypass the redirected item pointers from being checked for
HeapTupleSatisfiesVacuum.
if (ItemIdIsRedirected(itemid))
{
hastup = true; /* this page won't be truncatable */
continue;
}

..............
..............

switch (HeapTupleSatisfiesVacuum(&tuple, OldestXmin, buf))
{
case HEAPTUPLE_DEAD:

if (HeapTupleIsHotUpdated(&tuple) ||
HeapTupleIsHeapOnly(&tuple) ||
params->index_cleanup == VACOPT_TERNARY_DISABLED)
nkeep += 1;
else
tupgone = true; /* we can delete the tuple */
..............
..............
}

So, the point is, would HeapTupleIsHotUpdated(&tuple) ever be true?

--
With Regards,
Ashutosh Sharma
EnterpriseDB:http://www.enterprisedb.com

Show quoted text

On Mon, Aug 17, 2020 at 11:35 AM Ashutosh Sharma <ashu.coek88@gmail.com> wrote:

Hello Masahiko-san,

Thanks for the review. Please check the comments inline below:

On Fri, Aug 14, 2020 at 10:07 AM Masahiko Sawada
<masahiko.sawada@2ndquadrant.com> wrote:
Thank you for updating the patch! Here are my comments on v5 patch:
--- a/contrib/Makefile
+++ b/contrib/Makefile
@@ -35,6 +35,7 @@ SUBDIRS = \
pg_standby  \
pg_stat_statements \
pg_trgm     \
+       pg_surgery  \
pgcrypto    \
I guess we use alphabetical order here. So pg_surgery should be placed
before pg_trgm.
Okay, will take care of this in the next version of patch.
---
+           if (heap_force_opt == HEAP_FORCE_KILL)
+               ItemIdSetDead(itemid);
I think that if the page is an all-visible page, we should clear an
all-visible bit on the visibility map corresponding to the page and
PD_ALL_VISIBLE on the page header. Otherwise, index only scan would
return the wrong results.
I think we should let VACUUM do that. Please note that this module is
intended to be used only on a damaged relation and should only be
operated on damaged tuples of such relations. And the execution of any
of the functions provided by this module on a damaged relation must be
followed by VACUUM with DISABLE_PAGE_SKIPPING option on that relation.
This is necessary to bring back a damaged relation to the sane state
once a surgery is performed on it. I will try to add this note in the
documentation for this module.
---
+       /*
+        * We do not mark the buffer dirty or do WAL logging for unmodifed
+        * pages.
+        */
+       if (!did_modify_page)
+           goto skip_wal;
+
+       /* Mark buffer dirty before we write WAL. */
+       MarkBufferDirty(buf);
+
+       /* XLOG stuff */
+       if (RelationNeedsWAL(rel))
+           log_newpage_buffer(buf, true);
+
+skip_wal:
+       END_CRIT_SECTION();
+
s/unmodifed/unmodified/
okay, will fix this typo.

Do we really need to use goto? I think we can modify it like follows:

if (did_modity_page)
{
/* Mark buffer dirty before we write WAL. */
MarkBufferDirty(buf);

/* XLOG stuff */
if (RelationNeedsWAL(rel))
log_newpage_buffer(buf, true);
}

END_CRIT_SECTION();

No, we don't need it. We can achieve the same by checking the status
of did_modify_page flag as you suggested. I will do this change in the
next version.

---
pg_force_freeze() can revival a tuple that is already deleted but not
vacuumed yet. Therefore, the user might need to reindex indexes after
using that function. For instance, with the following script, the last
two queries: index scan and seq scan, will return different results.

set enable_seqscan to off;
set enable_bitmapscan to off;
set enable_indexonlyscan to off;
create table tbl (a int primary key);
insert into tbl values (1);

update tbl set a = a + 100 where a = 1;

explain analyze select * from tbl where a < 200;

-- revive deleted tuple on heap
select heap_force_freeze('tbl', array['(0,1)'::tid]);

-- index scan returns 2 tuples
explain analyze select * from tbl where a < 200;

-- seq scan returns 1 tuple
set enable_seqscan to on;
explain analyze select * from tbl;

I am not sure if this is the right use-case of pg_force_freeze
function. I think we should only be running pg_force_freeze function
on a tuple for which VACUUM reports "found xmin ABC from before
relfrozenxid PQR" sort of error otherwise it might worsen the things
instead of making it better. Now, the question is - can VACUUM report
this type of error for a deleted tuple or it would only report it for
a live tuple? AFAIU this won't be reported for the deleted tuples
because VACUUM wouldn't consider freezing a tuple that has been
deleted.

Also, if a tuple updated and moved to another partition is revived by
heap_force_freeze(), its ctid still has special values:
MovedPartitionsOffsetNumber and MovedPartitionsBlockNumber. I don't
see a problem yet caused by a visible tuple having the special ctid
value, but it might be worth considering either to reset ctid value as
well or to not freezing already-deleted tuple.

For this as well, the answer remains the same as above.

--
With Regards,
Ashutosh Sharma
EnterpriseDB:http://www.enterprisedb.com

#76

ashu.coek88@gmail.com

over 5 years ago

In reply to: Ashutosh Sharma (#75)

1 attachment(s)

Re: recovering from "found xmin ... from before relfrozenxid ..."

Attached is the new version of patch that addresses the comments from
Asim Praveen and Masahiko-san. It also improves the documentation to
some extent.

Show quoted text

On Tue, Aug 18, 2020 at 1:46 PM Ashutosh Sharma <ashu.coek88@gmail.com> wrote:

Hello Masahiko-san,

I've spent some more time trying to understand the code in
lazy_scan_heap function to know under what all circumstances a VACUUM
can fail with "found xmin ... before relfrozenxid ..." error for a
tuple whose xmin is behind relfrozenxid. Here are my observations:

1) It can fail with this error for a live tuple

OR,

2) It can also fail with this error if a tuple (that went through
update) is marked as HEAP_HOT_UPDATED or HEAP_ONLY_TUPLE.

OR,

3) If there are any concurrent transactions, then the tuple might be
marked as HEAPTUPLE_INSERT_IN_PROGRESS or HEAPTUPLE_DELETE_IN_PROGRESS
or HEAPTUPLE_RECENTLY_DEAD in which case also VACUUM can fail with
this error.

Now, AFAIU, as we will be dealing with a damaged table, the chances of
point #3 being the cause of this error looks impossible in our case
because I don't think we will be doing anything in parallel when
performing surgery on a damaged table, in fact we shouldn't be doing
any such things. However, it is quite possible that reason #2 could
cause VACUUM to fail with this sort of error, but, as we are already
skipping redirected item pointers in heap_force_common(), I think, we
would never be marking HEAP_HOT_UPDATED tuple as frozen and I don't
see any problem in marking HEAP_ONLY_TUPLE as frozen. So, probably, we
may not need to handle point #2 as well.

Further, I also don't see VACUUM reporting this error for a tuple that
has been moved from one partition to another. So, I think we might not
need to do any special handling for a tuple that got updated and its
new version was moved to another partition.

If you feel I am missing something here, please correct me. Thank you.

Moreover, while I was exploring on above, I noticed that in
lazy_scan_heap(), before we call HeapTupleSatisfiesVacuum() we check
for a redirected item pointers and if any redirected item pointer is
detected we do not call HeapTupleSatisfiesVacuum(). So, not sure how
HeapTupleSatisfiesVacuum would ever return a dead tuple that is marked
with HEAP_HOT_UPDATED. I am referring to the following code in
lazy_scan_heap().

for (offnum = FirstOffsetNumber;
offnum <= maxoff;
offnum = OffsetNumberNext(offnum))
{
ItemId itemid;

itemid = PageGetItemId(page, offnum);

.............
.............

/* Redirect items mustn't be touched */ <-- this check
would bypass the redirected item pointers from being checked for
HeapTupleSatisfiesVacuum.
if (ItemIdIsRedirected(itemid))
{
hastup = true; /* this page won't be truncatable */
continue;
}

..............
..............

switch (HeapTupleSatisfiesVacuum(&tuple, OldestXmin, buf))
{
case HEAPTUPLE_DEAD:

if (HeapTupleIsHotUpdated(&tuple) ||
HeapTupleIsHeapOnly(&tuple) ||
params->index_cleanup == VACOPT_TERNARY_DISABLED)
nkeep += 1;
else
tupgone = true; /* we can delete the tuple */
..............
..............
}

So, the point is, would HeapTupleIsHotUpdated(&tuple) ever be true?

--
With Regards,
Ashutosh Sharma
EnterpriseDB:http://www.enterprisedb.com

On Mon, Aug 17, 2020 at 11:35 AM Ashutosh Sharma <ashu.coek88@gmail.com> wrote:
Hello Masahiko-san,

Thanks for the review. Please check the comments inline below:

On Fri, Aug 14, 2020 at 10:07 AM Masahiko Sawada
<masahiko.sawada@2ndquadrant.com> wrote:
Thank you for updating the patch! Here are my comments on v5 patch:
--- a/contrib/Makefile
+++ b/contrib/Makefile
@@ -35,6 +35,7 @@ SUBDIRS = \
pg_standby  \
pg_stat_statements \
pg_trgm     \
+       pg_surgery  \
pgcrypto    \
I guess we use alphabetical order here. So pg_surgery should be placed
before pg_trgm.
Okay, will take care of this in the next version of patch.
---
+           if (heap_force_opt == HEAP_FORCE_KILL)
+               ItemIdSetDead(itemid);
I think that if the page is an all-visible page, we should clear an
all-visible bit on the visibility map corresponding to the page and
PD_ALL_VISIBLE on the page header. Otherwise, index only scan would
return the wrong results.
I think we should let VACUUM do that. Please note that this module is
intended to be used only on a damaged relation and should only be
operated on damaged tuples of such relations. And the execution of any
of the functions provided by this module on a damaged relation must be
followed by VACUUM with DISABLE_PAGE_SKIPPING option on that relation.
This is necessary to bring back a damaged relation to the sane state
once a surgery is performed on it. I will try to add this note in the
documentation for this module.
---
+       /*
+        * We do not mark the buffer dirty or do WAL logging for unmodifed
+        * pages.
+        */
+       if (!did_modify_page)
+           goto skip_wal;
+
+       /* Mark buffer dirty before we write WAL. */
+       MarkBufferDirty(buf);
+
+       /* XLOG stuff */
+       if (RelationNeedsWAL(rel))
+           log_newpage_buffer(buf, true);
+
+skip_wal:
+       END_CRIT_SECTION();
+
s/unmodifed/unmodified/
okay, will fix this typo.

Do we really need to use goto? I think we can modify it like follows:

if (did_modity_page)
{
/* Mark buffer dirty before we write WAL. */
MarkBufferDirty(buf);

/* XLOG stuff */
if (RelationNeedsWAL(rel))
log_newpage_buffer(buf, true);
}

END_CRIT_SECTION();

No, we don't need it. We can achieve the same by checking the status
of did_modify_page flag as you suggested. I will do this change in the
next version.

---
pg_force_freeze() can revival a tuple that is already deleted but not
vacuumed yet. Therefore, the user might need to reindex indexes after
using that function. For instance, with the following script, the last
two queries: index scan and seq scan, will return different results.

set enable_seqscan to off;
set enable_bitmapscan to off;
set enable_indexonlyscan to off;
create table tbl (a int primary key);
insert into tbl values (1);

update tbl set a = a + 100 where a = 1;

explain analyze select * from tbl where a < 200;

-- revive deleted tuple on heap
select heap_force_freeze('tbl', array['(0,1)'::tid]);

-- index scan returns 2 tuples
explain analyze select * from tbl where a < 200;

-- seq scan returns 1 tuple
set enable_seqscan to on;
explain analyze select * from tbl;

I am not sure if this is the right use-case of pg_force_freeze
function. I think we should only be running pg_force_freeze function
on a tuple for which VACUUM reports "found xmin ABC from before
relfrozenxid PQR" sort of error otherwise it might worsen the things
instead of making it better. Now, the question is - can VACUUM report
this type of error for a deleted tuple or it would only report it for
a live tuple? AFAIU this won't be reported for the deleted tuples
because VACUUM wouldn't consider freezing a tuple that has been
deleted.

Also, if a tuple updated and moved to another partition is revived by
heap_force_freeze(), its ctid still has special values:
MovedPartitionsOffsetNumber and MovedPartitionsBlockNumber. I don't
see a problem yet caused by a visible tuple having the special ctid
value, but it might be worth considering either to reset ctid value as
well or to not freezing already-deleted tuple.

For this as well, the answer remains the same as above.

--
With Regards,
Ashutosh Sharma
EnterpriseDB:http://www.enterprisedb.com

Attachments:

v6-0001-Add-contrib-pg_surgery-to-perform-surgery-on-a-damag.patchtext/x-patch; charset=US-ASCII; name=v6-0001-Add-contrib-pg_surgery-to-perform-surgery-on-a-damag.patchDownload

From d765464116ee87ed64ea270ba75784954bc75377 Mon Sep 17 00:00:00 2001
From: ashu <ashutosh.sharma@enterprisedb.com>
Date: Tue, 18 Aug 2020 16:14:15 +0530
Subject: [PATCH] Add contrib/pg_surgery to perform surgery on a damaged heap
 table.

This contrib module basically adds a couple of functions named
heap_force_kill and heap_force_freeze that can be used in the scripts
to perform surgery on the damaged heap tables.

Ashutosh Sharma.
---
 contrib/Makefile                           |   1 +
 contrib/pg_surgery/Makefile                |  23 ++
 contrib/pg_surgery/expected/pg_surgery.out | 161 +++++++++++++
 contrib/pg_surgery/heap_surgery.c          | 375 +++++++++++++++++++++++++++++
 contrib/pg_surgery/pg_surgery--1.0.sql     |  18 ++
 contrib/pg_surgery/pg_surgery.control      |   5 +
 contrib/pg_surgery/sql/pg_surgery.sql      |  89 +++++++
 doc/src/sgml/contrib.sgml                  |   1 +
 doc/src/sgml/filelist.sgml                 |   1 +
 doc/src/sgml/pgsurgery.sgml                | 121 ++++++++++
 10 files changed, 795 insertions(+)
 create mode 100644 contrib/pg_surgery/Makefile
 create mode 100644 contrib/pg_surgery/expected/pg_surgery.out
 create mode 100644 contrib/pg_surgery/heap_surgery.c
 create mode 100644 contrib/pg_surgery/pg_surgery--1.0.sql
 create mode 100644 contrib/pg_surgery/pg_surgery.control
 create mode 100644 contrib/pg_surgery/sql/pg_surgery.sql
 create mode 100644 doc/src/sgml/pgsurgery.sgml

diff --git a/contrib/Makefile b/contrib/Makefile
index 1846d41..c8d2a16 100644
--- a/contrib/Makefile
+++ b/contrib/Makefile
@@ -34,6 +34,7 @@ SUBDIRS = \
 		pg_prewarm	\
 		pg_standby	\
 		pg_stat_statements \
+		pg_surgery	\
 		pg_trgm		\
 		pgcrypto	\
 		pgrowlocks	\
diff --git a/contrib/pg_surgery/Makefile b/contrib/pg_surgery/Makefile
new file mode 100644
index 0000000..ecf2e20
--- /dev/null
+++ b/contrib/pg_surgery/Makefile
@@ -0,0 +1,23 @@
+# contrib/pg_surgery/Makefile
+
+MODULE_big = pg_surgery
+OBJS = \
+	$(WIN32RES) \
+	heap_surgery.o
+
+EXTENSION = pg_surgery
+DATA = pg_surgery--1.0.sql
+PGFILEDESC = "pg_surgery - perform surgery on a damaged relation"
+
+REGRESS = pg_surgery
+
+ifdef USE_PGXS
+PG_CONFIG = pg_config
+PGXS := $(shell $(PG_CONFIG) --pgxs)
+include $(PGXS)
+else
+subdir = contrib/pg_surgery
+top_builddir = ../..
+include $(top_builddir)/src/Makefile.global
+include $(top_srcdir)/contrib/contrib-global.mk
+endif
diff --git a/contrib/pg_surgery/expected/pg_surgery.out b/contrib/pg_surgery/expected/pg_surgery.out
new file mode 100644
index 0000000..9858de2
--- /dev/null
+++ b/contrib/pg_surgery/expected/pg_surgery.out
@@ -0,0 +1,161 @@
+create extension pg_surgery;
+--
+-- check that using heap_force_kill and heap_force_freeze functions with the
+-- supported relations passes.
+--
+-- normal heap table.
+begin;
+create table htab(a int);
+insert into htab values (100), (200), (300), (400), (500);
+select * from htab where xmin = 2;
+ a 
+---
+(0 rows)
+
+select heap_force_freeze('htab'::regclass, ARRAY['(0, 4)']::tid[]);
+ heap_force_freeze 
+-------------------
+ 
+(1 row)
+
+select ctid, xmax from htab where xmin = 2;
+ ctid  | xmax 
+-------+------
+ (0,4) |    0
+(1 row)
+
+select heap_force_kill('htab'::regclass, ARRAY['(0, 4)']::tid[]);
+ heap_force_kill 
+-----------------
+ 
+(1 row)
+
+select * from htab where ctid = '(0, 4)';
+ a 
+---
+(0 rows)
+
+rollback;
+-- materialized view.
+begin;
+create materialized view mvw as select a from generate_series(1, 3) a;
+select * from mvw where xmin = 2;
+ a 
+---
+(0 rows)
+
+select heap_force_freeze('mvw'::regclass, ARRAY['(0, 3)']::tid[]);
+ heap_force_freeze 
+-------------------
+ 
+(1 row)
+
+select * from mvw where xmin = 2;
+ a 
+---
+ 3
+(1 row)
+
+select heap_force_kill('mvw'::regclass, ARRAY['(0, 3)']::tid[]);
+ heap_force_kill 
+-----------------
+ 
+(1 row)
+
+select * from mvw where ctid = '(0, 3)';
+ a 
+---
+(0 rows)
+
+rollback;
+--
+-- check that using heap_force_kill and heap_force_freeze functions with the
+-- unsupported relations fails.
+--
+-- partitioned tables (the parent table) doesn't contain any tuple.
+create table ptab (a int) partition by list (a);
+select heap_force_kill('ptab'::regclass, ARRAY['(0, 1)']::tid[]);
+ERROR:  only the relation using heap_tableam_handler is supported
+select heap_force_freeze('ptab'::regclass, ARRAY['(0, 1)']::tid[]);
+ERROR:  only the relation using heap_tableam_handler is supported
+create index ptab_idx on ptab (a);
+-- indexes are not supported, should fail.
+select heap_force_kill('ptab_idx'::regclass, ARRAY['(0, 1)']::tid[]);
+ERROR:  only the relation using heap_tableam_handler is supported
+select heap_force_freeze('ptab_idx'::regclass, ARRAY['(0, 1)']::tid[]);
+ERROR:  only the relation using heap_tableam_handler is supported
+create view vw as select 1;
+-- views are not supported as well. so, all these should fail.
+select heap_force_kill('vw'::regclass, ARRAY['(0, 1)']::tid[]);
+ERROR:  only the relation using heap_tableam_handler is supported
+select heap_force_freeze('vw'::regclass, ARRAY['(0, 1)']::tid[]);
+ERROR:  only the relation using heap_tableam_handler is supported
+create sequence seq;
+-- sequences are not supported as well. so, all these functions should fail.
+select heap_force_kill('seq'::regclass, ARRAY['(0, 1)']::tid[]);
+ERROR:  "seq" is not a table, materialized view, or TOAST table
+select heap_force_freeze('seq'::regclass, ARRAY['(0, 1)']::tid[]);
+ERROR:  "seq" is not a table, materialized view, or TOAST table
+--
+-- Some negative test-cases with invalid inputs.
+--
+begin;
+create table htab(a int);
+-- invalid block number, should be skipped.
+select heap_force_kill('htab'::regclass, ARRAY['(0, 1)']::tid[]);
+NOTICE:  skipping block 0 for relation "htab" because the block number is out of range
+ heap_force_kill 
+-----------------
+ 
+(1 row)
+
+select heap_force_freeze('htab'::regclass, ARRAY['(0, 1)']::tid[]);
+NOTICE:  skipping block 0 for relation "htab" because the block number is out of range
+ heap_force_freeze 
+-------------------
+ 
+(1 row)
+
+insert into htab values (10);
+-- invalid offset number, should be skipped.
+select heap_force_kill('htab'::regclass, ARRAY['(0, 2)']::tid[]);
+NOTICE:  skipping tid (0, 2) for relation "htab" because the item number is out of range for this block
+ heap_force_kill 
+-----------------
+ 
+(1 row)
+
+select heap_force_freeze('htab'::regclass, ARRAY['(0, 2)']::tid[]);
+NOTICE:  skipping tid (0, 2) for relation "htab" because the item number is out of range for this block
+ heap_force_freeze 
+-------------------
+ 
+(1 row)
+
+-- dead tuple, should be skipped.
+select heap_force_kill('htab'::regclass, ARRAY['(0, 1)']::tid[]);
+ heap_force_kill 
+-----------------
+ 
+(1 row)
+
+select heap_force_kill('htab'::regclass, ARRAY['(0, 1)']::tid[]);
+NOTICE:  skipping tid (0, 1) for relation "htab" because it is marked dead
+ heap_force_kill 
+-----------------
+ 
+(1 row)
+
+select heap_force_freeze('htab'::regclass, ARRAY['(0, 1)']::tid[]);
+NOTICE:  skipping tid (0, 1) for relation "htab" because it is marked dead
+ heap_force_freeze 
+-------------------
+ 
+(1 row)
+
+rollback;
+-- cleanup.
+drop table ptab;
+drop view vw;
+drop sequence seq;
+drop extension pg_surgery;
diff --git a/contrib/pg_surgery/heap_surgery.c b/contrib/pg_surgery/heap_surgery.c
new file mode 100644
index 0000000..68d8205
--- /dev/null
+++ b/contrib/pg_surgery/heap_surgery.c
@@ -0,0 +1,375 @@
+/*-------------------------------------------------------------------------
+ *
+ * heap_surgery.c
+ *	  Functions to perform surgery on the damaged heap table.
+ *
+ * Copyright (c) 2020, PostgreSQL Global Development Group
+ *
+ * IDENTIFICATION
+ *	  contrib/pg_surgery/heap_surgery.c
+ *
+ *-------------------------------------------------------------------------
+ */
+#include "postgres.h"
+
+#include "access/heapam.h"
+#include "catalog/pg_am_d.h"
+#include "catalog/pg_proc_d.h"
+#include "miscadmin.h"
+#include "storage/bufmgr.h"
+#include "utils/acl.h"
+#include "utils/rel.h"
+
+PG_MODULE_MAGIC;
+
+/* Options to forcefully change the state of a heap tuple. */
+typedef enum HeapTupleForceOption
+{
+	HEAP_FORCE_KILL,
+	HEAP_FORCE_FREEZE
+} HeapTupleForceOption;
+
+PG_FUNCTION_INFO_V1(heap_force_kill);
+PG_FUNCTION_INFO_V1(heap_force_freeze);
+
+static int32 tidcmp(const void *a, const void *b);
+static Datum heap_force_common(FunctionCallInfo fcinfo,
+							   HeapTupleForceOption heap_force_opt);
+static void sanity_check_tid_array(ArrayType *ta, int *ntids);
+static void sanity_check_relation(Relation rel);
+static BlockNumber tids_same_page_fetch_offnums(ItemPointer tids, int ntids,
+												OffsetNumber *next_start_ptr,
+												OffsetNumber *offnos);
+
+/*-------------------------------------------------------------------------
+ * heap_force_kill()
+ *
+ * Force kill the tuple(s) pointed to by the item pointer(s) stored in the
+ * given tid array.
+ *
+ * Usage: SELECT heap_force_kill(regclass, tid[]);
+ *-------------------------------------------------------------------------
+ */
+Datum
+heap_force_kill(PG_FUNCTION_ARGS)
+{
+	PG_RETURN_DATUM(heap_force_common(fcinfo, HEAP_FORCE_KILL));
+}
+
+/*-------------------------------------------------------------------------
+ * heap_force_freeze()
+ *
+ * Force freeze the tuple(s) pointed to by the item pointer(s) stored in the
+ * given tid array.
+ *
+ * Usage: SELECT heap_force_freeze(regclass, tid[]);
+ *-------------------------------------------------------------------------
+ */
+Datum
+heap_force_freeze(PG_FUNCTION_ARGS)
+{
+	PG_RETURN_DATUM(heap_force_common(fcinfo, HEAP_FORCE_FREEZE));
+}
+
+/*-------------------------------------------------------------------------
+ * heap_force_common()
+ *
+ * Common code for heap_force_kill and heap_force_freeze
+ *-------------------------------------------------------------------------
+ */
+static Datum
+heap_force_common(FunctionCallInfo fcinfo, HeapTupleForceOption heap_force_opt)
+{
+	Oid				relid = PG_GETARG_OID(0);
+	ArrayType	   *ta = PG_GETARG_ARRAYTYPE_P_COPY(1);
+	ItemPointer		tids;
+	int				ntids,
+					nblocks;
+	Relation		rel;
+	OffsetNumber   *offnos;
+	OffsetNumber	noffs,
+					curr_start_ptr,
+					next_start_ptr;
+
+	if (RecoveryInProgress())
+		ereport(ERROR,
+				(errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
+				 errmsg("recovery is in progress"),
+				 errhint("heap surgery functions cannot be executed during recovery.")));
+
+	/* Basic sanity checking. */
+	sanity_check_tid_array(ta, &ntids);
+
+	rel = relation_open(relid, RowExclusiveLock);
+
+	sanity_check_relation(rel);
+
+	tids = ((ItemPointer) ARR_DATA_PTR(ta));
+
+	/*
+	 * If there is more than one tid in the array, sort it so that we can
+	 * easily fetch all the tids belonging to one particular page from the
+	 * array.
+	 */
+	if (ntids > 1)
+		qsort((void*) tids, ntids, sizeof(ItemPointerData), tidcmp);
+
+	offnos = (OffsetNumber *) palloc(ntids * sizeof(OffsetNumber));
+	noffs = curr_start_ptr = next_start_ptr = 0;
+	nblocks = RelationGetNumberOfBlocks(rel);
+
+	do
+	{
+		Buffer			buf;
+		Page			page;
+		BlockNumber		blkno;
+		OffsetNumber	maxoffset;
+		int				i;
+		bool			did_modify_page = false;
+
+		/*
+		 * Get the offset numbers from the tids belonging to one particular page
+		 * and process them one by one.
+		 */
+		blkno = tids_same_page_fetch_offnums(tids, ntids, &next_start_ptr,
+											 offnos);
+
+		/* Calculate the number of offsets stored in offnos array. */
+		noffs = next_start_ptr - curr_start_ptr;
+
+		/*
+		 * Update the current start pointer so that next time when
+		 * tids_same_page_fetch_offnums() is called, we can calculate the number
+		 * of offsets present in the offnos array.
+		 */
+		curr_start_ptr = next_start_ptr;
+
+		/* Check whether the block number is valid. */
+		if (blkno >= nblocks)
+		{
+			ereport(NOTICE,
+					(errcode(ERRCODE_INVALID_PARAMETER_VALUE),
+					 errmsg("skipping block %u for relation \"%s\" because the block number is out of range",
+							blkno, RelationGetRelationName(rel))));
+			continue;
+		}
+
+		CHECK_FOR_INTERRUPTS();
+
+		buf = ReadBuffer(rel, blkno);
+		LockBufferForCleanup(buf);
+
+		page = BufferGetPage(buf);
+
+		maxoffset = PageGetMaxOffsetNumber(page);
+
+		/* No ereport(ERROR) from here until all the changes are logged. */
+		START_CRIT_SECTION();
+
+		for (i = 0; i < noffs; i++)
+		{
+			OffsetNumber	offno;
+			ItemId			itemid;
+
+			if (offnos[i] == 0 || offnos[i] > maxoffset)
+			{
+				ereport(NOTICE,
+						 errmsg("skipping tid (%u, %u) for relation \"%s\" because the item number is out of range for this block",
+								blkno, offnos[i], RelationGetRelationName(rel)));
+				continue;
+			}
+
+			itemid = PageGetItemId(page, offnos[i]);
+
+			/* Follow any redirections until we find something useful. */
+			while (ItemIdIsRedirected(itemid))
+			{
+				offno = ItemIdGetRedirect(itemid);
+				itemid = PageGetItemId(page, offno);
+				CHECK_FOR_INTERRUPTS();
+			}
+
+			/* Nothing to do if the itemid is unused or already dead. */
+			if (!ItemIdIsUsed(itemid) || ItemIdIsDead(itemid))
+			{
+				if (!ItemIdIsUsed(itemid))
+					ereport(NOTICE,
+							(errmsg("skipping tid (%u, %u) for relation \"%s\" because it is marked unused",
+									blkno, offnos[i], RelationGetRelationName(rel))));
+				else
+					ereport(NOTICE,
+							(errmsg("skipping tid (%u, %u) for relation \"%s\" because it is marked dead",
+									blkno, offnos[i], RelationGetRelationName(rel))));
+				continue;
+			}
+
+			Assert(ItemIdIsNormal(itemid));
+
+			did_modify_page = true;
+
+			if (heap_force_opt == HEAP_FORCE_KILL)
+				ItemIdSetDead(itemid);
+			else
+			{
+				HeapTupleHeader htup;
+
+				Assert(heap_force_opt == HEAP_FORCE_FREEZE);
+
+				htup = (HeapTupleHeader) PageGetItem(page, itemid);
+
+				HeapTupleHeaderSetXmin(htup, FrozenTransactionId);
+				HeapTupleHeaderSetXmax(htup, InvalidTransactionId);
+
+				/* We might have MOVED_OFF/MOVED_IN tuples in the database */
+				if (htup->t_infomask & HEAP_MOVED)
+				{
+					if (htup->t_infomask & HEAP_MOVED_OFF)
+						HeapTupleHeaderSetXvac(htup, InvalidTransactionId);
+					else
+						HeapTupleHeaderSetXvac(htup, FrozenTransactionId);
+				}
+
+				/*
+				 * Clear all the visibility-related bits of this tuple and mark
+				 * it as frozen. Also, get rid of HOT_UPDATED and KEYS_UPDATES
+				 * bits.
+				 */
+				htup->t_infomask &= ~HEAP_XACT_MASK;
+				htup->t_infomask |= (HEAP_XMIN_FROZEN | HEAP_XMAX_INVALID);
+				htup->t_infomask2 &= ~HEAP_HOT_UPDATED;
+				htup->t_infomask2 &= ~HEAP_KEYS_UPDATED;
+			}
+		}
+
+		/*
+		 * If the page was modified, only then, we mark the buffer dirty or do
+		 * the WAL logging.
+		 */
+		if (did_modify_page)
+		{
+			/* Mark buffer dirty before we write WAL. */
+			MarkBufferDirty(buf);
+
+			/* XLOG stuff */
+			if (RelationNeedsWAL(rel))
+				log_newpage_buffer(buf, true);
+		}
+
+		END_CRIT_SECTION();
+
+		UnlockReleaseBuffer(buf);
+	} while (next_start_ptr != ntids);
+
+	relation_close(rel, RowExclusiveLock);
+
+	pfree(ta);
+	pfree(offnos);
+
+	PG_RETURN_VOID();
+}
+
+/*-------------------------------------------------------------------------
+ * tidcmp()
+ *
+ * Compare two item pointers, return -1, 0, or +1.
+ *
+ * See ItemPointerCompare for details.
+ * ------------------------------------------------------------------------
+ */
+static int32
+tidcmp(const void *a, const void *b)
+{
+	ItemPointer iptr1 = ((const ItemPointer) a);
+	ItemPointer iptr2 = ((const ItemPointer) b);
+
+	return ItemPointerCompare(iptr1, iptr2);
+}
+
+/*-------------------------------------------------------------------------
+ * sanity_check_tid_array()
+ *
+ * Perform sanity check on the given tid array.
+ * ------------------------------------------------------------------------
+ */
+static void
+sanity_check_tid_array(ArrayType *ta, int *ntids)
+{
+	if (ARR_HASNULL(ta) && array_contains_nulls(ta))
+		ereport(ERROR,
+				(errcode(ERRCODE_NULL_VALUE_NOT_ALLOWED),
+				 errmsg("array must not contain nulls")));
+
+	*ntids = ArrayGetNItems(ARR_NDIM(ta), ARR_DIMS(ta));
+
+	if (*ntids == 0)
+		ereport(ERROR,
+				(errcode(ERRCODE_INVALID_PARAMETER_VALUE),
+				 errmsg("empty tid array")));
+}
+
+/*-------------------------------------------------------------------------
+ * sanity_check_relation()
+ *
+ * Perform sanity check on the given relation.
+ * ------------------------------------------------------------------------
+ */
+static void
+sanity_check_relation(Relation rel)
+{
+	if (rel->rd_amhandler != HEAP_TABLE_AM_HANDLER_OID)
+		ereport(ERROR,
+				(errcode(ERRCODE_FEATURE_NOT_SUPPORTED),
+				 errmsg("only the relation using heap_tableam_handler is supported")));
+
+	if (rel->rd_rel->relkind != RELKIND_RELATION &&
+		rel->rd_rel->relkind != RELKIND_MATVIEW &&
+		rel->rd_rel->relkind != RELKIND_TOASTVALUE)
+		ereport(ERROR,
+				(errcode(ERRCODE_WRONG_OBJECT_TYPE),
+				 errmsg("\"%s\" is not a table, materialized view, or TOAST table",
+						RelationGetRelationName(rel))));
+
+	/* Must be owner of the table or superuser. */
+	if (!pg_class_ownercheck(RelationGetRelid(rel), GetUserId()))
+		aclcheck_error(ACLCHECK_NOT_OWNER,
+					   get_relkind_objtype(rel->rd_rel->relkind),
+					   RelationGetRelationName(rel));
+}
+
+/*-------------------------------------------------------------------------
+ * tids_same_page_fetch_offnums()
+ *
+ * Find out all the tids residing in the same page as tids[next_start_ptr] and
+ * fetch the offset number stored in each of them into a caller-allocated offset
+ * number array.
+ * ------------------------------------------------------------------------
+ */
+static BlockNumber
+tids_same_page_fetch_offnums(ItemPointer tids, int ntids,
+							 OffsetNumber *next_start_ptr, OffsetNumber *offnos)
+{
+	int				i;
+	BlockNumber		prev_blkno,
+					blkno;
+	OffsetNumber	offno;
+
+	prev_blkno = blkno = InvalidBlockNumber;
+
+	for (i = *next_start_ptr; i < ntids; i++)
+	{
+		ItemPointerData tid = tids[i];
+
+		blkno = ItemPointerGetBlockNumberNoCheck(&tid);
+		offno = ItemPointerGetOffsetNumberNoCheck(&tid);
+
+		if (i == *next_start_ptr || (prev_blkno == blkno))
+			offnos[i - *next_start_ptr] = offno;
+		else
+			break;
+
+		prev_blkno = blkno;
+	}
+
+	*next_start_ptr = i;
+	return prev_blkno;
+}
\ No newline at end of file
diff --git a/contrib/pg_surgery/pg_surgery--1.0.sql b/contrib/pg_surgery/pg_surgery--1.0.sql
new file mode 100644
index 0000000..2ae7f22
--- /dev/null
+++ b/contrib/pg_surgery/pg_surgery--1.0.sql
@@ -0,0 +1,18 @@
+/* contrib/pg_surgery/pg_surgery--1.0.sql */
+
+-- complain if script is sourced in psql, rather than via CREATE EXTENSION
+\echo Use "CREATE EXTENSION pg_surgery" to load this file. \quit
+
+CREATE FUNCTION heap_force_kill(reloid regclass, tids tid[])
+RETURNS VOID
+AS 'MODULE_PATHNAME', 'heap_force_kill'
+LANGUAGE C STRICT;
+
+REVOKE EXECUTE ON FUNCTION heap_force_kill(regclass, tid[]) FROM PUBLIC;
+
+CREATE FUNCTION heap_force_freeze(reloid regclass, tids tid[])
+RETURNS VOID
+AS 'MODULE_PATHNAME', 'heap_force_freeze'
+LANGUAGE C STRICT;
+
+REVOKE EXECUTE ON FUNCTION heap_force_freeze(regclass, tid[]) FROM PUBLIC;
\ No newline at end of file
diff --git a/contrib/pg_surgery/pg_surgery.control b/contrib/pg_surgery/pg_surgery.control
new file mode 100644
index 0000000..2bcdad1
--- /dev/null
+++ b/contrib/pg_surgery/pg_surgery.control
@@ -0,0 +1,5 @@
+# pg_surgery extension
+comment = 'extension to perform surgery on a damaged relation'
+default_version = '1.0'
+module_pathname = '$libdir/pg_surgery'
+relocatable = true
diff --git a/contrib/pg_surgery/sql/pg_surgery.sql b/contrib/pg_surgery/sql/pg_surgery.sql
new file mode 100644
index 0000000..4aef380
--- /dev/null
+++ b/contrib/pg_surgery/sql/pg_surgery.sql
@@ -0,0 +1,89 @@
+create extension pg_surgery;
+
+--
+-- check that using heap_force_kill and heap_force_freeze functions with the
+-- supported relations passes.
+--
+
+-- normal heap table.
+begin;
+create table htab(a int);
+insert into htab values (100), (200), (300), (400), (500);
+
+select * from htab where xmin = 2;
+select heap_force_freeze('htab'::regclass, ARRAY['(0, 4)']::tid[]);
+select ctid, xmax from htab where xmin = 2;
+
+select heap_force_kill('htab'::regclass, ARRAY['(0, 4)']::tid[]);
+select * from htab where ctid = '(0, 4)';
+rollback;
+
+-- materialized view.
+begin;
+create materialized view mvw as select a from generate_series(1, 3) a;
+
+select * from mvw where xmin = 2;
+select heap_force_freeze('mvw'::regclass, ARRAY['(0, 3)']::tid[]);
+select * from mvw where xmin = 2;
+
+select heap_force_kill('mvw'::regclass, ARRAY['(0, 3)']::tid[]);
+select * from mvw where ctid = '(0, 3)';
+rollback;
+
+--
+-- check that using heap_force_kill and heap_force_freeze functions with the
+-- unsupported relations fails.
+--
+
+-- partitioned tables (the parent table) doesn't contain any tuple.
+create table ptab (a int) partition by list (a);
+
+select heap_force_kill('ptab'::regclass, ARRAY['(0, 1)']::tid[]);
+select heap_force_freeze('ptab'::regclass, ARRAY['(0, 1)']::tid[]);
+
+create index ptab_idx on ptab (a);
+
+-- indexes are not supported, should fail.
+select heap_force_kill('ptab_idx'::regclass, ARRAY['(0, 1)']::tid[]);
+select heap_force_freeze('ptab_idx'::regclass, ARRAY['(0, 1)']::tid[]);
+
+create view vw as select 1;
+
+-- views are not supported as well. so, all these should fail.
+select heap_force_kill('vw'::regclass, ARRAY['(0, 1)']::tid[]);
+select heap_force_freeze('vw'::regclass, ARRAY['(0, 1)']::tid[]);
+
+create sequence seq;
+
+-- sequences are not supported as well. so, all these functions should fail.
+select heap_force_kill('seq'::regclass, ARRAY['(0, 1)']::tid[]);
+select heap_force_freeze('seq'::regclass, ARRAY['(0, 1)']::tid[]);
+
+--
+-- Some negative test-cases with invalid inputs.
+--
+begin;
+create table htab(a int);
+
+-- invalid block number, should be skipped.
+select heap_force_kill('htab'::regclass, ARRAY['(0, 1)']::tid[]);
+select heap_force_freeze('htab'::regclass, ARRAY['(0, 1)']::tid[]);
+
+insert into htab values (10);
+
+-- invalid offset number, should be skipped.
+select heap_force_kill('htab'::regclass, ARRAY['(0, 2)']::tid[]);
+select heap_force_freeze('htab'::regclass, ARRAY['(0, 2)']::tid[]);
+
+-- dead tuple, should be skipped.
+select heap_force_kill('htab'::regclass, ARRAY['(0, 1)']::tid[]);
+select heap_force_kill('htab'::regclass, ARRAY['(0, 1)']::tid[]);
+select heap_force_freeze('htab'::regclass, ARRAY['(0, 1)']::tid[]);
+
+rollback;
+
+-- cleanup.
+drop table ptab;
+drop view vw;
+drop sequence seq;
+drop extension pg_surgery;
diff --git a/doc/src/sgml/contrib.sgml b/doc/src/sgml/contrib.sgml
index 261a559..c82dde2 100644
--- a/doc/src/sgml/contrib.sgml
+++ b/doc/src/sgml/contrib.sgml
@@ -125,6 +125,7 @@ CREATE EXTENSION <replaceable>module_name</replaceable>;
  &pgrowlocks;
  &pgstatstatements;
  &pgstattuple;
+ &pgsurgery;
  &pgtrgm;
  &pgvisibility;
  &postgres-fdw;
diff --git a/doc/src/sgml/filelist.sgml b/doc/src/sgml/filelist.sgml
index 64b5da0..828396d 100644
--- a/doc/src/sgml/filelist.sgml
+++ b/doc/src/sgml/filelist.sgml
@@ -139,6 +139,7 @@
 <!ENTITY pgstandby       SYSTEM "pgstandby.sgml">
 <!ENTITY pgstatstatements SYSTEM "pgstatstatements.sgml">
 <!ENTITY pgstattuple     SYSTEM "pgstattuple.sgml">
+<!ENTITY pgsurgery       SYSTEM "pgsurgery.sgml">
 <!ENTITY pgtrgm          SYSTEM "pgtrgm.sgml">
 <!ENTITY pgvisibility    SYSTEM "pgvisibility.sgml">
 <!ENTITY postgres-fdw    SYSTEM "postgres-fdw.sgml">
diff --git a/doc/src/sgml/pgsurgery.sgml b/doc/src/sgml/pgsurgery.sgml
new file mode 100644
index 0000000..8d2ed81
--- /dev/null
+++ b/doc/src/sgml/pgsurgery.sgml
@@ -0,0 +1,121 @@
+<!-- doc/src/sgml/pgsurgery.sgml -->
+
+<sect1 id="pgsurgery" xreflabel="pg_surgery">
+ <title>pg_surgery</title>
+
+ <indexterm zone="pgsurgery">
+  <primary>pg_surgery</primary>
+ </indexterm>
+
+ <para>
+  The <filename>pg_surgery</filename> module provides various functions to
+  perform surgery on a damaged relation. It is intended to be used only with a
+  damaged relation and must be operated only on damaged tuples of such
+  relations.
+ </para>
+
+ <para>
+  By default, only the superusers have <literal>EXECUTE</literal> privilege on
+  the functions provided by this module. Once the extension has been installed,
+  users may issue <command>GRANT</command> commands to change the privileges on
+  the functions to allow others to execute them.
+ </para>
+
+ <note>
+ <para>
+  After a surgery is performed on a damaged relation using this module, we must
+  run VACUUM with DISABLE_PAGE_SKIPPING option on that relation to bring it back
+  into a sane state.
+ </para>
+ </note>
+
+ <note>
+ <para>
+  While performing surgery on a damaged relation, we must not be doing anything
+  else on that relation in parallel. This is to ensure that when we are
+  operating on a damaged tuple there is no other transaction trying to modify
+  that tuple.
+ </para>
+ </note>
+
+ <sect2>
+  <title>Functions</title>
+
+  <variablelist>
+   <varlistentry>
+    <term>
+     <function>heap_force_kill(regclass, tid[]) returns void</function>
+    </term>
+
+    <listitem>
+     <para>
+      <function>heap_force_kill</function> marks a <quote>used</quote> line
+      pointer as <quote>dead</quote> without examining the tuple. This may help
+      users to force-kill a corrupted tuple in the table that may or may not be
+      accessible. It takes the relation oid and an array of item pointers as an
+      input arguments. For example:
+<programlisting>
+test=&gt; select * from t1 where ctid = '(0, 1)';
+ERROR:  could not access status of transaction 4007513275
+DETAIL:  Could not open file "pg_xact/0EED": No such file or directory.
+
+test=# select heap_force_kill('t1'::regclass, ARRAY['(0, 1)']::tid[]);
+-[ RECORD 1 ]---+-
+heap_force_kill | 
+
+test=# select * from t1 where ctid = '(0, 1)';
+(0 rows)
+</programlisting>
+    </para>
+    </listitem>
+   </varlistentry>
+
+   <varlistentry>
+    <term>
+     <function>heap_force_freeze(regclass, tid[]) returns void</function>
+    </term>
+
+    <listitem>
+     <para>
+      <function>heap_force_freeze</function> marks a live tuple as frozen by
+      setting its xmin to FrozenTransactionId and xmax to InvalidTransactionId.
+      It clears all the visibility related bits in the infomask of the tuple and
+      sets the infomask flag to HEAP_XMIN_FROZEN and HEAP_XMAX_INVALID. This may
+      help users to force-freeze a tuple in the table that actually should have
+      been marked as frozen. Similar to <function>heap_force_kill</function>
+      function, this also takes the relation oid and an array of item pointers
+      as an input arguments. For example:
+<programlisting>
+test=&gt; vacuum t1;
+ERROR:  found xmin 507 from before relfrozenxid 515
+CONTEXT:  while scanning block 0 of relation "public.t1"
+
+test=# select ctid from t1 where xmin = 507;
+ ctid  
+-------
+ (0,3)
+(1 row)
+
+test=# select heap_force_freeze('t1'::regclass, ARRAY['(0, 3)']::tid[]);
+-[ RECORD 1 ]-----+-
+heap_force_freeze | 
+
+test=# vacuum t1;
+VACUUM
+</programlisting>
+     </para>
+    </listitem>
+   </varlistentry>
+
+  </variablelist>
+ </sect2>
+
+ <sect2>
+  <title>Authors</title>
+
+  <para>
+   Ashutosh Sharma <email>ashu.coek88@gmail.com</email>
+  </para>
+ </sect2>
+
+</sect1>
-- 
1.8.3.1

#77

Chris Travers

chris.travers@adjust.com

over 5 years ago

In reply to: Peter Geoghegan (#4)

Re: recovering from "found xmin ... from before relfrozenxid ..."

On Tue, Jul 14, 2020 at 12:28 AM Peter Geoghegan <pg@bowt.ie> wrote:

On Mon, Jul 13, 2020 at 2:12 PM Robert Haas <robertmhaas@gmail.com> wrote:

1. There's nothing to identify the tuple that has the problem, and no
way to know how many more of them there might be. Back-patching
b61d161c146328ae6ba9ed937862d66e5c8b035a would help with the first
part of this.

I am in favor of backpatching such changes in cases where senior
community members feel that it could help with hypothetical
undiscovered data corruption issues -- if they're willing to take
responsibility for the change. It certainly wouldn't be the first
time. A "defense in depth" mindset seems like the right one when it
comes to data corruption bugs. Early detection is really important.

Moreover, not everyone is as
interested in an extended debugging exercise as they are in getting
the system working again, and VACUUM failing repeatedly is a pretty
serious problem.

That's absolutely consistent with my experience. Most users want to
get back to business as usual now, while letting somebody else do the
hard work of debugging.

Also even if you do trace the problem you still have to recover.

And sometimes I have found latent corruption from times when dbs were
running on older versions and older servers, making debugging largely a
futile exercise.

Therefore, one of my colleagues has - at my request - created a couple
of functions called heap_force_kill() and heap_force_freeze() which
take an array of TIDs.

So I have these questions:

- Do people think it would me smart/good/useful to include something
like this in PostgreSQL?

I'm in favor of it.

Would be worth extending it with some functions to grab rows that have
various TOAST oids too.

- If so, how? I would propose a new contrib module that we back-patch
all the way, because the VACUUM errors were back-patched all the way,
and there seems to be no advantage in making people wait 5 years for a
new version that has some kind of tooling in this area.

I'm in favor of it being *possible* to backpatch tooling that is
clearly related to correctness in a fundamental way. Obviously this
would mean that we'd be revising our general position on backpatching
to allow some limited exceptions around corruption. I'm not sure that
this meets that standard, though. It's hardly something that we can
expect all that many users to be able to use effectively.

I may be biased, but I'd be inclined to permit it in the case of
something like amcheck, or pg_visibility, on the grounds that they're
more or less the same as the new VACUUM errcontext instrumentation you
mentioned. The same cannot be said of something like this new
heap_force_kill() stuff.

- Any ideas for additional things we should include, or improvements
on the sketch above?

Clearly you should work out a way of making it very hard to
accidentally (mis)use. For example, maybe you make the functions check
for the presence of a sentinel file in the data directory.

Agreed.

--
Peter Geoghegan

--
Best Regards,
Chris Travers
Head of Database

Tel: +49 162 9037 210 | Skype: einhverfr | www.adjust.com
Saarbrücker Straße 37a, 10405 Berlin

#78

Rajkumar Raghuwanshi

rajkumar.raghuwanshi@enterprisedb.com

over 5 years ago

In reply to: Chris Travers (#77)

Re: recovering from "found xmin ... from before relfrozenxid ..."

Thanks for suggestion Ashutosh, I have done testing around these suggestion
and found no issues. I will continue testing same with updated patch posted
on this thread.

On Fri, Aug 7, 2020 at 12:45 PM Ashutosh Sharma <ashu.coek88@gmail.com>
wrote:

Thanks Rajkumar for testing the patch.

Here are some of the additional test-cases that I would suggest you to
execute, if possible:

1) You may try running the test-cases that you have executed so far
with SR setup and see if the changes are getting reflected on the
standby.

2) You may also try running some concurrent test-cases for e.g. try
running these functions with VACUUM or some other sql commands
(preferable DML commands) in parallel.

3) See what happens when you pass some invalid tids (containing
invalid block or offset number) to these functions. You may also try
running these functions on the same tuple repeatedly and see the
behaviour.

...

--
With Regards,
Ashutosh Sharma
EnterpriseDB:http://www.enterprisedb.com

Thanks & Regards,
Rajkumar Raghuwanshi

#79

Alvaro Herrera

alvherre@2ndquadrant.com

over 5 years ago

In reply to: Ashutosh Sharma (#74)

Re: recovering from "found xmin ... from before relfrozenxid ..."

On 2020-Aug-17, Ashutosh Sharma wrote:

+           if (heap_force_opt == HEAP_FORCE_KILL)
+               ItemIdSetDead(itemid);
I think that if the page is an all-visible page, we should clear an
all-visible bit on the visibility map corresponding to the page and
PD_ALL_VISIBLE on the page header. Otherwise, index only scan would
return the wrong results.
I think we should let VACUUM do that. Please note that this module is
intended to be used only on a damaged relation and should only be
operated on damaged tuples of such relations. And the execution of any
of the functions provided by this module on a damaged relation must be
followed by VACUUM with DISABLE_PAGE_SKIPPING option on that relation.
This is necessary to bring back a damaged relation to the sane state
once a surgery is performed on it. I will try to add this note in the
documentation for this module.

It makes sense to recommend VACUUM after fixing the page, but I agree
with Sawada-san that it would be sensible to reset the VM bit while
doing surgery, since that's the state that the page would be in. We
should certainly *strongly recommend* to do VACUUM DISABLE_PAGE_SKIPPING,
but if users fail to do so, then leaving the VM bit set just means that
we know *for certain* that there will be further corruption as soon as
the XID counter advances sufficiently.

--
ï¿½lvaro Herrera https://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

#80

masahiko.sawada@2ndquadrant.com

over 5 years ago

In reply to: Ashutosh Sharma (#74)

Re: recovering from "found xmin ... from before relfrozenxid ..."

On Mon, 17 Aug 2020 at 15:05, Ashutosh Sharma <ashu.coek88@gmail.com> wrote:

Hello Masahiko-san,

Thanks for the review. Please check the comments inline below:

On Fri, Aug 14, 2020 at 10:07 AM Masahiko Sawada
<masahiko.sawada@2ndquadrant.com> wrote:
Thank you for updating the patch! Here are my comments on v5 patch:
--- a/contrib/Makefile
+++ b/contrib/Makefile
@@ -35,6 +35,7 @@ SUBDIRS = \
pg_standby  \
pg_stat_statements \
pg_trgm     \
+       pg_surgery  \
pgcrypto    \
I guess we use alphabetical order here. So pg_surgery should be placed
before pg_trgm.
Okay, will take care of this in the next version of patch.
---
+           if (heap_force_opt == HEAP_FORCE_KILL)
+               ItemIdSetDead(itemid);
I think that if the page is an all-visible page, we should clear an
all-visible bit on the visibility map corresponding to the page and
PD_ALL_VISIBLE on the page header. Otherwise, index only scan would
return the wrong results.
I think we should let VACUUM do that. Please note that this module is
intended to be used only on a damaged relation and should only be
operated on damaged tuples of such relations. And the execution of any
of the functions provided by this module on a damaged relation must be
followed by VACUUM with DISABLE_PAGE_SKIPPING option on that relation.
This is necessary to bring back a damaged relation to the sane state
once a surgery is performed on it. I will try to add this note in the
documentation for this module.
---
+       /*
+        * We do not mark the buffer dirty or do WAL logging for unmodifed
+        * pages.
+        */
+       if (!did_modify_page)
+           goto skip_wal;
+
+       /* Mark buffer dirty before we write WAL. */
+       MarkBufferDirty(buf);
+
+       /* XLOG stuff */
+       if (RelationNeedsWAL(rel))
+           log_newpage_buffer(buf, true);
+
+skip_wal:
+       END_CRIT_SECTION();
+
s/unmodifed/unmodified/
okay, will fix this typo.

Do we really need to use goto? I think we can modify it like follows:

if (did_modity_page)
{
/* Mark buffer dirty before we write WAL. */
MarkBufferDirty(buf);

/* XLOG stuff */
if (RelationNeedsWAL(rel))
log_newpage_buffer(buf, true);
}

END_CRIT_SECTION();

No, we don't need it. We can achieve the same by checking the status
of did_modify_page flag as you suggested. I will do this change in the
next version.

---
pg_force_freeze() can revival a tuple that is already deleted but not
vacuumed yet. Therefore, the user might need to reindex indexes after
using that function. For instance, with the following script, the last
two queries: index scan and seq scan, will return different results.

set enable_seqscan to off;
set enable_bitmapscan to off;
set enable_indexonlyscan to off;
create table tbl (a int primary key);
insert into tbl values (1);

update tbl set a = a + 100 where a = 1;

explain analyze select * from tbl where a < 200;

-- revive deleted tuple on heap
select heap_force_freeze('tbl', array['(0,1)'::tid]);

-- index scan returns 2 tuples
explain analyze select * from tbl where a < 200;

-- seq scan returns 1 tuple
set enable_seqscan to on;
explain analyze select * from tbl;

I am not sure if this is the right use-case of pg_force_freeze
function. I think we should only be running pg_force_freeze function
on a tuple for which VACUUM reports "found xmin ABC from before
relfrozenxid PQR" sort of error otherwise it might worsen the things
instead of making it better.

Should this also be documented? I think that it's hard to force the
user to always use this module in the right situation but we need to
show at least when to use.

Also, if a tuple updated and moved to another partition is revived by
heap_force_freeze(), its ctid still has special values:
MovedPartitionsOffsetNumber and MovedPartitionsBlockNumber. I don't
see a problem yet caused by a visible tuple having the special ctid
value, but it might be worth considering either to reset ctid value as
well or to not freezing already-deleted tuple.

For this as well, the answer remains the same as above.

Perhaps the same is true when a tuple header is corrupted including
xmin and ctid for some reason and the user wants to fix it? I'm
concerned that a live tuple having the wrong ctid will cause SEGV or
PANIC error in the future.

Regards,

--
Masahiko Sawada http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

#81

ashu.coek88@gmail.com

over 5 years ago

In reply to: Alvaro Herrera (#79)

Re: recovering from "found xmin ... from before relfrozenxid ..."

On Tue, Aug 18, 2020 at 9:44 PM Alvaro Herrera <alvherre@2ndquadrant.com> wrote:

On 2020-Aug-17, Ashutosh Sharma wrote:
+           if (heap_force_opt == HEAP_FORCE_KILL)
+               ItemIdSetDead(itemid);
I think that if the page is an all-visible page, we should clear an
all-visible bit on the visibility map corresponding to the page and
PD_ALL_VISIBLE on the page header. Otherwise, index only scan would
return the wrong results.
I think we should let VACUUM do that. Please note that this module is
intended to be used only on a damaged relation and should only be
operated on damaged tuples of such relations. And the execution of any
of the functions provided by this module on a damaged relation must be
followed by VACUUM with DISABLE_PAGE_SKIPPING option on that relation.
This is necessary to bring back a damaged relation to the sane state
once a surgery is performed on it. I will try to add this note in the
documentation for this module.
It makes sense to recommend VACUUM after fixing the page, but I agree
with Sawada-san that it would be sensible to reset the VM bit while
doing surgery, since that's the state that the page would be in.

Sure, I will try to do that change but I would still recommend to
always run VACUUM with DISABLE_PAGE_SKIPPING option on the relation
that underwent surgery.

should certainly *strongly recommend* to do VACUUM DISABLE_PAGE_SKIPPING,
but if users fail to do so, then leaving the VM bit set just means that
we know *for certain* that there will be further corruption as soon as
the XID counter advances sufficiently.

Yeah, I've already added a note for this in the documentation:

Note: "After a surgery is performed on a damaged relation using this
module, we must run VACUUM with DISABLE_PAGE_SKIPPING option on that
relation to bring it back into a sane state."

--
With Regards,
Ashutosh Sharma
EnterpriseDB:http://www.enterprisedb.com

#82

ashu.coek88@gmail.com

over 5 years ago

In reply to: Masahiko Sawada (#80)

Re: recovering from "found xmin ... from before relfrozenxid ..."

On Wed, Aug 19, 2020 at 9:27 AM Masahiko Sawada
<masahiko.sawada@2ndquadrant.com> wrote:

On Mon, 17 Aug 2020 at 15:05, Ashutosh Sharma <ashu.coek88@gmail.com> wrote:

pg_force_freeze() can revival a tuple that is already deleted but not
vacuumed yet. Therefore, the user might need to reindex indexes after
using that function. For instance, with the following script, the last
two queries: index scan and seq scan, will return different results.

set enable_seqscan to off;
set enable_bitmapscan to off;
set enable_indexonlyscan to off;
create table tbl (a int primary key);
insert into tbl values (1);

update tbl set a = a + 100 where a = 1;

explain analyze select * from tbl where a < 200;

-- revive deleted tuple on heap
select heap_force_freeze('tbl', array['(0,1)'::tid]);

-- index scan returns 2 tuples
explain analyze select * from tbl where a < 200;

-- seq scan returns 1 tuple
set enable_seqscan to on;
explain analyze select * from tbl;

I am not sure if this is the right use-case of pg_force_freeze
function. I think we should only be running pg_force_freeze function
on a tuple for which VACUUM reports "found xmin ABC from before
relfrozenxid PQR" sort of error otherwise it might worsen the things
instead of making it better.

Should this also be documented? I think that it's hard to force the
user to always use this module in the right situation but we need to
show at least when to use.

I've already added some examples in the documentation explaining the
use-case of force_freeze function. If required, I will also add a note
about it.

Also, if a tuple updated and moved to another partition is revived by
heap_force_freeze(), its ctid still has special values:
MovedPartitionsOffsetNumber and MovedPartitionsBlockNumber. I don't
see a problem yet caused by a visible tuple having the special ctid
value, but it might be worth considering either to reset ctid value as
well or to not freezing already-deleted tuple.

For this as well, the answer remains the same as above.

Perhaps the same is true when a tuple header is corrupted including
xmin and ctid for some reason and the user wants to fix it? I'm
concerned that a live tuple having the wrong ctid will cause SEGV or
PANIC error in the future.

If a tuple header itself is corrupted, then I think we must kill that
tuple. If only xmin and t_ctid fields are corrupted, then probably we
can think of resetting the ctid value of that tuple. However, it won't
be always possible to detect the corrupted ctid value. It's quite
possible that the corrupted ctid field has valid values for block
number and offset number in it, but it's actually corrupted and it
would be difficult to consider such ctid as corrupted. Hence, we can't
do anything about such types of corruption. Probably in such cases we
need to run VACUUM FULL on such tables so that new ctid gets generated
for each tuple in the table.

--
With Regards,
Ashutosh Sharma
EnterpriseDB:http://www.enterprisedb.com

#83

masahiko.sawada@2ndquadrant.com

over 5 years ago

In reply to: Ashutosh Sharma (#82)

Re: recovering from "found xmin ... from before relfrozenxid ..."

On Wed, 19 Aug 2020 at 15:09, Ashutosh Sharma <ashu.coek88@gmail.com> wrote:

On Wed, Aug 19, 2020 at 9:27 AM Masahiko Sawada
<masahiko.sawada@2ndquadrant.com> wrote:

On Mon, 17 Aug 2020 at 15:05, Ashutosh Sharma <ashu.coek88@gmail.com> wrote:

pg_force_freeze() can revival a tuple that is already deleted but not
vacuumed yet. Therefore, the user might need to reindex indexes after
using that function. For instance, with the following script, the last
two queries: index scan and seq scan, will return different results.

set enable_seqscan to off;
set enable_bitmapscan to off;
set enable_indexonlyscan to off;
create table tbl (a int primary key);
insert into tbl values (1);

update tbl set a = a + 100 where a = 1;

explain analyze select * from tbl where a < 200;

-- revive deleted tuple on heap
select heap_force_freeze('tbl', array['(0,1)'::tid]);

-- index scan returns 2 tuples
explain analyze select * from tbl where a < 200;

-- seq scan returns 1 tuple
set enable_seqscan to on;
explain analyze select * from tbl;

I am not sure if this is the right use-case of pg_force_freeze
function. I think we should only be running pg_force_freeze function
on a tuple for which VACUUM reports "found xmin ABC from before
relfrozenxid PQR" sort of error otherwise it might worsen the things
instead of making it better.

Should this also be documented? I think that it's hard to force the
user to always use this module in the right situation but we need to
show at least when to use.

I've already added some examples in the documentation explaining the
use-case of force_freeze function. If required, I will also add a note
about it.

Also, if a tuple updated and moved to another partition is revived by
heap_force_freeze(), its ctid still has special values:
MovedPartitionsOffsetNumber and MovedPartitionsBlockNumber. I don't
see a problem yet caused by a visible tuple having the special ctid
value, but it might be worth considering either to reset ctid value as
well or to not freezing already-deleted tuple.

For this as well, the answer remains the same as above.

Perhaps the same is true when a tuple header is corrupted including
xmin and ctid for some reason and the user wants to fix it? I'm
concerned that a live tuple having the wrong ctid will cause SEGV or
PANIC error in the future.

If a tuple header itself is corrupted, then I think we must kill that
tuple. If only xmin and t_ctid fields are corrupted, then probably we
can think of resetting the ctid value of that tuple. However, it won't
be always possible to detect the corrupted ctid value. It's quite
possible that the corrupted ctid field has valid values for block
number and offset number in it, but it's actually corrupted and it
would be difficult to consider such ctid as corrupted. Hence, we can't
do anything about such types of corruption. Probably in such cases we
need to run VACUUM FULL on such tables so that new ctid gets generated
for each tuple in the table.

Understood.

Perhaps such corruption will be able to be detected by new heapam
check functions discussed on another thread. My point was that it
might be better to attempt making the tuple header sane state as much
as possible when fixing a live tuple in order to prevent further
problems such as databases crash by malicious attackers.

Regards,

--
Masahiko Sawada http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

#84

ashu.coek88@gmail.com

over 5 years ago

In reply to: Masahiko Sawada (#83)

Re: recovering from "found xmin ... from before relfrozenxid ..."

On Wed, Aug 19, 2020 at 3:55 PM Masahiko Sawada
<masahiko.sawada@2ndquadrant.com> wrote:

On Wed, 19 Aug 2020 at 15:09, Ashutosh Sharma <ashu.coek88@gmail.com> wrote:

On Wed, Aug 19, 2020 at 9:27 AM Masahiko Sawada
<masahiko.sawada@2ndquadrant.com> wrote:

On Mon, 17 Aug 2020 at 15:05, Ashutosh Sharma <ashu.coek88@gmail.com> wrote:

pg_force_freeze() can revival a tuple that is already deleted but not
vacuumed yet. Therefore, the user might need to reindex indexes after
using that function. For instance, with the following script, the last
two queries: index scan and seq scan, will return different results.

set enable_seqscan to off;
set enable_bitmapscan to off;
set enable_indexonlyscan to off;
create table tbl (a int primary key);
insert into tbl values (1);

update tbl set a = a + 100 where a = 1;

explain analyze select * from tbl where a < 200;

-- revive deleted tuple on heap
select heap_force_freeze('tbl', array['(0,1)'::tid]);

-- index scan returns 2 tuples
explain analyze select * from tbl where a < 200;

-- seq scan returns 1 tuple
set enable_seqscan to on;
explain analyze select * from tbl;

I am not sure if this is the right use-case of pg_force_freeze
function. I think we should only be running pg_force_freeze function
on a tuple for which VACUUM reports "found xmin ABC from before
relfrozenxid PQR" sort of error otherwise it might worsen the things
instead of making it better.

Should this also be documented? I think that it's hard to force the
user to always use this module in the right situation but we need to
show at least when to use.

I've already added some examples in the documentation explaining the
use-case of force_freeze function. If required, I will also add a note
about it.

Also, if a tuple updated and moved to another partition is revived by
heap_force_freeze(), its ctid still has special values:
MovedPartitionsOffsetNumber and MovedPartitionsBlockNumber. I don't
see a problem yet caused by a visible tuple having the special ctid
value, but it might be worth considering either to reset ctid value as
well or to not freezing already-deleted tuple.

For this as well, the answer remains the same as above.

Perhaps the same is true when a tuple header is corrupted including
xmin and ctid for some reason and the user wants to fix it? I'm
concerned that a live tuple having the wrong ctid will cause SEGV or
PANIC error in the future.

If a tuple header itself is corrupted, then I think we must kill that
tuple. If only xmin and t_ctid fields are corrupted, then probably we
can think of resetting the ctid value of that tuple. However, it won't
be always possible to detect the corrupted ctid value. It's quite
possible that the corrupted ctid field has valid values for block
number and offset number in it, but it's actually corrupted and it
would be difficult to consider such ctid as corrupted. Hence, we can't
do anything about such types of corruption. Probably in such cases we
need to run VACUUM FULL on such tables so that new ctid gets generated
for each tuple in the table.

Understood.

Perhaps such corruption will be able to be detected by new heapam
check functions discussed on another thread. My point was that it
might be better to attempt making the tuple header sane state as much
as possible when fixing a live tuple in order to prevent further
problems such as databases crash by malicious attackers.

Agreed. So, to handle the ctid related concern that you raised, I'm
planning to do the following changes to ensure that the tuple being
marked as frozen contains the correct item pointer value. Please let
me know if you are okay with these changes.

HeapTupleHeader htup;
+ ItemPointerData ctid;

Assert(heap_force_opt == HEAP_FORCE_FREEZE);

+                              ItemPointerSet(&ctid, blkno, offnos[i]);
+
                               htup = (HeapTupleHeader)
PageGetItem(page, itemid);

+                               /*
+                                * Make sure that this tuple holds the
correct item pointer
+                                * value.
+                                */
+                               if
(!HeapTupleHeaderIndicatesMovedPartitions(htup) &&
+                                   !ItemPointerEquals(&ctid, &htup->t_ctid))
+                                   ItemPointerSet(&htup->t_ctid,
blkno, offnos[i]);
+
                                HeapTupleHeaderSetXmin(htup,
FrozenTransactionId);
                                HeapTupleHeaderSetXmax(htup,
InvalidTransactionId);

--
With Regards,
Ashutosh Sharma
EnterpriseDB:http://www.enterprisedb.com

#85

masahiko.sawada@2ndquadrant.com

over 5 years ago

In reply to: Ashutosh Sharma (#84)

Re: recovering from "found xmin ... from before relfrozenxid ..."

On Wed, 19 Aug 2020 at 20:45, Ashutosh Sharma <ashu.coek88@gmail.com> wrote:

On Wed, Aug 19, 2020 at 3:55 PM Masahiko Sawada
<masahiko.sawada@2ndquadrant.com> wrote:

On Wed, 19 Aug 2020 at 15:09, Ashutosh Sharma <ashu.coek88@gmail.com> wrote:

On Wed, Aug 19, 2020 at 9:27 AM Masahiko Sawada
<masahiko.sawada@2ndquadrant.com> wrote:

On Mon, 17 Aug 2020 at 15:05, Ashutosh Sharma <ashu.coek88@gmail.com> wrote:

pg_force_freeze() can revival a tuple that is already deleted but not
vacuumed yet. Therefore, the user might need to reindex indexes after
using that function. For instance, with the following script, the last
two queries: index scan and seq scan, will return different results.

set enable_seqscan to off;
set enable_bitmapscan to off;
set enable_indexonlyscan to off;
create table tbl (a int primary key);
insert into tbl values (1);

update tbl set a = a + 100 where a = 1;

explain analyze select * from tbl where a < 200;

-- revive deleted tuple on heap
select heap_force_freeze('tbl', array['(0,1)'::tid]);

-- index scan returns 2 tuples
explain analyze select * from tbl where a < 200;

-- seq scan returns 1 tuple
set enable_seqscan to on;
explain analyze select * from tbl;

I am not sure if this is the right use-case of pg_force_freeze
function. I think we should only be running pg_force_freeze function
on a tuple for which VACUUM reports "found xmin ABC from before
relfrozenxid PQR" sort of error otherwise it might worsen the things
instead of making it better.

Should this also be documented? I think that it's hard to force the
user to always use this module in the right situation but we need to
show at least when to use.

I've already added some examples in the documentation explaining the
use-case of force_freeze function. If required, I will also add a note
about it.

Also, if a tuple updated and moved to another partition is revived by
heap_force_freeze(), its ctid still has special values:
MovedPartitionsOffsetNumber and MovedPartitionsBlockNumber. I don't
see a problem yet caused by a visible tuple having the special ctid
value, but it might be worth considering either to reset ctid value as
well or to not freezing already-deleted tuple.

For this as well, the answer remains the same as above.

Perhaps the same is true when a tuple header is corrupted including
xmin and ctid for some reason and the user wants to fix it? I'm
concerned that a live tuple having the wrong ctid will cause SEGV or
PANIC error in the future.

If a tuple header itself is corrupted, then I think we must kill that
tuple. If only xmin and t_ctid fields are corrupted, then probably we
can think of resetting the ctid value of that tuple. However, it won't
be always possible to detect the corrupted ctid value. It's quite
possible that the corrupted ctid field has valid values for block
number and offset number in it, but it's actually corrupted and it
would be difficult to consider such ctid as corrupted. Hence, we can't
do anything about such types of corruption. Probably in such cases we
need to run VACUUM FULL on such tables so that new ctid gets generated
for each tuple in the table.

Understood.

Perhaps such corruption will be able to be detected by new heapam
check functions discussed on another thread. My point was that it
might be better to attempt making the tuple header sane state as much
as possible when fixing a live tuple in order to prevent further
problems such as databases crash by malicious attackers.

Agreed. So, to handle the ctid related concern that you raised, I'm
planning to do the following changes to ensure that the tuple being
marked as frozen contains the correct item pointer value. Please let
me know if you are okay with these changes.

Given that a live tuple never indicates to ve moved partitions, I
guess the first condition in the if statement is not necessary. The
rest looks good to me, although other hackers might think differently.

Regards,

--
Masahiko Sawada http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

#86

ashu.coek88@gmail.com

over 5 years ago

In reply to: Masahiko Sawada (#85)

Re: recovering from "found xmin ... from before relfrozenxid ..."

On Thu, Aug 20, 2020 at 11:04 AM Masahiko Sawada
<masahiko.sawada@2ndquadrant.com> wrote:

On Wed, 19 Aug 2020 at 20:45, Ashutosh Sharma <ashu.coek88@gmail.com> wrote:

On Wed, Aug 19, 2020 at 3:55 PM Masahiko Sawada
<masahiko.sawada@2ndquadrant.com> wrote:

On Wed, 19 Aug 2020 at 15:09, Ashutosh Sharma <ashu.coek88@gmail.com> wrote:

On Wed, Aug 19, 2020 at 9:27 AM Masahiko Sawada
<masahiko.sawada@2ndquadrant.com> wrote:

On Mon, 17 Aug 2020 at 15:05, Ashutosh Sharma <ashu.coek88@gmail.com> wrote:

pg_force_freeze() can revival a tuple that is already deleted but not
vacuumed yet. Therefore, the user might need to reindex indexes after
using that function. For instance, with the following script, the last
two queries: index scan and seq scan, will return different results.

set enable_seqscan to off;
set enable_bitmapscan to off;
set enable_indexonlyscan to off;
create table tbl (a int primary key);
insert into tbl values (1);

update tbl set a = a + 100 where a = 1;

explain analyze select * from tbl where a < 200;

-- revive deleted tuple on heap
select heap_force_freeze('tbl', array['(0,1)'::tid]);

-- index scan returns 2 tuples
explain analyze select * from tbl where a < 200;

-- seq scan returns 1 tuple
set enable_seqscan to on;
explain analyze select * from tbl;

I am not sure if this is the right use-case of pg_force_freeze
function. I think we should only be running pg_force_freeze function
on a tuple for which VACUUM reports "found xmin ABC from before
relfrozenxid PQR" sort of error otherwise it might worsen the things
instead of making it better.

Should this also be documented? I think that it's hard to force the
user to always use this module in the right situation but we need to
show at least when to use.

I've already added some examples in the documentation explaining the
use-case of force_freeze function. If required, I will also add a note
about it.

Also, if a tuple updated and moved to another partition is revived by
heap_force_freeze(), its ctid still has special values:
MovedPartitionsOffsetNumber and MovedPartitionsBlockNumber. I don't
see a problem yet caused by a visible tuple having the special ctid
value, but it might be worth considering either to reset ctid value as
well or to not freezing already-deleted tuple.

For this as well, the answer remains the same as above.

Perhaps the same is true when a tuple header is corrupted including
xmin and ctid for some reason and the user wants to fix it? I'm
concerned that a live tuple having the wrong ctid will cause SEGV or
PANIC error in the future.

If a tuple header itself is corrupted, then I think we must kill that
tuple. If only xmin and t_ctid fields are corrupted, then probably we
can think of resetting the ctid value of that tuple. However, it won't
be always possible to detect the corrupted ctid value. It's quite
possible that the corrupted ctid field has valid values for block
number and offset number in it, but it's actually corrupted and it
would be difficult to consider such ctid as corrupted. Hence, we can't
do anything about such types of corruption. Probably in such cases we
need to run VACUUM FULL on such tables so that new ctid gets generated
for each tuple in the table.

Understood.

Perhaps such corruption will be able to be detected by new heapam
check functions discussed on another thread. My point was that it
might be better to attempt making the tuple header sane state as much
as possible when fixing a live tuple in order to prevent further
problems such as databases crash by malicious attackers.

Agreed. So, to handle the ctid related concern that you raised, I'm
planning to do the following changes to ensure that the tuple being
marked as frozen contains the correct item pointer value. Please let
me know if you are okay with these changes.

Given that a live tuple never indicates to ve moved partitions, I
guess the first condition in the if statement is not necessary. The
rest looks good to me, although other hackers might think differently.

Okay, thanks for confirming. I am planning to go ahead with this
approach. Will later see what others have to say about it.

--
With Regards,
Ashutosh Sharma
EnterpriseDB:http://www.enterprisedb.com

#87

ashu.coek88@gmail.com

over 5 years ago

In reply to: Ashutosh Sharma (#86)

1 attachment(s)

Re: recovering from "found xmin ... from before relfrozenxid ..."

Hi Masahiko-san,

Please find the updated patch with the following new changes:

1) It adds the code changes in heap_force_kill function to clear an
all-visible bit on the visibility map corresponding to the page that
is marked all-visible. Along the way it also clears PD_ALL_VISIBLE
flag on the page header.

2) It adds the code changes in heap_force_freeze function to reset the
ctid value in a tuple header if it is corrupted.

3) It adds several notes and examples in the documentation stating
when and how we need to use the functions provided by this module.

Please have a look and let me know for any other concern.

Thanks,

--
With Regards,
Ashutosh Sharma
EnterpriseDB:http://www.enterprisedb.com

Show quoted text

On Thu, Aug 20, 2020 at 11:43 AM Ashutosh Sharma <ashu.coek88@gmail.com> wrote:

On Thu, Aug 20, 2020 at 11:04 AM Masahiko Sawada
<masahiko.sawada@2ndquadrant.com> wrote:

On Wed, 19 Aug 2020 at 20:45, Ashutosh Sharma <ashu.coek88@gmail.com> wrote:

On Wed, Aug 19, 2020 at 3:55 PM Masahiko Sawada
<masahiko.sawada@2ndquadrant.com> wrote:

On Wed, 19 Aug 2020 at 15:09, Ashutosh Sharma <ashu.coek88@gmail.com> wrote:

On Wed, Aug 19, 2020 at 9:27 AM Masahiko Sawada
<masahiko.sawada@2ndquadrant.com> wrote:

On Mon, 17 Aug 2020 at 15:05, Ashutosh Sharma <ashu.coek88@gmail.com> wrote:

pg_force_freeze() can revival a tuple that is already deleted but not
vacuumed yet. Therefore, the user might need to reindex indexes after
using that function. For instance, with the following script, the last
two queries: index scan and seq scan, will return different results.

set enable_seqscan to off;
set enable_bitmapscan to off;
set enable_indexonlyscan to off;
create table tbl (a int primary key);
insert into tbl values (1);

update tbl set a = a + 100 where a = 1;

explain analyze select * from tbl where a < 200;

-- revive deleted tuple on heap
select heap_force_freeze('tbl', array['(0,1)'::tid]);

-- index scan returns 2 tuples
explain analyze select * from tbl where a < 200;

-- seq scan returns 1 tuple
set enable_seqscan to on;
explain analyze select * from tbl;

I am not sure if this is the right use-case of pg_force_freeze
function. I think we should only be running pg_force_freeze function
on a tuple for which VACUUM reports "found xmin ABC from before
relfrozenxid PQR" sort of error otherwise it might worsen the things
instead of making it better.

Should this also be documented? I think that it's hard to force the
user to always use this module in the right situation but we need to
show at least when to use.

I've already added some examples in the documentation explaining the
use-case of force_freeze function. If required, I will also add a note
about it.

Also, if a tuple updated and moved to another partition is revived by
heap_force_freeze(), its ctid still has special values:
MovedPartitionsOffsetNumber and MovedPartitionsBlockNumber. I don't
see a problem yet caused by a visible tuple having the special ctid
value, but it might be worth considering either to reset ctid value as
well or to not freezing already-deleted tuple.

For this as well, the answer remains the same as above.

Perhaps the same is true when a tuple header is corrupted including
xmin and ctid for some reason and the user wants to fix it? I'm
concerned that a live tuple having the wrong ctid will cause SEGV or
PANIC error in the future.

If a tuple header itself is corrupted, then I think we must kill that
tuple. If only xmin and t_ctid fields are corrupted, then probably we
can think of resetting the ctid value of that tuple. However, it won't
be always possible to detect the corrupted ctid value. It's quite
possible that the corrupted ctid field has valid values for block
number and offset number in it, but it's actually corrupted and it
would be difficult to consider such ctid as corrupted. Hence, we can't
do anything about such types of corruption. Probably in such cases we
need to run VACUUM FULL on such tables so that new ctid gets generated
for each tuple in the table.

Understood.

Perhaps such corruption will be able to be detected by new heapam
check functions discussed on another thread. My point was that it
might be better to attempt making the tuple header sane state as much
as possible when fixing a live tuple in order to prevent further
problems such as databases crash by malicious attackers.

Agreed. So, to handle the ctid related concern that you raised, I'm
planning to do the following changes to ensure that the tuple being
marked as frozen contains the correct item pointer value. Please let
me know if you are okay with these changes.

Given that a live tuple never indicates to ve moved partitions, I
guess the first condition in the if statement is not necessary. The
rest looks good to me, although other hackers might think differently.

Okay, thanks for confirming. I am planning to go ahead with this
approach. Will later see what others have to say about it.

--
With Regards,
Ashutosh Sharma
EnterpriseDB:http://www.enterprisedb.com

Attachments:

v7-0001-Add-contrib-pg_surgery-to-perform-surgery-on-a-damag.patchtext/x-patch; charset=US-ASCII; name=v7-0001-Add-contrib-pg_surgery-to-perform-surgery-on-a-damag.patchDownload

From 4615bd2356a8102fed3f71efeb7181f8857f04e6 Mon Sep 17 00:00:00 2001
From: ashu <ashutosh.sharma@enterprisedb.com>
Date: Fri, 21 Aug 2020 18:13:35 +0530
Subject: [PATCH] Add contrib/pg_surgery to perform surgery on a damaged heap
 table.

This contrib module basically adds a couple of functions named
heap_force_kill and heap_force_freeze that can be used in the scripts
to perform surgery on the damaged heap tables.

Ashutosh Sharma.
---
 contrib/Makefile                           |   1 +
 contrib/pg_surgery/Makefile                |  23 ++
 contrib/pg_surgery/expected/pg_surgery.out | 161 +++++++++++
 contrib/pg_surgery/heap_surgery.c          | 418 +++++++++++++++++++++++++++++
 contrib/pg_surgery/pg_surgery--1.0.sql     |  18 ++
 contrib/pg_surgery/pg_surgery.control      |   5 +
 contrib/pg_surgery/sql/pg_surgery.sql      |  89 ++++++
 doc/src/sgml/contrib.sgml                  |   1 +
 doc/src/sgml/filelist.sgml                 |   1 +
 doc/src/sgml/pgsurgery.sgml                | 136 ++++++++++
 10 files changed, 853 insertions(+)
 create mode 100644 contrib/pg_surgery/Makefile
 create mode 100644 contrib/pg_surgery/expected/pg_surgery.out
 create mode 100644 contrib/pg_surgery/heap_surgery.c
 create mode 100644 contrib/pg_surgery/pg_surgery--1.0.sql
 create mode 100644 contrib/pg_surgery/pg_surgery.control
 create mode 100644 contrib/pg_surgery/sql/pg_surgery.sql
 create mode 100644 doc/src/sgml/pgsurgery.sgml

diff --git a/contrib/Makefile b/contrib/Makefile
index 1846d41..c8d2a16 100644
--- a/contrib/Makefile
+++ b/contrib/Makefile
@@ -34,6 +34,7 @@ SUBDIRS = \
 		pg_prewarm	\
 		pg_standby	\
 		pg_stat_statements \
+		pg_surgery	\
 		pg_trgm		\
 		pgcrypto	\
 		pgrowlocks	\
diff --git a/contrib/pg_surgery/Makefile b/contrib/pg_surgery/Makefile
new file mode 100644
index 0000000..ecf2e20
--- /dev/null
+++ b/contrib/pg_surgery/Makefile
@@ -0,0 +1,23 @@
+# contrib/pg_surgery/Makefile
+
+MODULE_big = pg_surgery
+OBJS = \
+	$(WIN32RES) \
+	heap_surgery.o
+
+EXTENSION = pg_surgery
+DATA = pg_surgery--1.0.sql
+PGFILEDESC = "pg_surgery - perform surgery on a damaged relation"
+
+REGRESS = pg_surgery
+
+ifdef USE_PGXS
+PG_CONFIG = pg_config
+PGXS := $(shell $(PG_CONFIG) --pgxs)
+include $(PGXS)
+else
+subdir = contrib/pg_surgery
+top_builddir = ../..
+include $(top_builddir)/src/Makefile.global
+include $(top_srcdir)/contrib/contrib-global.mk
+endif
diff --git a/contrib/pg_surgery/expected/pg_surgery.out b/contrib/pg_surgery/expected/pg_surgery.out
new file mode 100644
index 0000000..9858de2
--- /dev/null
+++ b/contrib/pg_surgery/expected/pg_surgery.out
@@ -0,0 +1,161 @@
+create extension pg_surgery;
+--
+-- check that using heap_force_kill and heap_force_freeze functions with the
+-- supported relations passes.
+--
+-- normal heap table.
+begin;
+create table htab(a int);
+insert into htab values (100), (200), (300), (400), (500);
+select * from htab where xmin = 2;
+ a 
+---
+(0 rows)
+
+select heap_force_freeze('htab'::regclass, ARRAY['(0, 4)']::tid[]);
+ heap_force_freeze 
+-------------------
+ 
+(1 row)
+
+select ctid, xmax from htab where xmin = 2;
+ ctid  | xmax 
+-------+------
+ (0,4) |    0
+(1 row)
+
+select heap_force_kill('htab'::regclass, ARRAY['(0, 4)']::tid[]);
+ heap_force_kill 
+-----------------
+ 
+(1 row)
+
+select * from htab where ctid = '(0, 4)';
+ a 
+---
+(0 rows)
+
+rollback;
+-- materialized view.
+begin;
+create materialized view mvw as select a from generate_series(1, 3) a;
+select * from mvw where xmin = 2;
+ a 
+---
+(0 rows)
+
+select heap_force_freeze('mvw'::regclass, ARRAY['(0, 3)']::tid[]);
+ heap_force_freeze 
+-------------------
+ 
+(1 row)
+
+select * from mvw where xmin = 2;
+ a 
+---
+ 3
+(1 row)
+
+select heap_force_kill('mvw'::regclass, ARRAY['(0, 3)']::tid[]);
+ heap_force_kill 
+-----------------
+ 
+(1 row)
+
+select * from mvw where ctid = '(0, 3)';
+ a 
+---
+(0 rows)
+
+rollback;
+--
+-- check that using heap_force_kill and heap_force_freeze functions with the
+-- unsupported relations fails.
+--
+-- partitioned tables (the parent table) doesn't contain any tuple.
+create table ptab (a int) partition by list (a);
+select heap_force_kill('ptab'::regclass, ARRAY['(0, 1)']::tid[]);
+ERROR:  only the relation using heap_tableam_handler is supported
+select heap_force_freeze('ptab'::regclass, ARRAY['(0, 1)']::tid[]);
+ERROR:  only the relation using heap_tableam_handler is supported
+create index ptab_idx on ptab (a);
+-- indexes are not supported, should fail.
+select heap_force_kill('ptab_idx'::regclass, ARRAY['(0, 1)']::tid[]);
+ERROR:  only the relation using heap_tableam_handler is supported
+select heap_force_freeze('ptab_idx'::regclass, ARRAY['(0, 1)']::tid[]);
+ERROR:  only the relation using heap_tableam_handler is supported
+create view vw as select 1;
+-- views are not supported as well. so, all these should fail.
+select heap_force_kill('vw'::regclass, ARRAY['(0, 1)']::tid[]);
+ERROR:  only the relation using heap_tableam_handler is supported
+select heap_force_freeze('vw'::regclass, ARRAY['(0, 1)']::tid[]);
+ERROR:  only the relation using heap_tableam_handler is supported
+create sequence seq;
+-- sequences are not supported as well. so, all these functions should fail.
+select heap_force_kill('seq'::regclass, ARRAY['(0, 1)']::tid[]);
+ERROR:  "seq" is not a table, materialized view, or TOAST table
+select heap_force_freeze('seq'::regclass, ARRAY['(0, 1)']::tid[]);
+ERROR:  "seq" is not a table, materialized view, or TOAST table
+--
+-- Some negative test-cases with invalid inputs.
+--
+begin;
+create table htab(a int);
+-- invalid block number, should be skipped.
+select heap_force_kill('htab'::regclass, ARRAY['(0, 1)']::tid[]);
+NOTICE:  skipping block 0 for relation "htab" because the block number is out of range
+ heap_force_kill 
+-----------------
+ 
+(1 row)
+
+select heap_force_freeze('htab'::regclass, ARRAY['(0, 1)']::tid[]);
+NOTICE:  skipping block 0 for relation "htab" because the block number is out of range
+ heap_force_freeze 
+-------------------
+ 
+(1 row)
+
+insert into htab values (10);
+-- invalid offset number, should be skipped.
+select heap_force_kill('htab'::regclass, ARRAY['(0, 2)']::tid[]);
+NOTICE:  skipping tid (0, 2) for relation "htab" because the item number is out of range for this block
+ heap_force_kill 
+-----------------
+ 
+(1 row)
+
+select heap_force_freeze('htab'::regclass, ARRAY['(0, 2)']::tid[]);
+NOTICE:  skipping tid (0, 2) for relation "htab" because the item number is out of range for this block
+ heap_force_freeze 
+-------------------
+ 
+(1 row)
+
+-- dead tuple, should be skipped.
+select heap_force_kill('htab'::regclass, ARRAY['(0, 1)']::tid[]);
+ heap_force_kill 
+-----------------
+ 
+(1 row)
+
+select heap_force_kill('htab'::regclass, ARRAY['(0, 1)']::tid[]);
+NOTICE:  skipping tid (0, 1) for relation "htab" because it is marked dead
+ heap_force_kill 
+-----------------
+ 
+(1 row)
+
+select heap_force_freeze('htab'::regclass, ARRAY['(0, 1)']::tid[]);
+NOTICE:  skipping tid (0, 1) for relation "htab" because it is marked dead
+ heap_force_freeze 
+-------------------
+ 
+(1 row)
+
+rollback;
+-- cleanup.
+drop table ptab;
+drop view vw;
+drop sequence seq;
+drop extension pg_surgery;
diff --git a/contrib/pg_surgery/heap_surgery.c b/contrib/pg_surgery/heap_surgery.c
new file mode 100644
index 0000000..5228990
--- /dev/null
+++ b/contrib/pg_surgery/heap_surgery.c
@@ -0,0 +1,418 @@
+/*-------------------------------------------------------------------------
+ *
+ * heap_surgery.c
+ *	  Functions to perform surgery on the damaged heap table.
+ *
+ * Copyright (c) 2020, PostgreSQL Global Development Group
+ *
+ * IDENTIFICATION
+ *	  contrib/pg_surgery/heap_surgery.c
+ *
+ *-------------------------------------------------------------------------
+ */
+#include "postgres.h"
+
+#include "access/heapam.h"
+#include "access/visibilitymap.h"
+#include "catalog/pg_am_d.h"
+#include "catalog/pg_proc_d.h"
+#include "miscadmin.h"
+#include "storage/bufmgr.h"
+#include "utils/acl.h"
+#include "utils/rel.h"
+
+PG_MODULE_MAGIC;
+
+/* Options to forcefully change the state of a heap tuple. */
+typedef enum HeapTupleForceOption
+{
+	HEAP_FORCE_KILL,
+	HEAP_FORCE_FREEZE
+} HeapTupleForceOption;
+
+PG_FUNCTION_INFO_V1(heap_force_kill);
+PG_FUNCTION_INFO_V1(heap_force_freeze);
+
+static int32 tidcmp(const void *a, const void *b);
+static Datum heap_force_common(FunctionCallInfo fcinfo,
+							   HeapTupleForceOption heap_force_opt);
+static void sanity_check_tid_array(ArrayType *ta, int *ntids);
+static void sanity_check_relation(Relation rel);
+static BlockNumber tids_same_page_fetch_offnums(ItemPointer tids, int ntids,
+												OffsetNumber *next_start_ptr,
+												OffsetNumber *offnos);
+
+/*-------------------------------------------------------------------------
+ * heap_force_kill()
+ *
+ * Force kill the tuple(s) pointed to by the item pointer(s) stored in the
+ * given tid array.
+ *
+ * Usage: SELECT heap_force_kill(regclass, tid[]);
+ *-------------------------------------------------------------------------
+ */
+Datum
+heap_force_kill(PG_FUNCTION_ARGS)
+{
+	PG_RETURN_DATUM(heap_force_common(fcinfo, HEAP_FORCE_KILL));
+}
+
+/*-------------------------------------------------------------------------
+ * heap_force_freeze()
+ *
+ * Force freeze the tuple(s) pointed to by the item pointer(s) stored in the
+ * given tid array.
+ *
+ * Usage: SELECT heap_force_freeze(regclass, tid[]);
+ *-------------------------------------------------------------------------
+ */
+Datum
+heap_force_freeze(PG_FUNCTION_ARGS)
+{
+	PG_RETURN_DATUM(heap_force_common(fcinfo, HEAP_FORCE_FREEZE));
+}
+
+/*-------------------------------------------------------------------------
+ * heap_force_common()
+ *
+ * Common code for heap_force_kill and heap_force_freeze
+ *-------------------------------------------------------------------------
+ */
+static Datum
+heap_force_common(FunctionCallInfo fcinfo, HeapTupleForceOption heap_force_opt)
+{
+	Oid				relid = PG_GETARG_OID(0);
+	ArrayType	   *ta = PG_GETARG_ARRAYTYPE_P_COPY(1);
+	ItemPointer		tids;
+	int				ntids,
+					nblocks;
+	Relation		rel;
+	OffsetNumber   *offnos;
+	OffsetNumber	noffs,
+					curr_start_ptr,
+					next_start_ptr;
+
+	if (RecoveryInProgress())
+		ereport(ERROR,
+				(errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
+				 errmsg("recovery is in progress"),
+				 errhint("heap surgery functions cannot be executed during recovery.")));
+
+	/* Basic sanity checking. */
+	sanity_check_tid_array(ta, &ntids);
+
+	rel = relation_open(relid, RowExclusiveLock);
+
+	sanity_check_relation(rel);
+
+	tids = ((ItemPointer) ARR_DATA_PTR(ta));
+
+	/*
+	 * If there is more than one tid in the array, sort it so that we can
+	 * easily fetch all the tids belonging to one particular page from the
+	 * array.
+	 */
+	if (ntids > 1)
+		qsort((void*) tids, ntids, sizeof(ItemPointerData), tidcmp);
+
+	offnos = (OffsetNumber *) palloc(ntids * sizeof(OffsetNumber));
+	noffs = curr_start_ptr = next_start_ptr = 0;
+	nblocks = RelationGetNumberOfBlocks(rel);
+
+	do
+	{
+		Buffer			buf;
+		Buffer			vmbuf = InvalidBuffer;
+		Page			page;
+		BlockNumber		blkno;
+		OffsetNumber	maxoffset;
+		int				i;
+		bool			did_modify_page = false;
+		bool			did_modify_vm = false;
+
+		/*
+		 * Get the offset numbers from the tids belonging to one particular page
+		 * and process them one by one.
+		 */
+		blkno = tids_same_page_fetch_offnums(tids, ntids, &next_start_ptr,
+											 offnos);
+
+		/* Calculate the number of offsets stored in offnos array. */
+		noffs = next_start_ptr - curr_start_ptr;
+
+		/*
+		 * Update the current start pointer so that next time when
+		 * tids_same_page_fetch_offnums() is called, we can calculate the number
+		 * of offsets present in the offnos array.
+		 */
+		curr_start_ptr = next_start_ptr;
+
+		/* Check whether the block number is valid. */
+		if (blkno >= nblocks)
+		{
+			ereport(NOTICE,
+					(errcode(ERRCODE_INVALID_PARAMETER_VALUE),
+					 errmsg("skipping block %u for relation \"%s\" because the block number is out of range",
+							blkno, RelationGetRelationName(rel))));
+			continue;
+		}
+
+		CHECK_FOR_INTERRUPTS();
+
+		buf = ReadBuffer(rel, blkno);
+		LockBufferForCleanup(buf);
+
+		page = BufferGetPage(buf);
+
+		maxoffset = PageGetMaxOffsetNumber(page);
+
+		/*
+		 * Before entering the critical section, pin the visibility map page if
+		 * it appears to be necessary.
+		 */
+		if (heap_force_opt == HEAP_FORCE_KILL && PageIsAllVisible(page))
+			visibilitymap_pin(rel, blkno, &vmbuf);
+
+		/* No ereport(ERROR) from here until all the changes are logged. */
+		START_CRIT_SECTION();
+
+		for (i = 0; i < noffs; i++)
+		{
+			OffsetNumber	offno;
+			ItemId			itemid;
+
+			if (offnos[i] == 0 || offnos[i] > maxoffset)
+			{
+				ereport(NOTICE,
+						 errmsg("skipping tid (%u, %u) for relation \"%s\" because the item number is out of range for this block",
+								blkno, offnos[i], RelationGetRelationName(rel)));
+				continue;
+			}
+
+			itemid = PageGetItemId(page, offnos[i]);
+
+			/* Follow any redirections until we find something useful. */
+			while (ItemIdIsRedirected(itemid))
+			{
+				offno = ItemIdGetRedirect(itemid);
+				itemid = PageGetItemId(page, offno);
+				CHECK_FOR_INTERRUPTS();
+			}
+
+			/* Nothing to do if the itemid is unused or already dead. */
+			if (!ItemIdIsUsed(itemid) || ItemIdIsDead(itemid))
+			{
+				if (!ItemIdIsUsed(itemid))
+					ereport(NOTICE,
+							(errmsg("skipping tid (%u, %u) for relation \"%s\" because it is marked unused",
+									blkno, offnos[i], RelationGetRelationName(rel))));
+				else
+					ereport(NOTICE,
+							(errmsg("skipping tid (%u, %u) for relation \"%s\" because it is marked dead",
+									blkno, offnos[i], RelationGetRelationName(rel))));
+				continue;
+			}
+
+			Assert(ItemIdIsNormal(itemid));
+
+			did_modify_page = true;
+
+			if (heap_force_opt == HEAP_FORCE_KILL)
+			{
+
+				ItemIdSetDead(itemid);
+
+				/*
+				 * If the page is marked all-visible, we must clear
+				 * PD_ALL_VISIBLE flag on the page header and an all-visible bit
+				 * on the visibility map corresponding to the page.
+				 */
+				if (PageIsAllVisible(page))
+				{
+					PageClearAllVisible(page);
+					visibilitymap_clear(rel, blkno, vmbuf,
+										VISIBILITYMAP_ALL_VISIBLE);
+					did_modify_vm = true;
+				}
+			}
+			else
+			{
+				HeapTupleHeader	htup;
+				ItemPointerData	ctid;
+
+				Assert(heap_force_opt == HEAP_FORCE_FREEZE);
+
+				ItemPointerSet(&ctid, blkno, offnos[i]);
+
+				htup = (HeapTupleHeader) PageGetItem(page, itemid);
+
+				/*
+				 * Make sure that this tuple holds the correct item pointer
+				 * value.
+				 */
+				if (!ItemPointerEquals(&ctid, &htup->t_ctid))
+					ItemPointerSet(&htup->t_ctid, blkno, offnos[i]);
+
+				HeapTupleHeaderSetXmin(htup, FrozenTransactionId);
+				HeapTupleHeaderSetXmax(htup, InvalidTransactionId);
+
+				/* We might have MOVED_OFF/MOVED_IN tuples in the database */
+				if (htup->t_infomask & HEAP_MOVED)
+				{
+					if (htup->t_infomask & HEAP_MOVED_OFF)
+						HeapTupleHeaderSetXvac(htup, InvalidTransactionId);
+					else
+						HeapTupleHeaderSetXvac(htup, FrozenTransactionId);
+				}
+
+				/*
+				 * Clear all the visibility-related bits of this tuple and mark
+				 * it as frozen. Also, get rid of HOT_UPDATED and KEYS_UPDATES
+				 * bits.
+				 */
+				htup->t_infomask &= ~HEAP_XACT_MASK;
+				htup->t_infomask |= (HEAP_XMIN_FROZEN | HEAP_XMAX_INVALID);
+				htup->t_infomask2 &= ~HEAP_HOT_UPDATED;
+				htup->t_infomask2 &= ~HEAP_KEYS_UPDATED;
+			}
+		}
+
+		/*
+		 * If the page was modified, only then, we mark the buffer dirty or do
+		 * the WAL logging.
+		 */
+		if (did_modify_page)
+		{
+			/* Mark buffer dirty before we write WAL. */
+			MarkBufferDirty(buf);
+
+			/* XLOG stuff */
+			if (RelationNeedsWAL(rel))
+				log_newpage_buffer(buf, true);
+		}
+
+		/* WAL log the VM page if it was modified. */
+		if (did_modify_vm && RelationNeedsWAL(rel))
+			log_newpage_buffer(vmbuf, false);
+
+		END_CRIT_SECTION();
+
+		UnlockReleaseBuffer(buf);
+
+		if (vmbuf != InvalidBuffer)
+			ReleaseBuffer(vmbuf);
+	} while (next_start_ptr != ntids);
+
+	relation_close(rel, RowExclusiveLock);
+
+	pfree(ta);
+	pfree(offnos);
+
+	PG_RETURN_VOID();
+}
+
+/*-------------------------------------------------------------------------
+ * tidcmp()
+ *
+ * Compare two item pointers, return -1, 0, or +1.
+ *
+ * See ItemPointerCompare for details.
+ * ------------------------------------------------------------------------
+ */
+static int32
+tidcmp(const void *a, const void *b)
+{
+	ItemPointer iptr1 = ((const ItemPointer) a);
+	ItemPointer iptr2 = ((const ItemPointer) b);
+
+	return ItemPointerCompare(iptr1, iptr2);
+}
+
+/*-------------------------------------------------------------------------
+ * sanity_check_tid_array()
+ *
+ * Perform sanity check on the given tid array.
+ * ------------------------------------------------------------------------
+ */
+static void
+sanity_check_tid_array(ArrayType *ta, int *ntids)
+{
+	if (ARR_HASNULL(ta) && array_contains_nulls(ta))
+		ereport(ERROR,
+				(errcode(ERRCODE_NULL_VALUE_NOT_ALLOWED),
+				 errmsg("array must not contain nulls")));
+
+	*ntids = ArrayGetNItems(ARR_NDIM(ta), ARR_DIMS(ta));
+
+	if (*ntids == 0)
+		ereport(ERROR,
+				(errcode(ERRCODE_INVALID_PARAMETER_VALUE),
+				 errmsg("empty tid array")));
+}
+
+/*-------------------------------------------------------------------------
+ * sanity_check_relation()
+ *
+ * Perform sanity check on the given relation.
+ * ------------------------------------------------------------------------
+ */
+static void
+sanity_check_relation(Relation rel)
+{
+	if (rel->rd_amhandler != HEAP_TABLE_AM_HANDLER_OID)
+		ereport(ERROR,
+				(errcode(ERRCODE_FEATURE_NOT_SUPPORTED),
+				 errmsg("only the relation using heap_tableam_handler is supported")));
+
+	if (rel->rd_rel->relkind != RELKIND_RELATION &&
+		rel->rd_rel->relkind != RELKIND_MATVIEW &&
+		rel->rd_rel->relkind != RELKIND_TOASTVALUE)
+		ereport(ERROR,
+				(errcode(ERRCODE_WRONG_OBJECT_TYPE),
+				 errmsg("\"%s\" is not a table, materialized view, or TOAST table",
+						RelationGetRelationName(rel))));
+
+	/* Must be owner of the table or superuser. */
+	if (!pg_class_ownercheck(RelationGetRelid(rel), GetUserId()))
+		aclcheck_error(ACLCHECK_NOT_OWNER,
+					   get_relkind_objtype(rel->rd_rel->relkind),
+					   RelationGetRelationName(rel));
+}
+
+/*-------------------------------------------------------------------------
+ * tids_same_page_fetch_offnums()
+ *
+ * Find out all the tids residing in the same page as tids[next_start_ptr] and
+ * fetch the offset number stored in each of them into a caller-allocated offset
+ * number array.
+ * ------------------------------------------------------------------------
+ */
+static BlockNumber
+tids_same_page_fetch_offnums(ItemPointer tids, int ntids,
+							 OffsetNumber *next_start_ptr, OffsetNumber *offnos)
+{
+	int				i;
+	BlockNumber		prev_blkno,
+					blkno;
+	OffsetNumber	offno;
+
+	prev_blkno = blkno = InvalidBlockNumber;
+
+	for (i = *next_start_ptr; i < ntids; i++)
+	{
+		ItemPointerData tid = tids[i];
+
+		blkno = ItemPointerGetBlockNumberNoCheck(&tid);
+		offno = ItemPointerGetOffsetNumberNoCheck(&tid);
+
+		if (i == *next_start_ptr || (prev_blkno == blkno))
+			offnos[i - *next_start_ptr] = offno;
+		else
+			break;
+
+		prev_blkno = blkno;
+	}
+
+	*next_start_ptr = i;
+	return prev_blkno;
+}
\ No newline at end of file
diff --git a/contrib/pg_surgery/pg_surgery--1.0.sql b/contrib/pg_surgery/pg_surgery--1.0.sql
new file mode 100644
index 0000000..2ae7f22
--- /dev/null
+++ b/contrib/pg_surgery/pg_surgery--1.0.sql
@@ -0,0 +1,18 @@
+/* contrib/pg_surgery/pg_surgery--1.0.sql */
+
+-- complain if script is sourced in psql, rather than via CREATE EXTENSION
+\echo Use "CREATE EXTENSION pg_surgery" to load this file. \quit
+
+CREATE FUNCTION heap_force_kill(reloid regclass, tids tid[])
+RETURNS VOID
+AS 'MODULE_PATHNAME', 'heap_force_kill'
+LANGUAGE C STRICT;
+
+REVOKE EXECUTE ON FUNCTION heap_force_kill(regclass, tid[]) FROM PUBLIC;
+
+CREATE FUNCTION heap_force_freeze(reloid regclass, tids tid[])
+RETURNS VOID
+AS 'MODULE_PATHNAME', 'heap_force_freeze'
+LANGUAGE C STRICT;
+
+REVOKE EXECUTE ON FUNCTION heap_force_freeze(regclass, tid[]) FROM PUBLIC;
\ No newline at end of file
diff --git a/contrib/pg_surgery/pg_surgery.control b/contrib/pg_surgery/pg_surgery.control
new file mode 100644
index 0000000..2bcdad1
--- /dev/null
+++ b/contrib/pg_surgery/pg_surgery.control
@@ -0,0 +1,5 @@
+# pg_surgery extension
+comment = 'extension to perform surgery on a damaged relation'
+default_version = '1.0'
+module_pathname = '$libdir/pg_surgery'
+relocatable = true
diff --git a/contrib/pg_surgery/sql/pg_surgery.sql b/contrib/pg_surgery/sql/pg_surgery.sql
new file mode 100644
index 0000000..4aef380
--- /dev/null
+++ b/contrib/pg_surgery/sql/pg_surgery.sql
@@ -0,0 +1,89 @@
+create extension pg_surgery;
+
+--
+-- check that using heap_force_kill and heap_force_freeze functions with the
+-- supported relations passes.
+--
+
+-- normal heap table.
+begin;
+create table htab(a int);
+insert into htab values (100), (200), (300), (400), (500);
+
+select * from htab where xmin = 2;
+select heap_force_freeze('htab'::regclass, ARRAY['(0, 4)']::tid[]);
+select ctid, xmax from htab where xmin = 2;
+
+select heap_force_kill('htab'::regclass, ARRAY['(0, 4)']::tid[]);
+select * from htab where ctid = '(0, 4)';
+rollback;
+
+-- materialized view.
+begin;
+create materialized view mvw as select a from generate_series(1, 3) a;
+
+select * from mvw where xmin = 2;
+select heap_force_freeze('mvw'::regclass, ARRAY['(0, 3)']::tid[]);
+select * from mvw where xmin = 2;
+
+select heap_force_kill('mvw'::regclass, ARRAY['(0, 3)']::tid[]);
+select * from mvw where ctid = '(0, 3)';
+rollback;
+
+--
+-- check that using heap_force_kill and heap_force_freeze functions with the
+-- unsupported relations fails.
+--
+
+-- partitioned tables (the parent table) doesn't contain any tuple.
+create table ptab (a int) partition by list (a);
+
+select heap_force_kill('ptab'::regclass, ARRAY['(0, 1)']::tid[]);
+select heap_force_freeze('ptab'::regclass, ARRAY['(0, 1)']::tid[]);
+
+create index ptab_idx on ptab (a);
+
+-- indexes are not supported, should fail.
+select heap_force_kill('ptab_idx'::regclass, ARRAY['(0, 1)']::tid[]);
+select heap_force_freeze('ptab_idx'::regclass, ARRAY['(0, 1)']::tid[]);
+
+create view vw as select 1;
+
+-- views are not supported as well. so, all these should fail.
+select heap_force_kill('vw'::regclass, ARRAY['(0, 1)']::tid[]);
+select heap_force_freeze('vw'::regclass, ARRAY['(0, 1)']::tid[]);
+
+create sequence seq;
+
+-- sequences are not supported as well. so, all these functions should fail.
+select heap_force_kill('seq'::regclass, ARRAY['(0, 1)']::tid[]);
+select heap_force_freeze('seq'::regclass, ARRAY['(0, 1)']::tid[]);
+
+--
+-- Some negative test-cases with invalid inputs.
+--
+begin;
+create table htab(a int);
+
+-- invalid block number, should be skipped.
+select heap_force_kill('htab'::regclass, ARRAY['(0, 1)']::tid[]);
+select heap_force_freeze('htab'::regclass, ARRAY['(0, 1)']::tid[]);
+
+insert into htab values (10);
+
+-- invalid offset number, should be skipped.
+select heap_force_kill('htab'::regclass, ARRAY['(0, 2)']::tid[]);
+select heap_force_freeze('htab'::regclass, ARRAY['(0, 2)']::tid[]);
+
+-- dead tuple, should be skipped.
+select heap_force_kill('htab'::regclass, ARRAY['(0, 1)']::tid[]);
+select heap_force_kill('htab'::regclass, ARRAY['(0, 1)']::tid[]);
+select heap_force_freeze('htab'::regclass, ARRAY['(0, 1)']::tid[]);
+
+rollback;
+
+-- cleanup.
+drop table ptab;
+drop view vw;
+drop sequence seq;
+drop extension pg_surgery;
diff --git a/doc/src/sgml/contrib.sgml b/doc/src/sgml/contrib.sgml
index 261a559..c82dde2 100644
--- a/doc/src/sgml/contrib.sgml
+++ b/doc/src/sgml/contrib.sgml
@@ -125,6 +125,7 @@ CREATE EXTENSION <replaceable>module_name</replaceable>;
  &pgrowlocks;
  &pgstatstatements;
  &pgstattuple;
+ &pgsurgery;
  &pgtrgm;
  &pgvisibility;
  &postgres-fdw;
diff --git a/doc/src/sgml/filelist.sgml b/doc/src/sgml/filelist.sgml
index 64b5da0..828396d 100644
--- a/doc/src/sgml/filelist.sgml
+++ b/doc/src/sgml/filelist.sgml
@@ -139,6 +139,7 @@
 <!ENTITY pgstandby       SYSTEM "pgstandby.sgml">
 <!ENTITY pgstatstatements SYSTEM "pgstatstatements.sgml">
 <!ENTITY pgstattuple     SYSTEM "pgstattuple.sgml">
+<!ENTITY pgsurgery       SYSTEM "pgsurgery.sgml">
 <!ENTITY pgtrgm          SYSTEM "pgtrgm.sgml">
 <!ENTITY pgvisibility    SYSTEM "pgvisibility.sgml">
 <!ENTITY postgres-fdw    SYSTEM "postgres-fdw.sgml">
diff --git a/doc/src/sgml/pgsurgery.sgml b/doc/src/sgml/pgsurgery.sgml
new file mode 100644
index 0000000..8120c1b
--- /dev/null
+++ b/doc/src/sgml/pgsurgery.sgml
@@ -0,0 +1,136 @@
+<!-- doc/src/sgml/pgsurgery.sgml -->
+
+<sect1 id="pgsurgery" xreflabel="pg_surgery">
+ <title>pg_surgery</title>
+
+ <indexterm zone="pgsurgery">
+  <primary>pg_surgery</primary>
+ </indexterm>
+
+ <para>
+  The <filename>pg_surgery</filename> module provides various functions to
+  perform surgery on a damaged relation. It is intended to be used only with a
+  damaged relation and must be operated only on damaged tuples of such
+  relations.
+ </para>
+
+ <para>
+  By default, only the superusers have <literal>EXECUTE</literal> privilege on
+  the functions provided by this module. Once the extension has been installed,
+  users may issue <command>GRANT</command> commands to change the privileges on
+  the functions to allow others to execute them.
+ </para>
+
+ <note>
+ <para>
+  After a surgery is performed on a damaged relation using this module, we must
+  run VACUUM with DISABLE_PAGE_SKIPPING option on that relation to bring it back
+  into a sane state (see examples in the next section).
+ </para>
+ </note>
+
+ <note>
+ <para>
+  While performing surgery on a damaged relation, we must not be doing anything
+  else on that relation in parallel. This is to ensure that when we are
+  operating on a damaged tuple there is no other transaction trying to modify
+  that tuple.
+ </para>
+ </note>
+
+ <sect2>
+  <title>Functions</title>
+
+  <variablelist>
+   <varlistentry>
+    <term>
+     <function>heap_force_kill(regclass, tid[]) returns void</function>
+    </term>
+
+    <listitem>
+     <para>
+      <function>heap_force_kill</function> marks a <quote>used</quote> line
+      pointer as <quote>dead</quote> without examining the tuple. It is used to
+      force-kill a corrupted tuple in the table that may or may not be
+      accessible. It takes the relation oid and an array of item pointers as an
+      input arguments. Here is an example demonstrating when and how this
+      function is used:
+<programlisting>
+test=&gt; select * from t1 where ctid = '(0, 1)';
+ERROR:  could not access status of transaction 4007513275
+DETAIL:  Could not open file "pg_xact/0EED": No such file or directory.
+
+test=# select heap_force_kill('t1'::regclass, ARRAY['(0, 1)']::tid[]);
+-[ RECORD 1 ]---+-
+heap_force_kill |
+
+test=# vacuum (disable_page_skipping) t1;
+VACUUM 
+
+test=# select * from t1 where ctid = '(0, 1)';
+(0 rows)
+</programlisting>
+    </para>
+    </listitem>
+   </varlistentry>
+
+   <varlistentry>
+    <term>
+     <function>heap_force_freeze(regclass, tid[]) returns void</function>
+    </term>
+
+    <listitem>
+     <para>
+      <function>heap_force_freeze</function> marks a live tuple (whose xmin is
+      corrupted) as frozen by setting its xmin to FrozenTransactionId and xmax
+      to InvalidTransactionId. It clears all the visibility related bits in the
+      infomask of the tuple and sets the infomask to HEAP_XMIN_FROZEN and
+      HEAP_XMAX_INVALID. It is used to force-freeze a tuple in the table for
+      which VACUUM fails with "found xmin ... from before relfrozenxid ..."
+      error. Similar to <function>heap_force_kill</function> function, this also
+      takes the relation oid and an array of item pointers as an input arguments.
+      Here is an example demonstrating when and how this function is used:
+<programlisting>
+test=&gt; vacuum t1;
+ERROR:  found xmin 507 from before relfrozenxid 515
+CONTEXT:  while scanning block 0 of relation "public.t1"
+
+test=# select ctid from t1 where xmin = 507;
+ ctid  
+-------
+ (0,3)
+(1 row)
+
+test=# select heap_force_freeze('t1'::regclass, ARRAY['(0, 3)']::tid[]);
+-[ RECORD 1 ]-----+-
+heap_force_freeze | 
+
+test=# vacuum (disable_page_skipping) t1;
+VACUUM
+</programlisting>
+    </para>
+
+    <note>
+    <para>
+     This function should only be operated on a tuple for which VACUUM fails
+     with "found xmin ... from before relfrozenxid ..." error. The user must
+     verify that only the xmin and ctid fields of the tuple are corrupted, if
+     any other fields of a tuple are corrupted, they must avoid running this
+     function on such tuples.
+    </para>
+    </note>
+   </listitem>
+  </varlistentry>
+
+  </variablelist>
+ </sect2>
+
+ <sect2>
+  <title>Authors</title>
+
+  <para>
+   Ashutosh Sharma <email>ashu.coek88@gmail.com</email>
+  </para>
+ </sect2>
+
+</sect1>
-- 
1.8.3.1

#88

robertmhaas@gmail.com

over 5 years ago

In reply to: Alvaro Herrera (#79)

Re: recovering from "found xmin ... from before relfrozenxid ..."

On Tue, Aug 18, 2020 at 12:14 PM Alvaro Herrera
<alvherre@2ndquadrant.com> wrote:

It makes sense to recommend VACUUM after fixing the page, but I agree
with Sawada-san that it would be sensible to reset the VM bit while
doing surgery, since that's the state that the page would be in. We
should certainly *strongly recommend* to do VACUUM DISABLE_PAGE_SKIPPING,
but if users fail to do so, then leaving the VM bit set just means that
we know *for certain* that there will be further corruption as soon as
the XID counter advances sufficiently.

+1.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

#89

x4mmm@yandex-team.ru

over 5 years ago

In reply to: Ashutosh Sharma (#87)

Re: recovering from "found xmin ... from before relfrozenxid ..."

Hi!

21 авг. 2020 г., в 18:24, Ashutosh Sharma <ashu.coek88@gmail.com> написал(а):

Please find the updated patch with the following new changes:

Do you have plans to support pg_surgery as external extension? For example for earlier versions of Postgres and for new features, like amcheck_next is maintained.
ISTM that I'll have to use something like that tomorrow and I'm in doubt - should I resurrect our pg_dirty_hands or try your new pg_surgey...

Thanks!

Best regards, Andrey Borodin.

#90

ashu.coek88@gmail.com

over 5 years ago

In reply to: Robert Haas (#88)

Re: recovering from "found xmin ... from before relfrozenxid ..."

On Mon, Aug 24, 2020 at 7:15 PM Robert Haas <robertmhaas@gmail.com> wrote:

On Tue, Aug 18, 2020 at 12:14 PM Alvaro Herrera
<alvherre@2ndquadrant.com> wrote:

It makes sense to recommend VACUUM after fixing the page, but I agree
with Sawada-san that it would be sensible to reset the VM bit while
doing surgery, since that's the state that the page would be in. We
should certainly *strongly recommend* to do VACUUM DISABLE_PAGE_SKIPPING,
but if users fail to do so, then leaving the VM bit set just means that
we know *for certain* that there will be further corruption as soon as
the XID counter advances sufficiently.

+1.

This has been taken care of in the latest v7 patch.

--
With Regards,
Ashutosh Sharma
EnterpriseDB:http://www.enterprisedb.com

#91

ashu.coek88@gmail.com

over 5 years ago

In reply to: Andrey M. Borodin (#89)

Re: recovering from "found xmin ... from before relfrozenxid ..."

On Mon, Aug 24, 2020 at 11:00 PM Andrey M. Borodin <x4mmm@yandex-team.ru> wrote:

Hi!

21 авг. 2020 г., в 18:24, Ashutosh Sharma <ashu.coek88@gmail.com> написал(а):

Please find the updated patch with the following new changes:

Do you have plans to support pg_surgery as external extension? For example for earlier versions of Postgres and for new features, like amcheck_next is maintained.
ISTM that I'll have to use something like that tomorrow and I'm in doubt - should I resurrect our pg_dirty_hands or try your new pg_surgey...

AFAICS, we don't have any plans to support pg_surgery as an external
extension as of now. Based on the discussion that has happened earlier
in this thread, I think we might also back-patch this contrib module.

--
With Regards,
Ashutosh Sharma
EnterpriseDB:http://www.enterprisedb.com

#92

masahiko.sawada@2ndquadrant.com

over 5 years ago

In reply to: Ashutosh Sharma (#87)

Re: recovering from "found xmin ... from before relfrozenxid ..."

On Fri, 21 Aug 2020 at 22:25, Ashutosh Sharma <ashu.coek88@gmail.com> wrote:

Hi Masahiko-san,

Please find the updated patch with the following new changes:

Thank you for updating the patch!

1) It adds the code changes in heap_force_kill function to clear an
all-visible bit on the visibility map corresponding to the page that
is marked all-visible. Along the way it also clears PD_ALL_VISIBLE
flag on the page header.

I think we need to clear all visibility map bits by using
VISIBILITYMAP_VALID_BITS. Otherwise, the page has all-frozen bit but
not all-visible bit, which is not a valid state.

2) It adds the code changes in heap_force_freeze function to reset the
ctid value in a tuple header if it is corrupted.

3) It adds several notes and examples in the documentation stating
when and how we need to use the functions provided by this module.

Please have a look and let me know for any other concern.

And here are small comments on the heap_surgery.c:

+       /*
+        * Get the offset numbers from the tids belonging to one particular page
+        * and process them one by one.
+        */
+       blkno = tids_same_page_fetch_offnums(tids, ntids, &next_start_ptr,
+                                            offnos);
+
+       /* Calculate the number of offsets stored in offnos array. */
+       noffs = next_start_ptr - curr_start_ptr;
+
+       /*
+        * Update the current start pointer so that next time when
+        * tids_same_page_fetch_offnums() is called, we can calculate the number
+        * of offsets present in the offnos array.
+        */
+       curr_start_ptr = next_start_ptr;
+
+       /* Check whether the block number is valid. */
+       if (blkno >= nblocks)
+       {
+           ereport(NOTICE,
+                   (errcode(ERRCODE_INVALID_PARAMETER_VALUE),
+                    errmsg("skipping block %u for relation \"%s\"
because the block number is out of range",
+                           blkno, RelationGetRelationName(rel))));
+           continue;
+       }
+
+       CHECK_FOR_INTERRUPTS();

I guess it would be better to call CHECK_FOR_INTERRUPTS() at the top
of the do loop for safety. I think it's unlikely to happen but the
user might mistakenly specify a lot of wrong block numbers.

----
+   offnos = (OffsetNumber *) palloc(ntids * sizeof(OffsetNumber));
+   noffs = curr_start_ptr = next_start_ptr = 0;
+   nblocks = RelationGetNumberOfBlocks(rel);
+
+   do
+   {

(snip)

+
+       /*
+        * Get the offset numbers from the tids belonging to one particular page
+        * and process them one by one.
+        */
+       blkno = tids_same_page_fetch_offnums(tids, ntids, &next_start_ptr,
+                                            offnos);
+
+       /* Calculate the number of offsets stored in offnos array. */
+       noffs = next_start_ptr - curr_start_ptr;
+

(snip)

+       /* No ereport(ERROR) from here until all the changes are logged. */
+       START_CRIT_SECTION();
+
+       for (i = 0; i < noffs; i++)

You copy all offset numbers belonging to the same page to palloc'd
array, offnos, and iterate it while processing the tuples. I might be
missing something but I think we can do that without allocating the
space for offset numbers. Is there any reason for this? I guess we can
do that by just iterating the sorted tids array.

Regards,

--
Masahiko Sawada http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

#93

masahiko.sawada@2ndquadrant.com

over 5 years ago

In reply to: Masahiko Sawada (#92)

Re: recovering from "found xmin ... from before relfrozenxid ..."

On Tue, 25 Aug 2020 at 17:08, Masahiko Sawada
<masahiko.sawada@2ndquadrant.com> wrote:

On Fri, 21 Aug 2020 at 22:25, Ashutosh Sharma <ashu.coek88@gmail.com> wrote:

Hi Masahiko-san,

Please find the updated patch with the following new changes:

Thank you for updating the patch!

1) It adds the code changes in heap_force_kill function to clear an
all-visible bit on the visibility map corresponding to the page that
is marked all-visible. Along the way it also clears PD_ALL_VISIBLE
flag on the page header.

I think we need to clear all visibility map bits by using
VISIBILITYMAP_VALID_BITS. Otherwise, the page has all-frozen bit but
not all-visible bit, which is not a valid state.

2) It adds the code changes in heap_force_freeze function to reset the
ctid value in a tuple header if it is corrupted.

3) It adds several notes and examples in the documentation stating
when and how we need to use the functions provided by this module.

Please have a look and let me know for any other concern.

And here are small comments on the heap_surgery.c:
+       /*
+        * Get the offset numbers from the tids belonging to one particular page
+        * and process them one by one.
+        */
+       blkno = tids_same_page_fetch_offnums(tids, ntids, &next_start_ptr,
+                                            offnos);
+
+       /* Calculate the number of offsets stored in offnos array. */
+       noffs = next_start_ptr - curr_start_ptr;
+
+       /*
+        * Update the current start pointer so that next time when
+        * tids_same_page_fetch_offnums() is called, we can calculate the number
+        * of offsets present in the offnos array.
+        */
+       curr_start_ptr = next_start_ptr;
+
+       /* Check whether the block number is valid. */
+       if (blkno >= nblocks)
+       {
+           ereport(NOTICE,
+                   (errcode(ERRCODE_INVALID_PARAMETER_VALUE),
+                    errmsg("skipping block %u for relation \"%s\"
because the block number is out of range",
+                           blkno, RelationGetRelationName(rel))));
+           continue;
+       }
+
+       CHECK_FOR_INTERRUPTS();
I guess it would be better to call CHECK_FOR_INTERRUPTS() at the top
of the do loop for safety. I think it's unlikely to happen but the
user might mistakenly specify a lot of wrong block numbers.
----
+   offnos = (OffsetNumber *) palloc(ntids * sizeof(OffsetNumber));
+   noffs = curr_start_ptr = next_start_ptr = 0;
+   nblocks = RelationGetNumberOfBlocks(rel);
+
+   do
+   {
(snip)
+
+       /*
+        * Get the offset numbers from the tids belonging to one particular page
+        * and process them one by one.
+        */
+       blkno = tids_same_page_fetch_offnums(tids, ntids, &next_start_ptr,
+                                            offnos);
+
+       /* Calculate the number of offsets stored in offnos array. */
+       noffs = next_start_ptr - curr_start_ptr;
+
(snip)
+       /* No ereport(ERROR) from here until all the changes are logged. */
+       START_CRIT_SECTION();
+
+       for (i = 0; i < noffs; i++)
You copy all offset numbers belonging to the same page to palloc'd
array, offnos, and iterate it while processing the tuples. I might be
missing something but I think we can do that without allocating the
space for offset numbers. Is there any reason for this? I guess we can
do that by just iterating the sorted tids array.

Let me share other comments on the latest version patch:

Some words need to be tagged. For instance, I found the following words:

VACUUM
DISABLE_PAGE_SKIPPING
HEAP_XMIN_FROZEN
HEAP_XMAX_INVALID

---
+test=# select ctid from t1 where xmin = 507;
+ ctid
+-------
+ (0,3)
+(1 row)
+
+test=# select heap_force_freeze('t1'::regclass, ARRAY['(0, 3)']::tid[]);
+-[ RECORD 1 ]-----+-
+heap_force_freeze |

I think it's better to use a consistent output format. The former uses
the normal format whereas the latter uses the expanded format.

---
+ <note>
+ <para>
+  While performing surgery on a damaged relation, we must not be doing anything
+  else on that relation in parallel. This is to ensure that when we are
+  operating on a damaged tuple there is no other transaction trying to modify
+  that tuple.
+ </para>
+ </note>

If we prefer to avoid concurrent operations on the target relation why
don't we use AccessExclusiveLock?

---
+CREATE FUNCTION heap_force_kill(reloid regclass, tids tid[])
+RETURNS VOID
+AS 'MODULE_PATHNAME', 'heap_force_kill'
+LANGUAGE C STRICT;

+CREATE FUNCTION heap_force_freeze(reloid regclass, tids tid[])
+RETURNS VOID
+AS 'MODULE_PATHNAME', 'heap_force_freeze'
+LANGUAGE C STRICT;

I think these functions should be PARALLEL UNSAFE.

Regards,

--
Masahiko Sawada http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

#94

ashu.coek88@gmail.com

over 5 years ago

In reply to: Masahiko Sawada (#92)

Re: recovering from "found xmin ... from before relfrozenxid ..."

Hi Masahiko-san,

Thank you for the review. Please check my comments inline below:

On Tue, Aug 25, 2020 at 1:39 PM Masahiko Sawada
<masahiko.sawada@2ndquadrant.com> wrote:

On Fri, 21 Aug 2020 at 22:25, Ashutosh Sharma <ashu.coek88@gmail.com> wrote:

Hi Masahiko-san,

Please find the updated patch with the following new changes:

Thank you for updating the patch!

1) It adds the code changes in heap_force_kill function to clear an
all-visible bit on the visibility map corresponding to the page that
is marked all-visible. Along the way it also clears PD_ALL_VISIBLE
flag on the page header.

I think we need to clear all visibility map bits by using
VISIBILITYMAP_VALID_BITS. Otherwise, the page has all-frozen bit but
not all-visible bit, which is not a valid state.

Yeah, makes sense, I will do that change in the next version of patch.

2) It adds the code changes in heap_force_freeze function to reset the
ctid value in a tuple header if it is corrupted.

3) It adds several notes and examples in the documentation stating
when and how we need to use the functions provided by this module.

Please have a look and let me know for any other concern.

And here are small comments on the heap_surgery.c:
+       /*
+        * Get the offset numbers from the tids belonging to one particular page
+        * and process them one by one.
+        */
+       blkno = tids_same_page_fetch_offnums(tids, ntids, &next_start_ptr,
+                                            offnos);
+
+       /* Calculate the number of offsets stored in offnos array. */
+       noffs = next_start_ptr - curr_start_ptr;
+
+       /*
+        * Update the current start pointer so that next time when
+        * tids_same_page_fetch_offnums() is called, we can calculate the number
+        * of offsets present in the offnos array.
+        */
+       curr_start_ptr = next_start_ptr;
+
+       /* Check whether the block number is valid. */
+       if (blkno >= nblocks)
+       {
+           ereport(NOTICE,
+                   (errcode(ERRCODE_INVALID_PARAMETER_VALUE),
+                    errmsg("skipping block %u for relation \"%s\"
because the block number is out of range",
+                           blkno, RelationGetRelationName(rel))));
+           continue;
+       }
+
+       CHECK_FOR_INTERRUPTS();
I guess it would be better to call CHECK_FOR_INTERRUPTS() at the top
of the do loop for safety. I think it's unlikely to happen but the
user might mistakenly specify a lot of wrong block numbers.

Okay, np, will shift it to top of the do loop.

----
+   offnos = (OffsetNumber *) palloc(ntids * sizeof(OffsetNumber));
+   noffs = curr_start_ptr = next_start_ptr = 0;
+   nblocks = RelationGetNumberOfBlocks(rel);
+
+   do
+   {
(snip)
+
+       /*
+        * Get the offset numbers from the tids belonging to one particular page
+        * and process them one by one.
+        */
+       blkno = tids_same_page_fetch_offnums(tids, ntids, &next_start_ptr,
+                                            offnos);
+
+       /* Calculate the number of offsets stored in offnos array. */
+       noffs = next_start_ptr - curr_start_ptr;
+
(snip)
+       /* No ereport(ERROR) from here until all the changes are logged. */
+       START_CRIT_SECTION();
+
+       for (i = 0; i < noffs; i++)
You copy all offset numbers belonging to the same page to palloc'd
array, offnos, and iterate it while processing the tuples. I might be
missing something but I think we can do that without allocating the
space for offset numbers. Is there any reason for this? I guess we can
do that by just iterating the sorted tids array.

Hmmm.. okay, I see your point. I think probably what you are trying to
suggest here is to make use of the current and next start pointers to
get the tids belonging to the same page and process them one by one
instead of fetching the offset numbers of all tids belonging to one
page into the offnos array and then iterate through the offnos array.
I think that is probably possible and I will try to do that in the
next version of patch. If there is something else that you have in
your mind, please let me know.

--
With Regards,
Ashutosh Sharma
EnterpriseDB:http://www.enterprisedb.com

#95

robertmhaas@gmail.com

over 5 years ago

In reply to: Masahiko Sawada (#93)

Re: recovering from "found xmin ... from before relfrozenxid ..."

On Tue, Aug 25, 2020 at 8:17 AM Masahiko Sawada
<masahiko.sawada@2ndquadrant.com> wrote:

+ <note>
+ <para>
+  While performing surgery on a damaged relation, we must not be doing anything
+  else on that relation in parallel. This is to ensure that when we are
+  operating on a damaged tuple there is no other transaction trying to modify
+  that tuple.
+ </para>
+ </note>

If we prefer to avoid concurrent operations on the target relation why
don't we use AccessExclusiveLock?

I disagree with the content of the note. It's up to the user whether
to perform any concurrent operations on the target relation, and in
many cases it would be fine to do so. Users who can afford to take the
table off-line to repair the problem don't really need this tool in
the first place.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

#96

[1]: /messages/by-id/CA+TgmobfJ8CkabKJZ-1FGfvbSz+b8bBX807Y6hHEtVfzVe+g6A@mail.gmail.com

ashu.coek88@gmail.com

over 5 years ago

In reply to: Robert Haas (#95)

Re: recovering from "found xmin ... from before relfrozenxid ..."

On Tue, Aug 25, 2020 at 11:51 PM Robert Haas <robertmhaas@gmail.com> wrote:

On Tue, Aug 25, 2020 at 8:17 AM Masahiko Sawada
<masahiko.sawada@2ndquadrant.com> wrote:
+ <note>
+ <para>
+  While performing surgery on a damaged relation, we must not be doing anything
+  else on that relation in parallel. This is to ensure that when we are
+  operating on a damaged tuple there is no other transaction trying to modify
+  that tuple.
+ </para>
+ </note>
If we prefer to avoid concurrent operations on the target relation why
don't we use AccessExclusiveLock?
I disagree with the content of the note. It's up to the user whether
to perform any concurrent operations on the target relation, and in
many cases it would be fine to do so. Users who can afford to take the
table off-line to repair the problem don't really need this tool in
the first place.

The only reason I added this note was to ensure that we do not revive
the tuple that is deleted but not yet vacuumed. There is one
corner-case scenario as reported by you in - [1]/messages/by-id/CA+TgmobfJ8CkabKJZ-1FGfvbSz+b8bBX807Y6hHEtVfzVe+g6A@mail.gmail.com where you have
explained a scenario under which vacuum can report "found xmin ...
from before relfrozenxid ..." sort of error for the deleted tuples.
And as per the explanation provided there, it can happen when there
are multiple transactions operating on the same tuple. However, I
think we can take care of this scenario by doing some code changes in
heap_force_freeze to identify the deleted tuples and maybe skip such
tuples. So, yeah, I will do the code changes for handling this and
remove the note added in the documentation. Thank you Robert and
Masahiko-san for pointing this out.

--
With Regards,
Ashutosh Sharma
EnterpriseDB:http://www.enterprisedb.com

#97

[1]: /messages/by-id/CA+fd4k6+JWq2MfQt5b7fSJ2wMvCes9TRfbDhVO_fQP9B8JJRAA@mail.gmail.com

ashu.coek88@gmail.com

over 5 years ago

In reply to: Masahiko Sawada (#93)

1 attachment(s)

Re: recovering from "found xmin ... from before relfrozenxid ..."

Thanks for the review. Please find my comments inline below:

On Tue, Aug 25, 2020 at 5:47 PM Masahiko Sawada
<masahiko.sawada@2ndquadrant.com> wrote:

Let me share other comments on the latest version patch:

Some words need to be tagged. For instance, I found the following words:

VACUUM
DISABLE_PAGE_SKIPPING
HEAP_XMIN_FROZEN
HEAP_XMAX_INVALID

Okay, done.

---
+test=# select ctid from t1 where xmin = 507;
+ ctid
+-------
+ (0,3)
+(1 row)
+
+test=# select heap_force_freeze('t1'::regclass, ARRAY['(0, 3)']::tid[]);
+-[ RECORD 1 ]-----+-
+heap_force_freeze |
I think it's better to use a consistent output format. The former uses
the normal format whereas the latter uses the expanded format.

Yep, makes sense, done.

---
+ <note>
+ <para>
+  While performing surgery on a damaged relation, we must not be doing anything
+  else on that relation in parallel. This is to ensure that when we are
+  operating on a damaged tuple there is no other transaction trying to modify
+  that tuple.
+ </para>
+ </note>

If we prefer to avoid concurrent operations on the target relation why
don't we use AccessExclusiveLock?

Removed this note from the documentation and added a note saying: "The
user needs to ensure that they do not operate pg_force_freeze function
on a deleted tuple because it may revive the deleted tuple."

---
+CREATE FUNCTION heap_force_kill(reloid regclass, tids tid[])
+RETURNS VOID
+AS 'MODULE_PATHNAME', 'heap_force_kill'
+LANGUAGE C STRICT;

+CREATE FUNCTION heap_force_freeze(reloid regclass, tids tid[])
+RETURNS VOID
+AS 'MODULE_PATHNAME', 'heap_force_freeze'
+LANGUAGE C STRICT;

I think these functions should be PARALLEL UNSAFE.

By default the functions are marked PARALLEL UNSAFE, so I think there
is nothing to do here.

Attached patch with above changes.

This patch also takes care of all the other review comments from - [1]/messages/by-id/CA+fd4k6+JWq2MfQt5b7fSJ2wMvCes9TRfbDhVO_fQP9B8JJRAA@mail.gmail.com.

--
With Regards,
Ashutosh Sharma
EnterpriseDB:http://www.enterprisedb.com

Attachments:

v8-0001-Add-contrib-pg_surgery-to-perform-surgery-on-a-damag.patchtext/x-patch; charset=US-ASCII; name=v8-0001-Add-contrib-pg_surgery-to-perform-surgery-on-a-damag.patchDownload

From a8067f03cf5ea900790a96ea0059ae080a164ed6 Mon Sep 17 00:00:00 2001
From: ashu <ashutosh.sharma@enterprisedb.com>
Date: Wed, 26 Aug 2020 15:57:59 +0530
Subject: [PATCH] Add contrib/pg_surgery to perform surgery on a damaged heap
 table.

This contrib module basically adds a couple of functions named
heap_force_kill and heap_force_freeze that can be used in the scripts
to perform surgery on the damaged heap tables.

Ashutosh Sharma.
---
 contrib/Makefile                           |   1 +
 contrib/pg_surgery/Makefile                |  23 ++
 contrib/pg_surgery/expected/pg_surgery.out | 161 ++++++++++++
 contrib/pg_surgery/heap_surgery.c          | 404 +++++++++++++++++++++++++++++
 contrib/pg_surgery/pg_surgery--1.0.sql     |  18 ++
 contrib/pg_surgery/pg_surgery.control      |   5 +
 contrib/pg_surgery/sql/pg_surgery.sql      |  89 +++++++
 doc/src/sgml/contrib.sgml                  |   1 +
 doc/src/sgml/filelist.sgml                 |   1 +
 doc/src/sgml/pgsurgery.sgml                | 144 ++++++++++
 10 files changed, 847 insertions(+)
 create mode 100644 contrib/pg_surgery/Makefile
 create mode 100644 contrib/pg_surgery/expected/pg_surgery.out
 create mode 100644 contrib/pg_surgery/heap_surgery.c
 create mode 100644 contrib/pg_surgery/pg_surgery--1.0.sql
 create mode 100644 contrib/pg_surgery/pg_surgery.control
 create mode 100644 contrib/pg_surgery/sql/pg_surgery.sql
 create mode 100644 doc/src/sgml/pgsurgery.sgml

diff --git a/contrib/Makefile b/contrib/Makefile
index 1846d41..c8d2a16 100644
--- a/contrib/Makefile
+++ b/contrib/Makefile
@@ -34,6 +34,7 @@ SUBDIRS = \
 		pg_prewarm	\
 		pg_standby	\
 		pg_stat_statements \
+		pg_surgery	\
 		pg_trgm		\
 		pgcrypto	\
 		pgrowlocks	\
diff --git a/contrib/pg_surgery/Makefile b/contrib/pg_surgery/Makefile
new file mode 100644
index 0000000..ecf2e20
--- /dev/null
+++ b/contrib/pg_surgery/Makefile
@@ -0,0 +1,23 @@
+# contrib/pg_surgery/Makefile
+
+MODULE_big = pg_surgery
+OBJS = \
+	$(WIN32RES) \
+	heap_surgery.o
+
+EXTENSION = pg_surgery
+DATA = pg_surgery--1.0.sql
+PGFILEDESC = "pg_surgery - perform surgery on a damaged relation"
+
+REGRESS = pg_surgery
+
+ifdef USE_PGXS
+PG_CONFIG = pg_config
+PGXS := $(shell $(PG_CONFIG) --pgxs)
+include $(PGXS)
+else
+subdir = contrib/pg_surgery
+top_builddir = ../..
+include $(top_builddir)/src/Makefile.global
+include $(top_srcdir)/contrib/contrib-global.mk
+endif
diff --git a/contrib/pg_surgery/expected/pg_surgery.out b/contrib/pg_surgery/expected/pg_surgery.out
new file mode 100644
index 0000000..9858de2
--- /dev/null
+++ b/contrib/pg_surgery/expected/pg_surgery.out
@@ -0,0 +1,161 @@
+create extension pg_surgery;
+--
+-- check that using heap_force_kill and heap_force_freeze functions with the
+-- supported relations passes.
+--
+-- normal heap table.
+begin;
+create table htab(a int);
+insert into htab values (100), (200), (300), (400), (500);
+select * from htab where xmin = 2;
+ a 
+---
+(0 rows)
+
+select heap_force_freeze('htab'::regclass, ARRAY['(0, 4)']::tid[]);
+ heap_force_freeze 
+-------------------
+ 
+(1 row)
+
+select ctid, xmax from htab where xmin = 2;
+ ctid  | xmax 
+-------+------
+ (0,4) |    0
+(1 row)
+
+select heap_force_kill('htab'::regclass, ARRAY['(0, 4)']::tid[]);
+ heap_force_kill 
+-----------------
+ 
+(1 row)
+
+select * from htab where ctid = '(0, 4)';
+ a 
+---
+(0 rows)
+
+rollback;
+-- materialized view.
+begin;
+create materialized view mvw as select a from generate_series(1, 3) a;
+select * from mvw where xmin = 2;
+ a 
+---
+(0 rows)
+
+select heap_force_freeze('mvw'::regclass, ARRAY['(0, 3)']::tid[]);
+ heap_force_freeze 
+-------------------
+ 
+(1 row)
+
+select * from mvw where xmin = 2;
+ a 
+---
+ 3
+(1 row)
+
+select heap_force_kill('mvw'::regclass, ARRAY['(0, 3)']::tid[]);
+ heap_force_kill 
+-----------------
+ 
+(1 row)
+
+select * from mvw where ctid = '(0, 3)';
+ a 
+---
+(0 rows)
+
+rollback;
+--
+-- check that using heap_force_kill and heap_force_freeze functions with the
+-- unsupported relations fails.
+--
+-- partitioned tables (the parent table) doesn't contain any tuple.
+create table ptab (a int) partition by list (a);
+select heap_force_kill('ptab'::regclass, ARRAY['(0, 1)']::tid[]);
+ERROR:  only the relation using heap_tableam_handler is supported
+select heap_force_freeze('ptab'::regclass, ARRAY['(0, 1)']::tid[]);
+ERROR:  only the relation using heap_tableam_handler is supported
+create index ptab_idx on ptab (a);
+-- indexes are not supported, should fail.
+select heap_force_kill('ptab_idx'::regclass, ARRAY['(0, 1)']::tid[]);
+ERROR:  only the relation using heap_tableam_handler is supported
+select heap_force_freeze('ptab_idx'::regclass, ARRAY['(0, 1)']::tid[]);
+ERROR:  only the relation using heap_tableam_handler is supported
+create view vw as select 1;
+-- views are not supported as well. so, all these should fail.
+select heap_force_kill('vw'::regclass, ARRAY['(0, 1)']::tid[]);
+ERROR:  only the relation using heap_tableam_handler is supported
+select heap_force_freeze('vw'::regclass, ARRAY['(0, 1)']::tid[]);
+ERROR:  only the relation using heap_tableam_handler is supported
+create sequence seq;
+-- sequences are not supported as well. so, all these functions should fail.
+select heap_force_kill('seq'::regclass, ARRAY['(0, 1)']::tid[]);
+ERROR:  "seq" is not a table, materialized view, or TOAST table
+select heap_force_freeze('seq'::regclass, ARRAY['(0, 1)']::tid[]);
+ERROR:  "seq" is not a table, materialized view, or TOAST table
+--
+-- Some negative test-cases with invalid inputs.
+--
+begin;
+create table htab(a int);
+-- invalid block number, should be skipped.
+select heap_force_kill('htab'::regclass, ARRAY['(0, 1)']::tid[]);
+NOTICE:  skipping block 0 for relation "htab" because the block number is out of range
+ heap_force_kill 
+-----------------
+ 
+(1 row)
+
+select heap_force_freeze('htab'::regclass, ARRAY['(0, 1)']::tid[]);
+NOTICE:  skipping block 0 for relation "htab" because the block number is out of range
+ heap_force_freeze 
+-------------------
+ 
+(1 row)
+
+insert into htab values (10);
+-- invalid offset number, should be skipped.
+select heap_force_kill('htab'::regclass, ARRAY['(0, 2)']::tid[]);
+NOTICE:  skipping tid (0, 2) for relation "htab" because the item number is out of range for this block
+ heap_force_kill 
+-----------------
+ 
+(1 row)
+
+select heap_force_freeze('htab'::regclass, ARRAY['(0, 2)']::tid[]);
+NOTICE:  skipping tid (0, 2) for relation "htab" because the item number is out of range for this block
+ heap_force_freeze 
+-------------------
+ 
+(1 row)
+
+-- dead tuple, should be skipped.
+select heap_force_kill('htab'::regclass, ARRAY['(0, 1)']::tid[]);
+ heap_force_kill 
+-----------------
+ 
+(1 row)
+
+select heap_force_kill('htab'::regclass, ARRAY['(0, 1)']::tid[]);
+NOTICE:  skipping tid (0, 1) for relation "htab" because it is marked dead
+ heap_force_kill 
+-----------------
+ 
+(1 row)
+
+select heap_force_freeze('htab'::regclass, ARRAY['(0, 1)']::tid[]);
+NOTICE:  skipping tid (0, 1) for relation "htab" because it is marked dead
+ heap_force_freeze 
+-------------------
+ 
+(1 row)
+
+rollback;
+-- cleanup.
+drop table ptab;
+drop view vw;
+drop sequence seq;
+drop extension pg_surgery;
diff --git a/contrib/pg_surgery/heap_surgery.c b/contrib/pg_surgery/heap_surgery.c
new file mode 100644
index 0000000..9c87593
--- /dev/null
+++ b/contrib/pg_surgery/heap_surgery.c
@@ -0,0 +1,404 @@
+/*-------------------------------------------------------------------------
+ *
+ * heap_surgery.c
+ *	  Functions to perform surgery on the damaged heap table.
+ *
+ * Copyright (c) 2020, PostgreSQL Global Development Group
+ *
+ * IDENTIFICATION
+ *	  contrib/pg_surgery/heap_surgery.c
+ *
+ *-------------------------------------------------------------------------
+ */
+#include "postgres.h"
+
+#include "access/heapam.h"
+#include "access/visibilitymap.h"
+#include "catalog/pg_am_d.h"
+#include "catalog/pg_proc_d.h"
+#include "miscadmin.h"
+#include "storage/bufmgr.h"
+#include "utils/acl.h"
+#include "utils/rel.h"
+
+PG_MODULE_MAGIC;
+
+/* Options to forcefully change the state of a heap tuple. */
+typedef enum HeapTupleForceOption
+{
+	HEAP_FORCE_KILL,
+	HEAP_FORCE_FREEZE
+} HeapTupleForceOption;
+
+PG_FUNCTION_INFO_V1(heap_force_kill);
+PG_FUNCTION_INFO_V1(heap_force_freeze);
+
+static int32 tidcmp(const void *a, const void *b);
+static Datum heap_force_common(FunctionCallInfo fcinfo,
+							   HeapTupleForceOption heap_force_opt);
+static void sanity_check_tid_array(ArrayType *ta, int *ntids);
+static void sanity_check_relation(Relation rel);
+static BlockNumber find_tids_one_page(ItemPointer tids, int ntids,
+									  OffsetNumber *next_start_ptr);
+
+/*-------------------------------------------------------------------------
+ * heap_force_kill()
+ *
+ * Force kill the tuple(s) pointed to by the item pointer(s) stored in the
+ * given tid array.
+ *
+ * Usage: SELECT heap_force_kill(regclass, tid[]);
+ *-------------------------------------------------------------------------
+ */
+Datum
+heap_force_kill(PG_FUNCTION_ARGS)
+{
+	PG_RETURN_DATUM(heap_force_common(fcinfo, HEAP_FORCE_KILL));
+}
+
+/*-------------------------------------------------------------------------
+ * heap_force_freeze()
+ *
+ * Force freeze the tuple(s) pointed to by the item pointer(s) stored in the
+ * given tid array.
+ *
+ * Usage: SELECT heap_force_freeze(regclass, tid[]);
+ *-------------------------------------------------------------------------
+ */
+Datum
+heap_force_freeze(PG_FUNCTION_ARGS)
+{
+	PG_RETURN_DATUM(heap_force_common(fcinfo, HEAP_FORCE_FREEZE));
+}
+
+/*-------------------------------------------------------------------------
+ * heap_force_common()
+ *
+ * Common code for heap_force_kill and heap_force_freeze
+ *-------------------------------------------------------------------------
+ */
+static Datum
+heap_force_common(FunctionCallInfo fcinfo, HeapTupleForceOption heap_force_opt)
+{
+	Oid				relid = PG_GETARG_OID(0);
+	ArrayType	   *ta = PG_GETARG_ARRAYTYPE_P_COPY(1);
+	ItemPointer		tids;
+	int				ntids,
+					nblocks;
+	Relation		rel;
+	OffsetNumber	curr_start_ptr,
+					next_start_ptr;
+
+	if (RecoveryInProgress())
+		ereport(ERROR,
+				(errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
+				 errmsg("recovery is in progress"),
+				 errhint("heap surgery functions cannot be executed during recovery.")));
+
+	/* Basic sanity checking. */
+	sanity_check_tid_array(ta, &ntids);
+
+	rel = relation_open(relid, RowExclusiveLock);
+
+	sanity_check_relation(rel);
+
+	tids = ((ItemPointer) ARR_DATA_PTR(ta));
+
+	/*
+	 * If there is more than one tid in the array, sort it so that we can
+	 * easily fetch all the tids belonging to one particular page from the
+	 * array.
+	 */
+	if (ntids > 1)
+		qsort((void*) tids, ntids, sizeof(ItemPointerData), tidcmp);
+
+	curr_start_ptr = next_start_ptr = 0;
+	nblocks = RelationGetNumberOfBlocks(rel);
+
+	do
+	{
+		Buffer			buf;
+		Buffer			vmbuf = InvalidBuffer;
+		Page			page;
+		BlockNumber		blkno;
+		OffsetNumber	maxoffset;
+		int				i;
+		bool			did_modify_page = false;
+		bool			did_modify_vm = false;
+
+		CHECK_FOR_INTERRUPTS();
+
+		/*
+		 * Find all the tids belonging to one particular page starting from
+		 * next_start_ptr and process them one by one.
+		 */
+		blkno = find_tids_one_page(tids, ntids, &next_start_ptr);
+
+		/* Check whether the block number is valid. */
+		if (blkno >= nblocks)
+		{
+			/* Update the current_start_ptr before moving to the next page. */
+			curr_start_ptr = next_start_ptr;
+
+			ereport(NOTICE,
+					(errcode(ERRCODE_INVALID_PARAMETER_VALUE),
+					 errmsg("skipping block %u for relation \"%s\" because the block number is out of range",
+							blkno, RelationGetRelationName(rel))));
+			continue;
+		}
+
+		buf = ReadBuffer(rel, blkno);
+		LockBufferForCleanup(buf);
+
+		page = BufferGetPage(buf);
+
+		maxoffset = PageGetMaxOffsetNumber(page);
+
+		/*
+		 * Before entering the critical section, pin the visibility map page if
+		 * it appears to be necessary.
+		 */
+		if (heap_force_opt == HEAP_FORCE_KILL && PageIsAllVisible(page))
+			visibilitymap_pin(rel, blkno, &vmbuf);
+
+		/* No ereport(ERROR) from here until all the changes are logged. */
+		START_CRIT_SECTION();
+
+		for (i = curr_start_ptr; i < next_start_ptr; i++)
+		{
+			OffsetNumber	offno = ItemPointerGetOffsetNumberNoCheck(&tids[i]);
+			ItemId			itemid;
+
+			/* Check whether the offset number is valid. */
+			if (offno == 0 || offno > maxoffset)
+			{
+				ereport(NOTICE,
+						 errmsg("skipping tid (%u, %u) for relation \"%s\" because the item number is out of range for this block",
+								blkno, offno, RelationGetRelationName(rel)));
+				continue;
+			}
+
+			itemid = PageGetItemId(page, offno);
+
+			/* Follow any redirections until we find something useful. */
+			while (ItemIdIsRedirected(itemid))
+			{
+				offno = ItemIdGetRedirect(itemid);
+				itemid = PageGetItemId(page, offno);
+				CHECK_FOR_INTERRUPTS();
+			}
+
+			/* Nothing to do if the itemid is unused or already dead. */
+			if (!ItemIdIsUsed(itemid) || ItemIdIsDead(itemid))
+			{
+				if (!ItemIdIsUsed(itemid))
+					ereport(NOTICE,
+							(errmsg("skipping tid (%u, %u) for relation \"%s\" because it is marked unused",
+									blkno, offno, RelationGetRelationName(rel))));
+				else
+					ereport(NOTICE,
+							(errmsg("skipping tid (%u, %u) for relation \"%s\" because it is marked dead",
+									blkno, offno, RelationGetRelationName(rel))));
+				continue;
+			}
+
+			Assert(ItemIdIsNormal(itemid));
+
+			did_modify_page = true;
+
+			if (heap_force_opt == HEAP_FORCE_KILL)
+			{
+
+				ItemIdSetDead(itemid);
+
+				/*
+				 * If the page is marked all-visible, we must clear
+				 * PD_ALL_VISIBLE flag on the page header and an all-visible bit
+				 * on the visibility map corresponding to the page.
+				 */
+				if (PageIsAllVisible(page))
+				{
+					PageClearAllVisible(page);
+					visibilitymap_clear(rel, blkno, vmbuf,
+										VISIBILITYMAP_VALID_BITS);
+					did_modify_vm = true;
+				}
+			}
+			else
+			{
+				HeapTupleHeader	htup;
+				ItemPointerData	ctid;
+
+				Assert(heap_force_opt == HEAP_FORCE_FREEZE);
+
+				ItemPointerSet(&ctid, blkno, offno);
+
+				htup = (HeapTupleHeader) PageGetItem(page, itemid);
+
+				/*
+				 * Make sure that this tuple holds the correct item pointer
+				 * value.
+				 */
+				if (!ItemPointerEquals(&ctid, &htup->t_ctid))
+					ItemPointerSet(&htup->t_ctid, blkno, offno);
+
+				HeapTupleHeaderSetXmin(htup, FrozenTransactionId);
+				HeapTupleHeaderSetXmax(htup, InvalidTransactionId);
+
+				/* We might have MOVED_OFF/MOVED_IN tuples in the database */
+				if (htup->t_infomask & HEAP_MOVED)
+				{
+					if (htup->t_infomask & HEAP_MOVED_OFF)
+						HeapTupleHeaderSetXvac(htup, InvalidTransactionId);
+					else
+						HeapTupleHeaderSetXvac(htup, FrozenTransactionId);
+				}
+
+				/*
+				 * Clear all the visibility-related bits of this tuple and mark
+				 * it as frozen. Also, get rid of HOT_UPDATED and KEYS_UPDATES
+				 * bits.
+				 */
+				htup->t_infomask &= ~HEAP_XACT_MASK;
+				htup->t_infomask |= (HEAP_XMIN_FROZEN | HEAP_XMAX_INVALID);
+				htup->t_infomask2 &= ~HEAP_HOT_UPDATED;
+				htup->t_infomask2 &= ~HEAP_KEYS_UPDATED;
+			}
+		}
+
+		/*
+		 * If the page was modified, only then, we mark the buffer dirty or do
+		 * the WAL logging.
+		 */
+		if (did_modify_page)
+		{
+			/* Mark buffer dirty before we write WAL. */
+			MarkBufferDirty(buf);
+
+			/* XLOG stuff */
+			if (RelationNeedsWAL(rel))
+				log_newpage_buffer(buf, true);
+		}
+
+		/* WAL log the VM page if it was modified. */
+		if (did_modify_vm && RelationNeedsWAL(rel))
+			log_newpage_buffer(vmbuf, false);
+
+		END_CRIT_SECTION();
+
+		UnlockReleaseBuffer(buf);
+
+		if (vmbuf != InvalidBuffer)
+			ReleaseBuffer(vmbuf);
+
+		/* Update the current_start_ptr before moving to the next page. */
+		curr_start_ptr = next_start_ptr;
+	} while (next_start_ptr != ntids);
+
+	relation_close(rel, RowExclusiveLock);
+
+	pfree(ta);
+
+	PG_RETURN_VOID();
+}
+
+/*-------------------------------------------------------------------------
+ * tidcmp()
+ *
+ * Compare two item pointers, return -1, 0, or +1.
+ *
+ * See ItemPointerCompare for details.
+ * ------------------------------------------------------------------------
+ */
+static int32
+tidcmp(const void *a, const void *b)
+{
+	ItemPointer iptr1 = ((const ItemPointer) a);
+	ItemPointer iptr2 = ((const ItemPointer) b);
+
+	return ItemPointerCompare(iptr1, iptr2);
+}
+
+/*-------------------------------------------------------------------------
+ * sanity_check_tid_array()
+ *
+ * Perform sanity check on the given tid array.
+ * ------------------------------------------------------------------------
+ */
+static void
+sanity_check_tid_array(ArrayType *ta, int *ntids)
+{
+	if (ARR_HASNULL(ta) && array_contains_nulls(ta))
+		ereport(ERROR,
+				(errcode(ERRCODE_NULL_VALUE_NOT_ALLOWED),
+				 errmsg("array must not contain nulls")));
+
+	*ntids = ArrayGetNItems(ARR_NDIM(ta), ARR_DIMS(ta));
+
+	if (*ntids == 0)
+		ereport(ERROR,
+				(errcode(ERRCODE_INVALID_PARAMETER_VALUE),
+				 errmsg("empty tid array")));
+}
+
+/*-------------------------------------------------------------------------
+ * sanity_check_relation()
+ *
+ * Perform sanity check on the given relation.
+ * ------------------------------------------------------------------------
+ */
+static void
+sanity_check_relation(Relation rel)
+{
+	if (rel->rd_amhandler != HEAP_TABLE_AM_HANDLER_OID)
+		ereport(ERROR,
+				(errcode(ERRCODE_FEATURE_NOT_SUPPORTED),
+				 errmsg("only the relation using heap_tableam_handler is supported")));
+
+	if (rel->rd_rel->relkind != RELKIND_RELATION &&
+		rel->rd_rel->relkind != RELKIND_MATVIEW &&
+		rel->rd_rel->relkind != RELKIND_TOASTVALUE)
+		ereport(ERROR,
+				(errcode(ERRCODE_WRONG_OBJECT_TYPE),
+				 errmsg("\"%s\" is not a table, materialized view, or TOAST table",
+						RelationGetRelationName(rel))));
+
+	/* Must be owner of the table or superuser. */
+	if (!pg_class_ownercheck(RelationGetRelid(rel), GetUserId()))
+		aclcheck_error(ACLCHECK_NOT_OWNER,
+					   get_relkind_objtype(rel->rd_rel->relkind),
+					   RelationGetRelationName(rel));
+}
+
+/*-------------------------------------------------------------------------
+ * find_tids_one_page()
+ *
+ * Find all the tids residing in the same page as tids[next_start_ptr] and
+ * update next_start_ptr so that it points to the first tid in the next page.
+ * ------------------------------------------------------------------------
+ */
+static BlockNumber
+find_tids_one_page(ItemPointer tids, int ntids, OffsetNumber *next_start_ptr)
+{
+	int				i;
+	BlockNumber		prev_blkno,
+					blkno;
+
+	prev_blkno = blkno = InvalidBlockNumber;
+
+	for (i = *next_start_ptr; i < ntids; i++)
+	{
+		ItemPointerData tid = tids[i];
+
+		blkno = ItemPointerGetBlockNumberNoCheck(&tid);
+
+		if (i == *next_start_ptr)
+			prev_blkno = blkno;
+
+		if (prev_blkno != blkno)
+			break;
+	}
+
+	*next_start_ptr = i;
+	return prev_blkno;
+}
\ No newline at end of file
diff --git a/contrib/pg_surgery/pg_surgery--1.0.sql b/contrib/pg_surgery/pg_surgery--1.0.sql
new file mode 100644
index 0000000..2ae7f22
--- /dev/null
+++ b/contrib/pg_surgery/pg_surgery--1.0.sql
@@ -0,0 +1,18 @@
+/* contrib/pg_surgery/pg_surgery--1.0.sql */
+
+-- complain if script is sourced in psql, rather than via CREATE EXTENSION
+\echo Use "CREATE EXTENSION pg_surgery" to load this file. \quit
+
+CREATE FUNCTION heap_force_kill(reloid regclass, tids tid[])
+RETURNS VOID
+AS 'MODULE_PATHNAME', 'heap_force_kill'
+LANGUAGE C STRICT;
+
+REVOKE EXECUTE ON FUNCTION heap_force_kill(regclass, tid[]) FROM PUBLIC;
+
+CREATE FUNCTION heap_force_freeze(reloid regclass, tids tid[])
+RETURNS VOID
+AS 'MODULE_PATHNAME', 'heap_force_freeze'
+LANGUAGE C STRICT;
+
+REVOKE EXECUTE ON FUNCTION heap_force_freeze(regclass, tid[]) FROM PUBLIC;
\ No newline at end of file
diff --git a/contrib/pg_surgery/pg_surgery.control b/contrib/pg_surgery/pg_surgery.control
new file mode 100644
index 0000000..2bcdad1
--- /dev/null
+++ b/contrib/pg_surgery/pg_surgery.control
@@ -0,0 +1,5 @@
+# pg_surgery extension
+comment = 'extension to perform surgery on a damaged relation'
+default_version = '1.0'
+module_pathname = '$libdir/pg_surgery'
+relocatable = true
diff --git a/contrib/pg_surgery/sql/pg_surgery.sql b/contrib/pg_surgery/sql/pg_surgery.sql
new file mode 100644
index 0000000..4aef380
--- /dev/null
+++ b/contrib/pg_surgery/sql/pg_surgery.sql
@@ -0,0 +1,89 @@
+create extension pg_surgery;
+
+--
+-- check that using heap_force_kill and heap_force_freeze functions with the
+-- supported relations passes.
+--
+
+-- normal heap table.
+begin;
+create table htab(a int);
+insert into htab values (100), (200), (300), (400), (500);
+
+select * from htab where xmin = 2;
+select heap_force_freeze('htab'::regclass, ARRAY['(0, 4)']::tid[]);
+select ctid, xmax from htab where xmin = 2;
+
+select heap_force_kill('htab'::regclass, ARRAY['(0, 4)']::tid[]);
+select * from htab where ctid = '(0, 4)';
+rollback;
+
+-- materialized view.
+begin;
+create materialized view mvw as select a from generate_series(1, 3) a;
+
+select * from mvw where xmin = 2;
+select heap_force_freeze('mvw'::regclass, ARRAY['(0, 3)']::tid[]);
+select * from mvw where xmin = 2;
+
+select heap_force_kill('mvw'::regclass, ARRAY['(0, 3)']::tid[]);
+select * from mvw where ctid = '(0, 3)';
+rollback;
+
+--
+-- check that using heap_force_kill and heap_force_freeze functions with the
+-- unsupported relations fails.
+--
+
+-- partitioned tables (the parent table) doesn't contain any tuple.
+create table ptab (a int) partition by list (a);
+
+select heap_force_kill('ptab'::regclass, ARRAY['(0, 1)']::tid[]);
+select heap_force_freeze('ptab'::regclass, ARRAY['(0, 1)']::tid[]);
+
+create index ptab_idx on ptab (a);
+
+-- indexes are not supported, should fail.
+select heap_force_kill('ptab_idx'::regclass, ARRAY['(0, 1)']::tid[]);
+select heap_force_freeze('ptab_idx'::regclass, ARRAY['(0, 1)']::tid[]);
+
+create view vw as select 1;
+
+-- views are not supported as well. so, all these should fail.
+select heap_force_kill('vw'::regclass, ARRAY['(0, 1)']::tid[]);
+select heap_force_freeze('vw'::regclass, ARRAY['(0, 1)']::tid[]);
+
+create sequence seq;
+
+-- sequences are not supported as well. so, all these functions should fail.
+select heap_force_kill('seq'::regclass, ARRAY['(0, 1)']::tid[]);
+select heap_force_freeze('seq'::regclass, ARRAY['(0, 1)']::tid[]);
+
+--
+-- Some negative test-cases with invalid inputs.
+--
+begin;
+create table htab(a int);
+
+-- invalid block number, should be skipped.
+select heap_force_kill('htab'::regclass, ARRAY['(0, 1)']::tid[]);
+select heap_force_freeze('htab'::regclass, ARRAY['(0, 1)']::tid[]);
+
+insert into htab values (10);
+
+-- invalid offset number, should be skipped.
+select heap_force_kill('htab'::regclass, ARRAY['(0, 2)']::tid[]);
+select heap_force_freeze('htab'::regclass, ARRAY['(0, 2)']::tid[]);
+
+-- dead tuple, should be skipped.
+select heap_force_kill('htab'::regclass, ARRAY['(0, 1)']::tid[]);
+select heap_force_kill('htab'::regclass, ARRAY['(0, 1)']::tid[]);
+select heap_force_freeze('htab'::regclass, ARRAY['(0, 1)']::tid[]);
+
+rollback;
+
+-- cleanup.
+drop table ptab;
+drop view vw;
+drop sequence seq;
+drop extension pg_surgery;
diff --git a/doc/src/sgml/contrib.sgml b/doc/src/sgml/contrib.sgml
index 261a559..c82dde2 100644
--- a/doc/src/sgml/contrib.sgml
+++ b/doc/src/sgml/contrib.sgml
@@ -125,6 +125,7 @@ CREATE EXTENSION <replaceable>module_name</replaceable>;
  &pgrowlocks;
  &pgstatstatements;
  &pgstattuple;
+ &pgsurgery;
  &pgtrgm;
  &pgvisibility;
  &postgres-fdw;
diff --git a/doc/src/sgml/filelist.sgml b/doc/src/sgml/filelist.sgml
index 64b5da0..828396d 100644
--- a/doc/src/sgml/filelist.sgml
+++ b/doc/src/sgml/filelist.sgml
@@ -139,6 +139,7 @@
 <!ENTITY pgstandby       SYSTEM "pgstandby.sgml">
 <!ENTITY pgstatstatements SYSTEM "pgstatstatements.sgml">
 <!ENTITY pgstattuple     SYSTEM "pgstattuple.sgml">
+<!ENTITY pgsurgery       SYSTEM "pgsurgery.sgml">
 <!ENTITY pgtrgm          SYSTEM "pgtrgm.sgml">
 <!ENTITY pgvisibility    SYSTEM "pgvisibility.sgml">
 <!ENTITY postgres-fdw    SYSTEM "postgres-fdw.sgml">
diff --git a/doc/src/sgml/pgsurgery.sgml b/doc/src/sgml/pgsurgery.sgml
new file mode 100644
index 0000000..e410167
--- /dev/null
+++ b/doc/src/sgml/pgsurgery.sgml
@@ -0,0 +1,144 @@
+<!-- doc/src/sgml/pgsurgery.sgml -->
+
+<sect1 id="pgsurgery" xreflabel="pg_surgery">
+ <title>pg_surgery</title>
+
+ <indexterm zone="pgsurgery">
+  <primary>pg_surgery</primary>
+ </indexterm>
+
+ <para>
+  The <filename>pg_surgery</filename> module provides various functions to
+  perform surgery on a damaged relation. It is intended to be used only with a
+  damaged relation and must be operated only on damaged tuples of such
+  relations.
+ </para>
+
+ <para>
+  By default, only the superusers have <literal>EXECUTE</literal> privilege on
+  the functions provided by this module. Once the extension has been installed,
+  users may issue <command>GRANT</command> commands to change the privileges on
+  the functions to allow others to execute them.
+ </para>
+
+ <note>
+ <para>
+  After a surgery is performed on a damaged relation using this module, we must
+  run <command>VACUUM</command> with <option>DISABLE_PAGE_SKIPPING</option>
+  option on that relation to bring it back into a sane state (see examples in
+  the next section).
+ </para>
+ </note>
+
+ <sect2>
+  <title>Functions</title>
+
+  <variablelist>
+   <varlistentry>
+    <term>
+     <function>heap_force_kill(regclass, tid[]) returns void</function>
+    </term>
+
+    <listitem>
+     <para>
+      <function>heap_force_kill</function> marks a <quote>used</quote> line
+      pointer as <quote>dead</quote> without examining the tuple. It is used to
+      force-kill a corrupted tuple in the table that may or may not be
+      accessible. It takes the relation oid and an array of item pointers as an
+      input arguments. Here is an example demonstrating when and how this
+      function is used:
+<programlisting>
+test=&gt; select * from t1 where ctid = '(0, 1)';
+ERROR:  could not access status of transaction 4007513275
+DETAIL:  Could not open file "pg_xact/0EED": No such file or directory.
+
+test=# select heap_force_kill('t1'::regclass, ARRAY['(0, 1)']::tid[]);
+ heap_force_kill 
+-----------------
+ 
+(1 row)
+
+test=# select * from t1 where ctid = '(0, 1)';
+(0 rows)
+
+test=# vacuum (disable_page_skipping) t1;
+VACUUM 
+</programlisting>
+    </para>
+    </listitem>
+   </varlistentry>
+
+   <varlistentry>
+    <term>
+     <function>heap_force_freeze(regclass, tid[]) returns void</function>
+    </term>
+
+    <listitem>
+     <para>
+      <function>heap_force_freeze</function> marks a live tuple (whose
+      <literal>xmin</literal> is corrupted) as <quote>frozen</quote> by setting
+      its <literal>xmin</literal> to <literal>FrozenTransactionId</literal> and
+      <literal>xmax</literal> to <literal>InvalidTransactionId</literal>. It
+      clears all the visibility related bits in the infomask of the tuple and
+      sets the infomask to <literal>HEAP_XMIN_FROZEN</literal> and
+      <literal>HEAP_XMAX_INVALID</literal>. It is used to force-freeze a tuple
+      in the table for which <command>VACUUM</command> fails with
+      <quote>found xmin ... from before relfrozenxid ...</quote> error. Similar
+      to <function>heap_force_kill</function> function, this also takes the
+      relation oid and an array of item pointers as an input arguments. Here is
+      an example demonstrating when and how this function is used:
+<programlisting>
+test=&gt; vacuum t1;
+ERROR:  found xmin 507 from before relfrozenxid 515
+CONTEXT:  while scanning block 0 of relation "public.t1"
+
+test=# select ctid from t1 where xmin = 507;
+ ctid  
+-------
+ (0,3)
+(1 row)
+
+test=# select heap_force_freeze('t1'::regclass, ARRAY['(0, 3)']::tid[]);
+ heap_force_freeze 
+-------------------
+ 
+(1 row)
+
+test=# select ctid from t1 where xmin = 2;
+ ctid
+-------
+ (0,3)
+(1 row)
+
+test=# vacuum (disable_page_skipping) t1;
+VACUUM
+</programlisting>
+    </para>
+
+    <note>
+    <para>
+     This function should only be operated on a tuple for which
+     <command>VACUUM</command> fails with
+     <quote>found xmin ... from before relfrozenxid ...</quote> error. The user
+     must verify that only the <literal>xmin</literal> and
+     <literal>ctid</literal> fields of the tuple are corrupted, if any other
+     fields of a tuple are corrupted, they must avoid running this function on
+     such tuples. The user also needs to ensure that they do not operate this
+     function on a deleted tuple because it may revive the deleted tuple.
+    </para>
+    </note>
+   </listitem>
+  </varlistentry>
+
+  </variablelist>
+ </sect2>
+
+ <sect2>
+  <title>Authors</title>
+
+  <para>
+   Ashutosh Sharma <email>ashu.coek88@gmail.com</email>
+  </para>
+ </sect2>
+
+</sect1>
-- 
1.8.3.1

#98

robertmhaas@gmail.com

over 5 years ago

In reply to: Ashutosh Sharma (#97)

Re: recovering from "found xmin ... from before relfrozenxid ..."

On Wed, Aug 26, 2020 at 7:36 AM Ashutosh Sharma <ashu.coek88@gmail.com> wrote:

Removed this note from the documentation and added a note saying: "The
user needs to ensure that they do not operate pg_force_freeze function
on a deleted tuple because it may revive the deleted tuple."

I do not agree with that note, either. I believe that trying to tell
people what things specifically they should do or avoid doing with the
tool is the wrong approach. Instead, the thrust of the message should
be to tell people that if you use this, it may corrupt your database,
and that's your problem. The difficulty with telling people what
specifically they ought to avoid doing is that experts will be annoyed
to be told that something is not safe when they know that it is fine,
and non-experts will think that some uses are safer than they really
are.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

#99

ashu.coek88@gmail.com

over 5 years ago

In reply to: Robert Haas (#98)

Re: recovering from "found xmin ... from before relfrozenxid ..."

On Wed, Aug 26, 2020 at 9:19 PM Robert Haas <robertmhaas@gmail.com> wrote:

On Wed, Aug 26, 2020 at 7:36 AM Ashutosh Sharma <ashu.coek88@gmail.com> wrote:

Removed this note from the documentation and added a note saying: "The
user needs to ensure that they do not operate pg_force_freeze function
on a deleted tuple because it may revive the deleted tuple."

I do not agree with that note, either. I believe that trying to tell
people what things specifically they should do or avoid doing with the
tool is the wrong approach. Instead, the thrust of the message should
be to tell people that if you use this, it may corrupt your database,
and that's your problem. The difficulty with telling people what
specifically they ought to avoid doing is that experts will be annoyed
to be told that something is not safe when they know that it is fine,
and non-experts will think that some uses are safer than they really
are.

Okay, point noted.

Thanks,

--
With Regards,
Ashutosh Sharma
EnterpriseDB:http://www.enterprisedb.com

#100

robertmhaas@gmail.com

over 5 years ago

In reply to: Ashutosh Sharma (#99)

1 attachment(s)

Re: recovering from "found xmin ... from before relfrozenxid ..."

On Wed, Aug 26, 2020 at 10:26 PM Ashutosh Sharma <ashu.coek88@gmail.com> wrote:

Okay, point noted.

I spent some time today working on this patch. I'm fairly happy with
it now and intend to commit it if nobody sees a big problem with that.
Per discussion, I do not intend to back-patch at this time. The two
most significant changes I made to your version are:

1. I changed things around to avoid using any form of ereport() in a
critical section. I'm not actually sure whether it is project policy
to avoid ereport(NOTICE, ...) or similar in a critical section, but it
seems prudent, because if anything fails in a critical section, we
will PANIC, so doing fewer things there seems prudent.

2. I changed the code so that it does not try to follow redirected
line pointers; instead, it skips them with an appropriate message, as
we were already doing for dead and unused line pointers. I think the
way you had it coded might've been my suggestion originally, but the
more I looked into it the less I liked it. One problem is that it
didn't match the docs. A second is that following a corrupted line
pointer might index off the end of the line pointer array, and while
that probably shouldn't happen, we are talking about corruption
recovery here. Then I realized that, as you coded it, if the line
pointer was redirected to a line pointer that is in turn dead (or
unused, if there's corruption) the user would get a NOTICE complaining
about a TID they hadn't specified, which seems like it would be very
confusing. I thought about trying to fix all that stuff, but it just
didn't seem worth it, because I can't think of a good reason to pass
this function the TID of a redirected line pointer in the first place.
If you're doing surgery, you should probably specify the exact thing
upon which you want to operate, not some other thing that points to
it.

Here is a list of other changes I made:

* Added a .gitignore file.
* Renamed the regression test file from pg_surgery to heap_surgery to
match the name of the single C source file we currently have.
* Capitalized TID in a few places.
* Ran pgindent.
* Adjusted various comments.
* Removed the check for an empty TID array. I don't see any reason why
this should be an error case and I don't see much precedent for having
such a check.
* Fixed the code to work properly with that change and added a test case.
* Added a check that the array is not multi-dimensional.
* Put the AM type check before the relkind check, following existing precedent.
* Adjusted the check to use the AM OID rather than the handler OID,
following existing precedent. Fixed the message wording accordingly.
* Changed the documentation wording to say less about specific
recovery procedures and focus more on the general idea that this is
dangerous.
* Removed all but one of the test cases that checked what happens if
you use this on a non-heap; three tests for basically the same thing
seemed excessive.
* Added some additional tests to improve code coverage. There are now
only a handful of lines not covered.
* Reorganized the test cases somewhat.

New patch attached.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Attachments:

v9-0001-pg_surgery-rmh-based-on-ashutosh-sharma-v8.patchapplication/octet-stream; name=v9-0001-pg_surgery-rmh-based-on-ashutosh-sharma-v8.patchDownload

From b0de48fbe9bff15f5b71d6f34899bacdabc440ee Mon Sep 17 00:00:00 2001
From: Robert Haas <rhaas@postgresql.org>
Date: Thu, 27 Aug 2020 13:57:18 -0400
Subject: [PATCH v9] pg_surgery rmh based on ashutosh sharma v8

---
 contrib/Makefile                             |   1 +
 contrib/pg_surgery/.gitignore                |   4 +
 contrib/pg_surgery/Makefile                  |  23 +
 contrib/pg_surgery/expected/heap_surgery.out | 180 ++++++++
 contrib/pg_surgery/heap_surgery.c            | 426 +++++++++++++++++++
 contrib/pg_surgery/pg_surgery--1.0.sql       |  18 +
 contrib/pg_surgery/pg_surgery.control        |   5 +
 contrib/pg_surgery/sql/heap_surgery.sql      |  91 ++++
 doc/src/sgml/contrib.sgml                    |   1 +
 doc/src/sgml/filelist.sgml                   |   1 +
 doc/src/sgml/pgsurgery.sgml                  | 107 +++++
 src/tools/pgindent/typedefs.list             |   1 +
 12 files changed, 858 insertions(+)
 create mode 100644 contrib/pg_surgery/.gitignore
 create mode 100644 contrib/pg_surgery/Makefile
 create mode 100644 contrib/pg_surgery/expected/heap_surgery.out
 create mode 100644 contrib/pg_surgery/heap_surgery.c
 create mode 100644 contrib/pg_surgery/pg_surgery--1.0.sql
 create mode 100644 contrib/pg_surgery/pg_surgery.control
 create mode 100644 contrib/pg_surgery/sql/heap_surgery.sql
 create mode 100644 doc/src/sgml/pgsurgery.sgml

diff --git a/contrib/Makefile b/contrib/Makefile
index 1846d415b6..c8d2a16273 100644
--- a/contrib/Makefile
+++ b/contrib/Makefile
@@ -34,6 +34,7 @@ SUBDIRS = \
 		pg_prewarm	\
 		pg_standby	\
 		pg_stat_statements \
+		pg_surgery	\
 		pg_trgm		\
 		pgcrypto	\
 		pgrowlocks	\
diff --git a/contrib/pg_surgery/.gitignore b/contrib/pg_surgery/.gitignore
new file mode 100644
index 0000000000..5dcb3ff972
--- /dev/null
+++ b/contrib/pg_surgery/.gitignore
@@ -0,0 +1,4 @@
+# Generated subdirectories
+/log/
+/results/
+/tmp_check/
diff --git a/contrib/pg_surgery/Makefile b/contrib/pg_surgery/Makefile
new file mode 100644
index 0000000000..a66776c4c4
--- /dev/null
+++ b/contrib/pg_surgery/Makefile
@@ -0,0 +1,23 @@
+# contrib/pg_surgery/Makefile
+
+MODULE_big = pg_surgery
+OBJS = \
+	$(WIN32RES) \
+	heap_surgery.o
+
+EXTENSION = pg_surgery
+DATA = pg_surgery--1.0.sql
+PGFILEDESC = "pg_surgery - perform surgery on a damaged relation"
+
+REGRESS = heap_surgery
+
+ifdef USE_PGXS
+PG_CONFIG = pg_config
+PGXS := $(shell $(PG_CONFIG) --pgxs)
+include $(PGXS)
+else
+subdir = contrib/pg_surgery
+top_builddir = ../..
+include $(top_builddir)/src/Makefile.global
+include $(top_srcdir)/contrib/contrib-global.mk
+endif
diff --git a/contrib/pg_surgery/expected/heap_surgery.out b/contrib/pg_surgery/expected/heap_surgery.out
new file mode 100644
index 0000000000..9451c5747b
--- /dev/null
+++ b/contrib/pg_surgery/expected/heap_surgery.out
@@ -0,0 +1,180 @@
+create extension pg_surgery;
+-- create a normal heap table and insert some rows.
+-- note that we don't commit the transaction, so autovacuum can't interfere.
+begin;
+create table htab(a int);
+insert into htab values (100), (200), (300), (400), (500);
+-- test empty TID array
+select heap_force_freeze('htab'::regclass, ARRAY[]::tid[]);
+ heap_force_freeze 
+-------------------
+ 
+(1 row)
+
+-- nothing should be frozen yet
+select * from htab where xmin = 2;
+ a 
+---
+(0 rows)
+
+-- freeze forcibly
+select heap_force_freeze('htab'::regclass, ARRAY['(0, 4)']::tid[]);
+ heap_force_freeze 
+-------------------
+ 
+(1 row)
+
+-- now we should have one frozen tuple
+select ctid, xmax from htab where xmin = 2;
+ ctid  | xmax 
+-------+------
+ (0,4) |    0
+(1 row)
+
+-- kill forcibly
+select heap_force_kill('htab'::regclass, ARRAY['(0, 4)']::tid[]);
+ heap_force_kill 
+-----------------
+ 
+(1 row)
+
+-- should be gone now
+select * from htab where ctid = '(0, 4)';
+ a 
+---
+(0 rows)
+
+-- should now be skipped because it's already dead
+select heap_force_kill('htab'::regclass, ARRAY['(0, 4)']::tid[]);
+NOTICE:  skipping tid (0, 4) for relation "htab" because it is marked dead
+ heap_force_kill 
+-----------------
+ 
+(1 row)
+
+select heap_force_freeze('htab'::regclass, ARRAY['(0, 4)']::tid[]);
+NOTICE:  skipping tid (0, 4) for relation "htab" because it is marked dead
+ heap_force_freeze 
+-------------------
+ 
+(1 row)
+
+-- freeze two TIDs at once while skipping an out-of-range block number
+select heap_force_freeze('htab'::regclass,
+						 ARRAY['(0, 1)', '(0, 3)', '(1, 1)']::tid[]);
+NOTICE:  skipping block 1 for relation "htab" because the block number is out of range
+ heap_force_freeze 
+-------------------
+ 
+(1 row)
+
+-- we should now have two frozen tuples
+select ctid, xmax from htab where xmin = 2;
+ ctid  | xmax 
+-------+------
+ (0,1) |    0
+ (0,3) |    0
+(2 rows)
+
+-- out-of-range TIDs should be skipped
+select heap_force_freeze('htab'::regclass, ARRAY['(0, 0)', '(0, 6)']::tid[]);
+NOTICE:  skipping tid (0, 0) for relation "htab" because the item number is out of range
+NOTICE:  skipping tid (0, 6) for relation "htab" because the item number is out of range
+ heap_force_freeze 
+-------------------
+ 
+(1 row)
+
+rollback;
+-- set up a new table with a redirected line pointer
+create table htab2(a int) with (autovacuum_enabled = off);
+insert into htab2 values (100);
+update htab2 set a = 200;
+vacuum htab2;
+-- redirected TIDs should be skipped
+select heap_force_kill('htab2'::regclass, ARRAY['(0, 1)']::tid[]);
+NOTICE:  skipping tid (0, 1) for relation "htab2" because it redirects to item 2
+ heap_force_kill 
+-----------------
+ 
+(1 row)
+
+-- now create an unused line pointer
+select ctid from htab2;
+ ctid  
+-------
+ (0,2)
+(1 row)
+
+update htab2 set a = 300;
+select ctid from htab2;
+ ctid  
+-------
+ (0,3)
+(1 row)
+
+vacuum freeze htab2;
+-- unused TIDs should be skipped
+select heap_force_kill('htab2'::regclass, ARRAY['(0, 2)']::tid[]);
+NOTICE:  skipping tid (0, 2) for relation "htab2" because it is marked unused
+ heap_force_kill 
+-----------------
+ 
+(1 row)
+
+-- multidimensional TID array should be rejected
+select heap_force_kill('htab2'::regclass, ARRAY[['(0, 2)']]::tid[]);
+ERROR:  argument must be empty or one-dimensional array
+-- TID array with nulls should be rejected
+select heap_force_kill('htab2'::regclass, ARRAY[NULL]::tid[]);
+ERROR:  array must not contain nulls
+-- but we should be able to kill the one tuple we have
+select heap_force_kill('htab2'::regclass, ARRAY['(0, 3)']::tid[]);
+ heap_force_kill 
+-----------------
+ 
+(1 row)
+
+-- materialized view.
+-- note that we don't commit the transaction, so autovacuum can't interfere.
+begin;
+create materialized view mvw as select a from generate_series(1, 3) a;
+select * from mvw where xmin = 2;
+ a 
+---
+(0 rows)
+
+select heap_force_freeze('mvw'::regclass, ARRAY['(0, 3)']::tid[]);
+ heap_force_freeze 
+-------------------
+ 
+(1 row)
+
+select * from mvw where xmin = 2;
+ a 
+---
+ 3
+(1 row)
+
+select heap_force_kill('mvw'::regclass, ARRAY['(0, 3)']::tid[]);
+ heap_force_kill 
+-----------------
+ 
+(1 row)
+
+select * from mvw where ctid = '(0, 3)';
+ a 
+---
+(0 rows)
+
+rollback;
+-- check that it fails on an unsupported relkind
+create view vw as select 1;
+select heap_force_kill('vw'::regclass, ARRAY['(0, 1)']::tid[]);
+ERROR:  "vw" is not a table, materialized view, or TOAST table
+select heap_force_freeze('vw'::regclass, ARRAY['(0, 1)']::tid[]);
+ERROR:  "vw" is not a table, materialized view, or TOAST table
+-- cleanup.
+drop table htab2;
+drop view vw;
+drop extension pg_surgery;
diff --git a/contrib/pg_surgery/heap_surgery.c b/contrib/pg_surgery/heap_surgery.c
new file mode 100644
index 0000000000..0d27dab897
--- /dev/null
+++ b/contrib/pg_surgery/heap_surgery.c
@@ -0,0 +1,426 @@
+/*-------------------------------------------------------------------------
+ *
+ * heap_surgery.c
+ *	  Functions to perform surgery on the damaged heap table.
+ *
+ * Copyright (c) 2020, PostgreSQL Global Development Group
+ *
+ * IDENTIFICATION
+ *	  contrib/pg_surgery/heap_surgery.c
+ *
+ *-------------------------------------------------------------------------
+ */
+#include "postgres.h"
+
+#include "access/heapam.h"
+#include "access/visibilitymap.h"
+#include "catalog/pg_am_d.h"
+#include "catalog/pg_proc_d.h"
+#include "miscadmin.h"
+#include "storage/bufmgr.h"
+#include "utils/acl.h"
+#include "utils/rel.h"
+
+PG_MODULE_MAGIC;
+
+/* Options to forcefully change the state of a heap tuple. */
+typedef enum HeapTupleForceOption
+{
+	HEAP_FORCE_KILL,
+	HEAP_FORCE_FREEZE
+} HeapTupleForceOption;
+
+PG_FUNCTION_INFO_V1(heap_force_kill);
+PG_FUNCTION_INFO_V1(heap_force_freeze);
+
+static int32 tidcmp(const void *a, const void *b);
+static Datum heap_force_common(FunctionCallInfo fcinfo,
+							   HeapTupleForceOption heap_force_opt);
+static void sanity_check_tid_array(ArrayType *ta, int *ntids);
+static void sanity_check_relation(Relation rel);
+static BlockNumber find_tids_one_page(ItemPointer tids, int ntids,
+									  OffsetNumber *next_start_ptr);
+
+/*-------------------------------------------------------------------------
+ * heap_force_kill()
+ *
+ * Force kill the tuple(s) pointed to by the item pointer(s) stored in the
+ * given TID array.
+ *
+ * Usage: SELECT heap_force_kill(regclass, tid[]);
+ *-------------------------------------------------------------------------
+ */
+Datum
+heap_force_kill(PG_FUNCTION_ARGS)
+{
+	PG_RETURN_DATUM(heap_force_common(fcinfo, HEAP_FORCE_KILL));
+}
+
+/*-------------------------------------------------------------------------
+ * heap_force_freeze()
+ *
+ * Force freeze the tuple(s) pointed to by the item pointer(s) stored in the
+ * given TID array.
+ *
+ * Usage: SELECT heap_force_freeze(regclass, tid[]);
+ *-------------------------------------------------------------------------
+ */
+Datum
+heap_force_freeze(PG_FUNCTION_ARGS)
+{
+	PG_RETURN_DATUM(heap_force_common(fcinfo, HEAP_FORCE_FREEZE));
+}
+
+/*-------------------------------------------------------------------------
+ * heap_force_common()
+ *
+ * Common code for heap_force_kill and heap_force_freeze
+ *-------------------------------------------------------------------------
+ */
+static Datum
+heap_force_common(FunctionCallInfo fcinfo, HeapTupleForceOption heap_force_opt)
+{
+	Oid			relid = PG_GETARG_OID(0);
+	ArrayType  *ta = PG_GETARG_ARRAYTYPE_P_COPY(1);
+	ItemPointer tids;
+	int			ntids,
+				nblocks;
+	Relation	rel;
+	OffsetNumber curr_start_ptr,
+				next_start_ptr;
+	bool		include_this_tid[MaxHeapTuplesPerPage];
+
+	if (RecoveryInProgress())
+		ereport(ERROR,
+				(errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
+				 errmsg("recovery is in progress"),
+				 errhint("heap surgery functions cannot be executed during recovery.")));
+
+	/* Check inputs. */
+	sanity_check_tid_array(ta, &ntids);
+
+	rel = relation_open(relid, RowExclusiveLock);
+
+	/* Check target relation. */
+	sanity_check_relation(rel);
+
+	tids = ((ItemPointer) ARR_DATA_PTR(ta));
+
+	/*
+	 * If there is more than one TID in the array, sort them so that we can
+	 * easily fetch all the TIDs belonging to one particular page from the
+	 * array.
+	 */
+	if (ntids > 1)
+		qsort((void *) tids, ntids, sizeof(ItemPointerData), tidcmp);
+
+	curr_start_ptr = next_start_ptr = 0;
+	nblocks = RelationGetNumberOfBlocks(rel);
+
+	/*
+	 * Loop, performing the necessary actions for each block.
+	 */
+	while (next_start_ptr != ntids)
+	{
+		Buffer		buf;
+		Buffer		vmbuf = InvalidBuffer;
+		Page		page;
+		BlockNumber blkno;
+		OffsetNumber curoff;
+		OffsetNumber maxoffset;
+		int			i;
+		bool		did_modify_page = false;
+		bool		did_modify_vm = false;
+
+		CHECK_FOR_INTERRUPTS();
+
+		/*
+		 * Find all the TIDs belonging to one particular page starting from
+		 * next_start_ptr and process them one by one.
+		 */
+		blkno = find_tids_one_page(tids, ntids, &next_start_ptr);
+
+		/* Check whether the block number is valid. */
+		if (blkno >= nblocks)
+		{
+			/* Update the current_start_ptr before moving to the next page. */
+			curr_start_ptr = next_start_ptr;
+
+			ereport(NOTICE,
+					(errcode(ERRCODE_INVALID_PARAMETER_VALUE),
+					 errmsg("skipping block %u for relation \"%s\" because the block number is out of range",
+							blkno, RelationGetRelationName(rel))));
+			continue;
+		}
+
+		buf = ReadBuffer(rel, blkno);
+		LockBufferForCleanup(buf);
+
+		page = BufferGetPage(buf);
+
+		maxoffset = PageGetMaxOffsetNumber(page);
+
+		/*
+		 * Figure out which TIDs we are going to process and which ones we are
+		 * going to skip.
+		 */
+		memset(include_this_tid, 0, sizeof(include_this_tid));
+		for (i = curr_start_ptr; i < next_start_ptr; i++)
+		{
+			OffsetNumber offno = ItemPointerGetOffsetNumberNoCheck(&tids[i]);
+			ItemId		itemid;
+
+			/* Check whether the offset number is valid. */
+			if (offno == InvalidOffsetNumber || offno > maxoffset)
+			{
+				ereport(NOTICE,
+						errmsg("skipping tid (%u, %u) for relation \"%s\" because the item number is out of range",
+							   blkno, offno, RelationGetRelationName(rel)));
+				continue;
+			}
+
+			itemid = PageGetItemId(page, offno);
+
+			/* Only accept an item ID that is used. */
+			if (ItemIdIsRedirected(itemid))
+			{
+				ereport(NOTICE,
+						errmsg("skipping tid (%u, %u) for relation \"%s\" because it redirects to item %u",
+							   blkno, offno, RelationGetRelationName(rel),
+							   ItemIdGetRedirect(itemid)));
+				continue;
+			}
+			else if (ItemIdIsDead(itemid))
+			{
+				ereport(NOTICE,
+						(errmsg("skipping tid (%u, %u) for relation \"%s\" because it is marked dead",
+								blkno, offno, RelationGetRelationName(rel))));
+				continue;
+			}
+			else if (!ItemIdIsUsed(itemid))
+			{
+				ereport(NOTICE,
+						(errmsg("skipping tid (%u, %u) for relation \"%s\" because it is marked unused",
+								blkno, offno, RelationGetRelationName(rel))));
+				continue;
+			}
+
+			/* Mark it for processing. */
+			Assert(offno < MaxHeapTuplesPerPage);
+			include_this_tid[offno] = true;
+		}
+
+		/*
+		 * Before entering the critical section, pin the visibility map page
+		 * if it appears to be necessary.
+		 */
+		if (heap_force_opt == HEAP_FORCE_KILL && PageIsAllVisible(page))
+			visibilitymap_pin(rel, blkno, &vmbuf);
+
+		/* No ereport(ERROR) from here until all the changes are logged. */
+		START_CRIT_SECTION();
+
+		for (curoff = FirstOffsetNumber; curoff <= maxoffset;
+			 curoff = OffsetNumberNext(curoff))
+		{
+			ItemId		itemid;
+
+			if (!include_this_tid[curoff])
+				continue;
+
+			itemid = PageGetItemId(page, curoff);
+			Assert(ItemIdIsNormal(itemid));
+
+			did_modify_page = true;
+
+			if (heap_force_opt == HEAP_FORCE_KILL)
+			{
+				ItemIdSetDead(itemid);
+
+				/*
+				 * If the page is marked all-visible, we must clear
+				 * PD_ALL_VISIBLE flag on the page header and an all-visible
+				 * bit on the visibility map corresponding to the page.
+				 */
+				if (PageIsAllVisible(page))
+				{
+					PageClearAllVisible(page);
+					visibilitymap_clear(rel, blkno, vmbuf,
+										VISIBILITYMAP_VALID_BITS);
+					did_modify_vm = true;
+				}
+			}
+			else
+			{
+				HeapTupleHeader htup;
+
+				Assert(heap_force_opt == HEAP_FORCE_FREEZE);
+
+				htup = (HeapTupleHeader) PageGetItem(page, itemid);
+
+				/*
+				 * Reset all visibility-related fields of the tuple. This
+				 * logic should mimic heap_execute_freeze_tuple(), but we
+				 * choose to reset xmin and ctid just to be sure that no
+				 * potentially-garbled data is left behind.
+				 */
+				ItemPointerSet(&htup->t_ctid, blkno, curoff);
+				HeapTupleHeaderSetXmin(htup, FrozenTransactionId);
+				HeapTupleHeaderSetXmax(htup, InvalidTransactionId);
+				if (htup->t_infomask & HEAP_MOVED)
+				{
+					if (htup->t_infomask & HEAP_MOVED_OFF)
+						HeapTupleHeaderSetXvac(htup, InvalidTransactionId);
+					else
+						HeapTupleHeaderSetXvac(htup, FrozenTransactionId);
+				}
+
+				/*
+				 * Clear all the visibility-related bits of this tuple and
+				 * mark it as frozen. Also, get rid of HOT_UPDATED and
+				 * KEYS_UPDATES bits.
+				 */
+				htup->t_infomask &= ~HEAP_XACT_MASK;
+				htup->t_infomask |= (HEAP_XMIN_FROZEN | HEAP_XMAX_INVALID);
+				htup->t_infomask2 &= ~HEAP_HOT_UPDATED;
+				htup->t_infomask2 &= ~HEAP_KEYS_UPDATED;
+			}
+		}
+
+		/*
+		 * If the page was modified, only then, we mark the buffer dirty or do
+		 * the WAL logging.
+		 */
+		if (did_modify_page)
+		{
+			/* Mark buffer dirty before we write WAL. */
+			MarkBufferDirty(buf);
+
+			/* XLOG stuff */
+			if (RelationNeedsWAL(rel))
+				log_newpage_buffer(buf, true);
+		}
+
+		/* WAL log the VM page if it was modified. */
+		if (did_modify_vm && RelationNeedsWAL(rel))
+			log_newpage_buffer(vmbuf, false);
+
+		END_CRIT_SECTION();
+
+		UnlockReleaseBuffer(buf);
+
+		if (vmbuf != InvalidBuffer)
+			ReleaseBuffer(vmbuf);
+
+		/* Update the current_start_ptr before moving to the next page. */
+		curr_start_ptr = next_start_ptr;
+	}
+
+	relation_close(rel, RowExclusiveLock);
+
+	pfree(ta);
+
+	PG_RETURN_VOID();
+}
+
+/*-------------------------------------------------------------------------
+ * tidcmp()
+ *
+ * Compare two item pointers, return -1, 0, or +1.
+ *
+ * See ItemPointerCompare for details.
+ * ------------------------------------------------------------------------
+ */
+static int32
+tidcmp(const void *a, const void *b)
+{
+	ItemPointer iptr1 = ((const ItemPointer) a);
+	ItemPointer iptr2 = ((const ItemPointer) b);
+
+	return ItemPointerCompare(iptr1, iptr2);
+}
+
+/*-------------------------------------------------------------------------
+ * sanity_check_tid_array()
+ *
+ * Perform sanity checks on the given tid array, and set *ntids to the
+ * number of items in the array.
+ * ------------------------------------------------------------------------
+ */
+static void
+sanity_check_tid_array(ArrayType *ta, int *ntids)
+{
+	if (ARR_HASNULL(ta) && array_contains_nulls(ta))
+		ereport(ERROR,
+				(errcode(ERRCODE_NULL_VALUE_NOT_ALLOWED),
+				 errmsg("array must not contain nulls")));
+
+	if (ARR_NDIM(ta) > 1)
+		ereport(ERROR,
+				(errcode(ERRCODE_DATA_EXCEPTION),
+				 errmsg("argument must be empty or one-dimensional array")));
+
+	*ntids = ArrayGetNItems(ARR_NDIM(ta), ARR_DIMS(ta));
+}
+
+/*-------------------------------------------------------------------------
+ * sanity_check_relation()
+ *
+ * Perform sanity checks on the given relation.
+ * ------------------------------------------------------------------------
+ */
+static void
+sanity_check_relation(Relation rel)
+{
+	if (rel->rd_rel->relkind != RELKIND_RELATION &&
+		rel->rd_rel->relkind != RELKIND_MATVIEW &&
+		rel->rd_rel->relkind != RELKIND_TOASTVALUE)
+		ereport(ERROR,
+				(errcode(ERRCODE_WRONG_OBJECT_TYPE),
+				 errmsg("\"%s\" is not a table, materialized view, or TOAST table",
+						RelationGetRelationName(rel))));
+
+	if (rel->rd_rel->relam != HEAP_TABLE_AM_OID)
+		ereport(ERROR,
+				(errcode(ERRCODE_FEATURE_NOT_SUPPORTED),
+				 errmsg("only heap AM is supported")));
+
+	/* Must be owner of the table or superuser. */
+	if (!pg_class_ownercheck(RelationGetRelid(rel), GetUserId()))
+		aclcheck_error(ACLCHECK_NOT_OWNER,
+					   get_relkind_objtype(rel->rd_rel->relkind),
+					   RelationGetRelationName(rel));
+}
+
+/*-------------------------------------------------------------------------
+ * find_tids_one_page()
+ *
+ * Find all the tids residing in the same page as tids[next_start_ptr] and
+ * update next_start_ptr so that it points to the first tid in the next page.
+ * ------------------------------------------------------------------------
+ */
+static BlockNumber
+find_tids_one_page(ItemPointer tids, int ntids, OffsetNumber *next_start_ptr)
+{
+	int			i;
+	BlockNumber prev_blkno,
+				blkno;
+
+	prev_blkno = blkno = InvalidBlockNumber;
+
+	for (i = *next_start_ptr; i < ntids; i++)
+	{
+		ItemPointerData tid = tids[i];
+
+		blkno = ItemPointerGetBlockNumberNoCheck(&tid);
+
+		if (i == *next_start_ptr)
+			prev_blkno = blkno;
+
+		if (prev_blkno != blkno)
+			break;
+	}
+
+	*next_start_ptr = i;
+	return prev_blkno;
+}
diff --git a/contrib/pg_surgery/pg_surgery--1.0.sql b/contrib/pg_surgery/pg_surgery--1.0.sql
new file mode 100644
index 0000000000..2ae7f228c7
--- /dev/null
+++ b/contrib/pg_surgery/pg_surgery--1.0.sql
@@ -0,0 +1,18 @@
+/* contrib/pg_surgery/pg_surgery--1.0.sql */
+
+-- complain if script is sourced in psql, rather than via CREATE EXTENSION
+\echo Use "CREATE EXTENSION pg_surgery" to load this file. \quit
+
+CREATE FUNCTION heap_force_kill(reloid regclass, tids tid[])
+RETURNS VOID
+AS 'MODULE_PATHNAME', 'heap_force_kill'
+LANGUAGE C STRICT;
+
+REVOKE EXECUTE ON FUNCTION heap_force_kill(regclass, tid[]) FROM PUBLIC;
+
+CREATE FUNCTION heap_force_freeze(reloid regclass, tids tid[])
+RETURNS VOID
+AS 'MODULE_PATHNAME', 'heap_force_freeze'
+LANGUAGE C STRICT;
+
+REVOKE EXECUTE ON FUNCTION heap_force_freeze(regclass, tid[]) FROM PUBLIC;
\ No newline at end of file
diff --git a/contrib/pg_surgery/pg_surgery.control b/contrib/pg_surgery/pg_surgery.control
new file mode 100644
index 0000000000..2bcdad1e3f
--- /dev/null
+++ b/contrib/pg_surgery/pg_surgery.control
@@ -0,0 +1,5 @@
+# pg_surgery extension
+comment = 'extension to perform surgery on a damaged relation'
+default_version = '1.0'
+module_pathname = '$libdir/pg_surgery'
+relocatable = true
diff --git a/contrib/pg_surgery/sql/heap_surgery.sql b/contrib/pg_surgery/sql/heap_surgery.sql
new file mode 100644
index 0000000000..8a27214e9c
--- /dev/null
+++ b/contrib/pg_surgery/sql/heap_surgery.sql
@@ -0,0 +1,91 @@
+create extension pg_surgery;
+
+-- create a normal heap table and insert some rows.
+-- note that we don't commit the transaction, so autovacuum can't interfere.
+begin;
+create table htab(a int);
+insert into htab values (100), (200), (300), (400), (500);
+
+-- test empty TID array
+select heap_force_freeze('htab'::regclass, ARRAY[]::tid[]);
+
+-- nothing should be frozen yet
+select * from htab where xmin = 2;
+
+-- freeze forcibly
+select heap_force_freeze('htab'::regclass, ARRAY['(0, 4)']::tid[]);
+
+-- now we should have one frozen tuple
+select ctid, xmax from htab where xmin = 2;
+
+-- kill forcibly
+select heap_force_kill('htab'::regclass, ARRAY['(0, 4)']::tid[]);
+
+-- should be gone now
+select * from htab where ctid = '(0, 4)';
+
+-- should now be skipped because it's already dead
+select heap_force_kill('htab'::regclass, ARRAY['(0, 4)']::tid[]);
+select heap_force_freeze('htab'::regclass, ARRAY['(0, 4)']::tid[]);
+
+-- freeze two TIDs at once while skipping an out-of-range block number
+select heap_force_freeze('htab'::regclass,
+						 ARRAY['(0, 1)', '(0, 3)', '(1, 1)']::tid[]);
+
+-- we should now have two frozen tuples
+select ctid, xmax from htab where xmin = 2;
+
+-- out-of-range TIDs should be skipped
+select heap_force_freeze('htab'::regclass, ARRAY['(0, 0)', '(0, 6)']::tid[]);
+
+rollback;
+
+-- set up a new table with a redirected line pointer
+create table htab2(a int) with (autovacuum_enabled = off);
+insert into htab2 values (100);
+update htab2 set a = 200;
+vacuum htab2;
+
+-- redirected TIDs should be skipped
+select heap_force_kill('htab2'::regclass, ARRAY['(0, 1)']::tid[]);
+
+-- now create an unused line pointer
+select ctid from htab2;
+update htab2 set a = 300;
+select ctid from htab2;
+vacuum freeze htab2;
+
+-- unused TIDs should be skipped
+select heap_force_kill('htab2'::regclass, ARRAY['(0, 2)']::tid[]);
+
+-- multidimensional TID array should be rejected
+select heap_force_kill('htab2'::regclass, ARRAY[['(0, 2)']]::tid[]);
+
+-- TID array with nulls should be rejected
+select heap_force_kill('htab2'::regclass, ARRAY[NULL]::tid[]);
+
+-- but we should be able to kill the one tuple we have
+select heap_force_kill('htab2'::regclass, ARRAY['(0, 3)']::tid[]);
+
+-- materialized view.
+-- note that we don't commit the transaction, so autovacuum can't interfere.
+begin;
+create materialized view mvw as select a from generate_series(1, 3) a;
+
+select * from mvw where xmin = 2;
+select heap_force_freeze('mvw'::regclass, ARRAY['(0, 3)']::tid[]);
+select * from mvw where xmin = 2;
+
+select heap_force_kill('mvw'::regclass, ARRAY['(0, 3)']::tid[]);
+select * from mvw where ctid = '(0, 3)';
+rollback;
+
+-- check that it fails on an unsupported relkind
+create view vw as select 1;
+select heap_force_kill('vw'::regclass, ARRAY['(0, 1)']::tid[]);
+select heap_force_freeze('vw'::regclass, ARRAY['(0, 1)']::tid[]);
+
+-- cleanup.
+drop table htab2;
+drop view vw;
+drop extension pg_surgery;
diff --git a/doc/src/sgml/contrib.sgml b/doc/src/sgml/contrib.sgml
index 261a559e81..c82dde2726 100644
--- a/doc/src/sgml/contrib.sgml
+++ b/doc/src/sgml/contrib.sgml
@@ -125,6 +125,7 @@ CREATE EXTENSION <replaceable>module_name</replaceable>;
  &pgrowlocks;
  &pgstatstatements;
  &pgstattuple;
+ &pgsurgery;
  &pgtrgm;
  &pgvisibility;
  &postgres-fdw;
diff --git a/doc/src/sgml/filelist.sgml b/doc/src/sgml/filelist.sgml
index 64b5da0070..828396d4a9 100644
--- a/doc/src/sgml/filelist.sgml
+++ b/doc/src/sgml/filelist.sgml
@@ -139,6 +139,7 @@
 <!ENTITY pgstandby       SYSTEM "pgstandby.sgml">
 <!ENTITY pgstatstatements SYSTEM "pgstatstatements.sgml">
 <!ENTITY pgstattuple     SYSTEM "pgstattuple.sgml">
+<!ENTITY pgsurgery       SYSTEM "pgsurgery.sgml">
 <!ENTITY pgtrgm          SYSTEM "pgtrgm.sgml">
 <!ENTITY pgvisibility    SYSTEM "pgvisibility.sgml">
 <!ENTITY postgres-fdw    SYSTEM "postgres-fdw.sgml">
diff --git a/doc/src/sgml/pgsurgery.sgml b/doc/src/sgml/pgsurgery.sgml
new file mode 100644
index 0000000000..134be9bebd
--- /dev/null
+++ b/doc/src/sgml/pgsurgery.sgml
@@ -0,0 +1,107 @@
+<!-- doc/src/sgml/pgsurgery.sgml -->
+
+<sect1 id="pgsurgery" xreflabel="pg_surgery">
+ <title>pg_surgery</title>
+
+ <indexterm zone="pgsurgery">
+  <primary>pg_surgery</primary>
+ </indexterm>
+
+ <para>
+  The <filename>pg_surgery</filename> module provides various functions to
+  perform surgery on a damaged relation. These functions are unsafe by design
+  and using them may corrupt (or further corrupt) your database. For example,
+  these functions can easily be used to make a table inconsistent with its
+  own indexes, to cause <literal>UNIQUE</literal> or
+  <literal>FOREIGN KEY</literal> constraint violations, or even to make
+  tuples visible which, when read, will cause a database server crash.
+  They should be used with great caution and only as a last resort.
+ </para>
+
+ <sect2>
+  <title>Functions</title>
+
+  <variablelist>
+   <varlistentry>
+    <term>
+     <function>heap_force_kill(regclass, tid[]) returns void</function>
+    </term>
+
+    <listitem>
+     <para>
+      <function>heap_force_kill</function> marks <quote>used</quote> line
+      pointers as <quote>dead</quote> without examining the tuples. The
+      intended use of this function is to forcibly remove tuples that are not
+      otherwise accessible. For example:
+<programlisting>
+test=&gt; select * from t1 where ctid = '(0, 1)';
+ERROR:  could not access status of transaction 4007513275
+DETAIL:  Could not open file "pg_xact/0EED": No such file or directory.
+
+test=# select heap_force_kill('t1'::regclass, ARRAY['(0, 1)']::tid[]);
+ heap_force_kill 
+-----------------
+ 
+(1 row)
+
+test=# select * from t1 where ctid = '(0, 1)';
+(0 rows)
+
+</programlisting>
+    </para>
+    </listitem>
+   </varlistentry>
+
+   <varlistentry>
+    <term>
+     <function>heap_force_freeze(regclass, tid[]) returns void</function>
+    </term>
+
+    <listitem>
+     <para>
+      <function>heap_force_freeze</function> marks tuples as frozen without
+      examining the tuple data. The intended use of this function is to
+      make accessible tuples which are inaccessible due to corrupted
+      visibility information, or which prevent the table from being
+      successfully vacuumed due to corrupted visibility information.
+      For example:
+<programlisting>
+test=&gt; vacuum t1;
+ERROR:  found xmin 507 from before relfrozenxid 515
+CONTEXT:  while scanning block 0 of relation "public.t1"
+
+test=# select ctid from t1 where xmin = 507;
+ ctid  
+-------
+ (0,3)
+(1 row)
+
+test=# select heap_force_freeze('t1'::regclass, ARRAY['(0, 3)']::tid[]);
+ heap_force_freeze 
+-------------------
+ 
+(1 row)
+
+test=# select ctid from t1 where xmin = 2;
+ ctid
+-------
+ (0,3)
+(1 row)
+
+</programlisting>
+    </para>
+   </listitem>
+  </varlistentry>
+
+  </variablelist>
+ </sect2>
+
+ <sect2>
+  <title>Authors</title>
+
+  <para>
+   Ashutosh Sharma <email>ashu.coek88@gmail.com</email>
+  </para>
+ </sect2>
+
+</sect1>
diff --git a/src/tools/pgindent/typedefs.list b/src/tools/pgindent/typedefs.list
index 3d990463ce..421de88263 100644
--- a/src/tools/pgindent/typedefs.list
+++ b/src/tools/pgindent/typedefs.list
@@ -3573,3 +3573,4 @@ yyscan_t
 z_stream
 z_streamp
 zic_t
+HeapTupleForceOption
-- 
2.24.3 (Apple Git-128)

#101

Mark Dilger

mark.dilger@enterprisedb.com

over 5 years ago

In reply to: Ashutosh Sharma (#97)

Re: recovering from "found xmin ... from before relfrozenxid ..."

On Aug 26, 2020, at 4:36 AM, Ashutosh Sharma <ashu.coek88@gmail.com> wrote:

This patch also takes care of all the other review comments from - [1].

[1] - /messages/by-id/CA+fd4k6+JWq2MfQt5b7fSJ2wMvCes9TRfbDhVO_fQP9B8JJRAA@mail.gmail.com

--
With Regards,
Ashutosh Sharma
EnterpriseDB:http://www.enterprisedb.com
<v8-0001-Add-contrib-pg_surgery-to-perform-surgery-on-a-damag.patch>

Hi Ashutosh,

I took a look at the v8 patch, created a commitfest entry [1]https://commitfest.postgresql.org/29/2700/ — Mark Dilger EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company because I did not find one already existent, and have the following review comments:

HeapTupleForceOption should be added to src/tools/pgindent/typedefs.list.

The tidcmp function can be removed, and ItemPointerCompare used directly by qsort as:

-               qsort((void*) tids, ntids, sizeof(ItemPointerData), tidcmp);
+               qsort((void*) tids, ntids, sizeof(ItemPointerData),
+                         (int (*) (const void *, const void *)) ItemPointerCompare);

sanity_check_tid_array() has two error messages:

"array must not contain nulls"
"empty tid array"

I would change the first to say "tid array must not contain nulls", as "tid" is the name of the parameter being checked. It is also more consistent with the second error message, but that doesn't matter to me so much, as I'd argue for removing the second check. I don't see why an empty array should draw an error. It seems more reasonable to just return early since there is no work to do. Consider if somebody uses a function that returns the tids for all corrupt tuples in a table, aggregates that into an array, and hands that to this function. It doesn't seem like an error for that aggregated array to have zero elements in it. I suppose you could emit a NOTICE in this case?

Upthread:

On Aug 13, 2020, at 12:03 PM, Robert Haas <robertmhaas@gmail.com> wrote:

This looks like a very good suggestion to me. I will do this change in
the next version. Just wondering if we should be doing similar changes
in other contrib modules (like pgrowlocks, pageinspect and
pgstattuple) as well?

It seems like it should be consistent, but I'm not sure the proposed
change is really an improvement.

You have used Asim's proposed check:

if (rel->rd_amhandler != HEAP_TABLE_AM_HANDLER_OID)
ereport(ERROR,
(errcode(ERRCODE_FEATURE_NOT_SUPPORTED),
errmsg("only the relation using heap_tableam_handler is supported")));

which Robert seems unenthusiastic about, but if you are going that direction, I think at least the language of the error message should be changed. I recommend something like:

        if (rel->rd_amhandler != HEAP_TABLE_AM_HANDLER_OID)
                ereport(ERROR,
                                (errcode(ERRCODE_FEATURE_NOT_SUPPORTED),
-                                errmsg("only the relation using heap_tableam_handler is supported")));
+                                errmsg("\"%s\" does not use a heap access method",
+                                               RelationGetRelationName(rel))));

where "a heap access method" could also be written as "a heap table type access method", "a heap table compatible access method", and so forth. There doesn't seem to be enough precedent to dictate exactly how to phrase this, or perhaps I'm just not looking in the right place.

The header comment for function find_tids_one_page should state the requirement that the tids array must be sorted.

The heap_force_common function contains multiple ereport(NOTICE,...) within a critical section. I don't think that is normal practice. Can those reports be buffered until after the critical section is exited? You also have a CHECK_FOR_INTERRUPTS within the critical section.

[1]: https://commitfest.postgresql.org/29/2700/ — Mark Dilger EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
—
Mark Dilger
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

#102

ashu.coek88@gmail.com

over 5 years ago

In reply to: Robert Haas (#100)

Re: recovering from "found xmin ... from before relfrozenxid ..."

On Fri, Aug 28, 2020 at 1:44 AM Robert Haas <robertmhaas@gmail.com> wrote:

On Wed, Aug 26, 2020 at 10:26 PM Ashutosh Sharma <ashu.coek88@gmail.com> wrote:

Okay, point noted.

I spent some time today working on this patch. I'm fairly happy with
it now and intend to commit it if nobody sees a big problem with that.
Per discussion, I do not intend to back-patch at this time. The two
most significant changes I made to your version are:

1. I changed things around to avoid using any form of ereport() in a
critical section. I'm not actually sure whether it is project policy
to avoid ereport(NOTICE, ...) or similar in a critical section, but it
seems prudent, because if anything fails in a critical section, we
will PANIC, so doing fewer things there seems prudent.

2. I changed the code so that it does not try to follow redirected
line pointers; instead, it skips them with an appropriate message, as
we were already doing for dead and unused line pointers. I think the
way you had it coded might've been my suggestion originally, but the
more I looked into it the less I liked it. One problem is that it
didn't match the docs. A second is that following a corrupted line
pointer might index off the end of the line pointer array, and while
that probably shouldn't happen, we are talking about corruption
recovery here. Then I realized that, as you coded it, if the line
pointer was redirected to a line pointer that is in turn dead (or
unused, if there's corruption) the user would get a NOTICE complaining
about a TID they hadn't specified, which seems like it would be very
confusing. I thought about trying to fix all that stuff, but it just
didn't seem worth it, because I can't think of a good reason to pass
this function the TID of a redirected line pointer in the first place.
If you're doing surgery, you should probably specify the exact thing
upon which you want to operate, not some other thing that points to
it.

Here is a list of other changes I made:

* Added a .gitignore file.
* Renamed the regression test file from pg_surgery to heap_surgery to
match the name of the single C source file we currently have.
* Capitalized TID in a few places.
* Ran pgindent.
* Adjusted various comments.
* Removed the check for an empty TID array. I don't see any reason why
this should be an error case and I don't see much precedent for having
such a check.
* Fixed the code to work properly with that change and added a test case.
* Added a check that the array is not multi-dimensional.
* Put the AM type check before the relkind check, following existing precedent.
* Adjusted the check to use the AM OID rather than the handler OID,
following existing precedent. Fixed the message wording accordingly.
* Changed the documentation wording to say less about specific
recovery procedures and focus more on the general idea that this is
dangerous.
* Removed all but one of the test cases that checked what happens if
you use this on a non-heap; three tests for basically the same thing
seemed excessive.
* Added some additional tests to improve code coverage. There are now
only a handful of lines not covered.
* Reorganized the test cases somewhat.

New patch attached.

Thank you Robert for the patch. I've looked into the changes you've
made to the v8 patch and they all look good to me.

--
With Regards,
Ashutosh Sharma
EnterpriseDB:http://www.enterprisedb.com

#103

ashu.coek88@gmail.com

over 5 years ago

In reply to: Mark Dilger (#101)

Re: recovering from "found xmin ... from before relfrozenxid ..."

Hi Mark,

Thanks for the review. Please find my comments inline below:

HeapTupleForceOption should be added to src/tools/pgindent/typedefs.list.

This has been fixed in the v9 patch.

The tidcmp function can be removed, and ItemPointerCompare used directly by qsort as:

-               qsort((void*) tids, ntids, sizeof(ItemPointerData), tidcmp);
+               qsort((void*) tids, ntids, sizeof(ItemPointerData),
+                         (int (*) (const void *, const void *)) ItemPointerCompare);

Will have a look into this.

sanity_check_tid_array() has two error messages:

"array must not contain nulls"
"empty tid array"

I would change the first to say "tid array must not contain nulls", as "tid" is the name of the parameter being checked. It is also more consistent with the second error message, but that doesn't matter to me so much, as I'd argue for removing the second check. I don't see why an empty array should draw an error. It seems more reasonable to just return early since there is no work to do. Consider if somebody uses a function that returns the tids for all corrupt tuples in a table, aggregates that into an array, and hands that to this function. It doesn't seem like an error for that aggregated array to have zero elements in it. I suppose you could emit a NOTICE in this case?

This comment is no more valid as per the changes done in the v9 patch.

Upthread:

On Aug 13, 2020, at 12:03 PM, Robert Haas <robertmhaas@gmail.com> wrote:

This looks like a very good suggestion to me. I will do this change in
the next version. Just wondering if we should be doing similar changes
in other contrib modules (like pgrowlocks, pageinspect and
pgstattuple) as well?

It seems like it should be consistent, but I'm not sure the proposed
change is really an improvement.

You have used Asim's proposed check:

if (rel->rd_amhandler != HEAP_TABLE_AM_HANDLER_OID)
ereport(ERROR,
(errcode(ERRCODE_FEATURE_NOT_SUPPORTED),
errmsg("only the relation using heap_tableam_handler is supported")));

which Robert seems unenthusiastic about, but if you are going that direction, I think at least the language of the error message should be changed. I recommend something like:
if (rel->rd_amhandler != HEAP_TABLE_AM_HANDLER_OID)
ereport(ERROR,
(errcode(ERRCODE_FEATURE_NOT_SUPPORTED),
-                                errmsg("only the relation using heap_tableam_handler is supported")));
+                                errmsg("\"%s\" does not use a heap access method",
+                                               RelationGetRelationName(rel))));
where "a heap access method" could also be written as "a heap table type access method", "a heap table compatible access method", and so forth. There doesn't seem to be enough precedent to dictate exactly how to phrase this, or perhaps I'm just not looking in the right place.

Same here. This also looks invalid as per the changes done in v9 patch.

The header comment for function find_tids_one_page should state the requirement that the tids array must be sorted.

Okay, will add a comment for this.

The heap_force_common function contains multiple ereport(NOTICE,...) within a critical section. I don't think that is normal practice. Can those reports be buffered until after the critical section is exited? You also have a CHECK_FOR_INTERRUPTS within the critical section.

This has been fixed in the v9 patch.

[1] https://commitfest.postgresql.org/29/2700/
—

Thanks for adding a commitfest entry for this.

--
With Regards,
Ashutosh Sharma
EnterpriseDB:http://www.enterprisedb.com

#104

masahiko.sawada@2ndquadrant.com

over 5 years ago

In reply to: Robert Haas (#100)

Re: recovering from "found xmin ... from before relfrozenxid ..."

On Fri, 28 Aug 2020 at 05:14, Robert Haas <robertmhaas@gmail.com> wrote:

On Wed, Aug 26, 2020 at 10:26 PM Ashutosh Sharma <ashu.coek88@gmail.com> wrote:

Okay, point noted.

I spent some time today working on this patch. I'm fairly happy with
it now and intend to commit it if nobody sees a big problem with that.
Per discussion, I do not intend to back-patch at this time. The two
most significant changes I made to your version are:

Thank you for updating the patch.

1. I changed things around to avoid using any form of ereport() in a
critical section. I'm not actually sure whether it is project policy
to avoid ereport(NOTICE, ...) or similar in a critical section, but it
seems prudent, because if anything fails in a critical section, we
will PANIC, so doing fewer things there seems prudent.

2. I changed the code so that it does not try to follow redirected
line pointers; instead, it skips them with an appropriate message, as
we were already doing for dead and unused line pointers. I think the
way you had it coded might've been my suggestion originally, but the
more I looked into it the less I liked it. One problem is that it
didn't match the docs. A second is that following a corrupted line
pointer might index off the end of the line pointer array, and while
that probably shouldn't happen, we are talking about corruption
recovery here. Then I realized that, as you coded it, if the line
pointer was redirected to a line pointer that is in turn dead (or
unused, if there's corruption) the user would get a NOTICE complaining
about a TID they hadn't specified, which seems like it would be very
confusing. I thought about trying to fix all that stuff, but it just
didn't seem worth it, because I can't think of a good reason to pass
this function the TID of a redirected line pointer in the first place.
If you're doing surgery, you should probably specify the exact thing
upon which you want to operate, not some other thing that points to
it.

Here is a list of other changes I made:

* Added a .gitignore file.
* Renamed the regression test file from pg_surgery to heap_surgery to
match the name of the single C source file we currently have.
* Capitalized TID in a few places.
* Ran pgindent.
* Adjusted various comments.
* Removed the check for an empty TID array. I don't see any reason why
this should be an error case and I don't see much precedent for having
such a check.
* Fixed the code to work properly with that change and added a test case.
* Added a check that the array is not multi-dimensional.
* Put the AM type check before the relkind check, following existing precedent.
* Adjusted the check to use the AM OID rather than the handler OID,
following existing precedent. Fixed the message wording accordingly.
* Changed the documentation wording to say less about specific
recovery procedures and focus more on the general idea that this is
dangerous.

You've removed the description about executing VACUUM with
DISABLE_PAGE_SKIPPING option on the target relation after using
pg_surgery functions from the doc but I guess it’s better to recommend
that in the doc for safety. Could you please tell me the reason for
removing that?

Regards,

--
Masahiko Sawada http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

#105

ashu.coek88@gmail.com

over 5 years ago

In reply to: Ashutosh Sharma (#103)

1 attachment(s)

Re: recovering from "found xmin ... from before relfrozenxid ..."

The tidcmp function can be removed, and ItemPointerCompare used

directly by qsort as:

- qsort((void*) tids, ntids, sizeof(ItemPointerData),

tidcmp);

+               qsort((void*) tids, ntids, sizeof(ItemPointerData),
+                         (int (*) (const void *, const void *))

ItemPointerCompare);

Will have a look into this.

We can certainly do this way, but I would still prefer having a comparator
function (tidcmp) here for the reasons that it makes the code look a bit
cleaner, it also makes us more consistent with the way the comparator
function argument is being passed to qsort at several other places in
postgres which kinda of increases the code readability and simplicity. For
e.g. there is a comparator function for gin that does the same thing as
tidcmp is doing here. See below:

static int
qsortCompareItemPointers(const void *a, const void *b)
{
int res = ginCompareItemPointers((ItemPointer) a, (ItemPointer)
b);

/* Assert that there are no equal item pointers being sorted */
Assert(res != 0);
return res;
}

In this case as well, it could have been done the way you are suggesting,
but it seems like writing a small comparator function with the prototype
that qsort accepts looked like a better option. Considering this, I am just
leaving this as-it-is. Please let me know if you feel the other way round.

The header comment for function find_tids_one_page should state the

requirement that the tids array must be sorted.

Okay, will add a comment for this.

Added a comment for this in the attached patch.

Please have a look into the attached patch for the changes and let me know
for any other concerns. Thank you.

--
With Regards,
Ashutosh Sharma
EnterpriseDB:http://www.enterprisedb.com

Attachments:

v10-0001-Add-contrib-pg_surgery-to-perform-surgery-on-a-damag.patchtext/x-patch; charset=US-ASCII; name=v10-0001-Add-contrib-pg_surgery-to-perform-surgery-on-a-damag.patchDownload

From ee49cfbfc153bbf6212d90ac81a2f9d13fa0eef7 Mon Sep 17 00:00:00 2001
From: ashu <ashutosh.sharma@enterprisedb.com>
Date: Fri, 28 Aug 2020 15:00:41 +0530
Subject: [PATCH] Add contrib/pg_surgery to perform surgery on a damaged
 relation.

This contrib module provides couple of functions named heap_force_kill
and heap_force_freeze that can be used to perform surgery on a damaged
heap table.

Author: Ashutosh Sharma and Robert Haas.
Reviewed by: Sawada Masahiko, Andrey Borodin, Beena Emerson, Mark Dilger and Robert Haas
Tested by: Rajkumar Raghuwanshi
Discussion: https://www.postgresql.org/message-id/CA%2BTgmoZW1fsU-QUNCRUQMGUygBDPVeOTLCqRdQZch%3DEYZnctSA%40mail.gmail.com
---
 contrib/Makefile                             |   1 +
 contrib/pg_surgery/.gitignore                |   4 +
 contrib/pg_surgery/Makefile                  |  23 ++
 contrib/pg_surgery/expected/heap_surgery.out | 180 +++++++++++
 contrib/pg_surgery/heap_surgery.c            | 428 +++++++++++++++++++++++++++
 contrib/pg_surgery/pg_surgery--1.0.sql       |  18 ++
 contrib/pg_surgery/pg_surgery.control        |   5 +
 contrib/pg_surgery/sql/heap_surgery.sql      |  91 ++++++
 doc/src/sgml/contrib.sgml                    |   1 +
 doc/src/sgml/filelist.sgml                   |   1 +
 doc/src/sgml/pgsurgery.sgml                  | 107 +++++++
 src/tools/pgindent/typedefs.list             |   1 +
 12 files changed, 860 insertions(+)
 create mode 100644 contrib/pg_surgery/.gitignore
 create mode 100644 contrib/pg_surgery/Makefile
 create mode 100644 contrib/pg_surgery/expected/heap_surgery.out
 create mode 100644 contrib/pg_surgery/heap_surgery.c
 create mode 100644 contrib/pg_surgery/pg_surgery--1.0.sql
 create mode 100644 contrib/pg_surgery/pg_surgery.control
 create mode 100644 contrib/pg_surgery/sql/heap_surgery.sql
 create mode 100644 doc/src/sgml/pgsurgery.sgml

diff --git a/contrib/Makefile b/contrib/Makefile
index 1846d41..c8d2a16 100644
--- a/contrib/Makefile
+++ b/contrib/Makefile
@@ -34,6 +34,7 @@ SUBDIRS = \
 		pg_prewarm	\
 		pg_standby	\
 		pg_stat_statements \
+		pg_surgery	\
 		pg_trgm		\
 		pgcrypto	\
 		pgrowlocks	\
diff --git a/contrib/pg_surgery/.gitignore b/contrib/pg_surgery/.gitignore
new file mode 100644
index 0000000..5dcb3ff
--- /dev/null
+++ b/contrib/pg_surgery/.gitignore
@@ -0,0 +1,4 @@
+# Generated subdirectories
+/log/
+/results/
+/tmp_check/
diff --git a/contrib/pg_surgery/Makefile b/contrib/pg_surgery/Makefile
new file mode 100644
index 0000000..a66776c
--- /dev/null
+++ b/contrib/pg_surgery/Makefile
@@ -0,0 +1,23 @@
+# contrib/pg_surgery/Makefile
+
+MODULE_big = pg_surgery
+OBJS = \
+	$(WIN32RES) \
+	heap_surgery.o
+
+EXTENSION = pg_surgery
+DATA = pg_surgery--1.0.sql
+PGFILEDESC = "pg_surgery - perform surgery on a damaged relation"
+
+REGRESS = heap_surgery
+
+ifdef USE_PGXS
+PG_CONFIG = pg_config
+PGXS := $(shell $(PG_CONFIG) --pgxs)
+include $(PGXS)
+else
+subdir = contrib/pg_surgery
+top_builddir = ../..
+include $(top_builddir)/src/Makefile.global
+include $(top_srcdir)/contrib/contrib-global.mk
+endif
diff --git a/contrib/pg_surgery/expected/heap_surgery.out b/contrib/pg_surgery/expected/heap_surgery.out
new file mode 100644
index 0000000..9451c57
--- /dev/null
+++ b/contrib/pg_surgery/expected/heap_surgery.out
@@ -0,0 +1,180 @@
+create extension pg_surgery;
+-- create a normal heap table and insert some rows.
+-- note that we don't commit the transaction, so autovacuum can't interfere.
+begin;
+create table htab(a int);
+insert into htab values (100), (200), (300), (400), (500);
+-- test empty TID array
+select heap_force_freeze('htab'::regclass, ARRAY[]::tid[]);
+ heap_force_freeze 
+-------------------
+ 
+(1 row)
+
+-- nothing should be frozen yet
+select * from htab where xmin = 2;
+ a 
+---
+(0 rows)
+
+-- freeze forcibly
+select heap_force_freeze('htab'::regclass, ARRAY['(0, 4)']::tid[]);
+ heap_force_freeze 
+-------------------
+ 
+(1 row)
+
+-- now we should have one frozen tuple
+select ctid, xmax from htab where xmin = 2;
+ ctid  | xmax 
+-------+------
+ (0,4) |    0
+(1 row)
+
+-- kill forcibly
+select heap_force_kill('htab'::regclass, ARRAY['(0, 4)']::tid[]);
+ heap_force_kill 
+-----------------
+ 
+(1 row)
+
+-- should be gone now
+select * from htab where ctid = '(0, 4)';
+ a 
+---
+(0 rows)
+
+-- should now be skipped because it's already dead
+select heap_force_kill('htab'::regclass, ARRAY['(0, 4)']::tid[]);
+NOTICE:  skipping tid (0, 4) for relation "htab" because it is marked dead
+ heap_force_kill 
+-----------------
+ 
+(1 row)
+
+select heap_force_freeze('htab'::regclass, ARRAY['(0, 4)']::tid[]);
+NOTICE:  skipping tid (0, 4) for relation "htab" because it is marked dead
+ heap_force_freeze 
+-------------------
+ 
+(1 row)
+
+-- freeze two TIDs at once while skipping an out-of-range block number
+select heap_force_freeze('htab'::regclass,
+						 ARRAY['(0, 1)', '(0, 3)', '(1, 1)']::tid[]);
+NOTICE:  skipping block 1 for relation "htab" because the block number is out of range
+ heap_force_freeze 
+-------------------
+ 
+(1 row)
+
+-- we should now have two frozen tuples
+select ctid, xmax from htab where xmin = 2;
+ ctid  | xmax 
+-------+------
+ (0,1) |    0
+ (0,3) |    0
+(2 rows)
+
+-- out-of-range TIDs should be skipped
+select heap_force_freeze('htab'::regclass, ARRAY['(0, 0)', '(0, 6)']::tid[]);
+NOTICE:  skipping tid (0, 0) for relation "htab" because the item number is out of range
+NOTICE:  skipping tid (0, 6) for relation "htab" because the item number is out of range
+ heap_force_freeze 
+-------------------
+ 
+(1 row)
+
+rollback;
+-- set up a new table with a redirected line pointer
+create table htab2(a int) with (autovacuum_enabled = off);
+insert into htab2 values (100);
+update htab2 set a = 200;
+vacuum htab2;
+-- redirected TIDs should be skipped
+select heap_force_kill('htab2'::regclass, ARRAY['(0, 1)']::tid[]);
+NOTICE:  skipping tid (0, 1) for relation "htab2" because it redirects to item 2
+ heap_force_kill 
+-----------------
+ 
+(1 row)
+
+-- now create an unused line pointer
+select ctid from htab2;
+ ctid  
+-------
+ (0,2)
+(1 row)
+
+update htab2 set a = 300;
+select ctid from htab2;
+ ctid  
+-------
+ (0,3)
+(1 row)
+
+vacuum freeze htab2;
+-- unused TIDs should be skipped
+select heap_force_kill('htab2'::regclass, ARRAY['(0, 2)']::tid[]);
+NOTICE:  skipping tid (0, 2) for relation "htab2" because it is marked unused
+ heap_force_kill 
+-----------------
+ 
+(1 row)
+
+-- multidimensional TID array should be rejected
+select heap_force_kill('htab2'::regclass, ARRAY[['(0, 2)']]::tid[]);
+ERROR:  argument must be empty or one-dimensional array
+-- TID array with nulls should be rejected
+select heap_force_kill('htab2'::regclass, ARRAY[NULL]::tid[]);
+ERROR:  array must not contain nulls
+-- but we should be able to kill the one tuple we have
+select heap_force_kill('htab2'::regclass, ARRAY['(0, 3)']::tid[]);
+ heap_force_kill 
+-----------------
+ 
+(1 row)
+
+-- materialized view.
+-- note that we don't commit the transaction, so autovacuum can't interfere.
+begin;
+create materialized view mvw as select a from generate_series(1, 3) a;
+select * from mvw where xmin = 2;
+ a 
+---
+(0 rows)
+
+select heap_force_freeze('mvw'::regclass, ARRAY['(0, 3)']::tid[]);
+ heap_force_freeze 
+-------------------
+ 
+(1 row)
+
+select * from mvw where xmin = 2;
+ a 
+---
+ 3
+(1 row)
+
+select heap_force_kill('mvw'::regclass, ARRAY['(0, 3)']::tid[]);
+ heap_force_kill 
+-----------------
+ 
+(1 row)
+
+select * from mvw where ctid = '(0, 3)';
+ a 
+---
+(0 rows)
+
+rollback;
+-- check that it fails on an unsupported relkind
+create view vw as select 1;
+select heap_force_kill('vw'::regclass, ARRAY['(0, 1)']::tid[]);
+ERROR:  "vw" is not a table, materialized view, or TOAST table
+select heap_force_freeze('vw'::regclass, ARRAY['(0, 1)']::tid[]);
+ERROR:  "vw" is not a table, materialized view, or TOAST table
+-- cleanup.
+drop table htab2;
+drop view vw;
+drop extension pg_surgery;
diff --git a/contrib/pg_surgery/heap_surgery.c b/contrib/pg_surgery/heap_surgery.c
new file mode 100644
index 0000000..eb96b4b
--- /dev/null
+++ b/contrib/pg_surgery/heap_surgery.c
@@ -0,0 +1,428 @@
+/*-------------------------------------------------------------------------
+ *
+ * heap_surgery.c
+ *	  Functions to perform surgery on the damaged heap table.
+ *
+ * Copyright (c) 2020, PostgreSQL Global Development Group
+ *
+ * IDENTIFICATION
+ *	  contrib/pg_surgery/heap_surgery.c
+ *
+ *-------------------------------------------------------------------------
+ */
+#include "postgres.h"
+
+#include "access/heapam.h"
+#include "access/visibilitymap.h"
+#include "catalog/pg_am_d.h"
+#include "catalog/pg_proc_d.h"
+#include "miscadmin.h"
+#include "storage/bufmgr.h"
+#include "utils/acl.h"
+#include "utils/rel.h"
+
+PG_MODULE_MAGIC;
+
+/* Options to forcefully change the state of a heap tuple. */
+typedef enum HeapTupleForceOption
+{
+	HEAP_FORCE_KILL,
+	HEAP_FORCE_FREEZE
+} HeapTupleForceOption;
+
+PG_FUNCTION_INFO_V1(heap_force_kill);
+PG_FUNCTION_INFO_V1(heap_force_freeze);
+
+static int32 tidcmp(const void *a, const void *b);
+static Datum heap_force_common(FunctionCallInfo fcinfo,
+							   HeapTupleForceOption heap_force_opt);
+static void sanity_check_tid_array(ArrayType *ta, int *ntids);
+static void sanity_check_relation(Relation rel);
+static BlockNumber find_tids_one_page(ItemPointer tids, int ntids,
+									  OffsetNumber *next_start_ptr);
+
+/*-------------------------------------------------------------------------
+ * heap_force_kill()
+ *
+ * Force kill the tuple(s) pointed to by the item pointer(s) stored in the
+ * given TID array.
+ *
+ * Usage: SELECT heap_force_kill(regclass, tid[]);
+ *-------------------------------------------------------------------------
+ */
+Datum
+heap_force_kill(PG_FUNCTION_ARGS)
+{
+	PG_RETURN_DATUM(heap_force_common(fcinfo, HEAP_FORCE_KILL));
+}
+
+/*-------------------------------------------------------------------------
+ * heap_force_freeze()
+ *
+ * Force freeze the tuple(s) pointed to by the item pointer(s) stored in the
+ * given TID array.
+ *
+ * Usage: SELECT heap_force_freeze(regclass, tid[]);
+ *-------------------------------------------------------------------------
+ */
+Datum
+heap_force_freeze(PG_FUNCTION_ARGS)
+{
+	PG_RETURN_DATUM(heap_force_common(fcinfo, HEAP_FORCE_FREEZE));
+}
+
+/*-------------------------------------------------------------------------
+ * heap_force_common()
+ *
+ * Common code for heap_force_kill and heap_force_freeze
+ *-------------------------------------------------------------------------
+ */
+static Datum
+heap_force_common(FunctionCallInfo fcinfo, HeapTupleForceOption heap_force_opt)
+{
+	Oid			relid = PG_GETARG_OID(0);
+	ArrayType  *ta = PG_GETARG_ARRAYTYPE_P_COPY(1);
+	ItemPointer tids;
+	int			ntids,
+				nblocks;
+	Relation	rel;
+	OffsetNumber curr_start_ptr,
+				next_start_ptr;
+	bool		include_this_tid[MaxHeapTuplesPerPage];
+
+	if (RecoveryInProgress())
+		ereport(ERROR,
+				(errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
+				 errmsg("recovery is in progress"),
+				 errhint("heap surgery functions cannot be executed during recovery.")));
+
+	/* Check inputs. */
+	sanity_check_tid_array(ta, &ntids);
+
+	rel = relation_open(relid, RowExclusiveLock);
+
+	/* Check target relation. */
+	sanity_check_relation(rel);
+
+	tids = ((ItemPointer) ARR_DATA_PTR(ta));
+
+	/*
+	 * If there is more than one TID in the array, sort them so that we can
+	 * easily fetch all the TIDs belonging to one particular page from the
+	 * array.
+	 */
+	if (ntids > 1)
+		qsort((void *) tids, ntids, sizeof(ItemPointerData), tidcmp);
+
+	curr_start_ptr = next_start_ptr = 0;
+	nblocks = RelationGetNumberOfBlocks(rel);
+
+	/*
+	 * Loop, performing the necessary actions for each block.
+	 */
+	while (next_start_ptr != ntids)
+	{
+		Buffer		buf;
+		Buffer		vmbuf = InvalidBuffer;
+		Page		page;
+		BlockNumber blkno;
+		OffsetNumber curoff;
+		OffsetNumber maxoffset;
+		int			i;
+		bool		did_modify_page = false;
+		bool		did_modify_vm = false;
+
+		CHECK_FOR_INTERRUPTS();
+
+		/*
+		 * Find all the TIDs belonging to one particular page starting from
+		 * next_start_ptr and process them one by one.
+		 */
+		blkno = find_tids_one_page(tids, ntids, &next_start_ptr);
+
+		/* Check whether the block number is valid. */
+		if (blkno >= nblocks)
+		{
+			/* Update the current_start_ptr before moving to the next page. */
+			curr_start_ptr = next_start_ptr;
+
+			ereport(NOTICE,
+					(errcode(ERRCODE_INVALID_PARAMETER_VALUE),
+					 errmsg("skipping block %u for relation \"%s\" because the block number is out of range",
+							blkno, RelationGetRelationName(rel))));
+			continue;
+		}
+
+		buf = ReadBuffer(rel, blkno);
+		LockBufferForCleanup(buf);
+
+		page = BufferGetPage(buf);
+
+		maxoffset = PageGetMaxOffsetNumber(page);
+
+		/*
+		 * Figure out which TIDs we are going to process and which ones we are
+		 * going to skip.
+		 */
+		memset(include_this_tid, 0, sizeof(include_this_tid));
+		for (i = curr_start_ptr; i < next_start_ptr; i++)
+		{
+			OffsetNumber offno = ItemPointerGetOffsetNumberNoCheck(&tids[i]);
+			ItemId		itemid;
+
+			/* Check whether the offset number is valid. */
+			if (offno == InvalidOffsetNumber || offno > maxoffset)
+			{
+				ereport(NOTICE,
+						errmsg("skipping tid (%u, %u) for relation \"%s\" because the item number is out of range",
+							   blkno, offno, RelationGetRelationName(rel)));
+				continue;
+			}
+
+			itemid = PageGetItemId(page, offno);
+
+			/* Only accept an item ID that is used. */
+			if (ItemIdIsRedirected(itemid))
+			{
+				ereport(NOTICE,
+						errmsg("skipping tid (%u, %u) for relation \"%s\" because it redirects to item %u",
+							   blkno, offno, RelationGetRelationName(rel),
+							   ItemIdGetRedirect(itemid)));
+				continue;
+			}
+			else if (ItemIdIsDead(itemid))
+			{
+				ereport(NOTICE,
+						(errmsg("skipping tid (%u, %u) for relation \"%s\" because it is marked dead",
+								blkno, offno, RelationGetRelationName(rel))));
+				continue;
+			}
+			else if (!ItemIdIsUsed(itemid))
+			{
+				ereport(NOTICE,
+						(errmsg("skipping tid (%u, %u) for relation \"%s\" because it is marked unused",
+								blkno, offno, RelationGetRelationName(rel))));
+				continue;
+			}
+
+			/* Mark it for processing. */
+			Assert(offno < MaxHeapTuplesPerPage);
+			include_this_tid[offno] = true;
+		}
+
+		/*
+		 * Before entering the critical section, pin the visibility map page
+		 * if it appears to be necessary.
+		 */
+		if (heap_force_opt == HEAP_FORCE_KILL && PageIsAllVisible(page))
+			visibilitymap_pin(rel, blkno, &vmbuf);
+
+		/* No ereport(ERROR) from here until all the changes are logged. */
+		START_CRIT_SECTION();
+
+		for (curoff = FirstOffsetNumber; curoff <= maxoffset;
+			 curoff = OffsetNumberNext(curoff))
+		{
+			ItemId		itemid;
+
+			if (!include_this_tid[curoff])
+				continue;
+
+			itemid = PageGetItemId(page, curoff);
+			Assert(ItemIdIsNormal(itemid));
+
+			did_modify_page = true;
+
+			if (heap_force_opt == HEAP_FORCE_KILL)
+			{
+				ItemIdSetDead(itemid);
+
+				/*
+				 * If the page is marked all-visible, we must clear
+				 * PD_ALL_VISIBLE flag on the page header and an all-visible
+				 * bit on the visibility map corresponding to the page.
+				 */
+				if (PageIsAllVisible(page))
+				{
+					PageClearAllVisible(page);
+					visibilitymap_clear(rel, blkno, vmbuf,
+										VISIBILITYMAP_VALID_BITS);
+					did_modify_vm = true;
+				}
+			}
+			else
+			{
+				HeapTupleHeader htup;
+
+				Assert(heap_force_opt == HEAP_FORCE_FREEZE);
+
+				htup = (HeapTupleHeader) PageGetItem(page, itemid);
+
+				/*
+				 * Reset all visibility-related fields of the tuple. This
+				 * logic should mimic heap_execute_freeze_tuple(), but we
+				 * choose to reset xmin and ctid just to be sure that no
+				 * potentially-garbled data is left behind.
+				 */
+				ItemPointerSet(&htup->t_ctid, blkno, curoff);
+				HeapTupleHeaderSetXmin(htup, FrozenTransactionId);
+				HeapTupleHeaderSetXmax(htup, InvalidTransactionId);
+				if (htup->t_infomask & HEAP_MOVED)
+				{
+					if (htup->t_infomask & HEAP_MOVED_OFF)
+						HeapTupleHeaderSetXvac(htup, InvalidTransactionId);
+					else
+						HeapTupleHeaderSetXvac(htup, FrozenTransactionId);
+				}
+
+				/*
+				 * Clear all the visibility-related bits of this tuple and
+				 * mark it as frozen. Also, get rid of HOT_UPDATED and
+				 * KEYS_UPDATES bits.
+				 */
+				htup->t_infomask &= ~HEAP_XACT_MASK;
+				htup->t_infomask |= (HEAP_XMIN_FROZEN | HEAP_XMAX_INVALID);
+				htup->t_infomask2 &= ~HEAP_HOT_UPDATED;
+				htup->t_infomask2 &= ~HEAP_KEYS_UPDATED;
+			}
+		}
+
+		/*
+		 * If the page was modified, only then, we mark the buffer dirty or do
+		 * the WAL logging.
+		 */
+		if (did_modify_page)
+		{
+			/* Mark buffer dirty before we write WAL. */
+			MarkBufferDirty(buf);
+
+			/* XLOG stuff */
+			if (RelationNeedsWAL(rel))
+				log_newpage_buffer(buf, true);
+		}
+
+		/* WAL log the VM page if it was modified. */
+		if (did_modify_vm && RelationNeedsWAL(rel))
+			log_newpage_buffer(vmbuf, false);
+
+		END_CRIT_SECTION();
+
+		UnlockReleaseBuffer(buf);
+
+		if (vmbuf != InvalidBuffer)
+			ReleaseBuffer(vmbuf);
+
+		/* Update the current_start_ptr before moving to the next page. */
+		curr_start_ptr = next_start_ptr;
+	}
+
+	relation_close(rel, RowExclusiveLock);
+
+	pfree(ta);
+
+	PG_RETURN_VOID();
+}
+
+/*-------------------------------------------------------------------------
+ * tidcmp()
+ *
+ * Compare two item pointers, return -1, 0, or +1.
+ *
+ * See ItemPointerCompare for details.
+ * ------------------------------------------------------------------------
+ */
+static int32
+tidcmp(const void *a, const void *b)
+{
+	ItemPointer iptr1 = ((const ItemPointer) a);
+	ItemPointer iptr2 = ((const ItemPointer) b);
+
+	return ItemPointerCompare(iptr1, iptr2);
+}
+
+/*-------------------------------------------------------------------------
+ * sanity_check_tid_array()
+ *
+ * Perform sanity checks on the given tid array, and set *ntids to the
+ * number of items in the array.
+ * ------------------------------------------------------------------------
+ */
+static void
+sanity_check_tid_array(ArrayType *ta, int *ntids)
+{
+	if (ARR_HASNULL(ta) && array_contains_nulls(ta))
+		ereport(ERROR,
+				(errcode(ERRCODE_NULL_VALUE_NOT_ALLOWED),
+				 errmsg("array must not contain nulls")));
+
+	if (ARR_NDIM(ta) > 1)
+		ereport(ERROR,
+				(errcode(ERRCODE_DATA_EXCEPTION),
+				 errmsg("argument must be empty or one-dimensional array")));
+
+	*ntids = ArrayGetNItems(ARR_NDIM(ta), ARR_DIMS(ta));
+}
+
+/*-------------------------------------------------------------------------
+ * sanity_check_relation()
+ *
+ * Perform sanity checks on the given relation.
+ * ------------------------------------------------------------------------
+ */
+static void
+sanity_check_relation(Relation rel)
+{
+	if (rel->rd_rel->relkind != RELKIND_RELATION &&
+		rel->rd_rel->relkind != RELKIND_MATVIEW &&
+		rel->rd_rel->relkind != RELKIND_TOASTVALUE)
+		ereport(ERROR,
+				(errcode(ERRCODE_WRONG_OBJECT_TYPE),
+				 errmsg("\"%s\" is not a table, materialized view, or TOAST table",
+						RelationGetRelationName(rel))));
+
+	if (rel->rd_rel->relam != HEAP_TABLE_AM_OID)
+		ereport(ERROR,
+				(errcode(ERRCODE_FEATURE_NOT_SUPPORTED),
+				 errmsg("only heap AM is supported")));
+
+	/* Must be owner of the table or superuser. */
+	if (!pg_class_ownercheck(RelationGetRelid(rel), GetUserId()))
+		aclcheck_error(ACLCHECK_NOT_OWNER,
+					   get_relkind_objtype(rel->rd_rel->relkind),
+					   RelationGetRelationName(rel));
+}
+
+/*-------------------------------------------------------------------------
+ * find_tids_one_page()
+ *
+ * Find all the tids residing in the same page as tids[next_start_ptr], and
+ * update next_start_ptr so that it points to the first tid in the next page.
+ *
+ * NOTE: The input tids[] array must be sorted.
+ * ------------------------------------------------------------------------
+ */
+static BlockNumber
+find_tids_one_page(ItemPointer tids, int ntids, OffsetNumber *next_start_ptr)
+{
+	int			i;
+	BlockNumber prev_blkno,
+				blkno;
+
+	prev_blkno = blkno = InvalidBlockNumber;
+
+	for (i = *next_start_ptr; i < ntids; i++)
+	{
+		ItemPointerData tid = tids[i];
+
+		blkno = ItemPointerGetBlockNumberNoCheck(&tid);
+
+		if (i == *next_start_ptr)
+			prev_blkno = blkno;
+
+		if (prev_blkno != blkno)
+			break;
+	}
+
+	*next_start_ptr = i;
+	return prev_blkno;
+}
diff --git a/contrib/pg_surgery/pg_surgery--1.0.sql b/contrib/pg_surgery/pg_surgery--1.0.sql
new file mode 100644
index 0000000..2ae7f22
--- /dev/null
+++ b/contrib/pg_surgery/pg_surgery--1.0.sql
@@ -0,0 +1,18 @@
+/* contrib/pg_surgery/pg_surgery--1.0.sql */
+
+-- complain if script is sourced in psql, rather than via CREATE EXTENSION
+\echo Use "CREATE EXTENSION pg_surgery" to load this file. \quit
+
+CREATE FUNCTION heap_force_kill(reloid regclass, tids tid[])
+RETURNS VOID
+AS 'MODULE_PATHNAME', 'heap_force_kill'
+LANGUAGE C STRICT;
+
+REVOKE EXECUTE ON FUNCTION heap_force_kill(regclass, tid[]) FROM PUBLIC;
+
+CREATE FUNCTION heap_force_freeze(reloid regclass, tids tid[])
+RETURNS VOID
+AS 'MODULE_PATHNAME', 'heap_force_freeze'
+LANGUAGE C STRICT;
+
+REVOKE EXECUTE ON FUNCTION heap_force_freeze(regclass, tid[]) FROM PUBLIC;
\ No newline at end of file
diff --git a/contrib/pg_surgery/pg_surgery.control b/contrib/pg_surgery/pg_surgery.control
new file mode 100644
index 0000000..2bcdad1
--- /dev/null
+++ b/contrib/pg_surgery/pg_surgery.control
@@ -0,0 +1,5 @@
+# pg_surgery extension
+comment = 'extension to perform surgery on a damaged relation'
+default_version = '1.0'
+module_pathname = '$libdir/pg_surgery'
+relocatable = true
diff --git a/contrib/pg_surgery/sql/heap_surgery.sql b/contrib/pg_surgery/sql/heap_surgery.sql
new file mode 100644
index 0000000..8a27214
--- /dev/null
+++ b/contrib/pg_surgery/sql/heap_surgery.sql
@@ -0,0 +1,91 @@
+create extension pg_surgery;
+
+-- create a normal heap table and insert some rows.
+-- note that we don't commit the transaction, so autovacuum can't interfere.
+begin;
+create table htab(a int);
+insert into htab values (100), (200), (300), (400), (500);
+
+-- test empty TID array
+select heap_force_freeze('htab'::regclass, ARRAY[]::tid[]);
+
+-- nothing should be frozen yet
+select * from htab where xmin = 2;
+
+-- freeze forcibly
+select heap_force_freeze('htab'::regclass, ARRAY['(0, 4)']::tid[]);
+
+-- now we should have one frozen tuple
+select ctid, xmax from htab where xmin = 2;
+
+-- kill forcibly
+select heap_force_kill('htab'::regclass, ARRAY['(0, 4)']::tid[]);
+
+-- should be gone now
+select * from htab where ctid = '(0, 4)';
+
+-- should now be skipped because it's already dead
+select heap_force_kill('htab'::regclass, ARRAY['(0, 4)']::tid[]);
+select heap_force_freeze('htab'::regclass, ARRAY['(0, 4)']::tid[]);
+
+-- freeze two TIDs at once while skipping an out-of-range block number
+select heap_force_freeze('htab'::regclass,
+						 ARRAY['(0, 1)', '(0, 3)', '(1, 1)']::tid[]);
+
+-- we should now have two frozen tuples
+select ctid, xmax from htab where xmin = 2;
+
+-- out-of-range TIDs should be skipped
+select heap_force_freeze('htab'::regclass, ARRAY['(0, 0)', '(0, 6)']::tid[]);
+
+rollback;
+
+-- set up a new table with a redirected line pointer
+create table htab2(a int) with (autovacuum_enabled = off);
+insert into htab2 values (100);
+update htab2 set a = 200;
+vacuum htab2;
+
+-- redirected TIDs should be skipped
+select heap_force_kill('htab2'::regclass, ARRAY['(0, 1)']::tid[]);
+
+-- now create an unused line pointer
+select ctid from htab2;
+update htab2 set a = 300;
+select ctid from htab2;
+vacuum freeze htab2;
+
+-- unused TIDs should be skipped
+select heap_force_kill('htab2'::regclass, ARRAY['(0, 2)']::tid[]);
+
+-- multidimensional TID array should be rejected
+select heap_force_kill('htab2'::regclass, ARRAY[['(0, 2)']]::tid[]);
+
+-- TID array with nulls should be rejected
+select heap_force_kill('htab2'::regclass, ARRAY[NULL]::tid[]);
+
+-- but we should be able to kill the one tuple we have
+select heap_force_kill('htab2'::regclass, ARRAY['(0, 3)']::tid[]);
+
+-- materialized view.
+-- note that we don't commit the transaction, so autovacuum can't interfere.
+begin;
+create materialized view mvw as select a from generate_series(1, 3) a;
+
+select * from mvw where xmin = 2;
+select heap_force_freeze('mvw'::regclass, ARRAY['(0, 3)']::tid[]);
+select * from mvw where xmin = 2;
+
+select heap_force_kill('mvw'::regclass, ARRAY['(0, 3)']::tid[]);
+select * from mvw where ctid = '(0, 3)';
+rollback;
+
+-- check that it fails on an unsupported relkind
+create view vw as select 1;
+select heap_force_kill('vw'::regclass, ARRAY['(0, 1)']::tid[]);
+select heap_force_freeze('vw'::regclass, ARRAY['(0, 1)']::tid[]);
+
+-- cleanup.
+drop table htab2;
+drop view vw;
+drop extension pg_surgery;
diff --git a/doc/src/sgml/contrib.sgml b/doc/src/sgml/contrib.sgml
index 261a559..c82dde2 100644
--- a/doc/src/sgml/contrib.sgml
+++ b/doc/src/sgml/contrib.sgml
@@ -125,6 +125,7 @@ CREATE EXTENSION <replaceable>module_name</replaceable>;
  &pgrowlocks;
  &pgstatstatements;
  &pgstattuple;
+ &pgsurgery;
  &pgtrgm;
  &pgvisibility;
  &postgres-fdw;
diff --git a/doc/src/sgml/filelist.sgml b/doc/src/sgml/filelist.sgml
index 64b5da0..828396d 100644
--- a/doc/src/sgml/filelist.sgml
+++ b/doc/src/sgml/filelist.sgml
@@ -139,6 +139,7 @@
 <!ENTITY pgstandby       SYSTEM "pgstandby.sgml">
 <!ENTITY pgstatstatements SYSTEM "pgstatstatements.sgml">
 <!ENTITY pgstattuple     SYSTEM "pgstattuple.sgml">
+<!ENTITY pgsurgery       SYSTEM "pgsurgery.sgml">
 <!ENTITY pgtrgm          SYSTEM "pgtrgm.sgml">
 <!ENTITY pgvisibility    SYSTEM "pgvisibility.sgml">
 <!ENTITY postgres-fdw    SYSTEM "postgres-fdw.sgml">
diff --git a/doc/src/sgml/pgsurgery.sgml b/doc/src/sgml/pgsurgery.sgml
new file mode 100644
index 0000000..134be9b
--- /dev/null
+++ b/doc/src/sgml/pgsurgery.sgml
@@ -0,0 +1,107 @@
+<!-- doc/src/sgml/pgsurgery.sgml -->
+
+<sect1 id="pgsurgery" xreflabel="pg_surgery">
+ <title>pg_surgery</title>
+
+ <indexterm zone="pgsurgery">
+  <primary>pg_surgery</primary>
+ </indexterm>
+
+ <para>
+  The <filename>pg_surgery</filename> module provides various functions to
+  perform surgery on a damaged relation. These functions are unsafe by design
+  and using them may corrupt (or further corrupt) your database. For example,
+  these functions can easily be used to make a table inconsistent with its
+  own indexes, to cause <literal>UNIQUE</literal> or
+  <literal>FOREIGN KEY</literal> constraint violations, or even to make
+  tuples visible which, when read, will cause a database server crash.
+  They should be used with great caution and only as a last resort.
+ </para>
+
+ <sect2>
+  <title>Functions</title>
+
+  <variablelist>
+   <varlistentry>
+    <term>
+     <function>heap_force_kill(regclass, tid[]) returns void</function>
+    </term>
+
+    <listitem>
+     <para>
+      <function>heap_force_kill</function> marks <quote>used</quote> line
+      pointers as <quote>dead</quote> without examining the tuples. The
+      intended use of this function is to forcibly remove tuples that are not
+      otherwise accessible. For example:
+<programlisting>
+test=&gt; select * from t1 where ctid = '(0, 1)';
+ERROR:  could not access status of transaction 4007513275
+DETAIL:  Could not open file "pg_xact/0EED": No such file or directory.
+
+test=# select heap_force_kill('t1'::regclass, ARRAY['(0, 1)']::tid[]);
+ heap_force_kill 
+-----------------
+ 
+(1 row)
+
+test=# select * from t1 where ctid = '(0, 1)';
+(0 rows)
+
+</programlisting>
+    </para>
+    </listitem>
+   </varlistentry>
+
+   <varlistentry>
+    <term>
+     <function>heap_force_freeze(regclass, tid[]) returns void</function>
+    </term>
+
+    <listitem>
+     <para>
+      <function>heap_force_freeze</function> marks tuples as frozen without
+      examining the tuple data. The intended use of this function is to
+      make accessible tuples which are inaccessible due to corrupted
+      visibility information, or which prevent the table from being
+      successfully vacuumed due to corrupted visibility information.
+      For example:
+<programlisting>
+test=&gt; vacuum t1;
+ERROR:  found xmin 507 from before relfrozenxid 515
+CONTEXT:  while scanning block 0 of relation "public.t1"
+
+test=# select ctid from t1 where xmin = 507;
+ ctid  
+-------
+ (0,3)
+(1 row)
+
+test=# select heap_force_freeze('t1'::regclass, ARRAY['(0, 3)']::tid[]);
+ heap_force_freeze 
+-------------------
+ 
+(1 row)
+
+test=# select ctid from t1 where xmin = 2;
+ ctid
+-------
+ (0,3)
+(1 row)
+
+</programlisting>
+    </para>
+   </listitem>
+  </varlistentry>
+
+  </variablelist>
+ </sect2>
+
+ <sect2>
+  <title>Authors</title>
+
+  <para>
+   Ashutosh Sharma <email>ashu.coek88@gmail.com</email>
+  </para>
+ </sect2>
+
+</sect1>
diff --git a/src/tools/pgindent/typedefs.list b/src/tools/pgindent/typedefs.list
index 3d99046..421de88 100644
--- a/src/tools/pgindent/typedefs.list
+++ b/src/tools/pgindent/typedefs.list
@@ -3573,3 +3573,4 @@ yyscan_t
 z_stream
 z_streamp
 zic_t
+HeapTupleForceOption
-- 
1.8.3.1

#106

robertmhaas@gmail.com

over 5 years ago

In reply to: Ashutosh Sharma (#105)

Re: recovering from "found xmin ... from before relfrozenxid ..."

On Fri, Aug 28, 2020 at 5:55 AM Ashutosh Sharma <ashu.coek88@gmail.com> wrote:

We can certainly do this way, but I would still prefer having a comparator function (tidcmp) here for the reasons that it makes the code look a bit cleaner, it also makes us more consistent with the way the comparator function argument is being passed to qsort at several other places in postgres which kinda of increases the code readability and simplicity. For e.g. there is a comparator function for gin that does the same thing as tidcmp is doing here.

Me too. Casting one kind of function pointer to another kind of
function pointer assumes that the compiler is using the same
argument-passing conventions in both cases, which seems slightly
risky. It also means that if the signature for the function were to
diverge further from the signature that we need in the future, the
compiler might not warn us about it. Perhaps there is some case where
the performance gains would be sufficiently to justify those risks,
but this is certainly not that case.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

#107

robertmhaas@gmail.com

over 5 years ago

In reply to: Masahiko Sawada (#104)

Re: recovering from "found xmin ... from before relfrozenxid ..."

On Fri, Aug 28, 2020 at 4:07 AM Masahiko Sawada
<masahiko.sawada@2ndquadrant.com> wrote:

You've removed the description about executing VACUUM with
DISABLE_PAGE_SKIPPING option on the target relation after using
pg_surgery functions from the doc but I guess it’s better to recommend
that in the doc for safety. Could you please tell me the reason for
removing that?

Well, I think that was added because there wasn't code to clear the
visibility map bits, either page-level in the map, but we added code
for that, so now I don't really see why it's necessary or even
desirable.

Here are a few example scenarios:

1. My table is not corrupt. For no particular reason, I force-freeze
or force-kill a tuple which is neither dead nor all-visible.
Concurrent queries might return wrong answers, but the table is not
corrupt. It does not require VACUUM and would not benefit from it.
Actually, it doesn't need anything at all.

2. My table is not corrupt. For no particular reason, I force-freeze a
tuple which is dead. I believe it's possible that the index entries
for that tuple might be gone already, but VACUUM won't fix that.
REINDEX or a table rewrite would, though. It's also possible, if the
dead tuple was added by an aborted transaction which added columns to
the table, that the tuple might have been created using a tuple
descriptor that differs from the table's current tuple descriptor. If
so, I think scanning the table could produce a crash. VACUUM won't fix
this, either. I would need to delete or force-kill the offending
tuple.

3. I have one or more tuples in my table that are intact except that
they have garbage values for xmin, resulting in VACUUM failure or
possibly even SELECT failure if the CLOG entries are also missing. I
force-kill or force-freeze them. If by chance the affected tuples were
also omitted from one or more indexes, a REINDEX or table rewrite is
needed to fix them, but a VACUUM will not help. On the other hand, if
those tuples are present in the indexes, there's no remaining problem
and VACUUM is not needed for the purpose of restoring the integrity of
the table. If the problem has been ongoing for a while, VACUUM might
be needed to advance relfrozenxid, but that doesn't require
DISABLE_PAGE_SKIPPING.

4. I have some pages in my table that have incorrect visibility map
bits. In this case, I need VACUUM (DISABLE_PAGE_SKIPPING). However, I
don't need the functions we're talking about here at all unless I also
have tuples with corrupted visibility information. If I do happen to
have both tuples with corrupted visibility information and also pages
with incorrect visibility map bits, then I suppose I need both these
tools and also VACUUM (DISABLE_PAGE_SKIPPING). Probably, I'll want to
do the VACUUM second. But, if I happened to do the VACUUM first and
then use these functions afterward, the worst thing that could happen
is that I might end up with a some dead tuples that could've gotten
removed faster if I'd switched the order. And that's not a disaster.

Basically, I can see no real reason to recommend VACUUM
(DISABLE_PAGE_SKIPPING) here. There are problems that can be fixed
with that command, and there are problems that can be fixed by this
method, but they are mostly independent of each other. We should not
recommend that people run VACUUM "just in case." That kind of fuzzy
thinking seems relatively prevalent already, and it leads to people
spending a lot of time running slow maintenance commands that do
nothing to help them, and which occasionally make things worse.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

#108

masahiko.sawada@2ndquadrant.com

over 5 years ago

In reply to: Robert Haas (#107)

Re: recovering from "found xmin ... from before relfrozenxid ..."

On Fri, 28 Aug 2020 at 23:39, Robert Haas <robertmhaas@gmail.com> wrote:

On Fri, Aug 28, 2020 at 4:07 AM Masahiko Sawada
<masahiko.sawada@2ndquadrant.com> wrote:

You've removed the description about executing VACUUM with
DISABLE_PAGE_SKIPPING option on the target relation after using
pg_surgery functions from the doc but I guess it’s better to recommend
that in the doc for safety. Could you please tell me the reason for
removing that?

Well, I think that was added because there wasn't code to clear the
visibility map bits, either page-level in the map, but we added code
for that, so now I don't really see why it's necessary or even
desirable.

Here are a few example scenarios:

1. My table is not corrupt. For no particular reason, I force-freeze
or force-kill a tuple which is neither dead nor all-visible.
Concurrent queries might return wrong answers, but the table is not
corrupt. It does not require VACUUM and would not benefit from it.
Actually, it doesn't need anything at all.

2. My table is not corrupt. For no particular reason, I force-freeze a
tuple which is dead. I believe it's possible that the index entries
for that tuple might be gone already, but VACUUM won't fix that.
REINDEX or a table rewrite would, though. It's also possible, if the
dead tuple was added by an aborted transaction which added columns to
the table, that the tuple might have been created using a tuple
descriptor that differs from the table's current tuple descriptor. If
so, I think scanning the table could produce a crash. VACUUM won't fix
this, either. I would need to delete or force-kill the offending
tuple.

3. I have one or more tuples in my table that are intact except that
they have garbage values for xmin, resulting in VACUUM failure or
possibly even SELECT failure if the CLOG entries are also missing. I
force-kill or force-freeze them. If by chance the affected tuples were
also omitted from one or more indexes, a REINDEX or table rewrite is
needed to fix them, but a VACUUM will not help. On the other hand, if
those tuples are present in the indexes, there's no remaining problem
and VACUUM is not needed for the purpose of restoring the integrity of
the table. If the problem has been ongoing for a while, VACUUM might
be needed to advance relfrozenxid, but that doesn't require
DISABLE_PAGE_SKIPPING.

4. I have some pages in my table that have incorrect visibility map
bits. In this case, I need VACUUM (DISABLE_PAGE_SKIPPING). However, I
don't need the functions we're talking about here at all unless I also
have tuples with corrupted visibility information. If I do happen to
have both tuples with corrupted visibility information and also pages
with incorrect visibility map bits, then I suppose I need both these
tools and also VACUUM (DISABLE_PAGE_SKIPPING). Probably, I'll want to
do the VACUUM second. But, if I happened to do the VACUUM first and
then use these functions afterward, the worst thing that could happen
is that I might end up with a some dead tuples that could've gotten
removed faster if I'd switched the order. And that's not a disaster.

Basically, I can see no real reason to recommend VACUUM
(DISABLE_PAGE_SKIPPING) here. There are problems that can be fixed
with that command, and there are problems that can be fixed by this
method, but they are mostly independent of each other. We should not
recommend that people run VACUUM "just in case." That kind of fuzzy
thinking seems relatively prevalent already, and it leads to people
spending a lot of time running slow maintenance commands that do
nothing to help them, and which occasionally make things worse.

Thank you for your explanation. That very makes sense to me.

If vacuum could fix the particular kind of problem by using together
with pg_surgery we could recommend using vacuum. But I agree that the
corruption of heap table is not the case.

Regards,

--
Masahiko Sawada http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

#109

robertmhaas@gmail.com

over 5 years ago

In reply to: Ashutosh Sharma (#105)

Re: recovering from "found xmin ... from before relfrozenxid ..."

On Fri, Aug 28, 2020 at 5:55 AM Ashutosh Sharma <ashu.coek88@gmail.com> wrote:

Please have a look into the attached patch for the changes and let me know for any other concerns. Thank you.

I have committed this version.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

#110

Justin Pryzby

pryzby@telsasoft.com

over 5 years ago

In reply to: Robert Haas (#109)

Re: recovering from "found xmin ... from before relfrozenxid ..."

On Thu, Sep 10, 2020 at 11:21:02AM -0400, Robert Haas wrote:

On Fri, Aug 28, 2020 at 5:55 AM Ashutosh Sharma <ashu.coek88@gmail.com> wrote:

Please have a look into the attached patch for the changes and let me know for any other concerns. Thank you.

I have committed this version.

Thanks ; I marked it as such in CF app.
https://commitfest.postgresql.org/29/2700/

--
Justin

#111

https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=sungazer&dt=2020-09-12%2005%3A13%3A12

tgl@sss.pgh.pa.us

over 5 years ago

In reply to: Robert Haas (#109)

Re: recovering from "found xmin ... from before relfrozenxid ..."

Robert Haas <robertmhaas@gmail.com> writes:

I have committed this version.

This failure says that the test case is not entirely stable:

diff -U3 /home/nm/farm/gcc64/HEAD/pgsql.build/contrib/pg_surgery/expected/heap_surgery.out /home/nm/farm/gcc64/HEAD/pgsql.build/contrib/pg_surgery/results/heap_surgery.out
--- /home/nm/farm/gcc64/HEAD/pgsql.build/contrib/pg_surgery/expected/heap_surgery.out	2020-09-11 06:31:36.000000000 +0000
+++ /home/nm/farm/gcc64/HEAD/pgsql.build/contrib/pg_surgery/results/heap_surgery.out	2020-09-12 11:40:26.000000000 +0000
@@ -116,7 +116,6 @@
  vacuum freeze htab2;
  -- unused TIDs should be skipped
  select heap_force_kill('htab2'::regclass, ARRAY['(0, 2)']::tid[]);
- NOTICE:  skipping tid (0, 2) for relation "htab2" because it is marked unused
   heap_force_kill 
  -----------------

sungazer's first run after pg_surgery went in was successful, so it's
not a hard failure. I'm guessing that it's timing dependent.

The most obvious theory for the cause is that what VACUUM does with
a tuple depends on whether the tuple's xmin is below global xmin,
and a concurrent autovacuum could very easily be holding back global
xmin. While I can't easily get autovac to run at just the right
time, I did verify that a concurrent regular session holding back
global xmin produces the symptom seen above. (To replicate, insert
"select pg_sleep(...)" in heap_surgery.sql before "-- now create an unused
line pointer"; run make installcheck; and use the delay to connect
to the database manually, start a serializable transaction, and do
any query to acquire a snapshot.)

I suggest that the easiest way to make this test reliable is to
make the test tables be temp tables (which allows dropping the
autovacuum_enabled = off property, too). In the wake of commit
a7212be8b, that should guarantee that vacuum has stable tuple-level
behavior regardless of what is happening concurrently.

regards, tom lane

#112

ashu.coek88@gmail.com

over 5 years ago

In reply to: Tom Lane (#111)

1 attachment(s)

Re: recovering from "found xmin ... from before relfrozenxid ..."

On Sun, Sep 13, 2020 at 3:30 AM Tom Lane <tgl@sss.pgh.pa.us> wrote:

Robert Haas <robertmhaas@gmail.com> writes:

I have committed this version.

This failure says that the test case is not entirely stable:

https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=sungazer&dt=2020-09-12%2005%3A13%3A12
diff -U3 /home/nm/farm/gcc64/HEAD/pgsql.build/contrib/pg_surgery/expected/heap_surgery.out /home/nm/farm/gcc64/HEAD/pgsql.build/contrib/pg_surgery/results/heap_surgery.out
--- /home/nm/farm/gcc64/HEAD/pgsql.build/contrib/pg_surgery/expected/heap_surgery.out   2020-09-11 06:31:36.000000000 +0000
+++ /home/nm/farm/gcc64/HEAD/pgsql.build/contrib/pg_surgery/results/heap_surgery.out    2020-09-12 11:40:26.000000000 +0000
@@ -116,7 +116,6 @@
vacuum freeze htab2;
-- unused TIDs should be skipped
select heap_force_kill('htab2'::regclass, ARRAY['(0, 2)']::tid[]);
- NOTICE:  skipping tid (0, 2) for relation "htab2" because it is marked unused
heap_force_kill
-----------------
sungazer's first run after pg_surgery went in was successful, so it's
not a hard failure. I'm guessing that it's timing dependent.

The most obvious theory for the cause is that what VACUUM does with
a tuple depends on whether the tuple's xmin is below global xmin,
and a concurrent autovacuum could very easily be holding back global
xmin. While I can't easily get autovac to run at just the right
time, I did verify that a concurrent regular session holding back
global xmin produces the symptom seen above. (To replicate, insert
"select pg_sleep(...)" in heap_surgery.sql before "-- now create an unused
line pointer"; run make installcheck; and use the delay to connect
to the database manually, start a serializable transaction, and do
any query to acquire a snapshot.)

Thanks for reporting. I'm able to reproduce the issue by creating some
delay just before "-- now create an unused line pointer" and use the
delay to start a new session either with repeatable read or
serializable transaction isolation level and run some query on the
test table. To fix this, as you suggested I've converted the test
table to the temp table. Attached is the patch with the changes.
Please have a look and let me know about any concerns.

Thanks,

--
With Regards,
Ashutosh Sharma
EnterpriseDB:http://www.enterprisedb.com

Attachments:

fix_regression_pg_surgey.patchtext/x-patch; charset=US-ASCII; name=fix_regression_pg_surgey.patchDownload

diff --git a/contrib/pg_surgery/expected/heap_surgery.out b/contrib/pg_surgery/expected/heap_surgery.out
index 9451c57..d2cf647 100644
--- a/contrib/pg_surgery/expected/heap_surgery.out
+++ b/contrib/pg_surgery/expected/heap_surgery.out
@@ -87,7 +87,7 @@ NOTICE:  skipping tid (0, 6) for relation "htab" because the item number is out
 
 rollback;
 -- set up a new table with a redirected line pointer
-create table htab2(a int) with (autovacuum_enabled = off);
+create temp table htab2(a int);
 insert into htab2 values (100);
 update htab2 set a = 200;
 vacuum htab2;
@@ -175,6 +175,5 @@ ERROR:  "vw" is not a table, materialized view, or TOAST table
 select heap_force_freeze('vw'::regclass, ARRAY['(0, 1)']::tid[]);
 ERROR:  "vw" is not a table, materialized view, or TOAST table
 -- cleanup.
-drop table htab2;
 drop view vw;
 drop extension pg_surgery;
diff --git a/contrib/pg_surgery/sql/heap_surgery.sql b/contrib/pg_surgery/sql/heap_surgery.sql
index 8a27214..3635bd5 100644
--- a/contrib/pg_surgery/sql/heap_surgery.sql
+++ b/contrib/pg_surgery/sql/heap_surgery.sql
@@ -41,7 +41,7 @@ select heap_force_freeze('htab'::regclass, ARRAY['(0, 0)', '(0, 6)']::tid[]);
 rollback;
 
 -- set up a new table with a redirected line pointer
-create table htab2(a int) with (autovacuum_enabled = off);
+create temp table htab2(a int);
 insert into htab2 values (100);
 update htab2 set a = 200;
 vacuum htab2;
@@ -86,6 +86,5 @@ select heap_force_kill('vw'::regclass, ARRAY['(0, 1)']::tid[]);
 select heap_force_freeze('vw'::regclass, ARRAY['(0, 1)']::tid[]);
 
 -- cleanup.
-drop table htab2;
 drop view vw;
 drop extension pg_surgery;

#113

robertmhaas@gmail.com

over 5 years ago

In reply to: Ashutosh Sharma (#112)

Re: recovering from "found xmin ... from before relfrozenxid ..."

On Mon, Sep 14, 2020 at 6:26 AM Ashutosh Sharma <ashu.coek88@gmail.com> wrote:

Thanks for reporting. I'm able to reproduce the issue by creating some
delay just before "-- now create an unused line pointer" and use the
delay to start a new session either with repeatable read or
serializable transaction isolation level and run some query on the
test table. To fix this, as you suggested I've converted the test
table to the temp table. Attached is the patch with the changes.
Please have a look and let me know about any concerns.

Tom, do you have any concerns about this fix?

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

#114

tgl@sss.pgh.pa.us

over 5 years ago

In reply to: Robert Haas (#113)

Re: recovering from "found xmin ... from before relfrozenxid ..."

Robert Haas <robertmhaas@gmail.com> writes:

On Mon, Sep 14, 2020 at 6:26 AM Ashutosh Sharma <ashu.coek88@gmail.com> wrote:

Thanks for reporting. I'm able to reproduce the issue by creating some
delay just before "-- now create an unused line pointer" and use the
delay to start a new session either with repeatable read or
serializable transaction isolation level and run some query on the
test table. To fix this, as you suggested I've converted the test
table to the temp table. Attached is the patch with the changes.
Please have a look and let me know about any concerns.

Tom, do you have any concerns about this fix?

It seems OK as far as it goes. Two thoughts:

* Do we need a comment in the test pointing out that the table must be
temp to ensure that we get stable vacuum results? Or will the commit
log message be enough documentation?

* Should any of the other tables in the test be converted to temp?
I see that the other test cases are kluging around related issues
by dint of never committing their tables at all. It's not clear
to me how badly those test cases have been distorted by that, or
whether it means they're testing less-than-typical situations.

Anyway, if you're satisfied with leaving the other cases as-is,
I have no objections.

regards, tom lane

#115

ashu.coek88@gmail.com

over 5 years ago

In reply to: Tom Lane (#114)

Re: recovering from "found xmin ... from before relfrozenxid ..."

On Wed, Sep 16, 2020 at 1:25 AM Tom Lane <tgl@sss.pgh.pa.us> wrote:

Robert Haas <robertmhaas@gmail.com> writes:

On Mon, Sep 14, 2020 at 6:26 AM Ashutosh Sharma <ashu.coek88@gmail.com> wrote:

Thanks for reporting. I'm able to reproduce the issue by creating some
delay just before "-- now create an unused line pointer" and use the
delay to start a new session either with repeatable read or
serializable transaction isolation level and run some query on the
test table. To fix this, as you suggested I've converted the test
table to the temp table. Attached is the patch with the changes.
Please have a look and let me know about any concerns.

Tom, do you have any concerns about this fix?

It seems OK as far as it goes. Two thoughts:

* Do we need a comment in the test pointing out that the table must be
temp to ensure that we get stable vacuum results? Or will the commit
log message be enough documentation?

I'll add a note for this.

* Should any of the other tables in the test be converted to temp?
I see that the other test cases are kluging around related issues
by dint of never committing their tables at all. It's not clear
to me how badly those test cases have been distorted by that, or
whether it means they're testing less-than-typical situations.

Are you trying to say that we can achieve the things being done in
test-case 1 and 2 by having a single temp table and we should aim for
it because it will make the test-case more efficient and easy to
maintain? If so, I will try to do the necessary changes and submit a
new patch for it. please confirm.

Thanks,

--
With Regards,
Ashutosh Sharma
EnterpriseDB:http://www.enterprisedb.com

#116

tgl@sss.pgh.pa.us

over 5 years ago

In reply to: Ashutosh Sharma (#115)

Re: recovering from "found xmin ... from before relfrozenxid ..."

Ashutosh Sharma <ashu.coek88@gmail.com> writes:

On Wed, Sep 16, 2020 at 1:25 AM Tom Lane <tgl@sss.pgh.pa.us> wrote:

* Should any of the other tables in the test be converted to temp?

Are you trying to say that we can achieve the things being done in
test-case 1 and 2 by having a single temp table and we should aim for
it because it will make the test-case more efficient and easy to
maintain?

Well, I'm just looking at the comment that says the reason for the
begin/rollback structure is to keep autovacuum's hands off the table.
In most if not all of the other places where we need that, the preferred
method is to make the table temp or mark it with (autovacuum = off).
While this way isn't wrong exactly, nor inefficient, it does seem
a little restrictive. For instance, you can't easily test cases that
involve intentional errors.

Another point is that we have a few optimizations that apply to tables
created in the current transaction. I'm not sure whether any of them
would fire in this test case, but if they do (now or in the future)
that might mean you aren't testing the usual scenario for use of
pg_surgery, which is surely not going to be a new-in-transaction
table. (That might be an argument for preferring autovacuum = off
over a temp table, too.)

Like I said, I don't have a big problem with leaving the rest of the
test as-is. It just seems to be doing things in an unusual way for
no very good reason.

regards, tom lane

#117

ashu.coek88@gmail.com

over 5 years ago

In reply to: Tom Lane (#116)

Re: recovering from "found xmin ... from before relfrozenxid ..."

On Wed, Sep 16, 2020 at 9:14 AM Tom Lane <tgl@sss.pgh.pa.us> wrote:

Ashutosh Sharma <ashu.coek88@gmail.com> writes:

On Wed, Sep 16, 2020 at 1:25 AM Tom Lane <tgl@sss.pgh.pa.us> wrote:

* Should any of the other tables in the test be converted to temp?

Are you trying to say that we can achieve the things being done in
test-case 1 and 2 by having a single temp table and we should aim for
it because it will make the test-case more efficient and easy to
maintain?

Well, I'm just looking at the comment that says the reason for the
begin/rollback structure is to keep autovacuum's hands off the table.
In most if not all of the other places where we need that, the preferred
method is to make the table temp or mark it with (autovacuum = off).
While this way isn't wrong exactly, nor inefficient, it does seem
a little restrictive. For instance, you can't easily test cases that
involve intentional errors.

Another point is that we have a few optimizations that apply to tables
created in the current transaction. I'm not sure whether any of them
would fire in this test case, but if they do (now or in the future)
that might mean you aren't testing the usual scenario for use of
pg_surgery, which is surely not going to be a new-in-transaction
table. (That might be an argument for preferring autovacuum = off
over a temp table, too.)

I agree with you on both the above points. I'll try to make the
necessary changes to address all your comments. Also, I'd prefer
creating a normal heap table with autovacuum = off over the temp table
for the reasons you mentioned in the second point.

--
With Regards,
Ashutosh Sharma
EnterpriseDB:http://www.enterprisedb.com

#118

ashu.coek88@gmail.com

over 5 years ago

In reply to: Ashutosh Sharma (#117)

1 attachment(s)

Re: recovering from "found xmin ... from before relfrozenxid ..."

On Wed, Sep 16, 2020 at 10:40 AM Ashutosh Sharma <ashu.coek88@gmail.com> wrote:

On Wed, Sep 16, 2020 at 9:14 AM Tom Lane <tgl@sss.pgh.pa.us> wrote:

Ashutosh Sharma <ashu.coek88@gmail.com> writes:

On Wed, Sep 16, 2020 at 1:25 AM Tom Lane <tgl@sss.pgh.pa.us> wrote:

* Should any of the other tables in the test be converted to temp?

Are you trying to say that we can achieve the things being done in
test-case 1 and 2 by having a single temp table and we should aim for
it because it will make the test-case more efficient and easy to
maintain?

Well, I'm just looking at the comment that says the reason for the
begin/rollback structure is to keep autovacuum's hands off the table.
In most if not all of the other places where we need that, the preferred
method is to make the table temp or mark it with (autovacuum = off).
While this way isn't wrong exactly, nor inefficient, it does seem
a little restrictive. For instance, you can't easily test cases that
involve intentional errors.

Another point is that we have a few optimizations that apply to tables
created in the current transaction. I'm not sure whether any of them
would fire in this test case, but if they do (now or in the future)
that might mean you aren't testing the usual scenario for use of
pg_surgery, which is surely not going to be a new-in-transaction
table. (That might be an argument for preferring autovacuum = off
over a temp table, too.)

I agree with you on both the above points. I'll try to make the
necessary changes to address all your comments. Also, I'd prefer
creating a normal heap table with autovacuum = off over the temp table
for the reasons you mentioned in the second point.

Attached is the patch with the changes suggested here. I've tried to
use a normal heap table with (autovacuum = off) wherever possible.
Please have a look and let me know for any other issues.

Thanks,

--
With Regards,
Ashutosh Sharma
EnterpriseDB:http://www.enterprisedb.com

Attachments:

v2-fix-regression-pg_surgey.patchtext/x-patch; charset=US-ASCII; name=v2-fix-regression-pg_surgey.patchDownload

diff --git a/contrib/pg_surgery/expected/heap_surgery.out b/contrib/pg_surgery/expected/heap_surgery.out
index 9451c57..033e3aa 100644
--- a/contrib/pg_surgery/expected/heap_surgery.out
+++ b/contrib/pg_surgery/expected/heap_surgery.out
@@ -1,8 +1,8 @@
 create extension pg_surgery;
 -- create a normal heap table and insert some rows.
--- note that we don't commit the transaction, so autovacuum can't interfere.
-begin;
-create table htab(a int);
+-- note that we don't want autovacuum to interfere and therefore we disable it
+-- on the test table.
+create table htab(a int) with (autovacuum_enabled = off);
 insert into htab values (100), (200), (300), (400), (500);
 -- test empty TID array
 select heap_force_freeze('htab'::regclass, ARRAY[]::tid[]);
@@ -85,9 +85,10 @@ NOTICE:  skipping tid (0, 6) for relation "htab" because the item number is out
  
 (1 row)
 
-rollback;
 -- set up a new table with a redirected line pointer
-create table htab2(a int) with (autovacuum_enabled = off);
+-- note that unlike other test cases we need a temp table here to ensure that we
+-- get a stable vacuum results.
+create temp table htab2(a int);
 insert into htab2 values (100);
 update htab2 set a = 200;
 vacuum htab2;
@@ -175,6 +176,6 @@ ERROR:  "vw" is not a table, materialized view, or TOAST table
 select heap_force_freeze('vw'::regclass, ARRAY['(0, 1)']::tid[]);
 ERROR:  "vw" is not a table, materialized view, or TOAST table
 -- cleanup.
-drop table htab2;
+drop table htab;
 drop view vw;
 drop extension pg_surgery;
diff --git a/contrib/pg_surgery/sql/heap_surgery.sql b/contrib/pg_surgery/sql/heap_surgery.sql
index 8a27214..234e667 100644
--- a/contrib/pg_surgery/sql/heap_surgery.sql
+++ b/contrib/pg_surgery/sql/heap_surgery.sql
@@ -1,9 +1,9 @@
 create extension pg_surgery;
 
 -- create a normal heap table and insert some rows.
--- note that we don't commit the transaction, so autovacuum can't interfere.
-begin;
-create table htab(a int);
+-- note that we don't want autovacuum to interfere and therefore we disable it
+-- on the test table.
+create table htab(a int) with (autovacuum_enabled = off);
 insert into htab values (100), (200), (300), (400), (500);
 
 -- test empty TID array
@@ -38,10 +38,10 @@ select ctid, xmax from htab where xmin = 2;
 -- out-of-range TIDs should be skipped
 select heap_force_freeze('htab'::regclass, ARRAY['(0, 0)', '(0, 6)']::tid[]);
 
-rollback;
-
 -- set up a new table with a redirected line pointer
-create table htab2(a int) with (autovacuum_enabled = off);
+-- note that unlike other test cases we need a temp table here to ensure that we
+-- get a stable vacuum results.
+create temp table htab2(a int);
 insert into htab2 values (100);
 update htab2 set a = 200;
 vacuum htab2;
@@ -86,6 +86,6 @@ select heap_force_kill('vw'::regclass, ARRAY['(0, 1)']::tid[]);
 select heap_force_freeze('vw'::regclass, ARRAY['(0, 1)']::tid[]);
 
 -- cleanup.
-drop table htab2;
+drop table htab;
 drop view vw;
 drop extension pg_surgery;

#119

robertmhaas@gmail.com

over 5 years ago

In reply to: Ashutosh Sharma (#118)

Re: recovering from "found xmin ... from before relfrozenxid ..."

On Wed, Sep 16, 2020 at 1:48 AM Ashutosh Sharma <ashu.coek88@gmail.com> wrote:

Attached is the patch with the changes suggested here. I've tried to
use a normal heap table with (autovacuum = off) wherever possible.
Please have a look and let me know for any other issues.

I think the comment needs some wordsmithing -- "unlike other cases" is
not that informative, and "we get a stable vacuum results" isn't
either very clear or all that grammatical. If we're going to add a
comment add here, why not just "use a temp table, so autovacuum can't
interfere"?

Tom, I know that you often have strong feelings about the exact
wording and details of this kind of stuff, so if you feel moved to
commit something that is fine with me. If not, I will take my best
shot at it.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

#120

tgl@sss.pgh.pa.us

over 5 years ago

In reply to: Robert Haas (#119)

Re: recovering from "found xmin ... from before relfrozenxid ..."

Robert Haas <robertmhaas@gmail.com> writes:

Tom, I know that you often have strong feelings about the exact
wording and details of this kind of stuff, so if you feel moved to
commit something that is fine with me. If not, I will take my best
shot at it.

I'm not feeling terribly picky about it --- so it's all yours.

regards, tom lane

#121

robertmhaas@gmail.com

over 5 years ago

In reply to: Tom Lane (#120)

1 attachment(s)

Re: recovering from "found xmin ... from before relfrozenxid ..."

On Wed, Sep 16, 2020 at 11:48 AM Tom Lane <tgl@sss.pgh.pa.us> wrote:

Robert Haas <robertmhaas@gmail.com> writes:

Tom, I know that you often have strong feelings about the exact
wording and details of this kind of stuff, so if you feel moved to
commit something that is fine with me. If not, I will take my best
shot at it.

I'm not feeling terribly picky about it --- so it's all yours.

OK. After some more study of the thread and some more experimentation,
I came up with the attached. I'll go ahead and commit this if nobody
objects.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Attachments:

0001-pg_surgery-Try-to-stabilize-regression-tests.patchapplication/octet-stream; name=0001-pg_surgery-Try-to-stabilize-regression-tests.patchDownload

From b1d366fe6a6131a0aa070593971912c49ac3ce2f Mon Sep 17 00:00:00 2001
From: Robert Haas <rhaas@postgresql.org>
Date: Wed, 16 Sep 2020 14:27:34 -0400
Subject: [PATCH] pg_surgery: Try to stabilize regression tests.

According to buildfarm member sungazer, the behavior of VACUUM can be
unstable in these tests even if we prevent autovacuum from running on
the tables in question, apparently because even a manual vacuum can
behave differently depending on whether anything else is running that
holds back the global xmin. So use a temporary table instead, which
as of commit a7212be8b9e0885ee769e8c55f99ef742cda487b enables
vacuuming using a more aggressive cutoff.

This approach can't be used for the regression test that involves a
materialized view, but that test doesn't run vacuum, so it shouldn't
be prone to this particular failure mode.

Analysis by Tom Lane. Patch by Ashutosh Sharma and me.

Discussion: http://postgr.es/m/665524.1599948007@sss.pgh.pa.us
---
 contrib/pg_surgery/expected/heap_surgery.out | 10 ++++------
 contrib/pg_surgery/sql/heap_surgery.sql      | 11 ++++-------
 2 files changed, 8 insertions(+), 13 deletions(-)

diff --git a/contrib/pg_surgery/expected/heap_surgery.out b/contrib/pg_surgery/expected/heap_surgery.out
index 9451c5747b..d4a757ffa0 100644
--- a/contrib/pg_surgery/expected/heap_surgery.out
+++ b/contrib/pg_surgery/expected/heap_surgery.out
@@ -1,8 +1,7 @@
 create extension pg_surgery;
 -- create a normal heap table and insert some rows.
--- note that we don't commit the transaction, so autovacuum can't interfere.
-begin;
-create table htab(a int);
+-- use a temp table so that vacuum behavior doesn't depend on global xmin
+create temp table htab (a int);
 insert into htab values (100), (200), (300), (400), (500);
 -- test empty TID array
 select heap_force_freeze('htab'::regclass, ARRAY[]::tid[]);
@@ -85,9 +84,9 @@ NOTICE:  skipping tid (0, 6) for relation "htab" because the item number is out
  
 (1 row)
 
-rollback;
 -- set up a new table with a redirected line pointer
-create table htab2(a int) with (autovacuum_enabled = off);
+-- use a temp table so that vacuum behavior doesn't depend on global xmin
+create temp table htab2(a int);
 insert into htab2 values (100);
 update htab2 set a = 200;
 vacuum htab2;
@@ -175,6 +174,5 @@ ERROR:  "vw" is not a table, materialized view, or TOAST table
 select heap_force_freeze('vw'::regclass, ARRAY['(0, 1)']::tid[]);
 ERROR:  "vw" is not a table, materialized view, or TOAST table
 -- cleanup.
-drop table htab2;
 drop view vw;
 drop extension pg_surgery;
diff --git a/contrib/pg_surgery/sql/heap_surgery.sql b/contrib/pg_surgery/sql/heap_surgery.sql
index 8a27214e9c..6526b27535 100644
--- a/contrib/pg_surgery/sql/heap_surgery.sql
+++ b/contrib/pg_surgery/sql/heap_surgery.sql
@@ -1,9 +1,8 @@
 create extension pg_surgery;
 
 -- create a normal heap table and insert some rows.
--- note that we don't commit the transaction, so autovacuum can't interfere.
-begin;
-create table htab(a int);
+-- use a temp table so that vacuum behavior doesn't depend on global xmin
+create temp table htab (a int);
 insert into htab values (100), (200), (300), (400), (500);
 
 -- test empty TID array
@@ -38,10 +37,9 @@ select ctid, xmax from htab where xmin = 2;
 -- out-of-range TIDs should be skipped
 select heap_force_freeze('htab'::regclass, ARRAY['(0, 0)', '(0, 6)']::tid[]);
 
-rollback;
-
 -- set up a new table with a redirected line pointer
-create table htab2(a int) with (autovacuum_enabled = off);
+-- use a temp table so that vacuum behavior doesn't depend on global xmin
+create temp table htab2(a int);
 insert into htab2 values (100);
 update htab2 set a = 200;
 vacuum htab2;
@@ -86,6 +84,5 @@ select heap_force_kill('vw'::regclass, ARRAY['(0, 1)']::tid[]);
 select heap_force_freeze('vw'::regclass, ARRAY['(0, 1)']::tid[]);
 
 -- cleanup.
-drop table htab2;
 drop view vw;
 drop extension pg_surgery;
-- 
2.24.3 (Apple Git-128)

#122

tgl@sss.pgh.pa.us

over 5 years ago

In reply to: Robert Haas (#121)

Re: recovering from "found xmin ... from before relfrozenxid ..."

Robert Haas <robertmhaas@gmail.com> writes:

OK. After some more study of the thread and some more experimentation,
I came up with the attached. I'll go ahead and commit this if nobody
objects.

This is OK by me.

regards, tom lane

#123

ashu.coek88@gmail.com

over 5 years ago

In reply to: Tom Lane (#122)

Re: recovering from "found xmin ... from before relfrozenxid ..."

On Thu, Sep 17, 2020 at 12:14 AM Tom Lane <tgl@sss.pgh.pa.us> wrote:

Robert Haas <robertmhaas@gmail.com> writes:

OK. After some more study of the thread and some more experimentation,
I came up with the attached. I'll go ahead and commit this if nobody
objects.

This is OK by me.

Looks good to me too.

Thanks,

--
With Regards,
Ashutosh Sharma
EnterpriseDB:http://www.enterprisedb.com

#124

robertmhaas@gmail.com

over 5 years ago

In reply to: Ashutosh Sharma (#123)

Re: recovering from "found xmin ... from before relfrozenxid ..."

On Wed, Sep 16, 2020 at 10:27 PM Ashutosh Sharma <ashu.coek88@gmail.com> wrote:

This is OK by me.

Looks good to me too.

Cool, thanks to both of you for looking. Committed.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

#125

https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=whelk&dt=2020-09-18%2017%3A42%3A11

tgl@sss.pgh.pa.us

over 5 years ago

In reply to: Robert Haas (#124)

Re: recovering from "found xmin ... from before relfrozenxid ..."

Robert Haas <robertmhaas@gmail.com> writes:

Cool, thanks to both of you for looking. Committed.

Hmph, according to whelk this is worse not better:

I'm at a loss to understand what's going on there.

regards, tom lane

#126

Amit Kapila

amit.kapila16@gmail.com

over 5 years ago

In reply to: Tom Lane (#125)

Re: recovering from "found xmin ... from before relfrozenxid ..."

On Sat, Sep 19, 2020 at 4:03 AM Tom Lane <tgl@sss.pgh.pa.us> wrote:

Robert Haas <robertmhaas@gmail.com> writes:

Cool, thanks to both of you for looking. Committed.

Hmph, according to whelk this is worse not better:

https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=whelk&dt=2020-09-18%2017%3A42%3A11

I'm at a loss to understand what's going on there.

I think our assumption that changing the tests to have temp tables
will make them safe w.r.t concurrent activity doesn't seem to be
correct. We do set OldestXmin for temp tables aggressive enough that
it allows us to remove all dead tuples but the test case behavior lies
on whether we are able to prune the chain. AFAICS, we are using
different cut-offs in heap_page_prune when it is called via
lazy_scan_heap. So that seems to be causing both the failures.

--
With Regards,
Amit Kapila.

#127

tgl@sss.pgh.pa.us

over 5 years ago

In reply to: Amit Kapila (#126)

Re: recovering from "found xmin ... from before relfrozenxid ..."

Amit Kapila <amit.kapila16@gmail.com> writes:

I think our assumption that changing the tests to have temp tables
will make them safe w.r.t concurrent activity doesn't seem to be
correct. We do set OldestXmin for temp tables aggressive enough that
it allows us to remove all dead tuples but the test case behavior lies
on whether we are able to prune the chain. AFAICS, we are using
different cut-offs in heap_page_prune when it is called via
lazy_scan_heap. So that seems to be causing both the failures.

Hm, reasonable theory.

I was able to partially reproduce whelk's failure here. I got a
couple of cases of "cannot freeze committed xmax", which then leads
to the second NOTICE diff; but I couldn't reproduce the first
NOTICE diff. That was out of about a thousand tries :-( so it's not
looking like a promising thing to reproduce without modifying the test.

I wonder whether "cannot freeze committed xmax" doesn't represent an
actual bug, ie is a7212be8b setting the cutoff *too* aggressively?
But if so, why's it so hard to reproduce?

regards, tom lane

#128

tgl@sss.pgh.pa.us

over 5 years ago

In reply to: Tom Lane (#127)

Re: recovering from "found xmin ... from before relfrozenxid ..."

I wrote:

I was able to partially reproduce whelk's failure here. I got a
couple of cases of "cannot freeze committed xmax", which then leads
to the second NOTICE diff; but I couldn't reproduce the first
NOTICE diff. That was out of about a thousand tries :-( so it's not
looking like a promising thing to reproduce without modifying the test.

... however, it's trivial to reproduce via manual interference,
using the same strategy discussed recently for another case:
add a pg_sleep at the start of the heap_surgery.sql script,
run "make installcheck", and while that's running start another
session in which you begin a serializable transaction, execute
any old SELECT, and wait. AFAICT this reproduces all of whelk's
symptoms with 100% reliability.

With a little more effort, this could be automated by putting
some long-running transaction (likely, it needn't be any more
complicated than "select pg_sleep(10)") in a second test script
launched in parallel with heap_surgery.sql.

So this confirms the suspicion that the cause of the buildfarm
failures is a concurrently-open transaction, presumably from
autovacuum. I don't have time to poke further right now.

regards, tom lane

#129

tgl@sss.pgh.pa.us

over 5 years ago

In reply to: Tom Lane (#128)

1 attachment(s)

Re: recovering from "found xmin ... from before relfrozenxid ..."

I wrote:

So this confirms the suspicion that the cause of the buildfarm
failures is a concurrently-open transaction, presumably from
autovacuum. I don't have time to poke further right now.

I spent some more time analyzing this, and there seem to be two distinct
issues:

1. My patch a7212be8b does indeed have a problem. It will allow
vacuum_set_xid_limits to compute freezeLimit = nextXid for a temp
table if freeze_min_age is zero (ie VACUUM FREEZE). If there's
any concurrent transactions, this falls foul of
heap_prepare_freeze_tuple's expectation that

* NB: cutoff_xid *must* be <= the current global xmin, to ensure that any
* XID older than it could neither be running nor seen as running by any
* open transaction. This ensures that the replacement will not change
* anyone's idea of the tuple state.

The "cannot freeze committed xmax" error message appears to be banking on
the assumption that we'd not reach heap_prepare_freeze_tuple for any
committed-dead tuple unless its xmax is past the specified cutoff_xid.

2. As Amit suspected, there's an inconsistency between pruneheap.c's
rules for which tuples are removable and vacuum.c's rules for that.
This seems like a massive bug in its own right: what's the point of
pruneheap.c going to huge effort to decide whether it should keep a
tuple if vacuum will then kill it anyway? I do not understand why
whoever put in the GlobalVisState stuff only applied it in pruneheap.c
and not VACUUM proper.

These two points interact, in that we don't get to the "cannot freeze"
failure except when pruneheap has decided not to remove something that
is removable according to VACUUM's rules. (VACUUM doesn't actually
remove it, because lazy_scan_heap won't try to remove HeapOnly tuples
even when it thinks they're HEAPTUPLE_DEAD; but then it tries to freeze
the tuple, and heap_prepare_freeze_tuple spits up.) However, if I revert
a7212be8b then the pg_surgery test still fails in the presence of a
concurrent transaction (both of the expected "skipping TID" notices
disappear). So reverting that patch wouldn't get us out of trouble.

I think to move forward, we need to figure out what the freezing
behavior ought to be for temp tables. We could make it the same
as it was before a7212be8b, which'd just require some more complexity
in vacuum_set_xid_limits. However, that negates the idea that we'd
like VACUUM's behavior on a temp table to be fully independent of
whether concurrent transactions exist. I'd prefer to allow a7212be8b's
behavior to stand, but then it seems we need to lobotomize the error
check in heap_prepare_freeze_tuple to some extent.

Independently of that, it seems like we need to fix things so that
when pruneheap.c is called by vacuum, it makes EXACTLY the same
dead-or-not-dead decisions that the main vacuum code makes. This
business with applying some GlobalVisState rule or other instead
seems just as unsafe as can be.

AFAICS, there is no chance of the existing pg_surgery regression test
being fully stable if we don't fix both things.

BTW, attached is a quick-hack patch to allow automated testing
of this scenario, along the lines I sketched yesterday. This
test passes if you run the two scripts serially, but not when
you run them in parallel. I'm not proposing this for commit;
it's a hack and its timing behavior is probably not stable enough
for the buildfarm. But it's pretty useful for poking at these
issues.

regards, tom lane

Attachments:

add-concurrent-xact-to-pg_surgery-test.patchtext/x-diff; charset=us-ascii; name=add-concurrent-xact-to-pg_surgery-test.patchDownload

diff --git a/contrib/pg_surgery/Makefile b/contrib/pg_surgery/Makefile
index a66776c4c4..a7bcfca0fb 100644
--- a/contrib/pg_surgery/Makefile
+++ b/contrib/pg_surgery/Makefile
@@ -9,7 +9,7 @@ EXTENSION = pg_surgery
 DATA = pg_surgery--1.0.sql
 PGFILEDESC = "pg_surgery - perform surgery on a damaged relation"
 
-REGRESS = heap_surgery
+REGRESS = --schedule=surgery_schedule
 
 ifdef USE_PGXS
 PG_CONFIG = pg_config
diff --git a/contrib/pg_surgery/expected/heap_surgery.out b/contrib/pg_surgery/expected/heap_surgery.out
index d4a757ffa0..6ef5597e7c 100644
--- a/contrib/pg_surgery/expected/heap_surgery.out
+++ b/contrib/pg_surgery/expected/heap_surgery.out
@@ -1,4 +1,10 @@
 create extension pg_surgery;
+select pg_sleep(0.1);  -- ensure concurrent transaction is ready
+ pg_sleep 
+----------
+ 
+(1 row)
+
 -- create a normal heap table and insert some rows.
 -- use a temp table so that vacuum behavior doesn't depend on global xmin
 create temp table htab (a int);
diff --git a/contrib/pg_surgery/expected/sleep.out b/contrib/pg_surgery/expected/sleep.out
new file mode 100644
index 0000000000..208e31163d
--- /dev/null
+++ b/contrib/pg_surgery/expected/sleep.out
@@ -0,0 +1,6 @@
+select pg_sleep(1.0);  -- long enough for whole heap_surgery test
+ pg_sleep 
+----------
+ 
+(1 row)
+
diff --git a/contrib/pg_surgery/sql/heap_surgery.sql b/contrib/pg_surgery/sql/heap_surgery.sql
index 6526b27535..212c657f3c 100644
--- a/contrib/pg_surgery/sql/heap_surgery.sql
+++ b/contrib/pg_surgery/sql/heap_surgery.sql
@@ -1,5 +1,7 @@
 create extension pg_surgery;
 
+select pg_sleep(0.1);  -- ensure concurrent transaction is ready
+
 -- create a normal heap table and insert some rows.
 -- use a temp table so that vacuum behavior doesn't depend on global xmin
 create temp table htab (a int);
diff --git a/contrib/pg_surgery/sql/sleep.sql b/contrib/pg_surgery/sql/sleep.sql
new file mode 100644
index 0000000000..a85c25bca2
--- /dev/null
+++ b/contrib/pg_surgery/sql/sleep.sql
@@ -0,0 +1 @@
+select pg_sleep(1.0);  -- long enough for whole heap_surgery test
diff --git a/contrib/pg_surgery/surgery_schedule b/contrib/pg_surgery/surgery_schedule
new file mode 100644
index 0000000000..1bb3aa5e53
--- /dev/null
+++ b/contrib/pg_surgery/surgery_schedule
@@ -0,0 +1 @@
+test: heap_surgery sleep

#130

ashu.coek88@gmail.com

over 5 years ago

In reply to: Tom Lane (#129)

Re: recovering from "found xmin ... from before relfrozenxid ..."

Hi Tom,

On Sun, Sep 20, 2020 at 10:43 PM Tom Lane <tgl@sss.pgh.pa.us> wrote:

I wrote:

So this confirms the suspicion that the cause of the buildfarm
failures is a concurrently-open transaction, presumably from
autovacuum. I don't have time to poke further right now.

I spent some more time analyzing this, and there seem to be two distinct
issues:

1. My patch a7212be8b does indeed have a problem. It will allow
vacuum_set_xid_limits to compute freezeLimit = nextXid for a temp
table if freeze_min_age is zero (ie VACUUM FREEZE). If there's
any concurrent transactions, this falls foul of
heap_prepare_freeze_tuple's expectation that

* NB: cutoff_xid *must* be <= the current global xmin, to ensure that any
* XID older than it could neither be running nor seen as running by any
* open transaction. This ensures that the replacement will not change
* anyone's idea of the tuple state.

The "cannot freeze committed xmax" error message appears to be banking on
the assumption that we'd not reach heap_prepare_freeze_tuple for any
committed-dead tuple unless its xmax is past the specified cutoff_xid.

2. As Amit suspected, there's an inconsistency between pruneheap.c's
rules for which tuples are removable and vacuum.c's rules for that.
This seems like a massive bug in its own right: what's the point of
pruneheap.c going to huge effort to decide whether it should keep a
tuple if vacuum will then kill it anyway? I do not understand why
whoever put in the GlobalVisState stuff only applied it in pruneheap.c
and not VACUUM proper.

These two points interact, in that we don't get to the "cannot freeze"
failure except when pruneheap has decided not to remove something that
is removable according to VACUUM's rules. (VACUUM doesn't actually
remove it, because lazy_scan_heap won't try to remove HeapOnly tuples
even when it thinks they're HEAPTUPLE_DEAD; but then it tries to freeze
the tuple, and heap_prepare_freeze_tuple spits up.) However, if I revert
a7212be8b then the pg_surgery test still fails in the presence of a
concurrent transaction (both of the expected "skipping TID" notices
disappear). So reverting that patch wouldn't get us out of trouble.

I think to move forward, we need to figure out what the freezing
behavior ought to be for temp tables. We could make it the same
as it was before a7212be8b, which'd just require some more complexity
in vacuum_set_xid_limits. However, that negates the idea that we'd
like VACUUM's behavior on a temp table to be fully independent of
whether concurrent transactions exist. I'd prefer to allow a7212be8b's
behavior to stand, but then it seems we need to lobotomize the error
check in heap_prepare_freeze_tuple to some extent.

Independently of that, it seems like we need to fix things so that
when pruneheap.c is called by vacuum, it makes EXACTLY the same
dead-or-not-dead decisions that the main vacuum code makes. This
business with applying some GlobalVisState rule or other instead
seems just as unsafe as can be.

AFAICS, there is no chance of the existing pg_surgery regression test
being fully stable if we don't fix both things.

From the explanation you provided above it seems like the test-case
for pg_surgery module is failing because there is some issue with the
changes done in a7212be8b commit (shown below). In other words, I
believe that the test-case for pg_surgery has actually detected an
issue in this commit.

commit a7212be8b9e0885ee769e8c55f99ef742cda487b
Author: Tom Lane <tgl@sss.pgh.pa.us>
Date: Tue Sep 1 18:37:12 2020 -0400

Set cutoff xmin more aggressively when vacuuming a temporary table.

....

So, do you mean to say that if the issues related to temp tables
induced by the above commit is fixed, it will make the regression test
for pg_surgery stable?

Please let me know if I am missing something here. Thank you.

BTW, attached is a quick-hack patch to allow automated testing
of this scenario, along the lines I sketched yesterday. This
test passes if you run the two scripts serially, but not when
you run them in parallel. I'm not proposing this for commit;
it's a hack and its timing behavior is probably not stable enough
for the buildfarm. But it's pretty useful for poking at these
issues.

Yeah, understood, thanks for sharing this.

--
With Regards,
Ashutosh Sharma
EnterpriseDB:http://www.enterprisedb.com

#131

Amit Kapila

amit.kapila16@gmail.com

over 5 years ago

In reply to: Tom Lane (#129)

Re: recovering from "found xmin ... from before relfrozenxid ..."

On Sun, Sep 20, 2020 at 10:43 PM Tom Lane <tgl@sss.pgh.pa.us> wrote:

I wrote:

So this confirms the suspicion that the cause of the buildfarm
failures is a concurrently-open transaction, presumably from
autovacuum. I don't have time to poke further right now.

2. As Amit suspected, there's an inconsistency between pruneheap.c's
rules for which tuples are removable and vacuum.c's rules for that.
This seems like a massive bug in its own right: what's the point of
pruneheap.c going to huge effort to decide whether it should keep a
tuple if vacuum will then kill it anyway? I do not understand why
whoever put in the GlobalVisState stuff only applied it in pruneheap.c
and not VACUUM proper.

These two points interact, in that we don't get to the "cannot freeze"
failure except when pruneheap has decided not to remove something that
is removable according to VACUUM's rules. (VACUUM doesn't actually
remove it, because lazy_scan_heap won't try to remove HeapOnly tuples
even when it thinks they're HEAPTUPLE_DEAD; but then it tries to freeze
the tuple, and heap_prepare_freeze_tuple spits up.) However, if I revert
a7212be8b then the pg_surgery test still fails in the presence of a
concurrent transaction (both of the expected "skipping TID" notices
disappear). So reverting that patch wouldn't get us out of trouble.

I think to move forward, we need to figure out what the freezing
behavior ought to be for temp tables. We could make it the same
as it was before a7212be8b, which'd just require some more complexity
in vacuum_set_xid_limits. However, that negates the idea that we'd
like VACUUM's behavior on a temp table to be fully independent of
whether concurrent transactions exist. I'd prefer to allow a7212be8b's
behavior to stand, but then it seems we need to lobotomize the error
check in heap_prepare_freeze_tuple to some extent.

Independently of that, it seems like we need to fix things so that
when pruneheap.c is called by vacuum, it makes EXACTLY the same
dead-or-not-dead decisions that the main vacuum code makes. This
business with applying some GlobalVisState rule or other instead
seems just as unsafe as can be.

Yeah, on a quick look it seems before commit dc7420c2c9 the
pruneheap.c and the main Vacuum code use to make the same decision and
that is commit which has introduced GlobalVisState stuff.

AFAICS, there is no chance of the existing pg_surgery regression test
being fully stable if we don't fix both things.

What if ensure that it runs with autovacuum = off and there is no
parallel test running? I am not sure about the second part but if we
can do that then the test will be probably stable.

--
With Regards,
Amit Kapila.

#132

tgl@sss.pgh.pa.us

over 5 years ago

In reply to: Amit Kapila (#131)

Re: recovering from "found xmin ... from before relfrozenxid ..."

Amit Kapila <amit.kapila16@gmail.com> writes:

On Sun, Sep 20, 2020 at 10:43 PM Tom Lane <tgl@sss.pgh.pa.us> wrote:

AFAICS, there is no chance of the existing pg_surgery regression test
being fully stable if we don't fix both things.

What if ensure that it runs with autovacuum = off and there is no
parallel test running? I am not sure about the second part but if we
can do that then the test will be probably stable.

Then it'll not be usable under "make installcheck", which is not
very nice. It's also arguable that you aren't testing pg_surgery
under real-world conditions if you do it like that.

Moreover, I think that both of these points need to be addressed
anyway, as they represent bugs that are reachable independently
of pg_surgery. Admittedly, we do not have a test case that
proves that the inconsistency between pruneheap and vacuum has
any bad effects in the absence of a7212be8b. But do you really
want to bet that there are none?

regards, tom lane

#133

robertmhaas@gmail.com

over 5 years ago

In reply to: Tom Lane (#129)

Re: recovering from "found xmin ... from before relfrozenxid ..."

On Sun, Sep 20, 2020 at 1:13 PM Tom Lane <tgl@sss.pgh.pa.us> wrote:

1. My patch a7212be8b does indeed have a problem. It will allow
vacuum_set_xid_limits to compute freezeLimit = nextXid for a temp
table if freeze_min_age is zero (ie VACUUM FREEZE). If there's
any concurrent transactions, this falls foul of
heap_prepare_freeze_tuple's expectation that

* NB: cutoff_xid *must* be <= the current global xmin, to ensure that any
* XID older than it could neither be running nor seen as running by any
* open transaction. This ensures that the replacement will not change
* anyone's idea of the tuple state.

The "cannot freeze committed xmax" error message appears to be banking on
the assumption that we'd not reach heap_prepare_freeze_tuple for any
committed-dead tuple unless its xmax is past the specified cutoff_xid.

2. As Amit suspected, there's an inconsistency between pruneheap.c's
rules for which tuples are removable and vacuum.c's rules for that.
This seems like a massive bug in its own right: what's the point of
pruneheap.c going to huge effort to decide whether it should keep a
tuple if vacuum will then kill it anyway? I do not understand why
whoever put in the GlobalVisState stuff only applied it in pruneheap.c
and not VACUUM proper.

I am not sure I fully understand why you're contrasting pruneheap.c
with vacuum here, because vacuum just does HOT pruning to remove dead
tuples - maybe calling the relevant functions with different
arguments, but it doesn't have its own independent logic for that.

The key point is that the freezing code isn't, or at least
historically wasn't, very smart about dead tuples. For example, I
think if you told it to freeze something that was dead it would just
do it, which is obviously bad. And that's why Andres stuck those
sanity checks in there. But it's still pretty fragile. I think perhaps
the pruning code should be rewritten in such a way that it can be
combined with the code that freezes and marks pages all-visible, so
that there's not so much action at a distance, but such an endeavor is
in itself pretty scary, and certainly not back-patchable.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

#134

tgl@sss.pgh.pa.us

over 5 years ago

In reply to: Robert Haas (#133)

Re: recovering from "found xmin ... from before relfrozenxid ..."

Robert Haas <robertmhaas@gmail.com> writes:

On Sun, Sep 20, 2020 at 1:13 PM Tom Lane <tgl@sss.pgh.pa.us> wrote:

2. As Amit suspected, there's an inconsistency between pruneheap.c's
rules for which tuples are removable and vacuum.c's rules for that.
This seems like a massive bug in its own right: what's the point of
pruneheap.c going to huge effort to decide whether it should keep a
tuple if vacuum will then kill it anyway? I do not understand why
whoever put in the GlobalVisState stuff only applied it in pruneheap.c
and not VACUUM proper.

I am not sure I fully understand why you're contrasting pruneheap.c
with vacuum here, because vacuum just does HOT pruning to remove dead
tuples - maybe calling the relevant functions with different
arguments, but it doesn't have its own independent logic for that.

Right, but what we end up with is that the very same tuple xmin and
xmax might result in pruning/deletion, or not, depending on whether
it's part of a HOT chain or not. That's at best pretty weird, and
at worst it means that corner-case bugs in other places are triggered
in only one of the two scenarios ... which is what we have here.

The key point is that the freezing code isn't, or at least
historically wasn't, very smart about dead tuples. For example, I
think if you told it to freeze something that was dead it would just
do it, which is obviously bad. And that's why Andres stuck those
sanity checks in there. But it's still pretty fragile. I think perhaps
the pruning code should be rewritten in such a way that it can be
combined with the code that freezes and marks pages all-visible, so
that there's not so much action at a distance, but such an endeavor is
in itself pretty scary, and certainly not back-patchable.

Not sure. The pruning code is trying to serve two masters, that is
both VACUUM and on-the-fly cleanup during ordinary queries. If you
try to merge it with other tasks that VACUUM does, you're going to
have a mess for the second usage. I fear there's going to be pretty
strong conservation of cruft either way.

FWIW, weakening the sanity checks in heap_prepare_freeze_tuple is
*not* my preferred fix here. But it'll take some work in other
places to preserve them.

regards, tom lane

#135

robertmhaas@gmail.com

over 5 years ago

In reply to: Tom Lane (#134)

Re: recovering from "found xmin ... from before relfrozenxid ..."

On Mon, Sep 21, 2020 at 2:21 PM Tom Lane <tgl@sss.pgh.pa.us> wrote:

Right, but what we end up with is that the very same tuple xmin and
xmax might result in pruning/deletion, or not, depending on whether
it's part of a HOT chain or not. That's at best pretty weird, and
at worst it means that corner-case bugs in other places are triggered
in only one of the two scenarios ... which is what we have here.

I'm not sure I really understand how that's happening, because surely
HOT chains and non-HOT chains are pruned by the same code, but it
doesn't sound good.

FWIW, weakening the sanity checks in heap_prepare_freeze_tuple is
*not* my preferred fix here. But it'll take some work in other
places to preserve them.

Make sense.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

#136

andres@anarazel.de

over 5 years ago

In reply to: Robert Haas (#135)

Re: recovering from "found xmin ... from before relfrozenxid ..."

Hi,

On 2020-09-21 16:02:29 -0400, Robert Haas wrote:

On Mon, Sep 21, 2020 at 2:21 PM Tom Lane <tgl@sss.pgh.pa.us> wrote:

Right, but what we end up with is that the very same tuple xmin and
xmax might result in pruning/deletion, or not, depending on whether
it's part of a HOT chain or not. That's at best pretty weird, and
at worst it means that corner-case bugs in other places are triggered
in only one of the two scenarios ... which is what we have here.

I'm not sure I really understand how that's happening, because surely
HOT chains and non-HOT chains are pruned by the same code, but it
doesn't sound good.

Not necessarily, unfortunately:

case HEAPTUPLE_DEAD:

/*
* Ordinarily, DEAD tuples would have been removed by
* heap_page_prune(), but it's possible that the tuple
* state changed since heap_page_prune() looked. In
* particular an INSERT_IN_PROGRESS tuple could have
* changed to DEAD if the inserter aborted. So this
* cannot be considered an error condition.
*
* If the tuple is HOT-updated then it must only be
* removed by a prune operation; so we keep it just as if
* it were RECENTLY_DEAD. Also, if it's a heap-only
* tuple, we choose to keep it, because it'll be a lot
* cheaper to get rid of it in the next pruning pass than
* to treat it like an indexed tuple. Finally, if index
* cleanup is disabled, the second heap pass will not
* execute, and the tuple will not get removed, so we must
* treat it like any other dead tuple that we choose to
* keep.
*
* If this were to happen for a tuple that actually needed
* to be deleted, we'd be in trouble, because it'd
* possibly leave a tuple below the relation's xmin
* horizon alive. heap_prepare_freeze_tuple() is prepared
* to detect that case and abort the transaction,
* preventing corruption.
*/
if (HeapTupleIsHotUpdated(&tuple) ||
HeapTupleIsHeapOnly(&tuple) ||
params->index_cleanup == VACOPT_TERNARY_DISABLED)
nkeep += 1;
else
tupgone = true; /* we can delete the tuple */
all_visible = false;

So if e.g. a transaction aborts between the heap_page_prune and this
check the pruning behaviour depends on whether the tuple is part of a
HOT chain or not.

Greetings,

Andres Freund

#137

tgl@sss.pgh.pa.us

over 5 years ago

In reply to: Robert Haas (#135)

Re: recovering from "found xmin ... from before relfrozenxid ..."

Robert Haas <robertmhaas@gmail.com> writes:

On Mon, Sep 21, 2020 at 2:21 PM Tom Lane <tgl@sss.pgh.pa.us> wrote:

Right, but what we end up with is that the very same tuple xmin and
xmax might result in pruning/deletion, or not, depending on whether
it's part of a HOT chain or not. That's at best pretty weird, and
at worst it means that corner-case bugs in other places are triggered
in only one of the two scenarios ... which is what we have here.

I'm not sure I really understand how that's happening, because surely
HOT chains and non-HOT chains are pruned by the same code, but it
doesn't sound good.

No, they're not. lazy_scan_heap() will never remove a tuple that
is HeapTupleIsHotUpdated or HeapTupleIsHeapOnly, even if it thinks
it's DEAD -- cf. vacuumlazy.c, about line 1350. So tuples in
a HOT chain are deleted exactly when pruneheap.c sees fit to do so.
OTOH, for tuples not in a HOT chain, the decision is (I believe)
entirely on lazy_scan_heap(). And the core of my complaint is that
pruneheap.c's decisions about what is DEAD are not reliably identical
to what HeapTupleSatisfiesVacuum thinks.

I don't mind if a free-standing prune operation has its own rules,
but when it's invoked by VACUUM it ought to follow VACUUM's rules about
what is dead or alive. What remains unclear at this point is whether
we ought to import some of the intelligence added by the GlobalVisState
patch into VACUUM's behavior, or just dumb down pruneheap.c so that
it applies exactly the HeapTupleSatisfiesVacuum rules when invoked
by VACUUM.

regards, tom lane

#138

andres@anarazel.de

over 5 years ago

In reply to: Tom Lane (#129)

Re: recovering from "found xmin ... from before relfrozenxid ..."

Hi,

On 2020-09-20 13:13:16 -0400, Tom Lane wrote:

2. As Amit suspected, there's an inconsistency between pruneheap.c's
rules for which tuples are removable and vacuum.c's rules for that.
This seems like a massive bug in its own right: what's the point of
pruneheap.c going to huge effort to decide whether it should keep a
tuple if vacuum will then kill it anyway? I do not understand why
whoever put in the GlobalVisState stuff only applied it in pruneheap.c
and not VACUUM proper.

The reason for that is that the GlobalVisState stuff is computed
heuristically (and then re-checked if that's not sufficient to prune a
tuple, unless already done so). That's done so GetSnapshotData() doesn't
have to look at each backends ->xmin, which is quite a massive speedup
at higher connection counts, as each backends ->xmin changes much more
often than each backend's xid.

But for VACUUM we need to do the accurate scan of the procarray anyway,
because we need an accurate value for things other than HOT pruning
decisions.

What do you exactly mean with the "going to huge effort to decide" bit?

I think to move forward, we need to figure out what the freezing
behavior ought to be for temp tables. We could make it the same
as it was before a7212be8b, which'd just require some more complexity
in vacuum_set_xid_limits. However, that negates the idea that we'd
like VACUUM's behavior on a temp table to be fully independent of
whether concurrent transactions exist. I'd prefer to allow a7212be8b's
behavior to stand, but then it seems we need to lobotomize the error
check in heap_prepare_freeze_tuple to some extent.

I think that's an argument for what I suggested elsewhere, which is that
we should move the logic for a different horizon for temp tables out of
vacuum_set_xid_limits, and into procarray.

Independently of that, it seems like we need to fix things so that
when pruneheap.c is called by vacuum, it makes EXACTLY the same
dead-or-not-dead decisions that the main vacuum code makes. This
business with applying some GlobalVisState rule or other instead
seems just as unsafe as can be.

It's not great, I agree. Not sure there is a super nice answer
though. Note that, even before my changes, vacuumlazy can decide
differently than pruning whether a tuple is live. E.g. when an inserting
transaction aborts. That's pretty much unavoidable as long as we have
multiple HTSV calls for a tuple, since none of our locking can (nor
should) prevent concurrent transactions from aborting.

Before your new code avoiding the GetOldestNonRemovableTransactionId()
call for temp tables, the GlobalVis* can never be more pessimistic than
decisions based ona prior GetOldestNonRemovableTransactionId call (as
that internally updates the heuristic horizons used by GlobalVis).

Greetings,

Andres Freund

#139

tgl@sss.pgh.pa.us

over 5 years ago

In reply to: Andres Freund (#138)

Re: recovering from "found xmin ... from before relfrozenxid ..."

Andres Freund <andres@anarazel.de> writes:

The reason for that is that the GlobalVisState stuff is computed
heuristically (and then re-checked if that's not sufficient to prune a
tuple, unless already done so). That's done so GetSnapshotData() doesn't
have to look at each backends ->xmin, which is quite a massive speedup
at higher connection counts, as each backends ->xmin changes much more
often than each backend's xid.

OK.

What do you exactly mean with the "going to huge effort to decide" bit?

I'd looked at all the complexity around GlobalVisState, but failed to
register that it should be pretty cheap on a per-tuple basis. So never
mind that complaint. The point that remains is just that it's different
from HeapTupleSatisfiesVacuum's rules.

I think to move forward, we need to figure out what the freezing
behavior ought to be for temp tables. We could make it the same
as it was before a7212be8b, which'd just require some more complexity
in vacuum_set_xid_limits. However, that negates the idea that we'd
like VACUUM's behavior on a temp table to be fully independent of
whether concurrent transactions exist. I'd prefer to allow a7212be8b's
behavior to stand, but then it seems we need to lobotomize the error
check in heap_prepare_freeze_tuple to some extent.

I think that's an argument for what I suggested elsewhere, which is that
we should move the logic for a different horizon for temp tables out of
vacuum_set_xid_limits, and into procarray.

But procarray does not seem like a great place for
table-persistence-dependent decisions either?

Independently of that, it seems like we need to fix things so that
when pruneheap.c is called by vacuum, it makes EXACTLY the same
dead-or-not-dead decisions that the main vacuum code makes. This
business with applying some GlobalVisState rule or other instead
seems just as unsafe as can be.

It's not great, I agree. Not sure there is a super nice answer
though. Note that, even before my changes, vacuumlazy can decide
differently than pruning whether a tuple is live. E.g. when an inserting
transaction aborts. That's pretty much unavoidable as long as we have
multiple HTSV calls for a tuple, since none of our locking can (nor
should) prevent concurrent transactions from aborting.

It's clear that if the environment changes between test A and test B,
we might get different results. What I'm not happy about is that the
rules are different, so we might get different results even if the
environment did not change.

regards, tom lane

#140

andres@anarazel.de

over 5 years ago

In reply to: Tom Lane (#139)

Re: recovering from "found xmin ... from before relfrozenxid ..."

Hi,

On 2020-09-21 16:40:40 -0400, Tom Lane wrote:

Andres Freund <andres@anarazel.de> writes:

I think to move forward, we need to figure out what the freezing
behavior ought to be for temp tables. We could make it the same
as it was before a7212be8b, which'd just require some more complexity
in vacuum_set_xid_limits. However, that negates the idea that we'd
like VACUUM's behavior on a temp table to be fully independent of
whether concurrent transactions exist. I'd prefer to allow a7212be8b's
behavior to stand, but then it seems we need to lobotomize the error
check in heap_prepare_freeze_tuple to some extent.

I think that's an argument for what I suggested elsewhere, which is that
we should move the logic for a different horizon for temp tables out of
vacuum_set_xid_limits, and into procarray.

But procarray does not seem like a great place for
table-persistence-dependent decisions either?

That ship has sailed a long long time ago though. GetOldestXmin() has
looked at the passed in relation for a quite a while, and even before
that we had logic about 'allDbs' etc. It doesn't easily seem possible
to avoid that, given how intimately that's coupled with how snapshots
are built and used, database & vacuumFlags checks etc.

Greetings,

Andres Freund

#141

tgl@sss.pgh.pa.us

over 5 years ago

In reply to: Andres Freund (#140)

Re: recovering from "found xmin ... from before relfrozenxid ..."

Andres Freund <andres@anarazel.de> writes:

On 2020-09-21 16:40:40 -0400, Tom Lane wrote:

Andres Freund <andres@anarazel.de> writes:

I think that's an argument for what I suggested elsewhere, which is that
we should move the logic for a different horizon for temp tables out of
vacuum_set_xid_limits, and into procarray.

But procarray does not seem like a great place for
table-persistence-dependent decisions either?

That ship has sailed a long long time ago though. GetOldestXmin() has
looked at the passed in relation for a quite a while, and even before
that we had logic about 'allDbs' etc. It doesn't easily seem possible
to avoid that, given how intimately that's coupled with how snapshots
are built and used, database & vacuumFlags checks etc.

OK. Given that you've got strong feelings about this, do you want to
propose a patch? I'm happy to fix it, since it's at least in part my
bug, but I probably won't do it exactly like you would.

regards, tom lane

#142

andres@anarazel.de

over 5 years ago

In reply to: Tom Lane (#141)

Re: recovering from "found xmin ... from before relfrozenxid ..."

On 2020-09-21 17:03:53 -0400, Tom Lane wrote:

Andres Freund <andres@anarazel.de> writes:

On 2020-09-21 16:40:40 -0400, Tom Lane wrote:

Andres Freund <andres@anarazel.de> writes:

I think that's an argument for what I suggested elsewhere, which is that
we should move the logic for a different horizon for temp tables out of
vacuum_set_xid_limits, and into procarray.

But procarray does not seem like a great place for
table-persistence-dependent decisions either?

That ship has sailed a long long time ago though. GetOldestXmin() has
looked at the passed in relation for a quite a while, and even before
that we had logic about 'allDbs' etc. It doesn't easily seem possible
to avoid that, given how intimately that's coupled with how snapshots
are built and used, database & vacuumFlags checks etc.

OK. Given that you've got strong feelings about this, do you want to
propose a patch? I'm happy to fix it, since it's at least in part my
bug, but I probably won't do it exactly like you would.

I can give it a try. I can see several paths of varying invasiveness,
not sure yet what the best approach is. Let me think about if for a bit.

Greetings,

Andres Freund

#143

andres@anarazel.de

about 5 years ago

In reply to: Andres Freund (#142)

1 attachment(s)

Re: recovering from "found xmin ... from before relfrozenxid ..."

Hi,

On 2020-09-21 14:20:03 -0700, Andres Freund wrote:

I can give it a try. I can see several paths of varying invasiveness,
not sure yet what the best approach is. Let me think about if for a bit.

Ugh, sorry for taking so long to get around to this.

Attached is a *prototype* implemention of this concept, which clearly is
lacking some comment work (and is intentionally lacking some
re-indentation).

I described my thoughts about how to limit the horizons for temp tables in
/messages/by-id/20201014203103.72oke6hqywcyhx7s@alap3.anarazel.de

Besides comments this probably mainly needs a bit more tests around temp
table vacuuming. Should have at least an isolation test that verifies
that temp table rows can be a) vacuumed b) pruned away in the presence
of other sessions with xids.

Greetings,

Andres Freund

Attachments:

WIP-horizon-stuff.difftext/x-diff; charset=us-asciiDownload

diff --git i/src/backend/commands/vacuum.c w/src/backend/commands/vacuum.c
index ddeec870d81..8851fafed64 100644
--- i/src/backend/commands/vacuum.c
+++ w/src/backend/commands/vacuum.c
@@ -950,24 +950,8 @@ vacuum_set_xid_limits(Relation rel,
 	MultiXactId mxactLimit;
 	MultiXactId safeMxactLimit;
 
-	if (RELATION_IS_LOCAL(rel) && !IsInTransactionBlock(isTopLevel))
 	{
 		/*
-		 * If we are processing a temp relation (which by prior checks must be
-		 * one belonging to our session), and we are not inside any
-		 * transaction block, then there can be no tuples in the rel that are
-		 * still in-doubt, nor can there be any that are dead but possibly
-		 * still interesting to some snapshot our session holds.  We don't
-		 * need to care whether other sessions could see such tuples, either.
-		 * So we can aggressively set the cutoff xmin to be the nextXid.
-		 */
-		*oldestXmin = ReadNewTransactionId();
-	}
-	else
-	{
-		/*
-		 * Otherwise, calculate the cutoff xmin normally.
-		 *
 		 * We can always ignore processes running lazy vacuum.  This is
 		 * because we use these values only for deciding which tuples we must
 		 * keep in the tables.  Since lazy vacuum doesn't write its XID
diff --git i/src/backend/storage/ipc/procarray.c w/src/backend/storage/ipc/procarray.c
index 07c5eeb7495..c3cfe34ffee 100644
--- i/src/backend/storage/ipc/procarray.c
+++ w/src/backend/storage/ipc/procarray.c
@@ -153,6 +153,8 @@ typedef struct ProcArrayStruct
  *    I.e. the difference to GlobalVisCatalogRels is that
  *    replication slot's catalog_xmin is not taken into account.
  *
+ * 4) GlobalVisTempRels, XXXX
+ *
  * GlobalVisTestFor(relation) returns the appropriate state
  * for the relation.
  *
@@ -234,6 +236,9 @@ typedef struct ComputeXidHorizonsResult
 	 * defined tables.
 	 */
 	TransactionId data_oldest_nonremovable;
+
+	TransactionId temp_oldest_nonremovable;
+
 } ComputeXidHorizonsResult;
 
 
@@ -257,12 +262,13 @@ static TransactionId standbySnapshotPendingXmin;
 
 /*
  * State for visibility checks on different types of relations. See struct
- * GlobalVisState for details. As shared, catalog, and user defined
+ * GlobalVisState for details. As shared, catalog, normal and temporary
  * relations can have different horizons, one such state exists for each.
  */
 static GlobalVisState GlobalVisSharedRels;
 static GlobalVisState GlobalVisCatalogRels;
 static GlobalVisState GlobalVisDataRels;
+static GlobalVisState GlobalVisTempRels;
 
 /*
  * This backend's RecentXmin at the last time the accurate xmin horizon was
@@ -1668,6 +1674,10 @@ ComputeXidHorizons(ComputeXidHorizonsResult *h)
 		h->oldest_considered_running = initial;
 		h->shared_oldest_nonremovable = initial;
 		h->data_oldest_nonremovable = initial;
+		if (TransactionIdIsValid(MyProc->xid))
+			h->temp_oldest_nonremovable = MyProc->xid;
+		else
+			h->temp_oldest_nonremovable = initial;
 	}
 
 	/*
@@ -1760,6 +1770,7 @@ ComputeXidHorizons(ComputeXidHorizonsResult *h)
 			TransactionIdOlder(h->shared_oldest_nonremovable, kaxmin);
 		h->data_oldest_nonremovable =
 			TransactionIdOlder(h->data_oldest_nonremovable, kaxmin);
+		/* temp relations cannot be accessed in recovery */
 	}
 	else
 	{
@@ -1785,6 +1796,7 @@ ComputeXidHorizons(ComputeXidHorizonsResult *h)
 		h->data_oldest_nonremovable =
 			TransactionIdRetreatedBy(h->data_oldest_nonremovable,
 									 vacuum_defer_cleanup_age);
+		/* defer doesn't apply to temp relations */
 	}
 
 	/*
@@ -1844,6 +1856,8 @@ ComputeXidHorizons(ComputeXidHorizonsResult *h)
 										 h->catalog_oldest_nonremovable));
 	Assert(TransactionIdPrecedesOrEquals(h->oldest_considered_running,
 										 h->data_oldest_nonremovable));
+	Assert(TransactionIdPrecedesOrEquals(h->oldest_considered_running,
+										 h->temp_oldest_nonremovable));
 	Assert(!TransactionIdIsValid(h->slot_xmin) ||
 		   TransactionIdPrecedesOrEquals(h->oldest_considered_running,
 										 h->slot_xmin));
@@ -1878,6 +1892,8 @@ GetOldestNonRemovableTransactionId(Relation rel)
 		return horizons.shared_oldest_nonremovable;
 	else if (RelationIsAccessibleInLogicalDecoding(rel))
 		return horizons.catalog_oldest_nonremovable;
+	else if (RELATION_IS_LOCAL(rel))
+		return horizons.temp_oldest_nonremovable;
 	else
 		return horizons.data_oldest_nonremovable;
 }
@@ -2357,6 +2373,9 @@ GetSnapshotData(Snapshot snapshot)
 		GlobalVisDataRels.definitely_needed =
 			FullTransactionIdNewer(def_vis_fxid_data,
 								   GlobalVisDataRels.definitely_needed);
+		GlobalVisTempRels.definitely_needed =
+			FullTransactionIdNewer(latest_completed,
+								   GlobalVisTempRels.definitely_needed);
 
 		/*
 		 * Check if we know that we can initialize or increase the lower
@@ -2375,6 +2394,18 @@ GetSnapshotData(Snapshot snapshot)
 		GlobalVisDataRels.maybe_needed =
 			FullTransactionIdNewer(GlobalVisDataRels.maybe_needed,
 								   oldestfxid);
+		if (TransactionIdIsNormal(myxid))
+		{
+			GlobalVisTempRels.maybe_needed =
+				FullTransactionIdNewer(GlobalVisTempRels.maybe_needed,
+									   FullXidRelativeTo(latest_completed, myxid));
+		}
+		else
+		{
+			GlobalVisTempRels.maybe_needed =
+				FullTransactionIdNewer(GlobalVisTempRels.maybe_needed,
+									   latest_completed);
+		}
 	}
 
 	RecentXmin = xmin;
@@ -3892,6 +3923,8 @@ GlobalVisTestFor(Relation rel)
 		state = &GlobalVisSharedRels;
 	else if (need_catalog)
 		state = &GlobalVisCatalogRels;
+	else if (RELATION_IS_LOCAL(rel))
+		state = &GlobalVisTempRels;
 	else
 		state = &GlobalVisDataRels;
 
@@ -3942,6 +3975,9 @@ GlobalVisUpdateApply(ComputeXidHorizonsResult *horizons)
 	GlobalVisDataRels.maybe_needed =
 		FullXidRelativeTo(horizons->latest_completed,
 						  horizons->data_oldest_nonremovable);
+	GlobalVisTempRels.maybe_needed =
+		FullXidRelativeTo(horizons->latest_completed,
+						  horizons->temp_oldest_nonremovable);
 
 	/*
 	 * In longer running transactions it's possible that transactions we
@@ -3957,6 +3993,9 @@ GlobalVisUpdateApply(ComputeXidHorizonsResult *horizons)
 	GlobalVisDataRels.definitely_needed =
 		FullTransactionIdNewer(GlobalVisDataRels.maybe_needed,
 							   GlobalVisDataRels.definitely_needed);
+	GlobalVisTempRels.definitely_needed =
+		FullTransactionIdNewer(GlobalVisTempRels.maybe_needed,
+							   GlobalVisTempRels.definitely_needed);
 
 	ComputeXidHorizonsResultLastXmin = RecentXmin;
 }
diff --git i/contrib/pg_surgery/Makefile w/contrib/pg_surgery/Makefile
index a66776c4c41..48a96008cea 100644
--- i/contrib/pg_surgery/Makefile
+++ w/contrib/pg_surgery/Makefile
@@ -9,7 +9,7 @@ EXTENSION = pg_surgery
 DATA = pg_surgery--1.0.sql
 PGFILEDESC = "pg_surgery - perform surgery on a damaged relation"
 
-REGRESS = heap_surgery
+REGRESS = --schedule=$(srcdir)/surgery_schedule
 
 ifdef USE_PGXS
 PG_CONFIG = pg_config
diff --git i/contrib/pg_surgery/expected/heap_surgery.out w/contrib/pg_surgery/expected/heap_surgery.out
index d4a757ffa01..6ef5597e7cc 100644
--- i/contrib/pg_surgery/expected/heap_surgery.out
+++ w/contrib/pg_surgery/expected/heap_surgery.out
@@ -1,4 +1,10 @@
 create extension pg_surgery;
+select pg_sleep(0.1);  -- ensure concurrent transaction is ready
+ pg_sleep 
+----------
+ 
+(1 row)
+
 -- create a normal heap table and insert some rows.
 -- use a temp table so that vacuum behavior doesn't depend on global xmin
 create temp table htab (a int);
diff --git i/contrib/pg_surgery/sql/heap_surgery.sql w/contrib/pg_surgery/sql/heap_surgery.sql
index 6526b27535d..212c657f3c7 100644
--- i/contrib/pg_surgery/sql/heap_surgery.sql
+++ w/contrib/pg_surgery/sql/heap_surgery.sql
@@ -1,5 +1,7 @@
 create extension pg_surgery;
 
+select pg_sleep(0.1);  -- ensure concurrent transaction is ready
+
 -- create a normal heap table and insert some rows.
 -- use a temp table so that vacuum behavior doesn't depend on global xmin
 create temp table htab (a int);

#144

andres@anarazel.de

about 5 years ago

In reply to: Andres Freund (#143)

1 attachment(s)

Re: recovering from "found xmin ... from before relfrozenxid ..."

Hi,

On 2020-10-15 01:37:35 -0700, Andres Freund wrote:

Attached is a *prototype* implemention of this concept, which clearly is
lacking some comment work (and is intentionally lacking some
re-indentation).

I described my thoughts about how to limit the horizons for temp tables in
/messages/by-id/20201014203103.72oke6hqywcyhx7s@alap3.anarazel.de

Attached is an updated version of this patch. Quite a bit of polish,
added removal of the isTopLevel arguments added a7212be8b9e that are now
unnecessary, and changed the initialization of the temp table horizons
to be latestCompletedXid + 1 instead of just latestCompletedXid when no
xid is assigned.

Besides comments this probably mainly needs a bit more tests around temp
table vacuuming. Should have at least an isolation test that verifies
that temp table rows can be a) vacuumed b) pruned away in the presence
of other sessions with xids.

I added an isolationtester test for this. It verifies that dead rows in
temp tables get vacuumed and pruned despite concurrent sessions having
older snapshots. It does so by forcing an IOS and checking the number of
heap fetches reported by EXPLAIN. I also added a companion test for
permanent relations, ensuring that such rows do not get removed.

Any comments? Otherwise I'll push that patch tomorrow.

Greetings,

Andres Freund

Attachments:

v2-0001-Centralize-horizon-determination-for-temp-tables-.patchtext/x-diff; charset=us-asciiDownload

From 1ba3c5eeb250d3e755ad5c044aede1816d040ae4 Mon Sep 17 00:00:00 2001
From: Andres Freund <andres@anarazel.de>
Date: Tue, 27 Oct 2020 20:39:57 -0700
Subject: [PATCH v2] Centralize horizon determination for temp tables, fix bug
 due to skew.

Also add tests to verify that temp tables can be pruned despite other
sessions having older snapshots.

Author: Andres Freund
Discussion: https://postgr.es/m/20201014203103.72oke6hqywcyhx7s@alap3.anarazel.de
Discussion: https://postgr.es/m/20201015083735.derdzysdtqdvxshp@alap3.anarazel.de
---
 src/include/commands/cluster.h           |   3 +-
 src/include/commands/vacuum.h            |   1 -
 src/backend/access/heap/vacuumlazy.c     |   1 -
 src/backend/commands/cluster.c           |  28 +--
 src/backend/commands/vacuum.c            |  76 +++---
 src/backend/storage/ipc/procarray.c      |  59 ++++-
 src/test/isolation/expected/horizons.out | 281 +++++++++++++++++++++++
 src/test/isolation/isolation_schedule    |   1 +
 src/test/isolation/specs/horizons.spec   | 165 +++++++++++++
 9 files changed, 541 insertions(+), 74 deletions(-)
 create mode 100644 src/test/isolation/expected/horizons.out
 create mode 100644 src/test/isolation/specs/horizons.spec

diff --git a/src/include/commands/cluster.h b/src/include/commands/cluster.h
index 1eb144204b6..e05884781b9 100644
--- a/src/include/commands/cluster.h
+++ b/src/include/commands/cluster.h
@@ -19,8 +19,7 @@
 
 
 extern void cluster(ClusterStmt *stmt, bool isTopLevel);
-extern void cluster_rel(Oid tableOid, Oid indexOid, int options,
-						bool isTopLevel);
+extern void cluster_rel(Oid tableOid, Oid indexOid, int options);
 extern void check_index_is_clusterable(Relation OldHeap, Oid indexOid,
 									   bool recheck, LOCKMODE lockmode);
 extern void mark_index_clustered(Relation rel, Oid indexOid, bool is_internal);
diff --git a/src/include/commands/vacuum.h b/src/include/commands/vacuum.h
index d9475c99890..a4cd7214009 100644
--- a/src/include/commands/vacuum.h
+++ b/src/include/commands/vacuum.h
@@ -267,7 +267,6 @@ extern void vacuum_set_xid_limits(Relation rel,
 								  int freeze_min_age, int freeze_table_age,
 								  int multixact_freeze_min_age,
 								  int multixact_freeze_table_age,
-								  bool isTopLevel,
 								  TransactionId *oldestXmin,
 								  TransactionId *freezeLimit,
 								  TransactionId *xidFullScanLimit,
diff --git a/src/backend/access/heap/vacuumlazy.c b/src/backend/access/heap/vacuumlazy.c
index 4f2f38168dc..be5439dd9d8 100644
--- a/src/backend/access/heap/vacuumlazy.c
+++ b/src/backend/access/heap/vacuumlazy.c
@@ -471,7 +471,6 @@ heap_vacuum_rel(Relation onerel, VacuumParams *params,
 						  params->freeze_table_age,
 						  params->multixact_freeze_min_age,
 						  params->multixact_freeze_table_age,
-						  true, /* we must be a top-level command */
 						  &OldestXmin, &FreezeLimit, &xidFullScanLimit,
 						  &MultiXactCutoff, &mxactFullScanLimit);
 
diff --git a/src/backend/commands/cluster.c b/src/backend/commands/cluster.c
index 0d647e912c4..04d12a7ece6 100644
--- a/src/backend/commands/cluster.c
+++ b/src/backend/commands/cluster.c
@@ -67,13 +67,10 @@ typedef struct
 } RelToCluster;
 
 
-static void rebuild_relation(Relation OldHeap, Oid indexOid,
-							 bool isTopLevel, bool verbose);
+static void rebuild_relation(Relation OldHeap, Oid indexOid, bool verbose);
 static void copy_table_data(Oid OIDNewHeap, Oid OIDOldHeap, Oid OIDOldIndex,
-							bool isTopLevel, bool verbose,
-							bool *pSwapToastByContent,
-							TransactionId *pFreezeXid,
-							MultiXactId *pCutoffMulti);
+							bool verbose, bool *pSwapToastByContent,
+							TransactionId *pFreezeXid, MultiXactId *pCutoffMulti);
 static List *get_tables_to_cluster(MemoryContext cluster_context);
 
 
@@ -173,7 +170,7 @@ cluster(ClusterStmt *stmt, bool isTopLevel)
 		table_close(rel, NoLock);
 
 		/* Do the job. */
-		cluster_rel(tableOid, indexOid, stmt->options, isTopLevel);
+		cluster_rel(tableOid, indexOid, stmt->options);
 	}
 	else
 	{
@@ -222,8 +219,7 @@ cluster(ClusterStmt *stmt, bool isTopLevel)
 			PushActiveSnapshot(GetTransactionSnapshot());
 			/* Do the job. */
 			cluster_rel(rvtc->tableOid, rvtc->indexOid,
-						stmt->options | CLUOPT_RECHECK,
-						isTopLevel);
+						stmt->options | CLUOPT_RECHECK);
 			PopActiveSnapshot();
 			CommitTransactionCommand();
 		}
@@ -254,7 +250,7 @@ cluster(ClusterStmt *stmt, bool isTopLevel)
  * and error messages should refer to the operation as VACUUM not CLUSTER.
  */
 void
-cluster_rel(Oid tableOid, Oid indexOid, int options, bool isTopLevel)
+cluster_rel(Oid tableOid, Oid indexOid, int options)
 {
 	Relation	OldHeap;
 	bool		verbose = ((options & CLUOPT_VERBOSE) != 0);
@@ -404,7 +400,7 @@ cluster_rel(Oid tableOid, Oid indexOid, int options, bool isTopLevel)
 	TransferPredicateLocksToHeapRelation(OldHeap);
 
 	/* rebuild_relation does all the dirty work */
-	rebuild_relation(OldHeap, indexOid, isTopLevel, verbose);
+	rebuild_relation(OldHeap, indexOid, verbose);
 
 	/* NB: rebuild_relation does table_close() on OldHeap */
 
@@ -549,12 +545,11 @@ mark_index_clustered(Relation rel, Oid indexOid, bool is_internal)
  *
  * OldHeap: table to rebuild --- must be opened and exclusive-locked!
  * indexOid: index to cluster by, or InvalidOid to rewrite in physical order.
- * isTopLevel: should be passed down from ProcessUtility.
  *
  * NB: this routine closes OldHeap at the right time; caller should not.
  */
 static void
-rebuild_relation(Relation OldHeap, Oid indexOid, bool isTopLevel, bool verbose)
+rebuild_relation(Relation OldHeap, Oid indexOid, bool verbose)
 {
 	Oid			tableOid = RelationGetRelid(OldHeap);
 	Oid			tableSpace = OldHeap->rd_rel->reltablespace;
@@ -582,7 +577,7 @@ rebuild_relation(Relation OldHeap, Oid indexOid, bool isTopLevel, bool verbose)
 							   AccessExclusiveLock);
 
 	/* Copy the heap data into the new table in the desired order */
-	copy_table_data(OIDNewHeap, tableOid, indexOid, isTopLevel, verbose,
+	copy_table_data(OIDNewHeap, tableOid, indexOid, verbose,
 					&swap_toast_by_content, &frozenXid, &cutoffMulti);
 
 	/*
@@ -733,8 +728,7 @@ make_new_heap(Oid OIDOldHeap, Oid NewTableSpace, char relpersistence,
  * *pCutoffMulti receives the MultiXactId used as a cutoff point.
  */
 static void
-copy_table_data(Oid OIDNewHeap, Oid OIDOldHeap, Oid OIDOldIndex,
-				bool isTopLevel, bool verbose,
+copy_table_data(Oid OIDNewHeap, Oid OIDOldHeap, Oid OIDOldIndex, bool verbose,
 				bool *pSwapToastByContent, TransactionId *pFreezeXid,
 				MultiXactId *pCutoffMulti)
 {
@@ -832,7 +826,7 @@ copy_table_data(Oid OIDNewHeap, Oid OIDOldHeap, Oid OIDOldIndex,
 	 * Since we're going to rewrite the whole table anyway, there's no reason
 	 * not to be aggressive about this.
 	 */
-	vacuum_set_xid_limits(OldHeap, 0, 0, 0, 0, isTopLevel,
+	vacuum_set_xid_limits(OldHeap, 0, 0, 0, 0,
 						  &OldestXmin, &FreezeXid, NULL, &MultiXactCutoff,
 						  NULL);
 
diff --git a/src/backend/commands/vacuum.c b/src/backend/commands/vacuum.c
index ddeec870d81..1b6717f727e 100644
--- a/src/backend/commands/vacuum.c
+++ b/src/backend/commands/vacuum.c
@@ -907,8 +907,7 @@ get_all_vacuum_rels(int options)
 /*
  * vacuum_set_xid_limits() -- compute oldestXmin and freeze cutoff points
  *
- * Input parameters are the target relation, applicable freeze age settings,
- * and isTopLevel which should be passed down from ProcessUtility.
+ * Input parameters are the target relation, applicable freeze age settings.
  *
  * The output parameters are:
  * - oldestXmin is the cutoff value used to distinguish whether tuples are
@@ -934,7 +933,6 @@ vacuum_set_xid_limits(Relation rel,
 					  int freeze_table_age,
 					  int multixact_freeze_min_age,
 					  int multixact_freeze_table_age,
-					  bool isTopLevel,
 					  TransactionId *oldestXmin,
 					  TransactionId *freezeLimit,
 					  TransactionId *xidFullScanLimit,
@@ -950,53 +948,33 @@ vacuum_set_xid_limits(Relation rel,
 	MultiXactId mxactLimit;
 	MultiXactId safeMxactLimit;
 
-	if (RELATION_IS_LOCAL(rel) && !IsInTransactionBlock(isTopLevel))
-	{
-		/*
-		 * If we are processing a temp relation (which by prior checks must be
-		 * one belonging to our session), and we are not inside any
-		 * transaction block, then there can be no tuples in the rel that are
-		 * still in-doubt, nor can there be any that are dead but possibly
-		 * still interesting to some snapshot our session holds.  We don't
-		 * need to care whether other sessions could see such tuples, either.
-		 * So we can aggressively set the cutoff xmin to be the nextXid.
-		 */
-		*oldestXmin = ReadNewTransactionId();
-	}
-	else
-	{
-		/*
-		 * Otherwise, calculate the cutoff xmin normally.
-		 *
-		 * We can always ignore processes running lazy vacuum.  This is
-		 * because we use these values only for deciding which tuples we must
-		 * keep in the tables.  Since lazy vacuum doesn't write its XID
-		 * anywhere (usually no XID assigned), it's safe to ignore it.  In
-		 * theory it could be problematic to ignore lazy vacuums in a full
-		 * vacuum, but keep in mind that only one vacuum process can be
-		 * working on a particular table at any time, and that each vacuum is
-		 * always an independent transaction.
-		 */
-		*oldestXmin = GetOldestNonRemovableTransactionId(rel);
+	/*
+	 * We can always ignore processes running lazy vacuum.  This is because we
+	 * use these values only for deciding which tuples we must keep in the
+	 * tables.  Since lazy vacuum doesn't write its XID anywhere (usually no
+	 * XID assigned), it's safe to ignore it.  In theory it could be
+	 * problematic to ignore lazy vacuums in a full vacuum, but keep in mind
+	 * that only one vacuum process can be working on a particular table at
+	 * any time, and that each vacuum is always an independent transaction.
+	 */
+	*oldestXmin = GetOldestNonRemovableTransactionId(rel);
 
-		if (OldSnapshotThresholdActive())
+	if (OldSnapshotThresholdActive())
+	{
+		TransactionId limit_xmin;
+		TimestampTz limit_ts;
+
+		if (TransactionIdLimitedForOldSnapshots(*oldestXmin, rel,
+												&limit_xmin, &limit_ts))
 		{
-			TransactionId limit_xmin;
-			TimestampTz limit_ts;
-
-			if (TransactionIdLimitedForOldSnapshots(*oldestXmin, rel,
-													&limit_xmin, &limit_ts))
-			{
-				/*
-				 * TODO: We should only set the threshold if we are pruning on
-				 * the basis of the increased limits.  Not as crucial here as
-				 * it is for opportunistic pruning (which often happens at a
-				 * much higher frequency), but would still be a significant
-				 * improvement.
-				 */
-				SetOldSnapshotThresholdTimestamp(limit_ts, limit_xmin);
-				*oldestXmin = limit_xmin;
-			}
+			/*
+			 * TODO: We should only set the threshold if we are pruning on the
+			 * basis of the increased limits.  Not as crucial here as it is
+			 * for opportunistic pruning (which often happens at a much higher
+			 * frequency), but would still be a significant improvement.
+			 */
+			SetOldSnapshotThresholdTimestamp(limit_ts, limit_xmin);
+			*oldestXmin = limit_xmin;
 		}
 	}
 
@@ -1930,7 +1908,7 @@ vacuum_rel(Oid relid, RangeVar *relation, VacuumParams *params)
 			cluster_options |= CLUOPT_VERBOSE;
 
 		/* VACUUM FULL is now a variant of CLUSTER; see cluster.c */
-		cluster_rel(relid, InvalidOid, cluster_options, true);
+		cluster_rel(relid, InvalidOid, cluster_options);
 	}
 	else
 		table_relation_vacuum(onerel, params, vac_strategy);
diff --git a/src/backend/storage/ipc/procarray.c b/src/backend/storage/ipc/procarray.c
index 07c5eeb7495..ee4caa51152 100644
--- a/src/backend/storage/ipc/procarray.c
+++ b/src/backend/storage/ipc/procarray.c
@@ -131,7 +131,7 @@ typedef struct ProcArrayStruct
  * different types of relations. As e.g. a normal user defined table in one
  * database is inaccessible to backends connected to another database, a test
  * specific to a relation can be more aggressive than a test for a shared
- * relation.  Currently we track three different states:
+ * relation.  Currently we track four different states:
  *
  * 1) GlobalVisSharedRels, which only considers an XID's
  *    effects visible-to-everyone if neither snapshots in any database, nor a
@@ -153,6 +153,9 @@ typedef struct ProcArrayStruct
  *    I.e. the difference to GlobalVisCatalogRels is that
  *    replication slot's catalog_xmin is not taken into account.
  *
+ * 4) GlobalVisTempRels, which only considers the current session, as temp
+ *    tables are not visible to other sessions.
+ *
  * GlobalVisTestFor(relation) returns the appropriate state
  * for the relation.
  *
@@ -234,6 +237,13 @@ typedef struct ComputeXidHorizonsResult
 	 * defined tables.
 	 */
 	TransactionId data_oldest_nonremovable;
+
+	/*
+	 * Oldest xid for which deleted tuples need to be retained in this
+	 * session's temporary tables.
+	 */
+	TransactionId temp_oldest_nonremovable;
+
 } ComputeXidHorizonsResult;
 
 
@@ -257,12 +267,13 @@ static TransactionId standbySnapshotPendingXmin;
 
 /*
  * State for visibility checks on different types of relations. See struct
- * GlobalVisState for details. As shared, catalog, and user defined
+ * GlobalVisState for details. As shared, catalog, normal and temporary
  * relations can have different horizons, one such state exists for each.
  */
 static GlobalVisState GlobalVisSharedRels;
 static GlobalVisState GlobalVisCatalogRels;
 static GlobalVisState GlobalVisDataRels;
+static GlobalVisState GlobalVisTempRels;
 
 /*
  * This backend's RecentXmin at the last time the accurate xmin horizon was
@@ -1668,6 +1679,23 @@ ComputeXidHorizons(ComputeXidHorizonsResult *h)
 		h->oldest_considered_running = initial;
 		h->shared_oldest_nonremovable = initial;
 		h->data_oldest_nonremovable = initial;
+
+		/*
+		 * Only modifications made by this backend affect the horizon for
+		 * temporary relations. Instead of a check in each iteration of the
+		 * loop over all PGPROCs it is cheaper to just initialize to the
+		 * current top-level xid any.
+		 *
+		 * Without an assigned xid we could use a horizon as agressive as
+		 * ReadNewTransactionid(), but we can get away with the much cheaper
+		 * latestCompletedXid + 1: If this backend has no xid there, by
+		 * definition, can't be any newer changes in the temp table than
+		 * latestCompletedXid.
+		 */
+		if (TransactionIdIsValid(MyProc->xid))
+			h->temp_oldest_nonremovable = MyProc->xid;
+		else
+			h->temp_oldest_nonremovable = initial;
 	}
 
 	/*
@@ -1760,6 +1788,7 @@ ComputeXidHorizons(ComputeXidHorizonsResult *h)
 			TransactionIdOlder(h->shared_oldest_nonremovable, kaxmin);
 		h->data_oldest_nonremovable =
 			TransactionIdOlder(h->data_oldest_nonremovable, kaxmin);
+		/* temp relations cannot be accessed in recovery */
 	}
 	else
 	{
@@ -1785,6 +1814,7 @@ ComputeXidHorizons(ComputeXidHorizonsResult *h)
 		h->data_oldest_nonremovable =
 			TransactionIdRetreatedBy(h->data_oldest_nonremovable,
 									 vacuum_defer_cleanup_age);
+		/* defer doesn't apply to temp relations */
 	}
 
 	/*
@@ -1844,6 +1874,8 @@ ComputeXidHorizons(ComputeXidHorizonsResult *h)
 										 h->catalog_oldest_nonremovable));
 	Assert(TransactionIdPrecedesOrEquals(h->oldest_considered_running,
 										 h->data_oldest_nonremovable));
+	Assert(TransactionIdPrecedesOrEquals(h->oldest_considered_running,
+										 h->temp_oldest_nonremovable));
 	Assert(!TransactionIdIsValid(h->slot_xmin) ||
 		   TransactionIdPrecedesOrEquals(h->oldest_considered_running,
 										 h->slot_xmin));
@@ -1878,6 +1910,8 @@ GetOldestNonRemovableTransactionId(Relation rel)
 		return horizons.shared_oldest_nonremovable;
 	else if (RelationIsAccessibleInLogicalDecoding(rel))
 		return horizons.catalog_oldest_nonremovable;
+	else if (RELATION_IS_LOCAL(rel))
+		return horizons.temp_oldest_nonremovable;
 	else
 		return horizons.data_oldest_nonremovable;
 }
@@ -2054,8 +2088,8 @@ GetSnapshotDataReuse(Snapshot snapshot)
  *		RecentXmin: the xmin computed for the most recent snapshot.  XIDs
  *			older than this are known not running any more.
  *
- * And try to advance the bounds of GlobalVisSharedRels, GlobalVisCatalogRels,
- * GlobalVisDataRels for the benefit of theGlobalVisTest* family of functions.
+ * And try to advance the bounds of GlobalVis{Shared,Catalog,Data,Temp}Rels
+ * for the benefit of theGlobalVisTest* family of functions.
  *
  * Note: this function should probably not be called with an argument that's
  * not statically allocated (see xip allocation below).
@@ -2357,6 +2391,15 @@ GetSnapshotData(Snapshot snapshot)
 		GlobalVisDataRels.definitely_needed =
 			FullTransactionIdNewer(def_vis_fxid_data,
 								   GlobalVisDataRels.definitely_needed);
+		/* See temp_oldest_nonremovable computation in ComputeXidHorizons() */
+		if (TransactionIdIsNormal(myxid))
+			GlobalVisTempRels.definitely_needed =
+				FullXidRelativeTo(latest_completed, myxid);
+		else
+		{
+			GlobalVisTempRels.definitely_needed = latest_completed;
+			FullTransactionIdAdvance(&GlobalVisTempRels.definitely_needed);
+		}
 
 		/*
 		 * Check if we know that we can initialize or increase the lower
@@ -2375,6 +2418,8 @@ GetSnapshotData(Snapshot snapshot)
 		GlobalVisDataRels.maybe_needed =
 			FullTransactionIdNewer(GlobalVisDataRels.maybe_needed,
 								   oldestfxid);
+		/* accurate value known */
+		GlobalVisTempRels.maybe_needed = GlobalVisTempRels.definitely_needed;
 	}
 
 	RecentXmin = xmin;
@@ -3892,6 +3937,8 @@ GlobalVisTestFor(Relation rel)
 		state = &GlobalVisSharedRels;
 	else if (need_catalog)
 		state = &GlobalVisCatalogRels;
+	else if (RELATION_IS_LOCAL(rel))
+		state = &GlobalVisTempRels;
 	else
 		state = &GlobalVisDataRels;
 
@@ -3942,6 +3989,9 @@ GlobalVisUpdateApply(ComputeXidHorizonsResult *horizons)
 	GlobalVisDataRels.maybe_needed =
 		FullXidRelativeTo(horizons->latest_completed,
 						  horizons->data_oldest_nonremovable);
+	GlobalVisTempRels.maybe_needed =
+		FullXidRelativeTo(horizons->latest_completed,
+						  horizons->temp_oldest_nonremovable);
 
 	/*
 	 * In longer running transactions it's possible that transactions we
@@ -3957,6 +4007,7 @@ GlobalVisUpdateApply(ComputeXidHorizonsResult *horizons)
 	GlobalVisDataRels.definitely_needed =
 		FullTransactionIdNewer(GlobalVisDataRels.maybe_needed,
 							   GlobalVisDataRels.definitely_needed);
+	GlobalVisTempRels.definitely_needed = GlobalVisTempRels.maybe_needed;
 
 	ComputeXidHorizonsResultLastXmin = RecentXmin;
 }
diff --git a/src/test/isolation/expected/horizons.out b/src/test/isolation/expected/horizons.out
new file mode 100644
index 00000000000..07bbc9832cd
--- /dev/null
+++ b/src/test/isolation/expected/horizons.out
@@ -0,0 +1,281 @@
+Parsed test spec with 2 sessions
+
+starting permutation: pruner_create_perm ll_start pruner_query_plan pruner_query pruner_query pruner_delete pruner_query pruner_query ll_commit pruner_drop
+step pruner_create_perm: 
+    CREATE TABLE horizons_tst (data int unique) WITH (autovacuum_enabled = off);
+    INSERT INTO horizons_tst(data) VALUES(1),(2);
+
+step ll_start: 
+    BEGIN TRANSACTION ISOLATION LEVEL REPEATABLE READ;
+    SELECT 1;
+
+?column?       
+
+1              
+step pruner_query_plan: 
+    EXPLAIN (COSTS OFF) SELECT * FROM horizons_tst ORDER BY data;
+
+QUERY PLAN     
+
+Index Only Scan using horizons_tst_data_key on horizons_tst
+step pruner_query: 
+    SELECT explain_json($$
+        EXPLAIN (FORMAT json, BUFFERS, ANALYZE)
+	  SELECT * FROM horizons_tst ORDER BY data;$$)->0->'Plan'->'Heap Fetches';
+
+?column?       
+
+2              
+step pruner_query: 
+    SELECT explain_json($$
+        EXPLAIN (FORMAT json, BUFFERS, ANALYZE)
+	  SELECT * FROM horizons_tst ORDER BY data;$$)->0->'Plan'->'Heap Fetches';
+
+?column?       
+
+2              
+step pruner_delete: 
+    DELETE FROM horizons_tst;
+
+step pruner_query: 
+    SELECT explain_json($$
+        EXPLAIN (FORMAT json, BUFFERS, ANALYZE)
+	  SELECT * FROM horizons_tst ORDER BY data;$$)->0->'Plan'->'Heap Fetches';
+
+?column?       
+
+2              
+step pruner_query: 
+    SELECT explain_json($$
+        EXPLAIN (FORMAT json, BUFFERS, ANALYZE)
+	  SELECT * FROM horizons_tst ORDER BY data;$$)->0->'Plan'->'Heap Fetches';
+
+?column?       
+
+2              
+step ll_commit: COMMIT;
+step pruner_drop: 
+    DROP TABLE horizons_tst;
+
+
+starting permutation: pruner_create_temp ll_start pruner_query_plan pruner_query pruner_query pruner_delete pruner_query pruner_query ll_commit pruner_drop
+step pruner_create_temp: 
+    CREATE TEMPORARY TABLE horizons_tst (data int unique) WITH (autovacuum_enabled = off);
+    INSERT INTO horizons_tst(data) VALUES(1),(2);
+
+step ll_start: 
+    BEGIN TRANSACTION ISOLATION LEVEL REPEATABLE READ;
+    SELECT 1;
+
+?column?       
+
+1              
+step pruner_query_plan: 
+    EXPLAIN (COSTS OFF) SELECT * FROM horizons_tst ORDER BY data;
+
+QUERY PLAN     
+
+Index Only Scan using horizons_tst_data_key on horizons_tst
+step pruner_query: 
+    SELECT explain_json($$
+        EXPLAIN (FORMAT json, BUFFERS, ANALYZE)
+	  SELECT * FROM horizons_tst ORDER BY data;$$)->0->'Plan'->'Heap Fetches';
+
+?column?       
+
+2              
+step pruner_query: 
+    SELECT explain_json($$
+        EXPLAIN (FORMAT json, BUFFERS, ANALYZE)
+	  SELECT * FROM horizons_tst ORDER BY data;$$)->0->'Plan'->'Heap Fetches';
+
+?column?       
+
+2              
+step pruner_delete: 
+    DELETE FROM horizons_tst;
+
+step pruner_query: 
+    SELECT explain_json($$
+        EXPLAIN (FORMAT json, BUFFERS, ANALYZE)
+	  SELECT * FROM horizons_tst ORDER BY data;$$)->0->'Plan'->'Heap Fetches';
+
+?column?       
+
+2              
+step pruner_query: 
+    SELECT explain_json($$
+        EXPLAIN (FORMAT json, BUFFERS, ANALYZE)
+	  SELECT * FROM horizons_tst ORDER BY data;$$)->0->'Plan'->'Heap Fetches';
+
+?column?       
+
+0              
+step ll_commit: COMMIT;
+step pruner_drop: 
+    DROP TABLE horizons_tst;
+
+
+starting permutation: pruner_create_temp ll_start pruner_query pruner_query pruner_begin pruner_delete pruner_query pruner_query ll_commit pruner_commit pruner_drop
+step pruner_create_temp: 
+    CREATE TEMPORARY TABLE horizons_tst (data int unique) WITH (autovacuum_enabled = off);
+    INSERT INTO horizons_tst(data) VALUES(1),(2);
+
+step ll_start: 
+    BEGIN TRANSACTION ISOLATION LEVEL REPEATABLE READ;
+    SELECT 1;
+
+?column?       
+
+1              
+step pruner_query: 
+    SELECT explain_json($$
+        EXPLAIN (FORMAT json, BUFFERS, ANALYZE)
+	  SELECT * FROM horizons_tst ORDER BY data;$$)->0->'Plan'->'Heap Fetches';
+
+?column?       
+
+2              
+step pruner_query: 
+    SELECT explain_json($$
+        EXPLAIN (FORMAT json, BUFFERS, ANALYZE)
+	  SELECT * FROM horizons_tst ORDER BY data;$$)->0->'Plan'->'Heap Fetches';
+
+?column?       
+
+2              
+step pruner_begin: BEGIN;
+step pruner_delete: 
+    DELETE FROM horizons_tst;
+
+step pruner_query: 
+    SELECT explain_json($$
+        EXPLAIN (FORMAT json, BUFFERS, ANALYZE)
+	  SELECT * FROM horizons_tst ORDER BY data;$$)->0->'Plan'->'Heap Fetches';
+
+?column?       
+
+2              
+step pruner_query: 
+    SELECT explain_json($$
+        EXPLAIN (FORMAT json, BUFFERS, ANALYZE)
+	  SELECT * FROM horizons_tst ORDER BY data;$$)->0->'Plan'->'Heap Fetches';
+
+?column?       
+
+2              
+step ll_commit: COMMIT;
+step pruner_commit: COMMIT;
+step pruner_drop: 
+    DROP TABLE horizons_tst;
+
+
+starting permutation: pruner_create_perm ll_start pruner_query pruner_query pruner_delete pruner_vacuum pruner_query pruner_query ll_commit pruner_drop
+step pruner_create_perm: 
+    CREATE TABLE horizons_tst (data int unique) WITH (autovacuum_enabled = off);
+    INSERT INTO horizons_tst(data) VALUES(1),(2);
+
+step ll_start: 
+    BEGIN TRANSACTION ISOLATION LEVEL REPEATABLE READ;
+    SELECT 1;
+
+?column?       
+
+1              
+step pruner_query: 
+    SELECT explain_json($$
+        EXPLAIN (FORMAT json, BUFFERS, ANALYZE)
+	  SELECT * FROM horizons_tst ORDER BY data;$$)->0->'Plan'->'Heap Fetches';
+
+?column?       
+
+2              
+step pruner_query: 
+    SELECT explain_json($$
+        EXPLAIN (FORMAT json, BUFFERS, ANALYZE)
+	  SELECT * FROM horizons_tst ORDER BY data;$$)->0->'Plan'->'Heap Fetches';
+
+?column?       
+
+2              
+step pruner_delete: 
+    DELETE FROM horizons_tst;
+
+step pruner_vacuum: 
+    VACUUM horizons_tst;
+
+step pruner_query: 
+    SELECT explain_json($$
+        EXPLAIN (FORMAT json, BUFFERS, ANALYZE)
+	  SELECT * FROM horizons_tst ORDER BY data;$$)->0->'Plan'->'Heap Fetches';
+
+?column?       
+
+2              
+step pruner_query: 
+    SELECT explain_json($$
+        EXPLAIN (FORMAT json, BUFFERS, ANALYZE)
+	  SELECT * FROM horizons_tst ORDER BY data;$$)->0->'Plan'->'Heap Fetches';
+
+?column?       
+
+2              
+step ll_commit: COMMIT;
+step pruner_drop: 
+    DROP TABLE horizons_tst;
+
+
+starting permutation: pruner_create_temp ll_start pruner_query pruner_query pruner_delete pruner_vacuum pruner_query pruner_query ll_commit pruner_drop
+step pruner_create_temp: 
+    CREATE TEMPORARY TABLE horizons_tst (data int unique) WITH (autovacuum_enabled = off);
+    INSERT INTO horizons_tst(data) VALUES(1),(2);
+
+step ll_start: 
+    BEGIN TRANSACTION ISOLATION LEVEL REPEATABLE READ;
+    SELECT 1;
+
+?column?       
+
+1              
+step pruner_query: 
+    SELECT explain_json($$
+        EXPLAIN (FORMAT json, BUFFERS, ANALYZE)
+	  SELECT * FROM horizons_tst ORDER BY data;$$)->0->'Plan'->'Heap Fetches';
+
+?column?       
+
+2              
+step pruner_query: 
+    SELECT explain_json($$
+        EXPLAIN (FORMAT json, BUFFERS, ANALYZE)
+	  SELECT * FROM horizons_tst ORDER BY data;$$)->0->'Plan'->'Heap Fetches';
+
+?column?       
+
+2              
+step pruner_delete: 
+    DELETE FROM horizons_tst;
+
+step pruner_vacuum: 
+    VACUUM horizons_tst;
+
+step pruner_query: 
+    SELECT explain_json($$
+        EXPLAIN (FORMAT json, BUFFERS, ANALYZE)
+	  SELECT * FROM horizons_tst ORDER BY data;$$)->0->'Plan'->'Heap Fetches';
+
+?column?       
+
+0              
+step pruner_query: 
+    SELECT explain_json($$
+        EXPLAIN (FORMAT json, BUFFERS, ANALYZE)
+	  SELECT * FROM horizons_tst ORDER BY data;$$)->0->'Plan'->'Heap Fetches';
+
+?column?       
+
+0              
+step ll_commit: COMMIT;
+step pruner_drop: 
+    DROP TABLE horizons_tst;
+
diff --git a/src/test/isolation/isolation_schedule b/src/test/isolation/isolation_schedule
index aa386ab1a25..f2e752c4454 100644
--- a/src/test/isolation/isolation_schedule
+++ b/src/test/isolation/isolation_schedule
@@ -78,6 +78,7 @@ test: timeouts
 test: vacuum-concurrent-drop
 test: vacuum-conflict
 test: vacuum-skip-locked
+test: horizons
 test: predicate-hash
 test: predicate-gist
 test: predicate-gin
diff --git a/src/test/isolation/specs/horizons.spec b/src/test/isolation/specs/horizons.spec
new file mode 100644
index 00000000000..19cc0a086ae
--- /dev/null
+++ b/src/test/isolation/specs/horizons.spec
@@ -0,0 +1,165 @@
+# Test that pruning and vacuuming pay attention to concurrent sessions
+# in the right way. For normal relations that means that rows cannot
+# be pruned away if there's an older snapshot, in contrast to that
+# temporary tables should nearly always be prunable.
+
+setup {
+    CREATE OR REPLACE FUNCTION explain_json(p_query text)
+    RETURNS json
+    LANGUAGE plpgsql AS $$
+        DECLARE
+            v_ret json;
+        BEGIN
+            EXECUTE p_query INTO STRICT v_ret;
+            RETURN v_ret;
+        END;$$;
+}
+
+teardown {
+    DROP FUNCTION explain_json(text);
+}
+
+session "lifeline"
+
+# Start a transaction, force a snapshot to be held
+step "ll_start"
+{
+    BEGIN TRANSACTION ISOLATION LEVEL REPEATABLE READ;
+    SELECT 1;
+}
+
+step "ll_commit" { COMMIT; }
+
+
+session "pruner"
+
+setup
+{
+    SET enable_seqscan = false;
+    SET enable_indexscan = false;
+    SET enable_bitmapscan = false;
+}
+
+step "pruner_create_temp"
+{
+    CREATE TEMPORARY TABLE horizons_tst (data int unique) WITH (autovacuum_enabled = off);
+    INSERT INTO horizons_tst(data) VALUES(1),(2);
+}
+
+step "pruner_create_perm"
+{
+    CREATE TABLE horizons_tst (data int unique) WITH (autovacuum_enabled = off);
+    INSERT INTO horizons_tst(data) VALUES(1),(2);
+}
+
+# Temp tables cannot be dropped in the teardown, so just always do so
+# as part of the permutation
+step "pruner_drop"
+{
+    DROP TABLE horizons_tst;
+}
+
+step "pruner_delete"
+{
+    DELETE FROM horizons_tst;
+}
+
+step "pruner_begin" { BEGIN; }
+step "pruner_commit" { COMMIT; }
+
+step "pruner_vacuum"
+{
+    VACUUM horizons_tst;
+}
+
+# Show the heap fetches of an ordered index-only-scan (other plans
+# have been forbidden above) - that tells us how many non-killed leaf
+# entries there are.
+step "pruner_query"
+{
+    SELECT explain_json($$
+        EXPLAIN (FORMAT json, BUFFERS, ANALYZE)
+	  SELECT * FROM horizons_tst ORDER BY data;$$)->0->'Plan'->'Heap Fetches';
+}
+
+# Verify that the query plan still is an IOS
+step "pruner_query_plan"
+{
+    EXPLAIN (COSTS OFF) SELECT * FROM horizons_tst ORDER BY data;
+}
+
+
+# Show that with a permanent relation deleted rows cannot be pruned
+# away if there's a concurrent session still seeing the rows.
+permutation
+    "pruner_create_perm"
+    "ll_start"
+    "pruner_query_plan"
+    # Run query that could do pruning twice, first has chance to prune,
+    # second would not perform heap fetches if first query did.
+    "pruner_query"
+    "pruner_query"
+    "pruner_delete"
+    "pruner_query"
+    "pruner_query"
+    "ll_commit"
+    "pruner_drop"
+
+# Show that with a temporary relation deleted rows can be pruned away,
+# even if there's a concurrent session with a snapshot from before the
+# deletion. That's safe because the session with the older snapshot
+# cannot access the temporary table.
+permutation
+    "pruner_create_temp"
+    "ll_start"
+    "pruner_query_plan"
+    "pruner_query"
+    "pruner_query"
+    "pruner_delete"
+    "pruner_query"
+    "pruner_query"
+    "ll_commit"
+    "pruner_drop"
+
+# Verify that pruning in temporary relations doesn't remove rows still
+# visible in the current session
+permutation
+    "pruner_create_temp"
+    "ll_start"
+    "pruner_query"
+    "pruner_query"
+    "pruner_begin"
+    "pruner_delete"
+    "pruner_query"
+    "pruner_query"
+    "ll_commit"
+    "pruner_commit"
+    "pruner_drop"
+
+# Show that vacuum cannot remove deleted rows still visible to another
+# session's snapshot, when accessing a permanent table.
+permutation
+    "pruner_create_perm"
+    "ll_start"
+    "pruner_query"
+    "pruner_query"
+    "pruner_delete"
+    "pruner_vacuum"
+    "pruner_query"
+    "pruner_query"
+    "ll_commit"
+    "pruner_drop"
+
+# Show that vacuum can remove deleted rows still visible to another
+# session's snapshot, when accessing a temporary table.
+permutation
+    "pruner_create_temp"
+    "ll_start"
+    "pruner_query"
+    "pruner_query"
+    "pruner_delete"
+    "pruner_vacuum"
+    "pruner_query"
+    "pruner_query"
+    "ll_commit"
+    "pruner_drop"
-- 
2.28.0.651.g306ee63a70

#145

andres@anarazel.de

about 5 years ago

In reply to: Andres Freund (#144)

Re: recovering from "found xmin ... from before relfrozenxid ..."

Hi,

On 2020-10-27 20:51:10 -0700, Andres Freund wrote:

On 2020-10-15 01:37:35 -0700, Andres Freund wrote:

Attached is a *prototype* implemention of this concept, which clearly is
lacking some comment work (and is intentionally lacking some
re-indentation).

I described my thoughts about how to limit the horizons for temp tables in
/messages/by-id/20201014203103.72oke6hqywcyhx7s@alap3.anarazel.de

Attached is an updated version of this patch. Quite a bit of polish,
added removal of the isTopLevel arguments added a7212be8b9e that are now
unnecessary, and changed the initialization of the temp table horizons
to be latestCompletedXid + 1 instead of just latestCompletedXid when no
xid is assigned.

Besides comments this probably mainly needs a bit more tests around temp
table vacuuming. Should have at least an isolation test that verifies
that temp table rows can be a) vacuumed b) pruned away in the presence
of other sessions with xids.

I added an isolationtester test for this. It verifies that dead rows in
temp tables get vacuumed and pruned despite concurrent sessions having
older snapshots. It does so by forcing an IOS and checking the number of
heap fetches reported by EXPLAIN. I also added a companion test for
permanent relations, ensuring that such rows do not get removed.

Any comments? Otherwise I'll push that patch tomorrow.

Just pushed this. Let's see what the BF says...

It's kinda cool how much more aggressive hot pruning / killtuples now is
for temp tables.

Greetings,

Andres Freund

#146

andres@anarazel.de

about 5 years ago

In reply to: Andres Freund (#145)

Re: recovering from "found xmin ... from before relfrozenxid ..."

On 2020-10-28 18:13:44 -0700, Andres Freund wrote:

Just pushed this. Let's see what the BF says...

It says that apparently something is unstable about my new test. It
first passed on a few animals, but then failed a lot in a row. Looking.

#147

andres@anarazel.de

about 5 years ago

In reply to: Andres Freund (#146)

Re: recovering from "found xmin ... from before relfrozenxid ..."

Hi,

On 2020-10-28 19:09:14 -0700, Andres Freund wrote:

On 2020-10-28 18:13:44 -0700, Andres Freund wrote:

Just pushed this. Let's see what the BF says...

It says that apparently something is unstable about my new test. It
first passed on a few animals, but then failed a lot in a row. Looking.

The differentiating factor is force_parallel_mode=regress.

Ugh, this is nasty: The problem is that we can end up computing the
horizons the first time before MyDatabaseId is even set. Which leads us
to compute a too aggressive horizon for plain tables, because we skip
over them, as MyDatabaseId still is InvalidOid:

/*
* Normally queries in other databases are ignored for anything but
* the shared horizon. But in recovery we cannot compute an accurate
* per-database horizon as all xids are managed via the
* KnownAssignedXids machinery.
*/
if (in_recovery ||
proc->databaseId == MyDatabaseId ||
proc->databaseId == 0) /* always include WalSender */
h->data_oldest_nonremovable =
TransactionIdOlder(h->data_oldest_nonremovable, xmin);

That then subsequently leads us consider a row fully dead in
heap_hot_search_buffers(). Triggering the killtuples logic. Causing the
test to fail.

With force_parallel_mode=regress we constantly start parallel workers,
which makes it much more likely that this case is hit.

It's trivial to fix luckily...

Greetings,

Andres Freund

#148

andres@anarazel.de

about 5 years ago

In reply to: Andres Freund (#147)

Re: recovering from "found xmin ... from before relfrozenxid ..."

Hi,

On 2020-10-28 21:00:30 -0700, Andres Freund wrote:

On 2020-10-28 19:09:14 -0700, Andres Freund wrote:

On 2020-10-28 18:13:44 -0700, Andres Freund wrote:

Just pushed this. Let's see what the BF says...

It says that apparently something is unstable about my new test. It
first passed on a few animals, but then failed a lot in a row. Looking.

The differentiating factor is force_parallel_mode=regress.

Ugh, this is nasty: The problem is that we can end up computing the
horizons the first time before MyDatabaseId is even set. Which leads us
to compute a too aggressive horizon for plain tables, because we skip
over them, as MyDatabaseId still is InvalidOid:

/*
* Normally queries in other databases are ignored for anything but
* the shared horizon. But in recovery we cannot compute an accurate
* per-database horizon as all xids are managed via the
* KnownAssignedXids machinery.
*/
if (in_recovery ||
proc->databaseId == MyDatabaseId ||
proc->databaseId == 0) /* always include WalSender */
h->data_oldest_nonremovable =
TransactionIdOlder(h->data_oldest_nonremovable, xmin);

That then subsequently leads us consider a row fully dead in
heap_hot_search_buffers(). Triggering the killtuples logic. Causing the
test to fail.

With force_parallel_mode=regress we constantly start parallel workers,
which makes it much more likely that this case is hit.

It's trivial to fix luckily...

Pushed that fix, hopefully that calms the BF.

Greetings,

Andres Freund

#149

Andrey Borodin

x4mmm@yandex-team.ru

almost 5 years ago

In reply to: Andres Freund (#148)

Re: recovering from "found xmin ... from before relfrozenxid ..."

Hi hackers!

Does anyone maintain opensource pg_surgery analogs for released versions of PG?
It seems to me I'll have to use something like this and I just though that I should consider pg_surgery in favour of our pg_dirty_hands.

Thanks!

Best regards, Andrey Borodin.

#150

robertmhaas@gmail.com

almost 5 years ago

In reply to: Andrey Borodin (#149)

Re: recovering from "found xmin ... from before relfrozenxid ..."

On Sat, Jan 16, 2021 at 10:41 AM Andrey Borodin <x4mmm@yandex-team.ru> wrote:

Does anyone maintain opensource pg_surgery analogs for released versions of PG?
It seems to me I'll have to use something like this and I just though that I should consider pg_surgery in favour of our pg_dirty_hands.

I do not. I'm still of the opinion that we ought to back-patch
pg_surgery. This didn't attract a consensus before, and it's hard to
dispute that it's a new feature in what would be a back branch. But
it's unclear to me how users are otherwise supposed to recover from
some of the bugs that are or have been present in those back branches.
I'm not sure that I see the logic in telling people we'll try to
prevent them from getting hosed in the future but if they're already
hosed they can wait for v14 to fix it. They can't wait that long, and
a dump-and-restore cycle is awfully painful.

--
Robert Haas
EDB: http://www.enterprisedb.com

#151

Michael Banck

michael.banck@credativ.de

almost 5 years ago

In reply to: Robert Haas (#150)

Re: recovering from "found xmin ... from before relfrozenxid ..."

On Mon, Jan 18, 2021 at 08:54:10AM -0500, Robert Haas wrote:

On Sat, Jan 16, 2021 at 10:41 AM Andrey Borodin <x4mmm@yandex-team.ru> wrote:

Does anyone maintain opensource pg_surgery analogs for released
versions of PG? It seems to me I'll have to use something like this
and I just though that I should consider pg_surgery in favour of our
pg_dirty_hands.

I do not. I'm still of the opinion that we ought to back-patch
pg_surgery. This didn't attract a consensus before, and it's hard to
dispute that it's a new feature in what would be a back branch. But
it's unclear to me how users are otherwise supposed to recover from
some of the bugs that are or have been present in those back branches.

One other possiblity would be to push a version of pg_surgery that is
compatible with the back-branches somewhere external (e.g. either
git.postgresql.org and/or Github), so that it can be picked up by
distributions and/or individual users in need.

That is Assuming it does not need assorted server changes to go with; I
did not read the thread in detail but I was under the assumption it is a
client program?

Michael

--
Michael Banck
Projektleiter / Senior Berater
Tel.: +49 2166 9901-171
Fax: +49 2166 9901-100
Email: michael.banck@credativ.de

credativ GmbH, HRB Mï¿½nchengladbach 12080
USt-ID-Nummer: DE204566209
Trompeterallee 108, 41189 Mï¿½nchengladbach
Geschï¿½ftsfï¿½hrung: Dr. Michael Meskes, Jï¿½rg Folz, Sascha Heuer

Unser Umgang mit personenbezogenen Daten unterliegt
folgenden Bestimmungen: https://www.credativ.de/datenschutz

#152