Idea for improving buildfarm robustness

Started by Tom Laneover 10 years ago28 messageshackers
Jump to latest
#1Tom Lane
tgl@sss.pgh.pa.us

A problem the buildfarm has had for a long time is that if for some reason
the scripts fail to stop a test postmaster, the postmaster process will
hang around and cause subsequent runs to fail because of socket conflicts.
This seems to have gotten a lot worse lately due to the influx of very
slow buildfarm machines, but the risk has always been there.

I've been thinking about teaching the buildfarm script to "kill -9"
any postmasters left around at the end of the run, but that's fairly
problematic: how do you find such processes (since "ps" output isn't
hugely portable, especially not to Windows), and how do you tell them
apart from postmasters not started by the script? So the idea was on
hold.

But today I thought of another way: suppose that we teach the postmaster
to commit hara-kiri if the $PGDATA directory goes away. Since the
buildfarm script definitely does remove all the temporary data directories
it creates, this ought to get the job done.

An easy way to do that would be to have it check every so often if
pg_control can still be read. We should not have it fail on ENFILE or
EMFILE, since that would create a new failure hazard under heavy load,
but ENOENT or similar would be reasonable grounds for deciding that
something is horribly broken. (At least on Windows, failing on EPERM
doesn't seem wise either, since we've seen antivirus products randomly
causing such errors.)

I wouldn't want to do this every time through the postmaster's main loop,
but we could do this once an hour for no added cost by adding the check
where it does TouchSocketLockFiles; or once every few minutes if we
carried a separate variable like last_touch_time. Once an hour would be
plenty to fix the buildfarm's problem, I should think.

Another question is what exactly "commit hara-kiri" should consist of.
We could just abort() or _exit(1) and leave it to child processes to
notice that the postmaster is gone, or we could make an effort to clean
up. I'd be a bit inclined to treat it like a SIGQUIT situation, ie
kill all the children and exit. The children are probably having
problems of their own if the data directory's gone, so forcing
termination might be best to keep them from getting stuck.

Also, perhaps we'd only enable this behavior in --enable-cassert builds,
to avoid any risk of a postmaster incorrectly choosing to suicide in a
production scenario. Or maybe that's overly conservative.

Thoughts?

regards, tom lane

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#2Josh Berkus
josh@agliodbs.com
In reply to: Tom Lane (#1)
Re: Idea for improving buildfarm robustness

On 09/29/2015 11:48 AM, Tom Lane wrote:

But today I thought of another way: suppose that we teach the postmaster
to commit hara-kiri if the $PGDATA directory goes away. Since the
buildfarm script definitely does remove all the temporary data directories
it creates, this ought to get the job done.

This would also be useful for production. I can't count the number of
times I've accidentally blown away a replica's PGDATA without shutting
the postmaster down first, and then had to do a bunch of kill -9.

In general, having the postmaster survive deletion of PGDATA is
suboptimal. In rare cases of having it survive installation of a new
PGDATA (via PITR restore, for example), I've even seen the zombie
postmaster corrupt the data files.

So if you want this change to be useful beyond the buildfarm, it should
check every few minutes, and you'd SIGQUIT.

--
Josh Berkus
PostgreSQL Experts Inc.
http://pgexperts.com

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#3Stephen Frost
sfrost@snowman.net
In reply to: Tom Lane (#1)
Re: Idea for improving buildfarm robustness

* Tom Lane (tgl@sss.pgh.pa.us) wrote:

But today I thought of another way: suppose that we teach the postmaster
to commit hara-kiri if the $PGDATA directory goes away. Since the
buildfarm script definitely does remove all the temporary data directories
it creates, this ought to get the job done.

Yes, please.

An easy way to do that would be to have it check every so often if
pg_control can still be read. We should not have it fail on ENFILE or
EMFILE, since that would create a new failure hazard under heavy load,
but ENOENT or similar would be reasonable grounds for deciding that
something is horribly broken. (At least on Windows, failing on EPERM
doesn't seem wise either, since we've seen antivirus products randomly
causing such errors.)

Sounds pretty reasonable to me.

I wouldn't want to do this every time through the postmaster's main loop,
but we could do this once an hour for no added cost by adding the check
where it does TouchSocketLockFiles; or once every few minutes if we
carried a separate variable like last_touch_time. Once an hour would be
plenty to fix the buildfarm's problem, I should think.

I have a bad (?) habit of doing exactly this during development and
would really like it to be a bit more often than once/hour, unless
there's a particular problem with that.

Another question is what exactly "commit hara-kiri" should consist of.
We could just abort() or _exit(1) and leave it to child processes to
notice that the postmaster is gone, or we could make an effort to clean
up. I'd be a bit inclined to treat it like a SIGQUIT situation, ie
kill all the children and exit. The children are probably having
problems of their own if the data directory's gone, so forcing
termination might be best to keep them from getting stuck.

I like the idea of killing all the children and then exiting.

Also, perhaps we'd only enable this behavior in --enable-cassert builds,
to avoid any risk of a postmaster incorrectly choosing to suicide in a
production scenario. Or maybe that's overly conservative.

That would work for my use-case. Perhaps only on --enable-cassert
builds for back-branches but enable it in master and see how things go
for 9.6? I agree that it feels overly conservative, but given our
recent history, we should be overly cautious with the back branches.

Thoughts?

Thanks!

Stephen

#4Tom Lane
tgl@sss.pgh.pa.us
In reply to: Stephen Frost (#3)
Re: Idea for improving buildfarm robustness

Stephen Frost <sfrost@snowman.net> writes:

* Tom Lane (tgl@sss.pgh.pa.us) wrote:

I wouldn't want to do this every time through the postmaster's main loop,
but we could do this once an hour for no added cost by adding the check
where it does TouchSocketLockFiles; or once every few minutes if we
carried a separate variable like last_touch_time. Once an hour would be
plenty to fix the buildfarm's problem, I should think.

I have a bad (?) habit of doing exactly this during development and
would really like it to be a bit more often than once/hour, unless
there's a particular problem with that.

Yeah, Josh mentioned the same. It would only take another three or four
lines of code to decouple it from TouchSocketLockFiles, and then it's
just a question of how much are you worried about the performance cost of
additional file-open attempts. I think either one-minute or five-minute
intervals would be pretty defensible.

regards, tom lane

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#5Andrew Dunstan
andrew@dunslane.net
In reply to: Tom Lane (#1)
Re: Idea for improving buildfarm robustness

On 09/29/2015 02:48 PM, Tom Lane wrote:

A problem the buildfarm has had for a long time is that if for some reason
the scripts fail to stop a test postmaster, the postmaster process will
hang around and cause subsequent runs to fail because of socket conflicts.
This seems to have gotten a lot worse lately due to the influx of very
slow buildfarm machines, but the risk has always been there.

I've been thinking about teaching the buildfarm script to "kill -9"
any postmasters left around at the end of the run, but that's fairly
problematic: how do you find such processes (since "ps" output isn't
hugely portable, especially not to Windows), and how do you tell them
apart from postmasters not started by the script? So the idea was on
hold.

But today I thought of another way: suppose that we teach the postmaster
to commit hara-kiri if the $PGDATA directory goes away. Since the
buildfarm script definitely does remove all the temporary data directories
it creates, this ought to get the job done.

An easy way to do that would be to have it check every so often if
pg_control can still be read. We should not have it fail on ENFILE or
EMFILE, since that would create a new failure hazard under heavy load,
but ENOENT or similar would be reasonable grounds for deciding that
something is horribly broken. (At least on Windows, failing on EPERM
doesn't seem wise either, since we've seen antivirus products randomly
causing such errors.)

I wouldn't want to do this every time through the postmaster's main loop,
but we could do this once an hour for no added cost by adding the check
where it does TouchSocketLockFiles; or once every few minutes if we
carried a separate variable like last_touch_time. Once an hour would be
plenty to fix the buildfarm's problem, I should think.

Another question is what exactly "commit hara-kiri" should consist of.
We could just abort() or _exit(1) and leave it to child processes to
notice that the postmaster is gone, or we could make an effort to clean
up. I'd be a bit inclined to treat it like a SIGQUIT situation, ie
kill all the children and exit. The children are probably having
problems of their own if the data directory's gone, so forcing
termination might be best to keep them from getting stuck.

Also, perhaps we'd only enable this behavior in --enable-cassert builds,
to avoid any risk of a postmaster incorrectly choosing to suicide in a
production scenario. Or maybe that's overly conservative.

Thoughts?

It's a fine idea. This is much more likely to be robust than any
buildfarm client fix.

Not every buildfarm member uses cassert, so I'm not sure that's the best
way to go. axolotl doesn't, and it's one of those that regularly has
speed problems. Maybe a not-very-well-publicized GUC, or an environment
setting? Or maybe just enable this anyway. If the data directory is gone
what's the point in keeping the postmaster around? Shutting it down
doesn't seem likely to cause any damage.

cheers

andrew

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#6Stephen Frost
sfrost@snowman.net
In reply to: Tom Lane (#4)
Re: Idea for improving buildfarm robustness

* Tom Lane (tgl@sss.pgh.pa.us) wrote:

Stephen Frost <sfrost@snowman.net> writes:

* Tom Lane (tgl@sss.pgh.pa.us) wrote:

I wouldn't want to do this every time through the postmaster's main loop,
but we could do this once an hour for no added cost by adding the check
where it does TouchSocketLockFiles; or once every few minutes if we
carried a separate variable like last_touch_time. Once an hour would be
plenty to fix the buildfarm's problem, I should think.

I have a bad (?) habit of doing exactly this during development and
would really like it to be a bit more often than once/hour, unless
there's a particular problem with that.

Yeah, Josh mentioned the same. It would only take another three or four
lines of code to decouple it from TouchSocketLockFiles, and then it's
just a question of how much are you worried about the performance cost of
additional file-open attempts. I think either one-minute or five-minute
intervals would be pretty defensible.

Perhaps I'm missing something, but it doesn't strike me as a terribly
expensive operation, and once a minute would work out quite well for my
needs, at least.

Running for long after pg_control has disappeared doesn't strike me as a
great idea anyway..

Thanks!

Stephen

#7Tom Lane
tgl@sss.pgh.pa.us
In reply to: Andrew Dunstan (#5)
Re: Idea for improving buildfarm robustness

Andrew Dunstan <andrew@dunslane.net> writes:

On 09/29/2015 02:48 PM, Tom Lane wrote:

Also, perhaps we'd only enable this behavior in --enable-cassert builds,
to avoid any risk of a postmaster incorrectly choosing to suicide in a
production scenario. Or maybe that's overly conservative.

Not every buildfarm member uses cassert, so I'm not sure that's the best
way to go. axolotl doesn't, and it's one of those that regularly has
speed problems. Maybe a not-very-well-publicized GUC, or an environment
setting? Or maybe just enable this anyway. If the data directory is gone
what's the point in keeping the postmaster around? Shutting it down
doesn't seem likely to cause any damage.

The only argument I can see against just turning it on all the time is
the possibility of false positives. I mentioned ENFILE and EPERM as
foreseeable false-positive conditions, and I'm worried that there might be
others. It might be good if we have a small list of specific errnos that
cause us to conclude we should die, rather than a small list that cause us
not to. But as long as we're reasonably confident that we're seeing an
error that means somebody deleted pg_control, I think abandoning ship
is just fine.

regards, tom lane

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#8Josh Berkus
josh@agliodbs.com
In reply to: Tom Lane (#1)
Re: Idea for improving buildfarm robustness

On 09/29/2015 12:18 PM, Tom Lane wrote:

Andrew Dunstan <andrew@dunslane.net> writes:

On 09/29/2015 02:48 PM, Tom Lane wrote:

Also, perhaps we'd only enable this behavior in --enable-cassert builds,
to avoid any risk of a postmaster incorrectly choosing to suicide in a
production scenario. Or maybe that's overly conservative.

Not every buildfarm member uses cassert, so I'm not sure that's the best
way to go. axolotl doesn't, and it's one of those that regularly has
speed problems. Maybe a not-very-well-publicized GUC, or an environment
setting? Or maybe just enable this anyway. If the data directory is gone
what's the point in keeping the postmaster around? Shutting it down
doesn't seem likely to cause any damage.

The only argument I can see against just turning it on all the time is
the possibility of false positives. I mentioned ENFILE and EPERM as
foreseeable false-positive conditions, and I'm worried that there might be
others. It might be good if we have a small list of specific errnos that
cause us to conclude we should die, rather than a small list that cause us
not to. But as long as we're reasonably confident that we're seeing an
error that means somebody deleted pg_control, I think abandoning ship
is just fine.

Give me source with the change, and I'll put it on a cheap, low-bandwith
AWS instance and hammer the heck out of it. That should raise any false
positives we can expect.

--
Josh Berkus
PostgreSQL Experts Inc.
http://pgexperts.com

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#9Tom Lane
tgl@sss.pgh.pa.us
In reply to: Josh Berkus (#2)
Re: Idea for improving buildfarm robustness

Josh Berkus <josh@agliodbs.com> writes:

On 09/29/2015 11:48 AM, Tom Lane wrote:

But today I thought of another way: suppose that we teach the postmaster
to commit hara-kiri if the $PGDATA directory goes away. Since the
buildfarm script definitely does remove all the temporary data directories
it creates, this ought to get the job done.

This would also be useful for production. I can't count the number of
times I've accidentally blown away a replica's PGDATA without shutting
the postmaster down first, and then had to do a bunch of kill -9.

In general, having the postmaster survive deletion of PGDATA is
suboptimal. In rare cases of having it survive installation of a new
PGDATA (via PITR restore, for example), I've even seen the zombie
postmaster corrupt the data files.

Side comment on that: if you'd actually removed $PGDATA, I can't see how
that would happen. The postmaster and children would have open CWD
handles to the now-disconnected-from-anything-else directory inode,
which would not enable them to reach files created under the new directory
inode. (They don't ever use absolute paths, only relative, or at least
that's the way it's supposed to work.)

However ... if you'd simply deleted everything *under* $PGDATA but not
that directory itself, then this type of failure mode is 100% plausible.
And that's not an unreasonable thing to do, especially if you've set
things up so that $PGDATA's parent is not a writable directory.

Testing accessibility of "global/pg_control" would be enough to catch this
case, but only if we do it before you create a new one. So that seems
like an argument for making the test relatively often. The once-a-minute
option is sounding better and better.

We could possibly add additional checks, like trying to verify that
pg_control has the same inode number it used to. But I'm afraid that
would add portability issues and false-positive hazards that would
outweigh the value.

regards, tom lane

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#10Alvaro Herrera
alvherre@2ndquadrant.com
In reply to: Tom Lane (#9)
Re: Idea for improving buildfarm robustness

Tom Lane wrote:

Testing accessibility of "global/pg_control" would be enough to catch this
case, but only if we do it before you create a new one. So that seems
like an argument for making the test relatively often. The once-a-minute
option is sounding better and better.

If we weren't afraid of portability issues or checks that only work on
certain platforms, we could use inotify on linux and get it to signal
postmaster when pg_control is deleted. There are various
implementations of similar things in different platforms (kqueues on
BSD, surely there's gotta be something in Linux) -- though admittedly
that code may quickly become worse that the select/poll loops (which are
ugly enough). Maybe it'd be okay if we just use a descriptor that sets
the process latch when signalled.

--
�lvaro Herrera http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#11Joe Conway
mail@joeconway.com
In reply to: Tom Lane (#9)
Re: Idea for improving buildfarm robustness

On 09/29/2015 12:47 PM, Tom Lane wrote:

We could possibly add additional checks, like trying to verify that
pg_control has the same inode number it used to. But I'm afraid that
would add portability issues and false-positive hazards that would
outweigh the value.

Not sure you remember the incident, but I think years ago that would
have saved me some heartache ;-)

Joe

--
Crunchy Data - http://crunchydata.com
PostgreSQL Support for Secure Enterprises
Consulting, Training, & Open Source Development

#12Alvaro Herrera
alvherre@2ndquadrant.com
In reply to: Joe Conway (#11)
Re: Idea for improving buildfarm robustness

Joe Conway wrote:

On 09/29/2015 12:47 PM, Tom Lane wrote:

We could possibly add additional checks, like trying to verify that
pg_control has the same inode number it used to. But I'm afraid that
would add portability issues and false-positive hazards that would
outweigh the value.

Not sure you remember the incident, but I think years ago that would
have saved me some heartache ;-)

I remember it, but I'm not sure it would have helped you. As I recall,
your trouble was that after a reboot the init script decided to initdb
the mount point -- postmaster wouldn't have been running at all ...

--
�lvaro Herrera http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#13Joe Conway
mail@joeconway.com
In reply to: Alvaro Herrera (#12)
Re: Idea for improving buildfarm robustness

On 09/29/2015 01:48 PM, Alvaro Herrera wrote:

Joe Conway wrote:

On 09/29/2015 12:47 PM, Tom Lane wrote:

We could possibly add additional checks, like trying to verify that
pg_control has the same inode number it used to. But I'm afraid that
would add portability issues and false-positive hazards that would
outweigh the value.

Not sure you remember the incident, but I think years ago that would
have saved me some heartache ;-)

I remember it, but I'm not sure it would have helped you. As I recall,
your trouble was that after a reboot the init script decided to initdb
the mount point -- postmaster wouldn't have been running at all ...

Right, which the init script non longer does as far as I'm aware, so
hopefully will never happen again to anyone.

But it was still a case where the postmaster started on one copy of
PGDATA (the newly init'd copy), and then the contents of the real PGDATA
was swapped in (when the filesystem was finally mounted), causing
corruption to the production data.

Joe

--
Crunchy Data - http://crunchydata.com
PostgreSQL Support for Secure Enterprises
Consulting, Training, & Open Source Development

#14Alvaro Herrera
alvherre@2ndquadrant.com
In reply to: Joe Conway (#13)
Re: Idea for improving buildfarm robustness

Joe Conway wrote:

On 09/29/2015 01:48 PM, Alvaro Herrera wrote:

I remember it, but I'm not sure it would have helped you. As I recall,
your trouble was that after a reboot the init script decided to initdb
the mount point -- postmaster wouldn't have been running at all ...

Right, which the init script non longer does as far as I'm aware, so
hopefully will never happen again to anyone.

Yeah.

But it was still a case where the postmaster started on one copy of
PGDATA (the newly init'd copy), and then the contents of the real PGDATA
was swapped in (when the filesystem was finally mounted), causing
corruption to the production data.

Ah, I didn't remember that part of it, but it makes sense.

--
�lvaro Herrera http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#15Tom Lane
tgl@sss.pgh.pa.us
In reply to: Josh Berkus (#8)
Re: Idea for improving buildfarm robustness

Josh Berkus <josh@agliodbs.com> writes:

Give me source with the change, and I'll put it on a cheap, low-bandwith
AWS instance and hammer the heck out of it. That should raise any false
positives we can expect.

Here's a draft patch against HEAD (looks like it will work on 9.5 or
9.4 without modifications, too).

regards, tom lane

Attachments:

die-on-no-pgdata.patchtext/x-diff; charset=us-ascii; name=die-on-no-pgdata.patchDownload+85-15
#16Tom Lane
tgl@sss.pgh.pa.us
In reply to: Tom Lane (#15)
Re: Idea for improving buildfarm robustness

I wrote:

Josh Berkus <josh@agliodbs.com> writes:

Give me source with the change, and I'll put it on a cheap, low-bandwith
AWS instance and hammer the heck out of it. That should raise any false
positives we can expect.

Here's a draft patch against HEAD (looks like it will work on 9.5 or
9.4 without modifications, too).

BTW: in addition to whatever AWS testing Josh has in mind, it'd be good if
someone tried it on Windows. AFAIK, the self-kill() should work in the
postmaster on Windows, but that should be checked. Also, does the set of
errnos it checks cover typical deletion cases on Windows? Try both
removal of $PGDATA in toto and removal of just pg_control or just global/.

regards, tom lane

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#17Josh Berkus
josh@agliodbs.com
In reply to: Tom Lane (#1)
Re: Idea for improving buildfarm robustness

On 09/29/2015 12:47 PM, Tom Lane wrote:

Josh Berkus <josh@agliodbs.com> writes:

In general, having the postmaster survive deletion of PGDATA is
suboptimal. In rare cases of having it survive installation of a new
PGDATA (via PITR restore, for example), I've even seen the zombie
postmaster corrupt the data files.

However ... if you'd simply deleted everything *under* $PGDATA but not
that directory itself, then this type of failure mode is 100% plausible.
And that's not an unreasonable thing to do, especially if you've set
things up so that $PGDATA's parent is not a writable directory.

I don't remember the exact setup, but this is likely the case. Probably
1/3 of the systems I monitor have a root-owned mount point for PGDATA's
parent directory.

Testing accessibility of "global/pg_control" would be enough to catch this
case, but only if we do it before you create a new one. So that seems
like an argument for making the test relatively often. The once-a-minute
option is sounding better and better.

We could possibly add additional checks, like trying to verify that
pg_control has the same inode number it used to. But I'm afraid that
would add portability issues and false-positive hazards that would
outweigh the value.

It's not worth doing extra stuff for this.

--
Josh Berkus
PostgreSQL Experts Inc.
http://pgexperts.com

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#18Michael Paquier
michael@paquier.xyz
In reply to: Tom Lane (#16)
Re: Idea for improving buildfarm robustness

On Wed, Sep 30, 2015 at 7:19 AM, Tom Lane <tgl@sss.pgh.pa.us> wrote:

I wrote:

Josh Berkus <josh@agliodbs.com> writes:

Give me source with the change, and I'll put it on a cheap, low-bandwith
AWS instance and hammer the heck out of it. That should raise any false
positives we can expect.

Here's a draft patch against HEAD (looks like it will work on 9.5 or
9.4 without modifications, too).

BTW: in addition to whatever AWS testing Josh has in mind, it'd be good if
someone tried it on Windows. AFAIK, the self-kill() should work in the
postmaster on Windows, but that should be checked. Also, does the set of
errnos it checks cover typical deletion cases on Windows? Try both
removal of $PGDATA in toto and removal of just pg_control or just global/.

Just tested on Windows, and this is working fine for me. It seems to
me as well that looking only for ENOENT and ENOTDIR is fine (here is
what I looked at for reference, note the extra EXDEV or STRUNCATE for
example with MS 2015):
https://msdn.microsoft.com/en-us/library/5814770t.aspx
--
Michael

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#19Jim Nasby
Jim.Nasby@BlueTreble.com
In reply to: Alvaro Herrera (#14)
Re: Idea for improving buildfarm robustness

On 9/29/15 4:13 PM, Alvaro Herrera wrote:

Joe Conway wrote:

On 09/29/2015 01:48 PM, Alvaro Herrera wrote:

I remember it, but I'm not sure it would have helped you. As I recall,
your trouble was that after a reboot the init script decided to initdb
the mount point -- postmaster wouldn't have been running at all ...

Right, which the init script non longer does as far as I'm aware, so
hopefully will never happen again to anyone.

Yeah.

But it was still a case where the postmaster started on one copy of
PGDATA (the newly init'd copy), and then the contents of the real PGDATA
was swapped in (when the filesystem was finally mounted), causing
corruption to the production data.

Ah, I didn't remember that part of it, but it makes sense.

Ouch. So it sounds like there's value to seeing if pg_control isn't what
we expect it to be.

Instead of looking at the inode (portability problem), what if
pg_control contained a random number that was created at initdb time? On
startup postmaster would read that value and then if it ever changed
after that you'd know something just went wrong.

Perhaps even stronger would be to write a new random value on startup;
that way you'd know if an old copy accidentally got put in place.
--
Jim Nasby, Data Architect, Blue Treble Consulting, Austin TX
Experts in Analytics, Data Architecture and PostgreSQL
Data in Trouble? Get it in Treble! http://BlueTreble.com

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#20Andrew Dunstan
andrew@dunslane.net
In reply to: Michael Paquier (#18)
Re: Idea for improving buildfarm robustness

On 09/30/2015 01:18 AM, Michael Paquier wrote:

On Wed, Sep 30, 2015 at 7:19 AM, Tom Lane <tgl@sss.pgh.pa.us> wrote:

I wrote:

Josh Berkus <josh@agliodbs.com> writes:

Give me source with the change, and I'll put it on a cheap, low-bandwith
AWS instance and hammer the heck out of it. That should raise any false
positives we can expect.

Here's a draft patch against HEAD (looks like it will work on 9.5 or
9.4 without modifications, too).

BTW: in addition to whatever AWS testing Josh has in mind, it'd be good if
someone tried it on Windows. AFAIK, the self-kill() should work in the
postmaster on Windows, but that should be checked. Also, does the set of
errnos it checks cover typical deletion cases on Windows? Try both
removal of $PGDATA in toto and removal of just pg_control or just global/.

Just tested on Windows, and this is working fine for me. It seems to
me as well that looking only for ENOENT and ENOTDIR is fine (here is
what I looked at for reference, note the extra EXDEV or STRUNCATE for
example with MS 2015):
https://msdn.microsoft.com/en-us/library/5814770t.aspx

Incidentally, AWS and Windows are not mutually exclusive. I used an AWS
Windows instance the other day when I validated the instructions for
building with Mingw.

cheers

andrew

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#21Tom Lane
tgl@sss.pgh.pa.us
In reply to: Jim Nasby (#19)
#22Josh Berkus
josh@agliodbs.com
In reply to: Tom Lane (#1)
#23Tom Lane
tgl@sss.pgh.pa.us
In reply to: Tom Lane (#21)
#24Tom Lane
tgl@sss.pgh.pa.us
In reply to: Tom Lane (#23)
#25Michael Paquier
michael@paquier.xyz
In reply to: Tom Lane (#24)
#26Tom Lane
tgl@sss.pgh.pa.us
In reply to: Michael Paquier (#25)
#27Josh Berkus
josh@agliodbs.com
In reply to: Tom Lane (#1)
#28Tom Lane
tgl@sss.pgh.pa.us
In reply to: Josh Berkus (#27)