Something flaky in the "relfilenode mapping" infrastructure

Started by Tom Laneabout 12 years ago17 messageshackers

tgl@sss.pgh.pa.us

about 12 years ago

Buildfarm member prairiedog thinks there's something unreliable about
commit f01d1ae3a104019d6d68aeff85c4816a275130b3:
http://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=prairiedog&dt=2014-03-27%2008%3A12%3A11

================== pgsql.13462/src/test/regress/regression.diffs ===================
*** /Users/buildfarm/bf-data/HEAD/pgsql.13462/src/test/regress/expected/alter_table.out	Thu Mar 27 04:12:40 2014
--- /Users/buildfarm/bf-data/HEAD/pgsql.13462/src/test/regress/results/alter_table.out	Thu Mar 27 04:52:02 2014
***************
*** 2333,2339 ****
      ) mapped;
   incorrectly_mapped | have_mappings 
  --------------------+---------------
!                   0 | t
  (1 row)

  -- Checks on creating and manipulation of user defined relations in
--- 2333,2339 ----
      ) mapped;
   incorrectly_mapped | have_mappings 
  --------------------+---------------
!                   1 | t
  (1 row)

-- Checks on creating and manipulation of user defined relations in

======================================================================

regards, tom lane

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Andres Freund

andres@anarazel.de

about 12 years ago

In reply to: Tom Lane (#1)

Re: Something flaky in the "relfilenode mapping" infrastructure

Hi,

On 2014-03-27 08:02:35 -0400, Tom Lane wrote:

Buildfarm member prairiedog thinks there's something unreliable about
commit f01d1ae3a104019d6d68aeff85c4816a275130b3:

*** /Users/buildfarm/bf-data/HEAD/pgsql.13462/src/test/regress/expected/alter_table.out	Thu Mar 27 04:12:40 2014
--- /Users/buildfarm/bf-data/HEAD/pgsql.13462/src/test/regress/results/alter_table.out	Thu Mar 27 04:52:02 2014
***************
*** 2333,2339 ****
) mapped;
incorrectly_mapped | have_mappings 
--------------------+---------------
!                   0 | t
(1 row)

That's rather odd. It has survived for a couple of months on the other
buildfarm animals now... Could one of you apply the attached patch
adding more details to eventual failures?

Greetings,

Andres Freund

--
Andres Freund http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services

Tom Lane

tgl@sss.pgh.pa.us

about 12 years ago

In reply to: Andres Freund (#2)

Re: Something flaky in the "relfilenode mapping" infrastructure

Andres Freund <andres@2ndquadrant.com> writes:

On 2014-03-27 08:02:35 -0400, Tom Lane wrote:

Buildfarm member prairiedog thinks there's something unreliable about
commit f01d1ae3a104019d6d68aeff85c4816a275130b3:

That's rather odd. It has survived for a couple of months on the other
buildfarm animals now... Could one of you apply the attached patch
adding more details to eventual failures?

Any objection to separating out the have_mappings bit? It wasn't terribly
appropriate before, but it seems really out of place in this formulation.

regards, tom lane

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Andres Freund

andres@anarazel.de

about 12 years ago

In reply to: Tom Lane (#3)

Re: Something flaky in the "relfilenode mapping" infrastructure

On 2014-03-28 16:41:15 -0400, Tom Lane wrote:

Andres Freund <andres@2ndquadrant.com> writes:

On 2014-03-27 08:02:35 -0400, Tom Lane wrote:

Buildfarm member prairiedog thinks there's something unreliable about
commit f01d1ae3a104019d6d68aeff85c4816a275130b3:

That's rather odd. It has survived for a couple of months on the other
buildfarm animals now... Could one of you apply the attached patch
adding more details to eventual failures?

Any objection to separating out the have_mappings bit? It wasn't terribly
appropriate before, but it seems really out of place in this formulation.

The patch I sent removed the have_mapping thing entirely? Do you mean it
should be there, but as a separate query?

Greetings,

Andres Freund

--
Andres Freund http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Tom Lane

tgl@sss.pgh.pa.us

about 12 years ago

In reply to: Andres Freund (#4)

Re: Something flaky in the "relfilenode mapping" infrastructure

Andres Freund <andres@2ndquadrant.com> writes:

On 2014-03-28 16:41:15 -0400, Tom Lane wrote:

Any objection to separating out the have_mappings bit? It wasn't terribly
appropriate before, but it seems really out of place in this formulation.

The patch I sent removed the have_mapping thing entirely? Do you mean it
should be there, but as a separate query?

Oh, so it did. Well, do you think we need a query checking that?
I hadn't questioned the need to do so, but if you feel it's unnecessary
I'm certainly willing to pull it.

regards, tom lane

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Andres Freund

andres@anarazel.de

about 12 years ago

In reply to: Tom Lane (#5)

Re: Something flaky in the "relfilenode mapping" infrastructure

On 2014-03-28 16:45:28 -0400, Tom Lane wrote:

Andres Freund <andres@2ndquadrant.com> writes:

On 2014-03-28 16:41:15 -0400, Tom Lane wrote:

Any objection to separating out the have_mappings bit? It wasn't terribly
appropriate before, but it seems really out of place in this formulation.

The patch I sent removed the have_mapping thing entirely? Do you mean it
should be there, but as a separate query?

Oh, so it did. Well, do you think we need a query checking that?
I hadn't questioned the need to do so, but if you feel it's unnecessary
I'm certainly willing to pull it.

I don't think it's necessary. As far as I understand LATERAL, a join to
a function returning NULL will still return the row. So, the test now
would only test whether there are rows in pg_class which seems a bit
pointless.

Greetings,

Andres Freund

--
Andres Freund http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Tom Lane

tgl@sss.pgh.pa.us

about 12 years ago

In reply to: Andres Freund (#6)

Re: Something flaky in the "relfilenode mapping" infrastructure

Andres Freund <andres@2ndquadrant.com> writes:

I don't think it's necessary. As far as I understand LATERAL, a join to
a function returning NULL will still return the row. So, the test now
would only test whether there are rows in pg_class which seems a bit
pointless.

Yeah, after looking closer I'd come to the same conclusion. If the
lateral function call could generate zero rows it'd perhaps be a risk,
but not in this formulation.

Will commit in a moment.

regards, tom lane

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Andres Freund

andres@anarazel.de

about 12 years ago

In reply to: Andres Freund (#2)

Re: Something flaky in the "relfilenode mapping" infrastructure

On 2014-03-28 21:36:11 +0100, Andres Freund wrote:

Hi,

On 2014-03-27 08:02:35 -0400, Tom Lane wrote:

Buildfarm member prairiedog thinks there's something unreliable about
commit f01d1ae3a104019d6d68aeff85c4816a275130b3:
*** /Users/buildfarm/bf-data/HEAD/pgsql.13462/src/test/regress/expected/alter_table.out	Thu Mar 27 04:12:40 2014
--- /Users/buildfarm/bf-data/HEAD/pgsql.13462/src/test/regress/results/alter_table.out	Thu Mar 27 04:52:02 2014
***************
*** 2333,2339 ****
) mapped;
incorrectly_mapped | have_mappings 
--------------------+---------------
!                   0 | t
(1 row)
That's rather odd. It has survived for a couple of months on the other
buildfarm animals now... Could one of you apply the attached patch
adding more details to eventual failures?

So I had made a notice to recheck on
this. http://buildfarm.postgresql.org/cgi-bin/show_history.pl?nm=prairiedog&br=HEAD
indicates there haven't been any further failures... So, for now I
assume this was caused by some problem fixed elsewhere.

Greetings,

Andres Freund

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Tom Lane

tgl@sss.pgh.pa.us

about 12 years ago

In reply to: Andres Freund (#8)

Re: Something flaky in the "relfilenode mapping" infrastructure

Andres Freund <andres@2ndquadrant.com> writes:

On 2014-03-27 08:02:35 -0400, Tom Lane wrote:

Buildfarm member prairiedog thinks there's something unreliable about
commit f01d1ae3a104019d6d68aeff85c4816a275130b3:

So I had made a notice to recheck on
this. http://buildfarm.postgresql.org/cgi-bin/show_history.pl?nm=prairiedog&br=HEAD
indicates there haven't been any further failures... So, for now I
assume this was caused by some problem fixed elsewhere.

Hard to say. In any case, I agree we can't make any progress unless we
see it again.

regards, tom lane

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#10

Noah Misch

noah@leadboat.com

almost 12 years ago

In reply to: Tom Lane (#9)

Re: Something flaky in the "relfilenode mapping" infrastructure

On Tue, Apr 15, 2014 at 03:28:41PM -0400, Tom Lane wrote:

Andres Freund <andres@2ndquadrant.com> writes:

On 2014-03-27 08:02:35 -0400, Tom Lane wrote:

Buildfarm member prairiedog thinks there's something unreliable about
commit f01d1ae3a104019d6d68aeff85c4816a275130b3:

So I had made a notice to recheck on
this. http://buildfarm.postgresql.org/cgi-bin/show_history.pl?nm=prairiedog&br=HEAD
indicates there haven't been any further failures... So, for now I
assume this was caused by some problem fixed elsewhere.

Hard to say. In any case, I agree we can't make any progress unless we
see it again.

The improved test just tripped:
http://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=prairiedog&dt=2014-06-12%2000%3A17%3A07

--
Noah Misch
EnterpriseDB http://www.enterprisedb.com

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#11

Tom Lane

tgl@sss.pgh.pa.us

almost 12 years ago

In reply to: Noah Misch (#10)

Re: Something flaky in the "relfilenode mapping" infrastructure

Noah Misch <noah@leadboat.com> writes:

The improved test just tripped:
http://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=prairiedog&dt=2014-06-12%2000%3A17%3A07

Ugh. If the MTBF is circa three months, how will we catch this before
we're dead?

A quick look around the machine's buildfarm directory says there's nothing
left behind that's not included in the buildfarm server report, so I can't
offer any immediate insight. But I can certainly load it up running some
additional tests if anyone has an idea what to look for.

regards, tom lane

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#12

Andres Freund

andres@anarazel.de

almost 12 years ago

In reply to: Noah Misch (#10)

Re: Something flaky in the "relfilenode mapping" infrastructure

On 2014-06-12 00:38:36 -0400, Noah Misch wrote:

On Tue, Apr 15, 2014 at 03:28:41PM -0400, Tom Lane wrote:

Andres Freund <andres@2ndquadrant.com> writes:

On 2014-03-27 08:02:35 -0400, Tom Lane wrote:

Buildfarm member prairiedog thinks there's something unreliable about
commit f01d1ae3a104019d6d68aeff85c4816a275130b3:

So I had made a notice to recheck on
this. http://buildfarm.postgresql.org/cgi-bin/show_history.pl?nm=prairiedog&br=HEAD
indicates there haven't been any further failures... So, for now I
assume this was caused by some problem fixed elsewhere.

Hard to say. In any case, I agree we can't make any progress unless we
see it again.

The improved test just tripped:

Hrmpf. Just one of these days I was happy thinking it was gone...

http://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=prairiedog&dt=2014-06-12%2000%3A17%3A07

Hm. My guess it's that it's just a 'harmless' concurrency issue. The
test currently run in concurrency with others: I think what happens is
that the table gets dropped in the other relation after the query has
acquired the mvcc snapshot (used for the pg_class) test.
But why is it triggering on such a 'unusual' system and not on others?
That's what worries me a bit.

Greetings,

Andres Freund

--
Andres Freund http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#13

Tom Lane

tgl@sss.pgh.pa.us

almost 12 years ago

In reply to: Andres Freund (#12)

Re: Something flaky in the "relfilenode mapping" infrastructure

Andres Freund <andres@2ndquadrant.com> writes:

On 2014-06-12 00:38:36 -0400, Noah Misch wrote:

http://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=prairiedog&dt=2014-06-12%2000%3A17%3A07

Hm. My guess it's that it's just a 'harmless' concurrency issue. The
test currently run in concurrency with others: I think what happens is
that the table gets dropped in the other relation after the query has
acquired the mvcc snapshot (used for the pg_class) test.
But why is it triggering on such a 'unusual' system and not on others?
That's what worries me a bit.

prairiedog is pretty damn slow by modern standards. OTOH, I think it
is not the slowest machine in the buildfarm; hamster for instance seems
to be at least a factor of 2 slower. So I'm not sure whether to believe
it's just a timing issue.

regards, tom lane

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#14

Noah Misch

noah@leadboat.com

almost 12 years ago

In reply to: Tom Lane (#13)

Re: Something flaky in the "relfilenode mapping" infrastructure

On Thu, Jun 12, 2014 at 02:44:10AM -0400, Tom Lane wrote:

Andres Freund <andres@2ndquadrant.com> writes:

On 2014-06-12 00:38:36 -0400, Noah Misch wrote:

http://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=prairiedog&dt=2014-06-12%2000%3A17%3A07

Hm. My guess it's that it's just a 'harmless' concurrency issue. The
test currently run in concurrency with others: I think what happens is
that the table gets dropped in the other relation after the query has
acquired the mvcc snapshot (used for the pg_class) test.
But why is it triggering on such a 'unusual' system and not on others?
That's what worries me a bit.

I can reproduce a similar disturbance in the test query using gdb and a
concurrent table drop, and the table reported in the prairiedog failure is a
table dropped in a concurrent test group. That explanation may not be the
full story behind these particular failures, but it certainly could cause
similar failures in the future.

Let's prevent this by only reporting rows for relations that still exist after
the query is complete.

prairiedog is pretty damn slow by modern standards. OTOH, I think it
is not the slowest machine in the buildfarm; hamster for instance seems
to be at least a factor of 2 slower. So I'm not sure whether to believe
it's just a timing issue.

That kernel's process scheduler could be a factor.

--
Noah Misch
EnterpriseDB http://www.enterprisedb.com

#15

Tom Lane

tgl@sss.pgh.pa.us

almost 12 years ago

In reply to: Noah Misch (#14)

Re: Something flaky in the "relfilenode mapping" infrastructure

Noah Misch <noah@leadboat.com> writes:

On Thu, Jun 12, 2014 at 02:44:10AM -0400, Tom Lane wrote:

Andres Freund <andres@2ndquadrant.com> writes:

On 2014-06-12 00:38:36 -0400, Noah Misch wrote:

http://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=prairiedog&dt=2014-06-12%2000%3A17%3A07

Hm. My guess it's that it's just a 'harmless' concurrency issue. The
test currently run in concurrency with others: I think what happens is
that the table gets dropped in the other relation after the query has
acquired the mvcc snapshot (used for the pg_class) test.
But why is it triggering on such a 'unusual' system and not on others?
That's what worries me a bit.

I can reproduce a similar disturbance in the test query using gdb and a
concurrent table drop, and the table reported in the prairiedog failure is a
table dropped in a concurrent test group. That explanation may not be the
full story behind these particular failures, but it certainly could cause
similar failures in the future.

Yeah, that seems like a plausible explanation, since the table shown
in the failure report is one that would be getting dropped concurrently,
and the discrepancy is that we get NULL rather than the expected value
for the pg_filenode_relation result, which is expected if the table is
already dropped when the mapping function is called.

Let's prevent this by only reporting rows for relations that still exist after
the query is complete.

I think this is a bad solution though; it risks masking actual problems.

What seems like a better fix to me is to change the test

mapped_oid IS DISTINCT FROM oid

mapped_oid <> oid

pg_class.oid will certainly never read as NULL, so what this will do is
allow the single case where the function returns NULL. AFAIK there is
no reason to suppose that a NULL result would mean anything except "the
table's been dropped", so changing it this way will allow only that case
and not any others.

Alternatively, we could do something like you suggest but adjust the
second join so that it suppresses only rows in which mapped_oid is null
*and* there's no longer a matching OID in pg_class. That would provide
additional confidence that the null result is a valid indicator of a
just-dropped table.

regards, tom lane

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#16

Noah Misch

noah@leadboat.com

almost 12 years ago

In reply to: Tom Lane (#15)

Re: Something flaky in the "relfilenode mapping" infrastructure

On Thu, Jun 12, 2014 at 10:50:44PM -0400, Tom Lane wrote:

Alternatively, we could do something like you suggest but adjust the
second join so that it suppresses only rows in which mapped_oid is null
*and* there's no longer a matching OID in pg_class. That would provide
additional confidence that the null result is a valid indicator of a
just-dropped table.

Can't hurt; I adjusted it along those lines and committed. For the record,
the failure on prairiedog has appeared a third time:
http://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=prairiedog&dt=2014-06-13%2005%3A19%3A35

--
Noah Misch
EnterpriseDB http://www.enterprisedb.com

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#17

Tom Lane

tgl@sss.pgh.pa.us

almost 12 years ago

In reply to: Noah Misch (#16)

Re: Something flaky in the "relfilenode mapping" infrastructure

Noah Misch <noah@leadboat.com> writes:

On Thu, Jun 12, 2014 at 10:50:44PM -0400, Tom Lane wrote:

Alternatively, we could do something like you suggest but adjust the
second join so that it suppresses only rows in which mapped_oid is null
*and* there's no longer a matching OID in pg_class. That would provide
additional confidence that the null result is a valid indicator of a
just-dropped table.

Can't hurt; I adjusted it along those lines and committed.

Looks good.

For the record,
the failure on prairiedog has appeared a third time:
http://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=prairiedog&dt=2014-06-13%2005%3A19%3A35

Yeah, I saw that. I wonder if we recently changed something that improved
the odds of the timing being just so. Two failures in two days seems out
of line with the critter's previous history.

regards, tom lane

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Something flaky in the "relfilenode mapping" infrastructure

Attachments:

Attachments: