buildfarm: could not read block 3 in file "base/16384/2662": read only 0 of 8192 bytes

Started by Tom Laneover 7 years ago46 messageshackers
Jump to latest
#1Tom Lane
tgl@sss.pgh.pa.us

We've seen several occurrences of $subject in the buildfarm in the past
month or so. Scraping the logs, I find

coypu | 2018-06-14 21:17:49 | HEAD | Check | 2018-06-14 23:31:44.505 CEST [5b22deb8.30e1:124] ERROR: could not read block 3 in file "base/16384/2662": read only 0 of 8192 bytes
coypu | 2018-06-14 21:17:49 | HEAD | Check | 2018-06-14 23:31:44.653 CEST [5b22deb8.6d4d:132] ERROR: could not read block 3 in file "base/16384/2662": read only 0 of 8192 bytes
gull | 2018-06-21 04:02:11 | HEAD | Check | 2018-06-20 21:27:06.843 PDT [31138:3] FATAL: could not read block 3 in file "base/16384/2662": read only 0 of 8192 bytes
mandrill | 2018-06-22 16:14:16 | HEAD | Check | 2018-06-22 16:46:24.138 UTC [14353240:43] pg_regress/create_table_like ERROR: could not read block 3 in file "base/16384/2662": read only 0 of 8192 bytes
mandrill | 2018-06-22 16:14:16 | HEAD | Check | 2018-06-22 16:46:24.137 UTC [20710034:1] ERROR: could not read block 3 in file "base/16384/2662": read only 0 of 8192 bytes
lapwing | 2018-07-05 07:20:02 | HEAD | Check | 2018-07-05 07:21:45.585 UTC [24022:1] ERROR: could not read block 3 in file "base/16384/2662": read only 0 of 8192 bytes
lapwing | 2018-07-05 07:20:02 | HEAD | Check | 2018-07-05 07:21:45.591 UTC [23941:39] pg_regress/roleattributes ERROR: could not read block 3 in file "base/16384/2662": read only 0 of 8192 bytes
mantid | 2018-07-17 01:20:04 | REL_11_STABLE | Check | 2018-07-16 21:21:32.557 EDT [26072:1] ERROR: could not read block 3 in file "base/16384/2662": read only 0 of 8192 bytes
mantid | 2018-07-17 01:20:04 | REL_11_STABLE | Check | 2018-07-16 21:21:32.557 EDT [25990:82] ERROR: could not read block 3 in file "base/16384/2662": read only 0 of 8192 bytes
coypu | 2018-07-17 01:27:39 | HEAD | Check | 2018-07-17 03:47:26.264 CEST [5b4d4aa4.3ac4:71] ERROR: could not read block 3 in file "base/16384/2662": read only 0 of 8192 bytes
coypu | 2018-07-20 11:18:02 | HEAD | Check | 2018-07-20 13:32:14.104 CEST [5b51c833.4884:131] ERROR: could not read block 3 in file "base/16384/2662": read only 0 of 8192 bytes

While I didn't go back indefinitely far, there are no other occurrences of
"could not read block" failures in the last three months. This suggests
strongly that we broke something in early June. Don't know what though.
Ideas?

In case anyone's wondering, 2662 is pg_class_oid_index.

regards, tom lane

#2Andres Freund
andres@anarazel.de
In reply to: Tom Lane (#1)
Re: buildfarm: could not read block 3 in file "base/16384/2662": read only 0 of 8192 bytes

On 2018-07-20 16:15:14 -0400, Tom Lane wrote:

We've seen several occurrences of $subject in the buildfarm in the past
month or so. Scraping the logs, I find

coypu | 2018-06-14 21:17:49 | HEAD | Check | 2018-06-14 23:31:44.505 CEST [5b22deb8.30e1:124] ERROR: could not read block 3 in file "base/16384/2662": read only 0 of 8192 bytes
coypu | 2018-06-14 21:17:49 | HEAD | Check | 2018-06-14 23:31:44.653 CEST [5b22deb8.6d4d:132] ERROR: could not read block 3 in file "base/16384/2662": read only 0 of 8192 bytes
gull | 2018-06-21 04:02:11 | HEAD | Check | 2018-06-20 21:27:06.843 PDT [31138:3] FATAL: could not read block 3 in file "base/16384/2662": read only 0 of 8192 bytes
mandrill | 2018-06-22 16:14:16 | HEAD | Check | 2018-06-22 16:46:24.138 UTC [14353240:43] pg_regress/create_table_like ERROR: could not read block 3 in file "base/16384/2662": read only 0 of 8192 bytes
mandrill | 2018-06-22 16:14:16 | HEAD | Check | 2018-06-22 16:46:24.137 UTC [20710034:1] ERROR: could not read block 3 in file "base/16384/2662": read only 0 of 8192 bytes
lapwing | 2018-07-05 07:20:02 | HEAD | Check | 2018-07-05 07:21:45.585 UTC [24022:1] ERROR: could not read block 3 in file "base/16384/2662": read only 0 of 8192 bytes
lapwing | 2018-07-05 07:20:02 | HEAD | Check | 2018-07-05 07:21:45.591 UTC [23941:39] pg_regress/roleattributes ERROR: could not read block 3 in file "base/16384/2662": read only 0 of 8192 bytes
mantid | 2018-07-17 01:20:04 | REL_11_STABLE | Check | 2018-07-16 21:21:32.557 EDT [26072:1] ERROR: could not read block 3 in file "base/16384/2662": read only 0 of 8192 bytes
mantid | 2018-07-17 01:20:04 | REL_11_STABLE | Check | 2018-07-16 21:21:32.557 EDT [25990:82] ERROR: could not read block 3 in file "base/16384/2662": read only 0 of 8192 bytes
coypu | 2018-07-17 01:27:39 | HEAD | Check | 2018-07-17 03:47:26.264 CEST [5b4d4aa4.3ac4:71] ERROR: could not read block 3 in file "base/16384/2662": read only 0 of 8192 bytes
coypu | 2018-07-20 11:18:02 | HEAD | Check | 2018-07-20 13:32:14.104 CEST [5b51c833.4884:131] ERROR: could not read block 3 in file "base/16384/2662": read only 0 of 8192 bytes

While I didn't go back indefinitely far, there are no other occurrences of
"could not read block" failures in the last three months. This suggests
strongly that we broke something in early June. Don't know what though.
Ideas?

In case anyone's wondering, 2662 is pg_class_oid_index.

While I do not immediately see how, and the affected versions don't
really point that way, the timing does make me think of
http://git.postgresql.org/gitweb/?p=postgresql.git;a=commit;h=a54e1f1587793b5bf926630ec9ce142e4578d7a0

HEAD/REL_11_STABLE apparently solely being affected points elsewhere,
but I don't immediatley know where.

Greetings,

Andres Freund

#3Andres Freund
andres@anarazel.de
In reply to: Andres Freund (#2)
Re: buildfarm: could not read block 3 in file "base/16384/2662": read only 0 of 8192 bytes

Hi,

On 2018-07-20 13:24:50 -0700, Andres Freund wrote:

On 2018-07-20 16:15:14 -0400, Tom Lane wrote:

We've seen several occurrences of $subject in the buildfarm in the past
month or so. Scraping the logs, I find

coypu | 2018-06-14 21:17:49 | HEAD | Check | 2018-06-14 23:31:44.505 CEST [5b22deb8.30e1:124] ERROR: could not read block 3 in file "base/16384/2662": read only 0 of 8192 bytes
coypu | 2018-06-14 21:17:49 | HEAD | Check | 2018-06-14 23:31:44.653 CEST [5b22deb8.6d4d:132] ERROR: could not read block 3 in file "base/16384/2662": read only 0 of 8192 bytes
gull | 2018-06-21 04:02:11 | HEAD | Check | 2018-06-20 21:27:06.843 PDT [31138:3] FATAL: could not read block 3 in file "base/16384/2662": read only 0 of 8192 bytes
mandrill | 2018-06-22 16:14:16 | HEAD | Check | 2018-06-22 16:46:24.138 UTC [14353240:43] pg_regress/create_table_like ERROR: could not read block 3 in file "base/16384/2662": read only 0 of 8192 bytes
mandrill | 2018-06-22 16:14:16 | HEAD | Check | 2018-06-22 16:46:24.137 UTC [20710034:1] ERROR: could not read block 3 in file "base/16384/2662": read only 0 of 8192 bytes
lapwing | 2018-07-05 07:20:02 | HEAD | Check | 2018-07-05 07:21:45.585 UTC [24022:1] ERROR: could not read block 3 in file "base/16384/2662": read only 0 of 8192 bytes
lapwing | 2018-07-05 07:20:02 | HEAD | Check | 2018-07-05 07:21:45.591 UTC [23941:39] pg_regress/roleattributes ERROR: could not read block 3 in file "base/16384/2662": read only 0 of 8192 bytes
mantid | 2018-07-17 01:20:04 | REL_11_STABLE | Check | 2018-07-16 21:21:32.557 EDT [26072:1] ERROR: could not read block 3 in file "base/16384/2662": read only 0 of 8192 bytes
mantid | 2018-07-17 01:20:04 | REL_11_STABLE | Check | 2018-07-16 21:21:32.557 EDT [25990:82] ERROR: could not read block 3 in file "base/16384/2662": read only 0 of 8192 bytes
coypu | 2018-07-17 01:27:39 | HEAD | Check | 2018-07-17 03:47:26.264 CEST [5b4d4aa4.3ac4:71] ERROR: could not read block 3 in file "base/16384/2662": read only 0 of 8192 bytes
coypu | 2018-07-20 11:18:02 | HEAD | Check | 2018-07-20 13:32:14.104 CEST [5b51c833.4884:131] ERROR: could not read block 3 in file "base/16384/2662": read only 0 of 8192 bytes

While I didn't go back indefinitely far, there are no other occurrences of
"could not read block" failures in the last three months. This suggests
strongly that we broke something in early June. Don't know what though.
Ideas?

In case anyone's wondering, 2662 is pg_class_oid_index.

While I do not immediately see how, and the affected versions don't
really point that way, the timing does make me think of
http://git.postgresql.org/gitweb/?p=postgresql.git;a=commit;h=a54e1f1587793b5bf926630ec9ce142e4578d7a0

HEAD/REL_11_STABLE apparently solely being affected points elsewhere,
but I don't immediatley know where.

Hm, there was:
http://archives.postgresql.org/message-id/20180628150209.n2qch5jtn3vt2xaa%40alap3.anarazel.de

I don't immediately see it being responsible, but I wonder if there's a
chance it actually is: Note that it happens in a parallel group that
includes vacuum.sql, which does a VACUUM FULL pg_class - but I still
don't immediately see how it could apply.

But either way, we need to fix the issue mentioned in the above email.

Greetings,

Andres Freund

In reply to: Andres Freund (#3)
Re: buildfarm: could not read block 3 in file "base/16384/2662": read only 0 of 8192 bytes

On Wed, Jul 25, 2018 at 4:07 PM, Andres Freund <andres@anarazel.de> wrote:

HEAD/REL_11_STABLE apparently solely being affected points elsewhere,
but I don't immediatley know where.

Hm, there was:
http://archives.postgresql.org/message-id/20180628150209.n2qch5jtn3vt2xaa%40alap3.anarazel.de

I don't immediately see it being responsible, but I wonder if there's a
chance it actually is: Note that it happens in a parallel group that
includes vacuum.sql, which does a VACUUM FULL pg_class - but I still
don't immediately see how it could apply.

It's now pretty clear that it was not that particular bug, since I
pushed a fix, and yet the issue hasn't gone away on affected buildfarm
animals. There was a recurrence of the problem on lapwing, for example
[1]: https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=lapwing&amp;dt=2018-08-04%2004%3A20%3A01

Anyway, "VACUUM FULL pg_class" should be expected to corrupt
pg_class_oid_index when we happen to get a parallel build, since
pg_class is a mapped relation, and I've identified that as a problem
for parallel CREATE INDEX [2]/messages/by-id/CAH2-Wzn=j0i8rxCAo6E=tBO9XuYXb8HbUsnW7J_StKON8dDOhQ@mail.gmail.com -- Peter Geoghegan. If that was the ultimate cause of the
issue, it would explain why only REL_11_STABLE and master are
involved.

My guess is that the metapage considers the root page to be at block 3
(block 3 is often the root page for small though not tiny B-Trees),
which for whatever reason is where we get a short read. I don't know
why there is a short read, but corrupting mapped catalog indexes at
random can be expected to cause all kinds of chaos, so that doesn't
mean much.

In any case, I'll probably push a fix for this other bug on Friday,
barring any objections. It's possible that that will make the problem
go away.

[1]: https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=lapwing&amp;dt=2018-08-04%2004%3A20%3A01
[2]: /messages/by-id/CAH2-Wzn=j0i8rxCAo6E=tBO9XuYXb8HbUsnW7J_StKON8dDOhQ@mail.gmail.com -- Peter Geoghegan
--
Peter Geoghegan

#5Tom Lane
tgl@sss.pgh.pa.us
In reply to: Peter Geoghegan (#4)
Re: buildfarm: could not read block 3 in file "base/16384/2662": read only 0 of 8192 bytes

Peter Geoghegan <pg@bowt.ie> writes:

On Wed, Jul 25, 2018 at 4:07 PM, Andres Freund <andres@anarazel.de> wrote:

I don't immediately see it being responsible, but I wonder if there's a
chance it actually is: Note that it happens in a parallel group that
includes vacuum.sql, which does a VACUUM FULL pg_class - but I still
don't immediately see how it could apply.

Anyway, "VACUUM FULL pg_class" should be expected to corrupt
pg_class_oid_index when we happen to get a parallel build, since
pg_class is a mapped relation, and I've identified that as a problem
for parallel CREATE INDEX [2]. If that was the ultimate cause of the
issue, it would explain why only REL_11_STABLE and master are
involved.

Oooh ... but pg_class wouldn't be big enough to get a parallel
index rebuild during that test, would it?

regards, tom lane

In reply to: Tom Lane (#5)
Re: buildfarm: could not read block 3 in file "base/16384/2662": read only 0 of 8192 bytes

On Wed, Aug 8, 2018 at 7:40 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:

Anyway, "VACUUM FULL pg_class" should be expected to corrupt
pg_class_oid_index when we happen to get a parallel build, since
pg_class is a mapped relation, and I've identified that as a problem
for parallel CREATE INDEX [2]. If that was the ultimate cause of the
issue, it would explain why only REL_11_STABLE and master are
involved.

Oooh ... but pg_class wouldn't be big enough to get a parallel
index rebuild during that test, would it?

Typically not, but I don't think that we can rule it out right away.
"min_parallel_table_scan_size = 0" will make it happen, and that
happens in quite a few regression tests. Though that doesn't include
vacuum.sql.

The animals mandrill, mantid and lapwing are all
"force_parallel_mode=regress", according to their notes. That actually
isn't enough to force parallel CREATE INDEX, though I tend to think it
should be. I don't see anything interesting in their "extra_config",
but perhaps "min_parallel_table_scan_size = 0" got in some other way.
I don't know all that much about the buildfarm client code, and it's
late.

--
Peter Geoghegan

#7Tom Lane
tgl@sss.pgh.pa.us
In reply to: Peter Geoghegan (#6)
Re: buildfarm: could not read block 3 in file "base/16384/2662": read only 0 of 8192 bytes

Peter Geoghegan <pg@bowt.ie> writes:

On Wed, Aug 8, 2018 at 7:40 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:

Oooh ... but pg_class wouldn't be big enough to get a parallel
index rebuild during that test, would it?

Typically not, but I don't think that we can rule it out right away.

Hmmm ... maybe we should temporarily stick in an elog(LOG) showing whether
a parallel build happened or not, so that we can check the buildfarm logs
next time we see that failure?

I don't know all that much about the buildfarm client code, and it's
late.

It doesn't really stick in any undocumented configuration changes,
AFAIK. Possibly Andrew would have some more insight.

regards, tom lane

In reply to: Tom Lane (#7)
Re: buildfarm: could not read block 3 in file "base/16384/2662": read only 0 of 8192 bytes

On Wed, Aug 8, 2018 at 10:03 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:

Typically not, but I don't think that we can rule it out right away.

Hmmm ... maybe we should temporarily stick in an elog(LOG) showing whether
a parallel build happened or not, so that we can check the buildfarm logs
next time we see that failure?

I can do that tomorrow. Of course, it might be easier to just push the
pending fix for the mapped relation bug, and see if that makes the
issue go away. That would take longer, but it's also the only thing
that's likely to be definitive anyway.

If you want to go ahead with it yourself, I suggest modifying the
DEBUG1 elog()s within index_build() to be LOG elevel.

--
Peter Geoghegan

#9Andrew Dunstan
andrew@dunslane.net
In reply to: Tom Lane (#7)
Re: buildfarm: could not read block 3 in file "base/16384/2662": read only 0 of 8192 bytes

On 08/09/2018 01:03 AM, Tom Lane wrote:

Peter Geoghegan <pg@bowt.ie> writes:

On Wed, Aug 8, 2018 at 7:40 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:

Oooh ... but pg_class wouldn't be big enough to get a parallel
index rebuild during that test, would it?

Typically not, but I don't think that we can rule it out right away.

Hmmm ... maybe we should temporarily stick in an elog(LOG) showing whether
a parallel build happened or not, so that we can check the buildfarm logs
next time we see that failure?

I don't know all that much about the buildfarm client code, and it's
late.

It doesn't really stick in any undocumented configuration changes,
AFAIK. Possibly Andrew would have some more insight.

No, everything should be visible in the config. Hidden things are what I
try to avoid.

The only things the configure()  adds are the prefix and pgport
settings, and the cache-file. make() only adds a "-j jobs" if so
configured. make_check() normally just runs "make NO_LOCALE=1 check".

cheers

andrew

--
Andrew Dunstan https://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

In reply to: Peter Geoghegan (#8)
Re: buildfarm: could not read block 3 in file "base/16384/2662": read only 0 of 8192 bytes

On Wed, Aug 8, 2018 at 10:08 PM, Peter Geoghegan <pg@bowt.ie> wrote:

Hmmm ... maybe we should temporarily stick in an elog(LOG) showing whether
a parallel build happened or not, so that we can check the buildfarm logs
next time we see that failure?

I can do that tomorrow. Of course, it might be easier to just push the
pending fix for the mapped relation bug, and see if that makes the
issue go away. That would take longer, but it's also the only thing
that's likely to be definitive anyway.

I'm going to push my fix for the relmapper.c parallel CREATE INDEX bug
tomorrow, and see what happens. I have a hard time imagining how there
could be a parallel index build on pg_class, now that Andrew has
indicated there is no reason to think that
"min_parallel_table_scan_size = 0" could slip in.

--
Peter Geoghegan

#11Tom Lane
tgl@sss.pgh.pa.us
In reply to: Peter Geoghegan (#6)
Re: buildfarm: could not read block 3 in file "base/16384/2662": read only 0 of 8192 bytes

Peter Geoghegan <pg@bowt.ie> writes:

On Wed, Aug 8, 2018 at 7:40 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:

Oooh ... but pg_class wouldn't be big enough to get a parallel
index rebuild during that test, would it?

Typically not, but I don't think that we can rule it out right away.

Didn't take long to show that the relmapper issue wasn't it:

https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=coypu&amp;dt=2018-08-10%2021%3A21%3A40

So we're back to square one. Although Tomas' recent report might
give us something new to chase.

regards, tom lane

In reply to: Tom Lane (#11)
Re: buildfarm: could not read block 3 in file "base/16384/2662": read only 0 of 8192 bytes

On Fri, Aug 10, 2018 at 7:45 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:

Didn't take long to show that the relmapper issue wasn't it:

https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=coypu&amp;dt=2018-08-10%2021%3A21%3A40

So we're back to square one. Although Tomas' recent report might
give us something new to chase.

Thanks for pointing out that Tomas had a lead - I missed that.

I'm concerned that this item has the potential to delay the release,
since, as you said, we're back to the drawing board.

--
Peter Geoghegan

#13Tom Lane
tgl@sss.pgh.pa.us
In reply to: Peter Geoghegan (#12)
Re: buildfarm: could not read block 3 in file "base/16384/2662": read only 0 of 8192 bytes

Peter Geoghegan <pg@bowt.ie> writes:

I'm concerned that this item has the potential to delay the release,
since, as you said, we're back to the drawing board.

Me too. I will absolutely not vote to release 11.0 before we've
solved this ...

regards, tom lane

In reply to: Tom Lane (#13)
Re: buildfarm: could not read block 3 in file "base/16384/2662": read only 0 of 8192 bytes

On Fri, Aug 10, 2018 at 8:02 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:

Me too. I will absolutely not vote to release 11.0 before we've
solved this ...

I believe that that's the right call, assuming things don't change.

This is spooky in a way that creates a lot of doubts in my mind. I
don't think it's at all advisable to make an executive decision to
release, while still not having the foggiest idea what's really going
on.

--
Peter Geoghegan

#15Tomas Vondra
tomas.vondra@2ndquadrant.com
In reply to: Tom Lane (#13)
Re: buildfarm: could not read block 3 in file "base/16384/2662": read only 0 of 8192 bytes

On 08/11/2018 05:02 AM, Tom Lane wrote:

Peter Geoghegan <pg@bowt.ie> writes:

I'm concerned that this item has the potential to delay the release,
since, as you said, we're back to the drawing board.

Me too. I will absolutely not vote to release 11.0 before we've
solved this ...

Not sure. I can easily reproduce that on REL_10_STABLE, so it's not a
new issue specific to 11.0.

regards

--
Tomas Vondra http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

#16Tomas Vondra
tomas.vondra@2ndquadrant.com
In reply to: Tomas Vondra (#15)
Re: buildfarm: could not read block 3 in file "base/16384/2662": read only 0 of 8192 bytes

On 08/11/2018 03:16 PM, Tomas Vondra wrote:

On 08/11/2018 05:02 AM, Tom Lane wrote:

Peter Geoghegan <pg@bowt.ie> writes:

I'm concerned that this item has the potential to delay the release,
since, as you said, we're back to the drawing board.

Me too. I will absolutely not vote to release 11.0 before we've
solved this ...

Not sure. I can easily reproduce that on REL_10_STABLE, so it's not a
new issue specific to 11.0.

For the record, I can actually reproduce this on 9.6 (haven't tried
older releases, but I suspect it's there too). Instead of using the
failing subscription, I've used another pgbench script doing this:

SET statement_timeout = 5;
COPY t TO '/dev/null';

and doing something like:

pgbench -n -c 20 -T 300 -f copy.sql test

I don't think the 20-client COPY concurrency is necessary, it just makes
restarts after the statement_timeout faster.

This produces about the same backtrace as reported on the other thread.
Attaches is both plain 'bt' and 'bt full'.

regards

--
Tomas Vondra http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Attachments:

backtrace-96.txttext/plain; charset=UTF-8; name=backtrace-96.txtDownload
backtrace-full-96.txttext/plain; charset=UTF-8; name=backtrace-full-96.txtDownload
#17Andres Freund
andres@anarazel.de
In reply to: Tomas Vondra (#16)
Re: buildfarm: could not read block 3 in file "base/16384/2662": read only 0 of 8192 bytes

Hi,

On 2018-08-11 15:40:19 +0200, Tomas Vondra wrote:

For the record, I can actually reproduce this on 9.6 (haven't tried
older releases, but I suspect it's there too). Instead of using the
failing subscription, I've used another pgbench script doing this:

SET statement_timeout = 5;
COPY t TO '/dev/null';

and doing something like:

pgbench -n -c 20 -T 300 -f copy.sql test

Just to confirm: That's with the vacuum full and insert running
concurrently? And then just restarting the above copy.sql (as pgbench
errors out after the timeouts) until you get the error?

I'm a bit confused what the copy + timeout is doing here? It shouldn't
trigger any invalidations itself, and the backtrace appears to be from
the insert.sql you posted earlier? Unclear why a copy to /dev/null
should trigger anything like this?

Greetings,

Andres Freund

#18Tomas Vondra
tomas.vondra@2ndquadrant.com
In reply to: Andres Freund (#17)
Re: buildfarm: could not read block 3 in file "base/16384/2662": read only 0 of 8192 bytes

On 08/11/2018 04:08 PM, Andres Freund wrote:

Hi,

On 2018-08-11 15:40:19 +0200, Tomas Vondra wrote:

For the record, I can actually reproduce this on 9.6 (haven't tried
older releases, but I suspect it's there too). Instead of using the
failing subscription, I've used another pgbench script doing this:

SET statement_timeout = 5;
COPY t TO '/dev/null';

and doing something like:

pgbench -n -c 20 -T 300 -f copy.sql test

Just to confirm: That's with the vacuum full and insert running
concurrently? And then just restarting the above copy.sql (as pgbench
errors out after the timeouts) until you get the error?

Yes, pretty much.

I'm a bit confused what the copy + timeout is doing here? It shouldn't
trigger any invalidations itself, and the backtrace appears to be from
the insert.sql you posted earlier? Unclear why a copy to /dev/null
should trigger anything like this?

My goal was to simulate the failing subscription sync, which does COPY
and fails because of duplicate data. The statement_timeout seemed like a
good approximation of that. It may be unnecessary, I don't know.

regards

--
Tomas Vondra http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

#19Todd A. Cook
ToddA.Cook@synopsys.com
In reply to: Peter Geoghegan (#6)
Re: buildfarm: could not read block 3 in file "base/16384/2662": read only 0 of 8192 bytes

On 8/9/18, 12:56 AM, "Peter Geoghegan" <pg@bowt.ie> wrote:

On Wed, Aug 8, 2018 at 7:40 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:

Anyway, "VACUUM FULL pg_class" should be expected to corrupt
pg_class_oid_index when we happen to get a parallel build, since
pg_class is a mapped relation, and I've identified that as a problem
for parallel CREATE INDEX [2]. If that was the ultimate cause of the
issue, it would explain why only REL_11_STABLE and master are
involved.

Oooh ... but pg_class wouldn't be big enough to get a parallel
index rebuild during that test, would it?

Typically not, but I don't think that we can rule it out right away.
"min_parallel_table_scan_size = 0" will make it happen, and that
happens in quite a few regression tests. Though that doesn't include
vacuum.sql.

The animals mandrill, mantid and lapwing are all
"force_parallel_mode=regress", according to their notes. That actually
isn't enough to force parallel CREATE INDEX, though I tend to think it
should be. I don't see anything interesting in their "extra_config",
but perhaps "min_parallel_table_scan_size = 0" got in some other way.
I don't know all that much about the buildfarm client code, and it's
late.

Sorry, I just noticed this. Mantid is my animal, so I can set "min_parallel_table_scan_size = 0"
on it if that would be helpful. (Please reply directly if so; I'm not able to keep up with pgsql-hackers
right now.)

-- todd

In reply to: Todd A. Cook (#19)
Re: buildfarm: could not read block 3 in file "base/16384/2662": read only 0 of 8192 bytes

On Tue, Aug 14, 2018 at 2:07 PM, Todd A. Cook <ToddA.Cook@synopsys.com> wrote:

Sorry, I just noticed this. Mantid is my animal, so I can set "min_parallel_table_scan_size = 0"
on it if that would be helpful. (Please reply directly if so; I'm not able to keep up with pgsql-hackers
right now.)

We've already been able to rule that out as a factor here. Thanks all
the same, though.

--
Peter Geoghegan

#21Andres Freund
andres@anarazel.de
In reply to: Andres Freund (#17)
#22Andres Freund
andres@anarazel.de
In reply to: Andres Freund (#21)
#23Andres Freund
andres@anarazel.de
In reply to: Andres Freund (#22)
#24Tom Lane
tgl@sss.pgh.pa.us
In reply to: Andres Freund (#22)
#25Andres Freund
andres@anarazel.de
In reply to: Tom Lane (#24)
#26Andres Freund
andres@anarazel.de
In reply to: Andres Freund (#25)
#27Tom Lane
tgl@sss.pgh.pa.us
In reply to: Andres Freund (#26)
#28Andres Freund
andres@anarazel.de
In reply to: Tom Lane (#27)
#29Andres Freund
andres@anarazel.de
In reply to: Andres Freund (#26)
#30Tom Lane
tgl@sss.pgh.pa.us
In reply to: Andres Freund (#29)
#31Tom Lane
tgl@sss.pgh.pa.us
In reply to: Tom Lane (#30)
#32Tom Lane
tgl@sss.pgh.pa.us
In reply to: Andres Freund (#29)
#33Andres Freund
andres@anarazel.de
In reply to: Tom Lane (#31)
#34Andres Freund
andres@anarazel.de
In reply to: Tom Lane (#32)
#35Tom Lane
tgl@sss.pgh.pa.us
In reply to: Andres Freund (#33)
#36Tom Lane
tgl@sss.pgh.pa.us
In reply to: Andres Freund (#34)
#37Tom Lane
tgl@sss.pgh.pa.us
In reply to: Tom Lane (#30)
#38Andres Freund
andres@anarazel.de
In reply to: Tom Lane (#37)
#39Tom Lane
tgl@sss.pgh.pa.us
In reply to: Andres Freund (#38)
#40Andres Freund
andres@anarazel.de
In reply to: Tom Lane (#39)
#41Tom Lane
tgl@sss.pgh.pa.us
In reply to: Andres Freund (#40)
#42Tom Lane
tgl@sss.pgh.pa.us
In reply to: Tom Lane (#41)
#43Tom Lane
tgl@sss.pgh.pa.us
In reply to: Tom Lane (#42)
#44Andres Freund
andres@anarazel.de
In reply to: Tom Lane (#43)
#45Tom Lane
tgl@sss.pgh.pa.us
In reply to: Andres Freund (#44)
#46Tom Lane
tgl@sss.pgh.pa.us
In reply to: Tom Lane (#45)