Breakage with VACUUM ANALYSE + partitions

Started by Thom Brownabout 10 years ago64 messageshackersbugs

thom@linux.com

about 10 years ago

hackersbugs

Hi,

I've not determined what's causing the following issue, but this is
the simplest reproducible test case I've managed to create so far in
9.6 latest git master:

postgresql.conf:
shared_buffers = 256MB

createdb pgbench
pgbench -i -s 80 pgbench

psql pgbench
\pset pager off
CREATE TABLE pgbench_accounts_1 (CHECK (bid % 2 = 0)) INHERITS
(pgbench_accounts);
CREATE TABLE pgbench_accounts_2 (CHECK (bid % 2 = 1)) INHERITS
(pgbench_accounts);
WITH del AS (DELETE FROM pgbench_accounts WHERE bid % 2 = 0
RETURNING *) INSERT INTO pgbench_accounts_1 SELECT * FROM del;
WITH del AS (DELETE FROM pgbench_accounts WHERE bid % 2 = 1
RETURNING *) INSERT INTO pgbench_accounts_2 SELECT * FROM del;
VACUUM ANALYSE;
EXPLAIN ANALYSE SELECT count(*) FROM pgbench_accounts;

This last statement produces:
ERROR: could not read block 0 in file "base/35160/35173": read only
0 of 8192 bytes

Looking at the file, I get:

File: ‘35173’
Size: 0 Blocks: 0 IO Block: 4096 regular empty file
Device: 801h/2049d Inode: 4504115 Links: 1
Access: (0600/-rw-------) Uid: (1000001/ thom) Gid: (1000001/ thom)
Access: 2016-03-20 19:37:50.433485085 +0000
Modify: 2016-03-20 19:39:49.422959222 +0000
Change: 2016-03-20 19:39:49.422959222 +0000
Birth: -

This is the same file info I get if I just to a regular VACUUM
instead, but that doesn't result in the error. And if I follow that
up with a VACUUM ANALYSE, the error still doesn't occur.

The base directory also shows this listing for files name after the
affected filenode:

-rw------- 1 thom thom 0 Mar 20 19:39 35173
-rw------- 1 thom thom 0 Mar 20 19:39 35173.1
-rw------- 1 thom thom 16384 Mar 20 19:39 35173_fsm
-rw------- 1 thom thom 0 Mar 20 19:39 35173_vm

After I see the error, I get the same error again if I then try to run
plain VACUUM.

The oid 35173 shown in this error message references the parent
pgbench_accounts table.

If I reduce the scale factor to 50, or increase shared_buffers to
512MB, there's no error. If, instead of VACUUM ANALYSE, I just use
VACUUM, or just ANALYSE, the error doesn't happen.

System details:
OS: Linux Mint Debian Edition (Wheezy 7.0)
Linux Kernel: 3.11.8
RAM: 16GB
Filesystem: ext4 (rw,noatime,discard,errors=remount-ro,data=ordered)

Regard

Thom

--
Sent via pgsql-bugs mailing list (pgsql-bugs@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-bugs

Robert Haas

robertmhaas@gmail.com

about 10 years ago

In reply to: Thom Brown (#1)

hackersbugs

Re: Breakage with VACUUM ANALYSE + partitions

On Sun, Mar 20, 2016 at 3:55 PM, Thom Brown <thom@linux.com> wrote:

I've not determined what's causing the following issue, but this is
the simplest reproducible test case I've managed to create so far in
9.6 latest git master:

postgresql.conf:
shared_buffers = 256MB

createdb pgbench
pgbench -i -s 80 pgbench

psql pgbench
\pset pager off
CREATE TABLE pgbench_accounts_1 (CHECK (bid % 2 = 0)) INHERITS
(pgbench_accounts);
CREATE TABLE pgbench_accounts_2 (CHECK (bid % 2 = 1)) INHERITS
(pgbench_accounts);
WITH del AS (DELETE FROM pgbench_accounts WHERE bid % 2 = 0
RETURNING *) INSERT INTO pgbench_accounts_1 SELECT * FROM del;
WITH del AS (DELETE FROM pgbench_accounts WHERE bid % 2 = 1
RETURNING *) INSERT INTO pgbench_accounts_2 SELECT * FROM del;
VACUUM ANALYSE;
EXPLAIN ANALYSE SELECT count(*) FROM pgbench_accounts;

This last statement produces:
ERROR: could not read block 0 in file "base/35160/35173": read only
0 of 8192 bytes

Hmm. The natural thing to suspect here would be a problem with the
freeze map changes. But I tried this and I couldn't reproduce it.
Any chance you can try this on each of the following commits?

fd31cd265138019dcccc9b5fe53043670898bc9f
77a1d1e79892a20ed15a67be42b96949b8546bf6
a892234f830e832110f63fc0a2afce2fb21d1584
68c521eb92c3515e3306f51a7fd3f32d16c97524

Make sure to re-initdb each time.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

--
Sent via pgsql-bugs mailing list (pgsql-bugs@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-bugs

Haribabu Kommi

kommi.haribabu@gmail.com

about 10 years ago

In reply to: Robert Haas (#2)

hackersbugs

Re: Breakage with VACUUM ANALYSE + partitions

On Thu, Mar 24, 2016 at 4:20 AM, Robert Haas <robertmhaas@gmail.com> wrote:

On Sun, Mar 20, 2016 at 3:55 PM, Thom Brown <thom@linux.com> wrote:

I've not determined what's causing the following issue, but this is
the simplest reproducible test case I've managed to create so far in
9.6 latest git master:

postgresql.conf:
shared_buffers = 256MB

createdb pgbench
pgbench -i -s 80 pgbench

psql pgbench
\pset pager off
CREATE TABLE pgbench_accounts_1 (CHECK (bid % 2 = 0)) INHERITS
(pgbench_accounts);
CREATE TABLE pgbench_accounts_2 (CHECK (bid % 2 = 1)) INHERITS
(pgbench_accounts);
WITH del AS (DELETE FROM pgbench_accounts WHERE bid % 2 = 0
RETURNING *) INSERT INTO pgbench_accounts_1 SELECT * FROM del;
WITH del AS (DELETE FROM pgbench_accounts WHERE bid % 2 = 1
RETURNING *) INSERT INTO pgbench_accounts_2 SELECT * FROM del;
VACUUM ANALYSE;
EXPLAIN ANALYSE SELECT count(*) FROM pgbench_accounts;

This last statement produces:
ERROR: could not read block 0 in file "base/35160/35173": read only
0 of 8192 bytes

Hmm. The natural thing to suspect here would be a problem with the
freeze map changes. But I tried this and I couldn't reproduce it.
Any chance you can try this on each of the following commits?

fd31cd265138019dcccc9b5fe53043670898bc9f
77a1d1e79892a20ed15a67be42b96949b8546bf6
a892234f830e832110f63fc0a2afce2fb21d1584
68c521eb92c3515e3306f51a7fd3f32d16c97524

Make sure to re-initdb each time.

I am able to reproduce the problem with latest HEAD 473b932 code.

During vacuum analyse operation, vacuum truncates the entire table
to 0 bytes because of no records present in the table and invalidates
the smgr relation.

During analyse operation, the smgr relation is constructed with segno = 1
also in the following location. In this case the smgr relation is not
invalidated.

#0 _mdfd_openseg (reln=0x1b8d790, forknum=MAIN_FORKNUM, segno=1,
oflags=0) at md.c:1736
#1 0x00000000007fa032 in _mdfd_getseg (reln=0x1b8d790,
forknum=MAIN_FORKNUM, blkno=131147, skipFsync=0 '\000',
behavior=EXTENSION_RETURN_NULL) at md.c:1815
#2 0x00000000007f8449 in mdwriteback (reln=0x1b8d790,
forknum=MAIN_FORKNUM, blocknum=131147, nblocks=1) at md.c:686
#3 0x00000000007faca3 in smgrwriteback (reln=0x1b8d790,
forknum=MAIN_FORKNUM, blocknum=131147, nblocks=1) at smgr.c:663
#4 0x00000000007c5211 in IssuePendingWritebacks (context=0xe70fc0
<BackendWritebackContext>) at bufmgr.c:4112
#5 0x00000000007c5071 in ScheduleBufferTagForWriteback
(context=0xe70fc0 <BackendWritebackContext>, tag=0x7ff18deaa180) at
bufmgr.c:4041
#6 0x00000000007c0a9d in BufferAlloc (smgr=0x1b8d490,
relpersistence=112 'p', forkNum=MAIN_FORKNUM, blockNum=247,
strategy=0x1bded48, foundPtr=0x7ffea3f0f34f "") at bufmgr.c:1134
#7 0x00000000007bff58 in ReadBuffer_common (smgr=0x1b8d490,
relpersistence=112 'p', forkNum=MAIN_FORKNUM, blockNum=247,
mode=RBM_NORMAL, strategy=0x1bded48, hit=0x7ffea3f0f43b "")
at bufmgr.c:744
#8 0x00000000007bfd6e in ReadBufferExtended (reln=0x7ff19dc69280,
forkNum=MAIN_FORKNUM, blockNum=247, mode=RBM_NORMAL,
strategy=0x1bded48) at bufmgr.c:663
#9 0x00000000005e09f7 in acquire_sample_rows (onerel=0x7ff19dc69280,
elevel=13, rows=0x1be7f28, targrows=15000, totalrows=0x7ffea3f0f598,
totaldeadrows=0x7ffea3f0f590)
at analyze.c:1025
#10 0x00000000005e1877 in acquire_inherited_sample_rows
(onerel=0x7ff19dc76a78, elevel=13, rows=0x1be7f28, targrows=30000,
totalrows=0x7ffea3f0f780, totaldeadrows=0x7ffea3f0f778)
at analyze.c:1410
#11 0x00000000005dfa68 in do_analyze_rel (onerel=0x7ff19dc76a78,
options=3, params=0x7ffea3f0fab0, va_cols=0x0, acquirefunc=0x5e08ca
<acquire_sample_rows>, relpages=0, inh=1 '\001',
in_outer_xact=0 '\000', elevel=13) at analyze.c:486
#12 0x00000000005df276 in analyze_rel (relid=16463, relation=0x0,
options=3, params=0x7ffea3f0fab0, va_cols=0x0, in_outer_xact=0 '\000',
bstrategy=0x1bded48) at analyze.c:264

So further operations on the table uses the already constructed smgr relation
and treats that there are RELSEG_SIZE number of blocks in the page and try
to do the scan. But there are 0 pages in the table thus it produces the error.

The issue doesn't occur from another session. Because of this reason only
if we do only vacuum operation, the error not occurred.

Regards,
Hari Babu
Fujitsu Australia

--
Sent via pgsql-bugs mailing list (pgsql-bugs@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-bugs

Robert Haas

robertmhaas@gmail.com

about 10 years ago

In reply to: Haribabu Kommi (#3)

hackersbugs

Re: Breakage with VACUUM ANALYSE + partitions

On Thu, Mar 24, 2016 at 12:59 AM, Haribabu Kommi
<kommi.haribabu@gmail.com> wrote:

So further operations on the table uses the already constructed smgr relation
and treats that there are RELSEG_SIZE number of blocks in the page and try
to do the scan. But there are 0 pages in the table thus it produces the error.

The issue doesn't occur from another session. Because of this reason only
if we do only vacuum operation, the error not occurred.

Yeah, I had a suspicion that this might have to do with invalidation
messages based on Thom's description, but I think we still need to
track down which commit is at fault.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

--
Sent via pgsql-bugs mailing list (pgsql-bugs@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-bugs

Michael Paquier

michael@paquier.xyz

about 10 years ago

In reply to: Robert Haas (#4)

hackersbugs

Re: Breakage with VACUUM ANALYSE + partitions

On Thu, Mar 24, 2016 at 9:40 PM, Robert Haas <robertmhaas@gmail.com> wrote:

On Thu, Mar 24, 2016 at 12:59 AM, Haribabu Kommi
<kommi.haribabu@gmail.com> wrote:

So further operations on the table uses the already constructed smgr relation
and treats that there are RELSEG_SIZE number of blocks in the page and try
to do the scan. But there are 0 pages in the table thus it produces the error.

The issue doesn't occur from another session. Because of this reason only
if we do only vacuum operation, the error not occurred.

Yeah, I had a suspicion that this might have to do with invalidation
messages based on Thom's description, but I think we still need to
track down which commit is at fault.

I could reproduce the failure on Linux, not on OSX, and bisecting the
failure, the first bad commit is this one:
commit: 428b1d6b29ca599c5700d4bc4f4ce4c5880369bf
author: Andres Freund <andres@anarazel.de>
date: Thu, 10 Mar 2016 17:04:34 -0800
Allow to trigger kernel writeback after a configurable number of writes.

The failure is a little bit sporadic, based on my tests 1/2 runs out
of 10 could pass, so one good commit was recognized as such after
passing the SQL sequence sent by Thom 5 times in a row. I also did
some manual tests and those are pointing to this commit as well.

I am adding Fabien and Andres in CC for some feedback.
--
Michael

--
Sent via pgsql-bugs mailing list (pgsql-bugs@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-bugs

Thom Brown

thom@linux.com

about 10 years ago

In reply to: Michael Paquier (#5)

hackersbugs

Re: Breakage with VACUUM ANALYSE + partitions

On 25 March 2016 at 12:41, Michael Paquier <michael.paquier@gmail.com> wrote:

On Thu, Mar 24, 2016 at 9:40 PM, Robert Haas <robertmhaas@gmail.com> wrote:

On Thu, Mar 24, 2016 at 12:59 AM, Haribabu Kommi
<kommi.haribabu@gmail.com> wrote:

So further operations on the table uses the already constructed smgr relation
and treats that there are RELSEG_SIZE number of blocks in the page and try
to do the scan. But there are 0 pages in the table thus it produces the error.

The issue doesn't occur from another session. Because of this reason only
if we do only vacuum operation, the error not occurred.

Yeah, I had a suspicion that this might have to do with invalidation
messages based on Thom's description, but I think we still need to
track down which commit is at fault.

I could reproduce the failure on Linux, not on OSX, and bisecting the
failure, the first bad commit is this one:
commit: 428b1d6b29ca599c5700d4bc4f4ce4c5880369bf
author: Andres Freund <andres@anarazel.de>
date: Thu, 10 Mar 2016 17:04:34 -0800
Allow to trigger kernel writeback after a configurable number of writes.

The failure is a little bit sporadic, based on my tests 1/2 runs out
of 10 could pass, so one good commit was recognized as such after
passing the SQL sequence sent by Thom 5 times in a row. I also did
some manual tests and those are pointing to this commit as well.

I've had to adapt a test with waits in to get it to work more
reliably. I've just scattered it with sleeps:

\pset pager off
select pg_sleep(1);
CREATE TABLE pgbench_accounts_1 (CHECK (bid % 2 = 0)) INHERITS
(pgbench_accounts);
select pg_sleep(1);
CREATE TABLE pgbench_accounts_2 (CHECK (bid % 2 = 1)) INHERITS
(pgbench_accounts);
select pg_sleep(1);
WITH del AS (DELETE FROM pgbench_accounts WHERE bid % 2 = 0 RETURNING
*) INSERT INTO pgbench_accounts_1 SELECT * FROM del;
select pg_sleep(1);
WITH del AS (DELETE FROM pgbench_accounts WHERE bid % 2 = 1 RETURNING
*) INSERT INTO pgbench_accounts_2 SELECT * FROM del;
select pg_sleep(1);
VACUUM ANALYSE;
select pg_sleep(3);
EXPLAIN ANALYSE SELECT count(*) FROM pgbench_accounts;

But this brings me to the same commit you found.

Thom

--
Sent via pgsql-bugs mailing list (pgsql-bugs@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-bugs

Robert Haas

robertmhaas@gmail.com

about 10 years ago

In reply to: Michael Paquier (#5)

hackersbugs

Re: Breakage with VACUUM ANALYSE + partitions

On Fri, Mar 25, 2016 at 8:41 AM, Michael Paquier
<michael.paquier@gmail.com> wrote:

On Thu, Mar 24, 2016 at 9:40 PM, Robert Haas <robertmhaas@gmail.com> wrote:

On Thu, Mar 24, 2016 at 12:59 AM, Haribabu Kommi
<kommi.haribabu@gmail.com> wrote:

So further operations on the table uses the already constructed smgr relation
and treats that there are RELSEG_SIZE number of blocks in the page and try
to do the scan. But there are 0 pages in the table thus it produces the error.

The issue doesn't occur from another session. Because of this reason only
if we do only vacuum operation, the error not occurred.

Yeah, I had a suspicion that this might have to do with invalidation
messages based on Thom's description, but I think we still need to
track down which commit is at fault.

I could reproduce the failure on Linux, not on OSX, and bisecting the
failure, the first bad commit is this one:
commit: 428b1d6b29ca599c5700d4bc4f4ce4c5880369bf
author: Andres Freund <andres@anarazel.de>
date: Thu, 10 Mar 2016 17:04:34 -0800
Allow to trigger kernel writeback after a configurable number of writes.

The failure is a little bit sporadic, based on my tests 1/2 runs out
of 10 could pass, so one good commit was recognized as such after
passing the SQL sequence sent by Thom 5 times in a row. I also did
some manual tests and those are pointing to this commit as well.

I am adding Fabien and Andres in CC for some feedback.

Gosh, that's surprising. I wonder if that just revealed an underlying
issue rather than creating it.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

--
Sent via pgsql-bugs mailing list (pgsql-bugs@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-bugs

Andres Freund

andres@anarazel.de

about 10 years ago

In reply to: Robert Haas (#7)

hackersbugs

Re: Breakage with VACUUM ANALYSE + partitions

On 2016-03-25 12:02:05 -0400, Robert Haas wrote:

On Fri, Mar 25, 2016 at 8:41 AM, Michael Paquier
<michael.paquier@gmail.com> wrote:

On Thu, Mar 24, 2016 at 9:40 PM, Robert Haas <robertmhaas@gmail.com> wrote:

On Thu, Mar 24, 2016 at 12:59 AM, Haribabu Kommi
<kommi.haribabu@gmail.com> wrote:

So further operations on the table uses the already constructed smgr relation
and treats that there are RELSEG_SIZE number of blocks in the page and try
to do the scan. But there are 0 pages in the table thus it produces the error.

The issue doesn't occur from another session. Because of this reason only
if we do only vacuum operation, the error not occurred.

Yeah, I had a suspicion that this might have to do with invalidation
messages based on Thom's description, but I think we still need to
track down which commit is at fault.

I could reproduce the failure on Linux, not on OSX, and bisecting the
failure, the first bad commit is this one:
commit: 428b1d6b29ca599c5700d4bc4f4ce4c5880369bf
author: Andres Freund <andres@anarazel.de>
date: Thu, 10 Mar 2016 17:04:34 -0800
Allow to trigger kernel writeback after a configurable number of writes.

The failure is a little bit sporadic, based on my tests 1/2 runs out
of 10 could pass, so one good commit was recognized as such after
passing the SQL sequence sent by Thom 5 times in a row. I also did
some manual tests and those are pointing to this commit as well.

I am adding Fabien and Andres in CC for some feedback.

Gosh, that's surprising. I wonder if that just revealed an underlying
issue rather than creating it.

I think that might be the case, but I'm not entirely sure yet. It
appears to me that the current backend - others don't show the problem -
still has the first segment of pgbench_accounts open (in the md.c
mdfdvec sense); likely because there were remaining flush
requests. Thus, when mdnblocks is called to get the size of the relation
we return the size of the first segment (131072) plus the size of the
second segment (0, doesn't exist). That then leads to this error.

I don't really understand yet how the "open segment" thing happens.

--
Sent via pgsql-bugs mailing list (pgsql-bugs@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-bugs

Andres Freund

andres@anarazel.de

about 10 years ago

In reply to: Robert Haas (#7)

hackersbugs

Re: Breakage with VACUUM ANALYSE + partitions

On 2016-03-25 12:02:05 -0400, Robert Haas wrote:

Gosh, that's surprising. I wonder if that just revealed an underlying
issue rather than creating it.

I think that's the case; it's just somewhat unlikely to hit in other
cases.

If SMgrRelation->md_fd[n] is an empty relation, and mdread() or another
routine is asking for a block in the second segment - which will error
out. But even if the first segment is zero bytes, _mdfd_getseg() will
dutifully try to open the second segment. Which will succeed in the case
of a truncated relation, because we leave the truncated segment in
place.

ISTM that _mdfd_getseg better check the length of the last segment
before opening the next one?

Regards,

Andres

--
Sent via pgsql-bugs mailing list (pgsql-bugs@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-bugs

#10

Noah Misch

noah@leadboat.com

about 10 years ago

In reply to: Andres Freund (#9)

hackersbugs

Re: Breakage with VACUUM ANALYSE + partitions

On Sat, Mar 26, 2016 at 01:39:42PM +0100, Andres Freund wrote:

On 2016-03-25 12:02:05 -0400, Robert Haas wrote:

Gosh, that's surprising. I wonder if that just revealed an underlying
issue rather than creating it.

I think that's the case; it's just somewhat unlikely to hit in other
cases.

If SMgrRelation->md_fd[n] is an empty relation, and mdread() or another
routine is asking for a block in the second segment - which will error
out. But even if the first segment is zero bytes, _mdfd_getseg() will
dutifully try to open the second segment. Which will succeed in the case
of a truncated relation, because we leave the truncated segment in
place.

ISTM that _mdfd_getseg better check the length of the last segment
before opening the next one?

[This is a generic notification.]

The above-described topic is currently a PostgreSQL 9.6 open item. Andres,
since you committed the patch believed to have created it, you own this open
item. If that responsibility lies elsewhere, please let us know whose
responsibility it is to fix this. Since new open items may be discovered at
any time and I want to plan to have them all fixed well in advance of the ship
date, I will appreciate your efforts toward speedy resolution. Please
present, within 72 hours, a plan to fix the defect within seven days of this
message. Thanks.

Non-generic addendum: granted, s/created/unearthed/ is looking more precise.

--
Sent via pgsql-bugs mailing list (pgsql-bugs@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-bugs

#11

Robert Haas

robertmhaas@gmail.com

about 10 years ago

In reply to: Andres Freund (#9)

hackersbugs

Re: [BUGS] Breakage with VACUUM ANALYSE + partitions

On Sat, Mar 26, 2016 at 8:39 AM, Andres Freund <andres@anarazel.de> wrote:

On 2016-03-25 12:02:05 -0400, Robert Haas wrote:

Gosh, that's surprising. I wonder if that just revealed an underlying
issue rather than creating it.

I think that's the case; it's just somewhat unlikely to hit in other
cases.

If SMgrRelation->md_fd[n] is an empty relation, and mdread() or another
routine is asking for a block in the second segment - which will error
out. But even if the first segment is zero bytes, _mdfd_getseg() will
dutifully try to open the second segment. Which will succeed in the case
of a truncated relation, because we leave the truncated segment in
place.

ISTM that _mdfd_getseg better check the length of the last segment
before opening the next one?

Well, I agree that it's pretty strange that _mdfd_getseg() makes no
such check, but I still don't think I understand what's going on here.
Backends shouldn't be requesting nonexistent blocks from a relation -
higher-level safeguards, like holding AccessExclusiveLock before
trying to complete a DROP or TRUNCATE - are supposed to prevent that.
If this patch is causing us to hold onto smgr references to a relation
on which we no longer hold locks, I think that's irretrievably broken
and should be reverted. I really doubt this will be the only thing
that goes wrong if you do that.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#12

Noah Misch

noah@leadboat.com

about 10 years ago

In reply to: Robert Haas (#11)

hackersbugs

Re: Breakage with VACUUM ANALYSE + partitions

On Mon, Apr 04, 2016 at 12:45:25PM -0400, Robert Haas wrote:

Backends shouldn't be requesting nonexistent blocks from a relation -
higher-level safeguards, like holding AccessExclusiveLock before
trying to complete a DROP or TRUNCATE - are supposed to prevent that.
If this patch is causing us to hold onto smgr references to a relation
on which we no longer hold locks, I think that's irretrievably broken
and should be reverted. I really doubt this will be the only thing
that goes wrong if you do that.

--
Sent via pgsql-bugs mailing list (pgsql-bugs@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-bugs

#13

Andres Freund

andres@anarazel.de

about 10 years ago

In reply to: Robert Haas (#11)

hackersbugs

Re: [BUGS] Breakage with VACUUM ANALYSE + partitions

On 2016-04-04 12:45:25 -0400, Robert Haas wrote:

Well, I agree that it's pretty strange that _mdfd_getseg() makes no
such check, but I still don't think I understand what's going on here.
Backends shouldn't be requesting nonexistent blocks from a relation -
higher-level safeguards, like holding AccessExclusiveLock before
trying to complete a DROP or TRUNCATE - are supposed to prevent that.

I don't think that's really true: We write blocks back without holding
any sort of relation level locks; and thus do _mdfd_getseg() type
accesses as well.

And we're not really "requesting nonexistant blocks" - the segments are
just opened to get the associated file descriptor, and they're opened
with EXTENSION_RETURN_NULL. It turns out to be important
performance-wise to reuse fd's when triggering kernel writeback.

If this patch is causing us to hold onto smgr references to a relation
on which we no longer hold locks, I think that's irretrievably broken
and should be reverted. I really doubt this will be the only thing
that goes wrong if you do that.

As we already have done that for writes for a long time, I'm a bit
surprised about that statement.

Greetings,

Andres Freund

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#14

Robert Haas

robertmhaas@gmail.com

about 10 years ago

In reply to: Andres Freund (#13)

hackersbugs

Re: [HACKERS] Breakage with VACUUM ANALYSE + partitions

On Mon, Apr 11, 2016 at 12:50 PM, Andres Freund <andres@anarazel.de> wrote:

On 2016-04-04 12:45:25 -0400, Robert Haas wrote:

Well, I agree that it's pretty strange that _mdfd_getseg() makes no
such check, but I still don't think I understand what's going on here.
Backends shouldn't be requesting nonexistent blocks from a relation -
higher-level safeguards, like holding AccessExclusiveLock before
trying to complete a DROP or TRUNCATE - are supposed to prevent that.

I don't think that's really true: We write blocks back without holding
any sort of relation level locks; and thus do _mdfd_getseg() type
accesses as well.

You're right, but I think that's more because I didn't say it
correctly than because you haven't done something novel. DROP and
relation truncation know about shared buffers, and they go clear
blocks that that might be affected from it as part of the truncate
operation, which means that no other backend will see them after they
are gone. The lock makes sure that no other references can be added
while we're busy removing any that are already there. So I think that
there is currently an invariant that any block we are attempting to
access should actually still exist. It sounds like these references
are sticking around in backend-private memory, which means they are
neither protected by locks nor able to be cleared out on drop or
truncate. I think that's a new thing, and a bit scary.

And we're not really "requesting nonexistant blocks" - the segments are
just opened to get the associated file descriptor, and they're opened
with EXTENSION_RETURN_NULL. It turns out to be important
performance-wise to reuse fd's when triggering kernel writeback.

The possibly-saving grace here, I suppose, is that the references
we're worried about are just being used to issue hints to the
operating system. So I guess if we sent a hint on a wrong block or
skip sending a hint altogether because of some failure, no harm done,
as long as we don't error out.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

--
Sent via pgsql-bugs mailing list (pgsql-bugs@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-bugs

#15

Andres Freund

andres@anarazel.de

about 10 years ago

In reply to: Robert Haas (#14)

hackersbugs

Re: [HACKERS] Breakage with VACUUM ANALYSE + partitions

On 2016-04-11 13:04:48 -0400, Robert Haas wrote:

You're right, but I think that's more because I didn't say it
correctly than because you haven't done something novel.

Could be.

DROP and
relation truncation know about shared buffers, and they go clear
blocks that that might be affected from it as part of the truncate
operation, which means that no other backend will see them after they
are gone. The lock makes sure that no other references can be added
while we're busy removing any that are already there. So I think that
there is currently an invariant that any block we are attempting to
access should actually still exist.

Note that we're not actually accessing any blocks, we're just opening a
segment to get the associated file descriptor.

It sounds like these references are sticking around in backend-private
memory, which means they are neither protected by locks nor able to be
cleared out on drop or truncate. I think that's a new thing, and a
bit scary.

True. But how would you batch flush requests in a sorted manner
otherwise, without re-opening file descriptors otherwise? And that's
prety essential for performance.

I can think of a number of relatively easy ways to address this:
1) Just zap (or issue?) all pending flush requests when getting an
smgrinval/smgrclosenode
2) Do 1), but filter for the closed relnode
3) Actually handle the case of the last open segment not being
RELSEG_SIZE properly in _mdfd_getseg() - mdnblocks() does so.

I'm kind of inclined to do both 3) and 1).

The possibly-saving grace here, I suppose, is that the references
we're worried about are just being used to issue hints to the
operating system.

Indeed.

So I guess if we sent a hint on a wrong block or
skip sending a hint altogether because of some failure, no harm done,
as long as we don't error out.

Which the writeback code is careful not to do; afaics it's just the
"already open segment" issue making problems here.

- Andres

--
Sent via pgsql-bugs mailing list (pgsql-bugs@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-bugs

#16

Fabien COELHO

coelho@cri.ensmp.fr

about 10 years ago

In reply to: Andres Freund (#15)

hackersbugs

Re: [HACKERS] Breakage with VACUUM ANALYSE + partitions

Hello Andres,

I can think of a number of relatively easy ways to address this:
1) Just zap (or issue?) all pending flush requests when getting an
smgrinval/smgrclosenode
2) Do 1), but filter for the closed relnode
3) Actually handle the case of the last open segment not being
RELSEG_SIZE properly in _mdfd_getseg() - mdnblocks() does so.

I'm kind of inclined to do both 3) and 1).

Alas I'm a little bit outside my area of expertise. Just for suggestion
purpose, possibly totally wrong (please accept my apology if this is the
case), the ideas I had while thinking about these issues, some may be
quite close to your suggestions above:

- keep track of file/segment closing/reopenings (say with a counter), so
that if a flush is about a file/segment which has been closed or
reopened, it is just skipped. I'm not sure this is enough, because one
process may do the file cleaning and another may want to flush, although
I guess there is some kind of locking/shared data structures to manage
such interactions between pg processes.

- because of "empty the bucket when filled behavior" of the current
implementation, a buffer to be flused may be kept for a very
long time in the bucket. I think that flushing advices become stale
after a while so should not be issued (the buffer may have been
written again, ...), or the bucket should be flushed after a while
even of not full.

Also, a detail in "pg_flush_data", there are a serie of three if/endif,
but it is not obvious to me whether they are mutually exclusive while
looking at the source, so I was wondering whether the code could try
several flushings approaches one after the other... This is probably not
the case, but I would be more at ease with a if/elif/elif/endif structure
there so that is clearer.

--
Fabien.

--
Sent via pgsql-bugs mailing list (pgsql-bugs@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-bugs

#17

Robert Haas

robertmhaas@gmail.com

about 10 years ago

In reply to: Andres Freund (#15)

hackersbugs

Re: [HACKERS] Breakage with VACUUM ANALYSE + partitions

On Mon, Apr 11, 2016 at 1:17 PM, Andres Freund <andres@anarazel.de> wrote:

I can think of a number of relatively easy ways to address this:
1) Just zap (or issue?) all pending flush requests when getting an
smgrinval/smgrclosenode
2) Do 1), but filter for the closed relnode
3) Actually handle the case of the last open segment not being
RELSEG_SIZE properly in _mdfd_getseg() - mdnblocks() does so.

I'm kind of inclined to do both 3) and 1).

#3 seems like it's probably about 15 years overdue, so let's do that anyway.

I don't quite understand why #1 fixes the problem - not because I
doubt you, but just because I haven't studied it much.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

--
Sent via pgsql-bugs mailing list (pgsql-bugs@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-bugs

#18

Abhijit Menon-Sen

ams@2ndQuadrant.com

about 10 years ago

In reply to: Robert Haas (#17)

hackersbugs

Re: [BUGS] Breakage with VACUUM ANALYSE + partitions

At 2016-04-12 09:00:57 -0400, robertmhaas@gmail.com wrote:

On Mon, Apr 11, 2016 at 1:17 PM, Andres Freund <andres@anarazel.de> wrote:

3) Actually handle the case of the last open segment not being
RELSEG_SIZE properly in _mdfd_getseg() - mdnblocks() does so.

#3 seems like it's probably about 15 years overdue, so let's do that
anyway.

I don't have anything useful to contribute, I'm just trying to
understand this fix.

_mdfd_getseg() looks like this:

targetseg = blkno / ((BlockNumber) RELSEG_SIZE);
for (nextsegno = 1; nextsegno <= targetseg; nextsegno++)
{
Assert(nextsegno == v->mdfd_segno + 1);

if (v->mdfd_chain == NULL)
{
/*
* Normally we will create new segments only if authorized by the
* caller (i.e., we are doing mdextend()). […]
*/
if (behavior == EXTENSION_CREATE || InRecovery)
{
if (_mdnblocks(reln, forknum, v) < RELSEG_SIZE)
/* mdextend */
v->mdfd_chain = _mdfd_openseg(reln, forknum, +nextsegno, O_CREAT);
}
else
{
/* We won't create segment if not existent */
v->mdfd_chain = _mdfd_openseg(reln, forknum, nextsegno, 0);
}
if (v->mdfd_chain == NULL)
/* return NULL or ERROR */
}
v = v->mdfd_chain;
}

Do I understand correctly that the case of the "last open segment"
(i.e., the one for which mdfd_chain == NULL) not being RELSEG_SIZE
(i.e., _mdnblocks(reln, forknum, v) < RELSEG_SIZE) should not call
_mfdf_openseg on nextsegno if behaviour is not EXTENSION_CREATE or
InRecovery?

And that "We won't create segment if not existent" should happen, but
doesn't only because the next segment file wasn't removed earlier, so
we have to add an extra check for that case?

In other words, is something like the following what's needed here, or
is there more to it?

-- Abhijit

diff --git a/src/backend/storage/smgr/md.c b/src/backend/storage/smgr/md.c
index 578276d..58ea5a0 100644
--- a/src/backend/storage/smgr/md.c
+++ b/src/backend/storage/smgr/md.c
@@ -1783,6 +1783,8 @@ _mdfd_getseg(SMgrRelation reln, ForkNumber forknum, BlockNumber blkno,

                if (v->mdfd_chain == NULL)
                {
+                       bool thisSegNeedsExtension = _mdnblocks(reln, forknum, v) < RELSEG_SIZE;
+
                        /*
                         * Normally we will create new segments only if authorized by the
                         * caller (i.e., we are doing mdextend()).  But when doing WAL
@@ -1799,7 +1801,7 @@ _mdfd_getseg(SMgrRelation reln, ForkNumber forknum, BlockNumber blkno,
                         */
                        if (behavior == EXTENSION_CREATE || InRecovery)
                        {
-                               if (_mdnblocks(reln, forknum, v) < RELSEG_SIZE)
+                               if (thisSegNeedsExtension)
                                {
                                        char       *zerobuf = palloc0(BLCKSZ);

@@ -1810,7 +1812,7 @@ _mdfd_getseg(SMgrRelation reln, ForkNumber forknum, BlockNumber blkno,
                                }
                                v->mdfd_chain = _mdfd_openseg(reln, forknum, +nextsegno, O_CREAT);
                        }
-                       else
+                       else if (!thisSegNeedsExtension)
                        {
                                /* We won't create segment if not existent */
                                v->mdfd_chain = _mdfd_openseg(reln, forknum, nextsegno, 0);

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#19

Robert Haas

robertmhaas@gmail.com

about 10 years ago

In reply to: Abhijit Menon-Sen (#18)

hackersbugs

Re: [BUGS] Breakage with VACUUM ANALYSE + partitions

On Thu, Apr 14, 2016 at 1:39 AM, Abhijit Menon-Sen <ams@2ndquadrant.com> wrote:

At 2016-04-12 09:00:57 -0400, robertmhaas@gmail.com wrote:

On Mon, Apr 11, 2016 at 1:17 PM, Andres Freund <andres@anarazel.de> wrote:

3) Actually handle the case of the last open segment not being
RELSEG_SIZE properly in _mdfd_getseg() - mdnblocks() does so.

#3 seems like it's probably about 15 years overdue, so let's do that
anyway.

Do I understand correctly that the case of the "last open segment"
(i.e., the one for which mdfd_chain == NULL) not being RELSEG_SIZE
(i.e., _mdnblocks(reln, forknum, v) < RELSEG_SIZE) should not call
_mfdf_openseg on nextsegno if behaviour is not EXTENSION_CREATE or
InRecovery?

And that "We won't create segment if not existent" should happen, but
doesn't only because the next segment file wasn't removed earlier, so
we have to add an extra check for that case?

In other words, is something like the following what's needed here, or
is there more to it?

Something like that is what I was thinking about, but I notice it has
the disadvantage of adding lseeks to cater to a shouldn't-happen
condition. Not sure if we should care.

My attempts to test things were also singularly unrewarding. Even
after messing with the filesystem in various ways, I couldn't figure
out exactly how this makes a difference.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#20

Noah Misch

noah@leadboat.com

about 10 years ago

In reply to: Robert Haas (#17)

hackersbugs

Re: Breakage with VACUUM ANALYSE + partitions

On Tue, Apr 12, 2016 at 09:00:57AM -0400, Robert Haas wrote:

On Mon, Apr 11, 2016 at 1:17 PM, Andres Freund <andres@anarazel.de> wrote:

I can think of a number of relatively easy ways to address this:
1) Just zap (or issue?) all pending flush requests when getting an
smgrinval/smgrclosenode
2) Do 1), but filter for the closed relnode
3) Actually handle the case of the last open segment not being
RELSEG_SIZE properly in _mdfd_getseg() - mdnblocks() does so.

I'm kind of inclined to do both 3) and 1).

Andres, you own the underlying open item, right? Do you plan to implement
this fix you have outlined? If so, when?

#3 seems like it's probably about 15 years overdue, so let's do that anyway.

I don't quite understand why #1 fixes the problem - not because I
doubt you, but just because I haven't studied it much.

--
Sent via pgsql-bugs mailing list (pgsql-bugs@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-bugs

#21

Andres Freund

andres@anarazel.de

about 10 years ago

In reply to: Noah Misch (#20)

hackersbugs

#22

Noah Misch

noah@leadboat.com

about 10 years ago

In reply to: Andres Freund (#21)

hackersbugs

#23

Andres Freund

andres@anarazel.de

about 10 years ago

In reply to: Robert Haas (#19)

hackersbugs

#24

Robert Haas

robertmhaas@gmail.com

about 10 years ago

In reply to: Andres Freund (#23)

hackersbugs

#25

Andres Freund

andres@anarazel.de

about 10 years ago

In reply to: Abhijit Menon-Sen (#18)

hackersbugs

#26

Robert Haas

robertmhaas@gmail.com

about 10 years ago

In reply to: Andres Freund (#25)

hackersbugs

#27

Robert Haas

robertmhaas@gmail.com

about 10 years ago

In reply to: Robert Haas (#26)

hackersbugs

#28

Thom Brown

thom@linux.com

about 10 years ago

In reply to: Andres Freund (#25)

hackersbugs

#29

Andres Freund

andres@anarazel.de

about 10 years ago

In reply to: Robert Haas (#27)

hackersbugs

#30

Robert Haas

robertmhaas@gmail.com

about 10 years ago

In reply to: Andres Freund (#29)

hackersbugs

#31

Andres Freund

andres@anarazel.de

about 10 years ago

In reply to: Robert Haas (#30)

hackersbugs

#32

Fabien COELHO

coelho@cri.ensmp.fr

about 10 years ago

In reply to: Thom Brown (#28)

hackersbugs

#33

Robert Haas

robertmhaas@gmail.com

about 10 years ago

In reply to: Andres Freund (#31)

hackersbugs

#34

Andres Freund

andres@anarazel.de

about 10 years ago

In reply to: Robert Haas (#33)

hackersbugs

#35

Robert Haas

robertmhaas@gmail.com

about 10 years ago

In reply to: Andres Freund (#34)

hackersbugs

#36

Andres Freund

andres@anarazel.de

about 10 years ago

In reply to: Robert Haas (#35)

hackersbugs

#37

Andres Freund

andres@anarazel.de

about 10 years ago

In reply to: Fabien COELHO (#32)

hackersbugs

#38

Thom Brown

thom@linux.com

about 10 years ago

In reply to: Andres Freund (#36)

hackersbugs

#39

Andres Freund

andres@anarazel.de

about 10 years ago

In reply to: Thom Brown (#38)

hackersbugs

#40

Fabien COELHO

coelho@cri.ensmp.fr

about 10 years ago

In reply to: Thom Brown (#38)

hackersbugs

#41

Fabien COELHO

coelho@cri.ensmp.fr

about 10 years ago

In reply to: Fabien COELHO (#40)

hackersbugs

#42

Andres Freund

andres@anarazel.de

about 10 years ago

In reply to: Thom Brown (#38)

hackersbugs

#43

Robert Haas

robertmhaas@gmail.com

about 10 years ago

In reply to: Andres Freund (#42)

hackersbugs

#44

Andres Freund

andres@anarazel.de

about 10 years ago

In reply to: Robert Haas (#43)

hackersbugs

#45

Robert Haas

robertmhaas@gmail.com

about 10 years ago

In reply to: Andres Freund (#44)

hackersbugs

#46

Andres Freund

andres@anarazel.de

about 10 years ago

In reply to: Robert Haas (#45)

hackersbugs

#47

Fabien COELHO

coelho@cri.ensmp.fr

about 10 years ago

In reply to: Andres Freund (#46)

hackersbugs

#48

Andres Freund

andres@anarazel.de

about 10 years ago

In reply to: Fabien COELHO (#47)

hackersbugs

#49

Fabien COELHO

coelho@cri.ensmp.fr

about 10 years ago

In reply to: Andres Freund (#48)

hackersbugs

#50

Andres Freund

andres@anarazel.de

about 10 years ago

In reply to: Fabien COELHO (#49)

hackersbugs

#51

Fabien COELHO

coelho@cri.ensmp.fr

about 10 years ago

In reply to: Andres Freund (#50)

hackersbugs

#52

Andres Freund

andres@anarazel.de

about 10 years ago

In reply to: Fabien COELHO (#51)

hackersbugs

#53

Fabien COELHO

coelho@cri.ensmp.fr

about 10 years ago

In reply to: Andres Freund (#52)

hackersbugs

#54

Andres Freund

andres@anarazel.de

about 10 years ago

In reply to: Thom Brown (#38)

hackersbugs

#55

Andres Freund

andres@anarazel.de

about 10 years ago

In reply to: Fabien COELHO (#53)

hackersbugs

#56

Thom Brown

thom@linux.com

about 10 years ago

In reply to: Andres Freund (#54)

hackersbugs

#57

Tom Lane

tgl@sss.pgh.pa.us

about 10 years ago

In reply to: Fabien COELHO (#53)

hackersbugs

#58

Fabien COELHO

coelho@cri.ensmp.fr

about 10 years ago

In reply to: Tom Lane (#57)

hackersbugs

#59

Tom Lane

tgl@sss.pgh.pa.us

about 10 years ago

In reply to: Fabien COELHO (#58)

hackersbugs

#60

Stephen Frost

sfrost@snowman.net

about 10 years ago

In reply to: Tom Lane (#59)

hackersbugs

#61

Tom Lane

tgl@sss.pgh.pa.us

about 10 years ago

In reply to: Stephen Frost (#60)

hackersbugs

#62

Stephen Frost

sfrost@snowman.net

about 10 years ago

In reply to: Tom Lane (#61)

hackersbugs

#63

Fabien COELHO

coelho@cri.ensmp.fr

about 10 years ago

In reply to: Tom Lane (#59)

hackersbugs

#64

Piotr Stefaniak

postgres@piotr-stefaniak.me

about 10 years ago

In reply to: Fabien COELHO (#63)

hackersbugs