BUG #15427: DROP INDEX did not free up disk space

Started by PG Bug reporting formover 7 years ago6 messagesbugs

noreply@postgresql.org

over 7 years ago

The following bug has been logged on the website:

Bug reference: 15427
Logged by: Andrew P
Email address: ap@zip.com.au
PostgreSQL version: 10.5
Operating system: Debian Sarge (9)
Description:

Performed a DROP INDEX to free up disk space but, whilst the command
succeeded the disk space was not freed up.

The index was:

CREATE INDEX radlelink_rlreid_idx ON radlelink (rlid, reid);

Config of table:

CREATE TABLE radlelink (
rlid BIGINT REFERENCES radlogins (id) ON DELETE
CASCADE,
reid BIGINT REFERENCES radextra (id) ON DELETE CASCADE
);

Ran 'lsof -nP +L1' on the server as per suggestion on postgresql irc channel
and this was the (abridged output):

COMMAND PID USER FD TYPE DEVICE SIZE/OFF NLINK NODE NAME
postgres 2633 postgres 197u REG 252,5 16777216 0 3410028
/var/lib/postgresql/10/main/pg_wal/00000001000003960000006E (deleted)
postgres 2633 postgres 325u REG 252,5 0 0 3409445
/var/lib/postgresql/10/main/base/16562/17437 (deleted)
postgres 2633 postgres 326u REG 252,5 1073741824 0 3409702
/var/lib/postgresql/10/main/base/16562/17437.1 (deleted)
postgres 2633 postgres 327u REG 252,5 1073741824 0 3409704
/var/lib/postgresql/10/main/base/16562/17437.2 (deleted)
postgres 2633 postgres 328u REG 252,5 1073741824 0 3409705
/var/lib/postgresql/10/main/base/16562/17437.3 (deleted)
postgres 2633 postgres 329u REG 252,5 1073741824 0 3409452
/var/lib/postgresql/10/main/base/16562/17437.4 (deleted)
postgres 2633 postgres 330u REG 252,5 1073741824 0 3409709
/var/lib/postgresql/10/main/base/16562/17437.5 (deleted)
postgres 2633 postgres 331u REG 252,5 1073741824 0 3409710
/var/lib/postgresql/10/main/base/16562/17437.6 (deleted)
postgres 2633 postgres 332u REG 252,5 1073741824 0 3409706
/var/lib/postgresql/10/main/base/16562/17437.7 (deleted)
postgres 2633 postgres 333u REG 252,5 1073741824 0 3409722
/var/lib/postgresql/10/main/base/16562/17437.8 (deleted)
postgres 2633 postgres 334u REG 252,5 1073741824 0 3409724
/var/lib/postgresql/10/main/base/16562/17437.9 (deleted)
postgres 2633 postgres 335u REG 252,5 1073741824 0 3409716
/var/lib/postgresql/10/main/base/16562/17437.10 (deleted)
...
postgres 2633 postgres 403u REG 252,5 1073741824 0 3409970
/var/lib/postgresql/10/main/base/16562/17437.78 (deleted)
postgres 2633 postgres 404u REG 252,5 1073741824 0 3409976
/var/lib/postgresql/10/main/base/16562/17437.79 (deleted)
postgres 2633 postgres 405u REG 252,5 1073741824 0 3409969
/var/lib/postgresql/10/main/base/16562/17437.80 (deleted)
postgres 2633 postgres 406u REG 252,5 1073741824 0 3409983
/var/lib/postgresql/10/main/base/16562/17437.81 (deleted)
postgres 2633 postgres 407u REG 252,5 1073741824 0 3409984
/var/lib/postgresql/10/main/base/16562/17437.82 (deleted)
postgres 2633 postgres 408u REG 252,5 1073741824 0 3409981
/var/lib/postgresql/10/main/base/16562/17437.83 (deleted)
postgres 2633 postgres 409u REG 252,5 1073741824 0 3410002
/var/lib/postgresql/10/main/base/16562/17437.84 (deleted)
postgres 2633 postgres 410u REG 252,5 1073741824 0 3410012
/var/lib/postgresql/10/main/base/16562/17437.85 (deleted)
postgres 2633 postgres 411u REG 252,5 1073741824 0 3409991
/var/lib/postgresql/10/main/base/16562/17437.86 (deleted)
postgres 2633 postgres 412u REG 252,5 1073741824 0 3410025
/var/lib/postgresql/10/main/base/16562/17437.87 (deleted)
postgres 2633 postgres 413u REG 252,5 264241152 0 3410026
/var/lib/postgresql/10/main/base/16562/17437.88 (deleted)

PID 2633 was:
postgres 2633 12.5 0.4 305868 158632 ? Ss Oct09 544:04 postgres:
10/main: postgres radlogs [local] idle

I executed 'SELECT 1;' in that session (it was a psql session) and 5s later
it returned, the files were closed and the disk space freed.

There were also two autovacuum processes up at the time:
postgres 29102 2.4 0.2 355320 76724 ? Ss 13:09 0:45 postgres:
10/main: autovacuum worker process radlogs
postgres 29378 16.0 0.1 355312 37976 ? Ss 13:40 0:00 postgres:
10/main: autovacuum worker process radlogs

Hope this helps.

Andres Freund

andres@anarazel.de

over 7 years ago

In reply to: PG Bug reporting form (#1)

Re: BUG #15427: DROP INDEX did not free up disk space

Hi,

On 2018-10-12 03:27:28 +0000, PG Bug reporting form wrote:

The following bug has been logged on the website:

Bug reference: 15427
Logged by: Andrew P
Email address: ap@zip.com.au
PostgreSQL version: 10.5
Operating system: Debian Sarge (9)
Description:

Performed a DROP INDEX to free up disk space but, whilst the command
succeeded the disk space was not freed up.

The index was:

CREATE INDEX radlelink_rlreid_idx ON radlelink (rlid, reid);

Config of table:

CREATE TABLE radlelink (
rlid BIGINT REFERENCES radlogins (id) ON DELETE
CASCADE,
reid BIGINT REFERENCES radextra (id) ON DELETE CASCADE
);

Ran 'lsof -nP +L1' on the server as per suggestion on postgresql irc channel
and this was the (abridged output):

COMMAND PID USER FD TYPE DEVICE SIZE/OFF NLINK NODE NAME
postgres 2633 postgres 197u REG 252,5 16777216 0 3410028
/var/lib/postgresql/10/main/pg_wal/00000001000003960000006E (deleted)
postgres 2633 postgres 325u REG 252,5 0 0 3409445
/var/lib/postgresql/10/main/base/16562/17437 (deleted)
postgres 2633 postgres 326u REG 252,5 1073741824 0 3409702
/var/lib/postgresql/10/main/base/16562/17437.1 (deleted)
postgres 2633 postgres 327u REG 252,5 1073741824 0 3409704
/var/lib/postgresql/10/main/base/16562/17437.2 (deleted)
postgres 2633 postgres 328u REG 252,5 1073741824 0 3409705
/var/lib/postgresql/10/main/base/16562/17437.3 (deleted)
postgres 2633 postgres 329u REG 252,5 1073741824 0 3409452
/var/lib/postgresql/10/main/base/16562/17437.4 (deleted)
postgres 2633 postgres 330u REG 252,5 1073741824 0 3409709
/var/lib/postgresql/10/main/base/16562/17437.5 (deleted)
postgres 2633 postgres 331u REG 252,5 1073741824 0 3409710
/var/lib/postgresql/10/main/base/16562/17437.6 (deleted)
postgres 2633 postgres 332u REG 252,5 1073741824 0 3409706
/var/lib/postgresql/10/main/base/16562/17437.7 (deleted)
postgres 2633 postgres 333u REG 252,5 1073741824 0 3409722
/var/lib/postgresql/10/main/base/16562/17437.8 (deleted)
postgres 2633 postgres 334u REG 252,5 1073741824 0 3409724
/var/lib/postgresql/10/main/base/16562/17437.9 (deleted)
postgres 2633 postgres 335u REG 252,5 1073741824 0 3409716
/var/lib/postgresql/10/main/base/16562/17437.10 (deleted)
...
postgres 2633 postgres 403u REG 252,5 1073741824 0 3409970
/var/lib/postgresql/10/main/base/16562/17437.78 (deleted)
postgres 2633 postgres 404u REG 252,5 1073741824 0 3409976
/var/lib/postgresql/10/main/base/16562/17437.79 (deleted)
postgres 2633 postgres 405u REG 252,5 1073741824 0 3409969
/var/lib/postgresql/10/main/base/16562/17437.80 (deleted)
postgres 2633 postgres 406u REG 252,5 1073741824 0 3409983
/var/lib/postgresql/10/main/base/16562/17437.81 (deleted)
postgres 2633 postgres 407u REG 252,5 1073741824 0 3409984
/var/lib/postgresql/10/main/base/16562/17437.82 (deleted)
postgres 2633 postgres 408u REG 252,5 1073741824 0 3409981
/var/lib/postgresql/10/main/base/16562/17437.83 (deleted)
postgres 2633 postgres 409u REG 252,5 1073741824 0 3410002
/var/lib/postgresql/10/main/base/16562/17437.84 (deleted)
postgres 2633 postgres 410u REG 252,5 1073741824 0 3410012
/var/lib/postgresql/10/main/base/16562/17437.85 (deleted)
postgres 2633 postgres 411u REG 252,5 1073741824 0 3409991
/var/lib/postgresql/10/main/base/16562/17437.86 (deleted)
postgres 2633 postgres 412u REG 252,5 1073741824 0 3410025
/var/lib/postgresql/10/main/base/16562/17437.87 (deleted)
postgres 2633 postgres 413u REG 252,5 264241152 0 3410026
/var/lib/postgresql/10/main/base/16562/17437.88 (deleted)

PID 2633 was:
postgres 2633 12.5 0.4 305868 158632 ? Ss Oct09 544:04 postgres:
10/main: postgres radlogs [local] idle

I executed 'SELECT 1;' in that session (it was a psql session) and 5s later
it returned, the files were closed and the disk space freed.

There were also two autovacuum processes up at the time:
postgres 29102 2.4 0.2 355320 76724 ? Ss 13:09 0:45 postgres:
10/main: autovacuum worker process radlogs
postgres 29378 16.0 0.1 355312 37976 ? Ss 13:40 0:00 postgres:
10/main: autovacuum worker process radlogs

Hope this helps.

The problem here is that even though we send a invalidation message to
all backends, there's nothing to force backends to process invalidation
messages in time if they're idle. Sure, at some point, when
independently enough inval messages have been created, we'll send out a
catchup interrupt. But that's not necessarily going to be that soon in
a production database.

ISTM that we need to force catchup interrupts in a few cases, like when
smgr invals for truncation.

Greetings,

Andres Freund

Tom Lane

tgl@sss.pgh.pa.us

over 7 years ago

In reply to: Andres Freund (#2)

Re: BUG #15427: DROP INDEX did not free up disk space

Andres Freund <andres@anarazel.de> writes:

On 2018-10-12 03:27:28 +0000, PG Bug reporting form wrote:

Performed a DROP INDEX to free up disk space but, whilst the command
succeeded the disk space was not freed up.

The problem here is that even though we send a invalidation message to
all backends, there's nothing to force backends to process invalidation
messages in time if they're idle.

Uh, what's that got to do with it?

My recollection (though I've not looked at the code) is that we
truncate the file to ensure disk space is freed, whether or not
all backends have closed their FDs for the file. This is quite
independent of the sinval mechanism.

I'm suspicious that this report reflects some weird behavior
of a non-mainline filesystem ...

regards, tom lane

Andres Freund

andres@anarazel.de

over 7 years ago

In reply to: Tom Lane (#3)

Re: BUG #15427: DROP INDEX did not free up disk space

Hi,

On 2018-10-11 23:57:16 -0400, Tom Lane wrote:

Andres Freund <andres@anarazel.de> writes:

On 2018-10-12 03:27:28 +0000, PG Bug reporting form wrote:

Performed a DROP INDEX to free up disk space but, whilst the command
succeeded the disk space was not freed up.

The problem here is that even though we send a invalidation message to
all backends, there's nothing to force backends to process invalidation
messages in time if they're idle.

Uh, what's that got to do with it?

If you look at the bugreport: As soon as the op, on my suggestion,
triggered sinval processing (by issuing a SELECT 1;) the space was
freed. So clearly the open FDs were part of the problem.

My recollection (though I've not looked at the code) is that we
truncate the file to ensure disk space is freed, whether or not
all backends have closed their FDs for the file.

That's not quite how we do it tho. In mdunlinkfork() we truncate the
*first* segment, but all further segments are unlink()ed directly. For
that to work, sinvals need to be processed. So

This is quite independent of the sinval mechanism.

is clearly currently not correct.

I'd argue that even if we were to change mdunlinkfork() so it always
truncates and then unlinks, we should *still* make sure that sinval
messages for things like smgrdounlinkfork() are processed in some
bounded time. Consider e.g. that you might drop a table + tablespace and
then would want to unmount - this'd prevent that.

Greetings,

Andres Freund

Tom Lane

tgl@sss.pgh.pa.us

over 7 years ago

In reply to: Andres Freund (#4)

Re: BUG #15427: DROP INDEX did not free up disk space

Andres Freund <andres@anarazel.de> writes:

On 2018-10-11 23:57:16 -0400, Tom Lane wrote:

Uh, what's that got to do with it?

If you look at the bugreport: As soon as the op, on my suggestion,
triggered sinval processing (by issuing a SELECT 1;) the space was
freed. So clearly the open FDs were part of the problem.

TBH, I think the space-freeup was more likely driven off a background
checkpoint completion, which is where the truncation happens.

regards, tom lane

Andres Freund

andres@anarazel.de

over 7 years ago

In reply to: Tom Lane (#5)

Re: BUG #15427: DROP INDEX did not free up disk space

Hi,

On 2018-10-12 00:33:14 -0400, Tom Lane wrote:

Andres Freund <andres@anarazel.de> writes:

On 2018-10-11 23:57:16 -0400, Tom Lane wrote:

Uh, what's that got to do with it?

If you look at the bugreport: As soon as the op, on my suggestion,
triggered sinval processing (by issuing a SELECT 1;) the space was
freed. So clearly the open FDs were part of the problem.

TBH, I think the space-freeup was more likely driven off a background
checkpoint completion, which is where the truncation happens.

Uh, as I wrote, mdunlinkfork(), which backs dropping an index via
index_drop()->RelationDropStorage() and then
smgrDoPendingDeletes()->smgrdounlinkall()->mdunlink()->mdunlinkfork(),
unlinks all segments beyond the first itself:

static void
mdunlinkfork(RelFileNodeBackend rnode, ForkNumber forkNum, bool isRedo)
{
char *path;
int ret;

path = relpath(rnode, forkNum);

/*
* Delete or truncate the first segment.
*/
if (isRedo || forkNum != MAIN_FORKNUM || RelFileNodeBackendIsTemp(rnode))
{
ret = unlink(path);
if (ret < 0 && errno != ENOENT)
ereport(WARNING,
(errcode_for_file_access(),
errmsg("could not remove file \"%s\": %m", path)));
}
else
{
/* truncate(2) would be easier here, but Windows hasn't got it */
int fd;

fd = OpenTransientFile(path, O_RDWR | PG_BINARY);
if (fd >= 0)
{
int save_errno;

ret = ftruncate(fd, 0);
save_errno = errno;
CloseTransientFile(fd);
errno = save_errno;
}
else
ret = -1;
if (ret < 0 && errno != ENOENT)
ereport(WARNING,
(errcode_for_file_access(),
errmsg("could not truncate file \"%s\": %m", path)));

/* Register request to unlink first segment later */
register_unlink(rnode);
}

/*
* Delete any additional segments.
*/
if (ret >= 0)
{
char *segpath = (char *) palloc(strlen(path) + 12);
BlockNumber segno;

/*
* Note that because we loop until getting ENOENT, we will correctly
* remove all inactive segments as well as active ones.
*/
for (segno = 1;; segno++)
{
sprintf(segpath, "%s.%u", path, segno);
if (unlink(segpath) < 0)
{
/* ENOENT is expected after the last segment... */
if (errno != ENOENT)
ereport(WARNING,
(errcode_for_file_access(),
errmsg("could not remove file \"%s\": %m", segpath)));
break;
}
}
pfree(segpath);
}

pfree(path);
}

As you clearly can see, unlink() is called directly here for anything
but the first segment (which is registered to be unlinked in
checkpointer via register_unlink()).

Note that the checkpointer based machinery doesn't even *support*
unlinking anything beyond the first segment:

void
mdpostckpt(void)
{
...
while (pendingUnlinks != NIL)
...
/* Unlink the file */
path = relpathperm(entry->rnode, MAIN_FORKNUM);
if (unlink(path) < 0)

there's no segment handling here.

You're right that mdtruncate() leaves later segments around in a
truncated manner. But that's because of an orthogonal concern:
* The full and partial segments are collectively the "active" segments.
* Inactive segments are those that once contained data but are currently
* not needed because of an mdtruncate() operation. The reason for leaving
* them present at size zero, rather than unlinking them, is that other
* backends and/or the checkpointer might be holding open file references to
* such segments. If the relation expands again after mdtruncate(), such
* that a deactivated segment becomes active again, it is important that
* such file references still be valid --- else data might get written
* out to an unlinked old copy of a segment file that will eventually
* disappear.

that doesn't apply to dropping relations.

Greetings,

Andres Freund