Why does PostgreSQL ftruncate before unlink?
When dropping lots of tables, I noticed postgresql taking longer than
I would have expected.
strace seems to report that the largest contributor is the ftruncate
and not the unlink. I'm curious what the logic is behind using
ftruncate before unlink.
I'm using an ext4 filesystem.
--
Jon
--
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general
On Fri, Feb 21, 2014 at 4:14 PM, Jon Nelson <jnelson+pgsql@jamponi.net> wrote:
When dropping lots of tables, I noticed postgresql taking longer than
I would have expected.strace seems to report that the largest contributor is the ftruncate
and not the unlink. I'm curious what the logic is behind using
ftruncate before unlink.I'm using an ext4 filesystem.
I'm guessing that this is so that it can be rolled back. Unlink is
likely issued at commit;
--
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general
On Sunday, February 23, 2014, Scott Marlowe <scott.marlowe@gmail.com> wrote:
On Fri, Feb 21, 2014 at 4:14 PM, Jon Nelson <jnelson+pgsql@jamponi.net<javascript:;>>
wrote:When dropping lots of tables, I noticed postgresql taking longer than
I would have expected.strace seems to report that the largest contributor is the ftruncate
and not the unlink. I'm curious what the logic is behind using
ftruncate before unlink.I'm using an ext4 filesystem.
I'm guessing that this is so that it can be rolled back. Unlink is
likely issued at commit;
I would hope that ftruncate is issued at commit as well. That doesn't
sound undoable.
Cheers,
Jeff
Jeff Janes <jeff.janes@gmail.com> writes:
On Sunday, February 23, 2014, Scott Marlowe <scott.marlowe@gmail.com> wrote:
I'm guessing that this is so that it can be rolled back. Unlink is
likely issued at commit;
I would hope that ftruncate is issued at commit as well. That doesn't
sound undoable.
It's more subtle than that. I'm too lazy to look at the comments in md.c
right now, but basically the reason for not doing an instant unlink is
to ensure that if a relation is truncated and then re-extended, open file
pointers held by other backends will still be valid. The ftruncate is
done to ensure that allocated disk space goes away as soon as that's safe
(ie, at commit of the truncation); but immediate unlink would require
forcing more cross-backend synchronization than we want to have.
If memory serves, the inode should get removed during the next checkpoint.
regards, tom lane
--
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general
On Sun, Feb 23, 2014 at 9:49 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
Jeff Janes <jeff.janes@gmail.com> writes:
On Sunday, February 23, 2014, Scott Marlowe <scott.marlowe@gmail.com> wrote:
I'm guessing that this is so that it can be rolled back. Unlink is
likely issued at commit;I would hope that ftruncate is issued at commit as well. That doesn't
sound undoable.It's more subtle than that. I'm too lazy to look at the comments in md.c
right now, but basically the reason for not doing an instant unlink is
to ensure that if a relation is truncated and then re-extended, open file
pointers held by other backends will still be valid. The ftruncate is
done to ensure that allocated disk space goes away as soon as that's safe
(ie, at commit of the truncation); but immediate unlink would require
forcing more cross-backend synchronization than we want to have.If memory serves, the inode should get removed during the next checkpoint.
I was moments away from commenting to say that I had traced the flow
of the code to md.c and found the comments there quite illuminating. I
wonder if there is a different way to solve the underlying issue
without relying on ftruncate (which seems to be somewhat expensive).
--
Jon
--
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general
Jon Nelson <jnelson+pgsql@jamponi.net> writes:
On Sun, Feb 23, 2014 at 9:49 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
If memory serves, the inode should get removed during the next checkpoint.
I was moments away from commenting to say that I had traced the flow
of the code to md.c and found the comments there quite illuminating. I
wonder if there is a different way to solve the underlying issue
without relying on ftruncate (which seems to be somewhat expensive).
Hm. The code is designed the way it is on the assumption that ftruncate
doesn't do anything that unlink wouldn't have to do anyway. If it really
is significantly slower on popular filesystems, maybe we need to revisit
that.
regards, tom lane
--
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general
On Sun, Feb 23, 2014 at 10:07 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
Jon Nelson <jnelson+pgsql@jamponi.net> writes:
On Sun, Feb 23, 2014 at 9:49 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
If memory serves, the inode should get removed during the next checkpoint.
I was moments away from commenting to say that I had traced the flow
of the code to md.c and found the comments there quite illuminating. I
wonder if there is a different way to solve the underlying issue
without relying on ftruncate (which seems to be somewhat expensive).Hm. The code is designed the way it is on the assumption that ftruncate
doesn't do anything that unlink wouldn't have to do anyway. If it really
is significantly slower on popular filesystems, maybe we need to revisit
that.
Here is an example.
% time seconds usecs/call calls errors syscall
------ ----------- ----------- --------- --------- ----------------
99.95 3.207681 4182 767 ftruncate
0.05 0.001579 1 2428 2301 unlink
--
Jon
--
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general
On Mon, Feb 24, 2014 at 6:38 PM, Jon Nelson <jnelson+pgsql@jamponi.net> wrote:
Here is an example.
% time seconds usecs/call calls errors syscall
------ ----------- ----------- --------- --------- ----------------
99.95 3.207681 4182 767 ftruncate
0.05 0.001579 1 2428 2301 unlink
Are this times for unlink after ftruncate? Because ( in linux which is
the one I use in the desktops and I'm familiar with ) unlinks of big
files are slow too, so to have a more meaningful comparison you would
need to time ftruncate+unlink and plain unlink of same files, IIRC
they take nearly equal time.
Francisco Olarte.
--
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general