Fix for large file support
Current version of postgres support only 1GB chunks. This limit is
defined in the pg_config_manual.h header file. However this setting
allows to have maximal 2GB chunks. Main problem is that md storage
manager and buffile use "long" data type (32bits) for offset instead
"off_t" defined in <sys/types.h>.
off_t is 32bits long on 32bits OS and 64bits long on 64bits OS or when
application is compiled with large file support.
Attached patch allow to setup bigger chunks than 4GB on OS with large
file support.
I tested it on 7GB table and it works.
Please, look on it and let me know your comments or if I miss something.
TODO/questions:
1) clean/update comments about limitation
2) Is there some doc for update?
3) I would like to add some check compare sizeof(off_t) and chunk size
setting and protect postgres with missconfigured chunk size. Is mdinit()
good place for this check?
4) I'm going to take bigger machine for test with really big table.
with regards Zdenek
Attachments:
largefile.difftext/x-patch; name=largefile.diffDownload+83-80
Zdenek Kotala wrote:
Current version of postgres support only 1GB chunks. This limit is
defined in the pg_config_manual.h header file. However this setting
allows to have maximal 2GB chunks. Main problem is that md storage
manager and buffile use "long" data type (32bits) for offset instead
"off_t" defined in <sys/types.h>.off_t is 32bits long on 32bits OS and 64bits long on 64bits OS or when
application is compiled with large file support.Attached patch allow to setup bigger chunks than 4GB on OS with large
file support.I tested it on 7GB table and it works.
What does it actually buy us, though? Does it mean the maximum field
size will grow beyond 1Gb? Or give better performance?
cheers
andrew
Andrew Dunstan wrote:
Does it mean the maximum field size will grow beyond 1Gb?
No. Because it is limited by varlena size. See
http://www.postgresql.org/docs/8.2/interactive/storage-toast.html
Or give better performance?
Yes. List of chunks is stored as linked list and for some operation
(e.g. expand) are all chunks opened and their size is checked. On big
tables it takes some time. For example if you have 1TB big table and you
want to add new block you must go and open all 1024 files.
By the way ./configure script performs check for __LARGE_FILE_ support,
but it looks that it is nowhere used.
There could be small time penalty in 64bit arithmetics. However it
happens only if large file support is enabled on 32bit OS.
Zdenek
Zdenek Kotala <Zdenek.Kotala@Sun.COM> writes:
Andrew Dunstan wrote:
Or give better performance?
Yes. List of chunks is stored as linked list and for some operation
(e.g. expand) are all chunks opened and their size is checked. On big
tables it takes some time. For example if you have 1TB big table and you
want to add new block you must go and open all 1024 files.
Indeed, but that would be far more effectively addressed by fixing the
*other* code path that doesn't segment at all (the
LET_OS_MANAGE_FILESIZE option, which is most likely broken these days
for lack of testing). I don't see the point of a halfway measure like
increasing RELSEG_SIZE.
regards, tom lane
Tom Lane wrote:
Zdenek Kotala <Zdenek.Kotala@Sun.COM> writes:
Andrew Dunstan wrote:
Indeed, but that would be far more effectively addressed by fixing the
*other* code path that doesn't segment at all (the
LET_OS_MANAGE_FILESIZE option, which is most likely broken these days
for lack of testing). I don't see the point of a halfway measure like
increasing RELSEG_SIZE.
LET_OS_MANAGE_FILESIZE is good way. I think one problem of this option I
fixed. It is size of offset. I went thru the code and did not see any
other problem there. However, how you mentioned it need more testing. I
going to take server with large disk array and I will test it.
I would like to add --enable-largefile switch to configure file to
enable access to wide group of users. What you think about it?
Zdenek
[ redirecting to -hackers for wider comment ]
Zdenek Kotala <Zdenek.Kotala@Sun.COM> writes:
Tom Lane wrote:
LET_OS_MANAGE_FILESIZE is good way. I think one problem of this option I
fixed. It is size of offset. I went thru the code and did not see any
other problem there. However, how you mentioned it need more testing. I
going to take server with large disk array and I will test it.
I would like to add --enable-largefile switch to configure file to
enable access to wide group of users. What you think about it?
Yeah, I was going to suggest the same thing --- but not with that switch
name. We already use enable/disable-largefile to control whether 64-bit
file access is built at all (this mostly affects pg_dump at the moment).
I think the clearest way might be to flip the sense of the variable.
I never found "LET_OS_MANAGE_FILESIZE" to be a good name anyway. I'd
suggest "USE_SEGMENTED_FILES", which defaults to "on", and you can
turn it off via --disable-segmented-files if configure confirms your
OS has largefile support (thus you could not specify both this and
--disable-largefile).
regards, tom lane
Tom Lane wrote:
[ redirecting to -hackers for wider comment ]
Zdenek Kotala <Zdenek.Kotala@Sun.COM> writes:
Tom Lane wrote:
LET_OS_MANAGE_FILESIZE is good way. I think one problem of this option I
fixed. It is size of offset. I went thru the code and did not see any
other problem there. However, how you mentioned it need more testing. I
going to take server with large disk array and I will test it.I would like to add --enable-largefile switch to configure file to
enable access to wide group of users. What you think about it?Yeah, I was going to suggest the same thing --- but not with that switch
name. We already use enable/disable-largefile to control whether 64-bit
file access is built at all (this mostly affects pg_dump at the moment).
hmm :( It looks that ./configure largefile detection does not work on
Solaris.
I think the clearest way might be to flip the sense of the variable.
I never found "LET_OS_MANAGE_FILESIZE" to be a good name anyway. I'd
suggest "USE_SEGMENTED_FILES", which defaults to "on", and you can
turn it off via --disable-segmented-files if configure confirms your
OS has largefile support (thus you could not specify both this and
--disable-largefile).
It sounds good. There is one think for clarification (for the present).
How to handle buffile? It does not currently support non segmented
files. I suggest to use same switch to enable/disable segments there.
Zdenek
Zdenek Kotala <Zdenek.Kotala@Sun.COM> writes:
It sounds good. There is one think for clarification (for the present).
How to handle buffile? It does not currently support non segmented
files. I suggest to use same switch to enable/disable segments there.
Do you think it really matters? terabyte-sized temp files seem a bit
unlikely, and anyway I don't think the performance argument applies;
md.c's tendency to open all the files at once is irrelevant.
regards, tom lane
If we expose LET_OS_MANAGE_FILESIZE, should we add a flag to the
control file so that you can't start a backend that has that defined
against a cluster that was initialized without it?
On Apr 6, 2007, at 2:45 PM, Tom Lane wrote:
[ redirecting to -hackers for wider comment ]
Zdenek Kotala <Zdenek.Kotala@Sun.COM> writes:
Tom Lane wrote:
LET_OS_MANAGE_FILESIZE is good way. I think one problem of this
option I
fixed. It is size of offset. I went thru the code and did not see any
other problem there. However, how you mentioned it need more
testing. I
going to take server with large disk array and I will test it.I would like to add --enable-largefile switch to configure file to
enable access to wide group of users. What you think about it?Yeah, I was going to suggest the same thing --- but not with that
switch
name. We already use enable/disable-largefile to control whether
64-bit
file access is built at all (this mostly affects pg_dump at the
moment).I think the clearest way might be to flip the sense of the variable.
I never found "LET_OS_MANAGE_FILESIZE" to be a good name anyway. I'd
suggest "USE_SEGMENTED_FILES", which defaults to "on", and you can
turn it off via --disable-segmented-files if configure confirms your
OS has largefile support (thus you could not specify both this and
--disable-largefile).regards, tom lane
---------------------------(end of
broadcast)---------------------------
TIP 5: don't forget to increase your free space map settings
--
Jim Nasby jim@nasby.net
EnterpriseDB http://enterprisedb.com 512.569.9461 (cell)
Jim Nasby <decibel@decibel.org> writes:
If we expose LET_OS_MANAGE_FILESIZE, should we add a flag to the
control file so that you can't start a backend that has that defined
against a cluster that was initialized without it?
I imagine we'd flag that as relsegsize = 0 or some such.
regards, tom lane
Tom Lane wrote:
Jim Nasby <decibel@decibel.org> writes:
If we expose LET_OS_MANAGE_FILESIZE, should we add a flag to the
control file so that you can't start a backend that has that defined
against a cluster that was initialized without it?I imagine we'd flag that as relsegsize = 0 or some such.
Yes I have it in my patch. I put relsegsize = 0 in the control file when
non-segmentation mode is enabled.
Zdenek
This has been saved for the 8.4 release:
http://momjian.postgresql.org/cgi-bin/pgpatches_hold
---------------------------------------------------------------------------
Tom Lane wrote:
[ redirecting to -hackers for wider comment ]
Zdenek Kotala <Zdenek.Kotala@Sun.COM> writes:
Tom Lane wrote:
LET_OS_MANAGE_FILESIZE is good way. I think one problem of this option I
fixed. It is size of offset. I went thru the code and did not see any
other problem there. However, how you mentioned it need more testing. I
going to take server with large disk array and I will test it.I would like to add --enable-largefile switch to configure file to
enable access to wide group of users. What you think about it?Yeah, I was going to suggest the same thing --- but not with that switch
name. We already use enable/disable-largefile to control whether 64-bit
file access is built at all (this mostly affects pg_dump at the moment).I think the clearest way might be to flip the sense of the variable.
I never found "LET_OS_MANAGE_FILESIZE" to be a good name anyway. I'd
suggest "USE_SEGMENTED_FILES", which defaults to "on", and you can
turn it off via --disable-segmented-files if configure confirms your
OS has largefile support (thus you could not specify both this and
--disable-largefile).regards, tom lane
---------------------------(end of broadcast)---------------------------
TIP 5: don't forget to increase your free space map settings
--
Bruce Momjian <bruce@momjian.us> http://momjian.us
EnterpriseDB http://www.enterprisedb.com
+ If your life is a hard drive, Christ can be your backup. +
Tom Lane wrote:
[ redirecting to -hackers for wider comment ]
Zdenek Kotala <Zdenek.Kotala@Sun.COM> writes:
Tom Lane wrote:
LET_OS_MANAGE_FILESIZE is good way. I think one problem of this option I
fixed. It is size of offset. I went thru the code and did not see any
other problem there. However, how you mentioned it need more testing. I
going to take server with large disk array and I will test it.I would like to add --enable-largefile switch to configure file to
enable access to wide group of users. What you think about it?Yeah, I was going to suggest the same thing --- but not with that switch
name. We already use enable/disable-largefile to control whether 64-bit
file access is built at all (this mostly affects pg_dump at the moment).I think the clearest way might be to flip the sense of the variable.
I never found "LET_OS_MANAGE_FILESIZE" to be a good name anyway. I'd
suggest "USE_SEGMENTED_FILES", which defaults to "on", and you can
turn it off via --disable-segmented-files if configure confirms your
OS has largefile support (thus you could not specify both this and
--disable-largefile).
There is latest version of nonsegment support patch. I changed
LET_OS_MANAGE_FILESIZE to USE_SEGMENTED_FILES and I added
-disable-segmented-files switch to configure. I kept tuplestore behavior
and it still split file in both mode.
I also little bit cleanup some other datatypes (e.g int->mode_t).
Autoconf and autoheader must be run after patch application.
I tested it with 9GB table and both mode works fine.
Please, let me know your comments.
Zdenek
Attachments:
This has been saved for the 8.4 release:
http://momjian.postgresql.org/cgi-bin/pgpatches_hold
---------------------------------------------------------------------------
Zdenek Kotala wrote:
Tom Lane wrote:
[ redirecting to -hackers for wider comment ]
Zdenek Kotala <Zdenek.Kotala@Sun.COM> writes:
Tom Lane wrote:
LET_OS_MANAGE_FILESIZE is good way. I think one problem of this option I
fixed. It is size of offset. I went thru the code and did not see any
other problem there. However, how you mentioned it need more testing. I
going to take server with large disk array and I will test it.I would like to add --enable-largefile switch to configure file to
enable access to wide group of users. What you think about it?Yeah, I was going to suggest the same thing --- but not with that switch
name. We already use enable/disable-largefile to control whether 64-bit
file access is built at all (this mostly affects pg_dump at the moment).I think the clearest way might be to flip the sense of the variable.
I never found "LET_OS_MANAGE_FILESIZE" to be a good name anyway. I'd
suggest "USE_SEGMENTED_FILES", which defaults to "on", and you can
turn it off via --disable-segmented-files if configure confirms your
OS has largefile support (thus you could not specify both this and
--disable-largefile).There is latest version of nonsegment support patch. I changed
LET_OS_MANAGE_FILESIZE to USE_SEGMENTED_FILES and I added
-disable-segmented-files switch to configure. I kept tuplestore behavior
and it still split file in both mode.I also little bit cleanup some other datatypes (e.g int->mode_t).
Autoconf and autoheader must be run after patch application.I tested it with 9GB table and both mode works fine.
Please, let me know your comments.
Zdenek
[ application/x-gzip is not supported, skipping... ]
---------------------------(end of broadcast)---------------------------
TIP 3: Have you checked our extensive FAQ?
--
Bruce Momjian <bruce@momjian.us> http://momjian.us
EnterpriseDB http://www.enterprisedb.com
+ If your life is a hard drive, Christ can be your backup. +
Zdenek Kotala <Zdenek.Kotala@Sun.COM> writes:
There is latest version of nonsegment support patch. I changed
LET_OS_MANAGE_FILESIZE to USE_SEGMENTED_FILES and I added
-disable-segmented-files switch to configure. I kept tuplestore behavior
and it still split file in both mode.
Applied with minor corrections.
regards, tom lane
Tom Lane wrote:
Zdenek Kotala <Zdenek.Kotala@Sun.COM> writes:
There is latest version of nonsegment support patch. I changed
LET_OS_MANAGE_FILESIZE to USE_SEGMENTED_FILES and I added
-disable-segmented-files switch to configure. I kept tuplestore behavior
and it still split file in both mode.Applied with minor corrections.
Why is this not the default when supported? I am wondering both from the
point of view of the user, and in terms of development direction.
Peter Eisentraut <peter_e@gmx.net> writes:
Tom Lane wrote:
Applied with minor corrections.
Why is this not the default when supported?
Fear.
Maybe eventually, but right now I think it's too risky.
One point that I already found out the hard way is that sizeof(off_t) = 8
does not guarantee the availability of largefile support; there can also
be filesystem-level constraints, and perhaps other things we know not of
at this point.
I think this needs to be treated as experimental until it's got a few
more than zero miles under its belt. I wouldn't be too surprised to
find that we have to implement it as a run-time switch instead of
compile-time, in order to not fail miserably when somebody sticks a
tablespace on an archaic filesystem.
regards, tom lane
Peter Eisentraut wrote:
Tom Lane wrote:
Zdenek Kotala <Zdenek.Kotala@Sun.COM> writes:
There is latest version of nonsegment support patch. I changed
LET_OS_MANAGE_FILESIZE to USE_SEGMENTED_FILES and I added
-disable-segmented-files switch to configure. I kept tuplestore behavior
and it still split file in both mode.Applied with minor corrections.
Why is this not the default when supported? I am wondering both from the
point of view of the user, and in terms of development direction.
Also it would get more buildfarm coverage if it were default. If it
breaks something we'll notice earlier.
--
Alvaro Herrera http://www.CommandPrompt.com/
PostgreSQL Replication, Consulting, Custom Development, 24x7 support
Alvaro Herrera <alvherre@commandprompt.com> writes:
Also it would get more buildfarm coverage if it were default. If it
breaks something we'll notice earlier.
Since nothing the regression tests do even approach 1GB, the odds that
the buildfarm will notice problems are approximately zero.
regards, tom lane
Tom Lane wrote:
I think this needs to be treated as experimental until it's got a few
more than zero miles under its belt.
OK, then maybe we should document that.
I wouldn't be too surprised to
find that we have to implement it as a run-time switch instead of
compile-time, in order to not fail miserably when somebody sticks a
tablespace on an archaic filesystem.
Yes, that sounds quite useful. Let's wait and see what happens.