Problems with >2GB tables on Linux 2.0
Hi,
I saw a message a couple of weeks ago from someone having problems with
larger than 2GB tables. I have similar problems.
PostgreSQL: anon-cvs as of today (2/5/1999)
OS: Redhat Linux 5.2 (running 2.0.35)
I created a database called mcrl, and a table called mcrl3_1.
I copied in a set of 450MB of data twice(which comes to pg file size of
2.4GB or so).
When it hit 2GB I got this message:
mcrl=> copy mcrl3_1 FROM '/home/gjerde/mcrl/MCR3_1.txt';
ERROR: mcrl3_1: cannot extend
The table file looks like this:
[postgres@snowman mcrl]$ ls -l mcrl*
-rw------- 1 postgres postgres 2147482624 Feb 5 16:49 mcrl3_1
It did NOT create the .1 file however, which I did see when I tried this
on 6.4.2(but still didn't work).
I looked around in the code(specifically src/backend/storage/smgr/*.c),
but couldn't figure too much of it out. I'll have to figure out how
postgres handles the database files first..
Hope this helps,
Ole Gjerde
I just found out that you guys have been discussing this problem.. oops..
I looked through the mailing-list archive, and didn't find any posts, but
february isn't in the archive yet :)
Ole Gjerde
I may be dating myself really badly here, but isn't there a hard limit
on
the file system at 2Gig? I thought the file size attribute in Unix is
represented as a 32 bit signed long, which happens to be a max value
of 2147483648. If I'm right, it means the problem is fundamentally
with the file system, not with PostGres, and you won't solve this
unless the os supports larger files.
gjerde@icebox.org wrote:
Hi,
I saw a message a couple of weeks ago from someone having problems with
larger than 2GB tables. I have similar problems.PostgreSQL: anon-cvs as of today (2/5/1999)
OS: Redhat Linux 5.2 (running 2.0.35)I created a database called mcrl, and a table called mcrl3_1.
I copied in a set of 450MB of data twice(which comes to pg file size of
2.4GB or so).When it hit 2GB I got this message:
mcrl=> copy mcrl3_1 FROM '/home/gjerde/mcrl/MCR3_1.txt';
ERROR: mcrl3_1: cannot extendThe table file looks like this:
[postgres@snowman mcrl]$ ls -l mcrl*
-rw------- 1 postgres postgres 2147482624 Feb 5 16:49 mcrl3_1It did NOT create the .1 file however, which I did see when I tried this
on 6.4.2(but still didn't work).I looked around in the code(specifically src/backend/storage/smgr/*.c),
but couldn't figure too much of it out. I'll have to figure out how
postgres handles the database files first..Hope this helps,
Ole Gjerde
--
------------------------------------------------------------
Thomas Reinke Tel: (416) 460-7021
Director of Technology Fax: (416) 598-2319
E-Soft Inc. http://www.e-softinc.com
On Sat, 6 Feb 1999, Thomas Reinke wrote:
I may be dating myself really badly here, but isn't there a hard limit
on
the file system at 2Gig? I thought the file size attribute in Unix is
represented as a 32 bit signed long, which happens to be a max value
of 2147483648. If I'm right, it means the problem is fundamentally
with the file system, not with PostGres, and you won't solve this
unless the os supports larger files.
PostgreSQL has internal code that is supposed to automagically break up a
table into 2gb chunks so that thsi isn't a problem...
gjerde@icebox.org wrote:
Hi,
I saw a message a couple of weeks ago from someone having problems with
larger than 2GB tables. I have similar problems.PostgreSQL: anon-cvs as of today (2/5/1999)
OS: Redhat Linux 5.2 (running 2.0.35)I created a database called mcrl, and a table called mcrl3_1.
I copied in a set of 450MB of data twice(which comes to pg file size of
2.4GB or so).When it hit 2GB I got this message:
mcrl=> copy mcrl3_1 FROM '/home/gjerde/mcrl/MCR3_1.txt';
ERROR: mcrl3_1: cannot extendThe table file looks like this:
[postgres@snowman mcrl]$ ls -l mcrl*
-rw------- 1 postgres postgres 2147482624 Feb 5 16:49 mcrl3_1It did NOT create the .1 file however, which I did see when I tried this
on 6.4.2(but still didn't work).I looked around in the code(specifically src/backend/storage/smgr/*.c),
but couldn't figure too much of it out. I'll have to figure out how
postgres handles the database files first..Hope this helps,
Ole Gjerde--
------------------------------------------------------------
Thomas Reinke Tel: (416) 460-7021
Director of Technology Fax: (416) 598-2319
E-Soft Inc. http://www.e-softinc.com
Marc G. Fournier
Systems Administrator @ hub.org
primary: scrappy@hub.org secondary: scrappy@{freebsd|postgresql}.org
Thomas Reinke wrote:
I may be dating myself really badly here, but isn't there a hard limit
on
the file system at 2Gig? I thought the file size attribute in Unix is
represented as a 32 bit signed long, which happens to be a max value
of 2147483648. If I'm right, it means the problem is fundamentally
with the file system, not with PostGres, and you won't solve this
unless the os supports larger files.
There is logic insid PostgreSQL to overflof to nex file at 2GB, but
apparently this is currently broken.
AFAIK, there are people working on it now
--------------
Hannu
On Sat, 6 Feb 1999, Hannu Krosing wrote:
Thomas Reinke wrote:
I may be dating myself really badly here, but isn't there a hard limit
on
the file system at 2Gig? I thought the file size attribute in Unix is
represented as a 32 bit signed long, which happens to be a max value
of 2147483648. If I'm right, it means the problem is fundamentally
with the file system, not with PostGres, and you won't solve this
unless the os supports larger files.There is logic insid PostgreSQL to overflof to nex file at 2GB, but
apparently this is currently broken.AFAIK, there are people working on it now
Yes, me ;-)
I have an idea where the failure is occuring, but I'm still testing the
relavent parts of the code.
--
Peter T Mount peter@retep.org.uk
Main Homepage: http://www.retep.org.uk
PostgreSQL JDBC Faq: http://www.retep.org.uk/postgres
Java PDF Generator: http://www.retep.org.uk/pdf
On Sun, 7 Feb 1999, Peter T Mount wrote:
On Sat, 6 Feb 1999, Hannu Krosing wrote:
Thomas Reinke wrote:
I may be dating myself really badly here, but isn't there a hard limit
on
the file system at 2Gig? I thought the file size attribute in Unix is
represented as a 32 bit signed long, which happens to be a max value
of 2147483648. If I'm right, it means the problem is fundamentally
with the file system, not with PostGres, and you won't solve this
unless the os supports larger files.There is logic insid PostgreSQL to overflof to nex file at 2GB, but
apparently this is currently broken.AFAIK, there are people working on it now
Yes, me ;-)
I have an idea where the failure is occuring, but I'm still testing the
relavent parts of the code.
Well, just now I think I know what's going on.
First, I've reduced the size that postgres breaks the file to 2Mb (256
blocks). I then ran the test script that imports some large records into a
test table.
As expected, the splitting of the file works fine. So the code isn't
broken. What I think is happening is that the code extends the table, then
tests to see if it's at the 2Gig limit, and when it is, creates the next
file for that table.
However, I think the OS has problems with a file exactly 2Gb in size.
I've attached a patch that should reduce the max table size by 1 block.
This should prevent us from hitting the physical limit.
Note: I haven't tested this patch yet!
It compiles but, because the test takes 4 hours for my machine to reach
2Gb, and I have a few other things to do today, I'll run it overnight.
Hopefully, first thing tomorrow, we'll know if it works.
Peter
--
Peter T Mount peter@retep.org.uk
Main Homepage: http://www.retep.org.uk
PostgreSQL JDBC Faq: http://www.retep.org.uk/postgres
Java PDF Generator: http://www.retep.org.uk/pdf
*** ./backend/storage/smgr/md.c.orig Mon Feb 1 17:55:57 1999
--- ./backend/storage/smgr/md.c Sun Feb 7 14:48:35 1999
***************
*** 77,86 ****
*
* 19 Mar 98 darrenk
*
*/
#ifndef LET_OS_MANAGE_FILESIZE
! #define RELSEG_SIZE ((8388608 / BLCKSZ) * 256)
#endif
/* routines declared here */
--- 77,91 ----
*
* 19 Mar 98 darrenk
*
+ * After testing, we need to add one less block to the file, otherwise
+ * we extend beyond the 2-gig limit.
+ *
+ * 07 Feb 99 Peter Mount
+ *
*/
#ifndef LET_OS_MANAGE_FILESIZE
! #define RELSEG_SIZE (((8388608 / BLCKSZ) * 256)-BLCKSZ)
#endif
/* routines declared here */
Peter T Mount <peter@retep.org.uk> writes:
As expected, the splitting of the file works fine. So the code isn't
broken. What I think is happening is that the code extends the table, then
tests to see if it's at the 2Gig limit, and when it is, creates the next
file for that table.
However, I think the OS has problems with a file exactly 2Gb in size.
Oh! I didn't realize that we were trying to extend the file to
*exactly* 2Gb. Indeed that's a very dangerous thing to do: the file
size in bytes will be 0x80000000, which will appear to be negative when
viewed as a signed 32-bit integer; unless your OS is very careful about
signed vs. unsigned arithmetic, it will break.
For that matter it's not impossible that our own code contains similar
problems, if it does much calculating with byte offsets into the file.
(The pushups that darrenk had to do in order to calculate RELSEG_SIZE
in the first place should have suggested to him that running right at
the overflow limit was not such a hot idea...)
I'd suggest setting the limit a good deal less than 2Gb to avoid any
risk of arithmetic overflow. Maybe 200000 8K blocks, instead of 262144.
And change the comment while you're at it, not just the code ;-)
regards, tom lane
Import Notes
Reply to msg id not found: YourmessageofSun7Feb1999145718+0000Pine.LNX.4.04.9902071440120.553-100000@maidast.retep.org.uk | Resolved by subject fallback
For that matter it's not impossible that our own code contains similar
problems, if it does much calculating with byte offsets into the file.
(The pushups that darrenk had to do in order to calculate RELSEG_SIZE
in the first place should have suggested to him that running right at
the overflow limit was not such a hot idea...)
Not my code to begin with...
RELSEG_SIZE was always there hard-coded to 262144 to assume the block
size would be 8k. At the time of my changes, I didn't think thru what
it was for, I only changed the code that was there to calculate it and
get the same value as before for variable disc block sizes.
I agree that running right at the limit is a Bad Thing, but analyzing
that wasn't my main area of concern with that patch.
darrenk
On Sun, 7 Feb 1999, Tom Lane wrote:
Peter T Mount <peter@retep.org.uk> writes:
As expected, the splitting of the file works fine. So the code isn't
broken. What I think is happening is that the code extends the table, then
tests to see if it's at the 2Gig limit, and when it is, creates the next
file for that table.However, I think the OS has problems with a file exactly 2Gb in size.
Oh! I didn't realize that we were trying to extend the file to
*exactly* 2Gb. Indeed that's a very dangerous thing to do: the file
size in bytes will be 0x80000000, which will appear to be negative when
viewed as a signed 32-bit integer; unless your OS is very careful about
signed vs. unsigned arithmetic, it will break.For that matter it's not impossible that our own code contains similar
problems, if it does much calculating with byte offsets into the file.
(The pushups that darrenk had to do in order to calculate RELSEG_SIZE
in the first place should have suggested to him that running right at
the overflow limit was not such a hot idea...)I'd suggest setting the limit a good deal less than 2Gb to avoid any
risk of arithmetic overflow. Maybe 200000 8K blocks, instead of 262144.
That might be an idea.
I've just re-synced by copy of the cvs source, so I'll set it there, and
we'll know by the morning if it's worked or not.
And change the comment while you're at it, not just the code ;-)
Will do :-)
Peter
--
Peter T Mount peter@retep.org.uk
Main Homepage: http://www.retep.org.uk
PostgreSQL JDBC Faq: http://www.retep.org.uk/postgres
Java PDF Generator: http://www.retep.org.uk/pdf
On Sun, 7 Feb 1999, Stupor Genius wrote:
For that matter it's not impossible that our own code contains similar
problems, if it does much calculating with byte offsets into the file.
(The pushups that darrenk had to do in order to calculate RELSEG_SIZE
in the first place should have suggested to him that running right at
the overflow limit was not such a hot idea...)Not my code to begin with...
RELSEG_SIZE was always there hard-coded to 262144 to assume the block
size would be 8k. At the time of my changes, I didn't think thru what
it was for, I only changed the code that was there to calculate it and
get the same value as before for variable disc block sizes.I agree that running right at the limit is a Bad Thing, but analyzing
that wasn't my main area of concern with that patch.
I agree with you. I think that the original error stemmed from when
RELSEG_SIZE was originally set.
Anyhow, I'm about to start the test, using RELSEG_SIZE set to 243968 which
works out to be 1.6Gb. That should stay well away from the overflow
problem.
Peter
--
Peter T Mount peter@retep.org.uk
Main Homepage: http://www.retep.org.uk
PostgreSQL JDBC Faq: http://www.retep.org.uk/postgres
Java PDF Generator: http://www.retep.org.uk/pdf
On Sun, 7 Feb 1999, Peter T Mount wrote:
Anyhow, I'm about to start the test, using RELSEG_SIZE set to 243968 which
works out to be 1.6Gb. That should stay well away from the overflow
problem.
Hi,
I just did a checkout of the cvs code, hardcoded RELSEG_SIZE to 243968,
and it works beautifully now!
I imported about 2.2GB of data(table file size) and it looks like this:
-rw------- 1 postgres postgres 1998585856 Feb 7 16:22 mcrl3_1
-rw------- 1 postgres postgres 219611136 Feb 7 16:49 mcrl3_1.1
-rw------- 1 postgres postgres 399368192 Feb 7 16:49
mcrl3_1_partnumber_index
And it works fine.. I did some selects on data that should have ended up
in the .1 file, and it works great. The best thing about it, is that it
seems at least as fast as MSSQL on the same data, if not faster..
It did take like 45 minutes to create that index.. Isn't that a bit
long(AMD K6-2 350MHz)? :)
Suggestion: How hard would it be to make copy tablename FROM 'somefile'
give some feedback? Either some kind of percentage or just print out
something after each 10k row chunks or something like that.
Thanks,
Ole Gjerde
On Sun, 7 Feb 1999 gjerde@icebox.org wrote:
On Sun, 7 Feb 1999, Peter T Mount wrote:
Anyhow, I'm about to start the test, using RELSEG_SIZE set to 243968 which
works out to be 1.6Gb. That should stay well away from the overflow
problem.Hi,
I just did a checkout of the cvs code, hardcoded RELSEG_SIZE to 243968,
and it works beautifully now!
Problem here is that RELSEG_SIZE is dependent on the block size. Seeing we
can increase the block size from 8k, this would break.
As I type, my machine is populating the test table.
I imported about 2.2GB of data(table file size) and it looks like this:
-rw------- 1 postgres postgres 1998585856 Feb 7 16:22 mcrl3_1
-rw------- 1 postgres postgres 219611136 Feb 7 16:49 mcrl3_1.1
-rw------- 1 postgres postgres 399368192 Feb 7 16:49
mcrl3_1_partnumber_indexAnd it works fine.. I did some selects on data that should have ended up
in the .1 file, and it works great. The best thing about it, is that it
seems at least as fast as MSSQL on the same data, if not faster..
This is what I got when I tested it using a reduced file size. It's what
made me decide to reduce the size by 1 in the patch I posted earlier.
However, I'm using John's suggestion of reducing the file size a lot more,
to ensure we don't hit any math errors, etc. So the max file size is about
1.6Gb.
It did take like 45 minutes to create that index.. Isn't that a bit
long(AMD K6-2 350MHz)? :)
Well, it's taking my poor old P133 about 2 hours to hit 2Gb at the moment.
Suggestion: How hard would it be to make copy tablename FROM 'somefile'
give some feedback? Either some kind of percentage or just print out
something after each 10k row chunks or something like that.
Attached is the test script I'm using, minus the data file.
Peter
--
Peter T Mount peter@retep.org.uk
Main Homepage: http://www.retep.org.uk
PostgreSQL JDBC Faq: http://www.retep.org.uk/postgres
Java PDF Generator: http://www.retep.org.uk/pdf
Attachments:
importtestdbtext/plain; charset=US-ASCII; name=importtestdbDownload
Peter T Mount <peter@retep.org.uk> writes:
I just did a checkout of the cvs code, hardcoded RELSEG_SIZE to 243968,
and it works beautifully now!
Problem here is that RELSEG_SIZE is dependent on the block size. Seeing we
can increase the block size from 8k, this would break.
Of course it should really be defined as
#define RELSEG_SIZE (2000000000 / BLCKSZ)
for some suitable magic constant.
regards, tom lane
Import Notes
Reply to msg id not found: YourmessageofSun7Feb1999234353+0000Pine.LNX.4.04.9902072334590.6820-200000@maidast.retep.org.uk | Resolved by subject fallback
Thus spake Tom Lane
I'd suggest setting the limit a good deal less than 2Gb to avoid any
risk of arithmetic overflow. Maybe 200000 8K blocks, instead of 262144.
Why not make it substantially lower by default? Makes it easier to split
a database across spindles. Even better, how about putting extra extents
into different directories like data/base.1, data/base.2, etc? Then as
the database grows you can add drives, move the extents into them and
mount the new drives. The software doesn't even notice the change.
Just a thought.
--
D'Arcy J.M. Cain <darcy@{druid|vex}.net> | Democracy is three wolves
http://www.druid.net/darcy/ | and a sheep voting on
+1 416 424 2871 (DoD#0082) (eNTP) | what's for dinner.
"D'Arcy J.M. Cain" wrote:
Thus spake Tom Lane
I'd suggest setting the limit a good deal less than 2Gb to avoid any
risk of arithmetic overflow. Maybe 200000 8K blocks, instead of 262144.Why not make it substantially lower by default? Makes it easier to split
a database across spindles. Even better, how about putting extra extents
into different directories like data/base.1, data/base.2, etc? Then as
the database grows you can add drives, move the extents into them and
mount the new drives. The software doesn't even notice the change.
It would be also a great way to help optimization if indexes were in
a separate directory from the tables.
And of course our current way of keeping all the large object files in
one
directory (even _the same_ with other data) sucks.
It has kept me away from using large objects at all, as I've heard that
Linux (or rather ext2fs) is not very good at dealing with huge
directories.
An I have no use for only a few large objects ;)
There have been suggestions about splitting up the large object storage
by
the hex representation of the oid value (= part of current filename),
but a good start would be to put them just in a separate directory under
pg_data. The temp files are also good candidates for putting in a
separate
directory.
The next step would be of course dataspaces, probably most easyly
implemented as directories:
CREATE DATASPACE PG_DATA1 STORAGE='/mnt/scsi.105.7/data1';
SET DEFAULT_DATASPACE TO PG_DATA1;
CREATE TABLE ... IN DATASPACE PG_DATA;
CREATE INDEX ... ;
Then we would'nt have to move and symlink them tables and indexes
manually.
--------------
Hannu
"D'Arcy" "J.M." Cain <darcy@druid.net> writes:
Thus spake Tom Lane
I'd suggest setting the limit a good deal less than 2Gb to avoid any
risk of arithmetic overflow. Maybe 200000 8K blocks, instead of 262144.
Why not make it substantially lower by default?
Configure-time option, anyone ;-) ?
Makes it easier to split
a database across spindles. Even better, how about putting extra extents
into different directories like data/base.1, data/base.2, etc?
This could be a pretty good idea. Right now, if you need to split a
database across multiple filesystems, you have to do a bunch of tedious
hand manipulation of symlinks. With an option like this, you could
automatically distribute your larger tables across filesystems...
set up the subdirectories as symlinks once, and forget it...
regards, tom lane
Import Notes
Reply to msg id not found: YourmessageofSun7Feb1999191400-0500m109eKu-0000bNC@druid.net | Resolved by subject fallback
On Sun, 7 Feb 1999, Peter T Mount wrote:
Anyhow, I'm about to start the test, using RELSEG_SIZE set to 243968 which
works out to be 1.6Gb. That should stay well away from the overflow
problem.Hi,
I just did a checkout of the cvs code, hardcoded RELSEG_SIZE to 243968,
and it works beautifully now!I imported about 2.2GB of data(table file size) and it looks like this:
-rw------- 1 postgres postgres 1998585856 Feb 7 16:22 mcrl3_1
-rw------- 1 postgres postgres 219611136 Feb 7 16:49 mcrl3_1.1
-rw------- 1 postgres postgres 399368192 Feb 7 16:49
mcrl3_1_partnumber_index
Great. This has been on the TODO list for quite some time. Glad it is
fixed.
And it works fine.. I did some selects on data that should have ended up
in the .1 file, and it works great. The best thing about it, is that it
seems at least as fast as MSSQL on the same data, if not faster..It did take like 45 minutes to create that index.. Isn't that a bit
long(AMD K6-2 350MHz)? :)Suggestion: How hard would it be to make copy tablename FROM 'somefile'
give some feedback? Either some kind of percentage or just print out
something after each 10k row chunks or something like that.
We could, but it would then make the output file larger.
--
Bruce Momjian | http://www.op.net/~candle
maillist@candle.pha.pa.us | (610) 853-3000
+ If your life is a hard drive, | 830 Blythe Avenue
+ Christ can be your backup. | Drexel Hill, Pennsylvania 19026
mcrl3_1_partnumber_index
And it works fine.. I did some selects on data that should have ended up
in the .1 file, and it works great. The best thing about it, is that it
seems at least as fast as MSSQL on the same data, if not faster..This is what I got when I tested it using a reduced file size. It's what
made me decide to reduce the size by 1 in the patch I posted earlier.However, I'm using John's suggestion of reducing the file size a lot more,
to ensure we don't hit any math errors, etc. So the max file size is about
1.6Gb.
I can imagine people finding that strange. It it really needed. Is
there some math that could overflow with a larger value?
--
Bruce Momjian | http://www.op.net/~candle
maillist@candle.pha.pa.us | (610) 853-3000
+ If your life is a hard drive, | 830 Blythe Avenue
+ Christ can be your backup. | Drexel Hill, Pennsylvania 19026
On Sun, 7 Feb 1999, Bruce Momjian wrote:
mcrl3_1_partnumber_index
And it works fine.. I did some selects on data that should have ended up
in the .1 file, and it works great. The best thing about it, is that it
seems at least as fast as MSSQL on the same data, if not faster..This is what I got when I tested it using a reduced file size. It's what
made me decide to reduce the size by 1 in the patch I posted earlier.However, I'm using John's suggestion of reducing the file size a lot more,
to ensure we don't hit any math errors, etc. So the max file size is about
1.6Gb.I can imagine people finding that strange. It it really needed. Is
there some math that could overflow with a larger value?
Not sure. My original choice was to subtract 1 from the calculated
maximum, which meant it would split just before the 2Gb limit.
However, running with the value set at the lower value:
1998585856 Feb 8 02:25 /opt/db/base/test/smallcat
599007232 Feb 8 03:21 /opt/db/base/test/smallcat.1
Total 26653000 rows loaded
Would anyone really notice the lower value?
Perhaps we could make this another compile time setting, like the block
size?
Peter
--
Peter T Mount peter@retep.org.uk
Main Homepage: http://www.retep.org.uk
PostgreSQL JDBC Faq: http://www.retep.org.uk/postgres
Java PDF Generator: http://www.retep.org.uk/pdf