Fix for large file support

Started by Zdenek Kotalaalmost 19 years ago36 messages
#1Zdenek Kotala
Zdenek.Kotala@Sun.COM
1 attachment(s)

Current version of postgres support only 1GB chunks. This limit is
defined in the pg_config_manual.h header file. However this setting
allows to have maximal 2GB chunks. Main problem is that md storage
manager and buffile use "long" data type (32bits) for offset instead
"off_t" defined in <sys/types.h>.

off_t is 32bits long on 32bits OS and 64bits long on 64bits OS or when
application is compiled with large file support.

Attached patch allow to setup bigger chunks than 4GB on OS with large
file support.

I tested it on 7GB table and it works.

Please, look on it and let me know your comments or if I miss something.

TODO/questions:

1) clean/update comments about limitation

2) Is there some doc for update?

3) I would like to add some check compare sizeof(off_t) and chunk size
setting and protect postgres with missconfigured chunk size. Is mdinit()
good place for this check?

4) I'm going to take bigger machine for test with really big table.

with regards Zdenek

Attachments:

largefile.difftext/x-patch; name=largefile.diffDownload
Index: src/backend/storage/file/buffile.c
===================================================================
RCS file: /projects/cvsroot/pgsql/src/backend/storage/file/buffile.c,v
retrieving revision 1.25
diff -c -r1.25 buffile.c
*** src/backend/storage/file/buffile.c	5 Jan 2007 22:19:37 -0000	1.25
--- src/backend/storage/file/buffile.c	6 Apr 2007 12:08:47 -0000
***************
*** 42,48 ****
   * Note we adhere to this limit whether or not LET_OS_MANAGE_FILESIZE
   * is defined, although md.c ignores it when that symbol is defined.
   */
! #define MAX_PHYSICAL_FILESIZE  (RELSEG_SIZE * BLCKSZ)
  
  /*
   * This data structure represents a buffered file that consists of one or
--- 42,48 ----
   * Note we adhere to this limit whether or not LET_OS_MANAGE_FILESIZE
   * is defined, although md.c ignores it when that symbol is defined.
   */
! #define MAX_PHYSICAL_FILESIZE  ((off_t)RELSEG_SIZE * BLCKSZ)
  
  /*
   * This data structure represents a buffered file that consists of one or
***************
*** 54,60 ****
  	int			numFiles;		/* number of physical files in set */
  	/* all files except the last have length exactly MAX_PHYSICAL_FILESIZE */
  	File	   *files;			/* palloc'd array with numFiles entries */
! 	long	   *offsets;		/* palloc'd array with numFiles entries */
  
  	/*
  	 * offsets[i] is the current seek position of files[i].  We use this to
--- 54,60 ----
  	int			numFiles;		/* number of physical files in set */
  	/* all files except the last have length exactly MAX_PHYSICAL_FILESIZE */
  	File	   *files;			/* palloc'd array with numFiles entries */
! 	off_t	   *offsets;		/* palloc'd array with numFiles entries */
  
  	/*
  	 * offsets[i] is the current seek position of files[i].  We use this to
***************
*** 70,76 ****
  	 * Position as seen by user of BufFile is (curFile, curOffset + pos).
  	 */
  	int			curFile;		/* file index (0..n) part of current pos */
! 	int			curOffset;		/* offset part of current pos */
  	int			pos;			/* next read/write position in buffer */
  	int			nbytes;			/* total # of valid bytes in buffer */
  	char		buffer[BLCKSZ];
--- 70,76 ----
  	 * Position as seen by user of BufFile is (curFile, curOffset + pos).
  	 */
  	int			curFile;		/* file index (0..n) part of current pos */
! 	off_t		curOffset;		/* offset part of current pos */
  	int			pos;			/* next read/write position in buffer */
  	int			nbytes;			/* total # of valid bytes in buffer */
  	char		buffer[BLCKSZ];
***************
*** 95,101 ****
  	file->numFiles = 1;
  	file->files = (File *) palloc(sizeof(File));
  	file->files[0] = firstfile;
! 	file->offsets = (long *) palloc(sizeof(long));
  	file->offsets[0] = 0L;
  	file->isTemp = false;
  	file->dirty = false;
--- 95,101 ----
  	file->numFiles = 1;
  	file->files = (File *) palloc(sizeof(File));
  	file->files[0] = firstfile;
! 	file->offsets = (off_t *) palloc(sizeof(off_t));
  	file->offsets[0] = 0L;
  	file->isTemp = false;
  	file->dirty = false;
***************
*** 121,128 ****
  
  	file->files = (File *) repalloc(file->files,
  									(file->numFiles + 1) * sizeof(File));
! 	file->offsets = (long *) repalloc(file->offsets,
! 									  (file->numFiles + 1) * sizeof(long));
  	file->files[file->numFiles] = pfile;
  	file->offsets[file->numFiles] = 0L;
  	file->numFiles++;
--- 121,128 ----
  
  	file->files = (File *) repalloc(file->files,
  									(file->numFiles + 1) * sizeof(File));
! 	file->offsets = (off_t *) repalloc(file->offsets,
! 									  (file->numFiles + 1) * sizeof(off_t));
  	file->files[file->numFiles] = pfile;
  	file->offsets[file->numFiles] = 0L;
  	file->numFiles++;
***************
*** 273,281 ****
  		bytestowrite = file->nbytes - wpos;
  		if (file->isTemp)
  		{
! 			long		availbytes = MAX_PHYSICAL_FILESIZE - file->curOffset;
  
! 			if ((long) bytestowrite > availbytes)
  				bytestowrite = (int) availbytes;
  		}
  
--- 273,281 ----
  		bytestowrite = file->nbytes - wpos;
  		if (file->isTemp)
  		{
! 			off_t		availbytes = MAX_PHYSICAL_FILESIZE - file->curOffset;
  
! 			if ((off_t) bytestowrite > availbytes)
  				bytestowrite = (int) availbytes;
  		}
  
***************
*** 445,454 ****
   * impossible seek is attempted.
   */
  int
! BufFileSeek(BufFile *file, int fileno, long offset, int whence)
  {
  	int			newFile;
! 	long		newOffset;
  
  	switch (whence)
  	{
--- 445,454 ----
   * impossible seek is attempted.
   */
  int
! BufFileSeek(BufFile *file, int fileno, off_t offset, int whence)
  {
  	int			newFile;
! 	off_t		newOffset;
  
  	switch (whence)
  	{
***************
*** 531,537 ****
  }
  
  void
! BufFileTell(BufFile *file, int *fileno, long *offset)
  {
  	*fileno = file->curFile;
  	*offset = file->curOffset + file->pos;
--- 530,536 ----
  }
  
  void
! BufFileTell(BufFile *file, int *fileno, off_t *offset)
  {
  	*fileno = file->curFile;
  	*offset = file->curOffset + file->pos;
***************
*** 544,559 ****
   * the file.  Note that users of this interface will fail if their files
   * exceed BLCKSZ * LONG_MAX bytes, but that is quite a lot; we don't work
   * with tables bigger than that, either...
   *
   * Result is 0 if OK, EOF if not.  Logical position is not moved if an
   * impossible seek is attempted.
   */
  int
! BufFileSeekBlock(BufFile *file, long blknum)
  {
  	return BufFileSeek(file,
  					   (int) (blknum / RELSEG_SIZE),
! 					   (blknum % RELSEG_SIZE) * BLCKSZ,
  					   SEEK_SET);
  }
  
--- 543,558 ----
   * the file.  Note that users of this interface will fail if their files
   * exceed BLCKSZ * LONG_MAX bytes, but that is quite a lot; we don't work
   * with tables bigger than that, either...
   *
   * Result is 0 if OK, EOF if not.  Logical position is not moved if an
   * impossible seek is attempted.
   */
  int
! BufFileSeekBlock(BufFile *file, BlockNumber blknum)
  {
  	return BufFileSeek(file,
  					   (int) (blknum / RELSEG_SIZE),
! 					   ((off_t)blknum % RELSEG_SIZE) * BLCKSZ,
  					   SEEK_SET);
  }
  
Index: src/backend/storage/file/fd.c
===================================================================
RCS file: /projects/cvsroot/pgsql/src/backend/storage/file/fd.c,v
retrieving revision 1.137
diff -c -r1.137 fd.c
*** src/backend/storage/file/fd.c	6 Mar 2007 02:06:14 -0000	1.137
--- src/backend/storage/file/fd.c	6 Apr 2007 12:08:47 -0000
***************
*** 128,134 ****
  	File		nextFree;		/* link to next free VFD, if in freelist */
  	File		lruMoreRecently;	/* doubly linked recency-of-use list */
  	File		lruLessRecently;
! 	long		seekPos;		/* current logical file position */
  	char	   *fileName;		/* name of file, or NULL for unused VFD */
  	/* NB: fileName is malloc'd, and must be free'd when closing the VFD */
  	int			fileFlags;		/* open(2) flags for (re)opening the file */
--- 128,134 ----
  	File		nextFree;		/* link to next free VFD, if in freelist */
  	File		lruMoreRecently;	/* doubly linked recency-of-use list */
  	File		lruLessRecently;
! 	off_t		seekPos;		/* current logical file position */
  	char	   *fileName;		/* name of file, or NULL for unused VFD */
  	/* NB: fileName is malloc'd, and must be free'd when closing the VFD */
  	int			fileFlags;		/* open(2) flags for (re)opening the file */
***************
*** 1136,1143 ****
  	return pg_fsync(VfdCache[file].fd);
  }
  
! long
! FileSeek(File file, long offset, int whence)
  {
  	int			returnCode;
  
--- 1136,1143 ----
  	return pg_fsync(VfdCache[file].fd);
  }
  
! off_t
! FileSeek(File file, off_t offset, int whence)
  {
  	int			returnCode;
  
***************
*** 1203,1209 ****
   * XXX not actually used but here for completeness
   */
  #ifdef NOT_USED
! long
  FileTell(File file)
  {
  	Assert(FileIsValid(file));
--- 1203,1209 ----
   * XXX not actually used but here for completeness
   */
  #ifdef NOT_USED
! off_t
  FileTell(File file)
  {
  	Assert(FileIsValid(file));
***************
*** 1214,1220 ****
  #endif
  
  int
! FileTruncate(File file, long offset)
  {
  	int			returnCode;
  
--- 1214,1220 ----
  #endif
  
  int
! FileTruncate(File file, off_t offset)
  {
  	int			returnCode;
  
***************
*** 1227,1233 ****
  	if (returnCode < 0)
  		return returnCode;
  
! 	returnCode = ftruncate(VfdCache[file].fd, (size_t) offset);
  	return returnCode;
  }
  
--- 1227,1233 ----
  	if (returnCode < 0)
  		return returnCode;
  
! 	returnCode = ftruncate(VfdCache[file].fd, (off_t) offset);
  	return returnCode;
  }
  
Index: src/backend/storage/smgr/md.c
===================================================================
RCS file: /projects/cvsroot/pgsql/src/backend/storage/smgr/md.c,v
retrieving revision 1.127
diff -c -r1.127 md.c
*** src/backend/storage/smgr/md.c	17 Jan 2007 16:25:01 -0000	1.127
--- src/backend/storage/smgr/md.c	6 Apr 2007 12:08:48 -0000
***************
*** 325,331 ****
  void
  mdextend(SMgrRelation reln, BlockNumber blocknum, char *buffer, bool isTemp)
  {
! 	long		seekpos;
  	int			nbytes;
  	MdfdVec    *v;
  
--- 325,331 ----
  void
  mdextend(SMgrRelation reln, BlockNumber blocknum, char *buffer, bool isTemp)
  {
! 	off_t		seekpos;
  	int			nbytes;
  	MdfdVec    *v;
  
***************
*** 351,360 ****
  	v = _mdfd_getseg(reln, blocknum, isTemp, EXTENSION_CREATE);
  
  #ifndef LET_OS_MANAGE_FILESIZE
! 	seekpos = (long) (BLCKSZ * (blocknum % ((BlockNumber) RELSEG_SIZE)));
! 	Assert(seekpos < BLCKSZ * RELSEG_SIZE);
  #else
! 	seekpos = (long) (BLCKSZ * (blocknum));
  #endif
  
  	/*
--- 351,360 ----
  	v = _mdfd_getseg(reln, blocknum, isTemp, EXTENSION_CREATE);
  
  #ifndef LET_OS_MANAGE_FILESIZE
! 	seekpos =  (off_t)BLCKSZ * (blocknum % ((BlockNumber) RELSEG_SIZE));
! 	Assert(seekpos < (off_t)BLCKSZ * RELSEG_SIZE);
  #else
! 	seekpos =  (off_t)BLCKSZ * blocknum;
  #endif
  
  	/*
***************
*** 507,523 ****
  void
  mdread(SMgrRelation reln, BlockNumber blocknum, char *buffer)
  {
! 	long		seekpos;
  	int			nbytes;
  	MdfdVec    *v;
  
  	v = _mdfd_getseg(reln, blocknum, false, EXTENSION_FAIL);
  
  #ifndef LET_OS_MANAGE_FILESIZE
! 	seekpos = (long) (BLCKSZ * (blocknum % ((BlockNumber) RELSEG_SIZE)));
! 	Assert(seekpos < BLCKSZ * RELSEG_SIZE);
  #else
! 	seekpos = (long) (BLCKSZ * (blocknum));
  #endif
  
  	if (FileSeek(v->mdfd_vfd, seekpos, SEEK_SET) != seekpos)
--- 507,523 ----
  void
  mdread(SMgrRelation reln, BlockNumber blocknum, char *buffer)
  {
! 	off_t		seekpos;
  	int			nbytes;
  	MdfdVec    *v;
  
  	v = _mdfd_getseg(reln, blocknum, false, EXTENSION_FAIL);
  
  #ifndef LET_OS_MANAGE_FILESIZE
! 	seekpos = (off_t)BLCKSZ * (blocknum % ((BlockNumber) RELSEG_SIZE));
! 	Assert(seekpos < (off_t)BLCKSZ * RELSEG_SIZE);
  #else
! 	seekpos = (off_t)BLCKSZ * (blocknum);
  #endif
  
  	if (FileSeek(v->mdfd_vfd, seekpos, SEEK_SET) != seekpos)
***************
*** 571,577 ****
  void
  mdwrite(SMgrRelation reln, BlockNumber blocknum, char *buffer, bool isTemp)
  {
! 	long		seekpos;
  	int			nbytes;
  	MdfdVec    *v;
  
--- 571,577 ----
  void
  mdwrite(SMgrRelation reln, BlockNumber blocknum, char *buffer, bool isTemp)
  {
! 	off_t		seekpos;
  	int			nbytes;
  	MdfdVec    *v;
  
***************
*** 583,592 ****
  	v = _mdfd_getseg(reln, blocknum, isTemp, EXTENSION_FAIL);
  
  #ifndef LET_OS_MANAGE_FILESIZE
! 	seekpos = (long) (BLCKSZ * (blocknum % ((BlockNumber) RELSEG_SIZE)));
! 	Assert(seekpos < BLCKSZ * RELSEG_SIZE);
  #else
! 	seekpos = (long) (BLCKSZ * (blocknum));
  #endif
  
  	if (FileSeek(v->mdfd_vfd, seekpos, SEEK_SET) != seekpos)
--- 583,592 ----
  	v = _mdfd_getseg(reln, blocknum, isTemp, EXTENSION_FAIL);
  
  #ifndef LET_OS_MANAGE_FILESIZE
! 	seekpos = (off_t)BLCKSZ * (blocknum % ((BlockNumber) RELSEG_SIZE));
! 	Assert(seekpos < (off_t)BLCKSZ * RELSEG_SIZE);
  #else
! 	seekpos = (off_t)BLCKSZ * (blocknum);
  #endif
  
  	if (FileSeek(v->mdfd_vfd, seekpos, SEEK_SET) != seekpos)
***************
*** 1297,1303 ****
  static BlockNumber
  _mdnblocks(SMgrRelation reln, MdfdVec *seg)
  {
! 	long		len;
  
  	len = FileSeek(seg->mdfd_vfd, 0L, SEEK_END);
  	if (len < 0)
--- 1297,1303 ----
  static BlockNumber
  _mdnblocks(SMgrRelation reln, MdfdVec *seg)
  {
! 	off_t		len;
  
  	len = FileSeek(seg->mdfd_vfd, 0L, SEEK_END);
  	if (len < 0)
Index: src/backend/utils/sort/tuplestore.c
===================================================================
RCS file: /projects/cvsroot/pgsql/src/backend/utils/sort/tuplestore.c,v
retrieving revision 1.30
diff -c -r1.30 tuplestore.c
*** src/backend/utils/sort/tuplestore.c	5 Jan 2007 22:19:47 -0000	1.30
--- src/backend/utils/sort/tuplestore.c	6 Apr 2007 12:08:49 -0000
***************
*** 130,143 ****
  	bool		eof_reached;	/* read reached EOF (always valid) */
  	int			current;		/* next array index (valid if INMEM) */
  	int			readpos_file;	/* file# (valid if WRITEFILE and not eof) */
! 	long		readpos_offset; /* offset (valid if WRITEFILE and not eof) */
  	int			writepos_file;	/* file# (valid if READFILE) */
! 	long		writepos_offset;	/* offset (valid if READFILE) */
  
  	/* markpos_xxx holds marked position for mark and restore */
  	int			markpos_current;	/* saved "current" */
  	int			markpos_file;	/* saved "readpos_file" */
! 	long		markpos_offset; /* saved "readpos_offset" */
  };
  
  #define COPYTUP(state,tup)	((*(state)->copytup) (state, tup))
--- 130,143 ----
  	bool		eof_reached;	/* read reached EOF (always valid) */
  	int			current;		/* next array index (valid if INMEM) */
  	int			readpos_file;	/* file# (valid if WRITEFILE and not eof) */
! 	off_t		readpos_offset; /* offset (valid if WRITEFILE and not eof) */
  	int			writepos_file;	/* file# (valid if READFILE) */
! 	off_t		writepos_offset;	/* offset (valid if READFILE) */
  
  	/* markpos_xxx holds marked position for mark and restore */
  	int			markpos_current;	/* saved "current" */
  	int			markpos_file;	/* saved "readpos_file" */
! 	off_t		markpos_offset; /* saved "readpos_offset" */
  };
  
  #define COPYTUP(state,tup)	((*(state)->copytup) (state, tup))
Index: src/include/storage/buffile.h
===================================================================
RCS file: /projects/cvsroot/pgsql/src/include/storage/buffile.h,v
retrieving revision 1.20
diff -c -r1.20 buffile.h
*** src/include/storage/buffile.h	5 Jan 2007 22:19:57 -0000	1.20
--- src/include/storage/buffile.h	6 Apr 2007 12:08:49 -0000
***************
*** 26,31 ****
--- 26,34 ----
  #ifndef BUFFILE_H
  #define BUFFILE_H
  
+ #include <sys/types.h>
+ #include "block.h"
+ 
  /* BufFile is an opaque type whose details are not known outside buffile.c. */
  
  typedef struct BufFile BufFile;
***************
*** 38,45 ****
  extern void BufFileClose(BufFile *file);
  extern size_t BufFileRead(BufFile *file, void *ptr, size_t size);
  extern size_t BufFileWrite(BufFile *file, void *ptr, size_t size);
! extern int	BufFileSeek(BufFile *file, int fileno, long offset, int whence);
! extern void BufFileTell(BufFile *file, int *fileno, long *offset);
! extern int	BufFileSeekBlock(BufFile *file, long blknum);
  
  #endif   /* BUFFILE_H */
--- 41,48 ----
  extern void BufFileClose(BufFile *file);
  extern size_t BufFileRead(BufFile *file, void *ptr, size_t size);
  extern size_t BufFileWrite(BufFile *file, void *ptr, size_t size);
! extern int	BufFileSeek(BufFile *file, int fileno, off_t offset, int whence);
! extern void BufFileTell(BufFile *file, int *fileno, off_t *offset);
! extern int	BufFileSeekBlock(BufFile *file, BlockNumber blknum);
  
  #endif   /* BUFFILE_H */
Index: src/include/storage/fd.h
===================================================================
RCS file: /projects/cvsroot/pgsql/src/include/storage/fd.h,v
retrieving revision 1.57
diff -c -r1.57 fd.h
*** src/include/storage/fd.h	5 Jan 2007 22:19:57 -0000	1.57
--- src/include/storage/fd.h	6 Apr 2007 12:08:50 -0000
***************
*** 67,74 ****
  extern int	FileRead(File file, char *buffer, int amount);
  extern int	FileWrite(File file, char *buffer, int amount);
  extern int	FileSync(File file);
! extern long FileSeek(File file, long offset, int whence);
! extern int	FileTruncate(File file, long offset);
  
  /* Operations that allow use of regular stdio --- USE WITH CAUTION */
  extern FILE *AllocateFile(const char *name, const char *mode);
--- 67,74 ----
  extern int	FileRead(File file, char *buffer, int amount);
  extern int	FileWrite(File file, char *buffer, int amount);
  extern int	FileSync(File file);
! extern off_t FileSeek(File file, off_t offset, int whence);
! extern int	FileTruncate(File file, off_t offset);
  
  /* Operations that allow use of regular stdio --- USE WITH CAUTION */
  extern FILE *AllocateFile(const char *name, const char *mode);
#2Andrew Dunstan
andrew@dunslane.net
In reply to: Zdenek Kotala (#1)
Re: Fix for large file support

Zdenek Kotala wrote:

Current version of postgres support only 1GB chunks. This limit is
defined in the pg_config_manual.h header file. However this setting
allows to have maximal 2GB chunks. Main problem is that md storage
manager and buffile use "long" data type (32bits) for offset instead
"off_t" defined in <sys/types.h>.

off_t is 32bits long on 32bits OS and 64bits long on 64bits OS or when
application is compiled with large file support.

Attached patch allow to setup bigger chunks than 4GB on OS with large
file support.

I tested it on 7GB table and it works.

What does it actually buy us, though? Does it mean the maximum field
size will grow beyond 1Gb? Or give better performance?

cheers

andrew

#3Zdenek Kotala
Zdenek.Kotala@Sun.COM
In reply to: Andrew Dunstan (#2)
Re: Fix for large file support

Andrew Dunstan wrote:

Does it mean the maximum field size will grow beyond 1Gb?

No. Because it is limited by varlena size. See
http://www.postgresql.org/docs/8.2/interactive/storage-toast.html

Or give better performance?

Yes. List of chunks is stored as linked list and for some operation
(e.g. expand) are all chunks opened and their size is checked. On big
tables it takes some time. For example if you have 1TB big table and you
want to add new block you must go and open all 1024 files.

By the way ./configure script performs check for __LARGE_FILE_ support,
but it looks that it is nowhere used.

There could be small time penalty in 64bit arithmetics. However it
happens only if large file support is enabled on 32bit OS.

Zdenek

#4Tom Lane
tgl@sss.pgh.pa.us
In reply to: Zdenek Kotala (#3)
Re: Fix for large file support

Zdenek Kotala <Zdenek.Kotala@Sun.COM> writes:

Andrew Dunstan wrote:

Or give better performance?

Yes. List of chunks is stored as linked list and for some operation
(e.g. expand) are all chunks opened and their size is checked. On big
tables it takes some time. For example if you have 1TB big table and you
want to add new block you must go and open all 1024 files.

Indeed, but that would be far more effectively addressed by fixing the
*other* code path that doesn't segment at all (the
LET_OS_MANAGE_FILESIZE option, which is most likely broken these days
for lack of testing). I don't see the point of a halfway measure like
increasing RELSEG_SIZE.

regards, tom lane

#5Zdenek Kotala
Zdenek.Kotala@Sun.COM
In reply to: Tom Lane (#4)
Re: Fix for large file support

Tom Lane wrote:

Zdenek Kotala <Zdenek.Kotala@Sun.COM> writes:

Andrew Dunstan wrote:

Indeed, but that would be far more effectively addressed by fixing the
*other* code path that doesn't segment at all (the
LET_OS_MANAGE_FILESIZE option, which is most likely broken these days
for lack of testing). I don't see the point of a halfway measure like
increasing RELSEG_SIZE.

LET_OS_MANAGE_FILESIZE is good way. I think one problem of this option I
fixed. It is size of offset. I went thru the code and did not see any
other problem there. However, how you mentioned it need more testing. I
going to take server with large disk array and I will test it.

I would like to add --enable-largefile switch to configure file to
enable access to wide group of users. What you think about it?

Zdenek

#6Tom Lane
tgl@sss.pgh.pa.us
In reply to: Zdenek Kotala (#5)
Re: [PATCHES] Fix for large file support

[ redirecting to -hackers for wider comment ]

Zdenek Kotala <Zdenek.Kotala@Sun.COM> writes:

Tom Lane wrote:
LET_OS_MANAGE_FILESIZE is good way. I think one problem of this option I
fixed. It is size of offset. I went thru the code and did not see any
other problem there. However, how you mentioned it need more testing. I
going to take server with large disk array and I will test it.

I would like to add --enable-largefile switch to configure file to
enable access to wide group of users. What you think about it?

Yeah, I was going to suggest the same thing --- but not with that switch
name. We already use enable/disable-largefile to control whether 64-bit
file access is built at all (this mostly affects pg_dump at the moment).

I think the clearest way might be to flip the sense of the variable.
I never found "LET_OS_MANAGE_FILESIZE" to be a good name anyway. I'd
suggest "USE_SEGMENTED_FILES", which defaults to "on", and you can
turn it off via --disable-segmented-files if configure confirms your
OS has largefile support (thus you could not specify both this and
--disable-largefile).

regards, tom lane

#7Zdenek Kotala
Zdenek.Kotala@Sun.COM
In reply to: Tom Lane (#6)
Re: [PATCHES] Fix for large file support

Tom Lane wrote:

[ redirecting to -hackers for wider comment ]

Zdenek Kotala <Zdenek.Kotala@Sun.COM> writes:

Tom Lane wrote:
LET_OS_MANAGE_FILESIZE is good way. I think one problem of this option I
fixed. It is size of offset. I went thru the code and did not see any
other problem there. However, how you mentioned it need more testing. I
going to take server with large disk array and I will test it.

I would like to add --enable-largefile switch to configure file to
enable access to wide group of users. What you think about it?

Yeah, I was going to suggest the same thing --- but not with that switch
name. We already use enable/disable-largefile to control whether 64-bit
file access is built at all (this mostly affects pg_dump at the moment).

hmm :( It looks that ./configure largefile detection does not work on
Solaris.

I think the clearest way might be to flip the sense of the variable.
I never found "LET_OS_MANAGE_FILESIZE" to be a good name anyway. I'd
suggest "USE_SEGMENTED_FILES", which defaults to "on", and you can
turn it off via --disable-segmented-files if configure confirms your
OS has largefile support (thus you could not specify both this and
--disable-largefile).

It sounds good. There is one think for clarification (for the present).
How to handle buffile? It does not currently support non segmented
files. I suggest to use same switch to enable/disable segments there.

Zdenek

#8Tom Lane
tgl@sss.pgh.pa.us
In reply to: Zdenek Kotala (#7)
Re: [PATCHES] Fix for large file support

Zdenek Kotala <Zdenek.Kotala@Sun.COM> writes:

It sounds good. There is one think for clarification (for the present).
How to handle buffile? It does not currently support non segmented
files. I suggest to use same switch to enable/disable segments there.

Do you think it really matters? terabyte-sized temp files seem a bit
unlikely, and anyway I don't think the performance argument applies;
md.c's tendency to open all the files at once is irrelevant.

regards, tom lane

#9Jim Nasby
decibel@decibel.org
In reply to: Tom Lane (#6)
Re: [PATCHES] Fix for large file support

If we expose LET_OS_MANAGE_FILESIZE, should we add a flag to the
control file so that you can't start a backend that has that defined
against a cluster that was initialized without it?

On Apr 6, 2007, at 2:45 PM, Tom Lane wrote:

[ redirecting to -hackers for wider comment ]

Zdenek Kotala <Zdenek.Kotala@Sun.COM> writes:

Tom Lane wrote:
LET_OS_MANAGE_FILESIZE is good way. I think one problem of this
option I
fixed. It is size of offset. I went thru the code and did not see any
other problem there. However, how you mentioned it need more
testing. I
going to take server with large disk array and I will test it.

I would like to add --enable-largefile switch to configure file to
enable access to wide group of users. What you think about it?

Yeah, I was going to suggest the same thing --- but not with that
switch
name. We already use enable/disable-largefile to control whether
64-bit
file access is built at all (this mostly affects pg_dump at the
moment).

I think the clearest way might be to flip the sense of the variable.
I never found "LET_OS_MANAGE_FILESIZE" to be a good name anyway. I'd
suggest "USE_SEGMENTED_FILES", which defaults to "on", and you can
turn it off via --disable-segmented-files if configure confirms your
OS has largefile support (thus you could not specify both this and
--disable-largefile).

regards, tom lane

---------------------------(end of
broadcast)---------------------------
TIP 5: don't forget to increase your free space map settings

--
Jim Nasby jim@nasby.net
EnterpriseDB http://enterprisedb.com 512.569.9461 (cell)

#10Tom Lane
tgl@sss.pgh.pa.us
In reply to: Jim Nasby (#9)
Re: [PATCHES] Fix for large file support

Jim Nasby <decibel@decibel.org> writes:

If we expose LET_OS_MANAGE_FILESIZE, should we add a flag to the
control file so that you can't start a backend that has that defined
against a cluster that was initialized without it?

I imagine we'd flag that as relsegsize = 0 or some such.

regards, tom lane

#11Zdenek Kotala
Zdenek.Kotala@Sun.COM
In reply to: Tom Lane (#10)
Re: [PATCHES] Fix for large file support

Tom Lane wrote:

Jim Nasby <decibel@decibel.org> writes:

If we expose LET_OS_MANAGE_FILESIZE, should we add a flag to the
control file so that you can't start a backend that has that defined
against a cluster that was initialized without it?

I imagine we'd flag that as relsegsize = 0 or some such.

Yes I have it in my patch. I put relsegsize = 0 in the control file when
non-segmentation mode is enabled.

Zdenek

#12Bruce Momjian
bruce@momjian.us
In reply to: Tom Lane (#6)
Re: [PATCHES] Fix for large file support

This has been saved for the 8.4 release:

http://momjian.postgresql.org/cgi-bin/pgpatches_hold

---------------------------------------------------------------------------

Tom Lane wrote:

[ redirecting to -hackers for wider comment ]

Zdenek Kotala <Zdenek.Kotala@Sun.COM> writes:

Tom Lane wrote:
LET_OS_MANAGE_FILESIZE is good way. I think one problem of this option I
fixed. It is size of offset. I went thru the code and did not see any
other problem there. However, how you mentioned it need more testing. I
going to take server with large disk array and I will test it.

I would like to add --enable-largefile switch to configure file to
enable access to wide group of users. What you think about it?

Yeah, I was going to suggest the same thing --- but not with that switch
name. We already use enable/disable-largefile to control whether 64-bit
file access is built at all (this mostly affects pg_dump at the moment).

I think the clearest way might be to flip the sense of the variable.
I never found "LET_OS_MANAGE_FILESIZE" to be a good name anyway. I'd
suggest "USE_SEGMENTED_FILES", which defaults to "on", and you can
turn it off via --disable-segmented-files if configure confirms your
OS has largefile support (thus you could not specify both this and
--disable-largefile).

regards, tom lane

---------------------------(end of broadcast)---------------------------
TIP 5: don't forget to increase your free space map settings

--
Bruce Momjian <bruce@momjian.us> http://momjian.us
EnterpriseDB http://www.enterprisedb.com

+ If your life is a hard drive, Christ can be your backup. +

#13Zdenek Kotala
Zdenek.Kotala@Sun.COM
In reply to: Tom Lane (#6)
1 attachment(s)
Re: Fix for large file support (nonsegment mode support)

Tom Lane wrote:

[ redirecting to -hackers for wider comment ]

Zdenek Kotala <Zdenek.Kotala@Sun.COM> writes:

Tom Lane wrote:
LET_OS_MANAGE_FILESIZE is good way. I think one problem of this option I
fixed. It is size of offset. I went thru the code and did not see any
other problem there. However, how you mentioned it need more testing. I
going to take server with large disk array and I will test it.

I would like to add --enable-largefile switch to configure file to
enable access to wide group of users. What you think about it?

Yeah, I was going to suggest the same thing --- but not with that switch
name. We already use enable/disable-largefile to control whether 64-bit
file access is built at all (this mostly affects pg_dump at the moment).

I think the clearest way might be to flip the sense of the variable.
I never found "LET_OS_MANAGE_FILESIZE" to be a good name anyway. I'd
suggest "USE_SEGMENTED_FILES", which defaults to "on", and you can
turn it off via --disable-segmented-files if configure confirms your
OS has largefile support (thus you could not specify both this and
--disable-largefile).

There is latest version of nonsegment support patch. I changed
LET_OS_MANAGE_FILESIZE to USE_SEGMENTED_FILES and I added
-disable-segmented-files switch to configure. I kept tuplestore behavior
and it still split file in both mode.

I also little bit cleanup some other datatypes (e.g int->mode_t).
Autoconf and autoheader must be run after patch application.

I tested it with 9GB table and both mode works fine.

Please, let me know your comments.

Zdenek

Attachments:

nonseg.patch.gzapplication/x-gzip; name=nonseg.patch.gzDownload
#14Bruce Momjian
bruce@momjian.us
In reply to: Zdenek Kotala (#13)
Re: Fix for large file support (nonsegment mode support)

This has been saved for the 8.4 release:

http://momjian.postgresql.org/cgi-bin/pgpatches_hold

---------------------------------------------------------------------------

Zdenek Kotala wrote:

Tom Lane wrote:

[ redirecting to -hackers for wider comment ]

Zdenek Kotala <Zdenek.Kotala@Sun.COM> writes:

Tom Lane wrote:
LET_OS_MANAGE_FILESIZE is good way. I think one problem of this option I
fixed. It is size of offset. I went thru the code and did not see any
other problem there. However, how you mentioned it need more testing. I
going to take server with large disk array and I will test it.

I would like to add --enable-largefile switch to configure file to
enable access to wide group of users. What you think about it?

Yeah, I was going to suggest the same thing --- but not with that switch
name. We already use enable/disable-largefile to control whether 64-bit
file access is built at all (this mostly affects pg_dump at the moment).

I think the clearest way might be to flip the sense of the variable.
I never found "LET_OS_MANAGE_FILESIZE" to be a good name anyway. I'd
suggest "USE_SEGMENTED_FILES", which defaults to "on", and you can
turn it off via --disable-segmented-files if configure confirms your
OS has largefile support (thus you could not specify both this and
--disable-largefile).

There is latest version of nonsegment support patch. I changed
LET_OS_MANAGE_FILESIZE to USE_SEGMENTED_FILES and I added
-disable-segmented-files switch to configure. I kept tuplestore behavior
and it still split file in both mode.

I also little bit cleanup some other datatypes (e.g int->mode_t).
Autoconf and autoheader must be run after patch application.

I tested it with 9GB table and both mode works fine.

Please, let me know your comments.

Zdenek

[ application/x-gzip is not supported, skipping... ]

---------------------------(end of broadcast)---------------------------
TIP 3: Have you checked our extensive FAQ?

http://www.postgresql.org/docs/faq

--
Bruce Momjian <bruce@momjian.us> http://momjian.us
EnterpriseDB http://www.enterprisedb.com

+ If your life is a hard drive, Christ can be your backup. +

#15Tom Lane
tgl@sss.pgh.pa.us
In reply to: Zdenek Kotala (#13)
Re: [PATCHES] Fix for large file support (nonsegment mode support)

Zdenek Kotala <Zdenek.Kotala@Sun.COM> writes:

There is latest version of nonsegment support patch. I changed
LET_OS_MANAGE_FILESIZE to USE_SEGMENTED_FILES and I added
-disable-segmented-files switch to configure. I kept tuplestore behavior
and it still split file in both mode.

Applied with minor corrections.

regards, tom lane

#16Peter Eisentraut
peter_e@gmx.net
In reply to: Tom Lane (#15)
Re: [PATCHES] Fix for large file support (nonsegment mode support)

Tom Lane wrote:

Zdenek Kotala <Zdenek.Kotala@Sun.COM> writes:

There is latest version of nonsegment support patch. I changed
LET_OS_MANAGE_FILESIZE to USE_SEGMENTED_FILES and I added
-disable-segmented-files switch to configure. I kept tuplestore behavior
and it still split file in both mode.

Applied with minor corrections.

Why is this not the default when supported? I am wondering both from the
point of view of the user, and in terms of development direction.

#17Tom Lane
tgl@sss.pgh.pa.us
In reply to: Peter Eisentraut (#16)
Re: [PATCHES] Fix for large file support (nonsegment mode support)

Peter Eisentraut <peter_e@gmx.net> writes:

Tom Lane wrote:

Applied with minor corrections.

Why is this not the default when supported?

Fear.

Maybe eventually, but right now I think it's too risky.

One point that I already found out the hard way is that sizeof(off_t) = 8
does not guarantee the availability of largefile support; there can also
be filesystem-level constraints, and perhaps other things we know not of
at this point.

I think this needs to be treated as experimental until it's got a few
more than zero miles under its belt. I wouldn't be too surprised to
find that we have to implement it as a run-time switch instead of
compile-time, in order to not fail miserably when somebody sticks a
tablespace on an archaic filesystem.

regards, tom lane

#18Alvaro Herrera
alvherre@commandprompt.com
In reply to: Peter Eisentraut (#16)
Re: [PATCHES] Fix for large file support (nonsegment mode support)

Peter Eisentraut wrote:

Tom Lane wrote:

Zdenek Kotala <Zdenek.Kotala@Sun.COM> writes:

There is latest version of nonsegment support patch. I changed
LET_OS_MANAGE_FILESIZE to USE_SEGMENTED_FILES and I added
-disable-segmented-files switch to configure. I kept tuplestore behavior
and it still split file in both mode.

Applied with minor corrections.

Why is this not the default when supported? I am wondering both from the
point of view of the user, and in terms of development direction.

Also it would get more buildfarm coverage if it were default. If it
breaks something we'll notice earlier.

--
Alvaro Herrera http://www.CommandPrompt.com/
PostgreSQL Replication, Consulting, Custom Development, 24x7 support

#19Tom Lane
tgl@sss.pgh.pa.us
In reply to: Alvaro Herrera (#18)
Re: [PATCHES] Fix for large file support (nonsegment mode support)

Alvaro Herrera <alvherre@commandprompt.com> writes:

Also it would get more buildfarm coverage if it were default. If it
breaks something we'll notice earlier.

Since nothing the regression tests do even approach 1GB, the odds that
the buildfarm will notice problems are approximately zero.

regards, tom lane

#20Peter Eisentraut
peter_e@gmx.net
In reply to: Tom Lane (#17)
Re: [PATCHES] Fix for large file support (nonsegment mode support)

Tom Lane wrote:

I think this needs to be treated as experimental until it's got a few
more than zero miles under its belt.

OK, then maybe we should document that.

I wouldn't be too surprised to
find that we have to implement it as a run-time switch instead of
compile-time, in order to not fail miserably when somebody sticks a
tablespace on an archaic filesystem.

Yes, that sounds quite useful. Let's wait and see what happens.

#21Zeugswetter Andreas OSB SD
Andreas.Zeugswetter@s-itsolutions.at
In reply to: Tom Lane (#17)
Re: [PATCHES] Fix for large file support (nonsegment mode support)

Why is this not the default when supported?

Fear.

Maybe eventually, but right now I think it's too risky.

One point that I already found out the hard way is that sizeof(off_t)

= 8

does not guarantee the availability of largefile support; there can

also

be filesystem-level constraints, and perhaps other things we know not

of

at this point.

Exactly, e.g. AIX is one of those. jfs (not the newer jfs2) has an
option
to enable large files, which is not the default and cannot be changed
post crfs.
And even if it is enabled, jfs has a 64 Gb filesize limit !
Anybody know others that support large but not huge files ?

Andreas

#22Zeugswetter Andreas OSB SD
Andreas.Zeugswetter@s-itsolutions.at
In reply to: Alvaro Herrera (#18)
Re: [PATCHES] Fix for large file support (nonsegment mode support)

Why is this not the default when supported? I am wondering both

from the

point of view of the user, and in terms of development direction.

Also it would get more buildfarm coverage if it were default. If it
breaks something we'll notice earlier.

No we don't, because the buildfarm does not test huge files.

Andreas

#23Larry Rosenman
ler@lerctr.org
In reply to: Tom Lane (#17)
Re: [PATCHES] Fix for large file support (nonsegment mode support)

On Mon, 10 Mar 2008, Tom Lane wrote:

Peter Eisentraut <peter_e@gmx.net> writes:

Tom Lane wrote:

Applied with minor corrections.

Why is this not the default when supported?

Fear.

Maybe eventually, but right now I think it's too risky.

One point that I already found out the hard way is that sizeof(off_t) = 8
does not guarantee the availability of largefile support; there can also
be filesystem-level constraints, and perhaps other things we know not of
at this point.

Just to note an additional filesystem that will need special action...
The VxFS filesystem has a largefiles option, per filesystem. At least that
was the case on SCO UnixWare (No, I no longer run it).

LER

regards, tom lane

--
Larry Rosenman http://www.lerctr.org/~ler
Phone: +1 512-248-2683 E-Mail: ler@lerctr.org
US Mail: 430 Valona Loop, Round Rock, TX 78681-3893

#24Tom Lane
tgl@sss.pgh.pa.us
In reply to: Peter Eisentraut (#20)
Re: [PATCHES] Fix for large file support (nonsegment mode support)

Peter Eisentraut <peter_e@gmx.net> writes:

Tom Lane wrote:

I think this needs to be treated as experimental until it's got a few
more than zero miles under its belt.

OK, then maybe we should document that.

Agreed, but at this point we don't even know what hazards we need to
document.

regards, tom lane

#25Tom Lane
tgl@sss.pgh.pa.us
In reply to: Zeugswetter Andreas OSB SD (#21)
Re: [PATCHES] Fix for large file support (nonsegment mode support)

"Zeugswetter Andreas OSB SD" <Andreas.Zeugswetter@s-itsolutions.at> writes:

Exactly, e.g. AIX is one of those. jfs (not the newer jfs2) has an
option
to enable large files, which is not the default and cannot be changed
post crfs.
And even if it is enabled, jfs has a 64 Gb filesize limit !
Anybody know others that support large but not huge files ?

Yeah, HPUX 10 is similar --- 128GB hard maximum. It does say you
can convert an existing filesystem to largefile support, but it has
to be unmounted.

These examples suggest that maybe what we want is not so much a "no
segments ever" mode as a segment size larger than 1GB.

regards, tom lane

#26Zdenek Kotala
Zdenek.Kotala@Sun.COM
In reply to: Tom Lane (#25)
Re: [PATCHES] Fix for large file support (nonsegment mode support)

Tom Lane napsal(a):

"Zeugswetter Andreas OSB SD" <Andreas.Zeugswetter@s-itsolutions.at> writes:

Exactly, e.g. AIX is one of those. jfs (not the newer jfs2) has an
option
to enable large files, which is not the default and cannot be changed
post crfs.
And even if it is enabled, jfs has a 64 Gb filesize limit !
Anybody know others that support large but not huge files ?

Yeah, HPUX 10 is similar --- 128GB hard maximum. It does say you
can convert an existing filesystem to largefile support, but it has
to be unmounted.

These examples suggest that maybe what we want is not so much a "no
segments ever" mode as a segment size larger than 1GB.

Patch allows to use bigger than 2/4GB segment files and it is possible
changed it in source file. But how it was mentioned in this thread maybe
somethink like this "CREATE TABLESPACE name LOCATION '/my/location'
SEGMENTS 10GB" should good solution. If segments is not mentioned then
default value is used.

Zdenek

PS: ZFS is happy with 2^64bit size and UFS has 1TB file size limit
(depends on solaris version)

#27Tom Lane
tgl@sss.pgh.pa.us
In reply to: Zdenek Kotala (#26)
Re: [PATCHES] Fix for large file support (nonsegment mode support)

Zdenek Kotala <Zdenek.Kotala@Sun.COM> writes:

Tom Lane napsal(a):

These examples suggest that maybe what we want is not so much a "no
segments ever" mode as a segment size larger than 1GB.

PS: ZFS is happy with 2^64bit size and UFS has 1TB file size limit
(depends on solaris version)

So even on Solaris, "no segments ever" is actually a pretty awful idea.
As it stands, the code would fail on tables > 1TB.

I'm thinking we need to reconsider this patch. Rather than disabling
segmentation altogether, we should see it as allowing use of segments
larger than 1GB. I suggest that we ought to just flat rip out the "non
segmenting" code paths in md.c, and instead look into what segment sizes
are appropriate on different platforms.

regards, tom lane

#28Bruce Momjian
bruce@momjian.us
In reply to: Tom Lane (#27)
Re: [PATCHES] Fix for large file support (nonsegment mode support)

Tom Lane wrote:

Zdenek Kotala <Zdenek.Kotala@Sun.COM> writes:

Tom Lane napsal(a):

These examples suggest that maybe what we want is not so much a "no
segments ever" mode as a segment size larger than 1GB.

PS: ZFS is happy with 2^64bit size and UFS has 1TB file size limit
(depends on solaris version)

So even on Solaris, "no segments ever" is actually a pretty awful idea.
As it stands, the code would fail on tables > 1TB.

I'm thinking we need to reconsider this patch. Rather than disabling
segmentation altogether, we should see it as allowing use of segments
larger than 1GB. I suggest that we ought to just flat rip out the "non
segmenting" code paths in md.c, and instead look into what segment sizes
are appropriate on different platforms.

Agreed.

--
Bruce Momjian <bruce@momjian.us> http://momjian.us
EnterpriseDB http://postgres.enterprisedb.com

+ If your life is a hard drive, Christ can be your backup. +

#29Zdenek Kotala
Zdenek.Kotala@Sun.COM
In reply to: Tom Lane (#27)
Re: [PATCHES] Fix for large file support (nonsegment mode support)

Tom Lane napsal(a):

Zdenek Kotala <Zdenek.Kotala@Sun.COM> writes:

Tom Lane napsal(a):

These examples suggest that maybe what we want is not so much a "no
segments ever" mode as a segment size larger than 1GB.

PS: ZFS is happy with 2^64bit size and UFS has 1TB file size limit
(depends on solaris version)

So even on Solaris, "no segments ever" is actually a pretty awful idea.
As it stands, the code would fail on tables > 1TB.

I'm thinking we need to reconsider this patch. Rather than disabling
segmentation altogether, we should see it as allowing use of segments
larger than 1GB. I suggest that we ought to just flat rip out the "non
segmenting" code paths in md.c, and instead look into what segment sizes
are appropriate on different platforms.

Yes, agree. It seems only ZFS is OK at this moment and if somebody sets
32TB he gets nonsegment mode anyway. I looked into posix standard and
there is useful function which can be used. See

http://www.opengroup.org/onlinepubs/009695399/functions/pathconf.html

Maybe we can put additional test into configure and collect appropriate
data from buildfarm.

I think current patch could stay in CVS and I will rip out non segment
code path in a new patch.

Zdenek

#30Tom Lane
tgl@sss.pgh.pa.us
In reply to: Zdenek Kotala (#29)
Re: [PATCHES] Fix for large file support (nonsegment mode support)

Zdenek Kotala <Zdenek.Kotala@Sun.COM> writes:

I think current patch could stay in CVS and I will rip out non segment
code path in a new patch.

Sure, I feel no need to revert what's applied. Have at it.

regards, tom lane

#31Peter Eisentraut
peter_e@gmx.net
In reply to: Zdenek Kotala (#29)
Re: [PATCHES] Fix for large file support (nonsegment mode support)

Zdenek Kotala wrote:

Yes, agree. It seems only ZFS is OK at this moment and if somebody sets
32TB he gets nonsegment mode anyway.

Surely if you set the segment size to INT64_MAX, you will get nonsegmented
behavior anyway, so two code paths might not be necessary at all.

I looked into posix standard and
there is useful function which can be used. See

http://www.opengroup.org/onlinepubs/009695399/functions/pathconf.html

Maybe we can put additional test into configure and collect appropriate
data from buildfarm.

It might be good to just check first if it returns realistic values for the
example cases that have been mentioned.

#32Tom Lane
tgl@sss.pgh.pa.us
In reply to: Peter Eisentraut (#31)
Re: [PATCHES] Fix for large file support (nonsegment mode support)

Peter Eisentraut <peter_e@gmx.net> writes:

Zdenek Kotala wrote:

Maybe we can put additional test into configure and collect appropriate
data from buildfarm.

It might be good to just check first if it returns realistic values for the
example cases that have been mentioned.

Yeah, please just make up a ten-line C program that prints the numbers
you want, and post it on -hackers for people to try. If manual testing
says that it's printing useful numbers, then it would be time enough to
think about how to get it into the buildfarm.

regards, tom lane

#33Peter Eisentraut
peter_e@gmx.net
In reply to: Zdenek Kotala (#26)
Re: [PATCHES] Fix for large file support (nonsegment mode support)

Zdenek Kotala wrote:

But how it was mentioned in this thread maybe
somethink like this "CREATE TABLESPACE name LOCATION '/my/location'
SEGMENTS 10GB" should good solution. If segments is not mentioned then
default value is used.

I think you would need a tool to resegmentize a table or tablespace offline,
usable for example when recovering a backup.

Also, tablespace configuration information is of course also stored in a
table. pg_tablespace probably won't become large, but it would probably
still need to be special-cased, along with other system catalogs perhaps.

An then, how to coordindate offline resegmenting and online tablespace
operations in a crash-safe way?

Another factor I just thought of is that tar, commonly used as part of a
backup procedure, can on some systems only handle files up to 8 GB in size.
There are supposed to be newer formats that can avoid that restriction, but
it's not clear how widely available these are and what the incantation is to
get at them. Of course we don't use tar directly, but if we ever make large
segments the default, we ought to provide some clear advice for the user on
how to make their backups.

#34Martijn van Oosterhout
kleptog@svana.org
In reply to: Peter Eisentraut (#33)
Re: [PATCHES] Fix for large file support (nonsegment mode support)

On Wed, Mar 19, 2008 at 09:38:12AM +0100, Peter Eisentraut wrote:

Another factor I just thought of is that tar, commonly used as part of a
backup procedure, can on some systems only handle files up to 8 GB in size.
There are supposed to be newer formats that can avoid that restriction, but
it's not clear how widely available these are and what the incantation is to
get at them. Of course we don't use tar directly, but if we ever make large
segments the default, we ought to provide some clear advice for the user on
how to make their backups.

By my reading, GNU tar handles larger files and no-one else (not even a
POSIX standard tar) can...

Have a nice day,
--
Martijn van Oosterhout <kleptog@svana.org> http://svana.org/kleptog/

Show quoted text

Please line up in a tree and maintain the heap invariant while
boarding. Thank you for flying nlogn airlines.

In reply to: Martijn van Oosterhout (#34)
Re: [PATCHES] Fix for large file support (nonsegment mode support)

On Wed, Mar 19, 2008 at 10:51:12AM +0100, Martijn van Oosterhout wrote:

On Wed, Mar 19, 2008 at 09:38:12AM +0100, Peter Eisentraut wrote:

Another factor I just thought of is that tar, commonly used as part of a
backup procedure, can on some systems only handle files up to 8 GB in size.
There are supposed to be newer formats that can avoid that restriction, but
it's not clear how widely available these are and what the incantation is to
get at them. Of course we don't use tar directly, but if we ever make large
segments the default, we ought to provide some clear advice for the user on
how to make their backups.

By my reading, GNU tar handles larger files and no-one else (not even a
POSIX standard tar) can...

Have a nice day,
--
Martijn van Oosterhout <kleptog@svana.org> http://svana.org/kleptog/

Please line up in a tree and maintain the heap invariant while
boarding. Thank you for flying nlogn airlines.

The star program written by Joerg Schilling is a very well written
POSIX compatible tar program that can easily handle files larger than
8GB. It is another backup option.

Cheers,
Ken

#36Zdeněk Kotala
Zdenek.Kotala@Sun.COM
In reply to: Peter Eisentraut (#33)
Re: [PATCHES] Fix for large file support (nonsegment mode support)

Peter Eisentraut napsal(a):

Zdenek Kotala wrote:

But how it was mentioned in this thread maybe
somethink like this "CREATE TABLESPACE name LOCATION '/my/location'
SEGMENTS 10GB" should good solution. If segments is not mentioned then
default value is used.

I think you would need a tool to resegmentize a table or tablespace offline,
usable for example when recovering a backup.

Do you mean something like strip(1) command? I don't see any usecase for
terrabytes data. You usually have a problem to find place where you can backup.

Also, tablespace configuration information is of course also stored in a
table. pg_tablespace probably won't become large, but it would probably
still need to be special-cased, along with other system catalogs perhaps.

It is true and unfortunately singularity. Same as database list which is in a
table as well, but it is stored also as a text file for startup purpose. I more
incline to use non table configuration file for tablespaces, because I don't see
any advantage to have it under MVCC control and it allow also to define storage
for pg_global and pg_default.

An then, how to coordindate offline resegmenting and online tablespace
operations in a crash-safe way?

Another factor I just thought of is that tar, commonly used as part of a
backup procedure, can on some systems only handle files up to 8 GB in size.
There are supposed to be newer formats that can avoid that restriction, but
it's not clear how widely available these are and what the incantation is to
get at them. Of course we don't use tar directly, but if we ever make large
segments the default, we ought to provide some clear advice for the user on
how to make their backups.

I think tar is OK - minimal on Solaris. See man largefile.

Default segment size still should be 1GB. If DBA makes a decision to increase
this to higher value, then it is his responsibility to find way how to process
this big files.

Zdenek