Extending varlena

Started by David Fetterover 17 years ago25 messages

david@fetter.org

over 17 years ago

Folks,

As the things stored in databases grow, we're going to start needing
to think about database objects that 4 bytes of size can't describe.
People are already storing video in lo and bytea fields. To date, the
sizes of media files have never trended downward.

What would need to happen for the next jump up from where varlena is
now, to 8 bytes? Would we want to use the bit-stuffing model that the
current varvarlena uses?

Cheers,
David.
--
David Fetter <david@fetter.org> http://fetter.org/
Phone: +1 415 235 3778 AIM: dfetter666 Yahoo!: dfetter
Skype: davidfetter XMPP: david.fetter@gmail.com

Remember to vote!
Consider donating to Postgres: http://www.postgresql.org/about/donate

Tom Lane

tgl@sss.pgh.pa.us

over 17 years ago

In reply to: David Fetter (#1)

Re: Extending varlena

David Fetter <david@fetter.org> writes:

What would need to happen for the next jump up from where varlena is
now, to 8 bytes?

Dealing with upwards-of-4GB blobs as single Datums isn't remotely sane,
and won't become so in the near (or even medium) future. So I don't
see the point of doing all the work that would be involved in making
this go.

What would make more sense is to redesign the large-object stuff to be
somewhat modern and featureful, and provide stream-access APIs (think
lo_read, lo_seek, etc) that allow offsets wider than 32 bits. The main
things I think we'd need to consider besides just the access API are

- permissions features (more than "none" anyway)
- better management of orphaned objects (obsoleting vacuumlo)
- support > 16TB of large objects (maybe partition pg_largeobject?)
- dump and restore probably need improvement to be practical for such
large data volumes

regards, tom lane

David Fetter

david@fetter.org

over 17 years ago

In reply to: Tom Lane (#2)

Re: Extending varlena

On Mon, Aug 18, 2008 at 04:22:56PM -0400, Tom Lane wrote:

David Fetter <david@fetter.org> writes:

What would need to happen for the next jump up from where varlena
is now, to 8 bytes?

Dealing with upwards-of-4GB blobs as single Datums isn't remotely
sane, and won't become so in the near (or even medium) future. So I
don't see the point of doing all the work that would be involved in
making this go.

What would make more sense is to redesign the large-object stuff to
be somewhat modern and featureful, and provide stream-access APIs
(think lo_read, lo_seek, etc) that allow offsets wider than 32 bits.

Great!

The main things I think we'd need to consider besides just the
access API are

- permissions features (more than "none" anyway)

Would ROLEs work, or are you thinking of the per-row and per-column
access controls people sometimes want?

- better management of orphaned objects (obsoleting vacuumlo)
- support > 16TB of large objects (maybe partition pg_largeobject?)
- dump and restore probably need improvement to be practical for such
large data volumes

That, and the usual upgrade-in-place :)

Cheers,
David.
--
David Fetter <david@fetter.org> http://fetter.org/
Phone: +1 415 235 3778 AIM: dfetter666 Yahoo!: dfetter
Skype: davidfetter XMPP: david.fetter@gmail.com

Remember to vote!
Consider donating to Postgres: http://www.postgresql.org/about/donate

Andrew Chernow

ac@esilo.com

over 17 years ago

In reply to: David Fetter (#1)

Re: Extending varlena

David Fetter wrote:

Folks,

As the things stored in databases grow, we're going to start needing
to think about database objects that 4 bytes of size can't describe.
People are already storing video in lo and bytea fields. To date, the
sizes of media files have never trended downward.

I always find these requests puzzling. Is it really useful to store the
data for a jpeg, video file or a 10GB tar ball in a database column?
Does anyone actually search for byte sequences within those data streams
(maybe if it were text)? I would think that the metadata is what gets
searched: title, track, name, file times, size, etc... Database storage
is normally pricey, stocked with 15K drives, so wasting that expensive
storage with non-searchable binary blobs doesn't make much sense. Why
not offload the data to a file system with 7200 RPM SATA drives and
store a reference to it in the db? Keep the db more compact and simpler
to manage.

Andrew Chernow
eSilo, LLC

David Fetter

david@fetter.org

over 17 years ago

In reply to: Andrew Chernow (#4)

Re: Extending varlena

On Mon, Aug 18, 2008 at 06:09:13PM -0400, Andrew Chernow wrote:

David Fetter wrote:

Folks,

As the things stored in databases grow, we're going to start
needing to think about database objects that 4 bytes of size can't
describe. People are already storing video in lo and bytea fields.
To date, the sizes of media files have never trended downward.

I always find these requests puzzling. Is it really useful to store
the data for a jpeg, video file or a 10GB tar ball in a database
column?

It is if you need transaction semantics. Think medical records, etc.

Cheers,
David.
--
David Fetter <david@fetter.org> http://fetter.org/
Phone: +1 415 235 3778 AIM: dfetter666 Yahoo!: dfetter
Skype: davidfetter XMPP: david.fetter@gmail.com

Remember to vote!
Consider donating to Postgres: http://www.postgresql.org/about/donate

Andrew Chernow

ac@esilo.com

over 17 years ago

In reply to: David Fetter (#5)

Re: Extending varlena

David Fetter wrote:

On Mon, Aug 18, 2008 at 06:09:13PM -0400, Andrew Chernow wrote:

David Fetter wrote:

Folks,

As the things stored in databases grow, we're going to start
needing to think about database objects that 4 bytes of size can't
describe. People are already storing video in lo and bytea fields.
To date, the sizes of media files have never trended downward.

I always find these requests puzzling. Is it really useful to store
the data for a jpeg, video file or a 10GB tar ball in a database
column?

It is if you need transaction semantics. Think medical records, etc.

Cheers,
David.

I see that, although developing the middleware between db and fs is
rather trival. I think that is the puzzling part. It just feels akward
to "me" to just stuff it in the db. You can do more by distributing.

Anyways (back on topic), I am in favor of removing limits from any
section of the database ... not just your suggestion. The end-user
application should impose limits.

Andrew

Josh Berkus

josh@agliodbs.com

over 17 years ago

In reply to: Andrew Chernow (#4)

Re: Extending varlena

Andrew,

I always find these requests puzzling. Is it really useful to store the
data for a jpeg, video file or a 10GB tar ball in a database column?

Some people find it useful. Because LOs are actually easier to manage in
PG than in most other DBMSes, right now that's a significant source of
PostgreSQL adoption. I'd like to encourage those users by giving them
more useful LO features.

--
--Josh

Josh Berkus
PostgreSQL
San Francisco

Jeff Davis

pgsql@j-davis.com

over 17 years ago

In reply to: Andrew Chernow (#4)

Re: Extending varlena

I always find these requests puzzling. Is it really useful to store the
data for a jpeg, video file or a 10GB tar ball in a database column?

One use case is that it can use the existing postgresql protocol, and
does not require extra filesystem mounts, extra error handling, and
other complexity.

Regards,
Jeff Davis

Tom Lane

tgl@sss.pgh.pa.us

over 17 years ago

In reply to: Andrew Chernow (#6)

Re: Extending varlena

Andrew Chernow <ac@esilo.com> writes:

Anyways (back on topic), I am in favor of removing limits from any
section of the database ... not just your suggestion. The end-user
application should impose limits.

That's nice as an abstract principle, but there are only so many hours
in the day, so we need to prioritize which limits we're going to get rid
of. The 4-byte limit on individual Datum sizes does not strike me as a
limit that's going to be significant for practical use any time soon.
(I grant David's premise that people will soon want to work with objects
that are larger than that --- but not that they'll want to push them
around as indivisible, store-and-fetch-as-a-unit field values.)

regards, tom lane

#10

Andrew Chernow

ac@esilo.com

over 17 years ago

In reply to: Tom Lane (#9)

Re: Extending varlena

Tom Lane wrote:

Andrew Chernow <ac@esilo.com> writes:

Anyways (back on topic), I am in favor of removing limits from any
section of the database ... not just your suggestion. The end-user
application should impose limits.

That's nice as an abstract principle, but there are only so many hours
in the day, so we need to prioritize which limits we're going to get rid
of. The 4-byte limit on individual Datum sizes does not strike me as a
limit that's going to be significant for practical use any time soon.
(I grant David's premise that people will soon want to work with objects
that are larger than that --- but not that they'll want to push them
around as indivisible, store-and-fetch-as-a-unit field values.)

regards, tom lane

Yeah, my comments were overly general. I wasn't suggesting attention be
put on one limit over another. I was only saying that the act of
removing a limit (of which many are arbitrary) is most often a good one.

andrew

#11

Andrew Chernow

ac@esilo.com

over 17 years ago

In reply to: Jeff Davis (#8)

Re: Extending varlena

Jeff Davis wrote:

I always find these requests puzzling. Is it really useful to store the
data for a jpeg, video file or a 10GB tar ball in a database column?

One use case is that it can use the existing postgresql protocol,

So can what I am suggesting. How about a user-defined C function in the
backend that talks to the fs and uses SPI to sync info with a record?
Now the operation is behind a transaction. Yet, one must handle fs
orphans from evil crash cases.

Just one solution, but other more creative cats may have better ideas.
the point is, it can be done without too much effort. A little TLC :)

Andrew

#12

Tom Lane

tgl@sss.pgh.pa.us

over 17 years ago

In reply to: David Fetter (#3)

Re: Extending varlena

David Fetter <david@fetter.org> writes:

On Mon, Aug 18, 2008 at 04:22:56PM -0400, Tom Lane wrote:

The main things I think we'd need to consider besides just the
access API are

- permissions features (more than "none" anyway)

Would ROLEs work, or are you thinking of the per-row and per-column
access controls people sometimes want?

Well, obviously roles are the entities that receive permissions, but
on what do we base granting permissions to LOs?

With the current model that a LO is an independent entity that is merely
referenced (or not) by OIDs in the database, it seems like we'd have to
grant/revoke permissions to individual LOs, identified by OID; which
sure seems messy to me. People don't really want to name their LOs
by OID anyway --- it's just a convention that's forced on them by the
current implementation.

I was kinda wondering about something closer to the TOAST model, where
a blob is only referenceable from a value that's in a table field;
and that value encapsulates the "name" of the blob in some way that
needn't even be user-visible. This'd greatly simplify the
cleanup-dead-objects problem, and we could avoid addressing the
permissions problem at all, since regular SQL permissions on the table
would serve fine. But it's not clear what regular SQL fetch and update
behaviors should be like for such a thing. (Fetching or storing the
whole blob value is right out, IMHO.) ISTR hearing of concepts roughly
like this in other DBs --- does it ring a bell for anyone?

regards, tom lane

#13

Tom Lane

tgl@sss.pgh.pa.us

over 17 years ago

In reply to: David Fetter (#5)

Re: Extending varlena

David Fetter <david@fetter.org> writes:

On Mon, Aug 18, 2008 at 06:09:13PM -0400, Andrew Chernow wrote:

I always find these requests puzzling. Is it really useful to store
the data for a jpeg, video file or a 10GB tar ball in a database
column?

It is if you need transaction semantics. Think medical records, etc.

The basic problem with outside-the-DB data storage is keeping it in sync
with your inside-the-DB metadata. In a slowly changing dataset you can
probably get away with external storage, but if there's lots of updates
then allowing the DB to manage the storage definitely makes life easier.

This is not to say that you want SQL-style operations on the blobs;
in fact I think you probably don't, which is why I was pointing to
a LO-style API.

regards, tom lane

#14

David Fetter

david@fetter.org

over 17 years ago

In reply to: Tom Lane (#12)

Re: Extending varlena

On Mon, Aug 18, 2008 at 07:31:04PM -0400, Tom Lane wrote:

David Fetter <david@fetter.org> writes:

On Mon, Aug 18, 2008 at 04:22:56PM -0400, Tom Lane wrote:

The main things I think we'd need to consider besides just the
access API are

- permissions features (more than "none" anyway)

Would ROLEs work, or are you thinking of the per-row and per-column
access controls people sometimes want?

Well, obviously roles are the entities that receive permissions, but
on what do we base granting permissions to LOs?

With the current model that a LO is an independent entity that is merely
referenced (or not) by OIDs in the database, it seems like we'd have to
grant/revoke permissions to individual LOs, identified by OID; which
sure seems messy to me. People don't really want to name their LOs
by OID anyway --- it's just a convention that's forced on them by the
current implementation.

I was kinda wondering about something closer to the TOAST model, where
a blob is only referenceable from a value that's in a table field;
and that value encapsulates the "name" of the blob in some way that
needn't even be user-visible.

This vaguely reminds me of Sybase's hidden primary keys.

This'd greatly simplify the
cleanup-dead-objects problem, and we could avoid addressing the
permissions problem at all, since regular SQL permissions on the table
would serve fine. But it's not clear what regular SQL fetch and update
behaviors should be like for such a thing. (Fetching or storing the
whole blob value is right out, IMHO.) ISTR hearing of concepts roughly
like this in other DBs --- does it ring a bell for anyone?

Informix has some pretty good blob-handling:

http://publib.boulder.ibm.com/infocenter/idshelp/v10/index.jsp?topic=/com.ibm.sqlr.doc/sqlrmst101.htm

Cheers,
David.
--
David Fetter <david@fetter.org> http://fetter.org/
Phone: +1 415 235 3778 AIM: dfetter666 Yahoo!: dfetter
Skype: davidfetter XMPP: david.fetter@gmail.com

Remember to vote!
Consider donating to Postgres: http://www.postgresql.org/about/donate

#15

Gregory Williamson

Gregory.Williamson@digitalglobe.com

over 17 years ago

In reply to: David Fetter (#1)

Re: Extending varlena

David Fetter wrote"

<...>

This'd greatly simplify the
cleanup-dead-objects problem, and we could avoid addressing the
permissions problem at all, since regular SQL permissions on the table
would serve fine. But it's not clear what regular SQL fetch and update
behaviors should be like for such a thing. (Fetching or storing the
whole blob value is right out, IMHO.) ISTR hearing of concepts roughly
like this in other DBs --- does it ring a bell for anyone?

Informix has some pretty good blob-handling:

http://publib.boulder.ibm.com/infocenter/idshelp/v10/index.jsp?topic=/com.ibm.sqlr.doc/sqlrmst101.htm

Agreed. I used Informix a few years back in a system that scanned both sides of multi-page financial documents; we stored them in Informix' blobs, which IIRC could be tuned to be given number of bytes. We found that 90% of our images fit in a given size and since Informix raw disk access let them move up the whole blob in a single pass, it was quite fast, and gave us all the warmth and fuzziness of ACID functionality. But we didn't fetch parts of the BLOB -- metadata lived in its own table. There is/was an Illustra/Informix blade which let you in theory do some processing of images (indexing) but that seems like a very specialized case.

Greg Williamson
Senior DBA
DigitalGlobe

Confidentiality Notice: This e-mail message, including any attachments, is for the sole use of the intended recipient(s) and may contain confidential and privileged information and must be protected in accordance with those provisions. Any unauthorized review, use, disclosure or distribution is prohibited. If you are not the intended recipient, please contact the sender by reply e-mail and destroy all copies of the original message.

(My corporate masters made me say this.)

#16

Simon Riggs

simon@2ndQuadrant.com

over 17 years ago

In reply to: Tom Lane (#2)

Re: Extending varlena

On Mon, 2008-08-18 at 16:22 -0400, Tom Lane wrote:

David Fetter <david@fetter.org> writes:

What would need to happen for the next jump up from where varlena is
now, to 8 bytes?

Dealing with upwards-of-4GB blobs as single Datums isn't remotely sane,
and won't become so in the near (or even medium) future. So I don't
see the point of doing all the work that would be involved in making
this go.

What would make more sense is to redesign the large-object stuff to be
somewhat modern and featureful, and provide stream-access APIs (think
lo_read, lo_seek, etc) that allow offsets wider than 32 bits. The main
things I think we'd need to consider besides just the access API are

- permissions features (more than "none" anyway)
- better management of orphaned objects (obsoleting vacuumlo)
- support > 16TB of large objects (maybe partition pg_largeobject?)
- dump and restore probably need improvement to be practical for such
large data volumes

Sounds like a good list.

Probably also using a separate Sequence to allocate numbers rather than
using up all the Oids on LOs would be a good plan.

--
Simon Riggs www.2ndQuadrant.com
PostgreSQL Training, Services and Support

#17

Tom Lane

tgl@sss.pgh.pa.us

over 17 years ago

In reply to: Simon Riggs (#16)

Re: Extending varlena

Simon Riggs <simon@2ndQuadrant.com> writes:

Probably also using a separate Sequence to allocate numbers rather than
using up all the Oids on LOs would be a good plan.

Well, assuming that your Large Objects are actually Large, you aren't
going to need as many OIDs as all that ;-)

However: I was chewing on this point a few days ago, and it seemed to me
that essentially duplicating the functionality of the OID generator
wasn't likely to be a win. What seems more practical is to extend the
internal next-OID counter to 64 bits, and allow callers to get either
the full 64 bits or just the lowest 32 bits depending on what they need.
This change would actually be entirely transparent to 32-bit callers,
and the extra cycles to manage a 64-bit counter would surely be lost in
the noise compared to acquiring/releasing OidGenLock.

regards, tom lane

#18

Simon Riggs

simon@2ndQuadrant.com

over 17 years ago

In reply to: Tom Lane (#17)

Re: Extending varlena

On Mon, 2008-08-18 at 23:43 -0400, Tom Lane wrote:

Simon Riggs <simon@2ndQuadrant.com> writes:

Probably also using a separate Sequence to allocate numbers rather than
using up all the Oids on LOs would be a good plan.

Well, assuming that your Large Objects are actually Large, you aren't
going to need as many OIDs as all that ;-)

However: I was chewing on this point a few days ago, and it seemed to me
that essentially duplicating the functionality of the OID generator
wasn't likely to be a win. What seems more practical is to extend the
internal next-OID counter to 64 bits, and allow callers to get either
the full 64 bits or just the lowest 32 bits depending on what they need.
This change would actually be entirely transparent to 32-bit callers,
and the extra cycles to manage a 64-bit counter would surely be lost in
the noise compared to acquiring/releasing OidGenLock.

Sounds very cool.

--
Simon Riggs www.2ndQuadrant.com
PostgreSQL Training, Services and Support

#19

Jeremy Drake

pgsql@jdrake.com

over 17 years ago

In reply to: Tom Lane (#2)

1 attachment(s)

Re: Extending varlena

On Mon, 18 Aug 2008, Tom Lane wrote:

What would make more sense is to redesign the large-object stuff to be
somewhat modern and featureful, and provide stream-access APIs (think
lo_read, lo_seek, etc) that allow offsets wider than 32 bits.

A few years ago, I was working on such a project for a company I used to
work for. The company changed directions shortly thereafter, and the
project was dropped, but perhaps the patch might still be useful as a
starting point for someone else.

The original patch is
http://archives.postgresql.org/pgsql-hackers/2005-09/msg01026.php, and the
advice I was working on implementing was in
http://archives.postgresql.org/pgsql-hackers/2005-09/msg01063.php

I am attaching the latest version of the patch I found around. As it was
almost 3 years ago, I am a little fuzzy on where I left off, but I do
remember that I was trying to work through the suggestions Tom Lane gave
in that second linked email. I would recommend discarding the libpq
changes, since that seemed to not pass muster.

Note that this patch was against 8.0.3. There only seem to be a few
issues applying it to the current head, but I haven't really dug into them
to see how difficult it would be to update. Luckily, the large object
code is fairly slow-moving, so there aren't too many conflicts. One thing
I did notice is that it looks like someone extracted one of the functions
I wrote in this patch and applied it as a 32-bit version. Good for them.
I'm glad someone got some use out of this project, and perhaps more use
will come of it.

--
At the source of every error which is blamed on the computer you will
find at least two human errors, including the error of blaming it on
the computer.

Attachments:

postgres-8.0.3-64bit-lo-r2.patchtext/plain; charset=US-ASCII; name=postgres-8.0.3-64bit-lo-r2.patchDownload

diff -Nur postgresql-8.0.3-orig/src/backend/libpq/be-fsstubs.c postgresql-8.0.3/src/backend/libpq/be-fsstubs.c
--- postgresql-8.0.3-orig/src/backend/libpq/be-fsstubs.c	2004-12-31 13:59:50.000000000 -0800
+++ postgresql-8.0.3/src/backend/libpq/be-fsstubs.c	2005-10-03 11:43:36.000000000 -0700
@@ -233,6 +233,34 @@
 	PG_RETURN_INT32(status);
 }
 
+
+Datum
+lo_lseek64(PG_FUNCTION_ARGS)
+{
+	int32		fd = PG_GETARG_INT32(0);
+	int64		offset = PG_GETARG_INT64(1);
+	int32		whence = PG_GETARG_INT32(2);
+	MemoryContext currentContext;
+	int64			status;
+
+	if (fd < 0 || fd >= cookies_size || cookies[fd] == NULL)
+	{
+		ereport(ERROR,
+				(errcode(ERRCODE_UNDEFINED_OBJECT),
+				 errmsg("invalid large-object descriptor: %d", fd)));
+		PG_RETURN_INT64(-1);
+	}
+
+	Assert(fscxt != NULL);
+	currentContext = MemoryContextSwitchTo(fscxt);
+
+	status = inv_seek(cookies[fd], offset, whence);
+
+	MemoryContextSwitchTo(currentContext);
+
+	PG_RETURN_INT64(status);
+}
+
 Datum
 lo_creat(PG_FUNCTION_ARGS)
 {
@@ -283,6 +311,165 @@
 	PG_RETURN_INT32(inv_tell(cookies[fd]));
 }
 
+
+Datum
+lo_tell64(PG_FUNCTION_ARGS)
+{
+	int32		fd = PG_GETARG_INT32(0);
+
+	if (fd < 0 || fd >= cookies_size || cookies[fd] == NULL)
+	{
+		ereport(ERROR,
+				(errcode(ERRCODE_UNDEFINED_OBJECT),
+				 errmsg("invalid large-object descriptor: %d", fd)));
+		PG_RETURN_INT64(-1);
+	}
+
+	/*
+	 * We assume we do not need to switch contexts for inv_tell. That is
+	 * true for now, but is probably more than this module ought to
+	 * assume...
+	 */
+	PG_RETURN_INT64(inv_tell(cookies[fd]));
+}
+
+Datum
+lo_length(PG_FUNCTION_ARGS)
+{
+	int32		fd = PG_GETARG_INT32(0);
+	int32		sz = 0;
+	MemoryContext currentContext;
+
+	if (fd < 0 || fd >= cookies_size || cookies[fd] == NULL)
+	{
+		ereport(ERROR,
+				(errcode(ERRCODE_UNDEFINED_OBJECT),
+				 errmsg("invalid large-object descriptor: %d", fd)));
+		PG_RETURN_INT32(-1);
+	}
+	Assert(fscxt != NULL);
+	currentContext = MemoryContextSwitchTo(fscxt);
+
+	sz = inv_length(cookies[fd]);
+
+	MemoryContextSwitchTo(currentContext);
+
+	PG_RETURN_INT32(sz);
+}
+
+Datum
+lo_length64(PG_FUNCTION_ARGS)
+{
+	int32		fd = PG_GETARG_INT32(0);
+	int64		sz = 0;
+	MemoryContext currentContext;
+
+	if (fd < 0 || fd >= cookies_size || cookies[fd] == NULL)
+	{
+		ereport(ERROR,
+				(errcode(ERRCODE_UNDEFINED_OBJECT),
+				 errmsg("invalid large-object descriptor: %d", fd)));
+		PG_RETURN_INT64(-1);
+	}
+	Assert(fscxt != NULL);
+	currentContext = MemoryContextSwitchTo(fscxt);
+
+	sz = inv_length(cookies[fd]);
+
+	MemoryContextSwitchTo(currentContext);
+
+	PG_RETURN_INT64(sz);
+}
+
+Datum
+lo_truncate(PG_FUNCTION_ARGS)
+{
+	int32		fd = PG_GETARG_INT32(0);
+	int32		offset = PG_GETARG_INT32(1);
+	int		status = 0;
+	MemoryContext currentContext;
+
+	if (fd < 0 || fd >= cookies_size || cookies[fd] == NULL)
+	{
+		ereport(ERROR,
+				(errcode(ERRCODE_UNDEFINED_OBJECT),
+				 errmsg("invalid large-object descriptor: %d", fd)));
+		PG_RETURN_INT32(-1);
+	}
+	Assert(fscxt != NULL);
+	currentContext = MemoryContextSwitchTo(fscxt);
+
+	status = inv_truncate(cookies[fd], offset);
+
+	MemoryContextSwitchTo(currentContext);
+
+	PG_RETURN_INT32(status);
+}
+
+
+Datum
+lo_truncate64(PG_FUNCTION_ARGS)
+{
+	int32		fd = PG_GETARG_INT32(0);
+	int64		offset = PG_GETARG_INT64(1);
+	int		status = 0;
+	MemoryContext currentContext;
+
+	if (fd < 0 || fd >= cookies_size || cookies[fd] == NULL)
+	{
+		ereport(ERROR,
+				(errcode(ERRCODE_UNDEFINED_OBJECT),
+				 errmsg("invalid large-object descriptor: %d", fd)));
+		PG_RETURN_INT64(-1);
+	}
+	Assert(fscxt != NULL);
+	currentContext = MemoryContextSwitchTo(fscxt);
+
+	status = inv_truncate(cookies[fd], offset);
+
+	MemoryContextSwitchTo(currentContext);
+
+	PG_RETURN_INT32(status);
+}
+
+Datum
+lo_stat(PG_FUNCTION_ARGS)
+{
+	Oid			lobjId = PG_GETARG_OID(0);
+	MemoryContext currentContext;
+	int32 sz = 0;
+
+	if (fscxt == NULL)
+	{
+		CreateFSContext();
+	}
+
+	currentContext = MemoryContextSwitchTo(fscxt);
+	sz = inv_stat (lobjId);
+	MemoryContextSwitchTo(currentContext);
+
+	PG_RETURN_INT32(sz);
+}
+
+Datum
+lo_stat64(PG_FUNCTION_ARGS)
+{
+	Oid			lobjId = PG_GETARG_OID(0);
+	MemoryContext currentContext;
+	int64 sz = 0;
+
+	if (fscxt == NULL)
+	{
+		CreateFSContext();
+	}
+
+	currentContext = MemoryContextSwitchTo(fscxt);
+	sz = inv_stat (lobjId);
+	MemoryContextSwitchTo(currentContext);
+
+	PG_RETURN_INT64(sz);
+}
+
 Datum
 lo_unlink(PG_FUNCTION_ARGS)
 {
Binary files postgresql-8.0.3-orig/src/backend/storage/large_object/.inv_api.c.swp and postgresql-8.0.3/src/backend/storage/large_object/.inv_api.c.swp differ
diff -Nur postgresql-8.0.3-orig/src/backend/storage/large_object/inv_api.c postgresql-8.0.3/src/backend/storage/large_object/inv_api.c
--- postgresql-8.0.3-orig/src/backend/storage/large_object/inv_api.c	2004-12-31 14:00:59.000000000 -0800
+++ postgresql-8.0.3/src/backend/storage/large_object/inv_api.c	2005-10-03 12:15:39.000000000 -0700
@@ -255,23 +255,21 @@
  * NOTE: LOs can contain gaps, just like Unix files.  We actually return
  * the offset of the last byte + 1.
  */
-static uint32
-inv_getsize(LargeObjectDesc *obj_desc)
+static int64
+inv_getsize(Oid lobjId)
 {
 	bool		found = false;
-	uint32		lastbyte = 0;
+	int64		lastbyte = 0;
 	ScanKeyData skey[1];
 	IndexScanDesc sd;
 	HeapTuple	tuple;
 
-	Assert(PointerIsValid(obj_desc));
-
 	open_lo_relation();
 
 	ScanKeyInit(&skey[0],
 				Anum_pg_largeobject_loid,
 				BTEqualStrategyNumber, F_OIDEQ,
-				ObjectIdGetDatum(obj_desc->id));
+				ObjectIdGetDatum(lobjId));
 
 	sd = index_beginscan(lo_heap_r, lo_index_r,
 						 SnapshotNow, 1, skey);
@@ -298,7 +296,7 @@
 				heap_tuple_untoast_attr((varattrib *) datafield);
 			pfreeit = true;
 		}
-		lastbyte = data->pageno * LOBLKSIZE + getbytealen(datafield);
+		lastbyte = (int64) data->pageno * LOBLKSIZE + getbytealen(datafield);
 		if (pfreeit)
 			pfree(datafield);
 		break;
@@ -309,12 +307,40 @@
 	if (!found)
 		ereport(ERROR,
 				(errcode(ERRCODE_UNDEFINED_OBJECT),
-				 errmsg("large object %u does not exist", obj_desc->id)));
+				 errmsg("large object %u does not exist", lobjId)));
 	return lastbyte;
 }
 
-int
-inv_seek(LargeObjectDesc *obj_desc, int offset, int whence)
+/*
+ * Determine size of a large object
+ *
+ * NOTE: LOs can contain gaps, just like Unix files.  We actually return
+ * the offset of the last byte + 1.
+ */
+static int64
+inv_fgetsize(LargeObjectDesc *obj_desc)
+{
+	Assert(PointerIsValid(obj_desc));
+
+	return inv_getsize(obj_desc->id);
+}
+
+int64
+inv_length(LargeObjectDesc * obj_desc)
+{
+	Assert(PointerIsValid(obj_desc));
+
+	return inv_fgetsize(obj_desc);
+}
+
+int64
+inv_stat(Oid lobjId)
+{
+	return inv_getsize(lobjId);
+}
+
+int64
+inv_seek(LargeObjectDesc *obj_desc, int64 offset, int whence)
 {
 	Assert(PointerIsValid(obj_desc));
 
@@ -322,20 +348,20 @@
 	{
 		case SEEK_SET:
 			if (offset < 0)
-				elog(ERROR, "invalid seek offset: %d", offset);
+				elog(ERROR, "invalid seek offset: " INT64_FORMAT, offset);
 			obj_desc->offset = offset;
 			break;
 		case SEEK_CUR:
-			if (offset < 0 && obj_desc->offset < ((uint32) (-offset)))
-				elog(ERROR, "invalid seek offset: %d", offset);
+			if (offset < 0 && obj_desc->offset < ((uint64) (-offset)))
+				elog(ERROR, "invalid seek offset: " INT64_FORMAT, offset);
 			obj_desc->offset += offset;
 			break;
 		case SEEK_END:
 			{
-				uint32		size = inv_getsize(obj_desc);
+				int64		size = inv_fgetsize(obj_desc);
 
-				if (offset < 0 && size < ((uint32) (-offset)))
-					elog(ERROR, "invalid seek offset: %d", offset);
+				if (offset < 0 && size < ((uint64) (-offset)))
+					elog(ERROR, "invalid seek offset: " INT64_FORMAT, offset);
 				obj_desc->offset = size + offset;
 			}
 			break;
@@ -345,7 +371,7 @@
 	return obj_desc->offset;
 }
 
-int
+int64
 inv_tell(LargeObjectDesc *obj_desc)
 {
 	Assert(PointerIsValid(obj_desc));
@@ -357,11 +383,11 @@
 inv_read(LargeObjectDesc *obj_desc, char *buf, int nbytes)
 {
 	int			nread = 0;
-	int			n;
-	int			off;
+	int64			n;
+	int64			off;
 	int			len;
 	int32		pageno = (int32) (obj_desc->offset / LOBLKSIZE);
-	uint32		pageoff;
+	int64		pageoff;
 	ScanKeyData skey[2];
 	IndexScanDesc sd;
 	HeapTuple	tuple;
@@ -400,7 +426,7 @@
 		 * there may be missing pages if the LO contains unwritten
 		 * "holes". We want missing sections to read out as zeroes.
 		 */
-		pageoff = ((uint32) data->pageno) * LOBLKSIZE;
+		pageoff = ((int64) data->pageno) * LOBLKSIZE;
 		if (pageoff > obj_desc->offset)
 		{
 			n = pageoff - obj_desc->offset;
@@ -447,6 +473,157 @@
 }
 
 int
+inv_truncate(LargeObjectDesc *obj_desc, int64 offset)
+{
+	int64 objsz = inv_fgetsize(obj_desc);
+	if (offset > objsz)
+	{
+		int64 tempoff = obj_desc->offset;
+		obj_desc->offset = offset - 1;
+		char nul = '\0';
+		int ret = inv_write(obj_desc, &nul, 1);
+		obj_desc->offset = tempoff;
+		if (ret < 0)
+			return -1;
+		else
+			return 0;
+	}
+	else
+	{
+		int			n;
+		int			off;
+		int			len;
+		int32		pageno = (int32) (offset / LOBLKSIZE);
+		ScanKeyData skey[2];
+		IndexScanDesc sd;
+		HeapTuple	oldtuple;
+		Form_pg_largeobject olddata;
+		bytea	   *datafield;
+		bool		pfreeit;
+		struct
+		{
+			bytea		hdr;
+			char		data[LOBLKSIZE];
+		}			workbuf;
+		char	   *workb = VARATT_DATA(&workbuf.hdr);
+		HeapTuple	newtup;
+		Datum		values[Natts_pg_largeobject];
+		char		nulls[Natts_pg_largeobject];
+		char		replace[Natts_pg_largeobject];
+		CatalogIndexState indstate;
+
+		Assert(PointerIsValid(obj_desc));
+
+		open_lo_relation();
+
+		indstate = CatalogOpenIndexes(lo_heap_r);
+
+		ScanKeyInit(&skey[0],
+					Anum_pg_largeobject_loid,
+					BTEqualStrategyNumber, F_OIDEQ,
+					ObjectIdGetDatum(obj_desc->id));
+
+		ScanKeyInit(&skey[1],
+					Anum_pg_largeobject_pageno,
+					BTGreaterEqualStrategyNumber, F_INT4GE,
+					Int32GetDatum(pageno));
+
+		sd = index_beginscan(lo_heap_r, lo_index_r,
+							 SnapshotNow, 2, skey);
+
+		oldtuple = NULL;
+		olddata = NULL;
+
+		while ((oldtuple = index_getnext(sd, ForwardScanDirection)) != NULL)
+		{
+			olddata = (Form_pg_largeobject) GETSTRUCT(oldtuple);
+			Assert(olddata->pageno >= pageno);
+
+			/*
+			 * If we have a pre-existing page, see if it is the page we want
+			 * to write, or a later one.
+			 */
+			if (olddata != NULL && olddata->pageno == pageno)
+			{
+				/*
+				 * Update an existing page with fresh data.
+				 *
+				 * First, load old data into workbuf
+				 */
+				datafield = &(olddata->data);
+				pfreeit = false;
+				if (VARATT_IS_EXTENDED(datafield))
+				{
+					datafield = (bytea *)
+						heap_tuple_untoast_attr((varattrib *) datafield);
+					pfreeit = true;
+				}
+				len = getbytealen(datafield);
+				Assert(len <= LOBLKSIZE);
+				memcpy(workb, VARDATA(datafield), len);
+				if (pfreeit)
+					pfree(datafield);
+
+				/*
+				 * Fill any hole
+				 */
+				off = (int) (offset % LOBLKSIZE);
+
+				/*
+				 * Insert appropriate portion of new data
+				 */
+				n = LOBLKSIZE - off;
+				MemSet(workb + off, 0, n);
+				/* compute valid length of new page */
+				len = off;
+				VARATT_SIZEP(&workbuf.hdr) = len + VARHDRSZ;
+
+				/*
+				 * Form and insert updated tuple
+				 */
+				memset(values, 0, sizeof(values));
+				memset(nulls, ' ', sizeof(nulls));
+				memset(replace, ' ', sizeof(replace));
+				values[Anum_pg_largeobject_data - 1] = PointerGetDatum(&workbuf);
+				replace[Anum_pg_largeobject_data - 1] = 'r';
+				newtup = heap_modifytuple(oldtuple, lo_heap_r,
+										  values, nulls, replace);
+				simple_heap_update(lo_heap_r, &newtup->t_self, newtup);
+				CatalogIndexInsert(indstate, newtup);
+				heap_freetuple(newtup);
+
+				/*
+				 * We're done with this old page.
+				 */
+				oldtuple = NULL;
+				olddata = NULL;
+			}
+			else
+			{
+				/*
+				 * Delete page.
+				 *
+				 * First, fill any hole
+				 */
+				simple_heap_delete(lo_heap_r, &oldtuple->t_self);
+			}
+		}
+
+		index_endscan(sd);
+
+		CatalogCloseIndexes(indstate);
+
+		/*
+		 * Advance command counter so that my tuple updates will be seen by
+		 * later large-object operations in this transaction.
+		 */
+		CommandCounterIncrement();
+	}
+
+	return 0;
+}
+
+int
 inv_write(LargeObjectDesc *obj_desc, char *buf, int nbytes)
 {
 	int			nwritten = 0;
diff -Nur postgresql-8.0.3-orig/src/include/catalog/pg_proc.h postgresql-8.0.3/src/include/catalog/pg_proc.h
--- postgresql-8.0.3-orig/src/include/catalog/pg_proc.h	2004-12-31 14:03:25.000000000 -0800
+++ postgresql-8.0.3/src/include/catalog/pg_proc.h	2005-10-03 11:56:27.000000000 -0700
@@ -1224,12 +1224,28 @@
 DESCR("large object read");
 DATA(insert OID = 955 (  lowrite		   PGNSP PGUID 12 f f t f v 2 23 "23 17" _null_ lowrite - _null_ ));
 DESCR("large object write");
+DATA(insert OID = 0 (  lo_truncate		   PGNSP PGUID 12 f f t f v 2 23 "23 23" _null_ lo_truncate - _null_ ));
+DESCR("large object resize");
+DATA(insert OID = 0 (  lo_truncate64		   PGNSP PGUID 12 f f t f v 2 23 "23 20" _null_ lo_truncate64 - _null_ ));
+DESCR("large object resize (64bit)");
+DATA(insert OID = 0   (  lo_lseek64		   PGNSP PGUID 12 f f t f v 3 20 "23 20 23" _null_	lo_lseek64 - _null_ ));
+DESCR("large object seek (64bit)");
 DATA(insert OID = 956 (  lo_lseek		   PGNSP PGUID 12 f f t f v 3 23 "23 23 23" _null_	lo_lseek - _null_ ));
 DESCR("large object seek");
 DATA(insert OID = 957 (  lo_creat		   PGNSP PGUID 12 f f t f v 1 26 "23" _null_  lo_creat - _null_ ));
 DESCR("large object create");
+DATA(insert OID = 0   (  lo_tell64		   PGNSP PGUID 12 f f t f v 1 20 "23" _null_  lo_tell64 - _null_ ));
+DESCR("large object position (64bit)");
 DATA(insert OID = 958 (  lo_tell		   PGNSP PGUID 12 f f t f v 1 23 "23" _null_  lo_tell - _null_ ));
 DESCR("large object position");
+DATA(insert OID = 0   (  lo_stat		   PGNSP PGUID 12 f f t f v 1 23 "26" _null_  lo_stat - _null_ ));
+DESCR("large object size");
+DATA(insert OID = 0   (  lo_stat64		   PGNSP PGUID 12 f f t f v 1 20 "26" _null_  lo_stat64 - _null_ ));
+DESCR("large object size (64bit)");
+DATA(insert OID = 0   (  lo_length		   PGNSP PGUID 12 f f t f v 1 23 "23" _null_  lo_length - _null_ ));
+DESCR("large object size");
+DATA(insert OID = 0   (  lo_length64		   PGNSP PGUID 12 f f t f v 1 20 "23" _null_  lo_length64 - _null_ ));
+DESCR("large object size (64bit)");
 
 DATA(insert OID = 959 (  on_pl			   PGNSP PGUID 12 f f t f i 2  16 "600 628" _null_	on_pl - _null_ ));
 DESCR("point on line?");
diff -Nur postgresql-8.0.3-orig/src/include/libpq/be-fsstubs.h postgresql-8.0.3/src/include/libpq/be-fsstubs.h
--- postgresql-8.0.3-orig/src/include/libpq/be-fsstubs.h	2004-12-31 14:03:32.000000000 -0800
+++ postgresql-8.0.3/src/include/libpq/be-fsstubs.h	2005-10-03 11:42:43.000000000 -0700
@@ -32,8 +32,17 @@
 
 extern Datum lo_lseek(PG_FUNCTION_ARGS);
 extern Datum lo_tell(PG_FUNCTION_ARGS);
+extern Datum lo_lseek64(PG_FUNCTION_ARGS);
+extern Datum lo_tell64(PG_FUNCTION_ARGS);
 extern Datum lo_unlink(PG_FUNCTION_ARGS);
 
+extern Datum lo_length(PG_FUNCTION_ARGS);
+extern Datum lo_length64(PG_FUNCTION_ARGS);
+extern Datum lo_stat(PG_FUNCTION_ARGS);
+extern Datum lo_stat64(PG_FUNCTION_ARGS);
+extern Datum lo_truncate(PG_FUNCTION_ARGS);
+extern Datum lo_truncate64(PG_FUNCTION_ARGS);
+
 /*
  * These are not fmgr-callable, but are available to C code.
  * Probably these should have had the underscore-free names,
diff -Nur postgresql-8.0.3-orig/src/include/storage/large_object.h postgresql-8.0.3/src/include/storage/large_object.h
--- postgresql-8.0.3-orig/src/include/storage/large_object.h	2004-12-31 14:03:42.000000000 -0800
+++ postgresql-8.0.3/src/include/storage/large_object.h	2005-10-03 11:12:42.000000000 -0700
@@ -33,7 +33,7 @@
 {
 	Oid			id;				/* LO's identifier */
 	SubTransactionId subid;		/* owning subtransaction ID */
-	uint32		offset;			/* current seek pointer */
+	uint64		offset;			/* current seek pointer */
 	int			flags;			/* locking info, etc */
 
 /* flag bits: */
@@ -69,9 +69,12 @@
 extern LargeObjectDesc *inv_open(Oid lobjId, int flags);
 extern void inv_close(LargeObjectDesc *obj_desc);
 extern int	inv_drop(Oid lobjId);
-extern int	inv_seek(LargeObjectDesc *obj_desc, int offset, int whence);
-extern int	inv_tell(LargeObjectDesc *obj_desc);
+extern int64	inv_seek(LargeObjectDesc *obj_desc, int64 offset, int whence);
+extern int64	inv_tell(LargeObjectDesc *obj_desc);
+extern int64	inv_length(LargeObjectDesc *obj_desc);
+extern int64	inv_stat(Oid lobjId);
 extern int	inv_read(LargeObjectDesc *obj_desc, char *buf, int nbytes);
 extern int	inv_write(LargeObjectDesc *obj_desc, char *buf, int nbytes);
+extern int	inv_truncate(LargeObjectDesc *obj_desc, int64 offset);
 
 #endif   /* LARGE_OBJECT_H */
diff -Nur postgresql-8.0.3-orig/src/interfaces/libpq/fe-lobj.c postgresql-8.0.3/src/interfaces/libpq/fe-lobj.c
--- postgresql-8.0.3-orig/src/interfaces/libpq/fe-lobj.c	2004-12-31 14:03:50.000000000 -0800
+++ postgresql-8.0.3/src/interfaces/libpq/fe-lobj.c	2005-09-20 22:30:48.000000000 -0700
@@ -264,6 +264,61 @@
 }
 
 /*
+ * lo_lseek64
+ *	  change the current read or write location on a large object
+ * currently, only L_SET is a legal value for whence
+ *
+ */
+
+int64
+lo_lseek64(PGconn *conn, int fd, int64 offset, int whence)
+{
+	PQArgBlock	argv[3];
+	PGresult   *res;
+	int64			retval;
+	int			result_len;
+
+	if (conn->lobjfuncs == NULL)
+	{
+		if (lo_initialize(conn) < 0)
+			return -1;
+	}
+
+	if (conn->lobjfuncs->fn_lo_lseek64 == 0)
+	{
+		printfPQExpBuffer(&conn->errorMessage,
+		  libpq_gettext("cannot determine OID of function lo_lseek64\n"));
+
+		return -1;
+	}
+
+	argv[0].isint = 1;
+	argv[0].len = 4;
+	argv[0].u.integer = fd;
+
+	argv[1].isint = 1;
+	argv[1].len = 8;
+	argv[1].u.ptr = (int *) &offset;
+
+	argv[2].isint = 1;
+	argv[2].len = 4;
+	argv[2].u.integer = whence;
+
+	res = PQfn(conn, conn->lobjfuncs->fn_lo_lseek64,
+			   (int *)&retval, &result_len, 1, argv, 3);
+	if (PQresultStatus(res) == PGRES_COMMAND_OK)
+	{
+		PQclear(res);
+		return retval;
+	}
+	else
+	{
+		PQclear(res);
+		return -1;
+	}
+}
+
+/*
  * lo_creat
  *	  create a new large object
  * the mode is a bitmask describing different attributes of the new object
@@ -305,6 +360,53 @@
 
 
 /*
+ * lo_tell64
+ *	  returns the current seek location of the large object
+ *
+ */
+
+int64
+lo_tell64(PGconn *conn, int fd)
+{
+	int64			retval;
+	PQArgBlock	argv[1];
+	PGresult   *res;
+	int			result_len;
+
+	if (conn->lobjfuncs == NULL)
+	{
+		if (lo_initialize(conn) < 0)
+			return -1;
+	}
+
+	if (conn->lobjfuncs->fn_lo_tell64 == 0)
+	{
+		printfPQExpBuffer(&conn->errorMessage,
+		  libpq_gettext("cannot determine OID of function lo_tell64\n"));
+
+		return -1;
+	}
+
+	argv[0].isint = 1;
+	argv[0].len = 4;
+	argv[0].u.integer = fd;
+
+	res = PQfn(conn, conn->lobjfuncs->fn_lo_tell64,
+			   (int *) &retval, &result_len, 1, argv, 1);
+	if (PQresultStatus(res) == PGRES_COMMAND_OK)
+	{
+		PQclear(res);
+		return retval;
+	}
+	else
+	{
+		PQclear(res);
+		return -1;
+	}
+}
+
+
+/*
  * lo_tell
  *	  returns the current seek location of the large object
  *
@@ -570,7 +672,9 @@
 			"'lo_creat', "
 			"'lo_unlink', "
 			"'lo_lseek', "
+			"'lo_lseek64', "
 			"'lo_tell', "
+			"'lo_tell64', "
 			"'loread', "
 			"'lowrite') "
 			"and pronamespace = (select oid from pg_catalog.pg_namespace "
@@ -583,6 +687,8 @@
 			"or proname = 'lo_unlink' "
 			"or proname = 'lo_lseek' "
 			"or proname = 'lo_tell' "
+			"or proname = 'lo_lseek64' "
+			"or proname = 'lo_tell64' "
 			"or proname = 'loread' "
 			"or proname = 'lowrite'";
 
@@ -621,6 +727,10 @@
 			lobjfuncs->fn_lo_lseek = foid;
 		else if (!strcmp(fname, "lo_tell"))
 			lobjfuncs->fn_lo_tell = foid;
+		else if (!strcmp(fname, "lo_lseek64"))
+			lobjfuncs->fn_lo_lseek64 = foid;
+		else if (!strcmp(fname, "lo_tell64"))
+			lobjfuncs->fn_lo_tell64 = foid;
 		else if (!strcmp(fname, "loread"))
 			lobjfuncs->fn_lo_read = foid;
 		else if (!strcmp(fname, "lowrite"))
diff -Nur postgresql-8.0.3-orig/src/interfaces/libpq/fe-misc.c postgresql-8.0.3/src/interfaces/libpq/fe-misc.c
--- postgresql-8.0.3-orig/src/interfaces/libpq/fe-misc.c	2004-12-31 14:03:50.000000000 -0800
+++ postgresql-8.0.3/src/interfaces/libpq/fe-misc.c	2005-09-20 22:30:48.000000000 -0700
@@ -223,6 +223,27 @@
 			conn->inCursor += 4;
 			*result = (int) ntohl(tmp4);
 			break;
+		case 8:
+			if (conn->inCursor + 8 > conn->inEnd)
+				return EOF;
+			else
+			{
+				int64 * i64res = (int64 *)result;
+				uint32 h32;
+				memcpy(&h32, conn->inBuffer + conn->inCursor, 4);
+				conn->inCursor += 4;
+				memcpy(&tmp4, conn->inBuffer + conn->inCursor, 4);
+				conn->inCursor += 4;
+#ifdef INT64_IS_BUSTED
+				/* just lose the high half */
+				*i64res = (int) ntohl(tmp4);
+#else
+				*i64res = (int) ntohl(h32);
+				*i64res <<= 32;
+				*i64res |= (uint32) ntohl(tmp4);
+#endif
+			}
+			break;
 		default:
 			pqInternalNotice(&conn->noticeHooks,
 						 "integer of size %lu not supported by pqGetInt",
@@ -231,12 +252,23 @@
 	}
 
 	if (conn->Pfdebug)
-		fprintf(conn->Pfdebug, "From backend (#%lu)> %d\n", (unsigned long) bytes, *result);
+		fprintf(conn->Pfdebug, "From backend (#%lu)> " INT64_FORMAT "\n", (unsigned long) bytes, (bytes == 8) ? *(int64 *)result : (int64)*result);
 
 	return 0;
 }
 
 /*
+ * pqGetInt64
+ *	read an 8 byte integer and convert from network byte order
+ *	to local byte order
+ */
+int
+pqGetInt64(int64 *result, PGconn *conn)
+{
+	return pqGetInt((int *)result, 8, conn);
+}
+
+/*
  * pqPutInt
  * write an integer of 2 or 4 bytes, converting from host byte order
  * to network byte order.
@@ -272,6 +304,41 @@
 	return 0;
 }
 
+
+/*
+ * pqPutInt64
+ * write an integer of 8 bytes, converting from host byte order
+ * to network byte order.
+ */
+int
+pqPutInt64(int64 value, PGconn *conn)
+{
+	uint32		n32;
+
+	/* High order half first, since we're doing MSB-first */
+#ifdef INT64_IS_BUSTED
+	/* don't try a right shift of 32 on a 32-bit word */
+	n32 = (value < 0) ? -1 : 0;
+#else
+	n32 = (uint32) (value >> 32);
+#endif
+	n32 = htonl(n32);
+	if (pqPutMsgBytes((const char *) &n32, 4, conn))
+		return EOF;
+
+	/* Now the low order half */
+	n32 = (uint32) value;
+	n32 = htonl(n32);
+
+	if (pqPutMsgBytes((const char *) &n32, 4, conn))
+		return EOF;
+
+	if (conn->Pfdebug)
+		fprintf(conn->Pfdebug, "To backend (8#)> " INT64_FORMAT "\n", value);
+
+	return 0;
+}
+
 /*
  * Make sure conn's output buffer can hold bytes_needed bytes (caller must
  * include already-stored data into the value!)
diff -Nur postgresql-8.0.3-orig/src/interfaces/libpq/fe-protocol3.c postgresql-8.0.3/src/interfaces/libpq/fe-protocol3.c
--- postgresql-8.0.3-orig/src/interfaces/libpq/fe-protocol3.c	2004-12-31 14:03:50.000000000 -0800
+++ postgresql-8.0.3/src/interfaces/libpq/fe-protocol3.c	2005-09-20 22:30:48.000000000 -0700
@@ -1233,11 +1233,24 @@
 
 		if (args[i].isint)
 		{
-			if (pqPutInt(args[i].u.integer, args[i].len, conn))
+			if (args[i].len <= 4)
 			{
-				pqHandleSendFailure(conn);
-				return NULL;
+				if (pqPutInt(args[i].u.integer, args[i].len, conn))
+				{
+					pqHandleSendFailure(conn);
+					return NULL;
+				}
+			}
+			else
+			{
+				if (pqPutInt64(*(int64 *)args[i].u.ptr, conn))
+				{
+					pqHandleSendFailure(conn);
+					return NULL;
+				}
 			}
+
+
 		}
 		else
 		{
diff -Nur postgresql-8.0.3-orig/src/interfaces/libpq/libpq-fe.h postgresql-8.0.3/src/interfaces/libpq/libpq-fe.h
--- postgresql-8.0.3-orig/src/interfaces/libpq/libpq-fe.h	2004-12-31 14:03:50.000000000 -0800
+++ postgresql-8.0.3/src/interfaces/libpq/libpq-fe.h	2005-09-20 22:30:48.000000000 -0700
@@ -27,6 +27,7 @@
  * such as Oid.
  */
 #include "postgres_ext.h"
+#include "postgres_fe.h"
 
 /* SSL type is needed here only to declare PQgetssl() */
 #ifdef USE_SSL
@@ -479,8 +480,10 @@
 extern int	lo_read(PGconn *conn, int fd, char *buf, size_t len);
 extern int	lo_write(PGconn *conn, int fd, char *buf, size_t len);
 extern int	lo_lseek(PGconn *conn, int fd, int offset, int whence);
+extern int64	lo_lseek64(PGconn *conn, int fd, int64 offset, int whence);
 extern Oid	lo_creat(PGconn *conn, int mode);
 extern int	lo_tell(PGconn *conn, int fd);
+extern int64	lo_tell64(PGconn *conn, int fd);
 extern int	lo_unlink(PGconn *conn, Oid lobjId);
 extern Oid	lo_import(PGconn *conn, const char *filename);
 extern int	lo_export(PGconn *conn, Oid lobjId, const char *filename);
diff -Nur postgresql-8.0.3-orig/src/interfaces/libpq/libpq-int.h postgresql-8.0.3/src/interfaces/libpq/libpq-int.h
--- postgresql-8.0.3-orig/src/interfaces/libpq/libpq-int.h	2005-01-05 16:59:47.000000000 -0800
+++ postgresql-8.0.3/src/interfaces/libpq/libpq-int.h	2005-09-20 22:30:48.000000000 -0700
@@ -232,6 +232,8 @@
 	Oid			fn_lo_unlink;	/* OID of backend function lo_unlink	*/
 	Oid			fn_lo_lseek;	/* OID of backend function lo_lseek		*/
 	Oid			fn_lo_tell;		/* OID of backend function lo_tell		*/
+	Oid			fn_lo_lseek64;	/* OID of backend function lo_lseek64		*/
+	Oid			fn_lo_tell64;		/* OID of backend function lo_tell64		*/
 	Oid			fn_lo_read;		/* OID of backend function LOread		*/
 	Oid			fn_lo_write;	/* OID of backend function LOwrite		*/
 } PGlobjfuncs;
@@ -457,7 +459,9 @@
 extern int	pqGetnchar(char *s, size_t len, PGconn *conn);
 extern int	pqPutnchar(const char *s, size_t len, PGconn *conn);
 extern int	pqGetInt(int *result, size_t bytes, PGconn *conn);
+extern int	pqGetInt64(int64 *result, PGconn *conn);
 extern int	pqPutInt(int value, size_t bytes, PGconn *conn);
+extern int	pqPutInt64(int64 value, PGconn *conn);
 extern int	pqPutMsgStart(char msg_type, bool force_len, PGconn *conn);
 extern int	pqPutMsgEnd(PGconn *conn);
 extern int	pqReadData(PGconn *conn);
diff -Nur postgresql-8.0.3-orig/src/interfaces/libpq/libpq.rc postgresql-8.0.3/src/interfaces/libpq/libpq.rc
--- postgresql-8.0.3-orig/src/interfaces/libpq/libpq.rc	2005-05-09 19:16:53.000000000 -0700
+++ postgresql-8.0.3/src/interfaces/libpq/libpq.rc	2005-09-20 23:31:58.000000000 -0700
@@ -1,8 +1,8 @@
 #include <winver.h>
 
 VS_VERSION_INFO VERSIONINFO
- FILEVERSION 8,0,3,5129
- PRODUCTVERSION 8,0,3,5129
+ FILEVERSION 8,0,3,5263
+ PRODUCTVERSION 8,0,3,5263
  FILEFLAGSMASK 0x3fL
  FILEFLAGS 0
  FILEOS VOS__WINDOWS32
diff -Nur postgresql-8.0.3-orig/src/test/examples/Makefile postgresql-8.0.3/src/test/examples/Makefile
--- postgresql-8.0.3-orig/src/test/examples/Makefile	2005-03-25 10:18:41.000000000 -0800
+++ postgresql-8.0.3/src/test/examples/Makefile	2005-09-20 22:30:48.000000000 -0700
@@ -6,11 +6,11 @@
 top_builddir = ../../..
 include $(top_builddir)/src/Makefile.global
 
-override CPPFLAGS := -I$(libpq_srcdir) $(CPPFLAGS)
+override CPPFLAGS := -I$(libpq_srcdir) $(CPPFLAGS) -g
 override LDLIBS := $(libpq_pgport) $(LDLIBS)
 
 
-PROGS = testlibpq testlibpq2 testlibpq3 testlibpq4 testlo
+PROGS = testlibpq testlibpq2 testlibpq3 testlibpq4 testlo testlo64
 
 all: $(PROGS)
 
diff -Nur postgresql-8.0.3-orig/src/test/examples/testlo64.c postgresql-8.0.3/src/test/examples/testlo64.c
--- postgresql-8.0.3-orig/src/test/examples/testlo64.c	1969-12-31 16:00:00.000000000 -0800
+++ postgresql-8.0.3/src/test/examples/testlo64.c	2005-09-20 22:30:48.000000000 -0700
@@ -0,0 +1,267 @@
+/*-------------------------------------------------------------------------
+ *
+ * testlo.c
+ *	  test using large objects with libpq
+ *
+ * Portions Copyright (c) 1996-2005, PostgreSQL Global Development Group
+ * Portions Copyright (c) 1994, Regents of the University of California
+ *
+ *
+ * IDENTIFICATION
+ *	  $PostgreSQL: pgsql/src/test/examples/testlo.c,v 1.25 2004/12/31 22:03:58 pgsql Exp $
+ *
+ *-------------------------------------------------------------------------
+ */
+#include <stdio.h>
+#include <stdlib.h>
+
+#include <sys/types.h>
+#include <sys/stat.h>
+#include <fcntl.h>
+#include <unistd.h>
+
+#include "libpq-fe.h"
+#include "libpq/libpq-fs.h"
+
+#define BUFSIZE			1024
+
+/*
+ * importFile -
+ *	  import file "in_filename" into database as large object "lobjOid"
+ *
+ */
+static Oid
+importFile(PGconn *conn, char *filename)
+{
+	Oid			lobjId;
+	int			lobj_fd;
+	char		buf[BUFSIZE];
+	int			nbytes,
+				tmp;
+	int			fd;
+
+	/*
+	 * open the file to be read in
+	 */
+	fd = open(filename, O_RDONLY, 0666);
+	if (fd < 0)
+	{							/* error */
+		fprintf(stderr, "can't open unix file\"%s\"\n", filename);
+	}
+
+	/*
+	 * create the large object
+	 */
+	lobjId = lo_creat(conn, INV_READ | INV_WRITE);
+	if (lobjId == 0)
+		fprintf(stderr, "can't create large object");
+
+	lobj_fd = lo_open(conn, lobjId, INV_WRITE);
+
+	/*
+	 * read in from the Unix file and write to the inversion file
+	 */
+	while ((nbytes = read(fd, buf, BUFSIZE)) > 0)
+	{
+		tmp = lo_write(conn, lobj_fd, buf, nbytes);
+		if (tmp < nbytes)
+			fprintf(stderr, "error while reading \"%s\"", filename);
+	}
+
+	close(fd);
+	lo_close(conn, lobj_fd);
+
+	return lobjId;
+}
+
+static void
+pickout(PGconn *conn, Oid lobjId, int64 start, int len)
+{
+	int			lobj_fd;
+	char	   *buf;
+	int			nbytes;
+	int			nread;
+
+	lobj_fd = lo_open(conn, lobjId, INV_READ);
+	if (lobj_fd < 0)
+		fprintf(stderr, "can't open large object %u", lobjId);
+
+	if (lo_lseek64(conn, lobj_fd, start, SEEK_SET) < 0)
+	{
+		fprintf(stderr, "error lo_lseek64: %s\n", PQerrorMessage(conn));
+		return;
+	}
+
+	buf = malloc(len + 1);
+
+	nread = 0;
+	while (len - nread > 0)
+	{
+		nbytes = lo_read(conn, lobj_fd, buf, len - nread);
+		buf[nbytes] = '\0';
+		fprintf(stderr, ">>> %s", buf);
+		nread += nbytes;
+		if (nbytes <= 0)
+			break;				/* no more data? */
+	}
+	free(buf);
+	fprintf(stderr, "\n");
+	lo_close(conn, lobj_fd);
+}
+
+static void
+overwrite(PGconn *conn, Oid lobjId, int64 start, int len)
+{
+	int			lobj_fd;
+	char	   *buf;
+	int			nbytes;
+	int			nwritten;
+	int			i;
+
+	lobj_fd = lo_open(conn, lobjId, INV_READ);
+	if (lobj_fd < 0)
+		fprintf(stderr, "can't open large object %u", lobjId);
+
+	lo_lseek64(conn, lobj_fd, start, SEEK_SET);
+	buf = malloc(len + 1);
+
+	for (i = 0; i < len; i++)
+		buf[i] = 'X';
+	buf[i] = '\0';
+
+	nwritten = 0;
+	while (len - nwritten > 0)
+	{
+		nbytes = lo_write(conn, lobj_fd, buf + nwritten, len - nwritten);
+		nwritten += nbytes;
+		if (nbytes <= 0)
+		{
+			fprintf(stderr, "\nWRITE FAILED!\n");
+			break;
+		}
+	}
+	free(buf);
+	fprintf(stderr, "\n");
+	lo_close(conn, lobj_fd);
+}
+
+
+/*
+ * exportFile -
+ *	  export large object "lobjOid" to file "out_filename"
+ *
+ */
+static void
+exportFile(PGconn *conn, Oid lobjId, char *filename)
+{
+	int			lobj_fd;
+	char		buf[BUFSIZE];
+	int			nbytes,
+				tmp;
+	int			fd;
+
+	/*
+	 * create an inversion "object"
+	 */
+	lobj_fd = lo_open(conn, lobjId, INV_READ);
+	if (lobj_fd < 0)
+		fprintf(stderr, "can't open large object %u", lobjId);
+
+	/*
+	 * open the file to be written to
+	 */
+	fd = open(filename, O_CREAT | O_WRONLY | O_TRUNC, 0666);
+	if (fd < 0)
+	{							/* error */
+		fprintf(stderr, "can't open unix file\"%s\"",
+				filename);
+	}
+
+	/*
+	 * read in from the Unix file and write to the inversion file
+	 */
+	while ((nbytes = lo_read(conn, lobj_fd, buf, BUFSIZE)) > 0)
+	{
+		tmp = write(fd, buf, nbytes);
+		if (tmp < nbytes)
+		{
+			fprintf(stderr, "error while writing \"%s\"",
+					filename);
+		}
+	}
+
+	lo_close(conn, lobj_fd);
+	close(fd);
+
+	return;
+}
+
+static void
+exit_nicely(PGconn *conn)
+{
+	PQfinish(conn);
+	exit(1);
+}
+
+int
+main(int argc, char **argv)
+{
+	char	   *in_filename,
+			   *out_filename;
+	char	   *database;
+	Oid			lobjOid;
+	PGconn	   *conn;
+	PGresult   *res;
+
+	if (argc != 4)
+	{
+		fprintf(stderr, "Usage: %s database_name in_filename out_filename\n",
+				argv[0]);
+		exit(1);
+	}
+
+	database = argv[1];
+	in_filename = argv[2];
+	out_filename = argv[3];
+
+	/*
+	 * set up the connection
+	 */
+	conn = PQsetdb(NULL, NULL, NULL, NULL, database);
+
+	/* check to see that the backend connection was successfully made */
+	if (PQstatus(conn) != CONNECTION_OK)
+	{
+		fprintf(stderr, "Connection to database failed: %s",
+				PQerrorMessage(conn));
+		exit_nicely(conn);
+	}
+
+	res = PQexec(conn, "begin");
+	PQclear(res);
+	printf("importing file \"%s\" ...\n", in_filename);
+/*	lobjOid = importFile(conn, in_filename); */
+	lobjOid = lo_import(conn, in_filename);
+	if (lobjOid == 0)
+		fprintf(stderr, "%s\n", PQerrorMessage(conn));
+	else
+	{
+		printf("\tas large object %u.\n", lobjOid);
+
+		printf("picking out bytes 4294967000-4294968000 of the large object\n");
+		pickout(conn, lobjOid, 4294967000ULL, 1000);
+
+		printf("overwriting bytes 4294967000-4294968000 of the large object with X's\n");
+		overwrite(conn, lobjOid, 4294967000ULL, 1000);
+
+		printf("exporting large object to file \"%s\" ...\n", out_filename);
+/*		exportFile(conn, lobjOid, out_filename); */
+		if (!lo_export(conn, lobjOid, out_filename))
+			fprintf(stderr, "%s\n", PQerrorMessage(conn));
+	}
+
+	res = PQexec(conn, "end");
+	PQclear(res);
+	PQfinish(conn);
+	return 0;
+}

#20

Magnus Hagander

magnus@hagander.net

over 17 years ago

In reply to: Josh Berkus (#7)

Re: Extending varlena

Josh Berkus wrote:

Andrew,

I always find these requests puzzling. Is it really useful to store the
data for a jpeg, video file or a 10GB tar ball in a database column?

Some people find it useful. Because LOs are actually easier to manage in
PG than in most other DBMSes, right now that's a significant source of
PostgreSQL adoption. I'd like to encourage those users by giving them
more useful LO features.

Given that they don't replicate with most (any?) of the currently
available replication solutions, the fact that they are easy to use
becomes irrelevant fairly quickly to larger installations in my experience.

But the interface *is* nice, so if we could fix that, it would be very good.

//Magnus

#21

Magnus Hagander

magnus@hagander.net

over 17 years ago

In reply to: Simon Riggs (#16)

Re: Extending varlena

Simon Riggs wrote:

On Mon, 2008-08-18 at 16:22 -0400, Tom Lane wrote:

David Fetter <david@fetter.org> writes:

What would need to happen for the next jump up from where varlena is
now, to 8 bytes?

Dealing with upwards-of-4GB blobs as single Datums isn't remotely sane,
and won't become so in the near (or even medium) future. So I don't
see the point of doing all the work that would be involved in making
this go.

What would make more sense is to redesign the large-object stuff to be
somewhat modern and featureful, and provide stream-access APIs (think
lo_read, lo_seek, etc) that allow offsets wider than 32 bits. The main
things I think we'd need to consider besides just the access API are

- permissions features (more than "none" anyway)
- better management of orphaned objects (obsoleting vacuumlo)
- support > 16TB of large objects (maybe partition pg_largeobject?)
- dump and restore probably need improvement to be practical for such
large data volumes

Sounds like a good list.

Probably also using a separate Sequence to allocate numbers rather than
using up all the Oids on LOs would be a good plan.

The ability to partition the large object store would not suck either...
For backup/recovery purposes mainly.

//Magnus

#22

Peter Eisentraut

peter_e@gmx.net

over 17 years ago

In reply to: Tom Lane (#2)

Re: Extending varlena

Am Monday, 18. August 2008 schrieb Tom Lane:

- permissions features (more than "none" anyway)
- better management of orphaned objects (obsoleting vacuumlo)
- support > 16TB of large objects (maybe partition pg_largeobject?)
- dump and restore probably need improvement to be practical for such
ï¿½ large data volumes

If you replace the third point by "maybe partition TOAST tables", replace
large object handle by TOAST pointer, and create an API to work on TOAST
pointers, how are the two so much different? And why should they be? I can
see that there are going to be needs to access large data with interfaces
that are not traditional SQL, but at least the storage handling could be the
same. That way you would solve the first two points and others for free.

#23

Kevin Grittner

Kevin.Grittner@wicourts.gov

over 17 years ago

In reply to: Tom Lane (#12)

Re: Extending varlena

Tom Lane <tgl@sss.pgh.pa.us> wrote:

I was kinda wondering about something closer to the TOAST model,

where

a blob is only referenceable from a value that's in a table field;
and that value encapsulates the "name" of the blob in some way that
needn't even be user-visible. This'd greatly simplify the
cleanup-dead-objects problem, and we could avoid addressing the
permissions problem at all, since regular SQL permissions on the

table

would serve fine. But it's not clear what regular SQL fetch and

update

behaviors should be like for such a thing. (Fetching or storing the
whole blob value is right out, IMHO.) ISTR hearing of concepts

roughly

like this in other DBs --- does it ring a bell for anyone?

It'd probably be good to have methods parallel to the JDBC API within
the implementation.

http://java.sun.com/javase/6/docs/api/java/sql/Blob.html
http://java.sun.com/javase/6/docs/api/java/sql/Clob.html
http://java.sun.com/javase/6/docs/api/java/sql/NClob.html

http://java.sun.com/javase/6/docs/api/java/sql/ResultSet.html#getBlob(int)
http://java.sun.com/javase/6/docs/api/java/sql/ResultSet.html#getClob(int)
http://java.sun.com/javase/6/docs/api/java/sql/ResultSet.html#getNClob(int)

http://java.sun.com/javase/6/docs/api/java/sql/ResultSet.html#updateBlob(int,%20java.sql.Blob)
http://java.sun.com/javase/6/docs/api/java/sql/ResultSet.html#updateBlob(int,%20java.io.InputStream)
http://java.sun.com/javase/6/docs/api/java/sql/ResultSet.html#updateBlob(int,%20java.io.InputStream,%20long)
(similar for CLOB & NCLOB)

Reading a blob value gives you an object which lets you perform the
stream-based operations against the blob. To set a blob value on an
insert, you prepare a statement and then link the stream to the blob
-- the insertRow method sucks the data from the stream. To set a blob
on an update, you use an updateable cursor (or maybe a prepared
statement) to do the same. You can set a lob from another lob
directly in SQL I assume we'd want to support streams directly inline
in the protocol, as well as support functions to convert between
datums and streams (for, say, tiny little 2MB or 10MB values), and
files and streams (kinda like COPY).

-Kevin

#24

Stephen R. van den Berg

srb@cuci.nl

over 17 years ago

In reply to: Peter Eisentraut (#22)

Re: Extending varlena

Peter Eisentraut wrote:

If you replace the third point by "maybe partition TOAST tables", replace
large object handle by TOAST pointer, and create an API to work on TOAST
pointers, how are the two so much different? And why should they be? I can
see that there are going to be needs to access large data with interfaces
that are not traditional SQL, but at least the storage handling could be the
same. That way you would solve the first two points and others for free.

I've long wondered why toast and large object storage is not one and the
same (indeed).
It seems a like a natural solution to marry the two.
--
Sincerely,
Stephen R. van den Berg.
E-mails should be like a lady's skirt:
Long enough to cover the subject, and short enough to be interesting.

#25

Tom Lane

tgl@sss.pgh.pa.us

over 17 years ago

In reply to: Stephen R. van den Berg (#24)

Re: Extending varlena

Peter Eisentraut wrote:

If you replace the third point by "maybe partition TOAST tables", replace
large object handle by TOAST pointer, and create an API to work on TOAST
pointers, how are the two so much different? And why should they be?

The reason they should be different is that (IMHO anyway) you don't want
the default behavior of SELECT * FROM ... to include pulling back the
entire contents of the blob. Indeed, we *can't* have that be the
behavior, unless we want to go back to the proposal that started this
thread of making the entire system safe for multi-gigabyte datums.

It's certainly possible that the underlying implementation could be
just TOAST, but we need some other API at the SQL level.

regards, tom lane