PDF files: to store in database or not

Started by Rich Shepardover 9 years ago26 messagesgeneral
Jump to latest
#1Rich Shepard
rshepard@appl-ecosys.com

With no experience of storing binary data in a bytea column I don't know
when its use is appropriate. I suspect that for an application I'm
developing it would be better to store row-related documents outside the
database, and want to learn if that is the appropriate approach.

Consider an application that manages a fleet of vehicles. There's a
Vehicles table with information on each one (perhaps make, model, VIN, year
of purchase) and a Services table. There are many PDF documents associated
with each row in the tables: purchase contract, insurance form, service and
maintenance records, etc.

My thinking is to not store these documents in the database, but to store
them in subdirectories outside the database.

Your thoughts?

Rich

--
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general

#2Joshua D. Drake
jd@commandprompt.com
In reply to: Rich Shepard (#1)
Re: PDF files: to store in database or not

On 12/06/2016 10:30 AM, Rich Shepard wrote:

My thinking is to not store these documents in the database, but to store
them in subdirectories outside the database.

Your thoughts?

Due to the widely variable size of a PDF document, I would say no. I
would store the metadata and file location.

Sincerely,

JD

Rich

--
Command Prompt, Inc. http://the.postgres.company/
+1-503-667-4564
PostgreSQL Centered full stack support, consulting and development.
Everyone appreciates your honesty, until you are honest with them.
Unless otherwise stated, opinions are my own.

--
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general

#3Rich Shepard
rshepard@appl-ecosys.com
In reply to: Joshua D. Drake (#2)
Re: PDF files: to store in database or not

On Tue, 6 Dec 2016, Joshua D. Drake wrote:

Due to the widely variable size of a PDF document, I would say no. I would
store the metadata and file location.

Joshua,

I read your answer as "don't store them in the database, but store the
location in a column."

Thanks for confirming,

Rich

--
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general

#4Moreno Andreo
moreno.andreo@evolu-s.it
In reply to: Rich Shepard (#3)
Re: PDF files: to store in database or not

... but what if database is replicated?

Thanks
Moreno.

Il 06/12/2016 19:50, Rich Shepard ha scritto:

On Tue, 6 Dec 2016, Joshua D. Drake wrote:

Due to the widely variable size of a PDF document, I would say no. I
would
store the metadata and file location.

Joshua,

I read your answer as "don't store them in the database, but store the
location in a column."

Thanks for confirming,

Rich

--
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general

#5Eric Schwarzenbach
subscriber@blackbrook.org
In reply to: Rich Shepard (#1)
Re: PDF files: to store in database or not

On 12/06/2016 01:30 PM, Rich Shepard wrote:

With no experience of storing binary data in a bytea column I don't
know
when its use is appropriate. I suspect that for an application I'm
developing it would be better to store row-related documents outside the
database, and want to learn if that is the appropriate approach.

Consider an application that manages a fleet of vehicles. There's a
Vehicles table with information on each one (perhaps make, model, VIN,
year
of purchase) and a Services table. There are many PDF documents
associated
with each row in the tables: purchase contract, insurance form,
service and
maintenance records, etc.

My thinking is to not store these documents in the database, but to
store
them in subdirectories outside the database.

Your thoughts?

Rich

I'd also be interested in answers to this that are give specific pros
and cons, and not in terms of "its better to do this than that."
What's "better" depends on how much you value the various pros and the
cons.

One of the pros of keeping them in the database is ease of protecting
adds and updates to the files and their related data with a transaction
and being able to have system where it iss pretty much impossible for
the documents to ever be out of sync with the related data.

I maintain some systems that do keep the documents outside of the
database, and the application code maintains the transactional integrity
of the files and data, and for the most part we don't have integrity
problems. In the worst of an add or update operation being interrupted
by a system crash or unexpected error, we have a new document saved but
the data about this document has not been written to the database and it
is as if that operation never happened. The file may really be there but
the system does not "know about it." This works even for updates because
our system versions documents and the old version is not written over,
there is simply a new version that the system never "knows" about.
Without versioning this would be more of a problem, and you would
probably need to protect yourself with code that does something like
temporarily keeping the last version of a file during an update and
switching over the metadata to reference the new document only at the
very last operation in the transaction.

We also have the potential of the database not matching the file store
when a system is migrated or "cloned." We are very careful about this,
but we've at least once had a case where a client's IT depart screwed it
up, and got a mismatched system to which they started writing new data.
Luckily this was a test or staging system and no production data was lost.

I've often wondered if we'd have been better off storing the files in
the database. This design decision was made some years ago, and our
concerns around this had to do with performance, but I don't know that
we had any real data that this should have been a concern, and I suspect
you could ameliorate if not eliminate this as an issue by careful
design. I'd loved to hear this idea confirmed or debunked by someone who
has more expertise (and ideally, done actual testing).

Cheers,

Eric

--
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general

#6Eric Schwarzenbach
subscriber@blackbrook.org
In reply to: Joshua D. Drake (#2)
Re: PDF files: to store in database or not

On 12/06/2016 01:34 PM, Joshua D. Drake wrote:

On 12/06/2016 10:30 AM, Rich Shepard wrote:

My thinking is to not store these documents in the database, but to
store
them in subdirectories outside the database.

Your thoughts?

Due to the widely variable size of a PDF document, I would say no. I
would store the metadata and file location.

Can you elaborate on this? Why is the variable size an issue? Are you
assuming the files go into the same table as the rest of the data? (They
certainly don't have to, and I would assume that not to be the smartest
design.)

--
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general

#7David Wall
d.wall@computer.org
In reply to: Eric Schwarzenbach (#6)
Re: PDF files: to store in database or not

On 12/6/16 11:12 AM, Eric Schwarzenbach wrote:

On 12/06/2016 01:34 PM, Joshua D. Drake wrote:

On 12/06/2016 10:30 AM, Rich Shepard wrote:

My thinking is to not store these documents in the database, but
to store
them in subdirectories outside the database.

Your thoughts?

Due to the widely variable size of a PDF document, I would say no. I
would store the metadata and file location.

Can you elaborate on this? Why is the variable size an issue? Are you
assuming the files go into the same table as the rest of the data?
(They certainly don't have to, and I would assume that not to be the
smartest design.)

The advantages of storing in the database is that a DB backup will have
everything, instead of a DB backup and a file system backup. Using a
BLOB, you can certainly keep track of variable length PDFs.

Also, if in the database, it can be part of a transaction so you will
not have any issues keeping the DB and filesystem in sync.

David

--
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general

#8John R Pierce
pierce@hogranch.com
In reply to: David Wall (#7)
Re: PDF files: to store in database or not

On 12/6/2016 11:21 AM, David Wall wrote:

The advantages of storing in the database is that a DB backup will
have everything, instead of a DB backup and a file system backup.
Using a BLOB, you can certainly keep track of variable length PDFs.

and one of the disadvantages of storing in the database is those db
backups become way huger if there's a lot of this file data.

--
john r pierce, recycling bits in santa cruz

--
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general

#9Joshua D. Drake
jd@commandprompt.com
In reply to: Eric Schwarzenbach (#6)
Re: PDF files: to store in database or not

On 12/06/2016 11:12 AM, Eric Schwarzenbach wrote:

On 12/06/2016 01:34 PM, Joshua D. Drake wrote:

On 12/06/2016 10:30 AM, Rich Shepard wrote:

My thinking is to not store these documents in the database, but to
store
them in subdirectories outside the database.

Your thoughts?

Due to the widely variable size of a PDF document, I would say no. I
would store the metadata and file location.

Can you elaborate on this? Why is the variable size an issue?

Because it will use at least that size in memory to deliver the document
to you. Consider a 100MB PDF (not at all uncommon), now imagine 40
connections requesting that PDF.

Sincerely,

JD

--
Command Prompt, Inc. http://the.postgres.company/
+1-503-667-4564
PostgreSQL Centered full stack support, consulting and development.
Everyone appreciates your honesty, until you are honest with them.
Unless otherwise stated, opinions are my own.

--
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general

#10Joshua D. Drake
jd@commandprompt.com
In reply to: Moreno Andreo (#4)
Re: PDF files: to store in database or not

On 12/06/2016 11:09 AM, Moreno Andreo wrote:

... but what if database is replicated?

Use a network mounted filesystem (or replicated filesystem).

JD

--
Command Prompt, Inc. http://the.postgres.company/
+1-503-667-4564
PostgreSQL Centered full stack support, consulting and development.
Everyone appreciates your honesty, until you are honest with them.
Unless otherwise stated, opinions are my own.

--
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general

#11Joshua D. Drake
jd@commandprompt.com
In reply to: Rich Shepard (#3)
Re: PDF files: to store in database or not

On 12/06/2016 10:50 AM, Rich Shepard wrote:

On Tue, 6 Dec 2016, Joshua D. Drake wrote:

Due to the widely variable size of a PDF document, I would say no. I
would
store the metadata and file location.

Joshua,

I read your answer as "don't store them in the database, but store the
location in a column."

Correct.

Thanks for confirming,

Rich

--
Command Prompt, Inc. http://the.postgres.company/
+1-503-667-4564
PostgreSQL Centered full stack support, consulting and development.
Everyone appreciates your honesty, until you are honest with them.
Unless otherwise stated, opinions are my own.

--
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general

#12Joshua D. Drake
jd@commandprompt.com
In reply to: David Wall (#7)
Re: PDF files: to store in database or not

On 12/06/2016 11:21 AM, David Wall wrote:

The advantages of storing in the database is that a DB backup will have
everything, instead of a DB backup and a file system backup. Using a
BLOB, you can certainly keep track of variable length PDFs.

This is true but also not necessarily an advantage. Your backups will be
larger, unless you are using logical backups and omitting certain tables.

Your resource utilization will be higher (memory, CPU etc...) to pull
and decode the binary.

JD

--
Command Prompt, Inc. http://the.postgres.company/
+1-503-667-4564
PostgreSQL Centered full stack support, consulting and development.
Everyone appreciates your honesty, until you are honest with them.
Unless otherwise stated, opinions are my own.

--
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general

#13Rich Shepard
rshepard@appl-ecosys.com
In reply to: David Wall (#7)
Re: PDF files: to store in database or not

On Tue, 6 Dec 2016, David Wall wrote:

The advantages of storing in the database is that a DB backup will have
everything, instead of a DB backup and a file system backup. Using a BLOB,
you can certainly keep track of variable length PDFs.

David,

I did not realize that a BLOB is not the same as a bytea (page 217 of the
9.6 PDF manual), and I cannot find BLOB as a postgres data type. Please
point me in the right direction to learn how to store PDFs as BLOBs.

Thanks,

Rich

--
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general

#14Eric Schwarzenbach
subscriber@blackbrook.org
In reply to: Joshua D. Drake (#9)
Re: PDF files: to store in database or not

On 12/06/2016 02:40 PM, Joshua D. Drake wrote:

On 12/06/2016 11:12 AM, Eric Schwarzenbach wrote:

On 12/06/2016 01:34 PM, Joshua D. Drake wrote:

On 12/06/2016 10:30 AM, Rich Shepard wrote:

My thinking is to not store these documents in the database, but to
store
them in subdirectories outside the database.

Your thoughts?

Due to the widely variable size of a PDF document, I would say no. I
would store the metadata and file location.

Can you elaborate on this? Why is the variable size an issue?

Because it will use at least that size in memory to deliver the
document to you. Consider a 100MB PDF (not at all uncommon), now
imagine 40 connections requesting that PDF.

Are you sure the whole thing necessarily gets pulled into memory? JDBC
and ODBC support streaming on their BLOB interfaces and isn't the whole
point of this that an application can stream large files a chunk at a
time, the same way it would from the file system? Of course if the db
engine always pulls the whole thing into memory to work with it
regardless of the API, that's another thing, but that wouldn't seem like
a very good design, and I have more faith in the PostgreSQL developers
than that...but I'd certainly like to know for sure.

Cheers,

Eric

--
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general

#15John R Pierce
pierce@hogranch.com
In reply to: Rich Shepard (#13)
Re: PDF files: to store in database or not

On 12/6/2016 12:10 PM, Rich Shepard wrote:

I did not realize that a BLOB is not the same as a bytea (page 217
of the
9.6 PDF manual), and I cannot find BLOB as a postgres data type. Please
point me in the right direction to learn how to store PDFs as BLOBs.

indeed BYTEA is postgres's type for storing arbitrary binary objects
that are called BLOB in certain other databases.

--
john r pierce, recycling bits in santa cruz

--
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general

#16Tom Lane
tgl@sss.pgh.pa.us
In reply to: John R Pierce (#15)
Re: PDF files: to store in database or not

John R Pierce <pierce@hogranch.com> writes:

On 12/6/2016 12:10 PM, Rich Shepard wrote:

I did not realize that a BLOB is not the same as a bytea (page 217
of the
9.6 PDF manual), and I cannot find BLOB as a postgres data type. Please
point me in the right direction to learn how to store PDFs as BLOBs.

indeed BYTEA is postgres's type for storing arbitrary binary objects
that are called BLOB in certain other databases.

Well, there are also "large objects", which aren't really a data type at
all. If you're storing stuff large enough that you need to write/read
it in chunks rather than all at once, the large-object APIs are what
you want.

regards, tom lane

--
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general

#17Rich Shepard
rshepard@appl-ecosys.com
In reply to: John R Pierce (#15)
Re: PDF files: to store in database or not

On Tue, 6 Dec 2016, John R Pierce wrote:

indeed BYTEA is postgres's type for storing arbitrary binary objects that
are called BLOB in certain other databases.

John,

I thought so.

Thanks,

Rich

--
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general

#18David Wall
d.wall@computer.org
In reply to: Tom Lane (#16)
Re: PDF files: to store in database or not

On 12/6/16 12:33 PM, Tom Lane wrote:

John R Pierce <pierce@hogranch.com> writes:

On 12/6/2016 12:10 PM, Rich Shepard wrote:

I did not realize that a BLOB is not the same as a bytea (page 217
of the
9.6 PDF manual), and I cannot find BLOB as a postgres data type. Please
point me in the right direction to learn how to store PDFs as BLOBs.

indeed BYTEA is postgres's type for storing arbitrary binary objects
that are called BLOB in certain other databases.

Well, there are also "large objects", which aren't really a data type at
all. If you're storing stuff large enough that you need to write/read
it in chunks rather than all at once, the large-object APIs are what
you want.

regards, tom lane

Yeah, we've not used much BYTEA, but use PG's large objects. It also
has a streaming API and you don't have to encode/decode every byte going
in and out of the DB.

In a table, you juse define the "blob_data" column as an OID. Since we
use Java/JDBC, this is handled by ResultSet.getBlob() for a
java.sql.Blob object.

Some complain about DB backups being biggers if the PDFs are inside,
which is true, but this only presumes you don't care about the
filesystem PDFs being backed up separately (and no way to ensure a
reliable DB backup and PDF filesystem backup if the system is active
when doing the backups). You can certainly put the files in a
filesystem and point to them, but you'll likely need some access control
or people will be able to download any/all PDFs in a given folder. In
the DB, you surely will have access control as I presume you don't allow
browser access to the DB <smile>.

Either way, you may want to see if your PDFs compress well or not as
that may save some storage space at the cost of compress/decompress on
accesses.

David

--
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general

#19John R Pierce
pierce@hogranch.com
In reply to: David Wall (#18)
Re: PDF files: to store in database or not

On 12/6/2016 1:02 PM, David Wall wrote:

You can certainly put the files in a filesystem and point to them, but
you'll likely need some access control or people will be able to
download any/all PDFs in a given folder. In the DB, you surely will
have access control as I presume you don't allow browser access to the
DB <smile>.

thats easily remedied by using storage accessed via https or whatever
with directory listing disabled, and hashed names. Or, only allow your
app server direct access to the file system if its via NFS or whatever.

Either way, you may want to see if your PDFs compress well or not as
that may save some storage space at the cost of compress/decompress on
accesses.

pretty sure most all PDF's are already compressed, and won't compress
much further.

--
john r pierce, recycling bits in santa cruz

--
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general

#20John DeSoi
desoi@pgedit.com
In reply to: Eric Schwarzenbach (#5)
Re: PDF files: to store in database or not

On Dec 6, 2016, at 1:09 PM, Eric Schwarzenbach <subscriber@blackbrook.org> wrote:

I've often wondered if we'd have been better off storing the files in the database. This design decision was made some years ago, and our concerns around this had to do with performance, but I don't know that we had any real data that this should have been a concern, and I suspect you could ameliorate if not eliminate this as an issue by careful design. I'd loved to hear this idea confirmed or debunked by someone who has more expertise (and ideally, done actual testing).

I have been storing PDFs in Postgres for several years without any problems. Documents range in size from a few pages to 100+ pages. I'm using a bytea column, not large objects. I store the documents in a separate database from the rest of the application data in order to make it easy to exclude in database dumps or backup in some other way. I'm currently managing about 600,000 documents.

I created some functions that enable a subset of the document database to be synchronized elsewhere. For example, we need to keep only the last 3 years of documents on a website for user access. Using Postgres has made this easy to manage and verify. And with replication we automatically have the document database available on the backup web server without additional effort.

John DeSoi, Ph.D.

--
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general

#21Rich Shepard
rshepard@appl-ecosys.com
In reply to: John DeSoi (#20)
#22Adrian Klaver
adrian.klaver@aklaver.com
In reply to: Rich Shepard (#21)
#23Chris Travers
chris.travers@gmail.com
In reply to: Rich Shepard (#21)
#24Rich Shepard
rshepard@appl-ecosys.com
In reply to: Adrian Klaver (#22)
#25John DeSoi
desoi@pgedit.com
In reply to: Chris Travers (#23)
#26Rich Shepard
rshepard@appl-ecosys.com
In reply to: Chris Travers (#23)