Where to store Blobs?
Some days ago I asked "Where **not** to use PostgreSQL?" on this mailing list.
Now I realized: Nobody talked about Blobs.
I guess most people do not store Blobs in PostgresSQL.
Where do you store Blobs?
(In my case Blobs are PDF/image files with size up to 20 MByte.
I do not talk about very big blobs which are several hundret MByte)
--
Thomas Guettler http://www.thomas-guettler.de/
I am looking for feedback: https://github.com/guettli/programming-guidelines
On 3/13/19 7:28 AM, Thomas Güttler wrote:
Some days ago I asked "Where **not** to use PostgreSQL?" on this mailing
list.Now I realized: Nobody talked about Blobs.
I guess most people do not store Blobs in PostgresSQL.
Where do you store Blobs?
Probably the preferred method:
https://www.postgresql.org/docs/11/datatype-binary.html
Another method:
https://www.postgresql.org/docs/11/largeobjects.html
(In my case Blobs are PDF/image files with size up to 20 MByte.
I do not talk about very big blobs which are several hundret MByte)
--
Adrian Klaver
adrian.klaver@aklaver.com
Thomas Güttler wrote:
Now I realized: Nobody talked about Blobs.
I guess most people do not store Blobs in PostgresSQL.
Where do you store Blobs?
(In my case Blobs are PDF/image files with size up to 20 MByte.
I do not talk about very big blobs which are several hundret MByte)
Ideally outside the database, if they are many.
Large databases are harder to backup than large file systems.
If you keep 20MB binaries in the database, you'd use the "bytea" data type.
Yours,
Laurenz Albe
--
Cybertec | https://www.cybertec-postgresql.com
On 3/13/19 9:28 AM, Thomas Güttler wrote:
Some days ago I asked "Where **not** to use PostgreSQL?" on this mailing
list.Now I realized: Nobody talked about Blobs.
I guess most people do not store Blobs in PostgresSQL.
Where do you store Blobs?
(In my case Blobs are PDF/image files with size up to 20 MByte.
I do not talk about very big blobs which are several hundret MByte)
We store PDFs and TIFFs of dozens to a few hundred KB in the database as bytea.
--
Angular momentum makes the world go 'round.
I store them as bytea in the database despite the fact that there are
benefits to storing them in the file system. The reason is that it is
easier to secure access to the database than to secure both the database
and provide secure access to the file system.
Chuck Martin
Avondale Software
On Wed, Mar 13, 2019 at 10:34 AM Laurenz Albe <laurenz.albe@cybertec.at>
wrote:
Show quoted text
Thomas Güttler wrote:
Now I realized: Nobody talked about Blobs.
I guess most people do not store Blobs in PostgresSQL.
Where do you store Blobs?
(In my case Blobs are PDF/image files with size up to 20 MByte.
I do not talk about very big blobs which are several hundret MByte)Ideally outside the database, if they are many.
Large databases are harder to backup than large file systems.If you keep 20MB binaries in the database, you'd use the "bytea" data type.
Yours,
Laurenz Albe
--
Cybertec | https://www.cybertec-postgresql.com
On Wed, 13 Mar 2019 at 10:27, Thomas Güttler
<guettliml@thomas-guettler.de> wrote:
I guess most people do not store Blobs in PostgresSQL.
Where do you store Blobs?
Things have changed, but at one time, we were using RT as our
ticketing system (https://bestpractical.com/request-tracker) and it
would capture documents as database objects.
The table where RT stowed downloadable documents was one of the
largest tables in the database because of there being a few 50MB
copies of Access Databases and some many-MB spreadsheets in there.
It worked fine; no problems evident from it. It was certainly
surprising to find such large documents there, and if people had
gotten in the habit of putting GBs of data into RT, that would have
probably led to some policy changes to prevent it, but unless you're
pretty actively trying to blow the system up, it just works.
--
When confronted by a difficult problem, solve it by reducing it to the
question, "How would the Lone Ranger handle this?"
On Wed, Mar 13, 2019 at 11:50:37AM -0400, Christopher Browne wrote:
I guess most people do not store Blobs in PostgresSQL.
- BYTEA puts practical limits on size
- LO storage happens inside the system (!) table
Nowadays, there are Foreign Data Wrappers which might
encapsulate files as if they lived inside the database.
Also, a combination of
COPY TO FORMAT binary
pg_read_binary_file()
and suitable plpgsql security definer functions might provide
for a Poor Man's binary file integrated external storage.
Karsten
--
GPG 40BE 5B0E C98E 1713 AFA6 5BC0 3BEA AC80 7D4F C89B
On 13/3/19 4:28 μ.μ., Thomas Güttler wrote:
Some days ago I asked "Where **not** to use PostgreSQL?" on this mailing list.
Where do you store Blobs?
(In my case Blobs are PDF/image files with size up to 20 MByte.
I do not talk about very big blobs which are several hundret MByte)
bytea. Keeping those in the filesys and trying to keep filesys in sync with the db is a PITA. Also dont know what happens in the transactional dept (ACID) with lo_* large objects. In most cases bytea
are just fine.
--
Achilleas Mantzios
IT DEV Lead
IT DEPT
Dynacom Tankers Mgmt
Hi,
Am 13.03.19 um 15:28 schrieb Thomas Güttler:
Where do you store Blobs?
Within PostgreSQL, of course. The system I have in mind stores ZIP and
PDF files, usually a few MBs each; we're currently at a total of about
100 GB and there are no evident problems.
For this application, it is extremely important that saving/modifying
the binaries is part of a larger transaction that completes/fails
atomically.
Replication is almost real-time and access to the binaries is provided
with the same libraries (JDBC/Hibernate) as everything else.
Best regards,
-hannes
Make dump/restore of database data unnecessarily expensive in terms of time
and space imho.
I store large models in the database because I need to have a historical
data to compare to. That said, I could probably also automate a git repo
but it will be just that much more work and git with binary files really
doesn't make sense. Storage is really cheap and I assume the database
stores bytesa types like they do text within a separate and partitioned
section of the database.
Thanks,
~Ben
On Wed, Mar 13, 2019 at 1:37 PM Jamesie Pic <jpic@yourlabs.org> wrote:
Show quoted text
Make dump/restore of database data unnecessarily expensive in terms of
time and space imho.
I don't think there is a suitable 'one size fits all' answer to this
question. A lot will depend on how you intend to use the blobs and what
sort of hardware architecture, especially storage systems, you have.
At first glance, sticking everything in the DB seems like an easy
choice. However, that can result in very large databases, which in turn
can lead to issues with respect to backup, replication etc. If all your
after is storage, then sometimes your better off using the file system
for the blobs and keeping the metadata in the db. It can potentially be
faster and easier to serve up blobs from the file system compared to the
db if that is the main use case, but if the blobs are more dynamic or
you use collections of blobs to build a master blob etc, the db has some
advantages.
If you really need database like functionality, given the relative
cheapness of storage and the wealth of options available, storing the
blobs in the database can have advantage. However, it will be important
to select the most appropriate datatype. What some people think of as a
'blob' is just an array of bytes to many DBs and as usual, you need to
make the decision as to what is the best storage representation for your
requirements, keeping in mind that the more general 'blob' like storage
type you choose often represents a loss in functionality but an increase
in flexibility wrt to what can be inserted over more precise data types,
which will be more restrictive about what can be inserted, but offer
more functionality regarding what you can do with it (at the db level).
Tim
Thomas Güttler <guettliml@thomas-guettler.de> writes:
Some days ago I asked "Where **not** to use PostgreSQL?" on this mailing list.
Now I realized: Nobody talked about Blobs.
I guess most people do not store Blobs in PostgresSQL.
Where do you store Blobs?
(In my case Blobs are PDF/image files with size up to 20 MByte.
I do not talk about very big blobs which are several hundret MByte)
--
Tim Cross
I am curious. Why did nobody say:
store blobs in a storage server (like s3) and only store the blob-id in PostgreSQL?
Regards,
Thomas
Am 13.03.19 um 15:28 schrieb Thomas Güttler:
Some days ago I asked "Where **not** to use PostgreSQL?" on this mailing list.
Now I realized: Nobody talked about Blobs.
I guess most people do not store Blobs in PostgresSQL.
Where do you store Blobs?
(In my case Blobs are PDF/image files with size up to 20 MByte.
I do not talk about very big blobs which are several hundret MByte)
--
Thomas Guettler http://www.thomas-guettler.de/
I am looking for feedback: https://github.com/guettli/programming-guidelines
On Mon, Mar 18, 2019 at 11:42:00AM +0100, Thomas G�ttler wrote:
I am curious. Why did nobody say:
store blobs in a storage server (like s3) and only store the blob-id in PostgreSQL?
That's been rehashed to the point of becoming a FAQ
https://wiki.postgresql.org/wiki/BinaryFilesInDB
Karsten Hilbert
--
GPG 40BE 5B0E C98E 1713 AFA6 5BC0 3BEA AC80 7D4F C89B
Storing files in the database has already been demonstrated as not an
efficient trade-off [0]. As such, this post provides a subjective story
about files in the database, rather than an objective compare.
So, I have this ridiculous situation where dumping a database with 12k
user-submitted forms with files, made by 3000 users, which I hope we can
agree “is absolutely nothing”, yet, it eats 35G of PG data.
As a result, dumping the database takes already 32 minutes and is extremely
write-intensive, causing more than 5 second disk backlogs during the
operation at the same time.
If I had stored file paths, like I should have, the db would take a few MBs
and the dump a few seconds even less than one second ?
Also, the backup could just read the file tree from the file system and
synchronize only new files: we don’t have programmatic modifications of
written files for legal reasons: which makes it even more ridiculous to
dump them every time we do a backup.
As such, my biggest regret is to have listened to the manager who imposed
this design decision, but I trusted him at the time I was a fresh hire.
Today, we have the choice of two poisons for deployment:
not shutting down the service during the backup, to save 32 minutes of
downtime, but that’s 32 minutes of writes that are at risk in the case of a
faulty destructive operation, yes that doesn’t happen 99% of the time,
shutting down the service during the backup, as we should, but that
means 32 minutes of extra downtime
In my experience, storing files in the database is a critical mistake. I
mean, if you know what you’re doing maybe, and I thought the said manager
would know what he’s doing.
But at the end of a year the manager decided to ignore all monitoring
alarms concerning disk space that were raised during the backup. As such, I
kept going and freeing as much space as possible when a backup failed to
happen, which hopefully canceled the full deployment, leaving the service
online, even though with a full disk.
I have raised the warning to the customer for months and but the manager
kept insisting that we close our eyes on it, and kept on doing palliative
fixes when needed.
Of course, piling up palliatives fixes in production eventually created the
situation where the disk was too full to make a backup. The manager that
had installed the server OS had put 3 disks in a RAID1 array with extremely
poor partitioning.
As such, i had to spend a night repartitioning the RAID1, so that the /
would be on 10G instead of 40G. Which kept us going a bit more, but
unfortunnately one week shorter than expected, because I had forgot to
include the exponential growth in the math of the estimation.
Leading to even more incidents. If you store files in the database, which
you shouldn’t unless you’re 100% sure about what you’re doing, then do not
ignore disk space warnings during backups. Or else … well what do you think
happens when an airplane pilot ignores the alarms on their dashboard ?
99% of incidents are a suite of predictable events.
https://habiletechnologies.com/blog/better-saving-files-database-file-system/
https://softwareengineering.stackexchange.com/questions/150669/is-it-a-bad-practice-to-store-large-files-10-mb-in-a-database
https://blog.yourlabs.org/post/184290880553/story-of-a-database-with-files-in-it
I forgot to mention that my deployments include automated migrations as
often as possible, sometimes destructive for refactoring purpose, as such,
to maintain PostgreSQL on a basic linux box I am:
- for having an automated backup prior in the automated deployment script
that may play destructive migrations,
- against the needless overhead of coupling both binary and relational data
in operations that slows the whole thing down or makes it less reliable
Also got supposedly many new points against, mixed with more detail on the
points briefly exposed in my previous email, going deeper in detail, about
how it fits in the big picture of my personal practice ... and how this has
destabilized my prod for months:
https://blog.yourlabs.org/post/184290880553/storing-hd-photos-in-a-relational-database-recipe
tl;dr
If you store media files in PostgreSQL on a production server, then do take
disk space alarms seriously even if they happen only during backups.
Otherwise I fail to see how to avoid a pattern of recurring incidents,
"manually unblocking automated deployments" (double debt interest cost
because also defeats the purpose of automating deployment), when not
filling up a disk during the nightly backup dump ...
Hope this helps,
Have a great day
Good question, and there are some excellent thoughts and cautionary tales
in the thread already. I've faced this question several times down the
years and, as others have said, the best answer depends. Roughly speaking,
I can think of three obvious places to store documents:
* A database.
* A file system.
* A commodity file-system-like cloud service, such as S3, B2 Cloud, Azure,
or Google Cloud.
There are pros and cons to each, and circumstances that will make one
option or another untenable. It also depends a lot on your overall
architecture, if documents can change, how your searches work, your
security/regulatory requirements, the overall workflow, etc.
But given all of that, I *hate* bloating a database with BLOBs that aren't
themselves searchable. That's what file systems are for. This past week,
I've been coding against the B2 Cloud API and it's quite nice...simpler
than S3, and much cheaper. In this setup, we keep a document catalog in the
database that ties together a source record and an external document. The
"cloud_doc" record contains identifiers and information from both parts of
this story. So, descriptors, tags and IDs needed when you're looking for
something. In our case, that's large XML log files. These logs are either
of zero value or great value. We need to keep them, but 99.99999% of the
time they're pure bloat. We've got a record about the event the log
details. The cloud_doc record includes a reference back to the original,
and to the storage location of the full log. So, the service type, bucket
ID, document ID, etc. So the catalog record links our real data with the
remote document, same as you would storing file paths to a local file tree.
Storing the data externally has some risks and limitations that make it
unsuitable in some situations. For example, you can't include the file or
cloud system in a transaction in any normal sense. Also, depending on
platform, there may be an unacceptable lag between pushing a file up and it
becoming visible to other systems. But for common cases, external (file or
cloud) storage can be pretty great.
Likewise, it's kind of scary to have the files someplace else where Bad
Things Might Happen and your database and file tree are out of sync. That's
genuinely worrisome. It's likely overkill, but I like to store references
back into the database in the meta-data for the document up on the cloud.
On S3 and B2, you store meta-data as "tags". I haven't used the document
storage systems on Azure or Google, but they likely have similar
functionality. A "tag" up on the cloud repository is a name-value-pair that
you can add custom data to. So, for example, I'll store the ID of our cloud
document catalog record in the meta-data (tags) for the document on the
cloud service. I also stash the ID of the source record that the log
relates to. If all goes well, we'll never need these tags but if worst came
to worst, they provide a way of calculating the parents of each external
document.
That last paragraph about tags points out something important: Cloud
document storage can be preferable to a standard file system for reasons
other than simplified provisioning and cost. Many file systems don't allow
your readily add and search meta-data like this, whereas it's a quite
accessible feature of cloud storage systems.
Going in another direction entirely, there may be times when what you need
is a local file storage system for documents. File systems are...scary, but
what about SQLite? You have a single file with a modern SQL syntax where
you can stuff BLOBs. You don't bloat your Postgres data file, but still
have a database for the documents instead of a file system. SQLite only
looks to solve problems in a pretty specific set of cases but, as it turns
out, those cases are also quite common. Combining it with a Postgres setup
seems pretty exotic, but there might be a time. It depends on your setup.
On Sat, Apr 20, 2019 at 10:59 AM Jamesie Pic <jpic@yourlabs.org> wrote:
Show quoted text
I forgot to mention that my deployments include automated migrations as
often as possible, sometimes destructive for refactoring purpose, as such,
to maintain PostgreSQL on a basic linux box I am:- for having an automated backup prior in the automated deployment script
that may play destructive migrations,
- against the needless overhead of coupling both binary and relational
data in operations that slows the whole thing down or makes it less reliableAlso got supposedly many new points against, mixed with more detail on the
points briefly exposed in my previous email, going deeper in detail, about
how it fits in the big picture of my personal practice ... and how this has
destabilized my prod for months:https://blog.yourlabs.org/post/184290880553/storing-hd-photos-in-a-relational-database-recipe
tl;dr
If you store media files in PostgreSQL on a production server, then do
take disk space alarms seriously even if they happen only during backups.
Otherwise I fail to see how to avoid a pattern of recurring incidents,
"manually unblocking automated deployments" (double debt interest cost
because also defeats the purpose of automating deployment), when not
filling up a disk during the nightly backup dump ...Hope this helps,
Have a great day
Thanks for your feedback. In my case, Bob the manager said this would "make
backups easier" xD
The general pro that I see is "transactional". While I can understand that,
it's irrelevant in our case: users upload files through AJAX, that happens
**before** they submit the form. That means, the file gets saved in an HTTP
exchange that happens **before** the actual data insertion transaction.
As such, be careful too that this argument does not be irrelevant in your
case like it is in mine.