Best Strategy for Large Number of Images

Started by Estevan Rechover 4 years ago25 messagesgeneral

softrech@gmail.com

over 4 years ago

Hi,

I don't currently use PostgreSQL, but I plan to migrate and I have a
question about the best way/strategy for storing images. I have about 2
million images with a growth trend of around 1 million images per year. I
plan to store in bytea format in an isolated table. Is this recommended? Is
there another better way?

Rob Sargent

robjsargent@gmail.com

over 4 years ago

In reply to: Estevan Rech (#1)

Re: Best Strategy for Large Number of Images

On 12/15/21 10:45, Estevan Rech wrote:

Hi,

I don't currently use PostgreSQL, but I plan to migrate and I have a
question about the best way/strategy for storing images. I have about
2 million images with a growth trend of around 1 million images per
year. I plan to store in bytea format in an isolated table. Is this
recommended? Is there another better way?

How big are the images on average?

Estevan Rech

softrech@gmail.com

over 4 years ago

In reply to: Rob Sargent (#2)

Re: Best Strategy for Large Number of Images

350kb

Em qua., 15 de dez. de 2021 às 14:50, Rob Sargent <robjsargent@gmail.com>
escreveu:

Show quoted text

On 12/15/21 10:45, Estevan Rech wrote:

Hi,

I don't currently use PostgreSQL, but I plan to migrate and I have a
question about the best way/strategy for storing images. I have about 2
million images with a growth trend of around 1 million images per year. I
plan to store in bytea format in an isolated table. Is this recommended? Is
there another better way?

How big are the images on average?

David G. Johnston

david.g.johnston@gmail.com

over 4 years ago

In reply to: Estevan Rech (#3)

Re: Best Strategy for Large Number of Images

On Wed, Dec 15, 2021 at 10:57 AM Estevan Rech <softrech@gmail.com> wrote:

350kb

Em qua., 15 de dez. de 2021 às 14:50, Rob Sargent <robjsargent@gmail.com>
escreveu:

On 12/15/21 10:45, Estevan Rech wrote:

Hi,

I don't currently use PostgreSQL, but I plan to migrate and I have a
question about the best way/strategy for storing images. I have about 2
million images with a growth trend of around 1 million images per year. I
plan to store in bytea format in an isolated table. Is this recommended? Is
there another better way?

How big are the images on average?

Please don't top-post - inline or bottom posting (with trimming) is the
custom here.

It does depend (both on the size and usage patterns).

An alternative way is to simply not store images within the database
itself. Store a pointer, and then store the image externally. The
application then gets the pointer from the database and uses it to retrieve
the image from the image server.

David J.

Estevan Rech

softrech@gmail.com

over 4 years ago

In reply to: David G. Johnston (#4)

Re: Best Strategy for Large Number of Images

I think about using it inside the database to facilitate the select and
generation of reports...

Show quoted text

Adrian Klaver

adrian.klaver@aklaver.com

over 4 years ago

In reply to: Estevan Rech (#5)

Re: Best Strategy for Large Number of Images

On 12/15/21 10:17 AM, Estevan Rech wrote:

I think about using it inside the database to facilitate the select and
generation of reports...

As long as the meta data about the images is stored in the database that
is still possible.

--
Adrian Klaver
adrian.klaver@aklaver.com

Estevan Rech

softrech@gmail.com

over 4 years ago

In reply to: Adrian Klaver (#6)

Re: Best Strategy for Large Number of Images

But PostgreSQL can handle it? Does it have good search performance? and is
memory usage feasible?

Adrian Klaver

adrian.klaver@aklaver.com

over 4 years ago

In reply to: Estevan Rech (#7)

Re: Best Strategy for Large Number of Images

On 12/15/21 10:48 AM, Estevan Rech wrote:

But PostgreSQL can handle it? Does it have good search performance? and
is memory usage feasible?

Search on what?

--
Adrian Klaver
adrian.klaver@aklaver.com

David G. Johnston

david.g.johnston@gmail.com

over 4 years ago

In reply to: Estevan Rech (#7)

Re: Best Strategy for Large Number of Images

On Wednesday, December 15, 2021, Estevan Rech <softrech@gmail.com> wrote:

But PostgreSQL can handle it? Does it have good search performance? and is
memory usage feasible?

Images of that size won’t be stored on the main table so performance when
not asking for image data should be normal. Memory is one of those usage
patterns and hardware “it depends” things. Probably it will be OK.

Personally, it is easier, and less complex, than the alternative. Do it
and migrate later if issues arise.

David J.

#10

Estevan Rech

softrech@gmail.com

over 4 years ago

In reply to: David G. Johnston (#9)

Re: Best Strategy for Large Number of Images

The possibilities are known, but does anyone have experience with this
scenario?

#11

Ron

ronljohnsonjr@gmail.com

over 4 years ago

In reply to: Estevan Rech (#1)

Re: Best Strategy for Large Number of Images

On 12/15/21 11:45 AM, Estevan Rech wrote:

Hi,

I don't currently use PostgreSQL, but I plan to migrate and I have a
question about the best way/strategy for storing images. I have about 2
million images with a growth trend of around 1 million images per year. I
plan to store in bytea format in an isolated table. Is this recommended?
Is there another better way?

We have an application which store images in the database using bytea
fields. The next version of the software will store the images externally,
with the filename in the database, because filesystems just inherently do a
better job at storing *lots* of files than does a database.

--
Angular momentum makes the world go 'round.

#12

Rob Sargent

robjsargent@gmail.com

over 4 years ago

In reply to: Estevan Rech (#10)

Re: Best Strategy for Large Number of Images

On 12/15/21 12:22, Estevan Rech wrote:

The possibilities are known, but does anyone have experience with this
scenario?

My previous employer dealt with radiology images. These were in the
file system with an image server as suggested up thread.

#13

Adrian Klaver

adrian.klaver@aklaver.com

over 4 years ago

In reply to: Estevan Rech (#10)

Re: Best Strategy for Large Number of Images

On 12/15/21 11:22, Estevan Rech wrote:

The possibilities are known, but does anyone have experience with this
scenario?

It would help if you elaborated on the scenerio is?

There is no questioning that you can store images in the database. The
issues are what you want to do with them once they are there, the
hardware(real or virtual) you have available, the user load on the
database, etc?

Providing at least an outline that answers the above would go a long way
to getting some reasonable answers based on experience.

--
Adrian Klaver
adrian.klaver@aklaver.com

#14

Estevan Rech

softrech@gmail.com

over 4 years ago

In reply to: Adrian Klaver (#13)

Re: Best Strategy for Large Number of Images

Adrian,

I have an application that takes pictures of items and uses them as
evidence in a report.

These images are saved and then used to generate a report.

Each item averages 10 photos and I have about 2 million photos currently,
with an average growth of 1 million photos over the next year.

I think about using it within the database because of the ease of selecting
records with the content of the images (when necessary).

I think my biggest concern is to continue to use this strategy for the next
2 years and encounter a PostgreSQL limitation or some failure and have to
migrate database again or have significant loss of images. And on the
contrary too, if I use it on disk and in 2 years I find some failure or
file limit in the operating system (currently it's windows server 2016)...

Related to this is the backup service that seems to me that in a disaster,
I can restore the database relatively quickly if it's in the database. On
the disk, I believe that the restoration is much slower and I don't think
it's reliable that all the images are copied/restored.

#15

Gavan Schneider

list.pg.gavan@pendari.org

over 4 years ago

In reply to: Estevan Rech (#14)

Re: Best Strategy for Large Number of Images

On 16 Dec 2021, at 7:12, Estevan Rech wrote:

I have an application that takes pictures of items and uses them as
evidence in a report.

These images are saved and then used to generate a report.

Once you start talking reports and evidence I think audit trail and verification a.k.a. “chain of evidence”.
With the images in the database they can be part of your information ecosystem, get a datestamp along with checksum and all this included in the audit trail.
All this can be done with the images in the file system and the database holds the pointers only… it just may not be as easy to convince the oppositions lawyers that it is tamper proof, except the best you can really get is tamper evident. 😉

Each item averages 10 photos and I have about 2 million photos currently,
with an average growth of 1 million photos over the next year.

I think about using it within the database because of the ease of selecting
records with the content of the images (when necessary).

That will be the same however you go. The selection will be on the context data you enter with the images. There has been no mention of searching within the image itself, and this is not something PG can do “out of the box” anyway.

I think my biggest concern is to continue to use this strategy for the next
2 years and encounter a PostgreSQL limitation or some failure and have to
migrate database again or have significant loss of images. And on the
contrary too, if I use it on disk and in 2 years I find some failure or
file limit in the operating system (currently it's windows server 2016)...

The total image data will end up somewhere less than 4Tb if your estimates continue to hold. That’s not going to stress PG. It’s not such a big deal on a modern home computer either, but I can’t make any informed comment on Windows anything. Realistically, once the data is in the database the whole can be moved to a more capable machine and/or OS if problems start to appear. PG will work wherever you deploy it.

Related to this is the backup service that seems to me that in a disaster,
I can restore the database relatively quickly if it's in the database. On
the disk, I believe that the restoration is much slower and I don't think
it's reliable that all the images are copied/restored.

If you have a reliable tested backup system for the database and don’t want to add to the workload then get the images into the database. If there are reasons why you end up with the images outside the database then, of course, the backup process has to be able to keep everything together and guarantee a full restore. Doable as well, but it does add some complexity.

If I am reading this correctly the use case is one of receiving data in the form of photos and field reports, entering the report items associated with the photos into the information system, generating a report (with embedded photos), and keeping records so disputes can be handled. This is not a high volume access to the images so there is no real need for optimum filesystem speed to serve the images… keeping them in the database as bytea is perfectly workable and will work for data volumes well beyond the current estimates.

Gavan Schneider
——
Gavan Schneider, Sodwalls, NSW, Australia
Explanations exist; they have existed for all time; there is always a well-known solution to every human problem — neat, plausible, and wrong.
— H. L. Mencken, 1920

#16

Marc Millas

marc.millas@mokadb.com

over 4 years ago

In reply to: Estevan Rech (#14)

Re: Best Strategy for Large Number of Images

Hi,
when I am doing this I store the data in one postgres cluster, with some
kind of id for each image , and the images in another with the id as link.
The customer app is written so that it issues a dedicated http request for
each image. (I use nginx to create a dedicated "path".
*pro*: the images dont over charge the shared buffer cache of the data db.,
its easier to cache the images if its relevant perf wise. backups are much
easier on the data db as its much smaller, .very easy to scale
horizontaly by adding dedicated images db, .etc...
*con*: as with all 2 storage solution, even the disk option, you
"may" encounter phase pb ie. id existing in one db and not in the other.
but this depends on how you maintain the images.

my 2 cents

Marc MILLAS
Senior Architect
+33607850334
www.mokadb.com

On Wed, Dec 15, 2021 at 9:12 PM Estevan Rech <softrech@gmail.com> wrote:

Show quoted text

Adrian,

I have an application that takes pictures of items and uses them as
evidence in a report.

These images are saved and then used to generate a report.

Each item averages 10 photos and I have about 2 million photos currently,
with an average growth of 1 million photos over the next year.

I think about using it within the database because of the ease of
selecting records with the content of the images (when necessary).

I think my biggest concern is to continue to use this strategy for the
next 2 years and encounter a PostgreSQL limitation or some failure and have
to migrate database again or have significant loss of images. And on the
contrary too, if I use it on disk and in 2 years I find some failure or
file limit in the operating system (currently it's windows server 2016)...

Related to this is the backup service that seems to me that in a disaster,
I can restore the database relatively quickly if it's in the database. On
the disk, I believe that the restoration is much slower and I don't think
it's reliable that all the images are copied/restored.

#17

Bruce Momjian

bruce@momjian.us

over 4 years ago

In reply to: Estevan Rech (#1)

Re: Best Strategy for Large Number of Images

On Wed, Dec 15, 2021 at 02:45:15PM -0300, Estevan Rech wrote:

Hi,

I don't currently use PostgreSQL, but I plan to migrate and I have a question
about the best way/strategy for storing images. I have about 2 million images
with a growth trend of around 1 million images per year. I plan to store in
bytea format in an isolated table. Is this recommended? Is there another better
way?

You might want to read this:

https://momjian.us/main/blogs/pgblog/2020.html#June_19_2020

--
Bruce Momjian <bruce@momjian.us> https://momjian.us
EDB https://enterprisedb.com

If only the physical world exists, free will is an illusion.

#18

Benedict Holland

benedict.m.holland@gmail.com

over 4 years ago

In reply to: Bruce Momjian (#17)

Re: Best Strategy for Large Number of Images

I would recommend storing a link to the file rather than the file itself.
Other than that, use BLOBS. I would probably recommend not storing any
binary objects in a database for a variety of reasons but if you have to
then bytea will work, assuming they are smaller than 2gb.

Thanks,
- Ben

On Wed, Dec 15, 2021, 4:41 PM Bruce Momjian <bruce@momjian.us> wrote:

Show quoted text

On Wed, Dec 15, 2021 at 02:45:15PM -0300, Estevan Rech wrote:

Hi,

I don't currently use PostgreSQL, but I plan to migrate and I have a

question

about the best way/strategy for storing images. I have about 2 million

images

with a growth trend of around 1 million images per year. I plan to store

in

bytea format in an isolated table. Is this recommended? Is there another

better

way?

You might want to read this:

https://momjian.us/main/blogs/pgblog/2020.html#June_19_2020

--
Bruce Momjian <bruce@momjian.us> https://momjian.us
EDB https://enterprisedb.com

If only the physical world exists, free will is an illusion.

#19

Bèrto ëd Sèra

berto.d.sera@gmail.com

over 4 years ago

In reply to: Benedict Holland (#18)

Re: Best Strategy for Large Number of Images

I don't currently use PostgreSQL, but I plan to migrate and I have a

question

about the best way/strategy for storing images. I have about 2 million

images

with a growth trend of around 1 million images per year. I plan to store

in bytea

format in an isolated table. Is this recommended? Is there another better

way?

Do you need to edit those images? Or do you need to make sure they are not
edited, and if they are, you want to know who did it?

If so, I'd say make a versioned external repo (a git repo will do), and
have your application store a chain of SHA-1s of the file commits. If
anyone wants to alter your images behind your back, they'll need to gain
access to both the DB and the repo, and know how you pair the two
information levels (how deeply paranoid you need to be is up to
your business requirements).

If you just want to keep the images and no particular security is required,
I'd say just store them in the DB. 1.000.000/year pictures are ~2/minute,
if the flow is regular. As per your later posts they are not so big that
they could not be managed, you don't seem to expect a lot of concurrent
reads and doing it like this all you have to think of is ONE db
backup/restore procedure. This might be more complex if you expect traffic
peaks on the insert phase, of course.

Berto

#20

David G. Johnston

david.g.johnston@gmail.com

over 4 years ago

In reply to: Estevan Rech (#14)

Re: Best Strategy for Large Number of Images

On Wed, Dec 15, 2021 at 1:12 PM Estevan Rech <softrech@gmail.com> wrote:

Related to this is the backup service that seems to me that in a disaster,
I can restore the database relatively quickly if it's in the database. On
the disk, I believe that the restoration is much slower and I don't think
it's reliable that all the images are copied/restored.

Everything ends up "on disk" at some point - whether it's a database or a
simple file server - so speed and reliability would generally be a wash
conceptually.

Honestly, with that more complete description (and the assumption these
images are primary work products that are being paid for), I would say you
should be doing both. Use something like AWS S3 to store a copy - saving
its path into the DB as metadata - while also leaving a copy within the
database. As mentioned nearby, also generate a hash and store that.
Institute a rule that once an image has been generated it is never
altered. If you need to deal with editing, create a new record, and maybe
record the "source image hash" so you can build a chain if desired.

I'd probably store the image data on a separate table than the metadata -
in a one-to-one relationship. Maybe even partition the image file table.
That lets you drop images from the DB if desired - with the option to
restore them from the AWS S3 system if needed. You can over-engineer
things this way but I do think that having two primary sources for the
images, plus their backups, is reasonable for the types of images you are
dealing with.

David J.

#21