dynamic schema modeling and performance

Started by Rj Ewingalmost 9 years ago10 messagesgeneral

ewing.rj@gmail.com

almost 9 years ago

I'm looking for thoughts on the best way to handle dynamic schemas.

The application I am developing revolves around user defined entities. Each
entity is a tabular dataset with user defined columns and data types.
Entities can also be related to each other through Parent-Child
relationships. Some entities will be 100% user driven, while others (such
as an entity representing a photo) will be partially user driven (all photo
entities will have common fields + custom user additions).

I was hoping to get opinions on whether postgresql would be a suitable
backend. A couple of options I have thought of are:

1. Each entity is represented as a table in psql. The schema would be
dynamically updated (with limits) when an entity mapping is updated. I
believe that this would provide the best data constraints and allow the
best data normalization. *A concern I have is that there could be an
enormous amount of tables generated and the performance impacts this might
have in the future*. I could then run elasticsearch as a denormalized cache
for efficient querying and full-text-search.

2. Use a nosql database. This provides the "dynamic" schema aspect. A
concern here is the lack of relation support, thus leading to a more
denormalized data structure and the potential for the data to become
corrupted.

Any opinions on the use of psql for this case, or other options would be
greatly appreciated!

Poul Kristensen

bcc5226@gmail.com

almost 9 years ago

In reply to: Rj Ewing (#1)

Re: dynamic schema modeling and performance

dataverse.org uses Postgresql and is well documented + it is completely
user driven. Maybe the concept could be usefull for you. I have installed
and configuration a few to be uses for researchers.

regards
Poul

2017-04-11 19:46 GMT+02:00 Rj Ewing <ewing.rj@gmail.com>:

I'm looking for thoughts on the best way to handle dynamic schemas.

The application I am developing revolves around user defined entities.
Each entity is a tabular dataset with user defined columns and data types.
Entities can also be related to each other through Parent-Child
relationships. Some entities will be 100% user driven, while others (such
as an entity representing a photo) will be partially user driven (all photo
entities will have common fields + custom user additions).

I was hoping to get opinions on whether postgresql would be a suitable
backend. A couple of options I have thought of are:

1. Each entity is represented as a table in psql. The schema would be
dynamically updated (with limits) when an entity mapping is updated. I
believe that this would provide the best data constraints and allow the
best data normalization. *A concern I have is that there could be an
enormous amount of tables generated and the performance impacts this might
have in the future*. I could then run elasticsearch as a denormalized
cache for efficient querying and full-text-search.

2. Use a nosql database. This provides the "dynamic" schema aspect. A
concern here is the lack of relation support, thus leading to a more
denormalized data structure and the potential for the data to become
corrupted.

Any opinions on the use of psql for this case, or other options would be
greatly appreciated!

RJ

--
Med venlig hilsen / Best regards
Poul Kristensen
Linux-OS/Virtualizationexpert and Oracle DBA

Dorian Hoxha

dorian.hoxha@gmail.com

almost 9 years ago

In reply to: Poul Kristensen (#2)

Re: dynamic schema modeling and performance

If you are asking if you should go nosql, 99% you should not.

On Tue, Apr 11, 2017 at 10:06 PM, Poul Kristensen <bcc5226@gmail.com> wrote:

Show quoted text

dataverse.org uses Postgresql and is well documented + it is completely
user driven. Maybe the concept could be usefull for you. I have installed
and configuration a few to be uses for researchers.

regards
Poul

2017-04-11 19:46 GMT+02:00 Rj Ewing <ewing.rj@gmail.com>:

I'm looking for thoughts on the best way to handle dynamic schemas.

The application I am developing revolves around user defined entities.
Each entity is a tabular dataset with user defined columns and data types.
Entities can also be related to each other through Parent-Child
relationships. Some entities will be 100% user driven, while others (such
as an entity representing a photo) will be partially user driven (all photo
entities will have common fields + custom user additions).

I was hoping to get opinions on whether postgresql would be a suitable
backend. A couple of options I have thought of are:

1. Each entity is represented as a table in psql. The schema would be
dynamically updated (with limits) when an entity mapping is updated. I
believe that this would provide the best data constraints and allow the
best data normalization. *A concern I have is that there could be an
enormous amount of tables generated and the performance impacts this might
have in the future*. I could then run elasticsearch as a denormalized
cache for efficient querying and full-text-search.

2. Use a nosql database. This provides the "dynamic" schema aspect. A
concern here is the lack of relation support, thus leading to a more
denormalized data structure and the potential for the data to become
corrupted.

Any opinions on the use of psql for this case, or other options would be
greatly appreciated!

RJ

--
Med venlig hilsen / Best regards
Poul Kristensen
Linux-OS/Virtualizationexpert and Oracle DBA

Rj Ewing

ewing.rj@gmail.com

almost 9 years ago

In reply to: Poul Kristensen (#2)

Re: dynamic schema modeling and performance

Poul,

I took a quick look at the demo site, but didn't see anything where the user was defining the fields. It looks like they can choose from a list of predetermined metadata fields. Looking at the code, but not actually seeing the total db schema, it looks like they might be using the EAV pattern?

On Apr 11, 2017, at 1:06 PM, Poul Kristensen <bcc5226@gmail.com> wrote:

concept could be usefull

--
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general

Rj Ewing

ewing.rj@gmail.com

almost 9 years ago

In reply to: Dorian Hoxha (#3)

Re: dynamic schema modeling and performance

I thought that might be an answer around here :)

I guess I was looking for what might be a better approach.

Is dynamically creating a table for each entity a bad idea? I can see something like creating a schema for each project (group of related entities) and then creating a table for each schema. I don't expect having more then a few thousand projects anytime soon. We have a relatively targeted audience.

Or would it be better to use jsonb data types and create a denormalized index elsewhere?

Show quoted text

On Apr 11, 2017, at 5:17 PM, Dorian Hoxha <dorian.hoxha@gmail.com> wrote:

If you are asking if you should go nosql, 99% you should not.

On Tue, Apr 11, 2017 at 10:06 PM, Poul Kristensen <bcc5226@gmail.com> wrote:
dataverse.org uses Postgresql and is well documented + it is completely user driven. Maybe the concept could be usefull for you. I have installed and configuration a few to be uses for researchers.

regards
Poul

2017-04-11 19:46 GMT+02:00 Rj Ewing <ewing.rj@gmail.com>:

I'm looking for thoughts on the best way to handle dynamic schemas.

The application I am developing revolves around user defined entities. Each entity is a tabular dataset with user defined columns and data types. Entities can also be related to each other through Parent-Child relationships. Some entities will be 100% user driven, while others (such as an entity representing a photo) will be partially user driven (all photo entities will have common fields + custom user additions).

I was hoping to get opinions on whether postgresql would be a suitable backend. A couple of options I have thought of are:

1. Each entity is represented as a table in psql. The schema would be dynamically updated (with limits) when an entity mapping is updated. I believe that this would provide the best data constraints and allow the best data normalization. A concern I have is that there could be an enormous amount of tables generated and the performance impacts this might have in the future. I could then run elasticsearch as a denormalized cache for efficient querying and full-text-search.

2. Use a nosql database. This provides the "dynamic" schema aspect. A concern here is the lack of relation support, thus leading to a more denormalized data structure and the potential for the data to become corrupted.

Any opinions on the use of psql for this case, or other options would be greatly appreciated!

RJ

--
Med venlig hilsen / Best regards
Poul Kristensen
Linux-OS/Virtualizationexpert and Oracle DBA

Dorian Hoxha

dorian.hoxha@gmail.com

almost 9 years ago

In reply to: Rj Ewing (#5)

Re: dynamic schema modeling and performance

I've done the dynamic-table-per-project previously and it worked great.
Even dynamic indexes on it. If low thousands it should work ok. If more
than that, use as many static-columns as possible, everything dynamic in
jsonb, and check stuff with per-project-constraints.

On Wed, Apr 12, 2017 at 3:31 AM, RJ Ewing <ewing.rj@gmail.com> wrote:

Show quoted text

I thought that might be an answer around here :)

I guess I was looking for what might be a better approach.

Is dynamically creating a table for each entity a bad idea? I can see
something like creating a schema for each project (group of related
entities) and then creating a table for each schema. I don't expect having
more then a few thousand projects anytime soon. We have a relatively
targeted audience.

Or would it be better to use jsonb data types and create a denormalized
index elsewhere?

On Apr 11, 2017, at 5:17 PM, Dorian Hoxha <dorian.hoxha@gmail.com> wrote:

If you are asking if you should go nosql, 99% you should not.

On Tue, Apr 11, 2017 at 10:06 PM, Poul Kristensen <bcc5226@gmail.com>
wrote:

dataverse.org uses Postgresql and is well documented + it is completely
user driven. Maybe the concept could be usefull for you. I have installed
and configuration a few to be uses for researchers.

regards
Poul

2017-04-11 19:46 GMT+02:00 Rj Ewing <ewing.rj@gmail.com>:

I'm looking for thoughts on the best way to handle dynamic schemas.

The application I am developing revolves around user defined entities.
Each entity is a tabular dataset with user defined columns and data types.
Entities can also be related to each other through Parent-Child
relationships. Some entities will be 100% user driven, while others (such
as an entity representing a photo) will be partially user driven (all photo
entities will have common fields + custom user additions).

I was hoping to get opinions on whether postgresql would be a suitable
backend. A couple of options I have thought of are:

1. Each entity is represented as a table in psql. The schema would be
dynamically updated (with limits) when an entity mapping is updated. I
believe that this would provide the best data constraints and allow the
best data normalization. *A concern I have is that there could be an
enormous amount of tables generated and the performance impacts this might
have in the future*. I could then run elasticsearch as a denormalized
cache for efficient querying and full-text-search.

2. Use a nosql database. This provides the "dynamic" schema aspect. A
concern here is the lack of relation support, thus leading to a more
denormalized data structure and the potential for the data to become
corrupted.

Any opinions on the use of psql for this case, or other options would be
greatly appreciated!

RJ

--
Med venlig hilsen / Best regards
Poul Kristensen
Linux-OS/Virtualizationexpert and Oracle DBA

Merlin Moncure

mmoncure@gmail.com

almost 9 years ago

In reply to: Rj Ewing (#1)

Re: dynamic schema modeling and performance

On Tue, Apr 11, 2017 at 12:46 PM, Rj Ewing <ewing.rj@gmail.com> wrote:

I'm looking for thoughts on the best way to handle dynamic schemas.

The application I am developing revolves around user defined entities. Each
entity is a tabular dataset with user defined columns and data types.
Entities can also be related to each other through Parent-Child
relationships. Some entities will be 100% user driven, while others (such as
an entity representing a photo) will be partially user driven (all photo
entities will have common fields + custom user additions).

I was hoping to get opinions on whether postgresql would be a suitable
backend. A couple of options I have thought of are:

1. Each entity is represented as a table in psql. The schema would be
dynamically updated (with limits) when an entity mapping is updated. I
believe that this would provide the best data constraints and allow the best
data normalization. A concern I have is that there could be an enormous
amount of tables generated and the performance impacts this might have in
the future. I could then run elasticsearch as a denormalized cache for
efficient querying and full-text-search.

2. Use a nosql database. This provides the "dynamic" schema aspect. A
concern here is the lack of relation support, thus leading to a more
denormalized data structure and the potential for the data to become
corrupted.

Any opinions on the use of psql for this case, or other options would be
greatly appreciated!

Postgres can function as a nosql database -- you can use jsonb for
example to archive data in such a way that the data model can be
changed without making schema adjustments. Another way to do it is
EAV pattern as noted. These might be good strategies if you're not
sure where you want to end up.

It really comes down to this: how formal do you want your data model
to be? Adding formality leads to performance optimizations, exposes
your data to the fantastic SQL language, and allows rigorous
assumptions to made made from external dependencies and trusted.
Formality also brings a degree of inflexibility since your data has to
be forced into predetermined structures.

merlin

--
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general

Rj Ewing

ewing.rj@gmail.com

almost 9 years ago

In reply to: Merlin Moncure (#7)

Re: dynamic schema modeling and performance

We do know where we want to end up. We've had the application running for a
while using a triple store db. We're looking to move away from the triple
store due to performance issues. Our core concept is that each project can
define a set of entities and their relations. Each entity has a set of
attributes. We need to be able to efficiently run ad-hoc queries across
multiple entities in a project, filtering via the attribute values

I think our business needs probably eliminate the possibility of data
integrity at the db level. We currently do application level data
validation.

Regarding EAV, is there a good way to do this? Most everything I read
online says that EAV is a terrible idea and performance is lack luster.
However there are 6NF advocators who say that done right, it is highly
performant. Are there any articles on the correct way to implement EAV?

would jsonb or eav likely provide better query performance?

On Wed, Apr 12, 2017 at 7:43 AM, Merlin Moncure <mmoncure@gmail.com> wrote:

Show quoted text

On Tue, Apr 11, 2017 at 12:46 PM, Rj Ewing <ewing.rj@gmail.com> wrote:

I'm looking for thoughts on the best way to handle dynamic schemas.

The application I am developing revolves around user defined entities.

Each

entity is a tabular dataset with user defined columns and data types.
Entities can also be related to each other through Parent-Child
relationships. Some entities will be 100% user driven, while others

(such as

an entity representing a photo) will be partially user driven (all photo
entities will have common fields + custom user additions).

I was hoping to get opinions on whether postgresql would be a suitable
backend. A couple of options I have thought of are:

1. Each entity is represented as a table in psql. The schema would be
dynamically updated (with limits) when an entity mapping is updated. I
believe that this would provide the best data constraints and allow the

best

data normalization. A concern I have is that there could be an enormous
amount of tables generated and the performance impacts this might have in
the future. I could then run elasticsearch as a denormalized cache for
efficient querying and full-text-search.

2. Use a nosql database. This provides the "dynamic" schema aspect. A
concern here is the lack of relation support, thus leading to a more
denormalized data structure and the potential for the data to become
corrupted.

Any opinions on the use of psql for this case, or other options would be
greatly appreciated!

Postgres can function as a nosql database -- you can use jsonb for
example to archive data in such a way that the data model can be
changed without making schema adjustments. Another way to do it is
EAV pattern as noted. These might be good strategies if you're not
sure where you want to end up.

It really comes down to this: how formal do you want your data model
to be? Adding formality leads to performance optimizations, exposes
your data to the fantastic SQL language, and allows rigorous
assumptions to made made from external dependencies and trusted.
Formality also brings a degree of inflexibility since your data has to
be forced into predetermined structures.

merlin

vinny

vinny@xs4all.nl

almost 9 years ago

In reply to: Rj Ewing (#8)

Re: dynamic schema modeling and performance

Op 14/04/2017 om 19:03 schreef Rj Ewing:

We do know where we want to end up. We've had the application running
for a while using a triple store db. We're looking to move away from
the triple store due to performance issues. Our core concept is that
each project can define a set of entities and their relations. Each
entity has a set of attributes. We need to be able to efficiently run
ad-hoc queries across multiple entities in a project, filtering via
the attribute values

What kind of queries are you talking about?

I think our business needs probably eliminate the possibility of data
integrity at the db level. We currently do application level data
validation.

Application level checks can only be done if you exclusively lock the
database from before you start the check until the moment you nolonger
need the certainty.
That's usually a no-go because it makes your data single-user for the
duration of the check.

Regarding EAV, is there a good way to do this? Most everything I read
online says that EAV is a terrible idea and performance is lack
luster. However there are 6NF advocators who say that done right, it
is highly performant. Are there any articles on the correct way to
implement EAV?

Performance always depends entirely on what you are doing with it, and
on what scale. This is something that you should experiment with.

would jsonb or eav likely provide better query performance?

Can't tell without more information about what you actually do with the
data.

But dynamic schemas can be just fine, as long as your application is
100% in control over what can and cannot be done,
and as long as you use separate databases per customer/project/whatever.
You will probably want to scale up at some point
and move customers to different servers, so you might aswell take that
into account before you start.

Show quoted text

On Wed, Apr 12, 2017 at 7:43 AM, Merlin Moncure <mmoncure@gmail.com
<mailto:mmoncure@gmail.com>> wrote:

On Tue, Apr 11, 2017 at 12:46 PM, Rj Ewing <ewing.rj@gmail.com
<mailto:ewing.rj@gmail.com>> wrote:

I'm looking for thoughts on the best way to handle dynamic schemas.

The application I am developing revolves around user defined

entities. Each

entity is a tabular dataset with user defined columns and data

types.

Entities can also be related to each other through Parent-Child
relationships. Some entities will be 100% user driven, while

others (such as

an entity representing a photo) will be partially user driven

(all photo

entities will have common fields + custom user additions).

I was hoping to get opinions on whether postgresql would be a

suitable

backend. A couple of options I have thought of are:

1. Each entity is represented as a table in psql. The schema

would be

dynamically updated (with limits) when an entity mapping is

updated. I

believe that this would provide the best data constraints and

allow the best

data normalization. A concern I have is that there could be an

enormous

amount of tables generated and the performance impacts this

might have in

the future. I could then run elasticsearch as a denormalized

cache for

efficient querying and full-text-search.

2. Use a nosql database. This provides the "dynamic" schema

aspect. A

concern here is the lack of relation support, thus leading to a more
denormalized data structure and the potential for the data to become
corrupted.

Any opinions on the use of psql for this case, or other options

would be

greatly appreciated!

Postgres can function as a nosql database -- you can use jsonb for
example to archive data in such a way that the data model can be
changed without making schema adjustments. Another way to do it is
EAV pattern as noted. These might be good strategies if you're not
sure where you want to end up.

It really comes down to this: how formal do you want your data model
to be? Adding formality leads to performance optimizations, exposes
your data to the fantastic SQL language, and allows rigorous
assumptions to made made from external dependencies and trusted.
Formality also brings a degree of inflexibility since your data has to
be forced into predetermined structures.

merlin

#10

Rj Ewing

ewing.rj@gmail.com

almost 9 years ago

In reply to: vinny (#9)

Re: dynamic schema modeling and performance

Some example queries are:

give me all attributes for entity1 where entity1.attribute1 > 1000 and
entity1.attribute15 = "someValue"

give me all attributes for entity1 where entity1.parentId = 1

give me all attributes for entity1 & parent_entity where entity1.attribute2
= "this"

Nothing too complicated.

Our application level checks more consist of validating the uploaded data
against user defined rules. They are fairly basic.

Scale wise...

Nothing huge. In terms of a more traditional schema, where each entity was
a separate table, the largest table might contain 1,000,000 rows, with the
vast majority of them under 10,000. Another issue is that there might be 30
attributes for a given entity. However the distribution of data might look
like 8 of 30 are 100% populated (each row has a value). 4 of 30 are 80%
populated, 15 are 50% populate, and the rest are <25% populated.

On Fri, Apr 14, 2017 at 11:23 AM, Vincent Elschot <vinny@xs4all.nl> wrote:

Show quoted text

Op 14/04/2017 om 19:03 schreef Rj Ewing:

We do know where we want to end up. We've had the application running for
a while using a triple store db. We're looking to move away from the triple
store due to performance issues. Our core concept is that each project can
define a set of entities and their relations. Each entity has a set of
attributes. We need to be able to efficiently run ad-hoc queries across
multiple entities in a project, filtering via the attribute values

What kind of queries are you talking about?

I think our business needs probably eliminate the possibility of data
integrity at the db level. We currently do application level data
validation.

Application level checks can only be done if you exclusively lock the
database from before you start the check until the moment you nolonger need
the certainty.
That's usually a no-go because it makes your data single-user for the
duration of the check.

Regarding EAV, is there a good way to do this? Most everything I read
online says that EAV is a terrible idea and performance is lack luster.
However there are 6NF advocators who say that done right, it is highly
performant. Are there any articles on the correct way to implement EAV?

Performance always depends entirely on what you are doing with it, and on
what scale. This is something that you should experiment with.

would jsonb or eav likely provide better query performance?

Can't tell without more information about what you actually do with the
data.

But dynamic schemas can be just fine, as long as your application is 100%
in control over what can and cannot be done,
and as long as you use separate databases per customer/project/whatever.
You will probably want to scale up at some point
and move customers to different servers, so you might aswell take that
into account before you start.

On Wed, Apr 12, 2017 at 7:43 AM, Merlin Moncure <mmoncure@gmail.com>
wrote:

On Tue, Apr 11, 2017 at 12:46 PM, Rj Ewing <ewing.rj@gmail.com> wrote:

I'm looking for thoughts on the best way to handle dynamic schemas.

The application I am developing revolves around user defined entities.

Each

entity is a tabular dataset with user defined columns and data types.
Entities can also be related to each other through Parent-Child
relationships. Some entities will be 100% user driven, while others

(such as

an entity representing a photo) will be partially user driven (all photo
entities will have common fields + custom user additions).

I was hoping to get opinions on whether postgresql would be a suitable
backend. A couple of options I have thought of are:

1. Each entity is represented as a table in psql. The schema would be
dynamically updated (with limits) when an entity mapping is updated. I
believe that this would provide the best data constraints and allow the

best

data normalization. A concern I have is that there could be an enormous
amount of tables generated and the performance impacts this might have

in

the future. I could then run elasticsearch as a denormalized cache for
efficient querying and full-text-search.

2. Use a nosql database. This provides the "dynamic" schema aspect. A
concern here is the lack of relation support, thus leading to a more
denormalized data structure and the potential for the data to become
corrupted.

Any opinions on the use of psql for this case, or other options would be
greatly appreciated!

Postgres can function as a nosql database -- you can use jsonb for
example to archive data in such a way that the data model can be
changed without making schema adjustments. Another way to do it is
EAV pattern as noted. These might be good strategies if you're not
sure where you want to end up.

It really comes down to this: how formal do you want your data model
to be? Adding formality leads to performance optimizations, exposes
your data to the fantastic SQL language, and allows rigorous
assumptions to made made from external dependencies and trusted.
Formality also brings a degree of inflexibility since your data has to
be forced into predetermined structures.

merlin