historical log of data records

Started by Sanjay Minniover 4 years ago7 messagesgeneral
Jump to latest
#1Sanjay Minni
sanjay.minni@gmail.com

I need to keep a copy of old data as the rows are changed.

For a general RDBMS I could think of keeping all the data in the same table
with a flag to indicate older copies of updated / deleted rows or keep a
parallel table and copy these rows into the parallel data under program /
trigger control. Each has its pros and cons.

In Postgres would i have to follow the same methods or are there any
features / packages available ?

regards
Sanjay Minni

#2Laurenz Albe
laurenz.albe@cybertec.at
In reply to: Sanjay Minni (#1)
Re: historical log of data records

On Tue, 2021-11-16 at 13:56 +0530, Sanjay Minni wrote:

I need to keep a copy of old data as the rows are changed. 

For a general RDBMS I could think of keeping all the data in the same table with a flag
to indicate older copies of updated /  deleted rows or keep a parallel table and copy
these rows into the parallel data under program / trigger control. Each has its pros and cons. 

In Postgres would i have to follow the same methods or are there any features / packages available ?

Yes, I would use one of these methods.

The only feature I can think of that may help is partitioning: if you have one partition
for the current data and one for the deleted data, then updating the flag would
automatically move the row between partitions, so you don't need a trigger.

Yours,
Laurenz Albe
--
Cybertec | https://www.cybertec-postgresql.com

#3Sanjay Minni
sanjay.minni@gmail.com
In reply to: Laurenz Albe (#2)
Re: historical log of data records

Are you referring to Table Partitioning ?
https://www.postgresql.org/docs/14/ddl-partitioning.html

with warm regards
Sanjay Minni
+91-9900-902902
http://planage.com
https://www.linkedin.com/in/sanjayminni/

On Tue, 16 Nov 2021 at 14:50, Laurenz Albe <laurenz.albe@cybertec.at> wrote:

Show quoted text

On Tue, 2021-11-16 at 13:56 +0530, Sanjay Minni wrote:

I need to keep a copy of old data as the rows are changed.

For a general RDBMS I could think of keeping all the data in the same

table with a flag

to indicate older copies of updated / deleted rows or keep a parallel

table and copy

these rows into the parallel data under program / trigger control. Each

has its pros and cons.

In Postgres would i have to follow the same methods or are there any

features / packages available ?

Yes, I would use one of these methods.

The only feature I can think of that may help is partitioning: if you have
one partition
for the current data and one for the deleted data, then updating the flag
would
automatically move the row between partitions, so you don't need a trigger.

Yours,
Laurenz Albe
--
Cybertec | https://www.cybertec-postgresql.com

#4Alban Hertroys
haramrae@gmail.com
In reply to: Laurenz Albe (#2)
Re: historical log of data records

On 16 Nov 2021, at 10:20, Laurenz Albe <laurenz.albe@cybertec.at> wrote:

On Tue, 2021-11-16 at 13:56 +0530, Sanjay Minni wrote:

I need to keep a copy of old data as the rows are changed.

For a general RDBMS I could think of keeping all the data in the same table with a flag
to indicate older copies of updated / deleted rows or keep a parallel table and copy
these rows into the parallel data under program / trigger control. Each has its pros and cons.

In Postgres would i have to follow the same methods or are there any features / packages available ?

Yes, I would use one of these methods.

The only feature I can think of that may help is partitioning: if you have one partition
for the current data and one for the deleted data, then updating the flag would
automatically move the row between partitions, so you don't need a trigger.

Are you building (something like) a data-vault? If so, keep in mind that you will have a row for every update, not just a single deleted row. Enriching the data can be really useful in such cases.

For a data-vault at a previous employer, we determined how to treat new rows by comparing a (md5) hash of the new and old rows, adding the hash and a validity interval to the stored rows. Historic data went to a separate table for each respective current table.

The current tables “inherited” the PK’s from the tables on the source systems (this was a data-warehouse DB). Obviously that same PK can not be applied to the historic tables where there _will_ be duplicates, although they should be at non-overlapping validity intervals.

Alternatively, since this is time-series data, it would probably be a good idea to store that in a way optimised for that. TimescaleDB comes to mind, or arrays as per Pavel’s suggestion at https://stackoverflow.com/questions/68440130/time-series-data-on-postgresql.

Regards,

Alban Hertroys
--
If you can't see the forest for the trees,
cut the trees and you'll find there is no forest.

#5Sanjay Minni
sanjay.minni@gmail.com
In reply to: Alban Hertroys (#4)
Re: historical log of data records

Alban,

Its a simple financial transaction processing application, the application
permits editing / updating / deleting of entered data even multiple times
but audit trail of the data tracing through all versions to its original
must be preserved.
(as outlined - Programmatically i could approach it by keeping a parallel
set of tables and copying the row being replaced into the parallel table
set, or, keeping all record versions in a single table only and a flag to
indicate the final / current version)
I am looking is there are better ways to do it

with warm regards
Sanjay Minni
+91-9900-902902

On Tue, 16 Nov 2021 at 15:57, Alban Hertroys <haramrae@gmail.com> wrote:

Show quoted text

On 16 Nov 2021, at 10:20, Laurenz Albe <laurenz.albe@cybertec.at> wrote:

On Tue, 2021-11-16 at 13:56 +0530, Sanjay Minni wrote:

I need to keep a copy of old data as the rows are changed.

For a general RDBMS I could think of keeping all the data in the same

table with a flag

to indicate older copies of updated / deleted rows or keep a parallel

table and copy

these rows into the parallel data under program / trigger control. Each

has its pros and cons.

In Postgres would i have to follow the same methods or are there any

features / packages available ?

Yes, I would use one of these methods.

The only feature I can think of that may help is partitioning: if you

have one partition

for the current data and one for the deleted data, then updating the

flag would

automatically move the row between partitions, so you don't need a

trigger.

Are you building (something like) a data-vault? If so, keep in mind that
you will have a row for every update, not just a single deleted row.
Enriching the data can be really useful in such cases.

For a data-vault at a previous employer, we determined how to treat new
rows by comparing a (md5) hash of the new and old rows, adding the hash and
a validity interval to the stored rows. Historic data went to a separate
table for each respective current table.

The current tables “inherited” the PK’s from the tables on the source
systems (this was a data-warehouse DB). Obviously that same PK can not be
applied to the historic tables where there _will_ be duplicates, although
they should be at non-overlapping validity intervals.

Alternatively, since this is time-series data, it would probably be a good
idea to store that in a way optimised for that. TimescaleDB comes to mind,
or arrays as per Pavel’s suggestion at
https://stackoverflow.com/questions/68440130/time-series-data-on-postgresql
.

Regards,

Alban Hertroys
--
If you can't see the forest for the trees,
cut the trees and you'll find there is no forest.

#6Peter J. Holzer
hjp-pgsql@hjp.at
In reply to: Sanjay Minni (#1)
Re: historical log of data records

On 2021-11-16 13:56:51 +0530, Sanjay Minni wrote:

I need to keep a copy of old data as the rows are changed. 

For a general RDBMS I could think of keeping all the data in the same table
with a flag to indicate older copies of updated /  deleted rows or keep a
parallel table and copy these rows into the parallel data under program /
trigger control. Each has its pros and cons. 

In Postgres would i have to follow the same methods or are there any features /
packages available ?

There is an extension (I think it's this one:
https://pgxn.org/dist/temporal_tables/) which we've used for one
project so far. I'm not sure whether I would use it for further
projects: It does what it's supposed to do and it's nice not having to
reinvent the wheel. But it did cause some problems during a posgresql
upgrade (nothing insurmountable, just a bit of a hassle) and IIRC also
with Django migration scripts. So, given that it's easy to roll your own
I probably would (but I do suffer a bit from NIH syndrome in general, so
that's not surprising).

I would strongly advise to use triggers. In my experience implementing
that in the application logic is a recipe for inconsistencies. Someone
will always forget to duplicate a row or set the wrong timestamp or
whatever (BTDT).

I don't have a strong preference between just using valid_from/valid_to
fields in the same table(s) and using separate history tables. Both
work, one or the other may be slightly more convenient depending on the
use case. Since you mentioned that the purpose is auditing, I'd probably
lean towards separate tables in your case.

hp

--
_ | Peter J. Holzer | Story must make more sense than reality.
|_|_) | |
| | | hjp@hjp.at | -- Charles Stross, "Creative writing
__/ | http://www.hjp.at/ | challenge!"

#7Laurenz Albe
laurenz.albe@cybertec.at
In reply to: Sanjay Minni (#3)
Re: historical log of data records

On Tue, 2021-11-16 at 15:24 +0530, Sanjay Minni wrote:

On Tue, 16 Nov 2021 at 14:50, Laurenz Albe <laurenz.albe@cybertec.at> wrote:

On Tue, 2021-11-16 at 13:56 +0530, Sanjay Minni wrote:

I need to keep a copy of old data as the rows are changed. 

For a general RDBMS I could think of keeping all the data in the same table with a flag
to indicate older copies of updated /  deleted rows or keep a parallel table and copy
these rows into the parallel data under program / trigger control. Each has its pros and cons. 

In Postgres would i have to follow the same methods or are there any features / packages available ?

Yes, I would use one of these methods.

The only feature I can think of that may help is partitioning: if you have one partition
for the current data and one for the deleted data, then updating the flag would
automatically move the row between partitions, so you don't need a trigger.

Are you referring to Table Partitioning ?
https://www.postgresql.org/docs/14/ddl-partitioning.html

Yes, exactly.

Yours,
Laurenz Albe