Postgres and data warehouses

Started by Nigel Gardinerabout 11 years ago10 messagesgeneral
Jump to latest
#1Nigel Gardiner
nigelgardiner@gmail.com

I'm looking at making a data warehouse to address our rapidly spiralling
report query times against the OLTP. I'm looking first at what it would
take to make this a real-time data warehouse, as opposed to batch-driven.

One approach I've seen used to achieve real time data warehousing is to
have middleware that is intercepting all database writes and echoing them
to a program that rolls up the data warehouse values and updates the facts,
dimensions and so on on the fly. Another approach was to use triggers on
the tables of interest to write to tables to journal the changes, which
then get processed by a batch job to achieve the same thing.

One of the problems of the trigger on the transactional database approach
is that if there is ever a problem with the trigger, the main transaction
is affected. I'm not sure if that is avoidable with proper exception
handling in the trigger code? It does mean a lot of trigger code to
maintain, and another schema to maintain (the journalled changes), so there
were several drawbacks.

I've had a quick search and haven't seen this approach used yet, but I was
thinking, the asynchronous replication of Postgres databases could be used
as a streaming journal of changes to be processed by a data warehouse. The
other approach that suggests itself is WAL file shipping. I've not dug into
the async rep protocol yet, before I do so I just wanted to get some brief
feedback on whether I'm on the wrong track or not, and if there's some
better approach I should be looking at first instead.

Any feedback much appreciated.

Regards,
Nigel

#2Stephen Frost
sfrost@snowman.net
In reply to: Nigel Gardiner (#1)
Re: Postgres and data warehouses

* Nigel Gardiner (nigelgardiner@gmail.com) wrote:

I've had a quick search and haven't seen this approach used yet, but I was
thinking, the asynchronous replication of Postgres databases could be used
as a streaming journal of changes to be processed by a data warehouse. The
other approach that suggests itself is WAL file shipping. I've not dug into
the async rep protocol yet, before I do so I just wanted to get some brief
feedback on whether I'm on the wrong track or not, and if there's some
better approach I should be looking at first instead.

Using a streaming replica for data warehouse queries is quite common..
The issue there is if you want to change the data structure or store
data on the replica because a streaming replica based on WAL shipping is
read-only. You can create FDW tables which you can write to (even on
the replica) or you can create an independent database which has FDW
tables to the replica. They have their own pros and cons, of course.

Thanks!

Stephen

#3Bill Moran
wmoran@potentialtech.com
In reply to: Stephen Frost (#2)
Re: Postgres and data warehouses

On Sun, 8 Mar 2015 11:20:16 -0400
Stephen Frost <sfrost@snowman.net> wrote:

* Nigel Gardiner (nigelgardiner@gmail.com) wrote:

I've had a quick search and haven't seen this approach used yet, but I was
thinking, the asynchronous replication of Postgres databases could be used
as a streaming journal of changes to be processed by a data warehouse. The
other approach that suggests itself is WAL file shipping. I've not dug into
the async rep protocol yet, before I do so I just wanted to get some brief
feedback on whether I'm on the wrong track or not, and if there's some
better approach I should be looking at first instead.

Using a streaming replica for data warehouse queries is quite common..
The issue there is if you want to change the data structure or store
data on the replica because a streaming replica based on WAL shipping is
read-only. You can create FDW tables which you can write to (even on
the replica) or you can create an independent database which has FDW
tables to the replica. They have their own pros and cons, of course.

Another option is to replicate using Slony instead of streaming, which allows
you to create additional tables on the replica that are read/write in
addition to triggers that only fire on the replica. It's complicated, but
pretty damn powerful.

--
Bill Moran

--
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general

#4John R Pierce
pierce@hogranch.com
In reply to: Nigel Gardiner (#1)
Re: Postgres and data warehouses

On 3/8/2015 7:40 AM, Nigel Gardiner wrote:

I'm looking at making a data warehouse to address our rapidly
spiralling report query times against the OLTP. I'm looking first at
what it would take to make this a real-time data warehouse, as opposed
to batch-driven.

we use a hybrid architecture. we have a 'republisher' process that
repeatedly slurps new data from the OLTP database and sends it to the
back end databases, using a 'publish/subscribe' messaging bus.
several back end databases subscribe to this data, and their subscriber
processes insert the incoming data into those OLAP and various other
reporting databases. this way the reporting databases can have
completely different schemas optimized for their needs, and have
different retention requirements than the OLTP database.

this republisher is usually within a few seconds of live new data. in
our case its made fairly easy to track 'new' because all our OLTP
transactions are event-oriented.

--
john r pierce 37N 122W
somewhere on the middle of the left coast

--
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general

#5Rob Sargent
robjsargent@gmail.com
In reply to: John R Pierce (#4)
Re: Postgres and data warehouses

I strongly suspect many would like further details on your implementation. A user conference session well recorded perhaps?

On Mar 8, 2015, at 4:57 PM, John R Pierce <pierce@hogranch.com> wrote:

On 3/8/2015 7:40 AM, Nigel Gardiner wrote:
I'm looking at making a data warehouse to address our rapidly spiralling report query times against the OLTP. I'm looking first at what it would take to make this a real-time data warehouse, as opposed to batch-driven.

we use a hybrid architecture. we have a 'republisher' process that repeatedly slurps new data from the OLTP database and sends it to the back end databases, using a 'publish/subscribe' messaging bus. several back end databases subscribe to this data, and their subscriber processes insert the incoming data into those OLAP and various other reporting databases. this way the reporting databases can have completely different schemas optimized for their needs, and have different retention requirements than the OLTP database.

this republisher is usually within a few seconds of live new data. in our case its made fairly easy to track 'new' because all our OLTP transactions are event-oriented.

--
john r pierce 37N 122W
somewhere on the middle of the left coast

--
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general

--
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general

#6John R Pierce
pierce@hogranch.com
In reply to: Rob Sargent (#5)
Re: Postgres and data warehouses

On 3/8/2015 8:24 PM, Rob Sargent wrote:

I strongly suspect many would like further details on your implementation. A user conference session well recorded perhaps?

the details are proprietary and quite specific to our workload and
requirements.

--
john r pierce 37N 122W
somewhere on the middle of the left coast

--
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general

#7Rob Sargent
robjsargent@gmail.com
In reply to: John R Pierce (#6)
Re: Postgres and data warehouses

I expected the latter but … some proprietors like to brag :)

Standing down.

On Mar 8, 2015, at 9:56 PM, John R Pierce <pierce@hogranch.com> wrote:

On 3/8/2015 8:24 PM, Rob Sargent wrote:

I strongly suspect many would like further details on your implementation. A user conference session well recorded perhaps?

the details are proprietary and quite specific to our workload and requirements.

--
john r pierce 37N 122W
somewhere on the middle of the left coast

--
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general

--
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general

#8Jerry Sievers
gsievers19@comcast.net
In reply to: Nigel Gardiner (#1)
Re: Postgres and data warehouses

Nigel Gardiner <nigelgardiner@gmail.com> writes:

I'm looking at making a data warehouse to address our rapidly spiralling report query times against the OLTP. I'm looking first at what it would take to make this a
real-time data warehouse, as opposed to batch-driven.

One approach I've seen used to achieve real time data warehousing is to have middleware that is intercepting all database writes and echoing them to a program that
rolls up the data warehouse values and updates the facts, dimensions and so on on the fly. Another approach was to use triggers on the tables of interest to write to
tables to journal the changes, which then get processed by a batch job to achieve the same thing.

One of the problems of the trigger on the transactional database
approach is that if there is ever a problem with the trigger, the main
transaction is affected. I'm not sure if that is avoidable with proper
exception handling in the trigger code? It does mean a lot of trigger
code to maintain, and another schema to maintain (the journalled
changes), so there were several drawbacks.

Firing a trigger on INS, UPD, DEL that simply loads a a journal table
with PK value and the event type is so utterly trivial as to be a
non-issue anywhere but the most high impact environments.

I've had a quick search and haven't seen this approach used yet, but I was thinking, the asynchronous replication of Postgres databases could be used as a streaming
journal of changes to be processed by a data warehouse. The other approach that suggests itself is WAL file shipping. I've not dug into the async rep protocol yet,
before I do so I just wanted to get some brief feedback on whether I'm on the wrong track or not, and if there's some better approach I should be looking at first
instead.

Consider if new Logical Change Set features of 9.4 might apply to your
case. May accomplish same as trigger based solution without all the
extra supporting structures hitherto necessary.

Any feedback much appreciated.

Regards,
Nigel

--
Jerry Sievers
Postgres DBA/Development Consulting
e: postgres.consulting@comcast.net
p: 312.241.7800

--
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general

#9Jim Nasby
Jim.Nasby@BlueTreble.com
In reply to: Rob Sargent (#7)
Re: Postgres and data warehouses

On 3/8/15 11:02 PM, Rob Sargent wrote:

I expected the latter but … some proprietors like to brag :)

Standing down.

On Mar 8, 2015, at 9:56 PM, John R Pierce <pierce@hogranch.com> wrote:

On 3/8/2015 8:24 PM, Rob Sargent wrote:

I strongly suspect many would like further details on your implementation. A user conference session well recorded perhaps?

the details are proprietary and quite specific to our workload and requirements.

You could build something similar with Skytools; it's what Skype
actually created it for. They ran many "real time" datamarts/warehouses
using it.

The big difference between it and Slony is that Skytools is modular, so
it's designed to allow for this kind of flexibility.
--
Jim Nasby, Data Architect, Blue Treble Consulting
Data in Trouble? Get it in Treble! http://BlueTreble.com

--
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general

#10Brent Tubbs
brent.tubbs@gmail.com
In reply to: Jim Nasby (#9)
Re: Postgres and data warehouses

The logical replication features in 9.4 look very promising as an
alternative to the binary replication that requires an identical schema on
both ends. There's no reason that the thing you're replicating to couldn't
be something like RabbitMQ or Storm.

I've been playing with a JSON-based decoder plugin so Postgres could emit
changes to non-SQL systems like this. See
https://bitbucket.org/btubbs/decoder_json/src. It should be still
considered experimental and only used by people unafraid of hacking on C at
this point. More eyeballs would be welcome.

I blogged about more options on streaming updates from Postgres at
http://btubbs.com/streaming-updates-from-postgres.html.

On Tue, Mar 10, 2015 at 8:11 AM, Jim Nasby <Jim.Nasby@bluetreble.com> wrote:

Show quoted text

On 3/8/15 11:02 PM, Rob Sargent wrote:

I expected the latter but … some proprietors like to brag :)

Standing down.

On Mar 8, 2015, at 9:56 PM, John R Pierce <pierce@hogranch.com> wrote:

On 3/8/2015 8:24 PM, Rob Sargent wrote:

I strongly suspect many would like further details on your
implementation. A user conference session well recorded perhaps?

the details are proprietary and quite specific to our workload and
requirements.

You could build something similar with Skytools; it's what Skype actually
created it for. They ran many "real time" datamarts/warehouses using it.

The big difference between it and Slony is that Skytools is modular, so
it's designed to allow for this kind of flexibility.
--
Jim Nasby, Data Architect, Blue Treble Consulting
Data in Trouble? Get it in Treble! http://BlueTreble.com

--
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general