time-based range partitioning and truncate/delete different timezones

Started by Niels Jespersenabout 5 years ago3 messagesgeneral
Jump to latest

Hi all

I need ideas regarding a way to design an insert, truncate/delete, and reinsert process on data that is in UTC and database range partitions that are in the Europe/Brussels time zone. Therefore, any input is welcome.

The issue at hand is this. We receive daily file deliveries of time-based measurements. The files are zipped csv's. A file covers a UTC-day (e.g. 2021-01-15 00:00:00+00 to 2021-01-15 23:59:59+00). About 100 million rows a day.

Loaded by Python into a Postgres 12 database table, range-partitioned on Europe/Brussels days. Database session in UTC, since timestamps in data has no time zone indicator. Most rows hit the same partition, but the last hour (this time of year, in the summer it is 2 hours) hits the next partition. Streamed directly from zip by psycopg2.cursor.copy_expert. Works perfect.

Then analysis on the data can proceed. Aggregating stuff on daily or monthly basis. This works fine since data is partitioned on our local time zone and we want aggregates on the same time zone.

Then, two weeks later a better delivery comes in covering the same UTC-period. Data quality is better, so we want to replace the old data with the new data. If data were partitioned on the same time zone boundary as the file covers, then a truncate partition, insert new data would be the obvious solution. That is no good, since the truncate will wipe an hour too much on one end and an hour too little on the other end.

So what is a better solution? Delete from t where t.timecolumn between a and b, and reinsert the new data. My instinct says no, but I cannot really think of a good alternative. It is 100 million rows with 5 or 6 numeric columns.

Regards Niels Jespersen

#2Michael Lewis
mlewis@entrata.com
In reply to: Niels Jespersen (#1)
Re: time-based range partitioning and truncate/delete different timezones

What version are you using? How long are you keeping data for? It is
possible to partition by hour or would that exceed the number of
recommended partitions too quickly for your retention time period? Else, I
would partition on date according to the timezone of your data. Selecting
from multiple partitions for the aggregate should be performant enough in
most cases. The truncate / detach type commands may need to be the priority.

In reply to: Michael Lewis (#2)
Re: time-based range partitioning and truncate/delete different timezones

<!doctype html>
<html>
<head>
<meta http-equiv="Content-Type" content="text/html; charset=utf-8">
</head>
<body>
<div style="font-family:-apple-system, Calibri , HelveticaNeue , sans-serif;">
<div dir="auto">
<div dir="auto">
<div id="signature-div-B95D2988-9DF3-4E51-AEF4-3D672DB60076" dir="auto"><br>
<br>
</div>
<div id="122C45FC-10C3-43BF-9CFC-1D25D1DBD2BE" spellcheck="true" autocorrect="true" autosuggest="true" dir="auto">
<hr>
<div style="font-family:-apple-system,Calibri,HelveticaNeue,sans-serif;" dir="auto">
<br>
<b>Fra:</b> Michael Lewis &lt;mlewis@entrata.com&gt;<br>
<b>Dato:</b> 15. januar 2021 kl. 19.49.32 CET<br>
<b>Til:</b> Niels Jespersen &lt;NJN@dst.dk&gt;<br>
<b>Cc:</b> pgsql-general@postgresql.org &lt;pgsql-general@postgresql.org&gt;<br>
<b>Emne:</b> Re: time-based range partitioning and truncate/delete different timezones<br>
</div>
<br>
<br>
<div dir="ltr">What version are you using? How long are you keeping data for? It is possible to partition by hour or would that exceed the number of recommended partitions too quickly for your retention time period? Else, I would partition on date according
to the timezone of your data. Selecting from multiple partitions for the aggregate should be performant enough in most cases. The truncate / detach type commands may need to be the priority.</div>
<div dir="ltr"><br>
</div>
<div dir="ltr">Version 12.1.&nbsp;</div>
<div dir="ltr"><br>
</div>
<div dir="ltr">6. months retention, which would give about 4.400 partitions using hourly partitioning. Maybe partitioning on the same timezone as input data and in something like 6 hour intervals is the way to go. I will think about it, based on your input.&nbsp;</div>
<div dir="ltr"><br>
</div>
<div dir="ltr">Thank you for your ideas</div>
<div dir="ltr"><br>
</div>
<div dir="ltr">Regards Niels</div>
<div dir="ltr"><br>
</div>
<div dir="ltr"><br>
</div>
<div dir="ltr"><br>
</div>
</div>
</div>
</div>
</div>
</body>
</html>