CSV Logging questions

Started by Greg Starkover 8 years ago3 messages
#1Greg Stark
stark@mit.edu

I was just looking over the CSV logging code and have a few questions
about why things were done the way they were done.

1) Why do we gather a per-session log line number? Is it just to aid
people importing to avoid duplicate entries from partial files? Is
there some other purpose given that entries will already be sequential
in the csv file?

2) Why is the file error conditional on log_error_verbosity? Surely
the whole point of a structured log is that you can log everything and
choose what to display later -- i.e. why csv logging doesn't look at
log_line_prefix to determine which other bits to display. There's no
added cost to include this information unconditionally and they're far
from the largest piece of data being logged either.

3) Similarly I wonder if the statement should always be included even
with hide_stmt is set so that users can write sensible queries against
the data even if it means duplicating data.

4) Why the session start time? Is this just so that <process_id,
session_start_time> uniquely identiifes a session? Should we perhaps
generate a unique session identifier instead?

The real reason I'm looking at this is because I'm looking at the
json_log plugin from Michael Paquier. It doesn't have the log line
numbers and I can't figure whether this is something it should have
because I can't quite figure out why they exist in CSV files. I think
there are a few other fields that have been added in Postgres but are
missing from the JSON log because of version skew.

I'm wondering if we should abstract out the CSV format so instead of
using emit_log_hook you would add a new format and it would specify a
"add_log_attribute(key,val)" hook which would get called once per log
format so you could have as many log formats as you want and be sure
they would all have the same data. That would also mean that the
timestamps would be in sync and we could probably eliminate the
occurrences of the wrong format appearing in the wrong logs.

--
greg

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#2David Fetter
david@fetter.org
In reply to: Greg Stark (#1)
Re: CSV Logging questions

On Mon, Sep 04, 2017 at 05:27:40PM +0100, Greg Stark wrote:

I was just looking over the CSV logging code and have a few questions
about why things were done the way they were done.

1) Why do we gather a per-session log line number? Is it just to aid
people importing to avoid duplicate entries from partial files? Is
there some other purpose given that entries will already be sequential
in the csv file?

2) Why is the file error conditional on log_error_verbosity? Surely
the whole point of a structured log is that you can log everything and
choose what to display later -- i.e. why csv logging doesn't look at
log_line_prefix to determine which other bits to display. There's no
added cost to include this information unconditionally and they're far
from the largest piece of data being logged either.

3) Similarly I wonder if the statement should always be included even
with hide_stmt is set so that users can write sensible queries against
the data even if it means duplicating data.

4) Why the session start time? Is this just so that <process_id,
session_start_time> uniquely identiifes a session? Should we perhaps
generate a unique session identifier instead?

The real reason I'm looking at this is because I'm looking at the
json_log plugin from Michael Paquier. It doesn't have the log line
numbers and I can't figure whether this is something it should have
because I can't quite figure out why they exist in CSV files. I think
there are a few other fields that have been added in Postgres but are
missing from the JSON log because of version skew.

I'm wondering if we should abstract out the CSV format so instead of
using emit_log_hook you would add a new format and it would specify a
"add_log_attribute(key,val)" hook which would get called once per log
format so you could have as many log formats as you want and be sure
they would all have the same data. That would also mean that the
timestamps would be in sync and we could probably eliminate the
occurrences of the wrong format appearing in the wrong logs.

+1 for making the emitters all work off the same source.

Any idea how much work we're talking about to do these things?

Best,
David.
--
David Fetter <david(at)fetter(dot)org> http://fetter.org/
Phone: +1 415 235 3778 AIM: dfetter666 Yahoo!: dfetter
Skype: davidfetter XMPP: david(dot)fetter(at)gmail(dot)com

Remember to vote!
Consider donating to Postgres: http://www.postgresql.org/about/donate

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#3Robert Haas
robertmhaas@gmail.com
In reply to: Greg Stark (#1)
Re: [HACKERS] CSV Logging questions

On Mon, Sep 4, 2017 at 12:27 PM, Greg Stark <stark@mit.edu> wrote:

1) Why do we gather a per-session log line number? Is it just to aid
people importing to avoid duplicate entries from partial files? Is
there some other purpose given that entries will already be sequential
in the csv file?

I think the idea is that if you see a line for a session you can tell
whether there are any earlier lines for the same session. It might
not be obvious, because they could be much earlier in the log if a
session was idle for a while. I've certainly run into this problem in
real-world troubleshooting.

2) Why is the file error conditional on log_error_verbosity? Surely
the whole point of a structured log is that you can log everything and
choose what to display later -- i.e. why csv logging doesn't look at
log_line_prefix to determine which other bits to display. There's no
added cost to include this information unconditionally and they're far
from the largest piece of data being logged either.

3) Similarly I wonder if the statement should always be included even
with hide_stmt is set so that users can write sensible queries against
the data even if it means duplicating data.

I think the principle that the CSV log should contain all of the
output fields can be taken too far, and I'd put both of these ideas in
that category. I don't see any reason to believe there couldn't be a
user who wants CSV logging but not at maximum verbosity -- and
hide_stmt is used for cases like this:

ereport(LOG,
(errmsg("statement: %s", query_string),
errhidestmt(true),
errdetail_execute(parsetree_list)));

Actually, I think it's poor design to force the CSV log to contain all
of the output fields. For some users, that might make it unusable by
making the output too big. I think it would be better if the data
were self-identifying - e.g. by sticking a header line on each log
file - and perhaps complete by default, but still configurable. We've
had the idea of adding new %-escapes shot down on the grounds that
that would force us to include them all the time in CSV output and
they are too specialized to justify this, but that seems to me to be a
case of the tail wagging the dog.

4) Why the session start time? Is this just so that <process_id,
session_start_time> uniquely identiifes a session? Should we perhaps
generate a unique session identifier instead?

We could do that, but it doesn't seem better...

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company