Emitting JSON to file using COPY TO
Hello!
I'm trying to emit a JSON aggregation of JSON rows to a file using COPY TO,
but I'm running into problems with COPY TO double quoting the output.
Here is a minimal example that demonstrates the problem I'm having:
create table public.tbl_json_test (id int, t_test text);
-- insert text that includes double quotes
insert into public.tbl_json_test (id, t_test) values (1, 'here''s a "string"');
-- select a JSON aggregation of JSON rows
select json_agg(row_to_json(t)) from (select * from public.tbl_json_test) t;
-- this yields the correct result in proper JSON format:
-- [{"id":1,"t_test":"here's a \"string\""}]
copy (select json_agg(row_to_json(t)) from (select * from
public.tbl_json_test) t) to '/tmp/tbl_json_test.json';
-- once the JSON results are copied to file, the JSON is broken due to
double quoting:
-- [{"id":1,"t_test":"here's a \\"string\\""}]
-- this fails to be parsed using jq on the command line:
-- cat /tmp/tbl_json_test.json | jq .
-- jq: parse error: Invalid numeric literal at line 1, column 40
We populate a text field in a table with text containing at least one
double-quote ("). We then select from that table, formating the result as
a JSON aggregation of JSON rows. At this point the JSON syntax is
correct, with the double quotes being properly quoted. The problem is that
once we use COPY TO to emit the results to a file, the output gets quoted
again with a second escape character (\), breaking the JSON and causing a
syntax error (as we can see above using the `jq` command line tool).
I have tried to get COPY TO to copy the results to file "as-is" by setting
the escape and the quote characters to the empty string (''), but they only
apply to the CSV format.
Is there a way to emit JSON results to file from within postgres?
Effectively, nn "as-is" option to COPY TO would work well for this JSON use
case.
Any assistance would be appreciated.
Thanks,
Davin
On Sat, Nov 25, 2023 at 12:22 PM Davin Shearer <scholarsmate@gmail.com>
wrote:
Is there a way to emit JSON results to file from within postgres?
Use psql to directly output query results to a file instead of using COPY
to output structured output in a format you don't want.
David J.
On 11/25/23 11:21, Davin Shearer wrote:
Hello!
I'm trying to emit a JSON aggregation of JSON rows to a file using COPY
TO, but I'm running into problems with COPY TO double quoting the
output. Here is a minimal example that demonstrates the problem I'm
having:
I have tried to get COPY TO to copy the results to file "as-is" by
setting the escape and the quote characters to the empty string (''),
but they only apply to the CSV format.Is there a way to emit JSON results to file from within postgres?
Effectively, nn "as-is" option to COPY TO would work well for this JSON
use case.
Not using COPY.
See David Johnson's post for one way using the client psql.
Otherwise you will need to use any of the many ETL programs out there
that are designed for this sort of thing.
Any assistance would be appreciated.
Thanks,
Davin
--
Adrian Klaver
adrian.klaver@aklaver.com
On Sat, Nov 25, 2023 at 10:00 PM Adrian Klaver <adrian.klaver@aklaver.com>
wrote:
On 11/25/23 11:21, Davin Shearer wrote:
Hello!
I'm trying to emit a JSON aggregation of JSON rows to a file using COPY
TO, but I'm running into problems with COPY TO double quoting the
output. Here is a minimal example that demonstrates the problem I'm
having:I have tried to get COPY TO to copy the results to file "as-is" by
setting the escape and the quote characters to the empty string (''),
but they only apply to the CSV format.Is there a way to emit JSON results to file from within postgres?
Effectively, nn "as-is" option to COPY TO would work well for this JSON
use case.Not using COPY.
See David Johnson's post for one way using the client psql.
Otherwise you will need to use any of the many ETL programs out there
that are designed for this sort of thing.
Guys, I don't get answers like that. The JSON spec is clear:
Show quoted text
On Mon, Nov 27, 2023 at 10:33 AM Dominique Devienne <ddevienne@gmail.com>
wrote:
On Sat, Nov 25, 2023 at 10:00 PM Adrian Klaver <adrian.klaver@aklaver.com>
wrote:On 11/25/23 11:21, Davin Shearer wrote:
Hello!
I'm trying to emit a JSON aggregation of JSON rows to a file using COPY
TO, but I'm running into problems with COPY TO double quoting the
output. Here is a minimal example that demonstrates the problem I'm
having:I have tried to get COPY TO to copy the results to file "as-is" by
setting the escape and the quote characters to the empty string (''),
but they only apply to the CSV format.Is there a way to emit JSON results to file from within postgres?
Effectively, nn "as-is" option to COPY TO would work well for this JSON
use case.Not using COPY.
See David Johnson's post for one way using the client psql.
Otherwise you will need to use any of the many ETL programs out there
that are designed for this sort of thing.Guys, I don't get answers like that. The JSON spec is clear:
Oops, sorry, user error. --DD
PS: The JSON spec is a bit ambiguous. First it says
Any codepoint except " or \ or control characters
And then is clearly shows \" as a valid sequence...
Sounds like JQ is too restrictive?
Or that's the double-escape that's the culprit?
i.e. \\ is in the final text, so that's just a backslash,
and then the double-quote is no longer escaped.
I've recently noticed json_agg(row_to_json(t))
is equivalent to json_agg(t)
Maybe use that instead? Does that make a difference?
I haven't noticed wrong escaping of double-quotes yet,
but then I'm using the binary mode of queries. Perhaps that matters.
On second thought, I guess that's COPY in its text modes doing the escaping?
Interesting. The text-based modes of COPY are configurable. There's even a
JSON mode.
By miracle, would the JSON output mode recognize JSON[B] values, and avoid
the escaping?
On Monday, November 27, 2023, Dominique Devienne <ddevienne@gmail.com>
wrote:
There's even a JSON mode.
By miracle, would the JSON output mode recognize JSON[B] values, and avoid
the escaping?
I agree there should be a copy option for “not formatted” so if you dump a
single column result in that format you get the raw unescaped contents of
the column. As soon as you ask for a format your json is now embedded so it
is a value within another format and any structural aspects of the wrapper
present in the json text representation need to be escaped.
David J.
Hi
po 27. 11. 2023 v 14:27 odesílatel David G. Johnston <
david.g.johnston@gmail.com> napsal:
On Monday, November 27, 2023, Dominique Devienne <ddevienne@gmail.com>
wrote:There's even a JSON mode.
By miracle, would the JSON output mode recognize JSON[B] values, and
avoid the escaping?I agree there should be a copy option for “not formatted” so if you dump a
single column result in that format you get the raw unescaped contents of
the column. As soon as you ask for a format your json is now embedded so it
is a value within another format and any structural aspects of the wrapper
present in the json text representation need to be escaped.
Is it better to use the LO API for this purpose? It is native for not
formatted data.
Regards
Pavel
Show quoted text
David J.
On Monday, November 27, 2023, Pavel Stehule <pavel.stehule@gmail.com> wrote:
Hi
po 27. 11. 2023 v 14:27 odesílatel David G. Johnston <
david.g.johnston@gmail.com> napsal:On Monday, November 27, 2023, Dominique Devienne <ddevienne@gmail.com>
wrote:There's even a JSON mode.
By miracle, would the JSON output mode recognize JSON[B] values, and
avoid the escaping?I agree there should be a copy option for “not formatted” so if you dump
a single column result in that format you get the raw unescaped contents of
the column. As soon as you ask for a format your json is now embedded so it
is a value within another format and any structural aspects of the wrapper
present in the json text representation need to be escaped.Is it better to use the LO API for this purpose? It is native for not
formatted data.
Using LO is, IMO, never the answer. But if you are using a driver API
anyway just handle the normal select query result.
David J.
"David G. Johnston" <david.g.johnston@gmail.com> writes:
I agree there should be a copy option for “not formatted” so if you dump a
single column result in that format you get the raw unescaped contents of
the column.
I'm not sure I even buy that. JSON data in particular is typically
multi-line, so how will you know where the row boundaries are?
That is, is a newline a row separator or part of the data?
You can debate the intelligence of any particular quoting/escaping
scheme, but imagining that you can get away without having one at
all will just create its own problems.
regards, tom lane
On Mon, Nov 27, 2023 at 3:56 PM Tom Lane <tgl@sss.pgh.pa.us> wrote:
"David G. Johnston" <david.g.johnston@gmail.com> writes:
I agree there should be a copy option for “not formatted” so if you dump
a
single column result in that format you get the raw unescaped contents of
the column.I'm not sure I even buy that. JSON data in particular is typically
multi-line, so how will you know where the row boundaries are?
That is, is a newline a row separator or part of the data?You can debate the intelligence of any particular quoting/escaping
scheme, but imagining that you can get away without having one at
all will just create its own problems.
What I was suggesting is not about a "not formatted" option.
But rather than JSON values (i.e. typed `json` or `jsonb`) in a
JSON-formatted COPY operator, the JSON values should not be
serialized to text that is simply output as a JSON-text-value by COPY,
but "inlined" as a "real" JSON value without the JSON document output by
COPY.
This is a special case, where the inner and outer "values" (for lack of a
better terminology)
are *both* JSON documents, and given that JSON is hierarchical, the inner
JSON value can
either by 1) serializing to text first, which must thus be escaped using
the JSON escaping rules,
2) NOT serialized, but "inline" or "spliced-in" the outer COPY JSON
document.
I guess COPY in JSON mode supports only #1 now? While #2 makes more sense
to me.
But both options are valid. Is that clearer?
BTW, JSON is not multi-line, except for insignificant whitespace.
So even COPY in JSON mode is not supposed to be line based I guess?
Unless COPY in JSON mode is more like NDJSON (https://ndjson.org/)? --DD
On 11/27/23 01:44, Dominique Devienne wrote:
On Mon, Nov 27, 2023 at 10:33 AM Dominique Devienne <ddevienne@gmail.com
<mailto:ddevienne@gmail.com>> wrote:
On second thought, I guess that's COPY in its text modes doing the escaping?
Interesting. The text-based modes of COPY are configurable. There's even
a JSON mode.
Where are you seeing the JSON mode for COPY? AFAIK there is only text
and CSV formats.
By miracle, would the JSON output mode recognize JSON[B] values, and
avoid the escaping?
--
Adrian Klaver
adrian.klaver@aklaver.com
On Mon, Nov 27, 2023 at 5:04 PM Adrian Klaver <adrian.klaver@aklaver.com>
wrote:
On 11/27/23 01:44, Dominique Devienne wrote:
On Mon, Nov 27, 2023 at 10:33 AM Dominique Devienne <ddevienne@gmail.com
<mailto:ddevienne@gmail.com>> wrote:
On second thought, I guess that's COPY in its text modes doing theescaping?
Interesting. The text-based modes of COPY are configurable. There's even
a JSON mode.Where are you seeing the JSON mode for COPY? AFAIK there is only text
and CSV formats.
Indeed. Somehow I thought there was...
I've used the TEXT and BINARY modes, and remembered a wishful thinking JSON
mode!
OK then, if there was, then what I wrote would apply :). --DD
This would be a very special case for COPY. It applies only to a single
column of JSON values. The original problem can be solved with psql
--tuples-only as David wrote earlier.
$ psql -tc 'select json_agg(row_to_json(t))
from (select * from public.tbl_json_test) t;'
[{"id":1,"t_test":"here's a \"string\""}]
Special-casing any encoding/escaping scheme leads to bugs and harder
parsing.
Just my 2c.
--
Filip Sedlák
Thanks for the responses everyone.
I worked around the issue using the `psql -tc` method as Filip described.
I think it would be great to support writing JSON using COPY TO at
some point so I can emit JSON to files using a PostgreSQL function directly.
-Davin
On Tue, Nov 28, 2023 at 2:36 AM Filip Sedlák <filip@sedlakovi.org> wrote:
Show quoted text
This would be a very special case for COPY. It applies only to a single
column of JSON values. The original problem can be solved with psql
--tuples-only as David wrote earlier.$ psql -tc 'select json_agg(row_to_json(t))
from (select * from public.tbl_json_test) t;'[{"id":1,"t_test":"here's a \"string\""}]
Special-casing any encoding/escaping scheme leads to bugs and harder
parsing.Just my 2c.
--
Filip Sedlák
On 11/29/23 10:32, Davin Shearer wrote:
Thanks for the responses everyone.
I worked around the issue using the `psql -tc` method as Filip described.
I think it would be great to support writing JSON using COPY TO at
some point so I can emit JSON to files using a PostgreSQL function directly.-Davin
On Tue, Nov 28, 2023 at 2:36 AM Filip Sedlák <filip@sedlakovi.org
<mailto:filip@sedlakovi.org>> wrote:This would be a very special case for COPY. It applies only to a single
column of JSON values. The original problem can be solved with psql
--tuples-only as David wrote earlier.$ psql -tc 'select json_agg(row_to_json(t))
from (select * from public.tbl_json_test) t;'[{"id":1,"t_test":"here's a \"string\""}]
Special-casing any encoding/escaping scheme leads to bugs and harder
parsing.
(moved to hackers)
I did a quick PoC patch (attached) -- if there interest and no hard
objections I would like to get it up to speed for the January commitfest.
Currently the patch lacks documentation and regression test support.
Questions:
----------
1. Is supporting JSON array format sufficient, or does it need to
support some other options? How flexible does the support scheme need to be?
2. This only supports COPY TO and we would undoubtedly want to support
COPY FROM for JSON as well, but is that required from the start?
Thanks for any feedback.
--
Joe Conway
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com
Attachments:
copyto_json.000.difftext/x-patch; charset=UTF-8; name=copyto_json.000.diffDownload+126-80
On Fri, Dec 01, 2023 at 02:28:55PM -0500, Joe Conway wrote:
I did a quick PoC patch (attached) -- if there interest and no hard
objections I would like to get it up to speed for the January commitfest.
Cool. I would expect there to be interest, given all the other JSON
support that has been added thus far.
I noticed that, with the PoC patch, "json" is the only format that must be
quoted. Without quotes, I see a syntax error. I'm assuming there's a
conflict with another json-related rule somewhere in gram.y, but I haven't
tracked down exactly which one is causing it.
1. Is supporting JSON array format sufficient, or does it need to support
some other options? How flexible does the support scheme need to be?
I don't presently have a strong opinion on this one. My instinct would be
start with something simple, though. I don't think we offer any special
options for log_destination...
2. This only supports COPY TO and we would undoubtedly want to support COPY
FROM for JSON as well, but is that required from the start?
I would vote for including COPY FROM support from the start.
! if (!cstate->opts.json_mode)
I think it's unfortunate that this further complicates the branching in
CopyOneRowTo(), but after some quick glances at the code, I'm not sure it's
worth refactoring a bunch of stuff to make this nicer.
--
Nathan Bossart
Amazon Web Services: https://aws.amazon.com
I'm really glad to see this taken up as a possible new feature and will
definitely use it if it gets released. I'm impressed with how clean,
understandable, and approachable the postgres codebase is in general and
how easy it is to read and understand this patch.
I reviewed the patch (though I didn't build and test the code) and have a
concern with adding the '[' at the beginning and ']' at the end of the json
output. Those are already added by `json_agg` (
https://www.postgresql.org/docs/current/functions-aggregate.html) as you
can see in my initial email. Adding them in the COPY TO may be redundant
(e.g., [[{"key":"value"...}....]]).
I think COPY TO makes good sense to support, though COPY FROM maybe not so
much as JSON isn't necessarily flat and rectangular like CSV.
For my use-case, I'm emitting JSON files to Apache NiFi for processing, and
NiFi has superior handling of JSON (via JOLT parsers) versus CSV where
parsing is generally done with regex. I want to be able to emit JSON using
a postgres function and thus COPY TO.
Definitely +1 for COPY TO.
I don't think COPY FROM will work out well unless the JSON is required to
be flat and rectangular. I would vote -1 to leave it out due to the
necessary restrictions making it not generally useful.
Hope it helps,
Davin
On Fri, Dec 1, 2023 at 6:10 PM Nathan Bossart <nathandbossart@gmail.com>
wrote:
Show quoted text
On Fri, Dec 01, 2023 at 02:28:55PM -0500, Joe Conway wrote:
I did a quick PoC patch (attached) -- if there interest and no hard
objections I would like to get it up to speed for the January commitfest.Cool. I would expect there to be interest, given all the other JSON
support that has been added thus far.I noticed that, with the PoC patch, "json" is the only format that must be
quoted. Without quotes, I see a syntax error. I'm assuming there's a
conflict with another json-related rule somewhere in gram.y, but I haven't
tracked down exactly which one is causing it.1. Is supporting JSON array format sufficient, or does it need to support
some other options? How flexible does the support scheme need to be?I don't presently have a strong opinion on this one. My instinct would be
start with something simple, though. I don't think we offer any special
options for log_destination...2. This only supports COPY TO and we would undoubtedly want to support
COPY
FROM for JSON as well, but is that required from the start?
I would vote for including COPY FROM support from the start.
! if (!cstate->opts.json_mode)
I think it's unfortunate that this further complicates the branching in
CopyOneRowTo(), but after some quick glances at the code, I'm not sure it's
worth refactoring a bunch of stuff to make this nicer.--
Nathan Bossart
Amazon Web Services: https://aws.amazon.com
On 12/1/23 22:00, Davin Shearer wrote:
I'm really glad to see this taken up as a possible new feature and will
definitely use it if it gets released. I'm impressed with how clean,
understandable, and approachable the postgres codebase is in general and
how easy it is to read and understand this patch.I reviewed the patch (though I didn't build and test the code) and have
a concern with adding the '[' at the beginning and ']' at the end of the
json output. Those are already added by `json_agg`
(https://www.postgresql.org/docs/current/functions-aggregate.html
<https://www.postgresql.org/docs/current/functions-aggregate.html>) as
you can see in my initial email. Adding them in the COPY TO may be
redundant (e.g., [[{"key":"value"...}....]]).
With this patch in place you don't use json_agg() at all. See the
example output (this is real output with the patch applied):
(oops -- I meant to send this with the same email as the patch)
8<-------------------------------------------------
create table foo(id int8, f1 text, f2 timestamptz);
insert into foo
select g.i,
'line: ' || g.i::text,
clock_timestamp()
from generate_series(1,4) as g(i);
copy foo to stdout (format 'json');
[
{"id":1,"f1":"line: 1","f2":"2023-12-01T12:58:16.776863-05:00"}
,{"id":2,"f1":"line: 2","f2":"2023-12-01T12:58:16.777084-05:00"}
,{"id":3,"f1":"line: 3","f2":"2023-12-01T12:58:16.777096-05:00"}
,{"id":4,"f1":"line: 4","f2":"2023-12-01T12:58:16.777103-05:00"}
]
8<-------------------------------------------------
I think COPY TO makes good sense to support, though COPY FROM maybe not
so much as JSON isn't necessarily flat and rectangular like CSV.
Yeah -- definitely not as straight forward but possibly we just support
the array-of-jsonobj-rows as input as well?
For my use-case, I'm emitting JSON files to Apache NiFi for processing,
and NiFi has superior handling of JSON (via JOLT parsers) versus CSV
where parsing is generally done with regex. I want to be able to emit
JSON using a postgres function and thus COPY TO.Definitely +1 for COPY TO.
I don't think COPY FROM will work out well unless the JSON is required
to be flat and rectangular. I would vote -1 to leave it out due to the
necessary restrictions making it not generally useful.
--
Joe Conway
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com
On 12/1/23 18:09, Nathan Bossart wrote:
On Fri, Dec 01, 2023 at 02:28:55PM -0500, Joe Conway wrote:
I did a quick PoC patch (attached) -- if there interest and no hard
objections I would like to get it up to speed for the January commitfest.Cool. I would expect there to be interest, given all the other JSON
support that has been added thus far.
Thanks for the review
I noticed that, with the PoC patch, "json" is the only format that must be
quoted. Without quotes, I see a syntax error. I'm assuming there's a
conflict with another json-related rule somewhere in gram.y, but I haven't
tracked down exactly which one is causing it.
It seems to be because 'json' is also a type name ($$ =
SystemTypeName("json")).
What do you think about using 'json_array' instead? It is more specific
and accurate, and avoids the need to quote.
test=# copy foo to stdout (format json_array);
[
{"id":1,"f1":"line: 1","f2":"2023-12-01T12:58:16.776863-05:00"}
,{"id":2,"f1":"line: 2","f2":"2023-12-01T12:58:16.777084-05:00"}
,{"id":3,"f1":"line: 3","f2":"2023-12-01T12:58:16.777096-05:00"}
,{"id":4,"f1":"line: 4","f2":"2023-12-01T12:58:16.777103-05:00"}
]
1. Is supporting JSON array format sufficient, or does it need to support
some other options? How flexible does the support scheme need to be?I don't presently have a strong opinion on this one. My instinct would be
start with something simple, though. I don't think we offer any special
options for log_destination...
WFM
2. This only supports COPY TO and we would undoubtedly want to support COPY
FROM for JSON as well, but is that required from the start?I would vote for including COPY FROM support from the start.
Check. My thought is to only accept the same format we emit -- i.e. only
take a json array.
! if (!cstate->opts.json_mode)
I think it's unfortunate that this further complicates the branching in
CopyOneRowTo(), but after some quick glances at the code, I'm not sure it's
worth refactoring a bunch of stuff to make this nicer.
Yeah that was my conclusion.
--
Joe Conway
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com
Joe Conway <mail@joeconway.com> writes:
I noticed that, with the PoC patch, "json" is the only format that must be
quoted. Without quotes, I see a syntax error. I'm assuming there's a
conflict with another json-related rule somewhere in gram.y, but I haven't
tracked down exactly which one is causing it.
While I've not looked too closely, I suspect this might be due to the
FORMAT_LA hack in base_yylex:
/* Replace FORMAT by FORMAT_LA if it's followed by JSON */
switch (next_token)
{
case JSON:
cur_token = FORMAT_LA;
break;
}
So if you are writing a production that might need to match
FORMAT followed by JSON, you need to match FORMAT_LA too.
(I spent a little bit of time last week trying to get rid of
FORMAT_LA, thinking that it didn't look necessary. Did not
succeed yet.)
regards, tom lane