json_to_recordset() and CTE performance

Started by Matt DeLucoover 5 years ago3 messagesgeneral

matt@deluco.net

over 5 years ago

Hello,

I’ve written an sql function that takes a single JSON parameter, in this case an array of objects each with eight properties (example below.) This function uses json_to_recordset() in a CTE to insert three rows on two tables. It takes nearly 7 minutes to insert my dataset of 8935 records (in JSON), using a small python script (the script parses the JSON in < 1s) and PostgreSQL 13 on my Macbook Air.

As an experiment I wrote a second sql function with 8 parameters representing each property of the JSON objects, requiring a discrete function call per record. I wrote a python script to insert all 8935 records (making as many calls to the sql function) which executed in around 2 minutes.

I’m very much a novice at interpreting EXPLAIN (ANALYZE) and hoping someone can help me better optimize my original function. Both the function and results of explain/analyze are provided below. Is it perhaps a limitation of CTEs or json_to_recordset(), and an entirely different approach is necessary (like the second function I wrote with one call per record?) I was really hoping to make this work, I’ve written a small API to my database using sql and plpgsql functions each taking a single JSON parameter, and my web backend acts almost as an http proxy to the database. It’s a different way of doing things as far as webdev goes, but (for me) an interesting experiment. I like PostgreSQL and really want to take advantage of it.

I’ve tried a few “optimizations”. Removing the final SELECT and returning VOID saves around 1.5 min. Removing some extra JOINs saves a little time but nothing substantial (the joins against account.t_account in some places are to check “ownership” of the record by a given client_id.)

Using a subquery seems like it could be much faster than a CTE, but I don’t know how to insert three rows on two tables using a subquery.

Any advice is appreciated, thank you in advance.

Matt

INPUT:
(The CTE illustrates all the properties and types)

[
{
“bank_account_id”: 1324,
“transaction_id”: “abc123”,
“transaction_date”: “2020-10-20”,
…
},
…
]

OUTPUT:
(Not sure what I’ve done to create the nested arrays, but it’s unnecessary..
I can shave off ~1.5min by returning VOID.)

[
[
{
"id": 250185
},
{
"id": 250186
},
...
]
]

FUNCTION:

CREATE OR REPLACE FUNCTION journal.create_with_categories(in_json JSON)
RETURNS JSON
AS $$

WITH data AS (
SELECT
(in_json#>>'{context, client_id}')::BIGINT AS client_id,
nextval(pg_get_serial_sequence('journal', 'id')) AS journal_id,
bank_transaction_type,
x.*
FROM json_to_recordset(in_json->'data')
AS x (
bank_account_id BIGINT,
transaction_id TEXT,
transaction_date DATE,
posted_date DATE,
amount finance.monetary,
description TEXT,
parent_account_id BIGINT,
child_account_id BIGINT
),
LATERAL bank_account.get_transaction_type_by_id(x.bank_account_id, x.amount) AS bank_transaction_type
),
insert_journal_entry AS (
INSERT INTO journal.journal (
client_id, id, bank_account_id,
transaction_id, transaction_date, posted_date,
description
)
SELECT
client_id, journal_id, bank_account_id,
transaction_id, transaction_date, posted_date,
description
FROM data
),
insert_account_entries AS (
INSERT INTO journal.account2journal(
account_id, journal_id, amount, type
)
-- T account
SELECT
t.id,
d.journal_id,
@ d.amount,
CASE WHEN d.bank_transaction_type = 'debit'::transaction_type
THEN 'credit'::transaction_type
ELSE 'debit'::transaction_type
END
FROM data d
LEFT JOIN account.t_account t
ON (t.id = COALESCE(d.child_account_id, d.parent_account_id))
WHERE t.client_id = d.client_id OR t.id IS NULL

UNION ALL

-- bank account
SELECT
t.id, d.journal_id, @ d.amount, d.bank_transaction_type
FROM data d
JOIN bank_account.bank_account b
ON (b.id = d.bank_account_id)
JOIN account.t_account t
ON (t.id = b.t_account_id)
WHERE
t.client_id = d.client_id
)
SELECT json_agg(d) FROM (SELECT d.journal_id AS id FROM data AS d) AS d;

$$ LANGUAGE sql;

EXPLAIN ANALYZE:
(From logs)

Aggregate (cost=24.24..24.25 rows=1 width=32) (actual time=388926.249..388926.371 rows=1 loops=1)
Buffers: shared hit=53877 dirtied=2
CTE data
-> Nested Loop (cost=0.26..4.76 rows=100 width=148) (actual time=183.906..388716.550 rows=8935 loops=1)
Buffers: shared hit=53877 dirtied=2
-> Function Scan on json_to_recordset x (cost=0.01..1.00 rows=100 width=128) (actual time=130.645..142.316 rows=8935 loops=1)
-> Function Scan on get_transaction_type_by_id bank_transaction_type (cost=0.25..0.26 rows=1 width=4) (actual time=0.154..0.156 rows=1 loops=8935)
Buffers: shared hit=18054
CTE insert_journal_entry
-> Insert on journal (cost=0.00..2.00 rows=100 width=96) (actual time=453.563..453.563 rows=0 loops=1)
Buffers: shared hit=79242 dirtied=295
-> CTE Scan on data (cost=0.00..2.00 rows=100 width=96) (actual time=0.006..10.001 rows=8935 loops=1)
CTE insert_account_entries
-> Insert on account2journal (cost=4.86..15.23 rows=2 width=52) (actual time=816.381..816.381 rows=0 loops=1)
Buffers: shared hit=159273 dirtied=335 written=17
-> Result (cost=4.86..15.23 rows=2 width=52) (actual time=0.206..109.222 rows=17870 loops=1)
Buffers: shared hit=5
-> Append (cost=4.86..15.20 rows=2 width=52) (actual time=0.197..95.060 rows=17870 loops=1)
Buffers: shared hit=5
-> Hash Left Join (cost=4.86..7.14 rows=1 width=52) (actual time=0.195..35.512 rows=8935 loops=1)
Hash Cond: (COALESCE(d_1.child_account_id, d_1.parent_account_id) = t.id)
Filter: ((t.client_id = d_1.client_id) OR (t.id IS NULL))
Buffers: shared hit=2
-> CTE Scan on data d_1 (cost=0.00..2.00 rows=100 width=68) (actual time=0.004..6.544 rows=8935 loops=1)
-> Hash (cost=3.27..3.27 rows=127 width=16) (actual time=0.137..0.137 rows=127 loops=1)
Buffers: shared hit=2
-> Seq Scan on t_account t (cost=0.00..3.27 rows=127 width=16) (actual time=0.026..0.073 rows=127 loops=1)
Buffers: shared hit=2
-> Hash Join (cost=3.80..8.03 rows=1 width=52) (actual time=40.182..53.796 rows=8935 loops=1)
Hash Cond: ((t_1.id = b.t_account_id) AND (t_1.client_id = d_2.client_id))
Buffers: shared hit=3
-> Seq Scan on t_account t_1 (cost=0.00..3.27 rows=127 width=16) (actual time=0.022..0.079 rows=127 loops=1)
Buffers: shared hit=2
-> Hash (cost=3.59..3.59 rows=14 width=60) (actual time=40.118..40.118 rows=8935 loops=1)
Buffers: shared hit=1
-> Hash Join (cost=1.32..3.59 rows=14 width=60) (actual time=0.071..17.863 rows=8935 loops=1)
Hash Cond: (d_2.bank_account_id = b.id)
Buffers: shared hit=1
-> CTE Scan on data d_2 (cost=0.00..2.00 rows=100 width=60) (actual time=0.005..3.740 rows=8935 loops=1)
-> Hash (cost=1.14..1.14 rows=14 width=16) (actual time=0.030..0.030 rows=14 loops=1)
Buffers: shared hit=1
-> Seq Scan on bank_account b (cost=0.00..1.14 rows=14 width=16) (actual time=0.012..0.016 rows=14 loops=1)
Buffers: shared hit=1
-> CTE Scan on data d (cost=0.00..2.00 rows=100 width=8) (actual time=183.918..388812.950 rows=8935 loops=1)
Buffers: shared hit=53877 dirtied=2
Trigger for constraint journal_client_id_fkey on journal: time=194.194 calls=8935
Trigger for constraint journal_bank_account_id_fkey on journal: time=204.014 calls=8935
Trigger trigger_journal_import_sequence on journal: time=373.344 calls=1
Trigger for constraint account2journal_account_id_fkey on account2journal: time=580.482 calls=17870
Trigger trigger_debits_equal_credits_on_insert on account2journal: time=116.653 calls=1

Michael Lewis

mlewis@entrata.com

over 5 years ago

In reply to: Matt DeLuco (#1)

Re: json_to_recordset() and CTE performance

Version? What is the value for work_mem and other configs that are
non-default? I see some estimates that are rather off like -

-> Nested Loop (cost=0.26..4.76 rows=100 width=148) (actual
time=183.906..388716.550 rows=8935 loops=1)
Buffers: shared hit=53877 dirtied=2
-> Function Scan on json_to_recordset x
(cost=0.01..1.00 *rows=100* width=128) (actual time=130.645..142.316
*rows=8935* loops=1)
-> Function Scan on get_transaction_type_by_id
bank_transaction_type (cost=0.25..0.26 rows=1 width=4) (actual
time=0.154..0.156 rows=1 loops=8935)
Buffers: shared hit=18054

Sometimes putting data into a temp table and analyzing it can be rather
helpful to ensure the planner has statistics on the number of records,
ndistinct, most common values, etc. I would try doing that with the result
of json_to_recordset and skipping the function call
to get_transaction_type_by_id until later, just to see how it performs.

That said, it seems like a hardcoded estimate from json_to_recordset is 100
perhaps. I haven't checked source code, but I know when defining a set
returning function, there is a ROWS option which provides the planner a
static value to assume will come out of that function so it would make
sense perhaps.

Matt DeLuco

matt@deluco.net

over 5 years ago

In reply to: Michael Lewis (#2)

Re: json_to_recordset() and CTE performance

PostgreSQL 13.0.

You’d have to be specific about the configs you’re looking for, I’m using Postgres.app (postgresapp.com <http://postgresapp.com/>) and am uncertain if it’s distributed with non-default configs.

But, a quick grep shows these items that are configured:
max_wal_size = 1GB
min_wal_size = 80MB
shared_buffers = 128MB

work_mem is not configured so presumably it’s the default of 4MB.

I’ll try the temp tables. That seems familiar to what I found searching online - are you suggesting that as a permanent solution, or just as a means to better analyze performance?

Thanks,

Matt

Show quoted text

On Oct 21, 2020, at 1:25 PM, Michael Lewis <mlewis@entrata.com> wrote:

Version? What is the value for work_mem and other configs that are non-default? I see some estimates that are rather off like -

-> Nested Loop (cost=0.26..4.76 rows=100 width=148) (actual time=183.906..388716.550 rows=8935 loops=1)
Buffers: shared hit=53877 dirtied=2
-> Function Scan on json_to_recordset x (cost=0.01..1.00 rows=100 width=128) (actual time=130.645..142.316 rows=8935 loops=1)
-> Function Scan on get_transaction_type_by_id bank_transaction_type (cost=0.25..0.26 rows=1 width=4) (actual time=0.154..0.156 rows=1 loops=8935)
Buffers: shared hit=18054

Sometimes putting data into a temp table and analyzing it can be rather helpful to ensure the planner has statistics on the number of records, ndistinct, most common values, etc. I would try doing that with the result of json_to_recordset and skipping the function call to get_transaction_type_by_id until later, just to see how it performs.

That said, it seems like a hardcoded estimate from json_to_recordset is 100 perhaps. I haven't checked source code, but I know when defining a set returning function, there is a ROWS option which provides the planner a static value to assume will come out of that function so it would make sense perhaps.