substring on bit(n) and bytea types is slow

Started by Evgeny Morozovabout 10 years ago4 messagesgeneral

evgeny.morozov+list+pgsql@shift-technology.com

about 10 years ago

Hi,

Queries like this:

SELECT substring(bitarray from (32 * (n - 1) + 1) for 32) -- bitarray is a
column of type bit(64000000)
FROM array_test_bit
JOIN generate_series(1, 10000) n ON true;

SELECT substring(bytearr from (8 * (n - 1) + 1) for 8) -- bytearr is a
column of type bytea
FROM array_test_bytea
JOIN generate_series(1, 10000) n ON true;

...are really slow. These take over a minute each and a postgres backend
process uses 100% of a CPU while the query runs. The same thing in SQL
Server 2014 (using varbinary(max) columns) runs fast - about 20 seconds for
4 million rows. Are byte/bit arrays just inherently slow in Postgres? Or is
substring the wrong function to use for them?

The context is that I want to efficiently store many integers. The obvious
answer is integer[], but most of my integers can fit into less than 32
bits, so I'd like to see if I can pack them more efficiently.

Regards,

Evgeny Morozov

Arjen Nienhuis

a.g.nienhuis@gmail.com

about 10 years ago

In reply to: Evgeny Morozov (#1)

Re: substring on bit(n) and bytea types is slow

On Feb 29, 2016 22:26, "Evgeny Morozov" <
evgeny.morozov+list+pgsql@shift-technology.com> wrote

SELECT substring(bitarray from (32 * (n - 1) + 1) for 32) -- bitarray is

a column of type bit(64000000)

FROM array_test_bit
JOIN generate_series(1, 10000) n ON true;

Substring on a bit string is not optimized for long TOASTed values.
Substring on text is optimized for that. The current code fetches the whole
8MB from the table every time.

Evgeny Morozov

evgeny.morozov@shift-technology.com

about 10 years ago

In reply to: Arjen Nienhuis (#2)

Re: substring on bit(n) and bytea types is slow

On 2 March 2016 at 00:33, Arjen Nienhuis <a.g.nienhuis@gmail.com> wrote:

On Feb 29, 2016 22:26, "Evgeny Morozov" <
evgeny.morozov+list+pgsql@shift-technology.com> wrote

SELECT substring(bitarray from (32 * (n - 1) + 1) for 32) -- bitarray is

a column of type bit(64000000)

FROM array_test_bit
JOIN generate_series(1, 10000) n ON true;

Substring on a bit string is not optimized for long TOASTed values.
Substring on text is optimized for that. The current code fetches the whole
8MB from the table every time.

I see, thanks. Is there a better way to pack a large number of integers
efficiently with reasonable read/write performance?

I tried arrays bit varying, which seemed perfect, but in practice when I
stored 4M integers in it, each one taking as few bits as possible, the
table takes 13MB - same as if I just store all of them as bit(24). In fact,
an array of 4M bit(10) integers also takes 13MB. bit(8) takes only 0.7 MB.
bit(9) is where things get weird: for integer 1 to 4M it takes 13MB, but if
I multiple them by 2 (i.e. store 4M even integers) it takes 0.7MB! So there
must be some kind of compression going on there, but I don't understand how
it works.

Evgeny Morozov

evgeny.morozov+list+pgsql@shift-technology.com

about 10 years ago

In reply to: Evgeny Morozov (#3)

Re: substring on bit(n) and bytea types is slow

On 2 March 2016 at 00:33, Arjen Nienhuis <a.g.nienhuis@gmail.com> wrote:

On Feb 29, 2016 22:26, "Evgeny Morozov" <
evgeny.morozov+list+pgsql@shift-technology.com> wrote

SELECT substring(bitarray from (32 * (n - 1) + 1) for 32) -- bitarray is

a column of type bit(64000000)

FROM array_test_bit
JOIN generate_series(1, 10000) n ON true;

Substring on a bit string is not optimized for long TOASTed values.
Substring on text is optimized for that. The current code fetches the whole
8MB from the table every time.

I see, thanks. Is there a better way to pack a large number of integers
efficiently with reasonable read/write performance?

Import Notes

Resolved by subject fallback