substring on bit(n) and bytea types is slow

Started by Evgeny Morozovabout 10 years ago4 messagesgeneral
Jump to latest
#1Evgeny Morozov
evgeny.morozov+list+pgsql@shift-technology.com

Hi,

Queries like this:

SELECT substring(bitarray from (32 * (n - 1) + 1) for 32) -- bitarray is a
column of type bit(64000000)
FROM array_test_bit
JOIN generate_series(1, 10000) n ON true;

SELECT substring(bytearr from (8 * (n - 1) + 1) for 8) -- bytearr is a
column of type bytea
FROM array_test_bytea
JOIN generate_series(1, 10000) n ON true;

...are really slow. These take over a minute each and a postgres backend
process uses 100% of a CPU while the query runs. The same thing in SQL
Server 2014 (using varbinary(max) columns) runs fast - about 20 seconds for
4 million rows. Are byte/bit arrays just inherently slow in Postgres? Or is
substring the wrong function to use for them?

The context is that I want to efficiently store many integers. The obvious
answer is integer[], but most of my integers can fit into less than 32
bits, so I'd like to see if I can pack them more efficiently.

Regards,

Evgeny Morozov

#2Arjen Nienhuis
a.g.nienhuis@gmail.com
In reply to: Evgeny Morozov (#1)
Re: substring on bit(n) and bytea types is slow

On Feb 29, 2016 22:26, "Evgeny Morozov" <
evgeny.morozov+list+pgsql@shift-technology.com> wrote

SELECT substring(bitarray from (32 * (n - 1) + 1) for 32) -- bitarray is

a column of type bit(64000000)

FROM array_test_bit
JOIN generate_series(1, 10000) n ON true;

Substring on a bit string is not optimized for long TOASTed values.
Substring on text is optimized for that. The current code fetches the whole
8MB from the table every time.

#3Evgeny Morozov
evgeny.morozov@shift-technology.com
In reply to: Arjen Nienhuis (#2)
Re: substring on bit(n) and bytea types is slow

On 2 March 2016 at 00:33, Arjen Nienhuis <a.g.nienhuis@gmail.com> wrote:

On Feb 29, 2016 22:26, "Evgeny Morozov" <
evgeny.morozov+list+pgsql@shift-technology.com> wrote

SELECT substring(bitarray from (32 * (n - 1) + 1) for 32) -- bitarray is

a column of type bit(64000000)

FROM array_test_bit
JOIN generate_series(1, 10000) n ON true;

Substring on a bit string is not optimized for long TOASTed values.
Substring on text is optimized for that. The current code fetches the whole
8MB from the table every time.

I see, thanks. Is there a better way to pack a large number of integers
efficiently with reasonable read/write performance?

I tried arrays bit varying, which seemed perfect, but in practice when I
stored 4M integers in it, each one taking as few bits as possible, the
table takes 13MB - same as if I just store all of them as bit(24). In fact,
an array of 4M bit(10) integers also takes 13MB. bit(8) takes only 0.7 MB.
bit(9) is where things get weird: for integer 1 to 4M it takes 13MB, but if
I multiple them by 2 (i.e. store 4M even integers) it takes 0.7MB! So there
must be some kind of compression going on there, but I don't understand how
it works.

#4Evgeny Morozov
evgeny.morozov+list+pgsql@shift-technology.com
In reply to: Evgeny Morozov (#3)
Re: substring on bit(n) and bytea types is slow

On 2 March 2016 at 00:33, Arjen Nienhuis <a.g.nienhuis@gmail.com> wrote:

On Feb 29, 2016 22:26, "Evgeny Morozov" <
evgeny.morozov+list+pgsql@shift-technology.com> wrote

SELECT substring(bitarray from (32 * (n - 1) + 1) for 32) -- bitarray is

a column of type bit(64000000)

FROM array_test_bit
JOIN generate_series(1, 10000) n ON true;

Substring on a bit string is not optimized for long TOASTed values.
Substring on text is optimized for that. The current code fetches the whole
8MB from the table every time.

I see, thanks. Is there a better way to pack a large number of integers
efficiently with reasonable read/write performance?

I tried arrays bit varying, which seemed perfect, but in practice when I
stored 4M integers in it, each one taking as few bits as possible, the
table takes 13MB - same as if I just store all of them as bit(24). In fact,
an array of 4M bit(10) integers also takes 13MB. bit(8) takes only 0.7 MB.
bit(9) is where things get weird: for integer 1 to 4M it takes 13MB, but if
I multiple them by 2 (i.e. store 4M even integers) it takes 0.7MB! So there
must be some kind of compression going on there, but I don't understand how
it works.