Way to check whether a particular block is on the shared_buffer?

Started by Kouhei Kaigaialmost 10 years ago24 messages

kaigai@ak.jp.nec.com

almost 10 years ago

Hello,

Do we have a reliable way to check whether a particular heap block
is already on the shared buffer, but not modify?

Right now, ReadBuffer and ReadBufferExtended are entrypoint of the
buffer manager for extensions. However, it tries to acquire an
available buffer pool instead of the victim buffer, regardless of
the ReadBufferMode.

It is different from what I want to do:
1. Check whether the supplied BlockNum is already loaded on the
shared buffer.
2. If yes, the caller gets buffer descriptor as usual ReadBuffer.
3. If not, the caller gets InvalidBuffer without modification of
the shared buffer, also no victim buffer pool.

It allows extensions (likely a custom scan provider) to take
different strategies for large table's scan, according to the
latest status of individual blocks.
If we don't have these interface, it seems to me an enhancement
of the ReadBuffer_common and (Local)BufferAlloc is the only way
to implement the feature.

Of course, we need careful investigation definition of the 'valid'
buffer pool. How about a buffer pool with BM_IO_IN_PROGRESS?
How about a buffer pool that needs storage extend (thus, no relevant
physical storage does not exists yet)? ... and so on.

As an aside, background of my motivation is the slide below:
http://www.slideshare.net/kaigai/sqlgpussd-english
(LT slides in JPUG conference last Dec)

I'm under investigation of SSD-to-GPU direct feature on top of
the custom-scan interface. It intends to load a bunch of data
blocks on NVMe-SSD to GPU RAM using P2P DMA, prior to the data
loading onto CPU/RAM, to preprocess the data to be filtered out.
It only makes sense if the target blocks are not loaded to the
CPU/RAM yet, because SSD device is essentially slower than RAM.
So, I like to have a reliable way to check the latest status of
the shared buffer, to kwon whether a particular block is already
loaded or not.

Thanks,
--
NEC Business Creation Division / PG-Strom Project
KaiGai Kohei <kaigai@ak.jp.nec.com>

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Jim.Nasby@BlueTreble.com

almost 10 years ago

In reply to: Kouhei Kaigai (#1)

Re: Way to check whether a particular block is on the shared_buffer?

On 1/31/16 7:38 PM, Kouhei Kaigai wrote:

I'm under investigation of SSD-to-GPU direct feature on top of
the custom-scan interface. It intends to load a bunch of data
blocks on NVMe-SSD to GPU RAM using P2P DMA, prior to the data
loading onto CPU/RAM, to preprocess the data to be filtered out.
It only makes sense if the target blocks are not loaded to the
CPU/RAM yet, because SSD device is essentially slower than RAM.
So, I like to have a reliable way to check the latest status of
the shared buffer, to kwon whether a particular block is already
loaded or not.

That completely ignores the OS cache though... wouldn't that be a major
issue?

To answer your direct question, I'm no expert, but I haven't seen any
functions that do exactly what you want. You'd have to pull relevant
bits from ReadBuffer_*. Or maybe a better method would just be to call
BufTableLookup() without any locks and if you get a result > -1 just
call the relevant ReadBuffer function. Sometimes you'll end up calling
ReadBuffer even though the buffer isn't in shared buffers, but I would
think that would be a rare occurrence.
--
Jim Nasby, Data Architect, Blue Treble Consulting, Austin TX
Experts in Analytics, Data Architecture and PostgreSQL
Data in Trouble? Get it in Treble! http://BlueTreble.com

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

kaigai@ak.jp.nec.com

almost 10 years ago

In reply to: Jim Nasby (#2)

Re: Way to check whether a particular block is on the shared_buffer?

On 1/31/16 7:38 PM, Kouhei Kaigai wrote:

I'm under investigation of SSD-to-GPU direct feature on top of
the custom-scan interface. It intends to load a bunch of data
blocks on NVMe-SSD to GPU RAM using P2P DMA, prior to the data
loading onto CPU/RAM, to preprocess the data to be filtered out.
It only makes sense if the target blocks are not loaded to the
CPU/RAM yet, because SSD device is essentially slower than RAM.
So, I like to have a reliable way to check the latest status of
the shared buffer, to kwon whether a particular block is already
loaded or not.

That completely ignores the OS cache though... wouldn't that be a major
issue?

Once we can ensure the target block is not cached in the shared buffer,
it is a job of the driver that support P2P DMA to handle OS page cache.
Once driver get a P2P DMA request from PostgreSQL, it checks OS page
cache status and determine the DMA source; whether OS buffer or SSD block.

To answer your direct question, I'm no expert, but I haven't seen any
functions that do exactly what you want. You'd have to pull relevant
bits from ReadBuffer_*. Or maybe a better method would just be to call
BufTableLookup() without any locks and if you get a result > -1 just
call the relevant ReadBuffer function. Sometimes you'll end up calling
ReadBuffer even though the buffer isn't in shared buffers, but I would
think that would be a rare occurrence.

Thanks, indeed, extension can call BufTableLookup(). PrefetchBuffer()
has a good example for this.

If it returned a valid buf_id, we have nothing difficult; just call
ReadBuffer() to pin the buffer.

Elsewhere, when BufTableLookup() returned negative, it means a pair of
(relation, forknum, blocknum) does not exist on the shared buffer.
So, extension enqueues P2P DMA request for asynchronous translation,
then driver processes the P2P DMA soon but later.
Concurrent access may always happen. PostgreSQL uses MVCC, so the backend
which issued P2P DMA does not need to pay attention for new tuples that
didn't exist on executor start time, even if other backend loads and
updates the same buffer just after the above BufTableLookup().

On the other hands, we have to pay attention whether a fraction of
the buffer page is partially written to OS buffer or storage. It is
in the scope of operating system, so it is not controllable from us.

One idea I can find out is, temporary suspension of FlushBuffer() for
a particular pairs of (relation, forknum, blocknum) until P2P DMA gets
completed. Even if concurrent backend updates the buffer page after the
BufTableLookup(), it allows to prevent OS caches and storages getting
dirty during the P2P DMA.

How about people's thought?
--
NEC Business Creation Division / PG-Strom Project
KaiGai Kohei <kaigai@ak.jp.nec.com>

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Langote_Amit_f8@lab.ntt.co.jp

almost 10 years ago

In reply to: Kouhei Kaigai (#1)

Re: Way to check whether a particular block is on the shared_buffer?

KaiGai-san,

On 2016/02/01 10:38, Kouhei Kaigai wrote:

As an aside, background of my motivation is the slide below:
http://www.slideshare.net/kaigai/sqlgpussd-english
(LT slides in JPUG conference last Dec)

I'm under investigation of SSD-to-GPU direct feature on top of
the custom-scan interface. It intends to load a bunch of data
blocks on NVMe-SSD to GPU RAM using P2P DMA, prior to the data
loading onto CPU/RAM, to preprocess the data to be filtered out.
It only makes sense if the target blocks are not loaded to the
CPU/RAM yet, because SSD device is essentially slower than RAM.
So, I like to have a reliable way to check the latest status of
the shared buffer, to kwon whether a particular block is already
loaded or not.

Quite interesting stuff, thanks for sharing!

I'm in no way expert on this but could this generally be attacked from the
smgr API perspective? Currently, we have only one implementation - md.c
(the hard-coded RelationData.smgr_which = 0). If we extended that and
provided end-to-end support so that there would be md.c alternatives to
storage operations, I guess that would open up opportunities for
extensions to specify smgr_which as an argument to ReadBufferExtended(),
provided there is already support in place to install md.c alternatives
(perhaps in .so). Of course, these are just musings and, perhaps does not
really concern the requirements of custom scan methods you have been
developing.

Thanks,
Amit

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

kaigai@ak.jp.nec.com

almost 10 years ago

In reply to: Amit Langote (#4)

Re: Way to check whether a particular block is on the shared_buffer?

KaiGai-san,

On 2016/02/01 10:38, Kouhei Kaigai wrote:

As an aside, background of my motivation is the slide below:
http://www.slideshare.net/kaigai/sqlgpussd-english
(LT slides in JPUG conference last Dec)

I'm under investigation of SSD-to-GPU direct feature on top of
the custom-scan interface. It intends to load a bunch of data
blocks on NVMe-SSD to GPU RAM using P2P DMA, prior to the data
loading onto CPU/RAM, to preprocess the data to be filtered out.
It only makes sense if the target blocks are not loaded to the
CPU/RAM yet, because SSD device is essentially slower than RAM.
So, I like to have a reliable way to check the latest status of
the shared buffer, to kwon whether a particular block is already
loaded or not.

Quite interesting stuff, thanks for sharing!

I'm in no way expert on this but could this generally be attacked from the
smgr API perspective? Currently, we have only one implementation - md.c
(the hard-coded RelationData.smgr_which = 0). If we extended that and
provided end-to-end support so that there would be md.c alternatives to
storage operations, I guess that would open up opportunities for
extensions to specify smgr_which as an argument to ReadBufferExtended(),
provided there is already support in place to install md.c alternatives
(perhaps in .so). Of course, these are just musings and, perhaps does not
really concern the requirements of custom scan methods you have been
developing.

Thanks for your idea. Indeed, smgr hooks are good candidate to implement
the feature, however, what I need is a thin intermediation layer rather
than alternative storage engine.

It becomes clear we need two features here.
1. A feature to check whether a particular block is already on the shared
buffer pool.
It is available. BufTableLookup() under the BufMappingPartitionLock
gives us the information we want.

2. A feature to suspend i/o write-out towards a particular blocks
that are registered by other concurrent backend, unless it is not
unregistered (usually, at the end of P2P DMA).
==> to be discussed.

When we call smgrwrite(), like FlushBuffer(), it fetches function pointer
from the 'smgrsw' array, then calls smgr_write.

void
smgrwrite(SMgrRelation reln, ForkNumber forknum, BlockNumber blocknum,
char *buffer, bool skipFsync)
{
(*(smgrsw[reln->smgr_which].smgr_write)) (reln, forknum, blocknum,
buffer, skipFsync);
}

If extension would overwrite smgrsw[] array, then call the original
function under the control by extension, it allows to suspend the call
of the original smgr_write until completion of P2P DMA.

It may be a minimum invasive way to implement, and portable to any
further storage layers.

How about your thought? Even though it is a bit different from your
original proposition.
--
NEC Business Creation Division / PG-Strom Project
KaiGai Kohei <kaigai@ak.jp.nec.com>

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

alvherre@2ndquadrant.com

almost 10 years ago

In reply to: Kouhei Kaigai (#3)

Re: Way to check whether a particular block is on the shared_buffer?

Kouhei Kaigai wrote:

On 1/31/16 7:38 PM, Kouhei Kaigai wrote:

To answer your direct question, I'm no expert, but I haven't seen any
functions that do exactly what you want. You'd have to pull relevant
bits from ReadBuffer_*. Or maybe a better method would just be to call
BufTableLookup() without any locks and if you get a result > -1 just
call the relevant ReadBuffer function. Sometimes you'll end up calling
ReadBuffer even though the buffer isn't in shared buffers, but I would
think that would be a rare occurrence.

Thanks, indeed, extension can call BufTableLookup(). PrefetchBuffer()
has a good example for this.

If it returned a valid buf_id, we have nothing difficult; just call
ReadBuffer() to pin the buffer.

Isn't this what (or very similar to)
ReadBufferExtended(RBM_ZERO_AND_LOCK) is already doing?

--
ï¿½lvaro Herrera http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

kaigai@ak.jp.nec.com

almost 10 years ago

In reply to: Alvaro Herrera (#6)

Re: Way to check whether a particular block is on the shared_buffer?

On 1/31/16 7:38 PM, Kouhei Kaigai wrote:

To answer your direct question, I'm no expert, but I haven't seen any
functions that do exactly what you want. You'd have to pull relevant
bits from ReadBuffer_*. Or maybe a better method would just be to call
BufTableLookup() without any locks and if you get a result > -1 just
call the relevant ReadBuffer function. Sometimes you'll end up calling
ReadBuffer even though the buffer isn't in shared buffers, but I would
think that would be a rare occurrence.

Thanks, indeed, extension can call BufTableLookup(). PrefetchBuffer()
has a good example for this.

If it returned a valid buf_id, we have nothing difficult; just call
ReadBuffer() to pin the buffer.

Isn't this what (or very similar to)
ReadBufferExtended(RBM_ZERO_AND_LOCK) is already doing?

This operation actually acquires a buffer page, fills up with zero
and a valid buffer page is wiped out if no free buffer page.
I want to keep the contents of the shared buffer already loaded on
the main memory. P2P DMA and GPU preprocessing intends to minimize
main memory consumption by rows to be filtered by scan qualifiers.

Thanks,
--
NEC Business Creation Division / PG-Strom Project
KaiGai Kohei <kaigai@ak.jp.nec.com>

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

kaigai@ak.jp.nec.com

almost 10 years ago

In reply to: Kouhei Kaigai (#5)

Re: Way to check whether a particular block is on the shared_buffer?

KaiGai-san,

On 2016/02/01 10:38, Kouhei Kaigai wrote:

As an aside, background of my motivation is the slide below:
http://www.slideshare.net/kaigai/sqlgpussd-english
(LT slides in JPUG conference last Dec)

I'm under investigation of SSD-to-GPU direct feature on top of
the custom-scan interface. It intends to load a bunch of data
blocks on NVMe-SSD to GPU RAM using P2P DMA, prior to the data
loading onto CPU/RAM, to preprocess the data to be filtered out.
It only makes sense if the target blocks are not loaded to the
CPU/RAM yet, because SSD device is essentially slower than RAM.
So, I like to have a reliable way to check the latest status of
the shared buffer, to kwon whether a particular block is already
loaded or not.

Quite interesting stuff, thanks for sharing!

I'm in no way expert on this but could this generally be attacked from the
smgr API perspective? Currently, we have only one implementation - md.c
(the hard-coded RelationData.smgr_which = 0). If we extended that and
provided end-to-end support so that there would be md.c alternatives to
storage operations, I guess that would open up opportunities for
extensions to specify smgr_which as an argument to ReadBufferExtended(),
provided there is already support in place to install md.c alternatives
(perhaps in .so). Of course, these are just musings and, perhaps does not
really concern the requirements of custom scan methods you have been
developing.

Thanks for your idea. Indeed, smgr hooks are good candidate to implement
the feature, however, what I need is a thin intermediation layer rather
than alternative storage engine.

It becomes clear we need two features here.
1. A feature to check whether a particular block is already on the shared
buffer pool.
It is available. BufTableLookup() under the BufMappingPartitionLock
gives us the information we want.

2. A feature to suspend i/o write-out towards a particular blocks
that are registered by other concurrent backend, unless it is not
unregistered (usually, at the end of P2P DMA).
==> to be discussed.

When we call smgrwrite(), like FlushBuffer(), it fetches function pointer
from the 'smgrsw' array, then calls smgr_write.

void
smgrwrite(SMgrRelation reln, ForkNumber forknum, BlockNumber blocknum,
char *buffer, bool skipFsync)
{
(*(smgrsw[reln->smgr_which].smgr_write)) (reln, forknum, blocknum,
buffer, skipFsync);
}

If extension would overwrite smgrsw[] array, then call the original
function under the control by extension, it allows to suspend the call
of the original smgr_write until completion of P2P DMA.

It may be a minimum invasive way to implement, and portable to any
further storage layers.

How about your thought? Even though it is a bit different from your
original proposition.

I tried to design a draft of enhancement to realize the above i/o write-out
suspend/resume, with less invasive way as possible as we can.

ASSUMPTION: I intend to implement this feature as a part of extension,
because this i/o suspend/resume checks are pure overhead increment
for the core features, unless extension which utilizes it.

Three functions shall be added:

extern int GetStorageMgrNumbers(void);
extern f_smgr GetStorageMgrHandlers(int smgr_which);
extern void SetStorageMgrHandlers(int smgr_which, f_smgr smgr_handlers);

As literal, GetStorageMgrNumbers() returns the number of storage manager
currently installed. It always return 1 right now.
GetStorageMgrHandlers() returns the currently configured f_smgr table to
the supplied smgr_which. It allows extensions to know current configuration
of the storage manager, even if other extension already modified it.
SetStorageMgrHandlers() assigns the supplied 'smgr_handlers', instead of
the current one.
If extension wants to intermediate 'smgr_write', extension will replace
the 'smgr_write' by own function, then call the original function, likely
mdwrite, from the alternative function.

In this case, call chain shall be:

  FlushBuffer, and others...
   +-- smgrwrite(...)
        +-- (extension's own function)
             +-- mdwrite

Once extension's own function blocks write i/o until P2P DMA completed by
concurrent process, we don't need to care about partial update of OS cache
or storage device.
It is not difficult for extensions to implement a feature to track/untrack
a pair of (relFileNode, forkNum, blockNum), automatic untracking according
to the resource-owner, and a mechanism to block the caller by P2P DMA
completion.

On the other hands, its flexibility seems to me a bit larger than necessity
(what I want to implement is just a blocker of buffer write i/o). And, it
may give people wrong impression for the feature of pluggable storage.

How about folk's thought?

Thanks,
--
NEC Business Creation Division / PG-Strom Project
KaiGai Kohei <kaigai@ak.jp.nec.com>

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Jim.Nasby@BlueTreble.com

almost 10 years ago

In reply to: Kouhei Kaigai (#8)

Re: Way to check whether a particular block is on the shared_buffer?

On 2/4/16 12:30 AM, Kouhei Kaigai wrote:

2. A feature to suspend i/o write-out towards a particular blocks

that are registered by other concurrent backend, unless it is not
unregistered (usually, at the end of P2P DMA).
==> to be discussed.

I think there's still a race condition here though...

A
finds buffer not in shared buffers

B
reads buffer in
modifies buffer
starts writing buffer to OS

A
Makes call to block write, but write is already in process; thinks
writes are now blocked
Reads corrupted block
Much hilarity ensues

Or maybe you were just glossing over that part for brevity.

...

I tried to design a draft of enhancement to realize the above i/o write-out
suspend/resume, with less invasive way as possible as we can.

ASSUMPTION: I intend to implement this feature as a part of extension,
because this i/o suspend/resume checks are pure overhead increment
for the core features, unless extension which utilizes it.

Three functions shall be added:

extern int GetStorageMgrNumbers(void);
extern f_smgr GetStorageMgrHandlers(int smgr_which);
extern void SetStorageMgrHandlers(int smgr_which, f_smgr smgr_handlers);

As literal, GetStorageMgrNumbers() returns the number of storage manager
currently installed. It always return 1 right now.
GetStorageMgrHandlers() returns the currently configured f_smgr table to
the supplied smgr_which. It allows extensions to know current configuration
of the storage manager, even if other extension already modified it.
SetStorageMgrHandlers() assigns the supplied 'smgr_handlers', instead of
the current one.
If extension wants to intermediate 'smgr_write', extension will replace
the 'smgr_write' by own function, then call the original function, likely
mdwrite, from the alternative function.

In this case, call chain shall be:
FlushBuffer, and others...
+-- smgrwrite(...)
+-- (extension's own function)
+-- mdwrite

ISTR someone (Robert Haas?) complaining that this method of hooks is
cumbersome to use and can be fragile if multiple hooks are being
installed. So maybe we don't want to extend it's usage...

I'm also not sure whether this is better done with an smgr hook or a
hook into shared buffer handling...
--
Jim Nasby, Data Architect, Blue Treble Consulting, Austin TX
Experts in Analytics, Data Architecture and PostgreSQL
Data in Trouble? Get it in Treble! http://BlueTreble.com

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

kaigai@ak.jp.nec.com

almost 10 years ago

In reply to: Jim Nasby (#9)

Re: Way to check whether a particular block is on the shared_buffer?

On 2/4/16 12:30 AM, Kouhei Kaigai wrote:

2. A feature to suspend i/o write-out towards a particular blocks

that are registered by other concurrent backend, unless it is not
unregistered (usually, at the end of P2P DMA).
==> to be discussed.

I think there's still a race condition here though...

A
finds buffer not in shared buffers

B
reads buffer in
modifies buffer
starts writing buffer to OS

A
Makes call to block write, but write is already in process; thinks
writes are now blocked
Reads corrupted block
Much hilarity ensues

Or maybe you were just glossing over that part for brevity.

Thanks, this part was not clear from my previous description.

At the time when B starts writing buffer to OS, extension will catch
this i/o request using a hook around the smgrwrite, then the mechanism
registers the block to block P2P DMA request during B's write operation.
(Of course, it unregisters the block at end of the smgrwrite)
So, even if A wants to issue P2P DMA concurrently, it cannot register
the block until B's write operation.

In practical, this operation shall be "try lock", because B's write
operation implies existence of the buffer in main memory, so B does
not need to wait A's write operation if B switch DMA source from SSD
to main memory.

Thanks,
--
NEC Business Creation Division / PG-Strom Project
KaiGai Kohei <kaigai@ak.jp.nec.com>

...
I tried to design a draft of enhancement to realize the above i/o write-out
suspend/resume, with less invasive way as possible as we can.

ASSUMPTION: I intend to implement this feature as a part of extension,
because this i/o suspend/resume checks are pure overhead increment
for the core features, unless extension which utilizes it.

Three functions shall be added:

extern int GetStorageMgrNumbers(void);
extern f_smgr GetStorageMgrHandlers(int smgr_which);
extern void SetStorageMgrHandlers(int smgr_which, f_smgr smgr_handlers);

As literal, GetStorageMgrNumbers() returns the number of storage manager
currently installed. It always return 1 right now.
GetStorageMgrHandlers() returns the currently configured f_smgr table to
the supplied smgr_which. It allows extensions to know current configuration
of the storage manager, even if other extension already modified it.
SetStorageMgrHandlers() assigns the supplied 'smgr_handlers', instead of
the current one.
If extension wants to intermediate 'smgr_write', extension will replace
the 'smgr_write' by own function, then call the original function, likely
mdwrite, from the alternative function.

In this case, call chain shall be:
FlushBuffer, and others...
+-- smgrwrite(...)
+-- (extension's own function)
+-- mdwrite
ISTR someone (Robert Haas?) complaining that this method of hooks is
cumbersome to use and can be fragile if multiple hooks are being
installed. So maybe we don't want to extend it's usage...

I'm also not sure whether this is better done with an smgr hook or a
hook into shared buffer handling...
--
Jim Nasby, Data Architect, Blue Treble Consulting, Austin TX
Experts in Analytics, Data Architecture and PostgreSQL
Data in Trouble? Get it in Treble! http://BlueTreble.com

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

kaigai@ak.jp.nec.com

almost 10 years ago

In reply to: Jim Nasby (#9)

Re: Way to check whether a particular block is on the shared_buffer?

-----Original Message-----
From: Jim Nasby [mailto:Jim.Nasby@BlueTreble.com]
Sent: Friday, February 05, 2016 9:17 AM
To: Kaigai Kouhei(海外浩平); pgsql-hackers@postgresql.org; Robert Haas
Cc: Amit Langote
Subject: Re: [HACKERS] Way to check whether a particular block is on the
shared_buffer?

On 2/4/16 12:30 AM, Kouhei Kaigai wrote:

2. A feature to suspend i/o write-out towards a particular blocks

that are registered by other concurrent backend, unless it is not
unregistered (usually, at the end of P2P DMA).
==> to be discussed.

I think there's still a race condition here though...

A
finds buffer not in shared buffers

B
reads buffer in
modifies buffer
starts writing buffer to OS

A
Makes call to block write, but write is already in process; thinks
writes are now blocked
Reads corrupted block
Much hilarity ensues

Or maybe you were just glossing over that part for brevity.

...
I tried to design a draft of enhancement to realize the above i/o write-out
suspend/resume, with less invasive way as possible as we can.

ASSUMPTION: I intend to implement this feature as a part of extension,
because this i/o suspend/resume checks are pure overhead increment
for the core features, unless extension which utilizes it.

Three functions shall be added:

extern int GetStorageMgrNumbers(void);
extern f_smgr GetStorageMgrHandlers(int smgr_which);
extern void SetStorageMgrHandlers(int smgr_which, f_smgr smgr_handlers);

As literal, GetStorageMgrNumbers() returns the number of storage manager
currently installed. It always return 1 right now.
GetStorageMgrHandlers() returns the currently configured f_smgr table to
the supplied smgr_which. It allows extensions to know current configuration
of the storage manager, even if other extension already modified it.
SetStorageMgrHandlers() assigns the supplied 'smgr_handlers', instead of
the current one.
If extension wants to intermediate 'smgr_write', extension will replace
the 'smgr_write' by own function, then call the original function, likely
mdwrite, from the alternative function.

In this case, call chain shall be:
FlushBuffer, and others...
+-- smgrwrite(...)
+-- (extension's own function)
+-- mdwrite
ISTR someone (Robert Haas?) complaining that this method of hooks is
cumbersome to use and can be fragile if multiple hooks are being
installed. So maybe we don't want to extend it's usage...

I'm also not sure whether this is better done with an smgr hook or a
hook into shared buffer handling...

# sorry, I oversight the later part of your reply.

I can agree that smgr hooks shall be primarily designed to make storage
systems pluggable, even if we can use this hooks for suspend & resume of
write i/o stuff.
In addition, "pluggable storage" is a long-standing feature, even though
it is not certain whether existing smgr hooks are good starting point.
It may be a risk if we implement a grand feature on top of the hooks
but out of its primary purpose.

So, my preference is a mechanism to hook buffer write to implement this
feature. (Or, maybe a built-in write i/o suspend / resume stuff if it
has nearly zero cost when no extension activate the feature.)
One downside of this approach is larger number of hook points.
We have to deploy the hook nearby existing smgrwrite of LocalBufferAlloc
and FlushRelationBuffers, in addition to FlushBuffer, at least.

Thanks,
--
NEC Business Creation Division / PG-Strom Project
KaiGai Kohei <kaigai@ak.jp.nec.com>

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

robertmhaas@gmail.com

almost 10 years ago

In reply to: Kouhei Kaigai (#11)

Re: Way to check whether a particular block is on the shared_buffer?

On Thu, Feb 4, 2016 at 11:34 PM, Kouhei Kaigai <kaigai@ak.jp.nec.com> wrote:

I can agree that smgr hooks shall be primarily designed to make storage
systems pluggable, even if we can use this hooks for suspend & resume of
write i/o stuff.
In addition, "pluggable storage" is a long-standing feature, even though
it is not certain whether existing smgr hooks are good starting point.
It may be a risk if we implement a grand feature on top of the hooks
but out of its primary purpose.

So, my preference is a mechanism to hook buffer write to implement this
feature. (Or, maybe a built-in write i/o suspend / resume stuff if it
has nearly zero cost when no extension activate the feature.)
One downside of this approach is larger number of hook points.
We have to deploy the hook nearby existing smgrwrite of LocalBufferAlloc
and FlushRelationBuffers, in addition to FlushBuffer, at least.

I don't understand what you're hoping to achieve by introducing
pluggability at the smgr layer. I mean, md.c is pretty much good for
read and writing from anything that looks like a directory of files.
Another smgr layer would make sense if we wanted to read and write via
some kind of network protocol, or if we wanted to have some kind of
storage substrate that did internally to itself some of the tasks for
which we are currently relying on the filesystem - e.g. if we wanted
to be able to use a raw device, or perhaps more plausibly if we wanted
to reduce the number of separate files we need, or provide a substrate
that can clip an unused extent out of the middle of a relation
efficiently. But I don't understand what this has to do with what
you're trying to do here. The subject of this thread is about whether
you can check for the presence of a block in shared_buffers, and as
discussed upthread, you can. I don't quite follow how we made the
jump from there to smgr pluggability.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

kaigai@ak.jp.nec.com

almost 10 years ago

In reply to: Robert Haas (#12)

Re: Way to check whether a particular block is on the shared_buffer?

-----Original Message-----
From: Robert Haas [mailto:robertmhaas@gmail.com]
Sent: Monday, February 08, 2016 1:52 AM
To: Kaigai Kouhei(海外浩平)
Cc: Jim Nasby; pgsql-hackers@postgresql.org; Amit Langote
Subject: Re: [HACKERS] Way to check whether a particular block is
on the shared_buffer?

On Thu, Feb 4, 2016 at 11:34 PM, Kouhei Kaigai <kaigai@ak.jp.nec.com> wrote:

I can agree that smgr hooks shall be primarily designed to make storage
systems pluggable, even if we can use this hooks for suspend & resume of
write i/o stuff.
In addition, "pluggable storage" is a long-standing feature, even though
it is not certain whether existing smgr hooks are good starting point.
It may be a risk if we implement a grand feature on top of the hooks
but out of its primary purpose.

So, my preference is a mechanism to hook buffer write to implement this
feature. (Or, maybe a built-in write i/o suspend / resume stuff if it
has nearly zero cost when no extension activate the feature.)
One downside of this approach is larger number of hook points.
We have to deploy the hook nearby existing smgrwrite of LocalBufferAlloc
and FlushRelationBuffers, in addition to FlushBuffer, at least.

I don't understand what you're hoping to achieve by introducing
pluggability at the smgr layer. I mean, md.c is pretty much good for
read and writing from anything that looks like a directory of files.
Another smgr layer would make sense if we wanted to read and write via
some kind of network protocol, or if we wanted to have some kind of
storage substrate that did internally to itself some of the tasks for
which we are currently relying on the filesystem - e.g. if we wanted
to be able to use a raw device, or perhaps more plausibly if we wanted
to reduce the number of separate files we need, or provide a substrate
that can clip an unused extent out of the middle of a relation
efficiently. But I don't understand what this has to do with what
you're trying to do here. The subject of this thread is about whether
you can check for the presence of a block in shared_buffers, and as
discussed upthread, you can. I don't quite follow how we made the
jump from there to smgr pluggability.

Yes. smgr pluggability is not what I want to investigate in this thread.
It is not a purpose of discussion, but one potential "idea to implement".

Through the discussion, it became clear that extension can check existence
of buffer of a particular block, using existing infrastructure.

On the other hands, it also became clear we have to guarantee OS buffer
or storage block must not be updated partially during the P2P DMA.
My motivation is a potential utilization of P2P DMA of SSD-to-GPU to
filter out unnecessary rows and columns prior to loading to CPU/RAM.
It needs to ensure PostgreSQL does not write out buffers to OS buffers
to avoid unexpected data corruption.

What I want to achieve is suspend of buffer write towards a particular
(relnode, forknum, blocknum) pair for a short time, by completion of
data processing by GPU (or other external devices).
In addition, it is preferable being workable regardless of the choice
of storage manager, even if we may have multiple options on top of the
pluggable smgr in the future.

The data processing close to the storage needs OS buffer should not be
updated under the P2P DMA, concurrently. So, I want the feature below.
1. An extension (that controls GPU and P2P DMA) can register a particular
(relnode, forknum, blocknum) pair as suspended block for write.
2. Once a particular block gets suspended, smgrwrite (or its caller) shall
be blocked unless the above suspended block is not unregistered.
3. The extension will unregister when P2P DMA from the blocks get completed,
then suspended concurrent backend shall be resumed to write i/o.
4. On the other hands, the extension cannot register the block if some
other concurrent executes smgrwrite, to avoid potential data flaw.

One idea was injection of a thin layer on top of the smgr mechanism, to
implement the above mechanism.
However, I'm also uncertain whether injection to entire smgr hooks is
a straightforward approach to achieve it.

The minimum stuff I want is a facility to get a control at the head and tail
of smgrwrite() - to suspend the concurrent write prior to smgr_write, and to
notify the concurrent smgr_write gets completed for the mechanism.

It does not need pluggability of smgr, but entrypoint shall be located around
smgr functions.

Thanks,
--
NEC Business Creation Division / PG-Strom Project
KaiGai Kohei <kaigai@ak.jp.nec.com>

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Import Notes

Resolved by subject fallback

robertmhaas@gmail.com

almost 10 years ago

In reply to: Kouhei Kaigai (#13)

Re: Way to check whether a particular block is on the shared_buffer?

On Sun, Feb 7, 2016 at 9:49 PM, Kouhei Kaigai <kaigai@ak.jp.nec.com> wrote:

On the other hands, it also became clear we have to guarantee OS buffer
or storage block must not be updated partially during the P2P DMA.
My motivation is a potential utilization of P2P DMA of SSD-to-GPU to
filter out unnecessary rows and columns prior to loading to CPU/RAM.
It needs to ensure PostgreSQL does not write out buffers to OS buffers
to avoid unexpected data corruption.

What I want to achieve is suspend of buffer write towards a particular
(relnode, forknum, blocknum) pair for a short time, by completion of
data processing by GPU (or other external devices).
In addition, it is preferable being workable regardless of the choice
of storage manager, even if we may have multiple options on top of the
pluggable smgr in the future.

It seems like you just need to take an exclusive content lock on the
buffer, or maybe a shared content lock would be sufficient.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

kaigai@ak.jp.nec.com

almost 10 years ago

In reply to: Robert Haas (#14)

Re: Way to check whether a particular block is on the shared_buffer?

-----Original Message-----
From: Robert Haas [mailto:robertmhaas@gmail.com]
Sent: Wednesday, February 10, 2016 1:58 AM
To: Kaigai Kouhei(海外浩平)
Cc: Jim Nasby; pgsql-hackers@postgresql.org; Amit Langote
Subject: ##freemail## Re: [HACKERS] Way to check whether a particular block is
on the shared_buffer?

On Sun, Feb 7, 2016 at 9:49 PM, Kouhei Kaigai <kaigai@ak.jp.nec.com> wrote:

On the other hands, it also became clear we have to guarantee OS buffer
or storage block must not be updated partially during the P2P DMA.
My motivation is a potential utilization of P2P DMA of SSD-to-GPU to
filter out unnecessary rows and columns prior to loading to CPU/RAM.
It needs to ensure PostgreSQL does not write out buffers to OS buffers
to avoid unexpected data corruption.

What I want to achieve is suspend of buffer write towards a particular
(relnode, forknum, blocknum) pair for a short time, by completion of
data processing by GPU (or other external devices).
In addition, it is preferable being workable regardless of the choice
of storage manager, even if we may have multiple options on top of the
pluggable smgr in the future.

It seems like you just need to take an exclusive content lock on the
buffer, or maybe a shared content lock would be sufficient.

Unfortunately, it was not sufficient.

Due to the assumption, the buffer page to be suspended does not exist
when a backend process issues a series P2P DMA command. (If block would
be already loaded to the shared buffer, it don't need to issue P2P DMA,
but just use usual memory<->device DMA because RAM is much faster than
SSD.)
It knows the pair of (rel,fork,block), but no BufferDesc of this block
exists. Thus, it cannot acquire locks in BufferDesc structure.

Even if the block does not exist at this point, concurrent process may
load the same page. BufferDesc of this page shall be assigned at this
point, however, here is no chance to lock something in BufferDesc for
the process which issues P2P DMA command.

It is the reason why I assume the suspend/resume mechanism shall take
a pair of (rel,fork,block) as identifier of the target block.

Thanks,
--
NEC Business Creation Division / PG-Strom Project
KaiGai Kohei <kaigai@ak.jp.nec.com>

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Import Notes

Resolved by subject fallback

robertmhaas@gmail.com

almost 10 years ago

In reply to: Kouhei Kaigai (#15)

Re: Way to check whether a particular block is on the shared_buffer?

On Tue, Feb 9, 2016 at 6:35 PM, Kouhei Kaigai <kaigai@ak.jp.nec.com> wrote:

Unfortunately, it was not sufficient.

Due to the assumption, the buffer page to be suspended does not exist
when a backend process issues a series P2P DMA command. (If block would
be already loaded to the shared buffer, it don't need to issue P2P DMA,
but just use usual memory<->device DMA because RAM is much faster than
SSD.)
It knows the pair of (rel,fork,block), but no BufferDesc of this block
exists. Thus, it cannot acquire locks in BufferDesc structure.

Even if the block does not exist at this point, concurrent process may
load the same page. BufferDesc of this page shall be assigned at this
point, however, here is no chance to lock something in BufferDesc for
the process which issues P2P DMA command.

It is the reason why I assume the suspend/resume mechanism shall take
a pair of (rel,fork,block) as identifier of the target block.

I see the problem, but I'm not terribly keen on putting in the hooks
that it would take to let you solve it without hacking core. It
sounds like an awfully invasive thing for a pretty niche requirement.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

kaigai@ak.jp.nec.com

almost 10 years ago

In reply to: Robert Haas (#16)

Re: Way to check whether a particular block is on the shared_buffer?

On Tue, Feb 9, 2016 at 6:35 PM, Kouhei Kaigai <kaigai@ak.jp.nec.com> wrote:

Unfortunately, it was not sufficient.

Due to the assumption, the buffer page to be suspended does not exist
when a backend process issues a series P2P DMA command. (If block would
be already loaded to the shared buffer, it don't need to issue P2P DMA,
but just use usual memory<->device DMA because RAM is much faster than
SSD.)
It knows the pair of (rel,fork,block), but no BufferDesc of this block
exists. Thus, it cannot acquire locks in BufferDesc structure.

Even if the block does not exist at this point, concurrent process may
load the same page. BufferDesc of this page shall be assigned at this
point, however, here is no chance to lock something in BufferDesc for
the process which issues P2P DMA command.

It is the reason why I assume the suspend/resume mechanism shall take
a pair of (rel,fork,block) as identifier of the target block.

I see the problem, but I'm not terribly keen on putting in the hooks
that it would take to let you solve it without hacking core. It
sounds like an awfully invasive thing for a pretty niche requirement.

Hmm. In my experience, it is often not a productive discussion whether
a feature is niche or commodity. So, let me change the viewpoint.

We may utilize OS-level locking mechanism here.

Even though it depends on filesystem implementation under the VFS,
we may use inode->i_mutex lock that shall be acquired during the buffer
copy from user to kernel, at least, on a few major filesystems; ext4,
xfs and btrfs in my research. As well, the modified NVMe SSD driver can
acquire the inode->i_mutex lock during P2P DMA transfer.

Once we can consider the OS buffer is updated atomically by the lock,
we don't need to worry about corrupted pages, but still needs to pay
attention to the scenario when updated buffer page is moved to GPU.

In this case, PD_ALL_VISIBLE may give us a hint. GPU side has no MVCC
infrastructure, so I intend to move all-visible pages only.
If someone updates the buffer concurrently, then write out the page
including invisible tuples, PD_ALL_VISIBLE flag shall be cleared because
updated tuples should not be visible to the transaction which issued
P2P DMA.

Once GPU met a page with !PD_ALL_VISIBLE, it can return an error status
that indicates CPU to retry this page again. In this case, this page is
likely loaded to the shared buffer already, so retry penalty is not so
much.

I'll try to investigate the implementation in this way.
Please correct me, if I misunderstand something (especially, treatment
of PD_ALL_VISIBLE).

Thanks,
--
NEC Business Creation Division / PG-Strom Project
KaiGai Kohei <kaigai@ak.jp.nec.com>

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

robertmhaas@gmail.com

almost 10 years ago

In reply to: Kouhei Kaigai (#17)

Re: Way to check whether a particular block is on the shared_buffer?

On Thu, Feb 11, 2016 at 9:05 PM, Kouhei Kaigai <kaigai@ak.jp.nec.com> wrote:

Hmm. In my experience, it is often not a productive discussion whether
a feature is niche or commodity. So, let me change the viewpoint.

We may utilize OS-level locking mechanism here.

Even though it depends on filesystem implementation under the VFS,
we may use inode->i_mutex lock that shall be acquired during the buffer
copy from user to kernel, at least, on a few major filesystems; ext4,
xfs and btrfs in my research. As well, the modified NVMe SSD driver can
acquire the inode->i_mutex lock during P2P DMA transfer.

Once we can consider the OS buffer is updated atomically by the lock,
we don't need to worry about corrupted pages, but still needs to pay
attention to the scenario when updated buffer page is moved to GPU.

In this case, PD_ALL_VISIBLE may give us a hint. GPU side has no MVCC
infrastructure, so I intend to move all-visible pages only.
If someone updates the buffer concurrently, then write out the page
including invisible tuples, PD_ALL_VISIBLE flag shall be cleared because
updated tuples should not be visible to the transaction which issued
P2P DMA.

Once GPU met a page with !PD_ALL_VISIBLE, it can return an error status
that indicates CPU to retry this page again. In this case, this page is
likely loaded to the shared buffer already, so retry penalty is not so
much.

I'll try to investigate the implementation in this way.
Please correct me, if I misunderstand something (especially, treatment
of PD_ALL_VISIBLE).

I suppose there's no theoretical reason why the buffer couldn't go
from all-visible to not-all-visible and back to all-visible again all
during the time you are copying it.

Honestly, I think trying to access buffers without going through
shared_buffers is likely to be very hard to make correct and probably
a loser. Copying the data into shared_buffers and then to the GPU is,
doubtless, at least somewhat slower. But I kind of doubt that it's
enough slower to make up for all of the problems you're going to have
with the approach you've chosen.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

kaigai@ak.jp.nec.com

almost 10 years ago

In reply to: Robert Haas (#18)

Re: Way to check whether a particular block is on the shared_buffer?

-----Original Message-----
From: pgsql-hackers-owner@postgresql.org
[mailto:pgsql-hackers-owner@postgresql.org] On Behalf Of Robert Haas
Sent: Saturday, February 13, 2016 1:46 PM
To: Kaigai Kouhei(海外浩平)
Cc: Jim Nasby; pgsql-hackers@postgresql.org; Amit Langote
Subject: Re: [HACKERS] Way to check whether a particular block is on the
shared_buffer?

On Thu, Feb 11, 2016 at 9:05 PM, Kouhei Kaigai <kaigai@ak.jp.nec.com> wrote:

Hmm. In my experience, it is often not a productive discussion whether
a feature is niche or commodity. So, let me change the viewpoint.

We may utilize OS-level locking mechanism here.

Even though it depends on filesystem implementation under the VFS,
we may use inode->i_mutex lock that shall be acquired during the buffer
copy from user to kernel, at least, on a few major filesystems; ext4,
xfs and btrfs in my research. As well, the modified NVMe SSD driver can
acquire the inode->i_mutex lock during P2P DMA transfer.

Once we can consider the OS buffer is updated atomically by the lock,
we don't need to worry about corrupted pages, but still needs to pay
attention to the scenario when updated buffer page is moved to GPU.

In this case, PD_ALL_VISIBLE may give us a hint. GPU side has no MVCC
infrastructure, so I intend to move all-visible pages only.
If someone updates the buffer concurrently, then write out the page
including invisible tuples, PD_ALL_VISIBLE flag shall be cleared because
updated tuples should not be visible to the transaction which issued
P2P DMA.

Once GPU met a page with !PD_ALL_VISIBLE, it can return an error status
that indicates CPU to retry this page again. In this case, this page is
likely loaded to the shared buffer already, so retry penalty is not so
much.

I'll try to investigate the implementation in this way.
Please correct me, if I misunderstand something (especially, treatment
of PD_ALL_VISIBLE).

I suppose there's no theoretical reason why the buffer couldn't go
from all-visible to not-all-visible and back to all-visible again all
during the time you are copying it.

The backend process that is copying the data to GPU has a transaction
in-progress (= not committed). Is it possible to get the updated buffer
page back to the all-visible state again?
I expect that in-progress transactions works as a blocker for backing
to all-visible. Right?

Honestly, I think trying to access buffers without going through
shared_buffers is likely to be very hard to make correct and probably
a loser.

No challenge, no outcome. ;-)

Copying the data into shared_buffers and then to the GPU is,
doubtless, at least somewhat slower. But I kind of doubt that it's
enough slower to make up for all of the problems you're going to have
with the approach you've chosen.

Honestly, I'm still uncertain whether it works well as I expects.
However, scan workload on the table larger than main memory is
headache for PG-Strom, so I'd like to try ideas we can implement.

Thanks,
--
NEC Business Creation Division / PG-Strom Project
KaiGai Kohei <kaigai@ak.jp.nec.com>

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

robertmhaas@gmail.com

almost 10 years ago

In reply to: Kouhei Kaigai (#19)

Re: Way to check whether a particular block is on the shared_buffer?

On Sat, Feb 13, 2016 at 7:29 AM, Kouhei Kaigai <kaigai@ak.jp.nec.com> wrote:

I suppose there's no theoretical reason why the buffer couldn't go
from all-visible to not-all-visible and back to all-visible again all
during the time you are copying it.

The backend process that is copying the data to GPU has a transaction
in-progress (= not committed). Is it possible to get the updated buffer
page back to the all-visible state again?
I expect that in-progress transactions works as a blocker for backing
to all-visible. Right?

Yeah, probably.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

kaigai@ak.jp.nec.com

almost 10 years ago

In reply to: Kouhei Kaigai (#19)

1 attachment(s)

Re: Way to check whether a particular block is on the shared_buffer?

I found one other, but tiny, problem to implement SSD-to-GPU direct
data transfer feature under the PostgreSQL storage.

Extension cannot know the raw file descriptor opened by smgr.

I expect an extension issues an ioctl(2) on the special device file
on behalf of the special kernel driver, to control the P2P DMA.
This ioctl(2) will pack file descriptor of the DMA source and some
various information (like base position, range, destination device
pointer, ...).

However, the raw file descriptor is wrapped in the fd.c, instead of
the File handler, thus, not visible to extension. oops...

The attached patch provides a way to obtain raw file descriptor (and
relevant flags) of a particular File virtual file descriptor on
PostgreSQL. (No need to say, extension has to treat the raw descriptor
carefully not to give an adverse effect to the storage manager.)

How about this tiny enhancement?

Show quoted text

-----Original Message-----
From: pgsql-hackers-owner@postgresql.org
[mailto:pgsql-hackers-owner@postgresql.org] On Behalf Of Robert Haas
Sent: Saturday, February 13, 2016 1:46 PM
To: Kaigai Kouhei(海外浩平)
Cc: Jim Nasby; pgsql-hackers@postgresql.org; Amit Langote
Subject: Re: [HACKERS] Way to check whether a particular block is on the
shared_buffer?

On Thu, Feb 11, 2016 at 9:05 PM, Kouhei Kaigai <kaigai@ak.jp.nec.com> wrote:

Hmm. In my experience, it is often not a productive discussion whether
a feature is niche or commodity. So, let me change the viewpoint.

We may utilize OS-level locking mechanism here.

Even though it depends on filesystem implementation under the VFS,
we may use inode->i_mutex lock that shall be acquired during the buffer
copy from user to kernel, at least, on a few major filesystems; ext4,
xfs and btrfs in my research. As well, the modified NVMe SSD driver can
acquire the inode->i_mutex lock during P2P DMA transfer.

Once we can consider the OS buffer is updated atomically by the lock,
we don't need to worry about corrupted pages, but still needs to pay
attention to the scenario when updated buffer page is moved to GPU.

In this case, PD_ALL_VISIBLE may give us a hint. GPU side has no MVCC
infrastructure, so I intend to move all-visible pages only.
If someone updates the buffer concurrently, then write out the page
including invisible tuples, PD_ALL_VISIBLE flag shall be cleared because
updated tuples should not be visible to the transaction which issued
P2P DMA.

Once GPU met a page with !PD_ALL_VISIBLE, it can return an error status
that indicates CPU to retry this page again. In this case, this page is
likely loaded to the shared buffer already, so retry penalty is not so
much.

I'll try to investigate the implementation in this way.
Please correct me, if I misunderstand something (especially, treatment
of PD_ALL_VISIBLE).

I suppose there's no theoretical reason why the buffer couldn't go
from all-visible to not-all-visible and back to all-visible again all
during the time you are copying it.

The backend process that is copying the data to GPU has a transaction
in-progress (= not committed). Is it possible to get the updated buffer
page back to the all-visible state again?
I expect that in-progress transactions works as a blocker for backing
to all-visible. Right?

Honestly, I think trying to access buffers without going through
shared_buffers is likely to be very hard to make correct and probably
a loser.

No challenge, no outcome. ;-)

Copying the data into shared_buffers and then to the GPU is,
doubtless, at least somewhat slower. But I kind of doubt that it's
enough slower to make up for all of the problems you're going to have
with the approach you've chosen.

Honestly, I'm still uncertain whether it works well as I expects.
However, scan workload on the table larger than main memory is
headache for PG-Strom, so I'd like to try ideas we can implement.

Thanks,
--
NEC Business Creation Division / PG-Strom Project
KaiGai Kohei <kaigai@ak.jp.nec.com>

Attachments:

pgsql-v9.6-filegetrawdesc.1.patchapplication/octet-stream; name=pgsql-v9.6-filegetrawdesc.1.patchDownload

 src/backend/storage/file/fd.c |   18 ++++++++++++++++++
 src/include/storage/fd.h      |    1 +
 2 files changed, 19 insertions(+), 0 deletions(-)

diff --git a/src/backend/storage/file/fd.c b/src/backend/storage/file/fd.c
index 1ba4946..4ef37df 100644
--- a/src/backend/storage/file/fd.c
+++ b/src/backend/storage/file/fd.c
@@ -1577,6 +1577,24 @@ FilePathName(File file)
 	return VfdCache[file].fileName;
 }
 
+/*
+ * Return the raw file descriptor with an open file, and relevant flags.
+ *
+ * The returned file descriptor will be valid until the file is closed, and
+ * caller has to treat this descriptor not to make an adverse effect.
+ */
+int
+FileGetRawDesc(File file, int *f_flags, int *f_mode)
+{
+	Assert(FileIsValid(file));
+
+	if (f_flags)
+		*f_flags = VfdCache[file].fileFlags;
+	if (f_mode)
+		*f_mode = VfdCache[file].fileMode;
+
+	return VfdCache[file].fd;
+}
 
 /*
  * Make room for another allocatedDescs[] array entry if needed and possible.
diff --git a/src/include/storage/fd.h b/src/include/storage/fd.h
index 7eabe09..1962160 100644
--- a/src/include/storage/fd.h
+++ b/src/include/storage/fd.h
@@ -75,6 +75,7 @@ extern int	FileSync(File file);
 extern off_t FileSeek(File file, off_t offset, int whence);
 extern int	FileTruncate(File file, off_t offset);
 extern char *FilePathName(File file);
+extern int	FileGetRawDesc(File file, int *f_flags, int *f_mode);
 
 /* Operations that allow use of regular stdio --- USE WITH CAUTION */
 extern FILE *AllocateFile(const char *name, const char *mode);

robertmhaas@gmail.com

almost 10 years ago

In reply to: Kouhei Kaigai (#21)

Re: Way to check whether a particular block is on the shared_buffer?

On Thu, Mar 3, 2016 at 8:54 PM, Kouhei Kaigai <kaigai@ak.jp.nec.com> wrote:

I found one other, but tiny, problem to implement SSD-to-GPU direct
data transfer feature under the PostgreSQL storage.

Extension cannot know the raw file descriptor opened by smgr.

I expect an extension issues an ioctl(2) on the special device file
on behalf of the special kernel driver, to control the P2P DMA.
This ioctl(2) will pack file descriptor of the DMA source and some
various information (like base position, range, destination device
pointer, ...).

However, the raw file descriptor is wrapped in the fd.c, instead of
the File handler, thus, not visible to extension. oops...

The attached patch provides a way to obtain raw file descriptor (and
relevant flags) of a particular File virtual file descriptor on
PostgreSQL. (No need to say, extension has to treat the raw descriptor
carefully not to give an adverse effect to the storage manager.)

How about this tiny enhancement?

Why not FileDescriptor(), FileFlags(), FileMode() as separate
functions like FilePathName()?

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

kaigai@ak.jp.nec.com

almost 10 years ago

In reply to: Robert Haas (#22)

1 attachment(s)

Re: Way to check whether a particular block is on the shared_buffer?

-----Original Message-----
From: pgsql-hackers-owner@postgresql.org
[mailto:pgsql-hackers-owner@postgresql.org] On Behalf Of Robert Haas
Sent: Saturday, March 05, 2016 2:42 AM
To: Kaigai Kouhei(海外浩平)
Cc: Jim Nasby; pgsql-hackers@postgresql.org; Amit Langote
Subject: Re: [HACKERS] Way to check whether a particular block is on the
shared_buffer?

On Thu, Mar 3, 2016 at 8:54 PM, Kouhei Kaigai <kaigai@ak.jp.nec.com> wrote:

I found one other, but tiny, problem to implement SSD-to-GPU direct
data transfer feature under the PostgreSQL storage.

Extension cannot know the raw file descriptor opened by smgr.

I expect an extension issues an ioctl(2) on the special device file
on behalf of the special kernel driver, to control the P2P DMA.
This ioctl(2) will pack file descriptor of the DMA source and some
various information (like base position, range, destination device
pointer, ...).

However, the raw file descriptor is wrapped in the fd.c, instead of
the File handler, thus, not visible to extension. oops...

The attached patch provides a way to obtain raw file descriptor (and
relevant flags) of a particular File virtual file descriptor on
PostgreSQL. (No need to say, extension has to treat the raw descriptor
carefully not to give an adverse effect to the storage manager.)

How about this tiny enhancement?

Why not FileDescriptor(), FileFlags(), FileMode() as separate
functions like FilePathName()?

Here is no deep reason. The attached patch adds three individual
functions.

Thanks,
--
NEC Business Creation Division / PG-Strom Project
KaiGai Kohei <kaigai@ak.jp.nec.com>

Attachments:

pgsql-v9.6-filegetrawdesc.2.patchapplication/octet-stream; name=pgsql-v9.6-filegetrawdesc.2.patchDownload

 src/backend/storage/file/fd.c | 32 ++++++++++++++++++++++++++++++++
 src/include/storage/fd.h      |  3 +++
 2 files changed, 35 insertions(+)

diff --git a/src/backend/storage/file/fd.c b/src/backend/storage/file/fd.c
index 1b30100..a3019b3 100644
--- a/src/backend/storage/file/fd.c
+++ b/src/backend/storage/file/fd.c
@@ -1577,6 +1577,38 @@ FilePathName(File file)
 	return VfdCache[file].fileName;
 }
 
+/*
+ * Return the raw file descriptor of an opened file.
+ *
+ * The returned file descriptor will be valid until the file is closed, and
+ * caller has to treat this descriptor not to make an adverse effect.
+ */
+int
+FileGetRawDesc(File file)
+{
+	Assert(FileIsValid(file));
+	return VfdCache[file].fd;
+}
+
+/*
+ * FileGetRawFlags - returns the file flags on open(2)
+ */
+int
+FileGetRawFlags(File file)
+{
+	Assert(FileIsValid(file));
+	return VfdCache[file].fileFlags;
+}
+
+/*
+ * FileGetRawMode - returns the mode bitmask passed to open(2)
+ */
+int
+FileGetRawMode(File file)
+{
+	Assert(FileIsValid(file));
+	return VfdCache[file].fileMode;
+}
 
 /*
  * Make room for another allocatedDescs[] array entry if needed and possible.
diff --git a/src/include/storage/fd.h b/src/include/storage/fd.h
index 4a3fccb..6faa8ad 100644
--- a/src/include/storage/fd.h
+++ b/src/include/storage/fd.h
@@ -75,6 +75,9 @@ extern int	FileSync(File file);
 extern off_t FileSeek(File file, off_t offset, int whence);
 extern int	FileTruncate(File file, off_t offset);
 extern char *FilePathName(File file);
+extern int	FileGetRawDesc(File file);
+extern int  FileGetRawFlags(File file);
+extern int	FileGetRawMode(File file);
 
 /* Operations that allow use of regular stdio --- USE WITH CAUTION */
 extern FILE *AllocateFile(const char *name, const char *mode);

robertmhaas@gmail.com

almost 10 years ago

In reply to: Kouhei Kaigai (#23)

Re: Way to check whether a particular block is on the shared_buffer?

On Mon, Mar 7, 2016 at 4:32 AM, Kouhei Kaigai <kaigai@ak.jp.nec.com> wrote:

Why not FileDescriptor(), FileFlags(), FileMode() as separate
functions like FilePathName()?

Here is no deep reason. The attached patch adds three individual
functions.

This seems unobjectionable to me, so committed.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers