Revision tags: v6.6.25, v6.6.24, v6.6.23, v6.6.16, v6.6.15, v6.6.14, v6.6.13, v6.6.12, v6.6.11, v6.6.10, v6.6.9, v6.6.8, v6.6.7, v6.6.6, v6.6.5, v6.6.4, v6.6.3, v6.6.2, v6.5.11, v6.6.1, v6.5.10, v6.6, v6.5.9, v6.5.8, v6.5.7, v6.5.6, v6.5.5 |
|
#
67f45795 |
| 21-Sep-2023 |
Max Kellermann <max.kellermann@ionos.com> |
fs/pipe: move check to pipe_has_watch_queue()
[ Upstream commit b4bd6b4bac8edd61eb8f7b836969d12c0c6af165 ]
This declutters the code by reducing the number of #ifdefs and makes the watch_queue check
fs/pipe: move check to pipe_has_watch_queue()
[ Upstream commit b4bd6b4bac8edd61eb8f7b836969d12c0c6af165 ]
This declutters the code by reducing the number of #ifdefs and makes the watch_queue checks simpler. This has no runtime effect; the machine code is identical.
Signed-off-by: Max Kellermann <max.kellermann@ionos.com> Message-Id: <20230921075755.1378787-2-max.kellermann@ionos.com> Reviewed-by: David Howells <dhowells@redhat.com> Signed-off-by: Christian Brauner <brauner@kernel.org> Stable-dep-of: e95aada4cb93 ("pipe: wakeup wr_wait after setting max_usage") Signed-off-by: Sasha Levin <sashal@kernel.org>
show more ...
|
Revision tags: v6.5.4, v6.5.3, v6.5.2, v6.1.51, v6.5.1, v6.1.50, v6.5, v6.1.49, v6.1.48, v6.1.46, v6.1.45, v6.1.44, v6.1.43, v6.1.42, v6.1.41, v6.1.40, v6.1.39, v6.1.38, v6.1.37, v6.1.36, v6.4, v6.1.35, v6.1.34, v6.1.33, v6.1.32, v6.1.31, v6.1.30, v6.1.29, v6.1.28, v6.1.27, v6.1.26, v6.3, v6.1.25, v6.1.24, v6.1.23, v6.1.22, v6.1.21, v6.1.20, v6.1.19, v6.1.18, v6.1.17, v6.1.16, v6.1.15, v6.1.14, v6.1.13, v6.2, v6.1.12, v6.1.11, v6.1.10 |
|
#
515c5046 |
| 01-Feb-2023 |
Luca Vizzarro <Luca.Vizzarro@arm.com> |
pipe: Pass argument of pipe_fcntl as int
The interface for fcntl expects the argument passed for the command F_SETPIPE_SZ to be of type int. The current code wrongly treats it as a long. In order to
pipe: Pass argument of pipe_fcntl as int
The interface for fcntl expects the argument passed for the command F_SETPIPE_SZ to be of type int. The current code wrongly treats it as a long. In order to avoid access to undefined bits, we should explicitly cast the argument to int.
Cc: Alexander Viro <viro@zeniv.linux.org.uk> Cc: Christian Brauner <brauner@kernel.org> Cc: Jeff Layton <jlayton@kernel.org> Cc: Chuck Lever <chuck.lever@oracle.com> Cc: Kevin Brodsky <Kevin.Brodsky@arm.com> Cc: Vincenzo Frascino <Vincenzo.Frascino@arm.com> Cc: Szabolcs Nagy <Szabolcs.Nagy@arm.com> Cc: "Theodore Ts'o" <tytso@mit.edu> Cc: David Laight <David.Laight@ACULAB.com> Cc: Mark Rutland <Mark.Rutland@arm.com> Cc: linux-fsdevel@vger.kernel.org Cc: linux-morello@op-lists.linaro.org Signed-off-by: Luca Vizzarro <Luca.Vizzarro@arm.com> Message-Id: <20230414152459.816046-4-Luca.Vizzarro@arm.com> Signed-off-by: Christian Brauner <brauner@kernel.org>
show more ...
|
#
247c8d2f |
| 16-May-2023 |
Arnd Bergmann <arnd@arndb.de> |
fs: pipe: reveal missing function protoypes
A couple of functions from fs/pipe.c are used both internally and for the watch queue code, but the declaration is only visible when the latter is enabled
fs: pipe: reveal missing function protoypes
A couple of functions from fs/pipe.c are used both internally and for the watch queue code, but the declaration is only visible when the latter is enabled:
fs/pipe.c:1254:5: error: no previous prototype for 'pipe_resize_ring' fs/pipe.c:758:15: error: no previous prototype for 'account_pipe_buffers' fs/pipe.c:764:6: error: no previous prototype for 'too_many_pipe_buffers_soft' fs/pipe.c:771:6: error: no previous prototype for 'too_many_pipe_buffers_hard' fs/pipe.c:777:6: error: no previous prototype for 'pipe_is_unprivileged_user'
Make the visible unconditionally to avoid these warnings.
Fixes: c73be61cede5 ("pipe: Add general notification queue support") Signed-off-by: Arnd Bergmann <arnd@arndb.de> Message-Id: <20230516195629.551602-1-arnd@kernel.org> Signed-off-by: Christian Brauner <brauner@kernel.org>
show more ...
|
#
07073eb0 |
| 14-Feb-2023 |
David Howells <dhowells@redhat.com> |
splice: Add a func to do a splice from a buffered file without ITER_PIPE
Provide a function to do splice read from a buffered file, pulling the folios out of the pagecache directly by calling filema
splice: Add a func to do a splice from a buffered file without ITER_PIPE
Provide a function to do splice read from a buffered file, pulling the folios out of the pagecache directly by calling filemap_get_pages() to do any required reading and then pasting the returned folios into the pipe.
A helper function is provided to do the actual folio pasting and will handle multipage folios by splicing as many of the relevant subpages as will fit into the pipe.
The code is loosely based on filemap_read() and might belong in mm/filemap.c with that as it needs to use filemap_get_pages().
Signed-off-by: David Howells <dhowells@redhat.com> Reviewed-by: Jens Axboe <axboe@kernel.dk> cc: Christoph Hellwig <hch@lst.de> cc: Al Viro <viro@zeniv.linux.org.uk> cc: David Hildenbrand <david@redhat.com> cc: John Hubbard <jhubbard@nvidia.com> cc: linux-mm@kvack.org cc: linux-block@vger.kernel.org cc: linux-fsdevel@vger.kernel.org Signed-off-by: Steve French <stfrench@microsoft.com>
show more ...
|
Revision tags: v6.1.9, v6.1.8, v6.1.7, v6.1.6, v6.1.5, v6.0.19, v6.0.18, v6.1.4, v6.1.3, v6.0.17, v6.1.2, v6.0.16, v6.1.1, v6.0.15, v6.0.14, v6.0.13, v6.1, v6.0.12, v6.0.11, v6.0.10, v5.15.80, v6.0.9, v5.15.79, v6.0.8, v5.15.78, v6.0.7, v5.15.77, v5.15.76, v6.0.6, v6.0.5, v5.15.75, v6.0.4, v6.0.3, v6.0.2, v5.15.74, v5.15.73, v6.0.1, v5.15.72, v6.0, v5.15.71, v5.15.70, v5.15.69, v5.15.68, v5.15.67, v5.15.66, v5.15.65, v5.15.64, v5.15.63, v5.15.62, v5.15.61, v5.15.60, v5.15.59, v5.19, v5.15.58, v5.15.57, v5.15.56, v5.15.55, v5.15.54, v5.15.53, v5.15.52, v5.15.51, v5.15.50, v5.15.49, v5.15.48 |
|
#
12d426ab |
| 15-Jun-2022 |
Al Viro <viro@zeniv.linux.org.uk> |
ITER_PIPE: fold data_start() and pipe_space_for_user() together
All their callers are next to each other; all of them want the total amount of pages and, possibly, the offset in the partial final bu
ITER_PIPE: fold data_start() and pipe_space_for_user() together
All their callers are next to each other; all of them want the total amount of pages and, possibly, the offset in the partial final buffer.
Combine into a new helper (pipe_npages()), fix the bogosity in pipe_space_for_user(), while we are at it.
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
show more ...
|
Revision tags: v5.15.47 |
|
#
c3497fd0 |
| 12-Jun-2022 |
Al Viro <viro@zeniv.linux.org.uk> |
fix short copy handling in copy_mc_pipe_to_iter()
Unlike other copying operations on ITER_PIPE, copy_mc_to_iter() can result in a short copy. In that case we need to trim the unused buffers, as wel
fix short copy handling in copy_mc_pipe_to_iter()
Unlike other copying operations on ITER_PIPE, copy_mc_to_iter() can result in a short copy. In that case we need to trim the unused buffers, as well as the length of partially filled one - it's not enough to set ->head, ->iov_offset and ->count to reflect how much had we copied. Not hard to fix, fortunately...
I'd put a helper (pipe_discard_from(pipe, head)) into pipe_fs_i.h, rather than iov_iter.c - it has nothing to do with iov_iter and having it will allow us to avoid an ugly kludge in fs/splice.c. We could put it into lib/iov_iter.c for now and move it later, but I don't see the point going that way...
Cc: stable@kernel.org # 4.19+ Fixes: ca146f6f091e "lib/iov_iter: Fix pipe handling in _copy_to_iter_mcsafe()" Reviewed-by: Jeff Layton <jlayton@kernel.org> Reviewed-by: Christian Brauner (Microsoft) <brauner@kernel.org> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
show more ...
|
Revision tags: v5.15.46, v5.15.45, v5.15.44, v5.15.43, v5.15.42, v5.18, v5.15.41, v5.15.40, v5.15.39, v5.15.38, v5.15.37 |
|
#
f485922d |
| 29-Apr-2022 |
Kuniyuki Iwashima <kuniyu@amazon.co.jp> |
pipe: make poll_usage boolean and annotate its access
Patch series "Fix data-races around epoll reported by KCSAN."
This series suppresses a false positive KCSAN's message and fixes a real data-rac
pipe: make poll_usage boolean and annotate its access
Patch series "Fix data-races around epoll reported by KCSAN."
This series suppresses a false positive KCSAN's message and fixes a real data-race.
This patch (of 2):
pipe_poll() runs locklessly and assigns 1 to poll_usage. Once poll_usage is set to 1, it never changes in other places. However, concurrent writes of a value trigger KCSAN, so let's make KCSAN happy.
BUG: KCSAN: data-race in pipe_poll / pipe_poll
write to 0xffff8880042f6678 of 4 bytes by task 174 on cpu 3: pipe_poll (fs/pipe.c:656) ep_item_poll.isra.0 (./include/linux/poll.h:88 fs/eventpoll.c:853) do_epoll_wait (fs/eventpoll.c:1692 fs/eventpoll.c:1806 fs/eventpoll.c:2234) __x64_sys_epoll_wait (fs/eventpoll.c:2246 fs/eventpoll.c:2241 fs/eventpoll.c:2241) do_syscall_64 (arch/x86/entry/common.c:50 arch/x86/entry/common.c:80) entry_SYSCALL_64_after_hwframe (arch/x86/entry/entry_64.S:113)
write to 0xffff8880042f6678 of 4 bytes by task 177 on cpu 1: pipe_poll (fs/pipe.c:656) ep_item_poll.isra.0 (./include/linux/poll.h:88 fs/eventpoll.c:853) do_epoll_wait (fs/eventpoll.c:1692 fs/eventpoll.c:1806 fs/eventpoll.c:2234) __x64_sys_epoll_wait (fs/eventpoll.c:2246 fs/eventpoll.c:2241 fs/eventpoll.c:2241) do_syscall_64 (arch/x86/entry/common.c:50 arch/x86/entry/common.c:80) entry_SYSCALL_64_after_hwframe (arch/x86/entry/entry_64.S:113)
Reported by Kernel Concurrency Sanitizer on: CPU: 1 PID: 177 Comm: epoll_race Not tainted 5.17.0-58927-gf443e374ae13 #6 Hardware name: Red Hat KVM, BIOS 1.11.0-2.amzn2 04/01/2014
Link: https://lkml.kernel.org/r/20220322002653.33865-1-kuniyu@amazon.co.jp Link: https://lkml.kernel.org/r/20220322002653.33865-2-kuniyu@amazon.co.jp Fixes: 3b844826b6c6 ("pipe: avoid unnecessary EPOLLET wakeups under normal loads") Signed-off-by: Kuniyuki Iwashima <kuniyu@amazon.co.jp> Cc: Alexander Duyck <alexander.h.duyck@intel.com> Cc: Al Viro <viro@zeniv.linux.org.uk> Cc: Davidlohr Bueso <dave@stgolabs.net> Cc: Kuniyuki Iwashima <kuni1840@gmail.com> Cc: "Soheil Hassas Yeganeh" <soheil@google.com> Cc: "Sridhar Samudrala" <sridhar.samudrala@intel.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
show more ...
|
Revision tags: v5.15.36, v5.15.35, v5.15.34, v5.15.33, v5.15.32, v5.15.31, v5.17, v5.15.30, v5.15.29, v5.15.28, v5.15.27, v5.15.26, v5.15.25, v5.15.24, v5.15.23, v5.15.22, v5.15.21, v5.15.20, v5.15.19, v5.15.18, v5.15.17 |
|
#
1998f193 |
| 22-Jan-2022 |
Luis Chamberlain <mcgrof@kernel.org> |
fs: move pipe sysctls to is own file
kernel/sysctl.c is a kitchen sink where everyone leaves their dirty dishes, this makes it very difficult to maintain.
To help with this maintenance let's start
fs: move pipe sysctls to is own file
kernel/sysctl.c is a kitchen sink where everyone leaves their dirty dishes, this makes it very difficult to maintain.
To help with this maintenance let's start by moving sysctls to places where they actually belong. The proc sysctl maintainers do not want to know what sysctl knobs you wish to add for your own piece of code, we just care about the core logic.
So move the pipe sysctls to its own file.
Link: https://lkml.kernel.org/r/20211129205548.605569-10-mcgrof@kernel.org Signed-off-by: Luis Chamberlain <mcgrof@kernel.org> Cc: Al Viro <viro@zeniv.linux.org.uk> Cc: Andy Shevchenko <andriy.shevchenko@linux.intel.com> Cc: Antti Palosaari <crope@iki.fi> Cc: Eric Biederman <ebiederm@xmission.com> Cc: Iurii Zaikin <yzaikin@google.com> Cc: "J. Bruce Fields" <bfields@fieldses.org> Cc: Jeff Layton <jlayton@kernel.org> Cc: Kees Cook <keescook@chromium.org> Cc: Lukas Middendorf <kernel@tuxforce.de> Cc: Stephen Kitt <steve@sk2.org> Cc: Xiaoming Ni <nixiaoming@huawei.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
show more ...
|
#
4228c037 |
| 12-Jun-2022 |
Al Viro <viro@zeniv.linux.org.uk> |
fix short copy handling in copy_mc_pipe_to_iter()
commit c3497fd009ef2c59eea60d21c3ac22de3585ed7d upstream.
Unlike other copying operations on ITER_PIPE, copy_mc_to_iter() can result in a short cop
fix short copy handling in copy_mc_pipe_to_iter()
commit c3497fd009ef2c59eea60d21c3ac22de3585ed7d upstream.
Unlike other copying operations on ITER_PIPE, copy_mc_to_iter() can result in a short copy. In that case we need to trim the unused buffers, as well as the length of partially filled one - it's not enough to set ->head, ->iov_offset and ->count to reflect how much had we copied. Not hard to fix, fortunately...
I'd put a helper (pipe_discard_from(pipe, head)) into pipe_fs_i.h, rather than iov_iter.c - it has nothing to do with iov_iter and having it will allow us to avoid an ugly kludge in fs/splice.c. We could put it into lib/iov_iter.c for now and move it later, but I don't see the point going that way...
Cc: stable@kernel.org # 4.19+ Fixes: ca146f6f091e "lib/iov_iter: Fix pipe handling in _copy_to_iter_mcsafe()" Reviewed-by: Jeff Layton <jlayton@kernel.org> Reviewed-by: Christian Brauner (Microsoft) <brauner@kernel.org> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
show more ...
|
#
e6acf868 |
| 29-Apr-2022 |
Kuniyuki Iwashima <kuniyu@amazon.co.jp> |
pipe: make poll_usage boolean and annotate its access
commit f485922d8fe4e44f6d52a5bb95a603b7c65554bb upstream.
Patch series "Fix data-races around epoll reported by KCSAN."
This series suppresses
pipe: make poll_usage boolean and annotate its access
commit f485922d8fe4e44f6d52a5bb95a603b7c65554bb upstream.
Patch series "Fix data-races around epoll reported by KCSAN."
This series suppresses a false positive KCSAN's message and fixes a real data-race.
This patch (of 2):
pipe_poll() runs locklessly and assigns 1 to poll_usage. Once poll_usage is set to 1, it never changes in other places. However, concurrent writes of a value trigger KCSAN, so let's make KCSAN happy.
BUG: KCSAN: data-race in pipe_poll / pipe_poll
write to 0xffff8880042f6678 of 4 bytes by task 174 on cpu 3: pipe_poll (fs/pipe.c:656) ep_item_poll.isra.0 (./include/linux/poll.h:88 fs/eventpoll.c:853) do_epoll_wait (fs/eventpoll.c:1692 fs/eventpoll.c:1806 fs/eventpoll.c:2234) __x64_sys_epoll_wait (fs/eventpoll.c:2246 fs/eventpoll.c:2241 fs/eventpoll.c:2241) do_syscall_64 (arch/x86/entry/common.c:50 arch/x86/entry/common.c:80) entry_SYSCALL_64_after_hwframe (arch/x86/entry/entry_64.S:113)
write to 0xffff8880042f6678 of 4 bytes by task 177 on cpu 1: pipe_poll (fs/pipe.c:656) ep_item_poll.isra.0 (./include/linux/poll.h:88 fs/eventpoll.c:853) do_epoll_wait (fs/eventpoll.c:1692 fs/eventpoll.c:1806 fs/eventpoll.c:2234) __x64_sys_epoll_wait (fs/eventpoll.c:2246 fs/eventpoll.c:2241 fs/eventpoll.c:2241) do_syscall_64 (arch/x86/entry/common.c:50 arch/x86/entry/common.c:80) entry_SYSCALL_64_after_hwframe (arch/x86/entry/entry_64.S:113)
Reported by Kernel Concurrency Sanitizer on: CPU: 1 PID: 177 Comm: epoll_race Not tainted 5.17.0-58927-gf443e374ae13 #6 Hardware name: Red Hat KVM, BIOS 1.11.0-2.amzn2 04/01/2014
Link: https://lkml.kernel.org/r/20220322002653.33865-1-kuniyu@amazon.co.jp Link: https://lkml.kernel.org/r/20220322002653.33865-2-kuniyu@amazon.co.jp Fixes: 3b844826b6c6 ("pipe: avoid unnecessary EPOLLET wakeups under normal loads") Signed-off-by: Kuniyuki Iwashima <kuniyu@amazon.co.jp> Cc: Alexander Duyck <alexander.h.duyck@intel.com> Cc: Al Viro <viro@zeniv.linux.org.uk> Cc: Davidlohr Bueso <dave@stgolabs.net> Cc: Kuniyuki Iwashima <kuni1840@gmail.com> Cc: "Soheil Hassas Yeganeh" <soheil@google.com> Cc: "Sridhar Samudrala" <sridhar.samudrala@intel.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
show more ...
|
#
e6acf868 |
| 29-Apr-2022 |
Kuniyuki Iwashima <kuniyu@amazon.co.jp> |
pipe: make poll_usage boolean and annotate its access
commit f485922d8fe4e44f6d52a5bb95a603b7c65554bb upstream.
Patch series "Fix data-races around epoll reported by KCSAN."
This series suppresses
pipe: make poll_usage boolean and annotate its access
commit f485922d8fe4e44f6d52a5bb95a603b7c65554bb upstream.
Patch series "Fix data-races around epoll reported by KCSAN."
This series suppresses a false positive KCSAN's message and fixes a real data-race.
This patch (of 2):
pipe_poll() runs locklessly and assigns 1 to poll_usage. Once poll_usage is set to 1, it never changes in other places. However, concurrent writes of a value trigger KCSAN, so let's make KCSAN happy.
BUG: KCSAN: data-race in pipe_poll / pipe_poll
write to 0xffff8880042f6678 of 4 bytes by task 174 on cpu 3: pipe_poll (fs/pipe.c:656) ep_item_poll.isra.0 (./include/linux/poll.h:88 fs/eventpoll.c:853) do_epoll_wait (fs/eventpoll.c:1692 fs/eventpoll.c:1806 fs/eventpoll.c:2234) __x64_sys_epoll_wait (fs/eventpoll.c:2246 fs/eventpoll.c:2241 fs/eventpoll.c:2241) do_syscall_64 (arch/x86/entry/common.c:50 arch/x86/entry/common.c:80) entry_SYSCALL_64_after_hwframe (arch/x86/entry/entry_64.S:113)
write to 0xffff8880042f6678 of 4 bytes by task 177 on cpu 1: pipe_poll (fs/pipe.c:656) ep_item_poll.isra.0 (./include/linux/poll.h:88 fs/eventpoll.c:853) do_epoll_wait (fs/eventpoll.c:1692 fs/eventpoll.c:1806 fs/eventpoll.c:2234) __x64_sys_epoll_wait (fs/eventpoll.c:2246 fs/eventpoll.c:2241 fs/eventpoll.c:2241) do_syscall_64 (arch/x86/entry/common.c:50 arch/x86/entry/common.c:80) entry_SYSCALL_64_after_hwframe (arch/x86/entry/entry_64.S:113)
Reported by Kernel Concurrency Sanitizer on: CPU: 1 PID: 177 Comm: epoll_race Not tainted 5.17.0-58927-gf443e374ae13 #6 Hardware name: Red Hat KVM, BIOS 1.11.0-2.amzn2 04/01/2014
Link: https://lkml.kernel.org/r/20220322002653.33865-1-kuniyu@amazon.co.jp Link: https://lkml.kernel.org/r/20220322002653.33865-2-kuniyu@amazon.co.jp Fixes: 3b844826b6c6 ("pipe: avoid unnecessary EPOLLET wakeups under normal loads") Signed-off-by: Kuniyuki Iwashima <kuniyu@amazon.co.jp> Cc: Alexander Duyck <alexander.h.duyck@intel.com> Cc: Al Viro <viro@zeniv.linux.org.uk> Cc: Davidlohr Bueso <dave@stgolabs.net> Cc: Kuniyuki Iwashima <kuni1840@gmail.com> Cc: "Soheil Hassas Yeganeh" <soheil@google.com> Cc: "Sridhar Samudrala" <sridhar.samudrala@intel.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
show more ...
|
Revision tags: v5.4.173, v5.15.16, v5.15.15, v5.16, v5.15.10, v5.15.9, v5.15.8, v5.15.7, v5.15.6, v5.15.5, v5.15.4, v5.15.3, v5.15.2, v5.15.1, v5.15, v5.14.14, v5.14.13, v5.14.12, v5.14.11, v5.14.10, v5.14.9, v5.14.8, v5.14.7, v5.14.6, v5.10.67, v5.10.66, v5.14.5, v5.14.4, v5.10.65, v5.14.3, v5.10.64, v5.14.2, v5.10.63, v5.14.1, v5.10.62, v5.14, v5.10.61, v5.10.60 |
|
#
3b844826 |
| 05-Aug-2021 |
Linus Torvalds <torvalds@linux-foundation.org> |
pipe: avoid unnecessary EPOLLET wakeups under normal loads
I had forgotten just how sensitive hackbench is to extra pipe wakeups, and commit 3a34b13a88ca ("pipe: make pipe writes always wake up read
pipe: avoid unnecessary EPOLLET wakeups under normal loads
I had forgotten just how sensitive hackbench is to extra pipe wakeups, and commit 3a34b13a88ca ("pipe: make pipe writes always wake up readers") ended up causing a quite noticeable regression on larger machines.
Now, hackbench isn't necessarily a hugely meaningful benchmark, and it's not clear that this matters in real life all that much, but as Mel points out, it's used often enough when comparing kernels and so the performance regression shows up like a sore thumb.
It's easy enough to fix at least for the common cases where pipes are used purely for data transfer, and you never have any exciting poll usage at all. So set a special 'poll_usage' flag when there is polling activity, and make the ugly "EPOLLET has crazy legacy expectations" semantics explicit to only that case.
I would love to limit it to just the broken EPOLLET case, but the pipe code can't see the difference between epoll and regular select/poll, so any non-read/write waiting will trigger the extra wakeup behavior. That is sufficient for at least the hackbench case.
Apart from making the odd extra wakeup cases more explicitly about EPOLLET, this also makes the extra wakeup be at the _end_ of the pipe write, not at the first write chunk. That is actually much saner semantics (as much as you can call any of the legacy edge-triggered expectations for EPOLLET "sane") since it means that you know the wakeup will happen once the write is done, rather than possibly in the middle of one.
[ For stable people: I'm putting a "Fixes" tag on this, but I leave it up to you to decide whether you actually want to backport it or not. It likely has no impact outside of synthetic benchmarks - Linus ]
Link: https://lore.kernel.org/lkml/20210802024945.GA8372@xsang-OptiPlex-9020/ Fixes: 3a34b13a88ca ("pipe: make pipe writes always wake up readers") Reported-by: kernel test robot <oliver.sang@intel.com> Tested-by: Sandeep Patil <sspatil@android.com> Tested-by: Mel Gorman <mgorman@techsingularity.net> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
show more ...
|
#
e91da23c |
| 05-Aug-2021 |
Linus Torvalds <torvalds@linux-foundation.org> |
pipe: avoid unnecessary EPOLLET wakeups under normal loads
commit 3b844826b6c6affa80755254da322b017358a2f4 upstream.
I had forgotten just how sensitive hackbench is to extra pipe wakeups, and commi
pipe: avoid unnecessary EPOLLET wakeups under normal loads
commit 3b844826b6c6affa80755254da322b017358a2f4 upstream.
I had forgotten just how sensitive hackbench is to extra pipe wakeups, and commit 3a34b13a88ca ("pipe: make pipe writes always wake up readers") ended up causing a quite noticeable regression on larger machines.
Now, hackbench isn't necessarily a hugely meaningful benchmark, and it's not clear that this matters in real life all that much, but as Mel points out, it's used often enough when comparing kernels and so the performance regression shows up like a sore thumb.
It's easy enough to fix at least for the common cases where pipes are used purely for data transfer, and you never have any exciting poll usage at all. So set a special 'poll_usage' flag when there is polling activity, and make the ugly "EPOLLET has crazy legacy expectations" semantics explicit to only that case.
I would love to limit it to just the broken EPOLLET case, but the pipe code can't see the difference between epoll and regular select/poll, so any non-read/write waiting will trigger the extra wakeup behavior. That is sufficient for at least the hackbench case.
Apart from making the odd extra wakeup cases more explicitly about EPOLLET, this also makes the extra wakeup be at the _end_ of the pipe write, not at the first write chunk. That is actually much saner semantics (as much as you can call any of the legacy edge-triggered expectations for EPOLLET "sane") since it means that you know the wakeup will happen once the write is done, rather than possibly in the middle of one.
[ For stable people: I'm putting a "Fixes" tag on this, but I leave it up to you to decide whether you actually want to backport it or not. It likely has no impact outside of synthetic benchmarks - Linus ]
Link: https://lore.kernel.org/lkml/20210802024945.GA8372@xsang-OptiPlex-9020/ Fixes: 3a34b13a88ca ("pipe: make pipe writes always wake up readers") Reported-by: kernel test robot <oliver.sang@intel.com> Tested-by: Sandeep Patil <sspatil@android.com> Tested-by: Mel Gorman <mgorman@techsingularity.net> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
show more ...
|
Revision tags: v5.10.53, v5.10.52, v5.10.51, v5.10.50, v5.10.49, v5.13, v5.10.46, v5.10.43, v5.10.42, v5.10.41, v5.10.40, v5.10.39, v5.4.119, v5.10.36, v5.10.35, v5.10.34, v5.4.116, v5.10.33, v5.12, v5.10.32, v5.10.31, v5.10.30, v5.10.27, v5.10.26, v5.10.25, v5.10.24, v5.10.23, v5.10.22, v5.10.21, v5.10.20, v5.10.19, v5.4.101, v5.10.18, v5.10.17, v5.11, v5.10.16, v5.10.15, v5.10.14, v5.10, v5.8.17, v5.8.16, v5.8.15, v5.9, v5.8.14 |
|
#
472e5b05 |
| 01-Oct-2020 |
Linus Torvalds <torvalds@linux-foundation.org> |
pipe: remove pipe_wait() and fix wakeup race with splice
The pipe splice code still used the old model of waiting for pipe IO by using a non-specific "pipe_wait()" that waited for any pipe event to
pipe: remove pipe_wait() and fix wakeup race with splice
The pipe splice code still used the old model of waiting for pipe IO by using a non-specific "pipe_wait()" that waited for any pipe event to happen, which depended on all pipe IO being entirely serialized by the pipe lock. So by checking the state you were waiting for, and then adding yourself to the wait queue before dropping the lock, you were guaranteed to see all the wakeups.
Strictly speaking, the actual wakeups were not done under the lock, but the pipe_wait() model still worked, because since the waiter held the lock when checking whether it should sleep, it would always see the current state, and the wakeup was always done after updating the state.
However, commit 0ddad21d3e99 ("pipe: use exclusive waits when reading or writing") split the single wait-queue into two, and in the process also made the "wait for event" code wait for _two_ wait queues, and that then showed a race with the wakers that were not serialized by the pipe lock.
It's only splice that used that "pipe_wait()" model, so the problem wasn't obvious, but Josef Bacik reports:
"I hit a hang with fstest btrfs/187, which does a btrfs send into /dev/null. This works by creating a pipe, the write side is given to the kernel to write into, and the read side is handed to a thread that splices into a file, in this case /dev/null.
The box that was hung had the write side stuck here [pipe_write] and the read side stuck here [splice_from_pipe_next -> pipe_wait].
[ more details about pipe_wait() scenario ]
The problem is we're doing the prepare_to_wait, which sets our state each time, however we can be woken up either with reads or writes. In the case above we race with the WRITER waking us up, and re-set our state to INTERRUPTIBLE, and thus never break out of schedule"
Josef had a patch that avoided the issue in pipe_wait() by just making it set the state only once, but the deeper problem is that pipe_wait() depends on a level of synchonization by the pipe mutex that it really shouldn't. And the whole "wait for any pipe state change" model really isn't very good to begin with.
So rather than trying to work around things in pipe_wait(), remove that legacy model of "wait for arbitrary pipe event" entirely, and actually create functions that wait for the pipe actually being readable or writable, and can do so without depending on the pipe lock serializing everything.
Fixes: 0ddad21d3e99 ("pipe: use exclusive waits when reading or writing") Link: https://lore.kernel.org/linux-fsdevel/bfa88b5ad6f069b2b679316b9e495a970130416c.1601567868.git.josef@toxicpanda.com/ Reported-by: Josef Bacik <josef@toxicpanda.com> Reviewed-and-tested-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
show more ...
|
Revision tags: v5.8.13, v5.8.12, v5.8.11, v5.8.10, v5.8.9, v5.8.8, v5.8.7, v5.8.6, v5.4.62, v5.8.5, v5.8.4, v5.4.61, v5.8.3, v5.4.60, v5.8.2, v5.4.59, v5.8.1, v5.4.58, v5.4.57, v5.4.56, v5.8, v5.7.12, v5.4.55, v5.7.11, v5.4.54, v5.7.10, v5.4.53, v5.4.52, v5.7.9, v5.7.8, v5.4.51, v5.4.50, v5.7.7, v5.4.49, v5.7.6, v5.7.5, v5.4.48, v5.7.4, v5.7.3, v5.4.47, v5.4.46, v5.7.2, v5.4.45, v5.7.1, v5.4.44, v5.7, v5.4.43 |
|
#
c928f642 |
| 20-May-2020 |
Christoph Hellwig <hch@lst.de> |
fs: rename pipe_buf ->steal to ->try_steal
And replace the arcane return value convention with a simple bool where true means success and false means failure.
[AV: braino fix folded in]
Signed-off
fs: rename pipe_buf ->steal to ->try_steal
And replace the arcane return value convention with a simple bool where true means success and false means failure.
[AV: braino fix folded in]
Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
show more ...
|
#
b8d9e7f2 |
| 20-May-2020 |
Christoph Hellwig <hch@lst.de> |
fs: make the pipe_buf_operations ->confirm operation optional
Just return 0 for success if it is not present.
Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Al Viro <viro@zeniv.linux.
fs: make the pipe_buf_operations ->confirm operation optional
Just return 0 for success if it is not present.
Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
show more ...
|
#
76887c25 |
| 20-May-2020 |
Christoph Hellwig <hch@lst.de> |
fs: make the pipe_buf_operations ->steal operation optional
Just return 1 for failure if it is not present.
Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Al Viro <viro@zeniv.linux.or
fs: make the pipe_buf_operations ->steal operation optional
Just return 1 for failure if it is not present.
Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
show more ...
|
#
f6dd9755 |
| 20-May-2020 |
Christoph Hellwig <hch@lst.de> |
pipe: merge anon_pipe_buf*_ops
All the op vectors are exactly the same, they are just used to encode packet or nomerge behavior. There already is a flag for the packet behavior, so just add a new o
pipe: merge anon_pipe_buf*_ops
All the op vectors are exactly the same, they are just used to encode packet or nomerge behavior. There already is a flag for the packet behavior, so just add a new one to allow for merging. Inverting it vs the previous nomerge special casing actually allows for much nicer code.
Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
show more ...
|
Revision tags: v5.4.42, v5.4.41, v5.4.40, v5.4.39, v5.4.38, v5.4.37, v5.4.36, v5.4.35, v5.4.34, v5.4.33, v5.4.32, v5.4.31, v5.4.30, v5.4.29, v5.6, v5.4.28, v5.4.27, v5.4.26, v5.4.25, v5.4.24, v5.4.23, v5.4.22, v5.4.21, v5.4.20, v5.4.19, v5.4.18, v5.4.17, v5.4.16, v5.5, v5.4.15, v5.4.14, v5.4.13, v5.4.12 |
|
#
e7d553d6 |
| 14-Jan-2020 |
David Howells <dhowells@redhat.com> |
pipe: Add notification lossage handling
Add handling for loss of notifications by having read() insert a loss-notification message after it has read the pipe buffer that was last in the ring when th
pipe: Add notification lossage handling
Add handling for loss of notifications by having read() insert a loss-notification message after it has read the pipe buffer that was last in the ring when the loss occurred.
Lossage can come about either by running out of notification descriptors or by running out of space in the pipe ring.
Signed-off-by: David Howells <dhowells@redhat.com>
show more ...
|
#
8cfba763 |
| 14-Jan-2020 |
David Howells <dhowells@redhat.com> |
pipe: Allow buffers to be marked read-whole-or-error for notifications
Allow a buffer to be marked such that read() must return the entire buffer in one go or return ENOBUFS. Multiple buffers can b
pipe: Allow buffers to be marked read-whole-or-error for notifications
Allow a buffer to be marked such that read() must return the entire buffer in one go or return ENOBUFS. Multiple buffers can be amalgamated into a single read, but a short read will occur if the next "whole" buffer won't fit.
This is useful for watch queue notifications to make sure we don't split a notification across multiple reads, especially given that we need to fabricate an overrun record under some circumstances - and that isn't in the buffers.
Signed-off-by: David Howells <dhowells@redhat.com>
show more ...
|
#
c73be61c |
| 14-Jan-2020 |
David Howells <dhowells@redhat.com> |
pipe: Add general notification queue support
Make it possible to have a general notification queue built on top of a standard pipe. Notifications are 'spliced' into the pipe and then read out. spl
pipe: Add general notification queue support
Make it possible to have a general notification queue built on top of a standard pipe. Notifications are 'spliced' into the pipe and then read out. splice(), vmsplice() and sendfile() are forbidden on pipes used for notifications as post_one_notification() cannot take pipe->mutex. This means that notifications could be posted in between individual pipe buffers, making iov_iter_revert() difficult to effect.
The way the notification queue is used is:
(1) An application opens a pipe with a special flag and indicates the number of messages it wishes to be able to queue at once (this can only be set once):
pipe2(fds, O_NOTIFICATION_PIPE); ioctl(fds[0], IOC_WATCH_QUEUE_SET_SIZE, queue_depth);
(2) The application then uses poll() and read() as normal to extract data from the pipe. read() will return multiple notifications if the buffer is big enough, but it will not split a notification across buffers - rather it will return a short read or EMSGSIZE.
Notification messages include a length in the header so that the caller can split them up.
Each message has a header that describes it:
struct watch_notification { __u32 type:24; __u32 subtype:8; __u32 info; };
The type indicates the source (eg. mount tree changes, superblock events, keyring changes, block layer events) and the subtype indicates the event type (eg. mount, unmount; EIO, EDQUOT; link, unlink). The info field indicates a number of things, including the entry length, an ID assigned to a watchpoint contributing to this buffer and type-specific flags.
Supplementary data, such as the key ID that generated an event, can be attached in additional slots. The maximum message size is 127 bytes. Messages may not be padded or aligned, so there is no guarantee, for example, that the notification type will be on a 4-byte bounary.
Signed-off-by: David Howells <dhowells@redhat.com>
show more ...
|
#
0bf999f9 |
| 09-Feb-2020 |
Randy Dunlap <rdunlap@infradead.org> |
linux/pipe_fs_i.h: fix kernel-doc warnings after @wait was split
Fix kernel-doc warnings in struct pipe_inode_info after @wait was split into @rd_wait and @wr_wait.
include/linux/pipe_fs_i.h:66:
linux/pipe_fs_i.h: fix kernel-doc warnings after @wait was split
Fix kernel-doc warnings in struct pipe_inode_info after @wait was split into @rd_wait and @wr_wait.
include/linux/pipe_fs_i.h:66: warning: Function parameter or member 'rd_wait' not described in 'pipe_inode_info' include/linux/pipe_fs_i.h:66: warning: Function parameter or member 'wr_wait' not described in 'pipe_inode_info'
Fixes: 0ddad21d3e99 ("pipe: use exclusive waits when reading or writing") Signed-off-by: Randy Dunlap <rdunlap@infradead.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
show more ...
|
Revision tags: v5.4.11, v5.4.10, v5.4.9, v5.4.8, v5.4.7, v5.4.6, v5.4.5, v5.4.4, v5.4.3 |
|
#
0ddad21d |
| 09-Dec-2019 |
Linus Torvalds <torvalds@linux-foundation.org> |
pipe: use exclusive waits when reading or writing
This makes the pipe code use separate wait-queues and exclusive waiting for readers and writers, avoiding a nasty thundering herd problem when there
pipe: use exclusive waits when reading or writing
This makes the pipe code use separate wait-queues and exclusive waiting for readers and writers, avoiding a nasty thundering herd problem when there are lots of readers waiting for data on a pipe (or, less commonly, lots of writers waiting for a pipe to have space).
While this isn't a common occurrence in the traditional "use a pipe as a data transport" case, where you typically only have a single reader and a single writer process, there is one common special case: using a pipe as a source of "locking tokens" rather than for data communication.
In particular, the GNU make jobserver code ends up using a pipe as a way to limit parallelism, where each job consumes a token by reading a byte from the jobserver pipe, and releases the token by writing a byte back to the pipe.
This pattern is fairly traditional on Unix, and works very well, but will waste a lot of time waking up a lot of processes when only a single reader needs to be woken up when a writer releases a new token.
A simplified test-case of just this pipe interaction is to create 64 processes, and then pass a single token around between them (this test-case also intentionally passes another token that gets ignored to test the "wake up next" logic too, in case anybody wonders about it):
#include <unistd.h>
int main(int argc, char **argv) { int fd[2], counters[2];
pipe(fd); counters[0] = 0; counters[1] = -1; write(fd[1], counters, sizeof(counters));
/* 64 processes */ fork(); fork(); fork(); fork(); fork(); fork();
do { int i; read(fd[0], &i, sizeof(i)); if (i < 0) continue; counters[0] = i+1; write(fd[1], counters, (1+(i & 1)) *sizeof(int)); } while (counters[0] < 1000000); return 0; }
and in a perfect world, passing that token around should only cause one context switch per transfer, when the writer of a token causes a directed wakeup of just a single reader.
But with the "writer wakes all readers" model we traditionally had, on my test box the above case causes more than an order of magnitude more scheduling: instead of the expected ~1M context switches, "perf stat" shows
231,852.37 msec task-clock # 15.857 CPUs utilized 11,250,961 context-switches # 0.049 M/sec 616,304 cpu-migrations # 0.003 M/sec 1,648 page-faults # 0.007 K/sec 1,097,903,998,514 cycles # 4.735 GHz 120,781,778,352 instructions # 0.11 insn per cycle 27,997,056,043 branches # 120.754 M/sec 283,581,233 branch-misses # 1.01% of all branches
14.621273891 seconds time elapsed
0.018243000 seconds user 3.611468000 seconds sys
before this commit.
After this commit, I get
5,229.55 msec task-clock # 3.072 CPUs utilized 1,212,233 context-switches # 0.232 M/sec 103,951 cpu-migrations # 0.020 M/sec 1,328 page-faults # 0.254 K/sec 21,307,456,166 cycles # 4.074 GHz 12,947,819,999 instructions # 0.61 insn per cycle 2,881,985,678 branches # 551.096 M/sec 64,267,015 branch-misses # 2.23% of all branches
1.702148350 seconds time elapsed
0.004868000 seconds user 0.110786000 seconds sys
instead. Much better.
[ Note! This kernel improvement seems to be very good at triggering a race condition in the make jobserver (in GNU make 4.2.1) for me. It's a long known bug that was fixed back in June 2017 by GNU make commit b552b0525198 ("[SV 51159] Use a non-blocking read with pselect to avoid hangs.").
But there wasn't a new release of GNU make until 4.3 on Jan 19 2020, so a number of distributions may still have the buggy version. Some have backported the fix to their 4.2.1 release, though, and even without the fix it's quite timing-dependent whether the bug actually is hit. ]
Josh Triplett says: "I've been hammering on your pipe fix patch (switching to exclusive wait queues) for a month or so, on several different systems, and I've run into no issues with it. The patch *substantially* improves parallel build times on large (~100 CPU) systems, both with parallel make and with other things that use make's pipe-based jobserver.
All current distributions (including stable and long-term stable distributions) have versions of GNU make that no longer have the jobserver bug"
Tested-by: Josh Triplett <josh@joshtriplett.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
show more ...
|
#
a28c8b9d |
| 07-Dec-2019 |
Linus Torvalds <torvalds@linux-foundation.org> |
pipe: remove 'waiting_writers' merging logic
This code is ancient, and goes back to when we only had a single page for the pipe buffers. The exact history is hidden in the mists of time (ie "before
pipe: remove 'waiting_writers' merging logic
This code is ancient, and goes back to when we only had a single page for the pipe buffers. The exact history is hidden in the mists of time (ie "before git", and in fact predates the BK repository too).
At that long-ago point in time, it actually helped to try to merge big back-and-forth pipe reads and writes, and not limit pipe reads to the single pipe buffer in length just because that was all we had at a time.
However, since then we've expanded the pipe buffers to multiple pages, and this logic really doesn't seem to make sense. And a lot of it is somewhat questionable (ie "hmm, the user asked for a non-blocking read, but we see that there's a writer pending, so let's wait anyway to get the extra data that the writer will have").
But more importantly, it makes the "go to sleep" logic much less obvious, and considering the wakeup issues we've had, I want to make for less of those kinds of things.
Cc: David Howells <dhowells@redhat.com> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
show more ...
|
Revision tags: v5.3.15, v5.4.2, v5.4.1, v5.3.14, v5.4, v5.3.13, v5.3.12, v5.3.11, v5.3.10, v5.3.9, v5.3.8, v5.3.7 |
|
#
6718b6f8 |
| 16-Oct-2019 |
David Howells <dhowells@redhat.com> |
pipe: Allow pipes to have kernel-reserved slots
Split pipe->ring_size into two numbers:
(1) pipe->ring_size - indicates the hard size of the pipe ring.
(2) pipe->max_usage - indicates the maximu
pipe: Allow pipes to have kernel-reserved slots
Split pipe->ring_size into two numbers:
(1) pipe->ring_size - indicates the hard size of the pipe ring.
(2) pipe->max_usage - indicates the maximum number of pipe ring slots that userspace orchestrated events can fill.
This allows for a pipe that is both writable by the general kernel notification facility and by userspace, allowing plenty of ring space for notifications to be added whilst preventing userspace from being able to pin too much unswappable kernel space.
Signed-off-by: David Howells <dhowells@redhat.com>
show more ...
|