#
5ecf9d4f |
| 19-Jul-2023 |
Eric Dumazet <edumazet@google.com> |
tcp: annotate data-races around tp->keepalive_intvl
do_tcp_getsockopt() reads tp->keepalive_intvl while another cpu might change its value.
Fixes: 1da177e4c3f4 ("Linux-2.6.12-rc2") Signed-off-by: E
tcp: annotate data-races around tp->keepalive_intvl
do_tcp_getsockopt() reads tp->keepalive_intvl while another cpu might change its value.
Fixes: 1da177e4c3f4 ("Linux-2.6.12-rc2") Signed-off-by: Eric Dumazet <edumazet@google.com> Link: https://lore.kernel.org/r/20230719212857.3943972-5-edumazet@google.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
show more ...
|
#
4164245c |
| 19-Jul-2023 |
Eric Dumazet <edumazet@google.com> |
tcp: annotate data-races around tp->keepalive_time
do_tcp_getsockopt() reads tp->keepalive_time while another cpu might change its value.
Fixes: 1da177e4c3f4 ("Linux-2.6.12-rc2") Signed-off-by: Eri
tcp: annotate data-races around tp->keepalive_time
do_tcp_getsockopt() reads tp->keepalive_time while another cpu might change its value.
Fixes: 1da177e4c3f4 ("Linux-2.6.12-rc2") Signed-off-by: Eric Dumazet <edumazet@google.com> Link: https://lore.kernel.org/r/20230719212857.3943972-4-edumazet@google.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
show more ...
|
#
dd23c9f1 |
| 19-Jul-2023 |
Eric Dumazet <edumazet@google.com> |
tcp: annotate data-races around tp->tsoffset
do_tcp_getsockopt() reads tp->tsoffset while another cpu might change its value.
Fixes: 93be6ce0e91b ("tcp: set and get per-socket timestamp") Signed-of
tcp: annotate data-races around tp->tsoffset
do_tcp_getsockopt() reads tp->tsoffset while another cpu might change its value.
Fixes: 93be6ce0e91b ("tcp: set and get per-socket timestamp") Signed-off-by: Eric Dumazet <edumazet@google.com> Link: https://lore.kernel.org/r/20230719212857.3943972-3-edumazet@google.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
show more ...
|
#
348b81b6 |
| 19-Jul-2023 |
Eric Dumazet <edumazet@google.com> |
tcp: annotate data-races around tp->tcp_tx_delay
do_tcp_getsockopt() reads tp->tcp_tx_delay while another cpu might change its value.
Fixes: a842fe1425cb ("tcp: add optional per socket transmit del
tcp: annotate data-races around tp->tcp_tx_delay
do_tcp_getsockopt() reads tp->tcp_tx_delay while another cpu might change its value.
Fixes: a842fe1425cb ("tcp: add optional per socket transmit delay") Signed-off-by: Eric Dumazet <edumazet@google.com> Link: https://lore.kernel.org/r/20230719212857.3943972-2-edumazet@google.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
show more ...
|
Revision tags: v6.1.39 |
|
#
dfa2f048 |
| 17-Jul-2023 |
Eric Dumazet <edumazet@google.com> |
tcp: get rid of sysctl_tcp_adv_win_scale
With modern NIC drivers shifting to full page allocations per received frame, we face the following issue:
TCP has one per-netns sysctl used to tweak how to
tcp: get rid of sysctl_tcp_adv_win_scale
With modern NIC drivers shifting to full page allocations per received frame, we face the following issue:
TCP has one per-netns sysctl used to tweak how to translate a memory use into an expected payload (RWIN), in RX path.
tcp_win_from_space() implementation is limited to few cases.
For hosts dealing with various MSS, we either under estimate or over estimate the RWIN we send to the remote peers.
For instance with the default sysctl_tcp_adv_win_scale value, we expect to store 50% of payload per allocated chunk of memory.
For the typical use of MTU=1500 traffic, and order-0 pages allocations by NIC drivers, we are sending too big RWIN, leading to potential tcp collapse operations, which are extremely expensive and source of latency spikes.
This patch makes sysctl_tcp_adv_win_scale obsolete, and instead uses a per socket scaling factor, so that we can precisely adjust the RWIN based on effective skb->len/skb->truesize ratio.
This patch alone can double TCP receive performance when receivers are too slow to drain their receive queue, or by allowing a bigger RWIN when MSS is close to PAGE_SIZE.
Signed-off-by: Eric Dumazet <edumazet@google.com> Acked-by: Soheil Hassas Yeganeh <soheil@google.com> Link: https://lore.kernel.org/r/20230717152917.751987-1-edumazet@google.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
show more ...
|
#
50501936 |
| 17-Jul-2023 |
Dmitry Torokhov <dmitry.torokhov@gmail.com> |
Merge tag 'v6.4' into next
Sync up with mainline to bring in updates to shared infrastructure.
|
#
0791faeb |
| 17-Jul-2023 |
Mark Brown <broonie@kernel.org> |
ASoC: Merge v6.5-rc2
Get a similar baseline to my other branches, and fixes for people using the branch.
|
#
2f98e686 |
| 11-Jul-2023 |
Maxime Ripard <mripard@kernel.org> |
Merge v6.5-rc1 into drm-misc-fixes
Boris needs 6.5-rc1 in drm-misc-fixes to prevent a conflict.
Signed-off-by: Maxime Ripard <mripard@kernel.org>
|
Revision tags: v6.1.38, v6.1.37 |
|
#
44f10dbe |
| 30-Jun-2023 |
Andrew Morton <akpm@linux-foundation.org> |
Merge branch 'master' into mm-hotfixes-stable
|
#
0a30901b |
| 30-Jun-2023 |
Andrew Morton <akpm@linux-foundation.org> |
Merge branch 'master' into mm-hotfixes-stable
|
#
3a8a670e |
| 28-Jun-2023 |
Linus Torvalds <torvalds@linux-foundation.org> |
Merge tag 'net-next-6.5' of git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net-next
Pull networking changes from Jakub Kicinski: "WiFi 7 and sendpage changes are the biggest pieces of work fo
Merge tag 'net-next-6.5' of git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net-next
Pull networking changes from Jakub Kicinski: "WiFi 7 and sendpage changes are the biggest pieces of work for this release. The latter will definitely require fixes but I think that we got it to a reasonable point.
Core:
- Rework the sendpage & splice implementations
Instead of feeding data into sockets page by page extend sendmsg handlers to support taking a reference on the data, controlled by a new flag called MSG_SPLICE_PAGES
Rework the handling of unexpected-end-of-file to invoke an additional callback instead of trying to predict what the right combination of MORE/NOTLAST flags is
Remove the MSG_SENDPAGE_NOTLAST flag completely
- Implement SCM_PIDFD, a new type of CMSG type analogous to SCM_CREDENTIALS, but it contains pidfd instead of plain pid
- Enable socket busy polling with CONFIG_RT
- Improve reliability and efficiency of reporting for ref_tracker
- Auto-generate a user space C library for various Netlink families
Protocols:
- Allow TCP to shrink the advertised window when necessary, prevent sk_rcvbuf auto-tuning from growing the window all the way up to tcp_rmem[2]
- Use per-VMA locking for "page-flipping" TCP receive zerocopy
- Prepare TCP for device-to-device data transfers, by making sure that payloads are always attached to skbs as page frags
- Make the backoff time for the first N TCP SYN retransmissions linear. Exponential backoff is unnecessarily conservative
- Create a new MPTCP getsockopt to retrieve all info (MPTCP_FULL_INFO)
- Avoid waking up applications using TLS sockets until we have a full record
- Allow using kernel memory for protocol ioctl callbacks, paving the way to issuing ioctls over io_uring
- Add nolocalbypass option to VxLAN, forcing packets to be fully encapsulated even if they are destined for a local IP address
- Make TCPv4 use consistent hash in TIME_WAIT and SYN_RECV. Ensure in-kernel ECMP implementation (e.g. Open vSwitch) select the same link for all packets. Support L4 symmetric hashing in Open vSwitch
- PPPoE: make number of hash bits configurable
- Allow DNS to be overwritten by DHCPACK in the in-kernel DHCP client (ipconfig)
- Add layer 2 miss indication and filtering, allowing higher layers (e.g. ACL filters) to make forwarding decisions based on whether packet matched forwarding state in lower devices (bridge)
- Support matching on Connectivity Fault Management (CFM) packets
- Hide the "link becomes ready" IPv6 messages by demoting their printk level to debug
- HSR: don't enable promiscuous mode if device offloads the proto
- Support active scanning in IEEE 802.15.4
- Continue work on Multi-Link Operation for WiFi 7
BPF:
- Add precision propagation for subprogs and callbacks. This allows maintaining verification efficiency when subprograms are used, or in fact passing the verifier at all for complex programs, especially those using open-coded iterators
- Improve BPF's {g,s}setsockopt() length handling. Previously BPF assumed the length is always equal to the amount of written data. But some protos allow passing a NULL buffer to discover what the output buffer *should* be, without writing anything
- Accept dynptr memory as memory arguments passed to helpers
- Add routing table ID to bpf_fib_lookup BPF helper
- Support O_PATH FDs in BPF_OBJ_PIN and BPF_OBJ_GET commands
- Drop bpf_capable() check in BPF_MAP_FREEZE command (used to mark maps as read-only)
- Show target_{obj,btf}_id in tracing link fdinfo
- Addition of several new kfuncs (most of the names are self-explanatory): - Add a set of new dynptr kfuncs: bpf_dynptr_adjust(), bpf_dynptr_is_null(), bpf_dynptr_is_rdonly(), bpf_dynptr_size() and bpf_dynptr_clone(). - bpf_task_under_cgroup() - bpf_sock_destroy() - force closing sockets - bpf_cpumask_first_and(), rework bpf_cpumask_any*() kfuncs
Netfilter:
- Relax set/map validation checks in nf_tables. Allow checking presence of an entry in a map without using the value
- Increase ip_vs_conn_tab_bits range for 64BIT builds
- Allow updating size of a set
- Improve NAT tuple selection when connection is closing
Driver API:
- Integrate netdev with LED subsystem, to allow configuring HW "offloaded" blinking of LEDs based on link state and activity (i.e. packets coming in and out)
- Support configuring rate selection pins of SFP modules
- Factor Clause 73 auto-negotiation code out of the drivers, provide common helper routines
- Add more fool-proof helpers for managing lifetime of MDIO devices associated with the PCS layer
- Allow drivers to report advanced statistics related to Time Aware scheduler offload (taprio)
- Allow opting out of VF statistics in link dump, to allow more VFs to fit into the message
- Split devlink instance and devlink port operations
New hardware / drivers:
- Ethernet: - Synopsys EMAC4 IP support (stmmac) - Marvell 88E6361 8 port (5x1GE + 3x2.5GE) switches - Marvell 88E6250 7 port switches - Microchip LAN8650/1 Rev.B0 PHYs - MediaTek MT7981/MT7988 built-in 1GE PHY driver
- WiFi: - Realtek RTL8192FU, 2.4 GHz, b/g/n mode, 2T2R, 300 Mbps - Realtek RTL8723DS (SDIO variant) - Realtek RTL8851BE
- CAN: - Fintek F81604
Drivers:
- Ethernet NICs: - Intel (100G, ice): - support dynamic interrupt allocation - use meta data match instead of VF MAC addr on slow-path - nVidia/Mellanox: - extend link aggregation to handle 4, rather than just 2 ports - spawn sub-functions without any features by default - OcteonTX2: - support HTB (Tx scheduling/QoS) offload - make RSS hash generation configurable - support selecting Rx queue using TC filters - Wangxun (ngbe/txgbe): - add basic Tx/Rx packet offloads - add phylink support (SFP/PCS control) - Freescale/NXP (enetc): - report TAPRIO packet statistics - Solarflare/AMD: - support matching on IP ToS and UDP source port of outer header - VxLAN and GENEVE tunnel encapsulation over IPv4 or IPv6 - add devlink dev info support for EF10
- Virtual NICs: - Microsoft vNIC: - size the Rx indirection table based on requested configuration - support VLAN tagging - Amazon vNIC: - try to reuse Rx buffers if not fully consumed, useful for ARM servers running with 16kB pages - Google vNIC: - support TCP segmentation of >64kB frames
- Ethernet embedded switches: - Marvell (mv88e6xxx): - enable USXGMII (88E6191X) - Microchip: - lan966x: add support for Egress Stage 0 ACL engine - lan966x: support mapping packet priority to internal switch priority (based on PCP or DSCP)
- Ethernet PHYs: - Broadcom PHYs: - support for Wake-on-LAN for BCM54210E/B50212E - report LPI counter - Microsemi PHYs: support RGMII delay configuration (VSC85xx) - Micrel PHYs: receive timestamp in the frame (LAN8841) - Realtek PHYs: support optional external PHY clock - Altera TSE PCS: merge the driver into Lynx PCS which it is a variant of
- CAN: Kvaser PCIEcan: - support packet timestamping
- WiFi: - Intel (iwlwifi): - major update for new firmware and Multi-Link Operation (MLO) - configuration rework to drop test devices and split the different families - support for segmented PNVM images and power tables - new vendor entries for PPAG (platform antenna gain) feature - Qualcomm 802.11ax (ath11k): - Multiple Basic Service Set Identifier (MBSSID) and Enhanced MBSSID Advertisement (EMA) support in AP mode - support factory test mode - RealTek (rtw89): - add RSSI based antenna diversity - support U-NII-4 channels on 5 GHz band - RealTek (rtl8xxxu): - AP mode support for 8188f - support USB RX aggregation for the newer chips"
* tag 'net-next-6.5' of git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net-next: (1602 commits) net: scm: introduce and use scm_recv_unix helper af_unix: Skip SCM_PIDFD if scm->pid is NULL. net: lan743x: Simplify comparison netlink: Add __sock_i_ino() for __netlink_diag_dump(). net: dsa: avoid suspicious RCU usage for synced VLAN-aware MAC addresses Revert "af_unix: Call scm_recv() only after scm_set_cred()." phylink: ReST-ify the phylink_pcs_neg_mode() kdoc libceph: Partially revert changes to support MSG_SPLICE_PAGES net: phy: mscc: fix packet loss due to RGMII delays net: mana: use vmalloc_array and vcalloc net: enetc: use vmalloc_array and vcalloc ionic: use vmalloc_array and vcalloc pds_core: use vmalloc_array and vcalloc gve: use vmalloc_array and vcalloc octeon_ep: use vmalloc_array and vcalloc net: usb: qmi_wwan: add u-blox 0x1312 composition perf trace: fix MSG_SPLICE_PAGES build error ipvlan: Fix return value of ipvlan_queue_xmit() netfilter: nf_tables: fix underflow in chain reference counter netfilter: nf_tables: unbind non-anonymous set if rule construction fails ...
show more ...
|
Revision tags: v6.1.36 |
|
#
e80b5003 |
| 27-Jun-2023 |
Jiri Kosina <jkosina@suse.cz> |
Merge branch 'for-6.5/apple' into for-linus
- improved support for Keychron K8 keyboard (Lasse Brun)
|
#
5f004bca |
| 27-Jun-2023 |
Jason Gunthorpe <jgg@nvidia.com> |
Merge tag 'v6.4' into rdma.git for-next
Linux 6.4
Resolve conflicts between rdma rc and next in rxe_cq matching linux-next:
drivers/infiniband/sw/rxe/rxe_cq.c: https://lore.kernel.org/r/20230622
Merge tag 'v6.4' into rdma.git for-next
Linux 6.4
Resolve conflicts between rdma rc and next in rxe_cq matching linux-next:
drivers/infiniband/sw/rxe/rxe_cq.c: https://lore.kernel.org/r/20230622115246.365d30ad@canb.auug.org.au
Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
show more ...
|
#
53ae158f |
| 27-Jun-2023 |
Russell King (Oracle) <rmk+kernel@armlinux.org.uk> |
Merge tag 'arm-vfp-refactor-for-rmk' of git://git.kernel.org/pub/scm/linux/kernel/git/ardb/linux into devel-stable
Refactor VFP support code and reimplement in C
The VFP related changes to permit k
Merge tag 'arm-vfp-refactor-for-rmk' of git://git.kernel.org/pub/scm/linux/kernel/git/ardb/linux into devel-stable
Refactor VFP support code and reimplement in C
The VFP related changes to permit kernel mode NEON in softirq context resulted in some issues regarding en/disabling of sofirqs from asm code, and this made it clear that it would be better to handle more of it from C code.
Given that we already have infrastructure that associates undefined instruction exceptions with handler code based on value/mask pairs, we can easily move the dispatch of VFP and NEON instructions to C code once we reimplement the actual VFP support routine (which reasons about how to deal with the exception and whether any emulation is needed) in C code first.
With those out of the way, we can drop the partial decoding logic in asm that reasons about which ISA is being used by user space, as the remaining cases are all 32-bit ARM only. This leaves a FPE specific routine with some iWMMXT logic that is easily duplicated in C as well, allowing us to move the FPE asm code into the FPE asm source file, and out of the shared entry code.
show more ...
|
#
a15b5137 |
| 26-Jun-2023 |
Takashi Iwai <tiwai@suse.de> |
Merge branch 'for-next' into for-linus
Pull the 6.5-devel branch for upstreaming.
Signed-off-by: Takashi Iwai <tiwai@suse.de>
|
#
f121ab7f |
| 26-Jun-2023 |
Thomas Gleixner <tglx@linutronix.de> |
Merge tag 'irqchip-6.5' of git://git.kernel.org/pub/scm/linux/kernel/git/maz/arm-platforms into irq/core
Pull irqchip updates from Marc Zyngier:
- A number of Loogson/Loogarch fixes
- Allow th
Merge tag 'irqchip-6.5' of git://git.kernel.org/pub/scm/linux/kernel/git/maz/arm-platforms into irq/core
Pull irqchip updates from Marc Zyngier:
- A number of Loogson/Loogarch fixes
- Allow the core code to retrigger an interrupt that has fired while the same interrupt is being handled on another CPU, papering over a GICv3 architecture issue
- Work around an integration problem on ASR8601, where the CPU numbering isn't representable in the GIC implementation...
- Add some missing interrupt to the STM32 irqchip
- A bunch of warning squashing triggered by W=1 builds
Link: https://lore.kernel.org/r/20230623224345.3577134-1-maz@kernel.org
show more ...
|
Revision tags: v6.4 |
|
#
9ae440b8 |
| 24-Jun-2023 |
Jakub Kicinski <kuba@kernel.org> |
Merge branch 'splice-net-switch-over-users-of-sendpage-and-remove-it'
David Howells says:
==================== splice, net: Switch over users of sendpage() and remove it
Here's the final set of pa
Merge branch 'splice-net-switch-over-users-of-sendpage-and-remove-it'
David Howells says:
==================== splice, net: Switch over users of sendpage() and remove it
Here's the final set of patches towards the removal of sendpage. All the drivers that use sendpage() get switched over to using sendmsg() with MSG_SPLICE_PAGES.
The following changes are made:
(1) Make the protocol drivers behave according to MSG_MORE, not MSG_SENDPAGE_NOTLAST. The latter is restricted to turning on MSG_MORE in the sendpage() wrappers.
(2) Fix ocfs2 to allocate its global protocol buffers with folio_alloc() rather than kzalloc() so as not to invoke the !sendpage_ok warning in skb_splice_from_iter().
(3) Make ceph/rds, skb_send_sock, dlm, nvme, smc, ocfs2, drbd and iscsi use sendmsg(), not sendpage and make them specify MSG_MORE instead of MSG_SENDPAGE_NOTLAST.
(4) Kill off sendpage and clean up MSG_SENDPAGE_NOTLAST.
Link: https://git.kernel.org/pub/scm/linux/kernel/git/netdev/net-next.git/commit/?id=51c78a4d532efe9543a4df019ff405f05c6157f6 # part 1 Link: https://lore.kernel.org/r/20230616161301.622169-1-dhowells@redhat.com/ # v1 Link: https://lore.kernel.org/r/20230617121146.716077-1-dhowells@redhat.com/ # v2 Link: https://lore.kernel.org/r/20230620145338.1300897-1-dhowells@redhat.com/ # v3 ====================
Link: https://lore.kernel.org/r/20230623225513.2732256-1-dhowells@redhat.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
show more ...
|
#
dc97391e |
| 23-Jun-2023 |
David Howells <dhowells@redhat.com> |
sock: Remove ->sendpage*() in favour of sendmsg(MSG_SPLICE_PAGES)
Remove ->sendpage() and ->sendpage_locked(). sendmsg() with MSG_SPLICE_PAGES should be used instead. This allows multiple pages an
sock: Remove ->sendpage*() in favour of sendmsg(MSG_SPLICE_PAGES)
Remove ->sendpage() and ->sendpage_locked(). sendmsg() with MSG_SPLICE_PAGES should be used instead. This allows multiple pages and multipage folios to be passed through.
Signed-off-by: David Howells <dhowells@redhat.com> Acked-by: Marc Kleine-Budde <mkl@pengutronix.de> # for net/can cc: Jens Axboe <axboe@kernel.dk> cc: Matthew Wilcox <willy@infradead.org> cc: linux-afs@lists.infradead.org cc: mptcp@lists.linux.dev cc: rds-devel@oss.oracle.com cc: tipc-discussion@lists.sourceforge.net cc: virtualization@lists.linux-foundation.org Link: https://lore.kernel.org/r/20230623225513.2732256-16-dhowells@redhat.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
show more ...
|
#
2fe11c9d |
| 23-Jun-2023 |
Pavel Begunkov <asml.silence@gmail.com> |
net/tcp: optimise locking for blocking splice
Even when tcp_splice_read() reads all it was asked for, for blocking sockets it'll release and immediately regrab the socket lock, loop around and break
net/tcp: optimise locking for blocking splice
Even when tcp_splice_read() reads all it was asked for, for blocking sockets it'll release and immediately regrab the socket lock, loop around and break on the while check.
Check tss.len right after we adjust it, and return if we're done. That saves us one release_sock(); lock_sock(); pair per successful blocking splice read.
Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Reviewed-by: Eric Dumazet <edumazet@google.com> Link: https://lore.kernel.org/r/80736a2cc6d478c383ea565ba825eaf4d1abd876.1687523671.git.asml.silence@gmail.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
show more ...
|
Revision tags: v6.1.35 |
|
#
de8a334f |
| 19-Jun-2023 |
Thomas Zimmermann <tzimmermann@suse.de> |
Merge drm/drm-next into drm-misc-next
Backmerging into drm-misc-next to get commit 2c1c7ba457d4 ("drm/amdgpu: support partition drm devices"), which is required to fix commit 0adec22702d4 ("drm: Rem
Merge drm/drm-next into drm-misc-next
Backmerging into drm-misc-next to get commit 2c1c7ba457d4 ("drm/amdgpu: support partition drm devices"), which is required to fix commit 0adec22702d4 ("drm: Remove struct drm_driver.gem_prime_mmap").
Signed-off-by: Thomas Zimmermann <tzimmermann@suse.de>
show more ...
|
#
cce3b573 |
| 19-Jun-2023 |
Dave Airlie <airlied@redhat.com> |
Backmerge tag 'v6.4-rc7' of git://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux into drm-next
Linux 6.4-rc7
Need this to pull in the msm work.
Signed-off-by: Dave Airlie <airlied@redhat.c
Backmerge tag 'v6.4-rc7' of git://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux into drm-next
Linux 6.4-rc7
Need this to pull in the msm work.
Signed-off-by: Dave Airlie <airlied@redhat.com>
show more ...
|
#
7a7f0946 |
| 16-Jun-2023 |
Arjun Roy <arjunroy@google.com> |
tcp: Use per-vma locking for receive zerocopy
Per-VMA locking allows us to lock a struct vm_area_struct without taking the process-wide mmap lock in read mode.
Consider a process workload where the
tcp: Use per-vma locking for receive zerocopy
Per-VMA locking allows us to lock a struct vm_area_struct without taking the process-wide mmap lock in read mode.
Consider a process workload where the mmap lock is taken constantly in write mode. In this scenario, all zerocopy receives are periodically blocked during that period of time - though in principle, the memory ranges being used by TCP are not touched by the operations that need the mmap write lock. This results in performance degradation.
Now consider another workload where the mmap lock is never taken in write mode, but there are many TCP connections using receive zerocopy that are concurrently receiving. These connections all take the mmap lock in read mode, but this does induce a lot of contention and atomic ops for this process-wide lock. This results in additional CPU overhead caused by contending on the cache line for this lock.
However, with per-vma locking, both of these problems can be avoided.
As a test, I ran an RPC-style request/response workload with 4KB payloads and receive zerocopy enabled, with 100 simultaneous TCP connections. I measured perf cycles within the find_tcp_vma/mmap_read_lock/mmap_read_unlock codepath, with and without per-vma locking enabled.
When using process-wide mmap semaphore read locking, about 1% of measured perf cycles were within this path. With per-VMA locking, this value dropped to about 0.45%.
Signed-off-by: Arjun Roy <arjunroy@google.com> Reviewed-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
show more ...
|
#
c1362fd0 |
| 17-Jun-2023 |
Conor Dooley <conor.dooley@microchip.com> |
Merge patch series "Add Sipeed Lichee Pi 4A RISC-V board support"
Jisheng Zhang <jszhang@kernel.org> says:
Sipeed's Lichee Pi 4A development board uses Lichee Module 4A core module which is powered
Merge patch series "Add Sipeed Lichee Pi 4A RISC-V board support"
Jisheng Zhang <jszhang@kernel.org> says:
Sipeed's Lichee Pi 4A development board uses Lichee Module 4A core module which is powered by T-HEAD's TH1520 SoC. Add minimal device tree files for the core module and the development board.
Support basic uart/gpio/dmac drivers, so supports booting to a basic shell.
This also pulls in -rc2, because of some maintainers re-jigging that went on in the interim in commit 80e62bc8487b ("MAINTAINERS: re-sort all entries and fields").
Link: https://lore.kernel.org/r/20230617161529.2092-1-jszhang@kernel.org Signed-off-by: Conor Dooley <conor.dooley@microchip.com>
show more ...
|
Revision tags: v6.1.34 |
|
#
e1d001fa |
| 09-Jun-2023 |
Breno Leitao <leitao@debian.org> |
net: ioctl: Use kernel memory on protocol ioctl callbacks
Most of the ioctls to net protocols operates directly on userspace argument (arg). Usually doing get_user()/put_user() directly in the ioctl
net: ioctl: Use kernel memory on protocol ioctl callbacks
Most of the ioctls to net protocols operates directly on userspace argument (arg). Usually doing get_user()/put_user() directly in the ioctl callback. This is not flexible, because it is hard to reuse these functions without passing userspace buffers.
Change the "struct proto" ioctls to avoid touching userspace memory and operate on kernel buffers, i.e., all protocol's ioctl callbacks is adapted to operate on a kernel memory other than on userspace (so, no more {put,get}_user() and friends being called in the ioctl callback).
This changes the "struct proto" ioctl format in the following way:
int (*ioctl)(struct sock *sk, int cmd, - unsigned long arg); + int *karg);
(Important to say that this patch does not touch the "struct proto_ops" protocols)
So, the "karg" argument, which is passed to the ioctl callback, is a pointer allocated to kernel space memory (inside a function wrapper). This buffer (karg) may contain input argument (copied from userspace in a prep function) and it might return a value/buffer, which is copied back to userspace if necessary. There is not one-size-fits-all format (that is I am using 'may' above), but basically, there are three type of ioctls:
1) Do not read from userspace, returns a result to userspace 2) Read an input parameter from userspace, and does not return anything to userspace 3) Read an input from userspace, and return a buffer to userspace.
The default case (1) (where no input parameter is given, and an "int" is returned to userspace) encompasses more than 90% of the cases, but there are two other exceptions. Here is a list of exceptions:
* Protocol RAW: * cmd = SIOCGETVIFCNT: * input and output = struct sioc_vif_req * cmd = SIOCGETSGCNT * input and output = struct sioc_sg_req * Explanation: for the SIOCGETVIFCNT case, userspace passes the input argument, which is struct sioc_vif_req. Then the callback populates the struct, which is copied back to userspace.
* Protocol RAW6: * cmd = SIOCGETMIFCNT_IN6 * input and output = struct sioc_mif_req6 * cmd = SIOCGETSGCNT_IN6 * input and output = struct sioc_sg_req6
* Protocol PHONET: * cmd == SIOCPNADDRESOURCE | SIOCPNDELRESOURCE * input int (4 bytes) * Nothing is copied back to userspace.
For the exception cases, functions sock_sk_ioctl_inout() will copy the userspace input, and copy it back to kernel space.
The wrapper that prepare the buffer and put the buffer back to user is sk_ioctl(), so, instead of calling sk->sk_prot->ioctl(), the callee now calls sk_ioctl(), which will handle all cases.
Signed-off-by: Breno Leitao <leitao@debian.org> Reviewed-by: Willem de Bruijn <willemb@google.com> Reviewed-by: David Ahern <dsahern@kernel.org> Reviewed-by: Kuniyuki Iwashima <kuniyu@amazon.com> Link: https://lore.kernel.org/r/20230609152800.830401-1-leitao@debian.org Signed-off-by: Jakub Kicinski <kuba@kernel.org>
show more ...
|
#
db6da59c |
| 15-Jun-2023 |
Thomas Zimmermann <tzimmermann@suse.de> |
Merge drm/drm-next into drm-misc-next-fixes
Backmerging to sync drm-misc-next-fixes with drm-misc-next.
Signed-off-by: Thomas Zimmermann <tzimmermann@suse.de>
|