Revision tags: v6.6.25, v6.6.24, v6.6.23, v6.6.16, v6.6.15, v6.6.14, v6.6.13, v6.6.12, v6.6.11, v6.6.10, v6.6.9, v6.6.8 |
|
#
3bc29c78 |
| 16-Dec-2023 |
Hou Tao <houtao1@huawei.com> |
bpf: Use c->unit_size to select target cache during free
[ Upstream commit 7ac5c53e00735d183a0f5e2cfce5eeb6c16319f2 ]
At present, bpf memory allocator uses check_obj_size() to ensure that ksize() o
bpf: Use c->unit_size to select target cache during free
[ Upstream commit 7ac5c53e00735d183a0f5e2cfce5eeb6c16319f2 ]
At present, bpf memory allocator uses check_obj_size() to ensure that ksize() of allocated pointer is equal with the unit_size of used bpf_mem_cache. Its purpose is to prevent bpf_mem_free() from selecting a bpf_mem_cache which has different unit_size compared with the bpf_mem_cache used for allocation. But as reported by lkp, the return value of ksize() or kmalloc_size_roundup() may change due to slab merge and it will lead to the warning report in check_obj_size().
The reported warning happened as follows: (1) in bpf_mem_cache_adjust_size(), kmalloc_size_roundup(96) returns the object_size of kmalloc-96 instead of kmalloc-cg-96. The object_size of kmalloc-96 is 96, so size_index for 96 is not adjusted accordingly. (2) the object_size of kmalloc-cg-96 is adjust from 96 to 128 due to slab merge in __kmem_cache_alias(). For SLAB, SLAB_HWCACHE_ALIGN is enabled by default for kmalloc slab, so align is 64 and size is 128 for kmalloc-cg-96. SLUB has a similar merge logic, but its object_size will not be changed, because its align is 8 under x86-64. (3) when unit_alloc() does kmalloc_node(96, __GFP_ACCOUNT, node), ksize() returns 128 instead of 96 for the returned pointer. (4) the warning in check_obj_size() is triggered.
Considering the slab merge can happen in anytime (e.g, a slab created in a new module), the following case is also possible: during the initialization of bpf_global_ma, there is no slab merge and ksize() for a 96-bytes object returns 96. But after that a new slab created by a kernel module is merged to kmalloc-cg-96 and the object_size of kmalloc-cg-96 is adjust from 96 to 128 (which is possible for x86-64 + CONFIG_SLAB, because its alignment requirement is 64 for 96-bytes slab). So soon or later, when bpf_global_ma frees a 96-byte-sized pointer which is allocated from bpf_mem_cache with unit_size=96, bpf_mem_free() will free the pointer through a bpf_mem_cache in which unit_size is 128, because the return value of ksize() changes. The warning for the mismatch will be triggered again.
A feasible fix is introducing similar APIs compared with ksize() and kmalloc_size_roundup() to return the actually-allocated size instead of size which may change due to slab merge, but it will introduce unnecessary dependency on the implementation details of mm subsystem.
As for now the pointer of bpf_mem_cache is saved in the 8-bytes area (or 4-bytes under 32-bit host) above the returned pointer, using unit_size in the saved bpf_mem_cache to select the target cache instead of inferring the size from the pointer itself. Beside no extra dependency on mm subsystem, the performance for bpf_mem_free_rcu() is also improved as shown below.
Before applying the patch, the performances of bpf_mem_alloc() and bpf_mem_free_rcu() on 8-CPUs VM with one producer are as follows:
kmalloc : alloc 11.69 ± 0.28M/s free 29.58 ± 0.93M/s percpu : alloc 14.11 ± 0.52M/s free 14.29 ± 0.99M/s
After apply the patch, the performance for bpf_mem_free_rcu() increases 9% and 146% for kmalloc memory and per-cpu memory respectively:
kmalloc: alloc 11.01 ± 0.03M/s free 32.42 ± 0.48M/s percpu: alloc 12.84 ± 0.12M/s free 35.24 ± 0.23M/s
After the fixes, there is no need to adjust size_index to fix the mismatch between allocation and free, so remove it as well. Also return NULL instead of ZERO_SIZE_PTR for zero-sized alloc in bpf_mem_alloc(), because there is no bpf_mem_cache pointer saved above ZERO_SIZE_PTR.
Fixes: 9077fc228f09 ("bpf: Use kmalloc_size_roundup() to adjust size_index") Reported-by: kernel test robot <oliver.sang@intel.com> Closes: https://lore.kernel.org/bpf/202310302113.9f8fe705-oliver.sang@intel.com Signed-off-by: Hou Tao <houtao1@huawei.com> Link: https://lore.kernel.org/r/20231216131052.27621-2-houtao@huaweicloud.com Signed-off-by: Alexei Starovoitov <ast@kernel.org> Signed-off-by: Sasha Levin <sashal@kernel.org>
show more ...
|
Revision tags: v6.6.7, v6.6.6, v6.6.5, v6.6.4, v6.6.3, v6.6.2, v6.5.11, v6.6.1, v6.5.10, v6.6, v6.5.9 |
|
#
63ddf081 |
| 20-Oct-2023 |
Hou Tao <houtao1@huawei.com> |
bpf: Use pcpu_alloc_size() in bpf_mem_free{_rcu}()
[ Upstream commit 3f2189e4f77b7a3e979d143dc4ff586488c7e8a5 ]
For bpf_global_percpu_ma, the pointer passed to bpf_mem_free_rcu() is allocated by km
bpf: Use pcpu_alloc_size() in bpf_mem_free{_rcu}()
[ Upstream commit 3f2189e4f77b7a3e979d143dc4ff586488c7e8a5 ]
For bpf_global_percpu_ma, the pointer passed to bpf_mem_free_rcu() is allocated by kmalloc() and its size is fixed (16-bytes on x86-64). So no matter which cache allocates the dynamic per-cpu area, on x86-64 cache[2] will always be used to free the per-cpu area.
Fix the unbalance by checking whether the bpf memory allocator is per-cpu or not and use pcpu_alloc_size() instead of ksize() to find the correct cache for per-cpu free.
Signed-off-by: Hou Tao <houtao1@huawei.com> Link: https://lore.kernel.org/r/20231020133202.4043247-5-houtao@huaweicloud.com Signed-off-by: Alexei Starovoitov <ast@kernel.org> Stable-dep-of: 7ac5c53e0073 ("bpf: Use c->unit_size to select target cache during free") Signed-off-by: Sasha Levin <sashal@kernel.org>
show more ...
|
#
62752b67 |
| 20-Oct-2023 |
Hou Tao <houtao1@huawei.com> |
bpf: Re-enable unit_size checking for global per-cpu allocator
[ Upstream commit baa8fdecd87bb8751237b45e3bcb5a179e5a12ca ]
With pcpu_alloc_size() in place, check whether or not the size of the dyn
bpf: Re-enable unit_size checking for global per-cpu allocator
[ Upstream commit baa8fdecd87bb8751237b45e3bcb5a179e5a12ca ]
With pcpu_alloc_size() in place, check whether or not the size of the dynamic per-cpu area is matched with unit_size.
Signed-off-by: Hou Tao <houtao1@huawei.com> Link: https://lore.kernel.org/r/20231020133202.4043247-4-houtao@huaweicloud.com Signed-off-by: Alexei Starovoitov <ast@kernel.org> Stable-dep-of: 7ac5c53e0073 ("bpf: Use c->unit_size to select target cache during free") Signed-off-by: Sasha Levin <sashal@kernel.org>
show more ...
|
#
d9105720 |
| 10-Nov-2023 |
Hou Tao <houtao1@huawei.com> |
bpf: Add missed allocation hint for bpf_mem_cache_alloc_flags()
[ Upstream commit 75a442581d05edaee168222ffbe00d4389785636 ]
bpf_mem_cache_alloc_flags() may call __alloc() directly when there is no
bpf: Add missed allocation hint for bpf_mem_cache_alloc_flags()
[ Upstream commit 75a442581d05edaee168222ffbe00d4389785636 ]
bpf_mem_cache_alloc_flags() may call __alloc() directly when there is no free object in free list, but it doesn't initialize the allocation hint for the returned pointer. It may lead to bad memory dereference when freeing the pointer, so fix it by initializing the allocation hint.
Fixes: 822fb26bdb55 ("bpf: Add a hint to allocated objects.") Signed-off-by: Hou Tao <houtao1@huawei.com> Acked-by: Yonghong Song <yonghong.song@linux.dev> Link: https://lore.kernel.org/r/20231111043821.2258513-1-houtao@huaweicloud.com Signed-off-by: Alexei Starovoitov <ast@kernel.org> Signed-off-by: Sasha Levin <sashal@kernel.org>
show more ...
|
Revision tags: v6.5.8, v6.5.7, v6.5.6 |
|
#
9077fc22 |
| 28-Sep-2023 |
Hou Tao <houtao1@huawei.com> |
bpf: Use kmalloc_size_roundup() to adjust size_index
Commit d52b59315bf5 ("bpf: Adjust size_index according to the value of KMALLOC_MIN_SIZE") uses KMALLOC_MIN_SIZE to adjust size_index, but as repo
bpf: Use kmalloc_size_roundup() to adjust size_index
Commit d52b59315bf5 ("bpf: Adjust size_index according to the value of KMALLOC_MIN_SIZE") uses KMALLOC_MIN_SIZE to adjust size_index, but as reported by Nathan, the adjustment is not enough, because __kmalloc_minalign() also decides the minimal alignment of slab object as shown in new_kmalloc_cache() and its value may be greater than KMALLOC_MIN_SIZE (e.g., 64 bytes vs 8 bytes under a riscv QEMU VM).
Instead of invoking __kmalloc_minalign() in bpf subsystem to find the maximal alignment, just using kmalloc_size_roundup() directly to get the corresponding slab object size for each allocation size. If these two sizes are unmatched, adjust size_index to select a bpf_mem_cache with unit_size equal to the object_size of the underlying slab cache for the allocation size.
Fixes: 822fb26bdb55 ("bpf: Add a hint to allocated objects.") Reported-by: Nathan Chancellor <nathan@kernel.org> Closes: https://lore.kernel.org/bpf/20230914181407.GA1000274@dev-arch.thelio-3990X/ Signed-off-by: Hou Tao <houtao1@huawei.com> Tested-by: Emil Renner Berthing <emil.renner.berthing@canonical.com> Link: https://lore.kernel.org/r/20230928101558.2594068-1-houtao@huaweicloud.com Signed-off-by: Alexei Starovoitov <ast@kernel.org>
show more ...
|
Revision tags: v6.5.5, v6.5.4 |
|
#
dca7acd8 |
| 13-Sep-2023 |
Hou Tao <houtao1@huawei.com> |
bpf: Skip unit_size checking for global per-cpu allocator
For global per-cpu allocator, the size of free object in free list doesn't match with unit_size and now there is no way to get the size of p
bpf: Skip unit_size checking for global per-cpu allocator
For global per-cpu allocator, the size of free object in free list doesn't match with unit_size and now there is no way to get the size of per-cpu pointer saved in free object, so just skip the checking.
Reported-by: Stephen Rothwell <sfr@canb.auug.org.au> Closes: https://lore.kernel.org/bpf/20230913133436.0eeec4cb@canb.auug.org.au/ Signed-off-by: Hou Tao <houtao1@huawei.com> Tested-by: Biju Das <biju.das.jz@bp.renesas.com> Link: https://lore.kernel.org/r/20230913135943.3137292-1-houtao@huaweicloud.com Signed-off-by: Alexei Starovoitov <ast@kernel.org>
show more ...
|
Revision tags: v6.5.3 |
|
#
c9304725 |
| 08-Sep-2023 |
Hou Tao <houtao1@huawei.com> |
bpf: Ensure unit_size is matched with slab cache object size
Add extra check in bpf_mem_alloc_init() to ensure the unit_size of bpf_mem_cache is matched with the object_size of underlying slab cache
bpf: Ensure unit_size is matched with slab cache object size
Add extra check in bpf_mem_alloc_init() to ensure the unit_size of bpf_mem_cache is matched with the object_size of underlying slab cache. If these two sizes are unmatched, print a warning once and return -EINVAL in bpf_mem_alloc_init(), so the mismatch can be found early and the potential issue can be prevented.
Suggested-by: Alexei Starovoitov <ast@kernel.org> Signed-off-by: Hou Tao <houtao1@huawei.com> Link: https://lore.kernel.org/r/20230908133923.2675053-4-houtao@huaweicloud.com Signed-off-by: Alexei Starovoitov <ast@kernel.org>
show more ...
|
#
b1d53958 |
| 08-Sep-2023 |
Hou Tao <houtao1@huawei.com> |
bpf: Don't prefill for unused bpf_mem_cache
When the unit_size of a bpf_mem_cache is unmatched with the object_size of the underlying slab cache, the bpf_mem_cache will not be used, and the allocati
bpf: Don't prefill for unused bpf_mem_cache
When the unit_size of a bpf_mem_cache is unmatched with the object_size of the underlying slab cache, the bpf_mem_cache will not be used, and the allocation will be redirected to a bpf_mem_cache with a bigger unit_size instead, so there is no need to prefill for these unused bpf_mem_caches.
Signed-off-by: Hou Tao <houtao1@huawei.com> Link: https://lore.kernel.org/r/20230908133923.2675053-3-houtao@huaweicloud.com Signed-off-by: Alexei Starovoitov <ast@kernel.org>
show more ...
|
#
d52b5931 |
| 08-Sep-2023 |
Hou Tao <houtao1@huawei.com> |
bpf: Adjust size_index according to the value of KMALLOC_MIN_SIZE
The following warning was reported when running "./test_progs -a link_api -a linked_list" on a RISC-V QEMU VM:
------------[ cut
bpf: Adjust size_index according to the value of KMALLOC_MIN_SIZE
The following warning was reported when running "./test_progs -a link_api -a linked_list" on a RISC-V QEMU VM:
------------[ cut here ]------------ WARNING: CPU: 3 PID: 261 at kernel/bpf/memalloc.c:342 bpf_mem_refill Modules linked in: bpf_testmod(OE) CPU: 3 PID: 261 Comm: test_progs- ... 6.5.0-rc5-01743-gdcb152bb8328 #2 Hardware name: riscv-virtio,qemu (DT) epc : bpf_mem_refill+0x1fc/0x206 ra : irq_work_single+0x68/0x70 epc : ffffffff801b1bc4 ra : ffffffff8015fe84 sp : ff2000000001be20 gp : ffffffff82d26138 tp : ff6000008477a800 t0 : 0000000000046600 t1 : ffffffff812b6ddc t2 : 0000000000000000 s0 : ff2000000001be70 s1 : ff5ffffffffe8998 a0 : ff5ffffffffe8998 a1 : ff600003fef4b000 a2 : 000000000000003f a3 : ffffffff80008250 a4 : 0000000000000060 a5 : 0000000000000080 a6 : 0000000000000000 a7 : 0000000000735049 s2 : ff5ffffffffe8998 s3 : 0000000000000022 s4 : 0000000000001000 s5 : 0000000000000007 s6 : ff5ffffffffe8570 s7 : ffffffff82d6bd30 s8 : 000000000000003f s9 : ffffffff82d2c5e8 s10: 000000000000ffff s11: ffffffff82d2c5d8 t3 : ffffffff81ea8f28 t4 : 0000000000000000 t5 : ff6000008fd28278 t6 : 0000000000040000 [<ffffffff801b1bc4>] bpf_mem_refill+0x1fc/0x206 [<ffffffff8015fe84>] irq_work_single+0x68/0x70 [<ffffffff8015feb4>] irq_work_run_list+0x28/0x36 [<ffffffff8015fefa>] irq_work_run+0x38/0x66 [<ffffffff8000828a>] handle_IPI+0x3a/0xb4 [<ffffffff800a5c3a>] handle_percpu_devid_irq+0xa4/0x1f8 [<ffffffff8009fafa>] generic_handle_domain_irq+0x28/0x36 [<ffffffff800ae570>] ipi_mux_process+0xac/0xfa [<ffffffff8000a8ea>] sbi_ipi_handle+0x2e/0x88 [<ffffffff8009fafa>] generic_handle_domain_irq+0x28/0x36 [<ffffffff807ee70e>] riscv_intc_irq+0x36/0x4e [<ffffffff812b5d3a>] handle_riscv_irq+0x54/0x86 [<ffffffff812b6904>] do_irq+0x66/0x98 ---[ end trace 0000000000000000 ]---
The warning is due to WARN_ON_ONCE(tgt->unit_size != c->unit_size) in free_bulk(). The direct reason is that a object is allocated and freed by bpf_mem_caches with different unit_size.
The root cause is that KMALLOC_MIN_SIZE is 64 and there is no 96-bytes slab cache in the specific VM. When linked_list test allocates a 72-bytes object through bpf_obj_new(), bpf_global_ma will allocate it from a bpf_mem_cache with 96-bytes unit_size, but this bpf_mem_cache is backed by 128-bytes slab cache. When the object is freed, bpf_mem_free() uses ksize() to choose the corresponding bpf_mem_cache. Because the object is allocated from 128-bytes slab cache, ksize() returns 128, bpf_mem_free() chooses a 128-bytes bpf_mem_cache to free the object and triggers the warning.
A similar warning will also be reported when using CONFIG_SLAB instead of CONFIG_SLUB in a x86-64 kernel. Because CONFIG_SLUB defines KMALLOC_MIN_SIZE as 8 but CONFIG_SLAB defines KMALLOC_MIN_SIZE as 32.
An alternative fix is to use kmalloc_size_round() in bpf_mem_alloc() to choose a bpf_mem_cache which has the same unit_size with the backing slab cache, but it may introduce performance degradation, so fix the warning by adjusting the indexes in size_index according to the value of KMALLOC_MIN_SIZE just like setup_kmalloc_cache_index_table() does.
Fixes: 822fb26bdb55 ("bpf: Add a hint to allocated objects.") Reported-by: Björn Töpel <bjorn@kernel.org> Closes: https://lore.kernel.org/bpf/87jztjmmy4.fsf@all.your.base.are.belong.to.us Signed-off-by: Hou Tao <houtao1@huawei.com> Link: https://lore.kernel.org/r/20230908133923.2675053-2-houtao@huaweicloud.com Signed-off-by: Alexei Starovoitov <ast@kernel.org>
show more ...
|
Revision tags: v6.5.2, v6.1.51, v6.5.1, v6.1.50, v6.5, v6.1.49, v6.1.48, v6.1.46, v6.1.45, v6.1.44, v6.1.43 |
|
#
d1a02358 |
| 27-Jul-2023 |
YiFei Zhu <zhuyifei@google.com> |
bpf: Non-atomically allocate freelist during prefill
In internal testing of test_maps, we sometimes observed failures like: test_maps: test_maps.c:173: void test_hashmap_percpu(unsigned int, void
bpf: Non-atomically allocate freelist during prefill
In internal testing of test_maps, we sometimes observed failures like: test_maps: test_maps.c:173: void test_hashmap_percpu(unsigned int, void *): Assertion `bpf_map_update_elem(fd, &key, value, BPF_ANY) == 0' failed. where the errno is ENOMEM. After some troubleshooting and enabling the warnings, we saw: [ 91.304708] percpu: allocation failed, size=8 align=8 atomic=1, atomic alloc failed, no space left [ 91.304716] CPU: 51 PID: 24145 Comm: test_maps Kdump: loaded Tainted: G N 6.1.38-smp-DEV #7 [ 91.304719] Hardware name: Google Astoria/astoria, BIOS 0.20230627.0-0 06/27/2023 [ 91.304721] Call Trace: [ 91.304724] <TASK> [ 91.304730] [<ffffffffa7ef83b9>] dump_stack_lvl+0x59/0x88 [ 91.304737] [<ffffffffa7ef83f8>] dump_stack+0x10/0x18 [ 91.304738] [<ffffffffa75caa0c>] pcpu_alloc+0x6fc/0x870 [ 91.304741] [<ffffffffa75ca302>] __alloc_percpu_gfp+0x12/0x20 [ 91.304743] [<ffffffffa756785e>] alloc_bulk+0xde/0x1e0 [ 91.304746] [<ffffffffa7566c02>] bpf_mem_alloc_init+0xd2/0x2f0 [ 91.304747] [<ffffffffa7547c69>] htab_map_alloc+0x479/0x650 [ 91.304750] [<ffffffffa751d6e0>] map_create+0x140/0x2e0 [ 91.304752] [<ffffffffa751d413>] __sys_bpf+0x5a3/0x6c0 [ 91.304753] [<ffffffffa751c3ec>] __x64_sys_bpf+0x1c/0x30 [ 91.304754] [<ffffffffa7ef847a>] do_syscall_64+0x5a/0x80 [ 91.304756] [<ffffffffa800009b>] entry_SYSCALL_64_after_hwframe+0x63/0xcd
This makes sense, because in atomic context, percpu allocation would not create new chunks; it would only create in non-atomic contexts. And if during prefill all precpu chunks are full, -ENOMEM would happen immediately upon next unit_alloc.
Prefill phase does not actually run in atomic context, so we can use this fact to allocate non-atomically with GFP_KERNEL instead of GFP_NOWAIT. This avoids the immediate -ENOMEM.
GFP_NOWAIT has to be used in unit_alloc when bpf program runs in atomic context. Even if bpf program runs in non-atomic context, in most cases, rcu read lock is enabled for the program so GFP_NOWAIT is still needed. This is often also the case for BPF_MAP_UPDATE_ELEM syscalls.
Signed-off-by: YiFei Zhu <zhuyifei@google.com> Acked-by: Yonghong Song <yonghong.song@linux.dev> Acked-by: Hou Tao <houtao1@huawei.com> Link: https://lore.kernel.org/r/20230728043359.3324347-1-zhuyifei@google.com Signed-off-by: Alexei Starovoitov <ast@kernel.org>
show more ...
|
Revision tags: v6.1.42 |
|
#
63e2da3b |
| 25-Jul-2023 |
Arnd Bergmann <arnd@arndb.de> |
bpf: work around -Wuninitialized warning
Splitting these out into separate helper functions means that we actually pass an uninitialized variable into another function call if dec_active() happens t
bpf: work around -Wuninitialized warning
Splitting these out into separate helper functions means that we actually pass an uninitialized variable into another function call if dec_active() happens to not be inlined, and CONFIG_PREEMPT_RT is disabled:
kernel/bpf/memalloc.c: In function 'add_obj_to_free_list': kernel/bpf/memalloc.c:200:9: error: 'flags' is used uninitialized [-Werror=uninitialized] 200 | dec_active(c, flags);
Avoid this by passing the flags by reference, so they either get initialized and dereferenced through a pointer, or the pointer never gets accessed at all.
Fixes: 18e027b1c7c6d ("bpf: Factor out inc/dec of active flag into helpers.") Suggested-by: Alexei Starovoitov <alexei.starovoitov@gmail.com> Signed-off-by: Arnd Bergmann <arnd@arndb.de> Link: https://lore.kernel.org/r/20230725202653.2905259-1-arnd@kernel.org Signed-off-by: Alexei Starovoitov <ast@kernel.org>
show more ...
|
Revision tags: v6.1.41, v6.1.40, v6.1.39 |
|
#
4ed8b5bc |
| 05-Jul-2023 |
Hou Tao <houtao1@huawei.com> |
bpf: Add object leak check.
The object leak check is cheap. Do it unconditionally to spot difficult races in bpf_mem_alloc.
Signed-off-by: Hou Tao <houtao1@huawei.com> Signed-off-by: Alexei Starovo
bpf: Add object leak check.
The object leak check is cheap. Do it unconditionally to spot difficult races in bpf_mem_alloc.
Signed-off-by: Hou Tao <houtao1@huawei.com> Signed-off-by: Alexei Starovoitov <ast@kernel.org> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Link: https://lore.kernel.org/bpf/20230706033447.54696-15-alexei.starovoitov@gmail.com
show more ...
|
#
5af6807b |
| 05-Jul-2023 |
Alexei Starovoitov <ast@kernel.org> |
bpf: Introduce bpf_mem_free_rcu() similar to kfree_rcu().
Introduce bpf_mem_[cache_]free_rcu() similar to kfree_rcu(). Unlike bpf_mem_[cache_]free() that links objects for immediate reuse into per-c
bpf: Introduce bpf_mem_free_rcu() similar to kfree_rcu().
Introduce bpf_mem_[cache_]free_rcu() similar to kfree_rcu(). Unlike bpf_mem_[cache_]free() that links objects for immediate reuse into per-cpu free list the _rcu() flavor waits for RCU grace period and then moves objects into free_by_rcu_ttrace list where they are waiting for RCU task trace grace period to be freed into slab.
The life cycle of objects: alloc: dequeue free_llist free: enqeueu free_llist free_rcu: enqueue free_by_rcu -> waiting_for_gp free_llist above high watermark -> free_by_rcu_ttrace after RCU GP waiting_for_gp -> free_by_rcu_ttrace free_by_rcu_ttrace -> waiting_for_gp_ttrace -> slab
Signed-off-by: Alexei Starovoitov <ast@kernel.org> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Acked-by: Hou Tao <houtao1@huawei.com> Link: https://lore.kernel.org/bpf/20230706033447.54696-13-alexei.starovoitov@gmail.com
show more ...
|
#
04fabf00 |
| 05-Jul-2023 |
Alexei Starovoitov <ast@kernel.org> |
bpf: Allow reuse from waiting_for_gp_ttrace list.
alloc_bulk() can reuse elements from free_by_rcu_ttrace. Let it reuse from waiting_for_gp_ttrace as well to avoid unnecessary kmalloc().
Signed-off
bpf: Allow reuse from waiting_for_gp_ttrace list.
alloc_bulk() can reuse elements from free_by_rcu_ttrace. Let it reuse from waiting_for_gp_ttrace as well to avoid unnecessary kmalloc().
Signed-off-by: Alexei Starovoitov <ast@kernel.org> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Link: https://lore.kernel.org/bpf/20230706033447.54696-10-alexei.starovoitov@gmail.com
show more ...
|
#
822fb26b |
| 05-Jul-2023 |
Alexei Starovoitov <ast@kernel.org> |
bpf: Add a hint to allocated objects.
To address OOM issue when one cpu is allocating and another cpu is freeing add a target bpf_mem_cache hint to allocated objects and when local cpu free_llist ov
bpf: Add a hint to allocated objects.
To address OOM issue when one cpu is allocating and another cpu is freeing add a target bpf_mem_cache hint to allocated objects and when local cpu free_llist overflows free to that bpf_mem_cache. The hint addresses the OOM while maintaining the same performance for common case when alloc/free are done on the same cpu.
Note that do_call_rcu_ttrace() now has to check 'draining' flag in one more case, since do_call_rcu_ttrace() is called not only for current cpu.
Signed-off-by: Alexei Starovoitov <ast@kernel.org> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Acked-by: Hou Tao <houtao1@huawei.com> Link: https://lore.kernel.org/bpf/20230706033447.54696-9-alexei.starovoitov@gmail.com
show more ...
|
#
d114dde2 |
| 05-Jul-2023 |
Alexei Starovoitov <ast@kernel.org> |
bpf: Change bpf_mem_cache draining process.
The next patch will introduce cross-cpu llist access and existing irq_work_sync() + drain_mem_cache() + rcu_barrier_tasks_trace() mechanism will not be en
bpf: Change bpf_mem_cache draining process.
The next patch will introduce cross-cpu llist access and existing irq_work_sync() + drain_mem_cache() + rcu_barrier_tasks_trace() mechanism will not be enough, since irq_work_sync() + drain_mem_cache() on cpu A won't guarantee that llist on cpu A are empty. The free_bulk() on cpu B might add objects back to llist of cpu A. Add 'bool draining' flag. The modified sequence looks like: for_each_cpu: WRITE_ONCE(c->draining, true); // do_call_rcu_ttrace() won't be doing call_rcu() any more irq_work_sync(); // wait for irq_work callback (free_bulk) to finish drain_mem_cache(); // free all objects rcu_barrier_tasks_trace(); // wait for RCU callbacks to execute
Signed-off-by: Alexei Starovoitov <ast@kernel.org> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Acked-by: Hou Tao <houtao1@huawei.com> Link: https://lore.kernel.org/bpf/20230706033447.54696-8-alexei.starovoitov@gmail.com
show more ...
|
#
74680482 |
| 05-Jul-2023 |
Alexei Starovoitov <ast@kernel.org> |
bpf: Further refactor alloc_bulk().
In certain scenarios alloc_bulk() might be taking free objects mainly from free_by_rcu_ttrace list. In such case get_memcg() and set_active_memcg() are redundant,
bpf: Further refactor alloc_bulk().
In certain scenarios alloc_bulk() might be taking free objects mainly from free_by_rcu_ttrace list. In such case get_memcg() and set_active_memcg() are redundant, but they show up in perf profile. Split the loop and only set memcg when allocating from slab. No performance difference in this patch alone, but it helps in combination with further patches.
Signed-off-by: Alexei Starovoitov <ast@kernel.org> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Acked-by: Hou Tao <houtao1@huawei.com> Link: https://lore.kernel.org/bpf/20230706033447.54696-7-alexei.starovoitov@gmail.com
show more ...
|
#
18e027b1 |
| 05-Jul-2023 |
Alexei Starovoitov <ast@kernel.org> |
bpf: Factor out inc/dec of active flag into helpers.
Factor out local_inc/dec_return(&c->active) into helpers. No functional changes.
Signed-off-by: Alexei Starovoitov <ast@kernel.org> Signed-off-b
bpf: Factor out inc/dec of active flag into helpers.
Factor out local_inc/dec_return(&c->active) into helpers. No functional changes.
Signed-off-by: Alexei Starovoitov <ast@kernel.org> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Acked-by: Hou Tao <houtao1@huawei.com> Link: https://lore.kernel.org/bpf/20230706033447.54696-6-alexei.starovoitov@gmail.com
show more ...
|
#
05ae6865 |
| 05-Jul-2023 |
Alexei Starovoitov <ast@kernel.org> |
bpf: Refactor alloc_bulk().
Factor out inner body of alloc_bulk into separate helper. No functional changes.
Signed-off-by: Alexei Starovoitov <ast@kernel.org> Signed-off-by: Daniel Borkmann <danie
bpf: Refactor alloc_bulk().
Factor out inner body of alloc_bulk into separate helper. No functional changes.
Signed-off-by: Alexei Starovoitov <ast@kernel.org> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Acked-by: Hou Tao <houtao1@huawei.com> Link: https://lore.kernel.org/bpf/20230706033447.54696-5-alexei.starovoitov@gmail.com
show more ...
|
#
9de3e815 |
| 05-Jul-2023 |
Alexei Starovoitov <ast@kernel.org> |
bpf: Let free_all() return the number of freed elements.
Let free_all() helper return the number of freed elements. It's not used in this patch, but helps in debug/development of bpf_mem_alloc.
For
bpf: Let free_all() return the number of freed elements.
Let free_all() helper return the number of freed elements. It's not used in this patch, but helps in debug/development of bpf_mem_alloc.
For example this diff for __free_rcu(): - free_all(llist_del_all(&c->waiting_for_gp_ttrace), !!c->percpu_size); + printk("cpu %d freed %d objs after tasks trace\n", raw_smp_processor_id(), + free_all(llist_del_all(&c->waiting_for_gp_ttrace), !!c->percpu_size));
would show how busy RCU tasks trace is. In artificial benchmark where one cpu is allocating and different cpu is freeing the RCU tasks trace won't be able to keep up and the list of objects would keep growing from thousands to millions and eventually OOMing.
Signed-off-by: Alexei Starovoitov <ast@kernel.org> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Acked-by: Hou Tao <houtao1@huawei.com> Link: https://lore.kernel.org/bpf/20230706033447.54696-4-alexei.starovoitov@gmail.com
show more ...
|
#
a80672d7 |
| 05-Jul-2023 |
Alexei Starovoitov <ast@kernel.org> |
bpf: Simplify code of destroy_mem_alloc() with kmemdup().
Use kmemdup() to simplify the code.
Signed-off-by: Alexei Starovoitov <ast@kernel.org> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net
bpf: Simplify code of destroy_mem_alloc() with kmemdup().
Use kmemdup() to simplify the code.
Signed-off-by: Alexei Starovoitov <ast@kernel.org> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Acked-by: Hou Tao <houtao1@huawei.com> Link: https://lore.kernel.org/bpf/20230706033447.54696-3-alexei.starovoitov@gmail.com
show more ...
|
#
12c8d0f4 |
| 05-Jul-2023 |
Alexei Starovoitov <ast@kernel.org> |
bpf: Rename few bpf_mem_alloc fields.
Rename: - struct rcu_head rcu; - struct llist_head free_by_rcu; - struct llist_head waiting_for_gp; - atomic_t call_rcu_in_progress; +
bpf: Rename few bpf_mem_alloc fields.
Rename: - struct rcu_head rcu; - struct llist_head free_by_rcu; - struct llist_head waiting_for_gp; - atomic_t call_rcu_in_progress; + struct llist_head free_by_rcu_ttrace; + struct llist_head waiting_for_gp_ttrace; + struct rcu_head rcu_ttrace; + atomic_t call_rcu_ttrace_in_progress; ... - static void do_call_rcu(struct bpf_mem_cache *c) + static void do_call_rcu_ttrace(struct bpf_mem_cache *c)
to better indicate intended use.
The 'tasks trace' is shortened to 'ttrace' to reduce verbosity. No functional changes.
Later patches will add free_by_rcu/waiting_for_gp fields to be used with normal RCU.
Signed-off-by: Alexei Starovoitov <ast@kernel.org> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Acked-by: Hou Tao <houtao1@huawei.com> Link: https://lore.kernel.org/bpf/20230706033447.54696-2-alexei.starovoitov@gmail.com
show more ...
|
Revision tags: v6.1.38, v6.1.37, v6.1.36, v6.4, v6.1.35, v6.1.34, v6.1.33 |
|
#
aa7881fc |
| 05-Jun-2023 |
Hou Tao <houtao1@huawei.com> |
bpf: Factor out a common helper free_all()
Factor out a common helper free_all() to free all normal elements or per-cpu elements on a lock-less list.
Signed-off-by: Hou Tao <houtao1@huawei.com> Lin
bpf: Factor out a common helper free_all()
Factor out a common helper free_all() to free all normal elements or per-cpu elements on a lock-less list.
Signed-off-by: Hou Tao <houtao1@huawei.com> Link: https://lore.kernel.org/r/20230606035310.4026145-2-houtao@huaweicloud.com Signed-off-by: Alexei Starovoitov <ast@kernel.org>
show more ...
|
Revision tags: v6.1.32, v6.1.31, v6.1.30, v6.1.29, v6.1.28, v6.1.27, v6.1.26, v6.3, v6.1.25, v6.1.24, v6.1.23, v6.1.22 |
|
#
e65a5c6e |
| 22-Mar-2023 |
Martin KaFai Lau <martin.lau@kernel.org> |
bpf: Add a few bpf mem allocator functions
This patch adds a few bpf mem allocator functions which will be used in the bpf_local_storage in a later patch.
bpf_mem_cache_alloc_flags(..., gfp_t flags
bpf: Add a few bpf mem allocator functions
This patch adds a few bpf mem allocator functions which will be used in the bpf_local_storage in a later patch.
bpf_mem_cache_alloc_flags(..., gfp_t flags) is added. When the flags == GFP_KERNEL, it will fallback to __alloc(..., GFP_KERNEL). bpf_local_storage knows its running context is sleepable (GFP_KERNEL) and provides a better guarantee on memory allocation.
bpf_local_storage has some uncommon cases that its selem cannot be reused immediately. It handles its own rcu_head and goes through a rcu_trace gp and then free it. bpf_mem_cache_raw_free() is added for direct free purpose without leaking the LLIST_NODE_SZ internal knowledge. During free time, the 'struct bpf_mem_alloc *ma' is no longer available. However, the caller should know if it is percpu memory or not and it can call different raw_free functions. bpf_local_storage does not support percpu value, so only the non-percpu 'bpf_mem_cache_raw_free()' is added in this patch.
Signed-off-by: Martin KaFai Lau <martin.lau@kernel.org> Link: https://lore.kernel.org/r/20230322215246.1675516-2-martin.lau@linux.dev Signed-off-by: Alexei Starovoitov <ast@kernel.org>
show more ...
|
Revision tags: v6.1.21, v6.1.20, v6.1.19, v6.1.18, v6.1.17, v6.1.16, v6.1.15, v6.1.14, v6.1.13, v6.2 |
|
#
997849c4 |
| 15-Feb-2023 |
Hou Tao <houtao1@huawei.com> |
bpf: Zeroing allocated object from slab in bpf memory allocator
Currently the freed element in bpf memory allocator may be immediately reused, for htab map the reuse will reinitialize special fields
bpf: Zeroing allocated object from slab in bpf memory allocator
Currently the freed element in bpf memory allocator may be immediately reused, for htab map the reuse will reinitialize special fields in map value (e.g., bpf_spin_lock), but lookup procedure may still access these special fields, and it may lead to hard-lockup as shown below:
NMI backtrace for cpu 16 CPU: 16 PID: 2574 Comm: htab.bin Tainted: G L 6.1.0+ #1 Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), RIP: 0010:queued_spin_lock_slowpath+0x283/0x2c0 ...... Call Trace: <TASK> copy_map_value_locked+0xb7/0x170 bpf_map_copy_value+0x113/0x3c0 __sys_bpf+0x1c67/0x2780 __x64_sys_bpf+0x1c/0x20 do_syscall_64+0x30/0x60 entry_SYSCALL_64_after_hwframe+0x46/0xb0 ...... </TASK>
For htab map, just like the preallocated case, these is no need to initialize these special fields in map value again once these fields have been initialized. For preallocated htab map, these fields are initialized through __GFP_ZERO in bpf_map_area_alloc(), so do the similar thing for non-preallocated htab in bpf memory allocator. And there is no need to use __GFP_ZERO for per-cpu bpf memory allocator, because __alloc_percpu_gfp() does it implicitly.
Fixes: 0fd7c5d43339 ("bpf: Optimize call_rcu in non-preallocated hash map.") Signed-off-by: Hou Tao <houtao1@huawei.com> Link: https://lore.kernel.org/r/20230215082132.3856544-2-houtao@huaweicloud.com Signed-off-by: Alexei Starovoitov <ast@kernel.org>
show more ...
|