#
8668d0e2 |
| 24-May-2019 |
Eric Dumazet <edumazet@google.com> |
ipv6: no longer reference init_net in ip6_frags_ns_ctl_table[]
(struct net *)->ipv6.fqdir will soon be a pointer, so make sure ip6_frags_ns_ctl_table[] does not reference init_net.
ip6_frags_ns_ctl
ipv6: no longer reference init_net in ip6_frags_ns_ctl_table[]
(struct net *)->ipv6.fqdir will soon be a pointer, so make sure ip6_frags_ns_ctl_table[] does not reference init_net.
ip6_frags_ns_ctl_register() can perform the needed initialization for all netns.
Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
show more ...
|
#
803fdd99 |
| 24-May-2019 |
Eric Dumazet <edumazet@google.com> |
net: rename struct fqdir fields
Rename the @frags fields from structs netns_ipv4, netns_ipv6, netns_nf_frag and netns_ieee802154_lowpan to @fqdir
Signed-off-by: Eric Dumazet <edumazet@google.com> S
net: rename struct fqdir fields
Rename the @frags fields from structs netns_ipv4, netns_ipv6, netns_nf_frag and netns_ieee802154_lowpan to @fqdir
Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
show more ...
|
#
89fb9005 |
| 24-May-2019 |
Eric Dumazet <edumazet@google.com> |
net: rename inet_frags_exit_net() to fqdir_exit()
Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
|
#
6ce3b4dc |
| 24-May-2019 |
Eric Dumazet <edumazet@google.com> |
inet: rename netns_frags to fqdir
1) struct netns_frags is renamed to struct fqdir This structure is really holding many frag queues in a hash table.
2) (struct inet_frag_queue)->net field is ren
inet: rename netns_frags to fqdir
1) struct netns_frags is renamed to struct fqdir This structure is really holding many frag queues in a hash table.
2) (struct inet_frag_queue)->net field is renamed to fqdir since net is generally associated to a 'struct net' pointer in networking stack.
Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
show more ...
|
Revision tags: v5.1.4, v5.1.3, v5.1.2, v5.1.1, v5.0.14, v5.1, v5.0.13, v5.0.12, v5.0.11, v5.0.10, v5.0.9, v5.0.8, v5.0.7, v5.0.6, v5.0.5, v5.0.4, v5.0.3, v4.19.29, v5.0.2, v4.19.28, v5.0.1, v4.19.27, v5.0, v4.19.26 |
|
#
d8cf757f |
| 25-Feb-2019 |
Peter Oskolkov <posk@google.com> |
net: remove unused struct inet_frag_queue.fragments field
Now that all users of struct inet_frag_queue have been converted to use 'rb_fragments', remove the unused 'fragments' field.
Build with `ma
net: remove unused struct inet_frag_queue.fragments field
Now that all users of struct inet_frag_queue have been converted to use 'rb_fragments', remove the unused 'fragments' field.
Build with `make allyesconfig` succeeded. ip_defrag selftest passed.
Signed-off-by: Peter Oskolkov <posk@google.com> Acked-by: Stefan Schmidt <stefan@datenfreihafen.org> Signed-off-by: David S. Miller <davem@davemloft.net>
show more ...
|
Revision tags: v4.19.25, v4.19.24, v4.19.23, v4.19.22, v4.19.21, v4.19.20, v4.19.19, v4.19.18, v4.19.17 |
|
#
d4289fcc |
| 22-Jan-2019 |
Peter Oskolkov <posk@google.com> |
net: IP6 defrag: use rbtrees for IPv6 defrag
Currently, IPv6 defragmentation code drops non-last fragments that are smaller than 1280 bytes: see commit 0ed4229b08c1 ("ipv6: defrag: drop non-last fra
net: IP6 defrag: use rbtrees for IPv6 defrag
Currently, IPv6 defragmentation code drops non-last fragments that are smaller than 1280 bytes: see commit 0ed4229b08c1 ("ipv6: defrag: drop non-last frags smaller than min mtu")
This behavior is not specified in IPv6 RFCs and appears to break compatibility with some IPv6 implemenations, as reported here: https://www.spinics.net/lists/netdev/msg543846.html
This patch re-uses common IP defragmentation queueing and reassembly code in IPv6, removing the 1280 byte restriction.
Signed-off-by: Peter Oskolkov <posk@google.com> Reported-by: Tom Herbert <tom@herbertland.com> Cc: Eric Dumazet <edumazet@google.com> Cc: Florian Westphal <fw@strlen.de> Signed-off-by: David S. Miller <davem@davemloft.net>
show more ...
|
Revision tags: v4.19.16, v4.19.15, v4.19.14 |
|
#
7f334a7e |
| 29-Dec-2018 |
Su Yanjun <suyj.fnst@cn.fujitsu.com> |
ipv6: fix typo in net/ipv6/reassembly.c
Signed-off-by: Su Yanjun <suyj.fnst@cn.fujitsu.com> Signed-off-by: David S. Miller <davem@davemloft.net>
|
Revision tags: v4.19.13, v4.19.12 |
|
#
d15f5ac8 |
| 20-Dec-2018 |
Herbert Xu <herbert@gondor.apana.org.au> |
ipv6: frags: Fix bogus skb->sk in reassembled packets
It was reported that IPsec would crash when it encounters an IPv6 reassembled packet because skb->sk is non-zero and not a valid pointer.
This
ipv6: frags: Fix bogus skb->sk in reassembled packets
It was reported that IPsec would crash when it encounters an IPv6 reassembled packet because skb->sk is non-zero and not a valid pointer.
This is because skb->sk is now a union with ip_defrag_offset.
This patch fixes this by resetting skb->sk when exiting from the reassembly code.
Reported-by: Xiumei Mu <xmu@redhat.com> Fixes: 219badfaade9 ("ipv6: frags: get rid of ip6frag_skb_cb/...") Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au> Signed-off-by: David S. Miller <davem@davemloft.net>
show more ...
|
Revision tags: v4.19.11, v4.19.10, v4.19.9, v4.19.8, v4.19.7 |
|
#
ebaf39e6 |
| 05-Dec-2018 |
Jiri Wiesner <jwiesner@suse.com> |
ipv4: ipv6: netfilter: Adjust the frag mem limit when truesize changes
The *_frag_reasm() functions are susceptible to miscalculating the byte count of packet fragments in case the truesize of a hea
ipv4: ipv6: netfilter: Adjust the frag mem limit when truesize changes
The *_frag_reasm() functions are susceptible to miscalculating the byte count of packet fragments in case the truesize of a head buffer changes. The truesize member may be changed by the call to skb_unclone(), leaving the fragment memory limit counter unbalanced even if all fragments are processed. This miscalculation goes unnoticed as long as the network namespace which holds the counter is not destroyed.
Should an attempt be made to destroy a network namespace that holds an unbalanced fragment memory limit counter the cleanup of the namespace never finishes. The thread handling the cleanup gets stuck in inet_frags_exit_net() waiting for the percpu counter to reach zero. The thread is usually in running state with a stacktrace similar to:
PID: 1073 TASK: ffff880626711440 CPU: 1 COMMAND: "kworker/u48:4" #5 [ffff880621563d48] _raw_spin_lock at ffffffff815f5480 #6 [ffff880621563d48] inet_evict_bucket at ffffffff8158020b #7 [ffff880621563d80] inet_frags_exit_net at ffffffff8158051c #8 [ffff880621563db0] ops_exit_list at ffffffff814f5856 #9 [ffff880621563dd8] cleanup_net at ffffffff814f67c0 #10 [ffff880621563e38] process_one_work at ffffffff81096f14
It is not possible to create new network namespaces, and processes that call unshare() end up being stuck in uninterruptible sleep state waiting to acquire the net_mutex.
The bug was observed in the IPv6 netfilter code by Per Sundstrom. I thank him for his analysis of the problem. The parts of this patch that apply to IPv4 and IPv6 fragment reassembly are preemptive measures.
Signed-off-by: Jiri Wiesner <jwiesner@suse.com> Reported-by: Per Sundstrom <per.sundstrom@redqube.se> Acked-by: Peter Oskolkov <posk@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
show more ...
|
Revision tags: v4.19.6, v4.19.5, v4.19.4, v4.18.20, v4.19.3, v4.18.19, v4.19.2, v4.18.18, v4.18.17, v4.19.1, v4.19, v4.18.16, v4.18.15, v4.18.14, v4.18.13, v4.18.12, v4.18.11, v4.18.10 |
|
#
83619623 |
| 21-Sep-2018 |
Peter Oskolkov <posk@google.com> |
net/ipfrag: let ip[6]frag_high_thresh in ns be higher than in init_net
Currently, ip[6]frag_high_thresh sysctl values in new namespaces are hard-limited to those of the root/init ns.
There are at l
net/ipfrag: let ip[6]frag_high_thresh in ns be higher than in init_net
Currently, ip[6]frag_high_thresh sysctl values in new namespaces are hard-limited to those of the root/init ns.
There are at least two use cases when it would be desirable to set the high_thresh values higher in a child namespace vs the global hard limit:
- a security/ddos protection policy may lower the thresholds in the root/init ns but allow for a special exception in a child namespace - testing: a test running in a namespace may want to set these thresholds higher in its namespace than what is in the root/init ns
The new behavior:
# ip netns add testns # ip netns exec testns bash
# sysctl -w net.ipv4.ipfrag_high_thresh=9000000 net.ipv4.ipfrag_high_thresh = 9000000
# sysctl net.ipv4.ipfrag_high_thresh net.ipv4.ipfrag_high_thresh = 9000000
# sysctl -w net.ipv6.ip6frag_high_thresh=9000000 net.ipv6.ip6frag_high_thresh = 9000000
# sysctl net.ipv6.ip6frag_high_thresh net.ipv6.ip6frag_high_thresh = 9000000
The old behavior:
# ip netns add testns # ip netns exec testns bash
# sysctl -w net.ipv4.ipfrag_high_thresh=9000000 net.ipv4.ipfrag_high_thresh = 9000000
# sysctl net.ipv4.ipfrag_high_thresh net.ipv4.ipfrag_high_thresh = 4194304
# sysctl -w net.ipv6.ip6frag_high_thresh=9000000 net.ipv6.ip6frag_high_thresh = 9000000
# sysctl net.ipv6.ip6frag_high_thresh net.ipv6.ip6frag_high_thresh = 4194304
Signed-off-by: Peter Oskolkov <posk@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
show more ...
|
#
2475f59c |
| 21-Sep-2018 |
Peter Oskolkov <posk@google.com> |
ipv6: discard IP frag queue on more errors
This is similar to how ipv4 now behaves: commit 0ff89efb5246 ("ip: fail fast on IP defrag errors").
Signed-off-by: Peter Oskolkov <posk@google.com> Signed
ipv6: discard IP frag queue on more errors
This is similar to how ipv4 now behaves: commit 0ff89efb5246 ("ip: fail fast on IP defrag errors").
Signed-off-by: Peter Oskolkov <posk@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
show more ...
|
Revision tags: v4.18.9, v4.18.7, v4.18.6, v4.18.5, v4.17.18, v4.18.4, v4.18.3, v4.17.17, v4.18.2, v4.17.16, v4.17.15, v4.18.1, v4.18, v4.17.14, v4.17.13, v4.17.12 |
|
#
a8305bff |
| 29-Jul-2018 |
David S. Miller <davem@davemloft.net> |
net: Add and use skb_mark_not_on_list().
An SKB is not on a list if skb->next is NULL.
Codify this convention into a helper function and use it where we are dequeueing an SKB and need to mark it as
net: Add and use skb_mark_not_on_list().
An SKB is not on a list if skb->next is NULL.
Codify this convention into a helper function and use it where we are dequeueing an SKB and need to mark it as such.
Signed-off-by: David S. Miller <davem@davemloft.net>
show more ...
|
#
0ed4229b |
| 02-Aug-2018 |
Florian Westphal <fw@strlen.de> |
ipv6: defrag: drop non-last frags smaller than min mtu
don't bother with pathological cases, they only waste cycles. IPv6 requires a minimum MTU of 1280 so we should never see fragments smaller than
ipv6: defrag: drop non-last frags smaller than min mtu
don't bother with pathological cases, they only waste cycles. IPv6 requires a minimum MTU of 1280 so we should never see fragments smaller than this (except last frag).
v3: don't use awkward "-offset + len" v2: drop IPv4 part, which added same check w. IPV4_MIN_MTU (68). There were concerns that there could be even smaller frags generated by intermediate nodes, e.g. on radio networks.
Cc: Peter Oskolkov <posk@google.com> Cc: Eric Dumazet <edumazet@google.com> Signed-off-by: Florian Westphal <fw@strlen.de> Signed-off-by: David S. Miller <davem@davemloft.net>
show more ...
|
#
fa0f5273 |
| 02-Aug-2018 |
Peter Oskolkov <posk@google.com> |
ip: use rb trees for IP frag queue.
Similar to TCP OOO RX queue, it makes sense to use rb trees to store IP fragments, so that OOO fragments are inserted faster.
Tested:
- a follow-up patch contai
ip: use rb trees for IP frag queue.
Similar to TCP OOO RX queue, it makes sense to use rb trees to store IP fragments, so that OOO fragments are inserted faster.
Tested:
- a follow-up patch contains a rather comprehensive ip defrag self-test (functional) - ran neper `udp_stream -c -H <host> -F 100 -l 300 -T 20`: netstat --statistics Ip: 282078937 total packets received 0 forwarded 0 incoming packets discarded 946760 incoming packets delivered 18743456 requests sent out 101 fragments dropped after timeout 282077129 reassemblies required 944952 packets reassembled ok 262734239 packet reassembles failed (The numbers/stats above are somewhat better re: reassemblies vs a kernel without this patchset. More comprehensive performance testing TBD).
Reported-by: Jann Horn <jannh@google.com> Reported-by: Juha-Matti Tilli <juha-matti.tilli@iki.fi> Suggested-by: Eric Dumazet <edumazet@google.com> Signed-off-by: Peter Oskolkov <posk@google.com> Signed-off-by: Eric Dumazet <edumazet@google.com> Cc: Florian Westphal <fw@strlen.de> Signed-off-by: David S. Miller <davem@davemloft.net>
show more ...
|
Revision tags: v4.17.11, v4.17.10, v4.17.9, v4.17.8, v4.17.7 |
|
#
70b095c8 |
| 13-Jul-2018 |
Florian Westphal <fw@strlen.de> |
ipv6: remove dependency of nf_defrag_ipv6 on ipv6 module
IPV6=m DEFRAG_IPV6=m CONNTRACK=y yields:
net/netfilter/nf_conntrack_proto.o: In function `nf_ct_netns_do_get': net/netfilter/nf_conntrack_pr
ipv6: remove dependency of nf_defrag_ipv6 on ipv6 module
IPV6=m DEFRAG_IPV6=m CONNTRACK=y yields:
net/netfilter/nf_conntrack_proto.o: In function `nf_ct_netns_do_get': net/netfilter/nf_conntrack_proto.c:802: undefined reference to `nf_defrag_ipv6_enable' net/netfilter/nf_conntrack_proto.o:(.rodata+0x640): undefined reference to `nf_conntrack_l4proto_icmpv6'
Setting DEFRAG_IPV6=y causes undefined references to ip6_rhash_params ip6_frag_init and ip6_expire_frag_queue so it would be needed to force IPV6=y too.
This patch gets rid of the 'followup linker error' by removing the dependency of ipv6.ko symbols from netfilter ipv6 defrag.
Shared code is placed into a header, then used from both.
Signed-off-by: Florian Westphal <fw@strlen.de> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
show more ...
|
Revision tags: v4.17.6, v4.17.5, v4.17.4, v4.17.3, v4.17.2, v4.17.1, v4.17 |
|
#
415787d7 |
| 17-Apr-2018 |
Eric Dumazet <edumazet@google.com> |
ipv6: frags: fix a lockdep false positive
lockdep does not know that the locks used by IPv4 defrag and IPv6 reassembly units are of different classes.
It complains because of following chains :
1)
ipv6: frags: fix a lockdep false positive
lockdep does not know that the locks used by IPv4 defrag and IPv6 reassembly units are of different classes.
It complains because of following chains :
1) sch_direct_xmit() (lock txq->_xmit_lock) dev_hard_start_xmit() xmit_one() dev_queue_xmit_nit() packet_rcv_fanout() ip_check_defrag() ip_defrag() spin_lock() (lock frag queue spinlock)
2) ip6_input_finish() ipv6_frag_rcv() (lock frag queue spinlock) ip6_frag_queue() icmpv6_param_prob() (lock txq->_xmit_lock at some point)
We could add lockdep annotations, but we also can make sure IPv6 calls icmpv6_param_prob() only after the release of the frag queue spinlock, since this naturally makes frag queue spinlock a leaf in lock hierarchy.
Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
show more ...
|
#
bdb7cc64 |
| 16-Apr-2018 |
Stephen Suryaputra <ssuryaextr@gmail.com> |
ipv6: Count interface receive statistics on the ingress netdev
The statistics such as InHdrErrors should be counted on the ingress netdev rather than on the dev from the dst, which is the egress.
S
ipv6: Count interface receive statistics on the ingress netdev
The statistics such as InHdrErrors should be counted on the ingress netdev rather than on the dev from the dst, which is the egress.
Signed-off-by: Stephen Suryaputra <ssuryaextr@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
show more ...
|
#
3d234012 |
| 04-Apr-2018 |
Eric Dumazet <edumazet@google.com> |
inet: frags: fix ip6frag_low_thresh boundary
Giving an integer to proc_doulongvec_minmax() is dangerous on 64bit arches, since linker might place next to it a non zero value preventing a change to i
inet: frags: fix ip6frag_low_thresh boundary
Giving an integer to proc_doulongvec_minmax() is dangerous on 64bit arches, since linker might place next to it a non zero value preventing a change to ip6frag_low_thresh.
ip6frag_low_thresh is not used anymore in the kernel, but we do not want to prematuraly break user scripts wanting to change it.
Since specifying a minimal value of 0 for proc_doulongvec_minmax() is moot, let's remove these zero values in all defrag units.
Fixes: 6e00f7dd5e4e ("ipv6: frags: fix /proc/sys/net/ipv6/ip6frag_low_thresh") Signed-off-by: Eric Dumazet <edumazet@google.com> Reported-by: Maciej Żenczykowski <maze@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
show more ...
|
#
6e00f7dd |
| 01-Apr-2018 |
Eric Dumazet <edumazet@google.com> |
ipv6: frags: fix /proc/sys/net/ipv6/ip6frag_low_thresh
I forgot to change ip6frag_low_thresh proc_handler from proc_dointvec_minmax to proc_doulongvec_minmax
Fixes: 3e67f106f619 ("inet: frags: brea
ipv6: frags: fix /proc/sys/net/ipv6/ip6frag_low_thresh
I forgot to change ip6frag_low_thresh proc_handler from proc_dointvec_minmax to proc_doulongvec_minmax
Fixes: 3e67f106f619 ("inet: frags: break the 2GB limit for frags storage") Signed-off-by: Eric Dumazet <edumazet@google.com> Reported-by: Maciej Żenczykowski <maze@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
show more ...
|
Revision tags: v4.16 |
|
#
219badfa |
| 31-Mar-2018 |
Eric Dumazet <edumazet@google.com> |
ipv6: frags: get rid of ip6frag_skb_cb/FRAG6_CB
ip6_frag_queue uses skb->cb[] to store the fragment offset, meaning that we could use two cache lines per skb when finding the insertion point, if for
ipv6: frags: get rid of ip6frag_skb_cb/FRAG6_CB
ip6_frag_queue uses skb->cb[] to store the fragment offset, meaning that we could use two cache lines per skb when finding the insertion point, if for some reason inet6_skb_parm size is increased in the future.
By using skb->ip_defrag_offset instead of skb->cb[], we pack all the fields in a single cache line, matching what we did for IPv4.
Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
show more ...
|
#
05c0b86b |
| 31-Mar-2018 |
Eric Dumazet <edumazet@google.com> |
ipv6: frags: rewrite ip6_expire_frag_queue()
Make it similar to IPv4 ip_expire(), and release the lock before calling icmp functions.
Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by
ipv6: frags: rewrite ip6_expire_frag_queue()
Make it similar to IPv4 ip_expire(), and release the lock before calling icmp functions.
Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
show more ...
|
#
3e67f106 |
| 31-Mar-2018 |
Eric Dumazet <edumazet@google.com> |
inet: frags: break the 2GB limit for frags storage
Some users are willing to provision huge amounts of memory to be able to perform reassembly reasonnably well under pressure.
Current memory tracki
inet: frags: break the 2GB limit for frags storage
Some users are willing to provision huge amounts of memory to be able to perform reassembly reasonnably well under pressure.
Current memory tracking is using one atomic_t and integers.
Switch to atomic_long_t so that 64bit arches can use more than 2GB, without any cost for 32bit arches.
Note that this patch avoids an overflow error, if high_thresh was set to ~2GB, since this test in inet_frag_alloc() was never true :
if (... || frag_mem_limit(nf) > nf->high_thresh)
Tested:
$ echo 16000000000 >/proc/sys/net/ipv4/ipfrag_high_thresh
<frag DDOS>
$ grep FRAG /proc/net/sockstat FRAG: inuse 14705885 memory 16000002880
$ nstat -n ; sleep 1 ; nstat | grep Reas IpReasmReqds 3317150 0.0 IpReasmFails 3317112 0.0
Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
show more ...
|
#
2d44ed22 |
| 31-Mar-2018 |
Eric Dumazet <edumazet@google.com> |
inet: frags: remove inet_frag_maybe_warn_overflow()
This function is obsolete, after rhashtable addition to inet defrag.
Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Mi
inet: frags: remove inet_frag_maybe_warn_overflow()
This function is obsolete, after rhashtable addition to inet defrag.
Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
show more ...
|
#
399d1404 |
| 31-Mar-2018 |
Eric Dumazet <edumazet@google.com> |
inet: frags: get rif of inet_frag_evicting()
This refactors ip_expire() since one indentation level is removed.
Note: in the future, we should try hard to avoid the skb_clone() since this is a seri
inet: frags: get rif of inet_frag_evicting()
This refactors ip_expire() since one indentation level is removed.
Note: in the future, we should try hard to avoid the skb_clone() since this is a serious performance cost. Under DDOS, the ICMP message wont be sent because of rate limits.
Fact that ip6_expire_frag_queue() does not use skb_clone() is disturbing too. Presumably IPv6 should have the same issue than the one we fixed in commit ec4fbd64751d ("inet: frag: release spinlock before calling icmp_send()")
Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
show more ...
|
#
648700f7 |
| 31-Mar-2018 |
Eric Dumazet <edumazet@google.com> |
inet: frags: use rhashtables for reassembly units
Some applications still rely on IP fragmentation, and to be fair linux reassembly unit is not working under any serious load.
It uses static hash t
inet: frags: use rhashtables for reassembly units
Some applications still rely on IP fragmentation, and to be fair linux reassembly unit is not working under any serious load.
It uses static hash tables of 1024 buckets, and up to 128 items per bucket (!!!)
A work queue is supposed to garbage collect items when host is under memory pressure, and doing a hash rebuild, changing seed used in hash computations.
This work queue blocks softirqs for up to 25 ms when doing a hash rebuild, occurring every 5 seconds if host is under fire.
Then there is the problem of sharing this hash table for all netns.
It is time to switch to rhashtables, and allocate one of them per netns to speedup netns dismantle, since this is a critical metric these days.
Lookup is now using RCU. A followup patch will even remove the refcount hold/release left from prior implementation and save a couple of atomic operations.
Before this patch, 16 cpus (16 RX queue NIC) could not handle more than 1 Mpps frags DDOS.
After the patch, I reach 9 Mpps without any tuning, and can use up to 2GB of storage for the fragments (exact number depends on frags being evicted after timeout)
$ grep FRAG /proc/net/sockstat FRAG: inuse 1966916 memory 2140004608
A followup patch will change the limits for 64bit arches.
Signed-off-by: Eric Dumazet <edumazet@google.com> Cc: Kirill Tkhai <ktkhai@virtuozzo.com> Cc: Herbert Xu <herbert@gondor.apana.org.au> Cc: Florian Westphal <fw@strlen.de> Cc: Jesper Dangaard Brouer <brouer@redhat.com> Cc: Alexander Aring <alex.aring@gmail.com> Cc: Stefan Schmidt <stefan@osg.samsung.com> Signed-off-by: David S. Miller <davem@davemloft.net>
show more ...
|