1. 27 1月, 2019 1 次提交
  2. 26 1月, 2019 2 次提交
  3. 25 1月, 2019 2 次提交
    • P
      tcp_bbr: adapt cwnd based on ack aggregation estimation · 78dc70eb
      Priyaranjan Jha 提交于
      Aggregation effects are extremely common with wifi, cellular, and cable
      modem link technologies, ACK decimation in middleboxes, and LRO and GRO
      in receiving hosts. The aggregation can happen in either direction,
      data or ACKs, but in either case the aggregation effect is visible
      to the sender in the ACK stream.
      
      Previously BBR's sending was often limited by cwnd under severe ACK
      aggregation/decimation because BBR sized the cwnd at 2*BDP. If packets
      were acked in bursts after long delays (e.g. one ACK acking 5*BDP after
      5*RTT), BBR's sending was halted after sending 2*BDP over 2*RTT, leaving
      the bottleneck idle for potentially long periods. Note that loss-based
      congestion control does not have this issue because when facing
      aggregation it continues increasing cwnd after bursts of ACKs, growing
      cwnd until the buffer is full.
      
      To achieve good throughput in the presence of aggregation effects, this
      algorithm allows the BBR sender to put extra data in flight to keep the
      bottleneck utilized during silences in the ACK stream that it has evidence
      to suggest were caused by aggregation.
      
      A summary of the algorithm: when a burst of packets are acked by a
      stretched ACK or a burst of ACKs or both, BBR first estimates the expected
      amount of data that should have been acked, based on its estimated
      bandwidth. Then the surplus ("extra_acked") is recorded in a windowed-max
      filter to estimate the recent level of observed ACK aggregation. Then cwnd
      is increased by the ACK aggregation estimate. The larger cwnd avoids BBR
      being cwnd-limited in the face of ACK silences that recent history suggests
      were caused by aggregation. As a sanity check, the ACK aggregation degree
      is upper-bounded by the cwnd (at the time of measurement) and a global max
      of BW * 100ms. The algorithm is further described by the following
      presentation:
      https://datatracker.ietf.org/meeting/101/materials/slides-101-iccrg-an-update-on-bbr-work-at-google-00
      
      In our internal testing, we observed a significant increase in BBR
      throughput (measured using netperf), in a basic wifi setup.
      - Host1 (sender on ethernet) -> AP -> Host2 (receiver on wifi)
      - 2.4 GHz -> BBR before: ~73 Mbps; BBR after: ~102 Mbps; CUBIC: ~100 Mbps
      - 5.0 GHz -> BBR before: ~362 Mbps; BBR after: ~593 Mbps; CUBIC: ~601 Mbps
      
      Also, this code is running globally on YouTube TCP connections and produced
      significant bandwidth increases for YouTube traffic.
      
      This is based on Ian Swett's max_ack_height_ algorithm from the
      QUIC BBR implementation.
      Signed-off-by: NPriyaranjan Jha <priyarjha@google.com>
      Signed-off-by: NNeal Cardwell <ncardwell@google.com>
      Signed-off-by: NYuchung Cheng <ycheng@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      78dc70eb
    • P
      tcp_bbr: refactor bbr_target_cwnd() for general inflight provisioning · 232aa8ec
      Priyaranjan Jha 提交于
      Because bbr_target_cwnd() is really a general-purpose BBR helper for
      computing some volume of inflight data as a function of the estimated
      BDP, refactor it into following helper functions:
      - bbr_bdp()
      - bbr_quantization_budget()
      - bbr_inflight()
      Signed-off-by: NPriyaranjan Jha <priyarjha@google.com>
      Signed-off-by: NNeal Cardwell <ncardwell@google.com>
      Signed-off-by: NYuchung Cheng <ycheng@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      232aa8ec
  4. 23 1月, 2019 3 次提交
    • L
      bridge: simplify ip_mc_check_igmp() and ipv6_mc_check_mld() internals · a2e2ca3b
      Linus Lüssing 提交于
      With this patch the internal use of the skb_trimmed is reduced to
      the ICMPv6/IGMP checksum verification. And for the length checks
      the newly introduced helper functions are used instead of calculating
      and checking with skb->len directly.
      
      These changes should hopefully make it easier to verify that length
      checks are performed properly.
      Signed-off-by: NLinus Lüssing <linus.luessing@c0d3.blue>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      a2e2ca3b
    • L
      bridge: simplify ip_mc_check_igmp() and ipv6_mc_check_mld() calls · ba5ea614
      Linus Lüssing 提交于
      This patch refactors ip_mc_check_igmp(), ipv6_mc_check_mld() and
      their callers (more precisely, the Linux bridge) to not rely on
      the skb_trimmed parameter anymore.
      
      An skb with its tail trimmed to the IP packet length was initially
      introduced for the following three reasons:
      
      1) To be able to verify the ICMPv6 checksum.
      2) To be able to distinguish the version of an IGMP or MLD query.
         They are distinguishable only by their size.
      3) To avoid parsing data for an IGMPv3 or MLDv2 report that is
         beyond the IP packet but still within the skb.
      
      The first case still uses a cloned and potentially trimmed skb to
      verfiy. However, there is no need to propagate it to the caller.
      For the second and third case explicit IP packet length checks were
      added.
      
      This hopefully makes ip_mc_check_igmp() and ipv6_mc_check_mld() easier
      to read and verfiy, as well as easier to use.
      Signed-off-by: NLinus Lüssing <linus.luessing@c0d3.blue>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      ba5ea614
    • C
      net: introduce a knob to control whether to inherit devconf config · 856c395c
      Cong Wang 提交于
      There have been many people complaining about the inconsistent
      behaviors of IPv4 and IPv6 devconf when creating new network
      namespaces.  Currently, for IPv4, we inherit all current settings
      from init_net, but for IPv6 we reset all setting to default.
      
      This patch introduces a new /proc file
      /proc/sys/net/core/devconf_inherit_init_net to control the
      behavior of whether to inhert sysctl current settings from init_net.
      This file itself is only available in init_net.
      
      As demonstrated below:
      
      Initial setup in init_net:
       # cat /proc/sys/net/ipv4/conf/all/rp_filter
       2
       # cat /proc/sys/net/ipv6/conf/all/accept_dad
       1
      
      Default value 0 (current behavior):
       # ip netns del test
       # ip netns add test
       # ip netns exec test cat /proc/sys/net/ipv4/conf/all/rp_filter
       2
       # ip netns exec test cat /proc/sys/net/ipv6/conf/all/accept_dad
       0
      
      Set to 1 (inherit from init_net):
       # echo 1 > /proc/sys/net/core/devconf_inherit_init_net
       # ip netns del test
       # ip netns add test
       # ip netns exec test cat /proc/sys/net/ipv4/conf/all/rp_filter
       2
       # ip netns exec test cat /proc/sys/net/ipv6/conf/all/accept_dad
       1
      
      Set to 2 (reset to default):
       # echo 2 > /proc/sys/net/core/devconf_inherit_init_net
       # ip netns del test
       # ip netns add test
       # ip netns exec test cat /proc/sys/net/ipv4/conf/all/rp_filter
       0
       # ip netns exec test cat /proc/sys/net/ipv6/conf/all/accept_dad
       0
      
      Set to a value out of range (invalid):
       # echo 3 > /proc/sys/net/core/devconf_inherit_init_net
       -bash: echo: write error: Invalid argument
       # echo -1 > /proc/sys/net/core/devconf_inherit_init_net
       -bash: echo: write error: Invalid argument
      Reported-by: NZhu Yanjun <Yanjun.Zhu@windriver.com>
      Reported-by: NTonghao Zhang <xiangxia.m.yue@gmail.com>
      Cc: Nicolas Dichtel <nicolas.dichtel@6wind.com>
      Signed-off-by: NCong Wang <xiyou.wangcong@gmail.com>
      Acked-by: NNicolas Dichtel <nicolas.dichtel@6wind.com>
      Acked-by: NTonghao Zhang <xiangxia.m.yue@gmail.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      856c395c
  5. 20 1月, 2019 3 次提交
  6. 19 1月, 2019 1 次提交
  7. 18 1月, 2019 21 次提交
  8. 17 1月, 2019 2 次提交
  9. 16 1月, 2019 3 次提交
    • E
      fou, fou6: do not assume linear skbs · 26fc181e
      Eric Dumazet 提交于
      Both gue_err() and gue6_err() incorrectly assume
      linear skbs. Fix them to use pskb_may_pull().
      
      BUG: KMSAN: uninit-value in gue6_err+0x475/0xc40 net/ipv6/fou6.c:101
      CPU: 0 PID: 18083 Comm: syz-executor1 Not tainted 5.0.0-rc1+ #7
      Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 01/01/2011
      Call Trace:
       <IRQ>
       __dump_stack lib/dump_stack.c:77 [inline]
       dump_stack+0x173/0x1d0 lib/dump_stack.c:113
       kmsan_report+0x12e/0x2a0 mm/kmsan/kmsan.c:600
       __msan_warning+0x82/0xf0 mm/kmsan/kmsan_instr.c:313
       gue6_err+0x475/0xc40 net/ipv6/fou6.c:101
       __udp6_lib_err_encap_no_sk net/ipv6/udp.c:434 [inline]
       __udp6_lib_err_encap net/ipv6/udp.c:491 [inline]
       __udp6_lib_err+0x18d0/0x2590 net/ipv6/udp.c:522
       udplitev6_err+0x118/0x130 net/ipv6/udplite.c:27
       icmpv6_notify+0x462/0x9f0 net/ipv6/icmp.c:784
       icmpv6_rcv+0x18ac/0x3fa0 net/ipv6/icmp.c:872
       ip6_protocol_deliver_rcu+0xb5a/0x23a0 net/ipv6/ip6_input.c:394
       ip6_input_finish net/ipv6/ip6_input.c:434 [inline]
       NF_HOOK include/linux/netfilter.h:289 [inline]
       ip6_input+0x2b6/0x350 net/ipv6/ip6_input.c:443
       dst_input include/net/dst.h:450 [inline]
       ip6_rcv_finish+0x4e7/0x6d0 net/ipv6/ip6_input.c:76
       NF_HOOK include/linux/netfilter.h:289 [inline]
       ipv6_rcv+0x34b/0x3f0 net/ipv6/ip6_input.c:272
       __netif_receive_skb_one_core net/core/dev.c:4973 [inline]
       __netif_receive_skb net/core/dev.c:5083 [inline]
       process_backlog+0x756/0x10e0 net/core/dev.c:5923
       napi_poll net/core/dev.c:6346 [inline]
       net_rx_action+0x78b/0x1a60 net/core/dev.c:6412
       __do_softirq+0x53f/0x93a kernel/softirq.c:293
       do_softirq_own_stack+0x49/0x80 arch/x86/entry/entry_64.S:1039
       </IRQ>
       do_softirq kernel/softirq.c:338 [inline]
       __local_bh_enable_ip+0x16f/0x1a0 kernel/softirq.c:190
       local_bh_enable+0x36/0x40 include/linux/bottom_half.h:32
       rcu_read_unlock_bh include/linux/rcupdate.h:696 [inline]
       ip6_finish_output2+0x1d64/0x25f0 net/ipv6/ip6_output.c:121
       ip6_finish_output+0xae4/0xbc0 net/ipv6/ip6_output.c:154
       NF_HOOK_COND include/linux/netfilter.h:278 [inline]
       ip6_output+0x5ca/0x710 net/ipv6/ip6_output.c:171
       dst_output include/net/dst.h:444 [inline]
       ip6_local_out+0x164/0x1d0 net/ipv6/output_core.c:176
       ip6_send_skb+0xfa/0x390 net/ipv6/ip6_output.c:1727
       udp_v6_send_skb+0x1733/0x1d20 net/ipv6/udp.c:1169
       udpv6_sendmsg+0x424e/0x45d0 net/ipv6/udp.c:1466
       inet_sendmsg+0x54a/0x720 net/ipv4/af_inet.c:798
       sock_sendmsg_nosec net/socket.c:621 [inline]
       sock_sendmsg net/socket.c:631 [inline]
       ___sys_sendmsg+0xdb9/0x11b0 net/socket.c:2116
       __sys_sendmmsg+0x580/0xad0 net/socket.c:2211
       __do_sys_sendmmsg net/socket.c:2240 [inline]
       __se_sys_sendmmsg+0xbd/0xe0 net/socket.c:2237
       __x64_sys_sendmmsg+0x56/0x70 net/socket.c:2237
       do_syscall_64+0xbc/0xf0 arch/x86/entry/common.c:291
       entry_SYSCALL_64_after_hwframe+0x63/0xe7
      RIP: 0033:0x457ec9
      Code: 6d b7 fb ff c3 66 2e 0f 1f 84 00 00 00 00 00 66 90 48 89 f8 48 89 f7 48 89 d6 48 89 ca 4d 89 c2 4d 89 c8 4c 8b 4c 24 08 0f 05 <48> 3d 01 f0 ff ff 0f 83 3b b7 fb ff c3 66 2e 0f 1f 84 00 00 00 00
      RSP: 002b:00007f4a5204fc78 EFLAGS: 00000246 ORIG_RAX: 0000000000000133
      RAX: ffffffffffffffda RBX: 0000000000000004 RCX: 0000000000457ec9
      RDX: 00000000040001ab RSI: 0000000020000240 RDI: 0000000000000003
      RBP: 000000000073bf00 R08: 0000000000000000 R09: 0000000000000000
      R10: 0000000000000000 R11: 0000000000000246 R12: 00007f4a520506d4
      R13: 00000000004c4ce5 R14: 00000000004d85d8 R15: 00000000ffffffff
      
      Uninit was created at:
       kmsan_save_stack_with_flags mm/kmsan/kmsan.c:205 [inline]
       kmsan_internal_poison_shadow+0x92/0x150 mm/kmsan/kmsan.c:159
       kmsan_kmalloc+0xa6/0x130 mm/kmsan/kmsan_hooks.c:176
       kmsan_slab_alloc+0xe/0x10 mm/kmsan/kmsan_hooks.c:185
       slab_post_alloc_hook mm/slab.h:446 [inline]
       slab_alloc_node mm/slub.c:2754 [inline]
       __kmalloc_node_track_caller+0xe9e/0xff0 mm/slub.c:4377
       __kmalloc_reserve net/core/skbuff.c:140 [inline]
       __alloc_skb+0x309/0xa20 net/core/skbuff.c:208
       alloc_skb include/linux/skbuff.h:1012 [inline]
       alloc_skb_with_frags+0x1c7/0xac0 net/core/skbuff.c:5288
       sock_alloc_send_pskb+0xafd/0x10a0 net/core/sock.c:2091
       sock_alloc_send_skb+0xca/0xe0 net/core/sock.c:2108
       __ip6_append_data+0x42ed/0x5dc0 net/ipv6/ip6_output.c:1443
       ip6_append_data+0x3c2/0x650 net/ipv6/ip6_output.c:1619
       icmp6_send+0x2f5c/0x3c40 net/ipv6/icmp.c:574
       icmpv6_send+0xe5/0x110 net/ipv6/ip6_icmp.c:43
       ip6_link_failure+0x5c/0x2c0 net/ipv6/route.c:2231
       dst_link_failure include/net/dst.h:427 [inline]
       vti_xmit net/ipv4/ip_vti.c:229 [inline]
       vti_tunnel_xmit+0xf3b/0x1ea0 net/ipv4/ip_vti.c:265
       __netdev_start_xmit include/linux/netdevice.h:4382 [inline]
       netdev_start_xmit include/linux/netdevice.h:4391 [inline]
       xmit_one net/core/dev.c:3278 [inline]
       dev_hard_start_xmit+0x604/0xc40 net/core/dev.c:3294
       __dev_queue_xmit+0x2e48/0x3b80 net/core/dev.c:3864
       dev_queue_xmit+0x4b/0x60 net/core/dev.c:3897
       neigh_direct_output+0x42/0x50 net/core/neighbour.c:1511
       neigh_output include/net/neighbour.h:508 [inline]
       ip6_finish_output2+0x1d4e/0x25f0 net/ipv6/ip6_output.c:120
       ip6_finish_output+0xae4/0xbc0 net/ipv6/ip6_output.c:154
       NF_HOOK_COND include/linux/netfilter.h:278 [inline]
       ip6_output+0x5ca/0x710 net/ipv6/ip6_output.c:171
       dst_output include/net/dst.h:444 [inline]
       ip6_local_out+0x164/0x1d0 net/ipv6/output_core.c:176
       ip6_send_skb+0xfa/0x390 net/ipv6/ip6_output.c:1727
       udp_v6_send_skb+0x1733/0x1d20 net/ipv6/udp.c:1169
       udpv6_sendmsg+0x424e/0x45d0 net/ipv6/udp.c:1466
       inet_sendmsg+0x54a/0x720 net/ipv4/af_inet.c:798
       sock_sendmsg_nosec net/socket.c:621 [inline]
       sock_sendmsg net/socket.c:631 [inline]
       ___sys_sendmsg+0xdb9/0x11b0 net/socket.c:2116
       __sys_sendmmsg+0x580/0xad0 net/socket.c:2211
       __do_sys_sendmmsg net/socket.c:2240 [inline]
       __se_sys_sendmmsg+0xbd/0xe0 net/socket.c:2237
       __x64_sys_sendmmsg+0x56/0x70 net/socket.c:2237
       do_syscall_64+0xbc/0xf0 arch/x86/entry/common.c:291
       entry_SYSCALL_64_after_hwframe+0x63/0xe7
      
      Fixes: b8a51b38 ("fou, fou6: ICMP error handlers for FoU and GUE")
      Signed-off-by: NEric Dumazet <edumazet@google.com>
      Reported-by: Nsyzbot <syzkaller@googlegroups.com>
      Cc: Stefano Brivio <sbrivio@redhat.com>
      Cc: Sabrina Dubroca <sd@queasysnail.net>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      26fc181e
    • W
      tcp: allow MSG_ZEROCOPY transmission also in CLOSE_WAIT state · 13d7f463
      Willem de Bruijn 提交于
      TCP transmission with MSG_ZEROCOPY fails if the peer closes its end of
      the connection and so transitions this socket to CLOSE_WAIT state.
      
      Transmission in close wait state is acceptable. Other similar tests in
      the stack (e.g., in FastOpen) accept both states. Relax this test, too.
      
      Link: https://www.mail-archive.com/netdev@vger.kernel.org/msg276886.html
      Link: https://www.mail-archive.com/netdev@vger.kernel.org/msg227390.html
      Fixes: f214f915 ("tcp: enable MSG_ZEROCOPY")
      Reported-by: NMarek Majkowski <marek@cloudflare.com>
      Signed-off-by: NWillem de Bruijn <willemb@google.com>
      CC: Yuchung Cheng <ycheng@google.com>
      CC: Neal Cardwell <ncardwell@google.com>
      CC: Soheil Hassas Yeganeh <soheil@google.com>
      CC: Alexey Kodanev <alexey.kodanev@oracle.com>
      Acked-by: NSoheil Hassas Yeganeh <soheil@google.com>
      Reviewed-by: NEric Dumazet <edumazet@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      13d7f463
    • I
      net: ipv4: Fix memory leak in network namespace dismantle · f97f4dd8
      Ido Schimmel 提交于
      IPv4 routing tables are flushed in two cases:
      
      1. In response to events in the netdev and inetaddr notification chains
      2. When a network namespace is being dismantled
      
      In both cases only routes associated with a dead nexthop group are
      flushed. However, a nexthop group will only be marked as dead in case it
      is populated with actual nexthops using a nexthop device. This is not
      the case when the route in question is an error route (e.g.,
      'blackhole', 'unreachable').
      
      Therefore, when a network namespace is being dismantled such routes are
      not flushed and leaked [1].
      
      To reproduce:
      # ip netns add blue
      # ip -n blue route add unreachable 192.0.2.0/24
      # ip netns del blue
      
      Fix this by not skipping error routes that are not marked with
      RTNH_F_DEAD when flushing the routing tables.
      
      To prevent the flushing of such routes in case #1, add a parameter to
      fib_table_flush() that indicates if the table is flushed as part of
      namespace dismantle or not.
      
      Note that this problem does not exist in IPv6 since error routes are
      associated with the loopback device.
      
      [1]
      unreferenced object 0xffff888066650338 (size 56):
        comm "ip", pid 1206, jiffies 4294786063 (age 26.235s)
        hex dump (first 32 bytes):
          00 00 00 00 00 00 00 00 b0 1c 62 61 80 88 ff ff  ..........ba....
          e8 8b a1 64 80 88 ff ff 00 07 00 08 fe 00 00 00  ...d............
        backtrace:
          [<00000000856ed27d>] inet_rtm_newroute+0x129/0x220
          [<00000000fcdfc00a>] rtnetlink_rcv_msg+0x397/0xa20
          [<00000000cb85801a>] netlink_rcv_skb+0x132/0x380
          [<00000000ebc991d2>] netlink_unicast+0x4c0/0x690
          [<0000000014f62875>] netlink_sendmsg+0x929/0xe10
          [<00000000bac9d967>] sock_sendmsg+0xc8/0x110
          [<00000000223e6485>] ___sys_sendmsg+0x77a/0x8f0
          [<000000002e94f880>] __sys_sendmsg+0xf7/0x250
          [<00000000ccb1fa72>] do_syscall_64+0x14d/0x610
          [<00000000ffbe3dae>] entry_SYSCALL_64_after_hwframe+0x49/0xbe
          [<000000003a8b605b>] 0xffffffffffffffff
      unreferenced object 0xffff888061621c88 (size 48):
        comm "ip", pid 1206, jiffies 4294786063 (age 26.235s)
        hex dump (first 32 bytes):
          6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b  kkkkkkkkkkkkkkkk
          6b 6b 6b 6b 6b 6b 6b 6b d8 8e 26 5f 80 88 ff ff  kkkkkkkk..&_....
        backtrace:
          [<00000000733609e3>] fib_table_insert+0x978/0x1500
          [<00000000856ed27d>] inet_rtm_newroute+0x129/0x220
          [<00000000fcdfc00a>] rtnetlink_rcv_msg+0x397/0xa20
          [<00000000cb85801a>] netlink_rcv_skb+0x132/0x380
          [<00000000ebc991d2>] netlink_unicast+0x4c0/0x690
          [<0000000014f62875>] netlink_sendmsg+0x929/0xe10
          [<00000000bac9d967>] sock_sendmsg+0xc8/0x110
          [<00000000223e6485>] ___sys_sendmsg+0x77a/0x8f0
          [<000000002e94f880>] __sys_sendmsg+0xf7/0x250
          [<00000000ccb1fa72>] do_syscall_64+0x14d/0x610
          [<00000000ffbe3dae>] entry_SYSCALL_64_after_hwframe+0x49/0xbe
          [<000000003a8b605b>] 0xffffffffffffffff
      
      Fixes: 8cced9ef ("[NETNS]: Enable routing configuration in non-initial namespace.")
      Signed-off-by: NIdo Schimmel <idosch@mellanox.com>
      Reviewed-by: NDavid Ahern <dsahern@gmail.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      f97f4dd8
  10. 12 1月, 2019 2 次提交
    • T
      net: bpfilter: disallow to remove bpfilter module while being used · 71a85084
      Taehee Yoo 提交于
      The bpfilter.ko module can be removed while functions of the bpfilter.ko
      are executing. so panic can occurred. in order to protect that, locks can
      be used. a bpfilter_lock protects routines in the
      __bpfilter_process_sockopt() but it's not enough because __exit routine
      can be executed concurrently.
      
      Now, the bpfilter_umh can not run in parallel.
      So, the module do not removed while it's being used and it do not
      double-create UMH process.
      The members of the umh_info and the bpfilter_umh_ops are protected by
      the bpfilter_umh_ops.lock.
      
      test commands:
         while :
         do
      	iptables -I FORWARD -m string --string ap --algo kmp &
      	modprobe -rv bpfilter &
         done
      
      splat looks like:
      [  298.623435] BUG: unable to handle kernel paging request at fffffbfff807440b
      [  298.628512] #PF error: [normal kernel read fault]
      [  298.633018] PGD 124327067 P4D 124327067 PUD 11c1a3067 PMD 119eb2067 PTE 0
      [  298.638859] Oops: 0000 [#1] SMP DEBUG_PAGEALLOC KASAN PTI
      [  298.638859] CPU: 0 PID: 2997 Comm: iptables Not tainted 4.20.0+ #154
      [  298.638859] RIP: 0010:__mutex_lock+0x6b9/0x16a0
      [  298.638859] Code: c0 00 00 e8 89 82 ff ff 80 bd 8f fc ff ff 00 0f 85 d9 05 00 00 48 8b 85 80 fc ff ff 48 bf 00 00 00 00 00 fc ff df 48 c1 e8 03 <80> 3c 38 00 0f 85 1d 0e 00 00 48 8b 85 c8 fc ff ff 49 39 47 58 c6
      [  298.638859] RSP: 0018:ffff88810e7777a0 EFLAGS: 00010202
      [  298.638859] RAX: 1ffffffff807440b RBX: ffff888111bd4d80 RCX: 0000000000000000
      [  298.638859] RDX: 1ffff110235ff806 RSI: ffff888111bd5538 RDI: dffffc0000000000
      [  298.638859] RBP: ffff88810e777b30 R08: 0000000080000002 R09: 0000000000000000
      [  298.638859] R10: 0000000000000000 R11: 0000000000000000 R12: fffffbfff168a42c
      [  298.638859] R13: ffff888111bd4d80 R14: ffff8881040e9a05 R15: ffffffffc03a2000
      [  298.638859] FS:  00007f39e3758700(0000) GS:ffff88811ae00000(0000) knlGS:0000000000000000
      [  298.638859] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
      [  298.638859] CR2: fffffbfff807440b CR3: 000000011243e000 CR4: 00000000001006f0
      [  298.638859] Call Trace:
      [  298.638859]  ? mutex_lock_io_nested+0x1560/0x1560
      [  298.638859]  ? kasan_kmalloc+0xa0/0xd0
      [  298.638859]  ? kmem_cache_alloc+0x1c2/0x260
      [  298.638859]  ? __alloc_file+0x92/0x3c0
      [  298.638859]  ? alloc_empty_file+0x43/0x120
      [  298.638859]  ? alloc_file_pseudo+0x220/0x330
      [  298.638859]  ? sock_alloc_file+0x39/0x160
      [  298.638859]  ? __sys_socket+0x113/0x1d0
      [  298.638859]  ? __x64_sys_socket+0x6f/0xb0
      [  298.638859]  ? do_syscall_64+0x138/0x560
      [  298.638859]  ? entry_SYSCALL_64_after_hwframe+0x49/0xbe
      [  298.638859]  ? __alloc_file+0x92/0x3c0
      [  298.638859]  ? init_object+0x6b/0x80
      [  298.638859]  ? cyc2ns_read_end+0x10/0x10
      [  298.638859]  ? cyc2ns_read_end+0x10/0x10
      [  298.638859]  ? hlock_class+0x140/0x140
      [  298.638859]  ? sched_clock_local+0xd4/0x140
      [  298.638859]  ? sched_clock_local+0xd4/0x140
      [  298.638859]  ? check_flags.part.37+0x440/0x440
      [  298.638859]  ? __lock_acquire+0x4f90/0x4f90
      [  298.638859]  ? set_rq_offline.part.89+0x140/0x140
      [ ... ]
      
      Fixes: d2ba09c1 ("net: add skeleton of bpfilter kernel module")
      Signed-off-by: NTaehee Yoo <ap420073@gmail.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      71a85084
    • T
      net: bpfilter: restart bpfilter_umh when error occurred · 61fbf593
      Taehee Yoo 提交于
      The bpfilter_umh will be stopped via __stop_umh() when the bpfilter
      error occurred.
      The bpfilter_umh() couldn't start again because there is no restart
      routine.
      
      The section of the bpfilter_umh_{start/end} is no longer .init.rodata
      because these area should be reused in the restart routine. hence
      the section name is changed to .bpfilter_umh.
      
      The bpfilter_ops->start() is restart callback. it will be called when
      bpfilter_umh is stopped.
      The stop bit means bpfilter_umh is stopped. this bit is set by both
      start and stop routine.
      
      Before this patch,
      Test commands:
         $ iptables -vnL
         $ kill -9 <pid of bpfilter_umh>
         $ iptables -vnL
         [  480.045136] bpfilter: write fail -32
         $ iptables -vnL
      
      All iptables commands will fail.
      
      After this patch,
      Test commands:
         $ iptables -vnL
         $ kill -9 <pid of bpfilter_umh>
         $ iptables -vnL
         $ iptables -vnL
      
      Now, all iptables commands will work.
      
      Fixes: d2ba09c1 ("net: add skeleton of bpfilter kernel module")
      Signed-off-by: NTaehee Yoo <ap420073@gmail.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      61fbf593