1. 07 9月, 2022 1 次提交
  2. 24 8月, 2022 1 次提交
  3. 16 5月, 2022 1 次提交
  4. 11 5月, 2022 2 次提交
  5. 03 5月, 2022 2 次提交
    • T
      net: sysctl: introduce sysctl SYSCTL_THREE · 4c7f24f8
      Tonghao Zhang 提交于
      This patch introdues the SYSCTL_THREE.
      
      KUnit:
      [00:10:14] ================ sysctl_test (10 subtests) =================
      [00:10:14] [PASSED] sysctl_test_api_dointvec_null_tbl_data
      [00:10:14] [PASSED] sysctl_test_api_dointvec_table_maxlen_unset
      [00:10:14] [PASSED] sysctl_test_api_dointvec_table_len_is_zero
      [00:10:14] [PASSED] sysctl_test_api_dointvec_table_read_but_position_set
      [00:10:14] [PASSED] sysctl_test_dointvec_read_happy_single_positive
      [00:10:14] [PASSED] sysctl_test_dointvec_read_happy_single_negative
      [00:10:14] [PASSED] sysctl_test_dointvec_write_happy_single_positive
      [00:10:14] [PASSED] sysctl_test_dointvec_write_happy_single_negative
      [00:10:14] [PASSED] sysctl_test_api_dointvec_write_single_less_int_min
      [00:10:14] [PASSED] sysctl_test_api_dointvec_write_single_greater_int_max
      [00:10:14] =================== [PASSED] sysctl_test ===================
      
      ./run_kselftest.sh -c sysctl
      ...
      ok 1 selftests: sysctl: sysctl.sh
      
      Cc: Luis Chamberlain <mcgrof@kernel.org>
      Cc: Kees Cook <keescook@chromium.org>
      Cc: Iurii Zaikin <yzaikin@google.com>
      Cc: "David S. Miller" <davem@davemloft.net>
      Cc: Jakub Kicinski <kuba@kernel.org>
      Cc: Paolo Abeni <pabeni@redhat.com>
      Cc: Hideaki YOSHIFUJI <yoshfuji@linux-ipv6.org>
      Cc: David Ahern <dsahern@kernel.org>
      Cc: Simon Horman <horms@verge.net.au>
      Cc: Julian Anastasov <ja@ssi.bg>
      Cc: Pablo Neira Ayuso <pablo@netfilter.org>
      Cc: Jozsef Kadlecsik <kadlec@netfilter.org>
      Cc: Florian Westphal <fw@strlen.de>
      Cc: Shuah Khan <shuah@kernel.org>
      Cc: Andrew Morton <akpm@linux-foundation.org>
      Cc: Alexei Starovoitov <ast@kernel.org>
      Cc: Eric Dumazet <edumazet@google.com>
      Cc: Lorenz Bauer <lmb@cloudflare.com>
      Cc: Akhmat Karakotov <hmukos@yandex-team.ru>
      Signed-off-by: NTonghao Zhang <xiangxia.m.yue@gmail.com>
      Reviewed-by: NSimon Horman <horms@verge.net.au>
      Signed-off-by: NPaolo Abeni <pabeni@redhat.com>
      4c7f24f8
    • T
      net: sysctl: use shared sysctl macro · bd8a5367
      Tonghao Zhang 提交于
      This patch replace two, four and long_one to SYSCTL_XXX.
      
      Cc: Luis Chamberlain <mcgrof@kernel.org>
      Cc: Kees Cook <keescook@chromium.org>
      Cc: Iurii Zaikin <yzaikin@google.com>
      Cc: "David S. Miller" <davem@davemloft.net>
      Cc: Jakub Kicinski <kuba@kernel.org>
      Cc: Paolo Abeni <pabeni@redhat.com>
      Cc: Hideaki YOSHIFUJI <yoshfuji@linux-ipv6.org>
      Cc: David Ahern <dsahern@kernel.org>
      Cc: Simon Horman <horms@verge.net.au>
      Cc: Julian Anastasov <ja@ssi.bg>
      Cc: Pablo Neira Ayuso <pablo@netfilter.org>
      Cc: Jozsef Kadlecsik <kadlec@netfilter.org>
      Cc: Florian Westphal <fw@strlen.de>
      Cc: Shuah Khan <shuah@kernel.org>
      Cc: Andrew Morton <akpm@linux-foundation.org>
      Cc: Alexei Starovoitov <ast@kernel.org>
      Cc: Eric Dumazet <edumazet@google.com>
      Cc: Lorenz Bauer <lmb@cloudflare.com>
      Cc: Akhmat Karakotov <hmukos@yandex-team.ru>
      Signed-off-by: NTonghao Zhang <xiangxia.m.yue@gmail.com>
      Signed-off-by: NPaolo Abeni <pabeni@redhat.com>
      bd8a5367
  6. 08 4月, 2022 1 次提交
    • J
      net: extract a few internals from netdevice.h · 6264f58c
      Jakub Kicinski 提交于
      There's a number of functions and static variables used
      under net/core/ but not from the outside. We currently
      dump most of them into netdevice.h. That bad for many
      reasons:
       - netdevice.h is very cluttered, hard to figure out
         what the APIs are;
       - netdevice.h is very long;
       - we have to touch netdevice.h more which causes expensive
         incremental builds.
      
      Create a header under net/core/ and move some declarations.
      
      The new header is also a bit of a catch-all but that's
      fine, if we create more specific headers people will
      likely over-think where their declaration fit best.
      And end up putting them in netdevice.h, again.
      
      More work should be done on splitting netdevice.h into more
      targeted headers, but that'd be more time consuming so small
      steps.
      Signed-off-by: NJakub Kicinski <kuba@kernel.org>
      6264f58c
  7. 28 2月, 2022 1 次提交
  8. 31 1月, 2022 1 次提交
  9. 30 12月, 2021 1 次提交
  10. 23 10月, 2021 1 次提交
  11. 26 3月, 2021 1 次提交
  12. 24 3月, 2021 1 次提交
    • D
      net: make unregister netdev warning timeout configurable · 5aa3afe1
      Dmitry Vyukov 提交于
      netdev_wait_allrefs() issues a warning if refcount does not drop to 0
      after 10 seconds. While 10 second wait generally should not happen
      under normal workload in normal environment, it seems to fire falsely
      very often during fuzzing and/or in qemu emulation (~10x slower).
      At least it's not possible to understand if it's really a false
      positive or not. Automated testing generally bumps all timeouts
      to very high values to avoid flake failures.
      Add net.core.netdev_unregister_timeout_secs sysctl to make
      the timeout configurable for automated testing systems.
      Lowering the timeout may also be useful for e.g. manual bisection.
      The default value matches the current behavior.
      Signed-off-by: NDmitry Vyukov <dvyukov@google.com>
      Fixes: https://bugzilla.kernel.org/show_bug.cgi?id=211877
      Cc: netdev@vger.kernel.org
      Cc: linux-kernel@vger.kernel.org
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      5aa3afe1
  13. 28 1月, 2021 1 次提交
  14. 28 8月, 2020 1 次提交
    • M
      net: add option to not create fall-back tunnels in root-ns as well · 316cdaa1
      Mahesh Bandewar 提交于
      The sysctl that was added  earlier by commit 79134e6c ("net: do
      not create fallback tunnels for non-default namespaces") to create
      fall-back only in root-ns. This patch enhances that behavior to provide
      option not to create fallback tunnels in root-ns as well. Since modules
      that create fallback tunnels could be built-in and setting the sysctl
      value after booting is pointless, so added a kernel cmdline options to
      change this default. The default setting is preseved for backward
      compatibility. The kernel command line option of fb_tunnels=initns will
      set the sysctl value to 1 and will create fallback tunnels only in initns
      while kernel cmdline fb_tunnels=none will set the sysctl value to 2 and
      fallback tunnels are skipped in every netns.
      Signed-off-by: NMahesh Bandewar <maheshb@google.com>
      Cc: Eric Dumazet <edumazet@google.com>
      Cc: Maciej Zenczykowski <maze@google.com>
      Cc: Jian Yang <jianyang@google.com>
      Cc: Randy Dunlap <rdunlap@infradead.org>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      316cdaa1
  15. 09 7月, 2020 1 次提交
    • K
      bpf: Check correct cred for CAP_SYSLOG in bpf_dump_raw_ok() · 63960260
      Kees Cook 提交于
      When evaluating access control over kallsyms visibility, credentials at
      open() time need to be used, not the "current" creds (though in BPF's
      case, this has likely always been the same). Plumb access to associated
      file->f_cred down through bpf_dump_raw_ok() and its callers now that
      kallsysm_show_value() has been refactored to take struct cred.
      
      Cc: Alexei Starovoitov <ast@kernel.org>
      Cc: Daniel Borkmann <daniel@iogearbox.net>
      Cc: bpf@vger.kernel.org
      Cc: stable@vger.kernel.org
      Fixes: 7105e828 ("bpf: allow for correlation of maps and helpers in dump")
      Signed-off-by: NKees Cook <keescook@chromium.org>
      63960260
  16. 08 6月, 2020 1 次提交
  17. 17 5月, 2020 1 次提交
  18. 27 4月, 2020 1 次提交
  19. 20 12月, 2019 1 次提交
    • A
      net, sysctl: Fix compiler warning when only cBPF is present · 1148f9ad
      Alexander Lobakin 提交于
      proc_dointvec_minmax_bpf_restricted() has been firstly introduced
      in commit 2e4a3098 ("bpf: restrict access to core bpf sysctls")
      under CONFIG_HAVE_EBPF_JIT. Then, this ifdef has been removed in
      ede95a63 ("bpf: add bpf_jit_limit knob to restrict unpriv
      allocations"), because a new sysctl, bpf_jit_limit, made use of it.
      Finally, this parameter has become long instead of integer with
      fdadd049 ("bpf: fix bpf_jit_limit knob for PAGE_SIZE >= 64K")
      and thus, a new proc_dolongvec_minmax_bpf_restricted() has been
      added.
      
      With this last change, we got back to that
      proc_dointvec_minmax_bpf_restricted() is used only under
      CONFIG_HAVE_EBPF_JIT, but the corresponding ifdef has not been
      brought back.
      
      So, in configurations like CONFIG_BPF_JIT=y && CONFIG_HAVE_EBPF_JIT=n
      since v4.20 we have:
      
        CC      net/core/sysctl_net_core.o
      net/core/sysctl_net_core.c:292:1: warning: ‘proc_dointvec_minmax_bpf_restricted’ defined but not used [-Wunused-function]
        292 | proc_dointvec_minmax_bpf_restricted(struct ctl_table *table, int write,
            | ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
      
      Suppress this by guarding it with CONFIG_HAVE_EBPF_JIT again.
      
      Fixes: fdadd049 ("bpf: fix bpf_jit_limit knob for PAGE_SIZE >= 64K")
      Signed-off-by: NAlexander Lobakin <alobakin@dlink.ru>
      Signed-off-by: NDaniel Borkmann <daniel@iogearbox.net>
      Link: https://lore.kernel.org/bpf/20191218091821.7080-1-alobakin@dlink.ru
      1148f9ad
  20. 09 8月, 2019 1 次提交
    • E
      net: use listified RX for handling GRO_NORMAL skbs · 323ebb61
      Edward Cree 提交于
      When GRO decides not to coalesce a packet, in napi_frags_finish(), instead
       of passing it to the stack immediately, place it on a list in the napi
       struct.  Then, at flush time (napi_complete_done(), napi_poll(), or
       napi_busy_loop()), call netif_receive_skb_list_internal() on the list.
      We'd like to do that in napi_gro_flush(), but it's not called if
       !napi->gro_bitmask, so we have to do it in the callers instead.  (There are
       a handful of drivers that call napi_gro_flush() themselves, but it's not
       clear why, or whether this will affect them.)
      Because a full 64 packets is an inefficiently large batch, also consume the
       list whenever it exceeds gro_normal_batch, a new net/core sysctl that
       defaults to 8.
      Signed-off-by: NEdward Cree <ecree@solarflare.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      323ebb61
  21. 19 7月, 2019 1 次提交
  22. 15 6月, 2019 1 次提交
    • E
      net: add high_order_alloc_disable sysctl/static key · ce27ec60
      Eric Dumazet 提交于
      >From linux-3.7, (commit 5640f768 "net: use a per task frag
      allocator") TCP sendmsg() has preferred using order-3 allocations.
      
      While it gives good results for most cases, we had reports
      that heavy uses of TCP over loopback were hitting a spinlock
      contention in page allocations/freeing.
      
      This commits adds a sysctl so that admins can opt-in
      for order-0 allocations. Hopefully mm layer might optimize
      order-3 allocations in the future since it could give us
      a nice boost  (see 8 lines of following benchmark)
      
      The following benchmark shows a win when more than 8 TCP_STREAM
      threads are running (56 x86 cores server in my tests)
      
      for thr in {1..30}
      do
       sysctl -wq net.core.high_order_alloc_disable=0
       T0=`./super_netperf $thr -H 127.0.0.1 -l 15`
       sysctl -wq net.core.high_order_alloc_disable=1
       T1=`./super_netperf $thr -H 127.0.0.1 -l 15`
       echo $thr:$T0:$T1
      done
      
      1: 49979: 37267
      2: 98745: 76286
      3: 141088: 110051
      4: 177414: 144772
      5: 197587: 173563
      6: 215377: 208448
      7: 241061: 234087
      8: 267155: 263373
      9: 295069: 297402
      10: 312393: 335213
      11: 340462: 368778
      12: 371366: 403954
      13: 412344: 443713
      14: 426617: 473580
      15: 474418: 507861
      16: 503261: 538539
      17: 522331: 563096
      18: 532409: 567084
      19: 550824: 605240
      20: 525493: 641988
      21: 564574: 665843
      22: 567349: 690868
      23: 583846: 710917
      24: 588715: 736306
      25: 603212: 763494
      26: 604083: 792654
      27: 602241: 796450
      28: 604291: 797993
      29: 611610: 833249
      30: 577356: 841062
      Signed-off-by: NEric Dumazet <edumazet@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      ce27ec60
  23. 24 3月, 2019 1 次提交
  24. 23 1月, 2019 1 次提交
    • C
      net: introduce a knob to control whether to inherit devconf config · 856c395c
      Cong Wang 提交于
      There have been many people complaining about the inconsistent
      behaviors of IPv4 and IPv6 devconf when creating new network
      namespaces.  Currently, for IPv4, we inherit all current settings
      from init_net, but for IPv6 we reset all setting to default.
      
      This patch introduces a new /proc file
      /proc/sys/net/core/devconf_inherit_init_net to control the
      behavior of whether to inhert sysctl current settings from init_net.
      This file itself is only available in init_net.
      
      As demonstrated below:
      
      Initial setup in init_net:
       # cat /proc/sys/net/ipv4/conf/all/rp_filter
       2
       # cat /proc/sys/net/ipv6/conf/all/accept_dad
       1
      
      Default value 0 (current behavior):
       # ip netns del test
       # ip netns add test
       # ip netns exec test cat /proc/sys/net/ipv4/conf/all/rp_filter
       2
       # ip netns exec test cat /proc/sys/net/ipv6/conf/all/accept_dad
       0
      
      Set to 1 (inherit from init_net):
       # echo 1 > /proc/sys/net/core/devconf_inherit_init_net
       # ip netns del test
       # ip netns add test
       # ip netns exec test cat /proc/sys/net/ipv4/conf/all/rp_filter
       2
       # ip netns exec test cat /proc/sys/net/ipv6/conf/all/accept_dad
       1
      
      Set to 2 (reset to default):
       # echo 2 > /proc/sys/net/core/devconf_inherit_init_net
       # ip netns del test
       # ip netns add test
       # ip netns exec test cat /proc/sys/net/ipv4/conf/all/rp_filter
       0
       # ip netns exec test cat /proc/sys/net/ipv6/conf/all/accept_dad
       0
      
      Set to a value out of range (invalid):
       # echo 3 > /proc/sys/net/core/devconf_inherit_init_net
       -bash: echo: write error: Invalid argument
       # echo -1 > /proc/sys/net/core/devconf_inherit_init_net
       -bash: echo: write error: Invalid argument
      Reported-by: NZhu Yanjun <Yanjun.Zhu@windriver.com>
      Reported-by: NTonghao Zhang <xiangxia.m.yue@gmail.com>
      Cc: Nicolas Dichtel <nicolas.dichtel@6wind.com>
      Signed-off-by: NCong Wang <xiyou.wangcong@gmail.com>
      Acked-by: NNicolas Dichtel <nicolas.dichtel@6wind.com>
      Acked-by: NTonghao Zhang <xiangxia.m.yue@gmail.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      856c395c
  25. 12 12月, 2018 1 次提交
    • D
      bpf: fix bpf_jit_limit knob for PAGE_SIZE >= 64K · fdadd049
      Daniel Borkmann 提交于
      Michael and Sandipan report:
      
        Commit ede95a63 introduced a bpf_jit_limit tuneable to limit BPF
        JIT allocations. At compile time it defaults to PAGE_SIZE * 40000,
        and is adjusted again at init time if MODULES_VADDR is defined.
      
        For ppc64 kernels, MODULES_VADDR isn't defined, so we're stuck with
        the compile-time default at boot-time, which is 0x9c400000 when
        using 64K page size. This overflows the signed 32-bit bpf_jit_limit
        value:
      
        root@ubuntu:/tmp# cat /proc/sys/net/core/bpf_jit_limit
        -1673527296
      
        and can cause various unexpected failures throughout the network
        stack. In one case `strace dhclient eth0` reported:
      
        setsockopt(5, SOL_SOCKET, SO_ATTACH_FILTER, {len=11, filter=0x105dd27f8},
                   16) = -1 ENOTSUPP (Unknown error 524)
      
        and similar failures can be seen with tools like tcpdump. This doesn't
        always reproduce however, and I'm not sure why. The more consistent
        failure I've seen is an Ubuntu 18.04 KVM guest booted on a POWER9
        host would time out on systemd/netplan configuring a virtio-net NIC
        with no noticeable errors in the logs.
      
      Given this and also given that in near future some architectures like
      arm64 will have a custom area for BPF JIT image allocations we should
      get rid of the BPF_JIT_LIMIT_DEFAULT fallback / default entirely. For
      4.21, we have an overridable bpf_jit_alloc_exec(), bpf_jit_free_exec()
      so therefore add another overridable bpf_jit_alloc_exec_limit() helper
      function which returns the possible size of the memory area for deriving
      the default heuristic in bpf_jit_charge_init().
      
      Like bpf_jit_alloc_exec() and bpf_jit_free_exec(), the new
      bpf_jit_alloc_exec_limit() assumes that module_alloc() is the default
      JIT memory provider, and therefore in case archs implement their custom
      module_alloc() we use MODULES_{END,_VADDR} for limits and otherwise for
      vmalloc_exec() cases like on ppc64 we use VMALLOC_{END,_START}.
      
      Additionally, for archs supporting large page sizes, we should change
      the sysctl to be handled as long to not run into sysctl restrictions
      in future.
      
      Fixes: ede95a63 ("bpf: add bpf_jit_limit knob to restrict unpriv allocations")
      Reported-by: NSandipan Das <sandipan@linux.ibm.com>
      Reported-by: NMichael Roth <mdroth@linux.vnet.ibm.com>
      Signed-off-by: NDaniel Borkmann <daniel@iogearbox.net>
      Tested-by: NMichael Roth <mdroth@linux.vnet.ibm.com>
      Signed-off-by: NAlexei Starovoitov <ast@kernel.org>
      fdadd049
  26. 26 10月, 2018 1 次提交
    • D
      bpf: add bpf_jit_limit knob to restrict unpriv allocations · ede95a63
      Daniel Borkmann 提交于
      Rick reported that the BPF JIT could potentially fill the entire module
      space with BPF programs from unprivileged users which would prevent later
      attempts to load normal kernel modules or privileged BPF programs, for
      example. If JIT was enabled but unsuccessful to generate the image, then
      before commit 290af866 ("bpf: introduce BPF_JIT_ALWAYS_ON config")
      we would always fall back to the BPF interpreter. Nowadays in the case
      where the CONFIG_BPF_JIT_ALWAYS_ON could be set, then the load will abort
      with a failure since the BPF interpreter was compiled out.
      
      Add a global limit and enforce it for unprivileged users such that in case
      of BPF interpreter compiled out we fail once the limit has been reached
      or we fall back to BPF interpreter earlier w/o using module mem if latter
      was compiled in. In a next step, fair share among unprivileged users can
      be resolved in particular for the case where we would fail hard once limit
      is reached.
      
      Fixes: 290af866 ("bpf: introduce BPF_JIT_ALWAYS_ON config")
      Fixes: 0a14842f ("net: filter: Just In Time compiler for x86-64")
      Co-Developed-by: NRick Edgecombe <rick.p.edgecombe@intel.com>
      Signed-off-by: NDaniel Borkmann <daniel@iogearbox.net>
      Acked-by: NAlexei Starovoitov <ast@kernel.org>
      Cc: Eric Dumazet <eric.dumazet@gmail.com>
      Cc: Jann Horn <jannh@google.com>
      Cc: Kees Cook <keescook@chromium.org>
      Cc: LKML <linux-kernel@vger.kernel.org>
      Signed-off-by: NAlexei Starovoitov <ast@kernel.org>
      ede95a63
  27. 06 4月, 2018 1 次提交
    • R
      headers: untangle kmemleak.h from mm.h · 514c6032
      Randy Dunlap 提交于
      Currently <linux/slab.h> #includes <linux/kmemleak.h> for no obvious
      reason.  It looks like it's only a convenience, so remove kmemleak.h
      from slab.h and add <linux/kmemleak.h> to any users of kmemleak_* that
      don't already #include it.  Also remove <linux/kmemleak.h> from source
      files that do not use it.
      
      This is tested on i386 allmodconfig and x86_64 allmodconfig.  It would
      be good to run it through the 0day bot for other $ARCHes.  I have
      neither the horsepower nor the storage space for the other $ARCHes.
      
      Update: This patch has been extensively build-tested by both the 0day
      bot & kisskb/ozlabs build farms.  Both of them reported 2 build failures
      for which patches are included here (in v2).
      
      [ slab.h is the second most used header file after module.h; kernel.h is
        right there with slab.h. There could be some minor error in the
        counting due to some #includes having comments after them and I didn't
        combine all of those. ]
      
      [akpm@linux-foundation.org: security/keys/big_key.c needs vmalloc.h, per sfr]
      Link: http://lkml.kernel.org/r/e4309f98-3749-93e1-4bb7-d9501a39d015@infradead.org
      Link: http://kisskb.ellerman.id.au/kisskb/head/13396/Signed-off-by: NRandy Dunlap <rdunlap@infradead.org>
      Reviewed-by: NIngo Molnar <mingo@kernel.org>
      Reported-by: Michael Ellerman <mpe@ellerman.id.au>	[2 build failures]
      Reported-by: Fengguang Wu <fengguang.wu@intel.com>	[2 build failures]
      Reviewed-by: NAndrew Morton <akpm@linux-foundation.org>
      Cc: Wei Yongjun <weiyongjun1@huawei.com>
      Cc: Luis R. Rodriguez <mcgrof@kernel.org>
      Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
      Cc: Mimi Zohar <zohar@linux.vnet.ibm.com>
      Cc: John Johansen <john.johansen@canonical.com>
      Cc: Stephen Rothwell <sfr@canb.auug.org.au>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      514c6032
  28. 28 3月, 2018 1 次提交
  29. 10 3月, 2018 1 次提交
    • E
      net: do not create fallback tunnels for non-default namespaces · 79134e6c
      Eric Dumazet 提交于
      fallback tunnels (like tunl0, gre0, gretap0, erspan0, sit0,
      ip6tnl0, ip6gre0) are automatically created when the corresponding
      module is loaded.
      
      These tunnels are also automatically created when a new network
      namespace is created, at a great cost.
      
      In many cases, netns are used for isolation purposes, and these
      extra network devices are a waste of resources. We are using
      thousands of netns per host, and hit the netns creation/delete
      bottleneck a lot. (Many thanks to Kirill for recent work on this)
      
      Add a new sysctl so that we can opt-out from this automatic creation.
      
      Note that these tunnels are still created for the initial namespace,
      to be the least intrusive for typical setups.
      
      Tested:
      lpk43:~# cat add_del_unshare.sh
      for i in `seq 1 40`
      do
       (for j in `seq 1 100` ; do  unshare -n /bin/true >/dev/null ; done) &
      done
      wait
      
      lpk43:~# echo 0 >/proc/sys/net/core/fb_tunnels_only_for_init_net
      lpk43:~# time ./add_del_unshare.sh
      
      real	0m37.521s
      user	0m0.886s
      sys	7m7.084s
      lpk43:~# echo 1 >/proc/sys/net/core/fb_tunnels_only_for_init_net
      lpk43:~# time ./add_del_unshare.sh
      
      real	0m4.761s
      user	0m0.851s
      sys	1m8.343s
      lpk43:~#
      Signed-off-by: NEric Dumazet <edumazet@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      79134e6c
  30. 13 2月, 2018 1 次提交
  31. 20 1月, 2018 2 次提交
  32. 10 1月, 2018 1 次提交
    • A
      bpf: introduce BPF_JIT_ALWAYS_ON config · 290af866
      Alexei Starovoitov 提交于
      The BPF interpreter has been used as part of the spectre 2 attack CVE-2017-5715.
      
      A quote from goolge project zero blog:
      "At this point, it would normally be necessary to locate gadgets in
      the host kernel code that can be used to actually leak data by reading
      from an attacker-controlled location, shifting and masking the result
      appropriately and then using the result of that as offset to an
      attacker-controlled address for a load. But piecing gadgets together
      and figuring out which ones work in a speculation context seems annoying.
      So instead, we decided to use the eBPF interpreter, which is built into
      the host kernel - while there is no legitimate way to invoke it from inside
      a VM, the presence of the code in the host kernel's text section is sufficient
      to make it usable for the attack, just like with ordinary ROP gadgets."
      
      To make attacker job harder introduce BPF_JIT_ALWAYS_ON config
      option that removes interpreter from the kernel in favor of JIT-only mode.
      So far eBPF JIT is supported by:
      x64, arm64, arm32, sparc64, s390, powerpc64, mips64
      
      The start of JITed program is randomized and code page is marked as read-only.
      In addition "constant blinding" can be turned on with net.core.bpf_jit_harden
      
      v2->v3:
      - move __bpf_prog_ret0 under ifdef (Daniel)
      
      v1->v2:
      - fix init order, test_bpf and cBPF (Daniel's feedback)
      - fix offloaded bpf (Jakub's feedback)
      - add 'return 0' dummy in case something can invoke prog->bpf_func
      - retarget bpf tree. For bpf-next the patch would need one extra hunk.
        It will be sent when the trees are merged back to net-next
      
      Considered doing:
        int bpf_jit_enable __read_mostly = BPF_EBPF_JIT_DEFAULT;
      but it seems better to land the patch as-is and in bpf-next remove
      bpf_jit_enable global variable from all JITs, consolidate in one place
      and remove this jit_init() function.
      Signed-off-by: NAlexei Starovoitov <ast@kernel.org>
      Signed-off-by: NDaniel Borkmann <daniel@iogearbox.net>
      290af866
  33. 02 11月, 2017 1 次提交
    • G
      License cleanup: add SPDX GPL-2.0 license identifier to files with no license · b2441318
      Greg Kroah-Hartman 提交于
      Many source files in the tree are missing licensing information, which
      makes it harder for compliance tools to determine the correct license.
      
      By default all files without license information are under the default
      license of the kernel, which is GPL version 2.
      
      Update the files which contain no license information with the 'GPL-2.0'
      SPDX license identifier.  The SPDX identifier is a legally binding
      shorthand, which can be used instead of the full boiler plate text.
      
      This patch is based on work done by Thomas Gleixner and Kate Stewart and
      Philippe Ombredanne.
      
      How this work was done:
      
      Patches were generated and checked against linux-4.14-rc6 for a subset of
      the use cases:
       - file had no licensing information it it.
       - file was a */uapi/* one with no licensing information in it,
       - file was a */uapi/* one with existing licensing information,
      
      Further patches will be generated in subsequent months to fix up cases
      where non-standard license headers were used, and references to license
      had to be inferred by heuristics based on keywords.
      
      The analysis to determine which SPDX License Identifier to be applied to
      a file was done in a spreadsheet of side by side results from of the
      output of two independent scanners (ScanCode & Windriver) producing SPDX
      tag:value files created by Philippe Ombredanne.  Philippe prepared the
      base worksheet, and did an initial spot review of a few 1000 files.
      
      The 4.13 kernel was the starting point of the analysis with 60,537 files
      assessed.  Kate Stewart did a file by file comparison of the scanner
      results in the spreadsheet to determine which SPDX license identifier(s)
      to be applied to the file. She confirmed any determination that was not
      immediately clear with lawyers working with the Linux Foundation.
      
      Criteria used to select files for SPDX license identifier tagging was:
       - Files considered eligible had to be source code files.
       - Make and config files were included as candidates if they contained >5
         lines of source
       - File already had some variant of a license header in it (even if <5
         lines).
      
      All documentation files were explicitly excluded.
      
      The following heuristics were used to determine which SPDX license
      identifiers to apply.
      
       - when both scanners couldn't find any license traces, file was
         considered to have no license information in it, and the top level
         COPYING file license applied.
      
         For non */uapi/* files that summary was:
      
         SPDX license identifier                            # files
         ---------------------------------------------------|-------
         GPL-2.0                                              11139
      
         and resulted in the first patch in this series.
      
         If that file was a */uapi/* path one, it was "GPL-2.0 WITH
         Linux-syscall-note" otherwise it was "GPL-2.0".  Results of that was:
      
         SPDX license identifier                            # files
         ---------------------------------------------------|-------
         GPL-2.0 WITH Linux-syscall-note                        930
      
         and resulted in the second patch in this series.
      
       - if a file had some form of licensing information in it, and was one
         of the */uapi/* ones, it was denoted with the Linux-syscall-note if
         any GPL family license was found in the file or had no licensing in
         it (per prior point).  Results summary:
      
         SPDX license identifier                            # files
         ---------------------------------------------------|------
         GPL-2.0 WITH Linux-syscall-note                       270
         GPL-2.0+ WITH Linux-syscall-note                      169
         ((GPL-2.0 WITH Linux-syscall-note) OR BSD-2-Clause)    21
         ((GPL-2.0 WITH Linux-syscall-note) OR BSD-3-Clause)    17
         LGPL-2.1+ WITH Linux-syscall-note                      15
         GPL-1.0+ WITH Linux-syscall-note                       14
         ((GPL-2.0+ WITH Linux-syscall-note) OR BSD-3-Clause)    5
         LGPL-2.0+ WITH Linux-syscall-note                       4
         LGPL-2.1 WITH Linux-syscall-note                        3
         ((GPL-2.0 WITH Linux-syscall-note) OR MIT)              3
         ((GPL-2.0 WITH Linux-syscall-note) AND MIT)             1
      
         and that resulted in the third patch in this series.
      
       - when the two scanners agreed on the detected license(s), that became
         the concluded license(s).
      
       - when there was disagreement between the two scanners (one detected a
         license but the other didn't, or they both detected different
         licenses) a manual inspection of the file occurred.
      
       - In most cases a manual inspection of the information in the file
         resulted in a clear resolution of the license that should apply (and
         which scanner probably needed to revisit its heuristics).
      
       - When it was not immediately clear, the license identifier was
         confirmed with lawyers working with the Linux Foundation.
      
       - If there was any question as to the appropriate license identifier,
         the file was flagged for further research and to be revisited later
         in time.
      
      In total, over 70 hours of logged manual review was done on the
      spreadsheet to determine the SPDX license identifiers to apply to the
      source files by Kate, Philippe, Thomas and, in some cases, confirmation
      by lawyers working with the Linux Foundation.
      
      Kate also obtained a third independent scan of the 4.13 code base from
      FOSSology, and compared selected files where the other two scanners
      disagreed against that SPDX file, to see if there was new insights.  The
      Windriver scanner is based on an older version of FOSSology in part, so
      they are related.
      
      Thomas did random spot checks in about 500 files from the spreadsheets
      for the uapi headers and agreed with SPDX license identifier in the
      files he inspected. For the non-uapi files Thomas did random spot checks
      in about 15000 files.
      
      In initial set of patches against 4.14-rc6, 3 files were found to have
      copy/paste license identifier errors, and have been fixed to reflect the
      correct identifier.
      
      Additionally Philippe spent 10 hours this week doing a detailed manual
      inspection and review of the 12,461 patched files from the initial patch
      version early this week with:
       - a full scancode scan run, collecting the matched texts, detected
         license ids and scores
       - reviewing anything where there was a license detected (about 500+
         files) to ensure that the applied SPDX license was correct
       - reviewing anything where there was no detection but the patch license
         was not GPL-2.0 WITH Linux-syscall-note to ensure that the applied
         SPDX license was correct
      
      This produced a worksheet with 20 files needing minor correction.  This
      worksheet was then exported into 3 different .csv files for the
      different types of files to be modified.
      
      These .csv files were then reviewed by Greg.  Thomas wrote a script to
      parse the csv files and add the proper SPDX tag to the file, in the
      format that the file expected.  This script was further refined by Greg
      based on the output to detect more types of files automatically and to
      distinguish between header and source .c files (which need different
      comment types.)  Finally Greg ran the script using the .csv files to
      generate the patches.
      Reviewed-by: NKate Stewart <kstewart@linuxfoundation.org>
      Reviewed-by: NPhilippe Ombredanne <pombredanne@nexb.com>
      Reviewed-by: NThomas Gleixner <tglx@linutronix.de>
      Signed-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      b2441318
  34. 26 5月, 2017 1 次提交
    • R
      net: move somaxconn init from sysctl code · 7c3f1875
      Roman Kapl 提交于
      The default value for somaxconn is set in sysctl_core_net_init(), but this
      function is not called when kernel is configured without CONFIG_SYSCTL.
      
      This results in the kernel not being able to accept TCP connections,
      because the backlog has zero size. Usually, the user ends up with:
      "TCP: request_sock_TCP: Possible SYN flooding on port 7. Dropping request.  Check SNMP counters."
      If SYN cookies are not enabled the connection is rejected.
      
      Before ef547f2a (tcp: remove max_qlen_log), the effects were less
      severe, because the backlog was always at least eight slots long.
      Signed-off-by: NRoman Kapl <roman.kapl@sysgo.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      7c3f1875
  35. 22 4月, 2017 1 次提交
  36. 25 3月, 2017 1 次提交
  37. 18 2月, 2017 1 次提交
    • D
      bpf: make jited programs visible in traces · 74451e66
      Daniel Borkmann 提交于
      Long standing issue with JITed programs is that stack traces from
      function tracing check whether a given address is kernel code
      through {__,}kernel_text_address(), which checks for code in core
      kernel, modules and dynamically allocated ftrace trampolines. But
      what is still missing is BPF JITed programs (interpreted programs
      are not an issue as __bpf_prog_run() will be attributed to them),
      thus when a stack trace is triggered, the code walking the stack
      won't see any of the JITed ones. The same for address correlation
      done from user space via reading /proc/kallsyms. This is read by
      tools like perf, but the latter is also useful for permanent live
      tracing with eBPF itself in combination with stack maps when other
      eBPF types are part of the callchain. See offwaketime example on
      dumping stack from a map.
      
      This work tries to tackle that issue by making the addresses and
      symbols known to the kernel. The lookup from *kernel_text_address()
      is implemented through a latched RB tree that can be read under
      RCU in fast-path that is also shared for symbol/size/offset lookup
      for a specific given address in kallsyms. The slow-path iteration
      through all symbols in the seq file done via RCU list, which holds
      a tiny fraction of all exported ksyms, usually below 0.1 percent.
      Function symbols are exported as bpf_prog_<tag>, in order to aide
      debugging and attribution. This facility is currently enabled for
      root-only when bpf_jit_kallsyms is set to 1, and disabled if hardening
      is active in any mode. The rationale behind this is that still a lot
      of systems ship with world read permissions on kallsyms thus addresses
      should not get suddenly exposed for them. If that situation gets
      much better in future, we always have the option to change the
      default on this. Likewise, unprivileged programs are not allowed
      to add entries there either, but that is less of a concern as most
      such programs types relevant in this context are for root-only anyway.
      If enabled, call graphs and stack traces will then show a correct
      attribution; one example is illustrated below, where the trace is
      now visible in tooling such as perf script --kallsyms=/proc/kallsyms
      and friends.
      
      Before:
      
        7fff8166889d bpf_clone_redirect+0x80007f0020ed (/lib/modules/4.9.0-rc8+/build/vmlinux)
               f5d80 __sendmsg_nocancel+0xffff006451f1a007 (/usr/lib64/libc-2.18.so)
      
      After:
      
        7fff816688b7 bpf_clone_redirect+0x80007f002107 (/lib/modules/4.9.0-rc8+/build/vmlinux)
        7fffa0575728 bpf_prog_33c45a467c9e061a+0x8000600020fb (/lib/modules/4.9.0-rc8+/build/vmlinux)
        7fffa07ef1fc cls_bpf_classify+0x8000600020dc (/lib/modules/4.9.0-rc8+/build/vmlinux)
        7fff81678b68 tc_classify+0x80007f002078 (/lib/modules/4.9.0-rc8+/build/vmlinux)
        7fff8164d40b __netif_receive_skb_core+0x80007f0025fb (/lib/modules/4.9.0-rc8+/build/vmlinux)
        7fff8164d718 __netif_receive_skb+0x80007f002018 (/lib/modules/4.9.0-rc8+/build/vmlinux)
        7fff8164e565 process_backlog+0x80007f002095 (/lib/modules/4.9.0-rc8+/build/vmlinux)
        7fff8164dc71 net_rx_action+0x80007f002231 (/lib/modules/4.9.0-rc8+/build/vmlinux)
        7fff81767461 __softirqentry_text_start+0x80007f0020d1 (/lib/modules/4.9.0-rc8+/build/vmlinux)
        7fff817658ac do_softirq_own_stack+0x80007f00201c (/lib/modules/4.9.0-rc8+/build/vmlinux)
        7fff810a2c20 do_softirq+0x80007f002050 (/lib/modules/4.9.0-rc8+/build/vmlinux)
        7fff810a2cb5 __local_bh_enable_ip+0x80007f002085 (/lib/modules/4.9.0-rc8+/build/vmlinux)
        7fff8168d452 ip_finish_output2+0x80007f002152 (/lib/modules/4.9.0-rc8+/build/vmlinux)
        7fff8168ea3d ip_finish_output+0x80007f00217d (/lib/modules/4.9.0-rc8+/build/vmlinux)
        7fff8168f2af ip_output+0x80007f00203f (/lib/modules/4.9.0-rc8+/build/vmlinux)
        [...]
        7fff81005854 do_syscall_64+0x80007f002054 (/lib/modules/4.9.0-rc8+/build/vmlinux)
        7fff817649eb return_from_SYSCALL_64+0x80007f002000 (/lib/modules/4.9.0-rc8+/build/vmlinux)
               f5d80 __sendmsg_nocancel+0xffff01c484812007 (/usr/lib64/libc-2.18.so)
      Signed-off-by: NDaniel Borkmann <daniel@iogearbox.net>
      Acked-by: NAlexei Starovoitov <ast@kernel.org>
      Cc: linux-kernel@vger.kernel.org
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      74451e66