1. 30 1月, 2019 6 次提交
  2. 29 1月, 2019 2 次提交
  3. 26 1月, 2019 1 次提交
  4. 24 1月, 2019 1 次提交
  5. 23 1月, 2019 3 次提交
    • N
      devlink: Use DIV_ROUND_UP_ULL in DEVLINK_HEALTH_SIZE_TO_BUFFERS · 33a0efa4
      Nathan Chancellor 提交于
      When building this code on a 32-bit platform such as ARM, there is a
      link time error (lld error shown, happpens with ld.bfd too):
      
      ld.lld: error: undefined symbol: __aeabi_uldivmod
      >>> referenced by devlink.c
      >>>               net/core/devlink.o:(devlink_health_buffers_create) in archive built-in.a
      
      This happens when using a regular division symbol with a u64 dividend.
      Use DIV_ROUND_UP_ULL, which wraps do_div, to avoid this situation.
      
      Fixes: cb5ccfbe ("devlink: Add health buffer support")
      Signed-off-by: NNathan Chancellor <natechancellor@gmail.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      33a0efa4
    • Y
      devlink: Add missing check of nlmsg_put · ed175d9c
      YueHaibing 提交于
      nlmsg_put may fail, this fix add a check of its return value.
      Signed-off-by: NYueHaibing <yuehaibing@huawei.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      ed175d9c
    • C
      net: introduce a knob to control whether to inherit devconf config · 856c395c
      Cong Wang 提交于
      There have been many people complaining about the inconsistent
      behaviors of IPv4 and IPv6 devconf when creating new network
      namespaces.  Currently, for IPv4, we inherit all current settings
      from init_net, but for IPv6 we reset all setting to default.
      
      This patch introduces a new /proc file
      /proc/sys/net/core/devconf_inherit_init_net to control the
      behavior of whether to inhert sysctl current settings from init_net.
      This file itself is only available in init_net.
      
      As demonstrated below:
      
      Initial setup in init_net:
       # cat /proc/sys/net/ipv4/conf/all/rp_filter
       2
       # cat /proc/sys/net/ipv6/conf/all/accept_dad
       1
      
      Default value 0 (current behavior):
       # ip netns del test
       # ip netns add test
       # ip netns exec test cat /proc/sys/net/ipv4/conf/all/rp_filter
       2
       # ip netns exec test cat /proc/sys/net/ipv6/conf/all/accept_dad
       0
      
      Set to 1 (inherit from init_net):
       # echo 1 > /proc/sys/net/core/devconf_inherit_init_net
       # ip netns del test
       # ip netns add test
       # ip netns exec test cat /proc/sys/net/ipv4/conf/all/rp_filter
       2
       # ip netns exec test cat /proc/sys/net/ipv6/conf/all/accept_dad
       1
      
      Set to 2 (reset to default):
       # echo 2 > /proc/sys/net/core/devconf_inherit_init_net
       # ip netns del test
       # ip netns add test
       # ip netns exec test cat /proc/sys/net/ipv4/conf/all/rp_filter
       0
       # ip netns exec test cat /proc/sys/net/ipv6/conf/all/accept_dad
       0
      
      Set to a value out of range (invalid):
       # echo 3 > /proc/sys/net/core/devconf_inherit_init_net
       -bash: echo: write error: Invalid argument
       # echo -1 > /proc/sys/net/core/devconf_inherit_init_net
       -bash: echo: write error: Invalid argument
      Reported-by: NZhu Yanjun <Yanjun.Zhu@windriver.com>
      Reported-by: NTonghao Zhang <xiangxia.m.yue@gmail.com>
      Cc: Nicolas Dichtel <nicolas.dichtel@6wind.com>
      Signed-off-by: NCong Wang <xiyou.wangcong@gmail.com>
      Acked-by: NNicolas Dichtel <nicolas.dichtel@6wind.com>
      Acked-by: NTonghao Zhang <xiangxia.m.yue@gmail.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      856c395c
  6. 20 1月, 2019 6 次提交
  7. 19 1月, 2019 8 次提交
  8. 18 1月, 2019 7 次提交
    • Y
      neighbour: Do not perturb drop profiles when neigh_probe · 87fff3ca
      Yang Wei 提交于
      Replace the kfree_skb() by consume_skb() to be drop monitor(dropwatch,
      perf) friendly.
      Signed-off-by: NYang Wei <yang.wei9@zte.com.cn>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      87fff3ca
    • P
      net: add a route cache full diagnostic message · 22c2ad61
      Peter Oskolkov 提交于
      In some testing scenarios, dst/route cache can fill up so quickly
      that even an explicit GC call occasionally fails to clean it up. This leads
      to sporadically failing calls to dst_alloc and "network unreachable" errors
      to the user, which is confusing.
      
      This patch adds a diagnostic message to make the cause of the failure
      easier to determine.
      Signed-off-by: NPeter Oskolkov <posk@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      22c2ad61
    • Y
      bpf: fix SO_MAX_PACING_RATE to support TCP internal pacing · e224c390
      Yuchung Cheng 提交于
      If sch_fq packet scheduler is not used, TCP can fallback to
      internal pacing, but this requires sk_pacing_status to
      be properly set.
      
      Fixes: 8c4b4c7e ("bpf: Add setsockopt helper function to bpf")
      Signed-off-by: NYuchung Cheng <ycheng@google.com>
      Signed-off-by: NEric Dumazet <edumazet@google.com>
      Cc: Lawrence Brakmo <brakmo@fb.com>
      Acked-by: NMartin KaFai Lau <kafai@fb.com>
      Signed-off-by: NDaniel Borkmann <daniel@iogearbox.net>
      e224c390
    • P
      bpf: bpf_setsockopt: reset sock dst on SO_MARK changes · f4924f24
      Peter Oskolkov 提交于
      In sock_setsockopt() (net/core/sock.h), when SO_MARK option is used
      to change sk_mark, sk_dst_reset(sk) is called. The same should be
      done in bpf_setsockopt().
      
      Fixes: 8c4b4c7e ("bpf: Add setsockopt helper function to bpf")
      Reported-by: NMaciej Żenczykowski <maze@google.com>
      Signed-off-by: NPeter Oskolkov <posk@google.com>
      Acked-by: NMartin KaFai Lau <kafai@fb.com>
      Reviewed-by: NMaciej Żenczykowski <maze@google.com>
      Signed-off-by: NDaniel Borkmann <daniel@iogearbox.net>
      f4924f24
    • P
      net: Add extack argument to ndo_fdb_add() · 87b0984e
      Petr Machata 提交于
      Drivers may not be able to support certain FDB entries, and an error
      code is insufficient to give clear hints as to the reasons of rejection.
      
      In order to make it possible to communicate the rejection reason, extend
      ndo_fdb_add() with an extack argument. Adapt the existing
      implementations of ndo_fdb_add() to take the parameter (and ignore it).
      Pass the extack parameter when invoking ndo_fdb_add() from rtnl_fdb_add().
      Signed-off-by: NPetr Machata <petrm@mellanox.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      87b0984e
    • D
      net: introduce SO_BINDTOIFINDEX sockopt · f5dd3d0c
      David Herrmann 提交于
      This introduces a new generic SOL_SOCKET-level socket option called
      SO_BINDTOIFINDEX. It behaves similar to SO_BINDTODEVICE, but takes a
      network interface index as argument, rather than the network interface
      name.
      
      User-space often refers to network-interfaces via their index, but has
      to temporarily resolve it to a name for a call into SO_BINDTODEVICE.
      This might pose problems when the network-device is renamed
      asynchronously by other parts of the system. When this happens, the
      SO_BINDTODEVICE might either fail, or worse, it might bind to the wrong
      device.
      
      In most cases user-space only ever operates on devices which they
      either manage themselves, or otherwise have a guarantee that the device
      name will not change (e.g., devices that are UP cannot be renamed).
      However, particularly in libraries this guarantee is non-obvious and it
      would be nice if that race-condition would simply not exist. It would
      make it easier for those libraries to operate even in situations where
      the device-name might change under the hood.
      
      A real use-case that we recently hit is trying to start the network
      stack early in the initrd but make it survive into the real system.
      Existing distributions rename network-interfaces during the transition
      from initrd into the real system. This, obviously, cannot affect
      devices that are up and running (unless you also consider moving them
      between network-namespaces). However, the network manager now has to
      make sure its management engine for dormant devices will not run in
      parallel to these renames. Particularly, when you offload operations
      like DHCP into separate processes, these might setup their sockets
      early, and thus have to resolve the device-name possibly running into
      this race-condition.
      
      By avoiding a call to resolve the device-name, we no longer depend on
      the name and can run network setup of dormant devices in parallel to
      the transition off the initrd. The SO_BINDTOIFINDEX ioctl plugs this
      race.
      Reviewed-by: NTom Gundersen <teg@jklm.no>
      Signed-off-by: NDavid Herrmann <dh.herrmann@gmail.com>
      Acked-by: NWillem de Bruijn <willemb@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      f5dd3d0c
    • V
      Optimize sk_msg_clone() by data merge to end dst sg entry · fda497e5
      Vakul Garg 提交于
      Function sk_msg_clone has been modified to merge the data from source sg
      entry to destination sg entry if the cloned data resides in same page
      and is contiguous to the end entry of destination sk_msg. This improves
      kernel tls throughput to the tune of 10%.
      
      When the user space tls application calls sendmsg() with MSG_MORE, it leads
      to calling sk_msg_clone() with new data being cloned placed continuous to
      previously cloned data. Without this optimization, a new SG entry in
      the destination sk_msg i.e. rec->msg_plaintext in tls_clone_plaintext_msg()
      gets used. This leads to exhaustion of sg entries in rec->msg_plaintext
      even before a full 16K of allowable record data is accumulated. Hence we
      lose oppurtunity to encrypt and send a full 16K record.
      
      With this patch, the kernel tls can accumulate full 16K of record data
      irrespective of the size of data passed in sendmsg() with MSG_MORE.
      Signed-off-by: NVakul Garg <vakul.garg@nxp.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      fda497e5
  9. 17 1月, 2019 2 次提交
  10. 10 1月, 2019 2 次提交
    • K
      net/core/neighbour: tell kmemleak about hash tables · 85704cb8
      Konstantin Khlebnikov 提交于
      This fixes false-positive kmemleak reports about leaked neighbour entries:
      
      unreferenced object 0xffff8885c6e4d0a8 (size 1024):
        comm "softirq", pid 0, jiffies 4294922664 (age 167640.804s)
        hex dump (first 32 bytes):
          00 00 00 00 00 00 00 00 20 2c f3 83 ff ff ff ff  ........ ,......
          08 c0 ef 5f 84 88 ff ff 01 8c 7d 02 01 00 00 00  ..._......}.....
        backtrace:
          [<00000000748509fe>] ip6_finish_output2+0x887/0x1e40
          [<0000000036d7a0d8>] ip6_output+0x1ba/0x600
          [<0000000027ea7dba>] ip6_send_skb+0x92/0x2f0
          [<00000000d6e2111d>] udp_v6_send_skb.isra.24+0x680/0x15e0
          [<000000000668a8be>] udpv6_sendmsg+0x18c9/0x27a0
          [<000000004bd5fa90>] sock_sendmsg+0xb3/0xf0
          [<000000008227b29f>] ___sys_sendmsg+0x745/0x8f0
          [<000000008698009d>] __sys_sendmsg+0xde/0x170
          [<00000000889dacf1>] do_syscall_64+0x9b/0x400
          [<0000000081cdb353>] entry_SYSCALL_64_after_hwframe+0x49/0xbe
          [<000000005767ed39>] 0xffffffffffffffff
      Signed-off-by: NKonstantin Khlebnikov <khlebnikov@yandex-team.ru>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      85704cb8
    • Y
      bpf: correctly set initial window on active Fast Open sender · 31aa6503
      Yuchung Cheng 提交于
      The existing BPF TCP initial congestion window (TCP_BPF_IW) does not
      to work on (active) Fast Open sender. This is because it changes the
      (initial) window only if data_segs_out is zero -- but data_segs_out
      is also incremented on SYN-data.  This patch fixes the issue by
      proerly accounting for SYN-data additionally.
      
      Fixes: fc747810 ("bpf: Adds support for setting initial cwnd")
      Signed-off-by: NYuchung Cheng <ycheng@google.com>
      Reviewed-by: NNeal Cardwell <ncardwell@google.com>
      Acked-by: NLawrence Brakmo <brakmo@fb.com>
      Signed-off-by: NAlexei Starovoitov <ast@kernel.org>
      31aa6503
  11. 06 1月, 2019 1 次提交
    • M
      jump_label: move 'asm goto' support test to Kconfig · e9666d10
      Masahiro Yamada 提交于
      Currently, CONFIG_JUMP_LABEL just means "I _want_ to use jump label".
      
      The jump label is controlled by HAVE_JUMP_LABEL, which is defined
      like this:
      
        #if defined(CC_HAVE_ASM_GOTO) && defined(CONFIG_JUMP_LABEL)
        # define HAVE_JUMP_LABEL
        #endif
      
      We can improve this by testing 'asm goto' support in Kconfig, then
      make JUMP_LABEL depend on CC_HAS_ASM_GOTO.
      
      Ugly #ifdef HAVE_JUMP_LABEL will go away, and CONFIG_JUMP_LABEL will
      match to the real kernel capability.
      Signed-off-by: NMasahiro Yamada <yamada.masahiro@socionext.com>
      Acked-by: Michael Ellerman <mpe@ellerman.id.au> (powerpc)
      Tested-by: NSedat Dilek <sedat.dilek@gmail.com>
      e9666d10
  12. 05 1月, 2019 1 次提交
    • D
      net, skbuff: do not prefer skb allocation fails early · f8c468e8
      David Rientjes 提交于
      Commit dcda9b04 ("mm, tree wide: replace __GFP_REPEAT by
      __GFP_RETRY_MAYFAIL with more useful semantic") replaced __GFP_REPEAT in
      alloc_skb_with_frags() with __GFP_RETRY_MAYFAIL when the allocation may
      directly reclaim.
      
      The previous behavior would require reclaim up to 1 << order pages for
      skb aligned header_len of order > PAGE_ALLOC_COSTLY_ORDER before failing,
      otherwise the allocations in alloc_skb() would loop in the page allocator
      looking for memory.  __GFP_RETRY_MAYFAIL makes both allocations failable
      under memory pressure, including for the HEAD allocation.
      
      This can cause, among many other things, write() to fail with ENOTCONN
      during RPC when under memory pressure.
      
      These allocations should succeed as they did previous to dcda9b04
      even if it requires calling the oom killer and additional looping in the
      page allocator to find memory.  There is no way to specify the previous
      behavior of __GFP_REPEAT, but it's unlikely to be necessary since the
      previous behavior only guaranteed that 1 << order pages would be reclaimed
      before failing for order > PAGE_ALLOC_COSTLY_ORDER.  That reclaim is not
      guaranteed to be contiguous memory, so repeating for such large orders is
      usually not beneficial.
      
      Removing the setting of __GFP_RETRY_MAYFAIL to restore the previous
      behavior, specifically not allowing alloc_skb() to fail for small orders
      and oom kill if necessary rather than allowing RPCs to fail.
      
      Fixes: dcda9b04 ("mm, tree wide: replace __GFP_REPEAT by __GFP_RETRY_MAYFAIL with more useful semantic")
      Signed-off-by: NDavid Rientjes <rientjes@google.com>
      Reviewed-by: NEric Dumazet <edumazet@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      f8c468e8