1. 05 11月, 2019 1 次提交
    • A
      net: of_get_phy_mode: Change API to solve int/unit warnings · 0c65b2b9
      Andrew Lunn 提交于
      Before this change of_get_phy_mode() returned an enum,
      phy_interface_t. On error, -ENODEV etc, is returned. If the result of
      the function is stored in a variable of type phy_interface_t, and the
      compiler has decided to represent this as an unsigned int, comparision
      with -ENODEV etc, is a signed vs unsigned comparision.
      
      Fix this problem by changing the API. Make the function return an
      error, or 0 on success, and pass a pointer, of type phy_interface_t,
      where the phy mode should be stored.
      
      v2:
      Return with *interface set to PHY_INTERFACE_MODE_NA on error.
      Add error checks to all users of of_get_phy_mode()
      Fixup a few reverse christmas tree errors
      Fixup a few slightly malformed reverse christmas trees
      
      v3:
      Fix 0-day reported errors.
      Reported-by: NDan Carpenter <dan.carpenter@oracle.com>
      Signed-off-by: NAndrew Lunn <andrew@lunn.ch>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      0c65b2b9
  2. 03 11月, 2019 3 次提交
    • D
      bpf: Add probe_read_{user, kernel} and probe_read_{user, kernel}_str helpers · 6ae08ae3
      Daniel Borkmann 提交于
      The current bpf_probe_read() and bpf_probe_read_str() helpers are broken
      in that they assume they can be used for probing memory access for kernel
      space addresses /as well as/ user space addresses.
      
      However, plain use of probe_kernel_read() for both cases will attempt to
      always access kernel space address space given access is performed under
      KERNEL_DS and some archs in-fact have overlapping address spaces where a
      kernel pointer and user pointer would have the /same/ address value and
      therefore accessing application memory via bpf_probe_read{,_str}() would
      read garbage values.
      
      Lets fix BPF side by making use of recently added 3d708182 ("uaccess:
      Add non-pagefault user-space read functions"). Unfortunately, the only way
      to fix this status quo is to add dedicated bpf_probe_read_{user,kernel}()
      and bpf_probe_read_{user,kernel}_str() helpers. The bpf_probe_read{,_str}()
      helpers are kept as-is to retain their current behavior.
      
      The two *_user() variants attempt the access always under USER_DS set, the
      two *_kernel() variants will -EFAULT when accessing user memory if the
      underlying architecture has non-overlapping address ranges, also avoiding
      throwing the kernel warning via 00c42373 ("x86-64: add warning for
      non-canonical user access address dereferences").
      
      Fixes: a5e8c070 ("bpf: add bpf_probe_read_str helper")
      Fixes: 2541517c ("tracing, perf: Implement BPF programs attached to kprobes")
      Signed-off-by: NDaniel Borkmann <daniel@iogearbox.net>
      Signed-off-by: NAlexei Starovoitov <ast@kernel.org>
      Acked-by: NAndrii Nakryiko <andriin@fb.com>
      Link: https://lore.kernel.org/bpf/796ee46e948bc808d54891a1108435f8652c6ca4.1572649915.git.daniel@iogearbox.net
      6ae08ae3
    • D
      uaccess: Add strict non-pagefault kernel-space read function · 75a1a607
      Daniel Borkmann 提交于
      Add two new probe_kernel_read_strict() and strncpy_from_unsafe_strict()
      helpers which by default alias to the __probe_kernel_read() and the
      __strncpy_from_unsafe(), respectively, but can be overridden by archs
      which have non-overlapping address ranges for kernel space and user
      space in order to bail out with -EFAULT when attempting to probe user
      memory including non-canonical user access addresses [0]:
      
        4-level page tables:
          user-space mem: 0x0000000000000000 - 0x00007fffffffffff
          non-canonical:  0x0000800000000000 - 0xffff7fffffffffff
      
        5-level page tables:
          user-space mem: 0x0000000000000000 - 0x00ffffffffffffff
          non-canonical:  0x0100000000000000 - 0xfeffffffffffffff
      
      The idea is that these helpers are complementary to the probe_user_read()
      and strncpy_from_unsafe_user() which probe user-only memory. Both added
      helpers here do the same, but for kernel-only addresses.
      
      Both set of helpers are going to be used for BPF tracing. They also
      explicitly avoid throwing the splat for non-canonical user addresses from
      00c42373 ("x86-64: add warning for non-canonical user access address
      dereferences").
      
      For compat, the current probe_kernel_read() and strncpy_from_unsafe() are
      left as-is.
      
        [0] Documentation/x86/x86_64/mm.txt
      Signed-off-by: NDaniel Borkmann <daniel@iogearbox.net>
      Signed-off-by: NAlexei Starovoitov <ast@kernel.org>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Masami Hiramatsu <mhiramat@kernel.org>
      Cc: x86@kernel.org
      Link: https://lore.kernel.org/bpf/eefeefd769aa5a013531f491a71f0936779e916b.1572649915.git.daniel@iogearbox.net
      75a1a607
    • D
      uaccess: Add non-pagefault user-space write function · 1d1585ca
      Daniel Borkmann 提交于
      Commit 3d708182 ("uaccess: Add non-pagefault user-space read functions")
      missed to add probe write function, therefore factor out a probe_write_common()
      helper with most logic of probe_kernel_write() except setting KERNEL_DS, and
      add a new probe_user_write() helper so it can be used from BPF side.
      
      Again, on some archs, the user address space and kernel address space can
      co-exist and be overlapping, so in such case, setting KERNEL_DS would mean
      that the given address is treated as being in kernel address space.
      Signed-off-by: NDaniel Borkmann <daniel@iogearbox.net>
      Signed-off-by: NAlexei Starovoitov <ast@kernel.org>
      Acked-by: NAndrii Nakryiko <andriin@fb.com>
      Cc: Masami Hiramatsu <mhiramat@kernel.org>
      Link: https://lore.kernel.org/bpf/9df2542e68141bfa3addde631441ee45503856a8.1572649915.git.daniel@iogearbox.net
      1d1585ca
  3. 02 11月, 2019 2 次提交
  4. 01 11月, 2019 8 次提交
  5. 31 10月, 2019 16 次提交
    • A
      bpf: Replace prog_raw_tp+btf_id with prog_tracing · f1b9509c
      Alexei Starovoitov 提交于
      The bpf program type raw_tp together with 'expected_attach_type'
      was the most appropriate api to indicate BTF-enabled raw_tp programs.
      But during development it became apparent that 'expected_attach_type'
      cannot be used and new 'attach_btf_id' field had to be introduced.
      Which means that the information is duplicated in two fields where
      one of them is ignored.
      Clean it up by introducing new program type where both
      'expected_attach_type' and 'attach_btf_id' fields have
      specific meaning.
      In the future 'expected_attach_type' will be extended
      with other attach points that have similar semantics to raw_tp.
      This patch is replacing BTF-enabled BPF_PROG_TYPE_RAW_TRACEPOINT with
      prog_type = BPF_RPOG_TYPE_TRACING
      expected_attach_type = BPF_TRACE_RAW_TP
      attach_btf_id = btf_id of raw tracepoint inside the kernel
      Future patches will add
      expected_attach_type = BPF_TRACE_FENTRY or BPF_TRACE_FEXIT
      where programs have the same input context and the same helpers,
      but different attach points.
      Signed-off-by: NAlexei Starovoitov <ast@kernel.org>
      Signed-off-by: NDaniel Borkmann <daniel@iogearbox.net>
      Acked-by: NAndrii Nakryiko <andriin@fb.com>
      Acked-by: NMartin KaFai Lau <kafai@fb.com>
      Link: https://lore.kernel.org/bpf/20191030223212.953010-2-ast@kernel.org
      f1b9509c
    • J
      efi/efi_test: Lock down /dev/efi_test and require CAP_SYS_ADMIN · 359efcc2
      Javier Martinez Canillas 提交于
      The driver exposes EFI runtime services to user-space through an IOCTL
      interface, calling the EFI services function pointers directly without
      using the efivar API.
      
      Disallow access to the /dev/efi_test character device when the kernel is
      locked down to prevent arbitrary user-space to call EFI runtime services.
      
      Also require CAP_SYS_ADMIN to open the chardev to prevent unprivileged
      users to call the EFI runtime services, instead of just relying on the
      chardev file mode bits for this.
      
      The main user of this driver is the fwts [0] tool that already checks if
      the effective user ID is 0 and fails otherwise. So this change shouldn't
      cause any regression to this tool.
      
      [0]: https://wiki.ubuntu.com/FirmwareTestSuite/Reference/uefivarinfoSigned-off-by: NJavier Martinez Canillas <javierm@redhat.com>
      Signed-off-by: NArd Biesheuvel <ard.biesheuvel@linaro.org>
      Acked-by: NLaszlo Ersek <lersek@redhat.com>
      Acked-by: NMatthew Garrett <mjg59@google.com>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: linux-efi@vger.kernel.org
      Link: https://lkml.kernel.org/r/20191029173755.27149-7-ardb@kernel.orgSigned-off-by: NIngo Molnar <mingo@kernel.org>
      359efcc2
    • K
      x86, efi: Never relocate kernel below lowest acceptable address · 220dd769
      Kairui Song 提交于
      Currently, kernel fails to boot on some HyperV VMs when using EFI.
      And it's a potential issue on all x86 platforms.
      
      It's caused by broken kernel relocation on EFI systems, when below three
      conditions are met:
      
      1. Kernel image is not loaded to the default address (LOAD_PHYSICAL_ADDR)
         by the loader.
      2. There isn't enough room to contain the kernel, starting from the
         default load address (eg. something else occupied part the region).
      3. In the memmap provided by EFI firmware, there is a memory region
         starts below LOAD_PHYSICAL_ADDR, and suitable for containing the
         kernel.
      
      EFI stub will perform a kernel relocation when condition 1 is met. But
      due to condition 2, EFI stub can't relocate kernel to the preferred
      address, so it fallback to ask EFI firmware to alloc lowest usable memory
      region, got the low region mentioned in condition 3, and relocated
      kernel there.
      
      It's incorrect to relocate the kernel below LOAD_PHYSICAL_ADDR. This
      is the lowest acceptable kernel relocation address.
      
      The first thing goes wrong is in arch/x86/boot/compressed/head_64.S.
      Kernel decompression will force use LOAD_PHYSICAL_ADDR as the output
      address if kernel is located below it. Then the relocation before
      decompression, which move kernel to the end of the decompression buffer,
      will overwrite other memory region, as there is no enough memory there.
      
      To fix it, just don't let EFI stub relocate the kernel to any address
      lower than lowest acceptable address.
      
      [ ardb: introduce efi_low_alloc_above() to reduce the scope of the change ]
      Signed-off-by: NKairui Song <kasong@redhat.com>
      Signed-off-by: NArd Biesheuvel <ard.biesheuvel@linaro.org>
      Acked-by: NJarkko Sakkinen <jarkko.sakkinen@linux.intel.com>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: linux-efi@vger.kernel.org
      Link: https://lkml.kernel.org/r/20191029173755.27149-6-ardb@kernel.orgSigned-off-by: NIngo Molnar <mingo@kernel.org>
      220dd769
    • V
      net: sched: update action implementations to support flags · e3822678
      Vlad Buslov 提交于
      Extend struct tc_action with new "tcfa_flags" field. Set the field in
      tcf_idr_create() function and provide new helper
      tcf_idr_create_from_flags() that derives 'cpustats' boolean from flags
      value. Update individual hardware-offloaded actions init() to pass their
      "flags" argument to new helper in order to skip percpu stats allocation
      when user requested it through flags.
      Signed-off-by: NVlad Buslov <vladbu@mellanox.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      e3822678
    • V
      net: sched: extend TCA_ACT space with TCA_ACT_FLAGS · abbb0d33
      Vlad Buslov 提交于
      Extend TCA_ACT space with nla_bitfield32 flags. Add
      TCA_ACT_FLAGS_NO_PERCPU_STATS as the only allowed flag. Parse the flags in
      tcf_action_init_1() and pass resulting value as additional argument to
      a_o->init().
      Signed-off-by: NVlad Buslov <vladbu@mellanox.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      abbb0d33
    • V
      net: sched: modify stats helper functions to support regular stats · 5e174d5e
      Vlad Buslov 提交于
      Modify stats update helper functions introduced in previous patches in this
      series to fallback to regular tc_action->tcfa_{b|q}stats if cpu stats are
      not allocated for the action argument. If regular non-percpu allocated
      counters are in use, then obtain action tcfa_lock while modifying them.
      Signed-off-by: NVlad Buslov <vladbu@mellanox.com>
      Acked-by: NJiri Pirko <jiri@mellanox.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      5e174d5e
    • V
      net: sched: don't expose action qstats to skb_tc_reinsert() · ef816f3c
      Vlad Buslov 提交于
      Previous commit introduced helper function for updating qstats and
      refactored set of actions to use the helpers, instead of modifying qstats
      directly. However, one of the affected action exposes its qstats to
      skb_tc_reinsert(), which then modifies it.
      
      Refactor skb_tc_reinsert() to return integer error code and don't increment
      overlimit qstats in case of error, and use the returned error code in
      tcf_mirred_act() to manually increment the overlimit counter with new
      helper function.
      Signed-off-by: NVlad Buslov <vladbu@mellanox.com>
      Acked-by: NJiri Pirko <jiri@mellanox.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      ef816f3c
    • V
      net: sched: extract qstats update code into functions · 26b537a8
      Vlad Buslov 提交于
      Extract common code that increments cpu_qstats counters into standalone act
      API functions. Change hardware offloaded actions that use percpu counter
      allocation to use the new functions instead of accessing cpu_qstats
      directly.
      
      This commit doesn't change functionality.
      Signed-off-by: NVlad Buslov <vladbu@mellanox.com>
      Acked-by: NJiri Pirko <jiri@mellanox.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      26b537a8
    • V
      net: sched: extract bstats update code into function · 5e1ad95b
      Vlad Buslov 提交于
      Extract common code that increments cpu_bstats counter into standalone act
      API function. Change hardware offloaded actions that use percpu counter
      allocation to use the new function instead of incrementing cpu_bstats
      directly.
      
      This commit doesn't change functionality.
      Signed-off-by: NVlad Buslov <vladbu@mellanox.com>
      Acked-by: NJiri Pirko <jiri@mellanox.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      5e1ad95b
    • V
      net: sched: extract common action counters update code into function · c8ecebd0
      Vlad Buslov 提交于
      Currently, all implementations of tc_action_ops->stats_update() callback
      have almost exactly the same implementation of counters update
      code (besides gact which also updates drop counter). In order to simplify
      support for using both percpu-allocated and regular action counters
      depending on run-time flag in following patches, extract action counters
      update code into standalone function in act API.
      
      This commit doesn't change functionality.
      Signed-off-by: NVlad Buslov <vladbu@mellanox.com>
      Acked-by: NJiri Pirko <jiri@mellanox.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      c8ecebd0
    • E
      net: annotate lockless accesses to sk->sk_napi_id · ee8d153d
      Eric Dumazet 提交于
      We already annotated most accesses to sk->sk_napi_id
      
      We missed sk_mark_napi_id() and sk_mark_napi_id_once()
      which might be called without socket lock held in UDP stack.
      
      KCSAN reported :
      BUG: KCSAN: data-race in udpv6_queue_rcv_one_skb / udpv6_queue_rcv_one_skb
      
      write to 0xffff888121c6d108 of 4 bytes by interrupt on cpu 0:
       sk_mark_napi_id include/net/busy_poll.h:125 [inline]
       __udpv6_queue_rcv_skb net/ipv6/udp.c:571 [inline]
       udpv6_queue_rcv_one_skb+0x70c/0xb40 net/ipv6/udp.c:672
       udpv6_queue_rcv_skb+0xb5/0x400 net/ipv6/udp.c:689
       udp6_unicast_rcv_skb.isra.0+0xd7/0x180 net/ipv6/udp.c:832
       __udp6_lib_rcv+0x69c/0x1770 net/ipv6/udp.c:913
       udpv6_rcv+0x2b/0x40 net/ipv6/udp.c:1015
       ip6_protocol_deliver_rcu+0x22a/0xbe0 net/ipv6/ip6_input.c:409
       ip6_input_finish+0x30/0x50 net/ipv6/ip6_input.c:450
       NF_HOOK include/linux/netfilter.h:305 [inline]
       NF_HOOK include/linux/netfilter.h:299 [inline]
       ip6_input+0x177/0x190 net/ipv6/ip6_input.c:459
       dst_input include/net/dst.h:442 [inline]
       ip6_rcv_finish+0x110/0x140 net/ipv6/ip6_input.c:76
       NF_HOOK include/linux/netfilter.h:305 [inline]
       NF_HOOK include/linux/netfilter.h:299 [inline]
       ipv6_rcv+0x1a1/0x1b0 net/ipv6/ip6_input.c:284
       __netif_receive_skb_one_core+0xa7/0xe0 net/core/dev.c:5010
       __netif_receive_skb+0x37/0xf0 net/core/dev.c:5124
       process_backlog+0x1d3/0x420 net/core/dev.c:5955
       napi_poll net/core/dev.c:6392 [inline]
       net_rx_action+0x3ae/0xa90 net/core/dev.c:6460
      
      write to 0xffff888121c6d108 of 4 bytes by interrupt on cpu 1:
       sk_mark_napi_id include/net/busy_poll.h:125 [inline]
       __udpv6_queue_rcv_skb net/ipv6/udp.c:571 [inline]
       udpv6_queue_rcv_one_skb+0x70c/0xb40 net/ipv6/udp.c:672
       udpv6_queue_rcv_skb+0xb5/0x400 net/ipv6/udp.c:689
       udp6_unicast_rcv_skb.isra.0+0xd7/0x180 net/ipv6/udp.c:832
       __udp6_lib_rcv+0x69c/0x1770 net/ipv6/udp.c:913
       udpv6_rcv+0x2b/0x40 net/ipv6/udp.c:1015
       ip6_protocol_deliver_rcu+0x22a/0xbe0 net/ipv6/ip6_input.c:409
       ip6_input_finish+0x30/0x50 net/ipv6/ip6_input.c:450
       NF_HOOK include/linux/netfilter.h:305 [inline]
       NF_HOOK include/linux/netfilter.h:299 [inline]
       ip6_input+0x177/0x190 net/ipv6/ip6_input.c:459
       dst_input include/net/dst.h:442 [inline]
       ip6_rcv_finish+0x110/0x140 net/ipv6/ip6_input.c:76
       NF_HOOK include/linux/netfilter.h:305 [inline]
       NF_HOOK include/linux/netfilter.h:299 [inline]
       ipv6_rcv+0x1a1/0x1b0 net/ipv6/ip6_input.c:284
       __netif_receive_skb_one_core+0xa7/0xe0 net/core/dev.c:5010
       __netif_receive_skb+0x37/0xf0 net/core/dev.c:5124
       process_backlog+0x1d3/0x420 net/core/dev.c:5955
      
      Reported by Kernel Concurrency Sanitizer on:
      CPU: 1 PID: 10890 Comm: syz-executor.0 Not tainted 5.4.0-rc3+ #0
      Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 01/01/2011
      
      Fixes: e68b6e50 ("udp: enable busy polling for all sockets")
      Signed-off-by: NEric Dumazet <edumazet@google.com>
      Reported-by: Nsyzbot <syzkaller@googlegroups.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      ee8d153d
    • M
      flow_dissector: extract more ICMP information · 5dec597e
      Matteo Croce 提交于
      The ICMP flow dissector currently parses only the Type and Code fields.
      Some ICMP packets (echo, timestamp) have a 16 bit Identifier field which
      is used to correlate packets.
      Add such field in flow_dissector_key_icmp and replace skb_flow_get_be16()
      with a more complex function which populate this field.
      Signed-off-by: NMatteo Croce <mcroce@redhat.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      5dec597e
    • M
      flow_dissector: add meaningful comments · 98298e6c
      Matteo Croce 提交于
      Documents two piece of code which can't be understood at a glance.
      Signed-off-by: NMatteo Croce <mcroce@redhat.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      98298e6c
    • E
      net: annotate accesses to sk->sk_incoming_cpu · 7170a977
      Eric Dumazet 提交于
      This socket field can be read and written by concurrent cpus.
      
      Use READ_ONCE() and WRITE_ONCE() annotations to document this,
      and avoid some compiler 'optimizations'.
      
      KCSAN reported :
      
      BUG: KCSAN: data-race in tcp_v4_rcv / tcp_v4_rcv
      
      write to 0xffff88812220763c of 4 bytes by interrupt on cpu 0:
       sk_incoming_cpu_update include/net/sock.h:953 [inline]
       tcp_v4_rcv+0x1b3c/0x1bb0 net/ipv4/tcp_ipv4.c:1934
       ip_protocol_deliver_rcu+0x4d/0x420 net/ipv4/ip_input.c:204
       ip_local_deliver_finish+0x110/0x140 net/ipv4/ip_input.c:231
       NF_HOOK include/linux/netfilter.h:305 [inline]
       NF_HOOK include/linux/netfilter.h:299 [inline]
       ip_local_deliver+0x133/0x210 net/ipv4/ip_input.c:252
       dst_input include/net/dst.h:442 [inline]
       ip_rcv_finish+0x121/0x160 net/ipv4/ip_input.c:413
       NF_HOOK include/linux/netfilter.h:305 [inline]
       NF_HOOK include/linux/netfilter.h:299 [inline]
       ip_rcv+0x18f/0x1a0 net/ipv4/ip_input.c:523
       __netif_receive_skb_one_core+0xa7/0xe0 net/core/dev.c:5010
       __netif_receive_skb+0x37/0xf0 net/core/dev.c:5124
       process_backlog+0x1d3/0x420 net/core/dev.c:5955
       napi_poll net/core/dev.c:6392 [inline]
       net_rx_action+0x3ae/0xa90 net/core/dev.c:6460
       __do_softirq+0x115/0x33f kernel/softirq.c:292
       do_softirq_own_stack+0x2a/0x40 arch/x86/entry/entry_64.S:1082
       do_softirq.part.0+0x6b/0x80 kernel/softirq.c:337
       do_softirq kernel/softirq.c:329 [inline]
       __local_bh_enable_ip+0x76/0x80 kernel/softirq.c:189
      
      read to 0xffff88812220763c of 4 bytes by interrupt on cpu 1:
       sk_incoming_cpu_update include/net/sock.h:952 [inline]
       tcp_v4_rcv+0x181a/0x1bb0 net/ipv4/tcp_ipv4.c:1934
       ip_protocol_deliver_rcu+0x4d/0x420 net/ipv4/ip_input.c:204
       ip_local_deliver_finish+0x110/0x140 net/ipv4/ip_input.c:231
       NF_HOOK include/linux/netfilter.h:305 [inline]
       NF_HOOK include/linux/netfilter.h:299 [inline]
       ip_local_deliver+0x133/0x210 net/ipv4/ip_input.c:252
       dst_input include/net/dst.h:442 [inline]
       ip_rcv_finish+0x121/0x160 net/ipv4/ip_input.c:413
       NF_HOOK include/linux/netfilter.h:305 [inline]
       NF_HOOK include/linux/netfilter.h:299 [inline]
       ip_rcv+0x18f/0x1a0 net/ipv4/ip_input.c:523
       __netif_receive_skb_one_core+0xa7/0xe0 net/core/dev.c:5010
       __netif_receive_skb+0x37/0xf0 net/core/dev.c:5124
       process_backlog+0x1d3/0x420 net/core/dev.c:5955
       napi_poll net/core/dev.c:6392 [inline]
       net_rx_action+0x3ae/0xa90 net/core/dev.c:6460
       __do_softirq+0x115/0x33f kernel/softirq.c:292
       run_ksoftirqd+0x46/0x60 kernel/softirq.c:603
       smpboot_thread_fn+0x37d/0x4a0 kernel/smpboot.c:165
      
      Reported by Kernel Concurrency Sanitizer on:
      CPU: 1 PID: 16 Comm: ksoftirqd/1 Not tainted 5.4.0-rc3+ #0
      Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 01/01/2011
      Signed-off-by: NEric Dumazet <edumazet@google.com>
      Reported-by: Nsyzbot <syzkaller@googlegroups.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      7170a977
    • J
      tipc: add smart nagle feature · c0bceb97
      Jon Maloy 提交于
      We introduce a feature that works like a combination of TCP_NAGLE and
      TCP_CORK, but without some of the weaknesses of those. In particular,
      we will not observe long delivery delays because of delayed acks, since
      the algorithm itself decides if and when acks are to be sent from the
      receiving peer.
      
      - The nagle property as such is determined by manipulating a new
        'maxnagle' field in struct tipc_sock. If certain conditions are met,
        'maxnagle' will define max size of the messages which can be bundled.
        If it is set to zero no messages are ever bundled, implying that the
        nagle property is disabled.
      - A socket with the nagle property enabled enters nagle mode when more
        than 4 messages have been sent out without receiving any data message
        from the peer.
      - A socket leaves nagle mode whenever it receives a data message from
        the peer.
      
      In nagle mode, messages smaller than 'maxnagle' are accumulated in the
      socket write queue. The last buffer in the queue is marked with a new
      'ack_required' bit, which forces the receiving peer to send a CONN_ACK
      message back to the sender upon reception.
      
      The accumulated contents of the write queue is transmitted when one of
      the following events or conditions occur.
      
      - A CONN_ACK message is received from the peer.
      - A data message is received from the peer.
      - A SOCK_WAKEUP pseudo message is received from the link level.
      - The write queue contains more than 64 1k blocks of data.
      - The connection is being shut down.
      - There is no CONN_ACK message to expect. I.e., there is currently
        no outstanding message where the 'ack_required' bit was set. As a
        consequence, the first message added after we enter nagle mode
        is always sent directly with this bit set.
      
      This new feature gives a 50-100% improvement of throughput for small
      (i.e., less than MTU size) messages, while it might add up to one RTT
      to latency time when the socket is in nagle mode.
      Acked-by: NYing Xue <ying.xue@windreiver.com>
      Signed-off-by: NJon Maloy <jon.maloy@ericsson.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      c0bceb97
    • T
      SUNRPC: Destroy the back channel when we destroy the host transport · 669996ad
      Trond Myklebust 提交于
      When we're destroying the host transport mechanism, we should ensure
      that we do not leak memory by failing to release any back channel
      slots that might still exist.
      Reported-by: NNeil Brown <neilb@suse.de>
      Reported-by: Nkbuild test robot <lkp@intel.com>
      Signed-off-by: NTrond Myklebust <trond.myklebust@hammerspace.com>
      Signed-off-by: NAnna Schumaker <Anna.Schumaker@Netapp.com>
      669996ad
  6. 30 10月, 2019 3 次提交
  7. 29 10月, 2019 5 次提交
    • A
      net: dsa: Add support for devlink device parameters · 6b297524
      Andrew Lunn 提交于
      Add plumbing to allow DSA drivers to register parameters with devlink.
      
      To keep with the abstraction, the DSA drivers pass the ds structure to
      these helpers, and the DSA core then translates that to the devlink
      structure associated to the device.
      Signed-off-by: NAndrew Lunn <andrew@lunn.ch>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      6b297524
    • T
      net: fix sk_page_frag() recursion from memory reclaim · 20eb4f29
      Tejun Heo 提交于
      sk_page_frag() optimizes skb_frag allocations by using per-task
      skb_frag cache when it knows it's the only user.  The condition is
      determined by seeing whether the socket allocation mask allows
      blocking - if the allocation may block, it obviously owns the task's
      context and ergo exclusively owns current->task_frag.
      
      Unfortunately, this misses recursion through memory reclaim path.
      Please take a look at the following backtrace.
      
       [2] RIP: 0010:tcp_sendmsg_locked+0xccf/0xe10
           ...
           tcp_sendmsg+0x27/0x40
           sock_sendmsg+0x30/0x40
           sock_xmit.isra.24+0xa1/0x170 [nbd]
           nbd_send_cmd+0x1d2/0x690 [nbd]
           nbd_queue_rq+0x1b5/0x3b0 [nbd]
           __blk_mq_try_issue_directly+0x108/0x1b0
           blk_mq_request_issue_directly+0xbd/0xe0
           blk_mq_try_issue_list_directly+0x41/0xb0
           blk_mq_sched_insert_requests+0xa2/0xe0
           blk_mq_flush_plug_list+0x205/0x2a0
           blk_flush_plug_list+0xc3/0xf0
       [1] blk_finish_plug+0x21/0x2e
           _xfs_buf_ioapply+0x313/0x460
           __xfs_buf_submit+0x67/0x220
           xfs_buf_read_map+0x113/0x1a0
           xfs_trans_read_buf_map+0xbf/0x330
           xfs_btree_read_buf_block.constprop.42+0x95/0xd0
           xfs_btree_lookup_get_block+0x95/0x170
           xfs_btree_lookup+0xcc/0x470
           xfs_bmap_del_extent_real+0x254/0x9a0
           __xfs_bunmapi+0x45c/0xab0
           xfs_bunmapi+0x15/0x30
           xfs_itruncate_extents_flags+0xca/0x250
           xfs_free_eofblocks+0x181/0x1e0
           xfs_fs_destroy_inode+0xa8/0x1b0
           destroy_inode+0x38/0x70
           dispose_list+0x35/0x50
           prune_icache_sb+0x52/0x70
           super_cache_scan+0x120/0x1a0
           do_shrink_slab+0x120/0x290
           shrink_slab+0x216/0x2b0
           shrink_node+0x1b6/0x4a0
           do_try_to_free_pages+0xc6/0x370
           try_to_free_mem_cgroup_pages+0xe3/0x1e0
           try_charge+0x29e/0x790
           mem_cgroup_charge_skmem+0x6a/0x100
           __sk_mem_raise_allocated+0x18e/0x390
           __sk_mem_schedule+0x2a/0x40
       [0] tcp_sendmsg_locked+0x8eb/0xe10
           tcp_sendmsg+0x27/0x40
           sock_sendmsg+0x30/0x40
           ___sys_sendmsg+0x26d/0x2b0
           __sys_sendmsg+0x57/0xa0
           do_syscall_64+0x42/0x100
           entry_SYSCALL_64_after_hwframe+0x44/0xa9
      
      In [0], tcp_send_msg_locked() was using current->page_frag when it
      called sk_wmem_schedule().  It already calculated how many bytes can
      be fit into current->page_frag.  Due to memory pressure,
      sk_wmem_schedule() called into memory reclaim path which called into
      xfs and then IO issue path.  Because the filesystem in question is
      backed by nbd, the control goes back into the tcp layer - back into
      tcp_sendmsg_locked().
      
      nbd sets sk_allocation to (GFP_NOIO | __GFP_MEMALLOC) which makes
      sense - it's in the process of freeing memory and wants to be able to,
      e.g., drop clean pages to make forward progress.  However, this
      confused sk_page_frag() called from [2].  Because it only tests
      whether the allocation allows blocking which it does, it now thinks
      current->page_frag can be used again although it already was being
      used in [0].
      
      After [2] used current->page_frag, the offset would be increased by
      the used amount.  When the control returns to [0],
      current->page_frag's offset is increased and the previously calculated
      number of bytes now may overrun the end of allocated memory leading to
      silent memory corruptions.
      
      Fix it by adding gfpflags_normal_context() which tests sleepable &&
      !reclaim and use it to determine whether to use current->task_frag.
      
      v2: Eric didn't like gfp flags being tested twice.  Introduce a new
          helper gfpflags_normal_context() and combine the two tests.
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Cc: Josef Bacik <josef@toxicpanda.com>
      Cc: Eric Dumazet <eric.dumazet@gmail.com>
      Cc: stable@vger.kernel.org
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      20eb4f29
    • G
      net: Fix various misspellings of "connect" · e1b18549
      Geert Uytterhoeven 提交于
      Fix misspellings of "disconnect", "disconnecting", "connections", and
      "disconnected".
      Signed-off-by: NGeert Uytterhoeven <geert+renesas@glider.be>
      Acked-by: NKalle Valo <kvalo@codeaurora.org>
      Acked-by: NSimon Horman <horms@verge.net.au>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      e1b18549
    • G
      net: Fix misspellings of "configure" and "configuration" · c199ce4f
      Geert Uytterhoeven 提交于
      Fix various misspellings of "configuration" and "configure".
      Signed-off-by: NGeert Uytterhoeven <geert+renesas@glider.be>
      Acked-by: NKalle Valo <kvalo@codeaurora.org>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      c199ce4f
    • E
      net: add skb_queue_empty_lockless() · d7d16a89
      Eric Dumazet 提交于
      Some paths call skb_queue_empty() without holding
      the queue lock. We must use a barrier in order
      to not let the compiler do strange things, and avoid
      KCSAN splats.
      
      Adding a barrier in skb_queue_empty() might be overkill,
      I prefer adding a new helper to clearly identify
      points where the callers might be lockless. This might
      help us finding real bugs.
      
      The corresponding WRITE_ONCE() should add zero cost
      for current compilers.
      Signed-off-by: NEric Dumazet <edumazet@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      d7d16a89
  8. 28 10月, 2019 2 次提交