You need to sign in or sign up before continuing.
  1. 23 3月, 2017 1 次提交
  2. 17 3月, 2017 1 次提交
    • I
      ipv4: fib_rules: Check if rule is a default rule · 3c71006d
      Ido Schimmel 提交于
      Currently, when non-default (custom) FIB rules are used, devices capable
      of layer 3 offloading flush their tables and let the kernel do the
      forwarding instead.
      
      When these devices' drivers are loaded they register to the FIB
      notification chain, which lets them know about the existence of any
      custom FIB rules. This is done by sending a RULE_ADD notification based
      on the value of 'net->ipv4.fib_has_custom_rules'.
      
      This approach is problematic when VRF offload is taken into account, as
      upon the creation of the first VRF netdev, a l3mdev rule is programmed
      to direct skbs to the VRF's table.
      
      Instead of merely reading the above value and sending a single RULE_ADD
      notification, we should iterate over all the FIB rules and send a
      detailed notification for each, thereby allowing offloading drivers to
      sanitize the rules they don't support and potentially flush their
      tables.
      
      While l3mdev rules are uniquely marked, the default rules are not.
      Therefore, when they are being notified they might invoke offloading
      drivers to unnecessarily flush their tables.
      
      Solve this by adding an helper to check if a FIB rule is a default rule.
      Namely, its selector should match all packets and its action should
      point to the local, main or default tables.
      
      As noted by David Ahern, uniquely marking the default rules is
      insufficient. When using VRFs, it's common to avoid false hits by moving
      the rule for the local table to just before the main table:
      
      Default configuration:
      $ ip rule show
      0:      from all lookup local
      32766:  from all lookup main
      32767:  from all lookup default
      
      Common configuration with VRFs:
      $ ip rule show
      1000:   from all lookup [l3mdev-table]
      32765:  from all lookup local
      32766:  from all lookup main
      32767:  from all lookup default
      Signed-off-by: NIdo Schimmel <idosch@mellanox.com>
      Signed-off-by: NJiri Pirko <jiri@mellanox.com>
      Acked-by: NDavid Ahern <dsa@cumulusnetworks.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      3c71006d
  3. 15 3月, 2017 1 次提交
  4. 14 3月, 2017 2 次提交
  5. 13 3月, 2017 1 次提交
  6. 10 3月, 2017 4 次提交
    • A
      tcp: rename *_sequence_number() to *_seq_and_tsoff() · a30aad50
      Alexey Kodanev 提交于
      The functions that are returning tcp sequence number also setup
      TS offset value, so rename them to better describe their purpose.
      
      No functional changes in this patch.
      Suggested-by: NEric Dumazet <edumazet@google.com>
      Signed-off-by: NAlexey Kodanev <alexey.kodanev@oracle.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      a30aad50
    • D
      net: Work around lockdep limitation in sockets that use sockets · cdfbabfb
      David Howells 提交于
      Lockdep issues a circular dependency warning when AFS issues an operation
      through AF_RXRPC from a context in which the VFS/VM holds the mmap_sem.
      
      The theory lockdep comes up with is as follows:
      
       (1) If the pagefault handler decides it needs to read pages from AFS, it
           calls AFS with mmap_sem held and AFS begins an AF_RXRPC call, but
           creating a call requires the socket lock:
      
      	mmap_sem must be taken before sk_lock-AF_RXRPC
      
       (2) afs_open_socket() opens an AF_RXRPC socket and binds it.  rxrpc_bind()
           binds the underlying UDP socket whilst holding its socket lock.
           inet_bind() takes its own socket lock:
      
      	sk_lock-AF_RXRPC must be taken before sk_lock-AF_INET
      
       (3) Reading from a TCP socket into a userspace buffer might cause a fault
           and thus cause the kernel to take the mmap_sem, but the TCP socket is
           locked whilst doing this:
      
      	sk_lock-AF_INET must be taken before mmap_sem
      
      However, lockdep's theory is wrong in this instance because it deals only
      with lock classes and not individual locks.  The AF_INET lock in (2) isn't
      really equivalent to the AF_INET lock in (3) as the former deals with a
      socket entirely internal to the kernel that never sees userspace.  This is
      a limitation in the design of lockdep.
      
      Fix the general case by:
      
       (1) Double up all the locking keys used in sockets so that one set are
           used if the socket is created by userspace and the other set is used
           if the socket is created by the kernel.
      
       (2) Store the kern parameter passed to sk_alloc() in a variable in the
           sock struct (sk_kern_sock).  This informs sock_lock_init(),
           sock_init_data() and sk_clone_lock() as to the lock keys to be used.
      
           Note that the child created by sk_clone_lock() inherits the parent's
           kern setting.
      
       (3) Add a 'kern' parameter to ->accept() that is analogous to the one
           passed in to ->create() that distinguishes whether kernel_accept() or
           sys_accept4() was the caller and can be passed to sk_alloc().
      
           Note that a lot of accept functions merely dequeue an already
           allocated socket.  I haven't touched these as the new socket already
           exists before we get the parameter.
      
           Note also that there are a couple of places where I've made the accepted
           socket unconditionally kernel-based:
      
      	irda_accept()
      	rds_rcp_accept_one()
      	tcp_accept_from_sock()
      
           because they follow a sock_create_kern() and accept off of that.
      
      Whilst creating this, I noticed that lustre and ocfs don't create sockets
      through sock_create_kern() and thus they aren't marked as for-kernel,
      though they appear to be internal.  I wonder if these should do that so
      that they use the new set of lock keys.
      Signed-off-by: NDavid Howells <dhowells@redhat.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      cdfbabfb
    • J
      ethtool: add CRC32 as an RSS hash function · abb521e3
      Jakub Kicinski 提交于
      CRC32 engines are usually easily available in hardware and generate
      OK spread for RSS hash.  Add CRC32 RSS hash function to ethtool API.
      Signed-off-by: NJakub Kicinski <jakub.kicinski@netronome.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      abb521e3
    • P
      net/socket: use per af lockdep classes for sk queues · 581319c5
      Paolo Abeni 提交于
      Currently the sock queue's spin locks get their lockdep
      classes by the default init_spin_lock() initializer:
      all socket families get - usually, see below - a single
      class for rx, another specific class for tx, etc.
      This can lead to false positive lockdep splat, as
      reported by Andrey.
      Moreover there are two separate initialization points
      for the sock queues, one in sk_clone_lock() and one
      in sock_init_data(), so that e.g. the rx queue lock
      can get one of two possible, different classes, depending
      on the socket being cloned or not.
      This change tries to address the above, setting explicitly
      a per address family lockdep class for each queue's
      spinlock. Also, move the duplicated initialization code to a
      single location.
      
      v1 -> v2:
       - renamed the init helper
      
      rfc -> v1:
       - no changes, tested with several different workload
      Suggested-by: NCong Wang <xiyou.wangcong@gmail.com>
      Signed-off-by: NPaolo Abeni <pabeni@redhat.com>
      Acked-by: NEric Dumazet <edumazet@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      581319c5
  7. 09 3月, 2017 5 次提交
  8. 08 3月, 2017 2 次提交
  9. 03 3月, 2017 1 次提交
  10. 02 3月, 2017 6 次提交
    • I
      sched/headers: Prepare to move the task_lock()/unlock() APIs to <linux/sched/task.h> · f719ff9b
      Ingo Molnar 提交于
      But first update the code that uses these facilities with the
      new header.
      Acked-by: NLinus Torvalds <torvalds@linux-foundation.org>
      Cc: Mike Galbraith <efault@gmx.de>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: linux-kernel@vger.kernel.org
      Signed-off-by: NIngo Molnar <mingo@kernel.org>
      f719ff9b
    • I
      sched/headers: Prepare to move signal wakeup & sigpending methods from... · 174cd4b1
      Ingo Molnar 提交于
      sched/headers: Prepare to move signal wakeup & sigpending methods from <linux/sched.h> into <linux/sched/signal.h>
      
      Fix up affected files that include this signal functionality via sched.h.
      Acked-by: NLinus Torvalds <torvalds@linux-foundation.org>
      Cc: Mike Galbraith <efault@gmx.de>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: linux-kernel@vger.kernel.org
      Signed-off-by: NIngo Molnar <mingo@kernel.org>
      174cd4b1
    • I
      sched/headers: Prepare for new header dependencies before moving code to <linux/sched/user.h> · 8703e8a4
      Ingo Molnar 提交于
      We are going to split <linux/sched/user.h> out of <linux/sched.h>, which
      will have to be picked up from other headers and a couple of .c files.
      
      Create a trivial placeholder <linux/sched/user.h> file that just
      maps to <linux/sched.h> to make this patch obviously correct and
      bisectable.
      
      Include the new header in the files that are going to need it.
      Acked-by: NLinus Torvalds <torvalds@linux-foundation.org>
      Cc: Mike Galbraith <efault@gmx.de>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: linux-kernel@vger.kernel.org
      Signed-off-by: NIngo Molnar <mingo@kernel.org>
      8703e8a4
    • I
      sched/headers: Prepare for new header dependencies before moving code to <linux/sched/signal.h> · 3f07c014
      Ingo Molnar 提交于
      We are going to split <linux/sched/signal.h> out of <linux/sched.h>, which
      will have to be picked up from other headers and a couple of .c files.
      
      Create a trivial placeholder <linux/sched/signal.h> file that just
      maps to <linux/sched.h> to make this patch obviously correct and
      bisectable.
      
      Include the new header in the files that are going to need it.
      Acked-by: NLinus Torvalds <torvalds@linux-foundation.org>
      Cc: Mike Galbraith <efault@gmx.de>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: linux-kernel@vger.kernel.org
      Signed-off-by: NIngo Molnar <mingo@kernel.org>
      3f07c014
    • E
      net: net_enable_timestamp() can be called from irq contexts · 13baa00a
      Eric Dumazet 提交于
      It is now very clear that silly TCP listeners might play with
      enabling/disabling timestamping while new children are added
      to their accept queue.
      
      Meaning net_enable_timestamp() can be called from BH context
      while current state of the static key is not enabled.
      
      Lets play safe and allow all contexts.
      
      The work queue is scheduled only under the problematic cases,
      which are the static key enable/disable transition, to not slow down
      critical paths.
      
      This extends and improves what we did in commit 5fa8bbda ("net: use
      a work queue to defer net_disable_timestamp() work")
      
      Fixes: b90e5794 ("net: dont call jump_label_dec from irq context")
      Signed-off-by: NEric Dumazet <edumazet@google.com>
      Reported-by: NDmitry Vyukov <dvyukov@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      13baa00a
    • E
      net: solve a NAPI race · 39e6c820
      Eric Dumazet 提交于
      While playing with mlx4 hardware timestamping of RX packets, I found
      that some packets were received by TCP stack with a ~200 ms delay...
      
      Since the timestamp was provided by the NIC, and my probe was added
      in tcp_v4_rcv() while in BH handler, I was confident it was not
      a sender issue, or a drop in the network.
      
      This would happen with a very low probability, but hurting RPC
      workloads.
      
      A NAPI driver normally arms the IRQ after the napi_complete_done(),
      after NAPI_STATE_SCHED is cleared, so that the hard irq handler can grab
      it.
      
      Problem is that if another point in the stack grabs NAPI_STATE_SCHED bit
      while IRQ are not disabled, we might have later an IRQ firing and
      finding this bit set, right before napi_complete_done() clears it.
      
      This can happen with busy polling users, or if gro_flush_timeout is
      used. But some other uses of napi_schedule() in drivers can cause this
      as well.
      
      thread 1                                 thread 2 (could be on same cpu, or not)
      
      // busy polling or napi_watchdog()
      napi_schedule();
      ...
      napi->poll()
      
      device polling:
      read 2 packets from ring buffer
                                                Additional 3rd packet is
      available.
                                                device hard irq
      
                                                // does nothing because
      NAPI_STATE_SCHED bit is owned by thread 1
                                                napi_schedule();
      
      napi_complete_done(napi, 2);
      rearm_irq();
      
      Note that rearm_irq() will not force the device to send an additional
      IRQ for the packet it already signaled (3rd packet in my example)
      
      This patch adds a new NAPI_STATE_MISSED bit, that napi_schedule_prep()
      can set if it could not grab NAPI_STATE_SCHED
      
      Then napi_complete_done() properly reschedules the napi to make sure
      we do not miss something.
      
      Since we manipulate multiple bits at once, use cmpxchg() like in
      sk_busy_loop() to provide proper transactions.
      
      In v2, I changed napi_watchdog() to use a relaxed variant of
      napi_schedule_prep() : No need to set NAPI_STATE_MISSED from this point.
      
      In v3, I added more details in the changelog and clears
      NAPI_STATE_MISSED in busy_poll_stop()
      
      In v4, I added the ideas given by Alexander Duyck in v3 review
      Signed-off-by: NEric Dumazet <edumazet@google.com>
      Cc: Alexander Duyck <alexander.duyck@gmail.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      39e6c820
  11. 24 2月, 2017 1 次提交
  12. 22 2月, 2017 3 次提交
  13. 18 2月, 2017 3 次提交
    • D
      rtnl: don't account unused struct ifla_port_vsi in rtnl_port_size · 025331df
      Daniel Borkmann 提交于
      When allocating rtnl dump messages, struct ifla_port_vsi is never dumped,
      so we can save header plus payload in rtnl_port_size(). Infact, attribute
      IFLA_PORT_VSI_TYPE and struct ifla_port_vsi are not used anywhere in
      the kernel. We only need to keep the nla policy should applications in
      user space be filling this out. Same NLA_BINARY issue exists as was fixed
      in 364d5716 ("rtnetlink: ifla_vf_policy: fix misuses of NLA_BINARY")
      and others, but then again IFLA_PORT_VSI_TYPE is not used anywhere, so
      just add a comment that it's unused.
      Signed-off-by: NDaniel Borkmann <daniel@iogearbox.net>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      025331df
    • D
      bpf: make jited programs visible in traces · 74451e66
      Daniel Borkmann 提交于
      Long standing issue with JITed programs is that stack traces from
      function tracing check whether a given address is kernel code
      through {__,}kernel_text_address(), which checks for code in core
      kernel, modules and dynamically allocated ftrace trampolines. But
      what is still missing is BPF JITed programs (interpreted programs
      are not an issue as __bpf_prog_run() will be attributed to them),
      thus when a stack trace is triggered, the code walking the stack
      won't see any of the JITed ones. The same for address correlation
      done from user space via reading /proc/kallsyms. This is read by
      tools like perf, but the latter is also useful for permanent live
      tracing with eBPF itself in combination with stack maps when other
      eBPF types are part of the callchain. See offwaketime example on
      dumping stack from a map.
      
      This work tries to tackle that issue by making the addresses and
      symbols known to the kernel. The lookup from *kernel_text_address()
      is implemented through a latched RB tree that can be read under
      RCU in fast-path that is also shared for symbol/size/offset lookup
      for a specific given address in kallsyms. The slow-path iteration
      through all symbols in the seq file done via RCU list, which holds
      a tiny fraction of all exported ksyms, usually below 0.1 percent.
      Function symbols are exported as bpf_prog_<tag>, in order to aide
      debugging and attribution. This facility is currently enabled for
      root-only when bpf_jit_kallsyms is set to 1, and disabled if hardening
      is active in any mode. The rationale behind this is that still a lot
      of systems ship with world read permissions on kallsyms thus addresses
      should not get suddenly exposed for them. If that situation gets
      much better in future, we always have the option to change the
      default on this. Likewise, unprivileged programs are not allowed
      to add entries there either, but that is less of a concern as most
      such programs types relevant in this context are for root-only anyway.
      If enabled, call graphs and stack traces will then show a correct
      attribution; one example is illustrated below, where the trace is
      now visible in tooling such as perf script --kallsyms=/proc/kallsyms
      and friends.
      
      Before:
      
        7fff8166889d bpf_clone_redirect+0x80007f0020ed (/lib/modules/4.9.0-rc8+/build/vmlinux)
               f5d80 __sendmsg_nocancel+0xffff006451f1a007 (/usr/lib64/libc-2.18.so)
      
      After:
      
        7fff816688b7 bpf_clone_redirect+0x80007f002107 (/lib/modules/4.9.0-rc8+/build/vmlinux)
        7fffa0575728 bpf_prog_33c45a467c9e061a+0x8000600020fb (/lib/modules/4.9.0-rc8+/build/vmlinux)
        7fffa07ef1fc cls_bpf_classify+0x8000600020dc (/lib/modules/4.9.0-rc8+/build/vmlinux)
        7fff81678b68 tc_classify+0x80007f002078 (/lib/modules/4.9.0-rc8+/build/vmlinux)
        7fff8164d40b __netif_receive_skb_core+0x80007f0025fb (/lib/modules/4.9.0-rc8+/build/vmlinux)
        7fff8164d718 __netif_receive_skb+0x80007f002018 (/lib/modules/4.9.0-rc8+/build/vmlinux)
        7fff8164e565 process_backlog+0x80007f002095 (/lib/modules/4.9.0-rc8+/build/vmlinux)
        7fff8164dc71 net_rx_action+0x80007f002231 (/lib/modules/4.9.0-rc8+/build/vmlinux)
        7fff81767461 __softirqentry_text_start+0x80007f0020d1 (/lib/modules/4.9.0-rc8+/build/vmlinux)
        7fff817658ac do_softirq_own_stack+0x80007f00201c (/lib/modules/4.9.0-rc8+/build/vmlinux)
        7fff810a2c20 do_softirq+0x80007f002050 (/lib/modules/4.9.0-rc8+/build/vmlinux)
        7fff810a2cb5 __local_bh_enable_ip+0x80007f002085 (/lib/modules/4.9.0-rc8+/build/vmlinux)
        7fff8168d452 ip_finish_output2+0x80007f002152 (/lib/modules/4.9.0-rc8+/build/vmlinux)
        7fff8168ea3d ip_finish_output+0x80007f00217d (/lib/modules/4.9.0-rc8+/build/vmlinux)
        7fff8168f2af ip_output+0x80007f00203f (/lib/modules/4.9.0-rc8+/build/vmlinux)
        [...]
        7fff81005854 do_syscall_64+0x80007f002054 (/lib/modules/4.9.0-rc8+/build/vmlinux)
        7fff817649eb return_from_SYSCALL_64+0x80007f002000 (/lib/modules/4.9.0-rc8+/build/vmlinux)
               f5d80 __sendmsg_nocancel+0xffff01c484812007 (/usr/lib64/libc-2.18.so)
      Signed-off-by: NDaniel Borkmann <daniel@iogearbox.net>
      Acked-by: NAlexei Starovoitov <ast@kernel.org>
      Cc: linux-kernel@vger.kernel.org
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      74451e66
    • D
      bpf: mark all registered map/prog types as __ro_after_init · c78f8bdf
      Daniel Borkmann 提交于
      All map types and prog types are registered to the BPF core through
      bpf_register_map_type() and bpf_register_prog_type() during init and
      remain unchanged thereafter. As by design we don't (and never will)
      have any pluggable code that can register to that at any later point
      in time, lets mark all the existing bpf_{map,prog}_type_list objects
      in the tree as __ro_after_init, so they can be moved to read-only
      section from then onwards.
      Signed-off-by: NDaniel Borkmann <daniel@iogearbox.net>
      Acked-by: NAlexei Starovoitov <ast@kernel.org>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      c78f8bdf
  14. 16 2月, 2017 1 次提交
    • M
      net: neigh: Fix netevent NETEVENT_DELAY_PROBE_TIME_UPDATE notification · 7627ae60
      Marcus Huewe 提交于
      When setting a neigh related sysctl parameter, we always send a
      NETEVENT_DELAY_PROBE_TIME_UPDATE netevent. For instance, when
      executing
      
      	sysctl net.ipv6.neigh.wlp3s0.retrans_time_ms=2000
      
      a NETEVENT_DELAY_PROBE_TIME_UPDATE netevent is generated.
      
      This is caused by commit 2a4501ae ("neigh: Send a
      notification when DELAY_PROBE_TIME changes"). According to the
      commit's description, it was intended to generate such an event
      when setting the "delay_first_probe_time" sysctl parameter.
      
      In order to fix this, only generate this event when actually
      setting the "delay_first_probe_time" sysctl parameter. This fix
      should not have any unintended side-effects, because all but one
      registered netevent callbacks check for other netevent event
      types (the registered callbacks were obtained by grepping for
      "register_netevent_notifier"). The only callback that uses the
      NETEVENT_DELAY_PROBE_TIME_UPDATE event is
      mlxsw_sp_router_netevent_event() (in
      drivers/net/ethernet/mellanox/mlxsw/spectrum_router.c): in case
      of this event, it only accesses the DELAY_PROBE_TIME of the
      passed neigh_parms.
      
      Fixes: 2a4501ae ("neigh: Send a notification when DELAY_PROBE_TIME changes")
      Signed-off-by: NMarcus Huewe <suse-tux@gmx.de>
      Reviewed-by: NIdo Schimmel <idosch@mellanox.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      7627ae60
  15. 15 2月, 2017 1 次提交
  16. 14 2月, 2017 1 次提交
  17. 11 2月, 2017 6 次提交