1. 24 10月, 2017 1 次提交
    • C
      tcp: Configure TFO without cookie per socket and/or per route · 71c02379
      Christoph Paasch 提交于
      We already allow to enable TFO without a cookie by using the
      fastopen-sysctl and setting it to TFO_SERVER_COOKIE_NOT_REQD (or
      TFO_CLIENT_NO_COOKIE).
      This is safe to do in certain environments where we know that there
      isn't a malicous host (aka., data-centers) or when the
      application-protocol already provides an authentication mechanism in the
      first flight of data.
      
      A server however might be providing multiple services or talking to both
      sides (public Internet and data-center). So, this server would want to
      enable cookie-less TFO for certain services and/or for connections that
      go to the data-center.
      
      This patch exposes a socket-option and a per-route attribute to enable such
      fine-grained configurations.
      Signed-off-by: NChristoph Paasch <cpaasch@apple.com>
      Reviewed-by: NYuchung Cheng <ycheng@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      71c02379
  2. 22 10月, 2017 5 次提交
  3. 21 10月, 2017 1 次提交
  4. 20 10月, 2017 9 次提交
  5. 19 10月, 2017 1 次提交
  6. 18 10月, 2017 8 次提交
    • J
      bpf: move knowledge about post-translation offsets out of verifier · 4f9218aa
      Jakub Kicinski 提交于
      Use the fact that verifier ops are now separate from program
      ops to define a separate set of callbacks for verification of
      already translated programs.
      
      Since we expect the analyzer ops to be defined only for
      a small subset of all program types initialize their array
      by hand (don't use linux/bpf_types.h).
      Signed-off-by: NJakub Kicinski <jakub.kicinski@netronome.com>
      Acked-by: NDaniel Borkmann <daniel@iogearbox.net>
      Acked-by: NAlexei Starovoitov <ast@kernel.org>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      4f9218aa
    • J
      bpf: remove the verifier ops from program structure · 00176a34
      Jakub Kicinski 提交于
      Since the verifier ops don't have to be associated with
      the program for its entire lifetime we can move it to
      verifier's struct bpf_verifier_env.
      Signed-off-by: NJakub Kicinski <jakub.kicinski@netronome.com>
      Acked-by: NDaniel Borkmann <daniel@iogearbox.net>
      Acked-by: NAlexei Starovoitov <ast@kernel.org>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      00176a34
    • J
      bpf: split verifier and program ops · 7de16e3a
      Jakub Kicinski 提交于
      struct bpf_verifier_ops contains both verifier ops and operations
      used later during program's lifetime (test_run).  Split the runtime
      ops into a different structure.
      
      BPF_PROG_TYPE() will now append ## _prog_ops or ## _verifier_ops
      to the names.
      Signed-off-by: NJakub Kicinski <jakub.kicinski@netronome.com>
      Acked-by: NDaniel Borkmann <daniel@iogearbox.net>
      Acked-by: NAlexei Starovoitov <ast@kernel.org>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      7de16e3a
    • J
      bpf: cpumap xdp_buff to skb conversion and allocation · 1c601d82
      Jesper Dangaard Brouer 提交于
      This patch makes cpumap functional, by adding SKB allocation and
      invoking the network stack on the dequeuing CPU.
      
      For constructing the SKB on the remote CPU, the xdp_buff in converted
      into a struct xdp_pkt, and it mapped into the top headroom of the
      packet, to avoid allocating separate mem.  For now, struct xdp_pkt is
      just a cpumap internal data structure, with info carried between
      enqueue to dequeue.
      
      If a driver doesn't have enough headroom it is simply dropped, with
      return code -EOVERFLOW.  This will be picked up the xdp tracepoint
      infrastructure, to allow users to catch this.
      
      V2: take into account xdp->data_meta
      
      V4:
       - Drop busypoll tricks, keeping it more simple.
       - Skip RPS and Generic-XDP-recursive-reinjection, suggested by Alexei
      
      V5: correct RCU read protection around __netif_receive_skb_core.
      
      V6: Setting TASK_RUNNING vs TASK_INTERRUPTIBLE based on talk with Rik van Riel
      Signed-off-by: NJesper Dangaard Brouer <brouer@redhat.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      1c601d82
    • J
      bpf: XDP_REDIRECT enable use of cpumap · 9c270af3
      Jesper Dangaard Brouer 提交于
      This patch connects cpumap to the xdp_do_redirect_map infrastructure.
      
      Still no SKB allocation are done yet.  The XDP frames are transferred
      to the other CPU, but they are simply refcnt decremented on the remote
      CPU.  This served as a good benchmark for measuring the overhead of
      remote refcnt decrement.  If driver page recycle cache is not
      efficient then this, exposes a bottleneck in the page allocator.
      
      A shout-out to MST's ptr_ring, which is the secret behind is being so
      efficient to transfer memory pointers between CPUs, without constantly
      bouncing cache-lines between CPUs.
      
      V3: Handle !CONFIG_BPF_SYSCALL pointed out by kbuild test robot.
      
      V4: Make Generic-XDP aware of cpumap type, but don't allow redirect yet,
       as implementation require a separate upstream discussion.
      
      V5:
       - Fix a maybe-uninitialized pointed out by kbuild test robot.
       - Restrict bpf-prog side access to cpumap, open when use-cases appear
       - Implement cpu_map_enqueue() as a more simple void pointer enqueue
      
      V6:
       - Allow cpumap type for usage in helper bpf_redirect_map,
         general bpf-prog side restriction moved to earlier patch.
      Signed-off-by: NJesper Dangaard Brouer <brouer@redhat.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      9c270af3
    • J
      bpf: introduce new bpf cpu map type BPF_MAP_TYPE_CPUMAP · 6710e112
      Jesper Dangaard Brouer 提交于
      The 'cpumap' is primarily used as a backend map for XDP BPF helper
      call bpf_redirect_map() and XDP_REDIRECT action, like 'devmap'.
      
      This patch implement the main part of the map.  It is not connected to
      the XDP redirect system yet, and no SKB allocation are done yet.
      
      The main concern in this patch is to ensure the datapath can run
      without any locking.  This adds complexity to the setup and tear-down
      procedure, which assumptions are extra carefully documented in the
      code comments.
      
      V2:
       - make sure array isn't larger than NR_CPUS
       - make sure CPUs added is a valid possible CPU
      
      V3: fix nitpicks from Jakub Kicinski <kubakici@wp.pl>
      
      V5:
       - Restrict map allocation to root / CAP_SYS_ADMIN
       - WARN_ON_ONCE if queue is not empty on tear-down
       - Return -EPERM on memlock limit instead of -ENOMEM
       - Error code in __cpu_map_entry_alloc() also handle ptr_ring_cleanup()
       - Moved cpu_map_enqueue() to next patch
      
      V6: all notice by Daniel Borkmann
       - Fix err return code in cpu_map_alloc() introduced in V5
       - Move cpu_possible() check after max_entries boundary check
       - Forbid usage initially in check_map_func_compatibility()
      
      V7:
       - Fix alloc error path spotted by Daniel Borkmann
       - Did stress test adding+removing CPUs from the map concurrently
       - Fixed refcnt issue on cpu_map_entry, kthread started too soon
       - Make sure packets are flushed during tear-down, involved use of
         rcu_barrier() and kthread_run only exit after queue is empty
       - Fix alloc error path in __cpu_map_entry_alloc() for ptr_ring
      
      V8:
       - Nitpicking comments and gramma by Edward Cree
       - Fix missing semi-colon introduced in V7 due to rebasing
       - Move struct bpf_cpu_map_entry members cpu+map_id to tracepoint patch
      Signed-off-by: NJesper Dangaard Brouer <brouer@redhat.com>
      Acked-by: NAlexei Starovoitov <ast@kernel.org>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      6710e112
    • D
      KEYS: Fix race between updating and finding a negative key · 363b02da
      David Howells 提交于
      Consolidate KEY_FLAG_INSTANTIATED, KEY_FLAG_NEGATIVE and the rejection
      error into one field such that:
      
       (1) The instantiation state can be modified/read atomically.
      
       (2) The error can be accessed atomically with the state.
      
       (3) The error isn't stored unioned with the payload pointers.
      
      This deals with the problem that the state is spread over three different
      objects (two bits and a separate variable) and reading or updating them
      atomically isn't practical, given that not only can uninstantiated keys
      change into instantiated or rejected keys, but rejected keys can also turn
      into instantiated keys - and someone accessing the key might not be using
      any locking.
      
      The main side effect of this problem is that what was held in the payload
      may change, depending on the state.  For instance, you might observe the
      key to be in the rejected state.  You then read the cached error, but if
      the key semaphore wasn't locked, the key might've become instantiated
      between the two reads - and you might now have something in hand that isn't
      actually an error code.
      
      The state is now KEY_IS_UNINSTANTIATED, KEY_IS_POSITIVE or a negative error
      code if the key is negatively instantiated.  The key_is_instantiated()
      function is replaced with key_is_positive() to avoid confusion as negative
      keys are also 'instantiated'.
      
      Additionally, barriering is included:
      
       (1) Order payload-set before state-set during instantiation.
      
       (2) Order state-read before payload-read when using the key.
      
      Further separate barriering is necessary if RCU is being used to access the
      payload content after reading the payload pointers.
      
      Fixes: 146aa8b1 ("KEYS: Merge the type-specific data with the payload data")
      Cc: stable@vger.kernel.org # v4.4+
      Reported-by: NEric Biggers <ebiggers@google.com>
      Signed-off-by: NDavid Howells <dhowells@redhat.com>
      Reviewed-by: NEric Biggers <ebiggers@google.com>
      363b02da
    • A
      ethtool: add ethtool_intersect_link_masks · 5a6cd6de
      Alan Brady 提交于
      This function provides a way to intersect two link masks together to
      find the common ground between them.  For example in i40e, the driver
      first generates link masks for what is supported by the PHY type.  The
      driver then gets the link masks for what the NVM supports.  The
      resulting intersection between them yields what can truly be supported.
      Signed-off-by: NAlan Brady <alan.brady@intel.com>
      Tested-by: NAndrew Bowers <andrewx.bowers@intel.com>
      Signed-off-by: NJeff Kirsher <jeffrey.t.kirsher@intel.com>
      5a6cd6de
  7. 17 10月, 2017 1 次提交
    • C
      tun: call dev_get_valid_name() before register_netdevice() · 0ad646c8
      Cong Wang 提交于
      register_netdevice() could fail early when we have an invalid
      dev name, in which case ->ndo_uninit() is not called. For tun
      device, this is a problem because a timer etc. are already
      initialized and it expects ->ndo_uninit() to clean them up.
      
      We could move these initializations into a ->ndo_init() so
      that register_netdevice() knows better, however this is still
      complicated due to the logic in tun_detach().
      
      Therefore, I choose to just call dev_get_valid_name() before
      register_netdevice(), which is quicker and much easier to audit.
      And for this specific case, it is already enough.
      
      Fixes: 96442e42 ("tuntap: choose the txq based on rxq")
      Reported-by: NDmitry Alexeev <avekceeb@gmail.com>
      Cc: Jason Wang <jasowang@redhat.com>
      Cc: "Michael S. Tsirkin" <mst@redhat.com>
      Signed-off-by: NCong Wang <xiyou.wangcong@gmail.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      0ad646c8
  8. 15 10月, 2017 1 次提交
  9. 14 10月, 2017 5 次提交
  10. 13 10月, 2017 1 次提交
  11. 12 10月, 2017 1 次提交
    • J
      bus: mbus: fix window size calculation for 4GB windows · 2bbbd963
      Jan Luebbe 提交于
      At least the Armada XP SoC supports 4GB on a single DRAM window. Because
      the size register values contain the actual size - 1, the MSB is set in
      that case. For example, the SDRAM window's control register's value is
      0xffffffe1 for 4GB (bits 31 to 24 contain the size).
      
      The MBUS driver reads back each window's size from registers and
      calculates the actual size as (control_reg | ~DDR_SIZE_MASK) + 1, which
      overflows for 32 bit values, resulting in other miscalculations further
      on (a bad RAM window for the CESA crypto engine calculated by
      mvebu_mbus_setup_cpu_target_nooverlap() in my case).
      
      This patch changes the type in 'struct mbus_dram_window' from u32 to
      u64, which allows us to keep using the same register calculation code in
      most MBUS-using drivers (which calculate ->size - 1 again).
      
      Fixes: fddddb52 ("bus: introduce an Marvell EBU MBus driver")
      CC: stable@vger.kernel.org
      Signed-off-by: NJan Luebbe <jlu@pengutronix.de>
      Signed-off-by: NGregory CLEMENT <gregory.clement@free-electrons.com>
      2bbbd963
  12. 11 10月, 2017 3 次提交
  13. 10 10月, 2017 3 次提交
    • P
      sched/core: Fix wake_affine() performance regression · d153b153
      Peter Zijlstra 提交于
      Eric reported a sysbench regression against commit:
      
        3fed382b ("sched/numa: Implement NUMA node level wake_affine()")
      
      Similarly, Rik was looking at the NAS-lu.C benchmark, which regressed
      against his v3.10 enterprise kernel.
      
      PRE (current tip/master):
      
       ivb-ep sysbench:
      
         2: [30 secs]     transactions:                        64110  (2136.94 per sec.)
         5: [30 secs]     transactions:                        143644 (4787.99 per sec.)
        10: [30 secs]     transactions:                        274298 (9142.93 per sec.)
        20: [30 secs]     transactions:                        418683 (13955.45 per sec.)
        40: [30 secs]     transactions:                        320731 (10690.15 per sec.)
        80: [30 secs]     transactions:                        355096 (11834.28 per sec.)
      
       hsw-ex NAS:
      
       OMP_PROC_BIND/lu.C.x_threads_144_run_1.log: Time in seconds =                    18.01
       OMP_PROC_BIND/lu.C.x_threads_144_run_2.log: Time in seconds =                    17.89
       OMP_PROC_BIND/lu.C.x_threads_144_run_3.log: Time in seconds =                    17.93
       lu.C.x_threads_144_run_1.log: Time in seconds =                   434.68
       lu.C.x_threads_144_run_2.log: Time in seconds =                   405.36
       lu.C.x_threads_144_run_3.log: Time in seconds =                   433.83
      
      POST (+patch):
      
       ivb-ep sysbench:
      
         2: [30 secs]     transactions:                        64494  (2149.75 per sec.)
         5: [30 secs]     transactions:                        145114 (4836.99 per sec.)
        10: [30 secs]     transactions:                        278311 (9276.69 per sec.)
        20: [30 secs]     transactions:                        437169 (14571.60 per sec.)
        40: [30 secs]     transactions:                        669837 (22326.73 per sec.)
        80: [30 secs]     transactions:                        631739 (21055.88 per sec.)
      
       hsw-ex NAS:
      
       lu.C.x_threads_144_run_1.log: Time in seconds =                    23.36
       lu.C.x_threads_144_run_2.log: Time in seconds =                    22.96
       lu.C.x_threads_144_run_3.log: Time in seconds =                    22.52
      
      This patch takes out all the shiny wake_affine() stuff and goes back to
      utter basics. Between the two CPUs involved with the wakeup (the CPU
      doing the wakeup and the CPU we ran on previously) pick the CPU we can
      run on _now_.
      
      This restores much of the regressions against the older kernels,
      but leaves some ground in the overloaded case. The default-enabled
      WA_WEIGHT (which will be introduced in the next patch) is an attempt
      to address the overloaded situation.
      Reported-by: NEric Farman <farman@linux.vnet.ibm.com>
      Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Cc: Christian Borntraeger <borntraeger@de.ibm.com>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Matthew Rosato <mjrosato@linux.vnet.ibm.com>
      Cc: Mike Galbraith <efault@gmx.de>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Rik van Riel <riel@redhat.com>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: jinpuwang@gmail.com
      Cc: vcaputo@pengaru.com
      Fixes: 3fed382b ("sched/numa: Implement NUMA node level wake_affine()")
      Signed-off-by: NIngo Molnar <mingo@kernel.org>
      d153b153
    • E
      once: switch to new jump label API · cf4c950b
      Eric Biggers 提交于
      Switch the DO_ONCE() macro from the deprecated jump label API to the new
      one.  The new one is more readable, and for DO_ONCE() it also makes the
      generated code more icache-friendly: now the one-time initialization
      code is placed out-of-line at the jump target, rather than at the inline
      fallthrough case.
      Acked-by: NHannes Frederic Sowa <hannes@stressinduktion.org>
      Signed-off-by: NEric Biggers <ebiggers@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      cf4c950b
    • M
      qed: Add LL2 slowpath handling · 6f34a284
      Michal Kalderon 提交于
      For iWARP unaligned MPA flow, a slowpath event of flushing an
      MPA connection that entered an unaligned state is required.
      The flush ramrod is received on the ll2 queue, and a pre-registered
      callback function is called to handle the flush event.
      Signed-off-by: NMichal Kalderon <Michal.Kalderon@cavium.com>
      Signed-off-by: NAriel Elior <Ariel.Elior@cavium.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      6f34a284