1. 14 9月, 2018 1 次提交
    • J
      tun: switch to new type of msg_control · fe8dd45b
      Jason Wang 提交于
      This patch introduces to a new tun/tap specific msg_control:
      
      #define TUN_MSG_UBUF 1
      #define TUN_MSG_PTR  2
      struct tun_msg_ctl {
             int type;
             void *ptr;
      };
      
      This allows us to pass different kinds of msg_control through
      sendmsg(). The first supported type is ubuf (TUN_MSG_UBUF) which will
      be used by the existed vhost_net zerocopy code. The second is XDP
      buff, which allows vhost_net to pass XDP buff to TUN. This could be
      used to implement accepting an array of XDP buffs from vhost_net in
      the following patches.
      Signed-off-by: NJason Wang <jasowang@redhat.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      fe8dd45b
  2. 13 9月, 2018 7 次提交
  3. 11 9月, 2018 3 次提交
  4. 06 9月, 2018 10 次提交
    • D
      qed*: Utilize FW 8.37.7.0 · a3f72307
      Denis Bolotin 提交于
      This patch adds a new qed firmware with fixes and support for new features.
      
      Fixes:
      - Fix a rare case of device crash with iWARP, iSCSI or FCoE offload.
      - Fix GRE tunneled traffic when iWARP offload is enabled.
      - Fix RoCE failure in ib_send_bw when using inline data.
      - Fix latency optimization flow for inline WQEs.
      - BigBear 100G fix
      
      RDMA:
      - Reduce task context size.
      - Application page sizes above 2GB support.
      - Performance improvements.
      
      ETH:
      - Tenant DCB support.
      - Replace RSS indirection table update interface.
      
      Misc:
      - Debug Tools changes.
      Signed-off-by: NDenis Bolotin <denis.bolotin@cavium.com>
      Signed-off-by: NAriel Elior <ariel.elior@cavium.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      a3f72307
    • V
      packet: add sockopt to ignore outgoing packets · fa788d98
      Vincent Whitchurch 提交于
      Currently, the only way to ignore outgoing packets on a packet socket is
      via the BPF filter.  With MSG_ZEROCOPY, packets that are looped into
      AF_PACKET are copied in dev_queue_xmit_nit(), and this copy happens even
      if the filter run from packet_rcv() would reject them.  So the presence
      of a packet socket on the interface takes away the benefits of
      MSG_ZEROCOPY, even if the packet socket is not interested in outgoing
      packets.  (Even when MSG_ZEROCOPY is not used, the skb is unnecessarily
      cloned, but the cost for that is much lower.)
      
      Add a socket option to allow AF_PACKET sockets to ignore outgoing
      packets to solve this.  Note that the *BSDs already have something
      similar: BIOCSSEESENT/BIOCSDIRECTION and BIOCSDIRFILT.
      
      The first intended user is lldpd.
      Signed-off-by: NVincent Whitchurch <vincent.whitchurch@axis.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      fa788d98
    • S
      net/mlx5e: Replace PTP clock lock from RW lock to seq lock · 64109f1d
      Shay Agroskin 提交于
      Changed "priv.clock.lock" lock from 'rw_lock' to 'seq_lock'
      in order to improve packet rate performance.
      
      Tested on Intel(R) Xeon(R) CPU E5-2660 v2 @ 2.20GHz.
      Sent 64b packets between two peers connected by ConnectX-5,
      and measured packet rate for the receiver in three modes:
      	no time-stamping (base rate)
      	time-stamping using rw_lock (old lock) for critical region
      	time-stamping using seq_lock (new lock) for critical region
      Only the receiver time stamped its packets.
      
      The measured packet rate improvements are:
      
      	Single flow (multiple TX rings to single RX ring):
      		without timestamping:	  4.26 (M packets)/sec
      		with rw-lock (old lock):  4.1  (M packets)/sec
      		with seq-lock (new lock): 4.16 (M packets)/sec
      		1.46% improvement
      
      	Multiple flows (multiple TX rings to six RX rings):
      		without timestamping: 	  22   (M packets)/sec
      		with rw-lock (old lock):  11.7 (M packets)/sec
      		with seq-lock (new lock): 21.3 (M packets)/sec
      		82.05% improvement
      
      The packet rate improvement is due to the lack of atomic operations
      for the 'readers' by the seq-lock.
      Since there are much more 'readers' than 'writers' contention
      on this lock, almost all atomic operations are saved.
      this results in a dramatic decrease in overall
      cache misses.
      Signed-off-by: NShay Agroskin <shayag@mellanox.com>
      Signed-off-by: NSaeed Mahameed <saeedm@mellanox.com>
      64109f1d
    • V
      net/mlx5: Add flow counters idr · 12d6066c
      Vlad Buslov 提交于
      Previous patch in series changed flow counter storage structure from
      rb_tree to linked list in order to improve flow counter traversal
      performance. The drawback of such solution is that flow counter lookup by
      id becomes linear in complexity.
      
      Store pointers to flow counters in idr in order to improve lookup
      performance to logarithmic again. Idr is non-intrusive data structure and
      doesn't require extending flow counter struct with new elements. This means
      that idr can be used for lookup, while linked list from previous patch is
      used for traversal, and struct mlx5_fc size is <= 2 cache lines.
      Signed-off-by: NVlad Buslov <vladbu@mellanox.com>
      Acked-by: NAmir Vadai <amir@vadai.me>
      Reviewed-by: NPaul Blakey <paulb@mellanox.com>
      Signed-off-by: NSaeed Mahameed <saeedm@mellanox.com>
      12d6066c
    • V
      net/mlx5: Store flow counters in a list · 9aff93d7
      Vlad Buslov 提交于
      In order to improve performance of flow counter stats query loop that
      traverses all configured flow counters, replace rb_tree with double-linked
      list. This change improves performance of traversing flow counters by
      removing the tree traversal. (profiling data showed that call to rb_next
      was most top CPU consumer)
      
      However, lookup of flow flow counter in list becomes linear, instead of
      logarithmic. This problem is fixed by next patch in series, which adds idr
      for fast lookup. Idr is to be used because it is not an intrusive data
      structure and doesn't require adding any new members to struct mlx5_fc,
      which allows its control data part to stay <= 1 cache line in size.
      Signed-off-by: NVlad Buslov <vladbu@mellanox.com>
      Acked-by: NAmir Vadai <amir@vadai.me>
      Reviewed-by: NPaul Blakey <paulb@mellanox.com>
      Signed-off-by: NSaeed Mahameed <saeedm@mellanox.com>
      9aff93d7
    • V
      net/mlx5: Add new list to store deleted flow counters · 6e5e2283
      Vlad Buslov 提交于
      In order to prevent flow counters stats work function from traversing whole
      flow counters tree while searching for deleted flow counters, new list to
      store deleted flow counters is added to struct mlx5_fc_stats. Lockless
      NULL-terminated single linked list data type is used due to following
      reasons:
       - This use case only needs to add single element to list and
       remove/iterate whole list. Lockless list doesn't require any additional
       synchronization for these operations.
       - First cache line of flow counter data structure only has space to store
       single additional pointer, which precludes usage of double linked list.
      
      Remove flow counter 'deleted' flag that is no longer needed.
      Signed-off-by: NVlad Buslov <vladbu@mellanox.com>
      Acked-by: NAmir Vadai <amir@vadai.me>
      Reviewed-by: NPaul Blakey <paulb@mellanox.com>
      Signed-off-by: NSaeed Mahameed <saeedm@mellanox.com>
      6e5e2283
    • V
      net/mlx5: Change flow counters addlist type to single linked list · 83033688
      Vlad Buslov 提交于
      In order to prevent flow counters stats work function from traversing whole
      flow counters tree while searching for deleted flow counters, new list to
      store deleted flow counters will be added to struct mlx5_fc_stats. However,
      the flow counter structure itself has no space left to store any more data
      in first cache line. To free space that is needed to store additional list
      node, convert current addlist double linked list (two pointers per node) to
      atomic single linked list (one pointer per node).
      
      Lockless NULL-terminated single linked list data type doesn't require any
      additional external synchronization for operations used by flow counters
      module (add single new element, remove all elements from list and traverse
      them). Remove addlist_lock that is no longer needed.
      Signed-off-by: NVlad Buslov <vladbu@mellanox.com>
      Acked-by: NAmir Vadai <amir@vadai.me>
      Reviewed-by: NPaul Blakey <paulb@mellanox.com>
      Reviewed-by: NRoi Dayan <roid@mellanox.com>
      Signed-off-by: NSaeed Mahameed <saeedm@mellanox.com>
      83033688
    • T
      net/mlx5: Use u16 for Work Queue buffer strides offset · a0903622
      Tariq Toukan 提交于
      Minimal stride size is 16.
      Hence, the number of strides in a fragment (of PAGE_SIZE)
      is <= PAGE_SIZE / 16 <= 4K.
      
      u16 is sufficient to represent this.
      
      Fixes: d7037ad7 ("net/mlx5: Fix QP fragmented buffer allocation")
      Signed-off-by: NTariq Toukan <tariqt@mellanox.com>
      Reviewed-by: NEran Ben Elisha <eranbe@mellanox.com>
      Signed-off-by: NSaeed Mahameed <saeedm@mellanox.com>
      a0903622
    • T
      net/mlx5: Use u16 for Work Queue buffer fragment size · 8d71e818
      Tariq Toukan 提交于
      Minimal stride size is 16.
      Hence, the number of strides in a fragment (of PAGE_SIZE)
      is <= PAGE_SIZE / 16 <= 4K.
      
      u16 is sufficient to represent this.
      
      Fixes: 388ca8be ("IB/mlx5: Implement fragmented completion queue (CQ)")
      Signed-off-by: NTariq Toukan <tariqt@mellanox.com>
      Reviewed-by: NEran Ben Elisha <eranbe@mellanox.com>
      Signed-off-by: NSaeed Mahameed <saeedm@mellanox.com>
      8d71e818
    • J
      net/mlx5: Fix use-after-free in self-healing flow · 76d5581c
      Jack Morgenstein 提交于
      When the mlx5 health mechanism detects a problem while the driver
      is in the middle of init_one or remove_one, the driver needs to prevent
      the health mechanism from scheduling future work; if future work
      is scheduled, there is a problem with use-after-free: the system WQ
      tries to run the work item (which has been freed) at the scheduled
      future time.
      
      Prevent this by disabling work item scheduling in the health mechanism
      when the driver is in the middle of init_one() or remove_one().
      
      Fixes: e126ba97 ("mlx5: Add driver for Mellanox Connect-IB adapters")
      Signed-off-by: NJack Morgenstein <jackm@dev.mellanox.co.il>
      Reviewed-by: NFeras Daoud <ferasda@mellanox.com>
      Signed-off-by: NSaeed Mahameed <saeedm@mellanox.com>
      76d5581c
  5. 05 9月, 2018 5 次提交
  6. 04 9月, 2018 1 次提交
  7. 03 9月, 2018 1 次提交
  8. 01 9月, 2018 2 次提交
    • D
      blkcg: delay blkg destruction until after writeback has finished · 59b57717
      Dennis Zhou (Facebook) 提交于
      Currently, blkcg destruction relies on a sequence of events:
        1. Destruction starts. blkcg_css_offline() is called and blkgs
           release their reference to the blkcg. This immediately destroys
           the cgwbs (writeback).
        2. With blkgs giving up their reference, the blkcg ref count should
           become zero and eventually call blkcg_css_free() which finally
           frees the blkcg.
      
      Jiufei Xue reported that there is a race between blkcg_bio_issue_check()
      and cgroup_rmdir(). To remedy this, blkg destruction becomes contingent
      on the completion of all writeback associated with the blkcg. A count of
      the number of cgwbs is maintained and once that goes to zero, blkg
      destruction can follow. This should prevent premature blkg destruction
      related to writeback.
      
      The new process for blkcg cleanup is as follows:
        1. Destruction starts. blkcg_css_offline() is called which offlines
           writeback. Blkg destruction is delayed on the cgwb_refcnt count to
           avoid punting potentially large amounts of outstanding writeback
           to root while maintaining any ongoing policies. Here, the base
           cgwb_refcnt is put back.
        2. When the cgwb_refcnt becomes zero, blkcg_destroy_blkgs() is called
           and handles destruction of blkgs. This is where the css reference
           held by each blkg is released.
        3. Once the blkcg ref count goes to zero, blkcg_css_free() is called.
           This finally frees the blkg.
      
      It seems in the past blk-throttle didn't do the most understandable
      things with taking data from a blkg while associating with current. So,
      the simplification and unification of what blk-throttle is doing caused
      this.
      
      Fixes: 08e18eab ("block: add bi_blkg to the bio for cgroups")
      Reviewed-by: NJosef Bacik <josef@toxicpanda.com>
      Signed-off-by: NDennis Zhou <dennisszhou@gmail.com>
      Cc: Jiufei Xue <jiufei.xue@linux.alibaba.com>
      Cc: Joseph Qi <joseph.qi@linux.alibaba.com>
      Cc: Tejun Heo <tj@kernel.org>
      Cc: Josef Bacik <josef@toxicpanda.com>
      Cc: Jens Axboe <axboe@kernel.dk>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      59b57717
    • D
      Revert "blk-throttle: fix race between blkcg_bio_issue_check() and cgroup_rmdir()" · 6b065462
      Dennis Zhou (Facebook) 提交于
      This reverts commit 4c699480.
      
      Destroying blkgs is tricky because of the nature of the relationship. A
      blkg should go away when either a blkcg or a request_queue goes away.
      However, blkg's pin the blkcg to ensure they remain valid. To break this
      cycle, when a blkcg is offlined, blkgs put back their css ref. This
      eventually lets css_free() get called which frees the blkcg.
      
      The above commit (4c699480) breaks this order of events by trying to
      destroy blkgs in css_free(). As the blkgs still hold references to the
      blkcg, css_free() is never called.
      
      The race between blkcg_bio_issue_check() and cgroup_rmdir() will be
      addressed in the following patch by delaying destruction of a blkg until
      all writeback associated with the blkcg has been finished.
      
      Fixes: 4c699480 ("blk-throttle: fix race between blkcg_bio_issue_check() and cgroup_rmdir()")
      Reviewed-by: NJosef Bacik <josef@toxicpanda.com>
      Signed-off-by: NDennis Zhou <dennisszhou@gmail.com>
      Cc: Jiufei Xue <jiufei.xue@linux.alibaba.com>
      Cc: Joseph Qi <joseph.qi@linux.alibaba.com>
      Cc: Tejun Heo <tj@kernel.org>
      Cc: Jens Axboe <axboe@kernel.dk>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      6b065462
  9. 31 8月, 2018 4 次提交
  10. 30 8月, 2018 4 次提交
    • M
      arm/arm64: smccc-1.1: Handle function result as parameters · 755a8bf5
      Marc Zyngier 提交于
      If someone has the silly idea to write something along those lines:
      
      	extern u64 foo(void);
      
      	void bar(struct arm_smccc_res *res)
      	{
      		arm_smccc_1_1_smc(0xbad, foo(), res);
      	}
      
      they are in for a surprise, as this gets compiled as:
      
      	0000000000000588 <bar>:
      	 588:   a9be7bfd        stp     x29, x30, [sp, #-32]!
      	 58c:   910003fd        mov     x29, sp
      	 590:   f9000bf3        str     x19, [sp, #16]
      	 594:   aa0003f3        mov     x19, x0
      	 598:   aa1e03e0        mov     x0, x30
      	 59c:   94000000        bl      0 <_mcount>
      	 5a0:   94000000        bl      0 <foo>
      	 5a4:   aa0003e1        mov     x1, x0
      	 5a8:   d4000003        smc     #0x0
      	 5ac:   b4000073        cbz     x19, 5b8 <bar+0x30>
      	 5b0:   a9000660        stp     x0, x1, [x19]
      	 5b4:   a9010e62        stp     x2, x3, [x19, #16]
      	 5b8:   f9400bf3        ldr     x19, [sp, #16]
      	 5bc:   a8c27bfd        ldp     x29, x30, [sp], #32
      	 5c0:   d65f03c0        ret
      	 5c4:   d503201f        nop
      
      The call to foo "overwrites" the x0 register for the return value,
      and we end up calling the wrong secure service.
      
      A solution is to evaluate all the parameters before assigning
      anything to specific registers, leading to the expected result:
      
      	0000000000000588 <bar>:
      	 588:   a9be7bfd        stp     x29, x30, [sp, #-32]!
      	 58c:   910003fd        mov     x29, sp
      	 590:   f9000bf3        str     x19, [sp, #16]
      	 594:   aa0003f3        mov     x19, x0
      	 598:   aa1e03e0        mov     x0, x30
      	 59c:   94000000        bl      0 <_mcount>
      	 5a0:   94000000        bl      0 <foo>
      	 5a4:   aa0003e1        mov     x1, x0
      	 5a8:   d28175a0        mov     x0, #0xbad
      	 5ac:   d4000003        smc     #0x0
      	 5b0:   b4000073        cbz     x19, 5bc <bar+0x34>
      	 5b4:   a9000660        stp     x0, x1, [x19]
      	 5b8:   a9010e62        stp     x2, x3, [x19, #16]
      	 5bc:   f9400bf3        ldr     x19, [sp, #16]
      	 5c0:   a8c27bfd        ldp     x29, x30, [sp], #32
      	 5c4:   d65f03c0        ret
      Reported-by: NJulien Grall <julien.grall@arm.com>
      Signed-off-by: NMarc Zyngier <marc.zyngier@arm.com>
      Signed-off-by: NWill Deacon <will.deacon@arm.com>
      755a8bf5
    • M
      ethtool: drop get_settings and set_settings callbacks · 9b300495
      Michal Kubecek 提交于
      Since [gs]et_settings ethtool_ops callbacks have been deprecated in
      February 2016, all in tree NIC drivers have been converted to provide
      [gs]et_link_ksettings() and out of tree drivers have had enough time to do
      the same.
      
      Drop get_settings() and set_settings() and implement both ETHTOOL_[GS]SET
      and ETHTOOL_[GS]LINKSETTINGS only using [gs]et_link_ksettings().
      Signed-off-by: NMichal Kubecek <mkubecek@suse.cz>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      9b300495
    • E
      bpf/verifier: per-register parent pointers · 679c782d
      Edward Cree 提交于
      By giving each register its own liveness chain, we elide the skip_callee()
       logic.  Instead, each register's parent is the state it inherits from;
       both check_func_call() and prepare_func_exit() automatically connect
       reg states to the correct chain since when they copy the reg state across
       (r1-r5 into the callee as args, and r0 out as the return value) they also
       copy the parent pointer.
      Signed-off-by: NEdward Cree <ecree@solarflare.com>
      Signed-off-by: NAlexei Starovoitov <ast@kernel.org>
      679c782d
    • M
      net: add napi_if_scheduled_mark_missed · 6c5c9581
      Magnus Karlsson 提交于
      The function napi_if_scheduled_mark_missed is used to check if the
      NAPI context is scheduled, if so set NAPIF_STATE_MISSED and return
      true. Used by the AF_XDP zero-copy i40e Tx code implementation in
      order to make sure that irq affinity is honored by the napi context.
      Signed-off-by: NMagnus Karlsson <magnus.karlsson@intel.com>
      Signed-off-by: NAlexei Starovoitov <ast@kernel.org>
      6c5c9581
  11. 29 8月, 2018 2 次提交