1. 01 10月, 2017 1 次提交
  2. 30 9月, 2017 1 次提交
    • A
      i40e: Enable VF to negotiate number of allocated queues · a3f5aa90
      Alan Brady 提交于
      Currently the PF allocates a default number of queues for each VF and
      cannot be changed.  This patch enables the VF to request a different
      number of queues allocated to it.  This patch also adds a new virtchnl
      op and capability flag to facilitate this negotiation.
      
      After the PF receives a request message, it will set a requested number
      of queues for that VF.  Then when the VF resets, its VSI will get a new
      number of queues allocated to it.
      
      This is a best effort request and since we only allocate a guaranteed
      default number, if the VF tries to ask for more than the guaranteed
      number, there may not be enough in HW to accommodate it unless other
      queues for other VFs are freed. It should also be noted decreasing the
      number queues allocated to a VF to below the default will NOT enable the
      allocation of more than 32 VFs per PF and will not free queues guaranteed
      to each VF by default.
      Signed-off-by: NAlan Brady <alan.brady@intel.com>
      Tested-by: NAndrew Bowers <andrewx.bowers@intel.com>
      Signed-off-by: NJeff Kirsher <jeffrey.t.kirsher@intel.com>
      a3f5aa90
  3. 29 9月, 2017 3 次提交
  4. 28 9月, 2017 4 次提交
  5. 27 9月, 2017 3 次提交
    • D
      bpf: add meta pointer for direct access · de8f3a83
      Daniel Borkmann 提交于
      This work enables generic transfer of metadata from XDP into skb. The
      basic idea is that we can make use of the fact that the resulting skb
      must be linear and already comes with a larger headroom for supporting
      bpf_xdp_adjust_head(), which mangles xdp->data. Here, we base our work
      on a similar principle and introduce a small helper bpf_xdp_adjust_meta()
      for adjusting a new pointer called xdp->data_meta. Thus, the packet has
      a flexible and programmable room for meta data, followed by the actual
      packet data. struct xdp_buff is therefore laid out that we first point
      to data_hard_start, then data_meta directly prepended to data followed
      by data_end marking the end of packet. bpf_xdp_adjust_head() takes into
      account whether we have meta data already prepended and if so, memmove()s
      this along with the given offset provided there's enough room.
      
      xdp->data_meta is optional and programs are not required to use it. The
      rationale is that when we process the packet in XDP (e.g. as DoS filter),
      we can push further meta data along with it for the XDP_PASS case, and
      give the guarantee that a clsact ingress BPF program on the same device
      can pick this up for further post-processing. Since we work with skb
      there, we can also set skb->mark, skb->priority or other skb meta data
      out of BPF, thus having this scratch space generic and programmable
      allows for more flexibility than defining a direct 1:1 transfer of
      potentially new XDP members into skb (it's also more efficient as we
      don't need to initialize/handle each of such new members). The facility
      also works together with GRO aggregation. The scratch space at the head
      of the packet can be multiple of 4 byte up to 32 byte large. Drivers not
      yet supporting xdp->data_meta can simply be set up with xdp->data_meta
      as xdp->data + 1 as bpf_xdp_adjust_meta() will detect this and bail out,
      such that the subsequent match against xdp->data for later access is
      guaranteed to fail.
      
      The verifier treats xdp->data_meta/xdp->data the same way as we treat
      xdp->data/xdp->data_end pointer comparisons. The requirement for doing
      the compare against xdp->data is that it hasn't been modified from it's
      original address we got from ctx access. It may have a range marking
      already from prior successful xdp->data/xdp->data_end pointer comparisons
      though.
      Signed-off-by: NDaniel Borkmann <daniel@iogearbox.net>
      Acked-by: NAlexei Starovoitov <ast@kernel.org>
      Acked-by: NJohn Fastabend <john.fastabend@gmail.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      de8f3a83
    • D
      bpf: rename bpf_compute_data_end into bpf_compute_data_pointers · 6aaae2b6
      Daniel Borkmann 提交于
      Just do the rename into bpf_compute_data_pointers() as we'll add
      one more pointer here to recompute.
      Signed-off-by: NDaniel Borkmann <daniel@iogearbox.net>
      Acked-by: NAlexei Starovoitov <ast@kernel.org>
      Acked-by: NJohn Fastabend <john.fastabend@gmail.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      6aaae2b6
    • M
      qed: iWARP - Add check for errors on a SYN packet · 1e99c497
      Michal Kalderon 提交于
      A SYN packet which arrives with errors from FW should be dropped.
      This required adding an additional field to the ll2
      rx completion data.
      Signed-off-by: NMichal Kalderon <Michal.Kalderon@cavium.com>
      Signed-off-by: NAriel Elior <Ariel.Elior@cavium.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      1e99c497
  6. 23 9月, 2017 1 次提交
  7. 22 9月, 2017 2 次提交
    • D
      Input: uinput - avoid FF flush when destroying device · e8b95728
      Dmitry Torokhov 提交于
      Normally, when input device supporting force feedback effects is being
      destroyed, we try to "flush" currently playing effects, so that the
      physical device does not continue vibrating (or executing other effects).
      Unfortunately this does not work well for uinput as flushing of the effects
      deadlocks with the destroy action:
      
      - if device is being destroyed because the file descriptor is being closed,
        then there is noone to even service FF requests;
      
      - if device is being destroyed because userspace sent UI_DEV_DESTROY,
        while theoretically it could be possible to service FF requests,
        userspace is unlikely to do so (they'd need to make sure FF handling
        happens on a separate thread) even if kernel solves the issue with FF
        ioctls deadlocking with UI_DEV_DESTROY ioctl on udev->mutex.
      
      To avoid lockups like the one below, let's install a custom input device
      flush handler, and avoid trying to flush force feedback effects when we
      destroying the device, and instead rely on uinput to shut off the device
      properly.
      
      NMI watchdog: Watchdog detected hard LOCKUP on cpu 3
      ...
       <<EOE>>  [<ffffffff817a0307>] _raw_spin_lock_irqsave+0x37/0x40
       [<ffffffff810e633d>] complete+0x1d/0x50
       [<ffffffffa00ba08c>] uinput_request_done+0x3c/0x40 [uinput]
       [<ffffffffa00ba587>] uinput_request_submit.part.7+0x47/0xb0 [uinput]
       [<ffffffffa00bb62b>] uinput_dev_erase_effect+0x5b/0x76 [uinput]
       [<ffffffff815d91ad>] erase_effect+0xad/0xf0
       [<ffffffff815d929d>] flush_effects+0x4d/0x90
       [<ffffffff815d4cc0>] input_flush_device+0x40/0x60
       [<ffffffff815daf1c>] evdev_cleanup+0xac/0xc0
       [<ffffffff815daf5b>] evdev_disconnect+0x2b/0x60
       [<ffffffff815d74ac>] __input_unregister_device+0xac/0x150
       [<ffffffff815d75f7>] input_unregister_device+0x47/0x70
       [<ffffffffa00bac45>] uinput_destroy_device+0xb5/0xc0 [uinput]
       [<ffffffffa00bb2de>] uinput_ioctl_handler.isra.9+0x65e/0x740 [uinput]
       [<ffffffff811231ab>] ? do_futex+0x12b/0xad0
       [<ffffffffa00bb3f8>] uinput_ioctl+0x18/0x20 [uinput]
       [<ffffffff81241248>] do_vfs_ioctl+0x298/0x480
       [<ffffffff81337553>] ? security_file_ioctl+0x43/0x60
       [<ffffffff812414a9>] SyS_ioctl+0x79/0x90
       [<ffffffff817a04ee>] entry_SYSCALL_64_fastpath+0x12/0x71
      Reported-by: NRodrigo Rivas Costa <rodrigorivascosta@gmail.com>
      Reported-by: NClément VUCHENER <clement.vuchener@gmail.com>
      Fixes: https://bugzilla.kernel.org/show_bug.cgi?id=193741Signed-off-by: NDmitry Torokhov <dmitry.torokhov@gmail.com>
      e8b95728
    • P
      net: avoid a full fib lookup when rp_filter is disabled. · 6e617de8
      Paolo Abeni 提交于
      Since commit 1dced6a8 ("ipv4: Restore accept_local behaviour
      in fib_validate_source()") a full fib lookup is needed even if
      the rp_filter is disabled, if accept_local is false - which is
      the default.
      
      What we really need in the above scenario is just checking
      that the source IP address is not local, and in most case we
      can do that is a cheaper way looking up the ifaddr hash table.
      
      This commit adds a helper for such lookup, and uses it to
      validate the src address when rp_filter is disabled and no
      'local' routes are created by the user space in the relevant
      namespace.
      
      A new ipv4 netns flag is added to account for such routes.
      We need that to preserve the same behavior we had before this
      patch.
      
      It also drops the checks to bail early from __fib_validate_source,
      added by the commit 1dced6a8 ("ipv4: Restore accept_local
      behaviour in fib_validate_source()") they do not give any
      measurable performance improvement: if we do the lookup with are
      on a slower path.
      
      This improves UDP performances for unconnected sockets
      when rp_filter is disabled by 5% and also gives small but
      measurable performance improvement for TCP flood scenarios.
      
      v1 -> v2:
       - use the ifaddr lookup helper in __ip_dev_find(), as suggested
         by Eric
       - fall-back to full lookup if custom local routes are present
      Signed-off-by: NPaolo Abeni <pabeni@redhat.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      6e617de8
  8. 21 9月, 2017 1 次提交
    • Y
      bpf: one perf event close won't free bpf program attached by another perf event · ec9dd352
      Yonghong Song 提交于
      This patch fixes a bug exhibited by the following scenario:
        1. fd1 = perf_event_open with attr.config = ID1
        2. attach bpf program prog1 to fd1
        3. fd2 = perf_event_open with attr.config = ID1
           <this will be successful>
        4. user program closes fd2 and prog1 is detached from the tracepoint.
        5. user program with fd1 does not work properly as tracepoint
           no output any more.
      
      The issue happens at step 4. Multiple perf_event_open can be called
      successfully, but only one bpf prog pointer in the tp_event. In the
      current logic, any fd release for the same tp_event will free
      the tp_event->prog.
      
      The fix is to free tp_event->prog only when the closing fd
      corresponds to the one which registered the program.
      Signed-off-by: NYonghong Song <yhs@fb.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      ec9dd352
  9. 20 9月, 2017 1 次提交
    • E
      net: sk_buff rbnode reorg · bffa72cf
      Eric Dumazet 提交于
      skb->rbnode shares space with skb->next, skb->prev and skb->tstamp
      
      Current uses (TCP receive ofo queue and netem) need to save/restore
      tstamp, while skb->dev is either NULL (TCP) or a constant for a given
      queue (netem).
      
      Since we plan using an RB tree for TCP retransmit queue to speedup SACK
      processing with large BDP, this patch exchanges skb->dev and
      skb->tstamp.
      
      This saves some overhead in both TCP and netem.
      
      v2: removes the swtstamp field from struct tcp_skb_cb
      Signed-off-by: NEric Dumazet <edumazet@google.com>
      Cc: Soheil Hassas Yeganeh <soheil@google.com>
      Cc: Wei Wang <weiwan@google.com>
      Cc: Willem de Bruijn <willemb@google.com>
      Acked-by: NSoheil Hassas Yeganeh <soheil@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      bffa72cf
  10. 18 9月, 2017 1 次提交
  11. 15 9月, 2017 4 次提交
    • D
      sched/wait: Add swq_has_sleeper() · 8cd641e3
      Davidlohr Bueso 提交于
      Which is the equivalent of what we have in regular waitqueues.
      I'm not crazy about the name, but this also helps us get both
      apis closer -- which iirc comes originally from the -net folks.
      
      We also duplicate the comments for the lockless swait_active(),
      from wait.h. Future users will make use of this interface.
      Signed-off-by: NDavidlohr Bueso <dbueso@suse.de>
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      8cd641e3
    • M
      vfs: constify path argument to kernel_read_file_from_path · 711aab1d
      Mimi Zohar 提交于
      This patch constifies the path argument to kernel_read_file_from_path().
      Signed-off-by: NMimi Zohar <zohar@linux.vnet.ibm.com>
      Cc: Christoph Hellwig <hch@infradead.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      711aab1d
    • T
      sched/wait: Introduce wakeup boomark in wake_up_page_bit · 11a19c7b
      Tim Chen 提交于
      Now that we have added breaks in the wait queue scan and allow bookmark
      on scan position, we put this logic in the wake_up_page_bit function.
      
      We can have very long page wait list in large system where multiple
      pages share the same wait list. We break the wake up walk here to allow
      other cpus a chance to access the list, and not to disable the interrupts
      when traversing the list for too long.  This reduces the interrupt and
      rescheduling latency, and excessive page wait queue lock hold time.
      
      [ v2: Remove bookmark_wake_function ]
      Signed-off-by: NTim Chen <tim.c.chen@linux.intel.com>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      11a19c7b
    • T
      sched/wait: Break up long wake list walk · 2554db91
      Tim Chen 提交于
      We encountered workloads that have very long wake up list on large
      systems. A waker takes a long time to traverse the entire wake list and
      execute all the wake functions.
      
      We saw page wait list that are up to 3700+ entries long in tests of
      large 4 and 8 socket systems. It took 0.8 sec to traverse such list
      during wake up. Any other CPU that contends for the list spin lock will
      spin for a long time. It is a result of the numa balancing migration of
      hot pages that are shared by many threads.
      
      Multiple CPUs waking are queued up behind the lock, and the last one
      queued has to wait until all CPUs did all the wakeups.
      
      The page wait list is traversed with interrupt disabled, which caused
      various problems. This was the original cause that triggered the NMI
      watch dog timer in: https://patchwork.kernel.org/patch/9800303/ . Only
      extending the NMI watch dog timer there helped.
      
      This patch bookmarks the waker's scan position in wake list and break
      the wake up walk, to allow access to the list before the waker resume
      its walk down the rest of the wait list. It lowers the interrupt and
      rescheduling latency.
      
      This patch also provides a performance boost when combined with the next
      patch to break up page wakeup list walk. We saw 22% improvement in the
      will-it-scale file pread2 test on a Xeon Phi system running 256 threads.
      
      [ v2: Merged in Linus' changes to remove the bookmark_wake_function, and
        simply access to flags. ]
      Reported-by: NKan Liang <kan.liang@intel.com>
      Tested-by: NKan Liang <kan.liang@intel.com>
      Signed-off-by: NTim Chen <tim.c.chen@linux.intel.com>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      2554db91
  12. 14 9月, 2017 1 次提交
    • M
      mm: treewide: remove GFP_TEMPORARY allocation flag · 0ee931c4
      Michal Hocko 提交于
      GFP_TEMPORARY was introduced by commit e12ba74d ("Group short-lived
      and reclaimable kernel allocations") along with __GFP_RECLAIMABLE.  It's
      primary motivation was to allow users to tell that an allocation is
      short lived and so the allocator can try to place such allocations close
      together and prevent long term fragmentation.  As much as this sounds
      like a reasonable semantic it becomes much less clear when to use the
      highlevel GFP_TEMPORARY allocation flag.  How long is temporary? Can the
      context holding that memory sleep? Can it take locks? It seems there is
      no good answer for those questions.
      
      The current implementation of GFP_TEMPORARY is basically GFP_KERNEL |
      __GFP_RECLAIMABLE which in itself is tricky because basically none of
      the existing caller provide a way to reclaim the allocated memory.  So
      this is rather misleading and hard to evaluate for any benefits.
      
      I have checked some random users and none of them has added the flag
      with a specific justification.  I suspect most of them just copied from
      other existing users and others just thought it might be a good idea to
      use without any measuring.  This suggests that GFP_TEMPORARY just
      motivates for cargo cult usage without any reasoning.
      
      I believe that our gfp flags are quite complex already and especially
      those with highlevel semantic should be clearly defined to prevent from
      confusion and abuse.  Therefore I propose dropping GFP_TEMPORARY and
      replace all existing users to simply use GFP_KERNEL.  Please note that
      SLAB users with shrinkers will still get __GFP_RECLAIMABLE heuristic and
      so they will be placed properly for memory fragmentation prevention.
      
      I can see reasons we might want some gfp flag to reflect shorterm
      allocations but I propose starting from a clear semantic definition and
      only then add users with proper justification.
      
      This was been brought up before LSF this year by Matthew [1] and it
      turned out that GFP_TEMPORARY really doesn't have a clear semantic.  It
      seems to be a heuristic without any measured advantage for most (if not
      all) its current users.  The follow up discussion has revealed that
      opinions on what might be temporary allocation differ a lot between
      developers.  So rather than trying to tweak existing users into a
      semantic which they haven't expected I propose to simply remove the flag
      and start from scratch if we really need a semantic for short term
      allocations.
      
      [1] http://lkml.kernel.org/r/20170118054945.GD18349@bombadil.infradead.org
      
      [akpm@linux-foundation.org: fix typo]
      [akpm@linux-foundation.org: coding-style fixes]
      [sfr@canb.auug.org.au: drm/i915: fix up]
        Link: http://lkml.kernel.org/r/20170816144703.378d4f4d@canb.auug.org.au
      Link: http://lkml.kernel.org/r/20170728091904.14627-1-mhocko@kernel.orgSigned-off-by: NMichal Hocko <mhocko@suse.com>
      Signed-off-by: NStephen Rothwell <sfr@canb.auug.org.au>
      Acked-by: NMel Gorman <mgorman@suse.de>
      Acked-by: NVlastimil Babka <vbabka@suse.cz>
      Cc: Matthew Wilcox <willy@infradead.org>
      Cc: Neil Brown <neilb@suse.de>
      Cc: "Theodore Ts'o" <tytso@mit.edu>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      0ee931c4
  13. 12 9月, 2017 3 次提交
  14. 11 9月, 2017 1 次提交
    • M
      dax: remove the pmem_dax_ops->flush abstraction · c3ca015f
      Mikulas Patocka 提交于
      Commit abebfbe2 ("dm: add ->flush() dax operation support") is
      buggy. A DM device may be composed of multiple underlying devices and
      all of them need to be flushed. That commit just routes the flush
      request to the first device and ignores the other devices.
      
      It could be fixed by adding more complex logic to the device mapper. But
      there is only one implementation of the method pmem_dax_ops->flush - that
      is pmem_dax_flush() - and it calls arch_wb_cache_pmem(). Consequently, we
      don't need the pmem_dax_ops->flush abstraction at all, we can call
      arch_wb_cache_pmem() directly from dax_flush() because dax_dev->ops->flush
      can't ever reach anything different from arch_wb_cache_pmem().
      
      It should be also pointed out that for some uses of persistent memory it
      is needed to flush only a very small amount of data (such as 1 cacheline),
      and it would be overkill if we go through that device mapper machinery for
      a single flushed cache line.
      
      Fix this by removing the pmem_dax_ops->flush abstraction and call
      arch_wb_cache_pmem() directly from dax_flush(). Also, remove the device
      mapper code that forwards the flushes.
      
      Fixes: abebfbe2 ("dm: add ->flush() dax operation support")
      Cc: stable@vger.kernel.org
      Signed-off-by: NMikulas Patocka <mpatocka@redhat.com>
      Reviewed-by: NDan Williams <dan.j.williams@intel.com>
      Signed-off-by: NMike Snitzer <snitzer@redhat.com>
      c3ca015f
  15. 09 9月, 2017 13 次提交