1. 07 9月, 2017 4 次提交
    • R
      dax: move all DAX radix tree defs to fs/dax.c · 527b19d0
      Ross Zwisler 提交于
      Now that we no longer insert struct page pointers in DAX radix trees the
      page cache code no longer needs to know anything about DAX exceptional
      entries.  Move all the DAX exceptional entry definitions from dax.h to
      fs/dax.c.
      
      Link: http://lkml.kernel.org/r/20170724170616.25810-6-ross.zwisler@linux.intel.comSigned-off-by: NRoss Zwisler <ross.zwisler@linux.intel.com>
      Suggested-by: NJan Kara <jack@suse.cz>
      Reviewed-by: NJan Kara <jack@suse.cz>
      Cc: "Darrick J. Wong" <darrick.wong@oracle.com>
      Cc: "Theodore Ts'o" <tytso@mit.edu>
      Cc: Alexander Viro <viro@zeniv.linux.org.uk>
      Cc: Andreas Dilger <adilger.kernel@dilger.ca>
      Cc: Christoph Hellwig <hch@lst.de>
      Cc: Dan Williams <dan.j.williams@intel.com>
      Cc: Dave Chinner <david@fromorbit.com>
      Cc: Ingo Molnar <mingo@redhat.com>
      Cc: Jonathan Corbet <corbet@lwn.net>
      Cc: Matthew Wilcox <mawilcox@microsoft.com>
      Cc: Steven Rostedt <rostedt@goodmis.org>
      Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      527b19d0
    • R
      dax: remove DAX code from page_cache_tree_insert() · d01ad197
      Ross Zwisler 提交于
      Now that we no longer insert struct page pointers in DAX radix trees we
      can remove the special casing for DAX in page_cache_tree_insert().
      
      This also allows us to make dax_wake_mapping_entry_waiter() local to
      fs/dax.c, removing it from dax.h.
      
      Link: http://lkml.kernel.org/r/20170724170616.25810-5-ross.zwisler@linux.intel.comSigned-off-by: NRoss Zwisler <ross.zwisler@linux.intel.com>
      Suggested-by: NJan Kara <jack@suse.cz>
      Reviewed-by: NJan Kara <jack@suse.cz>
      Cc: "Darrick J. Wong" <darrick.wong@oracle.com>
      Cc: "Theodore Ts'o" <tytso@mit.edu>
      Cc: Alexander Viro <viro@zeniv.linux.org.uk>
      Cc: Andreas Dilger <adilger.kernel@dilger.ca>
      Cc: Christoph Hellwig <hch@lst.de>
      Cc: Dan Williams <dan.j.williams@intel.com>
      Cc: Dave Chinner <david@fromorbit.com>
      Cc: Ingo Molnar <mingo@redhat.com>
      Cc: Jonathan Corbet <corbet@lwn.net>
      Cc: Matthew Wilcox <mawilcox@microsoft.com>
      Cc: Steven Rostedt <rostedt@goodmis.org>
      Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      d01ad197
    • R
      dax: use common 4k zero page for dax mmap reads · 91d25ba8
      Ross Zwisler 提交于
      When servicing mmap() reads from file holes the current DAX code
      allocates a page cache page of all zeroes and places the struct page
      pointer in the mapping->page_tree radix tree.
      
      This has three major drawbacks:
      
      1) It consumes memory unnecessarily. For every 4k page that is read via
         a DAX mmap() over a hole, we allocate a new page cache page. This
         means that if you read 1GiB worth of pages, you end up using 1GiB of
         zeroed memory. This is easily visible by looking at the overall
         memory consumption of the system or by looking at /proc/[pid]/smaps:
      
      	7f62e72b3000-7f63272b3000 rw-s 00000000 103:00 12   /root/dax/data
      	Size:            1048576 kB
      	Rss:             1048576 kB
      	Pss:             1048576 kB
      	Shared_Clean:          0 kB
      	Shared_Dirty:          0 kB
      	Private_Clean:   1048576 kB
      	Private_Dirty:         0 kB
      	Referenced:      1048576 kB
      	Anonymous:             0 kB
      	LazyFree:              0 kB
      	AnonHugePages:         0 kB
      	ShmemPmdMapped:        0 kB
      	Shared_Hugetlb:        0 kB
      	Private_Hugetlb:       0 kB
      	Swap:                  0 kB
      	SwapPss:               0 kB
      	KernelPageSize:        4 kB
      	MMUPageSize:           4 kB
      	Locked:                0 kB
      
      2) It is slower than using a common zero page because each page fault
         has more work to do. Instead of just inserting a common zero page we
         have to allocate a page cache page, zero it, and then insert it. Here
         are the average latencies of dax_load_hole() as measured by ftrace on
         a random test box:
      
          Old method, using zeroed page cache pages:	3.4 us
          New method, using the common 4k zero page:	0.8 us
      
         This was the average latency over 1 GiB of sequential reads done by
         this simple fio script:
      
           [global]
           size=1G
           filename=/root/dax/data
           fallocate=none
           [io]
           rw=read
           ioengine=mmap
      
      3) The fact that we had to check for both DAX exceptional entries and
         for page cache pages in the radix tree made the DAX code more
         complex.
      
      Solve these issues by following the lead of the DAX PMD code and using a
      common 4k zero page instead.  As with the PMD code we will now insert a
      DAX exceptional entry into the radix tree instead of a struct page
      pointer which allows us to remove all the special casing in the DAX
      code.
      
      Note that we do still pretty aggressively check for regular pages in the
      DAX radix tree, especially where we take action based on the bits set in
      the page.  If we ever find a regular page in our radix tree now that
      most likely means that someone besides DAX is inserting pages (which has
      happened lots of times in the past), and we want to find that out early
      and fail loudly.
      
      This solution also removes the extra memory consumption.  Here is that
      same /proc/[pid]/smaps after 1GiB of reading from a hole with the new
      code:
      
      	7f2054a74000-7f2094a74000 rw-s 00000000 103:00 12   /root/dax/data
      	Size:            1048576 kB
      	Rss:                   0 kB
      	Pss:                   0 kB
      	Shared_Clean:          0 kB
      	Shared_Dirty:          0 kB
      	Private_Clean:         0 kB
      	Private_Dirty:         0 kB
      	Referenced:            0 kB
      	Anonymous:             0 kB
      	LazyFree:              0 kB
      	AnonHugePages:         0 kB
      	ShmemPmdMapped:        0 kB
      	Shared_Hugetlb:        0 kB
      	Private_Hugetlb:       0 kB
      	Swap:                  0 kB
      	SwapPss:               0 kB
      	KernelPageSize:        4 kB
      	MMUPageSize:           4 kB
      	Locked:                0 kB
      
      Overall system memory consumption is similarly improved.
      
      Another major change is that we remove dax_pfn_mkwrite() from our fault
      flow, and instead rely on the page fault itself to make the PTE dirty
      and writeable.  The following description from the patch adding the
      vm_insert_mixed_mkwrite() call explains this a little more:
      
         "To be able to use the common 4k zero page in DAX we need to have our
          PTE fault path look more like our PMD fault path where a PTE entry
          can be marked as dirty and writeable as it is first inserted rather
          than waiting for a follow-up dax_pfn_mkwrite() =>
          finish_mkwrite_fault() call.
      
          Right now we can rely on having a dax_pfn_mkwrite() call because we
          can distinguish between these two cases in do_wp_page():
      
                  case 1: 4k zero page => writable DAX storage
                  case 2: read-only DAX storage => writeable DAX storage
      
          This distinction is made by via vm_normal_page(). vm_normal_page()
          returns false for the common 4k zero page, though, just as it does
          for DAX ptes. Instead of special casing the DAX + 4k zero page case
          we will simplify our DAX PTE page fault sequence so that it matches
          our DAX PMD sequence, and get rid of the dax_pfn_mkwrite() helper.
          We will instead use dax_iomap_fault() to handle write-protection
          faults.
      
          This means that insert_pfn() needs to follow the lead of
          insert_pfn_pmd() and allow us to pass in a 'mkwrite' flag. If
          'mkwrite' is set insert_pfn() will do the work that was previously
          done by wp_page_reuse() as part of the dax_pfn_mkwrite() call path"
      
      Link: http://lkml.kernel.org/r/20170724170616.25810-4-ross.zwisler@linux.intel.comSigned-off-by: NRoss Zwisler <ross.zwisler@linux.intel.com>
      Reviewed-by: NJan Kara <jack@suse.cz>
      Cc: "Darrick J. Wong" <darrick.wong@oracle.com>
      Cc: "Theodore Ts'o" <tytso@mit.edu>
      Cc: Alexander Viro <viro@zeniv.linux.org.uk>
      Cc: Andreas Dilger <adilger.kernel@dilger.ca>
      Cc: Christoph Hellwig <hch@lst.de>
      Cc: Dan Williams <dan.j.williams@intel.com>
      Cc: Dave Chinner <david@fromorbit.com>
      Cc: Ingo Molnar <mingo@redhat.com>
      Cc: Jonathan Corbet <corbet@lwn.net>
      Cc: Matthew Wilcox <mawilcox@microsoft.com>
      Cc: Steven Rostedt <rostedt@goodmis.org>
      Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      91d25ba8
    • R
      mm: add vm_insert_mixed_mkwrite() · b2770da6
      Ross Zwisler 提交于
      When servicing mmap() reads from file holes the current DAX code
      allocates a page cache page of all zeroes and places the struct page
      pointer in the mapping->page_tree radix tree.  This has three major
      drawbacks:
      
      1) It consumes memory unnecessarily. For every 4k page that is read via
         a DAX mmap() over a hole, we allocate a new page cache page. This
         means that if you read 1GiB worth of pages, you end up using 1GiB of
         zeroed memory.
      
      2) It is slower than using a common zero page because each page fault
         has more work to do. Instead of just inserting a common zero page we
         have to allocate a page cache page, zero it, and then insert it.
      
      3) The fact that we had to check for both DAX exceptional entries and
         for page cache pages in the radix tree made the DAX code more
         complex.
      
      This series solves these issues by following the lead of the DAX PMD
      code and using a common 4k zero page instead.  This reduces memory usage
      and decreases latencies for some workloads, and it simplifies the DAX
      code, removing over 100 lines in total.
      
      This patch (of 5):
      
      To be able to use the common 4k zero page in DAX we need to have our PTE
      fault path look more like our PMD fault path where a PTE entry can be
      marked as dirty and writeable as it is first inserted rather than
      waiting for a follow-up dax_pfn_mkwrite() => finish_mkwrite_fault()
      call.
      
      Right now we can rely on having a dax_pfn_mkwrite() call because we can
      distinguish between these two cases in do_wp_page():
      
      	case 1: 4k zero page => writable DAX storage
      	case 2: read-only DAX storage => writeable DAX storage
      
      This distinction is made by via vm_normal_page().  vm_normal_page()
      returns false for the common 4k zero page, though, just as it does for
      DAX ptes.  Instead of special casing the DAX + 4k zero page case we will
      simplify our DAX PTE page fault sequence so that it matches our DAX PMD
      sequence, and get rid of the dax_pfn_mkwrite() helper.  We will instead
      use dax_iomap_fault() to handle write-protection faults.
      
      This means that insert_pfn() needs to follow the lead of
      insert_pfn_pmd() and allow us to pass in a 'mkwrite' flag.  If 'mkwrite'
      is set insert_pfn() will do the work that was previously done by
      wp_page_reuse() as part of the dax_pfn_mkwrite() call path.
      
      Link: http://lkml.kernel.org/r/20170724170616.25810-2-ross.zwisler@linux.intel.comSigned-off-by: NRoss Zwisler <ross.zwisler@linux.intel.com>
      Reviewed-by: NJan Kara <jack@suse.cz>
      Acked-by: NKirill A. Shutemov <kirill.shutemov@linux.intel.com>
      Cc: "Darrick J. Wong" <darrick.wong@oracle.com>
      Cc: "Theodore Ts'o" <tytso@mit.edu>
      Cc: Alexander Viro <viro@zeniv.linux.org.uk>
      Cc: Andreas Dilger <adilger.kernel@dilger.ca>
      Cc: Christoph Hellwig <hch@lst.de>
      Cc: Dan Williams <dan.j.williams@intel.com>
      Cc: Dave Chinner <david@fromorbit.com>
      Cc: Ingo Molnar <mingo@redhat.com>
      Cc: Jonathan Corbet <corbet@lwn.net>
      Cc: Matthew Wilcox <mawilcox@microsoft.com>
      Cc: Steven Rostedt <rostedt@goodmis.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      b2770da6
  2. 02 9月, 2017 1 次提交
  3. 01 9月, 2017 9 次提交
    • C
      ANDROID: binder: Add BINDER_GET_NODE_DEBUG_INFO ioctl · abcc6153
      Colin Cross 提交于
      The BINDER_GET_NODE_DEBUG_INFO ioctl will return debug info on
      a node.  Each successive call reusing the previous return value
      will return the next node.  The data will be used by
      libmemunreachable to mark the pointers with kernel references
      as reachable.
      Signed-off-by: NColin Cross <ccross@android.com>
      Signed-off-by: NMartijn Coenen <maco@android.com>
      Signed-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      abcc6153
    • J
      include/linux/compiler.h: don't perform compiletime_assert with -O0 · c03567a8
      Joe Stringer 提交于
      Commit c7acec71 ("kernel.h: handle pointers to arrays better in
      container_of()") made use of __compiletime_assert() from container_of()
      thus increasing the usage of this macro, allowing developers to notice
      type conflicts in usage of container_of() at compile time.
      
      However, the implementation of __compiletime_assert relies on compiler
      optimizations to report an error.  This means that if a developer uses
      "-O0" with any code that performs container_of(), the compiler will always
      report an error regardless of whether there is an actual problem in the
      code.
      
      This patch disables compile_time_assert when optimizations are disabled to
      allow such code to compile with CFLAGS="-O0".
      
      Example compilation failure:
      
      ./include/linux/compiler.h:547:38: error: call to `__compiletime_assert_94' declared with attribute error: pointer type mismatch in container_of()
        _compiletime_assert(condition, msg, __compiletime_assert_, __LINE__)
                                            ^
      ./include/linux/compiler.h:530:4: note: in definition of macro `__compiletime_assert'
          prefix ## suffix();    \
          ^~~~~~
      ./include/linux/compiler.h:547:2: note: in expansion of macro `_compiletime_assert'
        _compiletime_assert(condition, msg, __compiletime_assert_, __LINE__)
        ^~~~~~~~~~~~~~~~~~~
      ./include/linux/build_bug.h:46:37: note: in expansion of macro `compiletime_assert'
       #define BUILD_BUG_ON_MSG(cond, msg) compiletime_assert(!(cond), msg)
                                           ^~~~~~~~~~~~~~~~~~
      ./include/linux/kernel.h:860:2: note: in expansion of macro `BUILD_BUG_ON_MSG'
        BUILD_BUG_ON_MSG(!__same_type(*(ptr), ((type *)0)->member) && \
        ^~~~~~~~~~~~~~~~
      
      [akpm@linux-foundation.org: use do{}while(0), per Michal]
      Link: http://lkml.kernel.org/r/20170829230114.11662-1-joe@ovn.org
      Fixes: c7acec71 ("kernel.h: handle pointers to arrays better in container_of()")
      Signed-off-by: NJoe Stringer <joe@ovn.org>
      Cc: Ian Abbott <abbotti@mev.co.uk>
      Cc: Arnd Bergmann <arnd@arndb.de>
      Cc: Michal Nazarewicz <mina86@mina86.com>
      Cc: Kees Cook <keescook@chromium.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      c03567a8
    • J
      mm/mmu_notifier: kill invalidate_page · 5f32b265
      Jérôme Glisse 提交于
      The invalidate_page callback suffered from two pitfalls.  First it used
      to happen after the page table lock was release and thus a new page
      might have setup before the call to invalidate_page() happened.
      
      This is in a weird way fixed by commit c7ab0d2f ("mm: convert
      try_to_unmap_one() to use page_vma_mapped_walk()") that moved the
      callback under the page table lock but this also broke several existing
      users of the mmu_notifier API that assumed they could sleep inside this
      callback.
      
      The second pitfall was invalidate_page() being the only callback not
      taking a range of address in respect to invalidation but was giving an
      address and a page.  Lots of the callback implementers assumed this
      could never be THP and thus failed to invalidate the appropriate range
      for THP.
      
      By killing this callback we unify the mmu_notifier callback API to
      always take a virtual address range as input.
      
      Finally this also simplifies the end user life as there is now two clear
      choices:
        - invalidate_range_start()/end() callback (which allow you to sleep)
        - invalidate_range() where you can not sleep but happen right after
          page table update under page table lock
      Signed-off-by: NJérôme Glisse <jglisse@redhat.com>
      Cc: Bernhard Held <berny156@gmx.de>
      Cc: Adam Borowski <kilobyte@angband.pl>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: Radim Krčmář <rkrcmar@redhat.com>
      Cc: Wanpeng Li <kernellwp@gmail.com>
      Cc: Paolo Bonzini <pbonzini@redhat.com>
      Cc: Takashi Iwai <tiwai@suse.de>
      Cc: Nadav Amit <nadav.amit@gmail.com>
      Cc: Mike Galbraith <efault@gmx.de>
      Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
      Cc: axie <axie@amd.com>
      Cc: Andrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      5f32b265
    • J
      dax: update to new mmu_notifier semantic · a4d1a885
      Jérôme Glisse 提交于
      Replace all mmu_notifier_invalidate_page() calls by *_invalidate_range()
      and make sure it is bracketed by calls to *_invalidate_range_start()/end().
      
      Note that because we can not presume the pmd value or pte value we have
      to assume the worst and unconditionaly report an invalidation as
      happening.
      Signed-off-by: NJérôme Glisse <jglisse@redhat.com>
      Cc: Dan Williams <dan.j.williams@intel.com>
      Cc: Ross Zwisler <ross.zwisler@linux.intel.com>
      Cc: Bernhard Held <berny156@gmx.de>
      Cc: Adam Borowski <kilobyte@angband.pl>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: Radim Krčmář <rkrcmar@redhat.com>
      Cc: Wanpeng Li <kernellwp@gmail.com>
      Cc: Paolo Bonzini <pbonzini@redhat.com>
      Cc: Takashi Iwai <tiwai@suse.de>
      Cc: Nadav Amit <nadav.amit@gmail.com>
      Cc: Mike Galbraith <efault@gmx.de>
      Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
      Cc: axie <axie@amd.com>
      Cc: Andrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      a4d1a885
    • B
      <linux/uaccess.h>: Fix copy_in_user() declaration · f58e76c1
      Bart Van Assche 提交于
      copy_in_user() copies data from user-space address @from to user-
      space address @to. Hence declare both @from and @to as user-space
      pointers.
      
      Fixes: commit d597580d ("generic ...copy_..._user primitives")
      Signed-off-by: NBart Van Assche <bart.vanassche@wdc.com>
      Cc: <stable@vger.kernel.org>
      Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>
      f58e76c1
    • C
      annotate RWF_... flags · ddef7ed2
      Christoph Hellwig 提交于
      [AV: added missing annotations in syscalls.h/compat.h]
      Signed-off-by: NChristoph Hellwig <hch@lst.de>
      Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>
      ddef7ed2
    • A
    • J
      drivers: w1: add hwmon support structures · 2eb79548
      Jaghathiswari Rankappagounder Natarajan 提交于
      This patch has changes to w1.h/w1.c generic files to add (optional) hwmon
      support structures.
      Signed-off-by: NJaghathiswari Rankappagounder Natarajan <jaghu@google.com>
      Acked-by: NEvgeniy Polyakov <zbr@ioremap.net>
      Acked-by: NGuenter Roeck <linux@roeck-us.net>
      Signed-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      2eb79548
    • B
      usb: phy: Avoid unchecked dereference warning · eb3c74de
      Baolin Wang 提交于
      Move the USB phy NULL checking before issuing usb_phy_set_charger_current()
      to avoid unchecked dereference warning.
      Signed-off-by: NBaolin Wang <baolin.wang@linaro.org>
      Signed-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      eb3c74de
  4. 31 8月, 2017 26 次提交
    • J
      genalloc: Fix an incorrect kerneldoc comment · a27bfcab
      Jonathan Corbet 提交于
      The kerneldoc comment for the genpool_algo_t typedef was incomplete and
      incorrectly formatted, leading to a raft of warnings during the docs build.
      Fix it appropriately.
      Signed-off-by: NJonathan Corbet <corbet@lwn.net>
      a27bfcab
    • M
      irqchip/gic-v3: Advertise GICv4 support to KVM · 4bdf5025
      Marc Zyngier 提交于
      As KVM needs to know about the availability of GICv4 to enable
      direct injection of interrupts, let's advertise the feature in
      the gic_kvm_info structure.
      Signed-off-by: NMarc Zyngier <marc.zyngier@arm.com>
      4bdf5025
    • M
      irqchip/gic-v4: Enable low-level GICv4 operations · 3d63cb53
      Marc Zyngier 提交于
      Get the show on the road...
      Reviewed-by: NThomas Gleixner <tglx@linutronix.de>
      Signed-off-by: NMarc Zyngier <marc.zyngier@arm.com>
      3d63cb53
    • M
      irqchip/gic-v4: Add VLPI configuration interface · f2eac75d
      Marc Zyngier 提交于
      Add the required interfaces to map, unmap and update a VLPI.
      Reviewed-by: NEric Auger <eric.auger@redhat.com>
      Reviewed-by: NThomas Gleixner <tglx@linutronix.de>
      Signed-off-by: NMarc Zyngier <marc.zyngier@arm.com>
      f2eac75d
    • M
      irqchip/gic-v4: Add VPE command interface · eab84318
      Marc Zyngier 提交于
      Add the required interfaces to schedule a VPE and perform a
      VINVALL command.
      Reviewed-by: NThomas Gleixner <tglx@linutronix.de>
      Reviewed-by: NEric Auger <eric.auger@redhat.com>
      Signed-off-by: NMarc Zyngier <marc.zyngier@arm.com>
      eab84318
    • M
      irqchip/gic-v4: Add per-VM VPE domain creation · 7de5c0af
      Marc Zyngier 提交于
      When creating a VM, it is very convenient to have an irq domain
      containing all the doorbell interrupts associated with that VM
      (each interrupt representing a VPE).
      Reviewed-by: NThomas Gleixner <tglx@linutronix.de>
      Signed-off-by: NMarc Zyngier <marc.zyngier@arm.com>
      7de5c0af
    • M
      irqchip/gic-v3-its: Set implementation defined bit to enable VLPIs · d51c4b4d
      Marc Zyngier 提交于
      A long time ago, GITS_CTLR[1] used to be called GITC_CTLR.EnableVLPI.
      It has been subsequently deprecated and is now an "Implementation
      Defined" bit that may ot may not be set for GICv4. Brilliant.
      
      And the current crop of the FastModel requires that bit for VLPIs
      to be enabled. Oh well... Let's set it and find out what breaks.
      Signed-off-by: NMarc Zyngier <marc.zyngier@arm.com>
      d51c4b4d
    • M
      irqchip/gic-v3-its: Add device proxy for VPE management if !DirectLpi · 20b3d54e
      Marc Zyngier 提交于
      When we don't have the DirectLPI feature, we must work around the
      architecture shortcomings to be able to perform the required
      maintenance (interrupt masking, clearing and injection).
      
      For this, we create a fake device whose sole purpose is to
      provide a way to issue commands as if we were dealing with LPIs
      coming from that device (while they actually originate from
      the ITS). This fake device doesn't have LPIs allocated to it,
      but instead uses the VPE LPIs.
      
      Of course, this could be a real bottleneck, and a naive
      implementation would require 6 commands to issue an invalidation.
      
      Instead, let's allocate at least one event per physical CPU
      (rounded up to the next power of 2), and opportunistically
      map the VPE doorbell to an event. This doorbell will be mapped
      until we roll over and need to reallocate this slot.
      
      This ensures that most of the time, we only need 2 commands
      to issue an INV, INT or CLEAR, making the performance a lot
      better, given that we always issue a CLEAR on entry, and
      an INV on each side of a trapped WFI.
      Signed-off-by: NMarc Zyngier <marc.zyngier@arm.com>
      20b3d54e
    • M
      irqchip/gic-v3-its: Add VPE scheduling · e643d803
      Marc Zyngier 提交于
      When a VPE is scheduled to run, the corresponding redistributor must
      be told so, by setting VPROPBASER to the VM's property table, and
      VPENDBASER to the vcpu's pending table.
      
      When scheduled out, we preserve the IDAI and PendingLast bits. The
      latter is specially important, as it tells the hypervisor that
      there are pending interrupts for this vcpu.
      Reviewed-by: NEric Auger <eric.auger@redhat.com>
      Signed-off-by: NMarc Zyngier <marc.zyngier@arm.com>
      e643d803
    • M
      irqchip/gic-v3-its: Add VPENDBASER/VPROPBASER accessors · 3ca63f36
      Marc Zyngier 提交于
      V{PEND,PROP}BASER being 64bit registers, they need some ad-hoc
      accessors on 32bit, specially given that VPENDBASER contains
      a Valid bit, making the access a bit convoluted.
      Reviewed-by: NThomas Gleixner <tglx@linutronix.de>
      Reviewed-by: NEric Auger <eric.auger@redhat.com>
      Signed-off-by: NMarc Zyngier <marc.zyngier@arm.com>
      3ca63f36
    • M
      irqchip/gic-v3-its: Add GICv4 ITS command definitions · d7276b80
      Marc Zyngier 提交于
      Add the new GICv4 ITS command definitions, most of them, being
      defined in terms of their physical counterparts.
      Reviewed-by: NEric Auger <eric.auger@redhat.com>
      Reviewed-by: NThomas Gleixner <tglx@linutronix.de>
      Signed-off-by: NMarc Zyngier <marc.zyngier@arm.com>
      d7276b80
    • M
      irqchip/gic-v4: Add management structure definitions · de29faa0
      Marc Zyngier 提交于
      Add a bunch of GICv4-specific data structures that will get used in
      subsequent patches.
      Reviewed-by: NThomas Gleixner <tglx@linutronix.de>
      Signed-off-by: NMarc Zyngier <marc.zyngier@arm.com>
      de29faa0
    • M
      IB/core: Assign root to all drivers · 52427112
      Matan Barak 提交于
      In order to use the parsing tree, we need to assign the root
      to all drivers. Currently, we just assign the default parsing
      tree via ib_uverbs_add_one. The driver could override this by
      assigning a parsing tree prior to registering the device.
      Signed-off-by: NMatan Barak <matanb@mellanox.com>
      Reviewed-by: NYishai Hadas <yishaih@mellanox.com>
      Signed-off-by: NDoug Ledford <dledford@redhat.com>
      52427112
    • M
      IB/core: Add completion queue (cq) object actions · 9ee79fce
      Matan Barak 提交于
      Adding CQ ioctl actions:
      1. create_cq
      2. destroy_cq
      
      This requires adding the following:
      1. A specification describing the method
      	a. Handler
      	b. Attributes specification
      		Each attribute is one of the following:
      		a. PTR_IN - input data
      			    Note: This could be encoded inlined for
      				  data < 64bit
      		b. PTR_OUT - response data
      		c. IDR - idr based object
      		d. FD - fd based object
                      Blobs attributes (clauses a and b) contain their type,
      	        while objects specifications (clauses c and d)
                      contains the expected object type (for example, the
                      given id should be UVERBS_TYPE_PD) and the required
                      access (READ, WRITE, NEW or DESTROY). If a NEW is
                      required, the new object's id will be assigned to this
                      attribute. All attributes could get UA_FLAGS
                      attribute. Currently we support stating that an
      		attribute is mandatory or that the specification size
                      corresponds to a lower bound (and that this attribute
      		could be extended).
      		We currently add both default attributes and the two
      		generic UHW_IN and UHW_OUT driver specific attributes.
      2. Handler
         A handler gets a uverbs_attr_bundle. The handler developer uses
         uverbs_attr_get to fetch an attribute of a given id.
         Each of these attribute groups correspond to the specification
         group defined in the action (clauses 1.b and 1.c respectively).
         The indices of these arrays corresponds to the attribute ids
         declared in the specifications (clause 2).
      
         The handler is quite simple. It assumes the infrastructure fetched
         all objects and locked, created or destroyed them as required by
         the specification. Pointer (or blob) attributes were validated to
         match their required sizes. After the handler finished, the
         infrastructure commits or rollbacks the objects.
      Signed-off-by: NMatan Barak <matanb@mellanox.com>
      Reviewed-by: NYishai Hadas <yishaih@mellanox.com>
      Signed-off-by: NDoug Ledford <dledford@redhat.com>
      9ee79fce
    • M
      IB/core: Add legacy driver's user-data · d70724f1
      Matan Barak 提交于
      In this phase, we don't want to change all the drivers to use
      flexible driver's specific attributes. Therefore, we add two default
      attributes: UHW_IN and UHW_OUT. These attributes are optional in some
      methods and they encode the driver specific command data. We add
      a function that extract this data and creates the legacy udata over
      it.
      
      Driver's data should start from UVERBS_UDATA_DRIVER_DATA_FLAG. This
      turns on the first bit of the namespace, indicating this attribute
      belongs to the driver's namespace.
      Signed-off-by: NMatan Barak <matanb@mellanox.com>
      Reviewed-by: NYishai Hadas <yishaih@mellanox.com>
      Signed-off-by: NDoug Ledford <dledford@redhat.com>
      d70724f1
    • M
      IB/core: Export ioctl enum types to user-space · 64b19e13
      Matan Barak 提交于
      Add a new ib_user_ioctl_verbs.h which exports all required ABI
      enums and structs to the user-space.
      Export the default types to user-space through this file.
      Signed-off-by: NMatan Barak <matanb@mellanox.com>
      Reviewed-by: NYishai Hadas <yishaih@mellanox.com>
      Signed-off-by: NDoug Ledford <dledford@redhat.com>
      64b19e13
    • M
      IB/core: Explicitly destroy an object while keeping uobject · 4da70da2
      Matan Barak 提交于
      When some objects are destroyed, we need to extract their status at
      destruction. After object's destruction, this status
      (e.g. events_reported) relies in the uobject. In order to have the
      latest and correct status, the underlying object should be destroyed,
      but we should keep the uobject alive and read this information off the
      uobject. We introduce a rdma_explicit_destroy function. This function
      destroys the class type object (for example, the IDR class type which
      destroys the underlying object as well) and then convert the uobject
      to be of a null class type. This uobject will then be destroyed as any
      other uobject once uverbs_finalize_object[s] is called.
      Signed-off-by: NMatan Barak <matanb@mellanox.com>
      Reviewed-by: NYishai Hadas <yishaih@mellanox.com>
      Signed-off-by: NDoug Ledford <dledford@redhat.com>
      4da70da2
    • M
      IB/core: Add macros for declaring methods and attributes · 35410306
      Matan Barak 提交于
      This patch adds macros for declaring objects, methods and
      attributes. These definitions are later used by downstream patches
      to declare some of the default types.
      Signed-off-by: NMatan Barak <matanb@mellanox.com>
      Reviewed-by: NYishai Hadas <yishaih@mellanox.com>
      Signed-off-by: NDoug Ledford <dledford@redhat.com>
      35410306
    • M
      IB/core: Add uverbs merge trees functionality · 118620d3
      Matan Barak 提交于
      Different drivers support different features and even subset of the
      common uverbs implementation. Currently, this is handled as bitmask
      in every driver that represents which kind of methods it supports, but
      doesn't go down to attributes granularity. Moreover, drivers might
      want to add their specific types, methods and attributes to let
      their user-space counter-parts be exposed to some more efficient
      abstractions. It means that existence of different features is
      validated syntactically via the parsing infrastructure rather than
      using a complex in-handler logic.
      
      In order to do that, we allow defining features and abstractions
      as parsing trees. These per-feature parsing tree could be merged
      to an efficient (perfect-hash based) parsing tree, which is later
      used by the parsing infrastructure.
      
      To sum it up, this makes a parse tree unique for a device and
      represents only the features this particular device supports.
      This is done by having a root specification tree per feature.
      Before a device registers itself as an IB device, it merges
      all these trees into one parsing tree. This parsing tree
      is used to parse all user-space commands.
      
      A future user-space application could read this parse tree. This
      tree represents which objects, methods and attributes are
      supported by this device.
      
      This is based on the idea of
      Jason Gunthorpe <jgunthorpe@obsidianresearch.com>
      Signed-off-by: NMatan Barak <matanb@mellanox.com>
      Reviewed-by: NYishai Hadas <yishaih@mellanox.com>
      Signed-off-by: NDoug Ledford <dledford@redhat.com>
      118620d3
    • M
      IB/core: Add DEVICE object and root tree structure · 09e3ebf8
      Matan Barak 提交于
      This adds the DEVICE object. This object supports creating the context
      that all objects are created from. Moreover, it supports executing
      methods which are related to the device itself, such as QUERY_DEVICE.
      This is a singleton object (per file instance).
      
      All standard objects are put in the root structure. This root will later
      on be used in drivers as the source for their whole parsing tree.
      Later on, when new features are added, these drivers could mix this root
      with other customized objects.
      Signed-off-by: NMatan Barak <matanb@mellanox.com>
      Reviewed-by: NYishai Hadas <yishaih@mellanox.com>
      Signed-off-by: NDoug Ledford <dledford@redhat.com>
      09e3ebf8
    • M
      IB/core: Declare an object instead of declaring only type attributes · 5009010f
      Matan Barak 提交于
      Switch all uverbs_type_attrs_xxxx with DECLARE_UVERBS_OBJECT
      macros. This will be later used in order to embed the object
      specific methods in the objects as well.
      Signed-off-by: NMatan Barak <matanb@mellanox.com>
      Reviewed-by: NYishai Hadas <yishaih@mellanox.com>
      Signed-off-by: NDoug Ledford <dledford@redhat.com>
      5009010f
    • M
      IB/core: Add new ioctl interface · fac9658c
      Matan Barak 提交于
      In this ioctl interface, processing the command starts from
      properties of the command and fetching the appropriate user objects
      before calling the handler.
      
      Parsing and validation is done according to a specifier declared by
      the driver's code. In the driver, all supported objects are declared.
      These objects are separated to different object namepsaces. Dividing
      objects to namespaces is done at initialization by using the higher
      bits of the object ids. This initialization can mix objects declared
      in different places to one parsing tree using in this ioctl interface.
      
      For each object we list all supported methods. Similarly to objects,
      methods are separated to method namespaces too. Namespacing is done
      similarly to the objects case. This could be used in order to add
      methods to an existing object.
      
      Each method has a specific handler, which could be either a default
      handler or a driver specific handler.
      Along with the handler, a bunch of attributes are specified as well.
      Similarly to objects and method, attributes are namespaced and hashed
      by their ids at initialization too. All supported attributes are
      subject to automatic fetching and validation. These attributes include
      the command, response and the method's related objects' ids.
      
      When these entities (objects, methods and attributes) are used, the
      high bits of the entities ids are used in order to calculate the hash
      bucket index. Then, these high bits are masked out in order to have a
      zero based index. Since we use these high bits for both bucketing and
      namespacing, we get a compact representation and O(1) array access.
      This is mandatory for efficient dispatching.
      
      Each attribute has a type (PTR_IN, PTR_OUT, IDR and FD) and a length.
      Attributes could be validated through some attributes, like:
      (*) Minimum size / Exact size
      (*) Fops for FD
      (*) Object type for IDR
      
      If an IDR/fd attribute is specified, the kernel also states the object
      type and the required access (NEW, WRITE, READ or DESTROY).
      All uobject/fd management is done automatically by the infrastructure,
      meaning - the infrastructure will fail concurrent commands that at
      least one of them requires concurrent access (WRITE/DESTROY),
      synchronize actions with device removals (dissociate context events)
      and take care of reference counting (increase/decrease) for concurrent
      actions invocation. The reference counts on the actual kernel objects
      shall be handled by the handlers.
      
       objects
      +--------+
      |        |
      |        |   methods                                                                +--------+
      |        |   ns         method      method_spec                           +-----+   |len     |
      +--------+  +------+[d]+-------+   +----------------+[d]+------------+    |attr1+-> |type    |
      | object +> |method+-> | spec  +-> +  attr_buckets  +-> |default_chain+--> +-----+   |idr_type|
      +--------+  +------+   |handler|   |                |   +------------+    |attr2|   |access  |
      |        |  |      |   +-------+   +----------------+   |driver chain|    +-----+   +--------+
      |        |  |      |                                    +------------+
      |        |  +------+
      |        |
      |        |
      |        |
      |        |
      |        |
      |        |
      |        |
      |        |
      |        |
      |        |
      +--------+
      
      [d] = Hash ids to groups using the high order bits
      
      The right types table is also chosen by using the high bits from
      the ids. Currently we have either default or driver specific groups.
      
      Once validation and object fetching (or creation) completed, we call
      the handler:
      int (*handler)(struct ib_device *ib_dev, struct ib_uverbs_file *ufile,
                     struct uverbs_attr_bundle *ctx);
      
      ctx bundles attributes of different namespaces. Each element there
      is an array of attributes which corresponds to one namespaces of
      attributes. For example, in the usually used case:
      
       ctx                               core
      +----------------------------+     +------------+
      | core:                      +---> | valid      |
      +----------------------------+     | cmd_attr   |
      | driver:                    |     +------------+
      |----------------------------+--+  | valid      |
                                      |  | cmd_attr   |
                                      |  +------------+
                                      |  | valid      |
                                      |  | obj_attr   |
                                      |  +------------+
                                      |
                                      |  drivers
                                      |  +------------+
                                      +> | valid      |
                                         | cmd_attr   |
                                         +------------+
                                         | valid      |
                                         | cmd_attr   |
                                         +------------+
                                         | valid      |
                                         | obj_attr   |
                                         +------------+
      Signed-off-by: NMatan Barak <matanb@mellanox.com>
      Reviewed-by: NYishai Hadas <yishaih@mellanox.com>
      Signed-off-by: NDoug Ledford <dledford@redhat.com>
      fac9658c
    • A
      RDMA/vmw_pvrdma: Report network header type in WC · 72f9b089
      Aditya Sarwade 提交于
      We should report the network header type in the work completion so that
      the kernel can infer the right RoCE type headers.
      Reviewed-by: NBryan Tan <bryantan@vmware.com>
      Signed-off-by: NAditya Sarwade <asarwade@vmware.com>
      Signed-off-by: NAdit Ranadive <aditr@vmware.com>
      Reviewed-by: NYuval Shaia <yuval.shaia@oracle.com>
      Signed-off-by: NDoug Ledford <dledford@redhat.com>
      72f9b089
    • B
      pinctrl: Add sleep related state to indicate sleep related configs · 6606bc9d
      Baolin Wang 提交于
      In some scenarios, we should set some pins as input/output/pullup/pulldown
      when the specified system goes into deep sleep mode, then when the system
      goes into deep sleep mode, these pins will be set automatically by hardware.
      
      That means some pins are not controlled by any specific driver in the OS, but
      need to be controlled when entering sleep mode. Thus we introduce one sleep
      state config into pinconf-generic for users to configure.
      Signed-off-by: NBaolin Wang <baolin.wang@spreadtrum.com>
      Signed-off-by: NLinus Walleij <linus.walleij@linaro.org>
      6606bc9d
    • H
      net/mlx5: Remove the flag MLX5_INTERFACE_STATE_SHUTDOWN · 10a8d007
      Huy Nguyen 提交于
      MLX5_INTERFACE_STATE_SHUTDOWN is not used in the code.
      
      Fixes: 5fc7197d ("net/mlx5: Add pci shutdown callback")
      Signed-off-by: NHuy Nguyen <huyn@mellanox.com>
      Reviewed-by: NDaniel Jurgens <danielj@mellanox.com>
      Signed-off-by: NSaeed Mahameed <saeedm@mellanox.com>
      10a8d007
    • H
      net/mlx5: Skip mlx5_unload_one if mlx5_load_one fails · b3cb5388
      Huy Nguyen 提交于
      There is an issue where the firmware fails during mlx5_load_one,
      the health_care timer detects the issue and schedules a health_care call.
      Then the mlx5_load_one detects the issue, cleans up and quits. Then
      the health_care starts and calls mlx5_unload_one to clean up the resources
      that no longer exist and causes kernel panic.
      
      The root cause is that the bit MLX5_INTERFACE_STATE_DOWN is not set
      after mlx5_load_one fails. The solution is removing the bit
      MLX5_INTERFACE_STATE_DOWN and quit mlx5_unload_one if the
      bit MLX5_INTERFACE_STATE_UP is not set. The bit MLX5_INTERFACE_STATE_DOWN
      is redundant and we can use MLX5_INTERFACE_STATE_UP instead.
      
      Fixes: 5fc7197d ("net/mlx5: Add pci shutdown callback")
      Signed-off-by: NHuy Nguyen <huyn@mellanox.com>
      Reviewed-by: NDaniel Jurgens <danielj@mellanox.com>
      Signed-off-by: NSaeed Mahameed <saeedm@mellanox.com>
      b3cb5388