1. 18 1月, 2017 4 次提交
  2. 12 1月, 2017 2 次提交
  3. 08 1月, 2017 1 次提交
    • J
      mm: workingset: fix use-after-free in shadow node shrinker · ea07b862
      Johannes Weiner 提交于
      Several people report seeing warnings about inconsistent radix tree
      nodes followed by crashes in the workingset code, which all looked like
      use-after-free access from the shadow node shrinker.
      
      Dave Jones managed to reproduce the issue with a debug patch applied,
      which confirmed that the radix tree shrinking indeed frees shadow nodes
      while they are still linked to the shadow LRU:
      
        WARNING: CPU: 2 PID: 53 at lib/radix-tree.c:643 delete_node+0x1e4/0x200
        CPU: 2 PID: 53 Comm: kswapd0 Not tainted 4.10.0-rc2-think+ #3
        Call Trace:
           delete_node+0x1e4/0x200
           __radix_tree_delete_node+0xd/0x10
           shadow_lru_isolate+0xe6/0x220
           __list_lru_walk_one.isra.4+0x9b/0x190
           list_lru_walk_one+0x23/0x30
           scan_shadow_nodes+0x2e/0x40
           shrink_slab.part.44+0x23d/0x5d0
           shrink_node+0x22c/0x330
           kswapd+0x392/0x8f0
      
      This is the WARN_ON_ONCE(!list_empty(&node->private_list)) placed in the
      inlined radix_tree_shrink().
      
      The problem is with 14b46879 ("mm: workingset: move shadow entry
      tracking to radix tree exceptional tracking"), which passes an update
      callback into the radix tree to link and unlink shadow leaf nodes when
      tree entries change, but forgot to pass the callback when reclaiming a
      shadow node.
      
      While the reclaimed shadow node itself is unlinked by the shrinker, its
      deletion from the tree can cause the left-most leaf node in the tree to
      be shrunk.  If that happens to be a shadow node as well, we don't unlink
      it from the LRU as we should.
      
      Consider this tree, where the s are shadow entries:
      
             root->rnode
                  |
             [0       n]
              |       |
           [s    ] [sssss]
      
      Now the shadow node shrinker reclaims the rightmost leaf node through
      the shadow node LRU:
      
             root->rnode
                  |
             [0        ]
              |
          [s     ]
      
      Because the parent of the deleted node is the first level below the
      root and has only one child in the left-most slot, the intermediate
      level is shrunk and the node containing the single shadow is put in
      its place:
      
             root->rnode
                  |
             [s        ]
      
      The shrinker again sees a single left-most slot in a first level node
      and thus decides to store the shadow in root->rnode directly and free
      the node - which is a leaf node on the shadow node LRU.
      
        root->rnode
             |
             s
      
      Without the update callback, the freed node remains on the shadow LRU,
      where it causes later shrinker runs to crash.
      
      Pass the node updater callback into __radix_tree_delete_node() in case
      the deletion causes the left-most branch in the tree to collapse too.
      
      Also add warnings when linked nodes are freed right away, rather than
      wait for the use-after-free when the list is scanned much later.
      
      Fixes: 14b46879 ("mm: workingset: move shadow entry tracking to radix tree exceptional tracking")
      Reported-by: NDave Chinner <david@fromorbit.com>
      Reported-by: NHugh Dickins <hughd@google.com>
      Reported-by: NAndrea Arcangeli <aarcange@redhat.com>
      Reported-and-tested-by: NDave Jones <davej@codemonkey.org.uk>
      Signed-off-by: NJohannes Weiner <hannes@cmpxchg.org>
      Cc: Christoph Hellwig <hch@lst.de>
      Cc: Chris Leech <cleech@redhat.com>
      Cc: Lee Duncan <lduncan@suse.com>
      Cc: Jan Kara <jack@suse.cz>
      Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
      Cc: Matthew Wilcox <mawilcox@linuxonhyperv.com>
      Cc: Andrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      ea07b862
  4. 07 1月, 2017 1 次提交
  5. 05 1月, 2017 1 次提交
    • P
      vfio-mdev: fix non-standard ioctl return val causing i386 build fail · c6ef7fd4
      Paul Gortmaker 提交于
      What appears to be a copy and paste error from the line above gets
      the ioctl a ssize_t return value instead of the traditional "int".
      
      The associated sample code used "long" which meant it would compile
      for x86-64 but not i386, with the latter failing as follows:
      
        CC [M]  samples/vfio-mdev/mtty.o
      samples/vfio-mdev/mtty.c:1418:20: error: initialization from incompatible pointer type [-Werror=incompatible-pointer-types]
        .ioctl          = mtty_ioctl,
                          ^
      samples/vfio-mdev/mtty.c:1418:20: note: (near initialization for ‘mdev_fops.ioctl’)
      cc1: some warnings being treated as errors
      
      Since in this case, vfio is working with struct file_operations; as such:
      
          long (*unlocked_ioctl) (struct file *, unsigned int, unsigned long);
          long (*compat_ioctl) (struct file *, unsigned int, unsigned long);
      
      ...and so here we just standardize on long vs. the normal int that user
      space typically sees and documents as per "man ioctl" and similar.
      
      Fixes: 9d1a546c ("docs: Sample driver to demonstrate how to use Mediated device framework.")
      Cc: Kirti Wankhede <kwankhede@nvidia.com>
      Cc: Neo Jia <cjia@nvidia.com>
      Cc: kvm@vger.kernel.org
      Signed-off-by: NPaul Gortmaker <paul.gortmaker@windriver.com>
      Signed-off-by: NAlex Williamson <alex.williamson@redhat.com>
      c6ef7fd4
  6. 31 12月, 2016 1 次提交
    • L
      iio: accel: st_accel: fix LIS3LV02 reading and scaling · 65e4345c
      Linus Walleij 提交于
      The LIS3LV02 has a special bit that need to be set to get the
      read values left aligned. Before this patch we get gibberish
      like this:
      
      iio_generic_buffer -a -c10 -n lis3lv02dl_accel
      (...)
      0.000000 -0.010042 -0.642688 19155832931907
      0.000000 -0.010042 -0.642688 19155858751073
      
      Which is because we read a raw value for 1g as 64 which is
      the nominal 1024 for 1g shifted 4 bits to the left by being
      right-aligned rather than left aligned.
      
      Since all other sensors are left aligned, add some code to
      set the special DAS (data alignment setting) bit to 1 so that
      the right value is now read like this:
      
      iio_generic_buffer -a -c10 -n lis3lv02dl_accel
      (...)
      0.000000 -0.147095 -10.120135 24761614364956
      -0.029419 -0.176514 -10.120135 24761631624540
      
      The scaling was weird as well: we have a gain of 1000 for 1g
      and 3000 for 6g. I don't even remember how I came up with the
      old values but they are wrong.
      
      Fixes: 3acddf74 ("iio: st-sensors: add support for lis3lv02d accelerometer")
      Cc: Lorenzo Bianconi <lorenzo.bianconi@st.com>
      Cc: Giuseppe Barba <giuseppe.barba@st.com>
      Cc: Denis Ciocca <denis.ciocca@st.com>
      Signed-off-by: NLinus Walleij <linus.walleij@linaro.org>
      Signed-off-by: NJonathan Cameron <jic23@kernel.org>
      65e4345c
  7. 30 12月, 2016 5 次提交
    • A
      vfio-mdev: Make mdev_device private and abstract interfaces · 99e3123e
      Alex Williamson 提交于
      Abstract access to mdev_device so that we can define which interfaces
      are public rather than relying on comments in the structure.
      
      Cc: Zhenyu Wang <zhenyuw@linux.intel.com>
      Cc: Zhi Wang <zhi.a.wang@intel.com>
      Signed-off-by: NAlex Williamson <alex.williamson@redhat.com>
      Reviewed-by: NJike Song <jike.song@intel.com>
      Reviewed by: Kirti Wankhede <kwankhede@nvidia.com>
      99e3123e
    • A
      vfio-mdev: Make mdev_parent private · 9372e6fe
      Alex Williamson 提交于
      Rather than hoping for good behavior by marking some elements
      internal, enforce it by making the entire structure private and
      creating an accessor function for the one useful external field.
      
      Cc: Zhenyu Wang <zhenyuw@linux.intel.com>
      Cc: Zhi Wang <zhi.a.wang@intel.com>
      Cc: Jike Song <jike.song@intel.com>
      Signed-off-by: NAlex Williamson <alex.williamson@redhat.com>
      Reviewed by: Kirti Wankhede <kwankhede@nvidia.com>
      9372e6fe
    • A
      vfio-mdev: de-polute the namespace, rename parent_device & parent_ops · 42930553
      Alex Williamson 提交于
      Add an mdev_ prefix so we're not poluting the namespace so much.
      
      Cc: Zhenyu Wang <zhenyuw@linux.intel.com>
      Cc: Zhi Wang <zhi.a.wang@intel.com>
      Cc: Jike Song <jike.song@intel.com>
      Signed-off-by: NAlex Williamson <alex.williamson@redhat.com>
      Reviewed by: Kirti Wankhede <kwankhede@nvidia.com>
      42930553
    • J
      net/mlx4_core: Fix raw qp flow steering rules under SRIOV · 10b1c04e
      Jack Morgenstein 提交于
      Demoting simple flow steering rule priority (for DPDK) was achieved by
      wrapping FW commands MLX4_QP_FLOW_STEERING_ATTACH/DETACH for the PF
      as well, and forcing the priority to MLX4_DOMAIN_NIC in the wrapper
      function for the PF and all VFs.
      
      In function mlx4_ib_create_flow(), this change caused the main rule
      creation for the PF to be wrapped, while it left the associated
      tunnel steering rule creation unwrapped for the PF.
      
      This mismatch caused rule deletion failures in mlx4_ib_destroy_flow()
      for the PF when the detach wrapper function did not find the associated
      tunnel-steering rule (since creation of that rule for the PF did not
      go through the wrapper function).
      
      Fix this by setting MLX4_QP_FLOW_STEERING_ATTACH/DETACH to be "native"
      (so that the PF invocation does not go through the wrapper), and perform
      the required priority demotion for the PF in the mlx4_ib_create_flow()
      code path.
      
      Fixes: 48564135 ("net/mlx4_core: Demote simple multicast and broadcast flow steering rules")
      Signed-off-by: NJack Morgenstein <jackm@dev.mellanox.co.il>
      Signed-off-by: NTariq Toukan <tariqt@mellanox.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      10b1c04e
    • L
      mm: optimize PageWaiters bit use for unlock_page() · b91e1302
      Linus Torvalds 提交于
      In commit 62906027 ("mm: add PageWaiters indicating tasks are
      waiting for a page bit") Nick Piggin made our page locking no longer
      unconditionally touch the hashed page waitqueue, which not only helps
      performance in general, but is particularly helpful on NUMA machines
      where the hashed wait queues can bounce around a lot.
      
      However, the "clear lock bit atomically and then test the waiters bit"
      sequence turns out to be much more expensive than it needs to be,
      because you get a nasty stall when trying to access the same word that
      just got updated atomically.
      
      On architectures where locking is done with LL/SC, this would be trivial
      to fix with a new primitive that clears one bit and tests another
      atomically, but that ends up not working on x86, where the only atomic
      operations that return the result end up being cmpxchg and xadd.  The
      atomic bit operations return the old value of the same bit we changed,
      not the value of an unrelated bit.
      
      On x86, we could put the lock bit in the high bit of the byte, and use
      "xadd" with that bit (where the overflow ends up not touching other
      bits), and look at the other bits of the result.  However, an even
      simpler model is to just use a regular atomic "and" to clear the lock
      bit, and then the sign bit in eflags will indicate the resulting state
      of the unrelated bit #7.
      
      So by moving the PageWaiters bit up to bit #7, we can atomically clear
      the lock bit and test the waiters bit on x86 too.  And architectures
      with LL/SC (which is all the usual RISC suspects), the particular bit
      doesn't matter, so they are fine with this approach too.
      
      This avoids the extra access to the same atomic word, and thus avoids
      the costly stall at page unlock time.
      
      The only downside is that the interface ends up being a bit odd and
      specialized: clear a bit in a byte, and test the sign bit.  Nick doesn't
      love the resulting name of the new primitive, but I'd rather make the
      name be descriptive and very clear about the limitation imposed by
      trying to work across all relevant architectures than make it be some
      generic thing that doesn't make the odd semantics explicit.
      
      So this introduces the new architecture primitive
      
          clear_bit_unlock_is_negative_byte();
      
      and adds the trivial implementation for x86.  We have a generic
      non-optimized fallback (that just does a "clear_bit()"+"test_bit(7)"
      combination) which can be overridden by any architecture that can do
      better.  According to Nick, Power has the same hickup x86 has, for
      example, but some other architectures may not even care.
      
      All these optimizations mean that my page locking stress-test (which is
      just executing a lot of small short-lived shell scripts: "make test" in
      the git source tree) no longer makes our page locking look horribly bad.
      Before all these optimizations, just the unlock_page() costs were just
      over 3% of all CPU overhead on "make test".  After this, it's down to
      0.66%, so just a quarter of the cost it used to be.
      
      (The difference on NUMA is bigger, but there this micro-optimization is
      likely less noticeable, since the big issue on NUMA was not the accesses
      to 'struct page', but the waitqueue accesses that were already removed
      by Nick's earlier commit).
      Acked-by: NNick Piggin <npiggin@gmail.com>
      Cc: Dave Hansen <dave.hansen@linux.intel.com>
      Cc: Bob Peterson <rpeterso@redhat.com>
      Cc: Steven Whitehouse <swhiteho@redhat.com>
      Cc: Andrew Lutomirski <luto@kernel.org>
      Cc: Andreas Gruenbacher <agruenba@redhat.com>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Mel Gorman <mgorman@techsingularity.net>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      b91e1302
  8. 29 12月, 2016 1 次提交
  9. 28 12月, 2016 1 次提交
  10. 27 12月, 2016 1 次提交
    • J
      mm: Invalidate DAX radix tree entries only if appropriate · c6dcf52c
      Jan Kara 提交于
      Currently invalidate_inode_pages2_range() and invalidate_mapping_pages()
      just delete all exceptional radix tree entries they find. For DAX this
      is not desirable as we track cache dirtiness in these entries and when
      they are evicted, we may not flush caches although it is necessary. This
      can for example manifest when we write to the same block both via mmap
      and via write(2) (to different offsets) and fsync(2) then does not
      properly flush CPU caches when modification via write(2) was the last
      one.
      
      Create appropriate DAX functions to handle invalidation of DAX entries
      for invalidate_inode_pages2_range() and invalidate_mapping_pages() and
      wire them up into the corresponding mm functions.
      Acked-by: NJohannes Weiner <hannes@cmpxchg.org>
      Reviewed-by: NRoss Zwisler <ross.zwisler@linux.intel.com>
      Signed-off-by: NJan Kara <jack@suse.cz>
      Signed-off-by: NDan Williams <dan.j.williams@intel.com>
      c6dcf52c
  11. 26 12月, 2016 5 次提交
    • N
      mm: add PageWaiters indicating tasks are waiting for a page bit · 62906027
      Nicholas Piggin 提交于
      Add a new page flag, PageWaiters, to indicate the page waitqueue has
      tasks waiting. This can be tested rather than testing waitqueue_active
      which requires another cacheline load.
      
      This bit is always set when the page has tasks on page_waitqueue(page),
      and is set and cleared under the waitqueue lock. It may be set when
      there are no tasks on the waitqueue, which will cause a harmless extra
      wakeup check that will clears the bit.
      
      The generic bit-waitqueue infrastructure is no longer used for pages.
      Instead, waitqueues are used directly with a custom key type. The
      generic code was not flexible enough to have PageWaiters manipulation
      under the waitqueue lock (which simplifies concurrency).
      
      This improves the performance of page lock intensive microbenchmarks by
      2-3%.
      
      Putting two bits in the same word opens the opportunity to remove the
      memory barrier between clearing the lock bit and testing the waiters
      bit, after some work on the arch primitives (e.g., ensuring memory
      operand widths match and cover both bits).
      Signed-off-by: NNicholas Piggin <npiggin@gmail.com>
      Cc: Dave Hansen <dave.hansen@linux.intel.com>
      Cc: Bob Peterson <rpeterso@redhat.com>
      Cc: Steven Whitehouse <swhiteho@redhat.com>
      Cc: Andrew Lutomirski <luto@kernel.org>
      Cc: Andreas Gruenbacher <agruenba@redhat.com>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Mel Gorman <mgorman@techsingularity.net>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      62906027
    • N
      mm: Use owner_priv bit for PageSwapCache, valid when PageSwapBacked · 6326fec1
      Nicholas Piggin 提交于
      A page is not added to the swap cache without being swap backed,
      so PageSwapBacked mappings can use PG_owner_priv_1 for PageSwapCache.
      Signed-off-by: NNicholas Piggin <npiggin@gmail.com>
      Acked-by: NHugh Dickins <hughd@google.com>
      Cc: Dave Hansen <dave.hansen@linux.intel.com>
      Cc: Bob Peterson <rpeterso@redhat.com>
      Cc: Steven Whitehouse <swhiteho@redhat.com>
      Cc: Andrew Lutomirski <luto@kernel.org>
      Cc: Andreas Gruenbacher <agruenba@redhat.com>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Mel Gorman <mgorman@techsingularity.net>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      6326fec1
    • T
      ktime: Get rid of ktime_equal() · 1f3a8e49
      Thomas Gleixner 提交于
      No point in going through loops and hoops instead of just comparing the
      values.
      Signed-off-by: NThomas Gleixner <tglx@linutronix.de>
      Cc: Peter Zijlstra <peterz@infradead.org>
      1f3a8e49
    • T
      ktime: Cleanup ktime_set() usage · 8b0e1953
      Thomas Gleixner 提交于
      ktime_set(S,N) was required for the timespec storage type and is still
      useful for situations where a Seconds and Nanoseconds part of a time value
      needs to be converted. For anything where the Seconds argument is 0, this
      is pointless and can be replaced with a simple assignment.
      Signed-off-by: NThomas Gleixner <tglx@linutronix.de>
      Cc: Peter Zijlstra <peterz@infradead.org>
      8b0e1953
    • T
      ktime: Get rid of the union · 2456e855
      Thomas Gleixner 提交于
      ktime is a union because the initial implementation stored the time in
      scalar nanoseconds on 64 bit machine and in a endianess optimized timespec
      variant for 32bit machines. The Y2038 cleanup removed the timespec variant
      and switched everything to scalar nanoseconds. The union remained, but
      become completely pointless.
      
      Get rid of the union and just keep ktime_t as simple typedef of type s64.
      
      The conversion was done with coccinelle and some manual mopping up.
      Signed-off-by: NThomas Gleixner <tglx@linutronix.de>
      Cc: Peter Zijlstra <peterz@infradead.org>
      2456e855
  12. 25 12月, 2016 9 次提交
  13. 24 12月, 2016 2 次提交
  14. 23 12月, 2016 2 次提交
  15. 21 12月, 2016 2 次提交
  16. 20 12月, 2016 2 次提交