1. 11 1月, 2017 1 次提交
  2. 30 12月, 2016 1 次提交
    • L
      mm: optimize PageWaiters bit use for unlock_page() · b91e1302
      Linus Torvalds 提交于
      In commit 62906027 ("mm: add PageWaiters indicating tasks are
      waiting for a page bit") Nick Piggin made our page locking no longer
      unconditionally touch the hashed page waitqueue, which not only helps
      performance in general, but is particularly helpful on NUMA machines
      where the hashed wait queues can bounce around a lot.
      
      However, the "clear lock bit atomically and then test the waiters bit"
      sequence turns out to be much more expensive than it needs to be,
      because you get a nasty stall when trying to access the same word that
      just got updated atomically.
      
      On architectures where locking is done with LL/SC, this would be trivial
      to fix with a new primitive that clears one bit and tests another
      atomically, but that ends up not working on x86, where the only atomic
      operations that return the result end up being cmpxchg and xadd.  The
      atomic bit operations return the old value of the same bit we changed,
      not the value of an unrelated bit.
      
      On x86, we could put the lock bit in the high bit of the byte, and use
      "xadd" with that bit (where the overflow ends up not touching other
      bits), and look at the other bits of the result.  However, an even
      simpler model is to just use a regular atomic "and" to clear the lock
      bit, and then the sign bit in eflags will indicate the resulting state
      of the unrelated bit #7.
      
      So by moving the PageWaiters bit up to bit #7, we can atomically clear
      the lock bit and test the waiters bit on x86 too.  And architectures
      with LL/SC (which is all the usual RISC suspects), the particular bit
      doesn't matter, so they are fine with this approach too.
      
      This avoids the extra access to the same atomic word, and thus avoids
      the costly stall at page unlock time.
      
      The only downside is that the interface ends up being a bit odd and
      specialized: clear a bit in a byte, and test the sign bit.  Nick doesn't
      love the resulting name of the new primitive, but I'd rather make the
      name be descriptive and very clear about the limitation imposed by
      trying to work across all relevant architectures than make it be some
      generic thing that doesn't make the odd semantics explicit.
      
      So this introduces the new architecture primitive
      
          clear_bit_unlock_is_negative_byte();
      
      and adds the trivial implementation for x86.  We have a generic
      non-optimized fallback (that just does a "clear_bit()"+"test_bit(7)"
      combination) which can be overridden by any architecture that can do
      better.  According to Nick, Power has the same hickup x86 has, for
      example, but some other architectures may not even care.
      
      All these optimizations mean that my page locking stress-test (which is
      just executing a lot of small short-lived shell scripts: "make test" in
      the git source tree) no longer makes our page locking look horribly bad.
      Before all these optimizations, just the unlock_page() costs were just
      over 3% of all CPU overhead on "make test".  After this, it's down to
      0.66%, so just a quarter of the cost it used to be.
      
      (The difference on NUMA is bigger, but there this micro-optimization is
      likely less noticeable, since the big issue on NUMA was not the accesses
      to 'struct page', but the waitqueue accesses that were already removed
      by Nick's earlier commit).
      Acked-by: NNick Piggin <npiggin@gmail.com>
      Cc: Dave Hansen <dave.hansen@linux.intel.com>
      Cc: Bob Peterson <rpeterso@redhat.com>
      Cc: Steven Whitehouse <swhiteho@redhat.com>
      Cc: Andrew Lutomirski <luto@kernel.org>
      Cc: Andreas Gruenbacher <agruenba@redhat.com>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Mel Gorman <mgorman@techsingularity.net>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      b91e1302
  3. 28 12月, 2016 2 次提交
  4. 27 12月, 2016 1 次提交
    • J
      mm: Invalidate DAX radix tree entries only if appropriate · c6dcf52c
      Jan Kara 提交于
      Currently invalidate_inode_pages2_range() and invalidate_mapping_pages()
      just delete all exceptional radix tree entries they find. For DAX this
      is not desirable as we track cache dirtiness in these entries and when
      they are evicted, we may not flush caches although it is necessary. This
      can for example manifest when we write to the same block both via mmap
      and via write(2) (to different offsets) and fsync(2) then does not
      properly flush CPU caches when modification via write(2) was the last
      one.
      
      Create appropriate DAX functions to handle invalidation of DAX entries
      for invalidate_inode_pages2_range() and invalidate_mapping_pages() and
      wire them up into the corresponding mm functions.
      Acked-by: NJohannes Weiner <hannes@cmpxchg.org>
      Reviewed-by: NRoss Zwisler <ross.zwisler@linux.intel.com>
      Signed-off-by: NJan Kara <jack@suse.cz>
      Signed-off-by: NDan Williams <dan.j.williams@intel.com>
      c6dcf52c
  5. 26 12月, 2016 5 次提交
    • N
      mm: add PageWaiters indicating tasks are waiting for a page bit · 62906027
      Nicholas Piggin 提交于
      Add a new page flag, PageWaiters, to indicate the page waitqueue has
      tasks waiting. This can be tested rather than testing waitqueue_active
      which requires another cacheline load.
      
      This bit is always set when the page has tasks on page_waitqueue(page),
      and is set and cleared under the waitqueue lock. It may be set when
      there are no tasks on the waitqueue, which will cause a harmless extra
      wakeup check that will clears the bit.
      
      The generic bit-waitqueue infrastructure is no longer used for pages.
      Instead, waitqueues are used directly with a custom key type. The
      generic code was not flexible enough to have PageWaiters manipulation
      under the waitqueue lock (which simplifies concurrency).
      
      This improves the performance of page lock intensive microbenchmarks by
      2-3%.
      
      Putting two bits in the same word opens the opportunity to remove the
      memory barrier between clearing the lock bit and testing the waiters
      bit, after some work on the arch primitives (e.g., ensuring memory
      operand widths match and cover both bits).
      Signed-off-by: NNicholas Piggin <npiggin@gmail.com>
      Cc: Dave Hansen <dave.hansen@linux.intel.com>
      Cc: Bob Peterson <rpeterso@redhat.com>
      Cc: Steven Whitehouse <swhiteho@redhat.com>
      Cc: Andrew Lutomirski <luto@kernel.org>
      Cc: Andreas Gruenbacher <agruenba@redhat.com>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Mel Gorman <mgorman@techsingularity.net>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      62906027
    • N
      mm: Use owner_priv bit for PageSwapCache, valid when PageSwapBacked · 6326fec1
      Nicholas Piggin 提交于
      A page is not added to the swap cache without being swap backed,
      so PageSwapBacked mappings can use PG_owner_priv_1 for PageSwapCache.
      Signed-off-by: NNicholas Piggin <npiggin@gmail.com>
      Acked-by: NHugh Dickins <hughd@google.com>
      Cc: Dave Hansen <dave.hansen@linux.intel.com>
      Cc: Bob Peterson <rpeterso@redhat.com>
      Cc: Steven Whitehouse <swhiteho@redhat.com>
      Cc: Andrew Lutomirski <luto@kernel.org>
      Cc: Andreas Gruenbacher <agruenba@redhat.com>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Mel Gorman <mgorman@techsingularity.net>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      6326fec1
    • T
      ktime: Get rid of ktime_equal() · 1f3a8e49
      Thomas Gleixner 提交于
      No point in going through loops and hoops instead of just comparing the
      values.
      Signed-off-by: NThomas Gleixner <tglx@linutronix.de>
      Cc: Peter Zijlstra <peterz@infradead.org>
      1f3a8e49
    • T
      ktime: Cleanup ktime_set() usage · 8b0e1953
      Thomas Gleixner 提交于
      ktime_set(S,N) was required for the timespec storage type and is still
      useful for situations where a Seconds and Nanoseconds part of a time value
      needs to be converted. For anything where the Seconds argument is 0, this
      is pointless and can be replaced with a simple assignment.
      Signed-off-by: NThomas Gleixner <tglx@linutronix.de>
      Cc: Peter Zijlstra <peterz@infradead.org>
      8b0e1953
    • T
      ktime: Get rid of the union · 2456e855
      Thomas Gleixner 提交于
      ktime is a union because the initial implementation stored the time in
      scalar nanoseconds on 64 bit machine and in a endianess optimized timespec
      variant for 32bit machines. The Y2038 cleanup removed the timespec variant
      and switched everything to scalar nanoseconds. The union remained, but
      become completely pointless.
      
      Get rid of the union and just keep ktime_t as simple typedef of type s64.
      
      The conversion was done with coccinelle and some manual mopping up.
      Signed-off-by: NThomas Gleixner <tglx@linutronix.de>
      Cc: Peter Zijlstra <peterz@infradead.org>
      2456e855
  6. 25 12月, 2016 9 次提交
  7. 24 12月, 2016 1 次提交
  8. 23 12月, 2016 2 次提交
  9. 21 12月, 2016 5 次提交
    • L
      ACPI / osl: Remove deprecated acpi_get_table_with_size()/early_acpi_os_unmap_memory() · 8d3523fb
      Lv Zheng 提交于
      Since all users are cleaned up, remove the 2 deprecated APIs due to no
      users.
      As a Linux variable rather than an ACPICA variable, acpi_gbl_permanent_mmap
      is renamed to acpi_permanent_mmap to have a consistent coding style across
      entire Linux ACPI subsystem.
      Signed-off-by: NLv Zheng <lv.zheng@intel.com>
      Signed-off-by: NRafael J. Wysocki <rafael.j.wysocki@intel.com>
      8d3523fb
    • L
      ACPICA: Tables: Back port acpi_get_table_with_size() and... · 174cc718
      Lv Zheng 提交于
      ACPICA: Tables: Back port acpi_get_table_with_size() and early_acpi_os_unmap_memory() from Linux kernel
      
      ACPICA commit cac6790954d4d752a083e6122220b8a22febcd07
      
      This patch back ports Linux acpi_get_table_with_size() and
      early_acpi_os_unmap_memory() into ACPICA upstream to reduce divergences.
      
      The 2 APIs are used by Linux as table management APIs for long time, it
      contains a hidden logic that during the early stage, the mapped tables
      should be unmapped before the early stage ends.
      
      During the early stage, tables are handled by the following sequence:
       acpi_get_table_with_size();
       parse the table
       early_acpi_os_unmap_memory();
      During the late stage, tables are handled by the following sequence:
       acpi_get_table();
       parse the table
      Linux uses acpi_gbl_permanent_mmap to distinguish the early stage and the
      late stage.
      
      The reasoning of introducing acpi_get_table_with_size() is: ACPICA will
      remember the early mapped pointer in acpi_get_table() and Linux isn't able to
      prevent ACPICA from using the wrong early mapped pointer during the late
      stage as there is no API provided from ACPICA to be an inverse of
      acpi_get_table() to forget the early mapped pointer.
      
      But how ACPICA can work with the early/late stage requirement? Inside of
      ACPICA, tables are ensured to be remained in "INSTALLED" state during the
      early stage, and they are carefully not transitioned to "VALIDATED" state
      until the late stage. So the same logic is in fact implemented inside of
      ACPICA in a different way. The gap is only that the feature is not provided
      to the OSPMs in an accessible external API style.
      
      It then is possible to fix the gap by providing an inverse of
      acpi_get_table() from ACPICA, so that the two Linux sequences can be
      combined:
       acpi_get_table();
       parse the table
       acpi_put_table();
      In order to work easier with the current Linux code, acpi_get_table() and
      acpi_put_table() is implemented in a usage counting based style:
       1. When the usage count of the table is increased from 0 to 1, table is
          mapped and .Pointer is set with the mapping address (VALIDATED);
       2. When the usage count of the table is decreased from 1 to 0, .Pointer
          is unset and the mapping address is unmapped (INVALIDATED).
      So that we can deploy the new APIs to Linux with minimal effort by just
      invoking acpi_get_table() in acpi_get_table_with_size() and invoking
      acpi_put_table() in early_acpi_os_unmap_memory(). Lv Zheng.
      
      Link: https://github.com/acpica/acpica/commit/cac67909Signed-off-by: NLv Zheng <lv.zheng@intel.com>
      Signed-off-by: NBob Moore <robert.moore@intel.com>
      Signed-off-by: NRafael J. Wysocki <rafael.j.wysocki@intel.com>
      174cc718
    • J
      dt: bindings: net: use boolean dt properties for eee broken modes · 308d3165
      jbrunet 提交于
      The patches regarding eee-broken-modes was merged before all people
      involved could find an agreement on the best way to move forward.
      
      While we agreed on having a DT property to mark particular modes as broken,
      the value used for eee-broken-modes mapped the phy register in very direct
      way. Because of this, the concern is that it could be used to implement
      configuration policies instead of describing a broken HW.
      
      In the end, having a boolean property for each mode seems to be preferred
      over one bit field value mapping the register (too) directly.
      
      Cc: Florian Fainelli <f.fainelli@gmail.com>
      Signed-off-by: NJerome Brunet <jbrunet@baylibre.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      308d3165
    • J
      ratelimit: fix WARN_ON_RATELIMIT return value · 1b011e2f
      Jiri Slaby 提交于
      The macro is to be used similarly as WARN_ON as:
      
        if (WARN_ON_RATELIMIT(condition, state))
      	do_something();
      
      One would expect only 'condition' to affect the 'if', but
      WARN_ON_RATELIMIT does internally only:
      
        WARN_ON((condition) && __ratelimit(state))
      
      So the 'if' is affected by the ratelimiting state too.  Fix this by
      returning 'condition' in any case.
      
      Note that nobody uses WARN_ON_RATELIMIT yet, so there is nothing to
      worry about.  But I was about to use it and was a bit surprised.
      
      Link: http://lkml.kernel.org/r/20161215093224.23126-1-jslaby@suse.czSigned-off-by: NJiri Slaby <jslaby@suse.cz>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      1b011e2f
    • M
      ima: on soft reboot, save the measurement list · 7b8589cc
      Mimi Zohar 提交于
      The TPM PCRs are only reset on a hard reboot.  In order to validate a
      TPM's quote after a soft reboot (eg.  kexec -e), the IMA measurement
      list of the running kernel must be saved and restored on boot.
      
      This patch uses the kexec buffer passing mechanism to pass the
      serialized IMA binary_runtime_measurements to the next kernel.
      
      Link: http://lkml.kernel.org/r/1480554346-29071-7-git-send-email-zohar@linux.vnet.ibm.comSigned-off-by: NThiago Jung Bauermann <bauerman@linux.vnet.ibm.com>
      Signed-off-by: NMimi Zohar <zohar@linux.vnet.ibm.com>
      Acked-by: N"Eric W. Biederman" <ebiederm@xmission.com>
      Acked-by: NDmitry Kasatkin <dmitry.kasatkin@gmail.com>
      Cc: Andreas Steffen <andreas.steffen@strongswan.org>
      Cc: Josh Sklar <sklar@linux.vnet.ibm.com>
      Cc: Dave Young <dyoung@redhat.com>
      Cc: Vivek Goyal <vgoyal@redhat.com>
      Cc: Baoquan He <bhe@redhat.com>
      Cc: Michael Ellerman <mpe@ellerman.id.au>
      Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
      Cc: Paul Mackerras <paulus@samba.org>
      Cc: Stewart Smith <stewart@linux.vnet.ibm.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      7b8589cc
  10. 20 12月, 2016 2 次提交
  11. 19 12月, 2016 1 次提交
  12. 18 12月, 2016 7 次提交
  13. 17 12月, 2016 2 次提交
  14. 16 12月, 2016 1 次提交