1. 05 9月, 2015 1 次提交
    • M
      mm: send one IPI per CPU to TLB flush all entries after unmapping pages · 72b252ae
      Mel Gorman 提交于
      An IPI is sent to flush remote TLBs when a page is unmapped that was
      potentially accesssed by other CPUs.  There are many circumstances where
      this happens but the obvious one is kswapd reclaiming pages belonging to a
      running process as kswapd and the task are likely running on separate
      CPUs.
      
      On small machines, this is not a significant problem but as machine gets
      larger with more cores and more memory, the cost of these IPIs can be
      high.  This patch uses a simple structure that tracks CPUs that
      potentially have TLB entries for pages being unmapped.  When the unmapping
      is complete, the full TLB is flushed on the assumption that a refill cost
      is lower than flushing individual entries.
      
      Architectures wishing to do this must give the following guarantee.
      
              If a clean page is unmapped and not immediately flushed, the
              architecture must guarantee that a write to that linear address
              from a CPU with a cached TLB entry will trap a page fault.
      
      This is essentially what the kernel already depends on but the window is
      much larger with this patch applied and is worth highlighting.  The
      architecture should consider whether the cost of the full TLB flush is
      higher than sending an IPI to flush each individual entry.  An additional
      architecture helper called flush_tlb_local is required.  It's a trivial
      wrapper with some accounting in the x86 case.
      
      The impact of this patch depends on the workload as measuring any benefit
      requires both mapped pages co-located on the LRU and memory pressure.  The
      case with the biggest impact is multiple processes reading mapped pages
      taken from the vm-scalability test suite.  The test case uses NR_CPU
      readers of mapped files that consume 10*RAM.
      
      Linear mapped reader on a 4-node machine with 64G RAM and 48 CPUs
      
                                                 4.2.0-rc1          4.2.0-rc1
                                                   vanilla       flushfull-v7
      Ops lru-file-mmap-read-elapsed      159.62 (  0.00%)   120.68 ( 24.40%)
      Ops lru-file-mmap-read-time_range    30.59 (  0.00%)     2.80 ( 90.85%)
      Ops lru-file-mmap-read-time_stddv     6.70 (  0.00%)     0.64 ( 90.38%)
      
                 4.2.0-rc1    4.2.0-rc1
                   vanilla flushfull-v7
      User          581.00       611.43
      System       5804.93      4111.76
      Elapsed       161.03       122.12
      
      This is showing that the readers completed 24.40% faster with 29% less
      system CPU time.  From vmstats, it is known that the vanilla kernel was
      interrupted roughly 900K times per second during the steady phase of the
      test and the patched kernel was interrupts 180K times per second.
      
      The impact is lower on a single socket machine.
      
                                                 4.2.0-rc1          4.2.0-rc1
                                                   vanilla       flushfull-v7
      Ops lru-file-mmap-read-elapsed       25.33 (  0.00%)    20.38 ( 19.54%)
      Ops lru-file-mmap-read-time_range     0.91 (  0.00%)     1.44 (-58.24%)
      Ops lru-file-mmap-read-time_stddv     0.28 (  0.00%)     0.47 (-65.34%)
      
                 4.2.0-rc1    4.2.0-rc1
                   vanilla flushfull-v7
      User           58.09        57.64
      System        111.82        76.56
      Elapsed        27.29        22.55
      
      It's still a noticeable improvement with vmstat showing interrupts went
      from roughly 500K per second to 45K per second.
      
      The patch will have no impact on workloads with no memory pressure or have
      relatively few mapped pages.  It will have an unpredictable impact on the
      workload running on the CPU being flushed as it'll depend on how many TLB
      entries need to be refilled and how long that takes.  Worst case, the TLB
      will be completely cleared of active entries when the target PFNs were not
      resident at all.
      
      [sasha.levin@oracle.com: trace tlb flush after disabling preemption in try_to_unmap_flush]
      Signed-off-by: NMel Gorman <mgorman@suse.de>
      Reviewed-by: NRik van Riel <riel@redhat.com>
      Cc: Dave Hansen <dave.hansen@intel.com>
      Acked-by: NIngo Molnar <mingo@kernel.org>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Signed-off-by: NSasha Levin <sasha.levin@oracle.com>
      Cc: Michal Hocko <mhocko@suse.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      72b252ae
  2. 23 8月, 2015 1 次提交
  3. 22 8月, 2015 3 次提交
    • A
      x86/kasan: Define KASAN_SHADOW_OFFSET per architecture · 920e277e
      Andrey Ryabinin 提交于
      Current definition of  KASAN_SHADOW_OFFSET in
      include/linux/kasan.h will not work for upcomming arm64, so move
      it to the arch header.
      Signed-off-by: NAndrey Ryabinin <ryabinin.a.a@gmail.com>
      Cc: Alexander Potapenko <glider@google.com>
      Cc: Alexey Klimov <klimov.linux@gmail.com>
      Cc: Andrew Morton <akpm@linux-foundation.org>
      Cc: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
      Cc: Arnd Bergmann <arnd@arndb.de>
      Cc: Catalin Marinas <catalin.marinas@arm.com>
      Cc: David Keitel <dkeitel@codeaurora.org>
      Cc: Dmitry Vyukov <dvyukov@google.com>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Linus Walleij <linus.walleij@linaro.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Rik van Riel <riel@redhat.com>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Will Deacon <will.deacon@arm.com>
      Cc: Yury <yury.norov@gmail.com>
      Cc: linux-arm-kernel@lists.infradead.org
      Cc: linux-mm@kvack.org
      Link: http://lkml.kernel.org/r/1439444244-26057-2-git-send-email-ryabinin.a.a@gmail.comSigned-off-by: NIngo Molnar <mingo@kernel.org>
      920e277e
    • H
      x86/asm/delay: Introduce an MWAITX-based delay with a configurable timer · b466bdb6
      Huang Rui 提交于
      MWAITX can enable a timer and a corresponding timer value
      specified in SW P0 clocks. The SW P0 frequency is the same as
      TSC. The timer provides an upper bound on how long the
      instruction waits before exiting.
      
      This way, a delay function in the kernel can leverage that
      MWAITX timer of MWAITX.
      
      When a CPU core executes MWAITX, it will be quiesced in a
      waiting phase, diminishing its power consumption. This way, we
      can save power in comparison to our default TSC-based delays.
      
      A simple test shows that:
      
      	$ cat /sys/bus/pci/devices/0000\:00\:18.4/hwmon/hwmon0/power1_acc
      	$ sleep 10000s
      	$ cat /sys/bus/pci/devices/0000\:00\:18.4/hwmon/hwmon0/power1_acc
      
      Results:
      
      	* TSC-based default delay:      485115 uWatts average power
      	* MWAITX-based delay:           252738 uWatts average power
      
      Thus, that's about 240 milliWatts less power consumption. The
      test method relies on the support of AMD CPU accumulated power
      algorithm in fam15h_power for which patches are forthcoming.
      Suggested-by: NAndy Lutomirski <luto@amacapital.net>
      Suggested-by: NBorislav Petkov <bp@suse.de>
      Suggested-by: NPeter Zijlstra <peterz@infradead.org>
      Signed-off-by: NHuang Rui <ray.huang@amd.com>
      [ Fix delay truncation. ]
      Signed-off-by: NBorislav Petkov <bp@suse.de>
      Cc: Aaron Lu <aaron.lu@intel.com>
      Cc: Andreas Herrmann <herrmann.der.user@gmail.com>
      Cc: Aravind Gopalakrishnan <Aravind.Gopalakrishnan@amd.com>
      Cc: Fengguang Wu <fengguang.wu@intel.com>
      Cc: Frédéric Weisbecker <fweisbec@gmail.com>
      Cc: H. Peter Anvin <hpa@zytor.com>
      Cc: Hector Marco-Gisbert <hecmargi@upv.es>
      Cc: Jacob Shin <jacob.w.shin@gmail.com>
      Cc: Jiri Olsa <jolsa@kernel.org>
      Cc: John Stultz <john.stultz@linaro.org>
      Cc: Len Brown <lenb@kernel.org>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Paolo Bonzini <pbonzini@redhat.com>
      Cc: Rafael J. Wysocki <rjw@rjwysocki.net>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Tony Li <tony.li@amd.com>
      Link: http://lkml.kernel.org/r/1438744732-1459-3-git-send-email-ray.huang@amd.com
      Link: http://lkml.kernel.org/r/1439201994-28067-4-git-send-email-bp@alien8.deSigned-off-by: NIngo Molnar <mingo@kernel.org>
      b466bdb6
    • H
      x86/asm: Add MONITORX/MWAITX instruction support · f9675674
      Huang Rui 提交于
      AMD Carrizo processors (Family 15h, Models 60h-6fh) added a new
      feature called MWAITX (MWAIT with extensions) as an extension to
      MONITOR/MWAIT.
      
      This new instruction controls a configurable timer which causes
      the core to exit wait state on timer expiration, in addition to
      "normal" MWAIT condition of reading from a monitored VA.
      
      Compared to MONITOR/MWAIT, there are minor differences in opcode
      and input parameters:
      
      MWAITX ECX[1]: enable timer if set
      MWAITX EBX[31:0]: max wait time expressed in SW P0 clocks ==
      TSC. The software P0 frequency is the same as the TSC frequency.
      
                      MWAIT                           MWAITX
      opcode          0f 01 c9           |            0f 01 fb
      ECX[0]                  value of RFLAGS.IF seen by instruction
      ECX[1]          unused/#GP if set  |            enable timer if set
      ECX[31:2]                     unused/#GP if set
      EAX                           unused (reserve for hint)
      EBX[31:0]       unused             |            max wait time (SW P0 == TSC)
      
                      MONITOR                         MONITORX
      opcode          0f 01 c8           |            0f 01 fa
      EAX                     (logical) address to monitor
      ECX                     #GP if not zero
      
      Max timeout = EBX/(TSC frequency)
      Signed-off-by: NHuang Rui <ray.huang@amd.com>
      Signed-off-by: NBorislav Petkov <bp@suse.de>
      Cc: Aaron Lu <aaron.lu@intel.com>
      Cc: Alexander Shishkin <alexander.shishkin@linux.intel.com>
      Cc: Andreas Herrmann <herrmann.der.user@gmail.com>
      Cc: Andy Lutomirski <luto@amacapital.net>
      Cc: Dave Hansen <dave.hansen@linux.intel.com>
      Cc: Dirk Brandewie <dirk.j.brandewie@intel.com>
      Cc: Fengguang Wu <fengguang.wu@intel.com>
      Cc: Frédéric Weisbecker <fweisbec@gmail.com>
      Cc: H. Peter Anvin <hpa@zytor.com>
      Cc: John Stultz <john.stultz@linaro.org>
      Cc: Josh Triplett <josh@joshtriplett.org>
      Cc: Len Brown <lenb@kernel.org>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Mike Galbraith <bitbucket@online.de>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Rafael J. Wysocki <rjw@rjwysocki.net>
      Cc: Ross Zwisler <ross.zwisler@linux.intel.com>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Tony Li <tony.li@amd.com>
      Link: http://lkml.kernel.org/r/1439201994-28067-3-git-send-email-bp@alien8.deSigned-off-by: NIngo Molnar <mingo@kernel.org>
      f9675674
  4. 21 8月, 2015 1 次提交
    • I
      x86/asm/tsc: Add rdtscll() merge helper · 99770737
      Ingo Molnar 提交于
      Some in-flight code makes use of the old rdtscll() (now removed), provide a wrapper
      for a kernel cycle to smooth the transition to rdtsc().
      
      ( We use the safest variant, rdtsc_ordered(), which has barriers - this adds another
        incentive to remove the wrapper in the future. )
      
      Cc: Andy Lutomirski <luto@kernel.org>
      Cc: Borislav Petkov <bp@suse.de>
      Cc: Andy Lutomirski <luto@amacapital.net>
      Cc: Borislav Petkov <bp@alien8.de>
      Cc: Brian Gerst <brgerst@gmail.com>
      Cc: Denys Vlasenko <dvlasenk@redhat.com>
      Cc: H. Peter Anvin <hpa@zytor.com>
      Cc: Huang Rui <ray.huang@amd.com>
      Cc: John Stultz <john.stultz@linaro.org>
      Cc: Len Brown <lenb@kernel.org>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Ralf Baechle <ralf@linux-mips.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: kvm ML <kvm@vger.kernel.org>
      Link: http://lkml.kernel.org/r/dddbf98a2af53312e9aa73a5a2b1622fe5d6f52b.1434501121.git.luto@kernel.orgSigned-off-by: NIngo Molnar <mingo@kernel.org>
      99770737
  5. 20 8月, 2015 6 次提交
  6. 18 8月, 2015 1 次提交
  7. 17 8月, 2015 1 次提交
    • L
      x86/smpboot: Remove APIC.wait_for_init_deassert and atomic init_deasserted · 656bba30
      Len Brown 提交于
      Both the per-APIC flag ".wait_for_init_deassert",
      and the global atomic_t "init_deasserted"
      are dead code -- remove them.
      
      For all APIC types, "wait_for_master()"
      prevents an AP from proceeding until the BSP has set
      cpu_callout_mask, making "init_deasserted" {unnecessary}:
      
      	BSP: <de-assert INIT>
      	...
      	BSP: {set init_deasserted}
      	AP: wait_for_master()
      		set cpu_initialized_mask
      		wait for cpu_callout_mask
      	BSP: test cpu_initialized_mask
      	BSP: set cpu_callout_mask
      	AP: test cpu_callout_mask
      	AP: {wait for init_deasserted}
      	...
      	AP: <touch APIC>
      
      Deleting the {dead code} above is necessary to enable
      some parallelism in a future patch.
      Signed-off-by: NLen Brown <len.brown@intel.com>
      Cc: Andy Lutomirski <luto@amacapital.net>
      Cc: Arjan van de Ven <arjan@linux.intel.com>
      Cc: Boris Ostrovsky <boris.ostrovsky@oracle.com>
      Cc: Borislav Petkov <bp@alien8.de>
      Cc: Dave Hansen <dave.hansen@linux.intel.com>
      Cc: Igor Mammedov <imammedo@redhat.com>
      Cc: Jan H. Schönherr <jschoenh@amazon.de>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Zhu Guihua <zhugh.fnst@cn.fujitsu.com>
      Link: http://lkml.kernel.org/r/de4b3a9bab894735e285870b5296da25ee6a8a5a.1439739165.git.len.brown@intel.comSigned-off-by: NIngo Molnar <mingo@kernel.org>
      656bba30
  8. 15 8月, 2015 1 次提交
  9. 14 8月, 2015 1 次提交
  10. 13 8月, 2015 5 次提交
  11. 12 8月, 2015 1 次提交
  12. 06 8月, 2015 2 次提交
  13. 05 8月, 2015 7 次提交
  14. 04 8月, 2015 3 次提交
  15. 03 8月, 2015 3 次提交
    • K
      sched/preempt: Fix cond_resched_lock() and cond_resched_softirq() · fe32d3cd
      Konstantin Khlebnikov 提交于
      These functions check should_resched() before unlocking spinlock/bh-enable:
      preempt_count always non-zero => should_resched() always returns false.
      cond_resched_lock() worked iff spin_needbreak is set.
      
      This patch adds argument "preempt_offset" to should_resched().
      
      preempt_count offset constants for that:
      
        PREEMPT_DISABLE_OFFSET  - offset after preempt_disable()
        PREEMPT_LOCK_OFFSET     - offset after spin_lock()
        SOFTIRQ_DISABLE_OFFSET  - offset after local_bh_distable()
        SOFTIRQ_LOCK_OFFSET     - offset after spin_lock_bh()
      Signed-off-by: NKonstantin Khlebnikov <khlebnikov@yandex-team.ru>
      Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Cc: Alexander Graf <agraf@suse.de>
      Cc: Boris Ostrovsky <boris.ostrovsky@oracle.com>
      Cc: David Vrabel <david.vrabel@citrix.com>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Mike Galbraith <efault@gmx.de>
      Cc: Paul Mackerras <paulus@samba.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Fixes: bdb43806 ("sched: Extract the basic add/sub preempt_count modifiers")
      Link: http://lkml.kernel.org/r/20150715095204.12246.98268.stgit@buzzSigned-off-by: NIngo Molnar <mingo@kernel.org>
      fe32d3cd
    • P
      locking/static_keys: Add a new static_key interface · 11276d53
      Peter Zijlstra 提交于
      There are various problems and short-comings with the current
      static_key interface:
      
       - static_key_{true,false}() read like a branch depending on the key
         value, instead of the actual likely/unlikely branch depending on
         init value.
      
       - static_key_{true,false}() are, as stated above, tied to the
         static_key init values STATIC_KEY_INIT_{TRUE,FALSE}.
      
       - we're limited to the 2 (out of 4) possible options that compile to
         a default NOP because that's what our arch_static_branch() assembly
         emits.
      
      So provide a new static_key interface:
      
        DEFINE_STATIC_KEY_TRUE(name);
        DEFINE_STATIC_KEY_FALSE(name);
      
      Which define a key of different types with an initial true/false
      value.
      
      Then allow:
      
         static_branch_likely()
         static_branch_unlikely()
      
      to take a key of either type and emit the right instruction for the
      case.
      
      This means adding a second arch_static_branch_jump() assembly helper
      which emits a JMP per default.
      
      In order to determine the right instruction for the right state,
      encode the branch type in the LSB of jump_entry::key.
      
      This is the final step in removing the naming confusion that has led to
      a stream of avoidable bugs such as:
      
        a833581e ("x86, perf: Fix static_key bug in load_mm_cr4()")
      
      ... but it also allows new static key combinations that will give us
      performance enhancements in the subsequent patches.
      
      Tested-by: Rabin Vincent <rabin@rab.in> # arm
      Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Acked-by: Michael Ellerman <mpe@ellerman.id.au> # ppc
      Acked-by: Heiko Carstens <heiko.carstens@de.ibm.com> # s390
      Cc: Andrew Morton <akpm@linux-foundation.org>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: linux-kernel@vger.kernel.org
      Signed-off-by: NIngo Molnar <mingo@kernel.org>
      11276d53
    • A
      locking, arch: use WRITE_ONCE()/READ_ONCE() in smp_store_release()/smp_load_acquire() · 76695af2
      Andrey Konovalov 提交于
      Replace ACCESS_ONCE() macro in smp_store_release() and smp_load_acquire()
      with WRITE_ONCE() and READ_ONCE() on x86, arm, arm64, ia64, metag, mips,
      powerpc, s390, sparc and asm-generic since ACCESS_ONCE() does not work
      reliably on non-scalar types.
      
      WRITE_ONCE() and READ_ONCE() were introduced in the following commits:
      
        230fa253 ("kernel: Provide READ_ONCE and ASSIGN_ONCE")
        43239cbe ("kernel: Change ASSIGN_ONCE(val, x) to WRITE_ONCE(x, val)")
      Signed-off-by: NAndrey Konovalov <andreyknvl@google.com>
      Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Acked-by: NDavidlohr Bueso <dbueso@suse.de>
      Acked-by: Michael Ellerman <mpe@ellerman.id.au> (powerpc)
      Acked-by: NRalf Baechle <ralf@linux-mips.org>
      Cc: Alexander Duyck <alexander.h.duyck@redhat.com>
      Cc: Andre Przywara <andre.przywara@arm.com>
      Cc: Arnd Bergmann <arnd@arndb.de>
      Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
      Cc: Borislav Petkov <bp@suse.de>
      Cc: Catalin Marinas <catalin.marinas@arm.com>
      Cc: Christian Borntraeger <borntraeger@de.ibm.com>
      Cc: David S. Miller <davem@davemloft.net>
      Cc: Davidlohr Bueso <dave@stgolabs.net>
      Cc: Dmitry Vyukov <dvyukov@google.com>
      Cc: Fenghua Yu <fenghua.yu@intel.com>
      Cc: H. Peter Anvin <hpa@zytor.com>
      Cc: Heiko Carstens <heiko.carstens@de.ibm.com>
      Cc: James Hogan <james.hogan@imgtec.com>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Martin Schwidefsky <schwidefsky@de.ibm.com>
      Cc: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
      Cc: Paul Mackerras <paulus@samba.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Russell King <linux@arm.linux.org.uk>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Tony Luck <tony.luck@intel.com>
      Cc: Will Deacon <will.deacon@arm.com>
      Cc: linux-arch@vger.kernel.org
      Link: http://lkml.kernel.org/r/1438528264-714-1-git-send-email-andreyknvl@google.comSigned-off-by: NIngo Molnar <mingo@kernel.org>
      76695af2
  16. 31 7月, 2015 3 次提交