1. 09 8月, 2014 5 次提交
    • V
      kernel: build bin2c based on config option CONFIG_BUILD_BIN2C · de5b56ba
      Vivek Goyal 提交于
      currently bin2c builds only if CONFIG_IKCONFIG=y. But bin2c will now be
      used by kexec too.  So make it compilation dependent on CONFIG_BUILD_BIN2C
      and this config option can be selected by CONFIG_KEXEC and CONFIG_IKCONFIG.
      Signed-off-by: NVivek Goyal <vgoyal@redhat.com>
      Cc: Borislav Petkov <bp@suse.de>
      Cc: Michael Kerrisk <mtk.manpages@gmail.com>
      Cc: Yinghai Lu <yinghai@kernel.org>
      Cc: Eric Biederman <ebiederm@xmission.com>
      Cc: H. Peter Anvin <hpa@zytor.com>
      Cc: Matthew Garrett <mjg59@srcf.ucam.org>
      Cc: Greg Kroah-Hartman <greg@kroah.com>
      Cc: Dave Young <dyoung@redhat.com>
      Cc: WANG Chao <chaowang@redhat.com>
      Cc: Baoquan He <bhe@redhat.com>
      Cc: Andy Lutomirski <luto@amacapital.net>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      de5b56ba
    • D
      shm: add memfd_create() syscall · 9183df25
      David Herrmann 提交于
      memfd_create() is similar to mmap(MAP_ANON), but returns a file-descriptor
      that you can pass to mmap().  It can support sealing and avoids any
      connection to user-visible mount-points.  Thus, it's not subject to quotas
      on mounted file-systems, but can be used like malloc()'ed memory, but with
      a file-descriptor to it.
      
      memfd_create() returns the raw shmem file, so calls like ftruncate() can
      be used to modify the underlying inode.  Also calls like fstat() will
      return proper information and mark the file as regular file.  If you want
      sealing, you can specify MFD_ALLOW_SEALING.  Otherwise, sealing is not
      supported (like on all other regular files).
      
      Compared to O_TMPFILE, it does not require a tmpfs mount-point and is not
      subject to a filesystem size limit.  It is still properly accounted to
      memcg limits, though, and to the same overcommit or no-overcommit
      accounting as all user memory.
      Signed-off-by: NDavid Herrmann <dh.herrmann@gmail.com>
      Acked-by: NHugh Dickins <hughd@google.com>
      Cc: Michael Kerrisk <mtk.manpages@gmail.com>
      Cc: Ryan Lortie <desrt@desrt.ca>
      Cc: Lennart Poettering <lennart@poettering.net>
      Cc: Daniel Mack <zonque@gmail.com>
      Cc: Andy Lutomirski <luto@amacapital.net>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      9183df25
    • D
      arch/x86: replace strict_strto calls · 164109e3
      Daniel Walter 提交于
      Replace obsolete strict_strto calls with appropriate kstrto calls
      Signed-off-by: NDaniel Walter <dwalter@google.com>
      Acked-by: NBorislav Petkov <bp@suse.de>
      Cc: Ingo Molnar <mingo@elte.hu>
      Cc: "H. Peter Anvin" <hpa@zytor.com>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      164109e3
    • A
      arm64,ia64,ppc,s390,sh,tile,um,x86,mm: remove default gate area · a6c19dfe
      Andy Lutomirski 提交于
      The core mm code will provide a default gate area based on
      FIXADDR_USER_START and FIXADDR_USER_END if
      !defined(__HAVE_ARCH_GATE_AREA) && defined(AT_SYSINFO_EHDR).
      
      This default is only useful for ia64.  arm64, ppc, s390, sh, tile, 64-bit
      UML, and x86_32 have their own code just to disable it.  arm, 32-bit UML,
      and x86_64 have gate areas, but they have their own implementations.
      
      This gets rid of the default and moves the code into ia64.
      
      This should save some code on architectures without a gate area: it's now
      possible to inline the gate_area functions in the default case.
      Signed-off-by: NAndy Lutomirski <luto@amacapital.net>
      Acked-by: NNathan Lynch <nathan_lynch@mentor.com>
      Acked-by: NH. Peter Anvin <hpa@linux.intel.com>
      Acked-by: Benjamin Herrenschmidt <benh@kernel.crashing.org> [in principle]
      Acked-by: Richard Weinberger <richard@nod.at> [for um]
      Acked-by: Will Deacon <will.deacon@arm.com> [for arm64]
      Cc: Catalin Marinas <catalin.marinas@arm.com>
      Cc: Will Deacon <will.deacon@arm.com>
      Cc: Tony Luck <tony.luck@intel.com>
      Cc: Fenghua Yu <fenghua.yu@intel.com>
      Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
      Cc: Paul Mackerras <paulus@samba.org>
      Cc: Martin Schwidefsky <schwidefsky@de.ibm.com>
      Cc: Heiko Carstens <heiko.carstens@de.ibm.com>
      Cc: Chris Metcalf <cmetcalf@tilera.com>
      Cc: Jeff Dike <jdike@addtoit.com>
      Cc: Richard Weinberger <richard@nod.at>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Ingo Molnar <mingo@redhat.com>
      Cc: "H. Peter Anvin" <hpa@zytor.com>
      Cc: Nathan Lynch <Nathan_Lynch@mentor.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      a6c19dfe
    • L
      lib/scatterlist: make ARCH_HAS_SG_CHAIN an actual Kconfig · 308c09f1
      Laura Abbott 提交于
      Rather than have architectures #define ARCH_HAS_SG_CHAIN in an
      architecture specific scatterlist.h, make it a proper Kconfig option and
      use that instead.  At same time, remove the header files are are now
      mostly useless and just include asm-generic/scatterlist.h.
      
      [sfr@canb.auug.org.au: powerpc files now need asm/dma.h]
      Signed-off-by: NLaura Abbott <lauraa@codeaurora.org>
      Acked-by: Thomas Gleixner <tglx@linutronix.de>			[x86]
      Acked-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>	[powerpc]
      Acked-by: NHeiko Carstens <heiko.carstens@de.ibm.com>
      Cc: Russell King <linux@arm.linux.org.uk>
      Cc: Tony Luck <tony.luck@intel.com>
      Cc: Fenghua Yu <fenghua.yu@intel.com>
      Cc: Paul Mackerras <paulus@samba.org>
      Cc: Ingo Molnar <mingo@redhat.com>
      Cc: "H. Peter Anvin" <hpa@zytor.com>
      Cc: "James E.J. Bottomley" <JBottomley@parallels.com>
      Cc: Martin Schwidefsky <schwidefsky@de.ibm.com>
      Signed-off-by: NStephen Rothwell <sfr@canb.auug.org.au>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      308c09f1
  2. 07 8月, 2014 3 次提交
  3. 06 8月, 2014 2 次提交
    • T
      x86: MCE: Add raw_lock conversion again · ed5c41d3
      Thomas Gleixner 提交于
      Commit ea431643 ("x86/mce: Fix CMCI preemption bugs") breaks RT by
      the completely unrelated conversion of the cmci_discover_lock to a
      regular (non raw) spinlock.  This lock was annotated in commit
      59d958d2 ("locking, x86: mce: Annotate cmci_discover_lock as raw")
      with a proper explanation why.
      
      The argument for converting the lock back to a regular spinlock was:
      
       - it does percpu ops without disabling preemption. Preemption is not
         disabled due to the mistaken use of a raw spinlock.
      
      Which is complete nonsense.  The raw_spinlock is disabling preemption in
      the same way as a regular spinlock.  In mainline spinlock maps to
      raw_spinlock, in RT spinlock becomes a "sleeping" lock.
      
      raw_spinlock has on RT exactly the same semantics as in mainline.  And
      because this lock is taken in non preemptible context it must be raw on
      RT.
      
      Undo the locking brainfart.
      Reported-by: NClark Williams <williams@redhat.com>
      Reported-by: NSteven Rostedt <rostedt@goodmis.org>
      Signed-off-by: NThomas Gleixner <tglx@linutronix.de>
      Cc: stable@vger.kernel.org
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      ed5c41d3
    • T
      random: introduce getrandom(2) system call · c6e9d6f3
      Theodore Ts'o 提交于
      The getrandom(2) system call was requested by the LibreSSL Portable
      developers.  It is analoguous to the getentropy(2) system call in
      OpenBSD.
      
      The rationale of this system call is to provide resiliance against
      file descriptor exhaustion attacks, where the attacker consumes all
      available file descriptors, forcing the use of the fallback code where
      /dev/[u]random is not available.  Since the fallback code is often not
      well-tested, it is better to eliminate this potential failure mode
      entirely.
      
      The other feature provided by this new system call is the ability to
      request randomness from the /dev/urandom entropy pool, but to block
      until at least 128 bits of entropy has been accumulated in the
      /dev/urandom entropy pool.  Historically, the emphasis in the
      /dev/urandom development has been to ensure that urandom pool is
      initialized as quickly as possible after system boot, and preferably
      before the init scripts start execution.
      
      This is because changing /dev/urandom reads to block represents an
      interface change that could potentially break userspace which is not
      acceptable.  In practice, on most x86 desktop and server systems, in
      general the entropy pool can be initialized before it is needed (and
      in modern kernels, we will printk a warning message if not).  However,
      on an embedded system, this may not be the case.  And so with this new
      interface, we can provide the functionality of blocking until the
      urandom pool has been initialized.  Any userspace program which uses
      this new functionality must take care to assure that if it is used
      during the boot process, that it will not cause the init scripts or
      other portions of the system startup to hang indefinitely.
      
      SYNOPSIS
      	#include <linux/random.h>
      
      	int getrandom(void *buf, size_t buflen, unsigned int flags);
      
      DESCRIPTION
      	The system call getrandom() fills the buffer pointed to by buf
      	with up to buflen random bytes which can be used to seed user
      	space random number generators (i.e., DRBG's) or for other
      	cryptographic uses.  It should not be used for Monte Carlo
      	simulations or other programs/algorithms which are doing
      	probabilistic sampling.
      
      	If the GRND_RANDOM flags bit is set, then draw from the
      	/dev/random pool instead of the /dev/urandom pool.  The
      	/dev/random pool is limited based on the entropy that can be
      	obtained from environmental noise, so if there is insufficient
      	entropy, the requested number of bytes may not be returned.
      	If there is no entropy available at all, getrandom(2) will
      	either block, or return an error with errno set to EAGAIN if
      	the GRND_NONBLOCK bit is set in flags.
      
      	If the GRND_RANDOM bit is not set, then the /dev/urandom pool
      	will be used.  Unlike using read(2) to fetch data from
      	/dev/urandom, if the urandom pool has not been sufficiently
      	initialized, getrandom(2) will block (or return -1 with the
      	errno set to EAGAIN if the GRND_NONBLOCK bit is set in flags).
      
      	The getentropy(2) system call in OpenBSD can be emulated using
      	the following function:
      
                  int getentropy(void *buf, size_t buflen)
                  {
                          int     ret;
      
                          if (buflen > 256)
                                  goto failure;
                          ret = getrandom(buf, buflen, 0);
                          if (ret < 0)
                                  return ret;
                          if (ret == buflen)
                                  return 0;
                  failure:
                          errno = EIO;
                          return -1;
                  }
      
      RETURN VALUE
             On success, the number of bytes that was filled in the buf is
             returned.  This may not be all the bytes requested by the
             caller via buflen if insufficient entropy was present in the
             /dev/random pool, or if the system call was interrupted by a
             signal.
      
             On error, -1 is returned, and errno is set appropriately.
      
      ERRORS
      	EINVAL		An invalid flag was passed to getrandom(2)
      
      	EFAULT		buf is outside the accessible address space.
      
      	EAGAIN		The requested entropy was not available, and
      			getentropy(2) would have blocked if the
      			GRND_NONBLOCK flag was not set.
      
      	EINTR		While blocked waiting for entropy, the call was
      			interrupted by a signal handler; see the description
      			of how interrupted read(2) calls on "slow" devices
      			are handled with and without the SA_RESTART flag
      			in the signal(7) man page.
      
      NOTES
      	For small requests (buflen <= 256) getrandom(2) will not
      	return EINTR when reading from the urandom pool once the
      	entropy pool has been initialized, and it will return all of
      	the bytes that have been requested.  This is the recommended
      	way to use getrandom(2), and is designed for compatibility
      	with OpenBSD's getentropy() system call.
      
      	However, if you are using GRND_RANDOM, then getrandom(2) may
      	block until the entropy accounting determines that sufficient
      	environmental noise has been gathered such that getrandom(2)
      	will be operating as a NRBG instead of a DRBG for those people
      	who are working in the NIST SP 800-90 regime.  Since it may
      	block for a long time, these guarantees do *not* apply.  The
      	user may want to interrupt a hanging process using a signal,
      	so blocking until all of the requested bytes are returned
      	would be unfriendly.
      
      	For this reason, the user of getrandom(2) MUST always check
      	the return value, in case it returns some error, or if fewer
      	bytes than requested was returned.  In the case of
      	!GRND_RANDOM and small request, the latter should never
      	happen, but the careful userspace code (and all crypto code
      	should be careful) should check for this anyway!
      
      	Finally, unless you are doing long-term key generation (and
      	perhaps not even then), you probably shouldn't be using
      	GRND_RANDOM.  The cryptographic algorithms used for
      	/dev/urandom are quite conservative, and so should be
      	sufficient for all purposes.  The disadvantage of GRND_RANDOM
      	is that it can block, and the increased complexity required to
      	deal with partially fulfilled getrandom(2) requests.
      Signed-off-by: NTheodore Ts'o <tytso@mit.edu>
      Reviewed-by: NZach Brown <zab@zabbo.net>
      c6e9d6f3
  4. 03 8月, 2014 3 次提交
    • D
      x86/pmc_atom: Silence shift wrapping warnings in pmc_sleep_tmr_show() · 4c51cb00
      Dan Carpenter 提交于
      I don't know if we really need 64 bits here but these variables are
      declared as u64 and it can't hurt to cast this so we prevent any shift
      wrapping.
      Signed-off-by: NDan Carpenter <dan.carpenter@oracle.com>
      Acked-by: NAubrey Li <aubrey.li@linux.intel.com>
      Link: http://lkml.kernel.org/r/20140801082715.GE28869@mwandaSigned-off-by: NH. Peter Anvin <hpa@zytor.com>
      4c51cb00
    • A
      net: filter: split 'struct sk_filter' into socket and bpf parts · 7ae457c1
      Alexei Starovoitov 提交于
      clean up names related to socket filtering and bpf in the following way:
      - everything that deals with sockets keeps 'sk_*' prefix
      - everything that is pure BPF is changed to 'bpf_*' prefix
      
      split 'struct sk_filter' into
      struct sk_filter {
      	atomic_t        refcnt;
      	struct rcu_head rcu;
      	struct bpf_prog *prog;
      };
      and
      struct bpf_prog {
              u32                     jited:1,
                                      len:31;
              struct sock_fprog_kern  *orig_prog;
              unsigned int            (*bpf_func)(const struct sk_buff *skb,
                                                  const struct bpf_insn *filter);
              union {
                      struct sock_filter      insns[0];
                      struct bpf_insn         insnsi[0];
                      struct work_struct      work;
              };
      };
      so that 'struct bpf_prog' can be used independent of sockets and cleans up
      'unattached' bpf use cases
      
      split SK_RUN_FILTER macro into:
          SK_RUN_FILTER to be used with 'struct sk_filter *' and
          BPF_PROG_RUN to be used with 'struct bpf_prog *'
      
      __sk_filter_release(struct sk_filter *) gains
      __bpf_prog_release(struct bpf_prog *) helper function
      
      also perform related renames for the functions that work
      with 'struct bpf_prog *', since they're on the same lines:
      
      sk_filter_size -> bpf_prog_size
      sk_filter_select_runtime -> bpf_prog_select_runtime
      sk_filter_free -> bpf_prog_free
      sk_unattached_filter_create -> bpf_prog_create
      sk_unattached_filter_destroy -> bpf_prog_destroy
      sk_store_orig_filter -> bpf_prog_store_orig_filter
      sk_release_orig_filter -> bpf_release_orig_filter
      __sk_migrate_filter -> bpf_migrate_filter
      __sk_prepare_filter -> bpf_prepare_filter
      
      API for attaching classic BPF to a socket stays the same:
      sk_attach_filter(prog, struct sock *)/sk_detach_filter(struct sock *)
      and SK_RUN_FILTER(struct sk_filter *, ctx) to execute a program
      which is used by sockets, tun, af_packet
      
      API for 'unattached' BPF programs becomes:
      bpf_prog_create(struct bpf_prog **)/bpf_prog_destroy(struct bpf_prog *)
      and BPF_PROG_RUN(struct bpf_prog *, ctx) to execute a program
      which is used by isdn, ppp, team, seccomp, ptp, xt_bpf, cls_bpf, test_bpf
      Signed-off-by: NAlexei Starovoitov <ast@plumgrid.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      7ae457c1
    • A
      net: filter: rename sk_convert_filter() -> bpf_convert_filter() · 8fb575ca
      Alexei Starovoitov 提交于
      to indicate that this function is converting classic BPF into eBPF
      and not related to sockets
      Signed-off-by: NAlexei Starovoitov <ast@plumgrid.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      8fb575ca
  5. 31 7月, 2014 8 次提交
    • D
      x86/mm: Set TLB flush tunable to sane value (33) · a5102476
      Dave Hansen 提交于
      This has been run through Intel's LKP tests across a wide range
      of modern sytems and workloads and it wasn't shown to make a
      measurable performance difference positive or negative.
      
      Now that we have some shiny new tracepoints, we can actually
      figure out what the heck is going on.
      
      During a kernel compile, 60% of the flush_tlb_mm_range() calls
      are for a single page.  It breaks down like this:
      
       size   percent  percent<=
        V        V        V
      GLOBAL:   2.20%   2.20% avg cycles:  2283
           1:  56.92%  59.12% avg cycles:  1276
           2:  13.78%  72.90% avg cycles:  1505
           3:   8.26%  81.16% avg cycles:  1880
           4:   7.41%  88.58% avg cycles:  2447
           5:   1.73%  90.31% avg cycles:  2358
           6:   1.32%  91.63% avg cycles:  2563
           7:   1.14%  92.77% avg cycles:  2862
           8:   0.62%  93.39% avg cycles:  3542
           9:   0.08%  93.47% avg cycles:  3289
          10:   0.43%  93.90% avg cycles:  3570
          11:   0.20%  94.10% avg cycles:  3767
          12:   0.08%  94.18% avg cycles:  3996
          13:   0.03%  94.20% avg cycles:  4077
          14:   0.02%  94.23% avg cycles:  4836
          15:   0.04%  94.26% avg cycles:  5699
          16:   0.06%  94.32% avg cycles:  5041
          17:   0.57%  94.89% avg cycles:  5473
          18:   0.02%  94.91% avg cycles:  5396
          19:   0.03%  94.95% avg cycles:  5296
          20:   0.02%  94.96% avg cycles:  6749
          21:   0.18%  95.14% avg cycles:  6225
          22:   0.01%  95.15% avg cycles:  6393
          23:   0.01%  95.16% avg cycles:  6861
          24:   0.12%  95.28% avg cycles:  6912
          25:   0.05%  95.32% avg cycles:  7190
          26:   0.01%  95.33% avg cycles:  7793
          27:   0.01%  95.34% avg cycles:  7833
          28:   0.01%  95.35% avg cycles:  8253
          29:   0.08%  95.42% avg cycles:  8024
          30:   0.03%  95.45% avg cycles:  9670
          31:   0.01%  95.46% avg cycles:  8949
          32:   0.01%  95.46% avg cycles:  9350
          33:   3.11%  98.57% avg cycles:  8534
          34:   0.02%  98.60% avg cycles: 10977
          35:   0.02%  98.62% avg cycles: 11400
      
      We get in to dimishing returns pretty quickly.  On pre-IvyBridge
      CPUs, we used to set the limit at 8 pages, and it was set at 128
      on IvyBrige.  That 128 number looks pretty silly considering that
      less than 0.5% of the flushes are that large.
      
      The previous code tried to size this number based on the size of
      the TLB.  Good idea, but it's error-prone, needs maintenance
      (which it didn't get up to now), and probably would not matter in
      practice much.
      
      Settting it to 33 means that we cover the mallopt
      M_TRIM_THRESHOLD, which is the most universally common size to do
      flushes.
      
      That's the short version.  Here's the long one for why I chose 33:
      
      1. These numbers have a constant bias in the timestamps from the
         tracing.  Probably counts for a couple hundred cycles in each of
         these tests, but it should be fairly _even_ across all of them.
         The smallest delta between the tracepoints I have ever seen is
         335 cycles.  This is one reason the cycles/page cost goes down in
         general as the flushes get larger.  The true cost is nearer to
         100 cycles.
      2. A full flush is more expensive than a single invlpg, but not
         by much (single percentages).
      3. A dtlb miss is 17.1ns (~45 cycles) and a itlb miss is 13.0ns
         (~34 cycles).  At those rates, refilling the 512-entry dTLB takes
         22,000 cycles.
      4. 22,000 cycles is approximately the equivalent of doing 85
         invlpg operations.  But, the odds are that the TLB can
         actually be filled up faster than that because TLB misses that
         are close in time also tend to leverage the same caches.
      6. ~98% of flushes are <=33 pages.  There are a lot of flushes of
         33 pages, probably because libc's M_TRIM_THRESHOLD is set to
         128k (32 pages)
      7. I've found no consistent data to support changing the IvyBridge
         vs. SandyBridge tunable by a factor of 16
      
      I used the performance counters on this hardware (IvyBridge i5-3320M)
      to figure out the tlb miss costs:
      
      ocperf.py stat -e dtlb_load_misses.walk_duration,dtlb_load_misses.walk_completed,dtlb_store_misses.walk_duration,dtlb_store_misses.walk_completed,itlb_misses.walk_duration,itlb_misses.walk_completed,itlb.itlb_flush
      
           7,720,030,970      dtlb_load_misses_walk_duration                                    [57.13%]
             169,856,353      dtlb_load_misses_walk_completed                                    [57.15%]
             708,832,859      dtlb_store_misses_walk_duration                                    [57.17%]
              19,346,823      dtlb_store_misses_walk_completed                                    [57.17%]
           2,779,687,402      itlb_misses_walk_duration                                    [57.15%]
              82,241,148      itlb_misses_walk_completed                                    [57.13%]
                 770,717      itlb_itlb_flush                                              [57.11%]
      
      Show that a dtlb miss is 17.1ns (~45 cycles) and a itlb miss is 13.0ns
      (~34 cycles).  At those rates, refilling the 512-entry dTLB takes
      22,000 cycles.  On a SandyBridge system with more cores and larger
      caches, those are dtlb=13.4ns and itlb=9.5ns.
      
      cat perf.stat.txt | perl -pe 's/,//g'
      	| awk '/itlb_misses_walk_duration/ { icyc+=$1 }
      		/itlb_misses_walk_completed/ { imiss+=$1 }
      		/dtlb_.*_walk_duration/ { dcyc+=$1 }
      		/dtlb_.*.*completed/ { dmiss+=$1 }
      		END {print "itlb cyc/miss: ", icyc/imiss, " dtlb cyc/miss: ", dcyc/dmiss, "   -----    ", icyc,imiss, dcyc,dmiss }
      
      On Westmere CPUs, the counters to use are: itlb_flush,itlb_misses.walk_cycles,itlb_misses.any,dtlb_misses.walk_cycles,dtlb_misses.any
      
      The assumptions that this code went in under:
      https://lkml.org/lkml/2012/6/12/119 say that a flush and a refill are
      about 100ns.  Being generous, that is over by a factor of 6 on the
      refill side, although it is fairly close on the cost of an invlpg.
      An increase of a single invlpg operation seems to lengthen the flush
      range operation by about 200 cycles.  Here is one example of the data
      collected for flushing 10 and 11 pages (full data are below):
      
          10:   0.43%  93.90% avg cycles:  3570 cycles/page:  357 samples: 4714
          11:   0.20%  94.10% avg cycles:  3767 cycles/page:  342 samples: 2145
      
      How to generate this table:
      
      	echo 10000 > /sys/kernel/debug/tracing/buffer_size_kb
      	echo x86-tsc > /sys/kernel/debug/tracing/trace_clock
      	echo 'reason != 0' > /sys/kernel/debug/tracing/events/tlb/tlb_flush/filter
      	echo 1 > /sys/kernel/debug/tracing/events/tlb/tlb_flush/enable
      
      Pipe the trace output in to this script:
      
      	http://sr71.net/~dave/intel/201402-tlb/trace-time-diff-process.pl.txt
      
      Note that these data were gathered with the invlpg threshold set to
      150 pages.  Only data points with >=50 of samples were printed:
      
      Flush    % of     %<=
      in       flush    this
      pages      es     size
      ------------------------------------------------------------------------------
          -1:   2.20%   2.20% avg cycles:  2283 cycles/page: xxxx samples: 23960
           1:  56.92%  59.12% avg cycles:  1276 cycles/page: 1276 samples: 620895
           2:  13.78%  72.90% avg cycles:  1505 cycles/page:  752 samples: 150335
           3:   8.26%  81.16% avg cycles:  1880 cycles/page:  626 samples: 90131
           4:   7.41%  88.58% avg cycles:  2447 cycles/page:  611 samples: 80877
           5:   1.73%  90.31% avg cycles:  2358 cycles/page:  471 samples: 18885
           6:   1.32%  91.63% avg cycles:  2563 cycles/page:  427 samples: 14397
           7:   1.14%  92.77% avg cycles:  2862 cycles/page:  408 samples: 12441
           8:   0.62%  93.39% avg cycles:  3542 cycles/page:  442 samples: 6721
           9:   0.08%  93.47% avg cycles:  3289 cycles/page:  365 samples: 917
          10:   0.43%  93.90% avg cycles:  3570 cycles/page:  357 samples: 4714
          11:   0.20%  94.10% avg cycles:  3767 cycles/page:  342 samples: 2145
          12:   0.08%  94.18% avg cycles:  3996 cycles/page:  333 samples: 864
          13:   0.03%  94.20% avg cycles:  4077 cycles/page:  313 samples: 289
          14:   0.02%  94.23% avg cycles:  4836 cycles/page:  345 samples: 236
          15:   0.04%  94.26% avg cycles:  5699 cycles/page:  379 samples: 390
          16:   0.06%  94.32% avg cycles:  5041 cycles/page:  315 samples: 643
          17:   0.57%  94.89% avg cycles:  5473 cycles/page:  321 samples: 6229
          18:   0.02%  94.91% avg cycles:  5396 cycles/page:  299 samples: 224
          19:   0.03%  94.95% avg cycles:  5296 cycles/page:  278 samples: 367
          20:   0.02%  94.96% avg cycles:  6749 cycles/page:  337 samples: 185
          21:   0.18%  95.14% avg cycles:  6225 cycles/page:  296 samples: 1964
          22:   0.01%  95.15% avg cycles:  6393 cycles/page:  290 samples: 83
          23:   0.01%  95.16% avg cycles:  6861 cycles/page:  298 samples: 61
          24:   0.12%  95.28% avg cycles:  6912 cycles/page:  288 samples: 1307
          25:   0.05%  95.32% avg cycles:  7190 cycles/page:  287 samples: 533
          26:   0.01%  95.33% avg cycles:  7793 cycles/page:  299 samples: 94
          27:   0.01%  95.34% avg cycles:  7833 cycles/page:  290 samples: 66
          28:   0.01%  95.35% avg cycles:  8253 cycles/page:  294 samples: 73
          29:   0.08%  95.42% avg cycles:  8024 cycles/page:  276 samples: 846
          30:   0.03%  95.45% avg cycles:  9670 cycles/page:  322 samples: 296
          31:   0.01%  95.46% avg cycles:  8949 cycles/page:  288 samples: 79
          32:   0.01%  95.46% avg cycles:  9350 cycles/page:  292 samples: 60
          33:   3.11%  98.57% avg cycles:  8534 cycles/page:  258 samples: 33936
          34:   0.02%  98.60% avg cycles: 10977 cycles/page:  322 samples: 268
          35:   0.02%  98.62% avg cycles: 11400 cycles/page:  325 samples: 177
          36:   0.01%  98.63% avg cycles: 11504 cycles/page:  319 samples: 161
          37:   0.02%  98.65% avg cycles: 11596 cycles/page:  313 samples: 182
          38:   0.02%  98.66% avg cycles: 11850 cycles/page:  311 samples: 195
          39:   0.01%  98.68% avg cycles: 12158 cycles/page:  311 samples: 128
          40:   0.01%  98.68% avg cycles: 11626 cycles/page:  290 samples: 78
          41:   0.04%  98.73% avg cycles: 11435 cycles/page:  278 samples: 477
          42:   0.01%  98.73% avg cycles: 12571 cycles/page:  299 samples: 74
          43:   0.01%  98.74% avg cycles: 12562 cycles/page:  292 samples: 78
          44:   0.01%  98.75% avg cycles: 12991 cycles/page:  295 samples: 108
          45:   0.01%  98.76% avg cycles: 13169 cycles/page:  292 samples: 78
          46:   0.02%  98.78% avg cycles: 12891 cycles/page:  280 samples: 261
          47:   0.01%  98.79% avg cycles: 13099 cycles/page:  278 samples: 67
          48:   0.01%  98.80% avg cycles: 13851 cycles/page:  288 samples: 77
          49:   0.01%  98.80% avg cycles: 13749 cycles/page:  280 samples: 66
          50:   0.01%  98.81% avg cycles: 13949 cycles/page:  278 samples: 73
          52:   0.00%  98.82% avg cycles: 14243 cycles/page:  273 samples: 52
          54:   0.01%  98.83% avg cycles: 15312 cycles/page:  283 samples: 87
          55:   0.01%  98.84% avg cycles: 15197 cycles/page:  276 samples: 109
          56:   0.02%  98.86% avg cycles: 15234 cycles/page:  272 samples: 208
          57:   0.00%  98.86% avg cycles: 14888 cycles/page:  261 samples: 53
          58:   0.01%  98.87% avg cycles: 15037 cycles/page:  259 samples: 59
          59:   0.01%  98.87% avg cycles: 15752 cycles/page:  266 samples: 63
          62:   0.00%  98.89% avg cycles: 16222 cycles/page:  261 samples: 54
          64:   0.02%  98.91% avg cycles: 17179 cycles/page:  268 samples: 248
          65:   0.12%  99.03% avg cycles: 18762 cycles/page:  288 samples: 1324
          85:   0.00%  99.10% avg cycles: 21649 cycles/page:  254 samples: 50
         127:   0.01%  99.18% avg cycles: 32397 cycles/page:  255 samples: 75
         128:   0.13%  99.31% avg cycles: 31711 cycles/page:  247 samples: 1466
         129:   0.18%  99.49% avg cycles: 33017 cycles/page:  255 samples: 1927
         181:   0.33%  99.84% avg cycles:  2489 cycles/page:   13 samples: 3547
         256:   0.05%  99.91% avg cycles:  2305 cycles/page:    9 samples: 550
         512:   0.03%  99.95% avg cycles:  2133 cycles/page:    4 samples: 304
        1512:   0.01%  99.99% avg cycles:  3038 cycles/page:    2 samples: 65
      
      Here are the tlb counters during a 10-second slice of a kernel compile
      for a SandyBridge system.  It's better than IvyBridge, but probably
      due to the larger caches since this was one of the 'X' extreme parts.
      
          10,873,007,282      dtlb_load_misses_walk_duration
             250,711,333      dtlb_load_misses_walk_completed
           1,212,395,865      dtlb_store_misses_walk_duration
              31,615,772      dtlb_store_misses_walk_completed
           5,091,010,274      itlb_misses_walk_duration
             163,193,511      itlb_misses_walk_completed
               1,321,980      itlb_itlb_flush
      
            10.008045158 seconds time elapsed
      
      # cat perf.stat.1392743721.txt | perl -pe 's/,//g' | awk '/itlb_misses_walk_duration/ { icyc+=$1 } /itlb_misses_walk_completed/ { imiss+=$1 } /dtlb_.*_walk_duration/ { dcyc+=$1 } /dtlb_.*.*completed/ { dmiss+=$1 } END {print "itlb cyc/miss: ", icyc/imiss/3.3, " dtlb cyc/miss: ", dcyc/dmiss/3.3, "   -----    ", icyc,imiss, dcyc,dmiss }'
      itlb ns/miss:  9.45338  dtlb ns/miss:  12.9716
      Signed-off-by: NDave Hansen <dave.hansen@linux.intel.com>
      Link: http://lkml.kernel.org/r/20140731154103.10C1115E@viggo.jf.intel.comAcked-by: NRik van Riel <riel@redhat.com>
      Acked-by: NMel Gorman <mgorman@suse.de>
      Signed-off-by: NH. Peter Anvin <hpa@linux.intel.com>
      a5102476
    • D
      x86/mm: New tunable for single vs full TLB flush · 2d040a1c
      Dave Hansen 提交于
      Most of the logic here is in the documentation file.  Please take
      a look at it.
      
      I know we've come full-circle here back to a tunable, but this
      new one is *WAY* simpler.  I challenge anyone to describe in one
      sentence how the old one worked.  Here's the way the new one
      works:
      
      	If we are flushing more pages than the ceiling, we use
      	the full flush, otherwise we use per-page flushes.
      Signed-off-by: NDave Hansen <dave.hansen@linux.intel.com>
      Link: http://lkml.kernel.org/r/20140731154101.12B52CAF@viggo.jf.intel.comAcked-by: NRik van Riel <riel@redhat.com>
      Acked-by: NMel Gorman <mgorman@suse.de>
      Signed-off-by: NH. Peter Anvin <hpa@linux.intel.com>
      2d040a1c
    • D
      x86/mm: Add tracepoints for TLB flushes · d17d8f9d
      Dave Hansen 提交于
      We don't have any good way to figure out what kinds of flushes
      are being attempted.  Right now, we can try to use the vm
      counters, but those only tell us what we actually did with the
      hardware (one-by-one vs full) and don't tell us what was actually
      _requested_.
      
      This allows us to select out "interesting" TLB flushes that we
      might want to optimize (like the ranged ones) and ignore the ones
      that we have very little control over (the ones at context
      switch).
      Signed-off-by: NDave Hansen <dave.hansen@linux.intel.com>
      Link: http://lkml.kernel.org/r/20140731154059.4C96CBA5@viggo.jf.intel.comAcked-by: NRik van Riel <riel@redhat.com>
      Cc: Mel Gorman <mgorman@suse.de>
      Signed-off-by: NH. Peter Anvin <hpa@linux.intel.com>
      d17d8f9d
    • D
      x86/mm: Unify remote INVLPG code · a23421f1
      Dave Hansen 提交于
      There are currently three paths through the remote flush code:
      
      1. full invalidation
      2. single page invalidation using invlpg
      3. ranged invalidation using invlpg
      
      This takes 2 and 3 and combines them in to a single path by
      making the single-page one just be the start and end be start
      plus a single page.  This makes placement of our tracepoint easier.
      Signed-off-by: NDave Hansen <dave.hansen@linux.intel.com>
      Link: http://lkml.kernel.org/r/20140731154058.E0F90408@viggo.jf.intel.com
      Cc: Rik van Riel <riel@redhat.com>
      Cc: Mel Gorman <mgorman@suse.de>
      Signed-off-by: NH. Peter Anvin <hpa@linux.intel.com>
      a23421f1
    • D
      x86/mm: Fix missed global TLB flush stat · 9dfa6dee
      Dave Hansen 提交于
      If we take the
      
      	if (end == TLB_FLUSH_ALL || vmflag & VM_HUGETLB) {
      		local_flush_tlb();
      		goto out;
      	}
      
      path out of flush_tlb_mm_range(), we will have flushed the tlb,
      but not incremented NR_TLB_LOCAL_FLUSH_ALL.  This unifies the
      way out of the function so that we always take a single path when
      doing a full tlb flush.
      Signed-off-by: NDave Hansen <dave.hansen@linux.intel.com>
      Link: http://lkml.kernel.org/r/20140731154056.FF763B76@viggo.jf.intel.comAcked-by: NRik van Riel <riel@redhat.com>
      Acked-by: NMel Gorman <mgorman@suse.de>
      Signed-off-by: NH. Peter Anvin <hpa@linux.intel.com>
      9dfa6dee
    • D
      x86/mm: Rip out complicated, out-of-date, buggy TLB flushing · e9f4e0a9
      Dave Hansen 提交于
      I think the flush_tlb_mm_range() code that tries to tune the
      flush sizes based on the CPU needs to get ripped out for
      several reasons:
      
      1. It is obviously buggy.  It uses mm->total_vm to judge the
         task's footprint in the TLB.  It should certainly be using
         some measure of RSS, *NOT* ->total_vm since only resident
         memory can populate the TLB.
      2. Haswell, and several other CPUs are missing from the
         intel_tlb_flushall_shift_set() function.  Thus, it has been
         demonstrated to bitrot quickly in practice.
      3. It is plain wrong in my vm:
      	[    0.037444] Last level iTLB entries: 4KB 0, 2MB 0, 4MB 0
      	[    0.037444] Last level dTLB entries: 4KB 0, 2MB 0, 4MB 0
      	[    0.037444] tlb_flushall_shift: 6
         Which leads to it to never use invlpg.
      4. The assumptions about TLB refill costs are wrong:
      	http://lkml.kernel.org/r/1337782555-8088-3-git-send-email-alex.shi@intel.com
          (more on this in later patches)
      5. I can not reproduce the original data: https://lkml.org/lkml/2012/5/17/59
         I believe the sample times were too short.  Running the
         benchmark in a loop yields times that vary quite a bit.
      
      Note that this leaves us with a static ceiling of 1 page.  This
      is a conservative, dumb setting, and will be revised in a later
      patch.
      
      This also removes the code which attempts to predict whether we
      are flushing data or instructions.  We expect instruction flushes
      to be relatively rare and not worth tuning for explicitly.
      Signed-off-by: NDave Hansen <dave.hansen@linux.intel.com>
      Link: http://lkml.kernel.org/r/20140731154055.ABC88E89@viggo.jf.intel.comAcked-by: NRik van Riel <riel@redhat.com>
      Acked-by: NMel Gorman <mgorman@suse.de>
      Signed-off-by: NH. Peter Anvin <hpa@linux.intel.com>
      e9f4e0a9
    • D
      x86/mm: Clean up the TLB flushing code · 4995ab9c
      Dave Hansen 提交于
      The
      
      	if (cpumask_any_but(mm_cpumask(mm), smp_processor_id()) < nr_cpu_ids)
      
      line of code is not exactly the easiest to audit, especially when
      it ends up at two different indentation levels.  This eliminates
      one of the the copy-n-paste versions.  It also gives us a unified
      exit point for each path through this function.  We need this in
      a minute for our tracepoint.
      Signed-off-by: NDave Hansen <dave.hansen@linux.intel.com>
      Link: http://lkml.kernel.org/r/20140731154054.44F1CDDC@viggo.jf.intel.comAcked-by: NRik van Riel <riel@redhat.com>
      Acked-by: NMel Gorman <mgorman@suse.de>
      Signed-off-by: NH. Peter Anvin <hpa@linux.intel.com>
      4995ab9c
    • M
      x86/kvm: Resolve shadow warnings in macro expansion · 42cbc04f
      Mark D Rustad 提交于
      Resolve shadow warnings that appear in W=2 builds. Instead of
      using ret to hold the return pointer, save the length in a new
      variable saved_len and compute the pointer on exit. This also
      resolves a very technical error, in that ret was declared as
      a const char *, when it really was a char * const.
      Signed-off-by: NMark Rustad <mark.d.rustad@intel.com>
      Signed-off-by: NJeff Kirsher <jeffrey.t.kirsher@intel.com>
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      42cbc04f
  6. 30 7月, 2014 2 次提交
  7. 29 7月, 2014 1 次提交
  8. 26 7月, 2014 5 次提交
  9. 25 7月, 2014 2 次提交
  10. 24 7月, 2014 9 次提交