1. 09 4月, 2014 1 次提交
  2. 08 4月, 2014 36 次提交
    • M
      mm: create generic early_ioremap() support · 9e5c33d7
      Mark Salter 提交于
      This patch creates a generic implementation of early_ioremap() support
      based on the existing x86 implementation.  early_ioremp() is useful for
      early boot code which needs to temporarily map I/O or memory regions
      before normal mapping functions such as ioremap() are available.
      
      Some architectures have optional MMU.  In the no-MMU case, the remap
      functions simply return the passed in physical address and the unmap
      functions do nothing.
      Signed-off-by: NMark Salter <msalter@redhat.com>
      Acked-by: NCatalin Marinas <catalin.marinas@arm.com>
      Acked-by: NH. Peter Anvin <hpa@zytor.com>
      Cc: Borislav Petkov <borislav.petkov@amd.com>
      Cc: Dave Young <dyoung@redhat.com>
      Cc: Will Deacon <will.deacon@arm.com>
      Cc: Ingo Molnar <mingo@elte.hu>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      9e5c33d7
    • J
      lglock: map to spinlock when !CONFIG_SMP · 64b47e8f
      Josh Triplett 提交于
      When the system has only one CPU, lglock is effectively a spinlock; map
      it directly to spinlock to eliminate the indirection and duplicate code.
      
      In addition to removing overhead, this drops 1.6k of code with a
      defconfig modified to have !CONFIG_SMP, and 1.1k with a minimal config.
      Signed-off-by: NJosh Triplett <josh@joshtriplett.org>
      Cc: Rusty Russell <rusty@rustcorp.com.au>
      Cc: Michal Marek <mmarek@suse.cz>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: David Howells <dhowells@redhat.com>
      Cc: "H. Peter Anvin" <hpa@zytor.com>
      Cc: Nick Piggin <npiggin@kernel.dk>
      Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      64b47e8f
    • C
      percpu: add preemption checks to __this_cpu ops · 188a8140
      Christoph Lameter 提交于
      We define a check function in order to avoid trouble with the include
      files.  Then the higher level __this_cpu macros are modified to invoke
      the preemption check.
      
      [akpm@linux-foundation.org: coding-style fixes]
      Signed-off-by: NChristoph Lameter <cl@linux.com>
      Acked-by: NIngo Molnar <mingo@kernel.org>
      Cc: Tejun Heo <tj@kernel.org>
      Tested-by: NGrygorii Strashko <grygorii.strashko@ti.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      188a8140
    • C
      vmstat: use raw_cpu_ops to avoid false positives on preemption checks · 293b6a4c
      Christoph Lameter 提交于
      vm counters are allowed to be racy.  Use raw_cpu_ops to avoid the
      local_irq_disable overhead and to avoid preemption checks which will be
      added to the __this_cpu operations.
      
      [akpm@linux-foundation.org: Add comment.  Again.]
      Signed-off-by: NChristoph Lameter <cl@linux.com>
      Reported-by: NSergey Senozhatsky <sergey.senozhatsky@gmail.com>
      Cc: Dave Chinner <dchinner@redhat.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      293b6a4c
    • C
      mm: use raw_cpu ops for determining current NUMA node · dc322a99
      Christoph Lameter 提交于
      With the preempt checking logic for __this_cpu_ops we will get false
      positives from locations in the code that use numa_node_id.
      
      Before the __this_cpu ops where introduced there were no checks for
      preemption present either.  smp_raw_processor_id() was used.  See
      
        http://www.spinics.net/lists/linux-numa/msg00641.html
      
      Therefore we need to use raw_cpu_read here to avoid false postives.
      
      Note that this issue has been discussed in prior years.  If the process
      changes nodes after retrieving the current numa node then that is
      acceptable since most uses of numa_node etc are for optimization and not
      for correctness.
      
      There were suggestions to implement a raw_numa_node_id in order to do
      preempt checks for numa_node_id as well.  But I think we better defer
      that to another patch since that would mean investigating how
      numa_node_id() is used throughout the kernel which would increase the
      scope of this patchset significantly.  After all preemption was never
      checked before when numa_node_id() was used.
      
      Some sample traces:
      
      __this_cpu_read operation in preemptible [00000000] code: login/1456
      caller is __this_cpu_preempt_check+0x2b/0x2d
      CPU: 0 PID: 1456 Comm: login Not tainted 3.12.0-rc4-cl-00062-g2fe80d3b-dirty #185
      Call Trace:
        dump_stack+0x4e/0x82
        check_preemption_disabled+0xc5/0xe0
        __this_cpu_preempt_check+0x2b/0x2d
        get_task_policy+0x1d/0x49
        get_vma_policy+0x14/0x76
        alloc_pages_vma+0x35/0xff
        handle_mm_fault+0x290/0x73b
        __do_page_fault+0x3fe/0x44d
        do_page_fault+0x9/0xc
        page_fault+0x22/0x30
        generic_file_aio_read+0x38e/0x624
        do_sync_read+0x54/0x73
        vfs_read+0x9d/0x12a
        SyS_read+0x47/0x7e
        cstar_dispatch+0x7/0x23
      
      caller is __this_cpu_preempt_check+0x2b/0x2d
      CPU: 0 PID: 1456 Comm: login Not tainted 3.12.0-rc4-cl-00062-g2fe80d3b-dirty #185
      Call Trace:
        dump_stack+0x4e/0x82
        check_preemption_disabled+0xc5/0xe0
        __this_cpu_preempt_check+0x2b/0x2d
        alloc_pages_current+0x8f/0xbc
        __page_cache_alloc+0xb/0xd
        __do_page_cache_readahead+0xf4/0x219
        ra_submit+0x1c/0x20
        ondemand_readahead+0x28c/0x2b4
        page_cache_sync_readahead+0x38/0x3a
        generic_file_aio_read+0x261/0x624
        do_sync_read+0x54/0x73
        vfs_read+0x9d/0x12a
        SyS_read+0x47/0x7e
        cstar_dispatch+0x7/0x23
      Signed-off-by: NChristoph Lameter <cl@linux.com>
      Acked-by: NIngo Molnar <mingo@kernel.org>
      Cc: Alex Shi <alex.shi@intel.com>
      Cc: Tejun Heo <tj@kernel.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      dc322a99
    • C
      percpu: add raw_cpu_ops · b3ca1c10
      Christoph Lameter 提交于
      The kernel has never been audited to ensure that this_cpu operations are
      consistently used throughout the kernel.  The code generated in many
      places can be improved through the use of this_cpu operations (which
      uses a segment register for relocation of per cpu offsets instead of
      performing address calculations).
      
      The patch set also addresses various consistency issues in general with
      the per cpu macros.
      
      A. The semantics of __this_cpu_ptr() differs from this_cpu_ptr only
         because checks are skipped. This is typically shown through a raw_
         prefix. So this patch set changes the places where __this_cpu_ptr()
         is used to raw_cpu_ptr().
      
      B. There has been the long term wish by some that __this_cpu operations
         would check for preemption. However, there are cases where preemption
         checks need to be skipped. This patch set adds raw_cpu operations that
         do not check for preemption and then adds preemption checks to the
         __this_cpu operations.
      
      C. The use of __get_cpu_var is always a reference to a percpu variable
         that can also be handled via a this_cpu operation. This patch set
         replaces all uses of __get_cpu_var with this_cpu operations.
      
      D. We can then use this_cpu RMW operations in various places replacing
         sequences of instructions by a single one.
      
      E. The use of this_cpu operations throughout will allow other arches than
         x86 to implement optimized references and RMV operations to work with
         per cpu local data.
      
      F. The use of this_cpu operations opens up the possibility to
         further optimize code that relies on synchronization through
         per cpu data.
      
      The patch set works in a couple of stages:
      
      I. Patch 1 adds the additional raw_cpu operations and raw_cpu_ptr().
          Also converts the existing __this_cpu_xx_# primitive in the x86
          code to raw_cpu_xx_#.
      
      II. Patch 2-4 use the raw_cpu operations in places that would give
           us false positives once they are enabled.
      
      III. Patch 5 adds preemption checks to __this_cpu operations to allow
          checking if preemption is properly disabled when these functions
          are used.
      
      IV. Patches 6-20 are patches that simply replace uses of __get_cpu_var
         with this_cpu_ptr. They do not depend on any changes to the percpu
         code. No preemption tests are skipped if they are applied.
      
      V. Patches 21-46 are conversion patches that use this_cpu operations
         in various kernel subsystems/drivers or arch code.
      
      VI.  Patches 47/48 (not included in this series) remove no longer used
          functions (__this_cpu_ptr and __get_cpu_var).  These should only be
          applied after all the conversion patches have made it and after we
          have done additional passes through the kernel to ensure that none of
          the uses of these functions remain.
      
      This patch (of 46):
      
      The patches following this one will add preemption checks to __this_cpu
      ops so we need to have an alternative way to use this_cpu operations
      without preemption checks.
      
      raw_cpu_ops will be the basis for all other ops since these will be the
      operations that do not implement any checks.
      
      Primitive operations are renamed by this patch from __this_cpu_xxx to
      raw_cpu_xxxx.
      
      Also change the uses of the x86 percpu primitives in preempt.h.
      These depend directly on asm/percpu.h (header #include nesting issue).
      Signed-off-by: NPeter Zijlstra <peterz@infradead.org>
      Signed-off-by: NChristoph Lameter <cl@linux.com>
      Acked-by: NIngo Molnar <mingo@kernel.org>
      Cc: Tejun Heo <tj@kernel.org>
      Cc: "James E.J. Bottomley" <jejb@parisc-linux.org>
      Cc: "Paul E. McKenney" <paulmck@linux.vnet.ibm.com>
      Cc: Alex Shi <alex.shi@intel.com>
      Cc: Arnd Bergmann <arnd@arndb.de>
      Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
      Cc: Bryan Wu <cooloney@gmail.com>
      Cc: Catalin Marinas <catalin.marinas@arm.com>
      Cc: Chris Metcalf <cmetcalf@tilera.com>
      Cc: Daniel Lezcano <daniel.lezcano@linaro.org>
      Cc: David Daney <david.daney@cavium.com>
      Cc: David Miller <davem@davemloft.net>
      Cc: David S. Miller <davem@davemloft.net>
      Cc: Dimitri Sivanich <sivanich@sgi.com>
      Cc: Dipankar Sarma <dipankar@in.ibm.com>
      Cc: Eric Dumazet <edumazet@google.com>
      Cc: Fenghua Yu <fenghua.yu@intel.com>
      Cc: Frederic Weisbecker <fweisbec@gmail.com>
      Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
      Cc: H. Peter Anvin <hpa@linux.intel.com>
      Cc: Haavard Skinnemoen <hskinnemoen@gmail.com>
      Cc: Hans-Christian Egtvedt <egtvedt@samfundet.no>
      Cc: Hedi Berriche <hedi@sgi.com>
      Cc: Heiko Carstens <heiko.carstens@de.ibm.com>
      Cc: Helge Deller <deller@gmx.de>
      Cc: Ivan Kokshaysky <ink@jurassic.park.msu.ru>
      Cc: James Hogan <james.hogan@imgtec.com>
      Cc: Jens Axboe <axboe@kernel.dk>
      Cc: John Stultz <john.stultz@linaro.org>
      Cc: Martin Schwidefsky <schwidefsky@de.ibm.com>
      Cc: Masami Hiramatsu <masami.hiramatsu.pt@hitachi.com>
      Cc: Matt Turner <mattst88@gmail.com>
      Cc: Mike Frysinger <vapier@gentoo.org>
      Cc: Mike Travis <travis@sgi.com>
      Cc: Neil Brown <neilb@suse.de>
      Cc: Nicolas Pitre <nicolas.pitre@linaro.org>
      Cc: Paul Mackerras <paulus@samba.org>
      Cc: Paul Mundt <lethal@linux-sh.org>
      Cc: Rafael J. Wysocki <rjw@sisk.pl>
      Cc: Ralf Baechle <ralf@linux-mips.org>
      Cc: Richard Henderson <rth@twiddle.net>
      Cc: Robert Richter <rric@kernel.org>
      Cc: Russell King <linux@arm.linux.org.uk>
      Cc: Russell King <rmk+kernel@arm.linux.org.uk>
      Cc: Rusty Russell <rusty@rustcorp.com.au>
      Cc: Steven Rostedt <rostedt@goodmis.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Tony Luck <tony.luck@intel.com>
      Cc: Will Deacon <will.deacon@arm.com>
      Cc: Wim Van Sebroeck <wim@iguana.be>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      b3ca1c10
    • V
      slub: rework sysfs layout for memcg caches · 9a41707b
      Vladimir Davydov 提交于
      Currently, we try to arrange sysfs entries for memcg caches in the same
      manner as for global caches.  Apart from turning /sys/kernel/slab into a
      mess when there are a lot of kmem-active memcgs created, it actually
      does not work properly - we won't create more than one link to a memcg
      cache in case its parent is merged with another cache.  For instance, if
      A is a root cache merged with another root cache B, we will have the
      following sysfs setup:
      
        X
        A -> X
        B -> X
      
      where X is some unique id (see create_unique_id()).  Now if memcgs M and
      N start to allocate from cache A (or B, which is the same), we will get:
      
        X
        X:M
        X:N
        A -> X
        B -> X
        A:M -> X:M
        A:N -> X:N
      
      Since B is an alias for A, we won't get entries B:M and B:N, which is
      confusing.
      
      It is more logical to have entries for memcg caches under the
      corresponding root cache's sysfs directory.  This would allow us to keep
      sysfs layout clean, and avoid such inconsistencies like one described
      above.
      
      This patch does the trick.  It creates a "cgroup" kset in each root
      cache kobject to keep its children caches there.
      Signed-off-by: NVladimir Davydov <vdavydov@parallels.com>
      Cc: Michal Hocko <mhocko@suse.cz>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Pekka Enberg <penberg@kernel.org>
      Cc: Glauber Costa <glommer@gmail.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      9a41707b
    • V
      memcg, slab: do not destroy children caches if parent has aliases · b8529907
      Vladimir Davydov 提交于
      Currently we destroy children caches at the very beginning of
      kmem_cache_destroy().  This is wrong, because the root cache will not
      necessarily be destroyed in the end - if it has aliases (refcount > 0),
      kmem_cache_destroy() will simply decrement its refcount and return.  In
      this case, at best we will get a bunch of warnings in dmesg, like this
      one:
      
        kmem_cache_destroy kmalloc-32:0: Slab cache still has objects
        CPU: 1 PID: 7139 Comm: modprobe Tainted: G    B   W    3.13.0+ #117
        Call Trace:
          dump_stack+0x49/0x5b
          kmem_cache_destroy+0xdf/0xf0
          kmem_cache_destroy_memcg_children+0x97/0xc0
          kmem_cache_destroy+0xf/0xf0
          xfs_mru_cache_uninit+0x21/0x30 [xfs]
          exit_xfs_fs+0x2e/0xc44 [xfs]
          SyS_delete_module+0x198/0x1f0
          system_call_fastpath+0x16/0x1b
      
      At worst - if kmem_cache_destroy() will race with an allocation from a
      memcg cache - the kernel will panic.
      
      This patch fixes this by moving children caches destruction after the
      check if the cache has aliases.  Plus, it forbids destroying a root
      cache if it still has children caches, because each children cache keeps
      a reference to its parent.
      Signed-off-by: NVladimir Davydov <vdavydov@parallels.com>
      Cc: Michal Hocko <mhocko@suse.cz>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Pekka Enberg <penberg@kernel.org>
      Cc: Glauber Costa <glommer@gmail.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      b8529907
    • V
      memcg, slab: separate memcg vs root cache creation paths · 794b1248
      Vladimir Davydov 提交于
      Memcg-awareness turned kmem_cache_create() into a dirty interweaving of
      memcg-only and except-for-memcg calls.  To clean this up, let's move the
      code responsible for memcg cache creation to a separate function.
      Signed-off-by: NVladimir Davydov <vdavydov@parallels.com>
      Cc: Michal Hocko <mhocko@suse.cz>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Pekka Enberg <penberg@kernel.org>
      Cc: Glauber Costa <glommer@gmail.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      794b1248
    • V
      memcg, slab: cleanup memcg cache creation · 5722d094
      Vladimir Davydov 提交于
      This patch cleans up the memcg cache creation path as follows:
      
      - Move memcg cache name creation to a separate function to be called
        from kmem_cache_create_memcg().  This allows us to get rid of the mutex
        protecting the temporary buffer used for the name formatting, because
        the whole cache creation path is protected by the slab_mutex.
      
      - Get rid of memcg_create_kmem_cache().  This function serves as a proxy
        to kmem_cache_create_memcg().  After separating the cache name creation
        path, it would be reduced to a function call, so let's inline it.
      Signed-off-by: NVladimir Davydov <vdavydov@parallels.com>
      Cc: Michal Hocko <mhocko@suse.cz>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Pekka Enberg <penberg@kernel.org>
      Cc: Glauber Costa <glommer@gmail.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      5722d094
    • U
      Kconfig: rename HAS_IOPORT to HAS_IOPORT_MAP · ce816fa8
      Uwe Kleine-König 提交于
      If the renamed symbol is defined lib/iomap.c implements ioport_map and
      ioport_unmap and currently (nearly) all platforms define the port
      accessor functions outb/inb and friend unconditionally.  So
      HAS_IOPORT_MAP is the better name for this.
      
      Consequently NO_IOPORT is renamed to NO_IOPORT_MAP.
      
      The motivation for this change is to reintroduce a symbol HAS_IOPORT
      that signals if outb/int et al are available.  I will address that at
      least one merge window later though to keep surprises to a minimum and
      catch new introductions of (HAS|NO)_IOPORT.
      
      The changes in this commit were done using:
      
      	$ git grep -l -E '(NO|HAS)_IOPORT' | xargs perl -p -i -e 's/\b((?:CONFIG_)?(?:NO|HAS)_IOPORT)\b/$1_MAP/'
      Signed-off-by: NUwe Kleine-König <u.kleine-koenig@pengutronix.de>
      Acked-by: NArnd Bergmann <arnd@arndb.de>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      ce816fa8
    • J
      bug: Make BUG() always stop the machine · a4b5d580
      Josh Triplett 提交于
      When !CONFIG_BUG and !HAVE_ARCH_BUG, define the generic BUG() as an
      infinite loop rather than a no-op.  This avoids undefined behavior if
      execution ever actually reaches BUG(), and avoids warnings about code
      after BUG() (such as on non-void functions calling BUG() and then not
      returning).
      
      bloat-o-meter results:
      
        add/remove: 0/0 grow/shrink: 43/10 up/down: 235/-98 (137)
        function                             old     new   delta
        umount_collect                       119     138     +19
        notify_change                        306     324     +18
        xstate_enable_boot_cpu               252     269     +17
        kunmap                                54      70     +16
        balloon_page_dequeue                 112     126     +14
        mm_take_all_locks                    223     233     +10
        list_lru_walk_node                   143     152      +9
        vma_adjust                          1059    1067      +8
        pcpu_setup_first_chunk              1130    1138      +8
        mm_drop_all_locks                    143     151      +8
        ns_capable                            55      62      +7
        anon_transport_class_unregister        8      15      +7
        srcu_init_notifier_head               35      41      +6
        shrink_dcache_for_umount             174     180      +6
        kunmap_high                           99     105      +6
        end_page_writeback                    43      49      +6
        do_exit                             1339    1345      +6
        __kfifo_dma_out_prepare_r             86      92      +6
        __kfifo_dma_in_prepare_r              90      96      +6
        fixup_user_fault                     120     125      +5
        repair_env_string                     73      77      +4
        read_cache_pages_invalidate_page      56      60      +4
        isolate_lru_pages.isra               142     146      +4
        do_notify_parent_cldstop             255     259      +4
        cpu_init                             370     374      +4
        utimes_common                        270     272      +2
        tasklet_hi_action                     91      93      +2
        tasklet_action                        91      93      +2
        set_pte_vaddr                         46      48      +2
        find_get_pages_tag                   202     204      +2
        early_iounmap                        185     187      +2
        __native_set_fixmap                   36      38      +2
        __get_user_pages                     822     824      +2
        __early_ioremap                      299     301      +2
        yield_task_stop                        1       2      +1
        tick_resume                           37      38      +1
        switched_to_stop                       1       2      +1
        switched_to_idle                       1       2      +1
        prio_changed_stop                      1       2      +1
        prio_changed_idle                      1       2      +1
        pm_qos_power_read                    111     112      +1
        arch_cpu_idle_dead                     1       2      +1
        __insert_vmap_area                   140     141      +1
        sys_renameat                         614     612      -2
        mm_fault_error                       297     295      -2
        SyS_renameat                         614     612      -2
        sys_linkat                           416     413      -3
        SyS_linkat                           416     413      -3
        chmod_common                         129     122      -7
        proc_cap_handler                     240     225     -15
        __schedule                           849     831     -18
        sys_madvise                         1077    1054     -23
        SyS_madvise                         1077    1054     -23
      Signed-off-by: NJosh Triplett <josh@joshtriplett.org>
      Reported-by: NArnd Bergmann <arnd@arndb.de>
      Acked-by: NArnd Bergmann <arnd@arndb.de>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      a4b5d580
    • J
      bug: when !CONFIG_BUG, make WARN call no_printk to check format and args · 4e50ebde
      Josh Triplett 提交于
      The stub version of WARN for !CONFIG_BUG completely ignored its format
      string and subsequent arguments; make it check them instead, using
      no_printk.
      Signed-off-by: NJosh Triplett <josh@joshtriplett.org>
      Reported-by: NArnd Bergmann <arnd@arndb.de>
      Acked-by: NArnd Bergmann <arnd@arndb.de>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      4e50ebde
    • J
    • J
      bug: when !CONFIG_BUG, simplify WARN_ON_ONCE and family · b607e70e
      Josh Triplett 提交于
      When !CONFIG_BUG, WARN_ON and family become simple passthroughs of their
      condition argument; however, WARN_ON_ONCE and family still have conditions
      and a boolean to detect one-time invocation, even though the warning
      they'd emit doesn't exist.  Make the existing definitions conditional on
      CONFIG_BUG, and add definitions for !CONFIG_BUG that map to the
      passthrough versions of WARN and WARN_ON.
      
      This saves 4.4k on a minimized configuration (smaller than allnoconfig),
      and 20.6k with defconfig plus CONFIG_BUG=n.
      Signed-off-by: NJosh Triplett <josh@joshtriplett.org>
      Acked-by: NArnd Bergmann <arnd@arndb.de>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      b607e70e
    • A
      rapidio: rework device hierarchy and introduce mport class of devices · 2aaf308b
      Alexandre Bounine 提交于
      This patch removes an artificial RapidIO bus root device and establishes
      actual device hierarchy by providing reference to real parent devices.
      It also introduces device class for RapidIO controller devices (on-chip
      or an eternal bridge, known as "mport").
      
      Existing implementation was sufficient for SoC-based platforms that have
      a single RapidIO controller.  With introduction of devices using
      multiple RapidIO controllers and PCIe-to-RapidIO bridges the old scheme
      is very limiting or does not work at all.  The implemented changes allow
      to properly reference platform's local RapidIO mport devices and provide
      device details needed for upper layers.
      
      This change to RapidIO device hierarchy does not break any known
      existing kernel or user space interfaces.
      Signed-off-by: NAlexandre Bounine <alexandre.bounine@idt.com>
      Cc: Matt Porter <mporter@kernel.crashing.org>
      Cc: Li Yang <leoli@freescale.com>
      Cc: Kumar Gala <galak@kernel.crashing.org>
      Cc: Andre van Herk <andre.van.herk@prodrive-technologies.com>
      Cc: Stef van Os <stef.van.os@prodrive-technologies.com>
      Cc: Jerry Jacobs <jerry.jacobs@prodrive-technologies.com>
      Cc: Arno Tiemersma <arno.tiemersma@prodrive-technologies.com>
      Cc: Rob Landley <rob@landley.net>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      2aaf308b
    • S
      idr: remove dead code · 90ae3ae5
      Stephen Hemminger 提交于
      Remove no longer used deprecated code, and make local functions
      static.
      Signed-off-by: NStephen Hemminger <stephen@networkplumber.org>
      Acked-by: NJean Delvare <jdelvare@suse.de>
      Acked-by: NTejun Heo <tj@kernel.org>
      Cc: Jeff Layton <jlayton@redhat.com>
      Cc: Philipp Reisner <philipp.reisner@linbit.com>
      Cc: Jens Axboe <axboe@kernel.dk>
      Cc: George Spelvin <linux@horizon.com>
      Cc: Randy Dunlap <rdunlap@infradead.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      90ae3ae5
    • R
      include/linux/crash_dump.h: add vmcore_cleanup() prototype · 82e0703b
      Rashika Kheria 提交于
      Eliminate the following warning in proc/vmcore.c:
      
        fs/proc/vmcore.c:1088:6: warning: no previous prototype for `vmcore_cleanup' [-Wmissing-prototypes]
      
      [akpm@linux-foundation.org: clean up powerpc, remove unneeded EXPORT_SYMBOL]
      Signed-off-by: NRashika Kheria <rashika.kheria@gmail.com>
      Reviewed-by: NJosh Triplett <josh@joshtriplett.org>
      Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      82e0703b
    • O
      wait: swap EXIT_ZOMBIE and EXIT_DEAD to hide EXIT_TRACE from user-space · ad86622b
      Oleg Nesterov 提交于
      get_task_state() uses the most significant bit to report the state to
      user-space, this means that EXIT_ZOMBIE->EXIT_TRACE->EXIT_DEAD transition
      can be noticed via /proc as Z -> X -> Z change.  Note that this was
      possible even before EXIT_TRACE was introduced.
      
      This is not really bad but imho it make sense to hide EXIT_TRACE from
      user-space completely.  So the patch simply swaps EXIT_ZOMBIE and
      EXIT_DEAD, this way EXIT_TRACE will be seen as EXIT_ZOMBIE by user-space.
      Signed-off-by: NOleg Nesterov <oleg@redhat.com>
      Cc: Jan Kratochvil <jan.kratochvil@redhat.com>
      Cc: Michal Schmidt <mschmidt@redhat.com>
      Cc: Al Viro <viro@ZenIV.linux.org.uk>
      Cc: Lennart Poettering <lpoetter@redhat.com>
      Cc: Roland McGrath <roland@hack.frob.com>
      Cc: Tejun Heo <tj@kernel.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      ad86622b
    • O
      wait: introduce EXIT_TRACE to avoid the racy EXIT_DEAD->EXIT_ZOMBIE transition · abd50b39
      Oleg Nesterov 提交于
      wait_task_zombie() first does EXIT_ZOMBIE->EXIT_DEAD transition and
      drops tasklist_lock.  If this task is not the natural child and it is
      traced, we change its state back to EXIT_ZOMBIE for ->real_parent.
      
      The last transition is racy, this is even documented in 50b8d257
      "ptrace: partially fix the do_wait(WEXITED) vs EXIT_DEAD->EXIT_ZOMBIE
      race".  wait_consider_task() tries to detect this transition and clear
      ->notask_error but we can't rely on ptrace_reparented(), debugger can
      exit and do ptrace_unlink() before its sub-thread sets EXIT_ZOMBIE.
      
      And there is another problem which were missed before: this transition
      can also race with reparent_leader() which doesn't reset >exit_signal if
      EXIT_DEAD, assuming that this task must be reaped by someone else.  So
      the tracee can be re-parented with ->exit_signal != SIGCHLD, and if
      /sbin/init doesn't use __WALL it becomes unreapable.  This was fixed by
      the previous commit, but it was the temporary hack.
      
      1. Add the new exit_state, EXIT_TRACE. It means that the task is the
         traced zombie, debugger is going to detach and notify its natural
         parent.
      
         This new state is actually EXIT_ZOMBIE | EXIT_DEAD. This way we
         can avoid the changes in proc/kgdb code, get_task_state() still
         reports "X (dead)" in this case.
      
         Note: with or without this change userspace can see Z -> X -> Z
         transition. Not really bad, but probably makes sense to fix.
      
      2. Change wait_task_zombie() to use EXIT_TRACE instead of EXIT_DEAD
         if we need to notify the ->real_parent.
      
      3. Revert the previous hack in reparent_leader(), now that EXIT_DEAD
         is always the final state we can safely ignore such a task.
      
      4. Change wait_consider_task() to check EXIT_TRACE separately and kill
         the racy and no longer needed ptrace_reparented() case.
      
         If ptrace == T an EXIT_TRACE thread should be simply ignored, the
         owner of this state is going to ptrace_unlink() this task. We can
         pretend that it was already removed from ->ptraced list.
      
         Otherwise we should skip this thread too but clear ->notask_error,
         we must be the natural parent and debugger is going to untrace and
         notify us. IOW, this doesn't differ from "EXIT_ZOMBIE && p->ptrace"
         even if the task was already untraced.
      Signed-off-by: NOleg Nesterov <oleg@redhat.com>
      Reported-by: NJan Kratochvil <jan.kratochvil@redhat.com>
      Reported-by: NMichal Schmidt <mschmidt@redhat.com>
      Tested-by: NMichal Schmidt <mschmidt@redhat.com>
      Cc: Al Viro <viro@ZenIV.linux.org.uk>
      Cc: Lennart Poettering <lpoetter@redhat.com>
      Cc: Roland McGrath <roland@hack.frob.com>
      Cc: Tejun Heo <tj@kernel.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      abd50b39
    • O
      exec: kill bprm->tcomm[], simplify the "basename" logic · 23aebe16
      Oleg Nesterov 提交于
      Starting from commit c4ad8f98 ("execve: use 'struct filename *' for
      executable name passing") bprm->filename can not go away after
      flush_old_exec(), so we do not need to save the binary name in
      bprm->tcomm[] added by 96e02d15 ("exec: fix use-after-free bug in
      setup_new_exec()").
      
      And there was never need for filename_to_taskname-like code, we can
      simply do set_task_comm(kbasename(filename).
      
      This patch has to change set_task_comm() and trace_task_rename() to
      accept "const char *", but I think this change is also good.
      Signed-off-by: NOleg Nesterov <oleg@redhat.com>
      Cc: Heiko Carstens <heiko.carstens@de.ibm.com>
      Cc: Steven Rostedt <rostedt@goodmis.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      23aebe16
    • S
      numa: use LAST_CPUPID_SHIFT to calculate LAST_CPUPID_MASK · 834a964a
      Srikar Dronamraju 提交于
      LAST_CPUPID_MASK is calculated using LAST_CPUPID_WIDTH.  However
      LAST_CPUPID_WIDTH itself can be 0.  (when LAST_CPUPID_NOT_IN_PAGE_FLAGS is
      set).  In such a case LAST_CPUPID_MASK turns out to be 0.
      
      But with recent commit 1ae71d03: (mm: numa: bugfix for
      LAST_CPUPID_NOT_IN_PAGE_FLAGS) if LAST_CPUPID_MASK is 0,
      page_cpupid_xchg_last() and page_cpupid_reset_last() causes
      page->_last_cpupid to be set to 0.
      
      This causes performance regression. Its almost as if numa_balancing is
      off.
      
      Fix LAST_CPUPID_MASK by using LAST_CPUPID_SHIFT instead of
      LAST_CPUPID_WIDTH.
      
      Some performance numbers and perf stats with and without the fix.
      
      (3.14-rc6)
      ----------
      numa01
      
       Performance counter stats for '/usr/bin/time -f %e %S %U %c %w -o start_bench.out -a ./numa01':
      
               12,27,462 cs                                                           [100.00%]
                2,41,957 migrations                                                   [100.00%]
             1,68,01,713 faults                                                       [100.00%]
          7,99,35,29,041 cache-misses
                  98,808 migrate:mm_migrate_pages                                     [100.00%]
      
          1407.690148814 seconds time elapsed
      
      numa02
      
       Performance counter stats for '/usr/bin/time -f %e %S %U %c %w -o start_bench.out -a ./numa02':
      
                  63,065 cs                                                           [100.00%]
                  14,364 migrations                                                   [100.00%]
                2,08,118 faults                                                       [100.00%]
            25,32,59,404 cache-misses
                      12 migrate:mm_migrate_pages                                     [100.00%]
      
            63.840827219 seconds time elapsed
      
      (3.14-rc6 with fix)
      -------------------
      numa01
      
       Performance counter stats for '/usr/bin/time -f %e %S %U %c %w -o start_bench.out -a ./numa01':
      
                9,68,911 cs                                                           [100.00%]
                1,01,414 migrations                                                   [100.00%]
               88,38,697 faults                                                       [100.00%]
          4,42,92,51,042 cache-misses
                4,25,060 migrate:mm_migrate_pages                                     [100.00%]
      
           685.965331189 seconds time elapsed
      
      numa02
      
       Performance counter stats for '/usr/bin/time -f %e %S %U %c %w -o start_bench.out -a ./numa02':
      
                  17,543 cs                                                           [100.00%]
                   2,962 migrations                                                   [100.00%]
                1,17,843 faults                                                       [100.00%]
            11,80,61,644 cache-misses
                  12,358 migrate:mm_migrate_pages                                     [100.00%]
      
            20.380132343 seconds time elapsed
      Signed-off-by: NSrikar Dronamraju <srikar@linux.vnet.ibm.com>
      Cc: Liu Ping Fan <pingfank@linux.vnet.ibm.com>
      Reviewed-by: NAneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
      Cc: Paul Mackerras <paulus@samba.org>
      Cc: Mel Gorman <mel@csn.ul.ie>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      834a964a
    • Z
      madvise: correct the comment of MADV_DODUMP flag · 85892f19
      Zhang Yanfei 提交于
      s/MADV_NODUMP/MADV_DONTDUMP/
      Signed-off-by: NZhang Yanfei <zhangyanfei@cn.fujitsu.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      85892f19
    • F
      mm/readahead.c: inline ra_submit · 29f175d1
      Fabian Frederick 提交于
      Commit f9acc8c7 ("readahead: sanify file_ra_state names") left
      ra_submit with a single function call.
      
      Move ra_submit to internal.h and inline it to save some stack.  Thanks
      to Andrew Morton for commenting different versions.
      Signed-off-by: NFabian Frederick <fabf@skynet.be>
      Suggested-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      29f175d1
    • M
      mm: remove unused arg of set_page_dirty_balance() · ed6d7c8e
      Miklos Szeredi 提交于
      There's only one caller of set_page_dirty_balance() and that will call it
      with page_mkwrite == 0.
      
      The page_mkwrite argument was unused since commit b827e496 "mm: close
      page_mkwrite races".
      Signed-off-by: NMiklos Szeredi <mszeredi@suse.cz>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      ed6d7c8e
    • M
      memcg: rename high level charging functions · d715ae08
      Michal Hocko 提交于
      mem_cgroup_newpage_charge is used only for charging anonymous memory so
      it is better to rename it to mem_cgroup_charge_anon.
      
      mem_cgroup_cache_charge is used for file backed memory so rename it to
      mem_cgroup_charge_file.
      Signed-off-by: NMichal Hocko <mhocko@suse.cz>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      d715ae08
    • J
      memcg: get_mem_cgroup_from_mm() · df381975
      Johannes Weiner 提交于
      Instead of returning NULL from try_get_mem_cgroup_from_mm() when the mm
      owner is exiting, just return root_mem_cgroup.  This makes sense for all
      callsites and gets rid of some of them having to fallback manually.
      
      [fengguang.wu@intel.com: fix warnings]
      Signed-off-by: NJohannes Weiner <hannes@cmpxchg.org>
      Signed-off-by: NFengguang Wu <fengguang.wu@intel.com>
      Acked-by: NMichal Hocko <mhocko@suse.cz>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      df381975
    • K
      mm: use 'const char *' insted of 'char *' for reason in dump_page() · d230dec1
      Kirill A. Shutemov 提交于
      I tried to use 'dump_page(page, __func__)' for debugging, but it triggers
      warning:
      
        warning: passing argument 2 of `dump_page' discards `const' qualifier from pointer target type [enabled by default]
      
      Let's convert 'reason' to 'const char *' in dump_page() and friends: we
      shouldn't modify it anyway.
      Signed-off-by: NKirill A. Shutemov <kirill.shutemov@linux.intel.com>
      Cc: Dave Hansen <dave.hansen@linux.intel.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      d230dec1
    • D
      res_counter: remove interface for locked charging and uncharging · 539a13b4
      David Rientjes 提交于
      The res_counter_{charge,uncharge}_locked() variants are not used in the
      kernel outside of the resource counter code itself, so remove the
      interface.
      Signed-off-by: NDavid Rientjes <rientjes@google.com>
      Acked-by: NMichal Hocko <mhocko@suse.cz>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Cc: Christoph Lameter <cl@linux-foundation.org>
      Cc: Pekka Enberg <penberg@kernel.org>
      Cc: Tejun Heo <tj@kernel.org>
      Cc: Mel Gorman <mgorman@suse.de>
      Cc: Oleg Nesterov <oleg@redhat.com>
      Cc: Rik van Riel <riel@redhat.com>
      Cc: Jianguo Wu <wujianguo@huawei.com>
      Cc: Tim Hockin <thockin@google.com>
      Cc: Christoph Lameter <cl@linux.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      539a13b4
    • D
      mm, mempolicy: remove per-process flag · f0432d15
      David Rientjes 提交于
      PF_MEMPOLICY is an unnecessary optimization for CONFIG_SLAB users.
      There's no significant performance degradation to checking
      current->mempolicy rather than current->flags & PF_MEMPOLICY in the
      allocation path, especially since this is considered unlikely().
      
      Running TCP_RR with netperf-2.4.5 through localhost on 16 cpu machine with
      64GB of memory and without a mempolicy:
      
      	threads		before		after
      	16		1249409		1244487
      	32		1281786		1246783
      	48		1239175		1239138
      	64		1244642		1241841
      	80		1244346		1248918
      	96		1266436		1254316
      	112		1307398		1312135
      	128		1327607		1326502
      
      Per-process flags are a scarce resource so we should free them up whenever
      possible and make them available.  We'll be using it shortly for memcg oom
      reserves.
      Signed-off-by: NDavid Rientjes <rientjes@google.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Michal Hocko <mhocko@suse.cz>
      Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Cc: Christoph Lameter <cl@linux-foundation.org>
      Cc: Pekka Enberg <penberg@kernel.org>
      Cc: Tejun Heo <tj@kernel.org>
      Cc: Mel Gorman <mgorman@suse.de>
      Cc: Oleg Nesterov <oleg@redhat.com>
      Cc: Rik van Riel <riel@redhat.com>
      Cc: Jianguo Wu <wujianguo@huawei.com>
      Cc: Tim Hockin <thockin@google.com>
      Cc: Christoph Lameter <cl@linux.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      f0432d15
    • D
      mm, mempolicy: rename slab_node for clarity · 2a389610
      David Rientjes 提交于
      slab_node() is actually a mempolicy function, so rename it to
      mempolicy_slab_node() to make it clearer that it used for processes with
      mempolicies.
      
      At the same time, cleanup its code by saving numa_mem_id() in a local
      variable (since we require a node with memory, not just any node) and
      remove an obsolete comment that assumes the mempolicy is actually passed
      into the function.
      Signed-off-by: NDavid Rientjes <rientjes@google.com>
      Acked-by: NChristoph Lameter <cl@linux.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Michal Hocko <mhocko@suse.cz>
      Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Cc: Christoph Lameter <cl@linux-foundation.org>
      Cc: Pekka Enberg <penberg@kernel.org>
      Cc: Tejun Heo <tj@kernel.org>
      Cc: Mel Gorman <mgorman@suse.de>
      Cc: Oleg Nesterov <oleg@redhat.com>
      Cc: Rik van Riel <riel@redhat.com>
      Cc: Jianguo Wu <wujianguo@huawei.com>
      Cc: Tim Hockin <thockin@google.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      2a389610
    • D
      mm: per-thread vma caching · 615d6e87
      Davidlohr Bueso 提交于
      This patch is a continuation of efforts trying to optimize find_vma(),
      avoiding potentially expensive rbtree walks to locate a vma upon faults.
      The original approach (https://lkml.org/lkml/2013/11/1/410), where the
      largest vma was also cached, ended up being too specific and random,
      thus further comparison with other approaches were needed.  There are
      two things to consider when dealing with this, the cache hit rate and
      the latency of find_vma().  Improving the hit-rate does not necessarily
      translate in finding the vma any faster, as the overhead of any fancy
      caching schemes can be too high to consider.
      
      We currently cache the last used vma for the whole address space, which
      provides a nice optimization, reducing the total cycles in find_vma() by
      up to 250%, for workloads with good locality.  On the other hand, this
      simple scheme is pretty much useless for workloads with poor locality.
      Analyzing ebizzy runs shows that, no matter how many threads are
      running, the mmap_cache hit rate is less than 2%, and in many situations
      below 1%.
      
      The proposed approach is to replace this scheme with a small per-thread
      cache, maximizing hit rates at a very low maintenance cost.
      Invalidations are performed by simply bumping up a 32-bit sequence
      number.  The only expensive operation is in the rare case of a seq
      number overflow, where all caches that share the same address space are
      flushed.  Upon a miss, the proposed replacement policy is based on the
      page number that contains the virtual address in question.  Concretely,
      the following results are seen on an 80 core, 8 socket x86-64 box:
      
      1) System bootup: Most programs are single threaded, so the per-thread
         scheme does improve ~50% hit rate by just adding a few more slots to
         the cache.
      
      +----------------+----------+------------------+
      | caching scheme | hit-rate | cycles (billion) |
      +----------------+----------+------------------+
      | baseline       | 50.61%   | 19.90            |
      | patched        | 73.45%   | 13.58            |
      +----------------+----------+------------------+
      
      2) Kernel build: This one is already pretty good with the current
         approach as we're dealing with good locality.
      
      +----------------+----------+------------------+
      | caching scheme | hit-rate | cycles (billion) |
      +----------------+----------+------------------+
      | baseline       | 75.28%   | 11.03            |
      | patched        | 88.09%   | 9.31             |
      +----------------+----------+------------------+
      
      3) Oracle 11g Data Mining (4k pages): Similar to the kernel build workload.
      
      +----------------+----------+------------------+
      | caching scheme | hit-rate | cycles (billion) |
      +----------------+----------+------------------+
      | baseline       | 70.66%   | 17.14            |
      | patched        | 91.15%   | 12.57            |
      +----------------+----------+------------------+
      
      4) Ebizzy: There's a fair amount of variation from run to run, but this
         approach always shows nearly perfect hit rates, while baseline is just
         about non-existent.  The amounts of cycles can fluctuate between
         anywhere from ~60 to ~116 for the baseline scheme, but this approach
         reduces it considerably.  For instance, with 80 threads:
      
      +----------------+----------+------------------+
      | caching scheme | hit-rate | cycles (billion) |
      +----------------+----------+------------------+
      | baseline       | 1.06%    | 91.54            |
      | patched        | 99.97%   | 14.18            |
      +----------------+----------+------------------+
      
      [akpm@linux-foundation.org: fix nommu build, per Davidlohr]
      [akpm@linux-foundation.org: document vmacache_valid() logic]
      [akpm@linux-foundation.org: attempt to untangle header files]
      [akpm@linux-foundation.org: add vmacache_find() BUG_ON]
      [hughd@google.com: add vmacache_valid_mm() (from Oleg)]
      [akpm@linux-foundation.org: coding-style fixes]
      [akpm@linux-foundation.org: adjust and enhance comments]
      Signed-off-by: NDavidlohr Bueso <davidlohr@hp.com>
      Reviewed-by: NRik van Riel <riel@redhat.com>
      Acked-by: NLinus Torvalds <torvalds@linux-foundation.org>
      Reviewed-by: NMichel Lespinasse <walken@google.com>
      Cc: Oleg Nesterov <oleg@redhat.com>
      Tested-by: NHugh Dickins <hughd@google.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      615d6e87
    • K
      mm: implement ->map_pages for page cache · f1820361
      Kirill A. Shutemov 提交于
      filemap_map_pages() is generic implementation of ->map_pages() for
      filesystems who uses page cache.
      
      It should be safe to use filemap_map_pages() for ->map_pages() if
      filesystem use filemap_fault() for ->fault().
      Signed-off-by: NKirill A. Shutemov <kirill.shutemov@linux.intel.com>
      Acked-by: NLinus Torvalds <torvalds@linux-foundation.org>
      Cc: Mel Gorman <mgorman@suse.de>
      Cc: Rik van Riel <riel@redhat.com>
      Cc: Andi Kleen <ak@linux.intel.com>
      Cc: Matthew Wilcox <matthew.r.wilcox@intel.com>
      Cc: Dave Hansen <dave.hansen@linux.intel.com>
      Cc: Alexander Viro <viro@zeniv.linux.org.uk>
      Cc: Dave Chinner <david@fromorbit.com>
      Cc: Ning Qu <quning@gmail.com>
      Cc: Hugh Dickins <hughd@google.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      f1820361
    • K
      mm: introduce vm_ops->map_pages() · 8c6e50b0
      Kirill A. Shutemov 提交于
      Here's new version of faultaround patchset.  It took a while to tune it
      and collect performance data.
      
      First patch adds new callback ->map_pages to vm_operations_struct.
      
      ->map_pages() is called when VM asks to map easy accessible pages.
      Filesystem should find and map pages associated with offsets from
      "pgoff" till "max_pgoff".  ->map_pages() is called with page table
      locked and must not block.  If it's not possible to reach a page without
      blocking, filesystem should skip it.  Filesystem should use do_set_pte()
      to setup page table entry.  Pointer to entry associated with offset
      "pgoff" is passed in "pte" field in vm_fault structure.  Pointers to
      entries for other offsets should be calculated relative to "pte".
      
      Currently VM use ->map_pages only on read page fault path.  We try to
      map FAULT_AROUND_PAGES a time.  FAULT_AROUND_PAGES is 16 for now.
      Performance data for different FAULT_AROUND_ORDER is below.
      
      TODO:
       - implement ->map_pages() for shmem/tmpfs;
       - modify get_user_pages() to be able to use ->map_pages() and implement
         mmap(MAP_POPULATE|MAP_NONBLOCK) on top.
      
      =========================================================================
      Tested on 4-socket machine (120 threads) with 128GiB of RAM.
      
      Few real-world workloads. The sweet spot for FAULT_AROUND_ORDER here is
      somewhere between 3 and 5. Let's say 4 :)
      
      Linux build (make -j60)
      FAULT_AROUND_ORDER		Baseline	1		3		4		5		7		9
      	minor-faults		283,301,572	247,151,987	212,215,789	204,772,882	199,568,944	194,703,779	193,381,485
      	time, seconds		151.227629483	153.920996480	151.356125472	150.863792049	150.879207877	151.150764954	151.450962358
      Linux rebuild (make -j60)
      FAULT_AROUND_ORDER		Baseline	1		3		4		5		7		9
      	minor-faults		5,396,854	4,148,444	2,855,286	2,577,282	2,361,957	2,169,573	2,112,643
      	time, seconds		27.404543757	27.559725591	27.030057426	26.855045126	26.678618635	26.974523490	26.761320095
      Git test suite (make -j60 test)
      FAULT_AROUND_ORDER		Baseline	1		3		4		5		7		9
      	minor-faults		129,591,823	99,200,751	66,106,718	57,606,410	51,510,808	45,776,813	44,085,515
      	time, seconds		66.087215026	64.784546905	64.401156567	65.282708668	66.034016829	66.793780811	67.237810413
      
      Two synthetic tests: access every word in file in sequential/random order.
      It doesn't improve much after FAULT_AROUND_ORDER == 4.
      
      Sequential access 16GiB file
      FAULT_AROUND_ORDER		Baseline	1		3		4		5		7		9
       1 thread
      	minor-faults		4,195,437	2,098,275	525,068		262,251		131,170		32,856		8,282
      	time, seconds		7.250461742	6.461711074	5.493859139	5.488488147	5.707213983	5.898510832	5.109232856
       8 threads
      	minor-faults		33,557,540	16,892,728	4,515,848	2,366,999	1,423,382	442,732		142,339
      	time, seconds		16.649304881	9.312555263	6.612490639	6.394316732	6.669827501	6.75078944	6.371900528
       32 threads
      	minor-faults		134,228,222	67,526,810	17,725,386	9,716,537	4,763,731	1,668,921	537,200
      	time, seconds		49.164430543	29.712060103	12.938649729	10.175151004	11.840094583	9.594081325	9.928461797
       60 threads
      	minor-faults		251,687,988	126,146,952	32,919,406	18,208,804	10,458,947	2,733,907	928,217
      	time, seconds		86.260656897	49.626551828	22.335007632	17.608243696	16.523119035	16.339489186	16.326390902
       120 threads
      	minor-faults		503,352,863	252,939,677	67,039,168	35,191,827	19,170,091	4,688,357	1,471,862
      	time, seconds		124.589206333	79.757867787	39.508707872	32.167281632	29.972989292	28.729834575	28.042251622
      Random access 1GiB file
       1 thread
      	minor-faults		262,636		132,743		34,369		17,299		8,527		3,451		1,222
      	time, seconds		15.351890914	16.613802482	16.569227308	15.179220992	16.557356122	16.578247824	15.365266994
       8 threads
      	minor-faults		2,098,948	1,061,871	273,690		154,501		87,110		25,663		7,384
      	time, seconds		15.040026343	15.096933500	14.474757288	14.289129964	14.411537468	14.296316837	14.395635804
       32 threads
      	minor-faults		8,390,734	4,231,023	1,054,432	528,847		269,242		97,746		26,881
      	time, seconds		20.430433109	21.585235358	22.115062928	14.872878951	14.880856305	14.883370649	14.821261690
       60 threads
      	minor-faults		15,733,258	7,892,809	1,973,393	988,266		594,789		164,994		51,691
      	time, seconds		26.577302548	25.692397770	18.728863715	20.153026398	21.619101933	17.745086260	17.613215273
       120 threads
      	minor-faults		31,471,111	15,816,616	3,959,209	1,978,685	1,008,299	264,635		96,010
      	time, seconds		41.835322703	40.459786095	36.085306105	35.313894834	35.814445675	36.552633793	34.289210594
      
      Touch only one page in page table in 16GiB file
      FAULT_AROUND_ORDER		Baseline	1		3		4		5		7		9
       1 thread
      	minor-faults		8,372		8,324		8,270		8,260		8,249		8,239		8,237
      	time, seconds		0.039892712	0.045369149	0.051846126	0.063681685	0.079095975	0.17652406	0.541213386
       8 threads
      	minor-faults		65,731		65,681		65,628		65,620		65,608		65,599		65,596
      	time, seconds		0.124159196	0.488600638	0.156854426	0.191901957	0.242631486	0.543569456	1.677303984
       32 threads
      	minor-faults		262,388		262,341		262,285		262,276		262,266		262,257		263,183
      	time, seconds		0.452421421	0.488600638	0.565020946	0.648229739	0.789850823	1.651584361	5.000361559
       60 threads
      	minor-faults		491,822		491,792		491,723		491,711		491,701		491,691		491,825
      	time, seconds		0.763288616	0.869620515	0.980727360	1.161732354	1.466915814	3.04041448	9.308612938
       120 threads
      	minor-faults		983,466		983,655		983,366		983,372		983,363		984,083		984,164
      	time, seconds		1.595846553	1.667902182	2.008959376	2.425380942	2.941368804	5.977807890	18.401846125
      
      This patch (of 2):
      
      Introduce new vm_ops callback ->map_pages() and uses it for mapping easy
      accessible pages around fault address.
      
      On read page fault, if filesystem provides ->map_pages(), we try to map up
      to FAULT_AROUND_PAGES pages around page fault address in hope to reduce
      number of minor page faults.
      
      We call ->map_pages first and use ->fault() as fallback if page by the
      offset is not ready to be mapped (cold page cache or something).
      Signed-off-by: NKirill A. Shutemov <kirill.shutemov@linux.intel.com>
      Acked-by: NLinus Torvalds <torvalds@linux-foundation.org>
      Cc: Mel Gorman <mgorman@suse.de>
      Cc: Rik van Riel <riel@redhat.com>
      Cc: Andi Kleen <ak@linux.intel.com>
      Cc: Matthew Wilcox <matthew.r.wilcox@intel.com>
      Cc: Dave Hansen <dave.hansen@linux.intel.com>
      Cc: Alexander Viro <viro@zeniv.linux.org.uk>
      Cc: Dave Chinner <david@fromorbit.com>
      Cc: Ning Qu <quning@gmail.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      8c6e50b0
    • A
      mm, thp: add VM_INIT_DEF_MASK and PRCTL_THP_DISABLE · a0715cc2
      Alex Thorlton 提交于
      Add VM_INIT_DEF_MASK, to allow us to set the default flags for VMs.  It
      also adds a prctl control which allows us to set the THP disable bit in
      mm->def_flags so that VMs will pick up the setting as they are created.
      Signed-off-by: NAlex Thorlton <athorlton@sgi.com>
      Suggested-by: NOleg Nesterov <oleg@redhat.com>
      Cc: Gerald Schaefer <gerald.schaefer@de.ibm.com>
      Cc: Martin Schwidefsky <schwidefsky@de.ibm.com>
      Cc: Heiko Carstens <heiko.carstens@de.ibm.com>
      Cc: Christian Borntraeger <borntraeger@de.ibm.com>
      Cc: Paolo Bonzini <pbonzini@redhat.com>
      Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
      Cc: Mel Gorman <mgorman@suse.de>
      Acked-by: NRik van Riel <riel@redhat.com>
      Cc: Ingo Molnar <mingo@kernel.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: Oleg Nesterov <oleg@redhat.com>
      Cc: "Eric W. Biederman" <ebiederm@xmission.com>
      Cc: Alexander Viro <viro@zeniv.linux.org.uk>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Paolo Bonzini <pbonzini@redhat.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      a0715cc2
    • A
      sched: remove sleep_on() and friends · b8780c36
      Arnd Bergmann 提交于
      This is the final piece in the puzzle, as all patches to remove the
      last users of \(interruptible_\|\)sleep_on\(_timeout\|\) have made it
      into the 3.15 merge window. The work was long overdue, and this
      interface in particular should not have survived the BKL removal
      that was done a couple of years ago.
      
      Citing Jon Corbet from http://lwn.net/2001/0201/kernel.php3":
      
       "[...] it was suggested that the janitors look for and fix all code
        that calls sleep_on() [...] since (1) almost all such code is
        incorrect, and (2) Linus has agreed that those functions should
        be removed in the 2.5 development series".
      
      We haven't quite made it for 2.5, but maybe we can merge this for 3.15.
      Signed-off-by: NArnd Bergmann <arnd@arndb.de>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Ingo Molnar <mingo@kernel.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      b8780c36
  3. 05 4月, 2014 3 次提交