1. 16 10月, 2015 2 次提交
    • T
      cgroup: make cgroup->nr_populated count the number of populated css_sets · 0de0942d
      Tejun Heo 提交于
      Currently, cgroup->nr_populated counts whether the cgroup has any
      css_sets linked to it and the number of children which has non-zero
      ->nr_populated.  This works because a css_set's refcnt converges with
      the number of tasks linked to it and thus there's no css_set linked to
      a cgroup if it doesn't have any live tasks.
      
      To help tracking resource usage of zombie tasks, putting the ref of
      css_set will be separated from disassociating the task from the
      css_set which means that a cgroup may have css_sets linked to it even
      when it doesn't have any live tasks.
      
      This patch updates cgroup->nr_populated so that for the cgroup itself
      it counts the number of css_sets which have tasks associated with them
      so that empty css_sets don't skew the populated test.
      Signed-off-by: NTejun Heo <tj@kernel.org>
      0de0942d
    • T
      cgroup: remove an unused parameter from cgroup_task_migrate() · b309e5b7
      Tejun Heo 提交于
      cgroup_task_migrate() no longer uses @old_cgrp.  Remove it.
      Signed-off-by: NTejun Heo <tj@kernel.org>
      b309e5b7
  2. 26 9月, 2015 1 次提交
    • T
      cgroup: fix too early usage of static_branch_disable() · a3e72739
      Tejun Heo 提交于
      49d1dc4b ("cgroup: implement static_key based
      cgroup_subsys_enabled() and cgroup_subsys_on_dfl()") converted cgroup
      enabled test to use static_key; however, cgroup_disable() is called
      before static_key subsystem itself is initialized and thus leads to
      the following warning when "cgroup_disable=" parameter is specified.
      
       WARNING: CPU: 0 PID: 0 at kernel/jump_label.c:99 static_key_slow_dec+0x44/0x60()
       static_key_slow_dec used before call to jump_label_init
       ...
       Call Trace:
        [<ffffffff813b18c2>] dump_stack+0x44/0x62
        [<ffffffff8108dd52>] warn_slowpath_common+0x82/0xc0
        [<ffffffff8108ddec>] warn_slowpath_fmt+0x5c/0x80
        [<ffffffff8119c054>] static_key_slow_dec+0x44/0x60
        [<ffffffff81d826b6>] cgroup_disable+0xaf/0xd6
        [<ffffffff81d5f9de>] unknown_bootoption+0x8c/0x194
        [<ffffffff810b0c03>] parse_args+0x273/0x4a0
        [<ffffffff81d5fd67>] start_kernel+0x205/0x4b8
       ...
      
      Fix it by making cgroup_disable() to record the subsystems to disable
      in cgroup_disable_mask and moving the actual application to
      cgroup_init() which is late enough and where the enabled state is
      first used.
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Reported-by: NAndrey Wagin <avagin@gmail.com>
      Link: http://lkml.kernel.org/g/CANaxB-yFuS4SA2znSvcKrO9L_CbHciHYW+o9bN8sZJ8eR9FxYA@mail.gmail.com
      Fixes: 49d1dc4b
      a3e72739
  3. 23 9月, 2015 5 次提交
    • T
      cgroup: make cgroup_update_dfl_csses() migrate all target processes atomically · 10265075
      Tejun Heo 提交于
      cgroup_update_dfl_csses() is responsible for migrating processes when
      controllers are enabled or disabled on the default hierarchy.  As the
      css association changes for all the processes in the affected cgroups,
      this involves migrating multiple processes.
      
      Up until now, it was implemented by migrating process-by-process until
      the source css_sets are empty; however, this means that if a process
      fails to migrate after some succeed before it, the recovery is very
      tricky.  This was considered okay as subsystems weren't allowed to
      reject process migration on the default hierarchy; unfortunately,
      enforcing this policy turned out to be problematic for certain types
      of resources - realtime slices for now.
      
      As such, the default hierarchy is gonna allow restricted failures
      during migration and to support that this patch makes
      cgroup_update_dfl_csses() migrate all target processes atomically
      rather than one-by-one.  The preceding patches made subsystems ready
      for multi-process migration and factored out taskset operations making
      this almost trivial.  All tasks of the target processes are put in the
      same taskset and the migration operations are performed once which
      either fails or succeeds for all.
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Acked-by: NZefan Li <lizefan@huawei.com>
      10265075
    • T
      cgroup: separate out taskset operations from cgroup_migrate() · adaae5dc
      Tejun Heo 提交于
      Currently, cgroup_migreate() implements large part of the migration
      logic inline including building the target taskset and actually
      migrating them.  This patch separates out the following taskset
      operations.
      
       CGROUP_TASKSET_INIT()		: taskset initializer
       cgroup_taskset_add()		: add a task to a taskset
       cgroup_taskset_migrate()	: migrate a taskset to the destination cgroup
      
      This will be used to implement atomic multi-process migration in
      cgroup_update_dfl_csses().  This is pure reorganization which doesn't
      introduce any functional changes.
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Acked-by: NZefan Li <lizefan@huawei.com>
      adaae5dc
    • T
      cgroup: reorder cgroup_migrate()'s parameters · 9af2ec45
      Tejun Heo 提交于
      cgroup_migrate() has the destination cgroup as the first parameter
      while cgroup_task_migrate() has the destination cset as the last.
      Another migration function is scheduled to be added which can make the
      discrepancy further stand out.  Let's reorder cgroup_migrate()'s
      parameters so that the destination cgroup is the last.
      
      This doesn't cause any functional difference.
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Acked-by: NZefan Li <lizefan@huawei.com>
      9af2ec45
    • T
      cgroup, memcg, cpuset: implement cgroup_taskset_for_each_leader() · 4530eddb
      Tejun Heo 提交于
      It wasn't explicitly documented but, when a process is being migrated,
      cpuset and memcg depend on cgroup_taskset_first() returning the
      threadgroup leader; however, this approach is somewhat ghetto and
      would no longer work for the planned multi-process migration.
      
      This patch introduces explicit cgroup_taskset_for_each_leader() which
      iterates over only the threadgroup leaders and replaces
      cgroup_taskset_first() usages for accessing the leader with it.
      
      This prepares both memcg and cpuset for multi-process migration.  This
      patch also updates the documentation for cgroup_taskset_for_each() to
      clarify the iteration rules and removes comments mentioning task
      ordering in tasksets.
      
      v2: A previous patch which added threadgroup leader test was dropped.
          Patch updated accordingly.
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Acked-by: NZefan Li <lizefan@huawei.com>
      Acked-by: NMichal Hocko <mhocko@suse.cz>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      4530eddb
    • T
      cpuset: migrate memory only for threadgroup leaders · 3df9ca0a
      Tejun Heo 提交于
      If memory_migrate flag is set, cpuset migrates memory according to the
      destnation css's nodemask.  The current implementation migrates memory
      whenever any thread of a process is migrated making the behavior
      somewhat arbitrary.  Let's tie memory operations to the threadgroup
      leader so that memory is migrated only when the leader is migrated.
      
      While this is a behavior change, given the inherent fuziness, this
      change is not too likely to be noticed and allows us to clearly define
      who owns the memory (always the leader) and helps the planned atomic
      multi-process migration.
      
      Note that we're currently migrating memory in migration path proper
      while holding all the locks.  In the long term, this should be moved
      out to an async work item.
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Acked-by: NZefan Li <lizefan@huawei.com>
      3df9ca0a
  4. 19 9月, 2015 7 次提交
    • T
      cgroup: generalize obtaining the handles of and notifying cgroup files · 6f60eade
      Tejun Heo 提交于
      cgroup core handles creations and removals of cgroup interface files
      as described by cftypes.  There are cases where the handle for a given
      file instance is necessary, for example, to generate a file modified
      event.  Currently, this is handled by explicitly matching the callback
      method pointer and storing the file handle manually in
      cgroup_add_file().  While this simple approach works for cgroup core
      files, it can't for controller interface files.
      
      This patch generalizes cgroup interface file handle handling.  struct
      cgroup_file is defined and each cftype can optionally tell cgroup core
      to store the file handle by setting ->file_offset.  A file handle
      remains accessible as long as the containing css is accessible.
      
      Both "cgroup.procs" and "cgroup.events" are converted to use the new
      generic mechanism instead of hooking directly into cgroup_add_file().
      Also, cgroup_file_notify() which takes a struct cgroup_file and
      generates a file modified event on it is added and replaces explicit
      kernfs_notify() invocations.
      
      This generalizes cgroup file handle handling and allows controllers to
      generate file modified notifications.
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Cc: Li Zefan <lizefan@huawei.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      6f60eade
    • T
      cgroup: restructure file creation / removal handling · 4df8dc90
      Tejun Heo 提交于
      The file creation / removal path has always been a bit icky and the
      planned notification update requires css during file creation.
      Restructure as follows.
      
      * cgroup_addrm_files() now takes both @css and @cgrp and is only
        called directly by other file handling functions.
      
      * cgroup_populate/clear_dir() are replaced with
        css_populate/clear_dir() taking @css and @cgrp_override.
        @cgrp_override is used only when files needs to be created on /
        removed from a cgroup which isn't attached to @css which happens
        during subsystem rebinds.  Subsystem loops are moved to the callers.
      
      * cgroup_add_file() now takes both @css and @cgrp.  @css isn't used
        yet but will be used by the planned notification update.
      
      This patch doens't cause any behavior changes.
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Cc: Li Zefan <lizefan@huawei.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      4df8dc90
    • T
      cgroup: cosmetic updates to rebind_subsystems() · 1ada4838
      Tejun Heo 提交于
      * Use local variables @scgrp and @dcgrp for @src_root->cgrp and
        @dst_root->cgrp respectively.
      
      * Use initializers to set @src_root and @css in the inner bind loop.
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Cc: Li Zefan <lizefan@huawei.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      1ada4838
    • T
      cgroup: make cgroup_addrm_files() clean up after itself on failures · 6732ed85
      Tejun Heo 提交于
      After a file creation failure, cgroup_addrm_files() it didn't remove
      the files which had already been created.  When cgroup_populate_dir()
      is the caller, this is fine as the caller performs cleanup; however,
      for other callers, this may leave unactivated dangling files behind.
      As kernfs directory removals are recursive, this doesn't lead to
      permanent memory leak but it can, for example, fail future attempts to
      create those files again.
      
      There's no point in keeping around this sort of subtlety and it gets
      in the way of planned updates to file handling.  This patch makes
      cgroup_addrm_files() clean up after itself on failures.
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Cc: Li Zefan <lizefan@huawei.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      6732ed85
    • T
      cgroup: relocate cgroup_populate_dir() · ccdca218
      Tejun Heo 提交于
      Move it upwards so that it's right below cgroup_clear_dir() and the
      forward declaration is unnecessary.
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Cc: Li Zefan <lizefan@huawei.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      ccdca218
    • T
      cgroup: replace cftype->mode with CFTYPE_WORLD_WRITABLE · 7dbdb199
      Tejun Heo 提交于
      cftype->mode allows controllers to give arbitrary permissions to
      interface knobs.  Except for "cgroup.event_control", the existing uses
      are spurious.
      
      * Some explicitly specify S_IRUGO | S_IWUSR even though that's the
        default.
      
      * "cpuset.memory_pressure" specifies S_IRUGO while also setting a
        write callback which returns -EACCES.  All it needs to do is simply
        not setting a write callback.
      
      "cgroup.event_control" uses cftype->mode to make the file
      world-writable.  It's a misdesigned interface and we don't want
      controllers to be tweaking interface file permissions in general.
      This patch removes cftype->mode and all its spurious uses and
      implements CFTYPE_WORLD_WRITABLE for "cgroup.event_control" which is
      marked as compatibility-only.
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Cc: Li Zefan <lizefan@huawei.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      7dbdb199
    • T
      cgroup: replace "cgroup.populated" with "cgroup.events" · 4a07c222
      Tejun Heo 提交于
      memcg already uses "memory.events" for event reporting and other
      controllers may need event reporting too.  Let's standardize on
      "$SUBSYS.events" interface file for reporting events which don't
      happen too frequently and thus can share event notification.
      
      "cgroup.populated" is replaced with "populated" field in
      "cgroup.events" and documentation is updated accordingly.
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Cc: Li Zefan <lizefan@huawei.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      4a07c222
  5. 18 9月, 2015 3 次提交
    • T
      cgroup: replace cgroup_on_dfl() tests in controllers with cgroup_subsys_on_dfl() · 9e10a130
      Tejun Heo 提交于
      cgroup_on_dfl() tests whether the cgroup's root is the default
      hierarchy; however, an individual controller is only interested in
      whether the controller is attached to the default hierarchy and never
      tests a cgroup which doesn't belong to the hierarchy that the
      controller is attached to.
      
      This patch replaces cgroup_on_dfl() tests in controllers with faster
      static_key based cgroup_subsys_on_dfl().  This leaves cgroup core as
      the only user of cgroup_on_dfl() and the function is moved from the
      header file to cgroup.c.
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Acked-by: NZefan Li <lizefan@huawei.com>
      Cc: Vivek Goyal <vgoyal@redhat.com>
      Cc: Jens Axboe <axboe@kernel.dk>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Michal Hocko <mhocko@kernel.org>
      9e10a130
    • T
      cgroup: replace cgroup_subsys->disabled tests with cgroup_subsys_enabled() · fc5ed1e9
      Tejun Heo 提交于
      Replace cgroup_subsys->disabled tests in controllers with
      cgroup_subsys_enabled().  cgroup_subsys_enabled() requires literal
      subsys name as its parameter and thus can't be used for cgroup core
      which iterates through controllers.  For cgroup core, introduce and
      use cgroup_ssid_enabled() which uses slower static_key_enabled() test
      and can be indexed by subsys ID.
      
      This leaves cgroup_subsys->disabled unused.  Removed.
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Acked-by: NZefan Li <lizefan@huawei.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Michal Hocko <mhocko@kernel.org>
      fc5ed1e9
    • T
      cgroup: implement static_key based cgroup_subsys_enabled() and cgroup_subsys_on_dfl() · 49d1dc4b
      Tejun Heo 提交于
      Whether a subsys is enabled and attached to the default hierarchy
      seldom changes and may be tested in the hot paths.  This patch
      implements static_key based cgroup_subsys_enabled() and
      cgroup_subsys_on_dfl() tests.
      
      The following patches will update the users and remove duplicate
      mechanisms.
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Acked-by: NZefan Li <lizefan@huawei.com>
      49d1dc4b
  6. 17 9月, 2015 2 次提交
    • T
      cgroup: simplify threadgroup locking · 3014dde7
      Tejun Heo 提交于
      Note: This commit was originally committed as b5ba75b5 but got
            reverted by f9f9e7b7 due to the performance regression from
            the percpu_rwsem write down/up operations added to cgroup task
            migration path.  percpu_rwsem changes which alleviate the
            performance issue are pending for v4.4-rc1 merge window.
            Re-apply.
      
      Now that threadgroup locking is made global, code paths around it can
      be simplified.
      
      * lock-verify-unlock-retry dancing removed from __cgroup_procs_write().
      
      * Race protection against de_thread() removed from
        cgroup_update_dfl_csses().
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Link: http://lkml.kernel.org/g/55F8097A.7000206@de.ibm.com
      3014dde7
    • T
      sched, cgroup: replace signal_struct->group_rwsem with a global percpu_rwsem · 1ed13287
      Tejun Heo 提交于
      Note: This commit was originally committed as d59cfc09 but got
            reverted by 0c986253 due to the performance regression from
            the percpu_rwsem write down/up operations added to cgroup task
            migration path.  percpu_rwsem changes which alleviate the
            performance issue are pending for v4.4-rc1 merge window.
            Re-apply.
      
      The cgroup side of threadgroup locking uses signal_struct->group_rwsem
      to synchronize against threadgroup changes.  This per-process rwsem
      adds small overhead to thread creation, exit and exec paths, forces
      cgroup code paths to do lock-verify-unlock-retry dance in a couple
      places and makes it impossible to atomically perform operations across
      multiple processes.
      
      This patch replaces signal_struct->group_rwsem with a global
      percpu_rwsem cgroup_threadgroup_rwsem which is cheaper on the reader
      side and contained in cgroups proper.  This patch converts one-to-one.
      
      This does make writer side heavier and lower the granularity; however,
      cgroup process migration is a fairly cold path, we do want to optimize
      thread operations over it and cgroup migration operations don't take
      enough time for the lower granularity to matter.
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Link: http://lkml.kernel.org/g/55F8097A.7000206@de.ibm.com
      Cc: Ingo Molnar <mingo@redhat.com>
      Cc: Peter Zijlstra <peterz@infradead.org>
      1ed13287
  7. 16 9月, 2015 2 次提交
    • T
      Revert "sched, cgroup: replace signal_struct->group_rwsem with a global percpu_rwsem" · 0c986253
      Tejun Heo 提交于
      This reverts commit d59cfc09.
      
      d59cfc09 ("sched, cgroup: replace signal_struct->group_rwsem with
      a global percpu_rwsem") and b5ba75b5 ("cgroup: simplify
      threadgroup locking") changed how cgroup synchronizes against task
      fork and exits so that it uses global percpu_rwsem instead of
      per-process rwsem; unfortunately, the write [un]lock paths of
      percpu_rwsem always involve synchronize_rcu_expedited() which turned
      out to be too expensive.
      
      Improvements for percpu_rwsem are scheduled to be merged in the coming
      v4.4-rc1 merge window which alleviates this issue.  For now, revert
      the two commits to restore per-process rwsem.  They will be re-applied
      for the v4.4-rc1 merge window.
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Link: http://lkml.kernel.org/g/55F8097A.7000206@de.ibm.comReported-by: NChristian Borntraeger <borntraeger@de.ibm.com>
      Cc: Oleg Nesterov <oleg@redhat.com>
      Cc: "Paul E. McKenney" <paulmck@linux.vnet.ibm.com>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Paolo Bonzini <pbonzini@redhat.com>
      Cc: stable@vger.kernel.org # v4.2+
      0c986253
    • T
      Revert "cgroup: simplify threadgroup locking" · f9f9e7b7
      Tejun Heo 提交于
      This reverts commit b5ba75b5.
      
      d59cfc09 ("sched, cgroup: replace signal_struct->group_rwsem with
      a global percpu_rwsem") and b5ba75b5 ("cgroup: simplify
      threadgroup locking") changed how cgroup synchronizes against task
      fork and exits so that it uses global percpu_rwsem instead of
      per-process rwsem; unfortunately, the write [un]lock paths of
      percpu_rwsem always involve synchronize_rcu_expedited() which turned
      out to be too expensive.
      
      Improvements for percpu_rwsem are scheduled to be merged in the coming
      v4.4-rc1 merge window which alleviates this issue.  For now, revert
      the two commits to restore per-process rwsem.  They will be re-applied
      for the v4.4-rc1 merge window.
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Link: http://lkml.kernel.org/g/55F8097A.7000206@de.ibm.comReported-by: NChristian Borntraeger <borntraeger@de.ibm.com>
      Cc: Oleg Nesterov <oleg@redhat.com>
      Cc: "Paul E. McKenney" <paulmck@linux.vnet.ibm.com>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Paolo Bonzini <pbonzini@redhat.com>
      Cc: stable@vger.kernel.org # v4.2+
      f9f9e7b7
  8. 12 9月, 2015 1 次提交
    • M
      sys_membarrier(): system-wide memory barrier (generic, x86) · 5b25b13a
      Mathieu Desnoyers 提交于
      Here is an implementation of a new system call, sys_membarrier(), which
      executes a memory barrier on all threads running on the system.  It is
      implemented by calling synchronize_sched().  It can be used to
      distribute the cost of user-space memory barriers asymmetrically by
      transforming pairs of memory barriers into pairs consisting of
      sys_membarrier() and a compiler barrier.  For synchronization primitives
      that distinguish between read-side and write-side (e.g.  userspace RCU
      [1], rwlocks), the read-side can be accelerated significantly by moving
      the bulk of the memory barrier overhead to the write-side.
      
      The existing applications of which I am aware that would be improved by
      this system call are as follows:
      
      * Through Userspace RCU library (http://urcu.so)
        - DNS server (Knot DNS) https://www.knot-dns.cz/
        - Network sniffer (http://netsniff-ng.org/)
        - Distributed object storage (https://sheepdog.github.io/sheepdog/)
        - User-space tracing (http://lttng.org)
        - Network storage system (https://www.gluster.org/)
        - Virtual routers (https://events.linuxfoundation.org/sites/events/files/slides/DPDK_RCU_0MQ.pdf)
        - Financial software (https://lkml.org/lkml/2015/3/23/189)
      
      Those projects use RCU in userspace to increase read-side speed and
      scalability compared to locking.  Especially in the case of RCU used by
      libraries, sys_membarrier can speed up the read-side by moving the bulk of
      the memory barrier cost to synchronize_rcu().
      
      * Direct users of sys_membarrier
        - core dotnet garbage collector (https://github.com/dotnet/coreclr/issues/198)
      
      Microsoft core dotnet GC developers are planning to use the mprotect()
      side-effect of issuing memory barriers through IPIs as a way to implement
      Windows FlushProcessWriteBuffers() on Linux.  They are referring to
      sys_membarrier in their github thread, specifically stating that
      sys_membarrier() is what they are looking for.
      
      To explain the benefit of this scheme, let's introduce two example threads:
      
      Thread A (non-frequent, e.g. executing liburcu synchronize_rcu())
      Thread B (frequent, e.g. executing liburcu
      rcu_read_lock()/rcu_read_unlock())
      
      In a scheme where all smp_mb() in thread A are ordering memory accesses
      with respect to smp_mb() present in Thread B, we can change each
      smp_mb() within Thread A into calls to sys_membarrier() and each
      smp_mb() within Thread B into compiler barriers "barrier()".
      
      Before the change, we had, for each smp_mb() pairs:
      
      Thread A                    Thread B
      previous mem accesses       previous mem accesses
      smp_mb()                    smp_mb()
      following mem accesses      following mem accesses
      
      After the change, these pairs become:
      
      Thread A                    Thread B
      prev mem accesses           prev mem accesses
      sys_membarrier()            barrier()
      follow mem accesses         follow mem accesses
      
      As we can see, there are two possible scenarios: either Thread B memory
      accesses do not happen concurrently with Thread A accesses (1), or they
      do (2).
      
      1) Non-concurrent Thread A vs Thread B accesses:
      
      Thread A                    Thread B
      prev mem accesses
      sys_membarrier()
      follow mem accesses
                                  prev mem accesses
                                  barrier()
                                  follow mem accesses
      
      In this case, thread B accesses will be weakly ordered. This is OK,
      because at that point, thread A is not particularly interested in
      ordering them with respect to its own accesses.
      
      2) Concurrent Thread A vs Thread B accesses
      
      Thread A                    Thread B
      prev mem accesses           prev mem accesses
      sys_membarrier()            barrier()
      follow mem accesses         follow mem accesses
      
      In this case, thread B accesses, which are ensured to be in program
      order thanks to the compiler barrier, will be "upgraded" to full
      smp_mb() by synchronize_sched().
      
      * Benchmarks
      
      On Intel Xeon E5405 (8 cores)
      (one thread is calling sys_membarrier, the other 7 threads are busy
      looping)
      
      1000 non-expedited sys_membarrier calls in 33s =3D 33 milliseconds/call.
      
      * User-space user of this system call: Userspace RCU library
      
      Both the signal-based and the sys_membarrier userspace RCU schemes
      permit us to remove the memory barrier from the userspace RCU
      rcu_read_lock() and rcu_read_unlock() primitives, thus significantly
      accelerating them. These memory barriers are replaced by compiler
      barriers on the read-side, and all matching memory barriers on the
      write-side are turned into an invocation of a memory barrier on all
      active threads in the process. By letting the kernel perform this
      synchronization rather than dumbly sending a signal to every process
      threads (as we currently do), we diminish the number of unnecessary wake
      ups and only issue the memory barriers on active threads. Non-running
      threads do not need to execute such barrier anyway, because these are
      implied by the scheduler context switches.
      
      Results in liburcu:
      
      Operations in 10s, 6 readers, 2 writers:
      
      memory barriers in reader:    1701557485 reads, 2202847 writes
      signal-based scheme:          9830061167 reads,    6700 writes
      sys_membarrier:               9952759104 reads,     425 writes
      sys_membarrier (dyn. check):  7970328887 reads,     425 writes
      
      The dynamic sys_membarrier availability check adds some overhead to
      the read-side compared to the signal-based scheme, but besides that,
      sys_membarrier slightly outperforms the signal-based scheme. However,
      this non-expedited sys_membarrier implementation has a much slower grace
      period than signal and memory barrier schemes.
      
      Besides diminishing the number of wake-ups, one major advantage of the
      membarrier system call over the signal-based scheme is that it does not
      need to reserve a signal. This plays much more nicely with libraries,
      and with processes injected into for tracing purposes, for which we
      cannot expect that signals will be unused by the application.
      
      An expedited version of this system call can be added later on to speed
      up the grace period. Its implementation will likely depend on reading
      the cpu_curr()->mm without holding each CPU's rq lock.
      
      This patch adds the system call to x86 and to asm-generic.
      
      [1] http://urcu.so
      
      membarrier(2) man page:
      
      MEMBARRIER(2)              Linux Programmer's Manual             MEMBARRIER(2)
      
      NAME
             membarrier - issue memory barriers on a set of threads
      
      SYNOPSIS
             #include <linux/membarrier.h>
      
             int membarrier(int cmd, int flags);
      
      DESCRIPTION
             The cmd argument is one of the following:
      
             MEMBARRIER_CMD_QUERY
                    Query  the  set  of  supported commands. It returns a bitmask of
                    supported commands.
      
             MEMBARRIER_CMD_SHARED
                    Execute a memory barrier on all threads running on  the  system.
                    Upon  return from system call, the caller thread is ensured that
                    all running threads have passed through a state where all memory
                    accesses  to  user-space  addresses  match program order between
                    entry to and return from the system  call  (non-running  threads
                    are de facto in such a state). This covers threads from all pro=E2=80=90
                    cesses running on the system.  This command returns 0.
      
             The flags argument needs to be 0. For future extensions.
      
             All memory accesses performed  in  program  order  from  each  targeted
             thread is guaranteed to be ordered with respect to sys_membarrier(). If
             we use the semantic "barrier()" to represent a compiler barrier forcing
             memory  accesses  to  be performed in program order across the barrier,
             and smp_mb() to represent explicit memory barriers forcing full  memory
             ordering  across  the barrier, we have the following ordering table for
             each pair of barrier(), sys_membarrier() and smp_mb():
      
             The pair ordering is detailed as (O: ordered, X: not ordered):
      
                                    barrier()   smp_mb() sys_membarrier()
                    barrier()          X           X            O
                    smp_mb()           X           O            O
                    sys_membarrier()   O           O            O
      
      RETURN VALUE
             On success, these system calls return zero.  On error, -1 is  returned,
             and errno is set appropriately. For a given command, with flags
             argument set to 0, this system call is guaranteed to always return the
             same value until reboot.
      
      ERRORS
             ENOSYS System call is not implemented.
      
             EINVAL Invalid arguments.
      
      Linux                             2015-04-15                     MEMBARRIER(2)
      Signed-off-by: NMathieu Desnoyers <mathieu.desnoyers@efficios.com>
      Reviewed-by: NPaul E. McKenney <paulmck@linux.vnet.ibm.com>
      Reviewed-by: NJosh Triplett <josh@joshtriplett.org>
      Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
      Cc: Steven Rostedt <rostedt@goodmis.org>
      Cc: Nicholas Miell <nmiell@comcast.net>
      Cc: Ingo Molnar <mingo@redhat.com>
      Cc: Alan Cox <gnomes@lxorguk.ukuu.org.uk>
      Cc: Lai Jiangshan <laijs@cn.fujitsu.com>
      Cc: Stephen Hemminger <stephen@networkplumber.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: David Howells <dhowells@redhat.com>
      Cc: Pranith Kumar <bobby.prani@gmail.com>
      Cc: Michael Kerrisk <mtk.manpages@gmail.com>
      Cc: Shuah Khan <shuahkh@osg.samsung.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      5b25b13a
  9. 11 9月, 2015 14 次提交
    • I
      sysctl: fix int -> unsigned long assignments in INT_MIN case · 9a5bc726
      Ilya Dryomov 提交于
      The following
      
          if (val < 0)
              *lvalp = (unsigned long)-val;
      
      is incorrect because the compiler is free to assume -val to be positive
      and use a sign-extend instruction for extending the bit pattern.  This is
      a problem if val == INT_MIN:
      
          # echo -2147483648 >/proc/sys/dev/scsi/logging_level
          # cat /proc/sys/dev/scsi/logging_level
          -18446744071562067968
      
      Cast to unsigned long before negation - that way we first sign-extend and
      then negate an unsigned, which is well defined.  With this:
      
          # cat /proc/sys/dev/scsi/logging_level
          -2147483648
      Signed-off-by: NIlya Dryomov <idryomov@gmail.com>
      Cc: Mikulas Patocka <mikulas@twibright.com>
      Cc: Robert Xiao <nneonneo@gmail.com>
      Cc: "Eric W. Biederman" <ebiederm@xmission.com>
      Cc: Kees Cook <keescook@chromium.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      9a5bc726
    • B
      kexec: export KERNEL_IMAGE_SIZE to vmcoreinfo · 1303a27c
      Baoquan He 提交于
      In x86_64, since v2.6.26 the KERNEL_IMAGE_SIZE is changed to 512M, and
      accordingly the MODULES_VADDR is changed to 0xffffffffa0000000.  However,
      in v3.12 Kees Cook introduced kaslr to randomise the location of kernel.
      And the kernel text mapping addr space is enlarged from 512M to 1G.  That
      means now KERNEL_IMAGE_SIZE is variable, its value is 512M when kaslr
      support is not compiled in and 1G when kaslr support is compiled in.
      Accordingly the MODULES_VADDR is changed too to be:
      
          #define MODULES_VADDR    (__START_KERNEL_map + KERNEL_IMAGE_SIZE)
      
      So when kaslr is compiled in and enabled, the kernel text mapping addr
      space and modules vaddr space need be adjusted.  Otherwise makedumpfile
      will collapse since the addr for some symbols is not correct.
      
      Hence KERNEL_IMAGE_SIZE need be exported to vmcoreinfo and got in
      makedumpfile to help calculate MODULES_VADDR.
      Signed-off-by: NBaoquan He <bhe@redhat.com>
      Acked-by: NKees Cook <keescook@chromium.org>
      Acked-by: NVivek Goyal <vgoyal@redhat.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      1303a27c
    • B
      kexec: align crash_notes allocation to make it be inside one physical page · bbb78b8f
      Baoquan He 提交于
      People reported that crash_notes in /proc/vmcore were corrupted and this
      cause crash kdump failure.  With code debugging and log we got the root
      cause.  This is because percpu variable crash_notes are allocated in 2
      vmalloc pages.  Currently percpu is based on vmalloc by default.  Vmalloc
      can't guarantee 2 continuous vmalloc pages are also on 2 continuous
      physical pages.  So when 1st kernel exports the starting address and size
      of crash_notes through sysfs like below:
      
      /sys/devices/system/cpu/cpux/crash_notes
      /sys/devices/system/cpu/cpux/crash_notes_size
      
      kdump kernel use them to get the content of crash_notes.  However the 2nd
      part may not be in the next neighbouring physical page as we expected if
      crash_notes are allocated accross 2 vmalloc pages.  That's why
      nhdr_ptr->n_namesz or nhdr_ptr->n_descsz could be very huge in
      update_note_header_size_elf64() and cause note header merging failure or
      some warnings.
      
      In this patch change to call __alloc_percpu() to passed in the align value
      by rounding crash_notes_size up to the nearest power of two.  This makes
      sure the crash_notes is allocated inside one physical page since
      sizeof(note_buf_t) in all ARCHS is smaller than PAGE_SIZE.  Meanwhile add
      a BUILD_BUG_ON to break compile if size is bigger than PAGE_SIZE since
      crash_notes definitely will be in 2 pages.  That need be avoided, and need
      be reported if it's unavoidable.
      
      [akpm@linux-foundation.org: use correct comment layout]
      Signed-off-by: NBaoquan He <bhe@redhat.com>
      Cc: Eric W. Biederman <ebiederm@xmission.com>
      Cc: Vivek Goyal <vgoyal@redhat.com>
      Cc: Dave Young <dyoung@redhat.com>
      Cc: Lisa Mitchell <lisa.mitchell@hp.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      bbb78b8f
    • M
      kexec: remove unnecessary test in kimage_alloc_crash_control_pages() · 04e9949b
      Minfei Huang 提交于
      Transforming PFN(Page Frame Number) to struct page is never failure, so we
      can simplify the code logic to do the image->control_page assignment
      directly in the loop, and remove the unnecessary conditional judgement.
      Signed-off-by: NMinfei Huang <mnfhuang@gmail.com>
      Acked-by: NDave Young <dyoung@redhat.com>
      Acked-by: NVivek Goyal <vgoyal@redhat.com>
      Cc: Simon Horman <horms@verge.net.au>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      04e9949b
    • D
      kexec: split kexec_load syscall from kexec core code · 2965faa5
      Dave Young 提交于
      There are two kexec load syscalls, kexec_load another and kexec_file_load.
       kexec_file_load has been splited as kernel/kexec_file.c.  In this patch I
      split kexec_load syscall code to kernel/kexec.c.
      
      And add a new kconfig option KEXEC_CORE, so we can disable kexec_load and
      use kexec_file_load only, or vice verse.
      
      The original requirement is from Ted Ts'o, he want kexec kernel signature
      being checked with CONFIG_KEXEC_VERIFY_SIG enabled.  But kexec-tools use
      kexec_load syscall can bypass the checking.
      
      Vivek Goyal proposed to create a common kconfig option so user can compile
      in only one syscall for loading kexec kernel.  KEXEC/KEXEC_FILE selects
      KEXEC_CORE so that old config files still work.
      
      Because there's general code need CONFIG_KEXEC_CORE, so I updated all the
      architecture Kconfig with a new option KEXEC_CORE, and let KEXEC selects
      KEXEC_CORE in arch Kconfig.  Also updated general kernel code with to
      kexec_load syscall.
      
      [akpm@linux-foundation.org: coding-style fixes]
      Signed-off-by: NDave Young <dyoung@redhat.com>
      Cc: Eric W. Biederman <ebiederm@xmission.com>
      Cc: Vivek Goyal <vgoyal@redhat.com>
      Cc: Petr Tesarik <ptesarik@suse.cz>
      Cc: Theodore Ts'o <tytso@mit.edu>
      Cc: Josh Boyer <jwboyer@fedoraproject.org>
      Cc: David Howells <dhowells@redhat.com>
      Cc: Geert Uytterhoeven <geert@linux-m68k.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      2965faa5
    • D
      kexec: split kexec_file syscall code to kexec_file.c · a43cac0d
      Dave Young 提交于
      Split kexec_file syscall related code to another file kernel/kexec_file.c
      so that the #ifdef CONFIG_KEXEC_FILE in kexec.c can be dropped.
      
      Sharing variables and functions are moved to kernel/kexec_internal.h per
      suggestion from Vivek and Petr.
      
      [akpm@linux-foundation.org: fix bisectability]
      [akpm@linux-foundation.org: declare the various arch_kexec functions]
      [akpm@linux-foundation.org: fix build]
      Signed-off-by: NDave Young <dyoung@redhat.com>
      Cc: Eric W. Biederman <ebiederm@xmission.com>
      Cc: Vivek Goyal <vgoyal@redhat.com>
      Cc: Petr Tesarik <ptesarik@suse.cz>
      Cc: Theodore Ts'o <tytso@mit.edu>
      Cc: Josh Boyer <jwboyer@fedoraproject.org>
      Cc: David Howells <dhowells@redhat.com>
      Cc: Geert Uytterhoeven <geert@linux-m68k.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      a43cac0d
    • F
      kmod: handle UMH_WAIT_PROC from system unbound workqueue · bb304a5c
      Frederic Weisbecker 提交于
      The UMH_WAIT_PROC handler runs in its own thread in order to make sure
      that waiting for the exec kernel thread completion won't block other
      usermodehelper queued jobs.
      
      On older workqueue implementations, worklets couldn't sleep without
      blocking the rest of the queue.  But now the workqueue subsystem handles
      that.  Khelper still had the older limitation due to its singlethread
      properties but we replaced it to system unbound workqueues.
      
      Those are affine to the current node and can block up to some number of
      instances.
      
      They are a good candidate to handle UMH_WAIT_PROC assuming that we have
      enough system unbound workers to handle lots of parallel usermodehelper
      jobs.
      Signed-off-by: NFrederic Weisbecker <fweisbec@gmail.com>
      Cc: Rik van Riel <riel@redhat.com>
      Reviewed-by: NOleg Nesterov <oleg@redhat.com>
      Cc: Christoph Lameter <cl@linux.com>
      Cc: Tejun Heo <tj@kernel.org>
      Cc: Rusty Russell <rusty@rustcorp.com.au>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      bb304a5c
    • F
      kmod: use system_unbound_wq instead of khelper · 90f02303
      Frederic Weisbecker 提交于
      We need to launch the usermodehelper kernel threads with the widest
      affinity and this is partly why we use khelper.  This workqueue has
      unbound properties and thus a wide affinity inherited by all its children.
      
      Now khelper also has special properties that we aren't much interested in:
      ordered and singlethread.  There is really no need about ordering as all
      we do is creating kernel threads.  This can be done concurrently.  And
      singlethread is a useless limitation as well.
      
      The workqueue engine already proposes generic unbound workqueues that
      don't share these useless properties and handle well parallel jobs.
      
      The only worrysome specific is their affinity to the node of the current
      CPU.  It's fine for creating the usermodehelper kernel threads but those
      inherit this affinity for longer jobs such as requesting modules.
      
      This patch proposes to use these node affine unbound workqueues assuming
      that a node is sufficient to handle several parallel usermodehelper
      requests.
      Signed-off-by: NFrederic Weisbecker <fweisbec@gmail.com>
      Cc: Rik van Riel <riel@redhat.com>
      Reviewed-by: NOleg Nesterov <oleg@redhat.com>
      Cc: Christoph Lameter <cl@linux.com>
      Cc: Tejun Heo <tj@kernel.org>
      Cc: Rusty Russell <rusty@rustcorp.com.au>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      90f02303
    • F
      kmod: add up-to-date explanations on the purpose of each asynchronous levels · b639e86b
      Frederic Weisbecker 提交于
      There seem to be quite some confusions on the comments, likely due to
      changes that came after them.
      
      Now since it's very non obvious why we have 3 levels of asynchronous code
      to implement usermodehelpers, it's important to comment in detail the
      reason of this layout.
      Signed-off-by: NFrederic Weisbecker <fweisbec@gmail.com>
      Cc: Rik van Riel <riel@redhat.com>
      Reviewed-by: NOleg Nesterov <oleg@redhat.com>
      Cc: Christoph Lameter <cl@linux.com>
      Cc: Tejun Heo <tj@kernel.org>
      Cc: Rusty Russell <rusty@rustcorp.com.au>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      b639e86b
    • F
      kmod: remove unecessary explicit wide CPU affinity setting · d097c024
      Frederic Weisbecker 提交于
      Khelper is affine to all CPUs.  Now since it creates the
      call_usermodehelper_exec_[a]sync() kernel threads, those inherit the wide
      affinity.
      
      As such explicitly forcing a wide affinity from those kernel threads
      is like a no-op.
      
      Just remove it. It's needless and it breaks CPU isolation users who
      rely on workqueue affinity tuning.
      Signed-off-by: NFrederic Weisbecker <fweisbec@gmail.com>
      Cc: Rik van Riel <riel@redhat.com>
      Reviewed-by: NOleg Nesterov <oleg@redhat.com>
      Cc: Christoph Lameter <cl@linux.com>
      Cc: Tejun Heo <tj@kernel.org>
      Cc: Rusty Russell <rusty@rustcorp.com.au>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      d097c024
    • F
      kmod: bunch of internal functions renames · b6b50a81
      Frederic Weisbecker 提交于
      This patchset does a bunch of cleanups and converts khelper to use system
      unbound workqueues.  The 3 first patches should be uncontroversial.  The
      last 2 patches are debatable.
      
      Kmod creates kernel threads that perform userspace jobs and we want those
      to have a large affinity in order not to contend busy CPUs.  This is
      (partly) why we use khelper which has a wide affinity that the kernel
      threads it create can inherit from.  Now khelper is a dedicated workqueue
      that has singlethread properties which we aren't interested in.
      
      Hence those two debatable changes:
      
      _ We would like to use generic workqueues. System unbound workqueues are
        a very good candidate but they are not wide affine, only node affine.
        Now probably a node is enough to perform many parallel kmod jobs.
      
      _ We would like to remove the wait_for_helper kernel thread (UMH_WAIT_PROC
        handler) to use the workqueue. It means that if the workqueue blocks,
        and no other worker can take pending kmod request, we can be screwed.
        Now if we have 512 threads, this should be enough.
      
      This patch (of 5):
      
      Underscores on function names aren't much verbose to explain the purpose
      of a function.  And kmod has interesting such flavours.
      
      Lets rename the following functions:
      
      * __call_usermodehelper -> call_usermodehelper_exec_work
      * ____call_usermodehelper -> call_usermodehelper_exec_async
      * wait_for_helper -> call_usermodehelper_exec_sync
      Signed-off-by: NFrederic Weisbecker <fweisbec@gmail.com>
      Cc: Rik van Riel <riel@redhat.com>
      Reviewed-by: NOleg Nesterov <oleg@redhat.com>
      Cc: Christoph Lameter <cl@linux.com>
      Cc: Tejun Heo <tj@kernel.org>
      Cc: Rusty Russell <rusty@rustcorp.com.au>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      b6b50a81
    • N
      kmod: correct documentation of return status of request_module · 60b61a6f
      NeilBrown 提交于
      If request_module() successfully runs modprobe, but modprobe exits with a
      non-zero status, then the return value from request_module() will be that
      (positive) error status.  So the return from request_module can be:
      
       negative errno
       zero for success
       positive exit code.
      Signed-off-by: NNeilBrown <neilb@suse.com>
      Cc: Goldwyn Rodrigues <rgoldwyn@suse.de>
      Cc: Oleg Nesterov <oleg@redhat.com>
      Cc: Tejun Heo <tj@kernel.org>
      Cc: Rusty Russell <rusty@rustcorp.com.au>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      60b61a6f
    • J
      kernel/cred.c: remove unnecessary kdebug atomic reads · 52aa8536
      Joe Perches 提交于
      Commit e0e81739 ("CRED: Add some configurable debugging [try #6]")
      added the kdebug mechanism to this file back in 2009.
      
      The kdebug macro calls no_printk which always evaluates arguments.
      
      Most of the kdebug uses have an unnecessary call of
      	atomic_read(&cred->usage)
      
      Make the kdebug macro do nothing by defining it with
      	do { if (0) no_printk(...); } while (0)
      when not enabled.
      
      $ size kernel/cred.o* (defconfig x86-64)
         text	   data	    bss	    dec	    hex	filename
         2748	    336	      8	   3092	    c14	kernel/cred.o.new
         2788	    336	      8	   3132	    c3c	kernel/cred.o.old
      
      Miscellanea:
      o Neaten the #define kdebug macros while there
      Signed-off-by: NJoe Perches <joe@perches.com>
      Cc: David Howells <dhowells@redhat.com>
      Cc: James Morris <jmorris@namei.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      52aa8536
    • W
      kernel/extable.c: remove duplicated include · 2307e1a3
      Wei Yongjun 提交于
      Signed-off-by: NWei Yongjun <yongjun_wei@trendmicro.com.cn>
      Acked-by: NSteven Rostedt <rostedt@goodmis.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      2307e1a3
  10. 10 9月, 2015 2 次提交
  11. 09 9月, 2015 1 次提交
    • V
      mm: rename alloc_pages_exact_node() to __alloc_pages_node() · 96db800f
      Vlastimil Babka 提交于
      alloc_pages_exact_node() was introduced in commit 6484eb3e ("page
      allocator: do not check NUMA node ID when the caller knows the node is
      valid") as an optimized variant of alloc_pages_node(), that doesn't
      fallback to current node for nid == NUMA_NO_NODE.  Unfortunately the
      name of the function can easily suggest that the allocation is
      restricted to the given node and fails otherwise.  In truth, the node is
      only preferred, unless __GFP_THISNODE is passed among the gfp flags.
      
      The misleading name has lead to mistakes in the past, see for example
      commits 5265047a ("mm, thp: really limit transparent hugepage
      allocation to local node") and b360edb4 ("mm, mempolicy:
      migrate_to_node should only migrate to node").
      
      Another issue with the name is that there's a family of
      alloc_pages_exact*() functions where 'exact' means exact size (instead
      of page order), which leads to more confusion.
      
      To prevent further mistakes, this patch effectively renames
      alloc_pages_exact_node() to __alloc_pages_node() to better convey that
      it's an optimized variant of alloc_pages_node() not intended for general
      usage.  Both functions get described in comments.
      
      It has been also considered to really provide a convenience function for
      allocations restricted to a node, but the major opinion seems to be that
      __GFP_THISNODE already provides that functionality and we shouldn't
      duplicate the API needlessly.  The number of users would be small
      anyway.
      
      Existing callers of alloc_pages_exact_node() are simply converted to
      call __alloc_pages_node(), with the exception of sba_alloc_coherent()
      which open-codes the check for NUMA_NO_NODE, so it is converted to use
      alloc_pages_node() instead.  This means it no longer performs some
      VM_BUG_ON checks, and since the current check for nid in
      alloc_pages_node() uses a 'nid < 0' comparison (which includes
      NUMA_NO_NODE), it may hide wrong values which would be previously
      exposed.
      
      Both differences will be rectified by the next patch.
      
      To sum up, this patch makes no functional changes, except temporarily
      hiding potentially buggy callers.  Restricting the checks in
      alloc_pages_node() is left for the next patch which can in turn expose
      more existing buggy callers.
      Signed-off-by: NVlastimil Babka <vbabka@suse.cz>
      Acked-by: NJohannes Weiner <hannes@cmpxchg.org>
      Acked-by: NRobin Holt <robinmholt@gmail.com>
      Acked-by: NMichal Hocko <mhocko@suse.com>
      Acked-by: NChristoph Lameter <cl@linux.com>
      Acked-by: NMichael Ellerman <mpe@ellerman.id.au>
      Cc: Mel Gorman <mgorman@suse.de>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Greg Thelen <gthelen@google.com>
      Cc: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
      Cc: Pekka Enberg <penberg@kernel.org>
      Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
      Cc: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
      Cc: Tony Luck <tony.luck@intel.com>
      Cc: Fenghua Yu <fenghua.yu@intel.com>
      Cc: Arnd Bergmann <arnd@arndb.de>
      Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
      Cc: Paul Mackerras <paulus@samba.org>
      Cc: Gleb Natapov <gleb@kernel.org>
      Cc: Paolo Bonzini <pbonzini@redhat.com>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Ingo Molnar <mingo@redhat.com>
      Cc: "H. Peter Anvin" <hpa@zytor.com>
      Cc: Cliff Whickman <cpw@sgi.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      96db800f