1. 24 1月, 2014 31 次提交
    • H
      mm: prevent setting of a value less than 0 to min_free_kbytes · da8c757b
      Han Pingtian 提交于
      If echo -1 > /proc/vm/sys/min_free_kbytes, the system will hang.  Changing
      proc_dointvec() to proc_dointvec_minmax() in the
      min_free_kbytes_sysctl_handler() can prevent this to happen.
      
      mhocko said:
      
      : You can still do echo $BIG_VALUE > /proc/vm/sys/min_free_kbytes and make
      : your machine unusable but I agree that proc_dointvec_minmax is more
      : suitable here as we already have:
      :
      : 	.proc_handler   = min_free_kbytes_sysctl_handler,
      : 	.extra1         = &zero,
      :
      : It used to work properly but then 6fce56ec ("sysctl: Remove references
      : to ctl_name and strategy from the generic sysctl table") has removed
      : sysctl_intvec strategy and so extra1 is ignored.
      Signed-off-by: NHan Pingtian <hanpt@linux.vnet.ibm.com>
      Acked-by: NMichal Hocko <mhocko@suse.cz>
      Acked-by: NDavid Rientjes <rientjes@google.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      da8c757b
    • M
      mm: new_vma_page() cannot see NULL vma for hugetlb pages · cc81717e
      Michal Hocko 提交于
      Commit 11c731e8 ("mm/mempolicy: fix !vma in new_vma_page()") has
      removed BUG_ON(!vma) from new_vma_page which is partially correct
      because page_address_in_vma will return EFAULT for non-linear mappings
      and at least shared shmem might be mapped this way.
      
      The patch also tried to prevent NULL ptr for hugetlb pages which is not
      correct AFAICS because hugetlb pages cannot be mapped as VM_NONLINEAR
      and other conditions in page_address_in_vma seem to be legit and catch
      real bugs.
      
      This patch restores BUG_ON for PageHuge to catch potential issues when
      the to-be-migrated page is not setup properly.
      Signed-off-by: NMichal Hocko <mhocko@suse.cz>
      Reviewed-by: NBob Liu <bob.liu@oracle.com>
      Cc: Sasha Levin <sasha.levin@oracle.com>
      Cc: Wanpeng Li <liwanp@linux.vnet.ibm.com>
      Cc: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      cc81717e
    • N
      mm/memory-failure.c: shift page lock from head page to tail page after thp split · 54b9dd14
      Naoya Horiguchi 提交于
      After thp split in hwpoison_user_mappings(), we hold page lock on the
      raw error page only between try_to_unmap, hence we are in danger of race
      condition.
      
      I found in the RHEL7 MCE-relay testing that we have "bad page" error
      when a memory error happens on a thp tail page used by qemu-kvm:
      
        Triggering MCE exception on CPU 10
        mce: [Hardware Error]: Machine check events logged
        MCE exception done on CPU 10
        MCE 0x38c535: Killing qemu-kvm:8418 due to hardware memory corruption
        MCE 0x38c535: dirty LRU page recovery: Recovered
        qemu-kvm[8418]: segfault at 20 ip 00007ffb0f0f229a sp 00007fffd6bc5240 error 4 in qemu-kvm[7ffb0ef14000+420000]
        BUG: Bad page state in process qemu-kvm  pfn:38c400
        page:ffffea000e310000 count:0 mapcount:0 mapping:          (null) index:0x7ffae3c00
        page flags: 0x2fffff0008001d(locked|referenced|uptodate|dirty|swapbacked)
        Modules linked in: hwpoison_inject mce_inject vhost_net macvtap macvlan ...
        CPU: 0 PID: 8418 Comm: qemu-kvm Tainted: G   M        --------------   3.10.0-54.0.1.el7.mce_test_fixed.x86_64 #1
        Hardware name: NEC NEC Express5800/R120b-1 [N8100-1719F]/MS-91E7-001, BIOS 4.6.3C19 02/10/2011
        Call Trace:
          dump_stack+0x19/0x1b
          bad_page.part.59+0xcf/0xe8
          free_pages_prepare+0x148/0x160
          free_hot_cold_page+0x31/0x140
          free_hot_cold_page_list+0x46/0xa0
          release_pages+0x1c1/0x200
          free_pages_and_swap_cache+0xad/0xd0
          tlb_flush_mmu.part.46+0x4c/0x90
          tlb_finish_mmu+0x55/0x60
          exit_mmap+0xcb/0x170
          mmput+0x67/0xf0
          vhost_dev_cleanup+0x231/0x260 [vhost_net]
          vhost_net_release+0x3f/0x90 [vhost_net]
          __fput+0xe9/0x270
          ____fput+0xe/0x10
          task_work_run+0xc4/0xe0
          do_exit+0x2bb/0xa40
          do_group_exit+0x3f/0xa0
          get_signal_to_deliver+0x1d0/0x6e0
          do_signal+0x48/0x5e0
          do_notify_resume+0x71/0xc0
          retint_signal+0x48/0x8c
      
      The reason of this bug is that a page fault happens before unlocking the
      head page at the end of memory_failure().  This strange page fault is
      trying to access to address 0x20 and I'm not sure why qemu-kvm does
      this, but anyway as a result the SIGSEGV makes qemu-kvm exit and on the
      way we catch the bad page bug/warning because we try to free a locked
      page (which was the former head page.)
      
      To fix this, this patch suggests to shift page lock from head page to
      tail page just after thp split.  SIGSEGV still happens, but it affects
      only error affected VMs, not a whole system.
      Signed-off-by: NNaoya Horiguchi <n-horiguchi@ah.jp.nec.com>
      Cc: Andi Kleen <andi@firstfloor.org>
      Cc: Wanpeng Li <liwanp@linux.vnet.ibm.com>
      Cc: <stable@vger.kernel.org>        [3.9+] # a3e0f9e4 "mm/memory-failure.c: transfer page count from head page to tail page after split thp"
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      54b9dd14
    • A
      numa: add a sysctl for numa_balancing · 54a43d54
      Andi Kleen 提交于
      Add a working sysctl to enable/disable automatic numa memory balancing
      at runtime.
      
      This allows us to track down performance problems with this feature and
      is generally a good idea.
      
      This was possible earlier through debugfs, but only with special
      debugging options set.  Also fix the boot message.
      
      [akpm@linux-foundation.org: s/sched_numa_balancing/sysctl_numa_balancing/]
      Signed-off-by: NAndi Kleen <ak@linux.intel.com>
      Acked-by: NMel Gorman <mgorman@suse.de>
      Cc: Ingo Molnar <mingo@elte.hu>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      54a43d54
    • P
      mm: free memblock.memory in free_all_bootmem · 5e270e25
      Philipp Hachtmann 提交于
      When calling free_all_bootmem() the free areas under memblock's control
      are released to the buddy allocator.  Additionally the reserved list is
      freed if it was reallocated by memblock.  The same should apply for the
      memory list.
      Signed-off-by: NPhilipp Hachtmann <phacht@linux.vnet.ibm.com>
      Reviewed-by: NTejun Heo <tj@kernel.org>
      Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Tang Chen <tangchen@cn.fujitsu.com>
      Cc: Toshi Kani <toshi.kani@hp.com>
      Cc: Jianguo Wu <wujianguo@huawei.com>
      Cc: Yinghai Lu <yinghai@kernel.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      5e270e25
    • P
      mm/nobootmem.c: add return value check in __alloc_memory_core_early() · 87379ec8
      Philipp Hachtmann 提交于
      When memblock_reserve() fails because memblock.reserved.regions cannot
      be resized, the caller (e.g.  alloc_bootmem()) is not informed of the
      failed allocation.  Therefore alloc_bootmem() silently returns the same
      pointer again and again.
      
      This patch adds a check for the return value of memblock_reserve() in
      __alloc_memory_core().
      Signed-off-by: NPhilipp Hachtmann <phacht@linux.vnet.ibm.com>
      Reviewed-by: NTejun Heo <tj@kernel.org>
      Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Tang Chen <tangchen@cn.fujitsu.com>
      Cc: Toshi Kani <toshi.kani@hp.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      87379ec8
    • V
      memcg: rework memcg_update_kmem_limit synchronization · d6441637
      Vladimir Davydov 提交于
      Currently we take both the memcg_create_mutex and the set_limit_mutex
      when we enable kmem accounting for a memory cgroup, which makes kmem
      activation events serialize with both memcg creations and other memcg
      limit updates (memory.limit, memory.memsw.limit).  However, there is no
      point in such strict synchronization rules there.
      
      First, the set_limit_mutex was introduced to keep the memory.limit and
      memory.memsw.limit values in sync.  Since memory.kmem.limit can be set
      independently of them, it is better to introduce a separate mutex to
      synchronize against concurrent kmem limit updates.
      
      Second, we take the memcg_create_mutex in order to make sure all
      children of this memcg will be kmem-active as well.  For achieving that,
      it is enough to hold this mutex only while checking if
      memcg_has_children() though.  This guarantees that if a child is added
      after we checked that the memcg has no children, the newly added cgroup
      will see its parent kmem-active (of course if the latter succeeded), and
      call kmem activation for itself.
      
      This patch simplifies the locking rules of memcg_update_kmem_limit()
      according to these considerations.
      
      [vdavydov@parallels.com: fix unintialized var warning]
      Signed-off-by: NVladimir Davydov <vdavydov@parallels.com>
      Cc: Michal Hocko <mhocko@suse.cz>
      Cc: Glauber Costa <glommer@gmail.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Balbir Singh <bsingharora@gmail.com>
      Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      d6441637
    • V
      memcg: remove KMEM_ACCOUNTED_ACTIVATED flag · 6de64beb
      Vladimir Davydov 提交于
      Currently we have two state bits in mem_cgroup::kmem_account_flags
      regarding kmem accounting activation, ACTIVATED and ACTIVE.  We start
      kmem accounting only if both flags are set (memcg_can_account_kmem()),
      plus throughout the code there are several places where we check only
      the ACTIVE flag, but we never check the ACTIVATED flag alone.  These
      flags are both set from memcg_update_kmem_limit() under the
      set_limit_mutex, the ACTIVE flag always being set after ACTIVATED, and
      they never get cleared.  That said checking if both flags are set is
      equivalent to checking only for the ACTIVE flag, and since there is no
      ACTIVATED flag checks, we can safely remove the ACTIVATED flag, and
      nothing will change.
      
      Let's try to understand what was the reason for introducing these flags.
      The purpose of the ACTIVE flag is clear - it states that kmem should be
      accounting to the cgroup.  The only requirement for it is that it should
      be set after we have fully initialized kmem accounting bits for the
      cgroup and patched all static branches relating to kmem accounting.
      Since we always check if static branch is enabled before actually
      considering if we should account (otherwise we wouldn't benefit from
      static branching), this guarantees us that we won't skip a commit or
      uncharge after a charge due to an unpatched static branch.
      
      Now let's move on to the ACTIVATED bit.  As I proved in the beginning of
      this message, it is absolutely useless, and removing it will change
      nothing.  So what was the reason introducing it?
      
      The ACTIVATED flag was introduced by commit a8964b9b ("memcg: use
      static branches when code not in use") in order to guarantee that
      static_key_slow_inc(&memcg_kmem_enabled_key) would be called only once
      for each memory cgroup when its kmem accounting was activated.  The
      point was that at that time the memcg_update_kmem_limit() function's
      work-flow looked like this:
      
              bool must_inc_static_branch = false;
      
              cgroup_lock();
              mutex_lock(&set_limit_mutex);
              if (!memcg->kmem_account_flags && val != RESOURCE_MAX) {
                      /* The kmem limit is set for the first time */
                      ret = res_counter_set_limit(&memcg->kmem, val);
      
                      memcg_kmem_set_activated(memcg);
                      must_inc_static_branch = true;
              } else
                      ret = res_counter_set_limit(&memcg->kmem, val);
              mutex_unlock(&set_limit_mutex);
              cgroup_unlock();
      
              if (must_inc_static_branch) {
                      /* We can't do this under cgroup_lock */
                      static_key_slow_inc(&memcg_kmem_enabled_key);
                      memcg_kmem_set_active(memcg);
              }
      
      So that without the ACTIVATED flag we could race with other threads
      trying to set the limit and increment the static branching ref-counter
      more than once.  Today we call the whole memcg_update_kmem_limit()
      function under the set_limit_mutex and this race is impossible.
      
      As now we understand why the ACTIVATED bit was introduced and why we
      don't need it now, and know that removing it will change nothing anyway,
      let's get rid of it.
      Signed-off-by: NVladimir Davydov <vdavydov@parallels.com>
      Cc: Michal Hocko <mhocko@suse.cz>
      Cc: Glauber Costa <glommer@gmail.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Balbir Singh <bsingharora@gmail.com>
      Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      6de64beb
    • V
      memcg, slab: RCU protect memcg_params for root caches · f8570263
      Vladimir Davydov 提交于
      We relocate root cache's memcg_params whenever we need to grow the
      memcg_caches array to accommodate all kmem-active memory cgroups.
      Currently on relocation we free the old version immediately, which can
      lead to use-after-free, because the memcg_caches array is accessed
      lock-free (see cache_from_memcg_idx()).  This patch fixes this by making
      memcg_params RCU-protected for root caches.
      Signed-off-by: NVladimir Davydov <vdavydov@parallels.com>
      Cc: Michal Hocko <mhocko@suse.cz>
      Cc: Glauber Costa <glommer@gmail.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Balbir Singh <bsingharora@gmail.com>
      Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Cc: Pekka Enberg <penberg@kernel.org>
      Cc: Christoph Lameter <cl@linux.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      f8570263
    • V
      slab: do not panic if we fail to create memcg cache · f717eb3a
      Vladimir Davydov 提交于
      There is no point in flooding logs with warnings or especially crashing
      the system if we fail to create a cache for a memcg.  In this case we
      will be accounting the memcg allocation to the root cgroup until we
      succeed to create its own cache, but it isn't that critical.
      Signed-off-by: NVladimir Davydov <vdavydov@parallels.com>
      Cc: Michal Hocko <mhocko@suse.cz>
      Cc: Glauber Costa <glommer@gmail.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Pekka Enberg <penberg@kernel.org>
      Cc: Christoph Lameter <cl@linux.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      f717eb3a
    • V
      memcg: get rid of kmem_cache_dup() · 842e2873
      Vladimir Davydov 提交于
      kmem_cache_dup() is only called from memcg_create_kmem_cache().  The
      latter, in fact, does nothing besides this, so let's fold
      kmem_cache_dup() into memcg_create_kmem_cache().
      
      This patch also makes the memcg_cache_mutex private to
      memcg_create_kmem_cache(), because it is not used anywhere else.
      Signed-off-by: NVladimir Davydov <vdavydov@parallels.com>
      Cc: Michal Hocko <mhocko@suse.cz>
      Cc: Glauber Costa <glommer@gmail.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Balbir Singh <bsingharora@gmail.com>
      Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      842e2873
    • V
      memcg, slab: fix races in per-memcg cache creation/destruction · 2edefe11
      Vladimir Davydov 提交于
      We obtain a per-memcg cache from a root kmem_cache by dereferencing an
      entry of the root cache's memcg_params::memcg_caches array.  If we find
      no cache for a memcg there on allocation, we initiate the memcg cache
      creation (see memcg_kmem_get_cache()).  The cache creation proceeds
      asynchronously in memcg_create_kmem_cache() in order to avoid lock
      clashes, so there can be several threads trying to create the same
      kmem_cache concurrently, but only one of them may succeed.  However, due
      to a race in the code, it is not always true.  The point is that the
      memcg_caches array can be relocated when we activate kmem accounting for
      a memcg (see memcg_update_all_caches(), memcg_update_cache_size()).  If
      memcg_update_cache_size() and memcg_create_kmem_cache() proceed
      concurrently as described below, we can leak a kmem_cache.
      
      Asume two threads schedule creation of the same kmem_cache.  One of them
      successfully creates it.  Another one should fail then, but if
      memcg_create_kmem_cache() interleaves with memcg_update_cache_size() as
      follows, it won't:
      
        memcg_create_kmem_cache()             memcg_update_cache_size()
        (called w/o mutexes held)             (called with slab_mutex,
                                               set_limit_mutex held)
        -------------------------             -------------------------
      
        mutex_lock(&memcg_cache_mutex)
      
                                              s->memcg_params=kzalloc(...)
      
        new_cachep=cache_from_memcg_idx(cachep,idx)
        // new_cachep==NULL => proceed to creation
      
                                              s->memcg_params->memcg_caches[i]
                                                  =cur_params->memcg_caches[i]
      
        // kmem_cache_create_memcg takes slab_mutex
        // so we will hang around until
        // memcg_update_cache_size finishes, but
        // nothing will prevent it from succeeding so
        // memcg_caches[idx] will be overwritten in
        // memcg_register_cache!
      
        new_cachep = kmem_cache_create_memcg(...)
        mutex_unlock(&memcg_cache_mutex)
      
      Let's fix this by moving the check for existence of the memcg cache to
      kmem_cache_create_memcg() to be called under the slab_mutex and make it
      return NULL if so.
      
      A similar race is possible when destroying a memcg cache (see
      kmem_cache_destroy()).  Since memcg_unregister_cache(), which clears the
      pointer in the memcg_caches array, is called w/o protection, we can race
      with memcg_update_cache_size() and omit clearing the pointer.  Therefore
      memcg_unregister_cache() should be moved before we release the
      slab_mutex.
      Signed-off-by: NVladimir Davydov <vdavydov@parallels.com>
      Cc: Michal Hocko <mhocko@suse.cz>
      Cc: Glauber Costa <glommer@gmail.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Balbir Singh <bsingharora@gmail.com>
      Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Cc: Pekka Enberg <penberg@kernel.org>
      Cc: Christoph Lameter <cl@linux.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      2edefe11
    • V
      memcg: fix possible NULL deref while traversing memcg_slab_caches list · 96403da2
      Vladimir Davydov 提交于
      All caches of the same memory cgroup are linked in the memcg_slab_caches
      list via kmem_cache::memcg_params::list.  This list is traversed, for
      example, when we read memory.kmem.slabinfo.
      
      Since the list actually consists of memcg_cache_params objects, we have
      to convert an element of the list to a kmem_cache object using
      memcg_params_to_cache(), which obtains the pointer to the cache from the
      memcg_params::memcg_caches array of the corresponding root cache.  That
      said the pointer to a kmem_cache in its parent's memcg_params must be
      initialized before adding the cache to the list, and cleared only after
      it has been unlinked.  Currently it is vice-versa, which can result in a
      NULL ptr dereference while traversing the memcg_slab_caches list.  This
      patch restores the correct order.
      Signed-off-by: NVladimir Davydov <vdavydov@parallels.com>
      Cc: Michal Hocko <mhocko@suse.cz>
      Cc: Glauber Costa <glommer@gmail.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Balbir Singh <bsingharora@gmail.com>
      Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      96403da2
    • V
      memcg, slab: fix barrier usage when accessing memcg_caches · 959c8963
      Vladimir Davydov 提交于
      Each root kmem_cache has pointers to per-memcg caches stored in its
      memcg_params::memcg_caches array.  Whenever we want to allocate a slab
      for a memcg, we access this array to get per-memcg cache to allocate
      from (see memcg_kmem_get_cache()).  The access must be lock-free for
      performance reasons, so we should use barriers to assert the kmem_cache
      is up-to-date.
      
      First, we should place a write barrier immediately before setting the
      pointer to it in the memcg_caches array in order to make sure nobody
      will see a partially initialized object.  Second, we should issue a read
      barrier before dereferencing the pointer to conform to the write
      barrier.
      
      However, currently the barrier usage looks rather strange.  We have a
      write barrier *after* setting the pointer and a read barrier *before*
      reading the pointer, which is incorrect.  This patch fixes this.
      Signed-off-by: NVladimir Davydov <vdavydov@parallels.com>
      Cc: Michal Hocko <mhocko@suse.cz>
      Cc: Glauber Costa <glommer@gmail.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Balbir Singh <bsingharora@gmail.com>
      Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Cc: Pekka Enberg <penberg@kernel.org>
      Cc: Christoph Lameter <cl@linux.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      959c8963
    • V
      memcg, slab: clean up memcg cache initialization/destruction · 1aa13254
      Vladimir Davydov 提交于
      Currently, we have rather a messy function set relating to per-memcg
      kmem cache initialization/destruction.
      
      Per-memcg caches are created in memcg_create_kmem_cache().  This
      function calls kmem_cache_create_memcg() to allocate and initialize a
      kmem cache and then "registers" the new cache in the
      memcg_params::memcg_caches array of the parent cache.
      
      During its work-flow, kmem_cache_create_memcg() executes the following
      memcg-related functions:
      
       - memcg_alloc_cache_params(), to initialize memcg_params of the newly
         created cache;
       - memcg_cache_list_add(), to add the new cache to the memcg_slab_caches
         list.
      
      On the other hand, kmem_cache_destroy() called on a cache destruction
      only calls memcg_release_cache(), which does all the work: it cleans the
      reference to the cache in its parent's memcg_params::memcg_caches,
      removes the cache from the memcg_slab_caches list, and frees
      memcg_params.
      
      Such an inconsistency between destruction and initialization paths make
      the code difficult to read, so let's clean this up a bit.
      
      This patch moves all the code relating to registration of per-memcg
      caches (adding to memcg list, setting the pointer to a cache from its
      parent) to the newly created memcg_register_cache() and
      memcg_unregister_cache() functions making the initialization and
      destruction paths look symmetrical.
      Signed-off-by: NVladimir Davydov <vdavydov@parallels.com>
      Cc: Michal Hocko <mhocko@suse.cz>
      Cc: Glauber Costa <glommer@gmail.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Balbir Singh <bsingharora@gmail.com>
      Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Cc: Pekka Enberg <penberg@kernel.org>
      Cc: Christoph Lameter <cl@linux.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      1aa13254
    • V
      memcg, slab: kmem_cache_create_memcg(): fix memleak on fail path · 363a044f
      Vladimir Davydov 提交于
      We do not free the cache's memcg_params if __kmem_cache_create fails.
      Fix this.
      
      Plus, rename memcg_register_cache() to memcg_alloc_cache_params(),
      because it actually does not register the cache anywhere, but simply
      initialize kmem_cache::memcg_params.
      
      [akpm@linux-foundation.org: fix build]
      Signed-off-by: NVladimir Davydov <vdavydov@parallels.com>
      Cc: Michal Hocko <mhocko@suse.cz>
      Cc: Glauber Costa <glommer@gmail.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Balbir Singh <bsingharora@gmail.com>
      Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Cc: Pekka Enberg <penberg@kernel.org>
      Cc: Christoph Lameter <cl@linux.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      363a044f
    • V
      slab: clean up kmem_cache_create_memcg() error handling · 3965fc36
      Vladimir Davydov 提交于
      Currently kmem_cache_create_memcg() backoffs on failure inside
      conditionals, without using gotos.  This results in the rollback code
      duplication, which makes the function look cumbersome even though on
      error we should only free the allocated cache.  Since in the next patch
      I am going to add yet another rollback function call on error path
      there, let's employ labels instead of conditionals for undoing any
      changes on failure to keep things clean.
      Signed-off-by: NVladimir Davydov <vdavydov@parallels.com>
      Reviewed-by: NPekka Enberg <penberg@kernel.org>
      Cc: Michal Hocko <mhocko@suse.cz>
      Cc: Glauber Costa <glommer@gmail.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Christoph Lameter <cl@linux.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      3965fc36
    • S
      mm: dump page when hitting a VM_BUG_ON using VM_BUG_ON_PAGE · 309381fe
      Sasha Levin 提交于
      Most of the VM_BUG_ON assertions are performed on a page.  Usually, when
      one of these assertions fails we'll get a BUG_ON with a call stack and
      the registers.
      
      I've recently noticed based on the requests to add a small piece of code
      that dumps the page to various VM_BUG_ON sites that the page dump is
      quite useful to people debugging issues in mm.
      
      This patch adds a VM_BUG_ON_PAGE(cond, page) which beyond doing what
      VM_BUG_ON() does, also dumps the page before executing the actual
      BUG_ON.
      
      [akpm@linux-foundation.org: fix up includes]
      Signed-off-by: NSasha Levin <sasha.levin@oracle.com>
      Cc: "Kirill A. Shutemov" <kirill@shutemov.name>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      309381fe
    • N
      fs/proc/page.c: add PageAnon check to surely detect thp · e3bba3c3
      Naoya Horiguchi 提交于
      stable_page_flags() checks !PageHuge && PageTransCompound && PageLRU to
      know that a specified page is thp or not.  But sometimes it's not enough
      and we fail to detect thp when the thp is on pagevec.  This happens only
      for a few seconds after LRU list operations, but it makes it difficult
      to control our applications depending on this flag.
      
      So this patch adds another check PageAnon to detect thps on pagevec.  It
      might not give the future extensibility for thp pagecache, but it's OK
      at least for now.
      Signed-off-by: NNaoya Horiguchi <n-horiguchi@ah.jp.nec.com>
      Cc: David Rientjes <rientjes@google.com>
      Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
      Cc: Wu Fengguang <fengguang.wu@intel.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      e3bba3c3
    • V
      memcg: do not use vmalloc for mem_cgroup allocations · 8ff69e2c
      Vladimir Davydov 提交于
      The vmalloc was introduced by 33327948 ("memcgroup: use vmalloc for
      mem_cgroup allocation"), because at that time MAX_NUMNODES was used for
      defining the per-node array in the mem_cgroup structure so that the
      structure could be huge even if the system had the only NUMA node.
      
      The situation was significantly improved by commit 45cf7ebd ("memcg:
      reduce the size of struct memcg 244-fold"), which made the size of the
      mem_cgroup structure calculated dynamically depending on the real number
      of NUMA nodes installed on the system (nr_node_ids), so now there is no
      point in using vmalloc here: the structure is allocated rarely and on
      most systems its size is about 1K.
      Signed-off-by: NVladimir Davydov <vdavydov@parallels.com>
      Acked-by: NMichal Hocko <mhocko@suse.cz>
      Cc: Glauber Costa <glommer@openvz.org>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Balbir Singh <bsingharora@gmail.com>
      Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      8ff69e2c
    • V
      mm: munlock: fix potential race with THP page split · 01cc2e58
      Vlastimil Babka 提交于
      Since commit ff6a6da6 ("mm: accelerate munlock() treatment of THP
      pages") munlock skips tail pages of a munlocked THP page.  There is some
      attempt to prevent bad consequences of racing with a THP page split, but
      code inspection indicates that there are two problems that may lead to a
      non-fatal, yet wrong outcome.
      
      First, __split_huge_page_refcount() copies flags including PageMlocked
      from the head page to the tail pages.  Clearing PageMlocked by
      munlock_vma_page() in the middle of this operation might result in part
      of tail pages left with PageMlocked flag.  As the head page still
      appears to be a THP page until all tail pages are processed,
      munlock_vma_page() might think it munlocked the whole THP page and skip
      all the former tail pages.  Before ff6a6da6, those pages would be
      cleared in further iterations of munlock_vma_pages_range(), but NR_MLOCK
      would still become undercounted (related the next point).
      
      Second, NR_MLOCK accounting is based on call to hpage_nr_pages() after
      the PageMlocked is cleared.  The accounting might also become
      inconsistent due to race with __split_huge_page_refcount()
      
      - undercount when HUGE_PMD_NR is subtracted, but some tail pages are
        left with PageMlocked set and counted again (only possible before
        ff6a6da6)
      
      - overcount when hpage_nr_pages() sees a normal page (split has already
        finished), but the parallel split has meanwhile cleared PageMlocked from
        additional tail pages
      
      This patch prevents both problems via extending the scope of lru_lock in
      munlock_vma_page().  This is convenient because:
      
      - __split_huge_page_refcount() takes lru_lock for its whole operation
      
      - munlock_vma_page() typically takes lru_lock anyway for page isolation
      
      As this becomes a second function where page isolation is done with
      lru_lock already held, factor this out to a new
      __munlock_isolate_lru_page() function and clean up the code around.
      
      [akpm@linux-foundation.org: avoid a coding-style ugly]
      Signed-off-by: NVlastimil Babka <vbabka@suse.cz>
      Cc: Sasha Levin <sasha.levin@oracle.com>
      Cc: Michel Lespinasse <walken@google.com>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: Rik van Riel <riel@redhat.com>
      Cc: Mel Gorman <mgorman@suse.de>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: <stable@vger.kernel.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      01cc2e58
    • D
      mm: print more details for bad_page() · f0b791a3
      Dave Hansen 提交于
      bad_page() is cool in that it prints out a bunch of data about the page.
      But, I can never remember which page flags are good and which are bad,
      or whether ->index or ->mapping is required to be NULL.
      
      This patch allows bad/dump_page() callers to specify a string about why
      they are dumping the page and adds explanation strings to a number of
      places.  It also adds a 'bad_flags' argument to bad_page(), which it
      then dumps out separately from the flags which are actually set.
      
      This way, the messages will show specifically why the page was bad,
      *specifically* which flags it is complaining about, if it was a page
      flag combination which was the problem.
      
      [akpm@linux-foundation.org: switch to pr_alert]
      Signed-off-by: NDave Hansen <dave.hansen@linux.intel.com>
      Reviewed-by: NChristoph Lameter <cl@linux.com>
      Cc: Andi Kleen <andi@firstfloor.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      f0b791a3
    • D
      mm/zswap.c: change params from hidden to ro · 12ab028b
      Dan Streetman 提交于
      The "compressor" and "enabled" params are currently hidden, this changes
      them to read-only, so userspace can tell if zswap is enabled or not and
      see what compressor is in use.
      Signed-off-by: NDan Streetman <ddstreet@ieee.org>
      Cc: Vladimir Murzin <murzin.v@gmail.com>
      Cc: Bob Liu <bob.liu@oracle.com>
      Cc: Minchan Kim <minchan@kernel.org>
      Cc: Weijie Yang <weijie.yang@samsung.com>
      Acked-by: NSeth Jennings <sjennings@variantweb.net>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      12ab028b
    • D
      mm: documentation: remove hopelessly out-of-date locking doc · 57ea8171
      Dave Hansen 提交于
      Documentation/vm/locking is a blast from the past.  In the entire git
      history, it has had precisely Three modifications.  Two of those look to
      be pure renames, and the third was from 2005.
      
      The doc contains such gems as:
      
      > The page_table_lock is grabbed while holding the
      > kernel_lock spinning monitor.
      
      > Page stealers hold kernel_lock to protect against a bunch of
      > races.
      
      Or this which talks about mmap_sem:
      
      > 4. The exception to this rule is expand_stack, which just
      >    takes the read lock and the page_table_lock, this is ok
      >    because it doesn't really modify fields anybody relies on.
      
      expand_stack() doesn't take any locks any more directly, and the
      mmap_sem acquisition was long ago moved up in to the page fault code
      itself.
      
      It could be argued that we need to rewrite this, but it is dangerous to
      leave it as-is.  It will confuse more people than it helps.
      Signed-off-by: NDave Hansen <dave.hansen@intel.com>
      Cc: Hugh Dickins <hughd@google.com>
      Acked-by: NVlastimil Babka <vbabka@suse.cz>
      Cc: Wanpeng Li <liwanp@linux.vnet.ibm.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      57ea8171
    • M
      microblaze: extable: sort the exception table at build time · 372c7209
      Michal Simek 提交于
      Sort the exception table at build-time rather than during boot.
      
      Microblaze is the same case as AARCH64 that's why EM_MICROBLAZE
      conditional check was added to allow cross-compilation on machines which
      are not running the latest libc-dev.
      
      Inspired by AARCH64 commit adace895 ("arm64: extable: sort the
      exception table at build time").
      Signed-off-by: NMichal Simek <michal.simek@xilinx.com>
      Acked-by: NDavid Daney <david.daney@cavium.com>
      Cc: Catalin Marinas <catalin.marinas@arm.com>
      Cc: Will Deacon <will.deacon@arm.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      372c7209
    • G
      cris: provide {in,out}[wl]_p() · 3fdb38bd
      Geert Uytterhoeven 提交于
        drivers/staging/comedi/drivers/das6402.c: In function 'intr_handler':
        drivers/staging/comedi/drivers/das6402.c:164:3: error: implicit declaration of function 'outw_p' [-Werror=implicit-function-declaration]
        drivers/staging/speakup/speakup_dtlk.c: In function 'synth_probe':
        drivers/staging/speakup/speakup_dtlk.c:362:2: error: implicit declaration of function 'inw_p' [-Werror=implicit-function-declaration]
      Signed-off-by: NGeert Uytterhoeven <geert@linux-m68k.org>
      Cc: Mikael Starvik <starvik@axis.com>
      Cc: Jesper Nilsson <jesper.nilsson@axis.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      3fdb38bd
    • L
      Merge branch 'for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/jack/linux-fs · 90804ed6
      Linus Torvalds 提交于
      Pull UDF & jbd fixes from Jan Kara:
       "A cleanup of JBD log messages and UDF fix of a lockdep warning"
      
      * 'for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/jack/linux-fs:
        udf: Fix lockdep warning from udf_symlink()
        jbd: Revise KERN_EMERG error messages
      90804ed6
    • S
      assoc_array: remove global variable · 30b02c4b
      Stephen Hemminger 提交于
      The associative array code creates unnecessary and potentially
      problematic global variable 'status'.  Remove it since never used.
      Signed-off-by: NStephen Hemminger <stephen@networkplumber.org>
      Signed-off-by: NDavid Howells <dhowells@redhat.com>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      30b02c4b
    • L
      Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/mszeredi/fuse · 5ee7a81a
      Linus Torvalds 提交于
      Pull fuse update from Miklos Szeredi:
       "This contains a fix for a potential use-after-module-unload bug
        noticed by Al and caching improvements for read-only fuse filesystems
        by Andrew Gallagher"
      
      * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/mszeredi/fuse:
        fuse: support clients that don't implement 'open'
        fuse: don't invalidate attrs when not using atime
        fuse: fix SetPageUptodate() condition in STORE
        fuse: fix pipe_buf_operations
      5ee7a81a
    • L
      Merge tag 'for-f2fs-3.14' of git://git.kernel.org/pub/scm/linux/kernel/git/jaegeuk/f2fs · 0d90d638
      Linus Torvalds 提交于
      Pull f2fs updates from Jaegeuk Kim:
       "In this round, a couple of sysfs entries were introduced to tune the
        f2fs at runtime.
      
        In addition, f2fs starts to support inline_data and improves the
        read/write performance in some workloads by refactoring bio-related
        flows.
      
        This patch-set includes the following major enhancement patches.
         - support inline_data
         - refactor bio operations such as merge operations and rw type
           assignment
         - enhance the direct IO path
         - enhance bio operations
         - truncate a node page when it becomes obsolete
         - add sysfs entries: small_discards, max_victim_search, and
           in-place-update
         - add a sysfs entry to control max_victim_search
      
        The other bug fixes are as follows.
         - fix a bug in truncate_partial_nodes
         - avoid warnings during sparse and build process
         - fix error handling flows
         - fix potential bit overflows
      
        And, there are a bunch of cleanups"
      
      * tag 'for-f2fs-3.14' of git://git.kernel.org/pub/scm/linux/kernel/git/jaegeuk/f2fs: (95 commits)
        f2fs: drop obsolete node page when it is truncated
        f2fs: introduce NODE_MAPPING for code consistency
        f2fs: remove the orphan block page array
        f2fs: add help function META_MAPPING
        f2fs: move a branch for code redability
        f2fs: call mark_inode_dirty to flush dirty pages
        f2fs: clean checkpatch warnings
        f2fs: missing REQ_META and REQ_PRIO when sync_meta_pages(META_FLUSH)
        f2fs: avoid f2fs_balance_fs call during pageout
        f2fs: add delimiter to seperate name and value in debug phrase
        f2fs: use spinlock rather than mutex for better speed
        f2fs: move alloc new orphan node out of lock protection region
        f2fs: move grabing orphan pages out of protection region
        f2fs: remove the needless parameter of f2fs_wait_on_page_writeback
        f2fs: update documents and a MAINTAINERS entry
        f2fs: add a sysfs entry to control max_victim_search
        f2fs: improve write performance under frequent fsync calls
        f2fs: avoid to read inline data except first page
        f2fs: avoid to left uninitialized data in page when read inline data
        f2fs: fix truncate_partial_nodes bug
        ...
      0d90d638
    • L
      Merge tag 'xfs-for-linus-v3.14-rc1' of git://oss.sgi.com/xfs/xfs · 1d32bdaf
      Linus Torvalds 提交于
      Pull xfs update from Ben Myers:
       "This is primarily bug fixes, many of which you already have.  New
        stuff includes a series to decouple the in-memory and on-disk log
        format, helpers in the area of inode clusters, and i_version handling.
      
        We decided to try to use more topic branches this release, so there
        are some merge commits in there on account of that.  I'm afraid I
        didn't do a good job of putting meaningful comments in the first
        couple of merges.  Sorry about that.  I think I have the hang of it
        now.
      
        For 3.14-rc1 there are fixes in the areas of remote attributes,
        discard, growfs, memory leaks in recovery, directory v2, quotas, the
        MAINTAINERS file, allocation alignment, extent list locking, and in
        xfs_bmapi_allocate.  There are cleanups in xfs_setsize_buftarg,
        removing unused macros, quotas, setattr, and freeing of inode
        clusters.  The in-memory and on-disk log format have been decoupled, a
        common helper to calculate the number of blocks in an inode cluster
        has been added, and handling of i_version has been pulled into the
        filesystems that use it.
      
         - cleanup in xfs_setsize_buftarg
         - removal of remaining unused flags for vop toss/flush/flushinval
         - fix for memory corruption in xfs_attrlist_by_handle
         - fix for out-of-date comment in xfs_trans_dqlockedjoin
         - fix for discard if range length is less than one block
         - fix for overrun of agfl buffer using growfs on v4 superblock
           filesystems
         - pull i_version handling out into the filesystems that use it
         - don't leak recovery items on error
         - fix for memory leak in xfs_dir2_node_removename
         - several cleanups for quotas
         - fix bad assertion in xfs_qm_vop_create_dqattach
         - cleanup for xfs_setattr_mode, and add xfs_setattr_time
         - fix quota assert in xfs_setattr_nonsize
         - fix an infinite loop when turning off group/project quota before
           user quota
         - fix for temporary buffer allocation failure in xfs_dir2_block_to_sf
           with large directory block sizes
         - fix Dave's email address in MAINTAINERS
         - cleanup calculation of freed inode cluster blocks
         - fix alignment of initial file allocations to match filesystem
           geometry
         - decouple in-memory and on-disk log format
         - introduce a common helper to calculate the number of filesystem
           blocks in an inode cluster
         - fixes for extent list locking
         - fix for off-by-one in xfs_attr3_rmt_verify
         - fix for missing destroy_work_on_stack in xfs_bmapi_allocate"
      
      * tag 'xfs-for-linus-v3.14-rc1' of git://oss.sgi.com/xfs/xfs: (51 commits)
        xfs: Calling destroy_work_on_stack() to pair with INIT_WORK_ONSTACK()
        xfs: fix off-by-one error in xfs_attr3_rmt_verify
        xfs: assert that we hold the ilock for extent map access
        xfs: use xfs_ilock_attr_map_shared in xfs_attr_list_int
        xfs: use xfs_ilock_attr_map_shared in xfs_attr_get
        xfs: use xfs_ilock_data_map_shared in xfs_qm_dqiterate
        xfs: use xfs_ilock_data_map_shared in xfs_qm_dqtobp
        xfs: take the ilock around xfs_bmapi_read in xfs_zero_remaining_bytes
        xfs: reinstate the ilock in xfs_readdir
        xfs: add xfs_ilock_attr_map_shared
        xfs: rename xfs_ilock_map_shared
        xfs: remove xfs_iunlock_map_shared
        xfs: no need to lock the inode in xfs_find_handle
        xfs: use xfs_icluster_size_fsb in xfs_imap
        xfs: use xfs_icluster_size_fsb in xfs_ifree_cluster
        xfs: use xfs_icluster_size_fsb in xfs_ialloc_inode_init
        xfs: use xfs_icluster_size_fsb in xfs_bulkstat
        xfs: introduce a common helper xfs_icluster_size_fsb
        xfs: get rid of XFS_IALLOC_BLOCKS macros
        xfs: get rid of XFS_INODE_CLUSTER_SIZE macros
        ...
      1d32bdaf
  2. 23 1月, 2014 9 次提交
    • L
      Merge tag 'modules-next-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/rusty/linux · 0dc3fd02
      Linus Torvalds 提交于
      Pull module updates from Rusty Russell.
      
      * tag 'modules-next-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/rusty/linux:
        module: Add missing newline in printk call.
        module: fix coding style
        export: declare ksymtab symbols
        module.h: Remove unnecessary semicolon
        params: improve standard definitions
        Add Documentation/module-signing.txt file
      0dc3fd02
    • L
      Merge tag 'virtio-next-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/rusty/linux · 93b05cba
      Linus Torvalds 提交于
      Pull virtio update from Rusty Russell:
       "A few simple fixes.  Quiet cycle"
      
      * tag 'virtio-next-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/rusty/linux:
        drivers: virtio: Mark function virtballoon_migratepage() as static in virtio_balloon.c
        virtio-scsi: Fix hotcpu_notifier use-after-free with virtscsi_freeze
        virtio: pci: remove unnecessary pci_set_drvdata()
      93b05cba
    • L
      Merge tag 'stable/for-linus-3.14-rc0-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/xen/tip · 84621c9b
      Linus Torvalds 提交于
      Pull Xen updates from Konrad Rzeszutek Wilk:
       "Two major features that Xen community is excited about:
      
        The first is event channel scalability by David Vrabel - we switch
        over from an two-level per-cpu bitmap of events (IRQs) - to an FIFO
        queue with priorities.  This lets us be able to handle more events,
        have lower latency, and better scalability.  Good stuff.
      
        The other is PVH by Mukesh Rathor.  In short, PV is a mode where the
        kernel lets the hypervisor program page-tables, segments, etc.  With
        EPT/NPT capabilities in current processors, the overhead of doing this
        in an HVM (Hardware Virtual Machine) container is much lower than the
        hypervisor doing it for us.
      
        In short we let a PV guest run without doing page-table, segment,
        syscall, etc updates through the hypervisor - instead it is all done
        within the guest container.  It is a "hybrid" PV - hence the 'PVH'
        name - a PV guest within an HVM container.
      
        The major benefits are less code to deal with - for example we only
        use one function from the the pv_mmu_ops (which has 39 function
        calls); faster performance for syscall (no context switches into the
        hypervisor); less traps on various operations; etc.
      
        It is still being baked - the ABI is not yet set in stone.  But it is
        pretty awesome and we are excited about it.
      
        Lastly, there are some changes to ARM code - you should get a simple
        conflict which has been resolved in #linux-next.
      
        In short, this pull has awesome features.
      
        Features:
         - FIFO event channels.  Key advantages: support for over 100,000
           events (2^17), 16 different event priorities, improved fairness in
           event latency through the use of FIFOs.
         - Xen PVH support.  "It’s a fully PV kernel mode, running with
           paravirtualized disk and network, paravirtualized interrupts and
           timers, no emulated devices of any kind (and thus no qemu), no BIOS
           or legacy boot — but instead of requiring PV MMU, it uses the HVM
           hardware extensions to virtualize the pagetables, as well as system
           calls and other privileged operations." (from "The
           Paravirtualization Spectrum, Part 2: From poles to a spectrum")
      
        Bug-fixes:
         - Fixes in balloon driver (refactor and make it work under ARM)
         - Allow xenfb to be used in HVM guests.
         - Allow xen_platform_pci=0 to work properly.
         - Refactors in event channels"
      
      * tag 'stable/for-linus-3.14-rc0-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/xen/tip: (52 commits)
        xen/pvh: Set X86_CR0_WP and others in CR0 (v2)
        MAINTAINERS: add git repository for Xen
        xen/pvh: Use 'depend' instead of 'select'.
        xen: delete new instances of __cpuinit usage
        xen/fb: allow xenfb initialization for hvm guests
        xen/evtchn_fifo: fix error return code in evtchn_fifo_setup()
        xen-platform: fix error return code in platform_pci_init()
        xen/pvh: remove duplicated include from enlighten.c
        xen/pvh: Fix compile issues with xen_pvh_domain()
        xen: Use dev_is_pci() to check whether it is pci device
        xen/grant-table: Force to use v1 of grants.
        xen/pvh: Support ParaVirtualized Hardware extensions (v3).
        xen/pvh: Piggyback on PVHVM XenBus.
        xen/pvh: Piggyback on PVHVM for grant driver (v4)
        xen/grant: Implement an grant frame array struct (v3).
        xen/grant-table: Refactor gnttab_init
        xen/grants: Remove gnttab_max_grant_frames dependency on gnttab_init.
        xen/pvh: Piggyback on PVHVM for event channels (v2)
        xen/pvh: Update E820 to work with PVH (v2)
        xen/pvh: Secondary VCPU bringup (non-bootup CPUs)
        ...
      84621c9b
    • L
      Merge tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvm · 7ebd3faa
      Linus Torvalds 提交于
      Pull KVM updates from Paolo Bonzini:
       "First round of KVM updates for 3.14; PPC parts will come next week.
      
        Nothing major here, just bugfixes all over the place.  The most
        interesting part is the ARM guys' virtualized interrupt controller
        overhaul, which lets userspace get/set the state and thus enables
        migration of ARM VMs"
      
      * tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvm: (67 commits)
        kvm: make KVM_MMU_AUDIT help text more readable
        KVM: s390: Fix memory access error detection
        KVM: nVMX: Update guest activity state field on L2 exits
        KVM: nVMX: Fix nested_run_pending on activity state HLT
        KVM: nVMX: Clean up handling of VMX-related MSRs
        KVM: nVMX: Add tracepoints for nested_vmexit and nested_vmexit_inject
        KVM: nVMX: Pass vmexit parameters to nested_vmx_vmexit
        KVM: nVMX: Leave VMX mode on clearing of feature control MSR
        KVM: VMX: Fix DR6 update on #DB exception
        KVM: SVM: Fix reading of DR6
        KVM: x86: Sync DR7 on KVM_SET_DEBUGREGS
        add support for Hyper-V reference time counter
        KVM: remove useless write to vcpu->hv_clock.tsc_timestamp
        KVM: x86: fix tsc catchup issue with tsc scaling
        KVM: x86: limit PIT timer frequency
        KVM: x86: handle invalid root_hpa everywhere
        kvm: Provide kvm_vcpu_eligible_for_directed_yield() stub
        kvm: vfio: silence GCC warning
        KVM: ARM: Remove duplicate include
        arm/arm64: KVM: relax the requirements of VMA alignment for THP
        ...
      7ebd3faa
    • L
      Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/jikos/trivial · bb1281f2
      Linus Torvalds 提交于
      Pull trivial tree updates from Jiri Kosina:
       "Usual rocket science stuff from trivial.git"
      
      * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/jikos/trivial: (39 commits)
        neighbour.h: fix comment
        sched: Fix warning on make htmldocs caused by wait.h
        slab: struct kmem_cache is protected by slab_mutex
        doc: Fix typo in USB Gadget Documentation
        of/Kconfig: Spelling s/one/once/
        mkregtable: Fix sscanf handling
        lp5523, lp8501: comment improvements
        thermal: rcar: comment spelling
        treewide: fix comments and printk msgs
        IXP4xx: remove '1 &&' from a condition check in ixp4xx_restart()
        Documentation: update /proc/uptime field description
        Documentation: Fix size parameter for snprintf
        arm: fix comment header and macro name
        asm-generic: uaccess: Spelling s/a ny/any/
        mtd: onenand: fix comment header
        doc: driver-model/platform.txt: fix a typo
        drivers: fix typo in DEVTMPFS_MOUNT Kconfig help text
        doc: Fix typo (acces_process_vm -> access_process_vm)
        treewide: Fix typos in printk
        drivers/gpu/drm/qxl/Kconfig: reformat the help text
        ...
      bb1281f2
    • L
      Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/jikos/hid · 4988abf1
      Linus Torvalds 提交于
      Pull HID updates from Jiri Kosina:
      
       - quite some work on hid-sony driver in order to have DualShock 4
         device properly supported, from Frank Praznik
      
       - fixed support for suspending I2C conntected devices, from Mika
         Westerberg
      
       - regression fix for 0xff05 usage on Microsoft Ergonomy, from Jiri
         Kosina
      
       - support for Synaptics HD touchscreen, from AceLan Kao
      
       - workaround for USB 3.0 problem for logitech-dj connected devices,
         from Benjamin Tisssoires
      
       - support for Logitech Dual Action pads, from Vitaly Katraew
      
       - quite a few other assorted fixes and device ID additions
      
      * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/jikos/hid: (33 commits)
        HID: sony: Use colors for the Dualshock 4 LED names
        HID: sony: Add annotated HID descriptor for the Dualshock 4
        HID: sony: Cache the output report for the Dualshock 4
        HID: sony: Map gyroscopes and accelerometers to axes
        HID: sony: Fix spacing in the device definitions.
        HID: sony: Use standard output reports instead of raw reports to send data to the Dualshock 4.
        HID: sony: Use separate identifiers for USB and Bluetooth connected Dualshock 4 controllers.
        HID: hid-holtek-mouse: add new a070 mouse
        HID: hid-sensor-hub: Fix buggy report descriptors
        HID: logitech-dj: Fix USB 3.0 issue
        HID: sony: Rename worker function
        HID: sony: Add LED controls for the Dualshock 4
        HID: sony: Add force-feedback support for the Dualshock 4
        HID: hidraw: make comment more accurate and nicer
        HID: sony: fix error return code
        HID: input: fix input sysfs path for hid devices
        HID: debug: add labels for some new buttons
        HID: remove SIS entries from hid_have_special_driver[]
        HID: microsoft: no fallthrough in MS ergonomy 0xff05 usage
        HID: add support for SiS multitouch panel in the touch monitor LG 23ET83V
        ...
      4988abf1
    • L
      Merge tag 'dm-3.14-changes' of git://git.kernel.org/pub/scm/linux/kernel/git/device-mapper/linux-dm · fe41c2c0
      Linus Torvalds 提交于
      Pull device-mapper changes from Mike Snitzer:
       "A lot of attention was paid to improving the thin-provisioning
        target's handling of metadata operation failures and running out of
        space.  A new 'error_if_no_space' feature was added to allow users to
        error IOs rather than queue them when either the data or metadata
        space is exhausted.
      
        Additional fixes/features include:
         - a few fixes to properly support thin metadata device resizing
         - a solution for reliably waiting for a DM device's embedded kobject
           to be released before destroying the device
         - old dm-snapshot is updated to use the dm-bufio interface to take
           advantage of readahead capabilities that improve snapshot
           activation
         - new dm-cache target tunables to control how quickly data is
           promoted to the cache (fast) device
         - improved write efficiency of cluster mirror target by combining
           userspace flush and mark requests"
      
      * tag 'dm-3.14-changes' of git://git.kernel.org/pub/scm/linux/kernel/git/device-mapper/linux-dm: (35 commits)
        dm log userspace: allow mark requests to piggyback on flush requests
        dm space map metadata: fix bug in resizing of thin metadata
        dm cache: add policy name to status output
        dm thin: fix pool feature parsing
        dm sysfs: fix a module unload race
        dm snapshot: use dm-bufio prefetch
        dm snapshot: use dm-bufio
        dm snapshot: prepare for switch to using dm-bufio
        dm snapshot: use GFP_KERNEL when initializing exceptions
        dm cache: add block sizes and total cache blocks to status output
        dm btree: add dm_btree_find_lowest_key
        dm space map metadata: fix extending the space map
        dm space map common: make sure new space is used during extend
        dm: wait until embedded kobject is released before destroying a device
        dm: remove pointless kobject comparison in dm_get_from_kobject
        dm snapshot: call destroy_work_on_stack() to pair with INIT_WORK_ONSTACK()
        dm cache policy mq: introduce three promotion threshold tunables
        dm cache policy mq: use list_del_init instead of list_del + INIT_LIST_HEAD
        dm thin: fix set_pool_mode exposed pool operation races
        dm thin: eliminate the no_free_space flag
        ...
      fe41c2c0
    • L
      Merge tag 'scsi-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/jejb/scsi · 194e57fd
      Linus Torvalds 提交于
      Pull SCSI updates from James Bottomley:
       "This patch set is a lot of driver updates for qla4xxx, bfa, hpsa,
        qla2xxx.  It also removes the aic7xxx_old driver (which has been
        deprecated for nearly a decade) and adds support for deadlines in
        error handling"
      
      * tag 'scsi-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/jejb/scsi: (75 commits)
        [SCSI] hpsa: allow SCSI mid layer to handle unit attention
        [SCSI] hpsa: do not require board "not ready" status after hard reset
        [SCSI] hpsa: enable unit attention reporting
        [SCSI] hpsa: rename scsi prefetch field
        [SCSI] hpsa: use workqueue instead of kernel thread for lockup detection
        [SCSI] ipr: increase dump size in ipr driver
        [SCSI] mac_scsi: Fix crash on out of memory
        [SCSI] st: fix enlarge_buffer
        [SCSI] qla1280: Annotate timer on stack so object debug does not complain
        [SCSI] qla4xxx: Update driver version to 5.04.00-k3
        [SCSI] qla4xxx: Recreate chap data list during get chap operation
        [SCSI] qla4xxx: Add support for ISCSI_PARAM_LOCAL_IPADDR sysfs attr
        [SCSI] libiscsi: Add local_ipaddr parameter in iscsi_conn struct
        [SCSI] scsi_transport_iscsi: Export ISCSI_PARAM_LOCAL_IPADDR attr for iscsi_connection
        [SCSI] qla4xxx: Add host statistics support
        [SCSI] scsi_transport_iscsi: Add host statistics support
        [SCSI] qla4xxx: Added support for Diagnostics MBOX command
        [SCSI] bfa: Driver version upgrade to 3.2.23.0
        [SCSI] bfa: change FC_ELS_TOV to 20sec
        [SCSI] bfa: Observed auto D-port mode instead of manual
        ...
      194e57fd
    • L
      Merge tag 'pci-v3.14-changes' of git://git.kernel.org/pub/scm/linux/kernel/git/helgaas/pci · e1ba8459
      Linus Torvalds 提交于
      Pull PCI updates from Bjorn Helgaas:
       "PCI changes for the v3.14 merge window:
      
        Resource management
          - Change pci_bus_region addresses to dma_addr_t (Bjorn Helgaas)
          - Support 64-bit AGP BARs (Bjorn Helgaas, Yinghai Lu)
          - Add pci_bus_address() to get bus address of a BAR (Bjorn Helgaas)
          - Use pci_resource_start() for CPU address of AGP BARs (Bjorn Helgaas)
          - Enforce bus address limits in resource allocation (Yinghai Lu)
          - Allocate 64-bit BARs above 4G when possible (Yinghai Lu)
          - Convert pcibios_resource_to_bus() to take pci_bus, not pci_dev (Yinghai Lu)
      
        PCI device hotplug
          - Major rescan/remove locking update (Rafael J. Wysocki)
          - Make ioapic builtin only (not modular) (Yinghai Lu)
          - Fix release/free issues (Yinghai Lu)
          - Clean up pciehp (Bjorn Helgaas)
          - Announce pciehp slot info during enumeration (Bjorn Helgaas)
      
        MSI
          - Add pci_msi_vec_count(), pci_msix_vec_count() (Alexander Gordeev)
          - Add pci_enable_msi_range(), pci_enable_msix_range() (Alexander Gordeev)
          - Deprecate "tri-state" interfaces: fail/success/fail+info (Alexander Gordeev)
          - Export MSI mode using attributes, not kobjects (Greg Kroah-Hartman)
          - Drop "irq" param from *_restore_msi_irqs() (DuanZhenzhong)
      
        SR-IOV
          - Clear NumVFs when disabling SR-IOV in sriov_init() (ethan.zhao)
      
        Virtualization
          - Add support for save/restore of extended capabilities (Alex Williamson)
          - Add Virtual Channel to save/restore support (Alex Williamson)
          - Never treat a VF as a multifunction device (Alex Williamson)
          - Add pci_try_reset_function(), et al (Alex Williamson)
      
        AER
          - Ignore non-PCIe error sources (Betty Dall)
          - Support ACPI HEST error sources for domains other than 0 (Betty Dall)
          - Consolidate HEST error source parsers (Bjorn Helgaas)
          - Add a TLP header print helper (Borislav Petkov)
      
        Freescale i.MX6
          - Remove unnecessary code (Fabio Estevam)
          - Make reset-gpio optional (Marek Vasut)
          - Report "link up" only after link training completes (Marek Vasut)
          - Start link in Gen1 before negotiating for Gen2 mode (Marek Vasut)
          - Fix PCIe startup code (Richard Zhu)
      
        Marvell MVEBU
          - Remove duplicate of_clk_get_by_name() call (Andrew Lunn)
          - Drop writes to bridge Secondary Status register (Jason Gunthorpe)
          - Obey bridge PCI_COMMAND_MEM and PCI_COMMAND_IO bits (Jason Gunthorpe)
          - Support a bridge with no IO port window (Jason Gunthorpe)
          - Use max_t() instead of max(resource_size_t,) (Jingoo Han)
          - Remove redundant of_match_ptr (Sachin Kamat)
          - Call pci_ioremap_io() at startup instead of dynamically (Thomas Petazzoni)
      
        NVIDIA Tegra
          - Disable Gen2 for Tegra20 and Tegra30 (Eric Brower)
      
        Renesas R-Car
          - Add runtime PM support (Valentine Barshak)
          - Fix rcar_pci_probe() return value check (Wei Yongjun)
      
        Synopsys DesignWare
          - Fix crash in dw_msi_teardown_irq() (Bjørn Erik Nilsen)
          - Remove redundant call to pci_write_config_word() (Bjørn Erik Nilsen)
          - Fix missing MSI IRQs (Harro Haan)
          - Add dw_pcie prefix before cfg_read/write (Pratyush Anand)
          - Fix I/O transfers by using CPU (not realio) address (Pratyush Anand)
          - Whitespace cleanup (Jingoo Han)
      
        EISA
          - Call put_device() if device_register() fails (Levente Kurusa)
          - Revert EISA initialization breakage ((Bjorn Helgaas)
      
        Miscellaneous
          - Remove unused code, including PCIe 3.0 interfaces (Stephen Hemminger)
          - Prevent bus conflicts while checking for bridge apertures (Bjorn Helgaas)
          - Stop clearing bridge Secondary Status when setting up I/O aperture (Bjorn Helgaas)
          - Use dev_is_pci() to identify PCI devices (Yijing Wang)
          - Deprecate DEFINE_PCI_DEVICE_TABLE (Joe Perches)
          - Update documentation 00-INDEX (Erik Ekman)"
      
      * tag 'pci-v3.14-changes' of git://git.kernel.org/pub/scm/linux/kernel/git/helgaas/pci: (119 commits)
        Revert "EISA: Initialize device before its resources"
        Revert "EISA: Log device resources in dmesg"
        vfio-pci: Use pci "try" reset interface
        PCI: Check parent kobject in pci_destroy_dev()
        xen/pcifront: Use global PCI rescan-remove locking
        powerpc/eeh: Use global PCI rescan-remove locking
        PCI: Fix pci_check_and_unmask_intx() comment typos
        PCI: Add pci_try_reset_function(), pci_try_reset_slot(), pci_try_reset_bus()
        MPT / PCI: Use pci_stop_and_remove_bus_device_locked()
        platform / x86: Use global PCI rescan-remove locking
        PCI: hotplug: Use global PCI rescan-remove locking
        pcmcia: Use global PCI rescan-remove locking
        ACPI / hotplug / PCI: Use global PCI rescan-remove locking
        ACPI / PCI: Use global PCI rescan-remove locking in PCI root hotplug
        PCI: Add global pci_lock_rescan_remove()
        PCI: Cleanup pci.h whitespace
        PCI: Reorder so actual code comes before stubs
        PCI/AER: Support ACPI HEST AER error sources for PCI domains other than 0
        ACPICA: Add helper macros to extract bus/segment numbers from HEST table.
        PCI: Make local functions static
        ...
      e1ba8459