1. 11 9月, 2015 15 次提交
    • V
      include/linux/poison.h: fix LIST_POISON{1,2} offset · 8a5e5e02
      Vasily Kulikov 提交于
      Poison pointer values should be small enough to find a room in
      non-mmap'able/hardly-mmap'able space.  E.g.  on x86 "poison pointer space"
      is located starting from 0x0.  Given unprivileged users cannot mmap
      anything below mmap_min_addr, it should be safe to use poison pointers
      lower than mmap_min_addr.
      
      The current poison pointer values of LIST_POISON{1,2} might be too big for
      mmap_min_addr values equal or less than 1 MB (common case, e.g.  Ubuntu
      uses only 0x10000).  There is little point to use such a big value given
      the "poison pointer space" below 1 MB is not yet exhausted.  Changing it
      to a smaller value solves the problem for small mmap_min_addr setups.
      
      The values are suggested by Solar Designer:
      http://www.openwall.com/lists/oss-security/2015/05/02/6Signed-off-by: NVasily Kulikov <segoon@openwall.com>
      Cc: Solar Designer <solar@openwall.com>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      8a5e5e02
    • W
      proc: change proc_subdir_lock to a rwlock · ecf1a3df
      Waiman Long 提交于
      The proc_subdir_lock spinlock is used to allow only one task to make
      change to the proc directory structure as well as looking up information
      in it.  However, the information lookup part can actually be entered by
      more than one task as the pde_get() and pde_put() reference count update
      calls in the critical sections are atomic increment and decrement
      respectively and so are safe with concurrent updates.
      
      The x86 architecture has already used qrwlock which is fair and other
      architectures like ARM are in the process of switching to qrwlock.  So
      unfairness shouldn't be a concern in that conversion.
      
      This patch changed the proc_subdir_lock to a rwlock in order to enable
      concurrent lookup. The following functions were modified to take a
      write lock:
       - proc_register()
       - remove_proc_entry()
       - remove_proc_subtree()
      
      The following functions were modified to take a read lock:
       - xlate_proc_name()
       - proc_lookup_de()
       - proc_readdir_de()
      
      A parallel /proc filesystem search with the "find" command (1000 threads)
      was run on a 4-socket Haswell-EX box (144 threads).  Before the patch, the
      parallel search took about 39s.  After the patch, the parallel find took
      only 25s, a saving of about 14s.
      
      The micro-benchmark that I used was artificial, but it was used to
      reproduce an exit hanging problem that I saw in real application.  In
      fact, only allow one task to do a lookup seems too limiting to me.
      Signed-off-by: NWaiman Long <Waiman.Long@hp.com>
      Acked-by: N"Eric W. Biederman" <ebiederm@xmission.com>
      Cc: Alexey Dobriyan <adobriyan@gmail.com>
      Cc: Nicolas Dichtel <nicolas.dichtel@6wind.com>
      Cc: Al Viro <viro@zeniv.linux.org.uk>
      Cc: Scott J Norton <scott.norton@hp.com>
      Cc: Douglas Hatch <doug.hatch@hp.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      ecf1a3df
    • C
      procfs: always expose /proc/<pid>/map_files/ and make it readable · bdb4d100
      Calvin Owens 提交于
      Currently, /proc/<pid>/map_files/ is restricted to CAP_SYS_ADMIN, and is
      only exposed if CONFIG_CHECKPOINT_RESTORE is set.
      
      Each mapped file region gets a symlink in /proc/<pid>/map_files/
      corresponding to the virtual address range at which it is mapped.  The
      symlinks work like the symlinks in /proc/<pid>/fd/, so you can follow them
      to the backing file even if that backing file has been unlinked.
      
      Currently, files which are mapped, unlinked, and closed are impossible to
      stat() from userspace.  Exposing /proc/<pid>/map_files/ closes this
      functionality "hole".
      
      Not being able to stat() such files makes noticing and explicitly
      accounting for the space they use on the filesystem impossible.  You can
      work around this by summing up the space used by every file in the
      filesystem and subtracting that total from what statfs() tells you, but
      that obviously isn't great, and it becomes unworkable once your filesystem
      becomes large enough.
      
      This patch moves map_files/ out from behind CONFIG_CHECKPOINT_RESTORE, and
      adjusts the permissions enforced on it as follows:
      
      * proc_map_files_lookup()
      * proc_map_files_readdir()
      * map_files_d_revalidate()
      
      	Remove the CAP_SYS_ADMIN restriction, leaving only the current
      	restriction requiring PTRACE_MODE_READ. The information made
      	available to userspace by these three functions is already
      	available in /proc/PID/maps with MODE_READ, so I don't see any
      	reason to limit them any further (see below for more detail).
      
      * proc_map_files_follow_link()
      
      	This stub has been added, and requires that the user have
      	CAP_SYS_ADMIN in order to follow the links in map_files/,
      	since there was concern on LKML both about the potential for
      	bypassing permissions on ancestor directories in the path to
      	files pointed to, and about what happens with more exotic
      	memory mappings created by some drivers (ie dma-buf).
      
      In older versions of this patch, I changed every permission check in
      the four functions above to enforce MODE_ATTACH instead of MODE_READ.
      This was an oversight on my part, and after revisiting the discussion
      it seems that nobody was concerned about anything outside of what is
      made possible by ->follow_link(). So in this version, I've left the
      checks for PTRACE_MODE_READ as-is.
      
      [akpm@linux-foundation.org: catch up with concurrent proc_pid_follow_link() changes]
      Signed-off-by: NCalvin Owens <calvinowens@fb.com>
      Reviewed-by: NKees Cook <keescook@chromium.org>
      Cc: Andy Lutomirski <luto@amacapital.net>
      Cc: Cyrill Gorcunov <gorcunov@openvz.org>
      Cc: Joe Perches <joe@perches.com>
      Cc: Kirill A. Shutemov <kirill@shutemov.name>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      bdb4d100
    • V
      proc: add cond_resched to /proc/kpage* read/write loop · d3691d2c
      Vladimir Davydov 提交于
      Reading/writing a /proc/kpage* file may take long on machines with a lot
      of RAM installed.
      Signed-off-by: NVladimir Davydov <vdavydov@parallels.com>
      Suggested-by: NAndres Lagar-Cavilla <andreslc@google.com>
      Reviewed-by: NAndres Lagar-Cavilla <andreslc@google.com>
      Cc: Minchan Kim <minchan@kernel.org>
      Cc: Raghavendra K T <raghavendra.kt@linux.vnet.ibm.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Michal Hocko <mhocko@suse.cz>
      Cc: Greg Thelen <gthelen@google.com>
      Cc: Michel Lespinasse <walken@google.com>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Pavel Emelyanov <xemul@parallels.com>
      Cc: Cyrill Gorcunov <gorcunov@openvz.org>
      Cc: Jonathan Corbet <corbet@lwn.net>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      d3691d2c
    • V
      proc: export idle flag via kpageflags · f074a8f4
      Vladimir Davydov 提交于
      As noted by Minchan, a benefit of reading idle flag from /proc/kpageflags
      is that one can easily filter dirty and/or unevictable pages while
      estimating the size of unused memory.
      
      Note that idle flag read from /proc/kpageflags may be stale in case the
      page was accessed via a PTE, because it would be too costly to iterate
      over all page mappings on each /proc/kpageflags read to provide an
      up-to-date value.  To make sure the flag is up-to-date one has to read
      /sys/kernel/mm/page_idle/bitmap first.
      Signed-off-by: NVladimir Davydov <vdavydov@parallels.com>
      Reviewed-by: NAndres Lagar-Cavilla <andreslc@google.com>
      Cc: Minchan Kim <minchan@kernel.org>
      Cc: Raghavendra K T <raghavendra.kt@linux.vnet.ibm.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Michal Hocko <mhocko@suse.cz>
      Cc: Greg Thelen <gthelen@google.com>
      Cc: Michel Lespinasse <walken@google.com>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Pavel Emelyanov <xemul@parallels.com>
      Cc: Cyrill Gorcunov <gorcunov@openvz.org>
      Cc: Jonathan Corbet <corbet@lwn.net>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      f074a8f4
    • V
      mm: introduce idle page tracking · 33c3fc71
      Vladimir Davydov 提交于
      Knowing the portion of memory that is not used by a certain application or
      memory cgroup (idle memory) can be useful for partitioning the system
      efficiently, e.g.  by setting memory cgroup limits appropriately.
      Currently, the only means to estimate the amount of idle memory provided
      by the kernel is /proc/PID/{clear_refs,smaps}: the user can clear the
      access bit for all pages mapped to a particular process by writing 1 to
      clear_refs, wait for some time, and then count smaps:Referenced.  However,
      this method has two serious shortcomings:
      
       - it does not count unmapped file pages
       - it affects the reclaimer logic
      
      To overcome these drawbacks, this patch introduces two new page flags,
      Idle and Young, and a new sysfs file, /sys/kernel/mm/page_idle/bitmap.
      A page's Idle flag can only be set from userspace by setting bit in
      /sys/kernel/mm/page_idle/bitmap at the offset corresponding to the page,
      and it is cleared whenever the page is accessed either through page tables
      (it is cleared in page_referenced() in this case) or using the read(2)
      system call (mark_page_accessed()). Thus by setting the Idle flag for
      pages of a particular workload, which can be found e.g.  by reading
      /proc/PID/pagemap, waiting for some time to let the workload access its
      working set, and then reading the bitmap file, one can estimate the amount
      of pages that are not used by the workload.
      
      The Young page flag is used to avoid interference with the memory
      reclaimer.  A page's Young flag is set whenever the Access bit of a page
      table entry pointing to the page is cleared by writing to the bitmap file.
      If page_referenced() is called on a Young page, it will add 1 to its
      return value, therefore concealing the fact that the Access bit was
      cleared.
      
      Note, since there is no room for extra page flags on 32 bit, this feature
      uses extended page flags when compiled on 32 bit.
      
      [akpm@linux-foundation.org: fix build]
      [akpm@linux-foundation.org: kpageidle requires an MMU]
      [akpm@linux-foundation.org: decouple from page-flags rework]
      Signed-off-by: NVladimir Davydov <vdavydov@parallels.com>
      Reviewed-by: NAndres Lagar-Cavilla <andreslc@google.com>
      Cc: Minchan Kim <minchan@kernel.org>
      Cc: Raghavendra K T <raghavendra.kt@linux.vnet.ibm.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Michal Hocko <mhocko@suse.cz>
      Cc: Greg Thelen <gthelen@google.com>
      Cc: Michel Lespinasse <walken@google.com>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Pavel Emelyanov <xemul@parallels.com>
      Cc: Cyrill Gorcunov <gorcunov@openvz.org>
      Cc: Jonathan Corbet <corbet@lwn.net>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      33c3fc71
    • V
      mmu-notifier: add clear_young callback · 1d7715c6
      Vladimir Davydov 提交于
      In the scope of the idle memory tracking feature, which is introduced by
      the following patch, we need to clear the referenced/accessed bit not only
      in primary, but also in secondary ptes.  The latter is required in order
      to estimate wss of KVM VMs.  At the same time we want to avoid flushing
      tlb, because it is quite expensive and it won't really affect the final
      result.
      
      Currently, there is no function for clearing pte young bit that would meet
      our requirements, so this patch introduces one.  To achieve that we have
      to add a new mmu-notifier callback, clear_young, since there is no method
      for testing-and-clearing a secondary pte w/o flushing tlb.  The new method
      is not mandatory and currently only implemented by KVM.
      Signed-off-by: NVladimir Davydov <vdavydov@parallels.com>
      Reviewed-by: NAndres Lagar-Cavilla <andreslc@google.com>
      Acked-by: NPaolo Bonzini <pbonzini@redhat.com>
      Cc: Minchan Kim <minchan@kernel.org>
      Cc: Raghavendra K T <raghavendra.kt@linux.vnet.ibm.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Michal Hocko <mhocko@suse.cz>
      Cc: Greg Thelen <gthelen@google.com>
      Cc: Michel Lespinasse <walken@google.com>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Pavel Emelyanov <xemul@parallels.com>
      Cc: Cyrill Gorcunov <gorcunov@openvz.org>
      Cc: Jonathan Corbet <corbet@lwn.net>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      1d7715c6
    • V
      proc: add kpagecgroup file · 80ae2fdc
      Vladimir Davydov 提交于
      /proc/kpagecgroup contains a 64-bit inode number of the memory cgroup each
      page is charged to, indexed by PFN.  Having this information is useful for
      estimating a cgroup working set size.
      
      The file is present if CONFIG_PROC_PAGE_MONITOR && CONFIG_MEMCG.
      Signed-off-by: NVladimir Davydov <vdavydov@parallels.com>
      Reviewed-by: NAndres Lagar-Cavilla <andreslc@google.com>
      Cc: Minchan Kim <minchan@kernel.org>
      Cc: Raghavendra K T <raghavendra.kt@linux.vnet.ibm.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Michal Hocko <mhocko@suse.cz>
      Cc: Greg Thelen <gthelen@google.com>
      Cc: Michel Lespinasse <walken@google.com>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Pavel Emelyanov <xemul@parallels.com>
      Cc: Cyrill Gorcunov <gorcunov@openvz.org>
      Cc: Jonathan Corbet <corbet@lwn.net>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      80ae2fdc
    • V
      memcg: zap try_get_mem_cgroup_from_page · e993d905
      Vladimir Davydov 提交于
      It is only used in mem_cgroup_try_charge, so fold it in and zap it.
      Signed-off-by: NVladimir Davydov <vdavydov@parallels.com>
      Reviewed-by: NAndres Lagar-Cavilla <andreslc@google.com>
      Cc: Minchan Kim <minchan@kernel.org>
      Cc: Raghavendra K T <raghavendra.kt@linux.vnet.ibm.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Michal Hocko <mhocko@suse.cz>
      Cc: Greg Thelen <gthelen@google.com>
      Cc: Michel Lespinasse <walken@google.com>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Pavel Emelyanov <xemul@parallels.com>
      Cc: Cyrill Gorcunov <gorcunov@openvz.org>
      Cc: Jonathan Corbet <corbet@lwn.net>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      e993d905
    • V
      hwpoison: use page_cgroup_ino for filtering by memcg · 94a59fb3
      Vladimir Davydov 提交于
      Hwpoison allows to filter pages by memory cgroup ino.  Currently, it
      calls try_get_mem_cgroup_from_page to obtain the cgroup from a page and
      then its ino using cgroup_ino, but now we have a helper method for
      that, page_cgroup_ino, so use it instead.
      
      This patch also loosens the hwpoison memcg filter dependency rules - it
      makes it depend on CONFIG_MEMCG instead of CONFIG_MEMCG_SWAP, because
      hwpoison memcg filter does not require anything (nor it used to) from
      CONFIG_MEMCG_SWAP side.
      Signed-off-by: NVladimir Davydov <vdavydov@parallels.com>
      Reviewed-by: NAndres Lagar-Cavilla <andreslc@google.com>
      Cc: Minchan Kim <minchan@kernel.org>
      Cc: Raghavendra K T <raghavendra.kt@linux.vnet.ibm.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Michal Hocko <mhocko@suse.cz>
      Cc: Greg Thelen <gthelen@google.com>
      Cc: Michel Lespinasse <walken@google.com>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Pavel Emelyanov <xemul@parallels.com>
      Cc: Cyrill Gorcunov <gorcunov@openvz.org>
      Cc: Jonathan Corbet <corbet@lwn.net>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      94a59fb3
    • V
      memcg: add page_cgroup_ino helper · 2fc04524
      Vladimir Davydov 提交于
      This patchset introduces a new user API for tracking user memory pages
      that have not been used for a given period of time.  The purpose of this
      is to provide the userspace with the means of tracking a workload's
      working set, i.e.  the set of pages that are actively used by the
      workload.  Knowing the working set size can be useful for partitioning the
      system more efficiently, e.g.  by tuning memory cgroup limits
      appropriately, or for job placement within a compute cluster.
      
      ==== USE CASES ====
      
      The unified cgroup hierarchy has memory.low and memory.high knobs, which
      are defined as the low and high boundaries for the workload working set
      size.  However, the working set size of a workload may be unknown or
      change in time.  With this patch set, one can periodically estimate the
      amount of memory unused by each cgroup and tune their memory.low and
      memory.high parameters accordingly, therefore optimizing the overall
      memory utilization.
      
      Another use case is balancing workloads within a compute cluster.  Knowing
      how much memory is not really used by a workload unit may help take a more
      optimal decision when considering migrating the unit to another node
      within the cluster.
      
      Also, as noted by Minchan, this would be useful for per-process reclaim
      (https://lwn.net/Articles/545668/). With idle tracking, we could reclaim idle
      pages only by smart user memory manager.
      
      ==== USER API ====
      
      The user API consists of two new files:
      
       * /sys/kernel/mm/page_idle/bitmap.  This file implements a bitmap where each
         bit corresponds to a page, indexed by PFN. When the bit is set, the
         corresponding page is idle. A page is considered idle if it has not been
         accessed since it was marked idle. To mark a page idle one should set the
         bit corresponding to the page by writing to the file. A value written to the
         file is OR-ed with the current bitmap value. Only user memory pages can be
         marked idle, for other page types input is silently ignored. Writing to this
         file beyond max PFN results in the ENXIO error. Only available when
         CONFIG_IDLE_PAGE_TRACKING is set.
      
         This file can be used to estimate the amount of pages that are not
         used by a particular workload as follows:
      
         1. mark all pages of interest idle by setting corresponding bits in the
            /sys/kernel/mm/page_idle/bitmap
         2. wait until the workload accesses its working set
         3. read /sys/kernel/mm/page_idle/bitmap and count the number of bits set
      
       * /proc/kpagecgroup.  This file contains a 64-bit inode number of the
         memory cgroup each page is charged to, indexed by PFN. Only available when
         CONFIG_MEMCG is set.
      
         This file can be used to find all pages (including unmapped file pages)
         accounted to a particular cgroup. Using /sys/kernel/mm/page_idle/bitmap, one
         can then estimate the cgroup working set size.
      
      For an example of using these files for estimating the amount of unused
      memory pages per each memory cgroup, please see the script attached
      below.
      
      ==== REASONING ====
      
      The reason to introduce the new user API instead of using
      /proc/PID/{clear_refs,smaps} is that the latter has two serious
      drawbacks:
      
       - it does not count unmapped file pages
       - it affects the reclaimer logic
      
      The new API attempts to overcome them both. For more details on how it
      is achieved, please see the comment to patch 6.
      
      ==== PATCHSET STRUCTURE ====
      
      The patch set is organized as follows:
      
       - patch 1 adds page_cgroup_ino() helper for the sake of
         /proc/kpagecgroup and patches 2-3 do related cleanup
       - patch 4 adds /proc/kpagecgroup, which reports cgroup ino each page is
         charged to
       - patch 5 introduces a new mmu notifier callback, clear_young, which is
         a lightweight version of clear_flush_young; it is used in patch 6
       - patch 6 implements the idle page tracking feature, including the
         userspace API, /sys/kernel/mm/page_idle/bitmap
       - patch 7 exports idle flag via /proc/kpageflags
      
      ==== SIMILAR WORKS ====
      
      Originally, the patch for tracking idle memory was proposed back in 2011
      by Michel Lespinasse (see http://lwn.net/Articles/459269/).  The main
      difference between Michel's patch and this one is that Michel implemented
      a kernel space daemon for estimating idle memory size per cgroup while
      this patch only provides the userspace with the minimal API for doing the
      job, leaving the rest up to the userspace.  However, they both share the
      same idea of Idle/Young page flags to avoid affecting the reclaimer logic.
      
      ==== PERFORMANCE EVALUATION ====
      
      SPECjvm2008 (https://www.spec.org/jvm2008/) was used to evaluate the
      performance impact introduced by this patch set.  Three runs were carried
      out:
      
       - base: kernel without the patch
       - patched: patched kernel, the feature is not used
       - patched-active: patched kernel, 1 minute-period daemon is used for
         tracking idle memory
      
      For tracking idle memory, idlememstat utility was used:
      https://github.com/locker/idlememstat
      
      testcase            base            patched        patched-active
      
      compiler       537.40 ( 0.00)%   532.26 (-0.96)%   538.31 ( 0.17)%
      compress       305.47 ( 0.00)%   301.08 (-1.44)%   300.71 (-1.56)%
      crypto         284.32 ( 0.00)%   282.21 (-0.74)%   284.87 ( 0.19)%
      derby          411.05 ( 0.00)%   413.44 ( 0.58)%   412.07 ( 0.25)%
      mpegaudio      189.96 ( 0.00)%   190.87 ( 0.48)%   189.42 (-0.28)%
      scimark.large   46.85 ( 0.00)%    46.41 (-0.94)%    47.83 ( 2.09)%
      scimark.small  412.91 ( 0.00)%   415.41 ( 0.61)%   421.17 ( 2.00)%
      serial         204.23 ( 0.00)%   213.46 ( 4.52)%   203.17 (-0.52)%
      startup         36.76 ( 0.00)%    35.49 (-3.45)%    35.64 (-3.05)%
      sunflow        115.34 ( 0.00)%   115.08 (-0.23)%   117.37 ( 1.76)%
      xml            620.55 ( 0.00)%   619.95 (-0.10)%   620.39 (-0.03)%
      
      composite      211.50 ( 0.00)%   211.15 (-0.17)%   211.67 ( 0.08)%
      
      time idlememstat:
      
      17.20user 65.16system 2:15:23elapsed 1%CPU (0avgtext+0avgdata 8476maxresident)k
      448inputs+40outputs (1major+36052minor)pagefaults 0swaps
      
      ==== SCRIPT FOR COUNTING IDLE PAGES PER CGROUP ====
      #! /usr/bin/python
      #
      
      import os
      import stat
      import errno
      import struct
      
      CGROUP_MOUNT = "/sys/fs/cgroup/memory"
      BUFSIZE = 8 * 1024  # must be multiple of 8
      
      def get_hugepage_size():
          with open("/proc/meminfo", "r") as f:
              for s in f:
                  k, v = s.split(":")
                  if k == "Hugepagesize":
                      return int(v.split()[0]) * 1024
      
      PAGE_SIZE = os.sysconf("SC_PAGE_SIZE")
      HUGEPAGE_SIZE = get_hugepage_size()
      
      def set_idle():
          f = open("/sys/kernel/mm/page_idle/bitmap", "wb", BUFSIZE)
          while True:
              try:
                  f.write(struct.pack("Q", pow(2, 64) - 1))
              except IOError as err:
                  if err.errno == errno.ENXIO:
                      break
                  raise
          f.close()
      
      def count_idle():
          f_flags = open("/proc/kpageflags", "rb", BUFSIZE)
          f_cgroup = open("/proc/kpagecgroup", "rb", BUFSIZE)
      
          with open("/sys/kernel/mm/page_idle/bitmap", "rb", BUFSIZE) as f:
              while f.read(BUFSIZE): pass  # update idle flag
      
          idlememsz = {}
          while True:
              s1, s2 = f_flags.read(8), f_cgroup.read(8)
              if not s1 or not s2:
                  break
      
              flags, = struct.unpack('Q', s1)
              cgino, = struct.unpack('Q', s2)
      
              unevictable = (flags >> 18) & 1
              huge = (flags >> 22) & 1
              idle = (flags >> 25) & 1
      
              if idle and not unevictable:
                  idlememsz[cgino] = idlememsz.get(cgino, 0) + \
                      (HUGEPAGE_SIZE if huge else PAGE_SIZE)
      
          f_flags.close()
          f_cgroup.close()
          return idlememsz
      
      if __name__ == "__main__":
          print "Setting the idle flag for each page..."
          set_idle()
      
          raw_input("Wait until the workload accesses its working set, "
                    "then press Enter")
      
          print "Counting idle pages..."
          idlememsz = count_idle()
      
          for dir, subdirs, files in os.walk(CGROUP_MOUNT):
              ino = os.stat(dir)[stat.ST_INO]
              print dir + ": " + str(idlememsz.get(ino, 0) / 1024) + " kB"
      ==== END SCRIPT ====
      
      This patch (of 8):
      
      Add page_cgroup_ino() helper to memcg.
      
      This function returns the inode number of the closest online ancestor of
      the memory cgroup a page is charged to.  It is required for exporting
      information about which page is charged to which cgroup to userspace,
      which will be introduced by a following patch.
      Signed-off-by: NVladimir Davydov <vdavydov@parallels.com>
      Reviewed-by: NAndres Lagar-Cavilla <andreslc@google.com>
      Cc: Minchan Kim <minchan@kernel.org>
      Cc: Raghavendra K T <raghavendra.kt@linux.vnet.ibm.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Michal Hocko <mhocko@suse.cz>
      Cc: Greg Thelen <gthelen@google.com>
      Cc: Michel Lespinasse <walken@google.com>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Pavel Emelyanov <xemul@parallels.com>
      Cc: Cyrill Gorcunov <gorcunov@openvz.org>
      Cc: Jonathan Corbet <corbet@lwn.net>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      2fc04524
    • D
      zswap: update docs for runtime-changeable attributes · 9c4c5ef3
      Dan Streetman 提交于
      Change the Documentation/vm/zswap.txt doc to indicate that the "zpool" and
      "compressor" params are now changeable at runtime.
      Signed-off-by: NDan Streetman <ddstreet@ieee.org>
      Cc: Seth Jennings <sjennings@variantweb.net>
      Cc: Sergey Senozhatsky <sergey.senozhatsky.work@gmail.com>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      9c4c5ef3
    • D
      zswap: change zpool/compressor at runtime · 90b0fc26
      Dan Streetman 提交于
      Update the zpool and compressor parameters to be changeable at runtime.
      When changed, a new pool is created with the requested zpool/compressor,
      and added as the current pool at the front of the pool list.  Previous
      pools remain in the list only to remove existing compressed pages from.
      The old pool(s) are removed once they become empty.
      Signed-off-by: NDan Streetman <ddstreet@ieee.org>
      Acked-by: NSeth Jennings <sjennings@variantweb.net>
      Cc: Sergey Senozhatsky <sergey.senozhatsky.work@gmail.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      90b0fc26
    • D
      zswap: dynamic pool creation · f1c54846
      Dan Streetman 提交于
      Add dynamic creation of pools.  Move the static crypto compression per-cpu
      transforms into each pool.  Add a pointer to zswap_entry to the pool it's
      in.
      
      This is required by the following patch which enables changing the zswap
      zpool and compressor params at runtime.
      
      [akpm@linux-foundation.org: fix merge snafus]
      Signed-off-by: NDan Streetman <ddstreet@ieee.org>
      Acked-by: NSeth Jennings <sjennings@variantweb.net>
      Cc: Sergey Senozhatsky <sergey.senozhatsky.work@gmail.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      f1c54846
    • D
      zpool: add zpool_has_pool() · 3f0e1312
      Dan Streetman 提交于
      This series makes creation of the zpool and compressor dynamic, so that
      they can be changed at runtime.  This makes using/configuring zswap
      easier, as before this zswap had to be configured at boot time, using boot
      params.
      
      This uses a single list to track both the zpool and compressor together,
      although Seth had mentioned an alternative which is to track the zpools
      and compressors using separate lists.  In the most common case, only a
      single zpool and single compressor, using one list is slightly simpler
      than using two lists, and for the uncommon case of multiple zpools and/or
      compressors, using one list is slightly less simple (and uses slightly
      more memory, probably) than using two lists.
      
      This patch (of 4):
      
      Add zpool_has_pool() function, indicating if the specified type of zpool
      is available (i.e.  zsmalloc or zbud).  This allows checking if a pool is
      available, without actually trying to allocate it, similar to
      crypto_has_alg().
      
      This is used by a following patch to zswap that enables the dynamic
      runtime creation of zswap zpools.
      Signed-off-by: NDan Streetman <ddstreet@ieee.org>
      Acked-by: NSeth Jennings <sjennings@variantweb.net>
      Cc: Sergey Senozhatsky <sergey.senozhatsky.work@gmail.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      3f0e1312
  2. 09 9月, 2015 25 次提交
    • L
      Merge tag 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/dledford/rdma · 26d2177e
      Linus Torvalds 提交于
      Pull inifiniband/rdma updates from Doug Ledford:
       "This is a fairly sizeable set of changes.  I've put them through a
        decent amount of testing prior to sending the pull request due to
        that.
      
        There are still a few fixups that I know are coming, but I wanted to
        go ahead and get the big, sizable chunk into your hands sooner rather
        than waiting for those last few fixups.
      
        Of note is the fact that this creates what is intended to be a
        temporary area in the drivers/staging tree specifically for some
        cleanups and additions that are coming for the RDMA stack.  We
        deprecated two drivers (ipath and amso1100) and are waiting to hear
        back if we can deprecate another one (ehca).  We also put Intel's new
        hfi1 driver into this area because it needs to be refactored and a
        transfer library created out of the factored out code, and then it and
        the qib driver and the soft-roce driver should all be modified to use
        that library.
      
        I expect drivers/staging/rdma to be around for three or four kernel
        releases and then to go away as all of the work is completed and final
        deletions of deprecated drivers are done.
      
        Summary of changes for 4.3:
      
         - Create drivers/staging/rdma
         - Move amso1100 driver to staging/rdma and schedule for deletion
         - Move ipath driver to staging/rdma and schedule for deletion
         - Add hfi1 driver to staging/rdma and set TODO for move to regular
           tree
         - Initial support for namespaces to be used on RDMA devices
         - Add RoCE GID table handling to the RDMA core caching code
         - Infrastructure to support handling of devices with differing read
           and write scatter gather capabilities
         - Various iSER updates
         - Kill off unsafe usage of global mr registrations
         - Update SRP driver
         - Misc  mlx4 driver updates
         - Support for the mr_alloc verb
         - Support for a netlink interface between kernel and user space cache
           daemon to speed path record queries and route resolution
         - Ininitial support for safe hot removal of verbs devices"
      
      * tag 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/dledford/rdma: (136 commits)
        IB/ipoib: Suppress warning for send only join failures
        IB/ipoib: Clean up send-only multicast joins
        IB/srp: Fix possible protection fault
        IB/core: Move SM class defines from ib_mad.h to ib_smi.h
        IB/core: Remove unnecessary defines from ib_mad.h
        IB/hfi1: Add PSM2 user space header to header_install
        IB/hfi1: Add CSRs for CONFIG_SDMA_VERBOSITY
        mlx5: Fix incorrect wc pkey_index assignment for GSI messages
        IB/mlx5: avoid destroying a NULL mr in reg_user_mr error flow
        IB/uverbs: reject invalid or unknown opcodes
        IB/cxgb4: Fix if statement in pick_local_ip6adddrs
        IB/sa: Fix rdma netlink message flags
        IB/ucma: HW Device hot-removal support
        IB/mlx4_ib: Disassociate support
        IB/uverbs: Enable device removal when there are active user space applications
        IB/uverbs: Explicitly pass ib_dev to uverbs commands
        IB/uverbs: Fix race between ib_uverbs_open and remove_one
        IB/uverbs: Fix reference counting usage of event files
        IB/core: Make ib_dealloc_pd return void
        IB/srp: Create an insecure all physical rkey only if needed
        ...
      26d2177e
    • L
      Merge tag 'for-linus-4.3' of git://git.code.sf.net/p/openipmi/linux-ipmi · a794b4f3
      Linus Torvalds 提交于
      Pull IPMI updates from Corey Minyard:
       "Most of these have been sitting in linux-next for more than a release,
        particularly commit 0fbcf4af ("ipmi: Convert the IPMI SI ACPI
        handling to a platform device") which is probably the most complex
        patch.
      
        That is also the one that changes drivers/acpi/acpi_pnp.c.  The change
        in that file is only removing IPMI from a "special platform devices"
        list, since I convert it to the standard PNP interface.  I posted this
        one to the ACPI list twice and got no response, and it seems to work
        well in my testing, so I'm hoping it's good.
      
        Hidehiro Kawai posted a set of changes that improves the panic time
        handling in the IPMI driver.
      
        The rest of the changes are minor bug fixes or cleanups and some
        documentation"
      
      * tag 'for-linus-4.3' of git://git.code.sf.net/p/openipmi/linux-ipmi:
        ipmi:ssif: Add a module parm to specify that SMBus alerts don't work
        ipmi: add of_device_id in MODULE_DEVICE_TABLE
        ipmi: Compensate for BMCs that wont set the irq enable bit
        ipmi: Don't call receive handler in the panic context
        ipmi: Avoid touching possible corrupted lists in the panic context
        ipmi: Don't flush messages in sender() in run-to-completion mode
        ipmi: Factor out message flushing procedure
        ipmi: Remove unneeded set_run_to_completion call
        ipmi: Make some data const that was only read
        ipmi: constify SSIF ACPI device ids
        ipmi: Delete an unnecessary check before the function call "cleanup_one_si"
        char:ipmi - Change 1 to true for bool type variables during initialization.
        impi:Remove unneeded setting of module owner to THIS_MODULE in the platform structure, powernv_ipmi_driver
        ipmi: Add a comment in how messages are delivered from the lower layer
        ipmi/powernv: Fix potential invalid pointer dereference
        ipmi: Convert the IPMI SI ACPI handling to a platform device
        ipmi: Add device tree bindings information
      a794b4f3
    • L
      Merge branch 'akpm' (patches from Andrew) · f6f7a636
      Linus Torvalds 提交于
      Merge second patch-bomb from Andrew Morton:
       "Almost all of the rest of MM.  There was an unusually large amount of
        MM material this time"
      
      * emailed patches from Andrew Morton <akpm@linux-foundation.org>: (141 commits)
        zpool: remove no-op module init/exit
        mm: zbud: constify the zbud_ops
        mm: zpool: constify the zpool_ops
        mm: swap: zswap: maybe_preload & refactoring
        zram: unify error reporting
        zsmalloc: remove null check from destroy_handle_cache()
        zsmalloc: do not take class lock in zs_shrinker_count()
        zsmalloc: use class->pages_per_zspage
        zsmalloc: consider ZS_ALMOST_FULL as migrate source
        zsmalloc: partial page ordering within a fullness_list
        zsmalloc: use shrinker to trigger auto-compaction
        zsmalloc: account the number of compacted pages
        zsmalloc/zram: introduce zs_pool_stats api
        zsmalloc: cosmetic compaction code adjustments
        zsmalloc: introduce zs_can_compact() function
        zsmalloc: always keep per-class stats
        zsmalloc: drop unused variable `nr_to_migrate'
        mm/memblock.c: fix comment in __next_mem_range()
        mm/page_alloc.c: fix type information of memoryless node
        memory-hotplug: fix comments in zone_spanned_pages_in_node() and zone_spanned_pages_in_node()
        ...
      f6f7a636
    • L
      Merge branch 'parisc-4.3-1' of git://git.kernel.org/pub/scm/linux/kernel/git/deller/parisc-linux · 839fe915
      Linus Torvalds 提交于
      Pull parisc updates from Helge Deller:
       "The most important changes in this patchset are:
      
         - re-enable 64bit PCI bus addresses which were temporarily disabled
           for PA-RISC in kernel 4.2
      
         - fix the 64bit CAS operation in the LWS path which now enables us to
           enable the 64bit gcc atomic builtins even on 32bit userspace with
           64bit kernel
      
         - fix a long-standing bug which sometimes crashed kernel at bootup
           while serial interrupt wasn't registered yet"
      
      * 'parisc-4.3-1' of git://git.kernel.org/pub/scm/linux/kernel/git/deller/parisc-linux:
        parisc: Use platform_device_register_simple("rtc-generic")
        parisc: Drop CONFIG_SMP around update_cr16_clocksource()
        parisc: Use double word condition in 64bit CAS operation
        parisc: Filter out spurious interrupts in PA-RISC irq handler
        parisc: Additionally check for in_atomic() in page fault handler
        PCI,parisc: Enable 64-bit bus addresses on PA-RISC
        parisc: Define ioremap_uc and ioremap_wc
      839fe915
    • L
      Merge tag 'linux-kselftest-4.3-rc1' of... · 54283aed
      Linus Torvalds 提交于
      Merge tag 'linux-kselftest-4.3-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/shuah/linux-kselftest
      
      Pull kselftest update from Shuah Khan:
       "This update adds new zram test and fixes to problems found during
        testing this new zram test.  In addition, there are a few bug fixes
        and ksefltest improvement patches from Linaro developers.
      
        I will send another update later on this week to fix kselftest
        breakage due to commit 2bf9e0ab ("locking/static_keys: Provide a
        selftest") after the fix soaks in next for a couple of days"
      
      * tag 'linux-kselftest-4.3-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/shuah/linux-kselftest:
        selftests/zram: Makefile fix
        selftests/zram: must be run as root
        selftests: breakpoints: fix installing error on the architecture except x86
        selftests: check before install
        selftests/zram: Adding zram tests
      54283aed
    • L
      Merge tag 'iommu-updates-v4.3' of git://git.kernel.org/pub/scm/linux/kernel/git/joro/iommu · 9a9952bb
      Linus Torvalds 提交于
      Pull iommu updates for from Joerg Roedel:
       "This time the IOMMU updates are mostly cleanups or fixes.  No big new
        features or drivers this time.  In particular the changes include:
      
         - Bigger cleanup of the Domain<->IOMMU data structures and the code
           that manages them in the Intel VT-d driver.  This makes the code
           easier to understand and maintain, and also easier to keep the data
           structures in sync.  It is also a preparation step to make use of
           default domains from the IOMMU core in the Intel VT-d driver.
      
         - Fixes for a couple of DMA-API misuses in ARM IOMMU drivers, namely
           in the ARM and Tegra SMMU drivers.
      
         - Fix for a potential buffer overflow in the OMAP iommu driver's
           debug code
      
         - A couple of smaller fixes and cleanups in various drivers
      
         - One small new feature: Report domain-id usage in the Intel VT-d
           driver to easier detect bugs where these are leaked"
      
      * tag 'iommu-updates-v4.3' of git://git.kernel.org/pub/scm/linux/kernel/git/joro/iommu: (83 commits)
        iommu/vt-d: Really use upper context table when necessary
        x86/vt-d: Fix documentation of DRHD
        iommu/fsl: Really fix init section(s) content
        iommu/io-pgtable-arm: Unmap and free table when overwriting with block
        iommu/io-pgtable-arm: Move init-fn declarations to io-pgtable.h
        iommu/msm: Use BUG_ON instead of if () BUG()
        iommu/vt-d: Access iomem correctly
        iommu/vt-d: Make two functions static
        iommu/vt-d: Use BUG_ON instead of if () BUG()
        iommu/vt-d: Return false instead of 0 in irq_remapping_cap()
        iommu/amd: Use BUG_ON instead of if () BUG()
        iommu/amd: Make a symbol static
        iommu/amd: Simplify allocation in irq_remapping_alloc()
        iommu/tegra-smmu: Parameterize number of TLB lines
        iommu/tegra-smmu: Factor out tegra_smmu_set_pde()
        iommu/tegra-smmu: Extract tegra_smmu_pte_get_use()
        iommu/tegra-smmu: Use __GFP_ZERO to allocate zeroed pages
        iommu/tegra-smmu: Remove PageReserved manipulation
        iommu/tegra-smmu: Convert to use DMA API
        iommu/tegra-smmu: smmu_flush_ptc() wants device addresses
        ...
      9a9952bb
    • L
      Merge tag 'regmap-v4.3' of git://git.kernel.org/pub/scm/linux/kernel/git/broonie/regmap · e81b594c
      Linus Torvalds 提交于
      Pull regmap updates from Mark Brown:
       "This has been a busy release for regmap.
      
        By far the biggest set of changes here are those from Markus Pargmann
        which implement support for block transfers in smbus devices.  This
        required quite a bit of refactoring but leaves us better able to
        handle odd restrictions that controllers may have and with better
        performance on smbus.
      
        Other new features include:
      
         - Fix interactions with lockdep for nested regmaps (eg, when a device
           using regmap is connected to a bus where the bus controller has a
           separate regmap).  Lockdep's default class identification is too
           crude to work without help.
      
         - Support for must write bitfield operations, useful for operations
           which require writing a bit to trigger them from Kuniori Morimoto.
      
         - Support for delaying during register patch application from Nariman
           Poushin.
      
         - Support for overriding cache state via the debugfs implementation
           from Richard Fitzgerald"
      
      * tag 'regmap-v4.3' of git://git.kernel.org/pub/scm/linux/kernel/git/broonie/regmap: (25 commits)
        regmap: fix a NULL pointer dereference in __regmap_init
        regmap: Support bulk reads for devices without raw formatting
        regmap-i2c: Add smbus i2c block support
        regmap: Add raw_write/read checks for max_raw_write/read sizes
        regmap: regmap max_raw_read/write getter functions
        regmap: Introduce max_raw_read/write for regmap_bulk_read/write
        regmap: Add missing comments about struct regmap_bus
        regmap: No multi_write support if bus->write does not exist
        regmap: Split use_single_rw internally into use_single_read/write
        regmap: Fix regmap_bulk_write for bus writes
        regmap: regmap_raw_read return error on !bus->read
        regulator: core: Print at debug level on debugfs creation failure
        regmap: Fix regmap_can_raw_write check
        regmap: fix typos in regmap.c
        regmap: Fix integertypes for register address and value
        regmap: Move documentation to regmap.h
        regmap: Use different lockdep class for each regmap init call
        thermal: sti: Add parentheses around bridge->ops->regmap_init call
        mfd: vexpress: Add parentheses around bridge->ops->regmap_init call
        regmap: debugfs: Fix misuse of IS_ENABLED
        ...
      e81b594c
    • L
      Merge tag 'fbdev-4.3' of git://git.kernel.org/pub/scm/linux/kernel/git/tomba/linux · fa815580
      Linus Torvalds 提交于
      Pull fbdev updates from Tomi Valkeinen:
       "Minor fixes and cleanups"
      
      * tag 'fbdev-4.3' of git://git.kernel.org/pub/scm/linux/kernel/git/tomba/linux:
        video: fbdev: atmel_lcdfb: remove useless include
        video: fbdev: pxa168fb: Use devm_clk_get
        fbdev: ssd1307fb: fix error return code
        fbdev: fix snprintf() limit in show_bl_curve()
        video: fbdev: s3c-fb: Constify platform_device_id
        video: fbdev: atmel: fix warning for const return value
        video: fbdev: Drop owner assignment from platform_driver
        video: fbdev: Drop owner assignment from i2c_driver
        fbdev: remove unnecessary memset in vfb
        framebuffer: disable vgacon on microblaze arch
        fbdev: udlfb: remove unneeded initialization in few places
        fbdev: Allow compile test of GPIO consumers if !GPIOLIB
        fbdev: fix cea_modes array size
      fa815580
    • L
      Merge tag 'mmc-v4.3' of git://git.linaro.org/people/ulf.hansson/mmc · 85579ad7
      Linus Torvalds 提交于
      Pull MMC updates from Ulf Hansson:
       "MMC core:
         - Fix a race condition in the request handling
         - Skip trim commands for some buggy kingston eMMCs
         - An optimization and a correction for erase groups
         - Set CMD23 quirk for some Sandisk cards
      
        MMC host:
         - sdhci: Give GPIO CD higher precedence and don't poll when it's used
         - sdhci: Fix DMA memory leakage
         - sdhci: Some updates for clock management
         - sdhci-of-at91: introduce driver for the Atmel SDMMC
         - sdhci-of-arasan: Add support for sdhci-5.1
         - sdhci-esdhc-imx: Add support for imx7d which also supports HS400
         - sdhci: A collection of fixes and improvements for various sdhci hosts
         - omap_hsmmc: Modernization of the regulator code
         - dw_mmc: A couple of fixes for DMA and PIO mode
         - usdhi6rol0: A few fixes and support probe deferral for regulators
         - pxamci: Convert to use dmaengine
         - sh_mmcif: Fix the suspend process in a short term solution
         - tmio: Adjust timeout for commands
         - sunxi: Fix timeout while gating/ungating clock"
      
      * tag 'mmc-v4.3' of git://git.linaro.org/people/ulf.hansson/mmc: (67 commits)
        mmc: android-goldfish: remove incorrect __iomem annotation
        mmc: core: fix race condition in mmc_wait_data_done
        mmc: host: omap_hsmmc: remove CONFIG_REGULATOR check
        mmc: host: omap_hsmmc: use ios->vdd for setting vmmc voltage
        mmc: host: omap_hsmmc: use regulator_is_enabled to find pbias status
        mmc: host: omap_hsmmc: enable/disable vmmc_aux regulator based on previous state
        mmc: host: omap_hsmmc: don't use ->set_power to set initial regulator state
        mmc: host: omap_hsmmc: avoid pbias regulator enable on power off
        mmc: host: omap_hsmmc: add separate function to set pbias
        mmc: host: omap_hsmmc: add separate functions for enable/disable supply
        mmc: host: omap_hsmmc: return error if any of the regulator APIs fail
        mmc: host: omap_hsmmc: remove unnecessary pbias set_voltage
        mmc: host: omap_hsmmc: use mmc_host's vmmc and vqmmc
        mmc: host: omap_hsmmc: use the ocrmask provided by the vmmc regulator
        mmc: host: omap_hsmmc: cleanup omap_hsmmc_reg_get()
        mmc: host: omap_hsmmc: return on fatal errors from omap_hsmmc_reg_get
        mmc: host: omap_hsmmc: use devm_regulator_get_optional() for vmmc
        mmc: sdhci-of-at91: fix platform_no_drv_owner.cocci warnings
        mmc: sh_mmcif: Fix suspend process
        mmc: usdhi6rol0: fix error return code
        ...
      85579ad7
    • L
      Merge tag 'platform-drivers-x86-v4.3-1' of... · 3af6e98f
      Linus Torvalds 提交于
      Merge tag 'platform-drivers-x86-v4.3-1' of git://git.infradead.org/users/dvhart/linux-platform-drivers-x86
      
      Pull x86 platform driver updates from Darren Hart:
       "Significant work on toshiba_acpi, including new hardware support,
        refactoring, and cleanups.  Extend device support for asus, ideapad,
        and acer systems.  New surface pro 3 buttons driver.  Misc minor
        cleanups for thinkpad and hp-wireless.
      
        acer-wmi:
         - No rfkill on HP Omen 15 wifi
      
        thinkpad_acpi:
         - Remove side effects from vdbg_printk -> no_printk macro
      
        surface pro 3:
         - Add support driver for Surface Pro 3 buttons
      
        hp-wireless:
         - remove unneeded goto/label in hpwl_init
      
        ideapad-laptop:
         - add alternative representation for Yoga 2 to DMI table
         - Add Lenovo Yoga 3 14 to no_hw_rfkill dmi list
      
        asus-laptop:
         - Add key found on Asus F3M
      
        MAINTAINERS:
         - Remove Toshiba Linux mailing list address
      
        toshiba_acpi:
         - Bump driver version to 0.23
         - Remove unnecessary checks and returns in HCI/SCI functions
         - Refactor *{get, set} functions return value
         - Remove "*not supported" feature prints
         - Change *available functions return type
         - Add set_fan_status function
         - Change some variables to avoid warnings from ninja-check
         - Reorder toshiba_acpi_alt_keymap entries
         - Remove unused wireless defines
         - Transflective backlight updates
         - Avoid registering input device on WMI event laptops
         - Add /dev/toshiba_acpi device
         - Adapt /proc/acpi/toshiba/keys to TOS1900 devices"
      
      * tag 'platform-drivers-x86-v4.3-1' of git://git.infradead.org/users/dvhart/linux-platform-drivers-x86: (21 commits)
        acer-wmi: No rfkill on HP Omen 15 wifi
        thinkpad_acpi: Remove side effects from vdbg_printk -> no_printk macro
        surface pro 3: Add support driver for Surface Pro 3 buttons
        hp-wireless: remove unneeded goto/label in hpwl_init
        ideapad-laptop: add alternative representation for Yoga 2 to DMI table
        asus-laptop: Add key found on Asus F3M
        MAINTAINERS: Remove Toshiba Linux mailing list address
        ideapad-laptop: Add Lenovo Yoga 3 14 to no_hw_rfkill dmi list
        toshiba_acpi: Bump driver version to 0.23
        toshiba_acpi: Remove unnecessary checks and returns in HCI/SCI functions
        toshiba_acpi: Refactor *{get, set} functions return value
        toshiba_acpi: Remove "*not supported" feature prints
        toshiba_acpi: Change *available functions return type
        toshiba_acpi: Add set_fan_status function
        toshiba_acpi: Change some variables to avoid warnings from ninja-check
        toshiba_acpi: Reorder toshiba_acpi_alt_keymap entries
        toshiba_acpi: Remove unused wireless defines
        toshiba_acpi: Transflective backlight updates
        toshiba_acpi: Avoid registering input device on WMI event laptops
        toshiba_acpi: Add /dev/toshiba_acpi device
        ...
      3af6e98f
    • L
      Merge branch 'i2c/for-4.3' of git://git.kernel.org/pub/scm/linux/kernel/git/wsa/linux · acceba59
      Linus Torvalds 提交于
      Pull i2c updates from Wolfram Sang:
       "Features:
      
         - new drivers: Renesas EMEV2, register based MUX, NXP LPC2xxx
         - core: scans DT and assigns wakeup interrupts.  no driver changes needed.
         - core: some refcouting issues fixed and better API for that
         - core: new helper function for best effort block read emulation
         - slave framework: proper DT bindings and userspace instantiation
         - some bigger work for xiic, pxa, omap drivers
      
        .. and quite a number of smaller driver fixes, cleanups, improvements"
      
      * 'i2c/for-4.3' of git://git.kernel.org/pub/scm/linux/kernel/git/wsa/linux: (65 commits)
        i2c: mux: reg Change ioread endianness for readback
        i2c: mux: reg: fix compilation warnings
        i2c: mux: reg: simplify register size checking
        i2c: muxes: fix leaked i2c adapter device node references
        i2c: allow specifying separate wakeup interrupt in device tree
        of/irq: export of_get_irq_byname()
        i2c: xgene-slimpro: dma_mapping_error() doesn't return an error code
        i2c: Replace I2C_CROS_EC_TUNNEL dependency
        eeprom: at24: use i2c_smbus_read_i2c_block_data_or_emulated
        i2c: core: Add support for best effort block read emulation
        i2c: lpc2k: add driver
        i2c: mux: Add register-based mux i2c-mux-reg
        i2c: dt: describe generic bindings
        i2c: slave: print warning if slave flag not set
        i2c: support 10 bit and slave addresses in sysfs 'new_device'
        i2c: take address space into account when checking for used addresses
        i2c: apply DT flags when probing
        i2c: make address check indpendent from client struct
        i2c: rename address check functions
        i2c: apply address offset for slaves, too
        ...
      acceba59
    • L
      Merge tag 'rtc-v4.3' of git://git.kernel.org/pub/scm/linux/kernel/git/abelloni/linux · c1917615
      Linus Torvalds 提交于
      Pull RTC updates from Alexandre Belloni:
       "Core:
         - use is_visible() to control sysfs attributes
         - switch wakealarm attribute to DEVICE_ATTR_RW
         - make rtc_does_wakealarm() return boolean
         - properly manage lifetime of dev and cdev in rtc device
         - remove unnecessary device_get() in rtc_device_unregister
         - fix double free in rtc_register_device() error path
      
        New drivers:
         - NXP LPC24xx
         - Xilinx Zynq MP
         - Dialog DA9062
      
        Subsystem wide cleanups:
         - fix drivers that consider 0 as a valid IRQ in client->irq
         - Drop (un)likely before IS_ERR(_OR_NULL)
         - drop the remaining owner assignment for i2c_driver and
           platform_driver
         - module autoload fixes
      
        Drivers:
         - 88pm80x: add device tree support
         - abx80x: fix RTC write bit
         - ab8500: Add a sentinel to ab85xx_rtc_ids[]
         - armada38x: Align RTC set time procedure with the official errata
         - as3722: correct month value
         - at91sam9: cleanups
         - at91rm9200: get and use slow clock and cleanups
         - bq32k: remove redundant check
         - cmos: century support, proper fix for the spurious wakeup
         - ds1307: cleanups and wakeup irq support
         - ds1374: Remove unused variable
         - ds1685: Use module_platform_driver
         - ds3232: fix WARNING trace in resume function
         - gemini: fix ptr_ret.cocci warnings
         - mt6397: implement suspend/resume
         - omap: support internal and external clock enabling
         - opal: Enable alarms only when opal supports tpo
         - pcf2127: use OFS flag to detect unreliable date and warn the user
         - pl031: fix typo for author email
         - rx8025: huge cleanup and fixes
         - sa1100/pxa: share common code
         - s5m: fix to update ctrl register
         - s3c: fix clocks and wakeup, cleanup
         - sirfsoc: use regmap
         - nvram_read()/nvram_write() functions for cmos, ds1305, ds1307,
           ds1343, ds1511, ds1553, ds1742, m48t59, rp5c01, stk17ta8, tx4939
         - use rtc_valid_tm() error code when reading date/time instead of 0
           for isl12022, pcf2123, pcf2127"
      
      * tag 'rtc-v4.3' of git://git.kernel.org/pub/scm/linux/kernel/git/abelloni/linux: (90 commits)
        rtc: abx80x: fix RTC write bit
        rtc: ab8500: Add a sentinel to ab85xx_rtc_ids[]
        rtc: ds1374: Remove unused variable
        rtc: Fix module autoload for OF platform drivers
        rtc: Fix module autoload for rtc-{ab8500,max8997,s5m} drivers
        rtc: omap: Add external clock enabling support
        rtc: omap: Add internal clock enabling support
        ARM: dts: AM437x: Add the internal and external clock nodes for rtc
        rtc: s5m: fix to update ctrl register
        rtc: add xilinx zynqmp rtc driver
        devicetree: bindings: rtc: add bindings for xilinx zynqmp rtc
        rtc: as3722: correct month value
        ARM: config: Switch PXA27x platforms to use PXA RTC driver
        ARM: mmp: remove unused RTC register definitions
        ARM: sa1100: remove unused RTC register definitions
        rtc: sa1100/pxa: convert to run-time register mapping
        ARM: pxa: add memory resource to SA1100 RTC device
        rtc: pxa: convert to use shared sa1100 functions
        rtc: sa1100: prepare to share sa1100_rtc_ops
        rtc: ds3232: fix WARNING trace in resume function
        ...
      c1917615
    • D
      zpool: remove no-op module init/exit · df69f52d
      Dan Streetman 提交于
      Remove zpool_init() and zpool_exit(); they do nothing other than print
      "loaded" and "unloaded".
      Signed-off-by: NDan Streetman <ddstreet@ieee.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      df69f52d
    • K
      mm: zbud: constify the zbud_ops · c83db4f4
      Krzysztof Kozlowski 提交于
      The structure zbud_ops is not modified so make the pointer to it a
      pointer to const.
      Signed-off-by: NKrzysztof Kozlowski <k.kozlowski@samsung.com>
      Acked-by: NDan Streetman <ddstreet@ieee.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      c83db4f4
    • K
      mm: zpool: constify the zpool_ops · 78672779
      Krzysztof Kozlowski 提交于
      The structure zpool_ops is not modified so make the pointer to it a
      pointer to const.
      Signed-off-by: NKrzysztof Kozlowski <k.kozlowski@samsung.com>
      Acked-by: NDan Streetman <ddstreet@ieee.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      78672779
    • D
      mm: swap: zswap: maybe_preload & refactoring · 5b999aad
      Dmitry Safonov 提交于
      zswap_get_swap_cache_page and read_swap_cache_async have pretty much the
      same code with only significant difference in return value and usage of
      swap_readpage.
      
      I a helper __read_swap_cache_async() with the common code.  Behavior
      change: now zswap_get_swap_cache_page will use radix_tree_maybe_preload
      instead radix_tree_preload.  Looks like, this wasn't changed only by the
      reason of code duplication.
      Signed-off-by: NDmitry Safonov <0x7f454c46@gmail.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Vladimir Davydov <vdavydov@parallels.com>
      Cc: Michal Hocko <mhocko@suse.cz>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Minchan Kim <minchan@kernel.org>
      Cc: Tejun Heo <tj@kernel.org>
      Cc: Jens Axboe <axboe@fb.com>
      Cc: Christoph Hellwig <hch@lst.de>
      Cc: David Herrmann <dh.herrmann@gmail.com>
      Cc: Seth Jennings <sjennings@variantweb.net>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      5b999aad
    • S
      zram: unify error reporting · 70864969
      Sergey Senozhatsky 提交于
      Make zram syslog error reporting more consistent. We have random
      error levels in some places. For example, critical errors like
        "Error allocating memory for compressed page"
      and
        "Unable to allocate temp memory"
      are reported as KERN_INFO messages.
      
      a) Reassign error levels
      
      Error messages that directly affect zram
      functionality -- pr_err():
      
       Error allocating zram address table
       Error creating memory pool
       Decompression failed! err=%d, page=%u
       Unable to allocate temp memory
       Compression failed! err=%d
       Error allocating memory for compressed page: %u, size=%zu
       Cannot initialise %s compressing backend
       Error allocating disk queue for device %d
       Error allocating disk structure for device %d
       Error creating sysfs group for device %d
       Unable to register zram-control class
       Unable to get major number
      
      Messages that do not affect functionality, but user
      must be warned (because sysfs attrs will be removed in
      this particular case) -- pr_warn():
      
       %d (%s) Attribute %s (and others) will be removed. %s
      
      Messages that do not affect functionality and mostly are
      informative -- pr_info():
      
       Cannot change max compression streams
       Can't change algorithm for initialized device
       Cannot change disksize for initialized device
       Added device: %s
       Removed device: %s
      
      b) Update sysfs_create_group() error message
      
      First, it lacks a trailing new line; add it.  Second, every error message
      in zram_add() has a "for device %d" part, which makes errors more
      informative.  Add missing part to "Error creating sysfs group" message.
      Signed-off-by: NSergey Senozhatsky <sergey.senozhatsky@gmail.com>
      Cc: Minchan Kim <minchan@kernel.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      70864969
    • S
      zsmalloc: remove null check from destroy_handle_cache() · cd10add0
      Sergey Senozhatsky 提交于
      We can pass a NULL cache pointer to kmem_cache_destroy(), because it
      NULL-checks its argument now.  Remove redundant test from
      destroy_handle_cache().
      Signed-off-by: NSergey Senozhatsky <sergey.senozhatsky@gmail.com>
      Acked-by: NMinchan Kim <minchan@kernel.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      cd10add0
    • S
      zsmalloc: do not take class lock in zs_shrinker_count() · b3e237f1
      Sergey Senozhatsky 提交于
      We can avoid taking class ->lock around zs_can_compact() in
      zs_shrinker_count(), because the number that we return back is outdated
      in general case, by design.  We have different sources that are able to
      change class's state right after we return from zs_can_compact() --
      ongoing I/O operations, manually triggered compaction, or two of them
      happening simultaneously.
      
      We re-do this calculations during compaction on a per class basis
      anyway.
      
      zs_unregister_shrinker() will not return until we have an active
      shrinker, so classes won't unexpectedly disappear while
      zs_shrinker_count() iterates them.
      Signed-off-by: NSergey Senozhatsky <sergey.senozhatsky@gmail.com>
      Acked-by: NMinchan Kim <minchan@kernel.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      b3e237f1
    • M
      zsmalloc: use class->pages_per_zspage · 6cbf16b3
      Minchan Kim 提交于
      There is no need to recalcurate pages_per_zspage in runtime.  Just use
      class->pages_per_zspage to avoid unnecessary runtime overhead.
      Signed-off-by: NMinchan Kim <minchan@kernel.org>
      Acked-by: NSergey Senozhatsky <sergey.senozhatsky@gmail.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      6cbf16b3
    • M
      zsmalloc: consider ZS_ALMOST_FULL as migrate source · ad9d5e17
      Minchan Kim 提交于
      There is no reason to prevent select ZS_ALMOST_FULL as migration source
      if we cannot find source from ZS_ALMOST_EMPTY.
      
      With this patch, zs_can_compact will return more exact result.
      Signed-off-by: NMinchan Kim <minchan.kim@lge.com>
      Acked-by: NSergey Senozhatsky <sergey.senozhatsky@gmail.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      ad9d5e17
    • S
      zsmalloc: partial page ordering within a fullness_list · 58f17117
      Sergey Senozhatsky 提交于
      We want to see more ZS_FULL pages and less ZS_ALMOST_{FULL, EMPTY}
      pages.  Put a page with higher ->inuse count first within its
      ->fullness_list, which will give us better chances to fill up this page
      with new objects (find_get_zspage() return ->fullness_list head for new
      object allocation), so some zspages will become ZS_ALMOST_FULL/ZS_FULL
      quicker.
      
      It performs a trivial and cheap ->inuse compare which does not slow down
      zsmalloc and in the worst case keeps the list pages in no particular
      order.
      
      A more expensive solution could sort fullness_list by ->inuse count.
      
      [minchan@kernel.org: code adjustments]
      Signed-off-by: NSergey Senozhatsky <sergey.senozhatsky@gmail.com>
      Cc: Minchan Kim <minchan@kernel.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      58f17117
    • S
      zsmalloc: use shrinker to trigger auto-compaction · ab9d306d
      Sergey Senozhatsky 提交于
      Perform automatic pool compaction by a shrinker when system is getting
      tight on memory.
      
      User-space has a very little knowledge regarding zsmalloc fragmentation
      and basically has no mechanism to tell whether compaction will result in
      any memory gain.  Another issue is that user space is not always aware
      of the fact that system is getting tight on memory.  Which leads to very
      uncomfortable scenarios when user space may start issuing compaction
      'randomly' or from crontab (for example).  Fragmentation is not always
      necessarily bad, allocated and unused objects, after all, may be filled
      with the data later, w/o the need of allocating a new zspage.  On the
      other hand, we obviously don't want to waste memory when the system
      needs it.
      
      Compaction now has a relatively quick pool scan so we are able to
      estimate the number of pages that will be freed easily, which makes it
      possible to call this function from a shrinker->count_objects()
      callback.  We also abort compaction as soon as we detect that we can't
      free any pages any more, preventing wasteful objects migrations.
      Signed-off-by: NSergey Senozhatsky <sergey.senozhatsky@gmail.com>
      Suggested-by: NMinchan Kim <minchan@kernel.org>
      Acked-by: NMinchan Kim <minchan@kernel.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      ab9d306d
    • S
      zsmalloc: account the number of compacted pages · 860c707d
      Sergey Senozhatsky 提交于
      Compaction returns back to zram the number of migrated objects, which is
      quite uninformative -- we have objects of different sizes so user space
      cannot obtain any valuable data from that number.  Change compaction to
      operate in terms of pages and return back to compaction issuer the
      number of pages that were freed during compaction.  So from now on we
      will export more meaningful value in zram<id>/mm_stat -- the number of
      freed (compacted) pages.
      
      This requires:
       (a) a rename of `num_migrated' to 'pages_compacted'
       (b) a internal API change -- return first_page's fullness_group from
           putback_zspage(), so we know when putback_zspage() did
           free_zspage().  It helps us to account compaction stats correctly.
      Signed-off-by: NSergey Senozhatsky <sergey.senozhatsky@gmail.com>
      Acked-by: NMinchan Kim <minchan@kernel.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      860c707d
    • S
      zsmalloc/zram: introduce zs_pool_stats api · 7d3f3938
      Sergey Senozhatsky 提交于
      `zs_compact_control' accounts the number of migrated objects but it has
      a limited lifespan -- we lose it as soon as zs_compaction() returns back
      to zram.  It worked fine, because (a) zram had it's own counter of
      migrated objects and (b) only zram could trigger compaction.  However,
      this does not work for automatic pool compaction (not issued by zram).
      To account objects migrated during auto-compaction (issued by the
      shrinker) we need to store this number in zs_pool.
      
      Define a new `struct zs_pool_stats' structure to keep zs_pool's stats
      there.  It provides only `num_migrated', as of this writing, but it
      surely can be extended.
      
      A new zsmalloc zs_pool_stats() symbol exports zs_pool's stats back to
      caller.
      
      Use zs_pool_stats() in zram and remove `num_migrated' from zram_stats.
      Signed-off-by: NSergey Senozhatsky <sergey.senozhatsky@gmail.com>
      Suggested-by: NMinchan Kim <minchan@kernel.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      7d3f3938