1. 12 10月, 2016 23 次提交
  2. 11 10月, 2016 2 次提交
    • A
      [btrfs] fix check_direct_IO() for non-iovec iterators · cd27e455
      Al Viro 提交于
      looking for duplicate ->iov_base makes sense only for
      iovec-backed iterators; for kvec-backed ones it's pointless,
      for bvec-backed ones it's pointless and broken on 32bit (we
      walk through an array of struct bio_vec accessing them as if
      they were struct iovec; works by accident on 64bit, but on
      32bit it'll blow up) and for pipe-backed ones it's pointless
      and ends up oopsing.
      Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>
      cd27e455
    • A
      fix ITER_PIPE interaction with direct_IO · c3a69024
      Al Viro 提交于
      by making sure we call iov_iter_advance() on original
      iov_iter even if direct_IO (done on its copy) has returned 0.
      It's a no-op for old iov_iter flavours and does the right thing
      (== truncation of the stuff we'd allocated, but not filled) in
      ITER_PIPE case.  Failures (e.g. -EIO) get caught and dealt with
      by cleanup in generic_file_read_iter().
      Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>
      c3a69024
  3. 10 10月, 2016 1 次提交
  4. 08 10月, 2016 14 次提交
    • A
      vfs: Remove {get,set,remove}xattr inode operations · fd50ecad
      Andreas Gruenbacher 提交于
      These inode operations are no longer used; remove them.
      Signed-off-by: NAndreas Gruenbacher <agruenba@redhat.com>
      Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>
      fd50ecad
    • A
      cred: simpler, 1D supplementary groups · 81243eac
      Alexey Dobriyan 提交于
      Current supplementary groups code can massively overallocate memory and
      is implemented in a way so that access to individual gid is done via 2D
      array.
      
      If number of gids is <= 32, memory allocation is more or less tolerable
      (140/148 bytes).  But if it is not, code allocates full page (!)
      regardless and, what's even more fun, doesn't reuse small 32-entry
      array.
      
      2D array means dependent shifts, loads and LEAs without possibility to
      optimize them (gid is never known at compile time).
      
      All of the above is unnecessary.  Switch to the usual
      trailing-zero-len-array scheme.  Memory is allocated with
      kmalloc/vmalloc() and only as much as needed.  Accesses become simpler
      (LEA 8(gi,idx,4) or even without displacement).
      
      Maximum number of gids is 65536 which translates to 256KB+8 bytes.  I
      think kernel can handle such allocation.
      
      On my usual desktop system with whole 9 (nine) aux groups, struct
      group_info shrinks from 148 bytes to 44 bytes, yay!
      
      Nice side effects:
      
       - "gi->gid[i]" is shorter than "GROUP_AT(gi, i)", less typing,
      
       - fix little mess in net/ipv4/ping.c
         should have been using GROUP_AT macro but this point becomes moot,
      
       - aux group allocation is persistent and should be accounted as such.
      
      Link: http://lkml.kernel.org/r/20160817201927.GA2096@p183.telecom.bySigned-off-by: NAlexey Dobriyan <adobriyan@gmail.com>
      Cc: Vasily Kulikov <segoon@openwall.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      81243eac
    • R
      mm, proc: fix region lost in /proc/self/smaps · 855af072
      Robert Ho 提交于
      Recently, Redhat reported that nvml test suite failed on QEMU/KVM,
      more detailed info please refer to:
      
         https://bugzilla.redhat.com/show_bug.cgi?id=1365721
      
      Actually, this bug is not only for NVDIMM/DAX but also for any other
      file systems.  This simple test case abstracted from nvml can easily
      reproduce this bug in common environment:
      
      -------------------------- testcase.c -----------------------------
      
      int
      is_pmem_proc(const void *addr, size_t len)
      {
              const char *caddr = addr;
      
              FILE *fp;
              if ((fp = fopen("/proc/self/smaps", "r")) == NULL) {
                      printf("!/proc/self/smaps");
                      return 0;
              }
      
              int retval = 0;         /* assume false until proven otherwise */
              char line[PROCMAXLEN];  /* for fgets() */
              char *lo = NULL;        /* beginning of current range in smaps file */
              char *hi = NULL;        /* end of current range in smaps file */
              int needmm = 0;         /* looking for mm flag for current range */
              while (fgets(line, PROCMAXLEN, fp) != NULL) {
                      static const char vmflags[] = "VmFlags:";
                      static const char mm[] = " wr";
      
                      /* check for range line */
                      if (sscanf(line, "%p-%p", &lo, &hi) == 2) {
                              if (needmm) {
                                      /* last range matched, but no mm flag found */
                                      printf("never found mm flag.\n");
                                      break;
                              } else if (caddr < lo) {
                                      /* never found the range for caddr */
                                      printf("#######no match for addr %p.\n", caddr);
                                      break;
                              } else if (caddr < hi) {
                                      /* start address is in this range */
                                      size_t rangelen = (size_t)(hi - caddr);
      
                                      /* remember that matching has started */
                                      needmm = 1;
      
                                      /* calculate remaining range to search for */
                                      if (len > rangelen) {
                                              len -= rangelen;
                                              caddr += rangelen;
                                              printf("matched %zu bytes in range "
                                                      "%p-%p, %zu left over.\n",
                                                              rangelen, lo, hi, len);
                                      } else {
                                              len = 0;
                                              printf("matched all bytes in range "
                                                              "%p-%p.\n", lo, hi);
                                      }
                              }
                      } else if (needmm && strncmp(line, vmflags,
                                              sizeof(vmflags) - 1) == 0) {
                              if (strstr(&line[sizeof(vmflags) - 1], mm) != NULL) {
                                      printf("mm flag found.\n");
                                      if (len == 0) {
                                              /* entire range matched */
                                              retval = 1;
                                              break;
                                      }
                                      needmm = 0;     /* saw what was needed */
                              } else {
                                      /* mm flag not set for some or all of range */
                                      printf("range has no mm flag.\n");
                                      break;
                              }
                      }
              }
      
              fclose(fp);
      
              printf("returning %d.\n", retval);
              return retval;
      }
      
      void *Addr;
      size_t Size;
      
      /*
       * worker -- the work each thread performs
       */
      static void *
      worker(void *arg)
      {
              int *ret = (int *)arg;
              *ret =  is_pmem_proc(Addr, Size);
              return NULL;
      }
      
      int main(int argc, char *argv[])
      {
              if (argc <  2 || argc > 3) {
                      printf("usage: %s file [env].\n", argv[0]);
                      return -1;
              }
      
              int fd = open(argv[1], O_RDWR);
      
              struct stat stbuf;
              fstat(fd, &stbuf);
      
              Size = stbuf.st_size;
              Addr = mmap(0, stbuf.st_size, PROT_READ|PROT_WRITE, MAP_PRIVATE, fd, 0);
      
              close(fd);
      
              pthread_t threads[NTHREAD];
              int ret[NTHREAD];
      
              /* kick off NTHREAD threads */
              for (int i = 0; i < NTHREAD; i++)
                      pthread_create(&threads[i], NULL, worker, &ret[i]);
      
              /* wait for all the threads to complete */
              for (int i = 0; i < NTHREAD; i++)
                      pthread_join(threads[i], NULL);
      
              /* verify that all the threads return the same value */
              for (int i = 1; i < NTHREAD; i++) {
                      if (ret[0] != ret[i]) {
                              printf("Error i %d ret[0] = %d ret[i] = %d.\n", i,
                                      ret[0], ret[i]);
                      }
              }
      
              printf("%d", ret[0]);
              return 0;
      }
      
      It failed as some threads can not find the memory region in
      "/proc/self/smaps" which is allocated in the main process
      
      It is caused by proc fs which uses 'file->version' to indicate the VMA that
      is the last one has already been handled by read() system call. When the
      next read() issues, it uses the 'version' to find the VMA, then the next
      VMA is what we want to handle, the related code is as follows:
      
              if (last_addr) {
                      vma = find_vma(mm, last_addr);
                      if (vma && (vma = m_next_vma(priv, vma)))
                              return vma;
              }
      
      However, VMA will be lost if the last VMA is gone, e.g:
      
      The process VMA list is A->B->C->D
      
      CPU 0                                  CPU 1
      read() system call
         handle VMA B
         version = B
      return to userspace
      
                                         unmap VMA B
      
      issue read() again to continue to get
      the region info
         find_vma(version) will get VMA C
         m_next_vma(C) will get VMA D
         handle D
         !!! VMA C is lost !!!
      
      In order to fix this bug, we make 'file->version' indicate the end address
      of the current VMA.  m_start will then look up a vma which with vma_start
      < last_vm_end and moves on to the next vma if we found the same or an
      overlapping vma.  This will guarantee that we will not miss an exclusive
      vma but we can still miss one if the previous vma was shrunk.  This is
      acceptable because guaranteeing "never miss a vma" is simply not feasible.
      User has to cope with some inconsistencies if the file is not read in one
      go.
      
      [mhocko@suse.com: changelog fixes]
      Link: http://lkml.kernel.org/r/1475296958-27652-1-git-send-email-robert.hu@intel.comAcked-by: NDave Hansen <dave.hansen@intel.com>
      Signed-off-by: NXiao Guangrong <guangrong.xiao@linux.intel.com>
      Signed-off-by: NRobert Hu <robert.hu@intel.com>
      Acked-by: NMichal Hocko <mhocko@suse.com>
      Acked-by: NOleg Nesterov <oleg@redhat.com>
      Cc: Paolo Bonzini <pbonzini@redhat.com>
      Cc: Dan Williams <dan.j.williams@intel.com>
      Cc: Gleb Natapov <gleb@kernel.org>
      Cc: Marcelo Tosatti <mtosatti@redhat.com>
      Cc: Stefan Hajnoczi <stefanha@redhat.com>
      Cc: Ross Zwisler <ross.zwisler@linux.intel.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      855af072
    • J
      proc: fix timerslack_ns CAP_SYS_NICE check when adjusting self · 4b2bd5fe
      John Stultz 提交于
      In changing from checking ptrace_may_access(p, PTRACE_MODE_ATTACH_FSCREDS)
      to capable(CAP_SYS_NICE), I missed that ptrace_my_access succeeds when p
      == current, but the CAP_SYS_NICE doesn't.
      
      Thus while the previous commit was intended to loosen the needed
      privileges to modify a processes timerslack, it needlessly restricted a
      task modifying its own timerslack via the proc/<tid>/timerslack_ns
      (which is permitted also via the PR_SET_TIMERSLACK method).
      
      This patch corrects this by checking if p == current before checking the
      CAP_SYS_NICE value.
      
      This patch applies on top of my two previous patches currently in -mm
      
      Link: http://lkml.kernel.org/r/1471906870-28624-1-git-send-email-john.stultz@linaro.orgSigned-off-by: NJohn Stultz <john.stultz@linaro.org>
      Acked-by: NKees Cook <keescook@chromium.org>
      Cc: "Serge E. Hallyn" <serge@hallyn.com>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Arjan van de Ven <arjan@linux.intel.com>
      Cc: Oren Laadan <orenl@cellrox.com>
      Cc: Ruchi Kandoi <kandoiruchi@google.com>
      Cc: Rom Lemarchand <romlem@android.com>
      Cc: Todd Kjos <tkjos@google.com>
      Cc: Colin Cross <ccross@android.com>
      Cc: Nick Kralevich <nnk@google.com>
      Cc: Dmitry Shmidt <dimitrysh@google.com>
      Cc: Elliott Hughes <enh@google.com>
      Cc: Android Kernel Team <kernel-team@android.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      4b2bd5fe
    • J
      proc: add LSM hook checks to /proc/<tid>/timerslack_ns · 904763e1
      John Stultz 提交于
      As requested, this patch checks the existing LSM hooks
      task_getscheduler/task_setscheduler when reading or modifying the task's
      timerslack value.
      
      Previous versions added new get/settimerslack LSM hooks, but since they
      checked the same PROCESS__SET/GETSCHED values as existing hooks, it was
      suggested we just use the existing ones.
      
      Link: http://lkml.kernel.org/r/1469132667-17377-2-git-send-email-john.stultz@linaro.orgSigned-off-by: NJohn Stultz <john.stultz@linaro.org>
      Cc: Kees Cook <keescook@chromium.org>
      Cc: "Serge E. Hallyn" <serge@hallyn.com>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Arjan van de Ven <arjan@linux.intel.com>
      Cc: Oren Laadan <orenl@cellrox.com>
      Cc: Ruchi Kandoi <kandoiruchi@google.com>
      Cc: Rom Lemarchand <romlem@android.com>
      Cc: Todd Kjos <tkjos@google.com>
      Cc: Colin Cross <ccross@android.com>
      Cc: Nick Kralevich <nnk@google.com>
      Cc: Dmitry Shmidt <dimitrysh@google.com>
      Cc: Elliott Hughes <enh@google.com>
      Cc: James Morris <jmorris@namei.org>
      Cc: Android Kernel Team <kernel-team@android.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      904763e1
    • J
      proc: relax /proc/<tid>/timerslack_ns capability requirements · 7abbaf94
      John Stultz 提交于
      When an interface to allow a task to change another tasks timerslack was
      first proposed, it was suggested that something greater then
      CAP_SYS_NICE would be needed, as a task could be delayed further then
      what normally could be done with nice adjustments.
      
      So CAP_SYS_PTRACE was adopted instead for what became the
      /proc/<tid>/timerslack_ns interface.  However, for Android (where this
      feature originates), giving the system_server CAP_SYS_PTRACE would allow
      it to observe and modify all tasks memory.  This is considered too high
      a privilege level for only needing to change the timerslack.
      
      After some discussion, it was realized that a CAP_SYS_NICE process can
      set a task as SCHED_FIFO, so they could fork some spinning processes and
      set them all SCHED_FIFO 99, in effect delaying all other tasks for an
      infinite amount of time.
      
      So as a CAP_SYS_NICE task can already cause trouble for other tasks,
      using it as a required capability for accessing and modifying
      /proc/<tid>/timerslack_ns seems sufficient.
      
      Thus, this patch loosens the capability requirements to CAP_SYS_NICE and
      removes CAP_SYS_PTRACE, simplifying some of the code flow as well.
      
      This is technically an ABI change, but as the feature just landed in
      4.6, I suspect no one is yet using it.
      
      Link: http://lkml.kernel.org/r/1469132667-17377-1-git-send-email-john.stultz@linaro.orgSigned-off-by: NJohn Stultz <john.stultz@linaro.org>
      Reviewed-by: NNick Kralevich <nnk@google.com>
      Acked-by: NSerge Hallyn <serge@hallyn.com>
      Acked-by: NKees Cook <keescook@chromium.org>
      Cc: Kees Cook <keescook@chromium.org>
      Cc: "Serge E. Hallyn" <serge@hallyn.com>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Arjan van de Ven <arjan@linux.intel.com>
      Cc: Oren Laadan <orenl@cellrox.com>
      Cc: Ruchi Kandoi <kandoiruchi@google.com>
      Cc: Rom Lemarchand <romlem@android.com>
      Cc: Todd Kjos <tkjos@google.com>
      Cc: Colin Cross <ccross@android.com>
      Cc: Nick Kralevich <nnk@google.com>
      Cc: Dmitry Shmidt <dimitrysh@google.com>
      Cc: Elliott Hughes <enh@google.com>
      Cc: Android Kernel Team <kernel-team@android.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      7abbaf94
    • J
      meminfo: break apart a very long seq_printf with #ifdefs · e16e2d8e
      Joe Perches 提交于
      Use a specific routine to emit most lines so that the code is easier to
      read and maintain.
      
      akpm:
         text    data     bss     dec     hex filename
         2976       8       0    2984     ba8 fs/proc/meminfo.o before
         2669       8       0    2677     a75 fs/proc/meminfo.o after
      
      Link: http://lkml.kernel.org/r/8fce7fdef2ba081a4ef531594e97da8a9feebb58.1470810406.git.joe@perches.comSigned-off-by: NJoe Perches <joe@perches.com>
      Cc: Andi Kleen <andi@firstfloor.org>
      Cc: Alexey Dobriyan <adobriyan@gmail.com>
      Cc: Al Viro <viro@zeniv.linux.org.uk>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      e16e2d8e
    • J
      seq/proc: modify seq_put_decimal_[u]ll to take a const char *, not char · 75ba1d07
      Joe Perches 提交于
      Allow some seq_puts removals by taking a string instead of a single
      char.
      
      [akpm@linux-foundation.org: update vmstat_show(), per Joe]
      Link: http://lkml.kernel.org/r/667e1cf3d436de91a5698170a1e98d882905e956.1470704995.git.joe@perches.comSigned-off-by: NJoe Perches <joe@perches.com>
      Cc: Joe Perches <joe@perches.com>
      Cc: Andi Kleen <andi@firstfloor.org>
      Cc: Al Viro <viro@zeniv.linux.org.uk>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      75ba1d07
    • A
      proc: faster /proc/*/status · f7a5f132
      Alexey Dobriyan 提交于
      top(1) opens the following files for every PID:
      
      	/proc/*/stat
      	/proc/*/statm
      	/proc/*/status
      
      This patch switches /proc/*/status away from seq_printf().
      The result is 13.5% speedup.
      
      Benchmark is open("/proc/self/status")+read+close 1.000.000 million times.
      
      				BEFORE
      $ perf stat -r 10 taskset -c 3 ./proc-self-status
      
       Performance counter stats for 'taskset -c 3 ./proc-self-status' (10 runs):
      
            10748.474301      task-clock (msec)         #    0.954 CPUs utilized            ( +-  0.91% )
                      12      context-switches          #    0.001 K/sec                    ( +-  1.09% )
                       1      cpu-migrations            #    0.000 K/sec
                     104      page-faults               #    0.010 K/sec                    ( +-  0.45% )
          37,424,127,876      cycles                    #    3.482 GHz                      ( +-  0.04% )
           8,453,010,029      stalled-cycles-frontend   #   22.59% frontend cycles idle     ( +-  0.12% )
           3,747,609,427      stalled-cycles-backend    #  10.01% backend cycles idle       ( +-  0.68% )
          65,632,764,147      instructions              #    1.75  insn per cycle
                                                        #    0.13  stalled cycles per insn  ( +-  0.00% )
          13,981,324,775      branches                  # 1300.773 M/sec                    ( +-  0.00% )
             138,967,110      branch-misses             #    0.99% of all branches          ( +-  0.18% )
      
            11.263885428 seconds time elapsed                                          ( +-  0.04% )
            ^^^^^^^^^^^^
      
      				AFTER
      $ perf stat -r 10 taskset -c 3 ./proc-self-status
      
       Performance counter stats for 'taskset -c 3 ./proc-self-status' (10 runs):
      
             9010.521776      task-clock (msec)         #    0.925 CPUs utilized            ( +-  1.54% )
                      11      context-switches          #    0.001 K/sec                    ( +-  1.54% )
                       1      cpu-migrations            #    0.000 K/sec                    ( +- 11.11% )
                     103      page-faults               #    0.011 K/sec                    ( +-  0.60% )
          32,352,310,603      cycles                    #    3.591 GHz                      ( +-  0.07% )
           7,849,199,578      stalled-cycles-frontend   #   24.26% frontend cycles idle     ( +-  0.27% )
           3,269,738,842      stalled-cycles-backend    #  10.11% backend cycles idle       ( +-  0.73% )
          56,012,163,567      instructions              #    1.73  insn per cycle
                                                        #    0.14  stalled cycles per insn  ( +-  0.00% )
          11,735,778,795      branches                  # 1302.453 M/sec                    ( +-  0.00% )
              98,084,459      branch-misses             #    0.84% of all branches          ( +-  0.28% )
      
             9.741247736 seconds time elapsed                                          ( +-  0.07% )
             ^^^^^^^^^^^
      
      Link: http://lkml.kernel.org/r/20160806125608.GB1187@p183.telecom.bySigned-off-by: NAlexey Dobriyan <adobriyan@gmail.com>
      Cc: Joe Perches <joe@perches.com>
      Cc: Andi Kleen <andi@firstfloor.org>
      Cc: Al Viro <viro@zeniv.linux.org.uk>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      f7a5f132
    • Z
      mm: remove unnecessary condition in remove_inode_hugepages · 72e2936c
      zhong jiang 提交于
      When the huge page is added to the page cahce (huge_add_to_page_cache),
      the page private flag will be cleared.  since this code
      (remove_inode_hugepages) will only be called for pages in the page
      cahce, PagePrivate(page) will always be false.
      
      The patch remove the code without any functional change.
      
      Link: http://lkml.kernel.org/r/1475113323-29368-1-git-send-email-zhongjiang@huawei.comSigned-off-by: Nzhong jiang <zhongjiang@huawei.com>
      Reviewed-by: NNaoya Horiguchi <n-horiguchi@ah.jp.nec.com>
      Reviewed-by: NMike Kravetz <mike.kravetz@oracle.com>
      Tested-by: NMike Kravetz <mike.kravetz@oracle.com>
      Acked-by: NMichal Hocko <mhocko@suse.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      72e2936c
    • Y
      mm/hugetlb: introduce ARCH_HAS_GIGANTIC_PAGE · 461a7184
      Yisheng Xie 提交于
      Avoid making ifdef get pretty unwieldy if many ARCHs support gigantic
      page.  No functional change with this patch.
      
      Link: http://lkml.kernel.org/r/1475227569-63446-2-git-send-email-xieyisheng1@huawei.comSigned-off-by: NYisheng Xie <xieyisheng1@huawei.com>
      Suggested-by: NMichal Hocko <mhocko@suse.com>
      Acked-by: NMichal Hocko <mhocko@suse.com>
      Acked-by: NNaoya Horiguchi <n-horiguchi@ah.jp.nec.com>
      Acked-by: NHillf Danton <hillf.zj@alibaba-inc.com>
      Cc: Hanjun Guo <guohanjun@huawei.com>
      Cc: Will Deacon <will.deacon@arm.com>
      Cc: Dave Hansen <dave.hansen@intel.com>
      Cc: Sudeep Holla <sudeep.holla@arm.com>
      Cc: Catalin Marinas <catalin.marinas@arm.com>
      Cc: Mark Rutland <mark.rutland@arm.com>
      Cc: Rob Herring <robh+dt@kernel.org>
      Cc: Mike Kravetz <mike.kravetz@oracle.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      461a7184
    • H
      mm: remove page_file_index · 8cd79788
      Huang Ying 提交于
      After using the offset of the swap entry as the key of the swap cache,
      the page_index() becomes exactly same as page_file_index().  So the
      page_file_index() is removed and the callers are changed to use
      page_index() instead.
      
      Link: http://lkml.kernel.org/r/1473270649-27229-2-git-send-email-ying.huang@intel.comSigned-off-by: N"Huang, Ying" <ying.huang@intel.com>
      Cc: Trond Myklebust <trond.myklebust@primarydata.com>
      Cc: Anna Schumaker <anna.schumaker@netapp.com>
      Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
      Cc: Michal Hocko <mhocko@suse.com>
      Cc: Dave Hansen <dave.hansen@linux.intel.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Dan Williams <dan.j.williams@intel.com>
      Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
      Cc: Ross Zwisler <ross.zwisler@linux.intel.com>
      Cc: Eric Dumazet <edumazet@google.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      8cd79788
    • A
      thp: reduce usage of huge zero page's atomic counter · 6fcb52a5
      Aaron Lu 提交于
      The global zero page is used to satisfy an anonymous read fault.  If
      THP(Transparent HugePage) is enabled then the global huge zero page is
      used.  The global huge zero page uses an atomic counter for reference
      counting and is allocated/freed dynamically according to its counter
      value.
      
      CPU time spent on that counter will greatly increase if there are a lot
      of processes doing anonymous read faults.  This patch proposes a way to
      reduce the access to the global counter so that the CPU load can be
      reduced accordingly.
      
      To do this, a new flag of the mm_struct is introduced:
      MMF_USED_HUGE_ZERO_PAGE.  With this flag, the process only need to touch
      the global counter in two cases:
      
       1 The first time it uses the global huge zero page;
       2 The time when mm_user of its mm_struct reaches zero.
      
      Note that right now, the huge zero page is eligible to be freed as soon
      as its last use goes away.  With this patch, the page will not be
      eligible to be freed until the exit of the last process from which it
      was ever used.
      
      And with the use of mm_user, the kthread is not eligible to use huge
      zero page either.  Since no kthread is using huge zero page today, there
      is no difference after applying this patch.  But if that is not desired,
      I can change it to when mm_count reaches zero.
      
      Case used for test on Haswell EP:
      
        usemem -n 72 --readonly -j 0x200000 100G
      
      Which spawns 72 processes and each will mmap 100G anonymous space and
      then do read only access to that space sequentially with a step of 2MB.
      
        CPU cycles from perf report for base commit:
            54.03%  usemem   [kernel.kallsyms]   [k] get_huge_zero_page
        CPU cycles from perf report for this commit:
             0.11%  usemem   [kernel.kallsyms]   [k] mm_get_huge_zero_page
      
      Performance(throughput) of the workload for base commit: 1784430792
      Performance(throughput) of the workload for this commit: 4726928591
      164% increase.
      
      Runtime of the workload for base commit: 707592 us
      Runtime of the workload for this commit: 303970 us
      50% drop.
      
      Link: http://lkml.kernel.org/r/fe51a88f-446a-4622-1363-ad1282d71385@intel.comSigned-off-by: NAaron Lu <aaron.lu@intel.com>
      Cc: Sergey Senozhatsky <sergey.senozhatsky@gmail.com>
      Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
      Cc: Dave Hansen <dave.hansen@intel.com>
      Cc: Tim Chen <tim.c.chen@linux.intel.com>
      Cc: Huang Ying <ying.huang@intel.com>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Cc: Jerome Marchand <jmarchan@redhat.com>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: Mel Gorman <mgorman@techsingularity.net>
      Cc: Ebru Akagunduz <ebru.akagunduz@gmail.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      6fcb52a5
    • J
      fs/proc/task_mmu.c: make the task_mmu walk_page_range() limit in clear_refs_write() obvious · 0f30206b
      James Morse 提交于
      Trying to walk all of virtual memory requires architecture specific
      knowledge.  On x86_64, addresses must be sign extended from bit 48,
      whereas on arm64 the top VA_BITS of address space have their own set of
      page tables.
      
      clear_refs_write() calls walk_page_range() on the range 0 to ~0UL, it
      provides a test_walk() callback that only expects to be walking over
      VMAs.  Currently walk_pmd_range() will skip memory regions that don't
      have a VMA, reporting them as a hole.
      
      As this call only expects to walk user address space, make it walk 0 to
      'highest_vm_end'.
      
      Link: http://lkml.kernel.org/r/1472655792-22439-1-git-send-email-james.morse@arm.comSigned-off-by: NJames Morse <james.morse@arm.com>
      Acked-by: NNaoya Horiguchi <n-horiguchi@ah.jp.nec.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      0f30206b