1. 13 9月, 2017 1 次提交
    • R
      xfs: XFS_IS_REALTIME_INODE() should be false if no rt device present · b31ff3cd
      Richard Wareing 提交于
      If using a kernel with CONFIG_XFS_RT=y and we set the RHINHERIT flag on
      a directory in a filesystem that does not have a realtime device and
      create a new file in that directory, it gets marked as a real time file.
      When data is written and a fsync is issued, the filesystem attempts to
      flush a non-existent rt device during the fsync process.
      
      This results in a crash dereferencing a null buftarg pointer in
      xfs_blkdev_issue_flush():
      
        BUG: unable to handle kernel NULL pointer dereference at 0000000000000008
        IP: xfs_blkdev_issue_flush+0xd/0x20
        .....
        Call Trace:
          xfs_file_fsync+0x188/0x1c0
          vfs_fsync_range+0x3b/0xa0
          do_fsync+0x3d/0x70
          SyS_fsync+0x10/0x20
          do_syscall_64+0x4d/0xb0
          entry_SYSCALL64_slow_path+0x25/0x25
      
      Setting RT inode flags does not require special privileges so any
      unprivileged user can cause this oops to occur.  To reproduce, confirm
      kernel is compiled with CONFIG_XFS_RT=y and run:
      
        # mkfs.xfs -f /dev/pmem0
        # mount /dev/pmem0 /mnt/test
        # mkdir /mnt/test/foo
        # xfs_io -c 'chattr +t' /mnt/test/foo
        # xfs_io -f -c 'pwrite 0 5m' -c fsync /mnt/test/foo/bar
      
      Or just run xfstests with MKFS_OPTIONS="-d rtinherit=1" and wait.
      
      Kernels built with CONFIG_XFS_RT=n are not exposed to this bug.
      
      Fixes: f538d4da ("[XFS] write barrier support")
      Cc: <stable@vger.kernel.org>
      Signed-off-by: NRichard Wareing <rwareing@fb.com>
      Signed-off-by: NDave Chinner <david@fromorbit.com>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      b31ff3cd
  2. 10 9月, 2017 4 次提交
  3. 09 9月, 2017 22 次提交
  4. 08 9月, 2017 1 次提交
  5. 07 9月, 2017 12 次提交
    • T
      NFS: Sync the correct byte range during synchronous writes · e973b1a5
      tarangg@amazon.com 提交于
      Since commit 18290650 ("NFS: Move buffered I/O locking into
      nfs_file_write()") nfs_file_write() has not flushed the correct byte
      range during synchronous writes.  generic_write_sync() expects that
      iocb->ki_pos points to the right edge of the range rather than the
      left edge.
      
      To replicate the problem, open a file with O_DSYNC, have the client
      write at increasing offsets, and then print the successful offsets.
      Block port 2049 partway through that sequence, and observe that the
      client application indicates successful writes in advance of what the
      server received.
      
      Fixes: 18290650 ("NFS: Move buffered I/O locking into nfs_file_write()")
      Signed-off-by: NJacob Strauss <jsstraus@amazon.com>
      Signed-off-by: NTarang Gupta <tarangg@amazon.com>
      Tested-by: NTarang Gupta <tarangg@amazon.com>
      Cc: stable@vger.kernel.org # v4.8+
      Signed-off-by: NTrond Myklebust <trond.myklebust@primarydata.com>
      e973b1a5
    • R
      mm,fork: introduce MADV_WIPEONFORK · d2cd9ede
      Rik van Riel 提交于
      Introduce MADV_WIPEONFORK semantics, which result in a VMA being empty
      in the child process after fork.  This differs from MADV_DONTFORK in one
      important way.
      
      If a child process accesses memory that was MADV_WIPEONFORK, it will get
      zeroes.  The address ranges are still valid, they are just empty.
      
      If a child process accesses memory that was MADV_DONTFORK, it will get a
      segmentation fault, since those address ranges are no longer valid in
      the child after fork.
      
      Since MADV_DONTFORK also seems to be used to allow very large programs
      to fork in systems with strict memory overcommit restrictions, changing
      the semantics of MADV_DONTFORK might break existing programs.
      
      MADV_WIPEONFORK only works on private, anonymous VMAs.
      
      The use case is libraries that store or cache information, and want to
      know that they need to regenerate it in the child process after fork.
      
      Examples of this would be:
       - systemd/pulseaudio API checks (fail after fork) (replacing a getpid
         check, which is too slow without a PID cache)
       - PKCS#11 API reinitialization check (mandated by specification)
       - glibc's upcoming PRNG (reseed after fork)
       - OpenSSL PRNG (reseed after fork)
      
      The security benefits of a forking server having a re-inialized PRNG in
      every child process are pretty obvious.  However, due to libraries
      having all kinds of internal state, and programs getting compiled with
      many different versions of each library, it is unreasonable to expect
      calling programs to re-initialize everything manually after fork.
      
      A further complication is the proliferation of clone flags, programs
      bypassing glibc's functions to call clone directly, and programs calling
      unshare, causing the glibc pthread_atfork hook to not get called.
      
      It would be better to have the kernel take care of this automatically.
      
      The patch also adds MADV_KEEPONFORK, to undo the effects of a prior
      MADV_WIPEONFORK.
      
      This is similar to the OpenBSD minherit syscall with MAP_INHERIT_ZERO:
      
          https://man.openbsd.org/minherit.2
      
      [akpm@linux-foundation.org: numerically order arch/parisc/include/uapi/asm/mman.h #defines]
      Link: http://lkml.kernel.org/r/20170811212829.29186-3-riel@redhat.comSigned-off-by: NRik van Riel <riel@redhat.com>
      Reported-by: NFlorian Weimer <fweimer@redhat.com>
      Reported-by: NColm MacCártaigh <colm@allcosts.net>
      Reviewed-by: NMike Kravetz <mike.kravetz@oracle.com>
      Cc: "H. Peter Anvin" <hpa@zytor.com>
      Cc: "Kirill A. Shutemov" <kirill@shutemov.name>
      Cc: Andy Lutomirski <luto@amacapital.net>
      Cc: Dave Hansen <dave.hansen@intel.com>
      Cc: Ingo Molnar <mingo@kernel.org>
      Cc: Helge Deller <deller@gmx.de>
      Cc: Kees Cook <keescook@chromium.org>
      Cc: Matthew Wilcox <willy@infradead.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Will Drewry <wad@chromium.org>
      Cc: <linux-api@vger.kernel.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      d2cd9ede
    • D
      mm: add /proc/pid/smaps_rollup · 493b0e9d
      Daniel Colascione 提交于
      /proc/pid/smaps_rollup is a new proc file that improves the performance
      of user programs that determine aggregate memory statistics (e.g., total
      PSS) of a process.
      
      Android regularly "samples" the memory usage of various processes in
      order to balance its memory pool sizes.  This sampling process involves
      opening /proc/pid/smaps and summing certain fields.  For very large
      processes, sampling memory use this way can take several hundred
      milliseconds, due mostly to the overhead of the seq_printf calls in
      task_mmu.c.
      
      smaps_rollup improves the situation.  It contains most of the fields of
      /proc/pid/smaps, but instead of a set of fields for each VMA,
      smaps_rollup instead contains one synthetic smaps-format entry
      representing the whole process.  In the single smaps_rollup synthetic
      entry, each field is the summation of the corresponding field in all of
      the real-smaps VMAs.  Using a common format for smaps_rollup and smaps
      allows userspace parsers to repurpose parsers meant for use with
      non-rollup smaps for smaps_rollup, and it allows userspace to switch
      between smaps_rollup and smaps at runtime (say, based on the
      availability of smaps_rollup in a given kernel) with minimal fuss.
      
      By using smaps_rollup instead of smaps, a caller can avoid the
      significant overhead of formatting, reading, and parsing each of a large
      process's potentially very numerous memory mappings.  For sampling
      system_server's PSS in Android, we measured a 12x speedup, representing
      a savings of several hundred milliseconds.
      
      One alternative to a new per-process proc file would have been including
      PSS information in /proc/pid/status.  We considered this option but
      thought that PSS would be too expensive (by a few orders of magnitude)
      to collect relative to what's already emitted as part of
      /proc/pid/status, and slowing every user of /proc/pid/status for the
      sake of readers that happen to want PSS feels wrong.
      
      The code itself works by reusing the existing VMA-walking framework we
      use for regular smaps generation and keeping the mem_size_stats
      structure around between VMA walks instead of using a fresh one for each
      VMA.  In this way, summation happens automatically.  We let seq_file
      walk over the VMAs just as it does for regular smaps and just emit
      nothing to the seq_file until we hit the last VMA.
      
      Benchmarks:
      
          using smaps:
          iterations:1000 pid:1163 pss:220023808
          0m29.46s real 0m08.28s user 0m20.98s system
      
          using smaps_rollup:
          iterations:1000 pid:1163 pss:220702720
          0m04.39s real 0m00.03s user 0m04.31s system
      
      We're using the PSS samples we collect asynchronously for
      system-management tasks like fine-tuning oom_adj_score, memory use
      tracking for debugging, application-level memory-use attribution, and
      deciding whether we want to kill large processes during system idle
      maintenance windows.  Android has been using PSS for these purposes for
      a long time; as the average process VMA count has increased and and
      devices become more efficiency-conscious, PSS-collection inefficiency
      has started to matter more.  IMHO, it'd be a lot safer to optimize the
      existing PSS-collection model, which has been fine-tuned over the years,
      instead of changing the memory tracking approach entirely to work around
      smaps-generation inefficiency.
      
      Tim said:
      
      : There are two main reasons why Android gathers PSS information:
      :
      : 1. Android devices can show the user the amount of memory used per
      :    application via the settings app.  This is a less important use case.
      :
      : 2. We log PSS to help identify leaks in applications.  We have found
      :    an enormous number of bugs (in the Android platform, in Google's own
      :    apps, and in third-party applications) using this data.
      :
      : To do this, system_server (the main process in Android userspace) will
      : sample the PSS of a process three seconds after it changes state (for
      : example, app is launched and becomes the foreground application) and about
      : every ten minutes after that.  The net result is that PSS collection is
      : regularly running on at least one process in the system (usually a few
      : times a minute while the screen is on, less when screen is off due to
      : suspend).  PSS of a process is an incredibly useful stat to track, and we
      : aren't going to get rid of it.  We've looked at some very hacky approaches
      : using RSS ("take the RSS of the target process, subtract the RSS of the
      : zygote process that is the parent of all Android apps") to reduce the
      : accounting time, but it regularly overestimated the memory used by 20+
      : percent.  Accordingly, I don't think that there's a good alternative to
      : using PSS.
      :
      : We started looking into PSS collection performance after we noticed random
      : frequency spikes while a phone's screen was off; occasionally, one of the
      : CPU clusters would ramp to a high frequency because there was 200-300ms of
      : constant CPU work from a single thread in the main Android userspace
      : process.  The work causing the spike (which is reasonable governor
      : behavior given the amount of CPU time needed) was always PSS collection.
      : As a result, Android is burning more power than we should be on PSS
      : collection.
      :
      : The other issue (and why I'm less sure about improving smaps as a
      : long-term solution) is that the number of VMAs per process has increased
      : significantly from release to release.  After trying to figure out why we
      : were seeing these 200-300ms PSS collection times on Android O but had not
      : noticed it in previous versions, we found that the number of VMAs in the
      : main system process increased by 50% from Android N to Android O (from
      : ~1800 to ~2700) and varying increases in every userspace process.  Android
      : M to N also had an increase in the number of VMAs, although not as much.
      : I'm not sure why this is increasing so much over time, but thinking about
      : ASLR and ways to make ASLR better, I expect that this will continue to
      : increase going forward.  I would not be surprised if we hit 5000 VMAs on
      : the main Android process (system_server) by 2020.
      :
      : If we assume that the number of VMAs is going to increase over time, then
      : doing anything we can do to reduce the overhead of each VMA during PSS
      : collection seems like the right way to go, and that means outputting an
      : aggregate statistic (to avoid whatever overhead there is per line in
      : writing smaps and in reading each line from userspace).
      
      Link: http://lkml.kernel.org/r/20170812022148.178293-1-dancol@google.comSigned-off-by: NDaniel Colascione <dancol@google.com>
      Cc: Tim Murray <timmurray@google.com>
      Cc: Joel Fernandes <joelaf@google.com>
      Cc: Al Viro <viro@zeniv.linux.org.uk>
      Cc: Randy Dunlap <rdunlap@infradead.org>
      Cc: Minchan Kim <minchan@kernel.org>
      Cc: Michal Hocko <mhocko@kernel.org>
      Cc: Sonny Rao <sonnyrao@chromium.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      493b0e9d
    • A
      userfaultfd: provide pid in userfault msg - add feat union · a36985d3
      Andrea Arcangeli 提交于
      No ABI change, but this will make it more explicit to software that ptid
      is only available if requested by passing UFFD_FEATURE_THREAD_ID to
      UFFDIO_API.  The fact it's a union will also self document it shouldn't
      be taken for granted there's a tpid there.
      
      Link: http://lkml.kernel.org/r/20170802165145.22628-7-aarcange@redhat.comSigned-off-by: NAndrea Arcangeli <aarcange@redhat.com>
      Cc: "Dr. David Alan Gilbert" <dgilbert@redhat.com>
      Cc: Alexey Perevalov <a.perevalov@samsung.com>
      Cc: Maxime Coquelin <maxime.coquelin@redhat.com>
      Cc: Mike Kravetz <mike.kravetz@oracle.com>
      Cc: Mike Rapoport <rppt@linux.vnet.ibm.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      a36985d3
    • A
      userfaultfd: provide pid in userfault msg · 9d4ac934
      Alexey Perevalov 提交于
      It could be useful for calculating downtime during postcopy live
      migration per vCPU.  Side observer or application itself will be
      informed about proper task's sleep during userfaultfd processing.
      
      Process's thread id is being provided when user requeste it by setting
      UFFD_FEATURE_THREAD_ID bit into uffdio_api.features.
      
      Link: http://lkml.kernel.org/r/20170802165145.22628-6-aarcange@redhat.comSigned-off-by: NAlexey Perevalov <a.perevalov@samsung.com>
      Signed-off-by: NAndrea Arcangeli <aarcange@redhat.com>
      Cc: "Dr. David Alan Gilbert" <dgilbert@redhat.com>
      Cc: Maxime Coquelin <maxime.coquelin@redhat.com>
      Cc: Mike Kravetz <mike.kravetz@oracle.com>
      Cc: Mike Rapoport <rppt@linux.vnet.ibm.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      9d4ac934
    • P
      mm: userfaultfd: add feature to request for a signal delivery · 2d6d6f5a
      Prakash Sangappa 提交于
      In some cases, userfaultfd mechanism should just deliver a SIGBUS signal
      to the faulting process, instead of the page-fault event.  Dealing with
      page-fault event using a monitor thread can be an overhead in these
      cases.  For example applications like the database could use the
      signaling mechanism for robustness purpose.
      
      Database uses hugetlbfs for performance reason.  Files on hugetlbfs
      filesystem are created and huge pages allocated using fallocate() API.
      Pages are deallocated/freed using fallocate() hole punching support.
      These files are mmapped and accessed by many processes as shared memory.
      The database keeps track of which offsets in the hugetlbfs file have
      pages allocated.
      
      Any access to mapped address over holes in the file, which can occur due
      to bugs in the application, is considered invalid and expect the process
      to simply receive a SIGBUS.  However, currently when a hole in the file
      is accessed via the mapped address, kernel/mm attempts to automatically
      allocate a page at page fault time, resulting in implicitly filling the
      hole in the file.  This may not be the desired behavior for applications
      like the database that want to explicitly manage page allocations of
      hugetlbfs files.
      
      Using userfaultfd mechanism with this support to get a signal, database
      application can prevent pages from being allocated implicitly when
      processes access mapped address over holes in the file.
      
      This patch adds UFFD_FEATURE_SIGBUS feature to userfaultfd mechnism to
      request for a SIGBUS signal.
      
      See following for previous discussion about the database requirement
      leading to this proposal as suggested by Andrea.
      
      http://www.spinics.net/lists/linux-mm/msg129224.html
      
      Link: http://lkml.kernel.org/r/1501552446-748335-2-git-send-email-prakash.sangappa@oracle.comSigned-off-by: NPrakash Sangappa <prakash.sangappa@oracle.com>
      Reviewed-by: NMike Rapoport <rppt@linux.vnet.ibm.com>
      Reviewed-by: NAndrea Arcangeli <aarcange@redhat.com>
      Cc: Mike Kravetz <mike.kravetz@oracle.com>
      Cc: Shuah Khan <shuah@kernel.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      2d6d6f5a
    • M
      mm: rename global_page_state to global_zone_page_state · c41f012a
      Michal Hocko 提交于
      global_page_state is error prone as a recent bug report pointed out [1].
      It only returns proper values for zone based counters as the enum it
      gets suggests.  We already have global_node_page_state so let's rename
      global_page_state to global_zone_page_state to be more explicit here.
      All existing users seems to be correct:
      
      $ git grep "global_page_state(NR_" | sed 's@.*(\(NR_[A-Z_]*\)).*@\1@' | sort | uniq -c
            2 NR_BOUNCE
            2 NR_FREE_CMA_PAGES
           11 NR_FREE_PAGES
            1 NR_KERNEL_STACK_KB
            1 NR_MLOCK
            2 NR_PAGETABLE
      
      This patch shouldn't introduce any functional change.
      
      [1] http://lkml.kernel.org/r/201707260628.v6Q6SmaS030814@www262.sakura.ne.jp
      
      Link: http://lkml.kernel.org/r/20170801134256.5400-2-hannes@cmpxchg.orgSigned-off-by: NMichal Hocko <mhocko@suse.com>
      Signed-off-by: NJohannes Weiner <hannes@cmpxchg.org>
      Cc: Tetsuo Handa <penguin-kernel@i-love.sakura.ne.jp>
      Cc: Josef Bacik <josef@toxicpanda.com>
      Cc: Vladimir Davydov <vdavydov.dev@gmail.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      c41f012a
    • J
      fs/sync.c: remove unnecessary NULL f_mapping check in sync_file_range · de23abd1
      Jeff Layton 提交于
      fsync codepath assumes that f_mapping can never be NULL, but
      sync_file_range has a check for that.
      
      Remove the one from sync_file_range as I don't see how you'd ever get a
      NULL pointer in here.
      
      Link: http://lkml.kernel.org/r/20170525110509.9434-1-jlayton@redhat.comSigned-off-by: NJeff Layton <jlayton@redhat.com>
      Cc: Alexander Viro <viro@zeniv.linux.org.uk>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      de23abd1
    • M
      userfaultfd: report UFFDIO_ZEROPAGE as available for shmem VMAs · ce53e8e6
      Mike Rapoport 提交于
      Now when shmem VMAs can be filled with zero page via userfaultfd we can
      report that UFFDIO_ZEROPAGE is available for those VMAs
      
      Link: http://lkml.kernel.org/r/1497939652-16528-7-git-send-email-rppt@linux.vnet.ibm.comSigned-off-by: NMike Rapoport <rppt@linux.vnet.ibm.com>
      Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: Hillf Danton <hillf.zj@alibaba-inc.com>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Pavel Emelyanov <xemul@virtuozzo.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      ce53e8e6
    • J
      mm: remove nr_pages argument from pagevec_lookup{,_range}() · 397162ff
      Jan Kara 提交于
      All users of pagevec_lookup() and pagevec_lookup_range() now pass
      PAGEVEC_SIZE as a desired number of pages.
      
      Just drop the argument.
      
      Link: http://lkml.kernel.org/r/20170726114704.7626-11-jack@suse.czSigned-off-by: NJan Kara <jack@suse.cz>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      397162ff
    • J
      fs: use pagevec_lookup_range() in page_cache_seek_hole_data() · 8338141f
      Jan Kara 提交于
      We want only pages from given range in page_cache_seek_hole_data().  Use
      pagevec_lookup_range() instead of pagevec_lookup() and remove
      unnecessary code.
      
      Note that the check for getting less pages than desired can be removed
      because index gets updated by pagevec_lookup_range().
      
      Link: http://lkml.kernel.org/r/20170726114704.7626-9-jack@suse.czSigned-off-by: NJan Kara <jack@suse.cz>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      8338141f
    • J
      hugetlbfs: use pagevec_lookup_range() in remove_inode_hugepages() · 48f2301c
      Jan Kara 提交于
      We want only pages from given range in remove_inode_hugepages().  Use
      pagevec_lookup_range() instead of pagevec_lookup().
      
      Link: http://lkml.kernel.org/r/20170726114704.7626-8-jack@suse.czSigned-off-by: NJan Kara <jack@suse.cz>
      Reviewed-by: NMike Kravetz <mike.kravetz@oracle.com>
      Cc: Nadia Yvette Chambers <nyc@holomorphy.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      48f2301c