1. 09 9月, 2017 4 次提交
    • J
      mm/migrate: new migrate mode MIGRATE_SYNC_NO_COPY · 2916ecc0
      Jérôme Glisse 提交于
      Introduce a new migration mode that allow to offload the copy to a device
      DMA engine.  This changes the workflow of migration and not all
      address_space migratepage callback can support this.
      
      This is intended to be use by migrate_vma() which itself is use for thing
      like HMM (see include/linux/hmm.h).
      
      No additional per-filesystem migratepage testing is needed.  I disables
      MIGRATE_SYNC_NO_COPY in all problematic migratepage() callback and i
      added comment in those to explain why (part of this patch).  The commit
      message is unclear it should say that any callback that wish to support
      this new mode need to be aware of the difference in the migration flow
      from other mode.
      
      Some of these callbacks do extra locking while copying (aio, zsmalloc,
      balloon, ...) and for DMA to be effective you want to copy multiple
      pages in one DMA operations.  But in the problematic case you can not
      easily hold the extra lock accross multiple call to this callback.
      
      Usual flow is:
      
      For each page {
       1 - lock page
       2 - call migratepage() callback
       3 - (extra locking in some migratepage() callback)
       4 - migrate page state (freeze refcount, update page cache, buffer
           head, ...)
       5 - copy page
       6 - (unlock any extra lock of migratepage() callback)
       7 - return from migratepage() callback
       8 - unlock page
      }
      
      The new mode MIGRATE_SYNC_NO_COPY:
       1 - lock multiple pages
      For each page {
       2 - call migratepage() callback
       3 - abort in all problematic migratepage() callback
       4 - migrate page state (freeze refcount, update page cache, buffer
           head, ...)
      } // finished all calls to migratepage() callback
       5 - DMA copy multiple pages
       6 - unlock all the pages
      
      To support MIGRATE_SYNC_NO_COPY in the problematic case we would need a
      new callback migratepages() (for instance) that deals with multiple
      pages in one transaction.
      
      Because the problematic cases are not important for current usage I did
      not wanted to complexify this patchset even more for no good reason.
      
      Link: http://lkml.kernel.org/r/20170817000548.32038-14-jglisse@redhat.comSigned-off-by: NJérôme Glisse <jglisse@redhat.com>
      Cc: Aneesh Kumar <aneesh.kumar@linux.vnet.ibm.com>
      Cc: Balbir Singh <bsingharora@gmail.com>
      Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
      Cc: Dan Williams <dan.j.williams@intel.com>
      Cc: David Nellans <dnellans@nvidia.com>
      Cc: Evgeny Baskakov <ebaskakov@nvidia.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: John Hubbard <jhubbard@nvidia.com>
      Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
      Cc: Mark Hairgrove <mhairgrove@nvidia.com>
      Cc: Michal Hocko <mhocko@kernel.org>
      Cc: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
      Cc: Ross Zwisler <ross.zwisler@linux.intel.com>
      Cc: Sherry Cheung <SCheung@nvidia.com>
      Cc: Subhash Gutti <sgutti@nvidia.com>
      Cc: Vladimir Davydov <vdavydov.dev@gmail.com>
      Cc: Bob Liu <liubo95@huawei.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      2916ecc0
    • J
      mm/ZONE_DEVICE: new type of ZONE_DEVICE for unaddressable memory · 5042db43
      Jérôme Glisse 提交于
      HMM (heterogeneous memory management) need struct page to support
      migration from system main memory to device memory.  Reasons for HMM and
      migration to device memory is explained with HMM core patch.
      
      This patch deals with device memory that is un-addressable memory (ie CPU
      can not access it).  Hence we do not want those struct page to be manage
      like regular memory.  That is why we extend ZONE_DEVICE to support
      different types of memory.
      
      A persistent memory type is define for existing user of ZONE_DEVICE and a
      new device un-addressable type is added for the un-addressable memory
      type.  There is a clear separation between what is expected from each
      memory type and existing user of ZONE_DEVICE are un-affected by new
      requirement and new use of the un-addressable type.  All specific code
      path are protect with test against the memory type.
      
      Because memory is un-addressable we use a new special swap type for when a
      page is migrated to device memory (this reduces the number of maximum swap
      file).
      
      The main two additions beside memory type to ZONE_DEVICE is two callbacks.
      First one, page_free() is call whenever page refcount reach 1 (which
      means the page is free as ZONE_DEVICE page never reach a refcount of 0).
      This allow device driver to manage its memory and associated struct page.
      
      The second callback page_fault() happens when there is a CPU access to an
      address that is back by a device page (which are un-addressable by the
      CPU).  This callback is responsible to migrate the page back to system
      main memory.  Device driver can not block migration back to system memory,
      HMM make sure that such page can not be pin into device memory.
      
      If device is in some error condition and can not migrate memory back then
      a CPU page fault to device memory should end with SIGBUS.
      
      [arnd@arndb.de: fix warning]
        Link: http://lkml.kernel.org/r/20170823133213.712917-1-arnd@arndb.de
      Link: http://lkml.kernel.org/r/20170817000548.32038-8-jglisse@redhat.comSigned-off-by: NJérôme Glisse <jglisse@redhat.com>
      Signed-off-by: NArnd Bergmann <arnd@arndb.de>
      Acked-by: NDan Williams <dan.j.williams@intel.com>
      Cc: Ross Zwisler <ross.zwisler@linux.intel.com>
      Cc: Aneesh Kumar <aneesh.kumar@linux.vnet.ibm.com>
      Cc: Balbir Singh <bsingharora@gmail.com>
      Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
      Cc: David Nellans <dnellans@nvidia.com>
      Cc: Evgeny Baskakov <ebaskakov@nvidia.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: John Hubbard <jhubbard@nvidia.com>
      Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
      Cc: Mark Hairgrove <mhairgrove@nvidia.com>
      Cc: Michal Hocko <mhocko@kernel.org>
      Cc: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
      Cc: Sherry Cheung <SCheung@nvidia.com>
      Cc: Subhash Gutti <sgutti@nvidia.com>
      Cc: Vladimir Davydov <vdavydov.dev@gmail.com>
      Cc: Bob Liu <liubo95@huawei.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      5042db43
    • N
      mm: soft-dirty: keep soft-dirty bits over thp migration · ab6e3d09
      Naoya Horiguchi 提交于
      Soft dirty bit is designed to keep tracked over page migration.  This
      patch makes it work in the same manner for thp migration too.
      Signed-off-by: NNaoya Horiguchi <n-horiguchi@ah.jp.nec.com>
      Signed-off-by: NZi Yan <zi.yan@cs.rutgers.edu>
      Cc: "H. Peter Anvin" <hpa@zytor.com>
      Cc: Anshuman Khandual <khandual@linux.vnet.ibm.com>
      Cc: Dave Hansen <dave.hansen@intel.com>
      Cc: David Nellans <dnellans@nvidia.com>
      Cc: Ingo Molnar <mingo@elte.hu>
      Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
      Cc: Mel Gorman <mgorman@techsingularity.net>
      Cc: Minchan Kim <minchan@kernel.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: Michal Hocko <mhocko@kernel.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      ab6e3d09
    • Z
      mm: thp: check pmd migration entry in common path · 84c3fc4e
      Zi Yan 提交于
      When THP migration is being used, memory management code needs to handle
      pmd migration entries properly.  This patch uses !pmd_present() or
      is_swap_pmd() (depending on whether pmd_none() needs separate code or
      not) to check pmd migration entries at the places where a pmd entry is
      present.
      
      Since pmd-related code uses split_huge_page(), split_huge_pmd(),
      pmd_trans_huge(), pmd_trans_unstable(), or
      pmd_none_or_trans_huge_or_clear_bad(), this patch:
      
      1. adds pmd migration entry split code in split_huge_pmd(),
      
      2. takes care of pmd migration entries whenever pmd_trans_huge() is present,
      
      3. makes pmd_none_or_trans_huge_or_clear_bad() pmd migration entry aware.
      
      Since split_huge_page() uses split_huge_pmd() and pmd_trans_unstable()
      is equivalent to pmd_none_or_trans_huge_or_clear_bad(), we do not change
      them.
      
      Until this commit, a pmd entry should be:
      1. pointing to a pte page,
      2. is_swap_pmd(),
      3. pmd_trans_huge(),
      4. pmd_devmap(), or
      5. pmd_none().
      Signed-off-by: NZi Yan <zi.yan@cs.rutgers.edu>
      Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
      Cc: "H. Peter Anvin" <hpa@zytor.com>
      Cc: Anshuman Khandual <khandual@linux.vnet.ibm.com>
      Cc: Dave Hansen <dave.hansen@intel.com>
      Cc: David Nellans <dnellans@nvidia.com>
      Cc: Ingo Molnar <mingo@elte.hu>
      Cc: Mel Gorman <mgorman@techsingularity.net>
      Cc: Minchan Kim <minchan@kernel.org>
      Cc: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: Michal Hocko <mhocko@kernel.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      84c3fc4e
  2. 08 9月, 2017 1 次提交
  3. 07 9月, 2017 25 次提交
    • R
      mm,fork: introduce MADV_WIPEONFORK · d2cd9ede
      Rik van Riel 提交于
      Introduce MADV_WIPEONFORK semantics, which result in a VMA being empty
      in the child process after fork.  This differs from MADV_DONTFORK in one
      important way.
      
      If a child process accesses memory that was MADV_WIPEONFORK, it will get
      zeroes.  The address ranges are still valid, they are just empty.
      
      If a child process accesses memory that was MADV_DONTFORK, it will get a
      segmentation fault, since those address ranges are no longer valid in
      the child after fork.
      
      Since MADV_DONTFORK also seems to be used to allow very large programs
      to fork in systems with strict memory overcommit restrictions, changing
      the semantics of MADV_DONTFORK might break existing programs.
      
      MADV_WIPEONFORK only works on private, anonymous VMAs.
      
      The use case is libraries that store or cache information, and want to
      know that they need to regenerate it in the child process after fork.
      
      Examples of this would be:
       - systemd/pulseaudio API checks (fail after fork) (replacing a getpid
         check, which is too slow without a PID cache)
       - PKCS#11 API reinitialization check (mandated by specification)
       - glibc's upcoming PRNG (reseed after fork)
       - OpenSSL PRNG (reseed after fork)
      
      The security benefits of a forking server having a re-inialized PRNG in
      every child process are pretty obvious.  However, due to libraries
      having all kinds of internal state, and programs getting compiled with
      many different versions of each library, it is unreasonable to expect
      calling programs to re-initialize everything manually after fork.
      
      A further complication is the proliferation of clone flags, programs
      bypassing glibc's functions to call clone directly, and programs calling
      unshare, causing the glibc pthread_atfork hook to not get called.
      
      It would be better to have the kernel take care of this automatically.
      
      The patch also adds MADV_KEEPONFORK, to undo the effects of a prior
      MADV_WIPEONFORK.
      
      This is similar to the OpenBSD minherit syscall with MAP_INHERIT_ZERO:
      
          https://man.openbsd.org/minherit.2
      
      [akpm@linux-foundation.org: numerically order arch/parisc/include/uapi/asm/mman.h #defines]
      Link: http://lkml.kernel.org/r/20170811212829.29186-3-riel@redhat.comSigned-off-by: NRik van Riel <riel@redhat.com>
      Reported-by: NFlorian Weimer <fweimer@redhat.com>
      Reported-by: NColm MacCártaigh <colm@allcosts.net>
      Reviewed-by: NMike Kravetz <mike.kravetz@oracle.com>
      Cc: "H. Peter Anvin" <hpa@zytor.com>
      Cc: "Kirill A. Shutemov" <kirill@shutemov.name>
      Cc: Andy Lutomirski <luto@amacapital.net>
      Cc: Dave Hansen <dave.hansen@intel.com>
      Cc: Ingo Molnar <mingo@kernel.org>
      Cc: Helge Deller <deller@gmx.de>
      Cc: Kees Cook <keescook@chromium.org>
      Cc: Matthew Wilcox <willy@infradead.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Will Drewry <wad@chromium.org>
      Cc: <linux-api@vger.kernel.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      d2cd9ede
    • D
      mm: add /proc/pid/smaps_rollup · 493b0e9d
      Daniel Colascione 提交于
      /proc/pid/smaps_rollup is a new proc file that improves the performance
      of user programs that determine aggregate memory statistics (e.g., total
      PSS) of a process.
      
      Android regularly "samples" the memory usage of various processes in
      order to balance its memory pool sizes.  This sampling process involves
      opening /proc/pid/smaps and summing certain fields.  For very large
      processes, sampling memory use this way can take several hundred
      milliseconds, due mostly to the overhead of the seq_printf calls in
      task_mmu.c.
      
      smaps_rollup improves the situation.  It contains most of the fields of
      /proc/pid/smaps, but instead of a set of fields for each VMA,
      smaps_rollup instead contains one synthetic smaps-format entry
      representing the whole process.  In the single smaps_rollup synthetic
      entry, each field is the summation of the corresponding field in all of
      the real-smaps VMAs.  Using a common format for smaps_rollup and smaps
      allows userspace parsers to repurpose parsers meant for use with
      non-rollup smaps for smaps_rollup, and it allows userspace to switch
      between smaps_rollup and smaps at runtime (say, based on the
      availability of smaps_rollup in a given kernel) with minimal fuss.
      
      By using smaps_rollup instead of smaps, a caller can avoid the
      significant overhead of formatting, reading, and parsing each of a large
      process's potentially very numerous memory mappings.  For sampling
      system_server's PSS in Android, we measured a 12x speedup, representing
      a savings of several hundred milliseconds.
      
      One alternative to a new per-process proc file would have been including
      PSS information in /proc/pid/status.  We considered this option but
      thought that PSS would be too expensive (by a few orders of magnitude)
      to collect relative to what's already emitted as part of
      /proc/pid/status, and slowing every user of /proc/pid/status for the
      sake of readers that happen to want PSS feels wrong.
      
      The code itself works by reusing the existing VMA-walking framework we
      use for regular smaps generation and keeping the mem_size_stats
      structure around between VMA walks instead of using a fresh one for each
      VMA.  In this way, summation happens automatically.  We let seq_file
      walk over the VMAs just as it does for regular smaps and just emit
      nothing to the seq_file until we hit the last VMA.
      
      Benchmarks:
      
          using smaps:
          iterations:1000 pid:1163 pss:220023808
          0m29.46s real 0m08.28s user 0m20.98s system
      
          using smaps_rollup:
          iterations:1000 pid:1163 pss:220702720
          0m04.39s real 0m00.03s user 0m04.31s system
      
      We're using the PSS samples we collect asynchronously for
      system-management tasks like fine-tuning oom_adj_score, memory use
      tracking for debugging, application-level memory-use attribution, and
      deciding whether we want to kill large processes during system idle
      maintenance windows.  Android has been using PSS for these purposes for
      a long time; as the average process VMA count has increased and and
      devices become more efficiency-conscious, PSS-collection inefficiency
      has started to matter more.  IMHO, it'd be a lot safer to optimize the
      existing PSS-collection model, which has been fine-tuned over the years,
      instead of changing the memory tracking approach entirely to work around
      smaps-generation inefficiency.
      
      Tim said:
      
      : There are two main reasons why Android gathers PSS information:
      :
      : 1. Android devices can show the user the amount of memory used per
      :    application via the settings app.  This is a less important use case.
      :
      : 2. We log PSS to help identify leaks in applications.  We have found
      :    an enormous number of bugs (in the Android platform, in Google's own
      :    apps, and in third-party applications) using this data.
      :
      : To do this, system_server (the main process in Android userspace) will
      : sample the PSS of a process three seconds after it changes state (for
      : example, app is launched and becomes the foreground application) and about
      : every ten minutes after that.  The net result is that PSS collection is
      : regularly running on at least one process in the system (usually a few
      : times a minute while the screen is on, less when screen is off due to
      : suspend).  PSS of a process is an incredibly useful stat to track, and we
      : aren't going to get rid of it.  We've looked at some very hacky approaches
      : using RSS ("take the RSS of the target process, subtract the RSS of the
      : zygote process that is the parent of all Android apps") to reduce the
      : accounting time, but it regularly overestimated the memory used by 20+
      : percent.  Accordingly, I don't think that there's a good alternative to
      : using PSS.
      :
      : We started looking into PSS collection performance after we noticed random
      : frequency spikes while a phone's screen was off; occasionally, one of the
      : CPU clusters would ramp to a high frequency because there was 200-300ms of
      : constant CPU work from a single thread in the main Android userspace
      : process.  The work causing the spike (which is reasonable governor
      : behavior given the amount of CPU time needed) was always PSS collection.
      : As a result, Android is burning more power than we should be on PSS
      : collection.
      :
      : The other issue (and why I'm less sure about improving smaps as a
      : long-term solution) is that the number of VMAs per process has increased
      : significantly from release to release.  After trying to figure out why we
      : were seeing these 200-300ms PSS collection times on Android O but had not
      : noticed it in previous versions, we found that the number of VMAs in the
      : main system process increased by 50% from Android N to Android O (from
      : ~1800 to ~2700) and varying increases in every userspace process.  Android
      : M to N also had an increase in the number of VMAs, although not as much.
      : I'm not sure why this is increasing so much over time, but thinking about
      : ASLR and ways to make ASLR better, I expect that this will continue to
      : increase going forward.  I would not be surprised if we hit 5000 VMAs on
      : the main Android process (system_server) by 2020.
      :
      : If we assume that the number of VMAs is going to increase over time, then
      : doing anything we can do to reduce the overhead of each VMA during PSS
      : collection seems like the right way to go, and that means outputting an
      : aggregate statistic (to avoid whatever overhead there is per line in
      : writing smaps and in reading each line from userspace).
      
      Link: http://lkml.kernel.org/r/20170812022148.178293-1-dancol@google.comSigned-off-by: NDaniel Colascione <dancol@google.com>
      Cc: Tim Murray <timmurray@google.com>
      Cc: Joel Fernandes <joelaf@google.com>
      Cc: Al Viro <viro@zeniv.linux.org.uk>
      Cc: Randy Dunlap <rdunlap@infradead.org>
      Cc: Minchan Kim <minchan@kernel.org>
      Cc: Michal Hocko <mhocko@kernel.org>
      Cc: Sonny Rao <sonnyrao@chromium.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      493b0e9d
    • A
      userfaultfd: provide pid in userfault msg - add feat union · a36985d3
      Andrea Arcangeli 提交于
      No ABI change, but this will make it more explicit to software that ptid
      is only available if requested by passing UFFD_FEATURE_THREAD_ID to
      UFFDIO_API.  The fact it's a union will also self document it shouldn't
      be taken for granted there's a tpid there.
      
      Link: http://lkml.kernel.org/r/20170802165145.22628-7-aarcange@redhat.comSigned-off-by: NAndrea Arcangeli <aarcange@redhat.com>
      Cc: "Dr. David Alan Gilbert" <dgilbert@redhat.com>
      Cc: Alexey Perevalov <a.perevalov@samsung.com>
      Cc: Maxime Coquelin <maxime.coquelin@redhat.com>
      Cc: Mike Kravetz <mike.kravetz@oracle.com>
      Cc: Mike Rapoport <rppt@linux.vnet.ibm.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      a36985d3
    • A
      userfaultfd: provide pid in userfault msg · 9d4ac934
      Alexey Perevalov 提交于
      It could be useful for calculating downtime during postcopy live
      migration per vCPU.  Side observer or application itself will be
      informed about proper task's sleep during userfaultfd processing.
      
      Process's thread id is being provided when user requeste it by setting
      UFFD_FEATURE_THREAD_ID bit into uffdio_api.features.
      
      Link: http://lkml.kernel.org/r/20170802165145.22628-6-aarcange@redhat.comSigned-off-by: NAlexey Perevalov <a.perevalov@samsung.com>
      Signed-off-by: NAndrea Arcangeli <aarcange@redhat.com>
      Cc: "Dr. David Alan Gilbert" <dgilbert@redhat.com>
      Cc: Maxime Coquelin <maxime.coquelin@redhat.com>
      Cc: Mike Kravetz <mike.kravetz@oracle.com>
      Cc: Mike Rapoport <rppt@linux.vnet.ibm.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      9d4ac934
    • P
      mm: userfaultfd: add feature to request for a signal delivery · 2d6d6f5a
      Prakash Sangappa 提交于
      In some cases, userfaultfd mechanism should just deliver a SIGBUS signal
      to the faulting process, instead of the page-fault event.  Dealing with
      page-fault event using a monitor thread can be an overhead in these
      cases.  For example applications like the database could use the
      signaling mechanism for robustness purpose.
      
      Database uses hugetlbfs for performance reason.  Files on hugetlbfs
      filesystem are created and huge pages allocated using fallocate() API.
      Pages are deallocated/freed using fallocate() hole punching support.
      These files are mmapped and accessed by many processes as shared memory.
      The database keeps track of which offsets in the hugetlbfs file have
      pages allocated.
      
      Any access to mapped address over holes in the file, which can occur due
      to bugs in the application, is considered invalid and expect the process
      to simply receive a SIGBUS.  However, currently when a hole in the file
      is accessed via the mapped address, kernel/mm attempts to automatically
      allocate a page at page fault time, resulting in implicitly filling the
      hole in the file.  This may not be the desired behavior for applications
      like the database that want to explicitly manage page allocations of
      hugetlbfs files.
      
      Using userfaultfd mechanism with this support to get a signal, database
      application can prevent pages from being allocated implicitly when
      processes access mapped address over holes in the file.
      
      This patch adds UFFD_FEATURE_SIGBUS feature to userfaultfd mechnism to
      request for a SIGBUS signal.
      
      See following for previous discussion about the database requirement
      leading to this proposal as suggested by Andrea.
      
      http://www.spinics.net/lists/linux-mm/msg129224.html
      
      Link: http://lkml.kernel.org/r/1501552446-748335-2-git-send-email-prakash.sangappa@oracle.comSigned-off-by: NPrakash Sangappa <prakash.sangappa@oracle.com>
      Reviewed-by: NMike Rapoport <rppt@linux.vnet.ibm.com>
      Reviewed-by: NAndrea Arcangeli <aarcange@redhat.com>
      Cc: Mike Kravetz <mike.kravetz@oracle.com>
      Cc: Shuah Khan <shuah@kernel.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      2d6d6f5a
    • M
      mm: rename global_page_state to global_zone_page_state · c41f012a
      Michal Hocko 提交于
      global_page_state is error prone as a recent bug report pointed out [1].
      It only returns proper values for zone based counters as the enum it
      gets suggests.  We already have global_node_page_state so let's rename
      global_page_state to global_zone_page_state to be more explicit here.
      All existing users seems to be correct:
      
      $ git grep "global_page_state(NR_" | sed 's@.*(\(NR_[A-Z_]*\)).*@\1@' | sort | uniq -c
            2 NR_BOUNCE
            2 NR_FREE_CMA_PAGES
           11 NR_FREE_PAGES
            1 NR_KERNEL_STACK_KB
            1 NR_MLOCK
            2 NR_PAGETABLE
      
      This patch shouldn't introduce any functional change.
      
      [1] http://lkml.kernel.org/r/201707260628.v6Q6SmaS030814@www262.sakura.ne.jp
      
      Link: http://lkml.kernel.org/r/20170801134256.5400-2-hannes@cmpxchg.orgSigned-off-by: NMichal Hocko <mhocko@suse.com>
      Signed-off-by: NJohannes Weiner <hannes@cmpxchg.org>
      Cc: Tetsuo Handa <penguin-kernel@i-love.sakura.ne.jp>
      Cc: Josef Bacik <josef@toxicpanda.com>
      Cc: Vladimir Davydov <vdavydov.dev@gmail.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      c41f012a
    • J
      fs/sync.c: remove unnecessary NULL f_mapping check in sync_file_range · de23abd1
      Jeff Layton 提交于
      fsync codepath assumes that f_mapping can never be NULL, but
      sync_file_range has a check for that.
      
      Remove the one from sync_file_range as I don't see how you'd ever get a
      NULL pointer in here.
      
      Link: http://lkml.kernel.org/r/20170525110509.9434-1-jlayton@redhat.comSigned-off-by: NJeff Layton <jlayton@redhat.com>
      Cc: Alexander Viro <viro@zeniv.linux.org.uk>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      de23abd1
    • M
      userfaultfd: report UFFDIO_ZEROPAGE as available for shmem VMAs · ce53e8e6
      Mike Rapoport 提交于
      Now when shmem VMAs can be filled with zero page via userfaultfd we can
      report that UFFDIO_ZEROPAGE is available for those VMAs
      
      Link: http://lkml.kernel.org/r/1497939652-16528-7-git-send-email-rppt@linux.vnet.ibm.comSigned-off-by: NMike Rapoport <rppt@linux.vnet.ibm.com>
      Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: Hillf Danton <hillf.zj@alibaba-inc.com>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Pavel Emelyanov <xemul@virtuozzo.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      ce53e8e6
    • J
      mm: remove nr_pages argument from pagevec_lookup{,_range}() · 397162ff
      Jan Kara 提交于
      All users of pagevec_lookup() and pagevec_lookup_range() now pass
      PAGEVEC_SIZE as a desired number of pages.
      
      Just drop the argument.
      
      Link: http://lkml.kernel.org/r/20170726114704.7626-11-jack@suse.czSigned-off-by: NJan Kara <jack@suse.cz>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      397162ff
    • J
      fs: use pagevec_lookup_range() in page_cache_seek_hole_data() · 8338141f
      Jan Kara 提交于
      We want only pages from given range in page_cache_seek_hole_data().  Use
      pagevec_lookup_range() instead of pagevec_lookup() and remove
      unnecessary code.
      
      Note that the check for getting less pages than desired can be removed
      because index gets updated by pagevec_lookup_range().
      
      Link: http://lkml.kernel.org/r/20170726114704.7626-9-jack@suse.czSigned-off-by: NJan Kara <jack@suse.cz>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      8338141f
    • J
      hugetlbfs: use pagevec_lookup_range() in remove_inode_hugepages() · 48f2301c
      Jan Kara 提交于
      We want only pages from given range in remove_inode_hugepages().  Use
      pagevec_lookup_range() instead of pagevec_lookup().
      
      Link: http://lkml.kernel.org/r/20170726114704.7626-8-jack@suse.czSigned-off-by: NJan Kara <jack@suse.cz>
      Reviewed-by: NMike Kravetz <mike.kravetz@oracle.com>
      Cc: Nadia Yvette Chambers <nyc@holomorphy.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      48f2301c
    • J
      ext4: use pagevec_lookup_range() in writeback code · 2b85a617
      Jan Kara 提交于
      Both occurences of pagevec_lookup() actually want only pages from a
      given range.  Use pagevec_lookup_range() for the lookup.
      
      Link: http://lkml.kernel.org/r/20170726114704.7626-7-jack@suse.czSigned-off-by: NJan Kara <jack@suse.cz>
      Cc: "Theodore Ts'o" <tytso@mit.edu>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      2b85a617
    • J
      ext4: use pagevec_lookup_range() in ext4_find_unwritten_pgoff() · dec0da7b
      Jan Kara 提交于
      Use pagevec_lookup_range() in ext4_find_unwritten_pgoff() since we are
      interested only in pages in the given range.  Simplify the logic as a
      result of not getting pages out of range and index getting automatically
      advanced.
      
      Link: http://lkml.kernel.org/r/20170726114704.7626-6-jack@suse.czSigned-off-by: NJan Kara <jack@suse.cz>
      Cc: "Theodore Ts'o" <tytso@mit.edu>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      dec0da7b
    • J
      fs: fix performance regression in clean_bdev_aliases() · c10f778d
      Jan Kara 提交于
      Commit e64855c6 ("fs: Add helper to clean bdev aliases under a bh
      and use it") added a wrapper for clean_bdev_aliases() that invalidates
      bdev aliases underlying a single buffer head.
      
      However this has caused a performance regression for bonnie++ benchmark
      on ext4 filesystem when delayed allocation is turned off (ext3 mode) -
      average of 3 runs:
      
        Hmean SeqOut Char  164787.55 (  0.00%) 107189.06 (-34.95%)
        Hmean SeqOut Block 219883.89 (  0.00%) 168870.32 (-23.20%)
      
      The reason for this regression is that clean_bdev_aliases() is slower
      when called for a single block because pagevec_lookup() it uses will end
      up iterating through the radix tree until it finds a page (which may
      take a while) but we are only interested whether there's a page at a
      particular index.
      
      Fix the problem by using pagevec_lookup_range() instead which avoids the
      needless iteration.
      
      Fixes: e64855c6 ("fs: Add helper to clean bdev aliases under a bh and use it")
      Link: http://lkml.kernel.org/r/20170726114704.7626-5-jack@suse.czSigned-off-by: NJan Kara <jack@suse.cz>
      Cc: Jens Axboe <axboe@fb.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      c10f778d
    • J
      mm: make pagevec_lookup() update index · d72dc8a2
      Jan Kara 提交于
      Make pagevec_lookup() (and underlying find_get_pages()) update index to
      the next page where iteration should continue.  Most callers want this
      and also pagevec_lookup_tag() already does this.
      
      Link: http://lkml.kernel.org/r/20170726114704.7626-3-jack@suse.czSigned-off-by: NJan Kara <jack@suse.cz>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      d72dc8a2
    • J
      fscache: remove unused ->now_uncached callback · 26b433d0
      Jan Kara 提交于
      Patch series "Ranged pagevec lookup", v2.
      
      In this series I make pagevec_lookup() update the index (to be
      consistent with pagevec_lookup_tag() and also as a preparation for
      ranged lookups), provide ranged variant of pagevec_lookup() and use it
      in places where it makes sense.  This not only removes some common code
      but is also a measurable performance win for some use cases (see patch
      4/10) where radix tree is sparse and searching & grabing of a page after
      the end of the range has measurable overhead.
      
      This patch (of 10):
      
      The callback doesn't ever get called.  Remove it.
      
      Link: http://lkml.kernel.org/r/20170726114704.7626-2-jack@suse.czSigned-off-by: NJan Kara <jack@suse.cz>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      26b433d0
    • J
      ocfs2: clean up some dead code · 964f14a0
      Jun Piao 提交于
      clean up some unused functions and parameters.
      
      Link: http://lkml.kernel.org/r/598A5E21.2080807@huawei.comSigned-off-by: NJun Piao <piaojun@huawei.com>
      Reviewed-by: NAlex Chen <alex.chen@huawei.com>
      Cc: Mark Fasheh <mfasheh@versity.com>
      Cc: Joel Becker <jlbec@evilplan.org>
      Cc: Junxiao Bi <junxiao.bi@oracle.com>
      Cc: Joseph Qi <jiangqi903@gmail.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      964f14a0
    • J
      ocfs2: make ocfs2_set_acl() static · 01ffb56b
      Jan Kara 提交于
      The function is never called outside of fs/ocfs2/acl.c.
      
      Link: http://lkml.kernel.org/r/20170801141252.19675-2-jack@suse.czSigned-off-by: NJan Kara <jack@suse.cz>
      Cc: Mark Fasheh <mfasheh@versity.com>
      Cc: Joel Becker <jlbec@evilplan.org>
      Cc: Junxiao Bi <junxiao.bi@oracle.com>
      Cc: Joseph Qi <jiangqi903@gmail.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      01ffb56b
    • N
      dax: initialize variable pfn before using it · 2f52074d
      Nicolas Iooss 提交于
      dax_pmd_insert_mapping() contains the following code:
      
              pfn_t pfn;
              if (bdev_dax_pgoff(bdev, sector, size, &pgoff) != 0)
                  goto fallback;
              /* ... */
          fallback:
            trace_dax_pmd_insert_mapping_fallback(inode, vmf, length, pfn, ret);
      
      When the condition in the if statement fails, the function calls
      trace_dax_pmd_insert_mapping_fallback() with an uninitialized pfn value.
      
      This issue has been found while building the kernel with clang.  The
      compiler reported:
      
          fs/dax.c:1280:6: error: variable 'pfn' is used uninitialized
          whenever 'if' condition is true [-Werror,-Wsometimes-uninitialized]
              if (bdev_dax_pgoff(bdev, sector, size, &pgoff) != 0)
                  ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
          fs/dax.c:1310:60: note: uninitialized use occurs here
            trace_dax_pmd_insert_mapping_fallback(inode, vmf, length, pfn, ret);
                                                                           ^~~
      
      Link: http://lkml.kernel.org/r/20170903083000.587-1-nicolas.iooss_linux@m4x.orgSigned-off-by: NNicolas Iooss <nicolas.iooss_linux@m4x.org>
      Reviewed-by: NRoss Zwisler <ross.zwisler@linux.intel.com>
      Cc: Matthew Wilcox <mawilcox@microsoft.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      2f52074d
    • R
      dax: use PG_PMD_COLOUR instead of open coding · 917f3452
      Ross Zwisler 提交于
      Use ~PG_PMD_COLOUR in dax_entry_waitqueue() instead of open coding an
      equivalent page offset mask.
      
      Link: http://lkml.kernel.org/r/20170822222436.18926-2-ross.zwisler@linux.intel.comSigned-off-by: NRoss Zwisler <ross.zwisler@linux.intel.com>
      Reviewed-by: NJan Kara <jack@suse.cz>
      Cc: "Slusarz, Marcin" <marcin.slusarz@intel.com>
      Cc: Alexander Viro <viro@zeniv.linux.org.uk>
      Cc: Christoph Hellwig <hch@lst.de>
      Cc: Dan Williams <dan.j.williams@intel.com>
      Cc: Dave Chinner <david@fromorbit.com>
      Cc: Matthew Wilcox <mawilcox@microsoft.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      917f3452
    • R
      dax: explain how read(2)/write(2) addresses are validated · a2e050f5
      Ross Zwisler 提交于
      Add a comment explaining how the user addresses provided to read(2) and
      write(2) are validated in the DAX I/O path.
      
      We call dax_copy_from_iter() or copy_to_iter() on these without calling
      access_ok() first in the DAX code, and there was a concern that the user
      might be able to read/write to arbitrary kernel addresses with this
      path.
      
      Link: http://lkml.kernel.org/r/20170816173615.10098-1-ross.zwisler@linux.intel.comSigned-off-by: NRoss Zwisler <ross.zwisler@linux.intel.com>
      Reviewed-by: NJan Kara <jack@suse.cz>
      Cc: Christoph Hellwig <hch@lst.de>
      Cc: Dan Williams <dan.j.williams@intel.com>
      Cc: Matthew Wilcox <mawilcox@microsoft.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      a2e050f5
    • R
      dax: move all DAX radix tree defs to fs/dax.c · 527b19d0
      Ross Zwisler 提交于
      Now that we no longer insert struct page pointers in DAX radix trees the
      page cache code no longer needs to know anything about DAX exceptional
      entries.  Move all the DAX exceptional entry definitions from dax.h to
      fs/dax.c.
      
      Link: http://lkml.kernel.org/r/20170724170616.25810-6-ross.zwisler@linux.intel.comSigned-off-by: NRoss Zwisler <ross.zwisler@linux.intel.com>
      Suggested-by: NJan Kara <jack@suse.cz>
      Reviewed-by: NJan Kara <jack@suse.cz>
      Cc: "Darrick J. Wong" <darrick.wong@oracle.com>
      Cc: "Theodore Ts'o" <tytso@mit.edu>
      Cc: Alexander Viro <viro@zeniv.linux.org.uk>
      Cc: Andreas Dilger <adilger.kernel@dilger.ca>
      Cc: Christoph Hellwig <hch@lst.de>
      Cc: Dan Williams <dan.j.williams@intel.com>
      Cc: Dave Chinner <david@fromorbit.com>
      Cc: Ingo Molnar <mingo@redhat.com>
      Cc: Jonathan Corbet <corbet@lwn.net>
      Cc: Matthew Wilcox <mawilcox@microsoft.com>
      Cc: Steven Rostedt <rostedt@goodmis.org>
      Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      527b19d0
    • R
      dax: remove DAX code from page_cache_tree_insert() · d01ad197
      Ross Zwisler 提交于
      Now that we no longer insert struct page pointers in DAX radix trees we
      can remove the special casing for DAX in page_cache_tree_insert().
      
      This also allows us to make dax_wake_mapping_entry_waiter() local to
      fs/dax.c, removing it from dax.h.
      
      Link: http://lkml.kernel.org/r/20170724170616.25810-5-ross.zwisler@linux.intel.comSigned-off-by: NRoss Zwisler <ross.zwisler@linux.intel.com>
      Suggested-by: NJan Kara <jack@suse.cz>
      Reviewed-by: NJan Kara <jack@suse.cz>
      Cc: "Darrick J. Wong" <darrick.wong@oracle.com>
      Cc: "Theodore Ts'o" <tytso@mit.edu>
      Cc: Alexander Viro <viro@zeniv.linux.org.uk>
      Cc: Andreas Dilger <adilger.kernel@dilger.ca>
      Cc: Christoph Hellwig <hch@lst.de>
      Cc: Dan Williams <dan.j.williams@intel.com>
      Cc: Dave Chinner <david@fromorbit.com>
      Cc: Ingo Molnar <mingo@redhat.com>
      Cc: Jonathan Corbet <corbet@lwn.net>
      Cc: Matthew Wilcox <mawilcox@microsoft.com>
      Cc: Steven Rostedt <rostedt@goodmis.org>
      Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      d01ad197
    • R
      dax: use common 4k zero page for dax mmap reads · 91d25ba8
      Ross Zwisler 提交于
      When servicing mmap() reads from file holes the current DAX code
      allocates a page cache page of all zeroes and places the struct page
      pointer in the mapping->page_tree radix tree.
      
      This has three major drawbacks:
      
      1) It consumes memory unnecessarily. For every 4k page that is read via
         a DAX mmap() over a hole, we allocate a new page cache page. This
         means that if you read 1GiB worth of pages, you end up using 1GiB of
         zeroed memory. This is easily visible by looking at the overall
         memory consumption of the system or by looking at /proc/[pid]/smaps:
      
      	7f62e72b3000-7f63272b3000 rw-s 00000000 103:00 12   /root/dax/data
      	Size:            1048576 kB
      	Rss:             1048576 kB
      	Pss:             1048576 kB
      	Shared_Clean:          0 kB
      	Shared_Dirty:          0 kB
      	Private_Clean:   1048576 kB
      	Private_Dirty:         0 kB
      	Referenced:      1048576 kB
      	Anonymous:             0 kB
      	LazyFree:              0 kB
      	AnonHugePages:         0 kB
      	ShmemPmdMapped:        0 kB
      	Shared_Hugetlb:        0 kB
      	Private_Hugetlb:       0 kB
      	Swap:                  0 kB
      	SwapPss:               0 kB
      	KernelPageSize:        4 kB
      	MMUPageSize:           4 kB
      	Locked:                0 kB
      
      2) It is slower than using a common zero page because each page fault
         has more work to do. Instead of just inserting a common zero page we
         have to allocate a page cache page, zero it, and then insert it. Here
         are the average latencies of dax_load_hole() as measured by ftrace on
         a random test box:
      
          Old method, using zeroed page cache pages:	3.4 us
          New method, using the common 4k zero page:	0.8 us
      
         This was the average latency over 1 GiB of sequential reads done by
         this simple fio script:
      
           [global]
           size=1G
           filename=/root/dax/data
           fallocate=none
           [io]
           rw=read
           ioengine=mmap
      
      3) The fact that we had to check for both DAX exceptional entries and
         for page cache pages in the radix tree made the DAX code more
         complex.
      
      Solve these issues by following the lead of the DAX PMD code and using a
      common 4k zero page instead.  As with the PMD code we will now insert a
      DAX exceptional entry into the radix tree instead of a struct page
      pointer which allows us to remove all the special casing in the DAX
      code.
      
      Note that we do still pretty aggressively check for regular pages in the
      DAX radix tree, especially where we take action based on the bits set in
      the page.  If we ever find a regular page in our radix tree now that
      most likely means that someone besides DAX is inserting pages (which has
      happened lots of times in the past), and we want to find that out early
      and fail loudly.
      
      This solution also removes the extra memory consumption.  Here is that
      same /proc/[pid]/smaps after 1GiB of reading from a hole with the new
      code:
      
      	7f2054a74000-7f2094a74000 rw-s 00000000 103:00 12   /root/dax/data
      	Size:            1048576 kB
      	Rss:                   0 kB
      	Pss:                   0 kB
      	Shared_Clean:          0 kB
      	Shared_Dirty:          0 kB
      	Private_Clean:         0 kB
      	Private_Dirty:         0 kB
      	Referenced:            0 kB
      	Anonymous:             0 kB
      	LazyFree:              0 kB
      	AnonHugePages:         0 kB
      	ShmemPmdMapped:        0 kB
      	Shared_Hugetlb:        0 kB
      	Private_Hugetlb:       0 kB
      	Swap:                  0 kB
      	SwapPss:               0 kB
      	KernelPageSize:        4 kB
      	MMUPageSize:           4 kB
      	Locked:                0 kB
      
      Overall system memory consumption is similarly improved.
      
      Another major change is that we remove dax_pfn_mkwrite() from our fault
      flow, and instead rely on the page fault itself to make the PTE dirty
      and writeable.  The following description from the patch adding the
      vm_insert_mixed_mkwrite() call explains this a little more:
      
         "To be able to use the common 4k zero page in DAX we need to have our
          PTE fault path look more like our PMD fault path where a PTE entry
          can be marked as dirty and writeable as it is first inserted rather
          than waiting for a follow-up dax_pfn_mkwrite() =>
          finish_mkwrite_fault() call.
      
          Right now we can rely on having a dax_pfn_mkwrite() call because we
          can distinguish between these two cases in do_wp_page():
      
                  case 1: 4k zero page => writable DAX storage
                  case 2: read-only DAX storage => writeable DAX storage
      
          This distinction is made by via vm_normal_page(). vm_normal_page()
          returns false for the common 4k zero page, though, just as it does
          for DAX ptes. Instead of special casing the DAX + 4k zero page case
          we will simplify our DAX PTE page fault sequence so that it matches
          our DAX PMD sequence, and get rid of the dax_pfn_mkwrite() helper.
          We will instead use dax_iomap_fault() to handle write-protection
          faults.
      
          This means that insert_pfn() needs to follow the lead of
          insert_pfn_pmd() and allow us to pass in a 'mkwrite' flag. If
          'mkwrite' is set insert_pfn() will do the work that was previously
          done by wp_page_reuse() as part of the dax_pfn_mkwrite() call path"
      
      Link: http://lkml.kernel.org/r/20170724170616.25810-4-ross.zwisler@linux.intel.comSigned-off-by: NRoss Zwisler <ross.zwisler@linux.intel.com>
      Reviewed-by: NJan Kara <jack@suse.cz>
      Cc: "Darrick J. Wong" <darrick.wong@oracle.com>
      Cc: "Theodore Ts'o" <tytso@mit.edu>
      Cc: Alexander Viro <viro@zeniv.linux.org.uk>
      Cc: Andreas Dilger <adilger.kernel@dilger.ca>
      Cc: Christoph Hellwig <hch@lst.de>
      Cc: Dan Williams <dan.j.williams@intel.com>
      Cc: Dave Chinner <david@fromorbit.com>
      Cc: Ingo Molnar <mingo@redhat.com>
      Cc: Jonathan Corbet <corbet@lwn.net>
      Cc: Matthew Wilcox <mawilcox@microsoft.com>
      Cc: Steven Rostedt <rostedt@goodmis.org>
      Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      91d25ba8
    • R
      dax: relocate some dax functions · e30331ff
      Ross Zwisler 提交于
      dax_load_hole() will soon need to call dax_insert_mapping_entry(), so it
      needs to be moved lower in dax.c so the definition exists.
      
      dax_wake_mapping_entry_waiter() will soon be removed from dax.h and be
      made static to dax.c, so we need to move its definition above all its
      callers.
      
      Link: http://lkml.kernel.org/r/20170724170616.25810-3-ross.zwisler@linux.intel.comSigned-off-by: NRoss Zwisler <ross.zwisler@linux.intel.com>
      Reviewed-by: NJan Kara <jack@suse.cz>
      Cc: "Darrick J. Wong" <darrick.wong@oracle.com>
      Cc: "Theodore Ts'o" <tytso@mit.edu>
      Cc: Alexander Viro <viro@zeniv.linux.org.uk>
      Cc: Andreas Dilger <adilger.kernel@dilger.ca>
      Cc: Christoph Hellwig <hch@lst.de>
      Cc: Dan Williams <dan.j.williams@intel.com>
      Cc: Dave Chinner <david@fromorbit.com>
      Cc: Ingo Molnar <mingo@redhat.com>
      Cc: Jonathan Corbet <corbet@lwn.net>
      Cc: Matthew Wilcox <mawilcox@microsoft.com>
      Cc: Steven Rostedt <rostedt@goodmis.org>
      Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      e30331ff
  4. 05 9月, 2017 4 次提交
    • M
      media: get rid of removed DMX_GET_CAPS and DMX_SET_SOURCE leftovers · 4cd7d6c9
      Mauro Carvalho Chehab 提交于
      Those two ioctls were never used within the Kernel. Still, there
      used to have compat32 code there (and an if #0 block at the core).
      
      Get rid of them.
      
      Fixes: 286fe1ca ("media: dmx.h: get rid of DMX_GET_CAPS")
      Fixes: 13adefbe ("media: dmx.h: get rid of DMX_SET_SOURCE")
      Signed-off-by: NMauro Carvalho Chehab <mchehab@s-opensource.com>
      4cd7d6c9
    • S
      cifs: Check for timeout on Negotiate stage · 76e75270
      Samuel Cabrero 提交于
      Some servers seem to accept connections while booting but never send
      the SMBNegotiate response neither close the connection, causing all
      processes accessing the share hang on uninterruptible sleep state.
      
      This happens when the cifs_demultiplex_thread detects the server is
      unresponsive so releases the socket and start trying to reconnect.
      At some point, the faulty server will accept the socket and the TCP
      status will be set to NeedNegotiate. The first issued command accessing
      the share will start the negotiation (pid 5828 below), but the response
      will never arrive so other commands will be blocked waiting on the mutex
      (pid 55352).
      
      This patch checks for unresponsive servers also on the negotiate stage
      releasing the socket and reconnecting if the response is not received
      and checking again the tcp state when the mutex is acquired.
      
      PID: 55352  TASK: ffff880fd6cc02c0  CPU: 0   COMMAND: "ls"
       #0 [ffff880fd9add9f0] schedule at ffffffff81467eb9
       #1 [ffff880fd9addb38] __mutex_lock_slowpath at ffffffff81468fe0
       #2 [ffff880fd9addba8] mutex_lock at ffffffff81468b1a
       #3 [ffff880fd9addbc0] cifs_reconnect_tcon at ffffffffa042f905 [cifs]
       #4 [ffff880fd9addc60] smb_init at ffffffffa042faeb [cifs]
       #5 [ffff880fd9addca0] CIFSSMBQPathInfo at ffffffffa04360b5 [cifs]
       ....
      
      Which is waiting a mutex owned by:
      
      PID: 5828   TASK: ffff880fcc55e400  CPU: 0   COMMAND: "xxxx"
       #0 [ffff880fbfdc19b8] schedule at ffffffff81467eb9
       #1 [ffff880fbfdc1b00] wait_for_response at ffffffffa044f96d [cifs]
       #2 [ffff880fbfdc1b60] SendReceive at ffffffffa04505ce [cifs]
       #3 [ffff880fbfdc1bb0] CIFSSMBNegotiate at ffffffffa0438d79 [cifs]
       #4 [ffff880fbfdc1c50] cifs_negotiate_protocol at ffffffffa043b383 [cifs]
       #5 [ffff880fbfdc1c80] cifs_reconnect_tcon at ffffffffa042f911 [cifs]
       #6 [ffff880fbfdc1d20] smb_init at ffffffffa042faeb [cifs]
       #7 [ffff880fbfdc1d60] CIFSSMBQFSInfo at ffffffffa0434eb0 [cifs]
       ....
      Signed-off-by: NSamuel Cabrero <scabrero@suse.de>
      Reviewed-by: NAurélien Aptel <aaptel@suse.de>
      Reviewed-by: NRonnie Sahlberg <lsahlber@redhat.com>
      Signed-off-by: NSteve French <smfrench@gmail.com>
      76e75270
    • R
      cifs: Add support for writing attributes on SMB2+ · 5517554e
      Ronnie Sahlberg 提交于
      This adds support for writing extended attributes on SMB2+ shares.
      Attributes can be written using the setfattr command.
      
      RH-bz: 1110709
      Signed-off-by: NRonnie Sahlberg <lsahlber@redhat.com>
      Signed-off-by: NSteve French <smfrench@gmail.com>
      Reviewed-by: NPavel Shilovsky <pshilov@microsoft.com>
      5517554e
    • R
      cifs: Add support for reading attributes on SMB2+ · 95907fea
      Ronnie Sahlberg 提交于
      SMB1 already has support to read attributes. This adds similar support
      to SMB2+.
      
      With this patch, tools such as 'getfattr' will now work with SMB2+ shares.
      
      RH-bz: 1110709
      Signed-off-by: NRonnie Sahlberg <lsahlber@redhat.com>
      Signed-off-by: NSteve French <smfrench@gmail.com>
      Reviewed-by: NPavel Shilovsky <pshilov@microsoft.com>
      95907fea
  5. 04 9月, 2017 2 次提交
  6. 02 9月, 2017 4 次提交