1. 19 2月, 2016 1 次提交
    • D
      mm/core: Do not enforce PKEY permissions on remote mm access · 1b2ee126
      Dave Hansen 提交于
      We try to enforce protection keys in software the same way that we
      do in hardware.  (See long example below).
      
      But, we only want to do this when accessing our *own* process's
      memory.  If GDB set PKRU[6].AD=1 (disable access to PKEY 6), then
      tried to PTRACE_POKE a target process which just happened to have
      some mprotect_pkey(pkey=6) memory, we do *not* want to deny the
      debugger access to that memory.  PKRU is fundamentally a
      thread-local structure and we do not want to enforce it on access
      to _another_ thread's data.
      
      This gets especially tricky when we have workqueues or other
      delayed-work mechanisms that might run in a random process's context.
      We can check that we only enforce pkeys when operating on our *own* mm,
      but delayed work gets performed when a random user context is active.
      We might end up with a situation where a delayed-work gup fails when
      running randomly under its "own" task but succeeds when running under
      another process.  We want to avoid that.
      
      To avoid that, we use the new GUP flag: FOLL_REMOTE and add a
      fault flag: FAULT_FLAG_REMOTE.  They indicate that we are
      walking an mm which is not guranteed to be the same as
      current->mm and should not be subject to protection key
      enforcement.
      
      Thanks to Jerome Glisse for pointing out this scenario.
      Signed-off-by: NDave Hansen <dave.hansen@linux.intel.com>
      Reviewed-by: NThomas Gleixner <tglx@linutronix.de>
      Cc: Alexey Kardashevskiy <aik@ozlabs.ru>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: Andrew Morton <akpm@linux-foundation.org>
      Cc: Andy Lutomirski <luto@amacapital.net>
      Cc: Andy Lutomirski <luto@kernel.org>
      Cc: Arnd Bergmann <arnd@arndb.de>
      Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
      Cc: Boaz Harrosh <boaz@plexistor.com>
      Cc: Borislav Petkov <bp@alien8.de>
      Cc: Brian Gerst <brgerst@gmail.com>
      Cc: Dan Williams <dan.j.williams@intel.com>
      Cc: Dave Chinner <dchinner@redhat.com>
      Cc: Dave Hansen <dave.hansen@linux.intel.com>
      Cc: David Gibson <david@gibson.dropbear.id.au>
      Cc: Denys Vlasenko <dvlasenk@redhat.com>
      Cc: Dominik Dingel <dingel@linux.vnet.ibm.com>
      Cc: Dominik Vogt <vogt@linux.vnet.ibm.com>
      Cc: Eric B Munson <emunson@akamai.com>
      Cc: Geliang Tang <geliangtang@163.com>
      Cc: Guan Xuetao <gxt@mprc.pku.edu.cn>
      Cc: H. Peter Anvin <hpa@zytor.com>
      Cc: Heiko Carstens <heiko.carstens@de.ibm.com>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Jan Kara <jack@suse.cz>
      Cc: Jason Low <jason.low2@hp.com>
      Cc: Jerome Marchand <jmarchan@redhat.com>
      Cc: Joerg Roedel <joro@8bytes.org>
      Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
      Cc: Konstantin Khlebnikov <koct9i@gmail.com>
      Cc: Laurent Dufour <ldufour@linux.vnet.ibm.com>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Martin Schwidefsky <schwidefsky@de.ibm.com>
      Cc: Matthew Wilcox <willy@linux.intel.com>
      Cc: Mel Gorman <mgorman@suse.de>
      Cc: Michael Ellerman <mpe@ellerman.id.au>
      Cc: Michal Hocko <mhocko@suse.com>
      Cc: Mikulas Patocka <mpatocka@redhat.com>
      Cc: Minchan Kim <minchan@kernel.org>
      Cc: Oleg Nesterov <oleg@redhat.com>
      Cc: Paul Mackerras <paulus@samba.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Rik van Riel <riel@redhat.com>
      Cc: Sasha Levin <sasha.levin@oracle.com>
      Cc: Shachar Raindel <raindel@mellanox.com>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Cc: Xie XiuQi <xiexiuqi@huawei.com>
      Cc: iommu@lists.linux-foundation.org
      Cc: linux-arch@vger.kernel.org
      Cc: linux-kernel@vger.kernel.org
      Cc: linux-mm@kvack.org
      Cc: linux-s390@vger.kernel.org
      Cc: linuxppc-dev@lists.ozlabs.org
      Signed-off-by: NIngo Molnar <mingo@kernel.org>
      1b2ee126
  2. 18 2月, 2016 1 次提交
    • D
      mm/gup, x86/mm/pkeys: Check VMAs and PTEs for protection keys · 33a709b2
      Dave Hansen 提交于
      Today, for normal faults and page table walks, we check the VMA
      and/or PTE to ensure that it is compatible with the action.  For
      instance, if we get a write fault on a non-writeable VMA, we
      SIGSEGV.
      
      We try to do the same thing for protection keys.  Basically, we
      try to make sure that if a user does this:
      
      	mprotect(ptr, size, PROT_NONE);
      	*ptr = foo;
      
      they see the same effects with protection keys when they do this:
      
      	mprotect(ptr, size, PROT_READ|PROT_WRITE);
      	set_pkey(ptr, size, 4);
      	wrpkru(0xffffff3f); // access disable pkey 4
      	*ptr = foo;
      
      The state to do that checking is in the VMA, but we also
      sometimes have to do it on the page tables only, like when doing
      a get_user_pages_fast() where we have no VMA.
      
      We add two functions and expose them to generic code:
      
      	arch_pte_access_permitted(pte_flags, write)
      	arch_vma_access_permitted(vma, write)
      
      These are, of course, backed up in x86 arch code with checks
      against the PTE or VMA's protection key.
      
      But, there are also cases where we do not want to respect
      protection keys.  When we ptrace(), for instance, we do not want
      to apply the tracer's PKRU permissions to the PTEs from the
      process being traced.
      Signed-off-by: NDave Hansen <dave.hansen@linux.intel.com>
      Reviewed-by: NThomas Gleixner <tglx@linutronix.de>
      Cc: Alexey Kardashevskiy <aik@ozlabs.ru>
      Cc: Andrew Morton <akpm@linux-foundation.org>
      Cc: Andy Lutomirski <luto@amacapital.net>
      Cc: Andy Lutomirski <luto@kernel.org>
      Cc: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
      Cc: Arnd Bergmann <arnd@arndb.de>
      Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
      Cc: Boaz Harrosh <boaz@plexistor.com>
      Cc: Borislav Petkov <bp@alien8.de>
      Cc: Brian Gerst <brgerst@gmail.com>
      Cc: Dan Williams <dan.j.williams@intel.com>
      Cc: Dave Hansen <dave@sr71.net>
      Cc: David Gibson <david@gibson.dropbear.id.au>
      Cc: David Hildenbrand <dahi@linux.vnet.ibm.com>
      Cc: David Vrabel <david.vrabel@citrix.com>
      Cc: Denys Vlasenko <dvlasenk@redhat.com>
      Cc: Dominik Dingel <dingel@linux.vnet.ibm.com>
      Cc: Dominik Vogt <vogt@linux.vnet.ibm.com>
      Cc: Guan Xuetao <gxt@mprc.pku.edu.cn>
      Cc: H. Peter Anvin <hpa@zytor.com>
      Cc: Heiko Carstens <heiko.carstens@de.ibm.com>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Jason Low <jason.low2@hp.com>
      Cc: Jerome Marchand <jmarchan@redhat.com>
      Cc: Juergen Gross <jgross@suse.com>
      Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
      Cc: Laurent Dufour <ldufour@linux.vnet.ibm.com>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Martin Schwidefsky <schwidefsky@de.ibm.com>
      Cc: Matthew Wilcox <willy@linux.intel.com>
      Cc: Mel Gorman <mgorman@suse.de>
      Cc: Michael Ellerman <mpe@ellerman.id.au>
      Cc: Michal Hocko <mhocko@suse.com>
      Cc: Mikulas Patocka <mpatocka@redhat.com>
      Cc: Minchan Kim <minchan@kernel.org>
      Cc: Paul Mackerras <paulus@samba.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Rik van Riel <riel@redhat.com>
      Cc: Sasha Levin <sasha.levin@oracle.com>
      Cc: Shachar Raindel <raindel@mellanox.com>
      Cc: Stephen Smalley <sds@tycho.nsa.gov>
      Cc: Toshi Kani <toshi.kani@hpe.com>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Cc: linux-arch@vger.kernel.org
      Cc: linux-kernel@vger.kernel.org
      Cc: linux-mm@kvack.org
      Cc: linux-s390@vger.kernel.org
      Cc: linuxppc-dev@lists.ozlabs.org
      Link: http://lkml.kernel.org/r/20160212210219.14D5D715@viggo.jf.intel.comSigned-off-by: NIngo Molnar <mingo@kernel.org>
      33a709b2
  3. 16 2月, 2016 1 次提交
    • D
      mm/gup: Introduce get_user_pages_remote() · 1e987790
      Dave Hansen 提交于
      For protection keys, we need to understand whether protections
      should be enforced in software or not.  In general, we enforce
      protections when working on our own task, but not when on others.
      We call these "current" and "remote" operations.
      
      This patch introduces a new get_user_pages() variant:
      
              get_user_pages_remote()
      
      Which is a replacement for when get_user_pages() is called on
      non-current tsk/mm.
      
      We also introduce a new gup flag: FOLL_REMOTE which can be used
      for the "__" gup variants to get this new behavior.
      
      The uprobes is_trap_at_addr() location holds mmap_sem and
      calls get_user_pages(current->mm) on an instruction address.  This
      makes it a pretty unique gup caller.  Being an instruction access
      and also really originating from the kernel (vs. the app), I opted
      to consider this a 'remote' access where protection keys will not
      be enforced.
      
      Without protection keys, this patch should not change any behavior.
      Signed-off-by: NDave Hansen <dave.hansen@linux.intel.com>
      Reviewed-by: NThomas Gleixner <tglx@linutronix.de>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: Andrew Morton <akpm@linux-foundation.org>
      Cc: Andy Lutomirski <luto@amacapital.net>
      Cc: Borislav Petkov <bp@alien8.de>
      Cc: Brian Gerst <brgerst@gmail.com>
      Cc: Dave Hansen <dave@sr71.net>
      Cc: Denys Vlasenko <dvlasenk@redhat.com>
      Cc: H. Peter Anvin <hpa@zytor.com>
      Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Rik van Riel <riel@redhat.com>
      Cc: Srikar Dronamraju <srikar@linux.vnet.ibm.com>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Cc: jack@suse.cz
      Cc: linux-mm@kvack.org
      Link: http://lkml.kernel.org/r/20160212210154.3F0E51EA@viggo.jf.intel.comSigned-off-by: NIngo Molnar <mingo@kernel.org>
      1e987790
  4. 04 2月, 2016 1 次提交
  5. 01 2月, 2016 1 次提交
  6. 21 1月, 2016 1 次提交
  7. 16 1月, 2016 12 次提交
  8. 15 1月, 2016 2 次提交
    • M
      mm: allow GFP_{FS,IO} for page_cache_read page cache allocation · c20cd45e
      Michal Hocko 提交于
      page_cache_read has been historically using page_cache_alloc_cold to
      allocate a new page.  This means that mapping_gfp_mask is used as the
      base for the gfp_mask.  Many filesystems are setting this mask to
      GFP_NOFS to prevent from fs recursion issues.  page_cache_read is called
      from the vm_operations_struct::fault() context during the page fault.
      This context doesn't need the reclaim protection normally.
      
      ceph and ocfs2 which call filemap_fault from their fault handlers seem
      to be OK because they are not taking any fs lock before invoking generic
      implementation.  xfs which takes XFS_MMAPLOCK_SHARED is safe from the
      reclaim recursion POV because this lock serializes truncate and punch
      hole with the page faults and it doesn't get involved in the reclaim.
      
      There is simply no reason to deliberately use a weaker allocation
      context when a __GFP_FS | __GFP_IO can be used.  The GFP_NOFS protection
      might be even harmful.  There is a push to fail GFP_NOFS allocations
      rather than loop within allocator indefinitely with a very limited
      reclaim ability.  Once we start failing those requests the OOM killer
      might be triggered prematurely because the page cache allocation failure
      is propagated up the page fault path and end up in
      pagefault_out_of_memory.
      
      We cannot play with mapping_gfp_mask directly because that would be racy
      wrt.  parallel page faults and it might interfere with other users who
      really rely on NOFS semantic from the stored gfp_mask.  The mask is also
      inode proper so it would even be a layering violation.  What we can do
      instead is to push the gfp_mask into struct vm_fault and allow fs layer
      to overwrite it should the callback need to be called with a different
      allocation context.
      
      Initialize the default to (mapping_gfp_mask | __GFP_FS | __GFP_IO)
      because this should be safe from the page fault path normally.  Why do
      we care about mapping_gfp_mask at all then? Because this doesn't hold
      only reclaim protection flags but it also might contain zone and
      movability restrictions (GFP_DMA32, __GFP_MOVABLE and others) so we have
      to respect those.
      Signed-off-by: NMichal Hocko <mhocko@suse.com>
      Reported-by: NTetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
      Acked-by: NJan Kara <jack@suse.com>
      Acked-by: NVlastimil Babka <vbabka@suse.cz>
      Cc: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
      Cc: Mel Gorman <mgorman@suse.de>
      Cc: Dave Chinner <david@fromorbit.com>
      Cc: Mark Fasheh <mfasheh@suse.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      c20cd45e
    • J
      mm, shmem: add internal shmem resident memory accounting · eca56ff9
      Jerome Marchand 提交于
      Currently looking at /proc/<pid>/status or statm, there is no way to
      distinguish shmem pages from pages mapped to a regular file (shmem pages
      are mapped to /dev/zero), even though their implication in actual memory
      use is quite different.
      
      The internal accounting currently counts shmem pages together with
      regular files.  As a preparation to extend the userspace interfaces,
      this patch adds MM_SHMEMPAGES counter to mm_rss_stat to account for
      shmem pages separately from MM_FILEPAGES.  The next patch will expose it
      to userspace - this patch doesn't change the exported values yet, by
      adding up MM_SHMEMPAGES to MM_FILEPAGES at places where MM_FILEPAGES was
      used before.  The only user-visible change after this patch is the OOM
      killer message that separates the reported "shmem-rss" from "file-rss".
      
      [vbabka@suse.cz: forward-porting, tweak changelog]
      Signed-off-by: NJerome Marchand <jmarchan@redhat.com>
      Signed-off-by: NVlastimil Babka <vbabka@suse.cz>
      Acked-by: NKonstantin Khlebnikov <khlebnikov@yandex-team.ru>
      Acked-by: NMichal Hocko <mhocko@suse.com>
      Acked-by: NHugh Dickins <hughd@google.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      eca56ff9
  9. 12 1月, 2016 1 次提交
    • A
      mm: Add vm_insert_pfn_prot() · 1745cbc5
      Andy Lutomirski 提交于
      The x86 vvar vma contains pages with differing cacheability
      flags.  x86 currently implements this by manually inserting all
      the ptes using (io_)remap_pfn_range when the vma is set up.
      
      x86 wants to move to using .fault with VM_FAULT_NOPAGE to set up
      the mappings as needed.  The correct API to use to insert a pfn
      in .fault is vm_insert_pfn(), but vm_insert_pfn() can't override the
      vma's cache mode, and the HPET page in particular needs to be
      uncached despite the fact that the rest of the VMA is cached.
      
      Add vm_insert_pfn_prot() to support varying cacheability within
      the same non-COW VMA in a more sane manner.
      
      x86 could alternatively use multiple VMAs, but that's messy,
      would break CRIU, and would create unnecessary VMAs that would
      waste memory.
      Signed-off-by: NAndy Lutomirski <luto@kernel.org>
      Reviewed-by: NKees Cook <keescook@chromium.org>
      Acked-by: NAndrew Morton <akpm@linux-foundation.org>
      Cc: Andy Lutomirski <luto@amacapital.net>
      Cc: Borislav Petkov <bp@alien8.de>
      Cc: Dave Hansen <dave.hansen@linux.intel.com>
      Cc: Fenghua Yu <fenghua.yu@intel.com>
      Cc: H. Peter Anvin <hpa@zytor.com>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Oleg Nesterov <oleg@redhat.com>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Quentin Casasnovas <quentin.casasnovas@oracle.com>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Link: http://lkml.kernel.org/r/d2938d1eb37be7a5e4f86182db646551f11e45aa.1451446564.git.luto@kernel.orgSigned-off-by: NIngo Molnar <mingo@kernel.org>
      1745cbc5
  10. 19 11月, 2015 1 次提交
    • Y
      mm, dax: fix DAX deadlocks (COW fault) · 0df9d41a
      Yigal Korman 提交于
      DAX handling of COW faults has wrong locking sequence:
      	dax_fault does i_mmap_lock_read
      	do_cow_fault does i_mmap_unlock_write
      
      Ross's commit[1] missed a fix[2] that Kirill added to Matthew's
      commit[3].
      
      Original COW locking logic was introduced by Matthew here[4].
      
      This should be applied to v4.3 as well.
      
      [1] 0f90cc66 mm, dax: fix DAX deadlocks
      [2] 52a2b53f mm, dax: use i_mmap_unlock_write() in do_cow_fault()
      [3] 84317297 dax: fix race between simultaneous faults
      [4] 2e4cdab0 mm: allow page fault handlers to perform the COW
      
      Cc: <stable@vger.kernel.org>
      Cc: Boaz Harrosh <boaz@plexistor.com>
      Cc: Alexander Viro <viro@zeniv.linux.org.uk>
      Cc: Dave Chinner <dchinner@redhat.com>
      Cc: Jan Kara <jack@suse.com>
      Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
      Cc: Matthew Wilcox <matthew.r.wilcox@intel.com>
      Acked-by: NRoss Zwisler <ross.zwisler@linux.intel.com>
      Signed-off-by: NYigal Korman <yigal@plexistor.com>
      Signed-off-by: NDan Williams <dan.j.williams@intel.com>
      0df9d41a
  11. 17 10月, 2015 1 次提交
  12. 11 9月, 2015 1 次提交
  13. 09 9月, 2015 5 次提交
  14. 05 9月, 2015 2 次提交
  15. 10 7月, 2015 1 次提交
  16. 25 6月, 2015 1 次提交
    • M
      mm, memcg: Try charging a page before setting page up to date · eb3c24f3
      Mel Gorman 提交于
      Historically memcg overhead was high even if memcg was unused.  This has
      improved a lot but it still showed up in a profile summary as being a
      problem.
      
      /usr/src/linux-4.0-vanilla/mm/memcontrol.c                           6.6441   395842
        mem_cgroup_try_charge                                                        2.950%   175781
        __mem_cgroup_count_vm_event                                                  1.431%    85239
        mem_cgroup_page_lruvec                                                       0.456%    27156
        mem_cgroup_commit_charge                                                     0.392%    23342
        uncharge_list                                                                0.323%    19256
        mem_cgroup_update_lru_size                                                   0.278%    16538
        memcg_check_events                                                           0.216%    12858
        mem_cgroup_charge_statistics.isra.22                                         0.188%    11172
        try_charge                                                                   0.150%     8928
        commit_charge                                                                0.141%     8388
        get_mem_cgroup_from_mm                                                       0.121%     7184
      
      That is showing that 6.64% of system CPU cycles were in memcontrol.c and
      dominated by mem_cgroup_try_charge.  The annotation shows that the bulk
      of the cost was checking PageSwapCache which is expected to be cache hot
      but is very expensive.  The problem appears to be that __SetPageUptodate
      is called just before the check which is a write barrier.  It is
      required to make sure struct page and page data is written before the
      PTE is updated and the data visible to userspace.  memcg charging does
      not require or need the barrier but gets unfairly hit with the cost so
      this patch attempts the charging before the barrier.  Aside from the
      accidental cost to memcg there is the added benefit that the barrier is
      avoided if the page cannot be charged.  When applied the relevant
      profile summary is as follows.
      
      /usr/src/linux-4.0-chargefirst-v2r1/mm/memcontrol.c                  3.7907   223277
        __mem_cgroup_count_vm_event                                                  1.143%    67312
        mem_cgroup_page_lruvec                                                       0.465%    27403
        mem_cgroup_commit_charge                                                     0.381%    22452
        uncharge_list                                                                0.332%    19543
        mem_cgroup_update_lru_size                                                   0.284%    16704
        get_mem_cgroup_from_mm                                                       0.271%    15952
        mem_cgroup_try_charge                                                        0.237%    13982
        memcg_check_events                                                           0.222%    13058
        mem_cgroup_charge_statistics.isra.22                                         0.185%    10920
        commit_charge                                                                0.140%     8235
        try_charge                                                                   0.131%     7716
      
      That brings the overhead down to 3.79% and leaves the memcg fault
      accounting to the root cgroup but it's an improvement.  The difference
      in headline performance of the page fault microbench is marginal as
      memcg is such a small component of it.
      
      pft faults
                                             4.0.0                  4.0.0
                                           vanilla            chargefirst
      Hmean    faults/cpu-1 1443258.1051 (  0.00%) 1509075.7561 (  4.56%)
      Hmean    faults/cpu-3 1340385.9270 (  0.00%) 1339160.7113 ( -0.09%)
      Hmean    faults/cpu-5  875599.0222 (  0.00%)  874174.1255 ( -0.16%)
      Hmean    faults/cpu-7  601146.6726 (  0.00%)  601370.9977 (  0.04%)
      Hmean    faults/cpu-8  510728.2754 (  0.00%)  510598.8214 ( -0.03%)
      Hmean    faults/sec-1 1432084.7845 (  0.00%) 1497935.5274 (  4.60%)
      Hmean    faults/sec-3 3943818.1437 (  0.00%) 3941920.1520 ( -0.05%)
      Hmean    faults/sec-5 3877573.5867 (  0.00%) 3869385.7553 ( -0.21%)
      Hmean    faults/sec-7 3991832.0418 (  0.00%) 3992181.4189 (  0.01%)
      Hmean    faults/sec-8 3987189.8167 (  0.00%) 3986452.2204 ( -0.02%)
      
      It's only visible at single threaded. The overhead is there for higher
      threads but other factors dominate.
      Signed-off-by: NMel Gorman <mgorman@suse.de>
      Acked-by: NMichal Hocko <mhocko@suse.cz>
      Acked-by: NJohannes Weiner <hannes@cmpxchg.org>
      Cc: Tejun Heo <tj@kernel.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      eb3c24f3
  17. 24 6月, 2015 1 次提交
  18. 19 5月, 2015 1 次提交
    • D
      sched/preempt, mm/fault: Trigger might_sleep() in might_fault() with disabled pagefaults · 9ec23531
      David Hildenbrand 提交于
      Commit 662bbcb2 ("mm, sched: Allow uaccess in atomic with
      pagefault_disable()") removed might_sleep() checks for all user access
      code (that uses might_fault()).
      
      The reason was to disable wrong "sleep in atomic" warnings in the
      following scenario:
      
          pagefault_disable()
          rc = copy_to_user(...)
          pagefault_enable()
      
      Which is valid, as pagefault_disable() increments the preempt counter
      and therefore disables the pagefault handler. copy_to_user() will not
      sleep and return an error code if a page is not available.
      
      However, as all might_sleep() checks are removed,
      CONFIG_DEBUG_ATOMIC_SLEEP would no longer detect the following scenario:
      
          spin_lock(&lock);
          rc = copy_to_user(...)
          spin_unlock(&lock)
      
      If the kernel is compiled with preemption turned on, preempt_disable()
      will make in_atomic() detect disabled preemption. The fault handler would
      correctly never sleep on user access.
      However, with preemption turned off, preempt_disable() is usually a NOP
      (with !CONFIG_PREEMPT_COUNT), therefore in_atomic() will not be able to
      detect disabled preemption nor disabled pagefaults. The fault handler
      could sleep.
      We really want to enable CONFIG_DEBUG_ATOMIC_SLEEP checks for user access
      functions again, otherwise we can end up with horrible deadlocks.
      
      Root of all evil is that pagefault_disable() acts almost as
      preempt_disable(), depending on preemption being turned on/off.
      
      As we now have pagefault_disabled(), we can use it to distinguish
      whether user acces functions might sleep.
      
      Convert might_fault() into a makro that calls __might_fault(), to
      allow proper file + line messages in case of a might_sleep() warning.
      Reviewed-and-tested-by: NThomas Gleixner <tglx@linutronix.de>
      Signed-off-by: NDavid Hildenbrand <dahi@linux.vnet.ibm.com>
      Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Cc: David.Laight@ACULAB.COM
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: airlied@linux.ie
      Cc: akpm@linux-foundation.org
      Cc: benh@kernel.crashing.org
      Cc: bigeasy@linutronix.de
      Cc: borntraeger@de.ibm.com
      Cc: daniel.vetter@intel.com
      Cc: heiko.carstens@de.ibm.com
      Cc: herbert@gondor.apana.org.au
      Cc: hocko@suse.cz
      Cc: hughd@google.com
      Cc: mst@redhat.com
      Cc: paulus@samba.org
      Cc: ralf@linux-mips.org
      Cc: schwidefsky@de.ibm.com
      Cc: yang.shi@windriver.com
      Link: http://lkml.kernel.org/r/1431359540-32227-3-git-send-email-dahi@linux.vnet.ibm.comSigned-off-by: NIngo Molnar <mingo@kernel.org>
      9ec23531
  19. 16 4月, 2015 3 次提交
    • B
      mm: new pfn_mkwrite same as page_mkwrite for VM_PFNMAP · dd906184
      Boaz Harrosh 提交于
      This will allow FS that uses VM_PFNMAP | VM_MIXEDMAP (no page structs) to
      get notified when access is a write to a read-only PFN.
      
      This can happen if we mmap() a file then first mmap-read from it to
      page-in a read-only PFN, than we mmap-write to the same page.
      
      We need this functionality to fix a DAX bug, where in the scenario above
      we fail to set ctime/mtime though we modified the file.  An xfstest is
      attached to this patchset that shows the failure and the fix.  (A DAX
      patch will follow)
      
      This functionality is extra important for us, because upon dirtying of a
      pmem page we also want to RDMA the page to a remote cluster node.
      
      We define a new pfn_mkwrite and do not reuse page_mkwrite because
        1 - The name ;-)
        2 - But mainly because it would take a very long and tedious
            audit of all page_mkwrite functions of VM_MIXEDMAP/VM_PFNMAP
            users. To make sure they do not now CRASH. For example current
            DAX code (which this is for) would crash.
            If we would want to reuse page_mkwrite, We will need to first
            patch all users, so to not-crash-on-no-page. Then enable this
            patch. But even if I did that I would not sleep so well at night.
            Adding a new vector is the safest thing to do, and is not that
            expensive. an extra pointer at a static function vector per driver.
            Also the new vector is better for performance, because else we
            Will call all current Kernel vectors, so to:
              check-ha-no-page-do-nothing and return.
      
      No need to call it from do_shared_fault because do_wp_page is called to
      change pte permissions anyway.
      Signed-off-by: NYigal Korman <yigal@plexistor.com>
      Signed-off-by: NBoaz Harrosh <boaz@plexistor.com>
      Acked-by: NKirill A. Shutemov <kirill.shutemov@linux.intel.com>
      Cc: Matthew Wilcox <matthew.r.wilcox@intel.com>
      Cc: Jan Kara <jack@suse.cz>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Mel Gorman <mgorman@suse.de>
      Cc: Dave Chinner <david@fromorbit.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      dd906184
    • K
      mm/memory: also print a_ops->readpage in print_bad_pte() · 2682582a
      Konstantin Khlebnikov 提交于
      A lot of filesystems use generic_file_mmap() and filemap_fault(),
      f_op->mmap and vm_ops->fault aren't enough to identify filesystem.
      
      This prints file name, vm_ops->fault, f_op->mmap and a_ops->readpage
      (which is almost always implemented and filesystem-specific).
      
      Example:
      
      [   23.676410] BUG: Bad page map in process sh  pte:1b7e6025 pmd:19bbd067
      [   23.676887] page:ffffea00006df980 count:4 mapcount:1 mapping:ffff8800196426c0 index:0x97
      [   23.677481] flags: 0x10000000000000c(referenced|uptodate)
      [   23.677896] page dumped because: bad pte
      [   23.678205] addr:00007f52fcb17000 vm_flags:00000075 anon_vma:          (null) mapping:ffff8800196426c0 index:97
      [   23.678922] file:libc-2.19.so fault:filemap_fault mmap:generic_file_readonly_mmap readpage:v9fs_vfs_readpage
      
      [akpm@linux-foundation.org: use pr_alert, per Kirill]
      Signed-off-by: NKonstantin Khlebnikov <khlebnikov@yandex-team.ru>
      Cc: Sasha Levin <sasha.levin@oracle.com>
      Acked-by: NKirill A. Shutemov <kirill@shutemov.name>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      2682582a
    • J
      mm: remove rest of ACCESS_ONCE() usages · 4db0c3c2
      Jason Low 提交于
      We converted some of the usages of ACCESS_ONCE to READ_ONCE in the mm/
      tree since it doesn't work reliably on non-scalar types.
      
      This patch removes the rest of the usages of ACCESS_ONCE, and use the new
      READ_ONCE API for the read accesses.  This makes things cleaner, instead
      of using separate/multiple sets of APIs.
      Signed-off-by: NJason Low <jason.low2@hp.com>
      Acked-by: NMichal Hocko <mhocko@suse.cz>
      Acked-by: NDavidlohr Bueso <dave@stgolabs.net>
      Acked-by: NRik van Riel <riel@redhat.com>
      Reviewed-by: NChristian Borntraeger <borntraeger@de.ibm.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      4db0c3c2
  20. 15 4月, 2015 2 次提交