1. 08 10月, 2016 24 次提交
  2. 06 10月, 2016 2 次提交
    • J
      mm: filemap: fix mapping->nrpages double accounting in fuse · 3ddf40e8
      Johannes Weiner 提交于
      Commit 22f2ac51 ("mm: workingset: fix crash in shadow node shrinker
      caused by replace_page_cache_page()") switched replace_page_cache() from
      raw radix tree operations to page_cache_tree_insert() but didn't take
      into account that the latter function, unlike the raw radix tree op,
      handles mapping->nrpages.  As a result, that counter is bumped for each
      page replacement rather than balanced out even.
      
      The mapping->nrpages counter is used to skip needless radix tree walks
      when invalidating, truncating, syncing inodes without pages, as well as
      statistics for userspace.  Since the error is positive, we'll do more
      page cache tree walks than necessary; we won't miss a necessary one.
      And we'll report more buffer pages to userspace than there are.  The
      error is limited to fuse inodes.
      
      Fixes: 22f2ac51 ("mm: workingset: fix crash in shadow node shrinker caused by replace_page_cache_page()")
      Signed-off-by: NJohannes Weiner <hannes@cmpxchg.org>
      Cc: Andrew Morton <akpm@linux-foundation.org>
      Cc: Miklos Szeredi <miklos@szeredi.hu>
      Cc: stable@vger.kernel.org
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      3ddf40e8
    • J
      mm: filemap: don't plant shadow entries without radix tree node · d3798ae8
      Johannes Weiner 提交于
      When the underflow checks were added to workingset_node_shadow_dec(),
      they triggered immediately:
      
        kernel BUG at ./include/linux/swap.h:276!
        invalid opcode: 0000 [#1] SMP
        Modules linked in: isofs usb_storage fuse xt_CHECKSUM ipt_MASQUERADE nf_nat_masquerade_ipv4 tun nf_conntrack_netbios_ns nf_conntrack_broadcast ip6t_REJECT nf_reject_ipv6
         soundcore wmi acpi_als pinctrl_sunrisepoint kfifo_buf tpm_tis industrialio acpi_pad pinctrl_intel tpm_tis_core tpm nfsd auth_rpcgss nfs_acl lockd grace sunrpc dm_crypt
        CPU: 0 PID: 20929 Comm: blkid Not tainted 4.8.0-rc8-00087-gbe67d60b #1
        Hardware name: System manufacturer System Product Name/Z170-K, BIOS 1803 05/06/2016
        task: ffff8faa93ecd940 task.stack: ffff8faa7f478000
        RIP: page_cache_tree_insert+0xf1/0x100
        Call Trace:
          __add_to_page_cache_locked+0x12e/0x270
          add_to_page_cache_lru+0x4e/0xe0
          mpage_readpages+0x112/0x1d0
          blkdev_readpages+0x1d/0x20
          __do_page_cache_readahead+0x1ad/0x290
          force_page_cache_readahead+0xaa/0x100
          page_cache_sync_readahead+0x3f/0x50
          generic_file_read_iter+0x5af/0x740
          blkdev_read_iter+0x35/0x40
          __vfs_read+0xe1/0x130
          vfs_read+0x96/0x130
          SyS_read+0x55/0xc0
          entry_SYSCALL_64_fastpath+0x13/0x8f
        Code: 03 00 48 8b 5d d8 65 48 33 1c 25 28 00 00 00 44 89 e8 75 19 48 83 c4 18 5b 41 5c 41 5d 41 5e 5d c3 0f 0b 41 bd ef ff ff ff eb d7 <0f> 0b e8 88 68 ef ff 0f 1f 84 00
        RIP  page_cache_tree_insert+0xf1/0x100
      
      This is a long-standing bug in the way shadow entries are accounted in
      the radix tree nodes. The shrinker needs to know when radix tree nodes
      contain only shadow entries, no pages, so node->count is split in half
      to count shadows in the upper bits and pages in the lower bits.
      
      Unfortunately, the radix tree implementation doesn't know of this and
      assumes all entries are in node->count. When there is a shadow entry
      directly in root->rnode and the tree is later extended, the radix tree
      implementation will copy that entry into the new node and and bump its
      node->count, i.e. increases the page count bits. Once the shadow gets
      removed and we subtract from the upper counter, node->count underflows
      and triggers the warning. Afterwards, without node->count reaching 0
      again, the radix tree node is leaked.
      
      Limit shadow entries to when we have actual radix tree nodes and can
      count them properly. That means we lose the ability to detect refaults
      from files that had only the first page faulted in at eviction time.
      
      Fixes: 449dd698 ("mm: keep page cache radix tree nodes in check")
      Signed-off-by: NJohannes Weiner <hannes@cmpxchg.org>
      Reported-and-tested-by: NLinus Torvalds <torvalds@linux-foundation.org>
      Reviewed-by: NJan Kara <jack@suse.cz>
      Cc: Andrew Morton <akpm@linux-foundation.org>
      Cc: stable@vger.kernel.org
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      d3798ae8
  3. 03 10月, 2016 1 次提交
  4. 01 10月, 2016 1 次提交
    • J
      mm: workingset: fix crash in shadow node shrinker caused by replace_page_cache_page() · 22f2ac51
      Johannes Weiner 提交于
      Antonio reports the following crash when using fuse under memory pressure:
      
        kernel BUG at /build/linux-a2WvEb/linux-4.4.0/mm/workingset.c:346!
        invalid opcode: 0000 [#1] SMP
        Modules linked in: all of them
        CPU: 2 PID: 63 Comm: kswapd0 Not tainted 4.4.0-36-generic #55-Ubuntu
        Hardware name: System manufacturer System Product Name/P8H67-M PRO, BIOS 3904 04/27/2013
        task: ffff88040cae6040 ti: ffff880407488000 task.ti: ffff880407488000
        RIP: shadow_lru_isolate+0x181/0x190
        Call Trace:
          __list_lru_walk_one.isra.3+0x8f/0x130
          list_lru_walk_one+0x23/0x30
          scan_shadow_nodes+0x34/0x50
          shrink_slab.part.40+0x1ed/0x3d0
          shrink_zone+0x2ca/0x2e0
          kswapd+0x51e/0x990
          kthread+0xd8/0xf0
          ret_from_fork+0x3f/0x70
      
      which corresponds to the following sanity check in the shadow node
      tracking:
      
        BUG_ON(node->count & RADIX_TREE_COUNT_MASK);
      
      The workingset code tracks radix tree nodes that exclusively contain
      shadow entries of evicted pages in them, and this (somewhat obscure)
      line checks whether there are real pages left that would interfere with
      reclaim of the radix tree node under memory pressure.
      
      While discussing ways how fuse might sneak pages into the radix tree
      past the workingset code, Miklos pointed to replace_page_cache_page(),
      and indeed there is a problem there: it properly accounts for the old
      page being removed - __delete_from_page_cache() does that - but then
      does a raw raw radix_tree_insert(), not accounting for the replacement
      page.  Eventually the page count bits in node->count underflow while
      leaving the node incorrectly linked to the shadow node LRU.
      
      To address this, make sure replace_page_cache_page() uses the tracked
      page insertion code, page_cache_tree_insert().  This fixes the page
      accounting and makes sure page-containing nodes are properly unlinked
      from the shadow node LRU again.
      
      Also, make the sanity checks a bit less obscure by using the helpers for
      checking the number of pages and shadows in a radix tree node.
      
      Fixes: 449dd698 ("mm: keep page cache radix tree nodes in check")
      Link: http://lkml.kernel.org/r/20160919155822.29498-1-hannes@cmpxchg.orgSigned-off-by: NJohannes Weiner <hannes@cmpxchg.org>
      Reported-by: NAntonio SJ Musumeci <trapexit@spawn.link>
      Debugged-by: NMiklos Szeredi <miklos@szeredi.hu>
      Cc: <stable@vger.kernel.org>	[3.15+]
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      22f2ac51
  5. 29 9月, 2016 2 次提交
  6. 26 9月, 2016 1 次提交
    • L
      mm: check VMA flags to avoid invalid PROT_NONE NUMA balancing · 38e08854
      Lorenzo Stoakes 提交于
      The NUMA balancing logic uses an arch-specific PROT_NONE page table flag
      defined by pte_protnone() or pmd_protnone() to mark PTEs or huge page
      PMDs respectively as requiring balancing upon a subsequent page fault.
      User-defined PROT_NONE memory regions which also have this flag set will
      not normally invoke the NUMA balancing code as do_page_fault() will send
      a segfault to the process before handle_mm_fault() is even called.
      
      However if access_remote_vm() is invoked to access a PROT_NONE region of
      memory, handle_mm_fault() is called via faultin_page() and
      __get_user_pages() without any access checks being performed, meaning
      the NUMA balancing logic is incorrectly invoked on a non-NUMA memory
      region.
      
      A simple means of triggering this problem is to access PROT_NONE mmap'd
      memory using /proc/self/mem which reliably results in the NUMA handling
      functions being invoked when CONFIG_NUMA_BALANCING is set.
      
      This issue was reported in bugzilla (issue 99101) which includes some
      simple repro code.
      
      There are BUG_ON() checks in do_numa_page() and do_huge_pmd_numa_page()
      added at commit c0e7cad9 to avoid accidentally provoking strange
      behaviour by attempting to apply NUMA balancing to pages that are in
      fact PROT_NONE.  The BUG_ON()'s are consistently triggered by the repro.
      
      This patch moves the PROT_NONE check into mm/memory.c rather than
      invoking BUG_ON() as faulting in these pages via faultin_page() is a
      valid reason for reaching the NUMA check with the PROT_NONE page table
      flag set and is therefore not always a bug.
      
      Link: https://bugzilla.kernel.org/show_bug.cgi?id=99101Reported-by: NTrevor Saunders <tbsaunde@tbsaunde.org>
      Signed-off-by: NLorenzo Stoakes <lstoakes@gmail.com>
      Acked-by: NRik van Riel <riel@redhat.com>
      Cc: Andrew Morton <akpm@linux-foundation.org>
      Cc: Mel Gorman <mgorman@techsingularity.net>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      38e08854
  7. 25 9月, 2016 3 次提交
  8. 21 9月, 2016 1 次提交
    • L
      mm: usercopy: Check for module addresses · aa4f0601
      Laura Abbott 提交于
      While running a compile on arm64, I hit a memory exposure
      
      usercopy: kernel memory exposure attempt detected from fffffc0000f3b1a8 (buffer_head) (1 bytes)
      ------------[ cut here ]------------
      kernel BUG at mm/usercopy.c:75!
      Internal error: Oops - BUG: 0 [#1] SMP
      Modules linked in: ip6t_rpfilter ip6t_REJECT
      nf_reject_ipv6 xt_conntrack ip_set nfnetlink ebtable_broute bridge stp
      llc ebtable_nat ip6table_security ip6table_raw ip6table_nat
      nf_conntrack_ipv6 nf_defrag_ipv6 nf_nat_ipv6 ip6table_mangle
      iptable_security iptable_raw iptable_nat nf_conntrack_ipv4
      nf_defrag_ipv4 nf_nat_ipv4 nf_nat nf_conntrack iptable_mangle
      ebtable_filter ebtables ip6table_filter ip6_tables vfat fat xgene_edac
      xgene_enet edac_core i2c_xgene_slimpro i2c_core at803x realtek xgene_dma
      mdio_xgene gpio_dwapb gpio_xgene_sb xgene_rng mailbox_xgene_slimpro nfsd
      auth_rpcgss nfs_acl lockd grace sunrpc xfs libcrc32c sdhci_of_arasan
      sdhci_pltfm sdhci mmc_core xhci_plat_hcd gpio_keys
      CPU: 0 PID: 19744 Comm: updatedb Tainted: G        W 4.8.0-rc3-threadinfo+ #1
      Hardware name: AppliedMicro X-Gene Mustang Board/X-Gene Mustang Board, BIOS 3.06.12 Aug 12 2016
      task: fffffe03df944c00 task.stack: fffffe00d128c000
      PC is at __check_object_size+0x70/0x3f0
      LR is at __check_object_size+0x70/0x3f0
      ...
      [<fffffc00082b4280>] __check_object_size+0x70/0x3f0
      [<fffffc00082cdc30>] filldir64+0x158/0x1a0
      [<fffffc0000f327e8>] __fat_readdir+0x4a0/0x558 [fat]
      [<fffffc0000f328d4>] fat_readdir+0x34/0x40 [fat]
      [<fffffc00082cd8f8>] iterate_dir+0x190/0x1e0
      [<fffffc00082cde58>] SyS_getdents64+0x88/0x120
      [<fffffc0008082c70>] el0_svc_naked+0x24/0x28
      
      fffffc0000f3b1a8 is a module address. Modules may have compiled in
      strings which could get copied to userspace. In this instance, it
      looks like "." which matches with a size of 1 byte. Extend the
      is_vmalloc_addr check to be is_vmalloc_or_module_addr to cover
      all possible cases.
      Signed-off-by: NLaura Abbott <labbott@redhat.com>
      Signed-off-by: NKees Cook <keescook@chromium.org>
      aa4f0601
  9. 20 9月, 2016 5 次提交