1. 07 8月, 2015 1 次提交
    • S
      ipc: use private shmem or hugetlbfs inodes for shm segments. · e1832f29
      Stephen Smalley 提交于
      The shm implementation internally uses shmem or hugetlbfs inodes for shm
      segments.  As these inodes are never directly exposed to userspace and
      only accessed through the shm operations which are already hooked by
      security modules, mark the inodes with the S_PRIVATE flag so that inode
      security initialization and permission checking is skipped.
      
      This was motivated by the following lockdep warning:
      
        ======================================================
         [ INFO: possible circular locking dependency detected ]
         4.2.0-0.rc3.git0.1.fc24.x86_64+debug #1 Tainted: G        W
        -------------------------------------------------------
         httpd/1597 is trying to acquire lock:
         (&ids->rwsem){+++++.}, at: shm_close+0x34/0x130
         but task is already holding lock:
         (&mm->mmap_sem){++++++}, at: SyS_shmdt+0x4b/0x180
         which lock already depends on the new lock.
         the existing dependency chain (in reverse order) is:
         -> #3 (&mm->mmap_sem){++++++}:
              lock_acquire+0xc7/0x270
              __might_fault+0x7a/0xa0
              filldir+0x9e/0x130
              xfs_dir2_block_getdents.isra.12+0x198/0x1c0 [xfs]
              xfs_readdir+0x1b4/0x330 [xfs]
              xfs_file_readdir+0x2b/0x30 [xfs]
              iterate_dir+0x97/0x130
              SyS_getdents+0x91/0x120
              entry_SYSCALL_64_fastpath+0x12/0x76
         -> #2 (&xfs_dir_ilock_class){++++.+}:
              lock_acquire+0xc7/0x270
              down_read_nested+0x57/0xa0
              xfs_ilock+0x167/0x350 [xfs]
              xfs_ilock_attr_map_shared+0x38/0x50 [xfs]
              xfs_attr_get+0xbd/0x190 [xfs]
              xfs_xattr_get+0x3d/0x70 [xfs]
              generic_getxattr+0x4f/0x70
              inode_doinit_with_dentry+0x162/0x670
              sb_finish_set_opts+0xd9/0x230
              selinux_set_mnt_opts+0x35c/0x660
              superblock_doinit+0x77/0xf0
              delayed_superblock_init+0x10/0x20
              iterate_supers+0xb3/0x110
              selinux_complete_init+0x2f/0x40
              security_load_policy+0x103/0x600
              sel_write_load+0xc1/0x750
              __vfs_write+0x37/0x100
              vfs_write+0xa9/0x1a0
              SyS_write+0x58/0xd0
              entry_SYSCALL_64_fastpath+0x12/0x76
        ...
      Signed-off-by: NStephen Smalley <sds@tycho.nsa.gov>
      Reported-by: NMorten Stevens <mstevens@fedoraproject.org>
      Acked-by: NHugh Dickins <hughd@google.com>
      Acked-by: NPaul Moore <paul@paul-moore.com>
      Cc: Manfred Spraul <manfred@colorfullife.com>
      Cc: Davidlohr Bueso <dave@stgolabs.net>
      Cc: Prarit Bhargava <prarit@redhat.com>
      Cc: Eric Paris <eparis@redhat.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      e1832f29
  2. 25 6月, 2015 1 次提交
  3. 16 4月, 2015 2 次提交
  4. 15 4月, 2015 1 次提交
    • K
      page_writeback: clean up mess around cancel_dirty_page() · b9ea2515
      Konstantin Khlebnikov 提交于
      This patch replaces cancel_dirty_page() with a helper function
      account_page_cleaned() which only updates counters.  It's called from
      truncate_complete_page() and from try_to_free_buffers() (hack for ext3).
      Page is locked in both cases, page-lock protects against concurrent
      dirtiers: see commit 2d6d7f98 ("mm: protect set_page_dirty() from
      ongoing truncation").
      
      Delete_from_page_cache() shouldn't be called for dirty pages, they must
      be handled by caller (either written or truncated).  This patch treats
      final dirty accounting fixup at the end of __delete_from_page_cache() as
      a debug check and adds WARN_ON_ONCE() around it.  If something removes
      dirty pages without proper handling that might be a bug and unwritten
      data might be lost.
      
      Hugetlbfs has no dirty pages accounting, ClearPageDirty() is enough
      here.
      
      cancel_dirty_page() in nfs_wb_page_cancel() is redundant.  This is
      helper for nfs_invalidate_page() and it's called only in case complete
      invalidation.
      
      The mess was started in v2.6.20 after commits 46d2277c ("Clean up
      and make try_to_free_buffers() not race with dirty pages") and
      3e67c098 ("truncate: clear page dirtiness before running
      try_to_free_buffers()") first was reverted right in v2.6.20 in commit
      ecdfc978 ("Resurrect 'try_to_free_buffers()' VM hackery"), second in
      v2.6.25 commit a2b34564 ("Fix dirty page accounting leak with ext3
      data=journal").
      
      Custom fixes were introduced between these points.  NFS in v2.6.23, commit
      1b3b4a1a ("NFS: Fix a write request leak in nfs_invalidate_page()").
      Kludge in __delete_from_page_cache() in v2.6.24, commit 3a692790 ("Do
      dirty page accounting when removing a page from the page cache").  Since
      v2.6.25 all of them are redundant.
      
      [akpm@linux-foundation.org: coding-style fixes]
      Signed-off-by: NKonstantin Khlebnikov <khlebnikov@yandex-team.ru>
      Cc: Tejun Heo <tj@kernel.org>
      Cc: Jan Kara <jack@suse.cz>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      b9ea2515
  5. 12 4月, 2015 2 次提交
  6. 21 1月, 2015 2 次提交
  7. 14 12月, 2014 2 次提交
  8. 05 6月, 2014 4 次提交
  9. 07 5月, 2014 1 次提交
    • N
      hugetlb: ensure hugepage access is denied if hugepages are not supported · 457c1b27
      Nishanth Aravamudan 提交于
      Currently, I am seeing the following when I `mount -t hugetlbfs /none
      /dev/hugetlbfs`, and then simply do a `ls /dev/hugetlbfs`.  I think it's
      related to the fact that hugetlbfs is properly not correctly setting
      itself up in this state?:
      
        Unable to handle kernel paging request for data at address 0x00000031
        Faulting instruction address: 0xc000000000245710
        Oops: Kernel access of bad area, sig: 11 [#1]
        SMP NR_CPUS=2048 NUMA pSeries
        ....
      
      In KVM guests on Power, in a guest not backed by hugepages, we see the
      following:
      
        AnonHugePages:         0 kB
        HugePages_Total:       0
        HugePages_Free:        0
        HugePages_Rsvd:        0
        HugePages_Surp:        0
        Hugepagesize:         64 kB
      
      HPAGE_SHIFT == 0 in this configuration, which indicates that hugepages
      are not supported at boot-time, but this is only checked in
      hugetlb_init().  Extract the check to a helper function, and use it in a
      few relevant places.
      
      This does make hugetlbfs not supported (not registered at all) in this
      environment.  I believe this is fine, as there are no valid hugepages
      and that won't change at runtime.
      
      [akpm@linux-foundation.org: use pr_info(), per Mel]
      [akpm@linux-foundation.org: fix build when HPAGE_SHIFT is undefined]
      Signed-off-by: NNishanth Aravamudan <nacc@linux.vnet.ibm.com>
      Reviewed-by: NAneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
      Acked-by: NMel Gorman <mgorman@suse.de>
      Cc: Randy Dunlap <rdunlap@infradead.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      457c1b27
  10. 04 4月, 2014 1 次提交
  11. 25 8月, 2013 1 次提交
  12. 14 8月, 2013 1 次提交
    • M
      hugetlb: fix lockdep splat caused by pmd sharing · b610ded7
      Michal Hocko 提交于
      Dave has reported the following lockdep splat:
      
        =================================
        [ INFO: inconsistent lock state ]
        3.11.0-rc1+ #9 Not tainted
        ---------------------------------
        inconsistent {RECLAIM_FS-ON-W} -> {IN-RECLAIM_FS-W} usage.
        kswapd0/49 [HC0[0]:SC0[0]:HE1:SE1] takes:
         (&mapping->i_mmap_mutex){+.+.?.}, at: [<c114971b>] page_referenced+0x87/0x5e3
        {RECLAIM_FS-ON-W} state was registered at:
           mark_held_locks+0x81/0xe7
           lockdep_trace_alloc+0x5e/0xbc
           __alloc_pages_nodemask+0x8b/0x9b6
           __get_free_pages+0x20/0x31
           get_zeroed_page+0x12/0x14
           __pmd_alloc+0x1c/0x6b
           huge_pmd_share+0x265/0x283
           huge_pte_alloc+0x5d/0x71
           hugetlb_fault+0x7c/0x64a
           handle_mm_fault+0x255/0x299
           __do_page_fault+0x142/0x55c
           do_page_fault+0xd/0x16
           error_code+0x6c/0x74
        irq event stamp: 3136917
        hardirqs last  enabled at (3136917):  _raw_spin_unlock_irq+0x27/0x50
        hardirqs last disabled at (3136916):  _raw_spin_lock_irq+0x15/0x78
        softirqs last  enabled at (3136180):  __do_softirq+0x137/0x30f
        softirqs last disabled at (3136175):  irq_exit+0xa8/0xaa
        other info that might help us debug this:
         Possible unsafe locking scenario:
               CPU0
               ----
          lock(&mapping->i_mmap_mutex);
          <Interrupt>
            lock(&mapping->i_mmap_mutex);
      
        *** DEADLOCK ***
        no locks held by kswapd0/49.
      
        stack backtrace:
        CPU: 1 PID: 49 Comm: kswapd0 Not tainted 3.11.0-rc1+ #9
        Hardware name: Dell Inc.                 Precision WorkStation 490    /0DT031, BIOS A08 04/25/2008
        Call Trace:
          dump_stack+0x4b/0x79
          print_usage_bug+0x1d9/0x1e3
          mark_lock+0x1e0/0x261
          __lock_acquire+0x623/0x17f2
          lock_acquire+0x7d/0x195
          mutex_lock_nested+0x6c/0x3a7
          page_referenced+0x87/0x5e3
          shrink_page_list+0x3d9/0x947
          shrink_inactive_list+0x155/0x4cb
          shrink_lruvec+0x300/0x5ce
          shrink_zone+0x53/0x14e
          kswapd+0x517/0xa75
          kthread+0xa8/0xaa
          ret_from_kernel_thread+0x1b/0x28
      
      which is a false positive caused by hugetlb pmd sharing code which
      allocates a new pmd from withing mapping->i_mmap_mutex.  If this
      allocation causes reclaim then the lockdep detector complains that we
      might self-deadlock.
      
      This is not correct though, because hugetlb pages are not reclaimable so
      their mapping will be never touched from the reclaim path.
      
      The patch tells lockup detector that hugetlb i_mmap_mutex is special by
      assigning it a separate lockdep class so it won't report possible
      deadlocks on unrelated mappings.
      
      [peterz@infradead.org: comment for annotation]
      Reported-by: NDave Jones <davej@redhat.com>
      Signed-off-by: NMichal Hocko <mhocko@suse.cz>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Reviewed-by: NMinchan Kim <minchan@kernel.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      b610ded7
  13. 08 5月, 2013 1 次提交
  14. 18 4月, 2013 1 次提交
  15. 04 3月, 2013 1 次提交
    • E
      fs: Limit sys_mount to only request filesystem modules. · 7f78e035
      Eric W. Biederman 提交于
      Modify the request_module to prefix the file system type with "fs-"
      and add aliases to all of the filesystems that can be built as modules
      to match.
      
      A common practice is to build all of the kernel code and leave code
      that is not commonly needed as modules, with the result that many
      users are exposed to any bug anywhere in the kernel.
      
      Looking for filesystems with a fs- prefix limits the pool of possible
      modules that can be loaded by mount to just filesystems trivially
      making things safer with no real cost.
      
      Using aliases means user space can control the policy of which
      filesystem modules are auto-loaded by editing /etc/modprobe.d/*.conf
      with blacklist and alias directives.  Allowing simple, safe,
      well understood work-arounds to known problematic software.
      
      This also addresses a rare but unfortunate problem where the filesystem
      name is not the same as it's module name and module auto-loading
      would not work.  While writing this patch I saw a handful of such
      cases.  The most significant being autofs that lives in the module
      autofs4.
      
      This is relevant to user namespaces because we can reach the request
      module in get_fs_type() without having any special permissions, and
      people get uncomfortable when a user specified string (in this case
      the filesystem type) goes all of the way to request_module.
      
      After having looked at this issue I don't think there is any
      particular reason to perform any filtering or permission checks beyond
      making it clear in the module request that we want a filesystem
      module.  The common pattern in the kernel is to call request_module()
      without regards to the users permissions.  In general all a filesystem
      module does once loaded is call register_filesystem() and go to sleep.
      Which means there is not much attack surface exposed by loading a
      filesytem module unless the filesystem is mounted.  In a user
      namespace filesystems are not mounted unless .fs_flags = FS_USERNS_MOUNT,
      which most filesystems do not set today.
      Acked-by: NSerge Hallyn <serge.hallyn@canonical.com>
      Acked-by: NKees Cook <keescook@chromium.org>
      Reported-by: NKees Cook <keescook@google.com>
      Signed-off-by: N"Eric W. Biederman" <ebiederm@xmission.com>
      7f78e035
  16. 26 2月, 2013 1 次提交
  17. 23 2月, 2013 2 次提交
  18. 12 12月, 2012 3 次提交
    • R
      mm: adjust address_space_operations.migratepage() return code · 78bd5209
      Rafael Aquini 提交于
      Memory fragmentation introduced by ballooning might reduce significantly
      the number of 2MB contiguous memory blocks that can be used within a
      guest, thus imposing performance penalties associated with the reduced
      number of transparent huge pages that could be used by the guest workload.
      
      This patch-set follows the main idea discussed at 2012 LSFMMS session:
      "Ballooning for transparent huge pages" -- http://lwn.net/Articles/490114/
      to introduce the required changes to the virtio_balloon driver, as well as
      the changes to the core compaction & migration bits, in order to make
      those subsystems aware of ballooned pages and allow memory balloon pages
      become movable within a guest, thus avoiding the aforementioned
      fragmentation issue
      
      Following are numbers that prove this patch benefits on allowing
      compaction to be more effective at memory ballooned guests.
      
      Results for STRESS-HIGHALLOC benchmark, from Mel Gorman's mmtests suite,
      running on a 4gB RAM KVM guest which was ballooning 512mB RAM in 64mB
      chunks, at every minute (inflating/deflating), while test was running:
      
      ===BEGIN stress-highalloc
      
      STRESS-HIGHALLOC
                       highalloc-3.7     highalloc-3.7
                           rc4-clean         rc4-patch
      Pass 1          55.00 ( 0.00%)    62.00 ( 7.00%)
      Pass 2          54.00 ( 0.00%)    62.00 ( 8.00%)
      while Rested    75.00 ( 0.00%)    80.00 ( 5.00%)
      
      MMTests Statistics: duration
                       3.7         3.7
                 rc4-clean   rc4-patch
      User         1207.59     1207.46
      System       1300.55     1299.61
      Elapsed      2273.72     2157.06
      
      MMTests Statistics: vmstat
                                      3.7         3.7
                                rc4-clean   rc4-patch
      Page Ins                    3581516     2374368
      Page Outs                  11148692    10410332
      Swap Ins                         80          47
      Swap Outs                      3641         476
      Direct pages scanned          37978       33826
      Kswapd pages scanned        1828245     1342869
      Kswapd pages reclaimed      1710236     1304099
      Direct pages reclaimed        32207       31005
      Kswapd efficiency               93%         97%
      Kswapd velocity             804.077     622.546
      Direct efficiency               84%         91%
      Direct velocity              16.703      15.682
      Percentage direct scans          2%          2%
      Page writes by reclaim        79252        9704
      Page writes file              75611        9228
      Page writes anon               3641         476
      Page reclaim immediate        16764       11014
      Page rescued immediate            0           0
      Slabs scanned               2171904     2152448
      Direct inode steals             385        2261
      Kswapd inode steals          659137      609670
      Kswapd skipped wait               1          69
      THP fault alloc                 546         631
      THP collapse alloc              361         339
      THP splits                      259         263
      THP fault fallback               98          50
      THP collapse fail                20          17
      Compaction stalls               747         499
      Compaction success              244         145
      Compaction failures             503         354
      Compaction pages moved       370888      474837
      Compaction move failure       77378       65259
      
      ===END stress-highalloc
      
      This patch:
      
      Introduce MIGRATEPAGE_SUCCESS as the default return code for
      address_space_operations.migratepage() method and documents the expected
      return code for the same method in failure cases.
      Signed-off-by: NRafael Aquini <aquini@redhat.com>
      Cc: Rusty Russell <rusty@rustcorp.com.au>
      Cc: "Michael S. Tsirkin" <mst@redhat.com>
      Cc: Rik van Riel <riel@redhat.com>
      Cc: Mel Gorman <mel@csn.ul.ie>
      Cc: Andi Kleen <andi@firstfloor.org>
      Cc: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
      Cc: Minchan Kim <minchan@kernel.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      78bd5209
    • M
      mm: use vm_unmapped_area() in hugetlbfs · 08659355
      Michel Lespinasse 提交于
      Update the hugetlb_get_unmapped_area function to make use of
      vm_unmapped_area() instead of implementing a brute force search.
      Signed-off-by: NMichel Lespinasse <walken@google.com>
      Reviewed-by: NRik van Riel <riel@redhat.com>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Russell King <linux@arm.linux.org.uk>
      Cc: Ralf Baechle <ralf@linux-mips.org>
      Cc: Paul Mundt <lethal@linux-sh.org>
      Cc: "David S. Miller" <davem@davemloft.net>
      Cc: Chris Metcalf <cmetcalf@tilera.com>
      Cc: Ingo Molnar <mingo@elte.hu>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: "H. Peter Anvin" <hpa@zytor.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      08659355
    • A
      mm: support more pagesizes for MAP_HUGETLB/SHM_HUGETLB · 42d7395f
      Andi Kleen 提交于
      There was some desire in large applications using MAP_HUGETLB or
      SHM_HUGETLB to use 1GB huge pages on some mappings, and stay with 2MB on
      others.  This is useful together with NUMA policy: use 2MB interleaving
      on some mappings, but 1GB on local mappings.
      
      This patch extends the IPC/SHM syscall interfaces slightly to allow
      specifying the page size.
      
      It borrows some upper bits in the existing flag arguments and allows
      encoding the log of the desired page size in addition to the *_HUGETLB
      flag.  When 0 is specified the default size is used, this makes the
      change fully compatible.
      
      Extending the internal hugetlb code to handle this is straight forward.
      Instead of a single mount it just keeps an array of them and selects the
      right mount based on the specified page size.  When no page size is
      specified it uses the mount of the default page size.
      
      The change is not visible in /proc/mounts because internal mounts don't
      appear there.  It also has very little overhead: the additional mounts
      just consume a super block, but not more memory when not used.
      
      I also exported the new flags to the user headers (they were previously
      under __KERNEL__).  Right now only symbols for x86 and some other
      architecture for 1GB and 2MB are defined.  The interface should already
      work for all other architectures though.  Only architectures that define
      multiple hugetlb sizes actually need it (that is currently x86, tile,
      powerpc).  However tile and powerpc have user configurable hugetlb
      sizes, so it's not easy to add defines.  A program on those
      architectures would need to query sysfs and use the appropiate log2.
      
      [akpm@linux-foundation.org: cleanups]
      [rientjes@google.com: fix build]
      [akpm@linux-foundation.org: checkpatch fixes]
      Signed-off-by: NAndi Kleen <ak@linux.intel.com>
      Cc: Michael Kerrisk <mtk.manpages@gmail.com>
      Acked-by: NRik van Riel <riel@redhat.com>
      Acked-by: NKAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Cc: Hillf Danton <dhillf@gmail.com>
      Signed-off-by: NDavid Rientjes <rientjes@google.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      42d7395f
  19. 06 12月, 2012 1 次提交
  20. 09 10月, 2012 2 次提交
    • M
      mm: replace vma prio_tree with an interval tree · 6b2dbba8
      Michel Lespinasse 提交于
      Implement an interval tree as a replacement for the VMA prio_tree.  The
      algorithms are similar to lib/interval_tree.c; however that code can't be
      directly reused as the interval endpoints are not explicitly stored in the
      VMA.  So instead, the common algorithm is moved into a template and the
      details (node type, how to get interval endpoints from the node, etc) are
      filled in using the C preprocessor.
      
      Once the interval tree functions are available, using them as a
      replacement to the VMA prio tree is a relatively simple, mechanical job.
      Signed-off-by: NMichel Lespinasse <walken@google.com>
      Cc: Rik van Riel <riel@redhat.com>
      Cc: Hillf Danton <dhillf@gmail.com>
      Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
      Cc: Catalin Marinas <catalin.marinas@arm.com>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: David Woodhouse <dwmw2@infradead.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      6b2dbba8
    • K
      mm: kill vma flag VM_RESERVED and mm->reserved_vm counter · 314e51b9
      Konstantin Khlebnikov 提交于
      A long time ago, in v2.4, VM_RESERVED kept swapout process off VMA,
      currently it lost original meaning but still has some effects:
      
       | effect                 | alternative flags
      -+------------------------+---------------------------------------------
      1| account as reserved_vm | VM_IO
      2| skip in core dump      | VM_IO, VM_DONTDUMP
      3| do not merge or expand | VM_IO, VM_DONTEXPAND, VM_HUGETLB, VM_PFNMAP
      4| do not mlock           | VM_IO, VM_DONTEXPAND, VM_HUGETLB, VM_PFNMAP
      
      This patch removes reserved_vm counter from mm_struct.  Seems like nobody
      cares about it, it does not exported into userspace directly, it only
      reduces total_vm showed in proc.
      
      Thus VM_RESERVED can be replaced with VM_IO or pair VM_DONTEXPAND | VM_DONTDUMP.
      
      remap_pfn_range() and io_remap_pfn_range() set VM_IO|VM_DONTEXPAND|VM_DONTDUMP.
      remap_vmalloc_range() set VM_DONTEXPAND | VM_DONTDUMP.
      
      [akpm@linux-foundation.org: drivers/vfio/pci/vfio_pci.c fixup]
      Signed-off-by: NKonstantin Khlebnikov <khlebnikov@openvz.org>
      Cc: Alexander Viro <viro@zeniv.linux.org.uk>
      Cc: Carsten Otte <cotte@de.ibm.com>
      Cc: Chris Metcalf <cmetcalf@tilera.com>
      Cc: Cyrill Gorcunov <gorcunov@openvz.org>
      Cc: Eric Paris <eparis@redhat.com>
      Cc: H. Peter Anvin <hpa@zytor.com>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Ingo Molnar <mingo@redhat.com>
      Cc: James Morris <james.l.morris@oracle.com>
      Cc: Jason Baron <jbaron@redhat.com>
      Cc: Kentaro Takeda <takedakn@nttdata.co.jp>
      Cc: Matt Helsley <matthltc@us.ibm.com>
      Cc: Nick Piggin <npiggin@kernel.dk>
      Cc: Oleg Nesterov <oleg@redhat.com>
      Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
      Cc: Robert Richter <robert.richter@amd.com>
      Cc: Suresh Siddha <suresh.b.siddha@intel.com>
      Cc: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
      Cc: Venkatesh Pallipadi <venki@google.com>
      Acked-by: NLinus Torvalds <torvalds@linux-foundation.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      314e51b9
  21. 03 10月, 2012 1 次提交
  22. 21 9月, 2012 1 次提交
  23. 01 8月, 2012 1 次提交
  24. 14 7月, 2012 1 次提交
  25. 06 5月, 2012 1 次提交
  26. 26 4月, 2012 1 次提交
    • A
      hugetlbfs: lockdep annotate root inode properly · 65ed7601
      Aneesh Kumar K.V 提交于
      This fixes the below reported false lockdep warning.  e096d0c7
      ("lockdep: Add helper function for dir vs file i_mutex annotation") added
      a similar annotation for every other inode in hugetlbfs but missed the
      root inode because it was allocated by a separate function.
      
      For HugeTLB fs we allow taking i_mutex in mmap.  HugeTLB fs doesn't
      support file write and its file read callback is modified in a05b0855
      ("hugetlbfs: avoid taking i_mutex from hugetlbfs_read()") to not take
      i_mutex.  Hence for HugeTLB fs with regular files we really don't take
      i_mutex with mmap_sem held.
      
       ======================================================
       [ INFO: possible circular locking dependency detected ]
       3.4.0-rc1+ #322 Not tainted
       -------------------------------------------------------
       bash/1572 is trying to acquire lock:
        (&mm->mmap_sem){++++++}, at: [<ffffffff810f1618>] might_fault+0x40/0x90
      
       but task is already holding lock:
        (&sb->s_type->i_mutex_key#12){+.+.+.}, at: [<ffffffff81125f88>] vfs_readdir+0x56/0xa8
      
       which lock already depends on the new lock.
      
       the existing dependency chain (in reverse order) is:
      
       -> #1 (&sb->s_type->i_mutex_key#12){+.+.+.}:
              [<ffffffff810a09e5>] lock_acquire+0xd5/0xfa
              [<ffffffff816a2f5e>] __mutex_lock_common+0x48/0x350
              [<ffffffff816a3325>] mutex_lock_nested+0x2a/0x31
              [<ffffffff811fb8e1>] hugetlbfs_file_mmap+0x7d/0x104
              [<ffffffff810f859a>] mmap_region+0x272/0x47d
              [<ffffffff810f8a39>] do_mmap_pgoff+0x294/0x2ee
              [<ffffffff810f8b65>] sys_mmap_pgoff+0xd2/0x10e
              [<ffffffff8103d19e>] sys_mmap+0x1d/0x1f
              [<ffffffff816a5922>] system_call_fastpath+0x16/0x1b
      
       -> #0 (&mm->mmap_sem){++++++}:
              [<ffffffff810a0256>] __lock_acquire+0xa81/0xd75
              [<ffffffff810a09e5>] lock_acquire+0xd5/0xfa
              [<ffffffff810f1645>] might_fault+0x6d/0x90
              [<ffffffff81125d62>] filldir+0x6a/0xc2
              [<ffffffff81133a83>] dcache_readdir+0x5c/0x222
              [<ffffffff81125fa8>] vfs_readdir+0x76/0xa8
              [<ffffffff811260b6>] sys_getdents+0x79/0xc9
              [<ffffffff816a5922>] system_call_fastpath+0x16/0x1b
      
       other info that might help us debug this:
      
        Possible unsafe locking scenario:
      
              CPU0                    CPU1
              ----                    ----
         lock(&sb->s_type->i_mutex_key#12);
                                      lock(&mm->mmap_sem);
                                      lock(&sb->s_type->i_mutex_key#12);
         lock(&mm->mmap_sem);
      
        *** DEADLOCK ***
      
       1 lock held by bash/1572:
        #0:  (&sb->s_type->i_mutex_key#12){+.+.+.}, at: [<ffffffff81125f88>] vfs_readdir+0x56/0xa8
      
       stack backtrace:
       Pid: 1572, comm: bash Not tainted 3.4.0-rc1+ #322
       Call Trace:
        [<ffffffff81699a3c>] print_circular_bug+0x1f8/0x209
        [<ffffffff810a0256>] __lock_acquire+0xa81/0xd75
        [<ffffffff810f38aa>] ? handle_pte_fault+0x5ff/0x614
        [<ffffffff8109e622>] ? mark_lock+0x2d/0x258
        [<ffffffff810f1618>] ? might_fault+0x40/0x90
        [<ffffffff810a09e5>] lock_acquire+0xd5/0xfa
        [<ffffffff810f1618>] ? might_fault+0x40/0x90
        [<ffffffff816a3249>] ? __mutex_lock_common+0x333/0x350
        [<ffffffff810f1645>] might_fault+0x6d/0x90
        [<ffffffff810f1618>] ? might_fault+0x40/0x90
        [<ffffffff81125d62>] filldir+0x6a/0xc2
        [<ffffffff81133a83>] dcache_readdir+0x5c/0x222
        [<ffffffff81125cf8>] ? sys_ioctl+0x74/0x74
        [<ffffffff81125cf8>] ? sys_ioctl+0x74/0x74
        [<ffffffff81125cf8>] ? sys_ioctl+0x74/0x74
        [<ffffffff81125fa8>] vfs_readdir+0x76/0xa8
        [<ffffffff811260b6>] sys_getdents+0x79/0xc9
        [<ffffffff816a5922>] system_call_fastpath+0x16/0x1b
      Signed-off-by: NAneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
      Cc: Dave Jones <davej@redhat.com>
      Cc: Al Viro <viro@zeniv.linux.org.uk>
      Cc: Josh Boyer <jwboyer@redhat.com>
      Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
      Cc: Mimi Zohar <zohar@linux.vnet.ibm.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      65ed7601
  27. 06 4月, 2012 1 次提交
    • H
      hugetlbfs: remove unregister_filesystem() when initializing module · 7563ec4c
      Hillf Danton 提交于
      It was introduced by d1d5e05f ("hugetlbfs: return error code when
      initializing module") but as Al pointed out, is a bad idea.
      
      Quoted comments from Al:
       "Note that unregister_filesystem() in module init is *always* wrong;
        it's not an issue here (it's done too early to care about and
        realistically the box is not going anywhere - it'll panic when attempt
        to exec /sbin/init fails, if not earlier), but it's a damn bad
        example.
      
        Consider a normal fs module.  Somebody loads it and in parallel with
        that we get a mount attempt on that fs type.  It comes between
        register and failure exits that causes unregister; at that point we
        are screwed since grabbing a reference to module as done by mount is
        enough to prevent exit, but not to prevent the failure of init.  As
        the result, module will get freed when init fails, mounted fs of that
        type be damned."
      
      So remove it.
      Signed-off-by: NHillf Danton <dhillf@gmail.com>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Al Viro <viro@zeniv.linux.org.uk>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      7563ec4c
  28. 22 3月, 2012 2 次提交