1. 30 9月, 2008 18 次提交
  2. 29 9月, 2008 2 次提交
    • B
      mm owner: fix race between swapoff and exit · 31a78f23
      Balbir Singh 提交于
      There's a race between mm->owner assignment and swapoff, more easily
      seen when task slab poisoning is turned on.  The condition occurs when
      try_to_unuse() runs in parallel with an exiting task.  A similar race
      can occur with callers of get_task_mm(), such as /proc/<pid>/<mmstats>
      or ptrace or page migration.
      
      CPU0                                    CPU1
                                              try_to_unuse
                                              looks at mm = task0->mm
                                              increments mm->mm_users
      task 0 exits
      mm->owner needs to be updated, but no
      new owner is found (mm_users > 1, but
      no other task has task->mm = task0->mm)
      mm_update_next_owner() leaves
                                              mmput(mm) decrements mm->mm_users
      task0 freed
                                              dereferencing mm->owner fails
      
      The fix is to notify the subsystem via mm_owner_changed callback(),
      if no new owner is found, by specifying the new task as NULL.
      
      Jiri Slaby:
      mm->owner was set to NULL prior to calling cgroup_mm_owner_callbacks(), but
      must be set after that, so as not to pass NULL as old owner causing oops.
      
      Daisuke Nishimura:
      mm_update_next_owner() may set mm->owner to NULL, but mem_cgroup_from_task()
      and its callers need to take account of this situation to avoid oops.
      
      Hugh Dickins:
      Lockdep warning and hang below exec_mmap() when testing these patches.
      exit_mm() up_reads mmap_sem before calling mm_update_next_owner(),
      so exec_mmap() now needs to do the same.  And with that repositioning,
      there's now no point in mm_need_new_owner() allowing for NULL mm.
      Reported-by: NHugh Dickins <hugh@veritas.com>
      Signed-off-by: NBalbir Singh <balbir@linux.vnet.ibm.com>
      Signed-off-by: NJiri Slaby <jirislaby@gmail.com>
      Signed-off-by: NDaisuke Nishimura <nishimura@mxp.nes.nec.co.jp>
      Signed-off-by: NHugh Dickins <hugh@veritas.com>
      Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Cc: Paul Menage <menage@google.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      31a78f23
    • L
      Fix NULL pointer dereference in proc_sys_compare · d0185c08
      Linus Torvalds 提交于
      The VFS interface for the 'd_compare()' is a bit special (read: 'odd'),
      because it really just essentially replaces a memcmp().  The filesystem
      is supposed to just compare the two names with whatever case-independent
      or other function.
      
      And when I say 'is supposed to', I obviously mean that 'procfs does odd
      things, and actually looks at the dentry that we don't even pass down,
      rather than just the name'.  Which results in problems, because we
      actually call d_compare before we have even verified that the dentry is
      still hashed at all.
      
      And that causes a problm since the inode that procfs looks at may have
      been free'd and the d_inode pointer is NULL.  procfs just assumes that
      all dentries are positive, since procfs itself never generates a
      negative one.  But memory pressure will still result in the dentry
      getting torn down, and as it is removed by RCU, it still remains visible
      on some lists - and to d_compare.
      
      If the filesystem just did a name comparison, we wouldn't care.  And we
      could just fix procfs to know about negative dentries too.  But rather
      than have the low-level filesystems know about internal VFS details,
      just move the check for a unhashed dentry up a bit, so that we will only
      call d_compare on dentries that are still active.
      
      The actual oops this caused didn't look like a NULL pointer dereference
      because procfs did a 'container_of(inode, struct proc_inode, vfs_inode)'
      to get at its internal proc_inode information from the inode pointer,
      and accessed a field below the inode. So the oops would look something
      like
      
      	BUG: unable to handle kernel paging request at fffffffffffffff0
      	IP: [<ffffffff802bc6c6>] proc_sys_compare+0x36/0x50
      
      and was seen on both x86-64 (Alexey Dobriyan and Hugh Dickins) and
      ppc64 (Hugh Dickins).
      Reported-by: NAlexey Dobriyan <adobriyan@gmail.com>
      Acked-by: NHugh Dickins <hugh@veritas.com>
      Cc: Al Viro <viro@ZenIV.linux.org.uk>
      Reviewed-by: N"Eric W. Biederman" <ebiederm@xmission.com>
      Signed-of-by: NLinus Torvalds <torvalds@linux-foundation.org>
      d0185c08
  3. 26 9月, 2008 2 次提交
    • L
      [XFS] Remove xfs_iext_irec_compact_full() · 71a8c87f
      Lachlan McIlroy 提交于
      Yet another bug was found in xfs_iext_irec_compact_full() and while the
      source of the bug was found it wasn't an easy task to track it down
      because the conditions are very difficult to reproduce.
      
      A HUGE thank-you goes to Russell Cattelan and Eric Sandeen for their
      significant effort in tracking down the source of this corruption.
      
      xfs_iext_irec_compact_full() and xfs_iext_irec_compact_pages() are almost
      identical - they both compact indirect extent lists by moving extents from
      subsequent buffers into earlier ones. xfs_iext_irec_compact_pages() only
      moves extents if all of the extents in the next buffer will fit into the
      empty space in the buffer before it. xfs_iext_irec_compact_full() will go
      a step further and move part of the next buffer if all the extents wont
      fit. It will then shift the remaining extents in the next buffer up to the
      start of the buffer. The bug here was that we did not update er_extoff and
      this caused extent list corruption.
      
      It does not appear that this extra functionality gains us much. Calling
      xfs_iext_irec_compact_pages() instead will do a good enough job at
      compacting the indirect list and will be quicker too.
      
      For the case in xfs_iext_indirect_to_direct() the total number of extents
      in the indirect list will fit into one buffer so we will never need the
      extra functionality of xfs_iext_irec_compact_full() there.
      
      Also xfs_iext_irec_compact_pages() doesn't need to do a memmove() (the
      buffers will never overlap) so we don't want the performance hit that can
      incur.
      
      SGI-PV: 987159
      
      SGI-Modid: xfs-linux-melb:xfs-kern:32166a
      Signed-off-by: NLachlan McIlroy <lachlan@sgi.com>
      Signed-off-by: NEric Sandeen <sandeen@sandeen.net>
      71a8c87f
    • L
      [XFS] Fix extent list corruption in xfs_iext_irec_compact_full(). · f1ccd295
      Lachlan McIlroy 提交于
      If we don't move all the records from the next buffer into the current
      buffer then we need to update the er_extoff field of the next buffer as we
      shift the remaining records to the start of the buffer.
      
      SGI-PV: 987159
      
      SGI-Modid: xfs-linux-melb:xfs-kern:32165a
      Signed-off-by: NLachlan McIlroy <lachlan@sgi.com>
      Signed-off-by: NEric Sandeen <sandeen@sandeen.net>
      Signed-off-by: NRussell Cattelan <cattelan@thebarn.com>
      f1ccd295
  4. 25 9月, 2008 1 次提交
  5. 18 9月, 2008 1 次提交
  6. 17 9月, 2008 10 次提交
  7. 14 9月, 2008 4 次提交
    • A
      rescan_partitions(): make device capacity errors non-fatal · 8d99f83b
      Andrew Morton 提交于
      Herton Krzesinski reports that the error-checking changes in
      04ebd4ae ("block/ioctl.c and
      fs/partition/check.c: check value returned by add_partition") cause his
      buggy USB camera to no longer mount.  "The camera is an Olympus X-840.
      The original issue comes from the camera itself: its format program
      creates a partition with an off by one error".
      
      Buggy devices happen.  It is better for the kernel to warn and to proceed
      with the mount.
      Reported-by: NHerton Ronaldo Krzesinski <herton@mandriva.com.br>
      Cc: Abdel Benamrouche <draconux@gmail.com>
      Cc: Jens Axboe <jens.axboe@oracle.com>
      Cc: Alan Stern <stern@rowland.harvard.edu>
      Cc: David Brownell <david-b@pacbell.net>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      8d99f83b
    • H
      mm: ifdef Quicklists in /proc/meminfo · d7a3e495
      Hugh Dickins 提交于
      A "Quicklists:          0 kB" line has just started appearing in
      /proc/meminfo, but most architectures (including x86) don't have
      them configured, so #ifdef it, like the highmem lines.
      
      And those architectures which do have quicklists configured are
      using them for page tables: so let's place it next to PageTables.
      Signed-off-by: NHugh Dickins <hugh@veritas.com>
      Acked-by: NChristoph Lameter <cl@linux-foundation.org>
      Acked-by: NKOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      d7a3e495
    • E
      bfs: fix Lockdep warning · 1558182f
      Eric Sesterhenn 提交于
      This fixes:
      
        =============================================
        [ INFO: possible recursive locking detected ]
        2.6.27-rc5-00283-g70bb0896 #68
        ---------------------------------------------
        touch/6855 is trying to acquire lock:
         (&info->bfs_lock){--..}, at: [<c02262f5>] bfs_delete_inode+0x9e/0x18c
      
        but task is already holding lock:
         (&info->bfs_lock){--..}, at: [<c0226c00>] bfs_create+0x45/0x187
      
        other info that might help us debug this:
        2 locks held by touch/6855:
         #0:  (&type->i_mutex_dir_key#5){--..}, at: [<c018ad13>] do_filp_open+0x10b/0x62f
         #1:  (&info->bfs_lock){--..}, at: [<c0226c00>] bfs_create+0x45/0x187
      
        stack backtrace:
        Pid: 6855, comm: touch Not tainted 2.6.27-rc5-00283-g70bb0896 #68
         [<c013e769>] validate_chain+0x458/0x9f4
         [<c013bece>] ? trace_hardirqs_off+0xb/0xd
         [<c013f36b>] __lock_acquire+0x666/0x6e0
         [<c013f440>] lock_acquire+0x5b/0x77
         [<c02262f5>] ? bfs_delete_inode+0x9e/0x18c
         [<c06aab74>] mutex_lock_nested+0xbc/0x234
         [<c02262f5>] ? bfs_delete_inode+0x9e/0x18c
         [<c02262f5>] ? bfs_delete_inode+0x9e/0x18c
         [<c02262f5>] bfs_delete_inode+0x9e/0x18c
         [<c0226257>] ? bfs_delete_inode+0x0/0x18c
         [<c01925e1>] generic_delete_inode+0x94/0xfe
         [<c019265d>] generic_drop_inode+0x12/0x12f
         [<c0191b7e>] iput+0x4b/0x4e
         [<c0226d1e>] bfs_create+0x163/0x187
         [<c0188b42>] vfs_create+0xa6/0x114
         [<c018adb5>] do_filp_open+0x1ad/0x62f
         [<c0107cdc>] ? native_sched_clock+0x82/0x96
         [<c06ac309>] ? _spin_unlock+0x27/0x3c
         [<c019379e>] ? alloc_fd+0xbf/0xc9
         [<c06ae2f4>] ? sub_preempt_count+0x9d/0xab
         [<c019379e>] ? alloc_fd+0xbf/0xc9
         [<c0180391>] do_sys_open+0x42/0xb8
         [<c041d564>] ? trace_hardirqs_on_thunk+0xc/0x10
         [<c0180449>] sys_open+0x1e/0x26
         [<c01038bd>] sysenter_do_call+0x12/0x31
         =======================
      
      The problem is that we don't unlock the bfs->lock mutex before calling
      iput (we do in the other cases).
      Signed-off-by: NEric Sesterhenn <snakebyte@gmx.de>
      Cc: Tigran Aivazian <tigran@aivazian.fsnet.co.uk>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      1558182f
    • A
      proc: more debugging for "already registered" case · 665020c3
      Alexey Dobriyan 提交于
      Print parent directory name as well.
      
      The aim is to catch non-creation of parent directory when proc_mkdir will
      return NULL and all subsequent registrations go directly in /proc instead
      of intended directory.
      Signed-off-by: NAlexey Dobriyan <adobriyan@gmail.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      [ Fixed insane printk string while at it.  - Linus ]
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      665020c3
  8. 10 9月, 2008 1 次提交
    • T
      ocfs2: Fix a bug in direct IO read. · 0e116227
      Tao Ma 提交于
      ocfs2 will become read-only if we try to read the bytes which pass
      the end of i_size. This can be easily reproduced by following steps:
      1. mkfs a ocfs2 volume with bs=4k cs=4k and nosparse.
      2. create a small file(say less than 100 bytes) and we will create the file
         which is allocated 1 cluster.
      3. read 8196 bytes from the kernel using O_DIRECT which exceeds the limit.
      4. The ocfs2 volume becomes read-only and dmesg shows:
      OCFS2: ERROR (device sda13): ocfs2_direct_IO_get_blocks:
      Inode 66010 has a hole at block 1
      File system is now read-only due to the potential of on-disk corruption.
      Please run fsck.ocfs2 once the file system is unmounted.
      
      So suppress the ERROR message.
      Signed-off-by: NTao Ma <tao.ma@oracle.com>
      Signed-off-by: NMark Fasheh <mfasheh@suse.com>
      0e116227
  9. 09 9月, 2008 1 次提交
    • C
      NFS: Restore missing hunk in NFS mount option parser · af904dea
      Chuck Lever 提交于
      Automounter maps can contain mount options valid for other NFS
      implementations but not for Linux.  The Linux automounter uses the
      mount command's "-s" command line option ("s" for "sloppy") so that
      mount requests containing such options are not rejected.
      
      Commit f45663ce attempted to address a
      known regression with text-based NFS mount option parsing.  Unrecognized
      mount options would cause mount requests to fail, even if the "-s"
      option was used on the mount command line.
      
      Unfortunately, this commit was not complete as submitted.  It adds a
      new mount option, "sloppy".  But it is missing a hunk, so it now allows
      NFS mounts with unrecognized mount options, even if the "sloppy" option
      is not present.  This could be a problem if a required critical mount
      option such as "sync" is misspelled, for example, and is considered a
      regression from 2.6.26.
      
      This patch restores the missing hunk.  Now, the default behavior of
      text-based NFS mount options is as before: any unrecognized mount option
      will cause the mount to fail.
      
      Please include this in 2.6.27-rc.
      
      Thanks to Neil Brown for reporting this.
      Signed-off-by: NChuck Lever <chuck.lever@oracle.com>
      Acked-by: NJ. Bruce Fields <bfields@citi.umich.edu>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      af904dea