1. 14 9月, 2009 10 次提交
    • F
      kill-the-BKL/reiserfs: release the write lock on flush_commit_list() · 6e3647ac
      Frederic Weisbecker 提交于
      flush_commit_list() uses ll_rw_block() to commit the pending log blocks.
      ll_rw_block() might sleep, and the bkl was released at this point. Then
      we can also relax the write lock at this point.
      
      [ Impact: release the reiserfs write lock when it is not needed ]
      
      Cc: Jeff Mahoney <jeffm@suse.com>
      Cc: Chris Mason <chris.mason@oracle.com>
      Cc: Alexander Beregalov <a.beregalov@gmail.com>
      Signed-off-by: NFrederic Weisbecker <fweisbec@gmail.com>
      6e3647ac
    • F
      kill-the-BKL/reiserfs: release the write lock inside reiserfs_read_bitmap_block() · 4c5eface
      Frederic Weisbecker 提交于
      reiserfs_read_bitmap_block() uses sb_bread() to read the bitmap block. This
      helper might sleep.
      
      Then, when the bkl was used, it was released at this point. We can then
      relax the write lock too here.
      
      [ Impact: release the reiserfs write lock when it is not needed ]
      
      Cc: Jeff Mahoney <jeffm@suse.com>
      Cc: Chris Mason <chris.mason@oracle.com>
      Cc: Alexander Beregalov <a.beregalov@gmail.com>
      Signed-off-by: NFrederic Weisbecker <fweisbec@gmail.com>
      4c5eface
    • F
      kill-the-BKL/reiserfs: release the write lock inside get_neighbors() · 148d3504
      Frederic Weisbecker 提交于
      get_neighbors() is used to get the left and/or right blocks
      against a given one in order to balance a tree.
      
      sb_bread() is used to read the buffer of these neighors blocks and
      while it waits for this operation, it might sleep.
      
      The bkl was released at this point, and then we can also release
      the write lock before calling sb_bread().
      
      This is safe because if the filesystem is changed after this
      lock release, the function returns REPEAT_SEARCH (aka SCHEDULE_OCCURRED
      in the function header comments) in order to repeat the neighbhor
      research.
      
      [ Impact: release the reiserfs write lock when it is not needed ]
      
      Cc: Jeff Mahoney <jeffm@suse.com>
      Cc: Chris Mason <chris.mason@oracle.com>
      Cc: Alexander Beregalov <a.beregalov@gmail.com>
      Signed-off-by: NFrederic Weisbecker <fweisbec@gmail.com>
      148d3504
    • F
      kill-the-BKL/reiserfs: release write lock while rescheduling on prepare_for_delete_or_cut() · 5e69e3a4
      Frederic Weisbecker 提交于
      prepare_for_delete_or_cut() can process several types of items, including
      indirect items, ie: items which contain no file data but pointers to
      unformatted nodes scattering the datas of a file.
      
      In this case it has to zero out these pointers to block numbers of
      unformatted nodes and release the bitmap from these block numbers.
      
      It can take some time, so a rescheduling() is performed between each
      block processed. We can safely release the write lock while
      rescheduling(), like the bkl did, because the code checks just after
      if the item has moved after sleeping.
      
      [ Impact: release the reiserfs write lock when it is not needed ]
      
      Cc: Jeff Mahoney <jeffm@suse.com>
      Cc: Chris Mason <chris.mason@oracle.com>
      Cc: Alexander Beregalov <a.beregalov@gmail.com>
      Signed-off-by: NFrederic Weisbecker <fweisbec@gmail.com>
      5e69e3a4
    • F
      kill-the-BKL/reiserfs: release the write lock before rescheduling on do_journal_end() · e6950a4d
      Frederic Weisbecker 提交于
      When do_journal_end() copies data to the journal blocks buffers in memory,
      it reschedules if needed between each block copied and dirtyfied.
      
      We can also release the write lock at this rescheduling stage,
      like did the bkl implicitly.
      
      [ Impact: release the reiserfs write lock when it is not needed ]
      
      Cc: Jeff Mahoney <jeffm@suse.com>
      Cc: Chris Mason <chris.mason@oracle.com>
      Cc: Alexander Beregalov <a.beregalov@gmail.com>
      Signed-off-by: NFrederic Weisbecker <fweisbec@gmail.com>
      e6950a4d
    • F
      kill-the-BKL/reiserfs: only acquire the write lock once in reiserfs_dirty_inode · dc8f6d89
      Frederic Weisbecker 提交于
      Impact: fix a deadlock
      
      reiserfs_dirty_inode() is the super_operations::dirty_inode() callback
      of reiserfs. It can be called from different contexts where the write
      lock can be already held.
      
      But this function also grab the write lock (possibly recursively).
      Subsequent release of the lock before sleep will actually not release
      the lock if the caller of mark_inode_dirty() (which in turn calls
      reiserfs_dirty_inode()) already owns the lock.
      
      A typical case:
      
      reiserfs_write_end() {
      	acquire_write_lock()
      	mark_inode_dirty() {
      		reiserfs_dirty_inode() {
      			reacquire_write_lock() {
      				journal_begin() {
      					do_journal_begin_r() {
      						/*
      						 * fail to release, still
      						 * one depth of lock
      						 */
      						release_write_lock()
      						reiserfs_wait_on_write_block() {
      							wait_event()
      
      The event is usually provided by something which needs the write lock but
      it hasn't been released.
      
      We use reiserfs_write_lock_once() here to ensure we only grab the
      write lock in one level.
      Signed-off-by: NFrederic Weisbecker <fweisbec@gmail.com>
      Cc: Frederic Weisbecker <fweisbec@gmail.com>
      Cc: Alessio Igor Bogani <abogani@texware.it>
      Cc: Jeff Mahoney <jeffm@suse.com>
      Cc: Chris Mason <chris.mason@oracle.com>
      LKML-Reference: <1239680065-25013-4-git-send-email-fweisbec@gmail.com>
      Signed-off-by: NIngo Molnar <mingo@elte.hu>
      dc8f6d89
    • F
      kill-the-BKL/reiserfs: lock only once in reiserfs_truncate_file · 22c963ad
      Frederic Weisbecker 提交于
      Impact: fix a deadlock
      
      reiserfs_truncate_file() can be called from multiple context where
      the write lock can be already hold or not.
      
      This function also acquire (possibly recursively) the write
      lock. Subsequent releases before sleeping will not actually release
      the lock because we may be in more than one lock depth degree.
      
      A typical case is:
      
      reiserfs_file_release {
      	acquire_the_lock()
      	reiserfs_truncate_file()
      		reacquire_the_lock()
      		journal_begin() {
      			do_journal_begin_r() {
      				reiserfs_wait_on_write_block() {
      					/*
      					 * Not released because still one
      					 * depth owned
      					 */
      					release_lock()
      					wait_for_event()
      
      At this stage the event never happen because the one which provides
      it needs the write lock.
      
      We use reiserfs_write_lock_once() here to ensure that we don't acquire the
      write lock recursively.
      Signed-off-by: NFrederic Weisbecker <fweisbec@gmail.com>
      Cc: Alessio Igor Bogani <abogani@texware.it>
      Cc: Jeff Mahoney <jeffm@suse.com>
      Cc: Alexander Beregalov <a.beregalov@gmail.com>
      Cc: Chris Mason <chris.mason@oracle.com>
      LKML-Reference: <1239680065-25013-3-git-send-email-fweisbec@gmail.com>
      Signed-off-by: NIngo Molnar <mingo@elte.hu>
      22c963ad
    • F
      kill-the-BKL/reiserfs: provide a tool to lock only once the write lock · daf88c89
      Frederic Weisbecker 提交于
      Sometimes we don't want to recursively hold the per superblock write
      lock because we want to be sure it is actually released when we come
      to sleep.
      
      This patch introduces the necessary tools for that.
      
      reiserfs_write_lock_once() does the same job than reiserfs_write_lock()
      except that it won't try to acquire recursively the lock if the current
      task already owns it. Also the lock_depth before the call of this function
      is returned.
      
      reiserfs_write_unlock_once() unlock only if reiserfs_write_lock_once()
      returned a depth equal to -1, ie: only if it actually locked.
      Signed-off-by: NFrederic Weisbecker <fweisbec@gmail.com>
      Cc: Alessio Igor Bogani <abogani@texware.it>
      Cc: Jeff Mahoney <jeffm@suse.com>
      Cc: Alexander Beregalov <a.beregalov@gmail.com>
      Cc: Chris Mason <chris.mason@oracle.com>
      LKML-Reference: <1239680065-25013-2-git-send-email-fweisbec@gmail.com>
      Signed-off-by: NIngo Molnar <mingo@elte.hu>
      daf88c89
    • F
      reiserfs, kill-the-BKL: fix unsafe j_flush_mutex lock · a412f9ef
      Frederic Weisbecker 提交于
      Impact: fix a deadlock
      
      The j_flush_mutex is acquired safely in journal.c:
      if we can't take it, we free the reiserfs per superblock lock
      and wait a bit.
      
      But we have a remaining place in kupdate_transactions() where
      j_flush_mutex is still acquired traditionnaly. Thus the following
      scenario (warned by lockdep) can happen:
      
      A						B
      
      mutex_lock(&write_lock)			mutex_lock(&write_lock)
      	mutex_lock(&j_flush_mutex)	mutex_lock(&j_flush_mutex) //block
      	mutex_unlock(&write_lock)
      	sleep...
      	mutex_lock(&write_lock) //deadlock
      
      Fix this by using reiserfs_mutex_lock_safe() in kupdate_transactions().
      Signed-off-by: NFrederic Weisbecker <fweisbec@gmail.com>
      Cc: Alessio Igor Bogani <abogani@texware.it>
      Cc: Jeff Mahoney <jeffm@suse.com>
      LKML-Reference: <1239660635-12940-1-git-send-email-fweisbec@gmail.com>
      Signed-off-by: NIngo Molnar <mingo@elte.hu>
      a412f9ef
    • F
      reiserfs: kill-the-BKL · 8ebc4232
      Frederic Weisbecker 提交于
      This patch is an attempt to remove the Bkl based locking scheme from
      reiserfs and is intended.
      
      It is a bit inspired from an old attempt by Peter Zijlstra:
      
         http://lkml.indiana.edu/hypermail/linux/kernel/0704.2/2174.html
      
      The bkl is heavily used in this filesystem to prevent from
      concurrent write accesses on the filesystem.
      
      Reiserfs makes a deep use of the specific properties of the Bkl:
      
      - It can be acqquired recursively by a same task
      - It is released on the schedule() calls and reacquired when schedule() returns
      
      The two properties above are a roadmap for the reiserfs write locking so it's
      very hard to simply replace it with a common mutex.
      
      - We need a recursive-able locking unless we want to restructure several blocks
        of the code.
      - We need to identify the sites where the bkl was implictly relaxed
        (schedule, wait, sync, etc...) so that we can in turn release and
        reacquire our new lock explicitly.
        Such implicit releases of the lock are often required to let other
        resources producer/consumer do their job or we can suffer unexpected
        starvations or deadlocks.
      
      So the new lock that replaces the bkl here is a per superblock mutex with a
      specific property: it can be acquired recursively by a same task, like the
      bkl.
      
      For such purpose, we integrate a lock owner and a lock depth field on the
      superblock information structure.
      
      The first axis on this patch is to turn reiserfs_write_(un)lock() function
      into a wrapper to manage this mutex. Also some explicit calls to
      lock_kernel() have been converted to reiserfs_write_lock() helpers.
      
      The second axis is to find the important blocking sites (schedule...(),
      wait_on_buffer(), sync_dirty_buffer(), etc...) and then apply an explicit
      release of the write lock on these locations before blocking. Then we can
      safely wait for those who can give us resources or those who need some.
      Typically this is a fight between the current writer, the reiserfs workqueue
      (aka the async commiter) and the pdflush threads.
      
      The third axis is a consequence of the second. The write lock is usually
      on top of a lock dependency chain which can include the journal lock, the
      flush lock or the commit lock. So it's dangerous to release and trying to
      reacquire the write lock while we still hold other locks.
      
      This is fine with the bkl:
      
            T1                       T2
      
      lock_kernel()
          mutex_lock(A)
          unlock_kernel()
          // do something
                                  lock_kernel()
                                      mutex_lock(A) -> already locked by T1
                                      schedule() (and then unlock_kernel())
          lock_kernel()
          mutex_unlock(A)
          ....
      
      This is not fine with a mutex:
      
            T1                       T2
      
      mutex_lock(write)
          mutex_lock(A)
          mutex_unlock(write)
          // do something
                                 mutex_lock(write)
                                    mutex_lock(A) -> already locked by T1
                                    schedule()
      
          mutex_lock(write) -> already locked by T2
          deadlock
      
      The solution in this patch is to provide a helper which releases the write
      lock and sleep a bit if we can't lock a mutex that depend on it. It's another
      simulation of the bkl behaviour.
      
      The last axis is to locate the fs callbacks that are called with the bkl held,
      according to Documentation/filesystem/Locking.
      
      Those are:
      
      - reiserfs_remount
      - reiserfs_fill_super
      - reiserfs_put_super
      
      Reiserfs didn't need to explicitly lock because of the context of these callbacks.
      But now we must take care of that with the new locking.
      
      After this patch, reiserfs suffers from a slight performance regression (for now).
      On UP, a high volume write with dd reports an average of 27 MB/s instead
      of 30 MB/s without the patch applied.
      Signed-off-by: NFrederic Weisbecker <fweisbec@gmail.com>
      Reviewed-by: NIngo Molnar <mingo@elte.hu>
      Cc: Jeff Mahoney <jeffm@suse.com>
      Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
      Cc: Bron Gondwana <brong@fastmail.fm>
      Cc: Andrew Morton <akpm@linux-foundation.org>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Alexander Viro <viro@zeniv.linux.org.uk>
      LKML-Reference: <1239070789-13354-1-git-send-email-fweisbec@gmail.com>
      Signed-off-by: NIngo Molnar <mingo@elte.hu>
      8ebc4232
  2. 07 9月, 2009 1 次提交
  3. 06 9月, 2009 2 次提交
    • N
      ext2: fix unbalanced kmap()/kunmap() · 9de6886e
      Nicolas Pitre 提交于
      In ext2_rename(), dir_page is acquired through ext2_dotdot().  It is
      then released through ext2_set_link() but only if old_dir != new_dir.
      Failing that, the pkmap reference count is never decremented and the
      page remains pinned forever.  Repeat that a couple times with highmem
      pages and all pkmap slots get exhausted, and every further kmap() calls
      end up stalling on the pkmap_map_wait queue at which point the whole
      system comes to a halt.
      Signed-off-by: NNicolas Pitre <nico@marvell.com>
      Acked-by: NTheodore Ts'o <tytso@mit.edu>
      Cc: stable@kernel.org
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      9de6886e
    • O
      exec: do not sleep in TASK_TRACED under ->cred_guard_mutex · a2a8474c
      Oleg Nesterov 提交于
      Tom Horsley reports that his debugger hangs when it tries to read
      /proc/pid_of_tracee/maps, this happens since
      
      	"mm_for_maps: take ->cred_guard_mutex to fix the race with exec"
      	04b836cbf19e885f8366bccb2e4b0474346c02d
      
      commit in 2.6.31.
      
      But the root of the problem lies in the fact that do_execve() path calls
      tracehook_report_exec() which can stop if the tracer sets PT_TRACE_EXEC.
      
      The tracee must not sleep in TASK_TRACED holding this mutex.  Even if we
      remove ->cred_guard_mutex from mm_for_maps() and proc_pid_attr_write(),
      another task doing PTRACE_ATTACH should not hang until it is killed or the
      tracee resumes.
      
      With this patch do_execve() does not use ->cred_guard_mutex directly and
      we do not hold it throughout, instead:
      
      	- introduce prepare_bprm_creds() helper, it locks the mutex
      	  and calls prepare_exec_creds() to initialize bprm->cred.
      
      	- install_exec_creds() drops the mutex after commit_creds(),
      	  and thus before tracehook_report_exec()->ptrace_stop().
      
      	  or, if exec fails,
      
      	  free_bprm() drops this mutex when bprm->cred != NULL which
      	  indicates install_exec_creds() was not called.
      Reported-by: NTom Horsley <tom.horsley@att.net>
      Signed-off-by: NOleg Nesterov <oleg@redhat.com>
      Acked-by: NDavid Howells <dhowells@redhat.com>
      Cc: Roland McGrath <roland@redhat.com>
      Cc: James Morris <jmorris@namei.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      a2a8474c
  4. 05 9月, 2009 1 次提交
  5. 03 9月, 2009 1 次提交
  6. 02 9月, 2009 1 次提交
  7. 01 9月, 2009 1 次提交
  8. 31 8月, 2009 1 次提交
    • R
      nilfs2: fix preempt count underflow in nilfs_btnode_prepare_change_key · b1f1b8ce
      Ryusuke Konishi 提交于
      This will fix the following preempt count underflow reported from
      users with the title "[NILFS users] segctord problem" (Message-ID:
      <949415.6494.qm@web58808.mail.re1.yahoo.com> and Message-ID:
      <debc30fc0908270825v747c1734xa59126623cfd5b05@mail.gmail.com>):
      
       WARNING: at kernel/sched.c:4890 sub_preempt_count+0x95/0xa0()
       Hardware name: HP Compaq 6530b (KR980UT#ABC)
       Modules linked in: bridge stp llc bnep rfcomm l2cap xfs exportfs nilfs2 cowloop loop vboxnetadp vboxnetflt vboxdrv btusb bluetooth uvcvideo videodev v4l1_compat v4l2_compat_ioctl32 arc4 snd_hda_codec_analog ecb iwlagn iwlcore rfkill lib80211 mac80211 snd_hda_intel snd_hda_codec ehci_hcd uhci_hcd usbcore snd_hwdep snd_pcm tg3 cfg80211 psmouse snd_timer joydev libphy ohci1394 snd_page_alloc hp_accel lis3lv02d ieee1394 led_class i915 drm i2c_algo_bit video backlight output i2c_core dm_crypt dm_mod
       Pid: 4197, comm: segctord Not tainted 2.6.30-gentoo-r4-64 #7
       Call Trace:
        [<ffffffff8023fa05>] ? sub_preempt_count+0x95/0xa0
        [<ffffffff802470f8>] warn_slowpath_common+0x78/0xd0
        [<ffffffff8024715f>] warn_slowpath_null+0xf/0x20
        [<ffffffff8023fa05>] sub_preempt_count+0x95/0xa0
        [<ffffffffa04ce4db>] nilfs_btnode_prepare_change_key+0x11b/0x190 [nilfs2]
        [<ffffffffa04d01ad>] nilfs_btree_assign_p+0x19d/0x1e0 [nilfs2]
        [<ffffffffa04d10ad>] nilfs_btree_assign+0xbd/0x130 [nilfs2]
        [<ffffffffa04cead7>] nilfs_bmap_assign+0x47/0x70 [nilfs2]
        [<ffffffffa04d9bc6>] nilfs_segctor_do_construct+0x956/0x20f0 [nilfs2]
        [<ffffffff805ac8e2>] ? _spin_unlock_irqrestore+0x12/0x40
        [<ffffffff803c06e0>] ? __up_write+0xe0/0x150
        [<ffffffff80262959>] ? up_write+0x9/0x10
        [<ffffffffa04ce9f3>] ? nilfs_bmap_test_and_clear_dirty+0x43/0x60 [nilfs2]
        [<ffffffffa04cd627>] ? nilfs_mdt_fetch_dirty+0x27/0x60 [nilfs2]
        [<ffffffffa04db5fc>] nilfs_segctor_construct+0x8c/0xd0 [nilfs2]
        [<ffffffffa04dc3dc>] nilfs_segctor_thread+0x15c/0x3a0 [nilfs2]
        [<ffffffffa04dbe20>] ? nilfs_construction_timeout+0x0/0x10 [nilfs2]
        [<ffffffff80252633>] ? add_timer+0x13/0x20
        [<ffffffff802370da>] ? __wake_up_common+0x5a/0x90
        [<ffffffff8025e960>] ? autoremove_wake_function+0x0/0x40
        [<ffffffffa04dc280>] ? nilfs_segctor_thread+0x0/0x3a0 [nilfs2]
        [<ffffffffa04dc280>] ? nilfs_segctor_thread+0x0/0x3a0 [nilfs2]
        [<ffffffff8025e556>] kthread+0x56/0x90
        [<ffffffff8020cdea>] child_rip+0xa/0x20
        [<ffffffff8025e500>] ? kthread+0x0/0x90
        [<ffffffff8020cde0>] ? child_rip+0x0/0x20
      
      This problem was caused due to a missing radix_tree_preload() call in
      the retry path of nilfs_btnode_prepare_change_key() function.
      Reported-by: NEric A <eric225125@yahoo.com>
      Reported-by: NJerome Poulin <jeromepoulin@gmail.com>
      Signed-off-by: NRyusuke Konishi <konishi.ryusuke@lab.ntt.co.jp>
      Tested-by: NJerome Poulin <jeromepoulin@gmail.com>
      Cc: stable@kernel.org
      b1f1b8ce
  9. 29 8月, 2009 1 次提交
  10. 28 8月, 2009 4 次提交
  11. 27 8月, 2009 4 次提交
  12. 25 8月, 2009 2 次提交
    • T
      NFSv4: Fix an infinite looping problem with the nfs4_state_manager · 7111dc73
      Trond Myklebust 提交于
      Commit 76db6d95 (nfs41: add session setup
      to the state manager) introduces an infinite loop possibility in the NFSv4
      state manager. By first checking nfs4_has_session() before clearing the
      NFS4CLNT_SESSION_SETUP flag, it allows for a situation where someone sets
      that flag, but it never gets cleared, and so the state manager loops.
      
      In fact commit c3fad1b1 (nfs41: add session
      reset to state manager) causes this to happen every time we get a network
      partition error.
      Signed-off-by: NTrond Myklebust <Trond.Myklebust@netapp.com>
      Tested-by: NDaniel J Blueman <daniel.blueman@gmail.com>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      7111dc73
    • H
      mm: fix hugetlb bug due to user_shm_unlock call · 353d5c30
      Hugh Dickins 提交于
      2.6.30's commit 8a0bdec1 removed
      user_shm_lock() calls in hugetlb_file_setup() but left the
      user_shm_unlock call in shm_destroy().
      
      In detail:
      Assume that can_do_hugetlb_shm() returns true and hence user_shm_lock()
      is not called in hugetlb_file_setup(). However, user_shm_unlock() is
      called in any case in shm_destroy() and in the following
      atomic_dec_and_lock(&up->__count) in free_uid() is executed and if
      up->__count gets zero, also cleanup_user_struct() is scheduled.
      
      Note that sched_destroy_user() is empty if CONFIG_USER_SCHED is not set.
      However, the ref counter up->__count gets unexpectedly non-positive and
      the corresponding structs are freed even though there are live
      references to them, resulting in a kernel oops after a lots of
      shmget(SHM_HUGETLB)/shmctl(IPC_RMID) cycles and CONFIG_USER_SCHED set.
      
      Hugh changed Stefan's suggested patch: can_do_hugetlb_shm() at the
      time of shm_destroy() may give a different answer from at the time
      of hugetlb_file_setup().  And fixed newseg()'s no_id error path,
      which has missed user_shm_unlock() ever since it came in 2.6.9.
      Reported-by: NStefan Huber <shuber2@gmail.com>
      Signed-off-by: NHugh Dickins <hugh.dickins@tiscali.co.uk>
      Tested-by: NStefan Huber <shuber2@gmail.com>
      Cc: stable@kernel.org
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      353d5c30
  13. 24 8月, 2009 3 次提交
    • J
      ext3: Improve error message that changing journaling mode on remount is not possible · 3c4cec65
      Jan Kara 提交于
      This patch makes the error message about changing journaling mode on remount
      more descriptive. Some people are going to hit this error now due to commit
      bbae8bcc if they configure a kernel to default
      to data=writeback mode. The problem happens if they have data=ordered set for
      the root filesystem in /etc/fstab but not in the kernel command line (and they
      don't use initrd). Their filesystem then gets mounted as data=writeback by
      kernel but then their boot fails because init scripts won't be able to remount
      the filesystem rw. Better error message will hopefully make it easier for them
      to find the error in their setup and bother us less with error reports :).
      Signed-off-by: NJan Kara <jack@suse.cz>
      3c4cec65
    • T
      ext3: Update Kconfig description of EXT3_DEFAULTS_TO_ORDERED · 6d418076
      Theodore Ts'o 提交于
      The old description for this configuration option was perhaps not
      completely balanced in terms of describing the tradeoffs of using a
      default of data=writeback vs. data=ordered.  Despite the fact that old
      description very strongly recomended disabling this feature, all of
      the major distributions have elected to preserve the existing 'legacy'
      default, which is a strong hint that it perhaps wasn't telling the
      whole story.
      
      This revised description has been vetted by a number of ext3
      developers as being better at informing the user about the tradeoffs
      of enabling or disabling this configuration feature.
      
      Cc: linux-ext4@vger.kernel.org
      Signed-off-by: N"Theodore Ts'o" <tytso@mit.edu>
      Signed-off-by: NJan Kara <jack@suse.cz>
      6d418076
    • M
      kernel_read: redefine offset type · 6777d773
      Mimi Zohar 提交于
      vfs_read() offset is defined as loff_t, but kernel_read()
      offset is only defined as unsigned long. Redefine
      kernel_read() offset as loff_t.
      
      Cc: stable@kernel.org
      Signed-off-by: NMimi Zohar <zohar@us.ibm.com>
      Signed-off-by: NJames Morris <jmorris@namei.org>
      6777d773
  14. 22 8月, 2009 1 次提交
    • L
      Re-introduce page mapping check in mark_buffer_dirty() · 8e9d78ed
      Linus Torvalds 提交于
      In commit a8e7d49a ("Fix race in
      create_empty_buffers() vs __set_page_dirty_buffers()"), I removed a test
      for a NULL page mapping unintentionally when some of the code inside
      __set_page_dirty() was moved to the callers.
      
      That removal generally didn't matter, since a filesystem would serialize
      truncation (which clears the page mapping) against writing (which marks
      the buffer dirty), so locking at a higher level (either per-page or an
      inode at a time) should mean that the buffer page would be stable.  And
      indeed, nothing bad seemed to happen.
      
      Except it turns out that apparently reiserfs does something odd when
      under load and writing out the journal, and we have a number of bugzilla
      entries that look similar:
      
      	http://bugzilla.kernel.org/show_bug.cgi?id=13556
      	http://bugzilla.kernel.org/show_bug.cgi?id=13756
      	http://bugzilla.kernel.org/show_bug.cgi?id=13876
      
      and it looks like reiserfs depended on that check (the common theme
      seems to be "data=journal", and a journal writeback during a truncate).
      
      I suspect reiserfs should have some additional locking, but in the
      meantime this should get us back to the pre-2.6.29 behavior.
      Pattern-pointed-out-by: NRoland Kletzing <devzero@web.de>
      Cc: stable@kernel.org (2.6.29 and 2.6.30)
      Cc: Jeff Mahoney <jeffm@suse.com>
      Cc: Nick Piggin <npiggin@suse.de>
      Cc: Al Viro <viro@zeniv.linux.org.uk>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      8e9d78ed
  15. 21 8月, 2009 3 次提交
  16. 19 8月, 2009 3 次提交
    • K
      mm: revert "oom: move oom_adj value" · 0753ba01
      KOSAKI Motohiro 提交于
      The commit 2ff05b2b (oom: move oom_adj value) moveed the oom_adj value to
      the mm_struct.  It was a very good first step for sanitize OOM.
      
      However Paul Menage reported the commit makes regression to his job
      scheduler.  Current OOM logic can kill OOM_DISABLED process.
      
      Why? His program has the code of similar to the following.
      
      	...
      	set_oom_adj(OOM_DISABLE); /* The job scheduler never killed by oom */
      	...
      	if (vfork() == 0) {
      		set_oom_adj(0); /* Invoked child can be killed */
      		execve("foo-bar-cmd");
      	}
      	....
      
      vfork() parent and child are shared the same mm_struct.  then above
      set_oom_adj(0) doesn't only change oom_adj for vfork() child, it's also
      change oom_adj for vfork() parent.  Then, vfork() parent (job scheduler)
      lost OOM immune and it was killed.
      
      Actually, fork-setting-exec idiom is very frequently used in userland program.
      We must not break this assumption.
      
      Then, this patch revert commit 2ff05b2b and related commit.
      
      Reverted commit list
      ---------------------
      - commit 2ff05b2b (oom: move oom_adj value from task_struct to mm_struct)
      - commit 4d8b9135 (oom: avoid unnecessary mm locking and scanning for OOM_DISABLE)
      - commit 81236810 (oom: only oom kill exiting tasks with attached memory)
      - commit 933b787b (mm: copy over oom_adj value at fork time)
      Signed-off-by: NKOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
      Cc: Paul Menage <menage@google.com>
      Cc: David Rientjes <rientjes@google.com>
      Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Cc: Rik van Riel <riel@redhat.com>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Oleg Nesterov <oleg@redhat.com>
      Cc: Nick Piggin <npiggin@suse.de>
      Cc: Mel Gorman <mel@csn.ul.ie>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      0753ba01
    • J
      vfs: make get_sb_pseudo set s_maxbytes to value that can be cast to signed · 89a4eb4b
      Jeff Layton 提交于
      get_sb_pseudo sets s_maxbytes to ~0ULL which becomes negative when cast
      to a signed value.  Fix it to use MAX_LFS_FILESIZE which casts properly
      to a positive signed value.
      Signed-off-by: NJeff Layton <jlayton@redhat.com>
      Reviewed-by: NJohannes Weiner <hannes@cmpxchg.org>
      Acked-by: NSteve French <smfrench@gmail.com>
      Reviewed-by: NChristoph Hellwig <hch@lst.de>
      Cc: Al Viro <viro@zeniv.linux.org.uk>
      Cc: Robert Love <rlove@google.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      89a4eb4b
    • R
      nilfs2: fix oopses with doubly mounted snapshots · a9245860
      Ryusuke Konishi 提交于
      will fix kernel oopses like the following:
      
       # mount -t nilfs2 -r -o cp=20 /dev/sdb1 /test1
       # mount -t nilfs2 -r -o cp=20 /dev/sdb1 /test2
       # umount /test1
       # umount /test2
      
      BUG: sleeping function called from invalid context at arch/x86/mm/fault.c:1069
      in_atomic(): 0, irqs_disabled(): 1, pid: 3886, name: umount.nilfs2
      1 lock held by umount.nilfs2/3886:
       #0:  (&type->s_umount_key#31){+.+...}, at: [<c10b398a>] deactivate_super+0x52/0x6c
      irq event stamp: 1219
      hardirqs last  enabled at (1219): [<c135c774>] __mutex_unlock_slowpath+0xf8/0x119
      hardirqs last disabled at (1218): [<c135c6d5>] __mutex_unlock_slowpath+0x59/0x119
      softirqs last  enabled at (1214): [<c1033316>] __do_softirq+0x1a5/0x1ad
      softirqs last disabled at (1205): [<c1033354>] do_softirq+0x36/0x5a
      Pid: 3886, comm: umount.nilfs2 Not tainted 2.6.31-rc6 #55
      Call Trace:
       [<c1023549>] __might_sleep+0x107/0x10e
       [<c13603c0>] do_page_fault+0x246/0x397
       [<c136017a>] ? do_page_fault+0x0/0x397
       [<c135e753>] error_code+0x6b/0x70
       [<c136017a>] ? do_page_fault+0x0/0x397
       [<c104f805>] ? __lock_acquire+0x91/0x12fd
       [<c1050a62>] ? __lock_acquire+0x12ee/0x12fd
       [<c1050a62>] ? __lock_acquire+0x12ee/0x12fd
       [<c1050b2b>] lock_acquire+0xba/0xdd
       [<d0d17d3f>] ? nilfs_detach_segment_constructor+0x2f/0x2fa [nilfs2]
       [<c135d4fe>] down_write+0x2a/0x46
       [<d0d17d3f>] ? nilfs_detach_segment_constructor+0x2f/0x2fa [nilfs2]
       [<d0d17d3f>] nilfs_detach_segment_constructor+0x2f/0x2fa [nilfs2]
       [<c104ea2c>] ? mark_held_locks+0x43/0x5b
       [<c104ecb1>] ? trace_hardirqs_on_caller+0x10b/0x133
       [<c104ece4>] ? trace_hardirqs_on+0xb/0xd
       [<d0d09ac1>] nilfs_put_super+0x2f/0xca [nilfs2]
       [<c10b3352>] generic_shutdown_super+0x49/0xb8
       [<c10b33de>] kill_block_super+0x1d/0x31
       [<c10e6599>] ? vfs_quota_off+0x0/0x12
       [<c10b398f>] deactivate_super+0x57/0x6c
       [<c10c4bc3>] mntput_no_expire+0x8c/0xb4
       [<c10c5094>] sys_umount+0x27f/0x2a4
       [<c10c50c6>] sys_oldumount+0xd/0xf
       [<c10031a4>] sysenter_do_call+0x12/0x38
       ...
      
      This turns out to be a bug brought by an -rc1 patch ("nilfs2: simplify
      remaining sget() use").
      
      In the patch, a new "put resource" function, nilfs_put_sbinfo()
      was introduced to delay freeing nilfs_sb_info struct.
      
      But the nilfs_put_sbinfo() mistakenly used atomic_dec_and_test()
      function to check the reference count, and it caused the nilfs_sb_info
      was freed when user mounted a snapshot twice.
      
      This bug also suggests there was unseen memory leak in usual mount
      /umount operations for nilfs.
      Signed-off-by: NRyusuke Konishi <konishi.ryusuke@lab.ntt.co.jp>
      a9245860
  17. 18 8月, 2009 1 次提交