1. 11 9月, 2009 4 次提交
    • J
      writeback: get rid of pdflush completely · d0bceac7
      Jens Axboe 提交于
      It is now unused, so kill it off.
      Signed-off-by: NJens Axboe <jens.axboe@oracle.com>
      d0bceac7
    • J
      writeback: switch to per-bdi threads for flushing data · 03ba3782
      Jens Axboe 提交于
      This gets rid of pdflush for bdi writeout and kupdated style cleaning.
      pdflush writeout suffers from lack of locality and also requires more
      threads to handle the same workload, since it has to work in a
      non-blocking fashion against each queue. This also introduces lumpy
      behaviour and potential request starvation, since pdflush can be starved
      for queue access if others are accessing it. A sample ffsb workload that
      does random writes to files is about 8% faster here on a simple SATA drive
      during the benchmark phase. File layout also seems a LOT more smooth in
      vmstat:
      
       r  b   swpd   free   buff  cache   si   so    bi    bo   in    cs us sy id wa
       0  1      0 608848   2652 375372    0    0     0 71024  604    24  1 10 48 42
       0  1      0 549644   2712 433736    0    0     0 60692  505    27  1  8 48 44
       1  0      0 476928   2784 505192    0    0     4 29540  553    24  0  9 53 37
       0  1      0 457972   2808 524008    0    0     0 54876  331    16  0  4 38 58
       0  1      0 366128   2928 614284    0    0     4 92168  710    58  0 13 53 34
       0  1      0 295092   3000 684140    0    0     0 62924  572    23  0  9 53 37
       0  1      0 236592   3064 741704    0    0     4 58256  523    17  0  8 48 44
       0  1      0 165608   3132 811464    0    0     0 57460  560    21  0  8 54 38
       0  1      0 102952   3200 873164    0    0     4 74748  540    29  1 10 48 41
       0  1      0  48604   3252 926472    0    0     0 53248  469    29  0  7 47 45
      
      where vanilla tends to fluctuate a lot in the creation phase:
      
       r  b   swpd   free   buff  cache   si   so    bi    bo   in    cs us sy id wa
       1  1      0 678716   5792 303380    0    0     0 74064  565    50  1 11 52 36
       1  0      0 662488   5864 319396    0    0     4   352  302   329  0  2 47 51
       0  1      0 599312   5924 381468    0    0     0 78164  516    55  0  9 51 40
       0  1      0 519952   6008 459516    0    0     4 78156  622    56  1 11 52 37
       1  1      0 436640   6092 541632    0    0     0 82244  622    54  0 11 48 41
       0  1      0 436640   6092 541660    0    0     0     8  152    39  0  0 51 49
       0  1      0 332224   6200 644252    0    0     4 102800  728    46  1 13 49 36
       1  0      0 274492   6260 701056    0    0     4 12328  459    49  0  7 50 43
       0  1      0 211220   6324 763356    0    0     0 106940  515    37  1 10 51 39
       1  0      0 160412   6376 813468    0    0     0  8224  415    43  0  6 49 45
       1  1      0  85980   6452 886556    0    0     4 113516  575    39  1 11 54 34
       0  2      0  85968   6452 886620    0    0     0  1640  158   211  0  0 46 54
      
      A 10 disk test with btrfs performs 26% faster with per-bdi flushing. A
      SSD based writeback test on XFS performs over 20% better as well, with
      the throughput being very stable around 1GB/sec, where pdflush only
      manages 750MB/sec and fluctuates wildly while doing so. Random buffered
      writes to many files behave a lot better as well, as does random mmap'ed
      writes.
      
      A separate thread is added to sync the super blocks. In the long term,
      adding sync_supers_bdi() functionality could get rid of this thread again.
      Signed-off-by: NJens Axboe <jens.axboe@oracle.com>
      03ba3782
    • J
      writeback: move dirty inodes from super_block to backing_dev_info · 66f3b8e2
      Jens Axboe 提交于
      This is a first step at introducing per-bdi flusher threads. We should
      have no change in behaviour, although sb_has_dirty_inodes() is now
      ridiculously expensive, as there's no easy way to answer that question.
      Not a huge problem, since it'll be deleted in subsequent patches.
      Signed-off-by: NJens Axboe <jens.axboe@oracle.com>
      66f3b8e2
    • J
      writeback: get rid of generic_sync_sb_inodes() export · d8a8559c
      Jens Axboe 提交于
      This adds two new exported functions:
      
      - writeback_inodes_sb(), which only attempts to writeback dirty inodes on
        this super_block, for WB_SYNC_NONE writeout.
      - sync_inodes_sb(), which writes out all dirty inodes on this super_block
        and also waits for the IO to complete.
      Acked-by: NJan Kara <jack@suse.cz>
      Signed-off-by: NJens Axboe <jens.axboe@oracle.com>
      d8a8559c
  2. 10 9月, 2009 1 次提交
    • R
      binfmt_elf: fix PT_INTERP bss handling · 752015d1
      Roland McGrath 提交于
      In fs/binfmt_elf.c, load_elf_interp() calls padzero() for .bss even if
      the PT_LOAD has no PROT_WRITE and no .bss.  This generates EFAULT.
      
      Here is a small test case.  (Yes, there are other, useful PT_INTERP
      which have only .text and no .data/.bss.)
      
      	----- ptinterp.S
      	_start: .globl _start
      		 nop
      		 int3
      	-----
      	$ gcc -m32 -nostartfiles -nostdlib -o ptinterp ptinterp.S
      	$ gcc -m32 -Wl,--dynamic-linker=ptinterp -o hello hello.c
      	$ ./hello
      	Segmentation fault  # during execve() itself
      
      	After applying the patch:
      	$ ./hello
      	Trace trap  # user-mode execution after execve() finishes
      
      If the ELF headers are actually self-inconsistent, then dying is fine.
      But having no PROT_WRITE segment is perfectly normal and correct if
      there is no segment with p_memsz > p_filesz (i.e. bss).  John Reiser
      suggested checking for PROT_WRITE in the bss logic.  I think it makes
      most sense to simply apply the bss logic only when there is bss.
      
      This patch looks less trivial than it is due to some reindentation.
      It just moves the "if (last_bss > elf_bss) {" test up to include the
      partial-page bss logic as well as the more-pages bss logic.
      Reported-by: NJohn Reiser <jreiser@bitwagon.com>
      Signed-off-by: NRoland McGrath <roland@redhat.com>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      752015d1
  3. 09 9月, 2009 7 次提交
  4. 07 9月, 2009 1 次提交
  5. 06 9月, 2009 2 次提交
    • N
      ext2: fix unbalanced kmap()/kunmap() · 9de6886e
      Nicolas Pitre 提交于
      In ext2_rename(), dir_page is acquired through ext2_dotdot().  It is
      then released through ext2_set_link() but only if old_dir != new_dir.
      Failing that, the pkmap reference count is never decremented and the
      page remains pinned forever.  Repeat that a couple times with highmem
      pages and all pkmap slots get exhausted, and every further kmap() calls
      end up stalling on the pkmap_map_wait queue at which point the whole
      system comes to a halt.
      Signed-off-by: NNicolas Pitre <nico@marvell.com>
      Acked-by: NTheodore Ts'o <tytso@mit.edu>
      Cc: stable@kernel.org
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      9de6886e
    • O
      exec: do not sleep in TASK_TRACED under ->cred_guard_mutex · a2a8474c
      Oleg Nesterov 提交于
      Tom Horsley reports that his debugger hangs when it tries to read
      /proc/pid_of_tracee/maps, this happens since
      
      	"mm_for_maps: take ->cred_guard_mutex to fix the race with exec"
      	04b836cbf19e885f8366bccb2e4b0474346c02d
      
      commit in 2.6.31.
      
      But the root of the problem lies in the fact that do_execve() path calls
      tracehook_report_exec() which can stop if the tracer sets PT_TRACE_EXEC.
      
      The tracee must not sleep in TASK_TRACED holding this mutex.  Even if we
      remove ->cred_guard_mutex from mm_for_maps() and proc_pid_attr_write(),
      another task doing PTRACE_ATTACH should not hang until it is killed or the
      tracee resumes.
      
      With this patch do_execve() does not use ->cred_guard_mutex directly and
      we do not hold it throughout, instead:
      
      	- introduce prepare_bprm_creds() helper, it locks the mutex
      	  and calls prepare_exec_creds() to initialize bprm->cred.
      
      	- install_exec_creds() drops the mutex after commit_creds(),
      	  and thus before tracehook_report_exec()->ptrace_stop().
      
      	  or, if exec fails,
      
      	  free_bprm() drops this mutex when bprm->cred != NULL which
      	  indicates install_exec_creds() was not called.
      Reported-by: NTom Horsley <tom.horsley@att.net>
      Signed-off-by: NOleg Nesterov <oleg@redhat.com>
      Acked-by: NDavid Howells <dhowells@redhat.com>
      Cc: Roland McGrath <roland@redhat.com>
      Cc: James Morris <jmorris@namei.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      a2a8474c
  6. 05 9月, 2009 1 次提交
  7. 03 9月, 2009 1 次提交
  8. 02 9月, 2009 1 次提交
  9. 01 9月, 2009 1 次提交
  10. 31 8月, 2009 1 次提交
    • R
      nilfs2: fix preempt count underflow in nilfs_btnode_prepare_change_key · b1f1b8ce
      Ryusuke Konishi 提交于
      This will fix the following preempt count underflow reported from
      users with the title "[NILFS users] segctord problem" (Message-ID:
      <949415.6494.qm@web58808.mail.re1.yahoo.com> and Message-ID:
      <debc30fc0908270825v747c1734xa59126623cfd5b05@mail.gmail.com>):
      
       WARNING: at kernel/sched.c:4890 sub_preempt_count+0x95/0xa0()
       Hardware name: HP Compaq 6530b (KR980UT#ABC)
       Modules linked in: bridge stp llc bnep rfcomm l2cap xfs exportfs nilfs2 cowloop loop vboxnetadp vboxnetflt vboxdrv btusb bluetooth uvcvideo videodev v4l1_compat v4l2_compat_ioctl32 arc4 snd_hda_codec_analog ecb iwlagn iwlcore rfkill lib80211 mac80211 snd_hda_intel snd_hda_codec ehci_hcd uhci_hcd usbcore snd_hwdep snd_pcm tg3 cfg80211 psmouse snd_timer joydev libphy ohci1394 snd_page_alloc hp_accel lis3lv02d ieee1394 led_class i915 drm i2c_algo_bit video backlight output i2c_core dm_crypt dm_mod
       Pid: 4197, comm: segctord Not tainted 2.6.30-gentoo-r4-64 #7
       Call Trace:
        [<ffffffff8023fa05>] ? sub_preempt_count+0x95/0xa0
        [<ffffffff802470f8>] warn_slowpath_common+0x78/0xd0
        [<ffffffff8024715f>] warn_slowpath_null+0xf/0x20
        [<ffffffff8023fa05>] sub_preempt_count+0x95/0xa0
        [<ffffffffa04ce4db>] nilfs_btnode_prepare_change_key+0x11b/0x190 [nilfs2]
        [<ffffffffa04d01ad>] nilfs_btree_assign_p+0x19d/0x1e0 [nilfs2]
        [<ffffffffa04d10ad>] nilfs_btree_assign+0xbd/0x130 [nilfs2]
        [<ffffffffa04cead7>] nilfs_bmap_assign+0x47/0x70 [nilfs2]
        [<ffffffffa04d9bc6>] nilfs_segctor_do_construct+0x956/0x20f0 [nilfs2]
        [<ffffffff805ac8e2>] ? _spin_unlock_irqrestore+0x12/0x40
        [<ffffffff803c06e0>] ? __up_write+0xe0/0x150
        [<ffffffff80262959>] ? up_write+0x9/0x10
        [<ffffffffa04ce9f3>] ? nilfs_bmap_test_and_clear_dirty+0x43/0x60 [nilfs2]
        [<ffffffffa04cd627>] ? nilfs_mdt_fetch_dirty+0x27/0x60 [nilfs2]
        [<ffffffffa04db5fc>] nilfs_segctor_construct+0x8c/0xd0 [nilfs2]
        [<ffffffffa04dc3dc>] nilfs_segctor_thread+0x15c/0x3a0 [nilfs2]
        [<ffffffffa04dbe20>] ? nilfs_construction_timeout+0x0/0x10 [nilfs2]
        [<ffffffff80252633>] ? add_timer+0x13/0x20
        [<ffffffff802370da>] ? __wake_up_common+0x5a/0x90
        [<ffffffff8025e960>] ? autoremove_wake_function+0x0/0x40
        [<ffffffffa04dc280>] ? nilfs_segctor_thread+0x0/0x3a0 [nilfs2]
        [<ffffffffa04dc280>] ? nilfs_segctor_thread+0x0/0x3a0 [nilfs2]
        [<ffffffff8025e556>] kthread+0x56/0x90
        [<ffffffff8020cdea>] child_rip+0xa/0x20
        [<ffffffff8025e500>] ? kthread+0x0/0x90
        [<ffffffff8020cde0>] ? child_rip+0x0/0x20
      
      This problem was caused due to a missing radix_tree_preload() call in
      the retry path of nilfs_btnode_prepare_change_key() function.
      Reported-by: NEric A <eric225125@yahoo.com>
      Reported-by: NJerome Poulin <jeromepoulin@gmail.com>
      Signed-off-by: NRyusuke Konishi <konishi.ryusuke@lab.ntt.co.jp>
      Tested-by: NJerome Poulin <jeromepoulin@gmail.com>
      Cc: stable@kernel.org
      b1f1b8ce
  11. 29 8月, 2009 1 次提交
  12. 28 8月, 2009 4 次提交
  13. 27 8月, 2009 4 次提交
  14. 25 8月, 2009 2 次提交
    • T
      NFSv4: Fix an infinite looping problem with the nfs4_state_manager · 7111dc73
      Trond Myklebust 提交于
      Commit 76db6d95 (nfs41: add session setup
      to the state manager) introduces an infinite loop possibility in the NFSv4
      state manager. By first checking nfs4_has_session() before clearing the
      NFS4CLNT_SESSION_SETUP flag, it allows for a situation where someone sets
      that flag, but it never gets cleared, and so the state manager loops.
      
      In fact commit c3fad1b1 (nfs41: add session
      reset to state manager) causes this to happen every time we get a network
      partition error.
      Signed-off-by: NTrond Myklebust <Trond.Myklebust@netapp.com>
      Tested-by: NDaniel J Blueman <daniel.blueman@gmail.com>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      7111dc73
    • H
      mm: fix hugetlb bug due to user_shm_unlock call · 353d5c30
      Hugh Dickins 提交于
      2.6.30's commit 8a0bdec1 removed
      user_shm_lock() calls in hugetlb_file_setup() but left the
      user_shm_unlock call in shm_destroy().
      
      In detail:
      Assume that can_do_hugetlb_shm() returns true and hence user_shm_lock()
      is not called in hugetlb_file_setup(). However, user_shm_unlock() is
      called in any case in shm_destroy() and in the following
      atomic_dec_and_lock(&up->__count) in free_uid() is executed and if
      up->__count gets zero, also cleanup_user_struct() is scheduled.
      
      Note that sched_destroy_user() is empty if CONFIG_USER_SCHED is not set.
      However, the ref counter up->__count gets unexpectedly non-positive and
      the corresponding structs are freed even though there are live
      references to them, resulting in a kernel oops after a lots of
      shmget(SHM_HUGETLB)/shmctl(IPC_RMID) cycles and CONFIG_USER_SCHED set.
      
      Hugh changed Stefan's suggested patch: can_do_hugetlb_shm() at the
      time of shm_destroy() may give a different answer from at the time
      of hugetlb_file_setup().  And fixed newseg()'s no_id error path,
      which has missed user_shm_unlock() ever since it came in 2.6.9.
      Reported-by: NStefan Huber <shuber2@gmail.com>
      Signed-off-by: NHugh Dickins <hugh.dickins@tiscali.co.uk>
      Tested-by: NStefan Huber <shuber2@gmail.com>
      Cc: stable@kernel.org
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      353d5c30
  15. 24 8月, 2009 3 次提交
    • J
      ext3: Improve error message that changing journaling mode on remount is not possible · 3c4cec65
      Jan Kara 提交于
      This patch makes the error message about changing journaling mode on remount
      more descriptive. Some people are going to hit this error now due to commit
      bbae8bcc if they configure a kernel to default
      to data=writeback mode. The problem happens if they have data=ordered set for
      the root filesystem in /etc/fstab but not in the kernel command line (and they
      don't use initrd). Their filesystem then gets mounted as data=writeback by
      kernel but then their boot fails because init scripts won't be able to remount
      the filesystem rw. Better error message will hopefully make it easier for them
      to find the error in their setup and bother us less with error reports :).
      Signed-off-by: NJan Kara <jack@suse.cz>
      3c4cec65
    • T
      ext3: Update Kconfig description of EXT3_DEFAULTS_TO_ORDERED · 6d418076
      Theodore Ts'o 提交于
      The old description for this configuration option was perhaps not
      completely balanced in terms of describing the tradeoffs of using a
      default of data=writeback vs. data=ordered.  Despite the fact that old
      description very strongly recomended disabling this feature, all of
      the major distributions have elected to preserve the existing 'legacy'
      default, which is a strong hint that it perhaps wasn't telling the
      whole story.
      
      This revised description has been vetted by a number of ext3
      developers as being better at informing the user about the tradeoffs
      of enabling or disabling this configuration feature.
      
      Cc: linux-ext4@vger.kernel.org
      Signed-off-by: N"Theodore Ts'o" <tytso@mit.edu>
      Signed-off-by: NJan Kara <jack@suse.cz>
      6d418076
    • M
      kernel_read: redefine offset type · 6777d773
      Mimi Zohar 提交于
      vfs_read() offset is defined as loff_t, but kernel_read()
      offset is only defined as unsigned long. Redefine
      kernel_read() offset as loff_t.
      
      Cc: stable@kernel.org
      Signed-off-by: NMimi Zohar <zohar@us.ibm.com>
      Signed-off-by: NJames Morris <jmorris@namei.org>
      6777d773
  16. 22 8月, 2009 1 次提交
    • L
      Re-introduce page mapping check in mark_buffer_dirty() · 8e9d78ed
      Linus Torvalds 提交于
      In commit a8e7d49a ("Fix race in
      create_empty_buffers() vs __set_page_dirty_buffers()"), I removed a test
      for a NULL page mapping unintentionally when some of the code inside
      __set_page_dirty() was moved to the callers.
      
      That removal generally didn't matter, since a filesystem would serialize
      truncation (which clears the page mapping) against writing (which marks
      the buffer dirty), so locking at a higher level (either per-page or an
      inode at a time) should mean that the buffer page would be stable.  And
      indeed, nothing bad seemed to happen.
      
      Except it turns out that apparently reiserfs does something odd when
      under load and writing out the journal, and we have a number of bugzilla
      entries that look similar:
      
      	http://bugzilla.kernel.org/show_bug.cgi?id=13556
      	http://bugzilla.kernel.org/show_bug.cgi?id=13756
      	http://bugzilla.kernel.org/show_bug.cgi?id=13876
      
      and it looks like reiserfs depended on that check (the common theme
      seems to be "data=journal", and a journal writeback during a truncate).
      
      I suspect reiserfs should have some additional locking, but in the
      meantime this should get us back to the pre-2.6.29 behavior.
      Pattern-pointed-out-by: NRoland Kletzing <devzero@web.de>
      Cc: stable@kernel.org (2.6.29 and 2.6.30)
      Cc: Jeff Mahoney <jeffm@suse.com>
      Cc: Nick Piggin <npiggin@suse.de>
      Cc: Al Viro <viro@zeniv.linux.org.uk>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      8e9d78ed
  17. 21 8月, 2009 3 次提交
  18. 19 8月, 2009 2 次提交