1. 17 12月, 2010 2 次提交
    • T
      implement in-kernel gendisk events handling · 77ea887e
      Tejun Heo 提交于
      Currently, media presence polling for removeable block devices is done
      from userland.  There are several issues with this.
      
      * Polling is done by periodically opening the device.  For SCSI
        devices, the command sequence generated by such action involves a
        few different commands including TEST_UNIT_READY.  This behavior,
        while perfectly legal, is different from Windows which only issues
        single command, GET_EVENT_STATUS_NOTIFICATION.  Unfortunately, some
        ATAPI devices lock up after being periodically queried such command
        sequences.
      
      * There is no reliable and unintrusive way for a userland program to
        tell whether the target device is safe for media presence polling.
        For example, polling for media presence during an on-going burning
        session can make it fail.  The polling program can avoid this by
        opening the device with O_EXCL but then it risks making a valid
        exclusive user of the device fail w/ -EBUSY.
      
      * Userland polling is unnecessarily heavy and in-kernel implementation
        is lighter and better coordinated (workqueue, timer slack).
      
      This patch implements framework for in-kernel disk event handling,
      which includes media presence polling.
      
      * bdops->check_events() is added, which supercedes ->media_changed().
        It should check whether there's any pending event and return if so.
        Currently, two events are defined - DISK_EVENT_MEDIA_CHANGE and
        DISK_EVENT_EJECT_REQUEST.  ->check_events() is guaranteed not to be
        called parallelly.
      
      * gendisk->events and ->async_events are added.  These should be
        initialized by block driver before passing the device to add_disk().
        The former contains the mask of all supported events and the latter
        the mask of all events which the device can report without polling.
        /sys/block/*/events[_async] export these to userland.
      
      * Kernel parameter block.events_dfl_poll_msecs controls the system
        polling interval (default is 0 which means disable) and
        /sys/block/*/events_poll_msecs control polling intervals for
        individual devices (default is -1 meaning use system setting).  Note
        that if a device can report all supported events asynchronously and
        its polling interval isn't explicitly set, the device won't be
        polled regardless of the system polling interval.
      
      * If a device is opened exclusively with write access, event checking
        is automatically disabled until all write exclusive accesses are
        released.
      
      * There are event 'clearing' events.  For example, both of currently
        defined events are cleared after the device has been successfully
        opened.  This information is passed to ->check_events() callback
        using @clearing argument as a hint.
      
      * Event checking is always performed from system_nrt_wq and timer
        slack is set to 25% for polling.
      
      * Nothing changes for drivers which implement ->media_changed() but
        not ->check_events().  Going forward, all drivers will be converted
        to ->check_events() and ->media_change() will be dropped.
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Cc: Kay Sievers <kay.sievers@vrfy.org>
      Cc: Jan Kara <jack@suse.cz>
      Signed-off-by: NJens Axboe <jaxboe@fusionio.com>
      77ea887e
    • T
      block: move register_disk() and del_gendisk() to block/genhd.c · d2bf1b67
      Tejun Heo 提交于
      There's no reason for register_disk() and del_gendisk() to be in
      fs/partitions/check.c.  Move both to genhd.c.  While at it, collapse
      unlink_gendisk(), which was artificially in a separate function due to
      genhd.c / check.c split, into del_gendisk().
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Signed-off-by: NJens Axboe <jaxboe@fusionio.com>
      d2bf1b67
  2. 15 11月, 2010 1 次提交
    • S
      GFS2: Fix inode deallocation race · 044b9414
      Steven Whitehouse 提交于
      This area of the code has always been a bit delicate due to the
      subtleties of lock ordering. The problem is that for "normal"
      alloc/dealloc, we always grab the inode locks first and the rgrp lock
      later.
      
      In order to ensure no races in looking up the unlinked, but still
      allocated inodes, we need to hold the rgrp lock when we do the lookup,
      which means that we can't take the inode glock.
      
      The solution is to borrow the technique already used by NFS to solve
      what is essentially the same problem (given an inode number, look up
      the inode carefully, checking that it really is in the expected
      state).
      
      We cannot do that directly from the allocation code (lock ordering
      again) so we give the job to the pre-existing delete workqueue and
      carry on with the allocation as normal.
      
      If we find there is no space, we do a journal flush (required anyway
      if space from a deallocation is to be released) which should block
      against the pending deallocations, so we should always get the space
      back.
      Signed-off-by: NSteven Whitehouse <swhiteho@redhat.com>
      044b9414
  3. 13 11月, 2010 7 次提交
    • T
      ocfs2: Change some lock status member in ocfs2_lock_res to char. · 1c66b360
      Tao Ma 提交于
      Commit 83fd9c7f changes l_level, l_requested and l_blocking of
      ocfs2_lock_res from int to unsigned char. But actually it is
      initially as -1(ocfs2_lock_res_init_common) which
      correspoding to 255 for unsigned char. So the whole dlm lock
      mechanism doesn't work now which means a disaster to ocfs2.
      
      Cc: Goldwyn Rodrigues <rgoldwyn@suse.de>
      Signed-off-by: NTao Ma <tao.ma@oracle.com>
      Signed-off-by: NJoel Becker <joel.becker@oracle.com>
      1c66b360
    • T
      block: clean up blkdev_get() wrappers and their users · d4d77629
      Tejun Heo 提交于
      After recent blkdev_get() modifications, open_by_devnum() and
      open_bdev_exclusive() are simple wrappers around blkdev_get().
      Replace them with blkdev_get_by_dev() and blkdev_get_by_path().
      
      blkdev_get_by_dev() is identical to open_by_devnum().
      blkdev_get_by_path() is slightly different in that it doesn't
      automatically add %FMODE_EXCL to @mode.
      
      All users are converted.  Most conversions are mechanical and don't
      introduce any behavior difference.  There are several exceptions.
      
      * btrfs now sets FMODE_EXCL in btrfs_device->mode, so there's no
        reason to OR it explicitly on blkdev_put().
      
      * gfs2, nilfs2 and the generic mount_bdev() now set FMODE_EXCL in
        sb->s_mode.
      
      * With the above changes, sb->s_mode now always should contain
        FMODE_EXCL.  WARN_ON_ONCE() added to kill_block_super() to detect
        errors.
      
      The new blkdev_get_*() functions are with proper docbook comments.
      While at it, add function description to blkdev_get() too.
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Cc: Philipp Reisner <philipp.reisner@linbit.com>
      Cc: Neil Brown <neilb@suse.de>
      Cc: Mike Snitzer <snitzer@redhat.com>
      Cc: Joern Engel <joern@lazybastard.org>
      Cc: Chris Mason <chris.mason@oracle.com>
      Cc: Jan Kara <jack@suse.cz>
      Cc: "Theodore Ts'o" <tytso@mit.edu>
      Cc: KONISHI Ryusuke <konishi.ryusuke@lab.ntt.co.jp>
      Cc: reiserfs-devel@vger.kernel.org
      Cc: xfs-masters@oss.sgi.com
      Cc: Alexander Viro <viro@zeniv.linux.org.uk>
      d4d77629
    • T
      block: check bdev_read_only() from blkdev_get() · 75f1dc0d
      Tejun Heo 提交于
      bdev read-only status can be queried using bdev_read_only() and may
      change while the device is being opened.  Enforce it by checking it
      from blkdev_get() after open succeeds.
      
      This makes bdev_read_only() check in open_bdev_exclusive() and
      fsg_lun_open() unnecessary.  Drop them.
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Cc: David Brownell <dbrownell@users.sourceforge.net>
      Cc: linux-usb@vger.kernel.org
      75f1dc0d
    • T
      block: reorganize claim/release implementation · 6a027eff
      Tejun Heo 提交于
      With claim/release rolled into blkdev_get/put(), there's no reason to
      keep bd_abort/finish_claim(), __bd_claim() and bd_release() as
      separate functions.  It only makes the code difficult to follow.
      Collapse them into blkdev_get/put().  This will ease future changes
      around claim/release.
      Signed-off-by: NTejun Heo <tj@kernel.org>
      6a027eff
    • T
      block: make blkdev_get/put() handle exclusive access · e525fd89
      Tejun Heo 提交于
      Over time, block layer has accumulated a set of APIs dealing with bdev
      open, close, claim and release.
      
      * blkdev_get/put() are the primary open and close functions.
      
      * bd_claim/release() deal with exclusive open.
      
      * open/close_bdev_exclusive() are combination of open and claim and
        the other way around, respectively.
      
      * bd_link/unlink_disk_holder() to create and remove holder/slave
        symlinks.
      
      * open_by_devnum() wraps bdget() + blkdev_get().
      
      The interface is a bit confusing and the decoupling of open and claim
      makes it impossible to properly guarantee exclusive access as
      in-kernel open + claim sequence can disturb the existing exclusive
      open even before the block layer knows the current open if for another
      exclusive access.  Reorganize the interface such that,
      
      * blkdev_get() is extended to include exclusive access management.
        @holder argument is added and, if is @FMODE_EXCL specified, it will
        gain exclusive access atomically w.r.t. other exclusive accesses.
      
      * blkdev_put() is similarly extended.  It now takes @mode argument and
        if @FMODE_EXCL is set, it releases an exclusive access.  Also, when
        the last exclusive claim is released, the holder/slave symlinks are
        removed automatically.
      
      * bd_claim/release() and close_bdev_exclusive() are no longer
        necessary and either made static or removed.
      
      * bd_link_disk_holder() remains the same but bd_unlink_disk_holder()
        is no longer necessary and removed.
      
      * open_bdev_exclusive() becomes a simple wrapper around lookup_bdev()
        and blkdev_get().  It also has an unexpected extra bdev_read_only()
        test which probably should be moved into blkdev_get().
      
      * open_by_devnum() is modified to take @holder argument and pass it to
        blkdev_get().
      
      Most of bdev open/close operations are unified into blkdev_get/put()
      and most exclusive accesses are tested atomically at the open time (as
      it should).  This cleans up code and removes some, both valid and
      invalid, but unnecessary all the same, corner cases.
      
      open_bdev_exclusive() and open_by_devnum() can use further cleanup -
      rename to blkdev_get_by_path() and blkdev_get_by_devt() and drop
      special features.  Well, let's leave them for another day.
      
      Most conversions are straight-forward.  drbd conversion is a bit more
      involved as there was some reordering, but the logic should stay the
      same.
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Acked-by: NNeil Brown <neilb@suse.de>
      Acked-by: NRyusuke Konishi <konishi.ryusuke@lab.ntt.co.jp>
      Acked-by: NMike Snitzer <snitzer@redhat.com>
      Acked-by: NPhilipp Reisner <philipp.reisner@linbit.com>
      Cc: Peter Osterlund <petero2@telia.com>
      Cc: Martin Schwidefsky <schwidefsky@de.ibm.com>
      Cc: Heiko Carstens <heiko.carstens@de.ibm.com>
      Cc: Jan Kara <jack@suse.cz>
      Cc: Andrew Morton <akpm@linux-foundation.org>
      Cc: Andreas Dilger <adilger.kernel@dilger.ca>
      Cc: "Theodore Ts'o" <tytso@mit.edu>
      Cc: Mark Fasheh <mfasheh@suse.com>
      Cc: Joel Becker <joel.becker@oracle.com>
      Cc: Alex Elder <aelder@sgi.com>
      Cc: Christoph Hellwig <hch@infradead.org>
      Cc: dm-devel@redhat.com
      Cc: drbd-dev@lists.linbit.com
      Cc: Leo Chen <leochen@broadcom.com>
      Cc: Scott Branden <sbranden@broadcom.com>
      Cc: Chris Mason <chris.mason@oracle.com>
      Cc: Steven Whitehouse <swhiteho@redhat.com>
      Cc: Dave Kleikamp <shaggy@linux.vnet.ibm.com>
      Cc: Joern Engel <joern@logfs.org>
      Cc: reiserfs-devel@vger.kernel.org
      Cc: Alexander Viro <viro@zeniv.linux.org.uk>
      e525fd89
    • T
      block: simplify holder symlink handling · e09b457b
      Tejun Heo 提交于
      Code to manage symlinks in /sys/block/*/{holders|slaves} are overly
      complex with multiple holder considerations, redundant extra
      references to all involved kobjects, unused generic kobject holder
      support and unnecessary mixup with bd_claim/release functionalities.
      
      Strip it down to what's necessary (single gendisk holder) and make it
      use a separate interface.  This is a step for cleaning up
      bd_claim/release.  This patch makes dm-table slightly more complex but
      it will be simplified again with further changes.
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Acked-by: NNeil Brown <neilb@suse.de>
      Acked-by: NMike Snitzer <snitzer@redhat.com>
      Cc: dm-devel@redhat.com
      e09b457b
    • T
      btrfs: close_bdev_exclusive() should use the same @flags as the matching open_bdev_exclusive() · 37004c42
      Tejun Heo 提交于
      In the failure path of __btrfs_open_devices(), close_bdev_exclusive()
      is called with @flags which doesn't match the one used during
      open_bdev_exclusive().  Fix it.
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Cc: Chris Mason <chris.mason@oracle.com>
      37004c42
  4. 12 11月, 2010 1 次提交
  5. 11 11月, 2010 11 次提交
  6. 10 11月, 2010 4 次提交
  7. 09 11月, 2010 8 次提交
    • S
      cifs: fix a memleak in cifs_setattr_nounix() · 3565bd46
      Suresh Jayaraman 提交于
      Andrew Hendry reported a kmemleak warning in 2.6.37-rc1 while editing a
      text file with gedit over cifs.
      
      unreferenced object 0xffff88022ee08b40 (size 32):
        comm "gedit", pid 2524, jiffies 4300160388 (age 2633.655s)
        hex dump (first 32 bytes):
          5c 2e 67 6f 75 74 70 75 74 73 74 72 65 61 6d 2d  \.goutputstream-
          35 42 41 53 4c 56 00 de 09 00 00 00 2c 26 78 ee  5BASLV......,&x.
        backtrace:
          [<ffffffff81504a4d>] kmemleak_alloc+0x2d/0x60
          [<ffffffff81136e13>] __kmalloc+0xe3/0x1d0
          [<ffffffffa0313db0>] build_path_from_dentry+0xf0/0x230 [cifs]
          [<ffffffffa031ae1e>] cifs_setattr+0x9e/0x770 [cifs]
          [<ffffffff8115fe90>] notify_change+0x170/0x2e0
          [<ffffffff81145ceb>] sys_fchmod+0x10b/0x140
          [<ffffffff8100c172>] system_call_fastpath+0x16/0x1b
          [<ffffffffffffffff>] 0xffffffffffffffff
      
      The commit 1025774c that removed inode_setattr() seems to have introduced this
      memleak by returning early without freeing 'full_path'.
      Reported-by: NAndrew Hendry <andrew.hendry@gmail.com>
      Cc: Christoph Hellwig <hch@lst.de>
      Reviewed-by: NJeff Layton <jlayton@redhat.com>
      Signed-off-by: NSuresh Jayaraman <sjayaraman@suse.de>
      Signed-off-by: NSteve French <sfrench@us.ibm.com>
      3565bd46
    • M
      sparc: fix openpromfs compile · 0e154825
      Meelis Roos 提交于
      Fix openpromfs compilation by adding a missing semicolon in
      fs/openpromfs/inode.c openprom_mount().
      Signed-off-by: NMeelis Roos <mroos@linux.ee>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      0e154825
    • J
      cifs: make cifs_ioctl handle NULL filp->private_data correctly · 61876395
      Jeff Layton 提交于
      Commit 13cfb733 made cifs_ioctl use the tlink attached to the
      cifsFileInfo for a filp. This ignores the case of an open directory
      however, which in CIFS can have a NULL private_data until a readdir
      is done on it.
      
      This patch re-adds the NULL pointer checks that were removed in commit
      50ae28f0 and moves the setting of tcon and "caps" variables lower.
      
      Long term, a better fix would be to establish a f_op->open routine for
      directories that populates that field at open time, but that requires
      some other changes to how readdir calls are handled.
      Reported-by: NKjell Rune Skaaraas <kjella79@yahoo.no>
      Reviewed-and-Tested-by: NSuresh Jayaraman <sjayaraman@suse.de>
      Signed-off-by: NJeff Layton <jlayton@redhat.com>
      Signed-off-by: NSteve French <sfrench@us.ibm.com>
      61876395
    • T
      ext4: Add new ext4 inode tracepoints · 7ff9c073
      Theodore Ts'o 提交于
      Add ext4_evict_inode, ext4_drop_inode, ext4_mark_inode_dirty, and
      ext4_begin_ordered_truncate()
      Signed-off-by: N"Theodore Ts'o" <tytso@mit.edu>
      7ff9c073
    • T
      ext4: Don't call sb_issue_discard() in ext4_free_blocks() · b56ff9d3
      Theodore Ts'o 提交于
      Commit 5c521830 (ext4: Support discard requests when running in
      no-journal mode) attempts to add sb_issue_discard() for data blocks
      (in data=writeback mode) and in no-journal mode.  Unfortunately, this
      no longer works, because in commit dd3932ed (block: remove
      BLKDEV_IFL_WAIT), sb_issue_discard() only presents a synchronous
      interface, and there are times when we call ext4_free_blocks() when we
      are are holding a spinlock, or are otherwise in an atomic context.
      
      For now, I've removed the call to sb_issue_discard() to prevent a
      deadlock or (if spinlock debugging is enabled) failures like this:
      
      BUG: scheduling while atomic: rc.sysinit/1376/0x00000002
      Pid: 1376, comm: rc.sysinit Not tainted 2.6.36-ARCH #1
      Call Trace:
      [<ffffffff810397ce>] __schedule_bug+0x5e/0x70
      [<ffffffff81403110>] schedule+0x950/0xa70
      [<ffffffff81060bad>] ? insert_work+0x7d/0x90
      [<ffffffff81060fbd>] ? queue_work_on+0x1d/0x30
      [<ffffffff81061127>] ? queue_work+0x37/0x60
      [<ffffffff8140377d>] schedule_timeout+0x21d/0x360
      [<ffffffff812031c3>] ? generic_make_request+0x2c3/0x540
      [<ffffffff81402680>] wait_for_common+0xc0/0x150
      [<ffffffff81041490>] ? default_wake_function+0x0/0x10
      [<ffffffff812034bc>] ? submit_bio+0x7c/0x100
      [<ffffffff810680a0>] ? wake_bit_function+0x0/0x40
      [<ffffffff814027b8>] wait_for_completion+0x18/0x20
      [<ffffffff8120a969>] blkdev_issue_discard+0x1b9/0x210
      [<ffffffff811ba03e>] ext4_free_blocks+0x68e/0xb60
      [<ffffffff811b1650>] ? __ext4_handle_dirty_metadata+0x110/0x120
      [<ffffffff811b098c>] ext4_ext_truncate+0x8cc/0xa70
      [<ffffffff810d713e>] ? pagevec_lookup+0x1e/0x30
      [<ffffffff81191618>] ext4_truncate+0x178/0x5d0
      [<ffffffff810eacbb>] ? unmap_mapping_range+0xab/0x280
      [<ffffffff810d8976>] vmtruncate+0x56/0x70
      [<ffffffff811925cb>] ext4_setattr+0x14b/0x460
      [<ffffffff811319e4>] notify_change+0x194/0x380
      [<ffffffff81117f80>] do_truncate+0x60/0x90
      [<ffffffff811e08fa>] ? security_inode_permission+0x1a/0x20
      [<ffffffff811eaec1>] ? tomoyo_path_truncate+0x11/0x20
      [<ffffffff81127539>] do_last+0x5d9/0x770
      [<ffffffff811278bd>] do_filp_open+0x1ed/0x680
      [<ffffffff8140644f>] ? page_fault+0x1f/0x30
      [<ffffffff81132bfc>] ? alloc_fd+0xec/0x140
      [<ffffffff81118db1>] do_sys_open+0x61/0x120
      [<ffffffff81118e8b>] sys_open+0x1b/0x20
      [<ffffffff81002e6b>] system_call_fastpath+0x16/0x1b
      
      https://bugzilla.kernel.org/show_bug.cgi?id=22302Reported-by: NMathias Burén <mathias.buren@gmail.com>
      Signed-off-by: N"Theodore Ts'o" <tytso@mit.edu>
      Cc: jiayingz@google.com
      b56ff9d3
    • D
      ext4: do not try to grab the s_umount semaphore in ext4_quota_off · 87009d86
      Dmitry Monakhov 提交于
      It's not needed to sync the filesystem, and it fixes a lock_dep complaint.
      Signed-off-by: NDmitry Monakhov <dmonakhov@gmail.com>
      Signed-off-by: N"Theodore Ts'o" <tytso@mit.edu>
      Reviewed-by: NJan Kara <jack@suse.cz>
      87009d86
    • T
      ext4: fix potential race when freeing ext4_io_page structures · 83668e71
      Theodore Ts'o 提交于
      Use an atomic_t and make sure we don't free the structure while we
      might still be submitting I/O for that page.
      Signed-off-by: N"Theodore Ts'o" <tytso@mit.edu>
      83668e71
    • T
      ext4: handle writeback of inodes which are being freed · f7ad6d2e
      Theodore Ts'o 提交于
      The following BUG can occur when an inode which is getting freed when
      it still has dirty pages outstanding, and it gets deleted (in this
      because it was the target of a rename).  In ordered mode, we need to
      make sure the data pages are written just in case we crash before the
      rename (or unlink) is committed.  If the inode is being freed then
      when we try to igrab the inode, we end up tripping the BUG_ON at
      fs/ext4/page-io.c:146.
      
      To solve this problem, we need to keep track of the number of io
      callbacks which are pending, and avoid destroying the inode until they
      have all been completed.  That way we don't have to bump the inode
      count to keep the inode from being destroyed; an approach which
      doesn't work because the count could have already been dropped down to
      zero before the inode writeback has started (at which point we're not
      allowed to bump the count back up to 1, since it's already started
      getting freed).
      
      Thanks to Dave Chinner for suggesting this approach, which is also
      used by XFS.
      
        kernel BUG at /scratch_space/linux-2.6/fs/ext4/page-io.c:146!
        Call Trace:
         [<ffffffff811075b1>] ext4_bio_write_page+0x172/0x307
         [<ffffffff811033a7>] mpage_da_submit_io+0x2f9/0x37b
         [<ffffffff811068d7>] mpage_da_map_and_submit+0x2cc/0x2e2
         [<ffffffff811069b3>] mpage_add_bh_to_extent+0xc6/0xd5
         [<ffffffff81106c66>] write_cache_pages_da+0x2a4/0x3ac
         [<ffffffff81107044>] ext4_da_writepages+0x2d6/0x44d
         [<ffffffff81087910>] do_writepages+0x1c/0x25
         [<ffffffff810810a4>] __filemap_fdatawrite_range+0x4b/0x4d
         [<ffffffff810815f5>] filemap_fdatawrite_range+0xe/0x10
         [<ffffffff81122a2e>] jbd2_journal_begin_ordered_truncate+0x7b/0xa2
         [<ffffffff8110615d>] ext4_evict_inode+0x57/0x24c
         [<ffffffff810c14a3>] evict+0x22/0x92
         [<ffffffff810c1a3d>] iput+0x212/0x249
         [<ffffffff810bdf16>] dentry_iput+0xa1/0xb9
         [<ffffffff810bdf6b>] d_kill+0x3d/0x5d
         [<ffffffff810be613>] dput+0x13a/0x147
         [<ffffffff810b990d>] sys_renameat+0x1b5/0x258
         [<ffffffff81145f71>] ? _atomic_dec_and_lock+0x2d/0x4c
         [<ffffffff810b2950>] ? cp_new_stat+0xde/0xea
         [<ffffffff810b29c1>] ? sys_newlstat+0x2d/0x38
         [<ffffffff810b99c6>] sys_rename+0x16/0x18
         [<ffffffff81002a2b>] system_call_fastpath+0x16/0x1b
      Reported-by: NNick Bowler <nbowler@elliptictech.com>
      Signed-off-by: N"Theodore Ts'o" <tytso@mit.edu>
      Tested-by: NNick Bowler <nbowler@elliptictech.com>
      f7ad6d2e
  8. 06 11月, 2010 1 次提交
  9. 05 11月, 2010 2 次提交
  10. 04 11月, 2010 1 次提交
  11. 03 11月, 2010 2 次提交