1. 21 8月, 2012 1 次提交
    • T
      workqueue: deprecate system_nrt[_freezable]_wq · 3b07e9ca
      Tejun Heo 提交于
      system_nrt[_freezable]_wq are now spurious.  Mark them deprecated and
      convert all users to system[_freezable]_wq.
      
      If you're cc'd and wondering what's going on: Now all workqueues are
      non-reentrant, so there's no reason to use system_nrt[_freezable]_wq.
      Please use system[_freezable]_wq instead.
      
      This patch doesn't make any functional difference.
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Acked-By: NLai Jiangshan <laijs@cn.fujitsu.com>
      
      Cc: Jens Axboe <axboe@kernel.dk>
      Cc: David Airlie <airlied@linux.ie>
      Cc: Jiri Kosina <jkosina@suse.cz>
      Cc: "David S. Miller" <davem@davemloft.net>
      Cc: Rusty Russell <rusty@rustcorp.com.au>
      Cc: "Paul E. McKenney" <paulmck@linux.vnet.ibm.com>
      Cc: David Howells <dhowells@redhat.com>
      3b07e9ca
  2. 14 8月, 2012 1 次提交
    • T
      workqueue: use mod_delayed_work() instead of cancel + queue · 41f63c53
      Tejun Heo 提交于
      Convert delayed_work users doing cancel_delayed_work() followed by
      queue_delayed_work() to mod_delayed_work().
      
      Most conversions are straight-forward.  Ones worth mentioning are,
      
      * drivers/edac/edac_mc.c: edac_mc_workq_setup() converted to always
        use mod_delayed_work() and cancel loop in
        edac_mc_reset_delay_period() is dropped.
      
      * drivers/platform/x86/thinkpad_acpi.c: No need to remember whether
        watchdog is active or not.  @fan_watchdog_active and related code
        dropped.
      
      * drivers/power/charger-manager.c: Seemingly a lot of
        delayed_work_pending() abuse going on here.
        [delayed_]work_pending() are unsynchronized and racy when used like
        this.  I converted one instance in fullbatt_handler().  Please
        conver the rest so that it invokes workqueue APIs for the intended
        target state rather than trying to game work item pending state
        transitions.  e.g. if timer should be modified - call
        mod_delayed_work(), canceled - call cancel_delayed_work[_sync]().
      
      * drivers/thermal/thermal_sys.c: thermal_zone_device_set_polling()
        simplified.  Note that round_jiffies() calls in this function are
        meaningless.  round_jiffies() work on absolute jiffies not delta
        delay used by delayed_work.
      
      v2: Tomi pointed out that __cancel_delayed_work() users can't be
          safely converted to mod_delayed_work().  They could be calling it
          from irq context and if that happens while delayed_work_timer_fn()
          is running, it could deadlock.  __cancel_delayed_work() users are
          dropped.
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Acked-by: NHenrique de Moraes Holschuh <hmh@hmh.eng.br>
      Acked-by: NDmitry Torokhov <dmitry.torokhov@gmail.com>
      Acked-by: NAnton Vorontsov <cbouatmailru@gmail.com>
      Acked-by: NDavid Howells <dhowells@redhat.com>
      Cc: Tomi Valkeinen <tomi.valkeinen@ti.com>
      Cc: Jens Axboe <axboe@kernel.dk>
      Cc: Jiri Kosina <jkosina@suse.cz>
      Cc: Doug Thompson <dougthompson@xmission.com>
      Cc: David Airlie <airlied@linux.ie>
      Cc: Roland Dreier <roland@kernel.org>
      Cc: "John W. Linville" <linville@tuxdriver.com>
      Cc: Zhang Rui <rui.zhang@intel.com>
      Cc: Len Brown <len.brown@intel.com>
      Cc: "J. Bruce Fields" <bfields@fieldses.org>
      Cc: Johannes Berg <johannes@sipsolutions.net>
      41f63c53
  3. 03 8月, 2012 1 次提交
    • J
      block: Don't use static to define "void *p" in show_partition_start() · 06768067
      Jianpeng Ma 提交于
      I met a odd prblem:read /proc/partitions may return zero.
      
      I wrote a file test.c:
      int main()
      {
      	char buff[4096];
      	int ret;
      	int fd;
      	printf("pid=%d\n",getpid());
      	while (1) {
      		fd = open("/proc/partitions", O_RDONLY);
      		if (fd < 0) {
      			printf("open error %s\n", strerror(errno));
      			return 0;
      		}
      		ret = read(fd, buff, 4096);
      		if (ret <= 0)
      			printf("ret=%d, %s, %ld\n", ret,
      				strerror(errno), lseek(fd,0,SEEK_CUR));
      		close(fd);
      	}
      	exit(0);
      }
      
      You can reproduce by:
      1:while true;do cat /proc/partitions > /dev/null ;done
      2:./test
      
      I reviewed the code and found:
      
      >> static void *show_partition_start(struct seq_file *seqf, loff_t *pos)
      >> {
      >> 	static void *p;
      >>
      >> 	p = disk_seqf_start(seqf, pos);
      >> 	if (!IS_ERR_OR_NULL(p) && !*pos)
      >> 		seq_puts(seqf, "major minor  #blocks  name\n\n");
      >> 	return p;
      >> }
      		test								cat /proc/partitions
      	p = disk_seqf_start()(Not NULL)
      									p = disk_seqf_start()(NULL because pos)
      	if (!IS_ERR_OR_NULL(p) && !*pos)
      Signed-off-by: NJianpeng Ma <majianpeng@gmail.com>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      06768067
  4. 01 8月, 2012 1 次提交
    • V
      block: add partition resize function to blkpg ioctl · c83f6bf9
      Vivek Goyal 提交于
      Add a new operation code (BLKPG_RESIZE_PARTITION) to the BLKPG ioctl that
      allows altering the size of an existing partition, even if it is currently
      in use.
      
      This patch converts hd_struct->nr_sects into sequence counter because
      One might extend a partition while IO is happening to it and update of
      nr_sects can be non-atomic on 32bit machines with 64bit sector_t. This
      can lead to issues like reading inconsistent size of a partition. Sequence
      counter have been used so that readers don't have to take bdev mutex lock
      as we call sector_in_part() very frequently.
      
      Now all the access to hd_struct->nr_sects should happen using sequence
      counter read/update helper functions part_nr_sects_read/part_nr_sects_write.
      There is one exception though, set_capacity()/get_capacity(). I think
      theoritically race should exist there too but this patch does not
      modify set_capacity()/get_capacity() due to sheer number of call sites
      and I am afraid that change might break something. I have left that as a
      TODO item. We can handle it later if need be. This patch does not introduce
      any new races as such w.r.t set_capacity()/get_capacity().
      
      v2: Add CONFIG_LBDAF test to UP preempt case as suggested by Phillip.
      Signed-off-by: NVivek Goyal <vgoyal@redhat.com>
      Signed-off-by: NPhillip Susi <psusi@ubuntu.com>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      c83f6bf9
  5. 15 5月, 2012 1 次提交
    • T
      block: fix buffer overflow when printing partition UUIDs · 05c69d29
      Tejun Heo 提交于
      6d1d8050 "block, partition: add partition_meta_info to hd_struct"
      added part_unpack_uuid() which assumes that the passed in buffer has
      enough space for sprintfing "%pU" - 37 characters including '\0'.
      
      Unfortunately, b5af921e "init: add support for root devices
      specified by partition UUID" supplied 33 bytes buffer to the function
      leading to the following panic with stackprotector enabled.
      
        Kernel panic - not syncing: stack-protector: Kernel stack corrupted in: ffffffff81b14c7e
      
        [<ffffffff815e226b>] panic+0xba/0x1c6
        [<ffffffff81b14c7e>] ? printk_all_partitions+0x259/0x26xb
        [<ffffffff810566bb>] __stack_chk_fail+0x1b/0x20
        [<ffffffff81b15c7e>] printk_all_paritions+0x259/0x26xb
        [<ffffffff81aedfe0>] mount_block_root+0x1bc/0x27f
        [<ffffffff81aee0fa>] mount_root+0x57/0x5b
        [<ffffffff81aee23b>] prepare_namespace+0x13d/0x176
        [<ffffffff8107eec0>] ? release_tgcred.isra.4+0x330/0x30
        [<ffffffff81aedd60>] kernel_init+0x155/0x15a
        [<ffffffff81087b97>] ? schedule_tail+0x27/0xb0
        [<ffffffff815f4d24>] kernel_thread_helper+0x5/0x10
        [<ffffffff81aedc0b>] ? start_kernel+0x3c5/0x3c5
        [<ffffffff815f4d20>] ? gs_change+0x13/0x13
      
      Increase the buffer size, remove the dangerous part_unpack_uuid() and
      use snprintf() directly from printk_all_partitions().
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Reported-by: NSzymon Gruszczynski <sz.gruszczynski@googlemail.com>
      Cc: Will Drewry <wad@chromium.org>
      Cc: stable@vger.kernel.org
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      05c69d29
  6. 02 3月, 2012 2 次提交
    • A
      Block: use a freezable workqueue for disk-event polling · 62d3c543
      Alan Stern 提交于
      This patch (as1519) fixes a bug in the block layer's disk-events
      polling.  The polling is done by a work routine queued on the
      system_nrt_wq workqueue.  Since that workqueue isn't freezable, the
      polling continues even in the middle of a system sleep transition.
      
      Obviously, polling a suspended drive for media changes and such isn't
      a good thing to do; in the case of USB mass-storage devices it can
      lead to real problems requiring device resets and even re-enumeration.
      
      The patch fixes things by creating a new system-wide, non-reentrant,
      freezable workqueue and using it for disk-events polling.
      Signed-off-by: NAlan Stern <stern@rowland.harvard.edu>
      CC: <stable@kernel.org>
      Acked-by: NTejun Heo <tj@kernel.org>
      Acked-by: NRafael J. Wysocki <rjw@sisk.pl>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      62d3c543
    • S
      block: fix __blkdev_get and add_disk race condition · 9f53d2fe
      Stanislaw Gruszka 提交于
      The following situation might occur:
      
      __blkdev_get:			add_disk:
      
      				register_disk()
      get_gendisk()
      
      disk_block_events()
      	disk->ev == NULL
      
      				disk_add_events()
      
      __disk_unblock_events()
      	disk->ev != NULL
      	--ev->block
      
      Then we unblock events, when they are suppose to be blocked. This can
      trigger events related block/genhd.c warnings, but also can crash in
      sd_check_events() or other places.
      
      I'm able to reproduce crashes with the following scripts (with
      connected usb dongle as sdb disk).
      
      <snip>
      DEV=/dev/sdb
      ENABLE=/sys/bus/usb/devices/1-2/bConfigurationValue
      
      function stop_me()
      {
      	for i in `jobs -p` ; do kill $i 2> /dev/null ; done
      	exit
      }
      
      trap stop_me SIGHUP SIGINT SIGTERM
      
      for ((i = 0; i < 10; i++)) ; do
      	while true; do fdisk -l $DEV  2>&1 > /dev/null ; done &
      done
      
      while true ; do
      echo 1 > $ENABLE
      sleep 1
      echo 0 > $ENABLE
      done
      </snip>
      
      I use the script to verify patch fixing oops in sd_revalidate_disk
      http://marc.info/?l=linux-scsi&m=132935572512352&w=2
      Without Jun'ichi Nomura patch titled "Fix NULL pointer dereference in
      sd_revalidate_disk" or this one, script easily crash kernel within
      a few seconds. With both patches applied I do not observe crash.
      Unfortunately after some time (dozen of minutes), script will hung in:
      
      [ 1563.906432]  [<c08354f5>] schedule_timeout_uninterruptible+0x15/0x20
      [ 1563.906437]  [<c04532d5>] msleep+0x15/0x20
      [ 1563.906443]  [<c05d60b2>] blk_drain_queue+0x32/0xd0
      [ 1563.906447]  [<c05d6e00>] blk_cleanup_queue+0xd0/0x170
      [ 1563.906454]  [<c06d278f>] scsi_free_queue+0x3f/0x60
      [ 1563.906459]  [<c06d7e6e>] __scsi_remove_device+0x6e/0xb0
      [ 1563.906463]  [<c06d4aff>] scsi_forget_host+0x4f/0x60
      [ 1563.906468]  [<c06cd84a>] scsi_remove_host+0x5a/0xf0
      [ 1563.906482]  [<f7f030fb>] quiesce_and_remove_host+0x5b/0xa0 [usb_storage]
      [ 1563.906490]  [<f7f03203>] usb_stor_disconnect+0x13/0x20 [usb_storage]
      
      Anyway I think this patch is some step forward.
      
      As drawback, I do not teardown on sysfs file create error, because I do
      not know how to nullify disk->ev (since it can be used). However add_disk
      error handling practically does not exist too, and things will work
      without this sysfs file, except events will not be exported to user
      space.
      Signed-off-by: NStanislaw Gruszka <sgruszka@redhat.com>
      Acked-by: NTejun Heo <tj@kernel.org>
      Cc: stable@kernel.org
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      9f53d2fe
  7. 04 1月, 2012 3 次提交
  8. 14 12月, 2011 1 次提交
    • T
      block: misc updates to blk_get_queue() · 09ac46c4
      Tejun Heo 提交于
      * blk_get_queue() is peculiar in that it returns 0 on success and 1 on
        failure instead of 0 / -errno or boolean.  Update it such that it
        returns %true on success and %false on failure.
      
      * Make sure the caller checks for the return value.
      
      * Separate out __blk_get_queue() which doesn't check whether @q is
        dead and put it in blk.h.  This will be used later.
      
      This patch doesn't introduce any functional changes.
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      09ac46c4
  9. 10 11月, 2011 1 次提交
  10. 24 10月, 2011 1 次提交
    • T
      block: make gendisk hold a reference to its queue · f992ae80
      Tejun Heo 提交于
      The following command sequence triggers an oops.
      
      # mount /dev/sdb1 /mnt
      # echo 1 > /sys/class/scsi_device/0\:0\:1\:0/device/delete
      # umount /mnt
      
       general protection fault: 0000 [#1] PREEMPT SMP
       CPU 2
       Modules linked in:
      
       Pid: 791, comm: umount Not tainted 3.1.0-rc3-work+ #8 Bochs Bochs
       RIP: 0010:[<ffffffff810d0879>]  [<ffffffff810d0879>] __lock_acquire+0x389/0x1d60
      ...
       Call Trace:
        [<ffffffff810d2845>] lock_acquire+0x95/0x140
        [<ffffffff81aed87b>] _raw_spin_lock+0x3b/0x50
        [<ffffffff811573bc>] bdi_lock_two+0x5c/0x70
        [<ffffffff811c2f6c>] bdev_inode_switch_bdi+0x4c/0xf0
        [<ffffffff811c3fcb>] __blkdev_put+0x11b/0x1d0
        [<ffffffff811c4010>] __blkdev_put+0x160/0x1d0
        [<ffffffff811c40df>] blkdev_put+0x5f/0x190
        [<ffffffff8118f18d>] kill_block_super+0x4d/0x80
        [<ffffffff8118f4a5>] deactivate_locked_super+0x45/0x70
        [<ffffffff8119003a>] deactivate_super+0x4a/0x70
        [<ffffffff811ac4ad>] mntput_no_expire+0xed/0x130
        [<ffffffff811acf2e>] sys_umount+0x7e/0x3a0
        [<ffffffff81aeeeab>] system_call_fastpath+0x16/0x1b
      
      This is because bdev holds on to disk but disk doesn't pin the
      associated queue.  If a SCSI device is removed while the device is
      still open, the sdev puts the base reference to the queue on release.
      When the bdev is finally released, the associated queue is already
      gone along with the bdi and bdev_inode_switch_bdi() ends up
      dereferencing already freed bdi.
      
      Even if it were not for this bug, disk not holding onto the associated
      queue is very unusual and error-prone.
      
      Fix it by making add_disk() take an extra reference to its queue and
      put it on disk_release() and ensuring that disk and its fops owner are
      put in that order after all accesses to the disk and queue are
      complete.
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Cc: Jens Axboe <axboe@kernel.dk>
      Cc: stable@kernel.org
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      f992ae80
  11. 19 10月, 2011 1 次提交
    • T
      block: make gendisk hold a reference to its queue · 523e1d39
      Tejun Heo 提交于
      The following command sequence triggers an oops.
      
      # mount /dev/sdb1 /mnt
      # echo 1 > /sys/class/scsi_device/0\:0\:1\:0/device/delete
      # umount /mnt
      
       general protection fault: 0000 [#1] PREEMPT SMP
       CPU 2
       Modules linked in:
      
       Pid: 791, comm: umount Not tainted 3.1.0-rc3-work+ #8 Bochs Bochs
       RIP: 0010:[<ffffffff810d0879>]  [<ffffffff810d0879>] __lock_acquire+0x389/0x1d60
      ...
       Call Trace:
        [<ffffffff810d2845>] lock_acquire+0x95/0x140
        [<ffffffff81aed87b>] _raw_spin_lock+0x3b/0x50
        [<ffffffff811573bc>] bdi_lock_two+0x5c/0x70
        [<ffffffff811c2f6c>] bdev_inode_switch_bdi+0x4c/0xf0
        [<ffffffff811c3fcb>] __blkdev_put+0x11b/0x1d0
        [<ffffffff811c4010>] __blkdev_put+0x160/0x1d0
        [<ffffffff811c40df>] blkdev_put+0x5f/0x190
        [<ffffffff8118f18d>] kill_block_super+0x4d/0x80
        [<ffffffff8118f4a5>] deactivate_locked_super+0x45/0x70
        [<ffffffff8119003a>] deactivate_super+0x4a/0x70
        [<ffffffff811ac4ad>] mntput_no_expire+0xed/0x130
        [<ffffffff811acf2e>] sys_umount+0x7e/0x3a0
        [<ffffffff81aeeeab>] system_call_fastpath+0x16/0x1b
      
      This is because bdev holds on to disk but disk doesn't pin the
      associated queue.  If a SCSI device is removed while the device is
      still open, the sdev puts the base reference to the queue on release.
      When the bdev is finally released, the associated queue is already
      gone along with the bdi and bdev_inode_switch_bdi() ends up
      dereferencing already freed bdi.
      
      Even if it were not for this bug, disk not holding onto the associated
      queue is very unusual and error-prone.
      
      Fix it by making add_disk() take an extra reference to its queue and
      put it on disk_release() and ensuring that disk and its fops owner are
      put in that order after all accesses to the disk and queue are
      complete.
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Cc: stable@kernel.org
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      523e1d39
  12. 29 8月, 2011 1 次提交
  13. 24 8月, 2011 1 次提交
    • T
      block: add GENHD_FL_NO_PART_SCAN · d27769ec
      Tejun Heo 提交于
      There are cases where suppressing partition scan is useful - e.g. for
      lo devices and pseudo SATA devices which advertise to be a disk but
      get upset on partition scan (some port multiplier control devices show
      such behavior).
      
      This patch adds GENHD_FL_NO_PART_SCAN which suppresses partition scan
      regardless of the number of possible partitions.  disk_partitionable()
      is renamed to disk_part_scan_enabled() as suppressing partition scan
      doesn't imply the device can't be partitioned using
      BLKPG_ADD/DEL_PARTITION calls from userland.  show_partition() now
      directly tests disk_max_parts() to maintain backward-compatibility.
      
      -v2: Updated to make it clear that only partition scan is suppressed
           not partitioning itself as suggested by Kay Sievers.
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Cc: Kay Sievers <kay.sievers@vrfy.org>
      Signed-off-by: NJens Axboe <jaxboe@fusionio.com>
      d27769ec
  14. 02 8月, 2011 1 次提交
    • H
      block/genhd.c: remove useless cast in diskstats_show() · f95fe9cf
      Herbert Poetzl 提交于
      Remove the (unsigned long long) cast in diskstats_show() and adjusts the
      seq_printf() format string to 'unsigned long'
      
      diskstats_show() uses part_stat_read() to get the stats, which either
      accesses the specified field in the struct disk_stats directly (non SMP)
      or sums up the per CPU values in a variable of the same type as the field,
      so in any case the result will have the same type and range as the
      specified field which for all disk_stats entries is unsigned long
      
      Also, for unsigned long ranges the output of %lu should be identical to
      the one of %llu, so no change in the actual proc entry contents.
      Signed-off-by: NHerbert Poetzl <herbert@13thfloor.at>
      Cc: Jens Axboe <axboe@kernel.dk>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NJens Axboe <jaxboe@fusionio.com>
      f95fe9cf
  15. 21 7月, 2011 1 次提交
  16. 01 7月, 2011 1 次提交
    • T
      block: flush MEDIA_CHANGE from drivers on close(2) · 85ef06d1
      Tejun Heo 提交于
      Currently, only open(2) is defined as the 'clearing' point.  It has
      two roles - first, it's an acknowledgement from userland indicating
      that the event has been received and kernel can clear pending states
      and proceed to generate more events.  Secondly, it's passed on to
      device drivers as a hint indicating that a synchronization point has
      been reached and it might want to take a deeper look at the device.
      
      The latter currently is only used by sr which uses two different
      mechanisms - GET_EVENT_MEDIA_STATUS_NOTIFICATION and TEST_UNIT_READY
      to discover events, where the former is lighter weight and safe to be
      used repeatedly but may not provide full coverage.  Among other
      things, GET_EVENT can't detect media removal while TUR can.
      
      This patch makes close(2) - blkdev_put() - indicate clearing hint for
      MEDIA_CHANGE to drivers.  disk_check_events() is renamed to
      disk_flush_events() and updated to take @mask for events to flush
      which is or'd to ev->clearing and will be passed to the driver on the
      next ->check_events() invocation.
      
      This change makes sr generate MEDIA_CHANGE when media is ejected from
      userland - e.g. with eject(1).
      
      Note: Given the current usage, it seems @clearing hint is needlessly
      complex.  disk_clear_events() can simply clear all events and the hint
      can be boolean @flush.
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Cc: Kay Sievers <kay.sievers@vrfy.org>
      Signed-off-by: NJens Axboe <jaxboe@fusionio.com>
      85ef06d1
  17. 13 6月, 2011 1 次提交
  18. 10 6月, 2011 3 次提交
  19. 27 5月, 2011 1 次提交
    • T
      block: always allocate genhd->ev if check_events is implemented · 75e3f3ee
      Tejun Heo 提交于
      9fd097b1 (block: unexport DISK_EVENT_MEDIA_CHANGE for legacy/fringe
      drivers) removed DISK_EVENT_MEDIA_CHANGE from legacy/fringe block
      drivers which have inadequate ->check_events().  Combined with earlier
      change 7c88a168 (block: don't propagate unlisted DISK_EVENTs to
      userland), this enables using ->check_events() for internal processing
      while avoiding enabling in-kernel block event polling which can lead
      to infinite event loop.
      
      Unfortunately, this made many drivers including floppy without any bit
      set in disk->events and ->async_events in which case disk_add_events()
      simply skipped allocation of disk->ev, which disables whole event
      handling.  As ->check_events() is still used during open processing
      for revalidation, this can lead to open failure.
      
      This patch always allocates disk->ev if ->check_events is implemented.
      In the long term, it would make sense to simply include the event
      structure inline into genhd as it's now used by virtually all block
      devices.
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Reported-by: NOndrej Zary <linux@rainbow-software.org>
      Reported-by: NAlex Villacis Lasso <avillaci@ceibo.fiec.espol.edu.ec>
      Cc: stable@kernel.org
      Signed-off-by: NJens Axboe <jaxboe@fusionio.com>
      75e3f3ee
  20. 22 4月, 2011 1 次提交
    • T
      block: don't propagate unlisted DISK_EVENTs to userland · 7c88a168
      Tejun Heo 提交于
      DISK_EVENT_MEDIA_CHANGE is used for both userland visible event and
      internal event for revalidation of removeable devices.  Some legacy
      drivers don't implement proper event detection and continuously
      generate events under certain circumstances.  For example, ide-cd
      generates media changed continuously if there's no media in the drive,
      which can lead to infinite loop of events jumping back and forth
      between the driver and userland event handler.
      
      This patch updates disk event infrastructure such that it never
      propagates events not listed in disk->events to userland.  Those
      events are processed the same for internal purposes but uevent
      generation is suppressed.
      
      This also ensures that userland only gets events which are advertised
      in the @events sysfs node lowering risk of confusion.
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Signed-off-by: NJens Axboe <jaxboe@fusionio.com>
      7c88a168
  21. 31 3月, 2011 1 次提交
  22. 10 3月, 2011 1 次提交
    • T
      block: Don't implicitly trigger event check on disk_unblock_events() · facc31dd
      Tejun Heo 提交于
      Currently, disk_unblock_events() implicitly kick event check if the
      block count reaches zero.  This behavior is not described in the
      comment and hinders with future changes.  Make the unblocker
      explicitly check events by calling disk_check_events() as necessary.
      
      This patch doesn't cause any behavior difference.
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Cc: Jens Axboe <axboe@kernel.dk>
      Cc: Kay Sievers <kay.sievers@vrfy.org>
      facc31dd
  23. 03 3月, 2011 1 次提交
  24. 24 2月, 2011 1 次提交
    • N
      Fix over-zealous flush_disk when changing device size. · 93b270f7
      NeilBrown 提交于
      There are two cases when we call flush_disk.
      In one, the device has disappeared (check_disk_change) so any
      data will hold becomes irrelevant.
      In the oter, the device has changed size (check_disk_size_change)
      so data we hold may be irrelevant.
      
      In both cases it makes sense to discard any 'clean' buffers,
      so they will be read back from the device if needed.
      
      In the former case it makes sense to discard 'dirty' buffers
      as there will never be anywhere safe to write the data.  In the
      second case it *does*not* make sense to discard dirty buffers
      as that will lead to file system corruption when you simply enlarge
      the containing devices.
      
      flush_disk calls __invalidate_devices.
      __invalidate_device calls both invalidate_inodes and invalidate_bdev.
      
      invalidate_inodes *does* discard I_DIRTY inodes and this does lead
      to fs corruption.
      
      invalidate_bev *does*not* discard dirty pages, but I don't really care
      about that at present.
      
      So this patch adds a flag to __invalidate_device (calling it
      __invalidate_device2) to indicate whether dirty buffers should be
      killed, and this is passed to invalidate_inodes which can choose to
      skip dirty inodes.
      
      flusk_disk then passes true from check_disk_change and false from
      check_disk_size_change.
      
      dm avoids tripping over this problem by calling i_size_write directly
      rathher than using check_disk_size_change.
      
      md does use check_disk_size_change and so is affected.
      
      This regression was introduced by commit 608aeef1 which causes
      check_disk_size_change to call flush_disk, so it is suitable for any
      kernel since 2.6.27.
      
      Cc: stable@kernel.org
      Acked-by: NJeff Moyer <jmoyer@redhat.com>
      Cc: Andrew Patterson <andrew.patterson@hp.com>
      Cc: Jens Axboe <axboe@kernel.dk>
      Signed-off-by: NNeilBrown <neilb@suse.de>
      93b270f7
  25. 07 1月, 2011 1 次提交
  26. 05 1月, 2011 1 次提交
    • J
      block: fix accounting bug on cross partition merges · 09e099d4
      Jerome Marchand 提交于
      /proc/diskstats would display a strange output as follows.
      
      $ cat /proc/diskstats |grep sda
         8       0 sda 90524 7579 102154 20464 0 0 0 0 0 14096 20089
         8       1 sda1 19085 1352 21841 4209 0 0 0 0 4294967064 15689 4293424691
                                                      ~~~~~~~~~~
         8       2 sda2 71252 3624 74891 15950 0 0 0 0 232 23995 1562390
         8       3 sda3 54 487 2188 92 0 0 0 0 0 88 92
         8       4 sda4 4 0 8 0 0 0 0 0 0 0 0
         8       5 sda5 81 2027 2130 138 0 0 0 0 0 87 137
      
      Its reason is the wrong way of accounting hd_struct->in_flight. When a bio is
      merged into a request belongs to different partition by ELEVATOR_FRONT_MERGE.
      
      The detailed root cause is as follows.
      
      Assuming that there are two partition, sda1 and sda2.
      
      1. A request for sda2 is in request_queue. Hence sda1's hd_struct->in_flight
         is 0 and sda2's one is 1.
      
              | hd_struct->in_flight
         ---------------------------
         sda1 |          0
         sda2 |          1
         ---------------------------
      
      2. A bio belongs to sda1 is issued and is merged into the request mentioned on
         step1 by ELEVATOR_BACK_MERGE. The first sector of the request is changed
         from sda2 region to sda1 region. However the two partition's
         hd_struct->in_flight are not changed.
      
              | hd_struct->in_flight
         ---------------------------
         sda1 |          0
         sda2 |          1
         ---------------------------
      
      3. The request is finished and blk_account_io_done() is called. In this case,
         sda2's hd_struct->in_flight, not a sda1's one, is decremented.
      
              | hd_struct->in_flight
         ---------------------------
         sda1 |         -1
         sda2 |          1
         ---------------------------
      
      The patch fixes the problem by caching the partition lookup
      inside the request structure, hence making sure that the increment
      and decrement will always happen on the same partition struct. This
      also speeds up IO with accounting enabled, since it cuts down on
      the number of lookups we have to do.
      
      Also add a refcount to struct hd_struct to keep the partition in
      memory as long as users exist. We use kref_test_and_get() to ensure
      we don't add a reference to a partition which is going away.
      Signed-off-by: NJerome Marchand <jmarchan@redhat.com>
      Signed-off-by: NYasuaki Ishimatsu <isimatu.yasuaki@jp.fujitsu.com>
      Cc: stable@kernel.org
      Signed-off-by: NJens Axboe <jaxboe@fusionio.com>
      09e099d4
  27. 17 12月, 2010 5 次提交
    • Y
      fs/block: type signature of major_to_index(int) to major_to_index(unsigned) · e61eb2e9
      Yang Zhang 提交于
      The major/minor device numbers are always defined and used as `unsigned'.
      Signed-off-by: NYang Zhang <kthreadd@gmail.com>
      Signed-off-by: NJens Axboe <jaxboe@fusionio.com>
      e61eb2e9
    • Y
      block: convert !IS_ERR(p) && p to !IS_ERR_NOR_NULL(p) · b9f985b6
      Yang Zhang 提交于
      Signed-off-by: NYang Zhang <kthreadd@gmail.com>
      Signed-off-by: NJens Axboe <jaxboe@fusionio.com>
      b9f985b6
    • T
      implement in-kernel gendisk events handling · 77ea887e
      Tejun Heo 提交于
      Currently, media presence polling for removeable block devices is done
      from userland.  There are several issues with this.
      
      * Polling is done by periodically opening the device.  For SCSI
        devices, the command sequence generated by such action involves a
        few different commands including TEST_UNIT_READY.  This behavior,
        while perfectly legal, is different from Windows which only issues
        single command, GET_EVENT_STATUS_NOTIFICATION.  Unfortunately, some
        ATAPI devices lock up after being periodically queried such command
        sequences.
      
      * There is no reliable and unintrusive way for a userland program to
        tell whether the target device is safe for media presence polling.
        For example, polling for media presence during an on-going burning
        session can make it fail.  The polling program can avoid this by
        opening the device with O_EXCL but then it risks making a valid
        exclusive user of the device fail w/ -EBUSY.
      
      * Userland polling is unnecessarily heavy and in-kernel implementation
        is lighter and better coordinated (workqueue, timer slack).
      
      This patch implements framework for in-kernel disk event handling,
      which includes media presence polling.
      
      * bdops->check_events() is added, which supercedes ->media_changed().
        It should check whether there's any pending event and return if so.
        Currently, two events are defined - DISK_EVENT_MEDIA_CHANGE and
        DISK_EVENT_EJECT_REQUEST.  ->check_events() is guaranteed not to be
        called parallelly.
      
      * gendisk->events and ->async_events are added.  These should be
        initialized by block driver before passing the device to add_disk().
        The former contains the mask of all supported events and the latter
        the mask of all events which the device can report without polling.
        /sys/block/*/events[_async] export these to userland.
      
      * Kernel parameter block.events_dfl_poll_msecs controls the system
        polling interval (default is 0 which means disable) and
        /sys/block/*/events_poll_msecs control polling intervals for
        individual devices (default is -1 meaning use system setting).  Note
        that if a device can report all supported events asynchronously and
        its polling interval isn't explicitly set, the device won't be
        polled regardless of the system polling interval.
      
      * If a device is opened exclusively with write access, event checking
        is automatically disabled until all write exclusive accesses are
        released.
      
      * There are event 'clearing' events.  For example, both of currently
        defined events are cleared after the device has been successfully
        opened.  This information is passed to ->check_events() callback
        using @clearing argument as a hint.
      
      * Event checking is always performed from system_nrt_wq and timer
        slack is set to 25% for polling.
      
      * Nothing changes for drivers which implement ->media_changed() but
        not ->check_events().  Going forward, all drivers will be converted
        to ->check_events() and ->media_change() will be dropped.
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Cc: Kay Sievers <kay.sievers@vrfy.org>
      Cc: Jan Kara <jack@suse.cz>
      Signed-off-by: NJens Axboe <jaxboe@fusionio.com>
      77ea887e
    • T
      block: move register_disk() and del_gendisk() to block/genhd.c · d2bf1b67
      Tejun Heo 提交于
      There's no reason for register_disk() and del_gendisk() to be in
      fs/partitions/check.c.  Move both to genhd.c.  While at it, collapse
      unlink_gendisk(), which was artificially in a separate function due to
      genhd.c / check.c split, into del_gendisk().
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Signed-off-by: NJens Axboe <jaxboe@fusionio.com>
      d2bf1b67
    • T
      block: kill genhd_media_change_notify() · dddd9dc3
      Tejun Heo 提交于
      There's no user of the facility.  Kill it.
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Signed-off-by: NJens Axboe <jaxboe@fusionio.com>
      dddd9dc3
  28. 25 10月, 2010 1 次提交
  29. 23 10月, 2010 1 次提交
    • A
      SYSFS: Allow boot time switching between deprecated and modern sysfs layout · e52eec13
      Andi Kleen 提交于
      I have some systems which need legacy sysfs due to old tools that are
      making assumptions that a directory can never be a symlink to another
      directory, and it's a big hazzle to compile separate kernels for them.
      
      This patch turns CONFIG_SYSFS_DEPRECATED into a run time option
      that can be switched on/off the kernel command line. This way
      the same binary can be used in both cases with just a option
      on the command line.
      
      The old CONFIG_SYSFS_DEPRECATED_V2 option is still there to set
      the default. I kept the weird name to not break existing
      config files.
      
      Also the compat code can be still completely disabled by undefining
      CONFIG_SYSFS_DEPRECATED_SWITCH -- just the optimizer takes
      care of this now instead of lots of ifdefs. This makes the code
      look nicer.
      
      v2: This is an updated version on top of Kay's patch to only
      handle the block devices. I tested it on my old systems
      and that seems to work.
      
      Cc: axboe@kernel.dk
      Signed-off-by: NAndi Kleen <ak@linux.intel.com>
      Cc: Kay Sievers <kay.sievers@vrfy.org>
      Signed-off-by: NGreg Kroah-Hartman <gregkh@suse.de>
      e52eec13
  30. 19 10月, 2010 1 次提交
    • Y
      block: fix accounting bug on cross partition merges · 7681bfee
      Yasuaki Ishimatsu 提交于
      /proc/diskstats would display a strange output as follows.
      
      $ cat /proc/diskstats |grep sda
         8       0 sda 90524 7579 102154 20464 0 0 0 0 0 14096 20089
         8       1 sda1 19085 1352 21841 4209 0 0 0 0 4294967064 15689 4293424691
                                                      ~~~~~~~~~~
         8       2 sda2 71252 3624 74891 15950 0 0 0 0 232 23995 1562390
         8       3 sda3 54 487 2188 92 0 0 0 0 0 88 92
         8       4 sda4 4 0 8 0 0 0 0 0 0 0 0
         8       5 sda5 81 2027 2130 138 0 0 0 0 0 87 137
      
      Its reason is the wrong way of accounting hd_struct->in_flight. When a bio is
      merged into a request belongs to different partition by ELEVATOR_FRONT_MERGE.
      
      The detailed root cause is as follows.
      
      Assuming that there are two partition, sda1 and sda2.
      
      1. A request for sda2 is in request_queue. Hence sda1's hd_struct->in_flight
         is 0 and sda2's one is 1.
      
              | hd_struct->in_flight
         ---------------------------
         sda1 |          0
         sda2 |          1
         ---------------------------
      
      2. A bio belongs to sda1 is issued and is merged into the request mentioned on
         step1 by ELEVATOR_BACK_MERGE. The first sector of the request is changed
         from sda2 region to sda1 region. However the two partition's
         hd_struct->in_flight are not changed.
      
              | hd_struct->in_flight
         ---------------------------
         sda1 |          0
         sda2 |          1
         ---------------------------
      
      3. The request is finished and blk_account_io_done() is called. In this case,
         sda2's hd_struct->in_flight, not a sda1's one, is decremented.
      
              | hd_struct->in_flight
         ---------------------------
         sda1 |         -1
         sda2 |          1
         ---------------------------
      
      The patch fixes the problem by caching the partition lookup
      inside the request structure, hence making sure that the increment
      and decrement will always happen on the same partition struct. This
      also speeds up IO with accounting enabled, since it cuts down on
      the number of lookups we have to do.
      
      When reloading partition tables, quiesce IO to ensure that no
      request references to the partition struct exists. When it is safe
      to free the partition table, the IO for that device is restarted
      again.
      Signed-off-by: NYasuaki Ishimatsu <isimatu.yasuaki@jp.fujitsu.com>
      Cc: stable@kernel.org
      Signed-off-by: NJens Axboe <jaxboe@fusionio.com>
      7681bfee
  31. 17 9月, 2010 1 次提交