1. 17 12月, 2010 3 次提交
    • T
      implement in-kernel gendisk events handling · 77ea887e
      Tejun Heo 提交于
      Currently, media presence polling for removeable block devices is done
      from userland.  There are several issues with this.
      
      * Polling is done by periodically opening the device.  For SCSI
        devices, the command sequence generated by such action involves a
        few different commands including TEST_UNIT_READY.  This behavior,
        while perfectly legal, is different from Windows which only issues
        single command, GET_EVENT_STATUS_NOTIFICATION.  Unfortunately, some
        ATAPI devices lock up after being periodically queried such command
        sequences.
      
      * There is no reliable and unintrusive way for a userland program to
        tell whether the target device is safe for media presence polling.
        For example, polling for media presence during an on-going burning
        session can make it fail.  The polling program can avoid this by
        opening the device with O_EXCL but then it risks making a valid
        exclusive user of the device fail w/ -EBUSY.
      
      * Userland polling is unnecessarily heavy and in-kernel implementation
        is lighter and better coordinated (workqueue, timer slack).
      
      This patch implements framework for in-kernel disk event handling,
      which includes media presence polling.
      
      * bdops->check_events() is added, which supercedes ->media_changed().
        It should check whether there's any pending event and return if so.
        Currently, two events are defined - DISK_EVENT_MEDIA_CHANGE and
        DISK_EVENT_EJECT_REQUEST.  ->check_events() is guaranteed not to be
        called parallelly.
      
      * gendisk->events and ->async_events are added.  These should be
        initialized by block driver before passing the device to add_disk().
        The former contains the mask of all supported events and the latter
        the mask of all events which the device can report without polling.
        /sys/block/*/events[_async] export these to userland.
      
      * Kernel parameter block.events_dfl_poll_msecs controls the system
        polling interval (default is 0 which means disable) and
        /sys/block/*/events_poll_msecs control polling intervals for
        individual devices (default is -1 meaning use system setting).  Note
        that if a device can report all supported events asynchronously and
        its polling interval isn't explicitly set, the device won't be
        polled regardless of the system polling interval.
      
      * If a device is opened exclusively with write access, event checking
        is automatically disabled until all write exclusive accesses are
        released.
      
      * There are event 'clearing' events.  For example, both of currently
        defined events are cleared after the device has been successfully
        opened.  This information is passed to ->check_events() callback
        using @clearing argument as a hint.
      
      * Event checking is always performed from system_nrt_wq and timer
        slack is set to 25% for polling.
      
      * Nothing changes for drivers which implement ->media_changed() but
        not ->check_events().  Going forward, all drivers will be converted
        to ->check_events() and ->media_change() will be dropped.
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Cc: Kay Sievers <kay.sievers@vrfy.org>
      Cc: Jan Kara <jack@suse.cz>
      Signed-off-by: NJens Axboe <jaxboe@fusionio.com>
      77ea887e
    • T
      block: move register_disk() and del_gendisk() to block/genhd.c · d2bf1b67
      Tejun Heo 提交于
      There's no reason for register_disk() and del_gendisk() to be in
      fs/partitions/check.c.  Move both to genhd.c.  While at it, collapse
      unlink_gendisk(), which was artificially in a separate function due to
      genhd.c / check.c split, into del_gendisk().
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Signed-off-by: NJens Axboe <jaxboe@fusionio.com>
      d2bf1b67
    • T
      block: kill genhd_media_change_notify() · dddd9dc3
      Tejun Heo 提交于
      There's no user of the facility.  Kill it.
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Signed-off-by: NJens Axboe <jaxboe@fusionio.com>
      dddd9dc3
  2. 25 10月, 2010 1 次提交
  3. 19 10月, 2010 1 次提交
    • Y
      block: fix accounting bug on cross partition merges · 7681bfee
      Yasuaki Ishimatsu 提交于
      /proc/diskstats would display a strange output as follows.
      
      $ cat /proc/diskstats |grep sda
         8       0 sda 90524 7579 102154 20464 0 0 0 0 0 14096 20089
         8       1 sda1 19085 1352 21841 4209 0 0 0 0 4294967064 15689 4293424691
                                                      ~~~~~~~~~~
         8       2 sda2 71252 3624 74891 15950 0 0 0 0 232 23995 1562390
         8       3 sda3 54 487 2188 92 0 0 0 0 0 88 92
         8       4 sda4 4 0 8 0 0 0 0 0 0 0 0
         8       5 sda5 81 2027 2130 138 0 0 0 0 0 87 137
      
      Its reason is the wrong way of accounting hd_struct->in_flight. When a bio is
      merged into a request belongs to different partition by ELEVATOR_FRONT_MERGE.
      
      The detailed root cause is as follows.
      
      Assuming that there are two partition, sda1 and sda2.
      
      1. A request for sda2 is in request_queue. Hence sda1's hd_struct->in_flight
         is 0 and sda2's one is 1.
      
              | hd_struct->in_flight
         ---------------------------
         sda1 |          0
         sda2 |          1
         ---------------------------
      
      2. A bio belongs to sda1 is issued and is merged into the request mentioned on
         step1 by ELEVATOR_BACK_MERGE. The first sector of the request is changed
         from sda2 region to sda1 region. However the two partition's
         hd_struct->in_flight are not changed.
      
              | hd_struct->in_flight
         ---------------------------
         sda1 |          0
         sda2 |          1
         ---------------------------
      
      3. The request is finished and blk_account_io_done() is called. In this case,
         sda2's hd_struct->in_flight, not a sda1's one, is decremented.
      
              | hd_struct->in_flight
         ---------------------------
         sda1 |         -1
         sda2 |          1
         ---------------------------
      
      The patch fixes the problem by caching the partition lookup
      inside the request structure, hence making sure that the increment
      and decrement will always happen on the same partition struct. This
      also speeds up IO with accounting enabled, since it cuts down on
      the number of lookups we have to do.
      
      When reloading partition tables, quiesce IO to ensure that no
      request references to the partition struct exists. When it is safe
      to free the partition table, the IO for that device is restarted
      again.
      Signed-off-by: NYasuaki Ishimatsu <isimatu.yasuaki@jp.fujitsu.com>
      Cc: stable@kernel.org
      Signed-off-by: NJens Axboe <jaxboe@fusionio.com>
      7681bfee
  4. 15 9月, 2010 1 次提交
    • W
      block, partition: add partition_meta_info to hd_struct · 6d1d8050
      Will Drewry 提交于
      I'm reposting this patch series as v4 since there have been no additional
      comments, and I cleaned up one extra bit of unneeded code (in 3/3). The patches
      are against Linus's tree: 2bfc96a1
      (2.6.36-rc3).
      
      Would this patchset be suitable for inclusion in an mm branch?
      
      This changes adds a partition_meta_info struct which itself contains a
      union of structures that provide partition table specific metadata.
      
      This change leaves the union empty. The subsequent patch includes an
      implementation for CONFIG_EFI_PARTITION-based metadata.
      Signed-off-by: NWill Drewry <wad@chromium.org>
      Signed-off-by: NJens Axboe <jaxboe@fusionio.com>
      6d1d8050
  5. 20 8月, 2010 1 次提交
  6. 16 3月, 2010 1 次提交
  7. 17 2月, 2010 1 次提交
  8. 11 1月, 2010 1 次提交
  9. 10 11月, 2009 1 次提交
  10. 07 10月, 2009 1 次提交
    • N
      block: Seperate read and write statistics of in_flight requests v2 · 316d315b
      Nikanth Karthikesan 提交于
      Commit a9327cac added seperate read
      and write statistics of in_flight requests. And exported the number
      of read and write requests in progress seperately through sysfs.
      
      But  Corrado Zoccolo <czoccolo@gmail.com> reported getting strange
      output from "iostat -kx 2". Global values for service time and
      utilization were garbage. For interval values, utilization was always
      100%, and service time is higher than normal.
      
      So this was reverted by commit 0f78ab98
      
      The problem was in part_round_stats_single(), I missed the following:
              if (now == part->stamp)
                      return;
      
      -       if (part->in_flight) {
      +       if (part_in_flight(part)) {
                      __part_stat_add(cpu, part, time_in_queue,
                                      part_in_flight(part) * (now - part->stamp));
                      __part_stat_add(cpu, part, io_ticks, (now - part->stamp));
      
      With this chunk included, the reported regression gets fixed.
      Signed-off-by: NNikanth Karthikesan <knikanth@suse.de>
      
      --
      Signed-off-by: NJens Axboe <jens.axboe@oracle.com>
      316d315b
  11. 05 10月, 2009 1 次提交
    • J
      Revert "Seperate read and write statistics of in_flight requests" · 0f78ab98
      Jens Axboe 提交于
      This reverts commit a9327cac.
      
      Corrado Zoccolo <czoccolo@gmail.com> reports:
      
      "with 2.6.32-rc1 I started getting the following strange output from
      "iostat -kx 2":
      Linux 2.6.31bisect (et2) 	04/10/2009 	_i686_	(2 CPU)
      
      avg-cpu:  %user   %nice %system %iowait  %steal   %idle
                10,70    0,00    3,16   15,75    0,00   70,38
      
      Device:         rrqm/s   wrqm/s     r/s     w/s    rkB/s    wkB/s
      avgrq-sz avgqu-sz   await  svctm  %util
      sda              18,22     0,00    0,67    0,01    14,77     0,02
      43,94     0,01   10,53 39043915,03 2629219,87
      sdb              60,89     9,68   50,79    3,04  1724,43    50,52
      65,95     0,70   13,06 488437,47 2629219,87
      
      avg-cpu:  %user   %nice %system %iowait  %steal   %idle
                 2,72    0,00    0,74    0,00    0,00   96,53
      
      Device:         rrqm/s   wrqm/s     r/s     w/s    rkB/s    wkB/s
      avgrq-sz avgqu-sz   await  svctm  %util
      sda               0,00     0,00    0,00    0,00     0,00     0,00
      0,00     0,00    0,00   0,00 100,00
      sdb               0,00     0,00    0,00    0,00     0,00     0,00
      0,00     0,00    0,00   0,00 100,00
      
      avg-cpu:  %user   %nice %system %iowait  %steal   %idle
                 6,68    0,00    0,99    0,00    0,00   92,33
      
      Device:         rrqm/s   wrqm/s     r/s     w/s    rkB/s    wkB/s
      avgrq-sz avgqu-sz   await  svctm  %util
      sda               0,00     0,00    0,00    0,00     0,00     0,00
      0,00     0,00    0,00   0,00 100,00
      sdb               0,00     0,00    0,00    0,00     0,00     0,00
      0,00     0,00    0,00   0,00 100,00
      
      avg-cpu:  %user   %nice %system %iowait  %steal   %idle
                 4,40    0,00    0,73    1,47    0,00   93,40
      
      Device:         rrqm/s   wrqm/s     r/s     w/s    rkB/s    wkB/s
      avgrq-sz avgqu-sz   await  svctm  %util
      sda               0,00     0,00    0,00    0,00     0,00     0,00
      0,00     0,00    0,00   0,00 100,00
      sdb               0,00     4,00    0,00    3,00     0,00    28,00
      18,67     0,06   19,50 333,33 100,00
      
      Global values for service time and utilization are garbage. For
      interval values, utilization is always 100%, and service time is
      higher than normal.
      
      I bisected it down to:
      [a9327cac] Seperate read and write
      statistics of in_flight requests
      and verified that reverting just that commit indeed solves the issue
      on 2.6.32-rc1."
      
      So until this is debugged, revert the bad commit.
      Signed-off-by: NJens Axboe <jens.axboe@oracle.com>
      0f78ab98
  12. 22 9月, 2009 1 次提交
  13. 20 9月, 2009 1 次提交
  14. 14 9月, 2009 1 次提交
  15. 16 6月, 2009 1 次提交
  16. 07 6月, 2009 1 次提交
    • B
      partitions: add ->set_capacity block device method · db429e9e
      Bartlomiej Zolnierkiewicz 提交于
      * Add ->set_capacity block device method and use it in rescan_partitions()
        to attempt enabling native capacity of the device upon detecting the
        partition which exceeds device capacity.
      
      * Add GENHD_FL_NATIVE_CAPACITY flag to try limit attempts of enabling
        native capacity during partition scan.
      
      Together with the consecutive patch implementing ->set_capacity method in
      ide-gd device driver this allows automatic disabling of Host Protected Area
      (HPA) if any partitions overlapping HPA are detected.
      
      Cc: Robert Hancock <hancockrwd@gmail.com>
      Cc: Frans Pop <elendil@planet.nl>
      Cc: "Andries E. Brouwer" <Andries.Brouwer@cwi.nl>
      Acked-by: NAl Viro <viro@zeniv.linux.org.uk>
      Emphatically-Acked-by: NAlan Cox <alan@linux.intel.com>
      Signed-off-by: NBartlomiej Zolnierkiewicz <bzolnier@gmail.com>
      db429e9e
  17. 23 5月, 2009 1 次提交
    • M
      block: Export I/O topology for block devices and partitions · c72758f3
      Martin K. Petersen 提交于
      To support devices with physical block sizes bigger than 512 bytes we
      need to ensure proper alignment.  This patch adds support for exposing
      I/O topology characteristics as devices are stacked.
      
        logical_block_size is the smallest unit the device can address.
      
        physical_block_size indicates the smallest I/O the device can write
        without incurring a read-modify-write penalty.
      
        The io_min parameter is the smallest preferred I/O size reported by
        the device.  In many cases this is the same as the physical block
        size.  However, the io_min parameter can be scaled up when stacking
        (RAID5 chunk size > physical block size).
      
        The io_opt characteristic indicates the optimal I/O size reported by
        the device.  This is usually the stripe width for arrays.
      
        The alignment_offset parameter indicates the number of bytes the start
        of the device/partition is offset from the device's natural alignment.
        Partition tools and MD/DM utilities can use this to pad their offsets
        so filesystems start on proper boundaries.
      Signed-off-by: NMartin K. Petersen <martin.petersen@oracle.com>
      Signed-off-by: NJens Axboe <jens.axboe@oracle.com>
      c72758f3
  18. 22 4月, 2009 1 次提交
  19. 24 3月, 2009 2 次提交
  20. 29 12月, 2008 1 次提交
    • J
      block: add one-hit cache for disk partition lookup · a6f23657
      Jens Axboe 提交于
      disk_map_sector_rcu() returns a partition from a sector offset,
      which we use for IO statistics on a per-partition basis. The
      lookup itself is an O(N) list lookup, where N is the number of
      partitions. This actually hurts performance quite a bit, even
      on the lower end partitions. On higher numbered partitions,
      it can get pretty bad.
      
      Solve this by adding a one-hit cache for partition lookup.
      This makes the lookup O(1) for the case where we do most IO to
      one partition. Even for mixed partition workloads, amortized cost
      is pretty close to O(1) since the natural IO batching makes the
      one-hit cache last for lots of IOs.
      Signed-off-by: NJens Axboe <jens.axboe@oracle.com>
      a6f23657
  21. 18 11月, 2008 1 次提交
  22. 23 10月, 2008 2 次提交
  23. 09 10月, 2008 14 次提交
    • T
      block: allow disk to have extended device number · 3e1a7ff8
      Tejun Heo 提交于
      Now that disk and partition handlings are mostly unified, it's easy to
      allow disk to have extended device number.  This patch makes
      add_disk() use extended device number if disk->minors is zero.  Both
      sd and ide-disk are updated to use this.
      
      * sd_format_disk_name() is implemented which can generically determine
        the drive name.  This removes disk number restriction stemming from
        limited device names.
      
      * If sd index goes over SD_MAX_DISKS (which can be increased now BTW),
        sd simply doesn't initialize minors letting block layer choose
        extended device number.
      
      * If CONFIG_DEBUG_EXT_DEVT is set, both sd and ide-disk always set
        minors to 0 and use extended device numbers.
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Signed-off-by: NJens Axboe <jens.axboe@oracle.com>
      3e1a7ff8
    • T
      block: replace @ext_minors with GENHD_FL_EXT_DEVT · 689d6fac
      Tejun Heo 提交于
      With previous changes, it's meaningless to limit the number of
      partitions.  Replace @ext_minors with GENHD_FL_EXT_DEVT such that
      setting the flag allows the disk to have maximum number of allowed
      partitions (only limited by the number of entries in parsed_partitions
      as determined by MAX_PART constant).
      
      This kills not-too-pretty alloc_disk_ext[_node]() functions and makes
      @minors parameter to alloc_disk[_node]() unnecessary.  The parameter
      is left alone to avoid disturbing the users.
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Signed-off-by: NJens Axboe <jens.axboe@oracle.com>
      689d6fac
    • T
      block: make partition array dynamic · 540eed56
      Tejun Heo 提交于
      disk->__part used to be statically allocated to the maximum possible
      number of partitions.  This patch makes partition array allocation
      dynamic.  The added overhead is minimal as only real change is one
      memory dereference changed to RCU one.  This saves both a bit of
      memory and cpu cycles iterating through unoccupied slots and makes
      increasing partition limit easier.
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Signed-off-by: NJens Axboe <jens.axboe@oracle.com>
      540eed56
    • T
      block: move stats from disk to part0 · 074a7aca
      Tejun Heo 提交于
      Move stats related fields - stamp, in_flight, dkstats - from disk to
      part0 and unify stat handling such that...
      
      * part_stat_*() now updates part0 together if the specified partition
        is not part0.  ie. part_stat_*() are now essentially all_stat_*().
      
      * {disk|all}_stat_*() are gone.
      
      * part_round_stats() is updated similary.  It handles part0 stats
        automatically and disk_round_stats() is killed.
      
      * part_{inc|dec}_in_fligh() is implemented which automatically updates
        part0 stats for parts other than part0.
      
      * disk_map_sector_rcu() is updated to return part0 if no part matches.
        Combined with the above changes, this makes NULL special case
        handling in callers unnecessary.
      
      * Separate stats show code paths for disk are collapsed into part
        stats show code paths.
      
      * Rename disk_stat_lock/unlock() to part_stat_lock/unlock()
      
      While at it, reposition stat handling macros a bit and add missing
      parentheses around macro parameters.
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Signed-off-by: NJens Axboe <jens.axboe@oracle.com>
      074a7aca
    • T
      block: kill GENHD_FL_FAIL and use part0->make_it_fail · eddb2e26
      Tejun Heo 提交于
      GENHD_FL_FAIL for disk is what make_it_fail is for parts.  Kill it and
      use part0->make_it_fail.  Sysfs node handling is unified too.
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Signed-off-by: NJens Axboe <jens.axboe@oracle.com>
      eddb2e26
    • T
      block: always set bdev->bd_part · 0762b8bd
      Tejun Heo 提交于
      Till now, bdev->bd_part is set only if the bdev was for parts other
      than part0.  This patch makes bdev->bd_part always set so that code
      paths don't have to differenciate common handling.
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Signed-off-by: NJens Axboe <jens.axboe@oracle.com>
      0762b8bd
    • T
      block: move holder_dir from disk to part0 · 4c46501d
      Tejun Heo 提交于
      Move disk->holder_dir to part0->holder_dir.  Kill now mostly
      superflous bdev_get_holder().
      
      While at it, kill superflous kobject_get/put() around holder_dir,
      slave_dir and cmd_filter creation and collapse
      disk_sysfs_add_subdirs() into register_disk().  These serve no purpose
      but obfuscating the code.
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Signed-off-by: NJens Axboe <jens.axboe@oracle.com>
      4c46501d
    • T
      block: move policy from disk to part0 · b7db9956
      Tejun Heo 提交于
      Move disk->policy to part0->policy.  Implement and use get_disk_ro().
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Signed-off-by: NJens Axboe <jens.axboe@oracle.com>
      b7db9956
    • T
      block: unify sysfs size node handling · e5610521
      Tejun Heo 提交于
      Now that capacity and __dev are moved to part0, part0 and others can
      share the same method.
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Signed-off-by: NJens Axboe <jens.axboe@oracle.com>
      e5610521
    • T
      block: move __dev from disk to part0 · 548b10eb
      Tejun Heo 提交于
      Move disk->__dev to part0->__dev.  This simplifies bdget_disk() and
      lookup_devt() and allows common sysfs attributes to be unified.
      part_to_disk() is updated to handle part0 -> disk.
      
      Updated to include a fix from Bartlomiej Zolnierkiewicz <bzolnier@gmail.com>,
      he writes:
      
      "part0 is a "special" partition and doesn't need to have capacity set - this
      fixes regression caused by "block: move __dev from disk to part0" commit."
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Signed-off-by: NJens Axboe <jens.axboe@oracle.com>
      548b10eb
    • T
      block: move capacity from disk to part0 · 80795aef
      Tejun Heo 提交于
      Move disk->capacity to part0->nr_sects and convert all users who
      directly accessed the field to use {get|set}_capacity().  This is done
      early to allow the __dev field to be moved.
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Signed-off-by: NJens Axboe <jens.axboe@oracle.com>
      80795aef
    • T
      block: introduce partition 0 · b5d0b9df
      Tejun Heo 提交于
      genhd and partition code handled disk and partitions separately.  All
      information about the whole disk was in struct genhd and partitions in
      struct hd_struct.  However, the whole disk (part0) and other
      partitions have a lot in common and the data structures end up having
      good number of common fields and thus separate code paths doing the
      same thing.  Also, the partition array was indexed by partno - 1 which
      gets pretty confusing at times.
      
      This patch introduces partition 0 and makes the partition array
      indexed by partno.  Following patches will unify the handling of disk
      and parts piece-by-piece.
      
      This patch also implements disk_partitionable() which tests whether a
      disk is partitionable.  With coming dynamic partition array change,
      the most common usage of disk_max_parts() will be testing whether a
      disk is partitionable and the number of max partitions will become
      much less important.
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Signed-off-by: NJens Axboe <jens.axboe@oracle.com>
      b5d0b9df
    • T
      block: implement and use {disk|part}_to_dev() · ed9e1982
      Tejun Heo 提交于
      Implement {disk|part}_to_dev() and use them to access generic device
      instead of directly dereferencing {disk|part}->dev.  To make sure no
      user is left behind, rename generic devices fields to __dev.
      
      This is in preparation of unifying partition 0 handling with other
      partitions.
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Signed-off-by: NJens Axboe <jens.axboe@oracle.com>
      ed9e1982
    • T
      block: implement extended dev numbers · bcce3de1
      Tejun Heo 提交于
      Implement extended device numbers.  A block driver can tell block
      layer that it wants to use extended device numbers.  After the usual
      minor space is used up, block layer automatically allocates devt's
      from EXT_BLOCK_MAJOR.
      
      Currently only one major number is allocated for this but as the
      allocation is strictly on-demand, ~1mil minor space under it should
      suffice unless the system actually has more than ~1mil partitions and
      if that ever happens adding more majors to the extended devt area is
      easy.
      
      Due to internal implementation issues, the first partition can't be
      allocated on the extended area.  In other words, genhd->minors should
      at least be 1.  This limitation will be lifted by later changes.
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Signed-off-by: NJens Axboe <jens.axboe@oracle.com>
      bcce3de1