1. 26 4月, 2017 1 次提交
    • D
      Revert "block: use DAX for partition table reads" · a41fe02b
      Dan Williams 提交于
      commit d1a5f2b4 ("block: use DAX for partition table reads") was
      part of a stalled effort to allow dax mappings of block devices. Since
      then the device-dax mechanism has filled the role of dax-mapping static
      device ranges.
      
      Now that we are moving ->direct_access() from a block_device operation
      to a dax_inode operation we would need block devices to map and carry
      their own dax_inode reference.
      
      Unless / until we decide to revive dax mapping of raw block devices
      through the dax_inode scheme, there is no need to carry
      read_dax_sector(). Its removal in turn allows for the removal of
      bdev_direct_access() and should have been included in commit
      22375701 ("block_dev: remove DAX leftovers").
      
      Cc: Jeff Moyer <jmoyer@redhat.com>
      Signed-off-by: NDan Williams <dan.j.williams@intel.com>
      a41fe02b
  2. 12 1月, 2017 1 次提交
    • D
      block: Rename blk_queue_zone_size and bdev_zone_size · f99e8648
      Damien Le Moal 提交于
      All block device data fields and functions returning a number of 512B
      sectors are by convention named xxx_sectors while names in the form
      xxx_size are generally used for a number of bytes. The blk_queue_zone_size
      and bdev_zone_size functions were not following this convention so rename
      them.
      
      No functional change is introduced by this patch.
      Signed-off-by: NDamien Le Moal <damien.lemoal@wdc.com>
      
      Collapsed the two patches, they were nonsensically split and broke
      bisection.
      Signed-off-by: NJens Axboe <axboe@fb.com>
      f99e8648
  3. 01 12月, 2016 1 次提交
    • D
      block: Check partition alignment on zoned block devices · b02d8aae
      Damien Le Moal 提交于
      Both blkdev_report_zones and blkdev_reset_zones can operate on a partition of
      a zoned block device. However, the first and last zones reported for a
      partition make sense only if the partition start sector and size are aligned
      on the device zone size. The same applies for zone reset. Resetting the first
      or the last zone of a partition straddling zones may impact neighboring
      partitions. Finally, if a partition start sector is not at the beginning of a
      sequential zone, it will be impossible to write to the first sectors of the
      partition on a host-managed device.
      Avoid all these problems and incoherencies by ignoring partitions that are not
      zone aligned.
      
      Note: Even with CONFIG_BLK_DEV_ZONED disabled, bdev_is_zoned() will report the
      correct disk zoning type (host-aware, host-managed or none) but
      bdev_zone_size() will always return 0 for zoned block devices (i.e. the zone
      size is unknown). So test this as a way to ensure that a zoned block device is
      being handled as such. As a result, for a host-aware devices, unaligned zone
      partitions will be accepted with CONFIG_BLK_DEV_ZONED disabled. That is, the
      disk will be treated as a regular block device (as it should). If zoned block
      device support is enabled, only aligned partitions will be accepted.
      Signed-off-by: NDamien Le Moal <damien.lemoal@wdc.com>
      Reviewed-by: NHannes Reinecke <hare@suse.com>
      Signed-off-by: NJens Axboe <axboe@fb.com>
      b02d8aae
  4. 14 6月, 2016 1 次提交
  5. 05 4月, 2016 1 次提交
    • K
      mm, fs: get rid of PAGE_CACHE_* and page_cache_{get,release} macros · 09cbfeaf
      Kirill A. Shutemov 提交于
      PAGE_CACHE_{SIZE,SHIFT,MASK,ALIGN} macros were introduced *long* time
      ago with promise that one day it will be possible to implement page
      cache with bigger chunks than PAGE_SIZE.
      
      This promise never materialized.  And unlikely will.
      
      We have many places where PAGE_CACHE_SIZE assumed to be equal to
      PAGE_SIZE.  And it's constant source of confusion on whether
      PAGE_CACHE_* or PAGE_* constant should be used in a particular case,
      especially on the border between fs and mm.
      
      Global switching to PAGE_CACHE_SIZE != PAGE_SIZE would cause to much
      breakage to be doable.
      
      Let's stop pretending that pages in page cache are special.  They are
      not.
      
      The changes are pretty straight-forward:
      
       - <foo> << (PAGE_CACHE_SHIFT - PAGE_SHIFT) -> <foo>;
      
       - <foo> >> (PAGE_CACHE_SHIFT - PAGE_SHIFT) -> <foo>;
      
       - PAGE_CACHE_{SIZE,SHIFT,MASK,ALIGN} -> PAGE_{SIZE,SHIFT,MASK,ALIGN};
      
       - page_cache_get() -> get_page();
      
       - page_cache_release() -> put_page();
      
      This patch contains automated changes generated with coccinelle using
      script below.  For some reason, coccinelle doesn't patch header files.
      I've called spatch for them manually.
      
      The only adjustment after coccinelle is revert of changes to
      PAGE_CAHCE_ALIGN definition: we are going to drop it later.
      
      There are few places in the code where coccinelle didn't reach.  I'll
      fix them manually in a separate patch.  Comments and documentation also
      will be addressed with the separate patch.
      
      virtual patch
      
      @@
      expression E;
      @@
      - E << (PAGE_CACHE_SHIFT - PAGE_SHIFT)
      + E
      
      @@
      expression E;
      @@
      - E >> (PAGE_CACHE_SHIFT - PAGE_SHIFT)
      + E
      
      @@
      @@
      - PAGE_CACHE_SHIFT
      + PAGE_SHIFT
      
      @@
      @@
      - PAGE_CACHE_SIZE
      + PAGE_SIZE
      
      @@
      @@
      - PAGE_CACHE_MASK
      + PAGE_MASK
      
      @@
      expression E;
      @@
      - PAGE_CACHE_ALIGN(E)
      + PAGE_ALIGN(E)
      
      @@
      expression E;
      @@
      - page_cache_get(E)
      + get_page(E)
      
      @@
      expression E;
      @@
      - page_cache_release(E)
      + put_page(E)
      Signed-off-by: NKirill A. Shutemov <kirill.shutemov@linux.intel.com>
      Acked-by: NMichal Hocko <mhocko@suse.com>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      09cbfeaf
  6. 30 3月, 2016 1 次提交
  7. 16 3月, 2016 1 次提交
  8. 31 1月, 2016 1 次提交
  9. 26 11月, 2015 1 次提交
  10. 22 10月, 2015 1 次提交
    • M
      block: Inline blk_integrity in struct gendisk · 25520d55
      Martin K. Petersen 提交于
      Up until now the_integrity profile has been dynamically allocated and
      attached to struct gendisk after the disk has been made active.
      
      This causes problems because NVMe devices need to register the profile
      prior to the partition table being read due to a mandatory metadata
      buffer requirement. In addition, DM goes through hoops to deal with
      preallocating, but not initializing integrity profiles.
      
      Since the integrity profile is small (4 bytes + a pointer), Christoph
      suggested moving it to struct gendisk proper. This requires several
      changes:
      
       - Moving the blk_integrity definition to genhd.h.
      
       - Inlining blk_integrity in struct gendisk.
      
       - Removing the dynamic allocation code.
      
       - Adding helper functions which allow gendisk to set up and tear down
         the integrity sysfs dir when a disk is added/deleted.
      
       - Adding a blk_integrity_revalidate() callback for updating the stable
         pages bdi setting.
      
       - The calls that depend on whether a device has an integrity profile or
         not now key off of the bi->profile pointer.
      
       - Simplifying the integrity support routines in DM (Mike Snitzer).
      Signed-off-by: NMartin K. Petersen <martin.petersen@oracle.com>
      Reported-by: NChristoph Hellwig <hch@lst.de>
      Reviewed-by: NSagi Grimberg <sagig@mellanox.com>
      Signed-off-by: NMike Snitzer <snitzer@redhat.com>
      Cc: Dan Williams <dan.j.williams@intel.com>
      Signed-off-by: NDan Williams <dan.j.williams@intel.com>
      Signed-off-by: NJens Axboe <axboe@fb.com>
      25520d55
  11. 17 7月, 2015 2 次提交
  12. 04 9月, 2014 1 次提交
  13. 08 4月, 2013 1 次提交
    • J
      Revert "loop: cleanup partitions when detaching loop device" · c2fccc1c
      Jens Axboe 提交于
      This reverts commit 8761a3dc.
      
      There are situations where the destruction path is called
      with the bdev->bd_mutex already held, which then deadlocks in
      loop_clr_fd(). The normal partition cleanup does a trylock()
      on the mutex, but it'd be nice to have a more bullet proof
      method in loop. So punt this more involved fix to the next
      merge window, and just back out this buggy fix for now.
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      c2fccc1c
  14. 23 3月, 2013 1 次提交
    • P
      loop: cleanup partitions when detaching loop device · 8761a3dc
      Phillip Susi 提交于
      Any partitions added by user space to the loop device were being
      left in place after detaching the loop device.  This was because
      the detach path issued a BLKRRPART to clean up partitions if
      LO_FLAGS_PARTSCAN was set, meaning that the partitions were auto
      scanned on attach.  Replace this BLKRRPART with code that
      unconditionally cleans up partitions on detach instead.
      Signed-off-by: NPhillip Susi <psusi@ubuntu.com>
      
      Modified by Jens to export delete_partition().
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      8761a3dc
  15. 28 2月, 2013 2 次提交
    • M
      block/partitions: optimize memory allocation in check_partition() · ac2e5327
      Ming Lei 提交于
      Currently, sizeof(struct parsed_partitions) may be 64KB in 32bit arch, so
      it is easy to trigger page allocation failure by check_partition,
      especially in hotplug block device situation(such as, USB mass storage,
      MMC card, ...), and Felipe Balbi has observed the failure.
      
      This patch does below optimizations on the allocation of struct
      parsed_partitions to try to address the issue:
      
      - make parsed_partitions.parts as pointer so that the pointed memory can
        fit in 32KB buffer, then approximate 32KB memory can be saved
      
      - vmalloc the buffer pointed by parsed_partitions.parts because 32KB is
        still a bit big for kmalloc
      
      - given that many devices have the partition count limit, so only
        allocate disk_max_parts() partitions instead of 256 partitions always
      Signed-off-by: NMing Lei <ming.lei@canonical.com>
      Reported-by: NFelipe Balbi <balbi@ti.com>
      Cc: Jens Axboe <axboe@kernel.dk>
      Reviewed-by: NYasuaki Ishimatsu <isimatu.yasuaki@jp.fujitsu.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      ac2e5327
    • T
      block: fix ext_devt_idr handling · 7b74e912
      Tomas Henzl 提交于
      While adding and removing a lot of disks disks and partitions this
      sometimes shows up:
      
        WARNING: at fs/sysfs/dir.c:512 sysfs_add_one+0xc9/0x130() (Not tainted)
        Hardware name:
        sysfs: cannot create duplicate filename '/dev/block/259:751'
        Modules linked in: raid1 autofs4 bnx2fc cnic uio fcoe libfcoe libfc 8021q scsi_transport_fc scsi_tgt garp stp llc sunrpc cpufreq_ondemand powernow_k8 freq_table mperf ipv6 dm_mirror dm_region_hash dm_log power_meter microcode dcdbas serio_raw amd64_edac_mod edac_core edac_mce_amd i2c_piix4 i2c_core k10temp bnx2 sg ixgbe dca mdio ext4 mbcache jbd2 dm_round_robin sr_mod cdrom sd_mod crc_t10dif ata_generic pata_acpi pata_atiixp ahci mptsas mptscsih mptbase scsi_transport_sas dm_multipath dm_mod [last unloaded: scsi_wait_scan]
        Pid: 44103, comm: async/16 Not tainted 2.6.32-195.el6.x86_64 #1
        Call Trace:
          warn_slowpath_common+0x87/0xc0
          warn_slowpath_fmt+0x46/0x50
          sysfs_add_one+0xc9/0x130
          sysfs_do_create_link+0x12b/0x170
          sysfs_create_link+0x13/0x20
          device_add+0x317/0x650
          idr_get_new+0x13/0x50
          add_partition+0x21c/0x390
          rescan_partitions+0x32b/0x470
          sd_open+0x81/0x1f0 [sd_mod]
          __blkdev_get+0x1b6/0x3c0
          blkdev_get+0x10/0x20
          register_disk+0x155/0x170
          add_disk+0xa6/0x160
          sd_probe_async+0x13b/0x210 [sd_mod]
          add_wait_queue+0x46/0x60
          async_thread+0x102/0x250
          default_wake_function+0x0/0x20
          async_thread+0x0/0x250
          kthread+0x96/0xa0
          child_rip+0xa/0x20
          kthread+0x0/0xa0
          child_rip+0x0/0x20
      
      This most likely happens because dev_t is freed while the number is
      still used and idr_get_new() is not protected on every use.  The fix
      adds a mutex where it wasn't before and moves the dev_t free function so
      it is called after device del.
      Signed-off-by: NTomas Henzl <thenzl@redhat.com>
      Cc: Jens Axboe <axboe@kernel.dk>
      Cc: <stable@vger.kernel.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      7b74e912
  16. 01 8月, 2012 1 次提交
    • V
      block: add partition resize function to blkpg ioctl · c83f6bf9
      Vivek Goyal 提交于
      Add a new operation code (BLKPG_RESIZE_PARTITION) to the BLKPG ioctl that
      allows altering the size of an existing partition, even if it is currently
      in use.
      
      This patch converts hd_struct->nr_sects into sequence counter because
      One might extend a partition while IO is happening to it and update of
      nr_sects can be non-atomic on 32bit machines with 64bit sector_t. This
      can lead to issues like reading inconsistent size of a partition. Sequence
      counter have been used so that readers don't have to take bdev mutex lock
      as we call sector_in_part() very frequently.
      
      Now all the access to hd_struct->nr_sects should happen using sequence
      counter read/update helper functions part_nr_sects_read/part_nr_sects_write.
      There is one exception though, set_capacity()/get_capacity(). I think
      theoritically race should exist there too but this patch does not
      modify set_capacity()/get_capacity() due to sheer number of call sites
      and I am afraid that change might break something. I have left that as a
      TODO item. We can handle it later if need be. This patch does not introduce
      any new races as such w.r.t set_capacity()/get_capacity().
      
      v2: Add CONFIG_LBDAF test to UP preempt case as suggested by Phillip.
      Signed-off-by: NVivek Goyal <vgoyal@redhat.com>
      Signed-off-by: NPhillip Susi <psusi@ubuntu.com>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      c83f6bf9
  17. 02 3月, 2012 1 次提交
  18. 04 1月, 2012 2 次提交
  19. 13 6月, 2011 1 次提交
  20. 30 5月, 2011 1 次提交
    • J
      Revert "block: Remove extra discard_alignment from hd_struct." · a1706ac4
      Jens Axboe 提交于
      It was not a good idea to start dereferencing disk->queue from
      the fs sysfs strategy for displaying discard alignment. We ran
      into first a NULL pointer deref, and after fixing that we sometimes
      see unvalid disk->queue pointer values.
      
      Since discard is the only one of the bunch actually looking into
      the queue, just revert the change.
      
      This reverts commit 23ceb5b7.
      
      Conflicts:
      	fs/partitions/check.c
      a1706ac4
  21. 27 5月, 2011 1 次提交
  22. 09 5月, 2011 1 次提交
    • J
      fs: fixup warning part_discard_alignment_show() · bbdd304c
      Jens Axboe 提交于
      Stephen reports:
      
      -----
      
      After merging the block tree, today's linux-next build (x86_64
      allmodconfig) produced this warning:
      
      fs/partitions/check.c: In function 'part_discard_alignment_show':
      fs/partitions/check.c:263: warning: format '%u' expects type 'unsigned int', but argument 3 has type 'long long unsigned int'
      
      Introduced by commit  ("block: Remove extra discard_alignment from
      hd_struct")
      
      -----
      
      Fix it up by just removing the cast, we return an int already.
      Reported-by: NStephen Rothwell <sfr@canb.auug.org.au>
      Signed-off-by: NJens Axboe <jaxboe@fusionio.com>
      bbdd304c
  23. 07 5月, 2011 1 次提交
  24. 31 3月, 2011 1 次提交
  25. 22 3月, 2011 1 次提交
  26. 07 1月, 2011 1 次提交
  27. 05 1月, 2011 1 次提交
    • J
      block: fix accounting bug on cross partition merges · 09e099d4
      Jerome Marchand 提交于
      /proc/diskstats would display a strange output as follows.
      
      $ cat /proc/diskstats |grep sda
         8       0 sda 90524 7579 102154 20464 0 0 0 0 0 14096 20089
         8       1 sda1 19085 1352 21841 4209 0 0 0 0 4294967064 15689 4293424691
                                                      ~~~~~~~~~~
         8       2 sda2 71252 3624 74891 15950 0 0 0 0 232 23995 1562390
         8       3 sda3 54 487 2188 92 0 0 0 0 0 88 92
         8       4 sda4 4 0 8 0 0 0 0 0 0 0 0
         8       5 sda5 81 2027 2130 138 0 0 0 0 0 87 137
      
      Its reason is the wrong way of accounting hd_struct->in_flight. When a bio is
      merged into a request belongs to different partition by ELEVATOR_FRONT_MERGE.
      
      The detailed root cause is as follows.
      
      Assuming that there are two partition, sda1 and sda2.
      
      1. A request for sda2 is in request_queue. Hence sda1's hd_struct->in_flight
         is 0 and sda2's one is 1.
      
              | hd_struct->in_flight
         ---------------------------
         sda1 |          0
         sda2 |          1
         ---------------------------
      
      2. A bio belongs to sda1 is issued and is merged into the request mentioned on
         step1 by ELEVATOR_BACK_MERGE. The first sector of the request is changed
         from sda2 region to sda1 region. However the two partition's
         hd_struct->in_flight are not changed.
      
              | hd_struct->in_flight
         ---------------------------
         sda1 |          0
         sda2 |          1
         ---------------------------
      
      3. The request is finished and blk_account_io_done() is called. In this case,
         sda2's hd_struct->in_flight, not a sda1's one, is decremented.
      
              | hd_struct->in_flight
         ---------------------------
         sda1 |         -1
         sda2 |          1
         ---------------------------
      
      The patch fixes the problem by caching the partition lookup
      inside the request structure, hence making sure that the increment
      and decrement will always happen on the same partition struct. This
      also speeds up IO with accounting enabled, since it cuts down on
      the number of lookups we have to do.
      
      Also add a refcount to struct hd_struct to keep the partition in
      memory as long as users exist. We use kref_test_and_get() to ensure
      we don't add a reference to a partition which is going away.
      Signed-off-by: NJerome Marchand <jmarchan@redhat.com>
      Signed-off-by: NYasuaki Ishimatsu <isimatu.yasuaki@jp.fujitsu.com>
      Cc: stable@kernel.org
      Signed-off-by: NJens Axboe <jaxboe@fusionio.com>
      09e099d4
  28. 17 12月, 2010 1 次提交
  29. 13 11月, 2010 1 次提交
    • T
      block: make blkdev_get/put() handle exclusive access · e525fd89
      Tejun Heo 提交于
      Over time, block layer has accumulated a set of APIs dealing with bdev
      open, close, claim and release.
      
      * blkdev_get/put() are the primary open and close functions.
      
      * bd_claim/release() deal with exclusive open.
      
      * open/close_bdev_exclusive() are combination of open and claim and
        the other way around, respectively.
      
      * bd_link/unlink_disk_holder() to create and remove holder/slave
        symlinks.
      
      * open_by_devnum() wraps bdget() + blkdev_get().
      
      The interface is a bit confusing and the decoupling of open and claim
      makes it impossible to properly guarantee exclusive access as
      in-kernel open + claim sequence can disturb the existing exclusive
      open even before the block layer knows the current open if for another
      exclusive access.  Reorganize the interface such that,
      
      * blkdev_get() is extended to include exclusive access management.
        @holder argument is added and, if is @FMODE_EXCL specified, it will
        gain exclusive access atomically w.r.t. other exclusive accesses.
      
      * blkdev_put() is similarly extended.  It now takes @mode argument and
        if @FMODE_EXCL is set, it releases an exclusive access.  Also, when
        the last exclusive claim is released, the holder/slave symlinks are
        removed automatically.
      
      * bd_claim/release() and close_bdev_exclusive() are no longer
        necessary and either made static or removed.
      
      * bd_link_disk_holder() remains the same but bd_unlink_disk_holder()
        is no longer necessary and removed.
      
      * open_bdev_exclusive() becomes a simple wrapper around lookup_bdev()
        and blkdev_get().  It also has an unexpected extra bdev_read_only()
        test which probably should be moved into blkdev_get().
      
      * open_by_devnum() is modified to take @holder argument and pass it to
        blkdev_get().
      
      Most of bdev open/close operations are unified into blkdev_get/put()
      and most exclusive accesses are tested atomically at the open time (as
      it should).  This cleans up code and removes some, both valid and
      invalid, but unnecessary all the same, corner cases.
      
      open_bdev_exclusive() and open_by_devnum() can use further cleanup -
      rename to blkdev_get_by_path() and blkdev_get_by_devt() and drop
      special features.  Well, let's leave them for another day.
      
      Most conversions are straight-forward.  drbd conversion is a bit more
      involved as there was some reordering, but the logic should stay the
      same.
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Acked-by: NNeil Brown <neilb@suse.de>
      Acked-by: NRyusuke Konishi <konishi.ryusuke@lab.ntt.co.jp>
      Acked-by: NMike Snitzer <snitzer@redhat.com>
      Acked-by: NPhilipp Reisner <philipp.reisner@linbit.com>
      Cc: Peter Osterlund <petero2@telia.com>
      Cc: Martin Schwidefsky <schwidefsky@de.ibm.com>
      Cc: Heiko Carstens <heiko.carstens@de.ibm.com>
      Cc: Jan Kara <jack@suse.cz>
      Cc: Andrew Morton <akpm@linux-foundation.org>
      Cc: Andreas Dilger <adilger.kernel@dilger.ca>
      Cc: "Theodore Ts'o" <tytso@mit.edu>
      Cc: Mark Fasheh <mfasheh@suse.com>
      Cc: Joel Becker <joel.becker@oracle.com>
      Cc: Alex Elder <aelder@sgi.com>
      Cc: Christoph Hellwig <hch@infradead.org>
      Cc: dm-devel@redhat.com
      Cc: drbd-dev@lists.linbit.com
      Cc: Leo Chen <leochen@broadcom.com>
      Cc: Scott Branden <sbranden@broadcom.com>
      Cc: Chris Mason <chris.mason@oracle.com>
      Cc: Steven Whitehouse <swhiteho@redhat.com>
      Cc: Dave Kleikamp <shaggy@linux.vnet.ibm.com>
      Cc: Joern Engel <joern@logfs.org>
      Cc: reiserfs-devel@vger.kernel.org
      Cc: Alexander Viro <viro@zeniv.linux.org.uk>
      e525fd89
  30. 11 11月, 2010 1 次提交
  31. 25 10月, 2010 1 次提交
  32. 23 10月, 2010 1 次提交
    • A
      SYSFS: Allow boot time switching between deprecated and modern sysfs layout · e52eec13
      Andi Kleen 提交于
      I have some systems which need legacy sysfs due to old tools that are
      making assumptions that a directory can never be a symlink to another
      directory, and it's a big hazzle to compile separate kernels for them.
      
      This patch turns CONFIG_SYSFS_DEPRECATED into a run time option
      that can be switched on/off the kernel command line. This way
      the same binary can be used in both cases with just a option
      on the command line.
      
      The old CONFIG_SYSFS_DEPRECATED_V2 option is still there to set
      the default. I kept the weird name to not break existing
      config files.
      
      Also the compat code can be still completely disabled by undefining
      CONFIG_SYSFS_DEPRECATED_SWITCH -- just the optimizer takes
      care of this now instead of lots of ifdefs. This makes the code
      look nicer.
      
      v2: This is an updated version on top of Kay's patch to only
      handle the block devices. I tested it on my old systems
      and that seems to work.
      
      Cc: axboe@kernel.dk
      Signed-off-by: NAndi Kleen <ak@linux.intel.com>
      Cc: Kay Sievers <kay.sievers@vrfy.org>
      Signed-off-by: NGreg Kroah-Hartman <gregkh@suse.de>
      e52eec13
  33. 19 10月, 2010 1 次提交
    • Y
      block: fix accounting bug on cross partition merges · 7681bfee
      Yasuaki Ishimatsu 提交于
      /proc/diskstats would display a strange output as follows.
      
      $ cat /proc/diskstats |grep sda
         8       0 sda 90524 7579 102154 20464 0 0 0 0 0 14096 20089
         8       1 sda1 19085 1352 21841 4209 0 0 0 0 4294967064 15689 4293424691
                                                      ~~~~~~~~~~
         8       2 sda2 71252 3624 74891 15950 0 0 0 0 232 23995 1562390
         8       3 sda3 54 487 2188 92 0 0 0 0 0 88 92
         8       4 sda4 4 0 8 0 0 0 0 0 0 0 0
         8       5 sda5 81 2027 2130 138 0 0 0 0 0 87 137
      
      Its reason is the wrong way of accounting hd_struct->in_flight. When a bio is
      merged into a request belongs to different partition by ELEVATOR_FRONT_MERGE.
      
      The detailed root cause is as follows.
      
      Assuming that there are two partition, sda1 and sda2.
      
      1. A request for sda2 is in request_queue. Hence sda1's hd_struct->in_flight
         is 0 and sda2's one is 1.
      
              | hd_struct->in_flight
         ---------------------------
         sda1 |          0
         sda2 |          1
         ---------------------------
      
      2. A bio belongs to sda1 is issued and is merged into the request mentioned on
         step1 by ELEVATOR_BACK_MERGE. The first sector of the request is changed
         from sda2 region to sda1 region. However the two partition's
         hd_struct->in_flight are not changed.
      
              | hd_struct->in_flight
         ---------------------------
         sda1 |          0
         sda2 |          1
         ---------------------------
      
      3. The request is finished and blk_account_io_done() is called. In this case,
         sda2's hd_struct->in_flight, not a sda1's one, is decremented.
      
              | hd_struct->in_flight
         ---------------------------
         sda1 |         -1
         sda2 |          1
         ---------------------------
      
      The patch fixes the problem by caching the partition lookup
      inside the request structure, hence making sure that the increment
      and decrement will always happen on the same partition struct. This
      also speeds up IO with accounting enabled, since it cuts down on
      the number of lookups we have to do.
      
      When reloading partition tables, quiesce IO to ensure that no
      request references to the partition struct exists. When it is safe
      to free the partition table, the IO for that device is restarted
      again.
      Signed-off-by: NYasuaki Ishimatsu <isimatu.yasuaki@jp.fujitsu.com>
      Cc: stable@kernel.org
      Signed-off-by: NJens Axboe <jaxboe@fusionio.com>
      7681bfee
  34. 15 9月, 2010 1 次提交
    • W
      block, partition: add partition_meta_info to hd_struct · 6d1d8050
      Will Drewry 提交于
      I'm reposting this patch series as v4 since there have been no additional
      comments, and I cleaned up one extra bit of unneeded code (in 3/3). The patches
      are against Linus's tree: 2bfc96a1
      (2.6.36-rc3).
      
      Would this patchset be suitable for inclusion in an mm branch?
      
      This changes adds a partition_meta_info struct which itself contains a
      union of structures that provide partition table specific metadata.
      
      This change leaves the union empty. The subsequent patch includes an
      implementation for CONFIG_EFI_PARTITION-based metadata.
      Signed-off-by: NWill Drewry <wad@chromium.org>
      Signed-off-by: NJens Axboe <jaxboe@fusionio.com>
      6d1d8050
  35. 11 8月, 2010 1 次提交
  36. 15 6月, 2010 1 次提交
  37. 22 5月, 2010 1 次提交
    • T
      block: improve automatic native capacity unlocking · b403a98e
      Tejun Heo 提交于
      Currently, native capacity unlocking is initiated only when a
      recognized partition extends beyond the end of the disk.  However,
      there are several other unhandled cases where truncated capacity can
      lead to misdetection of partitions.
      
      * Partition table is fully beyond EOD.
      
      * Partition table is partially beyond EOD (daisy chained ones).
      
      * Recognized partition starts beyond EOD.
      
      This patch updates generic partition check code such that all the
      above three cases are handled too.  For the first two, @state tracks
      whether low level partition check code tried to read beyond EOD during
      partition scan and triggers native capacity unlocking accordingly.
      The third is now handled similarly to the original unlocking case.
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Cc: Ben Hutchings <ben@decadent.org.uk>
      Acked-by: NDavid S. Miller <davem@davemloft.net>
      Signed-off-by: NJens Axboe <jens.axboe@oracle.com>
      b403a98e