1. 17 12月, 2010 3 次提交
    • T
      implement in-kernel gendisk events handling · 77ea887e
      Tejun Heo 提交于
      Currently, media presence polling for removeable block devices is done
      from userland.  There are several issues with this.
      
      * Polling is done by periodically opening the device.  For SCSI
        devices, the command sequence generated by such action involves a
        few different commands including TEST_UNIT_READY.  This behavior,
        while perfectly legal, is different from Windows which only issues
        single command, GET_EVENT_STATUS_NOTIFICATION.  Unfortunately, some
        ATAPI devices lock up after being periodically queried such command
        sequences.
      
      * There is no reliable and unintrusive way for a userland program to
        tell whether the target device is safe for media presence polling.
        For example, polling for media presence during an on-going burning
        session can make it fail.  The polling program can avoid this by
        opening the device with O_EXCL but then it risks making a valid
        exclusive user of the device fail w/ -EBUSY.
      
      * Userland polling is unnecessarily heavy and in-kernel implementation
        is lighter and better coordinated (workqueue, timer slack).
      
      This patch implements framework for in-kernel disk event handling,
      which includes media presence polling.
      
      * bdops->check_events() is added, which supercedes ->media_changed().
        It should check whether there's any pending event and return if so.
        Currently, two events are defined - DISK_EVENT_MEDIA_CHANGE and
        DISK_EVENT_EJECT_REQUEST.  ->check_events() is guaranteed not to be
        called parallelly.
      
      * gendisk->events and ->async_events are added.  These should be
        initialized by block driver before passing the device to add_disk().
        The former contains the mask of all supported events and the latter
        the mask of all events which the device can report without polling.
        /sys/block/*/events[_async] export these to userland.
      
      * Kernel parameter block.events_dfl_poll_msecs controls the system
        polling interval (default is 0 which means disable) and
        /sys/block/*/events_poll_msecs control polling intervals for
        individual devices (default is -1 meaning use system setting).  Note
        that if a device can report all supported events asynchronously and
        its polling interval isn't explicitly set, the device won't be
        polled regardless of the system polling interval.
      
      * If a device is opened exclusively with write access, event checking
        is automatically disabled until all write exclusive accesses are
        released.
      
      * There are event 'clearing' events.  For example, both of currently
        defined events are cleared after the device has been successfully
        opened.  This information is passed to ->check_events() callback
        using @clearing argument as a hint.
      
      * Event checking is always performed from system_nrt_wq and timer
        slack is set to 25% for polling.
      
      * Nothing changes for drivers which implement ->media_changed() but
        not ->check_events().  Going forward, all drivers will be converted
        to ->check_events() and ->media_change() will be dropped.
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Cc: Kay Sievers <kay.sievers@vrfy.org>
      Cc: Jan Kara <jack@suse.cz>
      Signed-off-by: NJens Axboe <jaxboe@fusionio.com>
      77ea887e
    • T
      block: move register_disk() and del_gendisk() to block/genhd.c · d2bf1b67
      Tejun Heo 提交于
      There's no reason for register_disk() and del_gendisk() to be in
      fs/partitions/check.c.  Move both to genhd.c.  While at it, collapse
      unlink_gendisk(), which was artificially in a separate function due to
      genhd.c / check.c split, into del_gendisk().
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Signed-off-by: NJens Axboe <jaxboe@fusionio.com>
      d2bf1b67
    • T
      block: kill genhd_media_change_notify() · dddd9dc3
      Tejun Heo 提交于
      There's no user of the facility.  Kill it.
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Signed-off-by: NJens Axboe <jaxboe@fusionio.com>
      dddd9dc3
  2. 13 12月, 2010 1 次提交
  3. 01 12月, 2010 2 次提交
  4. 16 11月, 2010 2 次提交
    • M
    • V
      blk-cgroup: Allow creation of hierarchical cgroups · bdc85df7
      Vivek Goyal 提交于
      o Allow hierarchical cgroup creation for blkio controller
      
      o Currently we disallow it as both the io controller policies (throttling
        as well as proportion bandwidth) do not support hierarhical accounting
        and control. But the flip side is that blkio controller can not be used with
        libvirt as libvirt creates a cgroup hierarchy deeper than 1 level.
      
        <top-level-cgroup-dir>/<controller>/libvirt/qemu/<virtual-machine-groups>
      
      o So this patch will allow creation of cgroup hierarhcy but at the backend
        everything will be treated as flat. So if somebody created a an hierarchy
        like as follows.
      
      			root
      			/  \
      		     test1 test2
      			|
      		     test3
      
        CFQ and throttling will practically treat all groups at same level.
      
      				pivot
      			     /  |   \  \
      			root  test1 test2  test3
      
      o Once we have actual support for hierarchical accounting and control
        then we can introduce another cgroup tunable file "blkio.use_hierarchy"
        which will be 0 by default but if user wants to enforce hierarhical
        control then it can be set to 1. This way there should not be any
        ABI problems down the line.
      
      o The only not so pretty part is introduction of extra file "use_hierarchy"
        down the line. Kame-san had mentioned that hierarhical accounting is
        expensive in memory controller hence they keep it off by default. I
        suspect same will be the case for IO controller also as for each IO
        completion we shall have to account IO through hierarchy up to the root.
        if yes, then it probably is not a very bad idea to introduce this extra
        file so that it will be used only when somebody needs it and some people
        might enable hierarchy only in part of the hierarchy.
      
      o This is how basically memory controller also uses "use_hierarhcy" and
        they also allowed creation of hierarchies when actual backend support
        was not available.
      Signed-off-by: NVivek Goyal <vgoyal@redhat.com>
      Acked-by: NBalbir Singh <balbir@linux.vnet.ibm.com>
      Reviewed-by: NGui Jianfeng <guijianfeng@cn.fujitsu.com>
      Reviewed-by: NCiju Rajan K <ciju@linux.vnet.ibm.com>
      Tested-by: NCiju Rajan K <ciju@linux.vnet.ibm.com>
      Signed-off-by: NJens Axboe <jaxboe@fusionio.com>
      bdc85df7
  5. 13 11月, 2010 1 次提交
    • T
      block: make blkdev_get/put() handle exclusive access · e525fd89
      Tejun Heo 提交于
      Over time, block layer has accumulated a set of APIs dealing with bdev
      open, close, claim and release.
      
      * blkdev_get/put() are the primary open and close functions.
      
      * bd_claim/release() deal with exclusive open.
      
      * open/close_bdev_exclusive() are combination of open and claim and
        the other way around, respectively.
      
      * bd_link/unlink_disk_holder() to create and remove holder/slave
        symlinks.
      
      * open_by_devnum() wraps bdget() + blkdev_get().
      
      The interface is a bit confusing and the decoupling of open and claim
      makes it impossible to properly guarantee exclusive access as
      in-kernel open + claim sequence can disturb the existing exclusive
      open even before the block layer knows the current open if for another
      exclusive access.  Reorganize the interface such that,
      
      * blkdev_get() is extended to include exclusive access management.
        @holder argument is added and, if is @FMODE_EXCL specified, it will
        gain exclusive access atomically w.r.t. other exclusive accesses.
      
      * blkdev_put() is similarly extended.  It now takes @mode argument and
        if @FMODE_EXCL is set, it releases an exclusive access.  Also, when
        the last exclusive claim is released, the holder/slave symlinks are
        removed automatically.
      
      * bd_claim/release() and close_bdev_exclusive() are no longer
        necessary and either made static or removed.
      
      * bd_link_disk_holder() remains the same but bd_unlink_disk_holder()
        is no longer necessary and removed.
      
      * open_bdev_exclusive() becomes a simple wrapper around lookup_bdev()
        and blkdev_get().  It also has an unexpected extra bdev_read_only()
        test which probably should be moved into blkdev_get().
      
      * open_by_devnum() is modified to take @holder argument and pass it to
        blkdev_get().
      
      Most of bdev open/close operations are unified into blkdev_get/put()
      and most exclusive accesses are tested atomically at the open time (as
      it should).  This cleans up code and removes some, both valid and
      invalid, but unnecessary all the same, corner cases.
      
      open_bdev_exclusive() and open_by_devnum() can use further cleanup -
      rename to blkdev_get_by_path() and blkdev_get_by_devt() and drop
      special features.  Well, let's leave them for another day.
      
      Most conversions are straight-forward.  drbd conversion is a bit more
      involved as there was some reordering, but the logic should stay the
      same.
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Acked-by: NNeil Brown <neilb@suse.de>
      Acked-by: NRyusuke Konishi <konishi.ryusuke@lab.ntt.co.jp>
      Acked-by: NMike Snitzer <snitzer@redhat.com>
      Acked-by: NPhilipp Reisner <philipp.reisner@linbit.com>
      Cc: Peter Osterlund <petero2@telia.com>
      Cc: Martin Schwidefsky <schwidefsky@de.ibm.com>
      Cc: Heiko Carstens <heiko.carstens@de.ibm.com>
      Cc: Jan Kara <jack@suse.cz>
      Cc: Andrew Morton <akpm@linux-foundation.org>
      Cc: Andreas Dilger <adilger.kernel@dilger.ca>
      Cc: "Theodore Ts'o" <tytso@mit.edu>
      Cc: Mark Fasheh <mfasheh@suse.com>
      Cc: Joel Becker <joel.becker@oracle.com>
      Cc: Alex Elder <aelder@sgi.com>
      Cc: Christoph Hellwig <hch@infradead.org>
      Cc: dm-devel@redhat.com
      Cc: drbd-dev@lists.linbit.com
      Cc: Leo Chen <leochen@broadcom.com>
      Cc: Scott Branden <sbranden@broadcom.com>
      Cc: Chris Mason <chris.mason@oracle.com>
      Cc: Steven Whitehouse <swhiteho@redhat.com>
      Cc: Dave Kleikamp <shaggy@linux.vnet.ibm.com>
      Cc: Joern Engel <joern@logfs.org>
      Cc: reiserfs-devel@vger.kernel.org
      Cc: Alexander Viro <viro@zeniv.linux.org.uk>
      e525fd89
  6. 11 11月, 2010 1 次提交
  7. 10 11月, 2010 5 次提交
  8. 09 11月, 2010 1 次提交
  9. 08 11月, 2010 3 次提交
    • S
      cfq-iosched: don't idle if a deep seek queue is slow · 8e1ac665
      Shaohua Li 提交于
      If a deep seek queue slowly deliver requests but disk is much faster, idle
      for the queue just wastes disk throughput. If the queue delevers all requests
      before half its slice is used, the patch disable idle for it.
      In my test, application delivers 32 requests one time, the disk can accept
      128 requests at maxium and disk is fast. without the patch, the throughput
      is just around 30m/s, while with it, the speed is about 80m/s. The disk is
      a SSD, but is detected as a rotational disk. I can configure it as SSD, but
      I thought the deep seek queue logic should be fixed too, for example,
      considering a fast raid.
      Signed-off-by: NShaohua Li <shaohua.li@intel.com>
      Signed-off-by: NJens Axboe <jaxboe@fusionio.com>
      8e1ac665
    • S
      cfq-iosched: schedule dispatch for noidle queue · d2d59e18
      Shaohua Li 提交于
      A queue is idle at cfq_dispatch_requests(), but it gets noidle later. Unless
      other task explictly does unplug or all requests are drained, we will not
      deliever requests to the disk even cfq_arm_slice_timer doesn't make the
      queue idle. For example, cfq_should_idle() returns true because of
      service_tree->count == 1, and then other queues are added. Note, I didn't
      see obvious performance impacts so far with the patch, but just thought
      this could be a problem.
      Signed-off-by: NShaohua Li <shaohua.li@intel.com>
      Signed-off-by: NJens Axboe <jaxboe@fusionio.com>
      d2d59e18
    • S
      cfq-iosched: do cleanup · c1e44756
      Shaohua Li 提交于
      Some functions should return boolean.
      Signed-off-by: NShaohua Li <shaohua.li@intel.com>
      Signed-off-by: NJens Axboe <jaxboe@fusionio.com>
      c1e44756
  10. 25 10月, 2010 1 次提交
  11. 24 10月, 2010 1 次提交
  12. 23 10月, 2010 1 次提交
    • A
      SYSFS: Allow boot time switching between deprecated and modern sysfs layout · e52eec13
      Andi Kleen 提交于
      I have some systems which need legacy sysfs due to old tools that are
      making assumptions that a directory can never be a symlink to another
      directory, and it's a big hazzle to compile separate kernels for them.
      
      This patch turns CONFIG_SYSFS_DEPRECATED into a run time option
      that can be switched on/off the kernel command line. This way
      the same binary can be used in both cases with just a option
      on the command line.
      
      The old CONFIG_SYSFS_DEPRECATED_V2 option is still there to set
      the default. I kept the weird name to not break existing
      config files.
      
      Also the compat code can be still completely disabled by undefining
      CONFIG_SYSFS_DEPRECATED_SWITCH -- just the optimizer takes
      care of this now instead of lots of ifdefs. This makes the code
      look nicer.
      
      v2: This is an updated version on top of Kay's patch to only
      handle the block devices. I tested it on my old systems
      and that seems to work.
      
      Cc: axboe@kernel.dk
      Signed-off-by: NAndi Kleen <ak@linux.intel.com>
      Cc: Kay Sievers <kay.sievers@vrfy.org>
      Signed-off-by: NGreg Kroah-Hartman <gregkh@suse.de>
      e52eec13
  13. 22 10月, 2010 1 次提交
    • V
      cfq-iosched: Fix a gcc 4.5 warning and put some comments · b4627321
      Vivek Goyal 提交于
      - Andi encountedred following warning with gcc 4.5
      
        linux/block/cfq-iosched.c: In function ‘cfq_dispatch_requests’:
        linux/block/cfq-iosched.c:2156:3: warning: array subscript is above array
        bounds
      
      - Warning happens due to following code.
      
        slice = group_slice * count /
      		max_t(unsigned, cfqg->busy_queues_avg[cfqd->serving_prio],
      		cfq_group_busy_queues_wl(cfqd->serving_prio, cfqd, cfqg));
      
        gcc is complaining about cfqg->busy_queues_avg[] being indexed by CFQ
        prio classes (RT, BE and IDLE) while the array size is only 2.
      
      - At run time, we never access cfqg->busy_queues_avg[IDLE] and return from
        function before this code hits.
      
      - To fix warning increase the array size though it will remain unused. This
        patch also puts some comments to clarify some of the confusions.
      
      - I have taken Jens's patch and modified it a bit.
      
      - Compile tested with gcc 4.4 and boot tested. I don't have gcc 4.5
        running, Andi can you please test it with gcc 4.5 to make sure it
        worked.
      Reported-by: NAndi Kleen <ak@linux.intel.com>
      Signed-off-by: NVivek Goyal <vgoyal@redhat.com>
      Acked-by: NJeff Moyer <jmoyer@redhat.com>
      Signed-off-by: NJens Axboe <jaxboe@fusionio.com>
      b4627321
  14. 19 10月, 2010 1 次提交
    • Y
      block: fix accounting bug on cross partition merges · 7681bfee
      Yasuaki Ishimatsu 提交于
      /proc/diskstats would display a strange output as follows.
      
      $ cat /proc/diskstats |grep sda
         8       0 sda 90524 7579 102154 20464 0 0 0 0 0 14096 20089
         8       1 sda1 19085 1352 21841 4209 0 0 0 0 4294967064 15689 4293424691
                                                      ~~~~~~~~~~
         8       2 sda2 71252 3624 74891 15950 0 0 0 0 232 23995 1562390
         8       3 sda3 54 487 2188 92 0 0 0 0 0 88 92
         8       4 sda4 4 0 8 0 0 0 0 0 0 0 0
         8       5 sda5 81 2027 2130 138 0 0 0 0 0 87 137
      
      Its reason is the wrong way of accounting hd_struct->in_flight. When a bio is
      merged into a request belongs to different partition by ELEVATOR_FRONT_MERGE.
      
      The detailed root cause is as follows.
      
      Assuming that there are two partition, sda1 and sda2.
      
      1. A request for sda2 is in request_queue. Hence sda1's hd_struct->in_flight
         is 0 and sda2's one is 1.
      
              | hd_struct->in_flight
         ---------------------------
         sda1 |          0
         sda2 |          1
         ---------------------------
      
      2. A bio belongs to sda1 is issued and is merged into the request mentioned on
         step1 by ELEVATOR_BACK_MERGE. The first sector of the request is changed
         from sda2 region to sda1 region. However the two partition's
         hd_struct->in_flight are not changed.
      
              | hd_struct->in_flight
         ---------------------------
         sda1 |          0
         sda2 |          1
         ---------------------------
      
      3. The request is finished and blk_account_io_done() is called. In this case,
         sda2's hd_struct->in_flight, not a sda1's one, is decremented.
      
              | hd_struct->in_flight
         ---------------------------
         sda1 |         -1
         sda2 |          1
         ---------------------------
      
      The patch fixes the problem by caching the partition lookup
      inside the request structure, hence making sure that the increment
      and decrement will always happen on the same partition struct. This
      also speeds up IO with accounting enabled, since it cuts down on
      the number of lookups we have to do.
      
      When reloading partition tables, quiesce IO to ensure that no
      request references to the partition struct exists. When it is safe
      to free the partition table, the IO for that device is restarted
      again.
      Signed-off-by: NYasuaki Ishimatsu <isimatu.yasuaki@jp.fujitsu.com>
      Cc: stable@kernel.org
      Signed-off-by: NJens Axboe <jaxboe@fusionio.com>
      7681bfee
  15. 15 10月, 2010 3 次提交
    • F
      [SCSI] bsg: fix incorrect device_status value · 47897160
      FUJITA Tomonori 提交于
      bsg incorrectly returns sg's masked_status value for device_status.
      
      [jejb: fix up expression logic]
      Reported-by: NDouglas Gilbert <dgilbert@interlog.com>
      Signed-off-by: NFUJITA Tomonori <fujita.tomonori@lab.ntt.co.jp>
      Cc: Stable Tree <stable@kernel.org>
      Signed-off-by: NJames Bottomley <James.Bottomley@suse.de>
      47897160
    • A
      llseek: automatically add .llseek fop · 6038f373
      Arnd Bergmann 提交于
      All file_operations should get a .llseek operation so we can make
      nonseekable_open the default for future file operations without a
      .llseek pointer.
      
      The three cases that we can automatically detect are no_llseek, seq_lseek
      and default_llseek. For cases where we can we can automatically prove that
      the file offset is always ignored, we use noop_llseek, which maintains
      the current behavior of not returning an error from a seek.
      
      New drivers should normally not use noop_llseek but instead use no_llseek
      and call nonseekable_open at open time.  Existing drivers can be converted
      to do the same when the maintainer knows for certain that no user code
      relies on calling seek on the device file.
      
      The generated code is often incorrectly indented and right now contains
      comments that clarify for each added line why a specific variant was
      chosen. In the version that gets submitted upstream, the comments will
      be gone and I will manually fix the indentation, because there does not
      seem to be a way to do that using coccinelle.
      
      Some amount of new code is currently sitting in linux-next that should get
      the same modifications, which I will do at the end of the merge window.
      
      Many thanks to Julia Lawall for helping me learn to write a semantic
      patch that does all this.
      
      ===== begin semantic patch =====
      // This adds an llseek= method to all file operations,
      // as a preparation for making no_llseek the default.
      //
      // The rules are
      // - use no_llseek explicitly if we do nonseekable_open
      // - use seq_lseek for sequential files
      // - use default_llseek if we know we access f_pos
      // - use noop_llseek if we know we don't access f_pos,
      //   but we still want to allow users to call lseek
      //
      @ open1 exists @
      identifier nested_open;
      @@
      nested_open(...)
      {
      <+...
      nonseekable_open(...)
      ...+>
      }
      
      @ open exists@
      identifier open_f;
      identifier i, f;
      identifier open1.nested_open;
      @@
      int open_f(struct inode *i, struct file *f)
      {
      <+...
      (
      nonseekable_open(...)
      |
      nested_open(...)
      )
      ...+>
      }
      
      @ read disable optional_qualifier exists @
      identifier read_f;
      identifier f, p, s, off;
      type ssize_t, size_t, loff_t;
      expression E;
      identifier func;
      @@
      ssize_t read_f(struct file *f, char *p, size_t s, loff_t *off)
      {
      <+...
      (
         *off = E
      |
         *off += E
      |
         func(..., off, ...)
      |
         E = *off
      )
      ...+>
      }
      
      @ read_no_fpos disable optional_qualifier exists @
      identifier read_f;
      identifier f, p, s, off;
      type ssize_t, size_t, loff_t;
      @@
      ssize_t read_f(struct file *f, char *p, size_t s, loff_t *off)
      {
      ... when != off
      }
      
      @ write @
      identifier write_f;
      identifier f, p, s, off;
      type ssize_t, size_t, loff_t;
      expression E;
      identifier func;
      @@
      ssize_t write_f(struct file *f, const char *p, size_t s, loff_t *off)
      {
      <+...
      (
        *off = E
      |
        *off += E
      |
        func(..., off, ...)
      |
        E = *off
      )
      ...+>
      }
      
      @ write_no_fpos @
      identifier write_f;
      identifier f, p, s, off;
      type ssize_t, size_t, loff_t;
      @@
      ssize_t write_f(struct file *f, const char *p, size_t s, loff_t *off)
      {
      ... when != off
      }
      
      @ fops0 @
      identifier fops;
      @@
      struct file_operations fops = {
       ...
      };
      
      @ has_llseek depends on fops0 @
      identifier fops0.fops;
      identifier llseek_f;
      @@
      struct file_operations fops = {
      ...
       .llseek = llseek_f,
      ...
      };
      
      @ has_read depends on fops0 @
      identifier fops0.fops;
      identifier read_f;
      @@
      struct file_operations fops = {
      ...
       .read = read_f,
      ...
      };
      
      @ has_write depends on fops0 @
      identifier fops0.fops;
      identifier write_f;
      @@
      struct file_operations fops = {
      ...
       .write = write_f,
      ...
      };
      
      @ has_open depends on fops0 @
      identifier fops0.fops;
      identifier open_f;
      @@
      struct file_operations fops = {
      ...
       .open = open_f,
      ...
      };
      
      // use no_llseek if we call nonseekable_open
      ////////////////////////////////////////////
      @ nonseekable1 depends on !has_llseek && has_open @
      identifier fops0.fops;
      identifier nso ~= "nonseekable_open";
      @@
      struct file_operations fops = {
      ...  .open = nso, ...
      +.llseek = no_llseek, /* nonseekable */
      };
      
      @ nonseekable2 depends on !has_llseek @
      identifier fops0.fops;
      identifier open.open_f;
      @@
      struct file_operations fops = {
      ...  .open = open_f, ...
      +.llseek = no_llseek, /* open uses nonseekable */
      };
      
      // use seq_lseek for sequential files
      /////////////////////////////////////
      @ seq depends on !has_llseek @
      identifier fops0.fops;
      identifier sr ~= "seq_read";
      @@
      struct file_operations fops = {
      ...  .read = sr, ...
      +.llseek = seq_lseek, /* we have seq_read */
      };
      
      // use default_llseek if there is a readdir
      ///////////////////////////////////////////
      @ fops1 depends on !has_llseek && !nonseekable1 && !nonseekable2 && !seq @
      identifier fops0.fops;
      identifier readdir_e;
      @@
      // any other fop is used that changes pos
      struct file_operations fops = {
      ... .readdir = readdir_e, ...
      +.llseek = default_llseek, /* readdir is present */
      };
      
      // use default_llseek if at least one of read/write touches f_pos
      /////////////////////////////////////////////////////////////////
      @ fops2 depends on !fops1 && !has_llseek && !nonseekable1 && !nonseekable2 && !seq @
      identifier fops0.fops;
      identifier read.read_f;
      @@
      // read fops use offset
      struct file_operations fops = {
      ... .read = read_f, ...
      +.llseek = default_llseek, /* read accesses f_pos */
      };
      
      @ fops3 depends on !fops1 && !fops2 && !has_llseek && !nonseekable1 && !nonseekable2 && !seq @
      identifier fops0.fops;
      identifier write.write_f;
      @@
      // write fops use offset
      struct file_operations fops = {
      ... .write = write_f, ...
      +	.llseek = default_llseek, /* write accesses f_pos */
      };
      
      // Use noop_llseek if neither read nor write accesses f_pos
      ///////////////////////////////////////////////////////////
      
      @ fops4 depends on !fops1 && !fops2 && !fops3 && !has_llseek && !nonseekable1 && !nonseekable2 && !seq @
      identifier fops0.fops;
      identifier read_no_fpos.read_f;
      identifier write_no_fpos.write_f;
      @@
      // write fops use offset
      struct file_operations fops = {
      ...
       .write = write_f,
       .read = read_f,
      ...
      +.llseek = noop_llseek, /* read and write both use no f_pos */
      };
      
      @ depends on has_write && !has_read && !fops1 && !fops2 && !has_llseek && !nonseekable1 && !nonseekable2 && !seq @
      identifier fops0.fops;
      identifier write_no_fpos.write_f;
      @@
      struct file_operations fops = {
      ... .write = write_f, ...
      +.llseek = noop_llseek, /* write uses no f_pos */
      };
      
      @ depends on has_read && !has_write && !fops1 && !fops2 && !has_llseek && !nonseekable1 && !nonseekable2 && !seq @
      identifier fops0.fops;
      identifier read_no_fpos.read_f;
      @@
      struct file_operations fops = {
      ... .read = read_f, ...
      +.llseek = noop_llseek, /* read uses no f_pos */
      };
      
      @ depends on !has_read && !has_write && !fops1 && !fops2 && !has_llseek && !nonseekable1 && !nonseekable2 && !seq @
      identifier fops0.fops;
      @@
      struct file_operations fops = {
      ...
      +.llseek = noop_llseek, /* no read or write fn */
      };
      ===== End semantic patch =====
      Signed-off-by: NArnd Bergmann <arnd@arndb.de>
      Cc: Julia Lawall <julia@diku.dk>
      Cc: Christoph Hellwig <hch@infradead.org>
      6038f373
    • M
      block: Fix double free in blk_integrity_unregister · e817bf3f
      Martin K. Petersen 提交于
      Commit 3839e4b2 introduced a kobject_put but failed to remove the
      kmem_cache_free beneath it, leading to a double free.
      Signed-off-by: NMartin K. Petersen <martin.petersen@oracle.com>
      Signed-off-by: NJens Axboe <jaxboe@fusionio.com>
      e817bf3f
  16. 14 10月, 2010 1 次提交
  17. 07 10月, 2010 1 次提交
    • J
      elevator: fix oops on early call to elevator_change() · 430c62fb
      Jens Axboe 提交于
      2.6.36 introduces an API for drivers to switch the IO scheduler
      instead of manually calling the elevator exit and init functions.
      This API was added since q->elevator must be cleared in between
      those two calls. And since we already have this functionality
      directly from use by the sysfs interface to switch schedulers
      online, it was prudent to reuse it internally too.
      
      But this API needs the queue to be in a fully initialized state
      before it is called, or it will attempt to unregister elevator
      kobjects before they have been added. This results in an oops
      like this:
      
      BUG: unable to handle kernel NULL pointer dereference at 0000000000000051
      IP: [<ffffffff8116f15e>] sysfs_create_dir+0x2e/0xc0
      PGD 47ddfc067 PUD 47c6a1067 PMD 0
      Oops: 0000 [#1] PREEMPT SMP
      last sysfs file: /sys/devices/pci0000:00/0000:00:02.0/0000:04:00.1/irq
      CPU 2
      Modules linked in: t(+) loop hid_apple usbhid ahci ehci_hcd uhci_hcd libahci usbcore nls_base igb
      
      Pid: 7319, comm: modprobe Not tainted 2.6.36-rc6+ #132 QSSC-S4R/QSSC-S4R
      RIP: 0010:[<ffffffff8116f15e>]  [<ffffffff8116f15e>] sysfs_create_dir+0x2e/0xc0
      RSP: 0018:ffff88027da25d08  EFLAGS: 00010246
      RAX: ffff88047c68c528 RBX: 00000000fffffffe RCX: 0000000000000000
      RDX: 000000000000002f RSI: 000000000000002f RDI: ffff88047e196c88
      RBP: ffff88027da25d38 R08: 0000000000000000 R09: d84156c5635688c0
      R10: d84156c5635688c0 R11: 0000000000000000 R12: ffff88047e196c88
      R13: 0000000000000000 R14: 0000000000000000 R15: ffff88047c68c528
      FS:  00007fcb0b26f6e0(0000) GS:ffff880287400000(0000) knlGS:0000000000000000
      CS:  0010 DS: 0000 ES: 0000 CR0: 000000008005003b
      CR2: 0000000000000051 CR3: 000000047e76e000 CR4: 00000000000006e0
      DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
      DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
      Process modprobe (pid: 7319, threadinfo ffff88027da24000, task ffff88027d377090)
      Stack:
       ffff88027da25d58 ffff88047c68c528 00000000fffffffe ffff88047e196c88
      <0> ffff88047c68c528 ffff88047e05bd90 ffff88027da25d78 ffffffff8123fb77
      <0> ffff88047e05bd90 0000000000000000 ffff88047e196c88 ffff88047c68c528
      Call Trace:
       [<ffffffff8123fb77>] kobject_add_internal+0xe7/0x1f0
       [<ffffffff8123fd98>] kobject_add_varg+0x38/0x60
       [<ffffffff8123feb9>] kobject_add+0x69/0x90
       [<ffffffff8116efe0>] ? sysfs_remove_dir+0x20/0xa0
       [<ffffffff8103d48d>] ? sub_preempt_count+0x9d/0xe0
       [<ffffffff8143de20>] ? _raw_spin_unlock+0x30/0x50
       [<ffffffff8116efe0>] ? sysfs_remove_dir+0x20/0xa0
       [<ffffffff8116eff4>] ? sysfs_remove_dir+0x34/0xa0
       [<ffffffff81224204>] elv_register_queue+0x34/0xa0
       [<ffffffff81224aad>] elevator_change+0xfd/0x250
       [<ffffffffa007e000>] ? t_init+0x0/0x361 [t]
       [<ffffffffa007e000>] ? t_init+0x0/0x361 [t]
       [<ffffffffa007e0a8>] t_init+0xa8/0x361 [t]
       [<ffffffff810001de>] do_one_initcall+0x3e/0x170
       [<ffffffff8108c3fd>] sys_init_module+0xbd/0x220
       [<ffffffff81002f2b>] system_call_fastpath+0x16/0x1b
      Code: e5 41 56 41 55 41 54 49 89 fc 53 48 83 ec 10 48 85 ff 74 52 48 8b 47 18 49 c7 c5 00 46 61 81 48 85 c0 74 04 4c 8b 68 30 45 31 f6 <41> 80 7d 51 00 74 0e 49 8b 44 24 28 4c 89 e7 ff 50 20 49 89 c6
      RIP  [<ffffffff8116f15e>] sysfs_create_dir+0x2e/0xc0
       RSP <ffff88027da25d08>
      CR2: 0000000000000051
      ---[ end trace a6541d3bf07945df ]---
      
      Fix this by adding a registered bit to the elevator queue, which is
      set when the sysfs kobjects have been registered.
      Signed-off-by: NJens Axboe <jaxboe@fusionio.com>
      430c62fb
  18. 05 10月, 2010 1 次提交
    • A
      block: autoconvert trivial BKL users to private mutex · 2a48fc0a
      Arnd Bergmann 提交于
      The block device drivers have all gained new lock_kernel
      calls from a recent pushdown, and some of the drivers
      were already using the BKL before.
      
      This turns the BKL into a set of per-driver mutexes.
      Still need to check whether this is safe to do.
      
      file=$1
      name=$2
      if grep -q lock_kernel ${file} ; then
          if grep -q 'include.*linux.mutex.h' ${file} ; then
                  sed -i '/include.*<linux\/smp_lock.h>/d' ${file}
          else
                  sed -i 's/include.*<linux\/smp_lock.h>.*$/include <linux\/mutex.h>/g' ${file}
          fi
          sed -i ${file} \
              -e "/^#include.*linux.mutex.h/,$ {
                      1,/^\(static\|int\|long\)/ {
                           /^\(static\|int\|long\)/istatic DEFINE_MUTEX(${name}_mutex);
      
      } }"  \
          -e "s/\(un\)*lock_kernel\>[ ]*()/mutex_\1lock(\&${name}_mutex)/g" \
          -e '/[      ]*cycle_kernel_lock();/d'
      else
          sed -i -e '/include.*\<smp_lock.h\>/d' ${file}  \
                      -e '/cycle_kernel_lock()/d'
      fi
      Signed-off-by: NArnd Bergmann <arnd@arndb.de>
      2a48fc0a
  19. 02 10月, 2010 3 次提交
  20. 01 10月, 2010 7 次提交