1. 27 1月, 2009 1 次提交
  2. 26 1月, 2009 1 次提交
    • A
      blktrace: add ftrace plugin · c71a8961
      Arnaldo Carvalho de Melo 提交于
      Impact: New way of using the blktrace infrastructure
      
      This drops the requirement of userspace utilities to use the blktrace
      facility.
      
      Configuration is done thru sysfs, adding a "trace" directory to the
      partition directory where blktrace can be enabled for the associated
      request_queue.
      
      The same filters present in the IOCTL interface are present as sysfs
      device attributes.
      
      The /sys/block/sdX/sdXN/trace/enable file allows tracing without any
      filters.
      
      The other files in this directory: pid, act_mask, start_lba and end_lba
      can be used with the same meaning as with the IOCTL interface.
      
      Using the sysfs interface will only setup the request_queue->blk_trace
      fields, tracing will only take place when the "blk" tracer is selected
      via the ftrace interface, as in the following example:
      
      To see the trace, one can use the /d/tracing/trace file or the
      /d/tracign/trace_pipe file, with semantics defined in the ftrace
      documentation in Documentation/ftrace.txt.
      
      [root@f10-1 ~]# cat /t/trace
             kjournald-305   [000]  3046.491224:   8,1    A WBS 6367 + 8 <- (8,1) 6304
             kjournald-305   [000]  3046.491227:   8,1    Q   R 6367 + 8 [kjournald]
             kjournald-305   [000]  3046.491236:   8,1    G  RB 6367 + 8 [kjournald]
             kjournald-305   [000]  3046.491239:   8,1    P  NS [kjournald]
             kjournald-305   [000]  3046.491242:   8,1    I RBS 6367 + 8 [kjournald]
             kjournald-305   [000]  3046.491251:   8,1    D  WB 6367 + 8 [kjournald]
             kjournald-305   [000]  3046.491610:   8,1    U  WS [kjournald] 1
                <idle>-0     [000]  3046.511914:   8,1    C  RS 6367 + 8 [6367]
      [root@f10-1 ~]#
      
      The default line context (prefix) format is the one described in the ftrace
      documentation, with the blktrace specific bits using its existing format,
      described in blkparse(8).
      
      If one wants to have the classic blktrace formatting, this is possible by
      using:
      
      [root@f10-1 ~]# echo blk_classic > /t/trace_options
      [root@f10-1 ~]# cat /t/trace
        8,1    0  3046.491224   305  A WBS 6367 + 8 <- (8,1) 6304
        8,1    0  3046.491227   305  Q   R 6367 + 8 [kjournald]
        8,1    0  3046.491236   305  G  RB 6367 + 8 [kjournald]
        8,1    0  3046.491239   305  P  NS [kjournald]
        8,1    0  3046.491242   305  I RBS 6367 + 8 [kjournald]
        8,1    0  3046.491251   305  D  WB 6367 + 8 [kjournald]
        8,1    0  3046.491610   305  U  WS [kjournald] 1
        8,1    0  3046.511914     0  C  RS 6367 + 8 [6367]
      [root@f10-1 ~]#
      
      Using the ftrace standard format allows more flexibility, such
      as the ability of asking for backtraces via trace_options:
      
      [root@f10-1 ~]# echo noblk_classic > /t/trace_options
      [root@f10-1 ~]# echo stacktrace > /t/trace_options
      
      [root@f10-1 ~]# cat /t/trace
             kjournald-305   [000]  3318.826779:   8,1    A WBS 6375 + 8 <- (8,1) 6312
             kjournald-305   [000]  3318.826782:
       <= submit_bio
       <= submit_bh
       <= sync_dirty_buffer
       <= journal_commit_transaction
       <= kjournald
       <= kthread
       <= child_rip
             kjournald-305   [000]  3318.826836:   8,1    Q   R 6375 + 8 [kjournald]
             kjournald-305   [000]  3318.826837:
       <= generic_make_request
       <= submit_bio
       <= submit_bh
       <= sync_dirty_buffer
       <= journal_commit_transaction
       <= kjournald
       <= kthread
      
      Please read the ftrace documentation to use aditional, standardized
      tracing filters such as /d/tracing/trace_cpumask, etc.
      
      See also /d/tracing/trace_mark to add comments in the trace stream,
      that is equivalent to the /d/block/sdaN/msg interface.
      Signed-off-by: NArnaldo Carvalho de Melo <acme@redhat.com>
      Signed-off-by: NIngo Molnar <mingo@elte.hu>
      c71a8961
  3. 07 1月, 2009 2 次提交
  4. 03 1月, 2009 2 次提交
  5. 29 12月, 2008 22 次提交
    • J
      cfq-iosched: fix race between exiting queue and exiting task · 62c1fe9d
      Jens Axboe 提交于
      Original patch from Nikanth Karthikesan <knikanth@suse.de>
      
      When a queue exits the queue lock is taken and cfq_exit_queue() would free all
      the cic's associated with the queue.
      
      But when a task exits, cfq_exit_io_context() gets cic one by one and then
      locks the associated queue to call __cfq_exit_single_io_context. It looks like
      between getting a cic from the ioc and locking the queue, the queue might have
      exited on another cpu.
      
      Fix this by rechecking the cfq_io_context queue key inside the queue lock
      again, and not calling into __cfq_exit_single_io_context() if somebody
      beat us to it.
      Signed-off-by: NJens Axboe <jens.axboe@oracle.com>
      62c1fe9d
    • J
      Get rid of CONFIG_LSF · b3a6ffe1
      Jens Axboe 提交于
      We have two seperate config entries for large devices/files. One
      is CONFIG_LBD that guards just the devices, the other is CONFIG_LSF
      that handles large files. This doesn't make a lot of sense, you typically
      want both or none. So get rid of CONFIG_LSF and change CONFIG_LBD wording
      to indicate that it covers both.
      Acked-by: NJean Delvare <khali@linux-fr.org>
      Signed-off-by: NJens Axboe <jens.axboe@oracle.com>
      b3a6ffe1
    • R
      block: make blk_softirq_init() static · 3c18ce71
      Roel Kluin 提交于
      Sparse asked whether these could be static.
      Signed-off-by: NRoel Kluin <roel.kluin@gmail.com>
      Signed-off-by: NJens Axboe <jens.axboe@oracle.com>
      3c18ce71
    • F
      block: use min_not_zero in blk_queue_stack_limits · 18af8b2c
      FUJITA Tomonori 提交于
      zero is invalid for max_phys_segments, max_hw_segments, and
      max_segment_size. It's better to use use min_not_zero instead of
      min. min() works though (because the commit 0e435ac2 makes sure that
      these values are set to the default values, non zero, if a queue is
      initialized properly).
      
      With this patch, blk_queue_stack_limits does the almost same thing
      that dm's combine_restrictions_low() does. I think that it's easy to
      remove dm's combine_restrictions_low.
      Signed-off-by: NFUJITA Tomonori <fujita.tomonori@lab.ntt.co.jp>
      Signed-off-by: NJens Axboe <jens.axboe@oracle.com>
      18af8b2c
    • J
      block: add one-hit cache for disk partition lookup · a6f23657
      Jens Axboe 提交于
      disk_map_sector_rcu() returns a partition from a sector offset,
      which we use for IO statistics on a per-partition basis. The
      lookup itself is an O(N) list lookup, where N is the number of
      partitions. This actually hurts performance quite a bit, even
      on the lower end partitions. On higher numbered partitions,
      it can get pretty bad.
      
      Solve this by adding a one-hit cache for partition lookup.
      This makes the lookup O(1) for the case where we do most IO to
      one partition. Even for mixed partition workloads, amortized cost
      is pretty close to O(1) since the natural IO batching makes the
      one-hit cache last for lots of IOs.
      Signed-off-by: NJens Axboe <jens.axboe@oracle.com>
      a6f23657
    • J
      cfq-iosched: remove limit of dispatch depth of max 4 times quantum · 30e0dc28
      Jens Axboe 提交于
      This basically limits the hardware queue depth to 4*quantum at any
      point in time, which is 16 with the default settings. As CFQ uses
      other means to shrink the hardware queue when necessary in the first
      place, there's really no need for this extra heuristic. Additionally,
      it ends up hurting performance in some cases.
      Signed-off-by: NJens Axboe <jens.axboe@oracle.com>
      30e0dc28
    • J
      block: get rid of elevator_t typedef · b374d18a
      Jens Axboe 提交于
      Just use struct elevator_queue everywhere instead.
      Signed-off-by: NJens Axboe <jens.axboe@oracle.com>
      b374d18a
    • J
      block: don't use plugging on SSD devices · a31a9738
      Jens Axboe 提交于
      We just want to hand the first bits of IO to the device as fast
      as possible. Gains a few percent on the IOPS rate.
      Signed-off-by: NJens Axboe <jens.axboe@oracle.com>
      a31a9738
    • T
      block: fix empty barrier on write-through w/ ordered tag · a185eb4b
      Tejun Heo 提交于
      Empty barrier on write-through (or no cache) w/ ordered tag has no
      command to execute and without any command to execute ordered tag is
      never issued to the device and the ordering is never achieved.  Force
      draining for such cases.
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Signed-off-by: NJens Axboe <jens.axboe@oracle.com>
      a185eb4b
    • T
      block: simplify empty barrier implementation · 58eea927
      Tejun Heo 提交于
      Empty barrier required special handling in __elv_next_request() to
      complete it without letting the low level driver see it.
      
      With previous changes, barrier code is now flexible enough to skip the
      BAR step using the same barrier sequence selection mechanism.  Drop
      the special handling and mask off q->ordered from start_ordered().
      
      Remove blk_empty_barrier() test which now has no user.
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Signed-off-by: NJens Axboe <jens.axboe@oracle.com>
      58eea927
    • T
      block: make barrier completion more robust · 8f11b3e9
      Tejun Heo 提交于
      Barrier completion had the following assumptions.
      
      * start_ordered() couldn't finish the whole sequence properly.  If all
        actions are to be skipped, q->ordseq is set correctly but the actual
        completion was never triggered thus hanging the barrier request.
      
      * Drain completion in elv_complete_request() assumed that there's
        always at least one request in the queue when drain completes.
      
      Both assumptions are true but these assumptions need to be removed to
      improve empty barrier implementation.  This patch makes the following
      changes.
      
      * Make start_ordered() use blk_ordered_complete_seq() to mark skipped
        steps complete and notify __elv_next_request() that it should fetch
        the next request if the whole barrier has completed inside
        start_ordered().
      
      * Make drain completion path in elv_complete_request() check whether
        the queue is empty.  Empty queue also indicates drain completion.
      
      * While at it, convert 0/1 return from blk_do_ordered() to false/true.
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Signed-off-by: NJens Axboe <jens.axboe@oracle.com>
      8f11b3e9
    • T
      block: make every barrier action optional · f671620e
      Tejun Heo 提交于
      In all barrier sequences, the barrier write itself was always assumed
      to be issued and thus didn't have corresponding control flag.  This
      patch adds QUEUE_ORDERED_DO_BAR and unify action mask handling in
      start_ordered() such that any barrier action can be skipped.
      
      This patch doesn't introduce any visible behavior changes.
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Signed-off-by: NJens Axboe <jens.axboe@oracle.com>
      f671620e
    • T
      block: remove duplicate or unused barrier/discard error paths · a7384677
      Tejun Heo 提交于
      * Because barrier mode can be changed dynamically, whether barrier is
        supported or not can be determined only when actually issuing the
        barrier and there is no point in checking it earlier.  Drop barrier
        support check in generic_make_request() and __make_request(), and
        update comment around the support check in blk_do_ordered().
      
      * There is no reason to check discard support in both
        generic_make_request() and __make_request().  Drop the check in
        __make_request().  While at it, move error action block to the end
        of the function and add unlikely() to q existence test.
      
      * Barrier request, be it empty or not, is never passed to low level
        driver and thus it's meaningless to try to copy back req->sector to
        bio->bi_sector on error.  In addition, the notion of failed sector
        doesn't make any sense for empty barrier to begin with.  Drop the
        code block from __end_that_request_first().
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Signed-off-by: NJens Axboe <jens.axboe@oracle.com>
      a7384677
    • T
      block: reorganize QUEUE_ORDERED_* constants · 313e4299
      Tejun Heo 提交于
      Separate out ordering type (drain,) and action masks (preflush,
      postflush, fua) from visible ordering mode selectors
      (QUEUE_ORDERED_*).  Ordering types are now named QUEUE_ORDERED_BY_*
      while action masks are named QUEUE_ORDERED_DO_*.
      
      This change is necessary to add QUEUE_ORDERED_DO_BAR and make it
      optional to improve empty barrier implementation.
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Signed-off-by: NJens Axboe <jens.axboe@oracle.com>
      313e4299
    • C
      block: use cancel_work_sync() instead of kblockd_flush_work() · 64d01dc9
      Cheng Renquan 提交于
      After many improvements on kblockd_flush_work, it is now identical to
      cancel_work_sync, so a direct call to cancel_work_sync is suggested.
      
      The only difference is that cancel_work_sync is a GPL symbol,
      so no non-GPL modules anymore.
      Signed-off-by: NCheng Renquan <crquan@gmail.com>
      Cc: Jens Axboe <jens.axboe@oracle.com>
      Signed-off-by: NJens Axboe <jens.axboe@oracle.com>
      64d01dc9
    • K
      block: Supress Buffer I/O errors when SCSI REQ_QUIET flag set · 08bafc03
      Keith Mannthey 提交于
      Allow the scsi request REQ_QUIET flag to be propagated to the buffer
      file system layer. The basic ideas is to pass the flag from the scsi
      request to the bio (block IO) and then to the buffer layer.  The buffer
      layer can then suppress needless printks.
      
      This patch declutters the kernel log by removed the 40-50 (per lun)
      buffer io error messages seen during a boot in my multipath setup . It
      is a good chance any real errors will be missed in the "noise" it the
      logs without this patch.
      
      During boot I see blocks of messages like
      "
      __ratelimit: 211 callbacks suppressed
      Buffer I/O error on device sdm, logical block 5242879
      Buffer I/O error on device sdm, logical block 5242879
      Buffer I/O error on device sdm, logical block 5242847
      Buffer I/O error on device sdm, logical block 1
      Buffer I/O error on device sdm, logical block 5242878
      Buffer I/O error on device sdm, logical block 5242879
      Buffer I/O error on device sdm, logical block 5242879
      Buffer I/O error on device sdm, logical block 5242879
      Buffer I/O error on device sdm, logical block 5242879
      Buffer I/O error on device sdm, logical block 5242872
      "
      in my logs.
      
      My disk environment is multipath fiber channel using the SCSI_DH_RDAC
      code and multipathd.  This topology includes an "active" and "ghost"
      path for each lun. IO's to the "ghost" path will never complete and the
      SCSI layer, via the scsi device handler rdac code, quick returns the IOs
      to theses paths and sets the REQ_QUIET scsi flag to suppress the scsi
      layer messages.
      
       I am wanting to extend the QUIET behavior to include the buffer file
      system layer to deal with these errors as well. I have been running this
      patch for a while now on several boxes without issue.  A few runs of
      bonnie++ show no noticeable difference in performance in my setup.
      
      Thanks for John Stultz for the quiet_error finalization.
      Submitted-by: NKeith Mannthey <kmannth@us.ibm.com>
      Signed-off-by: NJens Axboe <jens.axboe@oracle.com>
      08bafc03
    • W
      block: don't take lock on changing ra_pages · 7c239517
      Wu Fengguang 提交于
      There's no need to take queue_lock or kernel_lock when modifying
      bdi->ra_pages. So remove them. Also remove out of date comment for
      queue_max_sectors_store().
      Signed-off-by: NWu Fengguang <wfg@linux.intel.com>
      Signed-off-by: NJens Axboe <jens.axboe@oracle.com>
      7c239517
    • Q
      block/blk-tag.c: cleanup kernel-doc · c6a06f70
      Qinghuang Feng 提交于
      There is no argument named @tags in blk_init_tags,
      remove its' comment.
      Signed-off-by: NQinghuang Feng <qhfeng.kernel@gmail.com>
      Signed-off-by: NJens Axboe <jens.axboe@oracle.com>
      c6a06f70
    • M
      scsi-ioctl: use clock_t <> jiffies · 2b91bafc
      Milton Miller 提交于
      Convert the timeout ioctl scalling to use the clock_t functions
      which are much more accurate with some USER_HZ vs HZ combinations.
      Signed-off-by: NMilton Miller <miltonm@bga.com>
      Signed-off-by: NJens Axboe <jens.axboe@oracle.com>
      2b91bafc
    • J
      block: leave the request timeout timer running even on an empty list · 70ed28b9
      Jens Axboe 提交于
      For sync IO, we'll often do them serialized. This means we'll be touching
      the queue timer for every IO, as opposed to only occasionally like we
      do for queued IO. Instead of deleting the timer when the last request
      is removed, just let continue running. If a new request comes up soon
      we then don't have to readd the timer again. If no new requests arrive,
      the timer will expire without side effect later.
      
      This improves high iops sync IO by ~1%.
      Signed-off-by: NJens Axboe <jens.axboe@oracle.com>
      70ed28b9
    • J
    • M
      block: optimizations in blk_rq_timed_out_timer() · 565e411d
      malahal@us.ibm.com 提交于
      Now the rq->deadline can't be zero if the request is in the
      timeout_list, so there is no need to have next_set. There is no need to
      access a request's deadline field if blk_rq_timed_out is called on it.
      Signed-off-by: NMalahal Naineni <malahal@us.ibm.com>
      Signed-off-by: NJens Axboe <jens.axboe@oracle.com>
      565e411d
  6. 26 12月, 2008 1 次提交
  7. 06 12月, 2008 1 次提交
  8. 04 12月, 2008 2 次提交
  9. 03 12月, 2008 4 次提交
    • M
      block: fix setting of max_segment_size and seg_boundary mask · 0e435ac2
      Milan Broz 提交于
      Fix setting of max_segment_size and seg_boundary mask for stacked md/dm
      devices.
      
      When stacking devices (LVM over MD over SCSI) some of the request queue
      parameters are not set up correctly in some cases by default, namely
      max_segment_size and and seg_boundary mask.
      
      If you create MD device over SCSI, these attributes are zeroed.
      
      Problem become when there is over this mapping next device-mapper mapping
      - queue attributes are set in DM this way:
      
      request_queue   max_segment_size  seg_boundary_mask
      SCSI                65536             0xffffffff
      MD RAID1                0                      0
      LVM                 65536                 -1 (64bit)
      
      Unfortunately bio_add_page (resp.  bio_phys_segments) calculates number of
      physical segments according to these parameters.
      
      During the generic_make_request() is segment cout recalculated and can
      increase bio->bi_phys_segments count over the allowed limit.  (After
      bio_clone() in stack operation.)
      
      Thi is specially problem in CCISS driver, where it produce OOPS here
      
          BUG_ON(creq->nr_phys_segments > MAXSGENTRIES);
      
      (MAXSEGENTRIES is 31 by default.)
      
      Sometimes even this command is enough to cause oops:
      
        dd iflag=direct if=/dev/<vg>/<lv> of=/dev/null bs=128000 count=10
      
      This command generates bios with 250 sectors, allocated in 32 4k-pages
      (last page uses only 1024 bytes).
      
      For LVM layer, it allocates bio with 31 segments (still OK for CCISS),
      unfortunatelly on lower layer it is recalculated to 32 segments and this
      violates CCISS restriction and triggers BUG_ON().
      
      The patch tries to fix it by:
      
       * initializing attributes above in queue request constructor
         blk_queue_make_request()
      
       * make sure that blk_queue_stack_limits() inherits setting
      
       (DM uses its own function to set the limits because it
       blk_queue_stack_limits() was introduced later.  It should probably switch
       to use generic stack limit function too.)
      
       * sets the default seg_boundary value in one place (blkdev.h)
      
       * use this mask as default in DM (instead of -1, which differs in 64bit)
      
      Bugs related to this:
      https://bugzilla.redhat.com/show_bug.cgi?id=471639
      http://bugzilla.kernel.org/show_bug.cgi?id=8672Signed-off-by: NMilan Broz <mbroz@redhat.com>
      Reviewed-by: NAlasdair G Kergon <agk@redhat.com>
      Cc: Neil Brown <neilb@suse.de>
      Cc: FUJITA Tomonori <fujita.tomonori@lab.ntt.co.jp>
      Cc: Tejun Heo <htejun@gmail.com>
      Cc: Mike Miller <mike.miller@hp.com>
      Signed-off-by: NJens Axboe <jens.axboe@oracle.com>
      0e435ac2
    • T
      block: internal dequeue shouldn't start timer · 53a08807
      Tejun Heo 提交于
      blkdev_dequeue_request() and elv_dequeue_request() are equivalent and
      both start the timeout timer.  Barrier code dequeues the original
      barrier request but doesn't passes the request itself to lower level
      driver, only broken down proxy requests; however, as the original
      barrier code goes through the same dequeue path and timeout timer is
      started on it.  If barrier sequence takes long enough, this timer
      expires but the low level driver has no idea about this request and
      oops follows.
      
      Timeout timer shouldn't have been started on the original barrier
      request as it never goes through actual IO.  This patch unexports
      elv_dequeue_request(), which has no external user anyway, and makes it
      operate on elevator proper w/o adding the timer and make
      blkdev_dequeue_request() call elv_dequeue_request() and add timer.
      Internal users which don't pass the request to driver - barrier code
      and end_that_request_last() - are converted to use
      elv_dequeue_request().
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Cc: Mike Anderson <andmike@linux.vnet.ibm.com>
      Signed-off-by: NJens Axboe <jens.axboe@oracle.com>
      53a08807
    • C
      block: set disk->node_id before it's being used · bf91db18
      Cheng Renquan 提交于
      disk->node_id will be refered in allocating in disk_expand_part_tbl, so we
      should set it before disk->node_id is refered.
      Signed-off-by: NCheng Renquan <crquan@gmail.com>
      Signed-off-by: NJens Axboe <jens.axboe@oracle.com>
      bf91db18
    • P
      When block layer fails to map iov, it calls bio_unmap_user to undo · 53cc0b29
      Petr Vandrovec 提交于
      mapping.  Which is good if pages were mapped - but if they were provided
      by someone else and just copied then bad things happen - pages are
      released once here, and once by caller, leading to user triggerable BUG
      at include/linux/mm.h:246.
      Signed-off-by: NPetr Vandrovec <petr@vandrovec.name>
      Signed-off-by: NJens Axboe <jens.axboe@oracle.com>
      53cc0b29
  10. 26 11月, 2008 2 次提交
  11. 18 11月, 2008 2 次提交
    • J
      block: hold extra reference to bio in blk_rq_map_user_iov() · c26156b2
      Jens Axboe 提交于
      If the size passed in is OK but we end up mapping too many segments,
      we call the unmap path directly like from IO completion. But from IO
      completion we have an extra reference to the bio, so this error case
      goes OOPS when it attempts to free and already free bio.
      
      Fix it by getting an extra reference to the bio before calling the
      unmap failure case.
      Reported-by: NPetr Vandrovec <vandrove@vc.cvut.cz>
      Signed-off-by: NJens Axboe <jens.axboe@oracle.com>
      c26156b2
    • Z
      block: fix boot failure with CONFIG_DEBUG_BLOCK_EXT_DEVT=y and nash · 561ec68e
      Zhang, Yanmin 提交于
      We run into system boot failure with kernel 2.6.28-rc. We found it on a
      couple of machines, including T61 notebook, nehalem machine, and another
      HPC NX6325 notebook.  All the machines use FedoraCore 8 or FedoraCore 9.
      With kernel prior to 2.6.28-rc, system boot doesn't fail.
      
      I debug it and locate the root cause. Pls. see
      http://bugzilla.kernel.org/show_bug.cgi?id=11899
      https://bugzilla.redhat.com/show_bug.cgi?id=471517
      
      As a matter of fact, there are 2 bugs.
      
      1)root=/dev/sda1, system boot randomly fails. Mostly, boot for 5 times
      and fails once. nash has a bug. Some of its functions misuse return
      value 0.  Sometimes, 0 means timeout and no uevent available. Sometimes,
      0 means nash gets an uevent, but the uevent isn't block-related (for
      exmaple, usb). If by coincidence, kernel tells nash that uevents are
      available, but kernel also set timeout, nash might stops collecting
      other uevents in queue if current uevent isn't block-related.  I work
      out a patch for nash to fix it.
      http://bugzilla.kernel.org/attachment.cgi?id=18858
      
      2) root=LABEL=/, system always can't boot. initrd init reports
      switchroot fails. Here is an executation branch of nash when booting:
          (1) nash read /sys/block/sda/dev; Assume major is 8 (on my desktop)
          (2) nash query /proc/devices with the major number; It found line
      	"8 sd";
          (3) nash use 'sd' to search its own probe table to find device (DISK)
      	type for the device and add it to its own list;
          (4) Later on, it probes all devices in its list to get filesystem
      	labels; scsi register "8 sd" always.
      
      When major is 259, nash fails to find the device(DISK) type. I enables
      CONFIG_DEBUG_BLOCK_EXT_DEVT=y when compiling kernel, so 259 is picked up
      for device /dev/sda1, which causes nash to fail to find device (DISK)
      type.
      
      To fixing issue 2), I create a patch for nash and another patch for
      kernel.
      
      http://bugzilla.kernel.org/attachment.cgi?id=18859
      http://bugzilla.kernel.org/attachment.cgi?id=18837
      
      Below is the patch for kernel 2.6.28-rc4. It registers blkext, a new
      block device in proc/devices.
      
      With 2 patches on nash and 1 patch on kernel, I boot my machines for
      dozens of times without failure.
      
      Signed-off-by Zhang Yanmin <yanmin.zhang@linux.intel.com>
      Acked-by: NTejun Heo <tj@kernel.org>
      Signed-off-by: NJens Axboe <jens.axboe@oracle.com>
      561ec68e