1. 09 2月, 2008 9 次提交
    • E
      aoe: only install new AoE device once · 6b9699bb
      Ed L. Cashin 提交于
      An aoe driver user who had about 70 AoE targets found that he was hitting a
      BUG in sysfs_create_file because the aoe driver was trying to tell the kernel
      about an AoE device more than once.  Each AoE device was reachable by several
      local network interfaces, and multiple ATA device indentify responses were
      returning from that single device.
      
      This patch eliminates a race condition so that aoe always informs the block
      layer of a new AoE device once in the presence of multiple incoming ATA device
      identify responses.
      Signed-off-by: NEd L. Cashin <ecashin@coraid.com>
      Cc: Greg KH <greg@kroah.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      6b9699bb
    • E
      aoe: dynamically allocate a capped number of skbs when necessary · 9bb237b6
      Ed L. Cashin 提交于
      What this Patch Does
      
        Even before this recent series of 12 patches to 2.6.22-rc4, the aoe
        driver was reusing a small set of skbs that were allocated once and
        were only used for outbound AoE commands.
      
        The network layer cannot be allowed to put_page on the data that is
        still associated with a bio we haven't returned to the block layer,
        so the aoe driver (even before the patch under discussion) is still
        the owner of skbs that have been handed to the network layer for
        transmission.  We need to keep track of these skbs so that we can
        free them, but by tracking them, we can also easily re-use them.
      
        The new patch was a response to the behavior of certain network
        drivers.  We cannot reuse an skb that the network driver still has
        in its transmit ring.  Network drivers can defer transmit ring
        cleanup and then use the state in the skb to determine how many data
        segments to clean up in its transmit ring.  The tg3 driver is one
        driver that behaves in this way.
      
        When the network driver defers cleanup of its transmit ring, the aoe
        driver can find itself in a situation where it would like to send an
        AoE command, and the AoE target is ready for more work, but the
        network driver still has all of the pre-allocated skbs.  In that
        case, the new patch just calls alloc_skb, as you'd expect.
      
        We don't want to get carried away, though.  We try not to do
        excessive allocation in the write path, so we cap the number of skbs
        we dynamically allocate.
      
        Probably calling it a "dynamic pool" is misleading.  We were already
        trying to use a small fixed-size set of pre-allocated skbs before
        this patch, and this patch just provides a little headroom (with a
        ceiling, though) to accomodate network drivers that hang onto skbs,
        by allocating when needed.  The d->skbpool_hd list of allocated skbs
        is necessary so that we can free them later.
      
        We didn't notice the need for this headroom until AoE targets got
        fast enough.
      
      Alternatives
      
        If the network layer never did a put_page on the pages in the bio's
        we get from the block layer, then it would be possible for us to
        hand skbs to the network layer and forget about them, allowing the
        network layer to free skbs itself (and thereby calling our own
        skb->destructor callback function if we needed that).  In that case
        we could get rid of the pre-allocated skbs and also the
        d->skbpool_hd, instead just calling alloc_skb every time we wanted
        to transmit a packet.  The slab allocator would effectively maintain
        the list of skbs.
      
        Besides a loss of CPU cache locality, the main concern with that
        approach the danger that it would increase the likelihood of
        deadlock when VM is trying to free pages by writing dirty data from
        the page cache through the aoe driver out to persistent storage on
        an AoE device.  Right now we have a situation where we have
        pre-allocation that corresponds to how much we use, which seems
        ideal.
      
        Of course, there's still the separate issue of receiving the packets
        that tell us that a write has successfully completed on the AoE
        target.  When memory is low and VM is using AoE to flush dirty data
        to free up pages, it would be perfect if there were a way for us to
        register a fast callback that could recognize write command
        completion responses.  But I don't think the current problems with
        the receive side of the situation are a justification for
        exacerbating the problem on the transmit side.
      Signed-off-by: NEd L. Cashin <ecashin@coraid.com>
      Cc: Greg KH <greg@kroah.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      9bb237b6
    • E
      aoe: user can ask driver to forget previously detected devices · 262bf541
      Ed L. Cashin 提交于
      When an AoE device is detected, the kernel is informed, and a new block device
      is created.  If the device is unused, the block device corresponding to remote
      device that is no longer available may be removed from the system by telling
      the aoe driver to "flush" its list of devices.
      
      Without this patch, software like GPFS and LVM may attempt to read from AoE
      devices that were discovered earlier but are no longer present, blocking until
      the I/O attempt times out.
      Signed-off-by: NEd L. Cashin <ecashin@coraid.com>
      Cc: Greg KH <greg@kroah.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      262bf541
    • E
      aoe: eliminate goto and improve readability · cf446f0d
      Ed L. Cashin 提交于
      Adam Richter suggested eliminating this goto.
      Signed-off-by: NEd L. Cashin <ecashin@coraid.com>
      Cc: Greg KH <greg@kroah.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      cf446f0d
    • E
      aoe: mac_addr: avoid 64-bit arch compiler warnings · 1eb0da4c
      Ed L. Cashin 提交于
      By returning unsigned long long, mac_addr does not generate compiler warnings
      on 64-bit architectures.
      Signed-off-by: NEd L. Cashin <ecashin@coraid.com>
      Cc: Greg KH <greg@kroah.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      1eb0da4c
    • E
      aoe: handle multiple network paths to AoE device · 68e0d42f
      Ed L. Cashin 提交于
      A remote AoE device is something can process ATA commands and is identified by
      an AoE shelf number and an AoE slot number.  Such a device might have more
      than one network interface, and it might be reachable by more than one local
      network interface.  This patch tracks the available network paths available to
      each AoE device, allowing them to be used more efficiently.
      
      Andrew Morton asked about the call to msleep_interruptible in the revalidate
      function.  Yes, if a signal is pending, then msleep_interruptible will not
      return 0.  That means we will not loop but will call aoenet_xmit with a NULL
      skb, which is a noop.  If the system is too low on memory or the aoe driver is
      too low on frames, then the user can hit control-C to interrupt the attempt to
      do a revalidate.  I have added a comment to the code summarizing that.
      
      Andrew Morton asked whether the allocation performed inside addtgt could use a
      more relaxed allocation like GFP_KERNEL, but addtgt is called when the aoedev
      lock has been locked with spin_lock_irqsave.  It would be nice to allocate the
      memory under fewer restrictions, but targets are only added when the device is
      being discovered, and if the target can't be added right now, we can try again
      in a minute when then next AoE config query broadcast goes out.
      
      Andrew Morton pointed out that the "too many targets" message could be printed
      for failing GFP_ATOMIC allocations.  The last patch in this series makes the
      messages more specific.
      Signed-off-by: NEd L. Cashin <ecashin@coraid.com>
      Cc: Greg KH <greg@kroah.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      68e0d42f
    • E
      aoe: bring driver version number to 47 · 8911ef4d
      Ed L. Cashin 提交于
      Signed-off-by: NEd L. Cashin <ecashin@coraid.com>
      Cc: Greg KH <greg@kroah.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      8911ef4d
    • N
      rd: support XIP · 75acb9cd
      Nick Piggin 提交于
      Support direct_access XIP method with brd.
      Signed-off-by: NNick Piggin <npiggin@suse.de>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      75acb9cd
    • N
      rewrite rd · 9db5579b
      Nick Piggin 提交于
      This is a rewrite of the ramdisk block device driver.
      
      The old one is really difficult because it effectively implements a block
      device which serves data out of its own buffer cache.  It relies on the dirty
      bit being set, to pin its backing store in cache, however there are non
      trivial paths which can clear the dirty bit (eg.  try_to_free_buffers()),
      which had recently lead to data corruption.  And in general it is completely
      wrong for a block device driver to do this.
      
      The new one is more like a regular block device driver.  It has no idea about
      vm/vfs stuff.  It's backing store is similar to the buffer cache (a simple
      radix-tree of pages), but it doesn't know anything about page cache (the pages
      in the radix tree are not pagecache pages).
      
      There is one slight downside -- direct block device access and filesystem
      metadata access goes through an extra copy and gets stored in RAM twice.
      However, this downside is only slight, because the real buffercache of the
      device is now reclaimable (because we're not playing crazy games with it), so
      under memory intensive situations, footprint should effectively be the same --
      maybe even a slight advantage to the new driver because it can also reclaim
      buffer heads.
      
      The fact that it now goes through all the regular vm/fs paths makes it
      much more useful for testing, too.
      
         text    data     bss     dec     hex filename
         2837     849     384    4070     fe6 drivers/block/rd.o
         3528     371      12    3911     f47 drivers/block/brd.o
      
      Text is larger, but data and bss are smaller, making total size smaller.
      
      A few other nice things about it:
      - Similar structure and layout to the new loop device handlinag.
      - Dynamic ramdisk creation.
      - Runtime flexible buffer head size (because it is no longer part of the
        ramdisk code).
      - Boot / load time flexible ramdisk size, which could easily be extended
        to a per-ramdisk runtime changeable size (eg. with an ioctl).
      - Can use highmem for the backing store.
      
      [akpm@linux-foundation.org: fix build]
      [byron.bbradley@gmail.com: make rd_size non-static]
      Signed-off-by: NNick Piggin <npiggin@suse.de>
      Signed-off-by: NByron Bradley <byron.bbradley@gmail.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      9db5579b
  2. 07 2月, 2008 7 次提交
  3. 04 2月, 2008 7 次提交
    • C
      virtio_blk: implement naming for vda-vdz,vdaa-vdzz,vdaaa-vdzzz · d50ed907
      Christian Borntraeger 提交于
      Am Freitag, 1. Februar 2008 schrieb Christian Borntraeger:
      > Right. I will fix that with an additional patch.
      
      This patch goes on top of the minor number patch. Please let me know if
      you want a merged patch:
      
      Currently virtio_blk creates the disk name combinging "vd"  with 'a'++.
      This will give strange names after vdz. I have implemented names up to
      vdzzz - inspired by the sd.c code. That should be sufficient for now.
      
      There is one driver in the kernel (driver/s390/block/dasd_genhd.c) that
      implements names from dasda-dasdzzzz allowing even more disks. Maybe
      a janitor can come up with a common implementation usable for all kind
      of block device drivers.
      
      I have tested this patch with 100 disks - seems to work.
      Signed-off-by: NChristian Borntraeger <borntraeger@de.ibm.com>
      Signed-off-by: NRusty Russell <rusty@rustcorp.com.au>
      d50ed907
    • C
      virtio_blk: Dont waste major numbers · 4f3bf19c
      Christian Borntraeger 提交于
      Rusty,
      
      currently virtio_blk uses one major number per device. While this works
      quite well on most systems it is wasteful and will exhaust major numbers
      on larger installations.
      
      This patch allocates a major number on init and will use 16 minor numbers
      for each disk. That will allow ~64k virtio_blk disks.
      Signed-off-by: NChristian Borntraeger <borntraeger@de.ibm.com>
      Signed-off-by: NRusty Russell <rusty@rustcorp.com.au>
      4f3bf19c
    • C
      virtio_blk: provide getgeo · 135da0b0
      Christian Borntraeger 提交于
      Rusty,
      
      I currently try to make my guest boot from an virtio root device
      without having an external kernel. Some of the tools that I tried
      expect HDIO_GETGEO to work. The most interesting value is likely
      the geo.start value to get the offset of a partition. This value
      is filled by block/ioctl.c if fops->getgeo is set. This patch also
      fills in some standard values for heads, sectors and cylinders.
      
      Makes sense?
      Signed-off-by: NChristian Borntraeger <borntraeger@de.ibm.com>
      Signed-off-by: NRusty Russell <rusty@rustcorp.com.au>
      135da0b0
    • A
      virtio: Put the virtio under the virtualization menu · 0ad07ec1
      Anthony Liguori 提交于
      This patch moves virtio under the virtualization menu and changes virtio
      devices to not claim to only be for lguest.
      Signed-off-by: NAnthony Liguori <aliguori@us.ibm.com>
      Signed-off-by: NRusty Russell <rusty@rustcorp.com.au>
      0ad07ec1
    • R
      virtio: reset function · 6e5aa7ef
      Rusty Russell 提交于
      A reset function solves three problems:
      
      1) It allows us to renegotiate features, eg. if we want to upgrade a
         guest driver without rebooting the guest.
      
      2) It gives us a clean way of shutting down virtqueues: after a reset,
         we know that the buffers won't be used by the host, and
      
      3) It helps the guest recover from messed-up drivers.
      
      So we remove the ->shutdown hook, and the only way we now remove
      feature bits is via reset.
      
      We leave it to the driver to do the reset before it deletes queues:
      the balloon driver, for example, needs to chat to the host in its
      remove function.
      Signed-off-by: NRusty Russell <rusty@rustcorp.com.au>
      6e5aa7ef
    • R
      virtio: explicit enable_cb/disable_cb rather than callback return. · 18445c4d
      Rusty Russell 提交于
      It seems that virtio_net wants to disable callbacks (interrupts) before
      calling netif_rx_schedule(), so we can't use the return value to do so.
      
      Rename "restart" to "cb_enable" and introduce "cb_disable" hook: callback
      now returns void, rather than a boolean.
      Signed-off-by: NRusty Russell <rusty@rustcorp.com.au>
      18445c4d
    • R
      virtio: simplify config mechanism. · a586d4f6
      Rusty Russell 提交于
      Previously we used a type/len pair within the config space, but this
      seems overkill.  We now simply define a structure which represents the
      layout in the config space: the config space can now only be extended
      at the end.
      
      The main driver-visible changes:
      1) We indicate what fields are present with an explicit feature bit.
      2) Virtqueues are explicitly numbered, and not in the config space.
      Signed-off-by: NRusty Russell <rusty@rustcorp.com.au>
      a586d4f6
  4. 03 2月, 2008 1 次提交
  5. 02 2月, 2008 1 次提交
  6. 01 2月, 2008 1 次提交
  7. 30 1月, 2008 2 次提交
    • J
      cciss: fix bug in overriding ->data_len before completion · e7d9dc9c
      Jens Axboe 提交于
      For BLOCK_PC requests, we need that length for completing the request.
      Andrew Vasquez <andrew.vasquez@qlogic.com> reported the following
      oops
      
      Hitting a consistent BUG() with recent Linus' linux-2.6.git:
      
      	[   12.941428] ------------[ cut here ]------------
      	[   12.944874] kernel BUG at drivers/block/cciss.c:1260!
      	[   12.944874] invalid opcode: 0000 [1] SMP
      	[   12.944874] CPU 0
      	[   12.944874] Modules linked in:
      	[   12.944874] Pid: 0, comm: swapper Not tainted 2.6.24 #43
      	[   12.944874] RIP: 0010:[<ffffffff8039e43d>]  [<ffffffff8039e43d>] cciss_softirq_done+0xbc/0x1bf
      	[   12.944874] RSP: 0018:ffffffff8063aed0  EFLAGS: 00010202
      	[   12.944874] RAX: 0000000000000001 RBX: ffff8100cf800010 RCX: ffff81042f1253b0
      	[   12.944874] RDX: ffff81042de398f0 RSI: ffff81042de398f0 RDI: 0000000000000001
      	[   12.944874] RBP: ffff81042daa0000 R08: ffff81042f1253b0 R09: 0000000000000001
      	[   12.944874] R10: 00000000000000fe R11: 0000000000000000 R12: 0000000000000002
      	[   12.944874] R13: 0000000000000001 R14: ffff8100cf800000 R15: ffff81042de398f0
      	[   12.944874] FS:  0000000000000000(0000) GS:ffffffff805bb000(0000) knlGS:0000000000000000
      	[   12.944874] CS:  0010 DS: 0018 ES: 0018 CR0: 000000008005003b
      	[   12.944874] CR2: 00002afed7eea340 CR3: 000000042dbba000 CR4: 00000000000006e0
      	[   12.944874] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
      	[   12.944874] DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
      	[   12.944874] Process swapper (pid: 0, threadinfo ffffffff805f4000, task ffffffff805624a0)
      	[   12.944874] Stack:  0000000000000000 ffffffff8063af10 0000000000000001 ffffffff80632d60
      	[   12.944874]  0000000000000000 000000000000000a ffffffff805bb900 ffffffff8032038f
      	[   12.944874]  ffffffff8063af10 ffffffff8063af10 ffffffff805bb940 ffffffff802346b4
      	[   12.944874] Call Trace:
      	[   12.944874]  <IRQ>  [<ffffffff8032038f>] blk_done_softirq+0x69/0x78
      	[   12.944874]  [<ffffffff802346b4>] __do_softirq+0x6f/0xd8
      	[   12.944874]  [<ffffffff8020c45c>] call_softirq+0x1c/0x30
      	[   12.944874]  [<ffffffff8020e347>] do_softirq+0x30/0x80
      	[   12.944874]  [<ffffffff8020e409>] do_IRQ+0x72/0xd9
      	[   12.944874]  [<ffffffff8020a50a>] mwait_idle+0x0/0x46
      	[   12.944874]  [<ffffffff8020a3da>] default_idle+0x0/0x3d
      	[   12.944874]  [<ffffffff8020b7e1>] ret_from_intr+0x0/0xa
      	[   12.944874]  <EOI>  [<ffffffff8020a54c>] mwait_idle+0x42/0x46
      	[   12.944874]  [<ffffffff8020a481>] cpu_idle+0x6a/0xae
      	[   12.944874]
      	[   12.944874]
      	[   12.944874] Code: 0f 0b eb fe 48 8d 85 d8 c0 00 00 48 89 04 24 48 89 c7 e8 e5
      	[   12.944874] RIP  [<ffffffff8039e43d>] cciss_softirq_done+0xbc/0x1bf
      	[   12.944874]  RSP <ffffffff8063aed0>
      	[   12.944903] ---[ end trace e9c631603f90d22f ]---
      
      which is caused by blk_end_request() returning 'not done' for a request,
      since it gets asked to complete zero bytes.
      Signed-off-by: NJens Axboe <jens.axboe@oracle.com>
      e7d9dc9c
    • J
      xsysace: end request handling fix · 9bf72259
      Jens Axboe 提交于
      In ace_fsm_dostate(), the variable 'i' was used only for passing
      sector size of the request to end_that_request_first().
      So I removed it and changed the code to pass the size in bytes
      directly to __blk_end_request()
      Signed-off-by: NJens Axboe <jens.axboe@oracle.com>
      9bf72259
  8. 28 1月, 2008 12 次提交