1. 14 1月, 2016 1 次提交
    • A
      null_blk: use sector_div instead of do_div · e93d12ae
      Arnd Bergmann 提交于
      Dividing a sector_t number should be done using sector_div rather than do_div
      to optimize the 32-bit sector_t case, and with the latest do_div optimizations,
      we now get a compile-time warning for this:
      
      arch/arm/include/asm/div64.h:32:95: note: expected 'uint64_t * {aka long long unsigned int *}' but argument is of type 'sector_t * {aka long unsigned int *}'
      drivers/block/null_blk.c:521:81: warning: comparison of distinct pointer types lacks a cast
      
      This changes the newly added code to use sector_div. It is a simplified version
      of the original patch, as Linus Torvalds pointed out that we should not be using
      an expensive division function in the first place.
      
      This version was suggested by Matias Bjorling.
      Signed-off-by: NArnd Bergmann <arnd@arndb.de>
      Cc: Matias Bjorling <m@bjorling.me>
      Fixes: b2b7e001 ("null_blk: register as a LightNVM device")
      Signed-off-by: NJens Axboe <axboe@fb.com>
      e93d12ae
  2. 13 1月, 2016 1 次提交
    • J
      Merge branch 'stable/for-jens-4.5' of... · 038a75af
      Jens Axboe 提交于
      Merge branch 'stable/for-jens-4.5' of git://git.kernel.org/pub/scm/linux/kernel/git/konrad/xen into for-4.5/drivers
      
      Konrad writes:
      
      The pull is based on converting the backend driver into an multiqueue
      driver and exposing more than one queue to the frontend. As such we had
      to modify the frontend and also fix a bunch of bugs around this.
      
      The original work is based on Arianna Avanzini's work as an OPW intern.
      Bob took over the work and had been massaging it for quite some time.
      
      Also included are are features to 64KB page support for ARM and various
      bug-fixes.
      038a75af
  3. 09 1月, 2016 1 次提交
  4. 05 1月, 2016 19 次提交
    • K
      xen/blkfront: Fix crash if backend doesn't follow the right states. · c31ecf6c
      Konrad Rzeszutek Wilk 提交于
      We have split the setting up of all the resources in two steps:
      1) talk_to_blkback  - which figures out the num_ring_pages (from
         the default value of zero), sets up shadow and so
      2) blkfront_connect - does the real part of filling out the
         internal structures.
      
      The problem is if we bypass the 1) step and go straight to 2)
      and call blkfront_setup_indirect where we use the macro
      BLK_RING_SIZE - which returns an negative value (because
      sz is zero  - since num_ring_pages is zero - since it has never
      been set).
      
      We can fix this by making sure that we always have called
      talk_to_blkback before going to blkfront_connect.
      
      Or we could set in blkfront_probe info->nr_ring_pages = 1
      to have a default value. But that looks odd - as we haven't
      actually negotiated any ring size.
      
      This patch changes XenbusStateConnected state to detect if
      we haven't done the initial handshake - and if so continue
      on as if were in XenbusStateInitWait state.
      
      We also roll the error recovery (freeing the structure) into
      talk_to_blkback error path - which is safe since that function
      is only called from blkback_changed.
      Signed-off-by: NKonrad Rzeszutek Wilk <konrad.wilk@oracle.com>
      c31ecf6c
    • B
      xen/blkback: Fix two memory leaks. · 93bb277f
      Bob Liu 提交于
      This patch fixs two memleaks:
        backtrace:
          [<ffffffff817ba5e8>] kmemleak_alloc+0x28/0x50
          [<ffffffff81205e3b>] kmem_cache_alloc+0xbb/0x1d0
          [<ffffffff81534028>] xen_blkbk_probe+0x58/0x230
          [<ffffffff8146adb6>] xenbus_dev_probe+0x76/0x130
          [<ffffffff81511716>] driver_probe_device+0x166/0x2c0
          [<ffffffff815119bc>] __device_attach_driver+0xac/0xb0
          [<ffffffff8150fa57>] bus_for_each_drv+0x67/0x90
          [<ffffffff81511ab7>] __device_attach+0xc7/0x120
          [<ffffffff81511b23>] device_initial_probe+0x13/0x20
          [<ffffffff8151059a>] bus_probe_device+0x9a/0xb0
          [<ffffffff8150f0a1>] device_add+0x3b1/0x5c0
          [<ffffffff8150f47e>] device_register+0x1e/0x30
          [<ffffffff8146a9e8>] xenbus_probe_node+0x158/0x170
          [<ffffffff8146abaf>] xenbus_dev_changed+0x1af/0x1c0
          [<ffffffff8146b1bb>] backend_changed+0x1b/0x20
          [<ffffffff81468ca6>] xenwatch_thread+0xb6/0x160
      unreferenced object 0xffff880007ba8ef8 (size 224):
      
        backtrace:
          [<ffffffff817ba5e8>] kmemleak_alloc+0x28/0x50
          [<ffffffff81205c73>] __kmalloc+0xd3/0x1e0
          [<ffffffff81534d87>] frontend_changed+0x2c7/0x580
          [<ffffffff8146af12>] xenbus_otherend_changed+0xa2/0xb0
          [<ffffffff8146b2c0>] frontend_changed+0x10/0x20
          [<ffffffff81468ca6>] xenwatch_thread+0xb6/0x160
          [<ffffffff810d3e97>] kthread+0xd7/0xf0
          [<ffffffff817c4a9f>] ret_from_fork+0x3f/0x70
          [<ffffffffffffffff>] 0xffffffffffffffff
      unreferenced object 0xffff8800048dcd38 (size 224):
      
      The first leak is caused by not put() the be->blkif reference
      which we had gotten in xen_blkif_alloc(), while the second is
      us not freeing blkif->rings in the right place.
      Signed-off-by: NBob Liu <bob.liu@oracle.com>
      Reported-and-Tested-by: NKonrad Rzeszutek Wilk <konrad.wilk@oracle.com>
      Signed-off-by: NKonrad Rzeszutek Wilk <konrad.wilk@oracle.com>
      93bb277f
    • B
      xen/blkback: make st_ statistics per ring · db6fbc10
      Bob Liu 提交于
      Make st_* statistics per ring and the VBD sysfs would iterate over all the
      rings.
      
      Note: xenvbd_sysfs_delif() is called in xen_blkbk_remove() before all rings
      are torn down, so it's safe.
      Signed-off-by: NBob Liu <bob.liu@oracle.com>
      Signed-off-by: NKonrad Rzeszutek Wilk <konrad.wilk@oracle.com>
      ---
      v2: Aligned the variables on the same column.
      db6fbc10
    • J
      xen/blkfront: Handle non-indirect grant with 64KB pages · 6cc56833
      Julien Grall 提交于
      The minimal size of request in the block framework is always PAGE_SIZE.
      It means that when 64KB guest is support, the request will at least be
      64KB.
      
      Although, if the backend doesn't support indirect descriptor (such as QDISK
      in QEMU), a ring request is only able to accommodate 11 segments of 4KB
      (i.e 44KB).
      
      The current frontend is assuming that an I/O request will always fit in
      a ring request. This is not true any more when using 64KB page
      granularity and will therefore crash during boot.
      
      On ARM64, the ABI is completely neutral to the page granularity used by
      the domU. The guest has the choice between different page granularity
      supported by the processors (for instance on ARM64: 4KB, 16KB, 64KB).
      This can't be enforced by the hypervisor and therefore it's possible to
      run guests using different page granularity.
      
      So we can't mandate the block backend to support indirect descriptor
      when the frontend is using 64KB page granularity and have to fix it
      properly in the frontend.
      
      The solution exposed below is based on modifying directly the frontend
      guest rather than asking the block framework to support smaller size
      (i.e < PAGE_SIZE). This is because the change is the block framework are
      not trivial as everything seems to relying on a struct *page (see [1]).
      Although, it may be possible that someone succeed to do it in the future
      and we would therefore be able to use it.
      
      Given that a block request may not fit in a single ring request, a
      second request is introduced for the data that cannot fit in the first
      one. This means that the second ring request should never be used on
      Linux if the page size is smaller than 44KB.
      
      To achieve the support of the extra ring request, the block queue size
      is divided by two. Therefore, the ring will always contain enough space
      to accommodate 2 ring requests. While this will reduce the overall
      performance, it will make the implementation more contained. The way
      forward to get better performance is to implement in the backend either
      indirect descriptor or multiple grants ring.
      
      Note that the parameters blk_queue_max_* helpers haven't been updated.
      The block code will set the mimimum size supported and we may be able
      to support directly any change in the block framework that lower down
      the minimal size of a request.
      
      [1] http://lists.xen.org/archives/html/xen-devel/2015-08/msg02200.htmlSigned-off-by: NJulien Grall <julien.grall@citrix.com>
      Acked-by: NRoger Pau Monné <roger.pau@citrix.com>
      Signed-off-by: NKonrad Rzeszutek Wilk <konrad.wilk@oracle.com>
      6cc56833
    • J
      xen-blkfront: Introduce blkif_ring_get_request · 2e073969
      Julien Grall 提交于
      The code to get a request is always the same. Therefore we can factorize
      it in a single function.
      Signed-off-by: NJulien Grall <julien.grall@citrix.com>
      Acked-by: NRoger Pau Monné <roger.pau@citrix.com>
      Signed-off-by: NKonrad Rzeszutek Wilk <konrad.wilk@oracle.com>
      2e073969
    • J
      xen-blkback: clear PF_NOFREEZE for xen_blkif_schedule() · a6e7af12
      Jiri Kosina 提交于
      xen_blkif_schedule() kthread calls try_to_freeze() at the beginning of
      every attempt to purge the LRU. This operation can't ever succeed though,
      as the kthread hasn't marked itself as freezable.
      
      Before (hopefully eventually) kthread freezing gets converted to fileystem
      freezing, we'd rather mark xen_blkif_schedule() freezable (as it can
      generate I/O during suspend).
      Signed-off-by: NJiri Kosina <jkosina@suse.cz>
      Signed-off-by: NKonrad Rzeszutek Wilk <konrad.wilk@oracle.com>
      a6e7af12
    • K
      xen/blkback: Free resources if connect_ring failed. · 2d0382fa
      Konrad Rzeszutek Wilk 提交于
      With the multi-queue support we could fail at setting up
      some of the rings and fail the connection. That meant that
      all resources tied to rings[0..n-1] (where n is the ring
      that failed to be setup). Eventually the frontend will switch
      to the states and we will call xen_blkif_disconnect.
      
      However we do not want to be at the mercy of the frontend
      deciding when to change states. This allows us to do the
      cleanup right away and freeing resources.
      Signed-off-by: NKonrad Rzeszutek Wilk <konrad.wilk@oracle.com>
      2d0382fa
    • K
      xen/blocks: Return -EXX instead of -1 · bde21f73
      Konrad Rzeszutek Wilk 提交于
      Lets return sensible values instead of -1.
      Signed-off-by: NKonrad Rzeszutek Wilk <konrad.wilk@oracle.com>
      bde21f73
    • B
      xen/blkback: make pool of persistent grants and free pages per-queue · d4bf0065
      Bob Liu 提交于
      Make pool of persistent grants and free pages per-queue/ring instead of
      per-device to get better scalability.
      
      Test was done based on null_blk driver:
      dom0: v4.2-rc8 16vcpus 10GB "modprobe null_blk"
      domu: v4.2-rc8 16vcpus 10GB
      
      [test]
      rw=read
      direct=1
      ioengine=libaio
      bs=4k
      time_based
      runtime=30
      filename=/dev/xvdb
      numjobs=16
      iodepth=64
      iodepth_batch=64
      iodepth_batch_complete=64
      group_reporting
      
      Results:
      iops1: After patch "xen/blkfront: make persistent grants per-queue".
      iops2: After this patch.
      
      Queues:			  1 	   4 	  	  8 	 	 16
      Iops orig(k):		810 	1064 		780 		700
      Iops1(k):		810     1230(~20%)	1024(~20%)	850(~20%)
      Iops2(k):		810     1410(~35%)	1354(~75%)      1440(~100%)
      
      With 4 queues after this commit we can get ~75% increase in IOPS, and
      performance won't drop if increasing queue numbers.
      
      Please find the respective chart in this link:
      https://www.dropbox.com/s/agrcy2pbzbsvmwv/iops.png?dl=0Signed-off-by: NBob Liu <bob.liu@oracle.com>
      Signed-off-by: NKonrad Rzeszutek Wilk <konrad.wilk@oracle.com>
      d4bf0065
    • B
      xen/blkback: get the number of hardware queues/rings from blkfront · d62d8600
      Bob Liu 提交于
      Backend advertises "multi-queue-max-queues" to front, also get the negotiated
      number from "multi-queue-num-queues" written by blkfront.
      Signed-off-by: NBob Liu <bob.liu@oracle.com>
      Signed-off-by: NKonrad Rzeszutek Wilk <konrad.wilk@oracle.com>
      d62d8600
    • K
      xen/blkback: pseudo support for multi hardware queues/rings · 2fb1ef4f
      Konrad Rzeszutek Wilk 提交于
      Preparatory patch for multiple hardware queues (rings). The number of
      rings is unconditionally set to 1, larger number will be enabled in
      "xen/blkback: get the number of hardware queues/rings from blkfront".
      Signed-off-by: NArianna Avanzini <avanzini.arianna@gmail.com>
      Signed-off-by: NBob Liu <bob.liu@oracle.com>
      Signed-off-by: NKonrad Rzeszutek Wilk <konrad.wilk@oracle.com>
      ---
      v2: Align variables in the structures.
      2fb1ef4f
    • B
      xen/blkback: separate ring information out of struct xen_blkif · 59795700
      Bob Liu 提交于
      Split per ring information to an new structure "xen_blkif_ring", so that one vbd
      device can be associated with one or more rings/hardware queues.
      
      Introduce 'pers_gnts_lock' to protect the pool of persistent grants since we
      may have multi backend threads.
      
      This patch is a preparation for supporting multi hardware queues/rings.
      Signed-off-by: NArianna Avanzini <avanzini.arianna@gmail.com>
      Signed-off-by: NBob Liu <bob.liu@oracle.com>
      Signed-off-by: NKonrad Rzeszutek Wilk <konrad.wilk@oracle.com>
      ---
      v2: Align the variables in the structure.
      59795700
    • P
      xen/blkfront: correct setting for xen_blkif_max_ring_order · 45fc8264
      Peng Fan 提交于
      According to this piece code:
      "
           pr_info("Invalid max_ring_order (%d), will use default max: %d.\n",
                    xen_blkif_max_ring_order, XENBUS_MAX_RING_GRANT_ORDER);
      "
      if xen_blkif_max_ring_order is bigger that XENBUS_MAX_RING_GRANT_ORDER,
      need to set xen_blkif_max_ring_order using XENBUS_MAX_RING_GRANT_ORDER,
      but not 0.
      Signed-off-by: NPeng Fan <van.freenix@gmail.com>
      Cc: Boris Ostrovsky <boris.ostrovsky@oracle.com>
      Cc: David Vrabel <david.vrabel@citrix.com>
      Cc: "Roger Pau Monné" <roger.pau@citrix.com>
      Signed-off-by: NKonrad Rzeszutek Wilk <konrad.wilk@oracle.com>
      45fc8264
    • B
      xen/blkfront: make persistent grants pool per-queue · 73716df7
      Bob Liu 提交于
      Make persistent grants per-queue/ring instead of per-device, so that we can
      drop the 'dev_lock' and get better scalability.
      
      Test was done based on null_blk driver:
      dom0: v4.2-rc8 16vcpus 10GB "modprobe null_blk"
      domu: v4.2-rc8 16vcpus 10GB
      
      [test]
      rw=read
      direct=1
      ioengine=libaio
      bs=4k
      time_based
      runtime=30
      filename=/dev/xvdb
      numjobs=16
      iodepth=64
      iodepth_batch=64
      iodepth_batch_complete=64
      group_reporting
      
      Queues:			  1 	   4 	  	  8 	 	 16
      Iops orig(k):		810 	1064 		780 		700
      Iops patched(k):	810     1230(~20%)	1024(~20%)	850(~20%)
      Signed-off-by: NBob Liu <bob.liu@oracle.com>
      Signed-off-by: NKonrad Rzeszutek Wilk <konrad.wilk@oracle.com>
      73716df7
    • B
      xen/blkfront: Remove duplicate setting of ->xbdev. · 75f070b3
      Bob Liu 提交于
      We do the same exact operations a bit earlier in the
      function.
      Signed-off-by: NBob Liu <bob.liu@oracle.com>
      Signed-off-by: NKonrad Rzeszutek Wilk <konrad.wilk@oracle.com>
      75f070b3
    • K
    • B
      xen/blkfront: negotiate number of queues/rings to be used with backend · 28d949bc
      Bob Liu 提交于
      The max number of hardware queues for xen/blkfront is set by parameter
      'max_queues'(default 4), while it is also capped by the max value that the
      xen/blkback exposes through XenStore key 'multi-queue-max-queues'.
      
      The negotiated number is the smaller one and would be written back to xenstore
      as "multi-queue-num-queues", blkback needs to read this negotiated number.
      Signed-off-by: NBob Liu <bob.liu@oracle.com>
      Signed-off-by: NKonrad Rzeszutek Wilk <konrad.wilk@oracle.com>
      28d949bc
    • B
      xen/blkfront: split per device io_lock · 11659569
      Bob Liu 提交于
      After patch "xen/blkfront: separate per ring information out of device
      info", per-ring data is protected by a per-device lock ('io_lock').
      
      This is not a good way and will effect the scalability, so introduce a
      per-ring lock ('ring_lock').
      
      The old 'io_lock' is renamed to 'dev_lock' which protects the ->grants list and
      ->persistent_gnts_c which are shared by all rings.
      
      Note that in 'blkfront_probe' the 'blkfront_info' is setup via kzalloc
      so setting ->persistent_gnts_c to zero is not needed.
      Signed-off-by: NBob Liu <bob.liu@oracle.com>
      Signed-off-by: NKonrad Rzeszutek Wilk <konrad.wilk@oracle.com>
      11659569
    • B
      xen/blkfront: pseudo support for multi hardware queues/rings · 3df0e505
      Bob Liu 提交于
      Preparatory patch for multiple hardware queues (rings). The number of
      rings is unconditionally set to 1, larger number will be enabled in
      patch "xen/blkfront: negotiate number of queues/rings to be used with backend"
      so as to make review easier.
      
      Note that blkfront_gather_backend_features does not call
      blkfront_setup_indirect anymore (as that needs to be done per ring).
      That means that in blkif_recover/blkif_connect we have to do it in a loop
      (bounded by nr_rings).
      Signed-off-by: NBob Liu <bob.liu@oracle.com>
      Signed-off-by: NKonrad Rzeszutek Wilk <konrad.wilk@oracle.com>
      3df0e505
  5. 04 1月, 2016 2 次提交
  6. 31 12月, 2015 8 次提交
    • K
      bcache: Change refill_dirty() to always scan entire disk if necessary · 627ccd20
      Kent Overstreet 提交于
      Previously, it would only scan the entire disk if it was starting from
      the very start of the disk - i.e. if the previous scan got to the end.
      
      This was broken by refill_full_stripes(), which updates last_scanned so
      that refill_dirty was never triggering the searched_from_start path.
      
      But if we change refill_dirty() to always scan the entire disk if
      necessary, regardless of what last_scanned was, the code gets cleaner
      and we fix that bug too.
      Signed-off-by: NKent Overstreet <kent.overstreet@gmail.com>
      Cc: stable@vger.kernel.org
      Signed-off-by: NJens Axboe <axboe@fb.com>
      627ccd20
    • S
      bcache: prevent crash on changing writeback_running · 8d16ce54
      Stefan Bader 提交于
      Added a safeguard in the shutdown case. At least while not being
      attached it is also possible to trigger a kernel bug by writing into
      writeback_running. This change  adds the same check before trying to
      wake up the thread for that case.
      Signed-off-by: NStefan Bader <stefan.bader@canonical.com>
      Cc: Kent Overstreet <kent.overstreet@gmail.com>
      Cc: stable@vger.kernel.org
      Signed-off-by: NJens Axboe <axboe@fb.com>
      8d16ce54
    • G
      bcache: allows use of register in udev to avoid "device_busy" error. · d7076f21
      Gabriel de Perthuis 提交于
      Allows to use register, not register_quiet in udev to avoid "device_busy" error.
      The initial patch proposed at https://lkml.org/lkml/2013/8/26/549 by Gabriel de Perthuis
      <g2p.code@gmail.com> does not unlock the mutex and hangs the kernel.
      
      See http://thread.gmane.org/gmane.linux.kernel.bcache.devel/2594 for the discussion.
      
      Cc: Denis Bychkov <manover@gmail.com>
      Cc: Kent Overstreet <kent.overstreet@gmail.com>
      Cc: Eric Wheeler <bcache@linux.ewheeler.net>
      Cc: Gabriel de Perthuis <g2p.code@gmail.com>
      Cc: stable@vger.kernel.org
      Signed-off-by: NJens Axboe <axboe@fb.com>
      d7076f21
    • Z
      bcache: unregister reboot notifier if bcache fails to unregister device · 2ecf0cdb
      Zheng Liu 提交于
      In bcache_init() function it forgot to unregister reboot notifier if
      bcache fails to unregister a block device.  This commit fixes this.
      Signed-off-by: NZheng Liu <wenqing.lz@taobao.com>
      Tested-by: NJoshua Schmid <jschmid@suse.com>
      Tested-by: NEric Wheeler <bcache@linux.ewheeler.net>
      Cc: Kent Overstreet <kmo@daterainc.com>
      Cc: stable@vger.kernel.org
      Signed-off-by: NJens Axboe <axboe@fb.com>
      2ecf0cdb
    • A
      bcache: fix a leak in bch_cached_dev_run() · 4d4d8573
      Al Viro 提交于
      Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>
      Tested-by: NJoshua Schmid <jschmid@suse.com>
      Tested-by: NEric Wheeler <bcache@linux.ewheeler.net>
      Cc: Kent Overstreet <kmo@daterainc.com>
      Cc: stable@vger.kernel.org
      Signed-off-by: NJens Axboe <axboe@fb.com>
      4d4d8573
    • Z
      bcache: clear BCACHE_DEV_UNLINK_DONE flag when attaching a backing device · fecaee6f
      Zheng Liu 提交于
      This bug can be reproduced by the following script:
      
        #!/bin/bash
      
        bcache_sysfs="/sys/fs/bcache"
      
        function clear_cache()
        {
        	if [ ! -e $bcache_sysfs ]; then
        		echo "no bcache sysfs"
        		exit
        	fi
      
        	cset_uuid=$(ls -l $bcache_sysfs|head -n 2|tail -n 1|awk '{print $9}')
        	sudo sh -c "echo $cset_uuid > /sys/block/sdb/sdb1/bcache/detach"
        	sleep 5
        	sudo sh -c "echo $cset_uuid > /sys/block/sdb/sdb1/bcache/attach"
        }
      
        for ((i=0;i<10;i++)); do
        	clear_cache
        done
      
      The warning messages look like below:
      [  275.948611] ------------[ cut here ]------------
      [  275.963840] WARNING: at fs/sysfs/dir.c:512 sysfs_add_one+0xb8/0xd0() (Tainted: P        W
      ---------------   )
      [  275.979253] Hardware name: Tecal RH2285
      [  275.994106] sysfs: cannot create duplicate filename '/devices/pci0000:00/0000:00:09.0/0000:08:00.0/host4/target4:2:1/4:2:1:0/block/sdb/sdb1/bcache/cache'
      [  276.024105] Modules linked in: bcache tcp_diag inet_diag ipmi_devintf ipmi_si ipmi_msghandler
      bonding 8021q garp stp llc ipv6 ext3 jbd loop sg iomemory_vsl(P) bnx2 microcode serio_raw i2c_i801
      i2c_core iTCO_wdt iTCO_vendor_support i7core_edac edac_core shpchp ext4 jbd2 mbcache megaraid_sas
      pata_acpi ata_generic ata_piix dm_mod [last unloaded: scsi_wait_scan]
      [  276.072643] Pid: 2765, comm: sh Tainted: P        W  ---------------    2.6.32 #1
      [  276.089315] Call Trace:
      [  276.105801]  [<ffffffff81070fe7>] ? warn_slowpath_common+0x87/0xc0
      [  276.122650]  [<ffffffff810710d6>] ? warn_slowpath_fmt+0x46/0x50
      [  276.139361]  [<ffffffff81205c08>] ? sysfs_add_one+0xb8/0xd0
      [  276.156012]  [<ffffffff8120609b>] ? sysfs_do_create_link+0x12b/0x170
      [  276.172682]  [<ffffffff81206113>] ? sysfs_create_link+0x13/0x20
      [  276.189282]  [<ffffffffa03bda21>] ? bcache_device_link+0xc1/0x110 [bcache]
      [  276.205993]  [<ffffffffa03bfa08>] ? bch_cached_dev_attach+0x478/0x4f0 [bcache]
      [  276.222794]  [<ffffffffa03c4a17>] ? bch_cached_dev_store+0x627/0x780 [bcache]
      [  276.239680]  [<ffffffff8116783a>] ? alloc_pages_current+0xaa/0x110
      [  276.256594]  [<ffffffff81203b15>] ? sysfs_write_file+0xe5/0x170
      [  276.273364]  [<ffffffff811887b8>] ? vfs_write+0xb8/0x1a0
      [  276.290133]  [<ffffffff811890b1>] ? sys_write+0x51/0x90
      [  276.306368]  [<ffffffff8100c072>] ? system_call_fastpath+0x16/0x1b
      [  276.322301] ---[ end trace 9f5d4fcdd0c3edfb ]---
      [  276.338241] ------------[ cut here ]------------
      [  276.354109] WARNING: at /home/wenqing.lz/bcache/bcache/super.c:720
      bcache_device_link+0xdf/0x110 [bcache]() (Tainted: P        W  ---------------   )
      [  276.386017] Hardware name: Tecal RH2285
      [  276.401430] Couldn't create device <-> cache set symlinks
      [  276.401759] Modules linked in: bcache tcp_diag inet_diag ipmi_devintf ipmi_si ipmi_msghandler
      bonding 8021q garp stp llc ipv6 ext3 jbd loop sg iomemory_vsl(P) bnx2 microcode serio_raw i2c_i801
      i2c_core iTCO_wdt iTCO_vendor_support i7core_edac edac_core shpchp ext4 jbd2 mbcache megaraid_sas
      pata_acpi ata_generic ata_piix dm_mod [last unloaded: scsi_wait_scan]
      [  276.465477] Pid: 2765, comm: sh Tainted: P        W  ---------------    2.6.32 #1
      [  276.482169] Call Trace:
      [  276.498610]  [<ffffffff81070fe7>] ? warn_slowpath_common+0x87/0xc0
      [  276.515405]  [<ffffffff810710d6>] ? warn_slowpath_fmt+0x46/0x50
      [  276.532059]  [<ffffffffa03bda3f>] ? bcache_device_link+0xdf/0x110 [bcache]
      [  276.548808]  [<ffffffffa03bfa08>] ? bch_cached_dev_attach+0x478/0x4f0 [bcache]
      [  276.565569]  [<ffffffffa03c4a17>] ? bch_cached_dev_store+0x627/0x780 [bcache]
      [  276.582418]  [<ffffffff8116783a>] ? alloc_pages_current+0xaa/0x110
      [  276.599341]  [<ffffffff81203b15>] ? sysfs_write_file+0xe5/0x170
      [  276.616142]  [<ffffffff811887b8>] ? vfs_write+0xb8/0x1a0
      [  276.632607]  [<ffffffff811890b1>] ? sys_write+0x51/0x90
      [  276.648671]  [<ffffffff8100c072>] ? system_call_fastpath+0x16/0x1b
      [  276.664756] ---[ end trace 9f5d4fcdd0c3edfc ]---
      
      We forget to clear BCACHE_DEV_UNLINK_DONE flag in bcache_device_attach()
      function when we attach a backing device first time.  After detaching this
      backing device, this flag will be true and sysfs_remove_link() isn't called in
      bcache_device_unlink().  Then when we attach this backing device again,
      sysfs_create_link() will return EEXIST error in bcache_device_link().
      
      So the fix is trival and we clear this flag in bcache_device_link().
      Signed-off-by: NZheng Liu <wenqing.lz@taobao.com>
      Tested-by: NJoshua Schmid <jschmid@suse.com>
      Tested-by: NEric Wheeler <bcache@linux.ewheeler.net>
      Cc: Kent Overstreet <kmo@daterainc.com>
      Cc: stable@vger.kernel.org
      Signed-off-by: NJens Axboe <axboe@fb.com>
      fecaee6f
    • K
      bcache: Add a cond_resched() call to gc · c5f1e5ad
      Kent Overstreet 提交于
      Signed-off-by: NTakashi Iwai <tiwai@suse.de>
      Tested-by: NEric Wheeler <bcache@linux.ewheeler.net>
      Cc: Kent Overstreet <kmo@daterainc.com>
      Cc: stable@vger.kernel.org
      Signed-off-by: NJens Axboe <axboe@fb.com>
      c5f1e5ad
    • Z
      bcache: fix a livelock when we cause a huge number of cache misses · 2ef9ccbf
      Zheng Liu 提交于
      Subject :	[PATCH v2] bcache: fix a livelock in btree lock
      Date :	Wed, 25 Feb 2015 20:32:09 +0800 (02/25/2015 04:32:09 AM)
      
      This commit tries to fix a livelock in bcache.  This livelock might
      happen when we causes a huge number of cache misses simultaneously.
      
      When we get a cache miss, bcache will execute the following path.
      
      ->cached_dev_make_request()
        ->cached_dev_read()
          ->cached_lookup()
            ->bch->btree_map_keys()
              ->btree_root()  <------------------------
                ->bch_btree_map_keys_recurse()        |
                  ->cache_lookup_fn()                 |
                    ->cached_dev_cache_miss()         |
                      ->bch_btree_insert_check_key() -|
                        [If btree->seq is not equal to seq + 1, we should return
                         EINTR and traverse btree again.]
      
      In bch_btree_insert_check_key() function we first need to check upgrade
      flag (op->lock == -1), and when this flag is true we need to release
      read btree->lock and try to take write btree->lock.  During taking and
      releasing this write lock, btree->seq will be monotone increased in
      order to prevent other threads modify this in cache miss (see btree.h:74).
      But if there are some cache misses caused by some requested, we could
      meet a livelock because btree->seq is always changed by others.  Thus no
      one can make progress.
      
      This commit will try to take write btree->lock if it encounters a race
      when we traverse btree.  Although it sacrifice the scalability but we
      can ensure that only one can modify the btree.
      Signed-off-by: NZheng Liu <wenqing.lz@taobao.com>
      Tested-by: NJoshua Schmid <jschmid@suse.com>
      Tested-by: NEric Wheeler <bcache@linux.ewheeler.net>
      Cc: Joshua Schmid <jschmid@suse.com>
      Cc: Zhu Yanhai <zhu.yanhai@gmail.com>
      Cc: Kent Overstreet <kmo@daterainc.com>
      Cc: stable@vger.kernel.org
      Signed-off-by: NJens Axboe <axboe@fb.com>
      2ef9ccbf
  7. 23 12月, 2015 2 次提交
  8. 26 11月, 2015 6 次提交