1. 18 9月, 2014 28 次提交
  2. 24 8月, 2014 1 次提交
    • L
      Btrfs: fix task hang under heavy compressed write · 9e0af237
      Liu Bo 提交于
      This has been reported and discussed for a long time, and this hang occurs in
      both 3.15 and 3.16.
      
      Btrfs now migrates to use kernel workqueue, but it introduces this hang problem.
      
      Btrfs has a kind of work queued as an ordered way, which means that its
      ordered_func() must be processed in the way of FIFO, so it usually looks like --
      
      normal_work_helper(arg)
          work = container_of(arg, struct btrfs_work, normal_work);
      
          work->func() <---- (we name it work X)
          for ordered_work in wq->ordered_list
                  ordered_work->ordered_func()
                  ordered_work->ordered_free()
      
      The hang is a rare case, first when we find free space, we get an uncached block
      group, then we go to read its free space cache inode for free space information,
      so it will
      
      file a readahead request
          btrfs_readpages()
               for page that is not in page cache
                      __do_readpage()
                           submit_extent_page()
                                 btrfs_submit_bio_hook()
                                       btrfs_bio_wq_end_io()
                                       submit_bio()
                                       end_workqueue_bio() <--(ret by the 1st endio)
                                            queue a work(named work Y) for the 2nd
                                            also the real endio()
      
      So the hang occurs when work Y's work_struct and work X's work_struct happens
      to share the same address.
      
      A bit more explanation,
      
      A,B,C -- struct btrfs_work
      arg   -- struct work_struct
      
      kthread:
      worker_thread()
          pick up a work_struct from @worklist
          process_one_work(arg)
      	worker->current_work = arg;  <-- arg is A->normal_work
      	worker->current_func(arg)
      		normal_work_helper(arg)
      		     A = container_of(arg, struct btrfs_work, normal_work);
      
      		     A->func()
      		     A->ordered_func()
      		     A->ordered_free()  <-- A gets freed
      
      		     B->ordered_func()
      			  submit_compressed_extents()
      			      find_free_extent()
      				  load_free_space_inode()
      				      ...   <-- (the above readhead stack)
      				      end_workqueue_bio()
      					   btrfs_queue_work(work C)
      		     B->ordered_free()
      
      As if work A has a high priority in wq->ordered_list and there are more ordered
      works queued after it, such as B->ordered_func(), its memory could have been
      freed before normal_work_helper() returns, which means that kernel workqueue
      code worker_thread() still has worker->current_work pointer to be work
      A->normal_work's, ie. arg's address.
      
      Meanwhile, work C is allocated after work A is freed, work C->normal_work
      and work A->normal_work are likely to share the same address(I confirmed this
      with ftrace output, so I'm not just guessing, it's rare though).
      
      When another kthread picks up work C->normal_work to process, and finds our
      kthread is processing it(see find_worker_executing_work()), it'll think
      work C as a collision and skip then, which ends up nobody processing work C.
      
      So the situation is that our kthread is waiting forever on work C.
      
      Besides, there're other cases that can lead to deadlock, but the real problem
      is that all btrfs workqueue shares one work->func, -- normal_work_helper,
      so this makes each workqueue to have its own helper function, but only a
      wraper pf normal_work_helper.
      
      With this patch, I no long hit the above hang.
      Signed-off-by: NLiu Bo <bo.li.liu@oracle.com>
      Signed-off-by: NChris Mason <clm@fb.com>
      9e0af237
  3. 19 8月, 2014 6 次提交
    • M
      Btrfs: Fix wrong device size when we are resizing the device · 7df69d3e
      Miao Xie 提交于
      total_bytes of device is just a in-memory variant which is used to record
      the size of the device, and it might be changed before we resize a device,
      if the resize operation fails, it will be fallbacked. But some code used it
      to update on-disk metadata of the device, it would cause the problem that
      on-disk metadata of the devices was not consistent. We should use the other
      variant named disk_total_bytes to update the on-disk metadata of device,
      because that variant is updated only when the resize operation is successful.
      Fix it.
      Signed-off-by: NMiao Xie <miaox@cn.fujitsu.com>
      Signed-off-by: NChris Mason <clm@fb.com>
      7df69d3e
    • M
      Btrfs: Fix the problem that the replace destroys the seed filesystem · ff61d17c
      Miao Xie 提交于
      The seed filesystem was destroyed by the device replace, the reproduce
      method is:
       # mkfs.btrfs -f <dev0>
       # btrfstune -S 1 <dev0>
       # mount <dev0> <mnt>
       # btrfs device add <dev1> <mnt>
       # umount <mnt>
       # mount <dev1> <mnt>
       # btrfs replace start -f <dev0> <dev2> <mnt>
       # umount <mnt>
       # mount <dev0> <mnt>
      
      It is because we erase the super block on the seed device. It is wrong,
      we should not change anything on the seed device.
      Signed-off-by: NMiao Xie <miaox@cn.fujitsu.com>
      Reviewed-by: NDavid Sterba <dsterba@suse.cz>
      Signed-off-by: NChris Mason <clm@fb.com>
      ff61d17c
    • M
      Btrfs: fix wrong missing device counter decrease · 3a7d55c8
      Miao Xie 提交于
      The missing devices are accounted by its own fs device, for example
      the missing devices in seed filesystem will be accounted by the fs device
      of the seed filesystem, not by the new filesystem which is based on
      the seed filesystem, so when we remove the missing device in the
      seed filesystem, we should decrease the counter of its own fs device.
      Fix it.
      Signed-off-by: NMiao Xie <miaox@cn.fujitsu.com>
      Signed-off-by: NChris Mason <clm@fb.com>
      3a7d55c8
    • M
      Btrfs: fix unzeroed members in fs_devices when creating a fs from seed fs · 69611ac8
      Miao Xie 提交于
      We forgot to zero some members in fs_devices when we create new fs_devices
      from the one of the seed fs. It would cause the problem that we got wrong
      chunk profile when allocating chunks. Fix it.
      Signed-off-by: NMiao Xie <miaox@cn.fujitsu.com>
      Signed-off-by: NChris Mason <clm@fb.com>
      69611ac8
    • A
      btrfs: check generation as replace duplicates devid+uuid · 77bdae4d
      Anand Jain 提交于
      When FS in unmounted we need to check generation number as well
      since devid+uuid combination could match with the missing replaced
      disk when it reappears, and without this patch it might pair with
      the replaced disk again.
      
       device_list_add() function is called in the following threads,
      	mount device option
      	mount argument
      	ioctl BTRFS_IOC_SCAN_DEV (btrfs dev scan)
      	ioctl BTRFS_IOC_DEVICES_READY (btrfs dev ready <dev>)
       they have been unit tested to work fine with this patch.
      
       If the user knows what he is doing and really want to pair with
       replaced disk (which is not a standard operation), then he should
       first clear the kernel btrfs device list in the memory by doing
       the module unload/load and followed with the mount -o device option.
      Signed-off-by: NAnand Jain <anand.jain@oracle.com>
      Signed-off-by: NWang Shilong <wangsl.fnst@cn.fujitsu.com>
      Signed-off-by: NMiao Xie <miaox@cn.fujitsu.com>
      Signed-off-by: NChris Mason <clm@fb.com>
      77bdae4d
    • A
      Btrfs: device_list_add() should not update list when mounted · b96de000
      Anand Jain 提交于
      device_list_add() is called when user runs btrfs dev scan, which would add
      any btrfs device into the btrfs_fs_devices list.
      
      Now think of a mounted btrfs. And a new device which contains the a SB
      from the mounted btrfs devices.
      
      In this situation when user runs btrfs dev scan, the current code would
      just replace existing device with the new device.
      
      Which is to note that old device is neither closed nor gracefully
      removed from the btrfs.
      
      The FS is still operational with the old bdev however the device name
      is the btrfs_device is new which is provided by the btrfs dev scan.
      
      reproducer:
      
      devmgt[1] detach /dev/sdc
      
      replace the missing disk /dev/sdc
      
      btrfs rep start -f 1 /dev/sde /btrfs
      Label: none  uuid: 5dc0aaf4-4683-4050-b2d6-5ebe5f5cd120
              Total devices 2 FS bytes used 32.00KiB
              devid    1 size 958.94MiB used 115.88MiB path /dev/sde
              devid    2 size 958.94MiB used 103.88MiB path /dev/sdd
      
      make /dev/sdc to reappear
      
      devmgt attach host2
      
      btrfs dev scan
      
      btrfs fi show -m
      Label: none  uuid: 5dc0aaf4-4683-4050-b2d6-5ebe5f5cd120^M
              Total devices 2 FS bytes used 32.00KiB^M
              devid    1 size 958.94MiB used 115.88MiB path /dev/sdc <- Wrong.
              devid    2 size 958.94MiB used 103.88MiB path /dev/sdd
      
      since /dev/sdc has been replaced with /dev/sde, the /dev/sdc shouldn't be
      part of the btrfs-fsid when it reappears. If user want it to be part of it
      then sys admin should be using btrfs device add instead.
      
      [1] github.com/anajain/devmgt.git
      Signed-off-by: NAnand Jain <anand.jain@oracle.com>
      Signed-off-by: NWang Shilong <wangsl.fnst@cn.fujitsu.com>
      Signed-off-by: NMiao Xie <miaox@cn.fujitsu.com>
      Reviewed-by: NSatoru Takeuchi <takeuchi_satoru@jp.fujitsu.com>
      Signed-off-by: NChris Mason <clm@fb.com>
      b96de000
  4. 20 7月, 2014 1 次提交
  5. 03 7月, 2014 1 次提交
    • A
      btrfs: fix null pointer dereference in clone_fs_devices when name is null · e755f780
      Anand Jain 提交于
      when one of the device path is missing btrfs_device name is null. So this
      patch will check for that.
      
      stack:
      BUG: unable to handle kernel NULL pointer dereference at 0000000000000010
      IP: [<ffffffff812e18c0>] strlen+0x0/0x30
      [<ffffffffa01cd92a>] ? clone_fs_devices+0xaa/0x160 [btrfs]
      [<ffffffffa01cdcf7>] btrfs_init_new_device+0x317/0xca0 [btrfs]
      [<ffffffff81155bca>] ? __kmalloc_track_caller+0x15a/0x1a0
      [<ffffffffa01d6473>] btrfs_ioctl+0xaa3/0x2860 [btrfs]
      [<ffffffff81132a6c>] ? handle_mm_fault+0x48c/0x9c0
      [<ffffffff81192a61>] ? __blkdev_put+0x171/0x180
      [<ffffffff817a784c>] ? __do_page_fault+0x4ac/0x590
      [<ffffffff81193426>] ? blkdev_put+0x106/0x110
      [<ffffffff81179175>] ? mntput+0x35/0x40
      [<ffffffff8116d4b0>] do_vfs_ioctl+0x460/0x4a0
      [<ffffffff8115c72e>] ? ____fput+0xe/0x10
      [<ffffffff81068033>] ? task_work_run+0xb3/0xd0
      [<ffffffff8116d547>] SyS_ioctl+0x57/0x90
      [<ffffffff817a793e>] ? do_page_fault+0xe/0x10
      [<ffffffff817abe52>] system_call_fastpath+0x16/0x1b
      
      reproducer:
      mkfs.btrfs -draid1 -mraid1 /dev/sdg1 /dev/sdg2
      btrfstune -S 1 /dev/sdg1
      modprobe -r btrfs && modprobe btrfs
      mount -o degraded /dev/sdg1 /btrfs
      btrfs dev add /dev/sdg3 /btrfs
      Signed-off-by: NAnand Jain <Anand.Jain@oracle.com>
      Signed-off-by: NChris Mason <clm@fb.com>
      e755f780
  6. 29 6月, 2014 3 次提交