1. 18 9月, 2014 12 次提交
  2. 24 8月, 2014 1 次提交
    • L
      Btrfs: fix task hang under heavy compressed write · 9e0af237
      Liu Bo 提交于
      This has been reported and discussed for a long time, and this hang occurs in
      both 3.15 and 3.16.
      
      Btrfs now migrates to use kernel workqueue, but it introduces this hang problem.
      
      Btrfs has a kind of work queued as an ordered way, which means that its
      ordered_func() must be processed in the way of FIFO, so it usually looks like --
      
      normal_work_helper(arg)
          work = container_of(arg, struct btrfs_work, normal_work);
      
          work->func() <---- (we name it work X)
          for ordered_work in wq->ordered_list
                  ordered_work->ordered_func()
                  ordered_work->ordered_free()
      
      The hang is a rare case, first when we find free space, we get an uncached block
      group, then we go to read its free space cache inode for free space information,
      so it will
      
      file a readahead request
          btrfs_readpages()
               for page that is not in page cache
                      __do_readpage()
                           submit_extent_page()
                                 btrfs_submit_bio_hook()
                                       btrfs_bio_wq_end_io()
                                       submit_bio()
                                       end_workqueue_bio() <--(ret by the 1st endio)
                                            queue a work(named work Y) for the 2nd
                                            also the real endio()
      
      So the hang occurs when work Y's work_struct and work X's work_struct happens
      to share the same address.
      
      A bit more explanation,
      
      A,B,C -- struct btrfs_work
      arg   -- struct work_struct
      
      kthread:
      worker_thread()
          pick up a work_struct from @worklist
          process_one_work(arg)
      	worker->current_work = arg;  <-- arg is A->normal_work
      	worker->current_func(arg)
      		normal_work_helper(arg)
      		     A = container_of(arg, struct btrfs_work, normal_work);
      
      		     A->func()
      		     A->ordered_func()
      		     A->ordered_free()  <-- A gets freed
      
      		     B->ordered_func()
      			  submit_compressed_extents()
      			      find_free_extent()
      				  load_free_space_inode()
      				      ...   <-- (the above readhead stack)
      				      end_workqueue_bio()
      					   btrfs_queue_work(work C)
      		     B->ordered_free()
      
      As if work A has a high priority in wq->ordered_list and there are more ordered
      works queued after it, such as B->ordered_func(), its memory could have been
      freed before normal_work_helper() returns, which means that kernel workqueue
      code worker_thread() still has worker->current_work pointer to be work
      A->normal_work's, ie. arg's address.
      
      Meanwhile, work C is allocated after work A is freed, work C->normal_work
      and work A->normal_work are likely to share the same address(I confirmed this
      with ftrace output, so I'm not just guessing, it's rare though).
      
      When another kthread picks up work C->normal_work to process, and finds our
      kthread is processing it(see find_worker_executing_work()), it'll think
      work C as a collision and skip then, which ends up nobody processing work C.
      
      So the situation is that our kthread is waiting forever on work C.
      
      Besides, there're other cases that can lead to deadlock, but the real problem
      is that all btrfs workqueue shares one work->func, -- normal_work_helper,
      so this makes each workqueue to have its own helper function, but only a
      wraper pf normal_work_helper.
      
      With this patch, I no long hit the above hang.
      Signed-off-by: NLiu Bo <bo.li.liu@oracle.com>
      Signed-off-by: NChris Mason <clm@fb.com>
      9e0af237
  3. 19 8月, 2014 6 次提交
    • M
      Btrfs: Fix wrong device size when we are resizing the device · 7df69d3e
      Miao Xie 提交于
      total_bytes of device is just a in-memory variant which is used to record
      the size of the device, and it might be changed before we resize a device,
      if the resize operation fails, it will be fallbacked. But some code used it
      to update on-disk metadata of the device, it would cause the problem that
      on-disk metadata of the devices was not consistent. We should use the other
      variant named disk_total_bytes to update the on-disk metadata of device,
      because that variant is updated only when the resize operation is successful.
      Fix it.
      Signed-off-by: NMiao Xie <miaox@cn.fujitsu.com>
      Signed-off-by: NChris Mason <clm@fb.com>
      7df69d3e
    • M
      Btrfs: Fix the problem that the replace destroys the seed filesystem · ff61d17c
      Miao Xie 提交于
      The seed filesystem was destroyed by the device replace, the reproduce
      method is:
       # mkfs.btrfs -f <dev0>
       # btrfstune -S 1 <dev0>
       # mount <dev0> <mnt>
       # btrfs device add <dev1> <mnt>
       # umount <mnt>
       # mount <dev1> <mnt>
       # btrfs replace start -f <dev0> <dev2> <mnt>
       # umount <mnt>
       # mount <dev0> <mnt>
      
      It is because we erase the super block on the seed device. It is wrong,
      we should not change anything on the seed device.
      Signed-off-by: NMiao Xie <miaox@cn.fujitsu.com>
      Reviewed-by: NDavid Sterba <dsterba@suse.cz>
      Signed-off-by: NChris Mason <clm@fb.com>
      ff61d17c
    • M
      Btrfs: fix wrong missing device counter decrease · 3a7d55c8
      Miao Xie 提交于
      The missing devices are accounted by its own fs device, for example
      the missing devices in seed filesystem will be accounted by the fs device
      of the seed filesystem, not by the new filesystem which is based on
      the seed filesystem, so when we remove the missing device in the
      seed filesystem, we should decrease the counter of its own fs device.
      Fix it.
      Signed-off-by: NMiao Xie <miaox@cn.fujitsu.com>
      Signed-off-by: NChris Mason <clm@fb.com>
      3a7d55c8
    • M
      Btrfs: fix unzeroed members in fs_devices when creating a fs from seed fs · 69611ac8
      Miao Xie 提交于
      We forgot to zero some members in fs_devices when we create new fs_devices
      from the one of the seed fs. It would cause the problem that we got wrong
      chunk profile when allocating chunks. Fix it.
      Signed-off-by: NMiao Xie <miaox@cn.fujitsu.com>
      Signed-off-by: NChris Mason <clm@fb.com>
      69611ac8
    • A
      btrfs: check generation as replace duplicates devid+uuid · 77bdae4d
      Anand Jain 提交于
      When FS in unmounted we need to check generation number as well
      since devid+uuid combination could match with the missing replaced
      disk when it reappears, and without this patch it might pair with
      the replaced disk again.
      
       device_list_add() function is called in the following threads,
      	mount device option
      	mount argument
      	ioctl BTRFS_IOC_SCAN_DEV (btrfs dev scan)
      	ioctl BTRFS_IOC_DEVICES_READY (btrfs dev ready <dev>)
       they have been unit tested to work fine with this patch.
      
       If the user knows what he is doing and really want to pair with
       replaced disk (which is not a standard operation), then he should
       first clear the kernel btrfs device list in the memory by doing
       the module unload/load and followed with the mount -o device option.
      Signed-off-by: NAnand Jain <anand.jain@oracle.com>
      Signed-off-by: NWang Shilong <wangsl.fnst@cn.fujitsu.com>
      Signed-off-by: NMiao Xie <miaox@cn.fujitsu.com>
      Signed-off-by: NChris Mason <clm@fb.com>
      77bdae4d
    • A
      Btrfs: device_list_add() should not update list when mounted · b96de000
      Anand Jain 提交于
      device_list_add() is called when user runs btrfs dev scan, which would add
      any btrfs device into the btrfs_fs_devices list.
      
      Now think of a mounted btrfs. And a new device which contains the a SB
      from the mounted btrfs devices.
      
      In this situation when user runs btrfs dev scan, the current code would
      just replace existing device with the new device.
      
      Which is to note that old device is neither closed nor gracefully
      removed from the btrfs.
      
      The FS is still operational with the old bdev however the device name
      is the btrfs_device is new which is provided by the btrfs dev scan.
      
      reproducer:
      
      devmgt[1] detach /dev/sdc
      
      replace the missing disk /dev/sdc
      
      btrfs rep start -f 1 /dev/sde /btrfs
      Label: none  uuid: 5dc0aaf4-4683-4050-b2d6-5ebe5f5cd120
              Total devices 2 FS bytes used 32.00KiB
              devid    1 size 958.94MiB used 115.88MiB path /dev/sde
              devid    2 size 958.94MiB used 103.88MiB path /dev/sdd
      
      make /dev/sdc to reappear
      
      devmgt attach host2
      
      btrfs dev scan
      
      btrfs fi show -m
      Label: none  uuid: 5dc0aaf4-4683-4050-b2d6-5ebe5f5cd120^M
              Total devices 2 FS bytes used 32.00KiB^M
              devid    1 size 958.94MiB used 115.88MiB path /dev/sdc <- Wrong.
              devid    2 size 958.94MiB used 103.88MiB path /dev/sdd
      
      since /dev/sdc has been replaced with /dev/sde, the /dev/sdc shouldn't be
      part of the btrfs-fsid when it reappears. If user want it to be part of it
      then sys admin should be using btrfs device add instead.
      
      [1] github.com/anajain/devmgt.git
      Signed-off-by: NAnand Jain <anand.jain@oracle.com>
      Signed-off-by: NWang Shilong <wangsl.fnst@cn.fujitsu.com>
      Signed-off-by: NMiao Xie <miaox@cn.fujitsu.com>
      Reviewed-by: NSatoru Takeuchi <takeuchi_satoru@jp.fujitsu.com>
      Signed-off-by: NChris Mason <clm@fb.com>
      b96de000
  4. 20 7月, 2014 1 次提交
  5. 03 7月, 2014 1 次提交
    • A
      btrfs: fix null pointer dereference in clone_fs_devices when name is null · e755f780
      Anand Jain 提交于
      when one of the device path is missing btrfs_device name is null. So this
      patch will check for that.
      
      stack:
      BUG: unable to handle kernel NULL pointer dereference at 0000000000000010
      IP: [<ffffffff812e18c0>] strlen+0x0/0x30
      [<ffffffffa01cd92a>] ? clone_fs_devices+0xaa/0x160 [btrfs]
      [<ffffffffa01cdcf7>] btrfs_init_new_device+0x317/0xca0 [btrfs]
      [<ffffffff81155bca>] ? __kmalloc_track_caller+0x15a/0x1a0
      [<ffffffffa01d6473>] btrfs_ioctl+0xaa3/0x2860 [btrfs]
      [<ffffffff81132a6c>] ? handle_mm_fault+0x48c/0x9c0
      [<ffffffff81192a61>] ? __blkdev_put+0x171/0x180
      [<ffffffff817a784c>] ? __do_page_fault+0x4ac/0x590
      [<ffffffff81193426>] ? blkdev_put+0x106/0x110
      [<ffffffff81179175>] ? mntput+0x35/0x40
      [<ffffffff8116d4b0>] do_vfs_ioctl+0x460/0x4a0
      [<ffffffff8115c72e>] ? ____fput+0xe/0x10
      [<ffffffff81068033>] ? task_work_run+0xb3/0xd0
      [<ffffffff8116d547>] SyS_ioctl+0x57/0x90
      [<ffffffff817a793e>] ? do_page_fault+0xe/0x10
      [<ffffffff817abe52>] system_call_fastpath+0x16/0x1b
      
      reproducer:
      mkfs.btrfs -draid1 -mraid1 /dev/sdg1 /dev/sdg2
      btrfstune -S 1 /dev/sdg1
      modprobe -r btrfs && modprobe btrfs
      mount -o degraded /dev/sdg1 /btrfs
      btrfs dev add /dev/sdg3 /btrfs
      Signed-off-by: NAnand Jain <Anand.Jain@oracle.com>
      Signed-off-by: NChris Mason <clm@fb.com>
      e755f780
  6. 29 6月, 2014 3 次提交
  7. 20 6月, 2014 4 次提交
    • M
      Btrfs: fix wrong error handle when the device is missing or is not writeable · 8408c716
      Miao Xie 提交于
      The original bio might be submitted, so we shoud increase bi_remaining to
      account for it when we deal with the error that the device is missing or
      is not writeable, or we would skip the endio handle.
      Signed-off-by: NMiao Xie <miaox@cn.fujitsu.com>
      Signed-off-by: NChris Mason <clm@fb.com>
      8408c716
    • M
      Btrfs: fix deadlock when mounting a degraded fs · c55f1396
      Miao Xie 提交于
      The deadlock happened when we mount degraded filesystem, the reproduced
      steps are following:
       # mkfs.btrfs -f -m raid1 -d raid1 <dev0> <dev1>
       # echo 1 > /sys/block/`basename <dev0>`/device/delete
       # mount -o degraded <dev1> <mnt>
      
      The reason was that the counter -- bi_remaining was wrong. If the missing
      or unwriteable device was the last device in the mapping array, we would
      not submit the original bio, so we shouldn't increase bi_remaining of it
      in btrfs_end_bio(), or we would skip the final endio handle.
      
      Fix this problem by adding a flag into btrfs bio structure. If we submit
      the original bio, we will set the flag, and we increase bi_remaining counter,
      or we don't.
      
      Though there is another way to fix it -- decrease bi_remaining counter of the
      original bio when we make sure the original bio is not submitted, this method
      need add more check and is easy to make mistake.
      Signed-off-by: NMiao Xie <miaox@cn.fujitsu.com>
      Reviewed-by: NLiu Bo <bo.li.liu@oracle.com>
      Signed-off-by: NChris Mason <clm@fb.com>
      c55f1396
    • M
      Btrfs: use bio_endio_nodec instead of open code · e990f167
      Miao Xie 提交于
      Signed-off-by: NMiao Xie <miaox@cn.fujitsu.com>
      Signed-off-by: NChris Mason <clm@fb.com>
      e990f167
    • W
      Btrfs: fix NULL pointer crash when running balance and scrub concurrently · 298a8f9c
      Wang Shilong 提交于
      While running balance, scrub, fsstress concurrently we hit the
      following kernel crash:
      
      [56561.448845] BTRFS info (device sde): relocating block group 11005853696 flags 132
      [56561.524077] BUG: unable to handle kernel NULL pointer dereference at 0000000000000078
      [56561.524237] IP: [<ffffffffa038956d>] scrub_chunk.isra.12+0xdd/0x130 [btrfs]
      [56561.524297] PGD 9be28067 PUD 7f3dd067 PMD 0
      [56561.524325] Oops: 0000 [#1] SMP
      [....]
      [56561.527237] Call Trace:
      [56561.527309]  [<ffffffffa038980e>] scrub_enumerate_chunks+0x24e/0x490 [btrfs]
      [56561.527392]  [<ffffffff810abe00>] ? abort_exclusive_wait+0x50/0xb0
      [56561.527476]  [<ffffffffa038add4>] btrfs_scrub_dev+0x1a4/0x530 [btrfs]
      [56561.527561]  [<ffffffffa0368107>] btrfs_ioctl+0x13f7/0x2a90 [btrfs]
      [56561.527639]  [<ffffffff811c82f0>] do_vfs_ioctl+0x2e0/0x4c0
      [56561.527712]  [<ffffffff8109c384>] ? vtime_account_user+0x54/0x60
      [56561.527788]  [<ffffffff810f768c>] ? __audit_syscall_entry+0x9c/0xf0
      [56561.527870]  [<ffffffff811c8551>] SyS_ioctl+0x81/0xa0
      [56561.527941]  [<ffffffff815707f7>] tracesys+0xdd/0xe2
      [...]
      [56561.528304] RIP  [<ffffffffa038956d>] scrub_chunk.isra.12+0xdd/0x130 [btrfs]
      [56561.528395]  RSP <ffff88004c0f5be8>
      [56561.528454] CR2: 0000000000000078
      
      This is because in btrfs_relocate_chunk(), we will free @bdev directly while
      scrub may still hold extent mapping, and may access freed memory.
      
      Fix this problem by wrapping freeing @bdev work into free_extent_map() which
      is based on reference count.
      Reported-by: NQu Wenruo <quwenruo@cn.fujitsu.com>
      Signed-off-by: NWang Shilong <wangsl.fnst@cn.fujitsu.com>
      Signed-off-by: NMiao Xie <miaox@cn.fujitsu.com>
      Signed-off-by: NChris Mason <clm@fb.com>
      298a8f9c
  8. 10 6月, 2014 9 次提交
  9. 08 4月, 2014 1 次提交
  10. 11 3月, 2014 2 次提交