1. 16 8月, 2017 18 次提交
  2. 24 7月, 2017 3 次提交
    • N
      btrfs: round down size diff when shrinking/growing device · 0e4324a4
      Nikolay Borisov 提交于
      Further testing showed that the fix introduced in 7dfb8be1 ("btrfs:
      Round down values which are written for total_bytes_size") was
      insufficient and it could still lead to discrepancies between the
      total_bytes in the super block and the device total bytes. So this patch
      also ensures that the difference between old/new sizes when
      shrinking/growing is also rounded down. This ensure that we won't be
      subtracting/adding a non-sectorsize multiples to the superblock/device
      total sizees.
      
      Fixes: 7dfb8be1 ("btrfs: Round down values which are written for total_bytes_size")
      Signed-off-by: NNikolay Borisov <nborisov@suse.com>
      Reviewed-by: NDavid Sterba <dsterba@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      0e4324a4
    • O
      Btrfs: fix early ENOSPC due to delalloc · 17024ad0
      Omar Sandoval 提交于
      If a lot of metadata is reserved for outstanding delayed allocations, we
      rely on shrink_delalloc() to reclaim metadata space in order to fulfill
      reservation tickets. However, shrink_delalloc() has a shortcut where if
      it determines that space can be overcommitted, it will stop early. This
      made sense before the ticketed enospc system, but now it means that
      shrink_delalloc() will often not reclaim enough space to fulfill any
      tickets, leading to an early ENOSPC. (Reservation tickets don't care
      about being able to overcommit, they need every byte accounted for.)
      
      Fix it by getting rid of the shortcut so that shrink_delalloc() reclaims
      all of the metadata it is supposed to. This fixes early ENOSPCs we were
      seeing when doing a btrfs receive to populate a new filesystem, as well
      as early ENOSPCs Christoph saw when doing a big cp -r onto Btrfs.
      
      Fixes: 957780eb ("Btrfs: introduce ticketed enospc infrastructure")
      Tested-by: NChristoph Anton Mitterer <mail@christoph.anton.mitterer.name>
      Cc: stable@vger.kernel.org
      Reviewed-by: NJosef Bacik <jbacik@fb.com>
      Signed-off-by: NOmar Sandoval <osandov@fb.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      17024ad0
    • J
      btrfs: fix lockup in find_free_extent with read-only block groups · 14443937
      Jeff Mahoney 提交于
      If we have a block group that is all of the following:
      1) uncached in memory
      2) is read-only
      3) has a disk cache state that indicates we need to recreate the cache
      
      AND the file system has enough free space fragmentation such that the
      request for an extent of a given size can't be honored;
      
      AND have a single CPU core;
      
      AND it's the block group with the highest starting offset such that
      there are no opportunities (like reading from disk) for the loop to
      yield the CPU;
      
      We can end up with a lockup.
      
      The root cause is simple.  Once we're in the position that we've read in
      all of the other block groups directly and none of those block groups
      can honor the request, there are no more opportunities to sleep.  We end
      up trying to start a caching thread which never gets run if we only have
      one core.  This *should* present as a hung task waiting on the caching
      thread to make some progress, but it doesn't.  Instead, it degrades into
      a busy loop because of the placement of the read-only check.
      
      During the first pass through the loop, block_group->cached will be set
      to BTRFS_CACHE_STARTED and have_caching_bg will be set.  Then we hit the
      read-only check and short circuit the loop.  We're not yet in
      LOOP_CACHING_WAIT, so we skip that loop back before going through the
      loop again for other raid groups.
      
      Then we move to LOOP_CACHING_WAIT state.
      
      During the this pass through the loop, ->cached will still be
      BTRFS_CACHE_STARTED, which means it's not cached, so we'll enter
      cache_block_group, do a lot of nothing, and return, and also set
      have_caching_bg again.  Then we hit the read-only check and short circuit
      the loop.  The same thing happens as before except now we DO trigger
      the LOOP_CACHING_WAIT && have_caching_bg check and loop back up to the
      top.  We do this forever.
      
      There are two fixes in this patch since they address the same underlying
      bug.
      
      The first is to add a cond_resched to the end of the loop to ensure
      that the caching thread always has an opportunity to run.  This will
      fix the soft lockup issue, but find_free_extent will still loop doing
      nothing until the thread has completed.
      
      The second is to move the read-only check to the top of the loop.  We're
      never going to return an allocation within a read-only block group so
      we may as well skip it early.  The check for ->cached == BTRFS_CACHE_ERROR
      would cause the same problem except that BTRFS_CACHE_ERROR is considered
      a "done" state and we won't re-set have_caching_bg again.
      
      Many thanks to Stephan Kulow <coolo@suse.de> for his excellent help in
      the testing process.
      Signed-off-by: NJeff Mahoney <jeffm@suse.com>
      Reviewed-by: NDavid Sterba <dsterba@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      14443937
  3. 20 7月, 2017 1 次提交
  4. 15 7月, 2017 3 次提交
  5. 14 7月, 2017 1 次提交
    • F
      Btrfs: fix write corruption due to bio cloning on raid5/6 · 6592e58c
      Filipe Manana 提交于
      The recent changes to make bio cloning faster (added in the 4.13 merge
      window) by using the bio_clone_fast() API introduced a regression on
      raid5/6 modes, because cloned bios have an invalid bi_vcnt field
      (therefore it can not be used) and the raid5/6 code uses the
      bio_for_each_segment_all() API to iterate the segments of a bio, and this
      API uses a bio's bi_vcnt field.
      
      The issue is very simple to trigger by doing for example a direct IO write
      against a raid5 or raid6 filesystem and then attempting to read what we
      wrote before:
      
        $ mkfs.btrfs -m raid5 -d raid5 -f /dev/sdc /dev/sdd /dev/sde /dev/sdf
        $ mount /dev/sdc /mnt
        $ xfs_io -f -d -c "pwrite -S 0xab 0 1M" /mnt/foobar
        $ od -t x1 /mnt/foobar
        od: /mnt/foobar: read error: Input/output error
      
      For that example, the following is also reported in dmesg/syslog:
      
        [18274.985557] btrfs_print_data_csum_error: 18 callbacks suppressed
        [18274.995277] BTRFS warning (device sdf): csum failed root 5 ino 257 off 0 csum 0x98f94189 expected csum 0x94374193 mirror 1
        [18274.997205] BTRFS warning (device sdf): csum failed root 5 ino 257 off 4096 csum 0x98f94189 expected csum 0x94374193 mirror 1
        [18275.025221] BTRFS warning (device sdf): csum failed root 5 ino 257 off 8192 csum 0x98f94189 expected csum 0x94374193 mirror 1
        [18275.047422] BTRFS warning (device sdf): csum failed root 5 ino 257 off 12288 csum 0x98f94189 expected csum 0x94374193 mirror 1
        [18275.054818] BTRFS warning (device sdf): csum failed root 5 ino 257 off 4096 csum 0x98f94189 expected csum 0x94374193 mirror 1
        [18275.054834] BTRFS warning (device sdf): csum failed root 5 ino 257 off 8192 csum 0x98f94189 expected csum 0x94374193 mirror 1
        [18275.054943] BTRFS warning (device sdf): csum failed root 5 ino 257 off 8192 csum 0x98f94189 expected csum 0x94374193 mirror 2
        [18275.055207] BTRFS warning (device sdf): csum failed root 5 ino 257 off 8192 csum 0x98f94189 expected csum 0x94374193 mirror 3
        [18275.055571] BTRFS warning (device sdf): csum failed root 5 ino 257 off 0 csum 0x98f94189 expected csum 0x94374193 mirror 1
        [18275.062171] BTRFS warning (device sdf): csum failed root 5 ino 257 off 12288 csum 0x98f94189 expected csum 0x94374193 mirror 1
      
      A scrub will also fail correcting bad copies, mentioning the following in
      dmesg/syslog:
      
        [18276.128696] scrub_handle_errored_block: 498 callbacks suppressed
        [18276.129617] BTRFS warning (device sdf): checksum error at logical 2186346496 on dev /dev/sde, sector 2116608, root 5, inode 257, offset 65536, length 4096, links $
        [18276.149235] btrfs_dev_stat_print_on_error: 498 callbacks suppressed
        [18276.157897] BTRFS error (device sdf): bdev /dev/sde errs: wr 0, rd 0, flush 0, corrupt 1, gen 0
        [18276.206059] BTRFS warning (device sdf): checksum error at logical 2186477568 on dev /dev/sdd, sector 2116736, root 5, inode 257, offset 196608, length 4096, links$
        [18276.206059] BTRFS error (device sdf): bdev /dev/sdd errs: wr 0, rd 0, flush 0, corrupt 1, gen 0
        [18276.306552] BTRFS warning (device sdf): checksum error at logical 2186543104 on dev /dev/sdd, sector 2116864, root 5, inode 257, offset 262144, length 4096, links$
        [18276.319152] BTRFS error (device sdf): bdev /dev/sdd errs: wr 0, rd 0, flush 0, corrupt 2, gen 0
        [18276.394316] BTRFS warning (device sdf): checksum error at logical 2186739712 on dev /dev/sdf, sector 2116992, root 5, inode 257, offset 458752, length 4096, links$
        [18276.396348] BTRFS error (device sdf): bdev /dev/sdf errs: wr 0, rd 0, flush 0, corrupt 1, gen 0
        [18276.434127] BTRFS warning (device sdf): checksum error at logical 2186870784 on dev /dev/sde, sector 2117120, root 5, inode 257, offset 589824, length 4096, links$
        [18276.434127] BTRFS error (device sdf): bdev /dev/sde errs: wr 0, rd 0, flush 0, corrupt 2, gen 0
        [18276.500504] BTRFS error (device sdf): unable to fixup (regular) error at logical 2186477568 on dev /dev/sdd
        [18276.538400] BTRFS warning (device sdf): checksum error at logical 2186481664 on dev /dev/sdd, sector 2116744, root 5, inode 257, offset 200704, length 4096, links$
        [18276.540452] BTRFS error (device sdf): bdev /dev/sdd errs: wr 0, rd 0, flush 0, corrupt 3, gen 0
        [18276.542012] BTRFS error (device sdf): unable to fixup (regular) error at logical 2186481664 on dev /dev/sdd
        [18276.585030] BTRFS error (device sdf): unable to fixup (regular) error at logical 2186346496 on dev /dev/sde
        [18276.598306] BTRFS warning (device sdf): checksum error at logical 2186412032 on dev /dev/sde, sector 2116736, root 5, inode 257, offset 131072, length 4096, links$
        [18276.598310] BTRFS error (device sdf): bdev /dev/sde errs: wr 0, rd 0, flush 0, corrupt 3, gen 0
        [18276.598582] BTRFS error (device sdf): unable to fixup (regular) error at logical 2186350592 on dev /dev/sde
        [18276.603455] BTRFS error (device sdf): bdev /dev/sde errs: wr 0, rd 0, flush 0, corrupt 4, gen 0
        [18276.638362] BTRFS warning (device sdf): checksum error at logical 2186354688 on dev /dev/sde, sector 2116624, root 5, inode 257, offset 73728, length 4096, links $
        [18276.640445] BTRFS error (device sdf): bdev /dev/sde errs: wr 0, rd 0, flush 0, corrupt 5, gen 0
        [18276.645942] BTRFS error (device sdf): unable to fixup (regular) error at logical 2186354688 on dev /dev/sde
        [18276.657204] BTRFS error (device sdf): unable to fixup (regular) error at logical 2186412032 on dev /dev/sde
        [18276.660563] BTRFS warning (device sdf): checksum error at logical 2186416128 on dev /dev/sde, sector 2116744, root 5, inode 257, offset 135168, length 4096, links$
        [18276.664609] BTRFS error (device sdf): bdev /dev/sde errs: wr 0, rd 0, flush 0, corrupt 6, gen 0
        [18276.664609] BTRFS error (device sdf): unable to fixup (regular) error at logical 2186358784 on dev /dev/sde
      
      So fix this by using the bio_for_each_segment() API and setting before
      the bio's bi_iter field to the value of the corresponding btrfs bio
      container's saved iterator if we are processing a cloned bio in the
      raid5/6 code (the same code processes both cloned and non-cloned bios).
      
      This incorrect iteration of cloned bios was also causing some occasional
      BUG_ONs when running fstest btrfs/064, which have a trace like the
      following:
      
        [ 6674.416156] ------------[ cut here ]------------
        [ 6674.416157] kernel BUG at fs/btrfs/raid56.c:1897!
        [ 6674.416159] invalid opcode: 0000 [#1] PREEMPT SMP
        [ 6674.416160] Modules linked in: dm_flakey dm_mod dax ppdev tpm_tis parport_pc tpm_tis_core evdev tpm psmouse sg i2c_piix4 pcspkr parport i2c_core serio_raw button s
        [ 6674.416184] CPU: 3 PID: 19236 Comm: kworker/u32:10 Not tainted 4.12.0-rc6-btrfs-next-44+ #1
        [ 6674.416185] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.9.1-0-gb3ef39f-prebuilt.qemu-project.org 04/01/2014
        [ 6674.416210] Workqueue: btrfs-endio btrfs_endio_helper [btrfs]
        [ 6674.416211] task: ffff880147f6c740 task.stack: ffffc90001fb8000
        [ 6674.416229] RIP: 0010:__raid_recover_end_io+0x1ac/0x370 [btrfs]
        [ 6674.416230] RSP: 0018:ffffc90001fbbb90 EFLAGS: 00010217
        [ 6674.416231] RAX: ffff8801ff4b4f00 RBX: 0000000000000002 RCX: 0000000000000001
        [ 6674.416232] RDX: ffff880099b045d8 RSI: ffffffff81a5f6e0 RDI: 0000000000000004
        [ 6674.416232] RBP: ffffc90001fbbbc8 R08: 0000000000000001 R09: 0000000000000001
        [ 6674.416233] R10: ffffc90001fbbac8 R11: 0000000000001000 R12: 0000000000000002
        [ 6674.416234] R13: ffff880099b045c0 R14: 0000000000000004 R15: ffff88012bff2000
        [ 6674.416235] FS:  0000000000000000(0000) GS:ffff88023f2c0000(0000) knlGS:0000000000000000
        [ 6674.416235] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
        [ 6674.416236] CR2: 00007f28cf282000 CR3: 00000001000c6000 CR4: 00000000000006e0
        [ 6674.416239] Call Trace:
        [ 6674.416259]  __raid56_parity_recover+0xfc/0x16e [btrfs]
        [ 6674.416276]  raid56_parity_recover+0x157/0x16b [btrfs]
        [ 6674.416293]  btrfs_map_bio+0xe0/0x259 [btrfs]
        [ 6674.416310]  btrfs_submit_bio_hook+0xbf/0x147 [btrfs]
        [ 6674.416327]  end_bio_extent_readpage+0x27b/0x4a0 [btrfs]
        [ 6674.416331]  bio_endio+0x17d/0x1b3
        [ 6674.416346]  end_workqueue_fn+0x3c/0x3f [btrfs]
        [ 6674.416362]  btrfs_scrubparity_helper+0x1aa/0x3b8 [btrfs]
        [ 6674.416379]  btrfs_endio_helper+0xe/0x10 [btrfs]
        [ 6674.416381]  process_one_work+0x276/0x4b6
        [ 6674.416384]  worker_thread+0x1ac/0x266
        [ 6674.416386]  ? rescuer_thread+0x278/0x278
        [ 6674.416387]  kthread+0x106/0x10e
        [ 6674.416389]  ? __list_del_entry+0x22/0x22
        [ 6674.416391]  ret_from_fork+0x27/0x40
        [ 6674.416395] Code: 44 89 e2 be 00 10 00 00 ff 15 b0 ab ef ff eb 72 4d 89 e8 89 d9 44 89 e2 be 00 10 00 00 ff 15 a3 ab ef ff eb 5d 41 83 fc ff 74 02 <0f> 0b 49 63 97
        [ 6674.416432] RIP: __raid_recover_end_io+0x1ac/0x370 [btrfs] RSP: ffffc90001fbbb90
        [ 6674.416434] ---[ end trace 74d56ebe7489dd6a ]---
      Signed-off-by: NFilipe Manana <fdmanana@suse.com>
      Reviewed-by: NLiu Bo <bo.li.liu@oracle.com>
      6592e58c
  6. 10 7月, 2017 1 次提交
    • G
      btrfs: nowait aio: Correct assignment of pos · ff0fa732
      Goldwyn Rodrigues 提交于
      Assigning pos for usage early messes up in append mode, where the pos is
      re-assigned in generic_write_checks(). Assign pos later to get the
      correct position to write from iocb->ki_pos.
      
      Since check_can_nocow also uses the value of pos, we shift
      generic_write_checks() before check_can_nocow(). Checks with IOCB_DIRECT
      are present in generic_write_checks(), so checking for IOCB_NOWAIT is
      enough.
      
      Also, put locking sequence in the fast path.
      
      This fixes a user visible bug, as reported:
      
      "apparently breaks several shell related features on my system.
      In zsh history stopped working, because no new entries are added
      anymore.
      I fist noticed the issue when I tried to build mplayer. It uses a shell
      script to generate a help_mp.h file:
      [...]
      
      Here is a simple testcase:
      
       % echo "foo" >> test
       % echo "foo" >> test
       % cat test
       foo
       %
      "
      
      Fixes: edf064e7 ("btrfs: nowait aio support")
      CC: Jens Axboe <axboe@kernel.dk>
      Reported-by: NMarkus Trippelsdorf <markus@trippelsdorf.de>
      Link: https://lkml.kernel.org/r/20170704042306.GA274@x4Signed-off-by: NGoldwyn Rodrigues <rgoldwyn@suse.com>
      Reviewed-by: NDavid Sterba <dsterba@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      ff0fa732
  7. 07 7月, 2017 2 次提交
    • F
      Btrfs: incremental send, fix invalid memory access · 24e52b11
      Filipe Manana 提交于
      When doing an incremental send, while processing an extent that changed
      between the parent and send snapshots and that extent was an inline extent
      in the parent snapshot, it's possible to access a memory region beyond
      the end of leaf if the inline extent is very small and it is the first
      item in a leaf.
      
      An example scenario is described below.
      
      The send snapshot has the following leaf:
      
       leaf 33865728 items 33 free space 773 generation 46 owner 5
       fs uuid ab7090d8-dafd-4fb9-9246-723b6d2e2fb7
       chunk uuid 2d16478c-c704-4ab9-b574-68bff2281b1f
              (...)
              item 14 key (335 EXTENT_DATA 0) itemoff 3052 itemsize 53
                      generation 36 type 1 (regular)
                      extent data disk byte 12791808 nr 4096
                      extent data offset 0 nr 4096 ram 4096
                      extent compression 0 (none)
              item 15 key (335 EXTENT_DATA 8192) itemoff 2999 itemsize 53
                      generation 36 type 1 (regular)
                      extent data disk byte 138170368 nr 225280
                      extent data offset 0 nr 225280 ram 225280
                      extent compression 0 (none)
              (...)
      
      And the parent snapshot has the following leaf:
      
       leaf 31272960 items 17 free space 17 generation 31 owner 5
       fs uuid ab7090d8-dafd-4fb9-9246-723b6d2e2fb7
       chunk uuid 2d16478c-c704-4ab9-b574-68bff2281b1f
              item 0 key (335 EXTENT_DATA 0) itemoff 3951 itemsize 44
                      generation 31 type 0 (inline)
                      inline extent data size 23 ram_bytes 613 compression 1 (zlib)
              (...)
      
      When computing the send stream, it is detected that the extent of inode
      335, at file offset 0, and at fs/btrfs/send.c:is_extent_unchanged() we
      grab the leaf from the parent snapshot and access the inline extent item.
      However, before jumping to the 'out' label, we access the 'offset' and
      'disk_bytenr' fields of the extent item, which should not be done for
      inline extents since the inlined data starts at the offset of the
      'disk_bytenr' field and can be very small. For example accessing the
      'offset' field of the file extent item results in the following trace:
      
      [  599.705368] general protection fault: 0000 [#1] PREEMPT SMP
      [  599.706296] Modules linked in: btrfs psmouse i2c_piix4 ppdev acpi_cpufreq serio_raw parport_pc i2c_core evdev tpm_tis tpm_tis_core sg pcspkr parport tpm button su$
      [  599.709340] CPU: 7 PID: 5283 Comm: btrfs Not tainted 4.10.0-rc8-btrfs-next-46+ #1
      [  599.709340] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.9.1-0-gb3ef39f-prebuilt.qemu-project.org 04/01/2014
      [  599.709340] task: ffff88023eedd040 task.stack: ffffc90006658000
      [  599.709340] RIP: 0010:read_extent_buffer+0xdb/0xf4 [btrfs]
      [  599.709340] RSP: 0018:ffffc9000665ba00 EFLAGS: 00010286
      [  599.709340] RAX: db73880000000000 RBX: 0000000000000000 RCX: 0000000000000001
      [  599.709340] RDX: ffffc9000665ba60 RSI: db73880000000000 RDI: ffffc9000665ba5f
      [  599.709340] RBP: ffffc9000665ba30 R08: 0000000000000001 R09: ffff88020dc5e098
      [  599.709340] R10: 0000000000001000 R11: 0000160000000000 R12: 6db6db6db6db6db7
      [  599.709340] R13: ffff880000000000 R14: 0000000000000000 R15: ffff88020dc5e088
      [  599.709340] FS:  00007f519555a8c0(0000) GS:ffff88023f3c0000(0000) knlGS:0000000000000000
      [  599.709340] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
      [  599.709340] CR2: 00007f1411afd000 CR3: 0000000235f8e000 CR4: 00000000000006e0
      [  599.709340] Call Trace:
      [  599.709340]  btrfs_get_token_64+0x93/0xce [btrfs]
      [  599.709340]  ? printk+0x48/0x50
      [  599.709340]  btrfs_get_64+0xb/0xd [btrfs]
      [  599.709340]  process_extent+0x3a1/0x1106 [btrfs]
      [  599.709340]  ? btree_read_extent_buffer_pages+0x5/0xef [btrfs]
      [  599.709340]  changed_cb+0xb03/0xb3d [btrfs]
      [  599.709340]  ? btrfs_get_token_32+0x7a/0xcc [btrfs]
      [  599.709340]  btrfs_compare_trees+0x432/0x53d [btrfs]
      [  599.709340]  ? process_extent+0x1106/0x1106 [btrfs]
      [  599.709340]  btrfs_ioctl_send+0x960/0xe26 [btrfs]
      [  599.709340]  btrfs_ioctl+0x181b/0x1fed [btrfs]
      [  599.709340]  ? trace_hardirqs_on_caller+0x150/0x1ac
      [  599.709340]  vfs_ioctl+0x21/0x38
      [  599.709340]  ? vfs_ioctl+0x21/0x38
      [  599.709340]  do_vfs_ioctl+0x611/0x645
      [  599.709340]  ? rcu_read_unlock+0x5b/0x5d
      [  599.709340]  ? __fget+0x6d/0x79
      [  599.709340]  SyS_ioctl+0x57/0x7b
      [  599.709340]  entry_SYSCALL_64_fastpath+0x18/0xad
      [  599.709340] RIP: 0033:0x7f51945eec47
      [  599.709340] RSP: 002b:00007ffc21c13e98 EFLAGS: 00000202 ORIG_RAX: 0000000000000010
      [  599.709340] RAX: ffffffffffffffda RBX: ffffffff81096459 RCX: 00007f51945eec47
      [  599.709340] RDX: 00007ffc21c13f20 RSI: 0000000040489426 RDI: 0000000000000004
      [  599.709340] RBP: ffffc9000665bf98 R08: 00007f519450d700 R09: 00007f519450d700
      [  599.709340] R10: 00007f519450d9d0 R11: 0000000000000202 R12: 0000000000000046
      [  599.709340] R13: ffffc9000665bf78 R14: 0000000000000000 R15: 00007f5195574040
      [  599.709340]  ? trace_hardirqs_off_caller+0x43/0xb1
      [  599.709340] Code: 29 f0 49 39 d8 4c 0f 47 c3 49 03 81 58 01 00 00 44 89 c1 4c 01 c2 4c 29 c3 48 c1 f8 03 49 0f af c4 48 c1 e0 0c 4c 01 e8 48 01 c6 <f3> a4 31 f6 4$
      [  599.709340] RIP: read_extent_buffer+0xdb/0xf4 [btrfs] RSP: ffffc9000665ba00
      [  599.762057] ---[ end trace fe00d7af61b9f49e ]---
      
      This is because the 'offset' field starts at an offset of 37 bytes
      (offsetof(struct btrfs_file_extent_item, offset)), has a length of 8
      bytes and therefore attemping to read it causes a 1 byte access beyond
      the end of the leaf, as the first item's content in a leaf is located
      at the tail of the leaf, the item size is 44 bytes and the offset of
      that field plus its length (37 + 8 = 45) goes beyond the item's size
      by 1 byte.
      
      So fix this by accessing the 'offset' and 'disk_bytenr' fields after
      jumping to the 'out' label if we are processing an inline extent. We
      move the reading operation of the 'disk_bytenr' field too because we
      have the same problem as for the 'offset' field explained above when
      the inline data is less then 8 bytes. The access to the 'generation'
      field is also moved but just for the sake of grouping access to all
      the fields.
      
      Fixes: e1cbfd7b ("Btrfs: send, fix file hole not being preserved due to inline extent")
      Cc: <stable@vger.kernel.org>  # v4.12+
      Signed-off-by: NFilipe Manana <fdmanana@suse.com>
      24e52b11
    • F
      Btrfs: incremental send, fix invalid path for link commands · f5962781
      Filipe Manana 提交于
      In some scenarios an incremental send stream can contain link commands
      with an invalid target path. Such scenarios happen after moving some
      directory inode A, renaming a regular file inode B into the old name of
      inode A and finally creating a new hard link for inode B at directory
      inode A.
      
      Consider the following example scenario where this issue happens.
      
      Parent snapshot:
      
        .                                                      (ino 256)
        |
        |--- dir1/                                             (ino 257)
        |      |--- dir2/                                      (ino 258)
        |             |--- dir3/                               (ino 259)
        |                   |--- file1                         (ino 261)
        |                   |--- dir4/                         (ino 262)
        |
        |--- dir5/                                             (ino 260)
      
      Send snapshot:
      
        .                                                      (ino 256)
        |
        |--- dir1/                                             (ino 257)
               |--- dir2/                                      (ino 258)
               |      |--- dir3/                               (ino 259)
               |            |--- dir4                          (ino 261)
               |
               |--- dir6/                                      (ino 263)
                      |--- dir44/                              (ino 262)
                             |--- file11                       (ino 261)
                             |--- dir55/                       (ino 260)
      
      When attempting to apply the corresponding incremental send stream, a
      link command contains an invalid target path which makes the receiver
      fail. The following is the verbose output of the btrfs receive command:
      
        receiving snapshot mysnap2 uuid=90076fe6-5ba6-e64a-9321-9279670ed16b (...)
        utimes
        utimes dir1
        utimes dir1/dir2/dir3
        utimes
        rename dir1/dir2/dir3/dir4 -> o262-7-0
        link dir1/dir2/dir3/dir4 -> dir1/dir2/dir3/file1
        link dir1/dir2/dir3/dir4/file11 -> dir1/dir2/dir3/file1
        ERROR: link dir1/dir2/dir3/dir4/file11 -> dir1/dir2/dir3/file1 failed: Not a directory
      
      The following steps happen during the computation of the incremental send
      stream the lead to this issue:
      
      1) When processing inode 261, we orphanize inode 262 due to a name/location
         collision with one of the new hard links for inode 261 (created in the
         second step below).
      
      2) We create one of the 2 new hard links for inode 261, the one whose
         location is at "dir1/dir2/dir3/dir4".
      
      3) We then attempt to create the other new hard link for inode 261, which
         has inode 262 as its parent directory. Because the path for this new
         hard link was computed before we started processing the new references
         (hard links), it reflects the old name/location of inode 262, that is,
         it does not account for the orphanization step that happened when
         we started processing the new references for inode 261, whence it is
         no longer valid, causing the receiver to fail.
      
      So fix this issue by recomputing the full path of new references if we
      ended up orphanizing other inodes which are directories.
      
      A test case for fstests follows soon.
      Signed-off-by: NFilipe Manana <fdmanana@suse.com>
      f5962781
  8. 06 7月, 2017 2 次提交
  9. 30 6月, 2017 9 次提交
    • Q
      btrfs: Remove false alert when fiemap range is smaller than on-disk extent · 848c23b7
      Qu Wenruo 提交于
      Commit 4751832d ("btrfs: fiemap: Cache and merge fiemap extent before
      submit it to user") introduced a warning to catch unemitted cached
      fiemap extent.
      
      However such warning doesn't take the following case into consideration:
      
      0			4K			8K
      |<---- fiemap range --->|
      |<----------- On-disk extent ------------------>|
      
      In this case, the whole 0~8K is cached, and since it's larger than
      fiemap range, it break the fiemap extent emit loop.
      This leaves the fiemap extent cached but not emitted, and caught by the
      final fiemap extent sanity check, causing kernel warning.
      
      This patch removes the kernel warning and renames the sanity check to
      emit_last_fiemap_cache() since it's possible and valid to have cached
      fiemap extent.
      Reported-by: NDavid Sterba <dsterba@suse.cz>
      Reported-by: NAdam Borowski <kilobyte@angband.pl>
      Fixes: 4751832d ("btrfs: fiemap: Cache and merge fiemap extent ...")
      Signed-off-by: NQu Wenruo <quwenruo@cn.fujitsu.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      848c23b7
    • J
      btrfs: Don't clear SGID when inheriting ACLs · b7f8a09f
      Jan Kara 提交于
      When new directory 'DIR1' is created in a directory 'DIR0' with SGID bit
      set, DIR1 is expected to have SGID bit set (and owning group equal to
      the owning group of 'DIR0'). However when 'DIR0' also has some default
      ACLs that 'DIR1' inherits, setting these ACLs will result in SGID bit on
      'DIR1' to get cleared if user is not member of the owning group.
      
      Fix the problem by moving posix_acl_update_mode() out of
      __btrfs_set_acl() into btrfs_set_acl(). That way the function will not be
      called when inheriting ACLs which is what we want as it prevents SGID
      bit clearing and the mode has been properly set by posix_acl_create()
      anyway.
      
      Fixes: 07393101
      CC: stable@vger.kernel.org
      CC: linux-btrfs@vger.kernel.org
      CC: David Sterba <dsterba@suse.com>
      Signed-off-by: NJan Kara <jack@suse.cz>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      b7f8a09f
    • C
      btrfs: fix integer overflow in calc_reclaim_items_nr · 6374e57a
      Chris Mason 提交于
      Dave Jones hit a WARN_ON(nr < 0) in btrfs_wait_ordered_roots() with
      v4.12-rc6.  This was because commit 70e7af24 made it possible for
      calc_reclaim_items_nr() to return a negative number.  It's not really a
      bug in that commit, it just didn't go far enough down the stack to find
      all the possible 64->32 bit overflows.
      
      This switches calc_reclaim_items_nr() to return a u64 and changes everyone
      that uses the results of that math to u64 as well.
      Reported-by: NDave Jones <davej@codemonkey.org.uk>
      Fixes: 70e7af24 ("Btrfs: fix delalloc accounting leak caused by u32 overflow")
      Signed-off-by: NChris Mason <clm@fb.com>
      Reviewed-by: NDavid Sterba <dsterba@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      6374e57a
    • D
      btrfs: scrub: fix target device intialization while setting up scrub context · ded56184
      David Sterba 提交于
      The commit "btrfs: scrub: inline helper scrub_setup_wr_ctx" inlined a
      helper but wrongly sets up the target device. Incidentally there's a
      local variable with the same name as a parameter in the previous
      function, so this got caught during runtime as crash in test btrfs/027.
      Reported-by: NChris Mason <clm@fb.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      ded56184
    • Q
      btrfs: qgroup: Fix qgroup reserved space underflow by only freeing reserved ranges · bc42bda2
      Qu Wenruo 提交于
      [BUG]
      For the following case, btrfs can underflow qgroup reserved space
      at an error path:
      (Page size 4K, function name without "btrfs_" prefix)
      
               Task A                  |             Task B
      ----------------------------------------------------------------------
      Buffered_write [0, 2K)           |
      |- check_data_free_space()       |
      |  |- qgroup_reserve_data()      |
      |     Range aligned to page      |
      |     range [0, 4K)          <<< |
      |     4K bytes reserved      <<< |
      |- copy pages to page cache      |
                                       | Buffered_write [2K, 4K)
                                       | |- check_data_free_space()
                                       | |  |- qgroup_reserved_data()
                                       | |     Range alinged to page
                                       | |     range [0, 4K)
                                       | |     Already reserved by A <<<
                                       | |     0 bytes reserved      <<<
                                       | |- delalloc_reserve_metadata()
                                       | |  And it *FAILED* (Maybe EQUOTA)
                                       | |- free_reserved_data_space()
                                            |- qgroup_free_data()
                                               Range aligned to page range
                                               [0, 4K)
                                               Freeing 4K
      (Special thanks to Chandan for the detailed report and analyse)
      
      [CAUSE]
      Above Task B is freeing reserved data range [0, 4K) which is actually
      reserved by Task A.
      
      And at writeback time, page dirty by Task A will go through writeback
      routine, which will free 4K reserved data space at file extent insert
      time, causing the qgroup underflow.
      
      [FIX]
      For btrfs_qgroup_free_data(), add @reserved parameter to only free
      data ranges reserved by previous btrfs_qgroup_reserve_data().
      So in above case, Task B will try to free 0 byte, so no underflow.
      Reported-by: NChandan Rajendra <chandan@linux.vnet.ibm.com>
      Signed-off-by: NQu Wenruo <quwenruo@cn.fujitsu.com>
      Reviewed-by: NChandan Rajendra <chandan@linux.vnet.ibm.com>
      Tested-by: NChandan Rajendra <chandan@linux.vnet.ibm.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      bc42bda2
    • Q
      btrfs: qgroup: Introduce extent changeset for qgroup reserve functions · 364ecf36
      Qu Wenruo 提交于
      Introduce a new parameter, struct extent_changeset for
      btrfs_qgroup_reserved_data() and its callers.
      
      Such extent_changeset was used in btrfs_qgroup_reserve_data() to record
      which range it reserved in current reserve, so it can free it in error
      paths.
      
      The reason we need to export it to callers is, at buffered write error
      path, without knowing what exactly which range we reserved in current
      allocation, we can free space which is not reserved by us.
      
      This will lead to qgroup reserved space underflow.
      Reviewed-by: NChandan Rajendra <chandan@linux.vnet.ibm.com>
      Signed-off-by: NQu Wenruo <quwenruo@cn.fujitsu.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      364ecf36
    • Q
      btrfs: qgroup: Fix qgroup reserved space underflow caused by buffered write... · a12b877b
      Qu Wenruo 提交于
      btrfs: qgroup: Fix qgroup reserved space underflow caused by buffered write and quotas being enabled
      
      [BUG]
      Under the following case, we can underflow qgroup reserved space.
      
                  Task A                |            Task B
      ---------------------------------------------------------------
       Quota disabled                   |
       Buffered write                   |
       |- btrfs_check_data_free_space() |
       |  *NO* qgroup space is reserved |
       |  since quota is *DISABLED*     |
       |- All pages are copied to page  |
          cache                         |
                                        | Enable quota
                                        | Quota scan finished
                                        |
                                        | Sync_fs
                                        | |- run_delalloc_range
                                        | |- Write pages
                                        | |- btrfs_finish_ordered_io
                                        |    |- insert_reserved_file_extent
                                        |       |- btrfs_qgroup_release_data()
                                        |          Since no qgroup space is
                                                   reserved in Task A, we
                                                   underflow qgroup reserved
                                                   space
      This can be detected by fstest btrfs/104.
      
      [CAUSE]
      In insert_reserved_file_extent() we tell qgroup to release the @ram_bytes
      size of qgroup reserved_space in all cases.
      And btrfs_qgroup_release_data() will check if quotas are enabled.
      
      However in the above case, the buffered write happens before quota is
      enabled, so we don't have the reserved space for that range.
      
      [FIX]
      In insert_reserved_file_extent(), we tell qgroup to release the acctual
      byte number it released.
      In the above case, since we don't have the reserved space, we tell
      qgroups to release 0 byte, so the problem can be fixed.
      
      And thanks to the @reserved parameter introduced by the qgroup rework,
      and previous patch to return released bytes, the fix can be as small as
      10 lines.
      Signed-off-by: NQu Wenruo <quwenruo@cn.fujitsu.com>
      [ changelog updates ]
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      a12b877b
    • Q
      btrfs: qgroup: Return actually freed bytes for qgroup release or free data · 7bc329c1
      Qu Wenruo 提交于
      btrfs_qgroup_release/free_data() only returns 0 or a negative error
      number (ENOMEM is the only possible error).
      
      This is normally good enough, but sometimes we need the exact byte
      count it freed/released.
      
      Change it to return actually released/freed bytenr number instead of 0
      for success.
      And slightly modify related extent_changeset structure, since in btrfs
      one no-hole data extent won't be larger than 128M, so "unsigned int"
      is large enough for the use case.
      Signed-off-by: NQu Wenruo <quwenruo@cn.fujitsu.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      7bc329c1
    • Q
      btrfs: qgroup: Cleanup btrfs_qgroup_prepare_account_extents function · d1b8b94a
      Qu Wenruo 提交于
      Quite a lot of qgroup corruption happens due to wrong time of calling
      btrfs_qgroup_prepare_account_extents().
      
      Since the safest time is to call it just before
      btrfs_qgroup_account_extents(), there is no need to separate these 2
      functions.
      
      Merging them will make code cleaner and less bug prone.
      Signed-off-by: NQu Wenruo <quwenruo@cn.fujitsu.com>
      [ changelog and comment adjustments ]
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      d1b8b94a