1. 31 5月, 2018 12 次提交
  2. 26 5月, 2018 2 次提交
    • H
      proc: fix smaps and meminfo alignment · 6c04ab0e
      Hugh Dickins 提交于
      The 4.17-rc /proc/meminfo and /proc/<pid>/smaps look ugly: single-digit
      numbers (commonly 0) are misaligned.
      
      Remove seq_put_decimal_ull_width()'s leftover optimization for single
      digits: it's wrong now that num_to_str() takes care of the width.
      
      Link: http://lkml.kernel.org/r/alpine.LSU.2.11.1805241554210.1326@eggly.anvils
      Fixes: d1be35cb ("proc: add seq_put_decimal_ull_width to speed up /proc/pid/smaps")
      Signed-off-by: NHugh Dickins <hughd@google.com>
      Cc: Andrei Vagin <avagin@openvz.org>
      Cc: Alexey Dobriyan <adobriyan@gmail.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      6c04ab0e
    • C
      ocfs2: revert "ocfs2/o2hb: check len for bio_add_page() to avoid getting incorrect bio" · 3373de20
      Changwei Ge 提交于
      This reverts commit ba16ddfb ("ocfs2/o2hb: check len for
      bio_add_page() to avoid getting incorrect bio").
      
      In my testing, this patch introduces a problem that mkfs can't have
      slots more than 16 with 4k block size.
      
      And the original logic is safe actually with the situation it mentions
      so revert this commit.
      
      Attach test log:
        (mkfs.ocfs2,27479,2):o2hb_setup_one_bio:463 page 0, vec_len = 4096, vec_start = 0
        (mkfs.ocfs2,27479,2):o2hb_setup_one_bio:463 page 1, vec_len = 4096, vec_start = 0
        (mkfs.ocfs2,27479,2):o2hb_setup_one_bio:463 page 2, vec_len = 4096, vec_start = 0
        (mkfs.ocfs2,27479,2):o2hb_setup_one_bio:463 page 3, vec_len = 4096, vec_start = 0
        (mkfs.ocfs2,27479,2):o2hb_setup_one_bio:463 page 4, vec_len = 4096, vec_start = 0
        (mkfs.ocfs2,27479,2):o2hb_setup_one_bio:463 page 5, vec_len = 4096, vec_start = 0
        (mkfs.ocfs2,27479,2):o2hb_setup_one_bio:463 page 6, vec_len = 4096, vec_start = 0
        (mkfs.ocfs2,27479,2):o2hb_setup_one_bio:463 page 7, vec_len = 4096, vec_start = 0
        (mkfs.ocfs2,27479,2):o2hb_setup_one_bio:463 page 8, vec_len = 4096, vec_start = 0
        (mkfs.ocfs2,27479,2):o2hb_setup_one_bio:463 page 9, vec_len = 4096, vec_start = 0
        (mkfs.ocfs2,27479,2):o2hb_setup_one_bio:463 page 10, vec_len = 4096, vec_start = 0
        (mkfs.ocfs2,27479,2):o2hb_setup_one_bio:463 page 11, vec_len = 4096, vec_start = 0
        (mkfs.ocfs2,27479,2):o2hb_setup_one_bio:463 page 12, vec_len = 4096, vec_start = 0
        (mkfs.ocfs2,27479,2):o2hb_setup_one_bio:463 page 13, vec_len = 4096, vec_start = 0
        (mkfs.ocfs2,27479,2):o2hb_setup_one_bio:463 page 14, vec_len = 4096, vec_start = 0
        (mkfs.ocfs2,27479,2):o2hb_setup_one_bio:463 page 15, vec_len = 4096, vec_start = 0
        (mkfs.ocfs2,27479,2):o2hb_setup_one_bio:463 page 16, vec_len = 4096, vec_start = 0
        (mkfs.ocfs2,27479,2):o2hb_setup_one_bio:471 ERROR: Adding page[16] to bio failed, page ffffea0002d7ed40, len 0, vec_len 4096, vec_start 0,bi_sector 8192
        (mkfs.ocfs2,27479,2):o2hb_read_slots:500 ERROR: status = -5
        (mkfs.ocfs2,27479,2):o2hb_populate_slot_data:1911 ERROR: status = -5
        (mkfs.ocfs2,27479,2):o2hb_region_dev_write:2012 ERROR: status = -5
      
      Link: http://lkml.kernel.org/r/SIXPR06MB0461721F398A5A92FC68C39ED5920@SIXPR06MB0461.apcprd06.prod.outlook.comSigned-off-by: NChangwei Ge <ge.changwei@h3c.com>
      Cc: Jun Piao <piaojun@huawei.com>
      Cc: Yiwen Jiang <jiangyiwen@huawei.com>
      Cc: Joseph Qi <jiangqi903@gmail.com>
      Cc: Mark Fasheh <mark@fasheh.com>
      Cc: Joel Becker <jlbec@evilplan.org>
      Cc: Junxiao Bi <junxiao.bi@oracle.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      3373de20
  3. 24 5月, 2018 1 次提交
    • O
      Btrfs: fix error handling in btrfs_truncate() · d5014738
      Omar Sandoval 提交于
      Jun Wu at Facebook reported that an internal service was seeing a return
      value of 1 from ftruncate() on Btrfs in some cases. This is coming from
      the NEED_TRUNCATE_BLOCK return value from btrfs_truncate_inode_items().
      
      btrfs_truncate() uses two variables for error handling, ret and err.
      When btrfs_truncate_inode_items() returns non-zero, we set err to the
      return value. However, NEED_TRUNCATE_BLOCK is not an error. Make sure we
      only set err if ret is an error (i.e., negative).
      
      To reproduce the issue: mount a filesystem with -o compress-force=zstd
      and the following program will encounter return value of 1 from
      ftruncate:
      
      int main(void) {
              char buf[256] = { 0 };
              int ret;
              int fd;
      
              fd = open("test", O_CREAT | O_WRONLY | O_TRUNC, 0666);
              if (fd == -1) {
                      perror("open");
                      return EXIT_FAILURE;
              }
      
              if (write(fd, buf, sizeof(buf)) != sizeof(buf)) {
                      perror("write");
                      close(fd);
                      return EXIT_FAILURE;
              }
      
              if (fsync(fd) == -1) {
                      perror("fsync");
                      close(fd);
                      return EXIT_FAILURE;
              }
      
              ret = ftruncate(fd, 128);
              if (ret) {
                      printf("ftruncate() returned %d\n", ret);
                      close(fd);
                      return EXIT_FAILURE;
              }
      
              close(fd);
              return EXIT_SUCCESS;
      }
      
      Fixes: ddfae63c ("btrfs: move btrfs_truncate_block out of trans handle")
      CC: stable@vger.kernel.org # 4.15+
      Reported-by: NJun Wu <quark@fb.com>
      Signed-off-by: NOmar Sandoval <osandov@fb.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      d5014738
  4. 22 5月, 2018 10 次提交
    • A
      aio: fix io_destroy(2) vs. lookup_ioctx() race · baf10564
      Al Viro 提交于
      kill_ioctx() used to have an explicit RCU delay between removing the
      reference from ->ioctx_table and percpu_ref_kill() dropping the refcount.
      At some point that delay had been removed, on the theory that
      percpu_ref_kill() itself contained an RCU delay.  Unfortunately, that was
      the wrong kind of RCU delay and it didn't care about rcu_read_lock() used
      by lookup_ioctx().  As the result, we could get ctx freed right under
      lookup_ioctx().  Tejun has fixed that in a6d7cff4 ("fs/aio: Add explicit
      RCU grace period when freeing kioctx"); however, that fix is not enough.
      
      Suppose io_destroy() from one thread races with e.g. io_setup() from another;
      CPU1 removes the reference from current->mm->ioctx_table[...] just as CPU2
      has picked it (under rcu_read_lock()).  Then CPU1 proceeds to drop the
      refcount, getting it to 0 and triggering a call of free_ioctx_users(),
      which proceeds to drop the secondary refcount and once that reaches zero
      calls free_ioctx_reqs().  That does
              INIT_RCU_WORK(&ctx->free_rwork, free_ioctx);
              queue_rcu_work(system_wq, &ctx->free_rwork);
      and schedules freeing the whole thing after RCU delay.
      
      In the meanwhile CPU2 has gotten around to percpu_ref_get(), bumping the
      refcount from 0 to 1 and returned the reference to io_setup().
      
      Tejun's fix (that queue_rcu_work() in there) guarantees that ctx won't get
      freed until after percpu_ref_get().  Sure, we'd increment the counter before
      ctx can be freed.  Now we are out of rcu_read_lock() and there's nothing to
      stop freeing of the whole thing.  Unfortunately, CPU2 assumes that since it
      has grabbed the reference, ctx is *NOT* going away until it gets around to
      dropping that reference.
      
      The fix is obvious - use percpu_ref_tryget_live() and treat failure as miss.
      It's not costlier than what we currently do in normal case, it's safe to
      call since freeing *is* delayed and it closes the race window - either
      lookup_ioctx() comes before percpu_ref_kill() (in which case ctx->users
      won't reach 0 until the caller of lookup_ioctx() drops it) or lookup_ioctx()
      fails, ctx->users is unaffected and caller of lookup_ioctx() doesn't see
      the object in question at all.
      
      Cc: stable@kernel.org
      Fixes: a6d7cff4 "fs/aio: Add explicit RCU grace period when freeing kioctx"
      Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>
      baf10564
    • A
      ext2: fix a block leak · 5aa1437d
      Al Viro 提交于
      open file, unlink it, then use ioctl(2) to make it immutable or
      append only.  Now close it and watch the blocks *not* freed...
      
      Immutable/append-only checks belong in ->setattr().
      Note: the bug is old and backport to anything prior to 737f2e93
      ("ext2: convert to use the new truncate convention") will need
      these checks lifted into ext2_setattr().
      
      Cc: stable@kernel.org
      Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>
      5aa1437d
    • A
      nfsd: vfs_mkdir() might succeed leaving dentry negative unhashed · 3819bb0d
      Al Viro 提交于
      That can (and does, on some filesystems) happen - ->mkdir() (and thus
      vfs_mkdir()) can legitimately leave its argument negative and just
      unhash it, counting upon the lookup to pick the object we'd created
      next time we try to look at that name.
      
      Some vfs_mkdir() callers forget about that possibility...
      Acked-by: NJ. Bruce Fields <bfields@redhat.com>
      Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>
      3819bb0d
    • A
      cachefiles: vfs_mkdir() might succeed leaving dentry negative unhashed · 9c3e9025
      Al Viro 提交于
      That can (and does, on some filesystems) happen - ->mkdir() (and thus
      vfs_mkdir()) can legitimately leave its argument negative and just
      unhash it, counting upon the lookup to pick the object we'd created
      next time we try to look at that name.
      
      Some vfs_mkdir() callers forget about that possibility...
      Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>
      9c3e9025
    • A
      unfuck sysfs_mount() · 7b745a4e
      Al Viro 提交于
      new_sb is left uninitialized in case of early failures in kernfs_mount_ns(),
      and while IS_ERR(root) is true in all such cases, using IS_ERR(root) || !new_sb
      is not a solution - IS_ERR(root) is true in some cases when new_sb is true.
      
      Make sure new_sb is initialized (and matches the reality) in all cases and
      fix the condition for dropping kobj reference - we want it done precisely
      in those situations where the reference has not been transferred into a new
      super_block instance.
      Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>
      7b745a4e
    • A
      kernfs: deal with kernfs_fill_super() failures · 82382ace
      Al Viro 提交于
      make sure that info->node is initialized early, so that kernfs_kill_sb()
      can list_del() it safely.
      Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>
      82382ace
    • J
      cramfs: Fix IS_ENABLED typo · 08a8f308
      Joe Perches 提交于
      There's an extra C here...
      
      Fixes: 99c18ce5 ("cramfs: direct memory access support")
      Acked-by: NNicolas Pitre <nico@linaro.org>
      Signed-off-by: NJoe Perches <joe@perches.com>
      Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>
      08a8f308
    • A
      befs_lookup(): use d_splice_alias() · f4e4d434
      Al Viro 提交于
      RTFS(Documentation/filesystems/nfs/Exporting) if you try to make
      something exportable.
      
      Fixes: ac632f5b "befs: add NFS export support"
      Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>
      f4e4d434
    • A
      affs_lookup: switch to d_splice_alias() · 87fbd639
      Al Viro 提交于
      Making something exportable takes more than providing ->s_export_ops.
      In particular, ->lookup() *MUST* use d_splice_alias() instead of
      d_add().
      
      Reading Documentation/filesystems/nfs/Exporting would've been a good idea;
      as it is, exporting AFFS is badly (and exploitably) broken.
      
      Partially-Fixes: ed4433d7 "fs/affs: make affs exportable"
      Acked-by: NDavid Sterba <dsterba@suse.com>
      Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>
      87fbd639
    • A
      affs_lookup(): close a race with affs_remove_link() · 30da870c
      Al Viro 提交于
      we unlock the directory hash too early - if we are looking at secondary
      link and primary (in another directory) gets removed just as we unlock,
      we could have the old primary moved in place of the secondary, leaving
      us to look into freed entry (and leaving our dentry with ->d_fsdata
      pointing to a freed entry).
      
      Cc: stable@vger.kernel.org # 2.4.4+
      Acked-by: NDavid Sterba <dsterba@suse.com>
      Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>
      30da870c
  5. 19 5月, 2018 1 次提交
  6. 18 5月, 2018 1 次提交
    • W
      proc: do not access cmdline nor environ from file-backed areas · 7f7ccc2c
      Willy Tarreau 提交于
      proc_pid_cmdline_read() and environ_read() directly access the target
      process' VM to retrieve the command line and environment. If this
      process remaps these areas onto a file via mmap(), the requesting
      process may experience various issues such as extra delays if the
      underlying device is slow to respond.
      
      Let's simply refuse to access file-backed areas in these functions.
      For this we add a new FOLL_ANON gup flag that is passed to all calls
      to access_remote_vm(). The code already takes care of such failures
      (including unmapped areas). Accesses via /proc/pid/mem were not
      changed though.
      
      This was assigned CVE-2018-1120.
      
      Note for stable backports: the patch may apply to kernels prior to 4.11
      but silently miss one location; it must be checked that no call to
      access_remote_vm() keeps zero as the last argument.
      Reported-by: NQualys Security Advisory <qsa@qualys.com>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Andy Lutomirski <luto@amacapital.net>
      Cc: Oleg Nesterov <oleg@redhat.com>
      Cc: stable@vger.kernel.org
      Signed-off-by: NWilly Tarreau <w@1wt.eu>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      7f7ccc2c
  7. 17 5月, 2018 6 次提交
    • A
      btrfs: fix crash when trying to resume balance without the resume flag · 02ee654d
      Anand Jain 提交于
      We set the BTRFS_BALANCE_RESUME flag in the btrfs_recover_balance()
      only, which isn't called during the remount. So when resuming from
      the paused balance we hit the bug:
      
       kernel: kernel BUG at fs/btrfs/volumes.c:3890!
       ::
       kernel:  balance_kthread+0x51/0x60 [btrfs]
       kernel:  kthread+0x111/0x130
       ::
       kernel: RIP: btrfs_balance+0x12e1/0x1570 [btrfs] RSP: ffffba7d0090bde8
      
      Reproducer:
        On a mounted filesystem:
      
        btrfs balance start --full-balance /btrfs
        btrfs balance pause /btrfs
        mount -o remount,ro /dev/sdb /btrfs
        mount -o remount,rw /dev/sdb /btrfs
      
      To fix this set the BTRFS_BALANCE_RESUME flag in
      btrfs_resume_balance_async().
      
      CC: stable@vger.kernel.org # 4.4+
      Signed-off-by: NAnand Jain <anand.jain@oracle.com>
      Reviewed-by: NDavid Sterba <dsterba@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      02ee654d
    • N
      btrfs: Fix delalloc inodes invalidation during transaction abort · fe816d0f
      Nikolay Borisov 提交于
      When a transaction is aborted btrfs_cleanup_transaction is called to
      cleanup all the various in-flight bits and pieces which migth be
      active. One of those is delalloc inodes - inodes which have dirty
      pages which haven't been persisted yet. Currently the process of
      freeing such delalloc inodes in exceptional circumstances such as
      transaction abort boiled down to calling btrfs_invalidate_inodes whose
      sole job is to invalidate the dentries for all inodes related to a
      root. This is in fact wrong and insufficient since such delalloc inodes
      will likely have pending pages or ordered-extents and will be linked to
      the sb->s_inode_list. This means that unmounting a btrfs instance with
      an aborted transaction could potentially lead inodes/their pages
      visible to the system long after their superblock has been freed. This
      in turn leads to a "use-after-free" situation once page shrink is
      triggered. This situation could be simulated by running generic/019
      which would cause such inodes to be left hanging, followed by
      generic/176 which causes memory pressure and page eviction which lead
      to touching the freed super block instance. This situation is
      additionally detected by the unmount code of VFS with the following
      message:
      
      "VFS: Busy inodes after unmount of Self-destruct in 5 seconds.  Have a nice day..."
      
      Additionally btrfs hits WARN_ON(!RB_EMPTY_ROOT(&root->inode_tree));
      in free_fs_root for the same reason.
      
      This patch aims to rectify the sitaution by doing the following:
      
      1. Change btrfs_destroy_delalloc_inodes so that it calls
      invalidate_inode_pages2 for every inode on the delalloc list, this
      ensures that all the pages of the inode are released. This function
      boils down to calling btrfs_releasepage. During test I observed cases
      where inodes on the delalloc list were having an i_count of 0, so this
      necessitates using igrab to be sure we are working on a non-freed inode.
      
      2. Since calling btrfs_releasepage might queue delayed iputs move the
      call out to btrfs_cleanup_transaction in btrfs_error_commit_super before
      calling run_delayed_iputs for the last time. This is necessary to ensure
      that delayed iputs are run.
      
      Note: this patch is tagged for 4.14 stable but the fix applies to older
      versions too but needs to be backported manually due to conflicts.
      
      CC: stable@vger.kernel.org # 4.14.x: 2b877331: btrfs: Split btrfs_del_delalloc_inode into 2 functions
      CC: stable@vger.kernel.org # 4.14.x
      Signed-off-by: NNikolay Borisov <nborisov@suse.com>
      Reviewed-by: NDavid Sterba <dsterba@suse.com>
      [ add comment to igrab ]
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      fe816d0f
    • N
      btrfs: Split btrfs_del_delalloc_inode into 2 functions · 2b877331
      Nikolay Borisov 提交于
      This is in preparation of fixing delalloc inodes leakage on transaction
      abort. Also export the new function.
      Signed-off-by: NNikolay Borisov <nborisov@suse.com>
      Reviewed-by: NDavid Sterba <dsterba@suse.com>
      Reviewed-by: NAnand Jain <anand.jain@oracle.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      2b877331
    • L
      btrfs: fix reading stale metadata blocks after degraded raid1 mounts · 02a3307a
      Liu Bo 提交于
      If a btree block, aka. extent buffer, is not available in the extent
      buffer cache, it'll be read out from the disk instead, i.e.
      
      btrfs_search_slot()
        read_block_for_search()  # hold parent and its lock, go to read child
          btrfs_release_path()
          read_tree_block()  # read child
      
      Unfortunately, the parent lock got released before reading child, so
      commit 5bdd3536 ("Btrfs: Fix block generation verification race") had
      used 0 as parent transid to read the child block.  It forces
      read_tree_block() not to check if parent transid is different with the
      generation id of the child that it reads out from disk.
      
      A simple PoC is included in btrfs/124,
      
      0. A two-disk raid1 btrfs,
      
      1. Right after mkfs.btrfs, block A is allocated to be device tree's root.
      
      2. Mount this filesystem and put it in use, after a while, device tree's
         root got COW but block A hasn't been allocated/overwritten yet.
      
      3. Umount it and reload the btrfs module to remove both disks from the
         global @fs_devices list.
      
      4. mount -odegraded dev1 and write some data, so now block A is allocated
         to be a leaf in checksum tree.  Note that only dev1 has the latest
         metadata of this filesystem.
      
      5. Umount it and mount it again normally (with both disks), since raid1
         can pick up one disk by the writer task's pid, if btrfs_search_slot()
         needs to read block A, dev2 which does NOT have the latest metadata
         might be read for block A, then we got a stale block A.
      
      6. As parent transid is not checked, block A is marked as uptodate and
         put into the extent buffer cache, so the future search won't bother
         to read disk again, which means it'll make changes on this stale
         one and make it dirty and flush it onto disk.
      
      To avoid the problem, parent transid needs to be passed to
      read_tree_block().
      
      In order to get a valid parent transid, we need to hold the parent's
      lock until finishing reading child.
      
      This patch needs to be slightly adapted for stable kernels, the
      &first_key parameter added to read_tree_block() is from 4.16+
      (581c1760). The fix is to replace 0 by 'gen'.
      
      Fixes: 5bdd3536 ("Btrfs: Fix block generation verification race")
      CC: stable@vger.kernel.org # 4.4+
      Signed-off-by: NLiu Bo <bo.liu@linux.alibaba.com>
      Reviewed-by: NFilipe Manana <fdmanana@suse.com>
      Reviewed-by: NQu Wenruo <wqu@suse.com>
      [ update changelog ]
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      02a3307a
    • M
      btrfs: property: Set incompat flag if lzo/zstd compression is set · 1a63c198
      Misono Tomohiro 提交于
      Incompat flag of LZO/ZSTD compression should be set at:
      
       1. mount time (-o compress/compress-force)
       2. when defrag is done
       3. when property is set
      
      Currently 3. is missing and this commit adds this.
      
      This could lead to a filesystem that uses ZSTD but is not marked as
      such. If a kernel without a ZSTD support encounteres a ZSTD compressed
      extent, it will handle that but this could be confusing to the user.
      
      Typically the filesystem is mounted with the ZSTD option, but the
      discrepancy can arise when a filesystem is never mounted with ZSTD and
      then the property on some file is set (and some new extents are
      written). A simple mount with -o compress=zstd will fix that up on an
      unpatched kernel.
      
      Same goes for LZO, but this has been around for a very long time
      (2.6.37) so it's unlikely that a pre-LZO kernel would be used.
      
      Fixes: 5c1aab1d ("btrfs: Add zstd support")
      CC: stable@vger.kernel.org # 4.14+
      Signed-off-by: NTomohiro Misono <misono.tomohiro@jp.fujitsu.com>
      Reviewed-by: NAnand Jain <anand.jain@oracle.com>
      Reviewed-by: NDavid Sterba <dsterba@suse.com>
      [ add user visible impact ]
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      1a63c198
    • F
      Btrfs: fix duplicate extents after fsync of file with prealloc extents · 31d11b83
      Filipe Manana 提交于
      In commit 471d557a ("Btrfs: fix loss of prealloc extents past i_size
      after fsync log replay"), on fsync,  we started to always log all prealloc
      extents beyond an inode's i_size in order to avoid losing them after a
      power failure. However under some cases this can lead to the log replay
      code to create duplicate extent items, with different lengths, in the
      extent tree. That happens because, as of that commit, we can now log
      extent items based on extent maps that are not on the "modified" list
      of extent maps of the inode's extent map tree. Logging extent items based
      on extent maps is used during the fast fsync path to save time and for
      this to work reliably it requires that the extent maps are not merged
      with other adjacent extent maps - having the extent maps in the list
      of modified extents gives such guarantee.
      
      Consider the following example, captured during a long run of fsstress,
      which illustrates this problem.
      
      We have inode 271, in the filesystem tree (root 5), for which all of the
      following operations and discussion apply to.
      
      A buffered write starts at offset 312391 with a length of 933471 bytes
      (end offset at 1245862). At this point we have, for this inode, the
      following extent maps with the their field values:
      
      em A, start 0, orig_start 0, len 40960, block_start 18446744073709551613,
            block_len 0, orig_block_len 0
      em B, start 40960, orig_start 40960, len 376832, block_start 1106399232,
            block_len 376832, orig_block_len 376832
      em C, start 417792, orig_start 417792, len 782336, block_start
            18446744073709551613, block_len 0, orig_block_len 0
      em D, start 1200128, orig_start 1200128, len 835584, block_start
            1106776064, block_len 835584, orig_block_len 835584
      em E, start 2035712, orig_start 2035712, len 245760, block_start
            1107611648, block_len 245760, orig_block_len 245760
      
      Extent map A corresponds to a hole and extent maps D and E correspond to
      preallocated extents.
      
      Extent map D ends where extent map E begins (1106776064 + 835584 =
      1107611648), but these extent maps were not merged because they are in
      the inode's list of modified extent maps.
      
      An fsync against this inode is made, which triggers the fast path
      (BTRFS_INODE_NEEDS_FULL_SYNC is not set). This fsync triggers writeback
      of the data previously written using buffered IO, and when the respective
      ordered extent finishes, btrfs_drop_extents() is called against the
      (aligned) range 311296..1249279. This causes a split of extent map D at
      btrfs_drop_extent_cache(), replacing extent map D with a new extent map
      D', also added to the list of modified extents,  with the following
      values:
      
      em D', start 1249280, orig_start of 1200128,
             block_start 1106825216 (= 1106776064 + 1249280 - 1200128),
             orig_block_len 835584,
             block_len 786432 (835584 - (1249280 - 1200128))
      
      Then, during the fast fsync, btrfs_log_changed_extents() is called and
      extent maps D' and E are removed from the list of modified extents. The
      flag EXTENT_FLAG_LOGGING is also set on them. After the extents are logged
      clear_em_logging() is called on each of them, and that makes extent map E
      to be merged with extent map D' (try_merge_map()), resulting in D' being
      deleted and E adjusted to:
      
      em E, start 1249280, orig_start 1200128, len 1032192,
            block_start 1106825216, block_len 1032192,
            orig_block_len 245760
      
      A direct IO write at offset 1847296 and length of 360448 bytes (end offset
      at 2207744) starts, and at that moment the following extent maps exist for
      our inode:
      
      em A, start 0, orig_start 0, len 40960, block_start 18446744073709551613,
            block_len 0, orig_block_len 0
      em B, start 40960, orig_start 40960, len 270336, block_start 1106399232,
            block_len 270336, orig_block_len 376832
      em C, start 311296, orig_start 311296, len 937984, block_start 1112842240,
            block_len 937984, orig_block_len 937984
      em E (prealloc), start 1249280, orig_start 1200128, len 1032192,
            block_start 1106825216, block_len 1032192, orig_block_len 245760
      
      The dio write results in drop_extent_cache() being called twice. The first
      time for a range that starts at offset 1847296 and ends at offset 2035711
      (length of 188416), which results in a double split of extent map E,
      replacing it with two new extent maps:
      
      em F, start 1249280, orig_start 1200128, block_start 1106825216,
            block_len 598016, orig_block_len 598016
      em G, start 2035712, orig_start 1200128, block_start 1107611648,
            block_len 245760, orig_block_len 1032192
      
      It also creates a new extent map that represents a part of the requested
      IO (through create_io_em()):
      
      em H, start 1847296, len 188416, block_start 1107423232, block_len 188416
      
      The second call to drop_extent_cache() has a range with a start offset of
      2035712 and end offset of 2207743 (length of 172032). This leads to
      replacing extent map G with a new extent map I with the following values:
      
      em I, start 2207744, orig_start 1200128, block_start 1107783680,
            block_len 73728, orig_block_len 1032192
      
      It also creates a new extent map that represents the second part of the
      requested IO (through create_io_em()):
      
      em J, start 2035712, len 172032, block_start 1107611648, block_len 172032
      
      The dio write set the inode's i_size to 2207744 bytes.
      
      After the dio write the inode has the following extent maps:
      
      em A, start 0, orig_start 0, len 40960, block_start 18446744073709551613,
            block_len 0, orig_block_len 0
      em B, start 40960, orig_start 40960, len 270336, block_start 1106399232,
            block_len 270336, orig_block_len 376832
      em C, start 311296, orig_start 311296, len 937984, block_start 1112842240,
            block_len 937984, orig_block_len 937984
      em F, start 1249280, orig_start 1200128, len 598016,
            block_start 1106825216, block_len 598016, orig_block_len 598016
      em H, start 1847296, orig_start 1200128, len 188416,
            block_start 1107423232, block_len 188416, orig_block_len 835584
      em J, start 2035712, orig_start 2035712, len 172032,
            block_start 1107611648, block_len 172032, orig_block_len 245760
      em I, start 2207744, orig_start 1200128, len 73728,
            block_start 1107783680, block_len 73728, orig_block_len 1032192
      
      Now do some change to the file, like adding a xattr for example and then
      fsync it again. This triggers a fast fsync path, and as of commit
      471d557a ("Btrfs: fix loss of prealloc extents past i_size after fsync
      log replay"), we use the extent map I to log a file extent item because
      it's a prealloc extent and it starts at an offset matching the inode's
      i_size. However when we log it, we create a file extent item with a value
      for the disk byte location that is wrong, as can be seen from the
      following output of "btrfs inspect-internal dump-tree":
      
       item 1 key (271 EXTENT_DATA 2207744) itemoff 3782 itemsize 53
           generation 22 type 2 (prealloc)
           prealloc data disk byte 1106776064 nr 1032192
           prealloc data offset 1007616 nr 73728
      
      Here the disk byte value corresponds to calculation based on some fields
      from the extent map I:
      
        1106776064 = block_start (1107783680) - 1007616 (extent_offset)
        extent_offset = 2207744 (start) - 1200128 (orig_start) = 1007616
      
      The disk byte value of 1106776064 clashes with disk byte values of the
      file extent items at offsets 1249280 and 1847296 in the fs tree:
      
              item 6 key (271 EXTENT_DATA 1249280) itemoff 3568 itemsize 53
                      generation 20 type 2 (prealloc)
                      prealloc data disk byte 1106776064 nr 835584
                      prealloc data offset 49152 nr 598016
              item 7 key (271 EXTENT_DATA 1847296) itemoff 3515 itemsize 53
                      generation 20 type 1 (regular)
                      extent data disk byte 1106776064 nr 835584
                      extent data offset 647168 nr 188416 ram 835584
                      extent compression 0 (none)
              item 8 key (271 EXTENT_DATA 2035712) itemoff 3462 itemsize 53
                      generation 20 type 1 (regular)
                      extent data disk byte 1107611648 nr 245760
                      extent data offset 0 nr 172032 ram 245760
                      extent compression 0 (none)
              item 9 key (271 EXTENT_DATA 2207744) itemoff 3409 itemsize 53
                      generation 20 type 2 (prealloc)
                      prealloc data disk byte 1107611648 nr 245760
                      prealloc data offset 172032 nr 73728
      
      Instead of the disk byte value of 1106776064, the value of 1107611648
      should have been logged. Also the data offset value should have been
      172032 and not 1007616.
      After a log replay we end up getting two extent items in the extent tree
      with different lengths, one of 835584, which is correct and existed
      before the log replay, and another one of 1032192 which is wrong and is
      based on the logged file extent item:
      
       item 12 key (1106776064 EXTENT_ITEM 835584) itemoff 3406 itemsize 53
          refs 2 gen 15 flags DATA
          extent data backref root 5 objectid 271 offset 1200128 count 2
       item 13 key (1106776064 EXTENT_ITEM 1032192) itemoff 3353 itemsize 53
          refs 1 gen 22 flags DATA
          extent data backref root 5 objectid 271 offset 1200128 count 1
      
      Obviously this leads to many problems and a filesystem check reports many
      errors:
      
       (...)
       checking extents
       Extent back ref already exists for 1106776064 parent 0 root 5 owner 271 offset 1200128 num_refs 1
       extent item 1106776064 has multiple extent items
       ref mismatch on [1106776064 835584] extent item 2, found 3
       Incorrect local backref count on 1106776064 root 5 owner 271 offset 1200128 found 2 wanted 1 back 0x55b1d0ad7680
       Backref 1106776064 root 5 owner 271 offset 1200128 num_refs 0 not found in extent tree
       Incorrect local backref count on 1106776064 root 5 owner 271 offset 1200128 found 1 wanted 0 back 0x55b1d0ad4e70
       Backref bytes do not match extent backref, bytenr=1106776064, ref bytes=835584, backref bytes=1032192
       backpointer mismatch on [1106776064 835584]
       checking free space cache
       block group 1103101952 has wrong amount of free space
       failed to load free space cache for block group 1103101952
       checking fs roots
       (...)
      
      So fix this by logging the prealloc extents beyond the inode's i_size
      based on searches in the subvolume tree instead of the extent maps.
      
      Fixes: 471d557a ("Btrfs: fix loss of prealloc extents past i_size after fsync log replay")
      CC: stable@vger.kernel.org # 4.14+
      Signed-off-by: NFilipe Manana <fdmanana@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      31d11b83
  8. 14 5月, 2018 7 次提交
    • F
      Btrfs: fix xattr loss after power failure · 9a8fca62
      Filipe Manana 提交于
      If a file has xattrs, we fsync it, to ensure we clear the flags
      BTRFS_INODE_NEEDS_FULL_SYNC and BTRFS_INODE_COPY_EVERYTHING from its
      inode, the current transaction commits and then we fsync it (without
      either of those bits being set in its inode), we end up not logging
      all its xattrs. This results in deleting all xattrs when replying the
      log after a power failure.
      
      Trivial reproducer
      
        $ mkfs.btrfs -f /dev/sdb
        $ mount /dev/sdb /mnt
      
        $ touch /mnt/foobar
        $ setfattr -n user.xa -v qwerty /mnt/foobar
        $ xfs_io -c "fsync" /mnt/foobar
      
        $ sync
      
        $ xfs_io -c "pwrite -S 0xab 0 64K" /mnt/foobar
        $ xfs_io -c "fsync" /mnt/foobar
        <power failure>
      
        $ mount /dev/sdb /mnt
        $ getfattr --absolute-names --dump /mnt/foobar
        <empty output>
        $
      
      So fix this by making sure all xattrs are logged if we log a file's inode
      item and neither the flags BTRFS_INODE_NEEDS_FULL_SYNC nor
      BTRFS_INODE_COPY_EVERYTHING were set in the inode.
      
      Fixes: 36283bf7 ("Btrfs: fix fsync xattr loss in the fast fsync path")
      Cc: <stable@vger.kernel.org> # 4.2+
      Signed-off-by: NFilipe Manana <fdmanana@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      9a8fca62
    • R
      Btrfs: send, fix invalid access to commit roots due to concurrent snapshotting · 6f2f0b39
      Robbie Ko 提交于
      [BUG]
      btrfs incremental send BUG happens when creating a snapshot of snapshot
      that is being used by send.
      
      [REASON]
      The problem can happen if while we are doing a send one of the snapshots
      used (parent or send) is snapshotted, because snapshoting implies COWing
      the root of the source subvolume/snapshot.
      
      1. When doing an incremental send, the send process will get the commit
         roots from the parent and send snapshots, and add references to them
         through extent_buffer_get().
      
      2. When a snapshot/subvolume is snapshotted, its root node is COWed
         (transaction.c:create_pending_snapshot()).
      
      3. COWing releases the space used by the node immediately, through:
      
         __btrfs_cow_block()
         --btrfs_free_tree_block()
         ----btrfs_add_free_space(bytenr of node)
      
      4. Because send doesn't hold a transaction open, it's possible that
         the transaction used to create the snapshot commits, switches the
         commit root and the old space used by the previous root node gets
         assigned to some other node allocation. Allocation of a new node will
         use the existing extent buffer found in memory, which we previously
         got a reference through extent_buffer_get(), and allow the extent
         buffer's content (pages) to be modified:
      
         btrfs_alloc_tree_block
         --btrfs_reserve_extent
         ----find_free_extent (get bytenr of old node)
         --btrfs_init_new_buffer (use bytenr of old node)
         ----btrfs_find_create_tree_block
         ------alloc_extent_buffer
         --------find_extent_buffer (get old node)
      
      5. So send can access invalid memory content and have unpredictable
         behaviour.
      
      [FIX]
      So we fix the problem by copying the commit roots of the send and
      parent snapshots and use those copies.
      
      CallTrace looks like this:
       ------------[ cut here ]------------
       kernel BUG at fs/btrfs/ctree.c:1861!
       invalid opcode: 0000 [#1] SMP
       CPU: 6 PID: 24235 Comm: btrfs Tainted: P           O 3.10.105 #23721
       ffff88046652d680 ti: ffff88041b720000 task.ti: ffff88041b720000
       RIP: 0010:[<ffffffffa08dd0e8>] read_node_slot+0x108/0x110 [btrfs]
       RSP: 0018:ffff88041b723b68  EFLAGS: 00010246
       RAX: ffff88043ca6b000 RBX: ffff88041b723c50 RCX: ffff880000000000
       RDX: 000000000000004c RSI: ffff880314b133f8 RDI: ffff880458b24000
       RBP: 0000000000000000 R08: 0000000000000001 R09: ffff88041b723c66
       R10: 0000000000000001 R11: 0000000000001000 R12: ffff8803f3e48890
       R13: ffff8803f3e48880 R14: ffff880466351800 R15: 0000000000000001
       FS:  00007f8c321dc8c0(0000) GS:ffff88047fcc0000(0000)
       CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
       R2: 00007efd1006d000 CR3: 0000000213a24000 CR4: 00000000003407e0
       DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
       DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
       Stack:
       ffff88041b723c50 ffff8803f3e48880 ffff8803f3e48890 ffff8803f3e48880
       ffff880466351800 0000000000000001 ffffffffa08dd9d7 ffff88041b723c50
       ffff8803f3e48880 ffff88041b723c66 ffffffffa08dde85 a9ff88042d2c4400
       Call Trace:
       [<ffffffffa08dd9d7>] ? tree_move_down.isra.33+0x27/0x50 [btrfs]
       [<ffffffffa08dde85>] ? tree_advance+0xb5/0xc0 [btrfs]
       [<ffffffffa08e83d4>] ? btrfs_compare_trees+0x2d4/0x760 [btrfs]
       [<ffffffffa0982050>] ? finish_inode_if_needed+0x870/0x870 [btrfs]
       [<ffffffffa09841ea>] ? btrfs_ioctl_send+0xeda/0x1050 [btrfs]
       [<ffffffffa094bd3d>] ? btrfs_ioctl+0x1e3d/0x33f0 [btrfs]
       [<ffffffff81111133>] ? handle_pte_fault+0x373/0x990
       [<ffffffff8153a096>] ? atomic_notifier_call_chain+0x16/0x20
       [<ffffffff81063256>] ? set_task_cpu+0xb6/0x1d0
       [<ffffffff811122c3>] ? handle_mm_fault+0x143/0x2a0
       [<ffffffff81539cc0>] ? __do_page_fault+0x1d0/0x500
       [<ffffffff81062f07>] ? check_preempt_curr+0x57/0x90
       [<ffffffff8115075a>] ? do_vfs_ioctl+0x4aa/0x990
       [<ffffffff81034f83>] ? do_fork+0x113/0x3b0
       [<ffffffff812dd7d7>] ? trace_hardirqs_off_thunk+0x3a/0x6c
       [<ffffffff81150cc8>] ? SyS_ioctl+0x88/0xa0
       [<ffffffff8153e422>] ? system_call_fastpath+0x16/0x1b
       ---[ end trace 29576629ee80b2e1 ]---
      
      Fixes: 7069830a ("Btrfs: add btrfs_compare_trees function")
      CC: stable@vger.kernel.org # 3.6+
      Signed-off-by: NRobbie Ko <robbieko@synology.com>
      Reviewed-by: NFilipe Manana <fdmanana@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      6f2f0b39
    • D
      afs: Fix the non-encryption of calls · 4776cab4
      David Howells 提交于
      Some AFS servers refuse to accept unencrypted traffic, so can't be accessed
      with kAFS.  Set the AF_RXRPC security level to encrypt client calls to deal
      with this.
      
      Note that incoming service calls are set by the remote client and so aren't
      affected by this.
      
      This requires an AF_RXRPC patch to pass the value set by setsockopt to calls
      begun by the kernel.
      Signed-off-by: NDavid Howells <dhowells@redhat.com>
      4776cab4
    • D
      afs: Fix CB.CallBack handling · 428edade
      David Howells 提交于
      The handling of CB.CallBack messages sent by the fileserver to the client
      is broken in that they are currently being processed after the reply has
      been transmitted.
      
      This is not what the fileserver expects, however.  It holds up change
      visibility until the reply comes so as to maintain cache coherency, and so
      expects the client to have to refetch the state on the affected files.
      
      Fix CB.CallBack handling to perform the callback break before sending the
      reply.
      
      The fileserver is free to hold up status fetches issued by other threads on
      the same client that occur in reponse to the callback until any pending
      changes have been committed.
      
      Fixes: d001648e ("rxrpc: Don't expose skbs to in-kernel users [ver #2]")
      Signed-off-by: NDavid Howells <dhowells@redhat.com>
      428edade
    • D
      afs: Fix whole-volume callback handling · 68251f0a
      David Howells 提交于
      It's possible for an AFS file server to issue a whole-volume notification
      that callbacks on all the vnodes in the file have been broken.  This is
      done for R/O and backup volumes (which don't have per-file callbacks) and
      for things like a volume being taken offline.
      
      Fix callback handling to detect whole-volume notifications, to track it
      across operations and to check it during inode validation.
      
      Fixes: c435ee34 ("afs: Overhaul the callback handling")
      Signed-off-by: NDavid Howells <dhowells@redhat.com>
      68251f0a
    • M
      afs: Fix afs_find_server search loop · f9c1bba3
      Marc Dionne 提交于
      The code that looks up servers by addresses makes the assumption
      that the list of addresses for a server is sorted.  It exits the
      loop if it finds that the target address is larger than the
      current candidate.  As the list is not currently sorted, this
      can lead to a failure to find a matching server, which can cause
      callbacks from that server to be ignored.
      
      Remove the early exit case so that the complete list is searched.
      
      Fixes: d2ddc776 ("afs: Overhaul volume and server record caching and fileserver rotation")
      Signed-off-by: NMarc Dionne <marc.dionne@auristor.com>
      Signed-off-by: NDavid Howells <dhowells@redhat.com>
      f9c1bba3
    • D
      afs: Fix the handling of an unfound server in CM operations · a86b06d1
      David Howells 提交于
      If the client cache manager operations that need the server record
      (CB.Callback, CB.InitCallBackState, and CB.InitCallBackState3) can't find
      the server record, they abort the call from the file server with
      RX_CALL_DEAD when they should return okay.
      
      Fixes: c35eccb1 ("[AFS]: Implement the CB.InitCallBackState3 operation.")
      Signed-off-by: NDavid Howells <dhowells@redhat.com>
      a86b06d1