1. 02 4月, 2020 1 次提交
  2. 19 2月, 2020 1 次提交
  3. 15 2月, 2020 3 次提交
  4. 14 2月, 2020 11 次提交
  5. 13 2月, 2020 10 次提交
    • F
      cifs: Fix mode output in debugging statements · f52aa79d
      Frank Sorenson 提交于
      A number of the debug statements output file or directory mode
      in hex.  Change these to print using octal.
      Signed-off-by: NFrank Sorenson <sorenson@redhat.com>
      Signed-off-by: NSteve French <stfrench@microsoft.com>
      f52aa79d
    • J
      io-wq: don't call kXalloc_node() with non-online node · 7563439a
      Jens Axboe 提交于
      Glauber reports a crash on init on a box he has:
      
       RIP: 0010:__alloc_pages_nodemask+0x132/0x340
       Code: 18 01 75 04 41 80 ce 80 89 e8 48 8b 54 24 08 8b 74 24 1c c1 e8 0c 48 8b 3c 24 83 e0 01 88 44 24 20 48 85 d2 0f 85 74 01 00 00 <3b> 77 08 0f 82 6b 01 00 00 48 89 7c 24 10 89 ea 48 8b 07 b9 00 02
       RSP: 0018:ffffb8be4d0b7c28 EFLAGS: 00010246
       RAX: 0000000000000000 RBX: 0000000000000000 RCX: 000000000000e8e8
       RDX: 0000000000000000 RSI: 0000000000000002 RDI: 0000000000002080
       RBP: 0000000000012cc0 R08: 0000000000000000 R09: 0000000000000002
       R10: 0000000000000dc0 R11: ffff995c60400100 R12: 0000000000000000
       R13: 0000000000012cc0 R14: 0000000000000001 R15: ffff995c60db00f0
       FS:  00007f4d115ca900(0000) GS:ffff995c60d80000(0000) knlGS:0000000000000000
       CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
       CR2: 0000000000002088 CR3: 00000017cca66002 CR4: 00000000007606e0
       DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
       DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
       PKRU: 55555554
       Call Trace:
        alloc_slab_page+0x46/0x320
        new_slab+0x9d/0x4e0
        ___slab_alloc+0x507/0x6a0
        ? io_wq_create+0xb4/0x2a0
        __slab_alloc+0x1c/0x30
        kmem_cache_alloc_node_trace+0xa6/0x260
        io_wq_create+0xb4/0x2a0
        io_uring_setup+0x97f/0xaa0
        ? io_remove_personalities+0x30/0x30
        ? io_poll_trigger_evfd+0x30/0x30
        do_syscall_64+0x5b/0x1c0
        entry_SYSCALL_64_after_hwframe+0x44/0xa9
       RIP: 0033:0x7f4d116cb1ed
      
      which is due to the 'wqe' and 'worker' allocation being node affine.
      But it isn't valid to call the node affine allocation if the node isn't
      online.
      
      Setup structures for even offline nodes, as usual, but skip them in
      terms of thread setup to not waste resources. If the node isn't online,
      just alloc memory with NUMA_NO_NODE.
      Reported-by: NGlauber Costa <glauber@scylladb.com>
      Tested-by: NGlauber Costa <glauber@scylladb.com>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      7563439a
    • T
      NFSv4: Fix revalidation of dentries with delegations · efeda80d
      Trond Myklebust 提交于
      If a dentry was not initially looked up while we were holding a
      delegation, then we do still need to revalidate that it still holds
      the same name. If there are multiple hard links to the same file,
      then all the hard links need validation.
      Reported-by: NBenjamin Coddington <bcodding@redhat.com>
      Signed-off-by: NTrond Myklebust <trond.myklebust@hammerspace.com>
      Reviewed-by: NBenjamin Coddington <bcodding@redhat.com>
      Tested-by: NBenjamin Coddington <bcodding@redhat.com>
      [Anna: Put nfs_unset_verifier_delegated() under CONFIG_NFS_V4]
      Signed-off-by: NAnna Schumaker <Anna.Schumaker@Netapp.com>
      efeda80d
    • A
      btrfs: sysfs, move device id directories to UUID/devinfo · 1b9867eb
      Anand Jain 提交于
      Originally it was planned to create device id directories under
      UUID/devinfo, but it got under UUID/devices by mistake. We really want
      it under definfo so the bare device node names are not mixed with device
      ids and are easy to enumerate.
      
      Fixes: 668e48af ("btrfs: sysfs, add devid/dev_state kobject and device attributes")
      Signed-off-by: NAnand Jain <anand.jain@oracle.com>
      Reviewed-by: NDavid Sterba <dsterba@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      1b9867eb
    • A
      btrfs: sysfs, add UUID/devinfo kobject · a013d141
      Anand Jain 提交于
      Create directory /sys/fs/btrfs/UUID/devinfo to hold devices directories
      by the id (unlike /devices).
      Signed-off-by: NAnand Jain <anand.jain@oracle.com>
      Reviewed-by: NDavid Sterba <dsterba@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      a013d141
    • F
      Btrfs: fix race between shrinking truncate and fiemap · 28553fa9
      Filipe Manana 提交于
      When there is a fiemap executing in parallel with a shrinking truncate
      we can end up in a situation where we have extent maps for which we no
      longer have corresponding file extent items. This is generally harmless
      and at the moment the only consequences are missing file extent items
      representing holes after we expand the file size again after the
      truncate operation removed the prealloc extent items, and stale
      information for future fiemap calls (reporting extents that no longer
      exist or may have been reallocated to other files for example).
      
      Consider the following example:
      
      1) Our inode has a size of 128KiB, one 128KiB extent at file offset 0
         and a 1MiB prealloc extent at file offset 128KiB;
      
      2) Task A starts doing a shrinking truncate of our inode to reduce it to
         a size of 64KiB. Before it searches the subvolume tree for file
         extent items to delete, it drops all the extent maps in the range
         from 64KiB to (u64)-1 by calling btrfs_drop_extent_cache();
      
      3) Task B starts doing a fiemap against our inode. When looking up for
         the inode's extent maps in the range from 128KiB to (u64)-1, it
         doesn't find any in the inode's extent map tree, since they were
         removed by task A.  Because it didn't find any in the extent map
         tree, it scans the inode's subvolume tree for file extent items, and
         it finds the 1MiB prealloc extent at file offset 128KiB, then it
         creates an extent map based on that file extent item and adds it to
         inode's extent map tree (this ends up being done by
         btrfs_get_extent() <- btrfs_get_extent_fiemap() <-
         get_extent_skip_holes());
      
      4) Task A then drops the prealloc extent at file offset 128KiB and
         shrinks the 128KiB extent file offset 0 to a length of 64KiB. The
         truncation operation finishes and we end up with an extent map
         representing a 1MiB prealloc extent at file offset 128KiB, despite we
         don't have any more that extent;
      
      After this the two types of problems we have are:
      
      1) Future calls to fiemap always report that a 1MiB prealloc extent
         exists at file offset 128KiB. This is stale information, no longer
         correct;
      
      2) If the size of the file is increased, by a truncate operation that
         increases the file size or by a write into a file offset > 64KiB for
         example, we end up not inserting file extent items to represent holes
         for any range between 128KiB and 128KiB + 1MiB, since the hole
         expansion function, btrfs_cont_expand() will skip hole insertion for
         any range for which an extent map exists that represents a prealloc
         extent. This causes fsck to complain about missing file extent items
         when not using the NO_HOLES feature.
      
      The second issue could be often triggered by test case generic/561 from
      fstests, which runs fsstress and duperemove in parallel, and duperemove
      does frequent fiemap calls.
      
      Essentially the problems happens because fiemap does not acquire the
      inode's lock while truncate does, and fiemap locks the file range in the
      inode's iotree while truncate does not. So fix the issue by making
      btrfs_truncate_inode_items() lock the file range from the new file size
      to (u64)-1, so that it serializes with fiemap.
      
      CC: stable@vger.kernel.org # 4.4+
      Reviewed-by: NJosef Bacik <josef@toxicpanda.com>
      Signed-off-by: NFilipe Manana <fdmanana@suse.com>
      Reviewed-by: NDavid Sterba <dsterba@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      28553fa9
    • D
      btrfs: log message when rw remount is attempted with unclean tree-log · 10a3a3ed
      David Sterba 提交于
      A remount to a read-write filesystem is not safe when there's tree-log
      to be replayed. Files that could be opened until now might be affected
      by the changes in the tree-log.
      
      A regular mount is needed to replay the log so the filesystem presents
      the consistent view with the pending changes included.
      
      CC: stable@vger.kernel.org # 4.4+
      Reviewed-by: NAnand Jain <anand.jain@oracle.com>
      Reviewed-by: NJohannes Thumshirn <johannes.thumshirn@wdc.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      10a3a3ed
    • D
      btrfs: print message when tree-log replay starts · e8294f2f
      David Sterba 提交于
      There's no logged information about tree-log replay although this is
      something that points to previous unclean unmount. Other filesystems
      report that as well.
      Suggested-by: NChris Murphy <lists@colorremedies.com>
      CC: stable@vger.kernel.org # 4.4+
      Reviewed-by: NAnand Jain <anand.jain@oracle.com>
      Reviewed-by: NJohannes Thumshirn <johannes.thumshirn@wdc.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      e8294f2f
    • F
      Btrfs: fix race between using extent maps and merging them · ac05ca91
      Filipe Manana 提交于
      We have a few cases where we allow an extent map that is in an extent map
      tree to be merged with other extents in the tree. Such cases include the
      unpinning of an extent after the respective ordered extent completed or
      after logging an extent during a fast fsync. This can lead to subtle and
      dangerous problems because when doing the merge some other task might be
      using the same extent map and as consequence see an inconsistent state of
      the extent map - for example sees the new length but has seen the old start
      offset.
      
      With luck this triggers a BUG_ON(), and not some silent bug, such as the
      following one in __do_readpage():
      
        $ cat -n fs/btrfs/extent_io.c
        3061  static int __do_readpage(struct extent_io_tree *tree,
        3062                           struct page *page,
        (...)
        3127                  em = __get_extent_map(inode, page, pg_offset, cur,
        3128                                        end - cur + 1, get_extent, em_cached);
        3129                  if (IS_ERR_OR_NULL(em)) {
        3130                          SetPageError(page);
        3131                          unlock_extent(tree, cur, end);
        3132                          break;
        3133                  }
        3134                  extent_offset = cur - em->start;
        3135                  BUG_ON(extent_map_end(em) <= cur);
        (...)
      
      Consider the following example scenario, where we end up hitting the
      BUG_ON() in __do_readpage().
      
      We have an inode with a size of 8KiB and 2 extent maps:
      
        extent A: file offset 0, length 4KiB, disk_bytenr = X, persisted on disk by
                  a previous transaction
      
        extent B: file offset 4KiB, length 4KiB, disk_bytenr = X + 4KiB, not yet
                  persisted but writeback started for it already. The extent map
      	    is pinned since there's writeback and an ordered extent in
      	    progress, so it can not be merged with extent map A yet
      
      The following sequence of steps leads to the BUG_ON():
      
      1) The ordered extent for extent B completes, the respective page gets its
         writeback bit cleared and the extent map is unpinned, at that point it
         is not yet merged with extent map A because it's in the list of modified
         extents;
      
      2) Due to memory pressure, or some other reason, the MM subsystem releases
         the page corresponding to extent B - btrfs_releasepage() is called and
         returns 1, meaning the page can be released as it's not dirty, not under
         writeback anymore and the extent range is not locked in the inode's
         iotree. However the extent map is not released, either because we are
         not in a context that allows memory allocations to block or because the
         inode's size is smaller than 16MiB - in this case our inode has a size
         of 8KiB;
      
      3) Task B needs to read extent B and ends up __do_readpage() through the
         btrfs_readpage() callback. At __do_readpage() it gets a reference to
         extent map B;
      
      4) Task A, doing a fast fsync, calls clear_em_loggin() against extent map B
         while holding the write lock on the inode's extent map tree - this
         results in try_merge_map() being called and since it's possible to merge
         extent map B with extent map A now (the extent map B was removed from
         the list of modified extents), the merging begins - it sets extent map
         B's start offset to 0 (was 4KiB), but before it increments the map's
         length to 8KiB (4kb + 4KiB), task A is at:
      
         BUG_ON(extent_map_end(em) <= cur);
      
         The call to extent_map_end() sees the extent map has a start of 0
         and a length still at 4KiB, so it returns 4KiB and 'cur' is 4KiB, so
         the BUG_ON() is triggered.
      
      So it's dangerous to modify an extent map that is in the tree, because some
      other task might have got a reference to it before and still using it, and
      needs to see a consistent map while using it. Generally this is very rare
      since most paths that lookup and use extent maps also have the file range
      locked in the inode's iotree. The fsync path is pretty much the only
      exception where we don't do it to avoid serialization with concurrent
      reads.
      
      Fix this by not allowing an extent map do be merged if if it's being used
      by tasks other then the one attempting to merge the extent map (when the
      reference count of the extent map is greater than 2).
      Reported-by: Nryusuke1925 <st13s20@gm.ibaraki-ct.ac.jp>
      Reported-by: NKoki Mitani <koki.mitani.xg@hco.ntt.co.jp>
      Bugzilla: https://bugzilla.kernel.org/show_bug.cgi?id=206211
      CC: stable@vger.kernel.org # 4.4+
      Reviewed-by: NJosef Bacik <josef@toxicpanda.com>
      Signed-off-by: NFilipe Manana <fdmanana@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      ac05ca91
    • W
      btrfs: ref-verify: fix memory leaks · f311ade3
      Wenwen Wang 提交于
      In btrfs_ref_tree_mod(), 'ref' and 'ra' are allocated through kzalloc() and
      kmalloc(), respectively. In the following code, if an error occurs, the
      execution will be redirected to 'out' or 'out_unlock' and the function will
      be exited. However, on some of the paths, 'ref' and 'ra' are not
      deallocated, leading to memory leaks. For example, if 'action' is
      BTRFS_ADD_DELAYED_EXTENT, add_block_entry() will be invoked. If the return
      value indicates an error, the execution will be redirected to 'out'. But,
      'ref' is not deallocated on this path, causing a memory leak.
      
      To fix the above issues, deallocate both 'ref' and 'ra' before exiting from
      the function when an error is encountered.
      
      CC: stable@vger.kernel.org # 4.15+
      Signed-off-by: NWenwen Wang <wenwen@cs.uga.edu>
      Reviewed-by: NDavid Sterba <dsterba@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      f311ade3
  6. 12 2月, 2020 3 次提交
    • X
      ceph: noacl mount option is effectively ignored · 3b20bc2f
      Xiubo Li 提交于
      For the old mount API, the module parameters parseing function will
      be called in ceph_mount() and also just after the default posix acl
      flag set, so we can control to enable/disable it via the mount option.
      
      But for the new mount API, it will call the module parameters
      parseing function before ceph_get_tree(), so the posix acl will always
      be enabled.
      
      Fixes: 82995cc6 ("libceph, rbd, ceph: convert to use the new mount API")
      Signed-off-by: NXiubo Li <xiubli@redhat.com>
      Reviewed-by: NIlya Dryomov <idryomov@gmail.com>
      Signed-off-by: NIlya Dryomov <idryomov@gmail.com>
      3b20bc2f
    • I
      ceph: canonicalize server path in place · b27a939e
      Ilya Dryomov 提交于
      syzbot reported that 4fbc0c71 ("ceph: remove the extra slashes in
      the server path") had caused a regression where an allocation could be
      done under a spinlock -- compare_mount_options() is called by sget_fc()
      with sb_lock held.
      
      We don't really need the supplied server path, so canonicalize it
      in place and compare it directly.  To make this work, the leading
      slash is kept around and the logic in ceph_real_mount() to skip it
      is restored.  CEPH_MSG_CLIENT_SESSION now reports the same (i.e.
      canonicalized) path, with the leading slash of course.
      
      Fixes: 4fbc0c71 ("ceph: remove the extra slashes in the server path")
      Reported-by: syzbot+98704a51af8e3d9425a9@syzkaller.appspotmail.com
      Signed-off-by: NIlya Dryomov <idryomov@gmail.com>
      Reviewed-by: NJeff Layton <jlayton@kernel.org>
      b27a939e
    • X
      ceph: do not execute direct write in parallel if O_APPEND is specified · 8e4473bb
      Xiubo Li 提交于
      In O_APPEND & O_DIRECT mode, the data from different writers will
      be possibly overlapping each other since they take the shared lock.
      
      For example, both Writer1 and Writer2 are in O_APPEND and O_DIRECT
      mode:
      
                Writer1                         Writer2
      
           shared_lock()                   shared_lock()
           getattr(CAP_SIZE)               getattr(CAP_SIZE)
           iocb->ki_pos = EOF              iocb->ki_pos = EOF
           write(data1)
                                           write(data2)
           shared_unlock()                 shared_unlock()
      
      The data2 will overlap the data1 from the same file offset, the
      old EOF.
      
      Switch to exclusive lock instead when O_APPEND is specified.
      Signed-off-by: NXiubo Li <xiubli@redhat.com>
      Reviewed-by: NJeff Layton <jlayton@kernel.org>
      Signed-off-by: NIlya Dryomov <idryomov@gmail.com>
      8e4473bb
  7. 10 2月, 2020 7 次提交
  8. 09 2月, 2020 4 次提交