1. 19 5月, 2020 1 次提交
    • E
      fscrypt: support test_dummy_encryption=v2 · ed318a6c
      Eric Biggers 提交于
      v1 encryption policies are deprecated in favor of v2, and some new
      features (e.g. encryption+casefolding) are only being added for v2.
      
      Therefore, the "test_dummy_encryption" mount option (which is used for
      encryption I/O testing with xfstests) needs to support v2 policies.
      
      To do this, extend its syntax to be "test_dummy_encryption=v1" or
      "test_dummy_encryption=v2".  The existing "test_dummy_encryption" (no
      argument) also continues to be accepted, to specify the default setting
      -- currently v1, but the next patch changes it to v2.
      
      To cleanly support both v1 and v2 while also making it easy to support
      specifying other encryption settings in the future (say, accepting
      "$contents_mode:$filenames_mode:v2"), make ext4 and f2fs maintain a
      pointer to the dummy fscrypt_context rather than using mount flags.
      
      To avoid concurrency issues, don't allow test_dummy_encryption to be set
      or changed during a remount.  (The former restriction is new, but
      xfstests doesn't run into it, so no one should notice.)
      
      Tested with 'gce-xfstests -c {ext4,f2fs}/encrypt -g auto'.  On ext4,
      there are two regressions, both of which are test bugs: ext4/023 and
      ext4/028 fail because they set an xattr and expect it to be stored
      inline, but the increase in size of the fscrypt_context from
      24 to 40 bytes causes this xattr to be spilled into an external block.
      
      Link: https://lore.kernel.org/r/20200512233251.118314-4-ebiggers@kernel.orgAcked-by: NJaegeuk Kim <jaegeuk@kernel.org>
      Reviewed-by: NTheodore Ts'o <tytso@mit.edu>
      Signed-off-by: NEric Biggers <ebiggers@google.com>
      ed318a6c
  2. 16 5月, 2020 1 次提交
    • E
      fscrypt: add fscrypt_add_test_dummy_key() · cdeb21da
      Eric Biggers 提交于
      Currently, the test_dummy_encryption mount option (which is used for
      encryption I/O testing with xfstests) uses v1 encryption policies, and
      it relies on userspace inserting a test key into the session keyring.
      
      We need test_dummy_encryption to support v2 encryption policies too.
      Requiring userspace to add the test key doesn't work well with v2
      policies, since v2 policies only support the filesystem keyring (not the
      session keyring), and keys in the filesystem keyring are lost when the
      filesystem is unmounted.  Hooking all test code that unmounts and
      re-mounts the filesystem would be difficult.
      
      Instead, let's make the filesystem automatically add the test key to its
      keyring when test_dummy_encryption is enabled.
      
      That puts the responsibility for choosing the test key on the kernel.
      We could just hard-code a key.  But out of paranoia, let's first try
      using a per-boot random key, to prevent this code from being misused.
      A per-boot key will work as long as no one expects dummy-encrypted files
      to remain accessible after a reboot.  (gce-xfstests doesn't.)
      
      Therefore, this patch adds a function fscrypt_add_test_dummy_key() which
      implements the above.  The next patch will use it.
      
      Link: https://lore.kernel.org/r/20200512233251.118314-3-ebiggers@kernel.orgReviewed-by: NTheodore Ts'o <tytso@mit.edu>
      Reviewed-by: NJaegeuk Kim <jaegeuk@kernel.org>
      Signed-off-by: NEric Biggers <ebiggers@google.com>
      cdeb21da
  3. 13 5月, 2020 2 次提交
  4. 10 5月, 2020 1 次提交
  5. 09 5月, 2020 1 次提交
  6. 08 5月, 2020 3 次提交
    • R
      epoll: atomically remove wait entry on wake up · 412895f0
      Roman Penyaev 提交于
      This patch does two things:
      
       - fixes a lost wakeup introduced by commit 339ddb53 ("fs/epoll:
         remove unnecessary wakeups of nested epoll")
      
       - improves performance for events delivery.
      
      The description of the problem is the following: if N (>1) threads are
      waiting on ep->wq for new events and M (>1) events come, it is quite
      likely that >1 wakeups hit the same wait queue entry, because there is
      quite a big window between __add_wait_queue_exclusive() and the
      following __remove_wait_queue() calls in ep_poll() function.
      
      This can lead to lost wakeups, because thread, which was woken up, can
      handle not all the events in ->rdllist.  (in better words the problem is
      described here: https://lkml.org/lkml/2019/10/7/905)
      
      The idea of the current patch is to use init_wait() instead of
      init_waitqueue_entry().
      
      Internally init_wait() sets autoremove_wake_function as a callback,
      which removes the wait entry atomically (under the wq locks) from the
      list, thus the next coming wakeup hits the next wait entry in the wait
      queue, thus preventing lost wakeups.
      
      Problem is very well reproduced by the epoll60 test case [1].
      
      Wait entry removal on wakeup has also performance benefits, because
      there is no need to take a ep->lock and remove wait entry from the queue
      after the successful wakeup.  Here is the timing output of the epoll60
      test case:
      
        With explicit wakeup from ep_scan_ready_list() (the state of the
        code prior 339ddb53):
      
          real    0m6.970s
          user    0m49.786s
          sys     0m0.113s
      
       After this patch:
      
         real    0m5.220s
         user    0m36.879s
         sys     0m0.019s
      
      The other testcase is the stress-epoll [2], where one thread consumes
      all the events and other threads produce many events:
      
        With explicit wakeup from ep_scan_ready_list() (the state of the
        code prior 339ddb53):
      
          threads  events/ms  run-time ms
                8       5427         1474
               16       6163         2596
               32       6824         4689
               64       7060         9064
              128       6991        18309
      
       After this patch:
      
          threads  events/ms  run-time ms
                8       5598         1429
               16       7073         2262
               32       7502         4265
               64       7640         8376
              128       7634        16767
      
       (number of "events/ms" represents event bandwidth, thus higher is
        better; number of "run-time ms" represents overall time spent
        doing the benchmark, thus lower is better)
      
      [1] tools/testing/selftests/filesystems/epoll/epoll_wakeup_test.c
      [2] https://github.com/rouming/test-tools/blob/master/stress-epoll.cSigned-off-by: NRoman Penyaev <rpenyaev@suse.de>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Reviewed-by: NJason Baron <jbaron@akamai.com>
      Cc: Khazhismel Kumykov <khazhy@google.com>
      Cc: Alexander Viro <viro@zeniv.linux.org.uk>
      Cc: Heiher <r@hev.cc>
      Cc: <stable@vger.kernel.org>
      Link: http://lkml.kernel.org/r/20200430130326.1368509-2-rpenyaev@suse.deSigned-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      412895f0
    • K
      eventpoll: fix missing wakeup for ovflist in ep_poll_callback · 0c54a6a4
      Khazhismel Kumykov 提交于
      In the event that we add to ovflist, before commit 339ddb53
      ("fs/epoll: remove unnecessary wakeups of nested epoll") we would be
      woken up by ep_scan_ready_list, and did no wakeup in ep_poll_callback.
      
      With that wakeup removed, if we add to ovflist here, we may never wake
      up.  Rather than adding back the ep_scan_ready_list wakeup - which was
      resulting in unnecessary wakeups, trigger a wake-up in ep_poll_callback.
      
      We noticed that one of our workloads was missing wakeups starting with
      339ddb53 and upon manual inspection, this wakeup seemed missing to me.
      With this patch added, we no longer see missing wakeups.  I haven't yet
      tried to make a small reproducer, but the existing kselftests in
      filesystem/epoll passed for me with this patch.
      
      [khazhy@google.com: use if/elif instead of goto + cleanup suggested by Roman]
        Link: http://lkml.kernel.org/r/20200424190039.192373-1-khazhy@google.com
      Fixes: 339ddb53 ("fs/epoll: remove unnecessary wakeups of nested epoll")
      Signed-off-by: NKhazhismel Kumykov <khazhy@google.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Reviewed-by: NRoman Penyaev <rpenyaev@suse.de>
      Cc: Alexander Viro <viro@zeniv.linux.org.uk>
      Cc: Roman Penyaev <rpenyaev@suse.de>
      Cc: Heiher <r@hev.cc>
      Cc: Jason Baron <jbaron@akamai.com>
      Cc: <stable@vger.kernel.org>
      Link: http://lkml.kernel.org/r/20200424025057.118641-1-khazhy@google.comSigned-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      0c54a6a4
    • J
      io_uring: don't use 'fd' for openat/openat2/statx · 63ff8223
      Jens Axboe 提交于
      We currently make some guesses as when to open this fd, but in reality
      we have no business (or need) to do so at all. In fact, it makes certain
      things fail, like O_PATH.
      
      Remove the fd lookup from these opcodes, we're just passing the 'fd' to
      generic helpers anyway. With that, we can also remove the special casing
      of fd values in io_req_needs_file(), and the 'fd_non_neg' check that
      we have. And we can ensure that we only read sqe->fd once.
      
      This fixes O_PATH usage with openat/openat2, and ditto statx path side
      oddities.
      
      Cc: stable@vger.kernel.org: # v5.6
      Reported-by: NMax Kellermann <mk@cm4all.com>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      63ff8223
  7. 07 5月, 2020 2 次提交
  8. 06 5月, 2020 1 次提交
  9. 05 5月, 2020 3 次提交
  10. 04 5月, 2020 1 次提交
  11. 01 5月, 2020 8 次提交
  12. 30 4月, 2020 2 次提交
    • R
      fibmap: Warn and return an error in case of block > INT_MAX · b75dfde1
      Ritesh Harjani 提交于
      We better warn the fibmap user and not return a truncated and therefore
      an incorrect block map address if the bmap() returned block address
      is greater than INT_MAX (since user supplied integer pointer).
      
      It's better to pr_warn() all user of ioctl_fibmap() and return a proper
      error code rather than silently letting a FS corruption happen if the
      user tries to fiddle around with the returned block map address.
      
      We fix this by returning an error code of -ERANGE and returning 0 as the
      block mapping address in case if it is > INT_MAX.
      
      Now iomap_bmap() could be called from either of these two paths.
      Either when a user is calling an ioctl_fibmap() interface to get
      the block mapping address or by some filesystem via use of bmap()
      internal kernel API.
      bmap() kernel API is well equipped with handling of u64 addresses.
      
      WARN condition in iomap_bmap_actor() was mainly added to warn all
      the fibmap users. But now that we have directly added this warning
      for all fibmap users and also made sure to return 0 as block map address
      in case if addr > INT_MAX.
      So we can now remove this logic from iomap_bmap_actor().
      Signed-off-by: NRitesh Harjani <riteshh@linux.ibm.com>
      Reviewed-by: NChristoph Hellwig <hch@lst.de>
      Reviewed-by: NJan Kara <jack@suse.cz>
      Reviewed-by: NDarrick J. Wong <darrick.wong@oracle.com>
      Signed-off-by: NDarrick J. Wong <darrick.wong@oracle.com>
      b75dfde1
    • A
      btrfs: fix gcc-4.8 build warning for struct initializer · 9c6c723f
      Arnd Bergmann 提交于
      Some older compilers like gcc-4.8 warn about mismatched curly braces in
      a initializer:
      
      fs/btrfs/backref.c: In function 'is_shared_data_backref':
      fs/btrfs/backref.c:394:9: error: missing braces around
      initializer [-Werror=missing-braces]
        struct prelim_ref target = {0};
               ^
      fs/btrfs/backref.c:394:9: error: (near initialization for
      'target.rbnode') [-Werror=missing-braces]
      
      Use the GNU empty initializer extension to avoid this.
      
      Fixes: ed58f2e6 ("btrfs: backref, don't add refs from shared block when resolving normal backref")
      Reviewed-by: NQu Wenruo <wqu@suse.com>
      Signed-off-by: NArnd Bergmann <arnd@arndb.de>
      Reviewed-by: NDavid Sterba <dsterba@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      9c6c723f
  13. 29 4月, 2020 2 次提交
    • D
      Fix use after free in get_tree_bdev() · dd7bc815
      David Howells 提交于
      Commit 6fcf0c72, a fix to get_tree_bdev() put a missing blkdev_put() in
      the wrong place, before a warnf() that displays the bdev under
      consideration rather after it.
      
      This results in a silent lockup in printk("%pg") called via warnf() from
      get_tree_bdev() under some circumstances when there's a race with the
      blockdev being frozen.  This can be caused by xfstests/tests/generic/085 in
      combination with Lukas Czerner's ext4 mount API conversion patchset.  It
      looks like it ought to occur with other users of get_tree_bdev() such as
      XFS, but apparently doesn't.
      
      Fix this by switching the order of the lines.
      
      Fixes: 6fcf0c72 ("vfs: add missing blkdev_put() in get_tree_bdev()")
      Reported-by: NLukas Czerner <lczerner@redhat.com>
      Signed-off-by: NDavid Howells <dhowells@redhat.com>
      cc: Ian Kent <raven@themaw.net>
      cc: Al Viro <viro@zeniv.linux.org.uk>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      dd7bc815
    • O
      NFSv4.1: fix handling of backchannel binding in BIND_CONN_TO_SESSION · dff58530
      Olga Kornievskaia 提交于
      Currently, if the client sends BIND_CONN_TO_SESSION with
      NFS4_CDFC4_FORE_OR_BOTH but only gets NFS4_CDFS4_FORE back it ignores
      that it wasn't able to enable a backchannel.
      
      To make sure, the client sends BIND_CONN_TO_SESSION as the first
      operation on the connections (ie., no other session compounds haven't
      been sent before), and if the client's request to bind the backchannel
      is not satisfied, then reset the connection and retry.
      
      Cc: stable@vger.kernel.org
      Signed-off-by: NOlga Kornievskaia <kolga@netapp.com>
      Signed-off-by: NTrond Myklebust <trond.myklebust@hammerspace.com>
      dff58530
  14. 28 4月, 2020 2 次提交
    • L
      coredump: fix crash when umh is disabled · 3740d93e
      Luis Chamberlain 提交于
      Commit 64e90a8a ("Introduce STATIC_USERMODEHELPER to mediate
      call_usermodehelper()") added the optiont to disable all
      call_usermodehelper() calls by setting STATIC_USERMODEHELPER_PATH to
      an empty string. When this is done, and crashdump is triggered, it
      will crash on null pointer dereference, since we make assumptions
      over what call_usermodehelper_exec() did.
      
      This has been reported by Sergey when one triggers a a coredump
      with the following configuration:
      
      ```
      CONFIG_STATIC_USERMODEHELPER=y
      CONFIG_STATIC_USERMODEHELPER_PATH=""
      kernel.core_pattern = |/usr/lib/systemd/systemd-coredump %P %u %g %s %t %c %h %e
      ```
      
      The way disabling the umh was designed was that call_usermodehelper_exec()
      would just return early, without an error. But coredump assumes
      certain variables are set up for us when this happens, and calls
      ile_start_write(cprm.file) with a NULL file.
      
      [    2.819676] BUG: kernel NULL pointer dereference, address: 0000000000000020
      [    2.819859] #PF: supervisor read access in kernel mode
      [    2.820035] #PF: error_code(0x0000) - not-present page
      [    2.820188] PGD 0 P4D 0
      [    2.820305] Oops: 0000 [#1] SMP PTI
      [    2.820436] CPU: 2 PID: 89 Comm: a Not tainted 5.7.0-rc1+ #7
      [    2.820680] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS ?-20190711_202441-buildvm-armv7-10.arm.fedoraproject.org-2.fc31 04/01/2014
      [    2.821150] RIP: 0010:do_coredump+0xd80/0x1060
      [    2.821385] Code: e8 95 11 ed ff 48 c7 c6 cc a7 b4 81 48 8d bd 28 ff
      ff ff 89 c2 e8 70 f1 ff ff 41 89 c2 85 c0 0f 84 72 f7 ff ff e9 b4 fe ff
      ff <48> 8b 57 20 0f b7 02 66 25 00 f0 66 3d 00 8
      0 0f 84 9c 01 00 00 44
      [    2.822014] RSP: 0000:ffffc9000029bcb8 EFLAGS: 00010246
      [    2.822339] RAX: 0000000000000000 RBX: ffff88803f860000 RCX: 000000000000000a
      [    2.822746] RDX: 0000000000000009 RSI: 0000000000000282 RDI: 0000000000000000
      [    2.823141] RBP: ffffc9000029bde8 R08: 0000000000000000 R09: ffffc9000029bc00
      [    2.823508] R10: 0000000000000001 R11: ffff88803dec90be R12: ffffffff81c39da0
      [    2.823902] R13: ffff88803de84400 R14: 0000000000000000 R15: 0000000000000000
      [    2.824285] FS:  00007fee08183540(0000) GS:ffff88803e480000(0000) knlGS:0000000000000000
      [    2.824767] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
      [    2.825111] CR2: 0000000000000020 CR3: 000000003f856005 CR4: 0000000000060ea0
      [    2.825479] Call Trace:
      [    2.825790]  get_signal+0x11e/0x720
      [    2.826087]  do_signal+0x1d/0x670
      [    2.826361]  ? force_sig_info_to_task+0xc1/0xf0
      [    2.826691]  ? force_sig_fault+0x3c/0x40
      [    2.826996]  ? do_trap+0xc9/0x100
      [    2.827179]  exit_to_usermode_loop+0x49/0x90
      [    2.827359]  prepare_exit_to_usermode+0x77/0xb0
      [    2.827559]  ? invalid_op+0xa/0x30
      [    2.827747]  ret_from_intr+0x20/0x20
      [    2.827921] RIP: 0033:0x55e2c76d2129
      [    2.828107] Code: 2d ff ff ff e8 68 ff ff ff 5d c6 05 18 2f 00 00 01
      c3 0f 1f 80 00 00 00 00 c3 0f 1f 80 00 00 00 00 e9 7b ff ff ff 55 48 89
      e5 <0f> 0b b8 00 00 00 00 5d c3 66 2e 0f 1f 84 0
      0 00 00 00 00 0f 1f 40
      [    2.828603] RSP: 002b:00007fffeba5e080 EFLAGS: 00010246
      [    2.828801] RAX: 000055e2c76d2125 RBX: 0000000000000000 RCX: 00007fee0817c718
      [    2.829034] RDX: 00007fffeba5e188 RSI: 00007fffeba5e178 RDI: 0000000000000001
      [    2.829257] RBP: 00007fffeba5e080 R08: 0000000000000000 R09: 00007fee08193c00
      [    2.829482] R10: 0000000000000009 R11: 0000000000000000 R12: 000055e2c76d2040
      [    2.829727] R13: 0000000000000000 R14: 0000000000000000 R15: 0000000000000000
      [    2.829964] CR2: 0000000000000020
      [    2.830149] ---[ end trace ceed83d8c68a1bf1 ]---
      ```
      
      Cc: <stable@vger.kernel.org> # v4.11+
      Fixes: 64e90a8a ("Introduce STATIC_USERMODEHELPER to mediate call_usermodehelper()")
      BugLink: https://bugzilla.kernel.org/show_bug.cgi?id=199795Reported-by: NTony Vroon <chainsaw@gentoo.org>
      Reported-by: NSergey Kvachonok <ravenexp@gmail.com>
      Tested-by: NSergei Trofimovich <slyfox@gentoo.org>
      Signed-off-by: NLuis Chamberlain <mcgrof@kernel.org>
      Link: https://lore.kernel.org/r/20200416162859.26518-1-mcgrof@kernel.orgSigned-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      3740d93e
    • J
      io_uring: statx must grab the file table for valid fd · 5b0bbee4
      Jens Axboe 提交于
      Clay reports that OP_STATX fails for a test case with a valid fd
      and empty path:
      
       -- Test 0: statx:fd 3: SUCCEED, file mode 100755
       -- Test 1: statx:path ./uring_statx: SUCCEED, file mode 100755
       -- Test 2: io_uring_statx:fd 3: FAIL, errno 9: Bad file descriptor
       -- Test 3: io_uring_statx:path ./uring_statx: SUCCEED, file mode 100755
      
      This is due to statx not grabbing the process file table, hence we can't
      lookup the fd in async context. If the fd is valid, ensure that we grab
      the file table so we can grab the file from async context.
      
      Cc: stable@vger.kernel.org # v5.6
      Reported-by: NClay Harris <bugs@claycon.org>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      5b0bbee4
  15. 27 4月, 2020 4 次提交
    • Q
      btrfs: transaction: Avoid deadlock due to bad initialization timing of fs_info::journal_info · fcc99734
      Qu Wenruo 提交于
      [BUG]
      One run of btrfs/063 triggered the following lockdep warning:
        ============================================
        WARNING: possible recursive locking detected
        5.6.0-rc7-custom+ #48 Not tainted
        --------------------------------------------
        kworker/u24:0/7 is trying to acquire lock:
        ffff88817d3a46e0 (sb_internal#2){.+.+}, at: start_transaction+0x66c/0x890 [btrfs]
      
        but task is already holding lock:
        ffff88817d3a46e0 (sb_internal#2){.+.+}, at: start_transaction+0x66c/0x890 [btrfs]
      
        other info that might help us debug this:
         Possible unsafe locking scenario:
      
               CPU0
               ----
          lock(sb_internal#2);
          lock(sb_internal#2);
      
         *** DEADLOCK ***
      
         May be due to missing lock nesting notation
      
        4 locks held by kworker/u24:0/7:
         #0: ffff88817b495948 ((wq_completion)btrfs-endio-write){+.+.}, at: process_one_work+0x557/0xb80
         #1: ffff888189ea7db8 ((work_completion)(&work->normal_work)){+.+.}, at: process_one_work+0x557/0xb80
         #2: ffff88817d3a46e0 (sb_internal#2){.+.+}, at: start_transaction+0x66c/0x890 [btrfs]
         #3: ffff888174ca4da8 (&fs_info->reloc_mutex){+.+.}, at: btrfs_record_root_in_trans+0x83/0xd0 [btrfs]
      
        stack backtrace:
        CPU: 0 PID: 7 Comm: kworker/u24:0 Not tainted 5.6.0-rc7-custom+ #48
        Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 0.0.0 02/06/2015
        Workqueue: btrfs-endio-write btrfs_work_helper [btrfs]
        Call Trace:
         dump_stack+0xc2/0x11a
         __lock_acquire.cold+0xce/0x214
         lock_acquire+0xe6/0x210
         __sb_start_write+0x14e/0x290
         start_transaction+0x66c/0x890 [btrfs]
         btrfs_join_transaction+0x1d/0x20 [btrfs]
         find_free_extent+0x1504/0x1a50 [btrfs]
         btrfs_reserve_extent+0xd5/0x1f0 [btrfs]
         btrfs_alloc_tree_block+0x1ac/0x570 [btrfs]
         btrfs_copy_root+0x213/0x580 [btrfs]
         create_reloc_root+0x3bd/0x470 [btrfs]
         btrfs_init_reloc_root+0x2d2/0x310 [btrfs]
         record_root_in_trans+0x191/0x1d0 [btrfs]
         btrfs_record_root_in_trans+0x90/0xd0 [btrfs]
         start_transaction+0x16e/0x890 [btrfs]
         btrfs_join_transaction+0x1d/0x20 [btrfs]
         btrfs_finish_ordered_io+0x55d/0xcd0 [btrfs]
         finish_ordered_fn+0x15/0x20 [btrfs]
         btrfs_work_helper+0x116/0x9a0 [btrfs]
         process_one_work+0x632/0xb80
         worker_thread+0x80/0x690
         kthread+0x1a3/0x1f0
         ret_from_fork+0x27/0x50
      
      It's pretty hard to reproduce, only one hit so far.
      
      [CAUSE]
      This is because we're calling btrfs_join_transaction() without re-using
      the current running one:
      
      btrfs_finish_ordered_io()
      |- btrfs_join_transaction()		<<< Call #1
         |- btrfs_record_root_in_trans()
            |- btrfs_reserve_extent()
      	 |- btrfs_join_transaction()	<<< Call #2
      
      Normally such btrfs_join_transaction() call should re-use the existing
      one, without trying to re-start a transaction.
      
      But the problem is, in btrfs_join_transaction() call #1, we call
      btrfs_record_root_in_trans() before initializing current::journal_info.
      
      And in btrfs_join_transaction() call #2, we're relying on
      current::journal_info to avoid such deadlock.
      
      [FIX]
      Call btrfs_record_root_in_trans() after we have initialized
      current::journal_info.
      
      CC: stable@vger.kernel.org # 4.4+
      Signed-off-by: NQu Wenruo <wqu@suse.com>
      Reviewed-by: NDavid Sterba <dsterba@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      fcc99734
    • F
      btrfs: fix partial loss of prealloc extent past i_size after fsync · f135cea3
      Filipe Manana 提交于
      When we have an inode with a prealloc extent that starts at an offset
      lower than the i_size and there is another prealloc extent that starts at
      an offset beyond i_size, we can end up losing part of the first prealloc
      extent (the part that starts at i_size) and have an implicit hole if we
      fsync the file and then have a power failure.
      
      Consider the following example with comments explaining how and why it
      happens.
      
        $ mkfs.btrfs -f /dev/sdb
        $ mount /dev/sdb /mnt
      
        # Create our test file with 2 consecutive prealloc extents, each with a
        # size of 128Kb, and covering the range from 0 to 256Kb, with a file
        # size of 0.
        $ xfs_io -f -c "falloc -k 0 128K" /mnt/foo
        $ xfs_io -c "falloc -k 128K 128K" /mnt/foo
      
        # Fsync the file to record both extents in the log tree.
        $ xfs_io -c "fsync" /mnt/foo
      
        # Now do a redudant extent allocation for the range from 0 to 64Kb.
        # This will merely increase the file size from 0 to 64Kb. Instead we
        # could also do a truncate to set the file size to 64Kb.
        $ xfs_io -c "falloc 0 64K" /mnt/foo
      
        # Fsync the file, so we update the inode item in the log tree with the
        # new file size (64Kb). This also ends up setting the number of bytes
        # for the first prealloc extent to 64Kb. This is done by the truncation
        # at btrfs_log_prealloc_extents().
        # This means that if a power failure happens after this, a write into
        # the file range 64Kb to 128Kb will not use the prealloc extent and
        # will result in allocation of a new extent.
        $ xfs_io -c "fsync" /mnt/foo
      
        # Now set the file size to 256K with a truncate and then fsync the file.
        # Since no changes happened to the extents, the fsync only updates the
        # i_size in the inode item at the log tree. This results in an implicit
        # hole for the file range from 64Kb to 128Kb, something which fsck will
        # complain when not using the NO_HOLES feature if we replay the log
        # after a power failure.
        $ xfs_io -c "truncate 256K" -c "fsync" /mnt/foo
      
      So instead of always truncating the log to the inode's current i_size at
      btrfs_log_prealloc_extents(), check first if there's a prealloc extent
      that starts at an offset lower than the i_size and with a length that
      crosses the i_size - if there is one, just make sure we truncate to a
      size that corresponds to the end offset of that prealloc extent, so
      that we don't lose the part of that extent that starts at i_size if a
      power failure happens.
      
      A test case for fstests follows soon.
      
      Fixes: 31d11b83 ("Btrfs: fix duplicate extents after fsync of file with prealloc extents")
      CC: stable@vger.kernel.org # 4.14+
      Signed-off-by: NFilipe Manana <fdmanana@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      f135cea3
    • A
      propagate_one(): mnt_set_mountpoint() needs mount_lock · b0d3869c
      Al Viro 提交于
      ... to protect the modification of mp->m_count done by it.  Most of
      the places that modify that thing also have namespace_lock held,
      but not all of them can do so, so we really need mount_lock here.
      Kudos to Piotr Krysiuk <piotras@gmail.com>, who'd spotted a related
      bug in pivot_root(2) (fixed unnoticed in 5.3); search for other
      similar turds has caught out this one.
      
      Cc: stable@kernel.org
      Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>
      b0d3869c
    • X
      configfs: fix config_item refcnt leak in configfs_rmdir() · 8aebfffa
      Xiyu Yang 提交于
      configfs_rmdir() invokes configfs_get_config_item(), which returns a
      reference of the specified config_item object to "parent_item" with
      increased refcnt.
      
      When configfs_rmdir() returns, local variable "parent_item" becomes
      invalid, so the refcount should be decreased to keep refcount balanced.
      
      The reference counting issue happens in one exception handling path of
      configfs_rmdir(). When down_write_killable() fails, the function forgets
      to decrease the refcnt increased by configfs_get_config_item(), causing
      a refcnt leak.
      
      Fix this issue by calling config_item_put() when down_write_killable()
      fails.
      Signed-off-by: NXiyu Yang <xiyuyang19@fudan.edu.cn>
      Signed-off-by: NXin Tan <tanxin.ctf@gmail.com>
      Signed-off-by: NChristoph Hellwig <hch@lst.de>
      8aebfffa
  16. 25 4月, 2020 2 次提交
  17. 24 4月, 2020 4 次提交