1. 04 9月, 2021 14 次提交
  2. 26 8月, 2021 1 次提交
    • L
      pipe: do FASYNC notifications for every pipe IO, not just state changes · fe67f4dd
      Linus Torvalds 提交于
      It turns out that the SIGIO/FASYNC situation is almost exactly the same
      as the EPOLLET case was: user space really wants to be notified after
      every operation.
      
      Now, in a perfect world it should be sufficient to only notify user
      space on "state transitions" when the IO state changes (ie when a pipe
      goes from unreadable to readable, or from unwritable to writable).  User
      space should then do as much as possible - fully emptying the buffer or
      what not - and we'll notify it again the next time the state changes.
      
      But as with EPOLLET, we have at least one case (stress-ng) where the
      kernel sent SIGIO due to the pipe being marked for asynchronous
      notification, but the user space signal handler then didn't actually
      necessarily read it all before returning (it read more than what was
      written, but since there could be multiple writes, it could leave data
      pending).
      
      The user space code then expected to get another SIGIO for subsequent
      writes - even though the pipe had been readable the whole time - and
      would only then read more.
      
      This is arguably a user space bug - and Colin King already fixed the
      stress-ng code in question - but the kernel regression rules are clear:
      it doesn't matter if kernel people think that user space did something
      silly and wrong.  What matters is that it used to work.
      
      So if user space depends on specific historical kernel behavior, it's a
      regression when that behavior changes.  It's on us: we were silly to
      have that non-optimal historical behavior, and our old kernel behavior
      was what user space was tested against.
      
      Because of how the FASYNC notification was tied to wakeup behavior, this
      was first broken by commits f467a6a6 and 1b6b26ae ("pipe: fix
      and clarify pipe read/write wakeup logic"), but at the time it seems
      nobody noticed.  Probably because the stress-ng problem case ends up
      being timing-dependent too.
      
      It was then unwittingly fixed by commit 3a34b13a ("pipe: make pipe
      writes always wake up readers") only to be broken again when by commit
      3b844826 ("pipe: avoid unnecessary EPOLLET wakeups under normal
      loads").
      
      And at that point the kernel test robot noticed the performance
      refression in the stress-ng.sigio.ops_per_sec case.  So the "Fixes" tag
      below is somewhat ad hoc, but it matches when the issue was noticed.
      
      Fix it for good (knock wood) by simply making the kill_fasync() case
      separate from the wakeup case.  FASYNC is quite rare, and we clearly
      shouldn't even try to use the "avoid unnecessary wakeups" logic for it.
      
      Link: https://lore.kernel.org/lkml/20210824151337.GC27667@xsang-OptiPlex-9020/
      Fixes: 3b844826 ("pipe: avoid unnecessary EPOLLET wakeups under normal loads")
      Reported-by: Nkernel test robot <oliver.sang@intel.com>
      Tested-by: NOliver Sang <oliver.sang@intel.com>
      Cc: Eric Biederman <ebiederm@xmission.com>
      Cc: Colin Ian King <colin.king@canonical.com>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      fe67f4dd
  3. 25 8月, 2021 3 次提交
  4. 21 8月, 2021 2 次提交
  5. 19 8月, 2021 1 次提交
    • L
      pipe: avoid unnecessary EPOLLET wakeups under normal loads · 3b844826
      Linus Torvalds 提交于
      I had forgotten just how sensitive hackbench is to extra pipe wakeups,
      and commit 3a34b13a ("pipe: make pipe writes always wake up
      readers") ended up causing a quite noticeable regression on larger
      machines.
      
      Now, hackbench isn't necessarily a hugely meaningful benchmark, and it's
      not clear that this matters in real life all that much, but as Mel
      points out, it's used often enough when comparing kernels and so the
      performance regression shows up like a sore thumb.
      
      It's easy enough to fix at least for the common cases where pipes are
      used purely for data transfer, and you never have any exciting poll
      usage at all.  So set a special 'poll_usage' flag when there is polling
      activity, and make the ugly "EPOLLET has crazy legacy expectations"
      semantics explicit to only that case.
      
      I would love to limit it to just the broken EPOLLET case, but the pipe
      code can't see the difference between epoll and regular select/poll, so
      any non-read/write waiting will trigger the extra wakeup behavior.  That
      is sufficient for at least the hackbench case.
      
      Apart from making the odd extra wakeup cases more explicitly about
      EPOLLET, this also makes the extra wakeup be at the _end_ of the pipe
      write, not at the first write chunk.  That is actually much saner
      semantics (as much as you can call any of the legacy edge-triggered
      expectations for EPOLLET "sane") since it means that you know the wakeup
      will happen once the write is done, rather than possibly in the middle
      of one.
      
      [ For stable people: I'm putting a "Fixes" tag on this, but I leave it
        up to you to decide whether you actually want to backport it or not.
        It likely has no impact outside of synthetic benchmarks  - Linus ]
      
      Link: https://lore.kernel.org/lkml/20210802024945.GA8372@xsang-OptiPlex-9020/
      Fixes: 3a34b13a ("pipe: make pipe writes always wake up readers")
      Reported-by: Nkernel test robot <oliver.sang@intel.com>
      Tested-by: NSandeep Patil <sspatil@android.com>
      Tested-by: NMel Gorman <mgorman@techsingularity.net>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      3b844826
  6. 18 8月, 2021 1 次提交
  7. 16 8月, 2021 1 次提交
    • N
      btrfs: prevent rename2 from exchanging a subvol with a directory from different parents · 3f79f6f6
      NeilBrown 提交于
      Cross-rename lacks a check when that would prevent exchanging a
      directory and subvolume from different parent subvolume. This causes
      data inconsistencies and is caught before commit by tree-checker,
      turning the filesystem to read-only.
      
      Calling the renameat2 with RENAME_EXCHANGE flags like
      
        renameat2(AT_FDCWD, namesrc, AT_FDCWD, namedest, (1 << 1))
      
      on two paths:
      
        namesrc = dir1/subvol1/dir2
       namedest = subvol2/subvol3
      
      will cause key order problem with following write time tree-checker
      report:
      
        [1194842.307890] BTRFS critical (device loop1): corrupt leaf: root=5 block=27574272 slot=10 ino=258, invalid previous key objectid, have 257 expect 258
        [1194842.322221] BTRFS info (device loop1): leaf 27574272 gen 8 total ptrs 11 free space 15444 owner 5
        [1194842.331562] BTRFS info (device loop1): refs 2 lock_owner 0 current 26561
        [1194842.338772]        item 0 key (256 1 0) itemoff 16123 itemsize 160
        [1194842.338793]                inode generation 3 size 16 mode 40755
        [1194842.338801]        item 1 key (256 12 256) itemoff 16111 itemsize 12
        [1194842.338809]        item 2 key (256 84 2248503653) itemoff 16077 itemsize 34
        [1194842.338817]                dir oid 258 type 2
        [1194842.338823]        item 3 key (256 84 2363071922) itemoff 16043 itemsize 34
        [1194842.338830]                dir oid 257 type 2
        [1194842.338836]        item 4 key (256 96 2) itemoff 16009 itemsize 34
        [1194842.338843]        item 5 key (256 96 3) itemoff 15975 itemsize 34
        [1194842.338852]        item 6 key (257 1 0) itemoff 15815 itemsize 160
        [1194842.338863]                inode generation 6 size 8 mode 40755
        [1194842.338869]        item 7 key (257 12 256) itemoff 15801 itemsize 14
        [1194842.338876]        item 8 key (257 84 2505409169) itemoff 15767 itemsize 34
        [1194842.338883]                dir oid 256 type 2
        [1194842.338888]        item 9 key (257 96 2) itemoff 15733 itemsize 34
        [1194842.338895]        item 10 key (258 12 256) itemoff 15719 itemsize 14
        [1194842.339163] BTRFS error (device loop1): block=27574272 write time tree block corruption detected
        [1194842.339245] ------------[ cut here ]------------
        [1194842.443422] WARNING: CPU: 6 PID: 26561 at fs/btrfs/disk-io.c:449 csum_one_extent_buffer+0xed/0x100 [btrfs]
        [1194842.511863] CPU: 6 PID: 26561 Comm: kworker/u17:2 Not tainted 5.14.0-rc3-git+ #793
        [1194842.511870] Hardware name: empty empty/S3993, BIOS PAQEX0-3 02/24/2008
        [1194842.511876] Workqueue: btrfs-worker-high btrfs_work_helper [btrfs]
        [1194842.511976] RIP: 0010:csum_one_extent_buffer+0xed/0x100 [btrfs]
        [1194842.512068] RSP: 0018:ffffa2c284d77da0 EFLAGS: 00010282
        [1194842.512074] RAX: 0000000000000000 RBX: 0000000000001000 RCX: ffff928867bd9978
        [1194842.512078] RDX: 0000000000000000 RSI: 0000000000000027 RDI: ffff928867bd9970
        [1194842.512081] RBP: ffff92876b958000 R08: 0000000000000001 R09: 00000000000c0003
        [1194842.512085] R10: 0000000000000000 R11: 0000000000000001 R12: 0000000000000000
        [1194842.512088] R13: ffff92875f989f98 R14: 0000000000000000 R15: 0000000000000000
        [1194842.512092] FS:  0000000000000000(0000) GS:ffff928867a00000(0000) knlGS:0000000000000000
        [1194842.512095] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
        [1194842.512099] CR2: 000055f5384da1f0 CR3: 0000000102fe4000 CR4: 00000000000006e0
        [1194842.512103] Call Trace:
        [1194842.512128]  ? run_one_async_free+0x10/0x10 [btrfs]
        [1194842.631729]  btree_csum_one_bio+0x1ac/0x1d0 [btrfs]
        [1194842.631837]  run_one_async_start+0x18/0x30 [btrfs]
        [1194842.631938]  btrfs_work_helper+0xd5/0x1d0 [btrfs]
        [1194842.647482]  process_one_work+0x262/0x5e0
        [1194842.647520]  worker_thread+0x4c/0x320
        [1194842.655935]  ? process_one_work+0x5e0/0x5e0
        [1194842.655946]  kthread+0x135/0x160
        [1194842.655953]  ? set_kthread_struct+0x40/0x40
        [1194842.655965]  ret_from_fork+0x1f/0x30
        [1194842.672465] irq event stamp: 1729
        [1194842.672469] hardirqs last  enabled at (1735): [<ffffffffbd1104f5>] console_trylock_spinning+0x185/0x1a0
        [1194842.672477] hardirqs last disabled at (1740): [<ffffffffbd1104cc>] console_trylock_spinning+0x15c/0x1a0
        [1194842.672482] softirqs last  enabled at (1666): [<ffffffffbdc002e1>] __do_softirq+0x2e1/0x50a
        [1194842.672491] softirqs last disabled at (1651): [<ffffffffbd08aab7>] __irq_exit_rcu+0xa7/0xd0
      
      The corrupted data will not be written, and filesystem can be unmounted
      and mounted again (all changes since the last commit will be lost).
      
      Add the missing check for new_ino so that all non-subvolumes must reside
      under the same parent subvolume. There's an exception allowing to
      exchange two subvolumes from any parents as the directory representing a
      subvolume is only a logical link and does not have any other structures
      related to the parent subvolume, unlike files, directories etc, that
      are always in the inode namespace of the parent subvolume.
      
      Fixes: cdd1fedf ("btrfs: add support for RENAME_EXCHANGE and RENAME_WHITEOUT")
      CC: stable@vger.kernel.org # 4.7+
      Reviewed-by: NNikolay Borisov <nborisov@suse.com>
      Signed-off-by: NNeilBrown <neilb@suse.de>
      Reviewed-by: NDavid Sterba <dsterba@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      3f79f6f6
  8. 15 8月, 2021 1 次提交
    • J
      io_uring: only assign io_uring_enter() SQPOLL error in actual error case · 21f96522
      Jens Axboe 提交于
      If an SQPOLL based ring is newly created and an application issues an
      io_uring_enter(2) system call on it, then we can return a spurious
      -EOWNERDEAD error. This happens because there's nothing to submit, and
      if the caller doesn't specify any other action, the initial error
      assignment of -EOWNERDEAD never gets overwritten. This causes us to
      return it directly, even if it isn't valid.
      
      Move the error assignment into the actual failure case instead.
      
      Cc: stable@vger.kernel.org
      Fixes: d9d05217 ("io_uring: stop SQPOLL submit on creator's death")
      Reported-by: Sherlock Holo sherlockya@gmail.com
      Link: https://github.com/axboe/liburing/issues/413Signed-off-by: NJens Axboe <axboe@kernel.dk>
      21f96522
  9. 13 8月, 2021 2 次提交
  10. 10 8月, 2021 11 次提交
  11. 09 8月, 2021 3 次提交