1. 09 2月, 2020 10 次提交
  2. 07 2月, 2020 3 次提交
    • P
      io_uring: fix deferred req iovec leak · 1e95081c
      Pavel Begunkov 提交于
      After defer, a request will be prepared, that includes allocating iovec
      if needed, and then submitted through io_wq_submit_work() but not custom
      handler (e.g. io_rw_async()/io_sendrecv_async()). However, it'll leak
      iovec, as it's in io-wq and the code goes as follows:
      
      io_read() {
      	if (!io_wq_current_is_worker())
      		kfree(iovec);
      }
      
      Put all deallocation logic in io_{read,write,send,recv}(), which will
      leave the memory, if going async with -EAGAIN.
      
      It also fixes a leak after failed io_alloc_async_ctx() in
      io_{recv,send}_msg().
      
      Cc: stable@vger.kernel.org # 5.5
      Signed-off-by: NPavel Begunkov <asml.silence@gmail.com>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      1e95081c
    • R
      io_uring: fix 1-bit bitfields to be unsigned · e1d85334
      Randy Dunlap 提交于
      Make bitfields of size 1 bit be unsigned (since there is no room
      for the sign bit).
      This clears up the sparse warnings:
      
        CHECK   ../fs/io_uring.c
      ../fs/io_uring.c:207:50: error: dubious one-bit signed bitfield
      ../fs/io_uring.c:208:55: error: dubious one-bit signed bitfield
      ../fs/io_uring.c:209:63: error: dubious one-bit signed bitfield
      ../fs/io_uring.c:210:54: error: dubious one-bit signed bitfield
      ../fs/io_uring.c:211:57: error: dubious one-bit signed bitfield
      
      Found by sight and then verified with sparse.
      
      Fixes: 69b3e546 ("io_uring: change io_ring_ctx bool fields into bit fields")
      Signed-off-by: NRandy Dunlap <rdunlap@infradead.org>
      Cc: Jens Axboe <axboe@kernel.dk>
      Cc: io-uring@vger.kernel.org
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      e1d85334
    • P
      io_uring: get rid of delayed mm check · 1cb1edb2
      Pavel Begunkov 提交于
      Fail fast if can't grab mm, so past that requests always have an mm
      when required. This allows us to remove req->user altogether.
      Signed-off-by: NPavel Begunkov <asml.silence@gmail.com>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      1cb1edb2
  3. 05 2月, 2020 2 次提交
    • J
      io_uring: cleanup fixed file data table references · 2faf852d
      Jens Axboe 提交于
      syzbot reports a use-after-free in io_ring_file_ref_switch() when it
      tries to switch back to percpu mode. When we put the final reference to
      the table by calling percpu_ref_kill_and_confirm(), we don't want the
      zero reference to queue async work for flushing the potentially queued
      up items. We currently do a few flush_work(), but they merely paper
      around the issue, since the work item may not have been queued yet
      depending on the when the percpu-ref callback gets run.
      
      Coming into the file unregister, we know we have the ring quiesced.
      io_ring_file_ref_switch() can check for whether or not the ref is dying
      or not, and not queue anything async at that point. Once the ref has
      been confirmed killed, flush any potential items manually.
      
      Reported-by: syzbot+7caeaea49c2c8a591e3d@syzkaller.appspotmail.com
      Fixes: 05f3fb3c ("io_uring: avoid ring quiesce for fixed file set unregister and update")
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      2faf852d
    • J
      io_uring: spin for sq thread to idle on shutdown · df069d80
      Jens Axboe 提交于
      As part of io_uring shutdown, we cancel work that is pending and won't
      necessarily complete on its own. That includes requests like poll
      commands and timeouts.
      
      If we're using SQPOLL for kernel side submission and we shutdown the
      ring immediately after queueing such work, we can race with the sqthread
      doing the submission. This means we may miss cancelling some work, which
      results in the io_uring shutdown hanging forever.
      
      Cc: stable@vger.kernel.org
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      df069d80
  4. 04 2月, 2020 16 次提交
    • M
      treewide: remove redundant IS_ERR() before error code check · 45586c70
      Masahiro Yamada 提交于
      'PTR_ERR(p) == -E*' is a stronger condition than IS_ERR(p).
      Hence, IS_ERR(p) is unneeded.
      
      The semantic patch that generates this commit is as follows:
      
      // <smpl>
      @@
      expression ptr;
      constant error_code;
      @@
      -IS_ERR(ptr) && (PTR_ERR(ptr) == - error_code)
      +PTR_ERR(ptr) == - error_code
      // </smpl>
      
      Link: http://lkml.kernel.org/r/20200106045833.1725-1-masahiroy@kernel.orgSigned-off-by: NMasahiro Yamada <masahiroy@kernel.org>
      Cc: Julia Lawall <julia.lawall@lip6.fr>
      Acked-by: Stephen Boyd <sboyd@kernel.org> [drivers/clk/clk.c]
      Acked-by: Bartosz Golaszewski <bgolaszewski@baylibre.com> [GPIO]
      Acked-by: Wolfram Sang <wsa@the-dreams.de> [drivers/i2c]
      Acked-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com> [acpi/scan.c]
      Acked-by: NRob Herring <robh@kernel.org>
      Cc: Eric Biggers <ebiggers@kernel.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      45586c70
    • A
      proc: convert everything to "struct proc_ops" · 97a32539
      Alexey Dobriyan 提交于
      The most notable change is DEFINE_SHOW_ATTRIBUTE macro split in
      seq_file.h.
      
      Conversion rule is:
      
      	llseek		=> proc_lseek
      	unlocked_ioctl	=> proc_ioctl
      
      	xxx		=> proc_xxx
      
      	delete ".owner = THIS_MODULE" line
      
      [akpm@linux-foundation.org: fix drivers/isdn/capi/kcapi_proc.c]
      [sfr@canb.auug.org.au: fix kernel/sched/psi.c]
        Link: http://lkml.kernel.org/r/20200122180545.36222f50@canb.auug.org.au
      Link: http://lkml.kernel.org/r/20191225172546.GB13378@avx2Signed-off-by: NAlexey Dobriyan <adobriyan@gmail.com>
      Signed-off-by: NStephen Rothwell <sfr@canb.auug.org.au>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      97a32539
    • A
      proc: decouple proc from VFS with "struct proc_ops" · d56c0d45
      Alexey Dobriyan 提交于
      Currently core /proc code uses "struct file_operations" for custom hooks,
      however, VFS doesn't directly call them.  Every time VFS expands
      file_operations hook set, /proc code bloats for no reason.
      
      Introduce "struct proc_ops" which contains only those hooks which /proc
      allows to call into (open, release, read, write, ioctl, mmap, poll).  It
      doesn't contain module pointer as well.
      
      Save ~184 bytes per usage:
      
      	add/remove: 26/26 grow/shrink: 1/4 up/down: 1922/-6674 (-4752)
      	Function                                     old     new   delta
      	sysvipc_proc_ops                               -      72     +72
      				...
      	config_gz_proc_ops                             -      72     +72
      	proc_get_inode                               289     339     +50
      	proc_reg_get_unmapped_area                   110     107      -3
      	close_pdeo                                   227     224      -3
      	proc_reg_open                                289     284      -5
      	proc_create_data                              60      53      -7
      	rt_cpu_seq_fops                              256       -    -256
      				...
      	default_affinity_proc_fops                   256       -    -256
      	Total: Before=5430095, After=5425343, chg -0.09%
      
      Link: http://lkml.kernel.org/r/20191225172228.GA13378@avx2Signed-off-by: NAlexey Dobriyan <adobriyan@gmail.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      d56c0d45
    • S
      mm: pagewalk: add 'depth' parameter to pte_hole · b7a16c7a
      Steven Price 提交于
      The pte_hole() callback is called at multiple levels of the page tables.
      Code dumping the kernel page tables needs to know what at what depth the
      missing entry is.  Add this is an extra parameter to pte_hole().  When the
      depth isn't know (e.g.  processing a vma) then -1 is passed.
      
      The depth that is reported is the actual level where the entry is missing
      (ignoring any folding that is in place), i.e.  any levels where
      PTRS_PER_P?D is set to 1 are ignored.
      
      Note that depth starts at 0 for a PGD so that PUD/PMD/PTE retain their
      natural numbers as levels 2/3/4.
      
      Link: http://lkml.kernel.org/r/20191218162402.45610-16-steven.price@arm.comSigned-off-by: NSteven Price <steven.price@arm.com>
      Tested-by: NZong Li <zong.li@sifive.com>
      Cc: Albert Ou <aou@eecs.berkeley.edu>
      Cc: Alexandre Ghiti <alex@ghiti.fr>
      Cc: Andy Lutomirski <luto@kernel.org>
      Cc: Ard Biesheuvel <ard.biesheuvel@linaro.org>
      Cc: Arnd Bergmann <arnd@arndb.de>
      Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
      Cc: Borislav Petkov <bp@alien8.de>
      Cc: Catalin Marinas <catalin.marinas@arm.com>
      Cc: Christian Borntraeger <borntraeger@de.ibm.com>
      Cc: Dave Hansen <dave.hansen@linux.intel.com>
      Cc: David S. Miller <davem@davemloft.net>
      Cc: Heiko Carstens <heiko.carstens@de.ibm.com>
      Cc: "H. Peter Anvin" <hpa@zytor.com>
      Cc: Ingo Molnar <mingo@redhat.com>
      Cc: James Hogan <jhogan@kernel.org>
      Cc: James Morse <james.morse@arm.com>
      Cc: Jerome Glisse <jglisse@redhat.com>
      Cc: "Liang, Kan" <kan.liang@linux.intel.com>
      Cc: Mark Rutland <mark.rutland@arm.com>
      Cc: Michael Ellerman <mpe@ellerman.id.au>
      Cc: Paul Burton <paul.burton@mips.com>
      Cc: Paul Mackerras <paulus@samba.org>
      Cc: Paul Walmsley <paul.walmsley@sifive.com>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Ralf Baechle <ralf@linux-mips.org>
      Cc: Russell King <linux@armlinux.org.uk>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Vasily Gorbik <gor@linux.ibm.com>
      Cc: Vineet Gupta <vgupta@synopsys.com>
      Cc: Will Deacon <will@kernel.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      b7a16c7a
    • D
      fs/proc/page.c: allow inspection of last section and fix end detection · abec749f
      David Hildenbrand 提交于
      If max_pfn does not fall onto a section boundary, it is possible to
      inspect PFNs up to max_pfn, and PFNs above max_pfn, however, max_pfn
      itself can't be inspected.  We can have a valid (and online) memmap at and
      above max_pfn if max_pfn is not aligned to a section boundary.  The whole
      early section has a memmap and is marked online.  Being able to inspect
      the state of these PFNs is valuable for debugging, especially because
      max_pfn can change on memory hotplug and expose these memmaps.
      
      Also, querying page flags via "./page-types -r -a 0x144001,"
      (tools/vm/page-types.c) inside a x86-64 guest with 4160MB under QEMU
      results in an (almost) endless loop in user space, because the end is not
      detected properly when starting after max_pfn.
      
      Instead, let's allow to inspect all pages in the highest section and
      return 0 directly if we try to access pages above that section.
      
      While at it, check the count before adjusting it, to avoid masking user
      errors.
      
      Link: http://lkml.kernel.org/r/20191211163201.17179-3-david@redhat.comSigned-off-by: NDavid Hildenbrand <david@redhat.com>
      Cc: Alexey Dobriyan <adobriyan@gmail.com>
      Cc: Oscar Salvador <osalvador@suse.de>
      Cc: Michal Hocko <mhocko@kernel.org>
      Cc: Stephen Rothwell <sfr@canb.auug.org.au>
      Cc: Bob Picco <bob.picco@oracle.com>
      Cc: Daniel Jordan <daniel.m.jordan@oracle.com>
      Cc: Dan Williams <dan.j.williams@intel.com>
      Cc: Michal Hocko <mhocko@suse.com>
      Cc: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
      Cc: Pavel Tatashin <pasha.tatashin@oracle.com>
      Cc: Steven Sistare <steven.sistare@oracle.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      abec749f
    • G
      ocfs2: fix oops when writing cloned file · 2d797e9f
      Gang He 提交于
      Writing a cloned file triggers a kernel oops and the user-space command
      process is also killed by the system.  The bug can be reproduced stably
      via:
      
      1) create a file under ocfs2 file system directory.
      
        journalctl -b > aa.txt
      
      2) create a cloned file for this file.
      
        reflink aa.txt bb.txt
      
      3) write the cloned file with dd command.
      
        dd if=/dev/zero of=bb.txt bs=512 count=1 conv=notrunc
      
      The dd command is killed by the kernel, then you can see the oops message
      via dmesg command.
      
      [  463.875404] BUG: kernel NULL pointer dereference, address: 0000000000000028
      [  463.875413] #PF: supervisor read access in kernel mode
      [  463.875416] #PF: error_code(0x0000) - not-present page
      [  463.875418] PGD 0 P4D 0
      [  463.875425] Oops: 0000 [#1] SMP PTI
      [  463.875431] CPU: 1 PID: 2291 Comm: dd Tainted: G           OE     5.3.16-2-default
      [  463.875433] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS Bochs 01/01/2011
      [  463.875500] RIP: 0010:ocfs2_refcount_cow+0xa4/0x5d0 [ocfs2]
      [  463.875505] Code: 06 89 6c 24 38 89 eb f6 44 24 3c 02 74 be 49 8b 47 28
      [  463.875508] RSP: 0018:ffffa2cb409dfce8 EFLAGS: 00010202
      [  463.875512] RAX: ffff8b1ebdca8000 RBX: 0000000000000001 RCX: ffff8b1eb73a9df0
      [  463.875515] RDX: 0000000000056a01 RSI: 0000000000000000 RDI: 0000000000000000
      [  463.875517] RBP: 0000000000000001 R08: ffff8b1eb73a9de0 R09: 0000000000000000
      [  463.875520] R10: 0000000000000001 R11: 0000000000000000 R12: 0000000000000000
      [  463.875522] R13: ffff8b1eb922f048 R14: 0000000000000000 R15: ffff8b1eb922f048
      [  463.875526] FS:  00007f8f44d15540(0000) GS:ffff8b1ebeb00000(0000) knlGS:0000000000000000
      [  463.875529] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
      [  463.875532] CR2: 0000000000000028 CR3: 000000003c17a000 CR4: 00000000000006e0
      [  463.875546] Call Trace:
      [  463.875596]  ? ocfs2_inode_lock_full_nested+0x18b/0x960 [ocfs2]
      [  463.875648]  ocfs2_file_write_iter+0xaf8/0xc70 [ocfs2]
      [  463.875672]  new_sync_write+0x12d/0x1d0
      [  463.875688]  vfs_write+0xad/0x1a0
      [  463.875697]  ksys_write+0xa1/0xe0
      [  463.875710]  do_syscall_64+0x60/0x1f0
      [  463.875743]  entry_SYSCALL_64_after_hwframe+0x49/0xbe
      [  463.875758] RIP: 0033:0x7f8f4482ed44
      [  463.875762] Code: 00 f7 d8 64 89 02 48 c7 c0 ff ff ff ff eb b7 0f 1f 80 00 00 00
      [  463.875765] RSP: 002b:00007fff300a79d8 EFLAGS: 00000246 ORIG_RAX: 0000000000000001
      [  463.875769] RAX: ffffffffffffffda RBX: 0000000000000000 RCX: 00007f8f4482ed44
      [  463.875771] RDX: 0000000000000200 RSI: 000055f771b5c000 RDI: 0000000000000001
      [  463.875774] RBP: 0000000000000200 R08: 00007f8f44af9c78 R09: 0000000000000003
      [  463.875776] R10: 000000000000089f R11: 0000000000000246 R12: 000055f771b5c000
      [  463.875779] R13: 0000000000000200 R14: 0000000000000000 R15: 000055f771b5c000
      
      This regression problem was introduced by commit e74540b2 ("ocfs2:
      protect extent tree in ocfs2_prepare_inode_for_write()").
      
      Link: http://lkml.kernel.org/r/20200121050153.13290-1-ghe@suse.com
      Fixes: e74540b2 ("ocfs2: protect extent tree in ocfs2_prepare_inode_for_write()").
      Signed-off-by: NGang He <ghe@suse.com>
      Reviewed-by: NJoseph Qi <joseph.qi@linux.alibaba.com>
      Cc: Mark Fasheh <mark@fasheh.com>
      Cc: Joel Becker <jlbec@evilplan.org>
      Cc: Junxiao Bi <junxiao.bi@oracle.com>
      Cc: Changwei Ge <gechangwei@live.cn>
      Cc: Jun Piao <piaojun@huawei.com>
      Cc: <stable@vger.kernel.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      2d797e9f
    • J
      aio: prevent potential eventfd recursion on poll · 01d7a356
      Jens Axboe 提交于
      If we have nested or circular eventfd wakeups, then we can deadlock if
      we run them inline from our poll waitqueue wakeup handler. It's also
      possible to have very long chains of notifications, to the extent where
      we could risk blowing the stack.
      
      Check the eventfd recursion count before calling eventfd_signal(). If
      it's non-zero, then punt the signaling to async context. This is always
      safe, as it takes us out-of-line in terms of stack and locking context.
      
      Cc: stable@vger.kernel.org # 4.19+
      Reviewed-by: NJeff Moyer <jmoyer@redhat.com>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      01d7a356
    • P
      io_uring: put the flag changing code in the same spot · 3e577dcd
      Pavel Begunkov 提交于
      Both iocb_flags() and kiocb_set_rw_flags() are inline and modify
      kiocb->ki_flags. Place them close, so they can be potentially better
      optimised.
      Signed-off-by: NPavel Begunkov <asml.silence@gmail.com>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      3e577dcd
    • P
      io_uring: iterate req cache backwards · 6c8a3134
      Pavel Begunkov 提交于
      Grab requests from cache-array from the end, so can get by only
      free_reqs.
      Signed-off-by: NPavel Begunkov <asml.silence@gmail.com>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      6c8a3134
    • J
      io_uring: punt even fadvise() WILLNEED to async context · 3e69426d
      Jens Axboe 提交于
      Andres correctly points out that read-ahead can block, if it needs to
      read in meta data (or even just through the page cache page allocations).
      Play it safe for now and just ensure WILLNEED is also punted to async
      context.
      
      While in there, allow the file settings hints from non-blocking
      context. They don't need to start/do IO, and we can safely do them
      inline.
      
      Fixes: 4840e418 ("io_uring: add IORING_OP_FADVISE")
      Reported-by: NAndres Freund <andres@anarazel.de>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      3e69426d
    • J
      io_uring: fix sporadic double CQE entry for close · 1a417f4e
      Jens Axboe 提交于
      We punt close to async for the final fput(), but we log the completion
      even before that even in that case. We rely on the request not having
      a files table assigned to detect what the final async close should do.
      However, if we punt the async queue to __io_queue_sqe(), we'll get
      ->files assigned and this makes io_close_finish() think it should both
      close the filp again (which does no harm) AND log a new CQE event for
      this request. This causes duplicate CQEs.
      
      Queue the request up for async manually so we don't grab files
      needlessly and trigger this condition.
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      1a417f4e
    • P
      io_uring: remove extra ->file check · 9250f9ee
      Pavel Begunkov 提交于
      It won't ever get into io_prep_rw() when req->file haven't been set in
      io_req_set_file(), hence remove the check.
      Signed-off-by: NPavel Begunkov <asml.silence@gmail.com>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      9250f9ee
    • J
      io_uring: don't map read/write iovec potentially twice · 5d204bcf
      Jens Axboe 提交于
      If we have a read/write that is deferred, we already setup the async IO
      context for that request, and mapped it. When we later try and execute
      the request and we get -EAGAIN, we don't want to attempt to re-map it.
      If we do, we end up with garbage in the iovec, which typically leads
      to an -EFAULT or -EINVAL completion.
      
      Cc: stable@vger.kernel.org # 5.5
      Reported-by: NDan Melnic <dmm@fb.com>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      5d204bcf
    • J
      io_uring: use the proper helpers for io_send/recv · 0b7b21e4
      Jens Axboe 提交于
      Don't use the recvmsg/sendmsg helpers, use the same helpers that the
      recv(2) and send(2) system calls use.
      Reported-by: N李通洲 <carter.li@eoitek.com>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      0b7b21e4
    • J
      io_uring: prevent potential eventfd recursion on poll · f0b493e6
      Jens Axboe 提交于
      If we have nested or circular eventfd wakeups, then we can deadlock if
      we run them inline from our poll waitqueue wakeup handler. It's also
      possible to have very long chains of notifications, to the extent where
      we could risk blowing the stack.
      
      Check the eventfd recursion count before calling eventfd_signal(). If
      it's non-zero, then punt the signaling to async context. This is always
      safe, as it takes us out-of-line in terms of stack and locking context.
      
      Cc: stable@vger.kernel.org # 5.1+
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      f0b493e6
    • J
      eventfd: track eventfd_signal() recursion depth · b5e683d5
      Jens Axboe 提交于
      eventfd use cases from aio and io_uring can deadlock due to circular
      or resursive calling, when eventfd_signal() tries to grab the waitqueue
      lock. On top of that, it's also possible to construct notification
      chains that are deep enough that we could blow the stack.
      
      Add a percpu counter that tracks the percpu recursion depth, warn if we
      exceed it. The counter is also exposed so that users of eventfd_signal()
      can do the right thing if it's non-zero in the context where it is
      called.
      
      Cc: stable@vger.kernel.org # 4.19+
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      b5e683d5
  5. 03 2月, 2020 2 次提交
    • M
      ovl: fix lseek overflow on 32bit · a4ac9d45
      Miklos Szeredi 提交于
      ovl_lseek() is using ssize_t to return the value from vfs_llseek().  On a
      32-bit kernel ssize_t is a 32-bit signed int, which overflows above 2 GB.
      
      Assign the return value of vfs_llseek() to loff_t to fix this.
      Reported-by: NBoris Gjenero <boris.gjenero@gmail.com>
      Fixes: 9e46b840 ("ovl: support stacked SEEK_HOLE/SEEK_DATA")
      Cc: <stable@vger.kernel.org> # v4.19
      Signed-off-by: NMiklos Szeredi <mszeredi@redhat.com>
      a4ac9d45
    • J
      btrfs: do not zero f_bavail if we have available space · d55966c4
      Josef Bacik 提交于
      There was some logic added a while ago to clear out f_bavail in statfs()
      if we did not have enough free metadata space to satisfy our global
      reserve.  This was incorrect at the time, however didn't really pose a
      problem for normal file systems because we would often allocate chunks
      if we got this low on free metadata space, and thus wouldn't really hit
      this case unless we were actually full.
      
      Fast forward to today and now we are much better about not allocating
      metadata chunks all of the time.  Couple this with d792b0f1 ("btrfs:
      always reserve our entire size for the global reserve") which now means
      we'll easily have a larger global reserve than our free space, we are
      now more likely to trip over this while still having plenty of space.
      
      Fix this by skipping this logic if the global rsv's space_info is not
      full.  space_info->full is 0 unless we've attempted to allocate a chunk
      for that space_info and that has failed.  If this happens then the space
      for the global reserve is definitely sacred and we need to report
      b_avail == 0, but before then we can just use our calculated b_avail.
      Reported-by: NMartin Steigerwald <martin@lichtvoll.de>
      Fixes: ca8a51b3 ("btrfs: statfs: report zero available if metadata are exhausted")
      CC: stable@vger.kernel.org # 4.5+
      Reviewed-by: NQu Wenruo <wqu@suse.com>
      Tested-By: NMartin Steigerwald <martin@lichtvoll.de>
      Signed-off-by: NJosef Bacik <josef@toxicpanda.com>
      Reviewed-by: NDavid Sterba <dsterba@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      d55966c4
  6. 02 2月, 2020 1 次提交
    • A
      vfs: fix do_last() regression · 6404674a
      Al Viro 提交于
      Brown paperbag time: fetching ->i_uid/->i_mode really should've been
      done from nd->inode.  I even suggested that, but the reason for that has
      slipped through the cracks and I went for dir->d_inode instead - made
      for more "obvious" patch.
      
      Analysis:
      
       - at the entry into do_last() and all the way to step_into(): dir (aka
         nd->path.dentry) is known not to have been freed; so's nd->inode and
         it's equal to dir->d_inode unless we are already doomed to -ECHILD.
         inode of the file to get opened is not known.
      
       - after step_into(): inode of the file to get opened is known; dir
         might be pointing to freed memory/be negative/etc.
      
       - at the call of may_create_in_sticky(): guaranteed to be out of RCU
         mode; inode of the file to get opened is known and pinned; dir might
         be garbage.
      
      The last was the reason for the original patch.  Except that at the
      do_last() entry we can be in RCU mode and it is possible that
      nd->path.dentry->d_inode has already changed under us.
      
      In that case we are going to fail with -ECHILD, but we need to be
      careful; nd->inode is pointing to valid struct inode and it's the same
      as nd->path.dentry->d_inode in "won't fail with -ECHILD" case, so we
      should use that.
      Reported-by: N"Rantala, Tommi T. (Nokia - FI/Espoo)" <tommi.t.rantala@nokia.com>
      Reported-by: syzbot+190005201ced78a74ad6@syzkaller.appspotmail.com
      Wearing-brown-paperbag: Al Viro <viro@zeniv.linux.org.uk>
      Cc: stable@kernel.org
      Fixes: d0cb5018 ("do_last(): fetch directory ->i_mode and ->i_uid before it's too late")
      Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      6404674a
  7. 01 2月, 2020 6 次提交