1. 28 2月, 2019 3 次提交
    • J
      io_uring: support for IO polling · def596e9
      Jens Axboe 提交于
      Add support for a polled io_uring instance. When a read or write is
      submitted to a polled io_uring, the application must poll for
      completions on the CQ ring through io_uring_enter(2). Polled IO may not
      generate IRQ completions, hence they need to be actively found by the
      application itself.
      
      To use polling, io_uring_setup() must be used with the
      IORING_SETUP_IOPOLL flag being set. It is illegal to mix and match
      polled and non-polled IO on an io_uring.
      Reviewed-by: NHannes Reinecke <hare@suse.com>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      def596e9
    • C
      io_uring: add fsync support · c992fe29
      Christoph Hellwig 提交于
      Add a new fsync opcode, which either syncs a range if one is passed,
      or the whole file if the offset and length fields are both cleared
      to zero.  A flag is provided to use fdatasync semantics, that is only
      force out metadata which is required to retrieve the file data, but
      not others like metadata.
      Reviewed-by: NHannes Reinecke <hare@suse.com>
      Signed-off-by: NChristoph Hellwig <hch@lst.de>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      c992fe29
    • J
      Add io_uring IO interface · 2b188cc1
      Jens Axboe 提交于
      The submission queue (SQ) and completion queue (CQ) rings are shared
      between the application and the kernel. This eliminates the need to
      copy data back and forth to submit and complete IO.
      
      IO submissions use the io_uring_sqe data structure, and completions
      are generated in the form of io_uring_cqe data structures. The SQ
      ring is an index into the io_uring_sqe array, which makes it possible
      to submit a batch of IOs without them being contiguous in the ring.
      The CQ ring is always contiguous, as completion events are inherently
      unordered, and hence any io_uring_cqe entry can point back to an
      arbitrary submission.
      
      Two new system calls are added for this:
      
      io_uring_setup(entries, params)
      	Sets up an io_uring instance for doing async IO. On success,
      	returns a file descriptor that the application can mmap to
      	gain access to the SQ ring, CQ ring, and io_uring_sqes.
      
      io_uring_enter(fd, to_submit, min_complete, flags, sigset, sigsetsize)
      	Initiates IO against the rings mapped to this fd, or waits for
      	them to complete, or both. The behavior is controlled by the
      	parameters passed in. If 'to_submit' is non-zero, then we'll
      	try and submit new IO. If IORING_ENTER_GETEVENTS is set, the
      	kernel will wait for 'min_complete' events, if they aren't
      	already available. It's valid to set IORING_ENTER_GETEVENTS
      	and 'min_complete' == 0 at the same time, this allows the
      	kernel to return already completed events without waiting
      	for them. This is useful only for polling, as for IRQ
      	driven IO, the application can just check the CQ ring
      	without entering the kernel.
      
      With this setup, it's possible to do async IO with a single system
      call. Future developments will enable polled IO with this interface,
      and polled submission as well. The latter will enable an application
      to do IO without doing ANY system calls at all.
      
      For IRQ driven IO, an application only needs to enter the kernel for
      completions if it wants to wait for them to occur.
      
      Each io_uring is backed by a workqueue, to support buffered async IO
      as well. We will only punt to an async context if the command would
      need to wait for IO on the device side. Any data that can be accessed
      directly in the page cache is done inline. This avoids the slowness
      issue of usual threadpools, since cached data is accessed as quickly
      as a sync interface.
      
      Sample application: http://git.kernel.dk/cgit/fio/plain/t/io_uring.cReviewed-by: NHannes Reinecke <hare@suse.com>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      2b188cc1
  2. 24 2月, 2019 3 次提交
  3. 15 2月, 2019 5 次提交
  4. 07 2月, 2019 2 次提交
    • T
      nfsd: Fix error return values for nfsd4_clone_file_range() · e3fdc89c
      Trond Myklebust 提交于
      If the parameter 'count' is non-zero, nfsd4_clone_file_range() will
      currently clobber all errors returned by vfs_clone_file_range() and
      replace them with EINVAL.
      
      Fixes: 42ec3d4c ("vfs: make remap_file_range functions take and...")
      Signed-off-by: NTrond Myklebust <trond.myklebust@hammerspace.com>
      Cc: stable@vger.kernel.org # v4.20+
      Signed-off-by: NJ. Bruce Fields <bfields@redhat.com>
      e3fdc89c
    • T
      fs: ratelimit __find_get_block_slow() failure message. · 43636c80
      Tetsuo Handa 提交于
      When something let __find_get_block_slow() hit all_mapped path, it calls
      printk() for 100+ times per a second. But there is no need to print same
      message with such high frequency; it is just asking for stall warning, or
      at least bloating log files.
      
        [  399.866302][T15342] __find_get_block_slow() failed. block=1, b_blocknr=8
        [  399.873324][T15342] b_state=0x00000029, b_size=512
        [  399.878403][T15342] device loop0 blocksize: 4096
        [  399.883296][T15342] __find_get_block_slow() failed. block=1, b_blocknr=8
        [  399.890400][T15342] b_state=0x00000029, b_size=512
        [  399.895595][T15342] device loop0 blocksize: 4096
        [  399.900556][T15342] __find_get_block_slow() failed. block=1, b_blocknr=8
        [  399.907471][T15342] b_state=0x00000029, b_size=512
        [  399.912506][T15342] device loop0 blocksize: 4096
      
      This patch reduces frequency to up to once per a second, in addition to
      concatenating three lines into one.
      
        [  399.866302][T15342] __find_get_block_slow() failed. block=1, b_blocknr=8, b_state=0x00000029, b_size=512, device loop0 blocksize: 4096
      Signed-off-by: NTetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
      Reviewed-by: NJan Kara <jack@suse.cz>
      Cc: Dmitry Vyukov <dvyukov@google.com>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      43636c80
  5. 06 2月, 2019 1 次提交
  6. 04 2月, 2019 3 次提交
    • D
      xfs: set buffer ops when repair probes for btree type · add46b3b
      Darrick J. Wong 提交于
      In xrep_findroot_block, we work out the btree type and correctness of a
      given block by calling different btree verifiers on root block
      candidates.  However, we leave the NULL b_ops while ->verify_read
      validates the block, which means that if the verifier calls
      xfs_buf_verifier_error it'll crash on the null b_ops.  Fix it to set
      b_ops before calling the verifier and unsetting it if the verifier
      fails.
      
      Furthermore, improve the documentation around xfs_buf_ensure_ops, which
      is the function that is responsible for cleaning up the b_ops state of
      buffers that go through xrep_findroot_block but don't match anything.
      Signed-off-by: NDarrick J. Wong <darrick.wong@oracle.com>
      Reviewed-by: NBrian Foster <bfoster@redhat.com>
      add46b3b
    • B
      xfs: end sync buffer I/O properly on shutdown error · 465fa17f
      Brian Foster 提交于
      As of commit e339dd8d ("xfs: use sync buffer I/O for sync delwri
      queue submission"), the delwri submission code uses sync buffer I/O
      for sync delwri I/O. Instead of waiting on async I/O to unlock the
      buffer, it uses the underlying sync I/O completion mechanism.
      
      If delwri buffer submission fails due to a shutdown scenario, an
      error is set on the buffer and buffer completion never occurs. This
      can cause xfs_buf_delwri_submit() to deadlock waiting on a
      completion event.
      
      We could check the error state before waiting on such buffers, but
      that doesn't serialize against the case of an error set via a racing
      I/O completion. Instead, invoke I/O completion in the shutdown case
      regardless of buffer I/O type.
      Signed-off-by: NBrian Foster <bfoster@redhat.com>
      Reviewed-by: NDave Chinner <dchinner@redhat.com>
      Reviewed-by: NDarrick J. Wong <darrick.wong@oracle.com>
      Signed-off-by: NDarrick J. Wong <darrick.wong@oracle.com>
      465fa17f
    • B
      xfs: eof trim writeback mapping as soon as it is cached · aa6ee4ab
      Brian Foster 提交于
      The cached writeback mapping is EOF trimmed to try and avoid races
      between post-eof block management and writeback that result in
      sending cached data to a stale location. The cached mapping is
      currently trimmed on the validation check, which leaves a race
      window between the time the mapping is cached and when it is trimmed
      against the current inode size.
      
      For example, if a new mapping is cached by delalloc conversion on a
      blocksize == page size fs, we could cycle various locks, perform
      memory allocations, etc.  in the writeback codepath before the
      associated mapping is eventually trimmed to i_size. This leaves
      enough time for a post-eof truncate and file append before the
      cached mapping is trimmed. The former event essentially invalidates
      a range of the cached mapping and the latter bumps the inode size
      such the trim on the next writepage event won't trim all of the
      invalid blocks. fstest generic/464 reproduces this scenario
      occasionally and causes a lost writeback and stale delalloc blocks
      warning on inode inactivation.
      
      To work around this problem, trim the cached writeback mapping as
      soon as it is cached in addition to on subsequent validation checks.
      This is a minor tweak to tighten the race window as much as possible
      until a proper invalidation mechanism is available.
      
      Fixes: 40214d12 ("xfs: trim writepage mapping to within eof")
      Cc: <stable@vger.kernel.org> # v4.14+
      Signed-off-by: NBrian Foster <bfoster@redhat.com>
      Reviewed-by: NAllison Henderson <allison.henderson@oracle.com>
      Reviewed-by: NChristoph Hellwig <hch@lst.de>
      Reviewed-by: NDarrick J. Wong <darrick.wong@oracle.com>
      Signed-off-by: NDarrick J. Wong <darrick.wong@oracle.com>
      aa6ee4ab
  7. 02 2月, 2019 4 次提交
  8. 01 2月, 2019 1 次提交
  9. 31 1月, 2019 7 次提交
    • S
      cifs: update internal module version number · b9b9378b
      Steve French 提交于
      To 2.17
      Signed-off-by: NSteve French <stfrench@microsoft.com>
      b9b9378b
    • A
      CIFS: fix use-after-free of the lease keys · d339adc1
      Aurelien Aptel 提交于
      The request buffers are freed right before copying the pointers.
      Use the func args instead which are identical and still valid.
      
      Simple reproducer (requires KASAN enabled) on a cifs mount:
      
      echo foo > foo ; tail -f foo & rm foo
      
      Cc: <stable@vger.kernel.org> # 4.20
      Fixes: 179e44d4 ("smb3: add tracepoint for sending lease break responses to server")
      Signed-off-by: NAurelien Aptel <aaptel@suse.com>
      Signed-off-by: NSteve French <stfrench@microsoft.com>
      Reviewed-by: NPaulo Alcantara <palcantara@suse.de>
      d339adc1
    • W
      fs/dcache: Track & report number of negative dentries · af0c9af1
      Waiman Long 提交于
      The current dentry number tracking code doesn't distinguish between
      positive & negative dentries.  It just reports the total number of
      dentries in the LRU lists.
      
      As excessive number of negative dentries can have an impact on system
      performance, it will be wise to track the number of positive and
      negative dentries separately.
      
      This patch adds tracking for the total number of negative dentries in
      the system LRU lists and reports it in the 5th field in the
      /proc/sys/fs/dentry-state file.  The number, however, does not include
      negative dentries that are in flight but not in the LRU yet as well as
      those in the shrinker lists which are on the way out anyway.
      
      The number of positive dentries in the LRU lists can be roughly found by
      subtracting the number of negative dentries from the unused count.
      
      Matthew Wilcox had confirmed that since the introduction of the
      dentry_stat structure in 2.1.60, the dummy array was there, probably for
      future extension.  They were not replacements of pre-existing fields.
      So no sane applications that read the value of /proc/sys/fs/dentry-state
      will do dummy thing if the last 2 fields of the sysctl parameter are not
      zero.  IOW, it will be safe to use one of the dummy array entry for
      negative dentry count.
      Signed-off-by: NWaiman Long <longman@redhat.com>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      af0c9af1
    • W
      fs/dcache: Fix incorrect nr_dentry_unused accounting in shrink_dcache_sb() · 1dbd449c
      Waiman Long 提交于
      The nr_dentry_unused per-cpu counter tracks dentries in both the LRU
      lists and the shrink lists where the DCACHE_LRU_LIST bit is set.
      
      The shrink_dcache_sb() function moves dentries from the LRU list to a
      shrink list and subtracts the dentry count from nr_dentry_unused.  This
      is incorrect as the nr_dentry_unused count will also be decremented in
      shrink_dentry_list() via d_shrink_del().
      
      To fix this double decrement, the decrement in the shrink_dcache_sb()
      function is taken out.
      
      Fixes: 4e717f5c ("list_lru: remove special case function list_lru_dispose_all."
      Cc: stable@kernel.org
      Signed-off-by: NWaiman Long <longman@redhat.com>
      Reviewed-by: NDave Chinner <dchinner@redhat.com>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      1dbd449c
    • E
      btrfs: On error always free subvol_name in btrfs_mount · 532b618b
      Eric W. Biederman 提交于
      The subvol_name is allocated in btrfs_parse_subvol_options and is
      consumed and freed in mount_subvol.  Add a free to the error paths that
      don't call mount_subvol so that it is guaranteed that subvol_name is
      freed when an error happens.
      
      Fixes: 312c89fb ("btrfs: cleanup btrfs_mount() using btrfs_mount_root()")
      Cc: stable@vger.kernel.org # v4.19+
      Reviewed-by: NNikolay Borisov <nborisov@suse.com>
      Signed-off-by: N"Eric W. Biederman" <ebiederm@xmission.com>
      Reviewed-by: NDavid Sterba <dsterba@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      532b618b
    • D
      btrfs: clean up pending block groups when transaction commit aborts · c7cc64a9
      David Sterba 提交于
      The fstests generic/475 stresses transaction aborts and can reveal
      space accounting or use-after-free bugs regarding block goups.
      
      In this case the pending block groups that remain linked to the
      structures after transaction commit aborts in the middle.
      
      The corrupted slabs lead to failures in following tests, eg. generic/476
      
        [ 8172.752887] BUG: unable to handle kernel NULL pointer dereference at 0000000000000058
        [ 8172.755799] #PF error: [normal kernel read fault]
        [ 8172.757571] PGD 661ae067 P4D 661ae067 PUD 3db8e067 PMD 0
        [ 8172.759000] Oops: 0000 [#1] PREEMPT SMP
        [ 8172.760209] CPU: 0 PID: 39 Comm: kswapd0 Tainted: G        W         5.0.0-rc2-default #408
        [ 8172.762495] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.11.2-0-gf9626cc-prebuilt.qemu-project.org 04/01/2014
        [ 8172.765772] RIP: 0010:shrink_page_list+0x2f9/0xe90
        [ 8172.770453] RSP: 0018:ffff967f00663b18 EFLAGS: 00010287
        [ 8172.771184] RAX: 0000000000000000 RBX: ffff967f00663c20 RCX: 0000000000000000
        [ 8172.772850] RDX: 0000000000000000 RSI: 0000000000000001 RDI: ffff8c0620ab20e0
        [ 8172.774629] RBP: ffff967f00663dd8 R08: 0000000000000000 R09: 0000000000000000
        [ 8172.776094] R10: ffff8c0620ab22f8 R11: ffff8c063f772688 R12: ffff967f00663b78
        [ 8172.777533] R13: ffff8c063f625600 R14: ffff8c063f625608 R15: dead000000000200
        [ 8172.778886] FS:  0000000000000000(0000) GS:ffff8c063d400000(0000) knlGS:0000000000000000
        [ 8172.780545] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
        [ 8172.781787] CR2: 0000000000000058 CR3: 000000004e962000 CR4: 00000000000006f0
        [ 8172.783547] Call Trace:
        [ 8172.784112]  shrink_inactive_list+0x194/0x410
        [ 8172.784747]  shrink_node_memcg.constprop.85+0x3a5/0x6a0
        [ 8172.785472]  shrink_node+0x62/0x1e0
        [ 8172.786011]  balance_pgdat+0x216/0x460
        [ 8172.786577]  kswapd+0xe3/0x4a0
        [ 8172.787085]  ? finish_wait+0x80/0x80
        [ 8172.787795]  ? balance_pgdat+0x460/0x460
        [ 8172.788799]  kthread+0x116/0x130
        [ 8172.789640]  ? kthread_create_on_node+0x60/0x60
        [ 8172.790323]  ret_from_fork+0x24/0x30
        [ 8172.794253] CR2: 0000000000000058
      
      or accounting errors at umount time:
      
        [ 8159.537251] WARNING: CPU: 2 PID: 19031 at fs/btrfs/extent-tree.c:5987 btrfs_free_block_groups+0x3d5/0x410 [btrfs]
        [ 8159.543325] CPU: 2 PID: 19031 Comm: umount Tainted: G        W         5.0.0-rc2-default #408
        [ 8159.545472] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.11.2-0-gf9626cc-prebuilt.qemu-project.org 04/01/2014
        [ 8159.548155] RIP: 0010:btrfs_free_block_groups+0x3d5/0x410 [btrfs]
        [ 8159.554030] RSP: 0018:ffff967f079cbde8 EFLAGS: 00010206
        [ 8159.555144] RAX: 0000000001000000 RBX: ffff8c06366cf800 RCX: 0000000000000000
        [ 8159.556730] RDX: 0000000000000002 RSI: 0000000000000001 RDI: ffff8c06255ad800
        [ 8159.558279] RBP: ffff8c0637ac0000 R08: 0000000000000001 R09: 0000000000000000
        [ 8159.559797] R10: 0000000000000000 R11: 0000000000000001 R12: ffff8c0637ac0108
        [ 8159.561296] R13: ffff8c0637ac0158 R14: 0000000000000000 R15: dead000000000100
        [ 8159.562852] FS:  00007f7f693b9fc0(0000) GS:ffff8c063d800000(0000) knlGS:0000000000000000
        [ 8159.564839] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
        [ 8159.566160] CR2: 00007f7f68fab7b0 CR3: 000000000aec7000 CR4: 00000000000006e0
        [ 8159.567898] Call Trace:
        [ 8159.568597]  close_ctree+0x17f/0x350 [btrfs]
        [ 8159.569628]  generic_shutdown_super+0x64/0x100
        [ 8159.570808]  kill_anon_super+0x14/0x30
        [ 8159.571857]  btrfs_kill_super+0x12/0xa0 [btrfs]
        [ 8159.573063]  deactivate_locked_super+0x29/0x60
        [ 8159.574234]  cleanup_mnt+0x3b/0x70
        [ 8159.575176]  task_work_run+0x98/0xc0
        [ 8159.576177]  exit_to_usermode_loop+0x83/0x90
        [ 8159.577315]  do_syscall_64+0x15b/0x180
        [ 8159.578339]  entry_SYSCALL_64_after_hwframe+0x49/0xbe
      
      This fix is based on 2 Josef's patches that used sideefects of
      btrfs_create_pending_block_groups, this fix introduces the helper that
      does what we need.
      
      CC: stable@vger.kernel.org # 4.4+
      CC: Josef Bacik <josef@toxicpanda.com>
      Reviewed-by: NNikolay Borisov <nborisov@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      c7cc64a9
    • A
      btrfs: fix potential oops in device_list_add · 92900e51
      Al Viro 提交于
      alloc_fs_devices() can return ERR_PTR(-ENOMEM), so dereferencing its
      result before the check for IS_ERR() is a bad idea.
      
      Fixes: d1a63002 ("btrfs: add members to fs_devices to track fsid changes")
      Reviewed-by: NNikolay Borisov <nborisov@suse.com>
      Reviewed-by: NAnand Jain <anand.jain@oracle.com>
      Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>
      Reviewed-by: NDavid Sterba <dsterba@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      92900e51
  10. 30 1月, 2019 9 次提交
  11. 29 1月, 2019 1 次提交
    • Y
      nfs: Fix NULL pointer dereference of dev_name · 80ff0017
      Yao Liu 提交于
      There is a NULL pointer dereference of dev_name in nfs_parse_devname()
      
      The oops looks something like:
      
        BUG: unable to handle kernel NULL pointer dereference at 0000000000000000
        ...
        RIP: 0010:nfs_fs_mount+0x3b6/0xc20 [nfs]
        ...
        Call Trace:
         ? ida_alloc_range+0x34b/0x3d0
         ? nfs_clone_super+0x80/0x80 [nfs]
         ? nfs_free_parsed_mount_data+0x60/0x60 [nfs]
         mount_fs+0x52/0x170
         ? __init_waitqueue_head+0x3b/0x50
         vfs_kern_mount+0x6b/0x170
         do_mount+0x216/0xdc0
         ksys_mount+0x83/0xd0
         __x64_sys_mount+0x25/0x30
         do_syscall_64+0x65/0x220
         entry_SYSCALL_64_after_hwframe+0x49/0xbe
      
      Fix this by adding a NULL check on dev_name
      Signed-off-by: NYao Liu <yotta.liu@ucloud.cn>
      Signed-off-by: NAnna Schumaker <Anna.Schumaker@Netapp.com>
      80ff0017
  12. 28 1月, 2019 1 次提交
    • J
      btrfs: don't end the transaction for delayed refs in throttle · 302167c5
      Josef Bacik 提交于
      Previously callers to btrfs_end_transaction_throttle() would commit the
      transaction if there wasn't enough delayed refs space.  This happens in
      relocation, and if the fs is relatively empty we'll run out of delayed
      refs space basically immediately, so we'll just be stuck in this loop of
      committing the transaction over and over again.
      
      This code existed because we didn't have a good feedback mechanism for
      running delayed refs, but with the delayed refs rsv we do now.  Delete
      this throttling code and let the btrfs_start_transaction() in relocation
      deal with putting pressure on the delayed refs infrastructure.  With
      this patch we no longer take 5 minutes to balance a metadata only fs.
      
      Qu has submitted a fstest to catch slow balance or excessive transaction
      commits. Steps to reproduce:
      
      * create subvolume
      * create many (eg. 16000) inlined files, of size 2KiB
      * iteratively snapshot and touch several files to trigger metadata
        updates
      * start balance -m
      Reported-by: NQu Wenruo <wqu@suse.com>
      Fixes: 64403612 ("btrfs: rework btrfs_check_space_for_delayed_refs")
      Signed-off-by: NJosef Bacik <josef@toxicpanda.com>
      [ add tags and steps to reproduce ]
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      302167c5
新手
引导
客服 返回
顶部