1. 16 5月, 2019 24 次提交
    • L
      fs/writeback: Attach inode's wb to root if needed · dda65ea8
      luanshi 提交于
      There might have tons of files queued in the writeback, awaiting for
      writing back. Unfortunately, the writeback's cgroup has been dead. In
      this case, we reassociate the inode with another writeback, but we
      possibly can't because the writeback associated with the dead cgroup is
      the only valid one. In this case, the new writeback is allocated,
      initialized and associated with the inode in the non-stopping fashion
      until all data resident in the inode's page cache are flushed to disk.
      It causes unnecessary high system load.
      
      This fixes the issue by enforce moving the inode to root cgroup when the
      previous binding cgroup becomes dead. With it, no more unnecessary
      writebacks are created, populated and the system load decreased by about
      6x in the test case we carried out:
          Without the patch: 30% system load
          With the patch:    5%  system load
      Signed-off-by: Nluanshi <zhangliguang@linux.alibaba.com>
      Signed-off-by: NJiufei Xue <jiufei.xue@linux.alibaba.com>
      Reviewed-by: NJoseph Qi <joseph.qi@linux.alibaba.com>
      dda65ea8
    • J
      fs/writeback: use rcu_barrier() to wait for inflight wb switches going into workqueue when umount · 6644f956
      Jiufei Xue 提交于
      synchronize_rcu() didn't wait for call_rcu() callbacks, so inode wb
      switch may not go to the workqueue after synchronize_rcu(). Thus
      previous scheduled switches was not finished even flushing the
      workqueue, which will cause a NULL pointer dereferenced followed below.
      
      VFS: Busy inodes after unmount of vdd. Self-destruct in 5 seconds.  Have a nice day...
      BUG: unable to handle kernel NULL pointer dereference at 0000000000000278
      [<ffffffff8126a303>] evict+0xb3/0x180
      [<ffffffff8126a760>] iput+0x1b0/0x230
      [<ffffffff8127c690>] inode_switch_wbs_work_fn+0x3c0/0x6a0
      [<ffffffff810a5b2e>] worker_thread+0x4e/0x490
      [<ffffffff810a5ae0>] ? process_one_work+0x410/0x410
      [<ffffffff810ac056>] kthread+0xe6/0x100
      [<ffffffff8173c199>] ret_from_fork+0x39/0x50
      
      Replace the synchronize_rcu() call with a rcu_barrier() to wait for all
      pending callbacks to finish. And inc isw_nr_in_flight after call_rcu()
      in inode_switch_wbs() to make more sense.
      Suggested-by: NTejun Heo <tj@kernel.org>
      Signed-off-by: NJiufei Xue <jiufei.xue@linux.alibaba.com>
      Acked-by: NTejun Heo <tj@kernel.org>
      Reviewed-by: NJoseph Qi <joseph.qi@linux.alibaba.com>
      6644f956
    • J
      fs/writeback: fix double free of blkcg_css · 0b58738a
      Jiufei Xue 提交于
      We have gotten a WARNNING when releasing blkcg_css:
      
      [332489.681635] WARNING: CPU: 55 PID: 14859 at lib/list_debug.c:56 __list_del_entry+0x81/0xc0
      [332489.682191] list_del corruption, ffff883e6b94d450->prev is LIST_POISON2 (dead000000000200)
      ......
      [332489.683895] CPU: 55 PID: 14859 Comm: kworker/55:2 Tainted: G
      [332489.684477] Hardware name: Inspur SA5248M4/X10DRT-PS, BIOS 4.05A
      10/11/2016
      [332489.685061] Workqueue: cgroup_destroy css_release_work_fn
      [332489.685654]  ffffc9001d92bd28 ffffffff81380042 ffffc9001d92bd78
      0000000000000000
      [332489.686269]  ffffc9001d92bd68 ffffffff81088f8b 0000003800000000
      ffff883e6b94d4a0
      [332489.686867]  ffff883e6b94d400 ffffffff81ce8fe0 ffff88375b24f400
      ffff883e6b94d4a0
      [332489.687479] Call Trace:
      [332489.688078]  [<ffffffff81380042>] dump_stack+0x63/0x81
      [332489.688681]  [<ffffffff81088f8b>] __warn+0xcb/0xf0
      [332489.689276]  [<ffffffff8108900f>] warn_slowpath_fmt+0x5f/0x80
      [332489.689877]  [<ffffffff8139e7c1>] __list_del_entry+0x81/0xc0
      [332489.690481]  [<ffffffff81125552>] css_release_work_fn+0x42/0x140
      [332489.691090]  [<ffffffff810a2db9>] process_one_work+0x189/0x420
      [332489.691693]  [<ffffffff810a309e>] worker_thread+0x4e/0x4b0
      [332489.692293]  [<ffffffff810a3050>] ? process_one_work+0x420/0x420
      [332489.692905]  [<ffffffff810a9616>] kthread+0xe6/0x100
      [332489.693504]  [<ffffffff810a9530>] ? kthread_park+0x60/0x60
      [332489.694099]  [<ffffffff817184e1>] ret_from_fork+0x41/0x50
      [332489.694722] ---[ end trace 0cf869c4a5cfba87 ]---
      ......
      
      This is caused by calling css_get after the css is killed by another
      thread described below:
      
                 Thread 1                       Thread 2
      cgroup_rmdir
        -> kill_css
          -> percpu_ref_kill_and_confirm
            -> css_killed_ref_fn
      
      css_killed_work_fn
        -> css_put
          -> css_release
                                              wb_get_create
      					  -> find_blkcg_css
      					    -> css_get
      					  -> css_put
      					    -> css_release (double free)
          -> css_release_workfn
            -> css_free_work_fn
             -> blkcg_css_free
      
      When doublefree happened, it may free the memory still used by
      other threads and cause a kernel panic.
      
      Fix this by using css_tryget_online in find_blkcg_css while will return
      false if the css is killed.
      Signed-off-by: NJiufei Xue <jiufei.xue@linux.alibaba.com>
      Reviewed-by: NJoseph Qi <joseph.qi@linux.alibaba.com>
      0b58738a
    • J
      f27da2ad
    • J
      writeback: add memcg_blkcg_link tree · 60448d43
      Jiufei Xue 提交于
      Here we add a global radix tree to link memcg and blkcg that the user
      attach the tasks to when using cgroup v1, which is used for writeback
      cgroup.
      Signed-off-by: NJiufei Xue <jiufei.xue@linux.alibaba.com>
      Reviewed-by: NJoseph Qi <joseph.qi@linux.alibaba.com>
      60448d43
    • J
      ovl: check the capability before cred overridden · cc68ce6d
      Jiufei Xue 提交于
      We found that it return success when we set IMMUTABLE_FL flag to a file in
      docker even though the docker didn't have the capability
      CAP_LINUX_IMMUTABLE.
      
      The commit d1d04ef8 ("ovl: stack file ops") and dab5ca8f ("ovl: add
      lsattr/chattr support") implemented chattr operations on a regular overlay
      file. ovl_real_ioctl() overridden the current process's subjective
      credentials with ofs->creator_cred which have the capability
      CAP_LINUX_IMMUTABLE so that it will return success in
      vfs_ioctl()->cap_capable().
      
      Fix this by checking the capability before cred overridden. And here we
      only care about APPEND_FL and IMMUTABLE_FL, so get these information from
      inode.
      
      [SzM: move check and call to underlying fs inside inode locked region to
      prevent two such calls from racing with each other]
      Signed-off-by: NJiufei Xue <jiufei.xue@linux.alibaba.com>
      Signed-off-by: NMiklos Szeredi <mszeredi@redhat.com>
      Reviewed-by: NJoseph Qi <joseph.qi@linux.alibaba.com>
      cc68ce6d
    • J
      fbdev: fix WARNING in __alloc_pages_nodemask bug · b9761d35
      Jiufei Xue 提交于
      commit 8c40292be9169a9cbe19aadd1a6fc60cbd1af82f upstream.
      
      Syzkaller hit 'WARNING in __alloc_pages_nodemask' bug.
      
      WARNING: CPU: 1 PID: 1473 at mm/page_alloc.c:4377
      __alloc_pages_nodemask+0x4da/0x2130
      Kernel panic - not syncing: panic_on_warn set ...
      
      Call Trace:
       alloc_pages_current+0xb1/0x1e0
       kmalloc_order+0x1f/0x60
       kmalloc_order_trace+0x1d/0x120
       fb_alloc_cmap_gfp+0x85/0x2b0
       fb_set_user_cmap+0xff/0x370
       do_fb_ioctl+0x949/0xa20
       fb_ioctl+0xdd/0x120
       do_vfs_ioctl+0x186/0x1070
       ksys_ioctl+0x89/0xa0
       __x64_sys_ioctl+0x74/0xb0
       do_syscall_64+0xc8/0x550
       entry_SYSCALL_64_after_hwframe+0x49/0xbe
      
      This is a warning about order >= MAX_ORDER and the order is from
      userspace ioctl. Add flag __NOWARN to silence this warning.
      Signed-off-by: NJiufei Xue <jiufei.xue@linux.alibaba.com>
      Signed-off-by: NBartlomiej Zolnierkiewicz <b.zolnierkie@samsung.com>
      Reviewed-by: NJoseph Qi <joseph.qi@linux.alibaba.com>
      b9761d35
    • S
      fbdev: fix divide error in fb_var_to_videomode · e7efd88b
      Shile Zhang 提交于
      commit cf84807f6dd0be5214378e66460cfc9187f532f9 upstream
      
      To fix following divide-by-zero error found by Syzkaller:
      
        divide error: 0000 [#1] SMP PTI
        CPU: 7 PID: 8447 Comm: test Kdump: loaded Not tainted 4.19.24-8.al7.x86_64 #1
        Hardware name: Alibaba Cloud Alibaba Cloud ECS, BIOS rel-1.12.0-0-ga698c8995f-prebuilt.qemu.org 04/01/2014
        RIP: 0010:fb_var_to_videomode+0xae/0xc0
        Code: 04 44 03 46 78 03 4e 7c 44 03 46 68 03 4e 70 89 ce d1 ee 69 c0 e8 03 00 00 f6 c2 01 0f 45 ce 83 e2 02 8d 34 09 0f 45 ce 31 d2 <41> f7 f0 31 d2 f7 f1 89 47 08 f3 c3 66 0f 1f 44 00 00 0f 1f 44 00
        RSP: 0018:ffffb7e189347bf0 EFLAGS: 00010246
        RAX: 00000000e1692410 RBX: ffffb7e189347d60 RCX: 0000000000000000
        RDX: 0000000000000000 RSI: 0000000000000000 RDI: ffffb7e189347c10
        RBP: ffff99972a091c00 R08: 0000000000000000 R09: 0000000000000000
        R10: 0000000000000000 R11: 0000000000000000 R12: 0000000000000100
        R13: 0000000000010000 R14: 00007ffd66baf6d0 R15: 0000000000000000
        FS:  00007f2054d11740(0000) GS:ffff99972fbc0000(0000) knlGS:0000000000000000
        CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
        CR2: 00007f205481fd20 CR3: 00000004288a0001 CR4: 00000000001606a0
        Call Trace:
         fb_set_var+0x257/0x390
         ? lookup_fast+0xbb/0x2b0
         ? fb_open+0xc0/0x140
         ? chrdev_open+0xa6/0x1a0
         do_fb_ioctl+0x445/0x5a0
         do_vfs_ioctl+0x92/0x5f0
         ? __alloc_fd+0x3d/0x160
         ksys_ioctl+0x60/0x90
         __x64_sys_ioctl+0x16/0x20
         do_syscall_64+0x5b/0x190
         entry_SYSCALL_64_after_hwframe+0x44/0xa9
        RIP: 0033:0x7f20548258d7
        Code: 44 00 00 48 8b 05 b9 15 2d 00 64 c7 00 26 00 00 00 48 c7 c0 ff ff ff ff c3 66 2e 0f 1f 84 00 00 00 00 00 b8 10 00 00 00 0f 05 <48> 3d 01 f0 ff ff 73 01 c3 48 8b 0d 89 15 2d 00 f7 d8 64 89 01 48
      
      It can be triggered easily with following test code:
      
        #include <linux/fb.h>
        #include <fcntl.h>
        #include <sys/ioctl.h>
        int main(void)
        {
                struct fb_var_screeninfo var = {.activate = 0x100, .pixclock = 60};
                int fd = open("/dev/fb0", O_RDWR);
                if (fd < 0)
                        return 1;
      
                if (ioctl(fd, FBIOPUT_VSCREENINFO, &var))
                        return 1;
      
                return 0;
        }
      Signed-off-by: NShile Zhang <shile.zhang@linux.alibaba.com>
      Cc: Fredrik Noring <noring@nocrew.org>
      Cc: Daniel Vetter <daniel.vetter@ffwll.ch>
      Reviewed-by: NMukesh Ojha <mojha@codeaurora.org>
      Signed-off-by: NBartlomiej Zolnierkiewicz <b.zolnierkie@samsung.com>
      Acked-by: NJoseph Qi <joseph.qi@linux.alibaba.com>
      e7efd88b
    • G
      net: kernel hookers service for toa module · 3c74cfbb
      George Zhang 提交于
      LVS fullnat will replace network traffic's source ip with its local ip,
      and thus the backend servers cannot obtain the real client ip.
      
      To solve this, LVS has introduced the tcp option address (TOA) to store
      the essential ip address information in the last tcp ack packet of the
      3-way handshake, and the backend servers need to retrieve it from the
      packet header.
      
      In this patch, we have introduced the sk_toa_data member in the sock
      structure to hold the TOA information. There used to be an in-tree
      module for TOA managing, whereas it has now been maintained as an
      standalone module.
      
      In this case, the toa module should register its hook function(s) using
      the provided interfaces in the hookers module.
      
      TOA in sock structure:
      
      	__be32 sk_toa_data[16];
      
      The hookers module only provides the sk_toa_data placeholder, and the
      toa module can use this variable through the layout it needs.
      
      Hook interfaces:
      
      The hookers module replaces the kernel's syn_recv_sock and getname
      handler with a stub that chains the toa module's hook function(s) to the
      original handling function. The hookers module allows hook functions to
      be installed and uninstalled in any order.
      
      toa module:
      
      The external toa module will be provided in separate RPM package.
      
      [xuyu@linux.alibaba.com: amend commit log]
      Signed-off-by: NGeorge Zhang <georgezhang@linux.alibaba.com>
      Signed-off-by: NXu Yu <xuyu@linux.alibaba.com>
      Reviewed-by: NCaspar Zhang <caspar@linux.alibaba.com>
      3c74cfbb
    • C
      virtio_blk: add discard and write zeroes support · 78c3b712
      Changpeng Liu 提交于
      commit 1f23816b8eb8fdc39990abe166c10a18c16f6b21 upstream.
      
      In commit 88c85538, "virtio-blk: add discard and write zeroes features
      to specification" (https://github.com/oasis-tcs/virtio-spec), the virtio
      block specification has been extended to add VIRTIO_BLK_T_DISCARD and
      VIRTIO_BLK_T_WRITE_ZEROES commands.  This patch enables support for
      discard and write zeroes in the virtio-blk driver when the device
      advertises the corresponding features, VIRTIO_BLK_F_DISCARD and
      VIRTIO_BLK_F_WRITE_ZEROES.
      Signed-off-by: NChangpeng Liu <changpeng.liu@intel.com>
      Signed-off-by: NDaniel Verkamp <dverkamp@chromium.org>
      Signed-off-by: NMichael S. Tsirkin <mst@redhat.com>
      Reviewed-by: NStefan Hajnoczi <stefanha@redhat.com>
      Signed-off-by: NJiufei Xue <jiufei.xue@linux.alibaba.com>
      Reviewed-by: NLiu Bo <bo.liu@linux.alibaba.com>
      Acked-by: NJoseph Qi <joseph.qi@linux.alibaba.com>
      78c3b712
    • J
      kconfig: Disable x86 clocksource watchdog · cfd1e254
      Jiufei Xue 提交于
      Unstable tsc will trigger clocksource watchdog and disable itself, as a
      result other clocksource will be elected as the current clocksource
      which will result in performace issue on our servers.
      
      RHEL7 also disabled this feature for some issues, see changelog:
      [x86] disable clocksource watchdog (Prarit Bhargava) [914709]
      Signed-off-by: NJiufei Xue <jiufei.xue@linux.alibaba.com>
      Acked-by: NJoseph Qi <joseph.qi@linux.alibaba.com>
      cfd1e254
    • J
      Revert "x86/tsc: Prepare warp test for TSC adjustment" · 727cae00
      Jiufei Xue 提交于
      This reverts commit 76d3b851.
      
      The returned value for check_tsc_warp() is useless now, remove it.
      Signed-off-by: NJiufei Xue <jiufei.xue@linux.alibaba.com>
      Reviewed-by: NJoseph Qi <joseph.qi@linux.alibaba.com>
      727cae00
    • J
      Revert "x86/tsc: Try to adjust TSC if sync test fails" · 1a33c395
      Jiufei Xue 提交于
      This reverts commit cc4db268.
      
      When we do hot-add and enable vCPU, the time inside the VM jumps and
      then VM stucks.
      The dmesg shows like this:
      [   48.402948] CPU2 has been hot-added
      [   48.413774] smpboot: Booting Node 0 Processor 2 APIC 0x2
      [   48.415155] kvm-clock: cpu 2, msr 6b615081, secondary cpu clock
      [   48.453690] TSC ADJUST compensate: CPU2 observed 139318776350 warp.  Adjust: 139318776350
      [  102.060874] clocksource: timekeeping watchdog on CPU0: Marking clocksource 'tsc' as unstable because the skew is too large:
      [  102.060874] clocksource:                       'kvm-clock' wd_now: 1cb1cfc4bf8 wd_last: 1be9588f1fe mask: ffffffffffffffff
      [  102.060874] clocksource:                       'tsc' cs_now: 207d794f7e cs_last: 205a32697a mask: ffffffffffffffff
      [  102.060874] tsc: Marking TSC unstable due to clocksource watchdog
      [  102.070188] KVM setup async PF for cpu 2
      [  102.071461] kvm-stealtime: cpu 2, msr 13ba95000
      [  102.074530] Will online and init hotplugged CPU: 2
      
      This is because the TSC for the newly added VCPU is initialized to 0
      while others are ahead. Guest will do the TSC ADJUST compensate and
      cause the time jumps.
      
      Commit bd8fab39("KVM: x86: fix maintaining of kvm_clock stability
      on guest CPU hotplug") can fix this problem.  However, the host kernel
      version may be older, so do not ajust TSC if sync test fails, just mark
      it unstable.
      Signed-off-by: NJiufei Xue <jiufei.xue@linux.alibaba.com>
      Reviewed-by: NJoseph Qi <joseph.qi@linux.alibaba.com>
      1a33c395
    • J
      block-throttle: enable hierarchical throttling even on traditional hierarchy · 8e010885
      Joseph Qi 提交于
      ECI may have an use case that configuring each device mapper disk
      throttling policy just under root blkio cgroup, but actually using them
      in different containers.
      Since hierarchical throttling is now only supported on cgroup v2 and ECI
      uses cgroup v1, so we have to enable hierarchical throttling on cgroup
      v1.
      This is ported from redhat 7u, and a year ago Jiufei already ported it
      to alikernel 4.9 as well. So I think this change should be acceptable.
      Signed-off-by: NJoseph Qi <joseph.qi@linux.alibaba.com>
      Reviewed-by: NJiufei Xue <jiufei.xue@linux.alibaba.com>
      8e010885
    • E
      eci: drivers/virtio: add vring_force_dma_api boot param · ab5dcc8f
      Eryu Guan 提交于
      Prior to xdragon platform 20181230 release (e.g. 0930 release),
      vring_use_dma_api() is required to return 'true' unconditionally.
      
      Introduce a new kernel boot parameter called "vring_force_dma_api" to
      control the behavior, boot xdragon host with "vring_force_dma_api"
      command line to make ENI hotplug work, so that normal ECS hosts keep the
      original behavior.
      Reviewed-by: NJoseph Qi <joseph.qi@linux.alibaba.com>
      Signed-off-by: NEryu Guan <eguan@linux.alibaba.com>
      ab5dcc8f
    • A
      boot: give rdrand some credit · 6cf71d7e
      Arjan van de Ven 提交于
      Cherry-pick from clear-linux patches:
      https://github.com/clearlinux-pkgs/linux-kvm/0104-give-rdrand-some-credit.patch
      
      try to credit rdrand/rdseed with some entropy
      
      In VMs but even modern hardware, we're super starved for entropy, and while we can
      and do wear a tin foil hat, it's very hard to argue that
      rdrand and rdtsc add zero entropy.
      Signed-off-by: NArjan van de Ven <arjan@linux.intel.com>
      Signed-off-by: NJoseph Qi <joseph.qi@linux.alibaba.com>
      Reviewed-by: NJiufei Xue <jiufei.xue@linux.alibaba.com>
      6cf71d7e
    • J
    • A
      NEMU: Compile in evged always · 133cf0e6
      Arjan van de Ven 提交于
      Cherry-pick from kata-container patches:
      https://github.com/kata-containers/packaging/tree/master/kernel/patches/0002-Compile-in-evged-always.patch
      
      We need evged for NEMU (and in general for hw reduced)
      
      The config option cannot be set normally since it breaks all
      regular systems, and hardware reduced is really a runtime choice.
      Signed-off-by: NArjan van de Ven <arjan@linux.intel.com>
      Signed-off-by: NEryu Guan <eguan@linux.alibaba.com>
      Reviewed-by: NJiufei Xue <jiufei.xue@linux.alibaba.com>
      Reviewed-by: NJoseph Qi <joseph.qi@linux.alibaba.com>
      133cf0e6
    • E
      ext4: fix reserved cluster accounting at page invalidation time · 46631b20
      Eric Whitney 提交于
      commit f456767d3391e9f7d9d25a2e7241d75676dc19da upstream.
      
      Add new code to count canceled pending cluster reservations on bigalloc
      file systems and to reduce the cluster reservation count on all file
      systems using delayed allocation.  This replaces old code in
      ext4_da_page_release_reservations that was incorrect.
      Signed-off-by: NEric Whitney <enwlinux@gmail.com>
      Signed-off-by: NTheodore Ts'o <tytso@mit.edu>
      Signed-off-by: NJoseph Qi <joseph.qi@linux.alibaba.com>
      Reviewed-by: NJiufei Xue <jiufei.xue@linux.alibaba.com>
      46631b20
    • E
      ext4: adjust reserved cluster count when removing extents · c4cdb449
      Eric Whitney 提交于
      commit 9fe671496b6c286f9033aedfc1718d67721da0ae upstream.
      
      Modify ext4_ext_remove_space() and the code it calls to correct the
      reserved cluster count for pending reservations (delayed allocated
      clusters shared with allocated blocks) when a block range is removed
      from the extent tree.  Pending reservations may be found for the clusters
      at the ends of written or unwritten extents when a block range is removed.
      If a physical cluster at the end of an extent is freed, it's necessary
      to increment the reserved cluster count to maintain correct accounting
      if the corresponding logical cluster is shared with at least one
      delayed and unwritten extent as found in the extents status tree.
      
      Add a new function, ext4_rereserve_cluster(), to reapply a reservation
      on a delayed allocated cluster sharing blocks with a freed allocated
      cluster.  To avoid ENOSPC on reservation, a flag is applied to
      ext4_free_blocks() to briefly defer updating the freeclusters counter
      when an allocated cluster is freed.  This prevents another thread
      from allocating the freed block before the reservation can be reapplied.
      
      Redefine the partial cluster object as a struct to carry more state
      information and to clarify the code using it.
      
      Adjust the conditional code structure in ext4_ext_remove_space to
      reduce the indentation level in the main body of the code to improve
      readability.
      Signed-off-by: NEric Whitney <enwlinux@gmail.com>
      Signed-off-by: NTheodore Ts'o <tytso@mit.edu>
      Signed-off-by: NJoseph Qi <joseph.qi@linux.alibaba.com>
      Reviewed-by: NJiufei Xue <jiufei.xue@linux.alibaba.com>
      c4cdb449
    • E
      ext4: reduce reserved cluster count by number of allocated clusters · 1b7e8112
      Eric Whitney 提交于
      commit b6bf9171ef5c37b66d446378ba63af5339a56a97 upstream.
      
      Ext4 does not always reduce the reserved cluster count by the number
      of clusters allocated when mapping a delayed extent.  It sometimes
      adds back one or more clusters after allocation if delalloc blocks
      adjacent to the range allocated by ext4_ext_map_blocks() share the
      clusters newly allocated for that range.  However, this overcounts
      the number of clusters needed to satisfy future mapping requests
      (holding one or more reservations for clusters that have already been
      allocated) and premature ENOSPC and quota failures, etc., result.
      
      Ext4 also does not reduce the reserved cluster count when allocating
      clusters for non-delayed allocated writes that have previously been
      reserved for delayed writes.  This also results in overcounts.
      
      To make it possible to handle reserved cluster accounting for
      fallocated regions in the same manner as used for other non-delayed
      writes, do the reserved cluster accounting for them at the time of
      allocation.  In the current code, this is only done later when a
      delayed extent sharing the fallocated region is finally mapped.
      
      Address comment correcting handling of unsigned long long constant
      from Jan Kara's review of RFC version of this patch.
      Signed-off-by: NEric Whitney <enwlinux@gmail.com>
      Signed-off-by: NTheodore Ts'o <tytso@mit.edu>
      Signed-off-by: NJoseph Qi <joseph.qi@linux.alibaba.com>
      Reviewed-by: NJiufei Xue <jiufei.xue@linux.alibaba.com>
      1b7e8112
    • E
      ext4: fix reserved cluster accounting at delayed write time · 90ec6904
      Eric Whitney 提交于
      commit 0b02f4c0d6d9e2c611dfbdd4317193e9dca740e6 upstream.
      
      The code in ext4_da_map_blocks sometimes reserves space for more
      delayed allocated clusters than it should, resulting in premature
      ENOSPC, exceeded quota, and inaccurate free space reporting.
      
      Fix this by checking for written and unwritten blocks shared in the
      same cluster with the newly delayed allocated block.  A cluster
      reservation should not be made for a cluster for which physical space
      has already been allocated.
      Signed-off-by: NEric Whitney <enwlinux@gmail.com>
      Signed-off-by: NTheodore Ts'o <tytso@mit.edu>
      Signed-off-by: NJoseph Qi <joseph.qi@linux.alibaba.com>
      Reviewed-by: NJiufei Xue <jiufei.xue@linux.alibaba.com>
      90ec6904
    • E
      ext4: add new pending reservation mechanism · fd3ebfb8
      Eric Whitney 提交于
      commit 1dc0aa46e74a3366e12f426b7caaca477853e9c3 upstream.
      
      Add new pending reservation mechanism to help manage reserved cluster
      accounting.  Its primary function is to avoid the need to read extents
      from the disk when invalidating pages as a result of a truncate, punch
      hole, or collapse range operation.
      Signed-off-by: NEric Whitney <enwlinux@gmail.com>
      Signed-off-by: NTheodore Ts'o <tytso@mit.edu>
      Signed-off-by: NJoseph Qi <joseph.qi@linux.alibaba.com>
      Reviewed-by: NJiufei Xue <jiufei.xue@linux.alibaba.com>
      fd3ebfb8
    • E
      ext4: generalize extents status tree search functions · 8402581c
      Eric Whitney 提交于
      commit ad431025aecda85d3ebef5e4a3aca5c1c681d0c7 upstream.
      
      Ext4 contains a few functions that are used to search for delayed
      extents or blocks in the extents status tree.  Rather than duplicate
      code to add new functions to search for extents with different status
      values, such as written or a combination of delayed and unwritten,
      generalize the existing code to search for caller-specified extents
      status values.  Also, move this code into extents_status.c where it
      is better associated with the data structures it operates upon, and
      where it can be more readily used to implement new extents status tree
      functions that might want a broader scope for i_es_lock.
      
      Three missing static specifiers in RFC version of patch reported and
      fixed by Fengguang Wu <fengguang.wu@intel.com>.
      Signed-off-by: NEric Whitney <enwlinux@gmail.com>
      Signed-off-by: NTheodore Ts'o <tytso@mit.edu>
      Signed-off-by: NJoseph Qi <joseph.qi@linux.alibaba.com>
      Reviewed-by: NJiufei Xue <jiufei.xue@linux.alibaba.com>
      8402581c
  2. 15 5月, 2019 16 次提交