1. 01 6月, 2019 1 次提交
    • D
      splice: don't read more than available pipe space · 67c14e68
      Darrick J. Wong 提交于
      commit 17614445576b6af24e9cf36607c6448164719c96 upstream.
      
      In commit 4721a601099, we tried to fix a problem wherein directio reads
      into a splice pipe will bounce EFAULT/EAGAIN all the way out to
      userspace by simulating a zero-byte short read.  This happens because
      some directio read implementations (xfs) will call
      bio_iov_iter_get_pages to grab pipe buffer pages and issue asynchronous
      reads, but as soon as we run out of pipe buffers that _get_pages call
      returns EFAULT, which the splice code translates to EAGAIN and bounces
      out to userspace.
      
      In that commit, the iomap code catches the EFAULT and simulates a
      zero-byte read, but that causes assertion errors on regular splice reads
      because xfs doesn't allow short directio reads.
      
      The brokenness is compounded by splice_direct_to_actor immediately
      bailing on do_splice_to returning <= 0 without ever calling ->actor
      (which empties out the pipe), so if userspace calls back we'll EFAULT
      again on the full pipe, and nothing ever gets copied.
      
      Therefore, teach splice_direct_to_actor to clamp its requests to the
      amount of free space in the pipe and remove the simulated short read
      from the iomap directio code.
      
      Fixes: 4721a601099 ("iomap: dio data corruption and spurious errors when pipes fill")
      Reported-by: NMurphy Zhou <jencce.kernel@gmail.com>
      Ranted-by: NAmir Goldstein <amir73il@gmail.com>
      Reviewed-by: NChristoph Hellwig <hch@lst.de>
      Signed-off-by: NDarrick J. Wong <darrick.wong@oracle.com>
      Signed-off-by: NJeffle Xu <jefflexu@linux.alibaba.com>
      Acked-by: NCaspar Zhang <caspar@linux.alibaba.com>
      67c14e68
  2. 31 5月, 2019 3 次提交
    • D
      xfs: don't overflow xattr listent buffer · e8dbd741
      Darrick J. Wong 提交于
      commit 3b50086f0c0d78c144d9483fa292c1509c931b70 upstream.
      
      For VFS listxattr calls, xfs_xattr_put_listent calls
      __xfs_xattr_put_listent twice if it sees an attribute
      "trusted.SGI_ACL_FILE": once for that name, and again for
      "system.posix_acl_access".  Unfortunately, if we happen to run out of
      buffer space while emitting the first name, we set count to -1 (so that
      we can feed ERANGE to the caller).  The second invocation doesn't check that
      the context parameters make sense and overwrites the byte before the
      buffer, triggering a KASAN report:
      
      ==================================================================
      BUG: KASAN: slab-out-of-bounds in strncpy+0xb3/0xd0
      Write of size 1 at addr ffff88807fbd317f by task syz/1113
      
      CPU: 3 PID: 1113 Comm: syz Not tainted 5.0.0-rc6-xfsx #rc6
      Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 1.10.2-1ubuntu1 04/01/2014
      Call Trace:
       dump_stack+0xcc/0x180
       print_address_description+0x6c/0x23c
       kasan_report.cold.3+0x1c/0x35
       strncpy+0xb3/0xd0
       __xfs_xattr_put_listent+0x1a9/0x2c0 [xfs]
       xfs_attr_list_int_ilocked+0x11af/0x1800 [xfs]
       xfs_attr_list_int+0x20c/0x2e0 [xfs]
       xfs_vn_listxattr+0x225/0x320 [xfs]
       listxattr+0x11f/0x1b0
       path_listxattr+0xbd/0x130
       do_syscall_64+0x139/0x560
      
      While we're at it we add an assert to the other put_listent to avoid
      this sort of thing ever happening to the attrlist_by_handle code.
      Signed-off-by: NDarrick J. Wong <darrick.wong@oracle.com>
      Reviewed-by: NChristoph Hellwig <hch@lst.de>
      Signed-off-by: NJiufei Xue <jiufei.xue@linux.alibaba.com>
      Acked-by: NJoseph Qi <joseph.qi@linux.alibaba.com>
      e8dbd741
    • A
      ovl: relax WARN_ON() for overlapping layers use case · ece2e461
      Amir Goldstein 提交于
      commit acf3062a7e1ccf67c6f7e7c28671a6708fde63b0 upstream.
      
      This nasty little syzbot repro:
      https://syzkaller.appspot.com/x/repro.syz?x=12c7a94f400000
      
      Creates overlay mounts where the same directory is both in upper and lower
      layers. Simplified example:
      
        mkdir foo work
        mount -t overlay none foo -o"lowerdir=.,upperdir=foo,workdir=work"
      
      The repro runs several threads in parallel that attempt to chdir into foo
      and attempt to symlink/rename/exec/mkdir the file bar.
      
      The repro hits a WARN_ON() I placed in ovl_instantiate(), which suggests
      that an overlay inode already exists in cache and is hashed by the pointer
      of the real upper dentry that ovl_create_real() has just created. At the
      point of the WARN_ON(), for overlay dir inode lock is held and upper dir
      inode lock, so at first, I did not see how this was possible.
      
      On a closer look, I see that after ovl_create_real(), because of the
      overlapping upper and lower layers, a lookup by another thread can find the
      file foo/bar that was just created in upper layer, at overlay path
      foo/foo/bar and hash the an overlay inode with the new real dentry as lower
      dentry. This is possible because the overlay directory foo/foo is not
      locked and the upper dentry foo/bar is in dcache, so ovl_lookup() can find
      it without taking upper dir inode shared lock.
      
      Overlapping layers is considered a wrong setup which would result in
      unexpected behavior, but it shouldn't crash the kernel and it shouldn't
      trigger WARN_ON() either, so relax this WARN_ON() and leave a pr_warn()
      instead to cover all cases of failure to get an overlay inode.
      
      The error returned from failure to insert new inode to cache with
      inode_insert5() was changed to -EEXIST, to distinguish from the error
      -ENOMEM returned on failure to get/allocate inode with iget5_locked().
      
      Reported-by: syzbot+9c69c282adc4edd2b540@syzkaller.appspotmail.com
      Fixes: 01b39dcc ("ovl: use inode_insert5() to hash a newly...")
      Signed-off-by: NAmir Goldstein <amir73il@gmail.com>
      Signed-off-by: NMiklos Szeredi <mszeredi@redhat.com>
      Signed-off-by: NJoseph Qi <joseph.qi@linux.alibaba.com>
      Acked-by: NCaspar Zhang <caspar@linux.alibaba.com>
      ece2e461
    • G
      net/tcp: Support tunable tcp timeout value in TIME-WAIT state · f354337f
      George Zhang 提交于
      By default the tcp_tw_timeout value is 60 seconds. The minimum is
      1 second and the maximum is 600. This setting is useful on system under
      heavy tcp load.
      
      NOTE: set the tcp_tw_timeout below 60 seconds voilates the "quiet time"
      restriction, and make your system into the risk of causing some old data
      to be accepted as new or new data rejected as old duplicated by some
      receivers.
      
      Link: http://web.archive.org/web/20150102003320/http://tools.ietf.org/html/rfc793Signed-off-by: NGeorge Zhang <georgezhang@linux.alibaba.com>
      Signed-off-by: NJiufei Xue <jiufei.xue@linux.alibaba.com>
      Acked-by: NJoseph Qi <joseph.qi@linux.alibaba.com>
      f354337f
  3. 28 5月, 2019 1 次提交
    • G
      block: Fix a NULL pointer dereference in generic_make_request() · b8cff60d
      Guilherme G. Piccoli 提交于
      Commit 37f9579f ("blk-mq: Avoid that submitting a bio concurrently
      with device removal triggers a crash") introduced a NULL pointer
      dereference in generic_make_request(). The patch sets q to NULL and
      enter_succeeded to false; right after, there's an 'if (enter_succeeded)'
      which is not taken, and then the 'else' will dereference q in
      blk_queue_dying(q).
      
      This patch just moves the 'q = NULL' to a point in which it won't trigger
      the oops, although the semantics of this NULLification remains untouched.
      
      A simple test case/reproducer is as follows:
      a) Build kernel v5.2-rc1 with CONFIG_BLK_CGROUP=n.
      
      b) Create a raid0 md array with 2 NVMe devices as members, and mount it
      with an ext4 filesystem.
      
      c) Run the following oneliner (supposing the raid0 is mounted in /mnt):
      (dd of=/mnt/tmp if=/dev/zero bs=1M count=999 &); sleep 0.3;
      echo 1 > /sys/block/nvme0n1/device/device/remove
      (whereas nvme0n1 is the 2nd array member)
      
      This will trigger the following oops:
      
      BUG: unable to handle kernel NULL pointer dereference at 0000000000000078
      PGD 0 P4D 0
      Oops: 0000 [#1] SMP PTI
      RIP: 0010:generic_make_request+0x32b/0x400
      Call Trace:
       submit_bio+0x73/0x140
       ext4_io_submit+0x4d/0x60
       ext4_writepages+0x626/0xe90
       do_writepages+0x4b/0xe0
      [...]
      
      This patch has no functional changes and preserves the md/raid0 behavior
      when a member is removed before kernel v4.17.
      
      Cc: stable@vger.kernel.org # v4.17
      Reviewed-by: NBart Van Assche <bvanassche@acm.org>
      Reviewed-by: NMing Lei <ming.lei@redhat.com>
      Tested-by: NEric Ren <renzhengeek@gmail.com>
      Fixes: 37f9579f ("blk-mq: Avoid that submitting a bio concurrently with device removal triggers a crash")
      Signed-off-by: NGuilherme G. Piccoli <gpiccoli@canonical.com>
      Reviewed-by: NChristoph Hellwig <hch@lst.de>
      Signed-off-by: NEric Ren <renzhen@linux.alibaba.com>
      Acked-by: NJoseph Qi <joseph.qi@linux.alibaba.com>
      b8cff60d
  4. 25 5月, 2019 1 次提交
  5. 24 5月, 2019 2 次提交
  6. 23 5月, 2019 7 次提交
  7. 18 5月, 2019 1 次提交
  8. 16 5月, 2019 24 次提交
    • Y
      NFSv4.1: Fix bug only first CB_NOTIFY_LOCK is handled · a1b43c22
      Yihao Wu 提交于
      When a waiter is waked by CB_NOTIFY_LOCK, it will retry
      nfs4_proc_setlk(). The waiter may fail to nfs4_proc_setlk() and sleep
      again. However, the waiter is already removed from clp->cl_lock_waitq
      when handling CB_NOTIFY_LOCK in nfs4_wake_lock_waiter(). So any
      subsequent CB_NOTIFY_LOCK won't wake this waiter anymore. We should
      put the waiter back to clp->cl_lock_waitq before retrying.
      
      Cc: stable@vger.kernel.org #4.9+
      Signed-off-by: NYihao Wu <wuyihao@linux.alibaba.com>
      Reviewed-by: NJeff Layton <jlayton@kernel.org>
      Acked-by: NJoseph Qi <joseph.qi@linux.alibaba.com>
      a1b43c22
    • Y
      NFSv4.1: Again fix a race where CB_NOTIFY_LOCK fails to wake a waiter · e1843f01
      Yihao Wu 提交于
      Commit b7dbcc0e "NFSv4.1: Fix a race where CB_NOTIFY_LOCK fails to wake a waiter"
      found this bug. However it didn't fix it.
      
      This commit replaces schedule_timeout() with wait_woken() and
      default_wake_function() with woken_wake_function() in function
      nfs4_retry_setlk() and nfs4_wake_lock_waiter(). wait_woken() uses
      memory barriers in its implementation to avoid potential race condition
      when putting a process into sleeping state and then waking it up.
      
      Fixes: a1d617d8 ("nfs: allow blocking locks to be awoken by lock callbacks")
      Cc: stable@vger.kernel.org #4.9+
      Signed-off-by: NYihao Wu <wuyihao@linux.alibaba.com>
      Reviewed-by: NJeff Layton <jlayton@kernel.org>
      Acked-by: NJoseph Qi <joseph.qi@linux.alibaba.com>
      e1843f01
    • J
      writeback: introduce cgwb_v1 boot param · 769519ba
      Jiufei Xue 提交于
      So far writeback control is supported for cgroup v1 interface. However
      it also has some restrictions, so introduce a new kernel boot parameter
      to control the behavior which is disabled by default. Users can enable
      the writeback control for cgroup v1 with the command line "cgwb_v1".
      Signed-off-by: NJiufei Xue <jiufei.xue@linux.alibaba.com>
      Reviewed-by: NJoseph Qi <joseph.qi@linux.alibaba.com>
      769519ba
    • L
      fs/writeback: Attach inode's wb to root if needed · dda65ea8
      luanshi 提交于
      There might have tons of files queued in the writeback, awaiting for
      writing back. Unfortunately, the writeback's cgroup has been dead. In
      this case, we reassociate the inode with another writeback, but we
      possibly can't because the writeback associated with the dead cgroup is
      the only valid one. In this case, the new writeback is allocated,
      initialized and associated with the inode in the non-stopping fashion
      until all data resident in the inode's page cache are flushed to disk.
      It causes unnecessary high system load.
      
      This fixes the issue by enforce moving the inode to root cgroup when the
      previous binding cgroup becomes dead. With it, no more unnecessary
      writebacks are created, populated and the system load decreased by about
      6x in the test case we carried out:
          Without the patch: 30% system load
          With the patch:    5%  system load
      Signed-off-by: Nluanshi <zhangliguang@linux.alibaba.com>
      Signed-off-by: NJiufei Xue <jiufei.xue@linux.alibaba.com>
      Reviewed-by: NJoseph Qi <joseph.qi@linux.alibaba.com>
      dda65ea8
    • J
      fs/writeback: use rcu_barrier() to wait for inflight wb switches going into workqueue when umount · 6644f956
      Jiufei Xue 提交于
      synchronize_rcu() didn't wait for call_rcu() callbacks, so inode wb
      switch may not go to the workqueue after synchronize_rcu(). Thus
      previous scheduled switches was not finished even flushing the
      workqueue, which will cause a NULL pointer dereferenced followed below.
      
      VFS: Busy inodes after unmount of vdd. Self-destruct in 5 seconds.  Have a nice day...
      BUG: unable to handle kernel NULL pointer dereference at 0000000000000278
      [<ffffffff8126a303>] evict+0xb3/0x180
      [<ffffffff8126a760>] iput+0x1b0/0x230
      [<ffffffff8127c690>] inode_switch_wbs_work_fn+0x3c0/0x6a0
      [<ffffffff810a5b2e>] worker_thread+0x4e/0x490
      [<ffffffff810a5ae0>] ? process_one_work+0x410/0x410
      [<ffffffff810ac056>] kthread+0xe6/0x100
      [<ffffffff8173c199>] ret_from_fork+0x39/0x50
      
      Replace the synchronize_rcu() call with a rcu_barrier() to wait for all
      pending callbacks to finish. And inc isw_nr_in_flight after call_rcu()
      in inode_switch_wbs() to make more sense.
      Suggested-by: NTejun Heo <tj@kernel.org>
      Signed-off-by: NJiufei Xue <jiufei.xue@linux.alibaba.com>
      Acked-by: NTejun Heo <tj@kernel.org>
      Reviewed-by: NJoseph Qi <joseph.qi@linux.alibaba.com>
      6644f956
    • J
      fs/writeback: fix double free of blkcg_css · 0b58738a
      Jiufei Xue 提交于
      We have gotten a WARNNING when releasing blkcg_css:
      
      [332489.681635] WARNING: CPU: 55 PID: 14859 at lib/list_debug.c:56 __list_del_entry+0x81/0xc0
      [332489.682191] list_del corruption, ffff883e6b94d450->prev is LIST_POISON2 (dead000000000200)
      ......
      [332489.683895] CPU: 55 PID: 14859 Comm: kworker/55:2 Tainted: G
      [332489.684477] Hardware name: Inspur SA5248M4/X10DRT-PS, BIOS 4.05A
      10/11/2016
      [332489.685061] Workqueue: cgroup_destroy css_release_work_fn
      [332489.685654]  ffffc9001d92bd28 ffffffff81380042 ffffc9001d92bd78
      0000000000000000
      [332489.686269]  ffffc9001d92bd68 ffffffff81088f8b 0000003800000000
      ffff883e6b94d4a0
      [332489.686867]  ffff883e6b94d400 ffffffff81ce8fe0 ffff88375b24f400
      ffff883e6b94d4a0
      [332489.687479] Call Trace:
      [332489.688078]  [<ffffffff81380042>] dump_stack+0x63/0x81
      [332489.688681]  [<ffffffff81088f8b>] __warn+0xcb/0xf0
      [332489.689276]  [<ffffffff8108900f>] warn_slowpath_fmt+0x5f/0x80
      [332489.689877]  [<ffffffff8139e7c1>] __list_del_entry+0x81/0xc0
      [332489.690481]  [<ffffffff81125552>] css_release_work_fn+0x42/0x140
      [332489.691090]  [<ffffffff810a2db9>] process_one_work+0x189/0x420
      [332489.691693]  [<ffffffff810a309e>] worker_thread+0x4e/0x4b0
      [332489.692293]  [<ffffffff810a3050>] ? process_one_work+0x420/0x420
      [332489.692905]  [<ffffffff810a9616>] kthread+0xe6/0x100
      [332489.693504]  [<ffffffff810a9530>] ? kthread_park+0x60/0x60
      [332489.694099]  [<ffffffff817184e1>] ret_from_fork+0x41/0x50
      [332489.694722] ---[ end trace 0cf869c4a5cfba87 ]---
      ......
      
      This is caused by calling css_get after the css is killed by another
      thread described below:
      
                 Thread 1                       Thread 2
      cgroup_rmdir
        -> kill_css
          -> percpu_ref_kill_and_confirm
            -> css_killed_ref_fn
      
      css_killed_work_fn
        -> css_put
          -> css_release
                                              wb_get_create
      					  -> find_blkcg_css
      					    -> css_get
      					  -> css_put
      					    -> css_release (double free)
          -> css_release_workfn
            -> css_free_work_fn
             -> blkcg_css_free
      
      When doublefree happened, it may free the memory still used by
      other threads and cause a kernel panic.
      
      Fix this by using css_tryget_online in find_blkcg_css while will return
      false if the css is killed.
      Signed-off-by: NJiufei Xue <jiufei.xue@linux.alibaba.com>
      Reviewed-by: NJoseph Qi <joseph.qi@linux.alibaba.com>
      0b58738a
    • J
      f27da2ad
    • J
      writeback: add memcg_blkcg_link tree · 60448d43
      Jiufei Xue 提交于
      Here we add a global radix tree to link memcg and blkcg that the user
      attach the tasks to when using cgroup v1, which is used for writeback
      cgroup.
      Signed-off-by: NJiufei Xue <jiufei.xue@linux.alibaba.com>
      Reviewed-by: NJoseph Qi <joseph.qi@linux.alibaba.com>
      60448d43
    • J
      ovl: check the capability before cred overridden · cc68ce6d
      Jiufei Xue 提交于
      We found that it return success when we set IMMUTABLE_FL flag to a file in
      docker even though the docker didn't have the capability
      CAP_LINUX_IMMUTABLE.
      
      The commit d1d04ef8 ("ovl: stack file ops") and dab5ca8f ("ovl: add
      lsattr/chattr support") implemented chattr operations on a regular overlay
      file. ovl_real_ioctl() overridden the current process's subjective
      credentials with ofs->creator_cred which have the capability
      CAP_LINUX_IMMUTABLE so that it will return success in
      vfs_ioctl()->cap_capable().
      
      Fix this by checking the capability before cred overridden. And here we
      only care about APPEND_FL and IMMUTABLE_FL, so get these information from
      inode.
      
      [SzM: move check and call to underlying fs inside inode locked region to
      prevent two such calls from racing with each other]
      Signed-off-by: NJiufei Xue <jiufei.xue@linux.alibaba.com>
      Signed-off-by: NMiklos Szeredi <mszeredi@redhat.com>
      Reviewed-by: NJoseph Qi <joseph.qi@linux.alibaba.com>
      cc68ce6d
    • J
      fbdev: fix WARNING in __alloc_pages_nodemask bug · b9761d35
      Jiufei Xue 提交于
      commit 8c40292be9169a9cbe19aadd1a6fc60cbd1af82f upstream.
      
      Syzkaller hit 'WARNING in __alloc_pages_nodemask' bug.
      
      WARNING: CPU: 1 PID: 1473 at mm/page_alloc.c:4377
      __alloc_pages_nodemask+0x4da/0x2130
      Kernel panic - not syncing: panic_on_warn set ...
      
      Call Trace:
       alloc_pages_current+0xb1/0x1e0
       kmalloc_order+0x1f/0x60
       kmalloc_order_trace+0x1d/0x120
       fb_alloc_cmap_gfp+0x85/0x2b0
       fb_set_user_cmap+0xff/0x370
       do_fb_ioctl+0x949/0xa20
       fb_ioctl+0xdd/0x120
       do_vfs_ioctl+0x186/0x1070
       ksys_ioctl+0x89/0xa0
       __x64_sys_ioctl+0x74/0xb0
       do_syscall_64+0xc8/0x550
       entry_SYSCALL_64_after_hwframe+0x49/0xbe
      
      This is a warning about order >= MAX_ORDER and the order is from
      userspace ioctl. Add flag __NOWARN to silence this warning.
      Signed-off-by: NJiufei Xue <jiufei.xue@linux.alibaba.com>
      Signed-off-by: NBartlomiej Zolnierkiewicz <b.zolnierkie@samsung.com>
      Reviewed-by: NJoseph Qi <joseph.qi@linux.alibaba.com>
      b9761d35
    • S
      fbdev: fix divide error in fb_var_to_videomode · e7efd88b
      Shile Zhang 提交于
      commit cf84807f6dd0be5214378e66460cfc9187f532f9 upstream
      
      To fix following divide-by-zero error found by Syzkaller:
      
        divide error: 0000 [#1] SMP PTI
        CPU: 7 PID: 8447 Comm: test Kdump: loaded Not tainted 4.19.24-8.al7.x86_64 #1
        Hardware name: Alibaba Cloud Alibaba Cloud ECS, BIOS rel-1.12.0-0-ga698c8995f-prebuilt.qemu.org 04/01/2014
        RIP: 0010:fb_var_to_videomode+0xae/0xc0
        Code: 04 44 03 46 78 03 4e 7c 44 03 46 68 03 4e 70 89 ce d1 ee 69 c0 e8 03 00 00 f6 c2 01 0f 45 ce 83 e2 02 8d 34 09 0f 45 ce 31 d2 <41> f7 f0 31 d2 f7 f1 89 47 08 f3 c3 66 0f 1f 44 00 00 0f 1f 44 00
        RSP: 0018:ffffb7e189347bf0 EFLAGS: 00010246
        RAX: 00000000e1692410 RBX: ffffb7e189347d60 RCX: 0000000000000000
        RDX: 0000000000000000 RSI: 0000000000000000 RDI: ffffb7e189347c10
        RBP: ffff99972a091c00 R08: 0000000000000000 R09: 0000000000000000
        R10: 0000000000000000 R11: 0000000000000000 R12: 0000000000000100
        R13: 0000000000010000 R14: 00007ffd66baf6d0 R15: 0000000000000000
        FS:  00007f2054d11740(0000) GS:ffff99972fbc0000(0000) knlGS:0000000000000000
        CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
        CR2: 00007f205481fd20 CR3: 00000004288a0001 CR4: 00000000001606a0
        Call Trace:
         fb_set_var+0x257/0x390
         ? lookup_fast+0xbb/0x2b0
         ? fb_open+0xc0/0x140
         ? chrdev_open+0xa6/0x1a0
         do_fb_ioctl+0x445/0x5a0
         do_vfs_ioctl+0x92/0x5f0
         ? __alloc_fd+0x3d/0x160
         ksys_ioctl+0x60/0x90
         __x64_sys_ioctl+0x16/0x20
         do_syscall_64+0x5b/0x190
         entry_SYSCALL_64_after_hwframe+0x44/0xa9
        RIP: 0033:0x7f20548258d7
        Code: 44 00 00 48 8b 05 b9 15 2d 00 64 c7 00 26 00 00 00 48 c7 c0 ff ff ff ff c3 66 2e 0f 1f 84 00 00 00 00 00 b8 10 00 00 00 0f 05 <48> 3d 01 f0 ff ff 73 01 c3 48 8b 0d 89 15 2d 00 f7 d8 64 89 01 48
      
      It can be triggered easily with following test code:
      
        #include <linux/fb.h>
        #include <fcntl.h>
        #include <sys/ioctl.h>
        int main(void)
        {
                struct fb_var_screeninfo var = {.activate = 0x100, .pixclock = 60};
                int fd = open("/dev/fb0", O_RDWR);
                if (fd < 0)
                        return 1;
      
                if (ioctl(fd, FBIOPUT_VSCREENINFO, &var))
                        return 1;
      
                return 0;
        }
      Signed-off-by: NShile Zhang <shile.zhang@linux.alibaba.com>
      Cc: Fredrik Noring <noring@nocrew.org>
      Cc: Daniel Vetter <daniel.vetter@ffwll.ch>
      Reviewed-by: NMukesh Ojha <mojha@codeaurora.org>
      Signed-off-by: NBartlomiej Zolnierkiewicz <b.zolnierkie@samsung.com>
      Acked-by: NJoseph Qi <joseph.qi@linux.alibaba.com>
      e7efd88b
    • G
      net: kernel hookers service for toa module · 3c74cfbb
      George Zhang 提交于
      LVS fullnat will replace network traffic's source ip with its local ip,
      and thus the backend servers cannot obtain the real client ip.
      
      To solve this, LVS has introduced the tcp option address (TOA) to store
      the essential ip address information in the last tcp ack packet of the
      3-way handshake, and the backend servers need to retrieve it from the
      packet header.
      
      In this patch, we have introduced the sk_toa_data member in the sock
      structure to hold the TOA information. There used to be an in-tree
      module for TOA managing, whereas it has now been maintained as an
      standalone module.
      
      In this case, the toa module should register its hook function(s) using
      the provided interfaces in the hookers module.
      
      TOA in sock structure:
      
      	__be32 sk_toa_data[16];
      
      The hookers module only provides the sk_toa_data placeholder, and the
      toa module can use this variable through the layout it needs.
      
      Hook interfaces:
      
      The hookers module replaces the kernel's syn_recv_sock and getname
      handler with a stub that chains the toa module's hook function(s) to the
      original handling function. The hookers module allows hook functions to
      be installed and uninstalled in any order.
      
      toa module:
      
      The external toa module will be provided in separate RPM package.
      
      [xuyu@linux.alibaba.com: amend commit log]
      Signed-off-by: NGeorge Zhang <georgezhang@linux.alibaba.com>
      Signed-off-by: NXu Yu <xuyu@linux.alibaba.com>
      Reviewed-by: NCaspar Zhang <caspar@linux.alibaba.com>
      3c74cfbb
    • C
      virtio_blk: add discard and write zeroes support · 78c3b712
      Changpeng Liu 提交于
      commit 1f23816b8eb8fdc39990abe166c10a18c16f6b21 upstream.
      
      In commit 88c85538, "virtio-blk: add discard and write zeroes features
      to specification" (https://github.com/oasis-tcs/virtio-spec), the virtio
      block specification has been extended to add VIRTIO_BLK_T_DISCARD and
      VIRTIO_BLK_T_WRITE_ZEROES commands.  This patch enables support for
      discard and write zeroes in the virtio-blk driver when the device
      advertises the corresponding features, VIRTIO_BLK_F_DISCARD and
      VIRTIO_BLK_F_WRITE_ZEROES.
      Signed-off-by: NChangpeng Liu <changpeng.liu@intel.com>
      Signed-off-by: NDaniel Verkamp <dverkamp@chromium.org>
      Signed-off-by: NMichael S. Tsirkin <mst@redhat.com>
      Reviewed-by: NStefan Hajnoczi <stefanha@redhat.com>
      Signed-off-by: NJiufei Xue <jiufei.xue@linux.alibaba.com>
      Reviewed-by: NLiu Bo <bo.liu@linux.alibaba.com>
      Acked-by: NJoseph Qi <joseph.qi@linux.alibaba.com>
      78c3b712
    • J
      kconfig: Disable x86 clocksource watchdog · cfd1e254
      Jiufei Xue 提交于
      Unstable tsc will trigger clocksource watchdog and disable itself, as a
      result other clocksource will be elected as the current clocksource
      which will result in performace issue on our servers.
      
      RHEL7 also disabled this feature for some issues, see changelog:
      [x86] disable clocksource watchdog (Prarit Bhargava) [914709]
      Signed-off-by: NJiufei Xue <jiufei.xue@linux.alibaba.com>
      Acked-by: NJoseph Qi <joseph.qi@linux.alibaba.com>
      cfd1e254
    • J
      Revert "x86/tsc: Prepare warp test for TSC adjustment" · 727cae00
      Jiufei Xue 提交于
      This reverts commit 76d3b851.
      
      The returned value for check_tsc_warp() is useless now, remove it.
      Signed-off-by: NJiufei Xue <jiufei.xue@linux.alibaba.com>
      Reviewed-by: NJoseph Qi <joseph.qi@linux.alibaba.com>
      727cae00
    • J
      Revert "x86/tsc: Try to adjust TSC if sync test fails" · 1a33c395
      Jiufei Xue 提交于
      This reverts commit cc4db268.
      
      When we do hot-add and enable vCPU, the time inside the VM jumps and
      then VM stucks.
      The dmesg shows like this:
      [   48.402948] CPU2 has been hot-added
      [   48.413774] smpboot: Booting Node 0 Processor 2 APIC 0x2
      [   48.415155] kvm-clock: cpu 2, msr 6b615081, secondary cpu clock
      [   48.453690] TSC ADJUST compensate: CPU2 observed 139318776350 warp.  Adjust: 139318776350
      [  102.060874] clocksource: timekeeping watchdog on CPU0: Marking clocksource 'tsc' as unstable because the skew is too large:
      [  102.060874] clocksource:                       'kvm-clock' wd_now: 1cb1cfc4bf8 wd_last: 1be9588f1fe mask: ffffffffffffffff
      [  102.060874] clocksource:                       'tsc' cs_now: 207d794f7e cs_last: 205a32697a mask: ffffffffffffffff
      [  102.060874] tsc: Marking TSC unstable due to clocksource watchdog
      [  102.070188] KVM setup async PF for cpu 2
      [  102.071461] kvm-stealtime: cpu 2, msr 13ba95000
      [  102.074530] Will online and init hotplugged CPU: 2
      
      This is because the TSC for the newly added VCPU is initialized to 0
      while others are ahead. Guest will do the TSC ADJUST compensate and
      cause the time jumps.
      
      Commit bd8fab39("KVM: x86: fix maintaining of kvm_clock stability
      on guest CPU hotplug") can fix this problem.  However, the host kernel
      version may be older, so do not ajust TSC if sync test fails, just mark
      it unstable.
      Signed-off-by: NJiufei Xue <jiufei.xue@linux.alibaba.com>
      Reviewed-by: NJoseph Qi <joseph.qi@linux.alibaba.com>
      1a33c395
    • J
      block-throttle: enable hierarchical throttling even on traditional hierarchy · 8e010885
      Joseph Qi 提交于
      ECI may have an use case that configuring each device mapper disk
      throttling policy just under root blkio cgroup, but actually using them
      in different containers.
      Since hierarchical throttling is now only supported on cgroup v2 and ECI
      uses cgroup v1, so we have to enable hierarchical throttling on cgroup
      v1.
      This is ported from redhat 7u, and a year ago Jiufei already ported it
      to alikernel 4.9 as well. So I think this change should be acceptable.
      Signed-off-by: NJoseph Qi <joseph.qi@linux.alibaba.com>
      Reviewed-by: NJiufei Xue <jiufei.xue@linux.alibaba.com>
      8e010885
    • E
      eci: drivers/virtio: add vring_force_dma_api boot param · ab5dcc8f
      Eryu Guan 提交于
      Prior to xdragon platform 20181230 release (e.g. 0930 release),
      vring_use_dma_api() is required to return 'true' unconditionally.
      
      Introduce a new kernel boot parameter called "vring_force_dma_api" to
      control the behavior, boot xdragon host with "vring_force_dma_api"
      command line to make ENI hotplug work, so that normal ECS hosts keep the
      original behavior.
      Reviewed-by: NJoseph Qi <joseph.qi@linux.alibaba.com>
      Signed-off-by: NEryu Guan <eguan@linux.alibaba.com>
      ab5dcc8f
    • A
      boot: give rdrand some credit · 6cf71d7e
      Arjan van de Ven 提交于
      Cherry-pick from clear-linux patches:
      https://github.com/clearlinux-pkgs/linux-kvm/0104-give-rdrand-some-credit.patch
      
      try to credit rdrand/rdseed with some entropy
      
      In VMs but even modern hardware, we're super starved for entropy, and while we can
      and do wear a tin foil hat, it's very hard to argue that
      rdrand and rdtsc add zero entropy.
      Signed-off-by: NArjan van de Ven <arjan@linux.intel.com>
      Signed-off-by: NJoseph Qi <joseph.qi@linux.alibaba.com>
      Reviewed-by: NJiufei Xue <jiufei.xue@linux.alibaba.com>
      6cf71d7e
    • J
    • A
      NEMU: Compile in evged always · 133cf0e6
      Arjan van de Ven 提交于
      Cherry-pick from kata-container patches:
      https://github.com/kata-containers/packaging/tree/master/kernel/patches/0002-Compile-in-evged-always.patch
      
      We need evged for NEMU (and in general for hw reduced)
      
      The config option cannot be set normally since it breaks all
      regular systems, and hardware reduced is really a runtime choice.
      Signed-off-by: NArjan van de Ven <arjan@linux.intel.com>
      Signed-off-by: NEryu Guan <eguan@linux.alibaba.com>
      Reviewed-by: NJiufei Xue <jiufei.xue@linux.alibaba.com>
      Reviewed-by: NJoseph Qi <joseph.qi@linux.alibaba.com>
      133cf0e6
    • E
      ext4: fix reserved cluster accounting at page invalidation time · 46631b20
      Eric Whitney 提交于
      commit f456767d3391e9f7d9d25a2e7241d75676dc19da upstream.
      
      Add new code to count canceled pending cluster reservations on bigalloc
      file systems and to reduce the cluster reservation count on all file
      systems using delayed allocation.  This replaces old code in
      ext4_da_page_release_reservations that was incorrect.
      Signed-off-by: NEric Whitney <enwlinux@gmail.com>
      Signed-off-by: NTheodore Ts'o <tytso@mit.edu>
      Signed-off-by: NJoseph Qi <joseph.qi@linux.alibaba.com>
      Reviewed-by: NJiufei Xue <jiufei.xue@linux.alibaba.com>
      46631b20
    • E
      ext4: adjust reserved cluster count when removing extents · c4cdb449
      Eric Whitney 提交于
      commit 9fe671496b6c286f9033aedfc1718d67721da0ae upstream.
      
      Modify ext4_ext_remove_space() and the code it calls to correct the
      reserved cluster count for pending reservations (delayed allocated
      clusters shared with allocated blocks) when a block range is removed
      from the extent tree.  Pending reservations may be found for the clusters
      at the ends of written or unwritten extents when a block range is removed.
      If a physical cluster at the end of an extent is freed, it's necessary
      to increment the reserved cluster count to maintain correct accounting
      if the corresponding logical cluster is shared with at least one
      delayed and unwritten extent as found in the extents status tree.
      
      Add a new function, ext4_rereserve_cluster(), to reapply a reservation
      on a delayed allocated cluster sharing blocks with a freed allocated
      cluster.  To avoid ENOSPC on reservation, a flag is applied to
      ext4_free_blocks() to briefly defer updating the freeclusters counter
      when an allocated cluster is freed.  This prevents another thread
      from allocating the freed block before the reservation can be reapplied.
      
      Redefine the partial cluster object as a struct to carry more state
      information and to clarify the code using it.
      
      Adjust the conditional code structure in ext4_ext_remove_space to
      reduce the indentation level in the main body of the code to improve
      readability.
      Signed-off-by: NEric Whitney <enwlinux@gmail.com>
      Signed-off-by: NTheodore Ts'o <tytso@mit.edu>
      Signed-off-by: NJoseph Qi <joseph.qi@linux.alibaba.com>
      Reviewed-by: NJiufei Xue <jiufei.xue@linux.alibaba.com>
      c4cdb449
    • E
      ext4: reduce reserved cluster count by number of allocated clusters · 1b7e8112
      Eric Whitney 提交于
      commit b6bf9171ef5c37b66d446378ba63af5339a56a97 upstream.
      
      Ext4 does not always reduce the reserved cluster count by the number
      of clusters allocated when mapping a delayed extent.  It sometimes
      adds back one or more clusters after allocation if delalloc blocks
      adjacent to the range allocated by ext4_ext_map_blocks() share the
      clusters newly allocated for that range.  However, this overcounts
      the number of clusters needed to satisfy future mapping requests
      (holding one or more reservations for clusters that have already been
      allocated) and premature ENOSPC and quota failures, etc., result.
      
      Ext4 also does not reduce the reserved cluster count when allocating
      clusters for non-delayed allocated writes that have previously been
      reserved for delayed writes.  This also results in overcounts.
      
      To make it possible to handle reserved cluster accounting for
      fallocated regions in the same manner as used for other non-delayed
      writes, do the reserved cluster accounting for them at the time of
      allocation.  In the current code, this is only done later when a
      delayed extent sharing the fallocated region is finally mapped.
      
      Address comment correcting handling of unsigned long long constant
      from Jan Kara's review of RFC version of this patch.
      Signed-off-by: NEric Whitney <enwlinux@gmail.com>
      Signed-off-by: NTheodore Ts'o <tytso@mit.edu>
      Signed-off-by: NJoseph Qi <joseph.qi@linux.alibaba.com>
      Reviewed-by: NJiufei Xue <jiufei.xue@linux.alibaba.com>
      1b7e8112