1. 31 7月, 2019 17 次提交
  2. 28 7月, 2019 23 次提交
    • G
      Linux 4.19.62 · 64f46940
      Greg Kroah-Hartman 提交于
      64f46940
    • V
      net: sched: verify that q!=NULL before setting q->flags · 60e9babf
      Vlad Buslov 提交于
      commit 503d81d428bd598430f7f9d02021634e1a8139a0 upstream.
      
      In function int tc_new_tfilter() q pointer can be NULL when adding filter
      on a shared block. With recent change that resets TCQ_F_CAN_BYPASS after
      filter creation, following NULL pointer dereference happens in case parent
      block is shared:
      
      [  212.925060] BUG: kernel NULL pointer dereference, address: 0000000000000010
      [  212.925445] #PF: supervisor write access in kernel mode
      [  212.925709] #PF: error_code(0x0002) - not-present page
      [  212.925965] PGD 8000000827923067 P4D 8000000827923067 PUD 827924067 PMD 0
      [  212.926302] Oops: 0002 [#1] SMP KASAN PTI
      [  212.926539] CPU: 18 PID: 2617 Comm: tc Tainted: G    B             5.2.0+ #512
      [  212.926938] Hardware name: Supermicro SYS-2028TP-DECR/X10DRT-P, BIOS 2.0b 03/30/2017
      [  212.927364] RIP: 0010:tc_new_tfilter+0x698/0xd40
      [  212.927633] Code: 74 0d 48 85 c0 74 08 48 89 ef e8 03 aa 62 00 48 8b 84 24 a0 00 00 00 48 8d 78 10 48 89 44 24 18 e8 4d 0c 6b ff 48 8b 44 24 18 <83> 60 10 f
      b 48 85 ed 0f 85 3d fe ff ff e9 4f fe ff ff e8 81 26 f8
      [  212.928607] RSP: 0018:ffff88884fd5f5d8 EFLAGS: 00010296
      [  212.928905] RAX: 0000000000000000 RBX: 0000000000000000 RCX: dffffc0000000000
      [  212.929201] RDX: 0000000000000007 RSI: 0000000000000004 RDI: 0000000000000297
      [  212.929402] RBP: ffff88886bedd600 R08: ffffffffb91d4b51 R09: fffffbfff7616e4d
      [  212.929609] R10: fffffbfff7616e4c R11: ffffffffbb0b7263 R12: ffff88886bc61040
      [  212.929803] R13: ffff88884fd5f950 R14: ffffc900039c5000 R15: ffff88835e927680
      [  212.929999] FS:  00007fe7c50b6480(0000) GS:ffff88886f980000(0000) knlGS:0000000000000000
      [  212.930235] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
      [  212.930394] CR2: 0000000000000010 CR3: 000000085bd04002 CR4: 00000000001606e0
      [  212.930588] Call Trace:
      [  212.930682]  ? tc_del_tfilter+0xa40/0xa40
      [  212.930811]  ? __lock_acquire+0x5b5/0x2460
      [  212.930948]  ? find_held_lock+0x85/0xa0
      [  212.931081]  ? tc_del_tfilter+0xa40/0xa40
      [  212.931201]  rtnetlink_rcv_msg+0x4ab/0x5f0
      [  212.931332]  ? rtnl_dellink+0x490/0x490
      [  212.931454]  ? lockdep_hardirqs_on+0x260/0x260
      [  212.931589]  ? netlink_deliver_tap+0xab/0x5a0
      [  212.931717]  ? match_held_lock+0x1b/0x240
      [  212.931844]  netlink_rcv_skb+0xd0/0x200
      [  212.931958]  ? rtnl_dellink+0x490/0x490
      [  212.932079]  ? netlink_ack+0x440/0x440
      [  212.932205]  ? netlink_deliver_tap+0x161/0x5a0
      [  212.932335]  ? lock_downgrade+0x360/0x360
      [  212.932457]  ? lock_acquire+0xe5/0x210
      [  212.932579]  netlink_unicast+0x296/0x350
      [  212.932705]  ? netlink_attachskb+0x390/0x390
      [  212.932834]  ? _copy_from_iter_full+0xe0/0x3a0
      [  212.932976]  netlink_sendmsg+0x394/0x600
      [  212.937998]  ? netlink_unicast+0x350/0x350
      [  212.943033]  ? move_addr_to_kernel.part.0+0x90/0x90
      [  212.948115]  ? netlink_unicast+0x350/0x350
      [  212.953185]  sock_sendmsg+0x96/0xa0
      [  212.958099]  ___sys_sendmsg+0x482/0x520
      [  212.962881]  ? match_held_lock+0x1b/0x240
      [  212.967618]  ? copy_msghdr_from_user+0x250/0x250
      [  212.972337]  ? lock_downgrade+0x360/0x360
      [  212.976973]  ? rwlock_bug.part.0+0x60/0x60
      [  212.981548]  ? __mod_node_page_state+0x1f/0xa0
      [  212.986060]  ? match_held_lock+0x1b/0x240
      [  212.990567]  ? find_held_lock+0x85/0xa0
      [  212.994989]  ? do_user_addr_fault+0x349/0x5b0
      [  212.999387]  ? lock_downgrade+0x360/0x360
      [  213.003713]  ? find_held_lock+0x85/0xa0
      [  213.007972]  ? __fget_light+0xa1/0xf0
      [  213.012143]  ? sockfd_lookup_light+0x91/0xb0
      [  213.016165]  __sys_sendmsg+0xba/0x130
      [  213.020040]  ? __sys_sendmsg_sock+0xb0/0xb0
      [  213.023870]  ? handle_mm_fault+0x337/0x470
      [  213.027592]  ? page_fault+0x8/0x30
      [  213.031316]  ? lockdep_hardirqs_off+0xbe/0x100
      [  213.034999]  ? mark_held_locks+0x24/0x90
      [  213.038671]  ? do_syscall_64+0x1e/0xe0
      [  213.042297]  do_syscall_64+0x74/0xe0
      [  213.045828]  entry_SYSCALL_64_after_hwframe+0x49/0xbe
      [  213.049354] RIP: 0033:0x7fe7c527c7b8
      [  213.052792] Code: 89 02 48 c7 c0 ff ff ff ff eb bb 0f 1f 80 00 00 00 00 f3 0f 1e fa 48 8d 05 65 8f 0c 00 8b 00 85 c0 75 17 b8 2e 00 00 00 0f 05 <48> 3d 00 f
      0 ff ff 77 58 c3 0f 1f 80 00 00 00 00 48 83 ec 28 89 54
      [  213.060269] RSP: 002b:00007ffc3f7908a8 EFLAGS: 00000246 ORIG_RAX: 000000000000002e
      [  213.064144] RAX: ffffffffffffffda RBX: 000000005d34716f RCX: 00007fe7c527c7b8
      [  213.068094] RDX: 0000000000000000 RSI: 00007ffc3f790910 RDI: 0000000000000003
      [  213.072109] RBP: 0000000000000000 R08: 0000000000000001 R09: 00007fe7c5340cc0
      [  213.076113] R10: 0000000000404ec2 R11: 0000000000000246 R12: 0000000000000080
      [  213.080146] R13: 0000000000480640 R14: 0000000000000080 R15: 0000000000000000
      [  213.084147] Modules linked in: act_gact cls_flower sch_ingress nfsv3 nfs_acl nfs lockd grace fscache bridge stp llc sunrpc intel_rapl_msr intel_rapl_common
      [<1;69;32Msb_edac rdma_ucm rdma_cm x86_pkg_temp_thermal iw_cm intel_powerclamp ib_cm coretemp kvm_intel kvm irqbypass mlx5_ib ib_uverbs ib_core crct10dif_pclmul crc32_pc
      lmul crc32c_intel ghash_clmulni_intel mlx5_core intel_cstate intel_uncore iTCO_wdt igb iTCO_vendor_support mlxfw mei_me ptp ses intel_rapl_perf mei pcspkr ipmi
      _ssif i2c_i801 joydev enclosure pps_core lpc_ich ioatdma wmi dca ipmi_si ipmi_devintf ipmi_msghandler acpi_power_meter acpi_pad ast i2c_algo_bit drm_vram_helpe
      r ttm drm_kms_helper drm mpt3sas raid_class scsi_transport_sas
      [  213.112326] CR2: 0000000000000010
      [  213.117429] ---[ end trace adb58eb0a4ee6283 ]---
      
      Verify that q pointer is not NULL before setting the 'flags' field.
      
      Fixes: 3f05e6886a59 ("net_sched: unset TCQ_F_CAN_BYPASS when adding filters")
      Signed-off-by: NVlad Buslov <vladbu@mellanox.com>
      Acked-by: NJiri Pirko <jiri@mellanox.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      Cc: Sasha Levin <sashal@kernel.org>
      Signed-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      60e9babf
    • K
      mm: vmscan: scan anonymous pages on file refaults · c1d98b76
      Kuo-Hsin Yang 提交于
      commit 2c012a4ad1a2cd3fb5a0f9307b9d219f84eda1fa upstream.
      
      When file refaults are detected and there are many inactive file pages,
      the system never reclaim anonymous pages, the file pages are dropped
      aggressively when there are still a lot of cold anonymous pages and
      system thrashes.  This issue impacts the performance of applications
      with large executable, e.g.  chrome.
      
      With this patch, when file refault is detected, inactive_list_is_low()
      always returns true for file pages in get_scan_count() to enable
      scanning anonymous pages.
      
      The problem can be reproduced by the following test program.
      
      ---8<---
      void fallocate_file(const char *filename, off_t size)
      {
      	struct stat st;
      	int fd;
      
      	if (!stat(filename, &st) && st.st_size >= size)
      		return;
      
      	fd = open(filename, O_WRONLY | O_CREAT, 0600);
      	if (fd < 0) {
      		perror("create file");
      		exit(1);
      	}
      	if (posix_fallocate(fd, 0, size)) {
      		perror("fallocate");
      		exit(1);
      	}
      	close(fd);
      }
      
      long *alloc_anon(long size)
      {
      	long *start = malloc(size);
      	memset(start, 1, size);
      	return start;
      }
      
      long access_file(const char *filename, long size, long rounds)
      {
      	int fd, i;
      	volatile char *start1, *end1, *start2;
      	const int page_size = getpagesize();
      	long sum = 0;
      
      	fd = open(filename, O_RDONLY);
      	if (fd == -1) {
      		perror("open");
      		exit(1);
      	}
      
      	/*
      	 * Some applications, e.g. chrome, use a lot of executable file
      	 * pages, map some of the pages with PROT_EXEC flag to simulate
      	 * the behavior.
      	 */
      	start1 = mmap(NULL, size / 2, PROT_READ | PROT_EXEC, MAP_SHARED,
      		      fd, 0);
      	if (start1 == MAP_FAILED) {
      		perror("mmap");
      		exit(1);
      	}
      	end1 = start1 + size / 2;
      
      	start2 = mmap(NULL, size / 2, PROT_READ, MAP_SHARED, fd, size / 2);
      	if (start2 == MAP_FAILED) {
      		perror("mmap");
      		exit(1);
      	}
      
      	for (i = 0; i < rounds; ++i) {
      		struct timeval before, after;
      		volatile char *ptr1 = start1, *ptr2 = start2;
      		gettimeofday(&before, NULL);
      		for (; ptr1 < end1; ptr1 += page_size, ptr2 += page_size)
      			sum += *ptr1 + *ptr2;
      		gettimeofday(&after, NULL);
      		printf("File access time, round %d: %f (sec)
      ", i,
      		       (after.tv_sec - before.tv_sec) +
      		       (after.tv_usec - before.tv_usec) / 1000000.0);
      	}
      	return sum;
      }
      
      int main(int argc, char *argv[])
      {
      	const long MB = 1024 * 1024;
      	long anon_mb, file_mb, file_rounds;
      	const char filename[] = "large";
      	long *ret1;
      	long ret2;
      
      	if (argc != 4) {
      		printf("usage: thrash ANON_MB FILE_MB FILE_ROUNDS
      ");
      		exit(0);
      	}
      	anon_mb = atoi(argv[1]);
      	file_mb = atoi(argv[2]);
      	file_rounds = atoi(argv[3]);
      
      	fallocate_file(filename, file_mb * MB);
      	printf("Allocate %ld MB anonymous pages
      ", anon_mb);
      	ret1 = alloc_anon(anon_mb * MB);
      	printf("Access %ld MB file pages
      ", file_mb);
      	ret2 = access_file(filename, file_mb * MB, file_rounds);
      	printf("Print result to prevent optimization: %ld
      ",
      	       *ret1 + ret2);
      	return 0;
      }
      ---8<---
      
      Running the test program on 2GB RAM VM with kernel 5.2.0-rc5, the program
      fills ram with 2048 MB memory, access a 200 MB file for 10 times.  Without
      this patch, the file cache is dropped aggresively and every access to the
      file is from disk.
      
        $ ./thrash 2048 200 10
        Allocate 2048 MB anonymous pages
        Access 200 MB file pages
        File access time, round 0: 2.489316 (sec)
        File access time, round 1: 2.581277 (sec)
        File access time, round 2: 2.487624 (sec)
        File access time, round 3: 2.449100 (sec)
        File access time, round 4: 2.420423 (sec)
        File access time, round 5: 2.343411 (sec)
        File access time, round 6: 2.454833 (sec)
        File access time, round 7: 2.483398 (sec)
        File access time, round 8: 2.572701 (sec)
        File access time, round 9: 2.493014 (sec)
      
      With this patch, these file pages can be cached.
      
        $ ./thrash 2048 200 10
        Allocate 2048 MB anonymous pages
        Access 200 MB file pages
        File access time, round 0: 2.475189 (sec)
        File access time, round 1: 2.440777 (sec)
        File access time, round 2: 2.411671 (sec)
        File access time, round 3: 1.955267 (sec)
        File access time, round 4: 0.029924 (sec)
        File access time, round 5: 0.000808 (sec)
        File access time, round 6: 0.000771 (sec)
        File access time, round 7: 0.000746 (sec)
        File access time, round 8: 0.000738 (sec)
        File access time, round 9: 0.000747 (sec)
      
      Checked the swap out stats during the test [1], 19006 pages swapped out
      with this patch, 3418 pages swapped out without this patch. There are
      more swap out, but I think it's within reasonable range when file backed
      data set doesn't fit into the memory.
      
      $ ./thrash 2000 100 2100 5 1 # ANON_MB FILE_EXEC FILE_NOEXEC ROUNDS
      PROCESSES Allocate 2000 MB anonymous pages active_anon: 1613644,
      inactive_anon: 348656, active_file: 892, inactive_file: 1384 (kB)
      pswpout: 7972443, pgpgin: 478615246 Access 100 MB executable file pages
      Access 2100 MB regular file pages File access time, round 0: 12.165,
      (sec) active_anon: 1433788, inactive_anon: 478116, active_file: 17896,
      inactive_file: 24328 (kB) File access time, round 1: 11.493, (sec)
      active_anon: 1430576, inactive_anon: 477144, active_file: 25440,
      inactive_file: 26172 (kB) File access time, round 2: 11.455, (sec)
      active_anon: 1427436, inactive_anon: 476060, active_file: 21112,
      inactive_file: 28808 (kB) File access time, round 3: 11.454, (sec)
      active_anon: 1420444, inactive_anon: 473632, active_file: 23216,
      inactive_file: 35036 (kB) File access time, round 4: 11.479, (sec)
      active_anon: 1413964, inactive_anon: 471460, active_file: 31728,
      inactive_file: 32224 (kB) pswpout: 7991449 (+ 19006), pgpgin: 489924366
      (+ 11309120)
      
      With 4 processes accessing non-overlapping parts of a large file, 30316
      pages swapped out with this patch, 5152 pages swapped out without this
      patch.  The swapout number is small comparing to pgpgin.
      
      [1]: https://github.com/vovo/testing/blob/master/mem_thrash.c
      
      Link: http://lkml.kernel.org/r/20190701081038.GA83398@google.com
      Fixes: e9868505 ("mm,vmscan: only evict file pages when we have plenty")
      Fixes: 7c5bd705 ("mm: memcg: only evict file pages when we have plenty")
      Signed-off-by: NKuo-Hsin Yang <vovoy@chromium.org>
      Acked-by: NJohannes Weiner <hannes@cmpxchg.org>
      Cc: Michal Hocko <mhocko@suse.com>
      Cc: Sonny Rao <sonnyrao@chromium.org>
      Cc: Mel Gorman <mgorman@techsingularity.net>
      Cc: Rik van Riel <riel@redhat.com>
      Cc: Vladimir Davydov <vdavydov.dev@gmail.com>
      Cc: Minchan Kim <minchan@kernel.org>
      Cc: <stable@vger.kernel.org>	[4.12+]
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      [backported to 4.14.y, 4.19.y, 5.1.y: adjust context]
      Signed-off-by: NKuo-Hsin Yang <vovoy@chromium.org>
      Signed-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      c1d98b76
    • J
      KVM: nVMX: Clear pending KVM_REQ_GET_VMCS12_PAGES when leaving nested · 7560e333
      Jan Kiszka 提交于
      commit cf64527bb33f6cec2ed50f89182fc4688d0056b6 upstream.
      
      Letting this pend may cause nested_get_vmcs12_pages to run against an
      invalid state, corrupting the effective vmcs of L1.
      
      This was triggerable in QEMU after a guest corruption in L2, followed by
      a L1 reset.
      Signed-off-by: NJan Kiszka <jan.kiszka@siemens.com>
      Reviewed-by: NLiran Alon <liran.alon@oracle.com>
      Cc: stable@vger.kernel.org
      Fixes: 7f7f1ba3 ("KVM: x86: do not load vmcs12 pages while still in SMM")
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      Signed-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      7560e333
    • P
      KVM: nVMX: do not use dangling shadow VMCS after guest reset · 967bc679
      Paolo Bonzini 提交于
      commit 88dddc11a8d6b09201b4db9d255b3394d9bc9e57 upstream.
      
      If a KVM guest is reset while running a nested guest, free_nested will
      disable the shadow VMCS execution control in the vmcs01.  However,
      on the next KVM_RUN vmx_vcpu_run would nevertheless try to sync
      the VMCS12 to the shadow VMCS which has since been freed.
      
      This causes a vmptrld of a NULL pointer on my machime, but Jan reports
      the host to hang altogether.  Let's see how much this trivial patch fixes.
      Reported-by: NJan Kiszka <jan.kiszka@siemens.com>
      Cc: Liran Alon <liran.alon@oracle.com>
      Cc: stable@vger.kernel.org
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      Signed-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      
      967bc679
    • T
      ext4: allow directory holes · 3a17ca86
      Theodore Ts'o 提交于
      commit 4e19d6b65fb4fc42e352ce9883649e049da14743 upstream.
      
      The largedir feature was intended to allow ext4 directories to have
      unmapped directory blocks (e.g., directory holes).  And so the
      released e2fsprogs no longer enforces this for largedir file systems;
      however, the corresponding change to the kernel-side code was not made.
      
      This commit fixes this oversight.
      Signed-off-by: NTheodore Ts'o <tytso@mit.edu>
      Cc: stable@kernel.org
      Signed-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      3a17ca86
    • R
      ext4: use jbd2_inode dirty range scoping · caa4e082
      Ross Zwisler 提交于
      commit 73131fbb003b3691cfcf9656f234b00da497fcd6 upstream.
      
      Use the newly introduced jbd2_inode dirty range scoping to prevent us
      from waiting forever when trying to complete a journal transaction.
      Signed-off-by: NRoss Zwisler <zwisler@google.com>
      Signed-off-by: NTheodore Ts'o <tytso@mit.edu>
      Reviewed-by: NJan Kara <jack@suse.cz>
      Cc: stable@vger.kernel.org
      Signed-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      caa4e082
    • R
      jbd2: introduce jbd2_inode dirty range scoping · af3812b6
      Ross Zwisler 提交于
      commit 6ba0e7dc64a5adcda2fbe65adc466891795d639e upstream.
      
      Currently both journal_submit_inode_data_buffers() and
      journal_finish_inode_data_buffers() operate on the entire address space
      of each of the inodes associated with a given journal entry.  The
      consequence of this is that if we have an inode where we are constantly
      appending dirty pages we can end up waiting for an indefinite amount of
      time in journal_finish_inode_data_buffers() while we wait for all the
      pages under writeback to be written out.
      
      The easiest way to cause this type of workload is do just dd from
      /dev/zero to a file until it fills the entire filesystem.  This can
      cause journal_finish_inode_data_buffers() to wait for the duration of
      the entire dd operation.
      
      We can improve this situation by scoping each of the inode dirty ranges
      associated with a given transaction.  We do this via the jbd2_inode
      structure so that the scoping is contained within jbd2 and so that it
      follows the lifetime and locking rules for that structure.
      
      This allows us to limit the writeback & wait in
      journal_submit_inode_data_buffers() and
      journal_finish_inode_data_buffers() respectively to the dirty range for
      a given struct jdb2_inode, keeping us from waiting forever if the inode
      in question is still being appended to.
      Signed-off-by: NRoss Zwisler <zwisler@google.com>
      Signed-off-by: NTheodore Ts'o <tytso@mit.edu>
      Reviewed-by: NJan Kara <jack@suse.cz>
      Cc: stable@vger.kernel.org
      Signed-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      af3812b6
    • R
      mm: add filemap_fdatawait_range_keep_errors() · 4becd6c1
      Ross Zwisler 提交于
      commit aa0bfcd939c30617385ffa28682c062d78050eba upstream.
      
      In the spirit of filemap_fdatawait_range() and
      filemap_fdatawait_keep_errors(), introduce
      filemap_fdatawait_range_keep_errors() which both takes a range upon
      which to wait and does not clear errors from the address space.
      Signed-off-by: NRoss Zwisler <zwisler@google.com>
      Signed-off-by: NTheodore Ts'o <tytso@mit.edu>
      Reviewed-by: NJan Kara <jack@suse.cz>
      Cc: stable@vger.kernel.org
      Signed-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      4becd6c1
    • T
      ext4: enforce the immutable flag on open files · c9ea4620
      Theodore Ts'o 提交于
      commit 02b016ca7f99229ae6227e7b2fc950c4e140d74a upstream.
      
      According to the chattr man page, "a file with the 'i' attribute
      cannot be modified..."  Historically, this was only enforced when the
      file was opened, per the rest of the description, "... and the file
      can not be opened in write mode".
      
      There is general agreement that we should standardize all file systems
      to prevent modifications even for files that were opened at the time
      the immutable flag is set.  Eventually, a change to enforce this at
      the VFS layer should be landing in mainline.  Until then, enforce this
      at the ext4 level to prevent xfstests generic/553 from failing.
      Signed-off-by: NTheodore Ts'o <tytso@mit.edu>
      Cc: "Darrick J. Wong" <darrick.wong@oracle.com>
      Cc: stable@kernel.org
      Signed-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      c9ea4620
    • D
      ext4: don't allow any modifications to an immutable file · 29171e82
      Darrick J. Wong 提交于
      commit 2e53840362771c73eb0a5ff71611507e64e8eecd upstream.
      
      Don't allow any modifications to a file that's marked immutable, which
      means that we have to flush all the writable pages to make the readonly
      and we have to check the setattr/setflags parameters more closely.
      Signed-off-by: NDarrick J. Wong <darrick.wong@oracle.com>
      Signed-off-by: NTheodore Ts'o <tytso@mit.edu>
      Cc: stable@kernel.org
      Signed-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      29171e82
    • P
      perf/core: Fix race between close() and fork() · 4a5cc64d
      Peter Zijlstra 提交于
      commit 1cf8dfe8a661f0462925df943140e9f6d1ea5233 upstream.
      
      Syzcaller reported the following Use-after-Free bug:
      
      	close()						clone()
      
      							  copy_process()
      							    perf_event_init_task()
      							      perf_event_init_context()
      							        mutex_lock(parent_ctx->mutex)
      								inherit_task_group()
      								  inherit_group()
      								    inherit_event()
      								      mutex_lock(event->child_mutex)
      								      // expose event on child list
      								      list_add_tail()
      								      mutex_unlock(event->child_mutex)
      							        mutex_unlock(parent_ctx->mutex)
      
      							    ...
      							    goto bad_fork_*
      
      							  bad_fork_cleanup_perf:
      							    perf_event_free_task()
      
      	  perf_release()
      	    perf_event_release_kernel()
      	      list_for_each_entry()
      		mutex_lock(ctx->mutex)
      		mutex_lock(event->child_mutex)
      		// event is from the failing inherit
      		// on the other CPU
      		perf_remove_from_context()
      		list_move()
      		mutex_unlock(event->child_mutex)
      		mutex_unlock(ctx->mutex)
      
      							      mutex_lock(ctx->mutex)
      							      list_for_each_entry_safe()
      							        // event already stolen
      							      mutex_unlock(ctx->mutex)
      
      							    delayed_free_task()
      							      free_task()
      
      	     list_for_each_entry_safe()
      	       list_del()
      	       free_event()
      	         _free_event()
      		   // and so event->hw.target
      		   // is the already freed failed clone()
      		   if (event->hw.target)
      		     put_task_struct(event->hw.target)
      		       // WHOOPSIE, already quite dead
      
      Which puts the lie to the the comment on perf_event_free_task():
      'unexposed, unused context' not so much.
      
      Which is a 'fun' confluence of fail; copy_process() doing an
      unconditional free_task() and not respecting refcounts, and perf having
      creative locking. In particular:
      
        82d94856 ("perf/core: Fix lock inversion between perf,trace,cpuhp")
      
      seems to have overlooked this 'fun' parade.
      
      Solve it by using the fact that detached events still have a reference
      count on their (previous) context. With this perf_event_free_task()
      can detect when events have escaped and wait for their destruction.
      Debugged-by: NAlexander Shishkin <alexander.shishkin@linux.intel.com>
      Reported-by: syzbot+a24c397a29ad22d86c98@syzkaller.appspotmail.com
      Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Acked-by: NMark Rutland <mark.rutland@arm.com>
      Cc: <stable@vger.kernel.org>
      Cc: Alexander Shishkin <alexander.shishkin@linux.intel.com>
      Cc: Arnaldo Carvalho de Melo <acme@redhat.com>
      Cc: Jiri Olsa <jolsa@redhat.com>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Stephane Eranian <eranian@google.com>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Vince Weaver <vincent.weaver@maine.edu>
      Fixes: 82d94856 ("perf/core: Fix lock inversion between perf,trace,cpuhp")
      Signed-off-by: NIngo Molnar <mingo@kernel.org>
      Signed-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      4a5cc64d
    • A
      perf/core: Fix exclusive events' grouping · 75100ec5
      Alexander Shishkin 提交于
      commit 8a58ddae23796c733c5dfbd717538d89d036c5bd upstream.
      
      So far, we tried to disallow grouping exclusive events for the fear of
      complications they would cause with moving between contexts. Specifically,
      moving a software group to a hardware context would violate the exclusivity
      rules if both groups contain matching exclusive events.
      
      This attempt was, however, unsuccessful: the check that we have in the
      perf_event_open() syscall is both wrong (looks at wrong PMU) and
      insufficient (group leader may still be exclusive), as can be illustrated
      by running:
      
        $ perf record -e '{intel_pt//,cycles}' uname
        $ perf record -e '{cycles,intel_pt//}' uname
      
      ultimately successfully.
      
      Furthermore, we are completely free to trigger the exclusivity violation
      by:
      
         perf -e '{cycles,intel_pt//}' -e '{intel_pt//,instructions}'
      
      even though the helpful perf record will not allow that, the ABI will.
      
      The warning later in the perf_event_open() path will also not trigger, because
      it's also wrong.
      
      Fix all this by validating the original group before moving, getting rid
      of broken safeguards and placing a useful one to perf_install_in_context().
      Signed-off-by: NAlexander Shishkin <alexander.shishkin@linux.intel.com>
      Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Cc: <stable@vger.kernel.org>
      Cc: Arnaldo Carvalho de Melo <acme@redhat.com>
      Cc: Jiri Olsa <jolsa@redhat.com>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Stephane Eranian <eranian@google.com>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Vince Weaver <vincent.weaver@maine.edu>
      Cc: mathieu.poirier@linaro.org
      Cc: will.deacon@arm.com
      Fixes: bed5b25a ("perf: Add a pmu capability for "exclusive" events")
      Link: https://lkml.kernel.org/r/20190701110755.24646-1-alexander.shishkin@linux.intel.comSigned-off-by: NIngo Molnar <mingo@kernel.org>
      Signed-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      75100ec5
    • P
      MIPS: lb60: Fix pin mappings · 0e6ef184
      Paul Cercueil 提交于
      commit 1323c3b72a987de57141cabc44bf9cd83656bc70 upstream.
      
      The pin mappings introduced in commit 636f8ba6
      ("MIPS: JZ4740: Qi LB60: Add pinctrl configuration for several drivers")
      are completely wrong. The pinctrl driver name is incorrect, and the
      function and group fields are swapped.
      
      Fixes: 636f8ba6 ("MIPS: JZ4740: Qi LB60: Add pinctrl configuration for several drivers")
      Cc: <stable@vger.kernel.org>
      Signed-off-by: NPaul Cercueil <paul@crapouillou.net>
      Reviewed-by: NLinus Walleij <linus.walleij@linaro.org>
      Signed-off-by: NPaul Burton <paul.burton@mips.com>
      Cc: Ralf Baechle <ralf@linux-mips.org>
      Cc: James Hogan <jhogan@kernel.org>
      Cc: od@zcrc.me
      Cc: linux-mips@vger.kernel.org
      Cc: linux-kernel@vger.kernel.org
      Signed-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      0e6ef184
    • K
      gpio: davinci: silence error prints in case of EPROBE_DEFER · dd5994ab
      Keerthy 提交于
      commit 541e4095f388c196685685633c950cb9b97f8039 upstream.
      
      Silence error prints in case of EPROBE_DEFER. This avoids
      multiple/duplicate defer prints during boot.
      
      Cc: <stable@vger.kernel.org>
      Signed-off-by: NKeerthy <j-keerthy@ti.com>
      Signed-off-by: NBartosz Golaszewski <bgolaszewski@baylibre.com>
      Signed-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      dd5994ab
    • C
      dma-buf: Discard old fence_excl on retrying get_fences_rcu for realloc · c947cf3e
      Chris Wilson 提交于
      commit f5b07b04e5f090a85d1e96938520f2b2b58e4a8e upstream.
      
      If we have to drop the seqcount & rcu lock to perform a krealloc, we
      have to restart the loop. In doing so, be careful not to lose track of
      the already acquired exclusive fence.
      
      Fixes: fedf5413 ("dma-buf: Restart reservation_object_get_fences_rcu() after writes")
      Signed-off-by: NChris Wilson <chris@chris-wilson.co.uk>
      Cc: Daniel Vetter <daniel.vetter@ffwll.ch>
      Cc: Maarten Lankhorst <maarten.lankhorst@linux.intel.com>
      Cc: Christian König <christian.koenig@amd.com>
      Cc: Alex Deucher <alexander.deucher@amd.com>
      Cc: Sumit Semwal <sumit.semwal@linaro.org>
      Cc: stable@vger.kernel.org #v4.10
      Reviewed-by: NChristian König <christian.koenig@amd.com>
      Link: https://patchwork.freedesktop.org/patch/msgid/20190604125323.21396-1-chris@chris-wilson.co.ukSigned-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      c947cf3e
    • J
      dma-buf: balance refcount inbalance · 95ee55ca
      Jérôme Glisse 提交于
      commit 5e383a9798990c69fc759a4930de224bb497e62c upstream.
      
      The debugfs take reference on fence without dropping them.
      Signed-off-by: NJérôme Glisse <jglisse@redhat.com>
      Cc: Christian König <christian.koenig@amd.com>
      Cc: Daniel Vetter <daniel.vetter@ffwll.ch>
      Cc: Sumit Semwal <sumit.semwal@linaro.org>
      Cc: linux-media@vger.kernel.org
      Cc: dri-devel@lists.freedesktop.org
      Cc: linaro-mm-sig@lists.linaro.org
      Cc: Stéphane Marchesin <marcheu@chromium.org>
      Cc: stable@vger.kernel.org
      Reviewed-by: NChristian König <christian.koenig@amd.com>
      Signed-off-by: NSumit Semwal <sumit.semwal@linaro.org>
      Link: https://patchwork.freedesktop.org/patch/msgid/20181206161840.6578-1-jglisse@redhat.comSigned-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      95ee55ca
    • N
      net: bridge: stp: don't cache eth dest pointer before skb pull · b72fb8de
      Nikolay Aleksandrov 提交于
      [ Upstream commit 2446a68ae6a8cee6d480e2f5b52f5007c7c41312 ]
      
      Don't cache eth dest pointer before calling pskb_may_pull.
      
      Fixes: cf0f02d0 ("[BRIDGE]: use llc for receiving STP packets")
      Signed-off-by: NNikolay Aleksandrov <nikolay@cumulusnetworks.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      Signed-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      b72fb8de
    • N
      net: bridge: don't cache ether dest pointer on input · 78701843
      Nikolay Aleksandrov 提交于
      [ Upstream commit 3d26eb8ad1e9b906433903ce05f775cf038e747f ]
      
      We would cache ether dst pointer on input in br_handle_frame_finish but
      after the neigh suppress code that could lead to a stale pointer since
      both ipv4 and ipv6 suppress code do pskb_may_pull. This means we have to
      always reload it after the suppress code so there's no point in having
      it cached just retrieve it directly.
      
      Fixes: 057658cb ("bridge: suppress arp pkts on BR_NEIGH_SUPPRESS ports")
      Fixes: ed842fae ("bridge: suppress nd pkts on BR_NEIGH_SUPPRESS ports")
      Signed-off-by: NNikolay Aleksandrov <nikolay@cumulusnetworks.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      Signed-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      78701843
    • N
      net: bridge: mcast: fix stale ipv6 hdr pointer when handling v6 query · 41a8df71
      Nikolay Aleksandrov 提交于
      [ Upstream commit 3b26a5d03d35d8f732d75951218983c0f7f68dff ]
      
      We get a pointer to the ipv6 hdr in br_ip6_multicast_query but we may
      call pskb_may_pull afterwards and end up using a stale pointer.
      So use the header directly, it's just 1 place where it's needed.
      
      Fixes: 08b202b6 ("bridge br_multicast: IPv6 MLD support.")
      Signed-off-by: NNikolay Aleksandrov <nikolay@cumulusnetworks.com>
      Tested-by: NMartin Weinelt <martin@linuxlounge.net>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      Signed-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      41a8df71
    • N
      net: bridge: mcast: fix stale nsrcs pointer in igmp3/mld2 report handling · caf4488f
      Nikolay Aleksandrov 提交于
      [ Upstream commit e57f61858b7cf478ed6fa23ed4b3876b1c9625c4 ]
      
      We take a pointer to grec prior to calling pskb_may_pull and use it
      afterwards to get nsrcs so record nsrcs before the pull when handling
      igmp3 and we get a pointer to nsrcs and call pskb_may_pull when handling
      mld2 which again could lead to reading 2 bytes out-of-bounds.
      
       ==================================================================
       BUG: KASAN: use-after-free in br_multicast_rcv+0x480c/0x4ad0 [bridge]
       Read of size 2 at addr ffff8880421302b4 by task ksoftirqd/1/16
      
       CPU: 1 PID: 16 Comm: ksoftirqd/1 Tainted: G           OE     5.2.0-rc6+ #1
       Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.10.2-1 04/01/2014
       Call Trace:
        dump_stack+0x71/0xab
        print_address_description+0x6a/0x280
        ? br_multicast_rcv+0x480c/0x4ad0 [bridge]
        __kasan_report+0x152/0x1aa
        ? br_multicast_rcv+0x480c/0x4ad0 [bridge]
        ? br_multicast_rcv+0x480c/0x4ad0 [bridge]
        kasan_report+0xe/0x20
        br_multicast_rcv+0x480c/0x4ad0 [bridge]
        ? br_multicast_disable_port+0x150/0x150 [bridge]
        ? ktime_get_with_offset+0xb4/0x150
        ? __kasan_kmalloc.constprop.6+0xa6/0xf0
        ? __netif_receive_skb+0x1b0/0x1b0
        ? br_fdb_update+0x10e/0x6e0 [bridge]
        ? br_handle_frame_finish+0x3c6/0x11d0 [bridge]
        br_handle_frame_finish+0x3c6/0x11d0 [bridge]
        ? br_pass_frame_up+0x3a0/0x3a0 [bridge]
        ? virtnet_probe+0x1c80/0x1c80 [virtio_net]
        br_handle_frame+0x731/0xd90 [bridge]
        ? select_idle_sibling+0x25/0x7d0
        ? br_handle_frame_finish+0x11d0/0x11d0 [bridge]
        __netif_receive_skb_core+0xced/0x2d70
        ? virtqueue_get_buf_ctx+0x230/0x1130 [virtio_ring]
        ? do_xdp_generic+0x20/0x20
        ? virtqueue_napi_complete+0x39/0x70 [virtio_net]
        ? virtnet_poll+0x94d/0xc78 [virtio_net]
        ? receive_buf+0x5120/0x5120 [virtio_net]
        ? __netif_receive_skb_one_core+0x97/0x1d0
        __netif_receive_skb_one_core+0x97/0x1d0
        ? __netif_receive_skb_core+0x2d70/0x2d70
        ? _raw_write_trylock+0x100/0x100
        ? __queue_work+0x41e/0xbe0
        process_backlog+0x19c/0x650
        ? _raw_read_lock_irq+0x40/0x40
        net_rx_action+0x71e/0xbc0
        ? __switch_to_asm+0x40/0x70
        ? napi_complete_done+0x360/0x360
        ? __switch_to_asm+0x34/0x70
        ? __switch_to_asm+0x40/0x70
        ? __schedule+0x85e/0x14d0
        __do_softirq+0x1db/0x5f9
        ? takeover_tasklets+0x5f0/0x5f0
        run_ksoftirqd+0x26/0x40
        smpboot_thread_fn+0x443/0x680
        ? sort_range+0x20/0x20
        ? schedule+0x94/0x210
        ? __kthread_parkme+0x78/0xf0
        ? sort_range+0x20/0x20
        kthread+0x2ae/0x3a0
        ? kthread_create_worker_on_cpu+0xc0/0xc0
        ret_from_fork+0x35/0x40
      
       The buggy address belongs to the page:
       page:ffffea0001084c00 refcount:0 mapcount:-128 mapping:0000000000000000 index:0x0
       flags: 0xffffc000000000()
       raw: 00ffffc000000000 ffffea0000cfca08 ffffea0001098608 0000000000000000
       raw: 0000000000000000 0000000000000003 00000000ffffff7f 0000000000000000
       page dumped because: kasan: bad access detected
      
       Memory state around the buggy address:
       ffff888042130180: ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff
       ffff888042130200: ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff
       > ffff888042130280: ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff
                                           ^
       ffff888042130300: ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff
       ffff888042130380: ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff
       ==================================================================
       Disabling lock debugging due to kernel taint
      
      Fixes: bc8c20ac ("bridge: multicast: treat igmpv3 report with INCLUDE and no sources as a leave")
      Reported-by: NMartin Weinelt <martin@linuxlounge.net>
      Signed-off-by: NNikolay Aleksandrov <nikolay@cumulusnetworks.com>
      Tested-by: NMartin Weinelt <martin@linuxlounge.net>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      Signed-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      caf4488f
    • X
      sctp: not bind the socket in sctp_connect · bc9a2f36
      Xin Long 提交于
      [ Upstream commit 9b6c08878e23adb7cc84bdca94d8a944b03f099e ]
      
      Now when sctp_connect() is called with a wrong sa_family, it binds
      to a port but doesn't set bp->port, then sctp_get_af_specific will
      return NULL and sctp_connect() returns -EINVAL.
      
      Then if sctp_bind() is called to bind to another port, the last
      port it has bound will leak due to bp->port is NULL by then.
      
      sctp_connect() doesn't need to bind ports, as later __sctp_connect
      will do it if bp->port is NULL. So remove it from sctp_connect().
      While at it, remove the unnecessary sockaddr.sa_family len check
      as it's already done in sctp_inet_connect.
      
      Fixes: 644fbdea ("sctp: fix the issue that flags are ignored when using kernel_connect")
      Reported-by: syzbot+079bf326b38072f849d9@syzkaller.appspotmail.com
      Signed-off-by: NXin Long <lucien.xin@gmail.com>
      Acked-by: NMarcelo Ricardo Leitner <marcelo.leitner@gmail.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      Signed-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      bc9a2f36
    • J
      net/tls: make sure offload also gets the keys wiped · fde351ae
      Jakub Kicinski 提交于
      [ Upstream commit acd3e96d53a24d219f720ed4012b62723ae05da1 ]
      
      Commit 86029d10 ("tls: zero the crypto information from tls_context
      before freeing") added memzero_explicit() calls to clear the key material
      before freeing struct tls_context, but it missed tls_device.c has its
      own way of freeing this structure. Replace the missing free.
      
      Fixes: 86029d10 ("tls: zero the crypto information from tls_context before freeing")
      Signed-off-by: NJakub Kicinski <jakub.kicinski@netronome.com>
      Reviewed-by: NDirk van der Merwe <dirk.vandermerwe@netronome.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      Signed-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      fde351ae