1. 26 1月, 2019 40 次提交
    • A
      iwlwifi: mvm: Send LQ command as async when necessary · d9bcbcb7
      Avraham Stern 提交于
      commit 3baf7528d6f832b28622d1ddadd2e47f6c2b5e08 upstream.
      
      The parameter that indicated whether the LQ command should be sent
      as sync or async was removed, causing the LQ command to be sent as
      sync from interrupt context (e.g. from the RX path). This resulted
      in a kernel warning: "scheduling while atomic" and failing to send
      the LQ command, which ultimately leads to a queue hang.
      
      Fix it by adding back the required parameter to send the command as
      sync only when it is allowed.
      
      Fixes: d94c5a82 ("iwlwifi: mvm: open BA session only when sta is authorized")
      Signed-off-by: NAvraham Stern <avraham.stern@intel.com>
      Signed-off-by: NLuca Coelho <luciano.coelho@intel.com>
      Signed-off-by: NEmmanuel Grumbach <emmanuel.grumbach@intel.com>
      Signed-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      d9bcbcb7
    • M
      mm, proc: be more verbose about unstable VMA flags in /proc/<pid>/smaps · 0d73e773
      Michal Hocko 提交于
      [ Upstream commit 7550c6079846a24f30d15ac75a941c8515dbedfb ]
      
      Patch series "THP eligibility reporting via proc".
      
      This series of three patches aims at making THP eligibility reporting much
      more robust and long term sustainable.  The trigger for the change is a
      regression report [2] and the long follow up discussion.  In short the
      specific application didn't have good API to query whether a particular
      mapping can be backed by THP so it has used VMA flags to workaround that.
      These flags represent a deep internal state of VMAs and as such they
      should be used by userspace with a great deal of caution.
      
      A similar has happened for [3] when users complained that VM_MIXEDMAP is
      no longer set on DAX mappings.  Again a lack of a proper API led to an
      abuse.
      
      The first patch in the series tries to emphasise that that the semantic of
      flags might change and any application consuming those should be really
      careful.
      
      The remaining two patches provide a more suitable interface to address [2]
      and provide a consistent API to query the THP status both for each VMA and
      process wide as well.  [1]
      
      http://lkml.kernel.org/r/20181120103515.25280-1-mhocko@kernel.org [2]
      http://lkml.kernel.org/r/http://lkml.kernel.org/r/alpine.DEB.2.21.1809241054050.224429@chino.kir.corp.google.com
      [3] http://lkml.kernel.org/r/20181002100531.GC4135@quack2.suse.cz
      
      This patch (of 3):
      
      Even though vma flags exported via /proc/<pid>/smaps are explicitly
      documented to be not guaranteed for future compatibility the warning
      doesn't go far enough because it doesn't mention semantic changes to those
      flags.  And they are important as well because these flags are a deep
      implementation internal to the MM code and the semantic might change at
      any time.
      
      Let's consider two recent examples:
      http://lkml.kernel.org/r/20181002100531.GC4135@quack2.suse.cz
      : commit e1fb4a08 "dax: remove VM_MIXEDMAP for fsdax and device dax" has
      : removed VM_MIXEDMAP flag from DAX VMAs. Now our testing shows that in the
      : mean time certain customer of ours started poking into /proc/<pid>/smaps
      : and looks at VMA flags there and if VM_MIXEDMAP is missing among the VMA
      : flags, the application just fails to start complaining that DAX support is
      : missing in the kernel.
      
      http://lkml.kernel.org/r/alpine.DEB.2.21.1809241054050.224429@chino.kir.corp.google.com
      : Commit 18600332 ("mm: make PR_SET_THP_DISABLE immediately active")
      : introduced a regression in that userspace cannot always determine the set
      : of vmas where thp is ineligible.
      : Userspace relies on the "nh" flag being emitted as part of /proc/pid/smaps
      : to determine if a vma is eligible to be backed by hugepages.
      : Previous to this commit, prctl(PR_SET_THP_DISABLE, 1) would cause thp to
      : be disabled and emit "nh" as a flag for the corresponding vmas as part of
      : /proc/pid/smaps.  After the commit, thp is disabled by means of an mm
      : flag and "nh" is not emitted.
      : This causes smaps parsing libraries to assume a vma is eligible for thp
      : and ends up puzzling the user on why its memory is not backed by thp.
      
      In both cases userspace was relying on a semantic of a specific VMA flag.
      The primary reason why that happened is a lack of a proper interface.
      While this has been worked on and it will be fixed properly, it seems that
      our wording could see some refinement and be more vocal about semantic
      aspect of these flags as well.
      
      Link: http://lkml.kernel.org/r/20181211143641.3503-2-mhocko@kernel.orgSigned-off-by: NMichal Hocko <mhocko@suse.com>
      Acked-by: NJan Kara <jack@suse.cz>
      Acked-by: NDan Williams <dan.j.williams@intel.com>
      Acked-by: NDavid Rientjes <rientjes@google.com>
      Acked-by: NMike Rapoport <rppt@linux.ibm.com>
      Acked-by: NVlastimil Babka <vbabka@suse.cz>
      Cc: Dan Williams <dan.j.williams@intel.com>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Paul Oppenheimer <bepvte@gmail.com>
      Cc: William Kucharski <william.kucharski@oracle.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      Signed-off-by: NSasha Levin <sashal@kernel.org>
      0d73e773
    • P
      userfaultfd: clear flag if remap event not enabled · 2011eb74
      Peter Xu 提交于
      [ Upstream commit 3cfd22be0ad663248fadfc8f6ffa3e255c394552 ]
      
      When the process being tracked does mremap() without
      UFFD_FEATURE_EVENT_REMAP on the corresponding tracking uffd file handle,
      we should not generate the remap event, and at the same time we should
      clear all the uffd flags on the new VMA.  Without this patch, we can still
      have the VM_UFFD_MISSING|VM_UFFD_WP flags on the new VMA even the fault
      handling process does not even know the existance of the VMA.
      
      Link: http://lkml.kernel.org/r/20181211053409.20317-1-peterx@redhat.comSigned-off-by: NPeter Xu <peterx@redhat.com>
      Reviewed-by: NAndrea Arcangeli <aarcange@redhat.com>
      Acked-by: NMike Rapoport <rppt@linux.vnet.ibm.com>
      Reviewed-by: NWilliam Kucharski <william.kucharski@oracle.com>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: Mike Rapoport <rppt@linux.vnet.ibm.com>
      Cc: Kirill A. Shutemov <kirill@shutemov.name>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Pavel Emelyanov <xemul@virtuozzo.com>
      Cc: Pravin Shedge <pravin.shedge4linux@gmail.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      Signed-off-by: NSasha Levin <sashal@kernel.org>
      2011eb74
    • A
      mm/swap: use nr_node_ids for avail_lists in swap_info_struct · b0cd52e6
      Aaron Lu 提交于
      [ Upstream commit 66f71da9dd38af17dc17209cdde7987d4679a699 ]
      
      Since a2468cc9 ("swap: choose swap device according to numa node"),
      avail_lists field of swap_info_struct is changed to an array with
      MAX_NUMNODES elements.  This made swap_info_struct size increased to 40KiB
      and needs an order-4 page to hold it.
      
      This is not optimal in that:
      1 Most systems have way less than MAX_NUMNODES(1024) nodes so it
        is a waste of memory;
      2 It could cause swapon failure if the swap device is swapped on
        after system has been running for a while, due to no order-4
        page is available as pointed out by Vasily Averin.
      
      Solve the above two issues by using nr_node_ids(which is the actual
      possible node number the running system has) for avail_lists instead of
      MAX_NUMNODES.
      
      nr_node_ids is unknown at compile time so can't be directly used when
      declaring this array.  What I did here is to declare avail_lists as zero
      element array and allocate space for it when allocating space for
      swap_info_struct.  The reason why keep using array but not pointer is
      plist_for_each_entry needs the field to be part of the struct, so pointer
      will not work.
      
      This patch is on top of Vasily Averin's fix commit.  I think the use of
      kvzalloc for swap_info_struct is still needed in case nr_node_ids is
      really big on some systems.
      
      Link: http://lkml.kernel.org/r/20181115083847.GA11129@intel.comSigned-off-by: NAaron Lu <aaron.lu@intel.com>
      Reviewed-by: NAndrew Morton <akpm@linux-foundation.org>
      Acked-by: NMichal Hocko <mhocko@suse.com>
      Cc: Vasily Averin <vvs@virtuozzo.com>
      Cc: Huang Ying <ying.huang@intel.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      Signed-off-by: NSasha Levin <sashal@kernel.org>
      b0cd52e6
    • B
      mm/page-writeback.c: don't break integrity writeback on ->writepage() error · dc15e3fd
      Brian Foster 提交于
      [ Upstream commit 3fa750dcf29e8606e3969d13d8e188cc1c0f511d ]
      
      write_cache_pages() is used in both background and integrity writeback
      scenarios by various filesystems.  Background writeback is mostly
      concerned with cleaning a certain number of dirty pages based on various
      mm heuristics.  It may not write the full set of dirty pages or wait for
      I/O to complete.  Integrity writeback is responsible for persisting a set
      of dirty pages before the writeback job completes.  For example, an
      fsync() call must perform integrity writeback to ensure data is on disk
      before the call returns.
      
      write_cache_pages() unconditionally breaks out of its processing loop in
      the event of a ->writepage() error.  This is fine for background
      writeback, which had no strict requirements and will eventually come
      around again.  This can cause problems for integrity writeback on
      filesystems that might need to clean up state associated with failed page
      writeouts.  For example, XFS performs internal delayed allocation
      accounting before returning a ->writepage() error, where applicable.  If
      the current writeback happens to be associated with an unmount and
      write_cache_pages() completes the writeback prematurely due to error, the
      filesystem is unmounted in an inconsistent state if dirty+delalloc pages
      still exist.
      
      To handle this problem, update write_cache_pages() to always process the
      full set of pages for integrity writeback regardless of ->writepage()
      errors.  Save the first encountered error and return it to the caller once
      complete.  This facilitates XFS (or any other fs that expects integrity
      writeback to process the entire set of dirty pages) to clean up its
      internal state completely in the event of persistent mapping errors.
      Background writeback continues to exit on the first error encountered.
      
      [akpm@linux-foundation.org: fix typo in comment]
      Link: http://lkml.kernel.org/r/20181116134304.32440-1-bfoster@redhat.comSigned-off-by: NBrian Foster <bfoster@redhat.com>
      Reviewed-by: NJan Kara <jack@suse.cz>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      Signed-off-by: NSasha Levin <sashal@kernel.org>
      dc15e3fd
    • J
      ocfs2: fix panic due to unrecovered local alloc · 5a404f39
      Junxiao Bi 提交于
      [ Upstream commit 532e1e54c8140188e192348c790317921cb2dc1c ]
      
      mount.ocfs2 ignore the inconsistent error that journal is clean but
      local alloc is unrecovered.  After mount, local alloc not empty, then
      reserver cluster didn't alloc a new local alloc window, reserveration
      map is empty(ocfs2_reservation_map.m_bitmap_len = 0), that triggered the
      following panic.
      
      This issue was reported at
      
        https://oss.oracle.com/pipermail/ocfs2-devel/2015-May/010854.html
      
      and was advised to fixed during mount.  But this is a very unusual
      inconsistent state, usually journal dirty flag should be cleared at the
      last stage of umount until every other things go right.  We may need do
      further debug to check that.  Any way to avoid possible futher
      corruption, mount should be abort and fsck should be run.
      
        (mount.ocfs2,1765,1):ocfs2_load_local_alloc:353 ERROR: Local alloc hasn't been recovered!
        found = 6518, set = 6518, taken = 8192, off = 15912372
        ocfs2: Mounting device (202,64) on (node 0, slot 3) with ordered data mode.
        o2dlm: Joining domain 89CEAC63CC4F4D03AC185B44E0EE0F3F ( 0 1 2 3 4 5 6 8 ) 8 nodes
        ocfs2: Mounting device (202,80) on (node 0, slot 3) with ordered data mode.
        o2hb: Region 89CEAC63CC4F4D03AC185B44E0EE0F3F (xvdf) is now a quorum device
        o2net: Accepted connection from node yvwsoa17p (num 7) at 172.22.77.88:7777
        o2dlm: Node 7 joins domain 64FE421C8C984E6D96ED12C55FEE2435 ( 0 1 2 3 4 5 6 7 8 ) 9 nodes
        o2dlm: Node 7 joins domain 89CEAC63CC4F4D03AC185B44E0EE0F3F ( 0 1 2 3 4 5 6 7 8 ) 9 nodes
        ------------[ cut here ]------------
        kernel BUG at fs/ocfs2/reservations.c:507!
        invalid opcode: 0000 [#1] SMP
        Modules linked in: ocfs2 rpcsec_gss_krb5 auth_rpcgss nfsv4 nfs fscache lockd grace ocfs2_dlmfs ocfs2_stack_o2cb ocfs2_dlm ocfs2_nodemanager ocfs2_stackglue configfs sunrpc ipt_REJECT nf_reject_ipv4 nf_conntrack_ipv4 nf_defrag_ipv4 iptable_filter ip_tables ip6t_REJECT nf_reject_ipv6 nf_conntrack_ipv6 nf_defrag_ipv6 xt_state nf_conntrack ip6table_filter ip6_tables ib_ipoib rdma_ucm ib_ucm ib_uverbs ib_umad rdma_cm ib_cm iw_cm ib_sa ib_mad ib_core ib_addr ipv6 ovmapi ppdev parport_pc parport xen_netfront fb_sys_fops sysimgblt sysfillrect syscopyarea acpi_cpufreq pcspkr i2c_piix4 i2c_core sg ext4 jbd2 mbcache2 sr_mod cdrom xen_blkfront pata_acpi ata_generic ata_piix floppy dm_mirror dm_region_hash dm_log dm_mod
        CPU: 0 PID: 4349 Comm: startWebLogic.s Not tainted 4.1.12-124.19.2.el6uek.x86_64 #2
        Hardware name: Xen HVM domU, BIOS 4.4.4OVM 09/06/2018
        task: ffff8803fb04e200 ti: ffff8800ea4d8000 task.ti: ffff8800ea4d8000
        RIP: 0010:[<ffffffffa05e96a8>]  [<ffffffffa05e96a8>] __ocfs2_resv_find_window+0x498/0x760 [ocfs2]
        Call Trace:
          ocfs2_resmap_resv_bits+0x10d/0x400 [ocfs2]
          ocfs2_claim_local_alloc_bits+0xd0/0x640 [ocfs2]
          __ocfs2_claim_clusters+0x178/0x360 [ocfs2]
          ocfs2_claim_clusters+0x1f/0x30 [ocfs2]
          ocfs2_convert_inline_data_to_extents+0x634/0xa60 [ocfs2]
          ocfs2_write_begin_nolock+0x1c6/0x1da0 [ocfs2]
          ocfs2_write_begin+0x13e/0x230 [ocfs2]
          generic_perform_write+0xbf/0x1c0
          __generic_file_write_iter+0x19c/0x1d0
          ocfs2_file_write_iter+0x589/0x1360 [ocfs2]
          __vfs_write+0xb8/0x110
          vfs_write+0xa9/0x1b0
          SyS_write+0x46/0xb0
          system_call_fastpath+0x18/0xd7
        Code: ff ff 8b 75 b8 39 75 b0 8b 45 c8 89 45 98 0f 84 e5 fe ff ff 45 8b 74 24 18 41 8b 54 24 1c e9 56 fc ff ff 85 c0 0f 85 48 ff ff ff <0f> 0b 48 8b 05 cf c3 de ff 48 ba 00 00 00 00 00 00 00 10 48 85
        RIP   __ocfs2_resv_find_window+0x498/0x760 [ocfs2]
         RSP <ffff8800ea4db668>
        ---[ end trace 566f07529f2edf3c ]---
        Kernel panic - not syncing: Fatal exception
        Kernel Offset: disabled
      
      Link: http://lkml.kernel.org/r/20181121020023.3034-2-junxiao.bi@oracle.comSigned-off-by: NJunxiao Bi <junxiao.bi@oracle.com>
      Reviewed-by: NYiwen Jiang <jiangyiwen@huawei.com>
      Acked-by: NJoseph Qi <jiangqi903@gmail.com>
      Cc: Jun Piao <piaojun@huawei.com>
      Cc: Mark Fasheh <mfasheh@versity.com>
      Cc: Joel Becker <jlbec@evilplan.org>
      Cc: Changwei Ge <ge.changwei@h3c.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      Signed-off-by: NSasha Levin <sashal@kernel.org>
      5a404f39
    • E
      iomap: don't search past page end in iomap_is_partially_uptodate · c9dcb871
      Eric Sandeen 提交于
      [ Upstream commit 3cc31fa65d85610574c0f6a474e89f4c419923d5 ]
      
      iomap_is_partially_uptodate() is intended to check wither blocks within
      the selected range of a not-uptodate page are uptodate; if the range we
      care about is up to date, it's an optimization.
      
      However, the iomap implementation continues to check all blocks up to
      from+count, which is beyond the page, and can even be well beyond the
      iop->uptodate bitmap.
      
      I think the worst that will happen is that we may eventually find a zero
      bit and return "not partially uptodate" when it would have otherwise
      returned true, and skip the optimization.  Still, it's clearly an invalid
      memory access that must be fixed.
      
      So: fix this by limiting the search to within the page as is done in the
      non-iomap variant, block_is_partially_uptodate().
      
      Zorro noticed thiswhen KASAN went off for 512 byte blocks on a 64k
      page system:
      
       BUG: KASAN: slab-out-of-bounds in iomap_is_partially_uptodate+0x1a0/0x1e0
       Read of size 8 at addr ffff800120c3a318 by task fsstress/22337
      Reported-by: NZorro Lang <zlang@redhat.com>
      Signed-off-by: NEric Sandeen <sandeen@redhat.com>
      Signed-off-by: NEric Sandeen <sandeen@sandeen.net>
      Reviewed-by: NDarrick J. Wong <darrick.wong@oracle.com>
      Signed-off-by: NDarrick J. Wong <darrick.wong@oracle.com>
      Signed-off-by: NSasha Levin <sashal@kernel.org>
      c9dcb871
    • Q
      scsi: megaraid: fix out-of-bound array accesses · 00886ceb
      Qian Cai 提交于
      [ Upstream commit c7a082e4242fd8cd21a441071e622f87c16bdacc ]
      
      UBSAN reported those with MegaRAID SAS-3 3108,
      
      [   77.467308] UBSAN: Undefined behaviour in drivers/scsi/megaraid/megaraid_sas_fp.c:117:32
      [   77.475402] index 255 is out of range for type 'MR_LD_SPAN_MAP [1]'
      [   77.481677] CPU: 16 PID: 333 Comm: kworker/16:1 Not tainted 4.20.0-rc5+ #1
      [   77.488556] Hardware name: Huawei TaiShan 2280 /BC11SPCD, BIOS 1.50 06/01/2018
      [   77.495791] Workqueue: events work_for_cpu_fn
      [   77.500154] Call trace:
      [   77.502610]  dump_backtrace+0x0/0x2c8
      [   77.506279]  show_stack+0x24/0x30
      [   77.509604]  dump_stack+0x118/0x19c
      [   77.513098]  ubsan_epilogue+0x14/0x60
      [   77.516765]  __ubsan_handle_out_of_bounds+0xfc/0x13c
      [   77.521767]  mr_update_load_balance_params+0x150/0x158 [megaraid_sas]
      [   77.528230]  MR_ValidateMapInfo+0x2cc/0x10d0 [megaraid_sas]
      [   77.533825]  megasas_get_map_info+0x244/0x2f0 [megaraid_sas]
      [   77.539505]  megasas_init_adapter_fusion+0x9b0/0xf48 [megaraid_sas]
      [   77.545794]  megasas_init_fw+0x1ab4/0x3518 [megaraid_sas]
      [   77.551212]  megasas_probe_one+0x2c4/0xbe0 [megaraid_sas]
      [   77.556614]  local_pci_probe+0x7c/0xf0
      [   77.560365]  work_for_cpu_fn+0x34/0x50
      [   77.564118]  process_one_work+0x61c/0xf08
      [   77.568129]  worker_thread+0x534/0xa70
      [   77.571882]  kthread+0x1c8/0x1d0
      [   77.575114]  ret_from_fork+0x10/0x1c
      
      [   89.240332] UBSAN: Undefined behaviour in drivers/scsi/megaraid/megaraid_sas_fp.c:117:32
      [   89.248426] index 255 is out of range for type 'MR_LD_SPAN_MAP [1]'
      [   89.254700] CPU: 16 PID: 95 Comm: kworker/u130:0 Not tainted 4.20.0-rc5+ #1
      [   89.261665] Hardware name: Huawei TaiShan 2280 /BC11SPCD, BIOS 1.50 06/01/2018
      [   89.268903] Workqueue: events_unbound async_run_entry_fn
      [   89.274222] Call trace:
      [   89.276680]  dump_backtrace+0x0/0x2c8
      [   89.280348]  show_stack+0x24/0x30
      [   89.283671]  dump_stack+0x118/0x19c
      [   89.287167]  ubsan_epilogue+0x14/0x60
      [   89.290835]  __ubsan_handle_out_of_bounds+0xfc/0x13c
      [   89.295828]  MR_LdRaidGet+0x50/0x58 [megaraid_sas]
      [   89.300638]  megasas_build_io_fusion+0xbb8/0xd90 [megaraid_sas]
      [   89.306576]  megasas_build_and_issue_cmd_fusion+0x138/0x460 [megaraid_sas]
      [   89.313468]  megasas_queue_command+0x398/0x3d0 [megaraid_sas]
      [   89.319222]  scsi_dispatch_cmd+0x1dc/0x8a8
      [   89.323321]  scsi_request_fn+0x8e8/0xdd0
      [   89.327249]  __blk_run_queue+0xc4/0x158
      [   89.331090]  blk_execute_rq_nowait+0xf4/0x158
      [   89.335449]  blk_execute_rq+0xdc/0x158
      [   89.339202]  __scsi_execute+0x130/0x258
      [   89.343041]  scsi_probe_and_add_lun+0x2fc/0x1488
      [   89.347661]  __scsi_scan_target+0x1cc/0x8c8
      [   89.351848]  scsi_scan_channel.part.3+0x8c/0xc0
      [   89.356382]  scsi_scan_host_selected+0x130/0x1f0
      [   89.361002]  do_scsi_scan_host+0xd8/0xf0
      [   89.364927]  do_scan_async+0x9c/0x320
      [   89.368594]  async_run_entry_fn+0x138/0x420
      [   89.372780]  process_one_work+0x61c/0xf08
      [   89.376793]  worker_thread+0x13c/0xa70
      [   89.380546]  kthread+0x1c8/0x1d0
      [   89.383778]  ret_from_fork+0x10/0x1c
      
      This is because when populating Driver Map using firmware raid map, all
      non-existing VDs set their ldTgtIdToLd to 0xff, so it can be skipped later.
      
      From drivers/scsi/megaraid/megaraid_sas_base.c ,
      memset(instance->ld_ids, 0xff, MEGASAS_MAX_LD_IDS);
      
      From drivers/scsi/megaraid/megaraid_sas_fp.c ,
      /* For non existing VDs, iterate to next VD*/
      if (ld >= (MAX_LOGICAL_DRIVES_EXT - 1))
      	continue;
      
      However, there are a few places that failed to skip those non-existing VDs
      due to off-by-one errors. Then, those 0xff leaked into MR_LdRaidGet(0xff,
      map) and triggered the out-of-bound accesses.
      
      Fixes: 51087a86 ("megaraid_sas : Extended VD support")
      Signed-off-by: NQian Cai <cai@lca.pw>
      Acked-by: NSumit Saxena <sumit.saxena@broadcom.com>
      Signed-off-by: NMartin K. Petersen <martin.petersen@oracle.com>
      Signed-off-by: NSasha Levin <sashal@kernel.org>
      00886ceb
    • Y
      scsi: smartpqi: call pqi_free_interrupts() in pqi_shutdown() · d640fb10
      Yanjiang Jin 提交于
      [ Upstream commit e57b2945aa654e48f85a41e8917793c64ecb9de8 ]
      
      We must free all irqs during shutdown, else kexec's 2nd kernel would hang
      in pqi_wait_for_completion_io() as below:
      
      Call trace:
      
       pqi_wait_for_completion_io
       pqi_submit_raid_request_synchronous.constprop.78+0x23c/0x310 [smartpqi]
       pqi_configure_events+0xec/0x1f8 [smartpqi]
       pqi_ctrl_init+0x814/0xca0 [smartpqi]
       pqi_pci_probe+0x400/0x46c [smartpqi]
       local_pci_probe+0x48/0xb0
       pci_device_probe+0x14c/0x1b0
       really_probe+0x218/0x3fc
       driver_probe_device+0x70/0x140
       __driver_attach+0x11c/0x134
       bus_for_each_dev+0x70/0xc8
       driver_attach+0x30/0x38
       bus_add_driver+0x1f0/0x294
       driver_register+0x74/0x12c
       __pci_register_driver+0x64/0x70
       pqi_init+0xd0/0x10000 [smartpqi]
       do_one_initcall+0x60/0x1d8
       do_init_module+0x64/0x1f8
       load_module+0x10ec/0x1350
       __se_sys_finit_module+0xd4/0x100
       __arm64_sys_finit_module+0x28/0x34
       el0_svc_handler+0x104/0x160
       el0_svc+0x8/0xc
      
      This happens only in the following combinations:
      
      1. smartpqi is built as module, not built-in;
      2. We have a disk connected to smartpqi card;
      3. Both kexec's 1st and 2nd kernels use this disk as Rootfs' mount point.
      Signed-off-by: NYanjiang Jin <yanjiang.jin@hxt-semitech.com>
      Acked-by: NDon Brace <don.brace@microsemi.com>
      Signed-off-by: NMartin K. Petersen <martin.petersen@oracle.com>
      Signed-off-by: NSasha Levin <sashal@kernel.org>
      d640fb10
    • Z
      ath10k: fix peer stats null pointer dereference · dd619b90
      Zhi Chen 提交于
      [ Upstream commit 2d3b55853b123c177037cf534c5aaa2650310094 ]
      
      There was a race condition in SMP that an ath10k_peer was created but its
      member sta was null. Following are procedures of ath10k_peer creation and
      member sta access in peer statistics path.
      
          1. Peer creation:
              ath10k_peer_create()
                  =>ath10k_wmi_peer_create()
                      =>ath10k_wait_for_peer_created()
                      ...
      
              # another kernel path, RX from firmware
              ath10k_htt_t2h_msg_handler()
              =>ath10k_peer_map_event()
                      =>wake_up()
                      # ar->peer_map[id] = peer //add peer to map
      
              #wake up original path from waiting
                      ...
                      # peer->sta = sta //sta assignment
      
          2.  RX path of statistics
              ath10k_htt_t2h_msg_handler()
                  =>ath10k_update_per_peer_tx_stats()
                      =>ath10k_htt_fetch_peer_stats()
                      # peer->sta //sta accessing
      
      Any access of peer->sta after peer was added to peer_map but before sta was
      assigned could cause a null pointer issue. And because these two steps are
      asynchronous, no proper lock can protect them. So both peer and sta need to
      be checked before access.
      
      Tested: QCA9984 with firmware ver 10.4-3.9.0.1-00005
      Signed-off-by: NZhi Chen <zhichen@codeaurora.org>
      Signed-off-by: NKalle Valo <kvalo@codeaurora.org>
      Signed-off-by: NSasha Levin <sashal@kernel.org>
      dd619b90
    • K
      scsi: smartpqi: correct lun reset issues · ca8ad9bc
      Kevin Barnett 提交于
      [ Upstream commit 2ba55c9851d74eb015a554ef69ddf2ef061d5780 ]
      
      Problem:
      The Linux kernel takes a logical volume offline after a LUN reset.  This is
      generally accompanied by this message in the dmesg output:
      
      Device offlined - not ready after error recovery
      
      Root Cause:
      The root cause is a "quirk" in the timeout handling in the Linux SCSI
      layer. The Linux kernel places a 30-second timeout on most media access
      commands (reads and writes) that it send to device drivers.  When a media
      access command times out, the Linux kernel goes into error recovery mode
      for the LUN that was the target of the command that timed out. Every
      command that timed out is kept on a list inside of the Linux kernel to be
      retried later. The kernel attempts to recover the command(s) that timed out
      by issuing a LUN reset followed by a TEST UNIT READY. If the LUN reset and
      TEST UNIT READY commands are successful, the kernel retries the command(s)
      that timed out.
      
      Each SCSI command issued by the kernel has a result field associated with
      it. This field indicates the final result of the command (success or
      error). When a command times out, the kernel places a value in this result
      field indicating that the command timed out.
      
      The "quirk" is that after the LUN reset and TEST UNIT READY commands are
      completed, the kernel checks each command on the timed-out command list
      before retrying it. If the result field is still "timed out", the kernel
      treats that command as not having been successfully recovered for a
      retry. If the number of commands that are in this state are greater than
      two, the kernel takes the LUN offline.
      
      Fix:
      When our RAIDStack receives a LUN reset, it simply waits until all
      outstanding commands complete. Generally, all of these outstanding commands
      complete successfully. Therefore, the fix in the smartpqi driver is to
      always set the command result field to indicate success when a request
      completes successfully. This normally isn’t necessary because the result
      field is always initialized to success when the command is submitted to the
      driver. So when the command completes successfully, the result field is
      left untouched. But in this case, the kernel changes the result field
      behind the driver’s back and then expects the field to be changed by the
      driver as the commands that timed-out complete.
      Reviewed-by: NDave Carroll <david.carroll@microsemi.com>
      Reviewed-by: NScott Teel <scott.teel@microsemi.com>
      Signed-off-by: NKevin Barnett <kevin.barnett@microsemi.com>
      Signed-off-by: NDon Brace <don.brace@microsemi.com>
      Signed-off-by: NMartin K. Petersen <martin.petersen@oracle.com>
      Signed-off-by: NSasha Levin <sashal@kernel.org>
      ca8ad9bc
    • S
      scsi: mpt3sas: fix memory ordering on 64bit writes · 868152e4
      Stephan Günther 提交于
      [ Upstream commit 23c3828aa2f84edec7020c7397a22931e7a879e1 ]
      
      With commit 09c2f95a ("scsi: mpt3sas: Swap I/O memory read value back
      to cpu endianness"), 64bit writes in _base_writeq() were rewritten to use
      __raw_writeq() instad of writeq().
      
      This introduced a bug apparent on powerpc64 systems such as the Raptor
      Talos II that causes the HBA to drop from the PCIe bus under heavy load and
      being reinitialized after a couple of seconds.
      
      It can easily be triggered on affacted systems by using something like
      
        fio --name=random-write --iodepth=4 --rw=randwrite --bs=4k --direct=0 \
          --size=128M --numjobs=64 --end_fsync=1
        fio --name=random-write --iodepth=4 --rw=randwrite --bs=64k --direct=0 \
          --size=128M --numjobs=64 --end_fsync=1
      
      a couple of times. In my case I tested it on both a ZFS raidz2 and a btrfs
      raid6 using LSI 9300-8i and 9400-8i controllers.
      
      The fix consists in resembling the write ordering of writeq() by adding a
      mandatory write memory barrier before device access and a compiler barrier
      afterwards. The additional MMIO barrier is superfluous.
      Signed-off-by: NStephan Günther <moepi@moepi.net>
      Reported-by: NMatt Corallo <linux@bluematt.me>
      Acked-by: NSreekanth Reddy <Sreekanth.Reddy@broadcom.com>
      Signed-off-by: NMartin K. Petersen <martin.petersen@oracle.com>
      Signed-off-by: NSasha Levin <sashal@kernel.org>
      868152e4
    • P
      IB/usnic: Fix potential deadlock · 6fa75685
      Parvi Kaustubhi 提交于
      [ Upstream commit 8036e90f92aae2784b855a0007ae2d8154d28b3c ]
      
      Acquiring the rtnl lock while holding usdev_lock could result in a
      deadlock.
      
      For example:
      
      usnic_ib_query_port()
      | mutex_lock(&us_ibdev->usdev_lock)
       | ib_get_eth_speed()
        | rtnl_lock()
      
      rtnl_lock()
      | usnic_ib_netdevice_event()
       | mutex_lock(&us_ibdev->usdev_lock)
      
      This commit moves the usdev_lock acquisition after the rtnl lock has been
      released.
      
      This is safe to do because usdev_lock is not protecting anything being
      accessed in ib_get_eth_speed(). Hence, the correct order of holding locks
      (rtnl -> usdev_lock) is not violated.
      Signed-off-by: NParvi Kaustubhi <pkaustub@cisco.com>
      Signed-off-by: NJason Gunthorpe <jgg@mellanox.com>
      Signed-off-by: NSasha Levin <sashal@kernel.org>
      6fa75685
    • D
      sysfs: Disable lockdep for driver bind/unbind files · a13daf03
      Daniel Vetter 提交于
      [ Upstream commit 4f4b374332ec0ae9c738ff8ec9bed5cd97ff9adc ]
      
      This is the much more correct fix for my earlier attempt at:
      
      https://lkml.org/lkml/2018/12/10/118
      
      Short recap:
      
      - There's not actually a locking issue, it's just lockdep being a bit
        too eager to complain about a possible deadlock.
      
      - Contrary to what I claimed the real problem is recursion on
        kn->count. Greg pointed me at sysfs_break_active_protection(), used
        by the scsi subsystem to allow a sysfs file to unbind itself. That
        would be a real deadlock, which isn't what's happening here. Also,
        breaking the active protection means we'd need to manually handle
        all the lifetime fun.
      
      - With Rafael we discussed the task_work approach, which kinda works,
        but has two downsides: It's a functional change for a lockdep
        annotation issue, and it won't work for the bind file (which needs
        to get the errno from the driver load function back to userspace).
      
      - Greg also asked why this never showed up: To hit this you need to
        unregister a 2nd driver from the unload code of your first driver. I
        guess only gpus do that. The bug has always been there, but only
        with a recent patch series did we add more locks so that lockdep
        built a chain from unbinding the snd-hda driver to the
        acpi_video_unregister call.
      
      Full lockdep splat:
      
      [12301.898799] ============================================
      [12301.898805] WARNING: possible recursive locking detected
      [12301.898811] 4.20.0-rc7+ #84 Not tainted
      [12301.898815] --------------------------------------------
      [12301.898821] bash/5297 is trying to acquire lock:
      [12301.898826] 00000000f61c6093 (kn->count#39){++++}, at: kernfs_remove_by_name_ns+0x3b/0x80
      [12301.898841] but task is already holding lock:
      [12301.898847] 000000005f634021 (kn->count#39){++++}, at: kernfs_fop_write+0xdc/0x190
      [12301.898856] other info that might help us debug this:
      [12301.898862]  Possible unsafe locking scenario:
      [12301.898867]        CPU0
      [12301.898870]        ----
      [12301.898874]   lock(kn->count#39);
      [12301.898879]   lock(kn->count#39);
      [12301.898883] *** DEADLOCK ***
      [12301.898891]  May be due to missing lock nesting notation
      [12301.898899] 5 locks held by bash/5297:
      [12301.898903]  #0: 00000000cd800e54 (sb_writers#4){.+.+}, at: vfs_write+0x17f/0x1b0
      [12301.898915]  #1: 000000000465e7c2 (&of->mutex){+.+.}, at: kernfs_fop_write+0xd3/0x190
      [12301.898925]  #2: 000000005f634021 (kn->count#39){++++}, at: kernfs_fop_write+0xdc/0x190
      [12301.898936]  #3: 00000000414ef7ac (&dev->mutex){....}, at: device_release_driver_internal+0x34/0x240
      [12301.898950]  #4: 000000003218fbdf (register_count_mutex){+.+.}, at: acpi_video_unregister+0xe/0x40
      [12301.898960] stack backtrace:
      [12301.898968] CPU: 1 PID: 5297 Comm: bash Not tainted 4.20.0-rc7+ #84
      [12301.898974] Hardware name: Hewlett-Packard HP EliteBook 8460p/161C, BIOS 68SCF Ver. F.01 03/11/2011
      [12301.898982] Call Trace:
      [12301.898989]  dump_stack+0x67/0x9b
      [12301.898997]  __lock_acquire+0x6ad/0x1410
      [12301.899003]  ? kernfs_remove_by_name_ns+0x3b/0x80
      [12301.899010]  ? find_held_lock+0x2d/0x90
      [12301.899017]  ? mutex_spin_on_owner+0xe4/0x150
      [12301.899023]  ? find_held_lock+0x2d/0x90
      [12301.899030]  ? lock_acquire+0x90/0x180
      [12301.899036]  lock_acquire+0x90/0x180
      [12301.899042]  ? kernfs_remove_by_name_ns+0x3b/0x80
      [12301.899049]  __kernfs_remove+0x296/0x310
      [12301.899055]  ? kernfs_remove_by_name_ns+0x3b/0x80
      [12301.899060]  ? kernfs_name_hash+0xd/0x80
      [12301.899066]  ? kernfs_find_ns+0x6c/0x100
      [12301.899073]  kernfs_remove_by_name_ns+0x3b/0x80
      [12301.899080]  bus_remove_driver+0x92/0xa0
      [12301.899085]  acpi_video_unregister+0x24/0x40
      [12301.899127]  i915_driver_unload+0x42/0x130 [i915]
      [12301.899160]  i915_pci_remove+0x19/0x30 [i915]
      [12301.899169]  pci_device_remove+0x36/0xb0
      [12301.899176]  device_release_driver_internal+0x185/0x240
      [12301.899183]  unbind_store+0xaf/0x180
      [12301.899189]  kernfs_fop_write+0x104/0x190
      [12301.899195]  __vfs_write+0x31/0x180
      [12301.899203]  ? rcu_read_lock_sched_held+0x6f/0x80
      [12301.899209]  ? rcu_sync_lockdep_assert+0x29/0x50
      [12301.899216]  ? __sb_start_write+0x13c/0x1a0
      [12301.899221]  ? vfs_write+0x17f/0x1b0
      [12301.899227]  vfs_write+0xb9/0x1b0
      [12301.899233]  ksys_write+0x50/0xc0
      [12301.899239]  do_syscall_64+0x4b/0x180
      [12301.899247]  entry_SYSCALL_64_after_hwframe+0x49/0xbe
      [12301.899253] RIP: 0033:0x7f452ac7f7a4
      [12301.899259] Code: 00 f7 d8 64 89 02 48 c7 c0 ff ff ff ff eb b7 0f 1f 80 00 00 00 00 8b 05 aa f0 2c 00 48 63 ff 85 c0 75 13 b8 01 00 00 00 0f 05 <48> 3d 00 f0 ff ff 77 54 f3 c3 66 90 55 53 48 89 d5 48 89 f3 48 83
      [12301.899273] RSP: 002b:00007ffceafa6918 EFLAGS: 00000246 ORIG_RAX: 0000000000000001
      [12301.899282] RAX: ffffffffffffffda RBX: 000000000000000d RCX: 00007f452ac7f7a4
      [12301.899288] RDX: 000000000000000d RSI: 00005612a1abf7c0 RDI: 0000000000000001
      [12301.899295] RBP: 00005612a1abf7c0 R08: 000000000000000a R09: 00005612a1c46730
      [12301.899301] R10: 000000000000000a R11: 0000000000000246 R12: 000000000000000d
      [12301.899308] R13: 0000000000000001 R14: 00007f452af4a740 R15: 000000000000000d
      
      Looking around I've noticed that usb and i2c already handle similar
      recursion problems, where a sysfs file can unbind the same type of
      sysfs somewhere else in the hierarchy. Relevant commits are:
      
      commit 356c05d5
      Author: Alan Stern <stern@rowland.harvard.edu>
      Date:   Mon May 14 13:30:03 2012 -0400
      
          sysfs: get rid of some lockdep false positives
      
      commit e9b526fe
      Author: Alexander Sverdlin <alexander.sverdlin@nsn.com>
      Date:   Fri May 17 14:56:35 2013 +0200
      
          i2c: suppress lockdep warning on delete_device
      
      Implement the same trick for driver bind/unbind.
      
      v2: Put the macro into bus.c (Greg).
      Reviewed-by: NRafael J. Wysocki <rafael.j.wysocki@intel.com>
      Cc: Ramalingam C <ramalingam.c@intel.com>
      Cc: Arend van Spriel <aspriel@gmail.com>
      Cc: Andy Shevchenko <andriy.shevchenko@linux.intel.com>
      Cc: Geert Uytterhoeven <geert+renesas@glider.be>
      Cc: Bartosz Golaszewski <brgl@bgdev.pl>
      Cc: Heikki Krogerus <heikki.krogerus@linux.intel.com>
      Cc: Vivek Gautam <vivek.gautam@codeaurora.org>
      Cc: Joe Perches <joe@perches.com>
      Signed-off-by: NDaniel Vetter <daniel.vetter@intel.com>
      Signed-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      Signed-off-by: NSasha Levin <sashal@kernel.org>
      a13daf03
    • T
      ALSA: bebob: fix model-id of unit for Apogee Ensemble · 959bf5c1
      Takashi Sakamoto 提交于
      [ Upstream commit 644b2e97405b0b74845e1d3c2b4fe4c34858062b ]
      
      This commit fixes hard-coded model-id for an unit of Apogee Ensemble with
      a correct value. This unit uses DM1500 ASIC produced ArchWave AG (formerly
      known as BridgeCo AG).
      
      I note that this model supports three modes in the number of data channels
      in tx/rx streams; 8 ch pairs, 10 ch pairs, 18 ch pairs. The mode is
      switched by Vendor-dependent AV/C command, like:
      
      $ cd linux-firewire-utils
      $ ./firewire-request /dev/fw1 fcp 0x00ff000003dbeb0600000000 (8ch pairs)
      $ ./firewire-request /dev/fw1 fcp 0x00ff000003dbeb0601000000 (10ch pairs)
      $ ./firewire-request /dev/fw1 fcp 0x00ff000003dbeb0602000000 (18ch pairs)
      
      When switching between different mode, the unit disappears from IEEE 1394
      bus, then appears on the bus with different combination of stream formats.
      In a mode of 18 ch pairs, available sampling rate is up to 96.0 kHz, else
      up to 192.0 kHz.
      
      $ ./hinawa-config-rom-printer /dev/fw1
      { 'bus-info': { 'adj': False,
                      'bmc': True,
                      'chip_ID': 21474898341,
                      'cmc': True,
                      'cyc_clk_acc': 100,
                      'generation': 2,
                      'imc': True,
                      'isc': True,
                      'link_spd': 2,
                      'max_ROM': 1,
                      'max_rec': 512,
                      'name': '1394',
                      'node_vendor_ID': 987,
                      'pmc': False},
        'root-directory': [ ['HARDWARE_VERSION', 19],
                            [ 'NODE_CAPABILITIES',
                              { 'addressing': {'64': True, 'fix': True, 'prv': False},
                                'misc': {'int': False, 'ms': False, 'spt': True},
                                'state': { 'atn': False,
                                           'ded': False,
                                           'drq': True,
                                           'elo': False,
                                           'init': False,
                                           'lst': True,
                                           'off': False},
                                'testing': {'bas': False, 'ext': False}}],
                            ['VENDOR', 987],
                            ['DESCRIPTOR', 'Apogee Electronics'],
                            ['MODEL', 126702],
                            ['DESCRIPTOR', 'Ensemble'],
                            ['VERSION', 5297],
                            [ 'UNIT',
                              [ ['SPECIFIER_ID', 41005],
                                ['VERSION', 65537],
                                ['MODEL', 126702],
                                ['DESCRIPTOR', 'Ensemble']]],
                            [ 'DEPENDENT_INFO',
                              [ ['SPECIFIER_ID', 2037],
                                ['VERSION', 1],
                                [(58, 'IMMEDIATE'), 16777159],
                                [(59, 'IMMEDIATE'), 1048576],
                                [(60, 'IMMEDIATE'), 16777159],
                                [(61, 'IMMEDIATE'), 6291456]]]]}
      Signed-off-by: NTakashi Sakamoto <o-takashi@sakamocchi.jp>
      Signed-off-by: NTakashi Iwai <tiwai@suse.de>
      Signed-off-by: NSasha Levin <sashal@kernel.org>
      959bf5c1
    • R
      Bluetooth: btusb: Add support for Intel bluetooth device 8087:0029 · c5e68453
      Raghuram Hegde 提交于
      [ Upstream commit 2da711bcebe81209a9f2f90e145600eb1bae2b71 ]
      
      Include the new USB product ID for Intel Bluetooth device 22260
      family(CcPeak)
      
      The /sys/kernel/debug/usb/devices portion for this device is:
      
      T:  Bus=01 Lev=01 Prnt=01 Port=02 Cnt=02 Dev#=  2 Spd=12   MxCh= 0
      D:  Ver= 2.00 Cls=e0(wlcon) Sub=01 Prot=01 MxPS=64 #Cfgs=  1
      P:  Vendor=8087 ProdID=0029 Rev= 0.01
      C:* #Ifs= 2 Cfg#= 1 Atr=e0 MxPwr=100mA
      I:* If#= 0 Alt= 0 #EPs= 3 Cls=e0(wlcon) Sub=01 Prot=01 Driver=btusb
      E:  Ad=81(I) Atr=03(Int.) MxPS=  64 Ivl=1ms
      E:  Ad=02(O) Atr=02(Bulk) MxPS=  64 Ivl=0ms
      E:  Ad=82(I) Atr=02(Bulk) MxPS=  64 Ivl=0ms
      I:* If#= 1 Alt= 0 #EPs= 2 Cls=e0(wlcon) Sub=01 Prot=01 Driver=btusb
      E:  Ad=03(O) Atr=01(Isoc) MxPS=   0 Ivl=1ms
      E:  Ad=83(I) Atr=01(Isoc) MxPS=   0 Ivl=1ms
      I:  If#= 1 Alt= 1 #EPs= 2 Cls=e0(wlcon) Sub=01 Prot=01 Driver=btusb
      E:  Ad=03(O) Atr=01(Isoc) MxPS=   9 Ivl=1ms
      E:  Ad=83(I) Atr=01(Isoc) MxPS=   9 Ivl=1ms
      I:  If#= 1 Alt= 2 #EPs= 2 Cls=e0(wlcon) Sub=01 Prot=01 Driver=btusb
      E:  Ad=03(O) Atr=01(Isoc) MxPS=  17 Ivl=1ms
      E:  Ad=83(I) Atr=01(Isoc) MxPS=  17 Ivl=1ms
      I:  If#= 1 Alt= 3 #EPs= 2 Cls=e0(wlcon) Sub=01 Prot=01 Driver=btusb
      E:  Ad=03(O) Atr=01(Isoc) MxPS=  25 Ivl=1ms
      E:  Ad=83(I) Atr=01(Isoc) MxPS=  25 Ivl=1ms
      I:  If#= 1 Alt= 4 #EPs= 2 Cls=e0(wlcon) Sub=01 Prot=01 Driver=btusb
      E:  Ad=03(O) Atr=01(Isoc) MxPS=  33 Ivl=1ms
      E:  Ad=83(I) Atr=01(Isoc) MxPS=  33 Ivl=1ms
      I:  If#= 1 Alt= 5 #EPs= 2 Cls=e0(wlcon) Sub=01 Prot=01 Driver=btusb
      E:  Ad=03(O) Atr=01(Isoc) MxPS=  49 Ivl=1ms
      E:  Ad=83(I) Atr=01(Isoc) MxPS=  49 Ivl=1ms
      I:  If#= 1 Alt= 6 #EPs= 2 Cls=e0(wlcon) Sub=01 Prot=01 Driver=btusb
      E:  Ad=03(O) Atr=01(Isoc) MxPS=  63 Ivl=1ms
      E:  Ad=83(I) Atr=01(Isoc) MxPS=  63 Ivl=1ms
      Signed-off-by: NRaghuram Hegde <raghuram.hegde@intel.com>
      Signed-off-by: NChethan T N <chethan.tumkur.narayan@intel.com>
      Signed-off-by: NMarcel Holtmann <marcel@holtmann.org>
      Signed-off-by: NSasha Levin <sashal@kernel.org>
      c5e68453
    • M
      dm: Check for device sector overflow if CONFIG_LBDAF is not set · 887b1c9a
      Milan Broz 提交于
      [ Upstream commit ef87bfc24f9b8da82c89aff493df20f078bc9cb1 ]
      
      Reference to a device in device-mapper table contains offset in sectors.
      
      If the sector_t is 32bit integer (CONFIG_LBDAF is not set), then
      several device-mapper targets can overflow this offset and validity
      check is then performed on a wrong offset and a wrong table is activated.
      
      See for example (on 32bit without CONFIG_LBDAF) this overflow:
      
        # dmsetup create test --table "0 2048 linear /dev/sdg 4294967297"
        # dmsetup table test
        0 2048 linear 8:96 1
      
      This patch adds explicit check for overflow if the offset is sector_t type.
      Signed-off-by: NMilan Broz <gmazyland@gmail.com>
      Reviewed-by: NMikulas Patocka <mpatocka@redhat.com>
      Signed-off-by: NMike Snitzer <snitzer@redhat.com>
      Signed-off-by: NSasha Levin <sashal@kernel.org>
      887b1c9a
    • Y
      clocksource/drivers/integrator-ap: Add missing of_node_put() · decca9bc
      Yangtao Li 提交于
      [ Upstream commit 5eb73c831171115d3b4347e1e7124a5a35d8086c ]
      
      The function of_find_node_by_path() acquires a reference to the node
      returned by it and that reference needs to be dropped by its caller.
      
      integrator_ap_timer_init_of() doesn't do that.  The pri_node and the
      sec_node are used as an identifier to compare against the current
      node, so we can directly drop the refcount after getting the node from
      the path as it is not used as pointer.
      
      By dropping the refcount right after getting it, a single variable is
      needed instead of two.
      
      Fix this by use a single variable and drop the refcount right after
      of_find_node_by_path().
      Signed-off-by: NYangtao Li <tiny.windzz@gmail.com>
      Signed-off-by: NDaniel Lezcano <daniel.lezcano@linaro.org>
      Signed-off-by: NSasha Levin <sashal@kernel.org>
      decca9bc
    • J
      quota: Lock s_umount in exclusive mode for Q_XQUOTA{ON,OFF} quotactls. · 876b79b9
      Javier Barrio 提交于
      [ Upstream commit 41c4f85cdac280d356df1f483000ecec4a8868be ]
      
      Commit 1fa5efe3 (ext4: Use generic helpers for quotaon
      and quotaoff) made possible to call quotactl(Q_XQUOTAON/OFF) on ext4 filesystems
      with sysfile quota support. This leads to calling dquot_enable/disable without s_umount
      held in excl. mode, because quotactl_cmd_onoff checks only for Q_QUOTAON/OFF.
      
      The following WARN_ON_ONCE triggers (in this case for dquot_enable, ext4, latest Linus' tree):
      
      [  117.807056] EXT4-fs (dm-0): mounted filesystem with ordered data mode. Opts: quota,prjquota
      
      [...]
      
      [  155.036847] WARNING: CPU: 0 PID: 2343 at fs/quota/dquot.c:2469 dquot_enable+0x34/0xb9
      [  155.036851] Modules linked in: quota_v2 quota_tree ipv6 af_packet joydev mousedev psmouse serio_raw pcspkr i2c_piix4 intel_agp intel_gtt e1000 ttm drm_kms_helper drm agpgart fb_sys_fops syscopyarea sysfillrect sysimgblt i2c_core input_leds kvm_intel kvm irqbypass qemu_fw_cfg floppy evdev parport_pc parport button crc32c_generic dm_mod ata_generic pata_acpi ata_piix libata loop ext4 crc16 mbcache jbd2 usb_storage usbcore sd_mod scsi_mod
      [  155.036901] CPU: 0 PID: 2343 Comm: qctl Not tainted 4.20.0-rc6-00025-gf5d582777bcb #9
      [  155.036903] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.10.2-1ubuntu1 04/01/2014
      [  155.036911] RIP: 0010:dquot_enable+0x34/0xb9
      [  155.036915] Code: 41 56 41 55 41 54 55 53 4c 8b 6f 28 74 02 0f 0b 4d 8d 7d 70 49 89 fc 89 cb 41 89 d6 89 f5 4c 89 ff e8 23 09 ea ff 85 c0 74 0a <0f> 0b 4c 89 ff e8 8b 09 ea ff 85 db 74 6a 41 8b b5 f8 00 00 00 0f
      [  155.036918] RSP: 0018:ffffb09b00493e08 EFLAGS: 00010202
      [  155.036922] RAX: 0000000000000001 RBX: 0000000000000008 RCX: 0000000000000008
      [  155.036924] RDX: 0000000000000001 RSI: 0000000000000002 RDI: ffff9781b67cd870
      [  155.036926] RBP: 0000000000000002 R08: 0000000000000000 R09: 61c8864680b583eb
      [  155.036929] R10: ffffb09b00493e48 R11: ffffffffff7ce7d4 R12: ffff9781b7ee8d78
      [  155.036932] R13: ffff9781b67cd800 R14: 0000000000000004 R15: ffff9781b67cd870
      [  155.036936] FS:  00007fd813250b88(0000) GS:ffff9781ba000000(0000) knlGS:0000000000000000
      [  155.036939] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
      [  155.036942] CR2: 00007fd812ff61d6 CR3: 000000007c882000 CR4: 00000000000006b0
      [  155.036951] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
      [  155.036953] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
      [  155.036955] Call Trace:
      [  155.037004]  dquot_quota_enable+0x8b/0xd0
      [  155.037011]  kernel_quotactl+0x628/0x74e
      [  155.037027]  ? do_mprotect_pkey+0x2a6/0x2cd
      [  155.037034]  __x64_sys_quotactl+0x1a/0x1d
      [  155.037041]  do_syscall_64+0x55/0xe4
      [  155.037078]  entry_SYSCALL_64_after_hwframe+0x44/0xa9
      [  155.037105] RIP: 0033:0x7fd812fe1198
      [  155.037109] Code: 02 77 0d 48 89 c1 48 c1 e9 3f 75 04 48 8b 04 24 48 83 c4 50 5b c3 48 83 ec 08 49 89 ca 48 63 d2 48 63 ff b8 b3 00 00 00 0f 05 <48> 89 c7 e8 c1 eb ff ff 5a c3 48 63 ff b8 bb 00 00 00 0f 05 48 89
      [  155.037112] RSP: 002b:00007ffe8cd7b050 EFLAGS: 00000206 ORIG_RAX: 00000000000000b3
      [  155.037116] RAX: ffffffffffffffda RBX: 00007ffe8cd7b148 RCX: 00007fd812fe1198
      [  155.037119] RDX: 0000000000000000 RSI: 00007ffe8cd7cea9 RDI: 0000000000580102
      [  155.037121] RBP: 00007ffe8cd7b0f0 R08: 000055fc8eba8a9d R09: 0000000000000000
      [  155.037124] R10: 00007ffe8cd7b074 R11: 0000000000000206 R12: 00007ffe8cd7b168
      [  155.037126] R13: 000055fc8eba8897 R14: 0000000000000000 R15: 0000000000000000
      [  155.037131] ---[ end trace 210f864257175c51 ]---
      
      and then the syscall proceeds without s_umount locking.
      
      This patch locks the superblock ->s_umount sem. in exclusive mode for all Q_XQUOTAON/OFF
      quotactls too in addition to Q_QUOTAON/OFF.
      
      AFAICT, other than ext4, only xfs and ocfs2 are affected by this change.
      The VFS will now call in xfs_quota_* functions with s_umount held, which wasn't the case
      before. This looks good to me but I can not say for sure. Ext4 and ocfs2 where already
      beeing called with s_umount exclusive via quota_quotaon/off which is basically the same.
      Signed-off-by: NJavier Barrio <javier.barrio.mart@gmail.com>
      Signed-off-by: NJan Kara <jack@suse.cz>
      Signed-off-by: NSasha Levin <sashal@kernel.org>
      876b79b9
    • A
      perf tools: Add missing open_memstream() prototype for systems lacking it · 77f14a49
      Arnaldo Carvalho de Melo 提交于
      [ Upstream commit d7a8c4a6a055097a67ccfa3ca7c9ff1b64603a70 ]
      
      There are systems such as the Android NDK API level 24 has the
      open_memstream() function but doesn't provide a prototype, adding noise
      to the build:
      
        builtin-timechart.c: In function 'cat_backtrace':
        builtin-timechart.c:486:2: warning: implicit declaration of function 'open_memstream' [-Wimplicit-function-declaration]
          FILE *f = open_memstream(&p, &p_len);
          ^
        builtin-timechart.c:486:2: warning: nested extern declaration of 'open_memstream' [-Wnested-externs]
        builtin-timechart.c:486:12: warning: initialization makes pointer from integer without a cast
          FILE *f = open_memstream(&p, &p_len);
                    ^
      
      Define a LACKS_OPEN_MEMSTREAM_PROTOTYPE define so that code needing that
      can get a prototype.
      
      Checked in the bionic git repo to be available since level 23:
      
      https://android.googlesource.com/platform/bionic/+/master/libc/include/stdio.h#241
      
        FILE* open_memstream(char** __ptr, size_t* __size_ptr) __INTRODUCED_IN(23);
      
      Cc: Adrian Hunter <adrian.hunter@intel.com>
      Cc: Jiri Olsa <jolsa@kernel.org>
      Cc: Namhyung Kim <namhyung@kernel.org>
      Link: https://lkml.kernel.org/n/tip-343ashae97e5bq6vizusyfno@git.kernel.orgSigned-off-by: NArnaldo Carvalho de Melo <acme@redhat.com>
      Signed-off-by: NSasha Levin <sashal@kernel.org>
      77f14a49
    • A
      perf tools: Add missing sigqueue() prototype for systems lacking it · e2a1f8d6
      Arnaldo Carvalho de Melo 提交于
      [ Upstream commit 748fe0889c1ff12d378946bd5326e8ee8eacf5cf ]
      
      There are systems such as the Android NDK API level 24 has the
      sigqueue() function but doesn't provide a prototype, adding noise to the
      build:
      
        util/evlist.c: In function 'perf_evlist__prepare_workload':
        util/evlist.c:1494:4: warning: implicit declaration of function 'sigqueue' [-Wimplicit-function-declaration]
            if (sigqueue(getppid(), SIGUSR1, val))
            ^
        util/evlist.c:1494:4: warning: nested extern declaration of 'sigqueue' [-Wnested-externs]
      
      Define a LACKS_SIGQUEUE_PROTOTYPE define so that code needing that can
      get a prototype.
      
      Checked in the bionic git repo to be available since level 23:
      
      https://android.googlesource.com/platform/bionic/+/master/libc/include/signal.h#123
      
        int sigqueue(pid_t __pid, int __signal, const union sigval __value) __INTRODUCED_IN(23);
      
      Cc: Adrian Hunter <adrian.hunter@intel.com>
      Cc: Jiri Olsa <jolsa@kernel.org>
      Cc: Namhyung Kim <namhyung@kernel.org>
      Link: https://lkml.kernel.org/n/tip-lmhpev1uni9kdrv7j29glyov@git.kernel.orgSigned-off-by: NArnaldo Carvalho de Melo <acme@redhat.com>
      Signed-off-by: NSasha Levin <sashal@kernel.org>
      e2a1f8d6
    • L
      perf cs-etm: Correct packets swapping in cs_etm__flush() · 4bc4b575
      Leo Yan 提交于
      [ Upstream commit 43fd56669c28cd354e9228bdb58e4bca1c1a8b66 ]
      
      The structure cs_etm_queue uses 'prev_packet' to point to previous
      packet, this can be used to combine with new coming packet to generate
      samples.
      
      In function cs_etm__flush() it swaps packets only when the flag
      'etm->synth_opts.last_branch' is true, this means that it will not swap
      packets if without option '--itrace=il' to generate last branch entries;
      thus for this case the 'prev_packet' doesn't point to the correct
      previous packet and the stale packet still will be used to generate
      sequential sample.  Thus if dump trace with 'perf script' command we can
      see the incorrect flow with the stale packet's address info.
      
      This patch corrects packets swapping in cs_etm__flush(); except using
      the flag 'etm->synth_opts.last_branch' it also checks the another flag
      'etm->sample_branches', if any flag is true then it swaps packets so can
      save correct content to 'prev_packet'.  Finally this can fix the wrong
      program flow dumping issue.
      
      The patch has a minor refactoring to use 'etm->synth_opts.last_branch'
      instead of 'etmq->etm->synth_opts.last_branch' for condition checking,
      this is consistent with that is done in cs_etm__sample().
      Signed-off-by: NLeo Yan <leo.yan@linaro.org>
      Reviewed-by: NMathieu Poirier <mathieu.poirier@linaro.org>
      Cc: Alexander Shishkin <alexander.shishkin@linux.intel.com>
      Cc: Jiri Olsa <jolsa@redhat.com>
      Cc: Mike Leach <mike.leach@linaro.org>
      Cc: Namhyung Kim <namhyung@kernel.org>
      Cc: Robert Walker <robert.walker@arm.com>
      Cc: coresight@lists.linaro.org
      Cc: linux-arm-kernel@lists.infradead.org
      Link: http://lkml.kernel.org/r/1544513908-16805-2-git-send-email-leo.yan@linaro.orgSigned-off-by: NArnaldo Carvalho de Melo <acme@redhat.com>
      Signed-off-by: NSasha Levin <sashal@kernel.org>
      4bc4b575
    • N
      dm snapshot: Fix excessive memory usage and workqueue stalls · 9e5be33b
      Nikos Tsironis 提交于
      [ Upstream commit 721b1d98fb517ae99ab3b757021cf81db41e67be ]
      
      kcopyd has no upper limit to the number of jobs one can allocate and
      issue. Under certain workloads this can lead to excessive memory usage
      and workqueue stalls. For example, when creating multiple dm-snapshot
      targets with a 4K chunk size and then writing to the origin through the
      page cache. Syncing the page cache causes a large number of BIOs to be
      issued to the dm-snapshot origin target, which itself issues an even
      larger (because of the BIO splitting taking place) number of kcopyd
      jobs.
      
      Running the following test, from the device mapper test suite [1],
      
        dmtest run --suite snapshot -n many_snapshots_of_same_volume_N
      
      , with 8 active snapshots, results in the kcopyd job slab cache growing
      to 10G. Depending on the available system RAM this can lead to the OOM
      killer killing user processes:
      
      [463.492878] kthreadd invoked oom-killer: gfp_mask=0x6040c0(GFP_KERNEL|__GFP_COMP),
                    nodemask=(null), order=1, oom_score_adj=0
      [463.492894] kthreadd cpuset=/ mems_allowed=0
      [463.492948] CPU: 7 PID: 2 Comm: kthreadd Not tainted 4.19.0-rc7 #3
      [463.492950] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.10.2-1 04/01/2014
      [463.492952] Call Trace:
      [463.492964]  dump_stack+0x7d/0xbb
      [463.492973]  dump_header+0x6b/0x2fc
      [463.492987]  ? lockdep_hardirqs_on+0xee/0x190
      [463.493012]  oom_kill_process+0x302/0x370
      [463.493021]  out_of_memory+0x113/0x560
      [463.493030]  __alloc_pages_slowpath+0xf40/0x1020
      [463.493055]  __alloc_pages_nodemask+0x348/0x3c0
      [463.493067]  cache_grow_begin+0x81/0x8b0
      [463.493072]  ? cache_grow_begin+0x874/0x8b0
      [463.493078]  fallback_alloc+0x1e4/0x280
      [463.493092]  kmem_cache_alloc_node+0xd6/0x370
      [463.493098]  ? copy_process.part.31+0x1c5/0x20d0
      [463.493105]  copy_process.part.31+0x1c5/0x20d0
      [463.493115]  ? __lock_acquire+0x3cc/0x1550
      [463.493121]  ? __switch_to_asm+0x34/0x70
      [463.493129]  ? kthread_create_worker_on_cpu+0x70/0x70
      [463.493135]  ? finish_task_switch+0x90/0x280
      [463.493165]  _do_fork+0xe0/0x6d0
      [463.493191]  ? kthreadd+0x19f/0x220
      [463.493233]  kernel_thread+0x25/0x30
      [463.493235]  kthreadd+0x1bf/0x220
      [463.493242]  ? kthread_create_on_cpu+0x90/0x90
      [463.493248]  ret_from_fork+0x3a/0x50
      [463.493279] Mem-Info:
      [463.493285] active_anon:20631 inactive_anon:4831 isolated_anon:0
      [463.493285]  active_file:80216 inactive_file:80107 isolated_file:435
      [463.493285]  unevictable:0 dirty:51266 writeback:109372 unstable:0
      [463.493285]  slab_reclaimable:31191 slab_unreclaimable:3483521
      [463.493285]  mapped:526 shmem:4903 pagetables:1759 bounce:0
      [463.493285]  free:33623 free_pcp:2392 free_cma:0
      ...
      [463.493489] Unreclaimable slab info:
      [463.493513] Name                      Used          Total
      [463.493522] bio-6                   1028KB       1028KB
      [463.493525] bio-5                   1028KB       1028KB
      [463.493528] dm_snap_pending_exception     236783KB     243789KB
      [463.493531] dm_exception              41KB         42KB
      [463.493534] bio-4                   1216KB       1216KB
      [463.493537] bio-3                 439396KB     439396KB
      [463.493539] kcopyd_job           6973427KB    6973427KB
      ...
      [463.494340] Out of memory: Kill process 1298 (ruby2.3) score 1 or sacrifice child
      [463.494673] Killed process 1298 (ruby2.3) total-vm:435740kB, anon-rss:20180kB, file-rss:4kB, shmem-rss:0kB
      [463.506437] oom_reaper: reaped process 1298 (ruby2.3), now anon-rss:0kB, file-rss:0kB, shmem-rss:0kB
      
      Moreover, issuing a large number of kcopyd jobs results in kcopyd
      hogging the CPU, while processing them. As a result, processing of work
      items, queued for execution on the same CPU as the currently running
      kcopyd thread, is stalled for long periods of time, hurting performance.
      Running the aforementioned test we get, in dmesg, messages like the
      following:
      
      [67501.194592] BUG: workqueue lockup - pool cpus=4 node=0 flags=0x0 nice=0 stuck for 27s!
      [67501.195586] Showing busy workqueues and worker pools:
      [67501.195591] workqueue events: flags=0x0
      [67501.195597]   pwq 8: cpus=4 node=0 flags=0x0 nice=0 active=1/256
      [67501.195611]     pending: cache_reap
      [67501.195641] workqueue mm_percpu_wq: flags=0x8
      [67501.195645]   pwq 8: cpus=4 node=0 flags=0x0 nice=0 active=1/256
      [67501.195656]     pending: vmstat_update
      [67501.195682] workqueue kblockd: flags=0x18
      [67501.195687]   pwq 5: cpus=2 node=0 flags=0x0 nice=-20 active=1/256
      [67501.195698]     pending: blk_timeout_work
      [67501.195753] workqueue kcopyd: flags=0x8
      [67501.195757]   pwq 8: cpus=4 node=0 flags=0x0 nice=0 active=1/256
      [67501.195768]     pending: do_work [dm_mod]
      [67501.195802] workqueue kcopyd: flags=0x8
      [67501.195806]   pwq 8: cpus=4 node=0 flags=0x0 nice=0 active=1/256
      [67501.195817]     pending: do_work [dm_mod]
      [67501.195834] workqueue kcopyd: flags=0x8
      [67501.195838]   pwq 8: cpus=4 node=0 flags=0x0 nice=0 active=1/256
      [67501.195848]     pending: do_work [dm_mod]
      [67501.195881] workqueue kcopyd: flags=0x8
      [67501.195885]   pwq 8: cpus=4 node=0 flags=0x0 nice=0 active=1/256
      [67501.195896]     pending: do_work [dm_mod]
      [67501.195920] workqueue kcopyd: flags=0x8
      [67501.195924]   pwq 8: cpus=4 node=0 flags=0x0 nice=0 active=2/256
      [67501.195935]     in-flight: 67:do_work [dm_mod]
      [67501.195945]     pending: do_work [dm_mod]
      [67501.195961] pool 8: cpus=4 node=0 flags=0x0 nice=0 hung=27s workers=3 idle: 129 23765
      
      The root cause for these issues is the way dm-snapshot uses kcopyd. In
      particular, the lack of an explicit or implicit limit to the maximum
      number of in-flight COW jobs. The merging path is not affected because
      it implicitly limits the in-flight kcopyd jobs to one.
      
      Fix these issues by using a semaphore to limit the maximum number of
      in-flight kcopyd jobs. We grab the semaphore before allocating a new
      kcopyd job in start_copy() and start_full_bio() and release it after the
      job finishes in copy_callback().
      
      The initial semaphore value is configurable through a module parameter,
      to allow fine tuning the maximum number of in-flight COW jobs. Setting
      this parameter to zero initializes the semaphore to INT_MAX.
      
      A default value of 2048 maximum in-flight kcopyd jobs was chosen. This
      value was decided experimentally as a trade-off between memory
      consumption, stalling the kernel's workqueues and maintaining a high
      enough throughput.
      
      Re-running the aforementioned test:
      
        * Workqueue stalls are eliminated
        * kcopyd's job slab cache uses a maximum of 130MB
        * The time taken by the test to write to the snapshot-origin target is
          reduced from 05m20.48s to 03m26.38s
      
      [1] https://github.com/jthornber/device-mapper-test-suiteSigned-off-by: NNikos Tsironis <ntsironis@arrikto.com>
      Signed-off-by: NIlias Tsitsimpis <iliastsi@arrikto.com>
      Signed-off-by: NMike Snitzer <snitzer@redhat.com>
      Signed-off-by: NSasha Levin <sashal@kernel.org>
      9e5be33b
    • A
      tools lib subcmd: Don't add the kernel sources to the include path · d9513fdb
      Arnaldo Carvalho de Melo 提交于
      [ Upstream commit ece9804985b57e1ccd83b1fb6288520955a29d51 ]
      
      At some point we decided not to directly include kernel sources files
      when building tools/perf/, but when tools/lib/subcmd/ was forked from
      tools/perf it somehow ended up adding it via these two lines in its
      Makefile:
      
        CFLAGS += -I$(srctree)/include/uapi
        CFLAGS += -I$(srctree)/include
      
      As $(srctree) points to the kernel sources.
      
      Removing those lines and keeping just:
      
        CFLAGS += -I$(srctree)/tools/include/
      
      Is enough to build tools/perf and tools/objtool.
      
      This fixes the build when building from the sources in environments such
      as the Android NDK crossbuilding from a fedora:26 system:
      
        subcmd-util.h:11:15: error: expected ',' or ';' before 'void'
         static inline void report(const char *prefix, const char *err, va_list params)
                       ^
        In file included from /git/perf/include/uapi/linux/stddef.h:2:0,
                         from /git/perf/include/uapi/linux/posix_types.h:5,
                         from /opt/android-ndk-r12b/platforms/android-24/arch-arm/usr/include/sys/types.h:36,
                         from /opt/android-ndk-r12b/platforms/android-24/arch-arm/usr/include/unistd.h:33,
                         from run-command.c:2:
        subcmd-util.h:18:17: error: '__no_instrument_function__' attribute applies only to functions
      
      The /opt/android-ndk-r12b/platforms/android-24/arch-arm/usr/include/sys/types.h
      file that includes linux/posix_types.h ends up getting the one in the kernel
      sources causing the breakage. Fix it.
      
      Test built tools/objtool/ too.
      Reported-by: NJiri Olsa <jolsa@kernel.org>
      Tested-by: NJiri Olsa <jolsa@kernel.org>
      Cc: Adrian Hunter <adrian.hunter@intel.com>
      Cc: Josh Poimboeuf <jpoimboe@redhat.com>
      Cc: Namhyung Kim <namhyung@kernel.org>
      Fixes: 4b6ab94e ("perf subcmd: Create subcmd library")
      Link: https://lkml.kernel.org/n/tip-5lhaoecrj12t0bqwvpiu14sm@git.kernel.orgSigned-off-by: NArnaldo Carvalho de Melo <acme@redhat.com>
      Signed-off-by: NSasha Levin <sashal@kernel.org>
      d9513fdb
    • M
      perf stat: Avoid segfaults caused by negated options · 8603cac2
      Michael Petlan 提交于
      [ Upstream commit 51433ead1460fb3f46e1c34f68bb22fd2dd0f5d0 ]
      
      Some 'perf stat' options do not make sense to be negated (event,
      cgroup), some do not have negated path implemented (metrics). Due to
      that, it is better to disable the "no-" prefix for them, since
      otherwise, the later opt-parsing segfaults.
      
      Before:
      
        $ perf stat --no-metrics -- ls
        Segmentation fault (core dumped)
      
      After:
      
        $ perf stat --no-metrics -- ls
         Error: option `no-metrics' isn't available
         Usage: perf stat [<options>] [<command>]
      Signed-off-by: NMichael Petlan <mpetlan@redhat.com>
      Tested-by: NArnaldo Carvalho de Melo <acme@redhat.com>
      LPU-Reference: 1485912065.62416880.1544457604340.JavaMail.zimbra@redhat.com
      Signed-off-by: NArnaldo Carvalho de Melo <acme@redhat.com>
      Signed-off-by: NSasha Levin <sashal@kernel.org>
      8603cac2
    • N
      dm kcopyd: Fix bug causing workqueue stalls · cbd257f3
      Nikos Tsironis 提交于
      [ Upstream commit d7e6b8dfc7bcb3f4f3a18313581f67486a725b52 ]
      
      When using kcopyd to run callbacks through dm_kcopyd_do_callback() or
      submitting copy jobs with a source size of 0, the jobs are pushed
      directly to the complete_jobs list, which could be under processing by
      the kcopyd thread. As a result, the kcopyd thread can continue running
      completed jobs indefinitely, without releasing the CPU, as long as
      someone keeps submitting new completed jobs through the aforementioned
      paths. Processing of work items, queued for execution on the same CPU as
      the currently running kcopyd thread, is thus stalled for excessive
      amounts of time, hurting performance.
      
      Running the following test, from the device mapper test suite [1],
      
        dmtest run --suite snapshot -n parallel_io_to_many_snaps_N
      
      , with 8 active snapshots, we get, in dmesg, messages like the
      following:
      
      [68899.948523] BUG: workqueue lockup - pool cpus=0 node=0 flags=0x0 nice=0 stuck for 95s!
      [68899.949282] Showing busy workqueues and worker pools:
      [68899.949288] workqueue events: flags=0x0
      [68899.949295]   pwq 0: cpus=0 node=0 flags=0x0 nice=0 active=2/256
      [68899.949306]     pending: vmstat_shepherd, cache_reap
      [68899.949331] workqueue mm_percpu_wq: flags=0x8
      [68899.949337]   pwq 0: cpus=0 node=0 flags=0x0 nice=0 active=1/256
      [68899.949345]     pending: vmstat_update
      [68899.949387] workqueue dm_bufio_cache: flags=0x8
      [68899.949392]   pwq 4: cpus=2 node=0 flags=0x0 nice=0 active=1/256
      [68899.949400]     pending: work_fn [dm_bufio]
      [68899.949423] workqueue kcopyd: flags=0x8
      [68899.949429]   pwq 0: cpus=0 node=0 flags=0x0 nice=0 active=1/256
      [68899.949437]     pending: do_work [dm_mod]
      [68899.949452] workqueue kcopyd: flags=0x8
      [68899.949458]   pwq 0: cpus=0 node=0 flags=0x0 nice=0 active=2/256
      [68899.949466]     in-flight: 13:do_work [dm_mod]
      [68899.949474]     pending: do_work [dm_mod]
      [68899.949487] workqueue kcopyd: flags=0x8
      [68899.949493]   pwq 0: cpus=0 node=0 flags=0x0 nice=0 active=1/256
      [68899.949501]     pending: do_work [dm_mod]
      [68899.949515] workqueue kcopyd: flags=0x8
      [68899.949521]   pwq 0: cpus=0 node=0 flags=0x0 nice=0 active=1/256
      [68899.949529]     pending: do_work [dm_mod]
      [68899.949541] workqueue kcopyd: flags=0x8
      [68899.949547]   pwq 0: cpus=0 node=0 flags=0x0 nice=0 active=1/256
      [68899.949555]     pending: do_work [dm_mod]
      [68899.949568] pool 0: cpus=0 node=0 flags=0x0 nice=0 hung=95s workers=4 idle: 27130 27223 1084
      
      Fix this by splitting the complete_jobs list into two parts: A user
      facing part, named callback_jobs, and one used internally by kcopyd,
      retaining the name complete_jobs. dm_kcopyd_do_callback() and
      dispatch_job() now push their jobs to the callback_jobs list, which is
      spliced to the complete_jobs list once, every time the kcopyd thread
      wakes up. This prevents kcopyd from hogging the CPU indefinitely and
      causing workqueue stalls.
      
      Re-running the aforementioned test:
      
        * Workqueue stalls are eliminated
        * The maximum writing time among all targets is reduced from 09m37.10s
          to 06m04.85s and the total run time of the test is reduced from
          10m43.591s to 7m19.199s
      
      [1] https://github.com/jthornber/device-mapper-test-suiteSigned-off-by: NNikos Tsironis <ntsironis@arrikto.com>
      Signed-off-by: NIlias Tsitsimpis <iliastsi@arrikto.com>
      Signed-off-by: NMike Snitzer <snitzer@redhat.com>
      Signed-off-by: NSasha Levin <sashal@kernel.org>
      cbd257f3
    • A
      dm crypt: use u64 instead of sector_t to store iv_offset · 4e26ee31
      AliOS system security 提交于
      [ Upstream commit 8d683dcd65c037efc9fb38c696ec9b65b306e573 ]
      
      The iv_offset in the mapping table of crypt target is a 64bit number when
      IV algorithm is plain64, plain64be, essiv or benbi. It will be assigned to
      iv_offset of struct crypt_config, cc_sector of struct convert_context and
      iv_sector of struct dm_crypt_request. These structures members are defined
      as a sector_t. But sector_t is 32bit when CONFIG_LBDAF is not set in 32bit
      kernel. In this situation sector_t is not big enough to store the 64bit
      iv_offset.
      
      Here is a reproducer.
      Prepare test image and device (loop is automatically allocated by cryptsetup):
      
        # dd if=/dev/zero of=tst.img bs=1M count=1
        # echo "tst"|cryptsetup open --type plain -c aes-xts-plain64 \
        --skip 500000000000000000 tst.img test
      
      On 32bit system (use IV offset value that overflows to 64bit; CONFIG_LBDAF if off)
      and device checksum is wrong:
      
        # dmsetup table test --showkeys
        0 2048 crypt aes-xts-plain64 dfa7cfe3c481f2239155739c42e539ae8f2d38f304dcc89d20b26f69daaf0933 3551657984 7:0 0
      
        # sha256sum /dev/mapper/test
        533e25c09176632b3794f35303488c4a8f3f965dffffa6ec2df347c168cb6c19 /dev/mapper/test
      
      On 64bit system (and on 32bit system with the patch), table and checksum is now correct:
      
        # dmsetup table test --showkeys
        0 2048 crypt aes-xts-plain64 dfa7cfe3c481f2239155739c42e539ae8f2d38f304dcc89d20b26f69daaf0933 500000000000000000 7:0 0
      
        # sha256sum /dev/mapper/test
        5d16160f9d5f8c33d8051e65fdb4f003cc31cd652b5abb08f03aa6fce0df75fc /dev/mapper/test
      Signed-off-by: NAliOS system security <alios_sys_security@linux.alibaba.com>
      Tested-and-Reviewed-by: NMilan Broz <gmazyland@gmail.com>
      Signed-off-by: NMike Snitzer <snitzer@redhat.com>
      Signed-off-by: NSasha Levin <sashal@kernel.org>
      4e26ee31
    • H
      x86/topology: Use total_cpus for max logical packages calculation · a4772e8b
      Hui Wang 提交于
      [ Upstream commit aa02ef099cff042c2a9109782ec2bf1bffc955d4 ]
      
      nr_cpu_ids can be limited on the command line via nr_cpus=. This can break the
      logical package management because it results in a smaller number of packages
      while in kdump kernel.
      
      Check below case:
      There is a two sockets system, each socket has 8 cores, which has 16 logical
      cpus while HT was turn on.
      
       0  1  2  3  4  5  6  7     |    16 17 18 19 20 21 22 23
       cores on socket 0               threads on socket 0
       8  9 10 11 12 13 14 15     |    24 25 26 27 28 29 30 31
       cores on socket 1               threads on socket 1
      
      While starting the kdump kernel with command line option nr_cpus=16 panic
      was triggered on one of the cpus 24-31 eg. 26, then online cpu will be
      1-15, 26(cpu 0 was disabled in kdump), ncpus will be 16 and
      __max_logical_packages will be 1, but actually two packages were booted on.
      
      This issue can reproduced by set kdump option nr_cpus=<real physical core
      numbers>, and then trigger panic on last socket's thread, for example:
      
      taskset -c 26 echo c > /proc/sysrq-trigger
      
      Use total_cpus which will not be limited by nr_cpus command line to calculate
      the value of __max_logical_packages.
      Signed-off-by: NHui Wang <john.wanghui@huawei.com>
      Signed-off-by: NThomas Gleixner <tglx@linutronix.de>
      Cc: <guijianfeng@huawei.com>
      Cc: <wencongyang2@huawei.com>
      Cc: <douliyang1@huawei.com>
      Cc: <qiaonuohan@huawei.com>
      Link: https://lkml.kernel.org/r/20181107023643.22174-1-john.wanghui@huawei.comSigned-off-by: NSasha Levin <sashal@kernel.org>
      a4772e8b
    • T
      netfilter: ipt_CLUSTERIP: fix deadlock in netns exit routine · 9d51378a
      Taehee Yoo 提交于
      [ Upstream commit 5a86d68bcf02f2d1e9a5897dd482079fd5f75e7f ]
      
      When network namespace is destroyed, cleanup_net() is called.
      cleanup_net() holds pernet_ops_rwsem then calls each ->exit callback.
      So that clusterip_tg_destroy() is called by cleanup_net().
      And clusterip_tg_destroy() calls unregister_netdevice_notifier().
      
      But both cleanup_net() and clusterip_tg_destroy() hold same
      lock(pernet_ops_rwsem). hence deadlock occurrs.
      
      After this patch, only 1 notifier is registered when module is inserted.
      And all of configs are added to per-net list.
      
      test commands:
         %ip netns add vm1
         %ip netns exec vm1 bash
         %ip link set lo up
         %iptables -A INPUT -p tcp -i lo -d 192.168.0.5 --dport 80 \
      	-j CLUSTERIP --new --hashmode sourceip \
      	--clustermac 01:00:5e:00:00:20 --total-nodes 2 --local-node 1
         %exit
         %ip netns del vm1
      
      splat looks like:
      [  341.809674] ============================================
      [  341.809674] WARNING: possible recursive locking detected
      [  341.809674] 4.19.0-rc5+ #16 Tainted: G        W
      [  341.809674] --------------------------------------------
      [  341.809674] kworker/u4:2/87 is trying to acquire lock:
      [  341.809674] 000000005da2d519 (pernet_ops_rwsem){++++}, at: unregister_netdevice_notifier+0x8c/0x460
      [  341.809674]
      [  341.809674] but task is already holding lock:
      [  341.809674] 000000005da2d519 (pernet_ops_rwsem){++++}, at: cleanup_net+0x119/0x900
      [  341.809674]
      [  341.809674] other info that might help us debug this:
      [  341.809674]  Possible unsafe locking scenario:
      [  341.809674]
      [  341.809674]        CPU0
      [  341.809674]        ----
      [  341.809674]   lock(pernet_ops_rwsem);
      [  341.809674]   lock(pernet_ops_rwsem);
      [  341.809674]
      [  341.809674]  *** DEADLOCK ***
      [  341.809674]
      [  341.809674]  May be due to missing lock nesting notation
      [  341.809674]
      [  341.809674] 3 locks held by kworker/u4:2/87:
      [  341.809674]  #0: 00000000d9df6c92 ((wq_completion)"%s""netns"){+.+.}, at: process_one_work+0xafe/0x1de0
      [  341.809674]  #1: 00000000c2cbcee2 (net_cleanup_work){+.+.}, at: process_one_work+0xb60/0x1de0
      [  341.809674]  #2: 000000005da2d519 (pernet_ops_rwsem){++++}, at: cleanup_net+0x119/0x900
      [  341.809674]
      [  341.809674] stack backtrace:
      [  341.809674] CPU: 1 PID: 87 Comm: kworker/u4:2 Tainted: G        W         4.19.0-rc5+ #16
      [  341.809674] Workqueue: netns cleanup_net
      [  341.809674] Call Trace:
      [ ... ]
      [  342.070196]  down_write+0x93/0x160
      [  342.070196]  ? unregister_netdevice_notifier+0x8c/0x460
      [  342.070196]  ? down_read+0x1e0/0x1e0
      [  342.070196]  ? sched_clock_cpu+0x126/0x170
      [  342.070196]  ? find_held_lock+0x39/0x1c0
      [  342.070196]  unregister_netdevice_notifier+0x8c/0x460
      [  342.070196]  ? register_netdevice_notifier+0x790/0x790
      [  342.070196]  ? __local_bh_enable_ip+0xe9/0x1b0
      [  342.070196]  ? __local_bh_enable_ip+0xe9/0x1b0
      [  342.070196]  ? clusterip_tg_destroy+0x372/0x650 [ipt_CLUSTERIP]
      [  342.070196]  ? trace_hardirqs_on+0x93/0x210
      [  342.070196]  ? __bpf_trace_preemptirq_template+0x10/0x10
      [  342.070196]  ? clusterip_tg_destroy+0x372/0x650 [ipt_CLUSTERIP]
      [  342.123094]  clusterip_tg_destroy+0x3ad/0x650 [ipt_CLUSTERIP]
      [  342.123094]  ? clusterip_net_init+0x3d0/0x3d0 [ipt_CLUSTERIP]
      [  342.123094]  ? cleanup_match+0x17d/0x200 [ip_tables]
      [  342.123094]  ? xt_unregister_table+0x215/0x300 [x_tables]
      [  342.123094]  ? kfree+0xe2/0x2a0
      [  342.123094]  cleanup_entry+0x1d5/0x2f0 [ip_tables]
      [  342.123094]  ? cleanup_match+0x200/0x200 [ip_tables]
      [  342.123094]  __ipt_unregister_table+0x9b/0x1a0 [ip_tables]
      [  342.123094]  iptable_filter_net_exit+0x43/0x80 [iptable_filter]
      [  342.123094]  ops_exit_list.isra.10+0x94/0x140
      [  342.123094]  cleanup_net+0x45b/0x900
      [ ... ]
      
      Fixes: 202f59af ("netfilter: ipt_CLUSTERIP: do not hold dev")
      Signed-off-by: NTaehee Yoo <ap420073@gmail.com>
      Signed-off-by: NPablo Neira Ayuso <pablo@netfilter.org>
      Signed-off-by: NSasha Levin <sashal@kernel.org>
      9d51378a
    • T
      netfilter: ipt_CLUSTERIP: remove wrong WARN_ON_ONCE in netns exit routine · bb7b6c49
      Taehee Yoo 提交于
      [ Upstream commit b12f7bad5ad3724d19754390a3e80928525c0769 ]
      
      When network namespace is destroyed, both clusterip_tg_destroy() and
      clusterip_net_exit() are called. and clusterip_net_exit() is called
      before clusterip_tg_destroy().
      Hence cleanup check code in clusterip_net_exit() doesn't make sense.
      
      test commands:
         %ip netns add vm1
         %ip netns exec vm1 bash
         %ip link set lo up
         %iptables -A INPUT -p tcp -i lo -d 192.168.0.5 --dport 80 \
      	-j CLUSTERIP --new --hashmode sourceip \
      	--clustermac 01:00:5e:00:00:20 --total-nodes 2 --local-node 1
         %exit
         %ip netns del vm1
      
      splat looks like:
      [  341.184508] WARNING: CPU: 1 PID: 87 at net/ipv4/netfilter/ipt_CLUSTERIP.c:840 clusterip_net_exit+0x319/0x380 [ipt_CLUSTERIP]
      [  341.184850] Modules linked in: ipt_CLUSTERIP nf_conntrack nf_defrag_ipv6 nf_defrag_ipv4 xt_tcpudp iptable_filter bpfilter ip_tables x_tables
      [  341.184850] CPU: 1 PID: 87 Comm: kworker/u4:2 Not tainted 4.19.0-rc5+ #16
      [  341.227509] Workqueue: netns cleanup_net
      [  341.227509] RIP: 0010:clusterip_net_exit+0x319/0x380 [ipt_CLUSTERIP]
      [  341.227509] Code: 0f 85 7f fe ff ff 48 c7 c2 80 64 2c c0 be a8 02 00 00 48 c7 c7 a0 63 2c c0 c6 05 18 6e 00 00 01 e8 bc 38 ff f5 e9 5b fe ff ff <0f> 0b e9 33 ff ff ff e8 4b 90 50 f6 e9 2d fe ff ff 48 89 df e8 de
      [  341.227509] RSP: 0018:ffff88011086f408 EFLAGS: 00010202
      [  341.227509] RAX: dffffc0000000000 RBX: 1ffff1002210de85 RCX: 0000000000000000
      [  341.227509] RDX: 1ffff1002210de85 RSI: ffff880110813be8 RDI: ffffed002210de58
      [  341.227509] RBP: ffff88011086f4d0 R08: 0000000000000000 R09: 0000000000000000
      [  341.227509] R10: 0000000000000000 R11: 0000000000000000 R12: 1ffff1002210de81
      [  341.227509] R13: ffff880110625a48 R14: ffff880114cec8c8 R15: 0000000000000014
      [  341.227509] FS:  0000000000000000(0000) GS:ffff880116600000(0000) knlGS:0000000000000000
      [  341.227509] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
      [  341.227509] CR2: 00007f11fd38e000 CR3: 000000013ca16000 CR4: 00000000001006e0
      [  341.227509] Call Trace:
      [  341.227509]  ? __clusterip_config_find+0x460/0x460 [ipt_CLUSTERIP]
      [  341.227509]  ? default_device_exit+0x1ca/0x270
      [  341.227509]  ? remove_proc_entry+0x1cd/0x390
      [  341.227509]  ? dev_change_net_namespace+0xd00/0xd00
      [  341.227509]  ? __init_waitqueue_head+0x130/0x130
      [  341.227509]  ops_exit_list.isra.10+0x94/0x140
      [  341.227509]  cleanup_net+0x45b/0x900
      [ ... ]
      
      Fixes: 613d0776 ("netfilter: exit_net cleanup check added")
      Signed-off-by: NTaehee Yoo <ap420073@gmail.com>
      Signed-off-by: NPablo Neira Ayuso <pablo@netfilter.org>
      Signed-off-by: NSasha Levin <sashal@kernel.org>
      bb7b6c49
    • T
      netfilter: ipt_CLUSTERIP: check MAC address when duplicate config is set · 744383c8
      Taehee Yoo 提交于
      [ Upstream commit 06aa151ad1fc74a49b45336672515774a678d78d ]
      
      If same destination IP address config is already existing, that config is
      just used. MAC address also should be same.
      However, there is no MAC address checking routine.
      So that MAC address checking routine is added.
      
      test commands:
         %iptables -A INPUT -p tcp -i lo -d 192.168.0.5 --dport 80 \
      	   -j CLUSTERIP --new --hashmode sourceip \
      	   --clustermac 01:00:5e:00:00:20 --total-nodes 2 --local-node 1
         %iptables -A INPUT -p tcp -i lo -d 192.168.0.5 --dport 80 \
      	   -j CLUSTERIP --new --hashmode sourceip \
      	   --clustermac 01:00:5e:00:00:21 --total-nodes 2 --local-node 1
      
      After this patch, above commands are disallowed.
      Signed-off-by: NTaehee Yoo <ap420073@gmail.com>
      Signed-off-by: NPablo Neira Ayuso <pablo@netfilter.org>
      Signed-off-by: NSasha Levin <sashal@kernel.org>
      744383c8
    • A
      perf vendor events intel: Fix Load_Miss_Real_Latency on SKL/SKX · bd1040e6
      Andi Kleen 提交于
      [ Upstream commit 91b2b97025097ce7ca7536bc87eba2bf14760fb4 ]
      
      Fix incorrect event names for the Load_Miss_Real_Latency metric for
      Skylake and Skylake Server.
      
      Fixes https://github.com/andikleen/pmu-tools/issues/158
      
      Before:
      
        % perf stat -M Load_Miss_Real_Latency true
        event syntax error: '..ss.pending,mem_load_retired.l1_miss_ps,mem_load_retired.fb_hit_ps}:W'
                                          \___ parser error
      
         Usage: perf stat [<options>] [<command>]
      
            -M, --metrics <metric/metric group list>
                                  monitor specified metrics or metric groups (separated by ,)
      
      After:
      
        % perf stat -M Load_Miss_Real_Latency true
      
         Performance counter stats for 'true':
      
                   279,204      l1d_pend_miss.pending     #     14.0 Load_Miss_Real_Latency
                     4,784      mem_load_uops_retired.l1_miss
                    15,188      mem_load_uops_retired.hit_lfb
      
               0.000899640 seconds time elapsed
      Signed-off-by: NAndi Kleen <ak@linux.intel.com>
      Acked-by: NJiri Olsa <jolsa@kernel.org>
      Tested-by: NArnaldo Carvalho de Melo <acme@redhat.com>
      Link: http://lkml.kernel.org/r/20181120050635.4215-1-andi@firstfloor.orgSigned-off-by: NArnaldo Carvalho de Melo <acme@redhat.com>
      Signed-off-by: NSasha Levin <sashal@kernel.org>
      bd1040e6
    • A
      perf parse-events: Fix unchecked usage of strncpy() · 58c67a0b
      Arnaldo Carvalho de Melo 提交于
      [ Upstream commit bd8d57fb7e25e9fcf67a9eef5fa13aabe2016e07 ]
      
      The strncpy() function may leave the destination string buffer
      unterminated, better use strlcpy() that we have a __weak fallback
      implementation for systems without it.
      
      This fixes this warning on an Alpine Linux Edge system with gcc 8.2:
      
        util/parse-events.c: In function 'print_symbol_events':
        util/parse-events.c:2465:4: error: 'strncpy' specified bound 100 equals destination size [-Werror=stringop-truncation]
            strncpy(name, syms->symbol, MAX_NAME_LEN);
            ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
        In function 'print_symbol_events.constprop',
            inlined from 'print_events' at util/parse-events.c:2508:2:
        util/parse-events.c:2465:4: error: 'strncpy' specified bound 100 equals destination size [-Werror=stringop-truncation]
            strncpy(name, syms->symbol, MAX_NAME_LEN);
            ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
        In function 'print_symbol_events.constprop',
            inlined from 'print_events' at util/parse-events.c:2511:2:
        util/parse-events.c:2465:4: error: 'strncpy' specified bound 100 equals destination size [-Werror=stringop-truncation]
            strncpy(name, syms->symbol, MAX_NAME_LEN);
            ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
        cc1: all warnings being treated as errors
      
      Cc: Adrian Hunter <adrian.hunter@intel.com>
      Cc: Jiri Olsa <jolsa@kernel.org>
      Cc: Namhyung Kim <namhyung@kernel.org>
      Fixes: 947b4ad1 ("perf list: Fix max event string size")
      Link: https://lkml.kernel.org/n/tip-b663e33bm6x8hrkie4uxh7u2@git.kernel.orgSigned-off-by: NArnaldo Carvalho de Melo <acme@redhat.com>
      Signed-off-by: NSasha Levin <sashal@kernel.org>
      58c67a0b
    • A
      perf svghelper: Fix unchecked usage of strncpy() · b332b4cd
      Arnaldo Carvalho de Melo 提交于
      [ Upstream commit 2f5302533f306d5ee87bd375aef9ca35b91762cb ]
      
      The strncpy() function may leave the destination string buffer
      unterminated, better use strlcpy() that we have a __weak fallback
      implementation for systems without it.
      
      In this specific case this would only happen if fgets() was buggy, as
      its man page states that it should read one less byte than the size of
      the destination buffer, so that it can put the nul byte at the end of
      it, so it would never copy 255 non-nul chars, as fgets reads into the
      orig buffer at most 254 non-nul chars and terminates it. But lets just
      switch to strlcpy to keep the original intent and silence the gcc 8.2
      warning.
      
      This fixes this warning on an Alpine Linux Edge system with gcc 8.2:
      
        In function 'cpu_model',
            inlined from 'svg_cpu_box' at util/svghelper.c:378:2:
        util/svghelper.c:337:5: error: 'strncpy' output may be truncated copying 255 bytes from a string of length 255 [-Werror=stringop-truncation]
             strncpy(cpu_m, &buf[13], 255);
             ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~
      
      Cc: Adrian Hunter <adrian.hunter@intel.com>
      Cc: Jiri Olsa <jolsa@kernel.org>
      Cc: Namhyung Kim <namhyung@kernel.org>
      Cc: Arjan van de Ven <arjan@linux.intel.com>
      Fixes: f48d55ce ("perf: Add a SVG helper library file")
      Link: https://lkml.kernel.org/n/tip-xzkoo0gyr56gej39ltivuh9g@git.kernel.orgSigned-off-by: NArnaldo Carvalho de Melo <acme@redhat.com>
      Signed-off-by: NSasha Levin <sashal@kernel.org>
      b332b4cd
    • F
      perf tests ARM: Disable breakpoint tests 32-bit · f54fc4c2
      Florian Fainelli 提交于
      [ Upstream commit 24f967337f6d6bce931425769c0f5ff5cf2d212e ]
      
      The breakpoint tests on the ARM 32-bit kernel are broken in several
      ways.
      
      The breakpoint length requested does not necessarily match whether the
      function address has the Thumb bit (bit 0) set or not, and this does
      matter to the ARM kernel hw_breakpoint infrastructure. See [1] for
      background.
      
      [1]: https://lkml.org/lkml/2018/11/15/205
      
      As Will indicated, the overflow handling would require single-stepping
      which is not supported at the moment. Just disable those tests for the
      ARM 32-bit platforms and update the comment above to explain these
      limitations.
      Co-developed-by: NWill Deacon <will.deacon@arm.com>
      Signed-off-by: NFlorian Fainelli <f.fainelli@gmail.com>
      Signed-off-by: NWill Deacon <will.deacon@arm.com>
      Acked-by: NJiri Olsa <jolsa@redhat.com>
      Cc: Alexander Shishkin <alexander.shishkin@linux.intel.com>
      Cc: Namhyung Kim <namhyung@kernel.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Link: http://lkml.kernel.org/r/20181203191138.2419-1-f.fainelli@gmail.comSigned-off-by: NArnaldo Carvalho de Melo <acme@redhat.com>
      Signed-off-by: NSasha Levin <sashal@kernel.org>
      f54fc4c2
    • A
      perf intel-pt: Fix error with config term "pt=0" · c3e8c335
      Adrian Hunter 提交于
      [ Upstream commit 1c6f709b9f96366cc47af23c05ecec9b8c0c392d ]
      
      Users should never use 'pt=0', but if they do it may give a meaningless
      error:
      
      	$ perf record -e intel_pt/pt=0/u uname
      	Error:
      	The sys_perf_event_open() syscall returned with 22 (Invalid argument) for
      	event (intel_pt/pt=0/u).
      
      Fix that by forcing 'pt=1'.
      
      Committer testing:
      
        # perf record -e intel_pt/pt=0/u uname
        Error:
        The sys_perf_event_open() syscall returned with 22 (Invalid argument) for event (intel_pt/pt=0/u).
        /bin/dmesg | grep -i perf may provide additional information.
      
        # perf record -e intel_pt/pt=0/u uname
        pt=0 doesn't make sense, forcing pt=1
        Linux
        [ perf record: Woken up 1 times to write data ]
        [ perf record: Captured and wrote 0.020 MB perf.data ]
        #
      Signed-off-by: NAdrian Hunter <adrian.hunter@intel.com>
      Tested-by: NArnaldo Carvalho de Melo <acme@redhat.com>
      Cc: Jiri Olsa <jolsa@redhat.com>
      Link: http://lkml.kernel.org/r/b7c5b4e5-9497-10e5-fd43-5f3e4a0fe51d@intel.comSigned-off-by: NArnaldo Carvalho de Melo <acme@redhat.com>
      Signed-off-by: NSasha Levin <sashal@kernel.org>
      c3e8c335
    • S
      tty/serial: do not free trasnmit buffer page under port lock · f74fc96e
      Sergey Senozhatsky 提交于
      [ Upstream commit d72402145ace0697a6a9e8e75a3de5bf3375f78d ]
      
      LKP has hit yet another circular locking dependency between uart
      console drivers and debugobjects [1]:
      
           CPU0                                    CPU1
      
                                                  rhltable_init()
                                                   __init_work()
                                                    debug_object_init
           uart_shutdown()                          /* db->lock */
            /* uart_port->lock */                    debug_print_object()
             free_page()                              printk()
                                                       call_console_drivers()
              debug_check_no_obj_freed()                /* uart_port->lock */
               /* db->lock */
                debug_print_object()
      
      So there are two dependency chains:
      	uart_port->lock -> db->lock
      And
      	db->lock -> uart_port->lock
      
      This particular circular locking dependency can be addressed in several
      ways:
      
      a) One way would be to move debug_print_object() out of db->lock scope
         and, thus, break the db->lock -> uart_port->lock chain.
      b) Another one would be to free() transmit buffer page out of db->lock
         in UART code; which is what this patch does.
      
      It makes sense to apply a) and b) independently: there are too many things
      going on behind free(), none of which depend on uart_port->lock.
      
      The patch fixes transmit buffer page free() in uart_shutdown() and,
      additionally, in uart_port_startup() (as was suggested by Dmitry Safonov).
      
      [1] https://lore.kernel.org/lkml/20181211091154.GL23332@shao2-debian/T/#uSigned-off-by: NSergey Senozhatsky <sergey.senozhatsky@gmail.com>
      Reviewed-by: NPetr Mladek <pmladek@suse.com>
      Acked-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
      Cc: Jiri Slaby <jslaby@suse.com>
      Cc: Andrew Morton <akpm@linux-foundation.org>
      Cc: Waiman Long <longman@redhat.com>
      Cc: Dmitry Safonov <dima@arista.com>
      Cc: Steven Rostedt <rostedt@goodmis.org>
      Signed-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      Signed-off-by: NSasha Levin <sashal@kernel.org>
      f74fc96e
    • J
      btrfs: improve error handling of btrfs_add_link · 310f8296
      Johannes Thumshirn 提交于
      [ Upstream commit 1690dd41e0cb1dade80850ed8a3eb0121b96d22f ]
      
      In the error handling block, err holds the return value of either
      btrfs_del_root_ref() or btrfs_del_inode_ref() but it hasn't been checked
      since it's introduction with commit fe66a05a (Btrfs: improve error
      handling for btrfs_insert_dir_item callers) in 2012.
      
      If the error handling in the error handling fails, there's not much left
      to do and the abort either happened earlier in the callees or is
      necessary here.
      
      So if one of btrfs_del_root_ref() or btrfs_del_inode_ref() failed, abort
      the transaction, but still return the original code of the failure
      stored in 'ret' as this will be reported to the user.
      Signed-off-by: NJohannes Thumshirn <jthumshirn@suse.de>
      Reviewed-by: NDavid Sterba <dsterba@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      Signed-off-by: NSasha Levin <sashal@kernel.org>
      310f8296
    • A
      btrfs: fix use-after-free due to race between replace start and cancel · 38b17eee
      Anand Jain 提交于
      [ Upstream commit d189dd70e2556181732598956d808ea53cc8774e ]
      
      The device replace cancel thread can race with the replace start thread
      and if fs_info::scrubs_running is not yet set, btrfs_scrub_cancel() will
      fail to stop the scrub thread.
      
      The scrub thread continues with the scrub for replace which then will
      try to write to the target device and which is already freed by the
      cancel thread.
      
      scrub_setup_ctx() warns as tgtdev is NULL.
      
        struct scrub_ctx *scrub_setup_ctx(struct btrfs_device *dev, int is_dev_replace)
        {
        ...
      	  if (is_dev_replace) {
      		  WARN_ON(!fs_info->dev_replace.tgtdev);  <===
      		  sctx->pages_per_wr_bio = SCRUB_PAGES_PER_WR_BIO;
      		  sctx->wr_tgtdev = fs_info->dev_replace.tgtdev;
      		  sctx->flush_all_writes = false;
      	  }
      
        [ 6724.497655] BTRFS info (device sdb): dev_replace from /dev/sdb (devid 1) to /dev/sdc started
        [ 6753.945017] BTRFS info (device sdb): dev_replace from /dev/sdb (devid 1) to /dev/sdc canceled
        [ 6852.426700] WARNING: CPU: 0 PID: 4494 at fs/btrfs/scrub.c:622 scrub_setup_ctx.isra.19+0x220/0x230 [btrfs]
        ...
        [ 6852.428928] RIP: 0010:scrub_setup_ctx.isra.19+0x220/0x230 [btrfs]
        ...
        [ 6852.432970] Call Trace:
        [ 6852.433202]  btrfs_scrub_dev+0x19b/0x5c0 [btrfs]
        [ 6852.433471]  btrfs_dev_replace_start+0x48c/0x6a0 [btrfs]
        [ 6852.433800]  btrfs_dev_replace_by_ioctl+0x3a/0x60 [btrfs]
        [ 6852.434097]  btrfs_ioctl+0x2476/0x2d20 [btrfs]
        [ 6852.434365]  ? do_sigaction+0x7d/0x1e0
        [ 6852.434623]  do_vfs_ioctl+0xa9/0x6c0
        [ 6852.434865]  ? syscall_trace_enter+0x1c8/0x310
        [ 6852.435124]  ? syscall_trace_enter+0x1c8/0x310
        [ 6852.435387]  ksys_ioctl+0x60/0x90
        [ 6852.435663]  __x64_sys_ioctl+0x16/0x20
        [ 6852.435907]  do_syscall_64+0x50/0x180
        [ 6852.436150]  entry_SYSCALL_64_after_hwframe+0x49/0xbe
      
      Further, as the replace thread enters scrub_write_page_to_dev_replace()
      without the target device it panics:
      
        static int scrub_add_page_to_wr_bio(struct scrub_ctx *sctx,
      				      struct scrub_page *spage)
        {
        ...
      	bio_set_dev(bio, sbio->dev->bdev); <======
      
        [ 6929.715145] BUG: unable to handle kernel NULL pointer dereference at 00000000000000a0
        ..
        [ 6929.717106] Workqueue: btrfs-scrub btrfs_scrub_helper [btrfs]
        [ 6929.717420] RIP: 0010:scrub_write_page_to_dev_replace+0xb4/0x260
        [btrfs]
        ..
        [ 6929.721430] Call Trace:
        [ 6929.721663]  scrub_write_block_to_dev_replace+0x3f/0x60 [btrfs]
        [ 6929.721975]  scrub_bio_end_io_worker+0x1af/0x490 [btrfs]
        [ 6929.722277]  normal_work_helper+0xf0/0x4c0 [btrfs]
        [ 6929.722552]  process_one_work+0x1f4/0x520
        [ 6929.722805]  ? process_one_work+0x16e/0x520
        [ 6929.723063]  worker_thread+0x46/0x3d0
        [ 6929.723313]  kthread+0xf8/0x130
        [ 6929.723544]  ? process_one_work+0x520/0x520
        [ 6929.723800]  ? kthread_delayed_work_timer_fn+0x80/0x80
        [ 6929.724081]  ret_from_fork+0x3a/0x50
      
      Fix this by letting the btrfs_dev_replace_finishing() to do the job of
      cleaning after the cancel, including freeing of the target device.
      btrfs_dev_replace_finishing() is called when btrfs_scub_dev() returns
      along with the scrub return status.
      Signed-off-by: NAnand Jain <anand.jain@oracle.com>
      Reviewed-by: NDavid Sterba <dsterba@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      Signed-off-by: NSasha Levin <sashal@kernel.org>
      38b17eee
    • H
      btrfs: alloc_chunk: fix more DUP stripe size handling · 720b86a5
      Hans van Kranenburg 提交于
      [ Upstream commit baf92114c7e6dd6124aa3d506e4bc4b694da3bc3 ]
      
      Commit 92e222df "btrfs: alloc_chunk: fix DUP stripe size handling"
      fixed calculating the stripe_size for a new DUP chunk.
      
      However, the same calculation reappears a bit later, and that one was
      not changed yet. The resulting bug that is exposed is that the newly
      allocated device extents ('stripes') can have a few MiB overlap with the
      next thing stored after them, which is another device extent or the end
      of the disk.
      
      The scenario in which this can happen is:
      * The block device for the filesystem is less than 10GiB in size.
      * The amount of contiguous free unallocated disk space chosen to use for
        chunk allocation is 20% of the total device size, or a few MiB more or
        less.
      
      An example:
      - The filesystem device is 7880MiB (max_chunk_size gets set to 788MiB)
      - There's 1578MiB unallocated raw disk space left in one contiguous
        piece.
      
      In this case stripe_size is first calculated as 789MiB, (half of
      1578MiB).
      
      Since 789MiB (stripe_size * data_stripes) > 788MiB (max_chunk_size), we
      enter the if block. Now stripe_size value is immediately overwritten
      while calculating an adjusted value based on max_chunk_size, which ends
      up as 788MiB.
      
      Next, the value is rounded up to a 16MiB boundary, 800MiB, which is
      actually more than the value we had before. However, the last comparison
      fails to detect this, because it's comparing the value with the total
      amount of free space, which is about twice the size of stripe_size.
      
      In the example above, this means that the resulting raw disk space being
      allocated is 1600MiB, while only a gap of 1578MiB has been found. The
      second device extent object for this DUP chunk will overlap for 22MiB
      with whatever comes next.
      
      The underlying problem here is that the stripe_size is reused all the
      time for different things. So, when entering the code in the if block,
      stripe_size is immediately overwritten with something else. If later we
      decide we want to have the previous value back, then the logic to
      compute it was copy pasted in again.
      
      With this change, the value in stripe_size is not unnecessarily
      destroyed, so the duplicated calculation is not needed any more.
      Signed-off-by: NHans van Kranenburg <hans.van.kranenburg@mendix.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      Signed-off-by: NSasha Levin <sashal@kernel.org>
      720b86a5