1. 05 6月, 2020 4 次提交
    • D
      mm: Allow to offline unmovable PageOffline() pages via MEM_GOING_OFFLINE · aa218795
      David Hildenbrand 提交于
      virtio-mem wants to allow to offline memory blocks of which some parts
      were unplugged (allocated via alloc_contig_range()), especially, to later
      offline and remove completely unplugged memory blocks. The important part
      is that PageOffline() has to remain set until the section is offline, so
      these pages will never get accessed (e.g., when dumping). The pages should
      not be handed back to the buddy (which would require clearing PageOffline()
      and result in issues if offlining fails and the pages are suddenly in the
      buddy).
      
      Let's allow to do that by allowing to isolate any PageOffline() page
      when offlining. This way, we can reach the memory hotplug notifier
      MEM_GOING_OFFLINE, where the driver can signal that he is fine with
      offlining this page by dropping its reference count. PageOffline() pages
      with a reference count of 0 can then be skipped when offlining the
      pages (like if they were free, however they are not in the buddy).
      
      Anybody who uses PageOffline() pages and does not agree to offline them
      (e.g., Hyper-V balloon, XEN balloon, VMWare balloon for 2MB pages) will not
      decrement the reference count and make offlining fail when trying to
      migrate such an unmovable page. So there should be no observable change.
      Same applies to balloon compaction users (movable PageOffline() pages), the
      pages will simply be migrated.
      
      Note 1: If offlining fails, a driver has to increment the reference
      	count again in MEM_CANCEL_OFFLINE.
      
      Note 2: A driver that makes use of this has to be aware that re-onlining
      	the memory block has to be handled by hooking into onlining code
      	(online_page_callback_t), resetting the page PageOffline() and
      	not giving them to the buddy.
      Reviewed-by: NAlexander Duyck <alexander.h.duyck@linux.intel.com>
      Acked-by: NMichal Hocko <mhocko@suse.com>
      Tested-by: NPankaj Gupta <pankaj.gupta.linux@gmail.com>
      Acked-by: NAndrew Morton <akpm@linux-foundation.org>
      Cc: Andrew Morton <akpm@linux-foundation.org>
      Cc: Juergen Gross <jgross@suse.com>
      Cc: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
      Cc: Pavel Tatashin <pavel.tatashin@microsoft.com>
      Cc: Alexander Duyck <alexander.h.duyck@linux.intel.com>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Anthony Yznaga <anthony.yznaga@oracle.com>
      Cc: Michal Hocko <mhocko@suse.com>
      Cc: Oscar Salvador <osalvador@suse.de>
      Cc: Mel Gorman <mgorman@techsingularity.net>
      Cc: Mike Rapoport <rppt@linux.ibm.com>
      Cc: Dan Williams <dan.j.williams@intel.com>
      Cc: Anshuman Khandual <anshuman.khandual@arm.com>
      Cc: Qian Cai <cai@lca.pw>
      Cc: Pingfan Liu <kernelfans@gmail.com>
      Signed-off-by: NDavid Hildenbrand <david@redhat.com>
      Link: https://lore.kernel.org/r/20200507140139.17083-7-david@redhat.comSigned-off-by: NMichael S. Tsirkin <mst@redhat.com>
      aa218795
    • D
      virtio-mem: Allow to specify an ACPI PXM as nid · f2af6d39
      David Hildenbrand 提交于
      We want to allow to specify (similar as for a DIMM), to which node a
      virtio-mem device (and, therefore, its memory) belongs. Add a new
      virtio-mem feature flag and export pxm_to_node, so it can be used in kernel
      module context.
      
      Acked-by: Michal Hocko <mhocko@suse.com> # for the export
      Acked-by: "Rafael J. Wysocki" <rafael@kernel.org> # for the export
      Acked-by: NPankaj Gupta <pankaj.gupta.linux@gmail.com>
      Tested-by: NPankaj Gupta <pankaj.gupta.linux@gmail.com>
      Cc: "Michael S. Tsirkin" <mst@redhat.com>
      Cc: Jason Wang <jasowang@redhat.com>
      Cc: Oscar Salvador <osalvador@suse.de>
      Cc: Michal Hocko <mhocko@kernel.org>
      Cc: Igor Mammedov <imammedo@redhat.com>
      Cc: Dave Young <dyoung@redhat.com>
      Cc: Andrew Morton <akpm@linux-foundation.org>
      Cc: Dan Williams <dan.j.williams@intel.com>
      Cc: Pavel Tatashin <pasha.tatashin@soleen.com>
      Cc: Stefan Hajnoczi <stefanha@redhat.com>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Cc: Len Brown <lenb@kernel.org>
      Cc: linux-acpi@vger.kernel.org
      Signed-off-by: NDavid Hildenbrand <david@redhat.com>
      Link: https://lore.kernel.org/r/20200507140139.17083-4-david@redhat.comSigned-off-by: NMichael S. Tsirkin <mst@redhat.com>
      f2af6d39
    • D
      virtio-mem: Paravirtualized memory hotplug · 5f1f79bb
      David Hildenbrand 提交于
      Each virtio-mem device owns exactly one memory region. It is responsible
      for adding/removing memory from that memory region on request.
      
      When the device driver starts up, the requested amount of memory is
      queried and then plugged to Linux. On request, further memory can be
      plugged or unplugged. This patch only implements the plugging part.
      
      On x86-64, memory can currently be plugged in 4MB ("subblock") granularity.
      When required, a new memory block will be added (e.g., usually 128MB on
      x86-64) in order to plug more subblocks. Only x86-64 was tested for now.
      
      The online_page callback is used to keep unplugged subblocks offline
      when onlining memory - similar to the Hyper-V balloon driver. Unplugged
      pages are marked PG_offline, to tell dump tools (e.g., makedumpfile) to
      skip them.
      
      User space is usually responsible for onlining the added memory. The
      memory hotplug notifier is used to synchronize virtio-mem activity
      against memory onlining/offlining.
      
      Each virtio-mem device can belong to a NUMA node, which allows us to
      easily add/remove small chunks of memory to/from a specific NUMA node by
      using multiple virtio-mem devices. Something that works even when the
      guest has no idea about the NUMA topology.
      
      One way to view virtio-mem is as a "resizable DIMM" or a DIMM with many
      "sub-DIMMS".
      
      This patch directly introduces the basic infrastructure to implement memory
      unplug. Especially the memory block states and subblock bitmaps will be
      heavily used there.
      
      Notes:
      - In case memory is to be onlined by user space, we limit the amount of
        offline memory blocks, to not run out of memory. This is esp. an
        issue if memory is added faster than it is getting onlined.
      - Suspend/Hibernate is not supported due to the way virtio-mem devices
        behave. Limited support might be possible in the future.
      - Reloading the device driver is not supported.
      Reviewed-by: NPankaj Gupta <pankaj.gupta.linux@gmail.com>
      Tested-by: NPankaj Gupta <pankaj.gupta.linux@gmail.com>
      Cc: "Michael S. Tsirkin" <mst@redhat.com>
      Cc: Jason Wang <jasowang@redhat.com>
      Cc: Oscar Salvador <osalvador@suse.de>
      Cc: Michal Hocko <mhocko@kernel.org>
      Cc: Igor Mammedov <imammedo@redhat.com>
      Cc: Dave Young <dyoung@redhat.com>
      Cc: Andrew Morton <akpm@linux-foundation.org>
      Cc: Dan Williams <dan.j.williams@intel.com>
      Cc: Pavel Tatashin <pasha.tatashin@soleen.com>
      Cc: Stefan Hajnoczi <stefanha@redhat.com>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Cc: "Rafael J. Wysocki" <rjw@rjwysocki.net>
      Cc: Len Brown <lenb@kernel.org>
      Cc: linux-acpi@vger.kernel.org
      Signed-off-by: NDavid Hildenbrand <david@redhat.com>
      Link: https://lore.kernel.org/r/20200507140139.17083-2-david@redhat.comSigned-off-by: NMichael S. Tsirkin <mst@redhat.com>
      5f1f79bb
    • J
      vdpa: introduce get_vq_notification method · c25a26e6
      Jason Wang 提交于
      This patch introduces a new method in the vdpa_config_ops which
      reports the physical address and the size of the doorbell for a
      specific virtqueue.
      
      This will be used by the future patches that maps doorbell to
      userspace.
      Signed-off-by: NJason Wang <jasowang@redhat.com>
      Link: https://lore.kernel.org/r/20200529080303.15449-4-jasowang@redhat.comSigned-off-by: NMichael S. Tsirkin <mst@redhat.com>
      c25a26e6
  2. 02 6月, 2020 2 次提交
  3. 29 5月, 2020 3 次提交
  4. 28 5月, 2020 2 次提交
    • J
      RDMA/core: Fix double destruction of uobject · c85f4abe
      Jason Gunthorpe 提交于
      Fix use after free when user user space request uobject concurrently for
      the same object, within the RCU grace period.
      
      In that case, remove_handle_idr_uobject() is called twice and we will have
      an extra put on the uobject which cause use after free.  Fix it by leaving
      the uobject write locked after it was removed from the idr.
      
      Call to rdma_lookup_put_uobject with UVERBS_LOOKUP_DESTROY instead of
      UVERBS_LOOKUP_WRITE will do the work.
      
        refcount_t: underflow; use-after-free.
        WARNING: CPU: 0 PID: 1381 at lib/refcount.c:28 refcount_warn_saturate+0xfe/0x1a0
        Kernel panic - not syncing: panic_on_warn set ...
        CPU: 0 PID: 1381 Comm: syz-executor.0 Not tainted 5.5.0-rc3 #8
        Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.12.1-0-ga5cab58e9a3f-prebuilt.qemu.org 04/01/2014
        Call Trace:
         dump_stack+0x94/0xce
         panic+0x234/0x56f
         __warn+0x1cc/0x1e1
         report_bug+0x200/0x310
         fixup_bug.part.11+0x32/0x80
         do_error_trap+0xd3/0x100
         do_invalid_op+0x31/0x40
         invalid_op+0x1e/0x30
        RIP: 0010:refcount_warn_saturate+0xfe/0x1a0
        Code: 0f 0b eb 9b e8 23 f6 6d ff 80 3d 6c d4 19 03 00 75 8d e8 15 f6 6d ff 48 c7 c7 c0 02 55 bd c6 05 57 d4 19 03 01 e8 a2 58 49 ff <0f> 0b e9 6e ff ff ff e8 f6 f5 6d ff 80 3d 42 d4 19 03 00 0f 85 5c
        RSP: 0018:ffffc90002df7b98 EFLAGS: 00010282
        RAX: 0000000000000000 RBX: ffff88810f6a193c RCX: ffffffffba649009
        RDX: 0000000000000000 RSI: 0000000000000008 RDI: ffff88811b0283cc
        RBP: 0000000000000003 R08: ffffed10236060e3 R09: ffffed10236060e3
        R10: 0000000000000001 R11: ffffed10236060e2 R12: ffff88810f6a193c
        R13: ffffc90002df7d60 R14: 0000000000000000 R15: ffff888116ae6a08
         uverbs_uobject_put+0xfd/0x140
         __uobj_perform_destroy+0x3d/0x60
         ib_uverbs_close_xrcd+0x148/0x170
         ib_uverbs_write+0xaa5/0xdf0
         __vfs_write+0x7c/0x100
         vfs_write+0x168/0x4a0
         ksys_write+0xc8/0x200
         do_syscall_64+0x9c/0x390
         entry_SYSCALL_64_after_hwframe+0x44/0xa9
        RIP: 0033:0x465b49
        Code: f7 d8 64 89 02 b8 ff ff ff ff c3 66 0f 1f 44 00 00 48 89 f8 48 89 f7 48 89 d6 48 89 ca 4d 89 c2 4d 89 c8 4c 8b 4c 24 08 0f 05 <48> 3d 01 f0 ff ff 73 01 c3 48 c7 c1 bc ff ff ff f7 d8 64 89 01 48
        RSP: 002b:00007f759d122c58 EFLAGS: 00000246 ORIG_RAX: 0000000000000001
        RAX: ffffffffffffffda RBX: 000000000073bfa8 RCX: 0000000000465b49
        RDX: 000000000000000c RSI: 0000000020000080 RDI: 0000000000000003
        RBP: 0000000000000003 R08: 0000000000000000 R09: 0000000000000000
        R10: 0000000000000000 R11: 0000000000000246 R12: 00007f759d1236bc
        R13: 00000000004ca27c R14: 000000000070de40 R15: 00000000ffffffff
        Dumping ftrace buffer:
           (ftrace buffer empty)
        Kernel Offset: 0x39400000 from 0xffffffff81000000 (relocation range: 0xffffffff80000000-0xffffffffbfffffff)
      
      Fixes: 7452a3c7 ("IB/uverbs: Allow RDMA_REMOVE_DESTROY to work concurrently with disassociate")
      Link: https://lore.kernel.org/r/20200527135534.482279-1-leon@kernel.orgSigned-off-by: NMaor Gottlieb <maorg@mellanox.com>
      Signed-off-by: NLeon Romanovsky <leonro@mellanox.com>
      Signed-off-by: NJason Gunthorpe <jgg@mellanox.com>
      c85f4abe
    • A
      fanotify: turn off support for FAN_DIR_MODIFY · f1793699
      Amir Goldstein 提交于
      FAN_DIR_MODIFY has been enabled by commit 44d705b0 ("fanotify:
      report name info for FAN_DIR_MODIFY event") in 5.7-rc1. Now we are
      planning further extensions to the fanotify API and during that we
      realized that FAN_DIR_MODIFY may behave slightly differently to be more
      consistent with extensions we plan. So until we finalize these
      extensions, let's not bind our hands with exposing FAN_DIR_MODIFY to
      userland.
      Signed-off-by: NAmir Goldstein <amir73il@gmail.com>
      Signed-off-by: NJan Kara <jack@suse.cz>
      f1793699
  5. 27 5月, 2020 6 次提交
  6. 26 5月, 2020 2 次提交
    • V
      net/tls: fix race condition causing kernel panic · 0cada332
      Vinay Kumar Yadav 提交于
      tls_sw_recvmsg() and tls_decrypt_done() can be run concurrently.
      // tls_sw_recvmsg()
      	if (atomic_read(&ctx->decrypt_pending))
      		crypto_wait_req(-EINPROGRESS, &ctx->async_wait);
      	else
      		reinit_completion(&ctx->async_wait.completion);
      
      //tls_decrypt_done()
        	pending = atomic_dec_return(&ctx->decrypt_pending);
      
        	if (!pending && READ_ONCE(ctx->async_notify))
        		complete(&ctx->async_wait.completion);
      
      Consider the scenario tls_decrypt_done() is about to run complete()
      
      	if (!pending && READ_ONCE(ctx->async_notify))
      
      and tls_sw_recvmsg() reads decrypt_pending == 0, does reinit_completion(),
      then tls_decrypt_done() runs complete(). This sequence of execution
      results in wrong completion. Consequently, for next decrypt request,
      it will not wait for completion, eventually on connection close, crypto
      resources freed, there is no way to handle pending decrypt response.
      
      This race condition can be avoided by having atomic_read() mutually
      exclusive with atomic_dec_return(),complete().Intoduced spin lock to
      ensure the mutual exclution.
      
      Addressed similar problem in tx direction.
      
      v1->v2:
      - More readable commit message.
      - Corrected the lock to fix new race scenario.
      - Removed barrier which is not needed now.
      
      Fixes: a42055e8 ("net/tls: Add support for async encryption of records for performance")
      Signed-off-by: NVinay Kumar Yadav <vinay.yadav@chelsio.com>
      Reviewed-by: NJakub Kicinski <kuba@kernel.org>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      0cada332
    • P
      netfilter: nf_conntrack_pptp: prevent buffer overflows in debug code · 4c559f15
      Pablo Neira Ayuso 提交于
      Dan Carpenter says: "Smatch complains that the value for "cmd" comes
      from the network and can't be trusted."
      
      Add pptp_msg_name() helper function that checks for the array boundary.
      
      Fixes: f09943fe ("[NETFILTER]: nf_conntrack/nf_nat: add PPTP helper port")
      Reported-by: NDan Carpenter <dan.carpenter@oracle.com>
      Signed-off-by: NPablo Neira Ayuso <pablo@netfilter.org>
      4c559f15
  7. 25 5月, 2020 1 次提交
  8. 23 5月, 2020 3 次提交
    • E
      net/mlx5: Avoid processing commands before cmdif is ready · f7936ddd
      Eran Ben Elisha 提交于
      When driver is reloading during recovery flow, it can't get new commands
      till command interface is up again. Otherwise we may get to null pointer
      trying to access non initialized command structures.
      
      Add cmdif state to avoid processing commands while cmdif is not ready.
      
      Fixes: e126ba97 ("mlx5: Add driver for Mellanox Connect-IB adapters")
      Signed-off-by: NEran Ben Elisha <eranbe@mellanox.com>
      Signed-off-by: NMoshe Shemesh <moshe@mellanox.com>
      Signed-off-by: NSaeed Mahameed <saeedm@mellanox.com>
      f7936ddd
    • E
      net/mlx5: Fix a race when moving command interface to events mode · d43b7007
      Eran Ben Elisha 提交于
      After driver creates (via FW command) an EQ for commands, the driver will
      be informed on new commands completion by EQE. However, due to a race in
      driver's internal command mode metadata update, some new commands will
      still be miss-handled by driver as if we are in polling mode. Such commands
      can get two non forced completion, leading to already freed command entry
      access.
      
      CREATE_EQ command, that maps EQ to the command queue must be posted to the
      command queue while it is empty and no other command should be posted.
      
      Add SW mechanism that once the CREATE_EQ command is about to be executed,
      all other commands will return error without being sent to the FW. Allow
      sending other commands only after successfully changing the driver's
      internal command mode metadata.
      We can safely return error to all other commands while creating the command
      EQ, as all other commands might be sent from the user/application during
      driver load. Application can rerun them later after driver's load was
      finished.
      
      Fixes: e126ba97 ("mlx5: Add driver for Mellanox Connect-IB adapters")
      Signed-off-by: NEran Ben Elisha <eranbe@mellanox.com>
      Signed-off-by: NMoshe Shemesh <moshe@mellanox.com>
      Signed-off-by: NSaeed Mahameed <saeedm@mellanox.com>
      d43b7007
    • M
      net/mlx5: Add command entry handling completion · 17d00e83
      Moshe Shemesh 提交于
      When FW response to commands is very slow and all command entries in
      use are waiting for completion we can have a race where commands can get
      timeout before they get out of the queue and handled. Timeout
      completion on uninitialized command will cause releasing command's
      buffers before accessing it for initialization and then we will get NULL
      pointer exception while trying access it. It may also cause releasing
      buffers of another command since we may have timeout completion before
      even allocating entry index for this command.
      Add entry handling completion to avoid this race.
      
      Fixes: e126ba97 ("mlx5: Add driver for Mellanox Connect-IB adapters")
      Signed-off-by: NMoshe Shemesh <moshe@mellanox.com>
      Signed-off-by: NEran Ben Elisha <eranbe@mellanox.com>
      Signed-off-by: NSaeed Mahameed <saeedm@mellanox.com>
      17d00e83
  9. 22 5月, 2020 1 次提交
    • S
      net: don't return invalid table id error when we fall back to PF_UNSPEC · 41b4bd98
      Sabrina Dubroca 提交于
      In case we can't find a ->dumpit callback for the requested
      (family,type) pair, we fall back to (PF_UNSPEC,type). In effect, we're
      in the same situation as if userspace had requested a PF_UNSPEC
      dump. For RTM_GETROUTE, that handler is rtnl_dump_all, which calls all
      the registered RTM_GETROUTE handlers.
      
      The requested table id may or may not exist for all of those
      families. commit ae677bbb ("net: Don't return invalid table id
      error when dumping all families") fixed the problem when userspace
      explicitly requests a PF_UNSPEC dump, but missed the fallback case.
      
      For example, when we pass ipv6.disable=1 to a kernel with
      CONFIG_IP_MROUTE=y and CONFIG_IP_MROUTE_MULTIPLE_TABLES=y,
      the (PF_INET6, RTM_GETROUTE) handler isn't registered, so we end up in
      rtnl_dump_all, and listing IPv6 routes will unexpectedly print:
      
        # ip -6 r
        Error: ipv4: MR table does not exist.
        Dump terminated
      
      commit ae677bbb introduced the dump_all_families variable, which
      gets set when userspace requests a PF_UNSPEC dump. However, we can't
      simply set the family to PF_UNSPEC in rtnetlink_rcv_msg in the
      fallback case to get dump_all_families == true, because some messages
      types (for example RTM_GETRULE and RTM_GETNEIGH) only register the
      PF_UNSPEC handler and use the family to filter in the kernel what is
      dumped to userspace. We would then export more entries, that userspace
      would have to filter. iproute does that, but other programs may not.
      
      Instead, this patch removes dump_all_families and updates the
      RTM_GETROUTE handlers to check if the family that is being dumped is
      their own. When it's not, which covers both the intentional PF_UNSPEC
      dumps (as dump_all_families did) and the fallback case, ignore the
      missing table id error.
      
      Fixes: cb167893 ("net: Plumb support for filtering ipv4 and ipv6 multicast route dumps")
      Signed-off-by: NSabrina Dubroca <sd@queasysnail.net>
      Reviewed-by: NDavid Ahern <dsahern@gmail.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      41b4bd98
  10. 20 5月, 2020 1 次提交
  11. 19 5月, 2020 1 次提交
  12. 17 5月, 2020 1 次提交
  13. 15 5月, 2020 5 次提交
    • B
      x86: Fix early boot crash on gcc-10, third try · a9a3ed1e
      Borislav Petkov 提交于
      ... or the odyssey of trying to disable the stack protector for the
      function which generates the stack canary value.
      
      The whole story started with Sergei reporting a boot crash with a kernel
      built with gcc-10:
      
        Kernel panic — not syncing: stack-protector: Kernel stack is corrupted in: start_secondary
        CPU: 1 PID: 0 Comm: swapper/1 Not tainted 5.6.0-rc5—00235—gfffb08b3 #139
        Hardware name: Gigabyte Technology Co., Ltd. To be filled by O.E.M./H77M—D3H, BIOS F12 11/14/2013
        Call Trace:
          dump_stack
          panic
          ? start_secondary
          __stack_chk_fail
          start_secondary
          secondary_startup_64
        -—-[ end Kernel panic — not syncing: stack—protector: Kernel stack is corrupted in: start_secondary
      
      This happens because gcc-10 tail-call optimizes the last function call
      in start_secondary() - cpu_startup_entry() - and thus emits a stack
      canary check which fails because the canary value changes after the
      boot_init_stack_canary() call.
      
      To fix that, the initial attempt was to mark the one function which
      generates the stack canary with:
      
        __attribute__((optimize("-fno-stack-protector"))) ... start_secondary(void *unused)
      
      however, using the optimize attribute doesn't work cumulatively
      as the attribute does not add to but rather replaces previously
      supplied optimization options - roughly all -fxxx options.
      
      The key one among them being -fno-omit-frame-pointer and thus leading to
      not present frame pointer - frame pointer which the kernel needs.
      
      The next attempt to prevent compilers from tail-call optimizing
      the last function call cpu_startup_entry(), shy of carving out
      start_secondary() into a separate compilation unit and building it with
      -fno-stack-protector, was to add an empty asm("").
      
      This current solution was short and sweet, and reportedly, is supported
      by both compilers but we didn't get very far this time: future (LTO?)
      optimization passes could potentially eliminate this, which leads us
      to the third attempt: having an actual memory barrier there which the
      compiler cannot ignore or move around etc.
      
      That should hold for a long time, but hey we said that about the other
      two solutions too so...
      Reported-by: NSergei Trofimovich <slyfox@gentoo.org>
      Signed-off-by: NBorislav Petkov <bp@suse.de>
      Tested-by: NKalle Valo <kvalo@codeaurora.org>
      Cc: <stable@vger.kernel.org>
      Link: https://lkml.kernel.org/r/20200314164451.346497-1-slyfox@gentoo.org
      a9a3ed1e
    • G
      i2c: mux: Replace zero-length array with flexible-array · 8695e0b1
      Gustavo A. R. Silva 提交于
      The current codebase makes use of the zero-length array language
      extension to the C90 standard, but the preferred mechanism to declare
      variable-length types such as these ones is a flexible array member[1][2],
      introduced in C99:
      
      struct foo {
              int stuff;
              struct boo array[];
      };
      
      By making use of the mechanism above, we will get a compiler warning
      in case the flexible array does not occur last in the structure, which
      will help us prevent some kind of undefined behavior bugs from being
      inadvertently introduced[3] to the codebase from now on.
      
      Also, notice that, dynamic memory allocations won't be affected by
      this change:
      
      "Flexible array members have incomplete type, and so the sizeof operator
      may not be applied. As a quirk of the original implementation of
      zero-length arrays, sizeof evaluates to zero."[1]
      
      sizeof(flexible-array-member) triggers a warning because flexible array
      members have incomplete type[1]. There are some instances of code in
      which the sizeof operator is being incorrectly/erroneously applied to
      zero-length arrays and the result is zero. Such instances may be hiding
      some bugs. So, this work (flexible-array member conversions) will also
      help to get completely rid of those sorts of issues.
      
      This issue was found with the help of Coccinelle.
      
      [1] https://gcc.gnu.org/onlinedocs/gcc/Zero-Length.html
      [2] https://github.com/KSPP/linux/issues/21
      [3] commit 76497732 ("cxgb3/l2t: Fix undefined behaviour")
      Signed-off-by: NGustavo A. R. Silva <gustavoars@kernel.org>
      Reviewed-by: NPeter Rosin <peda@axentia.se>
      Signed-off-by: NWolfram Sang <wsa@kernel.org>
      8695e0b1
    • K
      net: phy: broadcom: fix BCM54XX_SHD_SCR3_TRDDAPD value for BCM54810 · cc8a677a
      Kevin Lo 提交于
      Set the correct bit when checking for PHY_BRCM_DIS_TXCRXC_NOENRGY on the
      BCM54810 PHY.
      
      Fixes: 0ececcfc ("net: phy: broadcom: Allow BCM54810 to use bcm54xx_adjust_rxrefclk()")
      Signed-off-by: NKevin Lo <kevlo@kevlo.org>
      Reviewed-by: NFlorian Fainelli <f.fainelli@gmail.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      cc8a677a
    • A
      security: Fix the default value of secid_to_secctx hook · 625236ba
      Anders Roxell 提交于
      security_secid_to_secctx is called by the bpf_lsm hook and a successful
      return value (i.e 0) implies that the parameter will be consumed by the
      LSM framework. The current behaviour return success when the pointer
      isn't initialized when CONFIG_BPF_LSM is enabled, with the default
      return from kernel/bpf/bpf_lsm.c.
      
      This is the internal error:
      
      [ 1229.341488][ T2659] usercopy: Kernel memory exposure attempt detected from null address (offset 0, size 280)!
      [ 1229.374977][ T2659] ------------[ cut here ]------------
      [ 1229.376813][ T2659] kernel BUG at mm/usercopy.c:99!
      [ 1229.378398][ T2659] Internal error: Oops - BUG: 0 [#1] PREEMPT SMP
      [ 1229.380348][ T2659] Modules linked in:
      [ 1229.381654][ T2659] CPU: 0 PID: 2659 Comm: systemd-journal Tainted: G    B   W         5.7.0-rc5-next-20200511-00019-g864e0c6319b8-dirty #13
      [ 1229.385429][ T2659] Hardware name: linux,dummy-virt (DT)
      [ 1229.387143][ T2659] pstate: 80400005 (Nzcv daif +PAN -UAO BTYPE=--)
      [ 1229.389165][ T2659] pc : usercopy_abort+0xc8/0xcc
      [ 1229.390705][ T2659] lr : usercopy_abort+0xc8/0xcc
      [ 1229.392225][ T2659] sp : ffff000064247450
      [ 1229.393533][ T2659] x29: ffff000064247460 x28: 0000000000000000
      [ 1229.395449][ T2659] x27: 0000000000000118 x26: 0000000000000000
      [ 1229.397384][ T2659] x25: ffffa000127049e0 x24: ffffa000127049e0
      [ 1229.399306][ T2659] x23: ffffa000127048e0 x22: ffffa000127048a0
      [ 1229.401241][ T2659] x21: ffffa00012704b80 x20: ffffa000127049e0
      [ 1229.403163][ T2659] x19: ffffa00012704820 x18: 0000000000000000
      [ 1229.405094][ T2659] x17: 0000000000000000 x16: 0000000000000000
      [ 1229.407008][ T2659] x15: 0000000000000000 x14: 003d090000000000
      [ 1229.408942][ T2659] x13: ffff80000d5b25b2 x12: 1fffe0000d5b25b1
      [ 1229.410859][ T2659] x11: 1fffe0000d5b25b1 x10: ffff80000d5b25b1
      [ 1229.412791][ T2659] x9 : ffffa0001034bee0 x8 : ffff00006ad92d8f
      [ 1229.414707][ T2659] x7 : 0000000000000000 x6 : ffffa00015eacb20
      [ 1229.416642][ T2659] x5 : ffff0000693c8040 x4 : 0000000000000000
      [ 1229.418558][ T2659] x3 : ffffa0001034befc x2 : d57a7483a01c6300
      [ 1229.420610][ T2659] x1 : 0000000000000000 x0 : 0000000000000059
      [ 1229.422526][ T2659] Call trace:
      [ 1229.423631][ T2659]  usercopy_abort+0xc8/0xcc
      [ 1229.425091][ T2659]  __check_object_size+0xdc/0x7d4
      [ 1229.426729][ T2659]  put_cmsg+0xa30/0xa90
      [ 1229.428132][ T2659]  unix_dgram_recvmsg+0x80c/0x930
      [ 1229.429731][ T2659]  sock_recvmsg+0x9c/0xc0
      [ 1229.431123][ T2659]  ____sys_recvmsg+0x1cc/0x5f8
      [ 1229.432663][ T2659]  ___sys_recvmsg+0x100/0x160
      [ 1229.434151][ T2659]  __sys_recvmsg+0x110/0x1a8
      [ 1229.435623][ T2659]  __arm64_sys_recvmsg+0x58/0x70
      [ 1229.437218][ T2659]  el0_svc_common.constprop.1+0x29c/0x340
      [ 1229.438994][ T2659]  do_el0_svc+0xe8/0x108
      [ 1229.440587][ T2659]  el0_svc+0x74/0x88
      [ 1229.441917][ T2659]  el0_sync_handler+0xe4/0x8b4
      [ 1229.443464][ T2659]  el0_sync+0x17c/0x180
      [ 1229.444920][ T2659] Code: aa1703e2 aa1603e1 910a8260 97ecc860 (d4210000)
      [ 1229.447070][ T2659] ---[ end trace 400497d91baeaf51 ]---
      [ 1229.448791][ T2659] Kernel panic - not syncing: Fatal exception
      [ 1229.450692][ T2659] Kernel Offset: disabled
      [ 1229.452061][ T2659] CPU features: 0x240002,20002004
      [ 1229.453647][ T2659] Memory Limit: none
      [ 1229.455015][ T2659] ---[ end Kernel panic - not syncing: Fatal exception ]---
      
      Rework the so the default return value is -EOPNOTSUPP.
      
      There are likely other callbacks such as security_inode_getsecctx() that
      may have the same problem, and that someone that understand the code
      better needs to audit them.
      
      Thank you Arnd for helping me figure out what went wrong.
      
      Fixes: 98e828a0 ("security: Refactor declaration of LSM hooks")
      Signed-off-by: NAnders Roxell <anders.roxell@linaro.org>
      Signed-off-by: NAlexei Starovoitov <ast@kernel.org>
      Acked-by: NJames Morris <jamorris@linux.microsoft.com>
      Cc: Arnd Bergmann <arnd@arndb.de>
      Link: https://lore.kernel.org/bpf/20200512174607.9630-1-anders.roxell@linaro.org
      625236ba
    • Y
      mm, memcg: fix inconsistent oom event behavior · 04fd61a4
      Yafang Shao 提交于
      A recent commit 9852ae3f ("mm, memcg: consider subtrees in
      memory.events") changed the behavior of memcg events, which will now
      consider subtrees in memory.events.
      
      But oom_kill event is a special one as it is used in both cgroup1 and
      cgroup2.  In cgroup1, it is displayed in memory.oom_control.  The file
      memory.oom_control is in both root memcg and non root memcg, that is
      different with memory.event as it only in non-root memcg.  That commit
      is okay for cgroup2, but it is not okay for cgroup1 as it will cause
      inconsistent behavior between root memcg and non-root memcg.
      
      Here's an example on why this behavior is inconsistent in cgroup1.
      
             root memcg
             /
          memcg foo
           /
        memcg bar
      
      Suppose there's an oom_kill in memcg bar, then the oon_kill will be
      
             root memcg : memory.oom_control(oom_kill)  0
             /
          memcg foo : memory.oom_control(oom_kill)  1
           /
        memcg bar : memory.oom_control(oom_kill)  1
      
      For the non-root memcg, its memory.oom_control(oom_kill) includes its
      descendants' oom_kill, but for root memcg, it doesn't include its
      descendants' oom_kill.  That means, memory.oom_control(oom_kill) has
      different meanings in different memcgs.  That is inconsistent.  Then the
      user has to know whether the memcg is root or not.
      
      If we can't fully support it in cgroup1, for example by adding
      memory.events.local into cgroup1 as well, then let's don't touch its
      original behavior.
      
      Fixes: 9852ae3f ("mm, memcg: consider subtrees in memory.events")
      Reported-by: NRandy Dunlap <rdunlap@infradead.org>
      Signed-off-by: NYafang Shao <laoar.shao@gmail.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Reviewed-by: NShakeel Butt <shakeelb@google.com>
      Acked-by: NJohannes Weiner <hannes@cmpxchg.org>
      Acked-by: NChris Down <chris@chrisdown.name>
      Acked-by: NMichal Hocko <mhocko@suse.com>
      Cc: <stable@vger.kernel.org>
      Link: http://lkml.kernel.org/r/20200502141055.7378-1-laoar.shao@gmail.comSigned-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      04fd61a4
  14. 14 5月, 2020 4 次提交
    • A
      usb: raw-gadget: support stalling/halting/wedging endpoints · c61769bd
      Andrey Konovalov 提交于
      Raw Gadget is currently unable to stall/halt/wedge gadget endpoints,
      which is required for proper emulation of certain USB classes.
      
      This patch adds a few more ioctls:
      
      - USB_RAW_IOCTL_EP0_STALL allows to stall control endpoint #0 when
        there's a pending setup request for it.
      - USB_RAW_IOCTL_SET/CLEAR_HALT/WEDGE allow to set/clear halt/wedge status
        on non-control non-isochronous endpoints.
      
      Fixes: f2c2e717 ("usb: gadget: add raw-gadget interface")
      Signed-off-by: NAndrey Konovalov <andreyknvl@google.com>
      Signed-off-by: NFelipe Balbi <balbi@kernel.org>
      c61769bd
    • A
      usb: raw-gadget: fix gadget endpoint selection · 97df5e57
      Andrey Konovalov 提交于
      Currently automatic gadget endpoint selection based on required features
      doesn't work. Raw Gadget tries iterating over the list of available
      endpoints and finding one that has the right direction and transfer type.
      Unfortunately selecting arbitrary gadget endpoints (even if they satisfy
      feature requirements) doesn't work, as (depending on the UDC driver) they
      might have fixed addresses, and one also needs to provide matching
      endpoint addresses in the descriptors sent to the host.
      
      The composite framework deals with this by assigning endpoint addresses
      in usb_ep_autoconfig() before enumeration starts. This approach won't work
      with Raw Gadget as the endpoints are supposed to be enabled after a
      set_configuration/set_interface request from the host, so it's too late to
      patch the endpoint descriptors that had already been sent to the host.
      
      For Raw Gadget we take another approach. Similarly to GadgetFS, we allow
      the user to make the decision as to which gadget endpoints to use.
      
      This patch adds another Raw Gadget ioctl USB_RAW_IOCTL_EPS_INFO that
      exposes information about all non-control endpoints that a currently
      connected UDC has. This information includes endpoints addresses, as well
      as their capabilities and limits to allow the user to choose the most
      fitting gadget endpoint.
      
      The USB_RAW_IOCTL_EP_ENABLE ioctl is updated to use the proper endpoint
      validation routine usb_gadget_ep_match_desc().
      
      These changes affect the portability of the gadgets that use Raw Gadget
      when running on different UDCs. Nevertheless, as long as the user relies
      on the information provided by USB_RAW_IOCTL_EPS_INFO to dynamically
      choose endpoint addresses, UDC-agnostic gadgets can still be written with
      Raw Gadget.
      
      Fixes: f2c2e717 ("usb: gadget: add raw-gadget interface")
      Signed-off-by: NAndrey Konovalov <andreyknvl@google.com>
      Signed-off-by: NFelipe Balbi <balbi@kernel.org>
      97df5e57
    • A
      usb: raw-gadget: improve uapi headers comments · 17ff3b72
      Andrey Konovalov 提交于
      Fix typo "trasferred" => "transferred".
      
      Don't call USB requests URBs.
      
      Fix comment style.
      Signed-off-by: NAndrey Konovalov <andreyknvl@google.com>
      Signed-off-by: NFelipe Balbi <balbi@kernel.org>
      17ff3b72
    • P
      efi: cper: Add support for printing Firmware Error Record Reference · 3d8c11ef
      Punit Agrawal 提交于
      While debugging a boot failure, the following unknown error record was
      seen in the boot logs.
      
          <...>
          BERT: Error records from previous boot:
          [Hardware Error]: event severity: fatal
          [Hardware Error]:  Error 0, type: fatal
          [Hardware Error]:   section type: unknown, 81212a96-09ed-4996-9471-8d729c8e69ed
          [Hardware Error]:   section length: 0x290
          [Hardware Error]:   00000000: 00000001 00000000 00000000 00020002  ................
          [Hardware Error]:   00000010: 00020002 0000001f 00000320 00000000  ........ .......
          [Hardware Error]:   00000020: 00000000 00000000 00000000 00000000  ................
          [Hardware Error]:   00000030: 00000000 00000000 00000000 00000000  ................
          <...>
      
      On further investigation, it was found that the error record with
      UUID (81212a96-09ed-4996-9471-8d729c8e69ed) has been defined in the
      UEFI Specification at least since v2.4 and has recently had additional
      fields defined in v2.7 Section N.2.10 Firmware Error Record Reference.
      
      Add support for parsing and printing the defined fields to give users
      a chance to figure out what went wrong.
      Signed-off-by: NPunit Agrawal <punit1.agrawal@toshiba.co.jp>
      Cc: Ard Biesheuvel <ardb@kernel.org>
      Cc: "Rafael J. Wysocki" <rjw@rjwysocki.net>
      Cc: Borislav Petkov <bp@alien8.de>
      Cc: James Morse <james.morse@arm.com>
      Cc: linux-acpi@vger.kernel.org
      Cc: linux-efi@vger.kernel.org
      Link: https://lore.kernel.org/r/20200512045502.3810339-1-punit1.agrawal@toshiba.co.jpSigned-off-by: NArd Biesheuvel <ardb@kernel.org>
      3d8c11ef
  15. 13 5月, 2020 3 次提交
    • S
      x86/ftrace: Have ftrace trampolines turn read-only at the end of system boot up · 59566b0b
      Steven Rostedt (VMware) 提交于
      Booting one of my machines, it triggered the following crash:
      
       Kernel/User page tables isolation: enabled
       ftrace: allocating 36577 entries in 143 pages
       Starting tracer 'function'
       BUG: unable to handle page fault for address: ffffffffa000005c
       #PF: supervisor write access in kernel mode
       #PF: error_code(0x0003) - permissions violation
       PGD 2014067 P4D 2014067 PUD 2015063 PMD 7b253067 PTE 7b252061
       Oops: 0003 [#1] PREEMPT SMP PTI
       CPU: 0 PID: 0 Comm: swapper Not tainted 5.4.0-test+ #24
       Hardware name: To Be Filled By O.E.M. To Be Filled By O.E.M./To be filled by O.E.M., BIOS SDBLI944.86P 05/08/2007
       RIP: 0010:text_poke_early+0x4a/0x58
       Code: 34 24 48 89 54 24 08 e8 bf 72 0b 00 48 8b 34 24 48 8b 4c 24 08 84 c0 74 0b 48 89 df f3 a4 48 83 c4 10 5b c3 9c 58 fa 48 89 df <f3> a4 50 9d 48 83 c4 10 5b e9 d6 f9 ff ff
      0 41 57 49
       RSP: 0000:ffffffff82003d38 EFLAGS: 00010046
       RAX: 0000000000000046 RBX: ffffffffa000005c RCX: 0000000000000005
       RDX: 0000000000000005 RSI: ffffffff825b9a90 RDI: ffffffffa000005c
       RBP: ffffffffa000005c R08: 0000000000000000 R09: ffffffff8206e6e0
       R10: ffff88807b01f4c0 R11: ffffffff8176c106 R12: ffffffff8206e6e0
       R13: ffffffff824f2440 R14: 0000000000000000 R15: ffffffff8206eac0
       FS:  0000000000000000(0000) GS:ffff88807d400000(0000) knlGS:0000000000000000
       CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
       CR2: ffffffffa000005c CR3: 0000000002012000 CR4: 00000000000006b0
       Call Trace:
        text_poke_bp+0x27/0x64
        ? mutex_lock+0x36/0x5d
        arch_ftrace_update_trampoline+0x287/0x2d5
        ? ftrace_replace_code+0x14b/0x160
        ? ftrace_update_ftrace_func+0x65/0x6c
        __register_ftrace_function+0x6d/0x81
        ftrace_startup+0x23/0xc1
        register_ftrace_function+0x20/0x37
        func_set_flag+0x59/0x77
        __set_tracer_option.isra.19+0x20/0x3e
        trace_set_options+0xd6/0x13e
        apply_trace_boot_options+0x44/0x6d
        register_tracer+0x19e/0x1ac
        early_trace_init+0x21b/0x2c9
        start_kernel+0x241/0x518
        ? load_ucode_intel_bsp+0x21/0x52
        secondary_startup_64+0xa4/0xb0
      
      I was able to trigger it on other machines, when I added to the kernel
      command line of both "ftrace=function" and "trace_options=func_stack_trace".
      
      The cause is the "ftrace=function" would register the function tracer
      and create a trampoline, and it will set it as executable and
      read-only. Then the "trace_options=func_stack_trace" would then update
      the same trampoline to include the stack tracer version of the function
      tracer. But since the trampoline already exists, it updates it with
      text_poke_bp(). The problem is that text_poke_bp() called while
      system_state == SYSTEM_BOOTING, it will simply do a memcpy() and not
      the page mapping, as it would think that the text is still read-write.
      But in this case it is not, and we take a fault and crash.
      
      Instead, lets keep the ftrace trampolines read-write during boot up,
      and then when the kernel executable text is set to read-only, the
      ftrace trampolines get set to read-only as well.
      
      Link: https://lkml.kernel.org/r/20200430202147.4dc6e2de@oasis.local.home
      
      Cc: Ingo Molnar <mingo@kernel.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Andy Lutomirski <luto@amacapital.net>
      Cc: Borislav Petkov <bp@alien8.de>
      Cc: Josh Poimboeuf <jpoimboe@redhat.com>
      Cc: "H. Peter Anvin" <hpa@zytor.com>
      Cc: stable@vger.kernel.org
      Fixes: 768ae440 ("x86/ftrace: Use text_poke()")
      Acked-by: NPeter Zijlstra <peterz@infradead.org>
      Signed-off-by: NSteven Rostedt (VMware) <rostedt@goodmis.org>
      59566b0b
    • E
      tcp: fix SO_RCVLOWAT hangs with fat skbs · 24adbc16
      Eric Dumazet 提交于
      We autotune rcvbuf whenever SO_RCVLOWAT is set to account for 100%
      overhead in tcp_set_rcvlowat()
      
      This works well when skb->len/skb->truesize ratio is bigger than 0.5
      
      But if we receive packets with small MSS, we can end up in a situation
      where not enough bytes are available in the receive queue to satisfy
      RCVLOWAT setting.
      As our sk_rcvbuf limit is hit, we send zero windows in ACK packets,
      preventing remote peer from sending more data.
      
      Even autotuning does not help, because it only triggers at the time
      user process drains the queue. If no EPOLLIN is generated, this
      can not happen.
      
      Note poll() has a similar issue, after commit
      c7004482 ("tcp: Respect SO_RCVLOWAT in tcp_poll().")
      
      Fixes: 03f45c88 ("tcp: avoid extra wakeups for SO_RCVLOWAT users")
      Signed-off-by: NEric Dumazet <edumazet@google.com>
      Acked-by: NSoheil Hassas Yeganeh <soheil@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      24adbc16
    • J
      ptp: fix struct member comment for do_aux_work · 2c864c78
      Jacob Keller 提交于
      The do_aux_work callback had documentation in the structure comment
      which referred to it as "do_work".
      Signed-off-by: NJacob Keller <jacob.e.keller@intel.com>
      Cc: Richard Cochran <richardcochran@gmail.com>
      Acked-by: NRichard Cochran <richardcochran@gmail.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      2c864c78
  16. 12 5月, 2020 1 次提交