1. 05 6月, 2020 2 次提交
    • D
      mm: Allow to offline unmovable PageOffline() pages via MEM_GOING_OFFLINE · aa218795
      David Hildenbrand 提交于
      virtio-mem wants to allow to offline memory blocks of which some parts
      were unplugged (allocated via alloc_contig_range()), especially, to later
      offline and remove completely unplugged memory blocks. The important part
      is that PageOffline() has to remain set until the section is offline, so
      these pages will never get accessed (e.g., when dumping). The pages should
      not be handed back to the buddy (which would require clearing PageOffline()
      and result in issues if offlining fails and the pages are suddenly in the
      buddy).
      
      Let's allow to do that by allowing to isolate any PageOffline() page
      when offlining. This way, we can reach the memory hotplug notifier
      MEM_GOING_OFFLINE, where the driver can signal that he is fine with
      offlining this page by dropping its reference count. PageOffline() pages
      with a reference count of 0 can then be skipped when offlining the
      pages (like if they were free, however they are not in the buddy).
      
      Anybody who uses PageOffline() pages and does not agree to offline them
      (e.g., Hyper-V balloon, XEN balloon, VMWare balloon for 2MB pages) will not
      decrement the reference count and make offlining fail when trying to
      migrate such an unmovable page. So there should be no observable change.
      Same applies to balloon compaction users (movable PageOffline() pages), the
      pages will simply be migrated.
      
      Note 1: If offlining fails, a driver has to increment the reference
      	count again in MEM_CANCEL_OFFLINE.
      
      Note 2: A driver that makes use of this has to be aware that re-onlining
      	the memory block has to be handled by hooking into onlining code
      	(online_page_callback_t), resetting the page PageOffline() and
      	not giving them to the buddy.
      Reviewed-by: NAlexander Duyck <alexander.h.duyck@linux.intel.com>
      Acked-by: NMichal Hocko <mhocko@suse.com>
      Tested-by: NPankaj Gupta <pankaj.gupta.linux@gmail.com>
      Acked-by: NAndrew Morton <akpm@linux-foundation.org>
      Cc: Andrew Morton <akpm@linux-foundation.org>
      Cc: Juergen Gross <jgross@suse.com>
      Cc: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
      Cc: Pavel Tatashin <pavel.tatashin@microsoft.com>
      Cc: Alexander Duyck <alexander.h.duyck@linux.intel.com>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Anthony Yznaga <anthony.yznaga@oracle.com>
      Cc: Michal Hocko <mhocko@suse.com>
      Cc: Oscar Salvador <osalvador@suse.de>
      Cc: Mel Gorman <mgorman@techsingularity.net>
      Cc: Mike Rapoport <rppt@linux.ibm.com>
      Cc: Dan Williams <dan.j.williams@intel.com>
      Cc: Anshuman Khandual <anshuman.khandual@arm.com>
      Cc: Qian Cai <cai@lca.pw>
      Cc: Pingfan Liu <kernelfans@gmail.com>
      Signed-off-by: NDavid Hildenbrand <david@redhat.com>
      Link: https://lore.kernel.org/r/20200507140139.17083-7-david@redhat.comSigned-off-by: NMichael S. Tsirkin <mst@redhat.com>
      aa218795
    • J
      vdpa: introduce get_vq_notification method · c25a26e6
      Jason Wang 提交于
      This patch introduces a new method in the vdpa_config_ops which
      reports the physical address and the size of the doorbell for a
      specific virtqueue.
      
      This will be used by the future patches that maps doorbell to
      userspace.
      Signed-off-by: NJason Wang <jasowang@redhat.com>
      Link: https://lore.kernel.org/r/20200529080303.15449-4-jasowang@redhat.comSigned-off-by: NMichael S. Tsirkin <mst@redhat.com>
      c25a26e6
  2. 02 6月, 2020 1 次提交
    • M
      virtio: force spec specified alignment on types · a865e420
      Michael S. Tsirkin 提交于
      The ring element addresses are passed between components with different
      alignments assumptions. Thus, if guest/userspace selects a pointer and
      host then gets and dereferences it, we might need to decrease the
      compiler-selected alignment to prevent compiler on the host from
      assuming pointer is aligned.
      
      This actually triggers on ARM with -mabi=apcs-gnu - which is a
      deprecated configuration, but it seems safer to handle this
      generally.
      
      Note that userspace that allocates the memory is actually OK and does
      not need to be fixed, but userspace that gets it from guest or another
      process does need to be fixed. The later doesn't generally talk to the
      kernel so while it might be buggy it's not talking to the kernel in the
      buggy way - it's just using the header in the buggy way - so fixing
      header and asking userspace to recompile is the best we can do.
      
      I verified that the produced kernel binary on x86 is exactly identical
      before and after the change.
      Signed-off-by: NMichael S. Tsirkin <mst@redhat.com>
      Acked-by: NJason Wang <jasowang@redhat.com>
      a865e420
  3. 29 5月, 2020 2 次提交
  4. 28 5月, 2020 1 次提交
    • A
      fanotify: turn off support for FAN_DIR_MODIFY · f1793699
      Amir Goldstein 提交于
      FAN_DIR_MODIFY has been enabled by commit 44d705b0 ("fanotify:
      report name info for FAN_DIR_MODIFY event") in 5.7-rc1. Now we are
      planning further extensions to the fanotify API and during that we
      realized that FAN_DIR_MODIFY may behave slightly differently to be more
      consistent with extensions we plan. So until we finalize these
      extensions, let's not bind our hands with exposing FAN_DIR_MODIFY to
      userland.
      Signed-off-by: NAmir Goldstein <amir73il@gmail.com>
      Signed-off-by: NJan Kara <jack@suse.cz>
      f1793699
  5. 27 5月, 2020 2 次提交
  6. 26 5月, 2020 1 次提交
  7. 25 5月, 2020 1 次提交
  8. 23 5月, 2020 3 次提交
    • E
      net/mlx5: Avoid processing commands before cmdif is ready · f7936ddd
      Eran Ben Elisha 提交于
      When driver is reloading during recovery flow, it can't get new commands
      till command interface is up again. Otherwise we may get to null pointer
      trying to access non initialized command structures.
      
      Add cmdif state to avoid processing commands while cmdif is not ready.
      
      Fixes: e126ba97 ("mlx5: Add driver for Mellanox Connect-IB adapters")
      Signed-off-by: NEran Ben Elisha <eranbe@mellanox.com>
      Signed-off-by: NMoshe Shemesh <moshe@mellanox.com>
      Signed-off-by: NSaeed Mahameed <saeedm@mellanox.com>
      f7936ddd
    • E
      net/mlx5: Fix a race when moving command interface to events mode · d43b7007
      Eran Ben Elisha 提交于
      After driver creates (via FW command) an EQ for commands, the driver will
      be informed on new commands completion by EQE. However, due to a race in
      driver's internal command mode metadata update, some new commands will
      still be miss-handled by driver as if we are in polling mode. Such commands
      can get two non forced completion, leading to already freed command entry
      access.
      
      CREATE_EQ command, that maps EQ to the command queue must be posted to the
      command queue while it is empty and no other command should be posted.
      
      Add SW mechanism that once the CREATE_EQ command is about to be executed,
      all other commands will return error without being sent to the FW. Allow
      sending other commands only after successfully changing the driver's
      internal command mode metadata.
      We can safely return error to all other commands while creating the command
      EQ, as all other commands might be sent from the user/application during
      driver load. Application can rerun them later after driver's load was
      finished.
      
      Fixes: e126ba97 ("mlx5: Add driver for Mellanox Connect-IB adapters")
      Signed-off-by: NEran Ben Elisha <eranbe@mellanox.com>
      Signed-off-by: NMoshe Shemesh <moshe@mellanox.com>
      Signed-off-by: NSaeed Mahameed <saeedm@mellanox.com>
      d43b7007
    • M
      net/mlx5: Add command entry handling completion · 17d00e83
      Moshe Shemesh 提交于
      When FW response to commands is very slow and all command entries in
      use are waiting for completion we can have a race where commands can get
      timeout before they get out of the queue and handled. Timeout
      completion on uninitialized command will cause releasing command's
      buffers before accessing it for initialization and then we will get NULL
      pointer exception while trying access it. It may also cause releasing
      buffers of another command since we may have timeout completion before
      even allocating entry index for this command.
      Add entry handling completion to avoid this race.
      
      Fixes: e126ba97 ("mlx5: Add driver for Mellanox Connect-IB adapters")
      Signed-off-by: NMoshe Shemesh <moshe@mellanox.com>
      Signed-off-by: NEran Ben Elisha <eranbe@mellanox.com>
      Signed-off-by: NSaeed Mahameed <saeedm@mellanox.com>
      17d00e83
  9. 17 5月, 2020 1 次提交
  10. 15 5月, 2020 5 次提交
    • B
      x86: Fix early boot crash on gcc-10, third try · a9a3ed1e
      Borislav Petkov 提交于
      ... or the odyssey of trying to disable the stack protector for the
      function which generates the stack canary value.
      
      The whole story started with Sergei reporting a boot crash with a kernel
      built with gcc-10:
      
        Kernel panic — not syncing: stack-protector: Kernel stack is corrupted in: start_secondary
        CPU: 1 PID: 0 Comm: swapper/1 Not tainted 5.6.0-rc5—00235—gfffb08b3 #139
        Hardware name: Gigabyte Technology Co., Ltd. To be filled by O.E.M./H77M—D3H, BIOS F12 11/14/2013
        Call Trace:
          dump_stack
          panic
          ? start_secondary
          __stack_chk_fail
          start_secondary
          secondary_startup_64
        -—-[ end Kernel panic — not syncing: stack—protector: Kernel stack is corrupted in: start_secondary
      
      This happens because gcc-10 tail-call optimizes the last function call
      in start_secondary() - cpu_startup_entry() - and thus emits a stack
      canary check which fails because the canary value changes after the
      boot_init_stack_canary() call.
      
      To fix that, the initial attempt was to mark the one function which
      generates the stack canary with:
      
        __attribute__((optimize("-fno-stack-protector"))) ... start_secondary(void *unused)
      
      however, using the optimize attribute doesn't work cumulatively
      as the attribute does not add to but rather replaces previously
      supplied optimization options - roughly all -fxxx options.
      
      The key one among them being -fno-omit-frame-pointer and thus leading to
      not present frame pointer - frame pointer which the kernel needs.
      
      The next attempt to prevent compilers from tail-call optimizing
      the last function call cpu_startup_entry(), shy of carving out
      start_secondary() into a separate compilation unit and building it with
      -fno-stack-protector, was to add an empty asm("").
      
      This current solution was short and sweet, and reportedly, is supported
      by both compilers but we didn't get very far this time: future (LTO?)
      optimization passes could potentially eliminate this, which leads us
      to the third attempt: having an actual memory barrier there which the
      compiler cannot ignore or move around etc.
      
      That should hold for a long time, but hey we said that about the other
      two solutions too so...
      Reported-by: NSergei Trofimovich <slyfox@gentoo.org>
      Signed-off-by: NBorislav Petkov <bp@suse.de>
      Tested-by: NKalle Valo <kvalo@codeaurora.org>
      Cc: <stable@vger.kernel.org>
      Link: https://lkml.kernel.org/r/20200314164451.346497-1-slyfox@gentoo.org
      a9a3ed1e
    • G
      i2c: mux: Replace zero-length array with flexible-array · 8695e0b1
      Gustavo A. R. Silva 提交于
      The current codebase makes use of the zero-length array language
      extension to the C90 standard, but the preferred mechanism to declare
      variable-length types such as these ones is a flexible array member[1][2],
      introduced in C99:
      
      struct foo {
              int stuff;
              struct boo array[];
      };
      
      By making use of the mechanism above, we will get a compiler warning
      in case the flexible array does not occur last in the structure, which
      will help us prevent some kind of undefined behavior bugs from being
      inadvertently introduced[3] to the codebase from now on.
      
      Also, notice that, dynamic memory allocations won't be affected by
      this change:
      
      "Flexible array members have incomplete type, and so the sizeof operator
      may not be applied. As a quirk of the original implementation of
      zero-length arrays, sizeof evaluates to zero."[1]
      
      sizeof(flexible-array-member) triggers a warning because flexible array
      members have incomplete type[1]. There are some instances of code in
      which the sizeof operator is being incorrectly/erroneously applied to
      zero-length arrays and the result is zero. Such instances may be hiding
      some bugs. So, this work (flexible-array member conversions) will also
      help to get completely rid of those sorts of issues.
      
      This issue was found with the help of Coccinelle.
      
      [1] https://gcc.gnu.org/onlinedocs/gcc/Zero-Length.html
      [2] https://github.com/KSPP/linux/issues/21
      [3] commit 76497732 ("cxgb3/l2t: Fix undefined behaviour")
      Signed-off-by: NGustavo A. R. Silva <gustavoars@kernel.org>
      Reviewed-by: NPeter Rosin <peda@axentia.se>
      Signed-off-by: NWolfram Sang <wsa@kernel.org>
      8695e0b1
    • K
      net: phy: broadcom: fix BCM54XX_SHD_SCR3_TRDDAPD value for BCM54810 · cc8a677a
      Kevin Lo 提交于
      Set the correct bit when checking for PHY_BRCM_DIS_TXCRXC_NOENRGY on the
      BCM54810 PHY.
      
      Fixes: 0ececcfc ("net: phy: broadcom: Allow BCM54810 to use bcm54xx_adjust_rxrefclk()")
      Signed-off-by: NKevin Lo <kevlo@kevlo.org>
      Reviewed-by: NFlorian Fainelli <f.fainelli@gmail.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      cc8a677a
    • A
      security: Fix the default value of secid_to_secctx hook · 625236ba
      Anders Roxell 提交于
      security_secid_to_secctx is called by the bpf_lsm hook and a successful
      return value (i.e 0) implies that the parameter will be consumed by the
      LSM framework. The current behaviour return success when the pointer
      isn't initialized when CONFIG_BPF_LSM is enabled, with the default
      return from kernel/bpf/bpf_lsm.c.
      
      This is the internal error:
      
      [ 1229.341488][ T2659] usercopy: Kernel memory exposure attempt detected from null address (offset 0, size 280)!
      [ 1229.374977][ T2659] ------------[ cut here ]------------
      [ 1229.376813][ T2659] kernel BUG at mm/usercopy.c:99!
      [ 1229.378398][ T2659] Internal error: Oops - BUG: 0 [#1] PREEMPT SMP
      [ 1229.380348][ T2659] Modules linked in:
      [ 1229.381654][ T2659] CPU: 0 PID: 2659 Comm: systemd-journal Tainted: G    B   W         5.7.0-rc5-next-20200511-00019-g864e0c6319b8-dirty #13
      [ 1229.385429][ T2659] Hardware name: linux,dummy-virt (DT)
      [ 1229.387143][ T2659] pstate: 80400005 (Nzcv daif +PAN -UAO BTYPE=--)
      [ 1229.389165][ T2659] pc : usercopy_abort+0xc8/0xcc
      [ 1229.390705][ T2659] lr : usercopy_abort+0xc8/0xcc
      [ 1229.392225][ T2659] sp : ffff000064247450
      [ 1229.393533][ T2659] x29: ffff000064247460 x28: 0000000000000000
      [ 1229.395449][ T2659] x27: 0000000000000118 x26: 0000000000000000
      [ 1229.397384][ T2659] x25: ffffa000127049e0 x24: ffffa000127049e0
      [ 1229.399306][ T2659] x23: ffffa000127048e0 x22: ffffa000127048a0
      [ 1229.401241][ T2659] x21: ffffa00012704b80 x20: ffffa000127049e0
      [ 1229.403163][ T2659] x19: ffffa00012704820 x18: 0000000000000000
      [ 1229.405094][ T2659] x17: 0000000000000000 x16: 0000000000000000
      [ 1229.407008][ T2659] x15: 0000000000000000 x14: 003d090000000000
      [ 1229.408942][ T2659] x13: ffff80000d5b25b2 x12: 1fffe0000d5b25b1
      [ 1229.410859][ T2659] x11: 1fffe0000d5b25b1 x10: ffff80000d5b25b1
      [ 1229.412791][ T2659] x9 : ffffa0001034bee0 x8 : ffff00006ad92d8f
      [ 1229.414707][ T2659] x7 : 0000000000000000 x6 : ffffa00015eacb20
      [ 1229.416642][ T2659] x5 : ffff0000693c8040 x4 : 0000000000000000
      [ 1229.418558][ T2659] x3 : ffffa0001034befc x2 : d57a7483a01c6300
      [ 1229.420610][ T2659] x1 : 0000000000000000 x0 : 0000000000000059
      [ 1229.422526][ T2659] Call trace:
      [ 1229.423631][ T2659]  usercopy_abort+0xc8/0xcc
      [ 1229.425091][ T2659]  __check_object_size+0xdc/0x7d4
      [ 1229.426729][ T2659]  put_cmsg+0xa30/0xa90
      [ 1229.428132][ T2659]  unix_dgram_recvmsg+0x80c/0x930
      [ 1229.429731][ T2659]  sock_recvmsg+0x9c/0xc0
      [ 1229.431123][ T2659]  ____sys_recvmsg+0x1cc/0x5f8
      [ 1229.432663][ T2659]  ___sys_recvmsg+0x100/0x160
      [ 1229.434151][ T2659]  __sys_recvmsg+0x110/0x1a8
      [ 1229.435623][ T2659]  __arm64_sys_recvmsg+0x58/0x70
      [ 1229.437218][ T2659]  el0_svc_common.constprop.1+0x29c/0x340
      [ 1229.438994][ T2659]  do_el0_svc+0xe8/0x108
      [ 1229.440587][ T2659]  el0_svc+0x74/0x88
      [ 1229.441917][ T2659]  el0_sync_handler+0xe4/0x8b4
      [ 1229.443464][ T2659]  el0_sync+0x17c/0x180
      [ 1229.444920][ T2659] Code: aa1703e2 aa1603e1 910a8260 97ecc860 (d4210000)
      [ 1229.447070][ T2659] ---[ end trace 400497d91baeaf51 ]---
      [ 1229.448791][ T2659] Kernel panic - not syncing: Fatal exception
      [ 1229.450692][ T2659] Kernel Offset: disabled
      [ 1229.452061][ T2659] CPU features: 0x240002,20002004
      [ 1229.453647][ T2659] Memory Limit: none
      [ 1229.455015][ T2659] ---[ end Kernel panic - not syncing: Fatal exception ]---
      
      Rework the so the default return value is -EOPNOTSUPP.
      
      There are likely other callbacks such as security_inode_getsecctx() that
      may have the same problem, and that someone that understand the code
      better needs to audit them.
      
      Thank you Arnd for helping me figure out what went wrong.
      
      Fixes: 98e828a0 ("security: Refactor declaration of LSM hooks")
      Signed-off-by: NAnders Roxell <anders.roxell@linaro.org>
      Signed-off-by: NAlexei Starovoitov <ast@kernel.org>
      Acked-by: NJames Morris <jamorris@linux.microsoft.com>
      Cc: Arnd Bergmann <arnd@arndb.de>
      Link: https://lore.kernel.org/bpf/20200512174607.9630-1-anders.roxell@linaro.org
      625236ba
    • Y
      mm, memcg: fix inconsistent oom event behavior · 04fd61a4
      Yafang Shao 提交于
      A recent commit 9852ae3f ("mm, memcg: consider subtrees in
      memory.events") changed the behavior of memcg events, which will now
      consider subtrees in memory.events.
      
      But oom_kill event is a special one as it is used in both cgroup1 and
      cgroup2.  In cgroup1, it is displayed in memory.oom_control.  The file
      memory.oom_control is in both root memcg and non root memcg, that is
      different with memory.event as it only in non-root memcg.  That commit
      is okay for cgroup2, but it is not okay for cgroup1 as it will cause
      inconsistent behavior between root memcg and non-root memcg.
      
      Here's an example on why this behavior is inconsistent in cgroup1.
      
             root memcg
             /
          memcg foo
           /
        memcg bar
      
      Suppose there's an oom_kill in memcg bar, then the oon_kill will be
      
             root memcg : memory.oom_control(oom_kill)  0
             /
          memcg foo : memory.oom_control(oom_kill)  1
           /
        memcg bar : memory.oom_control(oom_kill)  1
      
      For the non-root memcg, its memory.oom_control(oom_kill) includes its
      descendants' oom_kill, but for root memcg, it doesn't include its
      descendants' oom_kill.  That means, memory.oom_control(oom_kill) has
      different meanings in different memcgs.  That is inconsistent.  Then the
      user has to know whether the memcg is root or not.
      
      If we can't fully support it in cgroup1, for example by adding
      memory.events.local into cgroup1 as well, then let's don't touch its
      original behavior.
      
      Fixes: 9852ae3f ("mm, memcg: consider subtrees in memory.events")
      Reported-by: NRandy Dunlap <rdunlap@infradead.org>
      Signed-off-by: NYafang Shao <laoar.shao@gmail.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Reviewed-by: NShakeel Butt <shakeelb@google.com>
      Acked-by: NJohannes Weiner <hannes@cmpxchg.org>
      Acked-by: NChris Down <chris@chrisdown.name>
      Acked-by: NMichal Hocko <mhocko@suse.com>
      Cc: <stable@vger.kernel.org>
      Link: http://lkml.kernel.org/r/20200502141055.7378-1-laoar.shao@gmail.comSigned-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      04fd61a4
  11. 14 5月, 2020 1 次提交
    • P
      efi: cper: Add support for printing Firmware Error Record Reference · 3d8c11ef
      Punit Agrawal 提交于
      While debugging a boot failure, the following unknown error record was
      seen in the boot logs.
      
          <...>
          BERT: Error records from previous boot:
          [Hardware Error]: event severity: fatal
          [Hardware Error]:  Error 0, type: fatal
          [Hardware Error]:   section type: unknown, 81212a96-09ed-4996-9471-8d729c8e69ed
          [Hardware Error]:   section length: 0x290
          [Hardware Error]:   00000000: 00000001 00000000 00000000 00020002  ................
          [Hardware Error]:   00000010: 00020002 0000001f 00000320 00000000  ........ .......
          [Hardware Error]:   00000020: 00000000 00000000 00000000 00000000  ................
          [Hardware Error]:   00000030: 00000000 00000000 00000000 00000000  ................
          <...>
      
      On further investigation, it was found that the error record with
      UUID (81212a96-09ed-4996-9471-8d729c8e69ed) has been defined in the
      UEFI Specification at least since v2.4 and has recently had additional
      fields defined in v2.7 Section N.2.10 Firmware Error Record Reference.
      
      Add support for parsing and printing the defined fields to give users
      a chance to figure out what went wrong.
      Signed-off-by: NPunit Agrawal <punit1.agrawal@toshiba.co.jp>
      Cc: Ard Biesheuvel <ardb@kernel.org>
      Cc: "Rafael J. Wysocki" <rjw@rjwysocki.net>
      Cc: Borislav Petkov <bp@alien8.de>
      Cc: James Morse <james.morse@arm.com>
      Cc: linux-acpi@vger.kernel.org
      Cc: linux-efi@vger.kernel.org
      Link: https://lore.kernel.org/r/20200512045502.3810339-1-punit1.agrawal@toshiba.co.jpSigned-off-by: NArd Biesheuvel <ardb@kernel.org>
      3d8c11ef
  12. 13 5月, 2020 2 次提交
    • S
      x86/ftrace: Have ftrace trampolines turn read-only at the end of system boot up · 59566b0b
      Steven Rostedt (VMware) 提交于
      Booting one of my machines, it triggered the following crash:
      
       Kernel/User page tables isolation: enabled
       ftrace: allocating 36577 entries in 143 pages
       Starting tracer 'function'
       BUG: unable to handle page fault for address: ffffffffa000005c
       #PF: supervisor write access in kernel mode
       #PF: error_code(0x0003) - permissions violation
       PGD 2014067 P4D 2014067 PUD 2015063 PMD 7b253067 PTE 7b252061
       Oops: 0003 [#1] PREEMPT SMP PTI
       CPU: 0 PID: 0 Comm: swapper Not tainted 5.4.0-test+ #24
       Hardware name: To Be Filled By O.E.M. To Be Filled By O.E.M./To be filled by O.E.M., BIOS SDBLI944.86P 05/08/2007
       RIP: 0010:text_poke_early+0x4a/0x58
       Code: 34 24 48 89 54 24 08 e8 bf 72 0b 00 48 8b 34 24 48 8b 4c 24 08 84 c0 74 0b 48 89 df f3 a4 48 83 c4 10 5b c3 9c 58 fa 48 89 df <f3> a4 50 9d 48 83 c4 10 5b e9 d6 f9 ff ff
      0 41 57 49
       RSP: 0000:ffffffff82003d38 EFLAGS: 00010046
       RAX: 0000000000000046 RBX: ffffffffa000005c RCX: 0000000000000005
       RDX: 0000000000000005 RSI: ffffffff825b9a90 RDI: ffffffffa000005c
       RBP: ffffffffa000005c R08: 0000000000000000 R09: ffffffff8206e6e0
       R10: ffff88807b01f4c0 R11: ffffffff8176c106 R12: ffffffff8206e6e0
       R13: ffffffff824f2440 R14: 0000000000000000 R15: ffffffff8206eac0
       FS:  0000000000000000(0000) GS:ffff88807d400000(0000) knlGS:0000000000000000
       CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
       CR2: ffffffffa000005c CR3: 0000000002012000 CR4: 00000000000006b0
       Call Trace:
        text_poke_bp+0x27/0x64
        ? mutex_lock+0x36/0x5d
        arch_ftrace_update_trampoline+0x287/0x2d5
        ? ftrace_replace_code+0x14b/0x160
        ? ftrace_update_ftrace_func+0x65/0x6c
        __register_ftrace_function+0x6d/0x81
        ftrace_startup+0x23/0xc1
        register_ftrace_function+0x20/0x37
        func_set_flag+0x59/0x77
        __set_tracer_option.isra.19+0x20/0x3e
        trace_set_options+0xd6/0x13e
        apply_trace_boot_options+0x44/0x6d
        register_tracer+0x19e/0x1ac
        early_trace_init+0x21b/0x2c9
        start_kernel+0x241/0x518
        ? load_ucode_intel_bsp+0x21/0x52
        secondary_startup_64+0xa4/0xb0
      
      I was able to trigger it on other machines, when I added to the kernel
      command line of both "ftrace=function" and "trace_options=func_stack_trace".
      
      The cause is the "ftrace=function" would register the function tracer
      and create a trampoline, and it will set it as executable and
      read-only. Then the "trace_options=func_stack_trace" would then update
      the same trampoline to include the stack tracer version of the function
      tracer. But since the trampoline already exists, it updates it with
      text_poke_bp(). The problem is that text_poke_bp() called while
      system_state == SYSTEM_BOOTING, it will simply do a memcpy() and not
      the page mapping, as it would think that the text is still read-write.
      But in this case it is not, and we take a fault and crash.
      
      Instead, lets keep the ftrace trampolines read-write during boot up,
      and then when the kernel executable text is set to read-only, the
      ftrace trampolines get set to read-only as well.
      
      Link: https://lkml.kernel.org/r/20200430202147.4dc6e2de@oasis.local.home
      
      Cc: Ingo Molnar <mingo@kernel.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Andy Lutomirski <luto@amacapital.net>
      Cc: Borislav Petkov <bp@alien8.de>
      Cc: Josh Poimboeuf <jpoimboe@redhat.com>
      Cc: "H. Peter Anvin" <hpa@zytor.com>
      Cc: stable@vger.kernel.org
      Fixes: 768ae440 ("x86/ftrace: Use text_poke()")
      Acked-by: NPeter Zijlstra <peterz@infradead.org>
      Signed-off-by: NSteven Rostedt (VMware) <rostedt@goodmis.org>
      59566b0b
    • J
      ptp: fix struct member comment for do_aux_work · 2c864c78
      Jacob Keller 提交于
      The do_aux_work callback had documentation in the structure comment
      which referred to it as "do_work".
      Signed-off-by: NJacob Keller <jacob.e.keller@intel.com>
      Cc: Richard Cochran <richardcochran@gmail.com>
      Acked-by: NRichard Cochran <richardcochran@gmail.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      2c864c78
  13. 10 5月, 2020 1 次提交
  14. 08 5月, 2020 1 次提交
  15. 07 5月, 2020 2 次提交
  16. 06 5月, 2020 2 次提交
  17. 05 5月, 2020 6 次提交
  18. 01 5月, 2020 2 次提交
    • K
      security: Fix the default value of fs_context_parse_param hook · 54261af4
      KP Singh 提交于
      security_fs_context_parse_param is called by vfs_parse_fs_param and
      a succussful return value (i.e 0) implies that a parameter will be
      consumed by the LSM framework. This stops all further parsing of the
      parmeter by VFS. Furthermore, if an LSM hook returns a success, the
      remaining LSM hooks are not invoked for the parameter.
      
      The current default behavior of returning success means that all the
      parameters are expected to be parsed by the LSM hook and none of them
      end up being populated by vfs in fs_context
      
      This was noticed when lsm=bpf is supplied on the command line before any
      other LSM. As the bpf lsm uses this default value to implement a default
      hook, this resulted in a failure to parse any fs_context parameters and
      a failure to mount the root filesystem.
      
      Fixes: 98e828a0 ("security: Refactor declaration of LSM hooks")
      Reported-by: NMikko Ylinen <mikko.ylinen@linux.intel.com>
      Signed-off-by: NKP Singh <kpsingh@google.com>
      Signed-off-by: NJames Morris <jmorris@namei.org>
      54261af4
    • P
      mptcp: move option parsing into mptcp_incoming_options() · cfde141e
      Paolo Abeni 提交于
      The mptcp_options_received structure carries several per
      packet flags (mp_capable, mp_join, etc.). Such fields must
      be cleared on each packet, even on dropped ones or packet
      not carrying any MPTCP options, but the current mptcp
      code clears them only on TCP option reset.
      
      On several races/corner cases we end-up with stray bits in
      incoming options, leading to WARN_ON splats. e.g.:
      
      [  171.164906] Bad mapping: ssn=32714 map_seq=1 map_data_len=32713
      [  171.165006] WARNING: CPU: 1 PID: 5026 at net/mptcp/subflow.c:533 warn_bad_map (linux-mptcp/net/mptcp/subflow.c:533 linux-mptcp/net/mptcp/subflow.c:531)
      [  171.167632] Modules linked in: ip6_vti ip_vti ip_gre ipip sit tunnel4 ip_tunnel geneve ip6_udp_tunnel udp_tunnel macsec macvtap tap ipvlan macvlan 8021q garp mrp xfrm_interface veth netdevsim nlmon dummy team bonding vcan bridge stp llc ip6_gre gre ip6_tunnel tunnel6 tun binfmt_misc intel_rapl_msr intel_rapl_common rfkill kvm_intel kvm irqbypass crct10dif_pclmul crc32_pclmul ghash_clmulni_intel joydev virtio_balloon pcspkr i2c_piix4 sunrpc ip_tables xfs libcrc32c crc32c_intel serio_raw virtio_console ata_generic virtio_blk virtio_net net_failover failover ata_piix libata
      [  171.199464] CPU: 1 PID: 5026 Comm: repro Not tainted 5.7.0-rc1.mptcp_f227fdf5d388+ #95
      [  171.200886] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.12.0-2.fc30 04/01/2014
      [  171.202546] RIP: 0010:warn_bad_map (linux-mptcp/net/mptcp/subflow.c:533 linux-mptcp/net/mptcp/subflow.c:531)
      [  171.206537] Code: c1 ea 03 0f b6 14 02 48 89 f8 83 e0 07 83 c0 03 38 d0 7c 04 84 d2 75 1d 8b 55 3c 44 89 e6 48 c7 c7 20 51 13 95 e8 37 8b 22 fe <0f> 0b 48 83 c4 08 5b 5d 41 5c c3 89 4c 24 04 e8 db d6 94 fe 8b 4c
      [  171.220473] RSP: 0018:ffffc90000150560 EFLAGS: 00010282
      [  171.221639] RAX: 0000000000000000 RBX: 0000000000000000 RCX: 0000000000000000
      [  171.223108] RDX: 0000000000000000 RSI: 0000000000000008 RDI: fffff5200002a09e
      [  171.224388] RBP: ffff8880aa6e3c00 R08: 0000000000000001 R09: fffffbfff2ec9955
      [  171.225706] R10: ffffffff9764caa7 R11: fffffbfff2ec9954 R12: 0000000000007fca
      [  171.227211] R13: ffff8881066f4a7f R14: ffff8880aa6e3c00 R15: 0000000000000020
      [  171.228460] FS:  00007f8623719740(0000) GS:ffff88810be00000(0000) knlGS:0000000000000000
      [  171.230065] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
      [  171.231303] CR2: 00007ffdab190a50 CR3: 00000001038ea006 CR4: 0000000000160ee0
      [  171.232586] Call Trace:
      [  171.233109]  <IRQ>
      [  171.233531] get_mapping_status (linux-mptcp/net/mptcp/subflow.c:691)
      [  171.234371] mptcp_subflow_data_available (linux-mptcp/net/mptcp/subflow.c:736 linux-mptcp/net/mptcp/subflow.c:832)
      [  171.238181] subflow_state_change (linux-mptcp/net/mptcp/subflow.c:1085 (discriminator 1))
      [  171.239066] tcp_fin (linux-mptcp/net/ipv4/tcp_input.c:4217)
      [  171.240123] tcp_data_queue (linux-mptcp/./include/linux/compiler.h:199 linux-mptcp/net/ipv4/tcp_input.c:4822)
      [  171.245083] tcp_rcv_established (linux-mptcp/./include/linux/skbuff.h:1785 linux-mptcp/./include/net/tcp.h:1774 linux-mptcp/./include/net/tcp.h:1847 linux-mptcp/net/ipv4/tcp_input.c:5238 linux-mptcp/net/ipv4/tcp_input.c:5730)
      [  171.254089] tcp_v4_rcv (linux-mptcp/./include/linux/spinlock.h:393 linux-mptcp/net/ipv4/tcp_ipv4.c:2009)
      [  171.258969] ip_protocol_deliver_rcu (linux-mptcp/net/ipv4/ip_input.c:204 (discriminator 1))
      [  171.260214] ip_local_deliver_finish (linux-mptcp/./include/linux/rcupdate.h:651 linux-mptcp/net/ipv4/ip_input.c:232)
      [  171.261389] ip_local_deliver (linux-mptcp/./include/linux/netfilter.h:307 linux-mptcp/./include/linux/netfilter.h:301 linux-mptcp/net/ipv4/ip_input.c:252)
      [  171.265884] ip_rcv (linux-mptcp/./include/linux/netfilter.h:307 linux-mptcp/./include/linux/netfilter.h:301 linux-mptcp/net/ipv4/ip_input.c:539)
      [  171.273666] process_backlog (linux-mptcp/./include/linux/rcupdate.h:651 linux-mptcp/net/core/dev.c:6135)
      [  171.275328] net_rx_action (linux-mptcp/net/core/dev.c:6572 linux-mptcp/net/core/dev.c:6640)
      [  171.280472] __do_softirq (linux-mptcp/./arch/x86/include/asm/jump_label.h:25 linux-mptcp/./include/linux/jump_label.h:200 linux-mptcp/./include/trace/events/irq.h:142 linux-mptcp/kernel/softirq.c:293)
      [  171.281379] do_softirq_own_stack (linux-mptcp/arch/x86/entry/entry_64.S:1083)
      [  171.282358]  </IRQ>
      
      We could address the issue clearing explicitly the relevant fields
      in several places - tcp_parse_option, tcp_fast_parse_options,
      possibly others.
      
      Instead we move the MPTCP option parsing into the already existing
      mptcp ingress hook, so that we need to clear the fields in a single
      place.
      
      This allows us dropping an MPTCP hook from the TCP code and
      removing the quite large mptcp_options_received from the tcp_sock
      struct. On the flip side, the MPTCP sockets will traverse the
      option space twice (in tcp_parse_option() and in
      mptcp_incoming_options(). That looks acceptable: we already
      do that for syn and 3rd ack packets, plain TCP socket will
      benefit from it, and even MPTCP sockets will experience better
      code locality, reducing the jumps between TCP and MPTCP code.
      
      v1 -> v2:
       - rebased on current '-net' tree
      
      Fixes: 648ef4b8 ("mptcp: Implement MPTCP receive path")
      Signed-off-by: NPaolo Abeni <pabeni@redhat.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      cfde141e
  19. 30 4月, 2020 2 次提交
  20. 29 4月, 2020 2 次提交
    • O
      NFSv4.1: fix handling of backchannel binding in BIND_CONN_TO_SESSION · dff58530
      Olga Kornievskaia 提交于
      Currently, if the client sends BIND_CONN_TO_SESSION with
      NFS4_CDFC4_FORE_OR_BOTH but only gets NFS4_CDFS4_FORE back it ignores
      that it wasn't able to enable a backchannel.
      
      To make sure, the client sends BIND_CONN_TO_SESSION as the first
      operation on the connections (ie., no other session compounds haven't
      been sent before), and if the client's request to bind the backchannel
      is not satisfied, then reset the connection and retry.
      
      Cc: stable@vger.kernel.org
      Signed-off-by: NOlga Kornievskaia <kolga@netapp.com>
      Signed-off-by: NTrond Myklebust <trond.myklebust@hammerspace.com>
      dff58530
    • N
      SUNRPC: defer slow parts of rpc_free_client() to a workqueue. · 7c4310ff
      NeilBrown 提交于
      The rpciod workqueue is on the write-out path for freeing dirty memory,
      so it is important that it never block waiting for memory to be
      allocated - this can lead to a deadlock.
      
      rpc_execute() - which is often called by an rpciod work item - calls
      rcp_task_release_client() which can lead to rpc_free_client().
      
      rpc_free_client() makes two calls which could potentially block wating
      for memory allocation.
      
      rpc_clnt_debugfs_unregister() calls into debugfs and will block while
      any of the debugfs files are being accessed.  In particular it can block
      while any of the 'open' methods are being called and all of these use
      malloc for one thing or another.  So this can deadlock if the memory
      allocation waits for NFS to complete some writes via rpciod.
      
      rpc_clnt_remove_pipedir() can take the inode_lock() and while it isn't
      obvious that memory allocations can happen while the lock it held, it is
      safer to assume they might and to not let rpciod call
      rpc_clnt_remove_pipedir().
      
      So this patch moves these two calls (together with the final kfree() and
      rpciod_down()) into a work-item to be run from the system work-queue.
      rpciod can continue its important work, and the final stages of the free
      can happen whenever they happen.
      
      I have seen this deadlock on a 4.12 based kernel where debugfs used
      synchronize_srcu() when removing objects.  synchronize_srcu() requires a
      workqueue and there were no free workther threads and none could be
      allocated.  While debugsfs no longer uses SRCU, I believe the deadlock
      is still possible.
      Signed-off-by: NNeilBrown <neilb@suse.de>
      Signed-off-by: NTrond Myklebust <trond.myklebust@hammerspace.com>
      7c4310ff