1. 07 3月, 2019 3 次提交
  2. 28 2月, 2019 1 次提交
  3. 26 2月, 2019 1 次提交
    • L
      Revert "x86/fault: BUG() when uaccess helpers fault on kernel addresses" · 53a41cb7
      Linus Torvalds 提交于
      This reverts commit 9da3f2b7.
      
      It was well-intentioned, but wrong.  Overriding the exception tables for
      instructions for random reasons is just wrong, and that is what the new
      code did.
      
      It caused problems for tracing, and it caused problems for strncpy_from_user(),
      because the new checks made perfectly valid use cases break, rather than
      catch things that did bad things.
      
      Unchecked user space accesses are a problem, but that's not a reason to
      add invalid checks that then people have to work around with silly flags
      (in this case, that 'kernel_uaccess_faults_ok' flag, which is just an
      odd way to say "this commit was wrong" and was sprinked into random
      places to hide the wrongness).
      
      The real fix to unchecked user space accesses is to get rid of the
      special "let's not check __get_user() and __put_user() at all" logic.
      Make __{get|put}_user() be just aliases to the regular {get|put}_user()
      functions, and make it impossible to access user space without having
      the proper checks in places.
      
      The raison d'être of the special double-underscore versions used to be
      that the range check was expensive, and if you did multiple user
      accesses, you'd do the range check up front (like the signal frame
      handling code, for example).  But SMAP (on x86) and PAN (on ARM) have
      made that optimization pointless, because the _real_ expense is the "set
      CPU flag to allow user space access".
      
      Do let's not break the valid cases to catch invalid cases that shouldn't
      even exist.
      
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Kees Cook <keescook@chromium.org>
      Cc: Tobin C. Harding <tobin@kernel.org>
      Cc: Borislav Petkov <bp@alien8.de>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Andy Lutomirski <luto@kernel.org>
      Cc: Jann Horn <jannh@google.com>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      53a41cb7
  4. 24 2月, 2019 1 次提交
    • L
      net: phy: realtek: Dummy IRQ calls for RTL8366RB · 4c8e0459
      Linus Walleij 提交于
      This fixes a regression introduced by
      commit 0d2e778e
      "net: phy: replace PHY_HAS_INTERRUPT with a check for
      config_intr and ack_interrupt".
      
      This assumes that a PHY cannot trigger interrupt unless
      it has .config_intr() or .ack_interrupt() implemented.
      A later patch makes the code assume both need to be
      implemented for interrupts to be present.
      
      But this PHY (which is inside a DSA) will happily
      fire interrupts without either callback.
      
      Implement dummy callbacks for .config_intr() and
      .ack_interrupt() in the phy header to fix this.
      
      Tested on the RTL8366RB on D-Link DIR-685.
      
      Fixes: 0d2e778e ("net: phy: replace PHY_HAS_INTERRUPT with a check for config_intr and ack_interrupt")
      Cc: Heiner Kallweit <hkallweit1@gmail.com>
      Signed-off-by: NLinus Walleij <linus.walleij@linaro.org>
      Reviewed-by: NAndrew Lunn <andrew@lunn.ch>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      4c8e0459
  5. 22 2月, 2019 1 次提交
  6. 17 2月, 2019 1 次提交
  7. 16 2月, 2019 7 次提交
    • A
      efi/arm: Revert "Defer persistent reservations until after paging_init()" · 582a32e7
      Ard Biesheuvel 提交于
      This reverts commit eff89628, which
      deferred the processing of persistent memory reservations to a point
      where the memory may have already been allocated and overwritten,
      defeating the purpose.
      Signed-off-by: NArd Biesheuvel <ard.biesheuvel@linaro.org>
      Acked-by: NWill Deacon <will.deacon@arm.com>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Marc Zyngier <marc.zyngier@arm.com>
      Cc: Mike Rapoport <rppt@linux.ibm.com>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: linux-arm-kernel@lists.infradead.org
      Cc: linux-efi@vger.kernel.org
      Link: http://lkml.kernel.org/r/20190215123333.21209-3-ard.biesheuvel@linaro.orgSigned-off-by: NIngo Molnar <mingo@kernel.org>
      582a32e7
    • A
      arm64, mm, efi: Account for GICv3 LPI tables in static memblock reserve table · 8a5b403d
      Ard Biesheuvel 提交于
      In the irqchip and EFI code, we have what basically amounts to a quirk
      to work around a peculiarity in the GICv3 architecture, which permits
      the system memory address of LPI tables to be programmable only once
      after a CPU reset. This means kexec kernels must use the same memory
      as the first kernel, and thus ensure that this memory has not been
      given out for other purposes by the time the ITS init code runs, which
      is not very early for secondary CPUs.
      
      On systems with many CPUs, these reservations could overflow the
      memblock reservation table, and this was addressed in commit:
      
        eff89628 ("efi/arm: Defer persistent reservations until after paging_init()")
      
      However, this turns out to have made things worse, since the allocation
      of page tables and heap space for the resized memblock reservation table
      itself may overwrite the regions we are attempting to reserve, which may
      cause all kinds of corruption, also considering that the ITS will still
      be poking bits into that memory in response to incoming MSIs.
      
      So instead, let's grow the static memblock reservation table on such
      systems so it can accommodate these reservations at an earlier time.
      This will permit us to revert the above commit in a subsequent patch.
      
      [ mingo: Minor cleanups. ]
      Signed-off-by: NArd Biesheuvel <ard.biesheuvel@linaro.org>
      Acked-by: NMike Rapoport <rppt@linux.ibm.com>
      Acked-by: NWill Deacon <will.deacon@arm.com>
      Acked-by: NMarc Zyngier <marc.zyngier@arm.com>
      Cc: Andrew Morton <akpm@linux-foundation.org>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: linux-arm-kernel@lists.infradead.org
      Cc: linux-efi@vger.kernel.org
      Link: http://lkml.kernel.org/r/20190215123333.21209-2-ard.biesheuvel@linaro.orgSigned-off-by: NIngo Molnar <mingo@kernel.org>
      8a5b403d
    • W
      net: validate untrusted gso packets without csum offload · d5be7f63
      Willem de Bruijn 提交于
      Syzkaller again found a path to a kernel crash through bad gso input.
      By building an excessively large packet to cause an skb field to wrap.
      
      If VIRTIO_NET_HDR_F_NEEDS_CSUM was set this would have been dropped in
      skb_partial_csum_set.
      
      GSO packets that do not set checksum offload are suspicious and rare.
      Most callers of virtio_net_hdr_to_skb already pass them to
      skb_probe_transport_header.
      
      Move that test forward, change it to detect parse failure and drop
      packets on failure as those cleary are not one of the legitimate
      VIRTIO_NET_HDR_GSO types.
      
      Fixes: bfd5f4a3 ("packet: Add GSO/csum offload support.")
      Fixes: f43798c2 ("tun: Allow GSO using virtio_net_hdr")
      Reported-by: Nsyzbot <syzkaller@googlegroups.com>
      Signed-off-by: NWillem de Bruijn <willemb@google.com>
      Reviewed-by: NEric Dumazet <edumazet@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      d5be7f63
    • H
      net: Fix for_each_netdev_feature on Big endian · 3b89ea9c
      Hauke Mehrtens 提交于
      The features attribute is of type u64 and stored in the native endianes on
      the system. The for_each_set_bit() macro takes a pointer to a 32 bit array
      and goes over the bits in this area. On little Endian systems this also
      works with an u64 as the most significant bit is on the highest address,
      but on big endian the words are swapped. When we expect bit 15 here we get
      bit 47 (15 + 32).
      
      This patch converts it more or less to its own for_each_set_bit()
      implementation which works on 64 bit integers directly. This is then
      completely in host endianness and should work like expected.
      
      Fixes: fd867d51 ("net/core: generic support for disabling netdev features down stack")
      Signed-off-by: NHauke Mehrtens <hauke.mehrtens@intel.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      3b89ea9c
    • D
      keys: Fix dependency loop between construction record and auth key · 822ad64d
      David Howells 提交于
      In the request_key() upcall mechanism there's a dependency loop by which if
      a key type driver overrides the ->request_key hook and the userspace side
      manages to lose the authorisation key, the auth key and the internal
      construction record (struct key_construction) can keep each other pinned.
      
      Fix this by the following changes:
      
       (1) Killing off the construction record and using the auth key instead.
      
       (2) Including the operation name in the auth key payload and making the
           payload available outside of security/keys/.
      
       (3) The ->request_key hook is given the authkey instead of the cons
           record and operation name.
      
      Changes (2) and (3) allow the auth key to naturally be cleaned up if the
      keyring it is in is destroyed or cleared or the auth key is unlinked.
      
      Fixes: 7ee02a316600 ("keys: Fix dependency loop between construction record and auth key")
      Signed-off-by: NDavid Howells <dhowells@redhat.com>
      Signed-off-by: NJames Morris <james.morris@microsoft.com>
      822ad64d
    • M
      include/linux/module.h: copy __init/__exit attrs to init/cleanup_module · a6e60d84
      Miguel Ojeda 提交于
      The upcoming GCC 9 release extends the -Wmissing-attributes warnings
      (enabled by -Wall) to C and aliases: it warns when particular function
      attributes are missing in the aliases but not in their target.
      
      In particular, it triggers for all the init/cleanup_module
      aliases in the kernel (defined by the module_init/exit macros),
      ending up being very noisy.
      
      These aliases point to the __init/__exit functions of a module,
      which are defined as __cold (among other attributes). However,
      the aliases themselves do not have the __cold attribute.
      
      Since the compiler behaves differently when compiling a __cold
      function as well as when compiling paths leading to calls
      to __cold functions, the warning is trying to point out
      the possibly-forgotten attribute in the alias.
      
      In order to keep the warning enabled, we decided to silence
      this case. Ideally, we would mark the aliases directly
      as __init/__exit. However, there are currently around 132 modules
      in the kernel which are missing __init/__exit in their init/cleanup
      functions (either because they are missing, or for other reasons,
      e.g. the functions being called from somewhere else); and
      a section mismatch is a hard error.
      
      A conservative alternative was to mark the aliases as __cold only.
      However, since we would like to eventually enforce __init/__exit
      to be always marked,  we chose to use the new __copy function
      attribute (introduced by GCC 9 as well to deal with this).
      With it, we copy the attributes used by the target functions
      into the aliases. This way, functions that were not marked
      as __init/__exit won't have their aliases marked either,
      and therefore there won't be a section mismatch.
      
      Note that the warning would go away marking either the extern
      declaration, the definition, or both. However, we only mark
      the definition of the alias, since we do not want callers
      (which only see the declaration) to be compiled as if the function
      was __cold (and therefore the paths leading to those calls
      would be assumed to be unlikely).
      
      Link: https://lore.kernel.org/lkml/20190123173707.GA16603@gmail.com/
      Link: https://lore.kernel.org/lkml/20190206175627.GA20399@gmail.com/Suggested-by: NMartin Sebor <msebor@gcc.gnu.org>
      Acked-by: NJessica Yu <jeyu@kernel.org>
      Signed-off-by: NMiguel Ojeda <miguel.ojeda.sandonis@gmail.com>
      a6e60d84
    • M
      Compiler Attributes: add support for __copy (gcc >= 9) · c0d9782f
      Miguel Ojeda 提交于
      From the GCC manual:
      
        copy
        copy(function)
      
          The copy attribute applies the set of attributes with which function
          has been declared to the declaration of the function to which
          the attribute is applied. The attribute is designed for libraries
          that define aliases or function resolvers that are expected
          to specify the same set of attributes as their targets. The copy
          attribute can be used with functions, variables, or types. However,
          the kind of symbol to which the attribute is applied (either
          function or variable) must match the kind of symbol to which
          the argument refers. The copy attribute copies only syntactic and
          semantic attributes but not attributes that affect a symbol’s
          linkage or visibility such as alias, visibility, or weak.
          The deprecated attribute is also not copied.
      
        https://gcc.gnu.org/onlinedocs/gcc/Common-Function-Attributes.html
      
      The upcoming GCC 9 release extends the -Wmissing-attributes warnings
      (enabled by -Wall) to C and aliases: it warns when particular function
      attributes are missing in the aliases but not in their target, e.g.:
      
          void __cold f(void) {}
          void __alias("f") g(void);
      
      diagnoses:
      
          warning: 'g' specifies less restrictive attribute than
          its target 'f': 'cold' [-Wmissing-attributes]
      
      Using __copy(f) we can copy the __cold attribute from f to g:
      
          void __cold f(void) {}
          void __copy(f) __alias("f") g(void);
      
      This attribute is most useful to deal with situations where an alias
      is declared but we don't know the exact attributes the target has.
      
      For instance, in the kernel, the widely used module_init/exit macros
      define the init/cleanup_module aliases, but those cannot be marked
      always as __init/__exit since some modules do not have their
      functions marked as such.
      Suggested-by: NMartin Sebor <msebor@gcc.gnu.org>
      Reviewed-by: NNick Desaulniers <ndesaulniers@google.com>
      Signed-off-by: NMiguel Ojeda <miguel.ojeda.sandonis@gmail.com>
      c0d9782f
  8. 15 2月, 2019 1 次提交
  9. 11 2月, 2019 2 次提交
    • J
      perf/x86: Add check_period PMU callback · 81ec3f3c
      Jiri Olsa 提交于
      Vince (and later on Ravi) reported crashes in the BTS code during
      fuzzing with the following backtrace:
      
        general protection fault: 0000 [#1] SMP PTI
        ...
        RIP: 0010:perf_prepare_sample+0x8f/0x510
        ...
        Call Trace:
         <IRQ>
         ? intel_pmu_drain_bts_buffer+0x194/0x230
         intel_pmu_drain_bts_buffer+0x160/0x230
         ? tick_nohz_irq_exit+0x31/0x40
         ? smp_call_function_single_interrupt+0x48/0xe0
         ? call_function_single_interrupt+0xf/0x20
         ? call_function_single_interrupt+0xa/0x20
         ? x86_schedule_events+0x1a0/0x2f0
         ? x86_pmu_commit_txn+0xb4/0x100
         ? find_busiest_group+0x47/0x5d0
         ? perf_event_set_state.part.42+0x12/0x50
         ? perf_mux_hrtimer_restart+0x40/0xb0
         intel_pmu_disable_event+0xae/0x100
         ? intel_pmu_disable_event+0xae/0x100
         x86_pmu_stop+0x7a/0xb0
         x86_pmu_del+0x57/0x120
         event_sched_out.isra.101+0x83/0x180
         group_sched_out.part.103+0x57/0xe0
         ctx_sched_out+0x188/0x240
         ctx_resched+0xa8/0xd0
         __perf_event_enable+0x193/0x1e0
         event_function+0x8e/0xc0
         remote_function+0x41/0x50
         flush_smp_call_function_queue+0x68/0x100
         generic_smp_call_function_single_interrupt+0x13/0x30
         smp_call_function_single_interrupt+0x3e/0xe0
         call_function_single_interrupt+0xf/0x20
         </IRQ>
      
      The reason is that while event init code does several checks
      for BTS events and prevents several unwanted config bits for
      BTS event (like precise_ip), the PERF_EVENT_IOC_PERIOD allows
      to create BTS event without those checks being done.
      
      Following sequence will cause the crash:
      
      If we create an 'almost' BTS event with precise_ip and callchains,
      and it into a BTS event it will crash the perf_prepare_sample()
      function because precise_ip events are expected to come
      in with callchain data initialized, but that's not the
      case for intel_pmu_drain_bts_buffer() caller.
      
      Adding a check_period callback to be called before the period
      is changed via PERF_EVENT_IOC_PERIOD. It will deny the change
      if the event would become BTS. Plus adding also the limit_period
      check as well.
      Reported-by: NVince Weaver <vincent.weaver@maine.edu>
      Signed-off-by: NJiri Olsa <jolsa@kernel.org>
      Acked-by: NPeter Zijlstra <peterz@infradead.org>
      Cc: <stable@vger.kernel.org>
      Cc: Alexander Shishkin <alexander.shishkin@linux.intel.com>
      Cc: Arnaldo Carvalho de Melo <acme@kernel.org>
      Cc: Arnaldo Carvalho de Melo <acme@redhat.com>
      Cc: Jiri Olsa <jolsa@redhat.com>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Naveen N. Rao <naveen.n.rao@linux.vnet.ibm.com>
      Cc: Ravi Bangoria <ravi.bangoria@linux.ibm.com>
      Cc: Stephane Eranian <eranian@google.com>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Link: http://lkml.kernel.org/r/20190204123532.GA4794@kravaSigned-off-by: NIngo Molnar <mingo@kernel.org>
      81ec3f3c
    • W
      bpf: only adjust gso_size on bytestream protocols · b90efd22
      Willem de Bruijn 提交于
      bpf_skb_change_proto and bpf_skb_adjust_room change skb header length.
      For GSO packets they adjust gso_size to maintain the same MTU.
      
      The gso size can only be safely adjusted on bytestream protocols.
      Commit d02f51cb ("bpf: fix bpf_skb_adjust_net/bpf_skb_proto_xlat
      to deal with gso sctp skbs") excluded SKB_GSO_SCTP.
      
      Since then type SKB_GSO_UDP_L4 has been added, whose contents are one
      gso_size unit per datagram. Also exclude these.
      
      Move from a blacklist to a whitelist check to future proof against
      additional such new GSO types, e.g., for fraglist based GRO.
      
      Fixes: bec1f6f6 ("udp: generate gso with UDP_SEGMENT")
      Signed-off-by: NWillem de Bruijn <willemb@google.com>
      Signed-off-by: NAlexei Starovoitov <ast@kernel.org>
      b90efd22
  10. 08 2月, 2019 2 次提交
    • Z
      mmc: block: handle complete_work on separate workqueue · dcf6e2e3
      Zachary Hays 提交于
      The kblockd workqueue is created with the WQ_MEM_RECLAIM flag set.
      This generates a rescuer thread for that queue that will trigger when
      the CPU is under heavy load and collect the uncompleted work.
      
      In the case of mmc, this creates the possibility of a deadlock when
      there are multiple partitions on the device as other blk-mq work is
      also run on the same queue. For example:
      
      - worker 0 claims the mmc host to work on partition 1
      - worker 1 attempts to claim the host for partition 2 but has to wait
        for worker 0 to finish
      - worker 0 schedules complete_work to release the host
      - rescuer thread is triggered after time-out and collects the dangling
        work
      - rescuer thread attempts to complete the work in order starting with
        claim host
      - the task to release host is now blocked by a task to claim it and
        will never be called
      
      The above results in multiple hung tasks that lead to failures to
      mount partitions.
      
      Handling complete_work on a separate workqueue avoids this by keeping
      the work completion tasks separate from the other blk-mq work. This
      allows the host to be released without getting blocked by other tasks
      attempting to claim the host.
      Signed-off-by: NZachary Hays <zhays@lexmark.com>
      Fixes: 81196976 ("mmc: block: Add blk-mq support")
      Cc: <stable@vger.kernel.org>
      Signed-off-by: NUlf Hansson <ulf.hansson@linaro.org>
      dcf6e2e3
    • J
      blktrace: Show requests without sector · 0803de78
      Jan Kara 提交于
      Currently, blktrace will not show requests that don't have any data as
      rq->__sector is initialized to -1 which is out of device range and thus
      discarded by act_log_check(). This is most notably the case for cache
      flush requests sent to the device. Fix the problem by making
      blk_rq_trace_sector() return 0 for requests without initialized sector.
      Reviewed-by: NJohannes Thumshirn <jthumshirn@suse.de>
      Signed-off-by: NJan Kara <jack@suse.cz>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      0803de78
  11. 02 2月, 2019 3 次提交
    • J
      x86/resctrl: Avoid confusion over the new X86_RESCTRL config · e6d42931
      Johannes Weiner 提交于
      "Resource Control" is a very broad term for this CPU feature, and a term
      that is also associated with containers, cgroups etc. This can easily
      cause confusion.
      
      Make the user prompt more specific. Match the config symbol name.
      
       [ bp: In the future, the corresponding ARM arch-specific code will be
         under ARM_CPU_RESCTRL and the arch-agnostic bits will be carved out
         under the CPU_RESCTRL umbrella symbol. ]
      Signed-off-by: NJohannes Weiner <hannes@cmpxchg.org>
      Signed-off-by: NBorislav Petkov <bp@suse.de>
      Cc: Babu Moger <Babu.Moger@amd.com>
      Cc: Fenghua Yu <fenghua.yu@intel.com>
      Cc: "H. Peter Anvin" <hpa@zytor.com>
      Cc: Ingo Molnar <mingo@redhat.com>
      Cc: James Morse <james.morse@arm.com>
      Cc: Jonathan Corbet <corbet@lwn.net>
      Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
      Cc: linux-doc@vger.kernel.org
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Pu Wen <puwen@hygon.cn>
      Cc: Reinette Chatre <reinette.chatre@intel.com>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Tony Luck <tony.luck@intel.com>
      Cc: x86-ml <x86@kernel.org>
      Link: https://lkml.kernel.org/r/20190130195621.GA30653@cmpxchg.org
      e6d42931
    • Q
      mm/hotplug: invalid PFNs from pfn_to_online_page() · b13bc351
      Qian Cai 提交于
      On an arm64 ThunderX2 server, the first kmemleak scan would crash [1]
      with CONFIG_DEBUG_VM_PGFLAGS=y due to page_to_nid() found a pfn that is
      not directly mapped (MEMBLOCK_NOMAP).  Hence, the page->flags is
      uninitialized.
      
      This is due to the commit 9f1eb38e ("mm, kmemleak: little
      optimization while scanning") starts to use pfn_to_online_page() instead
      of pfn_valid().  However, in the CONFIG_MEMORY_HOTPLUG=y case,
      pfn_to_online_page() does not call memblock_is_map_memory() while
      pfn_valid() does.
      
      Historically, the commit 68709f45 ("arm64: only consider memblocks
      with NOMAP cleared for linear mapping") causes pages marked as nomap
      being no long reassigned to the new zone in memmap_init_zone() by
      calling __init_single_page().
      
      Since the commit 2d070eab ("mm: consider zone which is not fully
      populated to have holes") introduced pfn_to_online_page() and was
      designed to return a valid pfn only, but it is clearly broken on arm64.
      
      Therefore, let pfn_to_online_page() call pfn_valid_within(), so it can
      handle nomap thanks to the commit f52bb98f ("arm64: mm: always
      enable CONFIG_HOLES_IN_ZONE"), while it will be optimized away on
      architectures where have no HOLES_IN_ZONE.
      
      [1]
        Unable to handle kernel NULL pointer dereference at virtual address 0000000000000006
        Mem abort info:
          ESR = 0x96000005
          Exception class = DABT (current EL), IL = 32 bits
          SET = 0, FnV = 0
          EA = 0, S1PTW = 0
        Data abort info:
          ISV = 0, ISS = 0x00000005
          CM = 0, WnR = 0
        Internal error: Oops: 96000005 [#1] SMP
        CPU: 60 PID: 1408 Comm: kmemleak Not tainted 5.0.0-rc2+ #8
        pstate: 60400009 (nZCv daif +PAN -UAO)
        pc : page_mapping+0x24/0x144
        lr : __dump_page+0x34/0x3dc
        sp : ffff00003a5cfd10
        x29: ffff00003a5cfd10 x28: 000000000000802f
        x27: 0000000000000000 x26: 0000000000277d00
        x25: ffff000010791f56 x24: ffff7fe000000000
        x23: ffff000010772f8b x22: ffff00001125f670
        x21: ffff000011311000 x20: ffff000010772f8b
        x19: fffffffffffffffe x18: 0000000000000000
        x17: 0000000000000000 x16: 0000000000000000
        x15: 0000000000000000 x14: ffff802698b19600
        x13: ffff802698b1a200 x12: ffff802698b16f00
        x11: ffff802698b1a400 x10: 0000000000001400
        x9 : 0000000000000001 x8 : ffff00001121a000
        x7 : 0000000000000000 x6 : ffff0000102c53b8
        x5 : 0000000000000000 x4 : 0000000000000003
        x3 : 0000000000000100 x2 : 0000000000000000
        x1 : ffff000010772f8b x0 : ffffffffffffffff
        Process kmemleak (pid: 1408, stack limit = 0x(____ptrval____))
        Call trace:
         page_mapping+0x24/0x144
         __dump_page+0x34/0x3dc
         dump_page+0x28/0x4c
         kmemleak_scan+0x4ac/0x680
         kmemleak_scan_thread+0xb4/0xdc
         kthread+0x12c/0x13c
         ret_from_fork+0x10/0x18
        Code: d503201f f9400660 36000040 d1000413 (f9400661)
        ---[ end trace 4d4bd7f573490c8e ]---
        Kernel panic - not syncing: Fatal exception
        SMP: stopping secondary CPUs
        Kernel Offset: disabled
        CPU features: 0x002,20000c38
        Memory Limit: none
        ---[ end Kernel panic - not syncing: Fatal exception ]---
      
      Link: http://lkml.kernel.org/r/20190122132916.28360-1-cai@lca.pw
      Fixes: 9f1eb38e ("mm, kmemleak: little optimization while scanning")
      Signed-off-by: NQian Cai <cai@lca.pw>
      Acked-by: NMichal Hocko <mhocko@suse.com>
      Cc: Oscar Salvador <osalvador@suse.de>
      Cc: Catalin Marinas <catalin.marinas@arm.com>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      b13bc351
    • T
      oom, oom_reaper: do not enqueue same task twice · 9bcdeb51
      Tetsuo Handa 提交于
      Arkadiusz reported that enabling memcg's group oom killing causes
      strange memcg statistics where there is no task in a memcg despite the
      number of tasks in that memcg is not 0.  It turned out that there is a
      bug in wake_oom_reaper() which allows enqueuing same task twice which
      makes impossible to decrease the number of tasks in that memcg due to a
      refcount leak.
      
      This bug existed since the OOM reaper became invokable from
      task_will_free_mem(current) path in out_of_memory() in Linux 4.7,
      
        T1@P1     |T2@P1     |T3@P1     |OOM reaper
        ----------+----------+----------+------------
                                         # Processing an OOM victim in a different memcg domain.
                              try_charge()
                                mem_cgroup_out_of_memory()
                                  mutex_lock(&oom_lock)
                   try_charge()
                     mem_cgroup_out_of_memory()
                       mutex_lock(&oom_lock)
        try_charge()
          mem_cgroup_out_of_memory()
            mutex_lock(&oom_lock)
                                  out_of_memory()
                                    oom_kill_process(P1)
                                      do_send_sig_info(SIGKILL, @P1)
                                      mark_oom_victim(T1@P1)
                                      wake_oom_reaper(T1@P1) # T1@P1 is enqueued.
                                  mutex_unlock(&oom_lock)
                       out_of_memory()
                         mark_oom_victim(T2@P1)
                         wake_oom_reaper(T2@P1) # T2@P1 is enqueued.
                       mutex_unlock(&oom_lock)
            out_of_memory()
              mark_oom_victim(T1@P1)
              wake_oom_reaper(T1@P1) # T1@P1 is enqueued again due to oom_reaper_list == T2@P1 && T1@P1->oom_reaper_list == NULL.
            mutex_unlock(&oom_lock)
                                         # Completed processing an OOM victim in a different memcg domain.
                                         spin_lock(&oom_reaper_lock)
                                         # T1P1 is dequeued.
                                         spin_unlock(&oom_reaper_lock)
      
      but memcg's group oom killing made it easier to trigger this bug by
      calling wake_oom_reaper() on the same task from one out_of_memory()
      request.
      
      Fix this bug using an approach used by commit 855b0183 ("oom,
      oom_reaper: disable oom_reaper for oom_kill_allocating_task").  As a
      side effect of this patch, this patch also avoids enqueuing multiple
      threads sharing memory via task_will_free_mem(current) path.
      
      Link: http://lkml.kernel.org/r/e865a044-2c10-9858-f4ef-254bc71d6cc2@i-love.sakura.ne.jp
      Link: http://lkml.kernel.org/r/5ee34fc6-1485-34f8-8790-903ddabaa809@i-love.sakura.ne.jp
      Fixes: af8e15cc ("oom, oom_reaper: do not enqueue task if it is on the oom_reaper_list head")
      Signed-off-by: NTetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
      Reported-by: NArkadiusz Miskiewicz <arekm@maven.pl>
      Tested-by: NArkadiusz Miskiewicz <arekm@maven.pl>
      Acked-by: NMichal Hocko <mhocko@suse.com>
      Acked-by: NRoman Gushchin <guro@fb.com>
      Cc: Tejun Heo <tj@kernel.org>
      Cc: Aleksa Sarai <asarai@suse.de>
      Cc: Jay Kamat <jgkamat@fb.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: <stable@vger.kernel.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      9bcdeb51
  12. 01 2月, 2019 1 次提交
  13. 31 1月, 2019 8 次提交
    • J
      ide: ensure atapi sense request aren't preempted · 9a6d5488
      Jens Axboe 提交于
      There's an issue with how sense requests are handled in IDE. If ide-cd
      encounters an error, it queues a sense request. With how IDE request
      handling is done, this is the next request we need to handle. But it's
      impossible to guarantee this, as another request could come in between
      the sense being queued, and ->queue_rq() being run and handling it. If
      that request ALSO fails, then we attempt to doubly queue the single
      sense request we have.
      
      Since we only support one active request at the time, defer request
      processing when a sense request is queued.
      
      Fixes: 60033520 "ide: convert to blk-mq"
      Reported-by: NHe Zhe <zhe.he@windriver.com>
      Tested-by: NHe Zhe <zhe.he@windriver.com>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      9a6d5488
    • Z
      irqchip/gic-v3-its: Fix ITT_entry_size accessor · 56841070
      Zenghui Yu 提交于
      According to ARM IHI 0069C (ID070116), we should use GITS_TYPER's
      bits [7:4] as ITT_entry_size instead of [8:4]. Although this is
      pretty annoying, it only results in a potential over-allocation
      of memory, and nothing bad happens.
      
      Fixes: 3dfa576b ("irqchip/gic-v3-its: Add probing for VLPI properties")
      Signed-off-by: NZenghui Yu <yuzenghui@huawei.com>
      [maz: massaged subject and commit message]
      Signed-off-by: NMarc Zyngier <marc.zyngier@arm.com>
      56841070
    • J
      net: stmmac: Fallback to Platform Data clock in Watchdog conversion · 4ec5302f
      Jose Abreu 提交于
      If we don't have DT then stmmac_clk will not be available. Let's add a
      new Platform Data field so that we can specify the refclk by this mean.
      
      This way we can still use the coalesce command in PCI based setups.
      Signed-off-by: NJose Abreu <joabreu@synopsys.com>
      Cc: Joao Pinto <jpinto@synopsys.com>
      Cc: David S. Miller <davem@davemloft.net>
      Cc: Giuseppe Cavallaro <peppe.cavallaro@st.com>
      Cc: Alexandre Torgue <alexandre.torgue@st.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      4ec5302f
    • D
      ipvlan, l3mdev: fix broken l3s mode wrt local routes · d5256083
      Daniel Borkmann 提交于
      While implementing ipvlan l3 and l3s mode for kubernetes CNI plugin,
      I ran into the issue that while l3 mode is working fine, l3s mode
      does not have any connectivity to kube-apiserver and hence all pods
      end up in Error state as well. The ipvlan master device sits on
      top of a bond device and hostns traffic to kube-apiserver (also running
      in hostns) is DNATed from 10.152.183.1:443 to 139.178.29.207:37573
      where the latter is the address of the bond0. While in l3 mode, a
      curl to https://10.152.183.1:443 or to https://139.178.29.207:37573
      works fine from hostns, neither of them do in case of l3s. In the
      latter only a curl to https://127.0.0.1:37573 appeared to work where
      for local addresses of bond0 I saw kernel suddenly starting to emit
      ARP requests to query HW address of bond0 which remained unanswered
      and neighbor entries in INCOMPLETE state. These ARP requests only
      happen while in l3s.
      
      Debugging this further, I found the issue is that l3s mode is piggy-
      backing on l3 master device, and in this case local routes are using
      l3mdev_master_dev_rcu(dev) instead of net->loopback_dev as per commit
      f5a0aab8 ("net: ipv4: dst for local input routes should use l3mdev
      if relevant") and 5f02ce24 ("net: l3mdev: Allow the l3mdev to be
      a loopback"). I found that reverting them back into using the
      net->loopback_dev fixed ipvlan l3s connectivity and got everything
      working for the CNI.
      
      Now judging from 4fbae7d8 ("ipvlan: Introduce l3s mode") and the
      l3mdev paper in [0] the only sole reason why ipvlan l3s is relying
      on l3 master device is to get the l3mdev_ip_rcv() receive hook for
      setting the dst entry of the input route without adding its own
      ipvlan specific hacks into the receive path, however, any l3 domain
      semantics beyond just that are breaking l3s operation. Note that
      ipvlan also has the ability to dynamically switch its internal
      operation from l3 to l3s for all ports via ipvlan_set_port_mode()
      at runtime. In any case, l3 vs l3s soley distinguishes itself by
      'de-confusing' netfilter through switching skb->dev to ipvlan slave
      device late in NF_INET_LOCAL_IN before handing the skb to L4.
      
      Minimal fix taken here is to add a IFF_L3MDEV_RX_HANDLER flag which,
      if set from ipvlan setup, gets us only the wanted l3mdev_l3_rcv() hook
      without any additional l3mdev semantics on top. This should also have
      minimal impact since dev->priv_flags is already hot in cache. With
      this set, l3s mode is working fine and I also get things like
      masquerading pod traffic on the ipvlan master properly working.
      
        [0] https://netdevconf.org/1.2/papers/ahern-what-is-l3mdev-paper.pdf
      
      Fixes: f5a0aab8 ("net: ipv4: dst for local input routes should use l3mdev if relevant")
      Fixes: 5f02ce24 ("net: l3mdev: Allow the l3mdev to be a loopback")
      Fixes: 4fbae7d8 ("ipvlan: Introduce l3s mode")
      Signed-off-by: NDaniel Borkmann <daniel@iogearbox.net>
      Cc: Mahesh Bandewar <maheshb@google.com>
      Cc: David Ahern <dsa@cumulusnetworks.com>
      Cc: Florian Westphal <fw@strlen.de>
      Cc: Martynas Pumputis <m@lambda.lt>
      Acked-by: NDavid Ahern <dsa@cumulusnetworks.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      d5256083
    • V
      PM-runtime: Fix deadlock with ktime_get() · 15efb47d
      Vincent Guittot 提交于
      A deadlock has been seen when swicthing clocksources which use
      PM-runtime.  The call path is:
      
      change_clocksource
          ...
          write_seqcount_begin
          ...
          timekeeping_update
              ...
              sh_cmt_clocksource_enable
                  ...
                  rpm_resume
                      pm_runtime_mark_last_busy
                          ktime_get
                              do
                                  read_seqcount_begin
                              while read_seqcount_retry
          ....
          write_seqcount_end
      
      Although we should be safe because we haven't yet changed the
      clocksource at that time, we can't do that because of seqcount
      protection.
      
      Use ktime_get_mono_fast_ns() instead which is lock safe for such
      cases.
      
      With ktime_get_mono_fast_ns, the timestamp is not guaranteed to be
      monotonic across an update and as a result can goes backward.
      According to update_fast_timekeeper() description: "In the worst
      case, this can result is a slightly wrong timestamp (a few
      nanoseconds)". For PM-runtime autosuspend, this means only that
      the suspend decision may be slightly suboptimal.
      
      Fixes: 8234f673 ("PM-runtime: Switch autosuspend over to using hrtimers")
      Reported-by: NBiju Das <biju.das@bp.renesas.com>
      Signed-off-by: NVincent Guittot <vincent.guittot@linaro.org>
      Reviewed-by: NUlf Hansson <ulf.hansson@linaro.org>
      Signed-off-by: NRafael J. Wysocki <rafael.j.wysocki@intel.com>
      15efb47d
    • W
      fs/dcache: Track & report number of negative dentries · af0c9af1
      Waiman Long 提交于
      The current dentry number tracking code doesn't distinguish between
      positive & negative dentries.  It just reports the total number of
      dentries in the LRU lists.
      
      As excessive number of negative dentries can have an impact on system
      performance, it will be wise to track the number of positive and
      negative dentries separately.
      
      This patch adds tracking for the total number of negative dentries in
      the system LRU lists and reports it in the 5th field in the
      /proc/sys/fs/dentry-state file.  The number, however, does not include
      negative dentries that are in flight but not in the LRU yet as well as
      those in the shrinker lists which are on the way out anyway.
      
      The number of positive dentries in the LRU lists can be roughly found by
      subtracting the number of negative dentries from the unused count.
      
      Matthew Wilcox had confirmed that since the introduction of the
      dentry_stat structure in 2.1.60, the dummy array was there, probably for
      future extension.  They were not replacements of pre-existing fields.
      So no sane applications that read the value of /proc/sys/fs/dentry-state
      will do dummy thing if the last 2 fields of the sysctl parameter are not
      zero.  IOW, it will be safe to use one of the dummy array entry for
      negative dentry count.
      Signed-off-by: NWaiman Long <longman@redhat.com>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      af0c9af1
    • W
      fs: Don't need to put list_lru into its own cacheline · 7d10f70f
      Waiman Long 提交于
      The list_lru structure is essentially just a pointer to a table of
      per-node LRU lists.  Even if CONFIG_MEMCG_KMEM is defined, the list
      field is just used for LRU list registration and shrinker_id is set at
      initialization.  Those fields won't need to be touched that often.
      
      So there is no point to make the list_lru structures to sit in their own
      cachelines.
      Signed-off-by: NWaiman Long <longman@redhat.com>
      Reviewed-by: NDave Chinner <dchinner@redhat.com>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      7d10f70f
    • J
      cpu/hotplug: Fix "SMT disabled by BIOS" detection for KVM · b284909a
      Josh Poimboeuf 提交于
      With the following commit:
      
        73d5e2b4 ("cpu/hotplug: detect SMT disabled by BIOS")
      
      ... the hotplug code attempted to detect when SMT was disabled by BIOS,
      in which case it reported SMT as permanently disabled.  However, that
      code broke a virt hotplug scenario, where the guest is booted with only
      primary CPU threads, and a sibling is brought online later.
      
      The problem is that there doesn't seem to be a way to reliably
      distinguish between the HW "SMT disabled by BIOS" case and the virt
      "sibling not yet brought online" case.  So the above-mentioned commit
      was a bit misguided, as it permanently disabled SMT for both cases,
      preventing future virt sibling hotplugs.
      
      Going back and reviewing the original problems which were attempted to
      be solved by that commit, when SMT was disabled in BIOS:
      
        1) /sys/devices/system/cpu/smt/control showed "on" instead of
           "notsupported"; and
      
        2) vmx_vm_init() was incorrectly showing the L1TF_MSG_SMT warning.
      
      I'd propose that we instead consider #1 above to not actually be a
      problem.  Because, at least in the virt case, it's possible that SMT
      wasn't disabled by BIOS and a sibling thread could be brought online
      later.  So it makes sense to just always default the smt control to "on"
      to allow for that possibility (assuming cpuid indicates that the CPU
      supports SMT).
      
      The real problem is #2, which has a simple fix: change vmx_vm_init() to
      query the actual current SMT state -- i.e., whether any siblings are
      currently online -- instead of looking at the SMT "control" sysfs value.
      
      So fix it by:
      
        a) reverting the original "fix" and its followup fix:
      
           73d5e2b4 ("cpu/hotplug: detect SMT disabled by BIOS")
           bc2d8d26 ("cpu/hotplug: Fix SMT supported evaluation")
      
           and
      
        b) changing vmx_vm_init() to query the actual current SMT state --
           instead of the sysfs control value -- to determine whether the L1TF
           warning is needed.  This also requires the 'sched_smt_present'
           variable to exported, instead of 'cpu_smt_control'.
      
      Fixes: 73d5e2b4 ("cpu/hotplug: detect SMT disabled by BIOS")
      Reported-by: NIgor Mammedov <imammedo@redhat.com>
      Signed-off-by: NJosh Poimboeuf <jpoimboe@redhat.com>
      Signed-off-by: NThomas Gleixner <tglx@linutronix.de>
      Cc: Joe Mario <jmario@redhat.com>
      Cc: Jiri Kosina <jikos@kernel.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: kvm@vger.kernel.org
      Cc: stable@vger.kernel.org
      Link: https://lkml.kernel.org/r/e3a85d585da28cc333ecbc1e78ee9216e6da9396.1548794349.git.jpoimboe@redhat.com
      b284909a
  14. 29 1月, 2019 1 次提交
  15. 26 1月, 2019 1 次提交
  16. 25 1月, 2019 2 次提交
  17. 23 1月, 2019 3 次提交
  18. 22 1月, 2019 1 次提交
    • D
      libnvdimm/security: Require nvdimm_security_setup_events() to succeed · 1cd73865
      Dan Williams 提交于
      The following warning:
      
          ACPI0012:00: security event setup failed: -19
      
      ...is meant to capture exceptional failures of sysfs_get_dirent(),
      however it will also fail in the common case when security support is
      disabled. A few issues:
      
      1/ A dev_warn() report for a common case is too chatty
      2/ The setup of this notifier is generic, no need for it to be driven
         from the nfit driver, it can exist completely in the core.
      3/ If it fails for any reason besides security support being disabled,
         that's fatal and should abort DIMM activation. Userspace may hang if
         it never gets overwrite notifications.
      4/ The dirent needs to be released.
      
      Move the call to the core 'dimm' driver, make it conditional on security
      support being active, make it fatal for the exceptional case, add the
      missing sysfs_put() at device disable time.
      
      Fixes: 7d988097 ("...Add security DSM overwrite support")
      Reviewed-by: NDave Jiang <dave.jiang@intel.com>
      Signed-off-by: NDan Williams <dan.j.williams@intel.com>
      1cd73865