1. 06 12月, 2018 10 次提交
    • S
      function_graph: Make ftrace_push_return_trace() static · 72c33b23
      Steven Rostedt (VMware) 提交于
      commit d125f3f866df88da5a85df00291f88f0baa89f7c upstream.
      
      As all architectures now call function_graph_enter() to do the entry work,
      no architecture should ever call ftrace_push_return_trace(). Make it static.
      
      This is needed to prepare for a fix of a design bug on how the curr_ret_stack
      is used.
      
      Cc: stable@kernel.org
      Fixes: 03274a3f ("tracing/fgraph: Adjust fgraph depth before calling trace return callback")
      Reviewed-by: NMasami Hiramatsu <mhiramat@kernel.org>
      Signed-off-by: NSteven Rostedt (VMware) <rostedt@goodmis.org>
      Signed-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      72c33b23
    • S
      function_graph: Create function_graph_enter() to consolidate architecture code · 67d7bec3
      Steven Rostedt (VMware) 提交于
      commit 8114865ff82e200b383e46821c25cb0625b842b5 upstream.
      
      Currently all the architectures do basically the same thing in preparing the
      function graph tracer on entry to a function. This code can be pulled into a
      generic location and then this will allow the function graph tracer to be
      fixed, as well as extended.
      
      Create a new function graph helper function_graph_enter() that will call the
      hook function (ftrace_graph_entry) and the shadow stack operation
      (ftrace_push_return_trace), and remove the need of the architecture code to
      manage the shadow stack.
      
      This is needed to prepare for a fix of a design bug on how the curr_ret_stack
      is used.
      
      Cc: stable@kernel.org
      Fixes: 03274a3f ("tracing/fgraph: Adjust fgraph depth before calling trace return callback")
      Reviewed-by: NMasami Hiramatsu <mhiramat@kernel.org>
      Signed-off-by: NSteven Rostedt (VMware) <rostedt@goodmis.org>
      Signed-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      67d7bec3
    • T
      x86/speculation: Add prctl() control for indirect branch speculation · 238ba6e7
      Thomas Gleixner 提交于
      commit 9137bb27e60e554dab694eafa4cca241fa3a694f upstream
      
      Add the PR_SPEC_INDIRECT_BRANCH option for the PR_GET_SPECULATION_CTRL and
      PR_SET_SPECULATION_CTRL prctls to allow fine grained per task control of
      indirect branch speculation via STIBP and IBPB.
      
      Invocations:
       Check indirect branch speculation status with
       - prctl(PR_GET_SPECULATION_CTRL, PR_SPEC_INDIRECT_BRANCH, 0, 0, 0);
      
       Enable indirect branch speculation with
       - prctl(PR_SET_SPECULATION_CTRL, PR_SPEC_INDIRECT_BRANCH, PR_SPEC_ENABLE, 0, 0);
      
       Disable indirect branch speculation with
       - prctl(PR_SET_SPECULATION_CTRL, PR_SPEC_INDIRECT_BRANCH, PR_SPEC_DISABLE, 0, 0);
      
       Force disable indirect branch speculation with
       - prctl(PR_SET_SPECULATION_CTRL, PR_SPEC_INDIRECT_BRANCH, PR_SPEC_FORCE_DISABLE, 0, 0);
      
      See Documentation/userspace-api/spec_ctrl.rst.
      Signed-off-by: NTim Chen <tim.c.chen@linux.intel.com>
      Signed-off-by: NThomas Gleixner <tglx@linutronix.de>
      Reviewed-by: NIngo Molnar <mingo@kernel.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Andy Lutomirski <luto@kernel.org>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Jiri Kosina <jkosina@suse.cz>
      Cc: Tom Lendacky <thomas.lendacky@amd.com>
      Cc: Josh Poimboeuf <jpoimboe@redhat.com>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: David Woodhouse <dwmw@amazon.co.uk>
      Cc: Andi Kleen <ak@linux.intel.com>
      Cc: Dave Hansen <dave.hansen@intel.com>
      Cc: Casey Schaufler <casey.schaufler@intel.com>
      Cc: Asit Mallick <asit.k.mallick@intel.com>
      Cc: Arjan van de Ven <arjan@linux.intel.com>
      Cc: Jon Masters <jcm@redhat.com>
      Cc: Waiman Long <longman9394@gmail.com>
      Cc: Greg KH <gregkh@linuxfoundation.org>
      Cc: Dave Stewart <david.c.stewart@intel.com>
      Cc: Kees Cook <keescook@chromium.org>
      Cc: stable@vger.kernel.org
      Link: https://lkml.kernel.org/r/20181125185005.866780996@linutronix.deSigned-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      238ba6e7
    • T
      ptrace: Remove unused ptrace_may_access_sched() and MODE_IBRS · aecb9969
      Thomas Gleixner 提交于
      commit 46f7ecb1e7359f183f5bbd1e08b90e10e52164f9 upstream
      
      The IBPB control code in x86 removed the usage. Remove the functionality
      which was introduced for this.
      Signed-off-by: NThomas Gleixner <tglx@linutronix.de>
      Reviewed-by: NIngo Molnar <mingo@kernel.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Andy Lutomirski <luto@kernel.org>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Jiri Kosina <jkosina@suse.cz>
      Cc: Tom Lendacky <thomas.lendacky@amd.com>
      Cc: Josh Poimboeuf <jpoimboe@redhat.com>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: David Woodhouse <dwmw@amazon.co.uk>
      Cc: Tim Chen <tim.c.chen@linux.intel.com>
      Cc: Andi Kleen <ak@linux.intel.com>
      Cc: Dave Hansen <dave.hansen@intel.com>
      Cc: Casey Schaufler <casey.schaufler@intel.com>
      Cc: Asit Mallick <asit.k.mallick@intel.com>
      Cc: Arjan van de Ven <arjan@linux.intel.com>
      Cc: Jon Masters <jcm@redhat.com>
      Cc: Waiman Long <longman9394@gmail.com>
      Cc: Greg KH <gregkh@linuxfoundation.org>
      Cc: Dave Stewart <david.c.stewart@intel.com>
      Cc: Kees Cook <keescook@chromium.org>
      Cc: stable@vger.kernel.org
      Link: https://lkml.kernel.org/r/20181125185005.559149393@linutronix.deSigned-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      aecb9969
    • T
      x86/speculation: Rework SMT state change · f55e301e
      Thomas Gleixner 提交于
      commit a74cfffb03b73d41e08f84c2e5c87dec0ce3db9f upstream
      
      arch_smt_update() is only called when the sysfs SMT control knob is
      changed. This means that when SMT is enabled in the sysfs control knob the
      system is considered to have SMT active even if all siblings are offline.
      
      To allow finegrained control of the speculation mitigations, the actual SMT
      state is more interesting than the fact that siblings could be enabled.
      
      Rework the code, so arch_smt_update() is invoked from each individual CPU
      hotplug function, and simplify the update function while at it.
      Signed-off-by: NThomas Gleixner <tglx@linutronix.de>
      Reviewed-by: NIngo Molnar <mingo@kernel.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Andy Lutomirski <luto@kernel.org>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Jiri Kosina <jkosina@suse.cz>
      Cc: Tom Lendacky <thomas.lendacky@amd.com>
      Cc: Josh Poimboeuf <jpoimboe@redhat.com>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: David Woodhouse <dwmw@amazon.co.uk>
      Cc: Tim Chen <tim.c.chen@linux.intel.com>
      Cc: Andi Kleen <ak@linux.intel.com>
      Cc: Dave Hansen <dave.hansen@intel.com>
      Cc: Casey Schaufler <casey.schaufler@intel.com>
      Cc: Asit Mallick <asit.k.mallick@intel.com>
      Cc: Arjan van de Ven <arjan@linux.intel.com>
      Cc: Jon Masters <jcm@redhat.com>
      Cc: Waiman Long <longman9394@gmail.com>
      Cc: Greg KH <gregkh@linuxfoundation.org>
      Cc: Dave Stewart <david.c.stewart@intel.com>
      Cc: Kees Cook <keescook@chromium.org>
      Cc: stable@vger.kernel.org
      Link: https://lkml.kernel.org/r/20181125185004.521974984@linutronix.deSigned-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      f55e301e
    • T
      sched/smt: Expose sched_smt_present static key · 340693ee
      Thomas Gleixner 提交于
      commit 321a874a7ef85655e93b3206d0f36b4a6097f948 upstream
      
      Make the scheduler's 'sched_smt_present' static key globaly available, so
      it can be used in the x86 speculation control code.
      
      Provide a query function and a stub for the CONFIG_SMP=n case.
      Signed-off-by: NThomas Gleixner <tglx@linutronix.de>
      Reviewed-by: NIngo Molnar <mingo@kernel.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Andy Lutomirski <luto@kernel.org>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Jiri Kosina <jkosina@suse.cz>
      Cc: Tom Lendacky <thomas.lendacky@amd.com>
      Cc: Josh Poimboeuf <jpoimboe@redhat.com>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: David Woodhouse <dwmw@amazon.co.uk>
      Cc: Tim Chen <tim.c.chen@linux.intel.com>
      Cc: Andi Kleen <ak@linux.intel.com>
      Cc: Dave Hansen <dave.hansen@intel.com>
      Cc: Casey Schaufler <casey.schaufler@intel.com>
      Cc: Asit Mallick <asit.k.mallick@intel.com>
      Cc: Arjan van de Ven <arjan@linux.intel.com>
      Cc: Jon Masters <jcm@redhat.com>
      Cc: Waiman Long <longman9394@gmail.com>
      Cc: Greg KH <gregkh@linuxfoundation.org>
      Cc: Dave Stewart <david.c.stewart@intel.com>
      Cc: Kees Cook <keescook@chromium.org>
      Cc: stable@vger.kernel.org
      Link: https://lkml.kernel.org/r/20181125185004.430168326@linutronix.deSigned-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      340693ee
    • J
      x86/speculation: Apply IBPB more strictly to avoid cross-process data leak · cacd9385
      Jiri Kosina 提交于
      commit dbfe2953f63c640463c630746cd5d9de8b2f63ae upstream
      
      Currently, IBPB is only issued in cases when switching into a non-dumpable
      process, the rationale being to protect such 'important and security
      sensitive' processess (such as GPG) from data leaking into a different
      userspace process via spectre v2.
      
      This is however completely insufficient to provide proper userspace-to-userpace
      spectrev2 protection, as any process can poison branch buffers before being
      scheduled out, and the newly scheduled process immediately becomes spectrev2
      victim.
      
      In order to minimize the performance impact (for usecases that do require
      spectrev2 protection), issue the barrier only in cases when switching between
      processess where the victim can't be ptraced by the potential attacker (as in
      such cases, the attacker doesn't have to bother with branch buffers at all).
      
      [ tglx: Split up PTRACE_MODE_NOACCESS_CHK into PTRACE_MODE_SCHED and
        PTRACE_MODE_IBPB to be able to do ptrace() context tracking reasonably
        fine-grained ]
      
      Fixes: 18bf3c3e ("x86/speculation: Use Indirect Branch Prediction Barrier in context switch")
      Originally-by: NTim Chen <tim.c.chen@linux.intel.com>
      Signed-off-by: NJiri Kosina <jkosina@suse.cz>
      Signed-off-by: NThomas Gleixner <tglx@linutronix.de>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Josh Poimboeuf <jpoimboe@redhat.com>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc:  "WoodhouseDavid" <dwmw@amazon.co.uk>
      Cc: Andi Kleen <ak@linux.intel.com>
      Cc:  "SchauflerCasey" <casey.schaufler@intel.com>
      Link: https://lkml.kernel.org/r/nycvar.YFH.7.76.1809251437340.15880@cbobk.fhfr.pmSigned-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      
      cacd9385
    • E
      tcp: defer SACK compression after DupThresh · aaa7e45c
      Eric Dumazet 提交于
      [ Upstream commit 86de5921 ]
      
      Jean-Louis reported a TCP regression and bisected to recent SACK
      compression.
      
      After a loss episode (receiver not able to keep up and dropping
      packets because its backlog is full), linux TCP stack is sending
      a single SACK (DUPACK).
      
      Sender waits a full RTO timer before recovering losses.
      
      While RFC 6675 says in section 5, "Algorithm Details",
      
         (2) If DupAcks < DupThresh but IsLost (HighACK + 1) returns true --
             indicating at least three segments have arrived above the current
             cumulative acknowledgment point, which is taken to indicate loss
             -- go to step (4).
      ...
         (4) Invoke fast retransmit and enter loss recovery as follows:
      
      there are old TCP stacks not implementing this strategy, and
      still counting the dupacks before starting fast retransmit.
      
      While these stacks probably perform poorly when receivers implement
      LRO/GRO, we should be a little more gentle to them.
      
      This patch makes sure we do not enable SACK compression unless
      3 dupacks have been sent since last rcv_nxt update.
      
      Ideally we should even rearm the timer to send one or two
      more DUPACK if no more packets are coming, but that will
      be work aiming for linux-4.21.
      
      Many thanks to Jean-Louis for bisecting the issue, providing
      packet captures and testing this patch.
      
      Fixes: 5d9f4262 ("tcp: add SACK compression")
      Reported-by: NJean-Louis Dupond <jean-louis@dupond.be>
      Tested-by: NJean-Louis Dupond <jean-louis@dupond.be>
      Signed-off-by: NEric Dumazet <edumazet@google.com>
      Acked-by: NNeal Cardwell <ncardwell@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      Signed-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      aaa7e45c
    • T
      net/dim: Update DIM start sample after each DIM iteration · b8e07695
      Tal Gilboa 提交于
      [ Upstream commit 0211dda68a4f6531923a2f72d8e8959207f59fba ]
      
      On every iteration of net_dim, the algorithm may choose to
      check for the system state by comparing current data sample
      with previous data sample. After each of these comparison,
      regardless of the action taken, the sample used as baseline
      is needed to be updated.
      
      This patch fixes a bug that causes DIM to take wrong decisions,
      due to never updating the baseline sample for comparison between
      iterations. This way, DIM always compares current sample with
      zeros.
      
      Although this is a functional fix, it also improves and stabilizes
      performance as the algorithm works properly now.
      
      Performance:
      Tested single UDP TX stream with pktgen:
      samples/pktgen/pktgen_sample03_burst_single_flow.sh -i p4p2 -d 1.1.1.1
      -m 24:8a:07:88:26:8b -f 3 -b 128
      
      ConnectX-5 100GbE packet rate improved from 15-19Mpps to 19-20Mpps.
      Also, toggling between profiles is less frequent with the fix.
      
      Fixes: 8115b750 ("net/dim: use struct net_dim_sample as arg to net_dim")
      Signed-off-by: NTal Gilboa <talgi@mellanox.com>
      Reviewed-by: NTariq Toukan <tariqt@mellanox.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      Signed-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      b8e07695
    • W
      packet: copy user buffers before orphan or clone · f2a67e68
      Willem de Bruijn 提交于
      [ Upstream commit 5cd8d46ea1562be80063f53c7c6a5f40224de623 ]
      
      tpacket_snd sends packets with user pages linked into skb frags. It
      notifies that pages can be reused when the skb is released by setting
      skb->destructor to tpacket_destruct_skb.
      
      This can cause data corruption if the skb is orphaned (e.g., on
      transmit through veth) or cloned (e.g., on mirror to another psock).
      
      Create a kernel-private copy of data in these cases, same as tun/tap
      zerocopy transmission. Reuse that infrastructure: mark the skb as
      SKBTX_ZEROCOPY_FRAG, which will trigger copy in skb_orphan_frags(_rx).
      
      Unlike other zerocopy packets, do not set shinfo destructor_arg to
      struct ubuf_info. tpacket_destruct_skb already uses that ptr to notify
      when the original skb is released and a timestamp is recorded. Do not
      change this timestamp behavior. The ubuf_info->callback is not needed
      anyway, as no zerocopy notification is expected.
      
      Mark destructor_arg as not-a-uarg by setting the lower bit to 1. The
      resulting value is not a valid ubuf_info pointer, nor a valid
      tpacket_snd frame address. Add skb_zcopy_.._nouarg helpers for this.
      
      The fix relies on features introduced in commit 52267790 ("sock:
      add MSG_ZEROCOPY"), so can be backported as is only to 4.14.
      
      Tested with from `./in_netns.sh ./txring_overwrite` from
      http://github.com/wdebruij/kerneltools/tests
      
      Fixes: 69e3c75f ("net: TX_RING and packet mmap")
      Reported-by: NAnand H. Krishnan <anandhkrishnan@gmail.com>
      Signed-off-by: NWillem de Bruijn <willemb@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      Signed-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      f2a67e68
  2. 01 12月, 2018 5 次提交
  3. 27 11月, 2018 4 次提交
  4. 21 11月, 2018 4 次提交
  5. 14 11月, 2018 8 次提交
    • H
      media: hdmi.h: rename ADOBE_RGB to OPRGB and ADOBE_YCC to OPYCC · 3f7987f8
      Hans Verkuil 提交于
      commit 463659a0 upstream.
      
      These names have been renamed in the CTA-861 standard due to trademark
      issues. Replace them here as well so they are in sync with the standard.
      Signed-off-by: NHans Verkuil <hansverk@cisco.com>
      Cc: stable@vger.kernel.org
      Acked-by: NDaniel Vetter <daniel.vetter@ffwll.ch>
      Signed-off-by: NMauro Carvalho Chehab <mchehab+samsung@kernel.org>
      Signed-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      3f7987f8
    • M
      TC: Set DMA masks for devices · b0b62843
      Maciej W. Rozycki 提交于
      commit 3f2aa244 upstream.
      
      Fix a TURBOchannel support regression with commit 205e1b7f
      ("dma-mapping: warn when there is no coherent_dma_mask") that caused
      coherent DMA allocations to produce a warning such as:
      
      defxx: v1.11 2014/07/01  Lawrence V. Stefani and others
      tc1: DEFTA at MMIO addr = 0x1e900000, IRQ = 20, Hardware addr = 08-00-2b-a3-a3-29
      ------------[ cut here ]------------
      WARNING: CPU: 0 PID: 1 at ./include/linux/dma-mapping.h:516 dfx_dev_register+0x670/0x678
      Modules linked in:
      CPU: 0 PID: 1 Comm: swapper Not tainted 4.19.0-rc6 #2
      Stack : ffffffff8009ffc0 fffffffffffffec0 0000000000000000 ffffffff80647650
              0000000000000000 0000000000000000 ffffffff806f5f80 ffffffffffffffff
              0000000000000000 0000000000000000 0000000000000001 ffffffff8065d4e8
              98000000031b6300 ffffffff80563478 ffffffff805685b0 ffffffffffffffff
              0000000000000000 ffffffff805d6720 0000000000000204 ffffffff80388df8
              0000000000000000 0000000000000009 ffffffff8053efd0 ffffffff806657d0
              0000000000000000 ffffffff803177f8 0000000000000000 ffffffff806d0000
              9800000003078000 980000000307b9e0 000000001e900000 ffffffff80067940
              0000000000000000 ffffffff805d6720 0000000000000204 ffffffff80388df8
              ffffffff805176c0 ffffffff8004dc78 0000000000000000 ffffffff80067940
              ...
      Call Trace:
      [<ffffffff8004dc78>] show_stack+0xa0/0x130
      [<ffffffff80067940>] __warn+0x128/0x170
      ---[ end trace b1d1e094f67f3bb2 ]---
      
      This is because the TURBOchannel bus driver fails to set the coherent
      DMA mask for devices enumerated.
      
      Set the regular and coherent DMA masks for TURBOchannel devices then,
      observing that the bus protocol supports a 34-bit (16GiB) DMA address
      space, by interpreting the value presented in the address cycle across
      the 32 `ad' lines as a 32-bit word rather than byte address[1].  The
      architectural size of the TURBOchannel DMA address space exceeds the
      maximum amount of RAM any actual TURBOchannel system in existence may
      have, hence both masks are the same.
      
      This removes the warning shown above.
      
      References:
      
      [1] "TURBOchannel Hardware Specification", EK-369AA-OD-007B, Digital
          Equipment Corporation, January 1993, Section "DMA", pp. 1-15 -- 1-17
      Signed-off-by: NMaciej W. Rozycki <macro@linux-mips.org>
      Signed-off-by: NPaul Burton <paul.burton@mips.com>
      Patchwork: https://patchwork.linux-mips.org/patch/20835/
      Fixes: 205e1b7f ("dma-mapping: warn when there is no coherent_dma_mask")
      Cc: stable@vger.kernel.org # 4.16+
      Cc: Ralf Baechle <ralf@linux-mips.org>
      Signed-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      b0b62843
    • J
      fsnotify: Fix busy inodes during unmount · 778af261
      Jan Kara 提交于
      commit 721fb6fb upstream.
      
      Detaching of mark connector from fsnotify_put_mark() can race with
      unmounting of the filesystem like:
      
        CPU1				CPU2
      fsnotify_put_mark()
        spin_lock(&conn->lock);
        ...
        inode = fsnotify_detach_connector_from_object(conn)
        spin_unlock(&conn->lock);
      				generic_shutdown_super()
      				  fsnotify_unmount_inodes()
      				    sees connector detached for inode
      				      -> nothing to do
      				  evict_inode()
      				    barfs on pending inode reference
        iput(inode);
      
      Resulting in "Busy inodes after unmount" message and possible kernel
      oops. Make fsnotify_unmount_inodes() properly wait for outstanding inode
      references from detached connectors.
      
      Note that the accounting of outstanding inode references in the
      superblock can cause some cacheline contention on the counter. OTOH it
      happens only during deletion of the last notification mark from an inode
      (or during unlinking of watched inode) and that is not too bad. I have
      measured time to create & delete inotify watch 100000 times from 64
      processes in parallel (each process having its own inotify group and its
      own file on a shared superblock) on a 64 CPU machine. Average and
      standard deviation of 15 runs look like:
      
      	Avg		Stddev
      Vanilla	9.817400	0.276165
      Fixed	9.710467	0.228294
      
      So there's no statistically significant difference.
      
      Fixes: 6b3f05d2 ("fsnotify: Detach mark from object list when last reference is dropped")
      CC: stable@vger.kernel.org
      Signed-off-by: NJan Kara <jack@suse.cz>
      Signed-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      778af261
    • E
      signal: Guard against negative signal numbers in copy_siginfo_from_user32 · 04eb7194
      Eric W. Biederman 提交于
      commit a3670058 upstream.
      
      While fixing an out of bounds array access in known_siginfo_layout
      reported by the kernel test robot it became apparent that the same bug
      exists in siginfo_layout and affects copy_siginfo_from_user32.
      
      The straight forward fix that makes guards against making this mistake
      in the future and should keep the code size small is to just take an
      unsigned signal number instead of a signed signal number, as I did to
      fix known_siginfo_layout.
      
      Cc: stable@vger.kernel.org
      Fixes: cc731525 ("signal: Remove kernel interal si_code magic")
      Signed-off-by: N"Eric W. Biederman" <ebiederm@xmission.com>
      Signed-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      04eb7194
    • N
      scsi: sched/wait: Add wait_event_lock_irq_timeout for TASK_UNINTERRUPTIBLE usage · fd594155
      Nicholas Bellinger 提交于
      commit 25ab0bc3 upstream.
      
      Short of reverting commit 00d909a1 ("scsi: target: Make the session
      shutdown code also wait for commands that are being aborted") for v4.19,
      target-core needs a wait_event_t macro can be executed using
      TASK_UNINTERRUPTIBLE to function correctly with existing fabric drivers that
      expect to run with signals pending during session shutdown and active se_cmd
      I/O quiesce.
      
      The most notable is iscsi-target/iser-target, while ibmvscsi_tgt invokes
      session shutdown logic from userspace via configfs attribute that could also
      potentially have signals pending.
      
      So go ahead and introduce wait_event_lock_irq_timeout() to achieve this, and
      update + rename __wait_event_lock_irq_timeout() to make it accept 'state' as a
      parameter.
      
      Fixes: 00d909a1 ("scsi: target: Make the session shutdown code also wait for commands that are being aborted")
      Cc: <stable@vger.kernel.org> # v4.19+
      Cc: Bart Van Assche <bvanassche@acm.org>
      Cc: Mike Christie <mchristi@redhat.com>
      Cc: Hannes Reinecke <hare@suse.com>
      Cc: Christoph Hellwig <hch@lst.de>
      Cc: Sagi Grimberg <sagi@grimberg.me>
      Cc: Bryant G. Ly <bryantly@linux.vnet.ibm.com>
      Cc: Peter Zijlstra (Intel) <peterz@infradead.org>
      Tested-by: NNicholas Bellinger <nab@linux-iscsi.org>
      Signed-off-by: NNicholas Bellinger <nab@linux-iscsi.org>
      Reviewed-by: NBryant G. Ly <bly@catalogicsoftware.com>
      Acked-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Reviewed-by: NBart Van Assche <bvanassche@acm.org>
      Signed-off-by: NMartin K. Petersen <martin.petersen@oracle.com>
      Signed-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      fd594155
    • W
      signal: Introduce COMPAT_SIGMINSTKSZ for use in compat_sys_sigaltstack · 5445a4b0
      Will Deacon 提交于
      [ Upstream commit 22839869f21ab3850fbbac9b425ccc4c0023926f ]
      
      The sigaltstack(2) system call fails with -ENOMEM if the new alternative
      signal stack is found to be smaller than SIGMINSTKSZ. On architectures
      such as arm64, where the native value for SIGMINSTKSZ is larger than
      the compat value, this can result in an unexpected error being reported
      to a compat task. See, for example:
      
        https://bugs.debian.org/cgi-bin/bugreport.cgi?bug=904385
      
      This patch fixes the problem by extending do_sigaltstack to take the
      minimum signal stack size as an additional parameter, allowing the
      native and compat system call entry code to pass in their respective
      values. COMPAT_SIGMINSTKSZ is just defined as SIGMINSTKSZ if it has not
      been defined by the architecture.
      
      Cc: Arnd Bergmann <arnd@arndb.de>
      Cc: Dominik Brodowski <linux@dominikbrodowski.net>
      Cc: "Eric W. Biederman" <ebiederm@xmission.com>
      Cc: Andrew Morton <akpm@linux-foundation.org>
      Cc: Al Viro <viro@zeniv.linux.org.uk>
      Cc: Oleg Nesterov <oleg@redhat.com>
      Reported-by: NSteve McIntyre <steve.mcintyre@arm.com>
      Tested-by: NSteve McIntyre <93sam@debian.org>
      Signed-off-by: NWill Deacon <will.deacon@arm.com>
      Signed-off-by: NCatalin Marinas <catalin.marinas@arm.com>
      Signed-off-by: NSasha Levin <sashal@kernel.org>
      Signed-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      5445a4b0
    • J
      nvme: call nvme_complete_rq when nvmf_check_ready fails for mpath I/O · 4b445d47
      James Smart 提交于
      [ Upstream commit 783f4a4408e1251d17f333ad56abac24dde988b9 ]
      
      When an io is rejected by nvmf_check_ready() due to validation of the
      controller state, the nvmf_fail_nonready_command() will normally return
      BLK_STS_RESOURCE to requeue and retry.  However, if the controller is
      dying or the I/O is marked for NVMe multipath, the I/O is failed so that
      the controller can terminate or so that the io can be issued on a
      different path.  Unfortunately, as this reject point is before the
      transport has accepted the command, blk-mq ends up completing the I/O
      and never calls nvme_complete_rq(), which is where multipath may preserve
      or re-route the I/O. The end result is, the device user ends up seeing an
      EIO error.
      
      Example: single path connectivity, controller is under load, and a reset
      is induced.  An I/O is received:
      
        a) while the reset state has been set but the queues have yet to be
           stopped; or
        b) after queues are started (at end of reset) but before the reconnect
           has completed.
      
      The I/O finishes with an EIO status.
      
      This patch makes the following changes:
      
        - Adds the HOST_PATH_ERROR pathing status from TP4028
        - Modifies the reject point such that it appears to queue successfully,
          but actually completes the io with the new pathing status and calls
          nvme_complete_rq().
        - nvme_complete_rq() recognizes the new status, avoids resetting the
          controller (likely was already done in order to get this new status),
          and calls the multipather to clear the current path that errored.
          This allows the next command (retry or new command) to select a new
          path if there is one.
      Signed-off-by: NJames Smart <jsmart2021@gmail.com>
      Reviewed-by: NSagi Grimberg <sagi@grimberg.me>
      Signed-off-by: NChristoph Hellwig <hch@lst.de>
      Signed-off-by: NSasha Levin <sashal@kernel.org>
      Signed-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      4b445d47
    • D
      bpf: fix partial copy of map_ptr when dst is scalar · d93080cd
      Daniel Borkmann 提交于
      commit 0962590e553331db2cc0aef2dc35c57f6300dbbe upstream.
      
      ALU operations on pointers such as scalar_reg += map_value_ptr are
      handled in adjust_ptr_min_max_vals(). Problem is however that map_ptr
      and range in the register state share a union, so transferring state
      through dst_reg->range = ptr_reg->range is just buggy as any new
      map_ptr in the dst_reg is then truncated (or null) for subsequent
      checks. Fix this by adding a raw member and use it for copying state
      over to dst_reg.
      
      Fixes: f1174f77 ("bpf/verifier: rework value tracking")
      Signed-off-by: NDaniel Borkmann <daniel@iogearbox.net>
      Cc: Edward Cree <ecree@solarflare.com>
      Acked-by: NAlexei Starovoitov <ast@kernel.org>
      Signed-off-by: NAlexei Starovoitov <ast@kernel.org>
      Acked-by: NEdward Cree <ecree@solarflare.com>
      Signed-off-by: NSasha Levin <sashal@kernel.org>
      d93080cd
  6. 18 10月, 2018 2 次提交
    • L
      mremap: properly flush TLB before releasing the page · eb66ae03
      Linus Torvalds 提交于
      Jann Horn points out that our TLB flushing was subtly wrong for the
      mremap() case.  What makes mremap() special is that we don't follow the
      usual "add page to list of pages to be freed, then flush tlb, and then
      free pages".  No, mremap() obviously just _moves_ the page from one page
      table location to another.
      
      That matters, because mremap() thus doesn't directly control the
      lifetime of the moved page with a freelist: instead, the lifetime of the
      page is controlled by the page table locking, that serializes access to
      the entry.
      
      As a result, we need to flush the TLB not just before releasing the lock
      for the source location (to avoid any concurrent accesses to the entry),
      but also before we release the destination page table lock (to avoid the
      TLB being flushed after somebody else has already done something to that
      page).
      
      This also makes the whole "need_flush" logic unnecessary, since we now
      always end up flushing the TLB for every valid entry.
      Reported-and-tested-by: NJann Horn <jannh@google.com>
      Acked-by: NWill Deacon <will.deacon@arm.com>
      Tested-by: NIngo Molnar <mingo@kernel.org>
      Acked-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      Signed-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      eb66ae03
    • M
      tracepoint: Fix tracepoint array element size mismatch · 9c0be3f6
      Mathieu Desnoyers 提交于
      commit 46e0c9be ("kernel: tracepoints: add support for relative
      references") changes the layout of the __tracepoint_ptrs section on
      architectures supporting relative references. However, it does so
      without turning struct tracepoint * const into const int elsewhere in
      the tracepoint code, which has the following side-effect:
      
      Setting mod->num_tracepoints is done in by module.c:
      
          mod->tracepoints_ptrs = section_objs(info, "__tracepoints_ptrs",
                                               sizeof(*mod->tracepoints_ptrs),
                                               &mod->num_tracepoints);
      
      Basically, since sizeof(*mod->tracepoints_ptrs) is a pointer size
      (rather than sizeof(int)), num_tracepoints is erroneously set to half the
      size it should be on 64-bit arch. So a module with an odd number of
      tracepoints misses the last tracepoint due to effect of integer
      division.
      
      So in the module going notifier:
      
              for_each_tracepoint_range(mod->tracepoints_ptrs,
                      mod->tracepoints_ptrs + mod->num_tracepoints,
                      tp_module_going_check_quiescent, NULL);
      
      the expression (mod->tracepoints_ptrs + mod->num_tracepoints) actually
      evaluates to something within the bounds of the array, but miss the
      last tracepoint if the number of tracepoints is odd on 64-bit arch.
      
      Fix this by introducing a new typedef: tracepoint_ptr_t, which
      is either "const int" on architectures that have PREL32 relocations,
      or "struct tracepoint * const" on architectures that does not have
      this feature.
      
      Also provide a new tracepoint_ptr_defer() static inline to
      encapsulate deferencing this type rather than duplicate code and
      ugly idefs within the for_each_tracepoint_range() implementation.
      
      This issue appears in 4.19-rc kernels, and should ideally be fixed
      before the end of the rc cycle.
      Acked-by: NArd Biesheuvel <ard.biesheuvel@linaro.org>
      Acked-by: NJessica Yu <jeyu@kernel.org>
      Link: http://lkml.kernel.org/r/20181013191050.22389-1-mathieu.desnoyers@efficios.com
      Link: http://lkml.kernel.org/r/20180704083651.24360-7-ard.biesheuvel@linaro.org
      Cc: Michael Ellerman <mpe@ellerman.id.au>
      Cc: Ingo Molnar <mingo@kernel.org>
      Cc: Ard Biesheuvel <ard.biesheuvel@linaro.org>
      Cc: Arnd Bergmann <arnd@arndb.de>
      Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
      Cc: Bjorn Helgaas <bhelgaas@google.com>
      Cc: Catalin Marinas <catalin.marinas@arm.com>
      Cc: James Morris <james.morris@microsoft.com>
      Cc: James Morris <jmorris@namei.org>
      Cc: Josh Poimboeuf <jpoimboe@redhat.com>
      Cc: Kees Cook <keescook@chromium.org>
      Cc: Nicolas Pitre <nico@linaro.org>
      Cc: Paul Mackerras <paulus@samba.org>
      Cc: Petr Mladek <pmladek@suse.com>
      Cc: Russell King <linux@armlinux.org.uk>
      Cc: "Serge E. Hallyn" <serge@hallyn.com>
      Cc: Sergey Senozhatsky <sergey.senozhatsky@gmail.com>
      Cc: Thomas Garnier <thgarnie@google.com>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Will Deacon <will.deacon@arm.com>
      Cc: Andrew Morton <akpm@linux-foundation.org>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
      Signed-off-by: NMathieu Desnoyers <mathieu.desnoyers@efficios.com>
      Signed-off-by: NSteven Rostedt (VMware) <rostedt@goodmis.org>
      9c0be3f6
  7. 12 10月, 2018 1 次提交
  8. 11 10月, 2018 2 次提交
    • S
      net: ipv4: update fnhe_pmtu when first hop's MTU changes · af7d6cce
      Sabrina Dubroca 提交于
      Since commit 5aad1de5 ("ipv4: use separate genid for next hop
      exceptions"), exceptions get deprecated separately from cached
      routes. In particular, administrative changes don't clear PMTU anymore.
      
      As Stefano described in commit e9fa1495 ("ipv6: Reflect MTU changes
      on PMTU of exceptions for MTU-less routes"), the PMTU discovered before
      the local MTU change can become stale:
       - if the local MTU is now lower than the PMTU, that PMTU is now
         incorrect
       - if the local MTU was the lowest value in the path, and is increased,
         we might discover a higher PMTU
      
      Similarly to what commit e9fa1495 did for IPv6, update PMTU in those
      cases.
      
      If the exception was locked, the discovered PMTU was smaller than the
      minimal accepted PMTU. In that case, if the new local MTU is smaller
      than the current PMTU, let PMTU discovery figure out if locking of the
      exception is still needed.
      
      To do this, we need to know the old link MTU in the NETDEV_CHANGEMTU
      notifier. By the time the notifier is called, dev->mtu has been
      changed. This patch adds the old MTU as additional information in the
      notifier structure, and a new call_netdevice_notifiers_u32() function.
      
      Fixes: 5aad1de5 ("ipv4: use separate genid for next hop exceptions")
      Signed-off-by: NSabrina Dubroca <sd@queasysnail.net>
      Reviewed-by: NStefano Brivio <sbrivio@redhat.com>
      Reviewed-by: NDavid Ahern <dsahern@gmail.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      af7d6cce
    • T
      net/mlx5: WQ, fixes for fragmented WQ buffers API · 37fdffb2
      Tariq Toukan 提交于
      mlx5e netdevice used to calculate fragment edges by a call to
      mlx5_wq_cyc_get_frag_size(). This calculation did not give the correct
      indication for queues smaller than a PAGE_SIZE, (broken by default on
      PowerPC, where PAGE_SIZE == 64KB).  Here it is replaced by the correct new
      calls/API.
      
      Since (TX/RX) Work Queues buffers are fragmented, here we introduce
      changes to the API in core driver, so that it gets a stride index and
      returns the index of last stride on same fragment, and an additional
      wrapping function that returns the number of physically contiguous
      strides that can be written contiguously to the work queue.
      
      This obsoletes the following API functions, and their buggy
      usage in EN driver:
      * mlx5_wq_cyc_get_frag_size()
      * mlx5_wq_cyc_ctr2fragix()
      
      The new API improves modularity and hides the details of such
      calculation for mlx5e netdevice and mlx5_ib rdma drivers.
      
      New calculation is also more efficient, and improves performance
      as follows:
      
      Packet rate test: pktgen, UDP / IPv4, 64byte, single ring, 8K ring size.
      
      Before: 16,477,619 pps
      After:  17,085,793 pps
      
      3.7% improvement
      
      Fixes: 3a2f7033 ("net/mlx5: Use order-0 allocations for all WQ types")
      Signed-off-by: NTariq Toukan <tariqt@mellanox.com>
      Reviewed-by: NEran Ben Elisha <eranbe@mellanox.com>
      Signed-off-by: NSaeed Mahameed <saeedm@mellanox.com>
      37fdffb2
  9. 10 10月, 2018 1 次提交
    • S
      gpio: Assign gpio_irq_chip::parents to non-stack pointer · 3e779a2e
      Stephen Boyd 提交于
      gpiochip_set_cascaded_irqchip() is passed 'parent_irq' as an argument
      and then the address of that argument is assigned to the gpio chips
      gpio_irq_chip 'parents' pointer shortly thereafter. This can't ever
      work, because we've just assigned some stack address to a pointer that
      we plan to dereference later in gpiochip_irq_map(). I ran into this
      issue with the KASAN report below when gpiochip_irq_map() tried to setup
      the parent irq with a total junk pointer for the 'parents' array.
      
      BUG: KASAN: stack-out-of-bounds in gpiochip_irq_map+0x228/0x248
      Read of size 4 at addr ffffffc0dde472e0 by task swapper/0/1
      
      CPU: 7 PID: 1 Comm: swapper/0 Not tainted 4.14.72 #34
      Call trace:
      [<ffffff9008093638>] dump_backtrace+0x0/0x718
      [<ffffff9008093da4>] show_stack+0x20/0x2c
      [<ffffff90096b9224>] __dump_stack+0x20/0x28
      [<ffffff90096b91c8>] dump_stack+0x80/0xbc
      [<ffffff900845a350>] print_address_description+0x70/0x238
      [<ffffff900845a8e4>] kasan_report+0x1cc/0x260
      [<ffffff900845aa14>] __asan_report_load4_noabort+0x2c/0x38
      [<ffffff900897e098>] gpiochip_irq_map+0x228/0x248
      [<ffffff900820cc08>] irq_domain_associate+0x114/0x2ec
      [<ffffff900820d13c>] irq_create_mapping+0x120/0x234
      [<ffffff900820da78>] irq_create_fwspec_mapping+0x4c8/0x88c
      [<ffffff900820e2d8>] irq_create_of_mapping+0x180/0x210
      [<ffffff900917114c>] of_irq_get+0x138/0x198
      [<ffffff9008dc70ac>] spi_drv_probe+0x94/0x178
      [<ffffff9008ca5168>] driver_probe_device+0x51c/0x824
      [<ffffff9008ca6538>] __device_attach_driver+0x148/0x20c
      [<ffffff9008ca14cc>] bus_for_each_drv+0x120/0x188
      [<ffffff9008ca570c>] __device_attach+0x19c/0x2dc
      [<ffffff9008ca586c>] device_initial_probe+0x20/0x2c
      [<ffffff9008ca18bc>] bus_probe_device+0x80/0x154
      [<ffffff9008c9b9b4>] device_add+0x9b8/0xbdc
      [<ffffff9008dc7640>] spi_add_device+0x1b8/0x380
      [<ffffff9008dcbaf0>] spi_register_controller+0x111c/0x1378
      [<ffffff9008dd6b10>] spi_geni_probe+0x4dc/0x6f8
      [<ffffff9008cab058>] platform_drv_probe+0xdc/0x130
      [<ffffff9008ca5168>] driver_probe_device+0x51c/0x824
      [<ffffff9008ca59cc>] __driver_attach+0x100/0x194
      [<ffffff9008ca0ea8>] bus_for_each_dev+0x104/0x16c
      [<ffffff9008ca58c0>] driver_attach+0x48/0x54
      [<ffffff9008ca1edc>] bus_add_driver+0x274/0x498
      [<ffffff9008ca8448>] driver_register+0x1ac/0x230
      [<ffffff9008caaf6c>] __platform_driver_register+0xcc/0xdc
      [<ffffff9009c4b33c>] spi_geni_driver_init+0x1c/0x24
      [<ffffff9008084cb8>] do_one_initcall+0x240/0x3dc
      [<ffffff9009c017d0>] kernel_init_freeable+0x378/0x468
      [<ffffff90096e8240>] kernel_init+0x14/0x110
      [<ffffff9008086fcc>] ret_from_fork+0x10/0x18
      
      The buggy address belongs to the page:
      page:ffffffbf037791c0 count:0 mapcount:0 mapping:          (null) index:0x0
      flags: 0x4000000000000000()
      raw: 4000000000000000 0000000000000000 0000000000000000 00000000ffffffff
      raw: ffffffbf037791e0 ffffffbf037791e0 0000000000000000 0000000000000000
      page dumped because: kasan: bad access detected
      
      Memory state around the buggy address:
       ffffffc0dde47180: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
       ffffffc0dde47200: f1 f1 f1 f1 f8 f8 f8 f8 f8 f8 f8 f8 f8 f8 f2 f2
      >ffffffc0dde47280: f2 f2 00 00 00 00 00 00 00 00 00 00 f3 f3 f3 f3
                                                             ^
       ffffffc0dde47300: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
       ffffffc0dde47380: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
      
      Let's leave around one unsigned int in the gpio_irq_chip struct for the
      single parent irq case and repoint the 'parents' array at it. This way
      code is left mostly intact to setup parents and we waste an extra few
      bytes per structure of which there should be only a handful in a system.
      
      Cc: Evan Green <evgreen@chromium.org>
      Cc: Thierry Reding <treding@nvidia.com>
      Cc: Grygorii Strashko <grygorii.strashko@ti.com>
      Fixes: e0d89728 ("gpio: Implement tighter IRQ chip integration")
      Signed-off-by: NStephen Boyd <swboyd@chromium.org>
      Signed-off-by: NLinus Walleij <linus.walleij@linaro.org>
      3e779a2e
  10. 09 10月, 2018 1 次提交
  11. 06 10月, 2018 1 次提交
    • M
      mm: migration: fix migration of huge PMD shared pages · 017b1660
      Mike Kravetz 提交于
      The page migration code employs try_to_unmap() to try and unmap the source
      page.  This is accomplished by using rmap_walk to find all vmas where the
      page is mapped.  This search stops when page mapcount is zero.  For shared
      PMD huge pages, the page map count is always 1 no matter the number of
      mappings.  Shared mappings are tracked via the reference count of the PMD
      page.  Therefore, try_to_unmap stops prematurely and does not completely
      unmap all mappings of the source page.
      
      This problem can result is data corruption as writes to the original
      source page can happen after contents of the page are copied to the target
      page.  Hence, data is lost.
      
      This problem was originally seen as DB corruption of shared global areas
      after a huge page was soft offlined due to ECC memory errors.  DB
      developers noticed they could reproduce the issue by (hotplug) offlining
      memory used to back huge pages.  A simple testcase can reproduce the
      problem by creating a shared PMD mapping (note that this must be at least
      PUD_SIZE in size and PUD_SIZE aligned (1GB on x86)), and using
      migrate_pages() to migrate process pages between nodes while continually
      writing to the huge pages being migrated.
      
      To fix, have the try_to_unmap_one routine check for huge PMD sharing by
      calling huge_pmd_unshare for hugetlbfs huge pages.  If it is a shared
      mapping it will be 'unshared' which removes the page table entry and drops
      the reference on the PMD page.  After this, flush caches and TLB.
      
      mmu notifiers are called before locking page tables, but we can not be
      sure of PMD sharing until page tables are locked.  Therefore, check for
      the possibility of PMD sharing before locking so that notifiers can
      prepare for the worst possible case.
      
      Link: http://lkml.kernel.org/r/20180823205917.16297-2-mike.kravetz@oracle.com
      [mike.kravetz@oracle.com: make _range_in_vma() a static inline]
        Link: http://lkml.kernel.org/r/6063f215-a5c8-2f0c-465a-2c515ddc952d@oracle.com
      Fixes: 39dde65c ("shared page table for hugetlb page")
      Signed-off-by: NMike Kravetz <mike.kravetz@oracle.com>
      Acked-by: NKirill A. Shutemov <kirill.shutemov@linux.intel.com>
      Reviewed-by: NNaoya Horiguchi <n-horiguchi@ah.jp.nec.com>
      Acked-by: NMichal Hocko <mhocko@suse.com>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Cc: Davidlohr Bueso <dave@stgolabs.net>
      Cc: Jerome Glisse <jglisse@redhat.com>
      Cc: Mike Kravetz <mike.kravetz@oracle.com>
      Cc: <stable@vger.kernel.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      017b1660
  12. 05 10月, 2018 1 次提交