1. 15 3月, 2019 3 次提交
    • G
      net: kernel hookers service for toa module · 33576e37
      George Zhang 提交于
      LVS fullnat will replace network traffic's source ip with its local ip,
      and thus the backend servers cannot obtain the real client ip.
      
      To solve this, LVS has introduced the tcp option address (TOA) to store
      the essential ip address information in the last tcp ack packet of the
      3-way handshake, and the backend servers need to retrieve it from the
      packet header.
      
      In this patch, we have introduced the sk_toa_data member in the sock
      structure to hold the TOA information. There used to be an in-tree
      module for TOA managing, whereas it has now been maintained as an
      standalone module.
      
      In this case, the toa module should register its hook function(s) using
      the provided interfaces in the hookers module.
      
      TOA in sock structure:
      
      	__be32 sk_toa_data[16];
      
      The hookers module only provides the sk_toa_data placeholder, and the
      toa module can use this variable through the layout it needs.
      
      Hook interfaces:
      
      The hookers module replaces the kernel's syn_recv_sock and getname
      handler with a stub that chains the toa module's hook function(s) to the
      original handling function. The hookers module allows hook functions to
      be installed and uninstalled in any order.
      
      toa module:
      
      The external toa module will be provided in separate RPM package.
      
      [xuyu@linux.alibaba.com: amend commit log]
      Signed-off-by: NGeorge Zhang <georgezhang@linux.alibaba.com>
      Signed-off-by: NXu Yu <xuyu@linux.alibaba.com>
      Reviewed-by: NCaspar Zhang <caspar@linux.alibaba.com>
      33576e37
    • J
      ext4: fix NULL pointer dereference while journal is aborted · 0262b2d2
      Jiufei Xue 提交于
      We see the following NULL pointer dereference while running xfstests
      generic/475:
      BUG: unable to handle kernel NULL pointer dereference at 0000000000000008
      PGD 8000000c84bad067 P4D 8000000c84bad067 PUD c84e62067 PMD 0
      Oops: 0000 [#1] SMP PTI
      CPU: 7 PID: 9886 Comm: fsstress Kdump: loaded Not tainted 5.0.0-rc8 #10
      RIP: 0010:ext4_do_update_inode+0x4ec/0x760
      ...
      Call Trace:
      ? jbd2_journal_get_write_access+0x42/0x50
      ? __ext4_journal_get_write_access+0x2c/0x70
      ? ext4_truncate+0x186/0x3f0
      ext4_mark_iloc_dirty+0x61/0x80
      ext4_mark_inode_dirty+0x62/0x1b0
      ext4_truncate+0x186/0x3f0
      ? unmap_mapping_pages+0x56/0x100
      ext4_setattr+0x817/0x8b0
      notify_change+0x1df/0x430
      do_truncate+0x5e/0x90
      ? generic_permission+0x12b/0x1a0
      
      This is triggered because the NULL pointer handle->h_transaction was
      dereferenced in function ext4_update_inode_fsync_trans().
      I found that the h_transaction was set to NULL in jbd2__journal_restart
      but failed to attached to a new transaction while the journal is aborted.
      
      Fix this by checking the handle before updating the inode.
      Signed-off-by: NJiufei Xue <jiufei.xue@linux.alibaba.com>
      Reviewed-by: NJoseph Qi <joseph.qi@linux.alibaba.com>
      0262b2d2
    • A
      perf probe: Fix getting the kernel map · 38fd2b68
      Adrian Hunter 提交于
      Since commit 4d99e413 ("perf machine: Workaround missing maps for
      x86 PTI entry trampolines"), perf tools has been creating more than one
      kernel map, however 'perf probe' assumed there could be only one.
      
      Fix by using machine__kernel_map() to get the main kernel map.
      Signed-off-by: NAdrian Hunter <adrian.hunter@intel.com>
      Tested-by: NJoseph Qi <joseph.qi@linux.alibaba.com>
      Acked-by: NMasami Hiramatsu <mhiramat@kernel.org>
      Cc: Alexander Shishkin <alexander.shishkin@linux.intel.com>
      Cc: Andy Lutomirski <luto@kernel.org>
      Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
      Cc: Jiufei Xue <jiufei.xue@linux.alibaba.com>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: stable@vger.kernel.org
      Cc: Xu Yu <xuyu@linux.alibaba.com>
      Fixes: 4d99e413 ("perf machine: Workaround missing maps for x86 PTI entry trampolines")
      Fixes: d83212d5 ("kallsyms, x86: Export addresses of PTI entry trampolines")
      Link: http://lkml.kernel.org/r/2ed432de-e904-85d2-5c36-5897ddc5b23b@intel.comSigned-off-by: NArnaldo Carvalho de Melo <acme@redhat.com>
      Signed-off-by: NJoseph Qi <joseph.qi@linux.alibaba.com>
      38fd2b68
  2. 14 3月, 2019 1 次提交
  3. 12 3月, 2019 1 次提交
  4. 23 2月, 2019 1 次提交
    • M
      net: crypto set sk to NULL when af_alg_release. · f363be65
      Mao Wenan 提交于
      commit 9060cb719e61b685ec0102574e10337fa5f445ea upstream.
      
      KASAN has found use-after-free in sockfs_setattr.
      The existed commit 6d8c50dc ("socket: close race condition between sock_close()
      and sockfs_setattr()") is to fix this simillar issue, but it seems to ignore
      that crypto module forgets to set the sk to NULL after af_alg_release.
      
      KASAN report details as below:
      BUG: KASAN: use-after-free in sockfs_setattr+0x120/0x150
      Write of size 4 at addr ffff88837b956128 by task syz-executor0/4186
      
      CPU: 2 PID: 4186 Comm: syz-executor0 Not tainted xxx + #1
      Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS
      1.10.2-1ubuntu1 04/01/2014
      Call Trace:
       dump_stack+0xca/0x13e
       print_address_description+0x79/0x330
       ? vprintk_func+0x5e/0xf0
       kasan_report+0x18a/0x2e0
       ? sockfs_setattr+0x120/0x150
       sockfs_setattr+0x120/0x150
       ? sock_register+0x2d0/0x2d0
       notify_change+0x90c/0xd40
       ? chown_common+0x2ef/0x510
       chown_common+0x2ef/0x510
       ? chmod_common+0x3b0/0x3b0
       ? __lock_is_held+0xbc/0x160
       ? __sb_start_write+0x13d/0x2b0
       ? __mnt_want_write+0x19a/0x250
       do_fchownat+0x15c/0x190
       ? __ia32_sys_chmod+0x80/0x80
       ? trace_hardirqs_on_thunk+0x1a/0x1c
       __x64_sys_fchownat+0xbf/0x160
       ? lockdep_hardirqs_on+0x39a/0x5e0
       do_syscall_64+0xc8/0x580
       entry_SYSCALL_64_after_hwframe+0x49/0xbe
      RIP: 0033:0x462589
      Code: f7 d8 64 89 02 b8 ff ff ff ff c3 66 0f 1f 44 00 00 48 89 f8 48 89
      f7 48 89 d6 48 89
      ca 4d 89 c2 4d 89 c8 4c 8b 4c 24 08 0f 05 <48> 3d 01 f0 ff ff 73 01 c3
      48 c7 c1 bc ff ff
      ff f7 d8 64 89 01 48
      RSP: 002b:00007fb4b2c83c58 EFLAGS: 00000246 ORIG_RAX: 0000000000000104
      RAX: ffffffffffffffda RBX: 000000000072bfa0 RCX: 0000000000462589
      RDX: 0000000000000000 RSI: 00000000200000c0 RDI: 0000000000000007
      RBP: 0000000000000005 R08: 0000000000001000 R09: 0000000000000000
      R10: 0000000000000000 R11: 0000000000000246 R12: 00007fb4b2c846bc
      R13: 00000000004bc733 R14: 00000000006f5138 R15: 00000000ffffffff
      
      Allocated by task 4185:
       kasan_kmalloc+0xa0/0xd0
       __kmalloc+0x14a/0x350
       sk_prot_alloc+0xf6/0x290
       sk_alloc+0x3d/0xc00
       af_alg_accept+0x9e/0x670
       hash_accept+0x4a3/0x650
       __sys_accept4+0x306/0x5c0
       __x64_sys_accept4+0x98/0x100
       do_syscall_64+0xc8/0x580
       entry_SYSCALL_64_after_hwframe+0x49/0xbe
      
      Freed by task 4184:
       __kasan_slab_free+0x12e/0x180
       kfree+0xeb/0x2f0
       __sk_destruct+0x4e6/0x6a0
       sk_destruct+0x48/0x70
       __sk_free+0xa9/0x270
       sk_free+0x2a/0x30
       af_alg_release+0x5c/0x70
       __sock_release+0xd3/0x280
       sock_close+0x1a/0x20
       __fput+0x27f/0x7f0
       task_work_run+0x136/0x1b0
       exit_to_usermode_loop+0x1a7/0x1d0
       do_syscall_64+0x461/0x580
       entry_SYSCALL_64_after_hwframe+0x49/0xbe
      
      Syzkaller reproducer:
      r0 = perf_event_open(&(0x7f0000000000)={0x0, 0x70, 0x0, 0x0, 0x0, 0x0,
      0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0,
      0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0,
      0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, @perf_config_ext}, 0x0, 0x0,
      0xffffffffffffffff, 0x0)
      r1 = socket$alg(0x26, 0x5, 0x0)
      getrusage(0x0, 0x0)
      bind(r1, &(0x7f00000001c0)=@ALG={0x26, 'hash\x00', 0x0, 0x0,
      'sha256-ssse3\x00'}, 0x80)
      r2 = accept(r1, 0x0, 0x0)
      r3 = accept4$unix(r2, 0x0, 0x0, 0x0)
      r4 = dup3(r3, r0, 0x0)
      fchownat(r4, &(0x7f00000000c0)='\x00', 0x0, 0x0, 0x1000)
      
      Fixes: 6d8c50dc ("socket: close race condition between sock_close() and sockfs_setattr()")
      Signed-off-by: NMao Wenan <maowenan@huawei.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      Signed-off-by: NJoseph Qi <joseph.qi@linux.alibaba.com>
      f363be65
  5. 21 2月, 2019 15 次提交
  6. 20 2月, 2019 19 次提交
    • G
      Linux 4.19.24 · f287634f
      Greg Kroah-Hartman 提交于
      f287634f
    • S
      mm: proc: smaps_rollup: fix pss_locked calculation · dd5f4d06
      Sandeep Patil 提交于
      commit 27dd768ed8db48beefc4d9e006c58e7a00342bde upstream.
      
      The 'pss_locked' field of smaps_rollup was being calculated incorrectly.
      It accumulated the current pss everytime a locked VMA was found.  Fix
      that by adding to 'pss_locked' the same time as that of 'pss' if the vma
      being walked is locked.
      
      Link: http://lkml.kernel.org/r/20190203065425.14650-1-sspatil@android.com
      Fixes: 493b0e9d ("mm: add /proc/pid/smaps_rollup")
      Signed-off-by: NSandeep Patil <sspatil@android.com>
      Acked-by: NVlastimil Babka <vbabka@suse.cz>
      Reviewed-by: NJoel Fernandes (Google) <joel@joelfernandes.org>
      Cc: Alexey Dobriyan <adobriyan@gmail.com>
      Cc: Daniel Colascione <dancol@google.com>
      Cc: <stable@vger.kernel.org>	[4.14.x, 4.19.x]
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      Signed-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      dd5f4d06
    • J
      drm/i915: Prevent a race during I915_GEM_MMAP ioctl with WC set · 6a204bd5
      Joonas Lahtinen 提交于
      commit 2e7bd10e05afb866b5fb13eda25095c35d7a27cc upstream.
      
      Make sure the underlying VMA in the process address space is the
      same as it was during vm_mmap to avoid applying WC to wrong VMA.
      
      A more long-term solution would be to have vm_mmap_locked variant
      in linux/mmap.h for when caller wants to hold mmap_sem for an
      extended duration.
      
      v2:
      - Refactor the compare function
      
      Fixes: 1816f923 ("drm/i915: Support creation of unbound wc user mappings for objects")
      Reported-by: NAdam Zabrocki <adamza@microsoft.com>
      Suggested-by: NLinus Torvalds <torvalds@linux-foundation.org>
      Signed-off-by: NJoonas Lahtinen <joonas.lahtinen@linux.intel.com>
      Cc: <stable@vger.kernel.org> # v4.0+
      Cc: Akash Goel <akash.goel@intel.com>
      Cc: Chris Wilson <chris@chris-wilson.co.uk>
      Cc: Tvrtko Ursulin <tvrtko.ursulin@linux.intel.com>
      Cc: Adam Zabrocki <adamza@microsoft.com>
      Reviewed-by: NChris Wilson <chris@chris-wilson.co.uk>
      Reviewed-by: Tvrtko Ursulin <tvrtko.ursulin@intel.com> #v1
      Link: https://patchwork.freedesktop.org/patch/msgid/20190207085454.10598-1-joonas.lahtinen@linux.intel.com
      (cherry picked from commit 5c4604e757ba9b193b09768d75a7d2105a5b883f)
      Signed-off-by: NJani Nikula <jani.nikula@intel.com>
      Signed-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      6a204bd5
    • L
      drm/i915: Block fbdev HPD processing during suspend · 4631e0b4
      Lyude Paul 提交于
      commit e8a8fedd57fdcebf0e4f24ef0fc7e29323df8e66 upstream.
      
      When resuming, we check whether or not any previously connected
      MST topologies are still present and if so, attempt to resume them. If
      this fails, we disable said MST topologies and fire off a hotplug event
      so that userspace knows to reprobe.
      
      However, sending a hotplug event involves calling
      drm_fb_helper_hotplug_event(), which in turn results in fbcon doing a
      connector reprobe in the caller's thread - something we can't do at the
      point in which i915 calls drm_dp_mst_topology_mgr_resume() since
      hotplugging hasn't been fully initialized yet.
      
      This currently causes some rather subtle but fatal issues. For example,
      on my T480s the laptop dock connected to it usually disappears during a
      suspend cycle, and comes back up a short while after the system has been
      resumed. This guarantees pretty much every suspend and resume cycle,
      drm_dp_mst_topology_mgr_set_mst(mgr, false); will be caused and in turn,
      a connector hotplug will occur. Now it's Rute Goldberg time: when the
      connector hotplug occurs, i915 reprobes /all/ of the connectors,
      including eDP. However, eDP probing requires that we power on the panel
      VDD which in turn, grabs a wakeref to the appropriate power domain on
      the GPU (on my T480s, this is the PORT_DDI_A_IO domain). This is where
      things start breaking, since this all happens before
      intel_power_domains_enable() is called we end up leaking the wakeref
      that was acquired and never releasing it later. Come next suspend/resume
      cycle, this causes us to fail to shut down the GPU properly, which
      causes it not to resume properly and die a horrible complicated death.
      
      (as a note: this only happens when there's both an eDP panel and MST
      topology connected which is removed mid-suspend. One or the other seems
      to always be OK).
      
      We could try to fix the VDD wakeref leak, but this doesn't seem like
      it's worth it at all since we aren't able to handle hotplug detection
      while resuming anyway. So, let's go with a more robust solution inspired
      by nouveau: block fbdev from handling hotplug events until we resume
      fbdev. This allows us to still send sysfs hotplug events to be handled
      later by user space while we're resuming, while also preventing us from
      actually processing any hotplug events we receive until it's safe.
      
      This fixes the wakeref leak observed on the T480s and as such, also
      fixes suspend/resume with MST topologies connected on this machine.
      
      Changes since v2:
      * Don't call drm_fb_helper_hotplug_event() under lock, do it after lock
        (Chris Wilson)
      * Don't call drm_fb_helper_hotplug_event() in
        intel_fbdev_output_poll_changed() under lock (Chris Wilson)
      * Always set ifbdev->hpd_waiting (Chris Wilson)
      Signed-off-by: NLyude Paul <lyude@redhat.com>
      Fixes: 0e32b39c ("drm/i915: add DP 1.2 MST support (v0.7)")
      Cc: Todd Previte <tprevite@gmail.com>
      Cc: Dave Airlie <airlied@redhat.com>
      Cc: Jani Nikula <jani.nikula@linux.intel.com>
      Cc: Joonas Lahtinen <joonas.lahtinen@linux.intel.com>
      Cc: Rodrigo Vivi <rodrigo.vivi@intel.com>
      Cc: Imre Deak <imre.deak@intel.com>
      Cc: intel-gfx@lists.freedesktop.org
      Cc: <stable@vger.kernel.org> # v3.17+
      Reviewed-by: NChris Wilson <chris@chris-wilson.co.uk>
      Link: https://patchwork.freedesktop.org/patch/msgid/20190129191001.442-2-lyude@redhat.com
      (cherry picked from commit fe5ec65668cdaa4348631d8ce1766eed43b33c10)
      Signed-off-by: NJani Nikula <jani.nikula@intel.com>
      Signed-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      4631e0b4
    • R
      drm/vkms: Fix license inconsistent · de48b5f3
      Rodrigo Siqueira 提交于
      commit 7fd56e0260a22c0cfaf9adb94a2427b76e239dd0 upstream.
      
      Fixes license inconsistent related to the VKMS driver and remove the
      redundant boilerplate comment.
      
      Fixes: 854502fa ("drm/vkms: Add basic CRTC initialization")
      
      Cc: stable@vger.kernel.org
      Signed-off-by: NRodrigo Siqueira <rodrigosiqueiramelo@gmail.com>
      Acked-by: NDaniel Vetter <daniel.vetter@ffwll.ch>
      Link: https://patchwork.freedesktop.org/patch/msgid/20190206140116.7qvy2lpwbcd7wds6@smtp.gmail.comSigned-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      de48b5f3
    • M
      drm: Use array_size() when creating lease · 3312e0ae
      Matthew Wilcox 提交于
      commit 69ef943dbc14b21987c79f8399ffea08f9a1b446 upstream.
      
      Passing an object_count of sufficient size will make
      object_count * 4 wrap around to be very small, then a later function
      will happily iterate off the end of the object_ids array.  Using
      array_size() will saturate at SIZE_MAX, the kmalloc() will fail and
      we'll return an -ENOMEM to the norty userspace.
      
      Fixes: 62884cd3 ("drm: Add four ioctls for managing drm mode object leases [v7]")
      Signed-off-by: NMatthew Wilcox <willy@infradead.org>
      Acked-by: NKees Cook <keescook@chromium.org>
      Acked-by: NDaniel Vetter <daniel.vetter@ffwll.ch>
      Cc: <stable@vger.kernel.org> # v4.15+
      Signed-off-by: NDave Airlie <airlied@redhat.com>
      Signed-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      3312e0ae
    • N
      dm thin: fix bug where bio that overwrites thin block ignores FUA · 029a38f8
      Nikos Tsironis 提交于
      commit 4ae280b4ee3463fa57bbe6eede26b97daff8a0f1 upstream.
      
      When provisioning a new data block for a virtual block, either because
      the block was previously unallocated or because we are breaking sharing,
      if the whole block of data is being overwritten the bio that triggered
      the provisioning is issued immediately, skipping copying or zeroing of
      the data block.
      
      When this bio completes the new mapping is inserted in to the pool's
      metadata by process_prepared_mapping(), where the bio completion is
      signaled to the upper layers.
      
      This completion is signaled without first committing the metadata.  If
      the bio in question has the REQ_FUA flag set and the system crashes
      right after its completion and before the next metadata commit, then the
      write is lost despite the REQ_FUA flag requiring that I/O completion for
      this request must only be signaled after the data has been committed to
      non-volatile storage.
      
      Fix this by deferring the completion of overwrite bios, with the REQ_FUA
      flag set, until after the metadata has been committed.
      
      Cc: stable@vger.kernel.org
      Signed-off-by: NNikos Tsironis <ntsironis@arrikto.com>
      Acked-by: NJoe Thornber <ejt@redhat.com>
      Acked-by: NMikulas Patocka <mpatocka@redhat.com>
      Signed-off-by: NMike Snitzer <snitzer@redhat.com>
      Signed-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      029a38f8
    • M
      dm crypt: don't overallocate the integrity tag space · 3ec93eb3
      Mikulas Patocka 提交于
      commit ff0c129d3b5ecb3df7c8f5e2236582bf745b6c5f upstream.
      
      bio_sectors() returns the value in the units of 512-byte sectors (no
      matter what the real sector size of the device).  dm-crypt multiplies
      bio_sectors() by on_disk_tag_size to calculate the space allocated for
      integrity tags.  If dm-crypt is running with sector size larger than
      512b, it allocates more data than is needed.
      
      Device Mapper trims the extra space when passing the bio to
      dm-integrity, so this bug didn't result in any visible misbehavior.
      But it must be fixed to avoid wasteful memory allocation for the block
      integrity payload.
      
      Fixes: ef43aa38 ("dm crypt: add cryptographic data integrity protection (authenticated encryption)")
      Cc: stable@vger.kernel.org # 4.12+
      Reported-by: NMilan Broz <mbroz@redhat.com>
      Signed-off-by: NMikulas Patocka <mpatocka@redhat.com>
      Signed-off-by: NMike Snitzer <snitzer@redhat.com>
      Signed-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      3ec93eb3
    • B
      x86/a.out: Clear the dump structure initially · b2778ef8
      Borislav Petkov 提交于
      commit 10970e1b4be9c74fce8ab6e3c34a7d718f063f2c upstream.
      
      dump_thread32() in aout_core_dump() does not clear the user32 structure
      allocated on the stack as the first thing on function entry.
      
      As a result, the dump.u_comm, dump.u_ar0 and dump.signal which get
      assigned before the clearing, get overwritten.
      
      Rename that function to fill_dump() to make it clear what it does and
      call it first thing.
      
      This was caught while staring at a patch by Derek Robson
      <robsonde@gmail.com>.
      Signed-off-by: NBorislav Petkov <bp@suse.de>
      Cc: Derek Robson <robsonde@gmail.com>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Michael Matz <matz@suse.de>
      Cc: x86@kernel.org
      Cc: <stable@vger.kernel.org>
      Link: https://lkml.kernel.org/r/20190202005512.3144-1-robsonde@gmail.comSigned-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      b2778ef8
    • N
      md/raid1: don't clear bitmap bits on interrupted recovery. · ddf96641
      Nate Dailey 提交于
      commit dfcc34c99f3ebc16b787b118763bf9cb6b1efc7a upstream.
      
      sync_request_write no longer submits writes to a Faulty device. This has
      the unfortunate side effect that bitmap bits can be incorrectly cleared
      if a recovery is interrupted (previously, end_sync_write would have
      prevented this). This means the next recovery may not copy everything
      it should, potentially corrupting data.
      
      Add a function for doing the proper md_bitmap_end_sync, called from
      end_sync_write and the Faulty case in sync_request_write.
      
      backport note to 4.14: s/md_bitmap_end_sync/bitmap_end_sync
      Cc: stable@vger.kernel.org 4.14+
      Fixes: 0c9d5b12 ("md/raid1: avoid reusing a resync bio after error handling.")
      Reviewed-by: NJack Wang <jinpu.wang@cloud.ionos.com>
      Tested-by: NJack Wang <jinpu.wang@cloud.ionos.com>
      Signed-off-by: NNate Dailey <nate.dailey@stratus.com>
      Signed-off-by: NSong Liu <songliubraving@fb.com>
      Signed-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      ddf96641
    • E
      signal: Restore the stop PTRACE_EVENT_EXIT · a2b3e2c0
      Eric W. Biederman 提交于
      commit cf43a757fd49442bc38f76088b70c2299eed2c2f upstream.
      
      In the middle of do_exit() there is there is a call
      "ptrace_event(PTRACE_EVENT_EXIT, code);" That call places the process
      in TACKED_TRACED aka "(TASK_WAKEKILL | __TASK_TRACED)" and waits for
      for the debugger to release the task or SIGKILL to be delivered.
      
      Skipping past dequeue_signal when we know a fatal signal has already
      been delivered resulted in SIGKILL remaining pending and
      TIF_SIGPENDING remaining set.  This in turn caused the
      scheduler to not sleep in PTACE_EVENT_EXIT as it figured
      a fatal signal was pending.  This also caused ptrace_freeze_traced
      in ptrace_check_attach to fail because it left a per thread
      SIGKILL pending which is what fatal_signal_pending tests for.
      
      This difference in signal state caused strace to report
      strace: Exit of unknown pid NNNNN ignored
      
      Therefore update the signal handling state like dequeue_signal
      would when removing a per thread SIGKILL, by removing SIGKILL
      from the per thread signal mask and clearing TIF_SIGPENDING.
      Acked-by: NOleg Nesterov <oleg@redhat.com>
      Reported-by: NOleg Nesterov <oleg@redhat.com>
      Reported-by: NIvan Delalande <colona@arista.com>
      Cc: stable@vger.kernel.org
      Fixes: 35634ffa1751 ("signal: Always notice exiting tasks")
      Signed-off-by: N"Eric W. Biederman" <ebiederm@xmission.com>
      Signed-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      a2b3e2c0
    • J
      scsi: sd: fix entropy gathering for most rotational disks · 0396cf55
      James Bottomley 提交于
      commit e4a056987c86f402f1286e050b1dee3f4ce7c7eb upstream.
      
      The problem is that the default for MQ is not to gather entropy, whereas
      the default for the legacy queue was always to gather it.  The original
      attempt to fix entropy gathering for rotational disks under MQ added an
      else branch in sd_read_block_characteristics().  Unfortunately, the entire
      check isn't reached if the device has no characteristics VPD page.  Since
      this page was only introduced in SBC-3 and its optional anyway, most less
      expensive rotational disks don't have one, meaning they all stopped
      gathering entropy when we made MQ the default.  In a wholly unrelated
      change, openssl and openssh won't function until the random number
      generator is initialised, meaning lots of people have been seeing large
      delays before they could log into systems with default MQ kernels due to
      this lack of entropy, because it now can take tens of minutes to initialise
      the kernel random number generator.
      
      The fix is to set the non-rotational and add-randomness flags
      unconditionally early on in the disk initialization path, so they can be
      reset only if the device actually reports being non-rotational via the VPD
      page.
      Reported-by: NMikael Pettersson <mikpelinux@gmail.com>
      Fixes: 83e32a59 ("scsi: sd: Contribute to randomness when running rotational device")
      Cc: stable@vger.kernel.org
      Signed-off-by: NJames Bottomley <James.Bottomley@HansenPartnership.com>
      Reviewed-by: NJens Axboe <axboe@kernel.dk>
      Reviewed-by: NXuewei Zhang <xueweiz@google.com>
      Signed-off-by: NMartin K. Petersen <martin.petersen@oracle.com>
      Signed-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      0396cf55
    • H
      x86/platform/UV: Use efi_runtime_lock to serialise BIOS calls · cdc35685
      Hedi Berriche 提交于
      commit f331e766c4be33f4338574f3c9f7f77e98ab4571 upstream.
      
      Calls into UV firmware must be protected against concurrency, expose the
      efi_runtime_lock to the UV platform, and use it to serialise UV BIOS
      calls.
      Signed-off-by: NHedi Berriche <hedi.berriche@hpe.com>
      Signed-off-by: NBorislav Petkov <bp@suse.de>
      Reviewed-by: NArd Biesheuvel <ard.biesheuvel@linaro.org>
      Reviewed-by: NRuss Anderson <rja@hpe.com>
      Reviewed-by: NDimitri Sivanich <sivanich@hpe.com>
      Reviewed-by: NMike Travis <mike.travis@hpe.com>
      Cc: Andy Shevchenko <andy@infradead.org>
      Cc: Bhupesh Sharma <bhsharma@redhat.com>
      Cc: Darren Hart <dvhart@infradead.org>
      Cc: "H. Peter Anvin" <hpa@zytor.com>
      Cc: Ingo Molnar <mingo@redhat.com>
      Cc: linux-efi <linux-efi@vger.kernel.org>
      Cc: platform-driver-x86@vger.kernel.org
      Cc: stable@vger.kernel.org # v4.9+
      Cc: Steve Wahl <steve.wahl@hpe.com>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: x86-ml <x86@kernel.org>
      Link: https://lkml.kernel.org/r/20190213193413.25560-5-hedi.berriche@hpe.comSigned-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      cdc35685
    • A
      tracing/uprobes: Fix output for multiple string arguments · 45649b99
      Andreas Ziegler 提交于
      commit 0722069a5374b904ec1a67f91249f90e1cfae259 upstream.
      
      When printing multiple uprobe arguments as strings the output for the
      earlier arguments would also include all later string arguments.
      
      This is best explained in an example:
      
      Consider adding a uprobe to a function receiving two strings as
      parameters which is at offset 0xa0 in strlib.so and we want to print
      both parameters when the uprobe is hit (on x86_64):
      
      $ echo 'p:func /lib/strlib.so:0xa0 +0(%di):string +0(%si):string' > \
          /sys/kernel/debug/tracing/uprobe_events
      
      When the function is called as func("foo", "bar") and we hit the probe,
      the trace file shows a line like the following:
      
        [...] func: (0x7f7e683706a0) arg1="foobar" arg2="bar"
      
      Note the extra "bar" printed as part of arg1. This behaviour stacks up
      for additional string arguments.
      
      The strings are stored in a dynamically growing part of the uprobe
      buffer by fetch_store_string() after copying them from userspace via
      strncpy_from_user(). The return value of strncpy_from_user() is then
      directly used as the required size for the string. However, this does
      not take the terminating null byte into account as the documentation
      for strncpy_from_user() cleary states that it "[...] returns the
      length of the string (not including the trailing NUL)" even though the
      null byte will be copied to the destination.
      
      Therefore, subsequent calls to fetch_store_string() will overwrite
      the terminating null byte of the most recently fetched string with
      the first character of the current string, leading to the
      "accumulation" of strings in earlier arguments in the output.
      
      Fix this by incrementing the return value of strncpy_from_user() by
      one if we did not hit the maximum buffer size.
      
      Link: http://lkml.kernel.org/r/20190116141629.5752-1-andreas.ziegler@fau.de
      
      Cc: Ingo Molnar <mingo@redhat.com>
      Cc: stable@vger.kernel.org
      Fixes: 5baaa59e ("tracing/probes: Implement 'memory' fetch method for uprobes")
      Acked-by: NMasami Hiramatsu <mhiramat@kernel.org>
      Signed-off-by: NAndreas Ziegler <andreas.ziegler@fau.de>
      Signed-off-by: NSteven Rostedt (VMware) <rostedt@goodmis.org>
      Signed-off-by: NMasami Hiramatsu <mhiramat@kernel.org>
      Signed-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      45649b99
    • H
      s390/zcrypt: fix specification exception on z196 during ap probe · 88e1e66a
      Harald Freudenberger 提交于
      commit 8f9aca0c45322a807a343fc32f95f2500f83b9ae upstream.
      
      The older machines don't have the QCI instruction available.
      With support for up to 256 crypto cards the probing of each
      card has been extended to check card ids from 0 up to 255.
      For machines with QCI support there is a filter limiting the
      range of probed cards. The older machines (z196 and older)
      don't have this filter and so since support for 256 cards is
      in the driver all cards are probed. However, these machines
      also require to have the card id fit into 6 bits. Exceeding
      this limit results in a specification exception which happens
      on every kernel startup even when there is no crypto configured
      and used at all.
      
      This fix limits the range of probed crypto cards to 64 if
      there is no QCI instruction available to obey to the older
      ap architecture and so fixes the specification exceptions
      on z196 machines.
      
      Cc: stable@vger.kernel.org # v4.17+
      Fixes: af4a7227 ("s390/zcrypt: Support up to 256 crypto adapters.")
      Signed-off-by: NHarald Freudenberger <freude@linux.ibm.com>
      Signed-off-by: NMartin Schwidefsky <schwidefsky@de.ibm.com>
      Signed-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      88e1e66a
    • M
      alpha: Fix Eiger NR_IRQS to 128 · a3fadeff
      Meelis Roos 提交于
      commit bfc913682464f45bc4d6044084e370f9048de9d5 upstream.
      
      Eiger machine vector definition has nr_irqs 128, and working 2.6.26
      boot shows SCSI getting IRQ-s 64 and 65. Current kernel boot fails
      because Symbios SCSI fails to request IRQ-s and does not find the disks.
      It has been broken at least since 3.18 - the earliest I could test with
      my gcc-5.
      
      The headers have moved around and possibly another order of defines has
      worked in the past - but since 128 seems to be correct and used, fix
      arch/alpha/include/asm/irq.h to have NR_IRQS=128 for Eiger.
      
      This fixes 4.19-rc7 boot on my Force Flexor A264 (Eiger subarch).
      
      Cc: stable@vger.kernel.org # v3.18+
      Signed-off-by: NMeelis Roos <mroos@linux.ee>
      Signed-off-by: NMatt Turner <mattst88@gmail.com>
      Signed-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      a3fadeff
    • S
      alpha: fix page fault handling for r16-r18 targets · c56eef69
      Sergei Trofimovich 提交于
      commit 491af60ffb848b59e82f7c9145833222e0bf27a5 upstream.
      
      Fix page fault handling code to fixup r16-r18 registers.
      Before the patch code had off-by-two registers bug.
      This bug caused overwriting of ps,pc,gp registers instead
      of fixing intended r16,r17,r18 (see `struct pt_regs`).
      
      More details:
      
      Initially Dmitry noticed a kernel bug as a failure
      on strace test suite. Test passes unmapped userspace
      pointer to io_submit:
      
      ```c
          #include <err.h>
          #include <unistd.h>
          #include <sys/mman.h>
          #include <asm/unistd.h>
          int main(void)
          {
              unsigned long ctx = 0;
              if (syscall(__NR_io_setup, 1, &ctx))
                  err(1, "io_setup");
              const size_t page_size = sysconf(_SC_PAGESIZE);
              const size_t size = page_size * 2;
              void *ptr = mmap(NULL, size, PROT_READ | PROT_WRITE,
                               MAP_PRIVATE | MAP_ANONYMOUS, -1, 0);
              if (MAP_FAILED == ptr)
                  err(1, "mmap(%zu)", size);
              if (munmap(ptr, size))
                  err(1, "munmap");
              syscall(__NR_io_submit, ctx, 1, ptr + page_size);
              syscall(__NR_io_destroy, ctx);
              return 0;
          }
      ```
      
      Running this test causes kernel to crash when handling page fault:
      
      ```
          Unable to handle kernel paging request at virtual address ffffffffffff9468
          CPU 3
          aio(26027): Oops 0
          pc = [<fffffc00004eddf8>]  ra = [<fffffc00004edd5c>]  ps = 0000    Not tainted
          pc is at sys_io_submit+0x108/0x200
          ra is at sys_io_submit+0x6c/0x200
          v0 = fffffc00c58e6300  t0 = fffffffffffffff2  t1 = 000002000025e000
          t2 = fffffc01f159fef8  t3 = fffffc0001009640  t4 = fffffc0000e0f6e0
          t5 = 0000020001002e9e  t6 = 4c41564e49452031  t7 = fffffc01f159c000
          s0 = 0000000000000002  s1 = 000002000025e000  s2 = 0000000000000000
          s3 = 0000000000000000  s4 = 0000000000000000  s5 = fffffffffffffff2
          s6 = fffffc00c58e6300
          a0 = fffffc00c58e6300  a1 = 0000000000000000  a2 = 000002000025e000
          a3 = 00000200001ac260  a4 = 00000200001ac1e8  a5 = 0000000000000001
          t8 = 0000000000000008  t9 = 000000011f8bce30  t10= 00000200001ac440
          t11= 0000000000000000  pv = fffffc00006fd320  at = 0000000000000000
          gp = 0000000000000000  sp = 00000000265fd174
          Disabling lock debugging due to kernel taint
          Trace:
          [<fffffc0000311404>] entSys+0xa4/0xc0
      ```
      
      Here `gp` has invalid value. `gp is s overwritten by a fixup for the
      following page fault handler in `io_submit` syscall handler:
      
      ```
          __se_sys_io_submit
          ...
              ldq     a1,0(t1)
              bne     t0,4280 <__se_sys_io_submit+0x180>
      ```
      
      After a page fault `t0` should contain -EFALUT and `a1` is 0.
      Instead `gp` was overwritten in place of `a1`.
      
      This happens due to a off-by-two bug in `dpf_reg()` for `r16-r18`
      (aka `a0-a2`).
      
      I think the bug went unnoticed for a long time as `gp` is one
      of scratch registers. Any kernel function call would re-calculate `gp`.
      
      Dmitry tracked down the bug origin back to 2.1.32 kernel version
      where trap_a{0,1,2} fields were inserted into struct pt_regs.
      And even before that `dpf_reg()` contained off-by-one error.
      
      Cc: Richard Henderson <rth@twiddle.net>
      Cc: Ivan Kokshaysky <ink@jurassic.park.msu.ru>
      Cc: linux-alpha@vger.kernel.org
      Cc: linux-kernel@vger.kernel.org
      Reported-and-reviewed-by: N"Dmitry V. Levin" <ldv@altlinux.org>
      Cc: stable@vger.kernel.org # v2.1.32+
      Bug: https://bugs.gentoo.org/672040Signed-off-by: NSergei Trofimovich <slyfox@gentoo.org>
      Signed-off-by: NMatt Turner <mattst88@gmail.com>
      Signed-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      c56eef69
    • D
      Revert "mm: slowly shrink slabs with a relatively small number of objects" · 657fbf79
      Dave Chinner 提交于
      commit a9a238e83fbb0df31c3b9b67003f8f9d1d1b6c96 upstream.
      
      This reverts commit 172b06c3 ("mm: slowly shrink slabs with a
      relatively small number of objects").
      
      This change changes the agressiveness of shrinker reclaim, causing small
      cache and low priority reclaim to greatly increase scanning pressure on
      small caches.  As a result, light memory pressure has a disproportionate
      affect on small caches, and causes large caches to be reclaimed much
      faster than previously.
      
      As a result, it greatly perturbs the delicate balance of the VFS caches
      (dentry/inode vs file page cache) such that the inode/dentry caches are
      reclaimed much, much faster than the page cache and this drives us into
      several other caching imbalance related problems.
      
      As such, this is a bad change and needs to be reverted.
      
      [ Needs some massaging to retain the later seekless shrinker
        modifications.]
      
      Link: http://lkml.kernel.org/r/20190130041707.27750-3-david@fromorbit.com
      Fixes: 172b06c3 ("mm: slowly shrink slabs with a relatively small number of objects")
      Signed-off-by: NDave Chinner <dchinner@redhat.com>
      Cc: Wolfgang Walter <linux@stwm.de>
      Cc: Roman Gushchin <guro@fb.com>
      Cc: Spock <dairinin@gmail.com>
      Cc: Rik van Riel <riel@surriel.com>
      Cc: Michal Hocko <mhocko@kernel.org>
      Cc: <stable@vger.kernel.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      Signed-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      657fbf79
    • D
      Revert "mm: don't reclaim inodes with many attached pages" · 8d485d3a
      Dave Chinner 提交于
      commit 69056ee6a8a3d576ed31e38b3b14c70d6c74edcc upstream.
      
      This reverts commit a76cf1a474d7d ("mm: don't reclaim inodes with many
      attached pages").
      
      This change causes serious changes to page cache and inode cache
      behaviour and balance, resulting in major performance regressions when
      combining worklaods such as large file copies and kernel compiles.
      
        https://bugzilla.kernel.org/show_bug.cgi?id=202441
      
      This change is a hack to work around the problems introduced by changing
      how agressive shrinkers are on small caches in commit 172b06c3 ("mm:
      slowly shrink slabs with a relatively small number of objects").  It
      creates more problems than it solves, wasn't adequately reviewed or
      tested, so it needs to be reverted.
      
      Link: http://lkml.kernel.org/r/20190130041707.27750-2-david@fromorbit.com
      Fixes: a76cf1a474d7d ("mm: don't reclaim inodes with many attached pages")
      Signed-off-by: NDave Chinner <dchinner@redhat.com>
      Cc: Wolfgang Walter <linux@stwm.de>
      Cc: Roman Gushchin <guro@fb.com>
      Cc: Spock <dairinin@gmail.com>
      Cc: Rik van Riel <riel@surriel.com>
      Cc: Michal Hocko <mhocko@kernel.org>
      Cc: <stable@vger.kernel.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      Signed-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      8d485d3a