1. 28 1月, 2017 3 次提交
  2. 27 1月, 2017 1 次提交
    • D
      Reinstate "drm/probe-helpers: Drop locking from poll_enable"" · c4d79c22
      Dave Airlie 提交于
      This reverts commit 54a07c7b,
      and reinstates the original.
      
      [airlied: this might be a bad plan for git].
      
      commit 3846fd9b
      Author: Daniel Vetter <daniel.vetter@ffwll.ch>
      Date:   Wed Jan 11 10:01:17 2017 +0100
      
          drm/probe-helpers: Drop locking from poll_enable
      
          It was only needed to protect the connector_list walking, see
      
          commit 8c4ccc4a
          Author: Daniel Vetter <daniel.vetter@ffwll.ch>
          Date:   Thu Jul 9 23:44:26 2015 +0200
      
              drm/probe-helper: Grab mode_config.mutex in poll_init/enable
      
          Unfortunately the commit message of that patch fails to mention that
          the new locking check was for the connector_list.
      
          But that requirement disappeared in
      
          commit c36a3254
          Author: Daniel Vetter <daniel.vetter@ffwll.ch>
          Date:   Thu Dec 15 16:58:43 2016 +0100
      
              drm: Convert all helpers to drm_connector_list_iter
      
          and so we can drop this again.
      
          This fixes a locking inversion on nouveau, where the rpm code needs to
          re-enable. But in other places the rpm_get() calls are nested within
          the big modeset locks.
      
          While at it, also improve the kerneldoc for these two functions a
          notch.
      
          v2: Update the kerneldoc even more to explain that these functions
          can't be called concurrently, or bad things happen (Chris).
      c4d79c22
  3. 26 1月, 2017 1 次提交
  4. 25 1月, 2017 3 次提交
    • V
      mm, page_alloc: fix check for NULL preferred_zone · ea57485a
      Vlastimil Babka 提交于
      Patch series "fix premature OOM regression in 4.7+ due to cpuset races".
      
      This is v2 of my attempt to fix the recent report based on LTP cpuset
      stress test [1].  The intention is to go to stable 4.9 LTSS with this,
      as triggering repeated OOMs is not nice.  That's why the patches try to
      be not too intrusive.
      
      Unfortunately why investigating I found that modifying the testcase to
      use per-VMA policies instead of per-task policies will bring the OOM's
      back, but that seems to be much older and harder to fix problem.  I have
      posted a RFC [2] but I believe that fixing the recent regressions has a
      higher priority.
      
      Longer-term we might try to think how to fix the cpuset mess in a better
      and less error prone way.  I was for example very surprised to learn,
      that cpuset updates change not only task->mems_allowed, but also
      nodemask of mempolicies.  Until now I expected the parameter to
      alloc_pages_nodemask() to be stable.  I wonder why do we then treat
      cpusets specially in get_page_from_freelist() and distinguish HARDWALL
      etc, when there's unconditional intersection between mempolicy and
      cpuset.  I would expect the nodemask adjustment for saving overhead in
      g_p_f(), but that clearly doesn't happen in the current form.  So we
      have both crazy complexity and overhead, AFAICS.
      
      [1] https://lkml.kernel.org/r/CAFpQJXUq-JuEP=QPidy4p_=FN0rkH5Z-kfB4qBvsf6jMS87Edg@mail.gmail.com
      [2] https://lkml.kernel.org/r/7c459f26-13a6-a817-e508-b65b903a8378@suse.cz
      
      This patch (of 4):
      
      Since commit c33d6c06 ("mm, page_alloc: avoid looking up the first
      zone in a zonelist twice") we have a wrong check for NULL preferred_zone,
      which can theoretically happen due to concurrent cpuset modification.  We
      check the zoneref pointer which is never NULL and we should check the zone
      pointer.  Also document this in first_zones_zonelist() comment per Michal
      Hocko.
      
      Fixes: c33d6c06 ("mm, page_alloc: avoid looking up the first zone in a zonelist twice")
      Link: http://lkml.kernel.org/r/20170120103843.24587-2-vbabka@suse.czSigned-off-by: NVlastimil Babka <vbabka@suse.cz>
      Acked-by: NMel Gorman <mgorman@techsingularity.net>
      Acked-by: NHillf Danton <hillf.zj@alibaba-inc.com>
      Cc: Ganapatrao Kulkarni <gpkulkarni@gmail.com>
      Cc: Michal Hocko <mhocko@suse.com>
      Cc: <stable@vger.kernel.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      ea57485a
    • D
      kernel/watchdog: prevent false hardlockup on overloaded system · b94f5118
      Don Zickus 提交于
      On an overloaded system, it is possible that a change in the watchdog
      threshold can be delayed long enough to trigger a false positive.
      
      This can easily be achieved by having a cpu spinning indefinitely on a
      task, while another cpu updates watchdog threshold.
      
      What happens is while trying to park the watchdog threads, the hrtimers
      on the other cpus trigger and reprogram themselves with the new slower
      watchdog threshold.  Meanwhile, the nmi watchdog is still programmed
      with the old faster threshold.
      
      Because the one cpu is blocked, it prevents the thread parking on the
      other cpus from completing, which is needed to shutdown the nmi watchdog
      and reprogram it correctly.  As a result, a false positive from the nmi
      watchdog is reported.
      
      Fix this by setting a park_in_progress flag to block all lockups until
      the parking is complete.
      
      Fix provided by Ulrich Obergfell.
      
      [akpm@linux-foundation.org: s/park_in_progress/watchdog_park_in_progress/]
      Link: http://lkml.kernel.org/r/1481041033-192236-1-git-send-email-dzickus@redhat.comSigned-off-by: NDon Zickus <dzickus@redhat.com>
      Reviewed-by: NAaron Tomlin <atomlin@redhat.com>
      Cc: Ulrich Obergfell <uobergfe@redhat.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      b94f5118
    • Y
      memory_hotplug: make zone_can_shift() return a boolean value · 8a1f780e
      Yasuaki Ishimatsu 提交于
      online_{kernel|movable} is used to change the memory zone to
      ZONE_{NORMAL|MOVABLE} and online the memory.
      
      To check that memory zone can be changed, zone_can_shift() is used.
      Currently the function returns minus integer value, plus integer
      value and 0. When the function returns minus or plus integer value,
      it means that the memory zone can be changed to ZONE_{NORNAL|MOVABLE}.
      
      But when the function returns 0, there are two meanings.
      
      One of the meanings is that the memory zone does not need to be changed.
      For example, when memory is in ZONE_NORMAL and onlined by online_kernel
      the memory zone does not need to be changed.
      
      Another meaning is that the memory zone cannot be changed. When memory
      is in ZONE_NORMAL and onlined by online_movable, the memory zone may
      not be changed to ZONE_MOVALBE due to memory online limitation(see
      Documentation/memory-hotplug.txt). In this case, memory must not be
      onlined.
      
      The patch changes the return type of zone_can_shift() so that memory
      online operation fails when memory zone cannot be changed as follows:
      
      Before applying patch:
         # grep -A 35 "Node 2" /proc/zoneinfo
         Node 2, zone   Normal
         <snip>
            node_scanned  0
                 spanned  8388608
                 present  7864320
                 managed  7864320
         # echo online_movable > memory4097/state
         # grep -A 35 "Node 2" /proc/zoneinfo
         Node 2, zone   Normal
         <snip>
            node_scanned  0
                 spanned  8388608
                 present  8388608
                 managed  8388608
      
         online_movable operation succeeded. But memory is onlined as
         ZONE_NORMAL, not ZONE_MOVABLE.
      
      After applying patch:
         # grep -A 35 "Node 2" /proc/zoneinfo
         Node 2, zone   Normal
         <snip>
            node_scanned  0
                 spanned  8388608
                 present  7864320
                 managed  7864320
         # echo online_movable > memory4097/state
         bash: echo: write error: Invalid argument
         # grep -A 35 "Node 2" /proc/zoneinfo
         Node 2, zone   Normal
         <snip>
            node_scanned  0
                 spanned  8388608
                 present  7864320
                 managed  7864320
      
         online_movable operation failed because of failure of changing
         the memory zone from ZONE_NORMAL to ZONE_MOVABLE
      
      Fixes: df429ac0 ("memory-hotplug: more general validation of zone during online")
      Link: http://lkml.kernel.org/r/2f9c3837-33d7-b6e5-59c0-6ca4372b2d84@gmail.comSigned-off-by: NYasuaki Ishimatsu <isimatu.yasuaki@jp.fujitsu.com>
      Reviewed-by: NReza Arbab <arbab@linux.vnet.ibm.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      8a1f780e
  5. 19 1月, 2017 2 次提交
  6. 18 1月, 2017 5 次提交
  7. 17 1月, 2017 2 次提交
    • F
      scsi: libfc: Fix variable name in fc_set_wwpn · 67d35e70
      Fam Zheng 提交于
      The parameter name should be wwpn instead of wwnn.
      Signed-off-by: NFam Zheng <famz@redhat.com>
      Acked-by: NJohannes Thumshirn <jth@kernel.org>
      Signed-off-by: NMartin K. Petersen <martin.petersen@oracle.com>
      67d35e70
    • D
      bpf: rework prog_digest into prog_tag · f1f7714e
      Daniel Borkmann 提交于
      Commit 7bd509e3 ("bpf: add prog_digest and expose it via
      fdinfo/netlink") was recently discussed, partially due to
      admittedly suboptimal name of "prog_digest" in combination
      with sha1 hash usage, thus inevitably and rightfully concerns
      about its security in terms of collision resistance were
      raised with regards to use-cases.
      
      The intended use cases are for debugging resp. introspection
      only for providing a stable "tag" over the instruction sequence
      that both kernel and user space can calculate independently.
      It's not usable at all for making a security relevant decision.
      So collisions where two different instruction sequences generate
      the same tag can happen, but ideally at a rather low rate. The
      "tag" will be dumped in hex and is short enough to introspect
      in tracepoints or kallsyms output along with other data such
      as stack trace, etc. Thus, this patch performs a rename into
      prog_tag and truncates the tag to a short output (64 bits) to
      make it obvious it's not collision-free.
      
      Should in future a hash or facility be needed with a security
      relevant focus, then we can think about requirements, constraints,
      etc that would fit to that situation. For now, rework the exposed
      parts for the current use cases as long as nothing has been
      released yet. Tested on x86_64 and s390x.
      
      Fixes: 7bd509e3 ("bpf: add prog_digest and expose it via fdinfo/netlink")
      Signed-off-by: NDaniel Borkmann <daniel@iogearbox.net>
      Acked-by: NAlexei Starovoitov <ast@kernel.org>
      Cc: Andy Lutomirski <luto@kernel.org>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      f1f7714e
  8. 16 1月, 2017 1 次提交
  9. 15 1月, 2017 2 次提交
    • P
      rcu: Narrow early boot window of illegal synchronous grace periods · 52d7e48b
      Paul E. McKenney 提交于
      The current preemptible RCU implementation goes through three phases
      during bootup.  In the first phase, there is only one CPU that is running
      with preemption disabled, so that a no-op is a synchronous grace period.
      In the second mid-boot phase, the scheduler is running, but RCU has
      not yet gotten its kthreads spawned (and, for expedited grace periods,
      workqueues are not yet running.  During this time, any attempt to do
      a synchronous grace period will hang the system (or complain bitterly,
      depending).  In the third and final phase, RCU is fully operational and
      everything works normally.
      
      This has been OK for some time, but there has recently been some
      synchronous grace periods showing up during the second mid-boot phase.
      This code worked "by accident" for awhile, but started failing as soon
      as expedited RCU grace periods switched over to workqueues in commit
      8b355e3b ("rcu: Drive expedited grace periods from workqueue").
      Note that the code was buggy even before this commit, as it was subject
      to failure on real-time systems that forced all expedited grace periods
      to run as normal grace periods (for example, using the rcu_normal ksysfs
      parameter).  The callchain from the failure case is as follows:
      
      early_amd_iommu_init()
      |-> acpi_put_table(ivrs_base);
      |-> acpi_tb_put_table(table_desc);
      |-> acpi_tb_invalidate_table(table_desc);
      |-> acpi_tb_release_table(...)
      |-> acpi_os_unmap_memory
      |-> acpi_os_unmap_iomem
      |-> acpi_os_map_cleanup
      |-> synchronize_rcu_expedited
      
      The kernel showing this callchain was built with CONFIG_PREEMPT_RCU=y,
      which caused the code to try using workqueues before they were
      initialized, which did not go well.
      
      This commit therefore reworks RCU to permit synchronous grace periods
      to proceed during this mid-boot phase.  This commit is therefore a
      fix to a regression introduced in v4.9, and is therefore being put
      forward post-merge-window in v4.10.
      
      This commit sets a flag from the existing rcu_scheduler_starting()
      function which causes all synchronous grace periods to take the expedited
      path.  The expedited path now checks this flag, using the requesting task
      to drive the expedited grace period forward during the mid-boot phase.
      Finally, this flag is updated by a core_initcall() function named
      rcu_exp_runtime_mode(), which causes the runtime codepaths to be used.
      
      Note that this arrangement assumes that tasks are not sent POSIX signals
      (or anything similar) from the time that the first task is spawned
      through core_initcall() time.
      
      Fixes: 8b355e3b ("rcu: Drive expedited grace periods from workqueue")
      Reported-by: N"Zheng, Lv" <lv.zheng@intel.com>
      Reported-by: NBorislav Petkov <bp@alien8.de>
      Signed-off-by: NPaul E. McKenney <paulmck@linux.vnet.ibm.com>
      Tested-by: NStan Kain <stan.kain@gmail.com>
      Tested-by: NIvan <waffolz@hotmail.com>
      Tested-by: NEmanuel Castelo <emanuel.castelo@gmail.com>
      Tested-by: NBruno Pesavento <bpesavento@infinito.it>
      Tested-by: NBorislav Petkov <bp@suse.de>
      Tested-by: NFrederic Bezies <fredbezies@gmail.com>
      Cc: <stable@vger.kernel.org> # 4.9.0-
      52d7e48b
    • D
      coredump: Ensure proper size of sparse core files · 4d22c75d
      Dave Kleikamp 提交于
      If the last section of a core file ends with an unmapped or zero page,
      the size of the file does not correspond with the last dump_skip() call.
      gdb complains that the file is truncated and can be confusing to users.
      
      After all of the vma sections are written, make sure that the file size
      is no smaller than the current file position.
      
      This problem can be demonstrated with gdb's bigcore testcase on the
      sparc architecture.
      Signed-off-by: NDave Kleikamp <dave.kleikamp@oracle.com>
      Cc: Alexander Viro <viro@zeniv.linux.org.uk>
      Cc: linux-fsdevel@vger.kernel.org
      Cc: linux-kernel@vger.kernel.org
      Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>
      4d22c75d
  10. 14 1月, 2017 4 次提交
    • P
      efi/x86: Prune invalid memory map entries and fix boot regression · 0100a3e6
      Peter Jones 提交于
      Some machines, such as the Lenovo ThinkPad W541 with firmware GNET80WW
      (2.28), include memory map entries with phys_addr=0x0 and num_pages=0.
      
      These machines fail to boot after the following commit,
      
        commit 8e80632f ("efi/esrt: Use efi_mem_reserve() and avoid a kmalloc()")
      
      Fix this by removing such bogus entries from the memory map.
      
      Furthermore, currently the log output for this case (with efi=debug)
      looks like:
      
       [    0.000000] efi: mem45: [Reserved           |   |  |  |  |  |  |  |  |  |  |  |  ] range=[0x0000000000000000-0xffffffffffffffff] (0MB)
      
      This is clearly wrong, and also not as informative as it could be.  This
      patch changes it so that if we find obviously invalid memory map
      entries, we print an error and skip those entries.  It also detects the
      display of the address range calculation overflow, so the new output is:
      
       [    0.000000] efi: [Firmware Bug]: Invalid EFI memory map entries:
       [    0.000000] efi: mem45: [Reserved           |   |  |  |  |  |  |  |   |  |  |  |  ] range=[0x0000000000000000-0x0000000000000000] (invalid)
      
      It also detects memory map sizes that would overflow the physical
      address, for example phys_addr=0xfffffffffffff000 and
      num_pages=0x0200000000000001, and prints:
      
       [    0.000000] efi: [Firmware Bug]: Invalid EFI memory map entries:
       [    0.000000] efi: mem45: [Reserved           |   |  |  |  |  |  |  |   |  |  |  |  ] range=[phys_addr=0xfffffffffffff000-0x20ffffffffffffffff] (invalid)
      
      It then removes these entries from the memory map.
      Signed-off-by: NPeter Jones <pjones@redhat.com>
      Signed-off-by: NArd Biesheuvel <ard.biesheuvel@linaro.org>
      [ardb: refactor for clarity with no functional changes, avoid PAGE_SHIFT]
      Signed-off-by: NMatt Fleming <matt@codeblueprint.co.uk>
      [Matt: Include bugzilla info in commit log]
      Cc: <stable@vger.kernel.org> # v4.9+
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Link: https://bugzilla.kernel.org/show_bug.cgi?id=191121Signed-off-by: NIngo Molnar <mingo@kernel.org>
      0100a3e6
    • J
      perf/x86/intel: Account interrupts for PEBS errors · 475113d9
      Jiri Olsa 提交于
      It's possible to set up PEBS events to get only errors and not
      any data, like on SNB-X (model 45) and IVB-EP (model 62)
      via 2 perf commands running simultaneously:
      
          taskset -c 1 ./perf record -c 4 -e branches:pp -j any -C 10
      
      This leads to a soft lock up, because the error path of the
      intel_pmu_drain_pebs_nhm() does not account event->hw.interrupt
      for error PEBS interrupts, so in case you're getting ONLY
      errors you don't have a way to stop the event when it's over
      the max_samples_per_tick limit:
      
        NMI watchdog: BUG: soft lockup - CPU#22 stuck for 22s! [perf_fuzzer:5816]
        ...
        RIP: 0010:[<ffffffff81159232>]  [<ffffffff81159232>] smp_call_function_single+0xe2/0x140
        ...
        Call Trace:
         ? trace_hardirqs_on_caller+0xf5/0x1b0
         ? perf_cgroup_attach+0x70/0x70
         perf_install_in_context+0x199/0x1b0
         ? ctx_resched+0x90/0x90
         SYSC_perf_event_open+0x641/0xf90
         SyS_perf_event_open+0x9/0x10
         do_syscall_64+0x6c/0x1f0
         entry_SYSCALL64_slow_path+0x25/0x25
      
      Add perf_event_account_interrupt() which does the interrupt
      and frequency checks and call it from intel_pmu_drain_pebs_nhm()'s
      error path.
      
      We keep the pending_kill and pending_wakeup logic only in the
      __perf_event_overflow() path, because they make sense only if
      there's any data to deliver.
      Signed-off-by: NJiri Olsa <jolsa@kernel.org>
      Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Cc: Alexander Shishkin <alexander.shishkin@linux.intel.com>
      Cc: Arnaldo Carvalho de Melo <acme@kernel.org>
      Cc: Arnaldo Carvalho de Melo <acme@redhat.com>
      Cc: Jiri Olsa <jolsa@redhat.com>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Stephane Eranian <eranian@google.com>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Vince Weaver <vince@deater.net>
      Cc: Vince Weaver <vincent.weaver@maine.edu>
      Link: http://lkml.kernel.org/r/1482931866-6018-2-git-send-email-jolsa@kernel.orgSigned-off-by: NIngo Molnar <mingo@kernel.org>
      475113d9
    • C
      block: add blk_rq_payload_bytes · 2e3258ec
      Christoph Hellwig 提交于
      Add a helper to calculate the actual data transfer size for special
      payload requests.
      Signed-off-by: NChristoph Hellwig <hch@lst.de>
      Reviewed-by: NHannes Reinecke <hare@suse.com>
      Signed-off-by: NJens Axboe <axboe@fb.com>
      2e3258ec
    • S
      tcp: fix tcp_fastopen unaligned access complaints on sparc · 003c9410
      Shannon Nelson 提交于
      Fix up a data alignment issue on sparc by swapping the order
      of the cookie byte array field with the length field in
      struct tcp_fastopen_cookie, and making it a proper union
      to clean up the typecasting.
      
      This addresses log complaints like these:
          log_unaligned: 113 callbacks suppressed
          Kernel unaligned access at TPC[976490] tcp_try_fastopen+0x2d0/0x360
          Kernel unaligned access at TPC[9764ac] tcp_try_fastopen+0x2ec/0x360
          Kernel unaligned access at TPC[9764c8] tcp_try_fastopen+0x308/0x360
          Kernel unaligned access at TPC[9764e4] tcp_try_fastopen+0x324/0x360
          Kernel unaligned access at TPC[976490] tcp_try_fastopen+0x2d0/0x360
      
      Cc: Eric Dumazet <eric.dumazet@gmail.com>
      Signed-off-by: NShannon Nelson <shannon.nelson@oracle.com>
      Acked-by: NEric Dumazet <edumazet@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      003c9410
  11. 13 1月, 2017 4 次提交
  12. 12 1月, 2017 4 次提交
  13. 11 1月, 2017 8 次提交
    • B
      cfg80211: consider VHT opmode on station update · 06f7c88c
      Beni Lev 提交于
      Currently, this attribute is only fetched on station addition, but
      not on station change. Since this info is only present in the assoc
      request, with full station state support in the driver it cannot be
      present when the station is added.
      
      Thus, add support for changing the VHT opmode on station update if
      done before (or while) the station is marked as associated. After
      this, ignore it, since it used to be ignored.
      Signed-off-by: NBeni Lev <beni.lev@intel.com>
      Signed-off-by: NJohannes Berg <johannes.berg@intel.com>
      06f7c88c
    • M
      timerfd: export defines to userspace · 575b1967
      Mike Frysinger 提交于
      Since userspace is expected to call timerfd syscalls directly with these
      flags/ioctls, make sure we export them so they don't have to duplicate
      the values themselves.
      
      Link: http://lkml.kernel.org/r/20161219064052.7196-1-vapier@gentoo.orgSigned-off-by: NMike Frysinger <vapier@gentoo.org>
      Acked-by: NThomas Gleixner <tglx@linutronix.de>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      575b1967
    • M
      mm: support anonymous stable page · f0571429
      Minchan Kim 提交于
      During developemnt for zram-swap asynchronous writeback, I found strange
      corruption of compressed page, resulting in:
      
        Modules linked in: zram(E)
        CPU: 3 PID: 1520 Comm: zramd-1 Tainted: G            E   4.8.0-mm1-00320-ge0d4894c9c38-dirty #3274
        Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS Ubuntu-1.8.2-1ubuntu1 04/01/2014
        task: ffff88007620b840 task.stack: ffff880078090000
        RIP: set_freeobj.part.43+0x1c/0x1f
        RSP: 0018:ffff880078093ca8  EFLAGS: 00010246
        RAX: 0000000000000018 RBX: ffff880076798d88 RCX: ffffffff81c408c8
        RDX: 0000000000000018 RSI: 0000000000000000 RDI: 0000000000000246
        RBP: ffff880078093cb0 R08: 0000000000000000 R09: 0000000000000000
        R10: ffff88005bc43030 R11: 0000000000001df3 R12: ffff880076798d88
        R13: 000000000005bc43 R14: ffff88007819d1b8 R15: 0000000000000001
        FS:  0000000000000000(0000) GS:ffff88007e380000(0000) knlGS:0000000000000000
        CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
        CR2: 00007fc934048f20 CR3: 0000000077b01000 CR4: 00000000000406e0
        Call Trace:
          obj_malloc+0x22b/0x260
          zs_malloc+0x1e4/0x580
          zram_bvec_rw+0x4cd/0x830 [zram]
          page_requests_rw+0x9c/0x130 [zram]
          zram_thread+0xe6/0x173 [zram]
          kthread+0xca/0xe0
          ret_from_fork+0x25/0x30
      
      With investigation, it reveals currently stable page doesn't support
      anonymous page.  IOW, reuse_swap_page can reuse the page without waiting
      writeback completion so it can overwrite page zram is compressing.
      
      Unfortunately, zram has used per-cpu stream feature from v4.7.
      It aims for increasing cache hit ratio of scratch buffer for
      compressing. Downside of that approach is that zram should ask
      memory space for compressed page in per-cpu context which requires
      stricted gfp flag which could be failed. If so, it retries to
      allocate memory space out of per-cpu context so it could get memory
      this time and compress the data again, copies it to the memory space.
      
      In this scenario, zram assumes the data should never be changed
      but it is not true unless stable page supports. So, If the data is
      changed under us, zram can make buffer overrun because second
      compression size could be bigger than one we got in previous trial
      and blindly, copy bigger size object to smaller buffer which is
      buffer overrun. The overrun breaks zsmalloc free object chaining
      so system goes crash like above.
      
      I think below is same problem.
      https://bugzilla.suse.com/show_bug.cgi?id=997574
      
      Unfortunately, reuse_swap_page should be atomic so that we cannot wait on
      writeback in there so the approach in this patch is simply return false if
      we found it needs stable page.  Although it increases memory footprint
      temporarily, it happens rarely and it should be reclaimed easily althoug
      it happened.  Also, It would be better than waiting of IO completion,
      which is critial path for application latency.
      
      Fixes: da9556a2 ("zram: user per-cpu compression streams")
      Link: http://lkml.kernel.org/r/20161120233015.GA14113@bbox
      Link: http://lkml.kernel.org/r/1482366980-3782-2-git-send-email-minchan@kernel.orgSigned-off-by: NMinchan Kim <minchan@kernel.org>
      Acked-by: NHugh Dickins <hughd@google.com>
      Cc: Sergey Senozhatsky <sergey.senozhatsky@gmail.com>
      Cc: Darrick J. Wong <darrick.wong@oracle.com>
      Cc: Takashi Iwai <tiwai@suse.de>
      Cc: Hyeoncheol Lee <cheol.lee@lge.com>
      Cc: <yjay.kim@lge.com>
      Cc: Sangseok Lee <sangseok.lee@lge.com>
      Cc: <stable@vger.kernel.org> [4.7+]
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      f0571429
    • A
      mm: rename __page_frag functions to __page_frag_cache, drop order from drain · 2976db80
      Alexander Duyck 提交于
      This patch does two things.
      
      First it goes through and renames the __page_frag prefixed functions to
      __page_frag_cache so that we can be clear that we are draining or
      refilling the cache, not the frags themselves.
      
      Second we drop the order parameter from __page_frag_cache_drain since we
      don't actually need to pass it since all fragments are either order 0 or
      must be a compound page.
      
      Link: http://lkml.kernel.org/r/20170104023954.13451.5678.stgit@localhost.localdomainSigned-off-by: NAlexander Duyck <alexander.h.duyck@intel.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      2976db80
    • A
      mm: rename __alloc_page_frag to page_frag_alloc and __free_page_frag to page_frag_free · 8c2dd3e4
      Alexander Duyck 提交于
      Patch series "Page fragment updates", v4.
      
      This patch series takes care of a few cleanups for the page fragments
      API.
      
      First we do some renames so that things are much more consistent.  First
      we move the page_frag_ portion of the name to the front of the functions
      names.  Secondly we split out the cache specific functions from the
      other page fragment functions by adding the word "cache" to the name.
      
      Finally I added a bit of documentation that will hopefully help to
      explain some of this.  I plan to revisit this later as we get things
      more ironed out in the near future with the changes planned for the DMA
      setup to support eXpress Data Path.
      
      This patch (of 3):
      
      This patch renames the page frag functions to be more consistent with
      other APIs.  Specifically we place the name page_frag first in the name
      and then have either an alloc or free call name that we append as the
      suffix.  This makes it a bit clearer in terms of naming.
      
      In addition we drop the leading double underscores since we are
      technically no longer a backing interface and instead the front end that
      is called from the networking APIs.
      
      Link: http://lkml.kernel.org/r/20170104023854.13451.67390.stgit@localhost.localdomainSigned-off-by: NAlexander Duyck <alexander.h.duyck@intel.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      8c2dd3e4
    • M
      mm, memcg: fix the active list aging for lowmem requests when memcg is enabled · b4536f0c
      Michal Hocko 提交于
      Nils Holland and Klaus Ethgen have reported unexpected OOM killer
      invocations with 32b kernel starting with 4.8 kernels
      
      	kworker/u4:5 invoked oom-killer: gfp_mask=0x2400840(GFP_NOFS|__GFP_NOFAIL), nodemask=0, order=0, oom_score_adj=0
      	kworker/u4:5 cpuset=/ mems_allowed=0
      	CPU: 1 PID: 2603 Comm: kworker/u4:5 Not tainted 4.9.0-gentoo #2
      	[...]
      	Mem-Info:
      	active_anon:58685 inactive_anon:90 isolated_anon:0
      	 active_file:274324 inactive_file:281962 isolated_file:0
      	 unevictable:0 dirty:649 writeback:0 unstable:0
      	 slab_reclaimable:40662 slab_unreclaimable:17754
      	 mapped:7382 shmem:202 pagetables:351 bounce:0
      	 free:206736 free_pcp:332 free_cma:0
      	Node 0 active_anon:234740kB inactive_anon:360kB active_file:1097296kB inactive_file:1127848kB unevictable:0kB isolated(anon):0kB isolated(file):0kB mapped:29528kB dirty:2596kB writeback:0kB shmem:0kB shmem_thp: 0kB shmem_pmdmapped: 184320kB anon_thp: 808kB writeback_tmp:0kB unstable:0kB pages_scanned:0 all_unreclaimable? no
      	DMA free:3952kB min:788kB low:984kB high:1180kB active_anon:0kB inactive_anon:0kB active_file:7316kB inactive_file:0kB unevictable:0kB writepending:96kB present:15992kB managed:15916kB mlocked:0kB slab_reclaimable:3200kB slab_unreclaimable:1408kB kernel_stack:0kB pagetables:0kB bounce:0kB free_pcp:0kB local_pcp:0kB free_cma:0kB
      	lowmem_reserve[]: 0 813 3474 3474
      	Normal free:41332kB min:41368kB low:51708kB high:62048kB active_anon:0kB inactive_anon:0kB active_file:532748kB inactive_file:44kB unevictable:0kB writepending:24kB present:897016kB managed:836248kB mlocked:0kB slab_reclaimable:159448kB slab_unreclaimable:69608kB kernel_stack:1112kB pagetables:1404kB bounce:0kB free_pcp:528kB local_pcp:340kB free_cma:0kB
      	lowmem_reserve[]: 0 0 21292 21292
      	HighMem free:781660kB min:512kB low:34356kB high:68200kB active_anon:234740kB inactive_anon:360kB active_file:557232kB inactive_file:1127804kB unevictable:0kB writepending:2592kB present:2725384kB managed:2725384kB mlocked:0kB slab_reclaimable:0kB slab_unreclaimable:0kB kernel_stack:0kB pagetables:0kB bounce:0kB free_pcp:800kB local_pcp:608kB free_cma:0kB
      
      the oom killer is clearly pre-mature because there there is still a lot
      of page cache in the zone Normal which should satisfy this lowmem
      request.  Further debugging has shown that the reclaim cannot make any
      forward progress because the page cache is hidden in the active list
      which doesn't get rotated because inactive_list_is_low is not memcg
      aware.
      
      The code simply subtracts per-zone highmem counters from the respective
      memcg's lru sizes which doesn't make any sense.  We can simply end up
      always seeing the resulting active and inactive counts 0 and return
      false.  This issue is not limited to 32b kernels but in practice the
      effect on systems without CONFIG_HIGHMEM would be much harder to notice
      because we do not invoke the OOM killer for allocations requests
      targeting < ZONE_NORMAL.
      
      Fix the issue by tracking per zone lru page counts in mem_cgroup_per_node
      and subtract per-memcg highmem counts when memcg is enabled.  Introduce
      helper lruvec_zone_lru_size which redirects to either zone counters or
      mem_cgroup_get_zone_lru_size when appropriate.
      
      We are losing empty LRU but non-zero lru size detection introduced by
      ca707239 ("mm: update_lru_size warn and reset bad lru_size") because
      of the inherent zone vs. node discrepancy.
      
      Fixes: f8d1a311 ("mm: consider whether to decivate based on eligible zones inactive ratio")
      Link: http://lkml.kernel.org/r/20170104100825.3729-1-mhocko@kernel.orgSigned-off-by: NMichal Hocko <mhocko@suse.com>
      Reported-by: NNils Holland <nholland@tisys.org>
      Tested-by: NNils Holland <nholland@tisys.org>
      Reported-by: NKlaus Ethgen <Klaus@Ethgen.de>
      Acked-by: NMinchan Kim <minchan@kernel.org>
      Acked-by: NMel Gorman <mgorman@suse.de>
      Acked-by: NJohannes Weiner <hannes@cmpxchg.org>
      Reviewed-by: NVladimir Davydov <vdavydov.dev@gmail.com>
      Cc: <stable@vger.kernel.org>	[4.8+]
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      b4536f0c
    • J
      signal: protect SIGNAL_UNKILLABLE from unintentional clearing. · 2d39b3cd
      Jamie Iles 提交于
      Since commit 00cd5c37 ("ptrace: permit ptracing of /sbin/init") we
      can now trace init processes.  init is initially protected with
      SIGNAL_UNKILLABLE which will prevent fatal signals such as SIGSTOP, but
      there are a number of paths during tracing where SIGNAL_UNKILLABLE can
      be implicitly cleared.
      
      This can result in init becoming stoppable/killable after tracing.  For
      example, running:
      
        while true; do kill -STOP 1; done &
        strace -p 1
      
      and then stopping strace and the kill loop will result in init being
      left in state TASK_STOPPED.  Sending SIGCONT to init will resume it, but
      init will now respond to future SIGSTOP signals rather than ignoring
      them.
      
      Make sure that when setting SIGNAL_STOP_CONTINUED/SIGNAL_STOP_STOPPED
      that we don't clear SIGNAL_UNKILLABLE.
      
      Link: http://lkml.kernel.org/r/20170104122017.25047-1-jamie.iles@oracle.comSigned-off-by: NJamie Iles <jamie.iles@oracle.com>
      Acked-by: NOleg Nesterov <oleg@redhat.com>
      Cc: Alexander Viro <viro@zeniv.linux.org.uk>
      Cc: Ingo Molnar <mingo@redhat.com>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      2d39b3cd
    • M
      mm: get rid of __GFP_OTHER_NODE · 41b6167e
      Michal Hocko 提交于
      The flag was introduced by commit 78afd561 ("mm: add
      __GFP_OTHER_NODE flag") to allow proper accounting of remote node
      allocations done by kernel daemons on behalf of a process - e.g.
      khugepaged.
      
      After "mm: fix remote numa hits statistics" we do not need and actually
      use the flag so we can safely remove it because all allocations which
      are satisfied from their "home" node are accounted properly.
      
      [mhocko@suse.com: fix build]
      Link: http://lkml.kernel.org/r/20170106122225.GK5556@dhcp22.suse.cz
      Link: http://lkml.kernel.org/r/20170102153057.9451-3-mhocko@kernel.orgSigned-off-by: NMichal Hocko <mhocko@suse.com>
      Acked-by: NMel Gorman <mgorman@suse.de>
      Acked-by: NVlastimil Babka <vbabka@suse.cz>
      Cc: Michal Hocko <mhocko@suse.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
      Cc: Taku Izumi <izumi.taku@jp.fujitsu.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      41b6167e