1. 25 1月, 2017 1 次提交
    • C
      nfs: Don't increment lock sequence ID after NFS4ERR_MOVED · 059aa734
      Chuck Lever 提交于
      Xuan Qi reports that the Linux NFSv4 client failed to lock a file
      that was migrated. The steps he observed on the wire:
      
      1. The client sent a LOCK request to the source server
      2. The source server replied NFS4ERR_MOVED
      3. The client switched to the destination server
      4. The client sent the same LOCK request to the destination
         server with a bumped lock sequence ID
      5. The destination server rejected the LOCK request with
         NFS4ERR_BAD_SEQID
      
      RFC 3530 section 8.1.5 provides a list of NFS errors which do not
      bump a lock sequence ID.
      
      However, RFC 3530 is now obsoleted by RFC 7530. In RFC 7530 section
      9.1.7, this list has been updated by the addition of NFS4ERR_MOVED.
      Reported-by: NXuan Qi <xuan.qi@oracle.com>
      Signed-off-by: NChuck Lever <chuck.lever@oracle.com>
      Cc: stable@vger.kernel.org # v3.7+
      Signed-off-by: NTrond Myklebust <trond.myklebust@primarydata.com>
      059aa734
  2. 18 1月, 2017 1 次提交
  3. 17 1月, 2017 1 次提交
    • D
      bpf: rework prog_digest into prog_tag · f1f7714e
      Daniel Borkmann 提交于
      Commit 7bd509e3 ("bpf: add prog_digest and expose it via
      fdinfo/netlink") was recently discussed, partially due to
      admittedly suboptimal name of "prog_digest" in combination
      with sha1 hash usage, thus inevitably and rightfully concerns
      about its security in terms of collision resistance were
      raised with regards to use-cases.
      
      The intended use cases are for debugging resp. introspection
      only for providing a stable "tag" over the instruction sequence
      that both kernel and user space can calculate independently.
      It's not usable at all for making a security relevant decision.
      So collisions where two different instruction sequences generate
      the same tag can happen, but ideally at a rather low rate. The
      "tag" will be dumped in hex and is short enough to introspect
      in tracepoints or kallsyms output along with other data such
      as stack trace, etc. Thus, this patch performs a rename into
      prog_tag and truncates the tag to a short output (64 bits) to
      make it obvious it's not collision-free.
      
      Should in future a hash or facility be needed with a security
      relevant focus, then we can think about requirements, constraints,
      etc that would fit to that situation. For now, rework the exposed
      parts for the current use cases as long as nothing has been
      released yet. Tested on x86_64 and s390x.
      
      Fixes: 7bd509e3 ("bpf: add prog_digest and expose it via fdinfo/netlink")
      Signed-off-by: NDaniel Borkmann <daniel@iogearbox.net>
      Acked-by: NAlexei Starovoitov <ast@kernel.org>
      Cc: Andy Lutomirski <luto@kernel.org>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      f1f7714e
  4. 16 1月, 2017 1 次提交
  5. 15 1月, 2017 2 次提交
    • P
      rcu: Narrow early boot window of illegal synchronous grace periods · 52d7e48b
      Paul E. McKenney 提交于
      The current preemptible RCU implementation goes through three phases
      during bootup.  In the first phase, there is only one CPU that is running
      with preemption disabled, so that a no-op is a synchronous grace period.
      In the second mid-boot phase, the scheduler is running, but RCU has
      not yet gotten its kthreads spawned (and, for expedited grace periods,
      workqueues are not yet running.  During this time, any attempt to do
      a synchronous grace period will hang the system (or complain bitterly,
      depending).  In the third and final phase, RCU is fully operational and
      everything works normally.
      
      This has been OK for some time, but there has recently been some
      synchronous grace periods showing up during the second mid-boot phase.
      This code worked "by accident" for awhile, but started failing as soon
      as expedited RCU grace periods switched over to workqueues in commit
      8b355e3b ("rcu: Drive expedited grace periods from workqueue").
      Note that the code was buggy even before this commit, as it was subject
      to failure on real-time systems that forced all expedited grace periods
      to run as normal grace periods (for example, using the rcu_normal ksysfs
      parameter).  The callchain from the failure case is as follows:
      
      early_amd_iommu_init()
      |-> acpi_put_table(ivrs_base);
      |-> acpi_tb_put_table(table_desc);
      |-> acpi_tb_invalidate_table(table_desc);
      |-> acpi_tb_release_table(...)
      |-> acpi_os_unmap_memory
      |-> acpi_os_unmap_iomem
      |-> acpi_os_map_cleanup
      |-> synchronize_rcu_expedited
      
      The kernel showing this callchain was built with CONFIG_PREEMPT_RCU=y,
      which caused the code to try using workqueues before they were
      initialized, which did not go well.
      
      This commit therefore reworks RCU to permit synchronous grace periods
      to proceed during this mid-boot phase.  This commit is therefore a
      fix to a regression introduced in v4.9, and is therefore being put
      forward post-merge-window in v4.10.
      
      This commit sets a flag from the existing rcu_scheduler_starting()
      function which causes all synchronous grace periods to take the expedited
      path.  The expedited path now checks this flag, using the requesting task
      to drive the expedited grace period forward during the mid-boot phase.
      Finally, this flag is updated by a core_initcall() function named
      rcu_exp_runtime_mode(), which causes the runtime codepaths to be used.
      
      Note that this arrangement assumes that tasks are not sent POSIX signals
      (or anything similar) from the time that the first task is spawned
      through core_initcall() time.
      
      Fixes: 8b355e3b ("rcu: Drive expedited grace periods from workqueue")
      Reported-by: N"Zheng, Lv" <lv.zheng@intel.com>
      Reported-by: NBorislav Petkov <bp@alien8.de>
      Signed-off-by: NPaul E. McKenney <paulmck@linux.vnet.ibm.com>
      Tested-by: NStan Kain <stan.kain@gmail.com>
      Tested-by: NIvan <waffolz@hotmail.com>
      Tested-by: NEmanuel Castelo <emanuel.castelo@gmail.com>
      Tested-by: NBruno Pesavento <bpesavento@infinito.it>
      Tested-by: NBorislav Petkov <bp@suse.de>
      Tested-by: NFrederic Bezies <fredbezies@gmail.com>
      Cc: <stable@vger.kernel.org> # 4.9.0-
      52d7e48b
    • D
      coredump: Ensure proper size of sparse core files · 4d22c75d
      Dave Kleikamp 提交于
      If the last section of a core file ends with an unmapped or zero page,
      the size of the file does not correspond with the last dump_skip() call.
      gdb complains that the file is truncated and can be confusing to users.
      
      After all of the vma sections are written, make sure that the file size
      is no smaller than the current file position.
      
      This problem can be demonstrated with gdb's bigcore testcase on the
      sparc architecture.
      Signed-off-by: NDave Kleikamp <dave.kleikamp@oracle.com>
      Cc: Alexander Viro <viro@zeniv.linux.org.uk>
      Cc: linux-fsdevel@vger.kernel.org
      Cc: linux-kernel@vger.kernel.org
      Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>
      4d22c75d
  6. 14 1月, 2017 4 次提交
    • P
      efi/x86: Prune invalid memory map entries and fix boot regression · 0100a3e6
      Peter Jones 提交于
      Some machines, such as the Lenovo ThinkPad W541 with firmware GNET80WW
      (2.28), include memory map entries with phys_addr=0x0 and num_pages=0.
      
      These machines fail to boot after the following commit,
      
        commit 8e80632f ("efi/esrt: Use efi_mem_reserve() and avoid a kmalloc()")
      
      Fix this by removing such bogus entries from the memory map.
      
      Furthermore, currently the log output for this case (with efi=debug)
      looks like:
      
       [    0.000000] efi: mem45: [Reserved           |   |  |  |  |  |  |  |  |  |  |  |  ] range=[0x0000000000000000-0xffffffffffffffff] (0MB)
      
      This is clearly wrong, and also not as informative as it could be.  This
      patch changes it so that if we find obviously invalid memory map
      entries, we print an error and skip those entries.  It also detects the
      display of the address range calculation overflow, so the new output is:
      
       [    0.000000] efi: [Firmware Bug]: Invalid EFI memory map entries:
       [    0.000000] efi: mem45: [Reserved           |   |  |  |  |  |  |  |   |  |  |  |  ] range=[0x0000000000000000-0x0000000000000000] (invalid)
      
      It also detects memory map sizes that would overflow the physical
      address, for example phys_addr=0xfffffffffffff000 and
      num_pages=0x0200000000000001, and prints:
      
       [    0.000000] efi: [Firmware Bug]: Invalid EFI memory map entries:
       [    0.000000] efi: mem45: [Reserved           |   |  |  |  |  |  |  |   |  |  |  |  ] range=[phys_addr=0xfffffffffffff000-0x20ffffffffffffffff] (invalid)
      
      It then removes these entries from the memory map.
      Signed-off-by: NPeter Jones <pjones@redhat.com>
      Signed-off-by: NArd Biesheuvel <ard.biesheuvel@linaro.org>
      [ardb: refactor for clarity with no functional changes, avoid PAGE_SHIFT]
      Signed-off-by: NMatt Fleming <matt@codeblueprint.co.uk>
      [Matt: Include bugzilla info in commit log]
      Cc: <stable@vger.kernel.org> # v4.9+
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Link: https://bugzilla.kernel.org/show_bug.cgi?id=191121Signed-off-by: NIngo Molnar <mingo@kernel.org>
      0100a3e6
    • J
      perf/x86/intel: Account interrupts for PEBS errors · 475113d9
      Jiri Olsa 提交于
      It's possible to set up PEBS events to get only errors and not
      any data, like on SNB-X (model 45) and IVB-EP (model 62)
      via 2 perf commands running simultaneously:
      
          taskset -c 1 ./perf record -c 4 -e branches:pp -j any -C 10
      
      This leads to a soft lock up, because the error path of the
      intel_pmu_drain_pebs_nhm() does not account event->hw.interrupt
      for error PEBS interrupts, so in case you're getting ONLY
      errors you don't have a way to stop the event when it's over
      the max_samples_per_tick limit:
      
        NMI watchdog: BUG: soft lockup - CPU#22 stuck for 22s! [perf_fuzzer:5816]
        ...
        RIP: 0010:[<ffffffff81159232>]  [<ffffffff81159232>] smp_call_function_single+0xe2/0x140
        ...
        Call Trace:
         ? trace_hardirqs_on_caller+0xf5/0x1b0
         ? perf_cgroup_attach+0x70/0x70
         perf_install_in_context+0x199/0x1b0
         ? ctx_resched+0x90/0x90
         SYSC_perf_event_open+0x641/0xf90
         SyS_perf_event_open+0x9/0x10
         do_syscall_64+0x6c/0x1f0
         entry_SYSCALL64_slow_path+0x25/0x25
      
      Add perf_event_account_interrupt() which does the interrupt
      and frequency checks and call it from intel_pmu_drain_pebs_nhm()'s
      error path.
      
      We keep the pending_kill and pending_wakeup logic only in the
      __perf_event_overflow() path, because they make sense only if
      there's any data to deliver.
      Signed-off-by: NJiri Olsa <jolsa@kernel.org>
      Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Cc: Alexander Shishkin <alexander.shishkin@linux.intel.com>
      Cc: Arnaldo Carvalho de Melo <acme@kernel.org>
      Cc: Arnaldo Carvalho de Melo <acme@redhat.com>
      Cc: Jiri Olsa <jolsa@redhat.com>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Stephane Eranian <eranian@google.com>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Vince Weaver <vince@deater.net>
      Cc: Vince Weaver <vincent.weaver@maine.edu>
      Link: http://lkml.kernel.org/r/1482931866-6018-2-git-send-email-jolsa@kernel.orgSigned-off-by: NIngo Molnar <mingo@kernel.org>
      475113d9
    • C
      block: add blk_rq_payload_bytes · 2e3258ec
      Christoph Hellwig 提交于
      Add a helper to calculate the actual data transfer size for special
      payload requests.
      Signed-off-by: NChristoph Hellwig <hch@lst.de>
      Reviewed-by: NHannes Reinecke <hare@suse.com>
      Signed-off-by: NJens Axboe <axboe@fb.com>
      2e3258ec
    • S
      tcp: fix tcp_fastopen unaligned access complaints on sparc · 003c9410
      Shannon Nelson 提交于
      Fix up a data alignment issue on sparc by swapping the order
      of the cookie byte array field with the length field in
      struct tcp_fastopen_cookie, and making it a proper union
      to clean up the typecasting.
      
      This addresses log complaints like these:
          log_unaligned: 113 callbacks suppressed
          Kernel unaligned access at TPC[976490] tcp_try_fastopen+0x2d0/0x360
          Kernel unaligned access at TPC[9764ac] tcp_try_fastopen+0x2ec/0x360
          Kernel unaligned access at TPC[9764c8] tcp_try_fastopen+0x308/0x360
          Kernel unaligned access at TPC[9764e4] tcp_try_fastopen+0x324/0x360
          Kernel unaligned access at TPC[976490] tcp_try_fastopen+0x2d0/0x360
      
      Cc: Eric Dumazet <eric.dumazet@gmail.com>
      Signed-off-by: NShannon Nelson <shannon.nelson@oracle.com>
      Acked-by: NEric Dumazet <edumazet@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      003c9410
  7. 13 1月, 2017 2 次提交
  8. 12 1月, 2017 2 次提交
  9. 11 1月, 2017 11 次提交
    • M
      timerfd: export defines to userspace · 575b1967
      Mike Frysinger 提交于
      Since userspace is expected to call timerfd syscalls directly with these
      flags/ioctls, make sure we export them so they don't have to duplicate
      the values themselves.
      
      Link: http://lkml.kernel.org/r/20161219064052.7196-1-vapier@gentoo.orgSigned-off-by: NMike Frysinger <vapier@gentoo.org>
      Acked-by: NThomas Gleixner <tglx@linutronix.de>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      575b1967
    • M
      mm: support anonymous stable page · f0571429
      Minchan Kim 提交于
      During developemnt for zram-swap asynchronous writeback, I found strange
      corruption of compressed page, resulting in:
      
        Modules linked in: zram(E)
        CPU: 3 PID: 1520 Comm: zramd-1 Tainted: G            E   4.8.0-mm1-00320-ge0d4894c9c38-dirty #3274
        Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS Ubuntu-1.8.2-1ubuntu1 04/01/2014
        task: ffff88007620b840 task.stack: ffff880078090000
        RIP: set_freeobj.part.43+0x1c/0x1f
        RSP: 0018:ffff880078093ca8  EFLAGS: 00010246
        RAX: 0000000000000018 RBX: ffff880076798d88 RCX: ffffffff81c408c8
        RDX: 0000000000000018 RSI: 0000000000000000 RDI: 0000000000000246
        RBP: ffff880078093cb0 R08: 0000000000000000 R09: 0000000000000000
        R10: ffff88005bc43030 R11: 0000000000001df3 R12: ffff880076798d88
        R13: 000000000005bc43 R14: ffff88007819d1b8 R15: 0000000000000001
        FS:  0000000000000000(0000) GS:ffff88007e380000(0000) knlGS:0000000000000000
        CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
        CR2: 00007fc934048f20 CR3: 0000000077b01000 CR4: 00000000000406e0
        Call Trace:
          obj_malloc+0x22b/0x260
          zs_malloc+0x1e4/0x580
          zram_bvec_rw+0x4cd/0x830 [zram]
          page_requests_rw+0x9c/0x130 [zram]
          zram_thread+0xe6/0x173 [zram]
          kthread+0xca/0xe0
          ret_from_fork+0x25/0x30
      
      With investigation, it reveals currently stable page doesn't support
      anonymous page.  IOW, reuse_swap_page can reuse the page without waiting
      writeback completion so it can overwrite page zram is compressing.
      
      Unfortunately, zram has used per-cpu stream feature from v4.7.
      It aims for increasing cache hit ratio of scratch buffer for
      compressing. Downside of that approach is that zram should ask
      memory space for compressed page in per-cpu context which requires
      stricted gfp flag which could be failed. If so, it retries to
      allocate memory space out of per-cpu context so it could get memory
      this time and compress the data again, copies it to the memory space.
      
      In this scenario, zram assumes the data should never be changed
      but it is not true unless stable page supports. So, If the data is
      changed under us, zram can make buffer overrun because second
      compression size could be bigger than one we got in previous trial
      and blindly, copy bigger size object to smaller buffer which is
      buffer overrun. The overrun breaks zsmalloc free object chaining
      so system goes crash like above.
      
      I think below is same problem.
      https://bugzilla.suse.com/show_bug.cgi?id=997574
      
      Unfortunately, reuse_swap_page should be atomic so that we cannot wait on
      writeback in there so the approach in this patch is simply return false if
      we found it needs stable page.  Although it increases memory footprint
      temporarily, it happens rarely and it should be reclaimed easily althoug
      it happened.  Also, It would be better than waiting of IO completion,
      which is critial path for application latency.
      
      Fixes: da9556a2 ("zram: user per-cpu compression streams")
      Link: http://lkml.kernel.org/r/20161120233015.GA14113@bbox
      Link: http://lkml.kernel.org/r/1482366980-3782-2-git-send-email-minchan@kernel.orgSigned-off-by: NMinchan Kim <minchan@kernel.org>
      Acked-by: NHugh Dickins <hughd@google.com>
      Cc: Sergey Senozhatsky <sergey.senozhatsky@gmail.com>
      Cc: Darrick J. Wong <darrick.wong@oracle.com>
      Cc: Takashi Iwai <tiwai@suse.de>
      Cc: Hyeoncheol Lee <cheol.lee@lge.com>
      Cc: <yjay.kim@lge.com>
      Cc: Sangseok Lee <sangseok.lee@lge.com>
      Cc: <stable@vger.kernel.org> [4.7+]
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      f0571429
    • A
      mm: rename __page_frag functions to __page_frag_cache, drop order from drain · 2976db80
      Alexander Duyck 提交于
      This patch does two things.
      
      First it goes through and renames the __page_frag prefixed functions to
      __page_frag_cache so that we can be clear that we are draining or
      refilling the cache, not the frags themselves.
      
      Second we drop the order parameter from __page_frag_cache_drain since we
      don't actually need to pass it since all fragments are either order 0 or
      must be a compound page.
      
      Link: http://lkml.kernel.org/r/20170104023954.13451.5678.stgit@localhost.localdomainSigned-off-by: NAlexander Duyck <alexander.h.duyck@intel.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      2976db80
    • A
      mm: rename __alloc_page_frag to page_frag_alloc and __free_page_frag to page_frag_free · 8c2dd3e4
      Alexander Duyck 提交于
      Patch series "Page fragment updates", v4.
      
      This patch series takes care of a few cleanups for the page fragments
      API.
      
      First we do some renames so that things are much more consistent.  First
      we move the page_frag_ portion of the name to the front of the functions
      names.  Secondly we split out the cache specific functions from the
      other page fragment functions by adding the word "cache" to the name.
      
      Finally I added a bit of documentation that will hopefully help to
      explain some of this.  I plan to revisit this later as we get things
      more ironed out in the near future with the changes planned for the DMA
      setup to support eXpress Data Path.
      
      This patch (of 3):
      
      This patch renames the page frag functions to be more consistent with
      other APIs.  Specifically we place the name page_frag first in the name
      and then have either an alloc or free call name that we append as the
      suffix.  This makes it a bit clearer in terms of naming.
      
      In addition we drop the leading double underscores since we are
      technically no longer a backing interface and instead the front end that
      is called from the networking APIs.
      
      Link: http://lkml.kernel.org/r/20170104023854.13451.67390.stgit@localhost.localdomainSigned-off-by: NAlexander Duyck <alexander.h.duyck@intel.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      8c2dd3e4
    • M
      mm, memcg: fix the active list aging for lowmem requests when memcg is enabled · b4536f0c
      Michal Hocko 提交于
      Nils Holland and Klaus Ethgen have reported unexpected OOM killer
      invocations with 32b kernel starting with 4.8 kernels
      
      	kworker/u4:5 invoked oom-killer: gfp_mask=0x2400840(GFP_NOFS|__GFP_NOFAIL), nodemask=0, order=0, oom_score_adj=0
      	kworker/u4:5 cpuset=/ mems_allowed=0
      	CPU: 1 PID: 2603 Comm: kworker/u4:5 Not tainted 4.9.0-gentoo #2
      	[...]
      	Mem-Info:
      	active_anon:58685 inactive_anon:90 isolated_anon:0
      	 active_file:274324 inactive_file:281962 isolated_file:0
      	 unevictable:0 dirty:649 writeback:0 unstable:0
      	 slab_reclaimable:40662 slab_unreclaimable:17754
      	 mapped:7382 shmem:202 pagetables:351 bounce:0
      	 free:206736 free_pcp:332 free_cma:0
      	Node 0 active_anon:234740kB inactive_anon:360kB active_file:1097296kB inactive_file:1127848kB unevictable:0kB isolated(anon):0kB isolated(file):0kB mapped:29528kB dirty:2596kB writeback:0kB shmem:0kB shmem_thp: 0kB shmem_pmdmapped: 184320kB anon_thp: 808kB writeback_tmp:0kB unstable:0kB pages_scanned:0 all_unreclaimable? no
      	DMA free:3952kB min:788kB low:984kB high:1180kB active_anon:0kB inactive_anon:0kB active_file:7316kB inactive_file:0kB unevictable:0kB writepending:96kB present:15992kB managed:15916kB mlocked:0kB slab_reclaimable:3200kB slab_unreclaimable:1408kB kernel_stack:0kB pagetables:0kB bounce:0kB free_pcp:0kB local_pcp:0kB free_cma:0kB
      	lowmem_reserve[]: 0 813 3474 3474
      	Normal free:41332kB min:41368kB low:51708kB high:62048kB active_anon:0kB inactive_anon:0kB active_file:532748kB inactive_file:44kB unevictable:0kB writepending:24kB present:897016kB managed:836248kB mlocked:0kB slab_reclaimable:159448kB slab_unreclaimable:69608kB kernel_stack:1112kB pagetables:1404kB bounce:0kB free_pcp:528kB local_pcp:340kB free_cma:0kB
      	lowmem_reserve[]: 0 0 21292 21292
      	HighMem free:781660kB min:512kB low:34356kB high:68200kB active_anon:234740kB inactive_anon:360kB active_file:557232kB inactive_file:1127804kB unevictable:0kB writepending:2592kB present:2725384kB managed:2725384kB mlocked:0kB slab_reclaimable:0kB slab_unreclaimable:0kB kernel_stack:0kB pagetables:0kB bounce:0kB free_pcp:800kB local_pcp:608kB free_cma:0kB
      
      the oom killer is clearly pre-mature because there there is still a lot
      of page cache in the zone Normal which should satisfy this lowmem
      request.  Further debugging has shown that the reclaim cannot make any
      forward progress because the page cache is hidden in the active list
      which doesn't get rotated because inactive_list_is_low is not memcg
      aware.
      
      The code simply subtracts per-zone highmem counters from the respective
      memcg's lru sizes which doesn't make any sense.  We can simply end up
      always seeing the resulting active and inactive counts 0 and return
      false.  This issue is not limited to 32b kernels but in practice the
      effect on systems without CONFIG_HIGHMEM would be much harder to notice
      because we do not invoke the OOM killer for allocations requests
      targeting < ZONE_NORMAL.
      
      Fix the issue by tracking per zone lru page counts in mem_cgroup_per_node
      and subtract per-memcg highmem counts when memcg is enabled.  Introduce
      helper lruvec_zone_lru_size which redirects to either zone counters or
      mem_cgroup_get_zone_lru_size when appropriate.
      
      We are losing empty LRU but non-zero lru size detection introduced by
      ca707239 ("mm: update_lru_size warn and reset bad lru_size") because
      of the inherent zone vs. node discrepancy.
      
      Fixes: f8d1a311 ("mm: consider whether to decivate based on eligible zones inactive ratio")
      Link: http://lkml.kernel.org/r/20170104100825.3729-1-mhocko@kernel.orgSigned-off-by: NMichal Hocko <mhocko@suse.com>
      Reported-by: NNils Holland <nholland@tisys.org>
      Tested-by: NNils Holland <nholland@tisys.org>
      Reported-by: NKlaus Ethgen <Klaus@Ethgen.de>
      Acked-by: NMinchan Kim <minchan@kernel.org>
      Acked-by: NMel Gorman <mgorman@suse.de>
      Acked-by: NJohannes Weiner <hannes@cmpxchg.org>
      Reviewed-by: NVladimir Davydov <vdavydov.dev@gmail.com>
      Cc: <stable@vger.kernel.org>	[4.8+]
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      b4536f0c
    • J
      signal: protect SIGNAL_UNKILLABLE from unintentional clearing. · 2d39b3cd
      Jamie Iles 提交于
      Since commit 00cd5c37 ("ptrace: permit ptracing of /sbin/init") we
      can now trace init processes.  init is initially protected with
      SIGNAL_UNKILLABLE which will prevent fatal signals such as SIGSTOP, but
      there are a number of paths during tracing where SIGNAL_UNKILLABLE can
      be implicitly cleared.
      
      This can result in init becoming stoppable/killable after tracing.  For
      example, running:
      
        while true; do kill -STOP 1; done &
        strace -p 1
      
      and then stopping strace and the kill loop will result in init being
      left in state TASK_STOPPED.  Sending SIGCONT to init will resume it, but
      init will now respond to future SIGSTOP signals rather than ignoring
      them.
      
      Make sure that when setting SIGNAL_STOP_CONTINUED/SIGNAL_STOP_STOPPED
      that we don't clear SIGNAL_UNKILLABLE.
      
      Link: http://lkml.kernel.org/r/20170104122017.25047-1-jamie.iles@oracle.comSigned-off-by: NJamie Iles <jamie.iles@oracle.com>
      Acked-by: NOleg Nesterov <oleg@redhat.com>
      Cc: Alexander Viro <viro@zeniv.linux.org.uk>
      Cc: Ingo Molnar <mingo@redhat.com>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      2d39b3cd
    • M
      mm: get rid of __GFP_OTHER_NODE · 41b6167e
      Michal Hocko 提交于
      The flag was introduced by commit 78afd561 ("mm: add
      __GFP_OTHER_NODE flag") to allow proper accounting of remote node
      allocations done by kernel daemons on behalf of a process - e.g.
      khugepaged.
      
      After "mm: fix remote numa hits statistics" we do not need and actually
      use the flag so we can safely remove it because all allocations which
      are satisfied from their "home" node are accounted properly.
      
      [mhocko@suse.com: fix build]
      Link: http://lkml.kernel.org/r/20170106122225.GK5556@dhcp22.suse.cz
      Link: http://lkml.kernel.org/r/20170102153057.9451-3-mhocko@kernel.orgSigned-off-by: NMichal Hocko <mhocko@suse.com>
      Acked-by: NMel Gorman <mgorman@suse.de>
      Acked-by: NVlastimil Babka <vbabka@suse.cz>
      Cc: Michal Hocko <mhocko@suse.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
      Cc: Taku Izumi <izumi.taku@jp.fujitsu.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      41b6167e
    • M
      mm, slab: make sure that KMALLOC_MAX_SIZE will fit into MAX_ORDER · bb1107f7
      Michal Hocko 提交于
      Andrey Konovalov has reported the following warning triggered by the
      syzkaller fuzzer.
      
        WARNING: CPU: 1 PID: 9935 at mm/page_alloc.c:3511 __alloc_pages_nodemask+0x159c/0x1e20
        Kernel panic - not syncing: panic_on_warn set ...
        CPU: 1 PID: 9935 Comm: syz-executor0 Not tainted 4.9.0-rc7+ #34
        Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS Bochs 01/01/2011
        Call Trace:
          __alloc_pages_slowpath mm/page_alloc.c:3511
          __alloc_pages_nodemask+0x159c/0x1e20 mm/page_alloc.c:3781
          alloc_pages_current+0x1c7/0x6b0 mm/mempolicy.c:2072
          alloc_pages include/linux/gfp.h:469
          kmalloc_order+0x1f/0x70 mm/slab_common.c:1015
          kmalloc_order_trace+0x1f/0x160 mm/slab_common.c:1026
          kmalloc_large include/linux/slab.h:422
          __kmalloc+0x210/0x2d0 mm/slub.c:3723
          kmalloc include/linux/slab.h:495
          ep_write_iter+0x167/0xb50 drivers/usb/gadget/legacy/inode.c:664
          new_sync_write fs/read_write.c:499
          __vfs_write+0x483/0x760 fs/read_write.c:512
          vfs_write+0x170/0x4e0 fs/read_write.c:560
          SYSC_write fs/read_write.c:607
          SyS_write+0xfb/0x230 fs/read_write.c:599
          entry_SYSCALL_64_fastpath+0x1f/0xc2
      
      The issue is caused by a lack of size check for the request size in
      ep_write_iter which should be fixed.  It, however, points to another
      problem, that SLUB defines KMALLOC_MAX_SIZE too large because the its
      KMALLOC_SHIFT_MAX is (MAX_ORDER + PAGE_SHIFT) which means that the
      resulting page allocator request might be MAX_ORDER which is too large
      (see __alloc_pages_slowpath).
      
      The same applies to the SLOB allocator which allows even larger sizes.
      Make sure that they are capped properly and never request more than
      MAX_ORDER order.
      
      Link: http://lkml.kernel.org/r/20161220130659.16461-2-mhocko@kernel.orgSigned-off-by: NMichal Hocko <mhocko@suse.com>
      Reported-by: NAndrey Konovalov <andreyknvl@google.com>
      Acked-by: NChristoph Lameter <cl@linux.com>
      Cc: Alexei Starovoitov <ast@kernel.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      bb1107f7
    • R
      dax: wrprotect pmd_t in dax_mapping_entry_mkclean · f729c8c9
      Ross Zwisler 提交于
      Currently dax_mapping_entry_mkclean() fails to clean and write protect
      the pmd_t of a DAX PMD entry during an *sync operation.  This can result
      in data loss in the following sequence:
      
      1) mmap write to DAX PMD, dirtying PMD radix tree entry and making the
         pmd_t dirty and writeable
      2) fsync, flushing out PMD data and cleaning the radix tree entry. We
         currently fail to mark the pmd_t as clean and write protected.
      3) more mmap writes to the PMD.  These don't cause any page faults since
         the pmd_t is dirty and writeable.  The radix tree entry remains clean.
      4) fsync, which fails to flush the dirty PMD data because the radix tree
         entry was clean.
      5) crash - dirty data that should have been fsync'd as part of 4) could
         still have been in the processor cache, and is lost.
      
      Fix this by marking the pmd_t clean and write protected in
      dax_mapping_entry_mkclean(), which is called as part of the fsync
      operation 2).  This will cause the writes in step 3) above to generate
      page faults where we'll re-dirty the PMD radix tree entry, resulting in
      flushes in the fsync that happens in step 4).
      
      Fixes: 4b4bb46d ("dax: clear dirty entry tags on cache flush")
      Link: http://lkml.kernel.org/r/1482272586-21177-3-git-send-email-ross.zwisler@linux.intel.comSigned-off-by: NRoss Zwisler <ross.zwisler@linux.intel.com>
      Reviewed-by: NJan Kara <jack@suse.cz>
      Cc: Alexander Viro <viro@zeniv.linux.org.uk>
      Cc: Christoph Hellwig <hch@lst.de>
      Cc: Dan Williams <dan.j.williams@intel.com>
      Cc: Dave Chinner <david@fromorbit.com>
      Cc: Jan Kara <jack@suse.cz>
      Cc: Matthew Wilcox <mawilcox@microsoft.com>
      Cc: Dave Hansen <dave.hansen@intel.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      f729c8c9
    • R
      mm: add follow_pte_pmd() · 09796395
      Ross Zwisler 提交于
      Patch series "Write protect DAX PMDs in *sync path".
      
      Currently dax_mapping_entry_mkclean() fails to clean and write protect
      the pmd_t of a DAX PMD entry during an *sync operation.  This can result
      in data loss, as detailed in patch 2.
      
      This series is based on Dan's "libnvdimm-pending" branch, which is the
      current home for Jan's "dax: Page invalidation fixes" series.  You can
      find a working tree here:
      
        https://git.kernel.org/cgit/linux/kernel/git/zwisler/linux.git/log/?h=dax_pmd_clean
      
      This patch (of 2):
      
      Similar to follow_pte(), follow_pte_pmd() allows either a PTE leaf or a
      huge page PMD leaf to be found and returned.
      
      Link: http://lkml.kernel.org/r/1482272586-21177-2-git-send-email-ross.zwisler@linux.intel.comSigned-off-by: NRoss Zwisler <ross.zwisler@linux.intel.com>
      Suggested-by: NDave Hansen <dave.hansen@intel.com>
      Cc: Alexander Viro <viro@zeniv.linux.org.uk>
      Cc: Christoph Hellwig <hch@lst.de>
      Cc: Dan Williams <dan.j.williams@intel.com>
      Cc: Dave Chinner <david@fromorbit.com>
      Cc: Jan Kara <jack@suse.cz>
      Cc: Matthew Wilcox <mawilcox@microsoft.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      09796395
    • H
      gro: Disable frag0 optimization on IPv6 ext headers · 57ea52a8
      Herbert Xu 提交于
      The GRO fast path caches the frag0 address.  This address becomes
      invalid if frag0 is modified by pskb_may_pull or its variants.
      So whenever that happens we must disable the frag0 optimization.
      
      This is usually done through the combination of gro_header_hard
      and gro_header_slow, however, the IPv6 extension header path did
      the pulling directly and would continue to use the GRO fast path
      incorrectly.
      
      This patch fixes it by disabling the fast path when we enter the
      IPv6 extension header path.
      
      Fixes: 78a478d0 ("gro: Inline skb_gro_header and cache frag0 virtual address")
      Reported-by: NSlava Shwartsman <slavash@mellanox.com>
      Signed-off-by: NHerbert Xu <herbert@gondor.apana.org.au>
      Signed-off-by: NEric Dumazet <edumazet@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      57ea52a8
  10. 08 1月, 2017 1 次提交
    • J
      mm: workingset: fix use-after-free in shadow node shrinker · ea07b862
      Johannes Weiner 提交于
      Several people report seeing warnings about inconsistent radix tree
      nodes followed by crashes in the workingset code, which all looked like
      use-after-free access from the shadow node shrinker.
      
      Dave Jones managed to reproduce the issue with a debug patch applied,
      which confirmed that the radix tree shrinking indeed frees shadow nodes
      while they are still linked to the shadow LRU:
      
        WARNING: CPU: 2 PID: 53 at lib/radix-tree.c:643 delete_node+0x1e4/0x200
        CPU: 2 PID: 53 Comm: kswapd0 Not tainted 4.10.0-rc2-think+ #3
        Call Trace:
           delete_node+0x1e4/0x200
           __radix_tree_delete_node+0xd/0x10
           shadow_lru_isolate+0xe6/0x220
           __list_lru_walk_one.isra.4+0x9b/0x190
           list_lru_walk_one+0x23/0x30
           scan_shadow_nodes+0x2e/0x40
           shrink_slab.part.44+0x23d/0x5d0
           shrink_node+0x22c/0x330
           kswapd+0x392/0x8f0
      
      This is the WARN_ON_ONCE(!list_empty(&node->private_list)) placed in the
      inlined radix_tree_shrink().
      
      The problem is with 14b46879 ("mm: workingset: move shadow entry
      tracking to radix tree exceptional tracking"), which passes an update
      callback into the radix tree to link and unlink shadow leaf nodes when
      tree entries change, but forgot to pass the callback when reclaiming a
      shadow node.
      
      While the reclaimed shadow node itself is unlinked by the shrinker, its
      deletion from the tree can cause the left-most leaf node in the tree to
      be shrunk.  If that happens to be a shadow node as well, we don't unlink
      it from the LRU as we should.
      
      Consider this tree, where the s are shadow entries:
      
             root->rnode
                  |
             [0       n]
              |       |
           [s    ] [sssss]
      
      Now the shadow node shrinker reclaims the rightmost leaf node through
      the shadow node LRU:
      
             root->rnode
                  |
             [0        ]
              |
          [s     ]
      
      Because the parent of the deleted node is the first level below the
      root and has only one child in the left-most slot, the intermediate
      level is shrunk and the node containing the single shadow is put in
      its place:
      
             root->rnode
                  |
             [s        ]
      
      The shrinker again sees a single left-most slot in a first level node
      and thus decides to store the shadow in root->rnode directly and free
      the node - which is a leaf node on the shadow node LRU.
      
        root->rnode
             |
             s
      
      Without the update callback, the freed node remains on the shadow LRU,
      where it causes later shrinker runs to crash.
      
      Pass the node updater callback into __radix_tree_delete_node() in case
      the deletion causes the left-most branch in the tree to collapse too.
      
      Also add warnings when linked nodes are freed right away, rather than
      wait for the use-after-free when the list is scanned much later.
      
      Fixes: 14b46879 ("mm: workingset: move shadow entry tracking to radix tree exceptional tracking")
      Reported-by: NDave Chinner <david@fromorbit.com>
      Reported-by: NHugh Dickins <hughd@google.com>
      Reported-by: NAndrea Arcangeli <aarcange@redhat.com>
      Reported-and-tested-by: NDave Jones <davej@codemonkey.org.uk>
      Signed-off-by: NJohannes Weiner <hannes@cmpxchg.org>
      Cc: Christoph Hellwig <hch@lst.de>
      Cc: Chris Leech <cleech@redhat.com>
      Cc: Lee Duncan <lduncan@suse.com>
      Cc: Jan Kara <jack@suse.cz>
      Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
      Cc: Matthew Wilcox <mawilcox@linuxonhyperv.com>
      Cc: Andrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      ea07b862
  11. 07 1月, 2017 2 次提交
    • N
      x86/efi: Don't allocate memmap through memblock after mm_init() · 20b1e22d
      Nicolai Stange 提交于
      With the following commit:
      
        4bc9f92e ("x86/efi-bgrt: Use efi_mem_reserve() to avoid copying image data")
      
      ...  efi_bgrt_init() calls into the memblock allocator through
      efi_mem_reserve() => efi_arch_mem_reserve() *after* mm_init() has been called.
      
      Indeed, KASAN reports a bad read access later on in efi_free_boot_services():
      
        BUG: KASAN: use-after-free in efi_free_boot_services+0xae/0x24c
                  at addr ffff88022de12740
        Read of size 4 by task swapper/0/0
        page:ffffea0008b78480 count:0 mapcount:-127
        mapping:          (null) index:0x1 flags: 0x5fff8000000000()
        [...]
        Call Trace:
         dump_stack+0x68/0x9f
         kasan_report_error+0x4c8/0x500
         kasan_report+0x58/0x60
         __asan_load4+0x61/0x80
         efi_free_boot_services+0xae/0x24c
         start_kernel+0x527/0x562
         x86_64_start_reservations+0x24/0x26
         x86_64_start_kernel+0x157/0x17a
         start_cpu+0x5/0x14
      
      The instruction at the given address is the first read from the memmap's
      memory, i.e. the read of md->type in efi_free_boot_services().
      
      Note that the writes earlier in efi_arch_mem_reserve() don't splat because
      they're done through early_memremap()ed addresses.
      
      So, after memblock is gone, allocations should be done through the "normal"
      page allocator. Introduce a helper, efi_memmap_alloc() for this. Use
      it from efi_arch_mem_reserve(), efi_free_boot_services() and, for the sake
      of consistency, from efi_fake_memmap() as well.
      
      Note that for the latter, the memmap allocations cease to be page aligned.
      This isn't needed though.
      Tested-by: NDan Williams <dan.j.williams@intel.com>
      Signed-off-by: NNicolai Stange <nicstange@gmail.com>
      Reviewed-by: NArd Biesheuvel <ard.biesheuvel@linaro.org>
      Cc: <stable@vger.kernel.org> # v4.9
      Cc: Dave Young <dyoung@redhat.com>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Matt Fleming <matt@codeblueprint.co.uk>
      Cc: Mika Penttilä <mika.penttila@nextfour.com>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: linux-efi@vger.kernel.org
      Fixes: 4bc9f92e ("x86/efi-bgrt: Use efi_mem_reserve() to avoid copying image data")
      Link: http://lkml.kernel.org/r/20170105125130.2815-1-nicstange@gmail.comSigned-off-by: NIngo Molnar <mingo@kernel.org>
      20b1e22d
    • K
      swiotlb: Export swiotlb_max_segment to users · 7453c549
      Konrad Rzeszutek Wilk 提交于
      So they can figure out what is the optimal number of pages
      that can be contingously stitched together without fear of
      bounce buffer.
      
      We also expose an mechanism for sub-users of SWIOTLB API, such
      as Xen-SWIOTLB to set the max segment value. And lastly
      if swiotlb=force is set (which mandates we bounce buffer everything)
      we set max_segment so at least we can bounce buffer one 4K page
      instead of a giant 512KB one for which we may not have space.
      Signed-off-by: NKonrad Rzeszutek Wilk <konrad.wilk@oracle.com>
      Reported-and-Tested-by: NJuergen Gross <jgross@suse.com>
      7453c549
  12. 05 1月, 2017 1 次提交
    • P
      vfio-mdev: fix non-standard ioctl return val causing i386 build fail · c6ef7fd4
      Paul Gortmaker 提交于
      What appears to be a copy and paste error from the line above gets
      the ioctl a ssize_t return value instead of the traditional "int".
      
      The associated sample code used "long" which meant it would compile
      for x86-64 but not i386, with the latter failing as follows:
      
        CC [M]  samples/vfio-mdev/mtty.o
      samples/vfio-mdev/mtty.c:1418:20: error: initialization from incompatible pointer type [-Werror=incompatible-pointer-types]
        .ioctl          = mtty_ioctl,
                          ^
      samples/vfio-mdev/mtty.c:1418:20: note: (near initialization for ‘mdev_fops.ioctl’)
      cc1: some warnings being treated as errors
      
      Since in this case, vfio is working with struct file_operations; as such:
      
          long (*unlocked_ioctl) (struct file *, unsigned int, unsigned long);
          long (*compat_ioctl) (struct file *, unsigned int, unsigned long);
      
      ...and so here we just standardize on long vs. the normal int that user
      space typically sees and documents as per "man ioctl" and similar.
      
      Fixes: 9d1a546c ("docs: Sample driver to demonstrate how to use Mediated device framework.")
      Cc: Kirti Wankhede <kwankhede@nvidia.com>
      Cc: Neo Jia <cjia@nvidia.com>
      Cc: kvm@vger.kernel.org
      Signed-off-by: NPaul Gortmaker <paul.gortmaker@windriver.com>
      Signed-off-by: NAlex Williamson <alex.williamson@redhat.com>
      c6ef7fd4
  13. 31 12月, 2016 1 次提交
    • L
      iio: accel: st_accel: fix LIS3LV02 reading and scaling · 65e4345c
      Linus Walleij 提交于
      The LIS3LV02 has a special bit that need to be set to get the
      read values left aligned. Before this patch we get gibberish
      like this:
      
      iio_generic_buffer -a -c10 -n lis3lv02dl_accel
      (...)
      0.000000 -0.010042 -0.642688 19155832931907
      0.000000 -0.010042 -0.642688 19155858751073
      
      Which is because we read a raw value for 1g as 64 which is
      the nominal 1024 for 1g shifted 4 bits to the left by being
      right-aligned rather than left aligned.
      
      Since all other sensors are left aligned, add some code to
      set the special DAS (data alignment setting) bit to 1 so that
      the right value is now read like this:
      
      iio_generic_buffer -a -c10 -n lis3lv02dl_accel
      (...)
      0.000000 -0.147095 -10.120135 24761614364956
      -0.029419 -0.176514 -10.120135 24761631624540
      
      The scaling was weird as well: we have a gain of 1000 for 1g
      and 3000 for 6g. I don't even remember how I came up with the
      old values but they are wrong.
      
      Fixes: 3acddf74 ("iio: st-sensors: add support for lis3lv02d accelerometer")
      Cc: Lorenzo Bianconi <lorenzo.bianconi@st.com>
      Cc: Giuseppe Barba <giuseppe.barba@st.com>
      Cc: Denis Ciocca <denis.ciocca@st.com>
      Signed-off-by: NLinus Walleij <linus.walleij@linaro.org>
      Signed-off-by: NJonathan Cameron <jic23@kernel.org>
      65e4345c
  14. 30 12月, 2016 6 次提交
    • A
      vfio-mdev: Make mdev_device private and abstract interfaces · 99e3123e
      Alex Williamson 提交于
      Abstract access to mdev_device so that we can define which interfaces
      are public rather than relying on comments in the structure.
      
      Cc: Zhenyu Wang <zhenyuw@linux.intel.com>
      Cc: Zhi Wang <zhi.a.wang@intel.com>
      Signed-off-by: NAlex Williamson <alex.williamson@redhat.com>
      Reviewed-by: NJike Song <jike.song@intel.com>
      Reviewed by: Kirti Wankhede <kwankhede@nvidia.com>
      99e3123e
    • A
      vfio-mdev: Make mdev_parent private · 9372e6fe
      Alex Williamson 提交于
      Rather than hoping for good behavior by marking some elements
      internal, enforce it by making the entire structure private and
      creating an accessor function for the one useful external field.
      
      Cc: Zhenyu Wang <zhenyuw@linux.intel.com>
      Cc: Zhi Wang <zhi.a.wang@intel.com>
      Cc: Jike Song <jike.song@intel.com>
      Signed-off-by: NAlex Williamson <alex.williamson@redhat.com>
      Reviewed by: Kirti Wankhede <kwankhede@nvidia.com>
      9372e6fe
    • A
      vfio-mdev: de-polute the namespace, rename parent_device & parent_ops · 42930553
      Alex Williamson 提交于
      Add an mdev_ prefix so we're not poluting the namespace so much.
      
      Cc: Zhenyu Wang <zhenyuw@linux.intel.com>
      Cc: Zhi Wang <zhi.a.wang@intel.com>
      Cc: Jike Song <jike.song@intel.com>
      Signed-off-by: NAlex Williamson <alex.williamson@redhat.com>
      Reviewed by: Kirti Wankhede <kwankhede@nvidia.com>
      42930553
    • B
      Revert "remoteproc: Merge table_ptr and cached_table pointers" · a0c10687
      Bjorn Andersson 提交于
      Following any fw_rsc_vdev entries in the resource table are two variable
      length arrays, the first one reference vring resources and the second
      one is the virtio config space.  The virtio config space is used by
      virtio to communicate status and configuration changes and must as such
      be shared with the remote.
      
      The reverted commit incorrectly made any changes to the virtio config
      space only affect the local copy, in an attempt to allowing memory
      protection of the shared resource table.
      
      This reverts commit cda85293.
      Signed-off-by: NBjorn Andersson <bjorn.andersson@linaro.org>
      a0c10687
    • J
      net/mlx4_core: Fix raw qp flow steering rules under SRIOV · 10b1c04e
      Jack Morgenstein 提交于
      Demoting simple flow steering rule priority (for DPDK) was achieved by
      wrapping FW commands MLX4_QP_FLOW_STEERING_ATTACH/DETACH for the PF
      as well, and forcing the priority to MLX4_DOMAIN_NIC in the wrapper
      function for the PF and all VFs.
      
      In function mlx4_ib_create_flow(), this change caused the main rule
      creation for the PF to be wrapped, while it left the associated
      tunnel steering rule creation unwrapped for the PF.
      
      This mismatch caused rule deletion failures in mlx4_ib_destroy_flow()
      for the PF when the detach wrapper function did not find the associated
      tunnel-steering rule (since creation of that rule for the PF did not
      go through the wrapper function).
      
      Fix this by setting MLX4_QP_FLOW_STEERING_ATTACH/DETACH to be "native"
      (so that the PF invocation does not go through the wrapper), and perform
      the required priority demotion for the PF in the mlx4_ib_create_flow()
      code path.
      
      Fixes: 48564135 ("net/mlx4_core: Demote simple multicast and broadcast flow steering rules")
      Signed-off-by: NJack Morgenstein <jackm@dev.mellanox.co.il>
      Signed-off-by: NTariq Toukan <tariqt@mellanox.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      10b1c04e
    • L
      mm: optimize PageWaiters bit use for unlock_page() · b91e1302
      Linus Torvalds 提交于
      In commit 62906027 ("mm: add PageWaiters indicating tasks are
      waiting for a page bit") Nick Piggin made our page locking no longer
      unconditionally touch the hashed page waitqueue, which not only helps
      performance in general, but is particularly helpful on NUMA machines
      where the hashed wait queues can bounce around a lot.
      
      However, the "clear lock bit atomically and then test the waiters bit"
      sequence turns out to be much more expensive than it needs to be,
      because you get a nasty stall when trying to access the same word that
      just got updated atomically.
      
      On architectures where locking is done with LL/SC, this would be trivial
      to fix with a new primitive that clears one bit and tests another
      atomically, but that ends up not working on x86, where the only atomic
      operations that return the result end up being cmpxchg and xadd.  The
      atomic bit operations return the old value of the same bit we changed,
      not the value of an unrelated bit.
      
      On x86, we could put the lock bit in the high bit of the byte, and use
      "xadd" with that bit (where the overflow ends up not touching other
      bits), and look at the other bits of the result.  However, an even
      simpler model is to just use a regular atomic "and" to clear the lock
      bit, and then the sign bit in eflags will indicate the resulting state
      of the unrelated bit #7.
      
      So by moving the PageWaiters bit up to bit #7, we can atomically clear
      the lock bit and test the waiters bit on x86 too.  And architectures
      with LL/SC (which is all the usual RISC suspects), the particular bit
      doesn't matter, so they are fine with this approach too.
      
      This avoids the extra access to the same atomic word, and thus avoids
      the costly stall at page unlock time.
      
      The only downside is that the interface ends up being a bit odd and
      specialized: clear a bit in a byte, and test the sign bit.  Nick doesn't
      love the resulting name of the new primitive, but I'd rather make the
      name be descriptive and very clear about the limitation imposed by
      trying to work across all relevant architectures than make it be some
      generic thing that doesn't make the odd semantics explicit.
      
      So this introduces the new architecture primitive
      
          clear_bit_unlock_is_negative_byte();
      
      and adds the trivial implementation for x86.  We have a generic
      non-optimized fallback (that just does a "clear_bit()"+"test_bit(7)"
      combination) which can be overridden by any architecture that can do
      better.  According to Nick, Power has the same hickup x86 has, for
      example, but some other architectures may not even care.
      
      All these optimizations mean that my page locking stress-test (which is
      just executing a lot of small short-lived shell scripts: "make test" in
      the git source tree) no longer makes our page locking look horribly bad.
      Before all these optimizations, just the unlock_page() costs were just
      over 3% of all CPU overhead on "make test".  After this, it's down to
      0.66%, so just a quarter of the cost it used to be.
      
      (The difference on NUMA is bigger, but there this micro-optimization is
      likely less noticeable, since the big issue on NUMA was not the accesses
      to 'struct page', but the waitqueue accesses that were already removed
      by Nick's earlier commit).
      Acked-by: NNick Piggin <npiggin@gmail.com>
      Cc: Dave Hansen <dave.hansen@linux.intel.com>
      Cc: Bob Peterson <rpeterso@redhat.com>
      Cc: Steven Whitehouse <swhiteho@redhat.com>
      Cc: Andrew Lutomirski <luto@kernel.org>
      Cc: Andreas Gruenbacher <agruenba@redhat.com>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Mel Gorman <mgorman@techsingularity.net>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      b91e1302
  15. 29 12月, 2016 1 次提交
  16. 28 12月, 2016 1 次提交
  17. 27 12月, 2016 1 次提交
    • J
      mm: Invalidate DAX radix tree entries only if appropriate · c6dcf52c
      Jan Kara 提交于
      Currently invalidate_inode_pages2_range() and invalidate_mapping_pages()
      just delete all exceptional radix tree entries they find. For DAX this
      is not desirable as we track cache dirtiness in these entries and when
      they are evicted, we may not flush caches although it is necessary. This
      can for example manifest when we write to the same block both via mmap
      and via write(2) (to different offsets) and fsync(2) then does not
      properly flush CPU caches when modification via write(2) was the last
      one.
      
      Create appropriate DAX functions to handle invalidation of DAX entries
      for invalidate_inode_pages2_range() and invalidate_mapping_pages() and
      wire them up into the corresponding mm functions.
      Acked-by: NJohannes Weiner <hannes@cmpxchg.org>
      Reviewed-by: NRoss Zwisler <ross.zwisler@linux.intel.com>
      Signed-off-by: NJan Kara <jack@suse.cz>
      Signed-off-by: NDan Williams <dan.j.williams@intel.com>
      c6dcf52c
  18. 26 12月, 2016 1 次提交
    • N
      mm: add PageWaiters indicating tasks are waiting for a page bit · 62906027
      Nicholas Piggin 提交于
      Add a new page flag, PageWaiters, to indicate the page waitqueue has
      tasks waiting. This can be tested rather than testing waitqueue_active
      which requires another cacheline load.
      
      This bit is always set when the page has tasks on page_waitqueue(page),
      and is set and cleared under the waitqueue lock. It may be set when
      there are no tasks on the waitqueue, which will cause a harmless extra
      wakeup check that will clears the bit.
      
      The generic bit-waitqueue infrastructure is no longer used for pages.
      Instead, waitqueues are used directly with a custom key type. The
      generic code was not flexible enough to have PageWaiters manipulation
      under the waitqueue lock (which simplifies concurrency).
      
      This improves the performance of page lock intensive microbenchmarks by
      2-3%.
      
      Putting two bits in the same word opens the opportunity to remove the
      memory barrier between clearing the lock bit and testing the waiters
      bit, after some work on the arch primitives (e.g., ensuring memory
      operand widths match and cover both bits).
      Signed-off-by: NNicholas Piggin <npiggin@gmail.com>
      Cc: Dave Hansen <dave.hansen@linux.intel.com>
      Cc: Bob Peterson <rpeterso@redhat.com>
      Cc: Steven Whitehouse <swhiteho@redhat.com>
      Cc: Andrew Lutomirski <luto@kernel.org>
      Cc: Andreas Gruenbacher <agruenba@redhat.com>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Mel Gorman <mgorman@techsingularity.net>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      62906027