1. 27 4月, 2016 1 次提交
  2. 13 3月, 2016 1 次提交
  3. 10 3月, 2016 4 次提交
    • Z
      dma-mapping: avoid oops when parameter cpu_addr is null · d6b7eaeb
      Zhen Lei 提交于
      To keep consistent with kfree, which tolerate ptr is NULL.  We do this
      because sometimes we may use goto statement, so that success and failure
      case can share parts of the code.  But unfortunately, dma_free_coherent
      called with parameter cpu_addr is null will cause oops, such as showed
      below:
      
        Unable to handle kernel paging request at virtual address ffffffc020d3b2b8
        pgd = ffffffc083a61000
        [ffffffc020d3b2b8] *pgd=0000000000000000, *pud=0000000000000000
        CPU: 4 PID: 1489 Comm: malloc_dma_1 Tainted: G           O    4.1.12 #1
        Hardware name: ARM64 (DT)
        PC is at __dma_free_coherent.isra.10+0x74/0xc8
        LR is at __dma_free+0x9c/0xb0
        Process malloc_dma_1 (pid: 1489, stack limit = 0xffffffc0837fc020)
        [...]
        Call trace:
          __dma_free_coherent.isra.10+0x74/0xc8
          __dma_free+0x9c/0xb0
          malloc_dma+0x104/0x158 [dma_alloc_coherent_mtmalloc]
          kthread+0xec/0xfc
      Signed-off-by: NZhen Lei <thunder.leizhen@huawei.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      d6b7eaeb
    • M
      kasan: add functions to clear stack poison · e3ae1163
      Mark Rutland 提交于
      Functions which the compiler has instrumented for ASAN place poison on
      the stack shadow upon entry and remove this poison prior to returning.
      
      In some cases (e.g. hotplug and idle), CPUs may exit the kernel a
      number of levels deep in C code.  If there are any instrumented
      functions on this critical path, these will leave portions of the idle
      thread stack shadow poisoned.
      
      If a CPU returns to the kernel via a different path (e.g. a cold
      entry), then depending on stack frame layout subsequent calls to
      instrumented functions may use regions of the stack with stale poison,
      resulting in (spurious) KASAN splats to the console.
      
      Contemporary GCCs always add stack shadow poisoning when ASAN is
      enabled, even when asked to not instrument a function [1], so we can't
      simply annotate functions on the critical path to avoid poisoning.
      
      Instead, this series explicitly removes any stale poison before it can
      be hit.  In the common hotplug case we clear the entire stack shadow in
      common code, before a CPU is brought online.
      
      On architectures which perform a cold return as part of cpu idle may
      retain an architecture-specific amount of stack contents.  To retain the
      poison for this retained context, the arch code must call the core KASAN
      code, passing a "watermark" stack pointer value beyond which shadow will
      be cleared.  Architectures which don't perform a cold return as part of
      idle do not need any additional code.
      
      This patch (of 3):
      
      Functions which the compiler has instrumented for KASAN place poison on
      the stack shadow upon entry and remove this poision prior to returning.
      
      In some cases (e.g.  hotplug and idle), CPUs may exit the kernel a number
      of levels deep in C code.  If there are any instrumented functions on this
      critical path, these will leave portions of the stack shadow poisoned.
      
      If a CPU returns to the kernel via a different path (e.g.  a cold entry),
      then depending on stack frame layout subsequent calls to instrumented
      functions may use regions of the stack with stale poison, resulting in
      (spurious) KASAN splats to the console.
      
      To avoid this, we must clear stale poison from the stack prior to
      instrumented functions being called.  This patch adds functions to the
      KASAN core for removing poison from (portions of) a task's stack.  These
      will be used by subsequent patches to avoid problems with hotplug and
      idle.
      Signed-off-by: NMark Rutland <mark.rutland@arm.com>
      Acked-by: NCatalin Marinas <catalin.marinas@arm.com>
      Reviewed-by: NAndrey Ryabinin <aryabinin@virtuozzo.com>
      Cc: Alexander Potapenko <glider@google.com>
      Cc: Lorenzo Pieralisi <lorenzo.pieralisi@arm.com>
      Cc: Will Deacon <will.deacon@arm.com>
      Cc: Ingo Molnar <mingo@kernel.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      e3ae1163
    • D
      list: kill list_force_poison() · d77a117e
      Dan Williams 提交于
      Given we have uninitialized list_heads being passed to list_add() it
      will always be the case that those uninitialized values randomly trigger
      the poison value.  Especially since a list_add() operation will seed the
      stack with the poison value for later stack allocations to trip over.
      
      For example, see these two false positive reports:
      
        list_add attempted on force-poisoned entry
        WARNING: at lib/list_debug.c:34
        [..]
        NIP [c00000000043c390] __list_add+0xb0/0x150
        LR [c00000000043c38c] __list_add+0xac/0x150
        Call Trace:
          __list_add+0xac/0x150 (unreliable)
          __down+0x4c/0xf8
          down+0x68/0x70
          xfs_buf_lock+0x4c/0x150 [xfs]
      
        list_add attempted on force-poisoned entry(0000000000000500),
         new->next == d0000000059ecdb0, new->prev == 0000000000000500
        WARNING: at lib/list_debug.c:33
        [..]
        NIP [c00000000042db78] __list_add+0xa8/0x140
        LR [c00000000042db74] __list_add+0xa4/0x140
        Call Trace:
          __list_add+0xa4/0x140 (unreliable)
          rwsem_down_read_failed+0x6c/0x1a0
          down_read+0x58/0x60
          xfs_log_commit_cil+0x7c/0x600 [xfs]
      
      Fixes: commit 5c2c2587 ("mm, dax, pmem: introduce {get|put}_dev_pagemap() for dax-gup")
      Signed-off-by: NDan Williams <dan.j.williams@intel.com>
      Reported-by: NEryu Guan <eguan@redhat.com>
      Tested-by: NEryu Guan <eguan@redhat.com>
      Cc: Ross Zwisler <ross.zwisler@linux.intel.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      d77a117e
    • S
      tracing: Fix check for cpu online when event is disabled · dc17147d
      Steven Rostedt (Red Hat) 提交于
      Commit f3775549 ("tracepoints: Do not trace when cpu is offline") added
      a check to make sure that tracepoints only get called when the cpu is
      online, as it uses rcu_read_lock_sched() for protection.
      
      Commit 3a630178 ("tracing: generate RCU warnings even when tracepoints
      are disabled") added lockdep checks (including rcu checks) for events that
      are not enabled to catch possible RCU issues that would only be triggered if
      a trace event was enabled. Commit f3775549 only stopped the warnings
      when the trace event was enabled but did not prevent warnings if the trace
      event was called when disabled.
      
      To fix this, the cpu online check is moved to where the condition is added
      to the trace event. This will place the cpu online check in all places that
      it may be used now and in the future.
      
      Cc: stable@vger.kernel.org # v3.18+
      Fixes: f3775549 ("tracepoints: Do not trace when cpu is offline")
      Fixes: 3a630178 ("tracing: generate RCU warnings even when tracepoints are disabled")
      Reported-by: NSudeep Holla <sudeep.holla@arm.com>
      Tested-by: NSudeep Holla <sudeep.holla@arm.com>
      Signed-off-by: NSteven Rostedt <rostedt@goodmis.org>
      dc17147d
  4. 05 3月, 2016 1 次提交
  5. 04 3月, 2016 8 次提交
  6. 03 3月, 2016 1 次提交
  7. 01 3月, 2016 1 次提交
  8. 28 2月, 2016 3 次提交
    • R
      dax: move writeback calls into the filesystems · 7f6d5b52
      Ross Zwisler 提交于
      Previously calls to dax_writeback_mapping_range() for all DAX filesystems
      (ext2, ext4 & xfs) were centralized in filemap_write_and_wait_range().
      
      dax_writeback_mapping_range() needs a struct block_device, and it used
      to get that from inode->i_sb->s_bdev.  This is correct for normal inodes
      mounted on ext2, ext4 and XFS filesystems, but is incorrect for DAX raw
      block devices and for XFS real-time files.
      
      Instead, call dax_writeback_mapping_range() directly from the filesystem
      ->writepages function so that it can supply us with a valid block
      device.  This also fixes DAX code to properly flush caches in response
      to sync(2).
      Signed-off-by: NRoss Zwisler <ross.zwisler@linux.intel.com>
      Signed-off-by: NJan Kara <jack@suse.cz>
      Cc: Al Viro <viro@ftp.linux.org.uk>
      Cc: Dan Williams <dan.j.williams@intel.com>
      Cc: Dave Chinner <david@fromorbit.com>
      Cc: Jens Axboe <axboe@fb.com>
      Cc: Matthew Wilcox <matthew.r.wilcox@intel.com>
      Cc: Theodore Ts'o <tytso@mit.edu>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      7f6d5b52
    • R
      dax: give DAX clearing code correct bdev · 20a90f58
      Ross Zwisler 提交于
      dax_clear_blocks() needs a valid struct block_device and previously it
      was using inode->i_sb->s_bdev in all cases.  This is correct for normal
      inodes on mounted ext2, ext4 and XFS filesystems, but is incorrect for
      DAX raw block devices and for XFS real-time devices.
      
      Instead, rename dax_clear_blocks() to dax_clear_sectors(), and change
      its arguments to take a bdev and a sector instead of an inode and a
      block.  This better reflects what the function does, and it allows the
      filesystem and raw block device code to pass in an appropriate struct
      block_device.
      Signed-off-by: NRoss Zwisler <ross.zwisler@linux.intel.com>
      Suggested-by: NDan Williams <dan.j.williams@intel.com>
      Reviewed-by: NJan Kara <jack@suse.cz>
      Cc: Theodore Ts'o <tytso@mit.edu>
      Cc: Al Viro <viro@ftp.linux.org.uk>
      Cc: Dave Chinner <david@fromorbit.com>
      Cc: Jens Axboe <axboe@fb.com>
      Cc: Matthew Wilcox <matthew.r.wilcox@intel.com>
      Cc: Ross Zwisler <ross.zwisler@linux.intel.com>
      Cc: Theodore Ts'o <tytso@mit.edu>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      20a90f58
    • D
      drivers: char: random: add get_random_long() · ec9ee4ac
      Daniel Cashman 提交于
      Commit d07e2259 ("mm: mmap: add new /proc tunable for mmap_base
      ASLR") added the ability to choose from a range of values to use for
      entropy count in generating the random offset to the mmap_base address.
      
      The maximum value on this range was set to 32 bits for 64-bit x86
      systems, but this value could be increased further, requiring more than
      the 32 bits of randomness provided by get_random_int(), as is already
      possible for arm64.  Add a new function: get_random_long() which more
      naturally fits with the mmap usage of get_random_int() but operates
      exactly the same as get_random_int().
      
      Also, fix the shifting constant in mmap_rnd() to be an unsigned long so
      that values greater than 31 bits generate an appropriate mask without
      overflow.  This is especially important on x86, as its shift instruction
      uses a 5-bit mask for the shift operand, which meant that any value for
      mmap_rnd_bits over 31 acts as a no-op and effectively disables mmap_base
      randomization.
      
      Finally, replace calls to get_random_int() with get_random_long() where
      appropriate.
      
      This patch (of 2):
      
      Add get_random_long().
      Signed-off-by: NDaniel Cashman <dcashman@android.com>
      Acked-by: NKees Cook <keescook@chromium.org>
      Cc: "Theodore Ts'o" <tytso@mit.edu>
      Cc: Arnd Bergmann <arnd@arndb.de>
      Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
      Cc: Catalin Marinas <catalin.marinas@arm.com>
      Cc: Will Deacon <will.deacon@arm.com>
      Cc: Ralf Baechle <ralf@linux-mips.org>
      Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
      Cc: Paul Mackerras <paulus@samba.org>
      Cc: Michael Ellerman <mpe@ellerman.id.au>
      Cc: David S. Miller <davem@davemloft.net>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Ingo Molnar <mingo@redhat.com>
      Cc: H. Peter Anvin <hpa@zytor.com>
      Cc: Al Viro <viro@zeniv.linux.org.uk>
      Cc: Nick Kralevich <nnk@google.com>
      Cc: Jeff Vander Stoep <jeffv@google.com>
      Cc: Mark Salyzyn <salyzyn@android.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      ec9ee4ac
  9. 26 2月, 2016 1 次提交
    • H
      libata: Align ata_device's id on a cacheline · 4ee34ea3
      Harvey Hunt 提交于
      The id buffer in ata_device is a DMA target, but it isn't explicitly
      cacheline aligned. Due to this, adjacent fields can be overwritten with
      stale data from memory on non coherent architectures. As a result, the
      kernel is sometimes unable to communicate with an ATA device.
      
      Fix this by ensuring that the id buffer is cacheline aligned.
      
      This issue is similar to that fixed by Commit 84bda12a
      ("libata: align ap->sector_buf").
      Signed-off-by: NHarvey Hunt <harvey.hunt@imgtec.com>
      Cc: linux-kernel@vger.kernel.org
      Cc: <stable@vger.kernel.org> # 2.6.18
      Signed-off-by: NTejun Heo <tj@kernel.org>
      4ee34ea3
  10. 25 2月, 2016 2 次提交
    • P
      perf: Fix race between event install and jump_labels · 9107c89e
      Peter Zijlstra 提交于
      perf_install_in_context() relies upon the context switch hooks to have
      scheduled in events when the IPI misses its target -- after all, if
      the task has moved from the CPU (or wasn't running at all), it will
      have to context switch to run elsewhere.
      
      This however doesn't appear to be happening.
      
      It is possible for the IPI to not happen (task wasn't running) only to
      later observe the task running with an inactive context.
      
      The only possible explanation is that the context switch hooks are not
      called. Therefore put in a sync_sched() after toggling the jump_label
      to guarantee all CPUs will have them enabled before we install an
      event.
      
      A simple if (0->1) sync_sched() will not in fact work, because any
      further increment can race and complete before the sync_sched().
      Therefore we must jump through some hoops.
      Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Cc: Alexander Shishkin <alexander.shishkin@linux.intel.com>
      Cc: Arnaldo Carvalho de Melo <acme@redhat.com>
      Cc: Jiri Olsa <jolsa@redhat.com>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: dvyukov@google.com
      Cc: eranian@google.com
      Cc: oleg@redhat.com
      Cc: panand@redhat.com
      Cc: sasha.levin@oracle.com
      Cc: vince@deater.net
      Link: http://lkml.kernel.org/r/20160224174947.980211985@infradead.orgSigned-off-by: NIngo Molnar <mingo@kernel.org>
      9107c89e
    • P
      perf: Fix cloning · a69b0ca4
      Peter Zijlstra 提交于
      Alexander reported that when the 'original' context gets destroyed, no
      new clones happen.
      
      This can happen irrespective of the ctx switch optimization, any task
      can die, even the parent, and we want to continue monitoring the task
      hierarchy until we either close the event or no tasks are left in the
      hierarchy.
      
      perf_event_init_context() will attempt to pin the 'parent' context
      during clone(). At that point current is the parent, and since current
      cannot have exited while executing clone(), its context cannot have
      passed through perf_event_exit_task_context(). Therefore
      perf_pin_task_context() cannot observe ctx->task == TASK_TOMBSTONE.
      
      However, since inherit_event() does:
      
      	if (parent_event->parent)
      		parent_event = parent_event->parent;
      
      it looks at the 'original' event when it does: is_orphaned_event().
      This can return true if the context that contains the this event has
      passed through perf_event_exit_task_context(). And thus we'll fail to
      clone the perf context.
      
      Fix this by adding a new state: STATE_DEAD, which is set by
      perf_release() to indicate that the filedesc (or kernel reference) is
      dead and there are no observers for our data left.
      
      Only for STATE_DEAD will is_orphaned_event() be true and inhibit
      cloning.
      
      STATE_EXIT is otherwise preserved such that is_event_hup() remains
      functional and will report when the observed task hierarchy becomes
      empty.
      Reported-by: NAlexander Shishkin <alexander.shishkin@linux.intel.com>
      Tested-by: NAlexander Shishkin <alexander.shishkin@linux.intel.com>
      Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Reviewed-by: NAlexander Shishkin <alexander.shishkin@linux.intel.com>
      Cc: Arnaldo Carvalho de Melo <acme@redhat.com>
      Cc: Jiri Olsa <jolsa@redhat.com>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: dvyukov@google.com
      Cc: eranian@google.com
      Cc: oleg@redhat.com
      Cc: panand@redhat.com
      Cc: sasha.levin@oracle.com
      Cc: vince@deater.net
      Fixes: c6e5b732 ("perf: Synchronously clean up child events")
      Link: http://lkml.kernel.org/r/20160224174947.919845295@infradead.orgSigned-off-by: NIngo Molnar <mingo@kernel.org>
      a69b0ca4
  11. 24 2月, 2016 1 次提交
  12. 22 2月, 2016 2 次提交
  13. 20 2月, 2016 2 次提交
    • D
      libnvdimm, tools/testing/nvdimm: fix 'ars_status' output buffer sizing · 747ffe11
      Dan Williams 提交于
      Use the output length specified in the command to size the receive
      buffer rather than the arbitrary 4K limit.
      
      This bug was hiding the fact that the ndctl implementation of
      ndctl_bus_cmd_new_ars_status() was not specifying an output buffer size.
      
      Cc: <stable@vger.kernel.org>
      Cc: Vishal Verma <vishal.l.verma@intel.com>
      Signed-off-by: NDan Williams <dan.j.williams@intel.com>
      747ffe11
    • N
      net: make netdev_for_each_lower_dev safe for device removal · cfdd28be
      Nikolay Aleksandrov 提交于
      When I used netdev_for_each_lower_dev in commit bad53162 ("vrf:
      remove slave queue and private slave struct") I thought that it acts
      like netdev_for_each_lower_private and can be used to remove the current
      device from the list while walking, but unfortunately it acts more like
      netdev_for_each_lower_private_rcu and doesn't allow it. The difference
      is where the "iter" points to, right now it points to the current element
      and that makes it impossible to remove it. Change the logic to be
      similar to netdev_for_each_lower_private and make it point to the "next"
      element so we can safely delete the current one. VRF is the only such
      user right now, there's no change for the read-only users.
      
      Here's what can happen now:
      [98423.249858] general protection fault: 0000 [#1] SMP
      [98423.250175] Modules linked in: vrf bridge(O) stp llc nfsd auth_rpcgss
      oid_registry nfs_acl nfs lockd grace sunrpc crct10dif_pclmul
      crc32_pclmul crc32c_intel ghash_clmulni_intel jitterentropy_rng
      sha256_generic hmac drbg ppdev aesni_intel aes_x86_64 glue_helper lrw
      gf128mul ablk_helper cryptd evdev serio_raw pcspkr virtio_balloon
      parport_pc parport i2c_piix4 i2c_core virtio_console acpi_cpufreq button
      9pnet_virtio 9p 9pnet fscache ipv6 autofs4 ext4 crc16 mbcache jbd2 sg
      virtio_blk virtio_net sr_mod cdrom e1000 ata_generic ehci_pci uhci_hcd
      ehci_hcd usbcore usb_common virtio_pci ata_piix libata floppy
      virtio_ring virtio scsi_mod [last unloaded: bridge]
      [98423.255040] CPU: 1 PID: 14173 Comm: ip Tainted: G           O
      4.5.0-rc2+ #81
      [98423.255386] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996),
      BIOS 1.8.1-20150318_183358- 04/01/2014
      [98423.255777] task: ffff8800547f5540 ti: ffff88003428c000 task.ti:
      ffff88003428c000
      [98423.256123] RIP: 0010:[<ffffffff81514f3e>]  [<ffffffff81514f3e>]
      netdev_lower_get_next+0x1e/0x30
      [98423.256534] RSP: 0018:ffff88003428f940  EFLAGS: 00010207
      [98423.256766] RAX: 0002000100000004 RBX: ffff880054ff9000 RCX:
      0000000000000000
      [98423.257039] RDX: ffff88003428f8b8 RSI: ffff88003428f950 RDI:
      ffff880054ff90c0
      [98423.257287] RBP: ffff88003428f940 R08: 0000000000000000 R09:
      0000000000000000
      [98423.257537] R10: 0000000000000001 R11: 0000000000000000 R12:
      ffff88003428f9e0
      [98423.257802] R13: ffff880054a5fd00 R14: ffff88003428f970 R15:
      0000000000000001
      [98423.258055] FS:  00007f3d76881700(0000) GS:ffff88005d000000(0000)
      knlGS:0000000000000000
      [98423.258418] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
      [98423.258650] CR2: 00007ffe5951ffa8 CR3: 0000000052077000 CR4:
      00000000000406e0
      [98423.258902] Stack:
      [98423.259075]  ffff88003428f960 ffffffffa0442636 0002000100000004
      ffff880054ff9000
      [98423.259647]  ffff88003428f9b0 ffffffff81518205 ffff880054ff9000
      ffff88003428f978
      [98423.260208]  ffff88003428f978 ffff88003428f9e0 ffff88003428f9e0
      ffff880035b35f00
      [98423.260739] Call Trace:
      [98423.260920]  [<ffffffffa0442636>] vrf_dev_uninit+0x76/0xa0 [vrf]
      [98423.261156]  [<ffffffff81518205>]
      rollback_registered_many+0x205/0x390
      [98423.261401]  [<ffffffff815183ec>] unregister_netdevice_many+0x1c/0x70
      [98423.261641]  [<ffffffff8153223c>] rtnl_delete_link+0x3c/0x50
      [98423.271557]  [<ffffffff815335bb>] rtnl_dellink+0xcb/0x1d0
      [98423.271800]  [<ffffffff811cd7da>] ? __inc_zone_state+0x4a/0x90
      [98423.272049]  [<ffffffff815337b4>] rtnetlink_rcv_msg+0x84/0x200
      [98423.272279]  [<ffffffff810cfe7d>] ? trace_hardirqs_on+0xd/0x10
      [98423.272513]  [<ffffffff8153370b>] ? rtnetlink_rcv+0x1b/0x40
      [98423.272755]  [<ffffffff81533730>] ? rtnetlink_rcv+0x40/0x40
      [98423.272983]  [<ffffffff8155d6e7>] netlink_rcv_skb+0x97/0xb0
      [98423.273209]  [<ffffffff8153371a>] rtnetlink_rcv+0x2a/0x40
      [98423.273476]  [<ffffffff8155ce8b>] netlink_unicast+0x11b/0x1a0
      [98423.273710]  [<ffffffff8155d2f1>] netlink_sendmsg+0x3e1/0x610
      [98423.273947]  [<ffffffff814fbc98>] sock_sendmsg+0x38/0x70
      [98423.274175]  [<ffffffff814fc253>] ___sys_sendmsg+0x2e3/0x2f0
      [98423.274416]  [<ffffffff810d841e>] ? do_raw_spin_unlock+0xbe/0x140
      [98423.274658]  [<ffffffff811e1bec>] ? handle_mm_fault+0x26c/0x2210
      [98423.274894]  [<ffffffff811e19cd>] ? handle_mm_fault+0x4d/0x2210
      [98423.275130]  [<ffffffff81269611>] ? __fget_light+0x91/0xb0
      [98423.275365]  [<ffffffff814fcd42>] __sys_sendmsg+0x42/0x80
      [98423.275595]  [<ffffffff814fcd92>] SyS_sendmsg+0x12/0x20
      [98423.275827]  [<ffffffff81611bb6>] entry_SYSCALL_64_fastpath+0x16/0x7a
      [98423.276073] Code: c3 31 c0 5d c3 0f 1f 84 00 00 00 00 00 66 66 66 66
      90 48 8b 06 55 48 81 c7 c0 00 00 00 48 89 e5 48 8b 00 48 39 f8 74 09 48
      89 06 <48> 8b 40 e8 5d c3 31 c0 5d c3 0f 1f 84 00 00 00 00 00 66 66 66
      [98423.279639] RIP  [<ffffffff81514f3e>] netdev_lower_get_next+0x1e/0x30
      [98423.279920]  RSP <ffff88003428f940>
      
      CC: David Ahern <dsa@cumulusnetworks.com>
      CC: David S. Miller <davem@davemloft.net>
      CC: Roopa Prabhu <roopa@cumulusnetworks.com>
      CC: Vlad Yasevich <vyasevic@redhat.com>
      Fixes: bad53162 ("vrf: remove slave queue and private slave struct")
      Signed-off-by: NNikolay Aleksandrov <nikolay@cumulusnetworks.com>
      Reviewed-by: NDavid Ahern <dsa@cumulusnetworks.com>
      Tested-by: NDavid Ahern <dsa@cumulusnetworks.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      cfdd28be
  14. 19 2月, 2016 1 次提交
    • J
      Revert "fsnotify: destroy marks with call_srcu instead of dedicated thread" · 13d34ac6
      Jeff Layton 提交于
      This reverts commit c510eff6 ("fsnotify: destroy marks with
      call_srcu instead of dedicated thread").
      
      Eryu reported that he was seeing some OOM kills kick in when running a
      testcase that adds and removes inotify marks on a file in a tight loop.
      
      The above commit changed the code to use call_srcu to clean up the
      marks.  While that does (in principle) work, the srcu callback job is
      limited to cleaning up entries in small batches and only once per jiffy.
      It's easily possible to overwhelm that machinery with too many call_srcu
      callbacks, and Eryu's reproduer did just that.
      
      There's also another potential problem with using call_srcu here.  While
      you can obviously sleep while holding the srcu_read_lock, the callbacks
      run under local_bh_disable, so you can't sleep there.
      
      It's possible when putting the last reference to the fsnotify_mark that
      we'll end up putting a chain of references including the fsnotify_group,
      uid, and associated keys.  While I don't see any obvious ways that that
      could occurs, it's probably still best to avoid using call_srcu here
      after all.
      
      This patch reverts the above patch.  A later patch will take a different
      approach to eliminated the dedicated thread here.
      Signed-off-by: NJeff Layton <jeff.layton@primarydata.com>
      Reported-by: NEryu Guan <guaneryu@gmail.com>
      Tested-by: NEryu Guan <guaneryu@gmail.com>
      Cc: Jan Kara <jack@suse.com>
      Cc: Eric Paris <eparis@parisplace.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      13d34ac6
  15. 18 2月, 2016 3 次提交
  16. 17 2月, 2016 2 次提交
    • H
      net/mlx4_core: Set UAR page size to 4KB regardless of system page size · 85743f1e
      Huy Nguyen 提交于
      problem description:
      
      The current code sets UAR page size equal to system page size.
      The ConnectX-3 and ConnectX-3 Pro HWs require minimum 128 UAR pages.
      The mlx4 kernel drivers are not loaded if there is less than 128 UAR pages.
      
      solution:
      
      Always set UAR page to 4KB. This allows more UAR pages if the OS
      has PAGE_SIZE larger than 4KB. For example, PowerPC kernel use 64KB
      system page size, with 4MB uar region, there are 4MB/2/64KB = 32
      uars (half for uar, half for blueflame). This does not meet minimum 128
      UAR pages requirement. With 4KB UAR page, there are 4MB/2/4KB = 512 uars
      which meet the minimum requirement.
      
      Note that only codes in mlx4_core that deal with firmware know that uar
      page size is 4KB. Codes that deal with usr page in cq and qp context
      (mlx4_ib, mlx4_en and part of mlx4_core) still have the same assumption
      that uar page size equals to system page size.
      
      Note that with this implementation, on 64KB system page size kernel, there
      are 16 uars per system page but only one uars is used. The other 15
      uars are ignored because of the above assumption.
      
      Regarding SR-IOV, mlx4_core in hypervisor will set the uar page size
      to 4KB and mlx4_core code in virtual OS will obtain the uar page size from
      firmware.
      
      Regarding backward compatibility in SR-IOV, if hypervisor has this new code,
      the virtual OS must be updated. If hypervisor has old code, and the virtual
      OS has this new code, the new code will be backward compatible with the
      old code. If the uar size is big enough, this new code in VF continues to
      work with 64 KB uar page size (on PowerPc kernel). If the uar size does not
      meet 128 uars requirement, this new code not loaded in VF and print the same
      error message as the old code in Hypervisor.
      Signed-off-by: NHuy Nguyen <huyn@mellanox.com>
      Reviewed-by: NYishai Hadas <yishaih@mellanox.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      85743f1e
    • M
      net/mlx5: Use offset based reserved field names in the IFC header file · b4ff3a36
      Matan Barak 提交于
      mlx5_ifc.h is a header file representing the API and ABI between
      the driver to the firmware and hardware. This file is used from
      both the mlx5_ib and mlx5_core drivers.
      
      Previously, this file used incrementing counter to indicate
      reserved fields, for example:
      
      struct mlx5_ifc_odp_per_transport_service_cap_bits {
              u8         send[0x1];
              u8         receive[0x1];
              u8         write[0x1];
              u8         read[0x1];
              u8         reserved_0[0x1];
              u8         srq_receive[0x1];
              u8         reserved_1[0x1a];
      };
      
      If one developer implements through net-next feature A that uses
      reserved_0, they replace it with featureA and renames reserved_1 to
      reserved_0. In the same kernel cycle, a 2nd developer could implement
      feature B through the rdma tree, that uses reserved_1 and split it to
      featureB and a smaller reserved_1 field. This will cause a conflict
      when the two trees are merged.
      
      The source of this conflict is that the 1st developer changed *all*
      reserved fields.
      
      As Linus suggested, we change the layout of structs to:
      
      struct mlx5_ifc_odp_per_transport_service_cap_bits {
      	u8         send[0x1];
      	u8         receive[0x1];
      	u8         write[0x1];
      	u8         read[0x1];
      	u8         reserved_at_4[0x1];
      	u8         srq_receive[0x1];
      	u8         reserved_at_6[0x1a];
      };
      
      This makes the conflicts much more rare and preserves the locality of
      changes.
      Signed-off-by: NMatan Barak <matanb@mellanox.com>
      Signed-off-by: NAlaa Hleihel <alaa@mellanox.com>
      Reported-by: NLinus Torvalds <torvalds@linux-foundation.org>
      Signed-off-by: NSaeed Mahameed <saeedm@mellanox.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      b4ff3a36
  17. 16 2月, 2016 2 次提交
    • A
      tracing: Fix freak link error caused by branch tracer · b33c8ff4
      Arnd Bergmann 提交于
      In my randconfig tests, I came across a bug that involves several
      components:
      
      * gcc-4.9 through at least 5.3
      * CONFIG_GCOV_PROFILE_ALL enabling -fprofile-arcs for all files
      * CONFIG_PROFILE_ALL_BRANCHES overriding every if()
      * The optimized implementation of do_div() that tries to
        replace a library call with an division by multiplication
      * code in drivers/media/dvb-frontends/zl10353.c doing
      
              u32 adc_clock = 450560; /* 45.056 MHz */
              if (state->config.adc_clock)
                      adc_clock = state->config.adc_clock;
              do_div(value, adc_clock);
      
      In this case, gcc fails to determine whether the divisor
      in do_div() is __builtin_constant_p(). In particular, it
      concludes that __builtin_constant_p(adc_clock) is false, while
      __builtin_constant_p(!!adc_clock) is true.
      
      That in turn throws off the logic in do_div() that also uses
      __builtin_constant_p(), and instead of picking either the
      constant- optimized division, and the code in ilog2() that uses
      __builtin_constant_p() to figure out whether it knows the answer at
      compile time. The result is a link error from failing to find
      multiple symbols that should never have been called based on
      the __builtin_constant_p():
      
      dvb-frontends/zl10353.c:138: undefined reference to `____ilog2_NaN'
      dvb-frontends/zl10353.c:138: undefined reference to `__aeabi_uldivmod'
      ERROR: "____ilog2_NaN" [drivers/media/dvb-frontends/zl10353.ko] undefined!
      ERROR: "__aeabi_uldivmod" [drivers/media/dvb-frontends/zl10353.ko] undefined!
      
      This patch avoids the problem by changing __trace_if() to check
      whether the condition is known at compile-time to be nonzero, rather
      than checking whether it is actually a constant.
      
      I see this one link error in roughly one out of 1600 randconfig builds
      on ARM, and the patch fixes all known instances.
      
      Link: http://lkml.kernel.org/r/1455312410-1058841-1-git-send-email-arnd@arndb.deAcked-by: NNicolas Pitre <nico@linaro.org>
      Fixes: ab3c9c68 ("branch tracer, intel-iommu: fix build with CONFIG_BRANCH_TRACER=y")
      Cc: stable@vger.kernel.org # v2.6.30+
      Signed-off-by: NArnd Bergmann <arnd@arndb.de>
      Signed-off-by: NSteven Rostedt <rostedt@goodmis.org>
      b33c8ff4
    • S
      tracepoints: Do not trace when cpu is offline · f3775549
      Steven Rostedt (Red Hat) 提交于
      The tracepoint infrastructure uses RCU sched protection to enable and
      disable tracepoints safely. There are some instances where tracepoints are
      used in infrastructure code (like kfree()) that get called after a CPU is
      going offline, and perhaps when it is coming back online but hasn't been
      registered yet.
      
      This can probuce the following warning:
      
       [ INFO: suspicious RCU usage. ]
       4.4.0-00006-g0fe53e8-dirty #34 Tainted: G S
       -------------------------------
       include/trace/events/kmem.h:141 suspicious rcu_dereference_check() usage!
      
       other info that might help us debug this:
      
       RCU used illegally from offline CPU!  rcu_scheduler_active = 1, debug_locks = 1
       no locks held by swapper/8/0.
      
       stack backtrace:
        CPU: 8 PID: 0 Comm: swapper/8 Tainted: G S              4.4.0-00006-g0fe53e8-dirty #34
        Call Trace:
        [c0000005b76c78d0] [c0000000008b9540] .dump_stack+0x98/0xd4 (unreliable)
        [c0000005b76c7950] [c00000000010c898] .lockdep_rcu_suspicious+0x108/0x170
        [c0000005b76c79e0] [c00000000029adc0] .kfree+0x390/0x440
        [c0000005b76c7a80] [c000000000055f74] .destroy_context+0x44/0x100
        [c0000005b76c7b00] [c0000000000934a0] .__mmdrop+0x60/0x150
        [c0000005b76c7b90] [c0000000000e3ff0] .idle_task_exit+0x130/0x140
        [c0000005b76c7c20] [c000000000075804] .pseries_mach_cpu_die+0x64/0x310
        [c0000005b76c7cd0] [c000000000043e7c] .cpu_die+0x3c/0x60
        [c0000005b76c7d40] [c0000000000188d8] .arch_cpu_idle_dead+0x28/0x40
        [c0000005b76c7db0] [c000000000101e6c] .cpu_startup_entry+0x50c/0x560
        [c0000005b76c7ed0] [c000000000043bd8] .start_secondary+0x328/0x360
        [c0000005b76c7f90] [c000000000008a6c] start_secondary_prolog+0x10/0x14
      
      This warning is not a false positive either. RCU is not protecting code that
      is being executed while the CPU is offline.
      
      Instead of playing "whack-a-mole(TM)" and adding conditional statements to
      the tracepoints we find that are used in this instance, simply add a
      cpu_online() test to the tracepoint code where the tracepoint will be
      ignored if the CPU is offline.
      
      Use of raw_smp_processor_id() is fine, as there should never be a case where
      the tracepoint code goes from running on a CPU that is online and suddenly
      gets migrated to a CPU that is offline.
      
      Link: http://lkml.kernel.org/r/1455387773-4245-1-git-send-email-kda@linux-powerpc.orgReported-by: NDenis Kirjanov <kda@linux-powerpc.org>
      Fixes: 97e1c18e ("tracing: Kernel Tracepoints")
      Cc: stable@vger.kernel.org # v2.6.28+
      Signed-off-by: NSteven Rostedt <rostedt@goodmis.org>
      f3775549
  18. 15 2月, 2016 1 次提交
  19. 12 2月, 2016 2 次提交
  20. 11 2月, 2016 1 次提交
    • A
      libata: fix HDIO_GET_32BIT ioctl · 287e6611
      Arnd Bergmann 提交于
      As reported by Soohoon Lee, the HDIO_GET_32BIT ioctl does not
      work correctly in compat mode with libata.
      
      I have investigated the issue further and found multiple problems
      that all appeared with the same commit that originally introduced
      HDIO_GET_32BIT handling in libata back in linux-2.6.8 and presumably
      also linux-2.4, as the code uses "copy_to_user(arg, &val, 1)" to copy
      a 'long' variable containing either 0 or 1 to user space.
      
      The problems with this are:
      
      * On big-endian machines, this will always write a zero because it
        stores the wrong byte into user space.
      
      * In compat mode, the upper three bytes of the variable are updated
        by the compat_hdio_ioctl() function, but they now contain
        uninitialized stack data.
      
      * The hdparm tool calling this ioctl uses a 'static long' variable
        to store the result. This means at least the upper bytes are
        initialized to zero, but calling another ioctl like HDIO_GET_MULTCOUNT
        would fill them with data that remains stale when the low byte
        is overwritten. Fortunately libata doesn't implement any of the
        affected ioctl commands, so this would only happen when we query
        both an IDE and an ATA device in the same command such as
        "hdparm -N -c /dev/hda /dev/sda"
      
      * The libata code for unknown reasons started using ATA_IOC_GET_IO32
        and ATA_IOC_SET_IO32 as aliases for HDIO_GET_32BIT and HDIO_SET_32BIT,
        while the ioctl commands that were added later use the normal
        HDIO_* names. This is harmless but rather confusing.
      
      This addresses all four issues by changing the code to use put_user()
      on an 'unsigned long' variable in HDIO_GET_32BIT, like the IDE subsystem
      does, and by clarifying the names of the ioctl commands.
      Signed-off-by: NArnd Bergmann <arnd@arndb.de>
      Reported-by: NSoohoon Lee <Soohoon.Lee@f5.com>
      Tested-by: NSoohoon Lee <Soohoon.Lee@f5.com>
      Cc: stable@vger.kernel.org
      Signed-off-by: NTejun Heo <tj@kernel.org>
      287e6611