1. 07 7月, 2014 5 次提交
    • M
      fuse: avoid scheduling while atomic · c55a01d3
      Miklos Szeredi 提交于
      As reported by Richard Sharpe, an attempt to use fuse_notify_inval_entry()
      triggers complains about scheduling while atomic:
      
        BUG: scheduling while atomic: fuse.hf/13976/0x10000001
      
      This happens because fuse_notify_inval_entry() attempts to allocate memory
      with GFP_KERNEL, holding "struct fuse_copy_state" mapped by kmap_atomic().
      
      Introduced by commit 58bda1da "fuse/dev: use atomic maps"
      
      Fix by moving the map/unmap to just cover the actual memcpy operation.
      
      Original patch from Maxim Patlasov <mpatlasov@parallels.com>
      Reported-by: NRichard Sharpe <realrichardsharpe@gmail.com>
      Signed-off-by: NMiklos Szeredi <mszeredi@suse.cz>
      Cc: <stable@vger.kernel.org> # v3.15+
      c55a01d3
    • M
      fuse: handle large user and group ID · 233a01fa
      Miklos Szeredi 提交于
      If the number in "user_id=N" or "group_id=N" mount options was larger than
      INT_MAX then fuse returned EINVAL.
      
      Fix this to handle all valid uid/gid values.
      Signed-off-by: NMiklos Szeredi <mszeredi@suse.cz>
      Cc: stable@vger.kernel.org
      233a01fa
    • H
      fuse: inode: drop cast · 7b3d8bf7
      Himangi Saraogi 提交于
      This patch removes the cast on data of type void * as it is not needed.
      The following Coccinelle semantic patch was used for making the change:
      
      @r@
      expression x;
      void* e;
      type T;
      identifier f;
      @@
      
      (
        *((T *)e)
      |
        ((T *)x)[...]
      |
        ((T *)x)->f
      |
      - (T *)
        e
      )
      Signed-off-by: NHimangi Saraogi <himangi774@gmail.com>
      Acked-by: NJulia Lawall <julia.lawall@lip6.fr>
      Signed-off-by: NMiklos Szeredi <mszeredi@suse.cz>
      7b3d8bf7
    • A
      fuse: ignore entry-timeout on LOOKUP_REVAL · 154210cc
      Anand Avati 提交于
      The following test case demonstrates the bug:
      
        sh# mount -t glusterfs localhost:meta-test /mnt/one
      
        sh# mount -t glusterfs localhost:meta-test /mnt/two
      
        sh# echo stuff > /mnt/one/file; rm -f /mnt/two/file; echo stuff > /mnt/one/file
        bash: /mnt/one/file: Stale file handle
      
        sh# echo stuff > /mnt/one/file; rm -f /mnt/two/file; sleep 1; echo stuff > /mnt/one/file
      
      On the second open() on /mnt/one, FUSE would have used the old
      nodeid (file handle) trying to re-open it. Gluster is returning
      -ESTALE. The ESTALE propagates back to namei.c:filename_lookup()
      where lookup is re-attempted with LOOKUP_REVAL. The right
      behavior now, would be for FUSE to ignore the entry-timeout and
      and do the up-call revalidation. Instead FUSE is ignoring
      LOOKUP_REVAL, succeeding the revalidation (because entry-timeout
      has not passed), and open() is again retried on the old file
      handle and finally the ESTALE is going back to the application.
      
      Fix: if revalidation is happening with LOOKUP_REVAL, then ignore
      entry-timeout and always do the up-call.
      Signed-off-by: NAnand Avati <avati@redhat.com>
      Reviewed-by: NNiels de Vos <ndevos@redhat.com>
      Signed-off-by: NMiklos Szeredi <mszeredi@suse.cz>
      Cc: stable@vger.kernel.org
      154210cc
    • M
      fuse: timeout comparison fix · 126b9d43
      Miklos Szeredi 提交于
      As suggested by checkpatch.pl, use time_before64() instead of direct
      comparison of jiffies64 values.
      Signed-off-by: NMiklos Szeredi <mszeredi@suse.cz>
      Cc: <stable@vger.kernel.org>
      126b9d43
  2. 09 6月, 2014 2 次提交
  3. 08 6月, 2014 4 次提交
    • L
      Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/mason/linux-btrfs · c593e897
      Linus Torvalds 提交于
      Pull btrfs fix from Chris Mason:
       "I had this in my 3.16 merge window queue, but it is small and obvious
        enough for 3.15.  I cherry-picked and retested against current rc8"
      
      * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/mason/linux-btrfs:
        Btrfs: send, fix corrupted path strings for long paths
      c593e897
    • L
      Merge git://git.kernel.org/pub/scm/linux/kernel/git/nab/target-pending · 052e5c7e
      Linus Torvalds 提交于
      Pull SCSI target fixes from Nicholas Bellinger:
       "Here are the remaining fixes for v3.15.
      
        This series includes:
      
         - iser-target fix for ImmediateData exception reference count bug
           (Sagi + nab)
         - iscsi-target fix for MC/S login + potential iser-target MRDSL
           buffer overrun (Santosh + Roland)
         - iser-target fix for v3.15-rc multi network portal shutdown
           regression (nab)
         - target fix for allowing READ_CAPCITY during ALUA Standby access
           state (Chris + nab)
         - target fix for NULL pointer dereference of alua_access_state for
           un-configured devices (Chris + nab)"
      
      * git://git.kernel.org/pub/scm/linux/kernel/git/nab/target-pending:
        target: Fix alua_access_state attribute OOPs for un-configured devices
        target: Allow READ_CAPACITY opcode in ALUA Standby access state
        iser-target: Fix multi network portal shutdown regression
        iscsi-target: Fix wrong buffer / buffer overrun in iscsi_change_param_value()
        iser-target: Add missing target_put_sess_cmd for ImmedateData failure
      052e5c7e
    • L
      Merge branch 'x86/urgent' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip · 813895f8
      Linus Torvalds 提交于
      Pull x86 fixes from Peter Anvin:
       "A significantly larger than I'd like set of patches for just below the
        wire.  All of these, however, fix real problems.
      
        The one thing that is genuinely scary in here is the change of SMP
        initialization, but that *does* fix a confirmed hang when booting
        virtual machines.
      
        There is also a patch to actually do the right thing about not
        offlining a CPU when there are not enough interrupt vectors available
        in the system; the accounting was done incorrectly.  The worst case
        for that patch is that we fail to offline CPUs when we should (the new
        code is strictly more conservative than the old), so is not
        particularly risky.
      
        Most of the rest is minor stuff; the EFI patches are all about
        exporting correct information to boot loaders and kexec"
      
      * 'x86/urgent' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
        x86/boot: EFI_MIXED should not prohibit loading above 4G
        x86/smpboot: Initialize secondary CPU only if master CPU will wait for it
        x86/smpboot: Log error on secondary CPU wakeup failure at ERR level
        x86: Fix list/memory corruption on CPU hotplug
        x86: irq: Get correct available vectors for cpu disable
        x86/efi: Do not export efi runtime map in case old map
        x86/efi: earlyprintk=efi,keep fix
      813895f8
    • M
      x86/boot: EFI_MIXED should not prohibit loading above 4G · 745c5167
      Matt Fleming 提交于
      commit 7d453eee ("x86/efi: Wire up CONFIG_EFI_MIXED") introduced a
      regression for the functionality to load kernels above 4G. The relevant
      (incorrect) reasoning behind this change can be seen in the commit
      message,
      
        "The xloadflags field in the bzImage header is also updated to reflect
        that the kernel supports both entry points by setting both of
        XLF_EFI_HANDOVER_32 and XLF_EFI_HANDOVER_64 when CONFIG_EFI_MIXED=y.
        XLF_CAN_BE_LOADED_ABOVE_4G is disabled so that the kernel text is
        guaranteed to be addressable with 32-bits."
      
      This is obviously bogus since 32-bit EFI loaders will never place the
      kernel above the 4G mark. So this restriction is entirely unnecessary.
      
      But things are worse than that - since we want to encourage people to
      always compile with CONFIG_EFI_MIXED=y so that their kernels work out of
      the box for both 32-bit and 64-bit firmware, commit 7d453eee
      effectively disables XLF_CAN_BE_LOADED_ABOVE_4G completely.
      
      Remove the overzealous and superfluous restriction and restore the
      XLF_CAN_BE_LOADED_ABOVE_4G functionality.
      
      Cc: "H. Peter Anvin" <hpa@zytor.com>
      Cc: Dave Young <dyoung@redhat.com>
      Cc: Vivek Goyal <vgoyal@redhat.com>
      Signed-off-by: NMatt Fleming <matt.fleming@intel.com>
      Link: http://lkml.kernel.org/r/1402140380-15377-1-git-send-email-matt@console-pimps.orgSigned-off-by: NH. Peter Anvin <hpa@zytor.com>
      745c5167
  4. 07 6月, 2014 3 次提交
    • N
      mm: add !pte_present() check on existing hugetlb_entry callbacks · d4c54919
      Naoya Horiguchi 提交于
      The age table walker doesn't check non-present hugetlb entry in common
      path, so hugetlb_entry() callbacks must check it.  The reason for this
      behavior is that some callers want to handle it in its own way.
      
      [ I think that reason is bogus, btw - it should just do what the regular
        code does, which is to call the "pte_hole()" function for such hugetlb
        entries  - Linus]
      
      However, some callers don't check it now, which causes unpredictable
      result, for example when we have a race between migrating hugepage and
      reading /proc/pid/numa_maps.  This patch fixes it by adding !pte_present
      checks on buggy callbacks.
      
      This bug exists for years and got visible by introducing hugepage
      migration.
      
      ChangeLog v2:
      - fix if condition (check !pte_present() instead of pte_present())
      Reported-by: NSasha Levin <sasha.levin@oracle.com>
      Signed-off-by: NNaoya Horiguchi <n-horiguchi@ah.jp.nec.com>
      Cc: Rik van Riel <riel@redhat.com>
      Cc: <stable@vger.kernel.org> [3.12+]
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      [ Backported to 3.15.  Signed-off-by: Josh Boyer <jwboyer@fedoraproject.org> ]
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      d4c54919
    • F
      Btrfs: send, fix corrupted path strings for long paths · 01a9a8a9
      Filipe Manana 提交于
      If a path has more than 230 characters, we allocate a new buffer to
      use for the path, but we were forgotting to copy the contents of the
      previous buffer into the new one, which has random content from the
      kmalloc call.
      
      Test:
      
          mkfs.btrfs -f /dev/sdd
          mount /dev/sdd /mnt
      
          TEST_PATH="/mnt/fdmanana/.config/google-chrome-mysetup/Default/Pepper_Data/Shockwave_Flash/WritableRoot/#SharedObjects/JSHJ4ZKN/s.wsj.net/[[IMPORT]]/players.edgesuite.net/flash/plugins/osmf/advanced-streaming-plugin/v2.7/osmf1.6/Ak#"
          mkdir -p $TEST_PATH
          echo "hello world" > $TEST_PATH/amaiAdvancedStreamingPlugin.txt
      
          btrfs subvolume snapshot -r /mnt /mnt/mysnap1
          btrfs send /mnt/mysnap1 -f /tmp/1.snap
      
      A test for xfstests follows.
      Signed-off-by: NFilipe David Borba Manana <fdmanana@gmail.com>
      Cc: Marc Merlin <marc@merlins.org>
      Tested-by: NMarc MERLIN <marc@merlins.org>
      Signed-off-by: NChris Mason <clm@fb.com>
      01a9a8a9
    • L
      Merge branch 'sched-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip · d54d14bf
      Linus Torvalds 提交于
      Pull scheduler fixes from Ingo Molnar:
       "Four misc fixes: each was deemed serious enough to warrant v3.15
        inclusion"
      
      * 'sched-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
        sched/fair: Fix tg_set_cfs_bandwidth() deadlock on rq->lock
        sched/dl: Fix race in dl_task_timer()
        sched: Fix sched_policy < 0 comparison
        sched/numa: Fix use of spin_{un}lock_irq() when interrupts are disabled
      d54d14bf
  5. 06 6月, 2014 10 次提交
    • A
      mm: rmap: fix use-after-free in __put_anon_vma · 624483f3
      Andrey Ryabinin 提交于
      While working address sanitizer for kernel I've discovered
      use-after-free bug in __put_anon_vma.
      
      For the last anon_vma, anon_vma->root freed before child anon_vma.
      Later in anon_vma_free(anon_vma) we are referencing to already freed
      anon_vma->root to check rwsem.
      
      This fixes it by freeing the child anon_vma before freeing
      anon_vma->root.
      Signed-off-by: NAndrey Ryabinin <a.ryabinin@samsung.com>
      Acked-by: NPeter Zijlstra <peterz@infradead.org>
      Cc: <stable@vger.kernel.org> # v3.0+
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      624483f3
    • N
      target: Fix alua_access_state attribute OOPs for un-configured devices · f1453773
      Nicholas Bellinger 提交于
      This patch fixes a OOPs where an attempt to write to the per-device
      alua_access_state configfs attribute at:
      
        /sys/kernel/config/target/core/$HBA/$DEV/alua/$TG_PT_GP/alua_access_state
      
      results in an NULL pointer dereference when the backend device has not
      yet been configured.
      
      This patch adds an explicit check for DF_CONFIGURED, and fails with
      -ENODEV to avoid this case.
      Reported-by: NChris Boot <crb@tiger-computing.co.uk>
      Reported-by: NPhilip Gaw <pgaw@darktech.org.uk>
      Cc: Chris Boot <crb@tiger-computing.co.uk>
      Cc: Philip Gaw <pgaw@darktech.org.uk>
      Cc: stable@vger.kernel.org # 3.8+
      Signed-off-by: NNicholas Bellinger <nab@linux-iscsi.org>
      f1453773
    • N
      target: Allow READ_CAPACITY opcode in ALUA Standby access state · e7810c2d
      Nicholas Bellinger 提交于
      This patch allows READ_CAPACITY + SAI_READ_CAPACITY_16 opcode
      processing to occur while the associated ALUA group is in Standby
      access state.
      
      This is required to avoid host side LUN probe failures during the
      initial scan if an ALUA group has already implicitly changed into
      Standby access state.
      
      This addresses a bug reported by Chris + Philip using dm-multipath
      + ESX hosts configured with ALUA multipath.
      Reported-by: NChris Boot <crb@tiger-computing.co.uk>
      Reported-by: NPhilip Gaw <pgaw@darktech.org.uk>
      Cc: Chris Boot <crb@tiger-computing.co.uk>
      Cc: Philip Gaw <pgaw@darktech.org.uk>
      Cc: Hannes Reinecke <hare@suse.de>
      Cc: stable@vger.kernel.org
      Signed-off-by: NNicholas Bellinger <nab@linux-iscsi.org>
      e7810c2d
    • H
      Merge tag 'efi-urgent' into x86/urgent · 17787542
      H. Peter Anvin 提交于
       * Fix earlyprintk=efi,keep support by switching to an ioremap() mapping
         of the framebuffer when early_ioremap() is no longer available and
         dropping __init from functions that may be invoked after
         free_initmem() - Dave Young
      
       * We shouldn't be exporting the EFI runtime map in sysfs if not using
         the new 1:1 EFI mapping code since in that case the mappings are not
         static across a kexec reboot - Dave Young
      Signed-off-by: NH. Peter Anvin <hpa@linux.intel.com>
      17787542
    • L
      Merge branch 'perf-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip · 951e2730
      Linus Torvalds 提交于
      Pull perf fixes from Ingo Molnar:
       "Two last minute tooling fixes"
      
      * 'perf-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
        perf probe: Fix perf probe to find correct variable DIE
        perf probe: Fix a segfault if asked for variable it doesn't find
      951e2730
    • L
      Merge branch 'futex-fixes' (futex fixes from Thomas Gleixner) · 1c5aefb5
      Linus Torvalds 提交于
      Merge futex fixes from Thomas Gleixner:
       "So with more awake and less futex wreckaged brain, I went through my
        list of points again and came up with the following 4 patches.
      
        1) Prevent pi requeueing on the same futex
      
           I kept Kees check for uaddr1 == uaddr2 as a early check for private
           futexes and added a key comparison to both futex_requeue and
           futex_wait_requeue_pi.
      
           Sebastian, sorry for the confusion yesterday night.  I really
           misunderstood your question.
      
           You are right the check is pointless for shared futexes where the
           same physical address is mapped to two different virtual addresses.
      
        2) Sanity check atomic acquisiton in futex_lock_pi_atomic
      
           That's basically what Darren suggested.
      
           I just simplified it to use futex_top_waiter() to find kernel
           internal state.  If state is found return -EINVAL and do not bother
           to fix up the user space variable.  It's corrupted already.
      
        3) Ensure state consistency in futex_unlock_pi
      
           The code is silly versus the owner died bit.  There is no point to
           preserve it on unlock when the user space thread owns the futex.
      
           What's worse is that it does not update the user space value when
           the owner died bit is set.  So the kernel itself creates observable
           inconsistency.
      
           Another "optimization" is to retry an atomic unlock.  That's
           pointless as in a sane environment user space would not call into
           that code if it could have unlocked it atomically.  So we always
           check whether there is kernel state around and only if there is
           none, we do the unlock by setting the user space value to 0.
      
        4) Sanitize lookup_pi_state
      
           lookup_pi_state is ambigous about TID == 0 in the user space value.
      
           This can be a valid state even if there is kernel state on this
           uaddr, but we miss a few corner case checks.
      
           I tried to come up with a smaller solution hacking the checks into
           the current cruft, but it turned out to be ugly as hell and I got
           more confused than I was before.  So I rewrote the sanity checks
           along the state documentation with awful lots of commentry"
      
      * emailed patches from Thomas Gleixner <tglx@linutronix.de>:
        futex: Make lookup_pi_state more robust
        futex: Always cleanup owner tid in unlock_pi
        futex: Validate atomic acquisition in futex_lock_pi_atomic()
        futex-prevent-requeue-pi-on-same-futex.patch futex: Forbid uaddr == uaddr2 in futex_requeue(..., requeue_pi=1)
      1c5aefb5
    • T
      futex: Make lookup_pi_state more robust · 54a21788
      Thomas Gleixner 提交于
      The current implementation of lookup_pi_state has ambigous handling of
      the TID value 0 in the user space futex.  We can get into the kernel
      even if the TID value is 0, because either there is a stale waiters bit
      or the owner died bit is set or we are called from the requeue_pi path
      or from user space just for fun.
      
      The current code avoids an explicit sanity check for pid = 0 in case
      that kernel internal state (waiters) are found for the user space
      address.  This can lead to state leakage and worse under some
      circumstances.
      
      Handle the cases explicit:
      
             Waiter | pi_state | pi->owner | uTID      | uODIED | ?
      
        [1]  NULL   | ---      | ---       | 0         | 0/1    | Valid
        [2]  NULL   | ---      | ---       | >0        | 0/1    | Valid
      
        [3]  Found  | NULL     | --        | Any       | 0/1    | Invalid
      
        [4]  Found  | Found    | NULL      | 0         | 1      | Valid
        [5]  Found  | Found    | NULL      | >0        | 1      | Invalid
      
        [6]  Found  | Found    | task      | 0         | 1      | Valid
      
        [7]  Found  | Found    | NULL      | Any       | 0      | Invalid
      
        [8]  Found  | Found    | task      | ==taskTID | 0/1    | Valid
        [9]  Found  | Found    | task      | 0         | 0      | Invalid
        [10] Found  | Found    | task      | !=taskTID | 0/1    | Invalid
      
       [1] Indicates that the kernel can acquire the futex atomically. We
           came came here due to a stale FUTEX_WAITERS/FUTEX_OWNER_DIED bit.
      
       [2] Valid, if TID does not belong to a kernel thread. If no matching
           thread is found then it indicates that the owner TID has died.
      
       [3] Invalid. The waiter is queued on a non PI futex
      
       [4] Valid state after exit_robust_list(), which sets the user space
           value to FUTEX_WAITERS | FUTEX_OWNER_DIED.
      
       [5] The user space value got manipulated between exit_robust_list()
           and exit_pi_state_list()
      
       [6] Valid state after exit_pi_state_list() which sets the new owner in
           the pi_state but cannot access the user space value.
      
       [7] pi_state->owner can only be NULL when the OWNER_DIED bit is set.
      
       [8] Owner and user space value match
      
       [9] There is no transient state which sets the user space TID to 0
           except exit_robust_list(), but this is indicated by the
           FUTEX_OWNER_DIED bit. See [4]
      
      [10] There is no transient state which leaves owner and user space
           TID out of sync.
      Signed-off-by: NThomas Gleixner <tglx@linutronix.de>
      Cc: Kees Cook <keescook@chromium.org>
      Cc: Will Drewry <wad@chromium.org>
      Cc: Darren Hart <dvhart@linux.intel.com>
      Cc: stable@vger.kernel.org
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      54a21788
    • T
      futex: Always cleanup owner tid in unlock_pi · 13fbca4c
      Thomas Gleixner 提交于
      If the owner died bit is set at futex_unlock_pi, we currently do not
      cleanup the user space futex.  So the owner TID of the current owner
      (the unlocker) persists.  That's observable inconsistant state,
      especially when the ownership of the pi state got transferred.
      
      Clean it up unconditionally.
      Signed-off-by: NThomas Gleixner <tglx@linutronix.de>
      Cc: Kees Cook <keescook@chromium.org>
      Cc: Will Drewry <wad@chromium.org>
      Cc: Darren Hart <dvhart@linux.intel.com>
      Cc: stable@vger.kernel.org
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      13fbca4c
    • T
      futex: Validate atomic acquisition in futex_lock_pi_atomic() · b3eaa9fc
      Thomas Gleixner 提交于
      We need to protect the atomic acquisition in the kernel against rogue
      user space which sets the user space futex to 0, so the kernel side
      acquisition succeeds while there is existing state in the kernel
      associated to the real owner.
      
      Verify whether the futex has waiters associated with kernel state.  If
      it has, return -EINVAL.  The state is corrupted already, so no point in
      cleaning it up.  Subsequent calls will fail as well.  Not our problem.
      
      [ tglx: Use futex_top_waiter() and explain why we do not need to try
        	restoring the already corrupted user space state. ]
      Signed-off-by: NDarren Hart <dvhart@linux.intel.com>
      Cc: Kees Cook <keescook@chromium.org>
      Cc: Will Drewry <wad@chromium.org>
      Cc: stable@vger.kernel.org
      Signed-off-by: NThomas Gleixner <tglx@linutronix.de>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      b3eaa9fc
    • T
      futex-prevent-requeue-pi-on-same-futex.patch futex: Forbid uaddr == uaddr2 in... · e9c243a5
      Thomas Gleixner 提交于
      futex-prevent-requeue-pi-on-same-futex.patch futex: Forbid uaddr == uaddr2 in futex_requeue(..., requeue_pi=1)
      
      If uaddr == uaddr2, then we have broken the rule of only requeueing from
      a non-pi futex to a pi futex with this call.  If we attempt this, then
      dangling pointers may be left for rt_waiter resulting in an exploitable
      condition.
      
      This change brings futex_requeue() in line with futex_wait_requeue_pi()
      which performs the same check as per commit 6f7b0a2a ("futex: Forbid
      uaddr == uaddr2 in futex_wait_requeue_pi()")
      
      [ tglx: Compare the resulting keys as well, as uaddrs might be
        	different depending on the mapping ]
      
      Fixes CVE-2014-3153.
      
      Reported-by: Pinkie Pie
      Signed-off-by: NWill Drewry <wad@chromium.org>
      Signed-off-by: NKees Cook <keescook@chromium.org>
      Cc: stable@vger.kernel.org
      Signed-off-by: NThomas Gleixner <tglx@linutronix.de>
      Reviewed-by: NDarren Hart <dvhart@linux.intel.com>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      e9c243a5
  6. 05 6月, 2014 10 次提交
    • I
      x86/smpboot: Initialize secondary CPU only if master CPU will wait for it · 3e1a878b
      Igor Mammedov 提交于
      Hang is observed on virtual machines during CPU hotplug,
      especially in big guests with many CPUs. (It reproducible
      more often if host is over-committed).
      
      It happens because master CPU gives up waiting on
      secondary CPU and allows it to run wild. As result
      AP causes locking or crashing system. For example
      as described here:
      
         https://lkml.org/lkml/2014/3/6/257
      
      If master CPU have sent STARTUP IPI successfully,
      and AP signalled to master CPU that it's ready
      to start initialization, make master CPU wait
      indefinitely till AP is onlined.
      To ensure that AP won't ever run wild, make it
      wait at early startup till master CPU confirms its
      intention to wait for AP. If AP doesn't respond in 10
      seconds, the master CPU will timeout and cancel
      AP onlining.
      Signed-off-by: NIgor Mammedov <imammedo@redhat.com>
      Acked-by: NToshi Kani <toshi.kani@hp.com>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Link: http://lkml.kernel.org/r/1401975765-22328-4-git-send-email-imammedo@redhat.comSigned-off-by: NIngo Molnar <mingo@kernel.org>
      3e1a878b
    • I
      x86/smpboot: Log error on secondary CPU wakeup failure at ERR level · feef1e8e
      Igor Mammedov 提交于
      If system is running without debug level logging,
      it will not log error if do_boot_cpu() failed to
      wakeup AP. It may lead to silent AP bringup
      failures at boot time.
      Change message level to KERN_ERR to make error
      visible to user as it's done on other architectures.
      Signed-off-by: NIgor Mammedov <imammedo@redhat.com>
      Acked-by: NToshi Kani <toshi.kani@hp.com>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Link: http://lkml.kernel.org/r/1401975765-22328-3-git-send-email-imammedo@redhat.comSigned-off-by: NIngo Molnar <mingo@kernel.org>
      feef1e8e
    • I
      x86: Fix list/memory corruption on CPU hotplug · 89f898c1
      Igor Mammedov 提交于
      currently if AP wake up is failed, master CPU marks AP as not
      present in do_boot_cpu() by calling set_cpu_present(cpu, false).
      That leads to following list corruption on the next physical CPU
      hotplug:
      
      [  418.107336] WARNING: CPU: 1 PID: 45 at lib/list_debug.c:33 __list_add+0xbe/0xd0()
      [  418.115268] list_add corruption. prev->next should be next (ffff88003dc57600), but was ffff88003e20c3a0. (prev=ffff88003e20c3a0).
      [  418.123693] Modules linked in: nf_conntrack_netbios_ns nf_conntrack_broadcast ipt_MASQUERADE ip6t_REJECT ipt_REJECT cfg80211 xt_conntrack rfkill ee
      [  418.138979] CPU: 1 PID: 45 Comm: kworker/u10:1 Not tainted 3.14.0-rc6+ #387
      [  418.149989] Hardware name: Red Hat KVM, BIOS 0.5.1 01/01/2007
      [  418.165750] Workqueue: kacpi_hotplug acpi_hotplug_work_fn
      [  418.166433]  0000000000000021 ffff880038ca7988 ffffffff8159b22d 0000000000000021
      [  418.176460]  ffff880038ca79d8 ffff880038ca79c8 ffffffff8106942c ffff880038ca79e8
      [  418.177453]  ffff88003e20c3a0 ffff88003dc57600 ffff88003e20c3a0 00000000ffffffea
      [  418.178445] Call Trace:
      [  418.185811]  [<ffffffff8159b22d>] dump_stack+0x49/0x5c
      [  418.186440]  [<ffffffff8106942c>] warn_slowpath_common+0x8c/0xc0
      [  418.187192]  [<ffffffff81069516>] warn_slowpath_fmt+0x46/0x50
      [  418.191231]  [<ffffffff8136ef51>] ? acpi_ns_get_node+0xb7/0xc7
      [  418.193889]  [<ffffffff812f796e>] __list_add+0xbe/0xd0
      [  418.196649]  [<ffffffff812e2aa9>] kobject_add_internal+0x79/0x200
      [  418.208610]  [<ffffffff812e2e18>] kobject_add_varg+0x38/0x60
      [  418.213831]  [<ffffffff812e2ef4>] kobject_add+0x44/0x70
      [  418.229961]  [<ffffffff813e2c60>] device_add+0xd0/0x550
      [  418.234991]  [<ffffffff813f0e95>] ? pm_runtime_init+0xe5/0xf0
      [  418.250226]  [<ffffffff813e32be>] device_register+0x1e/0x30
      [  418.255296]  [<ffffffff813e82a3>] register_cpu+0xe3/0x130
      [  418.266539]  [<ffffffff81592be5>] arch_register_cpu+0x65/0x150
      [  418.285845]  [<ffffffff81355c0d>] acpi_processor_hotadd_init+0x5a/0x9b
      ...
      Which is caused by the fact that generic_processor_info() allocates
      logical CPU id by calling:
      
       cpu = cpumask_next_zero(-1, cpu_present_mask);
      
      which returns id of previously failed to wake up CPU, since its
      bit is cleared by do_boot_cpu() and as result register_cpu()
      tries to register another CPU with the same id as already
      present but failed to be onlined CPU.
      
      Taking in account that AP will not do anything if master CPU
      failed to wake it up, there is no reason to mark that AP as not
      present and break next cpu hotplug attempts. As a side effect of
      not marking AP as not present, user would be allowed to online
      it again later.
      
      Also fix memory corruption in acpi_unmap_lsapic()
      
      if during CPU hotplug master CPU failed to wake up AP
      it set percpu x86_cpu_to_apicid to BAD_APICID=0xFFFF for AP.
      
      However following attempt to unplug that CPU will lead to
      out of bound write access to __apicid_to_node[] which is
      32768 items long on x86_64 kernel.
      
      So with above fix of cpu_present_mask make sure that a present
      CPU has a valid APIC ID by not setting x86_cpu_to_apicid
      to BAD_APICID in do_boot_cpu() on failure and allow
      acpi_processor_remove()->acpi_unmap_lsapic() cleanly remove CPU.
      Signed-off-by: NIgor Mammedov <imammedo@redhat.com>
      Acked-by: NToshi Kani <toshi.kani@hp.com>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Link: http://lkml.kernel.org/r/1401975765-22328-2-git-send-email-imammedo@redhat.comSigned-off-by: NIngo Molnar <mingo@kernel.org>
      89f898c1
    • R
      sched/fair: Fix tg_set_cfs_bandwidth() deadlock on rq->lock · 09dc4ab0
      Roman Gushchin 提交于
      tg_set_cfs_bandwidth() sets cfs_b->timer_active to 0 to
      force the period timer restart. It's not safe, because
      can lead to deadlock, described in commit 927b54fc:
      "__start_cfs_bandwidth calls hrtimer_cancel while holding rq->lock,
      waiting for the hrtimer to finish. However, if sched_cfs_period_timer
      runs for another loop iteration, the hrtimer can attempt to take
      rq->lock, resulting in deadlock."
      
      Three CPUs must be involved:
      
        CPU0               CPU1                         CPU2
        take rq->lock      period timer fired
        ...                take cfs_b lock
        ...                ...                          tg_set_cfs_bandwidth()
        throttle_cfs_rq()  release cfs_b lock           take cfs_b lock
        ...                distribute_cfs_runtime()     timer_active = 0
        take cfs_b->lock   wait for rq->lock            ...
        __start_cfs_bandwidth()
        {wait for timer callback
         break if timer_active == 1}
      
      So, CPU0 and CPU1 are deadlocked.
      
      Instead of resetting cfs_b->timer_active, tg_set_cfs_bandwidth can
      wait for period timer callbacks (ignoring cfs_b->timer_active) and
      restart the timer explicitly.
      Signed-off-by: NRoman Gushchin <klamm@yandex-team.ru>
      Reviewed-by: NBen Segall <bsegall@google.com>
      Signed-off-by: NPeter Zijlstra <peterz@infradead.org>
      Link: http://lkml.kernel.org/r/87wqdi9g8e.wl\%klamm@yandex-team.ru
      Cc: pjt@google.com
      Cc: chris.j.arges@canonical.com
      Cc: gregkh@linuxfoundation.org
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Signed-off-by: NIngo Molnar <mingo@kernel.org>
      09dc4ab0
    • K
      sched/dl: Fix race in dl_task_timer() · 0f397f2c
      Kirill Tkhai 提交于
      Throttled task is still on rq, and it may be moved to other cpu
      if user is playing with sched_setaffinity(). Therefore, unlocked
      task_rq() access makes the race.
      
      Juri Lelli reports he got this race when dl_bandwidth_enabled()
      was not set.
      
      Other thing, pointed by Peter Zijlstra:
      
         "Now I suppose the problem can still actually happen when
          you change the root domain and trigger a effective affinity
          change that way".
      
      To fix that we do the same as made in __task_rq_lock(). We do not
      use __task_rq_lock() itself, because it has a useful lockdep check,
      which is not correct in case of dl_task_timer(). We do not need
      pi_lock locked here. This case is an exception (PeterZ):
      
         "The only reason we don't strictly need ->pi_lock now is because
          we're guaranteed to have p->state == TASK_RUNNING here and are
          thus free of ttwu races".
      Signed-off-by: NKirill Tkhai <tkhai@yandex.ru>
      Signed-off-by: NPeter Zijlstra <peterz@infradead.org>
      Cc: <stable@vger.kernel.org> # v3.14+
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Link: http://lkml.kernel.org/r/3056991400578422@web14g.yandex.ruSigned-off-by: NIngo Molnar <mingo@kernel.org>
      0f397f2c
    • R
      sched: Fix sched_policy < 0 comparison · b14ed2c2
      Richard Weinberger 提交于
      attr.sched_policy is u32, therefore a comparison against < 0 is never true.
      Fix this by casting sched_policy to int.
      
      This issue was reported by coverity CID 1219934.
      
      Fixes: dbdb2275 ("sched: Disallow sched_attr::sched_policy < 0")
      Signed-off-by: NRichard Weinberger <richard@nod.at>
      Signed-off-by: NPeter Zijlstra <peterz@infradead.org>
      Cc: Michael Kerrisk <mtk.manpages@gmail.com>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Link: http://lkml.kernel.org/r/1401741514-7045-1-git-send-email-richard@nod.atSigned-off-by: NIngo Molnar <mingo@kernel.org>
      b14ed2c2
    • S
      sched/numa: Fix use of spin_{un}lock_irq() when interrupts are disabled · e9dd685c
      Steven Rostedt 提交于
      As Peter Zijlstra told me, we have the following path:
      
      do_exit()
        exit_itimers()
          itimer_delete()
            spin_lock_irqsave(&timer->it_lock, &flags);
            timer_delete_hook(timer);
              kc->timer_del(timer) := posix_cpu_timer_del()
                put_task_struct()
                  __put_task_struct()
                    task_numa_free()
                      spin_lock(&grp->lock);
      
      Which means that task_numa_free() can be called with interrupts
      disabled, which means that we should not be using spin_lock_irq() but
      spin_lock_irqsave() instead. Otherwise we are enabling interrupts while
      holding an interrupt unsafe lock!
      Signed-off-by: NSteven Rostedt <rostedt@goodmis.org>
      Signed-off-by: NPeter Zijlstra <peterz@infradead.org>
      Cc: Thomas Gleixner<tglx@linutronix.de>
      Cc: Mike Galbraith <umgwanakikbuti@gmail.com>
      Cc: Eric Dumazet <eric.dumazet@gmail.com>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Link: http://lkml.kernel.org/r/20140527182541.GH11096@twins.programming.kicks-ass.netSigned-off-by: NIngo Molnar <mingo@kernel.org>
      e9dd685c
    • I
      Merge tag 'perf-urgent-for-mingo' of... · 22c91aa2
      Ingo Molnar 提交于
      Merge tag 'perf-urgent-for-mingo' of git://git.kernel.org/pub/scm/linux/kernel/git/jolsa/perf into perf/urgent
      
      Pull perf/urgent fixes from Jiri Olsa:
      
       * Fix perf probe to find correct variable DIE (Masami Hiramatsu)
      
       * Fix a segfault in perf probe if asked for variable it doesn't find (Masami Hiramatsu)
      Signed-off-by: NJiri Olsa <jolsa@kernel.org>
      Acked-by: NPeter Zijlstra <peterz@infradead.org>
      Signed-off-by: NIngo Molnar <mingo@kernel.org>
      22c91aa2
    • L
      Merge branch 'for-3.15-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/percpu · 54539cd2
      Linus Torvalds 提交于
      Pull percpu fix from Tejun Heo:
       "It is very late but this is an important percpu-refcount fix from
        Sebastian Ott.
      
        The problem is that percpu_ref_*() used __this_cpu_*() instead of
        this_cpu_*().  The difference between the two is that the latter is
        atomic on the local cpu while the former is not.  this_cpu_inc() is
        guaranteed to increment the percpu counter on the cpu that the
        operation is executed on without any synchronization; however,
        __this_cpu_inc() doesn't and if the local cpu invokes the function
        from different contexts (e.g.  process and irq) of the same CPU, it's
        not guaranteed to actually increment as it may be implemented as rmw.
      
        This bug existed from the get-go but it hasn't been noticed earlier
        probably because on x86 __this_cpu_inc() is equivalent to
        this_cpu_inc() as both get translated into single instruction;
        however, s390 uses the generic rmw implementation and gets affected by
        the bug.  Kudos to Sebastian and Heiko for diagnosing it.
      
        The change is very low risk and fixes a critical issue on the affected
        architectures, so I think it's a good candidate for inclusion although
        it's very late in the devel cycle.  On the other hand, this has been
        broken since v3.11, so backporting it through -stable post -rc1 won't
        be the end of the world.
      
        I'll ping Christoph whether __this_cpu_*() ops can be better annotated
        so that it can trigger lockdep warning when used from multiple
        contexts"
      
      * 'for-3.15-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/percpu:
        percpu-refcount: fix usage of this_cpu_ops
      54539cd2
    • S
      percpu-refcount: fix usage of this_cpu_ops · 0c36b390
      Sebastian Ott 提交于
      The percpu-refcount infrastructure uses the underscore variants of
      this_cpu_ops in order to modify percpu reference counters.
      (e.g. __this_cpu_inc()).
      
      However the underscore variants do not atomically update the percpu
      variable, instead they may be implemented using read-modify-write
      semantics (more than one instruction).  Therefore it is only safe to
      use the underscore variant if the context is always the same (process,
      softirq, or hardirq). Otherwise it is possible to lose updates.
      
      This problem is something that Sebastian has seen within the aio
      subsystem which uses percpu refcounters both in process and softirq
      context leading to reference counts that never dropped to zeroes; even
      though the number of "get" and "put" calls matched.
      
      Fix this by using the non-underscore this_cpu_ops variant which
      provides correct per cpu atomic semantics and fixes the corrupted
      reference counts.
      
      Cc: Kent Overstreet <kmo@daterainc.com>
      Cc: <stable@vger.kernel.org> # v3.11+
      Reported-by: NSebastian Ott <sebott@linux.vnet.ibm.com>
      Signed-off-by: NHeiko Carstens <heiko.carstens@de.ibm.com>
      Signed-off-by: NTejun Heo <tj@kernel.org>
      References: http://lkml.kernel.org/g/alpine.LFD.2.11.1406041540520.21183@denkbrett
      0c36b390
  7. 04 6月, 2014 6 次提交
    • L
      Merge tag 'pm-3.15-final' of git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm · c717d156
      Linus Torvalds 提交于
      Pull intel pstate fixes from Rafael Wysocki:
       "Final power management fixes for 3.15
      
         - Taking non-idle time into account when calculating core busy time
           was a mistake and led to a performance regression.  Since the
           problem it was supposed to address is now taken care of in a
           different way, we don't need to do it any more, so drop the
           non-idle time tracking from intel_pstate.  Dirk Brandewie.
      
         - Changing to fixed point math throughout the busy calculation
           introduced rounding errors that adversely affect the accuracy of
           intel_pstate's computations.  Fix from Dirk Brandewie.
      
         - The PID controller algorithm used by intel_pstate assumes that the
           time interval between two adjacent samples will always be the same
           which is not the case for deferable timers (used by intel_pstate)
           when the system is idle.  This leads to inaccurate predictions and
           artificially increases convergence times for the minimum P-state.
           Fix from Dirk Brandewie.
      
         - intel_pstate carries out computations using 32-bit variables that
           may overflow for large enough values of APERF/MPERF.  Switch to
           using 64-bit variables for computations, from Doug Smythies"
      
      * tag 'pm-3.15-final' of git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm:
        intel_pstate: Improve initial busy calculation
        intel_pstate: add sample time scaling
        intel_pstate: Correct rounding in busy calculation
        intel_pstate: Remove C0 tracking
      c717d156
    • L
      Merge branch 'drm-fixes' of git://people.freedesktop.org/~airlied/linux · 9e9a928e
      Linus Torvalds 提交于
      Pull drm fixes from Dave Airlie:
       "All fairly small: radeon stability and a panic path fix.
      
        Mostly radeon fixes, suspend/resume fix, stability on the CIK
        chipsets, along with a locking check avoidance patch for panic times
        regression"
      
      * 'drm-fixes' of git://people.freedesktop.org/~airlied/linux:
        drm/radeon: use the CP DMA on CIK
        drm/radeon: sync page table updates
        drm/radeon: fix vm buffer size estimation
        drm/crtc-helper: skip locking checks in panicking path
        drm/radeon/dpm: resume fixes for some systems
      9e9a928e
    • M
      perf probe: Fix perf probe to find correct variable DIE · 082f96a9
      Masami Hiramatsu 提交于
      Fix perf probe to find correct variable DIE which has location or
      external instance by tracking down the lexical blocks.
      
      Current die_find_variable() expects that the all variable DIEs
      which has DW_TAG_variable have a location. However, since recent
      dwarf information may have declaration variable DIEs at the
      entry of function (subprogram), die_find_variable() returns it.
      
      To solve this problem, it must track down the DIE tree to find
      a DIE which has an actual location or a reference for external
      instance.
      
      e.g. finding a DIE which origin is <0xdc73>;
      
       <1><11496>: Abbrev Number: 95 (DW_TAG_subprogram)
          <11497>   DW_AT_abstract_origin: <0xdc42>
          <1149b>   DW_AT_low_pc      : 0x1850
      [...]
       <2><114cc>: Abbrev Number: 119 (DW_TAG_variable) <- this is a declaration
          <114cd>   DW_AT_abstract_origin: <0xdc73>
       <2><114d1>: Abbrev Number: 119 (DW_TAG_variable)
      [...]
       <3><115a7>: Abbrev Number: 105 (DW_TAG_lexical_block)
          <115a8>   DW_AT_ranges      : 0xaa0
       <4><115ac>: Abbrev Number: 96 (DW_TAG_variable) <- this has a location
          <115ad>   DW_AT_abstract_origin: <0xdc73>
          <115b1>   DW_AT_location    : 0x486c        (location list)
      Signed-off-by: NMasami Hiramatsu <masami.hiramatsu.pt@hitachi.com>
      Tested-by: NArnaldo Carvalho de Melo <acme@kernel.org>
      Acked-by: NArnaldo Carvalho de Melo <acme@kernel.org>
      Cc: Arnaldo Carvalho de Melo <acme@kernel.org>
      Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
      Cc: Paul Mackerras <paulus@samba.org>
      Cc: Ingo Molnar <mingo@redhat.com>
      Cc: Namhyung Kim <namhyung@kernel.org>
      Link: http://lkml.kernel.org/r/20140529121930.30879.87092.stgit@ltc230.yrl.intra.hitachi.co.jpSigned-off-by: NJiri Olsa <jolsa@kernel.org>
      082f96a9
    • M
      perf probe: Fix a segfault if asked for variable it doesn't find · 0c188a07
      Masami Hiramatsu 提交于
      Fix a segfault bug by asking for variable it doesn't find.
      Since the convert_variable() didn't handle error code returned
      from convert_variable_location(), it just passed an incomplete
      variable field and then a segfault was occurred when formatting
      the field.
      
      This fixes that bug by handling success code correctly in
      convert_variable(). Other callers of convert_variable_location()
      are correctly checking the return code.
      
      This bug was introduced by following commit. But another hidden
      erroneous error handling has been there previously (-ENOMEM case).
      
       commit 3d918a12Signed-off-by: NMasami Hiramatsu <masami.hiramatsu.pt@hitachi.com>
      Reported-by: NArnaldo Carvalho de Melo <acme@kernel.org>
      Tested-by: NArnaldo Carvalho de Melo <acme@kernel.org>
      Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
      Cc: Paul Mackerras <paulus@samba.org>
      Cc: Ingo Molnar <mingo@redhat.com>
      Cc: Namhyung Kim <namhyung@kernel.org>
      Link: http://lkml.kernel.org/r/20140529105232.28251.30447.stgit@ltc230.yrl.intra.hitachi.co.jpSigned-off-by: NJiri Olsa <jolsa@kernel.org>
      0c188a07
    • Y
      x86: irq: Get correct available vectors for cpu disable · ac2a5539
      Yinghai Lu 提交于
      check_irq_vectors_for_cpu_disable() can overestimate the number of
      available interrupt vectors, so the check for cpu down succeeds, but
      the actual cpu removal fails.
      
      It iterates from FIRST_EXTERNAL_VECTOR to NR_VECTORS, which is wrong
      because the systems vectors are not taken into account.
      
      Limit the search to first_system_vector instead of NR_VECTORS.
      
      The second indicator for vector availability the used_vectors bitmap
      is not taken into account at all. So system vectors,
      e.g. IA32_SYSCALL_VECTOR (0x80) and IRQ_MOVE_CLEANUP_VECTOR (0x20),
      are accounted as available.
      
      Add a check for the used_vectors bitmap and do not account vectors
      which are marked there.
      
      [ tglx: Simplified code. Rewrote changelog and code comments. ]
      Signed-off-by: NYinghai Lu <yinghai@kernel.org>
      Acked-by: NPrarit Bhargava <prarit@redhat.com>
      Cc: Seiji Aguchi <seiji.aguchi@hds.com>
      Cc: Andi Kleen <ak@linux.intel.com>
      Cc: K. Y. Srinivasan <kys@microsoft.com>
      Cc: Steven Rostedt (Red Hat) <rostedt@goodmis.org>
      Cc: Ingo Molnar <mingo@redhat.com>
      Cc: "H. Peter Anvin" <hpa@zytor.com>
      Cc: "Elliott, Robert (Server Storage)" <Elliott@hp.com>
      Cc: x86@kernel.org
      Link: http://lkml.kernel.org/r/1400160305-17774-2-git-send-email-prarit@redhat.comSigned-off-by: NThomas Gleixner <tglx@linutronix.de>
      ac2a5539
    • D
      Merge branch 'drm-fixes-3.15' of git://people.freedesktop.org/~deathsimple/linux into drm-fixes · 0a4ae727
      Dave Airlie 提交于
      The first one is a one liner fixing a stupid typo in the VM handling code and is only relevant if play with one of the VM defines.
      
      The other two switches CIK to use the CPDMA instead of the SDMA for buffer moves, as it turned out the SDMA is still sometimes not 100% reliable.
      
      * 'drm-fixes-3.15' of git://people.freedesktop.org/~deathsimple/linux:
        drm/radeon: use the CP DMA on CIK
        drm/radeon: sync page table updates
        drm/radeon: fix vm buffer size estimation
      0a4ae727