1. 07 1月, 2012 6 次提交
    • J
      PCI: Introduce INTx check & mask API · a2e27787
      Jan Kiszka 提交于
      These new PCI services allow to probe for 2.3-compliant INTx masking
      support and then use the feature from PCI interrupt handlers. The
      services are properly synchronized with concurrent config space access
      via sysfs or on device reset.
      
      This enables generic PCI device drivers like uio_pci_generic or KVM's
      device assignment to implement the necessary kernel-side IRQ handling
      without any knowledge about device-specific interrupt status and control
      registers.
      Acked-by: NMichael S. Tsirkin <mst@redhat.com>
      Signed-off-by: NJan Kiszka <jan.kiszka@siemens.com>
      Signed-off-by: NJesse Barnes <jbarnes@virtuousgeek.org>
      a2e27787
    • J
      PCI: Rework config space blocking services · fb51ccbf
      Jan Kiszka 提交于
      pci_block_user_cfg_access was designed for the use case that a single
      context, the IPR driver, temporarily delays user space accesses to the
      config space via sysfs. This assumption became invalid by the time
      pci_dev_reset was added as locking instance. Today, if you run two loops
      in parallel that reset the same device via sysfs, you end up with a
      kernel BUG as pci_block_user_cfg_access detect the broken assumption.
      
      This reworks the pci_block_user_cfg_access to a sleeping service
      pci_cfg_access_lock and an atomic-compatible variant called
      pci_cfg_access_trylock. The former not only blocks user space access as
      before but also waits if access was already locked. The latter service
      just returns false in this case, allowing the caller to resolve the
      conflict instead of raising a BUG.
      
      Adaptions of the ipr driver were originally written by Brian King.
      Acked-by: NBrian King <brking@linux.vnet.ibm.com>
      Acked-by: NMichael S. Tsirkin <mst@redhat.com>
      Signed-off-by: NJan Kiszka <jan.kiszka@siemens.com>
      Signed-off-by: NJesse Barnes <jbarnes@virtuousgeek.org>
      fb51ccbf
    • A
      PCI: Fix PCI_EXP_TYPE_RC_EC value · 1830ea91
      Alex Williamson 提交于
      Spec shows this as 1010b = 0xa
      Signed-off-by: NAlex Williamson <alex.williamson@redhat.com>
      Signed-off-by: NJesse Barnes <jbarnes@virtuousgeek.org>
      1830ea91
    • M
      PCI: Rework ASPM disable code · 10f6dc7e
      Matthew Garrett 提交于
      Right now we forcibly clear ASPM state on all devices if the BIOS indicates
      that the feature isn't supported. Based on the Microsoft presentation
      "PCI Express In Depth for Windows Vista and Beyond", I'm starting to think
      that this may be an error. The implication is that unless the platform
      grants full control via _OSC, Windows will not touch any PCIe features -
      including ASPM. In that case clearing ASPM state would be an error unless
      the platform has granted us that control.
      
      This patch reworks the ASPM disabling code such that the actual clearing
      of state is triggered by a successful handoff of PCIe control to the OS.
      The general ASPM code undergoes some changes in order to ensure that the
      ability to clear the bits isn't overridden by ASPM having already been
      disabled. Further, this theoretically now allows for situations where
      only a subset of PCIe roots hand over control, leaving the others in the
      BIOS state.
      
      It's difficult to know for sure that this is the right thing to do -
      there's zero public documentation on the interaction between all of these
      components. But enough vendors enable ASPM on platforms and then set this
      bit that it seems likely that they're expecting the OS to leave them alone.
      
      Measured to save around 5W on an idle Thinkpad X220.
      Signed-off-by: NMatthew Garrett <mjg@redhat.com>
      Signed-off-by: NJesse Barnes <jbarnes@virtuousgeek.org>
      10f6dc7e
    • A
      PCI: Fix PRI and PASID consistency · cfa4d8cc
      Alex Williamson 提交于
      These are extended capabilities, rename and move to proper
      group for consistency.
      Signed-off-by: NAlex Williamson <alex.williamson@redhat.com>
      Signed-off-by: NJesse Barnes <jbarnes@virtuousgeek.org>
      cfa4d8cc
    • N
      PCI/sysfs: add per pci device msi[x] irq listing (v5) · da8d1c8b
      Neil Horman 提交于
      This patch adds a per-pci-device subdirectory in sysfs called:
      /sys/bus/pci/devices/<device>/msi_irqs
      
      This sub-directory exports the set of msi vectors allocated by a given
      pci device, by creating a numbered sub-directory for each vector beneath
      msi_irqs.  For each vector various attributes can be exported.
      Currently the only attribute is called mode, which tracks the
      operational mode of that vector (msi vs. msix)
      Acked-by: NGreg Kroah-Hartman <gregkh@suse.de>
      Signed-off-by: NJesse Barnes <jbarnes@virtuousgeek.org>
      da8d1c8b
  2. 04 1月, 2012 1 次提交
  3. 26 12月, 2011 1 次提交
    • J
      KVM: Don't automatically expose the TSC deadline timer in cpuid · 4d25a066
      Jan Kiszka 提交于
      Unlike all of the other cpuid bits, the TSC deadline timer bit is set
      unconditionally, regardless of what userspace wants.
      
      This is broken in several ways:
       - if userspace doesn't use KVM_CREATE_IRQCHIP, and doesn't emulate the TSC
         deadline timer feature, a guest that uses the feature will break
       - live migration to older host kernels that don't support the TSC deadline
         timer will cause the feature to be pulled from under the guest's feet;
         breaking it
       - guests that are broken wrt the feature will fail.
      
      Fix by not enabling the feature automatically; instead report it to userspace.
      Because the feature depends on KVM_CREATE_IRQCHIP, which we cannot guarantee
      will be called, we expose it via a KVM_CAP_TSC_DEADLINE_TIMER and not
      KVM_GET_SUPPORTED_CPUID.
      
      Fixes the Illumos guest kernel, which uses the TSC deadline timer feature.
      
      [avi: add the KVM_CAP + documentation]
      Reported-by: NAlexey Zaytsev <alexey.zaytsev@gmail.com>
      Tested-by: NAlexey Zaytsev <alexey.zaytsev@gmail.com>
      Signed-off-by: NJan Kiszka <jan.kiszka@siemens.com>
      Signed-off-by: NAvi Kivity <avi@redhat.com>
      4d25a066
  4. 22 12月, 2011 1 次提交
    • S
      VFS: Fix race between CPU hotplug and lglocks · e30e2fdf
      Srivatsa S. Bhat 提交于
      Currently, the *_global_[un]lock_online() routines are not at all synchronized
      with CPU hotplug. Soft-lockups detected as a consequence of this race was
      reported earlier at https://lkml.org/lkml/2011/8/24/185. (Thanks to Cong Meng
      for finding out that the root-cause of this issue is the race condition
      between br_write_[un]lock() and CPU hotplug, which results in the lock states
      getting messed up).
      
      Fixing this race by just adding {get,put}_online_cpus() at appropriate places
      in *_global_[un]lock_online() is not a good option, because, then suddenly
      br_write_[un]lock() would become blocking, whereas they have been kept as
      non-blocking all this time, and we would want to keep them that way.
      
      So, overall, we want to ensure 3 things:
      1. br_write_lock() and br_write_unlock() must remain as non-blocking.
      2. The corresponding lock and unlock of the per-cpu spinlocks must not happen
         for different sets of CPUs.
      3. Either prevent any new CPU online operation in between this lock-unlock, or
         ensure that the newly onlined CPU does not proceed with its corresponding
         per-cpu spinlock unlocked.
      
      To achieve all this:
      (a) We introduce a new spinlock that is taken by the *_global_lock_online()
          routine and released by the *_global_unlock_online() routine.
      (b) We register a callback for CPU hotplug notifications, and this callback
          takes the same spinlock as above.
      (c) We maintain a bitmap which is close to the cpu_online_mask, and once it is
          initialized in the lock_init() code, all future updates to it are done in
          the callback, under the above spinlock.
      (d) The above bitmap is used (instead of cpu_online_mask) while locking and
          unlocking the per-cpu locks.
      
      The callback takes the spinlock upon the CPU_UP_PREPARE event. So, if the
      br_write_lock-unlock sequence is in progress, the callback keeps spinning,
      thus preventing the CPU online operation till the lock-unlock sequence is
      complete. This takes care of requirement (3).
      
      The bitmap that we maintain remains unmodified throughout the lock-unlock
      sequence, since all updates to it are managed by the callback, which takes
      the same spinlock as the one taken by the lock code and released only by the
      unlock routine. Combining this with (d) above, satisfies requirement (2).
      
      Overall, since we use a spinlock (mentioned in (a)) to prevent CPU hotplug
      operations from racing with br_write_lock-unlock, requirement (1) is also
      taken care of.
      
      By the way, it is to be noted that a CPU offline operation can actually run
      in parallel with our lock-unlock sequence, because our callback doesn't react
      to notifications earlier than CPU_DEAD (in order to maintain our bitmap
      properly). And this means, since we use our own bitmap (which is stale, on
      purpose) during the lock-unlock sequence, we could end up unlocking the
      per-cpu lock of an offline CPU (because we had locked it earlier, when the
      CPU was online), in order to satisfy requirement (2). But this is harmless,
      though it looks a bit awkward.
      Debugged-by: NCong Meng <mc@linux.vnet.ibm.com>
      Signed-off-by: NSrivatsa S. Bhat <srivatsa.bhat@linux.vnet.ibm.com>
      Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>
      Cc: stable@vger.kernel.org
      e30e2fdf
  5. 19 12月, 2011 1 次提交
  6. 17 12月, 2011 1 次提交
    • E
      iommu: Export intel_iommu_enabled to signal when iommu is in use · 8bc1f85c
      Eugeni Dodonov 提交于
      In i915 driver, we do not enable either rc6 or semaphores on SNB when dmar
      is enabled. The new 'intel_iommu_enabled' variable signals when the
      iommu code is in operation.
      
      Cc: Ted Phelps <phelps@gnusto.com>
      Cc: Peter <pab1612@gmail.com>
      Cc: Lukas Hejtmanek <xhejtman@fi.muni.cz>
      Cc: Andrew Lutomirski <luto@mit.edu>
      CC: Daniel Vetter <daniel.vetter@ffwll.ch>
      Cc: Eugeni Dodonov <eugeni.dodonov@intel.com>
      Signed-off-by: NKeith Packard <keithp@keithp.com>
      8bc1f85c
  7. 13 12月, 2011 1 次提交
    • L
      linux/log2.h: Fix rounddown_pow_of_two(1) · 13c07b02
      Linus Torvalds 提交于
      Exactly like roundup_pow_of_two(1), the rounddown version was buggy for
      the case of a compile-time constant '1' argument.  Probably because it
      originated from the same code, sharing history with the roundup version
      from before the bugfix (for that one, see commit 1a06a52e: "Fix
      roundup_pow_of_two(1)").
      
      However, unlike the roundup version, the fix for rounddown is to just
      remove the broken special case entirely.  It's simply not needed - the
      generic code
      
          1UL << ilog2(n)
      
      does the right thing for the constant '1' argment too.  The only reason
      roundup needed that special case was because rounding up does so by
      subtracting one from the argument (and then adding one to the result)
      causing the obvious problems with "ilog2(0)".
      
      But rounddown doesn't do any of that, since ilog2() naturally truncates
      (ie "rounds down") to the right rounded down value.  And without the
      ilog2(0) case, there's no reason for the special case that had the wrong
      value.
      
      tl;dr: rounddown_pow_of_two(1) should be 1, not 0.
      Acked-by: NDmitry Torokhov <dtor@vmware.com>
      Cc: stable@kernel.org
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      13c07b02
  8. 11 12月, 2011 1 次提交
  9. 09 12月, 2011 1 次提交
  10. 07 12月, 2011 1 次提交
    • A
      fix apparmor dereferencing potentially freed dentry, sanitize __d_path() API · 02125a82
      Al Viro 提交于
      __d_path() API is asking for trouble and in case of apparmor d_namespace_path()
      getting just that.  The root cause is that when __d_path() misses the root
      it had been told to look for, it stores the location of the most remote ancestor
      in *root.  Without grabbing references.  Sure, at the moment of call it had
      been pinned down by what we have in *path.  And if we raced with umount -l, we
      could have very well stopped at vfsmount/dentry that got freed as soon as
      prepend_path() dropped vfsmount_lock.
      
      It is safe to compare these pointers with pre-existing (and known to be still
      alive) vfsmount and dentry, as long as all we are asking is "is it the same
      address?".  Dereferencing is not safe and apparmor ended up stepping into
      that.  d_namespace_path() really wants to examine the place where we stopped,
      even if it's not connected to our namespace.  As the result, it looked
      at ->d_sb->s_magic of a dentry that might've been already freed by that point.
      All other callers had been careful enough to avoid that, but it's really
      a bad interface - it invites that kind of trouble.
      
      The fix is fairly straightforward, even though it's bigger than I'd like:
      	* prepend_path() root argument becomes const.
      	* __d_path() is never called with NULL/NULL root.  It was a kludge
      to start with.  Instead, we have an explicit function - d_absolute_root().
      Same as __d_path(), except that it doesn't get root passed and stops where
      it stops.  apparmor and tomoyo are using it.
      	* __d_path() returns NULL on path outside of root.  The main
      caller is show_mountinfo() and that's precisely what we pass root for - to
      skip those outside chroot jail.  Those who don't want that can (and do)
      use d_path().
      	* __d_path() root argument becomes const.  Everyone agrees, I hope.
      	* apparmor does *NOT* try to use __d_path() or any of its variants
      when it sees that path->mnt is an internal vfsmount.  In that case it's
      definitely not mounted anywhere and dentry_path() is exactly what we want
      there.  Handling of sysctl()-triggered weirdness is moved to that place.
      	* if apparmor is asked to do pathname relative to chroot jail
      and __d_path() tells it we it's not in that jail, the sucker just calls
      d_absolute_path() instead.  That's the other remaining caller of __d_path(),
      BTW.
              * seq_path_root() does _NOT_ return -ENAMETOOLONG (it's stupid anyway -
      the normal seq_file logics will take care of growing the buffer and redoing
      the call of ->show() just fine).  However, if it gets path not reachable
      from root, it returns SEQ_SKIP.  The only caller adjusted (i.e. stopped
      ignoring the return value as it used to do).
      Reviewed-by: NJohn Johansen <john.johansen@canonical.com>
      ACKed-by: NJohn Johansen <john.johansen@canonical.com>
      Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>
      Cc: stable@vger.kernel.org
      02125a82
  11. 06 12月, 2011 2 次提交
  12. 05 12月, 2011 1 次提交
    • P
      perf: Fix loss of notification with multi-event · 10c6db11
      Peter Zijlstra 提交于
      When you do:
              $ perf record -e cycles,cycles,cycles noploop 10
      
      You expect about 10,000 samples for each event, i.e., 10s at
      1000samples/sec. However, this is not what's happening. You
      get much fewer samples, maybe 3700 samples/event:
      
      $ perf report -D | tail -15
      Aggregated stats:
                 TOTAL events:      10998
                  MMAP events:         66
                  COMM events:          2
                SAMPLE events:      10930
      cycles stats:
                 TOTAL events:       3644
                SAMPLE events:       3644
      cycles stats:
                 TOTAL events:       3642
                SAMPLE events:       3642
      cycles stats:
                 TOTAL events:       3644
                SAMPLE events:       3644
      
      On a Intel Nehalem or even AMD64, there are 4 counters capable
      of measuring cycles, so there is plenty of space to measure those
      events without multiplexing (even with the NMI watchdog active).
      And even with multiplexing, we'd expect roughly the same number
      of samples per event.
      
      The root of the problem was that when the event that caused the buffer
      to become full was not the first event passed on the cmdline, the user
      notification would get lost. The notification was sent to the file
      descriptor of the overflowed event but the perf tool was not polling
      on it.  The perf tool aggregates all samples into a single buffer,
      i.e., the buffer of the first event. Consequently, it assumes
      notifications for any event will come via that descriptor.
      
      The seemingly straight forward solution of moving the waitq into the
      ringbuffer object doesn't work because of life-time issues. One could
      perf_event_set_output() on a fd that you're also blocking on and cause
      the old rb object to be freed while its waitq would still be
      referenced by the blocked thread -> FAIL.
      
      Therefore link all events to the ringbuffer and broadcast the wakeup
      from the ringbuffer object to all possible events that could be waited
      upon. This is rather ugly, and we're open to better solutions but it
      works for now.
      Reported-by: NStephane Eranian <eranian@google.com>
      Finished-by: NStephane Eranian <eranian@google.com>
      Reviewed-by: NStephane Eranian <eranian@google.com>
      Signed-off-by: NPeter Zijlstra <a.p.zijlstra@chello.nl>
      Link: http://lkml.kernel.org/r/20111126014731.GA7030@quadSigned-off-by: NIngo Molnar <mingo@elte.hu>
      10c6db11
  13. 04 12月, 2011 1 次提交
  14. 29 11月, 2011 4 次提交
  15. 24 11月, 2011 2 次提交
  16. 23 11月, 2011 4 次提交
  17. 19 11月, 2011 1 次提交
    • D
      hugetlb: remove dummy definitions of HPAGE_MASK and HPAGE_SIZE · a5c86e98
      David Rientjes 提交于
      Dummy, non-zero definitions for HPAGE_MASK and HPAGE_SIZE were added in
      51c6f666 ("mm: ZAP_BLOCK causes redundant work") to avoid a divide
      by zero in generic kernel code.
      
      That code has since been removed, but probably should never have been
      added in the first place: we don't want HPAGE_SIZE to act like PAGE_SIZE
      for code that is working with hugepages, for example, when the
      dependency on CONFIG_HUGETLB_PAGE has not been fulfilled.
      
      Because hugepage size can differ from architecture to architecture, each
      is required to have their own definitions for both HPAGE_MASK and
      HPAGE_SIZE.  This is always done in arch/*/include/asm/page.h.
      
      So, just remove the dummy and dangerous definitions since they are no
      longer needed and reveals the correct dependencies.  Tested on
      architectures using the definitions with allyesconfig: x86 (even with
      thp), hppa, mips, powerpc, s390, sh3, sh4, sparc, and sparc64, and with
      defconfig on ia64.
      Signed-off-by: NDavid Rientjes <rientjes@google.com>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      a5c86e98
  18. 18 11月, 2011 2 次提交
    • K
      pstore: pass allocated memory region back to caller · f6f82851
      Kees Cook 提交于
      The buf_lock cannot be held while populating the inodes, so make the backend
      pass forward an allocated and filled buffer instead. This solves the following
      backtrace. The effect is that "buf" is only ever used to notify the backends
      that something was written to it, and shouldn't be used in the read path.
      
      To replace the buf_lock during the read path, isolate the open/read/close
      loop with a separate mutex to maintain serialized access to the backend.
      
      Note that is is up to the pstore backend to cope if the (*write)() path is
      called in the middle of the read path.
      
      [   59.691019] BUG: sleeping function called from invalid context at .../mm/slub.c:847
      [   59.691019] in_atomic(): 0, irqs_disabled(): 1, pid: 1819, name: mount
      [   59.691019] Pid: 1819, comm: mount Not tainted 3.0.8 #1
      [   59.691019] Call Trace:
      [   59.691019]  [<810252d5>] __might_sleep+0xc3/0xca
      [   59.691019]  [<810a26e6>] kmem_cache_alloc+0x32/0xf3
      [   59.691019]  [<810b53ac>] ? __d_lookup_rcu+0x6f/0xf4
      [   59.691019]  [<810b68b1>] alloc_inode+0x2a/0x64
      [   59.691019]  [<810b6903>] new_inode+0x18/0x43
      [   59.691019]  [<81142447>] pstore_get_inode.isra.1+0x11/0x98
      [   59.691019]  [<81142623>] pstore_mkfile+0xae/0x26f
      [   59.691019]  [<810a2a66>] ? kmem_cache_free+0x19/0xb1
      [   59.691019]  [<8116c821>] ? ida_get_new_above+0x140/0x158
      [   59.691019]  [<811708ea>] ? __init_rwsem+0x1e/0x2c
      [   59.691019]  [<810b67e8>] ? inode_init_always+0x111/0x1b0
      [   59.691019]  [<8102127e>] ? should_resched+0xd/0x27
      [   59.691019]  [<8137977f>] ? _cond_resched+0xd/0x21
      [   59.691019]  [<81142abf>] pstore_get_records+0x52/0xa7
      [   59.691019]  [<8114254b>] pstore_fill_super+0x7d/0x91
      [   59.691019]  [<810a7ff5>] mount_single+0x46/0x82
      [   59.691019]  [<8114231a>] pstore_mount+0x15/0x17
      [   59.691019]  [<811424ce>] ? pstore_get_inode.isra.1+0x98/0x98
      [   59.691019]  [<810a8199>] mount_fs+0x5a/0x12d
      [   59.691019]  [<810b9174>] ? alloc_vfsmnt+0xa4/0x14a
      [   59.691019]  [<810b9474>] vfs_kern_mount+0x4f/0x7d
      [   59.691019]  [<810b9d7e>] do_kern_mount+0x34/0xb2
      [   59.691019]  [<810bb15f>] do_mount+0x5fc/0x64a
      [   59.691019]  [<810912fb>] ? strndup_user+0x2e/0x3f
      [   59.691019]  [<810bb3cb>] sys_mount+0x66/0x99
      [   59.691019]  [<8137b537>] sysenter_do_call+0x12/0x26
      Signed-off-by: NKees Cook <keescook@chromium.org>
      Signed-off-by: NTony Luck <tony.luck@intel.com>
      f6f82851
    • R
      PM Sleep: Do not extend wakeup paths to devices with ignore_children set · 8b258cc8
      Rafael J. Wysocki 提交于
      Commit 4ca46ff3 (PM / Sleep: Mark
      devices involved in wakeup signaling during suspend) introduced
      the power.wakeup_path field in struct dev_pm_info to mark devices
      whose children are enabled to wake up the system from sleep states,
      so that power domains containing the parents that provide their
      children with wakeup power and/or relay their wakeup signals are not
      turned off.  Unfortunately, that introduced a PM regression on SH7372
      whose power consumption in the system "memory sleep" state increased
      as a result of it, because it prevented the power domain containing
      the I2C controller from being turned off when some children of that
      controller were enabled to wake up the system, although the
      controller was not necessary for them to signal wakeup.
      
      To fix this issue use the observation that devices whose
      power.ignore_children flag is set for runtime PM should be treated
      analogously during system suspend.  Namely, they shouldn't be
      included in wakeup paths going through their children.  Since the
      SH7372 I2C controller's power.ignore_children flag is set, doing so
      will restore the previous behavior of that SOC.
      Signed-off-by: NRafael J. Wysocki <rjw@sisk.pl>
      Acked-by: NGreg Kroah-Hartman <gregkh@suse.de>
      8b258cc8
  19. 17 11月, 2011 4 次提交
  20. 16 11月, 2011 4 次提交