1. 07 1月, 2012 5 次提交
    • J
      PCI: Rework config space blocking services · fb51ccbf
      Jan Kiszka 提交于
      pci_block_user_cfg_access was designed for the use case that a single
      context, the IPR driver, temporarily delays user space accesses to the
      config space via sysfs. This assumption became invalid by the time
      pci_dev_reset was added as locking instance. Today, if you run two loops
      in parallel that reset the same device via sysfs, you end up with a
      kernel BUG as pci_block_user_cfg_access detect the broken assumption.
      
      This reworks the pci_block_user_cfg_access to a sleeping service
      pci_cfg_access_lock and an atomic-compatible variant called
      pci_cfg_access_trylock. The former not only blocks user space access as
      before but also waits if access was already locked. The latter service
      just returns false in this case, allowing the caller to resolve the
      conflict instead of raising a BUG.
      
      Adaptions of the ipr driver were originally written by Brian King.
      Acked-by: NBrian King <brking@linux.vnet.ibm.com>
      Acked-by: NMichael S. Tsirkin <mst@redhat.com>
      Signed-off-by: NJan Kiszka <jan.kiszka@siemens.com>
      Signed-off-by: NJesse Barnes <jbarnes@virtuousgeek.org>
      fb51ccbf
    • A
      PCI: Fix PCI_EXP_TYPE_RC_EC value · 1830ea91
      Alex Williamson 提交于
      Spec shows this as 1010b = 0xa
      Signed-off-by: NAlex Williamson <alex.williamson@redhat.com>
      Signed-off-by: NJesse Barnes <jbarnes@virtuousgeek.org>
      1830ea91
    • M
      PCI: Rework ASPM disable code · 10f6dc7e
      Matthew Garrett 提交于
      Right now we forcibly clear ASPM state on all devices if the BIOS indicates
      that the feature isn't supported. Based on the Microsoft presentation
      "PCI Express In Depth for Windows Vista and Beyond", I'm starting to think
      that this may be an error. The implication is that unless the platform
      grants full control via _OSC, Windows will not touch any PCIe features -
      including ASPM. In that case clearing ASPM state would be an error unless
      the platform has granted us that control.
      
      This patch reworks the ASPM disabling code such that the actual clearing
      of state is triggered by a successful handoff of PCIe control to the OS.
      The general ASPM code undergoes some changes in order to ensure that the
      ability to clear the bits isn't overridden by ASPM having already been
      disabled. Further, this theoretically now allows for situations where
      only a subset of PCIe roots hand over control, leaving the others in the
      BIOS state.
      
      It's difficult to know for sure that this is the right thing to do -
      there's zero public documentation on the interaction between all of these
      components. But enough vendors enable ASPM on platforms and then set this
      bit that it seems likely that they're expecting the OS to leave them alone.
      
      Measured to save around 5W on an idle Thinkpad X220.
      Signed-off-by: NMatthew Garrett <mjg@redhat.com>
      Signed-off-by: NJesse Barnes <jbarnes@virtuousgeek.org>
      10f6dc7e
    • A
      PCI: Fix PRI and PASID consistency · cfa4d8cc
      Alex Williamson 提交于
      These are extended capabilities, rename and move to proper
      group for consistency.
      Signed-off-by: NAlex Williamson <alex.williamson@redhat.com>
      Signed-off-by: NJesse Barnes <jbarnes@virtuousgeek.org>
      cfa4d8cc
    • N
      PCI/sysfs: add per pci device msi[x] irq listing (v5) · da8d1c8b
      Neil Horman 提交于
      This patch adds a per-pci-device subdirectory in sysfs called:
      /sys/bus/pci/devices/<device>/msi_irqs
      
      This sub-directory exports the set of msi vectors allocated by a given
      pci device, by creating a numbered sub-directory for each vector beneath
      msi_irqs.  For each vector various attributes can be exported.
      Currently the only attribute is called mode, which tracks the
      operational mode of that vector (msi vs. msix)
      Acked-by: NGreg Kroah-Hartman <gregkh@suse.de>
      Signed-off-by: NJesse Barnes <jbarnes@virtuousgeek.org>
      da8d1c8b
  2. 04 1月, 2012 1 次提交
  3. 31 12月, 2011 1 次提交
  4. 30 12月, 2011 1 次提交
    • A
      procfs: do not confuse jiffies with cputime64_t · 34845636
      Andreas Schwab 提交于
      Commit 2a95ea6c ("procfs: do not overflow get_{idle,iowait}_time
      for nohz") did not take into account that one some architectures jiffies
      and cputime use different units.
      
      This causes get_idle_time() to return numbers in the wrong units, making
      the idle time fields in /proc/stat wrong.
      
      Instead of converting the usec value returned by
      get_cpu_{idle,iowait}_time_us to units of jiffies, use the new function
      usecs_to_cputime64 to convert it to the correct unit of cputime64_t.
      Signed-off-by: NAndreas Schwab <schwab@linux-m68k.org>
      Acked-by: NMichal Hocko <mhocko@suse.cz>
      Cc: Arnd Bergmann <arnd@arndb.de>
      Cc: "Artem S. Tashkinov" <t.artem@mailcity.com>
      Cc: Dave Jones <davej@redhat.com>
      Cc: Alexey Dobriyan <adobriyan@gmail.com>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: "Luck, Tony" <tony.luck@intel.com>
      Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
      Cc: Martin Schwidefsky <schwidefsky@de.ibm.com>
      Cc: Heiko Carstens <heiko.carstens@de.ibm.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      34845636
  5. 26 12月, 2011 1 次提交
    • J
      KVM: Don't automatically expose the TSC deadline timer in cpuid · 4d25a066
      Jan Kiszka 提交于
      Unlike all of the other cpuid bits, the TSC deadline timer bit is set
      unconditionally, regardless of what userspace wants.
      
      This is broken in several ways:
       - if userspace doesn't use KVM_CREATE_IRQCHIP, and doesn't emulate the TSC
         deadline timer feature, a guest that uses the feature will break
       - live migration to older host kernels that don't support the TSC deadline
         timer will cause the feature to be pulled from under the guest's feet;
         breaking it
       - guests that are broken wrt the feature will fail.
      
      Fix by not enabling the feature automatically; instead report it to userspace.
      Because the feature depends on KVM_CREATE_IRQCHIP, which we cannot guarantee
      will be called, we expose it via a KVM_CAP_TSC_DEADLINE_TIMER and not
      KVM_GET_SUPPORTED_CPUID.
      
      Fixes the Illumos guest kernel, which uses the TSC deadline timer feature.
      
      [avi: add the KVM_CAP + documentation]
      Reported-by: NAlexey Zaytsev <alexey.zaytsev@gmail.com>
      Tested-by: NAlexey Zaytsev <alexey.zaytsev@gmail.com>
      Signed-off-by: NJan Kiszka <jan.kiszka@siemens.com>
      Signed-off-by: NAvi Kivity <avi@redhat.com>
      4d25a066
  6. 23 12月, 2011 2 次提交
    • E
      net: relax rcvbuf limits · 0fd7bac6
      Eric Dumazet 提交于
      skb->truesize might be big even for a small packet.
      
      Its even bigger after commit 87fb4b7b (net: more accurate skb
      truesize) and big MTU.
      
      We should allow queueing at least one packet per receiver, even with a
      low RCVBUF setting.
      Reported-by: NMichal Simek <monstr@monstr.eu>
      Signed-off-by: NEric Dumazet <eric.dumazet@gmail.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      0fd7bac6
    • E
      net: introduce DST_NOPEER dst flag · e688a604
      Eric Dumazet 提交于
      Chris Boot reported crashes occurring in ipv6_select_ident().
      
      [  461.457562] RIP: 0010:[<ffffffff812dde61>]  [<ffffffff812dde61>]
      ipv6_select_ident+0x31/0xa7
      
      [  461.578229] Call Trace:
      [  461.580742] <IRQ>
      [  461.582870]  [<ffffffff812efa7f>] ? udp6_ufo_fragment+0x124/0x1a2
      [  461.589054]  [<ffffffff812dbfe0>] ? ipv6_gso_segment+0xc0/0x155
      [  461.595140]  [<ffffffff812700c6>] ? skb_gso_segment+0x208/0x28b
      [  461.601198]  [<ffffffffa03f236b>] ? ipv6_confirm+0x146/0x15e
      [nf_conntrack_ipv6]
      [  461.608786]  [<ffffffff81291c4d>] ? nf_iterate+0x41/0x77
      [  461.614227]  [<ffffffff81271d64>] ? dev_hard_start_xmit+0x357/0x543
      [  461.620659]  [<ffffffff81291cf6>] ? nf_hook_slow+0x73/0x111
      [  461.626440]  [<ffffffffa0379745>] ? br_parse_ip_options+0x19a/0x19a
      [bridge]
      [  461.633581]  [<ffffffff812722ff>] ? dev_queue_xmit+0x3af/0x459
      [  461.639577]  [<ffffffffa03747d2>] ? br_dev_queue_push_xmit+0x72/0x76
      [bridge]
      [  461.646887]  [<ffffffffa03791e3>] ? br_nf_post_routing+0x17d/0x18f
      [bridge]
      [  461.653997]  [<ffffffff81291c4d>] ? nf_iterate+0x41/0x77
      [  461.659473]  [<ffffffffa0374760>] ? br_flood+0xfa/0xfa [bridge]
      [  461.665485]  [<ffffffff81291cf6>] ? nf_hook_slow+0x73/0x111
      [  461.671234]  [<ffffffffa0374760>] ? br_flood+0xfa/0xfa [bridge]
      [  461.677299]  [<ffffffffa0379215>] ?
      nf_bridge_update_protocol+0x20/0x20 [bridge]
      [  461.684891]  [<ffffffffa03bb0e5>] ? nf_ct_zone+0xa/0x17 [nf_conntrack]
      [  461.691520]  [<ffffffffa0374760>] ? br_flood+0xfa/0xfa [bridge]
      [  461.697572]  [<ffffffffa0374812>] ? NF_HOOK.constprop.8+0x3c/0x56
      [bridge]
      [  461.704616]  [<ffffffffa0379031>] ?
      nf_bridge_push_encap_header+0x1c/0x26 [bridge]
      [  461.712329]  [<ffffffffa037929f>] ? br_nf_forward_finish+0x8a/0x95
      [bridge]
      [  461.719490]  [<ffffffffa037900a>] ?
      nf_bridge_pull_encap_header+0x1c/0x27 [bridge]
      [  461.727223]  [<ffffffffa0379974>] ? br_nf_forward_ip+0x1c0/0x1d4 [bridge]
      [  461.734292]  [<ffffffff81291c4d>] ? nf_iterate+0x41/0x77
      [  461.739758]  [<ffffffffa03748cc>] ? __br_deliver+0xa0/0xa0 [bridge]
      [  461.746203]  [<ffffffff81291cf6>] ? nf_hook_slow+0x73/0x111
      [  461.751950]  [<ffffffffa03748cc>] ? __br_deliver+0xa0/0xa0 [bridge]
      [  461.758378]  [<ffffffffa037533a>] ? NF_HOOK.constprop.4+0x56/0x56
      [bridge]
      
      This is caused by bridge netfilter special dst_entry (fake_rtable), a
      special shared entry, where attaching an inetpeer makes no sense.
      
      Problem is present since commit 87c48fa3 (ipv6: make fragment
      identifications less predictable)
      
      Introduce DST_NOPEER dst flag and make sure ipv6_select_ident() and
      __ip_select_ident() fallback to the 'no peer attached' handling.
      Reported-by: NChris Boot <bootc@bootc.net>
      Tested-by: NChris Boot <bootc@bootc.net>
      Signed-off-by: NEric Dumazet <eric.dumazet@gmail.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      e688a604
  7. 22 12月, 2011 2 次提交
    • S
      VFS: Fix race between CPU hotplug and lglocks · e30e2fdf
      Srivatsa S. Bhat 提交于
      Currently, the *_global_[un]lock_online() routines are not at all synchronized
      with CPU hotplug. Soft-lockups detected as a consequence of this race was
      reported earlier at https://lkml.org/lkml/2011/8/24/185. (Thanks to Cong Meng
      for finding out that the root-cause of this issue is the race condition
      between br_write_[un]lock() and CPU hotplug, which results in the lock states
      getting messed up).
      
      Fixing this race by just adding {get,put}_online_cpus() at appropriate places
      in *_global_[un]lock_online() is not a good option, because, then suddenly
      br_write_[un]lock() would become blocking, whereas they have been kept as
      non-blocking all this time, and we would want to keep them that way.
      
      So, overall, we want to ensure 3 things:
      1. br_write_lock() and br_write_unlock() must remain as non-blocking.
      2. The corresponding lock and unlock of the per-cpu spinlocks must not happen
         for different sets of CPUs.
      3. Either prevent any new CPU online operation in between this lock-unlock, or
         ensure that the newly onlined CPU does not proceed with its corresponding
         per-cpu spinlock unlocked.
      
      To achieve all this:
      (a) We introduce a new spinlock that is taken by the *_global_lock_online()
          routine and released by the *_global_unlock_online() routine.
      (b) We register a callback for CPU hotplug notifications, and this callback
          takes the same spinlock as above.
      (c) We maintain a bitmap which is close to the cpu_online_mask, and once it is
          initialized in the lock_init() code, all future updates to it are done in
          the callback, under the above spinlock.
      (d) The above bitmap is used (instead of cpu_online_mask) while locking and
          unlocking the per-cpu locks.
      
      The callback takes the spinlock upon the CPU_UP_PREPARE event. So, if the
      br_write_lock-unlock sequence is in progress, the callback keeps spinning,
      thus preventing the CPU online operation till the lock-unlock sequence is
      complete. This takes care of requirement (3).
      
      The bitmap that we maintain remains unmodified throughout the lock-unlock
      sequence, since all updates to it are managed by the callback, which takes
      the same spinlock as the one taken by the lock code and released only by the
      unlock routine. Combining this with (d) above, satisfies requirement (2).
      
      Overall, since we use a spinlock (mentioned in (a)) to prevent CPU hotplug
      operations from racing with br_write_lock-unlock, requirement (1) is also
      taken care of.
      
      By the way, it is to be noted that a CPU offline operation can actually run
      in parallel with our lock-unlock sequence, because our callback doesn't react
      to notifications earlier than CPU_DEAD (in order to maintain our bitmap
      properly). And this means, since we use our own bitmap (which is stale, on
      purpose) during the lock-unlock sequence, we could end up unlocking the
      per-cpu lock of an offline CPU (because we had locked it earlier, when the
      CPU was online), in order to satisfy requirement (2). But this is harmless,
      though it looks a bit awkward.
      Debugged-by: NCong Meng <mc@linux.vnet.ibm.com>
      Signed-off-by: NSrivatsa S. Bhat <srivatsa.bhat@linux.vnet.ibm.com>
      Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>
      Cc: stable@vger.kernel.org
      e30e2fdf
    • S
      net: Add a flow_cache_flush_deferred function · c0ed1c14
      Steffen Klassert 提交于
      flow_cach_flush() might sleep but can be called from
      atomic context via the xfrm garbage collector. So add
      a flow_cache_flush_deferred() function and use this if
      the xfrm garbage colector is invoked from within the
      packet path.
      Signed-off-by: NSteffen Klassert <steffen.klassert@secunet.com>
      Acked-by: NTimo Teräs <timo.teras@iki.fi>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      c0ed1c14
  8. 20 12月, 2011 1 次提交
  9. 19 12月, 2011 2 次提交
  10. 18 12月, 2011 1 次提交
  11. 17 12月, 2011 1 次提交
    • E
      iommu: Export intel_iommu_enabled to signal when iommu is in use · 8bc1f85c
      Eugeni Dodonov 提交于
      In i915 driver, we do not enable either rc6 or semaphores on SNB when dmar
      is enabled. The new 'intel_iommu_enabled' variable signals when the
      iommu code is in operation.
      
      Cc: Ted Phelps <phelps@gnusto.com>
      Cc: Peter <pab1612@gmail.com>
      Cc: Lukas Hejtmanek <xhejtman@fi.muni.cz>
      Cc: Andrew Lutomirski <luto@mit.edu>
      CC: Daniel Vetter <daniel.vetter@ffwll.ch>
      Cc: Eugeni Dodonov <eugeni.dodonov@intel.com>
      Signed-off-by: NKeith Packard <keithp@keithp.com>
      8bc1f85c
  12. 15 12月, 2011 1 次提交
  13. 14 12月, 2011 1 次提交
  14. 13 12月, 2011 1 次提交
    • L
      linux/log2.h: Fix rounddown_pow_of_two(1) · 13c07b02
      Linus Torvalds 提交于
      Exactly like roundup_pow_of_two(1), the rounddown version was buggy for
      the case of a compile-time constant '1' argument.  Probably because it
      originated from the same code, sharing history with the roundup version
      from before the bugfix (for that one, see commit 1a06a52e: "Fix
      roundup_pow_of_two(1)").
      
      However, unlike the roundup version, the fix for rounddown is to just
      remove the broken special case entirely.  It's simply not needed - the
      generic code
      
          1UL << ilog2(n)
      
      does the right thing for the constant '1' argment too.  The only reason
      roundup needed that special case was because rounding up does so by
      subtracting one from the argument (and then adding one to the result)
      causing the obvious problems with "ilog2(0)".
      
      But rounddown doesn't do any of that, since ilog2() naturally truncates
      (ie "rounds down") to the right rounded down value.  And without the
      ilog2(0) case, there's no reason for the special case that had the wrong
      value.
      
      tl;dr: rounddown_pow_of_two(1) should be 1, not 0.
      Acked-by: NDmitry Torokhov <dtor@vmware.com>
      Cc: stable@kernel.org
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      13c07b02
  15. 11 12月, 2011 2 次提交
  16. 09 12月, 2011 1 次提交
  17. 07 12月, 2011 1 次提交
    • A
      fix apparmor dereferencing potentially freed dentry, sanitize __d_path() API · 02125a82
      Al Viro 提交于
      __d_path() API is asking for trouble and in case of apparmor d_namespace_path()
      getting just that.  The root cause is that when __d_path() misses the root
      it had been told to look for, it stores the location of the most remote ancestor
      in *root.  Without grabbing references.  Sure, at the moment of call it had
      been pinned down by what we have in *path.  And if we raced with umount -l, we
      could have very well stopped at vfsmount/dentry that got freed as soon as
      prepend_path() dropped vfsmount_lock.
      
      It is safe to compare these pointers with pre-existing (and known to be still
      alive) vfsmount and dentry, as long as all we are asking is "is it the same
      address?".  Dereferencing is not safe and apparmor ended up stepping into
      that.  d_namespace_path() really wants to examine the place where we stopped,
      even if it's not connected to our namespace.  As the result, it looked
      at ->d_sb->s_magic of a dentry that might've been already freed by that point.
      All other callers had been careful enough to avoid that, but it's really
      a bad interface - it invites that kind of trouble.
      
      The fix is fairly straightforward, even though it's bigger than I'd like:
      	* prepend_path() root argument becomes const.
      	* __d_path() is never called with NULL/NULL root.  It was a kludge
      to start with.  Instead, we have an explicit function - d_absolute_root().
      Same as __d_path(), except that it doesn't get root passed and stops where
      it stops.  apparmor and tomoyo are using it.
      	* __d_path() returns NULL on path outside of root.  The main
      caller is show_mountinfo() and that's precisely what we pass root for - to
      skip those outside chroot jail.  Those who don't want that can (and do)
      use d_path().
      	* __d_path() root argument becomes const.  Everyone agrees, I hope.
      	* apparmor does *NOT* try to use __d_path() or any of its variants
      when it sees that path->mnt is an internal vfsmount.  In that case it's
      definitely not mounted anywhere and dentry_path() is exactly what we want
      there.  Handling of sysctl()-triggered weirdness is moved to that place.
      	* if apparmor is asked to do pathname relative to chroot jail
      and __d_path() tells it we it's not in that jail, the sucker just calls
      d_absolute_path() instead.  That's the other remaining caller of __d_path(),
      BTW.
              * seq_path_root() does _NOT_ return -ENAMETOOLONG (it's stupid anyway -
      the normal seq_file logics will take care of growing the buffer and redoing
      the call of ->show() just fine).  However, if it gets path not reachable
      from root, it returns SEQ_SKIP.  The only caller adjusted (i.e. stopped
      ignoring the return value as it used to do).
      Reviewed-by: NJohn Johansen <john.johansen@canonical.com>
      ACKed-by: NJohn Johansen <john.johansen@canonical.com>
      Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>
      Cc: stable@vger.kernel.org
      02125a82
  18. 06 12月, 2011 11 次提交
  19. 05 12月, 2011 1 次提交
    • P
      perf: Fix loss of notification with multi-event · 10c6db11
      Peter Zijlstra 提交于
      When you do:
              $ perf record -e cycles,cycles,cycles noploop 10
      
      You expect about 10,000 samples for each event, i.e., 10s at
      1000samples/sec. However, this is not what's happening. You
      get much fewer samples, maybe 3700 samples/event:
      
      $ perf report -D | tail -15
      Aggregated stats:
                 TOTAL events:      10998
                  MMAP events:         66
                  COMM events:          2
                SAMPLE events:      10930
      cycles stats:
                 TOTAL events:       3644
                SAMPLE events:       3644
      cycles stats:
                 TOTAL events:       3642
                SAMPLE events:       3642
      cycles stats:
                 TOTAL events:       3644
                SAMPLE events:       3644
      
      On a Intel Nehalem or even AMD64, there are 4 counters capable
      of measuring cycles, so there is plenty of space to measure those
      events without multiplexing (even with the NMI watchdog active).
      And even with multiplexing, we'd expect roughly the same number
      of samples per event.
      
      The root of the problem was that when the event that caused the buffer
      to become full was not the first event passed on the cmdline, the user
      notification would get lost. The notification was sent to the file
      descriptor of the overflowed event but the perf tool was not polling
      on it.  The perf tool aggregates all samples into a single buffer,
      i.e., the buffer of the first event. Consequently, it assumes
      notifications for any event will come via that descriptor.
      
      The seemingly straight forward solution of moving the waitq into the
      ringbuffer object doesn't work because of life-time issues. One could
      perf_event_set_output() on a fd that you're also blocking on and cause
      the old rb object to be freed while its waitq would still be
      referenced by the blocked thread -> FAIL.
      
      Therefore link all events to the ringbuffer and broadcast the wakeup
      from the ringbuffer object to all possible events that could be waited
      upon. This is rather ugly, and we're open to better solutions but it
      works for now.
      Reported-by: NStephane Eranian <eranian@google.com>
      Finished-by: NStephane Eranian <eranian@google.com>
      Reviewed-by: NStephane Eranian <eranian@google.com>
      Signed-off-by: NPeter Zijlstra <a.p.zijlstra@chello.nl>
      Link: http://lkml.kernel.org/r/20111126014731.GA7030@quadSigned-off-by: NIngo Molnar <mingo@elte.hu>
      10c6db11
  20. 04 12月, 2011 1 次提交
  21. 02 12月, 2011 1 次提交
  22. 01 12月, 2011 1 次提交