1. 05 2月, 2015 1 次提交
  2. 04 2月, 2015 3 次提交
  3. 02 2月, 2015 1 次提交
    • L
      sched: don't cause task state changes in nested sleep debugging · 00845eb9
      Linus Torvalds 提交于
      Commit 8eb23b9f ("sched: Debug nested sleeps") added code to report
      on nested sleep conditions, which we generally want to avoid because the
      inner sleeping operation can re-set the thread state to TASK_RUNNING,
      but that will then cause the outer sleep loop not actually sleep when it
      calls schedule.
      
      However, that's actually valid traditional behavior, with the inner
      sleep being some fairly rare case (like taking a sleeping lock that
      normally doesn't actually need to sleep).
      
      And the debug code would actually change the state of the task to
      TASK_RUNNING internally, which makes that kind of traditional and
      working code not work at all, because now the nested sleep doesn't just
      sometimes cause the outer one to not block, but will cause it to happen
      every time.
      
      In particular, it will cause the cardbus kernel daemon (pccardd) to
      basically busy-loop doing scheduling, converting a laptop into a heater,
      as reported by Bruno Prémont.  But there may be other legacy uses of
      that nested sleep model in other drivers that are also likely to never
      get converted to the new model.
      
      This fixes both cases:
      
       - don't set TASK_RUNNING when the nested condition happens (note: even
         if WARN_ONCE() only _warns_ once, the return value isn't whether the
         warning happened, but whether the condition for the warning was true.
         So despite the warning only happening once, the "if (WARN_ON(..))"
         would trigger for every nested sleep.
      
       - in the cases where we knowingly disable the warning by using
         "sched_annotate_sleep()", don't change the task state (that is used
         for all core scheduling decisions), instead use '->task_state_change'
         that is used for the debugging decision itself.
      
      (Credit for the second part of the fix goes to Oleg Nesterov: "Can't we
      avoid this subtle change in behaviour DEBUG_ATOMIC_SLEEP adds?" with the
      suggested change to use 'task_state_change' as part of the test)
      Reported-and-bisected-by: NBruno Prémont <bonbons@linux-vserver.org>
      Tested-by: NRafael J Wysocki <rjw@rjwysocki.net>
      Acked-by: NOleg Nesterov <oleg@redhat.com>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>,
      Cc: Ilya Dryomov <ilya.dryomov@inktank.com>,
      Cc: Mike Galbraith <umgwanakikbuti@gmail.com>
      Cc: Ingo Molnar <mingo@kernel.org>
      Cc: Peter Hurley <peter@hurleysoftware.com>,
      Cc: Davidlohr Bueso <dave@stgolabs.net>,
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      00845eb9
  4. 30 1月, 2015 1 次提交
    • L
      vm: add VM_FAULT_SIGSEGV handling support · 33692f27
      Linus Torvalds 提交于
      The core VM already knows about VM_FAULT_SIGBUS, but cannot return a
      "you should SIGSEGV" error, because the SIGSEGV case was generally
      handled by the caller - usually the architecture fault handler.
      
      That results in lots of duplication - all the architecture fault
      handlers end up doing very similar "look up vma, check permissions, do
      retries etc" - but it generally works.  However, there are cases where
      the VM actually wants to SIGSEGV, and applications _expect_ SIGSEGV.
      
      In particular, when accessing the stack guard page, libsigsegv expects a
      SIGSEGV.  And it usually got one, because the stack growth is handled by
      that duplicated architecture fault handler.
      
      However, when the generic VM layer started propagating the error return
      from the stack expansion in commit fee7e49d ("mm: propagate error
      from stack expansion even for guard page"), that now exposed the
      existing VM_FAULT_SIGBUS result to user space.  And user space really
      expected SIGSEGV, not SIGBUS.
      
      To fix that case, we need to add a VM_FAULT_SIGSEGV, and teach all those
      duplicate architecture fault handlers about it.  They all already have
      the code to handle SIGSEGV, so it's about just tying that new return
      value to the existing code, but it's all a bit annoying.
      
      This is the mindless minimal patch to do this.  A more extensive patch
      would be to try to gather up the mostly shared fault handling logic into
      one generic helper routine, and long-term we really should do that
      cleanup.
      
      Just from this patch, you can generally see that most architectures just
      copied (directly or indirectly) the old x86 way of doing things, but in
      the meantime that original x86 model has been improved to hold the VM
      semaphore for shorter times etc and to handle VM_FAULT_RETRY and other
      "newer" things, so it would be a good idea to bring all those
      improvements to the generic case and teach other architectures about
      them too.
      Reported-and-tested-by: NTakashi Iwai <tiwai@suse.de>
      Tested-by: NJan Engelhardt <jengelh@inai.de>
      Acked-by: Heiko Carstens <heiko.carstens@de.ibm.com> # "s390 still compiles and boots"
      Cc: linux-arch@vger.kernel.org
      Cc: stable@vger.kernel.org
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      33692f27
  5. 28 1月, 2015 2 次提交
    • P
      perf: Tighten (and fix) the grouping condition · c3c87e77
      Peter Zijlstra 提交于
      The fix from 9fc81d87 ("perf: Fix events installation during
      moving group") was incomplete in that it failed to recognise that
      creating a group with events for different CPUs is semantically
      broken -- they cannot be co-scheduled.
      
      Furthermore, it leads to real breakage where, when we create an event
      for CPU Y and then migrate it to form a group on CPU X, the code gets
      confused where the counter is programmed -- triggered in practice
      as well by me via the perf fuzzer.
      
      Fix this by tightening the rules for creating groups. Only allow
      grouping of counters that can be co-scheduled in the same context.
      This means for the same task and/or the same cpu.
      
      Fixes: 9fc81d87 ("perf: Fix events installation during moving group")
      Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Cc: Arnaldo Carvalho de Melo <acme@kernel.org>
      Cc: Jiri Olsa <jolsa@redhat.com>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Link: http://lkml.kernel.org/r/20150123125834.090683288@infradead.orgSigned-off-by: NIngo Molnar <mingo@kernel.org>
      c3c87e77
    • J
      quota: Switch ->get_dqblk() and ->set_dqblk() to use bytes as space units · 14bf61ff
      Jan Kara 提交于
      Currently ->get_dqblk() and ->set_dqblk() use struct fs_disk_quota which
      tracks space limits and usage in 512-byte blocks. However VFS quotas
      track usage in bytes (as some filesystems require that) and we need to
      somehow pass this information. Upto now it wasn't a problem because we
      didn't do any unit conversion (thus VFS quota routines happily stuck
      number of bytes into d_bcount field of struct fd_disk_quota). Only if
      you tried to use Q_XGETQUOTA or Q_XSETQLIM for VFS quotas (or Q_GETQUOTA
      / Q_SETQUOTA for XFS quotas), you got bogus results. Hardly anyone
      tried this but reportedly some Samba users hit the problem in practice.
      So when we want interfaces compatible we need to fix this.
      
      We bite the bullet and define another quota structure used for passing
      information from/to ->get_dqblk()/->set_dqblk. It's somewhat sad we have
      to have more conversion routines in fs/quota/quota.c and another copying
      of quota structure slows down getting of quota information by about 2%
      but it seems cleaner than overloading e.g. units of d_bcount to bytes.
      
      CC: stable@vger.kernel.org
      Reviewed-by: NChristoph Hellwig <hch@lst.de>
      Signed-off-by: NJan Kara <jack@suse.cz>
      14bf61ff
  6. 27 1月, 2015 3 次提交
  7. 22 1月, 2015 1 次提交
  8. 20 1月, 2015 2 次提交
    • R
      module: remove mod arg from module_free, rename module_memfree(). · be1f221c
      Rusty Russell 提交于
      Nothing needs the module pointer any more, and the next patch will
      call it from RCU, where the module itself might no longer exist.
      Removing the arg is the safest approach.
      
      This just codifies the use of the module_alloc/module_free pattern
      which ftrace and bpf use.
      Signed-off-by: NRusty Russell <rusty@rustcorp.com.au>
      Acked-by: NAlexei Starovoitov <ast@kernel.org>
      Cc: Mikael Starvik <starvik@axis.com>
      Cc: Jesper Nilsson <jesper.nilsson@axis.com>
      Cc: Ralf Baechle <ralf@linux-mips.org>
      Cc: Ley Foon Tan <lftan@altera.com>
      Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
      Cc: Chris Metcalf <cmetcalf@ezchip.com>
      Cc: Steven Rostedt <rostedt@goodmis.org>
      Cc: x86@kernel.org
      Cc: Ananth N Mavinakayanahalli <ananth@in.ibm.com>
      Cc: Anil S Keshavamurthy <anil.s.keshavamurthy@intel.com>
      Cc: Masami Hiramatsu <masami.hiramatsu.pt@hitachi.com>
      Cc: linux-cris-kernel@axis.com
      Cc: linux-kernel@vger.kernel.org
      Cc: linux-mips@linux-mips.org
      Cc: nios2-dev@lists.rocketboards.org
      Cc: linuxppc-dev@lists.ozlabs.org
      Cc: sparclinux@vger.kernel.org
      Cc: netdev@vger.kernel.org
      be1f221c
    • R
      module_arch_freeing_init(): new hook for archs before module->module_init freed. · d453cded
      Rusty Russell 提交于
      Archs have been abusing module_free() to clean up their arch-specific
      allocations.  Since module_free() is also (ab)used by BPF and trace code,
      let's keep it to simple allocations, and provide a hook called before
      that.
      
      This means that avr32, ia64, parisc and s390 no longer need to implement
      their own module_free() at all.  avr32 doesn't need module_finalize()
      either.
      Signed-off-by: NRusty Russell <rusty@rustcorp.com.au>
      Cc: Chris Metcalf <cmetcalf@ezchip.com>
      Cc: Haavard Skinnemoen <hskinnemoen@gmail.com>
      Cc: Hans-Christian Egtvedt <egtvedt@samfundet.no>
      Cc: Tony Luck <tony.luck@intel.com>
      Cc: Fenghua Yu <fenghua.yu@intel.com>
      Cc: "James E.J. Bottomley" <jejb@parisc-linux.org>
      Cc: Helge Deller <deller@gmx.de>
      Cc: Martin Schwidefsky <schwidefsky@de.ibm.com>
      Cc: Heiko Carstens <heiko.carstens@de.ibm.com>
      Cc: linux-kernel@vger.kernel.org
      Cc: linux-ia64@vger.kernel.org
      Cc: linux-parisc@vger.kernel.org
      Cc: linux-s390@vger.kernel.org
      d453cded
  9. 19 1月, 2015 1 次提交
  10. 17 1月, 2015 3 次提交
    • J
      genetlink: synchronize socket closing and family removal · ee1c2442
      Johannes Berg 提交于
      In addition to the problem Jeff Layton reported, I looked at the code
      and reproduced the same warning by subscribing and removing the genl
      family with a socket still open. This is a fairly tricky race which
      originates in the fact that generic netlink allows the family to go
      away while sockets are still open - unlike regular netlink which has
      a module refcount for every open socket so in general this cannot be
      triggered.
      
      Trying to resolve this issue by the obvious locking isn't possible as
      it will result in deadlocks between unregistration and group unbind
      notification (which incidentally lockdep doesn't find due to the home
      grown locking in the netlink table.)
      
      To really resolve this, introduce a "closing socket" reference counter
      (for generic netlink only, as it's the only affected family) in the
      core netlink code and use that in generic netlink to wait for all the
      sockets that are being closed at the same time as a generic netlink
      family is removed.
      
      This fixes the race that when a socket is closed, it will should call
      the unbind, but if the family is removed at the same time the unbind
      will not find it, leading to the warning. The real problem though is
      that in this case the unbind could actually find a new family that is
      registered to have a multicast group with the same ID, and call its
      mcast_unbind() leading to confusing.
      
      Also remove the warning since it would still trigger, but is now no
      longer a problem.
      
      This also moves the code in af_netlink.c to before unreferencing the
      module to avoid having the same problem in the normal non-genl case.
      Signed-off-by: NJohannes Berg <johannes.berg@intel.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      ee1c2442
    • Y
      PCI: Add pci_claim_bridge_resource() to clip window if necessary · 8505e729
      Yinghai Lu 提交于
      Add pci_claim_bridge_resource() to claim a PCI-PCI bridge window.  This is
      like regular pci_claim_resource(), except that if we fail to claim the
      window, we check to see if we can reduce the size of the window and try
      again.
      
      This is for scenarios like this:
      
        pci_bus 0000:00: root bus resource [mem 0xc0000000-0xffffffff]
        pci 0000:00:01.0:   bridge window [mem 0xbdf00000-0xddefffff 64bit pref]
        pci 0000:01:00.0: reg 0x10: [mem 0xc0000000-0xcfffffff pref]
      
      The 00:01.0 window is illegal: it starts before the host bridge window, so
      we have to assume the [0xbdf00000-0xbfffffff] region is inaccessible.  We
      can make it legal by clipping it to [mem 0xc0000000-0xddefffff 64bit pref].
      
      Previously we discarded the 00:01.0 window and tried to reassign that part
      of the hierarchy from scratch.  That is a problem because Linux doesn't
      always assign things optimally.  For example, in this case, BIOS put the
      01:00.0 device in a prefetchable window below 4GB, but after 5b285415,
      Linux puts the prefetchable window above 4GB where the 32-bit 01:00.0
      device can't use it.
      
      Clipping the 00:01.0 window is less intrusive than completely reassigning
      things and is sufficient to let us use most of the BIOS configuration.  Of
      course, it's possible that devices below 00:01.0 will no longer fit.  If
      that's the case, we'll have to reassign things.  But that's a separate
      problem.
      
      [bhelgaas: changelog, split into separate patch]
      Link: https://bugzilla.kernel.org/show_bug.cgi?id=85491Reported-by: NMarek Kordik <kordikmarek@gmail.com>
      Fixes: 5b285415 ("PCI: Restrict 64-bit prefetchable bridge windows to 64-bit resources")
      Signed-off-by: NYinghai Lu <yinghai@kernel.org>
      Signed-off-by: NBjorn Helgaas <bhelgaas@google.com>
      CC: stable@vger.kernel.org	# v3.16+
      8505e729
    • A
      PCI: Add flag for devices where we can't use bus reset · f331a859
      Alex Williamson 提交于
      Enable a mechanism for devices to quirk that they do not behave when
      doing a PCI bus reset.  We require a modest level of spec compliant
      behavior in order to do a reset, for instance the device should come
      out of reset without throwing errors and PCI config space should be
      accessible after reset.  This is too much to ask for some devices.
      
      Link: http://lkml.kernel.org/r/20140923210318.498dacbd@dualc.maya.orgSigned-off-by: NAlex Williamson <alex.williamson@redhat.com>
      Signed-off-by: NBjorn Helgaas <bhelgaas@google.com>
      CC: stable@vger.kernel.org	# v3.14+
      f331a859
  11. 15 1月, 2015 1 次提交
  12. 14 1月, 2015 2 次提交
  13. 12 1月, 2015 1 次提交
    • A
      mmc: sdhci: Disable re-tuning for HS400 · b5540ce1
      Adrian Hunter 提交于
      Re-tuning for HS400 mode must be done in HS200
      mode. Currently there is no support for that.
      That needs to be reflected in the code.
      Specifically, if tuning is executed in HS400 mode
      then return an error, and do not start the
      tuning timer if HS200 tuning is being done prior
      to switching to HS400.
      
      Note that periodic re-tuning is not expected
      to be needed for HS400 but re-tuning is still
      needed after the host controller has lost power.
      In the case of suspend/resume that is not necessary
      because the card is fully re-initialised. That
      just leaves runtime suspend/resume with no support
      for HS400 re-tuning.
      Signed-off-by: NAdrian Hunter <adrian.hunter@intel.com>
      Signed-off-by: NUlf Hansson <ulf.hansson@linaro.org>
      b5540ce1
  14. 09 1月, 2015 6 次提交
  15. 08 1月, 2015 7 次提交
    • K
      blk-mq: Allow requests to never expire · 5b3f25fc
      Keith Busch 提交于
      Some types of requests may be started that are not gauranteed to ever
      complete. This adds a request flag that a driver can use so mark the
      request as such.
      Signed-off-by: NKeith Busch <keith.busch@intel.com>
      Signed-off-by: NJens Axboe <axboe@fb.com>
      5b3f25fc
    • J
      blk-mq: Add helper to abort requeued requests · 1885b24d
      Jens Axboe 提交于
      Adds a helper function a driver can use to abort requeued requests in
      case any are pending when h/w queues are being removed.
      Signed-off-by: NJens Axboe <axboe@fb.com>
      1885b24d
    • K
      blk-mq: Let drivers cancel requeue_work · c68ed59f
      Keith Busch 提交于
      Kicking requeued requests will start h/w queues in a work_queue, which
      may alter the driver's requested state to temporarily stop them. This
      patch exports a method to cancel the q->requeue_work so a driver can be
      assured stopped h/w queues won't be started up before it is ready.
      Signed-off-by: NKeith Busch <keith.busch@intel.com>
      Signed-off-by: NJens Axboe <axboe@fb.com>
      c68ed59f
    • K
      blk-mq: Export if requests were started · 973c0191
      Keith Busch 提交于
      Drivers can iterate over all allocated request tags, but their callback
      needs a way to know if the driver started the request in the first place.
      Signed-off-by: NKeith Busch <keith.busch@intel.com>
      Signed-off-by: NJens Axboe <axboe@fb.com>
      973c0191
    • M
      libata: Whitelist SSDs that are known to properly return zeroes after TRIM · e61f7d1c
      Martin K. Petersen 提交于
      As defined, the DRAT (Deterministic Read After Trim) and RZAT (Return
      Zero After Trim) flags in the ATA Command Set are unreliable in the
      sense that they only define what happens if the device successfully
      executed the DSM TRIM command. TRIM is only advisory, however, and the
      device is free to silently ignore all or parts of the request.
      
      In practice this renders the DRAT and RZAT flags completely useless and
      because the results are unpredictable we decided to disable discard in
      MD for 3.18 to avoid the risk of data corruption.
      
      Hardware vendors in the real world obviously need better guarantees than
      what the standards bodies provide. Unfortuntely those guarantees are
      encoded in product requirements documents rather than somewhere we can
      key off of them programatically. So we are compelled to disabling
      discard_zeroes_data for all devices unless we explicitly have data to
      support whitelisting them.
      
      This patch whitelists SSDs from a few of the main vendors. None of the
      whitelists are based on written guarantees. They are purely based on
      empirical evidence collected from internal and external users that have
      tested or qualified these drives in RAID deployments.
      
      The whitelist is only meant as a starting point and is by no means
      comprehensive:
      
         - All intel SSD models except for 510
         - Micron M5?0/M600
         - Samsung SSDs
         - Seagate SSDs
      Signed-off-by: NMartin K. Petersen <martin.petersen@oracle.com>
      Reviewed-by: NChristoph Hellwig <hch@lst.de>
      Signed-off-by: NTejun Heo <tj@kernel.org>
      e61f7d1c
    • S
      time: settimeofday: Validate the values of tv from user · 6ada1fc0
      Sasha Levin 提交于
      An unvalidated user input is multiplied by a constant, which can result in
      an undefined behaviour for large values. While this is validated later,
      we should avoid triggering undefined behaviour.
      
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Ingo Molnar <mingo@kernel.org>
      Cc: stable <stable@vger.kernel.org>
      Signed-off-by: NSasha Levin <sasha.levin@oracle.com>
      [jstultz: include trivial milisecond->microsecond correction noticed
      by Andy]
      Signed-off-by: NJohn Stultz <john.stultz@linaro.org>
      6ada1fc0
    • J
      blk-mq: get rid of ->cmd_size in the hardware queue · 17ded320
      Jens Axboe 提交于
      We store it in the tag set, we don't need it in the hardware queue.
      While removing cmd_size, place ->queue_num further down to avoid
      a hole on 64-bit archs. It's not used in any fast paths, so we
      can safely move it.
      Signed-off-by: NJens Axboe <axboe@fb.com>
      17ded320
  16. 07 1月, 2015 1 次提交
    • L
      mm: propagate error from stack expansion even for guard page · fee7e49d
      Linus Torvalds 提交于
      Jay Foad reports that the address sanitizer test (asan) sometimes gets
      confused by a stack pointer that ends up being outside the stack vma
      that is reported by /proc/maps.
      
      This happens due to an interaction between RLIMIT_STACK and the guard
      page: when we do the guard page check, we ignore the potential error
      from the stack expansion, which effectively results in a missing guard
      page, since the expected stack expansion won't have been done.
      
      And since /proc/maps explicitly ignores the guard page (commit
      d7824370: "mm: fix up some user-visible effects of the stack guard
      page"), the stack pointer ends up being outside the reported stack area.
      
      This is the minimal patch: it just propagates the error.  It also
      effectively makes the guard page part of the stack limit, which in turn
      measn that the actual real stack is one page less than the stack limit.
      
      Let's see if anybody notices.  We could teach acct_stack_growth() to
      allow an extra page for a grow-up/grow-down stack in the rlimit test,
      but I don't want to add more complexity if it isn't needed.
      Reported-and-tested-by: NJay Foad <jay.foad@gmail.com>
      Cc: stable@vger.kernel.org
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      fee7e49d
  17. 06 1月, 2015 2 次提交
  18. 30 12月, 2014 1 次提交
    • M
      mm: get rid of radix tree gfp mask for pagecache_get_page · 45f87de5
      Michal Hocko 提交于
      Commit 2457aec6 ("mm: non-atomically mark page accessed during page
      cache allocation where possible") has added a separate parameter for
      specifying gfp mask for radix tree allocations.
      
      Not only this is less than optimal from the API point of view because it
      is error prone, it is also buggy currently because
      grab_cache_page_write_begin is using GFP_KERNEL for radix tree and if
      fgp_flags doesn't contain FGP_NOFS (mostly controlled by fs by
      AOP_FLAG_NOFS flag) but the mapping_gfp_mask has __GFP_FS cleared then
      the radix tree allocation wouldn't obey the restriction and might
      recurse into filesystem and cause deadlocks.  This is the case for most
      filesystems unfortunately because only ext4 and gfs2 are using
      AOP_FLAG_NOFS.
      
      Let's simply remove radix_gfp_mask parameter because the allocation
      context is same for both page cache and for the radix tree.  Just make
      sure that the radix tree gets only the sane subset of the mask (e.g.  do
      not pass __GFP_WRITE).
      
      Long term it is more preferable to convert remaining users of
      AOP_FLAG_NOFS to use mapping_gfp_mask instead and simplify this
      interface even further.
      Reported-by: NDave Chinner <david@fromorbit.com>
      Signed-off-by: NMichal Hocko <mhocko@suse.cz>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      45f87de5
  19. 27 12月, 2014 1 次提交
    • J
      netlink/genetlink: pass network namespace to bind/unbind · 023e2cfa
      Johannes Berg 提交于
      Netlink families can exist in multiple namespaces, and for the most
      part multicast subscriptions are per network namespace. Thus it only
      makes sense to have bind/unbind notifications per network namespace.
      
      To achieve this, pass the network namespace of a given client socket
      to the bind/unbind functions.
      
      Also do this in generic netlink, and there also make sure that any
      bind for multicast groups that only exist in init_net is rejected.
      This isn't really a problem if it is accepted since a client in a
      different namespace will never receive any notifications from such
      a group, but it can confuse the family if not rejected (it's also
      possible to silently (without telling the family) accept it, but it
      would also have to be ignored on unbind so families that take any
      kind of action on bind/unbind won't do unnecessary work for invalid
      clients like that.
      Signed-off-by: NJohannes Berg <johannes.berg@intel.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      023e2cfa