1. 19 6月, 2013 6 次提交
    • D
      FS-Cache: Fix object state machine to have separate work and wait states · caaef690
      David Howells 提交于
      Fix object state machine to have separate work and wait states as that makes
      it easier to envision.
      
      There are now three kinds of state:
      
       (1) Work state.  This is an execution state.  No event processing is performed
           by a work state.  The function attached to a work state returns a pointer
           indicating the next state to which the OSM should transition.  Returning
           NO_TRANSIT repeats the current state, but goes back to the scheduler
           first.
      
       (2) Wait state.  This is an event processing state.  No execution is
           performed by a wait state.  Wait states are just tables of "if event X
           occurs, clear it and transition to state Y".  The dispatcher returns to
           the scheduler if none of the events in which the wait state has an
           interest are currently pending.
      
       (3) Out-of-band state.  This is a special work state.  Transitions to normal
           states can be overridden when an unexpected event occurs (eg. I/O error).
           Instead the dispatcher disables and clears the OOB event and transits to
           the specified work state.  This then acts as an ordinary work state,
           though object->state points to the overridden destination.  Returning
           NO_TRANSIT resumes the overridden transition.
      
      In addition, the states have names in their definitions, so there's no need for
      tables of state names.  Further, the EV_REQUEUE event is no longer necessary as
      that is automatic for work states.
      
      Since the states are now separate structs rather than values in an enum, it's
      not possible to use comparisons other than (non-)equality between them, so use
      some object->flags to indicate what phase an object is in.
      
      The EV_RELEASE, EV_RETIRE and EV_WITHDRAW events have been squished into one
      (EV_KILL).  An object flag now carries the information about retirement.
      
      Similarly, the RELEASING, RECYCLING and WITHDRAWING states have been merged
      into an KILL_OBJECT state and additional states have been added for handling
      waiting dependent objects (JUMPSTART_DEPS and KILL_DEPENDENTS).
      
      A state has also been added for synchronising with parent object initialisation
      (WAIT_FOR_PARENT) and another for initiating look up (PARENT_READY).
      Signed-off-by: NDavid Howells <dhowells@redhat.com>
      Tested-By: NMilosz Tanski <milosz@adfin.com>
      Acked-by: NJeff Layton <jlayton@redhat.com>
      caaef690
    • D
      FS-Cache: Wrap checks on object state · 493f7bc1
      David Howells 提交于
      Wrap checks on object state (mostly outside of fs/fscache/object.c) with
      inline functions so that the mechanism can be replaced.
      
      Some of the state checks within object.c are left as-is as they will be
      replaced.
      Signed-off-by: NDavid Howells <dhowells@redhat.com>
      Tested-By: NMilosz Tanski <milosz@adfin.com>
      Acked-by: NJeff Layton <jlayton@redhat.com>
      493f7bc1
    • D
      FS-Cache: Uninline fscache_object_init() · 610be24e
      David Howells 提交于
      Uninline fscache_object_init() so as not to expose some of the FS-Cache
      internals to the cache backend.
      Signed-off-by: NDavid Howells <dhowells@redhat.com>
      Tested-By: NMilosz Tanski <milosz@adfin.com>
      Acked-by: NJeff Layton <jlayton@redhat.com>
      610be24e
    • D
      FS-Cache: Don't sleep in page release if __GFP_FS is not set · 0c59a95d
      David Howells 提交于
      Don't sleep in __fscache_maybe_release_page() if __GFP_FS is not set.  This
      goes some way towards mitigating fscache deadlocking against ext4 by way of
      the allocator, eg:
      
      INFO: task flush-8:0:24427 blocked for more than 120 seconds.
      "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
      flush-8:0       D ffff88003e2b9fd8     0 24427      2 0x00000000
       ffff88003e2b9138 0000000000000046 ffff880012e3a040 ffff88003e2b9fd8
       0000000000011c80 ffff88003e2b9fd8 ffffffff81a10400 ffff880012e3a040
       0000000000000002 ffff880012e3a040 ffff88003e2b9098 ffffffff8106dcf5
      Call Trace:
       [<ffffffff8106dcf5>] ? __lock_is_held+0x31/0x53
       [<ffffffff81219b61>] ? radix_tree_lookup_element+0xf4/0x12a
       [<ffffffff81454bed>] schedule+0x60/0x62
       [<ffffffffa01d349c>] __fscache_wait_on_page_write+0x8b/0xa5 [fscache]
       [<ffffffff810498a8>] ? __init_waitqueue_head+0x4d/0x4d
       [<ffffffffa01d393a>] __fscache_maybe_release_page+0x30c/0x324 [fscache]
       [<ffffffffa01d369a>] ? __fscache_maybe_release_page+0x6c/0x324 [fscache]
       [<ffffffff81071b53>] ? trace_hardirqs_on_caller+0x114/0x170
       [<ffffffffa01fd7b2>] nfs_fscache_release_page+0x68/0x94 [nfs]
       [<ffffffffa01ef73e>] nfs_release_page+0x7e/0x86 [nfs]
       [<ffffffff810aa553>] try_to_release_page+0x32/0x3b
       [<ffffffff810b6c70>] shrink_page_list+0x535/0x71a
       [<ffffffff81071b53>] ? trace_hardirqs_on_caller+0x114/0x170
       [<ffffffff810b7352>] shrink_inactive_list+0x20a/0x2dd
       [<ffffffff81071a13>] ? mark_held_locks+0xbe/0xea
       [<ffffffff810b7a65>] shrink_lruvec+0x34c/0x3eb
       [<ffffffff810b7bd3>] do_try_to_free_pages+0xcf/0x355
       [<ffffffff810b7fc8>] try_to_free_pages+0x9a/0xa1
       [<ffffffff810b08d2>] __alloc_pages_nodemask+0x494/0x6f7
       [<ffffffff810d9a07>] kmem_getpages+0x58/0x155
       [<ffffffff810dc002>] fallback_alloc+0x120/0x1f3
       [<ffffffff8106db23>] ? trace_hardirqs_off+0xd/0xf
       [<ffffffff810dbed3>] ____cache_alloc_node+0x177/0x186
       [<ffffffff81162a6c>] ? ext4_init_io_end+0x1c/0x37
       [<ffffffff810dc403>] kmem_cache_alloc+0xf1/0x176
       [<ffffffff810b17ac>] ? test_set_page_writeback+0x101/0x113
       [<ffffffff81162a6c>] ext4_init_io_end+0x1c/0x37
       [<ffffffff81162ce4>] ext4_bio_write_page+0x20f/0x3af
       [<ffffffff8115cc02>] mpage_da_submit_io+0x26e/0x2f6
       [<ffffffff811088e5>] ? __find_get_block_slow+0x38/0x133
       [<ffffffff81161348>] mpage_da_map_and_submit+0x3a7/0x3bd
       [<ffffffff81161a60>] ext4_da_writepages+0x30d/0x426
       [<ffffffff810b3359>] do_writepages+0x1c/0x2a
       [<ffffffff81102f4d>] __writeback_single_inode+0x3e/0xe5
       [<ffffffff81103995>] writeback_sb_inodes+0x1bd/0x2f4
       [<ffffffff81103b3b>] __writeback_inodes_wb+0x6f/0xb4
       [<ffffffff81103c81>] wb_writeback+0x101/0x195
       [<ffffffff81071b53>] ? trace_hardirqs_on_caller+0x114/0x170
       [<ffffffff811043aa>] ? wb_do_writeback+0xaa/0x173
       [<ffffffff8110434a>] wb_do_writeback+0x4a/0x173
       [<ffffffff81071bbc>] ? trace_hardirqs_on+0xd/0xf
       [<ffffffff81038554>] ? del_timer+0x4b/0x5b
       [<ffffffff811044e0>] bdi_writeback_thread+0x6d/0x147
       [<ffffffff81104473>] ? wb_do_writeback+0x173/0x173
       [<ffffffff81048fbc>] kthread+0xd0/0xd8
       [<ffffffff81455eb2>] ? _raw_spin_unlock_irq+0x29/0x3e
       [<ffffffff81048eec>] ? __init_kthread_worker+0x55/0x55
       [<ffffffff81456aac>] ret_from_fork+0x7c/0xb0
       [<ffffffff81048eec>] ? __init_kthread_worker+0x55/0x55
      2 locks held by flush-8:0/24427:
       #0:  (&type->s_umount_key#41){.+.+..}, at: [<ffffffff810e3b73>] grab_super_passive+0x4c/0x76
       #1:  (jbd2_handle){+.+...}, at: [<ffffffff81190d81>] start_this_handle+0x475/0x4ea
      
      
      The problem here is that another thread, which is attempting to write the
      to-be-stored NFS page to the on-ext4 cache file is waiting for the journal
      lock, eg:
      
      INFO: task kworker/u:2:24437 blocked for more than 120 seconds.
      "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
      kworker/u:2     D ffff880039589768     0 24437      2 0x00000000
       ffff8800395896d8 0000000000000046 ffff8800283bf040 ffff880039589fd8
       0000000000011c80 ffff880039589fd8 ffff880039f0b040 ffff8800283bf040
       0000000000000006 ffff8800283bf6b8 ffff880039589658 ffffffff81071a13
      Call Trace:
       [<ffffffff81071a13>] ? mark_held_locks+0xbe/0xea
       [<ffffffff81455e73>] ? _raw_spin_unlock_irqrestore+0x3a/0x50
       [<ffffffff81071b53>] ? trace_hardirqs_on_caller+0x114/0x170
       [<ffffffff81071bbc>] ? trace_hardirqs_on+0xd/0xf
       [<ffffffff81454bed>] schedule+0x60/0x62
       [<ffffffff81190c23>] start_this_handle+0x317/0x4ea
       [<ffffffff810498a8>] ? __init_waitqueue_head+0x4d/0x4d
       [<ffffffff81190fcc>] jbd2__journal_start+0xb3/0x12e
       [<ffffffff81176606>] __ext4_journal_start_sb+0xb2/0xc6
       [<ffffffff8115f137>] ext4_da_write_begin+0x109/0x233
       [<ffffffff810a964d>] generic_file_buffered_write+0x11a/0x264
       [<ffffffff811032cf>] ? __mark_inode_dirty+0x2d/0x1ee
       [<ffffffff810ab1ab>] __generic_file_aio_write+0x2a5/0x2d5
       [<ffffffff810ab24a>] generic_file_aio_write+0x6f/0xd0
       [<ffffffff81159a2c>] ext4_file_write+0x38c/0x3c4
       [<ffffffff810e0915>] do_sync_write+0x91/0xd1
       [<ffffffffa00a17f0>] cachefiles_write_page+0x26f/0x310 [cachefiles]
       [<ffffffffa01d470b>] fscache_write_op+0x21e/0x37a [fscache]
       [<ffffffff81455eb2>] ? _raw_spin_unlock_irq+0x29/0x3e
       [<ffffffffa01d2479>] fscache_op_work_func+0x78/0xd7 [fscache]
       [<ffffffff8104455a>] process_one_work+0x232/0x3a8
       [<ffffffff810444ff>] ? process_one_work+0x1d7/0x3a8
       [<ffffffff81044ee0>] worker_thread+0x214/0x303
       [<ffffffff81044ccc>] ? manage_workers+0x245/0x245
       [<ffffffff81048fbc>] kthread+0xd0/0xd8
       [<ffffffff81455eb2>] ? _raw_spin_unlock_irq+0x29/0x3e
       [<ffffffff81048eec>] ? __init_kthread_worker+0x55/0x55
       [<ffffffff81456aac>] ret_from_fork+0x7c/0xb0
       [<ffffffff81048eec>] ? __init_kthread_worker+0x55/0x55
      4 locks held by kworker/u:2/24437:
       #0:  (fscache_operation){.+.+.+}, at: [<ffffffff810444ff>] process_one_work+0x1d7/0x3a8
       #1:  ((&op->work)){+.+.+.}, at: [<ffffffff810444ff>] process_one_work+0x1d7/0x3a8
       #2:  (sb_writers#14){.+.+.+}, at: [<ffffffff810ab22c>] generic_file_aio_write+0x51/0xd0
       #3:  (&sb->s_type->i_mutex_key#19){+.+.+.}, at: [<ffffffff810ab236>] generic_file_aio_write+0x5b/0x
      
      fscache already tries to cancel pending stores, but it can't cancel a write
      for which I/O is already in progress.
      
      An alternative would be to accept writing garbage to the cache under extreme
      circumstances and to kill the afflicted cache object if we have to do this.
      However, we really need to know how strapped the allocator is before deciding
      to do that.
      Signed-off-by: NDavid Howells <dhowells@redhat.com>
      Tested-By: NMilosz Tanski <milosz@adfin.com>
      Acked-by: NJeff Layton <jlayton@redhat.com>
      0c59a95d
    • J
      CacheFiles: name i_mutex lock class explicitly · 6bd5e82b
      J. Bruce Fields 提交于
      Just some cleanup.
      
      (And note the caller of this function may, for example, call vfs_unlink
      on a child, so the "1" (I_MUTEX_PARENT) really was what was intended
      here.)
      Signed-off-by: NJ. Bruce Fields <bfields@redhat.com>
      Signed-off-by: NDavid Howells <dhowells@redhat.com>
      Tested-By: NMilosz Tanski <milosz@adfin.com>
      Acked-by: NJeff Layton <jlayton@redhat.com>
      6bd5e82b
    • S
      fs/fscache: remove spin_lock() from the condition in while() · ee8be57b
      Sebastian Andrzej Siewior 提交于
      The spinlock() within the condition in while() will cause a compile error
      if it is not a function. This is not a problem on mainline but it does not
      look pretty and there is no reason to do it that way.
      That patch writes it a little differently and avoids the double condition.
      Signed-off-by: NSebastian Andrzej Siewior <bigeasy@linutronix.de>
      Signed-off-by: NDavid Howells <dhowells@redhat.com>
      Tested-By: NMilosz Tanski <milosz@adfin.com>
      Acked-by: NJeff Layton <jlayton@redhat.com>
      ee8be57b
  2. 15 5月, 2013 3 次提交
    • D
      Add wait_on_atomic_t() and wake_up_atomic_t() · cb65537e
      David Howells 提交于
      Add wait_on_atomic_t() and wake_up_atomic_t() to indicate became-zero events on
      atomic_t types.  This uses the bit-wake waitqueue table.  The key is set to a
      value outside of the number of bits in a long so that wait_on_bit() won't be
      woken up accidentally.
      
      What I'm using this for is: in a following patch I add a counter to struct
      fscache_cookie to count the number of outstanding operations that need access
      to netfs data.  The way this works is:
      
       (1) When a cookie is allocated, the counter is initialised to 1.
      
       (2) When an operation wants to access netfs data, it calls atomic_inc_unless()
           to increment the counter before it does so.  If it was 0, then the counter
           isn't incremented, the operation isn't permitted to access the netfs data
           (which might by this point no longer exist) and the operation aborts in
           some appropriate manner.
      
       (3) When an operation finishes with the netfs data, it decrements the counter
           and if it reaches 0, calls wake_up_atomic_t() on it - the assumption being
           that it was the last blocker.
      
       (4) When a cookie is released, the counter is decremented and the releaser
           uses wait_on_atomic_t() to wait for the counter to become 0 - which should
           indicate no one is using the netfs data any longer.  The netfs data can
           then be destroyed.
      
      There are some alternatives that I have thought of and that have been suggested
      by Tejun Heo:
      
       (A) Using wait_on_bit() to wait on a bit in the counter.  This doesn't work
           because if that bit happens to be 0 then the wait won't happen - even if
           the counter is non-zero.
      
       (B) Using wait_on_bit() to wait on a flag elsewhere which is cleared when the
           counter reaches 0.  Such a flag would be redundant and would add
           complexity.
      
       (C) Adding a waitqueue to fscache_cookie - this would expand that struct by
           several words for an event that happens just once in each cookie's
           lifetime.  Further, cookies are generally per-file so there are likely to
           be a lot of them.
      
       (D) Similar to (C), but add a pointer to a waitqueue in the cookie instead of
           a waitqueue.  This would add single word per cookie and so would be less
           of an expansion - but still an expansion.
      
       (E) Adding a static waitqueue to the fscache module.  Generally this would be
           fine, but under certain circumstances many cookies will all get added at
           the same time (eg. NFS umount, cache withdrawal) thereby presenting
           scaling issues.  Note that the wait may be significant as disk I/O may be
           in progress.
      
      So, I think reusing the wait_on_bit() waitqueue set is reasonable.  I don't
      make much use of the waitqueue I need on a per-cookie basis, but sometimes I
      have a huge flood of the cookies to deal with.
      
      I also don't want to add a whole new set of global waitqueue tables
      specifically for the dec-to-0 event if I can reuse the bit tables.
      Signed-off-by: NDavid Howells <dhowells@redhat.com>
      Tested-By: NMilosz Tanski <milosz@adfin.com>
      Acked-by: NJeff Layton <jlayton@redhat.com>
      cb65537e
    • L
      Merge tag 'ext4_for_linus_stable' of git://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4 · b973425c
      Linus Torvalds 提交于
      Pull ext4 update from Ted Ts'o:
       "Fixed regressions (two stability regressions and a performance
        regression) introduced during the 3.10-rc1 merge window.
      
        Also included is a bug fix relating to allocating blocks after
        resizing an ext3 file system when using the ext4 file system driver"
      
      * tag 'ext4_for_linus_stable' of git://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4:
        jbd,jbd2: fix oops in jbd2_journal_put_journal_head()
        ext4: revert "ext4: use io_end for multiple bios"
        ext4: limit group search loop for non-extent files
        ext4: fix fio regression
      b973425c
    • L
      Merge branch 'for-3.10-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/wq · 7fb30d2b
      Linus Torvalds 提交于
      Pull workqueue fix from Tejun Heo:
       "A fix for a workqueue_congested() regression that broke fscache"
      
      * 'for-3.10-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/wq:
        workqueue: workqueue_congested() shouldn't translate WORK_CPU_UNBOUND into node number
      7fb30d2b
  3. 14 5月, 2013 31 次提交
    • L
      Merge branch 'merge' of git://git.kernel.org/pub/scm/linux/kernel/git/benh/powerpc · a2c7a54f
      Linus Torvalds 提交于
      Pull powerpc fixes from Benjamin Herrenschmidt:
       "This is mostly bug fixes (some of them regressions, some of them I
        deemed worth merging now) along with some patches from Li Zhong
        hooking up the new context tracking stuff (for the new full NO_HZ)"
      
      * 'merge' of git://git.kernel.org/pub/scm/linux/kernel/git/benh/powerpc: (25 commits)
        powerpc: Set show_unhandled_signals to 1 by default
        powerpc/perf: Fix setting of "to" addresses for BHRB
        powerpc/pmu: Fix order of interpreting BHRB target entries
        powerpc/perf: Move BHRB code into CONFIG_PPC64 region
        powerpc: select HAVE_CONTEXT_TRACKING for pSeries
        powerpc: Use the new schedule_user API on userspace preemption
        powerpc: Exit user context on notify resume
        powerpc: Exception hooks for context tracking subsystem
        powerpc: Syscall hooks for context tracking subsystem
        powerpc/booke64: Fix kernel hangs at kernel_dbg_exc
        powerpc: Fix irq_set_affinity() return values
        powerpc: Provide __bswapdi2
        powerpc/powernv: Fix starting of secondary CPUs on OPALv2 and v3
        powerpc/powernv: Detect OPAL v3 API version
        powerpc: Fix MAX_STACK_TRACE_ENTRIES too low warning again
        powerpc: Make CONFIG_RTAS_PROC depend on CONFIG_PROC_FS
        powerpc: Bring all threads online prior to migration/hibernation
        powerpc/rtas_flash: Fix validate_flash buffer overflow issue
        powerpc/kexec: Fix kexec when using VMX optimised memcpy
        powerpc: Fix build errors STRICT_MM_TYPECHECKS
        ...
      a2c7a54f
    • B
      powerpc: Set show_unhandled_signals to 1 by default · e34166ad
      Benjamin Herrenschmidt 提交于
      Just like other architectures
      Signed-off-by: NBenjamin Herrenschmidt <benh@kernel.crashing.org>
      e34166ad
    • M
      powerpc/perf: Fix setting of "to" addresses for BHRB · 69123184
      Michael Neuling 提交于
      Currently we only set the "to" address in the branch stack when the CPU
      explicitly gives us a value.  Unfortunately it only does this for XL form
      branches (eg blr, bctr, bctar) and not I and B form branches (eg b, bc).
      
      Fortunately if we read the instruction from memory we can extract the offset of
      a branch and calculate the target address.
      
      This adds a function power_pmu_bhrb_to() to calculate the target/to address of
      the corresponding I and B form branches.  It handles branches in both user and
      kernel spaces.  It also plumbs this into the perf brhb reading code.
      Signed-off-by: NMichael Neuling <mikey@neuling.org>
      Signed-off-by: NBenjamin Herrenschmidt <benh@kernel.crashing.org>
      69123184
    • M
      powerpc/pmu: Fix order of interpreting BHRB target entries · 506e70d1
      Michael Neuling 提交于
      The current Branch History Rolling Buffer (BHRB) code misinterprets the order
      of entries in the hardware buffer.  It assumes that a branch target address
      will be read _after_ its corresponding branch.  In reality the branch target
      comes before (lower mfbhrb entry) it's corresponding branch.
      
      This is a rewrite of the code to take this into account.
      Signed-off-by: NMichael Neuling <mikey@neuling.org>
      Signed-off-by: NBenjamin Herrenschmidt <benh@kernel.crashing.org>
      506e70d1
    • M
      powerpc/perf: Move BHRB code into CONFIG_PPC64 region · d52f2dc4
      Michael Neuling 提交于
      The new Branch History Rolling buffer (BHRB) code is only useful on 64bit
      processors, so move it into the #ifdef CONFIG_PPC64 region.
      
      This avoids code bloat on 32bit systems.
      Signed-off-by: NMichael Neuling <mikey@neuling.org>
      Signed-off-by: NBenjamin Herrenschmidt <benh@kernel.crashing.org>
      d52f2dc4
    • L
      powerpc: select HAVE_CONTEXT_TRACKING for pSeries · a1797b2f
      Li Zhong 提交于
      Start context tracking support from pSeries.
      Signed-off-by: NLi Zhong <zhong@linux.vnet.ibm.com>
      Signed-off-by: NBenjamin Herrenschmidt <benh@kernel.crashing.org>
      a1797b2f
    • L
      powerpc: Use the new schedule_user API on userspace preemption · 5d1c5745
      Li Zhong 提交于
      This patch corresponds to
      [PATCH] x86: Use the new schedule_user API on userspace preemption
        commit 0430499cSigned-off-by: NLi Zhong <zhong@linux.vnet.ibm.com>
      Signed-off-by: NBenjamin Herrenschmidt <benh@kernel.crashing.org>
      5d1c5745
    • L
      powerpc: Exit user context on notify resume · 106ed886
      Li Zhong 提交于
      This patch allows RCU usage in do_notify_resume, e.g. signal handling.
      It corresponds to
      [PATCH] x86: Exit RCU extended QS on notify resume
        commit edf55fdaSigned-off-by: NLi Zhong <zhong@linux.vnet.ibm.com>
      Signed-off-by: NBenjamin Herrenschmidt <benh@kernel.crashing.org>
      106ed886
    • L
      powerpc: Exception hooks for context tracking subsystem · ba12eede
      Li Zhong 提交于
      This is the exception hooks for context tracking subsystem, including
      data access, program check, single step, instruction breakpoint, machine check,
      alignment, fp unavailable, altivec assist, unknown exception, whose handlers
      might use RCU.
      
      This patch corresponds to
      [PATCH] x86: Exception hooks for userspace RCU extended QS
        commit 6ba3c97a
      
      But after the exception handling moved to generic code, and some changes in
      following two commits:
      56dd9470
        context_tracking: Move exception handling to generic code
      6c1e0256
        context_tracking: Restore correct previous context state on exception exit
      
      it is able for exception hooks to use the generic code above instead of a
      redundant arch implementation.
      Signed-off-by: NLi Zhong <zhong@linux.vnet.ibm.com>
      Signed-off-by: NBenjamin Herrenschmidt <benh@kernel.crashing.org>
      ba12eede
    • L
      powerpc: Syscall hooks for context tracking subsystem · 22ecbe8d
      Li Zhong 提交于
      This is the syscall slow path hooks for context tracking subsystem,
      corresponding to
      [PATCH] x86: Syscall hooks for userspace RCU extended QS
        commit bf5a3c13
      
      TIF_MEMDIE is moved to the second 16-bits (with value 17), as it seems there
      is no asm code using it. TIF_NOHZ is added to _TIF_SYCALL_T_OR_A, so it is
      better for it to be in the same 16 bits with others in the group, so in the
      asm code, andi. with this group could work.
      Signed-off-by: NLi Zhong <zhong@linux.vnet.ibm.com>
      Acked-by: NFrederic Weisbecker <fweisbec@gmail.com>
      Signed-off-by: NBenjamin Herrenschmidt <benh@kernel.crashing.org>
      22ecbe8d
    • S
      powerpc/booke64: Fix kernel hangs at kernel_dbg_exc · 6cecf76b
      Scott Wood 提交于
      MSR_DE is not cleared on entry to the kernel, and we don't clear it
      explicitly outside of debug code.  If we have MSR_DE set in
      prime_debug_regs(), and the new thread has events enabled in DBCR0
      (e.g.  ICMP is set in thread->dbsr0, even though it was cleared in the
      real DBCR0 when the thread got scheduled out), we'll end up taking a
      debug exception in the kernel when DBCR0 is loaded.  DSRR0 will not
      point to an exception vector, and the kernel ends up hanging at
      kernel_dbg_exc.  Fix this by always clearing MSR_DE when we load new
      debug state.
      
      Another observed source of kernel_dbg_exc hangs is with the branch
      taken event.  If this event is active, but we take a non-debug trap
      (e.g. a TLB miss or an asynchronous interrupt) before the next branch.
      We end up taking a branch-taken debug exception on the initial branch
      instruction of the exception vector, but because the debug exception is
      DBSR_BT rather than DBSR_IC we branch to kernel_dbg_exc before even
      checking the DSRR0 address.  Fix this by checking for DBSR_BT as well
      as DBSR_IC, which is what 32-bit does and what the comments suggest was
      intended in the 64-bit code as well.
      Signed-off-by: NScott Wood <scottwood@freescale.com>
      Signed-off-by: NBenjamin Herrenschmidt <benh@kernel.crashing.org>
      6cecf76b
    • A
    • D
      powerpc: Provide __bswapdi2 · ca9d7aea
      David Woodhouse 提交于
      Some versions of GCC apparently expect this to be provided by libgcc.
      
      Updates from Mikey to fix 32 bit version and adding "r" to registers.
      Signed-off-by: NDavid Woodhouse <David.Woodhouse@intel.com>
      Signed-off-by: NMichael Neuling <mikey@neuling.org>
      Signed-off-by: NBenjamin Herrenschmidt <benh@kernel.crashing.org>
      ca9d7aea
    • B
      powerpc/powernv: Fix starting of secondary CPUs on OPALv2 and v3 · b2b48584
      Benjamin Herrenschmidt 提交于
      The current code fails to handle kexec on OPALv2. This fixes it
      and adds code to improve the situation on OPALv3 where we can
      query the CPU status from the firmware and decide what to do
      based on that.
      Signed-off-by: NBenjamin Herrenschmidt <benh@kernel.crashing.org>
      b2b48584
    • B
      powerpc/powernv: Detect OPAL v3 API version · 75b93da4
      Benjamin Herrenschmidt 提交于
      Future firmwares will support that new version
      Signed-off-by: NBenjamin Herrenschmidt <benh@kernel.crashing.org>
      75b93da4
    • L
      powerpc: Fix MAX_STACK_TRACE_ENTRIES too low warning again · af945cf4
      Li Zhong 提交于
      Saw this warning again, and this time from the ret_from_fork path.
      
      It seems we could clear the back chain earlier in copy_thread(), which
      could cover both path, and also fix potential lockdep usage in
      schedule_tail(), or exception occurred before we clear the back chain.
      Signed-off-by: NLi Zhong <zhong@linux.vnet.ibm.com>
      Signed-off-by: NBenjamin Herrenschmidt <benh@kernel.crashing.org>
      af945cf4
    • M
      powerpc: Make CONFIG_RTAS_PROC depend on CONFIG_PROC_FS · b80ec3dc
      Michael Ellerman 提交于
      We are getting build errors with CONFIG_PROC_FS=n:
      
      arch/powerpc/kernel/rtas_flash.c
         In function 'rtas_flash_init':
        745:33: error: unused variable 'f' [-Werror=unused-variable]
      
      But rtas_flash.c should not be built when CONFIG_PROC_FS=n, beacause all
      it does is provide a /proc interface to the RTAS flash routines.
      
      CONFIG_RTAS_FLASH already depends on CONFIG_RTAS_PROC, to indicate that
      it depends on the RTAS proc support, but CONFIG_RTAS_PROC does not
      depend on CONFIG_PROC_FS. So fix that.
      Signed-off-by: NMichael Ellerman <michael@ellerman.id.au>
      Signed-off-by: NBenjamin Herrenschmidt <benh@kernel.crashing.org>
      b80ec3dc
    • R
      powerpc: Bring all threads online prior to migration/hibernation · 120496ac
      Robert Jennings 提交于
      This patch brings online all threads which are present but not online
      prior to migration/hibernation.  After migration/hibernation those
      threads are taken back offline.
      
      During migration/hibernation all online CPUs must call H_JOIN, this is
      required by the hypervisor.  Without this patch, threads that are offline
      (H_CEDE'd) will not be woken to make the H_JOIN call and the OS will be
      deadlocked (all threads either JOIN'd or CEDE'd).
      
      Cc: <stable@kernel.org>
      Signed-off-by: NRobert Jennings <rcj@linux.vnet.ibm.com>
      Signed-off-by: NBenjamin Herrenschmidt <benh@kernel.crashing.org>
      120496ac
    • V
      powerpc/rtas_flash: Fix validate_flash buffer overflow issue · a94a1472
      Vasant Hegde 提交于
      ibm,validate-flash-image RTAS call output buffer contains 150 - 200
      bytes of data on latest system. Presently we have output
      buffer size as 64 bytes and we use sprintf to copy data from
      RTAS buffer to local buffer. This causes kernel oops (see below
      call trace).
      
      This patch increases local buffer size to 256 and also uses
      snprintf instead of sprintf to copy data from RTAS buffer.
      
      Kernel call trace :
      -------------------
      Oops: Kernel access of bad area, sig: 11 [#1]
      SMP NR_CPUS=1024 NUMA pSeries
      Modules linked in: nfs fscache lockd auth_rpcgss nfs_acl sunrpc fuse loop dm_mod ipv6 ipv6_lib usb_storage ehea(X) sr_mod qlge ses cdrom enclosure st be2net sg ext3 jbd mbcache usbhid hid ohci_hcd ehci_hcd usbcore qla2xxx usb_common sd_mod crc_t10dif scsi_dh_hp_sw scsi_dh_rdac scsi_dh_alua scsi_dh_emc scsi_dh lpfc scsi_transport_fc scsi_tgt ipr(X) libata scsi_mod
      Supported: Yes
      NIP: 4520323031333130 LR: 4520323031333130 CTR: 0000000000000000
      REGS: c0000001b91779b0 TRAP: 0400   Tainted: G            X  (3.0.13-0.27-ppc64)
      MSR: 8000000040009032 <EE,ME,IR,DR>  CR: 44022488  XER: 20000018
      TASK = c0000001bca1aba0[4736] 'cat' THREAD: c0000001b9174000 CPU: 36
      GPR00: 4520323031333130 c0000001b9177c30 c000000000f87c98 000000000000009b
      GPR04: c0000001b9177c4a 000000000000000b 3520323031333130 2032303133313031
      GPR08: 3133313031350a4d 000000000000009b 0000000000000000 c0000000003664a4
      GPR12: 0000000022022448 c000000003ee6c00 0000000000000002 00000000100e8a90
      GPR16: 00000000100cb9d8 0000000010093370 000000001001d310 0000000000000000
      GPR20: 0000000000008000 00000000100fae60 000000000000005e 0000000000000000
      GPR24: 0000000010129350 46573738302e3030 2046573738302e30 300a4d4720323031
      GPR28: 333130313520554e 4b4e4f574e0a4d47 2032303133313031 3520323031333130
      NIP [4520323031333130] 0x4520323031333130
      LR [4520323031333130] 0x4520323031333130
      Call Trace:
      [c0000001b9177c30] [4520323031333130] 0x4520323031333130 (unreliable)
      Instruction dump:
      XXXXXXXX XXXXXXXX XXXXXXXX XXXXXXXX XXXXXXXX XXXXXXXX XXXXXXXX XXXXXXXX
      XXXXXXXX XXXXXXXX XXXXXXXX XXXXXXXX XXXXXXXX XXXXXXXX XXXXXXXX XXXXXXXX
      Signed-off-by: NVasant Hegde <hegdevasant@linux.vnet.ibm.com>
      Signed-off-by: NBenjamin Herrenschmidt <benh@kernel.crashing.org>
      a94a1472
    • A
      powerpc/kexec: Fix kexec when using VMX optimised memcpy · 79c66ce8
      Anton Blanchard 提交于
      commit b3f271e8 (powerpc: POWER7 optimised memcpy using VMX and
      enhanced prefetch) uses VMX when it is safe to do so (ie not in
      interrupt). It also looks at the task struct to decide if we have to
      save the current tasks' VMX state.
      
      kexec calls memcpy() at a point where the task struct may have been
      overwritten by the new kexec segments. If it has been overwritten
      then when memcpy -> enable_altivec looks up current->thread.regs->msr
      we get a cryptic oops or lockup.
      
      I also notice we aren't initialising thread_info->cpu, which means
      smp_processor_id is broken. Fix that too.
      Signed-off-by: NAnton Blanchard <anton@samba.org>
      Cc: <stable@vger.kernel.org> # 3.6+
      Signed-off-by: NBenjamin Herrenschmidt <benh@kernel.crashing.org>
      79c66ce8
    • A
    • A
      powerpc/mm: Use the correct mask value when looking at pgtable address · 613e60a6
      Aneesh Kumar K.V 提交于
      Our pgtable are 2*sizeof(pte_t)*PTRS_PER_PTE which is PTE_FRAG_SIZE.
      Instead of depending on frag size, mask with PMD_MASKED_BITS.
      Signed-off-by: NAneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
      Signed-off-by: NBenjamin Herrenschmidt <benh@kernel.crashing.org>
      613e60a6
    • L
      Merge tag 'fixes-for-3.10-rc2-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/sstabellini/xen · 674825d0
      Linus Torvalds 提交于
      Pull Xen/arm fixes from Stefano Stabellini:
       "This contains a couple of Xen on ARM initialization fixes and a patch
        to improve error handling"
      
      * tag 'fixes-for-3.10-rc2-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/sstabellini/xen:
        xen/arm: rename xen_secondary_init and run it on every online cpu
        xen/arm: do not handle VCPUOP_register_vcpu_info failures
        xen/arm: initialize pm functions later
      674825d0
    • L
      Merge branch 'parisc-for-3.10' of git://git.kernel.org/pub/scm/linux/kernel/git/deller/parisc-linux · c83bb885
      Linus Torvalds 提交于
      Pull parisc update from Helge Deller:
       "The second round of parisc updates for 3.10 includes build fixes and
        enhancements to utilize irq stacks, fixes SMP races when updating PTE
        and TLB entries by proper locking and makes the search for the correct
        cross compiler more robust on Debian and Gentoo."
      
      * 'parisc-for-3.10' of git://git.kernel.org/pub/scm/linux/kernel/git/deller/parisc-linux:
        parisc: make default cross compiler search more robust (v3)
        parisc: fix SMP races when updating PTE and TLB entries in entry.S
        parisc: implement irq stacks - part 2 (v2)
      c83bb885
    • L
      Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net · dbbffe68
      Linus Torvalds 提交于
      Pull networking fixes from David Miller:
       "Several small bug fixes all over:
      
         1) be2net driver uses wrong payload length when submitting MAC list
            get requests to the chip.  From Sathya Perla.
      
         2) Fix mwifiex memory leak on driver unload, from Amitkumar Karwar.
      
         3) Prevent random memory access in batman-adv, from Marek Lindner.
      
         4) batman-adv doesn't check for pskb_trim_rcsum() errors, also from
            Marek Lindner.
      
         5) Fix fec crashes on rapid link up/down, from Frank Li.
      
         6) Fix inner protocol grovelling in GSO, from Pravin B Shelar.
      
         7) Link event validation fix in qlcnic from Rajesh Borundia.
      
         8) Not all FEC chips can support checksum offload, fix from Shawn
            Guo.
      
         9) EXPORT_SYMBOL + inline doesn't make any sense, from Denis Efremov.
      
        10) Fix race in passthru mode during device removal in macvlan, from
            Jiri Pirko.
      
        11) Fix RCU hash table lookup socket state race in ipv6, leading to
            NULL pointer derefs, from Eric Dumazet.
      
        12) Add several missing HAS_DMA kconfig dependencies, from Geert
            Uyttterhoeven.
      
        13) Fix bogus PCI resource management in 3c59x driver, from Sergei
            Shtylyov.
      
        14) Fix info leak in ipv6 GRE tunnel driver, from Amerigo Wang.
      
        15) Fix device leak in ipv6 IPSEC policy layer, from Cong Wang.
      
        16) DMA mapping leak fix in qlge from Thadeu Lima de Souza Cascardo.
      
        17) Missing iounmap on probe failure in bna driver, from Wei Yongjun."
      
      * git://git.kernel.org/pub/scm/linux/kernel/git/davem/net: (40 commits)
        bna: add missing iounmap() on error in bnad_init()
        qlge: fix dma map leak when the last chunk is not allocated
        xfrm6: release dev before returning error
        ipv6,gre: do not leak info to user-space
        virtio_net: use default napi weight by default
        emac: Fix EMAC soft reset on 460EX/GT
        3c59x: fix PCI resource management
        caif: CAIF_VIRTIO should depend on HAS_DMA
        net/ethernet: MACB should depend on HAS_DMA
        net/ethernet: ARM_AT91_ETHER should depend on HAS_DMA
        net/wireless: ATH9K should depend on HAS_DMA
        net/ethernet: STMMAC_ETH should depend on HAS_DMA
        net/ethernet: NET_CALXEDA_XGMAC should depend on HAS_DMA
        ipv6: do not clear pinet6 field
        macvlan: fix passthru mode race between dev removal and rx path
        ipv4: ip_output: remove inline marking of EXPORT_SYMBOL functions
        net/mlx4: Strengthen VLAN tags/priorities enforcement in VST mode
        net/mlx4_core: Add missing report on VST and spoof-checking dev caps
        net: fec: enable hardware checksum only on imx6q-fec
        qlcnic: Fix validation of link event command.
        ...
      dbbffe68
    • H
      parisc: make default cross compiler search more robust (v3) · 6880b015
      Helge Deller 提交于
      People/distros vary how they prefix the toolchain name for 64bit builds.
      Rather than enforce one convention over another, add a for loop which
      does a search for all the general prefixes.
      
      For 64bit builds, we now search for (in order):
      	hppa64-unknown-linux-gnu
      	hppa64-linux-gnu
      	hppa64-linux
      
      For 32bit builds, we look for:
      	hppa-unknown-linux-gnu
      	hppa-linux-gnu
      	hppa-linux
      	hppa2.0-unknown-linux-gnu
      	hppa2.0-linux-gnu
      	hppa2.0-linux
      	hppa1.1-unknown-linux-gnu
      	hppa1.1-linux-gnu
      	hppa1.1-linux
      
      This patch was initiated by Mike Frysinger, with feedback from Jeroen
      Roovers, John David Anglin and Helge Deller.
      Signed-off-by: NMike Frysinger <vapier@gentoo.org>
      Signed-off-by: NJeroen Roovers <jer@gentoo.org>
      Signed-off-by: NJohn David Anglin <dave.anglin@bell.net>
      Signed-off-by: NHelge Deller <deller@gmx.de>
      6880b015
    • W
      bna: add missing iounmap() on error in bnad_init() · ba21fc69
      Wei Yongjun 提交于
      Add the missing iounmap() before return from bnad_init()
      in the error handling case.
      Introduced by commit 01b54b14
      (bna: tx rx cleanup fix).
      Signed-off-by: NWei Yongjun <yongjun_wei@trendmicro.com.cn>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      ba21fc69
    • T
      qlge: fix dma map leak when the last chunk is not allocated · ef380794
      Thadeu Lima de Souza Cascardo 提交于
      qlge allocates chunks from a page that it maps and unmaps that page when
      the last chunk is released. When the driver is unloaded or the card is
      removed, all chunks are released and the page is unmapped for the last
      chunk.
      
      However, when the last chunk of a page is not allocated and the device
      is removed, that page is not unmapped. In fact, its last reference is
      not put and there's also a page leak. This bug prevents a device from
      being properly hotplugged.
      
      When the DMA API debug option is enabled, the following messages show
      the pending DMA allocation after we remove the driver.
      
      This patch fixes the bug by unmapping and putting the page from the ring
      if its last chunk has not been allocated.
      
      pci 0005:98:00.0: DMA-API: device driver has pending DMA allocations while released from device [count=1]
      One of leaked entries details: [device address=0x0000000060a80000] [size=65536 bytes] [mapped with DMA_FROM_DEVICE] [mapped as page]
      ------------[ cut here ]------------
      WARNING: at lib/dma-debug.c:746
      Modules linked in: qlge(-) rpadlpar_io rpaphp pci_hotplug fuse [last unloaded: qlge]
      NIP: c0000000003fc3ec LR: c0000000003fc3e8 CTR: c00000000054de60
      REGS: c0000003ee9c74e0 TRAP: 0700   Tainted: G           O  (3.7.2)
      MSR: 8000000000029032 <SF,EE,ME,IR,DR,RI>  CR: 28002424  XER: 00000001
      SOFTE: 1
      CFAR: c0000000007a39c8
      TASK = c0000003ee8d5c90[8406] 'rmmod' THREAD: c0000003ee9c4000 CPU: 31
      GPR00: c0000000003fc3e8 c0000003ee9c7760 c000000000c789f8 00000000000000ee
      GPR04: 0000000000000000 00000000000000ef 0000000000004000 0000000000010000
      GPR08: 00000000000000be c000000000b22088 c000000000c4c218 00000000007c0000
      GPR12: 0000000028002422 c00000000ff26c80 0000000000000000 000001001b0f1b40
      GPR16: 00000000100cb9d8 0000000010093088 c000000000cdf910 0000000000000001
      GPR20: 0000000000000000 c000000000dbfc00 0000000000000000 c000000000dbfb80
      GPR24: c0000003fafc9d80 0000000000000001 000000000001ff80 c0000003f38f7888
      GPR28: c000000000ddfc00 0000000000000400 c000000000bd7790 c000000000ddfb80
      NIP [c0000000003fc3ec] .dma_debug_device_change+0x22c/0x2b0
      LR [c0000000003fc3e8] .dma_debug_device_change+0x228/0x2b0
      Call Trace:
      [c0000003ee9c7760] [c0000000003fc3e8] .dma_debug_device_change+0x228/0x2b0 (unreliable)
      [c0000003ee9c7840] [c00000000079a098] .notifier_call_chain+0x78/0xf0
      [c0000003ee9c78e0] [c0000000000acc20] .__blocking_notifier_call_chain+0x70/0xb0
      [c0000003ee9c7990] [c0000000004a9580] .__device_release_driver+0x100/0x140
      [c0000003ee9c7a20] [c0000000004a9708] .driver_detach+0x148/0x150
      [c0000003ee9c7ac0] [c0000000004a8144] .bus_remove_driver+0xc4/0x150
      [c0000003ee9c7b60] [c0000000004aa58c] .driver_unregister+0x8c/0xe0
      [c0000003ee9c7bf0] [c0000000004090b4] .pci_unregister_driver+0x34/0xf0
      [c0000003ee9c7ca0] [d000000002231194] .qlge_exit+0x1c/0x34 [qlge]
      [c0000003ee9c7d20] [c0000000000e36d8] .SyS_delete_module+0x1e8/0x290
      [c0000003ee9c7e30] [c0000000000098d4] syscall_exit+0x0/0x94
      Instruction dump:
      7f26cb78 e818003a e87e81a0 e8f80028 e9180030 796b1f24 78001f24 7d6a5a14
      7d2a002a e94b0020 483a7595 60000000 <0fe00000> 2fb80000 40de0048 80120050
      ---[ end trace 4294f9abdb01031d ]---
      Mapped at:
       [<d000000002222f54>] .ql_update_lbq+0x384/0x580 [qlge]
       [<d000000002227bd0>] .ql_clean_inbound_rx_ring+0x300/0xc60 [qlge]
       [<d0000000022288cc>] .ql_napi_poll_msix+0x39c/0x5a0 [qlge]
       [<c0000000006b3c50>] .net_rx_action+0x170/0x300
       [<c000000000081840>] .__do_softirq+0x170/0x300
      Signed-off-by: NThadeu Lima de Souza Cascardo <cascardo@linux.vnet.ibm.com>
      Acked-by: NJitendra Kalsaria <Jitendra.kalsaria@qlogic.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      ef380794
    • S
      xen/arm: rename xen_secondary_init and run it on every online cpu · 3cc8e40e
      Stefano Stabellini 提交于
      Rename xen_secondary_init to xen_percpu_init.
      Run xen_percpu_init on the each online cpu, reuse the current on_each_cpu call.
      Merge xen_percpu_enable_events into xen_percpu_init.
      Signed-off-by: NStefano Stabellini <stefano.stabellini@eu.citrix.com>
      3cc8e40e
    • S
      xen/arm: do not handle VCPUOP_register_vcpu_info failures · d7266d78
      Stefano Stabellini 提交于
      We expect VCPUOP_register_vcpu_info to succeed, do not try to handle
      failures.
      Signed-off-by: NStefano Stabellini <stefano.stabellini@eu.citrix.com>
      Acked-by: NIan Campbell <ian.campbell@citrix.com>
      d7266d78
    • S
      xen/arm: initialize pm functions later · 1aa3d8d9
      Stefano Stabellini 提交于
      If we are running in dom0, we have to wait for the arch specific code to
      complete the initialization in order for us to successfully reset the
      power_off and pm_restart functions.
      Signed-off-by: NStefano Stabellini <stefano.stabellini@eu.citrix.com>
      1aa3d8d9