1. 03 7月, 2013 1 次提交
    • D
      sync: don't block the flusher thread waiting on IO · 7747bd4b
      Dave Chinner 提交于
      When sync does it's WB_SYNC_ALL writeback, it issues data Io and
      then immediately waits for IO completion. This is done in the
      context of the flusher thread, and hence completely ties up the
      flusher thread for the backing device until all the dirty inodes
      have been synced. On filesystems that are dirtying inodes constantly
      and quickly, this means the flusher thread can be tied up for
      minutes per sync call and hence badly affect system level write IO
      performance as the page cache cannot be cleaned quickly.
      
      We already have a wait loop for IO completion for sync(2), so cut
      this out of the flusher thread and delegate it to wait_sb_inodes().
      Hence we can do rapid IO submission, and then wait for it all to
      complete.
      
      Effect of sync on fsmark before the patch:
      
      FSUse%        Count         Size    Files/sec     App Overhead
      .....
           0       640000         4096      35154.6          1026984
           0       720000         4096      36740.3          1023844
           0       800000         4096      36184.6           916599
           0       880000         4096       1282.7          1054367
           0       960000         4096       3951.3           918773
           0      1040000         4096      40646.2           996448
           0      1120000         4096      43610.1           895647
           0      1200000         4096      40333.1           921048
      
      And a single sync pass took:
      
        real    0m52.407s
        user    0m0.000s
        sys     0m0.090s
      
      After the patch, there is no impact on fsmark results, and each
      individual sync(2) operation run concurrently with the same fsmark
      workload takes roughly 7s:
      
        real    0m6.930s
        user    0m0.000s
        sys     0m0.039s
      
      IOWs, sync is 7-8x faster on a busy filesystem and does not have an
      adverse impact on ongoing async data write operations.
      Signed-off-by: NDave Chinner <dchinner@redhat.com>
      Reviewed-by: NJan Kara <jack@suse.cz>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      7747bd4b
  2. 01 7月, 2013 1 次提交
    • T
      jbd2: invalidate handle if jbd2_journal_restart() fails · 41a5b913
      Theodore Ts'o 提交于
      If jbd2_journal_restart() fails the handle will have been disconnected
      from the current transaction.  In this situation, the handle must not
      be used for for any jbd2 function other than jbd2_journal_stop().
      Enforce this with by treating a handle which has a NULL transaction
      pointer as an aborted handle, and issue a kernel warning if
      jbd2_journal_extent(), jbd2_journal_get_write_access(),
      jbd2_journal_dirty_metadata(), etc. is called with an invalid handle.
      
      This commit also fixes a bug where jbd2_journal_stop() would trip over
      a kernel jbd2 assertion check when trying to free an invalid handle.
      
      Also move the responsibility of setting current->journal_info to
      start_this_handle(), simplifying the three users of this function.
      Signed-off-by: N"Theodore Ts'o" <tytso@mit.edu>
      Reported-by: NYounger Liu <younger.liu@huawei.com>
      Cc: Jan Kara <jack@suse.cz>
      41a5b913
  3. 29 6月, 2013 7 次提交
  4. 27 6月, 2013 1 次提交
    • N
      net: fix kernel deadlock with interface rename and netdev name retrieval. · 5dbe7c17
      Nicolas Schichan 提交于
      When the kernel (compiled with CONFIG_PREEMPT=n) is performing the
      rename of a network interface, it can end up waiting for a workqueue
      to complete. If userland is able to invoke a SIOCGIFNAME ioctl or a
      SO_BINDTODEVICE getsockopt in between, the kernel will deadlock due to
      the fact that read_secklock_begin() will spin forever waiting for the
      writer process (the one doing the interface rename) to update the
      devnet_rename_seq sequence.
      
      This patch fixes the problem by adding a helper (netdev_get_name())
      and using it in the code handling the SIOCGIFNAME ioctl and
      SO_BINDTODEVICE setsockopt.
      
      The netdev_get_name() helper uses raw_seqcount_begin() to avoid
      spinning forever, waiting for devnet_rename_seq->sequence to become
      even. cond_resched() is used in the contended case, before retrying
      the access to give the writer process a chance to finish.
      
      The use of raw_seqcount_begin() will incur some unneeded work in the
      reader process in the contended case, but this is better than
      deadlocking the system.
      Signed-off-by: NNicolas Schichan <nschichan@freebox.fr>
      Acked-by: NEric Dumazet <edumazet@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      5dbe7c17
  5. 26 6月, 2013 1 次提交
  6. 25 6月, 2013 2 次提交
  7. 20 6月, 2013 3 次提交
  8. 19 6月, 2013 6 次提交
    • D
      FS-Cache: The retrieval remaining-pages counter needs to be atomic_t · 1bb4b7f9
      David Howells 提交于
      struct fscache_retrieval contains a count of the number of pages that still
      need some processing (n_pages).  This is decremented as the pages are
      processed.
      
      However, this needs to be atomic as fscache_retrieval_complete() (I think) just
      occasionally may be called from cachefiles_read_backing_file() and
      cachefiles_read_copier() simultaneously.
      
      This happens when an fscache_read_or_alloc_pages() request containing a lot of
      pages (say a couple of hundred) is being processed.  The read on each backing
      page is dispatched individually because we need to insert a monitor into the
      waitqueue to catch when the read completes.  However, under low-memory
      conditions, we might be forced to wait in the allocator - and this gives the
      I/O on the backing page a chance to complete first.
      
      When the I/O completes, fscache_enqueue_retrieval() chucks the retrieval onto
      the workqueue without waiting for the operation to finish the initial I/O
      dispatch (we want to release any pages we can as soon as we can), thus both can
      end up running simultaneously and potentially attempting to partially complete
      the retrieval simultaneously (ENOMEM may occur, backing pages may already be in
      the page cache).
      
      This was demonstrated by parallelling the non-atomic counter with an atomic
      counter and printing both of them when the assertion fails.  At this point, the
      atomic counter has reached zero, but the non-atomic counter has not.
      
      To fix this, make the counter an atomic_t.
      
      This results in the following bug appearing
      
      	FS-Cache: Assertion failed
      	3 == 5 is false
      	------------[ cut here ]------------
      	kernel BUG at fs/fscache/operation.c:421!
      
      or
      
      	FS-Cache: Assertion failed
      	3 == 5 is false
      	------------[ cut here ]------------
      	kernel BUG at fs/fscache/operation.c:414!
      
      With a backtrace like the following:
      
      RIP: 0010:[<ffffffffa0211b1d>] fscache_put_operation+0x1ad/0x240 [fscache]
      Call Trace:
       [<ffffffffa0213185>] fscache_retrieval_work+0x55/0x270 [fscache]
       [<ffffffffa0213130>] ? fscache_retrieval_work+0x0/0x270 [fscache]
       [<ffffffff81090b10>] worker_thread+0x170/0x2a0
       [<ffffffff81096d10>] ? autoremove_wake_function+0x0/0x40
       [<ffffffff810909a0>] ? worker_thread+0x0/0x2a0
       [<ffffffff81096966>] kthread+0x96/0xa0
       [<ffffffff8100c0ca>] child_rip+0xa/0x20
       [<ffffffff810968d0>] ? kthread+0x0/0xa0
       [<ffffffff8100c0c0>] ? child_rip+0x0/0x20
      Signed-off-by: NDavid Howells <dhowells@redhat.com>
      Reviewed-and-tested-By: NMilosz Tanski <milosz@adfin.com>
      Acked-by: NJeff Layton <jlayton@redhat.com>
      1bb4b7f9
    • D
      FS-Cache: Simplify cookie retention for fscache_objects, fixing oops · 1362729b
      David Howells 提交于
      Simplify the way fscache cache objects retain their cookie.  The way I
      implemented the cookie storage handling made synchronisation a pain (ie. the
      object state machine can't rely on the cookie actually still being there).
      
      Instead of the the object being detached from the cookie and the cookie being
      freed in __fscache_relinquish_cookie(), we defer both operations:
      
       (*) The detachment of the object from the list in the cookie now takes place
           in fscache_drop_object() and is thus governed by the object state machine
           (fscache_detach_from_cookie() has been removed).
      
       (*) The release of the cookie is now in fscache_object_destroy() - which is
           called by the cache backend just before it frees the object.
      
      This means that the fscache_cookie struct is now available to the cache all the
      way through from ->alloc_object() to ->drop_object() and ->put_object() -
      meaning that it's no longer necessary to take object->lock to guarantee access.
      
      However, __fscache_relinquish_cookie() doesn't wait for the object to go all
      the way through to destruction before letting the netfs proceed.  That would
      massively slow down the netfs.  Since __fscache_relinquish_cookie() leaves the
      cookie around, in must therefore break all attachments to the netfs - which
      includes ->def, ->netfs_data and any outstanding page read/writes.
      
      To handle this, struct fscache_cookie now has an n_active counter:
      
       (1) This starts off initialised to 1.
      
       (2) Any time the cache needs to get at the netfs data, it calls
           fscache_use_cookie() to increment it - if it is not zero.  If it was zero,
           then access is not permitted.
      
       (3) When the cache has finished with the data, it calls fscache_unuse_cookie()
           to decrement it.  This does a wake-up on it if it reaches 0.
      
       (4) __fscache_relinquish_cookie() decrements n_active and then waits for it to
           reach 0.  The initialisation to 1 in step (1) ensures that we only get
           wake ups when we're trying to get rid of the cookie.
      
      This leaves __fscache_relinquish_cookie() a lot simpler.
      
      
      ***
      This fixes a problem in the current code whereby if fscache_invalidate() is
      followed sufficiently quickly by fscache_relinquish_cookie() then it is
      possible for __fscache_relinquish_cookie() to have detached the cookie from the
      object and cleared the pointer before a thread is dispatched to process the
      invalidation state in the object state machine.
      
      Since the pending write clearance was deferred to the invalidation state to
      make it asynchronous, we need to either wait in relinquishment for the stores
      tree to be cleared in the invalidation state or we need to handle the clearance
      in relinquishment.
      
      Further, if the relinquishment code does clear the tree, then the invalidation
      state need to make the clearance contingent on still having the cookie to hand
      (since that's where the tree is rooted) and we have to prevent the cookie from
      disappearing for the duration.
      
      This can lead to an oops like the following:
      
      BUG: unable to handle kernel NULL pointer dereference at 000000000000000c
      ...
      RIP: 0010:[<ffffffff8151023e>] _spin_lock+0xe/0x30
      ...
      CR2: 000000000000000c ...
      ...
      Process kslowd002 (...)
      ....
      Call Trace:
       [<ffffffffa01c3278>] fscache_invalidate_writes+0x38/0xd0 [fscache]
       [<ffffffff810096f0>] ? __switch_to+0xd0/0x320
       [<ffffffff8105e759>] ? find_busiest_queue+0x69/0x150
       [<ffffffff8110ddd4>] ? slow_work_enqueue+0x104/0x180
       [<ffffffffa01c1303>] fscache_object_slow_work_execute+0x5e3/0x9d0 [fscache]
       [<ffffffff81096b67>] ? bit_waitqueue+0x17/0xd0
       [<ffffffff8110e233>] slow_work_execute+0x233/0x310
       [<ffffffff8110e515>] slow_work_thread+0x205/0x360
       [<ffffffff81096ca0>] ? autoremove_wake_function+0x0/0x40
       [<ffffffff8110e310>] ? slow_work_thread+0x0/0x360
       [<ffffffff81096936>] kthread+0x96/0xa0
       [<ffffffff8100c0ca>] child_rip+0xa/0x20
       [<ffffffff810968a0>] ? kthread+0x0/0xa0
       [<ffffffff8100c0c0>] ? child_rip+0x0/0x20
      
      The parameter to fscache_invalidate_writes() was object->cookie which is NULL.
      Signed-off-by: NDavid Howells <dhowells@redhat.com>
      Tested-By: NMilosz Tanski <milosz@adfin.com>
      Acked-by: NJeff Layton <jlayton@redhat.com>
      1362729b
    • D
      FS-Cache: Fix object state machine to have separate work and wait states · caaef690
      David Howells 提交于
      Fix object state machine to have separate work and wait states as that makes
      it easier to envision.
      
      There are now three kinds of state:
      
       (1) Work state.  This is an execution state.  No event processing is performed
           by a work state.  The function attached to a work state returns a pointer
           indicating the next state to which the OSM should transition.  Returning
           NO_TRANSIT repeats the current state, but goes back to the scheduler
           first.
      
       (2) Wait state.  This is an event processing state.  No execution is
           performed by a wait state.  Wait states are just tables of "if event X
           occurs, clear it and transition to state Y".  The dispatcher returns to
           the scheduler if none of the events in which the wait state has an
           interest are currently pending.
      
       (3) Out-of-band state.  This is a special work state.  Transitions to normal
           states can be overridden when an unexpected event occurs (eg. I/O error).
           Instead the dispatcher disables and clears the OOB event and transits to
           the specified work state.  This then acts as an ordinary work state,
           though object->state points to the overridden destination.  Returning
           NO_TRANSIT resumes the overridden transition.
      
      In addition, the states have names in their definitions, so there's no need for
      tables of state names.  Further, the EV_REQUEUE event is no longer necessary as
      that is automatic for work states.
      
      Since the states are now separate structs rather than values in an enum, it's
      not possible to use comparisons other than (non-)equality between them, so use
      some object->flags to indicate what phase an object is in.
      
      The EV_RELEASE, EV_RETIRE and EV_WITHDRAW events have been squished into one
      (EV_KILL).  An object flag now carries the information about retirement.
      
      Similarly, the RELEASING, RECYCLING and WITHDRAWING states have been merged
      into an KILL_OBJECT state and additional states have been added for handling
      waiting dependent objects (JUMPSTART_DEPS and KILL_DEPENDENTS).
      
      A state has also been added for synchronising with parent object initialisation
      (WAIT_FOR_PARENT) and another for initiating look up (PARENT_READY).
      Signed-off-by: NDavid Howells <dhowells@redhat.com>
      Tested-By: NMilosz Tanski <milosz@adfin.com>
      Acked-by: NJeff Layton <jlayton@redhat.com>
      caaef690
    • D
      FS-Cache: Wrap checks on object state · 493f7bc1
      David Howells 提交于
      Wrap checks on object state (mostly outside of fs/fscache/object.c) with
      inline functions so that the mechanism can be replaced.
      
      Some of the state checks within object.c are left as-is as they will be
      replaced.
      Signed-off-by: NDavid Howells <dhowells@redhat.com>
      Tested-By: NMilosz Tanski <milosz@adfin.com>
      Acked-by: NJeff Layton <jlayton@redhat.com>
      493f7bc1
    • D
      FS-Cache: Uninline fscache_object_init() · 610be24e
      David Howells 提交于
      Uninline fscache_object_init() so as not to expose some of the FS-Cache
      internals to the cache backend.
      Signed-off-by: NDavid Howells <dhowells@redhat.com>
      Tested-By: NMilosz Tanski <milosz@adfin.com>
      Acked-by: NJeff Layton <jlayton@redhat.com>
      610be24e
    • S
      tracing/context-tracking: Add preempt_schedule_context() for tracing · 29bb9e5a
      Steven Rostedt 提交于
      Dave Jones hit the following bug report:
      
       ===============================
       [ INFO: suspicious RCU usage. ]
       3.10.0-rc2+ #1 Not tainted
       -------------------------------
       include/linux/rcupdate.h:771 rcu_read_lock() used illegally while idle!
       other info that might help us debug this:
       RCU used illegally from idle CPU! rcu_scheduler_active = 1, debug_locks = 0
       RCU used illegally from extended quiescent state!
       2 locks held by cc1/63645:
        #0:  (&rq->lock){-.-.-.}, at: [<ffffffff816b39fd>] __schedule+0xed/0x9b0
        #1:  (rcu_read_lock){.+.+..}, at: [<ffffffff8109d645>] cpuacct_charge+0x5/0x1f0
      
       CPU: 1 PID: 63645 Comm: cc1 Not tainted 3.10.0-rc2+ #1 [loadavg: 40.57 27.55 13.39 25/277 64369]
       Hardware name: Gigabyte Technology Co., Ltd. GA-MA78GM-S2H/GA-MA78GM-S2H, BIOS F12a 04/23/2010
        0000000000000000 ffff88010f78fcf8 ffffffff816ae383 ffff88010f78fd28
        ffffffff810b698d ffff88011c092548 000000000023d073 ffff88011c092500
        0000000000000001 ffff88010f78fd60 ffffffff8109d7c5 ffffffff8109d645
       Call Trace:
        [<ffffffff816ae383>] dump_stack+0x19/0x1b
        [<ffffffff810b698d>] lockdep_rcu_suspicious+0xfd/0x130
        [<ffffffff8109d7c5>] cpuacct_charge+0x185/0x1f0
        [<ffffffff8109d645>] ? cpuacct_charge+0x5/0x1f0
        [<ffffffff8108dffc>] update_curr+0xec/0x240
        [<ffffffff8108f528>] put_prev_task_fair+0x228/0x480
        [<ffffffff816b3a71>] __schedule+0x161/0x9b0
        [<ffffffff816b4721>] preempt_schedule+0x51/0x80
        [<ffffffff816b4800>] ? __cond_resched_softirq+0x60/0x60
        [<ffffffff816b6824>] ? retint_careful+0x12/0x2e
        [<ffffffff810ff3cc>] ftrace_ops_control_func+0x1dc/0x210
        [<ffffffff816be280>] ftrace_call+0x5/0x2f
        [<ffffffff816b681d>] ? retint_careful+0xb/0x2e
        [<ffffffff816b4805>] ? schedule_user+0x5/0x70
        [<ffffffff816b4805>] ? schedule_user+0x5/0x70
        [<ffffffff816b6824>] ? retint_careful+0x12/0x2e
       ------------[ cut here ]------------
      
      What happened was that the function tracer traced the schedule_user() code
      that tells RCU that the system is coming back from userspace, and to
      add the CPU back to the RCU monitoring.
      
      Because the function tracer does a preempt_disable/enable_notrace() calls
      the preempt_enable_notrace() checks the NEED_RESCHED flag. If it is set,
      then preempt_schedule() is called. But this is called before the user_exit()
      function can inform the kernel that the CPU is no longer in user mode and
      needs to be accounted for by RCU.
      
      The fix is to create a new preempt_schedule_context() that checks if
      the kernel is still in user mode and if so to switch it to kernel mode
      before calling schedule. It also switches back to user mode coming back
      from schedule in need be.
      
      The only user of this currently is the preempt_enable_notrace(), which is
      only used by the tracing subsystem.
      Signed-off-by: NSteven Rostedt <rostedt@goodmis.org>
      Signed-off-by: NPeter Zijlstra <peterz@infradead.org>
      Link: http://lkml.kernel.org/r/1369423420.6828.226.camel@gandalf.local.homeSigned-off-by: NIngo Molnar <mingo@kernel.org>
      29bb9e5a
  9. 18 6月, 2013 9 次提交
  10. 15 6月, 2013 1 次提交
    • D
      smp.h: Use local_irq_{save,restore}() in !SMP version of on_each_cpu(). · f21afc25
      David Daney 提交于
      Thanks to commit f91eb62f ("init: scream bloody murder if interrupts
      are enabled too early"), "bloody murder" is now being screamed.
      
      With a MIPS OCTEON config, we use on_each_cpu() in our
      irq_chip.irq_bus_sync_unlock() function.  This gets called in early as a
      result of the time_init() call.  Because the !SMP version of
      on_each_cpu() unconditionally enables irqs, we get:
      
          WARNING: at init/main.c:560 start_kernel+0x250/0x410()
          Interrupts were enabled early
          CPU: 0 PID: 0 Comm: swapper Not tainted 3.10.0-rc5-Cavium-Octeon+ #801
          Call Trace:
            show_stack+0x68/0x80
            warn_slowpath_common+0x78/0xb0
            warn_slowpath_fmt+0x38/0x48
            start_kernel+0x250/0x410
      
      Suggested fix: Do what we already do in the SMP version of
      on_each_cpu(), and use local_irq_save/local_irq_restore.  Because we
      need a flags variable, make it a static inline to avoid name space
      issues.
      
      [ Change from v1: Convert on_each_cpu to a static inline function, add
        #include <linux/irqflags.h> to avoid build breakage on some files.
      
        on_each_cpu_mask() and on_each_cpu_cond() suffer the same problem as
        on_each_cpu(), but they are not causing !SMP bugs for me, so I will
        defer changing them to a less urgent patch. ]
      Signed-off-by: NDavid Daney <david.daney@cavium.com>
      Cc: Ralf Baechle <ralf@linux-mips.org>
      Cc: Andrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      f21afc25
  11. 13 6月, 2013 8 次提交
    • P
      jbd2: use a single printk for jbd_debug() · 169f1a2a
      Paul Gortmaker 提交于
      Since the jbd_debug() is implemented with two separate printk()
      calls, it can lead to corrupted and misleading debug output like
      the following (see lines marked with "*"):
      
      [  290.339362] (fs/jbd2/journal.c, 203): kjournald2: kjournald2 wakes
      [  290.339365] (fs/jbd2/journal.c, 155): kjournald2: commit_sequence=42103, commit_request=42104
      [  290.339369] (fs/jbd2/journal.c, 158): kjournald2: OK, requests differ
      [* 290.339376] (fs/jbd2/journal.c, 648): jbd2_log_wait_commit:
      [* 290.339379] (fs/jbd2/commit.c, 370): jbd2_journal_commit_transaction: JBD2: want 42104, j_commit_sequence=42103
      [* 290.339382] JBD2: starting commit of transaction 42104
      [  290.339410] (fs/jbd2/revoke.c, 566): jbd2_journal_write_revoke_records: Wrote 0 revoke records
      [  290.376555] (fs/jbd2/commit.c, 1088): jbd2_journal_commit_transaction: JBD2: commit 42104 complete, head 42079
      
      i.e. the debug output from log_wait_commit and journal_commit_transaction
      have become interleaved.  The output should have been:
      
      (fs/jbd2/journal.c, 648): jbd2_log_wait_commit: JBD2: want 42104, j_commit_sequence=42103
      (fs/jbd2/commit.c, 370): jbd2_journal_commit_transaction: JBD2: starting commit of transaction 42104
      
      It is expected that this is not easy to replicate -- I was only able
      to cause it on preempt-rt kernels, and even then only under heavy
      I/O load.
      Reported-by: NPaul Gortmaker <paul.gortmaker@windriver.com>
      Suggested-by: N"Theodore Ts'o" <tytso@mit.edu>
      Signed-off-by: NPaul Gortmaker <paul.gortmaker@windriver.com>
      Signed-off-by: N"Theodore Ts'o" <tytso@mit.edu>
      169f1a2a
    • P
      jbd/jbd2: relocate bit_spinlock header to jbd_common · c9b3a8cc
      Paul Gortmaker 提交于
      The bit_spinlock functions are only used for the jbd_lock_bh_state
      functions (and friends) in jbd_common.h and are not directly used
      by either of jbd.h or jbd2.h content.
      
      The jbd_common file is new as of commit 44606672 ("jdb/jbd2: factor
      out common functions from the jbd[2] header files") but common
      (and isolated) headers were not considered for factoring at that time.
      Signed-off-by: NPaul Gortmaker <paul.gortmaker@windriver.com>
      Signed-off-by: N"Theodore Ts'o" <tytso@mit.edu>
      c9b3a8cc
    • D
      ext4: fix data integrity for ext4_sync_fs · 06a407f1
      Dmitry Monakhov 提交于
      Inode's data or non journaled quota may be written w/o jounral so we
      _must_ send a barrier at the end of ext4_sync_fs. But it can be
      skipped if journal commit will do it for us.
      
      Also fix data integrity for nojournal mode.
      Signed-off-by: NDmitry Monakhov <dmonakhov@openvz.org>
      Signed-off-by: N"Theodore Ts'o" <tytso@mit.edu>
      06a407f1
    • D
      jbd2: optimize jbd2_journal_force_commit · 9ff86446
      Dmitry Monakhov 提交于
      Current implementation of jbd2_journal_force_commit() is suboptimal because
      result in empty and useless commits. But callers just want to force and wait
      any unfinished commits. We already have jbd2_journal_force_commit_nested()
      which does exactly what we want, except we are guaranteed that we do not hold
      journal transaction open.
      Signed-off-by: NDmitry Monakhov <dmonakhov@openvz.org>
      Signed-off-by: N"Theodore Ts'o" <tytso@mit.edu>
      9ff86446
    • A
      include/linux/math64.h: add div64_ul() · c2853c8d
      Alex Shi 提交于
      There is div64_long() to handle the s64/long division, but no mocro do
      u64/ul division.  It is necessary in some scenarios, so add this
      function.
      
      [akpm@linux-foundation.org: coding-style fixes]
      Signed-off-by: NAlex Shi <alex.shi@intel.com>
      Cc: Ingo Molnar <mingo@elte.hu>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      c2853c8d
    • N
      mm: migration: add migrate_entry_wait_huge() · 30dad309
      Naoya Horiguchi 提交于
      When we have a page fault for the address which is backed by a hugepage
      under migration, the kernel can't wait correctly and do busy looping on
      hugepage fault until the migration finishes.  As a result, users who try
      to kick hugepage migration (via soft offlining, for example) occasionally
      experience long delay or soft lockup.
      
      This is because pte_offset_map_lock() can't get a correct migration entry
      or a correct page table lock for hugepage.  This patch introduces
      migration_entry_wait_huge() to solve this.
      Signed-off-by: NNaoya Horiguchi <n-horiguchi@ah.jp.nec.com>
      Reviewed-by: NRik van Riel <riel@redhat.com>
      Reviewed-by: NWanpeng Li <liwanp@linux.vnet.ibm.com>
      Reviewed-by: NMichal Hocko <mhocko@suse.cz>
      Cc: Mel Gorman <mgorman@suse.de>
      Cc: Andi Kleen <andi@firstfloor.org>
      Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
      Cc: <stable@vger.kernel.org>	[2.6.35+]
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      30dad309
    • K
      kmsg: honor dmesg_restrict sysctl on /dev/kmsg · 637241a9
      Kees Cook 提交于
      The dmesg_restrict sysctl currently covers the syslog method for access
      dmesg, however /dev/kmsg isn't covered by the same protections.  Most
      people haven't noticed because util-linux dmesg(1) defaults to using the
      syslog method for access in older versions.  With util-linux dmesg(1)
      defaults to reading directly from /dev/kmsg.
      
      To fix /dev/kmsg, let's compare the existing interfaces and what they
      allow:
      
       - /proc/kmsg allows:
        - open (SYSLOG_ACTION_OPEN) if CAP_SYSLOG since it uses a destructive
          single-reader interface (SYSLOG_ACTION_READ).
        - everything, after an open.
      
       - syslog syscall allows:
        - anything, if CAP_SYSLOG.
        - SYSLOG_ACTION_READ_ALL and SYSLOG_ACTION_SIZE_BUFFER, if
          dmesg_restrict==0.
        - nothing else (EPERM).
      
      The use-cases were:
       - dmesg(1) needs to do non-destructive SYSLOG_ACTION_READ_ALLs.
       - sysklog(1) needs to open /proc/kmsg, drop privs, and still issue the
         destructive SYSLOG_ACTION_READs.
      
      AIUI, dmesg(1) is moving to /dev/kmsg, and systemd-journald doesn't
      clear the ring buffer.
      
      Based on the comments in devkmsg_llseek, it sounds like actions besides
      reading aren't going to be supported by /dev/kmsg (i.e.
      SYSLOG_ACTION_CLEAR), so we have a strict subset of the non-destructive
      syslog syscall actions.
      
      To this end, move the check as Josh had done, but also rename the
      constants to reflect their new uses (SYSLOG_FROM_CALL becomes
      SYSLOG_FROM_READER, and SYSLOG_FROM_FILE becomes SYSLOG_FROM_PROC).
      SYSLOG_FROM_READER allows non-destructive actions, and SYSLOG_FROM_PROC
      allows destructive actions after a capabilities-constrained
      SYSLOG_ACTION_OPEN check.
      
       - /dev/kmsg allows:
        - open if CAP_SYSLOG or dmesg_restrict==0
        - reading/polling, after open
      
      Addresses https://bugzilla.redhat.com/show_bug.cgi?id=903192
      
      [akpm@linux-foundation.org: use pr_warn_once()]
      Signed-off-by: NKees Cook <keescook@chromium.org>
      Reported-by: NChristian Kujau <lists@nerdbynature.de>
      Tested-by: NJosh Boyer <jwboyer@redhat.com>
      Cc: Kay Sievers <kay@vrfy.org>
      Cc: <stable@vger.kernel.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      637241a9
    • S
      CPU hotplug: provide a generic helper to disable/enable CPU hotplug · 16e53dbf
      Srivatsa S. Bhat 提交于
      There are instances in the kernel where we would like to disable CPU
      hotplug (from sysfs) during some important operation.  Today the freezer
      code depends on this and the code to do it was kinda tailor-made for
      that.
      
      Restructure the code and make it generic enough to be useful for other
      usecases too.
      Signed-off-by: NSrivatsa S. Bhat <srivatsa.bhat@linux.vnet.ibm.com>
      Signed-off-by: NRobin Holt <holt@sgi.com>
      Cc: H. Peter Anvin <hpa@zytor.com>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Ingo Molnar <mingo@elte.hu>
      Cc: Russ Anderson <rja@sgi.com>
      Cc: Robin Holt <holt@sgi.com>
      Cc: Russell King <linux@arm.linux.org.uk>
      Cc: Guan Xuetao <gxt@mprc.pku.edu.cn>
      Cc: Shawn Guo <shawn.guo@linaro.org>
      Cc: <stable@vger.kernel.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      16e53dbf