1. 14 12月, 2011 21 次提交
    • T
      block, cfq: move cfqd->icq_list to request_queue and add request->elv.icq · a612fddf
      Tejun Heo 提交于
      Most of icq management is about to be moved out of cfq into blk-ioc.
      This patch prepares for it.
      
      * Move cfqd->icq_list to request_queue->icq_list
      
      * Make request explicitly point to icq instead of through elevator
        private data.  ->elevator_private[3] is replaced with sub struct elv
        which contains icq pointer and priv[2].  cfq is updated accordingly.
      
      * Meaningless clearing of ->elevator_private[0] removed from
        elv_set_request().  At that point in code, the field was guaranteed
        to be %NULL anyway.
      
      This patch doesn't introduce any functional change.
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      a612fddf
    • T
      block, cfq: reorganize cfq_io_context into generic and cfq specific parts · c5869807
      Tejun Heo 提交于
      Currently io_context and cfq logics are mixed without clear boundary.
      Most of io_context is independent from cfq but cfq_io_context handling
      logic is dispersed between generic ioc code and cfq.
      
      cfq_io_context represents association between an io_context and a
      request_queue, which is a concept useful outside of cfq, but it also
      contains fields which are useful only to cfq.
      
      This patch takes out generic part and put it into io_cq (io
      context-queue) and the rest into cfq_io_cq (cic moniker remains the
      same) which contains io_cq.  The following changes are made together.
      
      * cfq_ttime and cfq_io_cq now live in cfq-iosched.c.
      
      * All related fields, functions and constants are renamed accordingly.
      
      * ioc->ioc_data is now "struct io_cq *" instead of "void *" and
        renamed to icq_hint.
      
      This prepares for io_context API cleanup.  Documentation is currently
      sparse.  It will be added later.
      
      Changes in this patch are mechanical and don't cause functional
      change.
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      c5869807
    • T
      block: remove elevator_queue->ops · 22f746e2
      Tejun Heo 提交于
      elevator_queue->ops points to the same ops struct ->elevator_type.ops
      is pointing to.  The only effect of caching it in elevator_queue is
      shorter notation - it doesn't save any indirect derefence.
      
      Relocate elevator_type->list which used only during module init/exit
      to the end of the structure, rename elevator_queue->elevator_type to
      ->type, and replace elevator_queue->ops with elevator_queue->type.ops.
      
      This doesn't introduce any functional difference.
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      22f746e2
    • T
      block: reorder elevator switch sequence · f8fc877d
      Tejun Heo 提交于
      Elevator switch sequence first attached the new elevator, then tried
      registering it (sysfs) and if that failed attached back the old
      elevator.  However, sysfs registration doesn't require the elevator to
      be attached, so there is no reason to do the "detach, attach new,
      register, maybe re-attach old" sequence.  It can just do "register,
      detach, attach".
      
      * elevator_init_queue() is updated to set ->elevator_data directly and
        return 0 / -errno.  This allows elevator_exit() on an unattached
        elevator.
      
      * __elv_unregister_queue() which was necessary to unregister
        unattached q is removed in favor of __elv_register_queue() which can
        register unattached q.
      
      * elevator_attach() becomes a single assignment and obscures more then
        it helps.  Dropped.
      
      This will help cleaning up io_context handling across elevator switch.
      
      This patch doesn't introduce visible behavior change.
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      f8fc877d
    • T
      block, cfq: replace current_io_context() with create_io_context() · f2dbd76a
      Tejun Heo 提交于
      When called under queue_lock, current_io_context() triggers lockdep
      warning if it hits allocation path.  This is because io_context
      installation is protected by task_lock which is not IRQ safe, so it
      triggers irq-unsafe-lock -> irq -> irq-safe-lock -> irq-unsafe-lock
      deadlock warning.
      
      Given the restriction, accessor + creator rolled into one doesn't work
      too well.  Drop current_io_context() and let the users access
      task->io_context directly inside queue_lock combined with explicit
      creation using create_io_context().
      
      Future ioc updates will further consolidate ioc access and the create
      interface will be unexported.
      
      While at it, relocate ioc internal interface declarations in blk.h and
      add section comments before and after.
      
      This patch does not introduce functional change.
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      f2dbd76a
    • T
      block, cfq: kill cic->key · 1238033c
      Tejun Heo 提交于
      Now that lazy paths are removed, cfqd_dead_key() is meaningless and
      cic->q can be used whereever cic->key is used.  Kill cic->key.
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      1238033c
    • T
      block, cfq: kill ioc_gone · b50b636b
      Tejun Heo 提交于
      Now that cic's are immediately unlinked under both locks, there's no
      need to count and drain cic's before module unload.  RCU callback
      completion is waited with rcu_barrier().
      
      While at it, remove residual RCU operations on cic_list.
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      b50b636b
    • T
      block, cfq: remove delayed unlink · b9a19208
      Tejun Heo 提交于
      Now that all cic's are immediately unlinked from both ioc and queue,
      lazy dropping from lookup path and trimming on elevator unregister are
      unnecessary.  Kill them and remove now unused elevator_ops->trim().
      
      This also leaves call_for_each_cic() without any user.  Removed.
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      b9a19208
    • T
      block, cfq: unlink cfq_io_context's immediately · b2efa052
      Tejun Heo 提交于
      cic is association between io_context and request_queue.  A cic is
      linked from both ioc and q and should be destroyed when either one
      goes away.  As ioc and q both have their own locks, locking becomes a
      bit complex - both orders work for removal from one but not from the
      other.
      
      Currently, cfq tries to circumvent this locking order issue with RCU.
      ioc->lock nests inside queue_lock but the radix tree and cic's are
      also protected by RCU allowing either side to walk their lists without
      grabbing lock.
      
      This rather unconventional use of RCU quickly devolves into extremely
      fragile convolution.  e.g. The following is from cfqd going away too
      soon after ioc and q exits raced.
      
       general protection fault: 0000 [#1] PREEMPT SMP
       CPU 2
       Modules linked in:
       [   88.503444]
       Pid: 599, comm: hexdump Not tainted 3.1.0-rc10-work+ #158 Bochs Bochs
       RIP: 0010:[<ffffffff81397628>]  [<ffffffff81397628>] cfq_exit_single_io_context+0x58/0xf0
       ...
       Call Trace:
        [<ffffffff81395a4a>] call_for_each_cic+0x5a/0x90
        [<ffffffff81395ab5>] cfq_exit_io_context+0x15/0x20
        [<ffffffff81389130>] exit_io_context+0x100/0x140
        [<ffffffff81098a29>] do_exit+0x579/0x850
        [<ffffffff81098d5b>] do_group_exit+0x5b/0xd0
        [<ffffffff81098de7>] sys_exit_group+0x17/0x20
        [<ffffffff81b02f2b>] system_call_fastpath+0x16/0x1b
      
      The only real hot path here is cic lookup during request
      initialization and avoiding extra locking requires very confined use
      of RCU.  This patch makes cic removal from both ioc and request_queue
      perform double-locking and unlink immediately.
      
      * From q side, the change is almost trivial as ioc->lock nests inside
        queue_lock.  It just needs to grab each ioc->lock as it walks
        cic_list and unlink it.
      
      * From ioc side, it's a bit more difficult because of inversed lock
        order.  ioc needs its lock to walk its cic_list but can't grab the
        matching queue_lock and needs to perform unlock-relock dancing.
      
        Unlinking is now wholly done from put_io_context() and fast path is
        optimized by using the queue_lock the caller already holds, which is
        by far the most common case.  If the ioc accessed multiple devices,
        it tries with trylock.  In unlikely cases of fast path failure, it
        falls back to full double-locking dance from workqueue.
      
      Double-locking isn't the prettiest thing in the world but it's *far*
      simpler and more understandable than RCU trick without adding any
      meaningful overhead.
      
      This still leaves a lot of now unnecessary RCU logics.  Future patches
      will trim them.
      
      -v2: Vivek pointed out that cic->q was being dereferenced after
           cic->release() was called.  Updated to use local variable @this_q
           instead.
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Cc: Vivek Goyal <vgoyal@redhat.com>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      b2efa052
    • T
      block, cfq: fix cic lookup locking · f1a4f4d3
      Tejun Heo 提交于
      * cfq_cic_lookup() may be called without queue_lock and multiple tasks
        can execute it simultaneously for the same shared ioc.  Nothing
        prevents them racing each other and trying to drop the same dead cic
        entry multiple times.
      
      * smp_wmb() in cfq_exit_cic() doesn't really do anything and nothing
        prevents cfq_cic_lookup() seeing stale cic->key.  This usually
        doesn't blow up because by the time cic is exited, all requests have
        been drained and new requests are terminated before going through
        elevator.  However, it can still be triggered by plug merge path
        which doesn't grab queue_lock and thus can't check DEAD state
        reliably.
      
      This patch updates lookup locking such that,
      
      * Lookup is always performed under queue_lock.  This doesn't add any
        more locking.  The only issue is cfq_allow_merge() which can be
        called from plug merge path without holding any lock.  For now, this
        is worked around by using cic of the request to merge into, which is
        guaranteed to have the same ioc.  For longer term, I think it would
        be best to separate out plug merge method from regular one.
      
      * Spurious ioc->lock locking around cic lookup hint assignment
        dropped.
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      f1a4f4d3
    • T
      block, cfq: fix race condition in cic creation path and tighten locking · 216284c3
      Tejun Heo 提交于
      cfq_get_io_context() would fail if multiple tasks race to insert cic's
      for the same association.  This patch restructures
      cfq_get_io_context() such that slow path insertion race is handled
      properly.
      
      Note that the restructuring also makes cfq_get_io_context() called
      under queue_lock and performs both ioc and cfqd insertions while
      holding both ioc and queue locks.  This is part of on-going locking
      tightening and will be used to simplify synchronization rules.
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      216284c3
    • T
      block, cfq: move ioc ioprio/cgroup changed handling to cic · dc86900e
      Tejun Heo 提交于
      ioprio/cgroup change was handled by marking the changed state in ioc
      and, on the following access to the ioc, performing RCU-protected
      iteration through all cic's grabbing the matching queue_lock.
      
      This patch moves the changed state to each cic.  When ioprio or cgroup
      changes, the respective bit is set on all cic's of the ioc and when
      each of those cic (not ioc) is accessed, change is applied for that
      specific ioc-queue pair.
      
      This also fixes the following two race conditions between setting and
      clearing of changed states.
      
      * Missing barrier between assign/load of ioprio and ioprio_changed
        allowed applying old ioprio.
      
      * Change requests could happen between application of change and
        clearing of changed variables.
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      dc86900e
    • T
      block, cfq: misc updates to cfq_io_context · 283287a5
      Tejun Heo 提交于
      Make the following changes to prepare for ioc/cic management cleanup.
      
      * Add cic->q so that ioc can determine the associated queue without
        querying cfq.  This will eventually replace ->key.
      
      * Factor out cfq_release_cic() from cic_free_func().  This function
        assumes that the caller handled locking.
      
      * Rename __cfq_exit_single_io_context() to cfq_exit_cic() and make it
        take only @cic.
      
      * Restructure cfq_cic_link() for future updates.
      
      This patch doesn't introduce any functional changes.
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      283287a5
    • T
      block: misc updates to blk_get_queue() · 09ac46c4
      Tejun Heo 提交于
      * blk_get_queue() is peculiar in that it returns 0 on success and 1 on
        failure instead of 0 / -errno or boolean.  Update it such that it
        returns %true on success and %false on failure.
      
      * Make sure the caller checks for the return value.
      
      * Separate out __blk_get_queue() which doesn't check whether @q is
        dead and put it in blk.h.  This will be used later.
      
      This patch doesn't introduce any functional changes.
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      09ac46c4
    • T
      block: make ioc get/put interface more conventional and fix race on alloction · 6e736be7
      Tejun Heo 提交于
      Ignoring copy_io() during fork, io_context can be allocated from two
      places - current_io_context() and set_task_ioprio().  The former is
      always called from local task while the latter can be called from
      different task.  The synchornization between them are peculiar and
      dubious.
      
      * current_io_context() doesn't grab task_lock() and assumes that if it
        saw %NULL ->io_context, it would stay that way until allocation and
        assignment is complete.  It has smp_wmb() between alloc/init and
        assignment.
      
      * set_task_ioprio() grabs task_lock() for assignment and does
        smp_read_barrier_depends() between "ioc = task->io_context" and "if
        (ioc)".  Unfortunately, this doesn't achieve anything - the latter
        is not a dependent load of the former.  ie, if ioc itself were being
        dereferenced "ioc->xxx", it would mean something (not sure what tho)
        but as the code currently stands, the dependent read barrier is
        noop.
      
      As only one of the the two test-assignment sequences is task_lock()
      protected, the task_lock() can't do much about race between the two.
      Nothing prevents current_io_context() and set_task_ioprio() allocating
      its own ioc for the same task and overwriting the other's.
      
      Also, set_task_ioprio() can race with exiting task and create a new
      ioc after exit_io_context() is finished.
      
      ioc get/put doesn't have any reason to be complex.  The only hot path
      is accessing the existing ioc of %current, which is simple to achieve
      given that ->io_context is never destroyed as long as the task is
      alive.  All other paths can happily go through task_lock() like all
      other task sub structures without impacting anything.
      
      This patch updates ioc get/put so that it becomes more conventional.
      
      * alloc_io_context() is replaced with get_task_io_context().  This is
        the only interface which can acquire access to ioc of another task.
        On return, the caller has an explicit reference to the object which
        should be put using put_io_context() afterwards.
      
      * The functionality of current_io_context() remains the same but when
        creating a new ioc, it shares the code path with
        get_task_io_context() and always goes through task_lock().
      
      * get_io_context() now means incrementing ref on an ioc which the
        caller already has access to (be that an explicit refcnt or implicit
        %current one).
      
      * PF_EXITING inhibits creation of new io_context and once
        exit_io_context() is finished, it's guaranteed that both ioc
        acquisition functions return %NULL.
      
      * All users are updated.  Most are trivial but
        smp_read_barrier_depends() removal from cfq_get_io_context() needs a
        bit of explanation.  I suppose the original intention was to ensure
        ioc->ioprio is visible when set_task_ioprio() allocates new
        io_context and installs it; however, this wouldn't have worked
        because set_task_ioprio() doesn't have wmb between init and install.
        There are other problems with this which will be fixed in another
        patch.
      
      * While at it, use NUMA_NO_NODE instead of -1 for wildcard node
        specification.
      
      -v2: Vivek spotted contamination from debug patch.  Removed.
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Cc: Vivek Goyal <vgoyal@redhat.com>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      6e736be7
    • T
      block: misc ioc cleanups · 42ec57a8
      Tejun Heo 提交于
      * int return from put_io_context() wasn't used by anybody.  Make it
        return void like other put functions and docbook-fy the function
        comment.
      
      * Reorder dummy declarations for !CONFIG_BLOCK case a bit.
      
      * Make alloc_ioc_context() use __GFP_ZERO allocation, take init out of
        if block and drop 0'ing.
      
      * Docbook-fy current_io_context() comment.
      
      This patch doesn't introduce any functional change.
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      42ec57a8
    • T
      block, cfq: move cfqd->cic_index to q->id · a73f730d
      Tejun Heo 提交于
      cfq allocates per-queue id using ida and uses it to index cic radix
      tree from io_context.  Move it to q->id and allocate on queue init and
      free on queue release.  This simplifies cfq a bit and will allow for
      further improvements of io context life-cycle management.
      
      This patch doesn't introduce any functional difference.
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      a73f730d
    • T
      block: add missing blk_queue_dead() checks · 8ba61435
      Tejun Heo 提交于
      blk_insert_cloned_request(), blk_execute_rq_nowait() and
      blk_flush_plug_list() either didn't check whether the queue was dead
      or did it without holding queue_lock.  Update them so that dead state
      is checked while holding queue_lock.
      
      AFAICS, this plugs all holes (requeue doesn't matter as the request is
      transitioning atomically from in_flight to queued).
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      8ba61435
    • T
      block: fix drain_all condition in blk_drain_queue() · 481a7d64
      Tejun Heo 提交于
      When trying to drain all requests, blk_drain_queue() checked only
      q->rq.count[]; however, this only tracks REQ_ALLOCED requests.  This
      patch updates blk_drain_queue() such that it looks at all the counters
      and queues so that request_queue is actually empty on completion.
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      481a7d64
    • T
      block: add blk_queue_dead() · 34f6055c
      Tejun Heo 提交于
      There are a number of QUEUE_FLAG_DEAD tests.  Add blk_queue_dead()
      macro and use it.
      
      This patch doesn't introduce any functional difference.
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      34f6055c
    • T
      block, sx8: kill blk_insert_request() · 1ba64ede
      Tejun Heo 提交于
      The only user left for blk_insert_request() is sx8 and it can be
      trivially switched to use blk_execute_rq_nowait() - special requests
      aren't included in io stat and sx8 doesn't use block layer tagging.
      Switch sx8 and kill blk_insert_requeset().
      
      This patch doesn't introduce any functional difference.
      
      Only compile tested.
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Acked-by: NJeff Garzik <jgarzik@pobox.com>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      1ba64ede
  2. 10 12月, 2011 9 次提交
  3. 09 12月, 2011 10 次提交
    • M
      sys_getppid: add missing rcu_dereference · 031af165
      Mandeep Singh Baines 提交于
      In order to safely dereference current->real_parent inside an
      rcu_read_lock, we need an rcu_dereference.
      Signed-off-by: NMandeep Singh Baines <msb@chromium.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Pavel Emelyanov <xemul@openvz.org>
      Cc: Oleg Nesterov <oleg@redhat.com>
      Cc: Kees Cook <keescook@chromium.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      031af165
    • A
      rapidio/tsi721: modify PCIe capability settings · 1cee22b7
      Alexandre Bounine 提交于
      Modify initialization of PCIe capability registers in Tsi721 mport driver:
       - change Completion Timeout value to avoid unexpected data transfer
         aborts during intensive traffic.
       - replace hardcoded offset of PCIe capability block by making it use the
         common function.
      
      This patch is applicable to kernel versions starting from 3.2-rc1.
      Signed-off-by: NAlexandre Bounine <alexandre.bounine@idt.com>
      Cc: Matt Porter <mporter@kernel.crashing.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      1cee22b7
    • A
      rapidio/tsi721: fix mailbox resource reporting · b439e66f
      Alexandre Bounine 提交于
      Bug fix for Tsi721 RapidIO mport driver: Tsi721 supports four RapidIO
      mailboxes (MBOX0 - MBOX3) as defined by RapidIO specification.  Mailbox
      resources has to be properly reported to allow use of all available
      mailboxes (initial version reports only MBOX0).
      
      This patch is applicable to kernel versions staring from 3.2-rc1.
      Signed-off-by: NAlexandre Bounine <alexandre.bounine@idt.com>
      Cc: Matt Porter <mporter@kernel.crashing.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      b439e66f
    • A
      rapidio/tsi721: switch to dma_zalloc_coherent · ceb96398
      Alexandre Bounine 提交于
      Replace the pair dma_alloc_coherent()+memset() with the new
      dma_zalloc_coherent() added by Andrew Morton for kernel version 3.2
      Signed-off-by: NAlexandre Bounine <alexandre.bounine@idt.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      ceb96398
    • M
      procfs: do not overflow get_{idle,iowait}_time for nohz · 2a95ea6c
      Michal Hocko 提交于
      Since commit a25cac51 ("proc: Consider NO_HZ when printing idle and
      iowait times") we are reporting idle/io_wait time also while a CPU is
      tickless.  We rely on get_{idle,iowait}_time functions to retrieve
      proper data.
      
      These functions, however, use usecs_to_cputime to translate micro
      seconds time to cputime64_t.  This is just an alias to usecs_to_jiffies
      which reduces the data type from u64 to unsigned int and also checks
      whether the given parameter overflows jiffies_to_usecs(MAX_JIFFY_OFFSET)
      and returns MAX_JIFFY_OFFSET in that case.
      
      When we overflow depends on CONFIG_HZ but especially for CONFIG_HZ_300
      it is quite low (1431649781) so we are getting MAX_JIFFY_OFFSET for
      >3000s! until we overflow unsigned int.  Just for reference
      CONFIG_HZ_100 has an overflow window around 20s, CONFIG_HZ_250 ~8s and
      CONFIG_HZ_1000 ~2s.
      
      This results in a bug when people saw [h]top going mad reporting 100%
      CPU usage even though there was basically no CPU load.  The reason was
      simply that /proc/stat stopped reporting idle/io_wait changes (and
      reported MAX_JIFFY_OFFSET) and so the only change happening was for user
      system time.
      
      Let's use nsecs_to_jiffies64 instead which doesn't reduce the precision
      to 32b type and it is much more appropriate for cumulative time values
      (unlike usecs_to_jiffies which intended for timeout calculations).
      Signed-off-by: NMichal Hocko <mhocko@suse.cz>
      Tested-by: NArtem S. Tashkinov <t.artem@mailcity.com>
      Cc: Dave Jones <davej@redhat.com>
      Cc: Arnd Bergmann <arnd@arndb.de>
      Cc: Alexey Dobriyan <adobriyan@gmail.com>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      2a95ea6c
    • M
      mm: vmalloc: check for page allocation failure before vmlist insertion · 1368edf0
      Mel Gorman 提交于
      Commit f5252e00 ("mm: avoid null pointer access in vm_struct via
      /proc/vmallocinfo") adds newly allocated vm_structs to the vmlist after
      it is fully initialised.  Unfortunately, it did not check that
      __vmalloc_area_node() successfully populated the area.  In the event of
      allocation failure, the vmalloc area is freed but the pointer to freed
      memory is inserted into the vmlist leading to a a crash later in
      get_vmalloc_info().
      
      This patch adds a check for ____vmalloc_area_node() failure within
      __vmalloc_node_range.  It does not use "goto fail" as in the previous
      error path as a warning was already displayed by __vmalloc_area_node()
      before it called vfree in its failure path.
      
      Credit goes to Luciano Chavez for doing all the real work of identifying
      exactly where the problem was.
      Signed-off-by: NMel Gorman <mgorman@suse.de>
      Reported-by: NLuciano Chavez <lnx1138@linux.vnet.ibm.com>
      Tested-by: NLuciano Chavez <lnx1138@linux.vnet.ibm.com>
      Reviewed-by: NRik van Riel <riel@redhat.com>
      Acked-by: NDavid Rientjes <rientjes@google.com>
      Cc: <stable@vger.kernel.org>		[3.1.x+]
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      1368edf0
    • M
      mm: Ensure that pfn_valid() is called once per pageblock when reserving pageblocks · d0215638
      Michal Hocko 提交于
      setup_zone_migrate_reserve() expects that zone->start_pfn starts at
      pageblock_nr_pages aligned pfn otherwise we could access beyond an
      existing memblock resulting in the following panic if
      CONFIG_HOLES_IN_ZONE is not configured and we do not check pfn_valid:
      
        IP: [<c02d331d>] setup_zone_migrate_reserve+0xcd/0x180
        *pdpt = 0000000000000000 *pde = f000ff53f000ff53
        Oops: 0000 [#1] SMP
        Pid: 1, comm: swapper Not tainted 3.0.7-0.7-pae #1 VMware, Inc. VMware Virtual Platform/440BX Desktop Reference Platform
        EIP: 0060:[<c02d331d>] EFLAGS: 00010006 CPU: 0
        EIP is at setup_zone_migrate_reserve+0xcd/0x180
        EAX: 000c0000 EBX: f5801fc0 ECX: 000c0000 EDX: 00000000
        ESI: 000c01fe EDI: 000c01fe EBP: 00140000 ESP: f2475f58
        DS: 007b ES: 007b FS: 00d8 GS: 0000 SS: 0068
        Process swapper (pid: 1, ti=f2474000 task=f2472cd0 task.ti=f2474000)
        Call Trace:
        [<c02d389c>] __setup_per_zone_wmarks+0xec/0x160
        [<c02d3a1f>] setup_per_zone_wmarks+0xf/0x20
        [<c08a771c>] init_per_zone_wmark_min+0x27/0x86
        [<c020111b>] do_one_initcall+0x2b/0x160
        [<c086639d>] kernel_init+0xbe/0x157
        [<c05cae26>] kernel_thread_helper+0x6/0xd
        Code: a5 39 f5 89 f7 0f 46 fd 39 cf 76 40 8b 03 f6 c4 08 74 32 eb 91 90 89 c8 c1 e8 0e 0f be 80 80 2f 86 c0 8b 14 85 60 2f 86 c0 89 c8 <2b> 82 b4 12 00 00 c1 e0 05 03 82 ac 12 00 00 8b 00 f6 c4 08 0f
        EIP: [<c02d331d>] setup_zone_migrate_reserve+0xcd/0x180 SS:ESP 0068:f2475f58
        CR2: 00000000000012b4
      
      We crashed in pageblock_is_reserved() when accessing pfn 0xc0000 because
      highstart_pfn = 0x36ffe.
      
      The issue was introduced in 3.0-rc1 by 6d3163ce ("mm: check if any page
      in a pageblock is reserved before marking it MIGRATE_RESERVE").
      
      Make sure that start_pfn is always aligned to pageblock_nr_pages to
      ensure that pfn_valid s always called at the start of each pageblock.
      Architectures with holes in pageblocks will be correctly handled by
      pfn_valid_within in pageblock_is_reserved.
      Signed-off-by: NMichal Hocko <mhocko@suse.cz>
      Signed-off-by: NMel Gorman <mgorman@suse.de>
      Tested-by: NDang Bo <bdang@vmware.com>
      Reviewed-by: NKAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Arve Hjnnevg <arve@android.com>
      Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
      Cc: John Stultz <john.stultz@linaro.org>
      Cc: Dave Hansen <dave@linux.vnet.ibm.com>
      Cc: <stable@vger.kernel.org>	[3.0+]
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      d0215638
    • H
      mm/migrate.c: pair unlock_page() and lock_page() when migrating huge pages · 09761333
      Hillf Danton 提交于
      Avoid unlocking and unlocked page if we failed to lock it.
      Signed-off-by: NHillf Danton <dhillf@gmail.com>
      Cc: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      09761333
    • Y
      thp: set compound tail page _count to zero · 58a84aa9
      Youquan Song 提交于
      Commit 70b50f94 ("mm: thp: tail page refcounting fix") keeps all
      page_tail->_count zero at all times.  But the current kernel does not
      set page_tail->_count to zero if a 1GB page is utilized.  So when an
      IOMMU 1GB page is used by KVM, it wil result in a kernel oops because a
      tail page's _count does not equal zero.
      
        kernel BUG at include/linux/mm.h:386!
        invalid opcode: 0000 [#1] SMP
        Call Trace:
          gup_pud_range+0xb8/0x19d
          get_user_pages_fast+0xcb/0x192
          ? trace_hardirqs_off+0xd/0xf
          hva_to_pfn+0x119/0x2f2
          gfn_to_pfn_memslot+0x2c/0x2e
          kvm_iommu_map_pages+0xfd/0x1c1
          kvm_iommu_map_memslots+0x7c/0xbd
          kvm_iommu_map_guest+0xaa/0xbf
          kvm_vm_ioctl_assigned_device+0x2ef/0xa47
          kvm_vm_ioctl+0x36c/0x3a2
          do_vfs_ioctl+0x49e/0x4e4
          sys_ioctl+0x5a/0x7c
          system_call_fastpath+0x16/0x1b
        RIP  gup_huge_pud+0xf2/0x159
      Signed-off-by: NYouquan Song <youquan.song@intel.com>
      Reviewed-by: NAndrea Arcangeli <aarcange@redhat.com>
      Cc: <stable@vger.kernel.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      58a84aa9
    • Y
      thp: add compound tail page _mapcount when mapped · b6999b19
      Youquan Song 提交于
      With the 3.2-rc kernel, IOMMU 2M pages in KVM works.  But when I tried
      to use IOMMU 1GB pages in KVM, I encountered an oops and the 1GB page
      failed to be used.
      
      The root cause is that 1GB page allocation calls gup_huge_pud() while 2M
      page calls gup_huge_pmd.  If compound pages are used and the page is a
      tail page, gup_huge_pmd() increases _mapcount to record tail page are
      mapped while gup_huge_pud does not do that.
      
      So when the mapped page is relesed, it will result in kernel oops
      because the page is not marked mapped.
      
      This patch add tail process for compound page in 1GB huge page which
      keeps the same process as 2M page.
      
      Reproduce like:
      1. Add grub boot option: hugepagesz=1G hugepages=8
      2. mount -t hugetlbfs -o pagesize=1G hugetlbfs /dev/hugepages
      3. qemu-kvm -m 2048 -hda os-kvm.img -cpu kvm64 -smp 4 -mem-path /dev/hugepages
      	-net none -device pci-assign,host=07:00.1
      
        kernel BUG at mm/swap.c:114!
        invalid opcode: 0000 [#1] SMP
        Call Trace:
          put_page+0x15/0x37
          kvm_release_pfn_clean+0x31/0x36
          kvm_iommu_put_pages+0x94/0xb1
          kvm_iommu_unmap_memslots+0x80/0xb6
          kvm_assign_device+0xba/0x117
          kvm_vm_ioctl_assigned_device+0x301/0xa47
          kvm_vm_ioctl+0x36c/0x3a2
          do_vfs_ioctl+0x49e/0x4e4
          sys_ioctl+0x5a/0x7c
          system_call_fastpath+0x16/0x1b
        RIP  put_compound_page+0xd4/0x168
      Signed-off-by: NYouquan Song <youquan.song@intel.com>
      Reviewed-by: NAndrea Arcangeli <aarcange@redhat.com>
      Cc: Andi Kleen <andi@firstfloor.org>
      Cc: <stable@vger.kernel.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      b6999b19