1. 14 12月, 2011 11 次提交
    • T
      block, cfq: kill ioc_gone · b50b636b
      Tejun Heo 提交于
      Now that cic's are immediately unlinked under both locks, there's no
      need to count and drain cic's before module unload.  RCU callback
      completion is waited with rcu_barrier().
      
      While at it, remove residual RCU operations on cic_list.
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      b50b636b
    • T
      block, cfq: remove delayed unlink · b9a19208
      Tejun Heo 提交于
      Now that all cic's are immediately unlinked from both ioc and queue,
      lazy dropping from lookup path and trimming on elevator unregister are
      unnecessary.  Kill them and remove now unused elevator_ops->trim().
      
      This also leaves call_for_each_cic() without any user.  Removed.
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      b9a19208
    • T
      block, cfq: unlink cfq_io_context's immediately · b2efa052
      Tejun Heo 提交于
      cic is association between io_context and request_queue.  A cic is
      linked from both ioc and q and should be destroyed when either one
      goes away.  As ioc and q both have their own locks, locking becomes a
      bit complex - both orders work for removal from one but not from the
      other.
      
      Currently, cfq tries to circumvent this locking order issue with RCU.
      ioc->lock nests inside queue_lock but the radix tree and cic's are
      also protected by RCU allowing either side to walk their lists without
      grabbing lock.
      
      This rather unconventional use of RCU quickly devolves into extremely
      fragile convolution.  e.g. The following is from cfqd going away too
      soon after ioc and q exits raced.
      
       general protection fault: 0000 [#1] PREEMPT SMP
       CPU 2
       Modules linked in:
       [   88.503444]
       Pid: 599, comm: hexdump Not tainted 3.1.0-rc10-work+ #158 Bochs Bochs
       RIP: 0010:[<ffffffff81397628>]  [<ffffffff81397628>] cfq_exit_single_io_context+0x58/0xf0
       ...
       Call Trace:
        [<ffffffff81395a4a>] call_for_each_cic+0x5a/0x90
        [<ffffffff81395ab5>] cfq_exit_io_context+0x15/0x20
        [<ffffffff81389130>] exit_io_context+0x100/0x140
        [<ffffffff81098a29>] do_exit+0x579/0x850
        [<ffffffff81098d5b>] do_group_exit+0x5b/0xd0
        [<ffffffff81098de7>] sys_exit_group+0x17/0x20
        [<ffffffff81b02f2b>] system_call_fastpath+0x16/0x1b
      
      The only real hot path here is cic lookup during request
      initialization and avoiding extra locking requires very confined use
      of RCU.  This patch makes cic removal from both ioc and request_queue
      perform double-locking and unlink immediately.
      
      * From q side, the change is almost trivial as ioc->lock nests inside
        queue_lock.  It just needs to grab each ioc->lock as it walks
        cic_list and unlink it.
      
      * From ioc side, it's a bit more difficult because of inversed lock
        order.  ioc needs its lock to walk its cic_list but can't grab the
        matching queue_lock and needs to perform unlock-relock dancing.
      
        Unlinking is now wholly done from put_io_context() and fast path is
        optimized by using the queue_lock the caller already holds, which is
        by far the most common case.  If the ioc accessed multiple devices,
        it tries with trylock.  In unlikely cases of fast path failure, it
        falls back to full double-locking dance from workqueue.
      
      Double-locking isn't the prettiest thing in the world but it's *far*
      simpler and more understandable than RCU trick without adding any
      meaningful overhead.
      
      This still leaves a lot of now unnecessary RCU logics.  Future patches
      will trim them.
      
      -v2: Vivek pointed out that cic->q was being dereferenced after
           cic->release() was called.  Updated to use local variable @this_q
           instead.
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Cc: Vivek Goyal <vgoyal@redhat.com>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      b2efa052
    • T
      block, cfq: move ioc ioprio/cgroup changed handling to cic · dc86900e
      Tejun Heo 提交于
      ioprio/cgroup change was handled by marking the changed state in ioc
      and, on the following access to the ioc, performing RCU-protected
      iteration through all cic's grabbing the matching queue_lock.
      
      This patch moves the changed state to each cic.  When ioprio or cgroup
      changes, the respective bit is set on all cic's of the ioc and when
      each of those cic (not ioc) is accessed, change is applied for that
      specific ioc-queue pair.
      
      This also fixes the following two race conditions between setting and
      clearing of changed states.
      
      * Missing barrier between assign/load of ioprio and ioprio_changed
        allowed applying old ioprio.
      
      * Change requests could happen between application of change and
        clearing of changed variables.
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      dc86900e
    • T
      block, cfq: misc updates to cfq_io_context · 283287a5
      Tejun Heo 提交于
      Make the following changes to prepare for ioc/cic management cleanup.
      
      * Add cic->q so that ioc can determine the associated queue without
        querying cfq.  This will eventually replace ->key.
      
      * Factor out cfq_release_cic() from cic_free_func().  This function
        assumes that the caller handled locking.
      
      * Rename __cfq_exit_single_io_context() to cfq_exit_cic() and make it
        take only @cic.
      
      * Restructure cfq_cic_link() for future updates.
      
      This patch doesn't introduce any functional changes.
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      283287a5
    • T
      block: misc updates to blk_get_queue() · 09ac46c4
      Tejun Heo 提交于
      * blk_get_queue() is peculiar in that it returns 0 on success and 1 on
        failure instead of 0 / -errno or boolean.  Update it such that it
        returns %true on success and %false on failure.
      
      * Make sure the caller checks for the return value.
      
      * Separate out __blk_get_queue() which doesn't check whether @q is
        dead and put it in blk.h.  This will be used later.
      
      This patch doesn't introduce any functional changes.
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      09ac46c4
    • T
      block: make ioc get/put interface more conventional and fix race on alloction · 6e736be7
      Tejun Heo 提交于
      Ignoring copy_io() during fork, io_context can be allocated from two
      places - current_io_context() and set_task_ioprio().  The former is
      always called from local task while the latter can be called from
      different task.  The synchornization between them are peculiar and
      dubious.
      
      * current_io_context() doesn't grab task_lock() and assumes that if it
        saw %NULL ->io_context, it would stay that way until allocation and
        assignment is complete.  It has smp_wmb() between alloc/init and
        assignment.
      
      * set_task_ioprio() grabs task_lock() for assignment and does
        smp_read_barrier_depends() between "ioc = task->io_context" and "if
        (ioc)".  Unfortunately, this doesn't achieve anything - the latter
        is not a dependent load of the former.  ie, if ioc itself were being
        dereferenced "ioc->xxx", it would mean something (not sure what tho)
        but as the code currently stands, the dependent read barrier is
        noop.
      
      As only one of the the two test-assignment sequences is task_lock()
      protected, the task_lock() can't do much about race between the two.
      Nothing prevents current_io_context() and set_task_ioprio() allocating
      its own ioc for the same task and overwriting the other's.
      
      Also, set_task_ioprio() can race with exiting task and create a new
      ioc after exit_io_context() is finished.
      
      ioc get/put doesn't have any reason to be complex.  The only hot path
      is accessing the existing ioc of %current, which is simple to achieve
      given that ->io_context is never destroyed as long as the task is
      alive.  All other paths can happily go through task_lock() like all
      other task sub structures without impacting anything.
      
      This patch updates ioc get/put so that it becomes more conventional.
      
      * alloc_io_context() is replaced with get_task_io_context().  This is
        the only interface which can acquire access to ioc of another task.
        On return, the caller has an explicit reference to the object which
        should be put using put_io_context() afterwards.
      
      * The functionality of current_io_context() remains the same but when
        creating a new ioc, it shares the code path with
        get_task_io_context() and always goes through task_lock().
      
      * get_io_context() now means incrementing ref on an ioc which the
        caller already has access to (be that an explicit refcnt or implicit
        %current one).
      
      * PF_EXITING inhibits creation of new io_context and once
        exit_io_context() is finished, it's guaranteed that both ioc
        acquisition functions return %NULL.
      
      * All users are updated.  Most are trivial but
        smp_read_barrier_depends() removal from cfq_get_io_context() needs a
        bit of explanation.  I suppose the original intention was to ensure
        ioc->ioprio is visible when set_task_ioprio() allocates new
        io_context and installs it; however, this wouldn't have worked
        because set_task_ioprio() doesn't have wmb between init and install.
        There are other problems with this which will be fixed in another
        patch.
      
      * While at it, use NUMA_NO_NODE instead of -1 for wildcard node
        specification.
      
      -v2: Vivek spotted contamination from debug patch.  Removed.
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Cc: Vivek Goyal <vgoyal@redhat.com>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      6e736be7
    • T
      block: misc ioc cleanups · 42ec57a8
      Tejun Heo 提交于
      * int return from put_io_context() wasn't used by anybody.  Make it
        return void like other put functions and docbook-fy the function
        comment.
      
      * Reorder dummy declarations for !CONFIG_BLOCK case a bit.
      
      * Make alloc_ioc_context() use __GFP_ZERO allocation, take init out of
        if block and drop 0'ing.
      
      * Docbook-fy current_io_context() comment.
      
      This patch doesn't introduce any functional change.
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      42ec57a8
    • T
      block, cfq: move cfqd->cic_index to q->id · a73f730d
      Tejun Heo 提交于
      cfq allocates per-queue id using ida and uses it to index cic radix
      tree from io_context.  Move it to q->id and allocate on queue init and
      free on queue release.  This simplifies cfq a bit and will allow for
      further improvements of io context life-cycle management.
      
      This patch doesn't introduce any functional difference.
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      a73f730d
    • T
      block: add blk_queue_dead() · 34f6055c
      Tejun Heo 提交于
      There are a number of QUEUE_FLAG_DEAD tests.  Add blk_queue_dead()
      macro and use it.
      
      This patch doesn't introduce any functional difference.
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      34f6055c
    • T
      block, sx8: kill blk_insert_request() · 1ba64ede
      Tejun Heo 提交于
      The only user left for blk_insert_request() is sx8 and it can be
      trivially switched to use blk_execute_rq_nowait() - special requests
      aren't included in io stat and sx8 doesn't use block layer tagging.
      Switch sx8 and kill blk_insert_requeset().
      
      This patch doesn't introduce any functional difference.
      
      Only compile tested.
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Acked-by: NJeff Garzik <jgarzik@pobox.com>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      1ba64ede
  2. 09 12月, 2011 1 次提交
  3. 07 12月, 2011 1 次提交
    • A
      fix apparmor dereferencing potentially freed dentry, sanitize __d_path() API · 02125a82
      Al Viro 提交于
      __d_path() API is asking for trouble and in case of apparmor d_namespace_path()
      getting just that.  The root cause is that when __d_path() misses the root
      it had been told to look for, it stores the location of the most remote ancestor
      in *root.  Without grabbing references.  Sure, at the moment of call it had
      been pinned down by what we have in *path.  And if we raced with umount -l, we
      could have very well stopped at vfsmount/dentry that got freed as soon as
      prepend_path() dropped vfsmount_lock.
      
      It is safe to compare these pointers with pre-existing (and known to be still
      alive) vfsmount and dentry, as long as all we are asking is "is it the same
      address?".  Dereferencing is not safe and apparmor ended up stepping into
      that.  d_namespace_path() really wants to examine the place where we stopped,
      even if it's not connected to our namespace.  As the result, it looked
      at ->d_sb->s_magic of a dentry that might've been already freed by that point.
      All other callers had been careful enough to avoid that, but it's really
      a bad interface - it invites that kind of trouble.
      
      The fix is fairly straightforward, even though it's bigger than I'd like:
      	* prepend_path() root argument becomes const.
      	* __d_path() is never called with NULL/NULL root.  It was a kludge
      to start with.  Instead, we have an explicit function - d_absolute_root().
      Same as __d_path(), except that it doesn't get root passed and stops where
      it stops.  apparmor and tomoyo are using it.
      	* __d_path() returns NULL on path outside of root.  The main
      caller is show_mountinfo() and that's precisely what we pass root for - to
      skip those outside chroot jail.  Those who don't want that can (and do)
      use d_path().
      	* __d_path() root argument becomes const.  Everyone agrees, I hope.
      	* apparmor does *NOT* try to use __d_path() or any of its variants
      when it sees that path->mnt is an internal vfsmount.  In that case it's
      definitely not mounted anywhere and dentry_path() is exactly what we want
      there.  Handling of sysctl()-triggered weirdness is moved to that place.
      	* if apparmor is asked to do pathname relative to chroot jail
      and __d_path() tells it we it's not in that jail, the sucker just calls
      d_absolute_path() instead.  That's the other remaining caller of __d_path(),
      BTW.
              * seq_path_root() does _NOT_ return -ENAMETOOLONG (it's stupid anyway -
      the normal seq_file logics will take care of growing the buffer and redoing
      the call of ->show() just fine).  However, if it gets path not reachable
      from root, it returns SEQ_SKIP.  The only caller adjusted (i.e. stopped
      ignoring the return value as it used to do).
      Reviewed-by: NJohn Johansen <john.johansen@canonical.com>
      ACKed-by: NJohn Johansen <john.johansen@canonical.com>
      Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>
      Cc: stable@vger.kernel.org
      02125a82
  4. 06 12月, 2011 11 次提交
  5. 05 12月, 2011 1 次提交
    • P
      perf: Fix loss of notification with multi-event · 10c6db11
      Peter Zijlstra 提交于
      When you do:
              $ perf record -e cycles,cycles,cycles noploop 10
      
      You expect about 10,000 samples for each event, i.e., 10s at
      1000samples/sec. However, this is not what's happening. You
      get much fewer samples, maybe 3700 samples/event:
      
      $ perf report -D | tail -15
      Aggregated stats:
                 TOTAL events:      10998
                  MMAP events:         66
                  COMM events:          2
                SAMPLE events:      10930
      cycles stats:
                 TOTAL events:       3644
                SAMPLE events:       3644
      cycles stats:
                 TOTAL events:       3642
                SAMPLE events:       3642
      cycles stats:
                 TOTAL events:       3644
                SAMPLE events:       3644
      
      On a Intel Nehalem or even AMD64, there are 4 counters capable
      of measuring cycles, so there is plenty of space to measure those
      events without multiplexing (even with the NMI watchdog active).
      And even with multiplexing, we'd expect roughly the same number
      of samples per event.
      
      The root of the problem was that when the event that caused the buffer
      to become full was not the first event passed on the cmdline, the user
      notification would get lost. The notification was sent to the file
      descriptor of the overflowed event but the perf tool was not polling
      on it.  The perf tool aggregates all samples into a single buffer,
      i.e., the buffer of the first event. Consequently, it assumes
      notifications for any event will come via that descriptor.
      
      The seemingly straight forward solution of moving the waitq into the
      ringbuffer object doesn't work because of life-time issues. One could
      perf_event_set_output() on a fd that you're also blocking on and cause
      the old rb object to be freed while its waitq would still be
      referenced by the blocked thread -> FAIL.
      
      Therefore link all events to the ringbuffer and broadcast the wakeup
      from the ringbuffer object to all possible events that could be waited
      upon. This is rather ugly, and we're open to better solutions but it
      works for now.
      Reported-by: NStephane Eranian <eranian@google.com>
      Finished-by: NStephane Eranian <eranian@google.com>
      Reviewed-by: NStephane Eranian <eranian@google.com>
      Signed-off-by: NPeter Zijlstra <a.p.zijlstra@chello.nl>
      Link: http://lkml.kernel.org/r/20111126014731.GA7030@quadSigned-off-by: NIngo Molnar <mingo@elte.hu>
      10c6db11
  6. 04 12月, 2011 1 次提交
  7. 02 12月, 2011 1 次提交
  8. 01 12月, 2011 1 次提交
  9. 29 11月, 2011 4 次提交
  10. 27 11月, 2011 4 次提交
  11. 24 11月, 2011 3 次提交
  12. 23 11月, 2011 1 次提交
    • A
      regulator: TPS65910: Fix VDD1/2 voltage selector count · 780dc9ba
      Afzal Mohammed 提交于
      Count of selector voltage is required for regulator_set_voltage
      to work via set_voltage_sel. VDD1/2 currently have it as zero,
      so regulator_set_voltage won't work for VDD1/2.
      Update count (n_voltages) for VDD1/2.
      
      Output Voltage = (step value * 12.5 mV + 562.5 mV) * gain
      
      With above expr, number of voltages that can be selected is
      step value count * gain count
      
      constant for gain count will be called VDD1_2_NUM_VOLT_COARSE
      
      existing constant for step value count is VDD1_2_NUM_VOLTS,
      use VDD1_2_NUM_VOLT_FINE instead to make clear that step value
      is not the only component in deciding selectable voltage count
      Signed-off-by: NAfzal Mohammed <afzal@ti.com>
      Signed-off-by: NMark Brown <broonie@opensource.wolfsonmicro.com>
      780dc9ba