1. 22 10月, 2010 9 次提交
    • A
      include/linux/libata.h: fix typo · 89692c03
      Andrea Gelmini 提交于
      Signed-off-by: NAndrea Gelmini <andrea.gelmini@gelma.net>
      Signed-off-by: NJeff Garzik <jgarzik@redhat.com>
      89692c03
    • R
      libata: reorder ata_queued_cmd to remove alignment padding on 64 bit builds · b34e9042
      Richard Kennedy 提交于
      Reorder structure ata_queued_cmd to remove 8 bytes of alignment padding
      on 64 bit builds & therefore reduce the size of structure ata_port by
      256 bytes.
      
      Overall this will have little impact, other than reducing the amount of
      memory that is cleared when allocating ata_ports.
      Signed-off-by: NRichard Kennedy <richard@rsk.demon.co.uk>
      Signed-off-by: NJeff Garzik <jgarzik@redhat.com>
      b34e9042
    • T
      libata: implement cross-port EH exclusion · c0c362b6
      Tejun Heo 提交于
      In libata, the non-EH code paths should always take and release
      ap->lock explicitly when accessing hardware or shared data structures.
      However, once EH is active, it's assumed that the port is owned by EH
      and EH methods don't explicitly take ap->lock unless race from irq
      handler or other code paths are expected.  However, libata EH didn't
      guarantee exclusion among EHs for ports of the same host.  IOW,
      multiple EHs may execute in parallel on multiple ports of the same
      controller.
      
      In many cases, especially in SATA, the ports are completely
      independent of each other and this doesn't cause problems; however,
      there are cases where different ports share the same resource, which
      lead to obscure timing related bugs such as the one fixed by commit
      213373cf (ata_piix: fix locking around SIDPR access).
      
      This patch implements exclusion among EHs of the same host.  When EH
      begins, it acquires per-host EH ownership by calling ata_eh_acquire().
      When EH finishes, the ownership is released by calling
      ata_eh_release().  EH ownership is also released whenever the EH
      thread goes to sleep from ata_msleep() or explicitly and reacquired
      after waking up.
      
      This ensures that while EH is actively accessing the hardware, it has
      exclusive access to it while allowing EHs to interleave and progress
      in parallel as they hit waiting stages, which dominate the time spent
      in EH.  This achieves cross-port EH exclusion without pervasive and
      fragile changes while still allowing parallel EH for the most part.
      
      This was first reported by yuanding02@gmail.com more than three years
      ago in the following bugzilla.  :-)
      
        https://bugzilla.kernel.org/show_bug.cgi?id=8223Signed-off-by: NTejun Heo <tj@kernel.org>
      Cc: Alan Cox <alan@lxorguk.ukuu.org.uk>
      Reported-by: yuanding02@gmail.com
      Signed-off-by: NJeff Garzik <jgarzik@redhat.com>
      c0c362b6
    • T
      libata: add @ap to ata_wait_register() and introduce ata_msleep() · 97750ceb
      Tejun Heo 提交于
      Add optional @ap argument to ata_wait_register() and replace msleep()
      calls with ata_msleep() which take optional @ap in addition to the
      duration.  These will be used to implement EH exclusion.
      
      This patch doesn't cause any behavior difference.
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Signed-off-by: NJeff Garzik <jgarzik@redhat.com>
      97750ceb
    • T
      libata: reimplement link power management · 6b7ae954
      Tejun Heo 提交于
      The current LPM implementation has the following issues.
      
      * Operation order isn't well thought-out.  e.g. HIPM should be
        configured after IPM in SControl is properly configured.  Not the
        other way around.
      
      * Suspend/resume paths call ata_lpm_enable/disable() which must only
        be called from EH context directly.  Also, ata_lpm_enable/disable()
        were called whether LPM was in use or not.
      
      * Implementation is per-port when it should be per-link.  As a result,
        it can't be used for controllers with slave links or PMP.
      
      * LPM state isn't managed consistently.  After a link reset for
        whatever reason including suspend/resume the actual LPM state would
        be reset leaving ap->lpm_policy inconsistent.
      
      * Generic/driver-specific logic boundary isn't clear.  Currently,
        libahci has to mangle stuff which libata EH proper should be
        handling.  This makes the implementation unnecessarily complex and
        fragile.
      
      * Tied to ALPM.  Doesn't consider DIPM only cases and doesn't check
        whether the device allows HIPM.
      
      * Error handling isn't implemented.
      
      Given the extent of mismatch with the rest of libata, I don't think
      trying to fix it piecewise makes much sense.  This patch reimplements
      LPM support.
      
      * The new implementation is per-link.  The target policy is still
        port-wide (ap->target_lpm_policy) but all the mechanisms and states
        are per-link and integrate well with the rest of link abstraction
        and can work with slave and PMP links.
      
      * Core EH has proper control of LPM state.  LPM state is reconfigured
        when and only when reconfiguration is necessary.  It makes sure that
        LPM state is reset when probing for new device on the link.
        Controller agnostic logic is now implemented in libata EH proper and
        driver implementation only has to deal with controller specifics.
      
      * Proper error handling.  LPM config failure is attributed to the
        device on the link and LPM is disabled for the link if it fails
        repeatedly.
      
      * ops->enable/disable_pm() are replaced with single ops->set_lpm()
        which takes @policy and @hints.  This simplifies driver specific
        implementation.
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Signed-off-by: NJeff Garzik <jgarzik@redhat.com>
      6b7ae954
    • T
      libata: implement sata_link_scr_lpm() and make ata_dev_set_feature() global · 1152b261
      Tejun Heo 提交于
      Link power management is about to be reimplemented.  Prepare for it.
      
      * Implement sata_link_scr_lpm().
      
      * Drop static from ata_dev_set_feature() and make it available to
        other libata files.
      
      * Trivial whitespace adjustments.
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Signed-off-by: NJeff Garzik <jgarzik@redhat.com>
      1152b261
    • T
      libata: clean up lpm related symbols and sysfs show/store functions · c93b263e
      Tejun Heo 提交于
      Link power management related symbols are in confusing state w/ mixed
      usages of lpm, ipm and pm.  This patch cleans up lpm related symbols
      and sysfs show/store functions as follows.
      
      * lpm states - NOT_AVAILABLE, MIN_POWER, MAX_PERFORMANCE and
        MEDIUM_POWER are renamed to ATA_LPM_UNKNOWN and
        ATA_LPM_{MIN|MAX|MED}_POWER.
      
      * Pre/postfixes are unified to lpm.
      
      * sysfs show/store functions for link_power_management_policy were
        curiously named get/put and unnecessarily complex.  Renamed to
        show/store and simplified.
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Signed-off-by: NJeff Garzik <jgarzik@redhat.com>
      c93b263e
    • G
      [libata] support for > 512 byte sectors (e.g. 4K Native) · 295124dc
      Grant Grundler 提交于
      This change enables my x86 machine to recognize and talk to a
      "Native 4K" SATA device.
      
      When I started working on this, I didn't know Matthew Wilcox had
      posted a similar patch 2 years ago:
        http://git.kernel.org/?p=linux/kernel/git/willy/ata.git;a=shortlog;h=refs/heads/ata-large-sectors
      
      Gwendal Grignou pointed me at the the above code and small portions of
      this patch include Matthew's work. That's why Mathew is first on the
      "Signed-off-by:". I've NOT included his use of a bitmap to determine
      512 vs Native for ATA command block size - just used a simple table.
      And bugs are almost certainly mine.
      
      Lastly, the patch has been tested with a native 4K 'Engineering
      Sample' drive provided by Hitachi GST.
      Signed-off-by: NMatthew Wilcox <matthew.r.wilcox@intel.com>
      Signed-off-by: NGrant Grundler <grundler@google.com>
      Reviewed-by: NGwendal Grignou <gwendal@google.com>
      Signed-off-by: NJeff Garzik <jgarzik@redhat.com>
      295124dc
    • G
      [libata] Add ATA transport class · d9027470
      Gwendal Grignou 提交于
      This is a scheleton for libata transport class.
      All information is read only, exporting information from libata:
      - ata_port class: one per ATA port
      - ata_link class: one per ATA port or 15 for SATA Port Multiplier
      - ata_device class: up to 2 for PATA link, usually one for SATA.
      Signed-off-by: NGwendal Grignou <gwendal@google.com>
      Reviewed-by: NGrant Grundler <grundler@google.com>
      Signed-off-by: NJeff Garzik <jgarzik@redhat.com>
      d9027470
  2. 21 10月, 2010 13 次提交
  3. 19 10月, 2010 13 次提交
    • Y
      block: fix accounting bug on cross partition merges · 7681bfee
      Yasuaki Ishimatsu 提交于
      /proc/diskstats would display a strange output as follows.
      
      $ cat /proc/diskstats |grep sda
         8       0 sda 90524 7579 102154 20464 0 0 0 0 0 14096 20089
         8       1 sda1 19085 1352 21841 4209 0 0 0 0 4294967064 15689 4293424691
                                                      ~~~~~~~~~~
         8       2 sda2 71252 3624 74891 15950 0 0 0 0 232 23995 1562390
         8       3 sda3 54 487 2188 92 0 0 0 0 0 88 92
         8       4 sda4 4 0 8 0 0 0 0 0 0 0 0
         8       5 sda5 81 2027 2130 138 0 0 0 0 0 87 137
      
      Its reason is the wrong way of accounting hd_struct->in_flight. When a bio is
      merged into a request belongs to different partition by ELEVATOR_FRONT_MERGE.
      
      The detailed root cause is as follows.
      
      Assuming that there are two partition, sda1 and sda2.
      
      1. A request for sda2 is in request_queue. Hence sda1's hd_struct->in_flight
         is 0 and sda2's one is 1.
      
              | hd_struct->in_flight
         ---------------------------
         sda1 |          0
         sda2 |          1
         ---------------------------
      
      2. A bio belongs to sda1 is issued and is merged into the request mentioned on
         step1 by ELEVATOR_BACK_MERGE. The first sector of the request is changed
         from sda2 region to sda1 region. However the two partition's
         hd_struct->in_flight are not changed.
      
              | hd_struct->in_flight
         ---------------------------
         sda1 |          0
         sda2 |          1
         ---------------------------
      
      3. The request is finished and blk_account_io_done() is called. In this case,
         sda2's hd_struct->in_flight, not a sda1's one, is decremented.
      
              | hd_struct->in_flight
         ---------------------------
         sda1 |         -1
         sda2 |          1
         ---------------------------
      
      The patch fixes the problem by caching the partition lookup
      inside the request structure, hence making sure that the increment
      and decrement will always happen on the same partition struct. This
      also speeds up IO with accounting enabled, since it cuts down on
      the number of lookups we have to do.
      
      When reloading partition tables, quiesce IO to ensure that no
      request references to the partition struct exists. When it is safe
      to free the partition table, the IO for that device is restarted
      again.
      Signed-off-by: NYasuaki Ishimatsu <isimatu.yasuaki@jp.fujitsu.com>
      Cc: stable@kernel.org
      Signed-off-by: NJens Axboe <jaxboe@fusionio.com>
      7681bfee
    • V
      sched: Add IRQ_TIME_ACCOUNTING, finer accounting of irq time · b52bfee4
      Venkatesh Pallipadi 提交于
      s390/powerpc/ia64 have support for CONFIG_VIRT_CPU_ACCOUNTING which does
      the fine granularity accounting of user, system, hardirq, softirq times.
      Adding that option on archs like x86 will be challenging however, given the
      state of TSC reliability on various platforms and also the overhead it will
      add in syscall entry exit.
      
      Instead, add a lighter variant that only does finer accounting of
      hardirq and softirq times, providing precise irq times (instead of timer tick
      based samples). This accounting is added with a new config option
      CONFIG_IRQ_TIME_ACCOUNTING so that there won't be any overhead for users not
      interested in paying the perf penalty.
      
      This accounting is based on sched_clock, with the code being generic.
      So, other archs may find it useful as well.
      
      This patch just adds the core logic and does not enable this logic yet.
      Signed-off-by: NVenkatesh Pallipadi <venki@google.com>
      Signed-off-by: NPeter Zijlstra <a.p.zijlstra@chello.nl>
      LKML-Reference: <1286237003-12406-5-git-send-email-venki@google.com>
      Signed-off-by: NIngo Molnar <mingo@elte.hu>
      b52bfee4
    • V
      sched: Add a PF flag for ksoftirqd identification · 6cdd5199
      Venkatesh Pallipadi 提交于
      To account softirq time cleanly in scheduler, we need to identify whether
      softirq is invoked in ksoftirqd context or softirq at hardirq tail context.
      Add PF_KSOFTIRQD for that purpose.
      
      As all PF flag bits are currently taken, create space by moving one of the
      infrequently used bits (PF_THREAD_BOUND) down in task_struct to be along
      with some other state fields.
      Signed-off-by: NVenkatesh Pallipadi <venki@google.com>
      Signed-off-by: NPeter Zijlstra <a.p.zijlstra@chello.nl>
      LKML-Reference: <1286237003-12406-4-git-send-email-venki@google.com>
      Signed-off-by: NIngo Molnar <mingo@elte.hu>
      6cdd5199
    • V
      sched: Consolidate account_system_vtime extern declaration · e1e10a26
      Venkatesh Pallipadi 提交于
      Just a minor cleanup patch that makes things easier to the following patches.
      No functionality change in this patch.
      Signed-off-by: NVenkatesh Pallipadi <venki@google.com>
      Signed-off-by: NPeter Zijlstra <a.p.zijlstra@chello.nl>
      LKML-Reference: <1286237003-12406-3-git-send-email-venki@google.com>
      Signed-off-by: NIngo Molnar <mingo@elte.hu>
      e1e10a26
    • V
      sched: Fix softirq time accounting · 75e1056f
      Venkatesh Pallipadi 提交于
      Peter Zijlstra found a bug in the way softirq time is accounted in
      VIRT_CPU_ACCOUNTING on this thread:
      
         http://lkml.indiana.edu/hypermail//linux/kernel/1009.2/01366.html
      
      The problem is, softirq processing uses local_bh_disable internally. There
      is no way, later in the flow, to differentiate between whether softirq is
      being processed or is it just that bh has been disabled. So, a hardirq when bh
      is disabled results in time being wrongly accounted as softirq.
      
      Looking at the code a bit more, the problem exists in !VIRT_CPU_ACCOUNTING
      as well. As account_system_time() in normal tick based accouting also uses
      softirq_count, which will be set even when not in softirq with bh disabled.
      
      Peter also suggested solution of using 2*SOFTIRQ_OFFSET as irq count
      for local_bh_{disable,enable} and using just SOFTIRQ_OFFSET while softirq
      processing. The patch below does that and adds API in_serving_softirq() which
      returns whether we are currently processing softirq or not.
      
      Also changes one of the usages of softirq_count in net/sched/cls_cgroup.c
      to in_serving_softirq.
      
      Looks like many usages of in_softirq really want in_serving_softirq. Those
      changes can be made individually on a case by case basis.
      Signed-off-by: NVenkatesh Pallipadi <venki@google.com>
      Signed-off-by: NPeter Zijlstra <a.p.zijlstra@chello.nl>
      LKML-Reference: <1286237003-12406-2-git-send-email-venki@google.com>
      Signed-off-by: NIngo Molnar <mingo@elte.hu>
      75e1056f
    • P
      jump_label: Add COND_STMT(), reducer wrappery · ebf31f50
      Peter Zijlstra 提交于
      The use of the JUMP_LABEL() construct ends up creating endless silly
      wrappers, create a higher level construct to reduce this clutter.
      Signed-off-by: NPeter Zijlstra <a.p.zijlstra@chello.nl>
      Cc: Jason Baron <jbaron@redhat.com>
      Cc: Steven Rostedt <rostedt@goodmis.org>
      Cc: Arnaldo Carvalho de Melo <acme@redhat.com>
      Cc: Frederic Weisbecker <fweisbec@gmail.com>
      Cc: Paul Mackerras <paulus@samba.org>
      LKML-Reference: <new-submission>
      Signed-off-by: NIngo Molnar <mingo@elte.hu>
      ebf31f50
    • P
      perf: Optimize sw events · 7e54a5a0
      Peter Zijlstra 提交于
      Acked-by: NFrederic Weisbecker <fweisbec@gmail.com>
      Signed-off-by: NPeter Zijlstra <a.p.zijlstra@chello.nl>
      LKML-Reference: <new-submission>
      Signed-off-by: NIngo Molnar <mingo@elte.hu>
      7e54a5a0
    • P
      perf: Use jump_labels to optimize the scheduler hooks · 82cd6def
      Peter Zijlstra 提交于
      Trades a call + conditional + ret for an unconditional jmp.
      Acked-by: NFrederic Weisbecker <fweisbec@gmail.com>
      Signed-off-by: NPeter Zijlstra <a.p.zijlstra@chello.nl>
      LKML-Reference: <20101014203625.501657727@chello.nl>
      Signed-off-by: NIngo Molnar <mingo@elte.hu>
      82cd6def
    • P
      jump_label: Add atomic_t interface · 8b92538d
      Peter Zijlstra 提交于
      Add an interface to allow usage of jump_labels with atomic counters.
      Signed-off-by: NPeter Zijlstra <a.p.zijlstra@chello.nl>
      Acked-by: NFrederic Weisbecker <fweisbec@gmail.com>
      LKML-Reference: <20101014203625.501657727@chello.nl>
      Signed-off-by: NIngo Molnar <mingo@elte.hu>
      8b92538d
    • P
      jump_label: Use more consistent naming · 3b6e901f
      Peter Zijlstra 提交于
      Now that there's still only a few users around, rename things to make
      them more consistent.
      Signed-off-by: NPeter Zijlstra <a.p.zijlstra@chello.nl>
      LKML-Reference: <20101014203625.448565169@chello.nl>
      Signed-off-by: NIngo Molnar <mingo@elte.hu>
      3b6e901f
    • P
      perf, hw_breakpoint: Fix crash in hw_breakpoint creation · d580ff86
      Peter Zijlstra 提交于
      hw_breakpoint creation needs to account stuff per-task to ensure there
      is always sufficient hardware resources to back these things due to
      ptrace.
      
      With the perf per pmu context changes the event initialization no
      longer has access to the event context, for the simple reason that we
      need to first find the pmu (result of initialization) before we can
      find the context.
      
      This makes hw_breakpoints unhappy, because it can no longer do per
      task accounting, cure this by frobbing a task pointer in the event::hw
      bits for now...
      Signed-off-by: NPeter Zijlstra <a.p.zijlstra@chello.nl>
      Cc: Frederic Weisbecker <fweisbec@gmail.com>
      LKML-Reference: <20101014203625.391543667@chello.nl>
      Signed-off-by: NIngo Molnar <mingo@elte.hu>
      d580ff86
    • P
      irq_work: Add generic hardirq context callbacks · e360adbe
      Peter Zijlstra 提交于
      Provide a mechanism that allows running code in IRQ context. It is
      most useful for NMI code that needs to interact with the rest of the
      system -- like wakeup a task to drain buffers.
      
      Perf currently has such a mechanism, so extract that and provide it as
      a generic feature, independent of perf so that others may also
      benefit.
      
      The IRQ context callback is generated through self-IPIs where
      possible, or on architectures like powerpc the decrementer (the
      built-in timer facility) is set to generate an interrupt immediately.
      
      Architectures that don't have anything like this get to do with a
      callback from the timer tick. These architectures can call
      irq_work_run() at the tail of any IRQ handlers that might enqueue such
      work (like the perf IRQ handler) to avoid undue latencies in
      processing the work.
      Signed-off-by: NPeter Zijlstra <a.p.zijlstra@chello.nl>
      Acked-by: NKyle McMartin <kyle@mcmartin.ca>
      Acked-by: NMartin Schwidefsky <schwidefsky@de.ibm.com>
      [ various fixes ]
      Signed-off-by: NHuang Ying <ying.huang@intel.com>
      LKML-Reference: <1287036094.7768.291.camel@yhuang-dev>
      Signed-off-by: NIngo Molnar <mingo@elte.hu>
      e360adbe
    • H
      lockdep: Add improved subclass caching · 62016250
      Hitoshi Mitake 提交于
      Current lockdep_map only caches one class with subclass == 0,
      and looks up hash table of classes when subclass != 0.
      
      It seems that this has no problem because the case of
      subclass != 0 is rare. But locks of struct rq are
      acquired with subclass == 1 when task migration is executed.
      Task migration is high frequent event, so I modified lockdep
      to cache subclasses.
      
      I measured the score of perf bench sched messaging.
      This patch has slightly but certain (order of milli seconds
      or 10 milli seconds) effect when lots of tasks are running.
      I'll show the result in the tail of this description.
      
      NR_LOCKDEP_CACHING_CLASSES specifies how many classes can be
      cached in the instances of lockdep_map.
      I discussed with Peter Zijlstra in LinuxCon Japan about
      this approach and he taught me that caching every subclasses(8)
      is cleary waste of memory. So number of cached classes
      should be configurable.
      
      === Score comparison of benchmarks ===
      # "min" means best score, and "max" means worst score
      
      for i in `seq 1 10`; do ./perf bench -f simple sched messaging; done
      
      before: min: 0.565000, max: 0.583000, avg: 0.572500
      after:  min: 0.559000, max: 0.568000, avg: 0.563300
      
      # with more processes
      for i in `seq 1 10`; do ./perf bench -f simple sched messaging -g 40; done
      
      before: min: 2.274000, max: 2.298000, avg: 2.286300
      after:  min: 2.242000, max: 2.270000, avg: 2.259700
      Signed-off-by: NHitoshi Mitake <mitake@dcl.info.waseda.ac.jp>
      Cc: Frederic Weisbecker <fweisbec@gmail.com>
      Signed-off-by: NPeter Zijlstra <a.p.zijlstra@chello.nl>
      LKML-Reference: <1286269311-28336-2-git-send-email-mitake@dcl.info.waseda.ac.jp>
      Signed-off-by: NIngo Molnar <mingo@elte.hu>
      62016250
  4. 17 10月, 2010 5 次提交