1. 22 12月, 2012 4 次提交
    • K
      [media] vivi: Optimize precalculate_line() · d40fbf8d
      Kirill Smelkov 提交于
      precalculate_line() is not very high on profile, but it calls expensive
      gen_twopix(), so let's polish it too:
          call gen_twopix() only once for every color bar and then distribute
          the result.
      before:
          # cmdline : /home/kirr/local/perf/bin/perf record -g -a sleep 20
          #
          # Samples: 46K of event 'cycles'
          # Event count (approx.): 15574200568
          #
          # Overhead          Command         Shared Object
          # ........  ...............  ....................
          #
              27.99%             rawv  libc-2.13.so          [.] __memcpy_ssse3
              23.29%           vivi-*  [kernel.kallsyms]     [k] memcpy
              10.30%             Xorg  [unknown]             [.] 0xa75c98f8
               5.34%           vivi-*  [vivi]                [k] gen_text.constprop.6
               4.61%             rawv  [vivi]                [k] gen_twopix
               2.64%             rawv  [vivi]                [k] precalculate_line
               1.37%          swapper  [kernel.kallsyms]     [k] read_hpet
      after:
          # cmdline : /home/kirr/local/perf/bin/perf record -g -a sleep 20
          #
          # Samples: 45K of event 'cycles'
          # Event count (approx.): 15561769214
          #
          # Overhead          Command         Shared Object
          # ........  ...............  ....................
          #
              30.73%             rawv  libc-2.13.so          [.] __memcpy_ssse3
              26.78%           vivi-*  [kernel.kallsyms]     [k] memcpy
              10.68%             Xorg  [unknown]             [.] 0xa73015e9
               5.55%           vivi-*  [vivi]                [k] gen_text.constprop.6
               1.36%          swapper  [kernel.kallsyms]     [k] read_hpet
               0.96%             Xorg  [kernel.kallsyms]     [k] read_hpet
               ...
               0.16%             rawv  [vivi]                [k] precalculate_line
               ...
               0.14%             rawv  [vivi]                [k] gen_twopix
      (i.e. gen_twopix and precalculate_line overheads are almost gone)
      Signed-off-by: NKirill Smelkov <kirr@mns.spb.ru>
      Acked-by: NHans Verkuil <hans.verkuil@cisco.com>
      Signed-off-by: NMauro Carvalho Chehab <mchehab@redhat.com>
      d40fbf8d
    • K
      [media] vivi: Move computations out of vivi_fillbuf linecopy loop · 13908f33
      Kirill Smelkov 提交于
      The "dev->mvcount % wmax" thing was showing high in profiles (we do it
      for each line which ~ 500 per frame)
                 ?     000010c0 <vivi_fillbuff>:
                       ...
            0,39 ? 70:???mov    0x3ff4(%edi),%esi
            0,22 ? 76:?  mov    0x2a0(%edi),%eax
            0,30 ?    ?  mov    -0x84(%ebp),%ebx
            0,35 ?    ?  mov    %eax,%edx
            0,04 ?    ?  mov    -0x7c(%ebp),%ecx
            0,35 ?    ?  sar    $0x1f,%edx
            0,44 ?    ?  idivl  -0x7c(%ebp)
           21,68 ?    ?  imul   %esi,%ecx
            0,70 ?    ?  imul   %esi,%ebx
            0,52 ?    ?  add    -0x88(%ebp),%ebx
            1,65 ?    ?  mov    %ebx,%eax
            0,22 ?    ?  imul   %edx,%esi
            0,04 ?    ?  lea    0x3f4(%edi,%esi,1),%edx
            2,18 ?    ?? call   vivi_fillbuff+0xa6
            0,74 ?    ?  addl   $0x1,-0x80(%ebp)
           62,69 ?    ?  mov    -0x7c(%ebp),%edx
            1,18 ?    ?  mov    -0x80(%ebp),%ecx
            0,35 ?    ?  add    %edx,-0x84(%ebp)
            0,61 ?    ?  cmp    %ecx,-0x8c(%ebp)
            0,22 ?    ???jne    70
      so since all variables stay the same for all iterations let's move
      computations out of the loop: the abovementioned division and
      "width*pixelsize" too
      before:
          # cmdline : /home/kirr/local/perf/bin/perf record -g -a sleep 20
          #
          # Samples: 49K of event 'cycles'
          # Event count (approx.): 16475832370
          #
          # Overhead          Command           Shared Object
          # ........  ...............  ......................
          #
              29.07%             rawv  libc-2.13.so            [.] __memcpy_ssse3
              20.57%           vivi-*  [kernel.kallsyms]       [k] memcpy
              10.20%             Xorg  [unknown]               [.] 0xa7301494
               5.16%           vivi-*  [vivi]                  [k] gen_text.constprop.6
               4.43%             rawv  [vivi]                  [k] gen_twopix
               4.36%           vivi-*  [vivi]                  [k] vivi_fillbuff
               2.42%             rawv  [vivi]                  [k] precalculate_line
               1.33%          swapper  [kernel.kallsyms]       [k] read_hpet
      after:
          # cmdline : /home/kirr/local/perf/bin/perf record -g -a sleep 20
          #
          # Samples: 46K of event 'cycles'
          # Event count (approx.): 15574200568
          #
          # Overhead          Command         Shared Object
          # ........  ...............  ....................
          #
              27.99%             rawv  libc-2.13.so          [.] __memcpy_ssse3
              23.29%           vivi-*  [kernel.kallsyms]     [k] memcpy
              10.30%             Xorg  [unknown]             [.] 0xa75c98f8
               5.34%           vivi-*  [vivi]                [k] gen_text.constprop.6
               4.61%             rawv  [vivi]                [k] gen_twopix
               2.64%             rawv  [vivi]                [k] precalculate_line
               1.37%          swapper  [kernel.kallsyms]     [k] read_hpet
               0.79%             Xorg  [kernel.kallsyms]     [k] read_hpet
               0.64%             Xorg  [kernel.kallsyms]     [k] unix_poll
               0.45%             Xorg  [kernel.kallsyms]     [k] fget_light
               0.43%             rawv  libxcb.so.1.1.0       [.] 0x0000aae9
               0.40%            runsv  [kernel.kallsyms]     [k] ext2_try_to_allocate
               0.36%             Xorg  [kernel.kallsyms]     [k] _raw_spin_lock_irqsave
               0.31%           vivi-*  [vivi]                [k] vivi_fillbuff
      (i.e. vivi_fillbuff own overhead is almost gone)
      Signed-off-by: NKirill Smelkov <kirr@mns.spb.ru>
      Acked-by: NHans Verkuil <hans.verkuil@cisco.com>
      Signed-off-by: NMauro Carvalho Chehab <mchehab@redhat.com>
      13908f33
    • K
      [media] vivi: vivi_dev->line[] was not aligned · 10ce8441
      Kirill Smelkov 提交于
      Though dev->line[] is u8 array we work with it as with u16, u24 or u32
      pixels, and also pass it to memcpy() and it's better to align it to at
      least 4.
      Before the patch, on x86 offsetof(vivi_dev, line) was 1003 and after
      patch it is 1004.
      There is slight performance increase, but I think is is slight, only
      because we start copying not from line[0]:
          ---- 8< ---- drivers/media/platform/vivi.c
          static void vivi_fillbuff(struct vivi_dev *dev, struct vivi_buffer *buf)
          {
                  ...
                  for (h = 0; h < hmax; h++)
                          memcpy(vbuf + h * wmax * dev->pixelsize,
                                 dev->line + (dev->mv_count % wmax) * dev->pixelsize,
                                 wmax * dev->pixelsize);
      before:
          # cmdline : /home/kirr/local/perf/bin/perf record -g -a sleep 20
          #
          # Samples: 49K of event 'cycles'
          # Event count (approx.): 16799780016
          #
          # Overhead          Command         Shared Object
          # ........  ...............  ....................
          #
              27.51%             rawv  libc-2.13.so          [.] __memcpy_ssse3
              23.77%           vivi-*  [kernel.kallsyms]     [k] memcpy
               9.96%             Xorg  [unknown]             [.] 0xa76f5e12
               4.94%           vivi-*  [vivi]                [k] gen_text.constprop.6
               4.44%             rawv  [vivi]                [k] gen_twopix
               3.17%           vivi-*  [vivi]                [k] vivi_fillbuff
               2.45%             rawv  [vivi]                [k] precalculate_line
               1.20%          swapper  [kernel.kallsyms]     [k] read_hpet
          23.77%           vivi-*  [kernel.kallsyms]     [k] memcpy
                           |
                           --- memcpy
                              |
                              |--99.28%-- vivi_fillbuff
                              |          vivi_thread
                              |          kthread
                              |          ret_from_kernel_thread
                               --0.72%-- [...]
      after:
          # cmdline : /home/kirr/local/perf/bin/perf record -g -a sleep 20
          #
          # Samples: 49K of event 'cycles'
          # Event count (approx.): 16475832370
          #
          # Overhead          Command           Shared Object
          # ........  ...............  ......................
          #
              29.07%             rawv  libc-2.13.so            [.] __memcpy_ssse3
              20.57%           vivi-*  [kernel.kallsyms]       [k] memcpy
              10.20%             Xorg  [unknown]               [.] 0xa7301494
               5.16%           vivi-*  [vivi]                  [k] gen_text.constprop.6
               4.43%             rawv  [vivi]                  [k] gen_twopix
               4.36%           vivi-*  [vivi]                  [k] vivi_fillbuff
               2.42%             rawv  [vivi]                  [k] precalculate_line
               1.33%          swapper  [kernel.kallsyms]       [k] read_hpet
      Signed-off-by: NKirill Smelkov <kirr@mns.spb.ru>
      Acked-by: NHans Verkuil <hans.verkuil@cisco.com>
      Signed-off-by: NMauro Carvalho Chehab <mchehab@redhat.com>
      10ce8441
    • K
      [media] vivi: Optimize gen_text() · e3a8b4d2
      Kirill Smelkov 提交于
      I've noticed that vivi takes a lot of CPU to produce its frames.
      For example for 8 devices and 8 simple programs running, where each
      captures YUY2 640x480 and displays it to X via SDL, profile timing is as
      follows:
          # cmdline : /home/kirr/local/perf/bin/perf record -g -a sleep 20
          # Samples: 82K of event 'cycles'
          # Event count (approx.): 31551930117
          #
          # Overhead          Command         Shared Object                                                           Symbol
          # ........  ...............  ....................
          #
              49.48%           vivi-*  [vivi]                [k] gen_twopix
              10.79%           vivi-*  [kernel.kallsyms]     [k] memcpy
              10.02%             rawv  libc-2.13.so          [.] __memcpy_ssse3
               8.35%           vivi-*  [vivi]                [k] gen_text.constprop.6
               5.06%             Xorg  [unknown]             [.] 0xa73015f8
               2.32%             rawv  [vivi]                [k] gen_twopix
               1.22%             rawv  [vivi]                [k] precalculate_line
               1.20%           vivi-*  [vivi]                [k] vivi_fillbuff
          (rawv is display program, vivi-* is a combination of vivi-000 through vivi-007)
      so a lot of time is spent in gen_twopix() which as the follwing
      call-graph profile shows ...
          49.48%           vivi-*  [vivi]                [k] gen_twopix
                           |
                           --- gen_twopix
                              |
                              |--96.30%-- gen_text.constprop.6
                              |          vivi_fillbuff
                              |          vivi_thread
                              |          kthread
                              |          ret_from_kernel_thread
                              |
                               --3.70%-- vivi_fillbuff
                                         vivi_thread
                                         kthread
                                         ret_from_kernel_thread
      ... is called mostly from gen_text().
      If we'll look at gen_text(), in the inner loop, we'll see
          if (chr & (1 << (7 - i)))
                  gen_twopix(dev, pos + j * dev->pixelsize, WHITE, (x+y) & 1);
          else
                  gen_twopix(dev, pos + j * dev->pixelsize, TEXT_BLACK, (x+y) & 1);
      which calls gen_twopix() for every character pixel, and that is very
      expensive, because gen_twopix() branches several times.
      Now, let's note, that we operate on only two colors - WHITE and
      TEXT_BLACK, and that pixel for that colors could be precomputed and
      gen_twopix() moved out of the inner loop. Also note, that for black
      and white colors even/odd does not make a difference for all supported
      pixel formats, so we could stop doing that `odd` gen_twopix() parameter
      game.
      So the first thing we are doing here is
          1) moving gen_twopix() calls out of gen_text() into vivi_fillbuff(),
             to pregenerate black and white colors, just before printing
             starts.
      what we have next is that gen_text's font rendering loop, even with
      gen_twopix() calls moved out, was inefficient and branchy, so let's
          2) rewrite gen_text() loop so it uses less variables + unroll char
             horizontal-rendering loop + instantiate 3 code paths for pixelsizes 2,3
             and 4 so that in all inner loops we don't have to branch or make
             indirections (*).
      Done all above reworks, for gen_text() we get nice, non-branchy
      streamlined code (showing loop for pixelsize=2):
                 ?       cmp    $0x2,%eax
                 ?     ? jne    26
                 ?       mov    -0x18(%ebp),%eax
                 ?       mov    -0x20(%ebp),%edi
                 ?       imul   -0x20(%ebp),%eax
                 ?       movzwl 0x3ffc(%ebx),%esi
            0,08 ?       movzwl 0x4000(%ebx),%ecx
            0,04 ?       add    %edi,%edi
                 ?       mov    0x0,%ebx
            0,51 ?       mov    %edi,-0x1c(%ebp)
                 ?       mov    %ebx,-0x14(%ebp)
                 ?       movl   $0x0,-0x10(%ebp)
                 ?       lea    0x20(%edx,%eax,2),%eax
                 ?       mov    %eax,-0x18(%ebp)
                 ?       xchg   %ax,%ax
            0,04 ? a0:   mov    0x8(%ebp),%ebx
                 ?       mov    -0x18(%ebp),%eax
            0,04 ?       movzbl (%ebx),%edx
            0,16 ?       test   %dl,%dl
            0,04 ?     ? je     128
            0,08 ?       lea    0x0(%esi),%esi
            1,61 ? b0:???shl    $0x4,%edx
            1,02 ?    ?  mov    -0x14(%ebp),%edi
            2,04 ?    ?  add    -0x10(%ebp),%edx
            2,24 ?    ?  lea    0x1(%ebx),%ebx
            0,27 ?    ?  movzbl (%edi,%edx,1),%edx
            9,92 ?    ?  mov    %esi,%edi
            0,39 ?    ?  test   %dl,%dl
            2,04 ?    ?  cmovns %ecx,%edi
            4,63 ?    ?  test   $0x40,%dl
            0,55 ?    ?  mov    %di,(%eax)
            3,76 ?    ?  mov    %esi,%edi
            0,71 ?    ?  cmove  %ecx,%edi
            3,41 ?    ?  test   $0x20,%dl
            0,75 ?    ?  mov    %di,0x2(%eax)
            2,43 ?    ?  mov    %esi,%edi
            0,59 ?    ?  cmove  %ecx,%edi
            4,59 ?    ?  test   $0x10,%dl
            0,67 ?    ?  mov    %di,0x4(%eax)
            2,55 ?    ?  mov    %esi,%edi
            0,78 ?    ?  cmove  %ecx,%edi
            4,31 ?    ?  test   $0x8,%dl
            0,67 ?    ?  mov    %di,0x6(%eax)
            5,76 ?    ?  mov    %esi,%edi
            1,80 ?    ?  cmove  %ecx,%edi
            4,20 ?    ?  test   $0x4,%dl
            0,86 ?    ?  mov    %di,0x8(%eax)
            2,98 ?    ?  mov    %esi,%edi
            1,37 ?    ?  cmove  %ecx,%edi
            4,67 ?    ?  test   $0x2,%dl
            0,20 ?    ?  mov    %di,0xa(%eax)
            2,78 ?    ?  mov    %esi,%edi
            0,75 ?    ?  cmove  %ecx,%edi
            3,92 ?    ?  and    $0x1,%edx
            0,75 ?    ?  mov    %esi,%edx
            2,59 ?    ?  mov    %di,0xc(%eax)
            0,59 ?    ?  cmove  %ecx,%edx
            3,10 ?    ?  mov    %dx,0xe(%eax)
            2,39 ?    ?  add    $0x10,%eax
            0,51 ?    ?  movzbl (%ebx),%edx
            2,86 ?    ?  test   %dl,%dl
            2,31 ?    ???jne    b0
            0,04 ?128:   addl   $0x1,-0x10(%ebp)
            4,00 ?       mov    -0x1c(%ebp),%eax
            0,04 ?       add    %eax,-0x18(%ebp)
            0,08 ?       cmpl   $0x10,-0x10(%ebp)
                 ?     ? jne    a0
      which almost goes away from the profile:
          # cmdline : /home/kirr/local/perf/bin/perf record -g -a sleep 20
          # Samples: 49K of event 'cycles'
          # Event count (approx.): 16799780016
          #
          # Overhead          Command         Shared Object                                                           Symbol
          # ........  ...............  ....................
          #
              27.51%             rawv  libc-2.13.so          [.] __memcpy_ssse3
              23.77%           vivi-*  [kernel.kallsyms]     [k] memcpy
               9.96%             Xorg  [unknown]             [.] 0xa76f5e12
               4.94%           vivi-*  [vivi]                [k] gen_text.constprop.6
               4.44%             rawv  [vivi]                [k] gen_twopix
               3.17%           vivi-*  [vivi]                [k] vivi_fillbuff
               2.45%             rawv  [vivi]                [k] precalculate_line
               1.20%          swapper  [kernel.kallsyms]     [k] read_hpet
      i.e. gen_twopix() overhead dropped from 49% to 4% and gen_text() loops
      from ~8% to ~4%, and overal cycles count dropped from 31551930117 to
      16799780016 which is ~1.9x whole workload speedup.
      (*) for RGB24 rendering I've introduced x24, which could be thought as
          synthetic u24 for simplifying the code. That's done because for
          memcpy used for conditional assignment, gcc generates suboptimal code
          with more indirections.
          Fortunately, in C struct assignment is builtin and that's all we
          need from pixeltype for font rendering.
      Signed-off-by: NKirill Smelkov <kirr@mns.spb.ru>
      Acked-by: NHans Verkuil <hans.verkuil@cisco.com>
      Signed-off-by: NMauro Carvalho Chehab <mchehab@redhat.com>
      e3a8b4d2
  2. 21 12月, 2012 1 次提交
  3. 26 11月, 2012 1 次提交
  4. 14 11月, 2012 1 次提交
  5. 29 10月, 2012 1 次提交
  6. 27 9月, 2012 1 次提交
  7. 26 9月, 2012 1 次提交
  8. 15 9月, 2012 1 次提交
  9. 16 8月, 2012 1 次提交
  10. 12 8月, 2012 3 次提交
  11. 31 7月, 2012 2 次提交
  12. 07 7月, 2012 4 次提交
  13. 12 6月, 2012 1 次提交
  14. 15 5月, 2012 2 次提交
  15. 14 5月, 2012 1 次提交
  16. 20 4月, 2012 1 次提交
  17. 11 4月, 2012 1 次提交
  18. 27 3月, 2012 1 次提交
  19. 15 2月, 2012 2 次提交
  20. 27 1月, 2012 1 次提交
  21. 24 1月, 2012 1 次提交
  22. 04 11月, 2011 1 次提交
  23. 21 9月, 2011 1 次提交
  24. 07 9月, 2011 3 次提交
  25. 28 7月, 2011 3 次提交
    • H
    • H
      [media] vivi: Fix sleep-in-atomic-context · 1de5be5e
      Hans Verkuil 提交于
      Fix sleep-in-atomic-context bug in vivi:
      
      Jun 28 18:14:39 tschai kernel: [   80.970478] BUG: sleeping function called from invalid context at kernel/mutex.c:271
      Jun 28 18:14:39 tschai kernel: [   80.970483] in_atomic(): 0, irqs_disabled(): 1, pid: 2854, name: vivi-000
      Jun 28 18:14:39 tschai kernel: [   80.970485] INFO: lockdep is turned off.
      Jun 28 18:14:39 tschai kernel: [   80.970486] irq event stamp: 0
      Jun 28 18:14:39 tschai kernel: [   80.970487] hardirqs last  enabled at (0): [<          (null)>]           (null)
      Jun 28 18:14:39 tschai kernel: [   80.970490] hardirqs last disabled at (0): [<ffffffff8109a90b>] copy_process+0x61b/0x1440
      Jun 28 18:14:39 tschai kernel: [   80.970495] softirqs last  enabled at (0): [<ffffffff8109a90b>] copy_process+0x61b/0x1440
      Jun 28 18:14:39 tschai kernel: [   80.970498] softirqs last disabled at (0): [<          (null)>]           (null)
      Jun 28 18:14:39 tschai kernel: [   80.970502] Pid: 2854, comm: vivi-000 Tainted: P            3.0.0-rc1-tschai #372
      Jun 28 18:14:39 tschai kernel: [   80.970504] Call Trace:
      Jun 28 18:14:39 tschai kernel: [   80.970509]  [<ffffffff81089be3>] __might_sleep+0xf3/0x130
      Jun 28 18:14:39 tschai kernel: [   80.970512]  [<ffffffff8176967f>] mutex_lock_nested+0x2f/0x60
      Jun 28 18:14:39 tschai kernel: [   80.970517]  [<ffffffffa0acee3e>] vivi_fillbuff+0x20e/0x3f0 [vivi]
      Jun 28 18:14:39 tschai kernel: [   80.970520]  [<ffffffff81407004>] ? do_raw_spin_lock+0x54/0x150
      Jun 28 18:14:39 tschai kernel: [   80.970524]  [<ffffffff8104ef5e>] ? read_tsc+0xe/0x20
      Jun 28 18:14:39 tschai kernel: [   80.970528]  [<ffffffff810c9d87>] ? getnstimeofday+0x57/0xe0
      Jun 28 18:14:39 tschai kernel: [   80.970531]  [<ffffffffa0acf1b1>] vivi_thread+0x191/0x2f0 [vivi]
      Jun 28 18:14:39 tschai kernel: [   80.970534]  [<ffffffff81093aa0>] ? try_to_wake_up+0x2d0/0x2d0
      Jun 28 18:14:39 tschai kernel: [   80.970537]  [<ffffffffa0acf020>] ? vivi_fillbuff+0x3f0/0x3f0 [vivi]
      Jun 28 18:14:39 tschai kernel: [   80.970541]  [<ffffffff810bff46>] kthread+0xb6/0xc0
      Jun 28 18:14:39 tschai kernel: [   80.970544]  [<ffffffff817743e4>] kernel_thread_helper+0x4/0x10
      Jun 28 18:14:39 tschai kernel: [   80.970547]  [<ffffffff8176b4d4>] ? retint_restore_args+0x13/0x13
      Jun 28 18:14:39 tschai kernel: [   80.970550]  [<ffffffff810bfe90>] ? __init_kthread_worker+0x70/0x70
      Jun 28 18:14:39 tschai kernel: [   80.970552]  [<ffffffff817743e0>] ? gs_change+0x13/0x13
      
      This bug was introduced in 2.6.39.
      Signed-off-by: NHans Verkuil <hans.verkuil@cisco.com>
      Signed-off-by: NMauro Carvalho Chehab <mchehab@redhat.com>
      1de5be5e
    • H
      [media] v4l2-event/ctrls/fh: allocate events per fh and per type instead of just per-fh · f1e393de
      Hans Verkuil 提交于
      The driver had to decide how many events to allocate when the v4l2_fh struct
      was created. It was possible to add more events afterwards, but there was no
      way to ensure that you wouldn't miss important events if the event queue
      would fill up for that filehandle.
      
      In addition, once there were no more free events, any new events were simply
      dropped on the floor.
      
      For the control event in particular this made life very difficult since
      control status/value changes could just be missed if the number of allocated
      events and the speed at which the application read events was too low to keep
      up with the number of generated events. The application would have no idea
      what the latest state was for a control since it could have missed the latest
      control change.
      
      So this patch makes some major changes in how events are allocated. Instead
      of allocating events per-filehandle they are now allocated when subscribing an
      event. So for that particular event type N events (determined by the driver)
      are allocated. Those events are reserved for that particular event type.
      This ensures that you will not miss events for a particular type altogether.
      
      In addition, if there are N events in use and a new event is raised, then
      the oldest event is dropped and the new one is added. So the latest event
      is always available.
      
      This can be further improved by adding the ability to merge the state of
      two events together, ensuring that no data is lost at all. This will be
      added in the next patch.
      
      This also makes it possible to allow the user to determine the number of
      events that will be allocated. This is not implemented at the moment, but
      would be trivial.
      Signed-off-by: NHans Verkuil <hans.verkuil@cisco.com>
      Signed-off-by: NMauro Carvalho Chehab <mchehab@redhat.com>
      f1e393de