bug-hunting.rst 8.1 KB
Newer Older
1 2
Bug hunting
+++++++++++
3 4 5 6 7 8 9 10 11 12 13 14 15 16

Last updated: 20 December 2005

Introduction
============

Always try the latest kernel from kernel.org and build from source. If you are
not confident in doing that please report the bug to your distribution vendor
instead of to a kernel developer.

Finding bugs is not always easy. Have a go though. If you can't find it don't
give up. Report as much as you have found to the relevant maintainer. See
MAINTAINERS for who that is for the subsystem you have worked on.

17 18
Before you submit a bug report read
:ref:`Documentation/REPORTING-BUGS <reportingbugs>`.
19 20 21 22 23 24 25 26 27 28 29 30

Devices not appearing
=====================

Often this is caused by udev. Check that first before blaming it on the
kernel.

Finding patch that caused a bug
===============================



31 32
Finding using ``git-bisect``
----------------------------
33

34 35
Using the provided tools with ``git`` makes finding bugs easy provided the bug
is reproducible.
36 37

Steps to do it:
38

39
- start using git for the kernel source
40
- read the man page for ``git-bisect``
41 42 43 44 45
- have fun

Finding it the old way
----------------------

L
Linus Torvalds 已提交
46 47
[Sat Mar  2 10:32:33 PST 1996 KERNEL_BUG-HOWTO lm@sgi.com (Larry McVoy)]

48
This is how to track down a bug if you know nothing about kernel hacking.
L
Linus Torvalds 已提交
49 50 51 52
It's a brute force approach but it works pretty well.

You need:

53 54
        - A reproducible bug - it has to happen predictably (sorry)
        - All the kernel tar files from a revision that worked to the
L
Linus Torvalds 已提交
55 56 57 58
          revision that doesn't

You will then do:

59 60
        - Rebuild a revision that you believe works, install, and verify that.
        - Do a binary search over the kernels to figure out which one
61
          introduced the bug.  I.e., suppose 1.3.28 didn't have the bug, but
L
Linus Torvalds 已提交
62 63 64
          you know that 1.3.69 does.  Pick a kernel in the middle and build
          that, like 1.3.50.  Build & test; if it works, pick the mid point
          between .50 and .69, else the mid point between .28 and .50.
65
        - You'll narrow it down to the kernel that introduced the bug.  You
66
          can probably do better than this but it gets tricky.
L
Linus Torvalds 已提交
67

68
        - Narrow it down to a subdirectory
L
Linus Torvalds 已提交
69 70 71 72 73 74 75

          - Copy kernel that works into "test".  Let's say that 3.62 works,
            but 3.63 doesn't.  So you diff -r those two kernels and come
            up with a list of directories that changed.  For each of those
            directories:

                Copy the non-working directory next to the working directory
76
                as "dir.63".
L
Linus Torvalds 已提交
77
                One directory at time, try moving the working directory to
78
                "dir.62" and mv dir.63 dir"time, try::
L
Linus Torvalds 已提交
79 80 81 82 83 84

                        mv dir dir.62
                        mv dir.63 dir
                        find dir -name '*.[oa]' -print | xargs rm -f

                And then rebuild and retest.  Assuming that all related
85 86
                changes were contained in the sub directory, this should
                isolate the change to a directory.
L
Linus Torvalds 已提交
87 88

                Problems: changes in header files may have occurred; I've
89
                found in my case that they were self explanatory - you may
L
Linus Torvalds 已提交
90 91
                or may not want to give up when that happens.

92
        - Narrow it down to a file
L
Linus Torvalds 已提交
93 94

          - You can apply the same technique to each file in the directory,
95 96
            hoping that the changes in that file are self contained.

97
        - Narrow it down to a routine
L
Linus Torvalds 已提交
98 99

          - You can take the old file and the new file and manually create
100
            a merged file that has::
L
Linus Torvalds 已提交
101 102 103 104 105 106 107 108 109 110 111 112 113 114

                #ifdef VER62
                routine()
                {
                        ...
                }
                #else
                routine()
                {
                        ...
                }
                #endif

            And then walk through that file, one routine at a time and
115
            prefix it with::
L
Linus Torvalds 已提交
116 117 118 119 120 121 122 123 124

                #define VER62
                /* both routines here */
                #undef VER62

            Then recompile, retest, move the ifdefs until you find the one
            that makes the difference.

Finally, you take all the info that you have, kernel revisions, bug
125
description, the extent to which you have narrowed it down, and pass
L
Linus Torvalds 已提交
126 127 128 129 130 131 132 133 134 135 136 137
that off to whomever you believe is the maintainer of that section.
A post to linux.dev.kernel isn't such a bad idea if you've done some
work to narrow it down.

If you get it down to a routine, you'll probably get a fix in 24 hours.

My apologies to Linus and the other kernel hackers for describing this
brute force approach, it's hardly what a kernel hacker would do.  However,
it does work and it lets non-hackers help fix bugs.  And it is cool
because Linux snapshots will let you do this - something that you can't
do with vendor supplied releases.

138 139 140 141 142 143 144 145 146 147
Fixing the bug
==============

Nobody is going to tell you how to fix bugs. Seriously. You need to work it
out. But below are some hints on how to use the tools.

To debug a kernel, use objdump and look for the hex offset from the crash
output to find the valid line of code/assembler. Without debug symbols, you
will see the assembler code for the routine shown, but if your kernel has
debug symbols the C code will also be available. (Debug symbols can be enabled
148
in the kernel hacking menu of the menu configuration.) For example::
149 150 151

    objdump -r -S -l --disassemble net/dccp/ipv4.o

152 153 154 155
.. note::

   You need to be at the top level of the kernel tree for this to pick up
   your C files.
156 157

If you don't have access to the code you can also debug on some crash dumps
158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186
e.g. crash dump output as shown by Dave Miller::

     EIP is at ip_queue_xmit+0x14/0x4c0
      ...
     Code: 44 24 04 e8 6f 05 00 00 e9 e8 fe ff ff 8d 76 00 8d bc 27 00 00
     00 00 55 57  56 53 81 ec bc 00 00 00 8b ac 24 d0 00 00 00 8b 5d 08
     <8b> 83 3c 01 00 00 89 44  24 14 8b 45 28 85 c0 89 44 24 18 0f 85

     Put the bytes into a "foo.s" file like this:

            .text
            .globl foo
     foo:
            .byte  .... /* bytes from Code: part of OOPS dump */

     Compile it with "gcc -c -o foo.o foo.s" then look at the output of
     "objdump --disassemble foo.o".

     Output:

     ip_queue_xmit:
         push       %ebp
         push       %edi
         push       %esi
         push       %ebx
         sub        $0xbc, %esp
         mov        0xd0(%esp), %ebp        ! %ebp = arg0 (skb)
         mov        0x8(%ebp), %ebx         ! %ebx = skb->sk
         mov        0x13c(%ebx), %eax       ! %eax = inet_sk(sk)->opt
187

188
In addition, you can use GDB to figure out the exact file and line
189 190 191
number of the OOPS from the ``vmlinux`` file. If you have
``CONFIG_DEBUG_INFO`` enabled, you can simply copy the EIP value from the
OOPS::
192 193 194

 EIP:    0060:[<c021e50e>]    Not tainted VLI

195
And use GDB to translate that to human-readable form::
196 197 198 199

  gdb vmlinux
  (gdb) l *0xc021e50e

200 201
If you don't have ``CONFIG_DEBUG_INFO`` enabled, you use the function
offset from the OOPS::
202 203 204

 EIP is at vt_ioctl+0xda8/0x1482

205
And recompile the kernel with ``CONFIG_DEBUG_INFO`` enabled::
206 207 208 209 210

  make vmlinux
  gdb vmlinux
  (gdb) p vt_ioctl
  (gdb) l *(0x<address of vt_ioctl> + 0xda8)
211 212 213

or, as one command::

214 215
  (gdb) l *(vt_ioctl + 0xda8)

216 217 218 219 220 221 222 223
If you have a call trace, such as::

     Call Trace:
      [<ffffffff8802c8e9>] :jbd:log_wait_commit+0xa3/0xf5
      [<ffffffff810482d9>] autoremove_wake_function+0x0/0x2e
      [<ffffffff8802770b>] :jbd:journal_stop+0x1be/0x1ee
      ...

224
this shows the problem in the :jbd: module. You can load that module in gdb
225 226
and list the relevant code::

227 228 229
  gdb fs/jbd/jbd.ko
  (gdb) p log_wait_commit
  (gdb) l *(0x<address> + 0xa3)
230 231 232

or::

233 234
  (gdb) l *(log_wait_commit + 0xa3)

235

236 237 238
Another very useful option of the Kernel Hacking section in menuconfig is
Debug memory allocations. This will help you see whether data has been
initialised and not set before use etc. To see the values that get assigned
239 240 241
with this look at ``mm/slab.c`` and search for ``POISON_INUSE``. When using
this an Oops will often show the poisoned data instead of zero which is the
default.
242 243 244 245 246

Once you have worked out a fix please submit it upstream. After all open
source is about sharing what you do and don't you want to be recognised for
your genius?

247 248
Please do read :ref:`Documentation/SubmittingPatches <submittingpatches>`
though to help your code get accepted.