diff --git a/Documentation/coccinelle.txt b/Documentation/dev-tools/coccinelle.rst similarity index 56% rename from Documentation/coccinelle.txt rename to Documentation/dev-tools/coccinelle.rst index 01fb1dae3163ca569045b2ce2a3c2b395bb7d217..4a64b4c69d3f9590733b8b5e23899449748fcabf 100644 --- a/Documentation/coccinelle.txt +++ b/Documentation/dev-tools/coccinelle.rst @@ -1,10 +1,18 @@ -Copyright 2010 Nicolas Palix -Copyright 2010 Julia Lawall -Copyright 2010 Gilles Muller +.. Copyright 2010 Nicolas Palix +.. Copyright 2010 Julia Lawall +.. Copyright 2010 Gilles Muller +.. highlight:: none - Getting Coccinelle -~~~~~~~~~~~~~~~~~~~~ +Coccinelle +========== + +Coccinelle is a tool for pattern matching and text transformation that has +many uses in kernel development, including the application of complex, +tree-wide patches and detection of problematic programming patterns. + +Getting Coccinelle +------------------- The semantic patches included in the kernel use features and options which are provided by Coccinelle version 1.0.0-rc11 and above. @@ -22,24 +30,23 @@ of many distributions, e.g. : - NetBSD - FreeBSD - You can get the latest version released from the Coccinelle homepage at http://coccinelle.lip6.fr/ Information and tips about Coccinelle are also provided on the wiki pages at http://cocci.ekstranet.diku.dk/wiki/doku.php -Once you have it, run the following command: +Once you have it, run the following command:: ./configure make -as a regular user, and install it with +as a regular user, and install it with:: sudo make install - Supplemental documentation -~~~~~~~~~~~~~~~~~~~~~~~~~~~~ +Supplemental documentation +--------------------------- For supplemental documentation refer to the wiki: @@ -47,49 +54,52 @@ https://bottest.wiki.kernel.org/coccicheck The wiki documentation always refers to the linux-next version of the script. - Using Coccinelle on the Linux kernel -~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ +Using Coccinelle on the Linux kernel +------------------------------------ A Coccinelle-specific target is defined in the top level -Makefile. This target is named 'coccicheck' and calls the 'coccicheck' -front-end in the 'scripts' directory. +Makefile. This target is named ``coccicheck`` and calls the ``coccicheck`` +front-end in the ``scripts`` directory. -Four basic modes are defined: patch, report, context, and org. The mode to -use is specified by setting the MODE variable with 'MODE='. +Four basic modes are defined: ``patch``, ``report``, ``context``, and +``org``. The mode to use is specified by setting the MODE variable with +``MODE=``. -'patch' proposes a fix, when possible. +- ``patch`` proposes a fix, when possible. -'report' generates a list in the following format: +- ``report`` generates a list in the following format: file:line:column-column: message -'context' highlights lines of interest and their context in a -diff-like style.Lines of interest are indicated with '-'. +- ``context`` highlights lines of interest and their context in a + diff-like style.Lines of interest are indicated with ``-``. -'org' generates a report in the Org mode format of Emacs. +- ``org`` generates a report in the Org mode format of Emacs. Note that not all semantic patches implement all modes. For easy use of Coccinelle, the default mode is "report". Two other modes provide some common combinations of these modes. -'chain' tries the previous modes in the order above until one succeeds. +- ``chain`` tries the previous modes in the order above until one succeeds. + +- ``rep+ctxt`` runs successively the report mode and the context mode. + It should be used with the C option (described later) + which checks the code on a file basis. -'rep+ctxt' runs successively the report mode and the context mode. - It should be used with the C option (described later) - which checks the code on a file basis. +Examples +~~~~~~~~ -Examples: - To make a report for every semantic patch, run the following command: +To make a report for every semantic patch, run the following command:: make coccicheck MODE=report - To produce patches, run: +To produce patches, run:: make coccicheck MODE=patch The coccicheck target applies every semantic patch available in the -sub-directories of 'scripts/coccinelle' to the entire Linux kernel. +sub-directories of ``scripts/coccinelle`` to the entire Linux kernel. For each semantic patch, a commit message is proposed. It gives a description of the problem being checked by the semantic patch, and @@ -99,15 +109,15 @@ As any static code analyzer, Coccinelle produces false positives. Thus, reports must be carefully checked, and patches reviewed. -To enable verbose messages set the V= variable, for example: +To enable verbose messages set the V= variable, for example:: make coccicheck MODE=report V=1 - Coccinelle parallelization -~~~~~~~~~~~~~~~~~~~~~~~~~~~~ +Coccinelle parallelization +--------------------------- By default, coccicheck tries to run as parallel as possible. To change -the parallelism, set the J= variable. For example, to run across 4 CPUs: +the parallelism, set the J= variable. For example, to run across 4 CPUs:: make coccicheck MODE=report J=4 @@ -115,44 +125,47 @@ As of Coccinelle 1.0.2 Coccinelle uses Ocaml parmap for parallelization, if support for this is detected you will benefit from parmap parallelization. When parmap is enabled coccicheck will enable dynamic load balancing by using -'--chunksize 1' argument, this ensures we keep feeding threads with work +``--chunksize 1`` argument, this ensures we keep feeding threads with work one by one, so that we avoid the situation where most work gets done by only a few threads. With dynamic load balancing, if a thread finishes early we keep feeding it more work. When parmap is enabled, if an error occurs in Coccinelle, this error -value is propagated back, the return value of the 'make coccicheck' +value is propagated back, the return value of the ``make coccicheck`` captures this return value. - Using Coccinelle with a single semantic patch -~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ +Using Coccinelle with a single semantic patch +--------------------------------------------- The optional make variable COCCI can be used to check a single semantic patch. In that case, the variable must be initialized with the name of the semantic patch to apply. -For instance: +For instance:: make coccicheck COCCI= MODE=patch -or + +or:: + make coccicheck COCCI= MODE=report - Controlling Which Files are Processed by Coccinelle -~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ +Controlling Which Files are Processed by Coccinelle +--------------------------------------------------- + By default the entire kernel source tree is checked. -To apply Coccinelle to a specific directory, M= can be used. -For example, to check drivers/net/wireless/ one may write: +To apply Coccinelle to a specific directory, ``M=`` can be used. +For example, to check drivers/net/wireless/ one may write:: make coccicheck M=drivers/net/wireless/ To apply Coccinelle on a file basis, instead of a directory basis, the -following command may be used: +following command may be used:: make C=1 CHECK="scripts/coccicheck" -To check only newly edited code, use the value 2 for the C flag, i.e. +To check only newly edited code, use the value 2 for the C flag, i.e.:: make C=2 CHECK="scripts/coccicheck" @@ -166,8 +179,8 @@ semantic patch as shown in the previous section. The "report" mode is the default. You can select another one with the MODE variable explained above. - Debugging Coccinelle SmPL patches -~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ +Debugging Coccinelle SmPL patches +--------------------------------- Using coccicheck is best as it provides in the spatch command line include options matching the options used when we compile the kernel. @@ -177,8 +190,8 @@ manually run Coccinelle with debug options added. Alternatively you can debug running Coccinelle against SmPL patches by asking for stderr to be redirected to stderr, by default stderr is redirected to /dev/null, if you'd like to capture stderr you -can specify the DEBUG_FILE="file.txt" option to coccicheck. For -instance: +can specify the ``DEBUG_FILE="file.txt"`` option to coccicheck. For +instance:: rm -f cocci.err make coccicheck COCCI=scripts/coccinelle/free/kfree.cocci MODE=report DEBUG_FILE=cocci.err @@ -186,7 +199,7 @@ instance: You can use SPFLAGS to add debugging flags, for instance you may want to add both --profile --show-trying to SPFLAGS when debugging. For instance -you may want to use: +you may want to use:: rm -f err.log export COCCI=scripts/coccinelle/misc/irqf_oneshot.cocci @@ -198,24 +211,24 @@ work. DEBUG_FILE support is only supported when using coccinelle >= 1.2. - .cocciconfig support -~~~~~~~~~~~~~~~~~~~~~~ +.cocciconfig support +-------------------- Coccinelle supports reading .cocciconfig for default Coccinelle options that should be used every time spatch is spawned, the order of precedence for variables for .cocciconfig is as follows: - o Your current user's home directory is processed first - o Your directory from which spatch is called is processed next - o The directory provided with the --dir option is processed last, if used +- Your current user's home directory is processed first +- Your directory from which spatch is called is processed next +- The directory provided with the --dir option is processed last, if used Since coccicheck runs through make, it naturally runs from the kernel proper dir, as such the second rule above would be implied for picking up a -.cocciconfig when using 'make coccicheck'. +.cocciconfig when using ``make coccicheck``. -'make coccicheck' also supports using M= targets.If you do not supply +``make coccicheck`` also supports using M= targets.If you do not supply any M= target, it is assumed you want to target the entire kernel. -The kernel coccicheck script has: +The kernel coccicheck script has:: if [ "$KBUILD_EXTMOD" = "" ] ; then OPTIONS="--dir $srctree $COCCIINCLUDE" @@ -235,12 +248,12 @@ override any of the kernel's .coccicheck's settings using SPFLAGS. We help Coccinelle when used against Linux with a set of sensible defaults options for Linux with our own Linux .cocciconfig. This hints to coccinelle -git can be used for 'git grep' queries over coccigrep. A timeout of 200 +git can be used for ``git grep`` queries over coccigrep. A timeout of 200 seconds should suffice for now. The options picked up by coccinelle when reading a .cocciconfig do not appear as arguments to spatch processes running on your system, to confirm what -options will be used by Coccinelle run: +options will be used by Coccinelle run:: spatch --print-options-only @@ -252,219 +265,227 @@ carries its own .cocciconfig, you will need to use SPFLAGS to use idutils if desired. See below section "Additional flags" for more details on how to use idutils. - Additional flags -~~~~~~~~~~~~~~~~~~ +Additional flags +---------------- Additional flags can be passed to spatch through the SPFLAGS variable. This works as Coccinelle respects the last flags -given to it when options are in conflict. +given to it when options are in conflict. :: make SPFLAGS=--use-glimpse coccicheck Coccinelle supports idutils as well but requires coccinelle >= 1.0.6. When no ID file is specified coccinelle assumes your ID database file is in the file .id-utils.index on the top level of the kernel, coccinelle -carries a script scripts/idutils_index.sh which creates the database with +carries a script scripts/idutils_index.sh which creates the database with:: mkid -i C --output .id-utils.index If you have another database filename you can also just symlink with this -name. +name. :: make SPFLAGS=--use-idutils coccicheck Alternatively you can specify the database filename explicitly, for -instance: +instance:: make SPFLAGS="--use-idutils /full-path/to/ID" coccicheck -See spatch --help to learn more about spatch options. +See ``spatch --help`` to learn more about spatch options. -Note that the '--use-glimpse' and '--use-idutils' options +Note that the ``--use-glimpse`` and ``--use-idutils`` options require external tools for indexing the code. None of them is thus active by default. However, by indexing the code with one of these tools, and according to the cocci file used, spatch could proceed the entire code base more quickly. - SmPL patch specific options -~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ +SmPL patch specific options +--------------------------- SmPL patches can have their own requirements for options passed to Coccinelle. SmPL patch specific options can be provided by -providing them at the top of the SmPL patch, for instance: +providing them at the top of the SmPL patch, for instance:: -// Options: --no-includes --include-headers + // Options: --no-includes --include-headers - SmPL patch Coccinelle requirements -~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ +SmPL patch Coccinelle requirements +---------------------------------- As Coccinelle features get added some more advanced SmPL patches may require newer versions of Coccinelle. If an SmPL patch requires at least a version of Coccinelle, this can be specified as follows, -as an example if requiring at least Coccinelle >= 1.0.5: +as an example if requiring at least Coccinelle >= 1.0.5:: -// Requires: 1.0.5 + // Requires: 1.0.5 - Proposing new semantic patches -~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ +Proposing new semantic patches +------------------------------- New semantic patches can be proposed and submitted by kernel developers. For sake of clarity, they should be organized in the -sub-directories of 'scripts/coccinelle/'. +sub-directories of ``scripts/coccinelle/``. - Detailed description of the 'report' mode -~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ +Detailed description of the ``report`` mode +------------------------------------------- + +``report`` generates a list in the following format:: -'report' generates a list in the following format: file:line:column-column: message -Example: +Example +~~~~~~~ -Running +Running:: make coccicheck MODE=report COCCI=scripts/coccinelle/api/err_cast.cocci -will execute the following part of the SmPL script. +will execute the following part of the SmPL script:: - -@r depends on !context && !patch && (org || report)@ -expression x; -position p; -@@ + + @r depends on !context && !patch && (org || report)@ + expression x; + position p; + @@ - ERR_PTR@p(PTR_ERR(x)) + ERR_PTR@p(PTR_ERR(x)) -@script:python depends on report@ -p << r.p; -x << r.x; -@@ + @script:python depends on report@ + p << r.p; + x << r.x; + @@ -msg="ERR_CAST can be used with %s" % (x) -coccilib.report.print_report(p[0], msg) - + msg="ERR_CAST can be used with %s" % (x) + coccilib.report.print_report(p[0], msg) + This SmPL excerpt generates entries on the standard output, as -illustrated below: +illustrated below:: -/home/user/linux/crypto/ctr.c:188:9-16: ERR_CAST can be used with alg -/home/user/linux/crypto/authenc.c:619:9-16: ERR_CAST can be used with auth -/home/user/linux/crypto/xts.c:227:9-16: ERR_CAST can be used with alg + /home/user/linux/crypto/ctr.c:188:9-16: ERR_CAST can be used with alg + /home/user/linux/crypto/authenc.c:619:9-16: ERR_CAST can be used with auth + /home/user/linux/crypto/xts.c:227:9-16: ERR_CAST can be used with alg - Detailed description of the 'patch' mode -~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ +Detailed description of the ``patch`` mode +------------------------------------------ -When the 'patch' mode is available, it proposes a fix for each problem +When the ``patch`` mode is available, it proposes a fix for each problem identified. -Example: +Example +~~~~~~~ + +Running:: -Running make coccicheck MODE=patch COCCI=scripts/coccinelle/api/err_cast.cocci -will execute the following part of the SmPL script. +will execute the following part of the SmPL script:: - -@ depends on !context && patch && !org && !report @ -expression x; -@@ + + @ depends on !context && patch && !org && !report @ + expression x; + @@ -- ERR_PTR(PTR_ERR(x)) -+ ERR_CAST(x) - + - ERR_PTR(PTR_ERR(x)) + + ERR_CAST(x) + This SmPL excerpt generates patch hunks on the standard output, as -illustrated below: +illustrated below:: -diff -u -p a/crypto/ctr.c b/crypto/ctr.c ---- a/crypto/ctr.c 2010-05-26 10:49:38.000000000 +0200 -+++ b/crypto/ctr.c 2010-06-03 23:44:49.000000000 +0200 -@@ -185,7 +185,7 @@ static struct crypto_instance *crypto_ct + diff -u -p a/crypto/ctr.c b/crypto/ctr.c + --- a/crypto/ctr.c 2010-05-26 10:49:38.000000000 +0200 + +++ b/crypto/ctr.c 2010-06-03 23:44:49.000000000 +0200 + @@ -185,7 +185,7 @@ static struct crypto_instance *crypto_ct alg = crypto_attr_alg(tb[1], CRYPTO_ALG_TYPE_CIPHER, CRYPTO_ALG_TYPE_MASK); if (IS_ERR(alg)) -- return ERR_PTR(PTR_ERR(alg)); -+ return ERR_CAST(alg); - + - return ERR_PTR(PTR_ERR(alg)); + + return ERR_CAST(alg); + /* Block size must be >= 4 bytes. */ err = -EINVAL; - Detailed description of the 'context' mode -~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ +Detailed description of the ``context`` mode +-------------------------------------------- -'context' highlights lines of interest and their context +``context`` highlights lines of interest and their context in a diff-like style. -NOTE: The diff-like output generated is NOT an applicable patch. The - intent of the 'context' mode is to highlight the important lines - (annotated with minus, '-') and gives some surrounding context + **NOTE**: The diff-like output generated is NOT an applicable patch. The + intent of the ``context`` mode is to highlight the important lines + (annotated with minus, ``-``) and gives some surrounding context lines around. This output can be used with the diff mode of Emacs to review the code. -Example: +Example +~~~~~~~ + +Running:: -Running make coccicheck MODE=context COCCI=scripts/coccinelle/api/err_cast.cocci -will execute the following part of the SmPL script. +will execute the following part of the SmPL script:: - -@ depends on context && !patch && !org && !report@ -expression x; -@@ + + @ depends on context && !patch && !org && !report@ + expression x; + @@ -* ERR_PTR(PTR_ERR(x)) - + * ERR_PTR(PTR_ERR(x)) + This SmPL excerpt generates diff hunks on the standard output, as -illustrated below: +illustrated below:: -diff -u -p /home/user/linux/crypto/ctr.c /tmp/nothing ---- /home/user/linux/crypto/ctr.c 2010-05-26 10:49:38.000000000 +0200 -+++ /tmp/nothing -@@ -185,7 +185,6 @@ static struct crypto_instance *crypto_ct + diff -u -p /home/user/linux/crypto/ctr.c /tmp/nothing + --- /home/user/linux/crypto/ctr.c 2010-05-26 10:49:38.000000000 +0200 + +++ /tmp/nothing + @@ -185,7 +185,6 @@ static struct crypto_instance *crypto_ct alg = crypto_attr_alg(tb[1], CRYPTO_ALG_TYPE_CIPHER, CRYPTO_ALG_TYPE_MASK); if (IS_ERR(alg)) -- return ERR_PTR(PTR_ERR(alg)); - + - return ERR_PTR(PTR_ERR(alg)); + /* Block size must be >= 4 bytes. */ err = -EINVAL; - Detailed description of the 'org' mode -~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ +Detailed description of the ``org`` mode +---------------------------------------- + +``org`` generates a report in the Org mode format of Emacs. -'org' generates a report in the Org mode format of Emacs. +Example +~~~~~~~ -Example: +Running:: -Running make coccicheck MODE=org COCCI=scripts/coccinelle/api/err_cast.cocci -will execute the following part of the SmPL script. +will execute the following part of the SmPL script:: - -@r depends on !context && !patch && (org || report)@ -expression x; -position p; -@@ + + @r depends on !context && !patch && (org || report)@ + expression x; + position p; + @@ - ERR_PTR@p(PTR_ERR(x)) + ERR_PTR@p(PTR_ERR(x)) -@script:python depends on org@ -p << r.p; -x << r.x; -@@ + @script:python depends on org@ + p << r.p; + x << r.x; + @@ -msg="ERR_CAST can be used with %s" % (x) -msg_safe=msg.replace("[","@(").replace("]",")") -coccilib.org.print_todo(p[0], msg_safe) - + msg="ERR_CAST can be used with %s" % (x) + msg_safe=msg.replace("[","@(").replace("]",")") + coccilib.org.print_todo(p[0], msg_safe) + This SmPL excerpt generates Org entries on the standard output, as -illustrated below: +illustrated below:: -* TODO [[view:/home/user/linux/crypto/ctr.c::face=ovl-face1::linb=188::colb=9::cole=16][ERR_CAST can be used with alg]] -* TODO [[view:/home/user/linux/crypto/authenc.c::face=ovl-face1::linb=619::colb=9::cole=16][ERR_CAST can be used with auth]] -* TODO [[view:/home/user/linux/crypto/xts.c::face=ovl-face1::linb=227::colb=9::cole=16][ERR_CAST can be used with alg]] + * TODO [[view:/home/user/linux/crypto/ctr.c::face=ovl-face1::linb=188::colb=9::cole=16][ERR_CAST can be used with alg]] + * TODO [[view:/home/user/linux/crypto/authenc.c::face=ovl-face1::linb=619::colb=9::cole=16][ERR_CAST can be used with auth]] + * TODO [[view:/home/user/linux/crypto/xts.c::face=ovl-face1::linb=227::colb=9::cole=16][ERR_CAST can be used with alg]] diff --git a/Documentation/dev-tools/gcov.rst b/Documentation/dev-tools/gcov.rst new file mode 100644 index 0000000000000000000000000000000000000000..19eedfea8800db38adb5382c571bb951e3206c4a --- /dev/null +++ b/Documentation/dev-tools/gcov.rst @@ -0,0 +1,256 @@ +Using gcov with the Linux kernel +================================ + +gcov profiling kernel support enables the use of GCC's coverage testing +tool gcov_ with the Linux kernel. Coverage data of a running kernel +is exported in gcov-compatible format via the "gcov" debugfs directory. +To get coverage data for a specific file, change to the kernel build +directory and use gcov with the ``-o`` option as follows (requires root):: + + # cd /tmp/linux-out + # gcov -o /sys/kernel/debug/gcov/tmp/linux-out/kernel spinlock.c + +This will create source code files annotated with execution counts +in the current directory. In addition, graphical gcov front-ends such +as lcov_ can be used to automate the process of collecting data +for the entire kernel and provide coverage overviews in HTML format. + +Possible uses: + +* debugging (has this line been reached at all?) +* test improvement (how do I change my test to cover these lines?) +* minimizing kernel configurations (do I need this option if the + associated code is never run?) + +.. _gcov: http://gcc.gnu.org/onlinedocs/gcc/Gcov.html +.. _lcov: http://ltp.sourceforge.net/coverage/lcov.php + + +Preparation +----------- + +Configure the kernel with:: + + CONFIG_DEBUG_FS=y + CONFIG_GCOV_KERNEL=y + +select the gcc's gcov format, default is autodetect based on gcc version:: + + CONFIG_GCOV_FORMAT_AUTODETECT=y + +and to get coverage data for the entire kernel:: + + CONFIG_GCOV_PROFILE_ALL=y + +Note that kernels compiled with profiling flags will be significantly +larger and run slower. Also CONFIG_GCOV_PROFILE_ALL may not be supported +on all architectures. + +Profiling data will only become accessible once debugfs has been +mounted:: + + mount -t debugfs none /sys/kernel/debug + + +Customization +------------- + +To enable profiling for specific files or directories, add a line +similar to the following to the respective kernel Makefile: + +- For a single file (e.g. main.o):: + + GCOV_PROFILE_main.o := y + +- For all files in one directory:: + + GCOV_PROFILE := y + +To exclude files from being profiled even when CONFIG_GCOV_PROFILE_ALL +is specified, use:: + + GCOV_PROFILE_main.o := n + +and:: + + GCOV_PROFILE := n + +Only files which are linked to the main kernel image or are compiled as +kernel modules are supported by this mechanism. + + +Files +----- + +The gcov kernel support creates the following files in debugfs: + +``/sys/kernel/debug/gcov`` + Parent directory for all gcov-related files. + +``/sys/kernel/debug/gcov/reset`` + Global reset file: resets all coverage data to zero when + written to. + +``/sys/kernel/debug/gcov/path/to/compile/dir/file.gcda`` + The actual gcov data file as understood by the gcov + tool. Resets file coverage data to zero when written to. + +``/sys/kernel/debug/gcov/path/to/compile/dir/file.gcno`` + Symbolic link to a static data file required by the gcov + tool. This file is generated by gcc when compiling with + option ``-ftest-coverage``. + + +Modules +------- + +Kernel modules may contain cleanup code which is only run during +module unload time. The gcov mechanism provides a means to collect +coverage data for such code by keeping a copy of the data associated +with the unloaded module. This data remains available through debugfs. +Once the module is loaded again, the associated coverage counters are +initialized with the data from its previous instantiation. + +This behavior can be deactivated by specifying the gcov_persist kernel +parameter:: + + gcov_persist=0 + +At run-time, a user can also choose to discard data for an unloaded +module by writing to its data file or the global reset file. + + +Separated build and test machines +--------------------------------- + +The gcov kernel profiling infrastructure is designed to work out-of-the +box for setups where kernels are built and run on the same machine. In +cases where the kernel runs on a separate machine, special preparations +must be made, depending on where the gcov tool is used: + +a) gcov is run on the TEST machine + + The gcov tool version on the test machine must be compatible with the + gcc version used for kernel build. Also the following files need to be + copied from build to test machine: + + from the source tree: + - all C source files + headers + + from the build tree: + - all C source files + headers + - all .gcda and .gcno files + - all links to directories + + It is important to note that these files need to be placed into the + exact same file system location on the test machine as on the build + machine. If any of the path components is symbolic link, the actual + directory needs to be used instead (due to make's CURDIR handling). + +b) gcov is run on the BUILD machine + + The following files need to be copied after each test case from test + to build machine: + + from the gcov directory in sysfs: + - all .gcda files + - all links to .gcno files + + These files can be copied to any location on the build machine. gcov + must then be called with the -o option pointing to that directory. + + Example directory setup on the build machine:: + + /tmp/linux: kernel source tree + /tmp/out: kernel build directory as specified by make O= + /tmp/coverage: location of the files copied from the test machine + + [user@build] cd /tmp/out + [user@build] gcov -o /tmp/coverage/tmp/out/init main.c + + +Troubleshooting +--------------- + +Problem + Compilation aborts during linker step. + +Cause + Profiling flags are specified for source files which are not + linked to the main kernel or which are linked by a custom + linker procedure. + +Solution + Exclude affected source files from profiling by specifying + ``GCOV_PROFILE := n`` or ``GCOV_PROFILE_basename.o := n`` in the + corresponding Makefile. + +Problem + Files copied from sysfs appear empty or incomplete. + +Cause + Due to the way seq_file works, some tools such as cp or tar + may not correctly copy files from sysfs. + +Solution + Use ``cat``' to read ``.gcda`` files and ``cp -d`` to copy links. + Alternatively use the mechanism shown in Appendix B. + + +Appendix A: gather_on_build.sh +------------------------------ + +Sample script to gather coverage meta files on the build machine +(see 6a):: + + #!/bin/bash + + KSRC=$1 + KOBJ=$2 + DEST=$3 + + if [ -z "$KSRC" ] || [ -z "$KOBJ" ] || [ -z "$DEST" ]; then + echo "Usage: $0 " >&2 + exit 1 + fi + + KSRC=$(cd $KSRC; printf "all:\n\t@echo \${CURDIR}\n" | make -f -) + KOBJ=$(cd $KOBJ; printf "all:\n\t@echo \${CURDIR}\n" | make -f -) + + find $KSRC $KOBJ \( -name '*.gcno' -o -name '*.[ch]' -o -type l \) -a \ + -perm /u+r,g+r | tar cfz $DEST -P -T - + + if [ $? -eq 0 ] ; then + echo "$DEST successfully created, copy to test system and unpack with:" + echo " tar xfz $DEST -P" + else + echo "Could not create file $DEST" + fi + + +Appendix B: gather_on_test.sh +----------------------------- + +Sample script to gather coverage data files on the test machine +(see 6b):: + + #!/bin/bash -e + + DEST=$1 + GCDA=/sys/kernel/debug/gcov + + if [ -z "$DEST" ] ; then + echo "Usage: $0 " >&2 + exit 1 + fi + + TEMPDIR=$(mktemp -d) + echo Collecting data.. + find $GCDA -type d -exec mkdir -p $TEMPDIR/\{\} \; + find $GCDA -name '*.gcda' -exec sh -c 'cat < $0 > '$TEMPDIR'/$0' {} \; + find $GCDA -name '*.gcno' -exec sh -c 'cp -d $0 '$TEMPDIR'/$0' {} \; + tar czf $DEST -C $TEMPDIR sys + rm -rf $TEMPDIR + + echo "$DEST successfully created, copy to build system and unpack with:" + echo " tar xfz $DEST" diff --git a/Documentation/gdb-kernel-debugging.txt b/Documentation/dev-tools/gdb-kernel-debugging.rst similarity index 73% rename from Documentation/gdb-kernel-debugging.txt rename to Documentation/dev-tools/gdb-kernel-debugging.rst index 7050ce8794b9a4b3dd93b76dd9e2a6d708b468ee..5e93c9bc6619da5d75a75179d858d60004e6f843 100644 --- a/Documentation/gdb-kernel-debugging.txt +++ b/Documentation/dev-tools/gdb-kernel-debugging.rst @@ -1,3 +1,5 @@ +.. highlight:: none + Debugging kernel and modules via gdb ==================================== @@ -13,54 +15,58 @@ be transferred to the other gdb stubs as well. Requirements ------------ - o gdb 7.2+ (recommended: 7.4+) with python support enabled (typically true - for distributions) +- gdb 7.2+ (recommended: 7.4+) with python support enabled (typically true + for distributions) Setup ----- - o Create a virtual Linux machine for QEMU/KVM (see www.linux-kvm.org and - www.qemu.org for more details). For cross-development, - http://landley.net/aboriginal/bin keeps a pool of machine images and - toolchains that can be helpful to start from. +- Create a virtual Linux machine for QEMU/KVM (see www.linux-kvm.org and + www.qemu.org for more details). For cross-development, + http://landley.net/aboriginal/bin keeps a pool of machine images and + toolchains that can be helpful to start from. - o Build the kernel with CONFIG_GDB_SCRIPTS enabled, but leave - CONFIG_DEBUG_INFO_REDUCED off. If your architecture supports - CONFIG_FRAME_POINTER, keep it enabled. +- Build the kernel with CONFIG_GDB_SCRIPTS enabled, but leave + CONFIG_DEBUG_INFO_REDUCED off. If your architecture supports + CONFIG_FRAME_POINTER, keep it enabled. - o Install that kernel on the guest. +- Install that kernel on the guest. + Alternatively, QEMU allows to boot the kernel directly using -kernel, + -append, -initrd command line switches. This is generally only useful if + you do not depend on modules. See QEMU documentation for more details on + this mode. - Alternatively, QEMU allows to boot the kernel directly using -kernel, - -append, -initrd command line switches. This is generally only useful if - you do not depend on modules. See QEMU documentation for more details on - this mode. +- Enable the gdb stub of QEMU/KVM, either - o Enable the gdb stub of QEMU/KVM, either - at VM startup time by appending "-s" to the QEMU command line - or + + or + - during runtime by issuing "gdbserver" from the QEMU monitor console - o cd /path/to/linux-build +- cd /path/to/linux-build - o Start gdb: gdb vmlinux +- Start gdb: gdb vmlinux - Note: Some distros may restrict auto-loading of gdb scripts to known safe - directories. In case gdb reports to refuse loading vmlinux-gdb.py, add + Note: Some distros may restrict auto-loading of gdb scripts to known safe + directories. In case gdb reports to refuse loading vmlinux-gdb.py, add:: add-auto-load-safe-path /path/to/linux-build - to ~/.gdbinit. See gdb help for more details. + to ~/.gdbinit. See gdb help for more details. + +- Attach to the booted guest:: - o Attach to the booted guest: (gdb) target remote :1234 Examples of using the Linux-provided gdb helpers ------------------------------------------------ - o Load module (and main kernel) symbols: +- Load module (and main kernel) symbols:: + (gdb) lx-symbols loading vmlinux scanning for modules in /home/user/linux/build @@ -72,17 +78,20 @@ Examples of using the Linux-provided gdb helpers ... loading @0xffffffffa0000000: /home/user/linux/build/drivers/ata/ata_generic.ko - o Set a breakpoint on some not yet loaded module function, e.g.: +- Set a breakpoint on some not yet loaded module function, e.g.:: + (gdb) b btrfs_init_sysfs Function "btrfs_init_sysfs" not defined. Make breakpoint pending on future shared library load? (y or [n]) y Breakpoint 1 (btrfs_init_sysfs) pending. - o Continue the target +- Continue the target:: + (gdb) c - o Load the module on the target and watch the symbols being loaded as well as - the breakpoint hit: +- Load the module on the target and watch the symbols being loaded as well as + the breakpoint hit:: + loading @0xffffffffa0034000: /home/user/linux/build/lib/libcrc32c.ko loading @0xffffffffa0050000: /home/user/linux/build/lib/lzo/lzo_compress.ko loading @0xffffffffa006e000: /home/user/linux/build/lib/zlib_deflate/zlib_deflate.ko @@ -91,7 +100,8 @@ Examples of using the Linux-provided gdb helpers Breakpoint 1, btrfs_init_sysfs () at /home/user/linux/fs/btrfs/sysfs.c:36 36 btrfs_kset = kset_create_and_add("btrfs", NULL, fs_kobj); - o Dump the log buffer of the target kernel: +- Dump the log buffer of the target kernel:: + (gdb) lx-dmesg [ 0.000000] Initializing cgroup subsys cpuset [ 0.000000] Initializing cgroup subsys cpu @@ -102,19 +112,22 @@ Examples of using the Linux-provided gdb helpers [ 0.000000] BIOS-e820: [mem 0x000000000009fc00-0x000000000009ffff] reserved .... - o Examine fields of the current task struct: +- Examine fields of the current task struct:: + (gdb) p $lx_current().pid $1 = 4998 (gdb) p $lx_current().comm $2 = "modprobe\000\000\000\000\000\000\000" - o Make use of the per-cpu function for the current or a specified CPU: +- Make use of the per-cpu function for the current or a specified CPU:: + (gdb) p $lx_per_cpu("runqueues").nr_running $3 = 1 (gdb) p $lx_per_cpu("runqueues", 2).nr_running $4 = 0 - o Dig into hrtimers using the container_of helper: +- Dig into hrtimers using the container_of helper:: + (gdb) set $next = $lx_per_cpu("hrtimer_bases").clock_base[0].active.next (gdb) p *$container_of($next, "struct hrtimer", "node") $5 = { @@ -144,7 +157,7 @@ List of commands and functions ------------------------------ The number of commands and convenience functions may evolve over the time, -this is just a snapshot of the initial version: +this is just a snapshot of the initial version:: (gdb) apropos lx function lx_current -- Return current task diff --git a/Documentation/dev-tools/kasan.rst b/Documentation/dev-tools/kasan.rst new file mode 100644 index 0000000000000000000000000000000000000000..f7a18f2743576ab1e96f6ddc6de2a329f1d40c16 --- /dev/null +++ b/Documentation/dev-tools/kasan.rst @@ -0,0 +1,173 @@ +The Kernel Address Sanitizer (KASAN) +==================================== + +Overview +-------- + +KernelAddressSANitizer (KASAN) is a dynamic memory error detector. It provides +a fast and comprehensive solution for finding use-after-free and out-of-bounds +bugs. + +KASAN uses compile-time instrumentation for checking every memory access, +therefore you will need a GCC version 4.9.2 or later. GCC 5.0 or later is +required for detection of out-of-bounds accesses to stack or global variables. + +Currently KASAN is supported only for the x86_64 and arm64 architectures. + +Usage +----- + +To enable KASAN configure kernel with:: + + CONFIG_KASAN = y + +and choose between CONFIG_KASAN_OUTLINE and CONFIG_KASAN_INLINE. Outline and +inline are compiler instrumentation types. The former produces smaller binary +the latter is 1.1 - 2 times faster. Inline instrumentation requires a GCC +version 5.0 or later. + +KASAN works with both SLUB and SLAB memory allocators. +For better bug detection and nicer reporting, enable CONFIG_STACKTRACE. + +To disable instrumentation for specific files or directories, add a line +similar to the following to the respective kernel Makefile: + +- For a single file (e.g. main.o):: + + KASAN_SANITIZE_main.o := n + +- For all files in one directory:: + + KASAN_SANITIZE := n + +Error reports +~~~~~~~~~~~~~ + +A typical out of bounds access report looks like this:: + + ================================================================== + BUG: AddressSanitizer: out of bounds access in kmalloc_oob_right+0x65/0x75 [test_kasan] at addr ffff8800693bc5d3 + Write of size 1 by task modprobe/1689 + ============================================================================= + BUG kmalloc-128 (Not tainted): kasan error + ----------------------------------------------------------------------------- + + Disabling lock debugging due to kernel taint + INFO: Allocated in kmalloc_oob_right+0x3d/0x75 [test_kasan] age=0 cpu=0 pid=1689 + __slab_alloc+0x4b4/0x4f0 + kmem_cache_alloc_trace+0x10b/0x190 + kmalloc_oob_right+0x3d/0x75 [test_kasan] + init_module+0x9/0x47 [test_kasan] + do_one_initcall+0x99/0x200 + load_module+0x2cb3/0x3b20 + SyS_finit_module+0x76/0x80 + system_call_fastpath+0x12/0x17 + INFO: Slab 0xffffea0001a4ef00 objects=17 used=7 fp=0xffff8800693bd728 flags=0x100000000004080 + INFO: Object 0xffff8800693bc558 @offset=1368 fp=0xffff8800693bc720 + + Bytes b4 ffff8800693bc548: 00 00 00 00 00 00 00 00 5a 5a 5a 5a 5a 5a 5a 5a ........ZZZZZZZZ + Object ffff8800693bc558: 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b kkkkkkkkkkkkkkkk + Object ffff8800693bc568: 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b kkkkkkkkkkkkkkkk + Object ffff8800693bc578: 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b kkkkkkkkkkkkkkkk + Object ffff8800693bc588: 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b kkkkkkkkkkkkkkkk + Object ffff8800693bc598: 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b kkkkkkkkkkkkkkkk + Object ffff8800693bc5a8: 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b kkkkkkkkkkkkkkkk + Object ffff8800693bc5b8: 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b kkkkkkkkkkkkkkkk + Object ffff8800693bc5c8: 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b a5 kkkkkkkkkkkkkkk. + Redzone ffff8800693bc5d8: cc cc cc cc cc cc cc cc ........ + Padding ffff8800693bc718: 5a 5a 5a 5a 5a 5a 5a 5a ZZZZZZZZ + CPU: 0 PID: 1689 Comm: modprobe Tainted: G B 3.18.0-rc1-mm1+ #98 + Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.7.5-0-ge51488c-20140602_164612-nilsson.home.kraxel.org 04/01/2014 + ffff8800693bc000 0000000000000000 ffff8800693bc558 ffff88006923bb78 + ffffffff81cc68ae 00000000000000f3 ffff88006d407600 ffff88006923bba8 + ffffffff811fd848 ffff88006d407600 ffffea0001a4ef00 ffff8800693bc558 + Call Trace: + [] dump_stack+0x46/0x58 + [] print_trailer+0xf8/0x160 + [] ? kmem_cache_oob+0xc3/0xc3 [test_kasan] + [] object_err+0x35/0x40 + [] ? kmalloc_oob_right+0x65/0x75 [test_kasan] + [] kasan_report_error+0x38a/0x3f0 + [] ? kasan_poison_shadow+0x2f/0x40 + [] ? kasan_unpoison_shadow+0x14/0x40 + [] ? kasan_poison_shadow+0x2f/0x40 + [] ? kmem_cache_oob+0xc3/0xc3 [test_kasan] + [] __asan_store1+0x75/0xb0 + [] ? kmem_cache_oob+0x1d/0xc3 [test_kasan] + [] ? kmalloc_oob_right+0x65/0x75 [test_kasan] + [] kmalloc_oob_right+0x65/0x75 [test_kasan] + [] init_module+0x9/0x47 [test_kasan] + [] do_one_initcall+0x99/0x200 + [] ? __vunmap+0xec/0x160 + [] load_module+0x2cb3/0x3b20 + [] ? m_show+0x240/0x240 + [] SyS_finit_module+0x76/0x80 + [] system_call_fastpath+0x12/0x17 + Memory state around the buggy address: + ffff8800693bc300: fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc + ffff8800693bc380: fc fc 00 00 00 00 00 00 00 00 00 00 00 00 00 fc + ffff8800693bc400: fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc + ffff8800693bc480: fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc + ffff8800693bc500: fc fc fc fc fc fc fc fc fc fc fc 00 00 00 00 00 + >ffff8800693bc580: 00 00 00 00 00 00 00 00 00 00 03 fc fc fc fc fc + ^ + ffff8800693bc600: fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc + ffff8800693bc680: fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc + ffff8800693bc700: fc fc fc fc fb fb fb fb fb fb fb fb fb fb fb fb + ffff8800693bc780: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb + ffff8800693bc800: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb + ================================================================== + +The header of the report discribe what kind of bug happened and what kind of +access caused it. It's followed by the description of the accessed slub object +(see 'SLUB Debug output' section in Documentation/vm/slub.txt for details) and +the description of the accessed memory page. + +In the last section the report shows memory state around the accessed address. +Reading this part requires some understanding of how KASAN works. + +The state of each 8 aligned bytes of memory is encoded in one shadow byte. +Those 8 bytes can be accessible, partially accessible, freed or be a redzone. +We use the following encoding for each shadow byte: 0 means that all 8 bytes +of the corresponding memory region are accessible; number N (1 <= N <= 7) means +that the first N bytes are accessible, and other (8 - N) bytes are not; +any negative value indicates that the entire 8-byte word is inaccessible. +We use different negative values to distinguish between different kinds of +inaccessible memory like redzones or freed memory (see mm/kasan/kasan.h). + +In the report above the arrows point to the shadow byte 03, which means that +the accessed address is partially accessible. + + +Implementation details +---------------------- + +From a high level, our approach to memory error detection is similar to that +of kmemcheck: use shadow memory to record whether each byte of memory is safe +to access, and use compile-time instrumentation to check shadow memory on each +memory access. + +AddressSanitizer dedicates 1/8 of kernel memory to its shadow memory +(e.g. 16TB to cover 128TB on x86_64) and uses direct mapping with a scale and +offset to translate a memory address to its corresponding shadow address. + +Here is the function which translates an address to its corresponding shadow +address:: + + static inline void *kasan_mem_to_shadow(const void *addr) + { + return ((unsigned long)addr >> KASAN_SHADOW_SCALE_SHIFT) + + KASAN_SHADOW_OFFSET; + } + +where ``KASAN_SHADOW_SCALE_SHIFT = 3``. + +Compile-time instrumentation used for checking memory accesses. Compiler inserts +function calls (__asan_load*(addr), __asan_store*(addr)) before each memory +access of size 1, 2, 4, 8 or 16. These functions check whether memory access is +valid or not by checking corresponding shadow memory. + +GCC 5.0 has possibility to perform inline instrumentation. Instead of making +function calls GCC directly inserts the code to check the shadow memory. +This option significantly enlarges kernel but it gives x1.1-x2 performance +boost over outline instrumented kernel. diff --git a/Documentation/kcov.txt b/Documentation/dev-tools/kcov.rst similarity index 78% rename from Documentation/kcov.txt rename to Documentation/dev-tools/kcov.rst index 779ff4ab1c1da09b98bed78b5df76436ad8d316a..aca0e27ca19727938423c30dea59ee9724f72904 100644 --- a/Documentation/kcov.txt +++ b/Documentation/dev-tools/kcov.rst @@ -12,38 +12,38 @@ To achieve this goal it does not collect coverage in soft/hard interrupts and instrumentation of some inherently non-deterministic parts of kernel is disbled (e.g. scheduler, locking). -Usage: -====== +Usage +----- -Configure kernel with: +Configure the kernel with:: CONFIG_KCOV=y CONFIG_KCOV requires gcc built on revision 231296 or later. -Profiling data will only become accessible once debugfs has been mounted: +Profiling data will only become accessible once debugfs has been mounted:: mount -t debugfs none /sys/kernel/debug -The following program demonstrates kcov usage from within a test program: - -#include -#include -#include -#include -#include -#include -#include -#include -#include -#include - -#define KCOV_INIT_TRACE _IOR('c', 1, unsigned long) -#define KCOV_ENABLE _IO('c', 100) -#define KCOV_DISABLE _IO('c', 101) -#define COVER_SIZE (64<<10) - -int main(int argc, char **argv) -{ +The following program demonstrates kcov usage from within a test program:: + + #include + #include + #include + #include + #include + #include + #include + #include + #include + #include + + #define KCOV_INIT_TRACE _IOR('c', 1, unsigned long) + #define KCOV_ENABLE _IO('c', 100) + #define KCOV_DISABLE _IO('c', 101) + #define COVER_SIZE (64<<10) + + int main(int argc, char **argv) + { int fd; unsigned long *cover, n, i; @@ -83,24 +83,24 @@ int main(int argc, char **argv) if (close(fd)) perror("close"), exit(1); return 0; -} - -After piping through addr2line output of the program looks as follows: - -SyS_read -fs/read_write.c:562 -__fdget_pos -fs/file.c:774 -__fget_light -fs/file.c:746 -__fget_light -fs/file.c:750 -__fget_light -fs/file.c:760 -__fdget_pos -fs/file.c:784 -SyS_read -fs/read_write.c:562 + } + +After piping through addr2line output of the program looks as follows:: + + SyS_read + fs/read_write.c:562 + __fdget_pos + fs/file.c:774 + __fget_light + fs/file.c:746 + __fget_light + fs/file.c:750 + __fget_light + fs/file.c:760 + __fdget_pos + fs/file.c:784 + SyS_read + fs/read_write.c:562 If a program needs to collect coverage from several threads (independently), it needs to open /sys/kernel/debug/kcov in each thread separately. diff --git a/Documentation/dev-tools/kmemcheck.rst b/Documentation/dev-tools/kmemcheck.rst new file mode 100644 index 0000000000000000000000000000000000000000..7f3d1985de743f00860e69033564043aa145ad39 --- /dev/null +++ b/Documentation/dev-tools/kmemcheck.rst @@ -0,0 +1,733 @@ +Getting started with kmemcheck +============================== + +Vegard Nossum + + +Introduction +------------ + +kmemcheck is a debugging feature for the Linux Kernel. More specifically, it +is a dynamic checker that detects and warns about some uses of uninitialized +memory. + +Userspace programmers might be familiar with Valgrind's memcheck. The main +difference between memcheck and kmemcheck is that memcheck works for userspace +programs only, and kmemcheck works for the kernel only. The implementations +are of course vastly different. Because of this, kmemcheck is not as accurate +as memcheck, but it turns out to be good enough in practice to discover real +programmer errors that the compiler is not able to find through static +analysis. + +Enabling kmemcheck on a kernel will probably slow it down to the extent that +the machine will not be usable for normal workloads such as e.g. an +interactive desktop. kmemcheck will also cause the kernel to use about twice +as much memory as normal. For this reason, kmemcheck is strictly a debugging +feature. + + +Downloading +----------- + +As of version 2.6.31-rc1, kmemcheck is included in the mainline kernel. + + +Configuring and compiling +------------------------- + +kmemcheck only works for the x86 (both 32- and 64-bit) platform. A number of +configuration variables must have specific settings in order for the kmemcheck +menu to even appear in "menuconfig". These are: + +- ``CONFIG_CC_OPTIMIZE_FOR_SIZE=n`` + This option is located under "General setup" / "Optimize for size". + + Without this, gcc will use certain optimizations that usually lead to + false positive warnings from kmemcheck. An example of this is a 16-bit + field in a struct, where gcc may load 32 bits, then discard the upper + 16 bits. kmemcheck sees only the 32-bit load, and may trigger a + warning for the upper 16 bits (if they're uninitialized). + +- ``CONFIG_SLAB=y`` or ``CONFIG_SLUB=y`` + This option is located under "General setup" / "Choose SLAB + allocator". + +- ``CONFIG_FUNCTION_TRACER=n`` + This option is located under "Kernel hacking" / "Tracers" / "Kernel + Function Tracer" + + When function tracing is compiled in, gcc emits a call to another + function at the beginning of every function. This means that when the + page fault handler is called, the ftrace framework will be called + before kmemcheck has had a chance to handle the fault. If ftrace then + modifies memory that was tracked by kmemcheck, the result is an + endless recursive page fault. + +- ``CONFIG_DEBUG_PAGEALLOC=n`` + This option is located under "Kernel hacking" / "Memory Debugging" + / "Debug page memory allocations". + +In addition, I highly recommend turning on ``CONFIG_DEBUG_INFO=y``. This is also +located under "Kernel hacking". With this, you will be able to get line number +information from the kmemcheck warnings, which is extremely valuable in +debugging a problem. This option is not mandatory, however, because it slows +down the compilation process and produces a much bigger kernel image. + +Now the kmemcheck menu should be visible (under "Kernel hacking" / "Memory +Debugging" / "kmemcheck: trap use of uninitialized memory"). Here follows +a description of the kmemcheck configuration variables: + +- ``CONFIG_KMEMCHECK`` + This must be enabled in order to use kmemcheck at all... + +- ``CONFIG_KMEMCHECK_``[``DISABLED`` | ``ENABLED`` | ``ONESHOT``]``_BY_DEFAULT`` + This option controls the status of kmemcheck at boot-time. "Enabled" + will enable kmemcheck right from the start, "disabled" will boot the + kernel as normal (but with the kmemcheck code compiled in, so it can + be enabled at run-time after the kernel has booted), and "one-shot" is + a special mode which will turn kmemcheck off automatically after + detecting the first use of uninitialized memory. + + If you are using kmemcheck to actively debug a problem, then you + probably want to choose "enabled" here. + + The one-shot mode is mostly useful in automated test setups because it + can prevent floods of warnings and increase the chances of the machine + surviving in case something is really wrong. In other cases, the one- + shot mode could actually be counter-productive because it would turn + itself off at the very first error -- in the case of a false positive + too -- and this would come in the way of debugging the specific + problem you were interested in. + + If you would like to use your kernel as normal, but with a chance to + enable kmemcheck in case of some problem, it might be a good idea to + choose "disabled" here. When kmemcheck is disabled, most of the run- + time overhead is not incurred, and the kernel will be almost as fast + as normal. + +- ``CONFIG_KMEMCHECK_QUEUE_SIZE`` + Select the maximum number of error reports to store in an internal + (fixed-size) buffer. Since errors can occur virtually anywhere and in + any context, we need a temporary storage area which is guaranteed not + to generate any other page faults when accessed. The queue will be + emptied as soon as a tasklet may be scheduled. If the queue is full, + new error reports will be lost. + + The default value of 64 is probably fine. If some code produces more + than 64 errors within an irqs-off section, then the code is likely to + produce many, many more, too, and these additional reports seldom give + any more information (the first report is usually the most valuable + anyway). + + This number might have to be adjusted if you are not using serial + console or similar to capture the kernel log. If you are using the + "dmesg" command to save the log, then getting a lot of kmemcheck + warnings might overflow the kernel log itself, and the earlier reports + will get lost in that way instead. Try setting this to 10 or so on + such a setup. + +- ``CONFIG_KMEMCHECK_SHADOW_COPY_SHIFT`` + Select the number of shadow bytes to save along with each entry of the + error-report queue. These bytes indicate what parts of an allocation + are initialized, uninitialized, etc. and will be displayed when an + error is detected to help the debugging of a particular problem. + + The number entered here is actually the logarithm of the number of + bytes that will be saved. So if you pick for example 5 here, kmemcheck + will save 2^5 = 32 bytes. + + The default value should be fine for debugging most problems. It also + fits nicely within 80 columns. + +- ``CONFIG_KMEMCHECK_PARTIAL_OK`` + This option (when enabled) works around certain GCC optimizations that + produce 32-bit reads from 16-bit variables where the upper 16 bits are + thrown away afterwards. + + The default value (enabled) is recommended. This may of course hide + some real errors, but disabling it would probably produce a lot of + false positives. + +- ``CONFIG_KMEMCHECK_BITOPS_OK`` + This option silences warnings that would be generated for bit-field + accesses where not all the bits are initialized at the same time. This + may also hide some real bugs. + + This option is probably obsolete, or it should be replaced with + the kmemcheck-/bitfield-annotations for the code in question. The + default value is therefore fine. + +Now compile the kernel as usual. + + +How to use +---------- + +Booting +~~~~~~~ + +First some information about the command-line options. There is only one +option specific to kmemcheck, and this is called "kmemcheck". It can be used +to override the default mode as chosen by the ``CONFIG_KMEMCHECK_*_BY_DEFAULT`` +option. Its possible settings are: + +- ``kmemcheck=0`` (disabled) +- ``kmemcheck=1`` (enabled) +- ``kmemcheck=2`` (one-shot mode) + +If SLUB debugging has been enabled in the kernel, it may take precedence over +kmemcheck in such a way that the slab caches which are under SLUB debugging +will not be tracked by kmemcheck. In order to ensure that this doesn't happen +(even though it shouldn't by default), use SLUB's boot option ``slub_debug``, +like this: ``slub_debug=-`` + +In fact, this option may also be used for fine-grained control over SLUB vs. +kmemcheck. For example, if the command line includes +``kmemcheck=1 slub_debug=,dentry``, then SLUB debugging will be used only +for the "dentry" slab cache, and with kmemcheck tracking all the other +caches. This is advanced usage, however, and is not generally recommended. + + +Run-time enable/disable +~~~~~~~~~~~~~~~~~~~~~~~ + +When the kernel has booted, it is possible to enable or disable kmemcheck at +run-time. WARNING: This feature is still experimental and may cause false +positive warnings to appear. Therefore, try not to use this. If you find that +it doesn't work properly (e.g. you see an unreasonable amount of warnings), I +will be happy to take bug reports. + +Use the file ``/proc/sys/kernel/kmemcheck`` for this purpose, e.g.:: + + $ echo 0 > /proc/sys/kernel/kmemcheck # disables kmemcheck + +The numbers are the same as for the ``kmemcheck=`` command-line option. + + +Debugging +~~~~~~~~~ + +A typical report will look something like this:: + + WARNING: kmemcheck: Caught 32-bit read from uninitialized memory (ffff88003e4a2024) + 80000000000000000000000000000000000000000088ffff0000000000000000 + i i i i u u u u i i i i i i i i u u u u u u u u u u u u u u u u + ^ + + Pid: 1856, comm: ntpdate Not tainted 2.6.29-rc5 #264 945P-A + RIP: 0010:[] [] __dequeue_signal+0xc8/0x190 + RSP: 0018:ffff88003cdf7d98 EFLAGS: 00210002 + RAX: 0000000000000030 RBX: ffff88003d4ea968 RCX: 0000000000000009 + RDX: ffff88003e5d6018 RSI: ffff88003e5d6024 RDI: ffff88003cdf7e84 + RBP: ffff88003cdf7db8 R08: ffff88003e5d6000 R09: 0000000000000000 + R10: 0000000000000080 R11: 0000000000000000 R12: 000000000000000e + R13: ffff88003cdf7e78 R14: ffff88003d530710 R15: ffff88003d5a98c8 + FS: 0000000000000000(0000) GS:ffff880001982000(0063) knlGS:00000 + CS: 0010 DS: 002b ES: 002b CR0: 0000000080050033 + CR2: ffff88003f806ea0 CR3: 000000003c036000 CR4: 00000000000006a0 + DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 + DR3: 0000000000000000 DR6: 00000000ffff4ff0 DR7: 0000000000000400 + [] dequeue_signal+0x8e/0x170 + [] get_signal_to_deliver+0x98/0x390 + [] do_notify_resume+0xad/0x7d0 + [] int_signal+0x12/0x17 + [] 0xffffffffffffffff + +The single most valuable information in this report is the RIP (or EIP on 32- +bit) value. This will help us pinpoint exactly which instruction that caused +the warning. + +If your kernel was compiled with ``CONFIG_DEBUG_INFO=y``, then all we have to do +is give this address to the addr2line program, like this:: + + $ addr2line -e vmlinux -i ffffffff8104ede8 + arch/x86/include/asm/string_64.h:12 + include/asm-generic/siginfo.h:287 + kernel/signal.c:380 + kernel/signal.c:410 + +The "``-e vmlinux``" tells addr2line which file to look in. **IMPORTANT:** +This must be the vmlinux of the kernel that produced the warning in the +first place! If not, the line number information will almost certainly be +wrong. + +The "``-i``" tells addr2line to also print the line numbers of inlined +functions. In this case, the flag was very important, because otherwise, +it would only have printed the first line, which is just a call to +``memcpy()``, which could be called from a thousand places in the kernel, and +is therefore not very useful. These inlined functions would not show up in +the stack trace above, simply because the kernel doesn't load the extra +debugging information. This technique can of course be used with ordinary +kernel oopses as well. + +In this case, it's the caller of ``memcpy()`` that is interesting, and it can be +found in ``include/asm-generic/siginfo.h``, line 287:: + + 281 static inline void copy_siginfo(struct siginfo *to, struct siginfo *from) + 282 { + 283 if (from->si_code < 0) + 284 memcpy(to, from, sizeof(*to)); + 285 else + 286 /* _sigchld is currently the largest know union member */ + 287 memcpy(to, from, __ARCH_SI_PREAMBLE_SIZE + sizeof(from->_sifields._sigchld)); + 288 } + +Since this was a read (kmemcheck usually warns about reads only, though it can +warn about writes to unallocated or freed memory as well), it was probably the +"from" argument which contained some uninitialized bytes. Following the chain +of calls, we move upwards to see where "from" was allocated or initialized, +``kernel/signal.c``, line 380:: + + 359 static void collect_signal(int sig, struct sigpending *list, siginfo_t *info) + 360 { + ... + 367 list_for_each_entry(q, &list->list, list) { + 368 if (q->info.si_signo == sig) { + 369 if (first) + 370 goto still_pending; + 371 first = q; + ... + 377 if (first) { + 378 still_pending: + 379 list_del_init(&first->list); + 380 copy_siginfo(info, &first->info); + 381 __sigqueue_free(first); + ... + 392 } + 393 } + +Here, it is ``&first->info`` that is being passed on to ``copy_siginfo()``. The +variable ``first`` was found on a list -- passed in as the second argument to +``collect_signal()``. We continue our journey through the stack, to figure out +where the item on "list" was allocated or initialized. We move to line 410:: + + 395 static int __dequeue_signal(struct sigpending *pending, sigset_t *mask, + 396 siginfo_t *info) + 397 { + ... + 410 collect_signal(sig, pending, info); + ... + 414 } + +Now we need to follow the ``pending`` pointer, since that is being passed on to +``collect_signal()`` as ``list``. At this point, we've run out of lines from the +"addr2line" output. Not to worry, we just paste the next addresses from the +kmemcheck stack dump, i.e.:: + + [] dequeue_signal+0x8e/0x170 + [] get_signal_to_deliver+0x98/0x390 + [] do_notify_resume+0xad/0x7d0 + [] int_signal+0x12/0x17 + + $ addr2line -e vmlinux -i ffffffff8104f04e ffffffff81050bd8 \ + ffffffff8100b87d ffffffff8100c7b5 + kernel/signal.c:446 + kernel/signal.c:1806 + arch/x86/kernel/signal.c:805 + arch/x86/kernel/signal.c:871 + arch/x86/kernel/entry_64.S:694 + +Remember that since these addresses were found on the stack and not as the +RIP value, they actually point to the _next_ instruction (they are return +addresses). This becomes obvious when we look at the code for line 446:: + + 422 int dequeue_signal(struct task_struct *tsk, sigset_t *mask, siginfo_t *info) + 423 { + ... + 431 signr = __dequeue_signal(&tsk->signal->shared_pending, + 432 mask, info); + 433 /* + 434 * itimer signal ? + 435 * + 436 * itimers are process shared and we restart periodic + 437 * itimers in the signal delivery path to prevent DoS + 438 * attacks in the high resolution timer case. This is + 439 * compliant with the old way of self restarting + 440 * itimers, as the SIGALRM is a legacy signal and only + 441 * queued once. Changing the restart behaviour to + 442 * restart the timer in the signal dequeue path is + 443 * reducing the timer noise on heavy loaded !highres + 444 * systems too. + 445 */ + 446 if (unlikely(signr == SIGALRM)) { + ... + 489 } + +So instead of looking at 446, we should be looking at 431, which is the line +that executes just before 446. Here we see that what we are looking for is +``&tsk->signal->shared_pending``. + +Our next task is now to figure out which function that puts items on this +``shared_pending`` list. A crude, but efficient tool, is ``git grep``:: + + $ git grep -n 'shared_pending' kernel/ + ... + kernel/signal.c:828: pending = group ? &t->signal->shared_pending : &t->pending; + kernel/signal.c:1339: pending = group ? &t->signal->shared_pending : &t->pending; + ... + +There were more results, but none of them were related to list operations, +and these were the only assignments. We inspect the line numbers more closely +and find that this is indeed where items are being added to the list:: + + 816 static int send_signal(int sig, struct siginfo *info, struct task_struct *t, + 817 int group) + 818 { + ... + 828 pending = group ? &t->signal->shared_pending : &t->pending; + ... + 851 q = __sigqueue_alloc(t, GFP_ATOMIC, (sig < SIGRTMIN && + 852 (is_si_special(info) || + 853 info->si_code >= 0))); + 854 if (q) { + 855 list_add_tail(&q->list, &pending->list); + ... + 890 } + +and:: + + 1309 int send_sigqueue(struct sigqueue *q, struct task_struct *t, int group) + 1310 { + .... + 1339 pending = group ? &t->signal->shared_pending : &t->pending; + 1340 list_add_tail(&q->list, &pending->list); + .... + 1347 } + +In the first case, the list element we are looking for, ``q``, is being +returned from the function ``__sigqueue_alloc()``, which looks like an +allocation function. Let's take a look at it:: + + 187 static struct sigqueue *__sigqueue_alloc(struct task_struct *t, gfp_t flags, + 188 int override_rlimit) + 189 { + 190 struct sigqueue *q = NULL; + 191 struct user_struct *user; + 192 + 193 /* + 194 * We won't get problems with the target's UID changing under us + 195 * because changing it requires RCU be used, and if t != current, the + 196 * caller must be holding the RCU readlock (by way of a spinlock) and + 197 * we use RCU protection here + 198 */ + 199 user = get_uid(__task_cred(t)->user); + 200 atomic_inc(&user->sigpending); + 201 if (override_rlimit || + 202 atomic_read(&user->sigpending) <= + 203 t->signal->rlim[RLIMIT_SIGPENDING].rlim_cur) + 204 q = kmem_cache_alloc(sigqueue_cachep, flags); + 205 if (unlikely(q == NULL)) { + 206 atomic_dec(&user->sigpending); + 207 free_uid(user); + 208 } else { + 209 INIT_LIST_HEAD(&q->list); + 210 q->flags = 0; + 211 q->user = user; + 212 } + 213 + 214 return q; + 215 } + +We see that this function initializes ``q->list``, ``q->flags``, and +``q->user``. It seems that now is the time to look at the definition of +``struct sigqueue``, e.g.:: + + 14 struct sigqueue { + 15 struct list_head list; + 16 int flags; + 17 siginfo_t info; + 18 struct user_struct *user; + 19 }; + +And, you might remember, it was a ``memcpy()`` on ``&first->info`` that +caused the warning, so this makes perfect sense. It also seems reasonable +to assume that it is the caller of ``__sigqueue_alloc()`` that has the +responsibility of filling out (initializing) this member. + +But just which fields of the struct were uninitialized? Let's look at +kmemcheck's report again:: + + WARNING: kmemcheck: Caught 32-bit read from uninitialized memory (ffff88003e4a2024) + 80000000000000000000000000000000000000000088ffff0000000000000000 + i i i i u u u u i i i i i i i i u u u u u u u u u u u u u u u u + ^ + +These first two lines are the memory dump of the memory object itself, and +the shadow bytemap, respectively. The memory object itself is in this case +``&first->info``. Just beware that the start of this dump is NOT the start +of the object itself! The position of the caret (^) corresponds with the +address of the read (ffff88003e4a2024). + +The shadow bytemap dump legend is as follows: + +- i: initialized +- u: uninitialized +- a: unallocated (memory has been allocated by the slab layer, but has not + yet been handed off to anybody) +- f: freed (memory has been allocated by the slab layer, but has been freed + by the previous owner) + +In order to figure out where (relative to the start of the object) the +uninitialized memory was located, we have to look at the disassembly. For +that, we'll need the RIP address again:: + + RIP: 0010:[] [] __dequeue_signal+0xc8/0x190 + + $ objdump -d --no-show-raw-insn vmlinux | grep -C 8 ffffffff8104ede8: + ffffffff8104edc8: mov %r8,0x8(%r8) + ffffffff8104edcc: test %r10d,%r10d + ffffffff8104edcf: js ffffffff8104ee88 <__dequeue_signal+0x168> + ffffffff8104edd5: mov %rax,%rdx + ffffffff8104edd8: mov $0xc,%ecx + ffffffff8104eddd: mov %r13,%rdi + ffffffff8104ede0: mov $0x30,%eax + ffffffff8104ede5: mov %rdx,%rsi + ffffffff8104ede8: rep movsl %ds:(%rsi),%es:(%rdi) + ffffffff8104edea: test $0x2,%al + ffffffff8104edec: je ffffffff8104edf0 <__dequeue_signal+0xd0> + ffffffff8104edee: movsw %ds:(%rsi),%es:(%rdi) + ffffffff8104edf0: test $0x1,%al + ffffffff8104edf2: je ffffffff8104edf5 <__dequeue_signal+0xd5> + ffffffff8104edf4: movsb %ds:(%rsi),%es:(%rdi) + ffffffff8104edf5: mov %r8,%rdi + ffffffff8104edf8: callq ffffffff8104de60 <__sigqueue_free> + +As expected, it's the "``rep movsl``" instruction from the ``memcpy()`` +that causes the warning. We know about ``REP MOVSL`` that it uses the register +``RCX`` to count the number of remaining iterations. By taking a look at the +register dump again (from the kmemcheck report), we can figure out how many +bytes were left to copy:: + + RAX: 0000000000000030 RBX: ffff88003d4ea968 RCX: 0000000000000009 + +By looking at the disassembly, we also see that ``%ecx`` is being loaded +with the value ``$0xc`` just before (ffffffff8104edd8), so we are very +lucky. Keep in mind that this is the number of iterations, not bytes. And +since this is a "long" operation, we need to multiply by 4 to get the +number of bytes. So this means that the uninitialized value was encountered +at 4 * (0xc - 0x9) = 12 bytes from the start of the object. + +We can now try to figure out which field of the "``struct siginfo``" that +was not initialized. This is the beginning of the struct:: + + 40 typedef struct siginfo { + 41 int si_signo; + 42 int si_errno; + 43 int si_code; + 44 + 45 union { + .. + 92 } _sifields; + 93 } siginfo_t; + +On 64-bit, the int is 4 bytes long, so it must the union member that has +not been initialized. We can verify this using gdb:: + + $ gdb vmlinux + ... + (gdb) p &((struct siginfo *) 0)->_sifields + $1 = (union {...} *) 0x10 + +Actually, it seems that the union member is located at offset 0x10 -- which +means that gcc has inserted 4 bytes of padding between the members ``si_code`` +and ``_sifields``. We can now get a fuller picture of the memory dump:: + + _----------------------------=> si_code + / _--------------------=> (padding) + | / _------------=> _sifields(._kill._pid) + | | / _----=> _sifields(._kill._uid) + | | | / + -------|-------|-------|-------| + 80000000000000000000000000000000000000000088ffff0000000000000000 + i i i i u u u u i i i i i i i i u u u u u u u u u u u u u u u u + +This allows us to realize another important fact: ``si_code`` contains the +value 0x80. Remember that x86 is little endian, so the first 4 bytes +"80000000" are really the number 0x00000080. With a bit of research, we +find that this is actually the constant ``SI_KERNEL`` defined in +``include/asm-generic/siginfo.h``:: + + 144 #define SI_KERNEL 0x80 /* sent by the kernel from somewhere */ + +This macro is used in exactly one place in the x86 kernel: In ``send_signal()`` +in ``kernel/signal.c``:: + + 816 static int send_signal(int sig, struct siginfo *info, struct task_struct *t, + 817 int group) + 818 { + ... + 828 pending = group ? &t->signal->shared_pending : &t->pending; + ... + 851 q = __sigqueue_alloc(t, GFP_ATOMIC, (sig < SIGRTMIN && + 852 (is_si_special(info) || + 853 info->si_code >= 0))); + 854 if (q) { + 855 list_add_tail(&q->list, &pending->list); + 856 switch ((unsigned long) info) { + ... + 865 case (unsigned long) SEND_SIG_PRIV: + 866 q->info.si_signo = sig; + 867 q->info.si_errno = 0; + 868 q->info.si_code = SI_KERNEL; + 869 q->info.si_pid = 0; + 870 q->info.si_uid = 0; + 871 break; + ... + 890 } + +Not only does this match with the ``.si_code`` member, it also matches the place +we found earlier when looking for where siginfo_t objects are enqueued on the +``shared_pending`` list. + +So to sum up: It seems that it is the padding introduced by the compiler +between two struct fields that is uninitialized, and this gets reported when +we do a ``memcpy()`` on the struct. This means that we have identified a false +positive warning. + +Normally, kmemcheck will not report uninitialized accesses in ``memcpy()`` calls +when both the source and destination addresses are tracked. (Instead, we copy +the shadow bytemap as well). In this case, the destination address clearly +was not tracked. We can dig a little deeper into the stack trace from above:: + + arch/x86/kernel/signal.c:805 + arch/x86/kernel/signal.c:871 + arch/x86/kernel/entry_64.S:694 + +And we clearly see that the destination siginfo object is located on the +stack:: + + 782 static void do_signal(struct pt_regs *regs) + 783 { + 784 struct k_sigaction ka; + 785 siginfo_t info; + ... + 804 signr = get_signal_to_deliver(&info, &ka, regs, NULL); + ... + 854 } + +And this ``&info`` is what eventually gets passed to ``copy_siginfo()`` as the +destination argument. + +Now, even though we didn't find an actual error here, the example is still a +good one, because it shows how one would go about to find out what the report +was all about. + + +Annotating false positives +~~~~~~~~~~~~~~~~~~~~~~~~~~ + +There are a few different ways to make annotations in the source code that +will keep kmemcheck from checking and reporting certain allocations. Here +they are: + +- ``__GFP_NOTRACK_FALSE_POSITIVE`` + This flag can be passed to ``kmalloc()`` or ``kmem_cache_alloc()`` + (therefore also to other functions that end up calling one of + these) to indicate that the allocation should not be tracked + because it would lead to a false positive report. This is a "big + hammer" way of silencing kmemcheck; after all, even if the false + positive pertains to particular field in a struct, for example, we + will now lose the ability to find (real) errors in other parts of + the same struct. + + Example:: + + /* No warnings will ever trigger on accessing any part of x */ + x = kmalloc(sizeof *x, GFP_KERNEL | __GFP_NOTRACK_FALSE_POSITIVE); + +- ``kmemcheck_bitfield_begin(name)``/``kmemcheck_bitfield_end(name)`` and + ``kmemcheck_annotate_bitfield(ptr, name)`` + The first two of these three macros can be used inside struct + definitions to signal, respectively, the beginning and end of a + bitfield. Additionally, this will assign the bitfield a name, which + is given as an argument to the macros. + + Having used these markers, one can later use + kmemcheck_annotate_bitfield() at the point of allocation, to indicate + which parts of the allocation is part of a bitfield. + + Example:: + + struct foo { + int x; + + kmemcheck_bitfield_begin(flags); + int flag_a:1; + int flag_b:1; + kmemcheck_bitfield_end(flags); + + int y; + }; + + struct foo *x = kmalloc(sizeof *x); + + /* No warnings will trigger on accessing the bitfield of x */ + kmemcheck_annotate_bitfield(x, flags); + + Note that ``kmemcheck_annotate_bitfield()`` can be used even before the + return value of ``kmalloc()`` is checked -- in other words, passing NULL + as the first argument is legal (and will do nothing). + + +Reporting errors +---------------- + +As we have seen, kmemcheck will produce false positive reports. Therefore, it +is not very wise to blindly post kmemcheck warnings to mailing lists and +maintainers. Instead, I encourage maintainers and developers to find errors +in their own code. If you get a warning, you can try to work around it, try +to figure out if it's a real error or not, or simply ignore it. Most +developers know their own code and will quickly and efficiently determine the +root cause of a kmemcheck report. This is therefore also the most efficient +way to work with kmemcheck. + +That said, we (the kmemcheck maintainers) will always be on the lookout for +false positives that we can annotate and silence. So whatever you find, +please drop us a note privately! Kernel configs and steps to reproduce (if +available) are of course a great help too. + +Happy hacking! + + +Technical description +--------------------- + +kmemcheck works by marking memory pages non-present. This means that whenever +somebody attempts to access the page, a page fault is generated. The page +fault handler notices that the page was in fact only hidden, and so it calls +on the kmemcheck code to make further investigations. + +When the investigations are completed, kmemcheck "shows" the page by marking +it present (as it would be under normal circumstances). This way, the +interrupted code can continue as usual. + +But after the instruction has been executed, we should hide the page again, so +that we can catch the next access too! Now kmemcheck makes use of a debugging +feature of the processor, namely single-stepping. When the processor has +finished the one instruction that generated the memory access, a debug +exception is raised. From here, we simply hide the page again and continue +execution, this time with the single-stepping feature turned off. + +kmemcheck requires some assistance from the memory allocator in order to work. +The memory allocator needs to + + 1. Tell kmemcheck about newly allocated pages and pages that are about to + be freed. This allows kmemcheck to set up and tear down the shadow memory + for the pages in question. The shadow memory stores the status of each + byte in the allocation proper, e.g. whether it is initialized or + uninitialized. + + 2. Tell kmemcheck which parts of memory should be marked uninitialized. + There are actually a few more states, such as "not yet allocated" and + "recently freed". + +If a slab cache is set up using the SLAB_NOTRACK flag, it will never return +memory that can take page faults because of kmemcheck. + +If a slab cache is NOT set up using the SLAB_NOTRACK flag, callers can still +request memory with the __GFP_NOTRACK or __GFP_NOTRACK_FALSE_POSITIVE flags. +This does not prevent the page faults from occurring, however, but marks the +object in question as being initialized so that no warnings will ever be +produced for this object. + +Currently, the SLAB and SLUB allocators are supported by kmemcheck. diff --git a/Documentation/kmemleak.txt b/Documentation/dev-tools/kmemleak.rst similarity index 73% rename from Documentation/kmemleak.txt rename to Documentation/dev-tools/kmemleak.rst index 18e24abb3ecf61b1f6a214af921af8bd138b27e4..1788722d549503c3164e43cdbe9b26e091d2f919 100644 --- a/Documentation/kmemleak.txt +++ b/Documentation/dev-tools/kmemleak.rst @@ -1,15 +1,12 @@ Kernel Memory Leak Detector =========================== -Introduction ------------- - Kmemleak provides a way of detecting possible kernel memory leaks in a way similar to a tracing garbage collector (https://en.wikipedia.org/wiki/Garbage_collection_%28computer_science%29#Tracing_garbage_collectors), with the difference that the orphan objects are not freed but only reported via /sys/kernel/debug/kmemleak. A similar method is used by the -Valgrind tool (memcheck --leak-check) to detect the memory leaks in +Valgrind tool (``memcheck --leak-check``) to detect the memory leaks in user-space applications. Kmemleak is supported on x86, arm, powerpc, sparc, sh, microblaze, ppc, mips, s390, metag and tile. @@ -19,20 +16,20 @@ Usage CONFIG_DEBUG_KMEMLEAK in "Kernel hacking" has to be enabled. A kernel thread scans the memory every 10 minutes (by default) and prints the number of new unreferenced objects found. To display the details of all -the possible memory leaks: +the possible memory leaks:: # mount -t debugfs nodev /sys/kernel/debug/ # cat /sys/kernel/debug/kmemleak -To trigger an intermediate memory scan: +To trigger an intermediate memory scan:: # echo scan > /sys/kernel/debug/kmemleak -To clear the list of all current possible memory leaks: +To clear the list of all current possible memory leaks:: # echo clear > /sys/kernel/debug/kmemleak -New leaks will then come up upon reading /sys/kernel/debug/kmemleak +New leaks will then come up upon reading ``/sys/kernel/debug/kmemleak`` again. Note that the orphan objects are listed in the order they were allocated @@ -40,22 +37,31 @@ and one object at the beginning of the list may cause other subsequent objects to be reported as orphan. Memory scanning parameters can be modified at run-time by writing to the -/sys/kernel/debug/kmemleak file. The following parameters are supported: - - off - disable kmemleak (irreversible) - stack=on - enable the task stacks scanning (default) - stack=off - disable the tasks stacks scanning - scan=on - start the automatic memory scanning thread (default) - scan=off - stop the automatic memory scanning thread - scan= - set the automatic memory scanning period in seconds - (default 600, 0 to stop the automatic scanning) - scan - trigger a memory scan - clear - clear list of current memory leak suspects, done by - marking all current reported unreferenced objects grey, - or free all kmemleak objects if kmemleak has been disabled. - dump= - dump information about the object found at - -Kmemleak can also be disabled at boot-time by passing "kmemleak=off" on +``/sys/kernel/debug/kmemleak`` file. The following parameters are supported: + +- off + disable kmemleak (irreversible) +- stack=on + enable the task stacks scanning (default) +- stack=off + disable the tasks stacks scanning +- scan=on + start the automatic memory scanning thread (default) +- scan=off + stop the automatic memory scanning thread +- scan= + set the automatic memory scanning period in seconds + (default 600, 0 to stop the automatic scanning) +- scan + trigger a memory scan +- clear + clear list of current memory leak suspects, done by + marking all current reported unreferenced objects grey, + or free all kmemleak objects if kmemleak has been disabled. +- dump= + dump information about the object found at + +Kmemleak can also be disabled at boot-time by passing ``kmemleak=off`` on the kernel command line. Memory may be allocated or freed before kmemleak is initialised and @@ -63,13 +69,14 @@ these actions are stored in an early log buffer. The size of this buffer is configured via the CONFIG_DEBUG_KMEMLEAK_EARLY_LOG_SIZE option. If CONFIG_DEBUG_KMEMLEAK_DEFAULT_OFF are enabled, the kmemleak is -disabled by default. Passing "kmemleak=on" on the kernel command +disabled by default. Passing ``kmemleak=on`` on the kernel command line enables the function. Basic Algorithm --------------- -The memory allocations via kmalloc, vmalloc, kmem_cache_alloc and +The memory allocations via :c:func:`kmalloc`, :c:func:`vmalloc`, +:c:func:`kmem_cache_alloc` and friends are traced and the pointers, together with additional information like size and stack trace, are stored in a rbtree. The corresponding freeing function calls are tracked and the pointers @@ -113,13 +120,13 @@ when doing development. To work around these situations you can use the you can find new unreferenced objects; this should help with testing specific sections of code. -To test a critical section on demand with a clean kmemleak do: +To test a critical section on demand with a clean kmemleak do:: # echo clear > /sys/kernel/debug/kmemleak ... test your kernel or modules ... # echo scan > /sys/kernel/debug/kmemleak -Then as usual to get your report with: +Then as usual to get your report with:: # cat /sys/kernel/debug/kmemleak @@ -131,7 +138,7 @@ disabled by the user or due to an fatal error, internal kmemleak objects won't be freed when kmemleak is disabled, and those objects may occupy a large part of physical memory. -In this situation, you may reclaim memory with: +In this situation, you may reclaim memory with:: # echo clear > /sys/kernel/debug/kmemleak @@ -140,20 +147,20 @@ Kmemleak API See the include/linux/kmemleak.h header for the functions prototype. -kmemleak_init - initialize kmemleak -kmemleak_alloc - notify of a memory block allocation -kmemleak_alloc_percpu - notify of a percpu memory block allocation -kmemleak_free - notify of a memory block freeing -kmemleak_free_part - notify of a partial memory block freeing -kmemleak_free_percpu - notify of a percpu memory block freeing -kmemleak_update_trace - update object allocation stack trace -kmemleak_not_leak - mark an object as not a leak -kmemleak_ignore - do not scan or report an object as leak -kmemleak_scan_area - add scan areas inside a memory block -kmemleak_no_scan - do not scan a memory block -kmemleak_erase - erase an old value in a pointer variable -kmemleak_alloc_recursive - as kmemleak_alloc but checks the recursiveness -kmemleak_free_recursive - as kmemleak_free but checks the recursiveness +- ``kmemleak_init`` - initialize kmemleak +- ``kmemleak_alloc`` - notify of a memory block allocation +- ``kmemleak_alloc_percpu`` - notify of a percpu memory block allocation +- ``kmemleak_free`` - notify of a memory block freeing +- ``kmemleak_free_part`` - notify of a partial memory block freeing +- ``kmemleak_free_percpu`` - notify of a percpu memory block freeing +- ``kmemleak_update_trace`` - update object allocation stack trace +- ``kmemleak_not_leak`` - mark an object as not a leak +- ``kmemleak_ignore`` - do not scan or report an object as leak +- ``kmemleak_scan_area`` - add scan areas inside a memory block +- ``kmemleak_no_scan`` - do not scan a memory block +- ``kmemleak_erase`` - erase an old value in a pointer variable +- ``kmemleak_alloc_recursive`` - as kmemleak_alloc but checks the recursiveness +- ``kmemleak_free_recursive`` - as kmemleak_free but checks the recursiveness Dealing with false positives/negatives -------------------------------------- diff --git a/Documentation/sparse.txt b/Documentation/dev-tools/sparse.rst similarity index 82% rename from Documentation/sparse.txt rename to Documentation/dev-tools/sparse.rst index eceab1308a8c2fbde6722232db18bbb57a6e7f2e..8c250e8a2105555b99fb6e647852c3300a9f61cf 100644 --- a/Documentation/sparse.txt +++ b/Documentation/dev-tools/sparse.rst @@ -1,11 +1,20 @@ -Copyright 2004 Linus Torvalds -Copyright 2004 Pavel Machek -Copyright 2006 Bob Copeland +.. Copyright 2004 Linus Torvalds +.. Copyright 2004 Pavel Machek +.. Copyright 2006 Bob Copeland + +Sparse +====== + +Sparse is a semantic checker for C programs; it can be used to find a +number of potential problems with kernel code. See +https://lwn.net/Articles/689907/ for an overview of sparse; this document +contains some kernel-specific sparse information. + Using sparse for typechecking -~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ +----------------------------- -"__bitwise" is a type attribute, so you have to do something like this: +"__bitwise" is a type attribute, so you have to do something like this:: typedef int __bitwise pm_request_t; @@ -20,13 +29,13 @@ but in this case we really _do_ want to force the conversion). And because the enum values are all the same type, now "enum pm_request" will be that type too. -And with gcc, all the __bitwise/__force stuff goes away, and it all ends -up looking just like integers to gcc. +And with gcc, all the "__bitwise"/"__force stuff" goes away, and it all +ends up looking just like integers to gcc. Quite frankly, you don't need the enum there. The above all really just boils down to one special "int __bitwise" type. -So the simpler way is to just do +So the simpler way is to just do:: typedef int __bitwise pm_request_t; @@ -50,7 +59,7 @@ __bitwise - noisy stuff; in particular, __le*/__be* are that. We really don't want to drown in noise unless we'd explicitly asked for it. Using sparse for lock checking -~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ +------------------------------ The following macros are undefined for gcc and defined during a sparse run to use the "context" tracking feature of sparse, applied to @@ -69,22 +78,22 @@ annotation is needed. The tree annotations above are for cases where sparse would otherwise report a context imbalance. Getting sparse -~~~~~~~~~~~~~~ +-------------- You can get latest released versions from the Sparse homepage at https://sparse.wiki.kernel.org/index.php/Main_Page Alternatively, you can get snapshots of the latest development version -of sparse using git to clone.. +of sparse using git to clone:: git://git.kernel.org/pub/scm/devel/sparse/sparse.git -DaveJ has hourly generated tarballs of the git tree available at.. +DaveJ has hourly generated tarballs of the git tree available at:: http://www.codemonkey.org.uk/projects/git-snapshots/sparse/ -Once you have it, just do +Once you have it, just do:: make make install @@ -92,7 +101,7 @@ Once you have it, just do as a regular user, and it will install sparse in your ~/bin directory. Using sparse -~~~~~~~~~~~~ +------------ Do a kernel make with "make C=1" to run sparse on all the C files that get recompiled, or use "make C=2" to run sparse on the files whether they need to @@ -101,7 +110,7 @@ have already built it. The optional make variable CF can be used to pass arguments to sparse. The build system passes -Wbitwise to sparse automatically. To perform endianness -checks, you may define __CHECK_ENDIAN__: +checks, you may define __CHECK_ENDIAN__:: make C=2 CF="-D__CHECK_ENDIAN__" diff --git a/Documentation/dev-tools/tools.rst b/Documentation/dev-tools/tools.rst new file mode 100644 index 0000000000000000000000000000000000000000..824ae8e54dd5b421d434aba7e95d82d42955a2c6 --- /dev/null +++ b/Documentation/dev-tools/tools.rst @@ -0,0 +1,25 @@ +================================ +Development tools for the kernel +================================ + +This document is a collection of documents about development tools that can +be used to work on the kernel. For now, the documents have been pulled +together without any significant effot to integrate them into a coherent +whole; patches welcome! + +.. class:: toc-title + + Table of contents + +.. toctree:: + :maxdepth: 2 + + coccinelle + sparse + kcov + gcov + kasan + ubsan + kmemleak + kmemcheck + gdb-kernel-debugging diff --git a/Documentation/ubsan.txt b/Documentation/dev-tools/ubsan.rst similarity index 78% rename from Documentation/ubsan.txt rename to Documentation/dev-tools/ubsan.rst index f58215ef57976d112aec47905bdf1280bd059c71..655e6b63c2273e5e2d2e5629ac69f06f2eb0a843 100644 --- a/Documentation/ubsan.txt +++ b/Documentation/dev-tools/ubsan.rst @@ -1,7 +1,5 @@ -Undefined Behavior Sanitizer - UBSAN - -Overview --------- +The Undefined Behavior Sanitizer - UBSAN +======================================== UBSAN is a runtime undefined behaviour checker. @@ -10,11 +8,13 @@ Compiler inserts code that perform certain kinds of checks before operations that may cause UB. If check fails (i.e. UB detected) __ubsan_handle_* function called to print error message. -GCC has that feature since 4.9.x [1] (see -fsanitize=undefined option and -its suboptions). GCC 5.x has more checkers implemented [2]. +GCC has that feature since 4.9.x [1_] (see ``-fsanitize=undefined`` option and +its suboptions). GCC 5.x has more checkers implemented [2_]. Report example ---------------- +-------------- + +:: ================================================================================ UBSAN: Undefined behaviour in ../include/linux/bitops.h:110:33 @@ -47,29 +47,33 @@ Report example Usage ----- -To enable UBSAN configure kernel with: +To enable UBSAN configure kernel with:: CONFIG_UBSAN=y -and to check the entire kernel: +and to check the entire kernel:: CONFIG_UBSAN_SANITIZE_ALL=y To enable instrumentation for specific files or directories, add a line similar to the following to the respective kernel Makefile: - For a single file (e.g. main.o): - UBSAN_SANITIZE_main.o := y +- For a single file (e.g. main.o):: + + UBSAN_SANITIZE_main.o := y - For all files in one directory: - UBSAN_SANITIZE := y +- For all files in one directory:: + + UBSAN_SANITIZE := y To exclude files from being instrumented even if -CONFIG_UBSAN_SANITIZE_ALL=y, use: +``CONFIG_UBSAN_SANITIZE_ALL=y``, use:: + + UBSAN_SANITIZE_main.o := n + +and:: - UBSAN_SANITIZE_main.o := n - and: - UBSAN_SANITIZE := n + UBSAN_SANITIZE := n Detection of unaligned accesses controlled through the separate option - CONFIG_UBSAN_ALIGNMENT. It's off by default on architectures that support @@ -80,5 +84,5 @@ reports. References ---------- -[1] - https://gcc.gnu.org/onlinedocs/gcc-4.9.0/gcc/Debugging-Options.html -[2] - https://gcc.gnu.org/onlinedocs/gcc/Debugging-Options.html +.. _1: https://gcc.gnu.org/onlinedocs/gcc-4.9.0/gcc/Debugging-Options.html +.. _2: https://gcc.gnu.org/onlinedocs/gcc/Debugging-Options.html diff --git a/Documentation/gcov.txt b/Documentation/gcov.txt deleted file mode 100644 index 7b727783db7ed4f87a7c68b44b52054c62f48e85..0000000000000000000000000000000000000000 --- a/Documentation/gcov.txt +++ /dev/null @@ -1,257 +0,0 @@ -Using gcov with the Linux kernel -================================ - -1. Introduction -2. Preparation -3. Customization -4. Files -5. Modules -6. Separated build and test machines -7. Troubleshooting -Appendix A: sample script: gather_on_build.sh -Appendix B: sample script: gather_on_test.sh - - -1. Introduction -=============== - -gcov profiling kernel support enables the use of GCC's coverage testing -tool gcov [1] with the Linux kernel. Coverage data of a running kernel -is exported in gcov-compatible format via the "gcov" debugfs directory. -To get coverage data for a specific file, change to the kernel build -directory and use gcov with the -o option as follows (requires root): - -# cd /tmp/linux-out -# gcov -o /sys/kernel/debug/gcov/tmp/linux-out/kernel spinlock.c - -This will create source code files annotated with execution counts -in the current directory. In addition, graphical gcov front-ends such -as lcov [2] can be used to automate the process of collecting data -for the entire kernel and provide coverage overviews in HTML format. - -Possible uses: - -* debugging (has this line been reached at all?) -* test improvement (how do I change my test to cover these lines?) -* minimizing kernel configurations (do I need this option if the - associated code is never run?) - --- - -[1] http://gcc.gnu.org/onlinedocs/gcc/Gcov.html -[2] http://ltp.sourceforge.net/coverage/lcov.php - - -2. Preparation -============== - -Configure the kernel with: - - CONFIG_DEBUG_FS=y - CONFIG_GCOV_KERNEL=y - -select the gcc's gcov format, default is autodetect based on gcc version: - - CONFIG_GCOV_FORMAT_AUTODETECT=y - -and to get coverage data for the entire kernel: - - CONFIG_GCOV_PROFILE_ALL=y - -Note that kernels compiled with profiling flags will be significantly -larger and run slower. Also CONFIG_GCOV_PROFILE_ALL may not be supported -on all architectures. - -Profiling data will only become accessible once debugfs has been -mounted: - - mount -t debugfs none /sys/kernel/debug - - -3. Customization -================ - -To enable profiling for specific files or directories, add a line -similar to the following to the respective kernel Makefile: - - For a single file (e.g. main.o): - GCOV_PROFILE_main.o := y - - For all files in one directory: - GCOV_PROFILE := y - -To exclude files from being profiled even when CONFIG_GCOV_PROFILE_ALL -is specified, use: - - GCOV_PROFILE_main.o := n - and: - GCOV_PROFILE := n - -Only files which are linked to the main kernel image or are compiled as -kernel modules are supported by this mechanism. - - -4. Files -======== - -The gcov kernel support creates the following files in debugfs: - - /sys/kernel/debug/gcov - Parent directory for all gcov-related files. - - /sys/kernel/debug/gcov/reset - Global reset file: resets all coverage data to zero when - written to. - - /sys/kernel/debug/gcov/path/to/compile/dir/file.gcda - The actual gcov data file as understood by the gcov - tool. Resets file coverage data to zero when written to. - - /sys/kernel/debug/gcov/path/to/compile/dir/file.gcno - Symbolic link to a static data file required by the gcov - tool. This file is generated by gcc when compiling with - option -ftest-coverage. - - -5. Modules -========== - -Kernel modules may contain cleanup code which is only run during -module unload time. The gcov mechanism provides a means to collect -coverage data for such code by keeping a copy of the data associated -with the unloaded module. This data remains available through debugfs. -Once the module is loaded again, the associated coverage counters are -initialized with the data from its previous instantiation. - -This behavior can be deactivated by specifying the gcov_persist kernel -parameter: - - gcov_persist=0 - -At run-time, a user can also choose to discard data for an unloaded -module by writing to its data file or the global reset file. - - -6. Separated build and test machines -==================================== - -The gcov kernel profiling infrastructure is designed to work out-of-the -box for setups where kernels are built and run on the same machine. In -cases where the kernel runs on a separate machine, special preparations -must be made, depending on where the gcov tool is used: - -a) gcov is run on the TEST machine - -The gcov tool version on the test machine must be compatible with the -gcc version used for kernel build. Also the following files need to be -copied from build to test machine: - -from the source tree: - - all C source files + headers - -from the build tree: - - all C source files + headers - - all .gcda and .gcno files - - all links to directories - -It is important to note that these files need to be placed into the -exact same file system location on the test machine as on the build -machine. If any of the path components is symbolic link, the actual -directory needs to be used instead (due to make's CURDIR handling). - -b) gcov is run on the BUILD machine - -The following files need to be copied after each test case from test -to build machine: - -from the gcov directory in sysfs: - - all .gcda files - - all links to .gcno files - -These files can be copied to any location on the build machine. gcov -must then be called with the -o option pointing to that directory. - -Example directory setup on the build machine: - - /tmp/linux: kernel source tree - /tmp/out: kernel build directory as specified by make O= - /tmp/coverage: location of the files copied from the test machine - - [user@build] cd /tmp/out - [user@build] gcov -o /tmp/coverage/tmp/out/init main.c - - -7. Troubleshooting -================== - -Problem: Compilation aborts during linker step. -Cause: Profiling flags are specified for source files which are not - linked to the main kernel or which are linked by a custom - linker procedure. -Solution: Exclude affected source files from profiling by specifying - GCOV_PROFILE := n or GCOV_PROFILE_basename.o := n in the - corresponding Makefile. - -Problem: Files copied from sysfs appear empty or incomplete. -Cause: Due to the way seq_file works, some tools such as cp or tar - may not correctly copy files from sysfs. -Solution: Use 'cat' to read .gcda files and 'cp -d' to copy links. - Alternatively use the mechanism shown in Appendix B. - - -Appendix A: gather_on_build.sh -============================== - -Sample script to gather coverage meta files on the build machine -(see 6a): -#!/bin/bash - -KSRC=$1 -KOBJ=$2 -DEST=$3 - -if [ -z "$KSRC" ] || [ -z "$KOBJ" ] || [ -z "$DEST" ]; then - echo "Usage: $0 " >&2 - exit 1 -fi - -KSRC=$(cd $KSRC; printf "all:\n\t@echo \${CURDIR}\n" | make -f -) -KOBJ=$(cd $KOBJ; printf "all:\n\t@echo \${CURDIR}\n" | make -f -) - -find $KSRC $KOBJ \( -name '*.gcno' -o -name '*.[ch]' -o -type l \) -a \ - -perm /u+r,g+r | tar cfz $DEST -P -T - - -if [ $? -eq 0 ] ; then - echo "$DEST successfully created, copy to test system and unpack with:" - echo " tar xfz $DEST -P" -else - echo "Could not create file $DEST" -fi - - -Appendix B: gather_on_test.sh -============================= - -Sample script to gather coverage data files on the test machine -(see 6b): - -#!/bin/bash -e - -DEST=$1 -GCDA=/sys/kernel/debug/gcov - -if [ -z "$DEST" ] ; then - echo "Usage: $0 " >&2 - exit 1 -fi - -TEMPDIR=$(mktemp -d) -echo Collecting data.. -find $GCDA -type d -exec mkdir -p $TEMPDIR/\{\} \; -find $GCDA -name '*.gcda' -exec sh -c 'cat < $0 > '$TEMPDIR'/$0' {} \; -find $GCDA -name '*.gcno' -exec sh -c 'cp -d $0 '$TEMPDIR'/$0' {} \; -tar czf $DEST -C $TEMPDIR sys -rm -rf $TEMPDIR - -echo "$DEST successfully created, copy to build system and unpack with:" -echo " tar xfz $DEST" diff --git a/Documentation/index.rst b/Documentation/index.rst index a15f81855b39b1f270335feb5e7793bc4c86d827..05eded59820ecdc0341d821e5c12d0ed2e7f95fe 100644 --- a/Documentation/index.rst +++ b/Documentation/index.rst @@ -12,6 +12,7 @@ Contents: :maxdepth: 2 kernel-documentation + dev-tools/tools media/index gpu/index diff --git a/Documentation/kasan.txt b/Documentation/kasan.txt deleted file mode 100644 index d167220324c471ffea657c6b3b8df5e6740f62ea..0000000000000000000000000000000000000000 --- a/Documentation/kasan.txt +++ /dev/null @@ -1,171 +0,0 @@ -KernelAddressSanitizer (KASAN) -============================== - -0. Overview -=========== - -KernelAddressSANitizer (KASAN) is a dynamic memory error detector. It provides -a fast and comprehensive solution for finding use-after-free and out-of-bounds -bugs. - -KASAN uses compile-time instrumentation for checking every memory access, -therefore you will need a GCC version 4.9.2 or later. GCC 5.0 or later is -required for detection of out-of-bounds accesses to stack or global variables. - -Currently KASAN is supported only for x86_64 and arm64 architecture. - -1. Usage -======== - -To enable KASAN configure kernel with: - - CONFIG_KASAN = y - -and choose between CONFIG_KASAN_OUTLINE and CONFIG_KASAN_INLINE. Outline and -inline are compiler instrumentation types. The former produces smaller binary -the latter is 1.1 - 2 times faster. Inline instrumentation requires a GCC -version 5.0 or later. - -KASAN works with both SLUB and SLAB memory allocators. -For better bug detection and nicer reporting, enable CONFIG_STACKTRACE. - -To disable instrumentation for specific files or directories, add a line -similar to the following to the respective kernel Makefile: - - For a single file (e.g. main.o): - KASAN_SANITIZE_main.o := n - - For all files in one directory: - KASAN_SANITIZE := n - -1.1 Error reports -================= - -A typical out of bounds access report looks like this: - -================================================================== -BUG: AddressSanitizer: out of bounds access in kmalloc_oob_right+0x65/0x75 [test_kasan] at addr ffff8800693bc5d3 -Write of size 1 by task modprobe/1689 -============================================================================= -BUG kmalloc-128 (Not tainted): kasan error ------------------------------------------------------------------------------ - -Disabling lock debugging due to kernel taint -INFO: Allocated in kmalloc_oob_right+0x3d/0x75 [test_kasan] age=0 cpu=0 pid=1689 - __slab_alloc+0x4b4/0x4f0 - kmem_cache_alloc_trace+0x10b/0x190 - kmalloc_oob_right+0x3d/0x75 [test_kasan] - init_module+0x9/0x47 [test_kasan] - do_one_initcall+0x99/0x200 - load_module+0x2cb3/0x3b20 - SyS_finit_module+0x76/0x80 - system_call_fastpath+0x12/0x17 -INFO: Slab 0xffffea0001a4ef00 objects=17 used=7 fp=0xffff8800693bd728 flags=0x100000000004080 -INFO: Object 0xffff8800693bc558 @offset=1368 fp=0xffff8800693bc720 - -Bytes b4 ffff8800693bc548: 00 00 00 00 00 00 00 00 5a 5a 5a 5a 5a 5a 5a 5a ........ZZZZZZZZ -Object ffff8800693bc558: 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b kkkkkkkkkkkkkkkk -Object ffff8800693bc568: 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b kkkkkkkkkkkkkkkk -Object ffff8800693bc578: 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b kkkkkkkkkkkkkkkk -Object ffff8800693bc588: 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b kkkkkkkkkkkkkkkk -Object ffff8800693bc598: 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b kkkkkkkkkkkkkkkk -Object ffff8800693bc5a8: 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b kkkkkkkkkkkkkkkk -Object ffff8800693bc5b8: 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b kkkkkkkkkkkkkkkk -Object ffff8800693bc5c8: 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b a5 kkkkkkkkkkkkkkk. -Redzone ffff8800693bc5d8: cc cc cc cc cc cc cc cc ........ -Padding ffff8800693bc718: 5a 5a 5a 5a 5a 5a 5a 5a ZZZZZZZZ -CPU: 0 PID: 1689 Comm: modprobe Tainted: G B 3.18.0-rc1-mm1+ #98 -Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.7.5-0-ge51488c-20140602_164612-nilsson.home.kraxel.org 04/01/2014 - ffff8800693bc000 0000000000000000 ffff8800693bc558 ffff88006923bb78 - ffffffff81cc68ae 00000000000000f3 ffff88006d407600 ffff88006923bba8 - ffffffff811fd848 ffff88006d407600 ffffea0001a4ef00 ffff8800693bc558 -Call Trace: - [] dump_stack+0x46/0x58 - [] print_trailer+0xf8/0x160 - [] ? kmem_cache_oob+0xc3/0xc3 [test_kasan] - [] object_err+0x35/0x40 - [] ? kmalloc_oob_right+0x65/0x75 [test_kasan] - [] kasan_report_error+0x38a/0x3f0 - [] ? kasan_poison_shadow+0x2f/0x40 - [] ? kasan_unpoison_shadow+0x14/0x40 - [] ? kasan_poison_shadow+0x2f/0x40 - [] ? kmem_cache_oob+0xc3/0xc3 [test_kasan] - [] __asan_store1+0x75/0xb0 - [] ? kmem_cache_oob+0x1d/0xc3 [test_kasan] - [] ? kmalloc_oob_right+0x65/0x75 [test_kasan] - [] kmalloc_oob_right+0x65/0x75 [test_kasan] - [] init_module+0x9/0x47 [test_kasan] - [] do_one_initcall+0x99/0x200 - [] ? __vunmap+0xec/0x160 - [] load_module+0x2cb3/0x3b20 - [] ? m_show+0x240/0x240 - [] SyS_finit_module+0x76/0x80 - [] system_call_fastpath+0x12/0x17 -Memory state around the buggy address: - ffff8800693bc300: fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc - ffff8800693bc380: fc fc 00 00 00 00 00 00 00 00 00 00 00 00 00 fc - ffff8800693bc400: fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc - ffff8800693bc480: fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc - ffff8800693bc500: fc fc fc fc fc fc fc fc fc fc fc 00 00 00 00 00 ->ffff8800693bc580: 00 00 00 00 00 00 00 00 00 00 03 fc fc fc fc fc - ^ - ffff8800693bc600: fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc - ffff8800693bc680: fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc - ffff8800693bc700: fc fc fc fc fb fb fb fb fb fb fb fb fb fb fb fb - ffff8800693bc780: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb - ffff8800693bc800: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb -================================================================== - -The header of the report discribe what kind of bug happened and what kind of -access caused it. It's followed by the description of the accessed slub object -(see 'SLUB Debug output' section in Documentation/vm/slub.txt for details) and -the description of the accessed memory page. - -In the last section the report shows memory state around the accessed address. -Reading this part requires some understanding of how KASAN works. - -The state of each 8 aligned bytes of memory is encoded in one shadow byte. -Those 8 bytes can be accessible, partially accessible, freed or be a redzone. -We use the following encoding for each shadow byte: 0 means that all 8 bytes -of the corresponding memory region are accessible; number N (1 <= N <= 7) means -that the first N bytes are accessible, and other (8 - N) bytes are not; -any negative value indicates that the entire 8-byte word is inaccessible. -We use different negative values to distinguish between different kinds of -inaccessible memory like redzones or freed memory (see mm/kasan/kasan.h). - -In the report above the arrows point to the shadow byte 03, which means that -the accessed address is partially accessible. - - -2. Implementation details -========================= - -From a high level, our approach to memory error detection is similar to that -of kmemcheck: use shadow memory to record whether each byte of memory is safe -to access, and use compile-time instrumentation to check shadow memory on each -memory access. - -AddressSanitizer dedicates 1/8 of kernel memory to its shadow memory -(e.g. 16TB to cover 128TB on x86_64) and uses direct mapping with a scale and -offset to translate a memory address to its corresponding shadow address. - -Here is the function which translates an address to its corresponding shadow -address: - -static inline void *kasan_mem_to_shadow(const void *addr) -{ - return ((unsigned long)addr >> KASAN_SHADOW_SCALE_SHIFT) - + KASAN_SHADOW_OFFSET; -} - -where KASAN_SHADOW_SCALE_SHIFT = 3. - -Compile-time instrumentation used for checking memory accesses. Compiler inserts -function calls (__asan_load*(addr), __asan_store*(addr)) before each memory -access of size 1, 2, 4, 8 or 16. These functions check whether memory access is -valid or not by checking corresponding shadow memory. - -GCC 5.0 has possibility to perform inline instrumentation. Instead of making -function calls GCC directly inserts the code to check the shadow memory. -This option significantly enlarges kernel but it gives x1.1-x2 performance -boost over outline instrumented kernel. diff --git a/Documentation/kmemcheck.txt b/Documentation/kmemcheck.txt deleted file mode 100644 index 80aae85d8da6c1b8476fd6824553ae7070e5c508..0000000000000000000000000000000000000000 --- a/Documentation/kmemcheck.txt +++ /dev/null @@ -1,754 +0,0 @@ -GETTING STARTED WITH KMEMCHECK -============================== - -Vegard Nossum - - -Contents -======== -0. Introduction -1. Downloading -2. Configuring and compiling -3. How to use -3.1. Booting -3.2. Run-time enable/disable -3.3. Debugging -3.4. Annotating false positives -4. Reporting errors -5. Technical description - - -0. Introduction -=============== - -kmemcheck is a debugging feature for the Linux Kernel. More specifically, it -is a dynamic checker that detects and warns about some uses of uninitialized -memory. - -Userspace programmers might be familiar with Valgrind's memcheck. The main -difference between memcheck and kmemcheck is that memcheck works for userspace -programs only, and kmemcheck works for the kernel only. The implementations -are of course vastly different. Because of this, kmemcheck is not as accurate -as memcheck, but it turns out to be good enough in practice to discover real -programmer errors that the compiler is not able to find through static -analysis. - -Enabling kmemcheck on a kernel will probably slow it down to the extent that -the machine will not be usable for normal workloads such as e.g. an -interactive desktop. kmemcheck will also cause the kernel to use about twice -as much memory as normal. For this reason, kmemcheck is strictly a debugging -feature. - - -1. Downloading -============== - -As of version 2.6.31-rc1, kmemcheck is included in the mainline kernel. - - -2. Configuring and compiling -============================ - -kmemcheck only works for the x86 (both 32- and 64-bit) platform. A number of -configuration variables must have specific settings in order for the kmemcheck -menu to even appear in "menuconfig". These are: - - o CONFIG_CC_OPTIMIZE_FOR_SIZE=n - - This option is located under "General setup" / "Optimize for size". - - Without this, gcc will use certain optimizations that usually lead to - false positive warnings from kmemcheck. An example of this is a 16-bit - field in a struct, where gcc may load 32 bits, then discard the upper - 16 bits. kmemcheck sees only the 32-bit load, and may trigger a - warning for the upper 16 bits (if they're uninitialized). - - o CONFIG_SLAB=y or CONFIG_SLUB=y - - This option is located under "General setup" / "Choose SLAB - allocator". - - o CONFIG_FUNCTION_TRACER=n - - This option is located under "Kernel hacking" / "Tracers" / "Kernel - Function Tracer" - - When function tracing is compiled in, gcc emits a call to another - function at the beginning of every function. This means that when the - page fault handler is called, the ftrace framework will be called - before kmemcheck has had a chance to handle the fault. If ftrace then - modifies memory that was tracked by kmemcheck, the result is an - endless recursive page fault. - - o CONFIG_DEBUG_PAGEALLOC=n - - This option is located under "Kernel hacking" / "Memory Debugging" - / "Debug page memory allocations". - -In addition, I highly recommend turning on CONFIG_DEBUG_INFO=y. This is also -located under "Kernel hacking". With this, you will be able to get line number -information from the kmemcheck warnings, which is extremely valuable in -debugging a problem. This option is not mandatory, however, because it slows -down the compilation process and produces a much bigger kernel image. - -Now the kmemcheck menu should be visible (under "Kernel hacking" / "Memory -Debugging" / "kmemcheck: trap use of uninitialized memory"). Here follows -a description of the kmemcheck configuration variables: - - o CONFIG_KMEMCHECK - - This must be enabled in order to use kmemcheck at all... - - o CONFIG_KMEMCHECK_[DISABLED | ENABLED | ONESHOT]_BY_DEFAULT - - This option controls the status of kmemcheck at boot-time. "Enabled" - will enable kmemcheck right from the start, "disabled" will boot the - kernel as normal (but with the kmemcheck code compiled in, so it can - be enabled at run-time after the kernel has booted), and "one-shot" is - a special mode which will turn kmemcheck off automatically after - detecting the first use of uninitialized memory. - - If you are using kmemcheck to actively debug a problem, then you - probably want to choose "enabled" here. - - The one-shot mode is mostly useful in automated test setups because it - can prevent floods of warnings and increase the chances of the machine - surviving in case something is really wrong. In other cases, the one- - shot mode could actually be counter-productive because it would turn - itself off at the very first error -- in the case of a false positive - too -- and this would come in the way of debugging the specific - problem you were interested in. - - If you would like to use your kernel as normal, but with a chance to - enable kmemcheck in case of some problem, it might be a good idea to - choose "disabled" here. When kmemcheck is disabled, most of the run- - time overhead is not incurred, and the kernel will be almost as fast - as normal. - - o CONFIG_KMEMCHECK_QUEUE_SIZE - - Select the maximum number of error reports to store in an internal - (fixed-size) buffer. Since errors can occur virtually anywhere and in - any context, we need a temporary storage area which is guaranteed not - to generate any other page faults when accessed. The queue will be - emptied as soon as a tasklet may be scheduled. If the queue is full, - new error reports will be lost. - - The default value of 64 is probably fine. If some code produces more - than 64 errors within an irqs-off section, then the code is likely to - produce many, many more, too, and these additional reports seldom give - any more information (the first report is usually the most valuable - anyway). - - This number might have to be adjusted if you are not using serial - console or similar to capture the kernel log. If you are using the - "dmesg" command to save the log, then getting a lot of kmemcheck - warnings might overflow the kernel log itself, and the earlier reports - will get lost in that way instead. Try setting this to 10 or so on - such a setup. - - o CONFIG_KMEMCHECK_SHADOW_COPY_SHIFT - - Select the number of shadow bytes to save along with each entry of the - error-report queue. These bytes indicate what parts of an allocation - are initialized, uninitialized, etc. and will be displayed when an - error is detected to help the debugging of a particular problem. - - The number entered here is actually the logarithm of the number of - bytes that will be saved. So if you pick for example 5 here, kmemcheck - will save 2^5 = 32 bytes. - - The default value should be fine for debugging most problems. It also - fits nicely within 80 columns. - - o CONFIG_KMEMCHECK_PARTIAL_OK - - This option (when enabled) works around certain GCC optimizations that - produce 32-bit reads from 16-bit variables where the upper 16 bits are - thrown away afterwards. - - The default value (enabled) is recommended. This may of course hide - some real errors, but disabling it would probably produce a lot of - false positives. - - o CONFIG_KMEMCHECK_BITOPS_OK - - This option silences warnings that would be generated for bit-field - accesses where not all the bits are initialized at the same time. This - may also hide some real bugs. - - This option is probably obsolete, or it should be replaced with - the kmemcheck-/bitfield-annotations for the code in question. The - default value is therefore fine. - -Now compile the kernel as usual. - - -3. How to use -============= - -3.1. Booting -============ - -First some information about the command-line options. There is only one -option specific to kmemcheck, and this is called "kmemcheck". It can be used -to override the default mode as chosen by the CONFIG_KMEMCHECK_*_BY_DEFAULT -option. Its possible settings are: - - o kmemcheck=0 (disabled) - o kmemcheck=1 (enabled) - o kmemcheck=2 (one-shot mode) - -If SLUB debugging has been enabled in the kernel, it may take precedence over -kmemcheck in such a way that the slab caches which are under SLUB debugging -will not be tracked by kmemcheck. In order to ensure that this doesn't happen -(even though it shouldn't by default), use SLUB's boot option "slub_debug", -like this: slub_debug=- - -In fact, this option may also be used for fine-grained control over SLUB vs. -kmemcheck. For example, if the command line includes "kmemcheck=1 -slub_debug=,dentry", then SLUB debugging will be used only for the "dentry" -slab cache, and with kmemcheck tracking all the other caches. This is advanced -usage, however, and is not generally recommended. - - -3.2. Run-time enable/disable -============================ - -When the kernel has booted, it is possible to enable or disable kmemcheck at -run-time. WARNING: This feature is still experimental and may cause false -positive warnings to appear. Therefore, try not to use this. If you find that -it doesn't work properly (e.g. you see an unreasonable amount of warnings), I -will be happy to take bug reports. - -Use the file /proc/sys/kernel/kmemcheck for this purpose, e.g.: - - $ echo 0 > /proc/sys/kernel/kmemcheck # disables kmemcheck - -The numbers are the same as for the kmemcheck= command-line option. - - -3.3. Debugging -============== - -A typical report will look something like this: - -WARNING: kmemcheck: Caught 32-bit read from uninitialized memory (ffff88003e4a2024) -80000000000000000000000000000000000000000088ffff0000000000000000 - i i i i u u u u i i i i i i i i u u u u u u u u u u u u u u u u - ^ - -Pid: 1856, comm: ntpdate Not tainted 2.6.29-rc5 #264 945P-A -RIP: 0010:[] [] __dequeue_signal+0xc8/0x190 -RSP: 0018:ffff88003cdf7d98 EFLAGS: 00210002 -RAX: 0000000000000030 RBX: ffff88003d4ea968 RCX: 0000000000000009 -RDX: ffff88003e5d6018 RSI: ffff88003e5d6024 RDI: ffff88003cdf7e84 -RBP: ffff88003cdf7db8 R08: ffff88003e5d6000 R09: 0000000000000000 -R10: 0000000000000080 R11: 0000000000000000 R12: 000000000000000e -R13: ffff88003cdf7e78 R14: ffff88003d530710 R15: ffff88003d5a98c8 -FS: 0000000000000000(0000) GS:ffff880001982000(0063) knlGS:00000 -CS: 0010 DS: 002b ES: 002b CR0: 0000000080050033 -CR2: ffff88003f806ea0 CR3: 000000003c036000 CR4: 00000000000006a0 -DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 -DR3: 0000000000000000 DR6: 00000000ffff4ff0 DR7: 0000000000000400 - [] dequeue_signal+0x8e/0x170 - [] get_signal_to_deliver+0x98/0x390 - [] do_notify_resume+0xad/0x7d0 - [] int_signal+0x12/0x17 - [] 0xffffffffffffffff - -The single most valuable information in this report is the RIP (or EIP on 32- -bit) value. This will help us pinpoint exactly which instruction that caused -the warning. - -If your kernel was compiled with CONFIG_DEBUG_INFO=y, then all we have to do -is give this address to the addr2line program, like this: - - $ addr2line -e vmlinux -i ffffffff8104ede8 - arch/x86/include/asm/string_64.h:12 - include/asm-generic/siginfo.h:287 - kernel/signal.c:380 - kernel/signal.c:410 - -The "-e vmlinux" tells addr2line which file to look in. IMPORTANT: This must -be the vmlinux of the kernel that produced the warning in the first place! If -not, the line number information will almost certainly be wrong. - -The "-i" tells addr2line to also print the line numbers of inlined functions. -In this case, the flag was very important, because otherwise, it would only -have printed the first line, which is just a call to memcpy(), which could be -called from a thousand places in the kernel, and is therefore not very useful. -These inlined functions would not show up in the stack trace above, simply -because the kernel doesn't load the extra debugging information. This -technique can of course be used with ordinary kernel oopses as well. - -In this case, it's the caller of memcpy() that is interesting, and it can be -found in include/asm-generic/siginfo.h, line 287: - -281 static inline void copy_siginfo(struct siginfo *to, struct siginfo *from) -282 { -283 if (from->si_code < 0) -284 memcpy(to, from, sizeof(*to)); -285 else -286 /* _sigchld is currently the largest know union member */ -287 memcpy(to, from, __ARCH_SI_PREAMBLE_SIZE + sizeof(from->_sifields._sigchld)); -288 } - -Since this was a read (kmemcheck usually warns about reads only, though it can -warn about writes to unallocated or freed memory as well), it was probably the -"from" argument which contained some uninitialized bytes. Following the chain -of calls, we move upwards to see where "from" was allocated or initialized, -kernel/signal.c, line 380: - -359 static void collect_signal(int sig, struct sigpending *list, siginfo_t *info) -360 { -... -367 list_for_each_entry(q, &list->list, list) { -368 if (q->info.si_signo == sig) { -369 if (first) -370 goto still_pending; -371 first = q; -... -377 if (first) { -378 still_pending: -379 list_del_init(&first->list); -380 copy_siginfo(info, &first->info); -381 __sigqueue_free(first); -... -392 } -393 } - -Here, it is &first->info that is being passed on to copy_siginfo(). The -variable "first" was found on a list -- passed in as the second argument to -collect_signal(). We continue our journey through the stack, to figure out -where the item on "list" was allocated or initialized. We move to line 410: - -395 static int __dequeue_signal(struct sigpending *pending, sigset_t *mask, -396 siginfo_t *info) -397 { -... -410 collect_signal(sig, pending, info); -... -414 } - -Now we need to follow the "pending" pointer, since that is being passed on to -collect_signal() as "list". At this point, we've run out of lines from the -"addr2line" output. Not to worry, we just paste the next addresses from the -kmemcheck stack dump, i.e.: - - [] dequeue_signal+0x8e/0x170 - [] get_signal_to_deliver+0x98/0x390 - [] do_notify_resume+0xad/0x7d0 - [] int_signal+0x12/0x17 - - $ addr2line -e vmlinux -i ffffffff8104f04e ffffffff81050bd8 \ - ffffffff8100b87d ffffffff8100c7b5 - kernel/signal.c:446 - kernel/signal.c:1806 - arch/x86/kernel/signal.c:805 - arch/x86/kernel/signal.c:871 - arch/x86/kernel/entry_64.S:694 - -Remember that since these addresses were found on the stack and not as the -RIP value, they actually point to the _next_ instruction (they are return -addresses). This becomes obvious when we look at the code for line 446: - -422 int dequeue_signal(struct task_struct *tsk, sigset_t *mask, siginfo_t *info) -423 { -... -431 signr = __dequeue_signal(&tsk->signal->shared_pending, -432 mask, info); -433 /* -434 * itimer signal ? -435 * -436 * itimers are process shared and we restart periodic -437 * itimers in the signal delivery path to prevent DoS -438 * attacks in the high resolution timer case. This is -439 * compliant with the old way of self restarting -440 * itimers, as the SIGALRM is a legacy signal and only -441 * queued once. Changing the restart behaviour to -442 * restart the timer in the signal dequeue path is -443 * reducing the timer noise on heavy loaded !highres -444 * systems too. -445 */ -446 if (unlikely(signr == SIGALRM)) { -... -489 } - -So instead of looking at 446, we should be looking at 431, which is the line -that executes just before 446. Here we see that what we are looking for is -&tsk->signal->shared_pending. - -Our next task is now to figure out which function that puts items on this -"shared_pending" list. A crude, but efficient tool, is git grep: - - $ git grep -n 'shared_pending' kernel/ - ... - kernel/signal.c:828: pending = group ? &t->signal->shared_pending : &t->pending; - kernel/signal.c:1339: pending = group ? &t->signal->shared_pending : &t->pending; - ... - -There were more results, but none of them were related to list operations, -and these were the only assignments. We inspect the line numbers more closely -and find that this is indeed where items are being added to the list: - -816 static int send_signal(int sig, struct siginfo *info, struct task_struct *t, -817 int group) -818 { -... -828 pending = group ? &t->signal->shared_pending : &t->pending; -... -851 q = __sigqueue_alloc(t, GFP_ATOMIC, (sig < SIGRTMIN && -852 (is_si_special(info) || -853 info->si_code >= 0))); -854 if (q) { -855 list_add_tail(&q->list, &pending->list); -... -890 } - -and: - -1309 int send_sigqueue(struct sigqueue *q, struct task_struct *t, int group) -1310 { -.... -1339 pending = group ? &t->signal->shared_pending : &t->pending; -1340 list_add_tail(&q->list, &pending->list); -.... -1347 } - -In the first case, the list element we are looking for, "q", is being returned -from the function __sigqueue_alloc(), which looks like an allocation function. -Let's take a look at it: - -187 static struct sigqueue *__sigqueue_alloc(struct task_struct *t, gfp_t flags, -188 int override_rlimit) -189 { -190 struct sigqueue *q = NULL; -191 struct user_struct *user; -192 -193 /* -194 * We won't get problems with the target's UID changing under us -195 * because changing it requires RCU be used, and if t != current, the -196 * caller must be holding the RCU readlock (by way of a spinlock) and -197 * we use RCU protection here -198 */ -199 user = get_uid(__task_cred(t)->user); -200 atomic_inc(&user->sigpending); -201 if (override_rlimit || -202 atomic_read(&user->sigpending) <= -203 t->signal->rlim[RLIMIT_SIGPENDING].rlim_cur) -204 q = kmem_cache_alloc(sigqueue_cachep, flags); -205 if (unlikely(q == NULL)) { -206 atomic_dec(&user->sigpending); -207 free_uid(user); -208 } else { -209 INIT_LIST_HEAD(&q->list); -210 q->flags = 0; -211 q->user = user; -212 } -213 -214 return q; -215 } - -We see that this function initializes q->list, q->flags, and q->user. It seems -that now is the time to look at the definition of "struct sigqueue", e.g.: - -14 struct sigqueue { -15 struct list_head list; -16 int flags; -17 siginfo_t info; -18 struct user_struct *user; -19 }; - -And, you might remember, it was a memcpy() on &first->info that caused the -warning, so this makes perfect sense. It also seems reasonable to assume that -it is the caller of __sigqueue_alloc() that has the responsibility of filling -out (initializing) this member. - -But just which fields of the struct were uninitialized? Let's look at -kmemcheck's report again: - -WARNING: kmemcheck: Caught 32-bit read from uninitialized memory (ffff88003e4a2024) -80000000000000000000000000000000000000000088ffff0000000000000000 - i i i i u u u u i i i i i i i i u u u u u u u u u u u u u u u u - ^ - -These first two lines are the memory dump of the memory object itself, and the -shadow bytemap, respectively. The memory object itself is in this case -&first->info. Just beware that the start of this dump is NOT the start of the -object itself! The position of the caret (^) corresponds with the address of -the read (ffff88003e4a2024). - -The shadow bytemap dump legend is as follows: - - i - initialized - u - uninitialized - a - unallocated (memory has been allocated by the slab layer, but has not - yet been handed off to anybody) - f - freed (memory has been allocated by the slab layer, but has been freed - by the previous owner) - -In order to figure out where (relative to the start of the object) the -uninitialized memory was located, we have to look at the disassembly. For -that, we'll need the RIP address again: - -RIP: 0010:[] [] __dequeue_signal+0xc8/0x190 - - $ objdump -d --no-show-raw-insn vmlinux | grep -C 8 ffffffff8104ede8: - ffffffff8104edc8: mov %r8,0x8(%r8) - ffffffff8104edcc: test %r10d,%r10d - ffffffff8104edcf: js ffffffff8104ee88 <__dequeue_signal+0x168> - ffffffff8104edd5: mov %rax,%rdx - ffffffff8104edd8: mov $0xc,%ecx - ffffffff8104eddd: mov %r13,%rdi - ffffffff8104ede0: mov $0x30,%eax - ffffffff8104ede5: mov %rdx,%rsi - ffffffff8104ede8: rep movsl %ds:(%rsi),%es:(%rdi) - ffffffff8104edea: test $0x2,%al - ffffffff8104edec: je ffffffff8104edf0 <__dequeue_signal+0xd0> - ffffffff8104edee: movsw %ds:(%rsi),%es:(%rdi) - ffffffff8104edf0: test $0x1,%al - ffffffff8104edf2: je ffffffff8104edf5 <__dequeue_signal+0xd5> - ffffffff8104edf4: movsb %ds:(%rsi),%es:(%rdi) - ffffffff8104edf5: mov %r8,%rdi - ffffffff8104edf8: callq ffffffff8104de60 <__sigqueue_free> - -As expected, it's the "rep movsl" instruction from the memcpy() that causes -the warning. We know about REP MOVSL that it uses the register RCX to count -the number of remaining iterations. By taking a look at the register dump -again (from the kmemcheck report), we can figure out how many bytes were left -to copy: - -RAX: 0000000000000030 RBX: ffff88003d4ea968 RCX: 0000000000000009 - -By looking at the disassembly, we also see that %ecx is being loaded with the -value $0xc just before (ffffffff8104edd8), so we are very lucky. Keep in mind -that this is the number of iterations, not bytes. And since this is a "long" -operation, we need to multiply by 4 to get the number of bytes. So this means -that the uninitialized value was encountered at 4 * (0xc - 0x9) = 12 bytes -from the start of the object. - -We can now try to figure out which field of the "struct siginfo" that was not -initialized. This is the beginning of the struct: - -40 typedef struct siginfo { -41 int si_signo; -42 int si_errno; -43 int si_code; -44 -45 union { -.. -92 } _sifields; -93 } siginfo_t; - -On 64-bit, the int is 4 bytes long, so it must the union member that has -not been initialized. We can verify this using gdb: - - $ gdb vmlinux - ... - (gdb) p &((struct siginfo *) 0)->_sifields - $1 = (union {...} *) 0x10 - -Actually, it seems that the union member is located at offset 0x10 -- which -means that gcc has inserted 4 bytes of padding between the members si_code -and _sifields. We can now get a fuller picture of the memory dump: - - _----------------------------=> si_code - / _--------------------=> (padding) - | / _------------=> _sifields(._kill._pid) - | | / _----=> _sifields(._kill._uid) - | | | / --------|-------|-------|-------| -80000000000000000000000000000000000000000088ffff0000000000000000 - i i i i u u u u i i i i i i i i u u u u u u u u u u u u u u u u - -This allows us to realize another important fact: si_code contains the value -0x80. Remember that x86 is little endian, so the first 4 bytes "80000000" are -really the number 0x00000080. With a bit of research, we find that this is -actually the constant SI_KERNEL defined in include/asm-generic/siginfo.h: - -144 #define SI_KERNEL 0x80 /* sent by the kernel from somewhere */ - -This macro is used in exactly one place in the x86 kernel: In send_signal() -in kernel/signal.c: - -816 static int send_signal(int sig, struct siginfo *info, struct task_struct *t, -817 int group) -818 { -... -828 pending = group ? &t->signal->shared_pending : &t->pending; -... -851 q = __sigqueue_alloc(t, GFP_ATOMIC, (sig < SIGRTMIN && -852 (is_si_special(info) || -853 info->si_code >= 0))); -854 if (q) { -855 list_add_tail(&q->list, &pending->list); -856 switch ((unsigned long) info) { -... -865 case (unsigned long) SEND_SIG_PRIV: -866 q->info.si_signo = sig; -867 q->info.si_errno = 0; -868 q->info.si_code = SI_KERNEL; -869 q->info.si_pid = 0; -870 q->info.si_uid = 0; -871 break; -... -890 } - -Not only does this match with the .si_code member, it also matches the place -we found earlier when looking for where siginfo_t objects are enqueued on the -"shared_pending" list. - -So to sum up: It seems that it is the padding introduced by the compiler -between two struct fields that is uninitialized, and this gets reported when -we do a memcpy() on the struct. This means that we have identified a false -positive warning. - -Normally, kmemcheck will not report uninitialized accesses in memcpy() calls -when both the source and destination addresses are tracked. (Instead, we copy -the shadow bytemap as well). In this case, the destination address clearly -was not tracked. We can dig a little deeper into the stack trace from above: - - arch/x86/kernel/signal.c:805 - arch/x86/kernel/signal.c:871 - arch/x86/kernel/entry_64.S:694 - -And we clearly see that the destination siginfo object is located on the -stack: - -782 static void do_signal(struct pt_regs *regs) -783 { -784 struct k_sigaction ka; -785 siginfo_t info; -... -804 signr = get_signal_to_deliver(&info, &ka, regs, NULL); -... -854 } - -And this &info is what eventually gets passed to copy_siginfo() as the -destination argument. - -Now, even though we didn't find an actual error here, the example is still a -good one, because it shows how one would go about to find out what the report -was all about. - - -3.4. Annotating false positives -=============================== - -There are a few different ways to make annotations in the source code that -will keep kmemcheck from checking and reporting certain allocations. Here -they are: - - o __GFP_NOTRACK_FALSE_POSITIVE - - This flag can be passed to kmalloc() or kmem_cache_alloc() (therefore - also to other functions that end up calling one of these) to indicate - that the allocation should not be tracked because it would lead to - a false positive report. This is a "big hammer" way of silencing - kmemcheck; after all, even if the false positive pertains to - particular field in a struct, for example, we will now lose the - ability to find (real) errors in other parts of the same struct. - - Example: - - /* No warnings will ever trigger on accessing any part of x */ - x = kmalloc(sizeof *x, GFP_KERNEL | __GFP_NOTRACK_FALSE_POSITIVE); - - o kmemcheck_bitfield_begin(name)/kmemcheck_bitfield_end(name) and - kmemcheck_annotate_bitfield(ptr, name) - - The first two of these three macros can be used inside struct - definitions to signal, respectively, the beginning and end of a - bitfield. Additionally, this will assign the bitfield a name, which - is given as an argument to the macros. - - Having used these markers, one can later use - kmemcheck_annotate_bitfield() at the point of allocation, to indicate - which parts of the allocation is part of a bitfield. - - Example: - - struct foo { - int x; - - kmemcheck_bitfield_begin(flags); - int flag_a:1; - int flag_b:1; - kmemcheck_bitfield_end(flags); - - int y; - }; - - struct foo *x = kmalloc(sizeof *x); - - /* No warnings will trigger on accessing the bitfield of x */ - kmemcheck_annotate_bitfield(x, flags); - - Note that kmemcheck_annotate_bitfield() can be used even before the - return value of kmalloc() is checked -- in other words, passing NULL - as the first argument is legal (and will do nothing). - - -4. Reporting errors -=================== - -As we have seen, kmemcheck will produce false positive reports. Therefore, it -is not very wise to blindly post kmemcheck warnings to mailing lists and -maintainers. Instead, I encourage maintainers and developers to find errors -in their own code. If you get a warning, you can try to work around it, try -to figure out if it's a real error or not, or simply ignore it. Most -developers know their own code and will quickly and efficiently determine the -root cause of a kmemcheck report. This is therefore also the most efficient -way to work with kmemcheck. - -That said, we (the kmemcheck maintainers) will always be on the lookout for -false positives that we can annotate and silence. So whatever you find, -please drop us a note privately! Kernel configs and steps to reproduce (if -available) are of course a great help too. - -Happy hacking! - - -5. Technical description -======================== - -kmemcheck works by marking memory pages non-present. This means that whenever -somebody attempts to access the page, a page fault is generated. The page -fault handler notices that the page was in fact only hidden, and so it calls -on the kmemcheck code to make further investigations. - -When the investigations are completed, kmemcheck "shows" the page by marking -it present (as it would be under normal circumstances). This way, the -interrupted code can continue as usual. - -But after the instruction has been executed, we should hide the page again, so -that we can catch the next access too! Now kmemcheck makes use of a debugging -feature of the processor, namely single-stepping. When the processor has -finished the one instruction that generated the memory access, a debug -exception is raised. From here, we simply hide the page again and continue -execution, this time with the single-stepping feature turned off. - -kmemcheck requires some assistance from the memory allocator in order to work. -The memory allocator needs to - - 1. Tell kmemcheck about newly allocated pages and pages that are about to - be freed. This allows kmemcheck to set up and tear down the shadow memory - for the pages in question. The shadow memory stores the status of each - byte in the allocation proper, e.g. whether it is initialized or - uninitialized. - - 2. Tell kmemcheck which parts of memory should be marked uninitialized. - There are actually a few more states, such as "not yet allocated" and - "recently freed". - -If a slab cache is set up using the SLAB_NOTRACK flag, it will never return -memory that can take page faults because of kmemcheck. - -If a slab cache is NOT set up using the SLAB_NOTRACK flag, callers can still -request memory with the __GFP_NOTRACK or __GFP_NOTRACK_FALSE_POSITIVE flags. -This does not prevent the page faults from occurring, however, but marks the -object in question as being initialized so that no warnings will ever be -produced for this object. - -Currently, the SLAB and SLUB allocators are supported by kmemcheck. diff --git a/MAINTAINERS b/MAINTAINERS index 20bb1d00098c70dacad7a9c778087f9319b0c5c6..810723537aa5e3dd71261a18bb11d536ca00fa5f 100644 --- a/MAINTAINERS +++ b/MAINTAINERS @@ -3124,7 +3124,7 @@ L: cocci@systeme.lip6.fr (moderated for non-subscribers) T: git git://git.kernel.org/pub/scm/linux/kernel/git/mmarek/kbuild.git misc W: http://coccinelle.lip6.fr/ S: Supported -F: Documentation/coccinelle.txt +F: Documentation/dev-tools/coccinelle.rst F: scripts/coccinelle/ F: scripts/coccicheck @@ -5118,7 +5118,7 @@ GCOV BASED KERNEL PROFILING M: Peter Oberparleiter S: Maintained F: kernel/gcov/ -F: Documentation/gcov.txt +F: Documentation/dev-tools/gcov.rst GDT SCSI DISK ARRAY CONTROLLER DRIVER M: Achim Leubner @@ -6587,7 +6587,7 @@ L: kasan-dev@googlegroups.com S: Maintained F: arch/*/include/asm/kasan.h F: arch/*/mm/kasan_init* -F: Documentation/kasan.txt +F: Documentation/dev-tools/kasan.rst F: include/linux/kasan*.h F: lib/test_kasan.c F: mm/kasan/ @@ -6803,7 +6803,7 @@ KMEMCHECK M: Vegard Nossum M: Pekka Enberg S: Maintained -F: Documentation/kmemcheck.txt +F: Documentation/dev-tools/kmemcheck.rst F: arch/x86/include/asm/kmemcheck.h F: arch/x86/mm/kmemcheck/ F: include/linux/kmemcheck.h @@ -6812,7 +6812,7 @@ F: mm/kmemcheck.c KMEMLEAK M: Catalin Marinas S: Maintained -F: Documentation/kmemleak.txt +F: Documentation/dev-tools/kmemleak.rst F: include/linux/kmemleak.h F: mm/kmemleak.c F: mm/kmemleak-test.c