提交 · 016c13daa5c9e4827eca703e2f0621c131f2cca3 · openeuler / raspberrypi-kernel

06 11月, 2015 15 次提交

由 Eric B Munson 提交于 11月 05, 2015

The cost of faulting in all memory to be locked can be very high when
working with large mappings.  If only portions of the mapping will be used
this can incur a high penalty for locking.

For the example of a large file, this is the usage pattern for a large
statical language model (probably applies to other statical or graphical
models as well).  For the security example, any application transacting in
data that cannot be swapped out (credit card data, medical records, etc).

This patch introduces the ability to request that pages are not
pre-faulted, but are placed on the unevictable LRU when they are finally
faulted in.  The VM_LOCKONFAULT flag will be used together with VM_LOCKED
and has no effect when set without VM_LOCKED.  Setting the VM_LOCKONFAULT
flag for a VMA will cause pages faulted into that VMA to be added to the
unevictable LRU when they are faulted or if they are already present, but
will not cause any missing pages to be faulted in.

Exposing this new lock state means that we cannot overload the meaning of
the FOLL_POPULATE flag any longer.  Prior to this patch it was used to
mean that the VMA for a fault was locked.  This means we need the new
FOLL_MLOCK flag to communicate the locked state of a VMA.  FOLL_POPULATE
will now only control if the VMA should be populated and in the case of
VM_LOCKONFAULT, it will not be set.
Signed-off-by: NEric B Munson <emunson@akamai.com>
Acked-by: NKirill A. Shutemov <kirill.shutemov@linux.intel.com>
Acked-by: NVlastimil Babka <vbabka@suse.cz>
Cc: Michal Hocko <mhocko@suse.cz>
Cc: Jonathan Corbet <corbet@lwn.net>
Cc: Catalin Marinas <catalin.marinas@arm.com>
Cc: Geert Uytterhoeven <geert@linux-m68k.org>
Cc: Guenter Roeck <linux@roeck-us.net>
Cc: Heiko Carstens <heiko.carstens@de.ibm.com>
Cc: Michael Kerrisk <mtk.manpages@gmail.com>
Cc: Ralf Baechle <ralf@linux-mips.org>
Cc: Shuah Khan <shuahkh@osg.samsung.com>
Cc: Stephen Rothwell <sfr@canb.auug.org.au>
Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>

de60f5f1

mm: mlock: add new mlock system call · a8ca5d0e

由 Eric B Munson 提交于 11月 05, 2015

With the refactored mlock code, introduce a new system call for mlock.
The new call will allow the user to specify what lock states are being
added.  mlock2 is trivial at the moment, but a follow on patch will add a
new mlock state making it useful.
Signed-off-by: NEric B Munson <emunson@akamai.com>
Acked-by: NMichal Hocko <mhocko@suse.com>
Acked-by: NVlastimil Babka <vbabka@suse.cz>
Cc: Heiko Carstens <heiko.carstens@de.ibm.com>
Cc: Geert Uytterhoeven <geert@linux-m68k.org>
Cc: Catalin Marinas <catalin.marinas@arm.com>
Cc: Stephen Rothwell <sfr@canb.auug.org.au>
Cc: Guenter Roeck <linux@roeck-us.net>
Cc: Jonathan Corbet <corbet@lwn.net>
Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Cc: Michael Kerrisk <mtk.manpages@gmail.com>
Cc: Ralf Baechle <ralf@linux-mips.org>
Cc: Shuah Khan <shuahkh@osg.samsung.com>
Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>

a8ca5d0e

mm, oom: remove task_lock protecting comm printing · da39da3a

由 David Rientjes 提交于 11月 05, 2015

The oom killer takes task_lock() in a couple of places solely to protect
printing the task's comm.

A process's comm, including current's comm, may change due to
/proc/pid/comm or PR_SET_NAME.

The comm will always be NULL-terminated, so the worst race scenario would
only be during update.  We can tolerate a comm being printed that is in
the middle of an update to avoid taking the lock.

Other locations in the kernel have already dropped task_lock() when
printing comm, so this is consistent.
Signed-off-by: NDavid Rientjes <rientjes@google.com>
Suggested-by: NOleg Nesterov <oleg@redhat.com>
Cc: Michal Hocko <mhocko@kernel.org>
Cc: Vladimir Davydov <vdavydov@parallels.com>
Cc: Sergey Senozhatsky <sergey.senozhatsky.work@gmail.com>
Acked-by: NJohannes Weiner <hannes@cmpxchg.org>
Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>

da39da3a

kernel/watchdog.c: fix race between proc_watchdog_thresh() and watchdog_timer_fn() · 39d2da21

由 Ulrich Obergfell 提交于 11月 05, 2015

Theoretically it is possible that the watchdog timer expires right at the
time when a user sets 'watchdog_thresh' to zero (note: this disables the
lockup detectors).  In this scenario, the is_softlockup() function - which
is called by the timer - could produce a false positive.

Fix this by checking the current value of 'watchdog_thresh'.
Signed-off-by: NUlrich Obergfell <uobergfe@redhat.com>
Acked-by: NDon Zickus <dzickus@redhat.com>
Reviewed-by: NAaron Tomlin <atomlin@redhat.com>
Cc: Ulrich Obergfell <uobergfe@redhat.com>
Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>

39d2da21

kernel/watchdog.c: remove {get|put}_online_cpus() from watchdog_{park|unpark}_threads() · a2a45b85

由 Ulrich Obergfell 提交于 11月 05, 2015

watchdog_{park|unpark}_threads() are now called in code paths that protect
themselves against CPU hotplug, so {get|put}_online_cpus() calls are
redundant and can be removed.
Signed-off-by: NUlrich Obergfell <uobergfe@redhat.com>
Acked-by: NDon Zickus <dzickus@redhat.com>
Reviewed-by: NAaron Tomlin <atomlin@redhat.com>
Cc: Ulrich Obergfell <uobergfe@redhat.com>
Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>

a2a45b85

kernel/watchdog.c: avoid races between /proc handlers and CPU hotplug · 8614ddef

由 Ulrich Obergfell 提交于 11月 05, 2015

The handler functions for watchdog parameters in /proc/sys/kernel do not
protect themselves against races with CPU hotplug.  Hence, theoretically
it is possible that a new watchdog thread is started on a hotplugged CPU
while a parameter is being modified, and the thread could thus use a
parameter value that is 'in transition'.

For example, if 'watchdog_thresh' is being set to zero (note: this
disables the lockup detectors) the thread would erroneously use the value
zero as the sample period.

To avoid such races and to keep the /proc handler code consistent,
call
     {get|put}_online_cpus() in proc_watchdog_common()
     {get|put}_online_cpus() in proc_watchdog_thresh()
     {get|put}_online_cpus() in proc_watchdog_cpumask()
Signed-off-by: NUlrich Obergfell <uobergfe@redhat.com>
Acked-by: NDon Zickus <dzickus@redhat.com>
Reviewed-by: NAaron Tomlin <atomlin@redhat.com>
Cc: Ulrich Obergfell <uobergfe@redhat.com>
Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>

8614ddef

kernel/watchdog.c: avoid race between lockup detector suspend/resume and CPU hotplug · ee89e71e

由 Ulrich Obergfell 提交于 11月 05, 2015

The lockup detector suspend/resume interface that was introduced by
commit 8c073d27 ("watchdog: introduce watchdog_suspend() and
watchdog_resume()") does not protect itself against races with CPU
hotplug.  Hence, theoretically it is possible that a new watchdog thread
is started on a hotplugged CPU while the lockup detector is suspended,
and the thread could thus interfere unexpectedly with the code that
requested to suspend the lockup detector.

Avoid the race by calling

  get_online_cpus() in lockup_detector_suspend()
  put_online_cpus() in lockup_detector_resume()
Signed-off-by: NUlrich Obergfell <uobergfe@redhat.com>
Acked-by: NDon Zickus <dzickus@redhat.com>
Reviewed-by: NAaron Tomlin <atomlin@redhat.com>
Cc: Ulrich Obergfell <uobergfe@redhat.com>
Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>

ee89e71e

kernel/watchdog.c: add sysctl knob hardlockup_panic · ac1f5912

由 Don Zickus 提交于 11月 05, 2015

The only way to enable a hardlockup to panic the machine is to set
'nmi_watchdog=panic' on the kernel command line.

This makes it awkward for end users and folks who want to run automate
tests (like myself).

Mimic the softlockup_panic knob and create a /proc/sys/kernel/hardlockup_panic
knob.
Signed-off-by: NDon Zickus <dzickus@redhat.com>
Cc: Ulrich Obergfell <uobergfe@redhat.com>
Acked-by: NJiri Kosina <jkosina@suse.cz>
Reviewed-by: NAaron Tomlin <atomlin@redhat.com>
Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>

ac1f5912

kernel/watchdog.c: perform all-CPU backtrace in case of hard lockup · 55537871

由 Jiri Kosina 提交于 11月 05, 2015

In many cases of hardlockup reports, it's actually not possible to know
why it triggered, because the CPU that got stuck is usually waiting on a
resource (with IRQs disabled) in posession of some other CPU is holding.

IOW, we are often looking at the stacktrace of the victim and not the
actual offender.

Introduce sysctl / cmdline parameter that makes it possible to have
hardlockup detector perform all-CPU backtrace.
Signed-off-by: NJiri Kosina <jkosina@suse.cz>
Reviewed-by: NAaron Tomlin <atomlin@redhat.com>
Cc: Ulrich Obergfell <uobergfe@redhat.com>
Acked-by: NDon Zickus <dzickus@redhat.com>
Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>

55537871

watchdog: do not unpark threads in watchdog_park_threads() on error · ee7fed54

由 Ulrich Obergfell 提交于 11月 05, 2015

If kthread_park() returns an error, watchdog_park_threads() should not
blindly 'roll back' the already parked threads to the unparked state.
Instead leave it up to the callers to handle such errors appropriately in
their context.  For example, it is redundant to unpark the threads if the
lockup detectors will soon be disabled by the callers anyway.
Signed-off-by: NUlrich Obergfell <uobergfe@redhat.com>
Reviewed-by: NAaron Tomlin <atomlin@redhat.com>
Acked-by: NDon Zickus <dzickus@redhat.com>
Cc: Ulrich Obergfell <uobergfe@redhat.com>
Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>

ee7fed54

watchdog: implement error handling in lockup_detector_suspend() · c993590c

由 Ulrich Obergfell 提交于 11月 05, 2015

lockup_detector_suspend() now handles errors from watchdog_park_threads().
Signed-off-by: NUlrich Obergfell <uobergfe@redhat.com>
Reviewed-by: NAaron Tomlin <atomlin@redhat.com>
Acked-by: NDon Zickus <dzickus@redhat.com>
Cc: Ulrich Obergfell <uobergfe@redhat.com>
Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>

c993590c

watchdog: implement error handling in update_watchdog_all_cpus() and callers · b43cb43c

由 Ulrich Obergfell 提交于 11月 05, 2015

update_watchdog_all_cpus() now passes errors from watchdog_park_threads()
up to functions in the call chain.  This allows watchdog_enable_all_cpus()
and proc_watchdog_update() to handle such errors too.
Signed-off-by: NUlrich Obergfell <uobergfe@redhat.com>
Reviewed-by: NAaron Tomlin <atomlin@redhat.com>
Acked-by: NDon Zickus <dzickus@redhat.com>
Cc: Ulrich Obergfell <uobergfe@redhat.com>
Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>

b43cb43c

watchdog: move watchdog_disable_all_cpus() outside of ifdef · 58cf690a

由 Ulrich Obergfell 提交于 11月 05, 2015

Move watchdog_disable_all_cpus() outside of the ifdef so that it is
available if CONFIG_SYSCTL is not defined.  This is preparation for
"watchdog: implement error handling in update_watchdog_all_cpus() and
callers".
Signed-off-by: NUlrich Obergfell <uobergfe@redhat.com>
Reviewed-by: NAaron Tomlin <atomlin@redhat.com>
Acked-by: NDon Zickus <dzickus@redhat.com>
Cc: Ulrich Obergfell <uobergfe@redhat.com>
Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>

58cf690a

watchdog: fix error handling in proc_watchdog_thresh() · d283c640

由 Ulrich Obergfell 提交于 11月 05, 2015

The original watchdog_park_threads() function that was introduced by
commit 81a4beef ("watchdog: introduce watchdog_park_threads() and
watchdog_unpark_threads()") takes a very simple approach to handle
errors returned by kthread_park(): It attempts to roll back all watchdog
threads to the unparked state.  However, this may be undesired behaviour
from the perspective of the caller which may want to handle errors as
appropriate in its specific context.  Currently, there are two possible
call chains:

- watchdog suspend/resume interface

    lockup_detector_suspend
      watchdog_park_threads

- write to parameters in /proc/sys/kernel

    proc_watchdog_update
      watchdog_enable_all_cpus
        update_watchdog_all_cpus
          watchdog_park_threads

Instead of 'blindly' attempting to unpark the watchdog threads if a
kthread_park() call fails, the new approach is to disable the lockup
detectors in the above call chains.  Failure becomes visible to the user
as follows:

- error messages from lockup_detector_suspend()
                   or watchdog_enable_all_cpus()

- the state that can be read from /proc/sys/kernel/watchdog_enabled

- the 'write' system call in the latter call chain returns an error

I did not experience kthread_park() failures in practice, I used some
instrumentation to fake error returns from kthread_park() in order to test
the patches.

This patch (of 5):

Restore the previous value of watchdog_thresh _and_ sample_period if
proc_watchdog_update() returns an error.  The variables must be consistent
to avoid false positives of the lockup detectors.
Signed-off-by: NUlrich Obergfell <uobergfe@redhat.com>
Reviewed-by: NAaron Tomlin <atomlin@redhat.com>
Acked-by: NDon Zickus <dzickus@redhat.com>
Cc: Ulrich Obergfell <uobergfe@redhat.com>
Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>

d283c640

kernel/watchdog.c: is_hardlockup can be boolean · 451637e4

由 Yaowei Bai 提交于 11月 05, 2015

Make is_hardlockup return bool to improve readability due to this
particular function only using either one or zero as its return value.

No functional change.
Signed-off-by: NYaowei Bai <bywxiaobai@163.com>
Reviewed-by: NAaron Tomlin <atomlin@redhat.com>
Acked-by: NDon Zickus <dzickus@redhat.com>
Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>

451637e4

04 11月, 2015 8 次提交

audit: make audit_log_common_recv_msg() a void function · 233a6866

由 Paul Moore 提交于 11月 04, 2015

It always returns zero and no one is checking the return value.
Signed-off-by: NPaul Moore <pmoore@redhat.com>

233a6866

audit: removing unused variable · c5ea6efd

由 Saurabh Sengar 提交于 11月 04, 2015

Variable rc in not required as it is just used for unchanged for return,
and return is always 0 in the function.
Signed-off-by: NSaurabh Sengar <saurabh.truth@gmail.com>
[PM: fixed spelling errors in description]
Signed-off-by: NPaul Moore <pmoore@redhat.com>

c5ea6efd

audit: fix comment block whitespace · 725131ef

由 Scott Matheina 提交于 11月 04, 2015

Signed-off-by: NScott Matheina <scott@matheina.com>
[PM: fixed subject line]
Signed-off-by: NPaul Moore <pmoore@redhat.com>

725131ef

audit: audit_tree_match can be boolean · 6f1b5d7a

由 Yaowei Bai 提交于 11月 04, 2015

This patch makes audit_tree_match return bool to improve readability
due to this particular function only using either one or zero as its
return value.

No functional change.
Signed-off-by: NYaowei Bai <bywxiaobai@163.com>
[PM: tweaked the subject line]
Signed-off-by: NPaul Moore <pmoore@redhat.com>

6f1b5d7a

audit: audit_string_contains_control can be boolean · 9fcf836b

由 Yaowei Bai 提交于 11月 04, 2015

This patch makes audit_string_contains_control return bool to improve
readability due to this particular function only using either one or
zero as its return value.
Signed-off-by: NYaowei Bai <bywxiaobai@163.com>
[PM: tweaked subject line]
Signed-off-by: NPaul Moore <pmoore@redhat.com>

9fcf836b

audit: try harder to send to auditd upon netlink failure · 32a1dbae

由 Richard Guy Briggs 提交于 11月 04, 2015

There are several reports of the kernel losing contact with auditd when
it is, in fact, still running.  When this happens, kernel syslogs show:
	"audit: *NO* daemon at audit_pid=<pid>"
although auditd is still running, and is apparently happy, listening on
the netlink socket. The pid in the "*NO* daemon" message matches the pid
of the running auditd process.  Restarting auditd solves this.

The problem appears to happen randomly, and doesn't seem to be strongly
correlated to the rate of audit events being logged.  The problem
happens fairly regularly (every few days), but not yet reproduced to
order.

On production kernels, BUG_ON() is a no-op, so any error will trigger
this.

Commit 34eab0a7 ("audit: prevent an older auditd shutdown from
orphaning a newer auditd startup") eliminates one possible cause.  This
isn't the case here, since the PID in the error message and the PID of
the running auditd match.

The primary expected cause of error here is -ECONNREFUSED when the audit
daemon goes away, when netlink_getsockbyportid() can't find the auditd
portid entry in the netlink audit table (or there is no receive
function).  If -EPERM is returned, that situation isn't likely to be
resolved in a timely fashion without administrator intervention.  In
both cases, reset the audit_pid.  This does not rule out a race
condition.  SELinux is expected to return zero since this isn't an INET
or INET6 socket.  Other LSMs may have other return codes.  Log the error
code for better diagnosis in the future.

In the case of -ENOMEM, the situation could be temporary, based on local
or general availability of buffers.  -EAGAIN should never happen since
the netlink audit (kernel) socket is set to MAX_SCHEDULE_TIMEOUT.
-ERESTARTSYS and -EINTR are not expected since this kernel thread is not
expected to receive signals.  In these cases (or any other unexpected
ones for now), report the error and re-schedule the thread, retrying up
to 5 times.

v2:
	Removed BUG_ON().
	Moved comma in pr_*() statements.
	Removed audit_strerror() text.
Reported-by: NVipin Rathor <v.rathor@gmail.com>
Reported-by: <ctcard@hotmail.com>
Signed-off-by: NRichard Guy Briggs <rgb@redhat.com>
[PM: applied rgb's fixup patch to correct audit_log_lost() format issues]
Signed-off-by: NPaul Moore <pmoore@redhat.com>

32a1dbae

atomic: remove all traces of READ_ONCE_CTRL() and atomic*_read_ctrl() · 105ff3cb

由 Linus Torvalds 提交于 11月 03, 2015

This seems to be a mis-reading of how alpha memory ordering works, and
is not backed up by the alpha architecture manual.  The helper functions
don't do anything special on any other architectures, and the arguments
that support them being safe on other architectures also argue that they
are safe on alpha.

Basically, the "control dependency" is between a previous read and a
subsequent write that is dependent on the value read.  Even if the
subsequent write is actually done speculatively, there is no way that
such a speculative write could be made visible to other cpu's until it
has been committed, which requires validating the speculation.

Note that most weakely ordered architectures (very much including alpha)
do not guarantee any ordering relationship between two loads that depend
on each other on a control dependency:

    read A
    if (val == 1)
        read B

because the conditional may be predicted, and the "read B" may be
speculatively moved up to before reading the value A.  So we require the
user to insert a smp_rmb() between the two accesses to be correct:

    read A;
    if (A == 1)
        smp_rmb()
        read B

Alpha is further special in that it can break that ordering even if the
*address* of B depends on the read of A, because the cacheline that is
read later may be stale unless you have a memory barrier in between the
pointer read and the read of the value behind a pointer:

    read ptr
    read offset(ptr)

whereas all other weakly ordered architectures guarantee that the data
dependency (as opposed to just a control dependency) will order the two
accesses.  As a result, alpha needs a "smp_read_barrier_depends()" in
between those two reads for them to be ordered.

The coontrol dependency that "READ_ONCE_CTRL()" and "atomic_read_ctrl()"
had was a control dependency to a subsequent *write*, however, and
nobody can finalize such a subsequent write without having actually done
the read.  And were you to write such a value to a "stale" cacheline
(the way the unordered reads came to be), that would seem to lose the
write entirely.

So the things that make alpha able to re-order reads even more
aggressively than other weak architectures do not seem to be relevant
for a subsequent write.  Alpha memory ordering may be strange, but
there's no real indication that it is *that* strange.

Also, the alpha architecture reference manual very explicitly talks
about the definition of "Dependence Constraints" in section 5.6.1.7,
where a preceding read dominates a subsequent write.

Such a dependence constraint admittedly does not impose a BEFORE (alpha
architecture term for globally visible ordering), but it does guarantee
that there can be no "causal loop".  I don't see how you could avoid
such a loop if another cpu could see the stored value and then impact
the value of the first read.  Put another way: the read and the write
could not be seen as being out of order wrt other cpus.

So I do not see how these "x_ctrl()" functions can currently be necessary.

I may have to eat my words at some point, but in the absense of clear
proof that alpha actually needs this, or indeed even an explanation of
how alpha could _possibly_ need it, I do not believe these functions are
called for.

And if it turns out that alpha really _does_ need a barrier for this
case, that barrier still should not be "smp_read_barrier_depends()".
We'd have to make up some new speciality barrier just for alpha, along
with the documentation for why it really is necessary.

Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: Paul E McKenney <paulmck@us.ibm.com>
Cc: Dmitry Vyukov <dvyukov@google.com>
Cc: Will Deacon <will.deacon@arm.com>
Cc: Ingo Molnar <mingo@kernel.org>
Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>

105ff3cb

bpf, verifier: annotate verbose printer with __printf · 1d056d9c

由 Daniel Borkmann 提交于 11月 03, 2015

The verbose() printer dumps the verifier state to user space, so let gcc
take care to check calls to verbose() for (future) errors. make with W=1
correctly suggests: function might be possible candidate for 'gnu_printf'
format attribute [-Wsuggest-attribute=format].
Signed-off-by: NDaniel Borkmann <daniel@iogearbox.net>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

1d056d9c

03 11月, 2015 5 次提交

bpf: add support for persistent maps/progs · b2197755

由 Daniel Borkmann 提交于 10月 29, 2015

This work adds support for "persistent" eBPF maps/programs. The term
"persistent" is to be understood that maps/programs have a facility
that lets them survive process termination. This is desired by various
eBPF subsystem users.

Just to name one example: tc classifier/action. Whenever tc parses
the ELF object, extracts and loads maps/progs into the kernel, these
file descriptors will be out of reach after the tc instance exits.
So a subsequent tc invocation won't be able to access/relocate on this
resource, and therefore maps cannot easily be shared, f.e. between the
ingress and egress networking data path.

The current workaround is that Unix domain sockets (UDS) need to be
instrumented in order to pass the created eBPF map/program file
descriptors to a third party management daemon through UDS' socket
passing facility. This makes it a bit complicated to deploy shared
eBPF maps or programs (programs f.e. for tail calls) among various
processes.

We've been brainstorming on how we could tackle this issue and various
approches have been tried out so far, which can be read up further in
the below reference.

The architecture we eventually ended up with is a minimal file system
that can hold map/prog objects. The file system is a per mount namespace
singleton, and the default mount point is /sys/fs/bpf/. Any subsequent
mounts within a given namespace will point to the same instance. The
file system allows for creating a user-defined directory structure.
The objects for maps/progs are created/fetched through bpf(2) with
two new commands (BPF_OBJ_PIN/BPF_OBJ_GET). I.e. a bpf file descriptor
along with a pathname is being passed to bpf(2) that in turn creates
(we call it eBPF object pinning) the file system nodes. Only the pathname
is being passed to bpf(2) for getting a new BPF file descriptor to an
existing node. The user can use that to access maps and progs later on,
through bpf(2). Removal of file system nodes is being managed through
normal VFS functions such as unlink(2), etc. The file system code is
kept to a very minimum and can be further extended later on.

The next step I'm working on is to add dump eBPF map/prog commands
to bpf(2), so that a specification from a given file descriptor can
be retrieved. This can be used by things like CRIU but also applications
can inspect the meta data after calling BPF_OBJ_GET.

Big thanks also to Alexei and Hannes who significantly contributed
in the design discussion that eventually let us end up with this
architecture here.

Reference: https://lkml.org/lkml/2015/10/15/925Signed-off-by: NDaniel Borkmann <daniel@iogearbox.net>
Signed-off-by: NAlexei Starovoitov <ast@kernel.org>
Signed-off-by: NHannes Frederic Sowa <hannes@stressinduktion.org>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

b2197755

bpf: consolidate bpf_prog_put{, _rcu} dismantle paths · e9d8afa9

由 Daniel Borkmann 提交于 10月 29, 2015

We currently have duplicated cleanup code in bpf_prog_put() and
bpf_prog_put_rcu() cleanup paths. Back then we decided that it was
not worth it to make it a common helper called by both, but with
the recent addition of resource charging, we could have avoided
the fix in commit ac00737f ("bpf: Need to call bpf_prog_uncharge_memlock
from bpf_prog_put") if we would have had only a single, common path.
We can simplify it further by assigning aux->prog only once during
allocation time.
Signed-off-by: NDaniel Borkmann <daniel@iogearbox.net>
Acked-by: NAlexei Starovoitov <ast@kernel.org>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

e9d8afa9

bpf: align and clean bpf_{map,prog}_get helpers · c2101297

由 Daniel Borkmann 提交于 10月 29, 2015

Add a bpf_map_get() function that we're going to use later on and
align/clean the remaining helpers a bit so that we have them a bit
more consistent:

  - __bpf_map_get() and __bpf_prog_get() that both work on the fd
    struct, check whether the descriptor is eBPF and return the
    pointer to the map/prog stored in the private data.

    Also, we can return f.file->private_data directly, the function
    signature is enough of a documentation already.

  - bpf_map_get() and bpf_prog_get() that both work on u32 user fd,
    call their respective __bpf_map_get()/__bpf_prog_get() variants,
    and take a reference.
Signed-off-by: NDaniel Borkmann <daniel@iogearbox.net>
Acked-by: NAlexei Starovoitov <ast@kernel.org>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

c2101297

bpf: abstract anon_inode_getfd invocations · aa79781b

由 Daniel Borkmann 提交于 10月 29, 2015

Since we're going to use anon_inode_getfd() invocations in more than just
the current places, make a helper function for both, so that we only need
to pass a map/prog pointer to the helper itself in order to get a fd. The
new helpers are called bpf_map_new_fd() and bpf_prog_new_fd().
Signed-off-by: NDaniel Borkmann <daniel@iogearbox.net>
Acked-by: NAlexei Starovoitov <ast@kernel.org>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

aa79781b

bpf: convert hashtab lock to raw lock · ac00881f

由 Yang Shi 提交于 10月 30, 2015

When running bpf samples on rt kernel, it reports the below warning:

BUG: sleeping function called from invalid context at kernel/locking/rtmutex.c:917
in_atomic(): 1, irqs_disabled(): 128, pid: 477, name: ping
Preemption disabled at:[<ffff80000017db58>] kprobe_perf_func+0x30/0x228

CPU: 3 PID: 477 Comm: ping Not tainted 4.1.10-rt8 #4
Hardware name: Freescale Layerscape 2085a RDB Board (DT)
Call trace:
[<ffff80000008a5b0>] dump_backtrace+0x0/0x128
[<ffff80000008a6f8>] show_stack+0x20/0x30
[<ffff8000007da90c>] dump_stack+0x7c/0xa0
[<ffff8000000e4830>] ___might_sleep+0x188/0x1a0
[<ffff8000007e2200>] rt_spin_lock+0x28/0x40
[<ffff80000018bf9c>] htab_map_update_elem+0x124/0x320
[<ffff80000018c718>] bpf_map_update_elem+0x40/0x58
[<ffff800000187658>] __bpf_prog_run+0xd48/0x1640
[<ffff80000017ca6c>] trace_call_bpf+0x8c/0x100
[<ffff80000017db58>] kprobe_perf_func+0x30/0x228
[<ffff80000017dd84>] kprobe_dispatcher+0x34/0x58
[<ffff8000007e399c>] kprobe_handler+0x114/0x250
[<ffff8000007e3bf4>] kprobe_breakpoint_handler+0x1c/0x30
[<ffff800000085b80>] brk_handler+0x88/0x98
[<ffff8000000822f0>] do_debug_exception+0x50/0xb8
Exception stack(0xffff808349687460 to 0xffff808349687580)
7460: 4ca2b600 ffff8083 4a3a7000 ffff8083 49687620 ffff8083 0069c5f8 ffff8000
7480: 00000001 00000000 007e0628 ffff8000 496874b0 ffff8083 007e1de8 ffff8000
74a0: 496874d0 ffff8083 0008e04c ffff8000 00000001 00000000 4ca2b600 ffff8083
74c0: 00ba2e80 ffff8000 49687528 ffff8083 49687510 ffff8083 000e5c70 ffff8000
74e0: 00c22348 ffff8000 00000000 ffff8083 49687510 ffff8083 000e5c74 ffff8000
7500: 4ca2b600 ffff8083 49401800 ffff8083 00000001 00000000 00000000 00000000
7520: 496874d0 ffff8083 00000000 00000000 00000000 00000000 00000000 00000000
7540: 2f2e2d2c 33323130 00000000 00000000 4c944500 ffff8083 00000000 00000000
7560: 00000000 00000000 008751e0 ffff8000 00000001 00000000 124e2d1d 00107b77

Convert hashtab lock to raw lock to avoid such warning.
Signed-off-by: NYang Shi <yang.shi@linaro.org>
Acked-by: NDaniel Borkmann <daniel@iogearbox.net>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

ac00881f

30 10月, 2015 1 次提交

blktrace: re-write setting q->blk_trace · cdea01b2

由 Davidlohr Bueso 提交于 10月 30, 2015

This is really about simplifying the double xchg patterns into
a single cmpxchg, with the same logic. Other than the immediate
cleanup, there are some subtleties this change deals with:

(i) While the load of the old bt is fully ordered wrt everything,
ie:

        old_bt = xchg(&q->blk_trace, bt);             [barrier]
        if (old_bt)
	     (void) xchg(&q->blk_trace, old_bt);    [barrier]

blk_trace could still be changed between the xchg and the old_bt
load. Note that this description is merely theoretical and afaict
very small, but doing everything in a single context with cmpxchg
closes this potential race.

(ii) Ordering guarantees are obviously kept with cmpxchg.

(iii) Gets rid of the hacky-by-nature (void)xchg pattern.
Signed-off-by: NDavidlohr Bueso <dbueso@suse.de>
eviewed-by: NJeff Moyer <jmoyer@redhat.com>
Signed-off-by: NJens Axboe <axboe@fb.com>

cdea01b2

29 10月, 2015 1 次提交

cgroup: fix race condition around termination check in css_task_iter_next() · d5745675

由 Tejun Heo 提交于 10月 29, 2015

css_task_iter_next() checked @it->cur_task before grabbing
css_set_lock and assumed that the result won't change afterwards;
however, tasks could leave the cgroup being iterated terminating the
iterator before css_task_lock is acquired.  If this happens,
css_task_iter_next() tries to calculate the current task from NULL
cg_list pointer leading to the following oops.

 BUG: unable to handle kernel paging request at fffffffffffff7d0
 IP: [<ffffffff810d5f22>] css_task_iter_next+0x42/0x80
 ...
 CPU: 4 PID: 6391 Comm: JobQDisp2 Not tainted 4.0.9-22_fbk4_rc3_81616_ge8d9cb6 #1
 Hardware name: Quanta Freedom/Winterfell, BIOS F03_3B08 03/04/2014
 task: ffff880868e46400 ti: ffff88083404c000 task.ti: ffff88083404c000
 RIP: 0010:[<ffffffff810d5f22>]  [<ffffffff810d5f22>] css_task_iter_next+0x42/0x80
 RSP: 0018:ffff88083404fd28  EFLAGS: 00010246
 RAX: 0000000000000000 RBX: ffff88083404fd68 RCX: ffff8804697fb8b0
 RDX: fffffffffffff7c0 RSI: ffff8803b7dff800 RDI: ffffffff822c0278
 RBP: ffff88083404fd38 R08: 0000000000017160 R09: ffff88046f4070c0
 R10: ffffffff810d61f7 R11: 0000000000000293 R12: ffff880863bf8400
 R13: ffff88046b87fd80 R14: 0000000000000000 R15: ffff88083404fe58
 FS:  00007fa0567e2700(0000) GS:ffff88046f900000(0000) knlGS:0000000000000000
 CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
 CR2: fffffffffffff7d0 CR3: 0000000469568000 CR4: 00000000001406e0
 Stack:
  0000000000000246 0000000000000000 ffff88083404fde8 ffffffff810d6248
  ffff88083404fd68 0000000000000000 ffff8803b7dff800 000001ef000001ee
  0000000000000000 0000000000000000 ffff880863bf8568 0000000000000000
 Call Trace:
  [<ffffffff810d6248>] cgroup_pidlist_start+0x258/0x550
  [<ffffffff810cf66d>] cgroup_seqfile_start+0x1d/0x20
  [<ffffffff8121f8ef>] kernfs_seq_start+0x5f/0xa0
  [<ffffffff811cab76>] seq_read+0x166/0x380
  [<ffffffff812200fd>] kernfs_fop_read+0x11d/0x180
  [<ffffffff811a7398>] __vfs_read+0x18/0x50
  [<ffffffff811a745d>] vfs_read+0x8d/0x150
  [<ffffffff811a756f>] SyS_read+0x4f/0xb0
  [<ffffffff818d4772>] system_call_fastpath+0x12/0x17

Fix it by moving the termination condition check inside css_set_lock.
@it->cur_task is now cleared after being put and @it->task_pos is
tested for termination instead of @it->cset_pos as they indicate the
same condition and @it->task_pos is what's being dereferenced.
Signed-off-by: NTejun Heo <tj@kernel.org>
Reported-by: NCalvin Owens <calvinowens@fb.com>
Fixes: ed27b9f7 ("cgroup: don't hold css_set_rwsem across css task iteration")
Acked-by: NZefan Li <lizefan@huawei.com>

d5745675

28 10月, 2015 1 次提交

seccomp, ptrace: add support for dumping seccomp filters · f8e529ed

由 Tycho Andersen 提交于 10月 27, 2015

This patch adds support for dumping a process' (classic BPF) seccomp
filters via ptrace.

PTRACE_SECCOMP_GET_FILTER allows the tracer to dump the user's classic BPF
seccomp filters. addr should be an integer which represents the ith seccomp
filter (0 is the most recently installed filter). data should be a struct
sock_filter * with enough room for the ith filter, or NULL, in which case
the filter is not saved. The return value for this command is the number of
BPF instructions the program represents, or negative in the case of errors.
Command specific errors are ENOENT: which indicates that there is no ith
filter in this seccomp tree, and EMEDIUMTYPE, which indicates that the ith
filter was not installed as a classic BPF filter.

A caveat with this approach is that there is no way to get explicitly at
the heirarchy of seccomp filters, and users need to memcmp() filters to
decide which are inherited. This means that a task which installs two of
the same filter can potentially confuse users of this interface.

v2: * make save_orig const
    * check that the orig_prog exists (not necessary right now, but when
       grows eBPF support it will be)
    * s/n/filter_off and make it an unsigned long to match ptrace
    * count "down" the tree instead of "up" when passing a filter offset

v3: * don't take the current task's lock for inspecting its seccomp mode
    * use a 0x42** constant for the ptrace command value

v4: * don't copy to userspace while holding spinlocks

v5: * add another condition to WARN_ON

v6: * rebase on net-next
Signed-off-by: NTycho Andersen <tycho.andersen@canonical.com>
Acked-by: NKees Cook <keescook@chromium.org>
CC: Will Drewry <wad@chromium.org>
Reviewed-by: NOleg Nesterov <oleg@redhat.com>
CC: Andy Lutomirski <luto@amacapital.net>
CC: Pavel Emelyanov <xemul@parallels.com>
CC: Serge E. Hallyn <serge.hallyn@ubuntu.com>
CC: Alexei Starovoitov <ast@kernel.org>
CC: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: NAlexei Starovoitov <ast@kernel.org>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

f8e529ed

27 10月, 2015 3 次提交

bpf: make tracing helpers gpl only · 1075ef59

由 Alexei Starovoitov 提交于 10月 23, 2015

exported perf symbols are GPL only, mark eBPF helper functions
used in tracing as GPL only as well.
Suggested-by: NPeter Zijlstra <a.p.zijlstra@chello.nl>
Signed-off-by: NAlexei Starovoitov <ast@kernel.org>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

1075ef59

bpf: fix bpf_perf_event_read() helper · 62544ce8

由 Alexei Starovoitov 提交于 10月 22, 2015

Fix safety checks for bpf_perf_event_read():
- only non-inherited events can be added to perf_event_array map
  (do this check statically at map insertion time)
- dynamically check that event is local and !pmu->count
Otherwise buggy bpf program can cause kernel splat.

Also fix error path after perf_event_attrs()
and remove redundant 'extern'.

Fixes: 35578d79 ("bpf: Implement function bpf_perf_event_read() that get the selected hardware PMU conuter")
Signed-off-by: NAlexei Starovoitov <ast@kernel.org>
Tested-by: NWang Nan <wangnan0@huawei.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

62544ce8

memremap: fix highmem support · 182475b7

由 Dan Williams 提交于 10月 26, 2015

Currently memremap checks if the range is "System RAM" and returns the
kernel linear address.  This is broken for highmem platforms where a
range may be "System RAM", but is not part of the kernel linear mapping.
Fallback to ioremap_cache() in these cases, to let the arch code attempt
to handle it.

Note that ARM ioremap will WARN when attempting to remap ram, and in
that case the caller needs to be fixed.  For this reason, existing
ioremap_cache() usages for ARM are already trained to avoid attempts to
remap ram.

The impact of this bug is low for now since the pmem driver is the only
user of memremap(), but this is important to fix before more conversions
to memremap arrive in 4.4.

Cc: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
Reported-by: NArd Biesheuvel <ard.biesheuvel@linaro.org>
Acked-by: NArd Biesheuvel <ard.biesheuvel@linaro.org>
Signed-off-by: NDan Williams <dan.j.williams@intel.com>

182475b7

26 10月, 2015 1 次提交

timeconst: Update path in comment · 03f136a2

由 Jason A. Donenfeld 提交于 7月 14, 2015

Signed-off-by: NJason A. Donenfeld <Jason@zx2c4.com>
Cc: hofrat@osadl.org
Link: http://lkml.kernel.org/r/1436894685-5868-1-git-send-email-Jason@zx2c4.comSigned-off-by: NThomas Gleixner <tglx@linutronix.de>

03f136a2

23 10月, 2015 2 次提交

sched/core: Add missing lockdep_unpin() annotations · 0aaafaab

由 Peter Zijlstra 提交于 10月 23, 2015

Luca and Wanpeng reported two missing annotations that led to
false lockdep complaints. Add the missing annotations.
Reported-by: NLuca Abeni <luca.abeni@unitn.it>
Reported-by: NWanpeng Li <wanpeng.li@hotmail.com>
Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
Cc: Juri Lelli <juri.lelli@arm.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Fixes: cbce1a68 ("sched,lockdep: Employ lock pinning")
Link: http://lkml.kernel.org/r/20151023095008.GY17308@twins.programming.kicks-ass.netSigned-off-by: NIngo Molnar <mingo@kernel.org>

0aaafaab

kmod: don't run async usermode helper as a child of kworker thread · 52116139

由 Oleg Nesterov 提交于 10月 22, 2015

call_usermodehelper_exec_sync() does fork() + wait() with "unignored"
SIGCHLD.  What we have missed is that this worker thread can have other
children previously forked by call_usermodehelper_exec_work() without
UMH_WAIT_PROC.  If such a child exits in between it becomes a zombie
because auto-reaping only works if SIGCHLD is ignored, and nobody can
reap it (unless/until this worker thread exits too).

Change the !UMH_WAIT_PROC case to use CLONE_PARENT.

Note: this is only first step.  All PF_KTHREAD tasks, even created by
kernel_thread() should have ->parent == kthreadd by default.

Fixes: bb304a5c ("kmod: handle UMH_WAIT_PROC from system unbound workqueue")
Signed-off-by: NOleg Nesterov <oleg@redhat.com>
Acked-by: NFrederic Weisbecker <fweisbec@gmail.com>
Cc: Rik van Riel <riel@redhat.com>
Cc: Christoph Lameter <cl@linux.com>
Cc: Tejun Heo <tj@kernel.org>
Cc: Rusty Russell <rusty@rustcorp.com.au>
Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>

52116139

22 10月, 2015 3 次提交

pstore: add pstore unregister · ee1d2674

由 Geliang Tang 提交于 10月 20, 2015

pstore doesn't support unregistering yet. It was marked as TODO.
This patch adds some code to fix it:
 1) Add functions to unregister kmsg/console/ftrace/pmsg.
 2) Add a function to free compression buffer.
 3) Unmap the memory and free it.
 4) Add a function to unregister pstore filesystem.
Signed-off-by: NGeliang Tang <geliangtang@163.com>
Acked-by: NKees Cook <keescook@chromium.org>
[Removed __exit annotation from ramoops_remove(). Reported by Arnd Bergmann]
Signed-off-by: NTony Luck <tony.luck@intel.com>

ee1d2674

bpf: introduce bpf_perf_event_output() helper · a43eec30

由 Alexei Starovoitov 提交于 10月 20, 2015

This helper is used to send raw data from eBPF program into
special PERF_TYPE_SOFTWARE/PERF_COUNT_SW_BPF_OUTPUT perf_event.
User space needs to perf_event_open() it (either for one or all cpus) and
store FD into perf_event_array (similar to bpf_perf_event_read() helper)
before eBPF program can send data into it.

Today the programs triggered by kprobe collect the data and either store
it into the maps or print it via bpf_trace_printk() where latter is the debug
facility and not suitable to stream the data. This new helper replaces
such bpf_trace_printk() usage and allows programs to have dedicated
channel into user space for post-processing of the raw data collected.
Signed-off-by: NAlexei Starovoitov <ast@kernel.org>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

a43eec30

perf: pad raw data samples automatically · fa128e6a

由 Alexei Starovoitov 提交于 10月 20, 2015

Instead of WARN_ON in perf_event_output() on unpaded raw samples,
pad them automatically.
Signed-off-by: NAlexei Starovoitov <ast@kernel.org>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

fa128e6a