1. 10 7月, 2019 1 次提交
    • J
      cpuset: restore sanity to cpuset_cpus_allowed_fallback() · b7747ecb
      Joel Savitz 提交于
      [ Upstream commit d477f8c202d1f0d4791ab1263ca7657bbe5cf79e ]
      
      In the case that a process is constrained by taskset(1) (i.e.
      sched_setaffinity(2)) to a subset of available cpus, and all of those are
      subsequently offlined, the scheduler will set tsk->cpus_allowed to
      the current value of task_cs(tsk)->effective_cpus.
      
      This is done via a call to do_set_cpus_allowed() in the context of
      cpuset_cpus_allowed_fallback() made by the scheduler when this case is
      detected. This is the only call made to cpuset_cpus_allowed_fallback()
      in the latest mainline kernel.
      
      However, this is not sane behavior.
      
      I will demonstrate this on a system running the latest upstream kernel
      with the following initial configuration:
      
      	# grep -i cpu /proc/$$/status
      	Cpus_allowed:	ffffffff,fffffff
      	Cpus_allowed_list:	0-63
      
      (Where cpus 32-63 are provided via smt.)
      
      If we limit our current shell process to cpu2 only and then offline it
      and reonline it:
      
      	# taskset -p 4 $$
      	pid 2272's current affinity mask: ffffffffffffffff
      	pid 2272's new affinity mask: 4
      
      	# echo off > /sys/devices/system/cpu/cpu2/online
      	# dmesg | tail -3
      	[ 2195.866089] process 2272 (bash) no longer affine to cpu2
      	[ 2195.872700] IRQ 114: no longer affine to CPU2
      	[ 2195.879128] smpboot: CPU 2 is now offline
      
      	# echo on > /sys/devices/system/cpu/cpu2/online
      	# dmesg | tail -1
      	[ 2617.043572] smpboot: Booting Node 0 Processor 2 APIC 0x4
      
      We see that our current process now has an affinity mask containing
      every cpu available on the system _except_ the one we originally
      constrained it to:
      
      	# grep -i cpu /proc/$$/status
      	Cpus_allowed:   ffffffff,fffffffb
      	Cpus_allowed_list:      0-1,3-63
      
      This is not sane behavior, as the scheduler can now not only place the
      process on previously forbidden cpus, it can't even schedule it on
      the cpu it was originally constrained to!
      
      Other cases result in even more exotic affinity masks. Take for instance
      a process with an affinity mask containing only cpus provided by smt at
      the moment that smt is toggled, in a configuration such as the following:
      
      	# taskset -p f000000000 $$
      	# grep -i cpu /proc/$$/status
      	Cpus_allowed:	000000f0,00000000
      	Cpus_allowed_list:	36-39
      
      A double toggle of smt results in the following behavior:
      
      	# echo off > /sys/devices/system/cpu/smt/control
      	# echo on > /sys/devices/system/cpu/smt/control
      	# grep -i cpus /proc/$$/status
      	Cpus_allowed:	ffffff00,ffffffff
      	Cpus_allowed_list:	0-31,40-63
      
      This is even less sane than the previous case, as the new affinity mask
      excludes all smt-provided cpus with ids less than those that were
      previously in the affinity mask, as well as those that were actually in
      the mask.
      
      With this patch applied, both of these cases end in the following state:
      
      	# grep -i cpu /proc/$$/status
      	Cpus_allowed:	ffffffff,ffffffff
      	Cpus_allowed_list:	0-63
      
      The original policy is discarded. Though not ideal, it is the simplest way
      to restore sanity to this fallback case without reinventing the cpuset
      wheel that rolls down the kernel just fine in cgroup v2. A user who wishes
      for the previous affinity mask to be restored in this fallback case can use
      that mechanism instead.
      
      This patch modifies scheduler behavior by instead resetting the mask to
      task_cs(tsk)->cpus_allowed by default, and cpu_possible mask in legacy
      mode. I tested the cases above on both modes.
      
      Note that the scheduler uses this fallback mechanism if and only if
      _every_ other valid avenue has been traveled, and it is the last resort
      before calling BUG().
      Suggested-by: NWaiman Long <longman@redhat.com>
      Suggested-by: NPhil Auld <pauld@redhat.com>
      Signed-off-by: NJoel Savitz <jsavitz@redhat.com>
      Acked-by: NPhil Auld <pauld@redhat.com>
      Acked-by: NWaiman Long <longman@redhat.com>
      Acked-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Signed-off-by: NSasha Levin <sashal@kernel.org>
      b7747ecb
  2. 03 7月, 2019 5 次提交
    • D
      bpf: fix unconnected udp hooks · 613bc37f
      Daniel Borkmann 提交于
      commit 983695fa676568fc0fe5ddd995c7267aabc24632 upstream.
      
      Intention of cgroup bind/connect/sendmsg BPF hooks is to act transparently
      to applications as also stated in original motivation in 7828f20e ("Merge
      branch 'bpf-cgroup-bind-connect'"). When recently integrating the latter
      two hooks into Cilium to enable host based load-balancing with Kubernetes,
      I ran into the issue that pods couldn't start up as DNS got broken. Kubernetes
      typically sets up DNS as a service and is thus subject to load-balancing.
      
      Upon further debugging, it turns out that the cgroupv2 sendmsg BPF hooks API
      is currently insufficient and thus not usable as-is for standard applications
      shipped with most distros. To break down the issue we ran into with a simple
      example:
      
        # cat /etc/resolv.conf
        nameserver 147.75.207.207
        nameserver 147.75.207.208
      
      For the purpose of a simple test, we set up above IPs as service IPs and
      transparently redirect traffic to a different DNS backend server for that
      node:
      
        # cilium service list
        ID   Frontend            Backend
        1    147.75.207.207:53   1 => 8.8.8.8:53
        2    147.75.207.208:53   1 => 8.8.8.8:53
      
      The attached BPF program is basically selecting one of the backends if the
      service IP/port matches on the cgroup hook. DNS breaks here, because the
      hooks are not transparent enough to applications which have built-in msg_name
      address checks:
      
        # nslookup 1.1.1.1
        ;; reply from unexpected source: 8.8.8.8#53, expected 147.75.207.207#53
        ;; reply from unexpected source: 8.8.8.8#53, expected 147.75.207.208#53
        ;; reply from unexpected source: 8.8.8.8#53, expected 147.75.207.207#53
        [...]
        ;; connection timed out; no servers could be reached
      
        # dig 1.1.1.1
        ;; reply from unexpected source: 8.8.8.8#53, expected 147.75.207.207#53
        ;; reply from unexpected source: 8.8.8.8#53, expected 147.75.207.208#53
        ;; reply from unexpected source: 8.8.8.8#53, expected 147.75.207.207#53
        [...]
      
        ; <<>> DiG 9.11.3-1ubuntu1.7-Ubuntu <<>> 1.1.1.1
        ;; global options: +cmd
        ;; connection timed out; no servers could be reached
      
      For comparison, if none of the service IPs is used, and we tell nslookup
      to use 8.8.8.8 directly it works just fine, of course:
      
        # nslookup 1.1.1.1 8.8.8.8
        1.1.1.1.in-addr.arpa	name = one.one.one.one.
      
      In order to fix this and thus act more transparent to the application,
      this needs reverse translation on recvmsg() side. A minimal fix for this
      API is to add similar recvmsg() hooks behind the BPF cgroups static key
      such that the program can track state and replace the current sockaddr_in{,6}
      with the original service IP. From BPF side, this basically tracks the
      service tuple plus socket cookie in an LRU map where the reverse NAT can
      then be retrieved via map value as one example. Side-note: the BPF cgroups
      static key should be converted to a per-hook static key in future.
      
      Same example after this fix:
      
        # cilium service list
        ID   Frontend            Backend
        1    147.75.207.207:53   1 => 8.8.8.8:53
        2    147.75.207.208:53   1 => 8.8.8.8:53
      
      Lookups work fine now:
      
        # nslookup 1.1.1.1
        1.1.1.1.in-addr.arpa    name = one.one.one.one.
      
        Authoritative answers can be found from:
      
        # dig 1.1.1.1
      
        ; <<>> DiG 9.11.3-1ubuntu1.7-Ubuntu <<>> 1.1.1.1
        ;; global options: +cmd
        ;; Got answer:
        ;; ->>HEADER<<- opcode: QUERY, status: NXDOMAIN, id: 51550
        ;; flags: qr rd ra ad; QUERY: 1, ANSWER: 0, AUTHORITY: 1, ADDITIONAL: 1
      
        ;; OPT PSEUDOSECTION:
        ; EDNS: version: 0, flags:; udp: 512
        ;; QUESTION SECTION:
        ;1.1.1.1.                       IN      A
      
        ;; AUTHORITY SECTION:
        .                       23426   IN      SOA     a.root-servers.net. nstld.verisign-grs.com. 2019052001 1800 900 604800 86400
      
        ;; Query time: 17 msec
        ;; SERVER: 147.75.207.207#53(147.75.207.207)
        ;; WHEN: Tue May 21 12:59:38 UTC 2019
        ;; MSG SIZE  rcvd: 111
      
      And from an actual packet level it shows that we're using the back end
      server when talking via 147.75.207.20{7,8} front end:
      
        # tcpdump -i any udp
        [...]
        12:59:52.698732 IP foo.42011 > google-public-dns-a.google.com.domain: 18803+ PTR? 1.1.1.1.in-addr.arpa. (38)
        12:59:52.698735 IP foo.42011 > google-public-dns-a.google.com.domain: 18803+ PTR? 1.1.1.1.in-addr.arpa. (38)
        12:59:52.701208 IP google-public-dns-a.google.com.domain > foo.42011: 18803 1/0/0 PTR one.one.one.one. (67)
        12:59:52.701208 IP google-public-dns-a.google.com.domain > foo.42011: 18803 1/0/0 PTR one.one.one.one. (67)
        [...]
      
      In order to be flexible and to have same semantics as in sendmsg BPF
      programs, we only allow return codes in [1,1] range. In the sendmsg case
      the program is called if msg->msg_name is present which can be the case
      in both, connected and unconnected UDP.
      
      The former only relies on the sockaddr_in{,6} passed via connect(2) if
      passed msg->msg_name was NULL. Therefore, on recvmsg side, we act in similar
      way to call into the BPF program whenever a non-NULL msg->msg_name was
      passed independent of sk->sk_state being TCP_ESTABLISHED or not. Note
      that for TCP case, the msg->msg_name is ignored in the regular recvmsg
      path and therefore not relevant.
      
      For the case of ip{,v6}_recv_error() paths, picked up via MSG_ERRQUEUE,
      the hook is not called. This is intentional as it aligns with the same
      semantics as in case of TCP cgroup BPF hooks right now. This might be
      better addressed in future through a different bpf_attach_type such
      that this case can be distinguished from the regular recvmsg paths,
      for example.
      
      Fixes: 1cedee13 ("bpf: Hooks for sys_sendmsg")
      Signed-off-by: NDaniel Borkmann <daniel@iogearbox.net>
      Acked-by: NAndrey Ignatov <rdna@fb.com>
      Acked-by: NMartin KaFai Lau <kafai@fb.com>
      Acked-by: NMartynas Pumputis <m@lambda.lt>
      Signed-off-by: NAlexei Starovoitov <ast@kernel.org>
      Signed-off-by: NDaniel Borkmann <daniel@iogearbox.net>
      Signed-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      613bc37f
    • M
      bpf: fix nested bpf tracepoints with per-cpu data · a7177b94
      Matt Mullins 提交于
      commit 9594dc3c7e71b9f52bee1d7852eb3d4e3aea9e99 upstream.
      
      BPF_PROG_TYPE_RAW_TRACEPOINTs can be executed nested on the same CPU, as
      they do not increment bpf_prog_active while executing.
      
      This enables three levels of nesting, to support
        - a kprobe or raw tp or perf event,
        - another one of the above that irq context happens to call, and
        - another one in nmi context
      (at most one of which may be a kprobe or perf event).
      
      Fixes: 20b9d7ac ("bpf: avoid excessive stack usage for perf_sample_data")
      Signed-off-by: NMatt Mullins <mmullins@fb.com>
      Acked-by: NAndrii Nakryiko <andriin@fb.com>
      Acked-by: NDaniel Borkmann <daniel@iogearbox.net>
      Signed-off-by: NAlexei Starovoitov <ast@kernel.org>
      Signed-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      a7177b94
    • J
      bpf: lpm_trie: check left child of last leftmost node for NULL · 4992d4af
      Jonathan Lemon 提交于
      commit da2577fdd0932ea4eefe73903f1130ee366767d2 upstream.
      
      If the leftmost parent node of the tree has does not have a child
      on the left side, then trie_get_next_key (and bpftool map dump) will
      not look at the child on the right.  This leads to the traversal
      missing elements.
      
      Lookup is not affected.
      
      Update selftest to handle this case.
      
      Reproducer:
      
       bpftool map create /sys/fs/bpf/lpm type lpm_trie key 6 \
           value 1 entries 256 name test_lpm flags 1
       bpftool map update pinned /sys/fs/bpf/lpm key  8 0 0 0  0   0 value 1
       bpftool map update pinned /sys/fs/bpf/lpm key 16 0 0 0  0 128 value 2
       bpftool map dump   pinned /sys/fs/bpf/lpm
      
      Returns only 1 element. (2 expected)
      
      Fixes: b471f2f1 ("bpf: implement MAP_GET_NEXT_KEY command for LPM_TRIE")
      Signed-off-by: NJonathan Lemon <jonathan.lemon@gmail.com>
      Acked-by: NMartin KaFai Lau <kafai@fb.com>
      Signed-off-by: NDaniel Borkmann <daniel@iogearbox.net>
      Signed-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      4992d4af
    • G
      cpu/speculation: Warn on unsupported mitigations= parameter · b78ad216
      Geert Uytterhoeven 提交于
      commit 1bf72720281770162c87990697eae1ba2f1d917a upstream.
      
      Currently, if the user specifies an unsupported mitigation strategy on the
      kernel command line, it will be ignored silently.  The code will fall back
      to the default strategy, possibly leaving the system more vulnerable than
      expected.
      
      This may happen due to e.g. a simple typo, or, for a stable kernel release,
      because not all mitigation strategies have been backported.
      
      Inform the user by printing a message.
      
      Fixes: 98af8452945c5565 ("cpu/speculation: Add 'mitigations=' cmdline option")
      Signed-off-by: NGeert Uytterhoeven <geert@linux-m68k.org>
      Signed-off-by: NThomas Gleixner <tglx@linutronix.de>
      Acked-by: NJosh Poimboeuf <jpoimboe@redhat.com>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Jiri Kosina <jkosina@suse.cz>
      Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
      Cc: Ben Hutchings <ben@decadent.org.uk>
      Cc: stable@vger.kernel.org
      Link: https://lkml.kernel.org/r/20190516070935.22546-1-geert@linux-m68k.orgSigned-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      b78ad216
    • S
      Revert "x86/uaccess, ftrace: Fix ftrace_likely_update() vs. SMAP" · fec1a13b
      Sasha Levin 提交于
      This reverts commit 1a3188d7, which was
      upstream commit 4a6c91fbdef846ec7250b82f2eeeb87ac5f18cf9.
      
      On Tue, Jun 25, 2019 at 09:39:45AM +0200, Sebastian Andrzej Siewior wrote:
      >Please backport commit e74deb11931ff682b59d5b9d387f7115f689698e to
      >stable _or_ revert the backport of commit 4a6c91fbdef84 ("x86/uaccess,
      >ftrace: Fix ftrace_likely_update() vs. SMAP"). It uses
      >user_access_{save|restore}() which has been introduced in the following
      >commit.
      Signed-off-by: NSasha Levin <sashal@kernel.org>
      fec1a13b
  3. 25 6月, 2019 1 次提交
    • M
      tracing: Silence GCC 9 array bounds warning · c493ead3
      Miguel Ojeda 提交于
      commit 0c97bf863efce63d6ab7971dad811601e6171d2f upstream.
      
      Starting with GCC 9, -Warray-bounds detects cases when memset is called
      starting on a member of a struct but the size to be cleared ends up
      writing over further members.
      
      Such a call happens in the trace code to clear, at once, all members
      after and including `seq` on struct trace_iterator:
      
          In function 'memset',
              inlined from 'ftrace_dump' at kernel/trace/trace.c:8914:3:
          ./include/linux/string.h:344:9: warning: '__builtin_memset' offset
          [8505, 8560] from the object at 'iter' is out of the bounds of
          referenced subobject 'seq' with type 'struct trace_seq' at offset
          4368 [-Warray-bounds]
            344 |  return __builtin_memset(p, c, size);
                |         ^~~~~~~~~~~~~~~~~~~~~~~~~~~~
      
      In order to avoid GCC complaining about it, we compute the address
      ourselves by adding the offsetof distance instead of referring
      directly to the member.
      
      Since there are two places doing this clear (trace.c and trace_kdb.c),
      take the chance to move the workaround into a single place in
      the internal header.
      
      Link: http://lkml.kernel.org/r/20190523124535.GA12931@gmail.comSigned-off-by: NMiguel Ojeda <miguel.ojeda.sandonis@gmail.com>
      [ Removed unnecessary parenthesis around "iter" ]
      Signed-off-by: NSteven Rostedt (VMware) <rostedt@goodmis.org>
      Signed-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      c493ead3
  4. 22 6月, 2019 3 次提交
    • P
      perf/ring-buffer: Always use {READ,WRITE}_ONCE() for rb->user_page data · 991ea848
      Peter Zijlstra 提交于
      [ Upstream commit 4d839dd9e4356bbacf3eb0ab13a549b83b008c21 ]
      
      We must use {READ,WRITE}_ONCE() on rb->user_page data such that
      concurrent usage will see whole values. A few key sites were missing
      this.
      Suggested-by: NYabin Cui <yabinc@google.com>
      Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Cc: Alexander Shishkin <alexander.shishkin@linux.intel.com>
      Cc: Arnaldo Carvalho de Melo <acme@redhat.com>
      Cc: Jiri Olsa <jolsa@redhat.com>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Stephane Eranian <eranian@google.com>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Vince Weaver <vincent.weaver@maine.edu>
      Cc: acme@kernel.org
      Cc: mark.rutland@arm.com
      Cc: namhyung@kernel.org
      Fixes: 7b732a75 ("perf_counter: new output ABI - part 1")
      Link: http://lkml.kernel.org/r/20190517115418.394192145@infradead.orgSigned-off-by: NIngo Molnar <mingo@kernel.org>
      Signed-off-by: NSasha Levin <sashal@kernel.org>
      991ea848
    • P
      perf/ring_buffer: Add ordering to rb->nest increment · c133c9db
      Peter Zijlstra 提交于
      [ Upstream commit 3f9fbe9bd86c534eba2faf5d840fd44c6049f50e ]
      
      Similar to how decrementing rb->next too early can cause data_head to
      (temporarily) be observed to go backward, so too can this happen when
      we increment too late.
      
      This barrier() ensures the rb->head load happens after the increment,
      both the one in the 'goto again' path, as the one from
      perf_output_get_handle() -- albeit very unlikely to matter for the
      latter.
      Suggested-by: NYabin Cui <yabinc@google.com>
      Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Cc: Alexander Shishkin <alexander.shishkin@linux.intel.com>
      Cc: Arnaldo Carvalho de Melo <acme@redhat.com>
      Cc: Jiri Olsa <jolsa@redhat.com>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Stephane Eranian <eranian@google.com>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Vince Weaver <vincent.weaver@maine.edu>
      Cc: acme@kernel.org
      Cc: mark.rutland@arm.com
      Cc: namhyung@kernel.org
      Fixes: ef60777c ("perf: Optimize the perf_output() path by removing IRQ-disables")
      Link: http://lkml.kernel.org/r/20190517115418.309516009@infradead.orgSigned-off-by: NIngo Molnar <mingo@kernel.org>
      Signed-off-by: NSasha Levin <sashal@kernel.org>
      c133c9db
    • Y
      perf/ring_buffer: Fix exposing a temporarily decreased data_head · cca19ab2
      Yabin Cui 提交于
      [ Upstream commit 1b038c6e05ff70a1e66e3e571c2e6106bdb75f53 ]
      
      In perf_output_put_handle(), an IRQ/NMI can happen in below location and
      write records to the same ring buffer:
      
      	...
      	local_dec_and_test(&rb->nest)
      	...                          <-- an IRQ/NMI can happen here
      	rb->user_page->data_head = head;
      	...
      
      In this case, a value A is written to data_head in the IRQ, then a value
      B is written to data_head after the IRQ. And A > B. As a result,
      data_head is temporarily decreased from A to B. And a reader may see
      data_head < data_tail if it read the buffer frequently enough, which
      creates unexpected behaviors.
      
      This can be fixed by moving dec(&rb->nest) to after updating data_head,
      which prevents the IRQ/NMI above from updating data_head.
      
      [ Split up by peterz. ]
      Signed-off-by: NYabin Cui <yabinc@google.com>
      Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Cc: Alexander Shishkin <alexander.shishkin@linux.intel.com>
      Cc: Arnaldo Carvalho de Melo <acme@kernel.org>
      Cc: Arnaldo Carvalho de Melo <acme@redhat.com>
      Cc: Jiri Olsa <jolsa@redhat.com>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Namhyung Kim <namhyung@kernel.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Stephane Eranian <eranian@google.com>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Vince Weaver <vincent.weaver@maine.edu>
      Cc: mark.rutland@arm.com
      Fixes: ef60777c ("perf: Optimize the perf_output() path by removing IRQ-disables")
      Link: http://lkml.kernel.org/r/20190517115418.224478157@infradead.orgSigned-off-by: NIngo Molnar <mingo@kernel.org>
      Signed-off-by: NSasha Levin <sashal@kernel.org>
      cca19ab2
  5. 19 6月, 2019 5 次提交
  6. 15 6月, 2019 4 次提交
  7. 11 6月, 2019 1 次提交
    • J
      x86/power: Fix 'nosmt' vs hibernation triple fault during resume · 4d166206
      Jiri Kosina 提交于
      commit ec527c318036a65a083ef68d8ba95789d2212246 upstream.
      
      As explained in
      
      	0cc3cd21 ("cpu/hotplug: Boot HT siblings at least once")
      
      we always, no matter what, have to bring up x86 HT siblings during boot at
      least once in order to avoid first MCE bringing the system to its knees.
      
      That means that whenever 'nosmt' is supplied on the kernel command-line,
      all the HT siblings are as a result sitting in mwait or cpudile after
      going through the online-offline cycle at least once.
      
      This causes a serious issue though when a kernel, which saw 'nosmt' on its
      commandline, is going to perform resume from hibernation: if the resume
      from the hibernated image is successful, cr3 is flipped in order to point
      to the address space of the kernel that is being resumed, which in turn
      means that all the HT siblings are all of a sudden mwaiting on address
      which is no longer valid.
      
      That results in triple fault shortly after cr3 is switched, and machine
      reboots.
      
      Fix this by always waking up all the SMT siblings before initiating the
      'restore from hibernation' process; this guarantees that all the HT
      siblings will be properly carried over to the resumed kernel waiting in
      resume_play_dead(), and acted upon accordingly afterwards, based on the
      target kernel configuration.
      
      Symmetricaly, the resumed kernel has to push the SMT siblings to mwait
      again in case it has SMT disabled; this means it has to online all
      the siblings when resuming (so that they come out of hlt) and offline
      them again to let them reach mwait.
      
      Cc: 4.19+ <stable@vger.kernel.org> # v4.19+
      Debugged-by: NThomas Gleixner <tglx@linutronix.de>
      Fixes: 0cc3cd21 ("cpu/hotplug: Boot HT siblings at least once")
      Signed-off-by: NJiri Kosina <jkosina@suse.cz>
      Acked-by: NPavel Machek <pavel@ucw.cz>
      Reviewed-by: NThomas Gleixner <tglx@linutronix.de>
      Reviewed-by: NJosh Poimboeuf <jpoimboe@redhat.com>
      Signed-off-by: NRafael J. Wysocki <rafael.j.wysocki@intel.com>
      Signed-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      4d166206
  8. 09 6月, 2019 2 次提交
  9. 04 6月, 2019 1 次提交
  10. 31 5月, 2019 15 次提交
    • P
      rcuperf: Fix cleanup path for invalid perf_type strings · 506b28fb
      Paul E. McKenney 提交于
      [ Upstream commit ad092c027713a68a34168942a5ef422e42e039f4 ]
      
      If the specified rcuperf.perf_type is not in the rcu_perf_init()
      function's perf_ops[] array, rcuperf prints some console messages and
      then invokes rcu_perf_cleanup() to set state so that a future torture
      test can run.  However, rcu_perf_cleanup() also attempts to end the
      test that didn't actually start, and in doing so relies on the value
      of cur_ops, a value that is not particularly relevant in this case.
      This can result in confusing output or even follow-on failures due to
      attempts to use facilities that have not been properly initialized.
      
      This commit therefore sets the value of cur_ops to NULL in this case and
      inserts a check near the beginning of rcu_perf_cleanup(), thus avoiding
      relying on an irrelevant cur_ops value.
      Signed-off-by: NPaul E. McKenney <paulmck@linux.ibm.com>
      Signed-off-by: NSasha Levin <sashal@kernel.org>
      506b28fb
    • P
      rcutorture: Fix cleanup path for invalid torture_type strings · aa7919e3
      Paul E. McKenney 提交于
      [ Upstream commit b813afae7ab6a5e91b4e16cc567331d9c2ae1f04 ]
      
      If the specified rcutorture.torture_type is not in the rcu_torture_init()
      function's torture_ops[] array, rcutorture prints some console messages
      and then invokes rcu_torture_cleanup() to set state so that a future
      torture test can run.  However, rcu_torture_cleanup() also attempts to
      end the test that didn't actually start, and in doing so relies on the
      value of cur_ops, a value that is not particularly relevant in this case.
      This can result in confusing output or even follow-on failures due to
      attempts to use facilities that have not been properly initialized.
      
      This commit therefore sets the value of cur_ops to NULL in this case
      and inserts a check near the beginning of rcu_torture_cleanup(),
      thus avoiding relying on an irrelevant cur_ops value.
      Reported-by: Nkernel test robot <rong.a.chen@intel.com>
      Signed-off-by: NPaul E. McKenney <paulmck@linux.ibm.com>
      Signed-off-by: NSasha Levin <sashal@kernel.org>
      aa7919e3
    • T
      timekeeping: Force upper bound for setting CLOCK_REALTIME · dc0f37b7
      Thomas Gleixner 提交于
      [ Upstream commit 7a8e61f8478639072d402a26789055a4a4de8f77 ]
      
      Several people reported testing failures after setting CLOCK_REALTIME close
      to the limits of the kernel internal representation in nanoseconds,
      i.e. year 2262.
      
      The failures are exposed in subsequent operations, i.e. when arming timers
      or when the advancing CLOCK_MONOTONIC makes the calculation of
      CLOCK_REALTIME overflow into negative space.
      
      Now people start to paper over the underlying problem by clamping
      calculations to the valid range, but that's just wrong because such
      workarounds will prevent detection of real issues as well.
      
      It is reasonable to force an upper bound for the various methods of setting
      CLOCK_REALTIME. Year 2262 is the absolute upper bound. Assume a maximum
      uptime of 30 years which is plenty enough even for esoteric embedded
      systems. That results in an upper bound of year 2232 for setting the time.
      
      Once that limit is reached in reality this limit is only a small part of
      the problem space. But until then this stops people from trying to paper
      over the problem at the wrong places.
      Reported-by: NXiongfeng Wang <wangxiongfeng2@huawei.com>
      Reported-by: NHongbo Yao <yaohongbo@huawei.com>
      Signed-off-by: NThomas Gleixner <tglx@linutronix.de>
      Cc: John Stultz <john.stultz@linaro.org>
      Cc: Stephen Boyd <sboyd@kernel.org>
      Cc: Miroslav Lichvar <mlichvar@redhat.com>
      Cc: Arnd Bergmann <arnd@arndb.de>
      Cc: Richard Cochran <richardcochran@gmail.com>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Link: https://lkml.kernel.org/r/alpine.DEB.2.21.1903231125480.2157@nanos.tec.linutronix.deSigned-off-by: NSasha Levin <sashal@kernel.org>
      dc0f37b7
    • P
      x86/uaccess, ftrace: Fix ftrace_likely_update() vs. SMAP · 1a3188d7
      Peter Zijlstra 提交于
      [ Upstream commit 4a6c91fbdef846ec7250b82f2eeeb87ac5f18cf9 ]
      
      For CONFIG_TRACE_BRANCH_PROFILING=y the likely/unlikely things get
      overloaded and generate callouts to this code, and thus also when
      AC=1.
      
      Make it safe.
      Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Cc: Borislav Petkov <bp@alien8.de>
      Cc: Josh Poimboeuf <jpoimboe@redhat.com>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Steven Rostedt <rostedt@goodmis.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Signed-off-by: NIngo Molnar <mingo@kernel.org>
      Signed-off-by: NSasha Levin <sashal@kernel.org>
      1a3188d7
    • N
      irq_work: Do not raise an IPI when queueing work on the local CPU · afee27f3
      Nicholas Piggin 提交于
      [ Upstream commit 471ba0e686cb13752bc1ff3216c54b69a2d250ea ]
      
      The QEMU PowerPC/PSeries machine model was not expecting a self-IPI,
      and it may be a bit surprising thing to do, so have irq_work_queue_on
      do local queueing when target is the current CPU.
      Suggested-by: NSteven Rostedt <rostedt@goodmis.org>
      Reported-by: NSebastian Andrzej Siewior <bigeasy@linutronix.de>
      Tested-by: NSebastian Andrzej Siewior <bigeasy@linutronix.de>
      Signed-off-by: NNicholas Piggin <npiggin@gmail.com>
      Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Reviewed-by: NFrederic Weisbecker <frederic@kernel.org>
      Acked-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Cc: =?UTF-8?q?C=C3=A9dric=20Le=20Goater?= <clg@kaod.org>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Paul Mackerras <paulus@samba.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Suraj Jitindar Singh <sjitindarsingh@gmail.com>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Link: https://lkml.kernel.org/r/20190409093403.20994-1-npiggin@gmail.com
      [ Simplified the preprocessor comments.
        Fixed unbalanced curly brackets pointed out by Thomas. ]
      Signed-off-by: NIngo Molnar <mingo@kernel.org>
      Signed-off-by: NSasha Levin <sashal@kernel.org>
      afee27f3
    • K
      sched/core: Handle overflow in cpu_shares_write_u64 · 355673f8
      Konstantin Khlebnikov 提交于
      [ Upstream commit 5b61d50ab4ef590f5e1d4df15cd2cea5f5715308 ]
      
      Bit shift in scale_load() could overflow shares. This patch saturates
      it to MAX_SHARES like following sched_group_set_shares().
      
      Example:
      
       # echo 9223372036854776832 > cpu.shares
       # cat cpu.shares
      
      Before patch: 1024
      After pattch: 262144
      Signed-off-by: NKonstantin Khlebnikov <khlebnikov@yandex-team.ru>
      Acked-by: NPeter Zijlstra <a.p.zijlstra@chello.nl>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Link: http://lkml.kernel.org/r/155125501891.293431.3345233332801109696.stgit@buzzSigned-off-by: NIngo Molnar <mingo@kernel.org>
      Signed-off-by: NSasha Levin <sashal@kernel.org>
      355673f8
    • K
      sched/rt: Check integer overflow at usec to nsec conversion · 7053046e
      Konstantin Khlebnikov 提交于
      [ Upstream commit 1a010e29cfa00fee2888fd2fd4983f848cbafb58 ]
      
      Example of unhandled overflows:
      
       # echo 18446744073709651 > cpu.rt_runtime_us
       # cat cpu.rt_runtime_us
       99
      
       # echo 18446744073709900 > cpu.rt_period_us
       # cat cpu.rt_period_us
       348
      
      After this patch they will fail with -EINVAL.
      Signed-off-by: NKonstantin Khlebnikov <khlebnikov@yandex-team.ru>
      Acked-by: NPeter Zijlstra <a.p.zijlstra@chello.nl>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Link: http://lkml.kernel.org/r/155125501739.293431.5252197504404771496.stgit@buzzSigned-off-by: NIngo Molnar <mingo@kernel.org>
      Signed-off-by: NSasha Levin <sashal@kernel.org>
      7053046e
    • K
      sched/core: Check quota and period overflow at usec to nsec conversion · 925275d0
      Konstantin Khlebnikov 提交于
      [ Upstream commit 1a8b4540db732ca16c9e43ac7c08b1b8f0b252d8 ]
      
      Large values could overflow u64 and pass following sanity checks.
      
       # echo 18446744073750000 > cpu.cfs_period_us
       # cat cpu.cfs_period_us
       40448
      
       # echo 18446744073750000 > cpu.cfs_quota_us
       # cat cpu.cfs_quota_us
       40448
      
      After this patch they will fail with -EINVAL.
      Signed-off-by: NKonstantin Khlebnikov <khlebnikov@yandex-team.ru>
      Acked-by: NPeter Zijlstra <a.p.zijlstra@chello.nl>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Link: http://lkml.kernel.org/r/155125502079.293431.3947497929372138600.stgit@buzzSigned-off-by: NIngo Molnar <mingo@kernel.org>
      Signed-off-by: NSasha Levin <sashal@kernel.org>
      925275d0
    • R
      cgroup: protect cgroup->nr_(dying_)descendants by css_set_lock · 4e4d5cea
      Roman Gushchin 提交于
      [ Upstream commit 4dcabece4c3a9f9522127be12cc12cc120399b2f ]
      
      The number of descendant cgroups and the number of dying
      descendant cgroups are currently synchronized using the cgroup_mutex.
      
      The number of descendant cgroups will be required by the cgroup v2
      freezer, which will use it to determine if a cgroup is frozen
      (depending on total number of descendants and number of frozen
      descendants). It's not always acceptable to grab the cgroup_mutex,
      especially from quite hot paths (e.g. exit()).
      
      To avoid this, let's additionally synchronize these counters using
      the css_set_lock.
      
      So, it's safe to read these counters with either cgroup_mutex or
      css_set_lock locked, and for changing both locks should be acquired.
      Signed-off-by: NRoman Gushchin <guro@fb.com>
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Cc: kernel-team@fb.com
      Signed-off-by: NSasha Levin <sashal@kernel.org>
      4e4d5cea
    • W
      audit: fix a memory leak bug · 6c21fa84
      Wenwen Wang 提交于
      [ Upstream commit 70c4cf17e445264453bc5323db3e50aa0ac9e81f ]
      
      In audit_rule_change(), audit_data_to_entry() is firstly invoked to
      translate the payload data to the kernel's rule representation. In
      audit_data_to_entry(), depending on the audit field type, an audit tree may
      be created in audit_make_tree(), which eventually invokes kmalloc() to
      allocate the tree.  Since this tree is a temporary tree, it will be then
      freed in the following execution, e.g., audit_add_rule() if the message
      type is AUDIT_ADD_RULE or audit_del_rule() if the message type is
      AUDIT_DEL_RULE. However, if the message type is neither AUDIT_ADD_RULE nor
      AUDIT_DEL_RULE, i.e., the default case of the switch statement, this
      temporary tree is not freed.
      
      To fix this issue, only allocate the tree when the type is AUDIT_ADD_RULE
      or AUDIT_DEL_RULE.
      Signed-off-by: NWenwen Wang <wang6495@umn.edu>
      Reviewed-by: NRichard Guy Briggs <rgb@redhat.com>
      Signed-off-by: NPaul Moore <paul@paul-moore.com>
      Signed-off-by: NSasha Levin <sashal@kernel.org>
      6c21fa84
    • N
      sched/nohz: Run NOHZ idle load balancer on HK_FLAG_MISC CPUs · 07da741d
      Nicholas Piggin 提交于
      [ Upstream commit 9b019acb72e4b5741d88e8936d6f200ed44b66b2 ]
      
      The NOHZ idle balancer runs on the lowest idle CPU. This can
      interfere with isolated CPUs, so confine it to HK_FLAG_MISC
      housekeeping CPUs.
      
      HK_FLAG_SCHED is not used for this because it is not set anywhere
      at the moment. This could be folded into HK_FLAG_SCHED once that
      option is fixed.
      
      The problem was observed with increased jitter on an application
      running on CPU0, caused by NOHZ idle load balancing being run on
      CPU1 (an SMT sibling).
      Signed-off-by: NNicholas Piggin <npiggin@gmail.com>
      Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Cc: Frederic Weisbecker <fweisbec@gmail.com>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Link: https://lkml.kernel.org/r/20190412042613.28930-1-npiggin@gmail.comSigned-off-by: NIngo Molnar <mingo@kernel.org>
      Signed-off-by: NSasha Levin <sashal@kernel.org>
      07da741d
    • N
      x86/modules: Avoid breaking W^X while loading modules · 8715ce03
      Nadav Amit 提交于
      [ Upstream commit f2c65fb3221adc6b73b0549fc7ba892022db9797 ]
      
      When modules and BPF filters are loaded, there is a time window in
      which some memory is both writable and executable. An attacker that has
      already found another vulnerability (e.g., a dangling pointer) might be
      able to exploit this behavior to overwrite kernel code. Prevent having
      writable executable PTEs in this stage.
      
      In addition, avoiding having W+X mappings can also slightly simplify the
      patching of modules code on initialization (e.g., by alternatives and
      static-key), as would be done in the next patch. This was actually the
      main motivation for this patch.
      
      To avoid having W+X mappings, set them initially as RW (NX) and after
      they are set as RO set them as X as well. Setting them as executable is
      done as a separate step to avoid one core in which the old PTE is cached
      (hence writable), and another which sees the updated PTE (executable),
      which would break the W^X protection.
      Suggested-by: NThomas Gleixner <tglx@linutronix.de>
      Suggested-by: NAndy Lutomirski <luto@amacapital.net>
      Signed-off-by: NNadav Amit <namit@vmware.com>
      Signed-off-by: NRick Edgecombe <rick.p.edgecombe@intel.com>
      Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Cc: <akpm@linux-foundation.org>
      Cc: <ard.biesheuvel@linaro.org>
      Cc: <deneen.t.dock@intel.com>
      Cc: <kernel-hardening@lists.openwall.com>
      Cc: <kristen@linux.intel.com>
      Cc: <linux_dti@icloud.com>
      Cc: <will.deacon@arm.com>
      Cc: Andy Lutomirski <luto@kernel.org>
      Cc: Borislav Petkov <bp@alien8.de>
      Cc: Dave Hansen <dave.hansen@intel.com>
      Cc: H. Peter Anvin <hpa@zytor.com>
      Cc: Jessica Yu <jeyu@kernel.org>
      Cc: Kees Cook <keescook@chromium.org>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Masami Hiramatsu <mhiramat@kernel.org>
      Cc: Rik van Riel <riel@surriel.com>
      Link: https://lkml.kernel.org/r/20190426001143.4983-12-namit@vmware.comSigned-off-by: NIngo Molnar <mingo@kernel.org>
      Signed-off-by: NSasha Levin <sashal@kernel.org>
      8715ce03
    • A
      acct_on(): don't mess with freeze protection · 7c2bcb3c
      Al Viro 提交于
      commit 9419a3191dcb27f24478d288abaab697228d28e6 upstream.
      
      What happens there is that we are replacing file->path.mnt of
      a file we'd just opened with a clone and we need the write
      count contribution to be transferred from original mount to
      new one.  That's it.  We do *NOT* want any kind of freeze
      protection for the duration of switchover.
      
      IOW, we should just use __mnt_{want,drop}_write() for that
      switchover; no need to bother with mnt_{want,drop}_write()
      there.
      Tested-by: NAmir Goldstein <amir73il@gmail.com>
      Reported-by: syzbot+2a73a6ea9507b7112141@syzkaller.appspotmail.com
      Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>
      Signed-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      7c2bcb3c
    • E
      bpf: devmap: fix use-after-free Read in __dev_map_entry_free · 003e2d74
      Eric Dumazet 提交于
      commit 2baae3545327632167c0180e9ca1d467416f1919 upstream.
      
      synchronize_rcu() is fine when the rcu callbacks only need
      to free memory (kfree_rcu() or direct kfree() call rcu call backs)
      
      __dev_map_entry_free() is a bit more complex, so we need to make
      sure that call queued __dev_map_entry_free() callbacks have completed.
      
      sysbot report:
      
      BUG: KASAN: use-after-free in dev_map_flush_old kernel/bpf/devmap.c:365
      [inline]
      BUG: KASAN: use-after-free in __dev_map_entry_free+0x2a8/0x300
      kernel/bpf/devmap.c:379
      Read of size 8 at addr ffff8801b8da38c8 by task ksoftirqd/1/18
      
      CPU: 1 PID: 18 Comm: ksoftirqd/1 Not tainted 4.17.0+ #39
      Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS
      Google 01/01/2011
      Call Trace:
        __dump_stack lib/dump_stack.c:77 [inline]
        dump_stack+0x1b9/0x294 lib/dump_stack.c:113
        print_address_description+0x6c/0x20b mm/kasan/report.c:256
        kasan_report_error mm/kasan/report.c:354 [inline]
        kasan_report.cold.7+0x242/0x2fe mm/kasan/report.c:412
        __asan_report_load8_noabort+0x14/0x20 mm/kasan/report.c:433
        dev_map_flush_old kernel/bpf/devmap.c:365 [inline]
        __dev_map_entry_free+0x2a8/0x300 kernel/bpf/devmap.c:379
        __rcu_reclaim kernel/rcu/rcu.h:178 [inline]
        rcu_do_batch kernel/rcu/tree.c:2558 [inline]
        invoke_rcu_callbacks kernel/rcu/tree.c:2818 [inline]
        __rcu_process_callbacks kernel/rcu/tree.c:2785 [inline]
        rcu_process_callbacks+0xe9d/0x1760 kernel/rcu/tree.c:2802
        __do_softirq+0x2e0/0xaf5 kernel/softirq.c:284
        run_ksoftirqd+0x86/0x100 kernel/softirq.c:645
        smpboot_thread_fn+0x417/0x870 kernel/smpboot.c:164
        kthread+0x345/0x410 kernel/kthread.c:240
        ret_from_fork+0x3a/0x50 arch/x86/entry/entry_64.S:412
      
      Allocated by task 6675:
        save_stack+0x43/0xd0 mm/kasan/kasan.c:448
        set_track mm/kasan/kasan.c:460 [inline]
        kasan_kmalloc+0xc4/0xe0 mm/kasan/kasan.c:553
        kmem_cache_alloc_trace+0x152/0x780 mm/slab.c:3620
        kmalloc include/linux/slab.h:513 [inline]
        kzalloc include/linux/slab.h:706 [inline]
        dev_map_alloc+0x208/0x7f0 kernel/bpf/devmap.c:102
        find_and_alloc_map kernel/bpf/syscall.c:129 [inline]
        map_create+0x393/0x1010 kernel/bpf/syscall.c:453
        __do_sys_bpf kernel/bpf/syscall.c:2351 [inline]
        __se_sys_bpf kernel/bpf/syscall.c:2328 [inline]
        __x64_sys_bpf+0x303/0x510 kernel/bpf/syscall.c:2328
        do_syscall_64+0x1b1/0x800 arch/x86/entry/common.c:290
        entry_SYSCALL_64_after_hwframe+0x49/0xbe
      
      Freed by task 26:
        save_stack+0x43/0xd0 mm/kasan/kasan.c:448
        set_track mm/kasan/kasan.c:460 [inline]
        __kasan_slab_free+0x11a/0x170 mm/kasan/kasan.c:521
        kasan_slab_free+0xe/0x10 mm/kasan/kasan.c:528
        __cache_free mm/slab.c:3498 [inline]
        kfree+0xd9/0x260 mm/slab.c:3813
        dev_map_free+0x4fa/0x670 kernel/bpf/devmap.c:191
        bpf_map_free_deferred+0xba/0xf0 kernel/bpf/syscall.c:262
        process_one_work+0xc64/0x1b70 kernel/workqueue.c:2153
        worker_thread+0x181/0x13a0 kernel/workqueue.c:2296
        kthread+0x345/0x410 kernel/kthread.c:240
        ret_from_fork+0x3a/0x50 arch/x86/entry/entry_64.S:412
      
      The buggy address belongs to the object at ffff8801b8da37c0
        which belongs to the cache kmalloc-512 of size 512
      The buggy address is located 264 bytes inside of
        512-byte region [ffff8801b8da37c0, ffff8801b8da39c0)
      The buggy address belongs to the page:
      page:ffffea0006e368c0 count:1 mapcount:0 mapping:ffff8801da800940
      index:0xffff8801b8da3540
      flags: 0x2fffc0000000100(slab)
      raw: 02fffc0000000100 ffffea0007217b88 ffffea0006e30cc8 ffff8801da800940
      raw: ffff8801b8da3540 ffff8801b8da3040 0000000100000004 0000000000000000
      page dumped because: kasan: bad access detected
      
      Memory state around the buggy address:
        ffff8801b8da3780: fc fc fc fc fc fc fc fc fb fb fb fb fb fb fb fb
        ffff8801b8da3800: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb
      > ffff8801b8da3880: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb
                                                     ^
        ffff8801b8da3900: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb
        ffff8801b8da3980: fb fb fb fb fb fb fb fb fc fc fc fc fc fc fc fc
      
      Fixes: 546ac1ff ("bpf: add devmap, a map for storing net device references")
      Signed-off-by: NEric Dumazet <edumazet@google.com>
      Reported-by: syzbot+457d3e2ffbcf31aee5c0@syzkaller.appspotmail.com
      Acked-by: NToke Høiland-Jørgensen <toke@redhat.com>
      Acked-by: NJesper Dangaard Brouer <brouer@redhat.com>
      Signed-off-by: NDaniel Borkmann <daniel@iogearbox.net>
      Signed-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      003e2d74
    • D
      bpf: add bpf_jit_limit knob to restrict unpriv allocations · 43caa29c
      Daniel Borkmann 提交于
      commit ede95a63b5e84ddeea6b0c473b36ab8bfd8c6ce3 upstream.
      
      Rick reported that the BPF JIT could potentially fill the entire module
      space with BPF programs from unprivileged users which would prevent later
      attempts to load normal kernel modules or privileged BPF programs, for
      example. If JIT was enabled but unsuccessful to generate the image, then
      before commit 290af866 ("bpf: introduce BPF_JIT_ALWAYS_ON config")
      we would always fall back to the BPF interpreter. Nowadays in the case
      where the CONFIG_BPF_JIT_ALWAYS_ON could be set, then the load will abort
      with a failure since the BPF interpreter was compiled out.
      
      Add a global limit and enforce it for unprivileged users such that in case
      of BPF interpreter compiled out we fail once the limit has been reached
      or we fall back to BPF interpreter earlier w/o using module mem if latter
      was compiled in. In a next step, fair share among unprivileged users can
      be resolved in particular for the case where we would fail hard once limit
      is reached.
      
      Fixes: 290af866 ("bpf: introduce BPF_JIT_ALWAYS_ON config")
      Fixes: 0a14842f ("net: filter: Just In Time compiler for x86-64")
      Co-Developed-by: NRick Edgecombe <rick.p.edgecombe@intel.com>
      Signed-off-by: NDaniel Borkmann <daniel@iogearbox.net>
      Acked-by: NAlexei Starovoitov <ast@kernel.org>
      Cc: Eric Dumazet <eric.dumazet@gmail.com>
      Cc: Jann Horn <jannh@google.com>
      Cc: Kees Cook <keescook@chromium.org>
      Cc: LKML <linux-kernel@vger.kernel.org>
      Signed-off-by: NAlexei Starovoitov <ast@kernel.org>
      Cc: Ben Hutchings <ben.hutchings@codethink.co.uk>
      Signed-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      43caa29c
  11. 26 5月, 2019 2 次提交
    • D
      bpf, lru: avoid messing with eviction heuristics upon syscall lookup · 107e215c
      Daniel Borkmann 提交于
      commit 50b045a8c0ccf44f76640ac3eea8d80ca53979a3 upstream.
      
      One of the biggest issues we face right now with picking LRU map over
      regular hash table is that a map walk out of user space, for example,
      to just dump the existing entries or to remove certain ones, will
      completely mess up LRU eviction heuristics and wrong entries such
      as just created ones will get evicted instead. The reason for this
      is that we mark an entry as "in use" via bpf_lru_node_set_ref() from
      system call lookup side as well. Thus upon walk, all entries are
      being marked, so information of actual least recently used ones
      are "lost".
      
      In case of Cilium where it can be used (besides others) as a BPF
      based connection tracker, this current behavior causes disruption
      upon control plane changes that need to walk the map from user space
      to evict certain entries. Discussion result from bpfconf [0] was that
      we should simply just remove marking from system call side as no
      good use case could be found where it's actually needed there.
      Therefore this patch removes marking for regular LRU and per-CPU
      flavor. If there ever should be a need in future, the behavior could
      be selected via map creation flag, but due to mentioned reason we
      avoid this here.
      
        [0] http://vger.kernel.org/bpfconf.html
      
      Fixes: 29ba732a ("bpf: Add BPF_MAP_TYPE_LRU_HASH")
      Fixes: 8f844938 ("bpf: Add BPF_MAP_TYPE_LRU_PERCPU_HASH")
      Signed-off-by: NDaniel Borkmann <daniel@iogearbox.net>
      Acked-by: NMartin KaFai Lau <kafai@fb.com>
      Signed-off-by: NAlexei Starovoitov <ast@kernel.org>
      Signed-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      107e215c
    • D
      bpf: add map_lookup_elem_sys_only for lookups from syscall side · 2bb3c547
      Daniel Borkmann 提交于
      commit c6110222c6f49ea68169f353565eb865488a8619 upstream.
      
      Add a callback map_lookup_elem_sys_only() that map implementations
      could use over map_lookup_elem() from system call side in case the
      map implementation needs to handle the latter differently than from
      the BPF data path. If map_lookup_elem_sys_only() is set, this will
      be preferred pick for map lookups out of user space. This hook is
      used in a follow-up fix for LRU map, but once development window
      opens, we can convert other map types from map_lookup_elem() (here,
      the one called upon BPF_MAP_LOOKUP_ELEM cmd is meant) over to use
      the callback to simplify and clean up the latter.
      Signed-off-by: NDaniel Borkmann <daniel@iogearbox.net>
      Acked-by: NMartin KaFai Lau <kafai@fb.com>
      Signed-off-by: NAlexei Starovoitov <ast@kernel.org>
      Signed-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      
      2bb3c547