1. 05 7月, 2019 11 次提交
  2. 03 7月, 2019 5 次提交
    • D
      bpf: fix unconnected udp hooks · 613bc37f
      Daniel Borkmann 提交于
      commit 983695fa676568fc0fe5ddd995c7267aabc24632 upstream.
      
      Intention of cgroup bind/connect/sendmsg BPF hooks is to act transparently
      to applications as also stated in original motivation in 7828f20e ("Merge
      branch 'bpf-cgroup-bind-connect'"). When recently integrating the latter
      two hooks into Cilium to enable host based load-balancing with Kubernetes,
      I ran into the issue that pods couldn't start up as DNS got broken. Kubernetes
      typically sets up DNS as a service and is thus subject to load-balancing.
      
      Upon further debugging, it turns out that the cgroupv2 sendmsg BPF hooks API
      is currently insufficient and thus not usable as-is for standard applications
      shipped with most distros. To break down the issue we ran into with a simple
      example:
      
        # cat /etc/resolv.conf
        nameserver 147.75.207.207
        nameserver 147.75.207.208
      
      For the purpose of a simple test, we set up above IPs as service IPs and
      transparently redirect traffic to a different DNS backend server for that
      node:
      
        # cilium service list
        ID   Frontend            Backend
        1    147.75.207.207:53   1 => 8.8.8.8:53
        2    147.75.207.208:53   1 => 8.8.8.8:53
      
      The attached BPF program is basically selecting one of the backends if the
      service IP/port matches on the cgroup hook. DNS breaks here, because the
      hooks are not transparent enough to applications which have built-in msg_name
      address checks:
      
        # nslookup 1.1.1.1
        ;; reply from unexpected source: 8.8.8.8#53, expected 147.75.207.207#53
        ;; reply from unexpected source: 8.8.8.8#53, expected 147.75.207.208#53
        ;; reply from unexpected source: 8.8.8.8#53, expected 147.75.207.207#53
        [...]
        ;; connection timed out; no servers could be reached
      
        # dig 1.1.1.1
        ;; reply from unexpected source: 8.8.8.8#53, expected 147.75.207.207#53
        ;; reply from unexpected source: 8.8.8.8#53, expected 147.75.207.208#53
        ;; reply from unexpected source: 8.8.8.8#53, expected 147.75.207.207#53
        [...]
      
        ; <<>> DiG 9.11.3-1ubuntu1.7-Ubuntu <<>> 1.1.1.1
        ;; global options: +cmd
        ;; connection timed out; no servers could be reached
      
      For comparison, if none of the service IPs is used, and we tell nslookup
      to use 8.8.8.8 directly it works just fine, of course:
      
        # nslookup 1.1.1.1 8.8.8.8
        1.1.1.1.in-addr.arpa	name = one.one.one.one.
      
      In order to fix this and thus act more transparent to the application,
      this needs reverse translation on recvmsg() side. A minimal fix for this
      API is to add similar recvmsg() hooks behind the BPF cgroups static key
      such that the program can track state and replace the current sockaddr_in{,6}
      with the original service IP. From BPF side, this basically tracks the
      service tuple plus socket cookie in an LRU map where the reverse NAT can
      then be retrieved via map value as one example. Side-note: the BPF cgroups
      static key should be converted to a per-hook static key in future.
      
      Same example after this fix:
      
        # cilium service list
        ID   Frontend            Backend
        1    147.75.207.207:53   1 => 8.8.8.8:53
        2    147.75.207.208:53   1 => 8.8.8.8:53
      
      Lookups work fine now:
      
        # nslookup 1.1.1.1
        1.1.1.1.in-addr.arpa    name = one.one.one.one.
      
        Authoritative answers can be found from:
      
        # dig 1.1.1.1
      
        ; <<>> DiG 9.11.3-1ubuntu1.7-Ubuntu <<>> 1.1.1.1
        ;; global options: +cmd
        ;; Got answer:
        ;; ->>HEADER<<- opcode: QUERY, status: NXDOMAIN, id: 51550
        ;; flags: qr rd ra ad; QUERY: 1, ANSWER: 0, AUTHORITY: 1, ADDITIONAL: 1
      
        ;; OPT PSEUDOSECTION:
        ; EDNS: version: 0, flags:; udp: 512
        ;; QUESTION SECTION:
        ;1.1.1.1.                       IN      A
      
        ;; AUTHORITY SECTION:
        .                       23426   IN      SOA     a.root-servers.net. nstld.verisign-grs.com. 2019052001 1800 900 604800 86400
      
        ;; Query time: 17 msec
        ;; SERVER: 147.75.207.207#53(147.75.207.207)
        ;; WHEN: Tue May 21 12:59:38 UTC 2019
        ;; MSG SIZE  rcvd: 111
      
      And from an actual packet level it shows that we're using the back end
      server when talking via 147.75.207.20{7,8} front end:
      
        # tcpdump -i any udp
        [...]
        12:59:52.698732 IP foo.42011 > google-public-dns-a.google.com.domain: 18803+ PTR? 1.1.1.1.in-addr.arpa. (38)
        12:59:52.698735 IP foo.42011 > google-public-dns-a.google.com.domain: 18803+ PTR? 1.1.1.1.in-addr.arpa. (38)
        12:59:52.701208 IP google-public-dns-a.google.com.domain > foo.42011: 18803 1/0/0 PTR one.one.one.one. (67)
        12:59:52.701208 IP google-public-dns-a.google.com.domain > foo.42011: 18803 1/0/0 PTR one.one.one.one. (67)
        [...]
      
      In order to be flexible and to have same semantics as in sendmsg BPF
      programs, we only allow return codes in [1,1] range. In the sendmsg case
      the program is called if msg->msg_name is present which can be the case
      in both, connected and unconnected UDP.
      
      The former only relies on the sockaddr_in{,6} passed via connect(2) if
      passed msg->msg_name was NULL. Therefore, on recvmsg side, we act in similar
      way to call into the BPF program whenever a non-NULL msg->msg_name was
      passed independent of sk->sk_state being TCP_ESTABLISHED or not. Note
      that for TCP case, the msg->msg_name is ignored in the regular recvmsg
      path and therefore not relevant.
      
      For the case of ip{,v6}_recv_error() paths, picked up via MSG_ERRQUEUE,
      the hook is not called. This is intentional as it aligns with the same
      semantics as in case of TCP cgroup BPF hooks right now. This might be
      better addressed in future through a different bpf_attach_type such
      that this case can be distinguished from the regular recvmsg paths,
      for example.
      
      Fixes: 1cedee13 ("bpf: Hooks for sys_sendmsg")
      Signed-off-by: NDaniel Borkmann <daniel@iogearbox.net>
      Acked-by: NAndrey Ignatov <rdna@fb.com>
      Acked-by: NMartin KaFai Lau <kafai@fb.com>
      Acked-by: NMartynas Pumputis <m@lambda.lt>
      Signed-off-by: NAlexei Starovoitov <ast@kernel.org>
      Signed-off-by: NDaniel Borkmann <daniel@iogearbox.net>
      Signed-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      613bc37f
    • M
      bpf: fix nested bpf tracepoints with per-cpu data · a7177b94
      Matt Mullins 提交于
      commit 9594dc3c7e71b9f52bee1d7852eb3d4e3aea9e99 upstream.
      
      BPF_PROG_TYPE_RAW_TRACEPOINTs can be executed nested on the same CPU, as
      they do not increment bpf_prog_active while executing.
      
      This enables three levels of nesting, to support
        - a kprobe or raw tp or perf event,
        - another one of the above that irq context happens to call, and
        - another one in nmi context
      (at most one of which may be a kprobe or perf event).
      
      Fixes: 20b9d7ac ("bpf: avoid excessive stack usage for perf_sample_data")
      Signed-off-by: NMatt Mullins <mmullins@fb.com>
      Acked-by: NAndrii Nakryiko <andriin@fb.com>
      Acked-by: NDaniel Borkmann <daniel@iogearbox.net>
      Signed-off-by: NAlexei Starovoitov <ast@kernel.org>
      Signed-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      a7177b94
    • J
      bpf: lpm_trie: check left child of last leftmost node for NULL · 4992d4af
      Jonathan Lemon 提交于
      commit da2577fdd0932ea4eefe73903f1130ee366767d2 upstream.
      
      If the leftmost parent node of the tree has does not have a child
      on the left side, then trie_get_next_key (and bpftool map dump) will
      not look at the child on the right.  This leads to the traversal
      missing elements.
      
      Lookup is not affected.
      
      Update selftest to handle this case.
      
      Reproducer:
      
       bpftool map create /sys/fs/bpf/lpm type lpm_trie key 6 \
           value 1 entries 256 name test_lpm flags 1
       bpftool map update pinned /sys/fs/bpf/lpm key  8 0 0 0  0   0 value 1
       bpftool map update pinned /sys/fs/bpf/lpm key 16 0 0 0  0 128 value 2
       bpftool map dump   pinned /sys/fs/bpf/lpm
      
      Returns only 1 element. (2 expected)
      
      Fixes: b471f2f1 ("bpf: implement MAP_GET_NEXT_KEY command for LPM_TRIE")
      Signed-off-by: NJonathan Lemon <jonathan.lemon@gmail.com>
      Acked-by: NMartin KaFai Lau <kafai@fb.com>
      Signed-off-by: NDaniel Borkmann <daniel@iogearbox.net>
      Signed-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      4992d4af
    • G
      cpu/speculation: Warn on unsupported mitigations= parameter · b78ad216
      Geert Uytterhoeven 提交于
      commit 1bf72720281770162c87990697eae1ba2f1d917a upstream.
      
      Currently, if the user specifies an unsupported mitigation strategy on the
      kernel command line, it will be ignored silently.  The code will fall back
      to the default strategy, possibly leaving the system more vulnerable than
      expected.
      
      This may happen due to e.g. a simple typo, or, for a stable kernel release,
      because not all mitigation strategies have been backported.
      
      Inform the user by printing a message.
      
      Fixes: 98af8452945c5565 ("cpu/speculation: Add 'mitigations=' cmdline option")
      Signed-off-by: NGeert Uytterhoeven <geert@linux-m68k.org>
      Signed-off-by: NThomas Gleixner <tglx@linutronix.de>
      Acked-by: NJosh Poimboeuf <jpoimboe@redhat.com>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Jiri Kosina <jkosina@suse.cz>
      Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
      Cc: Ben Hutchings <ben@decadent.org.uk>
      Cc: stable@vger.kernel.org
      Link: https://lkml.kernel.org/r/20190516070935.22546-1-geert@linux-m68k.orgSigned-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      b78ad216
    • S
      Revert "x86/uaccess, ftrace: Fix ftrace_likely_update() vs. SMAP" · fec1a13b
      Sasha Levin 提交于
      This reverts commit 1a3188d7, which was
      upstream commit 4a6c91fbdef846ec7250b82f2eeeb87ac5f18cf9.
      
      On Tue, Jun 25, 2019 at 09:39:45AM +0200, Sebastian Andrzej Siewior wrote:
      >Please backport commit e74deb11931ff682b59d5b9d387f7115f689698e to
      >stable _or_ revert the backport of commit 4a6c91fbdef84 ("x86/uaccess,
      >ftrace: Fix ftrace_likely_update() vs. SMAP"). It uses
      >user_access_{save|restore}() which has been introduced in the following
      >commit.
      Signed-off-by: NSasha Levin <sashal@kernel.org>
      fec1a13b
  3. 25 6月, 2019 1 次提交
    • M
      tracing: Silence GCC 9 array bounds warning · c493ead3
      Miguel Ojeda 提交于
      commit 0c97bf863efce63d6ab7971dad811601e6171d2f upstream.
      
      Starting with GCC 9, -Warray-bounds detects cases when memset is called
      starting on a member of a struct but the size to be cleared ends up
      writing over further members.
      
      Such a call happens in the trace code to clear, at once, all members
      after and including `seq` on struct trace_iterator:
      
          In function 'memset',
              inlined from 'ftrace_dump' at kernel/trace/trace.c:8914:3:
          ./include/linux/string.h:344:9: warning: '__builtin_memset' offset
          [8505, 8560] from the object at 'iter' is out of the bounds of
          referenced subobject 'seq' with type 'struct trace_seq' at offset
          4368 [-Warray-bounds]
            344 |  return __builtin_memset(p, c, size);
                |         ^~~~~~~~~~~~~~~~~~~~~~~~~~~~
      
      In order to avoid GCC complaining about it, we compute the address
      ourselves by adding the offsetof distance instead of referring
      directly to the member.
      
      Since there are two places doing this clear (trace.c and trace_kdb.c),
      take the chance to move the workaround into a single place in
      the internal header.
      
      Link: http://lkml.kernel.org/r/20190523124535.GA12931@gmail.comSigned-off-by: NMiguel Ojeda <miguel.ojeda.sandonis@gmail.com>
      [ Removed unnecessary parenthesis around "iter" ]
      Signed-off-by: NSteven Rostedt (VMware) <rostedt@goodmis.org>
      Signed-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      c493ead3
  4. 22 6月, 2019 3 次提交
    • P
      perf/ring-buffer: Always use {READ,WRITE}_ONCE() for rb->user_page data · 991ea848
      Peter Zijlstra 提交于
      [ Upstream commit 4d839dd9e4356bbacf3eb0ab13a549b83b008c21 ]
      
      We must use {READ,WRITE}_ONCE() on rb->user_page data such that
      concurrent usage will see whole values. A few key sites were missing
      this.
      Suggested-by: NYabin Cui <yabinc@google.com>
      Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Cc: Alexander Shishkin <alexander.shishkin@linux.intel.com>
      Cc: Arnaldo Carvalho de Melo <acme@redhat.com>
      Cc: Jiri Olsa <jolsa@redhat.com>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Stephane Eranian <eranian@google.com>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Vince Weaver <vincent.weaver@maine.edu>
      Cc: acme@kernel.org
      Cc: mark.rutland@arm.com
      Cc: namhyung@kernel.org
      Fixes: 7b732a75 ("perf_counter: new output ABI - part 1")
      Link: http://lkml.kernel.org/r/20190517115418.394192145@infradead.orgSigned-off-by: NIngo Molnar <mingo@kernel.org>
      Signed-off-by: NSasha Levin <sashal@kernel.org>
      991ea848
    • P
      perf/ring_buffer: Add ordering to rb->nest increment · c133c9db
      Peter Zijlstra 提交于
      [ Upstream commit 3f9fbe9bd86c534eba2faf5d840fd44c6049f50e ]
      
      Similar to how decrementing rb->next too early can cause data_head to
      (temporarily) be observed to go backward, so too can this happen when
      we increment too late.
      
      This barrier() ensures the rb->head load happens after the increment,
      both the one in the 'goto again' path, as the one from
      perf_output_get_handle() -- albeit very unlikely to matter for the
      latter.
      Suggested-by: NYabin Cui <yabinc@google.com>
      Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Cc: Alexander Shishkin <alexander.shishkin@linux.intel.com>
      Cc: Arnaldo Carvalho de Melo <acme@redhat.com>
      Cc: Jiri Olsa <jolsa@redhat.com>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Stephane Eranian <eranian@google.com>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Vince Weaver <vincent.weaver@maine.edu>
      Cc: acme@kernel.org
      Cc: mark.rutland@arm.com
      Cc: namhyung@kernel.org
      Fixes: ef60777c ("perf: Optimize the perf_output() path by removing IRQ-disables")
      Link: http://lkml.kernel.org/r/20190517115418.309516009@infradead.orgSigned-off-by: NIngo Molnar <mingo@kernel.org>
      Signed-off-by: NSasha Levin <sashal@kernel.org>
      c133c9db
    • Y
      perf/ring_buffer: Fix exposing a temporarily decreased data_head · cca19ab2
      Yabin Cui 提交于
      [ Upstream commit 1b038c6e05ff70a1e66e3e571c2e6106bdb75f53 ]
      
      In perf_output_put_handle(), an IRQ/NMI can happen in below location and
      write records to the same ring buffer:
      
      	...
      	local_dec_and_test(&rb->nest)
      	...                          <-- an IRQ/NMI can happen here
      	rb->user_page->data_head = head;
      	...
      
      In this case, a value A is written to data_head in the IRQ, then a value
      B is written to data_head after the IRQ. And A > B. As a result,
      data_head is temporarily decreased from A to B. And a reader may see
      data_head < data_tail if it read the buffer frequently enough, which
      creates unexpected behaviors.
      
      This can be fixed by moving dec(&rb->nest) to after updating data_head,
      which prevents the IRQ/NMI above from updating data_head.
      
      [ Split up by peterz. ]
      Signed-off-by: NYabin Cui <yabinc@google.com>
      Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Cc: Alexander Shishkin <alexander.shishkin@linux.intel.com>
      Cc: Arnaldo Carvalho de Melo <acme@kernel.org>
      Cc: Arnaldo Carvalho de Melo <acme@redhat.com>
      Cc: Jiri Olsa <jolsa@redhat.com>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Namhyung Kim <namhyung@kernel.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Stephane Eranian <eranian@google.com>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Vince Weaver <vincent.weaver@maine.edu>
      Cc: mark.rutland@arm.com
      Fixes: ef60777c ("perf: Optimize the perf_output() path by removing IRQ-disables")
      Link: http://lkml.kernel.org/r/20190517115418.224478157@infradead.orgSigned-off-by: NIngo Molnar <mingo@kernel.org>
      Signed-off-by: NSasha Levin <sashal@kernel.org>
      cca19ab2
  5. 19 6月, 2019 5 次提交
  6. 15 6月, 2019 4 次提交
  7. 11 6月, 2019 1 次提交
    • J
      x86/power: Fix 'nosmt' vs hibernation triple fault during resume · 4d166206
      Jiri Kosina 提交于
      commit ec527c318036a65a083ef68d8ba95789d2212246 upstream.
      
      As explained in
      
      	0cc3cd21 ("cpu/hotplug: Boot HT siblings at least once")
      
      we always, no matter what, have to bring up x86 HT siblings during boot at
      least once in order to avoid first MCE bringing the system to its knees.
      
      That means that whenever 'nosmt' is supplied on the kernel command-line,
      all the HT siblings are as a result sitting in mwait or cpudile after
      going through the online-offline cycle at least once.
      
      This causes a serious issue though when a kernel, which saw 'nosmt' on its
      commandline, is going to perform resume from hibernation: if the resume
      from the hibernated image is successful, cr3 is flipped in order to point
      to the address space of the kernel that is being resumed, which in turn
      means that all the HT siblings are all of a sudden mwaiting on address
      which is no longer valid.
      
      That results in triple fault shortly after cr3 is switched, and machine
      reboots.
      
      Fix this by always waking up all the SMT siblings before initiating the
      'restore from hibernation' process; this guarantees that all the HT
      siblings will be properly carried over to the resumed kernel waiting in
      resume_play_dead(), and acted upon accordingly afterwards, based on the
      target kernel configuration.
      
      Symmetricaly, the resumed kernel has to push the SMT siblings to mwait
      again in case it has SMT disabled; this means it has to online all
      the siblings when resuming (so that they come out of hlt) and offline
      them again to let them reach mwait.
      
      Cc: 4.19+ <stable@vger.kernel.org> # v4.19+
      Debugged-by: NThomas Gleixner <tglx@linutronix.de>
      Fixes: 0cc3cd21 ("cpu/hotplug: Boot HT siblings at least once")
      Signed-off-by: NJiri Kosina <jkosina@suse.cz>
      Acked-by: NPavel Machek <pavel@ucw.cz>
      Reviewed-by: NThomas Gleixner <tglx@linutronix.de>
      Reviewed-by: NJosh Poimboeuf <jpoimboe@redhat.com>
      Signed-off-by: NRafael J. Wysocki <rafael.j.wysocki@intel.com>
      Signed-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      4d166206
  8. 09 6月, 2019 2 次提交
  9. 04 6月, 2019 1 次提交
  10. 31 5月, 2019 7 次提交