1. 22 12月, 2017 4 次提交
  2. 21 12月, 2017 3 次提交
    • Y
      net: sock: replace sk_state_load with inet_sk_state_load and remove sk_state_store · 986ffdfd
      Yafang Shao 提交于
      sk_state_load is only used by AF_INET/AF_INET6, so rename it to
      inet_sk_state_load and move it into inet_sock.h.
      
      sk_state_store is removed as it is not used any more.
      Signed-off-by: NYafang Shao <laoar.shao@gmail.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      986ffdfd
    • Y
      net: tracepoint: replace tcp_set_state tracepoint with inet_sock_set_state tracepoint · 563e0bb0
      Yafang Shao 提交于
      As sk_state is a common field for struct sock, so the state
      transition tracepoint should not be a TCP specific feature.
      Currently it traces all AF_INET state transition, so I rename this
      tracepoint to inet_sock_set_state tracepoint with some minor changes and move it
      into trace/events/sock.h.
      We dont need to create a file named trace/events/inet_sock.h for this one single
      tracepoint.
      
      Two helpers are introduced to trace sk_state transition
          - void inet_sk_state_store(struct sock *sk, int newstate);
          - void inet_sk_set_state(struct sock *sk, int state);
      As trace header should not be included in other header files,
      so they are defined in sock.c.
      
      The protocol such as SCTP maybe compiled as a ko, hence export
      inet_sk_set_state().
      Signed-off-by: NYafang Shao <laoar.shao@gmail.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      563e0bb0
    • S
      tcp: Export to userspace the TCP state names for the trace events · d7b850a7
      Steven Rostedt (VMware) 提交于
      The TCP trace events (specifically tcp_set_state), maps emums to symbol
      names via __print_symbolic(). But this only works for reading trace events
      from the tracefs trace files. If perf or trace-cmd were to record these
      events, the event format file does not convert the enum names into numbers,
      and you get something like:
      
      __print_symbolic(REC->oldstate,
          { TCP_ESTABLISHED, "TCP_ESTABLISHED" },
          { TCP_SYN_SENT, "TCP_SYN_SENT" },
          { TCP_SYN_RECV, "TCP_SYN_RECV" },
          { TCP_FIN_WAIT1, "TCP_FIN_WAIT1" },
          { TCP_FIN_WAIT2, "TCP_FIN_WAIT2" },
          { TCP_TIME_WAIT, "TCP_TIME_WAIT" },
          { TCP_CLOSE, "TCP_CLOSE" },
          { TCP_CLOSE_WAIT, "TCP_CLOSE_WAIT" },
          { TCP_LAST_ACK, "TCP_LAST_ACK" },
          { TCP_LISTEN, "TCP_LISTEN" },
          { TCP_CLOSING, "TCP_CLOSING" },
          { TCP_NEW_SYN_RECV, "TCP_NEW_SYN_RECV" })
      
      Where trace-cmd and perf do not know the values of those enums.
      
      Use the TRACE_DEFINE_ENUM() macros that will have the trace events convert
      the enum strings into their values at system boot. This will allow perf and
      trace-cmd to see actual numbers and not enums:
      
      __print_symbolic(REC->oldstate,
          { 1, "TCP_ESTABLISHED" },
          { 2, "TCP_SYN_SENT" },
          { 3, "TCP_SYN_RECV" },
          { 4, "TCP_FIN_WAIT1" },
          { 5, "TCP_FIN_WAIT2" },
          { 6, "TCP_TIME_WAIT" },
          { 7, "TCP_CLOSE" },
          { 8, "TCP_CLOSE_WAIT" },
          { 9, "TCP_LAST_ACK" },
          { 10, "TCP_LISTEN" },
          { 11, "TCP_CLOSING" },
          { 12, "TCP_NEW_SYN_RECV" })
      Signed-off-by: NSteven Rostedt (VMware) <rostedt@goodmis.org>
      Acked-by: NSong Liu <songliubraving@fb.com>
      Signed-off-by: NYafang Shao <laoar.shao@gmail.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      d7b850a7
  3. 19 12月, 2017 4 次提交
    • M
      net: Introduce NETIF_F_GRO_HW. · fb1f5f79
      Michael Chan 提交于
      Introduce NETIF_F_GRO_HW feature flag for NICs that support hardware
      GRO.  With this flag, we can now independently turn on or off hardware
      GRO when GRO is on.  Previously, drivers were using NETIF_F_GRO to
      control hardware GRO and so it cannot be independently turned on or
      off without affecting GRO.
      
      Hardware GRO (just like GRO) guarantees that packets can be re-segmented
      by TSO/GSO to reconstruct the original packet stream.  Logically,
      GRO_HW should depend on GRO since it a subset, but we will let
      individual drivers enforce this dependency as they see fit.
      
      Since NETIF_F_GRO is not propagated between upper and lower devices,
      NETIF_F_GRO_HW should follow suit since it is a subset of GRO.  In other
      words, a lower device can independent have GRO/GRO_HW enabled or disabled
      and no feature propagation is required.  This will preserve the current
      GRO behavior.  This can be changed later if we decide to propagate GRO/
      GRO_HW/RXCSUM from upper to lower devices.
      
      Cc: Ariel Elior <Ariel.Elior@cavium.com>
      Cc: everest-linux-l2@cavium.com
      Signed-off-by: NMichael Chan <michael.chan@broadcom.com>
      Acked-by: NAlexander Duyck <alexander.h.duyck@intel.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      fb1f5f79
    • T
      sock: Hide unused variable when !CONFIG_PROC_FS. · 398b841e
      Tonghao Zhang 提交于
      When CONFIG_PROC_FS is disabled, we will not use the prot_inuse
      counter. This adds an #ifdef to hide the variable definition in
      that case. This is not a bugfix. But we can save bytes when there
      are many network namespace.
      
      Cc: Pavel Emelyanov <xemul@openvz.org>
      Signed-off-by: NMartin Zhang <zhangjunweimartin@didichuxing.com>
      Signed-off-by: NTonghao Zhang <zhangtonghao@didichuxing.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      398b841e
    • T
      sock: Move the socket inuse to namespace. · 648845ab
      Tonghao Zhang 提交于
      In some case, we want to know how many sockets are in use in
      different _net_ namespaces. It's a key resource metric.
      
      This patch add a member in struct netns_core. This is a counter
      for socket-inuse in the _net_ namespace. The patch will add/sub
      counter in the sk_alloc, sk_clone_lock and __sk_free.
      
      This patch will not counter the socket created in kernel.
      It's not very useful for userspace to know how many kernel
      sockets we created.
      
      The main reasons for doing this are that:
      
      1. When linux calls the 'do_exit' for process to exit, the functions
      'exit_task_namespaces' and 'exit_task_work' will be called sequentially.
      'exit_task_namespaces' may have destroyed the _net_ namespace, but
      'sock_release' called in 'exit_task_work' may use the _net_ namespace
      if we counter the socket-inuse in sock_release.
      
      2. socket and sock are in pair. More important, sock holds the _net_
      namespace. We counter the socket-inuse in sock, for avoiding holding
      _net_ namespace again in socket. It's a easy way to maintain the code.
      Signed-off-by: NMartin Zhang <zhangjunweimartin@didichuxing.com>
      Signed-off-by: NTonghao Zhang <zhangtonghao@didichuxing.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      648845ab
    • T
      sock: Change the netns_core member name. · 08fc7f81
      Tonghao Zhang 提交于
      Change the member name will make the code more readable.
      This patch will be used in next patch.
      Signed-off-by: NMartin Zhang <zhangjunweimartin@didichuxing.com>
      Signed-off-by: NTonghao Zhang <zhangtonghao@didichuxing.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      08fc7f81
  4. 18 12月, 2017 6 次提交
    • A
      bpf: x64: add JIT support for multi-function programs · 1c2a088a
      Alexei Starovoitov 提交于
      Typical JIT does several passes over bpf instructions to
      compute total size and relative offsets of jumps and calls.
      With multitple bpf functions calling each other all relative calls
      will have invalid offsets intially therefore we need to additional
      last pass over the program to emit calls with correct offsets.
      For example in case of three bpf functions:
      main:
        call foo
        call bpf_map_lookup
        exit
      foo:
        call bar
        exit
      bar:
        exit
      
      We will call bpf_int_jit_compile() indepedently for main(), foo() and bar()
      x64 JIT typically does 4-5 passes to converge.
      After these initial passes the image for these 3 functions
      will be good except call targets, since start addresses of
      foo() and bar() are unknown when we were JITing main()
      (note that call bpf_map_lookup will be resolved properly
      during initial passes).
      Once start addresses of 3 functions are known we patch
      call_insn->imm to point to right functions and call
      bpf_int_jit_compile() again which needs only one pass.
      Additional safety checks are done to make sure this
      last pass doesn't produce image that is larger or smaller
      than previous pass.
      
      When constant blinding is on it's applied to all functions
      at the first pass, since doing it once again at the last
      pass can change size of the JITed code.
      
      Tested on x64 and arm64 hw with JIT on/off, blinding on/off.
      x64 jits bpf-to-bpf calls correctly while arm64 falls back to interpreter.
      All other JITs that support normal BPF_CALL will behave the same way
      since bpf-to-bpf call is equivalent to bpf-to-kernel call from
      JITs point of view.
      Signed-off-by: NAlexei Starovoitov <ast@kernel.org>
      Acked-by: NDaniel Borkmann <daniel@iogearbox.net>
      Signed-off-by: NDaniel Borkmann <daniel@iogearbox.net>
      1c2a088a
    • A
      bpf: fix net.core.bpf_jit_enable race · 60b58afc
      Alexei Starovoitov 提交于
      global bpf_jit_enable variable is tested multiple times in JITs,
      blinding and verifier core. The malicious root can try to toggle
      it while loading the programs. This race condition was accounted
      for and there should be no issues, but it's safer to avoid
      this race condition.
      Signed-off-by: NAlexei Starovoitov <ast@kernel.org>
      Acked-by: NDaniel Borkmann <daniel@iogearbox.net>
      Signed-off-by: NDaniel Borkmann <daniel@iogearbox.net>
      60b58afc
    • A
      bpf: add support for bpf_call to interpreter · 1ea47e01
      Alexei Starovoitov 提交于
      though bpf_call is still the same call instruction and
      calling convention 'bpf to bpf' and 'bpf to helper' is the same
      the interpreter has to oparate on 'struct bpf_insn *'.
      To distinguish these two cases add a kernel internal opcode and
      mark call insns with it.
      This opcode is seen by interpreter only. JITs will never see it.
      Also add tiny bit of debug code to aid interpreter debugging.
      Signed-off-by: NAlexei Starovoitov <ast@kernel.org>
      Acked-by: NDaniel Borkmann <daniel@iogearbox.net>
      Signed-off-by: NDaniel Borkmann <daniel@iogearbox.net>
      1ea47e01
    • A
      bpf: teach verifier to recognize zero initialized stack · cc2b14d5
      Alexei Starovoitov 提交于
      programs with function calls are often passing various
      pointers via stack. When all calls are inlined llvm
      flattens stack accesses and optimizes away extra branches.
      When functions are not inlined it becomes the job of
      the verifier to recognize zero initialized stack to avoid
      exploring paths that program will not take.
      The following program would fail otherwise:
      
      ptr = &buffer_on_stack;
      *ptr = 0;
      ...
      func_call(.., ptr, ...) {
        if (..)
          *ptr = bpf_map_lookup();
      }
      ...
      if (*ptr != 0) {
        // Access (*ptr)->field is valid.
        // Without stack_zero tracking such (*ptr)->field access
        // will be rejected
      }
      
      since stack slots are no longer uniform invalid | spill | misc
      add liveness marking to all slots, but do it in 8 byte chunks.
      So if nothing was read or written in [fp-16, fp-9] range
      it will be marked as LIVE_NONE.
      If any byte in that range was read, it will be marked LIVE_READ
      and stacksafe() check will perform byte-by-byte verification.
      If all bytes in the range were written the slot will be
      marked as LIVE_WRITTEN.
      This significantly speeds up state equality comparison
      and reduces total number of states processed.
      
                          before   after
      bpf_lb-DLB_L3.o       2051    2003
      bpf_lb-DLB_L4.o       3287    3164
      bpf_lb-DUNKNOWN.o     1080    1080
      bpf_lxc-DDROP_ALL.o   24980   12361
      bpf_lxc-DUNKNOWN.o    34308   16605
      bpf_netdev.o          15404   10962
      bpf_overlay.o         7191    6679
      Signed-off-by: NAlexei Starovoitov <ast@kernel.org>
      Acked-by: NDaniel Borkmann <daniel@iogearbox.net>
      Signed-off-by: NDaniel Borkmann <daniel@iogearbox.net>
      cc2b14d5
    • A
      bpf: introduce function calls (verification) · f4d7e40a
      Alexei Starovoitov 提交于
      Allow arbitrary function calls from bpf function to another bpf function.
      
      To recognize such set of bpf functions the verifier does:
      1. runs control flow analysis to detect function boundaries
      2. proceeds with verification of all functions starting from main(root) function
      It recognizes that the stack of the caller can be accessed by the callee
      (if the caller passed a pointer to its stack to the callee) and the callee
      can store map_value and other pointers into the stack of the caller.
      3. keeps track of the stack_depth of each function to make sure that total
      stack depth is still less than 512 bytes
      4. disallows pointers to the callee stack to be stored into the caller stack,
      since they will be invalid as soon as the callee returns
      5. to reuse all of the existing state_pruning logic each function call
      is considered to be independent call from the verifier point of view.
      The verifier pretends to inline all function calls it sees are being called.
      It stores the callsite instruction index as part of the state to make sure
      that two calls to the same callee from two different places in the caller
      will be different from state pruning point of view
      6. more safety checks are added to liveness analysis
      
      Implementation details:
      . struct bpf_verifier_state is now consists of all stack frames that
        led to this function
      . struct bpf_func_state represent one stack frame. It consists of
        registers in the given frame and its stack
      . propagate_liveness() logic had a premature optimization where
        mark_reg_read() and mark_stack_slot_read() were manually inlined
        with loop iterating over parents for each register or stack slot.
        Undo this optimization to reuse more complex mark_*_read() logic
      . skip_callee() logic is not necessary from safety point of view,
        but without it mark_*_read() markings become too conservative,
        since after returning from the funciton call a read of r6-r9
        will incorrectly propagate the read marks into callee causing
        inefficient pruning later
      . mark_*_read() logic is now aware of control flow which makes it
        more complex. In the future the plan is to rewrite liveness
        to be hierarchical. So that liveness can be done within
        basic block only and control flow will be responsible for
        propagation of liveness information along cfg and between calls.
      . tail_calls and ld_abs insns are not allowed in the programs with
        bpf-to-bpf calls
      . returning stack pointers to the caller or storing them into stack
        frame of the caller is not allowed
      
      Testing:
      . no difference in cilium processed_insn numbers
      . large number of tests follows in next patches
      Signed-off-by: NAlexei Starovoitov <ast@kernel.org>
      Acked-by: NJohn Fastabend <john.fastabend@gmail.com>
      Acked-by: NDaniel Borkmann <daniel@iogearbox.net>
      Signed-off-by: NDaniel Borkmann <daniel@iogearbox.net>
      f4d7e40a
    • A
      bpf: introduce function calls (function boundaries) · cc8b0b92
      Alexei Starovoitov 提交于
      Allow arbitrary function calls from bpf function to another bpf function.
      
      Since the beginning of bpf all bpf programs were represented as a single function
      and program authors were forced to use always_inline for all functions
      in their C code. That was causing llvm to unnecessary inflate the code size
      and forcing developers to move code to header files with little code reuse.
      
      With a bit of additional complexity teach verifier to recognize
      arbitrary function calls from one bpf function to another as long as
      all of functions are presented to the verifier as a single bpf program.
      New program layout:
      r6 = r1    // some code
      ..
      r1 = ..    // arg1
      r2 = ..    // arg2
      call pc+1  // function call pc-relative
      exit
      .. = r1    // access arg1
      .. = r2    // access arg2
      ..
      call pc+20 // second level of function call
      ...
      
      It allows for better optimized code and finally allows to introduce
      the core bpf libraries that can be reused in different projects,
      since programs are no longer limited by single elf file.
      With function calls bpf can be compiled into multiple .o files.
      
      This patch is the first step. It detects programs that contain
      multiple functions and checks that calls between them are valid.
      It splits the sequence of bpf instructions (one program) into a set
      of bpf functions that call each other. Calls to only known
      functions are allowed. In the future the verifier may allow
      calls to unresolved functions and will do dynamic linking.
      This logic supports statically linked bpf functions only.
      
      Such function boundary detection could have been done as part of
      control flow graph building in check_cfg(), but it's cleaner to
      separate function boundary detection vs control flow checks within
      a subprogram (function) into logically indepedent steps.
      Follow up patches may split check_cfg() further, but not check_subprogs().
      
      Only allow bpf-to-bpf calls for root only and for non-hw-offloaded programs.
      These restrictions can be relaxed in the future.
      Signed-off-by: NAlexei Starovoitov <ast@kernel.org>
      Acked-by: NDaniel Borkmann <daniel@iogearbox.net>
      Signed-off-by: NDaniel Borkmann <daniel@iogearbox.net>
      cc8b0b92
  5. 16 12月, 2017 14 次提交
  6. 15 12月, 2017 6 次提交
  7. 14 12月, 2017 3 次提交
    • D
      drm: rework delayed connector cleanup in connector_iter · ea497bb9
      Daniel Vetter 提交于
      PROBE_DEFER also uses system_wq to reprobe drivers, which means when
      that again fails, and we try to flush the overall system_wq (to get
      all the delayed connectore cleanup work_struct completed), we
      deadlock.
      
      Fix this by using just a single cleanup work, so that we can only
      flush that one and don't block on anything else. That means a free
      list plus locking, a standard pattern.
      
      v2:
      - Correctly free connectors only on last ref. Oops (Chris).
      - use llist_head/node (Chris).
      
      v3
      - Add init_llist_head (Chris).
      
      Fixes: a703c550 ("drm: safely free connectors from connector_iter")
      Fixes: 613051da ("drm: locking&new iterators for connector_list")
      Cc: Ben Widawsky <ben@bwidawsk.net>
      Cc: Dave Airlie <airlied@gmail.com>
      Cc: Chris Wilson <chris@chris-wilson.co.uk>
      Cc: Sean Paul <seanpaul@chromium.org>
      Cc: <stable@vger.kernel.org> # v4.11+: 613051da ("drm: locking&new iterators for connector_list"
      Cc: <stable@vger.kernel.org> # v4.11+
      Cc: Daniel Vetter <daniel.vetter@intel.com>
      Cc: Jani Nikula <jani.nikula@linux.intel.com>
      Cc: Gustavo Padovan <gustavo@padovan.org>
      Cc: David Airlie <airlied@linux.ie>
      Cc: Javier Martinez Canillas <javier@dowhile0.org>
      Cc: Shuah Khan <shuahkh@osg.samsung.com>
      Cc: Guillaume Tucker <guillaume.tucker@collabora.com>
      Cc: Mark Brown <broonie@kernel.org>
      Cc: Kevin Hilman <khilman@baylibre.com>
      Cc: Matt Hart <matthew.hart@linaro.org>
      Cc: Thierry Escande <thierry.escande@collabora.co.uk>
      Cc: Tomeu Vizoso <tomeu.vizoso@collabora.com>
      Cc: Enric Balletbo i Serra <enric.balletbo@collabora.com>
      Tested-by: NMarek Szyprowski <m.szyprowski@samsung.com>
      Reviewed-by: NChris Wilson <chris@chris-wilson.co.uk>
      Signed-off-by: NDaniel Vetter <daniel.vetter@intel.com>
      Link: https://patchwork.freedesktop.org/patch/msgid/20171213124936.17914-1-daniel.vetter@ffwll.ch
      ea497bb9
    • Y
      bpf/tracing: fix kernel/events/core.c compilation error · f4e2298e
      Yonghong Song 提交于
      Commit f371b304 ("bpf/tracing: allow user space to
      query prog array on the same tp") introduced a perf
      ioctl command to query prog array attached to the
      same perf tracepoint. The commit introduced a
      compilation error under certain config conditions, e.g.,
        (1). CONFIG_BPF_SYSCALL is not defined, or
        (2). CONFIG_TRACING is defined but neither CONFIG_UPROBE_EVENTS
             nor CONFIG_KPROBE_EVENTS is defined.
      
      Error message:
        kernel/events/core.o: In function `perf_ioctl':
        core.c:(.text+0x98c4): undefined reference to `bpf_event_query_prog_array'
      
      This patch fixed this error by guarding the real definition under
      CONFIG_BPF_EVENTS and provided static inline dummy function
      if CONFIG_BPF_EVENTS was not defined.
      It renamed the function from bpf_event_query_prog_array to
      perf_event_query_prog_array and moved the definition from linux/bpf.h
      to linux/trace_events.h so the definition is in proximity to
      other prog_array related functions.
      
      Fixes: f371b304 ("bpf/tracing: allow user space to query prog array on the same tp")
      Reported-by: NStephen Rothwell <sfr@canb.auug.org.au>
      Signed-off-by: NYonghong Song <yhs@fb.com>
      Signed-off-by: NDaniel Borkmann <daniel@iogearbox.net>
      f4e2298e
    • F
      net: phy: phylink: Allow setting a custom link state callback · 1ac63e39
      Florian Fainelli 提交于
      phylink_get_fixed_state() currently consults an optional "link_gpio"
      GPIO descriptor, expand this mechanism to allow specifying a custom
      callback. This is necessary to support out of band link notifcation
      (e.g: from an interrupt within a MMIO register).
      Signed-off-by: NFlorian Fainelli <f.fainelli@gmail.com>
      Reviewed-by: NAndrew Lunn <andrew@lunn.ch>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      1ac63e39