1. 08 12月, 2016 2 次提交
    • D
      bpf: fix loading of BPF_MAXINSNS sized programs · ef0915ca
      Daniel Borkmann 提交于
      General assumption is that single program can hold up to BPF_MAXINSNS,
      that is, 4096 number of instructions. It is the case with cBPF and
      that limit was carried over to eBPF. When recently testing digest, I
      noticed that it's actually not possible to feed 4096 instructions
      via bpf(2).
      
      The check for > BPF_MAXINSNS was added back then to bpf_check() in
      cbd35700 ("bpf: verifier (add ability to receive verification log)").
      However, 09756af4 ("bpf: expand BPF syscall with program load/unload")
      added yet another check that comes before that into bpf_prog_load(),
      but this time bails out already in case of >= BPF_MAXINSNS.
      
      Fix it up and perform the check early in bpf_prog_load(), so we can drop
      the second one in bpf_check(). It makes sense, because also a 0 insn
      program is useless and we don't want to waste any resources doing work
      up to bpf_check() point. The existing bpf(2) man page documents E2BIG
      as the official error for such cases, so just stick with it as well.
      
      Fixes: 09756af4 ("bpf: expand BPF syscall with program load/unload")
      Signed-off-by: NDaniel Borkmann <daniel@iogearbox.net>
      Acked-by: NAlexei Starovoitov <ast@kernel.org>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      ef0915ca
    • M
      clocksource: export the clocks_calc_mult_shift to use by timestamp code · 5304121a
      Murali Karicheri 提交于
      The CPSW CPTS driver is capable of doing timestamping on tx/rx packets and
      requires to know mult and shift factors for timestamp conversion from raw
      value to nanoseconds (ptp clock). Now these mult and shift factors are
      calculated manually and provided through DT, which makes very hard to
      support of a lot number of platforms, especially if CPTS refclk is not the
      same for some kind of boards and depends on efuse settings (Keystone 2
      platforms). Hence, export clocks_calc_mult_shift() to allow drivers like
      CPSW CPTS (and other ptp drivesr) to benefit from automaitc calculation of
      mult and shift factors.
      
      Cc: John Stultz <john.stultz@linaro.org>
      Signed-off-by: NMurali Karicheri <m-karicheri2@ti.com>
      Signed-off-by: NGrygorii Strashko <grygorii.strashko@ti.com>
      Acked-by: NThomas Gleixner <tglx@linutronix.de>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      5304121a
  2. 06 12月, 2016 2 次提交
    • D
      bpf: add prog_digest and expose it via fdinfo/netlink · 7bd509e3
      Daniel Borkmann 提交于
      When loading a BPF program via bpf(2), calculate the digest over
      the program's instruction stream and store it in struct bpf_prog's
      digest member. This is done at a point in time before any instructions
      are rewritten by the verifier. Any unstable map file descriptor
      number part of the imm field will be zeroed for the hash.
      
      fdinfo example output for progs:
      
        # cat /proc/1590/fdinfo/5
        pos:          0
        flags:        02000002
        mnt_id:       11
        prog_type:    1
        prog_jited:   1
        prog_digest:  b27e8b06da22707513aa97363dfb11c7c3675d28
        memlock:      4096
      
      When programs are pinned and retrieved by an ELF loader, the loader
      can check the program's digest through fdinfo and compare it against
      one that was generated over the ELF file's program section to see
      if the program needs to be reloaded. Furthermore, this can also be
      exposed through other means such as netlink in case of a tc cls/act
      dump (or xdp in future), but also through tracepoints or other
      facilities to identify the program. Other than that, the digest can
      also serve as a base name for the work in progress kallsyms support
      of programs. The digest doesn't depend/select the crypto layer, since
      we need to keep dependencies to a minimum. iproute2 will get support
      for this facility.
      Signed-off-by: NDaniel Borkmann <daniel@iogearbox.net>
      Acked-by: NAlexei Starovoitov <ast@kernel.org>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      7bd509e3
    • G
      bpf: Preserve const register type on const OR alu ops · 3c839744
      Gianluca Borello 提交于
      Occasionally, clang (e.g. version 3.8.1) translates a sum between two
      constant operands using a BPF_OR instead of a BPF_ADD. The verifier is
      currently not handling this scenario, and the destination register type
      becomes UNKNOWN_VALUE even if it's still storing a constant. As a result,
      the destination register cannot be used as argument to a helper function
      expecting a ARG_CONST_STACK_*, limiting some use cases.
      
      Modify the verifier to handle this case, and add a few tests to make sure
      all combinations are supported, and stack boundaries are still verified
      even with BPF_OR.
      Signed-off-by: NGianluca Borello <g.borello@gmail.com>
      Signed-off-by: NAlexei Starovoitov <ast@kernel.org>
      Acked-by: NDaniel Borkmann <daniel@iogearbox.net>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      3c839744
  3. 03 12月, 2016 2 次提交
  4. 02 12月, 2016 2 次提交
    • T
      bpf: BPF for lightweight tunnel infrastructure · 3a0af8fd
      Thomas Graf 提交于
      Registers new BPF program types which correspond to the LWT hooks:
        - BPF_PROG_TYPE_LWT_IN   => dst_input()
        - BPF_PROG_TYPE_LWT_OUT  => dst_output()
        - BPF_PROG_TYPE_LWT_XMIT => lwtunnel_xmit()
      
      The separate program types are required to differentiate between the
      capabilities each LWT hook allows:
      
       * Programs attached to dst_input() or dst_output() are restricted and
         may only read the data of an skb. This prevent modification and
         possible invalidation of already validated packet headers on receive
         and the construction of illegal headers while the IP headers are
         still being assembled.
      
       * Programs attached to lwtunnel_xmit() are allowed to modify packet
         content as well as prepending an L2 header via a newly introduced
         helper bpf_skb_change_head(). This is safe as lwtunnel_xmit() is
         invoked after the IP header has been assembled completely.
      
      All BPF programs receive an skb with L3 headers attached and may return
      one of the following error codes:
      
       BPF_OK - Continue routing as per nexthop
       BPF_DROP - Drop skb and return EPERM
       BPF_REDIRECT - Redirect skb to device as per redirect() helper.
                      (Only valid in lwtunnel_xmit() context)
      
      The return codes are binary compatible with their TC_ACT_
      relatives to ease compatibility.
      Signed-off-by: NThomas Graf <tgraf@suug.ch>
      Acked-by: NAlexei Starovoitov <ast@kernel.org>
      Acked-by: NDaniel Borkmann <daniel@iogearbox.net>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      3a0af8fd
    • W
      audit: remove useless synchronize_net() · 60602982
      WANG Cong 提交于
      netlink kernel socket is protected by refcount, not RCU.
      Its rcv path is neither protected by RCU. So the synchronize_net()
      is just pointless.
      
      Cc: Richard Guy Briggs <rgb@redhat.com>
      Signed-off-by: NCong Wang <xiyou.wangcong@gmail.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      60602982
  5. 01 12月, 2016 1 次提交
  6. 30 11月, 2016 2 次提交
  7. 28 11月, 2016 3 次提交
  8. 26 11月, 2016 2 次提交
    • D
      bpf: add BPF_PROG_ATTACH and BPF_PROG_DETACH commands · f4324551
      Daniel Mack 提交于
      Extend the bpf(2) syscall by two new commands, BPF_PROG_ATTACH and
      BPF_PROG_DETACH which allow attaching and detaching eBPF programs
      to a target.
      
      On the API level, the target could be anything that has an fd in
      userspace, hence the name of the field in union bpf_attr is called
      'target_fd'.
      
      When called with BPF_ATTACH_TYPE_CGROUP_INET_{E,IN}GRESS, the target is
      expected to be a valid file descriptor of a cgroup v2 directory which
      has the bpf controller enabled. These are the only use-cases
      implemented by this patch at this point, but more can be added.
      
      If a program of the given type already exists in the given cgroup,
      the program is swapped automically, so userspace does not have to drop
      an existing program first before installing a new one, which would
      otherwise leave a gap in which no program is attached.
      
      For more information on the propagation logic to subcgroups, please
      refer to the bpf cgroup controller implementation.
      
      The API is guarded by CAP_NET_ADMIN.
      Signed-off-by: NDaniel Mack <daniel@zonque.org>
      Acked-by: NAlexei Starovoitov <ast@kernel.org>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      f4324551
    • D
      cgroup: add support for eBPF programs · 30070984
      Daniel Mack 提交于
      This patch adds two sets of eBPF program pointers to struct cgroup.
      One for such that are directly pinned to a cgroup, and one for such
      that are effective for it.
      
      To illustrate the logic behind that, assume the following example
      cgroup hierarchy.
      
        A - B - C
              \ D - E
      
      If only B has a program attached, it will be effective for B, C, D
      and E. If D then attaches a program itself, that will be effective for
      both D and E, and the program in B will only affect B and C. Only one
      program of a given type is effective for a cgroup.
      
      Attaching and detaching programs will be done through the bpf(2)
      syscall. For now, ingress and egress inet socket filtering are the
      only supported use-cases.
      Signed-off-by: NDaniel Mack <daniel@zonque.org>
      Acked-by: NAlexei Starovoitov <ast@kernel.org>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      30070984
  9. 22 11月, 2016 3 次提交
  10. 21 11月, 2016 1 次提交
  11. 19 11月, 2016 1 次提交
  12. 18 11月, 2016 1 次提交
    • A
      netns: make struct pernet_operations::id unsigned int · c7d03a00
      Alexey Dobriyan 提交于
      Make struct pernet_operations::id unsigned.
      
      There are 2 reasons to do so:
      
      1)
      This field is really an index into an zero based array and
      thus is unsigned entity. Using negative value is out-of-bound
      access by definition.
      
      2)
      On x86_64 unsigned 32-bit data which are mixed with pointers
      via array indexing or offsets added or subtracted to pointers
      are preffered to signed 32-bit data.
      
      "int" being used as an array index needs to be sign-extended
      to 64-bit before being used.
      
      	void f(long *p, int i)
      	{
      		g(p[i]);
      	}
      
        roughly translates to
      
      	movsx	rsi, esi
      	mov	rdi, [rsi+...]
      	call 	g
      
      MOVSX is 3 byte instruction which isn't necessary if the variable is
      unsigned because x86_64 is zero extending by default.
      
      Now, there is net_generic() function which, you guessed it right, uses
      "int" as an array index:
      
      	static inline void *net_generic(const struct net *net, int id)
      	{
      		...
      		ptr = ng->ptr[id - 1];
      		...
      	}
      
      And this function is used a lot, so those sign extensions add up.
      
      Patch snipes ~1730 bytes on allyesconfig kernel (without all junk
      messing with code generation):
      
      	add/remove: 0/0 grow/shrink: 70/598 up/down: 396/-2126 (-1730)
      
      Unfortunately some functions actually grow bigger.
      This is a semmingly random artefact of code generation with register
      allocator being used differently. gcc decides that some variable
      needs to live in new r8+ registers and every access now requires REX
      prefix. Or it is shifted into r12, so [r12+0] addressing mode has to be
      used which is longer than [r8]
      
      However, overall balance is in negative direction:
      
      	add/remove: 0/0 grow/shrink: 70/598 up/down: 396/-2126 (-1730)
      	function                                     old     new   delta
      	nfsd4_lock                                  3886    3959     +73
      	tipc_link_build_proto_msg                   1096    1140     +44
      	mac80211_hwsim_new_radio                    2776    2808     +32
      	tipc_mon_rcv                                1032    1058     +26
      	svcauth_gss_legacy_init                     1413    1429     +16
      	tipc_bcbase_select_primary                   379     392     +13
      	nfsd4_exchange_id                           1247    1260     +13
      	nfsd4_setclientid_confirm                    782     793     +11
      		...
      	put_client_renew_locked                      494     480     -14
      	ip_set_sockfn_get                            730     716     -14
      	geneve_sock_add                              829     813     -16
      	nfsd4_sequence_done                          721     703     -18
      	nlmclnt_lookup_host                          708     686     -22
      	nfsd4_lockt                                 1085    1063     -22
      	nfs_get_client                              1077    1050     -27
      	tcf_bpf_init                                1106    1076     -30
      	nfsd4_encode_fattr                          5997    5930     -67
      	Total: Before=154856051, After=154854321, chg -0.00%
      Signed-off-by: NAlexey Dobriyan <adobriyan@gmail.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      c7d03a00
  13. 17 11月, 2016 2 次提交
  14. 16 11月, 2016 5 次提交
    • M
      bpf: Add BPF_MAP_TYPE_LRU_PERCPU_HASH · 8f844938
      Martin KaFai Lau 提交于
      Provide a LRU version of the existing BPF_MAP_TYPE_PERCPU_HASH
      Signed-off-by: NMartin KaFai Lau <kafai@fb.com>
      Acked-by: NAlexei Starovoitov <ast@kernel.org>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      8f844938
    • M
      bpf: Add BPF_MAP_TYPE_LRU_HASH · 29ba732a
      Martin KaFai Lau 提交于
      Provide a LRU version of the existing BPF_MAP_TYPE_HASH.
      Signed-off-by: NMartin KaFai Lau <kafai@fb.com>
      Acked-by: NAlexei Starovoitov <ast@kernel.org>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      29ba732a
    • M
      bpf: Refactor codes handling percpu map · fd91de7b
      Martin KaFai Lau 提交于
      Refactor the codes that populate the value
      of a htab_elem in a BPF_MAP_TYPE_PERCPU_HASH
      typed bpf_map.
      Signed-off-by: NMartin KaFai Lau <kafai@fb.com>
      Acked-by: NAlexei Starovoitov <ast@kernel.org>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      fd91de7b
    • M
      bpf: Add percpu LRU list · 961578b6
      Martin KaFai Lau 提交于
      Instead of having a common LRU list, this patch allows a
      percpu LRU list which can be selected by specifying a map
      attribute.  The map attribute will be added in the later
      patch.
      
      While the common use case for LRU is #reads >> #updates,
      percpu LRU list allows bpf prog to absorb unusual #updates
      under pathological case (e.g. external traffic facing machine which
      could be under attack).
      
      Each percpu LRU is isolated from each other.  The LRU nodes (including
      free nodes) cannot be moved across different LRU Lists.
      
      Here are the update performance comparison between
      common LRU list and percpu LRU list (the test code is
      at the last patch):
      
      [root@kerneltest003.31.prn1 ~]# for i in 1 4 8; do echo -n "$i cpus: "; \
      ./map_perf_test 16 $i | awk '{r += $3}END{print r " updates"}'; done
       1 cpus: 2934082 updates
       4 cpus: 7391434 updates
       8 cpus: 6500576 updates
      
      [root@kerneltest003.31.prn1 ~]# for i in 1 4 8; do echo -n "$i cpus: "; \
      ./map_perf_test 32 $i | awk '{r += $3}END{printr " updates"}'; done
        1 cpus: 2896553 updates
        4 cpus: 9766395 updates
        8 cpus: 17460553 updates
      Signed-off-by: NMartin KaFai Lau <kafai@fb.com>
      Acked-by: NAlexei Starovoitov <ast@kernel.org>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      961578b6
    • M
      bpf: LRU List · 3a08c2fd
      Martin KaFai Lau 提交于
      Introduce bpf_lru_list which will provide LRU capability to
      the bpf_htab in the later patch.
      
      * General Thoughts:
      1. Target use case.  Read is more often than update.
         (i.e. bpf_lookup_elem() is more often than bpf_update_elem()).
         If bpf_prog does a bpf_lookup_elem() first and then an in-place
         update, it still counts as a read operation to the LRU list concern.
      2. It may be useful to think of it as a LRU cache
      3. Optimize the read case
         3.1 No lock in read case
         3.2 The LRU maintenance is only done during bpf_update_elem()
      4. If there is a percpu LRU list, it will lose the system-wise LRU
         property.  A completely isolated percpu LRU list has the best
         performance but the memory utilization is not ideal considering
         the work load may be imbalance.
      5. Hence, this patch starts the LRU implementation with a global LRU
         list with batched operations before accessing the global LRU list.
         As a LRU cache, #read >> #update/#insert operations, it will work well.
      6. There is a local list (for each cpu) which is named
         'struct bpf_lru_locallist'.  This local list is not used to sort
         the LRU property.  Instead, the local list is to batch enough
         operations before acquiring the lock of the global LRU list.  More
         details on this later.
      7. In the later patch, it allows a percpu LRU list by specifying a
         map-attribute for scalability reason and for use cases that need to
         prepare for the worst (and pathological) case like DoS attack.
         The percpu LRU list is completely isolated from each other and the
         LRU nodes (including free nodes) cannot be moved across the list.  The
         following description is for the global LRU list but mostly applicable
         to the percpu LRU list also.
      
      * Global LRU List:
      1. It has three sub-lists: active-list, inactive-list and free-list.
      2. The two list idea, active and inactive, is borrowed from the
         page cache.
      3. All nodes are pre-allocated and all sit at the free-list (of the
         global LRU list) at the beginning.  The pre-allocation reasoning
         is similar to the existing BPF_MAP_TYPE_HASH.  However,
         opting-out prealloc (BPF_F_NO_PREALLOC) is not supported in
         the LRU map.
      
      * Active/Inactive List (of the global LRU list):
      1. The active list, as its name says it, maintains the active set of
         the nodes.  We can think of it as the working set or more frequently
         accessed nodes.  The access frequency is approximated by a ref-bit.
         The ref-bit is set during the bpf_lookup_elem().
      2. The inactive list, as its name also says it, maintains a less
         active set of nodes.  They are the candidates to be removed
         from the bpf_htab when we are running out of free nodes.
      3. The ordering of these two lists is acting as a rough clock.
         The tail of the inactive list is the older nodes and
         should be released first if the bpf_htab needs free element.
      
      * Rotating the Active/Inactive List (of the global LRU list):
      1. It is the basic operation to maintain the LRU property of
         the global list.
      2. The active list is only rotated when the inactive list is running
         low.  This idea is similar to the current page cache.
         Inactive running low is currently defined as
         "# of inactive < # of active".
      3. The active list rotation always starts from the tail.  It moves
         node without ref-bit set to the head of the inactive list.
         It moves node with ref-bit set back to the head of the active
         list and then clears its ref-bit.
      4. The inactive rotation is pretty simply.
         It walks the inactive list and moves the nodes back to the head of
         active list if its ref-bit is set. The ref-bit is cleared after moving
         to the active list.
         If the node does not have ref-bit set, it just leave it as it is
         because it is already in the inactive list.
      
      * Shrinking the Inactive List (of the global LRU list):
      1. Shrinking is the operation to get free nodes when the bpf_htab is
         full.
      2. It usually only shrinks the inactive list to get free nodes.
      3. During shrinking, it will walk the inactive list from the tail,
         delete the nodes without ref-bit set from bpf_htab.
      4. If no free node found after step (3), it will forcefully get
         one node from the tail of inactive or active list.  Forcefully is
         in the sense that it ignores the ref-bit.
      
      * Local List:
      1. Each CPU has a 'struct bpf_lru_locallist'.  The purpose is to
         batch enough operations before acquiring the lock of the
         global LRU.
      2. A local list has two sub-lists, free-list and pending-list.
      3. During bpf_update_elem(), it will try to get from the free-list
         of (the current CPU local list).
      4. If the local free-list is empty, it will acquire from the
         global LRU list.  The global LRU list can either satisfy it
         by its global free-list or by shrinking the global inactive
         list.  Since we have acquired the global LRU list lock,
         it will try to get at most LOCAL_FREE_TARGET elements
         to the local free list.
      5. When a new element is added to the bpf_htab, it will
         first sit at the pending-list (of the local list) first.
         The pending-list will be flushed to the global LRU list
         when it needs to acquire free nodes from the global list
         next time.
      
      * Lock Consideration:
      The LRU list has a lock (lru_lock).  Each bucket of htab has a
      lock (buck_lock).  If both locks need to be acquired together,
      the lock order is always lru_lock -> buck_lock and this only
      happens in the bpf_lru_list.c logic.
      
      In hashtab.c, both locks are not acquired together (i.e. one
      lock is always released first before acquiring another lock).
      Signed-off-by: NMartin KaFai Lau <kafai@fb.com>
      Acked-by: NAlexei Starovoitov <ast@kernel.org>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      3a08c2fd
  15. 15 11月, 2016 5 次提交
    • D
      perf/core: Do not set cpuctx->cgrp for unscheduled cgroups · 864c2357
      David Carrillo-Cisneros 提交于
      Commit:
      
        db4a8356 ("perf/core: Set cgroup in CPU contexts for new cgroup events")
      
      failed to verify that event->cgrp is actually the scheduled cgroup
      in a CPU before setting cpuctx->cgrp. This patch fixes that.
      
      Now that there is a different path for scheduled and unscheduled
      cgroup, add a warning to catch when cpuctx->cgrp is still set after
      the last cgroup event has been unsheduled.
      
      To verify the bug:
      
        # Create 2 cgroups.
        mkdir /dev/cgroups/devices/g1
        mkdir /dev/cgroups/devices/g2
      
        # launch a task, bind it to a cpu and move it to g1
        CPU=2
        while :; do : ; done &
        P=$!
      
        taskset -pc $CPU $P
        echo $P > /dev/cgroups/devices/g1/tasks
      
        # monitor g2 (it runs no tasks) and observe output
        perf stat -e cycles -I 1000 -C $CPU -G g2
      
        #           time             counts unit events
           1.000091408          7,579,527      cycles                    g2
           2.000350111      <not counted>      cycles                    g2
           3.000589181      <not counted>      cycles                    g2
           4.000771428      <not counted>      cycles                    g2
      
        # note first line that displays that a task run in g2, despite
        # g2 having no tasks. This is because cpuctx->cgrp was wrongly
        # set when context of new event was installed.
        # After applying the fix we obtain the right output:
      
        perf stat -e cycles -I 1000 -C $CPU -G g2
        #           time             counts unit events
           1.000119615      <not counted>      cycles                    g2
           2.000389430      <not counted>      cycles                    g2
           3.000590962      <not counted>      cycles                    g2
      Signed-off-by: NDavid Carrillo-Cisneros <davidcc@google.com>
      Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Cc: Alexander Shishkin <alexander.shishkin@linux.intel.com>
      Cc: Arnaldo Carvalho de Melo <acme@redhat.com>
      Cc: Jiri Olsa <jolsa@redhat.com>
      Cc: Kan Liang <kan.liang@intel.com>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Nilay Vaish <nilayvaish@gmail.com>
      Cc: Paul Turner <pjt@google.com>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Stephane Eranian <eranian@google.com>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Vegard Nossum <vegard.nossum@gmail.com>
      Link: http://lkml.kernel.org/r/1478026378-86083-1-git-send-email-davidcc@google.comSigned-off-by: NIngo Molnar <mingo@kernel.org>
      864c2357
    • S
      ftrace: Add more checks for FTRACE_FL_DISABLED in processing ip records · 546fece4
      Steven Rostedt (Red Hat) 提交于
      When a module is first loaded and its function ip records are added to the
      ftrace list of functions to modify, they are set to DISABLED, as their text
      is still in a read only state. When the module is fully loaded, and can be
      updated, the flag is cleared, and if their's any functions that should be
      tracing them, it is updated at that moment.
      
      But there's several locations that do record accounting and should ignore
      records that are marked as disabled, or they can cause issues.
      
      Alexei already fixed one location, but others need to be addressed.
      
      Cc: stable@vger.kernel.org
      Fixes: b7ffffbb "ftrace: Add infrastructure for delayed enabling of module functions"
      Reported-by: NAlexei Starovoitov <ast@kernel.org>
      Signed-off-by: NSteven Rostedt <rostedt@goodmis.org>
      546fece4
    • A
      ftrace: Ignore FTRACE_FL_DISABLED while walking dyn_ftrace records · 977c1f9c
      Alexei Starovoitov 提交于
      ftrace_shutdown() checks for sanity of ftrace records
      and if dyn_ftrace->flags is not zero, it will warn.
      It can happen that 'flags' are set to FTRACE_FL_DISABLED at this point,
      since some module was loaded, but before ftrace_module_enable()
      cleared the flags for this module.
      
      In other words the module.c is doing:
      ftrace_module_init(mod); // calls ftrace_update_code() that sets flags=FTRACE_FL_DISABLED
      ... // here ftrace_shutdown() is called that warns, since
      err = prepare_coming_module(mod); // didn't have a chance to clear FTRACE_FL_DISABLED
      
      Fix it by ignoring disabled records.
      It's similar to what __ftrace_hash_rec_update() is already doing.
      
      Link: http://lkml.kernel.org/r/1478560460-3818619-1-git-send-email-ast@fb.com
      
      Cc: stable@vger.kernel.org
      Fixes: b7ffffbb "ftrace: Add infrastructure for delayed enabling of module functions"
      Signed-off-by: NAlexei Starovoitov <ast@kernel.org>
      Signed-off-by: NSteven Rostedt <rostedt@goodmis.org>
      977c1f9c
    • M
      bpf: Use u64_to_user_ptr() · 535e7b4b
      Mickaël Salaün 提交于
      Replace the custom u64_to_ptr() function with the u64_to_user_ptr()
      macro.
      Signed-off-by: NMickaël Salaün <mic@digikod.net>
      Cc: Alexei Starovoitov <ast@kernel.org>
      Cc: Arnd Bergmann <arnd@arndb.de>
      Cc: Daniel Borkmann <daniel@iogearbox.net>
      Acked-by: NDaniel Borkmann <daniel@iogearbox.net>
      Acked-by: NAlexei Starovoitov <ast@kernel.org>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      535e7b4b
    • L
      Revert "printk: make reading the kernel log flush pending lines" · f5c9f9c7
      Linus Torvalds 提交于
      This reverts commit bfd8d3f2.
      
      It turns out that this flushes things much too aggressiverly, and causes
      lines to break up when the system logger races with new continuation
      lines being printed.
      
      There's a pending patch to make printk() flushing much more
      straightforward, but it's too invasive for 4.9, so in the meantime let's
      just not make the system message logging flush continuation lines.
      They'll be flushed by the final newline anyway.
      Suggested-by: NPetr Mladek <pmladek@suse.com>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      f5c9f9c7
  16. 13 11月, 2016 1 次提交
  17. 12 11月, 2016 1 次提交
    • H
      Revert "console: don't prefer first registered if DT specifies stdout-path" · c6c7d83b
      Hans de Goede 提交于
      This reverts commit 05fd007e ("console: don't prefer first
      registered if DT specifies stdout-path").
      
      The reverted commit changes existing behavior on which many ARM boards
      rely.  Many ARM small-board-computers, like e.g.  the Raspberry Pi have
      both a video output and a serial console.  Depending on whether the user
      is using the device as a more regular computer; or as a headless device
      we need to have the console on either one or the other.
      
      Many users rely on the kernel behavior of the console being present on
      both outputs, before the reverted commit the console setup with no
      console= kernel arguments on an ARM board which sets stdout-path in dt
      would look like this:
      
        [root@localhost ~]# cat /proc/consoles
        ttyS0                -W- (EC p a)    4:64
        tty0                 -WU (E  p  )    4:1
      
      Where as after the reverted commit, it looks like this:
      
        [root@localhost ~]# cat /proc/consoles
        ttyS0                -W- (EC p a)    4:64
      
      This commit reverts commit 05fd007e ("console: don't prefer first
      registered if DT specifies stdout-path") restoring the original
      behavior.
      
      Fixes: 05fd007e ("console: don't prefer first registered if DT specifies stdout-path")
      Link: http://lkml.kernel.org/r/20161104121135.4780-2-hdegoede@redhat.comSigned-off-by: NHans de Goede <hdegoede@redhat.com>
      Cc: Paul Burton <paul.burton@imgtec.com>
      Cc: Rob Herring <robh+dt@kernel.org>
      Cc: Frank Rowand <frowand.list@gmail.com>
      Cc: Thorsten Leemhuis <regressions@leemhuis.info>
      Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
      Cc: Tejun Heo <tj@kernel.org>
      Cc: <stable@vger.kernel.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      c6c7d83b
  18. 10 11月, 2016 1 次提交
  19. 08 11月, 2016 3 次提交
    • T
      genirq: Use irq type from irqdata instead of irqdesc · 7ee7e87d
      Thomas Gleixner 提交于
      The type flags in the irq descriptor are there for historical reasons and
      only updated via irq_modify_status() or irq_set_type(). Both functions also
      update the type flags in irqdata. __setup_irq() is the only left over user
      of the type flags in the irq descriptor.
      
      If __setup_irq() is called with empty irq type flags, then the type flags
      are retrieved from irqdata. If an interrupt is shared, then the type flags
      are compared with the type flags stored in the irq descriptor. 
      
      On x86 the ioapic does not have a irq_set_type() callback because the type
      is defined in the BIOS tables and cannot be changed. The type is stored in
      irqdata at setup time without updating the type data in the irq
      descriptor. As a result the comparison described above fails.
      
      There is no point in updating the irq descriptor flags because the only
      relevant storage is irqdata. Use the type flags from irqdata for both
      retrieval and comparison in __setup_irq() instead.
      
      Aside of that the print out in case of non matching type flags has the old
      and new type flags arguments flipped. Fix that as well.
      
      For correctness sake the flags stored in the irq descriptor should be
      removed, but this is beyond the scope of this bugfix and will be done in a
      later patch.
      
      Fixes: 4b357dae ("genirq: Look-up trigger type if not specified by caller")
      Reported-and-tested-by: NMika Westerberg <mika.westerberg@linux.intel.com>
      Signed-off-by: NThomas Gleixner <tglx@linutronix.de>
      Cc: Marc Zyngier <marc.zyngier@arm.com>
      Cc: Jon Hunter <jonathanh@nvidia.com>
      Cc: stable@vger.kernel.org
      Link: http://lkml.kernel.org/r/alpine.DEB.2.20.1611072020360.3501@nanosSigned-off-by: NThomas Gleixner <tglx@linutronix.de>
      7ee7e87d
    • D
      bpf: fix map not being uncharged during map creation failure · 20b2b24f
      Daniel Borkmann 提交于
      In map_create(), we first find and create the map, then once that
      suceeded, we charge it to the user's RLIMIT_MEMLOCK, and then fetch
      a new anon fd through anon_inode_getfd(). The problem is, once the
      latter fails f.e. due to RLIMIT_NOFILE limit, then we only destruct
      the map via map->ops->map_free(), but without uncharging the previously
      locked memory first. That means that the user_struct allocation is
      leaked as well as the accounted RLIMIT_MEMLOCK memory not released.
      Make the label names in the fix consistent with bpf_prog_load().
      
      Fixes: aaac3ba9 ("bpf: charge user for creation of BPF maps and programs")
      Signed-off-by: NDaniel Borkmann <daniel@iogearbox.net>
      Acked-by: NAlexei Starovoitov <ast@kernel.org>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      20b2b24f
    • D
      bpf: fix htab map destruction when extra reserve is in use · 483bed2b
      Daniel Borkmann 提交于
      Commit a6ed3ea6 ("bpf: restore behavior of bpf_map_update_elem")
      added an extra per-cpu reserve to the hash table map to restore old
      behaviour from pre prealloc times. When non-prealloc is in use for a
      map, then problem is that once a hash table extra element has been
      linked into the hash-table, and the hash table is destroyed due to
      refcount dropping to zero, then htab_map_free() -> delete_all_elements()
      will walk the whole hash table and drop all elements via htab_elem_free().
      The problem is that the element from the extra reserve is first fed
      to the wrong backend allocator and eventually freed twice.
      
      Fixes: a6ed3ea6 ("bpf: restore behavior of bpf_map_update_elem")
      Reported-by: NDmitry Vyukov <dvyukov@google.com>
      Signed-off-by: NDaniel Borkmann <daniel@iogearbox.net>
      Acked-by: NAlexei Starovoitov <ast@kernel.org>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      483bed2b