1. 10 1月, 2018 2 次提交
    • A
      bpf: introduce BPF_JIT_ALWAYS_ON config · 290af866
      Alexei Starovoitov 提交于
      The BPF interpreter has been used as part of the spectre 2 attack CVE-2017-5715.
      
      A quote from goolge project zero blog:
      "At this point, it would normally be necessary to locate gadgets in
      the host kernel code that can be used to actually leak data by reading
      from an attacker-controlled location, shifting and masking the result
      appropriately and then using the result of that as offset to an
      attacker-controlled address for a load. But piecing gadgets together
      and figuring out which ones work in a speculation context seems annoying.
      So instead, we decided to use the eBPF interpreter, which is built into
      the host kernel - while there is no legitimate way to invoke it from inside
      a VM, the presence of the code in the host kernel's text section is sufficient
      to make it usable for the attack, just like with ordinary ROP gadgets."
      
      To make attacker job harder introduce BPF_JIT_ALWAYS_ON config
      option that removes interpreter from the kernel in favor of JIT-only mode.
      So far eBPF JIT is supported by:
      x64, arm64, arm32, sparc64, s390, powerpc64, mips64
      
      The start of JITed program is randomized and code page is marked as read-only.
      In addition "constant blinding" can be turned on with net.core.bpf_jit_harden
      
      v2->v3:
      - move __bpf_prog_ret0 under ifdef (Daniel)
      
      v1->v2:
      - fix init order, test_bpf and cBPF (Daniel's feedback)
      - fix offloaded bpf (Jakub's feedback)
      - add 'return 0' dummy in case something can invoke prog->bpf_func
      - retarget bpf tree. For bpf-next the patch would need one extra hunk.
        It will be sent when the trees are merged back to net-next
      
      Considered doing:
        int bpf_jit_enable __read_mostly = BPF_EBPF_JIT_DEFAULT;
      but it seems better to land the patch as-is and in bpf-next remove
      bpf_jit_enable global variable from all JITs, consolidate in one place
      and remove this jit_init() function.
      Signed-off-by: NAlexei Starovoitov <ast@kernel.org>
      Signed-off-by: NDaniel Borkmann <daniel@iogearbox.net>
      290af866
    • D
      bpf: avoid false sharing of map refcount with max_entries · be95a845
      Daniel Borkmann 提交于
      In addition to commit b2157399 ("bpf: prevent out-of-bounds
      speculation") also change the layout of struct bpf_map such that
      false sharing of fast-path members like max_entries is avoided
      when the maps reference counter is altered. Therefore enforce
      them to be placed into separate cachelines.
      
      pahole dump after change:
      
        struct bpf_map {
              const struct bpf_map_ops  * ops;                 /*     0     8 */
              struct bpf_map *           inner_map_meta;       /*     8     8 */
              void *                     security;             /*    16     8 */
              enum bpf_map_type          map_type;             /*    24     4 */
              u32                        key_size;             /*    28     4 */
              u32                        value_size;           /*    32     4 */
              u32                        max_entries;          /*    36     4 */
              u32                        map_flags;            /*    40     4 */
              u32                        pages;                /*    44     4 */
              u32                        id;                   /*    48     4 */
              int                        numa_node;            /*    52     4 */
              bool                       unpriv_array;         /*    56     1 */
      
              /* XXX 7 bytes hole, try to pack */
      
              /* --- cacheline 1 boundary (64 bytes) --- */
              struct user_struct *       user;                 /*    64     8 */
              atomic_t                   refcnt;               /*    72     4 */
              atomic_t                   usercnt;              /*    76     4 */
              struct work_struct         work;                 /*    80    32 */
              char                       name[16];             /*   112    16 */
              /* --- cacheline 2 boundary (128 bytes) --- */
      
              /* size: 128, cachelines: 2, members: 17 */
              /* sum members: 121, holes: 1, sum holes: 7 */
        };
      
      Now all entries in the first cacheline are read only throughout
      the life time of the map, set up once during map creation. Overall
      struct size and number of cachelines doesn't change from the
      reordering. struct bpf_map is usually first member and embedded
      in map structs in specific map implementations, so also avoid those
      members to sit at the end where it could potentially share the
      cacheline with first map values e.g. in the array since remote
      CPUs could trigger map updates just as well for those (easily
      dirtying members like max_entries intentionally as well) while
      having subsequent values in cache.
      
      Quoting from Google's Project Zero blog [1]:
      
        Additionally, at least on the Intel machine on which this was
        tested, bouncing modified cache lines between cores is slow,
        apparently because the MESI protocol is used for cache coherence
        [8]. Changing the reference counter of an eBPF array on one
        physical CPU core causes the cache line containing the reference
        counter to be bounced over to that CPU core, making reads of the
        reference counter on all other CPU cores slow until the changed
        reference counter has been written back to memory. Because the
        length and the reference counter of an eBPF array are stored in
        the same cache line, this also means that changing the reference
        counter on one physical CPU core causes reads of the eBPF array's
        length to be slow on other physical CPU cores (intentional false
        sharing).
      
      While this doesn't 'control' the out-of-bounds speculation through
      masking the index as in commit b2157399, triggering a manipulation
      of the map's reference counter is really trivial, so lets not allow
      to easily affect max_entries from it.
      
      Splitting to separate cachelines also generally makes sense from
      a performance perspective anyway in that fast-path won't have a
      cache miss if the map gets pinned, reused in other progs, etc out
      of control path, thus also avoids unintentional false sharing.
      
        [1] https://googleprojectzero.blogspot.ch/2018/01/reading-privileged-memory-with-side.htmlSigned-off-by: NDaniel Borkmann <daniel@iogearbox.net>
      Signed-off-by: NAlexei Starovoitov <ast@kernel.org>
      be95a845
  2. 09 1月, 2018 1 次提交
    • A
      bpf: prevent out-of-bounds speculation · b2157399
      Alexei Starovoitov 提交于
      Under speculation, CPUs may mis-predict branches in bounds checks. Thus,
      memory accesses under a bounds check may be speculated even if the
      bounds check fails, providing a primitive for building a side channel.
      
      To avoid leaking kernel data round up array-based maps and mask the index
      after bounds check, so speculated load with out of bounds index will load
      either valid value from the array or zero from the padded area.
      
      Unconditionally mask index for all array types even when max_entries
      are not rounded to power of 2 for root user.
      When map is created by unpriv user generate a sequence of bpf insns
      that includes AND operation to make sure that JITed code includes
      the same 'index & index_mask' operation.
      
      If prog_array map is created by unpriv user replace
        bpf_tail_call(ctx, map, index);
      with
        if (index >= max_entries) {
          index &= map->index_mask;
          bpf_tail_call(ctx, map, index);
        }
      (along with roundup to power 2) to prevent out-of-bounds speculation.
      There is secondary redundant 'if (index >= max_entries)' in the interpreter
      and in all JITs, but they can be optimized later if necessary.
      
      Other array-like maps (cpumap, devmap, sockmap, perf_event_array, cgroup_array)
      cannot be used by unpriv, so no changes there.
      
      That fixes bpf side of "Variant 1: bounds check bypass (CVE-2017-5753)" on
      all architectures with and without JIT.
      
      v2->v3:
      Daniel noticed that attack potentially can be crafted via syscall commands
      without loading the program, so add masking to those paths as well.
      Signed-off-by: NAlexei Starovoitov <ast@kernel.org>
      Acked-by: NJohn Fastabend <john.fastabend@gmail.com>
      Signed-off-by: NDaniel Borkmann <daniel@iogearbox.net>
      b2157399
  3. 07 1月, 2018 2 次提交
  4. 06 1月, 2018 3 次提交
  5. 05 1月, 2018 8 次提交
    • D
      Merge git://git.kernel.org/pub/scm/linux/kernel/git/pablo/nf · f737be8d
      David S. Miller 提交于
      Pablo Neira Ayuso says:
      
      ====================
      Netfilter fixes for net
      
      The following patchset contains Netfilter fixes for your net tree,
      they are:
      
      1) Fix chain filtering when dumping rules via nf_tables_dump_rules().
      
      2) Fix accidental change in NF_CT_STATE_UNTRACKED_BIT through uapi,
         introduced when removing the untracked conntrack object, from
         Florian Westphal.
      
      3) Fix potential nul-dereference when releasing dump filter in
         nf_tables_dump_obj_done(), patch from Hangbin Liu.
      ====================
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      f737be8d
    • H
      uapi/if_ether.h: prevent redefinition of struct ethhdr · 6926e041
      Hauke Mehrtens 提交于
      Musl provides its own ethhdr struct definition. Add a guard to prevent
      its definition of the appropriate musl header has already been included.
      
      glibc does not implement this header, but when glibc will implement this
      they can just define __UAPI_DEF_ETHHDR 0 to make it work with the
      kernel.
      Signed-off-by: NHauke Mehrtens <hauke@hauke-m.de>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      6926e041
    • W
      ipv6: fix general protection fault in fib6_add() · 7bbfe00e
      Wei Wang 提交于
      In fib6_add(), pn could be NULL if fib6_add_1() failed to return a fib6
      node. Checking pn != fn before accessing pn->leaf makes sure pn is not
      NULL.
      This fixes the following GPF reported by syzkaller:
      general protection fault: 0000 [#1] SMP KASAN
      Dumping ftrace buffer:
         (ftrace buffer empty)
      Modules linked in:
      CPU: 0 PID: 3201 Comm: syzkaller001778 Not tainted 4.15.0-rc5+ #151
      Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 01/01/2011
      RIP: 0010:fib6_add+0x736/0x15a0 net/ipv6/ip6_fib.c:1244
      RSP: 0018:ffff8801c7626a70 EFLAGS: 00010202
      RAX: dffffc0000000000 RBX: 0000000000000020 RCX: ffffffff84794465
      RDX: 0000000000000004 RSI: ffff8801d38935f0 RDI: 0000000000000282
      RBP: ffff8801c7626da0 R08: 1ffff10038ec4c35 R09: 0000000000000000
      R10: ffff8801c7626c68 R11: 0000000000000000 R12: 00000000fffffffe
      R13: 0000000000000000 R14: 0000000000000000 R15: 0000000000000009
      FS:  0000000000000000(0000) GS:ffff8801db200000(0063) knlGS:0000000009b70840
      CS:  0010 DS: 002b ES: 002b CR0: 0000000080050033
      CR2: 0000000020be1000 CR3: 00000001d585a006 CR4: 00000000001606f0
      DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
      DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
      Call Trace:
       __ip6_ins_rt+0x6c/0x90 net/ipv6/route.c:1006
       ip6_route_multipath_add+0xd14/0x16c0 net/ipv6/route.c:3833
       inet6_rtm_newroute+0xdc/0x160 net/ipv6/route.c:3957
       rtnetlink_rcv_msg+0x733/0x1020 net/core/rtnetlink.c:4411
       netlink_rcv_skb+0x21e/0x460 net/netlink/af_netlink.c:2408
       rtnetlink_rcv+0x1c/0x20 net/core/rtnetlink.c:4423
       netlink_unicast_kernel net/netlink/af_netlink.c:1275 [inline]
       netlink_unicast+0x4e8/0x6f0 net/netlink/af_netlink.c:1301
       netlink_sendmsg+0xa4a/0xe60 net/netlink/af_netlink.c:1864
       sock_sendmsg_nosec net/socket.c:636 [inline]
       sock_sendmsg+0xca/0x110 net/socket.c:646
       sock_write_iter+0x31a/0x5d0 net/socket.c:915
       call_write_iter include/linux/fs.h:1772 [inline]
       do_iter_readv_writev+0x525/0x7f0 fs/read_write.c:653
       do_iter_write+0x154/0x540 fs/read_write.c:932
       compat_writev+0x225/0x420 fs/read_write.c:1246
       do_compat_writev+0x115/0x220 fs/read_write.c:1267
       C_SYSC_writev fs/read_write.c:1278 [inline]
       compat_SyS_writev+0x26/0x30 fs/read_write.c:1274
       do_syscall_32_irqs_on arch/x86/entry/common.c:327 [inline]
       do_fast_syscall_32+0x3ee/0xf9d arch/x86/entry/common.c:389
       entry_SYSENTER_compat+0x54/0x63 arch/x86/entry/entry_64_compat.S:125
      Reported-by: Nsyzbot <syzkaller@googlegroups.com>
      Fixes: 66f5d6ce ("ipv6: replace rwlock with rcu and spinlock in fib6_table")
      Signed-off-by: NWei Wang <weiwan@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      7bbfe00e
    • M
      RDS: null pointer dereference in rds_atomic_free_op · 7d11f77f
      Mohamed Ghannam 提交于
      set rm->atomic.op_active to 0 when rds_pin_pages() fails
      or the user supplied address is invalid,
      this prevents a NULL pointer usage in rds_atomic_free_op()
      Signed-off-by: NMohamed Ghannam <simo.ghannam@gmail.com>
      Acked-by: NSantosh Shilimkar <santosh.shilimkar@oracle.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      7d11f77f
    • S
      sh_eth: fix TSU resource handling · dfe8266b
      Sergei Shtylyov 提交于
      When switching  the driver to the managed device API,  I managed to break
      the  case of a  dual Ether devices sharing a single TSU: the 2nd Ether port
      wouldn't probe. Iwamatsu-san has tried to fix this but his patch was buggy
      and he then dropped the ball...
      
      The solution is to  limit calling devm_request_mem_region() to the first
      of  the two  ports  sharing the same TSU, so devm_ioremap_resource() can't
      be used anymore for the TSU resource...
      
      Fixes: d5e07e69 ("sh_eth: use managed device API")
      Reported-by: NNobuhiro Iwamatsu <nobuhiro.iwamatsu.yj@renesas.com>
      Signed-off-by: NSergei Shtylyov <sergei.shtylyov@cogentembedded.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      dfe8266b
    • J
      net: stmmac: enable EEE in MII, GMII or RGMII only · 879626e3
      Jerome Brunet 提交于
      Note in the databook - Section 4.4 - EEE :
      " The EEE feature is not supported when the MAC is configured to use the
      TBI, RTBI, SMII, RMII or SGMII single PHY interface. Even if the MAC
      supports multiple PHY interfaces, you should activate the EEE mode only
      when the MAC is operating with GMII, MII, or RGMII interface."
      
      Applying this restriction solves a stability issue observed on Amlogic
      gxl platforms operating with RMII interface and the internal PHY.
      
      Fixes: 83bf79b6 ("stmmac: disable at run-time the EEE if not supported")
      Signed-off-by: NJerome Brunet <jbrunet@baylibre.com>
      Tested-by: NArnaud Patard <arnaud.patard@rtp-net.org>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      879626e3
    • A
      rtnetlink: give a user socket to get_target_net() · f428fe4a
      Andrei Vagin 提交于
      This function is used from two places: rtnl_dump_ifinfo and
      rtnl_getlink. In rtnl_getlink(), we give a request skb into
      get_target_net(), but in rtnl_dump_ifinfo, we give a response skb
      into get_target_net().
      The problem here is that NETLINK_CB() isn't initialized for the response
      skb. In both cases we can get a user socket and give it instead of skb
      into get_target_net().
      
      This bug was found by syzkaller with this call-trace:
      
      kasan: GPF could be caused by NULL-ptr deref or user memory access
      general protection fault: 0000 [#1] SMP KASAN
      Modules linked in:
      CPU: 1 PID: 3149 Comm: syzkaller140561 Not tainted 4.15.0-rc4-mm1+ #47
      Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS
      Google 01/01/2011
      RIP: 0010:__netlink_ns_capable+0x8b/0x120 net/netlink/af_netlink.c:868
      RSP: 0018:ffff8801c880f348 EFLAGS: 00010206
      RAX: dffffc0000000000 RBX: 0000000000000000 RCX: ffffffff8443f900
      RDX: 000000000000007b RSI: ffffffff86510f40 RDI: 00000000000003d8
      RBP: ffff8801c880f360 R08: 0000000000000000 R09: 1ffff10039101e4f
      R10: 0000000000000000 R11: 0000000000000001 R12: ffffffff86510f40
      R13: 000000000000000c R14: 0000000000000004 R15: 0000000000000011
      FS:  0000000001a1a880(0000) GS:ffff8801db300000(0000) knlGS:0000000000000000
      CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
      CR2: 0000000020151000 CR3: 00000001c9511005 CR4: 00000000001606e0
      DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
      DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
      Call Trace:
        netlink_ns_capable+0x26/0x30 net/netlink/af_netlink.c:886
        get_target_net+0x9d/0x120 net/core/rtnetlink.c:1765
        rtnl_dump_ifinfo+0x2e5/0xee0 net/core/rtnetlink.c:1806
        netlink_dump+0x48c/0xce0 net/netlink/af_netlink.c:2222
        __netlink_dump_start+0x4f0/0x6d0 net/netlink/af_netlink.c:2319
        netlink_dump_start include/linux/netlink.h:214 [inline]
        rtnetlink_rcv_msg+0x7f0/0xb10 net/core/rtnetlink.c:4485
        netlink_rcv_skb+0x21e/0x460 net/netlink/af_netlink.c:2441
        rtnetlink_rcv+0x1c/0x20 net/core/rtnetlink.c:4540
        netlink_unicast_kernel net/netlink/af_netlink.c:1308 [inline]
        netlink_unicast+0x4be/0x6a0 net/netlink/af_netlink.c:1334
        netlink_sendmsg+0xa4a/0xe60 net/netlink/af_netlink.c:1897
      
      Cc: Jiri Benc <jbenc@redhat.com>
      Fixes: 79e1ad14 ("rtnetlink: use netnsid to query interface")
      Signed-off-by: NAndrei Vagin <avagin@openvz.org>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      f428fe4a
    • P
      MAINTAINERS: Update my email address. · fb32dd3a
      Pravin B Shelar 提交于
      Signed-off-by: NPravin Shelar <pshelar@ovn.org>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      fb32dd3a
  6. 04 1月, 2018 24 次提交