1. 04 5月, 2018 1 次提交
    • D
      bpf: migrate ebpf ld_abs/ld_ind tests to test_verifier · 93731ef0
      Daniel Borkmann 提交于
      Remove all eBPF tests involving LD_ABS/LD_IND from test_bpf.ko. Reason
      is that the eBPF tests from test_bpf module do not go via BPF verifier
      and therefore any instruction rewrites from verifier cannot take place.
      
      Therefore, move them into test_verifier which runs out of user space,
      so that verfier can rewrite LD_ABS/LD_IND internally in upcoming patches.
      It will have the same effect since runtime tests are also performed from
      there. This also allows to finally unexport bpf_skb_vlan_{push,pop}_proto
      and keep it internal to core kernel.
      
      Additionally, also add further cBPF LD_ABS/LD_IND test coverage into
      test_bpf.ko suite.
      Signed-off-by: NDaniel Borkmann <daniel@iogearbox.net>
      Acked-by: NAlexei Starovoitov <ast@kernel.org>
      Signed-off-by: NAlexei Starovoitov <ast@kernel.org>
      93731ef0
  2. 02 5月, 2018 1 次提交
  3. 01 5月, 2018 2 次提交
    • C
      netns: restrict uevents · a3498436
      Christian Brauner 提交于
      commit 07e98962 ("kobject: Send hotplug events in all network namespaces")
      
      enabled sending hotplug events into all network namespaces back in 2010.
      Over time the set of uevents that get sent into all network namespaces has
      shrunk. We have now reached the point where hotplug events for all devices
      that carry a namespace tag are filtered according to that namespace.
      Specifically, they are filtered whenever the namespace tag of the kobject
      does not match the namespace tag of the netlink socket.
      Currently, only network devices carry namespace tags (i.e. network
      namespace tags). Hence, uevents for network devices only show up in the
      network namespace such devices are created in or moved to.
      
      However, any uevent for a kobject that does not have a namespace tag
      associated with it will not be filtered and we will broadcast it into all
      network namespaces. This behavior stopped making sense when user namespaces
      were introduced.
      
      This patch simplifies and fixes couple of things:
      - Split codepath for sending uevents by kobject namespace tags:
        1. Untagged kobjects - uevent_net_broadcast_untagged():
           Untagged kobjects will be broadcast into all uevent sockets recorded
           in uevent_sock_list, i.e. into all network namespacs owned by the
           intial user namespace.
        2. Tagged kobjects - uevent_net_broadcast_tagged():
           Tagged kobjects will only be broadcast into the network namespace they
           were tagged with.
        Handling of tagged kobjects in 2. does not cause any semantic changes.
        This is just splitting out the filtering logic that was handled by
        kobj_bcast_filter() before.
        Handling of untagged kobjects in 1. will cause a semantic change. The
        reasons why this is needed and ok have been discussed in [1]. Here is a
        short summary:
        - Userspace ignores uevents from network namespaces that are not owned by
          the intial user namespace:
          Uevents are filtered by userspace in a user namespace because the
          received uid != 0. Instead the uid associated with the event will be
          65534 == "nobody" because the global root uid is not mapped.
          This means we can safely and without introducing regressions modify the
          kernel to not send uevents into all network namespaces whose owning
          user namespace is not the initial user namespace because we know that
          userspace will ignore the message because of the uid anyway.
          I have a) verified that is is true for every udev implementation out
          there b) that this behavior has been present in all udev
          implementations from the very beginning.
        - Thundering herd:
          Broadcasting uevents into all network namespaces introduces significant
          overhead.
          All processes that listen to uevents running in non-initial user
          namespaces will end up responding to uevents that will be meaningless
          to them. Mainly, because non-initial user namespaces cannot easily
          manage devices unless they have a privileged host-process helping them
          out. This means that there will be a thundering herd of activity when
          there shouldn't be any.
        - Removing needless overhead/Increasing performance:
          Currently, the uevent socket for each network namespace is added to the
          global variable uevent_sock_list. The list itself needs to be protected
          by a mutex. So everytime a uevent is generated the mutex is taken on
          the list. The mutex is held *from the creation of the uevent (memory
          allocation, string creation etc. until all uevent sockets have been
          handled*. This is aggravated by the fact that for each uevent socket
          that has listeners the mc_list must be walked as well which means we're
          talking O(n^2) here. Given that a standard Linux workload usually has
          quite a lot of network namespaces and - in the face of containers - a
          lot of user namespaces this quickly becomes a performance problem (see
          "Thundering herd" above). By just recording uevent sockets of network
          namespaces that are owned by the initial user namespace we
          significantly increase performance in this codepath.
        - Injecting uevents:
          There's a valid argument that containers might be interested in
          receiving device events especially if they are delegated to them by a
          privileged userspace process. One prime example are SR-IOV enabled
          devices that are explicitly designed to be handed of to other users
          such as VMs or containers.
          This use-case can now be correctly handled since
          commit 692ec06d ("netns: send uevent messages"). This commit
          introduced the ability to send uevents from userspace. As such we can
          let a sufficiently privileged (CAP_SYS_ADMIN in the owning user
          namespace of the network namespace of the netlink socket) userspace
          process make a decision what uevents should be sent. This removes the
          need to blindly broadcast uevents into all user namespaces and provides
          a performant and safe solution to this problem.
        - Filtering logic:
          This patch filters by *owning user namespace of the network namespace a
          given task resides in* and not by user namespace of the task per se.
          This means if the user namespace of a given task is unshared but the
          network namespace is kept and is owned by the initial user namespace a
          listener that is opening the uevent socket in that network namespace
          can still listen to uevents.
      - Fix permission for tagged kobjects:
        Network devices that are created or moved into a network namespace that
        is owned by a non-initial user namespace currently are send with
        INVALID_{G,U}ID in their credentials. This means that all current udev
        implementations in userspace will ignore the uevent they receive for
        them. This has lead to weird bugs whereby new devices showing up in such
        network namespaces were not recognized and did not get IPs assigned etc.
        This patch adjusts the permission to the appropriate {g,u}id in the
        respective user namespace. This way udevd is able to correctly handle
        such devices.
      - Simplify filtering logic:
        do_one_broadcast() already ensures that only listeners in mc_list receive
        uevents that have the same network namespace as the uevent socket itself.
        So the filtering logic in kobj_bcast_filter is not needed (see [3]). This
        patch therefore removes kobj_bcast_filter() and replaces
        netlink_broadcast_filtered() with the simpler netlink_broadcast()
        everywhere.
      
      [1]: https://lkml.org/lkml/2018/4/4/739
      [2]: https://lkml.org/lkml/2018/4/26/767
      [3]: https://lkml.org/lkml/2018/4/26/738Signed-off-by: NChristian Brauner <christian.brauner@ubuntu.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      a3498436
    • C
      uevent: add alloc_uevent_skb() helper · 26045a7b
      Christian Brauner 提交于
      This patch adds alloc_uevent_skb() in preparation for follow up patches.
      Signed-off-by: NChristian Brauner <christian.brauner@ubuntu.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      26045a7b
  4. 27 4月, 2018 1 次提交
    • M
      errseq: Always report a writeback error once · b4678df1
      Matthew Wilcox 提交于
      The errseq_t infrastructure assumes that errors which occurred before
      the file descriptor was opened are of no interest to the application.
      This turns out to be a regression for some applications, notably Postgres.
      
      Before errseq_t, a writeback error would be reported exactly once (as
      long as the inode remained in memory), so Postgres could open a file,
      call fsync() and find out whether there had been a writeback error on
      that file from another process.
      
      This patch changes the errseq infrastructure to report errors to all
      file descriptors which are opened after the error occurred, but before
      it was reported to any file descriptor.  This restores the user-visible
      behaviour.
      
      Cc: stable@vger.kernel.org
      Fixes: 5660e13d ("fs: new infrastructure for writeback error handling and reporting")
      Signed-off-by: NMatthew Wilcox <mawilcox@microsoft.com>
      Reviewed-by: NJeff Layton <jlayton@kernel.org>
      Signed-off-by: NJeff Layton <jlayton@redhat.com>
      b4678df1
  5. 25 4月, 2018 3 次提交
  6. 23 4月, 2018 2 次提交
  7. 17 4月, 2018 1 次提交
  8. 14 4月, 2018 1 次提交
  9. 13 4月, 2018 1 次提交
  10. 12 4月, 2018 9 次提交
  11. 11 4月, 2018 1 次提交
  12. 10 4月, 2018 1 次提交
  13. 06 4月, 2018 3 次提交
  14. 01 4月, 2018 1 次提交
  15. 31 3月, 2018 4 次提交
  16. 30 3月, 2018 1 次提交
  17. 28 3月, 2018 2 次提交
  18. 27 3月, 2018 1 次提交
  19. 26 3月, 2018 4 次提交