1. 19 5月, 2021 14 次提交
  2. 17 5月, 2021 3 次提交
    • K
      f18ba26d
    • K
      libbpf: Add low level TC-BPF management API · 715c5ce4
      Kumar Kartikeya Dwivedi 提交于
      This adds functions that wrap the netlink API used for adding, manipulating,
      and removing traffic control filters.
      
      The API summary:
      
      A bpf_tc_hook represents a location where a TC-BPF filter can be attached.
      This means that creating a hook leads to creation of the backing qdisc,
      while destruction either removes all filters attached to a hook, or destroys
      qdisc if requested explicitly (as discussed below).
      
      The TC-BPF API functions operate on this bpf_tc_hook to attach, replace,
      query, and detach tc filters. All functions return 0 on success, and a
      negative error code on failure.
      
      bpf_tc_hook_create - Create a hook
      Parameters:
      	@hook - Cannot be NULL, ifindex > 0, attach_point must be set to
      		proper enum constant. Note that parent must be unset when
      		attach_point is one of BPF_TC_INGRESS or BPF_TC_EGRESS. Note
      		that as an exception BPF_TC_INGRESS|BPF_TC_EGRESS is also a
      		valid value for attach_point.
      
      		Returns -EOPNOTSUPP when hook has attach_point as BPF_TC_CUSTOM.
      
      bpf_tc_hook_destroy - Destroy a hook
      Parameters:
      	@hook - Cannot be NULL. The behaviour depends on value of
      		attach_point. If BPF_TC_INGRESS, all filters attached to
      		the ingress hook will be detached. If BPF_TC_EGRESS, all
      		filters attached to the egress hook will be detached. If
      		BPF_TC_INGRESS|BPF_TC_EGRESS, the clsact qdisc will be
      		deleted, also detaching all filters. As before, parent must
      		be unset for these attach_points, and set for BPF_TC_CUSTOM.
      
      		It is advised that if the qdisc is operated on by many programs,
      		then the program at least check that there are no other existing
      		filters before deleting the clsact qdisc. An example is shown
      		below:
      
      		DECLARE_LIBBPF_OPTS(bpf_tc_hook, .ifindex = if_nametoindex("lo"),
      				    .attach_point = BPF_TC_INGRESS);
      		/* set opts as NULL, as we're not really interested in
      		 * getting any info for a particular filter, but just
      	 	 * detecting its presence.
      		 */
      		r = bpf_tc_query(&hook, NULL);
      		if (r == -ENOENT) {
      			/* no filters */
      			hook.attach_point = BPF_TC_INGRESS|BPF_TC_EGREESS;
      			return bpf_tc_hook_destroy(&hook);
      		} else {
      			/* failed or r == 0, the latter means filters do exist */
      			return r;
      		}
      
      		Note that there is a small race between checking for no
      		filters and deleting the qdisc. This is currently unavoidable.
      
      		Returns -EOPNOTSUPP when hook has attach_point as BPF_TC_CUSTOM.
      
      bpf_tc_attach - Attach a filter to a hook
      Parameters:
      	@hook - Cannot be NULL. Represents the hook the filter will be
      		attached to. Requirements for ifindex and attach_point are
      		same as described in bpf_tc_hook_create, but BPF_TC_CUSTOM
      		is also supported.  In that case, parent must be set to the
      		handle where the filter will be attached (using BPF_TC_PARENT).
      		E.g. to set parent to 1:16 like in tc command line, the
      		equivalent would be BPF_TC_PARENT(1, 16).
      
      	@opts - Cannot be NULL. The following opts are optional:
      		* handle   - The handle of the filter
      		* priority - The priority of the filter
      			     Must be >= 0 and <= UINT16_MAX
      		Note that when left unset, they will be auto-allocated by
      		the kernel. The following opts must be set:
      		* prog_fd - The fd of the loaded SCHED_CLS prog
      		The following opts must be unset:
      		* prog_id - The ID of the BPF prog
      		The following opts are optional:
      		* flags - Currently only BPF_TC_F_REPLACE is allowed. It
      			  allows replacing an existing filter instead of
      			  failing with -EEXIST.
      		The following opts will be filled by bpf_tc_attach on a
      		successful attach operation if they are unset:
      		* handle   - The handle of the attached filter
      		* priority - The priority of the attached filter
      		* prog_id  - The ID of the attached SCHED_CLS prog
      		This way, the user can know what the auto allocated values
      		for optional opts like handle and priority are for the newly
      		attached filter, if they were unset.
      
      		Note that some other attributes are set to fixed default
      		values listed below (this holds for all bpf_tc_* APIs):
      		protocol as ETH_P_ALL, direct action mode, chain index of 0,
      		and class ID of 0 (this can be set by writing to the
      		skb->tc_classid field from the BPF program).
      
      bpf_tc_detach
      Parameters:
      	@hook - Cannot be NULL. Represents the hook the filter will be
      		detached from. Requirements are same as described above
      		in bpf_tc_attach.
      
      	@opts - Cannot be NULL. The following opts must be set:
      		* handle, priority
      		The following opts must be unset:
      		* prog_fd, prog_id, flags
      
      bpf_tc_query
      Parameters:
      	@hook - Cannot be NULL. Represents the hook where the filter lookup will
      		be performed. Requirements are same as described above in
      		bpf_tc_attach().
      
      	@opts - Cannot be NULL. The following opts must be set:
      		* handle, priority
      		The following opts must be unset:
      		* prog_fd, prog_id, flags
      		The following fields will be filled by bpf_tc_query upon a
      		successful lookup:
      		* prog_id
      
      Some usage examples (using BPF skeleton infrastructure):
      
      BPF program (test_tc_bpf.c):
      
      	#include <linux/bpf.h>
      	#include <bpf/bpf_helpers.h>
      
      	SEC("classifier")
      	int cls(struct __sk_buff *skb)
      	{
      		return 0;
      	}
      
      Userspace loader:
      
      	struct test_tc_bpf *skel = NULL;
      	int fd, r;
      
      	skel = test_tc_bpf__open_and_load();
      	if (!skel)
      		return -ENOMEM;
      
      	fd = bpf_program__fd(skel->progs.cls);
      
      	DECLARE_LIBBPF_OPTS(bpf_tc_hook, hook, .ifindex =
      			    if_nametoindex("lo"), .attach_point =
      			    BPF_TC_INGRESS);
      	/* Create clsact qdisc */
      	r = bpf_tc_hook_create(&hook);
      	if (r < 0)
      		goto end;
      
      	DECLARE_LIBBPF_OPTS(bpf_tc_opts, opts, .prog_fd = fd);
      	r = bpf_tc_attach(&hook, &opts);
      	if (r < 0)
      		goto end;
      	/* Print the auto allocated handle and priority */
      	printf("Handle=%u", opts.handle);
      	printf("Priority=%u", opts.priority);
      
      	opts.prog_fd = opts.prog_id = 0;
      	bpf_tc_detach(&hook, &opts);
      end:
      	test_tc_bpf__destroy(skel);
      
      This is equivalent to doing the following using tc command line:
        # tc qdisc add dev lo clsact
        # tc filter add dev lo ingress bpf obj foo.o sec classifier da
        # tc filter del dev lo ingress handle <h> prio <p> bpf
      ... where the handle and priority can be found using:
        # tc filter show dev lo ingress
      
      Another example replacing a filter (extending prior example):
      
      	/* We can also choose both (or one), let's try replacing an
      	 * existing filter.
      	 */
      	DECLARE_LIBBPF_OPTS(bpf_tc_opts, replace_opts, .handle =
      			    opts.handle, .priority = opts.priority,
      			    .prog_fd = fd);
      	r = bpf_tc_attach(&hook, &replace_opts);
      	if (r == -EEXIST) {
      		/* Expected, now use BPF_TC_F_REPLACE to replace it */
      		replace_opts.flags = BPF_TC_F_REPLACE;
      		return bpf_tc_attach(&hook, &replace_opts);
      	} else if (r < 0) {
      		return r;
      	}
      	/* There must be no existing filter with these
      	 * attributes, so cleanup and return an error.
      	 */
      	replace_opts.prog_fd = replace_opts.prog_id = 0;
      	bpf_tc_detach(&hook, &replace_opts);
      	return -1;
      
      To obtain info of a particular filter:
      
      	/* Find info for filter with handle 1 and priority 50 */
      	DECLARE_LIBBPF_OPTS(bpf_tc_opts, info_opts, .handle = 1,
      			    .priority = 50);
      	r = bpf_tc_query(&hook, &info_opts);
      	if (r == -ENOENT)
      		printf("Filter not found");
      	else if (r < 0)
      		return r;
      	printf("Prog ID: %u", info_opts.prog_id);
      	return 0;
      Signed-off-by: NKumar Kartikeya Dwivedi <memxor@gmail.com>
      Co-developed-by: Daniel Borkmann <daniel@iogearbox.net> # libbpf API design
      [ Daniel: also did major patch cleanup ]
      Signed-off-by: NDaniel Borkmann <daniel@iogearbox.net>
      Reviewed-by: NToke Høiland-Jørgensen <toke@redhat.com>
      Link: https://lore.kernel.org/bpf/20210512103451.989420-3-memxor@gmail.com
      715c5ce4
    • K
      libbpf: Add various netlink helpers · 8bbb77b7
      Kumar Kartikeya Dwivedi 提交于
      This change introduces a few helpers to wrap open coded attribute
      preparation in netlink.c. It also adds a libbpf_netlink_send_recv() that
      is useful to wrap send + recv handling in a generic way. Subsequent patch
      will also use this function for sending and receiving a netlink response.
      The libbpf_nl_get_link() helper has been removed instead, moving socket
      creation into the newly named libbpf_netlink_send_recv().
      
      Every nested attribute's closure must happen using the helper
      nlattr_end_nested(), which sets its length properly. NLA_F_NESTED is
      enforced using nlattr_begin_nested() helper. Other simple attributes
      can be added directly.
      
      The maxsz parameter corresponds to the size of the request structure
      which is being filled in, so for instance with req being:
      
        struct {
      	struct nlmsghdr nh;
      	struct tcmsg t;
      	char buf[4096];
        } req;
      
      Then, maxsz should be sizeof(req).
      
      This change also converts the open coded attribute preparation with these
      helpers. Note that the only failure the internal call to nlattr_add()
      could result in the nested helper would be -EMSGSIZE, hence that is what
      we return to our caller.
      
      The libbpf_netlink_send_recv() call takes care of opening the socket,
      sending the netlink message, receiving the response, potentially invoking
      callbacks, and return errors if any, and then finally close the socket.
      This allows users to avoid identical socket setup code in different places.
      The only user of libbpf_nl_get_link() has been converted to make use of it.
      __bpf_set_link_xdp_fd_replace() has also been refactored to use it.
      Signed-off-by: NKumar Kartikeya Dwivedi <memxor@gmail.com>
      [ Daniel: major patch cleanup ]
      Signed-off-by: NDaniel Borkmann <daniel@iogearbox.net>
      Reviewed-by: NToke Høiland-Jørgensen <toke@redhat.com>
      Link: https://lore.kernel.org/bpf/20210512103451.989420-2-memxor@gmail.com
      8bbb77b7
  3. 15 5月, 2021 1 次提交
  4. 14 5月, 2021 2 次提交
  5. 12 5月, 2021 6 次提交
  6. 07 5月, 2021 11 次提交
  7. 06 5月, 2021 3 次提交
    • P
      selftests/vm: gup_test: test faulting in kernel, and verify pinnable pages · e44605a8
      Pavel Tatashin 提交于
      When pages are pinned they can be faulted in userland and migrated, and
      they can be faulted right in kernel without migration.
      
      In either case, the pinned pages must end-up being pinnable (not
      movable).
      
      Add a new test to gup_test, to help verify that the gup/pup
      (get_user_pages() / pin_user_pages()) behavior with respect to pinnable
      and movable pages is reasonable and correct.  Specifically, provide a
      way to:
      
      1) Verify that only "pinnable" pages are pinned.  This is checked
         automatically for you.
      
      2) Verify that gup/pup performance is reasonable.  This requires
         comparing benchmarks between doing gup/pup on pages that have been
         pre-faulted in from user space, vs.  doing gup/pup on pages that are
         not faulted in until gup/pup time (via FOLL_TOUCH).  This decision is
         controlled with the new -z command line option.
      
      Link: https://lkml.kernel.org/r/20210215161349.246722-15-pasha.tatashin@soleen.comSigned-off-by: NPavel Tatashin <pasha.tatashin@soleen.com>
      Reviewed-by: NJohn Hubbard <jhubbard@nvidia.com>
      Cc: Dan Williams <dan.j.williams@intel.com>
      Cc: David Hildenbrand <david@redhat.com>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Ingo Molnar <mingo@redhat.com>
      Cc: Ira Weiny <ira.weiny@intel.com>
      Cc: James Morris <jmorris@namei.org>
      Cc: Jason Gunthorpe <jgg@nvidia.com>
      Cc: Jason Gunthorpe <jgg@ziepe.ca>
      Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
      Cc: Matthew Wilcox <willy@infradead.org>
      Cc: Mel Gorman <mgorman@suse.de>
      Cc: Michal Hocko <mhocko@kernel.org>
      Cc: Michal Hocko <mhocko@suse.com>
      Cc: Mike Kravetz <mike.kravetz@oracle.com>
      Cc: Oscar Salvador <osalvador@suse.de>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Sasha Levin <sashal@kernel.org>
      Cc: Steven Rostedt (VMware) <rostedt@goodmis.org>
      Cc: Tyler Hicks <tyhicks@linux.microsoft.com>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      e44605a8
    • P
      selftests/vm: gup_test: fix test flag · 79dbf135
      Pavel Tatashin 提交于
      In gup_test both gup_flags and test_flags use the same flags field.
      This is broken.
      
      Farther, in the actual gup_test.c all the passed gup_flags are erased
      and unconditionally replaced with FOLL_WRITE.
      
      Which means that test_flags are ignored, and code like this always
      performs pin dump test:
      
      155  			if (gup->flags & GUP_TEST_FLAG_DUMP_PAGES_USE_PIN)
      156  				nr = pin_user_pages(addr, nr, gup->flags,
      157  						    pages + i, NULL);
      158  			else
      159  				nr = get_user_pages(addr, nr, gup->flags,
      160  						    pages + i, NULL);
      161  			break;
      
      Add a new test_flags field, to allow raw gup_flags to work.  Add a new
      subcommand for DUMP_USER_PAGES_TEST to specify that pin test should be
      performed.
      
      Remove unconditional overwriting of gup_flags via FOLL_WRITE.  But,
      preserve the previous behaviour where FOLL_WRITE was the default flag,
      and add a new option "-W" to unset FOLL_WRITE.
      
      Rename flags with gup_flags.
      
      With the fix, dump works like this:
      
        root@virtme:/# gup_test  -c
        ---- page #0, starting from user virt addr: 0x7f8acb9e4000
        page:00000000d3d2ee27 refcount:2 mapcount:1 mapping:0000000000000000
        index:0x0 pfn:0x100bcf
        anon flags: 0x300000000080016(referenced|uptodate|lru|swapbacked)
        raw: 0300000000080016 ffffd0e204021608 ffffd0e208df2e88 ffff8ea04243ec61
        raw: 0000000000000000 0000000000000000 0000000200000000 0000000000000000
        page dumped because: gup_test: dump_pages() test
        DUMP_USER_PAGES_TEST: done
      
        root@virtme:/# gup_test  -c -p
        ---- page #0, starting from user virt addr: 0x7fd19701b000
        page:00000000baed3c7d refcount:1025 mapcount:1 mapping:0000000000000000
        index:0x0 pfn:0x108008
        anon flags: 0x300000000080014(uptodate|lru|swapbacked)
        raw: 0300000000080014 ffffd0e204200188 ffffd0e205e09088 ffff8ea04243ee71
        raw: 0000000000000000 0000000000000000 0000040100000000 0000000000000000
        page dumped because: gup_test: dump_pages() test
        DUMP_USER_PAGES_TEST: done
      
      Refcount shows the difference between pin vs no-pin case.
      Also change type of nr from int to long, as it counts number of pages.
      
      Link: https://lkml.kernel.org/r/20210215161349.246722-14-pasha.tatashin@soleen.comSigned-off-by: NPavel Tatashin <pasha.tatashin@soleen.com>
      Reviewed-by: NJohn Hubbard <jhubbard@nvidia.com>
      Cc: Dan Williams <dan.j.williams@intel.com>
      Cc: David Hildenbrand <david@redhat.com>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Ingo Molnar <mingo@redhat.com>
      Cc: Ira Weiny <ira.weiny@intel.com>
      Cc: James Morris <jmorris@namei.org>
      Cc: Jason Gunthorpe <jgg@nvidia.com>
      Cc: Jason Gunthorpe <jgg@ziepe.ca>
      Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
      Cc: Matthew Wilcox <willy@infradead.org>
      Cc: Mel Gorman <mgorman@suse.de>
      Cc: Michal Hocko <mhocko@kernel.org>
      Cc: Michal Hocko <mhocko@suse.com>
      Cc: Mike Kravetz <mike.kravetz@oracle.com>
      Cc: Oscar Salvador <osalvador@suse.de>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Sasha Levin <sashal@kernel.org>
      Cc: Steven Rostedt (VMware) <rostedt@goodmis.org>
      Cc: Tyler Hicks <tyhicks@linux.microsoft.com>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      79dbf135
    • A
      userfaultfd/selftests: add test exercising minor fault handling · f0fa9433
      Axel Rasmussen 提交于
      Fix a dormant bug in userfaultfd_events_test(), where we did `return
      faulting_process(0)` instead of `exit(faulting_process(0))`.  This
      caused the forked process to keep running, trying to execute any further
      test cases after the events test in parallel with the "real" process.
      
      Add a simple test case which exercises minor faults.  In short, it does
      the following:
      
      1. "Sets up" an area (area_dst) and a second shared mapping to the same
         underlying pages (area_dst_alias).
      
      2. Register one of these areas with userfaultfd, in minor fault mode.
      
      3. Start a second thread to handle any minor faults.
      
      4. Populate the underlying pages with the non-UFFD-registered side of
         the mapping. Basically, memset() each page with some arbitrary
         contents.
      
      5. Then, using the UFFD-registered mapping, read all of the page
         contents, asserting that the contents match expectations (we expect
         the minor fault handling thread can modify the page contents before
         resolving the fault).
      
      The minor fault handling thread, upon receiving an event, flips all the
      bits (~) in that page, just to prove that it can modify it in some
      arbitrary way.  Then it issues a UFFDIO_CONTINUE ioctl, to setup the
      mapping and resolve the fault.  The reading thread should wake up and
      see this modification.
      
      Currently the minor fault test is only enabled in hugetlb_shared mode,
      as this is the only configuration the kernel feature supports.
      
      Link: https://lkml.kernel.org/r/20210301222728.176417-7-axelrasmussen@google.comSigned-off-by: NAxel Rasmussen <axelrasmussen@google.com>
      Reviewed-by: NPeter Xu <peterx@redhat.com>
      Cc: Adam Ruprecht <ruprecht@google.com>
      Cc: Alexander Viro <viro@zeniv.linux.org.uk>
      Cc: Alexey Dobriyan <adobriyan@gmail.com>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: Anshuman Khandual <anshuman.khandual@arm.com>
      Cc: Cannon Matthews <cannonmatthews@google.com>
      Cc: Catalin Marinas <catalin.marinas@arm.com>
      Cc: Chinwen Chang <chinwen.chang@mediatek.com>
      Cc: David Rientjes <rientjes@google.com>
      Cc: "Dr . David Alan Gilbert" <dgilbert@redhat.com>
      Cc: Huang Ying <ying.huang@intel.com>
      Cc: Ingo Molnar <mingo@redhat.com>
      Cc: Jann Horn <jannh@google.com>
      Cc: Jerome Glisse <jglisse@redhat.com>
      Cc: Kirill A. Shutemov <kirill@shutemov.name>
      Cc: Lokesh Gidra <lokeshgidra@google.com>
      Cc: "Matthew Wilcox (Oracle)" <willy@infradead.org>
      Cc: Michael Ellerman <mpe@ellerman.id.au>
      Cc: "Michal Koutn" <mkoutny@suse.com>
      Cc: Michel Lespinasse <walken@google.com>
      Cc: Mike Kravetz <mike.kravetz@oracle.com>
      Cc: Mike Rapoport <rppt@linux.vnet.ibm.com>
      Cc: Mina Almasry <almasrymina@google.com>
      Cc: Nicholas Piggin <npiggin@gmail.com>
      Cc: Oliver Upton <oupton@google.com>
      Cc: Shaohua Li <shli@fb.com>
      Cc: Shawn Anastasio <shawn@anastas.io>
      Cc: Steven Price <steven.price@arm.com>
      Cc: Steven Rostedt <rostedt@goodmis.org>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      f0fa9433