1. 18 7月, 2017 1 次提交
  2. 14 7月, 2017 1 次提交
    • R
      kvm: x86: hyperv: make VP_INDEX managed by userspace · d3457c87
      Roman Kagan 提交于
      Hyper-V identifies vCPUs by Virtual Processor Index, which can be
      queried via HV_X64_MSR_VP_INDEX msr.  It is defined by the spec as a
      sequential number which can't exceed the maximum number of vCPUs per VM.
      APIC ids can be sparse and thus aren't a valid replacement for VP
      indices.
      
      Current KVM uses its internal vcpu index as VP_INDEX.  However, to make
      it predictable and persistent across VM migrations, the userspace has to
      control the value of VP_INDEX.
      
      This patch achieves that, by storing vp_index explicitly on vcpu, and
      allowing HV_X64_MSR_VP_INDEX to be set from the host side.  For
      compatibility it's initialized to KVM vcpu index.  Also a few variables
      are renamed to make clear distinction betweed this Hyper-V vp_index and
      KVM vcpu_id (== APIC id).  Besides, a new capability,
      KVM_CAP_HYPERV_VP_INDEX, is added to allow the userspace to skip
      attempting msr writes where unsupported, to avoid spamming error logs.
      Signed-off-by: NRoman Kagan <rkagan@virtuozzo.com>
      Signed-off-by: NRadim Krčmář <rkrcmar@redhat.com>
      d3457c87
  3. 13 7月, 2017 4 次提交
    • R
      kvm: x86: hyperv: add KVM_CAP_HYPERV_SYNIC2 · efc479e6
      Roman Kagan 提交于
      There is a flaw in the Hyper-V SynIC implementation in KVM: when message
      page or event flags page is enabled by setting the corresponding msr,
      KVM zeroes it out.  This is problematic because on migration the
      corresponding MSRs are loaded on the destination, so the content of
      those pages is lost.
      
      This went unnoticed so far because the only user of those pages was
      in-KVM hyperv synic timers, which could continue working despite that
      zeroing.
      
      Newer QEMU uses those pages for Hyper-V VMBus implementation, and
      zeroing them breaks the migration.
      
      Besides, in newer QEMU the content of those pages is fully managed by
      QEMU, so zeroing them is undesirable even when writing the MSRs from the
      guest side.
      
      To support this new scheme, introduce a new capability,
      KVM_CAP_HYPERV_SYNIC2, which, when enabled, makes sure that the synic
      pages aren't zeroed out in KVM.
      Signed-off-by: NRoman Kagan <rkagan@virtuozzo.com>
      Signed-off-by: NRadim Krčmář <rkrcmar@redhat.com>
      efc479e6
    • M
      include/linux/sem.h: correctly document sem_ctime · 2cd648c1
      Manfred Spraul 提交于
      sem_ctime is initialized to the semget() time and then updated at every
      semctl() that changes the array.
      
      Thus it does not represent the time of the last change.
      
      Especially, semop() calls are only stored in sem_otime, not in
      sem_ctime.
      
      This is already described in ipc/sem.c, I just overlooked that there is
      a comment in include/linux/sem.h and man semctl(2) as well.
      
      So: Correct wrong comments.
      
      Link: http://lkml.kernel.org/r/20170515171912.6298-4-manfred@colorfullife.comSigned-off-by: NManfred Spraul <manfred@colorfullife.com>
      Cc: Kees Cook <keescook@chromium.org>
      Cc: <1vier1@web.de>
      Cc: Davidlohr Bueso <dave@stgolabs.net>
      Cc: Ingo Molnar <mingo@kernel.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Fabian Frederick <fabf@skynet.be>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      2cd648c1
    • C
      kcmp: add KCMP_EPOLL_TFD mode to compare epoll target files · 0791e364
      Cyrill Gorcunov 提交于
      With current epoll architecture target files are addressed with
      file_struct and file descriptor number, where the last is not unique.
      Moreover files can be transferred from another process via unix socket,
      added into queue and closed then so we won't find this descriptor in the
      task fdinfo list.
      
      Thus to checkpoint and restore such processes CRIU needs to find out
      where exactly the target file is present to add it into epoll queue.
      For this sake one can use kcmp call where some particular target file
      from the queue is compared with arbitrary file passed as an argument.
      
      Because epoll target files can have same file descriptor number but
      different file_struct a caller should explicitly specify the offset
      within.
      
      To test if some particular file is matching entry inside epoll one have
      to
      
       - fill kcmp_epoll_slot structure with epoll file descriptor,
         target file number and target file offset (in case if only
         one target is present then it should be 0)
      
       - call kcmp as kcmp(pid1, pid2, KCMP_EPOLL_TFD, fd, &kcmp_epoll_slot)
          - the kernel fetch file pointer matching file descriptor @fd of pid1
          - lookups for file struct in epoll queue of pid2 and returns traditional
            0,1,2 result for sorting purpose
      
      Link: http://lkml.kernel.org/r/20170424154423.511592110@gmail.comSigned-off-by: NCyrill Gorcunov <gorcunov@openvz.org>
      Acked-by: NAndrey Vagin <avagin@openvz.org>
      Cc: Al Viro <viro@zeniv.linux.org.uk>
      Cc: Pavel Emelyanov <xemul@virtuozzo.com>
      Cc: Michael Kerrisk <mtk.manpages@gmail.com>
      Cc: Jason Baron <jbaron@akamai.com>
      Cc: Andy Lutomirski <luto@amacapital.net>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      0791e364
    • G
      KVM: s390: Fix KVM_S390_GET_CMMA_BITS ioctl definition · 949c0336
      Gleb Fotengauer-Malinovskiy 提交于
      In case of KVM_S390_GET_CMMA_BITS, the kernel does not only read struct
      kvm_s390_cmma_log passed from userspace (which constitutes _IOC_WRITE),
      it also writes back a return value (which constitutes _IOC_READ) making
      this an _IOWR ioctl instead of _IOW.
      
      Fixes: 4036e387 ("KVM: s390: ioctls to get and set guest storage attributes")
      Signed-off-by: NGleb Fotengauer-Malinovskiy <glebfm@altlinux.org>
      Acked-by: NChristian Borntraeger <borntraeger@de.ibm.com>
      Signed-off-by: NRadim Krčmář <rkrcmar@redhat.com>
      949c0336
  4. 11 7月, 2017 1 次提交
    • L
      Fix up over-eager 'wait_queue_t' renaming · 7cee9384
      Linus Torvalds 提交于
      Commit ac6424b9 ("sched/wait: Rename wait_queue_t =>
      wait_queue_entry_t") had scripted the renaming incorrectly, and didn't
      actually check that the 'wait_queue_t' was a full token.
      
      As a result, it also triggered on 'wait_queue_token', and renamed that
      to 'wait_queue_entry_token' entry in the autofs4 packet structure
      definition too.  That was entirely incorrect, and not intended.
      
      The end result built fine when building just the kernel - because
      everything had been renamed consistently there - but caused problems in
      user space because the "struct autofs_packet_missing" type is exported
      as part of the uapi.
      
      This scripts it all back again:
      
          git grep -lw wait_queue_entry_token |
              xargs sed -i 's/wait_queue_entry_token/wait_queue_token/g'
      
      and checks the end result.
      Reported-by: NFlorian Fainelli <f.fainelli@gmail.com>
      Acked-by: NIngo Molnar <mingo@kernel.org>
      Fixes: ac6424b9 ("sched/wait: Rename wait_queue_t => wait_queue_entry_t")
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      7cee9384
  5. 08 7月, 2017 1 次提交
  6. 07 7月, 2017 6 次提交
  7. 03 7月, 2017 4 次提交
  8. 02 7月, 2017 9 次提交
    • L
      bpf: Adds support for setting sndcwnd clamp · 13bf9641
      Lawrence Brakmo 提交于
      Adds a new bpf_setsockopt for TCP sockets, TCP_BPF_SNDCWND_CLAMP, which
      sets the initial congestion window. It is useful to limit the sndcwnd
      when the host are close to each other (small RTT).
      Signed-off-by: NLawrence Brakmo <brakmo@fb.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      13bf9641
    • L
      bpf: Adds support for setting initial cwnd · fc747810
      Lawrence Brakmo 提交于
      Adds a new bpf_setsockopt for TCP sockets, TCP_BPF_IW, which sets the
      initial congestion window. This can be used when the hosts are far
      apart (large RTTs) and it is safe to start with a large inital cwnd.
      Signed-off-by: NLawrence Brakmo <brakmo@fb.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      fc747810
    • L
      bpf: Add support for changing congestion control · 91b5b21c
      Lawrence Brakmo 提交于
      Added support for changing congestion control for SOCK_OPS bpf
      programs through the setsockopt bpf helper function. It also adds
      a new SOCK_OPS op, BPF_SOCK_OPS_NEEDS_ECN, that is needed for
      congestion controls, like dctcp, that need to enable ECN in the
      SYN packets.
      Signed-off-by: NLawrence Brakmo <brakmo@fb.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      91b5b21c
    • L
      bpf: Add TCP connection BPF callbacks · 9872a4bd
      Lawrence Brakmo 提交于
      Added callbacks to BPF SOCK_OPS type program before an active
      connection is intialized and after a passive or active connection is
      established.
      
      The following patch demostrates how they can be used to set send and
      receive buffer sizes.
      Signed-off-by: NLawrence Brakmo <brakmo@fb.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      9872a4bd
    • L
      bpf: Add setsockopt helper function to bpf · 8c4b4c7e
      Lawrence Brakmo 提交于
      Added support for calling a subset of socket setsockopts from
      BPF_PROG_TYPE_SOCK_OPS programs. The code was duplicated rather
      than making the changes to call the socket setsockopt function because
      the changes required would have been larger.
      
      The ops supported are:
        SO_RCVBUF
        SO_SNDBUF
        SO_MAX_PACING_RATE
        SO_PRIORITY
        SO_RCVLOWAT
        SO_MARK
      Signed-off-by: NLawrence Brakmo <brakmo@fb.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      8c4b4c7e
    • L
      bpf: Support for setting initial receive window · 13d3b1eb
      Lawrence Brakmo 提交于
      This patch adds suppport for setting the initial advertized window from
      within a BPF_SOCK_OPS program. This can be used to support larger
      initial cwnd values in environments where it is known to be safe.
      Signed-off-by: NLawrence Brakmo <brakmo@fb.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      13d3b1eb
    • L
      bpf: Support for per connection SYN/SYN-ACK RTOs · 8550f328
      Lawrence Brakmo 提交于
      This patch adds support for setting a per connection SYN and
      SYN_ACK RTOs from within a BPF_SOCK_OPS program. For example,
      to set small RTOs when it is known both hosts are within a
      datacenter.
      Signed-off-by: NLawrence Brakmo <brakmo@fb.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      8550f328
    • L
      bpf: BPF support for sock_ops · 40304b2a
      Lawrence Brakmo 提交于
      Created a new BPF program type, BPF_PROG_TYPE_SOCK_OPS, and a corresponding
      struct that allows BPF programs of this type to access some of the
      socket's fields (such as IP addresses, ports, etc.). It uses the
      existing bpf cgroups infrastructure so the programs can be attached per
      cgroup with full inheritance support. The program will be called at
      appropriate times to set relevant connections parameters such as buffer
      sizes, SYN and SYN-ACK RTOs, etc., based on connection information such
      as IP addresses, port numbers, etc.
      
      Alghough there are already 3 mechanisms to set parameters (sysctls,
      route metrics and setsockopts), this new mechanism provides some
      distinct advantages. Unlike sysctls, it can set parameters per
      connection. In contrast to route metrics, it can also use port numbers
      and information provided by a user level program. In addition, it could
      set parameters probabilistically for evaluation purposes (i.e. do
      something different on 10% of the flows and compare results with the
      other 90% of the flows). Also, in cases where IPv6 addresses contain
      geographic information, the rules to make changes based on the distance
      (or RTT) between the hosts are much easier than route metric rules and
      can be global. Finally, unlike setsockopt, it oes not require
      application changes and it can be updated easily at any time.
      
      Although the bpf cgroup framework already contains a sock related
      program type (BPF_PROG_TYPE_CGROUP_SOCK), I created the new type
      (BPF_PROG_TYPE_SOCK_OPS) beccause the existing type expects to be called
      only once during the connections's lifetime. In contrast, the new
      program type will be called multiple times from different places in the
      network stack code.  For example, before sending SYN and SYN-ACKs to set
      an appropriate timeout, when the connection is established to set
      congestion control, etc. As a result it has "op" field to specify the
      type of operation requested.
      
      The purpose of this new program type is to simplify setting connection
      parameters, such as buffer sizes, TCP's SYN RTO, etc. For example, it is
      easy to use facebook's internal IPv6 addresses to determine if both hosts
      of a connection are in the same datacenter. Therefore, it is easy to
      write a BPF program to choose a small SYN RTO value when both hosts are
      in the same datacenter.
      
      This patch only contains the framework to support the new BPF program
      type, following patches add the functionality to set various connection
      parameters.
      
      This patch defines a new BPF program type: BPF_PROG_TYPE_SOCKET_OPS
      and a new bpf syscall command to load a new program of this type:
      BPF_PROG_LOAD_SOCKET_OPS.
      
      Two new corresponding structs (one for the kernel one for the user/BPF
      program):
      
      /* kernel version */
      struct bpf_sock_ops_kern {
              struct sock *sk;
              __u32  op;
              union {
                      __u32 reply;
                      __u32 replylong[4];
              };
      };
      
      /* user version
       * Some fields are in network byte order reflecting the sock struct
       * Use the bpf_ntohl helper macro in samples/bpf/bpf_endian.h to
       * convert them to host byte order.
       */
      struct bpf_sock_ops {
              __u32 op;
              union {
                      __u32 reply;
                      __u32 replylong[4];
              };
              __u32 family;
              __u32 remote_ip4;     /* In network byte order */
              __u32 local_ip4;      /* In network byte order */
              __u32 remote_ip6[4];  /* In network byte order */
              __u32 local_ip6[4];   /* In network byte order */
              __u32 remote_port;    /* In network byte order */
              __u32 local_port;     /* In host byte horder */
      };
      
      Currently there are two types of ops. The first type expects the BPF
      program to return a value which is then used by the caller (or a
      negative value to indicate the operation is not supported). The second
      type expects state changes to be done by the BPF program, for example
      through a setsockopt BPF helper function, and they ignore the return
      value.
      
      The reply fields of the bpf_sockt_ops struct are there in case a bpf
      program needs to return a value larger than an integer.
      Signed-off-by: NLawrence Brakmo <brakmo@fb.com>
      Acked-by: NDaniel Borkmann <daniel@iogearbox.net>
      Acked-by: NAlexei Starovoitov <ast@kernel.org>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      40304b2a
    • N
      sctp: Add peeloff-flags socket option · 2cb5c8e3
      Neil Horman 提交于
      Based on a request raised on the sctp devel list, there is a need to
      augment the sctp_peeloff operation while specifying the O_CLOEXEC and
      O_NONBLOCK flags (simmilar to the socket syscall).  Since modifying the
      SCTP_SOCKOPT_PEELOFF socket option would break user space ABI for existing
      programs, this patch creates a new socket option
      SCTP_SOCKOPT_PEELOFF_FLAGS, which accepts a third flags parameter to
      allow atomic assignment of the socket descriptor flags.
      
      Tested successfully by myself and the requestor
      Signed-off-by: NNeil Horman <nhorman@tuxdriver.com>
      CC: Vlad Yasevich <vyasevich@gmail.com>
      CC: "David S. Miller" <davem@davemloft.net>
      CC: Andreas Steinmetz <ast@domdv.de>
      CC: Marcelo Ricardo Leitner <marcelo.leitner@gmail.com>
      Acked-by: NMarcelo Ricardo Leitner <marcelo.leitner@gmail.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      2cb5c8e3
  9. 01 7月, 2017 3 次提交
  10. 30 6月, 2017 1 次提交
  11. 28 6月, 2017 2 次提交
    • L
      switchtec: Add "running" status flag to fw partition info ioctl · 079e3bc5
      Logan Gunthorpe 提交于
      This flag lets userspace know which firmware partitions are currently in
      use as opposed to just active.  "Active" means they will be in use for the
      next reboot, whereas "running" means they are currently in use.
      
      If an old kernel is in use, or the firmware doesn't support these fields,
      the new flag will not be set in the output.
      Signed-off-by: NLogan Gunthorpe <logang@deltatee.com>
      Signed-off-by: NBjorn Helgaas <bhelgaas@google.com>
      Reviewed-by: NKurt Schwemmer <kurt.schwemmer@microsemi.com>
      079e3bc5
    • J
      fs: add fcntl() interface for setting/getting write life time hints · c75b1d94
      Jens Axboe 提交于
      Define a set of write life time hints:
      
      RWH_WRITE_LIFE_NOT_SET	No hint information set
      RWH_WRITE_LIFE_NONE	No hints about write life time
      RWH_WRITE_LIFE_SHORT	Data written has a short life time
      RWH_WRITE_LIFE_MEDIUM	Data written has a medium life time
      RWH_WRITE_LIFE_LONG	Data written has a long life time
      RWH_WRITE_LIFE_EXTREME	Data written has an extremely long life time
      
      The intent is for these values to be relative to each other, no
      absolute meaning should be attached to these flag names.
      
      Add an fcntl interface for querying these flags, and also for
      setting them as well:
      
      F_GET_RW_HINT		Returns the read/write hint set on the
      			underlying inode.
      
      F_SET_RW_HINT		Set one of the above write hints on the
      			underlying inode.
      
      F_GET_FILE_RW_HINT	Returns the read/write hint set on the
      			file descriptor.
      
      F_SET_FILE_RW_HINT	Set one of the above write hints on the
      			file descriptor.
      
      The user passes in a 64-bit pointer to get/set these values, and
      the interface returns 0/-1 on success/error.
      
      Sample program testing/implementing basic setting/getting of write
      hints is below.
      
      Add support for storing the write life time hint in the inode flags
      and in struct file as well, and pass them to the kiocb flags. If
      both a file and its corresponding inode has a write hint, then we
      use the one in the file, if available. The file hint can be used
      for sync/direct IO, for buffered writeback only the inode hint
      is available.
      
      This is in preparation for utilizing these hints in the block layer,
      to guide on-media data placement.
      
      /*
       * writehint.c: get or set an inode write hint
       */
       #include <stdio.h>
       #include <fcntl.h>
       #include <stdlib.h>
       #include <unistd.h>
       #include <stdbool.h>
       #include <inttypes.h>
      
       #ifndef F_GET_RW_HINT
       #define F_LINUX_SPECIFIC_BASE	1024
       #define F_GET_RW_HINT		(F_LINUX_SPECIFIC_BASE + 11)
       #define F_SET_RW_HINT		(F_LINUX_SPECIFIC_BASE + 12)
       #endif
      
      static char *str[] = { "RWF_WRITE_LIFE_NOT_SET", "RWH_WRITE_LIFE_NONE",
      			"RWH_WRITE_LIFE_SHORT", "RWH_WRITE_LIFE_MEDIUM",
      			"RWH_WRITE_LIFE_LONG", "RWH_WRITE_LIFE_EXTREME" };
      
      int main(int argc, char *argv[])
      {
      	uint64_t hint;
      	int fd, ret;
      
      	if (argc < 2) {
      		fprintf(stderr, "%s: file <hint>\n", argv[0]);
      		return 1;
      	}
      
      	fd = open(argv[1], O_RDONLY);
      	if (fd < 0) {
      		perror("open");
      		return 2;
      	}
      
      	if (argc > 2) {
      		hint = atoi(argv[2]);
      		ret = fcntl(fd, F_SET_RW_HINT, &hint);
      		if (ret < 0) {
      			perror("fcntl: F_SET_RW_HINT");
      			return 4;
      		}
      	}
      
      	ret = fcntl(fd, F_GET_RW_HINT, &hint);
      	if (ret < 0) {
      		perror("fcntl: F_GET_RW_HINT");
      		return 3;
      	}
      
      	printf("%s: hint %s\n", argv[1], str[hint]);
      	close(fd);
      	return 0;
      }
      Reviewed-by: NMartin K. Petersen <martin.petersen@oracle.com>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      c75b1d94
  12. 24 6月, 2017 3 次提交
    • D
      fscrypt: add support for AES-128-CBC · b7e7cf7a
      Daniel Walter 提交于
      fscrypt provides facilities to use different encryption algorithms which
      are selectable by userspace when setting the encryption policy. Currently,
      only AES-256-XTS for file contents and AES-256-CBC-CTS for file names are
      implemented. This is a clear case of kernel offers the mechanism and
      userspace selects a policy. Similar to what dm-crypt and ecryptfs have.
      
      This patch adds support for using AES-128-CBC for file contents and
      AES-128-CBC-CTS for file name encryption. To mitigate watermarking
      attacks, IVs are generated using the ESSIV algorithm. While AES-CBC is
      actually slightly less secure than AES-XTS from a security point of view,
      there is more widespread hardware support. Using AES-CBC gives us the
      acceptable performance while still providing a moderate level of security
      for persistent storage.
      
      Especially low-powered embedded devices with crypto accelerators such as
      CAAM or CESA often only support AES-CBC. Since using AES-CBC over AES-XTS
      is basically thought of a last resort, we use AES-128-CBC over AES-256-CBC
      since it has less encryption rounds and yields noticeable better
      performance starting from a file size of just a few kB.
      Signed-off-by: NDaniel Walter <dwalter@sigma-star.at>
      [david@sigma-star.at: addressed review comments]
      Signed-off-by: NDavid Gstir <david@sigma-star.at>
      Reviewed-by: NEric Biggers <ebiggers@google.com>
      Signed-off-by: NTheodore Ts'o <tytso@mit.edu>
      b7e7cf7a
    • J
      xdp: add reporting of offload mode · ce158e58
      Jakub Kicinski 提交于
      Extend the XDP_ATTACHED_* values to include offloaded mode.
      Let drivers report whether program is installed in the driver
      or the HW by changing the prog_attached field from bool to
      u8 (type of the netlink attribute).
      
      Exploit the fact that the value of XDP_ATTACHED_DRV is 1,
      therefore since all drivers currently assign the mode with
      double negation:
             mode = !!xdp_prog;
      no drivers have to be modified.
      Signed-off-by: NJakub Kicinski <jakub.kicinski@netronome.com>
      Acked-by: NDaniel Borkmann <daniel@iogearbox.net>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      ce158e58
    • J
      xdp: add HW offload mode flag for installing programs · ee5d032f
      Jakub Kicinski 提交于
      Add an installation-time flag for requesting that the program
      be installed only if it can be offloaded to HW.
      
      Internally new command for ndo_xdp is added, this way we avoid
      putting checks into drivers since they all return -EINVAL on
      an unknown command.
      Signed-off-by: NJakub Kicinski <jakub.kicinski@netronome.com>
      Acked-by: NDaniel Borkmann <daniel@iogearbox.net>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      ee5d032f
  13. 22 6月, 2017 3 次提交
  14. 21 6月, 2017 1 次提交