1. 25 2月, 2016 1 次提交
    • D
      bpf: fix csum setting for bpf_set_tunnel_key · 2da897e5
      Daniel Borkmann 提交于
      The fix in 35e2d115 ("tunnels: Allow IPv6 UDP checksums to be correctly
      controlled.") changed behavior for bpf_set_tunnel_key() when in use with
      IPv6 and thus uncovered a bug that TUNNEL_CSUM needed to be set but wasn't.
      As a result, the stack dropped ingress vxlan IPv6 packets, that have been
      sent via eBPF through collect meta data mode due to checksum now being zero.
      
      Since after LCO, we enable IPv4 checksum by default, so make that analogous
      and only provide a flag BPF_F_ZERO_CSUM_TX for the user to turn it off in
      IPv4 case.
      
      Fixes: 35e2d115 ("tunnels: Allow IPv6 UDP checksums to be correctly controlled.")
      Fixes: c6c33454 ("bpf: support ipv6 for bpf_skb_{set,get}_tunnel_key")
      Signed-off-by: NDaniel Borkmann <daniel@iogearbox.net>
      Acked-by: NAlexei Starovoitov <ast@kernel.org>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      2da897e5
  2. 31 1月, 2016 1 次提交
    • D
      block: revert runtime dax control of the raw block device · 9f4736fe
      Dan Williams 提交于
      Dynamically enabling DAX requires that the page cache first be flushed
      and invalidated.  This must occur atomically with the change of DAX mode
      otherwise we confuse the fsync/msync tracking and violate data
      durability guarantees.  Eliminate the possibilty of DAX-disabled to
      DAX-enabled transitions for now and revisit this for the next cycle.
      
      Cc: Jan Kara <jack@suse.com>
      Cc: Jeff Moyer <jmoyer@redhat.com>
      Cc: Christoph Hellwig <hch@lst.de>
      Cc: Dave Chinner <david@fromorbit.com>
      Cc: Matthew Wilcox <willy@linux.intel.com>
      Cc: Andrew Morton <akpm@linux-foundation.org>
      Cc: Ross Zwisler <ross.zwisler@linux.intel.com>
      Signed-off-by: NDan Williams <dan.j.williams@intel.com>
      9f4736fe
  3. 21 1月, 2016 1 次提交
    • J
      epoll: add EPOLLEXCLUSIVE flag · df0108c5
      Jason Baron 提交于
      Currently, epoll file descriptors or epfds (the fd returned from
      epoll_create[1]()) that are added to a shared wakeup source are always
      added in a non-exclusive manner.  This means that when we have multiple
      epfds attached to a shared fd source they are all woken up.  This creates
      thundering herd type behavior.
      
      Introduce a new 'EPOLLEXCLUSIVE' flag that can be passed as part of the
      'event' argument during an epoll_ctl() EPOLL_CTL_ADD operation.  This new
      flag allows for exclusive wakeups when there are multiple epfds attached
      to a shared fd event source.
      
      The implementation walks the list of exclusive waiters, and queues an
      event to each epfd, until it finds the first waiter that has threads
      blocked on it via epoll_wait().  The idea is to search for threads which
      are idle and ready to process the wakeup events.  Thus, we queue an event
      to at least 1 epfd, but may still potentially queue an event to all epfds
      that are attached to the shared fd source.
      
      Performance testing was done by Madars Vitolins using a modified version
      of Enduro/X.  The use of the 'EPOLLEXCLUSIVE' flag reduce the length of
      this particular workload from 860s down to 24s.
      
      Sample epoll_clt text:
      
      EPOLLEXCLUSIVE
      
        Sets an exclusive wakeup mode for the epfd file descriptor that is
        being attached to the target file descriptor, fd.  Thus, when an event
        occurs and multiple epfd file descriptors are attached to the same
        target file using EPOLLEXCLUSIVE, one or more epfds will receive an
        event with epoll_wait(2).  The default in this scenario (when
        EPOLLEXCLUSIVE is not set) is for all epfds to receive an event.
        EPOLLEXCLUSIVE may only be specified with the op EPOLL_CTL_ADD.
      Signed-off-by: NJason Baron <jbaron@akamai.com>
      Tested-by: NMadars Vitolins <m@silodev.com>
      Cc: Ingo Molnar <mingo@kernel.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Al Viro <viro@ftp.linux.org.uk>
      Cc: Michael Kerrisk <mtk.manpages@gmail.com>
      Cc: Eric Wong <normalperson@yhbt.net>
      Cc: Jonathan Corbet <corbet@lwn.net>
      Cc: Andy Lutomirski <luto@amacapital.net>
      Cc: Hagen Paul Pfeifer <hagen@jauu.net>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      df0108c5
  4. 14 1月, 2016 1 次提交
  5. 12 1月, 2016 5 次提交
    • M
      lightnvm: introduce factory reset · 8b4970c4
      Matias Bjørling 提交于
      Now that a device can be managed using the system blocks, a method to
      reset the device is necessary as well. This patch introduces logic to
      reset the device easily to factory state and exposes it through an
      ioctl.
      
      The ioctl takes the following flags:
      
        NVM_FACTORY_ERASE_ONLY_USER
            By default all blocks, except host-reserved blocks are erased upon
            factory reset. Instead of this, only erase host-reserved blocks.
        NVM_FACTORY_RESET_HOST_BLKS
            Mark host-reserved blocks to be erased and set their type to free.
        NVM_FACTORY_RESET_GRWN_BBLKS
            Mark "grown bad blocks" to be erased and set their type to free.
      Signed-off-by: NMatias Bjørling <m@bjorling.me>
      Signed-off-by: NJens Axboe <axboe@fb.com>
      8b4970c4
    • M
      lightnvm: introduce ioctl to initialize device · 55696154
      Matias Bjørling 提交于
      Based on the previous patch, we now introduce an ioctl to initialize the
      device using nvm_init_sysblock and create the necessary system blocks.
      The user may specify the media manager that they wish to instantiate on
      top. Default from user-space will be "gennvm".
      Signed-off-by: NMatias Bjørling <m@bjorling.me>
      Signed-off-by: NJens Axboe <axboe@fb.com>
      55696154
    • M
      lightnvm: core on-disk initialization · e3eb3799
      Matias Bjørling 提交于
      An Open-Channel SSD shall be initialized before use. To initialize, we
      define an on-disk format, that keeps a small set of metadata to bring up
      the media manager on top of the device.
      
      The initial step is introduced to allow a user to format the disks for a
      given media manager. During format, a system block is stored on one to
      three separate luns on the device. Each lun has the system block
      duplicated. During initialization, the system block can be retrieved and
      the appropriate media manager can initialized.
      
      The on-disk format currently covers (struct nvm_system_block):
      
       - Magic value "NVMS".
       - Monotonic increasing sequence number.
       - The physical block erase count.
       - Version of the system block format.
       - Media manager type.
       - Media manager superblock physical address.
      
      The interface provides three functions to manage the system block:
      
       int nvm_init_sysblock(struct nvm_dev *, struct nvm_sb_info *)
       int nvm_get_sysblock(struct nvm *dev, struct nvm_sb_info *)
       int nvm_update_sysblock(struct nvm *dev, struct nvm_sb_info *)
      
      Each implement a part of the logic to manage the system block. The
      initialization creates the first system blocks and mark them on the
      device. Get retrieves the latest system block by scanning all pages in
      the associated system blocks. The update sysblock writes new metadata
      and allocates new block if necessary.
      Signed-off-by: NMatias Bjørling <m@bjorling.me>
      Signed-off-by: NJens Axboe <axboe@fb.com>
      e3eb3799
    • D
      bpf: support ipv6 for bpf_skb_{set,get}_tunnel_key · c6c33454
      Daniel Borkmann 提交于
      After IPv6 support has recently been added to metadata dst and related
      encaps, add support for populating/reading it from an eBPF program.
      
      Commit d3aa45ce ("bpf: add helpers to access tunnel metadata") started
      with initial IPv4-only support back then (due to IPv6 metadata support
      not being available yet).
      
      To stay compatible with older programs, we need to test for the passed
      structure size. Also TOS and TTL support from the ip_tunnel_info key has
      been added. Tested with vxlan devs in collect meta data mode with IPv4,
      IPv6 and in compat mode over different network namespaces.
      Signed-off-by: NDaniel Borkmann <daniel@iogearbox.net>
      Acked-by: NAlexei Starovoitov <ast@kernel.org>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      c6c33454
    • D
      bpf: export helper function flags and reject invalid ones · 781c53bc
      Daniel Borkmann 提交于
      Export flags used by eBPF helper functions through UAPI, so they can be
      used by programs (instead of them redefining all flags each time or just
      using the hard-coded values). It also gives a better overview what flags
      are used where and we can further get rid of the extra macros defined in
      filter.c. Moreover, reject invalid flags.
      Signed-off-by: NDaniel Borkmann <daniel@iogearbox.net>
      Acked-by: NAlexei Starovoitov <ast@kernel.org>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      781c53bc
  6. 11 1月, 2016 16 次提交
  7. 09 1月, 2016 2 次提交
    • D
      block: enable dax for raw block devices · 5a023cdb
      Dan Williams 提交于
      If an application wants exclusive access to all of the persistent memory
      provided by an NVDIMM namespace it can use this raw-block-dax facility
      to forgo establishing a filesystem.  This capability is targeted
      primarily to hypervisors wanting to provision persistent memory for
      guests.  It can be disabled / enabled dynamically via the new BLKDAXSET
      ioctl.
      
      Cc: Jeff Moyer <jmoyer@redhat.com>
      Cc: Christoph Hellwig <hch@lst.de>
      Cc: Dave Chinner <david@fromorbit.com>
      Cc: Andrew Morton <akpm@linux-foundation.org>
      Cc: Ross Zwisler <ross.zwisler@linux.intel.com>
      Reported-by: Nkbuild test robot <fengguang.wu@intel.com>
      Reviewed-by: NJan Kara <jack@suse.com>
      Signed-off-by: NDan Williams <dan.j.williams@intel.com>
      5a023cdb
    • T
      fs: clean up the flags definition in uapi/linux/fs.h · 68ce7bfc
      Theodore Ts'o 提交于
      Add an explanation for the flags used by FS_IOC_[GS]ETFLAGS and remind
      people that changes should be revised by linux-fsdevel and linux-api.
      
      Add flags that are used on-disk for ext4, and remove FS_DIRECTIO_FL
      since it was used only by gfs2 and support was removed in 2008 in
      commit c9f6a6bb ("The ability to mark files for direct i/o access
      when opened normally is both unused and pointless, so this patch
      removes support for that feature.")  Now we have _two_ remaining flags
      left.  But since we want to discourage people from assigning new
      flags, that's OK.
      Signed-off-by: NTheodore Ts'o <tytso@mit.edu>
      68ce7bfc
  8. 08 1月, 2016 2 次提交
  9. 07 1月, 2016 1 次提交
  10. 06 1月, 2016 2 次提交
    • D
      drivers: md: use ktime_get_real_seconds() · 9ebc6ef1
      Deepa Dinamani 提交于
      get_seconds() API is not y2038 safe on 32 bit systems and the API
      is deprecated. Replace it with calls to ktime_get_real_seconds()
      API instead. Change mddev structure types to time64_t accordingly.
      
      32 bit signed timestamps will overflow in the year 2038.
      
      Change the user interface mdu_array_info_s structure timestamps:
      ctime and utime values used in ioctls GET_ARRAY_INFO and
      SET_ARRAY_INFO to unsigned int. This will extend the field to last
      until the year 2106.
      The long term plan is to get rid of ctime and utime values in
      this structure as this information can be read from the on-disk
      meta data directly.
      
      Clamp the tim64_t timestamps to positive values with a max of U32_MAX
      when returning from GET_ARRAY_INFO ioctl to accommodate above changes
      in the data type of timestamps to unsigned int.
      
      v0.90 on disk meta data uses u32 for maintaining time stamps.
      So this will also last until year 2106.
      Assumption is that the usage of v0.90 will be deprecated by
      year 2106.
      
      Timestamp fields in the on disk meta data for v1.0 version already
      use 64 bit data types. Remove the truncation of the bits while
      writing to or reading from these from the disk.
      Signed-off-by: NDeepa Dinamani <deepa.kernel@gmail.com>
      Reviewed-by: NArnd Bergmann <arnd@arndb.de>
      Signed-off-by: NNeilBrown <neilb@suse.com>
      9ebc6ef1
    • X
      include/uapi/linux/sockios.h: mark SIOCRTMSG unused · 2fbf5758
      xypron.glpk@gmx.de 提交于
      IOCTL SIOCRTMSG does nothing but return EINVAL.
      
      So comment it as unused.
      
      SIOCRTMSG is only used in:
      * net/ipv4/af_inet.c
      * include/uapi/linux/sockios.h
      
      inet_ioctl calls ip_rt_ioctl.
      ip_rt_ioctl only handles SIOCADDRT and SIOCDELRT and returns -EINVAL
      otherwise.
      Signed-off-by: NHeinrich Schuchardt <xypron.glpk@gmx.de>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      2fbf5758
  11. 05 1月, 2016 1 次提交
  12. 04 1月, 2016 3 次提交
    • D
      xfs: introduce per-inode DAX enablement · 58f88ca2
      Dave Chinner 提交于
      Rather than just being able to turn DAX on and off via a mount
      option, some applications may only want to enable DAX for certain
      performance critical files in a filesystem.
      
      This patch introduces a new inode flag to enable DAX in the v3 inode
      di_flags2 field. It adds support for setting and clearing flags in
      the di_flags2 field via the XFS_IOC_FSSETXATTR ioctl, and sets the
      S_DAX inode flag appropriately when it is seen.
      
      When this flag is set on a directory, it acts as an "inherit flag".
      That is, inodes created in the directory will automatically inherit
      the on-disk inode DAX flag, enabling administrators to set up
      directory heirarchies that automatically use DAX. Setting this flag
      on an empty root directory will make the entire filesystem use DAX
      by default.
      Signed-off-by: NDave Chinner <dchinner@redhat.com>
      58f88ca2
    • D
      fs: XFS_IOC_FS[SG]SETXATTR to FS_IOC_FS[SG]ETXATTR promotion · 334e580a
      Dave Chinner 提交于
      Hoist the ioctl definitions for the XFS_IOC_FS[SG]SETXATTR API from
      fs/xfs/libxfs/xfs_fs.h to include/uapi/linux/fs.h so that the ioctls
      can be used by all filesystems, not just XFS. This enables
      (initially) ext4 to use the ioctl to set project IDs on inodes.
      
      Based-on-patch-from: Li Xi <lixi@ddn.com>
      Signed-off-by: NDave Chinner <dchinner@redhat.com>
      334e580a
    • P
      netfilter: nft_limit: allow to invert matching criteria · c7862a5f
      Pablo Neira Ayuso 提交于
      This patch allows you to invert the ratelimit matching criteria, so you
      can match packets over the ratelimit. This is required to support what
      hashlimit does.
      Signed-off-by: NPablo Neira Ayuso <pablo@netfilter.org>
      c7862a5f
  13. 01 1月, 2016 1 次提交
  14. 31 12月, 2015 1 次提交
  15. 22 12月, 2015 2 次提交
    • A
      vfio: Include No-IOMMU mode · 03a76b60
      Alex Williamson 提交于
      There is really no way to safely give a user full access to a DMA
      capable device without an IOMMU to protect the host system.  There is
      also no way to provide DMA translation, for use cases such as device
      assignment to virtual machines.  However, there are still those users
      that want userspace drivers even under those conditions.  The UIO
      driver exists for this use case, but does not provide the degree of
      device access and programming that VFIO has.  In an effort to avoid
      code duplication, this introduces a No-IOMMU mode for VFIO.
      
      This mode requires building VFIO with CONFIG_VFIO_NOIOMMU and enabling
      the "enable_unsafe_noiommu_mode" option on the vfio driver.  This
      should make it very clear that this mode is not safe.  Additionally,
      CAP_SYS_RAWIO privileges are necessary to work with groups and
      containers using this mode.  Groups making use of this support are
      named /dev/vfio/noiommu-$GROUP and can only make use of the special
      VFIO_NOIOMMU_IOMMU for the container.  Use of this mode, specifically
      binding a device without a native IOMMU group to a VFIO bus driver
      will taint the kernel and should therefore not be considered
      supported.  This patch includes no-iommu support for the vfio-pci bus
      driver only.
      Signed-off-by: NAlex Williamson <alex.williamson@redhat.com>
      Acked-by: NMichael S. Tsirkin <mst@redhat.com>
      03a76b60
    • A
      vfio: Add explicit alignments in vfio_iommu_spapr_tce_create · 77d6bd47
      Alexey Kardashevskiy 提交于
      The vfio_iommu_spapr_tce_create struct has 4x32bit and 2x64bit fields
      which should have resulted in sizeof(fio_iommu_spapr_tce_create) equal
      to 32 bytes. However due to the gcc's default alignment, the actual
      size of this struct is 40 bytes.
      
      This fills gaps with __resv1/2 fields.
      
      This should not cause any change in behavior.
      Signed-off-by: NAlexey Kardashevskiy <aik@ozlabs.ru>
      Acked-by: NDavid Gibson <david@gibson.dropbear.id.au>
      Signed-off-by: NAlex Williamson <alex.williamson@redhat.com>
      77d6bd47