1. 26 11月, 2015 1 次提交
    • A
      kvm/x86: Hyper-V synthetic interrupt controller · 5c919412
      Andrey Smetanin 提交于
      SynIC (synthetic interrupt controller) is a lapic extension,
      which is controlled via MSRs and maintains for each vCPU
       - 16 synthetic interrupt "lines" (SINT's); each can be configured to
         trigger a specific interrupt vector optionally with auto-EOI
         semantics
       - a message page in the guest memory with 16 256-byte per-SINT message
         slots
       - an event flag page in the guest memory with 16 2048-bit per-SINT
         event flag areas
      
      The host triggers a SINT whenever it delivers a new message to the
      corresponding slot or flips an event flag bit in the corresponding area.
      The guest informs the host that it can try delivering a message by
      explicitly asserting EOI in lapic or writing to End-Of-Message (EOM)
      MSR.
      
      The userspace (qemu) triggers interrupts and receives EOM notifications
      via irqfd with resampler; for that, a GSI is allocated for each
      configured SINT, and irq_routing api is extended to support GSI-SINT
      mapping.
      
      Changes v4:
      * added activation of SynIC by vcpu KVM_ENABLE_CAP
      * added per SynIC active flag
      * added deactivation of APICv upon SynIC activation
      
      Changes v3:
      * added KVM_CAP_HYPERV_SYNIC and KVM_IRQ_ROUTING_HV_SINT notes into
      docs
      
      Changes v2:
      * do not use posted interrupts for Hyper-V SynIC AutoEOI vectors
      * add Hyper-V SynIC vectors into EOI exit bitmap
      * Hyper-V SyniIC SINT msr write logic simplified
      Signed-off-by: NAndrey Smetanin <asmetanin@virtuozzo.com>
      Reviewed-by: NRoman Kagan <rkagan@virtuozzo.com>
      Signed-off-by: NDenis V. Lunev <den@openvz.org>
      CC: Gleb Natapov <gleb@kernel.org>
      CC: Paolo Bonzini <pbonzini@redhat.com>
      CC: Roman Kagan <rkagan@virtuozzo.com>
      CC: Denis V. Lunev <den@openvz.org>
      CC: qemu-devel@nongnu.org
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      5c919412
  2. 05 11月, 2015 1 次提交
    • A
      vfio: Include No-IOMMU mode · 033291ec
      Alex Williamson 提交于
      There is really no way to safely give a user full access to a DMA
      capable device without an IOMMU to protect the host system.  There is
      also no way to provide DMA translation, for use cases such as device
      assignment to virtual machines.  However, there are still those users
      that want userspace drivers even under those conditions.  The UIO
      driver exists for this use case, but does not provide the degree of
      device access and programming that VFIO has.  In an effort to avoid
      code duplication, this introduces a No-IOMMU mode for VFIO.
      
      This mode requires building VFIO with CONFIG_VFIO_NOIOMMU and enabling
      the "enable_unsafe_noiommu_mode" option on the vfio driver.  This
      should make it very clear that this mode is not safe.  Additionally,
      CAP_SYS_RAWIO privileges are necessary to work with groups and
      containers using this mode.  Groups making use of this support are
      named /dev/vfio/noiommu-$GROUP and can only make use of the special
      VFIO_NOIOMMU_IOMMU for the container.  Use of this mode, specifically
      binding a device without a native IOMMU group to a VFIO bus driver
      will taint the kernel and should therefore not be considered
      supported.  This patch includes no-iommu support for the vfio-pci bus
      driver only.
      Signed-off-by: NAlex Williamson <alex.williamson@redhat.com>
      Acked-by: NMichael S. Tsirkin <mst@redhat.com>
      033291ec
  3. 03 11月, 2015 1 次提交
    • D
      bpf: add support for persistent maps/progs · b2197755
      Daniel Borkmann 提交于
      This work adds support for "persistent" eBPF maps/programs. The term
      "persistent" is to be understood that maps/programs have a facility
      that lets them survive process termination. This is desired by various
      eBPF subsystem users.
      
      Just to name one example: tc classifier/action. Whenever tc parses
      the ELF object, extracts and loads maps/progs into the kernel, these
      file descriptors will be out of reach after the tc instance exits.
      So a subsequent tc invocation won't be able to access/relocate on this
      resource, and therefore maps cannot easily be shared, f.e. between the
      ingress and egress networking data path.
      
      The current workaround is that Unix domain sockets (UDS) need to be
      instrumented in order to pass the created eBPF map/program file
      descriptors to a third party management daemon through UDS' socket
      passing facility. This makes it a bit complicated to deploy shared
      eBPF maps or programs (programs f.e. for tail calls) among various
      processes.
      
      We've been brainstorming on how we could tackle this issue and various
      approches have been tried out so far, which can be read up further in
      the below reference.
      
      The architecture we eventually ended up with is a minimal file system
      that can hold map/prog objects. The file system is a per mount namespace
      singleton, and the default mount point is /sys/fs/bpf/. Any subsequent
      mounts within a given namespace will point to the same instance. The
      file system allows for creating a user-defined directory structure.
      The objects for maps/progs are created/fetched through bpf(2) with
      two new commands (BPF_OBJ_PIN/BPF_OBJ_GET). I.e. a bpf file descriptor
      along with a pathname is being passed to bpf(2) that in turn creates
      (we call it eBPF object pinning) the file system nodes. Only the pathname
      is being passed to bpf(2) for getting a new BPF file descriptor to an
      existing node. The user can use that to access maps and progs later on,
      through bpf(2). Removal of file system nodes is being managed through
      normal VFS functions such as unlink(2), etc. The file system code is
      kept to a very minimum and can be further extended later on.
      
      The next step I'm working on is to add dump eBPF map/prog commands
      to bpf(2), so that a specification from a given file descriptor can
      be retrieved. This can be used by things like CRIU but also applications
      can inspect the meta data after calling BPF_OBJ_GET.
      
      Big thanks also to Alexei and Hannes who significantly contributed
      in the design discussion that eventually let us end up with this
      architecture here.
      
      Reference: https://lkml.org/lkml/2015/10/15/925Signed-off-by: NDaniel Borkmann <daniel@iogearbox.net>
      Signed-off-by: NAlexei Starovoitov <ast@kernel.org>
      Signed-off-by: NHannes Frederic Sowa <hannes@stressinduktion.org>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      b2197755
  4. 01 11月, 2015 2 次提交
  5. 30 10月, 2015 3 次提交
  6. 29 10月, 2015 1 次提交
    • M
      lightnvm: Support for Open-Channel SSDs · cd9e9808
      Matias Bjørling 提交于
      Open-channel SSDs are devices that share responsibilities with the host
      in order to implement and maintain features that typical SSDs keep
      strictly in firmware. These include (i) the Flash Translation Layer
      (FTL), (ii) bad block management, and (iii) hardware units such as the
      flash controller, the interface controller, and large amounts of flash
      chips. In this way, Open-channels SSDs exposes direct access to their
      physical flash storage, while keeping a subset of the internal features
      of SSDs.
      
      LightNVM is a specification that gives support to Open-channel SSDs
      LightNVM allows the host to manage data placement, garbage collection,
      and parallelism. Device specific responsibilities such as bad block
      management, FTL extensions to support atomic IOs, or metadata
      persistence are still handled by the device.
      
      The implementation of LightNVM consists of two parts: core and
      (multiple) targets. The core implements functionality shared across
      targets. This is initialization, teardown and statistics. The targets
      implement the interface that exposes physical flash to user-space
      applications. Examples of such targets include key-value store,
      object-store, as well as traditional block devices, which can be
      application-specific.
      
      Contributions in this patch from:
      
        Javier Gonzalez <jg@lightnvm.io>
        Dongsheng Yang <yangds.fnst@cn.fujitsu.com>
        Jesper Madsen <jmad@itu.dk>
      Signed-off-by: NMatias Bjørling <m@bjorling.me>
      Signed-off-by: NJens Axboe <axboe@fb.com>
      cd9e9808
  7. 28 10月, 2015 2 次提交
    • T
      seccomp, ptrace: add support for dumping seccomp filters · f8e529ed
      Tycho Andersen 提交于
      This patch adds support for dumping a process' (classic BPF) seccomp
      filters via ptrace.
      
      PTRACE_SECCOMP_GET_FILTER allows the tracer to dump the user's classic BPF
      seccomp filters. addr should be an integer which represents the ith seccomp
      filter (0 is the most recently installed filter). data should be a struct
      sock_filter * with enough room for the ith filter, or NULL, in which case
      the filter is not saved. The return value for this command is the number of
      BPF instructions the program represents, or negative in the case of errors.
      Command specific errors are ENOENT: which indicates that there is no ith
      filter in this seccomp tree, and EMEDIUMTYPE, which indicates that the ith
      filter was not installed as a classic BPF filter.
      
      A caveat with this approach is that there is no way to get explicitly at
      the heirarchy of seccomp filters, and users need to memcmp() filters to
      decide which are inherited. This means that a task which installs two of
      the same filter can potentially confuse users of this interface.
      
      v2: * make save_orig const
          * check that the orig_prog exists (not necessary right now, but when
             grows eBPF support it will be)
          * s/n/filter_off and make it an unsigned long to match ptrace
          * count "down" the tree instead of "up" when passing a filter offset
      
      v3: * don't take the current task's lock for inspecting its seccomp mode
          * use a 0x42** constant for the ptrace command value
      
      v4: * don't copy to userspace while holding spinlocks
      
      v5: * add another condition to WARN_ON
      
      v6: * rebase on net-next
      Signed-off-by: NTycho Andersen <tycho.andersen@canonical.com>
      Acked-by: NKees Cook <keescook@chromium.org>
      CC: Will Drewry <wad@chromium.org>
      Reviewed-by: NOleg Nesterov <oleg@redhat.com>
      CC: Andy Lutomirski <luto@amacapital.net>
      CC: Pavel Emelyanov <xemul@parallels.com>
      CC: Serge E. Hallyn <serge.hallyn@ubuntu.com>
      CC: Alexei Starovoitov <ast@kernel.org>
      CC: Daniel Borkmann <daniel@iogearbox.net>
      Acked-by: NAlexei Starovoitov <ast@kernel.org>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      f8e529ed
    • S
      Input: add userio module · 5523662e
      Stephen Chandler Paul 提交于
      Debugging input devices, specifically laptop touchpads, can be tricky
      without having the physical device handy. Here we try to remedy that
      with userio. This module allows an application to connect to a character
      device provided by the kernel, and emulate any serio device. In
      combination with userspace programs that can record PS/2 devices and
      replay them through the /dev/userio device, this allows developers to
      debug driver issues on the PS/2 level with devices simply by requesting
      a recording from the user experiencing the issue without having to have
      the physical hardware in front of them.
      Signed-off-by: NStephen Chandler Paul <cpaul@redhat.com>
      Reviewed-by: NDavid Herrmann <dh.herrmann@gmail.com>
      Signed-off-by: NDmitry Torokhov <dmitry.torokhov@gmail.com>
      5523662e
  8. 27 10月, 2015 5 次提交
    • C
      NFC: netlink: Add missing NFC_ATTR comments · be73c2cb
      Christophe Ricard 提交于
      NFC_CMD_ACTIVATE_TARGET and NFC_ATTR_SE_PARAMS comments are missing.
      Signed-off-by: NChristophe Ricard <christophe-h.ricard@st.com>
      Signed-off-by: NSamuel Ortiz <sameo@linux.intel.com>
      be73c2cb
    • D
      btrfs: extend balance filter usage to take minimum and maximum · bc309467
      David Sterba 提交于
      Similar to the 'limit' filter, we can enhance the 'usage' filter to
      accept a range. The change is backward compatible, the range is applied
      only in connection with the BTRFS_BALANCE_ARGS_USAGE_RANGE flag.
      
      We don't have a usecase yet, the current syntax has been sufficient. The
      enhancement should provide parity with other range-like filters.
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      Signed-off-by: NChris Mason <clm@fb.com>
      bc309467
    • G
      btrfs: add balance filter for stripes · dee32d0a
      Gabríel Arthúr Pétursson 提交于
      Balance block groups which have the given number of stripes, defined by
      a range min..max. This is useful to selectively rebalance only chunks
      that do not span enough devices, applies to RAID0/10/5/6.
      Signed-off-by: NGabríel Arthúr Pétursson <gabriel@system.is>
      [ renamed bargs members, added to the UAPI, wrote the changelog ]
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      Signed-off-by: NChris Mason <clm@fb.com>
      dee32d0a
    • D
      btrfs: extend balance filter limit to take minimum and maximum · 12907fc7
      David Sterba 提交于
      The 'limit' filter is underdesigned, it should have been a range for
      [min,max], with some relaxed semantics when one of the bounds is
      missing. Besides that, using a full u64 for a single value is a waste of
      bytes.
      
      Let's fix both by extending the use of the u64 bytes for the [min,max]
      range. This can be done in a backward compatible way, the range will be
      interpreted only if the appropriate flag is set
      (BTRFS_BALANCE_ARGS_LIMIT_RANGE).
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      Signed-off-by: NChris Mason <clm@fb.com>
      12907fc7
    • D
      Input: evdev - add event-mask API · 06a16293
      David Herrmann 提交于
      Hardware manufacturers group keys in the weirdest way possible. This may
      cause a power-key to be grouped together with normal keyboard keys and
      thus be reported on the same kernel interface.
      
      However, user-space is often only interested in specific sets of events.
      For instance, daemons dealing with system-reboot (like systemd-logind)
      listen for KEY_POWER, but are not interested in any main keyboard keys.
      Usually, power keys are reported via separate interfaces, however,
      some i8042 boards report it in the AT matrix. To avoid waking up those
      system daemons on each key-press, we had two ideas:
       - split off KEY_POWER into a separate interface unconditionally
       - allow filtering a specific set of events on evdev FDs
      
      Splitting of KEY_POWER is a rather weird way to deal with this and may
      break backwards-compatibility. It is also specific to KEY_POWER and might
      be required for other stuff, too. Moreover, we might end up with a huge
      set of input-devices just to have them properly split.
      
      Hence, this patchset implements the second idea: An event-mask to specify
      which events you're interested in. Two ioctls allow setting this mask for
      each event-type. If not set, all events are reported. The type==0 entry is
      used same as in EVIOCGBIT to set the actual EV_* mask of filtered events.
      This way, you have a two-level filter.
      
      We are heavily forward-compatible to new event-types and event-codes. So
      new user-space will be able to run on an old kernel which doesn't know the
      given event-codes or event-types.
      Signed-off-by: NDavid Herrmann <dh.herrmann@gmail.com>
      Signed-off-by: NDmitry Torokhov <dmitry.torokhov@gmail.com>
      06a16293
  9. 26 10月, 2015 1 次提交
    • J
      mmc: block: Add new ioctl to send multi commands · a5f5774c
      Jon Hunter 提交于
      Certain eMMC devices allow vendor specific device information to be read
      via a sequence of vendor commands. These vendor commands must be issued
      in sequence and an atomic fashion. One way to support this would be to
      add an ioctl function for sending a sequence of commands to the device
      atomically as proposed here. These multi commands are simple array of
      the existing mmc_ioc_cmd structure.
      
      The structure passed via the ioctl uses a __u64 type to specify the number
      of commands (so that the structure is aligned on a 64-bit boundary) and a
      zero length array as a header for list of commands to be issued. The
      maximum number of commands that can be sent is determined by
      MMC_IOC_MAX_CMDS (which defaults to 255 and should be more than
      sufficient).
      
      This based upon work by Seshagiri Holi <sholi@nvidia.com>.
      Signed-off-by: NSeshagiri Holi <sholi@nvidia.com>
      Signed-off-by: NJon Hunter <jonathanh@nvidia.com>
      Signed-off-by: NUlf Hansson <ulf.hansson@linaro.org>
      a5f5774c
  10. 24 10月, 2015 5 次提交
    • S
      raid5: add basic stripe log · f6bed0ef
      Shaohua Li 提交于
      This introduces a simple log for raid5. Data/parity writing to raid
      array first writes to the log, then write to raid array disks. If
      crash happens, we can recovery data from the log. This can speed up
      raid resync and fix write hole issue.
      
      The log structure is pretty simple. Data/meta data is stored in block
      unit, which is 4k generally. It has only one type of meta data block.
      The meta data block can track 3 types of data, stripe data, stripe
      parity and flush block. MD superblock will point to the last valid
      meta data block. Each meta data block has checksum/seq number, so
      recovery can scan the log correctly. We store a checksum of stripe
      data/parity to the metadata block, so meta data and stripe data/parity
      can be written to log disk together. otherwise, meta data write must
      wait till stripe data/parity is finished.
      
      For stripe data, meta data block will record stripe data sector and
      size. Currently the size is always 4k. This meta data record can be made
      simpler if we just fix write hole (eg, we can record data of a stripe's
      different disks together), but this format can be extended to support
      caching in the future, which must record data address/size.
      
      For stripe parity, meta data block will record stripe sector. It's
      size should be 4k (for raid5) or 8k (for raid6). We always store p
      parity first. This format should work for caching too.
      
      flush block indicates a stripe is in raid array disks. Fixing write
      hole doesn't need this type of meta data, it's for caching extension.
      Signed-off-by: NShaohua Li <shli@fb.com>
      Signed-off-by: NNeilBrown <neilb@suse.com>
      f6bed0ef
    • S
      md: override md superblock recovery_offset for journal device · 3069aa8d
      Shaohua Li 提交于
      Journal device stores data in a log structure. We need record the log
      start. Here we override md superblock recovery_offset for this purpose.
      This field of a journal device is meaningless otherwise.
      Signed-off-by: NShaohua Li <shli@fb.com>
      Signed-off-by: NNeilBrown <neilb@suse.com>
      3069aa8d
    • S
      MD: add a new disk role to present write journal device · bac624f3
      Song Liu 提交于
      Next patches will use a disk as raid5/6 journaling. We need a new disk
      role to present the journal device and add MD_FEATURE_JOURNAL to
      feature_map for backward compability.
      Signed-off-by: NSong Liu <songliubraving@fb.com>
      Signed-off-by: NShaohua Li <shli@fb.com>
      Signed-off-by: NNeilBrown <neilb@suse.com>
      bac624f3
    • S
      MD: replace special disk roles with macros · c4d4c91b
      Song Liu 提交于
      Add the following two macros for special roles: spare and faulty
      
      MD_DISK_ROLE_SPARE	0xffff
      MD_DISK_ROLE_FAULTY	0xfffe
      
      Add MD_DISK_ROLE_MAX	0xff00 as the maximal possible regular role,
      and minimal value of special role.
      Signed-off-by: NSong Liu <songliubraving@fb.com>
      Signed-off-by: NShaohua Li <shli@fb.com>
      Signed-off-by: NNeilBrown <neilb@suse.com>
      c4d4c91b
    • J
      i2c-dev: Fix typo in ioctl name reference · c57d3e7a
      Jean Delvare 提交于
      The ioctl is named I2C_RDWR for "I2C read/write". But references to it
      were misspelled "rdrw". Fix them.
      Signed-off-by: NJean Delvare <jdelvare@suse.de>
      Signed-off-by: NWolfram Sang <wsa@the-dreams.de>
      c57d3e7a
  11. 23 10月, 2015 1 次提交
  12. 22 10月, 2015 4 次提交
  13. 21 10月, 2015 4 次提交
  14. 20 10月, 2015 2 次提交
    • S
      perf: Add PERF_SAMPLE_BRANCH_CALL · c229bf9d
      Stephane Eranian 提交于
      Add a new branch sample type to cover only call branches (function calls).
      The current ANY_CALL included direct, indirect calls and far jumps.
      
      We want to be able to differentiate indirect from direct calls. Therefore
      we introduce PERF_SAMPLE_BRANCH_CALL. The implementation is up to each
      architecture.
      Signed-off-by: NStephane Eranian <eranian@google.com>
      Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Cc: Arnaldo Carvalho de Melo <acme@redhat.com>
      Cc: David Ahern <dsahern@gmail.com>
      Cc: Jiri Olsa <jolsa@redhat.com>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Namhyung Kim <namhyung@kernel.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Vince Weaver <vincent.weaver@maine.edu>
      Cc: khandual@linux.vnet.ibm.com
      Link: http://lkml.kernel.org/r/1444720151-10275-2-git-send-email-eranian@google.comSigned-off-by: NIngo Molnar <mingo@kernel.org>
      c229bf9d
    • A
      perf/x86: Fix time_shift in perf_event_mmap_page · b9511cd7
      Adrian Hunter 提交于
      Commit:
      
        b20112ed ("perf/x86: Improve accuracy of perf/sched clock")
      
      allowed the time_shift value in perf_event_mmap_page to be as much
      as 32.  Unfortunately the documented algorithms for using time_shift
      have it shifting an integer, whereas to work correctly with the value
      32, the type must be u64.
      
      In the case of perf tools, Intel PT decodes correctly but the timestamps
      that are output (for example by perf script) have lost 32-bits of
      granularity so they look like they are not changing at all.
      
      Fix by limiting the shift to 31 and adjusting the multiplier accordingly.
      
      Also update the documentation of perf_event_mmap_page so that new code
      based on it will be more future-proof.
      Signed-off-by: NAdrian Hunter <adrian.hunter@intel.com>
      Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Cc: Andy Lutomirski <luto@amacapital.net>
      Cc: Arnaldo Carvalho de Melo <acme@kernel.org>
      Cc: Arnaldo Carvalho de Melo <acme@redhat.com>
      Cc: David Ahern <dsahern@gmail.com>
      Cc: Jiri Olsa <jolsa@redhat.com>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Namhyung Kim <namhyung@kernel.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Stephane Eranian <eranian@google.com>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Vince Weaver <vincent.weaver@maine.edu>
      Fixes: b20112ed ("perf/x86: Improve accuracy of perf/sched clock")
      Link: http://lkml.kernel.org/r/1445001845-13688-2-git-send-email-adrian.hunter@intel.comSigned-off-by: NIngo Molnar <mingo@kernel.org>
      b9511cd7
  15. 19 10月, 2015 1 次提交
  16. 18 10月, 2015 1 次提交
  17. 17 10月, 2015 3 次提交
  18. 16 10月, 2015 2 次提交