1. 20 10月, 2017 1 次提交
    • M
      membarrier: Provide register expedited private command · a961e409
      Mathieu Desnoyers 提交于
      This introduces a "register private expedited" membarrier command which
      allows eventual removal of important memory barrier constraints on the
      scheduler fast-paths. It changes how the "private expedited" membarrier
      command (new to 4.14) is used from user-space.
      
      This new command allows processes to register their intent to use the
      private expedited command.  This affects how the expedited private
      command introduced in 4.14-rc is meant to be used, and should be merged
      before 4.14 final.
      
      Processes are now required to register before using
      MEMBARRIER_CMD_PRIVATE_EXPEDITED, otherwise that command returns EPERM.
      
      This fixes a problem that arose when designing requested extensions to
      sys_membarrier() to allow JITs to efficiently flush old code from
      instruction caches.  Several potential algorithms are much less painful
      if the user register intent to use this functionality early on, for
      example, before the process spawns the second thread.  Registering at
      this time removes the need to interrupt each and every thread in that
      process at the first expedited sys_membarrier() system call.
      Signed-off-by: NMathieu Desnoyers <mathieu.desnoyers@efficios.com>
      Acked-by: NPaul E. McKenney <paulmck@linux.vnet.ibm.com>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Ingo Molnar <mingo@redhat.com>
      Cc: Alexander Viro <viro@zeniv.linux.org.uk>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      a961e409
  2. 09 10月, 2017 1 次提交
    • S
      netfilter: xt_bpf: Fix XT_BPF_MODE_FD_PINNED mode of 'xt_bpf_info_v1' · 98589a09
      Shmulik Ladkani 提交于
      Commit 2c16d603 ("netfilter: xt_bpf: support ebpf") introduced
      support for attaching an eBPF object by an fd, with the
      'bpf_mt_check_v1' ABI expecting the '.fd' to be specified upon each
      IPT_SO_SET_REPLACE call.
      
      However this breaks subsequent iptables calls:
      
       # iptables -A INPUT -m bpf --object-pinned /sys/fs/bpf/xxx -j ACCEPT
       # iptables -A INPUT -s 5.6.7.8 -j ACCEPT
       iptables: Invalid argument. Run `dmesg' for more information.
      
      That's because iptables works by loading existing rules using
      IPT_SO_GET_ENTRIES to userspace, then issuing IPT_SO_SET_REPLACE with
      the replacement set.
      
      However, the loaded 'xt_bpf_info_v1' has an arbitrary '.fd' number
      (from the initial "iptables -m bpf" invocation) - so when 2nd invocation
      occurs, userspace passes a bogus fd number, which leads to
      'bpf_mt_check_v1' to fail.
      
      One suggested solution [1] was to hack iptables userspace, to perform a
      "entries fixup" immediatley after IPT_SO_GET_ENTRIES, by opening a new,
      process-local fd per every 'xt_bpf_info_v1' entry seen.
      
      However, in [2] both Pablo Neira Ayuso and Willem de Bruijn suggested to
      depricate the xt_bpf_info_v1 ABI dealing with pinned ebpf objects.
      
      This fix changes the XT_BPF_MODE_FD_PINNED behavior to ignore the given
      '.fd' and instead perform an in-kernel lookup for the bpf object given
      the provided '.path'.
      
      It also defines an alias for the XT_BPF_MODE_FD_PINNED mode, named
      XT_BPF_MODE_PATH_PINNED, to better reflect the fact that the user is
      expected to provide the path of the pinned object.
      
      Existing XT_BPF_MODE_FD_ELF behavior (non-pinned fd mode) is preserved.
      
      References: [1] https://marc.info/?l=netfilter-devel&m=150564724607440&w=2
                  [2] https://marc.info/?l=netfilter-devel&m=150575727129880&w=2Reported-by: NRafael Buchbinder <rafi@rbk.ms>
      Signed-off-by: NShmulik Ladkani <shmulik.ladkani@gmail.com>
      Acked-by: NWillem de Bruijn <willemb@google.com>
      Acked-by: NDaniel Borkmann <daniel@iogearbox.net>
      Signed-off-by: NPablo Neira Ayuso <pablo@netfilter.org>
      98589a09
  3. 04 10月, 2017 1 次提交
  4. 25 9月, 2017 1 次提交
    • M
      dm ioctl: fix alignment of event number in the device list · 62e08243
      Mikulas Patocka 提交于
      The size of struct dm_name_list is different on 32-bit and 64-bit
      kernels (so "(nl + 1)" differs between 32-bit and 64-bit kernels).
      
      This mismatch caused some harmless difference in padding when using 32-bit
      or 64-bit kernel. Commit 23d70c5e ("dm ioctl: report event number in
      DM_LIST_DEVICES") added reporting event number in the output of
      DM_LIST_DEVICES_CMD. This difference in padding makes it impossible for
      userspace to determine the location of the event number (the location
      would be different when running on 32-bit and 64-bit kernels).
      
      Fix the padding by using offsetof(struct dm_name_list, name) instead of
      sizeof(struct dm_name_list) to determine the location of entries.
      
      Also, the ioctl version number is incremented to 37 so that userspace
      can use the version number to determine that the event number is present
      and correctly located.
      
      In addition, a global event is now raised when a DM device is created,
      removed, renamed or when table is swapped, so that the user can monitor
      for device changes.
      Reported-by: NEugene Syromiatnikov <esyr@redhat.com>
      Fixes: 23d70c5e ("dm ioctl: report event number in DM_LIST_DEVICES")
      Cc: stable@vger.kernel.org # 4.13
      Signed-off-by: NMikulas Patocka <mpatocka@redhat.com>
      Signed-off-by: NMike Snitzer <snitzer@redhat.com>
      62e08243
  5. 22 9月, 2017 1 次提交
    • F
      net: ethtool: Add back transceiver type · 19cab887
      Florian Fainelli 提交于
      Commit 3f1ac7a7 ("net: ethtool: add new ETHTOOL_xLINKSETTINGS API")
      deprecated the ethtool_cmd::transceiver field, which was fine in
      premise, except that the PHY library was actually using it to report the
      type of transceiver: internal or external.
      
      Use the first word of the reserved field to put this __u8 transceiver
      field back in. It is made read-only, and we don't expect the
      ETHTOOL_xLINKSETTINGS API to be doing anything with this anyway, so this
      is mostly for the legacy path where we do:
      
      ethtool_get_settings()
      -> dev->ethtool_ops->get_link_ksettings()
         -> convert_link_ksettings_to_legacy_settings()
      
      to have no information loss compared to the legacy get_settings API.
      
      Fixes: 3f1ac7a7 ("net: ethtool: add new ETHTOOL_xLINKSETTINGS API")
      Signed-off-by: NFlorian Fainelli <f.fainelli@gmail.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      19cab887
  6. 19 9月, 2017 1 次提交
  7. 09 9月, 2017 4 次提交
  8. 07 9月, 2017 6 次提交
  9. 05 9月, 2017 16 次提交
  10. 04 9月, 2017 3 次提交
    • P
      netlink: add NLM_F_NONREC flag for deletion requests · 2335ba70
      Pablo Neira Ayuso 提交于
      In the last NFWS in Faro, Portugal, we discussed that netlink is lacking
      the semantics to request non recursive deletions, ie. do not delete an
      object iff it has child objects that hang from this parent object that
      the user requests to be deleted.
      
      We need this new flag to solve a problem for the iptables-compat
      backward compatibility utility, that runs iptables commands using the
      existing nf_tables netlink interface. Specifically, custom chains in
      iptables cannot be deleted if there are rules in it, however, nf_tables
      allows to remove any chain that is populated with content. To sort out
      this asymmetry, iptables-compat userspace sets this new NLM_F_NONREC
      flag to obtain the same semantics that iptables provides.
      
      This new flag should only be used for deletion requests. Note this new
      flag value overlaps with the existing:
      
      * NLM_F_ROOT for get requests.
      * NLM_F_REPLACE for new requests.
      
      However, those flags should not ever be used in deletion requests.
      Signed-off-by: NPablo Neira Ayuso <pablo@netfilter.org>
      2335ba70
    • P
      netfilter: nft_limit: add stateful object type · a6912055
      Pablo M. Bermudo Garay 提交于
      Register a new limit stateful object type into the stateful object
      infrastructure.
      Signed-off-by: NPablo M. Bermudo Garay <pablombg@gmail.com>
      Signed-off-by: NPablo Neira Ayuso <pablo@netfilter.org>
      a6912055
    • V
      netfilter: xt_hashlimit: add rate match mode · bea74641
      Vishwanath Pai 提交于
      This patch adds a new feature to hashlimit that allows matching on the
      current packet/byte rate without rate limiting. This can be enabled
      with a new flag --hashlimit-rate-match. The match returns true if the
      current rate of packets is above/below the user specified value.
      
      The main difference between the existing algorithm and the new one is
      that the existing algorithm rate-limits the flow whereas the new
      algorithm does not. Instead it *classifies* the flow based on whether
      it is above or below a certain rate. I will demonstrate this with an
      example below. Let us assume this rule:
      
      iptables -A INPUT -m hashlimit --hashlimit-above 10/s -j new_chain
      
      If the packet rate is 15/s, the existing algorithm would ACCEPT 10
      packets every second and send 5 packets to "new_chain".
      
      But with the new algorithm, as long as the rate of 15/s is sustained,
      all packets will continue to match and every packet is sent to new_chain.
      
      This new functionality will let us classify different flows based on
      their current rate, so that further decisions can be made on them based on
      what the current rate is.
      
      This is how the new algorithm works:
      We divide time into intervals of 1 (sec/min/hour) as specified by
      the user. We keep track of the number of packets/bytes processed in the
      current interval. After each interval we reset the counter to 0.
      
      When we receive a packet for match, we look at the packet rate
      during the current interval and the previous interval to make a
      decision:
      
      if [ prev_rate < user and cur_rate < user ]
              return Below
      else
              return Above
      
      Where cur_rate is the number of packets/bytes seen in the current
      interval, prev is the number of packets/bytes seen in the previous
      interval and 'user' is the rate specified by the user.
      
      We also provide flexibility to the user for choosing the time
      interval using the option --hashilmit-interval. For example the user can
      keep a low rate like x/hour but still keep the interval as small as 1
      second.
      
      To preserve backwards compatibility we have to add this feature in a new
      revision, so I've created revision 3 for hashlimit. The two new options
      we add are:
      
      --hashlimit-rate-match
      --hashlimit-rate-interval
      
      I have updated the help text to add these new options. Also added a few
      tests for the new options.
      Suggested-by: NIgor Lubashev <ilubashe@akamai.com>
      Reviewed-by: NJosh Hunt <johunt@akamai.com>
      Signed-off-by: NVishwanath Pai <vpai@akamai.com>
      Signed-off-by: NPablo Neira Ayuso <pablo@netfilter.org>
      bea74641
  11. 02 9月, 2017 3 次提交
    • I
      tcp_diag: report TCP MD5 signing keys and addresses · c03fa9bc
      Ivan Delalande 提交于
      Report TCP MD5 (RFC2385) signing keys, addresses and address prefixes to
      processes with CAP_NET_ADMIN requesting INET_DIAG_INFO. Currently it is
      not possible to retrieve these from the kernel once they have been
      configured on sockets.
      Signed-off-by: NIvan Delalande <colona@arista.com>
      Acked-by: NEric Dumazet <edumazet@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      c03fa9bc
    • D
      fsmap: fix documentation of FMR_OF_LAST · d897246d
      Darrick J. Wong 提交于
      The FMR_OF_LAST flag is set on the last fsmap record being returned for
      the dataset requested, contrary to what the header file says.  Fix the
      docs to reflect the behavior of all fsmap implementations.
      Signed-off-by: NDarrick J. Wong <darrick.wong@oracle.com>
      Reviewed-by: NDave Chinner <dchinner@redhat.com>
      d897246d
    • S
      Introduce v3 namespaced file capabilities · 8db6c34f
      Serge E. Hallyn 提交于
      Root in a non-initial user ns cannot be trusted to write a traditional
      security.capability xattr.  If it were allowed to do so, then any
      unprivileged user on the host could map his own uid to root in a private
      namespace, write the xattr, and execute the file with privilege on the
      host.
      
      However supporting file capabilities in a user namespace is very
      desirable.  Not doing so means that any programs designed to run with
      limited privilege must continue to support other methods of gaining and
      dropping privilege.  For instance a program installer must detect
      whether file capabilities can be assigned, and assign them if so but set
      setuid-root otherwise.  The program in turn must know how to drop
      partial capabilities, and do so only if setuid-root.
      
      This patch introduces v3 of the security.capability xattr.  It builds a
      vfs_ns_cap_data struct by appending a uid_t rootid to struct
      vfs_cap_data.  This is the absolute uid_t (that is, the uid_t in user
      namespace which mounted the filesystem, usually init_user_ns) of the
      root id in whose namespaces the file capabilities may take effect.
      
      When a task asks to write a v2 security.capability xattr, if it is
      privileged with respect to the userns which mounted the filesystem, then
      nothing should change.  Otherwise, the kernel will transparently rewrite
      the xattr as a v3 with the appropriate rootid.  This is done during the
      execution of setxattr() to catch user-space-initiated capability writes.
      Subsequently, any task executing the file which has the noted kuid as
      its root uid, or which is in a descendent user_ns of such a user_ns,
      will run the file with capabilities.
      
      Similarly when asking to read file capabilities, a v3 capability will
      be presented as v2 if it applies to the caller's namespace.
      
      If a task writes a v3 security.capability, then it can provide a uid for
      the xattr so long as the uid is valid in its own user namespace, and it
      is privileged with CAP_SETFCAP over its namespace.  The kernel will
      translate that rootid to an absolute uid, and write that to disk.  After
      this, a task in the writer's namespace will not be able to use those
      capabilities (unless rootid was 0), but a task in a namespace where the
      given uid is root will.
      
      Only a single security.capability xattr may exist at a time for a given
      file.  A task may overwrite an existing xattr so long as it is
      privileged over the inode.  Note this is a departure from previous
      semantics, which required privilege to remove a security.capability
      xattr.  This check can be re-added if deemed useful.
      
      This allows a simple setxattr to work, allows tar/untar to work, and
      allows us to tar in one namespace and untar in another while preserving
      the capability, without risking leaking privilege into a parent
      namespace.
      
      Example using tar:
      
       $ cp /bin/sleep sleepx
       $ mkdir b1 b2
       $ lxc-usernsexec -m b:0:100000:1 -m b:1:$(id -u):1 -- chown 0:0 b1
       $ lxc-usernsexec -m b:0:100001:1 -m b:1:$(id -u):1 -- chown 0:0 b2
       $ lxc-usernsexec -m b:0:100000:1000 -- tar --xattrs-include=security.capability --xattrs -cf b1/sleepx.tar sleepx
       $ lxc-usernsexec -m b:0:100001:1000 -- tar --xattrs-include=security.capability --xattrs -C b2 -xf b1/sleepx.tar
       $ lxc-usernsexec -m b:0:100001:1000 -- getcap b2/sleepx
         b2/sleepx = cap_sys_admin+ep
       # /opt/ltp/testcases/bin/getv3xattr b2/sleepx
         v3 xattr, rootid is 100001
      
      A patch to linux-test-project adding a new set of tests for this
      functionality is in the nsfscaps branch at github.com/hallyn/ltp
      
      Changelog:
         Nov 02 2016: fix invalid check at refuse_fcap_overwrite()
         Nov 07 2016: convert rootid from and to fs user_ns
         (From ebiederm: mar 28 2017)
           commoncap.c: fix typos - s/v4/v3
           get_vfs_caps_from_disk: clarify the fs_ns root access check
           nsfscaps: change the code split for cap_inode_setxattr()
         Apr 09 2017:
             don't return v3 cap for caps owned by current root.
            return a v2 cap for a true v2 cap in non-init ns
         Apr 18 2017:
            . Change the flow of fscap writing to support s_user_ns writing.
            . Remove refuse_fcap_overwrite().  The value of the previous
              xattr doesn't matter.
         Apr 24 2017:
            . incorporate Eric's incremental diff
            . move cap_convert_nscap to setxattr and simplify its usage
         May 8, 2017:
            . fix leaking dentry refcount in cap_inode_getsecurity
      Signed-off-by: NSerge Hallyn <serge@hallyn.com>
      Signed-off-by: NEric W. Biederman <ebiederm@xmission.com>
      8db6c34f
  12. 01 9月, 2017 2 次提交