1. 19 1月, 2019 8 次提交
  2. 18 1月, 2019 1 次提交
    • D
      net: introduce SO_BINDTOIFINDEX sockopt · f5dd3d0c
      David Herrmann 提交于
      This introduces a new generic SOL_SOCKET-level socket option called
      SO_BINDTOIFINDEX. It behaves similar to SO_BINDTODEVICE, but takes a
      network interface index as argument, rather than the network interface
      name.
      
      User-space often refers to network-interfaces via their index, but has
      to temporarily resolve it to a name for a call into SO_BINDTODEVICE.
      This might pose problems when the network-device is renamed
      asynchronously by other parts of the system. When this happens, the
      SO_BINDTODEVICE might either fail, or worse, it might bind to the wrong
      device.
      
      In most cases user-space only ever operates on devices which they
      either manage themselves, or otherwise have a guarantee that the device
      name will not change (e.g., devices that are UP cannot be renamed).
      However, particularly in libraries this guarantee is non-obvious and it
      would be nice if that race-condition would simply not exist. It would
      make it easier for those libraries to operate even in situations where
      the device-name might change under the hood.
      
      A real use-case that we recently hit is trying to start the network
      stack early in the initrd but make it survive into the real system.
      Existing distributions rename network-interfaces during the transition
      from initrd into the real system. This, obviously, cannot affect
      devices that are up and running (unless you also consider moving them
      between network-namespaces). However, the network manager now has to
      make sure its management engine for dormant devices will not run in
      parallel to these renames. Particularly, when you offload operations
      like DHCP into separate processes, these might setup their sockets
      early, and thus have to resolve the device-name possibly running into
      this race-condition.
      
      By avoiding a call to resolve the device-name, we no longer depend on
      the name and can run network setup of dormant devices in parallel to
      the transition off the initrd. The SO_BINDTOIFINDEX ioctl plugs this
      race.
      Reviewed-by: NTom Gundersen <teg@jklm.no>
      Signed-off-by: NDavid Herrmann <dh.herrmann@gmail.com>
      Acked-by: NWillem de Bruijn <willemb@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      f5dd3d0c
  3. 09 1月, 2019 1 次提交
  4. 08 1月, 2019 1 次提交
  5. 06 1月, 2019 2 次提交
    • E
      fscrypt: add Adiantum support · 8094c3ce
      Eric Biggers 提交于
      Add support for the Adiantum encryption mode to fscrypt.  Adiantum is a
      tweakable, length-preserving encryption mode with security provably
      reducible to that of XChaCha12 and AES-256, subject to a security bound.
      It's also a true wide-block mode, unlike XTS.  See the paper
      "Adiantum: length-preserving encryption for entry-level processors"
      (https://eprint.iacr.org/2018/720.pdf) for more details.  Also see
      commit 059c2a4d ("crypto: adiantum - add Adiantum support").
      
      On sufficiently long messages, Adiantum's bottlenecks are XChaCha12 and
      the NH hash function.  These algorithms are fast even on processors
      without dedicated crypto instructions.  Adiantum makes it feasible to
      enable storage encryption on low-end mobile devices that lack AES
      instructions; currently such devices are unencrypted.  On ARM Cortex-A7,
      on 4096-byte messages Adiantum encryption is about 4 times faster than
      AES-256-XTS encryption; decryption is about 5 times faster.
      
      In fscrypt, Adiantum is suitable for encrypting both file contents and
      names.  With filenames, it fixes a known weakness: when two filenames in
      a directory share a common prefix of >= 16 bytes, with CTS-CBC their
      encrypted filenames share a common prefix too, leaking information.
      Adiantum does not have this problem.
      
      Since Adiantum also accepts long tweaks (IVs), it's also safe to use the
      master key directly for Adiantum encryption rather than deriving
      per-file keys, provided that the per-file nonce is included in the IVs
      and the master key isn't used for any other encryption mode.  This
      configuration saves memory and improves performance.  A new fscrypt
      policy flag is added to allow users to opt-in to this configuration.
      Signed-off-by: NEric Biggers <ebiggers@google.com>
      Signed-off-by: NTheodore Ts'o <tytso@mit.edu>
      8094c3ce
    • M
      arch: remove stale comments "UAPI Header export list" · d4ce5458
      Masahiro Yamada 提交于
      These comments are leftovers of commit fcc8487d ("uapi: export all
      headers under uapi directories").
      
      Prior to that commit, exported headers must be explicitly added to
      header-y. Now, all headers under the uapi/ directories are exported.
      Signed-off-by: NMasahiro Yamada <yamada.masahiro@socionext.com>
      d4ce5458
  6. 05 1月, 2019 6 次提交
  7. 01 1月, 2019 1 次提交
  8. 30 12月, 2018 2 次提交
    • D
      csky: define syscall_get_arch() · d770b256
      Dmitry V. Levin 提交于
      syscall_get_arch() is required to be implemented on all architectures
      in order to extend the generic ptrace API with PTRACE_GET_SYSCALL_INFO
      request.
      
      Cc: Guo Ren <guoren@kernel.org>
      Cc: Paul Moore <paul@paul-moore.com>
      Cc: Eric Paris <eparis@redhat.com>
      Cc: Oleg Nesterov <oleg@redhat.com>
      Cc: Andy Lutomirski <luto@kernel.org>
      Cc: Elvira Khabirova <lineprinter@altlinux.org>
      Cc: Eugene Syromyatnikov <esyr@redhat.com>
      Cc: linux-audit@redhat.com
      Signed-off-by: NDmitry V. Levin <ldv@altlinux.org>
      Signed-off-by: NGuo Ren <guoren@kernel.org>
      
       arch/csky/include/asm/syscall.h | 7 +++++++
       include/uapi/linux/audit.h      | 1 +
       2 files changed, 8 insertions(+)
      d770b256
    • D
      elf-em.h: add EM_CSKY · 077b930a
      Dmitry V. Levin 提交于
      The uapi/linux/audit.h header is going to use EM_CSKY in order
      to define AUDIT_ARCH_CSKY which is needed to implement
      syscall_get_arch() which in turn is required to extend
      the generic ptrace API with PTRACE_GET_SYSCALL_INFO request.
      
      The value for EM_CSKY has been taken from arch/csky/include/asm/elf.h
      and confirmed by binutils:include/elf/common.h
      
      Cc: Guo Ren <guoren@kernel.org>
      Cc: Oleg Nesterov <oleg@redhat.com>
      Cc: Andy Lutomirski <luto@kernel.org>
      Cc: Elvira Khabirova <lineprinter@altlinux.org>
      Cc: Eugene Syromyatnikov <esyr@redhat.com>
      Signed-off-by: NDmitry V. Levin <ldv@altlinux.org>
      Signed-off-by: NGuo Ren <guoren@kernel.org>
      077b930a
  9. 21 12月, 2018 5 次提交
    • A
      vfio_pci: Add NVIDIA GV100GL [Tesla V100 SXM2] subdriver · 7f928917
      Alexey Kardashevskiy 提交于
      POWER9 Witherspoon machines come with 4 or 6 V100 GPUs which are not
      pluggable PCIe devices but still have PCIe links which are used
      for config space and MMIO. In addition to that the GPUs have 6 NVLinks
      which are connected to other GPUs and the POWER9 CPU. POWER9 chips
      have a special unit on a die called an NPU which is an NVLink2 host bus
      adapter with p2p connections to 2 to 3 GPUs, 3 or 2 NVLinks to each.
      These systems also support ATS (address translation services) which is
      a part of the NVLink2 protocol. Such GPUs also share on-board RAM
      (16GB or 32GB) to the system via the same NVLink2 so a CPU has
      cache-coherent access to a GPU RAM.
      
      This exports GPU RAM to the userspace as a new VFIO device region. This
      preregisters the new memory as device memory as it might be used for DMA.
      This inserts pfns from the fault handler as the GPU memory is not onlined
      until the vendor driver is loaded and trained the NVLinks so doing this
      earlier causes low level errors which we fence in the firmware so
      it does not hurt the host system but still better be avoided; for the same
      reason this does not map GPU RAM into the host kernel (usual thing for
      emulated access otherwise).
      
      This exports an ATSD (Address Translation Shootdown) register of NPU which
      allows TLB invalidations inside GPU for an operating system. The register
      conveniently occupies a single 64k page. It is also presented to
      the userspace as a new VFIO device region. One NPU has 8 ATSD registers,
      each of them can be used for TLB invalidation in a GPU linked to this NPU.
      This allocates one ATSD register per an NVLink bridge allowing passing
      up to 6 registers. Due to the host firmware bug (just recently fixed),
      only 1 ATSD register per NPU was actually advertised to the host system
      so this passes that alone register via the first NVLink bridge device in
      the group which is still enough as QEMU collects them all back and
      presents to the guest via vPHB to mimic the emulated NPU PHB on the host.
      
      In order to provide the userspace with the information about GPU-to-NVLink
      connections, this exports an additional capability called "tgt"
      (which is an abbreviated host system bus address). The "tgt" property
      tells the GPU its own system address and allows the guest driver to
      conglomerate the routing information so each GPU knows how to get directly
      to the other GPUs.
      
      For ATS to work, the nest MMU (an NVIDIA block in a P9 CPU) needs to
      know LPID (a logical partition ID or a KVM guest hardware ID in other
      words) and PID (a memory context ID of a userspace process, not to be
      confused with a linux pid). This assigns a GPU to LPID in the NPU and
      this is why this adds a listener for KVM on an IOMMU group. A PID comes
      via NVLink from a GPU and NPU uses a PID wildcard to pass it through.
      
      This requires coherent memory and ATSD to be available on the host as
      the GPU vendor only supports configurations with both features enabled
      and other configurations are known not to work. Because of this and
      because of the ways the features are advertised to the host system
      (which is a device tree with very platform specific properties),
      this requires enabled POWERNV platform.
      
      The V100 GPUs do not advertise any of these capabilities via the config
      space and there are more than just one device ID so this relies on
      the platform to tell whether these GPUs have special abilities such as
      NVLinks.
      Signed-off-by: NAlexey Kardashevskiy <aik@ozlabs.ru>
      Acked-by: NAlex Williamson <alex.williamson@redhat.com>
      Signed-off-by: NMichael Ellerman <mpe@ellerman.id.au>
      7f928917
    • M
      IB/core: Move query port to ioctl · 641d1207
      Michael Guralnik 提交于
      Add a method for query port under the uverbs global methods.  Current
      ib_port_attr struct is passed as a single attribute and port_cap_flags2 is
      added as a new attribute to the function.
      Signed-off-by: NMichael Guralnik <michaelgur@mellanox.com>
      Signed-off-by: NLeon Romanovsky <leonro@mellanox.com>
      Signed-off-by: NJason Gunthorpe <jgg@mellanox.com>
      641d1207
    • M
      RDMA/nldev: Expose port_cap_flags2 · 4fa2813d
      Michael Guralnik 提交于
      port_cap_flags2 represents IBTA PortInfo:CapabilityMask2.
      
      The field safely extends the RDMA_NLDEV_ATTR_CAP_FLAGS operand as it was
      exported as 64 bit to allow this kind of extension.
      Signed-off-by: NMichael Guralnik <michaelgur@mellanox.com>
      Signed-off-by: NLeon Romanovsky <leonro@mellanox.com>
      Signed-off-by: NJason Gunthorpe <jgg@mellanox.com>
      4fa2813d
    • R
      fbdev: make FB_BACKLIGHT a tristate · b4a1ed0c
      Rob Clark 提交于
      BACKLIGHT_CLASS_DEVICE is already tristate, but a dependency
      FB_BACKLIGHT prevents it from being built as a module.  There
      doesn't seem to be any particularly good reason for this, so
      switch FB_BACKLIGHT over to tristate.
      Signed-off-by: NRob Clark <robdclark@gmail.com>
      Tested-by: NArnd Bergmann <arnd@arndb.de>
      Cc: Simon Horman <horms+renesas@verge.net.au>
      Cc: Geert Uytterhoeven <geert+renesas@glider.be>
      Cc: Laurent Pinchart <laurent.pinchart@ideasonboard.com>
      Cc: Daniel Vetter <daniel@ffwll.ch>
      Cc: Ulf Magnusson <ulfalizer@gmail.com>
      Cc: Randy Dunlap <rdunlap@infradead.org>
      Cc: Hans de Goede <j.w.r.degoede@gmail.com>
      Signed-off-by: NBartlomiej Zolnierkiewicz <b.zolnierkie@samsung.com>
      b4a1ed0c
    • D
      vfs: Suppress MS_* flag defs within the kernel unless explicitly enabled · e262e32d
      David Howells 提交于
      Only the mount namespace code that implements mount(2) should be using the
      MS_* flags.  Suppress them inside the kernel unless uapi/linux/mount.h is
      included.
      Signed-off-by: NDavid Howells <dhowells@redhat.com>
      Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>
      Reviewed-by: NDavid Howells <dhowells@redhat.com>
      e262e32d
  10. 20 12月, 2018 3 次提交
  11. 19 12月, 2018 7 次提交
    • C
      binder: implement binderfs · 3ad20fe3
      Christian Brauner 提交于
      As discussed at Linux Plumbers Conference 2018 in Vancouver [1] this is the
      implementation of binderfs.
      
      /* Abstract */
      binderfs is a backwards-compatible filesystem for Android's binder ipc
      mechanism. Each ipc namespace will mount a new binderfs instance. Mounting
      binderfs multiple times at different locations in the same ipc namespace
      will not cause a new super block to be allocated and hence it will be the
      same filesystem instance.
      Each new binderfs mount will have its own set of binder devices only
      visible in the ipc namespace it has been mounted in. All devices in a new
      binderfs mount will follow the scheme binder%d and numbering will always
      start at 0.
      
      /* Backwards compatibility */
      Devices requested in the Kconfig via CONFIG_ANDROID_BINDER_DEVICES for the
      initial ipc namespace will work as before. They will be registered via
      misc_register() and appear in the devtmpfs mount. Specifically, the
      standard devices binder, hwbinder, and vndbinder will all appear in their
      standard locations in /dev. Mounting or unmounting the binderfs mount in
      the initial ipc namespace will have no effect on these devices, i.e. they
      will neither show up in the binderfs mount nor will they disappear when the
      binderfs mount is gone.
      
      /* binder-control */
      Each new binderfs instance comes with a binder-control device. No other
      devices will be present at first. The binder-control device can be used to
      dynamically allocate binder devices. All requests operate on the binderfs
      mount the binder-control device resides in.
      Assuming a new instance of binderfs has been mounted at /dev/binderfs
      via mount -t binderfs binderfs /dev/binderfs. Then a request to create a
      new binder device can be made as illustrated in [2].
      Binderfs devices can simply be removed via unlink().
      
      /* Implementation details */
      - dynamic major number allocation:
        When binderfs is registered as a new filesystem it will dynamically
        allocate a new major number. The allocated major number will be returned
        in struct binderfs_device when a new binder device is allocated.
      - global minor number tracking:
        Minor are tracked in a global idr struct that is capped at
        BINDERFS_MAX_MINOR. The minor number tracker is protected by a global
        mutex. This is the only point of contention between binderfs mounts.
      - struct binderfs_info:
        Each binderfs super block has its own struct binderfs_info that tracks
        specific details about a binderfs instance:
        - ipc namespace
        - dentry of the binder-control device
        - root uid and root gid of the user namespace the binderfs instance
          was mounted in
      - mountable by user namespace root:
        binderfs can be mounted by user namespace root in a non-initial user
        namespace. The devices will be owned by user namespace root.
      - binderfs binder devices without misc infrastructure:
        New binder devices associated with a binderfs mount do not use the
        full misc_register() infrastructure.
        The misc_register() infrastructure can only create new devices in the
        host's devtmpfs mount. binderfs does however only make devices appear
        under its own mountpoint and thus allocates new character device nodes
        from the inode of the root dentry of the super block. This will have
        the side-effect that binderfs specific device nodes do not appear in
        sysfs. This behavior is similar to devpts allocated pts devices and
        has no effect on the functionality of the ipc mechanism itself.
      
      [1]: https://goo.gl/JL2tfX
      [2]: program to allocate a new binderfs binder device:
      
           #define _GNU_SOURCE
           #include <errno.h>
           #include <fcntl.h>
           #include <stdio.h>
           #include <stdlib.h>
           #include <string.h>
           #include <sys/ioctl.h>
           #include <sys/stat.h>
           #include <sys/types.h>
           #include <unistd.h>
           #include <linux/android/binder_ctl.h>
      
           int main(int argc, char *argv[])
           {
                   int fd, ret, saved_errno;
                   size_t len;
                   struct binderfs_device device = { 0 };
      
                   if (argc < 2)
                           exit(EXIT_FAILURE);
      
                   len = strlen(argv[1]);
                   if (len > BINDERFS_MAX_NAME)
                           exit(EXIT_FAILURE);
      
                   memcpy(device.name, argv[1], len);
      
                   fd = open("/dev/binderfs/binder-control", O_RDONLY | O_CLOEXEC);
                   if (fd < 0) {
                           printf("%s - Failed to open binder-control device\n",
                                  strerror(errno));
                           exit(EXIT_FAILURE);
                   }
      
                   ret = ioctl(fd, BINDER_CTL_ADD, &device);
                   saved_errno = errno;
                   close(fd);
                   errno = saved_errno;
                   if (ret < 0) {
                           printf("%s - Failed to allocate new binder device\n",
                                  strerror(errno));
                           exit(EXIT_FAILURE);
                   }
      
                   printf("Allocated new binder device with major %d, minor %d, and "
                          "name %s\n", device.major, device.minor,
                          device.name);
      
                   exit(EXIT_SUCCESS);
           }
      
      Cc: Martijn Coenen <maco@android.com>
      Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
      Signed-off-by: NChristian Brauner <christian.brauner@ubuntu.com>
      Acked-by: NTodd Kjos <tkjos@google.com>
      Signed-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      3ad20fe3
    • D
      net: Use __kernel_clockid_t in uapi net_stamp.h · e2c4cf7f
      Davide Caratti 提交于
      Herton reports the following error when building a userspace program that
      includes net_stamp.h:
      
       In file included from foo.c:2:
       /usr/include/linux/net_tstamp.h:158:2: error: unknown type name
       ‘clockid_t’
         clockid_t clockid; /* reference clockid */
         ^~~~~~~~~
      
      Fix it by using __kernel_clockid_t in place of clockid_t.
      
      Fixes: 80b14dee ("net: Add a new socket option for a future transmit time.")
      Cc: Timothy Redaelli <tredaelli@redhat.com>
      Reported-by: NHerton R. Krzesinski <herton@redhat.com>
      Signed-off-by: NDavide Caratti <dcaratti@redhat.com>
      Tested-by: NPaolo Abeni <pabeni@redhat.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      e2c4cf7f
    • J
      bpf: sockmap, metadata support for reporting size of msg · 3bdbd022
      John Fastabend 提交于
      This adds metadata to sk_msg_md for BPF programs to read the sk_msg
      size.
      
      When the SK_MSG program is running under an application that is using
      sendfile the data is not copied into sk_msg buffers by default. Rather
      the BPF program uses sk_msg_pull_data to read the bytes in. This
      avoids doing the costly memcopy instructions when they are not in
      fact needed. However, if we don't know the size of the sk_msg we
      have to guess if needed bytes are available by doing a pull request
      which may fail. By including the size of the sk_msg BPF programs can
      check the size before issuing sk_msg_pull_data requests.
      
      Additionally, the same applies for sendmsg calls when the application
      provides multiple iovs. Here the BPF program needs to pull in data
      to update data pointers but its not clear where the data ends without
      a size parameter. In many cases "guessing" is not easy to do
      and results in multiple calls to pull and without bounded loops
      everything gets fairly tricky.
      
      Clean this up by including a u32 size field. Note, all writes into
      sk_msg_md are rejected already from sk_msg_is_valid_access so nothing
      additional is needed there.
      Signed-off-by: NJohn Fastabend <john.fastabend@gmail.com>
      Signed-off-by: NDaniel Borkmann <daniel@iogearbox.net>
      3bdbd022
    • M
      IB/uverbs: Add support to advise_mr · ad8a4496
      Moni Shoua 提交于
      Add new ioctl method for the MR object - ADVISE_MR.
      
      This command can be used by users to give an advice or directions to the
      kernel about an address range that belongs to memory regions.
      
      A new ib_device callback, advise_mr(), is introduced here to suupport the
      new command. This command takes the following arguments:
      
      - pd:		The protection domain to which all memory regions belong
      - advice: 	The type of the advice
      	  	* IB_UVERBS_ADVISE_MR_ADVICE_PREFETCH - Pre-fetch a range of
      		an on-demand paging MR
      	  	* IB_UVERBS_ADVISE_MR_ADVICE_PREFETCH_WRITE - Pre-fetch a range
      		of an on-demand paging MR with write intention
      - flags:	The properties of the advice
      		* IB_UVERBS_ADVISE_MR_FLAG_FLUSH - Operation must end before
      		return to the caller
      - sg_list:	The list of memory ranges
      - num_sge:	The number of memory ranges in the list
      - attrs:	More attributes to be parsed by the provider
      Signed-off-by: NMoni Shoua <monis@mellanox.com>
      Reviewed-by: NGuy Levi <guyle@mellanox.com>
      Signed-off-by: NLeon Romanovsky <leonro@mellanox.com>
      Signed-off-by: NJason Gunthorpe <jgg@mellanox.com>
      ad8a4496
    • P
      RDMA/uverbs: Add an ioctl method to destroy an object · bbc13cda
      Parav Pandit 提交于
      Add an ioctl method to destroy the PD, MR, MW, AH, flow, RWQ indirection
      table and XRCD objects by handle which doesn't require any output response
      during destruction.
      Signed-off-by: NParav Pandit <parav@mellanox.com>
      Signed-off-by: NLeon Romanovsky <leonro@mellanox.com>
      Signed-off-by: NJason Gunthorpe <jgg@mellanox.com>
      bbc13cda
    • J
      RDMA/uverbs: Add a method to introspect handles in a context · 149d3845
      Jason Gunthorpe 提交于
      Introduce a helper function gather_objects_handle() to copy object handles
      under a spin lock.
      
      Expose these objects handles via the uverbs ioctl interface.
      Signed-off-by: NJason Gunthorpe <jgg@mellanox.com>
      Signed-off-by: NParav Pandit <parav@mellanox.com>
      Signed-off-by: NLeon Romanovsky <leonro@mellanox.com>
      149d3845
    • J
      RDMA/uverbs: Implement an ioctl that can call write and write_ex handlers · 4785860e
      Jason Gunthorpe 提交于
      Now that the handlers do not process their own udata we can make a
      sensible ioctl that wrappers them. The ioctl follows the same format as
      the write_ex() and has the user explicitly specify the core and driver
      in/out opaque structures and a command number.
      
      This works for all forms of write commands.
      Signed-off-by: NJason Gunthorpe <jgg@mellanox.com>
      Signed-off-by: NLeon Romanovsky <leonro@mellanox.com>
      Signed-off-by: NDoug Ledford <dledford@redhat.com>
      4785860e
  12. 18 12月, 2018 3 次提交
    • S
      nl80211: Add support to notify radar event info received from STA · 30c63115
      Sriram R 提交于
      Currently radar detection and corresponding channel switch is handled
      at the AP device. STA ignores these detected radar events since the
      radar signal can be seen mostly by the AP as well. But in scenarios where
      a radar signal is seen only at STA, notifying this event to the AP which
      can trigger a channel switch can be useful.
      Stations can report such radar events autonomously through Spectrum
      management (Measurement Report) action frame to its AP. The userspace on
      processing the report can notify the kernel with the use of the added
      NL80211_CMD_NOTIFY_RADAR to indicate the detected event and inturn adding
      the reported channel to NOL.
      Signed-off-by: NSriram R <srirrama@codeaurora.org>
      Signed-off-by: NJohannes Berg <johannes.berg@intel.com>
      30c63115
    • J
      cfg80211: clarify LCI/civic location documentation · 30db641e
      Johannes Berg 提交于
      The older code and current userspace assumed that this data
      is the content of the Measurement Report element, starting
      with the Measurement Token. Clarify this in the documentation.
      Signed-off-by: NJohannes Berg <johannes.berg@intel.com>
      30db641e
    • Y
      bpf: btf: fix struct/union/fwd types with kind_flag · 9d5f9f70
      Yonghong Song 提交于
      This patch fixed two issues with BTF. One is related to
      struct/union bitfield encoding and the other is related to
      forward type.
      
      Issue #1 and solution:
      
      ======================
      
      Current btf encoding of bitfield follows what pahole generates.
      For each bitfield, pahole will duplicate the type chain and
      put the bitfield size at the final int or enum type.
      Since the BTF enum type cannot encode bit size,
      pahole workarounds the issue by generating
      an int type whenever the enum bit size is not 32.
      
      For example,
        -bash-4.4$ cat t.c
        typedef int ___int;
        enum A { A1, A2, A3 };
        struct t {
          int a[5];
          ___int b:4;
          volatile enum A c:4;
        } g;
        -bash-4.4$ gcc -c -O2 -g t.c
      The current kernel supports the following BTF encoding:
        $ pahole -JV t.o
        [1] TYPEDEF ___int type_id=2
        [2] INT int size=4 bit_offset=0 nr_bits=32 encoding=SIGNED
        [3] ENUM A size=4 vlen=3
              A1 val=0
              A2 val=1
              A3 val=2
        [4] STRUCT t size=24 vlen=3
              a type_id=5 bits_offset=0
              b type_id=9 bits_offset=160
              c type_id=11 bits_offset=164
        [5] ARRAY (anon) type_id=2 index_type_id=2 nr_elems=5
        [6] INT sizetype size=8 bit_offset=0 nr_bits=64 encoding=(none)
        [7] VOLATILE (anon) type_id=3
        [8] INT int size=1 bit_offset=0 nr_bits=4 encoding=(none)
        [9] TYPEDEF ___int type_id=8
        [10] INT (anon) size=1 bit_offset=0 nr_bits=4 encoding=SIGNED
        [11] VOLATILE (anon) type_id=10
      
      Two issues are in the above:
        . by changing enum type to int, we lost the original
          type information and this will not be ideal later
          when we try to convert BTF to a header file.
        . the type duplication for bitfields will cause
          BTF bloat. Duplicated types cannot be deduplicated
          later if the bitfield size is different.
      
      To fix this issue, this patch implemented a compatible
      change for BTF struct type encoding:
        . the bit 31 of struct_type->info, previously reserved,
          now is used to indicate whether bitfield_size is
          encoded in btf_member or not.
        . if bit 31 of struct_type->info is set,
          btf_member->offset will encode like:
            bit 0 - 23: bit offset
            bit 24 - 31: bitfield size
          if bit 31 is not set, the old behavior is preserved:
            bit 0 - 31: bit offset
      
      So if the struct contains a bit field, the maximum bit offset
      will be reduced to (2^24 - 1) instead of MAX_UINT. The maximum
      bitfield size will be 256 which is enough for today as maximum
      bitfield in compiler can be 128 where int128 type is supported.
      
      This kernel patch intends to support the new BTF encoding:
        $ pahole -JV t.o
        [1] TYPEDEF ___int type_id=2
        [2] INT int size=4 bit_offset=0 nr_bits=32 encoding=SIGNED
        [3] ENUM A size=4 vlen=3
              A1 val=0
              A2 val=1
              A3 val=2
        [4] STRUCT t kind_flag=1 size=24 vlen=3
              a type_id=5 bitfield_size=0 bits_offset=0
              b type_id=1 bitfield_size=4 bits_offset=160
              c type_id=7 bitfield_size=4 bits_offset=164
        [5] ARRAY (anon) type_id=2 index_type_id=2 nr_elems=5
        [6] INT sizetype size=8 bit_offset=0 nr_bits=64 encoding=(none)
        [7] VOLATILE (anon) type_id=3
      
      Issue #2 and solution:
      ======================
      
      Current forward type in BTF does not specify whether the original
      type is struct or union. This will not work for type pretty print
      and BTF-to-header-file conversion as struct/union must be specified.
        $ cat tt.c
        struct t;
        union u;
        int foo(struct t *t, union u *u) { return 0; }
        $ gcc -c -g -O2 tt.c
        $ pahole -JV tt.o
        [1] INT int size=4 bit_offset=0 nr_bits=32 encoding=SIGNED
        [2] FWD t type_id=0
        [3] PTR (anon) type_id=2
        [4] FWD u type_id=0
        [5] PTR (anon) type_id=4
      
      To fix this issue, similar to issue #1, type->info bit 31
      is used. If the bit is set, it is union type. Otherwise, it is
      a struct type.
      
        $ pahole -JV tt.o
        [1] INT int size=4 bit_offset=0 nr_bits=32 encoding=SIGNED
        [2] FWD t kind_flag=0 type_id=0
        [3] PTR (anon) kind_flag=0 type_id=2
        [4] FWD u kind_flag=1 type_id=0
        [5] PTR (anon) kind_flag=0 type_id=4
      
      Pahole/LLVM change:
      ===================
      
      The new kind_flag functionality has been implemented in pahole
      and llvm:
        https://github.com/yonghong-song/pahole/tree/bitfield
        https://github.com/yonghong-song/llvm/tree/bitfield
      
      Note that pahole hasn't implemented func/func_proto kind
      and .BTF.ext. So to print function signature with bpftool,
      the llvm compiler should be used.
      
      Fixes: 69b693f0 ("bpf: btf: Introduce BPF Type Format (BTF)")
      Acked-by: NMartin KaFai Lau <kafai@fb.com>
      Signed-off-by: NMartin KaFai Lau <kafai@fb.com>
      Signed-off-by: NYonghong Song <yhs@fb.com>
      Signed-off-by: NDaniel Borkmann <daniel@iogearbox.net>
      9d5f9f70