1. 21 2月, 2020 2 次提交
    • T
      sched: Provide cant_migrate() · 4e139c77
      Thomas Gleixner 提交于
      Some code pathes rely on preempt_disable() to prevent migration on a non RT
      enabled kernel. These preempt_disable/enable() pairs are substituted by
      migrate_disable/enable() pairs or other forms of RT specific protection. On
      RT these protections prevent migration but not preemption. Obviously a
      cant_sleep() check in such a section will trigger on RT because preemption
      is not disabled.
      
      Provide a cant_migrate() macro which maps to cant_sleep() on a non RT
      kernel and an empty placeholder for RT for now. The placeholder will be
      changed to a proper debug check along with the RT specific migration
      protection mechanism.
      Signed-off-by: NThomas Gleixner <tglx@linutronix.de>
      Link: https://lkml.kernel.org/r/20200214161503.070487511@linutronix.de
      4e139c77
    • T
      sched/rt: Provide migrate_disable/enable() inlines · 66630058
      Thomas Gleixner 提交于
      Code which solely needs to prevent migration of a task uses
      preempt_disable()/enable() pairs. This is the only reliable way to do so
      as setting the task affinity to a single CPU can be undone by a
      setaffinity operation from a different task/process.
      
      RT provides a seperate migrate_disable/enable() mechanism which does not
      disable preemption to achieve the semantic requirements of a (almost) fully
      preemptible kernel.
      
      As it is unclear from looking at a given code path whether the intention is
      to disable preemption or migration, introduce migrate_disable/enable()
      inline functions which can be used to annotate code which merely needs to
      disable migration. Map them to preempt_disable/enable() for now. The RT
      substitution will be provided later.
      
      Code which is annotated that way documents that it has no requirement to
      protect against reentrancy of a preempting task. Either this is not
      required at all or the call sites are already serialized by other means.
      Signed-off-by: NThomas Gleixner <tglx@linutronix.de>
      Signed-off-by: NIngo Molnar <mingo@kernel.org>
      Signed-off-by: NThomas Gleixner <tglx@linutronix.de>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Juri Lelli <juri.lelli@redhat.com>
      Cc: Vincent Guittot <vincent.guittot@linaro.org>
      Cc: Dietmar Eggemann <dietmar.eggemann@arm.com>
      Cc: Steven Rostedt <rostedt@goodmis.org>
      Cc: Ben Segall <bsegall@google.com>
      Cc: Mel Gorman <mgorman@suse.de>
      Cc: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
      Link: https://lore.kernel.org/r/878slclv1u.fsf@nanos.tec.linutronix.de
      66630058
  2. 14 2月, 2020 3 次提交
    • R
      netdevice.h: fix all kernel-doc and Sphinx warnings · a1fa83bd
      Randy Dunlap 提交于
      Eliminate all kernel-doc and Sphinx warnings in
      <linux/netdevice.h>.  Fixes these warnings:
      
      ../include/linux/netdevice.h:2100: warning: Function parameter or member 'gso_partial_features' not described in 'net_device'
      ../include/linux/netdevice.h:2100: warning: Function parameter or member 'l3mdev_ops' not described in 'net_device'
      ../include/linux/netdevice.h:2100: warning: Function parameter or member 'xfrmdev_ops' not described in 'net_device'
      ../include/linux/netdevice.h:2100: warning: Function parameter or member 'tlsdev_ops' not described in 'net_device'
      ../include/linux/netdevice.h:2100: warning: Function parameter or member 'name_assign_type' not described in 'net_device'
      ../include/linux/netdevice.h:2100: warning: Function parameter or member 'ieee802154_ptr' not described in 'net_device'
      ../include/linux/netdevice.h:2100: warning: Function parameter or member 'mpls_ptr' not described in 'net_device'
      ../include/linux/netdevice.h:2100: warning: Function parameter or member 'xdp_prog' not described in 'net_device'
      ../include/linux/netdevice.h:2100: warning: Function parameter or member 'gro_flush_timeout' not described in 'net_device'
      ../include/linux/netdevice.h:2100: warning: Function parameter or member 'xdp_bulkq' not described in 'net_device'
      ../include/linux/netdevice.h:2100: warning: Function parameter or member 'xps_cpus_map' not described in 'net_device'
      ../include/linux/netdevice.h:2100: warning: Function parameter or member 'xps_rxqs_map' not described in 'net_device'
      ../include/linux/netdevice.h:2100: warning: Function parameter or member 'qdisc_hash' not described in 'net_device'
      ../include/linux/netdevice.h:3552: WARNING: Inline emphasis start-string without end-string.
      ../include/linux/netdevice.h:3552: WARNING: Inline emphasis start-string without end-string.
      Signed-off-by: NRandy Dunlap <rdunlap@infradead.org>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      a1fa83bd
    • J
      icmp: introduce helper for nat'd source address in network device context · 0b41713b
      Jason A. Donenfeld 提交于
      This introduces a helper function to be called only by network drivers
      that wraps calls to icmp[v6]_send in a conntrack transformation, in case
      NAT has been used. We don't want to pollute the non-driver path, though,
      so we introduce this as a helper to be called by places that actually
      make use of this, as suggested by Florian.
      Signed-off-by: NJason A. Donenfeld <Jason@zx2c4.com>
      Cc: Florian Westphal <fw@strlen.de>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      0b41713b
    • H
      net/flow_dissector: remove unexist field description · 6ee2deb6
      Hangbin Liu 提交于
      @thoff has moved to struct flow_dissector_key_control.
      
      Fixes: 42aecaa9 ("net: Get skb hash over flow_keys structure")
      Signed-off-by: NHangbin Liu <liuhangbin@gmail.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      6ee2deb6
  3. 13 2月, 2020 2 次提交
  4. 12 2月, 2020 1 次提交
  5. 11 2月, 2020 2 次提交
    • R
      ACPI: PM: s2idle: Avoid possible race related to the EC GPE · e3728b50
      Rafael J. Wysocki 提交于
      It is theoretically possible for the ACPI EC GPE to be set after the
      s2idle_ops->wake() called from s2idle_loop() has returned and before
      the subsequent pm_wakeup_pending() check is carried out.  If that
      happens, the resulting wakeup event will cause the system to resume
      even though it may be a spurious one.
      
      To avoid that race, first make the ->wake() callback in struct
      platform_s2idle_ops return a bool value indicating whether or not
      to let the system resume and rearrange s2idle_loop() to use that
      value instad of the direct pm_wakeup_pending() call if ->wake() is
      present.
      
      Next, rework acpi_s2idle_wake() to process EC events and check
      pm_wakeup_pending() before re-arming the SCI for system wakeup
      to prevent it from triggering prematurely and add comments to
      that function to explain the rationale for the new code flow.
      
      Fixes: 56b99184 ("PM: sleep: Simplify suspend-to-idle control flow")
      Cc: 5.4+ <stable@vger.kernel.org> # 5.4+
      Signed-off-by: NRafael J. Wysocki <rafael.j.wysocki@intel.com>
      e3728b50
    • T
      tracing: Consolidate trace() functions · 7276531d
      Tom Zanussi 提交于
      Move the checking, buffer reserve and buffer commit code in
      synth_event_trace_start/end() into inline functions
      __synth_event_trace_start/end() so they can also be used by
      synth_event_trace() and synth_event_trace_array(), and then have all
      those functions use them.
      
      Also, change synth_event_trace_state.enabled to disabled so it only
      needs to be set if the event is disabled, which is not normally the
      case.
      
      Link: http://lkml.kernel.org/r/b1f3108d0f450e58192955a300e31d0405ab4149.1581374549.git.zanussi@kernel.orgSigned-off-by: NTom Zanussi <zanussi@kernel.org>
      Signed-off-by: NSteven Rostedt (VMware) <rostedt@goodmis.org>
      7276531d
  6. 09 2月, 2020 1 次提交
    • L
      pipe: use exclusive waits when reading or writing · 0ddad21d
      Linus Torvalds 提交于
      This makes the pipe code use separate wait-queues and exclusive waiting
      for readers and writers, avoiding a nasty thundering herd problem when
      there are lots of readers waiting for data on a pipe (or, less commonly,
      lots of writers waiting for a pipe to have space).
      
      While this isn't a common occurrence in the traditional "use a pipe as a
      data transport" case, where you typically only have a single reader and
      a single writer process, there is one common special case: using a pipe
      as a source of "locking tokens" rather than for data communication.
      
      In particular, the GNU make jobserver code ends up using a pipe as a way
      to limit parallelism, where each job consumes a token by reading a byte
      from the jobserver pipe, and releases the token by writing a byte back
      to the pipe.
      
      This pattern is fairly traditional on Unix, and works very well, but
      will waste a lot of time waking up a lot of processes when only a single
      reader needs to be woken up when a writer releases a new token.
      
      A simplified test-case of just this pipe interaction is to create 64
      processes, and then pass a single token around between them (this
      test-case also intentionally passes another token that gets ignored to
      test the "wake up next" logic too, in case anybody wonders about it):
      
          #include <unistd.h>
      
          int main(int argc, char **argv)
          {
              int fd[2], counters[2];
      
              pipe(fd);
              counters[0] = 0;
              counters[1] = -1;
              write(fd[1], counters, sizeof(counters));
      
              /* 64 processes */
              fork(); fork(); fork(); fork(); fork(); fork();
      
              do {
                      int i;
                      read(fd[0], &i, sizeof(i));
                      if (i < 0)
                              continue;
                      counters[0] = i+1;
                      write(fd[1], counters, (1+(i & 1)) *sizeof(int));
              } while (counters[0] < 1000000);
              return 0;
          }
      
      and in a perfect world, passing that token around should only cause one
      context switch per transfer, when the writer of a token causes a
      directed wakeup of just a single reader.
      
      But with the "writer wakes all readers" model we traditionally had, on
      my test box the above case causes more than an order of magnitude more
      scheduling: instead of the expected ~1M context switches, "perf stat"
      shows
      
              231,852.37 msec task-clock                #   15.857 CPUs utilized
              11,250,961      context-switches          #    0.049 M/sec
                 616,304      cpu-migrations            #    0.003 M/sec
                   1,648      page-faults               #    0.007 K/sec
       1,097,903,998,514      cycles                    #    4.735 GHz
         120,781,778,352      instructions              #    0.11  insn per cycle
          27,997,056,043      branches                  #  120.754 M/sec
             283,581,233      branch-misses             #    1.01% of all branches
      
            14.621273891 seconds time elapsed
      
             0.018243000 seconds user
             3.611468000 seconds sys
      
      before this commit.
      
      After this commit, I get
      
                5,229.55 msec task-clock                #    3.072 CPUs utilized
               1,212,233      context-switches          #    0.232 M/sec
                 103,951      cpu-migrations            #    0.020 M/sec
                   1,328      page-faults               #    0.254 K/sec
          21,307,456,166      cycles                    #    4.074 GHz
          12,947,819,999      instructions              #    0.61  insn per cycle
           2,881,985,678      branches                  #  551.096 M/sec
              64,267,015      branch-misses             #    2.23% of all branches
      
             1.702148350 seconds time elapsed
      
             0.004868000 seconds user
             0.110786000 seconds sys
      
      instead. Much better.
      
      [ Note! This kernel improvement seems to be very good at triggering a
        race condition in the make jobserver (in GNU make 4.2.1) for me. It's
        a long known bug that was fixed back in June 2017 by GNU make commit
        b552b0525198 ("[SV 51159] Use a non-blocking read with pselect to
        avoid hangs.").
      
        But there wasn't a new release of GNU make until 4.3 on Jan 19 2020,
        so a number of distributions may still have the buggy version. Some
        have backported the fix to their 4.2.1 release, though, and even
        without the fix it's quite timing-dependent whether the bug actually
        is hit. ]
      
      Josh Triplett says:
       "I've been hammering on your pipe fix patch (switching to exclusive
        wait queues) for a month or so, on several different systems, and I've
        run into no issues with it. The patch *substantially* improves
        parallel build times on large (~100 CPU) systems, both with parallel
        make and with other things that use make's pipe-based jobserver.
      
        All current distributions (including stable and long-term stable
        distributions) have versions of GNU make that no longer have the
        jobserver bug"
      Tested-by: NJosh Triplett <josh@joshtriplett.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      0ddad21d
  7. 08 2月, 2020 13 次提交
  8. 07 2月, 2020 6 次提交
    • J
      mac80211: use more bits for ack_frame_id · f2b18bac
      Johannes Berg 提交于
      It turns out that this wasn't a good idea, I hit a test failure in
      hwsim due to this. That particular failure was easily worked around,
      but it raised questions: if an AP needs to, for example, send action
      frames to each connected station, the current limit is nowhere near
      enough (especially if those stations are sleeping and the frames are
      queued for a while.)
      
      Shuffle around some bits to make more room for ack_frame_id to allow
      up to 8192 queued up frames, that's enough for queueing 4 frames to
      each connected station, even at the maximum of 2007 stations on a
      single AP.
      
      We take the bits from band (which currently only 2 but I leave 3 in
      case we add another band) and from the hw_queue, which can only need
      4 since it has a limit of 16 queues.
      
      Fixes: 6912daed ("mac80211: Shrink the size of ack_frame_id to make room for tx_time_est")
      Signed-off-by: NJohannes Berg <johannes.berg@intel.com>
      Acked-by: NToke Høiland-Jørgensen <toke@redhat.com>
      Link: https://lore.kernel.org/r/20200115122549.b9a4ef9f4980.Ied52ed90150220b83a280009c590b65d125d087c@changeidSigned-off-by: NJohannes Berg <johannes.berg@intel.com>
      f2b18bac
    • D
      fs: New zonefs file system · 8dcc1a9d
      Damien Le Moal 提交于
      zonefs is a very simple file system exposing each zone of a zoned block
      device as a file. Unlike a regular file system with zoned block device
      support (e.g. f2fs), zonefs does not hide the sequential write
      constraint of zoned block devices to the user. Files representing
      sequential write zones of the device must be written sequentially
      starting from the end of the file (append only writes).
      
      As such, zonefs is in essence closer to a raw block device access
      interface than to a full featured POSIX file system. The goal of zonefs
      is to simplify the implementation of zoned block device support in
      applications by replacing raw block device file accesses with a richer
      file API, avoiding relying on direct block device file ioctls which may
      be more obscure to developers. One example of this approach is the
      implementation of LSM (log-structured merge) tree structures (such as
      used in RocksDB and LevelDB) on zoned block devices by allowing SSTables
      to be stored in a zone file similarly to a regular file system rather
      than as a range of sectors of a zoned device. The introduction of the
      higher level construct "one file is one zone" can help reducing the
      amount of changes needed in the application as well as introducing
      support for different application programming languages.
      
      Zonefs on-disk metadata is reduced to an immutable super block to
      persistently store a magic number and optional feature flags and
      values. On mount, zonefs uses blkdev_report_zones() to obtain the device
      zone configuration and populates the mount point with a static file tree
      solely based on this information. E.g. file sizes come from the device
      zone type and write pointer offset managed by the device itself.
      
      The zone files created on mount have the following characteristics.
      1) Files representing zones of the same type are grouped together
         under a common sub-directory:
           * For conventional zones, the sub-directory "cnv" is used.
           * For sequential write zones, the sub-directory "seq" is used.
        These two directories are the only directories that exist in zonefs.
        Users cannot create other directories and cannot rename nor delete
        the "cnv" and "seq" sub-directories.
      2) The name of zone files is the number of the file within the zone
         type sub-directory, in order of increasing zone start sector.
      3) The size of conventional zone files is fixed to the device zone size.
         Conventional zone files cannot be truncated.
      4) The size of sequential zone files represent the file's zone write
         pointer position relative to the zone start sector. Truncating these
         files is allowed only down to 0, in which case, the zone is reset to
         rewind the zone write pointer position to the start of the zone, or
         up to the zone size, in which case the file's zone is transitioned
         to the FULL state (finish zone operation).
      5) All read and write operations to files are not allowed beyond the
         file zone size. Any access exceeding the zone size is failed with
         the -EFBIG error.
      6) Creating, deleting, renaming or modifying any attribute of files and
         sub-directories is not allowed.
      7) There are no restrictions on the type of read and write operations
         that can be issued to conventional zone files. Buffered, direct and
         mmap read & write operations are accepted. For sequential zone files,
         there are no restrictions on read operations, but all write
         operations must be direct IO append writes. mmap write of sequential
         files is not allowed.
      
      Several optional features of zonefs can be enabled at format time.
      * Conventional zone aggregation: ranges of contiguous conventional
        zones can be aggregated into a single larger file instead of the
        default one file per zone.
      * File ownership: The owner UID and GID of zone files is by default 0
        (root) but can be changed to any valid UID/GID.
      * File access permissions: the default 640 access permissions can be
        changed.
      
      The mkzonefs tool is used to format zoned block devices for use with
      zonefs. This tool is available on Github at:
      
      git@github.com:damien-lemoal/zonefs-tools.git.
      
      zonefs-tools also includes a test suite which can be run against any
      zoned block device, including null_blk block device created with zoned
      mode.
      
      Example: the following formats a 15TB host-managed SMR HDD with 256 MB
      zones with the conventional zones aggregation feature enabled.
      
      $ sudo mkzonefs -o aggr_cnv /dev/sdX
      $ sudo mount -t zonefs /dev/sdX /mnt
      $ ls -l /mnt/
      total 0
      dr-xr-xr-x 2 root root     1 Nov 25 13:23 cnv
      dr-xr-xr-x 2 root root 55356 Nov 25 13:23 seq
      
      The size of the zone files sub-directories indicate the number of files
      existing for each type of zones. In this example, there is only one
      conventional zone file (all conventional zones are aggregated under a
      single file).
      
      $ ls -l /mnt/cnv
      total 137101312
      -rw-r----- 1 root root 140391743488 Nov 25 13:23 0
      
      This aggregated conventional zone file can be used as a regular file.
      
      $ sudo mkfs.ext4 /mnt/cnv/0
      $ sudo mount -o loop /mnt/cnv/0 /data
      
      The "seq" sub-directory grouping files for sequential write zones has
      in this example 55356 zones.
      
      $ ls -lv /mnt/seq
      total 14511243264
      -rw-r----- 1 root root 0 Nov 25 13:23 0
      -rw-r----- 1 root root 0 Nov 25 13:23 1
      -rw-r----- 1 root root 0 Nov 25 13:23 2
      ...
      -rw-r----- 1 root root 0 Nov 25 13:23 55354
      -rw-r----- 1 root root 0 Nov 25 13:23 55355
      
      For sequential write zone files, the file size changes as data is
      appended at the end of the file, similarly to any regular file system.
      
      $ dd if=/dev/zero of=/mnt/seq/0 bs=4K count=1 conv=notrunc oflag=direct
      1+0 records in
      1+0 records out
      4096 bytes (4.1 kB, 4.0 KiB) copied, 0.000452219 s, 9.1 MB/s
      
      $ ls -l /mnt/seq/0
      -rw-r----- 1 root root 4096 Nov 25 13:23 /mnt/seq/0
      
      The written file can be truncated to the zone size, preventing any
      further write operation.
      
      $ truncate -s 268435456 /mnt/seq/0
      $ ls -l /mnt/seq/0
      -rw-r----- 1 root root 268435456 Nov 25 13:49 /mnt/seq/0
      
      Truncation to 0 size allows freeing the file zone storage space and
      restart append-writes to the file.
      
      $ truncate -s 0 /mnt/seq/0
      $ ls -l /mnt/seq/0
      -rw-r----- 1 root root 0 Nov 25 13:49 /mnt/seq/0
      
      Since files are statically mapped to zones on the disk, the number of
      blocks of a file as reported by stat() and fstat() indicates the size
      of the file zone.
      
      $ stat /mnt/seq/0
        File: /mnt/seq/0
        Size: 0       Blocks: 524288     IO Block: 4096   regular empty file
      Device: 870h/2160d      Inode: 50431       Links: 1
      Access: (0640/-rw-r-----)  Uid: (    0/    root)   Gid: (    0/  root)
      Access: 2019-11-25 13:23:57.048971997 +0900
      Modify: 2019-11-25 13:52:25.553805765 +0900
      Change: 2019-11-25 13:52:25.553805765 +0900
       Birth: -
      
      The number of blocks of the file ("Blocks") in units of 512B blocks
      gives the maximum file size of 524288 * 512 B = 256 MB, corresponding
      to the device zone size in this example. Of note is that the "IO block"
      field always indicates the minimum IO size for writes and corresponds
      to the device physical sector size.
      
      This code contains contributions from:
      * Johannes Thumshirn <jthumshirn@suse.de>,
      * Darrick J. Wong <darrick.wong@oracle.com>,
      * Christoph Hellwig <hch@lst.de>,
      * Chaitanya Kulkarni <chaitanya.kulkarni@wdc.com> and
      * Ting Yao <tingyao@hust.edu.cn>.
      Signed-off-by: NDamien Le Moal <damien.lemoal@wdc.com>
      Reviewed-by: NDave Chinner <dchinner@redhat.com>
      8dcc1a9d
    • A
      fold struct fs_parameter_enum into struct constant_table · 5eede625
      Al Viro 提交于
      no real difference now
      Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>
      5eede625
    • A
      fs_parse: get rid of ->enums · 2710c957
      Al Viro 提交于
      Don't do a single array; attach them to fsparam_enum() entry
      instead.  And don't bother trying to embed the names into those -
      it actually loses memory, with no real speedup worth mentioning.
      
      Simplifies validation as well.
      Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>
      2710c957
    • A
      Pass consistent param->type to fs_parse() · 0f89589a
      Al Viro 提交于
      As it is, vfs_parse_fs_string() makes "foo" and "foo=" indistinguishable;
      both get fs_value_is_string for ->type and NULL for ->string.  To make
      it even more unpleasant, that combination is impossible to produce with
      fsconfig().
      
      Much saner rules would be
              "foo"           => fs_value_is_flag, NULL
      	"foo="          => fs_value_is_string, ""
      	"foo=bar"       => fs_value_is_string, "bar"
      All cases are distinguishable, all results are expressable by fsconfig(),
      ->has_value checks are much simpler that way (to the point of the field
      being useless) and quite a few regressions go away (gfs2 has no business
      accepting -o nodebug=, for example).
      
      Partially based upon patches from Miklos.
      Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>
      0f89589a
    • T
      net/mlx5: Deprecate usage of generic TLS HW capability bit · 61c00cca
      Tariq Toukan 提交于
      Deprecate the generic TLS cap bit, use the new TX-specific
      TLS cap bit instead.
      
      Fixes: a12ff35e ("net/mlx5: Introduce TLS TX offload hardware bits and structures")
      Signed-off-by: NTariq Toukan <tariqt@mellanox.com>
      Reviewed-by: NEran Ben Elisha <eranbe@mellanox.com>
      Signed-off-by: NSaeed Mahameed <saeedm@mellanox.com>
      61c00cca
  9. 06 2月, 2020 2 次提交
    • Q
      skbuff: fix a data race in skb_queue_len() · 86b18aaa
      Qian Cai 提交于
      sk_buff.qlen can be accessed concurrently as noticed by KCSAN,
      
       BUG: KCSAN: data-race in __skb_try_recv_from_queue / unix_dgram_sendmsg
      
       read to 0xffff8a1b1d8a81c0 of 4 bytes by task 5371 on cpu 96:
        unix_dgram_sendmsg+0x9a9/0xb70 include/linux/skbuff.h:1821
      				 net/unix/af_unix.c:1761
        ____sys_sendmsg+0x33e/0x370
        ___sys_sendmsg+0xa6/0xf0
        __sys_sendmsg+0x69/0xf0
        __x64_sys_sendmsg+0x51/0x70
        do_syscall_64+0x91/0xb47
        entry_SYSCALL_64_after_hwframe+0x49/0xbe
      
       write to 0xffff8a1b1d8a81c0 of 4 bytes by task 1 on cpu 99:
        __skb_try_recv_from_queue+0x327/0x410 include/linux/skbuff.h:2029
        __skb_try_recv_datagram+0xbe/0x220
        unix_dgram_recvmsg+0xee/0x850
        ____sys_recvmsg+0x1fb/0x210
        ___sys_recvmsg+0xa2/0xf0
        __sys_recvmsg+0x66/0xf0
        __x64_sys_recvmsg+0x51/0x70
        do_syscall_64+0x91/0xb47
        entry_SYSCALL_64_after_hwframe+0x49/0xbe
      
      Since only the read is operating as lockless, it could introduce a logic
      bug in unix_recvq_full() due to the load tearing. Fix it by adding
      a lockless variant of skb_queue_len() and unix_recvq_full() where
      READ_ONCE() is on the read while WRITE_ONCE() is on the write similar to
      the commit d7d16a89 ("net: add skb_queue_empty_lockless()").
      Signed-off-by: NQian Cai <cai@lca.pw>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      86b18aaa
    • G
      of: clk: Make <linux/of_clk.h> self-contained · 5df86714
      Geert Uytterhoeven 提交于
      Depending on include order:
      
          include/linux/of_clk.h:11:45: warning: ‘struct device_node’ declared inside parameter list will not be visible outside of this definition or declaration
           unsigned int of_clk_get_parent_count(struct device_node *np);
      						 ^~~~~~~~~~~
          include/linux/of_clk.h:12:43: warning: ‘struct device_node’ declared inside parameter list will not be visible outside of this definition or declaration
           const char *of_clk_get_parent_name(struct device_node *np, int index);
      					       ^~~~~~~~~~~
          include/linux/of_clk.h:13:31: warning: ‘struct of_device_id’ declared inside parameter list will not be visible outside of this definition or declaration
           void of_clk_init(const struct of_device_id *matches);
      				   ^~~~~~~~~~~~
      
      Fix this by adding forward declarations for struct device_node and
      struct of_device_id.
      Signed-off-by: NGeert Uytterhoeven <geert+renesas@glider.be>
      Link: https://lkml.kernel.org/r/20200205194649.31309-1-geert+renesas@glider.beSigned-off-by: NStephen Boyd <sboyd@kernel.org>
      5df86714
  10. 05 2月, 2020 3 次提交
    • E
      bonding/alb: properly access headers in bond_alb_xmit() · 38f88c45
      Eric Dumazet 提交于
      syzbot managed to send an IPX packet through bond_alb_xmit()
      and af_packet and triggered a use-after-free.
      
      First, bond_alb_xmit() was using ipx_hdr() helper to reach
      the IPX header, but ipx_hdr() was using the transport offset
      instead of the network offset. In the particular syzbot
      report transport offset was 0xFFFF
      
      This patch removes ipx_hdr() since it was only (mis)used from bonding.
      
      Then we need to make sure IPv4/IPv6/IPX headers are pulled
      in skb->head before dereferencing anything.
      
      BUG: KASAN: use-after-free in bond_alb_xmit+0x153a/0x1590 drivers/net/bonding/bond_alb.c:1452
      Read of size 2 at addr ffff8801ce56dfff by task syz-executor.2/18108
       (if (ipx_hdr(skb)->ipx_checksum != IPX_NO_CHECKSUM) ...)
      
      Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 01/01/2011
      Call Trace:
       [<ffffffff8441fc42>] __dump_stack lib/dump_stack.c:17 [inline]
       [<ffffffff8441fc42>] dump_stack+0x14d/0x20b lib/dump_stack.c:53
       [<ffffffff81a7dec4>] print_address_description+0x6f/0x20b mm/kasan/report.c:282
       [<ffffffff81a7e0ec>] kasan_report_error mm/kasan/report.c:380 [inline]
       [<ffffffff81a7e0ec>] kasan_report mm/kasan/report.c:438 [inline]
       [<ffffffff81a7e0ec>] kasan_report.cold+0x8c/0x2a0 mm/kasan/report.c:422
       [<ffffffff81a7dc4f>] __asan_report_load_n_noabort+0xf/0x20 mm/kasan/report.c:469
       [<ffffffff82c8c00a>] bond_alb_xmit+0x153a/0x1590 drivers/net/bonding/bond_alb.c:1452
       [<ffffffff82c60c74>] __bond_start_xmit drivers/net/bonding/bond_main.c:4199 [inline]
       [<ffffffff82c60c74>] bond_start_xmit+0x4f4/0x1570 drivers/net/bonding/bond_main.c:4224
       [<ffffffff83baa558>] __netdev_start_xmit include/linux/netdevice.h:4525 [inline]
       [<ffffffff83baa558>] netdev_start_xmit include/linux/netdevice.h:4539 [inline]
       [<ffffffff83baa558>] xmit_one net/core/dev.c:3611 [inline]
       [<ffffffff83baa558>] dev_hard_start_xmit+0x168/0x910 net/core/dev.c:3627
       [<ffffffff83bacf35>] __dev_queue_xmit+0x1f55/0x33b0 net/core/dev.c:4238
       [<ffffffff83bae3a8>] dev_queue_xmit+0x18/0x20 net/core/dev.c:4278
       [<ffffffff84339189>] packet_snd net/packet/af_packet.c:3226 [inline]
       [<ffffffff84339189>] packet_sendmsg+0x4919/0x70b0 net/packet/af_packet.c:3252
       [<ffffffff83b1ac0c>] sock_sendmsg_nosec net/socket.c:673 [inline]
       [<ffffffff83b1ac0c>] sock_sendmsg+0x12c/0x160 net/socket.c:684
       [<ffffffff83b1f5a2>] __sys_sendto+0x262/0x380 net/socket.c:1996
       [<ffffffff83b1f700>] SYSC_sendto net/socket.c:2008 [inline]
       [<ffffffff83b1f700>] SyS_sendto+0x40/0x60 net/socket.c:2004
      
      Fixes: 1da177e4 ("Linux-2.6.12-rc2")
      Signed-off-by: NEric Dumazet <edumazet@google.com>
      Reported-by: Nsyzbot <syzkaller@googlegroups.com>
      Cc: Jay Vosburgh <j.vosburgh@gmail.com>
      Cc: Veaceslav Falico <vfalico@gmail.com>
      Cc: Andy Gospodarek <andy@greyhouse.net>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      38f88c45
    • A
      net: dsa: microchip: Platform data shan't include kernel.h · 8b7a07c7
      Andy Shevchenko 提交于
      Replace with appropriate types.h.
      Signed-off-by: NAndy Shevchenko <andriy.shevchenko@linux.intel.com>
      Reviewed-by: NFlorian Fainelli <f.fainelli@gmail.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      8b7a07c7
    • A
      net: dsa: b53: Platform data shan't include kernel.h · e22e0790
      Andy Shevchenko 提交于
      Replace with appropriate types.h.
      Signed-off-by: NAndy Shevchenko <andriy.shevchenko@linux.intel.com>
      Reviewed-by: NFlorian Fainelli <f.fainelli@gmail.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      e22e0790
  11. 04 2月, 2020 5 次提交