1. 02 10月, 2020 1 次提交
    • L
      pipe: remove pipe_wait() and fix wakeup race with splice · 472e5b05
      Linus Torvalds 提交于
      The pipe splice code still used the old model of waiting for pipe IO by
      using a non-specific "pipe_wait()" that waited for any pipe event to
      happen, which depended on all pipe IO being entirely serialized by the
      pipe lock.  So by checking the state you were waiting for, and then
      adding yourself to the wait queue before dropping the lock, you were
      guaranteed to see all the wakeups.
      
      Strictly speaking, the actual wakeups were not done under the lock, but
      the pipe_wait() model still worked, because since the waiter held the
      lock when checking whether it should sleep, it would always see the
      current state, and the wakeup was always done after updating the state.
      
      However, commit 0ddad21d ("pipe: use exclusive waits when reading or
      writing") split the single wait-queue into two, and in the process also
      made the "wait for event" code wait for _two_ wait queues, and that then
      showed a race with the wakers that were not serialized by the pipe lock.
      
      It's only splice that used that "pipe_wait()" model, so the problem
      wasn't obvious, but Josef Bacik reports:
      
       "I hit a hang with fstest btrfs/187, which does a btrfs send into
        /dev/null. This works by creating a pipe, the write side is given to
        the kernel to write into, and the read side is handed to a thread that
        splices into a file, in this case /dev/null.
      
        The box that was hung had the write side stuck here [pipe_write] and
        the read side stuck here [splice_from_pipe_next -> pipe_wait].
      
        [ more details about pipe_wait() scenario ]
      
        The problem is we're doing the prepare_to_wait, which sets our state
        each time, however we can be woken up either with reads or writes. In
        the case above we race with the WRITER waking us up, and re-set our
        state to INTERRUPTIBLE, and thus never break out of schedule"
      
      Josef had a patch that avoided the issue in pipe_wait() by just making
      it set the state only once, but the deeper problem is that pipe_wait()
      depends on a level of synchonization by the pipe mutex that it really
      shouldn't.  And the whole "wait for any pipe state change" model really
      isn't very good to begin with.
      
      So rather than trying to work around things in pipe_wait(), remove that
      legacy model of "wait for arbitrary pipe event" entirely, and actually
      create functions that wait for the pipe actually being readable or
      writable, and can do so without depending on the pipe lock serializing
      everything.
      
      Fixes: 0ddad21d ("pipe: use exclusive waits when reading or writing")
      Link: https://lore.kernel.org/linux-fsdevel/bfa88b5ad6f069b2b679316b9e495a970130416c.1601567868.git.josef@toxicpanda.com/Reported-by: NJosef Bacik <josef@toxicpanda.com>
      Reviewed-and-tested-by: NJosef Bacik <josef@toxicpanda.com>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      472e5b05
  2. 21 5月, 2020 3 次提交
  3. 19 5月, 2020 3 次提交
    • D
      pipe: Add notification lossage handling · e7d553d6
      David Howells 提交于
      Add handling for loss of notifications by having read() insert a
      loss-notification message after it has read the pipe buffer that was last
      in the ring when the loss occurred.
      
      Lossage can come about either by running out of notification descriptors or
      by running out of space in the pipe ring.
      Signed-off-by: NDavid Howells <dhowells@redhat.com>
      e7d553d6
    • D
      pipe: Allow buffers to be marked read-whole-or-error for notifications · 8cfba763
      David Howells 提交于
      Allow a buffer to be marked such that read() must return the entire buffer
      in one go or return ENOBUFS.  Multiple buffers can be amalgamated into a
      single read, but a short read will occur if the next "whole" buffer won't
      fit.
      
      This is useful for watch queue notifications to make sure we don't split a
      notification across multiple reads, especially given that we need to
      fabricate an overrun record under some circumstances - and that isn't in
      the buffers.
      Signed-off-by: NDavid Howells <dhowells@redhat.com>
      8cfba763
    • D
      pipe: Add general notification queue support · c73be61c
      David Howells 提交于
      Make it possible to have a general notification queue built on top of a
      standard pipe.  Notifications are 'spliced' into the pipe and then read
      out.  splice(), vmsplice() and sendfile() are forbidden on pipes used for
      notifications as post_one_notification() cannot take pipe->mutex.  This
      means that notifications could be posted in between individual pipe
      buffers, making iov_iter_revert() difficult to effect.
      
      The way the notification queue is used is:
      
       (1) An application opens a pipe with a special flag and indicates the
           number of messages it wishes to be able to queue at once (this can
           only be set once):
      
      	pipe2(fds, O_NOTIFICATION_PIPE);
      	ioctl(fds[0], IOC_WATCH_QUEUE_SET_SIZE, queue_depth);
      
       (2) The application then uses poll() and read() as normal to extract data
           from the pipe.  read() will return multiple notifications if the
           buffer is big enough, but it will not split a notification across
           buffers - rather it will return a short read or EMSGSIZE.
      
           Notification messages include a length in the header so that the
           caller can split them up.
      
      Each message has a header that describes it:
      
      	struct watch_notification {
      		__u32	type:24;
      		__u32	subtype:8;
      		__u32	info;
      	};
      
      The type indicates the source (eg. mount tree changes, superblock events,
      keyring changes, block layer events) and the subtype indicates the event
      type (eg. mount, unmount; EIO, EDQUOT; link, unlink).  The info field
      indicates a number of things, including the entry length, an ID assigned to
      a watchpoint contributing to this buffer and type-specific flags.
      
      Supplementary data, such as the key ID that generated an event, can be
      attached in additional slots.  The maximum message size is 127 bytes.
      Messages may not be padded or aligned, so there is no guarantee, for
      example, that the notification type will be on a 4-byte bounary.
      Signed-off-by: NDavid Howells <dhowells@redhat.com>
      c73be61c
  4. 03 4月, 2020 1 次提交
  5. 19 2月, 2020 1 次提交
    • L
      pipe: make sure to wake up everybody when the last reader/writer closes · 6551d5c5
      Linus Torvalds 提交于
      Andrei Vagin reported that commit 0ddad21d ("pipe: use exclusive
      waits when reading or writing") broke one of the CRIU tests.  He even
      has a trivial reproducer:
      
          #include <unistd.h>
          #include <sys/types.h>
          #include <sys/wait.h>
      
          int main()
          {
                  int p[2];
                  pid_t p1, p2;
                  int status;
      
                  if (pipe(p) == -1)
                          return 1;
      
                  p1 = fork();
                  if (p1 == 0) {
                          close(p[1]);
                          read(p[0], &status, sizeof(status));
                          return 0;
                  }
                  p2 = fork();
                  if (p2 == 0) {
                          close(p[1]);
                          read(p[0], &status, sizeof(status));
                          return 0;
                  }
                  sleep(1);
                  close(p[1]);
                  wait(&status);
                  wait(&status);
      
                  return 0;
          }
      
      and the problem - once he points it out - is obvious.  We use these nice
      exclusive waits, but when the last writer goes away, it then needs to
      wake up _every_ reader (and conversely, the last reader disappearing
      needs to wake every writer, of course).
      
      In fact, when going through this, we had several small oddities around
      how to wake things.  We did in fact wake every reader when we changed
      the size of the pipe buffers.  But that's entirely pointless, since that
      just acts as a possible source of new space - no new data to read.
      
      And when we change the size of the buffer, we don't need to wake all
      writers even when we add space - that case acts just as if somebody made
      space by reading, and any writer that finds itself not filling it up
      entirely will wake the next one.
      
      On the other hand, on the exit path, we tried to limit the wakeups with
      the proper poll keys etc, which is entirely pointless, because at that
      point we obviously need to wake up everybody.  So don't do that: just
      wake up everybody - but only do that if the counts changed to zero.
      
      So fix those non-IO wakeups to be more proper: space change doesn't add
      any new data, but it might make room for writers, so it wakes up a
      writer.  And the actual changes to reader/writer counts should wake up
      everybody, since everybody is affected (ie readers will all see EOF if
      the writers have gone away, and writers will all get EPIPE if all
      readers have gone away).
      
      Fixes: 0ddad21d ("pipe: use exclusive waits when reading or writing")
      Reported-and-tested-by: NAndrei Vagin <avagin@gmail.com>
      Cc: Josh Triplett <josh@joshtriplett.org>
      Cc: Matthew Wilcox <willy@infradead.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      6551d5c5
  6. 09 2月, 2020 1 次提交
    • L
      pipe: use exclusive waits when reading or writing · 0ddad21d
      Linus Torvalds 提交于
      This makes the pipe code use separate wait-queues and exclusive waiting
      for readers and writers, avoiding a nasty thundering herd problem when
      there are lots of readers waiting for data on a pipe (or, less commonly,
      lots of writers waiting for a pipe to have space).
      
      While this isn't a common occurrence in the traditional "use a pipe as a
      data transport" case, where you typically only have a single reader and
      a single writer process, there is one common special case: using a pipe
      as a source of "locking tokens" rather than for data communication.
      
      In particular, the GNU make jobserver code ends up using a pipe as a way
      to limit parallelism, where each job consumes a token by reading a byte
      from the jobserver pipe, and releases the token by writing a byte back
      to the pipe.
      
      This pattern is fairly traditional on Unix, and works very well, but
      will waste a lot of time waking up a lot of processes when only a single
      reader needs to be woken up when a writer releases a new token.
      
      A simplified test-case of just this pipe interaction is to create 64
      processes, and then pass a single token around between them (this
      test-case also intentionally passes another token that gets ignored to
      test the "wake up next" logic too, in case anybody wonders about it):
      
          #include <unistd.h>
      
          int main(int argc, char **argv)
          {
              int fd[2], counters[2];
      
              pipe(fd);
              counters[0] = 0;
              counters[1] = -1;
              write(fd[1], counters, sizeof(counters));
      
              /* 64 processes */
              fork(); fork(); fork(); fork(); fork(); fork();
      
              do {
                      int i;
                      read(fd[0], &i, sizeof(i));
                      if (i < 0)
                              continue;
                      counters[0] = i+1;
                      write(fd[1], counters, (1+(i & 1)) *sizeof(int));
              } while (counters[0] < 1000000);
              return 0;
          }
      
      and in a perfect world, passing that token around should only cause one
      context switch per transfer, when the writer of a token causes a
      directed wakeup of just a single reader.
      
      But with the "writer wakes all readers" model we traditionally had, on
      my test box the above case causes more than an order of magnitude more
      scheduling: instead of the expected ~1M context switches, "perf stat"
      shows
      
              231,852.37 msec task-clock                #   15.857 CPUs utilized
              11,250,961      context-switches          #    0.049 M/sec
                 616,304      cpu-migrations            #    0.003 M/sec
                   1,648      page-faults               #    0.007 K/sec
       1,097,903,998,514      cycles                    #    4.735 GHz
         120,781,778,352      instructions              #    0.11  insn per cycle
          27,997,056,043      branches                  #  120.754 M/sec
             283,581,233      branch-misses             #    1.01% of all branches
      
            14.621273891 seconds time elapsed
      
             0.018243000 seconds user
             3.611468000 seconds sys
      
      before this commit.
      
      After this commit, I get
      
                5,229.55 msec task-clock                #    3.072 CPUs utilized
               1,212,233      context-switches          #    0.232 M/sec
                 103,951      cpu-migrations            #    0.020 M/sec
                   1,328      page-faults               #    0.254 K/sec
          21,307,456,166      cycles                    #    4.074 GHz
          12,947,819,999      instructions              #    0.61  insn per cycle
           2,881,985,678      branches                  #  551.096 M/sec
              64,267,015      branch-misses             #    2.23% of all branches
      
             1.702148350 seconds time elapsed
      
             0.004868000 seconds user
             0.110786000 seconds sys
      
      instead. Much better.
      
      [ Note! This kernel improvement seems to be very good at triggering a
        race condition in the make jobserver (in GNU make 4.2.1) for me. It's
        a long known bug that was fixed back in June 2017 by GNU make commit
        b552b0525198 ("[SV 51159] Use a non-blocking read with pselect to
        avoid hangs.").
      
        But there wasn't a new release of GNU make until 4.3 on Jan 19 2020,
        so a number of distributions may still have the buggy version. Some
        have backported the fix to their 4.2.1 release, though, and even
        without the fix it's quite timing-dependent whether the bug actually
        is hit. ]
      
      Josh Triplett says:
       "I've been hammering on your pipe fix patch (switching to exclusive
        wait queues) for a month or so, on several different systems, and I've
        run into no issues with it. The patch *substantially* improves
        parallel build times on large (~100 CPU) systems, both with parallel
        make and with other things that use make's pipe-based jobserver.
      
        All current distributions (including stable and long-term stable
        distributions) have versions of GNU make that no longer have the
        jobserver bug"
      Tested-by: NJosh Triplett <josh@joshtriplett.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      0ddad21d
  7. 23 12月, 2019 1 次提交
    • J
      pipe: fix empty pipe check in pipe_write() · 0dd1e377
      Jan Stancek 提交于
      LTP pipeio_1 test is hanging with v5.5-rc2-385-gb8e382a1,
      with read side observing empty pipe and sleeping and write
      side running out of space and then sleeping as well. In this
      scenario there are 5 writers and 1 reader.
      
      Problem is that after pipe_write() reacquires pipe lock, it
      re-checks for empty pipe with potentially stale 'head' and
      doesn't wake up read side anymore. pipe->tail can advance
      beyond 'head', because there are multiple writers.
      
      Use pipe->head for empty pipe check after reacquiring lock
      to observe current state.
      
      Testing: With patch, LTP pipeio_1 ran successfully in loop for 1 hour.
               Without patch it hanged within a minute.
      
      Fixes: 1b6b26ae ("pipe: fix and clarify pipe write wakeup logic")
      Reported-by: NRachel Sibley <rasibley@redhat.com>
      Signed-off-by: NJan Stancek <jstancek@redhat.com>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      0dd1e377
  8. 12 12月, 2019 1 次提交
    • L
      pipe: simplify signal handling in pipe_read() and add comments · d1c6a2aa
      Linus Torvalds 提交于
      There's no need to separately check for signals while inside the locked
      region, since we're going to do "wait_event_interruptible()" right
      afterwards anyway, and the error handling is much simpler there.
      
      The check for whether we had already read anything was also redundant,
      since we no longer do the odd merging of reads when there are pending
      writers.
      
      But perhaps more importantly, this adds commentary about why we still
      need to wake up possible writers even though we didn't read any data,
      and why we can skip all the finishing touches now if we get a signal (or
      had a signal pending) while waiting for more data.
      
      [ This is a split-out cleanup from my "make pipe IO use exclusive wait
        queues" thing, which I can't apply because it triggers a nasty bug in
        the GNU make jobserver   - Linus ]
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      d1c6a2aa
  9. 08 12月, 2019 5 次提交
    • L
      pipe: don't use 'pipe_wait() for basic pipe IO · 85190d15
      Linus Torvalds 提交于
      pipe_wait() may be simple, but since it relies on the pipe lock, it
      means that we have to do the wakeup while holding the lock.  That's
      unfortunate, because the very first thing the waked entity will want to
      do is to get the pipe lock for itself.
      
      So get rid of the pipe_wait() usage by simply releasing the pipe lock,
      doing the wakeup (if required) and then using wait_event_interruptible()
      to wait on the right condition instead.
      
      wait_event_interruptible() handles races on its own by comparing the
      wakeup condition before and after adding itself to the wait queue, so
      you can use an optimistic unlocked condition for it.
      
      Cc: David Howells <dhowells@redhat.com>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      85190d15
    • L
      pipe: remove 'waiting_writers' merging logic · a28c8b9d
      Linus Torvalds 提交于
      This code is ancient, and goes back to when we only had a single page
      for the pipe buffers.  The exact history is hidden in the mists of time
      (ie "before git", and in fact predates the BK repository too).
      
      At that long-ago point in time, it actually helped to try to merge big
      back-and-forth pipe reads and writes, and not limit pipe reads to the
      single pipe buffer in length just because that was all we had at a time.
      
      However, since then we've expanded the pipe buffers to multiple pages,
      and this logic really doesn't seem to make sense.  And a lot of it is
      somewhat questionable (ie "hmm, the user asked for a non-blocking read,
      but we see that there's a writer pending, so let's wait anyway to get
      the extra data that the writer will have").
      
      But more importantly, it makes the "go to sleep" logic much less
      obvious, and considering the wakeup issues we've had, I want to make for
      less of those kinds of things.
      
      Cc: David Howells <dhowells@redhat.com>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      a28c8b9d
    • L
      pipe: fix and clarify pipe read wakeup logic · f467a6a6
      Linus Torvalds 提交于
      This is the read side version of the previous commit: it simplifies the
      logic to only wake up waiting writers when necessary, and makes sure to
      use a synchronous wakeup.  This time not so much for GNU make jobserver
      reasons (that pipe never fills up), but simply to get the writer going
      quickly again.
      
      A bit less verbose commentary this time, if only because I assume that
      the write side commentary isn't going to be ignored if you touch this
      code.
      
      Cc: David Howells <dhowells@redhat.com>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      f467a6a6
    • L
      pipe: fix and clarify pipe write wakeup logic · 1b6b26ae
      Linus Torvalds 提交于
      The pipe rework ends up having been extra painful, partly becaused of
      actual bugs with ordering and caching of the pipe state, but also
      because of subtle performance issues.
      
      In particular, the pipe rework caused the kernel build to inexplicably
      slow down.
      
      The reason turns out to be that the GNU make jobserver (which limits the
      parallelism of the build) uses a pipe to implement a "token" system: a
      parallel submake will read a character from the pipe to get the job
      token before starting a new job, and will write a character back to the
      pipe when it is done.  The overall job limit is thus easily controlled
      by just writing the appropriate number of initial token characters into
      the pipe.
      
      But to work well, that really means that the old behavior of write
      wakeups being synchronous (WF_SYNC) is very important - when the pipe
      writer wakes up a reader, we want the reader to actually get scheduled
      immediately.  Otherwise you lose the parallelism of the build.
      
      The pipe rework lost that synchronous wakeup on write, and we had
      clearly all forgotten the reasons and rules for it.
      
      This rewrites the pipe write wakeup logic to do the required Wsync
      wakeups, but also clarifies the logic and avoids extraneous wakeups.
      
      It also ends up addign a number of comments about what oit does and why,
      so that we hopefully don't end up forgetting about this next time we
      change this code.
      
      Cc: David Howells <dhowells@redhat.com>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      1b6b26ae
    • L
      pipe: fix poll/select race introduced by the pipe rework · ad910e36
      Linus Torvalds 提交于
      The kernel wait queues have a basic rule to them: you add yourself to
      the wait-queue first, and then you check the things that you're going to
      wait on.  That avoids the races with the event you're waiting for.
      
      The same goes for poll/select logic: the "poll_wait()" goes first, and
      then you check the things you're polling for.
      
      Of course, if you use locking, the ordering doesn't matter since the
      lock will serialize with anything that changes the state you're looking
      at. That's not the case here, though.
      
      So move the poll_wait() first in pipe_poll(), before you start looking
      at the pipe state.
      
      Fixes: 8cefc107 ("pipe: Use head and tail pointers for the ring, not cursor and length")
      Cc: David Howells <dhowells@redhat.com>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      ad910e36
  10. 06 12月, 2019 2 次提交
    • D
      pipe: Fix missing mask update after pipe_wait() · 8f868d68
      David Howells 提交于
      Fix pipe_write() to not cache the ring index mask and max_usage as their
      values are invalidated by calling pipe_wait() because the latter
      function drops the pipe lock, thereby allowing F_SETPIPE_SZ change them.
      Without this, pipe_write() may subsequently miscalculate the array
      indices and pipe fullness, leading to an oops like the following:
      
        BUG: KASAN: slab-out-of-bounds in pipe_write+0xc25/0xe10 fs/pipe.c:481
        Write of size 8 at addr ffff8880771167a8 by task syz-executor.3/7987
        ...
        CPU: 1 PID: 7987 Comm: syz-executor.3 Not tainted 5.4.0-rc2-syzkaller #0
        ...
        Call Trace:
          pipe_write+0xc25/0xe10 fs/pipe.c:481
          call_write_iter include/linux/fs.h:1895 [inline]
          new_sync_write+0x3fd/0x7e0 fs/read_write.c:483
          __vfs_write+0x94/0x110 fs/read_write.c:496
          vfs_write+0x18a/0x520 fs/read_write.c:558
          ksys_write+0x105/0x220 fs/read_write.c:611
          __do_sys_write fs/read_write.c:623 [inline]
          __se_sys_write fs/read_write.c:620 [inline]
          __x64_sys_write+0x6e/0xb0 fs/read_write.c:620
          do_syscall_64+0xca/0x5d0 arch/x86/entry/common.c:290
          entry_SYSCALL_64_after_hwframe+0x49/0xbe
      
      This is not a problem for pipe_read() as the mask is recalculated on
      each pass of the loop, after pipe_wait() has been called.
      
      Fixes: 8cefc107 ("pipe: Use head and tail pointers for the ring, not cursor and length")
      Reported-by: syzbot+838eb0878ffd51f27c41@syzkaller.appspotmail.com
      Signed-off-by: NDavid Howells <dhowells@redhat.com>
      Cc: Eric Biggers <ebiggers@kernel.org>
      [ Changed it to use a temporary variable 'mask' to avoid long lines -Linus ]
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      8f868d68
    • D
      pipe: Remove assertion from pipe_poll() · 8c7b8c34
      David Howells 提交于
      An assertion check was added to pipe_poll() to make sure that the ring
      occupancy isn't seen to overflow the ring size.  However, since no locks
      are held when the three values are read, it is possible for F_SETPIPE_SZ
      to intervene and muck up the calculation, thereby causing the oops.
      
      Fix this by simply removing the assertion and accepting that the
      calculation might be approximate.
      
      Note that the previous code also had a similar issue, though there was
      no assertion check, since the occupancy counter and the ring size were
      not read with a lock held, so it's possible that the poll check might
      have malfunctioned then too.
      
      Also wake up all the waiters so that they can reissue their checks if
      there was a competing read or write.
      
      Fixes: 8cefc107 ("pipe: Use head and tail pointers for the ring, not cursor and length")
      Reported-by: syzbot+d37abaade33a934f16f2@syzkaller.appspotmail.com
      Signed-off-by: NDavid Howells <dhowells@redhat.com>
      cc: Eric Biggers <ebiggers@kernel.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      8c7b8c34
  11. 26 11月, 2019 1 次提交
    • L
      vfs: mark pipes and sockets as stream-like file descriptors · d8e464ec
      Linus Torvalds 提交于
      In commit 3975b097 ("convert stream-like files -> stream_open, even
      if they use noop_llseek") Kirill used a coccinelle script to change
      "nonseekable_open()" to "stream_open()", which changed the trivial cases
      of stream-like file descriptors to the new model with FMODE_STREAM.
      
      However, the two big cases - sockets and pipes - don't actually have
      that trivial pattern at all, and were thus never converted to
      FMODE_STREAM even though it makes lots of sense to do so.
      
      That's particularly true when looking forward to the next change:
      getting rid of FMODE_ATOMIC_POS entirely, and just using FMODE_STREAM to
      decide whether f_pos updates are needed or not.  And if they are, we'll
      always do them atomically.
      
      This came up because KCSAN (correctly) noted that the non-locked f_pos
      updates are data races: they are clearly benign for the case where we
      don't care, but it would be good to just not have that issue exist at
      all.
      
      Note that the reason we used FMODE_ATOMIC_POS originally is that only
      doing it for the minimal required case is "safer" in that it's possible
      that the f_pos locking can cause unnecessary serialization across the
      whole write() call.  And in the worst case, that kind of serialization
      can cause deadlock issues: think writers that need readers to empty the
      state using the same file descriptor.
      
      [ Note that the locking is per-file descriptor - because it protects
        "f_pos", which is obviously per-file descriptor - so it only affects
        cases where you literally use the same file descriptor to both read
        and write.
      
        So a regular pipe that has separate reading and writing file
        descriptors doesn't really have this situation even though it's the
        obvious case of "reader empties what a bit writer concurrently fills"
      
        But we want to make pipes as being stream-line anyway, because we
        don't want the unnecessary overhead of locking, and because a named
        pipe can be (ab-)used by reading and writing to the same file
        descriptor. ]
      
      There are likely a lot of other cases that might want FMODE_STREAM, and
      looking for ".llseek = no_llseek" users and other cases that don't have
      an lseek file operation at all and making them use "stream_open()" might
      be a good idea.  But pipes and sockets are likely to be the two main
      cases.
      
      Cc: Kirill Smelkov <kirr@nexedi.com>
      Cc: Eic Dumazet <edumazet@google.com>
      Cc: Al Viro <viro@zeniv.linux.org.uk>
      Cc: Alan Stern <stern@rowland.harvard.edu>
      Cc: Marco Elver <elver@google.com>
      Cc: Andrea Parri <parri.andrea@gmail.com>
      Cc: Paul McKenney <paulmck@kernel.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      d8e464ec
  12. 16 11月, 2019 8 次提交
  13. 31 10月, 2019 1 次提交
    • D
      pipe: Use head and tail pointers for the ring, not cursor and length · 8cefc107
      David Howells 提交于
      Convert pipes to use head and tail pointers for the buffer ring rather than
      pointer and length as the latter requires two atomic ops to update (or a
      combined op) whereas the former only requires one.
      
       (1) The head pointer is the point at which production occurs and points to
           the slot in which the next buffer will be placed.  This is equivalent
           to pipe->curbuf + pipe->nrbufs.
      
           The head pointer belongs to the write-side.
      
       (2) The tail pointer is the point at which consumption occurs.  It points
           to the next slot to be consumed.  This is equivalent to pipe->curbuf.
      
           The tail pointer belongs to the read-side.
      
       (3) head and tail are allowed to run to UINT_MAX and wrap naturally.  They
           are only masked off when the array is being accessed, e.g.:
      
      	pipe->bufs[head & mask]
      
           This means that it is not necessary to have a dead slot in the ring as
           head == tail isn't ambiguous.
      
       (4) The ring is empty if "head == tail".
      
           A helper, pipe_empty(), is provided for this.
      
       (5) The occupancy of the ring is "head - tail".
      
           A helper, pipe_occupancy(), is provided for this.
      
       (6) The number of free slots in the ring is "pipe->ring_size - occupancy".
      
           A helper, pipe_space_for_user() is provided to indicate how many slots
           userspace may use.
      
       (7) The ring is full if "head - tail >= pipe->ring_size".
      
           A helper, pipe_full(), is provided for this.
      Signed-off-by: NDavid Howells <dhowells@redhat.com>
      8cefc107
  14. 26 5月, 2019 2 次提交
    • D
      vfs: Convert pipe to use the new mount API · 4fa7ec5d
      David Howells 提交于
      Convert the pipe filesystem to the new internal mount API as the old
      one will be obsoleted and removed.  This allows greater flexibility in
      communication of mount parameters between userspace, the VFS and the
      filesystem.
      
      See Documentation/filesystems/mount_api.txt for more information.
      Signed-off-by: NDavid Howells <dhowells@redhat.com>
      cc: linux-fsdevel@vger.kernel.org
      Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>
      4fa7ec5d
    • A
      mount_pseudo(): drop 'name' argument, switch to d_make_root() · 1f58bb18
      Al Viro 提交于
      Once upon a time we used to set ->d_name of e.g. pipefs root
      so that d_path() on pipes would work.  These days it's
      completely pointless - dentries of pipes are not even connected
      to pipefs root.  However, mount_pseudo() had set the root
      dentry name (passed as the second argument) and callers
      kept inventing names to pass to it.  Including those that
      didn't *have* any non-root dentries to start with...
      
      All of that had been pointless for about 8 years now; it's
      time to get rid of that cargo-culting...
      Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>
      1f58bb18
  15. 15 4月, 2019 1 次提交
  16. 06 3月, 2019 1 次提交
  17. 01 2月, 2019 2 次提交
    • J
      pipe: stop using ->can_merge · 01e7187b
      Jann Horn 提交于
      Al Viro pointed out that since there is only one pipe buffer type to which
      new data can be appended, it isn't necessary to have a ->can_merge field in
      struct pipe_buf_operations, we can just check for a magic type.
      Suggested-by: NAl Viro <viro@zeniv.linux.org.uk>
      Signed-off-by: NJann Horn <jannh@google.com>
      Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>
      01e7187b
    • J
      splice: don't merge into linked buffers · a0ce2f0a
      Jann Horn 提交于
      Before this patch, it was possible for two pipes to affect each other after
      data had been transferred between them with tee():
      
      ============
      $ cat tee_test.c
      
      int main(void) {
        int pipe_a[2];
        if (pipe(pipe_a)) err(1, "pipe");
        int pipe_b[2];
        if (pipe(pipe_b)) err(1, "pipe");
        if (write(pipe_a[1], "abcd", 4) != 4) err(1, "write");
        if (tee(pipe_a[0], pipe_b[1], 2, 0) != 2) err(1, "tee");
        if (write(pipe_b[1], "xx", 2) != 2) err(1, "write");
      
        char buf[5];
        if (read(pipe_a[0], buf, 4) != 4) err(1, "read");
        buf[4] = 0;
        printf("got back: '%s'\n", buf);
      }
      $ gcc -o tee_test tee_test.c
      $ ./tee_test
      got back: 'abxx'
      $
      ============
      
      As suggested by Al Viro, fix it by creating a separate type for
      non-mergeable pipe buffers, then changing the types of buffers in
      splice_pipe_to_pipe() and link_pipe().
      
      Cc: <stable@vger.kernel.org>
      Fixes: 7c77f0b3 ("splice: implement pipe to pipe splicing")
      Fixes: 70524490 ("[PATCH] splice: add support for sys_tee()")
      Suggested-by: NAl Viro <viro@zeniv.linux.org.uk>
      Signed-off-by: NJann Horn <jannh@google.com>
      Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>
      a0ce2f0a
  18. 12 7月, 2018 3 次提交
  19. 11 7月, 2018 1 次提交
  20. 29 6月, 2018 1 次提交
    • L
      Revert changes to convert to ->poll_mask() and aio IOCB_CMD_POLL · a11e1d43
      Linus Torvalds 提交于
      The poll() changes were not well thought out, and completely
      unexplained.  They also caused a huge performance regression, because
      "->poll()" was no longer a trivial file operation that just called down
      to the underlying file operations, but instead did at least two indirect
      calls.
      
      Indirect calls are sadly slow now with the Spectre mitigation, but the
      performance problem could at least be largely mitigated by changing the
      "->get_poll_head()" operation to just have a per-file-descriptor pointer
      to the poll head instead.  That gets rid of one of the new indirections.
      
      But that doesn't fix the new complexity that is completely unwarranted
      for the regular case.  The (undocumented) reason for the poll() changes
      was some alleged AIO poll race fixing, but we don't make the common case
      slower and more complex for some uncommon special case, so this all
      really needs way more explanations and most likely a fundamental
      redesign.
      
      [ This revert is a revert of about 30 different commits, not reverted
        individually because that would just be unnecessarily messy  - Linus ]
      
      Cc: Al Viro <viro@zeniv.linux.org.uk>
      Cc: Christoph Hellwig <hch@lst.de>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      a11e1d43