1. 11 8月, 2016 1 次提交
  2. 10 8月, 2016 1 次提交
  3. 09 8月, 2016 14 次提交
    • N
      RDS: add __printf format attribute to error reporting functions · 6cdaf03f
      Nicolas Iooss 提交于
      This is helpful to detect at compile-time errors related to format
      strings.
      Signed-off-by: NNicolas Iooss <nicolas.iooss_linux@m4x.org>
      Acked-by: NSantosh Shilimkar <santosh.shilimkar@oracle.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      6cdaf03f
    • M
      net/sched/sch_hfsc.c: remove unused cl_myfadj · 37088f61
      Michal Soltys 提交于
      The code using this variable has been commented out in the past as it
      was causing issues in upperlimited link-sharing scenarios.
      Signed-off-by: NMichal Soltys <soltys@ziu.info>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      37088f61
    • M
      net/sched/sch_hfsc.c: keep fsc and virtual times in sync; fix an old bug · 678a6241
      Michal Soltys 提交于
      This patch simplifies how we update fsc and calculate vt from it - while
      keeping the expected functionality identical with how hfsc behaves
      curently. It also fixes a certain issue introduced with
      a very old patch.
      
      The idea is, that instead of correcting cl_vt before fsc curve update
      (rtsc_min) and correcting cl_vt after calculation (rtsc_y2x) to keep
      cl_vt local to the current period - we can simply rely on virtual times
      and curve values always being in sync - analogously to how rsc and usc
      function, except that we use virtual time here.
      
      Why hasn't it been done since the beginning this way ? The likely scenario
      (basing on the code trying to correct curves whenever possible) was to
      keep the virtual times as small as possible - as they have tendency to
      "gallop" forward whenever their siblings and other fair sharing
      subtrees are idling. On top of that, current code is subtly bugged, so
      cumulative time (without any corrections) is always kept and used in
      init_vf() when a new backlog period begins (using cl_cvtoff).
      
      Is cumulative value safe ? Generally yes, though corner cases are easy
      to create. For example consider:
      
      1gbit interface
      some 100kbit leaf, everything else idle
      
      With current tick (64ns) 1s is 15625000 ticks, but the leaf is alone and
      it's virtual time, so in reality it's 10000 times more. ITOW 38 bits are
      needed to hold 1 second. 54 - 1 day, 59 - 1 month, 63 - 1 year (all
      logarithms rounded up). It's getting somewhat dangerous, but also
      requires setup excusing this kind of values not mentioning permanently
      backlogged class for a year. In near most extreme case (10gbit, 10kbit
      leaf), we have "enough" to hold ~13.6 days in 64 bits.
      
      Well, the issue remains mostly theoretical and cl_cvtoff has been
      working fine for all those years. Sensible configuration are de-facto
      immune to this issue, and not so sensible can solve it with a cronjob
      and its period inversely proportional to the insanity of such setup =)
      
      Now let's explain the subtle bug mentioned earlier.
      
      The issue is related to how offsets are kept and how we calculate
      virtual times and update fair service curve(s). The issue itself is
      subtle, but easy to observe with long m1 segments. It was introduced in
      rather old patch:
      
      Commit 99296150c7: "[NET_SCHED]: O(1) children vtoff adjustment
      in HFSC scheduler"
      
      (available in git://git.kernel.org/pub/scm/linux/kernel/git/tglx/history.git)
      
      Originally when a new backlog period was started, cl_vtoff of each
      sibling was updated with cl_cvtmax from past period - naturally moving
      all cl_vt to proper starting point. That patch adjusted it so cumulative
      offset is kept in the parent, and there is no need for traversing the
      list (as any subsequent child activation derives new vt from already
      active sibling(s)).
      
      But with this change, cl_vtoff (of each sibling) is no longer persistent
      across the inactivity periods, as it's calculated from parent's
      cl_cvtoff on a new backlog period, conflicting with the following curve
      correction from the previous period:
      
      if (cl->cl_virtual.x == vt) {
              cl->cl_virtual.x -= cl->cl_vtoff;
      	cl->cl_vtoff = 0;
      }
      
      This essentially tries to keep curve as if it was local to the period
      and resets cl_vtoff (cumulative vt offset of the class) to 0 when
      possible (read: when we have an intersection or if a new curve is below
      the old one). But then it's recalculated from cl_cvtoff on next active
      period.  Then rtsc_min() call preceding the above if() doesn't really
      do what we expect it to do in such scenario - as it calculates the
      minimum of corrected curve (from the previous backlog period) and the
      new uncorrected curve (with offset derived from cl_cvtoff).
      
      Example:
      
      tc class add dev $ife parent 1:0 classid 1:1  hfsc ls m2 100mbit ul m2 100mbit
      tc class add dev $ife parent 1:1 classid 1:10 hfsc ls m1 80mbit d 10s m2 20mbit
      tc class add dev $ife parent 1:1 classid 1:11 hfsc ls m2 20mbit
      
      start B, keep it backlogged, let it run 6s (30s worth of vt as A is idle)
      pause B briefly to force cl_cvtoff update in parent (whole 1:1 going idle)
      start A, let it run 10s
      pause A briefly to force rtsc_min()
      
      At this point we would expect A to continue at 20mbit after a brief
      moment of 80mbit. But instead A will use 80mbit for full 10s again. It's
      the effect of first correcting A (during 'start A'), and then - after
      unpausing - calculating rtsc_min() from old corrected and new uncorrected
      curve.
      
      The patch fixes this bug and keepis vt and fsc in sync (virtual times
      are cumulative, not local to the backlog period).
      Signed-off-by: NMichal Soltys <soltys@ziu.info>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      678a6241
    • H
      net/multicast: should not send source list records when have filter mode change · a052517a
      Hangbin Liu 提交于
      Based on RFC3376 5.1 and RFC3810 6.1
      
         If the per-interface listening change that triggers the new report is
         a filter mode change, then the next [Robustness Variable] State
         Change Reports will include a Filter Mode Change Record.  This
         applies even if any number of source list changes occur in that
         period.
      
         Old State         New State         State Change Record Sent
         ---------         ---------         ------------------------
         INCLUDE (A)       EXCLUDE (B)       TO_EX (B)
         EXCLUDE (A)       INCLUDE (B)       TO_IN (B)
      
      So we should not send source-list change if there is a filter-mode change.
      
      Here are two scenarios:
      1. Group deleted and filter mode is EXCLUDE, which means we need send a
         TO_IN { }.
      2. Not group deleted, but has pcm->crcount, which means we need send a
         normal filter-mode-change.
      
      At the same time, if the type is ALLOW or BLOCK, and have psf->sf_crcount,
      we stop add records and decrease sf_crcount directly
      
      Reference: https://www.ietf.org/mail-archive/web/magma/current/msg01274.htmlSigned-off-by: NHangbin Liu <liuhangbin@gmail.com>
      Acked-by: NHannes Frederic Sowa <hannes@stressinduktion.org>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      a052517a
    • U
      net: ipconfig: drop inter-device timeout · e0688534
      Uwe Kleine-König 提交于
      Now that ipconfig learned to handle "delayed replies" in the previous
      commit, there is no reason any more to delay sending a first request per
      device.
      Signed-off-by: NUwe Kleine-König <u.kleine-koenig@pengutronix.de>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      e0688534
    • U
      net: ipconfig: Support using "delayed" DHCP replies · 2647cffb
      Uwe Kleine-König 提交于
      The dhcp code only waits 1s between sending DHCP requests on different
      devices and only accepts an answer for the device that sent out the last
      request. Only the timeout at the end of a loop is increased iteratively
      which favours only the last device. This makes it impossible to work
      with a dhcp server that takes little more than 1s connected to a device
      that is not the last one.
      
      Instead of also increasing the inter-device timeout, teach the code to
      handle delayed replies.
      
      To accomplish that, make *ic_dev track the current ic_device instead of
      the current net_device and adapt all users accordingly. The relevant
      change then is to reset d to ic_dev on a reply to assert that the
      followup request goes through the right device.
      Signed-off-by: NUwe Kleine-König <u.kleine-koenig@pengutronix.de>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      2647cffb
    • U
      net: ipconfig: Add device name to debug messages · 22fc5388
      Uwe Kleine-König 提交于
      This simplifies understanding what happens when there is more than one
      device.
      Signed-off-by: NUwe Kleine-König <u.kleine-koenig@pengutronix.de>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      22fc5388
    • J
      neigh: allow admin to set NUD_STALE · 0e7bbcc1
      Julian Anastasov 提交于
      Admin should be able to set any state. Currently, this fails
      when lladdr is not changed and state is changed from
      NUD_CONNECTED to NUD_STALE:
      
      ip neigh add 192.168.8.1 lladdr 00:11:22:33:44:55 nud perm dev wlan0
      ip neigh show to 192.168.8.1
      192.168.8.1 dev wlan0 lladdr 00:11:22:33:44:55 PERMANENT
      ip neigh change 192.168.8.1 lladdr 00:11:22:33:44:55 nud stale dev wlan0
      ip neigh show to 192.168.8.1
      192.168.8.1 dev wlan0 lladdr 00:11:22:33:44:55 PERMANENT
      
      Problem may be from 2.1.X days.
      Signed-off-by: NJulian Anastasov <ja@ssi.bg>
      Reviewed-by: NChunhui He <hchunhui@mail.ustc.edu.cn>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      0e7bbcc1
    • X
      sctp: use event->chunk when it's valid · 1fe323aa
      Xin Long 提交于
      Commit 52253db9 ("sctp: also point GSO head_skb to the sk when
      it's available") used event->chunk->head_skb to get the head_skb in
      sctp_ulpevent_set_owner().
      
      But at that moment, the event->chunk was NULL, as it cloned the skb
      in sctp_ulpevent_make_rcvmsg(). Therefore, that patch didn't really
      work.
      
      This patch is to move the event->chunk initialization before calling
      sctp_ulpevent_receive_data() so that it uses event->chunk when it's
      valid.
      
      Fixes: 52253db9 ("sctp: also point GSO head_skb to the sk when it's available")
      Signed-off-by: NXin Long <lucien.xin@gmail.com>
      Acked-by: NMarcelo Ricardo Leitner <marcelo.leitner@gmail.com>
      Acked-by: NNeil Horman <nhorman@tuxdriver.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      1fe323aa
    • D
      bpf: fix checksum for vlan push/pop helper · 8065694e
      Daniel Borkmann 提交于
      When having skbs on ingress with CHECKSUM_COMPLETE, tc BPF programs don't
      push rcsum of mac header back in and after BPF run back pull out again as
      opposed to some other subsystems (ovs, for example).
      
      For cases like q-in-q, meaning when a vlan tag for offloading is already
      present and we're about to push another one, then skb_vlan_push() pushes the
      inner one into the skb, increasing mac header and skb_postpush_rcsum()'ing
      the 4 bytes vlan header diff. Likewise, for the reverse operation in
      skb_vlan_pop() for the case where vlan header needs to be pulled out of the
      skb, we're decreasing the mac header and skb_postpull_rcsum()'ing the 4 bytes
      rcsum of the vlan header that was removed.
      
      However mangling the rcsum here will lead to hw csum failure for BPF case,
      since we're pulling or pushing data that was not part of the current rcsum.
      Changing tc BPF programs in general to push/pull rcsum around BPF_PROG_RUN()
      is also not really an option since current behaviour is ABI by now, but apart
      from that would also mean to do quite a bit of useless work in the sense that
      usually 12 bytes need to be rcsum pushed/pulled also when we don't need to
      touch this vlan related corner case. One way to fix it would be to push the
      necessary rcsum fixup down into vlan helpers that are (mostly) slow-path
      anyway.
      
      Fixes: 4e10df9a ("bpf: introduce bpf_skb_vlan_push/pop() helpers")
      Signed-off-by: NDaniel Borkmann <daniel@iogearbox.net>
      Acked-by: NAlexei Starovoitov <ast@kernel.org>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      8065694e
    • D
      bpf: fix checksum fixups on bpf_skb_store_bytes · 479ffccc
      Daniel Borkmann 提交于
      bpf_skb_store_bytes() invocations above L2 header need BPF_F_RECOMPUTE_CSUM
      flag for updates, so that CHECKSUM_COMPLETE will be fixed up along the way.
      Where we ran into an issue with bpf_skb_store_bytes() is when we did a
      single-byte update on the IPv6 hoplimit despite using BPF_F_RECOMPUTE_CSUM
      flag; simple ping via ICMPv6 triggered a hw csum failure as a result. The
      underlying issue has been tracked down to a buffer alignment issue.
      
      Meaning, that csum_partial() computations via skb_postpull_rcsum() and
      skb_postpush_rcsum() pair invoked had a wrong result since they operated on
      an odd address for the hoplimit, while other computations were done on an
      even address. This mix doesn't work as-is with skb_postpull_rcsum(),
      skb_postpush_rcsum() pair as it always expects at least half-word alignment
      of input buffers, which is normally the case. Thus, instead of these helpers
      using csum_sub() and (implicitly) csum_add(), we need to use csum_block_sub(),
      csum_block_add(), respectively. For unaligned offsets, they rotate the sum
      to align it to a half-word boundary again, otherwise they work the same as
      csum_sub() and csum_add().
      
      Adding __skb_postpull_rcsum(), __skb_postpush_rcsum() variants that take the
      offset as an input and adapting bpf_skb_store_bytes() to them fixes the hw
      csum failures again. The skb_postpull_rcsum(), skb_postpush_rcsum() helpers
      use a 0 constant for offset so that the compiler optimizes the offset & 1
      test away and generates the same code as with csum_sub()/_add().
      
      Fixes: 608cd71a ("tc: bpf: generalize pedit action")
      Signed-off-by: NDaniel Borkmann <daniel@iogearbox.net>
      Acked-by: NAlexei Starovoitov <ast@kernel.org>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      479ffccc
    • D
      bpf: also call skb_postpush_rcsum on xmit occasions · a2bfe6bf
      Daniel Borkmann 提交于
      Follow-up to commit f8ffad69 ("bpf: add skb_postpush_rcsum and fix
      dev_forward_skb occasions") to fix an issue for dev_queue_xmit() redirect
      locations which need CHECKSUM_COMPLETE fixups on ingress.
      
      For the same reasons as described in f8ffad69 already, we of course
      also need this here, since dev_queue_xmit() on a veth device will let us
      end up in the dev_forward_skb() helper again to cross namespaces.
      
      Latter then calls into skb_postpull_rcsum() to pull out L2 header, so
      that netif_rx_internal() sees CHECKSUM_COMPLETE as it is expected. That
      is, CHECKSUM_COMPLETE on ingress covering L2 _payload_, not L2 headers.
      
      Also here we have to address bpf_redirect() and bpf_clone_redirect().
      
      Fixes: 3896d655 ("bpf: introduce bpf_clone_redirect() helper")
      Fixes: 27b29f63 ("bpf: add bpf_redirect() helper")
      Signed-off-by: NDaniel Borkmann <daniel@iogearbox.net>
      Acked-by: NAlexei Starovoitov <ast@kernel.org>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      a2bfe6bf
    • P
      sctp_diag: Respect ss adding TCPF_CLOSE to idiag_states · 1ba8d77f
      Phil Sutter 提交于
      Since 'ss' always adds TCPF_CLOSE to idiag_states flags, sctp_diag can't
      rely upon TCPF_LISTEN flag solely being present when listening sockets
      are requested.
      Signed-off-by: NPhil Sutter <phil@nwl.cc>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      1ba8d77f
    • P
      sctp_diag: Fix T3_rtx timer export · 12474e8e
      Phil Sutter 提交于
      The asoc's timer value is not kept in asoc->timeouts array but in it's
      primary transport instead.
      
      Furthermore, we must export the timer only if it is pending, otherwise
      the value will underrun when stored in an unsigned variable and
      user space will only see a very large timeout value.
      Signed-off-by: NPhil Sutter <phil@nwl.cc>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      12474e8e
  4. 06 8月, 2016 3 次提交
    • D
      ipv4: panic in leaf_walk_rcu due to stale node pointer · 94d9f1c5
      David Forster 提交于
      Panic occurs when issuing "cat /proc/net/route" whilst
      populating FIB with > 1M routes.
      
      Use of cached node pointer in fib_route_get_idx is unsafe.
      
       BUG: unable to handle kernel paging request at ffffc90001630024
       IP: [<ffffffff814cf6a0>] leaf_walk_rcu+0x10/0xe0
       PGD 11b08d067 PUD 11b08e067 PMD dac4b067 PTE 0
       Oops: 0000 [#1] SMP
       Modules linked in: nfsd auth_rpcgss oid_registry nfs_acl nfs lockd grace fscac
       snd_hda_codec_generic snd_hda_intel snd_hda_codec snd_hda_core snd_hwdep virti
       acpi_cpufreq button parport_pc ppdev lp parport autofs4 ext4 crc16 mbcache jbd
      tio_ring virtio floppy uhci_hcd ehci_hcd usbcore usb_common libata scsi_mod
       CPU: 1 PID: 785 Comm: cat Not tainted 4.2.0-rc8+ #4
       Hardware name: Bochs Bochs, BIOS Bochs 01/01/2007
       task: ffff8800da1c0bc0 ti: ffff88011a05c000 task.ti: ffff88011a05c000
       RIP: 0010:[<ffffffff814cf6a0>]  [<ffffffff814cf6a0>] leaf_walk_rcu+0x10/0xe0
       RSP: 0018:ffff88011a05fda0  EFLAGS: 00010202
       RAX: ffff8800d8a40c00 RBX: ffff8800da4af940 RCX: ffff88011a05ff20
       RDX: ffffc90001630020 RSI: 0000000001013531 RDI: ffff8800da4af950
       RBP: 0000000000000000 R08: ffff8800da1f9a00 R09: 0000000000000000
       R10: ffff8800db45b7e4 R11: 0000000000000246 R12: ffff8800da4af950
       R13: ffff8800d97a74c0 R14: 0000000000000000 R15: ffff8800d97a7480
       FS:  00007fd3970e0700(0000) GS:ffff88011fd00000(0000) knlGS:0000000000000000
       CS:  0010 DS: 0000 ES: 0000 CR0: 000000008005003b
       CR2: ffffc90001630024 CR3: 000000011a7e4000 CR4: 00000000000006e0
       Stack:
        ffffffff814d00d3 0000000000000000 ffff88011a05ff20 ffff8800da1f9a00
        ffffffff811dd8b9 0000000000000800 0000000000020000 00007fd396f35000
        ffffffff811f8714 0000000000003431 ffffffff8138dce0 0000000000000f80
       Call Trace:
        [<ffffffff814d00d3>] ? fib_route_seq_start+0x93/0xc0
        [<ffffffff811dd8b9>] ? seq_read+0x149/0x380
        [<ffffffff811f8714>] ? fsnotify+0x3b4/0x500
        [<ffffffff8138dce0>] ? process_echoes+0x70/0x70
        [<ffffffff8121cfa7>] ? proc_reg_read+0x47/0x70
        [<ffffffff811bb823>] ? __vfs_read+0x23/0xd0
        [<ffffffff811bbd42>] ? rw_verify_area+0x52/0xf0
        [<ffffffff811bbe61>] ? vfs_read+0x81/0x120
        [<ffffffff811bcbc2>] ? SyS_read+0x42/0xa0
        [<ffffffff81549ab2>] ? entry_SYSCALL_64_fastpath+0x16/0x75
       Code: 48 85 c0 75 d8 f3 c3 31 c0 c3 f3 c3 66 66 66 66 66 66 2e 0f 1f 84 00 00
      a 04 89 f0 33 02 44 89 c9 48 d3 e8 0f b6 4a 05 49 89
       RIP  [<ffffffff814cf6a0>] leaf_walk_rcu+0x10/0xe0
        RSP <ffff88011a05fda0>
       CR2: ffffc90001630024
      Signed-off-by: NDave Forster <dforster@brocade.com>
      Acked-by: NAlexander Duyck <alexander.h.duyck@intel.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      94d9f1c5
    • D
      rxrpc: Fix races between skb free, ACK generation and replying · 372ee163
      David Howells 提交于
      Inside the kafs filesystem it is possible to occasionally have a call
      processed and terminated before we've had a chance to check whether we need
      to clean up the rx queue for that call because afs_send_simple_reply() ends
      the call when it is done, but this is done in a workqueue item that might
      happen to run to completion before afs_deliver_to_call() completes.
      
      Further, it is possible for rxrpc_kernel_send_data() to be called to send a
      reply before the last request-phase data skb is released.  The rxrpc skb
      destructor is where the ACK processing is done and the call state is
      advanced upon release of the last skb.  ACK generation is also deferred to
      a work item because it's possible that the skb destructor is not called in
      a context where kernel_sendmsg() can be invoked.
      
      To this end, the following changes are made:
      
       (1) kernel_rxrpc_data_consumed() is added.  This should be called whenever
           an skb is emptied so as to crank the ACK and call states.  This does
           not release the skb, however.  kernel_rxrpc_free_skb() must now be
           called to achieve that.  These together replace
           rxrpc_kernel_data_delivered().
      
       (2) kernel_rxrpc_data_consumed() is wrapped by afs_data_consumed().
      
           This makes afs_deliver_to_call() easier to work as the skb can simply
           be discarded unconditionally here without trying to work out what the
           return value of the ->deliver() function means.
      
           The ->deliver() functions can, via afs_data_complete(),
           afs_transfer_reply() and afs_extract_data() mark that an skb has been
           consumed (thereby cranking the state) without the need to
           conditionally free the skb to make sure the state is correct on an
           incoming call for when the call processor tries to send the reply.
      
       (3) rxrpc_recvmsg() now has to call kernel_rxrpc_data_consumed() when it
           has finished with a packet and MSG_PEEK isn't set.
      
       (4) rxrpc_packet_destructor() no longer calls rxrpc_hard_ACK_data().
      
           Because of this, we no longer need to clear the destructor and put the
           call before we free the skb in cases where we don't want the ACK/call
           state to be cranked.
      
       (5) The ->deliver() call-type callbacks are made to return -EAGAIN rather
           than 0 if they expect more data (afs_extract_data() returns -EAGAIN to
           the delivery function already), and the caller is now responsible for
           producing an abort if that was the last packet.
      
       (6) There are many bits of unmarshalling code where:
      
       		ret = afs_extract_data(call, skb, last, ...);
      		switch (ret) {
      		case 0:		break;
      		case -EAGAIN:	return 0;
      		default:	return ret;
      		}
      
           is to be found.  As -EAGAIN can now be passed back to the caller, we
           now just return if ret < 0:
      
       		ret = afs_extract_data(call, skb, last, ...);
      		if (ret < 0)
      			return ret;
      
       (7) Checks for trailing data and empty final data packets has been
           consolidated as afs_data_complete().  So:
      
      		if (skb->len > 0)
      			return -EBADMSG;
      		if (!last)
      			return 0;
      
           becomes:
      
      		ret = afs_data_complete(call, skb, last);
      		if (ret < 0)
      			return ret;
      
       (8) afs_transfer_reply() now checks the amount of data it has against the
           amount of data desired and the amount of data in the skb and returns
           an error to induce an abort if we don't get exactly what we want.
      
      Without these changes, the following oops can occasionally be observed,
      particularly if some printks are inserted into the delivery path:
      
      general protection fault: 0000 [#1] SMP
      Modules linked in: kafs(E) af_rxrpc(E) [last unloaded: af_rxrpc]
      CPU: 0 PID: 1305 Comm: kworker/u8:3 Tainted: G            E   4.7.0-fsdevel+ #1303
      Hardware name: ASUS All Series/H97-PLUS, BIOS 2306 10/09/2014
      Workqueue: kafsd afs_async_workfn [kafs]
      task: ffff88040be041c0 ti: ffff88040c070000 task.ti: ffff88040c070000
      RIP: 0010:[<ffffffff8108fd3c>]  [<ffffffff8108fd3c>] __lock_acquire+0xcf/0x15a1
      RSP: 0018:ffff88040c073bc0  EFLAGS: 00010002
      RAX: 6b6b6b6b6b6b6b6b RBX: 0000000000000000 RCX: ffff88040d29a710
      RDX: 0000000000000000 RSI: 0000000000000000 RDI: ffff88040d29a710
      RBP: ffff88040c073c70 R08: 0000000000000001 R09: 0000000000000001
      R10: 0000000000000001 R11: 0000000000000000 R12: 0000000000000000
      R13: 0000000000000000 R14: ffff88040be041c0 R15: ffffffff814c928f
      FS:  0000000000000000(0000) GS:ffff88041fa00000(0000) knlGS:0000000000000000
      CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
      CR2: 00007fa4595f4750 CR3: 0000000001c14000 CR4: 00000000001406f0
      Stack:
       0000000000000006 000000000be04930 0000000000000000 ffff880400000000
       ffff880400000000 ffffffff8108f847 ffff88040be041c0 ffffffff81050446
       ffff8803fc08a920 ffff8803fc08a958 ffff88040be041c0 ffff88040c073c38
      Call Trace:
       [<ffffffff8108f847>] ? mark_held_locks+0x5e/0x74
       [<ffffffff81050446>] ? __local_bh_enable_ip+0x9b/0xa1
       [<ffffffff8108f9ca>] ? trace_hardirqs_on_caller+0x16d/0x189
       [<ffffffff810915f4>] lock_acquire+0x122/0x1b6
       [<ffffffff810915f4>] ? lock_acquire+0x122/0x1b6
       [<ffffffff814c928f>] ? skb_dequeue+0x18/0x61
       [<ffffffff81609dbf>] _raw_spin_lock_irqsave+0x35/0x49
       [<ffffffff814c928f>] ? skb_dequeue+0x18/0x61
       [<ffffffff814c928f>] skb_dequeue+0x18/0x61
       [<ffffffffa009aa92>] afs_deliver_to_call+0x344/0x39d [kafs]
       [<ffffffffa009ab37>] afs_process_async_call+0x4c/0xd5 [kafs]
       [<ffffffffa0099e9c>] afs_async_workfn+0xe/0x10 [kafs]
       [<ffffffff81063a3a>] process_one_work+0x29d/0x57c
       [<ffffffff81064ac2>] worker_thread+0x24a/0x385
       [<ffffffff81064878>] ? rescuer_thread+0x2d0/0x2d0
       [<ffffffff810696f5>] kthread+0xf3/0xfb
       [<ffffffff8160a6ff>] ret_from_fork+0x1f/0x40
       [<ffffffff81069602>] ? kthread_create_on_node+0x1cf/0x1cf
      Signed-off-by: NDavid Howells <dhowells@redhat.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      372ee163
    • I
      OVS: Ignore negative headroom value · 5ef9f289
      Ian Wienand 提交于
      net_device->ndo_set_rx_headroom (introduced in
      871b642a) says
      
        "Setting a negtaive value reset the rx headroom
         to the default value".
      
      It seems that the OVS implementation in
      3a927bc7 overlooked this and sets
      dev->needed_headroom unconditionally.
      
      This doesn't have an immediate effect, but can mess up later
      LL_RESERVED_SPACE calculations, such as done in
      net/ipv6/mcast.c:mld_newpack.  For reference, this issue was found
      from a skb_panic raised there after the length calculations had given
      the wrong result.
      
      Note the other current users of this interface
      (drivers/net/tun.c:tun_set_headroom and
      drivers/net/veth.c:veth_set_rx_headroom) are both checking this
      correctly thus need no modification.
      
      Thanks to Ben for some pointers from the crash dumps!
      
      Cc: Benjamin Poirier <bpoirier@suse.com>
      Cc: Paolo Abeni <pabeni@redhat.com>
      Bugzilla: https://bugzilla.redhat.com/show_bug.cgi?id=1361414Signed-off-by: NIan Wienand <iwienand@redhat.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      5ef9f289
  5. 05 8月, 2016 4 次提交
  6. 04 8月, 2016 1 次提交
  7. 03 8月, 2016 1 次提交
  8. 02 8月, 2016 2 次提交
  9. 31 7月, 2016 6 次提交
  10. 28 7月, 2016 6 次提交
  11. 27 7月, 2016 1 次提交
    • V
      af_unix: charge buffers to kmemcg · 3aa9799e
      Vladimir Davydov 提交于
      Unix sockets can consume a significant amount of system memory, hence
      they should be accounted to kmemcg.
      
      Since unix socket buffers are always allocated from process context, all
      we need to do to charge them to kmemcg is set __GFP_ACCOUNT in
      sock->sk_allocation mask.
      
      Eric asked:
      
      > 1) What happens when a buffer, allocated from socket <A> lands in a
      > different socket <B>, maybe owned by another user/process.
      >
      > Who owns it now, in term of kmemcg accounting ?
      
      We never move memcg charges.  E.g.  if two processes from different
      cgroups are sharing a memory region, each page will be charged to the
      process which touched it first.  Or if two processes are working with
      the same directory tree, inodes and dentries will be charged to the
      first user.  The same is fair for unix socket buffers - they will be
      charged to the sender.
      
      > 2) Has performance impact been evaluated ?
      
      I ran netperf STREAM_STREAM with default options in a kmemcg on a 4 core
      x2 HT box.  The results are below:
      
       # clients            bandwidth (10^6bits/sec)
                          base              patched
               1      67643 +-  725      64874 +-  353    - 4.0 %
               4     193585 +- 2516     186715 +- 1460    - 3.5 %
               8     194820 +-  377     187443 +- 1229    - 3.7 %
      
      So the accounting doesn't come for free - it takes ~4% of performance.
      I believe we could optimize it by using per cpu batching not only on
      charge, but also on uncharge in memcg core, but that's beyond the scope
      of this patch set - I'll take a look at this later.
      
      Anyway, if performance impact is found to be unacceptable, it is always
      possible to disable kmem accounting at boot time (cgroup.memory=nokmem)
      or not use memory cgroups at runtime at all (thanks to jump labels
      there'll be no overhead even if they are compiled in).
      
      Link: http://lkml.kernel.org/r/fcfe6cae27a59fbc5e40145664b3cf085a560c68.1464079538.git.vdavydov@virtuozzo.comSigned-off-by: NVladimir Davydov <vdavydov@virtuozzo.com>
      Cc: "David S. Miller" <davem@davemloft.net>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Michal Hocko <mhocko@kernel.org>
      Cc: Eric Dumazet <eric.dumazet@gmail.com>
      Cc: Minchan Kim <minchan@kernel.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      3aa9799e