[5.12,242/296] net: sched: fix packet stuck problem for lockless qdisc

From: Yunsheng Lin <linyunsheng@huawei.com>

From: Yunsheng Lin <linyunsheng@huawei.com>

[ Upstream commit a90c57f2cedd52a511f739fb55e6244e22e1a2fb ]

Lockless qdisc has below concurrent problem:
    cpu0                 cpu1
     .                     .
q->enqueue                 .
     .                     .
qdisc_run_begin()          .
     .                     .
dequeue_skb()              .
     .                     .
sch_direct_xmit()          .
     .                     .
     .                q->enqueue
     .             qdisc_run_begin()
     .            return and do nothing
     .                     .
qdisc_run_end()            .

cpu1 enqueue a skb without calling __qdisc_run() because cpu0
has not released the lock yet and spin_trylock() return false
for cpu1 in qdisc_run_begin(), and cpu0 do not see the skb
enqueued by cpu1 when calling dequeue_skb() because cpu1 may
enqueue the skb after cpu0 calling dequeue_skb() and before
cpu0 calling qdisc_run_end().

Lockless qdisc has below another concurrent problem when
tx_action is involved:

cpu0(serving tx_action)     cpu1             cpu2
          .                   .                .
          .              q->enqueue            .
          .            qdisc_run_begin()       .
          .              dequeue_skb()         .
          .                   .            q->enqueue
          .                   .                .
          .             sch_direct_xmit()      .
          .                   .         qdisc_run_begin()
          .                   .       return and do nothing
          .                   .                .
 clear __QDISC_STATE_SCHED    .                .
 qdisc_run_begin()            .                .
 return and do nothing        .                .
          .                   .                .
          .            qdisc_run_end()         .

This patch fixes the above data race by:
1. If the first spin_trylock() return false and STATE_MISSED is
   not set, set STATE_MISSED and retry another spin_trylock() in
   case other CPU may not see STATE_MISSED after it releases the
   lock.
2. reschedule if STATE_MISSED is set after the lock is released
   at the end of qdisc_run_end().

For tx_action case, STATE_MISSED is also set when cpu1 is at the
end if qdisc_run_end(), so tx_action will be rescheduled again
to dequeue the skb enqueued by cpu2.

Clear STATE_MISSED before retrying a dequeuing when dequeuing
returns NULL in order to reduce the overhead of the second
spin_trylock() and __netif_schedule() calling.

Also clear the STATE_MISSED before calling __netif_schedule()
at the end of qdisc_run_end() to avoid doing another round of
dequeuing in the pfifo_fast_dequeue().

The performance impact of this patch, tested using pktgen and
dummy netdev with pfifo_fast qdisc attached:

 threads  without+this_patch   with+this_patch      delta
    1        2.61Mpps            2.60Mpps           -0.3%
    2        3.97Mpps            3.82Mpps           -3.7%
    4        5.62Mpps            5.59Mpps           -0.5%
    8        2.78Mpps            2.77Mpps           -0.3%
   16        2.22Mpps            2.22Mpps           -0.0%

Fixes: 6b3ba9146fe6 ("net: sched: allow qdiscs to handle locking")
Acked-by: Jakub Kicinski <kuba@kernel.org>
Tested-by: Juergen Gross <jgross@suse.com>
Signed-off-by: Yunsheng Lin <linyunsheng@huawei.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Signed-off-by: Sasha Levin <sashal@kernel.org>
---
 include/net/sch_generic.h | 35 ++++++++++++++++++++++++++++++++++-
 net/sched/sch_generic.c   | 19 +++++++++++++++++++
 2 files changed, 53 insertions(+), 1 deletion(-)

Message ID	20210531130711.926935292@linuxfoundation.org
State	Superseded
Headers	show Return-Path: <stable-owner@kernel.org> X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org X-Spam-Level: X-Spam-Status: No, score=-19.1 required=3.0 tests=BAYES_00,DKIMWL_WL_HIGH, DKIM_SIGNED,DKIM_VALID,DKIM_VALID_AU,HEADER_FROM_DIFFERENT_DOMAINS, INCLUDES_CR_TRAILER, INCLUDES_PATCH, MAILING_LIST_MULTI, SPF_HELO_NONE, SPF_PASS, URIBL_BLOCKED, USER_AGENT_GIT autolearn=ham autolearn_force=no version=3.4.0 Received: from mail.kernel.org (mail.kernel.org [198.145.29.99]) by smtp.lore.kernel.org (Postfix) with ESMTP id 8D1C5C47083 for <stable@archiver.kernel.org>; Mon, 31 May 2021 14:55:12 +0000 (UTC) Received: from vger.kernel.org (vger.kernel.org [23.128.96.18]) by mail.kernel.org (Postfix) with ESMTP id 723AA61378 for <stable@archiver.kernel.org>; Mon, 31 May 2021 14:55:12 +0000 (UTC) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S233727AbhEaO4q (ORCPT <rfc822;stable@archiver.kernel.org>); Mon, 31 May 2021 10:56:46 -0400 Received: from mail.kernel.org ([198.145.29.99]:47164 "EHLO mail.kernel.org" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S234251AbhEaOym (ORCPT <rfc822;stable@vger.kernel.org>); Mon, 31 May 2021 10:54:42 -0400 Received: by mail.kernel.org (Postfix) with ESMTPSA id 50DDC61446; Mon, 31 May 2021 13:59:17 +0000 (UTC) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple; d=linuxfoundation.org; s=korg; t=1622469557; bh=4MGRhDtRwd9u1UJeWbFHlGZW6Auqd4re8ONVRW2+rLY=; h=From:To:Cc:Subject:Date:In-Reply-To:References:From; b=ube6xFWBGw+zEVO5fwpzgOgw2UjTI79o83pnfYwRI52XTf70IsXvYYqRWOJbpk+Kj sl/XPXpEB4Cuw0MZnBlgKRyD2/lProzt9NMy2iKqiLYNrdkq+KJIzTTIf+KX/SqDR7 wVKMcX5bvkjekq4JDlMiLu0SHM+drMo6DNu+NaGE= From: Greg Kroah-Hartman <gregkh@linuxfoundation.org> To: linux-kernel@vger.kernel.org Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>, stable@vger.kernel.org, Jakub Kicinski <kuba@kernel.org>, Juergen Gross <jgross@suse.com>, Yunsheng Lin <linyunsheng@huawei.com>, "David S. Miller" <davem@davemloft.net>, Sasha Levin <sashal@kernel.org> Subject: [PATCH 5.12 242/296] net: sched: fix packet stuck problem for lockless qdisc Date: Mon, 31 May 2021 15:14:57 +0200 Message-Id: <20210531130711.926935292@linuxfoundation.org> X-Mailer: git-send-email 2.31.1 In-Reply-To: <20210531130703.762129381@linuxfoundation.org> References: <20210531130703.762129381@linuxfoundation.org> User-Agent: quilt/0.66 MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit Precedence: bulk List-ID: <stable.vger.kernel.org> X-Mailing-List: stable@vger.kernel.org
Series	None \| expand [5.12,005/296] ALSA: hda/realtek: fix mute/micmute LEDs and speaker for HP Zbook G8 [5.12,006/296] ALSA: hda/realtek: fix mute/micmute LEDs and speaker for HP Zbook Fury 15 G8 [5.12,007/296] ALSA: hda/realtek: fix mute/micmute LEDs and speaker for HP Zbook Fury 17 G8 [5.12,009/296] ALSA: usb-audio: scarlett2: Fix device hang with ehci-pci [5.12,012/296] cifs: set server->cipher_type to AES-128-CCM for SMB3.0 [5.12,013/296] mtd: rawnand: cs553x: Fix external use of SW Hamming ECC helper [5.12,014/296] mtd: rawnand: txx9ndfmc: Fix external use of SW Hamming ECC helper [5.12,015/296] mtd: rawnand: sharpsl: Fix external use of SW Hamming ECC helper [5.12,018/296] mtd: rawnand: tmio: Fix external use of SW Hamming ECC helper [5.12,019/296] mtd: rawnand: fsmc: Fix external use of SW Hamming ECC helper [5.12,020/296] can: isotp: prevent race between isotp_bind() and isotp_setsockopt() [5.12,022/296] scsi: target: core: Avoid smp_processor_id() in preemptible code [5.12,025/296] perf intel-pt: Fix sample instruction bytes [5.12,027/296] perf scripts python: exported-sql-viewer.py: Fix copy to clipboard from Top Calls ... [5.12,029/296] perf scripts python: exported-sql-viewer.py: Fix warning display [5.12,030/296] proc: Check /proc/$pid/attr/ writes against file opener [5.12,033/296] net/sched: fq_pie: fix OOB access in the traffic path [5.12,036/296] mac80211: prevent mixed key and fragment cache attacks [5.12,040/296] mac80211: add fragment cache to sta_info [5.12,041/296] mac80211: check defrag PN against current frame [5.12,042/296] mac80211: prevent attacks on TKIP/WEP as well [5.12,045/296] ath10k: add CCMP PN replay protection for fragmented frames for PCIe [5.12,047/296] ath10k: drop fragments with multicast DA for SDIO [5.12,048/296] ath10k: drop MPDU which has discard flag set by firmware for SDIO [5.12,049/296] ath10k: Fix TKIP Michael MIC verification for PCIe [5.12,051/296] ath11k: Clear the fragment cache during key install [5.12,054/296] drm/i915: Reenable LTTPR non-transparent LT mode for DPCD_REV<1.4 [5.12,055/296] drm/amd/pm: correct MGpuFanBoost setting [5.12,059/296] drm/amdgpu/vcn2.5: add cancel_delayed_work_sync before power gate [5.12,060/296] drm/amdgpu/jpeg2.0: add cancel_delayed_work_sync before power gate [5.12,061/296] kgdb: fix gcc-11 warnings harder [5.12,062/296] Documentation: seccomp: Fix user notification documentation [5.12,063/296] riscv: stacktrace: fix the riscv stacktrace when CONFIG_FRAME_POINTER enabled [5.12,066/296] serial: core: fix suspicious security_locked_down() call [5.12,069/296] thunderbolt: dma_port: Fix NVM read buffer bounds and offset issue [5.12,070/296] KVM: X86: Fix vCPU preempted state from guests point of view [5.12,073/296] KVM: arm64: Prevent mixed-width VM creation [5.12,075/296] staging: iio: cdc: ad7746: avoid overwrite of num_channels [5.12,076/296] iio: gyro: fxas21002c: balance runtime power in error path [5.12,077/296] iio: dac: ad5770r: Put fwnode in error case during ->probe() [5.12,078/296] iio: adc: ad7768-1: Fix too small buffer passed to iio_push_to_buffers_with_timest... [5.12,079/296] iio: adc: ad7124: Fix missbalanced regulator enable / disable on error. [5.12,080/296] iio: adc: ad7124: Fix potential overflow due to non sequential channel numbers [5.12,081/296] iio: adc: ad7923: Fix undersized rx buffer. [5.12,083/296] iio: adc: ad7192: Avoid disabling a clock that was never enabled. [5.12,084/296] iio: adc: ad7192: handle regulator voltage error first [5.12,088/296] serial: 8250_pci: handle FL_NOIRQ board flag [5.12,089/296] USB: trancevibrator: fix control-request direction [5.12,093/296] xhci: Fix 5.12 regression of missing xHC cache clearing command after a Stall [5.12,094/296] drivers: base: Fix device link removal [5.12,098/296] USB: serial: ti_usb_3410_5052: add startech.com device id [5.12,099/296] USB: serial: option: add Telit LE910-S1 compositions 0x7010, 0x7011 [5.12,100/296] USB: serial: ftdi_sio: add IDs for IDS GmbH Products [5.12,104/296] usb: gadget: udc: renesas_usb3: Fix a race in usb3_start_pipen() [5.12,105/296] usb: typec: mux: Fix matching with typec_altmode_desc [5.12,106/296] usb: typec: ucsi: Clear pending after acking connector change [5.12,108/296] usb: typec: tcpm: Properly interrupt VDM AMS [5.12,109/296] usb: typec: tcpm: Respond Not_Supported if no snk_vdo [5.12,111/296] Bluetooth: cmtp: fix file refcount when cmtp_attach_device fails [5.12,113/296] NFS: fix an incorrect limit in filelayout_decode_layout() [5.12,114/296] NFS: Fix an Oopsable condition in __nfs_pageio_add_request() [5.12,117/296] drm/meson: fix shutdown crash when component not probed [5.12,121/296] {net,vdpa}/mlx5: Configure interface MAC into mpfs L2 table [5.12,123/296] net/mlx5e: Fix nullptr in add_vlan_push_action() [5.12,125/296] net/mlx5e: Fix null deref accessing lag dev [5.12,126/296] net/mlx4: Fix EEPROM dump support [5.12,128/296] net/mlx5: Set term table as an unmanaged flow table [5.12,131/296] KVM: selftests: Fix 32-bit truncation of vm_get_max_gfn() [5.12,132/296] SUNRPC in case of backlog, hand free slots directly to waiting task [5.12,135/296] tipc: skb_linearize the head skb when reassembling msgs [5.12,138/296] sctp: add the missing setting for asoc encap_port [5.12,141/296] net: dsa: bcm_sf2: Fix bcm_sf2_reg_rgmii_cntrl() call for non-RGMII port [5.12,144/296] net: dsa: sja1105: use 4095 as the private VLAN for untagged traffic [5.12,147/296] net: dsa: sja1105: call dsa_unregister_switch when allocating memory fails [5.12,149/296] i2c: s3c2410: fix possible NULL pointer deref on read message after write [5.12,150/296] i2c: mediatek: Disable i2c start_en and clear intr_stat brfore reset [5.12,152/296] i2c: sh_mobile: Use new clock calculation formulas for RZ/G2E [5.12,153/296] afs: Fix the nlink handling of dir-over-dir rename [5.12,156/296] nvmet-tcp: fix inline data size comparison in nvmet_tcp_queue_response [5.12,157/296] mptcp: avoid error message on infinite mapping [5.12,159/296] mptcp: drop unconditional pr_warn on bad opt [5.12,160/296] platform/x86: hp_accel: Avoid invoking _INI to speed up resume [5.12,162/296] Revert "crypto: cavium/nitrox - add an error message to explain the failure of pci... [5.12,163/296] Revert "media: usb: gspca: add a missed check for goto_low_power" [5.12,167/296] Revert "net: fujitsu: fix a potential NULL pointer dereference" [5.12,168/296] net: fujitsu: fix potential null-ptr-deref [5.12,172/296] net: caif: remove BUG_ON(dev == NULL) in caif_xmit [5.12,173/296] Revert "char: hpet: fix a missing check of ioremap" [5.12,174/296] char: hpet: add checks after calling ioremap [5.12,179/296] isdn: mISDNinfineon: check/cleanup ioremap failure correctly in setup_io [5.12,181/296] ath6kl: return error code in ath6kl_wmi_set_roam_lrssi_cmd() [5.12,182/296] Revert "isdn: mISDN: Fix potential NULL pointer dereference of kzalloc" [5.12,184/296] Revert "dmaengine: qcom_hidma: Check for driver register failure" [5.12,186/296] Revert "libertas: add checks for the return value of sysfs_create_group" [5.12,188/296] Revert "ASoC: cs43130: fix a NULL pointer dereference" [5.12,189/296] ASoC: cs43130: handle errors in cs43130_probe() properly [5.12,191/296] media: dvb: Add check on sp8870_readreg return [5.12,194/296] Revert "media: gspca: Check the return value of write_bridge for timeout" [5.12,197/296] net: liquidio: Add missing null pointer checks [5.12,198/296] Revert "brcmfmac: add a check for the status of usb_register" [5.12,199/296] brcmfmac: properly check for bus register errors [5.12,202/296] scsi: BusLogic: Fix 64-bit system enumeration error for Buslogic [5.12,204/296] scsi: pm80xx: Fix drives missing during rmmod/insmod loop [5.12,205/296] btrfs: release path before starting transaction when cloning inline extent [5.12,206/296] btrfs: do not BUG_ON in link_to_fixup_dir [5.12,208/296] platform/x86: hp-wireless: add AMDs hardware id to the supported list [5.12,210/296] platform/x86: touchscreen_dmi: Add info for the Mediacom Winpad 7.0 W700 tablet [5.12,211/296] SMB3: incorrect file id in requests compounded with open [5.12,212/296] drm/amd/display: Disconnect non-DP with no EDID [5.12,215/296] drm/amd/amdgpu: fix a potential deadlock in gpu reset [5.12,218/296] block: fix a race between del_gendisk and BLKRRPART [5.12,221/296] net: netcp: Fix an error message [5.12,222/296] net: dsa: fix error code getting shifted with 4 in dsa_slave_get_sset_count [5.12,224/296] interconnect: qcom: Add missing MODULE_DEVICE_TABLE [5.12,225/296] usb: cdnsp: Fix lack of removing request from pending list. [5.12,226/296] ASoC: cs42l42: Regmap must use_single_read/write [5.12,228/296] net: ipa: memory region array is variable size [5.12,229/296] vfio-ccw: Check initialized flag in cp_init() [5.12,230/296] spi: Assume GPIO CS active high in ACPI case [5.12,233/296] net: fec: fix the potential memory leak in fec_enet_init() [5.12,235/296] ptp: ocp: Fix a resource leak in an error handling path [5.12,238/296] net: mdio: octeon: Fix some double free issues [5.12,242/296] net: sched: fix packet stuck problem for lockless qdisc [5.12,243/296] net: sched: fix tx action rescheduling issue during deactivation [5.12,244/296] net: sched: fix tx action reschedule issue with stopped queue [5.12,246/296] net: bnx2: Fix error return code in bnx2_init_board() [5.12,247/296] bnxt_en: Include new P5 HV definition in VF check. [5.12,250/296] net/smc: remove device from smcd_dev_list after failed device_add() [5.12,251/296] gve: Check TX QPL was actually assigned [5.12,254/296] gve: Upgrade memory barrier in poll routine [5.12,256/296] iommu/amd: Clear DMA ops when switching domain [5.12,259/296] net: hns3: put off calling register_netdev() until client initialize complete [5.12,260/296] net: hns3: fix users coalesce configuration lost issue [5.12,262/296] net/mlx5e: Make sure fib dev exists in fib event [5.12,264/296] iommu/vt-d: Check for allocation failure in aux_detach_device() [5.12,269/296] bpf, offload: Reorder offload callback prepare in verifier [5.12,270/296] bpf: Set mac_len in bpf_skb_change_head [5.12,271/296] ixgbe: fix large MTU request from VF [5.12,272/296] ASoC: qcom: lpass-cpu: Use optional clk APIs [5.12,274/296] net: lantiq: fix memory corruption in RX ring [5.12,275/296] ipv6: record frag_max_size in atomic fragments in input path [5.12,276/296] scsi: aic7xxx: Restore several defines for aic7xxx firmware build [5.12,280/296] net: hsr: fix mac_len checks [5.12,281/296] MIPS: alchemy: xxs1500: add gpio-au1000.h header file [5.12,282/296] MIPS: ralink: export rt_sysc_membase for rt2880_wdt.c [5.12,283/296] net: zero-initialize tc skb extension on allocation [5.12,286/296] thermal/drivers/qcom: Fix error code in adc_tm5_get_dt_channel_data() [5.12,287/296] KVM: X86: hyper-v: Task srcu lock when accessing kvm_memslots() [5.12,289/296] samples/bpf: Consider frame size in tx_only of xdpsock sample [5.12,292/296] Revert "Revert "ALSA: usx2y: Fix potential NULL pointer dereference"" [5.12,293/296] net: hso: bail out on interrupt URB allocation failure [5.12,296/296] usb: core: reduce power-on-good delay time of root hub

[5.12,242/296] net: sched: fix packet stuck problem for lockless qdisc

Commit Message

Patch