diff mbox series

[RFC] irq_work: wakeup irq_workd when queued first rt_lazy work

Message ID 20220711172314.603717-1-schspa@gmail.com
State New
Headers show
Series [RFC] irq_work: wakeup irq_workd when queued first rt_lazy work | expand

Commit Message

Schspa Shi July 11, 2022, 5:23 p.m. UTC
Commit b4c6f86ec2f64
("irq_work: Handle some irq_work in a per-CPU thread on PREEMPT_RT")
treat all irq_work without IRQ_WORK_HARD_IRQ flags to the lazy_list
in PREEMPT_RT. But this kind of irq_work still has some difference
with IRQ work set IRQ_WORK_LAZY. The difference as fellowing:

- With IRQ_WORK_LAZY: (type1)
  This kind of work will be executed after the next time tick by wakeup
  irq_workd, there will be more scheduling delays.

  Let's mark this as type1

- Without IRQ_WORK_LAZY and IRQ_WORK_HARD_IRQ: (type2)
  This kind of irq_work will have a faster response speed by wakeup
  irq_workd from IPI interrupt.

  Let's mark it as type2

I want to know if this difference is by design.

If this is by design, we have a problem that the irq_work of type2
will not execute as quickly as expected, it may be delayed by the
irq_work of type1.

Please consider the following scenarios:

If the CPU queued a type1 irq_work A, and then a type2 irq_work B.
But we won't make B executed quickly, because we won't issue the IPI
interrupt to wakeup irq_workd (the llist_add call will return false).

This PATCH will issue the IPI_IRQ_WORK to make B execute quickly.

One thing that needs to be optimized is that we now have
lazy_list.node.llist and lazy_work_raised which need to be granted
to be atomicity, disabled the local CPU IRQ to make this atomic.
There should be a better way to make these two variants to be atomically
and I can go in deep if this little problem is not by design, and need
to be fixed.

If these two types of irq_work should be the same with the priority.
maybe we should change.

if (!lazy_work || tick_nohz_tick_stopped()) {
	arch_irq_work_raise();
}

to

if (!(lazy_work || rt_lazy_work) || tick_nohz_tick_stopped()) {
	arch_irq_work_raise();
}

I'm uploading this patch just to explain the problem, hopefully
don't pay too much attention to the ugly changes below.

Signed-off-by: Schspa Shi <schspa@gmail.com>
---
 kernel/irq_work.c | 40 +++++++++++++++++++++++++++++++++++++---
 1 file changed, 37 insertions(+), 3 deletions(-)

Comments

Sebastian Andrzej Siewior Aug. 18, 2022, 4:28 p.m. UTC | #1
On 2022-07-12 01:23:15 [+0800], Schspa Shi wrote:
> I want to know if this difference is by design.

Yes. type1 (LAZY) does not need immediate action but can't be scheduled
regularly like a workqueue.

> If this is by design, we have a problem that the irq_work of type2
> will not execute as quickly as expected, it may be delayed by the
> irq_work of type1.
> 
> Please consider the following scenarios:
> 
> If the CPU queued a type1 irq_work A, and then a type2 irq_work B.
> But we won't make B executed quickly, because we won't issue the IPI
> interrupt to wakeup irq_workd (the llist_add call will return false).

But those two are different lists. So adding type1 to list1 does not
affect type2 with list2

> This PATCH will issue the IPI_IRQ_WORK to make B execute quickly.
> 
> One thing that needs to be optimized is that we now have
> lazy_list.node.llist and lazy_work_raised which need to be granted
> to be atomicity, disabled the local CPU IRQ to make this atomic.
> There should be a better way to make these two variants to be atomically
> and I can go in deep if this little problem is not by design, and need
> to be fixed.
> 
> If these two types of irq_work should be the same with the priority.
> maybe we should change.
> 
> if (!lazy_work || tick_nohz_tick_stopped()) {
> 	arch_irq_work_raise();
> }
> 
> to
> 
> if (!(lazy_work || rt_lazy_work) || tick_nohz_tick_stopped()) {
> 	arch_irq_work_raise();
> }

but we wait for the timer for the lazy-work. RT has more LAZY items
compared to !RT. So if there is an error then it should be visible
there, too.

Is there a problem with this? Adding (as you call it) type1 item does
not affect type2 items. They will will processed asap.

Sebastian
Schspa Shi Aug. 18, 2022, 7:56 p.m. UTC | #2
Sebastian Andrzej Siewior <bigeasy@linutronix.de> writes:

> On 2022-07-12 01:23:15 [+0800], Schspa Shi wrote:
>> I want to know if this difference is by design.
>
> Yes. type1 (LAZY) does not need immediate action but can't be scheduled
> regularly like a workqueue.
>
>> If this is by design, we have a problem that the irq_work of type2
>> will not execute as quickly as expected, it may be delayed by the
>> irq_work of type1.
>> 
>> Please consider the following scenarios:
>> 
>> If the CPU queued a type1 irq_work A, and then a type2 irq_work B.
>> But we won't make B executed quickly, because we won't issue the IPI
>> interrupt to wakeup irq_workd (the llist_add call will return false).
>
> But those two are different lists. So adding type1 to list1 does not
> affect type2 with list2
>

No, this will be added to same list (lazy_list).
All irq work without IRQ_WORK_HARD_IRQ flags will be added to lazy_list.
Maybe my description of type2 is not clear, type2 irq work means neither
the IRQ_WORK_LAZY flag nor the IRQ_WORK_HARD_IRQ flag is set.

>> This PATCH will issue the IPI_IRQ_WORK to make B execute quickly.
>> 
>> One thing that needs to be optimized is that we now have
>> lazy_list.node.llist and lazy_work_raised which need to be granted
>> to be atomicity, disabled the local CPU IRQ to make this atomic.
>> There should be a better way to make these two variants to be atomically
>> and I can go in deep if this little problem is not by design, and need
>> to be fixed.
>> 
>> If these two types of irq_work should be the same with the priority.
>> maybe we should change.
>> 
>> if (!lazy_work || tick_nohz_tick_stopped()) {
>> 	arch_irq_work_raise();
>> }
>> 
>> to
>> 
>> if (!(lazy_work || rt_lazy_work) || tick_nohz_tick_stopped()) {
>> 	arch_irq_work_raise();
>> }
>
> but we wait for the timer for the lazy-work. RT has more LAZY items
> compared to !RT. So if there is an error then it should be visible
> there, too.
>

As type 2 work and type 1 work will be added to lazy_list, type 2 work
can be delayed and have same priority as type 1.

> Is there a problem with this? Adding (as you call it) type1 item does
> not affect type2 items. They will will processed asap.
>

I noticed this because there is a BUG before the patch
b4c6f86ec2f6 ("irq_work: Handle some irq_work in a per-CPU thread on PREEMPT_RT")
applied, which makes the task hang on when the CPU hotplug.

On some RT branches, lazy_work will be queued to ksoftirq via commit
https://git.kernel.org/pub/scm/linux/kernel/git/rt/linux-rt-devel.git/commit/kernel/irq_work.c?h=linux-5.10.y-rt&id=c1ecdc62c514c2d541490026c312ec614ebd35aa
c1ecdc62c5 ("irqwork: push most work into softirq context")

Which makes the irq_work won't be executed due to we don't call arch_irq_work_raise();
and raise_softirq(TIMER_SOFTIRQ); won't be executed by this case too.

If there is no timer exists on the current CPU, it will hang forever.

Log as fellowing.

[32987.846092] INFO: task core_ctl:749 blocked for more than 120 seconds.
[32987.846106]       Tainted: G           O      5.10.59-rt52-g19228cd9c280-dirty #24
[32987.846117] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[32987.846125] task:core_ctl        state:D stack:    0 pid:  749 ppid:     2 flags:0x00000028
[32987.846149] Call trace:
[32987.846155]  __switch_to+0x164/0x17c
[32987.846175]  __schedule+0x4cc/0x5c0
[32987.846190]  schedule+0x7c/0xcc
[32987.846205]  schedule_timeout+0x34/0xdc
[32987.846338]  do_wait_for_common+0xa0/0x12c
[32987.846360]  wait_for_common+0x44/0x68
[32987.846376]  wait_for_completion+0x18/0x24
[32987.846391]  __cpuhp_kick_ap+0x58/0x68
[32987.846408]  cpuhp_kick_ap+0x38/0x94
[32987.846423]  _cpu_down+0xbc/0x1f8
[32987.846443]  cpu_down_maps_locked+0x20/0x34
[32987.846461]  cpu_device_down+0x24/0x40
[32987.846477]  cpu_subsys_offline+0x10/0x1c
[32987.846496]  device_offline+0x6c/0xbc
[32987.846514]  remove_cpu+0x24/0x40
[32987.846530]  do_core_ctl+0x44/0x88 [cpuhp_qos]
[32987.846563]  try_core_ctl+0x90/0xb0 [cpuhp_qos]
[32987.846587]  kthread+0x114/0x124
[32987.846604]  ret_from_fork+0x10/0x30


Please notice this patch is only used to explain the problem, don't
try to compile it.

> Sebastian
Schspa Shi Aug. 18, 2022, 8:42 p.m. UTC | #3
Schspa Shi <schspa@gmail.com> writes:

> Sebastian Andrzej Siewior <bigeasy@linutronix.de> writes:
>
>> On 2022-07-12 01:23:15 [+0800], Schspa Shi wrote:
>>> I want to know if this difference is by design.
>>
>> Yes. type1 (LAZY) does not need immediate action but can't be scheduled
>> regularly like a workqueue.
>>
>>> If this is by design, we have a problem that the irq_work of type2
>>> will not execute as quickly as expected, it may be delayed by the
>>> irq_work of type1.
>>> 
>>> Please consider the following scenarios:
>>> 
>>> If the CPU queued a type1 irq_work A, and then a type2 irq_work B.
>>> But we won't make B executed quickly, because we won't issue the IPI
>>> interrupt to wakeup irq_workd (the llist_add call will return false).
>>
>> But those two are different lists. So adding type1 to list1 does not
>> affect type2 with list2
>>
>
> No, this will be added to same list (lazy_list).
> All irq work without IRQ_WORK_HARD_IRQ flags will be added to lazy_list.
> Maybe my description of type2 is not clear, type2 irq work means neither
> the IRQ_WORK_LAZY flag nor the IRQ_WORK_HARD_IRQ flag is set.
>
>>> This PATCH will issue the IPI_IRQ_WORK to make B execute quickly.
>>> 
>>> One thing that needs to be optimized is that we now have
>>> lazy_list.node.llist and lazy_work_raised which need to be granted
>>> to be atomicity, disabled the local CPU IRQ to make this atomic.
>>> There should be a better way to make these two variants to be atomically
>>> and I can go in deep if this little problem is not by design, and need
>>> to be fixed.
>>> 
>>> If these two types of irq_work should be the same with the priority.
>>> maybe we should change.
>>> 
>>> if (!lazy_work || tick_nohz_tick_stopped()) {
>>> 	arch_irq_work_raise();
>>> }
>>> 
>>> to
>>> 
>>> if (!(lazy_work || rt_lazy_work) || tick_nohz_tick_stopped()) {
>>> 	arch_irq_work_raise();
>>> }
>>
>> but we wait for the timer for the lazy-work. RT has more LAZY items
>> compared to !RT. So if there is an error then it should be visible
>> there, too.
>>
>
> As type 2 work and type 1 work will be added to lazy_list, type 2 work
> can be delayed and have same priority as type 1.
>
>> Is there a problem with this? Adding (as you call it) type1 item does
>> not affect type2 items. They will will processed asap.
>>
>
> I noticed this because there is a BUG before the patch
> b4c6f86ec2f6 ("irq_work: Handle some irq_work in a per-CPU thread on PREEMPT_RT")
> applied, which makes the task hang on when the CPU hotplug.
>
> On some RT branches, lazy_work will be queued to ksoftirq via commit
> https://git.kernel.org/pub/scm/linux/kernel/git/rt/linux-rt-devel.git/commit/kernel/irq_work.c?h=linux-5.10.y-rt&id=c1ecdc62c514c2d541490026c312ec614ebd35aa
> c1ecdc62c5 ("irqwork: push most work into softirq context")
>
> Which makes the irq_work won't be executed due to we don't call arch_irq_work_raise();
> and raise_softirq(TIMER_SOFTIRQ); won't be executed by this case too.
>
> If there is no timer exists on the current CPU, it will hang forever.
>
> Log as fellowing.
>
> [32987.846092] INFO: task core_ctl:749 blocked for more than 120 seconds.
> [32987.846106]       Tainted: G           O      5.10.59-rt52-g19228cd9c280-dirty #24
> [32987.846117] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
> [32987.846125] task:core_ctl        state:D stack:    0 pid:  749 ppid:     2 flags:0x00000028
> [32987.846149] Call trace:
> [32987.846155]  __switch_to+0x164/0x17c
> [32987.846175]  __schedule+0x4cc/0x5c0
> [32987.846190]  schedule+0x7c/0xcc
> [32987.846205]  schedule_timeout+0x34/0xdc
> [32987.846338]  do_wait_for_common+0xa0/0x12c
> [32987.846360]  wait_for_common+0x44/0x68
> [32987.846376]  wait_for_completion+0x18/0x24
> [32987.846391]  __cpuhp_kick_ap+0x58/0x68
> [32987.846408]  cpuhp_kick_ap+0x38/0x94
> [32987.846423]  _cpu_down+0xbc/0x1f8
> [32987.846443]  cpu_down_maps_locked+0x20/0x34
> [32987.846461]  cpu_device_down+0x24/0x40
> [32987.846477]  cpu_subsys_offline+0x10/0x1c
> [32987.846496]  device_offline+0x6c/0xbc
> [32987.846514]  remove_cpu+0x24/0x40
> [32987.846530]  do_core_ctl+0x44/0x88 [cpuhp_qos]
> [32987.846563]  try_core_ctl+0x90/0xb0 [cpuhp_qos]
> [32987.846587]  kthread+0x114/0x124
> [32987.846604]  ret_from_fork+0x10/0x30
>

Add missing logs to make this hang issues clear.

[32987.953814] NMI backtrace for cpu 3
[32987.953829] CPU: 3 PID: 30 Comm: cpuhp/3 Tainted: G           O      5.10.59-rt52-g19228cd9c280-dirty #24
[32987.953849] Hardware name: Horizon Robotics Journey 5 DVB (DT)
[32987.953859] pstate: 60c00009 (nZCv daif +PAN +UAO -TCO BTYPE=--)
[32987.953876] pc : irq_work_sync+0x8/0x1c
[32987.953900] lr : cpufreq_dbs_governor_stop+0x50/0x7c
[32987.953923] sp : ffff80001280bd20
[32987.953930] pmr_save: 000000e0
[32987.953937] x29: ffff80001280bd20 x28: 0000000000000000 
[32987.953960] x27: 0000000000000000 x26: 0000000000000000 
[32987.953980] x25: 0000000000000000 x24: 0000000000000001 
[32987.953999] x23: ffff800011403000 x22: ffff000183bf1800 
[32987.954019] x21: ffff80001155e8e0 x20: 0000000000000080 
[32987.954040] x19: ffff00018aafd400 x18: 0000000000000000 
[32987.954059] x17: 0000000000000000 x16: 0000000000000000 
[32987.954079] x15: 000000000000000a x14: 000000000000001e 
[32987.954098] x13: ffff800012e2703c x12: ffffffffffffffff 
[32987.954118] x11: ffffffffffffffff x10: 0000000000000a20 
[32987.954138] x9 : ffff80001280bab0 x8 : ffff000180274480 
[32987.954158] x7 : 00000000ffffffff x6 : ffff800012e255da 
[32987.954177] x5 : ffff80001280bd00 x4 : 0000000000000000 
[32987.954196] x3 : ffff80001280bd00 x2 : 0000000000000000 
[32987.954216] x1 : 0000000000000023 x0 : ffff00018aafd448 
[32987.954235] Call trace:
[32987.954243]  irq_work_sync+0x8/0x1c
[32987.954263]  cpufreq_stop_governor+0x6c/0x80
[32987.954282]  cpufreq_offline+0xc4/0x1ec
[32987.954300]  cpuhp_cpufreq_offline+0x10/0x20
[32987.954317]  cpuhp_invoke_callback+0xc0/0x1b0
[32987.954334]  cpuhp_thread_fun+0x124/0x170
[32987.954350]  smpboot_thread_fn+0x1e8/0x1ec
[32987.954367]  kthread+0x114/0x124
[32987.954384]  ret_from_fork+0x10/0x30

>
> Please notice this patch is only used to explain the problem, don't
> try to compile it.
>
>> Sebastian
diff mbox series

Patch

diff --git a/kernel/irq_work.c b/kernel/irq_work.c
index 7afa40fe5cc43..d5d0b720fac15 100644
--- a/kernel/irq_work.c
+++ b/kernel/irq_work.c
@@ -25,6 +25,7 @@ 
 static DEFINE_PER_CPU(struct llist_head, raised_list);
 static DEFINE_PER_CPU(struct llist_head, lazy_list);
 static DEFINE_PER_CPU(struct task_struct *, irq_workd);
+static DEFINE_PER_CPU(bool, lazy_work_raised);
 
 static void wake_irq_workd(void)
 {
@@ -81,6 +82,7 @@  static void __irq_work_queue_local(struct irq_work *work)
 	bool rt_lazy_work = false;
 	bool lazy_work = false;
 	int work_flags;
+	unsigned long flags;
 
 	work_flags = atomic_read(&work->node.a_flags);
 	if (work_flags & IRQ_WORK_LAZY)
@@ -94,12 +96,35 @@  static void __irq_work_queue_local(struct irq_work *work)
 	else
 		list = this_cpu_ptr(&raised_list);
 
-	if (!llist_add(&work->node.llist, list))
+	local_irq_save(flags);
+	if (!llist_add(&work->node.llist, list)) {
+		bool irq_raised;
+		/*
+		 * In PREEMPT_RT, if we add a lazy work added to the list
+		 * before, the work maybe not raised. We need a extra check
+		 * for PREEMPT_RT.
+		 */
+		irq_raised = !xchg(this_cpu_ptr(&lazy_work_raised), true);
+		local_irq_restore(flags);
+		if (unlikely(!irq_raised))
+			arch_irq_work_raise();
+
 		return;
+	}
 
 	/* If the work is "lazy", handle it from next tick if any */
-	if (!lazy_work || tick_nohz_tick_stopped())
+	if (!lazy_work || tick_nohz_tick_stopped()) {
+		(void) xchg(this_cpu_ptr(&lazy_work_raised), true);
+		local_irq_restore(flags);
 		arch_irq_work_raise();
+	} else if (lazy_work || rt_lazy_work) {
+		/*
+		 * The first added irq work not raise a irq work, we need to
+		 * raise one for the next added irq work.
+		 */
+		(void) xchg(this_cpu_ptr(&lazy_work_raised), false);
+		local_irq_restore(flags);
+	}
 }
 
 /* Enqueue the irq work @work on the current CPU */
@@ -151,9 +176,18 @@  bool irq_work_queue_on(struct irq_work *work, int cpu)
 		 */
 		if (IS_ENABLED(CONFIG_PREEMPT_RT) &&
 		    !(atomic_read(&work->node.a_flags) & IRQ_WORK_HARD_IRQ)) {
+			unsigned long flags;
+
+			local_irq_save(flags);
+			if (!llist_add(&work->node.llist, &per_cpu(lazy_list, cpu))) {
+				if (!xchg(this_cpu_ptr(&lazy_work_raised), true))
+					arch_irq_work_raise();
 
-			if (!llist_add(&work->node.llist, &per_cpu(lazy_list, cpu)))
+				local_irq_restore(flags);
 				goto out;
+			}
+			(void) xchg(this_cpu_ptr(&lazy_work_raised), true);
+			local_irq_restore(flags);
 
 			work = &per_cpu(irq_work_wakeup, cpu);
 			if (!irq_work_claim(work))