diff mbox series

[v3] usb: dwc3: gadget: Propagate core init errors to UDC during pullup

Message ID 20230618120949.14868-1-quic_kriskura@quicinc.com
State New
Headers show
Series [v3] usb: dwc3: gadget: Propagate core init errors to UDC during pullup | expand

Commit Message

Krishna Kurapati June 18, 2023, 12:09 p.m. UTC
In scenarios where pullup relies on resume (get sync) to initialize
the controller and set the run stop bit, then core_init is followed by
gadget_resume which will eventually set run stop bit.

But in cases where the core_init fails, the return value is not sent
back to udc appropriately. So according to UDC the controller has
started but in reality we never set the run stop bit.

On systems like Android, there are uevents sent to HAL depending on
whether the configfs_bind / configfs_disconnect were invoked. In the
above mentioned scnenario, if the core init fails, the run stop won't
be set and the cable plug-out won't result in generation of any
disconnect event and userspace would never get any uevent regarding
cable plug out and we never call pullup(0) again. Furthermore none of
the next Plug-In/Plug-Out's would be known to configfs.

Return back the appropriate result to UDC to let the userspace/
configfs know that the pullup failed so they can take appropriate
action.

Fixes: 77adb8bdf422 ("usb: dwc3: gadget: Allow runtime suspend if UDC unbinded")
Signed-off-by: Krishna Kurapati <quic_kriskura@quicinc.com>
Acked-by: Thinh Nguyen <Thinh.Nguyen@synopsys.com>
---
Changes in v3: Added changelog mising in v2
Changes in v2: Added Fixes tag

 drivers/usb/dwc3/gadget.c | 4 +++-
 1 file changed, 3 insertions(+), 1 deletion(-)

Comments

Krishna Kurapati June 20, 2023, 4:53 a.m. UTC | #1
On 6/19/2023 8:36 PM, Johan Hovold wrote:
> On Mon, Jun 19, 2023 at 06:20:43PM +0530, Krishna Kurapati PSSNV wrote:
>> On 6/19/2023 12:36 PM, Johan Hovold wrote:
>>> On Sun, Jun 18, 2023 at 05:39:49PM +0530, Krishna Kurapati wrote:
> 
>>>> @@ -2747,7 +2747,9 @@ static int dwc3_gadget_pullup(struct usb_gadget *g, int is_on)
>>>>    	ret = pm_runtime_get_sync(dwc->dev);
>>>>    	if (!ret || ret < 0) {
>>>>    		pm_runtime_put(dwc->dev);
>>>> -		return 0;
>>>> +		if (ret < 0)
>>>> +			pm_runtime_set_suspended(dwc->dev);
>>>
>>> This bit is broken and is also not mentioned or explained in the commit
>>> message. What are you trying to achieve here?
>>>
>>> You cannot set the state like this after runtime PM is enabled and the
>>> above call will always fail.
> 
>> The reason why I an returning ret is because, when the first get_sync
>> fails because of core_init failure and we return 0 instead of ret, the
>> UDC thinks that controller has started successfully but we never set the
>> run stop bit.
> 
> That bit is clear.
> 
>> So when we plug out the cable,  the disconnect event won't
>> be generated and we never send on systems like android the user space
>> will never clear the UDC upon disconnect. Its a sort of mismatch between
>> controller and udc.
> 
> Ok, but the controller is an error state after the resume failure. And
> here you rely on user space to retry gadget activation in order to
> eventually detect the disconnect event?
>   
>> Also once the first get_sync fails, the dwc->dev->power.runtime_error
>> flag is set and successive calls to get_sync always return -EINVAL. In
>> this situation even if UDC/configfs retry pullup, resume_common will
>> never be called and we never actually start the controller or resume
>> dwc->dev.
>>
>> By calling set_suspended, I am trying to clear the runtime_error flag so
>> that the next retry to pullup will call resume_common and retry
>> core_init and set run_stop.
> 
> Ok, thanks, that's the bit I was missing in the commit message.
> 
> First, I perhaps mistakingly thought pm_runtime_set_suspended() may only
> be called with PM runtime disabled, but it appears it may indeed be
> valid to call also after an error but with the caveat that the device
> must then actually be in the suspended state.
> 
> The documentation and implementation is inconsistent here as the kernel
> doc for pm_runtime_set_suspended() clearly states:
> 
> 	It is not valid to call this function for devices with runtime
> 	PM enabled.
> 
> and it also looks like we'd end up with an active-child counter
> imbalance if anyone actually tries to do so.
> 
> But either way, it also seems like the controller is not guaranteed to
> be suspended here as pm_runtime_get_sync() may also fail after a
> previous errors that have left the controller in the active state?
>  > Also, what kind of errors would cause core_init and resume to fail here?
> 
Hi Johan,

   As per the comment just above the get_sync during pullup, the 
resume_common path is used to resume the controller and start peripheral 
mode incase the dwc3 was in suspended state. So if we are entering 
gadget_resume we are in suspended state the first time it is called.

Regarding the errors that might leave controller in active state, I have 
faced issue in core init, and in that function there is a cleanup 
happening in case something fails. So controller was actually not in 
active state after cleanup was done. In resume common, if 
core_init_for_resume is failing, we cleanup everything initialized up 
until that point.

The scenario you mentioned would be applicable in case gadget_resume 
fails. We are not having a return value check and I am not sure what 
would be the side effect of not having that check there. Either ways, 
since it was failing for core init, I went ahead and made this patch.

As per the reason for failure in core init, the following is what was 
happening at customer's end:

1. Cable plug-in
2. get_sync calls resume common which inturn calls core_init
3. core soft reset fails in core init, we cleanup and return -110
4. After applying this patch, the -110 was propagated to UDC properly
5. We got a second call to pullup via connect_control and this time 
reset was successful.

The behavior was similar to [1]. There as well, on all Gen-2 targets, 
after the retry happens I see soft reset is passing but failing for 
first attempt.

[1]: 
https://lore.kernel.org/all/20230510075252.31023-2-quic_kriskura@quicinc.com/

Regards,
Krishna,

> If this is something that you see during normal operation then this
> seems to suggest that something is wrong with the runtime pm
> implementation.
> 
> Note that virtually all drivers treat resume failures as fatal errors
> and do not implement any recovery from that.
> 
> In fact, the only other example of this kind of usage that I could find
> is also for a Qualcomm driver...
> 
> Johan
diff mbox series

Patch

diff --git a/drivers/usb/dwc3/gadget.c b/drivers/usb/dwc3/gadget.c
index 578804dc29ca..27cb671e18e3 100644
--- a/drivers/usb/dwc3/gadget.c
+++ b/drivers/usb/dwc3/gadget.c
@@ -2747,7 +2747,9 @@  static int dwc3_gadget_pullup(struct usb_gadget *g, int is_on)
 	ret = pm_runtime_get_sync(dwc->dev);
 	if (!ret || ret < 0) {
 		pm_runtime_put(dwc->dev);
-		return 0;
+		if (ret < 0)
+			pm_runtime_set_suspended(dwc->dev);
+		return ret;
 	}
 
 	if (dwc->pullups_connected == is_on) {