Author Topic: Picking up an endpoind from peer of extmodule crashes Yate  (Read 13669 times)

andr04

  • Newbie
  • *
  • Posts: 18
    • View Profile
I'm trying to fix the bug during last days, but without success. :(

If I have a sip endpoint which is connected to an external plugin (in playrec mode) a try to picking it up (disconnect extmodule and connect sip endpoints each other) crashes Yate at all in 90% of cases.

The log which I see:
Code: [Select]
2017-05-01_23:17:24.413847 <INFO> Choosing offered 'audio' format 'alaw' [0x7fb840011bc0]
2017-05-01_23:17:24.413857 <sip/3:NOTE> Formats for 'audio' changed to 'alaw' [0x7fb840010990]
2017-05-01_23:17:24.413902 >>> DataTranslator::detachChain(0x7fb83c006dd0,0x7fb84c00fac0)
2017-05-01_23:17:27.000093   <engine:MILD> Creating new message dispatching thread (1 running)
2017-05-01_23:17:29.000097   <engine:MILD> Creating new message dispatching thread (2 running)
...

The normal state (if I disconnecting tone endpoint):
Code: [Select]
2017-05-01_22:36:17.464148 <yrtp:INFO> RTP starting format 'alaw' payload 8 [0x7fb840016b40]
2017-05-01_22:36:17.464172 >>> DataTranslator::detachChain(0x7fb840017190,0x7fb8300040a0)
2017-05-01_22:36:17.464177 <<< DataTranslator::detachChain
2017-05-01_22:36:17.464185 <ALL> DataTranslator::attachChain [0x7fb840017190] 'alaw' -> [0x7fb8300040a0] 'alaw' succeeded
2017-05-01_22:36:17.464190 >>> DataTranslator::detachChain(0x7fb830003ec0,0x7fb840017320)
2017-05-01_22:36:17.464193 <<< DataTranslator::detachChain
2017-05-01_22:36:17.464198 <ALL> DataTranslator::attachChain [0x7fb830003ec0] 'alaw' -> [0x7fb840017320] 'alaw' succeeded
2017-05-01_22:36:17.464270 <yrtp:INFO> Wrapper opened a hole in firewall/NAT [0x7fb840016b40]

marian

  • Hero Member
  • *****
  • Posts: 513
    • View Profile
Re: Picking up an endpoind from peer of extmodule crashes Yate
« Reply #1 on: May 02, 2017, 12:41:14 AM »
Hi,
Can you provide a larger log?
A backtrace would help a lot.
You may start yate with -C command line option.
Please see http://docs.yate.ro/wiki/Bug_reporting on how to build a backtrace.

andr04

  • Newbie
  • *
  • Posts: 18
    • View Profile
Re: Picking up an endpoind from peer of extmodule crashes Yate
« Reply #2 on: May 02, 2017, 02:59:04 AM »
See attachments (suppose it doesn't contain a private data; if no notify me please). It's not a crash in true understanding but it produces extra threads without success and after that (if run Yate as a service) it restarted by supervisor process. Without a supervisor process it hangs up.

Log contains some extra test outputs (from extmodule.cpp and external modules as well).

The scenario was:
1. Call from 'home' to 'mobile'.
2. Hangup 'home' leg => 'mobile' leg connects to 'noise.py' (it's a channel module).
3. Call from 'home' to 'mobile' again => not happy to picking up 'mobile' leg from 'noise.py'.
4. Since not happy hangup 'home' and 'mobile'.

To manage the call logic I use 'monitor.py' (it's a global module). I believe the external modules ('monitor.py' and 'noise.py') is correct because if 'monitor.py' runs 'call.execute' with 'callto'='tone/milliwatt' (instead of 'external/playrec/noise.py') no any bugs are happend.

Also it seems external modules don't killed in a proper way. For example on restarting external module I see the following:
Code: [Select]
2017-05-02_08:36:57.353068 <TEST> !!! ExtModReceiver::die
2017-05-02_08:36:57.353114 <ALL> ExtModReceiver::die() pid=10175 dead=no [0x191e4f0]
Unloading external module 'monitor.py' ''
2017-05-02_08:36:57.353130 <ALL> ExtModReceiver::die() waiting for pid=10175 to die [0x191e4f0]
2017-05-02_08:36:57.360740 <INFO> ExtModReceiver::die() pid=10175 did not exit? [0x191e4f0]
2017-05-02_08:36:57.360768 <TEST> !!! ExtModReceiver::flush
2017-05-02_08:36:57.361942 <TEST> !!! ExtModReceiver::cleanup
2017-05-02_08:36:57.361999 <WARN> Process 10175 has not exited on closing stdin - we'll kill it
2017-05-02_08:36:57.362118 <WARN> Process 10175 has still not exited yet?
2017-05-02_08:36:57.365940 <ALL> ExtModReceiver::destruct() pid=0 [0x191e4f0]
2017-05-02_08:36:57.365961 <TEST> !!! ExtModReceiver::die
2017-05-02_08:36:57.365966 <ALL> ExtModReceiver::die() pid=0 dead=yes [0x191e4f0]
2017-05-02_08:36:57.366000 <ALL> ExtModReceiver::ExtModReceiver("monitor.py","(null)") [0x7f36840010f0]
Loading external module 'monitor.py' ''

'monitor.py' handles stdin and sigint in a proper way. I don't understand in the code why we killing external (call closeOut) in 2 places: in ExtModReceiver::die and ExtModReceiver::cleanup . And in both cases we don't wait for a reasonable time (because an external module can be complicated and requires a time for proper terminating). Another question is why we don't kill by sigint signal (sigterm only in a last chance)? Since we terminate in 2 different places I don't know for sure how to fix it in a proper way as well.
« Last Edit: May 02, 2017, 03:30:59 AM by andr04 »

marian

  • Hero Member
  • *****
  • Posts: 513
    • View Profile
Re: Picking up an endpoind from peer of extmodule crashes Yate
« Reply #3 on: May 02, 2017, 08:49:30 AM »
What version of yate svn are you using?
Did you made any other changes (except for debug messages) in extmodule?

andr04

  • Newbie
  • *
  • Posts: 18
    • View Profile
Re: Picking up an endpoind from peer of extmodule crashes Yate
« Reply #4 on: May 02, 2017, 04:01:19 PM »
Yes, I've make some changes.

First, I use release version. When I haven't reach a success I replaced extmodule.cpp from trunk. Then I add ability to use CallEndpoint by extmodules which are started by the global one: it's only one line of new code ExtModReceiver::processLine:
Code: [Select]
    if (m_setdata) {
CallEndpoint* c_chan = locateChan(m->getValue("channelid"),m->getBoolValue("channelid_peer"));
m->userData(c_chan ? c_chan : m_chan);
            }
After that I started to solve the described bug and tried to add to ExtModSource methods cleanup and attached (from wavefile.cpp) - it also doesn't help.

My latest version of extmodule.cpp is attached.

marian

  • Hero Member
  • *****
  • Posts: 513
    • View Profile
Re: Picking up an endpoind from peer of extmodule crashes Yate
« Reply #5 on: May 03, 2017, 01:16:16 AM »
These are the last lines in log:
2017-05-02_07:56:25.278737 >>> ExtModChan::~ExtModChan() [0x7fd738008fa0]
2017-05-02_07:56:25.278743   <TEST> !!! ExtModReceiver::die
2017-05-02_07:56:25.278748   <ALL> ExtModReceiver::die() pid=9754 dead=no [0x7fd738010970]
2017-05-02_07:56:25.278754   <ALL> ExtModReceiver::die() waiting for pid=9754 to die [0x7fd738010970]
2017-05-02_07:56:25.278763   <extmodule:ALL> ExtModSource cleanup, total=56000, chan=(nil) [0x7fd738010820]
2017-05-02_07:56:25.278770   <ALL> ExtModSource::~ExtModSource() [0x7fd738010820] total=56000

You overridden ExtModSource::cleanup() to reset m_chan. The ExtModSource destructor calls m_chan->setRunning(...). Backtrace indicates it crashed there (this=0x0)
Remember:
ExtModSource::cleanup() is called while m_chan is in destructor: m_chan->ref() will fail (refcount is already is already 0)
You shouldn't reset m_chan there: it is not a dead pointer, it's a dying pointer

andr04

  • Newbie
  • *
  • Posts: 18
    • View Profile
Re: Picking up an endpoind from peer of extmodule crashes Yate
« Reply #6 on: May 03, 2017, 05:48:40 AM »
Let's try again from scratch.

1. Use release version Yate 5.5.0.
2. Apply only the patch extmodule.cpp.diff .
3. Recompile, run with yate -vvvvvvvvvv -C -l yate.log -c /home/yate/conf .
4. Reproduce the problem (not from 1st time; it occurs in ≈60% of cases) and wait for 30 seconds.
5. Kill yate with -6 (SIGABORT) signal to make core dump.
6. Generate backtrace: echo -e 'thread apply all bt\nquit' | gdb yate core &> yate.bt

That's it. All files is attached.

And last words about restarting the global external module. I believe it's not the right way not to wait reasonable time for terminating:
Code: [Select]
2017-05-03_11:43:30.561179 <ALL> ExtModReceiver::die() pid=11961 dead=no [0xd21c90]
Unloading external module 'monitor.py' ''
2017-05-03_11:43:30.561241 <ALL> ExtModReceiver::die() waiting for pid=11961 to die [0xd21c90]
2017-05-03_11:43:30.568744 <INFO> ExtModReceiver::die() pid=11961 did not exit? [0xd21c90]
2017-05-03_11:43:30.573032 <ExtModule:WARN> Read error 9 on 0x7fad94000940 [0xd21c90]
2017-05-03_11:43:30.573137 <WARN> Process 11961 has not exited on closing stdin - we'll kill it
2017-05-03_11:43:30.573238 <WARN> Process 11961 has still not exited yet?
2017-05-03_11:43:30.573868 <ALL> ExtModReceiver::~ExtModReceiver() pid=0 [0xd21c90]
2017-05-03_11:43:30.573880 <ALL> ExtModReceiver::die() pid=0 dead=yes [0xd21c90]
2017-05-03_11:43:30.573899 <ALL> ExtModReceiver::ExtModReceiver("monitor.py","(null)") [0x7fad7c001070]

marian

  • Hero Member
  • *****
  • Posts: 513
    • View Profile
Re: Picking up an endpoind from peer of extmodule crashes Yate
« Reply #7 on: May 03, 2017, 06:25:31 AM »
Seems that ExtModConsumer::Consume() is not returning fast enough to release source mutex (needed to be locked when consumer is detached from source).

Please start yate with -C -Ddam
Please start message sniffer.
Please post log and backtrace after termination: when the mutex lock will time out yate will crash.


andr04

  • Newbie
  • *
  • Posts: 18
    • View Profile
Re: Picking up an endpoind from peer of extmodule crashes Yate
« Reply #8 on: May 03, 2017, 08:01:12 AM »
Yes, it happened! Since the 2nd recall (in log you can see what it was in the normal case).

See attachments.

marian

  • Hero Member
  • *****
  • Posts: 513
    • View Profile
Re: Picking up an endpoind from peer of extmodule crashes Yate
« Reply #9 on: May 03, 2017, 08:27:59 AM »
Indeed, ExtModConsumer::Consume is blocking the source's mutex.
You may try to debug it: add debug message when entering and exiting from ExtModConsumer::Consume().
This way you'll see in log when entered and exited (if exited!) and compute the duration.


andr04

  • Newbie
  • *
  • Posts: 18
    • View Profile
Re: Picking up an endpoind from peer of extmodule crashes Yate
« Reply #10 on: May 03, 2017, 09:37:26 AM »
I made the following changes:
Code: [Select]
unsigned long ExtModConsumer::Consume(const DataBlock& data, unsigned long timestamp, unsigned long flags)
{
    if ((m_str) && !data.null()) {
Debug(DebugTest,"ExtModConsumer::Consume ENTER");
m_str->writeData(data);
m_total += data.length();
Debug(DebugTest,"ExtModConsumer::Consume EXIT");
return invalidStamp();
    }
    return 0;
}

The log is:
Code: [Select]
2017-05-03_15:31:13.431325 <TEST> ExtModConsumer::Consume ENTER
2017-05-03_15:31:13.431346 <TEST> ExtModConsumer::Consume EXIT
2017-05-03_15:31:13.466999 <TEST> ExtModConsumer::Consume ENTER
2017-05-03_15:31:13.467037 <TEST> ExtModConsumer::Consume EXIT
2017-05-03_15:31:13.467045 <TEST> ExtModConsumer::Consume ENTER
2017-05-03_15:31:16.057263 <sip:INFO> 'udp:0.0.0.0:5060' received 1351 bytes SIP message from 85.142.148.80:5060 [0x9b2dd0]
------
INVITE sip:200@37.140.188.160:5060 SIP/2.0
Via: SIP/2.0/UDP 85.142.148.80:5060;rport;branch=z9hG4bKPjb7b30d52-e99d-4776-9c6b-9c29aadb0b35
...
...
2017-05-03_15:31:16.063444 <INFO> Could not route call to '200' in context 'default', wasted 169 usec
!!! call_route: {'sip_user-agent': 'ruVoIP.net PBX', 'ip_host': '85.142.148.80', 'billid': '1493825410-3', 'ip_port': '5060', 'transport': 'RTP/AVP', 'antiloop': '19', 'media_video': 'yes', 'sip_content-type': 'application/sdp', 'sip_x-cid': 'KcffpUKcxxDt37oI2L77-hrHzsb7Moa-', 'sdp_maxptime': '20', 'sip_callid': '42c9652c-980e-4a17-bb77-8dbb8f5dcb5f', 'sip_from': 'sip:laptop@talk37.ru', 'callername': 'laptop', 'sip_allow': 'OPTIONS, SUBSCRIBE, NOTIFY, PUBLISH, INVITE, ACK, BYE, CANCEL, UPDATE, PRACK, REGISTER, MESSAGE, REFER', 'rtp_addr': '85.142.148.80', 'sdp_sendrecv': '', 'rtp_port_video': '25680', 'direction': 'incoming', 'transport_video': 'RTP/AVP', 'module': 'sip', 'handlers': 'regexroute:100,sip:100,regexroute:100', 'answered': 'false', 'device': 'ruVoIP.net PBX', 'sip_min-se': '90', 'sip_session-expires': '1800', 'formats_video': '', 'sip_contact': '<sip:asterisk@85.142.148.80:5060>', 'sip_supported': '100rel, timer, replaces, norefersub', 'rtp_forward': 'possible', 'formats': 'alaw', 'ip_transport': 'UDP', 'id': 'sip/4', 'rtp_rfc2833': '101', 'caller': 'laptop', 'address': '85.142.148.80:5060', 'media': 'yes', 'sip_uri': 'sip:200@37.140.188.160:5060', 'called': '200', 'sip_to': '<sip:200@37.140.188.160>', 'callid': 'sip/42c9652c-980e-4a17-bb77-8dbb8f5dcb5f/cd69bf93-880a-4da6-9a74-754ae71f7408/', 'status': 'incoming', 'rtp_port': '25486', 'sdp_video_sendrecv': ''} | {('200', 'laptop'): 'sip/2'} | {'sip/2': ('200', 'laptop')}
2017-05-03_15:31:16.067878 <INFO> Choosing offered 'audio' format 'alaw' [0x7fc588012b10]
2017-05-03_15:31:16.067888 <sip/4:NOTE> Formats for 'audio' changed to 'alaw' [0x7fc588012090]
2017-05-03_15:31:16.067924 >>> DataTranslator::detachChain(0x7fc584006e00,0x7fc59400fae0)
2017-05-03_15:31:19.000091   <engine:MILD> Creating new message dispatching thread (1 running)
2017-05-03_15:31:21.000020   <engine:MILD> Creating new message dispatching thread (2 running)
2017-05-03_15:31:23.000059   <engine:MILD> Creating new message dispatching thread (3 running)
2017-05-03_15:31:25.000087   <engine:WARN> Creating new message dispatching thread (4 running)
2017-05-03_15:31:27.000143   <engine:WARN> Creating new message dispatching thread (5 running)
2017-05-03_15:31:29.000109   <engine:WARN> Creating new message dispatching thread (6 running)
Supervisor: killing unresponsive child 12900
Yate (12935) is starting Wed May  3 15:31:30 2017

As you can see no EXIT before Yate hangs up. And if the problem is not happened the log is:
Code: [Select]
2017-05-03_15:31:04.590200 <TEST> ExtModConsumer::Consume ENTER
2017-05-03_15:31:04.590227 <TEST> ExtModConsumer::Consume EXIT
2017-05-03_15:31:04.590236 <TEST> ExtModConsumer::Consume ENTER
2017-05-03_15:31:04.590245 <TEST> ExtModConsumer::Consume EXIT
2017-05-03_15:31:04.601540 <sip:INFO> 'udp:0.0.0.0:5060' received 1349 bytes SIP message from 85.142.148.80:5060 [0x9b2dd0]
------
INVITE sip:200@37.140.188.160:5060 SIP/2.0
Via: SIP/2.0/UDP 85.142.148.80:5060;rport;branch=z9hG4bKPjab09198c-6c90-4b01-821d-5dfd05972ad3

I.e. EXIT is appeared. Sure, I really don't know why it wasn't EXITed for 2.6 seconds when Yate hangs up...
« Last Edit: May 03, 2017, 12:47:26 PM by andr04 »

marian

  • Hero Member
  • *****
  • Posts: 513
    • View Profile
Re: Picking up an endpoind from peer of extmodule crashes Yate
« Reply #11 on: May 04, 2017, 01:20:29 AM »
It didn't returned for 16 seconds!
2017-05-03_15:31:13.467045 <TEST> ExtModConsumer::Consume ENTER
2017-05-03_15:31:29.000109   <engine:WARN> Creating new message dispatching thread (6 running)

Check your script to see how received data is handled.
If you are storing data check the storage for errors (this may indicate a slow file system).
You may consider using wavefile module to store data

andr04

  • Newbie
  • *
  • Posts: 18
    • View Profile
Re: Picking up an endpoind from peer of extmodule crashes Yate
« Reply #12 on: May 04, 2017, 07:42:58 AM »
Yes, there is no problems if I read and lose input audio data from external module, or use 'play' mode instead of 'playrec'. Thank you!

I suppose, I can make a circular buffer on an external module side; maybe it will be better.

And what about killing external module? How to catch termination in external module correctly or to fix extmodule to wait for a reasonable time for termination? I suppose the following is not a good solution:
Code: [Select]
Unloading external module 'monitor.py' ''
2017-05-04_13:41:08.492771 <INFO> ExtModReceiver::die() pid=15024 did not exit? [0x7fc580001070]
2017-05-04_13:41:08.496951 <ExtModule:WARN> Read error 9 on 0x7fc590003d10 [0x7fc580001070]
2017-05-04_13:41:08.497051 <WARN> Process 15024 has not exited on closing stdin - we'll kill it
2017-05-04_13:41:08.497540 <WARN> Process 15024 has still not exited yet?

marian

  • Hero Member
  • *****
  • Posts: 513
    • View Profile
Re: Picking up an endpoind from peer of extmodule crashes Yate
« Reply #13 on: May 04, 2017, 09:01:01 AM »
I'm not very familiar with external modules.
I don't know if it's a good idea to wait more for external module termination.