#432: SIP errors with CW on a Flash-HDD (Fixed)

Jul 01 2008 * 20:51
Reported by: Release: 1.2
Priority: Normal Milestone:
Component: Assigned to:

When i install CW on a Flash-HDD (in my case a 1GB IDE Flash Module from Transcend and Debian/Lenny as OS) i get this SIP-Erros when i cancel a ringing call:

Jun 30 22:46:53 ERROR3044608912: chan_sip.c:14516 sipsock_read: We could NOT get the channel lock for SIP/10-812a – Call ID b3c8e0424b094405@192.168.116.102! Jun 30 22:46:53 ERROR3044608912: chan_sip.c:14517 sipsock_read: SIP MESSAGE JUST IGNORED: ACK

For testing i also installed a old Asterisk 1.2.13 (from debian source) on the same system, and i dont get this errors.

Ok, then i saw that sipsock_read function try 100 times (unsigned int lockretry = 100;) to repeat before i give the error, so i just set the lockretry to 100000, and then i worked. I added a counter and i can see, that after ~20000-40000 retries (what is ~40ms) the

if (p->owner && cw_mutex_trylock(&p->owner->lock))

in sipsock_read does not grab, because after 40ms the p->owner is NULL finally. But it should be NULL by the first attempt, because on a normal-HDD it does. So the question is, why it need ~40ms to get p->owner to NULL when we cancel a ringing call on CW@Flash-HDD?

Changelog:

Modified by Jul 01 2008 * 23:26

i tested this with old RC5… no error with RC5. Any ideas what could be the influencing change?

Modified by Jul 02 2008 * 02:03

test all changes no. Then i found: It must be one of the debian configure settings. Because i always build deb’s with dpkg-buildpackage. And when i do a pure install, the i dont get the errors. Now i test every of the debian configure settings.

Modified by Jul 02 2008 * 17:23

after testing all configure options from the rule file in /debian step by step, i found out, that even one simple configure option for the path like:

./configure ‘—prefix=/usr/’ brings me back the error.

Only a pure ./configure gives me a callweaver without this erros. Crazy! and, just tested also giving some additional CFLAGS configure options like:

./configure CHOST='i686-pc-linux-gnu' CFLAGS='

is not a problem. So only one simple path option give the error.

Modified by Jul 02 2008 * 21:48

the error is in libcallweaver.so Because when i simply replace the libcallweaver.so with one that was compiled with a pure ./configure its working without the error.

Modified by Jul 04 2008 * 00:42

after long long debugging (but i really learned a lot about the internal structure of CW) i found, that the error is because the cw_hangup function in channel.c need to long to (re)close the cdr. And then we get a race condition so that the

if (p->owner && cw_mutex_trylock(&p->owner->lock)) does grap in sipsock_read/chan_sip, retrys 100times and then gives us the error.

Still dont know why the cdr closing does need longer with CW@Flash-HDD, but here starts the reason for the delay.

Modified by Jul 04 2008 * 13:19

ok, it’s cdr_sqlite3! Because when cw_hangup function in channel.c tell the cdr’s it is closing time, then cdr_sqlite3 needs to long to write to cdr.db on a Flash-HDD. That’s it. So i just noload it. It should not be a hard job to code a lock/wait until cdr_sqlite3 or other cdr are ready, but this job can be done um, well… later :)

Modified by Jul 04 2008 * 20:15

More details: cw_hangup calls all cdr’s to close now with post_cdr in cdr.c And here is also called res_sqlite to close now. Both res_sqlite and cdr_sqlite3 write slow on a Flash-HDD, so the hole cw_hangup progress does need unnaturally long time (up to 40ms)

For me not a problem. I dont need them both, so i just unload them. But perhaps i would be a idea to set a mark that cw_hangup is in progress, then sipsock_read can check if the channel owner is in hangup progress and does not need to lock it anymore, and dont need to throw us error messages.

Modified by Jul 08 2008 * 22:36

The sqlite backends are probably doing fsyncs as part of their commit handling (they should be anyway). Not all filesystems, block device drivers and hard disks do proper write flushing and SSDs so far tend to have poor write performance, especially when they get small writes flushed separately. So the time to flush CDRs can vary a lot.

Then there’s the way the legacy * code (ab)uses locks. That isn’t going to change in a hurry and almost certainly not in the stable 1.2 branch.

What you can try is to use “batch = yes” in cdr.conf. That causes CDRs to be queued and written out in a separate background thread. I think the original intention was to batch updates to optimize database updates but a useful side-effect is that channel handling is no longer at the mercy of whatever the CDR backends are doing.

Modified by Jul 09 2008 * 00:44

Thanks! It is working with batch = yes.

I have also find a other solution, that was to do the:

cw_cdr_detach(chan->cdr)

Just a lite bit later in cw_hanhup. After:

if (chan->tech->hangup)
       res = chan->tech->hangup(chan);

That was also working for me.

Modified by Jun 05 2010 * 15:28
  • Status: changed from Open to Fixed