Tuesday, October 26, 2010

Breaking RFC compliance to improve monitoring

A colleague came to me today and had a troubling issue. He's using sipsak and nagios to monitor some SIP endpoints. Pretty standard so far, right? He noticed that when using UDP and checking on an endpoint that was completely offline sipsak would take over 30 seconds to finally return with an error. Meanwhile Nagios would block and wait for sipsak to return...

Without a simple command line option in sipsak that appeared to change this behavior, we had to enter the semi-complicated world of SIP timers. I feared that to change this behavior we'd have to do some things that might not necessarily be RFC compliant...

What's this? For once I'm actually suggesting you do something against the better advice of an RFC?

That's right, I am.

RFC3261 defines multiple timers and timeouts for messages and transactions. It says things like:

"If there is no final response for the original request in 64*T1 seconds"

"The UAC core considers the INVITE transaction completed 64*T1 seconds after the reception of the first 2xx response."

"The 2xx response is passed to the transport with an interval that starts at T1 seconds and doubles for each retransmission until it reaches T2 seconds"

Without even knowing what "T1" is you can start to see that it's a pretty important timing parameter and (more or less) serves as the father of all timeouts in SIP. Let's look at section 17 to find out what T1 is:

"The default value for T1 is 500 ms. T1 is an estimate of the RTT between the client and server transactions. Elements MAY (though it is NOT RECOMMENDED) use smaller values of T1 within closed, private networks that do not permit general Internet connection. T1 MAY be chosen larger, and this is RECOMMENDED if it is known in advance (such as on high latency access links) that the RTT is larger. Whatever the value of T1, the exponential backoffs on retransmissions described in this section MUST be used."

T1 is essentially a variable for RTT between two endpoints that serves as a multiplier for other timeouts. Unless we know better T1 should default to 500ms, which is quite high. Some implementations (such as Asterisk with the SIP peer qualify option) automatically send OPTIONS requests to endpoints in an effort to better determine RTT instead of using the RFC default of 500ms.

In reading through the sipsak source code it appeared to be RFC compliant for timing, using a default T1 value of 500ms and a transaction timeout value of 64*T1. This is why it was taking over 30 seconds (32 seconds to be exact) for sipsak to finally timeout and return the status code to nagios. This comes directly from the RFC:

"For any transport, the client transaction MUST start timer B with a value of 64*T1 seconds (Timer B controls transaction timeouts)."

This is all well and good but what happens when you don't have a way to dynamically determine T1 and you can't wait T1*64 (32s) for your results like my sipsak/nagios check earlier? Simple: you go renegade, throw out the RFC, and hack the sipsak source yourself!

So I had three options:

1) Change the default value of T1.
2) Change the value of T2 by changing the multiplier or setting a static timeout.
3) Some combination of both.

I decided to go with option #3 (RFC be damned). Why?

1) 500ms is crazy high for most of our endpoints. At a glance 100ms would be fine for ~90% of them. I'll pick 150ms.
2) I don't need that many retransmits. If the latency and/or packet loss is that bad I'm not going to wait (my RTP certainly isn't) and I just want to know about it that much quicker.

So I ended up with a quick easy patch to sipsak:

diff -urN sipsak-0.9.6.orig/sipsak.h sipsak-0.9.6/sipsak.h
--- sipsak-0.9.6.orig/sipsak.h 2006-01-28 16:11:50.000000000 -0500
+++ sipsak-0.9.6/sipsak.h 2010-10-26 18:38:45.000000000 -0400
@@ -102,11 +102,7 @@
# define FQDN_SIZE 100
#endif

-#ifdef HAVE_CONFIG_H
-# define SIP_T1 DEFAULT_TIMEOUT
-#else
-# define SIP_T1 500
-#endif
+#define SIP_T1 150

#define SIP_T2 8*SIP_T1

diff -urN sipsak-0.9.6.orig/transport.c sipsak-0.9.6/transport.c
--- sipsak-0.9.6.orig/transport.c 2006-01-28 16:11:34.000000000 -0500
+++ sipsak-0.9.6/transport.c 2010-10-26 18:38:51.000000000 -0400
@@ -286,7 +286,7 @@
}
}
senddiff = deltaT(&(srt->starttime), &(srt->recvtime));
- if (senddiff > (float)64 * (float)SIP_T1) {
+ if (senddiff > inv_final) {
if (timing == 0) {
if (verbose>0)
printf("*** giving up, no final response after %.3f ms\n", senddiff);

This changes the value of T1 to 150ms (more reasonable for most networks) and allows you to specify the number of retransmits (and thus the total timeout) using -D on the sipsak command line:

kkmac:sipsak-0.9.6-build kris$ ./sipsak -p 10.16.0.3 -s sip:ext_callqual@asterisk -D1 -v
** timeout after 150 ms**
*** giving up, no final response after 150.334 ms

kkmac:sipsak-0.9.6-build kris$ ./sipsak -p 10.16.0.3 -s sip:ext_callqual@asterisk -D2 -v
** timeout after 150 ms**
** timeout after 300 ms**
*** giving up, no final response after 460.612 ms

kkmac:sipsak-0.9.6-build kris$ ./sipsak -p 10.16.0.3 -s sip:ext_callqual@asterisk -D4 -v
** timeout after 150 ms**
** timeout after 300 ms**
** timeout after 600 ms**
*** giving up, no final response after 1071.137 ms

kkmac:sipsak-0.9.6-build kris$

Needless to say our monitoring situation is much improved.

Thursday, August 5, 2010

A ClueCon Update

Cluecon is going very well this year... I spoke the first day and have spent the rest of my time here enjoying the presentations and interacting with the community.

A few highlights:

  • Perfect wireless provided by Meraki. I've never been to a tech conference where the wifi has kept up with the crowd. Well done.
  • The Trump Tower. Phenomenal.
  • FreeSWITCH HA support in Sofia! This is worthy of its own post and it will have one when I get back and play with it. In the meantime my guy Jay Binks has been working to document this exciting new feature.
  • Chicago. I just LOVE this town.
More later... I've got to get back to the conference!

Friday, May 21, 2010

A ClueCon Preview...

A while back I saw a preview for the new A-Team movie. While the movie itself looks horrible I was reminded of the original TV series with its many interesting characters and catch phrases. Among my personal favorites?

I love it when a plan comes together.

That's exactly how I feel with one of my "pet projects" from the past couple of months. Much like Hanibel and the A-Team I was up against formidable issues in trying to accomplish my task: implementing a flexible (very flexible), reasonably high performance LCR server that could be added to my existing architecture.

First I needed to select an LCR "engine". Multiple possibilities were considered but I left the final recommendation up to the DB and billing teams I work with. They selected mod_lcr from FreeSWITCH. While I was certain droute from OpenSIPS (or something similar) would have higher performance I accepted their recommendation. After playing with mod_lcr a bit I can also see its potential.

So now the question was: can FreeSWITCH respond with the proper SIP signaling (300 Multiple Choices)? Using the redirect application from mod_dptools it could not. I created a bounty to add multiple Contact/300 Multiple Choices functionality to FreeSWITCH. Tony had it implemented that day.

With the ability to respond properly I now had to get the data. Mod_lcr looked nice but it certainly wasn't designed for this application. All of the default syntax, tables, etc showed it being used with FreeSWITCH for FreeSWITCH. The tables and code used several bridge specific syntax examples. I hacked mod_lcr to return data to mod_dptools/redirect properly. A created a JIRA issue with my patch and a couple of days later Rupa had it committed.

So now FreeSWITCH could be a route server. All I needed to do was make sure OpenSIPS could route from what FreeSWITCH returned. Turns out it could not. RFC 3261 (section 21.3.1) states "...the SIP response MAY contain several Contact fields or a list of addresses in a Contact field." The Sofia stack from FreeSWITCH used multiple Contact headers, each with its own URI. OpenSIPS would only parse the first one returned. Sofia couldn't be changed easily so OpenSIPS would need to be changed (it was non-compliant anyway). Without this change there is no ability to handle multiple contacts and only the first would be used. It could be worse but obviously this wasn't good enough.

I contacted Bogdan from OpenSIPS to see what it would take to update the parser to handle multiple Contact headers. He indicated it would take four hours or so. Once he got back to me I had an OpenSIPS system that would handle multiple contact headers and create new branches from a failure route as desired.

So how did it all turn out? Well, you have two ways to hear the end of this story:

1) Attend ClueCon at the Trump Hotel in Chicago, IL in early August.
2) Wait until mid-August for an update here.

I'll make sure to post all of my materials - conference presentation, sipp scenarios for testing, OpenSIPS configuration, FreeSWITCH configuration, DB tweaks, etc.

Too late to make it to ClueCon this year? Just make sure to register next year, I'm sure I'll be there.

Wednesday, May 19, 2010

I've said it before but I'll say it again...

FreeSWITCH rocks!

Earlier today I wanted to play with the possibility of using FreeSWITCH as a route/LCR server for another platform. FreeSWITCH already has mod_lcr and redirect. Using these two features FreeSWITCH could be made to respond with a 302 and a single SIP URI in the Contact field.

I wanted more. I wanted a way to respond with multiple routes.

The standard way to do this (using SIP, of course) is to respond to incoming INVITEs with a 300 Multiple Choices. This response should contain a Contact header (or multiple Contact headers) with a list of SIP URIs (along with optional q values, etc) for the original system to route the call to.

As usual I wrote the FreeSWITCH-Users mailing list to make sure this functionality didn't already exist somewhere. It did not and it was suggested I create a bounty.

Creating a bounty is always tough... I don't deal with the source code of FreeSWITCH all that often. I don't know how much work this is going to take. I don't know how much C programmers make. So I did my best to come up with something that seemed fair: $250.

Less than two hours later the feature was coded, committed to FreeSWITCH, tested by me, and paid for.

Once again, Open Source for the win!

Wednesday, May 12, 2010

Another SIP gotcha: Cisco

Another quick and dirty SIP interop post.

A while back I was tasked to interface a FreeSWITCH server and a Cisco Unified Communications Manager system. Once the SIP trunk was configured on the Call Manager/CUCM side they sent an INVITE over. It didn't have an SDP.

It appeared that we needed to enable 3pcc (third party call control) in FreeSWITCH. No problem. I enabled 3pcc and interop continued.

Problems arose, however, when we needed to send the Cisco ringback. Whether it be a 180 or 183 (with or without SDP for either) this was going to be tough because with 3pcc enabled the dialog looked like so:

<-- Cisco
--> FreeSWITCH
INVITE (without SDP) <--
100 Trying -->
200 OK (with SDP) -->
ACK (with SDP) <--

So... There was no opportunity to signal progress as long as we 200 OKd the call almost immediately. Sure I probably could generate some ringback after the 200 but that would just be wrong!

As I like to say, the internet to the rescue. Not having much experience with CUCM I thought I'd ask on VoiceOps. Within a few minutes a very nice gentlemen by the name of Mark Holloway mentioned "Media Termination Point Required" as a CUCM configuration option. These were the magic words. After some research it turned out that was the configuration option I needed*. Thanks Mark!

Once "Media Termination Point Required" was enabled on the Cisco side I disabled 3pcc in FreeSWITCH and all was good. Users even get ringback now!

I also brought the issue up on the FreeSWITCH-Users mailing list and found out this has been bothering people for some time. MC from FreeSWITCH was even nice enough to start a wiki page for me to document all of this there.

Sometimes with SIP it's all about the SIMPLE achievements ;).

* That research also brought up another possibility: enabling PRACK/100rel on the CallManager side instead of "MTP Required". Of course the trouble with PRACK is there are a lot of SIP implementations (Asterisk) that don't support it. FreeSWITCH does but can crash. Many SIP implementations don't support the default CUCM configuration (INVITE w/o SDP). I was looking for the most canonical, compatible configuration possible.

Friday, February 5, 2010

(High Quality) VoIP on the iPhone

(Regular readers will note that my excessive use of parentheses has now spilled into my titles and first sentences)!

Ahhh Apple... Ahh the iPhone. Regardless of how you feel about this company or their product you can't doubt the market impact they've made over the last couple of years (decades perhaps?). Multitouch (NOT multitasking). App Store. iTunes. There are countless other blogs that discuss these topics so I don't need to. As usual I'm here to talk about VoIP.

For the last ten months or so I've been involved (part time) in another local venture. Voalte (pronounced volt) is a startup here in Sarasota, FL founded by Trey Lauderdale. When the Apple iPhone was announced Trey was working in sales for Emergin, a healthcare IT middleware provider. Trey noticed how incredibly arcane the mobile devices used in healthcare are when compared to this new device from Apple. Once the iPhone SDK was announced Trey knew he had to develop an iPhone application for healthcare. This application became Voalte One.

Voalte One is an iPhone application that provides voice, alarms, and text for healthcare point of care providers (that's nurses to you and I). The complete Voalte One solution is comprised of the following parts:

- iPhone
- Voalte Server (XMPP, LDAP, etc)
- Voalte Voice Server (FreeSWITCH using SIP + event socket)
- An overall excellent customer/user experience (also new to healthcare)

Text messaging and alarm integration are cool but as I've already said, I do VoIP. If you'd like to know more about iPhone development, XMPP, LDAP, etc let me know and I can point you in the right direction.

VoIP in our application is interesting. It's a softphone, technically, but unlike one you've ever seen before. As everyone knows the iPhone cannot run multiple applications. It can't background applications. These are just two of the many challenges introduced when developing a user-friendly, always available, reliable non-GSM phone experience for the iPhone. Simply downloading an off the shelf softphone and installing it on the iPhone is not enough.

We're a startup and we get to do cool things. For example, one of the big differences between VoIP/voice with Voalte One on the iPhone is the voice quality. We use G.722 wideband at 16kHz as our standard voice codec. Why? Because one Saturday (after a long night out) Trey and I were having lunch. I asked him if he thought we should set ourselves apart on something as basic as sample rate. After a little explanation on my part we quickly decided - why not?

As cool as G.722 is it introduces some interesting challenges:

- The iPhone. How are we going to get 16kHz audio from the hardware?
- PJSIP (our SIP stack). Does it support G.722? How does it interface with the audio hardware?
- Hospital PBXs. Voalte One interfaces with the hospital PBX as an ordinary extension. Most of them probably don't support G.722. How/where do we resample to the standard 8kHz used in G.711?

After looking through PJSIP and the available audio drivers for the iPhone we decided we needed to write our own. There were legal and technical reasons and I'm glad we did it. Especially because I didn't have to do most of the work! ;) We also confirmed PJSIP supports G.722.

Voalte has an amazing iPhone developer - Robbie Hanson. Robbie, Ben (Voalte CTO), and I were able to look over the available audio frameworks on the iPhone and pick the best. Not only is it the best overall (it supports echo cancellation, etc) it would provide us the sampling rate of 16kHz we knew we needed.

After working with PJSIP and AudioUnit for a while Robbie was able to write an iPhone audio driver (using AudioUnit, of course) for PJSIP. While working on the audio driver Robbie (along with another contributor) also wrote an Objective C wrapper for PJSIP. These are the raw ingredients of a high quality VoIP experience on the iPhone.

In the months leading up to release we had to deal with a plethora of other issues: push notifications, local ringback, wifi, etc, etc. I won't (and probably can't) describe these issues in detail.

The good news is Voalte has done the right thing and released the core components of this solution as open source.

I'm proud to work with companies that "get it" and are willing to actively participate in the free software ecosystem.

Tuesday, February 2, 2010

Upcoming Review: Building Telephony Systems with OpenSIPS 1.6

Packt Publishing has once again asked me to review their latest work in the OpenSIPS series: Building Telephony Systems with OpenSIPS 1.6.

My review of the previous edition goes all the way back to when OpenSIPS was called OpenSER. I have another post discussing that topic...

Anyways, I should be receiving the book this week and I should have a review up by next week.

Wednesday, January 20, 2010

AstLinux 0.7 released and more!

The AstLinux team (of which I'm an occasional member) has released AstLinux 0.7. Darrick, Philip, Lonnie, and the rest of the community have done a great job getting this release out there. I couldn't be happier with how my little project has grown up!

In addition to getting this release out, they've also taken the time to focus on documentation and a new website.

Well done guys!

Testing with SIPP

A quick one, I promise...

I'd been having some issues testing Asterisk with sipp. It turns out there is a fairly well known issue with sipp when using five digit port numbers for RTP. A quick Google search found a solution pretty quickly.

Just in case that link ever goes dead, here's the diff:

diff -urb sipp.svn_orig/call.cpp sipp.svn_fixed/call.cpp
--- sipp.svn_orig/call.cpp 2008-12-19 13:14:51.000000000 +0300
+++ sipp.svn_fixed/call.cpp 2008-12-19 13:16:34.000000000 +0300
@@ -192,7 +192,7 @@
/* m=audio not found */
return 0;
}
- begin += strlen(pattern) - 1;
+ begin += strlen(pattern);
end = strstr(begin, "\r\n");
if (!end)
ERROR("get_remote_port_media: no CRLF found");


More on sipp later!