Wednesday, November 2, 2011

Building a Startup

(Continued from Starting a Startup)
After several days of meetings in Sarasota we determined:

1)  I was moving to Sarasota to start a company with Norm and Joe.
2)  We were going to utilize open source software wherever possible (including AstLinux, obviously).
3)  The Internet was the only ubiquitous, high quality network to build a nationwide platform.
4)  The Internet was only getting more ubiquitous, more reliable, and faster in coming months/years/decades/etc.
5)  We were going to take advantage of as much of this as possible.

These were some pretty lofty goals.  Remember, this is early 2006.  Gmail was still invitation-only beta.  Google docs didn’t exist.  Amazon EC2 didn’t exist.  “Cloud computing” hadn’t come back into fashion yet.  The term itself didn’t exist.  The Internet was considered (by many) to be “best effort”, “inherently unreliable”, and “unsuitable” for critical communications (such as real time business telephony).  There were many naysayers who were confident this would be a miserable failure.  As it turns out, they were almost right.

We thought the “secret sauce” to business grade voice over the internet was monitoring and management.  If one could monitor and manage the internet connection business grade voice should be possible.  Of course this is very ambiguous but it lead to several great hires.  We hired

Joe had already deployed several embedded Asterisk systems to various businesses in the Sarasota area.  They used an embedded version of Linux he patched together and a third party (unnamed) “carrier” to connect to the PSTN.  The first step was upgrading these machines and getting them on AstLinux.  Once this was accomplished we felt confident enough to proceed with our plan.  This was Star2Star Communications and in the beginning of 2006 it looked something like this:

1)  Soekris net4801 machines running AstLinux on the customer premise.
2)  Grandstream GXP-2000 phones at each desk.
3)  Connectivity to a third party “ITSP”.
4)  Management/monitoring systems (check IP connectivity, phone availability, ITSP reliability, local LAN, etc).
5)  Central provisioning of AstLinux systems, phones, etc.

This was Star2Star and there was something I really liked about it - it was simple.  Anyone who knows me or knows of my projects (AstLinux, for example) has to know I favor simplicity whenever possible.  Keep it simple, keep it simple, keep it simple (stupid).

As time went on we started to learn that maybe this was too simple.  We didn’t have enough control.  Out monitoring wasn’t as mature as it should be.  We didn’t pick the right IP phones.  These could be easily fixed.  However, we soon realized our biggest mistake was architecture (or lack thereof).  This wasn’t going to be an easy fix.

We couldn’t find an ITSP that offered a level of quality we considered to be acceptable.  Very few ITSPs had any more experience with VoIP, SIP, and the internet than we did.  More disturbing, however, was an almost complete lack of focus on quality and reliability.  No process.

What we (quickly) discovered is the extremely low barrier to entry for ITSPs, especially back then.  Virtually anyone could install Asterisk on a $100/mo box in a colo somewhere, buy dialtone from someone (who knows) and call themselves an ITSP.  After going through several of these we discovered we needed to do it ourselves.

Even assuming we could solve the PSTN connectivity problem we discovered yet another issue.  All of the monitoring and management in the world cannot make up for a terrible last mile.  If the copper in the ground is rotting and the DSL modem can only negotiate 128kbps/128kbps that’s all you’re going to get.  To make matters worse in the event of a cut or outage the customer would be down completely.  While that may have always happened with the PSTN and an on premise PBX we considered this to be unacceptable.

So then, in the eleventh hour, just before launch I met with the original founders and posed a radical idea - scrap almost everything.  There was a better way.
(Continued in Building a Startup (the right way))

Tuesday, October 25, 2011

Starting a Startup


I know I’ve apologized for being quiet in the past.  This is not one of those times because (as you’ll soon find out) I’ve been hard at work and only now can I finally talk about it.

Six years ago I was spending most of my time working with Asterisk and AstLinux.  I spent a lot of time promoting both - working the conference circuit, blogging, magazines, books, etc.  Conferences are a great way to network and meet new people.  I did just that.  With each conference I attended came new business opportunities.  Sure, not all of them were a slam dunk and eventually I started to pick and chose which conferences I considered worthy of the time and investment.

For anyone involved with Asterisk Astricon is certainly worthy of your time and energy - the mecca of the Asterisk community.  Astricon was always a whirlwind and 2005 was no exception.  We were in Anaheim, California and embedded Asterisk was starting to really heat up.  I announced my port of AstLinux to Gumstix and announced the “World’s Smallest PBX”, leading to an interview and story in LinuxDevices.  I worked a free community booth (thanks Astricon) with Dave Taht and was introduced to Captain Crunch (that’s another post for another day).

It was at Astricon in 2005 that I also met one of my soon to be business partners (although I certainly didn’t know it at the time).  While I was promoting embedded Asterisk and AstLinux I met a man from Florida named Joe Rhem.  Joe had come up with the idea of using embedded Asterisk systems as the cornerstone of a new way to provide business grade telephone services.  Joe and I met for a few minutes and discussed the merits of embedded Asterisk.  Unfortunately (and everyone already knows this) I don’t remember meeting with Joe.  Like I said Astricon was always a whirlwind and I had these conversations with dozens if not hundreds of people at each show.  I made my way through Astricon, made a pit stop in Santa Clara for (the now defunct) ISPCon and then returned home to Lake Geneva, WI with a stack of business cards, a few new stories, and a lot of work to finish (or start, depending on your perspective).

A couple of months later I received an e-mail from Joe Rhem discussing how he’d like to move forward with what we discussed in Anaheim.  Joe had recruited another partner to lead the new venture.  Norm Worthington was a successful serial entrepreneur and his offer to lead the company was the equivalent of “having General Patton lead your war effort”.  After some catch up I was intrigued with Joe’s idea.  A few hours on the phone later everyone was pretty comfortable with how this could work.

Now I just needed to fly to Sarasota, FL (where’s that - sounds nice, I thought) to meet with everyone, discuss terms, plan a relocation, and (most importantly) start putting the company, product, and technology together.

A short time later I found myself arriving in Sarasota.  It was early January and I coming from Wisconsin I couldn’t believe how nice it was.  Looking back on it I’m sure Norm and Joe were very confident I’d be joining them in Sarasota.  Working with technology I love “in paradise”, how could I resist?

(Continued in Building a Startup)

Tuesday, October 26, 2010

Breaking RFC compliance to improve monitoring

A colleague came to me today and had a troubling issue. He's using sipsak and nagios to monitor some SIP endpoints. Pretty standard so far, right? He noticed that when using UDP and checking on an endpoint that was completely offline sipsak would take over 30 seconds to finally return with an error. Meanwhile Nagios would block and wait for sipsak to return...

Without a simple command line option in sipsak that appeared to change this behavior, we had to enter the semi-complicated world of SIP timers. I feared that to change this behavior we'd have to do some things that might not necessarily be RFC compliant...

What's this? For once I'm actually suggesting you do something against the better advice of an RFC?

That's right, I am.

RFC3261 defines multiple timers and timeouts for messages and transactions. It says things like:

"If there is no final response for the original request in 64*T1 seconds"

"The UAC core considers the INVITE transaction completed 64*T1 seconds after the reception of the first 2xx response."

"The 2xx response is passed to the transport with an interval that starts at T1 seconds and doubles for each retransmission until it reaches T2 seconds"

Without even knowing what "T1" is you can start to see that it's a pretty important timing parameter and (more or less) serves as the father of all timeouts in SIP. Let's look at section 17 to find out what T1 is:

"The default value for T1 is 500 ms. T1 is an estimate of the RTT between the client and server transactions. Elements MAY (though it is NOT RECOMMENDED) use smaller values of T1 within closed, private networks that do not permit general Internet connection. T1 MAY be chosen larger, and this is RECOMMENDED if it is known in advance (such as on high latency access links) that the RTT is larger. Whatever the value of T1, the exponential backoffs on retransmissions described in this section MUST be used."

T1 is essentially a variable for RTT between two endpoints that serves as a multiplier for other timeouts. Unless we know better T1 should default to 500ms, which is quite high. Some implementations (such as Asterisk with the SIP peer qualify option) automatically send OPTIONS requests to endpoints in an effort to better determine RTT instead of using the RFC default of 500ms.

In reading through the sipsak source code it appeared to be RFC compliant for timing, using a default T1 value of 500ms and a transaction timeout value of 64*T1. This is why it was taking over 30 seconds (32 seconds to be exact) for sipsak to finally timeout and return the status code to nagios. This comes directly from the RFC:

"For any transport, the client transaction MUST start timer B with a value of 64*T1 seconds (Timer B controls transaction timeouts)."

This is all well and good but what happens when you don't have a way to dynamically determine T1 and you can't wait T1*64 (32s) for your results like my sipsak/nagios check earlier? Simple: you go renegade, throw out the RFC, and hack the sipsak source yourself!

So I had three options:

1) Change the default value of T1.
2) Change the value of T2 by changing the multiplier or setting a static timeout.
3) Some combination of both.

I decided to go with option #3 (RFC be damned). Why?

1) 500ms is crazy high for most of our endpoints. At a glance 100ms would be fine for ~90% of them. I'll pick 150ms.
2) I don't need that many retransmits. If the latency and/or packet loss is that bad I'm not going to wait (my RTP certainly isn't) and I just want to know about it that much quicker.

So I ended up with a quick easy patch to sipsak:

diff -urN sipsak-0.9.6.orig/sipsak.h sipsak-0.9.6/sipsak.h
--- sipsak-0.9.6.orig/sipsak.h 2006-01-28 16:11:50.000000000 -0500
+++ sipsak-0.9.6/sipsak.h 2010-10-26 18:38:45.000000000 -0400
@@ -102,11 +102,7 @@
# define FQDN_SIZE 100
#endif

-#ifdef HAVE_CONFIG_H
-# define SIP_T1 DEFAULT_TIMEOUT
-#else
-# define SIP_T1 500
-#endif
+#define SIP_T1 150

#define SIP_T2 8*SIP_T1

diff -urN sipsak-0.9.6.orig/transport.c sipsak-0.9.6/transport.c
--- sipsak-0.9.6.orig/transport.c 2006-01-28 16:11:34.000000000 -0500
+++ sipsak-0.9.6/transport.c 2010-10-26 18:38:51.000000000 -0400
@@ -286,7 +286,7 @@
}
}
senddiff = deltaT(&(srt->starttime), &(srt->recvtime));
- if (senddiff > (float)64 * (float)SIP_T1) {
+ if (senddiff > inv_final) {
if (timing == 0) {
if (verbose>0)
printf("*** giving up, no final response after %.3f ms\n", senddiff);

This changes the value of T1 to 150ms (more reasonable for most networks) and allows you to specify the number of retransmits (and thus the total timeout) using -D on the sipsak command line:

kkmac:sipsak-0.9.6-build kris$ ./sipsak -p 10.16.0.3 -s sip:ext_callqual@asterisk -D1 -v
** timeout after 150 ms**
*** giving up, no final response after 150.334 ms

kkmac:sipsak-0.9.6-build kris$ ./sipsak -p 10.16.0.3 -s sip:ext_callqual@asterisk -D2 -v
** timeout after 150 ms**
** timeout after 300 ms**
*** giving up, no final response after 460.612 ms

kkmac:sipsak-0.9.6-build kris$ ./sipsak -p 10.16.0.3 -s sip:ext_callqual@asterisk -D4 -v
** timeout after 150 ms**
** timeout after 300 ms**
** timeout after 600 ms**
*** giving up, no final response after 1071.137 ms

kkmac:sipsak-0.9.6-build kris$

Needless to say our monitoring situation is much improved.

Thursday, August 5, 2010

A ClueCon Update

Cluecon is going very well this year... I spoke the first day and have spent the rest of my time here enjoying the presentations and interacting with the community.

A few highlights:

  • Perfect wireless provided by Meraki. I've never been to a tech conference where the wifi has kept up with the crowd. Well done.
  • The Trump Tower. Phenomenal.
  • FreeSWITCH HA support in Sofia! This is worthy of its own post and it will have one when I get back and play with it. In the meantime my guy Jay Binks has been working to document this exciting new feature.
  • Chicago. I just LOVE this town.
More later... I've got to get back to the conference!

Friday, May 21, 2010

A ClueCon Preview...

A while back I saw a preview for the new A-Team movie. While the movie itself looks horrible I was reminded of the original TV series with its many interesting characters and catch phrases. Among my personal favorites?

I love it when a plan comes together.

That's exactly how I feel with one of my "pet projects" from the past couple of months. Much like Hanibel and the A-Team I was up against formidable issues in trying to accomplish my task: implementing a flexible (very flexible), reasonably high performance LCR server that could be added to my existing architecture.

First I needed to select an LCR "engine". Multiple possibilities were considered but I left the final recommendation up to the DB and billing teams I work with. They selected mod_lcr from FreeSWITCH. While I was certain droute from OpenSIPS (or something similar) would have higher performance I accepted their recommendation. After playing with mod_lcr a bit I can also see its potential.

So now the question was: can FreeSWITCH respond with the proper SIP signaling (300 Multiple Choices)? Using the redirect application from mod_dptools it could not. I created a bounty to add multiple Contact/300 Multiple Choices functionality to FreeSWITCH. Tony had it implemented that day.

With the ability to respond properly I now had to get the data. Mod_lcr looked nice but it certainly wasn't designed for this application. All of the default syntax, tables, etc showed it being used with FreeSWITCH for FreeSWITCH. The tables and code used several bridge specific syntax examples. I hacked mod_lcr to return data to mod_dptools/redirect properly. A created a JIRA issue with my patch and a couple of days later Rupa had it committed.

So now FreeSWITCH could be a route server. All I needed to do was make sure OpenSIPS could route from what FreeSWITCH returned. Turns out it could not. RFC 3261 (section 21.3.1) states "...the SIP response MAY contain several Contact fields or a list of addresses in a Contact field." The Sofia stack from FreeSWITCH used multiple Contact headers, each with its own URI. OpenSIPS would only parse the first one returned. Sofia couldn't be changed easily so OpenSIPS would need to be changed (it was non-compliant anyway). Without this change there is no ability to handle multiple contacts and only the first would be used. It could be worse but obviously this wasn't good enough.

I contacted Bogdan from OpenSIPS to see what it would take to update the parser to handle multiple Contact headers. He indicated it would take four hours or so. Once he got back to me I had an OpenSIPS system that would handle multiple contact headers and create new branches from a failure route as desired.

So how did it all turn out? Well, you have two ways to hear the end of this story:

1) Attend ClueCon at the Trump Hotel in Chicago, IL in early August.
2) Wait until mid-August for an update here.

I'll make sure to post all of my materials - conference presentation, sipp scenarios for testing, OpenSIPS configuration, FreeSWITCH configuration, DB tweaks, etc.

Too late to make it to ClueCon this year? Just make sure to register next year, I'm sure I'll be there.

Wednesday, May 19, 2010

I've said it before but I'll say it again...

FreeSWITCH rocks!

Earlier today I wanted to play with the possibility of using FreeSWITCH as a route/LCR server for another platform. FreeSWITCH already has mod_lcr and redirect. Using these two features FreeSWITCH could be made to respond with a 302 and a single SIP URI in the Contact field.

I wanted more. I wanted a way to respond with multiple routes.

The standard way to do this (using SIP, of course) is to respond to incoming INVITEs with a 300 Multiple Choices. This response should contain a Contact header (or multiple Contact headers) with a list of SIP URIs (along with optional q values, etc) for the original system to route the call to.

As usual I wrote the FreeSWITCH-Users mailing list to make sure this functionality didn't already exist somewhere. It did not and it was suggested I create a bounty.

Creating a bounty is always tough... I don't deal with the source code of FreeSWITCH all that often. I don't know how much work this is going to take. I don't know how much C programmers make. So I did my best to come up with something that seemed fair: $250.

Less than two hours later the feature was coded, committed to FreeSWITCH, tested by me, and paid for.

Once again, Open Source for the win!

Wednesday, May 12, 2010

Another SIP gotcha: Cisco

Another quick and dirty SIP interop post.

A while back I was tasked to interface a FreeSWITCH server and a Cisco Unified Communications Manager system. Once the SIP trunk was configured on the Call Manager/CUCM side they sent an INVITE over. It didn't have an SDP.

It appeared that we needed to enable 3pcc (third party call control) in FreeSWITCH. No problem. I enabled 3pcc and interop continued.

Problems arose, however, when we needed to send the Cisco ringback. Whether it be a 180 or 183 (with or without SDP for either) this was going to be tough because with 3pcc enabled the dialog looked like so:

<-- Cisco
--> FreeSWITCH
INVITE (without SDP) <--
100 Trying -->
200 OK (with SDP) -->
ACK (with SDP) <--

So... There was no opportunity to signal progress as long as we 200 OKd the call almost immediately. Sure I probably could generate some ringback after the 200 but that would just be wrong!

As I like to say, the internet to the rescue. Not having much experience with CUCM I thought I'd ask on VoiceOps. Within a few minutes a very nice gentlemen by the name of Mark Holloway mentioned "Media Termination Point Required" as a CUCM configuration option. These were the magic words. After some research it turned out that was the configuration option I needed*. Thanks Mark!

Once "Media Termination Point Required" was enabled on the Cisco side I disabled 3pcc in FreeSWITCH and all was good. Users even get ringback now!

I also brought the issue up on the FreeSWITCH-Users mailing list and found out this has been bothering people for some time. MC from FreeSWITCH was even nice enough to start a wiki page for me to document all of this there.

Sometimes with SIP it's all about the SIMPLE achievements ;).

* That research also brought up another possibility: enabling PRACK/100rel on the CallManager side instead of "MTP Required". Of course the trouble with PRACK is there are a lot of SIP implementations (Asterisk) that don't support it. FreeSWITCH does but can crash. Many SIP implementations don't support the default CUCM configuration (INVITE w/o SDP). I was looking for the most canonical, compatible configuration possible.