Tuesday, December 20, 2011

Performance Testing (Part 1)

Over the past few years (like many other people in this business) I’ve needed to do performance testing.  Open source software is great but this is one place where you need to do your own leg work.  This conundrum first presented itself in the Asterisk community.  There are literally thousands of variables that can affect system the performance of Asterisk, FreeSWITCH, or any other software solution.  In no particular order:

- Configuration.  Which modules do you have loaded?  How are they configured?  If you’re using Kamailio, do you do hundreds of huge, slow, nasty DB queries for each call setup?  How is your logging configured?  Maybe you use Asterisk or FreeSWITCH and so several system calls, DB lookups, LUA scripts, etc?  Even the slightest misstep in configuration (synchronous syslogging with Kamailio, for example) can reduce your performance by 90%.

- Features in use.  Paging groups (unicast) are notorious for destroying performance on standard hardware - every call needs to be setup individually, you need to handle RTP, and some audio mixing is involved.  Hardware that can’t do 10 members in a page group using Asterisk or FreeSWITCH may be capable of hundreds of sessions using Kamailio with no media.

- Standard performance metrics.  “Thousands of calls” you say?  How many calls per second?  Are you transcoding?  Maybe you’re not handling any media at all?  What is the delay in call setup?

- Hardware.  This may seem obvious (MORE HERTZ) but even then there are issues...  If you’re handling RTP, what are you using for timing?  If you have lots of RTP, which network card are you using?  Does it and your kernel support MSI or MSI-X for better interrupt handling?  Can you load balance IRQs across cores?  How efficient (or buggy) is the driver (Realtek I’m looking at you)?!?

- The “guano” effect.  As features are added to the underlying toolkit (Asterisk, FreeSWITCH, etc) and to your configuration, how is performance affected over time?  Add a feature here, and a feature there - and repeat.  Over the months and years (even with faster hardware) you may find that each “little” feature reduced call capacity by 5%.  Or maybe your calls per second went down by two each time.  Not a big deal overall yet over time this adds up - assuming no other optimizations your call capacity could be down by 50% after ten “minor” changes.  It adds up - it really does.

Even when pointing out all of these issues you’d still be surprised how often one is faced with the question “Well yeah but how many calls can I handle on my dual core Dell server?”.

In almost every case the best answer is “Get your hardware, develop your solution, run sipp against it and see what happens”.  That’s really about as good as we can do.

SIPP is a great example of a typical, high quality open source tool.  In true “Unix philosophy” it does one thing and it does it well: SIP performance testing.  SIPP can be configured to initiate (or receive) just about any conceivable SIP scenario - from simple INVITE call handling to full SIMPLE test cases.  In these tests SIPP will tell you call setup time, messages received, successful dialogs, etc.

SIPP even goes a step further and includes some support for RTP.  SIPP has the ability to echo RTP from the remote end or even replay RTP from a PCAP file you have saved to disk.  This is where SIPP starts to show some deficiencies.  Again, you can’t blame SIPP because SIPP is a SIP performance testing tool - it does that and it does it well.  RTP testing leaves a lot to be desired.  First of all, you’re on your own when it comes to manipulating any of the PCAP parameters.  Length, content, codec, payload types, etc, etc need to be configured separately.  This isn’t a problem, necesarily, as there are various open source tools to help you with some of these tasks.  I won’t get into all of them here but they too leave something to be desired.

What about analyzing the quality of the RTP streams?  SIPP provides mechanisms to measure various SIP “quality” metrics - SIP response times, SIP retransmits, etc.  With RTP you’re on your own.  Once again, sure, you could setup tshark on a SPAN port (or something) to do RTP stream analysis on every stream but this would be tedious and (once again) subject you to some of the harsh realities of processing a tremendous amount of small packets in software on commodity hardware.

Let’s face it - for a typical B2BUA handling RTP the numbers add up very quickly - let’s assume 20ms packetization for the following:

Single RTP stream = 50 packets per second (pps)
Bi-directional RTP stream = 100 pps
A-leg bi-directional RTP stream = 100 pps
B-leg bi-directional RTP stream = 100 pps

A leg + B leg = 200 pps PER CALL

What does this look like with 10,000 channels (using g711u)?

952 mbit/s (close to Gigabit wire speed) in each direction
1,000,000 (total) packets per second

Open source software is great - it provides us with the tools to (ultimately) build services and businesses.  Many of us choose what to focus on (our core competency).  At Star2Star we provide business grade communication services and we spend a lot of time and energy to build these services because it’s what we do.  We don’t sell, manufacture, or support testing platforms.

At this point some of you may be getting an idea...  Why don’t I build/design an open source testing solution?  It’s a good question and while I don’t want to crush your dreams there are some harsh realities:

1)  This gets insanely complicated, quickly.  Anyone who follows this blog knows SIP itself is complicated enough.
2)  Scaling becomes a concern (as noted above).
3)  Who would use it?

The last question is probably the most serious - who really needs the ability to initiate 10,000 SIP channels at 100 calls per second while monitoring RTP stream quality, etc?  SIP carriers?  SIP equipment manufacturers?  A few SIP software developers?  How large is the market?  What kind of investment would be required to even get the project off the ground?  What does the competition look like?  While I don’t have the answers to most of these questions I can answer the last one.

Commercial SIP testing equipment is available from a few vendors:

Spirent
Empirix
Ixia

...and I’m sure others.  We evaluated a few of these solutions and I’ll be talking more about them in a follow-up post in the near future.
 
Stay tuned because this series is going to be good!

Friday, December 2, 2011

Star2Star Gets Noticed

Just a quick one today (and some shameless self promotion on my part)...  Star2Star has been recognized on a few "lists" this year, check it out:

Inc 500
Forbes 100 "Most Promising"

I'm lucky enough to tell people the same story all of the time - when I was a little kid I played with all of this stuff because I thought it was fun and I loved it.  Only later did I realize that one day I'd be getting paid for it.  I certainly never thought it could come to this!

Ok, enough of that for now.  I'll be getting back to some tech stuff soon...

Tuesday, November 15, 2011

Building a Startup (the right way)

(Continued from Building a Startup)
 
Our way wasn’t working.  To put it mildly our “business grade” solution didn’t perform much better than Vonage.  We became to exemplify VoIP - jittery calls, dropped calls, one way calls, etc, etc, etc.  Most of this was because of the lack of quality ITSPs at that time.  Either way our customers didn’t care.  It was us.  If we went to market with what we had the first time around we were going to loose.

The problem was the other predominant architecture at the time was “hosted”.  Someone hosts a PBX for you and ships you some phones.  You plug them in behind your router and magically you have a phone system.  They weren’t doing much better.  Sure, their sales looked good but even then it was becoming obvious customer churn was quite high.  People didn’t like hosted either, and for good reason.  Typically they have less control over the call than we do.

As I’ve eluded before I thought there was a better way.  We needed to host the voice applications where it made the most “sense”.  We were primarily using Asterisk and with a little creative provisioning, a kick-ass SIP proxy, and enough Asterisk machines we could build the perfect business PBX - even if that meant virtually none of it existed at the customer premise.  Or maybe all of it did.  That flexibility was key.  After a lot of discussions, whiteboard sessions, and late nights everyone agreed.  We needed a do-over.

So we got to work and slowly our new architecture began to take shape.  We added a kick-ass SIP proxy (OpenSER).  OpenSER would power the core routing between various Asterisk servers each meeting different needs - IVR/Auto Attendant, Conferencing, Voicemail, remote phones (for “hosted” phones/softphones), etc.  The beauty was the SIP proxy could route between all of these different systems including the original AstLinux system at the customer premise.  Customer needs to call voicemail?  No problem - the AstLinux system at the CPE fires an INVITE off to the proxy and the proxy figures out where their voicemail server is.  The call is connected and the media goes directly between the two endpoints.  Same thing for calls between any two points on the network - AstLinux CPE to AstLinux CPE, PSTN to voicemail, IVR to conference.

This is a good time to take a break and acknowledge what really made this all possible - OpenSER.  While it’s difficult to explain the exact history and family tree with any piece of SER software I can tell you one thing - this company would not be possible without it.  There is no question in my mind.  It’s now 2011 and whether you select Kamailio or OpenSIPS for your SIP project you will not be sorry.  Even after five years you will not find a more capable, flexible, scalable piece of SIP server software.  It was one of the best decisions we ever made.

Need to add another server to meet demand for IVR?  No problem, bring another server online, add the IP to a table and presto - you’re now taking calls on your new IVR.  Eventually a new IVR lead to several new IVRs, voicemail servers, conference systems, web portals, mail servers, various monitoring systems, etc.

What about infrastructure?  Starting at our small scale, regional footprint, and focus on quality we began to buy our own PRIs and running them on a couple of Cisco AS5350XM gateways.  This got us past our initial issues with questionable ITSPs.  Bandwidth was becoming another problem...  We had an excellent colocation provider that provided blended bandwidth but still we needed more control.  Here came BGP, ARIN, AS numbers, a pair of Cisco 7206VXRs w/ G2s, iBGP, multiple upstream providers, etc.

At times I would wonder - whatever happened to spending my time worrying about cross compilers?  Looking back I’m not sure which was worse - GNU autoconf cross-compiling hell or SIP interop, BGP, etc.  It’s fairly safe to say I’m a sadomasochist either way.

Even with all of the pain, missteps, and work we finally had an architecture to take to market.  It would be the architecture that would serve us well for several years.  Of course there was more work to be done...

Wednesday, November 2, 2011

Building a Startup

(Continued from Starting a Startup)
After several days of meetings in Sarasota we determined:

1)  I was moving to Sarasota to start a company with Norm and Joe.
2)  We were going to utilize open source software wherever possible (including AstLinux, obviously).
3)  The Internet was the only ubiquitous, high quality network to build a nationwide platform.
4)  The Internet was only getting more ubiquitous, more reliable, and faster in coming months/years/decades/etc.
5)  We were going to take advantage of as much of this as possible.

These were some pretty lofty goals.  Remember, this is early 2006.  Gmail was still invitation-only beta.  Google docs didn’t exist.  Amazon EC2 didn’t exist.  “Cloud computing” hadn’t come back into fashion yet.  The term itself didn’t exist.  The Internet was considered (by many) to be “best effort”, “inherently unreliable”, and “unsuitable” for critical communications (such as real time business telephony).  There were many naysayers who were confident this would be a miserable failure.  As it turns out, they were almost right.

We thought the “secret sauce” to business grade voice over the internet was monitoring and management.  If one could monitor and manage the internet connection business grade voice should be possible.  Of course this is very ambiguous but it lead to several great hires.  We hired

Joe had already deployed several embedded Asterisk systems to various businesses in the Sarasota area.  They used an embedded version of Linux he patched together and a third party (unnamed) “carrier” to connect to the PSTN.  The first step was upgrading these machines and getting them on AstLinux.  Once this was accomplished we felt confident enough to proceed with our plan.  This was Star2Star Communications and in the beginning of 2006 it looked something like this:

1)  Soekris net4801 machines running AstLinux on the customer premise.
2)  Grandstream GXP-2000 phones at each desk.
3)  Connectivity to a third party “ITSP”.
4)  Management/monitoring systems (check IP connectivity, phone availability, ITSP reliability, local LAN, etc).
5)  Central provisioning of AstLinux systems, phones, etc.

This was Star2Star and there was something I really liked about it - it was simple.  Anyone who knows me or knows of my projects (AstLinux, for example) has to know I favor simplicity whenever possible.  Keep it simple, keep it simple, keep it simple (stupid).

As time went on we started to learn that maybe this was too simple.  We didn’t have enough control.  Out monitoring wasn’t as mature as it should be.  We didn’t pick the right IP phones.  These could be easily fixed.  However, we soon realized our biggest mistake was architecture (or lack thereof).  This wasn’t going to be an easy fix.

We couldn’t find an ITSP that offered a level of quality we considered to be acceptable.  Very few ITSPs had any more experience with VoIP, SIP, and the internet than we did.  More disturbing, however, was an almost complete lack of focus on quality and reliability.  No process.

What we (quickly) discovered is the extremely low barrier to entry for ITSPs, especially back then.  Virtually anyone could install Asterisk on a $100/mo box in a colo somewhere, buy dialtone from someone (who knows) and call themselves an ITSP.  After going through several of these we discovered we needed to do it ourselves.

Even assuming we could solve the PSTN connectivity problem we discovered yet another issue.  All of the monitoring and management in the world cannot make up for a terrible last mile.  If the copper in the ground is rotting and the DSL modem can only negotiate 128kbps/128kbps that’s all you’re going to get.  To make matters worse in the event of a cut or outage the customer would be down completely.  While that may have always happened with the PSTN and an on premise PBX we considered this to be unacceptable.

So then, in the eleventh hour, just before launch I met with the original founders and posed a radical idea - scrap almost everything.  There was a better way.
(Continued in Building a Startup (the right way))

Tuesday, October 25, 2011

Starting a Startup


I know I’ve apologized for being quiet in the past.  This is not one of those times because (as you’ll soon find out) I’ve been hard at work and only now can I finally talk about it.

Six years ago I was spending most of my time working with Asterisk and AstLinux.  I spent a lot of time promoting both - working the conference circuit, blogging, magazines, books, etc.  Conferences are a great way to network and meet new people.  I did just that.  With each conference I attended came new business opportunities.  Sure, not all of them were a slam dunk and eventually I started to pick and chose which conferences I considered worthy of the time and investment.

For anyone involved with Asterisk Astricon is certainly worthy of your time and energy - the mecca of the Asterisk community.  Astricon was always a whirlwind and 2005 was no exception.  We were in Anaheim, California and embedded Asterisk was starting to really heat up.  I announced my port of AstLinux to Gumstix and announced the “World’s Smallest PBX”, leading to an interview and story in LinuxDevices.  I worked a free community booth (thanks Astricon) with Dave Taht and was introduced to Captain Crunch (that’s another post for another day).

It was at Astricon in 2005 that I also met one of my soon to be business partners (although I certainly didn’t know it at the time).  While I was promoting embedded Asterisk and AstLinux I met a man from Florida named Joe Rhem.  Joe had come up with the idea of using embedded Asterisk systems as the cornerstone of a new way to provide business grade telephone services.  Joe and I met for a few minutes and discussed the merits of embedded Asterisk.  Unfortunately (and everyone already knows this) I don’t remember meeting with Joe.  Like I said Astricon was always a whirlwind and I had these conversations with dozens if not hundreds of people at each show.  I made my way through Astricon, made a pit stop in Santa Clara for (the now defunct) ISPCon and then returned home to Lake Geneva, WI with a stack of business cards, a few new stories, and a lot of work to finish (or start, depending on your perspective).

A couple of months later I received an e-mail from Joe Rhem discussing how he’d like to move forward with what we discussed in Anaheim.  Joe had recruited another partner to lead the new venture.  Norm Worthington was a successful serial entrepreneur and his offer to lead the company was the equivalent of “having General Patton lead your war effort”.  After some catch up I was intrigued with Joe’s idea.  A few hours on the phone later everyone was pretty comfortable with how this could work.

Now I just needed to fly to Sarasota, FL (where’s that - sounds nice, I thought) to meet with everyone, discuss terms, plan a relocation, and (most importantly) start putting the company, product, and technology together.

A short time later I found myself arriving in Sarasota.  It was early January and I coming from Wisconsin I couldn’t believe how nice it was.  Looking back on it I’m sure Norm and Joe were very confident I’d be joining them in Sarasota.  Working with technology I love “in paradise”, how could I resist?

(Continued in Building a Startup)

Tuesday, October 26, 2010

Breaking RFC compliance to improve monitoring

A colleague came to me today and had a troubling issue. He's using sipsak and nagios to monitor some SIP endpoints. Pretty standard so far, right? He noticed that when using UDP and checking on an endpoint that was completely offline sipsak would take over 30 seconds to finally return with an error. Meanwhile Nagios would block and wait for sipsak to return...

Without a simple command line option in sipsak that appeared to change this behavior, we had to enter the semi-complicated world of SIP timers. I feared that to change this behavior we'd have to do some things that might not necessarily be RFC compliant...

What's this? For once I'm actually suggesting you do something against the better advice of an RFC?

That's right, I am.

RFC3261 defines multiple timers and timeouts for messages and transactions. It says things like:

"If there is no final response for the original request in 64*T1 seconds"

"The UAC core considers the INVITE transaction completed 64*T1 seconds after the reception of the first 2xx response."

"The 2xx response is passed to the transport with an interval that starts at T1 seconds and doubles for each retransmission until it reaches T2 seconds"

Without even knowing what "T1" is you can start to see that it's a pretty important timing parameter and (more or less) serves as the father of all timeouts in SIP. Let's look at section 17 to find out what T1 is:

"The default value for T1 is 500 ms. T1 is an estimate of the RTT between the client and server transactions. Elements MAY (though it is NOT RECOMMENDED) use smaller values of T1 within closed, private networks that do not permit general Internet connection. T1 MAY be chosen larger, and this is RECOMMENDED if it is known in advance (such as on high latency access links) that the RTT is larger. Whatever the value of T1, the exponential backoffs on retransmissions described in this section MUST be used."

T1 is essentially a variable for RTT between two endpoints that serves as a multiplier for other timeouts. Unless we know better T1 should default to 500ms, which is quite high. Some implementations (such as Asterisk with the SIP peer qualify option) automatically send OPTIONS requests to endpoints in an effort to better determine RTT instead of using the RFC default of 500ms.

In reading through the sipsak source code it appeared to be RFC compliant for timing, using a default T1 value of 500ms and a transaction timeout value of 64*T1. This is why it was taking over 30 seconds (32 seconds to be exact) for sipsak to finally timeout and return the status code to nagios. This comes directly from the RFC:

"For any transport, the client transaction MUST start timer B with a value of 64*T1 seconds (Timer B controls transaction timeouts)."

This is all well and good but what happens when you don't have a way to dynamically determine T1 and you can't wait T1*64 (32s) for your results like my sipsak/nagios check earlier? Simple: you go renegade, throw out the RFC, and hack the sipsak source yourself!

So I had three options:

1) Change the default value of T1.
2) Change the value of T2 by changing the multiplier or setting a static timeout.
3) Some combination of both.

I decided to go with option #3 (RFC be damned). Why?

1) 500ms is crazy high for most of our endpoints. At a glance 100ms would be fine for ~90% of them. I'll pick 150ms.
2) I don't need that many retransmits. If the latency and/or packet loss is that bad I'm not going to wait (my RTP certainly isn't) and I just want to know about it that much quicker.

So I ended up with a quick easy patch to sipsak:

diff -urN sipsak-0.9.6.orig/sipsak.h sipsak-0.9.6/sipsak.h
--- sipsak-0.9.6.orig/sipsak.h 2006-01-28 16:11:50.000000000 -0500
+++ sipsak-0.9.6/sipsak.h 2010-10-26 18:38:45.000000000 -0400
@@ -102,11 +102,7 @@
# define FQDN_SIZE 100
#endif

-#ifdef HAVE_CONFIG_H
-# define SIP_T1 DEFAULT_TIMEOUT
-#else
-# define SIP_T1 500
-#endif
+#define SIP_T1 150

#define SIP_T2 8*SIP_T1

diff -urN sipsak-0.9.6.orig/transport.c sipsak-0.9.6/transport.c
--- sipsak-0.9.6.orig/transport.c 2006-01-28 16:11:34.000000000 -0500
+++ sipsak-0.9.6/transport.c 2010-10-26 18:38:51.000000000 -0400
@@ -286,7 +286,7 @@
}
}
senddiff = deltaT(&(srt->starttime), &(srt->recvtime));
- if (senddiff > (float)64 * (float)SIP_T1) {
+ if (senddiff > inv_final) {
if (timing == 0) {
if (verbose>0)
printf("*** giving up, no final response after %.3f ms\n", senddiff);

This changes the value of T1 to 150ms (more reasonable for most networks) and allows you to specify the number of retransmits (and thus the total timeout) using -D on the sipsak command line:

kkmac:sipsak-0.9.6-build kris$ ./sipsak -p 10.16.0.3 -s sip:ext_callqual@asterisk -D1 -v
** timeout after 150 ms**
*** giving up, no final response after 150.334 ms

kkmac:sipsak-0.9.6-build kris$ ./sipsak -p 10.16.0.3 -s sip:ext_callqual@asterisk -D2 -v
** timeout after 150 ms**
** timeout after 300 ms**
*** giving up, no final response after 460.612 ms

kkmac:sipsak-0.9.6-build kris$ ./sipsak -p 10.16.0.3 -s sip:ext_callqual@asterisk -D4 -v
** timeout after 150 ms**
** timeout after 300 ms**
** timeout after 600 ms**
*** giving up, no final response after 1071.137 ms

kkmac:sipsak-0.9.6-build kris$

Needless to say our monitoring situation is much improved.

Thursday, August 5, 2010

A ClueCon Update

Cluecon is going very well this year... I spoke the first day and have spent the rest of my time here enjoying the presentations and interacting with the community.

A few highlights:

  • Perfect wireless provided by Meraki. I've never been to a tech conference where the wifi has kept up with the crowd. Well done.
  • The Trump Tower. Phenomenal.
  • FreeSWITCH HA support in Sofia! This is worthy of its own post and it will have one when I get back and play with it. In the meantime my guy Jay Binks has been working to document this exciting new feature.
  • Chicago. I just LOVE this town.
More later... I've got to get back to the conference!