UPDATE: Any updates for this and other SIP/RTP issues can be found here.
In my last post over one month ago, I ranted on and on (big surprise, right) about some issues with Sonus equipment we were experiencing. After learning more I should elaborate on "Sonus equipment".
Like many other manufacturers Sonus has multiple products. We'll be talking about their NBS SBC. Many providers use the NBS SBC in conjunction with GSX gateways and PSX route servers. I have no comments about GSX gateways or PSX route servers; this equipment is largely transparent to us "end users". My gripes are with the NBS SBC.
Providers that use Sonus NBS:
- Level(3) (w/ GSX)
- XO (w/ PSX & GSX)
- Global Crossing
- Broadvox
- Many others
If you are using these carriers for SIP services, be aware.
Last time I was talking about timestamps. This time it's far more insidious...
Apparently (as relayed to me from Level(3) engineers) Sonus has a DSP buffer limitation for RTP packet handling. If there is ever more than a 100ms (my experience has shown it to be much less) gap in RTP Sonus will in technical terms, "freak out".
We have now identified four RTP interop issues with Sonus equipment:
1) Sonus requires all RTP packets (events or voice) to have unique timestamps. The RFCs specifically state that not only is it valid to use the same timestamp for various RTP packets, it is ideal in some cases (like events, for example).
2) The RFC 2833 events generated by Sonus equipment are goofy, to put it lightly. The event duration increments do not match the packetization of the voice stream as stated in RFC 2833 and elaborated on in RFC 4733. Specifically, Sonus equipment increments RFC 2833 duration 80 samples
at a time as if the voice stream is 10ms (regardless of what it actually is). I don't know of any other implementations that do this. Even when the audio stream is *clearly* 20 ms (in the SDP, too) Sonus will continue to increment 80 samples at a time.
3) The most recent (and biggest problem) has been caused by the Sonus (seemingly arbitrary) requirement that there never be greater than 100ms gaps in RTP. This is inherently broken behavior for robustness in IP networks.
4) Sonus has yet another issue with RTP timing and sequencing... If a call is brought up with an endpoint that clocks it's own RTP stream (IVR server, for example) everything will be fine. Until the IVR server (or whatever) bridges that channel to another device that also clocks its own RTP. Sonus (probably related to #3 above) will lose sync and drop audio for up to several seconds while it catches up to the new RTP stream. This requires those of us that work with Sonus equipment to rewrite all timestamps and sequence numbers on our equipment; which has the adverse effect of less than optimal jitter buffering (which should ideally be done at each far endpoint).
Asterisk is largely ok with all of these issues, believe it or not. The one that still causes problems is #3. If you are using Asterisk and Sonus gateways, make DAMN SURE that you are using Packet2Packet bridging and that your devices (whatever they may be) implement RFC 2833 the Sonus way. If not...
NO DTMF FOR YOU!
If you are not using Packet2Packet bridging and your events need to traverse the Asterisk core (for features, fixup, or anything else) there will be a variable length RTP gap that often exceeds the Sonus DSP buffer requirement. With gaps in RTP...
NO DTMF FOR YOU!
FreeSWITCH is also ok as long as you avoid #4. FreeSWITCH provides the configuration option to rewrite timestamps and break jitter buffering. If you are using Sonus gateways you should enable it, otherwise...
NO DTMF FOR YOU!
All of this makes me wish I was around back in the old days when there was one telco and all DTMF was inband!
I created AstLinux but I write and rant about a lot of other things here. Mostly rants about SIP and the other various technologies I deal with on a daily basis.
Thursday, February 5, 2009
Wednesday, January 7, 2009
Heads up!
UPDATE: Any developments on this and other SIP/RTP issues can be found here.
Some serious issues for all of those of you in SIP land:
There is a pretty serious RTP problem with Sonus equipment that has been making the rounds...
Simply put, Sonus equipment will not accept two RTP packets with the same timestamp, even if the sequence number has been properly incremented. According to various RFCs (namely 1889 and 2833) this is perfectly valid and in some cases (like video) desired.
A few slight problems... Many implementations (including Asterisk AND FreeSWITCH) will (did -more on this later) send out RFC 2833 DTMF events with the same timestamp as the last voice RTP packet. This is perfectly valid according to the RFCs mentioned above.
It appears (after my own testing) that Sonus will actually drop BOTH the voice RTP packet and the event packet. After some testing against Sonus gear it was pretty clear that no audio was being passed as long as the DTMF event occured. This makes sense because per RFC2833 a variable length DTMF event must use the same timestamp, increment the sequence counter and increase the duration when it is resent - DO NOT change the timestamp. Oh Sonus.
Both Asterisk and FreeSWITCH have incremented workarounds to address this. They are similar but there is one key difference. Asterisk now (as of SVN 12/15/2008 or so) will always use a unique timestamp for every RTP packet. I guess that solves that problem. FreeSWITCH is slightly smarter about it (as of SVN about the same time, interestingly enough) but I"m worried...
FreeSWITCH will parse the SDP to find the originator line (o=). If it is equal to "Sonus_UAC" FreeSWITCH activates a specific workaround to always send RTP packets with different timestamps. This seems more elegant but I am worried they will have to expand this hack for other equipment in the future (requiring a code change and recompile).
One could argue that Sonus has gotten this far with their current implementation and expected behavior. While it is valid (per the RFCs) to use the same timestamp, it is more /compatible/ to always use different timestamps. That appears to be what most equipment does.
This issue is what (apparantly) caused so many issues for Teliax a while back while they switched from Asterisk to FreeSWITCH. At least that's what I heard. What doesn't make any sense is that Asterisk had the same behavior as FreeSWITCH - they both sent voice and event RTP packets with identical timestamps. So that part doesn't make any sense.
Also, one would like to think that when you provide voice services (which are pretty important to your customers) you would *test* something like DTMF when you were completely switching platforms. I discovered these issues while testing Star2Star with Level(3), for example. I'm glad I was paying attention. Our customers would have been upset with broken DTMF while we updated all of our Asterisk machines (several hundred).
I'm suprised no one noticed this until mid-December or so. It will be interesting to see what other things pop out of this mess...
Some serious issues for all of those of you in SIP land:
There is a pretty serious RTP problem with Sonus equipment that has been making the rounds...
Simply put, Sonus equipment will not accept two RTP packets with the same timestamp, even if the sequence number has been properly incremented. According to various RFCs (namely 1889 and 2833) this is perfectly valid and in some cases (like video) desired.
A few slight problems... Many implementations (including Asterisk AND FreeSWITCH) will (did -more on this later) send out RFC 2833 DTMF events with the same timestamp as the last voice RTP packet. This is perfectly valid according to the RFCs mentioned above.
It appears (after my own testing) that Sonus will actually drop BOTH the voice RTP packet and the event packet. After some testing against Sonus gear it was pretty clear that no audio was being passed as long as the DTMF event occured. This makes sense because per RFC2833 a variable length DTMF event must use the same timestamp, increment the sequence counter and increase the duration when it is resent - DO NOT change the timestamp. Oh Sonus.
Both Asterisk and FreeSWITCH have incremented workarounds to address this. They are similar but there is one key difference. Asterisk now (as of SVN 12/15/2008 or so) will always use a unique timestamp for every RTP packet. I guess that solves that problem. FreeSWITCH is slightly smarter about it (as of SVN about the same time, interestingly enough) but I"m worried...
FreeSWITCH will parse the SDP to find the originator line (o=). If it is equal to "Sonus_UAC" FreeSWITCH activates a specific workaround to always send RTP packets with different timestamps. This seems more elegant but I am worried they will have to expand this hack for other equipment in the future (requiring a code change and recompile).
One could argue that Sonus has gotten this far with their current implementation and expected behavior. While it is valid (per the RFCs) to use the same timestamp, it is more /compatible/ to always use different timestamps. That appears to be what most equipment does.
This issue is what (apparantly) caused so many issues for Teliax a while back while they switched from Asterisk to FreeSWITCH. At least that's what I heard. What doesn't make any sense is that Asterisk had the same behavior as FreeSWITCH - they both sent voice and event RTP packets with identical timestamps. So that part doesn't make any sense.
Also, one would like to think that when you provide voice services (which are pretty important to your customers) you would *test* something like DTMF when you were completely switching platforms. I discovered these issues while testing Star2Star with Level(3), for example. I'm glad I was paying attention. Our customers would have been upset with broken DTMF while we updated all of our Asterisk machines (several hundred).
I'm suprised no one noticed this until mid-December or so. It will be interesting to see what other things pop out of this mess...
Monday, December 22, 2008
Introducing Recqual
I've been waiting to talk about this one for a while.
Several months ago Star2Star was having problems with one of our upstream SIP carriers. We were starting to notice a large increase in the number of one way audio calls our customers were reporting.
When most people think of one way calls their first reaction is to blame SIP. Must be NAT! Must be a firewall! SIP sucks! Etc, etc.
I knew that wasn't the case. I just had to prove it.
I was convinced the problem wasn't SIP/UDP/IP related at all. We had multiple pcaps where we were sending RTP to the appropriate gateway. It just wasn't getting to the PSTN. Where was it going? When was this happening? Which gateways (out of hundreds) were the most problematic? We needed to know and we needed to know quickly.
I came up with and "wrote" recqual over a couple of days. After a few runs we were noticing patterns with problematic RTP endpoint IP addresses. Long story short, once these were identified we worked with the carrier to replace various bits of equipment (DSPs, line cards, etc). The one way audio problem has largely disappeared and we continue to run recqual. If this starts happening again we should know /BEFORE/ our customers do.
Of course I'm using Asterisk to place the calls. The best part of using Asterisk is it's multi-protocol flexibility. You should be able to test just about any combination of voice technologies - G.279a, G711, GSM, SIP, IAX, PRI, FXO, FXO, gtalk/jabber/jingle, skype, etc. The possibilities boggle the mind.
I've just been too busy to get it together and release this to the community - until now.
Tarball with instructions here.
Questions? Comments? Suggestions? Drop me a line.
Several months ago Star2Star was having problems with one of our upstream SIP carriers. We were starting to notice a large increase in the number of one way audio calls our customers were reporting.
When most people think of one way calls their first reaction is to blame SIP. Must be NAT! Must be a firewall! SIP sucks! Etc, etc.
I knew that wasn't the case. I just had to prove it.
I was convinced the problem wasn't SIP/UDP/IP related at all. We had multiple pcaps where we were sending RTP to the appropriate gateway. It just wasn't getting to the PSTN. Where was it going? When was this happening? Which gateways (out of hundreds) were the most problematic? We needed to know and we needed to know quickly.
I came up with and "wrote" recqual over a couple of days. After a few runs we were noticing patterns with problematic RTP endpoint IP addresses. Long story short, once these were identified we worked with the carrier to replace various bits of equipment (DSPs, line cards, etc). The one way audio problem has largely disappeared and we continue to run recqual. If this starts happening again we should know /BEFORE/ our customers do.
Of course I'm using Asterisk to place the calls. The best part of using Asterisk is it's multi-protocol flexibility. You should be able to test just about any combination of voice technologies - G.279a, G711, GSM, SIP, IAX, PRI, FXO, FXO, gtalk/jabber/jingle, skype, etc. The possibilities boggle the mind.
I've just been too busy to get it together and release this to the community - until now.
Tarball with instructions here.
Questions? Comments? Suggestions? Drop me a line.
Sunday, December 21, 2008
Consulting Time Available
I haven't blogged in a while but there is some good news...
I have made some time available for consulting work!
You might not be as excited about it as I am but this is a good thing. I'm looking for interesting projects, people, and companies to work with.
If you or anyone you know might be interested please contact me. Resume, references, etc available on request.
I'll be offering bonus time, discounts, and a few other potential incentives to anyone that lets me blog about my projects and/or release any work under a liberal (read: FOSS) license.
Between my change in schedule and some (hopefully) fun new projects you can all expect to see much more frequent blogging soon!
I have made some time available for consulting work!
You might not be as excited about it as I am but this is a good thing. I'm looking for interesting projects, people, and companies to work with.
If you or anyone you know might be interested please contact me. Resume, references, etc available on request.
I'll be offering bonus time, discounts, and a few other potential incentives to anyone that lets me blog about my projects and/or release any work under a liberal (read: FOSS) license.
Between my change in schedule and some (hopefully) fun new projects you can all expect to see much more frequent blogging soon!
Monday, November 17, 2008
SBCs are Killing SIP
Wow... Over a month since my last post! My how time flies.
No time to reminisce or catch up. I've got a rant that needs to get out - NOW.
SBCs (Session Border Controllers) are killing SIP. Breaking SIP. Smothering SIP. Especially when used by "carriers". Carriers and their SBCs I tells ya.
SBCs, technically, are pretty cool devices. While I certainly understand their purpose they tend to be overused, misconfigured, and misunderstood. Many entities deploy SBCs without any idea of the other components (I'm looking at you, proxies) that make up a well designed SIP network.
Why do I hate SBCs so much?
1) SIP is cool because it is end to end and designed with intelligent endpoints in mind (endpoints that can think for themselves).
2) SIP is very flexible, especially with regards to handling media.
3) Ubiquity.
SBCs (especially when misconfigured) break many of these features:
1) SBCs (by design) hide endpoints from one another. Both endpoints support G.722? The SBC doesn't and it's going to rewrite the SDP with it's capabilities. Too bad.
2) SBCs (by design) handle media. While this can be good often times it isn't and there are other, less drastic ways to ensure quality of media.
3) When the only tool you have is a hammer, every problem starts to look like a nail.
My biggest concerns with SBCs relate to the last point. I swear, there are many providers, enterprises, etc that have deployed SIP in some capacity using ONLY SBCs and simple UACs and UASs. They've never heard of a proxy. Or a registrar. Heck, I'd even go for a signalling-only B2BUA and call it a compromise. Chances are they've never heard of that either.
I have dealt with several devices that break down, utterly fall apart when used with a proxy. I've covered it on this blog before. I'm just too mad to look up the link now. Again, $MANUFACTURER designs and markets a SIP device. They only test it against SBCs and they've (apparently) never heard of a proxy. Guess what happens...
Some poor soul like myself tries to deploy said device in what I consider to be a well designed SIP network. Unfortunately for me, this call path might not involve an SBC. Guess what happens? The device doesn't understand traversing proxies (Record-Route, Via, etc) and does something silly like parse the Contact header when trying to send a response. Call failure and all kinds of brokenness ensue.
So... I talk to $MANUFACTURER and get the standard "We've deployed this device thousands of times and never seen this problem before". Let's assume that's true. I don't know what's more depressing: the fact that they skipped over multiple sections of a basic SIP RFC like 3261 or the fact that no one noticed it for this long because (apparantly) no one uses proxies anymore. Ugh. Gross.
It's not just device manufacturers. Carriers do this too. Often times the actual issue lies with their SBC. Many carriers (especially those using ACME SBCs, it seems) parse To: instead of the Request-URI. Probably because their customers are using SBCs too and Request-URI and To: match. Not so with a proxy. I don't blame the carrier's use of an SBC. This makes sense. That's what they were designed for. However, please test your device and configuration against something other than another SBC.
What happens if your Request-URI and To: don't match? They send a 404! Yet another RFC3261 violation. Section 8.2.2.1 allows for a UAS to route based off To (although it doesn't sound preffered). However, for the love of God, if you are going to deny a request because of the content of a To header, please send a 403 as specified in the RFC. Your 404s are confusing and ignorant. Was it really not found, or are you just routing based off To instead of the Request-URI? Once again I blame SBCs and a world where it's becoming common for SBCs to talk to each other (and nothing else).
This is yet another situation where assumptions are made based on the behavior of SBCs. It's bad. Please stop.
No time to reminisce or catch up. I've got a rant that needs to get out - NOW.
SBCs (Session Border Controllers) are killing SIP. Breaking SIP. Smothering SIP. Especially when used by "carriers". Carriers and their SBCs I tells ya.
SBCs, technically, are pretty cool devices. While I certainly understand their purpose they tend to be overused, misconfigured, and misunderstood. Many entities deploy SBCs without any idea of the other components (I'm looking at you, proxies) that make up a well designed SIP network.
Why do I hate SBCs so much?
1) SIP is cool because it is end to end and designed with intelligent endpoints in mind (endpoints that can think for themselves).
2) SIP is very flexible, especially with regards to handling media.
3) Ubiquity.
SBCs (especially when misconfigured) break many of these features:
1) SBCs (by design) hide endpoints from one another. Both endpoints support G.722? The SBC doesn't and it's going to rewrite the SDP with it's capabilities. Too bad.
2) SBCs (by design) handle media. While this can be good often times it isn't and there are other, less drastic ways to ensure quality of media.
3) When the only tool you have is a hammer, every problem starts to look like a nail.
My biggest concerns with SBCs relate to the last point. I swear, there are many providers, enterprises, etc that have deployed SIP in some capacity using ONLY SBCs and simple UACs and UASs. They've never heard of a proxy. Or a registrar. Heck, I'd even go for a signalling-only B2BUA and call it a compromise. Chances are they've never heard of that either.
I have dealt with several devices that break down, utterly fall apart when used with a proxy. I've covered it on this blog before. I'm just too mad to look up the link now. Again, $MANUFACTURER designs and markets a SIP device. They only test it against SBCs and they've (apparently) never heard of a proxy. Guess what happens...
Some poor soul like myself tries to deploy said device in what I consider to be a well designed SIP network. Unfortunately for me, this call path might not involve an SBC. Guess what happens? The device doesn't understand traversing proxies (Record-Route, Via, etc) and does something silly like parse the Contact header when trying to send a response. Call failure and all kinds of brokenness ensue.
So... I talk to $MANUFACTURER and get the standard "We've deployed this device thousands of times and never seen this problem before". Let's assume that's true. I don't know what's more depressing: the fact that they skipped over multiple sections of a basic SIP RFC like 3261 or the fact that no one noticed it for this long because (apparantly) no one uses proxies anymore. Ugh. Gross.
It's not just device manufacturers. Carriers do this too. Often times the actual issue lies with their SBC. Many carriers (especially those using ACME SBCs, it seems) parse To: instead of the Request-URI. Probably because their customers are using SBCs too and Request-URI and To: match. Not so with a proxy. I don't blame the carrier's use of an SBC. This makes sense. That's what they were designed for. However, please test your device and configuration against something other than another SBC.
What happens if your Request-URI and To: don't match? They send a 404! Yet another RFC3261 violation. Section 8.2.2.1 allows for a UAS to route based off To (although it doesn't sound preffered). However, for the love of God, if you are going to deny a request because of the content of a To header, please send a 403 as specified in the RFC. Your 404s are confusing and ignorant. Was it really not found, or are you just routing based off To instead of the Request-URI? Once again I blame SBCs and a world where it's becoming common for SBCs to talk to each other (and nothing else).
This is yet another situation where assumptions are made based on the behavior of SBCs. It's bad. Please stop.
Thursday, October 9, 2008
Submit Your SIP
Ever since I've started blogging and talking about SIP people have come out of the woodwork with SIP interop problems.
After giving a talk about SIP at Astricon 2008 I received several e-mails from audience members with specific SIP issues. I LOVE getting these e-mails.
Why? I love working on SIP issues. With all of the devices using SIP there is no shortage of interop problems. Just today a guy on the Asterisk mailing list had a problem with his Cisco AS5300 and Asterisk 1.2 Usually that wouldn't be a problem at all - many people (including myself) use this combination of hardware with great success.
Why was he having problems? His AS5300 was configured for GTD and Asterisk 1.2 (apparently) doesn't handle multipart SIP bodies very well. I was able to find a patch to Asterisk 1.4 to improve multipart body parsing. That was a fun one.
I got to thinking... There should be a place where people can exchange specific SIP interop tips and notes. Otherwise how are we supposed to get anything to work!?!?
I came up with such a place and it's called SubmitYourSip.com. I' ve started to fill it in a little but hopefully (with time) it will become somewhat of a SIP wiki (with a focus on interop, of course).
I'm just getting started on it but I'll be working on my MediaWiki syntax and going back through my e-mail to dig out some of these examples.
After giving a talk about SIP at Astricon 2008 I received several e-mails from audience members with specific SIP issues. I LOVE getting these e-mails.
Why? I love working on SIP issues. With all of the devices using SIP there is no shortage of interop problems. Just today a guy on the Asterisk mailing list had a problem with his Cisco AS5300 and Asterisk 1.2 Usually that wouldn't be a problem at all - many people (including myself) use this combination of hardware with great success.
Why was he having problems? His AS5300 was configured for GTD and Asterisk 1.2 (apparently) doesn't handle multipart SIP bodies very well. I was able to find a patch to Asterisk 1.4 to improve multipart body parsing. That was a fun one.
I got to thinking... There should be a place where people can exchange specific SIP interop tips and notes. Otherwise how are we supposed to get anything to work!?!?
I came up with such a place and it's called SubmitYourSip.com. I' ve started to fill it in a little but hopefully (with time) it will become somewhat of a SIP wiki (with a focus on interop, of course).
I'm just getting started on it but I'll be working on my MediaWiki syntax and going back through my e-mail to dig out some of these examples.
Friday, September 19, 2008
A preview: performance tests
I'm headed out the door for some sushi but I thought I'd drop in to give you an idea of what I'm working on for my next blog post. I'm hungry so let's keep this short and sweet: receive interrupt mitigation and its effects on Linux media applications.
In general I'm a big fan of receive interrupt mitigation. I'll trade some delay for a substantial decrease in system CPU time spent servicing interrupts resulting in the ability to handle more calls. I didn't just come to this one day, I've done some tests in the past to verify this.
However, I've never done a large scale test on regular, server class hardware. Usually just Asterisk on an embedded system. It usually works out well. This is why, by default, all ethernet adapters that support NAPI in Linux are enabled in AstLinux by default.
The folks at TransNexus spend a fair amount of time testing OpenSER/Kamailio/OpenSips performance. Today Jim Dalton posted the results of another test to the Kamailio User's mailing list. I replied to his post with so many questions I figured it might be time for me to lab this up myself and test my theories about interrupt handling (in Linux, specifically).
If those brighly colored rolls of fish weren't so distracting and delicious I'd promise to think about all of this over dinner. Unfortunately it will have to wait until tomorrow...
In general I'm a big fan of receive interrupt mitigation. I'll trade some delay for a substantial decrease in system CPU time spent servicing interrupts resulting in the ability to handle more calls. I didn't just come to this one day, I've done some tests in the past to verify this.
However, I've never done a large scale test on regular, server class hardware. Usually just Asterisk on an embedded system. It usually works out well. This is why, by default, all ethernet adapters that support NAPI in Linux are enabled in AstLinux by default.
The folks at TransNexus spend a fair amount of time testing OpenSER/Kamailio/OpenSips performance. Today Jim Dalton posted the results of another test to the Kamailio User's mailing list. I replied to his post with so many questions I figured it might be time for me to lab this up myself and test my theories about interrupt handling (in Linux, specifically).
If those brighly colored rolls of fish weren't so distracting and delicious I'd promise to think about all of this over dinner. Unfortunately it will have to wait until tomorrow...
Subscribe to:
Posts (Atom)