Friday, September 20, 2013

Apple's new Facetime - a SIP Perspective

A colleague of mine recently sent over a PCAP file containing an Apple Facetime session between two iPhone devices running just-released iOS 7.  Being a protocol junkie he thought I might be interested in seeing them.  Clearly that’s the case!

As always (because I’m lazy) I looked around the internet to see what other research people had done on the “Facetime protocol”.  I found several excellent (though now somewhat dated) articles.  Without giving it away just yet it looks like Apple has been very busy over the last three years (like you needed me to tell you that).  As we usually do here, let’s start looking at packets!

At first glance Facetime 2013 resembles any normal SIP capture:





Then some interesting details begin to emerge…  What are these unknown packets?  What’s up with this INVITE?

Let’s talk about the unknown UDP packets.  We’ll select one and force Wireshark to decode it as RTP:





That worked and we can see this is RTP using a dynamic payload type.  Our RTP packets are now correctly dissected but Wireshark seems to have confused the SIP and STUN packets for RTP.  Why?

Forcing Wireshark to decode a given stream (where “stream” is a SRC IP:PORT pair and a DST IP:PORT pair) attempts to force a decode of that protocol type on all packets belonging to that stream.  So what happened here?

If we look closer at the UDP layer we can see that STUN, SIP, and RTP all appear to be using the same port number on each endpoint (16402 in this case).  We also have no idea what codec payload type 104 is (we’d need to see the SDP for that).  Now’s probably a good time to look at the SIP signaling a bit closer.





Apple is using compact SIP headers (interesting).  The User-Agent uses the codename for Facetime and GK more than likely implies the use of Apple’s GameKit.  That SDP, however, looks a little strange.  Certainly not the simple, printable ASCII we’re used to seeing in SIP bodies!

However, not all hope is lost.  SIP compact header “e” maps to encoding.  Encoding “deflate” is pretty standard HTTP 1.1 compression (from RFC 1951).  Wireshark clearly doesn’t support this with SIP so we’ll have to do a little more work…

I saved the SIP message bodies (SDPs) from the INVITE and 200 OK into separate files.  Here’s what hexdump had to say about the INVITE:


0000000 da78 4f75 6e41 3083 bc10 f123 3f07 4da0
0000010 36bc e021 0f95 2886 d46d 4224 8a26 057a
0000020 554a 20d4 9010 bfbc a6eb 7352 d641 3d6a
0000030 3b33 f5e3 734d ebdf cbf4 b9db aa6b fd3a
0000040 a62a 1ebc 746e 9c65 eece 76c8 c059 1620
0000050 080b 85a3 b090 8800 6f7c 6dd4 3657 da97
0000060 2af7 373d 6a53 2b93 39c1 4fe5 c29a af7c
0000070 dbd0 8e7d 6b67 c714 bdc3 6a59 00a0 28dd
0000080 721e 0cf5 3fb8 4c59 624d 4c52 92ad dfb8
0000090 323a ee23 0555 ac68 960a 49f2 032e b77c
00000a0 22e8 8737 bac4 9a9e f7cc 5d5a 3f5c 8e9a
00000b0 1841 c170 29ec 9a5b c673 d380 7c7a 1545
00000c0 98b2 057e b082 5410 9ce0 54c3 eaf5 e1d7
00000d0 67d0 f53b 98ca e594 db45 ea5f ab31 e487
00000e0 55d2 2cdf f888 ba7d 6ddf 8494 8e28 5ae5
00000f0 d2c4 c571 8555 75ab effc 2f77 e58e e580
0000100 f3a3 594f 2acd 1b21 0aeb 5467 97da 07d4
0000110 770c 03fc 5b01 3c74                    
0000118

Oh, you can’t read that?  Yeah, I can’t either.  We’ll have to inflate these.  Unfortunately no standard utility (gunzip, etc) seemed to want to uncompress these binary blobs so I had to hack up some PHP (first thing I could find with Google, whatever):

#!/usr/bin/php
//Add the usual php open and close statements (thanks Blogger)
$filename = "./my-sdp";
$handle = fopen($filename, "rb");
$contents = fread($handle, filesize($filename));
fclose($handle);
$uncompressed = gzuncompress($contents);
echo $uncompressed;

Which gave me a legible SDP offer from the INVITE:

v=0
o=GKVoiceChatService 0 0 IN IP4 192.168.231.118
s=mobile
c=IN IP4 192.168.231.118
b=AS:2000
t=0 0
a=FLS;VRA:0;MVRA:0;RVRA1:1;AS:2;MS:-1;LTR;CABAC;CR:3;LF:-1;PR;CH:4;AR:4/3,3/4;XR;
a=DMBR
a=CAP
m=audio 16402 RTP/AVP 104 105 106 9 0 124 122 121
a=rtcp:16402
a=fmtp:AAC SamplesPerBlock 480
a=rtpID:3189937293
a=au:65792
a=fmtp:104 sbr;block 480
a=fmtp:105 sbr;block 480
a=fmtp:106 sec;sbr;block 480
a=fmtp:122 sec
a=fmtp:121 sec

Now we can begin to analyze what’s actually happening here.  Our INVITE contains a perfectly valid audio offer advertising support for PCMU (payload type 0), 16kHz G722 (payload 9), and six dynamic payload types.  The IANA has defined static payload types for RTP.  Simply put, we know payload type 0 is PCMU but anything between 96-127 SHOULD (RFC 4566) have a corresponding rtpmap line to map RTP payload type to a “media encoding name”.  Looking at this trace alone I don’t know what these RTP payload types are.

However, it’s very likely that the payload types have stayed the same even though Apple has now removed the suggested rtpmap lines.  More than likely Apple has “hardcoded” these payload type codec maps internally.  We’ll get to why in just a bit.

The other iPhone 5 running iOS 7 responds with the following SDP answer in the 200 OK:

v=0
o=GKVoiceChatService 0 0 IN IP4 192.168.231.100
s=mobile
c=IN IP4 192.168.231.100
b=AS:2000:2000
t=0 0
a=FLS;VRA:0;MVRA:0;RVRA1:1;AS:2;MS:-1;LTR;CABAC;CR:3;LF:-1;PR;CH:4;AR:4/3,3/4;XR;
a=DMBR
a=CAP
m=audio 16402 RTP/AVP 104 106 121 122
a=rtcp:16402
a=fmtp:AAC SamplesPerBlock 480
a=rtpID:3770747611
a=au:65792
a=fmtp:104 sbr;block 480
a=fmtp:106 sec;sbr;block 480
a=fmtp:121 sec
a=fmtp:122 sec

Using what we know from the 2010 analysis (complete with rtpmap lines) it seems we have agreed to use the AAC_ELD codec at 24kHz and 16kHz sample rates.  However, absent analysis of an RTP stream with payloads 121 or 122 (and missing rtpmap lines) it’s hard for me to say what those other payload types represent.

A casual reader may have read this far and thought to themselves: at the end of the day we’re ending up with 24kHz AAC_ELD audio just like we were in 2010.  Apple has gone through all of this work for nothing.  Oh and by the way, how/why are SIP, STUN, RTP, and RTCP using the same UDP port?

It’s called port multiplexing and it has become all the rage these days (it’s standard in WebRTC, for example - although not to this extent).  Unfortunately I’m not able to determine if port multiplexing is new to Facetime 2013 (I’m sure someone will chime in here).  Either way it’s an important technical distinction.  Typically a SIP session of this type would require at least three UDP ports:

- SIP signaling
- RTP for audio
- RTCP for well, RTCP

What’s wrong with three ports?  Three ports make it much more difficult (if not impossible) to cross some types of NAT devices and firewalls.  Three ports and three possibly bad interactions with some firewall or other device.  Three times as much exposure to Murphy's law.  With port multiplexing Apple has greatly increased the chances that Facetime will work through challenging network environments.  With this aggressive (I’ve never seen it before) use of UDP multiplexing for all of these standard protocols Apple has virtually guaranteed that if the signalling works the media and everything else will too.

UPDATE: The always on-point Olle Johansson has postulated that Apple might be multiplexing everything over a single port for greater compatibility with carrier grade NAT (CGN) implementations, which place specific limits on the number of ports used per client.  Thanks Olle!

Any SIP engineer will tell you NAT is the bane of our existence.  That’s why (even from 2010) Apple has made use of protocols and features such as STUN, ICE, TURN, etc.  These are discussed all over the internet so I won’t get into them here but in summary they are all technologies used to traverse NAT devices and network firewalls.

It is very clear Apple has made Facetime (finally) ready for primetime.  Let me explain why and how I came to that conclusion.  First let’s talk strategy.

Without going into all of the details it is my opinion that Skype became as popular as it is because (like many successes) “it just works”.  The main reason Skype “just works” is its almost-magical NAT traversal.  The creators of Skype learned a lot from their previous gig defeating firewalls for peer to peer music sharing with Kazaa.  Both Skype and Kazaa have an almost legendary reputation for NAT and firewall traversal.  If there is a way through a NAT device or firewall they will probably figure it out.  It’s with this technology (and timing, codecs, etc) that Skype became so popular.  When nothing else worked, Skype would (and still does).  It just works (and as we’ve seen that’s hugely valuable).

The more endpoints that “just work” the more Metcalfe’s law comes into effect.  If Apple can succeed in making as many endpoints as possible “just work” they have a hugely valuable real-time communications network with Facetime.

A skeptic might point out that Apple is already using STUN, TURN, and ICE. They’ve already got NAT “figured out”.  Generally speaking, yes.  However, they’ve now taken some extraordinary steps to take their NAT handling (and by extension the “value” of Facetime) to the next level.  Apple wants Facetime to work on as many networks as possible and they’ve spent a lot of time making sure of that.  Let’s look at the changes from Facetime of 2010 to now:

- SIP, STUN, RTP, RTCP port multiplexing (while not including the a=rtcp-mux attribute)
- Compact SIP headers
- SDP minimization (removing rtpmap lines, etc)
- SDP compression

This clearly isn’t an off the shelf SIP and RTP based solution anymore, but how does it all add up?  Also, why are three of these efforts focused on minimizing packet size (even if it means violating standards)?

I know this has been a long and winding road.  Hopefully you’re still with me!  To understand the value of minimizing packet size you need to peek into another little-known area of the internet: IP fragments.  I’m going to butcher a lot of this but at this point you just need to get the broad strokes.

Each network link type is configured for a maximum transmission unit (MTU).  For Ethernet this is typically 1500 bytes.  This means that the maximum size of a single Ethernet frame can be 1500 bytes.  In many, cases, however, the various links and links inside of links (encapsulation), etc mean that the effective end-to-end MTU is significantly smaller than that (we won’t get into ATM, etc).  There are two ways this can be addressed:

- PMTU (Path MTU discovery)

With these two technologies who cares about packet size?  Firewalls and NAT devices, that’s who.  First, many firewall administrators or vendors carelessly block all ICMP packets.  That means PMTU is out.  If Apple Facetime had greater than end-to-end MTU sized packets and depended on functioning PMTU there would be many instances where it would not work.

IP fragmentation poses another problem.  Once again, many network vendors and firewall administrators outright block IP fragments.  To add insult to injury (or is that injury to insult?) many of these devices have broken and/or buggy support for IP fragment reassembly.  This especially goes for UDP (which to be fair is not a “connection oriented” protocol).  I have personally witnessed many instances where firewall devices could successfully reassemble IP+TCP fragments but not IP+UDP fragments.

This leads to an almost unwinnable situation.  Apple could use TCP for the signalling and UDP for media but then they’d lose the benefits of single port multiplexing.  They could use TCP over a single port for everything but that would just be crazy.  In either case TCP has a larger header anyway (larger header = larger packet for the same amount of data).

This has posed such a longstanding problem that some network operators and academics study the behavior of MTU interactions and IP fragmentation on the internet.  It came up most recently on NANOG a few weeks ago.  Let’s look at the results of Emile Aben’s mini-study:

Results:
size = ICMP packet size, add 20 for IPv4 packet size
fail% = % of vantage points where 5 packets where sent, 0 where received.
#size   fail%   vantage points
100     0.88    2963
300     0.77    3614
500     0.88    1133
700     1.07    3258
900     1.13    3614
1000    1.04    770
1100    2.04    3525
1200    1.91    3303
1300    1.76    681
1400    2.06    3014
1450    2.53    3597
1470    3.01    2192
1470    3.12    3592
1473    4.96    3566
1475    4.96    3387
1480    6.04    679
1480    4.93    3492 [*]
1481    9.86    3489
1482    9.81    3567
1483    9.94    3118

From these results we can see that you start to have more than 1% failure when packet size is greater than 700 1000 bytes.

The Facetime 2010 INVITE packet was 1093 bytes.  The Facetime 2013 INVITE packet is 714 bytes.  In any case there is a greater chance that the 714 byte packet will reach its intended destination.  From these results we can also see that you start to run into real trouble once you reach 1400 bytes.  Sure that’s twice the size of our current packet but who knows what the future holds for the capabilities of Facetime?  Screen sharing, multi-party conferencing, file transfer, etc all depend on larger SDP descriptions.  With the changes Apple has made they’ve not only increased the robustness of Facetime today, they’ve given themselves “room to grow” in the future.

Apple has spent some time in the trenches over the last few years and found out how difficult real-time communications can be in the real world.  Facetime isn't playing around anymore and Apple is becoming a serious networking (services?) company that's posed to take on current best-of-breed solutions.

If you're interested to see how this might be implemented in various open source projects, check out the update!

9 comments:

Anonymous said...

Very interesting, thanks!

Unknown said...

Great write up! Thank you for taking the time to analyze this.

Neill Wilkinson said...

Good Analysis. Some serious food for thought for the rest of us... Reminiscent of IAX2 putting signalling and media over the same UDP port.

Unknown said...

The SDP compression not only saves bytes, but also helps preventing "smart" ALGs messing around with the port numbers (until the ALGs deflate on the fly as well).

carlosj said...

Great article.

What I don't understand is where is the point in using protocols as complex as SIP/SDP and not being 100% compatible...

Unknown said...

what, they are not using sigcomp for compression?... dissapointed... :-)

hellt said...

Would you mind if I translate your post (with original link of course) in Russian language and post it to the tech blog resource?

hellt said...

Would you mind if I translate your post in Russian language and post it to tech blog resource (with original link of course)?

Anonymous said...

Amusing how like Digium's IAX2 facetime has become, binary, all on one port, multiplexed. Perhaps Mark was more right than we knew.