Over the last week or so, I’ve learned a ton about voice over IP and audio codecs, and what’s really bananas is how easy it is actually write your own voice over IP client. This is technology that’s been around for a few decades now, so it’s hardly revolutionary, but the fact that it’s so accessible to create your own is pretty wild.
At work, we rely heavily on Twilio for a lot of the services they provide, and quite frankly they’re pretty awesome. There’s a readily accessible API which pretty much allows you to do any sort of programmatic action with voice and text calls, which, in the real world, this means that you can do some fairly powerful things. Noticeably though, for the straightforward and simple use case of just calling an individual, the costs can really start to add up. This is something like pennies per minute, and is indeed cheap, but it turns out that writing your own VOIP client is even cheaper and otherwise really fun to work on and figure out. The audio quality is also possibly better (although I’m not quite sure).
High Level Concepts
It’s not hard to find out how VOIP works, there’s lots of resources on the internets, but my mind was actually pretty blown to find out that VOIP is technically not real time. I assumed that data was encoded on the fly one small packet at a time and streamed over the wire, but it actually works by taking very small samples (which are then very large by computer standards), and encoding each sample individually. The associated lag is imperceptible to a human.
So this is the gist of how it works:
- Sample audio in fixed intervals that can be encoded, compressed, and shipped off to a destination endpoint. This can be anywhere from something like 15 ms to 120 or so ms.
- Take raw audio from the input microphone. This should be Pulse Code Modulation (PCM) audio where raw float values correspond to the amplitude of the sound wave. For the full sound spectrum perceptible for a human, this means the the audio is recorded at 44,100 hertz (see Nyquist’s Theorem).
- Since we only care about voice, we don’t actually need the full sound spectrum. We only need the frequencies at which human voice falls in to. Downsample the audio to 8,000 hertz (this means that the raw audio size will be much smaller)
- If you want to get really wild and crazy to conserve bandwidth, this is where you might apply some wild and crazy steps like apply a voice band-pass filter (which some audio codecs already apply, but perhaps your needs are specific), or perhaps check to see if the input audio is white noise, and discard packets accordingly.
- Before we send the data across the internets and take up precious bandwidth, somehow compress the audio to minimize the data footprint.
- The simplest codec to integrate with is also apparently the best. Opus encapsulates Skype’s open source codec “Silk” which is both very high quality and minimal in size. Integrate the codec (which is probably a static library written in C) into your project and encode each data packet.
- Transmit each packet across the network to the receiving end. This can be done in a number of ways, but I used PubNub which I’ll explain in greater detail further down.
- On the receiving end, the small packets sent across the network are not gauranteed to arrive in the same order that were transmitted. You need to re-order them based on the order they were initially created on the client side. This means that you need to add a buffering layer that delays for some arbitrary time (I did about 250ms to be extra safe). This is called a “Jitter Buffer”
- Read packets from the jitter buffer, which should now be ordered correctly, and decode them in the opposite way they were encoded
- Play back the audio at the same rate that it was sampled. For me, I just have a timer in an IOS app that fires every 40 milliseconds that dequeues from the jitter buffer and presses play.
- At this point you can get into advanced concepts like acoustic echo cancellation to mitigate feedback loops and allow for full-duplex communication.
A short history of open source codecs before you start searching all over the internets: Speex used to run the open source game of audio codecs, but then Skype opensourced Silkwhich outperformed Speex in terms of the audio codec. Silk is integrated into Opus which is an all encompassing-library for various codecs based on the input audio. It’s the most robust and pretty much all you need for audio encoding. However, Speex is more than just an audio codec; there’s some additional functions that might be useful such as acoustic echo cancellation, which you’ll need in order to eliminate feedback loops if you have a full duplex conversation. Opus, however, recommends WebRTC for AEC.
I found a project set up for Opus here on GitHub, which was all I ended up needing to get up and running. If you have VLC installed, there’s likely already a copy of libopus.a on your computer, but you’ll also need the headers to go with your static library.
Opus is otherwise well documented and you just need to patiently read through the docs to get something up and working.
3rd Party Services
For Voice Over IP in particular, if you understood how things worked it was incredibly easy to actually create the point to point connection thanks to PubNub. I wrote an earlier blog post about how to set it up for Python, but in all reality, Python is not what you’d want to use to integrate with them beyond testing. Python’s asynchronous code execution is clanky at best.
With PubNub, you can get a sample project up and running in less than 10 minutes for extremely cheap. Their service plans come to $1 per million messages. In the context of voice over IP, if I’m sampling at 40 milliseconds, this means I’m sending 25 messages per second. This means that under their pricing plan, it will cost me $1 for 11 hours of voice. This is hard to put in perspective because I don’t know anyone who cares about just voice these days. If we compare this to Twilio (who also provides a completely awesome service and a stellar API so I’m not trying to bash them at all), calls are 1.5 cents per minute (also extremely cheap because there’s no additional overhead). This comes to $9.90 for the same duration of time. This is a difference by a factor of 10 to 1! At scale, this can be a huge money saver. This also leaves plenty of room to increase the same rate if we wanted and still maintain really cheap prices.
Conceptually the above descriptions can be done any way, but I used the Swift programming language which turns out to be really awesome even beyond the context of user interface heavy applications. C code can be interfaced with directly, which was handy for this project, and if you needed the code to run really fast, you could. You can take an array and cast it to a pointer, so you can cast a block of bytes and immediately cast them to another type (which was a necessity to transmit an array of PCM values represented as floats). I used memcpy once more, wielding one of the modern marvels of human techmology. And there’s type safety all over the place, which comes as a luxury in a world where I mostly use Python.
Also as a tangent, Swift’s explicit nullable or non-nullable variables are incredibly helpful for type safety, and the compiler is able to check for tons of errors as a result. Multi-threading is a forefront consideration of the language, and immutably helps in that respect as well.
But I digress. My point is that with Swift, I was able to gain access to the iOS framework to immediately capture audio using Apple’s AVAudioEngine and play it as well. The in between was the above described (and done quickly with a low level compiled language), and by using Swift it was simple to interface with a C library.
In the age of ubiquitous telephony, the application I wrote itself is underwhelming from a physical standpoint. If you imagine me talking to a phone and then hearing the audio on the other end, that’s basically what the application does.