Monday
Sep012014

What's the right way to calculate tanh?

I'm working on improving the synchronization of M-PSK phase-locked loops, which we call the Costas loop in GNU Radio. I'll post later about what I'm doing and why, but the results are looking very satisfying. However, during this work, I need to calculate a tanh. In my simulation proving the algorithm, I did it all in Python and wasn't concerned about speed. Moving this to a GNU Radio block and the thought of calling a trigonometric function from libm gave me chills. So I thought that I'd look into faster approximations of the tanh function.

Spoiler alert: this is a lesson in optimization and never trusting your gut. The standard C version of tanh is amazingly fast these days and the recommended function to use.

So what are our options for calculating a tanh? I'll go through four here, some of which I had help from the #gnuradio IRC chatroom to explore.

  1. Calling tanhf(beta) straight from libm.
  2. Calculating from an exponential: (expf(2*beta)+1) / (expf(2*beta)-1)
  3. Using a look-up table (LUT)
  4. An expansion approximation from stackexchange (the last equation from July 18, 2013)

There are, I'm sure, other algorithms that will calculate a tanh. I know that we can use the CORDIC algorithm in hyperbolic mode, though my tests of the CORDIC algorithm, while appropriate in hardware, are not good in software. However, we'll see that we don't really need to go to great lengths here before we can be satisfied.

The code I used to test is rather simple. I time using GNU Radio's high_res_timer to get the start and end time  of a loop. The loop itself runs for N floating point samples, and each input sample is calculated using the standard random() function. I want to make sure each number into the function under test is different so we don't benefit from caching. Just putting in 0's makes the test run almost instantaneously, because of the known special case that tanh(0) is 0.

Within each loop iteration, it calculates a new value of beta, the input sample, and one of the four methods listed above. The tests are performed on an Intel i7 870 processor running at 2.93 GHz as well as an ARMv7 Cortex-A9 (using an Odroid-U3) at 1.7 GHz. Tests were run with 100 million floats and compiled using G++ with -O3 optimization. Download the code here.

Results 1

In the following tables of results, the "none" category is the baseline for just calculating the random samples for beta; there is no tanh calculation involved. All values are in seconds.

Intel i7 870

  • none: 0.682541
  • libm: 0.700341
  • lut:  0.93626
  • expf: 0.764546
  • series approx:  1.39389

ARMv7 Cortex-A9

  • none: 6.12575
  • libm: 6.12661
  • lut:  8.55903
  • expf: 7.3092
  • series approx: 8.71788

Results 2

I realized after collecting the data that each of the three methods I developed, the LUT, exponential, and series approximation method were all done as function calls. I'm not sure how well the compiler might do with inlining each of these, so I decided to inline each function myself and see how that changed the results.

Intel i7 870

  • none: 0.682541
  • libm: 0.700341
  • lut:  0.720641
  • expf: 0.81447
  • series approx:  0.695086

ARMv7 Cortex-A9

  • none: 6.12575
  • libm: 6.12661
  • lut:  6.12733
  • expf: 7.31014
  • series approx:  6.12655

Discussion

The inlined results are much better. Each of the approximation methods now starts to give us some decent results. The LUT is only 256 values, so the approximation is off by up to about 0.01 for different inputs, and it still doesn't actually beat libm's implementation. Calculating with expf was never going to work, though it is surprisingly not bad for having that divide in it. Raw comparisons show that libm's expf is actually just by itself slower than tanh, so we were never going to get a win here.

Finally, the series approximation that we use here is pretty close, maybe even somewhat faster. I didn't run these multiple times to get an average or minimum, though, so we can only say that this value is on par with the libm version. I also haven't looked at how far off the approximation is because the libm version is so fast comparatively that I don't see the need, and this is true for both Intel and ARM processors.

I'm surprised, though pleasantly so, that for all of this work, we find that the standard C library implementation of the "float tanhf(float x)" function is about as fast as we can expect. On an Intel processor, we computed 100 million floats, which added 17.8ms to the calculation of just the random numbers, which means that each tanh calculation only took 178 ps to compute. I actually have a hard time wrapping my head around that number. I'm sure that the compiler is doing some serious loop optimization seeing as this is the only thing going on inside here. Still, until I see this being a problem inside of a real algorithm, I'm satisfied with the performance to just use the standard C library's tanhf function.

Further Discussion

One of the big problems with benchmarking and providing results for general purpose processors is that you know that the results are so directly tied to your processor version and generation. Using a Macbook Pro from 2014 with a newer i7 processor gave similar relative results but with much better speeds, even though it only runs at 2.6 GHz instead of 2.93 GHz. But that's technology for you. The important lesson here is that the processor and compiler designers are doing some tremendous work making functions we used to never even think about using wonderfully fast.

But the other thing to note is that YMMV. It used to be common knowledge that we want to use look-up tables wherever possible with GPPs. That seems to have been true once, but the processor technology has improved so much over memory and bus speeds that any memory access, even from cache, is too slow. But this means that if you are running an older processor, you might suffer a bit extra based on these results off more current computers (then again, the Intel tests are from a processor that's roughly three years old). What is really noteworthy is that the ARM showed similar trends with the results, meaning we don't have to think about doing anything differently between ARM and Intel platforms. Still, you might want to bechmark your own system, just in case.

We could also think about using VOLK to code up each of these methods and let the VOLK profiler take care of finding the right solution. I'm against this right now only in that we don't have good measurements or even concerns about the approximations, like with the LUT method. We are not only trading off speed, but also a huge amount of accuracy. In my application, I wasn't concerned about that level of accuracy, but VOLK will and should for other applications.

Finally, one thing that I didn't test much is compiler flags. The tests above used -O3 for everything, but the only other setting I tested was using the new -Ofast, which also applies the -ffast-math. I found that some versions of the algorithm improved in speed by about 10ms and others reduced in speed by about 10ms. So it didn't do a whole lot for us in this instance. There may, of course, be other flags that we could look into for more gains. Similar mixed results were found using Clang's C++ compiler.

I'd like to thank Brian Padalino, Macus Leech, Tim O'Shea, and Rick Farina for chatting about this on IRC and poking me with new and different things to try.

Wednesday
Jul232014

Some Data with Almost No Insights

I just pushed to our GNU Radio master branch an update that allows us to set a few new build types. With our work in optimization, profiling, VOLK SIMD work, etc. etc., I added new build types to make setting up different testing scenarios a little easier by specifying the CMAKE_BUILD_TYPE.

  • NoOptWithASM: sets '-g -O0 -save-temps' to add debug symbols, do no compiler optimization, and save all temporary build files, including the assembly .s files. This allows us to study the assembly output when using VOLK with SIMD intrinsics to make sure we're doing "the right thing."
  • O2WithASM: sets '-g -O2 -save-temps' to also keep the assembly files but do some compiler optimization.
  • O3WithASM: sets '-g -O3 -save-temps' to add the next stage of compiler optimization.

There are the cmake default build types that are always available: None, Debug, Release, RelWithDebInfo, and MinSizeRel.

These are just some build profiles you can use to set up the environments easily for different scenarios you might want to test. Just as an example, I used the new build types to see how optimization affected the overall time of the ctest:

  • NoOptWithASM: 124.97 sec
  • O2WithASM: 115.41 sec
  • O3WithASM: 111.59 sec

So each stage of optimization shaves off some time. Of course, there's a lot of time taken here in the setup and tear down of each test. These numbers let us know that, sure, optimization is better, but why would anyone be surprised? Hopefully, no one. I just used this as a way to test that each build profile worked right and did what it was supposed to. I figured I might as well put the numbers out there, anyways, since I had them.

Wednesday
Apr232014

Peer Review of a DySPAN Paper

One of the technology papers at DySPAN that caught my attention was called "Reconfigurable Wireless Platforms for Spectrally Agile Coexistence" by Rohan Grover, Samuel J. MacMullan, and Alex Wyglinski. Their interest is in providing OFDM waveforms with subcarriers cut out in such a way that the resulting spectrum hole is still deep enough to allow for another radio to use it. Normally, just nulling out subcarriers leaves a lot of power from the side-lobes of the other carriers. So what they suggested instead was the use of IIR filters to provide cheap, sharp notches at these nulled-out subcarriers. The paper explains the development of the IIR filters, in which they have a subset of pre-built and stable filters to meet their performance requirements. They select the a set of filters to use and combine them to provide band notching. Read the paper for more details about what, why, and how.

My interest was to see if this scheme would really work and how well. I figured that this would be relatively easy to replicate in GNU Radio, so I went to work. The main problem that I had was that we don't focus on IIR filters in GNU Radio. IIR filters provide too much phase distortion and the lack of SIMD versions of the filters plus the fact that FIR filters are easy to pipeline and implement with FFTs means that we get very good performance and filtering just using FIR filters in software. However, for this, I was going to need an IIR filter that takes complex inputs and outputs, which we didn't have. GNU Radio only had a float in and float out IIR filter. So I went in and fixed this. We now have more IIR filter options for dealing with complex data types and taps. Aside from struggling with C++ templates (because we end up having to specialize pretty much everything), this wasn't that hard to do.

I then put together a simple simulation with our OFDM transmitter and receiver blocks. I put the IIR filter on the output of the transmitter and popped open our gr_filter_design tool. The paper doesn't give exact specs for the IIR filters except that they were trying to create a 12-tap filter, but now having the actual specs doesn't exactly matter here. So I designed my filter as an elliptic high pass filter with the end of the stop band at 0.1, the start of the pass band at 0.125, a max loss in the pass band of 0.1 dB, and an out-of-band attenuation of 100 dB. These frequency values are normalized to a sample rate of 1.

The 12-tap filter looks like this in frequency magnitude and phase:

It was the phase response of the IIR filter that first gave me pause as a way to go about filtering OFDM signals since it would distort the phase of the subcarriers throughout. I then realized that the OFDM's equalizer should take care of that, no problem, but I still wanted to test the idea.

The paper puts an IIR filter on both the output of the transmitter and the input of the receiver, both to block transmitted signals in that band to minimize the interference caused to other uses as well as to filter out external signals being generated. I just put this on the output of my transmitter to test the basic concept. Here's the flowgraph I used for testing.

Notice here that I include a fading model as well as our hardware impairments model. Paul Sutton wanted to know what would happen to the filtered signal once it passed through real radio hardware -- would the IIR filter really still have the same effect? Below is the PSD of the signal at three different stages. The blue is the output of the original OFDM signal with 128 subcarriers with subcarriers -10 to +10 turned off. The red line shows the output after the IIR filter, where we can see it notching out those middle 20 subcarriers dramatically. And the magenta is after going through the hardware impairments model doing the minimal amount of signal distortion possible. We can see that even with a near perfect hardware model that the middle subcarriers are no longer as suppressed as they originally were.

So that's really our best case scenario when dealing with a real radio. Next, I turned up the IIP3, IIP2, and phase noise distortions just a bit (we'd have to calculate what "a bit" really means for a real hardware system; right now, we just have ratio values to adjust and play with). This brings the out-of-band (OOB) emissions level back up near to the original. But notice that we are still ahead of the game, at least, and our receiver is receiving the packets just fine.

I then added the channel model with some noise and a Rayleigh fading model. Here we can see that the noise is dominating in this case, but again, my receiver showed that it was still receiving packets.

So conceptually, this is certainly possible and should provide some measure of performance improvement. Given the results shown here, it's not much of a leap to think about the IIR filter being applied to the receiver, which would cause a huge notch in any received signal in those frequencies. So from the point of view of the receiver, we can use this to avoid receiving interference on those subcarriers. With the hardware impairments model, we'd need better translation of the values used to real-world radios. So let's take a look at this with a real radio.

Over-the-Air

I took the same OFDM signal and transmitted it using a USRP N210 with a WBX daughterboard. I'm using a quiet piece of spectrum near me around 900 MHz and kept the transmission power low to avoid any transmission non-linearities. Without the IIR filter, this is what I received using another USRP N210 wth WBX:

Now here we have the same setup only with the added IIR high pass filter:

I have to say that that is much better than expected. We basically brought the signal down to near the noise floor. We have a DC term that's generated in the receiver, but that's to be expected and wouldn't interfere with another signal as is the purpose of this idea.

Finally, becaues of the success above, I decided to put another IIR filter on to the signal, but this time as a low pass filter to get rid fo the high OOB signals on the outside of this spectrum. Again I used the gr_filter_design tool to create an elliptic IIR low pass filter with the end of the pass band at 0.80, the start of the stop band at 0.90, a 0.1 dB max pass band loss, and a 100 dB out of band attenuation. This produced a 9-tap filter that I put in-line with the other filter on the OFDM transmitter. The resulting spectrum provides quite a bit of OOB suppression:

Wrap-up

This was a fun little project, and I was pleased that I could so easily take a paper and reproduce it in GNU Radio to prove the basics. It looks like the idea presented here should provide some good OOB suppression and produce more usable spectrum around otherwise hostile OFDM signals.

The use of the hardware impairments model was a nice way to see how different radio problems could affect the concept here, too. Without accounting for these effects, the simulations are not entirely meaningful, and we can then see how much it will take when building a radio to meet the specifications to cancel out effects of the filtering stages. On the other hand, the WBX daughterboard with an Ettus USRP N210 showed great performance with this signal and the added filters. I was able to free up quite a lot of spectrum by filtering at the transmitter using these radios. Perhaps lesser radios wouldn't have behaved so well, but that's for another time.

 

 

Thursday
Feb272014

To Use or Not to Use FFT Filters

I've talked in various presentations about the merits of fast convolution, which we implement in GNU Radio as the fft_filter. When you have enough taps in your filter, and this is architecture dependent, it is computationally cheaper to use the fft_filter over the normal fir_filters. The cross-over point tends to be somewhere between 10 and 30 taps depending on your machine. On my AVX-enabled system, it's down around 10 taps.

However, Sylvain Munaut pointed out decreasing performance of the FFT filters over normal FIR filters when decimating a high rates. The cause was pretty obvious. In the FIR filter, we use a polyphase implementation where we downsample the input before filtering. However, in the FFT filter's overlap-and-save algorithm, we filter the input first and then downsample on the output, which means we're always running the FFT filter at full rate regardless of how much or little data we're actually getting out of it.

GNU Radio also has a pfb_decimator block that works as a down-sampling filter and also does channel selection. Like the FIR filter, this uses the concept of polyphase filtering and has the same efficiencies from that perspective. The difference is that the FIR filter will only give you the baseband channel out while this PFB filter allows us to select any one of the Nyquist zone channels to extract. It does so by multiplying each arm of the filterbank by a complex exponential that constructively sums all of the aliases from our desired channel together while destructively cancelling the rest.

After the discussion about the FIR vs. FFT implementation, I went into the guts of the PFB decimating filter to work on two things. First, the internal filters in the filterbank could be done using either normal FIR filter kernels or FFT filter kernels. Likewise, the complex exponential rotation can be realized by simply multiplying each channel with a complex number and summing the results, or it could be accomplished using an FFT. I wanted to know which implementations were better.

Typically with these things, like the cross-over point in the number of taps between a FIR and FFT filter, there are going to be certain situations where different methods perform better. So I outfitted the PFB decimating filter with the ability to select which fitler and which rotation structures to use. You pass these in as flags to the constructor of the block as:

  • False, False: FIR filters with complex exponential rotation
  • True, False: FIR filters with the FFT rotator
  • False, True: FFT filters with the exponential rotator
  • True, True: FFT filters with the FFT rotator

This means we get to pick the best combination of methods depending on whatever influences we might have on how each performs. Typically, given an architecture, we'll have to play with this to understand the trade-offs based on the amount of decimation and size of the filters.

I created a script that uses our Performance Counters to give us the total time spent in the work function of each of these filters given the same input data and taps. It runs through a large number of situations for different number of channels (or decimation) and different number of taps per channel (the total filter size is really the taps len times the number of channels). Here I'll show just a handful of results to give an idea what the trade-off space looks like for the given processor I tested on (Intel i7-2620M @ 2.7 GHz, dual core with hyper threading; 8 GB DDR3 RAM). This used GNU Radio 3.7.3 (not released, yet) with GCC 4.8.1 using the build type RelWithDebInfo (release mode for full optimization that also includes debug symbols).

Here are a few select graphs from the data I collected for various numbers of channels and filter sizes. Note that the FFT filter is not always represented. For some reason that I haven't nailed down, yet, the timing information for the FFT filters was bogus for large filters, so I removed it entirely. Yet, I was able to independently test the FFT filters for different situations like those here and they performed fine; not sure why the timing was failing in these tests.

We see that the FIR filters and FFT filters almost always win out, but they are doing far fewer operations. The PFB decimator is going through the rotation stage, so of course it will never be as fast as the normal FIR filter. But within the space of the PFB decimating filters, we see that generally the FFT filter version is better while the selection between the exponential rotator and FFT rotator is not as clear-cut. Sometimes one is better than the other, which I am assuming is due to different performance levels of the FFT for a given number of channels. You can see the full data set here in OpenOffice format.

Filtering and Channelizing

A second script looks at a more interesting scenario where the PFB decimator might be useful over the FIR filter. Here, instead of just taking the baseband channel, we use the ability of the PFB decimator to select any given channel. To duplicate this result, the input to the FIR filter must first be shifted in frequency to baseband the correct channel and then filtered. To do this, we add a signal generator and complex multiply block to handle the frequency shift, so the resulting time value displayed here is the sum of the time spent in each of those blocks. The same is true for the FFT filters.

Finally, we add another block to the experiment. We have a freq_xlating_fir_filter that does the frequency translation, filtering, and decimation all in one block. So we can compare all these methods to see how each stacks up.

What this tells us is that the standard method of down shifting a signal and filtering it is not the optimal choice. However, the best selection of filter technique really depends on the number of channels (e.g., the decimation factor) and the number of taps in the filter. For large channels and taps, the FFT/FFT version of the PFB decimating filter is the best here, but there are times when the frequency xlating filter is really the best choice. Here is the full data set for the channelizing experiments.

One question that came to mind after collecting the data and looking at it is what optimizations FFTW might have in it. I know it does a lot of SIMD optimization, but I also remember a time when the default binary install (via Ubuntu apt-get) did not take advantage of AVX processors. Instead, I would have to recompile FFTW with AVX turned on, which might also make a difference since many of the blocks in GNU Radio use VOLK for SIMD optimization, including AVX that my processor supports. That might change things somewhat. But the important thing to look at here are not absolute numbers but general trends and trying to get a feeling for what's the best for your given scenario and hardware. Because these can change, I provided the scripts in this post so that anyone else can use them to experiment with, too.

 

Thursday
Feb272014

Working with GRC Busports

Busports are a fairly new addition to the GNU Radio Companion (GRC). They allow us to group block ports together to make connecting many ports together much easier. While most flowgraphs we work with don't require this, we are moving to more complex structures that require handling many channels of data, which can get graphically tricky in GRC. Enter busports.

This post walks through two setups using busports to help explain how to work with them and a few things to look out for. Also, some of these concepts are brand new to GNU Radio and will not be made available until version 3.7.3.

Connecting Null Sources and Sinks

Many of the cases we'll come to involve the need to sink a number of channels or outputs to null sinks so that we can ignore those while focusing on the important ones. Previously, we would have to have dropped many null sinks into the flowgraph and connect each line individually. Well, I have now outfitted the null sinks with the ability to sink multiple data streams and to be able to control the bus ports connections in GRC.

By default, if we have a block with multiple ports on one side or the other, we can toggle busports that group all ports into a single bus (right-click on the block and use either "Toggle Source Bus" or "Toggle Sink Bus" for whichever version is right for the block). For example, if our null sink has three sink ports, we toggle the sink bus on, which looks like this:

However, for the null_sink and null_source blocks, I have instrumented the ability to selectively break up the bus ports to arbitrary busses. Let's take the example of a source block that has 10 source ports with 4 output groupings: Ports 0-2, 3-5, 6-7, and 8-9. We handle these groupings by specifying the "Bus Connection" parameter of the null source block.

The Bus Connections parameter is a vector of vectors. Underneath, it is translated using the XML tag "<bus_structure_source> that I put into the block's XML file. Again, it is a list of lists. Each element of the outer list allows us to specify which ports are connected to that one source port. The internal lists are the list of ports to connect. Given our specification above for our 4 groupings of the 10 ports, we would use:

Bus Connections: [[0, 1, 2], [3, 4, 5], [6, 7], [8, 9]]

Now, when we toggle Toggle Source Bus on for this block, it will have 4 bus ports.

Let's now connect three null sinks to these four ports. The first two sinks will each connect to one bus port and the third null sink will sink the last two bus ports. For the first two null sinks, we only have to specify the number of input ports and the bus connections is simply "[[0,1,2],]" or alternatively "[range(3),]". The third null sink takes in 4 connections  in 2 busports, so the bus connections parameter is slightly more complicated as "[[0,1], [2,3]]". This creates two input ports that each take 2 connections. We then toggle the sink bus for each of the null sinks and create a graph that looks like this:

Obviously, this flowgraph doesn't do anything really interesting, but I think it useful to understand how to work with busport connections. Notice the numbering tells us which bus port it is and how many internal connections each bus has. When connecting busports together, GRC will check the cardinality and only connect ports that have the same number of connections. So we couldn't, for instance, connect Bus0 of the source to bus0 of the third null sink.

WARNING: There is a bug in GRC that can really screw up a flowgraph if we try and change the bus connections parameter when a connection already exists. Until we fix this, I would highly recommend that you disconnect the bus connection before making any modifications to the number of ports or the busport connections. If you forget and your canvas all of a sudden goes blank, DO NOT SAVE and instead just close GRC and reopen it.

Grab the flowgraph example here (remember, this will require GNU Radio 3.7.3 or later to run).

Using Busports with a Channelizer

The null sinks and sources are instructive but don't actually do anything. So I've made a more complex example that channelizes five sine waves of different frequencies. The flowgraph looks like this:

The signal generators from top to bottom generate sine waves with frequencies:

  • 1 kHz
  • 22 kHz
  • 44 kHz
  • -23 kHz
  • -45 kHz

These are added together with a sample rate of 100 kHz (so we have a spectrum from -50 to +50 kHz). Since we're splitting this spectrum into 5 equal sections, we have the following channels: 

  • Channel 0: -10 to 10 kHz
  • Channel 1: 10 to 30 kHz
  • Channel 2: 30 to 50 kHz
  • Channel 3: -50 to -30 kHz
  • Channel 4: -30 to -10 kHz

What that means is that when we channelize them, the signals in these bandwidths are moved to baseband. So we get output signals at 1, 2, 4, -5, and -3 kHz on the output channels.

The flowgraph shows us using two channelizers. The first one on top sends all five channels to a single frequency sink to display all the baseband channels together. We use busports to keep the connection in the GRC canvas clean. The second channelizers says we only care about 3 of the 5 channels, so we'll split the busports output into 2 and send channels 0, 2, and 4 to the plotter and channels 1 and 3 to a null sink to ignore them. The busports connection for this channelizer looks like:

Bus Connections: [[0,2,4], [1,3]]

So in the output of the first channelizers, we'll see a single sine wave at the already specified frequencies on each channel. The output of the second channelizer will only show three signals with frequencies 1, 4, and -3 kHz. In the following figure showing the output, the top display is the input to the channelizer, the bottom left is the first channelizer with all 5 channels connected, and the bottom right is the second channelizer with just the 3 connections.

You can get the example script here.

 Wrap-up

I hope this gives some better ideas how to work with the new busports features in GRC. I didn't really want to go overboard with a huge number of connections just to show them off, but these examples should give you some understanding about where we would want to use busports in the future. And again, be careful updating the bus connections or number of ports when the connections already exist.