Sunday
Sep152013

Explaining the GNU Radio Scheduler

The most important and distinguishing piece of GNU Radio is the underlying scheduler and framework for implementing and connecting the signal processing blocks. While there are many other libraries of DSP blocks available, it's GNU Radio's scheduler that provides a common, fast, and robust platform to research, develop, and share new ideas. On the other hand, the scheduler is the most mysterious and complicated part of the GNU Radio code base. We often warn when people start talking about modifying the scheduler that, to use the cliche, "There be dragons!" Since the scheduler is so complicated and any change (and therefore bug) introduced to it will affect everything else, we have to be careful with what we do inside of it.

Still, we may want to add new features, improvements, or changes to the scheduler, and do to so, we need to understand the entire scheduler to make sure that our changes fit and don't cause problems elsewhere. The scheduler has a number of responsibilities, and within each responsibility, there are checks, balances, and performance issues to consider. But we've never really documented the code, and only a few people have gone in and really analyzed and understood the scheduler. So I've gone about creating this presentation to try to break down the scheduler into its pieces to help improve the overall understanding of what goes on inside. Hopefully, this will demystify it to some extent.

Overview of the GNU Radio Scheduler

Tuesday
Jul022013

London Meetup

I'll be in London next week for a summer school event on Cognitive Radio at King's College, so I thought I would take the time to get to know some local GNU Radio users over there.

The meetup will occur on July 9 (next Tuesday from when I'm posting this) at 7pm.

I haven't picked out a location just yet but am circling in on a couple of what look like nice pubs. Partly, this will depend on how many people we get to attend.

 

Wednesday
Jun122013

Nearly 50 Minutes of Volk!

I want to announce that a slide show plus audio of me talking about our Volk library has been published by the IEEE Signal Processing Society here:

http://www.brainshark.com/SPS/vu?pi=zH8zQcV8dzAXPbz0

This is based on a paper and presentation to the Wireless Innovation Forum's annual Software Radio Conference. It goes over the motivation and background of Volk and into how to use it in your own projects. This should extend easily into using it with GNU Radio blocks (and there are plenty of examples of GNU Radio blocks using Volk, as well). It does not explain how to build new kernels in Volk, however. But at 50 minutes, it seemed like it was already going too long.

 

Wednesday
May082013

Performance Counter Performance

Most of my time these past few months has been taken up almost completely with the preparations for releasing 3.7 of GNU Radio. It's a huge task and has needed a lot of work and careful scrutiny.

But I managed to find some time today to look into a question that's been bugging me for a bit now. A few months ago, we introduced the Performance Counters into GNU Radio. These are a way of internally probing statistics about the running radio system. We have currently defined 5 PCs for the GNU Radio blocks, and each time a block's work function is called, the instantaneous, average, and variance of all five PCs are calculated and stored. The PCs are really useful for performance analysis of the radio, possibly even leading to an understanding of how to dynamically adjust the flowgraph to alleviate performance issues. But the question that keeps coming up is, "what is the performance cost of calculating the performance counters?"

So I sat down to figure that out, along with a little side project I was also interested in. My methodology was just to create a simple, finite flowgraph that would take some time to process. I would then compare the run time of the flowgraph with the performance counters disabled at compile time, disabled at runtime, and enabled. Disabling at compile time uses #ifdef statements to remove the calls from the scheduler completely, so it's like the PCs aren't there at all (in fact, they aren't). Disabling at runtime means that the PCs are compiled into GNU Radio but they can be disabled with a configuration file or environmental variable. This means we do a simple runtime if() check on this configuration value to determine if we calculate the counters or not.

The hypothesis here is that disabling at compile time is the baseline control where we add no extra computation. Compiling them in but turning them off at runtime through the config file will take a small hit because of the extra time to perform the if check. Finally, actually computing the PCs will take the most amount of time to run the flowgraph because of the all of the calculations required for the three statistics of each of the 5 PCs for every block.

The flowgraph used is shown here and is a very simple script:

-------------------------------------------------------------------------------

from gnuradio import gr, blocks, filter
import scipy
def main():
    N = 1e9
    taps = scipy.random.random(100)
    
    src = blocks.null_source(gr.sizeof_gr_complex)
    hed = blocks.head(gr.sizeof_gr_complex, int(N))
    op  = filter.fir_filter_ccf(1, taps)
    snk = blocks.null_sink(gr.sizeof_gr_complex)
    op.set_processor_affinity([7,])
    src.set_processor_affinity([6,])
    hed.set_processor_affinity([6,])
    #snk.set_processor_affinity([6,])
    
    tb = gr.top_block()
    tb.connect(src, hed, op, snk)
    tb.run()
if __name__ == "__main__":
    main()

-------------------------------------------------------------------------------

It sets up a null source and sink and a filter of some arbitrary taps and some length. The flowgraph is run for N items (1 billion here). I then just used the Linux time command to run and time the flowgraph. I then keep the real time for each of 10 runs.


PCs Disabled PCs On PCs Off Affinity Off

54.61 54.632 55.141 62.648

54.35 54.655 54.451 61.206

54.443 55.319 54.388 61.787

54.337 54.718 54.348 62.355

54.309 54.415 55.467 61.729

55.055 54.318 54.314 61.526

54.281 54.651 54.581 62.036

54.935 54.487 55.226 61.788

54.316 54.595 54.283 61.817

54.956 54.785 54.593 62.255
Avg. 54.5592 54.6575 54.6792 61.9147
Min. 54.281 54.318 54.283 61.206
Max. 55.055 55.319 55.467 62.648





% Diff. Avg 0.000 0.180 0.220 13.482
% Diff. Min 0.000 0.068 0.004 12.758
% Diff. Max 0.000 0.480 0.748 13.792

 

The results were just about as predicted, but also somewhat surprising, but in a good way. As predicted, having the PCs compiled out of the scheduler was in fact the fastest. If we look only at the minimum run times out of the 10, then turning the PCs off at runtime was the next fastest, and doing the computations was the slowest. But what's nice to see here is that we're talking much less than 1% difference in speed. Somewhat surprisingly, though, on average and looking at the max values, the runtime disabling performed worse than enabling the PCs. I can only gather that there was some poor branch prediction going on here.

Whatever the reasons for all of this, the take-away is clear: the Performance Counters barely impact the performance of a flowgraph. At least for small graphs. But if we're calculating the PCs on all blocks all the time, what happens when we increase the number of blocks in the flowgraph? I'll address that in a second.

First, I wanted to look at what we can do with the thread affinity concept also recently introduced. I have an 8-core machine and the source, sink, and head block take very little processing power. So I pushed all of those onto one core and gave the FIR filter its own core to use. All of the PC tests were done setting this thread affinity. So then I turned the thread affinity off while computing the compile-time disabled experiments. The results are fairly shocking. We see a decrease in the speed of this flowgraph of around 13%! Just by forcing the OS to keep the threads locked to a specific core. We still need to study this more, such as what happens when we have more blocks than cores as well as how best to map the blocks to the cores, but the evidence here of it's benefits is pretty exciting.

Now, back to the many-block problem. For this, I will generate 20 FIR filters, each with different taps (so we can't have cache hiding issues). For this, because we have more blocks than I have cores, I'm not going to study the thread affinity concept. I'll save that for later. Also, I reduced N to 100 million to save a bit of time since the results are being pretty consistent.


PCs Disabled PCs Off PCs On

29.670 29.868 29.725

29.754 29.721 29.787

29.702 29.735 29.800

29.708 29.700 29.767

29.592 29.932 30.023

29.595 29.877 29.784

29.564 29.964 29.873

29.685 29.926 29.794

29.719 29.973 29.902

29.646 29.881 29.861
Avg. 29.664 29.858 29.832
Min. 29.564 29.700 29.725
Max. 29.754 29.973 30.023




% Diff. Avg 0.000 0.655 0.567
% Diff. Min 0.000 0.460 0.545
% Diff. Max 0.000 0.736 0.904

 

These results show us that even for a fairly complex flowgraph with over 20 blocks, the performance penalty is around a half a percent to close to a percent in the worst case.

My takeaway from this is that we are probably being a bit over-zealous by having both the compile-time and run-time configuration of the performance counters. It also looks as though the runtime toggle of the PCs on and off seems to almost hurt more than it helps. This might make me change the defaults to have the PCs enabled at compile time by default instead of being disabled. For users who want to get that half a percent more efficiency in their graph, they can go through the trouble of disabling it manually.

Sunday
Jan132013

DARPA Spectrum Challenge

Announced mid last week, DARPA will be sponsoring a Spectrum Challenge. This is a huge opportunity for the radio field. For years, we have been researching issues of spectrum sharing (or Dynamic Spectrum Access (DSA)). While we've generated a lot of good ideas, we haven't yet seen these ideas take shape in real, provable systems. Too often, the work is a simulation or an experimental test bed with too many controlled parameters.

DARPA, through their Spectrum Challenge, has a chance to change this. Forcing teams to develop and compete against each other as well as other interfering radios means that we have to think about real, unexpected challenges to our ideas. We will have to develop both robust algorithms and robust systems. What will result will almost certainly be a large number of advances in the science, technology, and understanding of the coexistence of radios.

From my perspective as the maintainer of GNU Radio, this is a great opportunity for us, too. While DARPA is not mandating the use of GNU Radio, they are requiring that teams demonstrate competency in our software. And since the final challenge will be done using USRPs, I hope that many teams will continue to use GNU Radio as their platform of choice. As I said, the challenge will not only involve developing robust spectrum sharing algorithms, it will also demand robust platforms. GNU Radio is well-known, well-tested, and has an active, educated community of users, and so is a perfect platform to build upon.

As the head of GNU Radio, I will not be participating directly in the competition. I hope to be able to advise and help all teams as I am able, and I do not want to be biased by any stake I have. Personally, my stake in this competition is the advancement of the science and technology of DSA as well as the opportunity it provides for GNU Radio.

For more details of the chellenge, visit DARPA's website.

Their Q&A page is a good, quick read over the main aspects of the challenge to get up to speed.