Wednesday, May 18, 2011

A different approach to analyze software flow.

Analyzing how software runs is an everyday task, either active or passive. There are many ways to do that, here are the most common one:
  • Logging: Nice to analyze problems when running software at a place that is not directly accessible using development tools. Most times used to analyze components over a long time. The output is useful to the developer and, to a limited extend, to the user of the software. It's not usable in time critical parts, as logging has an impact to the runtime plus this will produce heavy data collections that must be analyzed by humans (usually).
  • Debugging: Primary intended to find bugs, but I use it every day to check if my implementations work as expected. It does not require additional source code to have insight into the component under test. Like logging, it is not a good tool to analyze e.g. time critical loops or longer sequences of code.
  • Profiling: Usually a late step and a really nice tool to find the bottlenecks inside the software. But has issues to solve problems that I will present below.
The idea to the approach I will present here was born when one of my work projects requires that I have to measure the performance of a data transfer in contrast to a group of buffers that act as source to the transfer. To make things a bit more complicated, several threads were part of this process and all things run in a higher speed domain (multiple megabytes per second). Logging was not possible in this scenario, too much output and affecting to the performance, debugging impossible, so profiling was the only option. But profilers usually average the runtime of function calls over time, which is nice if the runtime is not affected from parameters, data or global elements. But if the runtime varies it would be nice to see how it differs in comparison to e.g. buffer levels. As I'm working in the hardware development domain I know how useful logic analyzers can be.
I started to write a little class that collects floating point values in combination to the time of sampling in different buffers. The output was than formatted to act as input for gnuplot, to create diagrams that show the runtime in X and the various data lines in the Y axis. This was nice at the beginning, but as data collections start to get larger, this was not what I wanted.
At the end, I wrote an application to show the collected data interactive, which allows more options, e.g. the exact time difference between two samples. As I'm not really creative naming things, I called it TimeLineVisualizer.

To give a better understanding of the usefulness, I've created 2 different examples. The first one is quit simple. It should show how a std::vector's push_back method acts in runtime compared to its size() and capacity(). For those who are not familiar with it, std::vector is a dynamic sized data container, which increases its internal buffer at specific sizes. More details here.
Similar to logging, the code must be instrumented with some inline functions. It is not just fine like automatic instrumentation that profilers use, but this has one positive aspect: The developer has full control how and what data is measured. Here's the code for this example:
 1 #define ENABLE_TLV_ANALYZER
 2 #include <tlvanalyzer.h>
 3 
 4 #include <vector>
 5 using namespace std;
 6 
 7 int main()
 8 {
 9   const size_t MaxInsertions = 100000;
10 
11   size_t _insert_time_id = _AddSampleBuffer("Insert time");
12   size_t _level_id = _AddSampleBuffer("Level", false, 
13     static_cast<float>(MaxInsertions));
14   size_t _capacity_id = _AddSampleBuffer("Capacity", true);
15 
16   vector<int> vec;
17   for (size_t i = 0; i < MaxInsertions; ++i)
18   {
19     _AddFlank(_insert_time_id, true);
20     vec.push_back(0);
21     _AddFlank(_insert_time_id, false);
22 
23     _AddSample(_level_id, static_cast<float>(vec.size()));
24     _AddSample(_capacity_id, static_cast<float>(vec.capacity()));
25   }
26 
27   return 0;
28 }
First of all, the define ENABLE_TLV_ANALYZER must be placed before including tlvanalyzer.h. Commenting out this define will disable the sample collection (all calls than are defined as empty inline functions, which compilers remove at least in optimized builds).
Each sample line is created using _AddSampleBuffer(). The returned id is than used in all subsequent calls to add samples to the buffer. Parameters are:
  • name: A descriptive name which is used in displaying the data.
  • autoScale: If true, the output is scaled between the smallest and largest value sampled to this buffer.
  • maxValue: If autoScale is disabled, data is scaled between 0 and this value. Useful to display e.g. the level of a static sized buffer.
There are two possible ways to add a sample to a buffer (at the moment). _AddFlank() will act a bit like a digital line in a logic analyzer, as the name implies, it inserts a up or down flank. _AddSample() let the user insert any float value, which can be used to visualize data in various ways. In the example above, we use the flank version to show the runtime of push_back() and the sample version to add size() and capacity() of the vector.
The collected data will be dumped to an XML file when the application has been finished. It can than be loaded in TimeLineVisualizer, which shows the data a little bit like logic analyzers will do and allows to analyze the data in an interactive way, e.g. zooming, scrolling and inspection of samples by placing the mouse above it:


We can now see three sample lines. The one at the top is the capacity. Compared to the second line, the size of the vector, it is clear that the capacity does not have the same linearity. It shows that the capacity is increased from time to time in always larger steps. The last line is just a blue bar, the runtime of the push_back() method call. As we have measured 100000 calls, data density is very high here, 200000 flanks are visible. To inspect this line, we have to zoom in, easily done holding the Ctrl-key and rotating the mouse wheel. Using the mouse wheel, without holding any key, pans the view in horizontal direction. Holding shift while panning will do larger steps. The slider at the bottom can be used to scroll as well.

The next image shows a zoomed area, the position is at the last time the vector increases it's capacity:


We can now see that the last sample line contains usable data as well. Data insertions are really fast except the moment the vector does an internal resize of its capacity (buffer allocation, copy and old buffer deallocation). The light rectangle in the upper part highlights a time segment between to calls to _AddFlank() (line 24 in the above code). Detailed information to this area can be found in the right lower area. It gives use two important informations:
  • The vector increases its capacity from 92170 entries to 138255.
  • This took 781 microseconds.
Isn't that really useful ? To me, it is.
This is quite a simple example and the common developer does not inspect the runtime of library functions. But it shows the power of the possibility to see data in contrast to runtime.

By the way: The sample collector is thread safe, so we can analyze the runtime of multiple threads and see if they handle data in the designated way. Here's another example that outlines this:

Assume that we've a working queue that is fed with data from any asynchronous source. Each data element must be handled in a time consuming process. We use multiple threads that work on this queue (aka worker threads), which is a useful way in times were most computers have multiple cores. Each worker thread does the following in an endless loop:
  • Check if the _exit flag is set, terminate (break) in this case.
  • Check if there's at least one object to handle (inside the _objects container), if so, get and handle it.
  • If nothing to handle is available, wait 1 millisecond and try again (to do not eat up the whole computing power of a single core).
 1 /// Thread body of workers.
 2 void operator()(int threadId)
 3 {
 4   ostringstream ss;
 5   ss << "Thread #" << threadId;
 6   size_t _thread_specific_id = _AddSampleBuffer(ss.str().c_str());
 7 
 8   {
 9     lock_guard<mutex> l(_mutex);
10     cout << ss.str() << " started." << endl;
11   }
12 
13   for (;;)
14   {
15     bool nothingToDo = true;
16 
17     {
18       lock_guard<mutex> l(_mutex);
19 
20       // Check for exit
21       if (_exit) break;
22 
23       // Are there objects to handle ?
24       if (_objects.size())
25       {
26         // Yes, get one and remove it.
27         boost::shared_ptr<SampleObject> object = _objects.front();
28         _objects.pop_front();
29 
30         // Update count of active threads, add it to analyze.
31         ++_threadsActive;
32         _AddSample(_threads_active_id, 
33           static_cast<float>(_threadsActive));
34 
35         // Handle the object.
36         _AddFlank(_thread_specific_id, true);
37         HandleSampleObject(object);
38         _AddFlank(_thread_specific_id, false);
39 
40         // Remove from active.
41         --_threadsActive;
42 
43         // Prevent sleep at the end of this loop.
44         nothingToDo = false;
45       }
46     }
47 
48     /// If there's nothing to handle, wait 1 ms to free CPU.
49     if (nothingToDo)
50       this_thread::sleep(
51         microsec_clock::universal_time() + milliseconds(1));
52   }
53 }
OK, maybe not the best way to implement it, but it should clarify what information we got using the instrumentation. Each thread got its own id to show when it handles one object. Besides that, we sample how many threads are active in parallel. The result will look like this:


Some attentive readers may already have seen the problem in the code above. A typical scenario where one thread blocks others, because the scopes have been placed wrong (the mutex lock is active during data handling). There's always only one active thread, which can be detected using the thread count sample line as well as the the runtime of each thread.
It may be easy to find the issue in this little example using a debugger, logging or anything else, but in complex applications, this must be measured in any way. In this case, my approach shows it clear and simple.

OK, lets try to fix this:
 1 /// Thread body of workers.
 2 void operator()(int threadId)
 3 {
 4   ostringstream ss;
 5   ss << "Thread #" << threadId;
 6   size_t _thread_specific_id = _AddSampleBuffer(ss.str().c_str());
 7 
 8   {
 9     lock_guard<mutex> l(_mutex);
10     cout << ss.str() << " started." << endl;
11   }
12 
13   for (;;)
14   {
15     boost::shared_ptr<SampleObject> object;
16 
17     {
18       lock_guard<mutex> l(_mutex);
19 
20       // Check for exit
21       if (_exit) break;
22 
23       // Are there objects to handle ?
24       if (_objects.size())
25       {
26         // Yes, get one and remove it.
27         object.swap(_objects.front());
28         _objects.pop_front();
29       }
30     }
31 
32     if (0 != object.get())
33     {
34       // Update count of active threads and add it to analyze.
35       {
36         lock_guard<mutex> l(_mutex);
37         ++_threadsActive;
38         _AddSample(_threads_active_id, 
39           static_cast<float>(_threadsActive));
40       }
41 
42       // Handle the object.
43       _AddFlank(_thread_specific_id, true);
44       HandleSampleObject(object);
45       _AddFlank(_thread_specific_id, false);
46 
47       // Remove from active.
48       {
49         lock_guard<mutex> l(_mutex);
50         --_threadsActive;
51       }
52     }
53     else
54       this_thread::sleep(
55         microsec_clock::universal_time() + milliseconds(1));
56   }
57 }
The resulting measurement may look like this (may vary more or less):


It shows that all threads are used, evenly distributed, like we expect it!

License: Examples and analyzer / collector are released under the BSD license. The application TimeLineVisualizer is temporary free closed source, I'm not sure how I distribute it in the future.

All things can be downloaded here: TimeLineVisualizer1.0.0.0.zip.

Enjoy it,
hobBIT

4 comments:

  1. This comment has been removed by the author.

    ReplyDelete
  2. Hi hobBIT,
    Thank you for a good posting.
    Can I know which source code formatter is used on your posting?
    I'm using SyntaxHighlighter but I don't know how to adjust its font size but your source code looks so clear.
    Thank you.

    ReplyDelete
  3. Hi,

    thanks, it seems you are one of the first people that have detected my little blog :-) ... I use this tool for my source code: http://www.andre-simon.de/doku/highlight/en/highlight_screenshots.html it has many options and is free. If I remember correctly, I've chosen HTML output, but I had to remove the html and body tag pairs and the header section before the blog accepts the code. In addition I think I've chosen to embed CSS.

    Just play with it, you'll find the settings you like,

    best regards

    ReplyDelete
  4. Thanks. I've just downloaded it.
    It shows an error message while compiling the GUI code on my Mac book. I think it's because of the configuration of my Qt library. I'll look at the detailed error message and make it compiled on my Mac.
    Thank you again.

    ReplyDelete