Wednesday, May 18, 2011

A different approach to analyze software flow.

Analyzing how software runs is an everyday task, either active or passive. There are many ways to do that, here are the most common one:
  • Logging: Nice to analyze problems when running software at a place that is not directly accessible using development tools. Most times used to analyze components over a long time. The output is useful to the developer and, to a limited extend, to the user of the software. It's not usable in time critical parts, as logging has an impact to the runtime plus this will produce heavy data collections that must be analyzed by humans (usually).
  • Debugging: Primary intended to find bugs, but I use it every day to check if my implementations work as expected. It does not require additional source code to have insight into the component under test. Like logging, it is not a good tool to analyze e.g. time critical loops or longer sequences of code.
  • Profiling: Usually a late step and a really nice tool to find the bottlenecks inside the software. But has issues to solve problems that I will present below.
The idea to the approach I will present here was born when one of my work projects requires that I have to measure the performance of a data transfer in contrast to a group of buffers that act as source to the transfer. To make things a bit more complicated, several threads were part of this process and all things run in a higher speed domain (multiple megabytes per second). Logging was not possible in this scenario, too much output and affecting to the performance, debugging impossible, so profiling was the only option. But profilers usually average the runtime of function calls over time, which is nice if the runtime is not affected from parameters, data or global elements. But if the runtime varies it would be nice to see how it differs in comparison to e.g. buffer levels. As I'm working in the hardware development domain I know how useful logic analyzers can be.
I started to write a little class that collects floating point values in combination to the time of sampling in different buffers. The output was than formatted to act as input for gnuplot, to create diagrams that show the runtime in X and the various data lines in the Y axis. This was nice at the beginning, but as data collections start to get larger, this was not what I wanted.
At the end, I wrote an application to show the collected data interactive, which allows more options, e.g. the exact time difference between two samples. As I'm not really creative naming things, I called it TimeLineVisualizer.

To give a better understanding of the usefulness, I've created 2 different examples. The first one is quit simple. It should show how a std::vector's push_back method acts in runtime compared to its size() and capacity(). For those who are not familiar with it, std::vector is a dynamic sized data container, which increases its internal buffer at specific sizes. More details here.
Similar to logging, the code must be instrumented with some inline functions. It is not just fine like automatic instrumentation that profilers use, but this has one positive aspect: The developer has full control how and what data is measured. Here's the code for this example:
 1 #define ENABLE_TLV_ANALYZER
 2 #include <tlvanalyzer.h>
 3 
 4 #include <vector>
 5 using namespace std;
 6 
 7 int main()
 8 {
 9   const size_t MaxInsertions = 100000;
10 
11   size_t _insert_time_id = _AddSampleBuffer("Insert time");
12   size_t _level_id = _AddSampleBuffer("Level", false, 
13     static_cast<float>(MaxInsertions));
14   size_t _capacity_id = _AddSampleBuffer("Capacity", true);
15 
16   vector<int> vec;
17   for (size_t i = 0; i < MaxInsertions; ++i)
18   {
19     _AddFlank(_insert_time_id, true);
20     vec.push_back(0);
21     _AddFlank(_insert_time_id, false);
22 
23     _AddSample(_level_id, static_cast<float>(vec.size()));
24     _AddSample(_capacity_id, static_cast<float>(vec.capacity()));
25   }
26 
27   return 0;
28 }
First of all, the define ENABLE_TLV_ANALYZER must be placed before including tlvanalyzer.h. Commenting out this define will disable the sample collection (all calls than are defined as empty inline functions, which compilers remove at least in optimized builds).
Each sample line is created using _AddSampleBuffer(). The returned id is than used in all subsequent calls to add samples to the buffer. Parameters are:
  • name: A descriptive name which is used in displaying the data.
  • autoScale: If true, the output is scaled between the smallest and largest value sampled to this buffer.
  • maxValue: If autoScale is disabled, data is scaled between 0 and this value. Useful to display e.g. the level of a static sized buffer.
There are two possible ways to add a sample to a buffer (at the moment). _AddFlank() will act a bit like a digital line in a logic analyzer, as the name implies, it inserts a up or down flank. _AddSample() let the user insert any float value, which can be used to visualize data in various ways. In the example above, we use the flank version to show the runtime of push_back() and the sample version to add size() and capacity() of the vector.
The collected data will be dumped to an XML file when the application has been finished. It can than be loaded in TimeLineVisualizer, which shows the data a little bit like logic analyzers will do and allows to analyze the data in an interactive way, e.g. zooming, scrolling and inspection of samples by placing the mouse above it:


We can now see three sample lines. The one at the top is the capacity. Compared to the second line, the size of the vector, it is clear that the capacity does not have the same linearity. It shows that the capacity is increased from time to time in always larger steps. The last line is just a blue bar, the runtime of the push_back() method call. As we have measured 100000 calls, data density is very high here, 200000 flanks are visible. To inspect this line, we have to zoom in, easily done holding the Ctrl-key and rotating the mouse wheel. Using the mouse wheel, without holding any key, pans the view in horizontal direction. Holding shift while panning will do larger steps. The slider at the bottom can be used to scroll as well.

The next image shows a zoomed area, the position is at the last time the vector increases it's capacity:


We can now see that the last sample line contains usable data as well. Data insertions are really fast except the moment the vector does an internal resize of its capacity (buffer allocation, copy and old buffer deallocation). The light rectangle in the upper part highlights a time segment between to calls to _AddFlank() (line 24 in the above code). Detailed information to this area can be found in the right lower area. It gives use two important informations:
  • The vector increases its capacity from 92170 entries to 138255.
  • This took 781 microseconds.
Isn't that really useful ? To me, it is.
This is quite a simple example and the common developer does not inspect the runtime of library functions. But it shows the power of the possibility to see data in contrast to runtime.

By the way: The sample collector is thread safe, so we can analyze the runtime of multiple threads and see if they handle data in the designated way. Here's another example that outlines this:

Assume that we've a working queue that is fed with data from any asynchronous source. Each data element must be handled in a time consuming process. We use multiple threads that work on this queue (aka worker threads), which is a useful way in times were most computers have multiple cores. Each worker thread does the following in an endless loop:
  • Check if the _exit flag is set, terminate (break) in this case.
  • Check if there's at least one object to handle (inside the _objects container), if so, get and handle it.
  • If nothing to handle is available, wait 1 millisecond and try again (to do not eat up the whole computing power of a single core).
 1 /// Thread body of workers.
 2 void operator()(int threadId)
 3 {
 4   ostringstream ss;
 5   ss << "Thread #" << threadId;
 6   size_t _thread_specific_id = _AddSampleBuffer(ss.str().c_str());
 7 
 8   {
 9     lock_guard<mutex> l(_mutex);
10     cout << ss.str() << " started." << endl;
11   }
12 
13   for (;;)
14   {
15     bool nothingToDo = true;
16 
17     {
18       lock_guard<mutex> l(_mutex);
19 
20       // Check for exit
21       if (_exit) break;
22 
23       // Are there objects to handle ?
24       if (_objects.size())
25       {
26         // Yes, get one and remove it.
27         boost::shared_ptr<SampleObject> object = _objects.front();
28         _objects.pop_front();
29 
30         // Update count of active threads, add it to analyze.
31         ++_threadsActive;
32         _AddSample(_threads_active_id, 
33           static_cast<float>(_threadsActive));
34 
35         // Handle the object.
36         _AddFlank(_thread_specific_id, true);
37         HandleSampleObject(object);
38         _AddFlank(_thread_specific_id, false);
39 
40         // Remove from active.
41         --_threadsActive;
42 
43         // Prevent sleep at the end of this loop.
44         nothingToDo = false;
45       }
46     }
47 
48     /// If there's nothing to handle, wait 1 ms to free CPU.
49     if (nothingToDo)
50       this_thread::sleep(
51         microsec_clock::universal_time() + milliseconds(1));
52   }
53 }
OK, maybe not the best way to implement it, but it should clarify what information we got using the instrumentation. Each thread got its own id to show when it handles one object. Besides that, we sample how many threads are active in parallel. The result will look like this:


Some attentive readers may already have seen the problem in the code above. A typical scenario where one thread blocks others, because the scopes have been placed wrong (the mutex lock is active during data handling). There's always only one active thread, which can be detected using the thread count sample line as well as the the runtime of each thread.
It may be easy to find the issue in this little example using a debugger, logging or anything else, but in complex applications, this must be measured in any way. In this case, my approach shows it clear and simple.

OK, lets try to fix this:
 1 /// Thread body of workers.
 2 void operator()(int threadId)
 3 {
 4   ostringstream ss;
 5   ss << "Thread #" << threadId;
 6   size_t _thread_specific_id = _AddSampleBuffer(ss.str().c_str());
 7 
 8   {
 9     lock_guard<mutex> l(_mutex);
10     cout << ss.str() << " started." << endl;
11   }
12 
13   for (;;)
14   {
15     boost::shared_ptr<SampleObject> object;
16 
17     {
18       lock_guard<mutex> l(_mutex);
19 
20       // Check for exit
21       if (_exit) break;
22 
23       // Are there objects to handle ?
24       if (_objects.size())
25       {
26         // Yes, get one and remove it.
27         object.swap(_objects.front());
28         _objects.pop_front();
29       }
30     }
31 
32     if (0 != object.get())
33     {
34       // Update count of active threads and add it to analyze.
35       {
36         lock_guard<mutex> l(_mutex);
37         ++_threadsActive;
38         _AddSample(_threads_active_id, 
39           static_cast<float>(_threadsActive));
40       }
41 
42       // Handle the object.
43       _AddFlank(_thread_specific_id, true);
44       HandleSampleObject(object);
45       _AddFlank(_thread_specific_id, false);
46 
47       // Remove from active.
48       {
49         lock_guard<mutex> l(_mutex);
50         --_threadsActive;
51       }
52     }
53     else
54       this_thread::sleep(
55         microsec_clock::universal_time() + milliseconds(1));
56   }
57 }
The resulting measurement may look like this (may vary more or less):


It shows that all threads are used, evenly distributed, like we expect it!

License: Examples and analyzer / collector are released under the BSD license. The application TimeLineVisualizer is temporary free closed source, I'm not sure how I distribute it in the future.

All things can be downloaded here: TimeLineVisualizer1.0.0.0.zip.

Enjoy it,
hobBIT

Monday, May 9, 2011

Use STL, Boost and C++0x lambda expressions to increase your productivity when working with data collections.

Working with data collections is something that many developers will be confronted with very often. Increasing the productivity in this area has a positive impact on the overall productivity. In my daily work in C++, C# and Python, I've noticed that C++ has a clear deficit here. Even if the STL has a really nice feature set and for sure the highest performance (if used correctly), using STL collections the recommended way requires much more source code compared to C# and Python. As I'm always try to express often used elements as short as possible, I was really glad when I first encountered C++0x lambda expressions.
I will show a really simple example of data processing in all three languages, trying to form the C++ version to compete with both C# and Python. 

I've tried to find a task that can't be done with a simple loop over a linear collection of data. The requirement of sorting the collection in the middle of the exercise make things a bit more tricky. So here it is:
Lets assume, we have a collection of employees. Each employee's data is stored in a class instance, while, to simplify this example, every instance contains these four elements:
  • Forename (string)
  • Surname (string)
  • Birthday (string, YYYY-MM-DD)
  • Children (int, count of children)
We assume too, the collection is verified, we do not encounter invalid formatted or empty values (None, null, nullptr). The encoding of the birthday as string was intended, so we do not have to handle different date type implementations and we need a string operation one each element during sort. The requirement is:
  • Filter all employees that have at least one child.
  • Sort the candidates by month/day of there birthday.
  • Create a string collection formatted: 'Forename, Surname: MM-DD'.
The sorting should only be done on the already filtered candidates (In a real life scenario, this is the typically step, always process as few data as possible) !
The element access in the class in Python, as well in C# are done using properties (something I really miss in C++). The C++ class has inline methods in the form:
 1 std::string Forename() const { return _forename; }
OK, lets start with the Python version. The employees are stored in a simple list: employees = [ ... ]. There are two usual ways to accomplish it, either form I choose, the expression is really short:
Using filter, sorted, map and lambda's:
 1 result = filter(lambda e: e.Children > 0, employees)
 2 result = sorted(result, key=lambda e: e.Birthday[5:])
 3 result = map(lambda e: '%s, %s: %s' % 
 4   (e.Forename, e.Surname, e.Birthday[5:]), result)
Or using list comprehensions:
 1 result = [ e for e in employees if (e.Children > 0) ]
 2 result.sort(key=lambda e: e.Birthday[5:])
 3 result = [ '%s, %s: %s' % 
 4   (e.Forename, e.Surname, e.Birthday[5:]) 
 5   for e in result ]

The next candidate is C#. Employee instances are stored in a generic: List<Employee> employees;. We use LINQ here, an innovation that has changed my thinking of software development when I first encountered it. To me, this the easiest to read version of all three languages:
 1 var res =
 2   from e in employees
 3   where e.Children > 0
 4   orderby e.Birthday.Substring(5)
 5   select string.Format("{0}, {1}: {2}",
 6     e.Forename, e.Surname, e.Birthday.Substring(5));
Clear and very expressive.

We've seen how easy this can be done in the previous languages. Now take a look how to do this in C++ using the recommended way by using STL algorithms. Employee instances are stored in std::vector<Employee>. I've chosen to prevent the use of std::vector<std::shared_ptr<Employee> >, as this would make things a bit more complicated than necessary, even if speed is C++'s primary domain and copy construction is a complement to this !
The first part are definitions of three functor classes, the first for filtering, the second to sort the remaining candidates and the third to format the output. After the functor definitions, the true processing takes its place and there's one important thing. Algorithm copy_if is not part of the standard, I just assume that the STL implementation in use has it implemented. Otherwise it is not that difficult to implement one. OK, here's the source of the C++ version:
 1 struct FilterFunctor : unary_function<const Employee&, bool> { 
 2   bool operator()(const Employee &e) const { 
 3     return e.Children() > 0;
 4   }
 5 };
 6 
 7 struct SortFunctor :  
 8   binary_function<const Employee &, const Employee &, bool> {
 9   bool operator()(const Employee &lhs, const Employee &rhs) const {
10     return lhs.Birthday().substr(5) < rhs.Birthday().substr(5);
11     }
12 };
13 
14 struct FormatFunctor : unary_function<string, const Employee &> {
15   string operator()(const Employee &e) {
16     ostringstream os; 
17     os << e.Forename() << ", " << e.Surname() << 
18       ": " << e.Birthday().substr(5);
19     return os.str();
20   }
21 }; 
22 
23 vector<Employee> filtered;
24 copy_if(employees.begin(), employees.end(), 
25   back_inserter(filtered), FilterFunctor());
26 sort(filtered.begin(), filtered.end(), SortFunctor());
27 vector<string> result(filtered.size());
28 transform(filtered.begin(), filtered.end(), 
29   result.begin(), FormatFunctor());
I think that everybody can see that this implementation has a higher amount of source code, far more ! Besides that, separating the vector processing calls and the element processing procedures (functors) is a two sided sword. This can reduce code bloat, but on the other hand, a reader of the source code has to examine code in different locations to understand a single context. But later more on this.

Each algorithm in the STL requires at least one pair of iterators as specifiers for the processing range. This is a nice and really flexible feature, as this opens a wide range of possibilities. But looking back to my usage of the STL, around 95% of my algorithm calls work between begin() and end() of a collection. Why is there no algorithm that expects the container instance as range specification ? It seems that i'm not the only person that asks this question. There's already a solution to this: Boost range algorithms. Besides many useful features, each STL algorithm has an equivalent in namespace boost, that reduces the algorithm call range specification to the container's instance name. A simple example, the call to:
 1 copy(employees.begin(), employees.end(), employeeClone.begin());
Can be reduced to:
 1 copy(employees, employeeClone.begin());
By just a simple #include and using namespace directive !
As mentioned above, the off-site specification of the functors has (at least to me) two downsides:
  • Split the context into different source code parts.
  • Heavy overhead of the amount of source code required to type in.
The whole definition of functor FilterFunctor above is expressed in C# using the simple term e.Children > 0, really unfair for the C++ community, isn't it ?
With the introduction of lambda expression in C++0x, these problems are a thing of the past. They allow the C++ user to embed the whole functor body at the place where it is required, the functor parameter on the call-site. Using this and boost range, the compete C++ code above can be written as follows (boost range does not implement copy_if, so we have to use the STL version here):
 1 vector<Employee> filtered;
 2 copy_if(employees.begin(), employees.end(), back_inserter(filtered), 
 3   [](const Employee &e) { return e.Children() > 0; });
 4 
 5 sort(filtered, [](const Employee &lhs, const Employee &rhs) { 
 6   return lhs.Birthday().substr(5) < rhs.Birthday().substr(5); });
 7 
 8 vector<string> result(filtered.size());
 9 transform(filtered, result.begin(), [](const Employee &e) {
10   return (format("%s, %s: %s") % e.Forename() % 
11     e.Surname() % e.Birthday().substr(5)).str(); });
I have to admit, it is not that clear to read like the C# code, but to people that know C++ and have an eye to the sometimes cryptic look of C++, this version goes into the direction (including the source code amount) of the Python implementation. The lambda specification looks a bit scary the first time, even for C++ developers. Three pairs of different brackets in one single term is cryptic. But using it for a while will train the eyes and lambdas will become an everyday tool. By the way, lambda expressions work well in Visual Studio 2010 and g++ 4.5+ (when using the compiler options -std=c++0x or -std=gnu++0x).

Hope this will useful to you,
hobBIT

Btw: I'm very interested to see implementations in other languages, especially Haskell and Ruby !

Friday, May 6, 2011

XbmcMovieHelper

To move this blog away from being total useless, here one of my tools, may it be useful to you !

I wrote it to got full control on the movie scraping process of XBMC. Several movies, when searched on IMDB or OFDB, often end in multiple results. So the scraper has two possibilities, either ask the user, which is very annoying when scanning a large movie collection, or just choose the best fit. Both ways are not what I really like, so I've searched for workarounds.
I found some nice tools, including Ember Media Manager, which do a really nice job, writing all movie related information to an .nfo file, which is namely bound to the respective movie file. The scraper from XMBC recognizes this file, and loads information from this file, instead from any internet source.

After a while using it, I've detected two flaws of it:
  • The data is completely stored on my hard drive. Updating information or switching to another media information source is not as easy this way. In addition, including fan-art, cover image and .nfo, I have at least 3 additional files per movie. Not very disk space consuming, but annoying when browsing the folder.
  • Adding new movies (when done in groups of many entries) is boring. After each selection, the user must select fan-art and cover, plus wait for all movie related downloads, which can be 10 seconds or more, even on my fast internet connection.
Another possibility was to just store the IMDB-ID in this .nfo file. The XBMC-scraper than has an exact identification of the movie, but without the drawbacks above:
  • The scraper can do always fresh updates using this ID, from any source that has some form of IMDB binding. All without loosing the movie identification and without user interaction !
  • The information in question is just a short ID, which is known at the moment the movie is identified and stored in really no time.
  • One additional benefit is the fact that the Moving Pictures plugin of Media Portal can use this information as well, so using the movie collection when switching to this nice media framework does not require additional work, just let the software scan.
To create the .nfo files with the IMDB-ID, I wrote XbmcMovieHelper, coded in C# and one of my first experiments with Windows Presentation Foundation (WPF) / XAML. Images are often more expressive than words, so here's a screenshot:


The first thing to do is to open the options dialog and configure the only two options:
  • Choose the right site: IMDB has sites for most languages, i.e. http://imdb.de for German.
  • Verify movie extensions: Check if your preferred movie format(s) are part of the list of file extensions, otherwise extend it.
Using it is quit simple. The first task is to select a movie root, this is the directory where movie files are stored.
All recognized movies inside this directory are than listed in the lower window pane. If .nfo files are already present, the IMDB-ID is scanned from it (if possible) and listed in front of the respective movie. These movies can now be tagged with their IMDB-ID.

Selecting a movie will result in one of two scenarios:
  • If an IMDB-ID is already available, the movie information site on IMDB is shown in the browser area above. Pressing the 'Re-Search Movie' button allows the modification of an already present ID.
  • Otherwise an IMDB-search is started, showing the results in the browser pane. The user has than the possibility to navigate inside the browser to the correct movie. If the movie is identified, a click on 'Use IMDB ID' will store the ID of the current visible movie to the currently selected in the lower pane. This will silently override an already existing .nfo file !
VoilĂ , thats it. This allows a really fast handling of large movie collections ! The 'Play Movie'-button is very self explanatory, isn't it ?
The reason of 'Toggle Search Lock' can be best read in the forum, were I first post this tool: http://forum.xbmc.org/showthread.php?t=85194#6

XbmcMovieHelper is licensed under GPLv3.
The binary can be downloaded from here: Windows Binary
The source code here: Source Code

First words ...

My intention to create and write to this blog is just to give the internet something back that I got from it over the last years: information. Whenever I have a problem, the internet offers me a wide range of possibilities and solutions. Nearly all of them are free to me, so why not pay something back this way ?!

In addition, I want to offer some tools that I wrote in the past and write in the upcoming time. Maybe there are users that find them useful.

Just for information: English is not my native language, but I've chosen to write my blog in this language, as English has become the number one in the online world and most users are aware of it. Besides that, I assume that most automatic site translator software is happy with an English source.
So please be merciful if my posts are not 100% perfect English, I will try to enhance my knowledge :-)

Enjoy my blog,
hobBIT

PS: My nickname 'hobbit' originates to the early 90's, were me and many of my classmates became fans of J.R.R.Tolkiens work. As there is a tonal similarity to my surname they gave me this nickname. I adapted it by choosing the writing style 'hobBIT' to highlight my addiction to the digital world :)
The blog name BITtleEarth was just the result when searching for a world where real 'hobBIT''s would live in.