Monday, May 9, 2011

Use STL, Boost and C++0x lambda expressions to increase your productivity when working with data collections.

Working with data collections is something that many developers will be confronted with very often. Increasing the productivity in this area has a positive impact on the overall productivity. In my daily work in C++, C# and Python, I've noticed that C++ has a clear deficit here. Even if the STL has a really nice feature set and for sure the highest performance (if used correctly), using STL collections the recommended way requires much more source code compared to C# and Python. As I'm always try to express often used elements as short as possible, I was really glad when I first encountered C++0x lambda expressions.
I will show a really simple example of data processing in all three languages, trying to form the C++ version to compete with both C# and Python. 

I've tried to find a task that can't be done with a simple loop over a linear collection of data. The requirement of sorting the collection in the middle of the exercise make things a bit more tricky. So here it is:
Lets assume, we have a collection of employees. Each employee's data is stored in a class instance, while, to simplify this example, every instance contains these four elements:
  • Forename (string)
  • Surname (string)
  • Birthday (string, YYYY-MM-DD)
  • Children (int, count of children)
We assume too, the collection is verified, we do not encounter invalid formatted or empty values (None, null, nullptr). The encoding of the birthday as string was intended, so we do not have to handle different date type implementations and we need a string operation one each element during sort. The requirement is:
  • Filter all employees that have at least one child.
  • Sort the candidates by month/day of there birthday.
  • Create a string collection formatted: 'Forename, Surname: MM-DD'.
The sorting should only be done on the already filtered candidates (In a real life scenario, this is the typically step, always process as few data as possible) !
The element access in the class in Python, as well in C# are done using properties (something I really miss in C++). The C++ class has inline methods in the form:
 1 std::string Forename() const { return _forename; }
OK, lets start with the Python version. The employees are stored in a simple list: employees = [ ... ]. There are two usual ways to accomplish it, either form I choose, the expression is really short:
Using filter, sorted, map and lambda's:
 1 result = filter(lambda e: e.Children > 0, employees)
 2 result = sorted(result, key=lambda e: e.Birthday[5:])
 3 result = map(lambda e: '%s, %s: %s' % 
 4   (e.Forename, e.Surname, e.Birthday[5:]), result)
Or using list comprehensions:
 1 result = [ e for e in employees if (e.Children > 0) ]
 2 result.sort(key=lambda e: e.Birthday[5:])
 3 result = [ '%s, %s: %s' % 
 4   (e.Forename, e.Surname, e.Birthday[5:]) 
 5   for e in result ]

The next candidate is C#. Employee instances are stored in a generic: List<Employee> employees;. We use LINQ here, an innovation that has changed my thinking of software development when I first encountered it. To me, this the easiest to read version of all three languages:
 1 var res =
 2   from e in employees
 3   where e.Children > 0
 4   orderby e.Birthday.Substring(5)
 5   select string.Format("{0}, {1}: {2}",
 6     e.Forename, e.Surname, e.Birthday.Substring(5));
Clear and very expressive.

We've seen how easy this can be done in the previous languages. Now take a look how to do this in C++ using the recommended way by using STL algorithms. Employee instances are stored in std::vector<Employee>. I've chosen to prevent the use of std::vector<std::shared_ptr<Employee> >, as this would make things a bit more complicated than necessary, even if speed is C++'s primary domain and copy construction is a complement to this !
The first part are definitions of three functor classes, the first for filtering, the second to sort the remaining candidates and the third to format the output. After the functor definitions, the true processing takes its place and there's one important thing. Algorithm copy_if is not part of the standard, I just assume that the STL implementation in use has it implemented. Otherwise it is not that difficult to implement one. OK, here's the source of the C++ version:
 1 struct FilterFunctor : unary_function<const Employee&, bool> { 
 2   bool operator()(const Employee &e) const { 
 3     return e.Children() > 0;
 4   }
 5 };
 6 
 7 struct SortFunctor :  
 8   binary_function<const Employee &, const Employee &, bool> {
 9   bool operator()(const Employee &lhs, const Employee &rhs) const {
10     return lhs.Birthday().substr(5) < rhs.Birthday().substr(5);
11     }
12 };
13 
14 struct FormatFunctor : unary_function<string, const Employee &> {
15   string operator()(const Employee &e) {
16     ostringstream os; 
17     os << e.Forename() << ", " << e.Surname() << 
18       ": " << e.Birthday().substr(5);
19     return os.str();
20   }
21 }; 
22 
23 vector<Employee> filtered;
24 copy_if(employees.begin(), employees.end(), 
25   back_inserter(filtered), FilterFunctor());
26 sort(filtered.begin(), filtered.end(), SortFunctor());
27 vector<string> result(filtered.size());
28 transform(filtered.begin(), filtered.end(), 
29   result.begin(), FormatFunctor());
I think that everybody can see that this implementation has a higher amount of source code, far more ! Besides that, separating the vector processing calls and the element processing procedures (functors) is a two sided sword. This can reduce code bloat, but on the other hand, a reader of the source code has to examine code in different locations to understand a single context. But later more on this.

Each algorithm in the STL requires at least one pair of iterators as specifiers for the processing range. This is a nice and really flexible feature, as this opens a wide range of possibilities. But looking back to my usage of the STL, around 95% of my algorithm calls work between begin() and end() of a collection. Why is there no algorithm that expects the container instance as range specification ? It seems that i'm not the only person that asks this question. There's already a solution to this: Boost range algorithms. Besides many useful features, each STL algorithm has an equivalent in namespace boost, that reduces the algorithm call range specification to the container's instance name. A simple example, the call to:
 1 copy(employees.begin(), employees.end(), employeeClone.begin());
Can be reduced to:
 1 copy(employees, employeeClone.begin());
By just a simple #include and using namespace directive !
As mentioned above, the off-site specification of the functors has (at least to me) two downsides:
  • Split the context into different source code parts.
  • Heavy overhead of the amount of source code required to type in.
The whole definition of functor FilterFunctor above is expressed in C# using the simple term e.Children > 0, really unfair for the C++ community, isn't it ?
With the introduction of lambda expression in C++0x, these problems are a thing of the past. They allow the C++ user to embed the whole functor body at the place where it is required, the functor parameter on the call-site. Using this and boost range, the compete C++ code above can be written as follows (boost range does not implement copy_if, so we have to use the STL version here):
 1 vector<Employee> filtered;
 2 copy_if(employees.begin(), employees.end(), back_inserter(filtered), 
 3   [](const Employee &e) { return e.Children() > 0; });
 4 
 5 sort(filtered, [](const Employee &lhs, const Employee &rhs) { 
 6   return lhs.Birthday().substr(5) < rhs.Birthday().substr(5); });
 7 
 8 vector<string> result(filtered.size());
 9 transform(filtered, result.begin(), [](const Employee &e) {
10   return (format("%s, %s: %s") % e.Forename() % 
11     e.Surname() % e.Birthday().substr(5)).str(); });
I have to admit, it is not that clear to read like the C# code, but to people that know C++ and have an eye to the sometimes cryptic look of C++, this version goes into the direction (including the source code amount) of the Python implementation. The lambda specification looks a bit scary the first time, even for C++ developers. Three pairs of different brackets in one single term is cryptic. But using it for a while will train the eyes and lambdas will become an everyday tool. By the way, lambda expressions work well in Visual Studio 2010 and g++ 4.5+ (when using the compiler options -std=c++0x or -std=gnu++0x).

Hope this will useful to you,
hobBIT

Btw: I'm very interested to see implementations in other languages, especially Haskell and Ruby !

No comments:

Post a Comment