The Joy of Algorithms & Data Structures - part 1

In this series of blogs I'd like to tell you something about algorithms and data structures and how they can make your life easier!

I like puzzles. The kind that challenge your brain.
I like elegance. The kind that a beautiful Mathematics or Physics equation demonstrates.
And I'm a lazy guy. I'd rather write a computer program that solves 1000 Sudoku puzzles in one second, than solve 10 Sudoku puzzles manually. (In fact, I wrote that program one day :) )
Not surprisingly, when I first saw a book about algorithms and data structures, I was hooked! Algorithms and data structures can be powerful, elegant, simple and smart all at the same time.

In this series of blogs I'd like to tell you something about algorithms and data structures and how they can make your life easier. I will do this by giving a couple of examples from my personal experience. These are all real application examples, not made up situations. The goal of this blog is just entertainment, I'd like to share with you how fun it can be to use algorithms in your "everyday" code. If you are interested in learning more about this subject, I recommend reading some of the many good books about algorithms and data structures (for example "Introduction to Algorithms" by Cormen, Leierson, Rivest and Stein) and try to apply them yourself! 

I will use the "big O notation" in these blogs quite regularly. If you are unfamiliar with this concept you might want to read up on this first

Part 1: Hash sets

Let's start off with something simple, a hash set. That is, simple to use if available in your code library, not necessarily simple to build yourself!

In one of my work projects I was asked to rebuild a backup system that safely copies a collection of millions of images from a source storage system to a target storage system. Every day, the images on the target storage had to be synchronized with the source: new images had to be added to the target and deleted images had to be deleted from the target (existing images were never changed).
In this system, transferring data between the storage systems was time consuming, so the backup system should avoid copying data for images that already exist in the target storage.
I chose the following strategy to synchronize both storages:

  1. take a snapshot of the list of all image file paths in the source: the source list;
  2. take a snapshot of the list of all image file paths in the target: the target list;
  3. find all file paths that are in the source list but are not in the target list. These are the images that have to be copied from source to target;
  4. find all file paths that are in the target list but are not in the source list. These are the images that have to be deleted from the target;
  5. copy the files from source to target that were found in step 3;
  6. delete the files from target that were found in step 4.
To keep things simple in this example I will ignore the changes that are made to the source during the evaluation of this process.

First approach - Using arrays

Let's represent every image file path as a unique string. 
Now at step 3 and 4, there are multiple ways to determine these "lists of differences" between two lists.
For example, one could use an array of strings as a data structure for both the source list and the target list.
Using these two arrays, we could perform step 3 in the following way:

  • start with an empty result list. These will be the images to be copied from source to target.
  • for each element in source array
  • look at all elements in target array
  • if source element and target element are equal, then the target already contains the source image
  • if no target element equal to the source element is found, then add the source image to the result list
.NET code example:

Or, exactly the same algorithm with the same running time, but fewer lines of code (using LINQ):

This simple algorithm would definitely work! But it is slow. If both source and target contain millions of elements then this algorithm will compare trillions of strings, which takes a long time for a normal computer to do. In more mathematical terms, if N is the maximum of the number of images in the source and the number of images in the target, then this algorithm will do O(N²) string comparisons. And N² is quite a big number if N is in the millions.

A faster way - Using a hash set

We can do better of course. For example by using a hash set instead of an array as our data structure.

A hash set is a set based on a hash table (or hash map). The cool thing about hash tables is that they can perform lookups and additions in constant time, O(1). For more information about hash tables, see for example: What is a Hash Table?.

Using a hash set for our target list, we can look up whether a source image exists in the target in constant time, O(1). In other words, we don't have to loop over all elements in the target to check if the target contains the image. Obviously this is a big improvement compared our previous approach: each lookup is done in O(1) instead of O(N). Thereby we reduce the total number of operations to get our result to O(N) instead of O(N²), which is a huge improvement if N is in the millions.

or the same again using LINQ:

It's important to note that we have to check that the gain of using a hash set at step 3 is not nullified by a loss in another part of the algorithm. For example, if building the Hash set would be very time consuming, then our total gain could be zero. Fortunately, adding elements to a hash set is done in contant time, so building the hash set can be done in O(N), just like building an array, and no significant performance is lost.

Comparison of the two approaches

The following numbers give an indication of the running time for both approaches.

N HashSet Array HashSet vs Array
  Running time (mm:ss) Running time (mm:ss)  
1000 00:00.0011866 00:00.0160698 13 times faster
10.000 00:00.0008660 00:00.8357707 965 times faster
100.000 00:00.0096915 01:37.9626172 10.108 times faster
1.000.000 00:00.1776648 very very long many times faster

Other solutions

Another fast solution is to apply an ordering on both the source and target list (taking up O(N log N) time for each list) and then looping through both lists at the same time. This total O(N log N) algorithm is a little bit more complex to write, but it has the advantage that no hashing or hash sets are required.
Or - if good hashing is difficult - using a data structure based on a balanced binary tree (like a SortedSet in .NET or a TreeSet in Java) instead of a hash set would also result in an O(N log N) algorithm.

We can compare all these algorithms to the ones in this article, where they are applied to perform joins in databases.

Conclusion

In this very simple example we have seen how using a hash set as an underlying data structure instead of an array greatly improved the running time of a backup system. This illustrates how using the right data structure can hugely improve performance - also for your program!

In the next blog, we will look at an example of how a smart algorithm can do the same!

Many thanks go to Freek Paans, Marie Louise van Dorp and Wolf Hijlkema for giving feedback.

About Erik: I'm a software developer at Infi in the Netherlands. We mainly create web and mobile applications for our business clients. In my spare time I like to work on small, fun, side software projects. I mostly use .NET / C#.

Gezocht: ondernemende nerds!

› Wil jij je hersens bij ons laten kraken?

Wil je iets waarmaken met Infi?

Wil jij een eigen webapplicatie of mobiele app waarmee jij het bij anderen maakt?

Waargemaakt door de nerds van Infi.
Nerds met liefde voor softwareontwikkeling en die kunnen communiceren. En heel belangrijk: wat we doen, doen we met veel lol!

Wij willen het fixen. Laat jij van je horen?

Voor wie heb je een vraag?