Blog-Archiv

Mittwoch, 14. Oktober 2015

Iterator in Java and JS

Although I think that JavaScript is nearer to C than to Java, I would like to make a juxtaposition of the two languages in this Blog. It won't be so much about what the one can do and the other can't, it'll be more about how to implement a Java class using JavaScript.

Java Iterator

A nice small example would be the implementation of an array Iterator. This is a stateful object that always holds the index of the latest accessed array member.

A stateful object is one that does not look the same all the time, it changes during its lifetime.
For example a Java List could change, members may be added or removed, or the order could be changed. You could call it mutable.
On the other hand a Java Integer Object is stateless, it does not ever change, it always stays the number it is. New numbers could evolve from add- or subtract-operations with numbers. You could call this immmutable or final.

The Iterator tells you whether there is another member that has not yet been accessed, and it gives you that next member when so. As soon as you retrieve the next member, its internal index will skip to the next if present. Once the iterator has been consumed it is not usable any more, it must be instantiated (constructed) again for a new iteration.

The iterator abstracts the nature of the iterated collection, which could be a tree, a graph, a list, a table. You do not need to implement any loop counter when using Iterator. Iterator is a design pattern, very recommendable for achieving reusable code with high abstraction level.

So here is a simple Java implementation of iterator:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
public class Iterator
{
    private final Object [] list;
    private int index;
    
    public Iterator(Object [] list) {
        this.list = list;
    }

    public boolean hasNext() {
        return index < list.length;
    }

    public Object next() {
        int i = index;
        index++;
        return list[i];
    }
}

This should be easy to read. There is a class Iterator that has a constructor that requires an array to iterate (would make no sense without). It stores the given array into a private final member field (private: not visible outside the class, final: can not be modified any more from inside the class). The second private field is the index that will be used to iterate through the array. The class exposes two public methods to the outer world, one returns whether there is another member available, the other returns the next member.

A perfect capsule, binding together data and functions like the OO idea demands.

JavaScript Iterator

To achieve the same in JavaScript we have to know a lot about this language, especially about closures. Simply spoken a closure is a block of code that remembers its environment, but this is a little bit too general. In JavaScript, closures are instantiated function bodies. In Java until version 8, a closure could be realized for example by implementing a Runnable, whereby all environmental variables had to be final to be accessible by the Runnable body. In Java 8, closures have been named lambdas, and there is a special syntax for them.

But before going too much into details, here is the JavaScript iterator.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
var newIterator = function(list)
{
    var index = 0;
    var that = {};
    
    that.hasNext = function() {
        return index < list.length;
    };
    
    that.next = function() {
        var i = index;
        index++;
        return list[i];
    };
    
    return that;
};

The first line declares a function called "newIterator". I could also have called it "iteratorFactory", but I wanted to make clear that this function resembles a Java class constructor.

This function exists in the global context (use AMD when you want to avoid such). For a browser, the global context would be the window object.

The closure starts with the opening { brace of the function, and with the closing } brace it ends. All declared parameters of the function, and all its local var variables, will be visible and accessible from inside the function, and from any function or object declared inside this function, down to any depth. And even when there will be new objects generated from inside and returned to the outer world, these would all see the same parameter values and local variables, even when the function gets called several times with different parameter values. Parameter values would get copied, and local variables would get instantiated newly, on every function call.

So there is an invisible JS mechanism that binds parameters and variables of a certain function call to the code blocks existing inside. This invisible environment is called closure (or lambda) in JS. A function call generates a closure, and that closure copies the current values of parameters, and the closure will survive the function call when it is referenced by some function or object generated inside the function and returned to the outer world.
Mind that a normal { block of code } is NOT a closure in JS! Only functions can provide them.

Using the closure mechanism we can declare the index as local variable within the body of the newIterator function. It will be visible and accessible from the returned that object and its hasNext() and next() functions. The same applies to the list parameter. Any call of newIterator() will generate a new closure, binding together the list, the index and the that object.

Thus we can use factory-functions to imitate both Java classes and constructors at the same time in JS. We can then return an object that uses a closure which holds references to all parameters and variables of the factory function. This might look a little bit unusual for OO developers, but it works fine.


Here they are once more, side by side, for the eye to get used to it, left JavaScript, right Java.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
var newIterator = function(list)
{
    var index = 0;
    var that = {};
    
    that.hasNext = function() {
        return index < list.length;
    };
    
    that.next = function() {
        var i = index;
        index++;
        return list[i];
    };
    
    return that;
};
 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
public class Iterator
{
    private final Object [] list;
    private int index;
    
    public Iterator(Object [] list) {
        this.list = list;
    }

    public boolean hasNext() {
        return index < list.length;
    }

    public Object next() {
        int i = index;
        index++;
        return list[i];
    }
}



Dienstag, 13. Oktober 2015

JS Natural Sort Order

Here is a port of the natural-sort-order to JavaScript (as an example that JS implementations can also result in readable code :-)
It is directly derived from the Java code of my latest Blog article.

Test Area

The area below is for trying out the natural sort order. Enter names in the text-area, each in a single line, then press "Sort" to see the sorted result. I've already provided an example list.

Case Sensitive
Ignore Spaces

You find the JS code working here in the following.

JS Sorting

JavaScript provides a sort() function on arrays. This sorts alphabetically, no matter whether it is an array of numbers or an array of strings. But as optional argument you can pass a compare-function, this is a function that receives two parameters and is expected to return an integer being ...

  • negative when first parameter should be sorted to the left of second parameter
  • zero when the parameters are equal
  • positive when the first parameter should be sorted to the right of second parameter (swap them).

Here is an example:

    var array = [ 4, 2, 6, 5, 3, 12, 9 ];
    var comparator = function(a, b) {
        return a - b;
    };
    array.sort(comparator);
    // yields [ 2, 3, 4, 5, 6, 9, 12 ]

JS NaturalSortComparator

Using the JS sort() implementation I just need to provide the comparator function. For a discussion how natural sorting works please refer to my latest Java Blogs. Wherever it was possible I tried to keep identifiers and objects like they were in the Java code.

By passing one or two booleans to function newNaturalSortComparator() you could determine whether the comparator ignores upper/lower case or spaces.

  1
  2
  3
  4
  5
  6
  7
  8
  9
 10
 11
 12
 13
 14
 15
 16
 17
 18
 19
 20
 21
 22
 23
 24
 25
 26
 27
 28
 29
 30
 31
 32
 33
 34
 35
 36
 37
 38
 39
 40
 41
 42
 43
 44
 45
 46
 47
 48
 49
 50
 51
 52
 53
 54
 55
 56
 57
 58
 59
 60
 61
 62
 63
 64
 65
 66
 67
 68
 69
 70
 71
 72
 73
 74
 75
 76
 77
 78
 79
 80
 81
 82
 83
 84
 85
 86
 87
 88
 89
 90
 91
 92
 93
 94
 95
 96
 97
 98
 99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
"use strict";

/**
 * Implements a natural sort order where numbers within names
 * are detected and evaluated as numbers, not strings.
 * Thus "a1", "a10", "a2" would be sorted as "a1", "a2", "a10".
 */
var newNaturalSortComparator = function(caseSensitive, ignoreSpaces)
{
    // parameter default definition
    ignoreSpaces = (ignoreSpaces === undefined) ? true : ignoreSpaces;
    
    
    /** Inner class Part implements Comparator. */
    var newPart = function(isNumber, content) {
        var i = 0;
        if (isNumber)   // remove and count leading zeros when number
            while (i < content.length && content.charAt(i) == '0')
                i++;
                
        content = (i == 0) ? content : content.substring(i);
        content = caseSensitive ? content : content.toLowerCase();
        
        var part = {
            content: content,
            leadingZeros: i,
            number: isNumber ? parseInt(content) : undefined,
            
            compareTo: function(other) {
                var a, b;
                if (part.number !== undefined && other.number !== undefined) {
                    a = part.number;
                    b = other.number;
                }
                else {
                    a = part.content;
                    b = other.content;
                }
                return (a < b) ? -1 : (a > b) ? +1 : 0;
            }
        };
        
        return part;
    };
    
    var compare = function(string1, string2) {
        var iterator1 = iterator(split(string1));
        var iterator2 = iterator(split(string2));
        
        while (iterator1.hasNext() || iterator2.hasNext()) {
            // first has no more parts -> comes first
            if ( ! iterator1.hasNext() && iterator2.hasNext())
                return -1;
            
            // second has no more parts -> comes first
            if (iterator1.hasNext() && ! iterator2.hasNext())
                return 1;

            // compare next parts
            var result = iterator1.next().compareTo(iterator2.next());
            if (result !== 0)    // found difference
                return result;
        }
        
        return 0;
    };
    
    
    var cache = {};
    var sb = ""; // string buffer
    
    var split = function(string) {
        var cachedList = cache[string];
        if (cachedList) // cache results to be fast
            return cachedList;
        
        var list = [];
        var digits = false;
        
        for (var i = 0; i < string.length; i++) {
            var c = string.charAt(i);
            var ignore = (ignoreSpaces && (c === ' ' || c === '\t' || c === '\r' || c === '\n'));
            var isDigit = ( ! ignore && c >= '0' && c <= '9' );
            
            if (isDigit !== digits)    { // state change
                closeCurrentPart(list, digits);
                digits = isDigit;
            }
            
            if ( ! ignore )
                sb += c;
        }
        closeCurrentPart(list, digits);
        
        cache[string] = list;
        return list;
    };
    
    var closeCurrentPart = function(list, digits) {
        if (sb.length > 0)    { // close current string part
            list.push(newPart(digits, sb));
            sb = "";
        }
    };
    
    var iterator = function(list) {
        var index = 0;
        
        return {
            hasNext: function() {
                return index < list.length;
            },
            next: function() {
                var i = index;
                index++;
                return list[i];
            }
        };
    };
    
    
    return compare;
};

Use this like the following:

array.sort(newNaturalSortComparator());

Unit Test

Here is another usage example, actually representing a test.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
(function() {

    var test = function(array, sortedArray, comparator) {
        array.sort(comparator);
        
        if (array.length !== sortedArray.length)
            throw "Array has length "+array.length+", sorted array has length "+sortedArray.length;
                
        for (var i = 0; i < array.length; i++)
            if (array[i] !== sortedArray[i])
                throw "Element at index "+i+", expected "+sortedArray[i]+" but was "+array[i];
    };
    
    (function() {
        var array = [
            "a15 b",
            "a21",
            "12b",
            "11b",
            "2b",
            "1b",
            "a11b",
            "a2b",
            "a 14 b",
            "a22",
            "3 4",
            "a11",
            "a1",
            "3 3",
            "a2",
            "a 13b",
            "a1b",
            "a12b",
            "a12",
            "a21"
        ];
        var sortedArray = [ 
            "1b",
            "2b",
            "3 3",
            "3 4",
            "11b",
            "12b",
            "a1",
            "a1b",
            "a2",
            "a2b",
            "a11",
            "a11b",
            "a12",
            "a12b",
            "a 13b",
            "a 14 b",
            "a15 b",
            "a21",
            "a21",
            "a22"
        ];
        test(array, sortedArray, newNaturalSortComparator());
    })();

})();




Sonntag, 11. Oktober 2015

Natural Sort in Java, Performance Tuned

How to performance-tune? How can we make some Java code run faster?
You guess right, there is no simple answer to this. Here are my tips.

Caching nearly always helps, and you might gain incredible performance boost from it with very few work. But it also can cause big problems. If caches are not released in time, they will contain wrong information, and thus generate errors. So you need to manage your caches in a safe way.
In case of natural sorting, caching stateful Iterator instances is not a good idea, but caching the ready-made part-lists is.

Avoid loops. Avoid loops in loops. Avoid loops in loops in loops ....
This is not an easy task, because big hierarchies of method calls might hide a laborious program flow. Avoiding loops also means avoiding doing the same thing again and again, like splitting or parsing the same text or other input repeatedly, thus it is strongly related with caching.

Make things short. Break loops when continuing does not make sense any more. A classical example is the short circuit condition, built into most modern programming languages.
In case of natural sorting, in my first performance tuning I break the splitting of strings as soon as a difference occurs.

Remove unneeded things. Generally this refers to the agile You ain't gonna need it principle.
In case of natural sorting, distinguishing between separators and letters may be dispensable.

Look for synchronized sections. Thread-safe code has its drawbacks, it is slower than normal code, because it needs operating system resources for synchronization.

In old times it was also important to avoid the new operator, because this is an operating system call and thus may take time. But I do not believe this to be a significant any more today.

Generally I want to mention that performance tuning is nothing you should do when implementing new software. You should do that afterwards, and just where needed. You save much efforts by such a strategy. Do not forget that

PREMATURE OPTIMIZATION IS THE ROOT OF ALL EVIL

So how successful has my second natural sort tuning been? And which strategy helped?

  • Tip: you find the full Java source code of the NaturalSortComparator4 on bottom of this page in a fold.

Caching

Caching brought my first naive sorter implementation instantly to an even better performance than the NaturalOrderComparator (that I named PoolPaourComparator here, after its authors Martin Pool and Pierre-Luc Paour). And that with just a few lines of code, using the Java Hashtable class.

    private Map<String,List<String>> cache = new Hashtable<>();
    
    /** Splits given string into a list of names and numbers. */
    private List<String> split(String string) {
        final List<String> cachedList = cache.get(string);
        if (cachedList != null)
            return cachedList;

        final List<String> list = new ArrayList<String>();

        ..... // scan into list

        
        cache.put(string, list);
        
        return list;
    }

You could also use HashMap for this. I use Hashtable when I know that there will be no null values in that map, in particular, when I want to detect putting null values, because this wouldn't make sense here. Thus using Hashtable is an implicit assertion.

Binding the cache to the Comparator instance makes it safe against error-prone usage. Precondition is that comparators are not cached and are always constructed newly (is a stateful comparator!).

Here is the benchmark after I applied this:

Alphabetical comparator yields 44 millis
PoolPaourComparator yields 708 millis
NaturalSortComparator yields 566 millis
NaturalSortComparator2 yields 1637 millis

It went down from 2.5 seconds to just 0.5 seconds!

Making Things Short

When I want to break loops, I need to implement dynamic iterators that do not read more than needed. This is what I did in my first performance tuning attempt. And it made the sorting two times faster.

Now the challenge appears to combine this with caching. When I tried it out, I found that caching stateful objects like Iterator instances is not recommendable in many situations. I got following error message from Java's TimSort:

Comparison method violates its general contract

This happened when a string was several times in the list to sort, and the second instance of the string got the same character-Iterator instance from cache as the first, and these two were compared to each other. As you can imagine, using the same iterator for two string representations at the same time does not work.

I tried to fix this bug by using System.identityHashCode(string) as hashing key. But then I dropped this strategy, because it turned out that caching the string-parts made up 99 % of the performance benefit. Splitting them only on demand and caching the Iterators made the code much more complicated and did not give a justifying result.

Remove Unneeded Things

I do not find much sense in distinguishing between letters and separators in a natural-sort comparator. Maybe I haven't seen certain use cases. But for now, I removed any code that made a distinction between separators and letters. I just scan for numbers and treat them according to their nature. Anything else is done alfabetically.

As a special tuning action I tried to not convert strings to numbers as long as possible. Fact is that you can compare numbers to each other by just comparing their length (assuming you removed leading zeros), and this does not require the (expensive) conversion of string to number.
But I also gave up this strategy. Again it made the code much more complex, and after some investigations I found out that although this optimization was used in fact, finally always the number was parsed anyway, so why not parse it right from start.

Result: NaturalSortComparator4

Following comparator can be configured to treat spaces to be significant, default it doesn't (ignores spaces). And it can be configured to compare case-sensitive (default is case-insensitive).

All compare-logic now is done in the Part class implementation (representing a string-split-part). It implements Java's Comparable interface, and it holds a numeric representation of any number-string for a fast comparison. Splitting strings is done in first place, because caching gained much more performance than doing the splits lazily in long and complicated code.

Here it is. It is a non-static inner class, to have access to the caseSensitive constructor parameter of its enclosing Comparator instance.

    /**
     * String part that implements a comparison according to its nature.
     */
    private class Part implements Comparable<Part>
    {
        private final String content;
        private final int leadingZeros;
        private final Long number;
        
        Part(boolean isNumber, String content) {
            int i = 0;
            if (isNumber)   // remove and count leading zeros when number
                while (i < content.length() && content.charAt(i) == '0')
                    i++;

            this.content = (i == 0) ? content : content.substring(i);
            this.leadingZeros = i;
            this.number = isNumber ? Long.valueOf(content) : null;
        }
        
        /**
         * Compares this part with given one, according to their nature
         * either as numbers or others. This has to be fast.
         */
        @Override
        public int compareTo(Part other) {
            if (number != null && other.number != null)  {
                final int result = number.compareTo(other.number);
                if (result != 0)
                    return result;
                
                return Integer.compare(leadingZeros, other.leadingZeros);
            }
            
            return caseSensitive
                    ? content.compareTo(other.content)
                    : content.compareToIgnoreCase(other.content);
        }
    }

The number field is evaluated in constructor when the Part is considered to be a number. Long.valueOf(content) would throw an exception when not. The count of leading zeros decides about a comparison when the numbers are equal: the more zeros are present, the more the string goes to end of list. The number then is used in compareTo() implementation of interface Comparator.
The inner class sees the field caseSensitive of its enclosing class, so it can do an according compareTo() or compareToIgnoreCase() call when the part is not a number.

Here is its enclosing class, called NaturalSortComparator4. All it needs to import comes from java.util.*.

public class NaturalSortComparator4 implements Comparator<String>
{
    
    ..... // inner class Part goes here
    
    
    private final boolean caseSensitive;
    private boolean ignoreSpaces;
    
    private Map<String,List<Part>> cache = new Hashtable<>();
    
    private final StringBuilder sb = new StringBuilder();

    
    /** Comparator that compares case-insensitive. */
    public NaturalSortComparator4() {
        this(false);
    }
    
    /** Comparator that treats case-sensitivity according to given parameter. */
    public NaturalSortComparator4(boolean caseSensitive) {
        this(caseSensitive, true);
    }
    
    /** Comparator that treats case-sensitivity according to given parameter, and optionally does not ignore spaces. */
    public NaturalSortComparator4(boolean caseSensitive, boolean ignoreSpaces) {
        this.caseSensitive = caseSensitive;
        this.ignoreSpaces = ignoreSpaces;
    }
    
    
    /**
     * Splits the given strings and compares them
     * by delegating to the Comparable parts.
     */
    @Override
    public int compare(String string1, String string2) {
        final Iterator<Part> iterator1 = split(string1).iterator();
        final Iterator<Part> iterator2 = split(string2).iterator();
        
        while (iterator1.hasNext() || iterator2.hasNext()) {
            // first has no more parts -> comes first
            if ( ! iterator1.hasNext() && iterator2.hasNext())
                return -1;
            
            // second has no more parts -> comes first
            if (iterator1.hasNext() && ! iterator2.hasNext())
                return 1;

            // compare next parts
            int result = iterator1.next().compareTo(iterator2.next());
            if (result != 0)    // found difference
                return result;
        }
        
        return 0;   // are equal
    }

    
    /** Splits given string into a list of numbers and other parts. */
    private List<Part> split(String string) {
        final List<Part> cachedList = cache.get(string);
        if (cachedList != null) // cache results to be fast
            return cachedList;
        
        final List<Part> list = new ArrayList<>();
        boolean digits = false;
        
        for (int i = 0; i < string.length(); i++) {
            final char c = string.charAt(i);
            final boolean ignore = (ignoreSpaces && Character.isWhitespace(c));
            final boolean isDigit = (ignore == false && Character.isDigit(c));
            
            if (isDigit != digits)    { // state change
                closeCurrentPart(list, digits);
                digits = isDigit;
            }
            
            if (ignore == false)
                sb.append(c);
        }
        closeCurrentPart(list, digits);
        
        cache.put(string, list);
        return list;
    }

    private void closeCurrentPart(List<Part> list, boolean digits) {
        if (sb.length() > 0)    { // close current string part
            list.add(new Part(digits, sb.toString()));
            sb.setLength(0);
        }
    }

}

The compare() method contains just the handling of the two involved iterators, nothing else. All comparison logic moved to class Part, and should be maintained there.

The split() method has become even simpler than it was before, but it now uses caching. It does not distinguish between letters and separators any more (add it there when you need it).

Here is my final benchmark:

Alphabetical comparator yields 44 millis
PoolPaourComparator yields 708 millis
NaturalSortComparator4 yields 339 millis

As copy & paste convenience I added the fold below.

Click on left-side arrow to see the full source code of the tuned natural-sort comparator.