This is about performance tuning of the natural-sort class I started in my previous article. Simple and comprehensible code is rarely the fastest, and sorting is always about speed. The challenge is to achieve a fast sorter without hacking on long methods that use countless variables with abbreviated names no one understands. In other words, I want to prove that simplicity and good performance are not contradictory.
But before talking about a way to make the sorter faster I need to specify how I
perceive that something is fast. I wrote a unit test for that.
This test does not assert anything, it only outputs its benchmark at end.
It measures in nanoseconds, 1000000 nanoseconds make up a millisecond, 1000 millis make up a second.
It performs several repeats and calculates an average result then.
It uses the NaturalSortTest.names
data of the unit test from previous Blog
and blows them up a little.
import org.junit.Test; /** * Compares different implementations of natural sort orders. */ public class NaturalSortPerformanceTest { @Test public void testPerformance() { final int REPEATS = 5; long time0 = 0; long time1 = 0; long time2 = 0; long time3 = 0; for (int i = 0; i < REPEATS; i++) { time0 += performance(null); time1 += performance(new PoolPaourComparator()); time2 += performance(new NaturalSortComparator2()); time3 += performance(new NaturalSortComparator()); } System.err.println("Alphabetical comparator"+" yields "+(time0 / REPEATS)+" millis"); System.err.println(PoolPaourComparator.class.getSimpleName()+" yields "+(time1 / REPEATS)+" millis"); System.err.println(NaturalSortComparator2.class.getSimpleName()+" yields "+(time2 / REPEATS)+" millis"); System.err.println(NaturalSortComparator.class.getSimpleName()+" yields "+(time3 / REPEATS)+" millis"); } private long performance(Comparator<String> comparator) { final int FACTOR = 10000; final List<String> toSort = new ArrayList<>(FACTOR * NaturalSortTest.names.size()); for (int i = 0; i < FACTOR; i++) { for (int j = 0; j < NaturalSortTest.names.size(); j++) { String name = NaturalSortTest.names.get(j); final int MULTIPLY = 3; for (int k = 0; k < MULTIPLY; k++) name += "/"+name; toSort.add(name); } } final long start = System.nanoTime(); Collections.sort(toSort, comparator); final long end = System.nanoTime(); return (end - start) / 1000000l; } }
This will tell me the truth about all sorters I want to throw into the game.
As you can see there are already three comparators in the race. The null comparator is the alfabetical order. This will clearly be the fastest. As it does not produce natural sort order, it is here just to show the gap.
The PoolPaourComparator
you can download from its
GitHub. Might not be what you are looking for,
but you can fix it, it is not so hard to understand. At least it is fast.
The NaturalSortComparator2
is subject of the following article.
It is a rewrite of the naive NaturalSortComparator
from my previous Blog.
Here is the result of one of the test runs:
Alphabetical comparator yields 45 millis PoolPaourComparator yields 560 millis NaturalSortComparator2 yields 1367 millis NaturalSortComparator yields 2597 millis
As you see my NaturalSortComparator
is far behind PoolPaourComparator
.
So this article might have another follower that tries to achieve even better performance.
But at least it is two times faster than the latest.
First Optimized Natural Sorter
What were the weaknesses of the first NaturalSortComparator
?
It had to read all strings in the list entirely, as many times as they were
compared to each other, so it had to scan lots of strings several times repeatedly.
The new comparator just reads until the first difference is detected, not the entire string.
Nevertheless it reads as far as a letter- or digit-sequence lasts, so it also does more than needed.
And this is the reason why it still significantly slower than a comparator
that reads just characters, not string parts. Such a comparator
could be implemented as finite automaton.
Might be the next NaturalSortComparator
Blog:-)
I must rewrite the split(String string)
method to return an Iterator
,
and I must implement this iterator to never read more characters than needed.
Here is what I've done to integrate that Iterator
into the old code.
public class NaturalSortComparator2 implements Comparator<String> { private enum Type { SPACE, DIGIT, LETTER, SEPARATOR; } private static class Part { final Type type; final String content; Part(Type type, String content) { this.type = type; this.content = content; } } private final boolean caseSensitive; /** Comparator that compares case-insensitive. */ public NaturalSortComparator2() { this(false); } /** Comparator that treats case-sensitivity according to given parameter. */ public NaturalSortComparator2(boolean caseSensitive) { this.caseSensitive = caseSensitive; } /** * Splits the given strings and then compares them, * according to their nature, either as numbers or others. */ @Override public int compare(String string1, String string2) { final Iterator<Part> iterator1 = split(string1); final Iterator<Part> iterator2 = split(string2); while (iterator1.hasNext() || iterator2.hasNext()) { // first has no more parts -> comes first if ( ! iterator1.hasNext() && iterator2.hasNext()) return -1; // second has no more parts -> comes first if (iterator1.hasNext() && ! iterator2.hasNext()) return 1; // get parts to compare final Part part1 = iterator1.next(); final Part part2 = iterator2.next(); int result; // if part is a number if (part1.type == Type.DIGIT && part2.type == Type.DIGIT) { result = Long.compare(Long.valueOf(part1.content), Long.valueOf(part2.content)); // if numbers are equal, then shorter comes first if (result == 0) result = Integer.compare(part1.content.length(), part2.content.length()); } else { // part is name or separator result = caseSensitive ? part1.content.compareTo(part2.content) : part1.content.compareToIgnoreCase(part2.content); } if (result != 0) // found difference return result; } return 0; } ...... }
This is mostly the same as the old Comparator, with following differences:
- it uses an inner class
Part
to represent the split results, and it tags any such part with aType
(enum) that expresses its nature (number, word, separator) - the
Part
instances come from asplit()
method that returns anIterator<Part>
which should never read more than needed - it distinguishes between spaces and separators
- it ignores spaces completely
The rest is the same as it was before. The shorter string wins. When not both are numbers, the parts are compared like they were strings.
Now here is the missing rewritten split(String)
method that produces a "lazy" iterator.
/** Splits given string into a list of names, numbers and separators (others). */ private Iterator<Part> split(final String string) { return new Iterator<Part>() { private char currentChar; private int position; private final int length = string.length(); private final StringBuilder sb = new StringBuilder(); @Override public boolean hasNext() { // read away spaces while (position < length && getState(currentChar = string.charAt(position)) == Type.SPACE) position++; return position < length; } @Override public Part next() { if (hasNext() == false) // read away spaces throw new IllegalStateException("Iterator called when no next item exists!"); final Type initialState = getState(currentChar); Type state = initialState; while (position < length && state == initialState) { sb.append(currentChar); position++; if (position < length) state = getState(currentChar = string.charAt(position)); } final String part = sb.toString(); sb.setLength(0); return new Part(initialState, part); } private Type getState(char c) { return Character.isWhitespace(c) ? Type.SPACE : Character.isDigit(c) ? Type.DIGIT : Character.isLetter(c) ? Type.LETTER : Type.SEPARATOR; } }; }
An anonymous inner class implements the Iterator
interface to lazily return split parts of a string.
The final string
parameter is always visible and available for the inner implementation.
The hasNext()
method reads away spaces and returns false when no next split-part is available.
The next()
method delivers another split-part when the caller demands it, not earlier.
This makes the sorter two times faster than the old version.
The next()
implementation also determines the type of the part, whether it is a number,
a separator or a word. Spaces are ignored completely here.
The next performance tuning attempt will be to rewrite this to always compare characters, not string parts, unless a digit sequence is starting on both input strings. Let's see how fast this will evolve :-)
Keine Kommentare:
Kommentar veröffentlichen