Blog-Archiv

Montag, 5. Oktober 2015

Natural Sort Order in Java

Our usual sort order is the alphabetical. It would turn

  • a10
  • a2
  • a1
to
  • a1
  • a10
  • a2

One of the unusual things is that

human beings found out that this quite often is not what human beings would expect
(although it was implemented by human beings :-)

These people donated us natural sort order, which would turn the list above to

  • a1
  • a2
  • a10

... because 2 comes before 10.
Digits within the names are read as numbers and compared according to their nature.

Of course a lot of special cases are imaginable here: what about leading zeros, what about decimal point numbers, what about separators within the name. In this Blog I want to do a reference implementation of the natural sort operation in Java. This will not be sufficient for big lists, because it will be slow, but it should solve the problem. Performance tuning I do, as usual best practice, afterwards, and only when needed.

Find inspirations about natural sorting on following links:

Specification

In times of test-driven development I start to specify my implementation by a unit test. I will try to breed good test data there, and think about as many special cases as possible.

For now I will provide two arrays, one with names containing numbers, and one with numbers only. Both of the arrays will have peer arrays of same length which contain the items as expected by a natural sort order. These peer arrays will serve to assert the sort result.

import org.junit.Test;

/**
 * Contains examples of natural sort orders.
 */
public class NaturalSortTest
{
    final static List<String> names = Arrays.asList(new String[] {
        "a15 b",
        "a21",
        "12b",
        "11b",
        "2b",
        "1b",
        "a11b",
        "a2b",
        "a 14 b",
        "a22",
        "3 4",
        "a11",
        "a1",
        "3 3",
        "a2",
        "a 13b",
        "a1b",
        "a12b",
        "a12",
        "a21",
    });
    private final static List<String> sortedNames = Arrays.asList(new String[] {
        "1b",
        "2b",
        "3 3",
        "3 4",
        "11b",
        "12b",
        "a1",
        "a1b",
        "a2",
        "a2b",
        "a11",
        "a11b",
        "a12",
        "a12b",
        "a 13b",
        "a 14 b",
        "a15 b",
        "a21",
        "a21",
        "a22",
    });
    
    final List<String> numbers = Arrays.asList(new String[] {
        "07",
        "007",
        "02",
        "30",
        "1",
        "10",
    });
    final List<String> sortedNumbers = Arrays.asList(new String[] {
        "1",
        "02",
        "07",
        "007",
        "10",
        "30",
    });

}

You see that I already weaved in some oddities: leading numbers, trailing numbers, numbers in-between, and leading zeros. I defined List instances instead of arrays because these are easier to handle (will need the contains() method).

What is missing in this unit test is the test-method that actually sorts the arrays and compares them with the expected result. And, as I might modify the arrays quite often, I want to assert that any array has the same items as its peer containing the expected items.

public class NaturalSortTest
{
    .....
    .....

    @Test
    public void testSortOrders() {
        testComparator(new NaturalSortComparator());
        .....
    }
    
    private void testComparator(Comparator<String> comparator) {
        assertTestDataSanity("Name", names, sortedNames);
        assertTestDataSanity("Number", numbers, sortedNumbers);
        
        Collections.sort(names, comparator);

        for (int i = 0; i < names.size(); i++)
            assertEquals(comparator.getClass().getSimpleName()+" failed at name "+(i + 1)+", ", sortedNames.get(i), names.get(i));
            
        Collections.sort(numbers, comparator);

        for (int i = 0; i < numbers.size(); i++)
            assertEquals(comparator.getClass().getSimpleName()+" failed at number "+(i + 1)+", ", sortedNumbers.get(i), numbers.get(i));
    }

    private void assertTestDataSanity(String arrayName, List<String> names, List<String> sortedNames) {
        assertEquals("For "+arrayName, names.size(), sortedNames.size());
        for (String name : names)
            assertTrue(arrayName+" '"+name+"' is not in sorted list!", sortedNames.contains(name));
    }

}

Now I must think about how to implement the NaturalSortComparator, this is not part of the Java runtime libraries, I must implement it.

Design

The NaturalSortComparator will implement Comparator<String>. This is an interface, provided by the Java runtime libraries, that enables me to use the sort algorithm of the Java runtime, whatever this is, by simply calling the static method Collections.sort(numbers, new HyperIntelligentAndFastComparator()).

Actually I don't care whether Java uses QuickSort or TimSort, I know that it will always be the fastest available algorithm as long as Java is a "living language" (long live Java !-)

So here is the outline of what is needed:

/**
 * Implements a natural sort order where numbers within names
 * are detected and evaluated as numbers, not strings.
 * Thus "a1", "a10", "a2" would be sorted as "a1", "a2", "a10".
 * See unit test for more examples.
 */
public class NaturalSortComparator implements Comparator<String>
{
    private final boolean caseSensitive;
    
    /** Comparator that compares case-insensitive. */
    public NaturalSortComparator() {
        this(false);
    }
    
    /** Comparator that treats case-sensitivity according to given parameter. */
    public NaturalSortComparator(boolean caseSensitive) {
        this.caseSensitive = caseSensitive;
    }

    ....

}

Now it is time to do real work.

Implementation

The Comparator interface requires us to implement just the compare(String, String) method. Thus the "real work" must do the following:

  • split each of the given strings to compare with each other into parts, either words (letter sequences), numbers, or others (separators of any kind)
  • compare, in order, each split part, according to its nature, with its peer part, whereby a number comparison is done only when both parts are numbers
  • return -1 when first string parameter should go to head, or +1 when to tail
  • when all parts are equal, return 0 for equality.

For generating return values I have following Java runtime library methods available (assumed part1 and part2 are of type String): Long.compare(Long.valueOf(part1), Long.valueOf(part2)), part1.compareTo(part2). These will do the "real work" for us :-)

So here is my reference implementation. It is slow, because all strings must be split entirely before the evaluation of their parts can start. For a performant implementation I would try to avoid that.

public class NaturalSortComparator implements Comparator<String>
{
    .....
    
    /**
     * Splits the given strings and then compares them,
     * according to their nature, either as numbers or others.
     */
    @Override
    public int compare(String string1, String string2) {
        final Iterator<String> iterator1 = split(string1).iterator();
        final Iterator<String> iterator2 = split(string2).iterator();
        
        while (iterator1.hasNext() || iterator2.hasNext()) {
            // first has no more parts -> comes first
            if ( ! iterator1.hasNext() && iterator2.hasNext())
                return -1;
            
            // second has no more parts -> comes first
            if (iterator1.hasNext() && ! iterator2.hasNext())
                return 1;

            // get parts to compare
            final String part1 = iterator1.next();
            final String part2 = iterator2.next();
            
            int result;
            // if part is a number
            if (Character.isDigit(part1.charAt(0)) && Character.isDigit(part2.charAt(0))) {
                result = Long.compare(Long.valueOf(part1), Long.valueOf(part2));
                
                // if numbers are equal, then shorter comes first
                if (result == 0)
                    result = Integer.compare(part1.length(), part2.length());
            }
            else    {   // part is name or separator
                result = caseSensitive ? part1.compareTo(part2) : part1.compareToIgnoreCase(part2);
            }

            if (result != 0)    // found difference
                return result;
        }
        
        return 0;
    }

    /** Splits given string into a list of names, numbers and separators (others). */
    private List<String> split(String string) {
        final List<String> cachedList = cache.get(string);
        if (cachedList != null)
            return cachedList;
        
        final List<String> list = new ArrayList<String>();
        final StringBuilder sb = new StringBuilder();
        
        boolean digits = false;
        boolean letters = false;
        boolean others = false;
        
        for (int i = 0; i < string.length(); i++) {
            final char c = string.charAt(i);
            
            boolean isWhitespace = Character.isWhitespace(c);
            boolean isDigit = ! isWhitespace && Character.isDigit(c);
            boolean isLetter = ! isWhitespace && ! isDigit && Character.isLetter(c);
            boolean isOther = ! isWhitespace && ! isDigit && ! isLetter;
            
            if (isWhitespace || isDigit && ! digits || isLetter && ! letters || isOther && ! others)    {
                if (sb.length() > 0)    { // close current string part
                    list.add(sb.toString());
                    sb.setLength(0);
                }
                digits = isDigit;
                letters = isLetter;
            }
            
            if (isWhitespace == false)
                sb.append(c);
        }
        
        if (sb.length() > 0) // do not lose last part
            list.add(sb.toString());
        
        cache.put(string, list);
        
        return list;
    }

}

Next will be the performance tuning. I am already curious if it will require a lot of code changes. At least the Iterator returning the string-parts will have to be implemented explicitly.

Another to-do here would be locale-independent sorting, i.e. applying Collation.




Keine Kommentare: