Blog-Archiv

Samstag, 30. Juli 2016

JS Document Text Search on Client Side

Did you know that ....

  • Cognitio est aurum futurorum (Latin)
  • Scio estas la oro de la estonteco (Esperanto)
  • Knowledge is the gold of the future (English)
  • Wissen ist das Gold der Zukunft (German)

Forget gold. Forget oil. We are going Computer:-)

To get knowledge you need tools, primarily search tools. For example, for me it would be important to find articles in my Blog archive that contain a certain search pattern. From anywhere, online.


Now, I don't have access to the server where my Blog resides, so I can't install a web-service for this purpose there. (It would be a server-side indexing of documents like Lucene provides, and a web-server searching these indexes and responding with according document links.)

So I want to have a client-side Blog-search-tool that I can start from anywhere, and that does not need a web-server doing the search-work. Is this possible?

Surely not for big data, because the browser itself has to carry out such a search. On the other hand, why not provide it for small data like my Blog, and for HTML documents with relative URLs only? Here is a concept:

  1. for every search launch, retrieve all linked documents, using asynchronous AJAX (the browser would cache the results, so the next search would be much faster)

  2. for each document, parse its content by temporaryDocument = document.implementation.createHTMLDocument() and setting temporaryDocument.body.innerHTML = ajaxResponseText

  3. remove all script-elements from result document, then retrieve plain text (without HTML markup) using temporaryDocument.body.textContent

  4. search the pattern in resulting plain text

Mind that AJAX will deny to load URLs that are not of same origin as the calling page.

The script introduced by this article is written in pure JavaScript, and it does not use any external library. I do not raise any legal claim on this source code, it is free to be used anywhere.

HTML

Here is the outline of my test page.

<!DOCTYPE html>
<html>
  <head>
    <meta name="viewport" content="initial-scale=1"/>
    <meta charset="UTF-8"/>
    
    <title>Document Text Search on Client Side</title>
    
  </head>
  
  <body>
    <!-- search input fields -->
    <div>
      <input id="search-text" type="text" placeholder="Full Text Search" />
      <input id="search-button" type="button" value="Go!"/>
    </div>

    <!-- linked documents -->
    <table id="result">
      <tr><td>106</td><td><a href='Refactoring_JS_List_Filters__Part_One.html'>Refactoring JS List Filters, Part One</a></td><td>2016-06-21</td></tr>
      ....

      <tr><td>1</td><td><a href='Things_Are_Changing.html'>Things Are Changing</a></td><td>2008-02-26</td></tr>
    </table>

    <script type="text/javascript">
    (function() {
      // JavaScript that executes full text search in linked documents
    })();
    </script>
       
  </body>
  
</html>

On top are a text-field and a button to start a search. The table element contains all linked documents that must be full-text-searched. Any row holding a link that does not contain the search pattern must be set invisible by the script. Is there no search pattern, all rows will be set visible again.

This page should look like this:


Script

Here comes the script, to be inserted where "// JavaScript that executes full text search in linked documents" is, part by part. Of course you will also need to provide documents that can be searched.

Mind that this is a raw JS implementation, not encapsulated in modules. Mind further that this search will work for HTML documents only, not for PDF, JPG, MP3, ....

Network

      var getAjaxResponse = function(ajax, url) {
        if (ajax.readyState === 4)
          if (ajax.status === 0 || ajax.status === 200) /* 0 occurs with file protocol */
            return ajax.responseText;
          else
            throw "Error status for "+url+" is "+ajax.status+", message: "+ajax.statusText;
  
        return undefined;
      };
      
      var fetchResourceAsync = function(url, toCallOnLoad) {
        var ajax = new XMLHttpRequest();
        
        ajax.onreadystatechange = function() {
          var responseText = getAjaxResponse(ajax, url);
          if (responseText)
            toCallOnLoad(responseText);
        };
  
        ajax.open("GET", url, true);
        ajax.send();
      };

This code should run on any modern browser (not IE below 9). The fetchResourceAsync() function receives an URL and returns, asynchronously, its content. To process the content, the caller must provide a callback-function toCallOnLoad.

The getAjaxResponse() function is responsible to check for invalid HTTP responses.

Text Processing

      var getContent = function(htmlText) {
        htmlText = htmlText.replace(/<br\/?>/gim, ' ');
        /* replace <br> tags by space, browser does not do that, that leads to word joins */
        
        var tmpDoc = document.implementation.createHTMLDocument();
        tmpDoc.body.innerHTML = htmlText;
        
        removeScripts(tmpDoc.body);
        
        var text = "";
        for (var i = 0; i < tmpDoc.body.children.length; i++)
          text += tmpDoc.body.children[i].textContent+"\n";

        return text;
      };
      
      var removeScripts = function(element) {
        var toRemove = [];
        for (var i = 0; i < element.children.length; i++) {
          var child = element.children[i];
          if (child.tagName === "SCRIPT")
            toRemove.push(child);
          else
            removeScripts(child);
        }
        for (var i = 0; i < toRemove.length; i++) {
          element.removeChild(toRemove[i]);
        }
      };

That's the way how the URL content is processed as soon as the toCallOnLoad AJAX callback function is called: it will call getContent() to get plain text. This function constructs a new HTML document and sets the received text into it. It then removes all SCRIPT tags and returns plain text.

The removeScripts() function is a recursive DOM-traversal that removes all SCRIPT elements from a given DOM element.

Search

      var searchPatternInResource = function(resource, regExp) {
        fetchResourceAsync(resource, function(text) {
          text = getContent(text);
          var match = regExp.exec(text);
          if (match)
            displayResult(resource);
        });
      };
      
      var searchPatternInResources = function(searchPattern, resources) {
        for (var i = 0; i < resources.length; i++)
          searchPatternInResource(resources[i], new RegExp(searchPattern, "im"));
      };
      
      var search = function() {
        var searchPattern = getSearchTextElement().value;
        if ( ! searchPattern || ! searchPattern.trim() ) {
          resetResultList();
        }
        else {
          clearResultList();
          searchPatternInResources(searchPattern, getResources());
        }
      };

The search() function is the event-callback for the search-button (or ENTER in text-field). It fetches the search-pattern from text-field, resets the document-list when nothing was entered, or starts the search when the text-input was not empty.

It calls the searchInResources() function which loops over all linked documents, calling searchInResource() with each. It passes them a regular expression built from the search-pattern. Mind that this enables regular expressions, so you should know what characters you may use on testing! The RegExp is constructed with "im", the "i" means "ignore case", the "m" means "multiline search".

The searchPatternInResource() function finally starts the asynchronous AJAX call to retrieve the document. It passes an anonymous function to fetchResourceAsync() which turns the AJAX-result into plain text by calling getContent(), then executes the regular expression over the result. It sets the document link to displayed when the text matches the regular expression. The function regExp.exec() returns null when no match was found, and JS would evaluate null to false. Mind that null is not undefined.

DOM Access

Until here the script will apply on any page. Below comes the part that most likely you will have to rewrite for your own page. It contains the DOM access, that means the way how the script works together with the surrounding HTML page.

      var getResultList = function() {
        return document.getElementById("result");
      };
      
      var getResources = function() {
        var linkList = getResultList().querySelectorAll("a");
        var resources = [];
        for (var i = 0; i < linkList.length; i++)
          resources.push(linkList[i].getAttribute("href"));
        
        return resources;
      };
      
      var displayResult = function(resource) {
        var resultList = getResultList().children[0].children;
        for (var i = 0; i < resultList.length; i++)
          if (resultList[i].children[1].children[0].getAttribute("href") === resource)
            resultList[i].style.display = "";
      };
      
      var clearResultList = function() {
        resetResultList(true);
      };
      
      var resetResultList = function(clear) {
        var resultList = getResultList().children[0].children;
        for (var i = 0; i < resultList.length; i++)
          resultList[i].style.display = clear ? "none" : "";
      };
      
      var getSearchTextElement = function() {
        return document.getElementById("search-text");
      };
      
      
      /* Initialization, listener installation */
      
      getSearchTextElement().addEventListener("keydown", function(event) {
        if ((event.which || event.keyCode) === 13)
          search();
      });
      
      document.getElementById("search-button").addEventListener("click", search);

As you can see, the getResultList() function refers to the table with the according id, containing all document links. The functions getResources(), displayResult(), clearResultList(), resetResultList() use it accordingly to manage the list of linked documents. These implementations are definitely page-specific. Mind that the statement element.style.display = ""; will set the element visible.

The getSearchTextElement() function refers to the text-field where you input the search-pattern.

In the trailing initialization an event-listener is installed on ENTER keypress of the text-field, and on click of the search-button. In both cases, the search() function is installed as callback.

Summary

Although this is a quite restricted search, you could blow it up to control

  • case-sensitivity
  • word-bounds
  • turn off regular expressions (should be default!)
  • search optionally also in HTML code, or STYLE or SCRIPT elements.

When you want to see and test this, you can go to the Blog-search on my homepage. When you expand the title, you see two different searches, the lower one being the document full-text search.




Sonntag, 24. Juli 2016

JS goes ES6

You may have asked yourself why the IT business uses acronyms so much. Or should I say abuse? There is no more justification for abbreviations and acronyms in a world of Terabyte disks with 12 GB memory on personal computers. So please pardon my conformity, here is the title again:

JavaScript is going to be EcmaScript 6

ECMA originally was the acronym for European Computer Manufacturers Association, but "the name is no longer considered an acronym and no longer uses full capitalization". So we have a meaningless acronym more. Didn't you get used to it yet?


This Blog is a short look into the coming ES-6 world, syntactically sugared with my own opinion :-)

If you're still on a year 2015 desktop-browser, you could update to e.g. Firefox 47 and try out ES-6 code, most of it is already working. Except that nasty little thing called "dependency management", until now solved by CommonJS and AMD, but going to be a browser responsibility. We have to wait for that? The browser support page doesn't even mention it.

And then it may take 2 more years until Internet users will be able to consume ES-6 web applications. So there's no reason to hurry. On mobile browsers, ES-6 support is missing widely.

ES-6 Links

You may want to read the facts first.

Here are articles about the urgently needed dependency management:

ES-6 is much more complex than JS. Like Scala, it adopts techniques of functional languages.

It has constants. It has block-scoped variables and functions. It has classes. It has arrow-functions (lambdas, closures). It has typed arrays, maps, sets, even weak ones. String and Number support improved, internationalization basics are also provided.

Long Missed Spice

Encapsulation

Comes as export statements and modules. The top-level variables and functions of a module are not global in the window context. To make a function or constant globally available you need to export it. No more global namespace pollution.

Within a module, the this object is undefined, and the mode is always "strict", that means we do not need "use strict"; any more.

Modules are not scripts. Scripts still can not have imports or exports, just modules can. There will be a new type of script-tag for defining modules within HTML: <script type="module">.

Dependency Management

The import statement uses relative URLs as you would use on an <IMG> tag. There is a one-to-one mapping between modules and files: a module must be in a file.js, and a file is a module.

You can then locally rename the things you imported. You can import selectively, i.e. just one function or constant, not the whole module. You can not change any imported variable, it will be a constant.

A Promise based module loader API can be used to load modules programmatically in scripts.

Classes

So now JS actually is going to be object-oriented. What I saw on the features-page convinced me. The keywords new, this and super work as expected. We'll see what gotchas are behind it.

What did not convince me is the multiple-inheritance example. Multiple inheritance is unsafe, confusing, and mostly not needed. For example, Java doesn't have it, except on interfaces, but here you inherit just responsibilities, no concrete implementations. Moreover the example is too complex, and seems to be not generic. Showing muscles :-?

Block Scope

We can forget about variable hoisting and keep our code refactorable. The var keyword is dead, long live the let keyword.
Although - is this intuitive?

Constants

Why did this last so long? Constants definitely are needed in any programming language, not only the functional ones. In JS everything is public, thus we want to protect certain immutable values from being changed unintentionally.

Default Values for Parameters

In the parameter list of a function, you can assign default-values to the parameters, for the case that the caller does not define them. This makes the error-prone value = value || "(not given)"; statements dispensable.

Sugar

I do not mention the other ES-6 features because I do not consider them to be important. Moreover they are perfectly introduced on their web page. E.g. the arrow-function is just a shorthand for an anonymous function, and in JS this was a solid closure. So why promote copy & paste code in lambdas against anonymous functions that we could quickly refactor to be named and reusable?

Generally I think that ES-6 will make JS code harder to read. Look at the Generators and Iterators. I bet I'm not the only one that skipped these chapters.

Writing source code may be fun, but is just 20% of the real efforts related to a product that you sell to customers. The 80% maintenance start after the code has been written virtuoso, using all possible tricks, ignoring any convenience. Some temporarily hired high-paid expert then will look at those uncommented tirades and try to fix the bugs within them.

Code written in a simple and restrictive programming language is much easier to maintain. "Freedom of expression" refers to human social behaviour, not to communication with machines. Programming languages should not provide a playground. Here the idea of "reducing to bare minimum" (see below) would be valuable: reduce syntactical sugar!

So, you might know what it means to consume too much sugar :-) ES-6 is rich of it. This is going to be an expert language, inducing even more per-developer-dialects than JS. We will need further books with titles like "The Useful Parts of ...".

Finally a Semicolon :-)

ES-6 encourages you to leave out the semicolon:

.... But this era is gone ... perfectly support automatic semicolon inference/insertion ... coders nowadays can get rid of nearly all semicolons and remove clutter from their source code. Ralf S. Engelschall is a strong promoter of reducing source code to its bare minimum. Hence, in his personal opinion ECMAScript 6 should be coded with as less syntactic sugar as possible and hence without semicolons. ....

... clutter ... reducing code to its bare minimum ... All human expressions need a context, there is no meaning without context. When you read source code, you know the rules of the programming language. A semicolon denotes the end of a context. Although it's not a statement terminator in JS, it's still a statement separator. It makes the code more readable, and it also makes lots of other things easier.

For example I use JavaScript in articles of this Blog. When I back-up all my Blog articles, I get a big XML file, where the HTML of one article is packed into a single line of text, with all newlines removed. Without semicolons, the JS of that article would not work any more when I unpack and view the HTML. Yeah, we could file an issue to the Blog exporter ....

No, I don't want to take part in any semicolon discussion. Thank god it's still optional in ES-6, and that's what I'll use. But I'm not sure whether I will use the arrow-function, it's too sweet :-)




Sonntag, 17. Juli 2016

Refactoring JS List Filters, Part Two

Refactoring my JS list filters has led to more than just encapsulating functions into modules. I found that it makes sense to integrate some general search aspects to have really reusable source code.

What are these aspects? A filter input field ...

  • ... could refer to a certain column of the table to filter, e.g. the "Title" column, or the "Creation Date" column
  • ... could denote an alternative or a restriction (resulting in logical AND / OR conditions)
  • ... must be combinable with others arbitrarily (resulting in parentheses around AND / OR conditions)

A fully fledged implementation of search-filters would end in a "logical expressions" library, in other words, you could write field-conditions using comparators like ==, <, <=, >, >=, contains, ..., attribute them by logical operator NOT, and then concatenate them by using logical operators AND and OR, whereby you would express precedences by enclosing certain parts into parentheses (see examples below).

But I don't want to go that far with my Blog article filters. I just want to be able to perform following ...

Use Cases

I want to ...

... find articles about at least one of several predefined semantics (checkboxes)
... revert that, i.e. display articles that are NOT about any of several predefined semantics
... restrict to articles contained in a "Best Of" list
... restrict to titles containing a search-term, entered at runtime in a text-field
... restrict to a creation date-range (not referring title but creation-date)

Examples would be:

  1. "Find articles about 'HTML' or 'Java', restricted to "Best Of" articles created in year 2015."
    ( article.searchWords.contains("HTML") OR article.searchWords.contains("Java") ) AND bestOfList.contains(article.identity) AND article.creationDate.contains("2015")

  2. "Find articles about none of the predefined search-terms ('HTML', 'CSS', 'JavaScript', 'Java', 'LINUX'), combined with those containing 'LINUX', restricted to articles with titles containing 'script'
    ( NOT ( article.searchWords.contains("HTML") OR article.searchWords.contains("CSS") OR .... ) OR article.searchWords.contains("LINUX") ) AND article.title.contains("script")

Design Decisions

In particular it's even more subtle, because when I turn on the "JavaScript" checkbox, I also must search "JS" and "jQuery" in titles. That means, each such checkbox carries several search terms, not just one.

But anyway, I will not provide parentheses but hardcode them, because in the examples above you always see the same pattern:

  1. collect articles using the alternative filters (OR), in case none were activated, provide all articles
  2. shorten that collection by applying restricting filters (AND)

For every article title, I will check it against all restrictions (if any), and then, if it passes, check against all alternatives (if any).

User Interface Prototype

Title Search

In first layout-row there are alternatives, in second restrictions.
Mind that not all checkboxes are OR-filters, the "Best Of" checkbox is an AND-filter.

You can try out these filters, and see all source code, on my Blog Archive.

JS Implementations

There are two abstract modules, a number of concrete extensions for filter-fields, and a string utility (on bottom of this article).

Filter Input Field Wrappers

Abstract Wrapper

This module encapsulates exactly one

  • INPUT field,
  • its context (table column),
  • its working mode (case-sensitivity),
  • its logical operator (restricting = AND, not restricting = OR),
  • the fact whether it should use the word-array (labelWords) for matching titles
 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
    /**
     * Wraps a filter input field and its current state.
     */
    var abstractFilterField = function(inputField, column, caseSensitive, restricting, usesWordArray)
    {
      var self = {};
      
      self.addChangeListener = function(callbackFunction) {
        inputField.addEventListener('change', callbackFunction);
      };
      
      self.matchesOne = function(labelText, labelWords) {
        var searchWords = self.getSearchWords(column);
        
        for (var i = 0; i < searchWords.length; i++) {
          var searchWord = searchWords[i];
          if (usesWordArray && searchWord.indexOf(' ') < 0) {
            if (labelWords.indexOf(searchWord) >= 0)
              return true;
          }
          else if (labelText.indexOf(searchWord) >= 0) {
            return true;
          }
        }
        return false;
      };
      
      self.getSearchWords = function() {
        throw 'Extension must implement getSearchWords() !';
      };
      
      self.isValid = function() {
        throw 'Extension must implement isValid() !';
      };
      
      self.isRestricting = function() {
        return (restricting === undefined) ? false : restricting;
      };
      
      self.isCaseSensitive = function() {
        return (caseSensitive === undefined) ? false : caseSensitive;
      };
      
      self.getColumn = function() {
        return column; 
      };
      
      return self;
    };

Remember that a JS module is a function. We can call that function, and then enrich the returned object with concrete function implementations.

For all INPUT fields, adding a change-listener and matching some article-title is solved here. Mind that matchesOne() implementation does not check isValid(), so the caller must take care of this.

Concrete Wrappers

We have two types of HTML elements: checkboxes and text fields. For checkboxes, we have the alternatives and the "Others" checkbox (that inverts their meaning).

The text-filter can be easily implemented by deriving the abstractFilterField module. I set it to be restricting = true by default, when no such parameter was given on construction.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
    var textFilter = function(inputField, column, caseSensitive, restricting, usesWordArray)
    {
      restricting = (restricting === undefined) ? true : restricting;
      
      var self = abstractFilterField(inputField, column, caseSensitive, restricting, usesWordArray);
      
      self.getSearchWords = function() {
        return inputField.value ?
            stringUtil.toArrayWithoutEmpty(inputField.value, ',', ! caseSensitive) :
            [];
      };
      
      self.isValid = function() {
        return inputField.value && inputField.value.trim().length > 0;
      };
      
      return self;
    };

This module overwrites the exception-throwing functions of the abstraction. Mind that you can also concatenate search words by ',' here, which would make such fields also using OR conditions.

You will see how to use this module soon, first lets implement the checkboxes. There is also an abstraction for the two types of them.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
    var abstractCheckboxFilter = function(inputField, column, caseSensitive, restricting)
    {
      var self = abstractFilterField(inputField, column, caseSensitive, restricting, true);
      
      self.isValid = function() {
        return inputField.checked;
      };
      
      return self;
    };

This encapsulates the fact that a checkbox provides its state in the checked attribute, and hardcodes the usesWordArray parameter to true. Here come concrete derivations.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
    var checkboxFilter = function(checkbox, column, caseSensitive, restricting, SEARCH_WORDS_ATTRIBUTE_NAME)
    {
      SEARCH_WORDS_ATTRIBUTE_NAME = SEARCH_WORDS_ATTRIBUTE_NAME || 'search-words';
      
      var cachedSearchWords = stringUtil.toArrayWithoutEmpty(
          checkbox.getAttribute(SEARCH_WORDS_ATTRIBUTE_NAME),
          ',',
          ! caseSensitive);
      
      var self = abstractCheckboxFilter(checkbox, column, caseSensitive, restricting);
      
      self.getSearchWords = function() {
        return cachedSearchWords;
      };
      
      return self;
    };

This module reads its search words on construction from given attribute of the INPUT element. The getSearchWords() implementation returns them.

Next is the "Others" checkbox, which needs to know all other alternative-fields to invert their result, thus it receives them on construction.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
    var othersCheckboxFilter = function(checkbox, column, otherCheckboxFilters)
    {
      var self = abstractCheckboxFilter(checkbox, column);
      
      var otherSearchWords = [];
      
      for (var c = 0; c < otherCheckboxFilters.length; c++) {
        var searchWords = otherCheckboxFilters[c].getSearchWords();
        for (var w = 0; w < searchWords.length; w++)
            otherSearchWords.push(searchWords[w]);
      }
      
      var superMatchesOne = self.matchesOne;
      
      self.matchesOne = function(labelText, labelWords) {
        return ! superMatchesOne(labelText, labelWords);
      };
      
      self.getSearchWords = function() {
        return otherSearchWords;
      };
      
      return self;
    };

This implementation extracts the search-terms from all given otherCheckboxFilters and stores them to a private variable. These are returned by getSearchWords(), which is called by matchesOne(). The overwrite of matchesOne() calls superMatchesOne, and then simply reverts the result. That's the power of overriding, it makes this complicated looking functionality very simple!

Mind that all these wrappers will be separate object instances, with private field instances, because abstractFilterField always returns a new object when called.

You might wonder how the "Best Of" field can be implemented by these wrappers, because obviously it got no specific module. We can wrap this field by constructing it with restricting = true, and pointing its column to the first table-column, which carries the order number of the article. The search-words of this field then must be a comma-separated list of numbers of "Best Of" articles. That means, the page author must write these numbers.

Remember that this still is a client-side search, we do not access the linked document in any way by AJAX.

List Filter

This is the heart piece. We are going to process conditions from filter fields now. There will be one list filter instance per table to filter, holding all its filter fields. There is only an abstract module, each page must provide adapter code to manage HTML elements of the table or list to filter. For clearness I will show that adapter code now first.

      <div >Title Search</div>
      
      <div>
        <label><input type='checkbox' class="title-alternative" search-words='JS, JavaScript, jQuery'/>JavaScript </label>
        <label><input type='checkbox' class="title-alternative" search-words='CSS, Cascading Style Sheets'/>CSS </label>
        <label><input type='checkbox' class="title-alternative" search-words='HTML, Page'/>HTML </label>
        <label><input type='checkbox' class="title-alternative" search-words='Java, Constructors'/>Java </label>
        <label><input type='checkbox' class="title-alternative" search-words='LINUX, UNIX, vi, AWK, GIMP'/>LINUX </label>
        <label><input type='checkbox' id='others'/>Others </label>
        <br>
        <label><input type='checkbox' id='best-of' search-words='101,94,92,90,79,78,77,75,70,61,58,52,47,45,43,40,37,36,29,23,20,12,6'/>Best Of </label>
        <label style="padding-left: 2em;"> Title <input type='text' id="title-column-filter" placeholder="JS, CSS, HTML" style="width: 8em;"/></label>
        <label> Created <input type='text' id="date-column-filter" placeholder="2015, 2016" style="width: 6em;"/></label>
      </div>
      
      <table>
        <tr><td>106</td><td><a href='Refactoring_JS_List_Filters__Part_One.html'>Refactoring JS List Filters, Part One</a></td><td>2016-06-21</td></tr>
        ....
        <tr><td>1</td><td><a href='Things_Are_Changing.html'>Things Are Changing</a></td><td>2008-02-26</td></tr>
      </table>

This is a fragment of the HTML page. You see how synonyms for search-terms are defined, and how the "Best Of" filter is written. The table below is the filter target, exposing the column structure.

Following JS code, wrapped in a <script type="text/javascript"> tag, makes use of the ids and classes defined in that HTML.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
    (function() {
      /** Concrete filter-implementation for this page */

      /* retrieving filter fields */
      var checkboxes = document.body.querySelectorAll('input.title-alternative');
      var others = document.getElementById('others');
      var bestOf = document.getElementById('best-of');
      var titleTextField = document.getElementById('title-column-filter');
      var dateTextField = document.getElementById('date-column-filter');
      
      /* map fields to list-columns */
      var columns = [ "number", "title", "creation-date" ];
      var filters = [];
      
      for (var i = 0; i < checkboxes.length; i++)
        filters.push(checkboxFilter(checkboxes[i], columns[1]));
      
      filters.push(othersCheckboxFilter(others, columns[1], filters));
      
      filters.push(checkboxFilter(bestOf, columns[0], false, true));
      filters.push(textFilter(titleTextField, columns[1], false));
      filters.push(textFilter(dateTextField, columns[2], false));
      
      /* build a list-filter */
      var listFilter = abstractListFilter(filters);

      /* tell list-filter where the rows to filter are */
      listFilter.getRows = function() {
        var table = document.getElementsByTagName('table')[0];
        var tbody = table.children[0];
        return tbody.children;
      };

      /* tell list-filter how to get text from a row and column */
      listFilter.getColumns = function() {
        return columns;
      };
      
      /* tell list-filter how to get text from a row and column */
      listFilter.getText = function(row, column) {
        if (column === columns[0])
          return row.children[0].childNodes[0].textContent;
        
        if (column === columns[1])
          return row.children[1].children[0].childNodes[0].textContent;
        
        if (column === columns[2])
          return row.children[2].childNodes[0].textContent;
        
        throw "Not implemented column: "+column;
      };
      
    }());

Starting at line 5, the filter fields are retrieved. This implementation anticipates that there is just one table to filter on this page.

Then, on line 12, columns of the table are named. These names are arbitrary, but they must be used at construction of filter fields (lines 16-22) and the getColumns() and getText() overrides (line 35 and 40) consistently.

From line 13-22, fields are wrapped by constructing adequate objects for them. They receive their input-field, column, and restricting flag.

A list of them is then passed to the construction of an abstract list-filter on line 25. That filter is concretized by overriding three of its functions (that most likely would throw an exception). The getColumns() function returns the list of column names, the getRows() function knows how to retrieve table rows from the HTML document, and finally the getText function can return a text for a specific table cell, addressed by row and column-name.

  1
  2
  3
  4
  5
  6
  7
  8
  9
 10
 11
 12
 13
 14
 15
 16
 17
 18
 19
 20
 21
 22
 23
 24
 25
 26
 27
 28
 29
 30
 31
 32
 33
 34
 35
 36
 37
 38
 39
 40
 41
 42
 43
 44
 45
 46
 47
 48
 49
 50
 51
 52
 53
 54
 55
 56
 57
 58
 59
 60
 61
 62
 63
 64
 65
 66
 67
 68
 69
 70
 71
 72
 73
 74
 75
 76
 77
 78
 79
 80
 81
 82
 83
 84
 85
 86
 87
 88
 89
 90
 91
 92
 93
 94
 95
 96
 97
 98
 99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
    /**
     * Abstract implementation of a list filter.
     * @param filters list of fields representing abstractFilterField instances.
     */
    var abstractListFilter = function(filters)
    {
      /* private */
      
      var matchFilters = function(restricting, filters, labelText, labelTextUpperCase, labelWords, labelWordsUpperCase) {
        for (var f = 0; f < filters.length; f++) {
          var filter = filters[f];
          var labelTextParam = filter.isCaseSensitive() ? labelText : labelTextUpperCase;
          var labelWordsParam = filter.isCaseSensitive() ? labelWords : labelWordsUpperCase;

          if (filter.matchesOne(labelTextParam, labelWordsParam)) {
            if ( ! restricting )
              return true; /* short OR-cycle */
          }
          else if (restricting) {
            return false; /* short AND-cycle */
          }
        }
        return restricting;
      };
      
      var matchRowColumn = function(restricting, row, column, validColumnFilters) {
        var labelText = self.getText(row, column).trim();
        var labelTextUpperCase = labelText.toUpperCase();
        var labelWords = stringUtil.toArrayWithoutEmpty(labelText, ' ');
        var labelWordsUpperCase = stringUtil.toArrayWithoutEmpty(labelTextUpperCase, ' ');
        
        return matchFilters(restricting, validColumnFilters, labelText, labelTextUpperCase, labelWords, labelWordsUpperCase);
      };
      
      var isValidFilter = function(filter, restricting, column) {
        return filter.isValid() &&
                filter.isRestricting() === restricting &&
                filter.getColumn() === column;
      };
      
      var getValidFilters = function(restricting, column) {
        var validColumnFilters = [];
        for (var f = 0; f < filters.length; f++)
          if (isValidFilter(filters[f], restricting, column))
            validColumnFilters.push(filters[f]);
        return validColumnFilters;
      };
      
      var matchRowColumns = function(restricting, row, columns) {
        var filtersExist = false;
        
        for (var c = 0; c < columns.length; c++) {
          var validColumnFilters = getValidFilters(restricting, columns[c]);
          
          if (validColumnFilters.length > 0) {
            filtersExist = true;
            
            if (matchRowColumn(restricting, row, columns[c], validColumnFilters)) {
              if ( ! restricting )
                return true; /* short OR-cycle */
            }
            else if (restricting) {
              return false; /* short AND-cycle */
            }
          }
        }
        return filtersExist ? restricting : true; /* no filters means: accept */
      };
      
      var filterRow = function(row, columns) {
        var displayed = true;
        
        if ( ! matchRowColumns(true, row, columns) ||
             ! matchRowColumns(false, row, columns) )
          /* one denied, or none agreed */
          displayed = false;
        
        self.setDisplayed(row, displayed);
      };
      
      var self = {};

      /* public */
      
      /**
       * The filter-listener callback, carried out when the user
       * clicks a checkbox or leaves a textfield, to be installed
       * by concrete list-filter implementations.
       */
      self.filter = function() {
        var rows = self.getRows();
        var columns = self.getColumns();
        
        for (var r = 0; r < rows.length; r++)
          filterRow(rows[r], columns);
      };
      
      /**
       * Concrete sub-classes CAN override this to support several table columns.
       * @return array of table column identifiers. This implementation
       *    returns an array containing abstractListFilter.DUMMY_COLUMN.
       */
      self.getColumns = function() {
        return [ abstractListFilter.DUMMY_COLUMN ];
      };
      
      /**
       * Concrete sub-classes MUST implement this.
       * @returns array of rows to filter.
       */
      self.getRows = function() {
        throw 'Extension must implement getRows() !';
      };

      /**
       * Concrete sub-classes MUST implement this.
       * @param row required, a row as returned from getRows().
       * @param column optional, a column of the row to get text from,
       *    as returned from getColumns().
       * @returns the text from given row in given column,
       *    used to compare with search-words.
       */
      self.getText = function(row, column) {
        throw 'Extension must implement getText() !';
      };

      /**
       * Concrete sub-classes CAN override this.
       * This implementation sets the *display* style-property of the row.
       * @param row required, the TABLE row to set filtered-out or filtered-in.
       * @param displayed required, when true, the row matches the filters, false when not.
       */
      self.setDisplayed = function(row, displayed) {
        row.style.display = displayed ? '' : 'none';
      };
      
      
      for (var i = 0; i < filters.length; i++)
        filters[i].addChangeListener(self.filter);
      
      
      return self;
    };

    /** The single-column-identifier when getColumns() was not overridden. */
    abstractListFilter.DUMMY_COLUMN = "dummy-column";

Read this from bottom to top. On bottom you find the public part, while private implementations are on top.

On the very bottom (line 138) you see that an event listener function is installed into all filter fields. The filter() function will be called by them all.

The public part contains three functions that are meant to be overwritten: getColumns(), getRows(), getText(). The getColumns() provides a default implementation in case there is just one column, but the other two MUST Be overwritten, else exceptions will be thrown.

The setDisplayed() is public for convenience-overrides that want e.g. to gray-out non-matching rows instead of making them invisible.

The private part contains the real complexity. The public filter() function, called on each change-event, retrieves all table rows and loops over them, calling filterRow(). The responsibility of that private function is to set the row either displayed or invisible by calling setDisplayed().

To find out whether a row is displayed or not, the filterRow() function calls matchRowColumns() twice, once asking for all restricting filters, once for all non-restricting. Should none of them match, the row is regarded to be not displayed.

The matchRowColumns() function lopps over column-names. It retrieves all valid filters assigned to one column and conforming to the restricting flag, and delegates to matchRowColumn() to find out whether the filters match.
Should one column match, and restricting is false, it performs a "short cycle" and immediatley returns true.
Should one column not match, and restricting is true, it also immediatley returns false.
Further this function finds out whether filters existed for any column, and returns true when not. Thus all rows would be displayed if no filter is present or valid.

The getValidFilters() and isValidFilter() functions are responsible for retrieving filter fields connected to a column-name and according to a restricting-flag.

The matchRowColumn() column now decides for exactly one table cell whether its filters match it. It retrieves the text from the cell, provides different representations of it, and then passes that to matchFilters().

There, finally, the filter fields are looped to match the cell text, in one of its representations. Here the matchesOne() function of abstractFilterField is called. Again a short cycle is performed when one of the match-results allows an immediate return.

String Util

Here are the string functions used in above implementations, wrapped into a singleton module to be easily reachable.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
    /**
     * String utility singleton.
     */
    var stringUtil = (function()
    {
      var PUNCTUATION_CHARACTERS = ',:?!.;()';

      /** @returns true when given character is a punctuation character. */
      var isPunctuation = function(character) {
        for (var i = 0; i < PUNCTUATION_CHARACTERS.length; i++)
          if (PUNCTUATION_CHARACTERS.charAt(i) === character)
            return true;
        return false;
      };

      /** 
       * @param the string from which a character should be read.
       * @param first true when to read first character, false for last.
       * @returns the first or last character from given string,
       *    undefined when string was empty.
       */
      var getTrimChar = function(string, first) {
        if ( ! string )
          return undefined;
        return string.charAt(first ? 0 : string.length - 1);
      };

      /** 
       * @param the string from which a character should be removed.
       * @param first true when to remove first character, false for last.
       * @returns the trimmed string, undefined when string was empty.
       */
      var removeTrimChar = function(string, first) {
        if ( ! string )
          return undefined;
        return first ? string.substring(1) : string.substring(0, string.length - 1);
      };
      
      var self = {};
      
      /**
       * @param toSplit the string containing words to scan.
       * @param splitChar the character that separates the words in toSplit.
       * @param toUpper optional, when true, all strings are converted to upper-case.
       * @returns an array of words scanned from given toSplit,
       *    trimmed, punctuations removed, upper-case, no empty strings being in array.
       */
      self.toArrayWithoutEmpty = function(toSplit, splitChar, toUpper) {
        var rawWords = toSplit.split(splitChar);
        var words = [];

        for (var i = 0; i < rawWords.length; i++) {
          var rawWord = rawWords[i].trim();

          for (var l = 0, locations = [ true, false ]; l < locations.length; l++) {
            var location = locations[l];
            while (rawWord.length > 0 && isPunctuation(getTrimChar(rawWord, location)))
              rawWord = removeTrimChar(rawWord, location);
          }

          rawWord = rawWord.trim();
          if (rawWord.length > 0)
            words.push(toUpper ? rawWord.toUpperCase() : rawWord);
        }
        return words;
      };
     
      return self;
      
    }());

Summary

You can watch this working in my Blog Archive on my homepage, expand the title, then expand "Semantic Search". If you use your browser's "View Page Source" context menu item, or press Ctrl-U, you will see all source code that was described here.