Blog-Archiv

Samstag, 30. Juli 2016

JS Document Text Search on Client Side

Did you know that ....

  • Cognitio est aurum futurorum (Latin)
  • Scio estas la oro de la estonteco (Esperanto)
  • Knowledge is the gold of the future (English)
  • Wissen ist das Gold der Zukunft (German)

Forget gold. Forget oil. We are going Computer:-)

To get knowledge you need tools, primarily search tools. For example, for me it would be important to find articles in my Blog archive that contain a certain search pattern. From anywhere, online.


Now, I don't have access to the server where my Blog resides, so I can't install a web-service for this purpose there. (It would be a server-side indexing of documents like Lucene provides, and a web-server searching these indexes and responding with according document links.)

So I want to have a client-side Blog-search-tool that I can start from anywhere, and that does not need a web-server doing the search-work. Is this possible?

Surely not for big data, because the browser itself has to carry out such a search. On the other hand, why not provide it for small data like my Blog, and for HTML documents with relative URLs only? Here is a concept:

  1. for every search launch, retrieve all linked documents, using asynchronous AJAX (the browser would cache the results, so the next search would be much faster)

  2. for each document, parse its content by temporaryDocument = document.implementation.createHTMLDocument() and setting temporaryDocument.body.innerHTML = ajaxResponseText

  3. remove all script-elements from result document, then retrieve plain text (without HTML markup) using temporaryDocument.body.textContent

  4. search the pattern in resulting plain text

Mind that AJAX will deny to load URLs that are not of same origin as the calling page.

The script introduced by this article is written in pure JavaScript, and it does not use any external library. I do not raise any legal claim on this source code, it is free to be used anywhere.

HTML

Here is the outline of my test page.

<!DOCTYPE html>
<html>
  <head>
    <meta name="viewport" content="initial-scale=1"/>
    <meta charset="UTF-8"/>
    
    <title>Document Text Search on Client Side</title>
    
  </head>
  
  <body>
    <!-- search input fields -->
    <div>
      <input id="search-text" type="text" placeholder="Full Text Search" />
      <input id="search-button" type="button" value="Go!"/>
    </div>

    <!-- linked documents -->
    <table id="result">
      <tr><td>106</td><td><a href='Refactoring_JS_List_Filters__Part_One.html'>Refactoring JS List Filters, Part One</a></td><td>2016-06-21</td></tr>
      ....

      <tr><td>1</td><td><a href='Things_Are_Changing.html'>Things Are Changing</a></td><td>2008-02-26</td></tr>
    </table>

    <script type="text/javascript">
    (function() {
      // JavaScript that executes full text search in linked documents
    })();
    </script>
       
  </body>
  
</html>

On top are a text-field and a button to start a search. The table element contains all linked documents that must be full-text-searched. Any row holding a link that does not contain the search pattern must be set invisible by the script. Is there no search pattern, all rows will be set visible again.

This page should look like this:


Script

Here comes the script, to be inserted where "// JavaScript that executes full text search in linked documents" is, part by part. Of course you will also need to provide documents that can be searched.

Mind that this is a raw JS implementation, not encapsulated in modules. Mind further that this search will work for HTML documents only, not for PDF, JPG, MP3, ....

Network

      var getAjaxResponse = function(ajax, url) {
        if (ajax.readyState === 4)
          if (ajax.status === 0 || ajax.status === 200) /* 0 occurs with file protocol */
            return ajax.responseText;
          else
            throw "Error status for "+url+" is "+ajax.status+", message: "+ajax.statusText;
  
        return undefined;
      };
      
      var fetchResourceAsync = function(url, toCallOnLoad) {
        var ajax = new XMLHttpRequest();
        
        ajax.onreadystatechange = function() {
          var responseText = getAjaxResponse(ajax, url);
          if (responseText)
            toCallOnLoad(responseText);
        };
  
        ajax.open("GET", url, true);
        ajax.send();
      };

This code should run on any modern browser (not IE below 9). The fetchResourceAsync() function receives an URL and returns, asynchronously, its content. To process the content, the caller must provide a callback-function toCallOnLoad.

The getAjaxResponse() function is responsible to check for invalid HTTP responses.

Text Processing

      var getContent = function(htmlText) {
        htmlText = htmlText.replace(/<br\/?>/gim, ' ');
        /* replace <br> tags by space, browser does not do that, that leads to word joins */
        
        var tmpDoc = document.implementation.createHTMLDocument();
        tmpDoc.body.innerHTML = htmlText;
        
        removeScripts(tmpDoc.body);
        
        var text = "";
        for (var i = 0; i < tmpDoc.body.children.length; i++)
          text += tmpDoc.body.children[i].textContent+"\n";

        return text;
      };
      
      var removeScripts = function(element) {
        var toRemove = [];
        for (var i = 0; i < element.children.length; i++) {
          var child = element.children[i];
          if (child.tagName === "SCRIPT")
            toRemove.push(child);
          else
            removeScripts(child);
        }
        for (var i = 0; i < toRemove.length; i++) {
          element.removeChild(toRemove[i]);
        }
      };

That's the way how the URL content is processed as soon as the toCallOnLoad AJAX callback function is called: it will call getContent() to get plain text. This function constructs a new HTML document and sets the received text into it. It then removes all SCRIPT tags and returns plain text.

The removeScripts() function is a recursive DOM-traversal that removes all SCRIPT elements from a given DOM element.

Search

      var searchPatternInResource = function(resource, regExp) {
        fetchResourceAsync(resource, function(text) {
          text = getContent(text);
          var match = regExp.exec(text);
          if (match)
            displayResult(resource);
        });
      };
      
      var searchPatternInResources = function(searchPattern, resources) {
        for (var i = 0; i < resources.length; i++)
          searchPatternInResource(resources[i], new RegExp(searchPattern, "im"));
      };
      
      var search = function() {
        var searchPattern = getSearchTextElement().value;
        if ( ! searchPattern || ! searchPattern.trim() ) {
          resetResultList();
        }
        else {
          clearResultList();
          searchPatternInResources(searchPattern, getResources());
        }
      };

The search() function is the event-callback for the search-button (or ENTER in text-field). It fetches the search-pattern from text-field, resets the document-list when nothing was entered, or starts the search when the text-input was not empty.

It calls the searchInResources() function which loops over all linked documents, calling searchInResource() with each. It passes them a regular expression built from the search-pattern. Mind that this enables regular expressions, so you should know what characters you may use on testing! The RegExp is constructed with "im", the "i" means "ignore case", the "m" means "multiline search".

The searchPatternInResource() function finally starts the asynchronous AJAX call to retrieve the document. It passes an anonymous function to fetchResourceAsync() which turns the AJAX-result into plain text by calling getContent(), then executes the regular expression over the result. It sets the document link to displayed when the text matches the regular expression. The function regExp.exec() returns null when no match was found, and JS would evaluate null to false. Mind that null is not undefined.

DOM Access

Until here the script will apply on any page. Below comes the part that most likely you will have to rewrite for your own page. It contains the DOM access, that means the way how the script works together with the surrounding HTML page.

      var getResultList = function() {
        return document.getElementById("result");
      };
      
      var getResources = function() {
        var linkList = getResultList().querySelectorAll("a");
        var resources = [];
        for (var i = 0; i < linkList.length; i++)
          resources.push(linkList[i].getAttribute("href"));
        
        return resources;
      };
      
      var displayResult = function(resource) {
        var resultList = getResultList().children[0].children;
        for (var i = 0; i < resultList.length; i++)
          if (resultList[i].children[1].children[0].getAttribute("href") === resource)
            resultList[i].style.display = "";
      };
      
      var clearResultList = function() {
        resetResultList(true);
      };
      
      var resetResultList = function(clear) {
        var resultList = getResultList().children[0].children;
        for (var i = 0; i < resultList.length; i++)
          resultList[i].style.display = clear ? "none" : "";
      };
      
      var getSearchTextElement = function() {
        return document.getElementById("search-text");
      };
      
      
      /* Initialization, listener installation */
      
      getSearchTextElement().addEventListener("keydown", function(event) {
        if ((event.which || event.keyCode) === 13)
          search();
      });
      
      document.getElementById("search-button").addEventListener("click", search);

As you can see, the getResultList() function refers to the table with the according id, containing all document links. The functions getResources(), displayResult(), clearResultList(), resetResultList() use it accordingly to manage the list of linked documents. These implementations are definitely page-specific. Mind that the statement element.style.display = ""; will set the element visible.

The getSearchTextElement() function refers to the text-field where you input the search-pattern.

In the trailing initialization an event-listener is installed on ENTER keypress of the text-field, and on click of the search-button. In both cases, the search() function is installed as callback.

Summary

Although this is a quite restricted search, you could blow it up to control

  • case-sensitivity
  • word-bounds
  • turn off regular expressions (should be default!)
  • search optionally also in HTML code, or STYLE or SCRIPT elements.

When you want to see and test this, you can go to the Blog-search on my homepage. When you expand the title, you see two different searches, the lower one being the document full-text search.




Keine Kommentare: