Paul Buis' Blog

Monday, May 2, 2016

JavaScript programming with ES6 in Node.JS: Sets and Generators

I enjoy working on problems involving words. For example, I like the Jumble puzzle in the newspaper. To solve these, one can use a word list file such as the ones provided by Peter Norvig that go with his chapter Natural Language Corpus Data in the Book Beautiful Data that simply contain one word per line.

Sets of Strings

In ES6 JavaScript, I read this file with the following code:

"use strict";

const fs = require("fs");
/**
 * @returns {object} representung set of lexicon words,
 *                   lexicon[word]===true
 *                   if and only if word is in lexicon
 */
function readLexicon() {
    const lines = fs.readFileSync("words.txt", "utf-8").split("\n");
    var line; 
    const lexicon = {};
    for (line of lines) {
        line = line.trim();
        if (line.length > 0) {
            lexicon[line] = true;
        } 
    }
    // could just return lexicon, but freezing object should improve
    // performance and make returned object immutable
    return Object.freeze(lexicon);
}

The file is read in and split into an array of lines. The object returned from this function has a property with the value true for each word in the dictionary. For every other string, querying corresponding property returns undefined, which is treated just like false when used as the condition in of an if statement. This gives us a simple way to check for membership in a set of strings.

Iterators

Iterators over a set of values in JavaScript are objects with a method next(), that returns an object with two properties: value and done. As long as done is false, value has a value. When the iteration is complete, done becomes true and the object returned by next does not have a value property. Normally, iterators are used in the JavaScript for ( ... of ...) loop construct. So for example, for (x of ['a', 'b', 'c']) will create an iterator that iterates over the array setting x to next().value for successive values in the array.

To convert an iterable object into an iterator, one can use the function* and yeild* keywords as seen in the following Node.js interactive session (using node v5.4.1):

 function* foo() {
... yield* ['a', 'b', 'c'];
... }
undefined
> const it = foo()
undefined
> it.next()
{ value: 'a', done: false }
> it.next()
{ value: 'b', done: false }
> it.next()
{ value: 'c', done: false }
> it.next()
{ value: undefined, done: true }

Generators

So, what kind of a strange function is that with an asterisk in the previous code snippet? That is what is called a "generator." Here is a slightly more enlightening example using a plain yeild rather than a *yield:

function* f() {
  yield 2;
  yield 3;
  yield 4;
}

The mental model is that f() is invoked over and over. When a yield statement is encountered it returns a value and the function is suspended. The next time the function is invoked, it resumes where it had previously been suspended. This model is not what really happens under the hood. A more accurate model might be that the yielded values are placed in a concurrent queue where they wait until next(), dequeues them (or next() waits for them to be produced). Again, since JavaScript doesn't have concurrency, this model isn't quite right either. The exact mechanism used under the hood is unimportant to us. The difference between yield and yeild* is that yeild generates a single value, and yeild* generates a sequence. However, yeild* is only usable inside a function*.

Putting this all together to solve the Jumble is the following program (to be placed in a file along with the readLexicon() function:

function* f() {
function* permutations(str) {
    
    function* permute(prefix, str) {
        const n = str.length;
        if (n == 0) {
            yield prefix;
        }
        let i = 0;
        for (; i < n; i += 1) {
            yield* permute(prefix + str.charAt(i), str.substring(0,i) + str.substring(i+1, n) );
            
        }
    }
    
    yield* permute("", str);
}

const lexicon = readLexicon();
var s;
for (s of permutations("ourf")) {
    if (lexicon[s]) {
        console.log(s);
    }
}

The program should print out all permutations of "ourf" that are in the word list file. This should include the common word "four," but may also include the uncommon word "furo" depending on how complete the word list file is. The problem with large lexicons is that they tend to include more uncommon words along with uncommon variations on common words.

Monday, June 16, 2014

Higher Education: "Ivory Tower" Movie

It looks like the new movie "Ivory Tower" is yet another source of the incorrect information bombarding potential college students and their parents with the notion that college costs are rising much faster than inflation. At Ball State, it has clearly been shown that college costs are going down after adjusting for inflation. Of course, "cost" is not what parents and students are really interested in. They care about "price." The difference is state government subsidy for in-state students. The legislature has slashed subsidies, so the price is going up while the cost is going down. If you are looking for someone to blame, blame the politicians who want to cut your taxes and/or blame yourself for being unwilling to pay taxes that are lower than in any other part of the civilized world.

Politics, good financial decision making, and honest analysis tend not to go hand in hand. Folks got upset when prisoners were being given college educations for free, so we stopped educating prisoners. Now, they are just that much more likely to come back to prison when the get released. The net cost in real dollars: way more than the cost of a college education (not including the direct and indirect costs of crime, just the cost of keeping them in prison). Do basic math and make rational decisions instead of getting all worked up about how bad you've got it.

Friday, June 13, 2014

Programming Language: Pet Peeve - Confusing "static" with "strong" typing and type declaratinons

One of my pet peeves is folks who claim this or that programming language is "strongly typed" when they really mean it is "statically typed." In a statically typed programming language, the types of variables and expressions is determined prior to run time, i.e., language processing time a.k.a. "compile time." This is in contrast to "dynamically typed" languages where the types are not known until run time. In such languages, the same variable might even have different types at different points of run time. In a dynamically typed system applying the "+" operator to a number and a string won't cause problems at language processing time, but at run time three things might happen: an error, a conversion of the number to a string resulting in a concatenation, or an attempt to convert the string to a number resulting in numerical addition.

Strongly typed languages never have run-time type errors. This is a lot stricter than simply static type checking. This also means at least well-defined behavior for things like an index out of bounds issue on an array or dereferencing an invalid pointer such as a "null." The ML language and some of its derivatives achieve strong typing by simply not supporting arrays or pointers. C is statically typed, but can't be considered strongly typed because it may silently error at run-time with these kinds of errors. Java throws exceptions for these kinds of errors which is well defined behavior and can be considered distinct from an error. I'm not convinced Java is strongly typed and throwing exceptions seems like a less desirable thing than simply preventing such problems at language processing time.

Strong typing also implies that even explicit type casts can't have undefined behavior and certainly means that implicit type casts won't be attempted if information might be lost. So, if your language implicitly converts from a 32-bit integer to a 32-bit IEEE floating point value, it probably isn't strongly typed. If you can downcast in an OO hierarchy your language might not be strongly typed either.

Yet another confusion with static typing is the notion that static typing requires explicitly declaring types of variables. Again, ML is the classical counterexample where type inference is used rather than explicit type declaration. Newer language systems such as C++11, D, and Swift all use type inference to at least reduce the amount of explicit type declaration that needs to be done. Type inference requires more sophisticated language processors and increases the amount of time needed to compile a program. Hence, while the technology to do type inference is fairly old, many language designers avoided this feature until Moore's law made it more practical. In type inference, if one sees an expression like "i+1" the type of i can be inferred to be the same as the type of 1, probably an integer. When using complicated types whose declarations require a lot of typing and are often non-obvious, type inference makes the programmer's job much easier.

Thursday, June 12, 2014

Programming Languages: Discriminated Unions are called Enums with Attributes in Swift

In C, there was this scary thing called a union which was declared like a struct, but the fields of a union all shared the same address instead of being a separate offesets from the beginning of a struct. The fields could be of different types which made type errors easy to commit. Normally, one wrapped a union in a struct that had a "tag" field that would let the programmer tell which field name to use (and hence avoid type errors). Typically, one would use a switch on the tag and then have a case of for each of the possible tag values.

In more advanced languages, the compiler would make sure that when a specific tag value was present, only the associated field name of a union could be referenced. E.g., the field reference needed to be nested inside an "if" or a "switch/case" that had checked on the tag. If a "switch" was being used in such a language for this purpose, all possible tag values needed to have associated "case" labels. In these languages the data structure was referred to as a "discriminated" union and it was considered to be type safe.

Microsoft's F#, having descended from ML, has such a feature with this name. Interestingly, when accessing an F# discriminated union from C#, one sees the discriminated union type as an abstract base class with no data value, and each of the fields as an inner subclass of the abstract base class. In Apple's new Swift programming language, "enums with attributes" look like they get used with the "enum" part playing the role of a tag and each tag's attributes playing the role of field values in a discriminated union. Since enums are more common now than unions, this should make explaining them easier, but without an understanding of unions and how they are used, it that explanation will need to be done in a vacuum. Java also has enums that have attributes, but each possible enum value ends up having the same attribute type(s), so one has much less flexibility and Java enums should not be considered equivalent to discriminated unions. Swift's allowance for each enum value to have a different attribute types gives its enum types a different spin than Java's enums and makes them equivalent to ML/F# discriminated unions.

If one has both many (N) possible field values and many (M) possible operations one wants to perform you see two different patterns emerge if you have designed this kind of thing with discriminated unions or with a flat inheritance hierarchy with virtual methods. With discriminated unions, you end up with M functions, each "fat" with a big switch statement checking N possible field values. With OO inheritance, you have N classes, each with M virtual functions override a pure virtual function in the base class. Unless one field value is much more common then the rest and you check it at the top of each switch statement the discriminated union approach of sequentially having N machine level branches will probably be slower than virtual function dispatch which takes longer than a single branch, but can still be done in constant time.

The question of which is easier to maintain and expand, however, depends on what happens after the early design and implementation phase. Each time one adds a new potential field value, the discriminated union approach requires that each fat function be modified to insert a new case branch, and the compiler will complain if the new case branch isn't added. Each time one adds a new potential field value to the OO design, one needs a whole new class with a new virtual function for each operation. The compiler should you if you've failed to override all the pure virtuals from the base class as long as the inheriting subclass is not itself abstract. One way a single large chunk of code is inserted, in the other a bunch of small pieces have to be inserted scattered.

Maintenance and expansion goes is the other way around if you are adding a new operation to the design. This time, with the discriminated union approach, you need to add one big fat function including a switch statement and a long chain of cases. With the OO design, you need to add one virtual to each subclass. Again, the difference is between a single large chunk of code and a bunch of scattered pieces, but this time the what you have to do is flip-flopped the other way around between the two design approaches.

In most realistically complex problems, it could be argued that the OO design is really not a flat hierarchy and many subclasses in the system would not need to override the behavior of their super-classes. However, in that case the discriminated union approach becomes a set of nested discriminated unions and each function needs to cover fewer cases, so they both simplify. It just isn't clear to me that one approach is fundamentally better than the other, although my OO-loving colleagues would, I'm sure, try to tell me discriminated unions are bad because, for them, axiomatically OO="good."

Programming Languages: Swift

With all the buzz about Apple's new Swift programming language for iOS to go along with Objective-C, I had to read the big thick manual they put out and several other bloggers' opinion pieces. It should come as no surprise that developers are pleased with Swift, since Objective-C is an aging language.

Right away, you should note that at this point in time you need XCode 6 to program in Swift. XCode 6 is still just in beta and only available on MacOS. When Microsoft came out with C#, one similarly could only program in C# using Visual Studio in Windows. Now, C# is also available in using the open source, cross-platform Mono development system. One can hope that something similar emerges for Swift down the road.

Swift comes out of the box with type inference, pattern matching, generics, closures/lambdas, hashes/dictionaries, tuples, and string interpolation. Many of these features have been added to languages such as Java, C#, and D either long after the programming language was introduced or via a runtime library API. The runtime library API approach never makes data structures like hashes and tuples feel quite the same as built-in support to parse literals of those types.

Optionals
A lot of fuss is made about Swift's support for "optionals." Such types are similar to F# options or the C# generic "Nullable" class. Again, Swift handles this with a built-in type rather than a library API class. As a built-in type, the ? and ! syntax for Swift optionals make them read more smoothly than an API, although not all that different than the F# discriminated union implementation of the same idea.

Pattern Matching
I'm fond of F#'s pattern matching (which comes with its roots in ML), and I'm really happy to see pattern matching in Swift both with tuples and enumerations with associated values. The Swift language definition doesn't use the words "pattern matching," but the bindings that happen when using the keyword combination "case let" in a switch statement cause the same kind of variable binding to occur while doing a successful control flow definition as with F#'s "match" statement.

Access Control
One criticism I read was that Swift didn't contain any support for access control with keywords like "public" or "private." The same can be said of JavaScript, but like JavaScript it doesn't seem like Swift should need a "private" keyword because it supports both constructors and closures. If one can nest the definition of a method inside the definition of a constructor, the method gets access to the local variables of the constructor. Like private data members in languages with a special keyword, these local variables can only be accessed by the nested member functions. Hence, this criticism would seem to be unfounded and based on a lack of understanding of how closures work.

Concurrency
A larger concern of mine would be the seeming lack of support for concurrency. Swift does not seem to support anything similar to the "synchronized" keyword in Java or C# or an API class supporting futures. Apple's devices are multi-core and even a single-core device with single threaded applications can benefit performance-wise from concurrency. It seems a shame that some kind of basic, limited support for concurrency isn't built into the initial design of a modern language.