Wednesday, 22 February 2017

Three More Letterboxd Lists

After completing my Monsters of the Movies list, I decided to curate three more lists on letterboxd.

" Top 50 Horror Movies (2016)" is fairly self-explanatory, providing that you realise that they're horror movies from the 2016 TSPDT list, not just movies released in 2016! So far, I've seen forty of them.

"99 Years, 99 Movies" has one film from each year 1901 to 1999 inclusive. Mostly, they're critically-acclaimed movies, but for the sparse years at the beginning of the century, I selected films that are popular on letterboxd. I've currently only seen 39 of the 99.

Finally, "Twentieth Century Movies by Duration" is a quirky list of 100 films ordered by duration in minutes from 61 minutes to 160 minutes (as reported by letterboxd). Again, it includes classic movies with some oddballs to fill the gaps. I've seen 68 of them.

Needless to say, I'm going to be ploughing through these lists over the next few months, desperately trying to improve my "scores".

STOP PRESS: TSPDT has released their 2017 list, but I don't believe any of the lists above will have changed as a result.

Computus 4

I noticed a few days ago that Dr J R Stockton's web pages on Computus are now longer available. The last impression of the site in the Wayback Machine was 2015-09-07. In it he lists a number of JavaScript algorithms for the calculation of the date of Easter Sunday in the Gregorian calendar. The result of the following functions are "the day of March" such that 1 indicates March 1st, 32 indicates April 1st, etc.

The first version is my own humble attempt (here translated to C/C++):

  int EasterDayOfMarch(int year) {
    // 25 ops = 2 div, 3 rem, 2 mul, 5 shift, 8 add, 5 sub
    int a = year % 19;
    int b = year >> 2;
    int c = (b / 25) + 1;
    int d = (c * 3) >> 2;
    int e = ((c << 3) + 5) / 25;
    int f = (a * 19 + d - e + 15) % 30;
    int g = f + ((29578 - a - (f << 5)) >> 10);
    int h = g - ((year + b - d + g + 2) % 7);
    return h;

This is derived from Gauss's algorithm and subsequent revisions. However, the calculation of 'g' is my own work and chosen such that it can be computed on a 16-bit machine efficiently (though it works for all integer sizes).

There are other minor variations outlined on Dr Stockton's page:

    // Lichtenberg
    int g = 28 + f - ((f + a / 11) / 29);


    // Hutchins
    int g = 28 + f - ((a + f * 11) / 319);  

There is also a C implementation which is the equivalent of:

    // Pemberton
    int g = 28 + f - (int)((f + g + (int)(a > 10)) > 28);

As mentioned previously, Al Petrofsky came up with another variation on the theme:

  int EasterDayOfMarchPetrofsky(int year) {
    // 21 ops = 4 div, 3 rem, 4 mul, 1 shift, 6 add, 3 sub
    int a = (year % 19) * 6060;
    int b = year >> 2;
    int c = b / 25;
    int p = (c * 2267) - ((b / 100) * 6775) + 3411;
    int q = ((a + ((p / 25) * 319) - 1) % 9570) / 330;
    int r = 28 + q - (year + b + p + q) % 7;
    return r;

This cleverly uses fewer operations, though requires integer sizes greater than sixteen bits for larger years (not a great imposition, these days!)

To convert a "day of March" 'h' to an actual date, the following code can be used:

  void Easter(int year, int* month, int* day) {
    int a = year % 19;
    int b = year >> 2;
    int c = (b / 25) + 1;
    int d = (c * 3) >> 2;
    int e = ((c << 3) + 5) / 25;
    int f = (a * 19 + d - e + 15) % 30;
    int g = f + ((29578 - a - (f << 5)) >> 10);
    int h = g - ((year + b - d + g + 2) % 7);
    int i = h >> 5; // is it April?
    *month = i + 3;
    *day = (h & 31) + i;

Sunday, 27 November 2016

Monsters of the Movies - Achievement Unlocked

After almost forty years, I've finally managed to see all the films listed in Monsters of the Movies. I've been tracking my progress in a spreadsheet and a letterboxd list and over the last few months, I've been slowly filling my gaps by downloading the films from or watching the films online. The exception was Mad Love (1935), which I had to get through Amazon as a Spanish-subtitled DVD and then watch with the subtitles turned off!

Amongst the recently-watched films, the two stand-out gems for me were The Manster (1959) a.k.a. The Split:

and The Abominable Dr. Phibes (1971):

I kept the latter until the very end ... and I wasn't disappointed.

Lexicon 2

To finish off the discussion of my Lexicon demo, I'll outline the in-memory data structure used to store the dictionary.

In the first part, I described how the lexicon is stored trivially-compressed for transmission "over the wire". The words are then expanded in memory into a form that makes searching simple and efficient.

The main data structure is a four-dimensional array of JavaScript objects named "Lexicon.byRaritySubsetLength":

  1. The first index is an integer rarity between one and five inclusive: one means "not very rare", whereas five means "esoteric".
  2. The second index is an integer bit-field (one to seven inclusive) which encodes which dictionaries the corresponding list is part of: TWL06, SOWPODS and SCOWL.
  3. The third index is the integer length of the word (two to fifteen inclusive).
  4. The fourth index is an integer index into a list sorted alphabetically (starting from zero).
By default, the demo looks at all rarities and word lengths for the SOWPODS lexicon. This means that without changing the criteria, when matching words, the client must look through 5 * 4 * 14 = 70 lists of words.

Each word entry is a JavaScript object of the following form:

    word : string
    rarity : integer
    subset : integer
    mask : integer

The first three fields contain the uppercase word, rarity index and lexicon subset mask. The last field is used when searching for anagrams. It is a bit-field that encodes which letters are contained within the word: 1 for "A", 2 for "B", 4 for "C", 8 for "D", etc.

For instance, to find anagrams for "WORDS" in SOWPODS, the client needs to search through only 20 lists of words: five rarities (1 to 5), four lists for SOWPODS and one word length (5). These lists contain 12,478 entries which can be scanned quickly by looking for those that have a "mask" with the bits for "D", "O", "R", "S" and "W" all set. It will typically come up with the solution ("SWORD" and "DROWS") within a couple of milliseconds.

Tuesday, 11 October 2016

Mario T-Shirt

From an original idea by David Gavilan Ruiz

Sunday, 18 September 2016

Mortality Statistics

The Office for National Statistics collects mortality data for the United Kingdom. There's lots of information hidden within the data, but some of it is quite difficult to extract and/or visualise.

The underlying data for questions of life expectancy and mortality rates would seem to be a statistic named "Qx" (raw data). According to their guide, this is defined by the ONS as:
The mortality rate between age x and (x +1); that is, the probability that a person aged x exactly will die before reaching age (x +1).
It is sometimes expressed as a probability multiplied by 100,000. You can plot this for males as a surface with one axis being the age, "x", and the other being the year of observation of "Qx":
The noise towards the "back" of the graph is due to the fact that concrete data beyond the age of 100 is very sparse, so probabilities are difficult to calculate accurately. The next derived value is "Lx" which the ONS defines as:
The number of survivors to exact age x of 100,000 live births of the same sex who are assumed to be subject throughout their lives to the mortality rates experienced in the 3-year period to which the national life table relates.
I've approximated that as a normalised probability:

L[0] = 1
L[i] = L[i-1] * (1 - Q[i-1])

This gives a monotonically decreasing graph that reminds me a bit of a zero curve from fixed-income finance.
You can already see that the shape of the "Lx" curve has changed appreciably between 1850 and 2000. But the graphs (which I prepared using the excellent plotly) become easier to interpret if we add more colour levels:
You can start to see that there are strange "striations" in the surface. We now look at "Dx", defined as:
The number dying between exact age x and (x +1) described similarly to Lx.
I used the following formula:

D[i] = L[i] - L[i+1]

This gives a surface that has a few interesting features clearly visible:

  1. The infant mortality rate (defined here as the number of deaths between 0 and 1 years, i.e. D[0]) has dropped from over 15% in 1850 to less than 0.5% after 2000.
  2. The adult life expectancy has improved; particularly after about 1950. This is indicated by the increase in height of the hump near the "front" of the surface, along with a gradual shift of the highest point towards greater ages. I suggest this is due to the introduction of the National Heath Service.
  3. There is a curious small green hump in the plateau near Year=1918, Age=20.
If, instead of graphing "Dx", we graph "log10(Dx)", we see even more features:
We can see that there is a second hump a couple of decades later that the original. Obviously these are due to the two World Wars (1914-1918 and 1939-1945 for Britain). However, the humps are present (though to a lesser extent) in the female graph too:
I was somewhat confused by this but it appears that some UK "life tables" (as actuaries optimistically call them) do not include deaths of service men and women on active duty. So perhaps the humps account for only civilian casualties: e.g. bombing victims and deaths due to general privations caused by the wars.

Another feature I found quite striking was the "male late-teen spike" that is seen better in this view:
There is a relatively sudden increase in male deaths in late-teens. This is well-known phenomenon that is nowhere near as pronounced in the female data:
In fact, after World War II, for females older than one year, log10(Dx) becomes amazingly linear until it peaks at 88 years old (for 2013 data).

One final feature, that I originally thought was a data or plotting bug, is the diagonal cleft you can just see in the green peak to the far right of the male (and female) Dx surfaces:
The diagonal cleft is aligned at exactly forty-five degrees:
Year=2000, Age=81 approx
Year=1950, Age=31 approx
Year=1940, Age=21 approx
Year=1930, Age=11 approx
Year=1920, Age=1 approx
My first theory is that the cleft is due to deaths in World War II. If a large percentage of young people are killed in the space of a few years, then this age range will be "under-represented" in the death statistics in subsequent years.

My second theory involves not the death rate, but the live birth rate. There was a noticeable dip in live births during World War I:

This dip wasn't seen during Second World II. Essentially, if there were fewer babies born in 1917-1919, there will be relatively fewer eight-one-year-olds in the year 2000.

So which theory is correct? I honestly don't know. Perhaps neither, Perhaps both. It may just be due to an accident of history that the drop in birth rate during World War I coincides with the tragic loss of men and women in their early twenties during World War II.

Interactive graphs here. Data courtesy of ONS.

Sunday, 11 September 2016

Lexicon 1

For a long time, I've been thinking about implementing a word puzzle dictionary to help solve (or cheat at, if you prefer) crosswords, codewords, Scrabble, and the rest. You know the sort of thing:
"One across, nine letters: C-blank-O-blank-S-blank-blank-R-blank" *
I've always assumed that running this on the client-side in JavaScript would be far too slow. It turns out I was very wrong; if you're careful with your coding, you can get it running in a browser on a low-end tablet or phone with memory and CPU cycles to spare.

My first problem was getting hold of a good lexicon or word list. The obvious avenue to explore was the "official" Scrabble word list used in competitions, but it quickly became apparent that there are at least two such lists:
These are simply lists of words that are considered "valid"; the former is over 178,000 words and the latter over 267,000 words. They do not contain definitions nor any metadata about the words such as frequency.

In contrast, SCOWL (Spell Checker Oriented Word Lists) contains over 423,000 words with information such as relative frequency and regional variations (American/Canadian/British). Some of the words are suspect (domain-specific abbreviations and bad optical character recognition are two likely sources of suspicious "words") but the frequency information makes it easy to cull outliers.

Instead of choosing just one word list, I decided to go for all three and divided the superset of words into five levels of "rarity" based on the SCOWL frequency metadata. It turns out that SOWPODS is a superset of TWL06, but contains some words not found in SCOWL.

After a lot of curation (in Excel, of all tools!) I ended up with a list of 432,852 words split into five levels of increasing "rarity". Each word is implicitly tagged with the word lists in which it is found. I limited the words to those between two and fifteen letters (inclusive) where each "letter" is one of the twenty-six Roman letters "A" to "Z" (no digits, apostrophes, hyphens or accents are allowed).

I could have stored the lexicon as a simple text file or as a JSON array of strings, but this would have taken up several megabytes, so I decided to have a go at trivial compression. I know, I know ... most HTTP servers will serve up compressed files out of the box and there exist any number of JavaScript libraries that do client-side decompression of ZIP files, but I felt like re-inventing a wheel. Give me a break!

The main constraint I gave myself was to try to get the lexicon to the client-side in a usable form as quickly as possible. This meant a trade-off between file size and time taken to decompress and store in a suitable data structure. One thing you don't want to do is sort half a million words into alphabetical order on the client. That means the lexicon (or parts thereof) needs to be pre-sorted.

Sorted word lists have a lot of redundancy at the beginning of each entry. For example, imagine a list starting like this:
  • ...
If we replace the beginning of each word with a digit ('0' to '9') specifying how many letters from the previous word we should replicate, the list becomes:
  • 8S
  • 4WOLF
  • 7VES
  • ...
Another technique is to substitute common letter sequences with ASCII characters outside the ranges '0' to '9' and 'A' to 'Z'. It turns out that the most common four-letter sequences near the end of the words are:
For three-letter sequences, the list is:
While the most common near-end two-letter sequences are:
Using these two simple techniques, the 432,852 words can be compressed into a single ASCII string of just over 1.5 million characters. This string can be efficiently converted into a high-performance in-memory data structure within a second on a typical PC.

The full demo can be found here, whilst the JavaScript file that decompresses the word list is lexicond.js.

* "CLOISTERS" obviously