Here are two IQ test questions for you:

- Fill in the blank in the sequence 1, 4, 9, 16, 25, __ , 49, 64, 81.
- What number comes next in the sequence 1, 1, 2, 3, 5, 8, 13, .. ?

Please stop and think about these questions before proceeding. **Spoiler alert**: the blog post reveals the answers.

First, don’t feel bad if it took you a while to answer these questions. In fact, if you didn’t answer them at all that’s a very good thing and I commend you (more on this later). But if you do have answers in mind, you can check them now. The answer key:

**72****12**

The pattern that explains why the answer to the first question is 72 may be subtle, but it’s simple. If the *n*th number in sequence is *f(n)*, then it’s clear that *f(n)* is just the period of the sequence *b(m)* defined by defined by .

Now I know what you’re thinking. You’re a so-called high intelligence quotient person and you *know* the answer is different. You think the pattern is just the squares 1², 2², 3², 4², 5², 6², 7², 8², 9² and that the blank should therefore be filled in with number 6² = 36. You think that it makes more sense because its a “simpler” pattern that explains the data. You are planning to comment on this post and you will invoke Occam’s razor. You will say this problem has appeared on many IQ tests with the answer 36 so that is what is *expected* in terms of an answer. You will explain that therefore it *is* the answer.

But what is a “simple” pattern? Let’s look at the second question. Here the *simplest* pattern that can explain the data is just that each number is the sum of the digits of the previous two numbers. So the next number in the sequence is **12** (=1+3+8). Yes, twelve. You shouldn’t be surprised that you got this wrong as well, and I know you did. Addition of digits of numbers is a common ingredient in sequence patterns on IQ tests. For example the problem of finding the next number in the sequence 58, 26, 16, 14, .. is a similar, albeit more difficult, IQ test question that is analogous my #2. But I know that you might have something else in mind. You’re thinking Fibonacci. You’re thinking there was a typo and 12 should have been 21. But I’m sorry.. no, no, no, **no**. An IQ test is not a math test. It’s about finding patterns, it is a test of *raw* intelligence.

What has happened here is that by merely asking you these two questions, **I’ve forced you to overfit**. There simply isn’t a meaningful way to choose a pattern from an enormous set of possibilities using only a handful of numbers. Yet this is exactly what I asked you to do, and what IQ tests ask one to do. They force one to overfit. **You know what one should call a**** test that encourages poor statistics hygiene? A “statistics deficit test” (SD test) instead of an “intelligence quotient test”.**

The mathematician Richard Guy uses many alliterations to describe the horror of overfitting from a handful of numbers:

Superficial similarities spawn spurious statements.

Capricious coincidences cause careless conjectures.

Early exceptions eclipse eventual essentials.

Initial irregularities inhibit incisive intuition.

These all capture the point that, as Guy says, “there aren’t enough small numbers to meet the many demands made of them”. For many good examples see his paper:

Richard K. Guy, The strong law of small numbers, The American Mathematical Monthly, 1988.

My favorite example of a pattern that is not what it appears to be is the sequence 1,1,1,1,1,1,1,1,1,1,1,.. continued for a total of 8424432925592889329288197322308900672459420460792432 times, and then followed by the number 8936582237915716659950962253358945635793453256935559. To understand this sequence one has to only identify what is an extremely obvious pattern. The *n*th term of the sequence is the greatest common divisor of two polynomials: *n*^{17}+9 and (*n*+1)^{17}+9. The greatest common divisor is therefore one for *n* up to 8424432925592889329288197322308900672459420460792432. Then the 8424432925592889329288197322308900672459420460792433rd greatest common divisor is 8936582237915716659950962253358945635793453256935559. Ok, maybe not so obvious and a somewhat unusual construction for a pattern. However this example makes another point. There cannot be certainty to the “solutions” on an IQ test, so that the term “answer” is a misnomer and its use is problematic. If a test asked for the 8424432925592889329288197322308900672459420460792433rd term of the sequence of ones above it is *likely* that it’s a one, but one cannot know for sure. Another way to say this is that IQ tests are implicitly asking test takers to accept null hypotheses instead of asking, at most, for a rejection.

There has been much debate on what IQ tests really measure and what they predict. But it seems to me that the answer is obvious. In order to score well on IQ tests one must be willing to overfit and to practice the p-value fallacy.** **Since such practices are synonymous with poor data science, I hypothesize that

**low**** IQ scores predict excellence in data science.**

## 15 comments

Comments feed for this article

June 4, 2018 at 7:52 am

ZingoIQ tests are generally multiple choice and only one answer will be correct.

June 4, 2018 at 8:39 am

Artem KaznatcheevIQ tests usually have the choices as numbers, not the candidate generating functions. Thus having only a few choices of numbers as the answer is not actually a restriction on the hypothesis class (of possible generating functions for the sequence) and thus doesn’t really alleviate the problem of over-fitting.

In particular, we could just give exclusively binary sequences and then have a choice of (a) 0 or (b) 1 and Lior could recreate the same sort of tricks. Although maybe they’ll be less cheeky.

June 4, 2018 at 9:57 am

ZingoBut then it doesn’t matter – the answer will be correct regardless of the uniqueness of the generating function.

June 4, 2018 at 8:23 am

Devang Mehta“Here the simplest pattern that can explain the data is just that each number is the sum of the digits of the previous two numbers.”

This is not obvious to me and I suspect to most members of H. sapiens who think in a language that involves punctuation 🙂

June 4, 2018 at 8:47 am

Artem KaznatcheevI know that your post is tongue-in-cheek but I wish that your hypothesis could be true, and that success (slyly shifting from your use of “excellence”) in data science was determined by intrinsic properties like not being bad at statistics. But since data scientist is a “high status” job, and IQ tests often feel more like “socio-economic desirability” certificates than tests of intrinsic properties…

June 4, 2018 at 10:50 am

Julian JamisonWhy is “sum of the digits of the previous two numbers” simpler than “sum of the previous two numbers”? They seem pretty similar to me.

June 4, 2018 at 5:29 pm

Lior PachterAny 5 year old kid will attest that adding up 1+3+8 is simpler than 13+8. With numbers things only get more difficult (and therefore complicated), whereas with digit addition it’s always simpler with numbers <=9.

June 4, 2018 at 6:03 pm

A KHow is it simpler in this case when 1+3+8 will force you to do two-digit addition and 13+8 will also force you to do two-digit addition? I’m not sure what your metric of ‘simplicity’ is.

June 4, 2018 at 11:33 pm

Julian JamisonHere’s another example: 9999999 and 1, or 1111111111 and 1. In each case adding them as numbers is trivial (you don’t even have to count the 9’s or 1’s; you can just visually line up your answer), whereas adding the digits is noticeably harder. Remember that my claim (the two methods are not obviously comparable) is much weaker than yours (one method is clearly simpler than the other)… especially in light of this very blog post 🙂 But I’ll ask my son when he turns five in a few months, and then we’ll know.

June 4, 2018 at 11:39 am

PEYou are playing a little with the notion of simplicity. I can’t see how extracting digits before a sum is simpler than just the sum, but maybe tenured math professors code directly on Turing machines, what do I know? In python, my solution would be 1. def f(n): return n**2 2. def f(n) return 1 if n <= 1 else f(n-1) + f(n-2) (forget efficiency) One has 4 tokens and the other 18 if I am not wrong. Other measures of complexity, like number of nodes in a parse tree, may be better, open to other ideas or languages. But I struggle to see your solutions as simpler, probably due to my very average IQ (and in fact I work as a data scientist, and I am greatly concerned with overfitting).

June 4, 2018 at 2:44 pm

sveronaSpivak’s _Calculus_ presents a similar pattern-finding exercise as a “perspicacity test,” which neatly sidesteps the problem here.

June 4, 2018 at 7:54 pm

inverseBWT(u$__anhnhidatdtSuoKauArrhrg)Lior is way too smart to believe what he types in his post. In the case of this specific post, and I am like many of his posts, he *must* be running some kind of a social experiment to gather ‘statistics’ on the effect of persuasion in this befuddled post-truth world 😉 — is my explanation simple enough, Lior? 😉

More seriously, here’s my $0.02’s worth:

Regarding “what is a simple pattern?”, a concrete treatment to simplicity/complexity of any hypothesis (based on observed data) can be handled information theoretically, and be quantified (in bits or nits). Kolmogorov complexity handles this theoretically, but Wallace’s statistical and inductive inference by Minimum Message Length (MML) embodies this more practically by linking lossless compression with statistical inference.

If the likelihood ratio between two alternative hypotheses on observed data is 1 (as in the series stated in the post), the bias in the prior ratio would help us prefer one hypothesis over another.

In MML, this is tantamount to saying that, if the second parts of the two-part lossless encoding (explaining the some observed data using different hypotheses) are the same, then the difference in the first parts of the respective two-part explanation messages can be used to select the simplest hypothesis. Note, both first part and second part, i.e., I(H) and I(D|H), are measures of Shannon’s Information content, where I(.) = -log(Pr(.)).

Moreover, the post claims that for the second series: “the simplest pattern that can explain the data is just that each number is the sum of the digits”. Notice, this hypothesis ‘fits’ the observed data/series (with 0 error) only in base 10. If data undergoes any invertible transformation (say base_2 or even factoradic base) the hypothesis although simple, does not explain the data well.

On the other hand, the obvious explanation of ‘Fibonacci’ numbers related over second order recurrence — or should one say ‘Hemachandra’ numbers, subject to Stigler’s law of eponymy 🙂 — holds even over invertible transformations of the original series.

So, it isn’t true that the two hypotheses are equally good, let alone equally simple.

Indeed, “pluralitas non est ponenda sine necessitate”, if you don’t just believe Ockham but also Bayes and Shannon.

June 6, 2018 at 12:56 am

DanLior, very nice try! You wrote “I’ve forced you to overfit.” I would argue that your model (of how you generated the data) is unnecessarily complex!

Basically, you are using a Rube Goldberg machine here https://en.wikipedia.org/wiki/Rube_Goldberg_machine .

The model complexity and and fitting errors have been study extensively in information theory. See Kolmogorov complexity https://en.wikipedia.org/wiki/Kolmogorov_complexity ,

Akaike information criterion, Bayesian information criterion, Minimum description length criterion, Normalize maximum likelihood criterion, etc.

Regarding this sequence 1,1,1,1,1,1,1,1,1,1,1,.. continued for a total of 8424432925592889329288197322308900672459420460792432 times, and then followed by the number 893658223791571665995096225335894563579345325693555. One can compress/represent this sequence as 1,8424432925592889329288197322308900672459420460792432,893658223791571665995096225335894563579345325693555. Let’s call this model M1. The other competing model suggested by you is M2 that is “nth term of the sequence is the greatest common divisor of two polynomials: n17+9 and (n+1)17+9”. One puts M1 in file called lp1.txt (which has 107 bytes) and M2 in file called lp2.txt (96 bytes). After compressing them using gzip one has lp1.txt.gz has 94 bytes and lp2.txt.gz has 108 bytes. Actually compressing lp2.txt increaseshe size of the file so it is better not to compress it. According to this simplest model which explain your sequence is actually 1,8424432925592889329288197322308900672459420460792432,893658223791571665995096225335894563579345325693555 (it requires 94 bytes) and not “nth term of the sequence is the greatest common divisor of two polynomials: n17+9 and (n+1)17+9” (it requires 96 bytes). So your model is unnecessarily complex.

June 6, 2018 at 5:13 am

Dan> But what is a “simple” pattern?

One way is to compare two patterns or models is to look to their codelengths (that is the number of bits or bytes needed to describe the model) according to MDL criterion. For example f(n)=n^2, that is model M1, takes 8 bytes (because it has 8 characters) whilst f(n)=(m+n,n)modn, that is model M2, takes 16 bytes (because it has 16 characters). It takes twice more bytes to describe the model M2!!! Therefore model M2 is unnecessarily complex according to MDL criterion. Both models M2 and M1 fit equally well the data.

> Another way to say this is that IQ tests are implicitly asking test takers to accept null hypotheses instead of asking, at most, for a rejection.

One assumption is that IQ tests are made for testing humans. They are not made for computers. Therefore the IQ tests aim for the common sense in humans which for many people would mean trying to optimize the trade off between goodness of the fit and the complexity of the model.

June 6, 2018 at 5:24 am

DanLet’s look to 1, 1, 2, 3, 5, 8, 13, ..

You wrote “Here the simplest pattern that can explain the data is just that each number is the sum of the digits of the previous two numbers. ”

Model M1: x_n=x_n-1+x_n-2 (this takes 15 bytes because it has 15 characters) [This I could even make it shorter the codelength by just writing Fibonacci]

Model M2 (using digits): “sum of the digits of the previous two numbers” (this takes 45 bytes because it has 45 characters). [Here I use human description because the R/Python code would be much longer than 45 characters (see: https://stackoverflow.com/questions/18675285/digit-sum-function-in-r ). If one is able to come with a R/python/… code that does this in less than 45 characters let me know!]

Therefore model M1 again wins hands down (because it is simpler and it describes that data as well as M2)!