Fighting bad typography research

The pre-match hype

Fans of good typography are like any other – they love a good fight and take sides easily, and sometimes the comment thread under my review of serif/sans-serif legibility heats up as if they’re arguing about religion, politics or even climate change.

Now, some commenters on the thread have claimed they’ve found research showing undeniable evidence of a massive difference in legibility between serif and sans-serif fonts, despite the overwhelming body of evidence showing that there is either no difference, or if there is a difference it is too small to worry about.

The research they’re talking about is from Colin Wheildon’s 1984 report – “Communicating or just making pretty shapes” (here reprinted in 1990). The report later formed a part of his 1995 book “Type and layout: How typography and design can get your message across–or get in your way“.

When you hear claims which are radically different from the established body of research, you should rightly be sceptical, especially when they haven’t been published in a peer-reviewed scholarly journal. Nevertheless, being sceptical means examining the merits of any research even if it goes against the consensus view…

Round 1: Down but not out

A few years after Wheildon’s book came out, it was savaged by researcher Ole Lund in a book review and PhD thesis, including what looked like personal attacks. But he failed to mention the basic problem with the research study – that it is very badly designed and the conclusions drawn from it are not credible.

Round 2: It’s a knockout?

Let’s take a closer look at the study as it is described in the copy of the report I have…

The set-up

In Wheildon’s experiment people were shown a newspaper article set in a sans-serif font, asked about their comprehension of the article and any other comments about difficulty in reading, then they were shown an article set in a serif font and were then asked the same questions.

Immediately we can see a problem –  The purpose of the experiment is revealed after one test condition but before the other, so biasing the second condition. People taking part in research studies are notoriously open to bias and leading questions, so the volunteers may have simply been saying what they thought the experimenters wanted to hear – this alone has the potential to invalidate the test (see the Hawthorne effect).

It looks like there may have only been one or two rounds of testing, which isn’t enough to produce a valid result either.  If it was the same article content shown each time, then obviously it will be easier to read 2nd time round. If you solve it by randomising the article order you’re going to need a lot more rounds of testing. If you solve it by using different article content how do you make sure they are equal in terms of reading difficulty to isolate the effects of different fonts?

Measuring the results

There is a table listing comprehension levels but we don’t get to see the questions. Why is comprehension the only measure used?

What about these other recognised measures of legibility or readability?:

  • speed of reading
  • speed of perception
  • fatigue in reading
  • backtracking and other eye movements
  • perceptibility at a distance
  • perceptibility in peripheral vision

A lot of space is given for comments from people about how they felt, the actions they performed or what they thought they understood when reading, but these are only anecdotal claims – there were no objective observations made using stopwatches or eye trackers to see what really happened.

The final bell

The research is poorly designed, wide-open to unintentional (and perhaps intentional) bias and doesn’t provide credible objective data.

However, perhaps I’ve got it wrong – the write-up I’m working from is very poorly described – If anyone can send me what they consider to be the definitive version of the report, I’d be glad to take another look.

But for now, I hope these 28 year old rogue claims about serifs are finally out for the count.

27 responses

  1. I’ve read and considered Wheildon’s work in the broader context of legibility and readability studies since I first came across it about 1990. I have worked in publishing and printing since the early 1970s, being just old enough to have direct experience of letterpress [trained as a compositor, ho, ho , ho] and then to see the complete transformation of my craft over the next twenty years or so. I advise large numbers of thesis writers about the use of production technologies, and help them actually design their theses to communicate their findings.
    I disagree with those who say that Wheildon’s study was poorly designed. He used over 200 subjects to test five main hypotheses, sufficient to demonstrate the necessary content and construct validity. The material he was using [text and pages from a popular automobile club magazine] was at an appropriate level for the audience. He only measured comprehension, no other fashionable concepts, which is what makes it such a powerful study. When I advise thesis students, who are writing material far above that used by Wheildon but for an expert audience, I treat his work as a classic empirical study. Very few other studies are as clearly defined, and very few have findings as clear-cut. Too many are compromised at the outset by theory. The purpose of typography, invisible as it should be, is to let the meaning of the words flow effortlessly to the reader.
    Objections based on literary theory founder, in my view, on Wheildon’s empirical rocks. I have lost count of, and lost patience with, people who says things like ‘I hate Times New Roman’. TNR was designed to be legible from 4 points to 196 points; it is the quintessential ‘legible’ font. When students give me a page of Arial [or Garamond], I convert it to Times and print it for comparison, and say ‘what do you think now that you see it on the page?’: grey versus black, a function of ascenders, descenders and counters. To paraphrase Wheildon: black is the best colour, the blacker the better.
    Michael McBain
    University of Melbourne, Australia

    • Hi Michael,

      Thanks for taking the time to comment.

      It’s good that he used so many test subjects (224 to be exact), but if he presented the test conditions as described in the report – ie probably not randomised and maybe with leading questions, then that number is irrelevant – it simply isn’t a fair test and the results cannot be taken seriously.

      But as I said, the copy of the report I have perhaps isn’t the most complete one – but it sounds like you have access to the full write-up. If you post a scan or link here or send it to me by email, I’d be glad to update my post.

  2. And where is your research – other than a series of generalisations – that indicates Wheildon is wrong?

    • Hi there, The criticisms are very specific about the methodology of the research. Do you have a copy of the original research so I can check these points definitively?

  3. excellent post – i am a typographic nut (yes, early bauhaus lack of capitalisation here).

    what are your thoughts of the interesting, albeit self-admitted lack of methodology, work just done by Errol Morris?

    http://opinionator.blogs.nytimes.com/2012/08/08/hear-all-ye-people-hearken-o-earth/

    i think it was very “fun” of him to do his study in such a non-academic matter and that his results to speak to something – although i am not certain what and thus the question to you.

    thanks! =)

  4. From my memory of reading Wheildon’s full study, the test was objective as to comprehension. Would subjects mask their comprehension of a san-serif typeface? Maybe, but it seems unlikely. Still, it would be interesting to re-run the study. Michael, would you have the resources to do it? It could be done to without telling students it was to test type faces, which would be simple.

  5. Typography and design are battlefields strewn with the wreckage left behind by the life-and-death struggles of many theorists. Wheildon’s study does seem to get the hackles up for a lot of people. I like Karen Shriver’s book for its well-referenced studies of meaning and rhetorical purpose, and even though it is now getting a little bit old, I still draw on it on the three or four occasions a year when I speak about the value of design. Shriver doesn’t like Wheildon, either.
    Partly because of the enmity Wheildon invokes, I’ve read most of what has been written about design, rhetorical purpose and measures of communicative value. As I said in my April comment, much of the wheildonschmerz is based on an incorrect pre-judgement of what such studies should be discovering; they make quite large assumptions about the design and purpose of Wheildon’s study. It makes me wonder if they ever had access to the original study, which admittedly was published privately by a New South Wales industry organisation–it doesn’t get much more obscure than that.

    What makes Wheildon’s study so seminal for me is that he eschewed theory. He took his 224 volunteers, and mixed up the fonts, the leading, the justification, the capitalisation and the measures, and then tested comprehension–that’s all. “Read this, and then answer these questions about the thing you’ve just read”. It’s a classic independent/dependent variable study, and virtually unencumbered by theory. The number of variables being tweaked, and the number of participants tested, is well within reasonable limits; it’s a sound and robust experimental design.

    • Hello again Michael,
      Do you have a copy of the original study, or a version of it that describes the study design in more detail than the version I have access to?

  6. Has anyone presented a link to the actual study yet? I am curious to read it as verbatim as possible. I have seen many other studies that were inconclusive in the other aspects of readability between fonts with and without serifs.

    • No, every time I ask for a copy, these ardent supporters suddenly shy away from sharing the actual study…I wonder why?

  7. Based on the comments i’ve seen here, the reason they aren’t sending you a copy is that they don’t have one either, hence their inability to respond to your comments about experimental design.

  8. I believe you have misread the research description in the 1995 book.

    Page 187 indicates that two articles were prepared. Each article was prepared in two formats. The research subjects were divided into two groups. Both groups received the same articles, but in different formats. They were allowed to read the article for a fixed time, then asked 10 questions about the content. They were NOT told about the objective for the questions.

    The groups were then reversed with the second article.

    The order of asking questions was randomized.

    This process was repeated multiple times over a period of several years (so there were multiple articles).

    The study author asked all the questions, so the study is not double-blind (which in my mind is the single greatest weakness of the study, and neither you nor Lund have addressed this weakness).

    There are no variance measurements reported in the study, so we cannot estimate any statistical reliability for the conclusions. That’s another major weakness. But it’s hard to imagine that the variance is large enough to account for the differences observed.

    You fault the study for focusing on comprehension. But in my mind, that’s the most important feature of body text. It doesn’t matter how much I like reading it, or how pretty it is, or how much my eyes want to dwell on it. It matters if I understand it. At least for the writing I care most about.

    I wish that I had access to the raw data and to more details about the experiments. I wish I had the articles and the questions. I don’t. It’s possible that Wheildon rigged the experiments to get the results he wanted.

    It seems like one who wants to debunk Wheildon’s work ought to just try to falsify it by repeating the experiment and achieving different results. That’s the way science is usually done. And there is no evidence on this blog or in Lund’s dissertation of actually repeating the test that is believed to be flawed. That is disappointing to me.

    I’m thinking of trying to duplicate Wheildon’s work with a small sample of writing of my own. Not long term, not replicated, but with randomized group assignments and a double-blind experimental format. It seems to me that repeating his work is the best way to confirm or refute it, not arguing about the potential problems with his methodology (since we don’t really have a peer-review-acceptable description of his methodology).

    The details of the number of

    • This comment has been truncated somehow. I sent the author an email a while back to resend the whole message, am waiting for a reply.

  9. I’ve only just found this discussion.

    I have Wheildon’s original paper. I obviously can’t upload it for copyright reasons, but I endorse Michael McBain’s comments. That said, I do consider the study flawed insofar as it doesn’t cater for a number of other variables, notably the extent of the text (only about a page in this case), the balance between leading and measure, lighting, the specific fonts, and the effects of paper stock on print clarity (ink spread especially). (I always taught my students that a design doesn’t succeed or fail until the ink hits the paper.) In contrast, Karen Shriver’s study (as reported in Dynamics in Document Design) only reflects participants’ subjective preference – no test of comprehension. Let’s further remember that TNR was designed for narrow measures on newsprint, not for quality printing, and it provides a particularly poor leading/measure balance in most applications.

    • Where was the paper published? Several commenters have claimed to have access to it–yourself included–without providing any detail other than to claim that Alex Poole has misunderstood it. Even if it were published by an obscure press in a distant and beautiful land, giving those of us who are interested in this topic the opportunity to track it down would be helpful.

    • I, too, would be delighted to see a properly-controlled repeat of Wheildon’s study, not least because the results were so startling. The difference in comprehension was 80% vs 30% for the serif/sans serif split, and similar for justified/ragged-right. Those are big differences, but in hindsight it would have been great to see some more detail of the design and associated statistics. Colin was the editor of the NRMA members’ magazine [NRMA is like the RAC and AAA], not an academic researcher. He was frustrated by the lack of rigour of contemporary studies of font and design effects, and he wanted hard data on which to base his own font and design decisions.

      I also disagree with Michael Lewis in a professional sense about Times New Roman. It is not a ‘beautiful’ font that a modern book designer would reach for, but it is a wonderfully legible ‘working’ font with a healthy x-height which lends itself to tight or loose leading but above all is very ‘black’ on the page. For academic theses, now the only area where I exercise my design and typographic skills, it is the base typeface against which others have to compete. “Why this font rather than Times?” And my measure of comparison is: comprehension.

  10. Not that it’s in any way direct proof but I do know Wheildon’s work was done under the guidance of the late Henry Mayer, former Professor of Government at Sydney University, and in his retirement, editor of a magazine specialising in media. As a former student, I can assure you Henry was notorious for his insistence on empirical research properly designed and structured. I think some comments above already indicate that that was the case in this instance, contradicting the suggestion of crude bias in design.

    I’m interested in the hostility aroused by his research which seem to arise from the “revolutionary” idea that he might for comprehension instead of the somewhat peripheral characteristics that the “readability” proponents advocate, such as the recognition of individual letters of the alphabet. It’s as though the effectiveness of a diet can only be studied by testing whether the food was considered tasty rather than whether it provides effective nourishment.

    Complaints about controls for light levels and the quality of the paper documents were printed on strike me as minor irrelevancies, given that very few people have even bothered to try repeating these tests for comprehension in the 3 decades since the original research. One might have expected that that would be more relevant than fiddling with light levels.
    Are we now expected to dismiss all paper-based university examination results for example, on the grounds that light levels may have varied in different examination rooms (as I recall, they did in my case) ? Give me strength!
    If opponents of the Wheildon hypothesis want to challenge it, why have they not attempted to replicate the tests of comprehension (first on paper, then on screen)?
    It may well be that the critics are correct. But they impress me not one whit if they continue to insist on testing peripheral components of the comprehension environment which they claim are important, while failing to show their connection to the ultimate result:
    Does the reader understand the text in a reasonably normal reading environment ?

Comments are closed.