Postings on science, wine, and the mind, among other things.

Introducing MatlabTFCE

A new package for fMRI multiple comparison correction

Functional magnetic resonance imaging (fMRI) is one of the most powerful tools currently available for understanding living human brains. However, the method raises a number of statistical difficulties which researchers have struggled to address, such as temporal autocorrelation, complex spatial dependency, and aligning brains of diverse sizes and shapes. Perhaps the most troublesome statistical issue that fMRI presents is the multiple comparison problem. The traditional null hypothesis significance testing framework is vulnerable to testing multiple hypotheses at the same time. Under ordinary circumstances, setting one's threshold to the customary α = .05 (meaning that p-values must be less than .05 to be "statistically significant") should fix the chance of a type I error (false rejection of null hypothesis) to 5%, assuming the assumptions of the test in question are met. In other words, the probability of observing data as (or more) extreme than what you actually saw, given the null hypothesis, should be less than .05.

However, in the frequentist framework this error risk refers to a (usually imaginary) series of tests of a single hypothesis. If one simultaneously tests multiple hypotheses at once, one's risk of incorrectly rejecting at least one null hypothesis increases accordingly. The risk of falsely rejecting at least one hypothesis in a set is known as the family-wise error rate (FWER). In the case of wholebrain fMRI analyses, where 100,000s are voxels may be tested at once, one is virtually guaranteed to falsely reject the null hypothesis in some voxels if one sticks to an uncorrected α = .05 (because FWER = 1-(.95)^100000 ≈ 1). Using a priori regions of interest (ROIs) can reduce the number of tests, but what does one do when a wholebrain analysis is truly called for?

There are many ways to deal with the problem of multiple comparisons. The most obvious approach is known as the Bonferroni correction in which one simply divides one α by the number of tests conducted. This can work passably when only a handful of comparisons are considered, but is disastrously conservative in the context of fMRI. Instead, popular correction techniques tend to capitalize on a particular feature of fMRI data: spatial dependency. Adjacent voxels in the brain are not independent of one another, meaning that assuming "# of test == # of voxels" is too pessimistic. The main modern approaches to fMRI multiple comparison correction fall under three broad headings, listed below, but all take advantage of this spatial dependency.

  • Gaussian random field theory: GFT was one of the first approaches devised, and as a result became the default on many statistical packages such as SPM and FSL. It relies on Gaussian random field theory, a parametric model of spatial dependency. Like any parametric test, from t-tests on up, GFT corrections provide accurate results only under certain assumptions. Unfortunately, the assumptions of GFT are frequently violated by real fMRI data, which is usually both insufficiently smooth overall, and also follows a non-Gaussian smoothing curve with a long tail (but cf. here). When it works, however, GFT provides p-values associated with particular clusters of voxels, sacrificing spatial specificity within the cluster (relative to Bonferroni) for greater power.
  • Monte Carlo simulation: simulating fMRI data, ideally with parameters estimated from one's actual data, is a slightly less assumption-dependent approach to multiple comparison correction. In this approach, one simulates the results of many (1000s) of experiments like the one actually conducted, and observe how large a cluster one needs (paired with some voxelwise threshold) to limit the FWER to .05. The simulation-based approach does not rely on the assumption that one's statistical maps approximate a GFT, but it still assumes that the simulated data sufficiently resemble one's real data.
  • Maximal statistic permutation: permutation testing relies on randomization of one's actual data. As a result, it has the fewest assumptions of any approach (though it still has some, such as the exchangeability of observations). Using permutation testing to correct multiple comparisons relies on aggregating the "maximal statistic" across the brain (see here for the classic primer). That statistic can be the maximal t- or f-statistical of individual voxels, the voxel extent of clusters, to something else (that something else being the topic of this post).

For completeness I'll also mention FDR - false discovery rate. FDR is an alternative to FWER rather than to the methods used to achieve FWER listed above. Indeed, FDR can be combined with these methods as an alternative to controlling FWER. Rather than limiting the risk of any false positive to .05, FDR limits the proportion of positive results that are false positives to .05 (i.e. it aims to produce a statistical map in which at most 5% of significant results are type I errors). As you might expect, FDR is never more strict than FWER. How liberal FDR is in comparison to FWER depends on how many significant voxels or clusters there are in the brain: when there are very few FDR approaches FWER, but when there are many significant voxels FDR is considerably more liberal than FWER.

In my experience, cluster corrections (whether GFT, simulation, or permutation based) are vastly more popular than voxelwise corrections in the literature. This isn't terribly surprisingly given that cluster corrections are almost always much more liberal than voxelwise approaches, and it is rarely valuable to make claims about specific voxels vs clusters. However, cluster corrections do possess an undesirable feature: instability. Depending on how one sets one's voxelwise threshold, the consequent cluster-extent threshold can change a great deal. Indeed, if you walk through a number of different voxelwise thresholds, cluster can *pop* in and out of significance. This is clearly not ideal behavior. Among other things, it creates an unnecessary temptation for researchers to "try" many different thresholds and report only the "best" one. This is a form of p-hacking, and thus quite inappropriate. However, you can probably understand the vexation of a researcher seeing a cluster with a huge voxel extent just under the a priori voxelwise threshold they set.

A solution to this problem came in 2009 with Smith & Nichols' invention of threshold free cluster enhancement (TFCE). TFCE is an image transform that accentuates the "clusteriness" in data. One feeds in a statistical map (e.g. t- for F-statistics) and TFCE re-weights every point based on those around it . The technically-minded amongst you can check out the function for this re-weighting in the linked paper, but for everyone else here's an intuitive description: imagine the statistical map as mountainous terrain with peaks and valleys depending on the value of the statistic at each voxel. TFCE visits each point on the landscape and replaces the raw altitude with a weighted average of the mass of the mountain under that point. By doing this, TFCE in some sense equalizes broad but low "hills" with tall but narrow "mountains." When combined with maximal statistic permutation testing (where the maximal statistics are now TFCE-transformed) this can produce a "smooth" (i.e. continuous) map of corrected p-values that take into account the spatial dependency in the data. One can threshold this map at .05 to achieve control over FWER without the instability of cluster correction.

TFCE has gotten a bad reputation in certain quarters, primarily for being "too strict." Indeed, I've heard it referred to facetiously as "Total Fucking Cluster Elimination." This reputation is undeserved. It stems in part from the association of TFCE with permutation testing. This isn't a necessary feature - in theory TFCE could be combined with Monte Carlo simulations, too - but in practice the two are almost invariably paired. So are permutation approaches unnecessarily strict? They do tend to produce more conservative results than GFT and Monte Carlo approaches, but one should ask whether this is because permutations are too conservative or the other approaches are to liberal. Increasing evidence seems to suggest the latter, but of course we're strongly motivated not to believe it, because if we adopted the stricter practice we would be less likely to get publishable results. There are principled arguments in favor of less conservative approaches, but they should be adopted knowingly and with a full understanding of the properties of the corrections employed, rather than by implicit consensus to use approaches that feel right based on published results.

Another reason that some think that maximal statistic permutation tests are harmful is actually justified, under some circumstances. The problem is that for some test statistics, permutation of a large effect can produce an empirical null distribution with a large variance. This isn't a problem for individual permutation tests, but it becomes problematic when one aggregates across many such tests for the purposes of multiple comparisons. If one has a large effect in one's data (in magnitude or spatial extent) this can effectively "suppress" smaller effects which would otherwise have been statistically significant. Fortunately, there's an easy solution to this problem: using standardized statistics. Although conducting a permutation correction directly on mean differences leads to suppression, applying the same correction to t-tests (standardized mean differences) obviates the issue.

A related issue is the question of sequential or stepdown testing procedures. If you're familiar with the Bonferroni correction, you may also have heard of its cousin, the Holm-Bonferroni method. Holm-Bonferroni is slightly more complicated conceptually, but in fact it has no additional assumptions beyond those of the Bonferroni procedure, and is uniformly more powerful than it. Thus Holm-Bonferroni should always be preferred to Bonferroni in cases where the latter is applicable. Holm-Bonferroni gains its additional power through a step-down procedure. It starts by comparing the smallest p-value (out of the family of tests under consideration) to the full Bonferroni corrected α. However, if this test turns out significant, the next test is slightly less strict: the 2nd smallest p-value is compared to a corrected α = αraw/(# tests - 1). The process proceeds from there, comparing each p-value to a less and less strict α until all tests have been corrected, or (more typically) until it hits a test which does not pass threshold. In the latter case, that test and all subsequent tests (i.e. those with larger p-values) are non-significant. This process might seem suspicious to the wary, but it is provably correct (see link above).

Interestingly, the same step-down logic described above can be carried over to maximal statistic permutation test corrections. Indeed, this was pointed out by Andrew Holmes 20 years ago back when imaging was still in its infancy. At the time, step-down procedures weren't computationally viable. They are now, but unfortunately it turns out they have pretty minimal advantages in the fMRI context. Even when using data specifically simulated to benefit from such step-down, I've found that the method conveys virtually no advantage over unadulterated maximal statistic permutation testing with TFCE.

Based on what we know now, maximal statistic permutation testing with TFCE seems like the best approach to correct wholebrain analyses for multiple comparisons. It does embody certain assumptions about the spatial distribution of brain activity, but no more so than other methods, and otherwise it is stable and nearly assumption-free. Perhaps one day we'll all be running wholebrain analyses using hierarchical Bayesian spatial time-series mixture models (I'm not holding my breath) but until then I doubt we'll see something much better than TFCE.

To my knowledge the first (and until recently, only) implementation of maximal statistic permutation testing with TFCE was FSL's randomise. An SPM-compatible Matlab implementation of maximal statistic permutation testing has long been available in the form of SnPM, but this package allowed only voxelwise and clusterwise corrections, not TFCE. The recently implemented CoSMoMVPA package now provides TFCE support in Matlab, though it's integrated with the rest of the package, and so may require some cajoling if the correction is all you want. Anders Eklund's awesome BROCCOLI package has a Matlab wrapper for ordinary use, but the base code is programmed in OpenCL (it's super fast because it relies on GPUs) which renders it accessible to considerably fewer researchers. There may be other Matlab options out there, but if so I haven't been able to locate them.

To help fill this space, I've put together a small package I'm calling MatlabTCFE. I wrote this package to provide a very simple way to conduct maximal statistic permutation testing with TFCE for the most common random effects analyses (t-tests, correlations, and ANOVAs) in Matlab and with no dependencies. The package is pure Matlab, works directly with (NIfTI) images (.img/.hdr or .nii format), and has no external dependencies (so no SPM/FSL/AFNI/BrainVoyager/etc. required). It's more limited than randomise - in that it does not provide an option for custom design matrices - and does not provide surface-based TFCE like CoSMoMVPA, but nonetheless should suffice for a high percentage of real world use cases. However, if speed is what you're looking for - and permutation tests do take a while - then BROCCOLI is undeniably a better choice.

The first proper release of MatlabTFCE is available on my github page (linked above). It provides a (very minimal) GUI for those who like such things, but all functionality can be accessed from the command line through the matlab_tfce.m function (the documentation for which describes how to request the various implemented tests). I've validated the TFCE transformation itself, as well as the results of the one-sample/paired t-tests, directly against FSL. There's also an included demo which demonstrates that all tests achieve proper FWER control under trivial conditions (i.e. using simulated data that's not realistic to the fMRI context). I've tested the package on Matlab versions back to 2011b, and a variety of systems (Windows 7/10, Linux cluster). However, there's a limit to how much testing one person can do, so I very much welcome additional validation/testing, bug reports, and pull requests! Any suggestions for improving the (memory and speed) performance of the ANOVAs would be particularly appreciated. This is really my first attempt to produce software for general consumption (outside of web applications) so apologies in advance for any remaining issues that may crop up or not-so-best practices I inadvertently followed.