20140219

My interaction is not significant, but the simple slopes are...

Often times, people are confused by a statistical interaction pattern they find - or rather: by the lack of it. Let's say you have two potentially explaining variables X1 and X2 and a dependent variable Y. You estimate a regression model with X1, X2 and their product X1*X2 as predictors and Y as the criterion. You find that the p value associated with the interaction term is far from the 'holy hurdle' of p = .05: not significant by a long shot. So the decision is clear: I cannot claim that an interaction of the size found is far enough from zero to assertively claim an interaction effect (at least not with a type I error of .05 and a reasonable type II error = 1 - β). So, it is more reasonable to consider X1 and X2 having individual, additive effects on Y - in other words: X1 and X2 only have true main effects on Y.
Let's have an example (The R code for this example is available at https://raw.githubusercontent.com/johannjacoby/interaction_and_slopes/master/interaction%20&%20slopes.R.

#get example data with predictors X1 and X2 and a dependent variable Y ds <- as.data.frame( read.table( "https://raw.github.com/johannjacoby/interaction_and_slopes/master/no_interaction_differing_slopes.dat", header=T, sep="\t")) #center both predictors ds$centered.X1 <- scale(ds$X1, center=T, scale=F) ds$centered.X2 <- scale(ds$X2, center=T, scale=F) #shift X2 to obtain the simple slope of X1 @ 1SD below the mean of X2 ds$centered.X2.lo <- ds$centered.X2 + sd(ds$X2) #shift X2 to obtain the simple slope of X1 @ 1SD above the mean of X2 ds$centered.X2.hi <- ds$centered.X2 - sd(ds$X2) #### yes, it is correct, to get ds$centered.X2.hi you have to subtract 1 SD, #### and in order to get ds$centered.X2.lo you have to add 1 SD. # now estimate the basic regression model to see whether X1 and X2 interact: model0 <- lm(Y~centered.X1*centered.X2, ds) # and the two regression models in order to obtain the simple slopes of X1 at X2=1SD below the mean and at X2=1SD above the mean: model.lo <- lm(Y~centered.X1*centered.X2.lo, ds) model.hi <- lm(Y~centered.X1*centered.X2.hi, ds) #show the models estimates summary(model0); summary(model.lo); summary(model.hi) # print the interaction and the simple slopes of X1 @ X2=-1SD and X2=+1SD results0 <- summary(model0)[[4]] results.lo <- summary(model.lo)[[4]] results.hi <- summary(model.hi)[[4]] cat("\n", "Interaction X1 * X2: b=",results0[4],", t=",results0[12],", p=", sprintf("%5.4f",results0[16]),ifelse(results0[16] < .05," *",""),"\n", "Slope of X1 @ X2 = -1SD: b=",results.lo[2],", t=",results.lo[10],", p=", sprintf("%5.4f",results.lo[14]),ifelse(results.lo[14] < .05," *",""),"\n", "Slope of X1 @ X2 = +1SD: b=",results.hi[2],", t=",results.hi[10],", p=", sprintf("%5.4f",results.hi[14]),ifelse(results.hi[14] < .05," *",""),"\n", "absolute diff(p) = |",results.hi[14]," - ",results.lo[14],"| = ", abs(results.hi[14] - results.lo[14]),"\n", "diff(b) = ",results.hi[2]," - ",results.lo[2]," = ",results.hi[2] - results.lo[2], "\n", sep="")

These are the results:

Interaction X1 * X2: b=-0.5400003, t=-1.062382, p=0.2921 Slope of X1 @ X2 = -1SD: b=1.33277, t=2.525593, p=0.0140 * Slope of X1 @ X2 = +1SD: b=0.3723435, t=0.5394312, p=0.5915 absolute diff(p) = |0.5914612 - 0.0140371| = 0.5774241 diff(b) = 0.3723435 - 1.33277 = -0.9604268

Clearly, the results indicate that the interaction term is not significant and small, so that it could easily be explained by random fluctuations (around a true zero "interaction" in the results). This essentially means: There is not interaction of X1 and X2, the slopes of one of the predictors at different values of the respective other do not systematically differ as a function of this other predictor. But Slope of X1 @ X2 = +1SD (from now on: "high slope") is not significant with a p value = .5915 and Slope of X1 @ X2 = -1SD (from now on: "low slope") is significant with p = .0140! So it appears that the low slope and the high slope are different after all - the former is not significant, the latter is significant. So we might be tempted to ignore the non-significant interaction and simply claim that we found differential effects of X1 on Y, conditional on the value of X2. But I argue that the implicit reasoning behind this is fundamentally flawed and will elaborate on this argument.

Notice how above, we chose +1SD and -1SD as the conditional values of X2 at which we test the slope of X1. But these values of X2 are rather arbitrary. We could just as well consider the slope of X1 @ X2 = mean of X2 (from now on: "mean slope") vs. the low slope. We have the latter from above:

Slope of X1 @ X2 = -1SD: b=1.33277, t=2.525593, p=0.0140 *

and we obtain the mean slope from the original basic regression model that we used to test the interaction in the first place:

# simple slope of X1 @ X2 = mean cat("Slope of X1 @ X2 = mean of X2: b=",results0[2],", t=",results0[10],", p=",sprintf("%5.4f",results0[14]), ifelse(results0[14] < .05," *",""),"\n")

yielding the result:

Slope of X1 @ X2 = mean of X2: b= 0.8525569 , t= 2.048904 , p= 0.0446 *

So, comparing the mean slope vs. the low slope, we see that they are both significant. So, according to the mere comparison of significance decisions, they are "the same" - the simple slopes do not differ. One might argue that one of the p values is smaller than the other, but p values and their differences are a very poor indicator of an actual difference as they are not a linear function of the actual effect size. The difference between p = .30 and p = .40 is not the same as the difference between p = .11 and p = .01, so gauging whether two effects differ by eyeballing two p values and making a guess as to whether they actually are "very different", "not so much, but still different", or "not really different at all" will not cut it as a reproducible and transparent decision rule for the scientific test of a theory. In addition, p values are by definition highly dependent on sample size. So a difference of the same magnitude between two slopes (i.e., a difference in bs) might look like a huge difference with N = 300 (because the p values are far apart), but with N = 120, the difference in p values might not look so big anymore. This is of course true for any p value from a statistical test, but the problem is exacerbated if you look at differences between effects.

So, if we compare the low and high slopes by a rather haphazard guess about the difference between p values (absolute diff(p) = .577 in the case of comparing the low and high slopes), we get the result "the slopes differ"; but if we compare the mean slope and the high slope, we get the result "they don't differ, they are both significant". One might argue that this is unfair, because the difference in X2 between the mean and the high slopes is much smaller than the difference in X2 between the low and the high slopes. That is unquestionably true, but who is to say which difference in X2 is the "right one"? Also, we could chose another set of values of X2 on which we condition the slope of X1 and obtain a similar result that the slopes are not really different (according to simple p value comparison). We can estimate the simple slopes at any values of X2 we wish to, so we could pick X2 = -2SD of X2 and X2 = mean of X2. These two conditional values are exactly as far apart as + and - 1SD above (i.e., 2SDs), so the comparison of these two simple slopes is just as "fair" toward a difference between the slopes as that of the high and low slopes above:

# simple slopes of X1 @ X2 = -2 SD and @ X2 = mean of X2 ds$centered.X2.minus2SD <- ds$centered.X2 + 2 *sd(ds$X2) model.lo.other <- lm(Y~centered.X1*centered.X2.minus2SD, ds) summary(model.lo.other) results.lo.other <- summary(model.lo.other)[[4]]; results.hi.other = summary(model.hi.other)[[4]] cat( "Slope of X1 @ X2 = -2SD: b=",results.lo.other[2],", t=",results.lo.other[10],", p=", sprintf("%5.4f",results.lo.other[14]),ifelse(results.lo.other[14] < .05," *",""),"\n", "Slope of X1 @ X2 = mean of X2: b=",results0[2],", t=",results0[10],", p=", sprintf("%5.4f",results0[14]),ifelse(results0[14] < .05," *",""),"\n", "abs.diff(p) = |",results.hi.other[14]," - ",results.lo.other[14],"| = ",abs(results.hi.other[14] - results.lo.other[14]), "\n", "diff(b) = ",results.hi.other[2]," - ",results.lo.other[2]," = ",results.hi.other[2] - results.lo.other[2], "\n", sep="")

The slope of X1 @ X2 = -2SD (i.e., the "-2 slope") and the slope of X1 @ X2 = mean of X2 (i.e., the "mean slope") are both significant:

Slope of X1 @ X2 = -2SD: b=1.812984, t=2.036622, p=0.0458 * Slope of X1 @ X2 = mean of X2: b=0.8525569, t=2.048904, p=0.0446 * abs.diff(p) = |0.04457648 - 0.04582954| = 0.001253059 diff(b) = 0.8525569 - 1.812984 = -0.9604268

And their p values only differ by a miniscule .0013. That surely is not a difference in simple slopes that should be taken serious.

Thus, even if we compare two simple slopes that are as far apart in X2 as the low and high slopes, now we have to come to the conclusion (if we use the crude p value comparison) that the slopes are not different, and both not significant. The comparison between the p values again is not a big help: this time the difference between p values of the slopes is diff(p) = .0013, while in the comparison between low and high slopes it was .577. How is one to determine whether these differences in p are the same or different? That's right, it will become a mess to continue with comparing p values by approximate visual inspection and making transparent scientific decisions based on these p value decisions.

Things are, however, different if we look at the difference in b, diff(b) in the two cases. In the comparison of the low and high slopes above, we obtained a difference of diff(b) = -0.96 and in the comparison -2 slope vs. mean slope we obtained the exact same difference: -0.96. It appears more sensible to use the difference in b as a representation of the difference in slopes - at least the difference in slopes thusly indicated is the same when the difference in the conditional values of X2 is the same. But how can we test if this difference in b is substantial, or allows us - based on a clear and transparent criterion - to say: "the two slopes at two different values of the moderator X2 are different"? After all we have no p value for the difference in b! Or do we? Yes, we do! Well, not a p value for the comparison between the low and high slopes or that comparison between the -2 and mean slopes, but we have one for any pair of simple slopes that are 1 unit of X2 apart: The p value for the interaction term in the very first basic regression analysis. The interaction term tests whether the difference of two simple slopes of X1 on Y that are estimated or calculated at two different conditional values of X2 that are exactly one unit of X2 apart is significant. So this does not compare the low and high simple slopes, but since the low and high simple slopes are exactly 2SD of X2 apart and SD is a stable multiple of the unit of X2, the test is good enough. Remember that after all the interaction term is linear - this means that the increase in the simple slope of X1 as X2 becomes larger (or the decrease as X2 becomes smaller, or even - if the interaction is negative: the increase in the simple slope of X1 as X2 becomes smaller and the decrease in the simple slope of X1 as X2 becomes larger) is the same no matter which two conditional values of X2 one might chose, as long as they are equally far apart, i.e., 1 unit.

In sum: the comparison of two p values in order to assess whether simple slopes are different is not to be recommended as it gives different comparison values for slopes of the same distance in X2 and p values are not very easy to compare. In contrast, the p value of the original interaction allows to make a decision for or against the hypothesis of "differing slopes" that will be the same for any pair of conditional X2 values that are the same distance apart. This p value of the interaction is a focused and concise basis for such a decision and does not require an additional decision about whether two p values are close, far apart or basically the same. The interaction term and its statistical properties is what the comparison of simple slopes should be based on, not the isolated slopes or pairs of them. If the interaction term is not significant - by the logic of significance testing - the slopes are not different. If the interaction term is significant, then any pair of slopes whose conditional values of the moderator are one unit apart, are different - no matter what their individual p values are.

There are two more general points that can be taken away from this:

  1. Comparing things
  2. In general, comparing two effects (or any other statistics) by just laying them side by side and approximately assessing a difference in p values is never a good idea. What counts is whether the difference between the two is significant, not the individual significance decisions for the two statistics to be compared. This general principle is very clear to anyone if we look at the comparison of two means. Suppose you want compare two groups regarding a dependent variable that ranges from -16 to +16. So you do a t-test within each group and test whether the mean of the dependent variable differs from zero:

    exampledata.group.means <- as.data.frame( read.table("https://raw.github.com/johannjacoby/interaction_and_slopes/master/group.mean.comparison.dat", header=T, sep="\t", quote="", stringsAsFactors=F)) #comparing group means individually against 0 test1 <- t.test(exampledata.group.means[which(exampledata.group.means$group==1),]$dv) test2 <- t.test(exampledata.group.means[which(exampledata.group.means$group==2),]$dv) cat( "Group 1: t = ",test1$statistic,", p = ", sprintf("%10.9f",test1$p.value),"\n", "Group 2: t = ",test2$statistic,", p = ", sprintf("%10.9f",test2$p.value),"\n", sep="")

    You get the following results for the two groups:

    Group 1: t = 0.02738598, p = 0.978270349 Group 2: t = 2.605806, p = 0.012717168

    So the group means are different, right? The group mean in Group 1 is not significantly different from zero, but the mean in Group 2 is nice and significant. And the graph confirms this - one of the means is essentially zero, the other on is significantly different from zero:

    library(gplots) dg <- barplot2( tapply(exampledata.group.means$dv, exampledata.group.means$group, mean), width=c(1,1), names.arg = c("Group 1", "Group 2"), xlim=c(0,3), ylim = c(min(c(test1$conf.int[1], test2$conf.int[1]))-1,max(c(test1$conf.int[2], test2$conf.int[2]))+1), plot.ci=TRUE, ci.l=c(test1$conf.int[1], test2$conf.int[1]), ci.u=c(test1$conf.int[2], test2$conf.int[2]), ci.width=.1 ) title(sub=expression(paste("Error bars denote 95% confidence intervals | * = significant at ", alpha, " < .05", sep="")), cex.sub=.7, adj=0) text( dg[1], test1$conf.int[2]+.5, paste("M = ",sprintf("%3.2f", test1$estimate), " ", ifelse(test1$p.value < .05,"*","n.s."), sep="")) text( dg[2],test2$conf.int[2]+.5, paste("M = ",sprintf("%3.2f", test2$estimate), " ", ifelse(test2$p.value < .05,"*","n.s."), sep=""))

    Right? Of course not. Nobody in their right mind would accept such a comparison of two group means by looking at the difference in p values. The correct way to compare two means is not looking at the difference of p values associated with the test of individual group means against zero, but testing the difference between the means:

    #comparing the difference between means against 0 test.both <- t.test(exampledata.group.means$dv, exampledata.group.means$group) cat( "Group comparison : t = ",test.both$statistic,", p = ", sprintf("%5.4f",test.both$p.value),"\n", sep="")

    This gives us a test of whether the difference between the means is different from zero:

    Group comparison [M(Group 1) - M(Group2) against zero]: t = -1.219916, p = 0.2257

    So, even though the group means appear to be different, based on their individual p values, they are not different: the difference between them is not significantly different from zero, and in the logic of significance testing the means should be considered not different. The best you can do to characterize the means is to take the same estimate for both: the grand mean. Any deviations from that grand mean can plausibly be attributed to chance.

    Of course nobody would proceed as described here (i.e., visually comparing p values for individual group means and drawing conclusions regarding the difference between them). But to the same degree that this strategy appears nonsensical for two group means, it also should appear nonsensical for the comparison of two slopes, two interactions, two structural equation models or the comparison of a total and direct effect in mediation analysis:

    • If you want to compare two means, test their difference against zero (instead of testing each individually against zero and then visually inspecting a difference between the two results). For group mean comparison, this test of the difference is achieved by an independent sample t-test.
    • The same goes for simple slopes: Test their difference against zero (instead testing each individually against zero and then visually inspecting a difference between the two results). This test of the difference between slopes is achieved by the statistical test of the interaction effect.
    • And it goes on: if you want to know whether two two-way interactions differ, do not conduct two two-way analyses and then visually compare their differences - instead, look at the three-way interaction - it tells you whether the two-way interactions differ.
    • If you want to compare to structural equation models, do not estimate each of them and then visually inspect the individual RMSEA values (or any other fit indices) - instead, test the difference in fit between the models, by testing the difference in χ².
    • If you want to compare two correlations between X and Y within different groups of Z, do not compute each correlation within groups of Z and then visually inspect differences between the two correlations - rather, enter X, Z, and their product interaction term as predictors into a regression model predicting Y, then the interaction will tell you whether the association between X and Y differs depending on Z.
    • If you conduct a mediation analysis based on recommendation formulated most prominently in Baron, R. M., & Kenny, D. A. (1986). The moderator–mediator variable distinction in social psychological research: Conceptual, strategic, and statistical considerations. Journal of Personality and Social Psychology, 51(6), 1173-1182., you might be tempted to engage in comparing the c and c' paths (the total effect and the direct effect remaining after statistically controling M in the prediction of Y, respectively). It is still widespread to draw substantial conclusions from the comparison of these two paths regarding "complete" or "partial" mediation. In this logic, mediation is "complete" if the total effect is significant, but the direct effect is not. Imagine a case where c is associated with a p value of p=.04 (significant) and c' has a p value of p=.06 (not significant). According to the "complete"/"partial" mediation logic such a difference would have to qualify as "complete". On the other hand, a case where the total effect has p=.06 (not significant) and a positive sign, and the direct effect has a negative sign and also p=.06 (so is also not significant but by far in the opposite direction), even though the indirect effect is much, much larger than in the former example, one would have to conclude "no indirect effect/no mediation" here. And finally, the total effect may be very very large with a miniscule p value, and the direct effect could be also significant, with a larger p value (but still smaller than .05), with a huge difference between these two effects, but you would have to conclude "only" "partial" mediation. One reason for this mess is that the distinction between "partial" and "complete" mediation relies on comparing individual p values instead of simply basing decisions on a test of the difference. This difference, in normal cases with continuous variables, is exactly the indirect effect, i.e., a ×b and can be tested in a simple elegant fashion without the visual inspection of p values that very poorly represent effect sizes and differences between effects. More on this in Hayes, A. F. (2009). Beyond Baron and Kenny: Statistical mediation analysis in the new millennium. Communication Monographs, 76, 408-420. and Rucker, D. D., Preacher, K. J., Tormala, Z. L., & Petty, R. E. (2011). Mediation analysis in social psychology: Current practices and new recommendations. Social and Personality Psychology Compass, 5(6), 359-371.

  3. Other combination of interaction and slopes significance decisions
  4. Of course, it goes the other way around: you might find yourself in a situation where the X1×X2 interaction term is significant, but the simple slopes you compute seem to not differ - they are both significant, e.g. one has a p value of p = .03 and the other one of p = .001. This case is just the other side of the coin of the issue discussed above. Of course the simple slopes can differ between two conditional values of X2 that are one unit apart, and they can both be significant, but one of them is larger than the other. Remember that the p value does not give you a good idea of how large an effect is, it essentially tells you whether you should be weary that an effect of a given size that you found might be plausibly explained as a random deviation from an actual null effect. So again, if you want to know the difference between effects, you need to look at their difference and test that difference rather than 'looking' at whether the difference between p values gives you a tingly gut feeling. And if the interaction is significant, two individually very clearly significant simple slopes at X2 = mean of X2 and X2 = mean of X2 + 1 will also be statistically different. They are both so far from zero that you may deem it implausible for any of them to come from a true null distribution, but one of them just may be larger than the other.

    And finally you may come across cases where none of the simple slopes you elect to compute may be significant, but the interaction is significant. This may of course be due to your choosing to examine slopes at conditional values of X2 that are quite close together, but also in cases where the conditional values of X2 (for the simple slopes of X1) are reasonably far apart. In an example:

    example2 <- as.data.frame( read.table( "https://raw.github.com/johannjacoby/interaction_and_slopes/master/interaction_insignificant_slopes.dat", header=T, sep="\t")) example2$centered.X1 <- scale(example2$X1, center=T, scale=F) example2$centered.X2 <- scale(example2$X2, center=T, scale=F) example2$centered.X2.lo <- example2$centered.X2 + sd(example2$X2) example2$centered.X2.hi <- example2$centered.X2 - sd(example2$X2) model0.2 <- lm(Y~centered.X1*centered.X2, example2) model.lo.2 <- lm(Y~centered.X1*centered.X2.lo, example2) model.hi.2 <- lm(Y~centered.X1*centered.X2.hi, example2) results0.2 <- summary(model0.2)[[4]] results.lo.2 <- summary(model.lo.2)[[4]] results.hi.2 <- summary(model.hi.2)[[4]] cat("\n", "Interaction X1 * X2: b=",results0.2[4],", t=",results0.2[12],", p=", sprintf("%5.4f",results0.2[16]),ifelse(results0.2[16] < .05," *",""),"\n", "Slope of X1 @ X2 = -1SD: b=",results.lo.2[2],", t=",results.lo.2[10],", p=", sprintf("%5.4f",results.lo.2[14]),ifelse(results.lo.2[14] < .05," *",""),"\n", "Slope of X1 @ X2 = +1SD: b=",results.hi.2[2],", t=",results.hi.2[10],", p=", sprintf("%5.4f",results.hi.2[14]),ifelse(results.hi.2[14] < .05," *",""),"\n", "absolute diff(p) = |",results.hi.2[14]," - ",results.lo.2[14],"| = ", abs(results.hi.2[14] - results.lo.2[14]), "\n", "diff(b) = ",results.hi.2[2]," - ",results.lo.2[2]," = ",results.hi.2[2] - results.lo.2[2], "\n", sep="" )

    In this particular data set, the following estimates and test results are obtained:

    Interaction X1 * X2: b=0.6179825, t=2.037498, p=0.0457 * Slope of X1 @ X2 = -1SD: b=-0.4550505, t=-1.238916, p=0.2199 Slope of X1 @ X2 = +1SD: b=0.4935985, t=1.156491, p=0.2518 absolute diff(p) = |0.251779 - 0.2199011| = 0.03187793 diff(b) = 0.4935985 - -0.4550505 = 0.948649

    The interaction is significant, so the slopes differ. The simple slopes at X2 = -1SD and X2 = +1SD however are both not significant. But look at their signs: they are opposite. So even though the simple slopes may not be strictly significantly different from zero, they differ from each other very clearly. After all the difference in b is much larger than either simple slope b coefficient by itself. The interaction term picks up this difference and is clearly significant.

2 comments:

Anonymous said...

Hi, Thank you for this post. My question is "what if the moderator is a dichotomous variable such as sex?" Than the two levels of the moderator is not just any two randomly selected levels. They are widely recognized social categories. In my analysis I find that even though the interaction of sex with the IV is not significant the conditional effect or simple slope is significant for males but not for females. This is consistent with my hypothesis that X-Y relationship will be significant for males but not for females. Any suggestions? Thank you for your help.

johann said...

Hello Mahmut,

thanks for your comment which alerted me that this should be clarified.
The bottom line is: The argument I made in the post also applies to cases where a moderator is dichotomous, as with the example you mention, gender with two levels. If the interaction is not significant, there is no moderation, and the simple slopes must be regarded as two instances of the distribution of the same (main) effect, and not different.
Someone arguing that different simple slopes of a predictor should be interpreted as different conditional on different values of another predictor (=moderator candidate) even though the interaction term of these two predictors is not significant, may use the hedge of chosing arbitrary (not necessarily random) values of the moderator at which to probe the simple slopes and finding any combination of significant/non-significanc simple slopes they desire.
But, as I argue, even in this case where one would have a choice of moderator candidate values, the test of whether the simple slopes are different at different values of a moderator still corresponds to the test of the interaction term (as the interaction term test allows judgement about whether two slopes, i.e., generalized differences are different from each other, rather than whether two slopes are differently judged regarding whether they separately differ from zero). It does not matter what configuration of significance decisions the separate slopes one finds.
In the case of a dichotomous moderator, such a choice of arbitrary moderator candidate values cannot even be made, as there are only two values. The argument thus does not hinge on the moderator candidate values being restricted to two or not. If you want to test whether the effect of a predictor differes depending on different values of the moderator, the general argument still applies: The interaction effect test tells you whether the slopes are different. If there is no interaction, the slopes are not statistically different, even though the two slopes you obtain may each have different p-values leading to different statistical significance decisions ("Is the slope different from zero?") about each of these slopes separately.
Two different significance decisions for two regression coefficients is simply not the same as significance of the difference of the two coefficients. So even though your simple slopes (for women and men, respectively) may have different p-values (on two different sides of .05, or on the same side, it does not matter), they are not statistically different as evidenced by the absence of an interaction.

Best,
Johann