PLoS ONE
Home Inference in skew generalized t-link models for clustered binary outcome via a parameter-expanded EM algorithm
Inference in skew generalized t-link models for clustered binary outcome via a parameter-expanded EM algorithm
Inference in skew generalized t-link models for clustered binary outcome via a parameter-expanded EM algorithm

Competing Interests: The authors have declared that no competing interests exist.

Article Type: Research Article Article History
Abstract

Binary Generalized Linear Mixed Model (GLMM) is the most common method used by researchers to analyze clustered binary data in biological and social sciences. The traditional approach to GLMMs causes substantial bias in estimates due to steady shape of logistic and normal distribution assumptions thereby resulting into wrong and misleading decisions. This study brings forward an approach governed by skew generalized t distributions that belong to a class of potentially skewed and heavy tailed distributions. Interestingly, both the traditional logistic and probit mixed models, as well as other available methods can be utilized within the skew generalized t-link model (SGTLM) frame. We have taken advantage of the Expectation-Maximization algorithm accelerated via parameter-expansion for model fitting. We evaluated the performance of this approach to GLMMs through a simulation experiment by varying sample size and data distribution. Our findings indicated that the proposed methodology outperforms competing approaches in estimating population parameters and predicting random effects, when the traditional link and normality assumptions are violated. In addition, empirical standard errors and information criteria proved useful for detecting spurious skewness and avoiding complex models for probit data. An application with respiratory infection data points out to the superiority of the SGTLM which turns to be the most adequate model. In future, studies should focus on integrating the demonstrated flexibility in other generalized linear mixed models to enhance robust modeling.

Tovissodé,Diop,Glèlè Kakaï,and Citi: Inference in skew generalized t-link models for clustered binary outcome via a parameter-expanded EM algorithm

Introduction

Binary outcomes are prominent in many applied sciences, including but not limited to biological and social sciences. Moreover, in cross sectional as well as panel studies, dichotomous responses are often naturally grouped by sampling techniques or some properties of the sampling units [1]. The preferred modern method to analyze clustered binary data is through the Generalized Linear Mixed Model (GLMM) framework [2]. Indeed, when a binary outcome has been recorded repeatedly or in the presence of latent factors, GLMMs allow accounting explicitly for over-dispersion and correlation within clusters using random effects.

Let Yij denote the binary outcome (0 or 1) of the jth measurement (j = 1, 2, ⋯, ni) and Yi the collection of responses from the ith cluster, i.e. Yi=(Yi1,,Yini), i = 1, 2, ⋯, n. In terms of an underlying latent continuous random vector Zi=(Zi1,,Zini) and random effects bi = (bi1, ⋯, biq), the mixed probit model (PM) assumes that Yij are conditionally independent and given as [3]:

where IA(x) is the indicator function which equals to 1 if xA and 0 otherwise; ηi is the nivector of linear predictors, ηi=(ηi1,,ηini)=Xiβ+Wibi; β is the pvector of fixed effects; Xi=(Xi1,,Xini) and Wi=(Wi1,,Wini) are respectively known ni × p and ni × q matrices of covariates with Xij = (Xij1, ⋯, Xijp) and Wij = (Wij1, ⋯, Wijq); Ini is the ni × ni identity matrix and Nq(0,D) denotes the q-variate normal distribution, with null mean vector and variance-covariance an unknown q × q matrix D, meant to capture the dependence structure of Yi. The latent variable Zij serves for a convenient stochastic representation of the conditional outcome Yij. Equivalently, one may write P(Yij = 1|bi) = Φ(ηij) with Φ(⋅) the cumulative distribution function (cdf) of the standard normal distribution, standing as the inverse link function mapping the linear predictor ηij and the predicted probabilities of the outcome Yij. Combined with the normality assumption on random effects, the systematic use of this link and the well known alternative, the logit link, is somewhat controversial [4, 5].

The link function indeed has a critical role in GLMMs since it heavily impacts estimates, predictions and consequently interpretations [4, 6]. As a result, in the binary generalized linear model literature, aside the logistic and probit models based on the steady shape logistic and normal distributions respectively, there has been increasing efforts to render the link function flexible. Many works have considered heavy tailed link functions, for instance the Semi-Nonparametric [7], Student-t [8] and generalized t [9] distributions, and elliptical scale mixtures [10, 11]. Indeed, the maximum likelihood estimators of logistic and probit regression models are not robust to outliers [7]. Heavy tailed links are not sensitive to outliers and thus allow outlier-robust inference. In particular, the links functions based on the Student t distribution incorporate observation-specific stochastic weights which can be used for outlier detection [7, 12]. Similarly, skew-probit [13], skew generalized t [9], asymmetric logistic [14], loglog and complementary loglog, power symmetric and reciprocal power symmetric [15] links were used among others to handle situations where the probability of a given binary response approaches zero at a different rate than it approaches one. Skew logistic distributions have also been developed (see e.g. [16]) and may be used with the same aim in mind. For example [9], discussed a prostate cancer study where the outcome variable Y represents the presence or the penetration of cancer in or near the prostate capsule of patients. The rate at which the probability of “Y = 1” approaches one is expected to be very different (slower) from the rate at which it approaches zero [9]. For this study, the skew generalized t-link fits best the data [9], indicating that the simultaneous use of skewed and heavy tailed link functions can lead to more effective modelling of binary data.

Furthermore, although random effects are traditionally assumed to be normally distributed in GLMMs, this may not be realistic [17, 18]. Therefore, huge efforts have been devoted to making the random effects distribution in GLMMs flexible, replacing the normal distribution with, for instance Semi-Nonparametric [19], probability integral transformation of normal [20], skew normal [21], log-normal [22, 23], Student-t [24] and scale mixtures of normal [25] distributions.

The above background demonstrates the extent to which the number of possible approaches for fitting a flexible GLMM to correlated binary outcomes goes, although none of these approaches attempts to explicitly account for skewness and tail behavior of the link function as well as the random effects distribution simultaneously. However, the misspecification of the link function or the random effects distribution can introduce substantial bias and reduce the accuracy of the mean response as well as heterogeneity estimates [6, 18]. Standing in a fully parametric frame, we propose a unifying approach based on skew generalized t (SGT) distributions [26], that is the class of models including among others the normal, the skew normal and the Student t models. The use of a skew generalized t family instead of the Student t family allows to rescale fixed effects so that they have the same interpretation as in the mixed probit model in Eq (1).

Our contributions include i) an extension of the flexible generalized t-link model built for independent binary samples proposed by [9] to deal with dependent binary samples (mixed model); ii) a parameter-expanded EM algorithm [27] for computing the maximum likelihood of skew generalized t-link models for correlated binary data, extending the EM algorithm of [24] for t-link models; and iii) empirical Bayes estimators of skew t distributed random effects in mixed models for binary data.

The organization of the paper is as follows. Section 2 presents preliminary results on the SGT distributions and the truncated SGT distributions and their first two moments. Section 3 specifies the SGT-link model (SGTLM) and describes maximum likelihood estimation and cluster-specific inference based on random effects and weights. A simulation study assessing the relative performance of the SGTLM relative to existing methods and the application of the modelling approach to a real respiratory infection data are presented in Section 4. Concluding remarks are given in section 5.

Preliminary results

In this section, we present some useful properties of the skew generalized t distributions.

Multivariate skew generalized t distributions

Multivariate skew generalized t (SGT) distributions are special cases of multivariate skew scale mixture of normal (SSMN) distributions [28] (pages 102-103) which we first introduce. A random variable Z is said to follow a pvariate SSMN distribution with location μ, scale Ω, and shape λ, if it can be represented as [29] (page 20, Eq 3.12):

where U, called scale mixing variable, is a positive random variable with cdf FU(⋅|ν) indexed by a parameter vector ν, HN(0,1) is the standard half normal distribution; Z0, X and U are independent; Ω¯=Ωδδ and δ = (1 + λ λ)−1/2 Ω1/2 λ. Different choices of the scale mixing distribution FU(⋅|ν) result in various sub-classes of skew elliptical distributions, for instance, the skew normal when P(U = 1) = 1 [28] (page 103); the skew contaminated normal when ν = (ν1, ν2), 0 < ν1 < 1, 0 < ν1 ≤ 1 and U is discrete and takes the values U = 1 with probability 1 − ν1 and U = ν2 with probability ν1 [30] (page 308); the skew slash when UBeta(ν,1), ν > 0 [30] (page 307); and the skew generalized t when ν = (ν, ν0), ν > 0, ν0 > 0, UGamma(ν/2,ν0/2) [28] (page 105). The following result states conditions for the identifiability of SSMN distributions, a requirement for reliable inference using this class of distributions.

Lemma 1 (see S1 Appendix for a proof) The free parameters (μ, δ, Ω and ν) of a SSMN distribution with the representation in Eq (2) are identifiable if and only if i) the scale mixing distribution FU(⋅|ν) is identifiable and ii) FU(⋅|ν) does not satisfy FU(u|ν)=H(uνk|νk) for any element νk of ν and any distribution function H(⋅|νk) where νk is the vector ν without the element νk. If U has a probability density function (pdf) fU(u|ν) for all u > 0, then the condition ii) is equivalent to fU(⋅|ν) does not satisfy fU(u|ν)=1νkhU(uνk|νk) for any pdf hU(⋅|νk).

On setting c=2/π and defining the expectations U˜t=EU{Ut/2}, and assuming that U˜t< for the required expectations, the first two central moments of a SSMN vector Z are given by [28] (pages 109-110):

The ability of the SSMN distributions to capture more data structure than the normal, the skew normal or the scale mixture of normals is reflected in the expressions for skewness (γ1k) and kurtosis (γ2k) indices given for the kth marginal of Z as [31]:
where δk is the kth element of δ and σk2 is the kth diagonal element of the covariance matrix given in Eq (4).

We notice from the expressions for skewness Eq (5) and kutosis Eq (6) indices that the parameter λ controls the shape of the distribution only through the working shape parameter δ = (1 + λ λ)−1/2 Ω1/2 λ. This quantity is invariant under marginalization, i.e. by the stochastic representation in Eq (2), for any arbitrary subset of Z, the working shape parameter is the corresponding subset of δ. It is worth noticing however that the quantity δ cannot be specified unrestrictedly, independently of Ω. Indeed, we observe that δ = (1 + λ λ)−1/2 Ω1/2 λ implies that λ = (1 + λ λ)1/2 Ω−1/2 δ. This in turn gives λ λ = (1 + λ λ)δ Ω−1 δ which yields λ λ(1−δ Ω−1 δ) = δ Ω−1 δ. We then get λλ=δΩ1δ1δΩ1δ so that 1+λλ=1(1δΩ1δ), i.e. 1 + λ λ = (1 − δ Ω−1 δ)−1 provided δ Ω−1 δ ≠ 1. Therefore, λ is recorvered from δ and Ω under the constraint δ Ω−1 δ < 1 as:

It is nevertheless possible to unrestrictedly specify δ and Ω¯ (positive definite). In this case, Ω is recovered as Ω=Ω¯+δδ. Using the Sherman-Morrison identity [32] (page 121, Eq 3.1), we have Ω1=(Ω¯+δδ)1=Ω¯1Ω¯1δδΩ¯11+δΩ¯1δ from which we get δΩ1δ=δΩ¯1δδΩ¯1δδΩ¯1δ1+δΩ¯1δ that simplifies as δΩ1δ=δΩ¯1δ1+δΩ¯1δ. We thus have 1δΩ1δ=11+δΩ¯1δ hence
In the binary data modeling framework, we shall consider δ and Ω¯ as model parameters as they turn to be easier to estimate by the mean of the EM algorithm. For the multivariate Skew Generalized t (SGT) distribution, the mixing variable U is gamma distributed, i.e. UGamma(ν/2,ν0/2) with pdf [33] (page 1, Eq 1):
The pvariate SGT distribution, denoted SGTp(μ,Ω,λ,ν) with ν = (ν, ν0) has pdf for zRp [28] (page 106):
where
is the pdf of the pvariate Generalized t (GT) distribution, z0 = Ω−1/2 (zμ), α = λ z0, and T(⋅|ν) is the cdf of the standard univariate t distribution with ν degrees of freedom. For SGT distributions, the expectations U˜t required for computing moments given in Eqs (3)–(6) have for t < ν the form
It is worthwhile noticing that the gamma mixing pdf fG(⋅|ν/2, ν0/2) satisfies the condition i) of Lemma 1 but not the condition ii). The SGT ditribution with ν as a parameter is thus not identifiable. However, restricting ν0 to a fixed value (so that only ν is considered as a parameter) is sufficient to ensure identifiability of the SGT family of distribution. When ν0 = ν, the pvariate SGT distribution SGTp(μ,Ω,λ,ν) reduces to the pvariate Skew t (ST) distribution [28] (page 106), denoted STp(μ,Ω,λ,ν) which is thus identifiable with pdf Stp(⋅|μ, Ω, λ, ν) and cdf Stp(⋅|μ, Ω, λ, ν). If λ = 0, the SGT distribution reduces to the GT distribution which equals the Student t distribution for ν0 = ν. The following lemma formalizes the relationship between skew generalized t and skew t distributions.

Lemma 2 (see S2 Appendix for a proof) Let us consider the SGT distribution SGTp(μ,Ω,λ,ν) with ν = (ν, ν0) and pdf in Eq (10). Set Ω*=ν0νΩ. The following statements hold:

    SGtp(z|μ, Ω, λ, ν) = Stp(z|μ, Ω*, λ, ν);

    If ZSGTp(μ,Ω,λ,ν) then ZSTp(μ,Ω*,λ,ν).

Lemma 2 indicates that any SGT vector is a rescaled version of a ST vector. However, in the frame of binary data models, the use of a SGT distribution instead of a simple ST distribution as link function allows to control the scale of the link function through the parameter ν0 [9]. Specifically, a skew generalized t-link allows to define a skewed and heavy-tailled binary mixed model where fixed effects have the same scale and interpretation as in the mixed probit model in Eq (1). Interestingly, the popular logit and probit links for binary data can be recast as special cases of the cdf of the SGT class of distributions. Indeed, the normal distribution is a limiting case of SGT distributions when ν0 = ν → ∞ and λ = 0. Moreover, the logistic distribution is well appoximated by a rescaled Student t distribution with appropriate degrees of freedom [8] (page 228). These constatations make the SGT distributions appropriate for extending the traditional probit and logistic GLMMs, accounting for skewness and heavy tail behaviors. To this end, we present in the next section some results on truncated multivariate SGT distributions since binary data can reflect truncation of latent continuous variables.

Truncated multivariate skew generalized t distributions

As seen for the mixed probit model in Eq (1), models for binary data can be defined by truncating latent variables following continuous distributions. We define in this section a class of truncated multivariate skew generalized t distributions which are useful for a latent variable representation of skew generalized t-link binary data models. We also give expressions to evaluate some joint moments of a truncated multivariate skew generalized t distribution and a gamma distribution, as they prove useful in the implementation of the EM algorithm for the skew generalized t-link model.

Let TSGTp(μ,Ω,λ,ν,A) represent a pvariate skew generalized t (SGT) vector restricted to a p-dimensional hyperplane A; with μ a pvector (location), Ω a p × p positive definite matrix (scale), λ a pvector (shape) and ν = (ν, ν0) a vector of positive scalars (degrees of freedom). The pdf of ZTSGTp(μ,Ω,λ,ν,A) is:

where SGtp(⋅|μ, Ω, λ, ν) is the pdf in Eq (10) and αst=ASGtp(z|μ,Ω,λ,ν)dz serves for normalization. The cdf of Z is denoted TSGTp(·|μ,Ω,λ,ν,A). When λ = 0, we obtain a truncated generalized t distribution denoted TGTp(μ,Ω,ν,A) with pdf TGtp(·|μ,Ω,ν,A) and cdf TGTp(·|μ,Ω,ν,A). When ν0 = ν, we get a truncated ST distribution denoted TSTp(μ,Ω,λ,ν,A) with pdf TStp(·|μ,Ω,λ,ν,A) and cdf TSTp(·|μ,Ω,λ,ν,A). If both λ = 0 and ν0 = ν, the truncated multivariate SGT distribution is reduced to a truncated multivariate t distribution [34] denoted TTp(μ,Ω,ν,A) with pdf Ttp(·|μ,Ω,ν,A) and cdf TTp(·|μ,Ω,ν,A).

In the frame of correlated binary data models, the truncation region A typically has the form A=A1×A2××Ap where Ak are real intervals of the form Ak=(,ak] or Ak=(ak,), for akR (k = 1, 2, ⋯, p). Let us consider for instance a vector Y of three binary outcomes obtained by truncating the elements of a 3−variate SGT vector ZSGT3(μ,Ω,λ,ν): Yk = 0 if Zk ≤ 0 and Yk = 1 if Zk > 0. In practice, however, only the binary outcomes (Y) are observable whereas the latent outcome Z is unobservable. Suppose one observes the binary outcomes y = (1, 0, 1). This implies that the corresponding value z of the latent vector Z satisfies z1 > 0, z2 ≤ 0, and z3 > 0. The conditional distribution of Z given Y = y (required for maximum likelihood estimation using the EM algorithm) is thus SGT3(μ,Ω,λ,ν) truncated to the region Ae.g.=(0,)×(,0]×(0,), i.e. Z|Y=yTSGTp(μ,Ω,λ,ν,Ae.g.) as defined in Eq (13).

We shall use the simplified notation TSGTp(μ,Ω,λ,ν,a) with aRp to denote a truncated SGT distribution TSGTp(μ,Ω,λ,ν,A) when A is the right truncated hyperplane A={zRp|z1a1,,zpap}. In this case, αst = SGTp(a|μ, Ω, λ, ν) with SGTp(⋅|μ, Ω, λ, ν) the cdf of the pvariate ST distribution. This corresponds for instance to the situation where all binary outcomes are zeros. When λ = 0, the right truncated SGT distribution TSGTp(μ,Ω,0,ν,a) is a right truncated GT distribution denoted TGTp(μ,Ω,ν,a) whose pdf and cdf are respectively denoted TGtp(⋅|μ, Ω, ν, a) and TGTp(⋅|μ, Ω, ν, a). When ν0 = ν, the right truncated SGT distribution is a right truncated ST distribution denoted TSTp(μ,Ω,λ,ν,a) with pdf TStp(⋅|μ, Ω, λ, ν, a) and cdf TSTp(⋅|μ, Ω, λ, ν, a). If both λ = 0 and ν0 = ν, the distribution is reduced to a right truncated t distribution denoted TTp(μ,Ω,ν,a) with pdf Ttp(⋅|μ, Ω, ν, a) and cdf TTp(⋅|μ, Ω, ν, a). In the above example, if y = (0, 0, 0), then the truncation region becomes Ae.g.=(,0]×(,0]×(,0]. Since all truncation points are zeros, we shall write in this case Z|Y=yTSGTp(μ,Ω,λ,ν,a) with a = (0, 0, 0) using the above simplified notation.

The implementation of an EM algorithm for a SGT distribution based binary data model requires joint moments of the form ur¯=E{Ur/2}, urzs¯=E{Ur/2Z(s)}, τr¯=E{Ur/2ζ1(U1/2α)} and τrzs¯=E{Ur/2ζ1(U1/2α)Z(s)} for s ∈ {1, 2}, Z(1) = Z and Z(2) = Z Z, UGamma(ν/2,ν0/2), ZTSGTp(μ,Ω,λ,ν,A), α = λ Ω−1/2(Zμ), ζ1(x) = ϕ(x)/Φ(x) with ϕ(⋅) the pdf of the standard normal distribution, and A is an hyperplane of the form A=A1××Ap with Ak{(,ak],(ak,)}, (ak, ∞)} for a = (a1, ⋯, ap). Proposition 1 hereafter will be useful for the derivation of ur¯, urzs¯, τr¯ and τrzs¯.

Proposition 1 (see S3 Appendix for a proof) Let ZTSGTp(μ,Ω,λ,ν,A) with ν = (ν, ν0), UGamma(ν/2,ν0/2) and set α = λ Ω−1/2(Zμ). Then, for any real r > − ν and an integrable function g(⋅) of Z:

where Cr(ν)=(2ν0)r/2Γ(ν+r2)Γ(ν/2), αst=AStp(z|μ,ν0νΩ,λ,ν)dz, ατ,r=Atp(z|μ,Ω¯*,ν+r)dz, and αu,r=AStp(z|μ,Ω*,λ,ν+r)dz; c=2/π, Zu,rTSTp(μ,Ω*,λ,ν+r,A), Zτ,rTTp(μ,Ω¯*,ν+r,A), M=(1+δΩ¯1δ)1/2 with δ=(ν0ν)1/2(1+λλ)1/2Ω1/2λ, Ω¯=ν0νΩδδ, Ω¯*=νν+rΩ¯ and Ω*=ν0ν+rΩ.

By the mean of a simple linear transformation, we obtain from Proposition 1 the joint expectations ur¯, urzs¯, τr¯ and τrzs¯ in terms of moments of a truncated multivariate skew t distribution.

Corollary 1 (see S4 Appendix for a proof) Let ZTSGTp(μ,Ω,λ,ν,A) with ν = (ν, ν0), UGamma(ν/2,ν0/2), aRp and A=A1××Ap with Ak=(,ak] or Ak=(ak,). Then, on setting δ=(ν0ν)1/2(1+λλ)1/2Ω1/2λ, Ω¯=ν0νΩδδ, A = diag(A1, ⋯, Ap) with Ak = 1 if Ak=(,ak] and Ak = −1 if Ak=(ak,), a* = Aa, μ* = A μ, Ω*=ν0νAΩA, Ω¯*=AΩ¯A, λ* = and αst = STp(a*|μ*, Ω*, λ*, ν),

where Xu,rTSTp(μ*,νν+rΩ*,λ*,ν+r,a*), Xτ,rTTp(μ*,νν+rΩ¯*,ν+r,a*), and we have set αu,r=STp(a*|μ*,νν+rΩ*,λ*,ν+r) and ατ,r=Tp(a*|μ*,νν+rΩ¯*,ν+r).

For a practical use of Corollary 1, the cumulative multivariate skew t distribution is required. To this end, the function pmst of the package sn [35] in R freeware [36] is an appropriate routine.

Moments of truncated multivariate skew generalized t distributions

The evaluation of expectations E{Xu,r(s)} involved in Corollary 1 calls for general expressions for the first and second order moments of truncated multivariate SGT distributions. These moments are required in the implementation of an EM algorithm for a SGT distribution based binary data model. The moments have been derived for truncated multivariate t distributions by [34] and were used by [24] in their implementation of the EM algorithm for a t-link GLMM. We present in this section the expressions for the first two moments of the multivariate SGT distributions, relying on the Theorem 1 of [37] and the moments of truncated multivariate t distributions available from [34] (see also [38]).

Let ZTSGTp(μ,Ω,λ,ν,a) with ν = (ν, ν0) and aRp, i.e. a pvariate SGT vector restricted to the right truncated hyperplane A={xRp|x1a1,,xpap}. The pdf of Z is:

where αst = SGTp(a|μ, Ω, λ, ν), SGTp(⋅|μ, Ω, λ, ν) is the cdf of the pvariate SGT distribution. If μ = 0, Ω is a correlation matrix (Ω = R) and ν0 = ν, then ZTSTp(0,R,λ,ν,a). In this case, the first two moments of Z can be evaluated using the following proposition which simply combines Theorem 3 in [34] with Theorem 1 in [37].

Proposition 2 (see S5 Appendix for a proof) Let ZTSTp(0,R,λ,ν,a) with R a correlation matrix. Then,

where C0*=Γ(ν12)Γ(ν22)(νπ)1/2Tp(a|0,νν1(Rδδ),ν1) with Tp(⋅|μ, Ω, ν) the cdf of the pvariate t distribution with location μ, scale Ω and degrees of freedom ν; q*(a)Rp with ith element qi*(ai)=ν2νt(ν2νai,ν2)Tp(a¯2(i)|aiR¯12(i),ν+ai2ν1R¯22.1(i),ν1), t(⋅) being the pdf of the standard Student t distribution; αst*=Tp+1(a¯|0,νν2R*,ν2) with R*=(σ02σ0δσ0δR), σ0=1+λλ and δ=σ01R1/2λ; H0*Rp with ith element H0i*=νν4t2(a1(0i)|0,νν4R11(0i),ν4)Tp1(a¯2(0i)|μ¯2.1(0i),ν+α0iν2R¯22.1(0i),ν2), α0i=ai2(1δi2); H* is the p × p matrix with diagonal elements Hii*=0 and off diagonal elements defined as
with αij=ai22ρijaiaj+aj2(1ρij2); D0*=δH0*; D* is the p × p diagonal matrix with diagonal elements Dii*=δiH0i*aiqi*(ai)RiH*i, H*i denoting the ith column of H*; a¯=(0,a), R11(0i)=(1δiδi1), R11(ij)=(1ρijρij1), R¯=(1δδR), with δi the ith element of δ, ρij the (ij)th element of R; Hi the ith column of H; a¯2(i) the vector a with its (i + 1)th element (i.e.ai) deleted; R¯12(i) the (i + 1)th column of R¯ with its (i + 1)th element (i.e. 1) deleted; R¯22.1(i)=R¯22(i)R¯12(i)R¯12(i)T, R¯22(i) being R¯ with its (i + 1)th row and column deleted; a1(ij)=(ai,aj); a¯2(ij) the vector a¯ with its (i + 1)th and (j + 1)th elements (i.e.ai and aj) deleted; R¯22.1(ij)=R¯22(ij)R¯12(ij)[R11(ij)]1R¯12(ij)T, R¯22(ij) being R¯ with its (i + 1)th and (j + 1)th rows and columns deleted; R¯12(ij) being the matrix R¯ with its (i + 1)th and (j + 1)th columns deleted, and only its (i + 1)th and (j + 1)th rows kept; μ¯2.1(ij)=R¯12(ij)[R11(ij)]1a1(ij); a1(0i)=(0,ai); a¯2(0i) the vector a with its ith element (i.e.ai) deleted; R¯22.1(0i)=R¯22(0i)R¯12(0i)[R11(0i)]1R¯12(0i)T, R¯22(0i) being R with its ith row and column deleted; R¯12(0i) being the matrix R¯ with its first and (i + 1)th columns deleted, and only its first and (i + 1)th rows kept; and μ¯2.1(0i)=R¯12(0i)[R11(0i)]1a1(0i).

The following corollary gives the first two moments of a general right truncated SGT vector ZTSGTp(μ,Ω,λ,ν,a) with ν = (ν, ν0).

Corollary 2 Let ZTSGTp(μ,Ω,λ,ν,a) with ν = (ν, ν0). Then,

where Λ=ν0/νdiag(ω1,,ωp), ωi2 is the ith diagonal element of Ω, XTSTp(0,R,λ*,ν,a*), R is the correlation matrix from Ω, λ*=ν0/νR1/2Λ1Ω1/2λ, a* = Λ−1(aμ) and E{X} and E{XX} are available from Proposition 2.

When ν → ∞, the truncated multivariate SGT family has the truncated multivariate skew normal family as a limiting case (see S5 Appendix for a definition and formulas for computing the first two moments).

Skew generalized t-link mixed binomial model

This section defines the skew generalized t-link model (SGTLM) and describes an Expectation-Maximization (EM) algorithm [39] accelerated using parameter expansion [27] for likelihood inference. Empirical Bayes estimators of random effects and weights are also obtained for cluster specific inference.

Model specification and marginal log-likelihood

The skew generalized t-link GLMM (SGTLM) considered in this work is defined as:

where Yij is the binary outcome of the jth measurement (j = 1, 2, ⋯, ni) on the ith cluster (i = 1, 2, ⋯, n), Zi is a latent continuous outcome which determines the observable Yi=(Yi1,,Yini), and bi is a vector of q random effects associated to the cluster i. In Eq (27), ηi = Xi β + Wi bi, β, Xi and Wi are as in Eq (1); c=2/π, U˜1=(ν2)1/2Γ(ν12)Γ(ν/2), δεR, Jni is the nivector of all ones, Ωεi=Ini+δε2JniJni, ν=(ν,νυ02) with υ0 > 0 and ν > 2; D=D¯+δδ and λ=(1+δD¯1δ)1/2D1/2δ with δRp and D¯ a q × q positive define matrix.

In the SGTLM, the distribution of a single latent outcome Zij is SGT(ηijcU˜1υ0δε,ωε2,δε,ν) where ωε2=1+δε2 and SGT(μ,ω2,λ,ν) denotes a univariate SGT distribution with location μ, scale ω2, shape λ and degrees of freedom ν. Therefore, on denoting SGT(⋅|μ, ω2, δ, ν) the cdf of a scalar SGT distribution SGT(μ,ω2,δ,ν), the success probability of an outcome Yij given the random effects bi is P(Yij=1|bi)=SGT(0|ηijcU˜1υ0δε,ω2,δε,ν). Unlike in the common probit model (PM) (see Eq (1)), for a given cluster i, the ni latent outcomes Zij are not independent given the random effects bi. Indeed, on using Eq (4) and setting U˜2=νν2, the variance-covariance matrix of Zi given bi is

so that the correlation coefficient between two elements Zij and Zik of Zi with kj is ρ0=(U˜2cU˜12)δε2U˜2+(U˜2cU˜12)δε2. Thus, conditional on random effects, the ni latent outcomes in Zi are uncorrelated only when δε = 0, i.e. a skewed link function implies correlated latent outcomes within a cluster i.

The positive constant υ0 in the SGTLM controls the scale of the latent variable Zij and thus the scale of the model link function. Indeed, from Eq (28), the conditional variance of Zij is Var{Zij|bi}=υ02[U˜2+(U˜2c2U˜12)δε2]. Setting υ0 = 1 would yield a skew t-link model (i.e. ν = (ν, ν)). However, to make fixed effects in the SGTLM comparable with fixed effects in the common probit model (PM) characterized by a link function with a unit scale (i.e. Var{Zij|bi} = 1), we have set

The application of Eq (3) to bi shows that, as in the PM, random effects in the SGTLM have null mean vector E{bi} = 0. Using Eq (4), the variance-covariance matrix of random effects is given by
When δε = 0 and δ = 0, the SGTLM is reduced to the t-link model in [24] except υ0 = 1 therein. As ν → ∞ (so that Ui = 1), the STGLM has as limiting case the mixed skew-probit model (SPM) which reduces to the PM for δε = 0 and δ = 0.

By Eq (2), the SGTLM has the stochastic representation

where Ui and Vi are independent. In this representation of the SGTLM in terms of more common distributions, the ni binary outcomes Yij of a cluster i are independent given the random effects bi, the scale mixing variable Ui and the half normal variable Vi. Note that given Ui and Vi, Zi|bi and bi are normally distributed and share the same Ui and Vi. As a result, the joint distribution of Zi and bi belongs to the class of SGT distributions. From the stochastic representation in Eq (31), we obtain the unconditional distributions of Zi and Yi=(Yi1,,Yini) as follows.

Proposition 3 (see S6 Appendix for a proof). Let us consider the latent vector Zi and the binary variable Yij in Eq (27) and define μi=XiβcU˜1Δi with Ωi=Ω¯i+ΔiΔi, Ω¯i=υ02Ini+WiD¯Wi, Δi=υ0δεJni+Wiδ, and the related shape parameter λi=Ωi1/2Δi(1+ΔiΩ¯i1Δi)1/2. Then ZiSTni(μi,Ωi,λi,ν). Furthermore, the vector of binary outcomes Yi has a multivariate Bernoulli distribution with joint probability mass function,

and Yij has a Bernoulli distribution with success probability P(Yij=1)=ST(μij|0,ωij2,λij,ν) and probability mass function,
where Ai=diag(Ai1,,Aini), Aij = 1 − 2yij, ωij2 is the jth diagonal element of Ωi, λij=Δij/ω¯ij, and ω¯ij2=ωij2Δij2 with Δij and μij the jth elements of Δi and μi respectively.

Eq (32) conveniently expresses for a value yi of Yi, the likelihood as a cumulative probability of a ST distribution whose location, scale and shape parameters depend on yi, using the identities P(Zij > zij) = P(−Zij < −zij) and sign(Zij) = 2Yij − 1 where sign(⋅) returns the sign of its argument. On using Eq (4) on the distribution of Zi given in Proposition 3, the variance-covariance matrix of the outcomes at the latent scale is

Thus, in a model with a cluster-specific random intercept (q = 1) with δ = 0 and D¯=σ¯b2, the latent intra-class correlation coefficient (the proportion of variance explained by clustering at latent scale) is given by
The joint distribution of Zi and bi (i.e. (Zi,bi)) is STni+q(μ˜i,Ω˜i,λ˜i,ν) where μ˜i=(XiβcU˜1ΔicU˜1δ), Ω˜i=Ω˜¯i+δ˜iδ˜i, Ω˜¯i=(Ω¯iWiD¯D¯WiD¯), δ˜i=(Δiδ) and λ˜i=Ω˜i1/2δ˜i(1+δ˜iΩ˜¯i1δ˜i)1/2. Thus, for j = 1, 2, ⋯, ni and k = 1, 2, ⋯, q, the correlation between Zij and bk is ρijk=σijkσijσk with σk2=U˜2D¯k2+(U˜2c2U˜12)δk2 the variance of bk, σij2=U˜2(υ0+WijD¯Wij)+(U˜2c2U˜12)(υ0δε+Wijδ)2 the variance of Zij and σijk=U˜2WijD¯k+(U˜2c2U˜12)(υ0δε+Wijδ)δk is the covariance between Zij and bk, Wij is the jth row of Wi and D¯k is the kth column of D¯, D¯k2 is the kth diagonal element of D¯ and δk is the kth element of δ.

The parameters of the SGTLM include β, δε, δ, vech(D¯) and ν where the vech(⋅) operator returns the lower triangle elements of its matrix argument. In order to avoid non-regular likelihood problems occurring in models based on the Student distribution and its extensions (in particular when ν is close to zero) [40], we follow some recent related works [24, 41] and first consider ν as known, focusing on θ=(β,δε,δ,vech(D¯)). Classical inference on θ is based on the marginal likelihood of the observed data y. Using Proposition 3, the joint marginal log-likelihood of n independent clusters yi (i = 1, 2, ⋯, n) is:

From Eq (36), an optimization routine like the R function optim can be used for inference on θ. We however develop an EM algorithm to circumvent the ni-dimensional integral in Eq (32) when estimating θ.

Model identifiability

Estimations in the skew generalized t-link model (SGTLM) may produce inconsistent results which would induce unreliable and misleading conclusions, if the model is not identifiable. It is thus of great importance to check whether different points in the parameter space can be distinguished from observations yi. We analyse in this section the identifiability of the SGTLM and indicate when it is necessary to restrict the parameter space to ensure reliable inference from observed data. We restrict attention to the case υ0 = 1 (ignoring Eq (29)) since υ0 is an artificial device only included to ensure a unit variance in the conditional link function as in the traditional probit mixed model.

Although not sufficient, the identifiability of the SGTLM requires both the marginal random effects distribution and the conditional model given random effects to be identifiable. The identifiability of the random effects distribution follows from the identifiability of multivariate skew t distributions. We survey the identifiability of the conditional model before turning to the marginal model.

Conditional identifiability

Conditional on the random effects bi and for fixed degrees of freedom parameter ν, the identifiability of the SGTLM reduces to the identifiability of the fixed effects skew-probit model. This follows because the skew t inverse link function is an average of the skew-probit inverse link function with respect to the gamma mixing distribution. The identifiability of the skew-probit model with one covariate has been recently shown to depend on the nature (binary/continuous) of the covariate in the model [42]. Indeed, the fixed effect skew-probit model is not identifiable in the absence of any covariate (i.e. each Xi is a column of ni ones) [42] (page 1624, Proposition 2.1) or in the presence of a binary covariate [42] (page 1626, Proposition 2.2). On the other hand, the fixed effect skew-probit model is identifiable when the covariate is continuous [42] (page 1627, Proposition 2.3). Extension to the case of multiple covariates is straightforwardly obtained by requiring the covariate matrix X=(X1,X2,,Xn) to be of full column rank as in the classical linear regression model context. Whenever binary covariates are considered or no covariate is considered, we advocate to set δϵ = 0 so that the conditional model reduces to a classical probit model.

Marginal identifiability

From Proposition 3, it appears that when υ0 = 1 the paramaters δε and δ enter the marginal distribution of Yi only through the marginal working shape Δi=δεJni+Wiδ whose jth element can be written Δij=δϵ+Wijδ. As a result, caution is required when learning the model parameters from some realizations yi of Yi. Indeed, if the model includes a random intercept term, i.e.Wij has the form Wij=(1,W1ij) where W1ijRq1, we can partition the random effects working shape as δ=(δ0,δ1) with δ1Rq1 so that the jth element of the marginal working shape reads Δij=δϵ+δ0+W1ijδ1. Therefore, only the sum δϵ + δ0 could be estimated and it would not be possible to distinguish in the observed skewness the part due to the random intercept from the part due to the conditional link function. This confounding issues may actually be avoided by considering more complex models based for instance on the fundamental skew distributions [43]. Recall that skewness is introduced in the SGTLM through a hidden standard half normal variable, namely Vi. As opposed to the unique standard half normal variable Vi used for both Zi and bi in the SGTLM, the use of two different standard half normal hidden variables for Zi and bi [44] (page 667 eq 5-6) or two different standard half multivariate normal hidden vectors [45] (page 420 eq 2.2) remove the confounding problem.

Fortunately, the non identifiability of the skewness of conditional link function and random intercept does not affect the success probability of the response since this only depends on Δi, but not on the individual values of δϵ and δ0. However, since the conditional link scale depends on δϵ through υ0, the confounding problem affects the scale and thus the interpretation of fixed effects. Moreover, inference on the random intercept is affected since the random intercept variance and skewness depend on δ0. For example, δϵ + δ0 = 0 only indicates null marginal skewness, and in no way absence of link function and random intercept skewness which could be equally strong but of opposite signs. Thus, only a lower bound can be given to the random intercept variance: U˜2D¯11+(U˜2c2U˜12)δ02U˜2D¯11 where D¯11 is the first diagonal element of D¯. To rule out this peculiar situation where the model is not marginally fully identifiable, some previous works on skew normal/skew t distributions have considered the restriction δϵ = 0 (regardless of the presence of a random intercep term) for instance in the context of linear mixed effects models [30, 46, 47] (page 1492 eq 2, page 4100 eq 4, and page 309 eq 3.2 respectively), multivariate measurement error models [48] (page 35, Eq 4.11) and non linear mixed effect models [49] (page 7 eq 10), but no argument was given to support this choice.

In the very common situation where the mixed model includes a random intercept term, prior information on the shape of the link function and/or the random intercept is required to place a meaningful restriction on the parameter space by setting for instance δϵ = 0 or δ0 = 0 or δϵ = δ0. In the absence of such information, we advocate to consider the restriction δ0 = 0 because the success probability of a response may exibit skewness, irrespective of the presence of random effects. This restriction thus allows to recover a fixed effects skew generalized t-link model when no random effect is considered. Overall, the restriction δ0 = 0 simply expresses the unability of the SGTLM to capture any additional skewness structure from the data through the inclusion of only random intercept. For completeness, we develop in the next section an estimation procedure for the full model in Eq (27), since the introduction of any equality restriction on δϵ and δ0 can be straightforwardly reduced to δ0 = 0.

The two restrictions discussed in this section are related to the structure of the quantity Δi and are required only for some specific data structures (models including a binary covariate or models with a random intercept only). However, even when a restriction is required on δε or the first element of δ, the quantity Δi itself can remain unrestricted. When a restriction is required, it forces the skewness from a data to be summarized either by δε or elements of δ. Overall, the SGT-link function is always allowed to be skewed (unconditional link). But some designs do not allow to distinguish skewness in the conditional link function from skewness in the distribution of random effects.

Maximum likelihood inference

Estimation via the EM algorithm

The choice of the value of υ0 in Eq (29) is in line with one of our purposes: rescale fixed effects so that they have the same interpretation as in the mixed probit model. There is no need to define υ0 depending on situations, because in our proposal, υ0 is fixed. However, since υ0 is simply a scaling factor, it may be given any positive value during estimation, as long as the estimates are rescaled after the convergence of the estimation procedure so that υ0 is finally given by Eq (29). Indeed, because routines are basically written for the skew t distributions, we used υ0 = 1 in the EM algorithm and rescaled the estimates at the end of the procedure. Let us consider the complete data ycomi={yi,zi,bi,ui,vi}. Because yi only retains the signs of elements of zi, the joint density of yi and zi is fYZ(yi,zi)=fZ(zi)×IAi(zi) where Ai=Ai1××Aini with Aij=(,0] if yij = 0 and Aij=(0,) if yij = 1. The density of ycomi is thus fycom(ycomi)=fZ,b,U,V(zi,bi,ui,vi)×IAi(zi). Hence by Bayes’s rule and in light of Eq (27) with υ0 = 1, the density of ycomi is:

where fV(vi) = 2ϕ(vi) I(0,∞)(vi) and fG(⋅|ν/2, ν/2) is given in Eq (9). By Eq (37) and on setting N=i=1nni and C=N+n(q+1)2log(2π)+nlog2, the complete data log-likelihood is:
where γi=ziWibiδε(viui1/2cU˜1)Jni, and tr(⋅) is the trace operator. Let θ^(k) the estimate of θ at the kth EM iteration. The E-step of the (k + 1)th iteration finds the expectation Q(·|θ^(k)) of (⋅|ycom) given the observed data y and the current parameter estimate θ^(k)=(β^(k),δ^ε(k),δ^(k),vech(D¯^(k))):
where u2γ^i(k)=u2z^i(k)Wiu2b^i(k)δε(vu^i(k)cU˜1u2^i(k))Jni and
so that we have
Note that the conditional expectation of logIAi(zi) is 0 since given yi, IAi(zi|yi)=1. The E-step thus reduces to the computation of the conditional expectations u2b^i(k)=E{Uibi,θ^(k)}, vub^i(k)=E{ViUi1/2bi|yi,θ^(k)}, u2bz^i(k)=E{UibiZi|yi,θ^(k)}, u2b2^i(k)=E{Uibibi|yi,θ^(k)}, vu^i(k)=E{ViUi1/2|yi,θ^(k)}, v2^i(k)=E{Vi2|yi,θ^(k)}, u2^i(k)=E{Ui|yi,θ^(k)}, u2z^i(k)=E{UiZi|yi,θ^(k)}, vuz^i(k)=E{ViUi1/2Zi|yi,θ^(k)}, u2z2^i(k)=E{UiZiZi|yi,θ^(k)} and logu2^i(k)=E{logUi|yi,θ^(k)}. The expressions for these expectations (except logu2^i(k)) are given in the following result where we have dropped the supraindex (k) for simplicity.

Proposition 4 (see S7 Appendix for a proof). Consider the random variables Yij, Zi, bi, Ui, and Vi as defined in Eq (27) with υ0 = 1, and an update θ^=(β^,δ^ε,δ^,vech(D¯^)) of the model parameter θ. Let Ω¯^i=Ini+WiD¯^Wi, r^i=D¯^WiΩ¯^i1, Λ^i=(Iqr^iWi)D¯^, Δ^i=δ^εJni+Wiδ^, s^i=δ^r^iΔ^i, μ^i=Xiβ^cU˜1Δ^i, M^i=(1+Δ^iΩ¯^i1Δ^i)1/2, Ω^i=Ω¯^i+Δ^iΔ^i, λ^i=M^i1Ω^i1/2Δ^i, Z0i=Ω^i1/2(Ziμ^i). Then:

where u2α^i=M^iΔ^iΩ¯^i1(u2z^iu2^iμ^i), τα^i=M^iΔ^iΩ¯^i1(τz^iτ^iμ^i), u2αz^i=M^i(u2z2^iu2z^iμ^i)Ω¯^i1Δ^i, u2α2^i=M^i2[u2^i(Δ^iΩ¯^i1μ^i)2+Δ^iΩ¯^i1(u2z2^i2u2z^iμ^i)Ω¯^i1Δ^i], and the expectations u2z^i=E{UiZi|yi,θ^}, u2z2^i=E{UiZiZi|yi,θ^}, τ^i=E{Ui1/2ζ1(Ui1/2λZ0i)|yi,θ^}, τz^i=E{Ui1/2ζ1(Ui1/2λZ0i)Zi|yi,θ^} and u2^i=E{Ui|yi,θ^} are to be evaluated directly using Corollary 1 applied to the conditional latent vector Zi|Yi=yi,θ=θ^TSTni(μ^i,Ω^i,λ^i,ν^,Ai), Ai=Ai1××Aini with Aij=(,0] if yij = 0 and Aij=(0,) if yij = 1.

The M-step jointly maximizes Q(·|θ^(k)) over θ. This yields the following updating expressions for θ.

Proposition 5 (see S8 Appendix for a proof) Consider an identifiable SGTLM as defined in Eq (27) with υ0 = 1; and an estimate θ^(k) of θ. Set S1^i(k)=u2z^i(k)Wiu2b^i(k), S2^i(k)=vu^i(k)cU˜1u2^i(k), S3^i(k)=v2^i(k)2cU˜1vu^i(k)+c2U˜12(k)u2^i(k), S4^i(k)=vuz^i(k)Wivub^i(k), T^1(k)=[i=1nu2^i(k)XiXi]1 and T^2(k)=i=1nS2^i(k)JniXi. At EM iteration (k + 1), the updates of β, δε, δ and D¯ are given by:

At convergence of the EM algorithm, we obtain the estimate θ˜(ν) of θ. The corresponding estimate of the variance-covariance matrix of random effects is Σ^b=U˜2D¯^+(U˜2c2U˜12)δ^δ^. More generally, when ν is not actually known, the M-step of the EM algorithm can be extended to include a profiled marginal log-likelihood maximization step. Indeed, at EM iteration k, we notice that the estimate θ˜(k)(ν) of θ depends on ν only through U˜1 and the profiled marginal log-likelihood Lν(⋅) for ν can be obtained by simply substituting θ˜(k)(ν) for θ in Eq (36). We can thus find ν^(k+1) using a one dimensional optimization routine (e.g. optimize in R) to maximize Lν(⋅). Then, the update of the parameter θ becomes θ^(k+1)=θ˜(k)(ν^(k+1)). The use of the profiled marginal log-likelihood instead of a profiled version of Q(·|θ^(k)) can provide substantial gain of efficiency [50] and mostly helps bypass the calculation of logu2^i(k) which does not have any known closed form.

It is worthwhile noticing however that the inclusion of a profiled marginal log-likelihood maximization step would prevent the convergence of the whole estimation procedure if the marginal log-likelihood in Eq (36) as a function of ν is unbounded for a particular dataset. This issue especially when ν is close to zero. Another challenge associated to the estimation of ν is time. The use of this strategy requires a very fast routine to compute cumulative probabilities of the skew t distributions. As an alternative route to estimate ν, we point out the model selection approach of [51] (page 893). It consists in setting a grid of feasible values of ν and obtaining a sequence of estimates θ˜(ν) of θ. Then, the couple ν and θ˜(ν) maximizing the marginal log-like-lihood in Eq (36) is taken as the estimates of ν and θ.

Accelerating EM via parameter-expansion

Besides its attractiveness and stability for handling incomplete data models, the EM algorithm sometimes experiences slow convergence, which has motivated many methods to accelerate its linear convergence speed. Among popular EM accelerators, the so-called parameter-expanded (PX) EM algorithm was proposed by [27] to speed up convergence. Let us consider a complete data model F(ycom|θ). The PX-EM algorithm expands F(ycom|θ) to a larger model FX(ycom|Θ) parameterized by Θ=(θ,α) where θ plays in FX(ycom|Θ) the role of θ in F(ycom|θ) and α is a working parameter. The use of the PX-EM algorithm requires that (1) α admits a value α0 that preserves the original complete data model and (ii) the observed-data model is preserved by a many-to-one reduction function R: Θ ↦ θ = R(Θ) which allows an unambigious recovering of θ from Θ. We refer to [27] for more details. For the SGTLM, let us consider the following expanded complete data model obtained by including a working q × q scale matrix α into the linear predictor as ηi = Xi β + Wi α bi:

This expanded model equals the STGLM in Eq (27) when α takes the value α0 = Iq, and has expanded parameter Θ=(β,δε,δ,vech(D¯),vec(α)) where vec is the usual operator which stacks the columns of its matrix argument. Under this model, the marginal distribution of Yi remains as given in Eq (32) with Ω¯i=Ini+WiαD¯αWi and Δi=δεJni+Wiαδ so that Θ reduces as θ=(β,δε,δα,vech(αD¯α)). As the observed-data model is preserved whatever the value of α, we fix α = Iq at each E-step of the EM procedure. Therefore, the E-step of the PX-EM algorithm uses Proposition 4 to obtain conditional expectations required in Eq (39) as for the classical EM algorithm. At the M-step, the estimates of δ and D¯ are still given by Eqs (49)–(50) respectively whereas the estimates of δε, β and vec(α) are:
where ξ(k)=(i=1nS2^i(k)XiJnii=1n(WiJni)(vub^i(k)cU˜1u2b^i(k))), ϒ(k)=(i=1nXiu2z^i(k)vec(i=1nWiu2bz^i(k))), and Ψ(k)=(i=1nXiXiu2^i(k)i=1nXi(u2b^i(k)Wi)i=1n(u2b^i(k)Wi)Xii=1nu2b2^i(k)(WiWi))1 with ⊗ the direct product operator. Using the reduction function, the original model parameter estimates can be recovered as β^(k+1)=β^(k+1), δ^ε(k+1)=δ^ε(k+1), δ^(k+1)=α^(k+1)δ^(k+1) and D¯^(k+1)=α^(k+1)D¯^(k+1)α^(k+1). In the neighbourhood of the ML estimate of θ, the working scale estimate α^ becomes close to α0 = Iq [27] so that the advantage of the PX-EM algorithm over the classical EM algorithm disappears. We thus propose to stop the PX acceleration once |λmax| < ϵ with λmax the dominant eigen value of α^(k+1)Iq and ϵ a pre-specified tolerance value (e.g. ϵ = 10−2).

Summary of the estimation procedure

The estimation procedure starts with a parameter θ^(0), k = 0 and iterates the following six steps until convergence.

    E-step: compute conditional expectations defined by Eqs (40)–(46) with θ^(k).

    PX M-step: obtain θ^(k+1) and α^(k+1) using Eqs (49)–(52) and the reduction function: β^(k+1)=β^(k+1), δ^ε(k+1)=δ^ε(k+1), δ^(k+1)=α^(k+1)δ^(k+1) and D¯^(k+1)=α^(k+1)D¯^(k+1)α^(k+1).

    Test: compute λmax the dominant eigen value of α^(k+1)Iq. If |λmax| < 10−2 then compute the marginal likelihood (θ^(k+1)|y) using Eq (36) and go to 4) with k = k + 1, otherwise return to 1) with k = k + 1.

    E-step: compute conditional expectations defined by Eqs (40)–(46) with θ^(k).

    M-step: obtain θ^(k+1) using Eqs (47)–(50).

    Test: compute the marginal likelihood (θ^(k+1)|y) using Eqs (36). If |(θ^(k+1)|y)(θ^(k)|y)|/|(θ^(k)|y)|<106 then go to 7), otherwise return to 4) with k = k + 1.

    Rescaling: compute υ0 using (29) and rescale the estimates as β^υ0β^(k), δ^υ0δ^(k) and D¯^υ02D¯^(k). Return θ^=(β^,δ^ϵ,δ^,vech(D¯^)).

Approximate standard errors

With a view to allow asymptotic inference in SGTLM, we follow the empirical information-based method of [52] (pages 132-133) to compute the asymptotic variance-covariance matrix of the ML estimate θ^ of θ under some general regularity conditions. The observed information matrix is defined to be Io(θ^|y)=i=1ng^ig^i where g^i=gi(θ^), gi(θ)=Qi(θ|θ^)θ, Qi(θ|θ^) being the contribution of the single observation yi to the expected complete data log-likelihood in Eq (39). On setting vub¯^i=vub^icU˜1u2b^i, the elements Qi(θ|θ^)β,Qi(θ|θ^)δε,Qi(θ|θ^)δ,vech(Qi(θ|θ^)D¯) of the score gi can be explicitly evaluated using:

Afterwards, the standard errors of estimated model parameters are approximated by square roots of diagonal elements of [Io(θ^|y)]1 and confidence intervals can be built assuming asymptotic normality.

Empirical Bayes estimators of random effects and weights

In this section, we provide the empirical Bayes estimators of cluster specific random effects and weights that are useful for evaluating individual intercepts and slopes as well as detecting outlying individuals. From Eq (27), the distribution of bi conditional on Zi = zi, Ui = ui and Vi = vi is multivariate normal with mean ri(ziXiβ)+(viui1/2cU˜1)si and covariance matrix ui1Λi where ri=D¯WiΩ¯i1, si = δri Δi, Δi=υ0δεJni+Wiδ, Ω¯i=υ02Ini+WiD¯Wi, and Λi=(IqriWi)D¯. The conditional mean of bi given Yi = yi is thus:

where vu1¯i=Mi(uα¯i+τ1¯i), uα¯i=MiΔiΩ¯i1(uz¯iu¯iμi), μi=XiβcU˜1Δi and the quantities u¯i=E{Ui1/2|yi,θ}, uz¯i=E{Ui1/2Zi|yi,θ}, τ1¯i=E{Ui1/2ζ1(Ui1/2λZ0i)|yi} and z¯i=E{Zi|yi,θ} are to be evaluated using Corollary 1 applied to Zi|Yi=yiTSTni(μi,Ωi,λi,ν,Ai) with Ωi=Ω¯i+ΔiΔi, λi=(1+ΔiΩ¯i1Δi)1/2Ωi1/2Δi, Ai=Ai1××Aini, Aij=(,0] if yij = 0 and Aij=(0,) if yij = 1. The empirical Bayes estimators of bi can then be obtained as b^i=b¯i(θ^).

For outlying individuals detection, individual weights Ui are predicted by u2¯i=E{Ui|yi,θ} [53] which is given by Eq (16) in Corollary 1 applied to Zi|Yi = yi. The empirical Bayes estimators of Ui are thus given by u2^i=E{Ui|yi,θ^}. Relatively low weights (< 1) are indicative of outlying individuals.

Applications

This section presents a simulation study for assessing performance of SGTLM, and an application of the modeling approach to a real dataset.

Simulation study

We conducted a simulation to evaluate the proposed approach to the analysis of correlated binary data. The simulation experiment targeted four specific objectives. First, it assessed for different sample sizes, the abilities of the probit (PM), the skew-probit (SPM), the generalized t-link (GTLM) and the skew generalized t-link (SGTLM) models to recover population parameters when the common normality assumption for the link function is either violated or not. The widely used logistic model was not investigated as the logistic distribution can be considered as a special case of the Student t distribution [8] hence the logistic model is a special case of GTLM. Second, the experiment evaluated the extent to which asymptotic 95% confidence intervals (CI95%) can detect the presence of spurious skewness. Third, the experiment evaluated the ability of empirical Bayes estimators of random effects to predict true random effects. Finally, the simulation study assessed the ability of Akaike’s information criterion (AIC), Schwarz’s Bayesian information criterion (BIC) and Hannan-Quinn criterion (HQ) to select the correct model fit. All computations were performed in R.

Simulation design

Mimicking the structure of the simulation model studied in [24] (page 1116), we considered the following GLMM:

where ηi = (ηi1, ⋯, ηi6), ηij = β0 + β1 X1i + b0i + b1i W1ij, bi = (b0i, b1i); X1 is a dichotomous covariate (Bernoulli distribution with sucess probability 0.5) and W1 is a continuous occasion-varying random covariate (standard normal distribution); β0 is an intercept, i.e. the general mean of the linear predictor ηij and β1 is the fixed-effects associated to the covariate X1 with values arbitrarily fixed to β0 = 1 and β1 = −1; b0i is a random intercept associated to the cluster i, b1i is the random slope associated to W1i; D¯ is a 2 × 2 scale matrix with diagonal elements 0.5 and 1, and off diagonal element 0.25; FU(ν) is a positive distribution with finite first two negative moments, i.e. U˜t=E{Uit}< for t = 1, 2; and υ0=[U˜2+(U˜2c2U˜12)δε2]1/2. We considered δ = (0, δ1), i.e. a null random intercept skewness to ensure the identifiability of the model.

Under this general class of SSMN latent models, we considered two data models. The first is the probit data model where U = 1, δ1 = 0 and δε = 0 (probit link, υ0 = 1). The second is the skew generalized t-link data model with δε = −2 and δ1 = 2 and UGamma(ν/2,ν/2) with ν = 5 (υ0 = 0.4598). We considered for each data model, sample sizes (n) of 100, 500 and 1000 and thus generated three sets of covariates which were used for all simulations involving each of the two data models. Under each of the six resulting simulation settings, we generated 250 datasets to which we fitted the four fitting models under evaluation (PM, SPM, GTLM, SGTLM), considering the model degrees of freedom as known and equals to ν = 5 for the SGTLM. Fixed effects (β0 and β1) and skewness parameters (δϵ and δ1) were initialized to zero whereas the scale matrix D¯ was initialized to the 2×2 identity matrix.

Performance measures

In addition to estimates of fixed effects (β^k, k = 0, 1) and skewness (δ^l for SPM and SGTLM, l = ϵ, 1) and related empirical standard errors and CI95%, we recorded random effects variances (σ12 of b0i and σ22 of b1i) and covariance (σ12) and their approximate standard errors derived using the delta method [54] as implemented in the R package car [55], empirical Bayes estimates of individual random effetcs (b^i), and the AIC, BIC and HQ criteria defined as: AIC=2^+2Np, BIC=2^+NplogN, and HQ=2^+2Nplog(logN) where ^ is the maximized log-likelihood value, N is the total number of observations and Np is the number of estimated model parameters. These data were used to compute various performance measures (Table 1) including the relative bias (%Bias) and the root mean square error (RMSE) in β^k and σ^l; the standard deviations (SD) of β^k; the quadratic mean (SE¯) of standard errors of β^k; the coverage probabilities (CP¯) of β^k and δ^l, i.e. the proportion of times the CI95% for βk or δl included the true value; the arithmetic mean (R2¯(bl,b^l)) of the square of Pearson’s correlation (coefficient of determination) between simulated and Bayes estimates of subject random effects; and the arithmetic means of information criteria AIC (AIC¯), BIC (BIC¯) and HQ (HQ¯) across the 250 simulated datasets per simulation setting.

Table 1
Measures of the performance of binomial fitting models.
MeasureFormulaRole
alternativesalternativesUnbiasedness of alternatives
alternativesalternativesAccuracy of alternatives
alternativesalternativesUnbiasedness of alternatives
alternativesalternativesAccuracy of alternatives
alternativesalternativesPrecision of alternatives
alternativesalternativesReliability of estimated SE
alternativesalternativesCoverage probability of alternatives
alternativesalternativesCoverage probability of alternatives
alternativesalternativesPredictive power
alternativesalternativesModel selection
alternativesalternativesModel selection
alternativesalternativesModel selection

Notes: SE = standard error, β^¯k=r=1250β^k(r)250; bi(r)=(bi0(r),bi1(r)) is the rth simulated vector of random effects for the subject i and b^i(r) is the empirical Bayes estimates for bi(r); R2(b^l(r),bl(r)) is the square of Pearson’s coefficient of correlation between b^l(r) and bl(r); se(β^k(r)) is the empirical information-based standard error for β^k(r) and CI(k, r) is the 95% confidence interval for βk from the rth generated dataset. The same applies for δl and σl.

Simulation results

Simulation results presented in Tables 2 and 3 show that under the probit data generation mechanism, the probit, the skew-probit, the generalized t (GT)-link and the skew generalized t (SGT)-link models recovered the population parameter values. Indeed, the percentage of bias was below 5% at all levels for fixed effects, whereas for variance components, the percentage of bias was below 20%. We particularly noticed a high relative bias in the variance component (%Bias = 17.33) under probit fit to probit data (assuming the true model) with small sample size (n = 100). This may be explained by the maximum likelihood estimation method which is known for providing biased variance components [56]. Nevertheless, this can be improved by opting for residual maximum likelihood estimation procedure [56]. However, it is worth noticing that the estimation improves as the sample size increases and the empirical standard error estimates agree with the standard deviations from the simulations. The results for n = 100 and n = 500 are consistent with findings in [24] where empirical information based standard errors approached Monte Carlo standard errors. Moreover, the 95% confidence interval for the skewness parameters allows to detect spurious skewness in the skew-probit and the SGT-link models with coverage probabilities of 100%. This result can be explained by the underlined high accuracy of information based standard errors in this type of model. The power in predicting random effects varied from R2 = 0.45 to R2 = 0.47 for random intercepts and from R2 = 0.23 to R2 = 0.28 for random slopes, but was comparable for the three fitting models. Finally, it appears that on average all model selection criteria correctly considered probit fitting model as the parsimonious model.

Table 2
Results based on 250 replications of probit samples: Probit and skew-probit fits.
ParametersMeasuresn = 100n = 500n = 1000
ProbitSkew probitProbitSkew probitProbitSkew probit
β0(−1)Mean-1.04 (0.25)-1.03 (0.30)-1.01 (0.12)-1.04 (0.31)-1.00 (0.08)-1.00 (0.11)
%Bias-3.61-3.01-0.64-4.03-0.32-0.23
alternatives0.25 [0.23]0.27 [0.26]0.11 [0.12]0.29 [0.24]0.08 [0.08]0.09 [0.11]
CP0.960.970.950.950.970.97
β1(1)Mean1.03 (0.24)1.02 (0.41)1.01 (0.10)1.00 (0.41)1.00 (0.07)1.00 (0.42)
%Bias3.211.960.590.460.130.16
alternatives0.23 [0.22]0.43 [0.43]0.1 [0.10]0.48 [0.47]0.07 [0.07]0.08 [0.07]
CP0.950.960.970.970.970.97
alternativesMean0.54 (0.24)0.51 (0.29)0.51 (0.09)0.53 (0.29)0.50 (0.06)0.51 (0.07)
%Bias7.982.241.392.350.121.06
σ12(0.25)Mean0.30 (0.34)0.23 (0.48)0.25 (0.13)0.22 (0.48)0.25 (0.09)0.24 (0.49)
%Bias18.48-6.521.84-12.021.55-2.02
alternativesMean1.17 (0.91)1.10 (0.95)1.02 (0.34)1.11 (0.94)1.02 (0.22)1.00 (0.47)
%Bias17.339.942.4210.981.960.99
δε(0)Mean-0.02 (0.65)-0.02 (0.46)-0.00 (0.46)
alternatives-0.61 [0.62]-0.71 [0.51]-0.61 [0.46]
CP-1.00-1.00-1.00
δ1(0)Mean--0.02 (0.37)--0.02 (0.38)--0.00 (0.25)
alternatives-2.00 [1.84]-2.00 [1.72]-2.00 [0.22]
CP-1.00-1.00-1.00
alternativesMean0.450.460.460.460.460.46
alternativesMean0.230.240.260.260.270.28
AICMean548.29552.143732.113733.807522.507524.04
BICMean570.27582.923762.143775.857556.007570.94
HQMean556.85564.123742.913748.927534.137540.33

Mean value and root mean square error (in parentheses), percentage of bias (%Bias), mean standard error (SE) and sample standard deviation (in square brackets), coverage probability (CP) of estimates, coefficient of determination (R2) and averages of AIC, BIC and HQ criteria for sample sizes (n) of 100, 500 and 1000.

Notes: In the first column (Parameters), model parameters are followed in parentheses by their respective true values;—means “not applicable”; the coverage probability (CP) is the probability that an approximate confidence interval (assuming asymptotic normality for the model parameter) contains the true parameter value.

Table 3
Results based on 250 replications of probit samples: Generalized t (GT)-link and skew generalized t (SGT)-link fits.
ParametersMeasuresn = 100n = 500n = 1000
GT-linkSGT-linkGT-linkSGT-linkGT-linkSGT-link
β0(−1)Mean-1.05 (0.26)-1.05 (0.24)-1.01 (0.11)-1.03 (0.11)-.1.00 (0.10)-1.00 (0.08)
%Bias-4.95-4.61-1.06-2.94-0.17-0.17
alternatives0.25 [0.24]0.27 [0.25]0.15 [0.12]0.11 [0.12]0.08 [0.08]0.06 [0.05]
CP0.980.970.970.980.970.97
β1(1)Mean1.03 (0.32)1.03 (0.23)1.01 (0.14)1.03 (0.10)1.00 (0.12)1.00 (0.07)
%Bias3.163.281.622.900.080.13
alternatives0.23 [0.20]0.25 [0.22]0.13 [0.13]0.1 [0.10]0.07 [0.07]0.07 [0.07]
CP0.970.960.980.970.980.98
alternativesMean0.47 (0.12)0.47 (0.10)0.49 (0.11)0.50 (0.09)0.50 (0.05)0.50 (0.03)
%Bias-5.36-5.57-1.760.010.170.12
σ12(0.25)Mean0.26 (0.53)0.24 (0.40)0.24 (0.22)0.23 (0.02)0.25 (0.19)0.25 (0.02)
%Bias4.00-4.82-3.96-6.26-1.681.55
alternativesMean1.09 (.62)0.99 (0.49)1.00 (0.18)0.99 (0.12)0.99 (0.11)0.99 (0.12)
%Bias8.98-1.12-0.78-1.13-0.87-0.91
δε(0)Mean-0.00 (0.56)-0.01 (0.54)-0.00 (0.41)
alternatives-0.44 [0.56]-0.46 [0.56]-0.48 [0.44]
CP-1.00-1.00-1.00
δ1(0)Mean-0.00 (0.18)-0.00 [0.14]-0.00 [0.14]
alternatives-1.22 [0.18]-0.99 [0.16]-0.62 [0.17]
CP-1.00-1.00-1.00
alternativesMean0.450.460.470.460.470.47
alternativesMean0.260.270.270.270.270.28
AICMean552.09554.373733.103736.217523.127525.21
BICMean581.43589.553762.433784.267568.957573.26
HQMean561.97568.063744.113764.907538.917542.49

Mean value and root mean square error (in parentheses), percentage of bias (%Bias), mean standard error (SE) and sample standard deviation (in square brackets), coverage probability (CP) of estimates, coefficient of determination (R2) and averages of AIC, BIC and HQ criteria for sample sizes (n) of 100, 500 and 1000.

Notes: In the first column (Parameters), model parameters are followed in parentheses by their respective true values;—means “not applicable”; the coverage probability (CP) is the probability that an approximate confidence interval (assuming asymptotic normality for the model parameter) contains the true parameter value.

Under a SGT-link data generation mechanism, the probit model performed poorly, showing large relative fixed effects bias values which decreased from 46% for samples of size n = 100 to 18% for samples of size n = 1000 (Table 4). The SGT-link model estimates were the less biased (%Bias < 12) as well as the most accurate with the lowest root mean square errors across all levels (Table 5). The same observations apply to variance components which were highly biased downward for probit, skew-probit and GT-link models (%Bias up to 90) relative to the SGT-link model (%Bias < 7). Regarding estimates of skewness parameters, the coverage probability was low (53% to 94%) for small sample size (n = 100) and approached nominal (95%) value for larger sample sizes (n = 500, 1000). The skew-probit model estimates (coverage probability down to 53%) were less reliable than estimates from the SGT-link model (coverage probability above 90%).

Table 4
Results based on 250 replications of skew generalized t-link samples (probit and skew-probit fits).
ParametersMeasuresn = 100n = 500n = 1000
ProbitSkew probitProbitSkew probitProbitSkew probit
β0(−1)Mean-1.46 (1.53)-1.33 (1.30)-1.28 (1.29)-1.28 (1.31)-1.18 (1.27)-1.22 (0.92)
%Bias-46.18-33.41-28.30-28.13-18.27-22.08
alternatives0.95 [0.99]0.89 [0.86]0.96 [1.02]0.89 [0.94]0.97 [0.99]0.92 [0.89]
CP0.930.950.950.960.960.96
β1(1)Mean0.95 (0.46)0.97 (0.47)0.95 (0.41)0.98 (0.42)0.95 (0.31)0.98 (0.37)
%Bias-5.11-3.31-5.18-2.11-5.21-2.11
alternatives0.51 [0.48]0.44 [0.48]0.26 [0.39]0.41 [0.43]0.31 [0.29]0.40 [0.41]
CP0.950.960.970.970.970.97
alternativesMean0.34 (0.94)0.61 (0.39)0.41 (0.91)0.67 (0.38)0.56 (0.76)0.73 (0.34)
%Bias-59.04-26.51-50.60-19.28-32.53-12.05
σ12(0.42)Mean0.80 (0.64)0.53 (0.58)0.81 (0.53)0.42 (0.38)0.75 (0.50)0.40 (0.39)
%Bias90.4826.1992.860.0978.57-4.76
alternativesMean2.07 (1.19)2.85 (1.08)2.22 (1.44)2.97 (1.99)2.20 (1.32)2.88 (1.70)
%Bias-56.24-39.75-53.07-37.21-53.49-39.11
δε(−2)Mean--0.48 (1.75)--0.91 (1.25)--0.90 (1.04)
%Bias-76.01-54.50-54.99
alternatives-0.60 [0.72]-1.23 [1.22]-1.03 [1.04]
CP-0.87-0.91-0.95
δ1(2)Mean-1.03 (0.97)-1.13 (0.97)-1.16 (0.92)
%Bias--48.51--43.49--42.00
alternatives-1.01 [1.00]-1.04 [1.01]-0.99 [0.96]
CP-0.53-0.61-0.66
alternativesMean0.350.380.380.420.380.46
alternativesMean0.140.260.160.310.260.33
AICMean578.22576.343824.313814.727662.077713.55
BICMean600.20607.123854.343856.777695.567681.25
HQMean586.78588.323835.113829.847673.707678.562

Mean value and root mean square error (in parentheses), percentage of bias (%Bias), mean standard error (SE) and sample standard deviation (in square brackets), coverage probability (CP) of estimates, coefficient of determination (R2) and averages of AIC, BIC and HQ criteria for sample sizes (n) of 100, 500 and 1000.

Notes: In the first column (Parameters), model parameters are followed by their respective true values in parentheses;—means “not applicable”; the coverage probability (CP) is the probability that an approximate confidence interval (assuming asymptotic normality for the model parameter) contains the true parameter value.

Table 5
Results based on 250 replications of skew generalized t-link samples (generalized t-link and skew generalized t-link fits).
ParametersMeasuresn = 100n = 500n = 1000
GT-linkSGT-linkGT-linkSGT-linkGT-linkSGT-link
β0(−1)Mean-1.40 (1.54)-1.17 (1.31)-1.25 (1.22)-1.11 (1.09)-1.11-1.02 (0.09)
%Bias-39.17-17.33-24.56-11.411.41-2.09
alternatives0.96 [0.89]0.64 [0.55]0.83 [0.92]0.60 [0.51]0.89 [0.86]0.40[0.37]
CP0.930.960.950.980.960.98
β1(1)Mean0.98 (0.37)1.01 (0.33)0.99 (0.36)1.01 (0.16)0.98 (0.22)1.00 (0.09)
%Bias-2.191.10-1.141.23-1.980.37
alternatives0.32 [0.29]0.28 [0.32]0.23 [0.28]0.11 [0.16]0.28 [0.21]0.09 [0.10]
CP0.950.960.970.970.970.98
alternativesMean0.37 (0.55)0.86 (0.16)0.40 (0.56)0.84 (0.09)0.53 (0.63)0.84 (0.07)
%Bias-55.453.61-51.801.20-36.141.20
σ12(0.42)Mean0.68 (0.61)0.36 (0.31)0.71 (0.56)0.38 (0.23)0.66 (0.56)0.38 (0.09)
%Bias61.91-14.2969.05-9.5257.14-9.52
alternativesMean2.29 (1.23)4.56 (1.51)2.33 (1.17)4.55 (1.29)2.34 (1.11)4.58 (1.31)
%Bias51.59-3.59-50.74-3.81-50.53-3.17
δε(−2)Mean--2.18 (1.06)--2.12 (1.04)--2.06 (0.99)
%Bias--9.00--6.02--3.04
alternatives-1.44 [1.04]-1.04 [1.05]-0.88 [0.98]
CP-0.94-0.96-0.97
δ1(2)Mean-1.64 (1.18)-1.69 [1.16]-1.84 [1.04]
%Bias--18.13--15.50--8.10
alternatives-1.13 [1.18]-1.16 [1.17]-1.02 [1.03]
CP-.90-0.94-0.96
alternativesMean0.340.530.380.560.400.58
alternativesMean0.220.490.260.520.260.52
AICMean578.04524.063820.103804.127657.137652.56
BICMean601.18597.243853.793852.177689.127671.17
HQMean588.01575.753834.303821.407667.667542.49

Mean value and root mean square error (in parentheses), percentage of bias (%Bias), mean standard error (SE) and sample standard deviation (in square brackets), coverage probability (CP) of estimates, coefficient of determination (R2) and averages of AIC, BIC and HQ criteria for sample sizes (n) of 100, 500 and 1000.

Notes: In the first column (Parameters), model parameters are followed by their respective true values in parentheses;—means “not applicable”; the coverage probability (CP) is the probability that an approximate confidence interval (assuming asymptotic normality for the model parameter) contains the true parameter value.

Clearly, the SGT-link model adjusted better with non normal data and accordingly, random effects prediction is better with SGT-link model (R2 ≥ 0.49) than with the probit or the skew-probit model. Moreover, all the considered model selection criteria namely AIC, BIC and HQ on average correctly selected the SGT-link model as the preferred model.

Application to the respiratory infection data

To demonstrate the usefulness of the proposed approach to correlated binary data modeling, we revisited the respiratory illness data (available in geepack package [57] in R) which was used by [24] to illustrate their t-link GLMM. The respiratory illness data was obtained from a clinical study of the effect of a treatment on 111 patients with respiratory illness, recruited from two different clinical centers. The patients were examined and their respiratory state (categorized as 1 = good, 0 = poor) determined (baseline). They were then randomized to receive either placebo or an active treatment. The goal of the study was to determine whether the treatment induced a better respiratory state in treated patients. The outcome is the respiratory state measured at four visits for each patient as good (y = 1) or poor (y = 0). In addition to the treatment (treat = 0 for placebo group (P) and treat = 1 for treated group (A)), the following fixed covariates were included: the clinical center (center = 0 for the first center and center = 1 for the second center), the baseline (respiratory state at the first visit), gender (sex = 0 for female (F) and sex = 1 for male (M)) and the interaction of treatment and gender. Following [24], we assumed that the age effect is patient-specific (random slope) and thus considered the patient age centered around its median (31 years) as a random covariate. Since the fixed covariates included binary variables (treat, gender, center and baseline), a conditional skew-probit model is not identifiable given random effects and we thus set δϵ = 0 to ensure identifiability.

For the purpose of comparison, we fitted the probit, skew-probit, GT-link and SGT-link models. We initialized fixed effects β and the random slope skewness parameter δ to zero whereas the random slope scale was initialized to one. For the GT-link and the SGT-link models, we considered the model selection approach of [51] with degrees of freedom ν = 2.5, 2.6, …, 15.

As depicted in Fig 1, the profiled marginal log-likelihood for the GT-link model is unbounded, with smaller ν corresponding to better fit in accordance with the t-link model fits in [24] (ν ≤ 4). We thus set ν = 2.5 for the t-link model. For the SGT-link fit, Fig 1 indicates that the log-likelihood is bounded with a maximum at ν = 3.7, suggesting heavy tail link function and random slope distributions. The difference in the behaviours of the GT-link and SGT-link models may be explained by the implication of ν in the location of the skew t-link model through U˜1 (see Eq (32)).

Fitting the generalized t-link and the skew generalized t models to the respiratory infection data: Plot of the marginal log-likelihood profiled for the degrees of freedom ν.
Fig 1

Fitting the generalized t-link and the skew generalized t models to the respiratory infection data: Plot of the marginal log-likelihood profiled for the degrees of freedom ν.

The maximum likelihood (ML) estimates under the probit, skew-probit, GT-link and SGT-link models (Table 6) are somewhat close for the four fitted models which all show that respiratory illness is associated to clinical center, baseline state and treatment, with the treatment effect varying with gender. The SGT-link fit additionally indicates that, irrespective of the treatment, the respiratory state is poorer for male patients (Table 6, β^3<0) than females. We notice for this dataset, that the intercept coefficient estimate increases with model complexity and estimates of fixed effects and their respective standard errors are shrunk toward zero for the skew-probit model relative to the probit one, and for the skew-probit model relative to the SGT-link one. The skew-probit model fit also gave a higher skewness (δ^=0.0411) as compared with the SGT-link model fit (δ^=0.0262). Although the estimated skewness is relatively low for both skew-probit and SGT-link models, the use of a skewed and heavy tail link clearly improved, not only the precision of estimates but also the adequacy between data and model. Indeed, the asymptotic 95% confidence interval for δ under the SGT-link includes zero (CI95% = [−0.0010, 0.0534]), but we noticed from the simulation results that asymptotic CI95% for skewness parameters becomes reliable only in large samples (n ≥ 500), whereas information criteria are reliable for all tested sample sizes. Thus, based on the AIC, BIC and HQ criteria in Table 6, the SGT-link fit is the best for the respiratory illness data. The estimate of the variance of the random slope of age is σ^2=0.0118 for the SGT-link fit, with close values under probit and skew-probit models. From the SGT-link fit, it appears that the treatment induced an overall better respiratory state for treated patients (with a negative coefficient β4 = −2.0398 for the placebo group). Moreover, the treatment has on average a better effect on female patients than on male patients (with a positive coefficient, β5 = 1.3425 for male patients in the placebo group). However, as noted by [24], new studies are required to check this latter trend because of the highly unbalanced proportion of males (79%) and females (21%) in the data.

Table 6
Maximum likelihood fits of probit, skew-probit, Generalized T (GT)-link and Skew Generalized T (SGT) -link models to the respiratory infection data.
Parameter (variable)ProbitSkew probitGT-link (ν = 2.5)SGT-link (ν = 3.7)
estimatesez valueestimatesez valueestimatesez valueestimatesez value
β0 (Intercept)0.34290.49940.68660.45270.50120.90310.35700.54910.65020.53830.46591.1555
β1 (center = 2)0.62920.26392.38460.68900.26112.63910.67670.21012.94100.59930.23952.5026
β2 (baseline)1.66890.29005.75441.63120.28375.74991.81000.41184.39521.51520.25845.8627
β3 (sex = M)-0.82280.5218-1.5768-1.01700.5293-1.9216-0.88330.5274-1.6748-1.01290.4944-2.0489
β4 (treat = P)-2.11980.6018-3.5224-2.09800.5958-3.5215-2.26340.7910-2.8615-2.03980.5366-3.8010
β5 (M×P)1.38400.64612.14191.40550.64032.19501.32510.63002.10331.34250.58672.2884
δ (age)---0.04110.02161.9011---0.02620.01391.8906
σ 2 (age)0.01170.0050-0.01130.0049-0.01430.0061-0.01180.0059-
AIC455.6305--453.7766--453.2140--451.4658--
BIC484.3013--486.5432--485.9399--484.2324--
HQ466.9370--466.6983--465.7514--464.3875--

Notes: M = male patient; P = placebo; M×P is a shortcut for sex = M×treat = P; σ2 and δ are the variance and the skewness parameter respectively for the patient-specific slope of median centered age (year); se = standard error; z value = estimate/se (a z value ≥1.96 roughly indicates 5% significance assuming asymptotic normality of z values);—means “not applicable”

Conclusion

This work has considered the skew generalized t class of distributions for both link and random effects distributions in mixed models for binary data. The objective was to improve the exploitation of binary data bearing oddities such as skewness and tails thicker/thinner than the normal distribution. To allow inference in such models, we developped a maximum likelihood estimation procedure based on the EM algorithm. We combined results from [34] and [37] to obtained expressions for computing moments of truncated multivariate skew t distributions. The computation used existing R functions for the multivariate skew t cumulative distribution function. Our simulation experiment showed that, irrespective of sample size, the SGT-link model outperforms the probit GLMM when the underlying data generation mechanism is not normal. We also demonstrated that the skew generalized-link model performed better than the skew-probit and the generalized t-link GLMMs, when the underlying data is both skewed and heavy tailed.

An important finding is that when the model degrees of freedom ν is small and very large values are assumed (fitting probit and skew-probit models), the estimates of fixed effects are biased, whereas when ν is large but small values are assumed, the estimates of fixed effects are not biased. Moreover, asymptotic inference using information based standard errors proved highest ability accuracy in detecting spurious skewness in large samples (n ≥ 500) and information criteria on average selected the correct model fit for all tested sample sizes (n = 100, 500, 1000). These findings extend results in [24] on t-link GLMM to SGT-link GLMM, asserting that information criteria are reliable for selecting the best model for a particular dataset.

However, the simulation experiments revealed that the EM algorithm has a high computational cost. For instance, in a model with q = 2 random effects, n = 100 clusters and ni = 6 observations per cluster, the mean running time for the SGT-link model fit was 4.76 minutes which is almost 135 times the time required by the probit model fit (2.12 seconds). Our implementation relies on the pmst function of the R package sn [35] to compute the cumulative probabilities of skew t distributions. This function uses the one dimensional routine integral of R on the multivariate normal cumulative distribution function. The use of the EM algorithm for large q values (e.g. q = 10, 15) requires the prior development of a faster routine for the computation of cumulative probabilities of skew t distributions. This will make the EM algorithm scalable for large q + ni. On multicore plateformes, parallel computing can also substantially speed computations up. The expressions provided for computing moments of truncated multivariate skew t distributions is limited to work for models with ν > 2. The use of formulae given in [38] will extend our EM algorithm to very small degrees of freedom (1 < ν ≤ 2).

Binary data related to very rare events often require special treatment and are generally analysed using zero inflated models [58]. The development of a skew generalized t-link model with zero inflation can significantly improve the exploitation of such data. In addition to binary data, GLMMs handle other data types like count, proportional and ordinal outcomes. From the good performance demonstrated in this work and in previous related ones [9, 24], we believe that the simultaneous introduction of flexible links and random effects distributions in GLMM would benefit knowledge extraction from observed data in applied research fields where advances rely on modeling capacity.

Acknowledgements

The authors wish to thank the editor and two referees for their relevant comments and suggestions. They are also grateful to Matthews Lazaro (Kamuzu College of Nursing, Lilongwe, Malawi) for the time he devoted to edit the manuscript for language usage, spelling, and grammar.

References

El-Saeiti IN. Performance of mixed effects for clustered binary data models. In: AIP Conference Proceedings. vol. 1643. AIP; 2015. p. 80–85.

JANelder, RWWedderburn. Generalized linear models. Journal of the Royal Statistical Society: Series A (General). 1972;135(3):370384. 10.2307/2344614

CEMcCulloch. Maximum likelihood variance components estimation for binary data. Journal of the American Statistical Association. 1994;89(425):330335. 10.1080/01621459.1994.10476474

MHChen. Skewed link models for categorical response data. In: Skew-Elliptical Distributions and Their Applications. Chapman and Hall/CRC; 2004. p. 151172.

CEMcCulloch, JMNeuhaus. Misspecifying the shape of a random effects distribution: why getting it wrong may not matter. Statistical science. 2011;28(3):388402.

CCzado, TJSantner. The effect of link misspecification on binary regression inference. Journal of statistical planning and inference. 1992;33(2):213231. 10.1016/0378-3758(92)90069-5

MBStewart. Semi-nonparametric estimation of extended ordered probit models. Stata Journal. 2004;4(1):2739. 10.1177/1536867X0100400102

CLiu. Robit regression: a simple robust alternative to logistic and probit regression. In: AGelman, XLMeng, editors. Applied Bayesian Modeling and Casual Inference from Incomplete-Data Perspectives. England: Wiley London; 2004. p. 227238.

SKim, MHChen, DKDey. Flexible generalized t-link models for binary response data. Biometrika. 2008;95(1):93106. 10.1093/biomet/asm079

10 

CAAbanto-Valle, DKDey. State space mixed models for binary responses with scale mixture of normal distributions links. Computational Statistics & Data Analysis. 2014;71:274287. 10.1016/j.csda.2013.01.009

11 

SBasu, SMukhopadhyay. Binary response regression with normal scale mixture links. BIOSTATISTICS-BASEL-. 2000;5:231242.

12 

JCPinheiro, CLiu, YNWu. Efficient algorithms for robust estimation in linear mixed-effects models using the multivariate t distribution. Journal of Computational and Graphical Statistics. 2001;10(2):249276. 10.1198/10618600152628059

13 

MHChen, DKDey, QMShao. A new skewed link model for dichotomous quantal response data. Journal of the American Statistical Association. 1999;94(448):11721186. 10.1080/01621459.1999.10473872

14 

OKomori, SEguchi, SIkeda, HOkamura, MIchinokawa, SNakayama. An asymmetric logistic regression model for ecological data. Methods in Ecology and Evolution. 2016;7(2):249260. 10.1111/2041-210X.12473

15 

AJLemonte, JLBazán. New links for binary regression: an application to coca cultivation in Peru. Test. 2018;27(3):597617. 10.1007/s11749-017-0563-1

16 

AAsgharzadeh, LEsmaeili, SNadarajah, SShih. A generalized skew logistic distribution. REVSTAT–Statistical Journal. 2013;11(3):317338.

17 

JBCarlin, RWolfe, CHBrown, AGelman. A case study on the choice, interpretation and checking of multilevel models for longitudinal binary outcomes. Biostatistics. 2001;2(4):397416. 10.1093/biostatistics/2.4.397

18 

AAgresti, BCaffo, POhman-Strickland. Examples in which misspecification of a random effects distribution reduces efficiency, and possible remedies. Computational Statistics & Data Analysis. 2004;47(3):639653. 10.1016/j.csda.2003.12.009

19 

JChen, DZhang, MDavidian. A Monte Carlo EM algorithm for generalized linear mixed models with flexible random effects distribution. Biostatistics. 2002;3(3):347360. 10.1093/biostatistics/3.3.347

20 

KPNelson, SRLipsitz, GMFitzmaurice, JIbrahim, MParzen, RStrawderman. Use of the probability integral transformation to fit nonlinear mixed-effects models with nonnormal random effects. Journal of Computational and Graphical Statistics. 2006;15(1):3957. 10.1198/106186006X96854

21 

FHosseini, JEidsvik, MMohammadzadeh. Approximate Bayesian inference in spatial GLMM with skew normal latent variables. Computational Statistics & Data Analysis. 2011;55(4):17911806. 10.1016/j.csda.2010.11.011

22 

GBroström, HHolmberg. Generalized linear models with clustered data: Fixed and random effects models. Computational Statistics & Data Analysis. 2011;55(12):31233134. 10.1016/j.csda.2011.06.011

23 

AMGad, RBEl Kholy. Generalized linear mixed models for longitudinal data. International Journal of Probability and Statistics. 2012;1(3):4147. 10.5923/j.ijps.20120103.03

24 

MOPrates, DRCosta, VHLachos. Generalized linear mixed models for correlated binary data with t-link. Statistics and Computing. 2014;24(6):11111123. 10.1007/s11222-013-9423-3

25 

CCSantos, RHLoschi. EM-Type algorithms for heavy-tailed logistic mixed models. Journal of Statistical Computation and Simulation. 2017;87(15):29402961. 10.1080/00949655.2017.1350678

26 

AAzzalini, ACapitanio. Distributions generated by perturbation of symmetry with emphasis on a multivariate skew t-distribution. Journal of the Royal Statistical Society: Series B (Statistical Methodology). 2003;65(2):367389. 10.1111/1467-9868.00391

27 

CLiu, DBRubin, YNWu. Parameter expansion to accelerate EM: the PX-EM algorithm. Biometrika. 1998;85(4):755770. 10.1093/biomet/85.4.755

28 

MDBranco, DKDey. A general class of multivariate skew-elliptical distributions. Journal of Multivariate Analysis. 2001;79(1):99113. 10.1006/jmva.2000.1960

29 

LDVHugo, CRBCabral. Scale Mixtures of Skew-Normal Distributions. In: LDVHugo, CRBCabral, CBZeller, editors. Finite Mixture of Skewed Distributions. Switzerland: Springer International Publishing; 2018. p. 1536.

30 

VHLachos, PGhosh, RBArellano-Valle. Likelihood based inference for skew-normal independent linear mixed models. Statistica Sinica. 2010;20:303322.

31 

Capitanio A. On the canonical form of scale mixtures of skew-normal distributions; 2012. Available from: https://arxiv.org/abs/1207.0797.

32 

GKéri. The Sherman-Morrison formula for the determinant and its application for optimizing quadratic functions on condition sets given by extreme generators. In: FGiannessi, PPardalos, RT, editors. Optimization Theory. Boston: Springer; 2001. p. 119138.

33 

AAhmed, JReshi, KMir. Structural properties of size biased Gamma distribution. IOSR J Mathem. 2013;5:5561. 10.9790/5728-0525561

34 

HJHo, TILin, HYChen, WLWang. Some results on the truncated multivariate t distribution. Journal of Statistical Planning and Inference. 2012;142(1):2540. 10.1016/j.jspi.2011.06.006

35 

Azzalini A. The R package sn: The Skew-Normal and Related Distributions such as the Skew-t (version 1.5-2).; 2018. Available from: http://azzalini.stat.unipd.it/SN.

36 

R Core Team. R: A Language and Environment for Statistical Computing; 2019. Available from: https://www.R-project.org/.

37 

Galarza CE, Matos LA, Lachos VH. Moments of the doubly truncated selection elliptical distributions with emphasis on the unified multivariate skew-t distribution. arXiv preprint arXiv:200714980. 2020.

38 

CEGalarza, TILin, WLWang, VHLachos. On moments of folded and truncated multivariate Student-t distributions based on recurrence relations. Metrika. 2021; p. 126.

39 

APDempster, NMLaird, DBRubin. Maximum likelihood from incomplete data via the EM algorithm. Journal of the royal statistical society Series B (methodological). 1977;39(1):122. 10.1111/j.2517-6161.1977.tb01600.x

40 

CFernandez, MFSteel. Multivariate Student-t regression models: Pitfalls and inference. Biometrika. 1999;86(1):153167. 10.1093/biomet/86.1.153

41 

Ada Silva Braga, GMCordeiro, EMOrtega, GOSilva. The Odd Log-Logistic Student t Distribution: Theory and Applications. Journal of Agricultural, Biological and Environmental Statistics. 2017;22(4):615639. 10.1007/s13253-017-0301-x

42 

DLee, SSinha. Identifiability and bias reduction in the skew-probit model for a binary response. Journal of Statistical Computation and Simulation. 2019;89(9):16211648. 10.1080/00949655.2019.1590579

43 

RBArellano-Valle, MGGenton. Fundamental skew distributions. Journal of Multivariate Analysis. 2005;96:93116. 10.1016/j.jmva.2004.10.002

44 

RArellano-Valle, HBolfarine, VLachos. Bayesian inference for skew-normal linear mixed models. Journal of Applied Statistics. 2007;34(6):663682. 10.1080/02664760701236905

45 

RArellano-Valle, HBolfarine, VLachos. Skew-normal linear mixed models. Journal of data science. 2005;3(4):415438.

46 

TILin, JCLee. Estimation and prediction in linear mixed models with skew-normal random effects for longitudinal data. Statistics in medicine. 2008;27(9):14901507. 10.1002/sim.3026

47 

VHLachos, DKDey, VGCancho. Robust linear mixed models with skew-normal independent distributions from a Bayesian perspective. Journal of Statistical Planning and Inference. 2009;139(12):40984110. 10.1016/j.jspi.2009.05.040

48 

VHLachos, FVLabra, PGhosh. Multivariate skew-normal/independent distributions: properties and inference. Pro Mathematica. 2014;28(56):1153.

49 

MAAPereira, CMRusso. Nonlinear mixed-effects models with scale mixture of skew-normal distributions. Journal of Applied Statistics. 2019;46(9):16021620. 10.1080/02664763.2018.1557122

50 

CLiu, DBRubin. The ECME algorithm: a simple extension of EM and ECM with faster monotone convergence. Biometrika. 1994;81(4):633648. 10.1093/biomet/81.4.633

51 

KLLange, RJLittle, JMTaylor. Robust statistical modeling using the t distribution. Journal of the American Statistical Association. 1989;84(408):881896. 10.2307/2290063

52 

IMeilijson. A fast improvement to the EM algorithm on its own terms. Journal of the Royal Statistical Society Series B (Methodological). 1989;51(1):127138. 10.1111/j.2517-6161.1989.tb01754.x

53 

CMeza, FOsorio, RDe la Cruz. Estimation in nonlinear mixed-effects models using heavy-tailed distributions. Statistics and Computing. 2012;22(1):121139. 10.1007/s11222-010-9212-1

54 

CCox. Delta method. Encyclopedia of biostatistics. 2005;2. 10.1002/0470011815.b2a15029

55 

JFox, SWeisberg. An R Companion to Applied Regression. 3rd ed. Thousand Oaks CA: Sage; 2019. Available from: https://socialsciences.mcmaster.ca/jfox/Books/Companion/.

56 

CMeza, FJaffrézic, JLFoulley. Estimation in the probit normal model for binary outcomes using the SAEM algorithm. Computational Statistics & Data Analysis. 2009;53(4):13501360. 10.1016/j.csda.2008.11.024

57 

JYan. geepack: Yet Another Package for Generalized Estimating Equations. R-News. 2002;2/3:1214.

58 

DBHall. Zero-Inflated Poisson and Binomial Regression with random effects: A Case Study. Biometrics. 2000;56(4):10301039. 10.1111/j.0006-341X.2000.01030.x