Nonoverlap proportion and the representation of point-biserial variation

PLoS ONE

Home Nonoverlap proportion and the representation of point-biserial variation

Stanley Luck

Competing Interests: The author, Stanley Luck, is a member of Vector Analytics, LLC, which is a science consulting company. This affiliation also does not alter our adherence to PLOS ONE policies on sharing data and materials. There are no competing interests connected with our consulting work at Vector Analytics, LLC. This work is not associated with any patents or commercial products.

https://doi.org/10.1371/journal.pone.0244517, Volume: 15, Issue: 12, Pages: 1-17

Article Type: Research Article Article History

Publisher: Public Library of Science

Altmetric

Table of Contents

1 Introduction
2 Methods
3 Data analysis and results
4 Discussion

Abstract

We consider the problem of constructing a complete set of parameters that account for all of the degrees of freedom for point-biserial variation. We devise an algorithm where sort as an intrinsic property of both numbers and labels, is used to generate the parameters. Algebraically, point-biserial variation is represented by a Cartesian product of statistical parameters for two sets of R1

data, and the difference between mean values (δ) corresponds to the representation of variation in the center of mass coordinates, (δ, μ). The existence of alternative effect size measures is explained by the fact that mathematical considerations alone do not specify a preferred coordinate system for the representation of point-biserial variation. We develop a novel algorithm for estimating the nonoverlap proportion (ρ_pb) of two sets of R1

data. ρ_pb is obtained by sorting the labeled R1

data and analyzing the induced order in the categorical data using a diagonally symmetric 2 × 2 contingency table. We examine the correspondence between ρ_pb and point-biserial correlation (r_pb) for uniform and normal distributions. We identify the R2

, P1

, and S+1

representations for Pearson product-moment correlation, Cohen’s d, and r_pb. We compare the performance of r_pb versus ρ_pb and the sample size proportion corrected correlation (r_pbd), confirm that invariance with respect to the sample size proportion is important in the formulation of the effect size, and give an example where three parameters (r_pbd, μ, ρ_pb) are needed to distinguish different forms of point-biserial variation in CART regression tree analysis. We discuss the importance of providing an assessment of cost-benefit trade-offs between relevant system parameters because ‘substantive significance’ is specified by mapping functional or engineering requirements into the effect size coordinates. Distributions and confidence intervals for the statistical parameters are obtained using Monte Carlo methods.

Luckand Hutson: Nonoverlap proportion and the representation of point-biserial variation

1 Introduction

This work began when we noticed that results from classification and regression tree (CART) analyses did not correspond well with statistical associations in genome-wide association studies (GWAS) [1]. Then, we discovered the extensive research literature discussing confounding properties of effect size measures used in our analyses. Statistical components of our bioinformatics system came from open source software packages that are widely used for research. In data analysis, there are two important requirements for obtaining reproducible results. First, statistics methodology is subject to the general physical principle that it is necessary to account for all of the degrees of freedom when studying a quantitative phenomenon. Second, analysis protocols must correct for dependence on data acquisition parameters including unbalanced sample sizes, in order to obtain interpretable results for effect size. Our work on proportional variation and the phi coefficient for 2 × 2 contingency tables was recently published in this journal; we refer to this as Paper1 [2]. There, we demonstrate that odds-ratio or relative risk as standalone effect size measures, do not account for all of the degrees of freedom and are therefore subject to ambiguity. Using matrix factorization for the marginal sums, we identified the four alternative forms of proportional variation which serve as the basis for specifying the effect size. There is also an elementary discussion of projective geometry for fractional variation that might be helpful to the reader. Here, we study similar problems in the formulation of effect size for point-biserial variation and the associated correlation coefficient, r_pb. First, the term ‘point-biserial’ comes from psychology statistics, and we explain its use as a general reference for the two groups data analysis problem. The difference between mean values for two sets of $R^{1}$ data, $δ = {\bar{y}}_{A} - {\bar{y}}_{B}$ , serves as the basis for specifying effect size for system response to perturbation. Statistically, analysis of δ corresponds to measuring the relation or association between a continuous variable and a binary categorical variable obtained by individually labeling the $R^{1}$ data. The standard procedure is to replace the labels with numeric {0, 1} indicators. The Pearson product moment correlation coefficient (r) calculated from these numeric data is known as the point-biserial correlation coefficient (r_pb) [3]. This connection between r_pb and δ explains our use of the term ‘point-biserial’. It is standard terminology in the effect size literature. We provide a short discussion of the literature which gave us much inspiration, and note that there are several books on effect size methods as well [4, 5]. In their discussion of physical principles in the formulation effect size, Kelly & Preacher recommend that an effect size should serve as a sample size independent estimate of a system parameter [6]. The existence of alternative effect size measures, and their classification as relationship, group difference, and group overlap is discussed by Huberty [7]. A recently proposed group overlap measure is nonparametric but requires the use of kernel density estimators to produce an approximate representation of the unknown densities [8]. McGrath and Meyer give a nice review of research into the limitations of r_pb, and points out that different measures can “lead to different conclusions about the size or importance” of an effect [3]. Various researchers have already noted that there are two complications that can limit the range of r_pb. The first difficulty arises from the definition of r_pb, requiring the {0, 1} representation to allow the calculation of r. The {0, 1} representation corresponds to binary groupings of the data, comprising a pair of many-to-one mappings. The latter are incompatible with r as a measure of the degree to which two variables are linearly related [9] and raises questions about the interpretation of r_pb. It has been shown that when the {y_A, y_B} data are obtained by a dichotomy of a normal distribution, r_pb has a maximum value of 0.79 [3, 10]. In contrast, when each $R^{1}$ set corresponds to a normal distribution, r_pb still ranges from −1.0 to 1.0 [11, 12], with the proviso that the extremal values are reached in the limit as |δ| approaches infinity. Secondly, r_pb is subject to confounding from unbalanced sample sizes for the {y_A, y_B} data; in the effect size literature, the sample size proportions are usually referred to as ‘base rates’. Then, variation in the sampling proportions between data sets leads to irreproducibility, which complicates the interpretation of r_pb. The machine learning community has rediscovered the problems associated with unbalanced sample sizes, creating the new term “classification imbalance” [13].

It is accepted practice to report a single effect size such as Cohen’s d as the basis for deciding the outcome of an experiment. However, d is associated with an implicit parameterization that does not account for all of the degrees of freedom for point-biserial variation, which results in ambiguity. Consequently, our objective is to construct a computational framework for a complete parameterization of the variation (v_pb). We use an inductive approach based on connections between r_pb, Cohen’s d, and the mean squared error information gain (IG_MSE). These measures play an important role because of their connections with elementary statistical concepts. We show that Cohen’s d is a perspective function of center of mass coordinates (δ, μ) for the mean value vector $({\bar{y}}_{A}, {\bar{y}}_{B})$ . We also identify a novel association measure, ρ_pb, which measures the degree of nonoverlap between two sets of $R^{1}$ data.ρ_pb is calculated directly from the data and is therefore nonparametric because the underlying densities are unspecified. A particular goal is to examine the dependence of r_pb on unbalanced sample sizes because of concerns about the effect on reproducibility. We address other problems as well including the use of Monte Carlo methods to estimate the joint distribution for statistical parameters. As in Paper1, we use CART association graphs to compare the performance of various effect size measures. However, in this work we are particularly interested in the case where the target variable is a quantitative variable, which corresponds to the regression tree implementation (rCART) [14]. We show that ρ_pb and the sample size proportion corrected correlation (r_pbd) serve as effect size measures for rCART while avoiding complications associated with r_pb. The main novel contributions of this work are as follows: 1) a computational model for generating statistical parameters for point-biserial variation v_pb, which corresponds to the Cartesian product of parameters for two sets of $R^{1}$ data, and identification of the fact that pure mathematics alone is not sufficient to specify a preferred effect size, 2) a sorting algorithm to estimate the nonoverlap proportion, ρ_pb, of two sets of $R^{1}$ data using a diagonally symmetric 2 × 2 contingency table, 3) identification of the $R^{2}$ , $P^{1}$ , and $S_{+}^{1}$ representations for Pearson correlation, 4) demonstration of the equivalence between r_pb and IG_MSE, and 5) demonstration of the importance of adjusting for unbalanced sample sizes in impurity measures in rCART analysis.

2 Methods

The specification of a complete set of parameters for point-biserial variation, v_pb, is a prerequisite for the rigorous formulation of effect size. Then, a measure for effect size is asociated with a perspective function of v_pb. We begin with an examination of limitations of r_pb in section 2.1. Then, we use an inductive approach to construct an algebraic framework for point-biserial variation in four sections 2.2–2.5.

2.1 The effect of unbalanced sample sizes on r_pb

The derivation and limitations of r_pb are reviewed by McGrath and Meyer [3]. Two sets, $y_{A}^{} \in R^{N_{A}}$ and $y_{B}^{} \in R^{N_{B}}$ , are combined to form a set of paired values, ${(c_{i}, y_{i}) | c_{i} =' A' \lor' B', y_{i} \in R^{1}, 1 \leq i \leq N, N = N_{A}^{} + N_{B}^{}}$ , where c_i is a group membership label, and the {(c_i, y_i)} data correspond to the vectors, (c, y). The standard practice is to invoke a numeric {0, 1} representation for c to obtain an indicator vector, $I_{c} \in R^{N}$ . Then, application of the Pearson product-moment formula produces the point-biserial correlation coefficient [3]

where p_A = N_A/(N_A + N_B) and p_B = 1 − p_A are sample size proportions, Cohen’s d is defined as

and the pooled variance is the weighted average of the sample variances,

S_{p}^{2} = p_{A}^{} S_{A}^{2} + p_{B}^{} S_{B}^{2}

. Thus, |r_pb| approaches unity as |d| → ∞ [, ] for 0 < p_A < 1. Rearranging , we obtain the quadratic relation

For a fixed value of r_pb, there is a range of (d, p_A) values (Fig 1). Alternatively, the variation in (r_pb, p_A) for fixed d becomes a source of irreproducibility in r_pb because p_A can vary between experiments depending on the data acquisition protocol. This ambiguity explains why researchers have expressed concern about the confounding effect of unbalanced sample sizes on r_pb, and effect size in general [, ]. Furthermore, the binomial p_A p_B dependence originates from the covariance

and variance, Var(I_c) = p_A p_B. Therefore, the criticism about p_A p_B dependence applies more broadly to the use of the numeric {0, 1} indicator variable. Various researchers have already recommended that the proportions should be equalized, p_A = p_B = 1/2, in to give []

This ‘attenuation-corrected’ coefficient is denoted as r_c in []. The r_pb and r_pbd curves in Fig 2 provide an illustration of this correction. The one-to-one projective relation between r_pbd and Cohen’s d is discussed in section 2.4, and the application of r_pbd in rCART is discussed in section 2.5.

Fig 1

Quadratic dependence of the point-biserial correlation coefficient, r_pb.

For the fixed value r_pb = 0.2, there is a range for Cohen’s d and the sample size proportion, p_A. This ambiguity complicates the interpretation of r_pb as an effect size measure.

Fig 2

Nonoverlap proportion and point-biserial correlation.

Theoretical curves and estimated values for point-biserial correlation, r_pb, nonoverlap proportion, ρ_pb, and sample size adjusted correlation, r_pbd, for simulated data with unequal sample sizes (N_A : N_B = 15000 : 500) and the difference between mean values, ${\bar{y}}_{A} - {\bar{y}}_{B}$ . Compared to r_pbd, r_pb is attenuated due to the confounding effect of the binomial sampling factor. A: Uniform unit width $(σ = 1 / \sqrt{12})$ distributions. B: Standard normal (σ = 1) distributions.

2.2 Statistical parameters for point-biserial variation

In this section, we consider the question of how to generate a set of parameters for statistical variation in point-biserial data. The fact that r_pb is subject to confounding effects suggests that replacing categorical labels with {0, 1} numeric values is an improper procedure, because the labels acquire arithmetic properties in an ad-hoc way. Instead, we propose a new framework where sort is used as an intrinsic property of both numbers and labels. Suppose there is a machine which generates numbers with labels, (c_i, y_i), in no particular order, placing them in a data table to produce a point-biserial data set. Then, the table can be sorted using either c or y, to obtain orderings denoted as y_c and c_y, respectively. As we discuss next, these orderings are associated with statistical parameters, v_c and v_y, respectively. However, there is no rule that specifies which parameterization, v_c or v_y, might be preferred. Therefore, we make the following proposition,

Proposition 1. Point-biserial variation is parameterized by the Cartesian product of statistical parameters for the y_c and c_y orderings,

The y_c ordering corresponds to sorting the y data into two sets, y_c ↔ {y_A, y_B}. Then, the statistical parameters for the two sets are associated with a two-component Cartesian product structure, yielding the familiar effect size measures, Cohen’s d and r_pb as discussed in section 2.3. The c_y ordering is associated with a new nonoverlap measure, ρ_pb. The two types of y-sort, ascending or descending, produce orderings where either {(c_i, y_i)|y_i ≤ y_i+1} or {(c_i, y_i)|y_i ≥ y_i+1}, respectively. Then, the c-column corresponds to a y-ordered string, c_y. The induced order from the y-sorting is reflected in the degree of mixing of As and Bs in c_y. Next, we sort the data with respect to c obtaining a maximally ordered string, c_y, where the As and Bs are completely separated. c_M corresponds to the condition where y_A and y_B are disjoint in

R^{1}

, which has been characterized as “perfect correlation” []. Our c_y-sorting algorithm requires equal sample sizes, N_A = N_B. When the sample sizes are unequal, a preprocessing step is required. Suppose N_B < N_A. Then, the y_B data are replicated to create a new data set, y_Brep, such that N_Brep = N_A. If the difference in sample size is small, 0 < N_B − N_A < N_B, then a subset of y_B uniformly spaced by rank is replicated. The y_Brep and y_A data are combined to obtain the (c_y, c_M) strings. They constitute a set of joint observations for two categorical variables, which are summarized in a diagonally symmetric 2 × 2 contingency table of the form [[a, b], [b, a]]. The symmetric form results from the equal sample size condition, which requires that the rows and columns each sum to N_A. Then, the nonoverlap proportion is given by the difference in proportions

where

p_{a} = \frac{a}{a + b}

, and p_b = 1 − p_a. When y_A and y_B are disjoint, |ρ_pb| = 1. The sign of ρ_pb is arbitrary because the order of the columns (or rows) of the 2 × 2 table depends on the direction of the sort in y or c_M. In our implementation, the sign is chosen to be consistent with Cohen’s d. The ρ_pb values in Fig 2 were obtained using this sort algorithm. The overlap between uniform unit width

(σ = 1 / \sqrt{12})

distributions is an important pedagogical case because the expressions for Cohen’s d, r_pbd, and ρ_pb take a simple form. Geometrically, the overlap (θ_U) is given by a rectangle with area θ_U = 1 − δ for the difference between mean values, with 0 ≤ δ ≤ 1, and θ_U = 0 for δ > 1. The nonoverlap is given by ρ_pbU = 1 − θ_U = δ, with 0 ≤ δ ≤ 1. Similarly,

For the overlap of standard normal (σ = 1) distributions, we obtain

where Φ is the cumulative normal distribution function []. In Fig 2, we observe that at a large enough δ, r_pbd is attenuated compared to ρ_pb, as expected []. However, for small δ, the inequality is reversed, i.e., r_pbd > ρ_pb. Nevertheless, there is close correspondence between r_pbd and ρ_pb for both the uniform and normal distributions. This is particularly true for highly correlated data where both r_pbd and ρ_pb are near 1, and are therefore equivalent. However, in section 3 we demonstrate that when the data are not well correlated, both r_pbd and ρ_pb are needed in order to distinguish different forms of point-biserial variation. We conclude that r_pbd and Cohen’s d serve as measures of the nonoverlap of distributions but are not necessarily equivalent to ρ_pb.

2.3 Coordinates for a two-component system of distributed effects

In this section, we discuss the fact that d and ρ_pb are only two elements of a minimal set of parameters for representing point-biserial variation. The one-to-one correspondence, d ↔ r_pbd, will be discussed in section 2.4. Algebraically, v_c corresponds to the Cartesian product of statistical parameters for two sets of $R^{1}$ data, $v_{c} = ({\bar{y}}_{A}, S_{A}^{2}, N_{A}) \times ({\bar{y}}_{B}, S_{B}^{2}, N_{B})$ . Introducing the center of mass parameter, $μ = ({\bar{y}}_{A}^{} + {\bar{y}}_{B}^{}) / 2$ , the mean values vector is expressed as

where (1, 1) and (1, −1) comprise the center of mass basis. We note that the generalization for a weighted average is straightforward. A similar decomposition holds for variances

where

S_{μ}^{2} = (S_{A}^{2} + S_{B}^{2}) / 2

and

S_{δ}^{2} = S_{A}^{2} - S_{B}^{2}

. A further reduction is obtained if the variances are homoscedastic,

S_{A}^{2} = S_{B}^{2}

, yielding

S_{p}^{2} = S_{μ}^{2}

, and

S_{δ}^{2} = 0

. Finally, we obtain

as a minimal set of parameters for point-biserial variation. However, we observe that v_pb is not unique because functions of the components, {f_i(v_pb,i)}, including linear fractional transformations can be introduced to obtain alternative representations. Mathematics alone is not sufficient to specify a preferred vector basis, which explains why there are alternative effect size measures [, ]. Furthermore, r_pb and Cohen’s d correspond to perspective functions [] of v_pb and do not account for all of the degrees-of-freedom. Consequently, the practice of using one of these measures to serve as a one-parameter summary of experimental results will be subject to irreproducibility.

The term ‘substantive significance’ has been used to refer to the magnitude of an effect that would be regarded as practically important in a given application [6]. Suppose functional or engineering requirements are expressed in terms of a vector, h, of system parameters. Then, the utility of an effect would be specified as a mapping, $u : h \mapsto R^{1}$ . The specification of u(h) would account for differences in cost-benefit trade-offs for variation in the {h_i} components. The substantive significance for the effect size would be determined by the mapping, u(h) → u(v_pb). Without this information, it is difficult to reach a consensus on the merits of an effect size. This explains the criticism of Cohen’s thresholds for small, medium, and large effects as “somewhat arbitrary” [16] and suggestions that the significance of the magnitude of an effect size depends on the research question [3, 17, 18].

A fundamental limitation arises from the fact that the (δ, μ) center of mass decomposition does not extend to higher dimensions in a straightforward way. Consider the group means vector for three sets, i.e., $({\bar{y}}_{A}^{}, {\bar{y}}_{B}^{}, {\bar{y}}_{C}^{})$ . The default center of mass parameter is defined as $μ = ({\bar{y}}_{A}^{} + {\bar{y}}_{B}^{} + {\bar{y}}_{C}^{}) / 3$ . However, there is no standard procedure for choosing the two additional deviation parameters needed to specify a complete basis. Consequently, the formulation of an effect size measure for multiple group variation is not a well-posed problem, i.e., there is no unique solution [19]. This explains why Cohen’s d does not generalize to schemes involving more than two groups [20] and provides support for previous recommendations to break down ‘complicated hypotheses’, p. 526 [21], and ‘reduce any multiple-level or multiple-variable relationship’ into a set of two-variable effect size relationships [17]. This provides the raison d'être for the development of exploratory methodologies such as CART in high-dimensional data analytics [22, 23].

2.4 Homogeneous coordinates for Pearson correlation

In the effect size literature, it is accepted practice to distinguish three different types of effect size measure, ‘relationship’, ‘group difference’, and ‘group overlap’ [3, 7]. In this section, we discuss the fact that this classification is misleading. We have already discussed the fact that Cohen’s d, r_pbd and ρ_pb all serve as measures of nonoverlap (section 2.2). Now, we point out that r_pbd and Cohen’s d are two sides of the same coin because relationship and group difference correspond to different coordinate systems for representing fractional variation. Such correspondences are quite useful in exploring statistical dependence in high-dimensional data. Consider a vector $(a, b) \in R^{2}$ . Division by the y-component produces the ratio vector, ${α = (α, 1) \in P^{1} | α = a / b, b \neq 0}$ . Ratios can be distinguished by their representations as points in the projective line, $P^{1}$ . However, normalization of a ratio vector by the Euclidean length, $‖ α ‖ = \sqrt{α^{2} + 1}$ , produces the unit vector $\hat{α}$ , which is a point in the positive half-circle $S_{+}^{1}$ . Thus, a fractional quantity can be represented as a point in either $P^{1}$ or $S_{+}^{1}$ . Algebraically, the $P^{1}$ and $S_{+}^{1}$ representations are related by linear fractional transformations. In the terminology of projective geometry, a ratio corresponds to a perspective function, P(u, t) = u/t, for vector u [15]. The scaling invariance property of α is represented by the equivalence relation

with t ≠ 0. Geometrically, this relation specifies points on the line passing through the origin, (a, b) and (α, 1). The points, (a, b)t, constitute the homogeneous coordinates [] for the line. The homogeneous coordinates concept shows that there is a natural correspondence between ‘relationship’ and ‘group difference’ effect size. Expressing the Pearson product-moment correlation coefficient as the rescaled covariance []

the corresponding projective geometric structure is as summarized in Table 1. Vector representations for r_pb and r_pbd are also listed, and a geometric visualization for r_pb is shown in Fig 3. Consequently, r_pbd, Cohen’s d, and ρ_pb each possess

P^{1}

and

S_{+}^{1}

representations and serve as measures of group overlap, as described in section 2.2. Therefore, we conclude that the general classification of effect size as a ‘relationship’, ‘group difference’, or ‘group overlap’ index is misleading. We also observe that the question of the merits of Cohen’s d versus r_pb in [] is complicated by the fact that these measures correspond to points in different spaces,

P^{1}

and

S_{+}^{1}

, respectively. The limitations of r_pb are more easily understood by considering its representation as the vector,

(\sqrt{p_{A} p_{B}} d, 1) \in ℙ^{1}

. The binomial factor has a confounding effect, particularly since base rates are determined by the experimental protocol. This is analogous to the confounding effect of the marginal sums on the ϕ coefficient for a 2 × 2 contingency table (Paper1). Therefore, neither r_pb nor ϕ meet the criterion for a well-behaved effect size of serving to quantify ‘some phenomenon that addresses a question of interest’ []. In section 2.5, we give an example where r_pb gives nonintuitive results in rCART analysis.

Fig 3

Projective spaces for the representation of point-biserial correlation.

The point-biserial correlation coefficient, r_pb, corresponds to the point $(r_{pb}^{}, \sqrt{1 - r_{pb}^{2}})$ on the positive half-circle, $S_{+}^{1}$ , and the point $(\sqrt{p_{A} p_{B}} d, 1)$ on the projective line, $P^{1}$ . The homogeneous coordinates $(\sqrt{p_{A} p_{B}} d, 1) t$ for $t \in R^{1}$ correspond to points on the line through the origin. {p_A, p_B}: sample size proportions, d: Cohen’s d.

Table 1

Homogeneous coordinates for Pearson correlation.

Effect size
Pearson correlation
Point-biserial correlation
r_pbd

The representations for Pearson product-moment correlation as homogeneous coordinates in $R^{2}$ , the vector $(r, \sqrt{1 - r^{2}}) \in S_{+}^{1}$ , and the vector $(r / \sqrt{1 - r^{2}}, 1) \in P^{1}$ . Corresponding representations for the point-biserial correlation, r_pb, and sample size adjusted correlation, r_pbd, are also listed. Cohen’s $d = ({\bar{y}}_{A} - {\bar{y}}_{B}) / S_{p}$ , ${{\bar{y}}_{A}^{}, {\bar{y}}_{B}^{}}$ : mean values, S_p: pooled variance, {p_A, p_B}: sample size proportions for ‘A’ and ‘B’ data, $t \in R^{1}$ .

2.5 Point-biserial variation in regression tree analysis

The CART association graph was introduced in Paper1 as a new method for analyzing statistical association in point-biserial data. In this section, we investigate the role of point-biserial variation in rCART, particularly the connection between IG_MSE and r_pb, and introduce the rCART graph as a new method for analyzing association for (x, y) data. The CART decision tree algorithm creates a decision tree by recursive partitioning of the association between response and independent variables [2, 14]. Each node of the tree corresponds to a binary partition of the range of an independent variable. In standard implementations, the partition parameters for a node are determined by maximizing the information gain (IG) for the response variable in an exhaustive search of associations over all independent variables. The rCART implementation is of particular interest because it involves the analysis of point-biserial variation. In each iteration, the set of statistics obtained for partitions of an independent variable constitutes a CART association graph [2]. For the partition value $x_{j} \in R^{1}$ , the data for a node (V) are divided into two subsets, i.e., V_A = {(x_i, y_i)|x_i ≤ x_j} and V_B = {(x_i, y_i)|x_i > x_j}, from which data vectors {y_A, y_B} are obtained. Alternatively, if x_j is categorical, the subsets are specified using matching criteria V_A = {(x_i, y_i)|x_i = x_j} and V_B = {(x_i, y_i)|x_i ≠ x_j}. The standard rCART impurity measure is the mean square error for the response, $MSE (y) = \sum_{i} {(\bar{y} - y_{i})}^{2} / N_{V}^{}$ , where N_V is the sample size and $\bar{y}$ is the mean [14]. Then, IG is defined as the parent node impurity minus the weighted impurities for the subsets

where p_A and p_B are the sample size proportions. Partitioning the sum of squares, MSE(y), gives [, ]

Substitution for MSE(y) in gives

Thus, IG_MSE(y_A, y_B) is equivalent to

r_{pb}^{2}

with S_p = 1 (Table 1); IG_MSE does not account for the variation in S_p. To the best of our knowledge, this connection between IG_MSE and r_pb has not been reported previously. We conclude that the analysis of point-biserial variation serves as the basis for rCART, and we use the terms ‘effect size’ and ‘information gain’ interchangeably. The x_j partition produces subsets with sample sizes, j and N_V − j for

x_{j} \in R^{1}

. An association graph is obtained by searching over all partitions where the sample size proportions, p_j and (1 − p_j), vary over their entire range, producing a large parabolic variation in the p_j(1 − p_j) factor. Thus, an association graph is a convenient way to compare the sample size proportion dependence of effect size measures. In the next section, we demonstrate that r_pb gives misleading results in rCART, while r_pbd and ρ_pb produce more intuitive results. However, when the (x, y) data are highly correlated and Pearson r(x, y) → 1, the rCART graph becomes a horizontal line or nearly so, because r_pbd ≈ ρ_pb ≈ 1 for all x_j partitions. Then, the rCART graph and Pearson r are equivalent representations. Thus, CART methodology is most useful when the data are poorly correlated, which includes population studies where system performance is determined by trade-offs between multiple factors. Typical applications include GWAS, and other high-dimensional search problems such as nursing home performance as discussed in the next section.

3 Data analysis and results

In Paper1, we used the publicly accessible Nursing Home Compare (NHC) data [25] in CART analysis to demonstrate the importance of adjusting for the dependence on marginal sums for 2 × 2 contingency tables [2]. In this section, we use a similar NHC data set for a discussion of point-biserial variation and the rCART algorithm. Our objective is to provide a practical demonstration of the limitations of r_pb due to the confounding effect of unbalanced sample sizes and to compare the behaviors of r_pbd and ρ_pb. We also discuss the importance of accounting for three degrees of freedom, (r_pbd, μ, ρ_pb), and the use of Monte Carlo methods to estimate the joint distribution of statistical parameters.

3.1 rCART association graphs for NHC quality measures

NHC data of the fourth quarter of 2018 were retrieved for 20 quality measures (Q_i) for 15341 nursing homes; detailed descriptions of these continuous variables can be found on the NHC website [26]. A histogram of the nursing home occupancy is shown in Fig 4A. Since performance estimates for nursing homes with low occupancy would be less reliable, a minimum occupancy criterion of at least 50 ‘Average number of residents per day’ was applied to obtain a restricted data set of 11053 nursing homes for further analysis [27]. Pearson correlation coefficients, r(Q_i, Q_j), and association graphs were calculated for all pairs of quality measures, {(Q_i, Q_j)|i ≠ j}. On average, the information gain for the rCART partition is larger when the (Q_i, Q_j) variables are highly correlated (Fig 5A); the r(Q_i, Q_j) correlations are distributed with 95% less than 0.16 and a maximum of 0.65. The distribution for ‘Number of outpatient emergency department visits per 1000 long-stay resident days’ (‘Emergency visits’) versus ‘Number of hospitalizations per 1000 long-stay resident days’ (‘Hospitalizations’) with correlation r = 0.37 is skewed, with a long tail towards larger values (Fig 4B). rCART association graphs are shown for the ‘Hospitalizations’ response and ‘Emergency visits’ partition variables (Fig 6A and 6B), and for the reverse, i.e., ‘Emergency visits’ response and ‘Hospitalizations’ partition variables (Fig 6C and 6D). The high correlation between r_pb and $\sqrt{p_{A} p_{B}}$ (r = 0.99) is typical and indicates that variation in the binomial sampling factor overrides the smaller variation in Cohen’s d (Eq 2). We also note that the graphs for r_pb and IG_MSE (not shown) are superimposable, as expected from Eq 20 and because the variation in S_p is small. Thus, r_pb and IG_MSE mainly correspond to the variation in sample size proportion. In general, we observe that the association curves for r_pbd and ρ_pb can be categorized as monotonically increasing or decreasing, or even U-shaped (concave up), depending on how the (Q_i, Q_j) data are distributed. Here, the U-shaped dependence of r_pbd correlates well with δ (r = 0.999) and contrasts sharply with the concave down variation for r_pb. Consequently, r_pb and r_pbd produce very different rCART partitions (Table 2). In Fig 6A, the r_pb partition for the split value, x_j = 0.8, produces subnodes with comparable sample sizes, N_A = 5742 and N_B = 4890 (Table 2). It is useful to view this partition from a statistical perspective. As a first approximation, we expect that the majority of nursing homes belong to a broad distribution for average performance. Then, the r_pb partition with a split value close to the median, 0.85, is analogous to splitting a normal distribution nearly in half, producing subsets with different mean ‘Emergency visits’ values {0.5, 1.4} that nevertheless correspond to entities with average performance. Thus, r_pb and IG_MSE produce rCART subsets that are not well distinguished from a functional perspective. In comparison, for r_pbd, there are two possible rCART partitions at either low (x_j = 0.3) or high (x_j = 2.5) split values. Each partition produces a large subset corresponding to a broad distribution for average performance and a much smaller subset for either above- or below-average performance. Thus, r_pbd produces more functionally relevant classifications.

Fig 4

Skewed distributions for NHC quality measures.

A. Histogram of ‘Average number of residents per day’ for 15341 nursing homes. B. Two-dimensional Gaussian kernel density estimate of the distribution of ‘Number of outpatient emergency department visits per 1000 long-stay resident days’ (‘Emergency visits’) versus ‘Number of hospitalizations per 1000 long-stay resident days’ (‘Hospitalizations’), with correlation r = 0.37.

Fig 5

The relation between r_pbd and ρ_pb in rCART.

These graphs display data obtained from association graphs for 380 pairs of quality measures, {(Q_i, Q_j)|i ≠ j}. A. r_pbd effect size for rCART split versus correlation r(Q_i, Q_j). On average, the largest information gain is obtained when the response and partition variables are highly correlated. B. Correlation r(r_pbd, ρ_pb) between effect size and r(Q_i, Q_j) for association graphs. There is good correlation between r_pbd and ρ_pb in many cases, but there are exceptions.

Fig 6

rCART association graphs for effect size.

A,B: ‘Hospitalizations’ response versus ‘Emergency visits’ partition variables, with correlation r(r_pbd, ρ_pb) = 0.93. C,D: ‘Emergency visits’ response versus ‘Hospitalizations’ partition variables, with correlation r(r_pbd, ρ_pb) = 0.49. Bar plot histograms are shown for ‘Emergency visits’ (B inset) and ‘Hospitalizations’ (D inset). r_pb: point-biserial correlation coefficient, {p_A, p_B}: sample size proportions, r_pbd: sample size corrected correlation coefficient, ρ_pb: nonoverlap proportion, (δ, μ): center of mass parameters $({\bar{y}}_{A} - {\bar{y}}_{B}, ({\bar{y}}_{A} + {\bar{y}}_{B}) / 2)$ .

Table 2

rCART subnode parameters.

Response variable	Partition variable	Split value	Subnode A	Subnode B
Hospitalizations	Emergency visits	r_pbd: 0.3	1.8, 0.7, 9909	1.2, 0.7, 723
”	”	r_pb: 0.8	2.0, 0.7, 5742	1.5, 0.7, 4890
”	”	r_pbd: 2.5	2.5, 0.8, 320	1.7, 0.7, 10312
Emergency visits	Hospitalizations	r_pbd: 0.7	1.0, 0.6, 10137	0.5, 0.4, 495
”	”	r_pb: 1.7	1.2, 0.7, 5318	0.8, 0.5, 5314
”	”	r_pbd: 3.3	1.5, 1.0, 330	1.0, 0.6, 10302

Summary of the rCART partition values and subnode statistics $(\bar{y}, σ, N)$ for the association graphs in Fig 6. r_pb: point-biserial correlation coefficient, r_pbd: sample size corrected correlation, $(\bar{y}, σ, N)$ : mean value, standard deviation, sample size.

The importance of accounting for variation in both degrees of freedom, (r_pbd, μ), is illustrated in Fig 6B and 6D. Here, μ is monotonically increasing, and one of the r_pbd partitions might be preferred depending on μ. However, this requires an assessment of the cost-benefit trade-offs for (r_pbd, μ) variation, which will depend on the particular application. A close correspondence between r_pbd and ρ_pb is observed in many cases, with r(r_pbd, ρ_pb) ≥ 0.8 in 68% of the association graphs (Fig 5B), but there are many cases where they differ depending on how the (Q_i, Q_j) data are skewed. Fig 6C shows an example of the difference between the ρ_pb and r_pbd curves with r(r_pbd, ρ_pb) = 0.49. The r_pbd partition for the lower split value might be preferred because it is associated with higher ρ_pb, depending on how the cost-benefit trade-off is assessed for (r_pbd, ρ_pb) variation. Consequently, three coordinates (r_pbd, μ, ρ_pb) are needed to distinguish different forms of point-biserial variation. These observations provide support for previous remarks stating that interpreting the magnitude of an effect size as a measure of substantive significance depends on the particular application [6, 18]. A more precise approach would take into account the multidimensional nature of point-biserial variation and involve the specification of functional or engineering requirements for a relevant vector basis. Then, an analysis of the effect size for the system response could involve separate thresholds for each coordinate. The ability to account for all relevant degrees of freedom is also important in assessing reproducibility. A one-parameter representation using an effect size such as r_pbd or Cohen’s d gives an incomplete picture and leads to ambiguous results because of the loss of information.

3.2 Distributed effects in point-biserial variation

The reproducibility of nursing home performance data depends on stochastic effects in the measurement of patient outcome. Then, the observed data are associated with a distribution of data sets, $P (y)$ , and corresponding distributions of the statistical parameters $P (v_{pb}^{})$ and effect size. The specification of $P (y)$ must be based on a realistic assessment of all sources of error and uncertainty to form an error model for the data, $E (y)$ . Then, the determination of the distribution for the effect size requires propagation of the error in $P (y)$ . For fractional quantities such as Cohen’s d and r_pbd, it is necessary to account for stochastic effects in both the numerator and denominator. However, analytical methods for estimating distributions for ratios [28, 29], proportions [30, 31], and correlation coefficients [32] are complicated by fractional transformation, a bounded range, and discreteness. Thus, iterative procedures are needed for the analysis of noncentral effect size distributions and estimating confidence intervals for deviations above and below the effect size estimate [5, 18]. Alternatively, Monte Carlo (MC) methods [2, 33, 34] provide a more practical approach to estimating the distribution for the effect size. In an MC simulation, $E (y)$ specifies error parameters for each observed value in the original data. Then, a point-biserial MC data set is obtained by random sampling to produce MC instances for y_A and y_B. The MC sampling process is repeated many times to obtain a collection of MC data sets to form an estimate, $P_{i} (y)$ . Statistical parameters are calculated for the data sets in $P_{i} (y)$ to obtain estimates of distributions and histograms for point-biserial effects. Many MC runs are performed to obtain a set, ${P_{i} (y) | 1 \leq i \leq N_{MCruns}^{}}$ , which allows the determination of the degree of convergence for the MC simulation. However, the information needed to construct an error model is not included in the NHC quality measures data. For this demonstration, we provided a rudimentary ‘Emergency visits’ error model, where σ_i = y_i/5. MC simulations for (r_pbd, μ) and (r_pbd, ρ_pb) for ‘Emergency visits’ response with ‘Hospitalizations’ rCART split value, 3.3 (Table 2), are shown in Fig 7. The discrete structure of the ρ_pb distribution is due to stochastic effects in the c_y sorting. The separate confidence intervals in Fig 6 for positive and negative deviation from the observed effect size estimate were estimated from the MC distributions. In practical applications, the advantage of the MC method is that it allows detailed simulation of the data acquisition process, including heterogeneity within groups, and specifications for $E (y)$ can include heteroscedasticity, measurement error, and misclassification [17, 35, 36].

Fig 7

Monte Carlo simulation of the distribution of stochastic effects for point-biserial variation.

2D histograms of MC distributions for (r_pbd, μ) (A) and (r_pbd, ρ_pb) (B) for ‘Emergency visits’ response with ‘Hospitalizations’ rCART split value, 3.3 (Table 2). The 1σ error bars for the r_pbd histogram (A inset) serve as an indication of convergence for the simulation; the mean for the normal curve corresponds to the observed r_pbd value, 0.398. r_pbd: sample size corrected correlation, ρ_pb: nonoverlap proportion, μ: center of mass parameter $({\bar{y}}_{A} + {\bar{y}}_{A}) / 2$ , number of MC runs: 25, samples per MC run: 4000.

4 Discussion

In this work, we use sort as an intrinsic property of both numbers and labels to generate a complete set of parameters for point-biserial variation, v_pb. We demonstrate that Cohen’s d is associated with the center of mass representation for a two-component system of normal distributions. However, a parameterization can also be constructed for skewed distributions. We do not attempt to incorporate requirements for ‘substantive significance’ because this depends on the particular application, which might require different or additional parameters. The specification of performance criteria for all of the parameters in v_pb is also required. The (δ, μ) effect size representation does not generalize because there is no standard center of mass parameterization for a multicomponent system. However, this does not constitute a fundamental limitation in the application of effect size for high-dimensional data analytics. Instead, the (δ, μ) coordinates serve as a minimal framework for analyzing dependency using exploratory methodologies such as rCART. CART methodology is useful in population studies where the performance or system response is distributed due to complex interactions. Then, a decision tree for identifying outperforming individuals can help in the determination of predictive criteria for improved performance, and the construction of a functional model. We also demonstrate the use of replication as a nonparametric method for equalizing sample sizes in the estimation of ρ_pb. This replication protocol can be used in other classification algorithms where adjustment for unbalanced sample size is needed. We also demonstrate that the Monte Carlo method is a practical way to estimate the distribution of a fractional statistical quantity from the detailed specification of an error model for the data. Then, the assessment of substantive significance must take into account the distribution in effect size parameters. We conclude that a better understanding of the applied algebraic foundations and an improved methodology are important for the application of effect size in data analytics.

Acknowledgements

I thank many former colleagues in the Genetic Discovery group at DuPont for stimulating my interest in statistical problems in genome-wide association studies and CART.

References

ABeló, SDLuck. Association Mapping for the Exploration of Genetic Diversity and Identification of Useful Loci for Plant Breeding In: KMeksem, GKahl, editors. The Handbook of Plant Mutation Screening. Weinheim, Germany: Wiley-VCH Verlag GmbH & Co. KGaA; 2010 p. 231–246. Available from: http://onlinelibrary.wiley.com/doi/10.1002/9783527629398.ch14/summaryhttp://doi.wiley.com/10.1002/9783527629398.ch14.

SLuck. Factoring a 2 x 2 contingency table. PLOS ONE. 2019;14(10):e0224460 10.1371/journal.pone.0224460

REMcGrath, GJMeyer. When effect sizes disagree: The case of r and d. Psychological Methods. 2006;11(4):386–401. 10.1037/1082-989X.11.4.386

RJGrissom, JJKim. Effect Sizes for Research. 2nd ed New York, NY: Routledge; 2011.

GCumming. Understanding The New Statistics. New York, NY: Routledge; 2012.

KKelley, KJPreacher. On effect size. Psychological Methods. 2012;17(2):137–152. 10.1037/a0028086

CJHuberty. A History of Effect Size Indices. Educational and Psychological Measurement. 2002;62(2):227–240. 10.1177/0013164402062002002

MPastore, ACalcagnì. Measuring Distribution Similarities Between Samples: A Distribution-Free Overlapping Index. Frontiers in Psychology. 2019;10:1089 10.3389/fpsyg.2019.01089

JLee Rodgers, WANicewander. Thirteen Ways to Look at the Correlation Coefficient. The American Statistician. 1988;42(1):59–66. 10.2307/2685263

MGradstein. Maximal Correlation between Normal and Dichotomous Variables. Journal of Educational Statistics. 1986;11(4):259–261. 10.3102/10769986011004259

RGChambers. Correlation coefficients from 2 x 2 tables and from biserial data. British Journal of Mathematical and Statistical Psychology. 1982;35(2):216–227. 10.1111/j.2044-8317.1982.tb00654.x

YCheng, HLiu. A short note on the maximal point-biserial correlation under non-normality. British Journal of Mathematical and Statistical Psychology. 2016;69(3):344–351. 10.1111/bmsp.12075

JMJohnson, TMKhoshgoftaar. Survey on deep learning with class imbalance. Journal of Big Data. 2019;6(1):27 10.1186/s40537-019-0192-5

MKrzywinski, NAltman. Classification and regression trees. Nature Methods. 2017;14(8):757–758. 10.1038/nmeth.4370

SPBoyd, LVandenberghe. Convex optimization. New York, NY: Cambridge University Press; 2004.

TSchäfer, MASchwarz. The Meaningfulness of Effect Sizes in Psychological Research: Differences Between Sub-Disciplines and the Impact of Potential Biases. Frontiers in Psychology. 2019;10(APR):813 10.3389/fpsyg.2019.00813

SNakagawa, ICCuthill. Effect size, confidence interval and statistical significance: a practical guide for biologists. Biological Reviews. 2007;82(4):591–605. 10.1111/j.1469-185X.2007.00027.x

COFritz, PEMorris, JJRichler. Effect size estimates: Current use, calculations, and interpretation. Journal of Experimental Psychology: General. 2012;141(1):2–18. 10.1037/a0024338

JDLogan. Applied Mathematics. 2nd ed New York, NY: John Wiley & Sons, Inc; 1997.

JTERichardson. Measures of effect size. Behavior Research Methods, Instruments, & Computers. 1996;28(1):12–22. 10.3758/BF03203631

GCasella, RBerger. Statistical Inference. 2nd ed Pacific Grove, CA: Duxbury; 2002.

THastie, RTibshirani, JFriedman. The Elements of Statistical Learning Springer Series in Statistics. New York, NY: Springer New York; 2009.

Bde Ville. Decision trees. Wiley Interdisciplinary Reviews: Computational Statistics. 2013;5(6):448–455. 10.1002/wics.1278

SGhali. Introduction to Geometric Computing. London: Springer London; 2008.

Nursing Home Compare datasets; 2020. Available from: https://data.medicare.gov/data/nursing-home-compare.

NHC Quality Measures; 2020. Available from: https://www.medicare.gov/NursingHomeCompare/About/nhcinformation.html.

SLuck. Data for the paper “Nonoverlap proportion and point-biserial variation”; 2020 Available from: 10.6084/m9.figshare.11591334.v2.

GMarsaglia. Ratios of Normal Variables. Journal of Statistical Software. 2006;16(4):1–10. 10.18637/jss.v016.i04

Uvon Luxburg, VHFranz. A Geometric Approach to Confidence Sets for Ratios: Fieller’s Theorem, Generalizations, and Bootstrap. Statistica Sinica. 2009;19:1095–1117. 10.2307/24308947

RGNewcombe. Interval estimation for the difference between independent proportions: comparison of eleven methods. Statistics in Medicine. 1998;17(8):873–890. 10.1002/(SICI)1097-0258(19980430)17:8<873::AID-SIM779>3.0.CO;2-I

AAgresti. Dealing with discreteness: making ‘exact’ confidence intervals for proportions, differences of proportions, and odds ratios more exact. Statistical Methods in Medical Research. 2003;12(1):3–21. 10.1191/0962280203sm311ra

AJBishara, JBHittner. Reducing Bias and Error in the Correlation Coefficient Due to Nonnormality. Educational and Psychological Measurement. 2015;75(5):785–804. 10.1177/0013164414557639

PRBevington, DKRobinson. Data Reduction and Error Analysis for the Physical Sciences. 3nd ed New York, NY: McGraw-Hill; 2003.

DPKroese, TBrereton, TTaimre, ZIBotev. Why the Monte Carlo method is so important today. Wiley Interdisciplinary Reviews: Computational Statistics. 2014;6(6):386–392. 10.1002/wics.1314

MHöfler. The effect of misclassification on the estimation of association: a review. International Journal of Methods in Psychiatric Research. 2005;14(2):92–101. 10.1002/mpr.20

JPBuonaccorsi. Measurement error: models, methods, and applications. Boca Raton: Chapman and Hall/CRC; 2010.