Method Overview of gcor
1 Introduction
gcor is software that provides generalized correlation measures and is available for R and Python. This document describes the methods used in gcor.
1.1 Overview
We consider a measure defined for any pair of random variables \((X, Y)\), including continuous, discrete, or a mixture of the two. The measure takes values in the interval \([0,1]\). It is \(0\) when \(X\) and \(Y\) are independent and \(1\) when they are mutually completely dependent; in particular, when \((X, Y)\) follows a bivariate normal distribution with correlation coefficient \(\rho\), the measure coincides with \(|\rho|\), the absolute value of the correlation coefficient.
Measures satisfying these properties can be defined in various ways; we call them generalized correlation measures. gcor provides measures based on an information-theoretic approach and can assess not only linear but also nonlinear relationships.
These measures are population parameters determined by the joint distribution of \(X\) and \(Y\), so we need to estimate them from data in practice. This document describes the measures used in gcor and presents their definitions and estimation methods.
2 Definition of the measure
This chapter describes the definition of generalized correlation measures under a population distribution.
2.1 Preliminaries
We consider assessing the strength of the relationship between two random variables \(X\) and \(Y\). Each of \(X\) and \(Y\) may be continuous, discrete, or a mixture of the two.
Suppose that the pair \((X, Y)\) follows a joint distribution \(P_{X,Y}\). In the discrete case, we write the joint probability as \(p_{X,Y}(x,y)\) and the marginal probabilities as \(p_X(x)\) and \(p_Y(y)\). In the continuous case, we write the joint density as \(f_{X,Y}(x,y)\) and the marginal densities as \(f_X(x)\) and \(f_Y(y)\).
2.2 Complete dependence
To assess the strength of the relationship between random variables, we use the notion of complete dependence due to Lancaster (1963). This can be summarized as follows:
If the value of \(Y\) is uniquely determined (with probability one) given the value of \(X\), then \(Y\) is said to be completely dependent on \(X\). More formally, this means that there exists a measurable function \(g\) such that \(P(Y = g(X)) = 1\). If \(X\) is also completely dependent on \(Y\), then \(X\) and \(Y\) are said to be mutually completely dependent.
2.3 Generalized correlation measures
Let \(r\) be a quantity determined by the joint distribution \(P_{X,Y}\) of a pair of random variables \((X, Y)\). If \(r\) satisfies all of the following properties, we call \(r\) a generalized correlation measure:
- \(0 \le r \le 1\)
- If \(X\) and \(Y\) are independent, then \(r = 0\)
- If \(X\) and \(Y\) are mutually completely dependent, then \(r = 1\)
- If \((X, Y)\) follows a bivariate normal distribution with correlation coefficient \(\rho\), then \(r = |\rho|\)
Note that if both \(X\) and \(Y\) are constant (i.e., each is a random variable that takes a fixed value with probability one), then they are independent and also mutually completely dependent, so properties (2) and (3) conflict. Therefore, in this case, \(r\) is undefined in theory. In practice, one may set it to \(0\) or \(1\) for convenience, depending on the use case.
Rényi (1959) discusses measures of dependence and lists similar conditions as part of its axioms. It excludes constant random variables from consideration as well.
2.4 Chi-squared informational correlation
Based on an information-theoretic approach, we define a measure that satisfies the properties of a generalized correlation measure.
The chi-squared mutual information between random variables \(X\) and \(Y\) can be written as follows (Polyanskiy and Wu 2025; Csiszár 1967):
\[ I_{\chi^2}(X;Y) = \begin{cases} \displaystyle \mathbb{E}_{P_{X,Y}} \bigg[ \frac{dP_{X,Y}}{d(P_X \otimes P_Y)} \bigg] - 1 & (P_{X,Y} \ll P_X \otimes P_Y) \\ \infty & (\text{otherwise}) \end{cases} \]
Here \(P_X \otimes P_Y\) denotes the product distribution of \(P_X\) and \(P_Y\), \(dP/dQ\) denotes the Radon–Nikodym derivative, and \(P \ll Q\) indicates that \(P\) is absolutely continuous with respect to \(Q\).
To define our measure, we use \(\psi = I_{\chi^2}(X;Y) + 1\), with the constant \(1\) added. When both \(X\) and \(Y\) are discrete, or both are continuous, \(\psi\) can be expressed under certain conditions as:
\[ \psi = \begin{cases} \displaystyle \sum_x \sum_y \frac{p_{X,Y}(x,y)^2}{p_X(x)p_Y(y)} & (\text{discrete}) \\[1em] \displaystyle \iint \frac{f_{X,Y}(x,y)^2}{f_X(x)f_Y(y)} dx dy & (\text{continuous}) \end{cases} \]
Here \(p_{X,Y}(x,y)\) denotes the joint probability, and \(p_X(x)\) and \(p_Y(y)\) denote the marginal probabilities. Similarly, \(f_{X,Y}(x,y)\) denotes the joint density, and \(f_X(x)\) and \(f_Y(y)\) denote the marginal densities.
Using this quantity, we define the chi-squared informational correlation as follows:
\[ \rchi(X,Y) = \begin{cases} \displaystyle \sqrt{\frac{1-\psi^{-1}}{\sqrt{1-k_X^{-1}}\sqrt{1-k_Y^{-1}}}} & (k_X > 1) \land (k_Y > 1) \\[1em] 0 & (k_X = 1 < k_Y) \lor (k_Y = 1 < k_X) \end{cases} \]
Here \(k_X\) and \(k_Y\) denote the cardinalities of the support sets of \(X\) and \(Y\), respectively. For example, if \(X\) is a finite discrete random variable, then \(k_X\) is the number of elements \(x\) such that \(p_X(x) > 0\). If the support set is infinite, we set \(k_X = \infty\) without distinguishing infinite cardinalities. The same definition applies to \(k_Y\).
As noted above, \(\rchi\) is undefined when \(k_X = k_Y = 1\).
2.4.1 Properties
The chi-squared informational correlation \(\rchi\) has the following properties:
- \(0 \le \rchi \le 1\)
- \(X\) and \(Y\) are independent if and only if \(\rchi = 0\)
- If \(X\) and \(Y\) are mutually completely dependent, then \(\rchi = 1\)
- If \((X, Y)\) follows a bivariate normal distribution with correlation coefficient \(\rho\), then \(\rchi = |\rho|\)
- Symmetry: \(\rchi(X,Y) = \rchi(Y,X)\)
- Invariance under transformations: for measurable isomorphisms \(u\) and \(v\), \(\rchi(u(X),v(Y)) = \rchi(X,Y)\)
Properties 1–4 correspond to those of a generalized correlation measure (Property 2 is an equivalence, so \(\rchi = 0\) guarantees independence). Properties 5 and 6 are also desirable for a generalization of the correlation coefficient.
3 Estimation method
This chapter describes an estimation method for generalized correlation measures.
Generalized correlation measures are defined under a population distribution, so we need to estimate them from data in practice. gcor uses a nonparametric estimation method that does not impose strong assumptions on the functional form of the distribution.
In theory, generalized correlation measures can be defined for any pair of random variables. For estimation, we need to handle the data appropriately according to their types. gcor supports common data types used in statistical software, including numeric, categorical, and date/datetime values. It can also handle data with missing values.
3.1 Preliminaries
For a pair of random variables \((X, Y)\), we write observed data as \(\mathbf{x} = (x_1, x_2, \cdots, x_n)\) and \(\mathbf{y} = (y_1, y_2, \cdots, y_n)\), assuming an independently and identically distributed (i.i.d.) sample of size \(n\). This is a paired sample where the index corresponds to each observational unit. That is, the pair \((x_i, y_i)\) represents the data observed for the \(i\)-th unit.
3.2 Data types
The observed data \(\mathbf{x}\) and \(\mathbf{y}\) can take values of various types. Examples include the following data types:
- Integer
- Floating-point number
- String
- Boolean
- Datetime
From the perspective of estimation, we classify them as follows:
- Categorical
-
Data with a finite number of possible values. Values are often represented as strings, booleans, or integers, but floating-point numbers or datetimes may also be treated as categorical as long as the number of possible values is finite. Missing values (
NA) may be included. - Numeric/ordered
-
Non-categorical data for which a total order is defined. This includes integers, floating-point numbers, and datetimes. Positive and negative infinity (
Inf/-Inf) may be included. Missing values (NA) may be included as an exception for which a total order is not defined. - Other
- Data that do not fall into any of the above classes. That is, the number of possible values is not finite and a total order is not defined. Examples include complex numbers (except when the number of possible values is finite). Such data are out of scope for the estimation method described in this document.
Here, the notation for missing values and other special values follows the statistical computing environment R. Practical details are given below.
3.2.1 Handling Not-a-Number
In this document, we treat Not-a-Number values (NaN) as equivalent to missing values (NA). In fact, Python pandas often uses NaN (numpy.nan) as an internal representation of missing values. In R, NA and NaN have different internal representations, but they may be treated equivalently in processing; for example, is.na(NaN) returns TRUE.
3.2.2 Classification procedure
In real data, the number of observed values is always finite, and a total order can be defined in some way. However, it is not appropriate to treat all data as categorical. In practice, we determine the classification using the following procedure:
- If an explicit categorical data type is set, treat it as categorical. This corresponds to types such as
factorin R andpandas.Categoricalin Python. - If a data type representing values such as numeric values or datetimes is set, treat it as numeric/ordered.
- For other data types such as strings, treat them as categorical if the number of distinct values is less than or equal to a threshold. The threshold can be set by the
max_levelsargument. - If none of the above applies, treat it as other.
3.3 Estimation by quantile binning
This section describes the estimation method used in gcor. The method first converts non-categorical data (e.g., numeric values) into categories using empirical quantiles. It then approximately estimates the measure using the transformed categorical data.
3.3.1 Preparation
Dropping records with missing values
By default, missing values (NA) are treated as a separate category. This allows us to incorporate patterns such as “when \(X\) is missing, \(Y\) tends to take larger values” into the assessment.
If this behavior is undesirable, records with missing values can be dropped in advance. In the R implementation, this is controlled by the dropNA argument, and in the Python implementation by the drop_na argument.
When records with missing values are dropped, the number of remaining records is regarded as the sample size \(n\).
Choosing the number of bins
For quantile binning, we need to choose an integer \(k \ge 2\) as the number of bins. A larger \(k\) allows us to capture more complex relationships, but it also requires a larger sample size to maintain estimation accuracy.
By default, the number of bins is selected automatically, as described later. It can also be specified manually using the k argument.
3.3.2 Algorithm
The estimation algorithm consists of three steps.
Step 1: Categorization
In this step, we convert the observed data into categorical data.
For the observed data \(\mathbf{x}\) and \(\mathbf{y}\), we denote the converted data by \(\mathbf{x'}\) and \(\mathbf{y'}\), respectively. Below, we describe the conversion using \(\mathbf{x}\) as an example. The same applies to \(\mathbf{y}\).
If the number of bins \(k\) is not specified, it is set according to the following 10-to-2 rule:
\[ k = \max \bigg\{2,\; \Bigl\lfloor \tfrac12 (n_\mathbf{x})^{\log_{10} 2} \Bigr\rfloor \bigg\} \]
Here \(n_\mathbf{x}\) denotes the number of \(x_i\) that are observed and not missing. Denoting missing values by \(\mathtt{NA}\), we can write it as follows:
\[ n_\mathbf{x} = \sum_{i=1}^n \mathbf{1} \{x_i \ne \mathtt{NA}\} \]
If records with missing values have been dropped in advance, or if the data contain no missing values, then \(n_\mathbf{x} = n\).
If either of the following holds, we simply let \(\mathbf{x'} = \mathbf{x}\):
- \(\mathbf{x}\) is categorical
- \(\mathbf{x}\) takes at most \(k\) distinct values
Otherwise, that is, if \(\mathbf{x}\) is numeric/ordered and takes more than \(k\) distinct values, we compute the empirical \(k\)-quantiles \(\hat{q}_X[m] \; (0 \le m \le k)\) based on the empirical distribution. We define the empirical distribution function of \(X\) as follows:
\[ \hat{F}_X(x) = \frac{1}{n_\mathbf{x}} \sum_{i:\,x_i \ne \mathtt{NA}} \mathbf{1} \{x_i \le x\} \]
For a given \(k\), we define the \(m\)-th quantile of \(X\) as follows:
\[ \hat{q}_X[m] := \min \bigg\{ x_i : x_i \ne \mathtt{NA},\, \hat{F}_X(x_i) \ge \frac{m}{k} \bigg\} \quad (0 \le m \le k) \]
Here, when \(m = 0\), \(\hat{q}_X[0] = \min \{x_i : x_i \ne \mathtt{NA}\}\), and when \(m = k\), \(\hat{q}_X[k] = \max \{x_i : x_i \ne \mathtt{NA}\}\).
For example, when \(k = 4\), the quantiles are as follows:
\[ \begin{aligned} \hat{q}_X[0] &: \text{minimum value} \\ \hat{q}_X[1] &: \text{first quartile} \\ \hat{q}_X[2] &: \text{second quartile (median)} \\ \hat{q}_X[3] &: \text{third quartile} \\ \hat{q}_X[4] &: \text{maximum value} \\ \end{aligned} \]
Here, the quartiles and the median are defined based on the empirical distribution.
Using these quantiles, we define the empirical \(k\)-quantile intervals \(\hat{Q}_X[m] \; (1 \le m \le k)\). We use a closed interval only when \(m = 1\); otherwise, we use a half-open interval:
\[ \hat{Q}_X[m] = \begin{cases} \Big[ \hat{q}_X[0], \; \hat{q}_X[1] \Big] & (m = 1) \\[1em] \Big( \hat{q}_X[m - 1], \; \hat{q}_X[m] \Big] & (2 \le m \le k) \\ \end{cases} \]
We then define the categorized data \(\mathbf{x'} = (x'_1, \cdots, x'_n)\) using these intervals as follows:
\[ x'_i = \begin{cases} m & (x_i \in \hat{Q}_X[m]) \\[1em] \mathtt{NA} & (x_i = \mathtt{NA}) \end{cases} \]
We obtain \(\mathbf{y'} = (y'_1, \cdots, y'_n)\) in the same way.
Step 2: Tabulation
In this step, we scan the pairs of categorized values \((x'_i, y'_i)\) and count the occurrences of each value combination.
Suppose that \(x'_i\) takes \(s\) distinct values and \(y'_i\) takes \(t\) distinct values. A missing value, if present, is also counted as a distinct value. Values that never appear in the data are excluded from consideration.
For notational convenience, we write \(x'_i \in \{1, \cdots, s\}\) and \(y'_i \in \{1, \cdots, t\}\). Then the required counts are given by:
\[ \begin{aligned} n_{ml} &= \sum_{i=1}^n \mathbf{1}\{x'_i = m , \, y'_i = l \}\\ n_{m \cdot} &= \sum_{i=1}^n \mathbf{1}\{x'_i = m \}\\ n_{\cdot l} &= \sum_{i=1}^n \mathbf{1}\{y'_i = l \} \end{aligned} \]
Here \(1 \le m \le s\) and \(1 \le l \le t\). These counts can be represented in the following two-way contingency table:
\[ \begin{array}{c|ccc|c} & & Y & & \Sigma \\ \hline & n_{11} & \cdots & n_{1t} & n_{1\cdot} \\ X & \vdots & & \vdots & \vdots \\ & n_{s1} & \cdots & n_{st} & n_{s\cdot} \\ \hline \Sigma & n_{\cdot 1} & \cdots & n_{\cdot t} & n \end{array} \]
Step 3: Computation
In this step, we compute an estimate of the generalized correlation measure using the counts.
For the chi-squared informational correlation, we consider the following quantity based on the chi-squared mutual information of the categorized data:
\[ \psi' = I_{\chi^2}(X';Y') + 1 \]
We compute its estimate as follows:
\[ \hat\psi' = \sum_{m=1}^s \sum_{l=1}^t \frac{n^2_{ml}}{n_{m \cdot}n_{\cdot l}} \]
Using this, we compute the estimate of the chi-squared informational correlation \(\rchi\) as follows:
\[ \hat{r}_{\chi^2} = \begin{cases} \displaystyle \sqrt{\frac{1-1/\hat\psi'}{\sqrt{1-1/s}\sqrt{1-1/t}}} & (s > 1) \land (t > 1) \\[1em] 0 & (s = 1 < t) \lor (t = 1 < s) \\[1em] 1 & (s = t = 1) \end{cases} \]
When \(s = t = 1\), both \(\mathbf{x'}\) and \(\mathbf{y'}\) take only a single value and are therefore constant over the observed data. Theoretically, \(r_{\chi^2}\) is undefined for a pair of constant variables, but here we set \(\hat{r}_{\chi^2} = 1\) for convenience. This may be handled differently in the future, for example by returning a missing value.