## INTRODUCTION

How do major climate modes such as the El Niño Southern Oscillation (ENSO) influence remote regions via global teleconnections? How are physiological processes in the human body coupled? Also, through which pathways do different brain regions interact? Identifying causal association networks of multiple variables and quantifying causal strength are key challenges in the analysis of complex dynamical systems, especially since, here, interventional real experiments, the gold standard of scientific discovery, are often unethical or practically impossible. In climate research, model simulations can help to test causal mechanisms, but these are very expensive, time consuming, and represent only an approximation of the real-world physical processes (*1*). We here introduce an approach that learns causal association networks directly from time series data. These data-driven approaches have become increasingly attractive as recent decades have seen an explosion in data availability from simulations and real-world observations, for example, in Earth sciences (*2*). We therefore identify an urgent need for the development of novel causal discovery methods that can take advantage of this recent surge of big data, which, as we show here, has the potential to facilitate progress in many areas of sciences.

In a typical observational analysis scenario, for example, in climate science, a researcher has a hypothesis on the causal influence between two processes given observed time series data. The data may consist of different climatological variables (e.g., temperature and pressure) at one location, or of time series that represent regional averages of climatological variables, for example, commonly defined climate indices. For example, she may be interested in the influence of the regional ENSO index on an index characterizing the temperature variability over certain land areas of North America. Suppose the time series show a clear correlation, suggesting a relationship between the two processes. To exclude other possible hypotheses that may explain such a correlation, she will then include other relevant variables. In highly interconnected systems, there are typically many possible drivers she could test, quickly leading to high-dimensional causal discovery problems.

The goal in time series causal discovery from complex dynamical systems is to statistically reliably estimate causal links, including their time lags. Climatic teleconnections, for example, can take days to months. Two key challenges are the typically high dimensionality of these causal discovery problems and the often strong interdependencies. For instance, in a system comprising dozens to hundreds of variables (e.g., different regional climate indices), correlations will arise not only because of direct causal effects but also because of autocorrelation effects within each time series, indirect links, or common drivers (Fig. 1). Ideally, a causal discovery method detects as many true causal relationships as possible (high detection power) and controls the number of false positives (incorrect link detections). Causal discovery can help to better understand physical mechanisms, to build more parsimonious prediction models, and to more reliably estimate the strength of causal effects, which can be done in different frameworks, for example, the potential outcome (*3*) or graphical model frameworks (*4*, *5*). Put simply, causal discovery will be useful in situations where researchers wish to study complex dynamical systems in a way that goes beyond simple correlation analyses. Of course, any causal interpretation will rest on a number of assumptions (*4*, *5*) as we further discuss below.

A major current approach not only in Earth data analysis (*6*–*9*) but also in neuroscience (*10*, *11*) is to estimate time-lagged causal associations using autoregressive models in the framework of Granger causality (*12*, *13*). If implemented using standard regression techniques, then the high dimensionality of typical datasets leads to very low detection power (the “curse of dimensionality”) since sample sizes are often only on the order of a few hundred (e.g., for a monthly time resolution with 30 years of satellite data). This shortcoming leads to a dilemma that has limited applications of Granger causality mostly to bivariate analyses that cannot, however, account for indirect links or common drivers. Complementary to linear Granger causality, state-space methods (*14*, *15*) better address nonlinear state-dependent couplings, but these are also difficult to extend to high-dimensional scenarios.

There are methods that can cope with high dimensionality, such as regularized regression techniques (*16*–*18*), but mainly in the context of prediction and not causal discovery where assessing the significance of causal links is more important. An exception is Lasso regression (*17*), which also allows discovering active variables. Another approach with some recent applications in geosciences (*19*–*24*) is algorithms aimed specifically at causal discovery (*4*, *5*, *25*), which use iterative independence and conditional independence testing. However, both regularized regression (*26*) and recent implementations of causal discovery algorithms do not deal well with the strong interdependencies due to the spatiotemporal nature of the variables, as we show here. In particular, controlling false positives at a desired level is difficult for these methods and becomes even more challenging for nonlinear estimators. In summary, these problems lead to brittle estimates of causal networks and causal effects, and a more reliable methodology is required. In (*2*), the authors present an overview of the state of the art in causal inference methods and discuss related challenges with a focus on Earth sciences.

We present a causal network discovery method based on the graphical causal model framework (*5*) that scales well with large time series datasets featuring linear and nonlinear, time-delayed dependencies. Through analytical results, real-world applications, and extensive numerical experiments, we demonstrate that the proposed method has substantial advantages over the current state of the art in dealing with interdependent time series datasets on the order of dozens to hundreds of variables for sample sizes of a few hundred or more, yielding reliable false-positive control and higher detection power. We also find that more reliable causal network estimates yield more precise estimates of causal effects, bridging causal discovery with causal effect inference frameworks such as the potential outcome framework. Our approach enables causal analyses among more variables, opening up new opportunities to more credibly estimate causal networks and causal effects from time series in Earth system science, physiology, neuroscience, and other fields.

## CAUSAL DISCOVERY

### Motivating example from climate science

In the following, we illustrate the causal discovery problem on a well-known long-range teleconnection. We highlight two main factors that lead the common autoregressive Granger causal modeling approach to have low detection power: reduced effect size due to conditioning on irrelevant variables and high dimensionality.

Given a finite time series sample, every causal discovery method has to balance the trade-off between too many false positives (incorrect link detections) and too few true positives (correct link detections). A causality method ideally controls false positives at a predefined significance level (e.g., 5%) and maximizes detection power. The power of a method to detect a causal link depends on the available sample size, the significance level, the dimensionality of the problem (e.g., the number of coefficients in an autoregressive model), and the effect size, which, here, is the magnitude of the effect as measured by the test statistic (e.g., the partial correlation coefficient). Since the sample size and the significance level are usually fixed in the present context, a method’s power can only be improved by reducing the dimensionality or increasing the effect size (or both).

Consider a typical causal discovery scenario in climate research (Fig. 2). We wish to test whether the observational data support the hypothesis that tropical Pacific surface temperatures, as represented by the monthly Nino 3.4 index (further referred to as Nino; see map and region in fig. S2) (*27*), causally affected extratropical land air temperatures (*28*) over British Columbia (BCT) for 1979–2017 (*T* = 468 months). We chose this example since it is well established and physically understood that atmospheric wave trains induced by increased sea surface temperatures over the tropical Pacific can affect North American temperatures but not the other way around (*9*, *29*–*31*). Thus, the ground truth here is Nino → BCT on the (intra)seasonal time scale, allowing us to validate causality methods.

We start with a time-lagged correlation analysis and find that the two variables are correlated in both directions, that is, for both positive and negative lags (Fig. 2A and see fig. S2 for lag functions), suggesting also an influence from BCT on Nino. The correlation Nino → BCT has an effect size of ≈ 0.3 (*P* < 10^{−4}) at a lag of 2 months. In the networks in Fig. 2, the link colors denote effect sizes (gray links are spurious), and the node colors denote the autocorrelation strength.

Lagged correlation cannot be used to infer causal directionality and not even the correct time lag of a coupling (*20*). Hence, we now move to causal methods. To test Nino → BCT, the most straightforward approach then is to fit a linear autoregressive model of BCT on past lags of itself and Nino and test whether and which past coefficients of Nino are significantly different from zero. This is equivalent to a lag-specific version of Granger causality, but one can phrase this problem also more generally as testing for conditional independence between Nino_{t−τ} and BCT* _{t}* conditional on (or controlling for) the common past

, denoted

${\text{Nino}}_{t-\mathrm{\tau}}\u2aeb{\text{BCT}}_{t}\mid {\mathbf{X}}_{t}^{-}\setminus \{{\text{Nino}}_{t-\mathrm{\tau}}\}$. For estimating conditional independencies (see below), the time index *t* runs through the samples up to the time series length *T*.

is, in practice, truncated at a maximum time lag τ_{max}, which depends on the application and can be chosen according to the maximum causal time lag expected in the complex system or based on the largest lag with significant correlation. We call this general approach full conditional independence testing (FullCI; see table S3 for an overview of methods considered in this paper) and illustrate it in a linear partial correlation implementation for this example, that is, we test

for different lags τ, which is the effect size for FullCI.

Using a maximum time lag τ_{max} = 6 months, we find a significant FullCI partial correlation for Nino → BCT at lag 2 of 0.1 (*P* = 0.037) (Fig. 2A) and no significant dependency in the other direction. That is, the effect size of FullCI is strongly reduced compared to the correlation (≈ 0.3) when taking into account the past. However, as mentioned before, such a bivariate analysis can usually not be interpreted causally, because other processes might explain the relationship. To further test our hypothesis, we include another variable *Z* that may explain the dependency between Nino and BCT (Fig. 2B). Here, we generate *Z* artificially for illustration purposes and define

for independent standard normal noise

${\mathrm{\eta}}_{t}^{Z}$. Thus, Nino drives *Z* with lag 1, but *Z* has no causal effect on BCT, which we assume a priori unknown. Here, we simulated different realizations of *Z* to measure detection power and false-positive rates. We find that the correlation would be even more misguiding a causal interpretation since we observe spurious links between all variables (Fig. 2B). The FullCI partial correlation, with

including the past of all three processes and not just BCT and Nino, now has an effect size of 0.09 for Nino_{t−2} → BCT* _{t}* compared to 0.1 in the bivariate case. At a 5% significant level, this link is only detected in 53% of the realizations (true-positive rate, arrow width in Fig. 2).

, it now “explains away” some part of the partial correlation

$\mathrm{\rho}({\text{Nino}}_{t-2},{\text{BCT}}_{t}\mid {\mathbf{X}}_{t}^{-}\setminus \{{\text{Nino}}_{t-2}\})$, thereby leading to an effect size that is just 0.01 smaller, which already strongly reduces the detection rate.

Suppose we got one of the realizations of *Z* for which the link Nino_{t−2} → BCT* _{t}* is still significant. To further illustrate the effect of high dimensionality on detection power, we now include six more variables

*W*(

^{i}*i*= 1, …, 6), which are all independent of Nino, BCT, and

*Z*but coupled between each other in the following way (Fig. 2C):

for *i* = 2, 4, 6 and

for *i* = 1, 3, 5, all with the same coupling coefficient *c* = 0.15 and *a*^{1,2} = 0.1, *a*^{3,4} = 0.5, and *a*^{5,6} = 0.9. Now, the FullCI effect size for Nino_{t−2} → BCT* _{t}* is still 0.09, but the detection power is even lower than before and decreases from 53% to only 40% because of the higher dimensionality. Thus, the true causal link Nino

_{t−2}→ BCT

*is likely to be overlooked.*

_{t}Effect size is also affected by autocorrelation effects of the included variables: The coupled variable pairs

${W}_{t-2}^{i-1}\to {W}_{t}^{i}$ (*i* = 2, 4, 6) differ in their autocorrelation (as visualized by their node color in Fig. 2C) and, although the coupling coefficient *c* is the same for each pair, their FullCI partial correlations are 0.15, 0.13, and 0.11 (from lower to higher autocorrelation). Similar to the above case, conditioning on past lags, here

(*i* = 1, 3, 5), explains away information, leading to a smaller effect size and lower power the higher their autocorrelation is. Conversely, we here observe more spurious correlations for higher autocorrelations (Fig. 2C, left).

### Causal network discovery with PCMCI

${\mathbf{X}}_{t}=({X}_{t}^{1},\dots ,{X}_{t}^{N})$with

$${X}_{t}^{j}={f}_{j}(\mathcal{P}({X}_{t}^{j}),{\mathrm{\eta}}_{t}^{j})$$(1)where *f _{j}* is some potentially nonlinear functional dependency and

represents mutually independent dynamical noise. The nodes in a time series graph represent the variables

${X}_{t}^{j}$at different lag times, and

$\mathcal{P}({X}_{t}^{j})\subset {\mathbf{X}}_{t}^{-}=({\mathbf{X}}_{t-1},{\mathbf{X}}_{t-2},\dots )$denotes the causal parents of variable

${X}_{t}^{j}$ (Fig. 3B, nodes with black arrows) among the past of all *N* variables. A causal link

exists if

${X}_{t-\mathrm{\tau}}^{i}\in \mathcal{P}({X}_{t}^{j})$. Another way to define links is that

${X}_{t-\tau}^{i}$is not conditionally independent of

${X}_{t}^{j}$given the past of all variables, defined by

${X}_{t-\mathrm{\tau}}^{i}\overline{)\u2aeb}{X}_{t}^{j}\mid {\mathbf{X}}_{t}^{-}\setminus \{{X}_{t-\mathrm{\tau}}^{i}\}$, with ⫫ denoting the absence of a (conditional) independence (*34*). The goal in causal discovery is then to estimate the causal parents from time series data. FullCI directly tests the link-defining conditional independence, but recall that in Fig. 3A, the high dimensionality of including *N*τ_{max} − 1 conditions on the one hand, and the reduced effect size due to conditioning on

and

${X}_{t-1}^{2}$(similar to the example in Fig. 2), on the other, leads to a potentially drastically reduced detection power of FullCI.

Causal discovery theory (*4*, *5*) tells us that the parents

of a variable

${X}_{t}^{j}$ are a sufficient conditioning set that allows establishing conditional independence [causal Markov property (*5*)]. Thus, in contrast to conditioning on the whole past of all processes as in FullCI, conditioning only on a set that at least includes the parents of a variable

suffices to identify spurious links. Markov discovery algorithms (*5*, *35*) such as the PC algorithm (named after its inventors) (*25*) allow us to detect these parents and can be flexibly implemented with different kinds of conditional independence tests that can accommodate nonlinear functional dependencies and variables that are discrete or continuous. These properties allow for greater flexibility than attempting to directly fit the possibly very complex functional dependencies in Eq. 1. However, as shown in our numerical experiments, the PC algorithm should not be directly used for the time series case.

for all time series variables

${X}_{t}^{j}\in \{{X}_{t}^{1},\dots ,{X}_{t}^{N}\}$and (ii) the momentary conditional independence (MCI) test (Fig. 3C and algorithm S2) to test whether

${X}_{t-\tau}^{i}\to {X}_{t}^{j}$with

$$\text{MCI}:{X}_{t-\mathrm{\tau}}^{i}\overline{)\u2aeb}{X}_{t}^{j}\mid \stackrel{\u02c6}{\mathcal{P}}\left({X}_{t}^{j}\right)\setminus \left\{{X}_{t-\mathrm{\tau}}^{i}\right\},\stackrel{\u02c6}{\mathcal{P}}\left({X}_{t-\mathrm{\tau}}^{i}\right)$$(2)

Thus, MCI conditions on both the parents of

${X}_{t}^{j}$and the time-shifted parents of

${X}_{t-\tau}^{i}$. The two stages (i) and (ii) serve the following purposes: PC_{1} is a Markov set discovery algorithm based on the PC-stable algorithm (*36*) that removes irrelevant conditions for each of the *N* variables by iterative independence testing (illustrated by shades of red and blue in Fig. 3B). A liberal significance level α_{PC} in the tests lets PC_{1} adaptively converge to typically only few relevant conditions (dark red/blue) that include the causal parents P in Eq. 1 with high probability but might also include some false positives (marked with a star in Fig. 3B). The MCI test (Fig. 3C) then addresses false-positive control for the highly interdependent time series case.

More precisely, in the PC_{1} algorithm, we start for every variable

by initializing the preliminary parents

$\stackrel{\u02c6}{\mathcal{P}}({X}_{t}^{j})=({\mathbf{X}}_{t-1},{\mathbf{X}}_{t-2},\dots ,{\mathbf{X}}_{t-{\mathrm{\tau}}_{\text{max}}})$. In the first iteration (*p* = 0), we conduct unconditional independence tests and remove

from

$\stackrel{\u02c6}{\mathcal{P}}({X}_{t}^{j})$if the null hypothesis

${X}_{t-\mathrm{\tau}}^{i}\u2aeb{X}_{t}^{j}$ cannot be rejected at a significance level α_{PC}. In Fig. 3B, for the parents of

, this would likely be the case for the lagged variables

${X}_{t-\mathrm{\tau}}^{4}$ (light shades of red). In each next iteration (*p* → *p* + 1), we first sort the preliminary parents by their (absolute) test statistic value and then conduct conditional independence tests

, where S are the strongest *p* parents in

. After each iteration, independent parents are removed from

$\stackrel{\u02c6}{\mathcal{P}}({X}_{t}^{j})$, and the algorithm converges if no more conditions can be tested (see details in Materials and Methods). In Fig. 3B, for

${X}_{t}^{3}$ (blue shades), the algorithm converges already after *p* = 1-dimensional conditions have been tested. Since these tests are all very low dimensional compared to FullCI (or Granger causality), they have higher detection power.

In the second stage, the MCI test (Fig. 3C) uses the estimated conditions as follows. For testing

${X}_{t-2}^{1}\to {X}_{t}^{3}$, the conditions

$\stackrel{\u02c6}{\mathcal{P}}({X}_{t}^{3})$(blue boxes in Fig. 3C) are sufficient to establish conditional independence (Markov property), that is, to identify indirect and common cause links. The additional condition on the lagged parents

$\stackrel{\u02c6}{\mathcal{P}}({X}_{t-2}^{1})$ (red boxes) accounts for autocorrelation, leading to correctly controlled false-positive rates at the expected level as further discussed below in our theoretical results. The significance of each link can be assessed based on the *P* values of the MCI test. These can, subsequently, also be adjusted according to procedures such as false discovery rate control (*37*). The main free parameter of PCMCI is the significance level α_{PC} in PC_{1}, which should be regarded as a hyperparameter and can be chosen on the basis of model selection criteria such as the Akaike information criterion (AIC) or cross-validation. Further technical details can be found in Materials and Methods.

### Assumptions of causal discovery from observational data

${X}_{t}^{j}$is independent of

${\mathbf{X}}_{t}^{-}\setminus \mathcal{P}({X}_{t}^{j})$given its parents

$\mathcal{P}({X}_{t}^{j})$, and the Faithfulness assumption, which requires that all observed conditional independencies arise from the causal graphical structure. For the present time series case, we assume no contemporaneous causal effects and, since typically only a single realization is available, we also assume stationarity. Another option would be to use independent ensembles of realizations of lagged processes. We elaborate on these assumptions in Discussion. See (*2*, *34*) for an overview of causal discovery on time series.

## RESULTS

### Theoretical properties of PCMCI

$\stackrel{\u02c6}{\mathcal{P}}({X}_{t-\mathrm{\tau}}^{i})$ of the lagged variable. Theoretical results for finite samples would require strong assumptions (*45*, *46*) or are mostly impossible, especially for nonlinear associations. Because of the condition selection stage, MCI typically has a much lower conditioning dimensionality than FullCI. Further, avoiding conditioning on irrelevant variables also can be shown to always yield a greater (or equal) effect size than FullCI. Irrelevant variables are not explanatory for causal relationships, and they may also lead to smaller effect sizes if they are caused by the considered driver variable. Both of these factors lead to typically much higher detection power than FullCI (or Granger causality) for small and large numbers of variables as further discussed in section S5.4. Last, while detecting the causal network structure is the main goal of PCMCI, the MCI test statistic also yields a well-interpretable notion of a normalized causal strength, as further discussed in section S5.5 and (*38*, *39*). Thus, the value of the MCI statistic (e.g., partial correlation or CMI) allows us to rank causal links in large-scale studies in a meaningful way.

### Real-world applications

${W}_{t-2}^{i-1}\to {W}_{t}^{i}$ (*i* = 2, 4, 6), resulting in similar detection power irrespective of different autocorrelations in different *W ^{i}* time series.

In Fig. 4A, we show that PCMCI can reconstruct the Walker circulation (*47*) in the tropical Pacific including the link to the Atlantic, where the underlying physical mechanism is theoretically well understood and has been validated with detailed physical simulation experiments (*48*): Warm surface air temperature anomalies in the East Pacific (EPAC) are carried westward by trade winds across the Central Pacific (CPAC). Then, the moist air rises over the West Pacific (WPAC), and the circulation is closed by the cool and dry air sinking eastward across the entire tropical Pacific. Furthermore, the CPAC region links temperature anomalies to the tropical Atlantic (ATL) via an atmospheric bridge (*49*). Pure lagged correlation analysis results in a completely connected graph with significant correlations at almost all time lags (see lag functions in fig. S3), while PCMCI with the linear ParCorr conditional independence test better identifies the Walker circulation and Atlantic teleconnection. In particular, the link from EPAC to WPAC is correctly identified as indirectly mediated through CPAC.

### Estimating causal effects

${X}_{t-\tau}^{i}\to {X}_{t}^{j}$is based on the interventional distribution

$P({X}_{t}^{j}\mid \text{do}({X}_{t-\mathrm{\tau}}^{i}=x))$, which is the probability distribution of

${X}_{t}^{j}$ at time *t* if

was forced exogenously to have a value *x*. Causal effects can also be studied using the framework of potential outcomes (*3*), which mainly targets the social sciences and medicine. In this framework, causal effects can be defined as the difference between two potential outcomes, one where a subject *u* has received a treatment, denoted *Y _{u}*(

*X*= 1), and one where no treatment was given, denoted

*Y*(

_{u}*X*= 0). In observational causal inference,

*Y*(

_{u}*X*= 1) and

*Y*(

_{u}*X*= 0) are never measured simultaneously (one subject cannot be treated and untreated at the same time), requiring an assumption called strong ignorability to identify causal effects. In our case, one could write

as

${X}_{t}^{j}({\mathbf{X}}_{t}^{-})$, where time *t* replaces unit *u* and where we are interested in testing whether or not an entry

in

${\mathbf{X}}_{t}^{-}$appears in the treatment function determining the causal influence from the past.

Both in Pearl’s causal effect framework and in the potential outcome framework (*3*), one typically assumes that the causal interdependency structure is known qualitatively (existence and absence of links), and the interest lies more in quantifying causal effects. In the case of linear continuous variables, the average causal effect and equivalently the potential outcome for

can be estimated from observational data with linear regression using a suitable adjustment set of regressors. In the following, we assume Causal Sufficiency and compare three approaches to estimate the linear causal effect when the true causal interdependency structure is unknown. CE-Corr simply is a univariate linear regression of

${X}_{t}^{j}$on

${X}_{t-\mathrm{\tau}}^{i}$, CE-Full is a multivariate regression of

${X}_{t}^{j}$on the whole past of the multivariate process

${\mathbf{X}}_{t}^{-}$ up to a maximum time lag τ_{max} = 5 and, finally, CE-PCMCI is a multivariate regression of

on just the parents

$\mathcal{P}({X}_{t}^{j})$obtained with PCMCI.

In Fig. 6, we investigate these approaches numerically. Different from the model setup before, we now fix a network size of *N* = 20 time series variables (*T* = 150) and consider models with different link coefficients *c* (*x* axis in Fig. 6). The bottom panels show the distribution of causal effects. The absolute value of the true causal effect is ∣*c*∣ (black lines). The top panels show the distribution of true-positive rates (across all links in the model) for an *F*-test under the null hypothesis that the effect is zero at a significance level of 5%. The rightmost panels show the results for a regression on the true parents (CE-True).

includes the true parents as a sufficient adjustment set. However, the high dimensionality of this adjustment set leads to a large estimation variance that, in particular, implies that causal effects are less reliably estimated as evidenced by the low true-positive rates in the top panel. Last, CE-PCMCI better estimates causal effects and even comes close to the detection rate for CE-True based on the true parents. While the parents are a sufficient adjustment set to estimate causal effects, other adjustment sets may yield even better estimates, but in any case, knowledge of the dependency structure as estimated with PCMCI is beneficial (*54*).

## MATERIALS AND METHODS

### Detailed description of PCMCI

Our causal discovery technique to estimate the time series graph

$\stackrel{\u02c6}{\mathrm{G}}$is based on a two-stage procedure:

1. Condition selection via PC_{1}: Obtain an estimate

of (a superset of) the parents

$\mathcal{P}({X}_{t}^{j})$for all variables

${X}_{t}^{j}\in {\mathbf{X}}_{t}=({X}_{t}^{1},{X}_{t}^{2},\dots ,{X}_{t}^{N})$with algorithm S1.

2. Use these parents as conditions in the MCI causal discovery stage (algorithm S2), which tests all variable pairs

$({X}_{t-\mathrm{\tau}}^{i},{X}_{t}^{j})$ with *i*, *j* ∈ {1, …, *N*} and time delays τ ∈ {1, …, τ_{max}} and establishes a link, that is,

, if and only if

$$\text{MCI}:{X}_{t-\mathrm{\tau}}^{i}\overline{)\u2aeb}{X}_{t}^{j}\mid \stackrel{\u02c6}{\mathcal{P}}\left({X}_{t}^{j}\right)\setminus \left\{{X}_{t-\mathrm{\tau}}^{i}\right\},{\stackrel{\u02c6}{\mathcal{P}}}_{{p}_{X}}\left({X}_{t-\mathrm{\tau}}^{i}\right)$$(3)where

${\stackrel{\u02c6}{\mathcal{P}}}_{{p}_{X}}({X}_{t-\mathrm{\tau}}^{i})\subseteq \stackrel{\u02c6}{\mathcal{P}}({X}_{t-\mathrm{\tau}}^{i})$ denotes the *p _{X}* strongest parents according to the sorting in algorithm S1. This parameter is just an optional choice. One can also restrict the maximum number of parents used for

, but here, we impose no restrictions. For τ = 0, one can also consider undirected contemporaneous links (*39*).

, first the preliminary parents

$\stackrel{\u02c6}{\mathcal{P}}\left({X}_{t}^{j}\right)=({\mathbf{X}}_{t-1},{\mathbf{X}}_{t-2},\dots ,{\mathbf{X}}_{t-{\mathrm{\tau}}_{\text{max}}})$ are initialized. Starting with *p* = 0, iteratively *p* → *p* + 1 is increased in an outer loop and, in an inner loop, it is tested for all variables

from

$\stackrel{\u02c6}{\mathcal{P}}({X}_{t}^{j})$whether the null hypothesis

$$\text{PC}:{X}_{t-\mathrm{\tau}}^{i}\overline{)\u2aeb}{X}_{t}^{j}\mid \mathcal{S}\text{for}\text{any}\mathcal{S}\text{with}\mid \mathcal{S}\mid =p$$(4)can be rejected at a significance threshold α_{PC}. For the PC algorithm implemented here, S iterates through different combinations of subsets of

with cardinality *p*, up to a maximum number of combinations *q*_{max}. Our fast variant PC_{1} is obtained by only testing the *p* parents with strongest dependency, that is, restricting the maximum number of combinations *q*_{max} per iteration to *q*_{max} = 1. In the first iteration (*p* = 0), S is empty and, thus, unconditional dependencies are tested. In each next iteration, the cardinality is increased *p* → *p* + 1, and Eq. 4 is tested again. If the null hypothesis cannot be rejected, then the link is removed from

at the end of each *p* iteration. The algorithm converges for a link

once

$\mathcal{S}=\stackrel{\u02c6}{\mathcal{P}}({X}_{t}^{j})\setminus \{{X}_{t-\mathrm{\tau}}^{i}\}$, and the null hypothesis

${X}_{t-\mathrm{\tau}}^{i}\u2aeb{X}_{t}^{j}\mid \stackrel{\u02c6}{\mathcal{P}}({X}_{t}^{j})\setminus \{{X}_{t-\mathrm{\tau}}^{i}\}$is rejected (if the null hypothesis cannot be rejected, then the link is removed).

$\stackrel{\u02c6}{\mathcal{P}}({X}_{t}^{j})$ is sorted after every iteration according to the absolute test statistic value (ParCorr, GPDC, or CMI) and S is picked in lexicographic order (only relevant for *q*_{max} > 1). Other causal variable selection algorithms use similar heuristics (*35*, *63*). The MCI stage is inspired by the information-theoretic measure momentary information transfer introduced in different variants in (*55*, *64*).

+MCI* _{pX}*, if not clear from the context.

*Choice of* τ* _{max}*. The maximum time delay depends on the application and should be chosen according to the maximum physical time lag expected in the complex system. If a relevant time lag, which may explain a dependency between two other variables, is not included, then the Causal Sufficiency assumption would be violated. In practice, we recommend a rather large choice, e.g., the last lag with significant unconditional dependency, because a too large choice of τ

_{max}merely leads to longer runtimes of PCMCI but not so much to an increased estimation dimension as for FullCI.

*Choice of* α* _{PC}*. α

_{PC}should not be seen as a significance test level in PC

_{1}since the iterative hypothesis tests do not allow for a precise assessment of uncertainties in this stage. α

_{PC}here rather takes the role of a regularization parameter as in model selection techniques. The conditioning sets

estimated with PC_{1} should include the true parents and, at the same time, be small in cardinality to reduce the estimation dimension of the MCI test and improve its power. However, the first demand is typically more important (see section S5.3). In fig. S8, we investigated the performance of PCMCI implemented with ParCorr, GPDC, and CMI for different α_{PC}. Too small values of α_{PC} result in many true links not being included in the condition set for the MCI tests and, hence, increase false positives. Too high levels of α_{PC} lead to high dimensionality of the condition set, which reduces detection power and increases the runtime. Note that for a threshold α_{PC} = 1 in PC_{1}, no parents are removed and all *N*τ_{max} variables would be selected as conditions. Then, the MCI test becomes a FullCI test. As in any variable selection method (*35*), α_{PC} can be optimized using cross-validation or based on scores such as Bayesian Information Criterion (BIC) or AIC. For all ParCorr experiments (except for the ones labeled with

+MCI* _{pX}*), we optimized α

_{PC}with AIC as a selection criterion. More precisely, for each

, we ran PC_{1} separately for each α_{PC} ∈ {0.1, 0.2, 0.3, 0.4}, yielding different conditioning sets

. Then, we fit a linear model for each α_{PC}

(5)yielding the residual sum of squares (RSS), and selected α_{PC} according to AIC (modulo constants)

(6)where *n* is the sample size (typically the time series length *T* minus a cutoff due to τ_{max}) and ∣ · ∣ denotes cardinality. For GPDC, one can similarly select α_{PC} based on the log marginal likelihood of the fitted Gaussian process, while for CMI, one can use cross-validation based on nearest-neighbor predictions for different

. But since GPDC and CMI are already quite computationally demanding, we picked α_{PC} = 0.2 in all experiments, based on our findings in fig. S8. In the bottom panels of figs. S4 to S7, we analyzed α_{PC} = 0.2 also for ParCorr for all numerical experiments and found that this option also gave good results for sparse networks and runs even faster than Lasso. However, there is potentially a higher risk of inflated false positives. Unfortunately, we have no finite sample consistency results for choosing α_{PC}.

*Choice of p _{X}*. While the parents

are sufficient to assess conditional independence, the additional conditions

${\stackrel{\u02c6}{\mathcal{P}}}_{{p}_{X}}({X}_{t-\mathrm{\tau}}^{i})\subseteq \stackrel{\u02c6}{\mathcal{P}}({X}_{t-\mathrm{\tau}}^{i})$are used to account for autocorrelation and make the MCI test statistic a measure of causal strength as analyzed in section S5.5. To limit high dimensionality, one can strongly restrict the number of conditions

${\stackrel{\u02c6}{\mathcal{P}}}_{{p}_{X}}({X}_{t-\mathrm{\tau}}^{i})$ with the free parameter *p _{X}*. To avoid having another free parameter, we kept

*p*unrestricted in most experiments. In some experiments (see figs. S4, S5, S6, S7, S10, and S12), we found that a small value

_{X}*p*= 3 already suffices to reduce inflated false positives due to strong autocorrelation and estimate causal strength. The reason is that, typically, the largest driver will be the autodependency, and conditioning out its influence already diminishes the effect of strong autocorrelations. In theory, a too small

_{X}*p*should lead to a less well-calibrated test (see section S5.3), but in practice, it seems like a sensible trade-off. In section S3, we describe a PCMCI variant for

_{X}*p*= 0 and a bivariate variant that does not condition on external variables. Both of these cannot guarantee consistent causal graph estimates and likely feature inflated false positives especially for strong autocorrelation.

_{X}*False discovery rate control*. PCMCI can also be combined with false discovery rate controls, e.g., using the Hochberg-Benjamini approach (*37*). This approach controls the expected number of false discoveries by adjusting the *P* values resulting from the MCI stage for the whole time series graph. More precisely, we obtain the *q* values as

(7)where *P* is the original *P* value, *r* is the rank of the original *P* value when *P* values are sorted in ascending order, and *m* is the number of computed *P* values in total, that is, *m* = *N*^{2}τ_{max} to adjust only directed links for τ > 0 and correspondingly if also contemporaneous links for τ = 0 are taken into account. In our numerical experiments, we did not control the false discovery rate since we were interested in the individual link performances.

### Real-world applications

The climate time series are regional averages (see boxes in Fig. 4) from the reanalysis (*65*) for the period 1948–2012 with 780 months. WPAC denotes monthly surface pressure anomalies in the West Pacific, CPAC and EPAC surface air temperature anomalies in the Central and East Pacific, respectively, and ATL surface air temperature anomalies in the tropical Atlantic. Anomalies are taken with respect to the whole period. The data are freely available from www.esrl.noaa.gov.

## SUPPLEMENTARY MATERIALS

Section S1. Time series graphs

Section S2. Alternative methods

Section S3. Further PCMCI variants

Section S4. Conditional independence tests

Section S5. Theoretical properties of PCMCI

Section S6. Numerical experiments

Algorithm S1. Pseudo-code for condition selection algorithm.

Algorithm S2. Pseudo-code for MCI causal discovery stage.

Algorithm S3. Pseudo-code for adaptive Lasso regression.

Table S1. Overview of conditional independence tests.

Table S2. Model configurations for different experiments.

Table S3. Overview of methods compared in numerical experiments.

Table S4. Summarized ANOVA results for high-dimensionality ParCorr experiments.

Table S5. Summarized ANOVA results for high-density ParCorr experiments.

Table S6. Summarized ANOVA results for high-dimensionality GPDC and CMI experiments.

Table S7. Summarized ANOVA results for sample size experiments.

Table S8. Summarized ANOVA results for noise and nonstationarity experiments.

Table S9. ANCOVA results for FullCI.

Table S10. ANCOVA results for Lasso.

Table S11. ANCOVA results for PC.

Table S12. ANCOVA results for FullCI.

Fig. S1. Illustration of notation.

Fig. S2. Motivational climate example.

Fig. S3. Real climate and cardiovascular applications.

Fig. S4. Experiments for linear models with short time series length.

Fig. S5. Experiments for linear models with longer time series length.

Fig. S6. Experiments for dense linear models with short time series length.

Fig. S7. Experiments for dense linear models with longer time series length.

Fig. S8. Experiments for different method parameters.

Fig. S9. Experiments for linear methods with different sample sizes.

Fig. S10. Experiments for nonlinear models (part 1).

Fig. S11. Experiments for nonlinear models with different sample sizes (part 1).

Fig. S12. Experiments for nonlinear models (part 2).

Fig. S13. Experiments for nonlinear models with different sample sizes (part 2).

Fig. S14. Experiments for observational noise models.

Fig. S15. Experiments for nonstationary models.

Fig. S16. Runtimes for numerical experiments.

Fig. S17. ANCOVA interaction plots.