29. August 2019 |

**Disclaimer** – *The views and opinions expressed in this blog are those of the author and do not necessarily reflect the views of Scalable Capital GmbH or its subsidiaries. Further information can be found at the end of this article.*

- Estimating dependencies between individual assets is part of many financial applications like portfolio optimisation or risk management and the most commonly used measure is the Pearson correlation coefficient.
- We demonstrate that estimated correlations can differ substantially when either using ETF or index data due to asynchronicity.
- Using a simple geometric Brownian motion we explain and provide theoretical justification for the downward bias in estimated correlations caused by asynchronicity.
- Alternative approaches to overcome the issue, like using aggregated returns or different estimators, are discussed.

In a recent article, Michael Harris pointed out that:

Understanding the data is the most difficult and time consuming part of investment and trading strategy design.

– Michael Harris in the article "Understanding Data Before Developing Models (2019)

A main message of his article is that exploratory data analysis is an essential part of every analysis or test of an investment strategy. Harris (2019) shows that using non-investable index data to backtest an ETF-strategy can produce misleading results. However, this is often done as index data usually comes with longer histories. As an example Harris (2019) tests a simple investment strategy based on the Internal Strength Bar indicator. The analysis is done twice, once using S&P 500 index data and once with the SPY ETF. The results differ quite substantially and as a possible explanation Harris (2019) shows that there have been rather extreme differences in the ETF and index returns on some days before 2010.

In our blog article, we want to present another case where working with index or ETF data can lead to substantially different results. We will explain why estimated correlations for asynchronously observed return data have a systematic downward bias. As a consequence, substantial differences in estimated correlations can be observed when backtesting with either ETFs or index data if assets from multiple regions are considered. To demonstrate, we consider ETFs and indices for three major international stock markets: Japan, Europe and the USA.

Assessing dependencies is part of many financial applications, like portfolio optimisation or risk management. A commonly used measure for dependence is the Pearson correlation coefficient. It measures the linear association of random variables and is also called linear or product moment correlation. The correlation of two random variables $X$ and $Y$ is defined as

$$\text{Corr}(X,Y)=\frac{\text{Cov}(X,Y)}{(\text{Var}(X)\text{Var}(Y){)}^{\frac{1}{2}}}$$

and takes on values in $[-1, 1]$. It is easy to show that independence of random variables implies uncorrelatedness but the converse is **not** true. Multivariate elliptical distributions are fully determined by their mean vector, variance vector, the correlation matrix and a characteristic generator function (see for example McNeil, Frey and Embrechts (2005)). As a consequence, the correlation is particularly useful for elliptically distributed random variables, but not necessarily a sound dependence measure for non-elliptical multivariate distributions^{1}. However, the correlation is widely used among practitioners and researchers in all sorts of applications in the subject areas of risk management and portfolio optimisation.

A typical example where correlations are estimated is the assessment of dependencies and diversification potentials between international stock markets represented (or summarised) by indices. As an example, we consider the US stock market, the European stock market and the Japanese stock market. For estimating the correlations, we will use two different types of data, total return index data (S&P 500 Index , S&P Europe 350 Index, MSCI Japan Index) and total returns for ETFs (IVV, IEV, EWJ) traded at NYSE which are tracking these indices. Data is obtained from Bloomberg. Using daily returns in USD for the past fifteen years (Jan-2004 till Dec-2018), we obtain the sample correlation matrices shown in Figure 1. The differences between the estimated correlations for the ETFs (shown on the right) and the underlying indices (on the left) are obvious. A main contribution of this blog article will be an explanation for these differences in estimated correlations when either using index data or ETF data.

Instead of using daily return data, we could also estimate correlations from lower frequency data like weekly returns. To obtain weekly returns we aggregate and use overlapping samples to estimate correlations. Overlapping data is used in order to prevent effects like spurious seasonality (Kurz and Mittnik (2018)) and phenomenons like timing luck (Hoffstein, Sibears and Faber (2019)). The estimated correlation matrices for weekly return data are shown in Figure 2. We can clearly see that the differences in the estimated correlations are now much smaller than before with daily return data.

A main cause for the differences in estimated correlations from index data versus estimated correlations from ETF data is that end-of-day ETF returns can be observed synchronously. This is not the case for the indices representing the stock markets from different regions. The trading hours of international markets can differ substantially. When the stock exchange in Tokyo (Japan) is closing, the European market has not even opened yet and the European markets only have a partial overlap with the US market. After the European stock exchanges close, the US market is still trading for about four hours and this repeats on a daily basis. An overview of the different trading hours is given in a visualisation in Figure 3.

For illustrative purposes we show the closing time of representative stock exchanges. The exact timestamps for the official closing price of indices can vary from day to day. Besides that, trading hours in UTC time can change over time due to daylight saving time. Additionally, trading hours for electronic trading and floor trading might differ, e.g., for the Frankfurt Stock Exchange trading hours of the floor trading starts earlier and ends later than for the electronic trading at the associated Xetra.

Before we come to the effect of asynchronously traded markets on estimated correlations, estimation uncertainty under ideal conditions should be briefly illustrated to serve as a reference point for the later presented results with asynchronicity. Correlations can be easily estimated from observed data given some weak regularity conditions. However, one needs to be aware of the estimation uncertainty when working with correlations. To keep things simple we can perform some trivial simulation experiments to illustrate the magnitude of estimation uncertainty. All simulations in this article will be based on 25,000 simulated paths. The data is simulated from a bivariate normal distribution with mean zero and variance $1.25$ for both random variables. For the correlation we consider two cases $\rho =0.5$ and $\rho =0.8$. In Figure 4 we provide kernel density plots for the estimated sample correlations for different sample sizes of between 250 (roughly one year of daily data) and 2,500 (roughly ten years of historic data). We can see how estimated sample correlations vary for different sample sizes. Obviously the estimation uncertainty is larger for smaller sample sizes and we can see that the variation in absolute terms is more pronounced for $\rho =0.5$ than for $\rho =0.8$.

To set up a simulation study showcasing the effect of asynchronicity, we assume that prices are generated by a continuous process. We choose a geometric Brownian motion with price process given by

$P_i(t) = P_i(0)$

where the Brownian motions $W_i$ are correlated, with correlation matrix $R$ and $P_i(t)$ is the price of asset $i$ at time $t$ We use the notation ${\rho}_{ij}$ for the correlation between $W_i$ and $W_j$, i.e., ${\rho}_{ij}$ is the $ij$-th element of the correlation matrix $R.$ The geometric Brownian motion satisfies a system of stochastic differential equations

$d P_i(t) =$

for $i=1,$.

Note that by intention we use an extremely over-simplified model. In no way do we intend to claim that independent identically and normally distributed increments are a "good" way to model financial markets. Also the assumption of a 24-hours price-building process with constant variance is questionable. However, for our over-simplified data generating process, the effect of asynchronously observed data becomes visible and it is much easier to understand its size and the mechanism behind it. It is obvious that moving from our over-simplified model to real data, the effect will not vanish and that is why we use a simple model for demonstrating and reasoning.

Simulated paths from a geometric Brownian motion can be obtained with the usual algorithms (see for example Algorithm 2.1.1 in Remillard (2013)). In our study we simulate hourly increments to allow testing the effect of asynchronously trading stock exchanges on the estimates of correlations. For simplicity, we assume that the hidden true continuous price process follows a geometric Brownian motion for which we simulate 24 increments for 24 equidistant time points representing the hours of a day. A first impression of a simulated bivariate sample path (with $\rho =0.8$) can be obtained in Figure 5. The left plot shows in the background paths of a bivariate process with hourly observations. We assume that every 24 hours, a closing price for each process can be observed. If the two assets (or indices) are traded synchronously there is a lag of zero. The animation shows the effect of observing the second process (in turquoise) with a lag of zero to 24 hours. In the zoomed area, we can see how the visual impression of the daily sampled path changes with the lag. Additionally, on the right hand side we can see a scatterplot of the daily returns for different lags. The regression line shows how the correlation of daily return observations changes with an increasing lag^{2}. Remember that there are 10.5 hours between the market close in Japan and Europe, and then another 4.5 hours until the US market closes.

We now run a simulation study where we generated paths with a length of 500 trading days. The paths are simulated with a bivariate geometric Brownian motion allowing to let each daily return be generated as the sum of 24 hourly increments. This allows us to compute daily returns for different sampling lags. We consider lags between zero and 15 hours. The lag-zero case corresponds to the synchronously sampled case. The distribution of estimated correlations is shown in Figure 6. The left plot corresponds to a true correlation of $\rho =0.5$ and the right plot to a true correlation of $\rho =0.8$. We clearly see how the estimated correlations are becoming smaller the larger the lag. Furthermore, especially for $\rho =0.8$, we can observe an increased variation of the estimates.

It is important to point out that this effect of asynchronicity on correlation estimates is well known and has been studied in numerous papers, like for example Burns, Engle and Mezrich (1998) or Martens and Poon (2001). What is maybe less well known to practitioners is the magnitude of the effect. Even under idealised and unrealistic conditions of data being generated from a geometric Brownian motion, the effect can be huge as shown in Figure 6.

In the idealised situation of a geometric Brownian motion as data generating process, the bias of the estimation can be easily computed. Let the log-return $r_i(t_{$ be defined as

$r_i(t_{$

for $\tau \in \mathbb{N}$, $k$. It is the $k$-th hourly return on day $\tau $. For notational simplicity we have set $P_i(t_{$. The daily return (i.e. the return over the last 24 hours) observed in hour $k$ of day $\tau $ (i.e., at time $t_{$) can be obtained via

$R_i(t_{$

Further note that the covariance of the two assets $i$ and $j$ for synchronously observed hourly observations is given by

$$\text{Cov}({r}_{i}({t}_{\tau ,k}),{r}_{j}({t}_{\tau ,k}))=\frac{1}{24}{\sigma}_{i}{\sigma}_{j}{\rho}_{ij}$$

and asynchronously observed non-overlapping hourly returns are uncorrelated, i.e.,

$$\begin{array}{cccc}& \begin{array}{r}{\displaystyle \text{Cov}({r}_{i}({t}_{\tau ,k}),{r}_{j}({t}_{\kappa ,l}))=0,}\end{array}& & \text{(}{\textstyle \star}\text{)}\end{array}$$

for $t_{$ ≠ $t_{$. Obviously, it holds $\text{Var}({R}_{i}({t}_{\tau ,k}))={\sigma}_{i}^{2}$. The correlation for asynchronously observed (lag $\delta $ hours) daily returns is defined as

$$\text{Corr}({R}_{i}({t}_{\tau ,k-\delta}),{R}_{j}({t}_{\tau ,k}))=\frac{\text{Cov}({R}_{i}({t}_{\tau ,k-\delta}),{R}_{j}({t}_{\tau ,k}))}{(\text{Var}({R}_{i}({t}_{\tau ,k-\delta}))\text{Var}({R}_{j}({t}_{\tau ,k})){)}^{\frac{1}{2}}}\mathrm{.}$$

W.l.o.g. we set $k=24$ and obtain

$R_i(t_{$

and

$R_j(t_{$

For the covariance it follows

$$\begin{array}{rl}{\displaystyle \text{Cov}({R}_{i}({t}_{\tau ,24-\delta}),{R}_{j}({t}_{\tau ,24}))}& {\displaystyle =\sum _{l=1}^{24-\delta}\sum _{m=1}^{24}\text{Cov}({r}_{i}({t}_{\tau ,l}),{r}_{j}({t}_{\tau ,m}))+\sum _{l=24-\delta +1}^{24}\sum _{m=1}^{24}\text{Cov}({r}_{i}({t}_{\tau -1,l}),{r}_{j}({t}_{\tau ,m}))}\\ {\displaystyle}& {\displaystyle \stackrel{(\star )}{=}\sum _{l=1}^{24-\delta}\text{Cov}({r}_{i}({t}_{\tau ,l}),{r}_{j}({t}_{\tau ,l}))}\\ {\displaystyle}& {\displaystyle =\frac{24-\delta}{24}{\sigma}_{i}{\sigma}_{j}{\rho}_{ij}}\end{array}$$

and for the correlation

$$\text{Corr}({R}_{i}({t}_{\tau ,24-\delta}),{R}_{j}({t}_{\tau ,24}))=\frac{24-\delta}{24}{\rho}_{ij}\mathrm{.}$$

This implies that with a $\delta $ hour lag the bias is

$${\rho}_{ij}-\frac{24-\delta}{24}{\rho}_{ij}=-\frac{\delta}{24}{\rho}_{ij}\mathrm{.}$$

This confirms our simulation results in Figure 6. Interestingly, the relative error $-$ is always the same independent of the true correlation. Additionally, the error is proportional to the $\delta $ hours lag length. The derivations also provide an explanation for the much smaller differences in estimated correlations from ETF and index data when using weekly instead of daily returns. While the lag in hours stays the same, the longer return window (five business days with 24 hours each) results in a much bigger denominator in the formula for the error and bias.

Not only cross-correlations are biased when estimated with asynchronously observed return data. Looking at temporal cross-correlations, we can also observe substantial differences between estimates using asynchronously observed index returns and estimates based on synchronously observed ETF returns. In Figure 7 we show estimated correlations for index returns. The left plot is the same as shown on the left of Figure 1. Additionally, we also show on the right the cross-correlations with the lagged (by one day) returns. We clearly see that there is a positive correlation between the lagged US equity returns and the returns for European and Japanese equities. Furthermore, the correlation of lagged European equities with Japanese Equities is also clearly positive.

In contrast, when we base correlation estimates on synchronously observed ETF returns all temporal cross-correlations (shown on the right of Figure 8) are rather close to zero.

Going back to our simplified model, the geometric Brownian motion, we find the cause for these temporal cross-correlations. For the temporal cross-covariance it follows

$$\begin{array}{rl}{\displaystyle \text{Cov}({R}_{i}({t}_{\tau ,24-\delta}),{R}_{j}({t}_{\tau -1,24}))}& {\displaystyle =\sum _{l=1}^{24-\delta}\sum _{m=1}^{24}\text{Cov}({r}_{i}({t}_{\tau ,l}),{r}_{j}({t}_{\tau -1,m}))+\sum _{l=24-\delta +1}^{24}\sum _{m=1}^{24}\text{Cov}({r}_{i}({t}_{\tau -1,l}),{r}_{j}({t}_{\tau -1,m}))}\\ {\displaystyle}& {\displaystyle \stackrel{(\star )}{=}\sum _{l=24-\delta +1}^{24}\text{Cov}({r}_{i}({t}_{\tau -1,l}),{r}_{j}({t}_{\tau -1,l}))}\\ {\displaystyle}& {\displaystyle =\frac{\delta}{24}{\sigma}_{i}{\sigma}_{j}{\rho}_{ij}}\end{array}$$

and for the correlation

$$\text{Corr}({R}_{i}({t}_{\tau ,24-\delta}),{R}_{j}({t}_{\tau -1,24}))=\frac{\delta}{24}{\rho}_{ij},$$

while the true value (for $\delta =0$) is zero.

An estimator which in our simplified model overcomes the bias caused by asynchronously observed returns is the so-called Hayashi-Yoshida estimator, see Section 1.2.1.2 in Roncalli (2013). The estimator was originally introduced by Hayashi and Yoshida (2005) in the high-frequency context for realised (co-)variances. The Hayashi-Yoshida estimator for the correlation of $R_i(t_{$ and $R_j(t_{$ is defined as the sample version of

$${\text{Corr}}_{HY}({R}_{i}({t}_{\tau ,24-\delta}),{R}_{j}({t}_{\tau ,24}))=\text{Corr}({R}_{i}({t}_{\tau ,24-\delta}),{R}_{j}({t}_{\tau ,24}))+\text{Corr}({R}_{i}({t}_{\tau ,24-\delta}),{R}_{j}({t}_{\mathrm{\mathit{\tau}\mathit{-}\mathbf{1}},24}))\mathrm{.}$$

For the simplified model, i.e., a geometric Brownian motion as data generating process, the Hayashi-Yoshida estimator is an asymptotically unbiased estimator.

To assure ourselves, we apply the Hayashi-Yoshida estimator for the same simulated data used to generate the plots shown in Figure 6. The results are presented in Figure 9 and we clearly see that the Hayashi-Yoshida estimator is capable of overcoming the bias produced by applying standard correlation estimators to asynchronously observed return data.

The Hayashi-Yoshida estimator can also be applied to the studied real data example, i.e., the asynchronously observed index returns for the US, European and Japanese stock market. In Figure 10 we show on the left the standard sample correlation estimates based on asynchronously observed daily returns. The plot on the right shows the Hayashi-Yoshida estimates. We clearly see the expected downwards bias when using the plain-vanilla sample correlation as estimator for asynchronously observed data.

In this blog article we discuss the issue of downwards biased correlation estimates for asynchronously observed return data. The relevance of the bias was illustrated by comparing correlation estimates based on asynchronously observed index data with those based on synchronously observed ETF data. We explain, demonstrate and provide theoretical justification for the magnitude of, and mechanism behind, the phenomenon using a simplified geometric Brownian motion as a data generating process.

The bias can be overcome by using estimators like the Hayashi-Yoshida estimator^{3}. However, in reality the choice of data and estimator is more complex. The Hayashi-Yoshida estimator for correlations is not bounded between minus and plus one. Additionally, for a larger number of assets it often results in covariance matrix estimates that are not positive semi-definite. A second alternative is to aggregate data to lower frequency for which the bias is less severe (as demonstrated with weekly returns). However, on lower frequencies other issues, like spurious seasonality (Kurz and Mittnik (2018)), ooccur and the choice between overlapping and non-overlapping sampling becomes non-trivial. A third alternative is modelling based on synchronously observed ETF data, which should not suffer from the bias caused by asynchronicity. Nevertheless, ETF data histories are often rather short and in the early years might be more noisy as demonstrated by Harris (2019). It becomes clear that there is no single perfect solution. Therefore, we conclude that the choice of data and estimator needs to be done carefully and exploratory data analysis is essential to provide a better understanding of driving factors for observed results. As the choice of data and the estimator has to be done in accordance with the analysis and depending on data availability, there is no clear recommendation possible but the optimal choice will depend on the research question that should be studied.

Burns, P., Engle, R. F. and Mezrich J. J. (1998), Correlations and Volatilities of Asynchronous Data. The Journal of Derivatives 5 (4), 7--18, https://doi.org/10.3905/jod.1998.408000.

Harris, M. (2019), Understanding Data Before Developing Models, https://www.priceactionlab.com/Blog/2019/06/understanding-data-developing-models/.

Hayashi, T. and Yoshida, N. (2005), On covariance estimation of non-synchronously observed diffusion processes. Bernoulli 11 (2), 359--379, https://doi:10.3150/bj/1116340299.

Hoffstein, C., Sibears, D. J. and Faber, N. (2019), Rebalance Timing Luck: The Difference Between Hired and Fired, The Journal of Index Investing, https://doi.org/10.3905/jii.2019.1.070.

Kurz, M. S. and Mittnik, S. (2018), Risk Assessment and Spurious Seasonality, Center for Quantitative Risk Analysis (CEQURA), Working Paper Number 19, 2018. Available at SSRN: http://dx.doi.org/10.2139/ssrn.2990772

Martens, M. and Poon, S.-H. (2001), Returns synchronization and daily correlation dynamics between international stock markets, Journal of Banking & Finance 25 (10), 1805--1827, https://doi.org/10.1016/S0378-4266(00)00159-X.

McNeil, A. J., Frey, R. and Embrechts, P. (2005), Quantitative Risk Management, Princeton Series in Finance. Princeton NJ: Princeton Univ. Press.

Remillard, B. (2013), Statistical Methods for Financial Engineering, Boca Raton, FL: CRC Press.

Roncalli, T. (2013), Introduction to Risk Parity and Budgeting, Chapman & Hall/CRC Financial Mathematics Series.

1: Using copulas it is easy to show that the multivariate distribution is by far not fully specified by the correlation matrix. Furthermore, the set of attainable correlations can be substantially smaller than the entire interval $[-1, 1]$ for given univariate marginal distributions. Examples and details can be found in Chapter 5 of McNeil, Frey and Embrechts (2005).

2: Note that we did not calibrate the regression line from the data but used the theoretical true correlation of the asynchronously observed returns to come up with the slope of the line.

3: Alternative estimators for asynchronously estimated returns can for example be found in Martens and Poon (2001).

**Disclaimer** – The views and opinions expressed in this blog are those of the author and do not necessarily reflect the views of Scalable Capital GmbH, its subsidiaries or its employees ("Scalable Capital", "we"). The content is provided to you solely for informational purposes and does not constitute, and should not be construed as, an offer or a solicitation of an offer, advice or recommendation to purchase any securities or other financial instruments. Any representation is for illustrative purposes only and is not representative of any Scalable Capital product or investment strategy. The academic concepts set forth herein are derived from sources believed by the author and Scalable Capital to be reliable and have no connection with the financial services offered by Scalable Capital. Past performance and forward-looking statements are not reliable indicators of future performance. The return may rise or fall as a result of currency fluctuations. Please refer to our risk information.

**Risikohinweis** – Die Kapitalanlage ist mit Risiken verbunden und kann zum Verlust des eingesetzten Vermögens führen. Weder vergangene Wertentwicklungen noch Prognosen haben eine verlässliche Aussagekraft über zukünftige Wertentwicklungen. Wir erbringen keine Anlage-, Rechts- und/oder Steuerberatung. Sollte diese Website Informationen über den Kapitalmarkt, Finanzinstrumente und/oder sonstige für die Kapitalanlage relevante Themen enthalten, so dienen diese Informationen ausschließlich der allgemeinen Erläuterung der von Unternehmen unserer Unternehmensgruppe erbrachten Wertpapierdienstleistungen. Bitte lesen Sie auch unsere Risikohinweise und Nutzungsbedingungen.