Math and Code

This post shows some math and code.

Correlations are ubiquitous. For example, news articles reporting that a research paper found no correlation between X and Y. Also, it is related to (in)dependence, which plays an important role in linear regression. This post will explain the Pearson correlation coefficient. The explanation is mainly based on the book by Hogg et al. (2018).

Let A, B and C be discrete random variables defined by respectively f_A(x) = x + 1, f_B(x) = 0.5x + 3, and f_C(x) = 5 for the range 1 to 7. Let D be the reverse of A. The probabilities are chosen such that they are the same for all the values in these random variables.

We can put this data in a table (DataFrame):

using AlgebraOfGraphics
using CairoMakie
using DataFrames
using Statistics: mean
df = let
X = collect(1:7)
A = [x + 1 for x in X]
B = [0.5x + 3 for x in X]
C = [5 for x in X]
D = reverse(A)
DataFrame(; X, A, B, C, D)
end
X A B C D
1 2 3.5 5 8
2 3 4.0 5 7
3 4 4.5 5 6
4 5 5.0 5 5
5 6 5.5 5 4
6 7 6.0 5 3
7 8 6.5 5 2

and plot the variables to obtain the following figure:

sdf = stack(df, [:A, :B, :C, :D])
xv = data(sdf) * mapping(:X, :value; color=:variable)
draw(xv)

Data visualization